Runs, approvals & checkpoints

Run lifecycle and statuses, human approval/input, retry, checkpoints, and stale-run handling.

Once a workflow is authored and stored, every execution becomes a run — a tracked instance with its own status, per-step results, and a checkpoint on disk. This page covers the run lifecycle and its statuses, how to pause a run for human approval or freeform input, how to retry a failed run, how checkpoints persist state, and the stale-run gotcha you hit on ephemeral hosts like Cloud Run.

Use this page when you operate workflows day to day — watching runs in the dashboard, clearing approval gates, retrying failures. If you have not run a workflow yet, start with Your first workflow. For the full definition CRUD and run endpoints, see the Workflows & Architect API.

Run lifecycle & status

A run is created when you launch a workflow (from the dashboard Execute button, POST /api/workflows/run/{name}, a cron trigger, or an entity-event trigger). The operator executes the step DAG and the gateway reports status by overlaying local checkpoint data on top of the Kumiho run record, so progress stays current even between persistence flushes.

Status	Meaning
`pending`	Created but not yet picked up by the executor.
`running`	The executor is active.
`paused`	Waiting at a `human_approval` or `human_input` gate.
`completed`	All steps finished successfully.
`failed`	A step failed; the checkpoint is preserved so you can retry.
`cancelled`	Stopped by a user or the system.
`stale`	A non-terminal status with no checkpoint and no live lock — see Stale runs.

Inspecting a run

Fetch a single run by its id. Every workflow route requires the pairing bearer token (see Pairing & authentication).

GET /api/workflows/runs/{run_id}
Authorization: Bearer <token>

The response is { "run": { ... } } with per-step detail. Live runs are read from the executor’s in-memory status; finished runs are resolved from the Kumiho run record. An unknown run_id returns 404.

To list recent runs:

GET /api/workflows/runs?limit=20&workflow=hello-world
Authorization: Bearer <token>

Query param	Default	Meaning
`limit`	`20`	Number of runs to return.
`workflow`	—	Filter by workflow name.

Cancel and delete

Stopping a run sends it to the executor, which halts at the next step boundary and kills owned shell/python subprocesses where possible. Deleting a run removes its Kumiho run record plus best-effort local checkpoint and artifact files — it does not delete or deprecate the workflow definition (definitions use separate /api/workflows/{kref} routes).

Method	Path	Effect
`POST`	`/api/workflows/runs/{run_id}/cancel`	Stop an active run at the next boundary.
`DELETE`	`/api/workflows/runs/{run_id}`	Delete the run record and local files.

The full run endpoint surface (list, trigger, approve, retry, cancel, delete, dashboard stats, agent-activity) is documented in the Workflows & Architect API.

Checkpoints

With checkpoint: true (the workflow-level default), the executor saves run state to disk after each step completes and on pause for human approval. This is what makes retry and approval-resume possible.

name: deploy-pipeline
version: "1.0"
checkpoint: true          # workflow-level default is true
steps:
  - id: build
    type: shell
    shell:
      command: "npm run build"

Artifact	Path	Purpose
Checkpoint file	`~/.revka/workflow_checkpoints/{run_id}.json`	Snapshot of completed steps and their outputs.
Lock file	`~/.revka/workflow_locks/{run_id[:12]}.lock`	Advisory lock held while the run is executing.

The gateway reads both files to determine the live status of a run: an active lock means the executor is alive, and the checkpoint supplies up-to-the-step progress.

Retry

Per-step retry handles transient failures during a run. Set retry (number of additional attempts after the first) and an optional retry_delay (seconds between attempts):

- id: flaky_step
  type: agent
  agent:
    role: researcher
    prompt: "Fetch and summarize the latest report."
  retry: 2                # retry up to 2 times after the first attempt
  retry_delay: 10         # wait 10 seconds between retries

Field	Type	Default	Meaning
`retry`	int	`0`	Extra attempts after the first failure.
`retry_delay`	int (seconds)	`0`	Delay between retry attempts.

Retrying a whole run

When a run reaches failed, the preserved checkpoint lets you retry from the first failed step — successful step outputs are reused, so only the failed step and its downstream steps re-execute.

Dashboard
API

On the Workflow Runs page, select the failed run and click Retry. The retry path re-launches execution from the checkpoint.

POST /api/workflows/runs/{run_id}/retry
Authorization: Bearer <token>
Content-Type: application/json

{ "cwd": "/path/to/project" }

Field	Required	Meaning
`cwd`	no	Working directory for shell and agent steps on the retry.

The gateway forwards the request to the operator and broadcasts a workflow_retry SSE event to dashboard clients.

Human approval

A human_approval step pauses the run and waits for a yes/no decision. The checkpoint is written before the run pauses, so an approval that arrives hours later resumes cleanly.

- id: approve
  type: human_approval
  human_approval:
    message: "Deploy to production?"
    timeout: 3600          # seconds — here, 1 hour

Field	Type	Meaning
`message`	string	The prompt shown to the approver.
`timeout`	int (seconds)	How long to wait before the gate times out.

While paused, the run sits in paused. Submit the decision over the API:

POST /api/workflows/runs/{run_id}/approve
Authorization: Bearer <token>
Content-Type: application/json

{ "approved": true, "feedback": "LGTM — ship it." }

Field	Required	Meaning
`approved`	yes	`true` to approve and continue, `false` to reject.
`feedback`	no	Freeform note passed back into the run.

On resolution the gateway broadcasts a human_approval_resolved SSE event (carrying run_id and approved) to dashboard clients.

Human input

A human_input step pauses for freeform text instead of a yes/no. The submitted text becomes available to downstream steps as ${step_id.output}.

- id: ask_user
  type: human_input
  human_input:
    message: "What changes do you want?"
    channel: dashboard
    timeout: 3600

Field	Type	Meaning
`message`	string	The prompt shown to the operator.
`channel`	string	Where to ask (for example, `dashboard`).
`timeout`	int (seconds)	How long to wait for a response.

A later step can consume the answer directly, e.g. prompt: "Apply these changes: ${ask_user.output}". See Variables, expressions & triggers for the full namespace list.

Approval registry (workflow human-approval)

Approval gates are not limited to the dashboard — they can be cleared from a chat channel. The gateway keeps a process-global approval registry that bridges workflow human_approval steps to Discord, Slack, and Telegram replies.

When a run hits an approval step, it registers a pending approval in this registry. After the channel adapter posts the approval prompt, it attaches the channel’s thread and message IDs so the registry can scope the match precisely. When a user replies, the registry matches the message and atomically claims the approval (try_claim), which removes the entry so it cannot be double-resolved — this is what prevents a race between, say, a Discord reply and a dashboard click landing at the same time.

Match	Rule
Approve	Message starts with one of the approve keywords (case-insensitive).
Reject	Message starts with a reject keyword; text after the keyword becomes the `feedback`.

For the underlying real-time surface and how SSE events reach the dashboard, see Realtime: WebSocket, SSE & Live Canvas. Tool-call-level approvals (a separate gate that sits underneath the workflow) are covered in Policy, commands & sandboxing and Autonomy levels & approvals.

Dashboard — visual DAG & run viewer

The dashboard keeps workflow definitions and workflow runs separate. The runs surface (/runs) gives you run-scoped controls only — stop an active run, retry a failed one, or delete a run record — and a live graph view.

The run viewer renders the workflow as an interactive DAG and pins each run to the exact YAML revision it executed (by resolving its workflow_revision_kref). This means later edits to the definition never change the graph of an already-completed run — what you see is what actually ran.

Completed conditional nodes show the matched branch, goto target, branch value, and emitted output, in both the graph and the step inspector.
Failed nodes show the best available failure detail — the executor error, structured output_data, an stderr preview, or the captured inputs.
The step inspector (click any node) shows status, agent id and role, an output preview, injected skills, and any group-chat transcript.
“Run to here” uses target_step_id to execute only the transitive ancestor closure of a target step plus the step itself, then stop — handy for iterating on one branch.

The dashboard also exposes aggregated stats via GET /api/workflows/dashboard (definition count, active runs, recent runs) and a per-agent drill-down via GET /api/workflows/agent-activity/{agent_id}, backed by JSONL run logs at ~/.revka/operator_mcp/runlogs/{agent_id}.jsonl. For the full dashboard walkthrough, see Workflows, editor & runs.

Stale runs & the Cloud Run gotcha

Checkpoints and lock files live on the local filesystem of the host running the operator. On a persistent host this is invisible — the files survive between requests. On ephemeral hosts that wipe disk between deploys or scale-to-zero cycles (Cloud Run, similar serverless PaaS), those files vanish while the Kumiho run record still says running or paused.

When the gateway then loads such a run and finds no checkpoint and no live lock, it cannot honestly report it as live, so it surfaces it as stale. A stale run is effectively orphaned: there is no executor holding it and no on-disk state to resume from.

See Docker, Compose & one-click PaaS and Runtime modes, adapters & resource limits for deployment guidance on keeping that state durable.

Where to go next

Workflows & Architect API Full run endpoints: trigger, approve, retry, cancel, delete, stats.

Step types reference human_approval, human_input, and every other step type in detail.

Variables, expressions & triggers Consume human_input output and reference step results.

Workflows, editor & runs The dashboard DAG editor and run viewer in depth.

Realtime: WebSocket, SSE & Live Canvas The SSE event stream that drives live run updates.

Docker, Compose & one-click PaaS Keep checkpoint state durable on ephemeral hosts.

Runs, approvals & checkpoints

Run lifecycle & status

Inspecting a run

Cancel and delete

Checkpoints

Retry

Retrying a whole run

Human approval

Human input

Approval registry (workflow human-approval)

Dashboard — visual DAG & run viewer

Stale runs & the Cloud Run gotcha

Where to go next

Get started

Core concepts

Guides

CLI reference

Gateway API

Dashboard

Channels

Providers & models

Tools

Memory

Workflows & SOP

Cron & scheduling

Security & audit

Deployment & ops

Hardware

MCP & extensibility

Ecosystem

Reference