Runs, approvals & checkpoints
Run lifecycle and statuses, human approval/input, retry, checkpoints, and stale-run handling.
Once a workflow is authored and stored, every execution becomes a run — a tracked instance with its own status, per-step results, and a checkpoint on disk. This page covers the run lifecycle and its statuses, how to pause a run for human approval or freeform input, how to retry a failed run, how checkpoints persist state, and the stale-run gotcha you hit on ephemeral hosts like Cloud Run.
Use this page when you operate workflows day to day — watching runs in the dashboard, clearing approval gates, retrying failures. If you have not run a workflow yet, start with Your first workflow. For the full definition CRUD and run endpoints, see the Workflows & Architect API.
Run lifecycle & status
Section titled “Run lifecycle & status”A run is created when you launch a workflow (from the dashboard Execute button, POST /api/workflows/run/{name}, a cron trigger, or an entity-event trigger). The operator executes the step DAG and the gateway reports status by overlaying local checkpoint data on top of the Kumiho run record, so progress stays current even between persistence flushes.
| Status | Meaning |
|---|---|
pending | Created but not yet picked up by the executor. |
running | The executor is active. |
paused | Waiting at a human_approval or human_input gate. |
completed | All steps finished successfully. |
failed | A step failed; the checkpoint is preserved so you can retry. |
cancelled | Stopped by a user or the system. |
stale | A non-terminal status with no checkpoint and no live lock — see Stale runs. |
Inspecting a run
Section titled “Inspecting a run”Fetch a single run by its id. Every workflow route requires the pairing bearer token (see Pairing & authentication).
GET /api/workflows/runs/{run_id}Authorization: Bearer <token>The response is { "run": { ... } } with per-step detail. Live runs are read from the executor’s in-memory status; finished runs are resolved from the Kumiho run record. An unknown run_id returns 404.
To list recent runs:
GET /api/workflows/runs?limit=20&workflow=hello-worldAuthorization: Bearer <token>| Query param | Default | Meaning |
|---|---|---|
limit | 20 | Number of runs to return. |
workflow | — | Filter by workflow name. |
Cancel and delete
Section titled “Cancel and delete”Stopping a run sends it to the executor, which halts at the next step boundary and kills owned shell/python subprocesses where possible. Deleting a run removes its Kumiho run record plus best-effort local checkpoint and artifact files — it does not delete or deprecate the workflow definition (definitions use separate /api/workflows/{kref} routes).
| Method | Path | Effect |
|---|---|---|
POST | /api/workflows/runs/{run_id}/cancel | Stop an active run at the next boundary. |
DELETE | /api/workflows/runs/{run_id} | Delete the run record and local files. |
The full run endpoint surface (list, trigger, approve, retry, cancel, delete, dashboard stats, agent-activity) is documented in the Workflows & Architect API.
Checkpoints
Section titled “Checkpoints”With checkpoint: true (the workflow-level default), the executor saves run state to disk after each step completes and on pause for human approval. This is what makes retry and approval-resume possible.
name: deploy-pipelineversion: "1.0"checkpoint: true # workflow-level default is truesteps: - id: build type: shell shell: command: "npm run build"| Artifact | Path | Purpose |
|---|---|---|
| Checkpoint file | ~/.revka/workflow_checkpoints/{run_id}.json | Snapshot of completed steps and their outputs. |
| Lock file | ~/.revka/workflow_locks/{run_id[:12]}.lock | Advisory lock held while the run is executing. |
The gateway reads both files to determine the live status of a run: an active lock means the executor is alive, and the checkpoint supplies up-to-the-step progress.
Per-step retry handles transient failures during a run. Set retry (number of additional attempts after the first) and an optional retry_delay (seconds between attempts):
- id: flaky_step type: agent agent: role: researcher prompt: "Fetch and summarize the latest report." retry: 2 # retry up to 2 times after the first attempt retry_delay: 10 # wait 10 seconds between retries| Field | Type | Default | Meaning |
|---|---|---|---|
retry | int | 0 | Extra attempts after the first failure. |
retry_delay | int (seconds) | 0 | Delay between retry attempts. |
Retrying a whole run
Section titled “Retrying a whole run”When a run reaches failed, the preserved checkpoint lets you retry from the first failed step — successful step outputs are reused, so only the failed step and its downstream steps re-execute.
On the Workflow Runs page, select the failed run and click Retry. The retry path re-launches execution from the checkpoint.
POST /api/workflows/runs/{run_id}/retryAuthorization: Bearer <token>Content-Type: application/json{ "cwd": "/path/to/project" }| Field | Required | Meaning |
|---|---|---|
cwd | no | Working directory for shell and agent steps on the retry. |
The gateway forwards the request to the operator and broadcasts a workflow_retry SSE event to dashboard clients.
Human approval
Section titled “Human approval”A human_approval step pauses the run and waits for a yes/no decision. The checkpoint is written before the run pauses, so an approval that arrives hours later resumes cleanly.
- id: approve type: human_approval human_approval: message: "Deploy to production?" timeout: 3600 # seconds — here, 1 hour| Field | Type | Meaning |
|---|---|---|
message | string | The prompt shown to the approver. |
timeout | int (seconds) | How long to wait before the gate times out. |
While paused, the run sits in paused. Submit the decision over the API:
POST /api/workflows/runs/{run_id}/approveAuthorization: Bearer <token>Content-Type: application/json{ "approved": true, "feedback": "LGTM — ship it." }| Field | Required | Meaning |
|---|---|---|
approved | yes | true to approve and continue, false to reject. |
feedback | no | Freeform note passed back into the run. |
On resolution the gateway broadcasts a human_approval_resolved SSE event (carrying run_id and approved) to dashboard clients.
Human input
Section titled “Human input”A human_input step pauses for freeform text instead of a yes/no. The submitted text becomes available to downstream steps as ${step_id.output}.
- id: ask_user type: human_input human_input: message: "What changes do you want?" channel: dashboard timeout: 3600| Field | Type | Meaning |
|---|---|---|
message | string | The prompt shown to the operator. |
channel | string | Where to ask (for example, dashboard). |
timeout | int (seconds) | How long to wait for a response. |
A later step can consume the answer directly, e.g. prompt: "Apply these changes: ${ask_user.output}". See Variables, expressions & triggers for the full namespace list.
Approval registry (workflow human-approval)
Section titled “Approval registry (workflow human-approval)”Approval gates are not limited to the dashboard — they can be cleared from a chat channel. The gateway keeps a process-global approval registry that bridges workflow human_approval steps to Discord, Slack, and Telegram replies.
When a run hits an approval step, it registers a pending approval in this registry. After the channel adapter posts the approval prompt, it attaches the channel’s thread and message IDs so the registry can scope the match precisely. When a user replies, the registry matches the message and atomically claims the approval (try_claim), which removes the entry so it cannot be double-resolved — this is what prevents a race between, say, a Discord reply and a dashboard click landing at the same time.
| Match | Rule |
|---|---|
| Approve | Message starts with one of the approve keywords (case-insensitive). |
| Reject | Message starts with a reject keyword; text after the keyword becomes the feedback. |
For the underlying real-time surface and how SSE events reach the dashboard, see Realtime: WebSocket, SSE & Live Canvas. Tool-call-level approvals (a separate gate that sits underneath the workflow) are covered in Policy, commands & sandboxing and Autonomy levels & approvals.
Dashboard — visual DAG & run viewer
Section titled “Dashboard — visual DAG & run viewer”The dashboard keeps workflow definitions and workflow runs separate. The runs surface (/runs) gives you run-scoped controls only — stop an active run, retry a failed one, or delete a run record — and a live graph view.
The run viewer renders the workflow as an interactive DAG and pins each run to the exact YAML revision it executed (by resolving its workflow_revision_kref). This means later edits to the definition never change the graph of an already-completed run — what you see is what actually ran.
- Completed conditional nodes show the matched branch, goto target, branch value, and emitted output, in both the graph and the step inspector.
- Failed nodes show the best available failure detail — the executor error, structured
output_data, an stderr preview, or the captured inputs. - The step inspector (click any node) shows status, agent id and role, an output preview, injected skills, and any group-chat transcript.
- “Run to here” uses
target_step_idto execute only the transitive ancestor closure of a target step plus the step itself, then stop — handy for iterating on one branch.
The dashboard also exposes aggregated stats via GET /api/workflows/dashboard (definition count, active runs, recent runs) and a per-agent drill-down via GET /api/workflows/agent-activity/{agent_id}, backed by JSONL run logs at ~/.revka/operator_mcp/runlogs/{agent_id}.jsonl. For the full dashboard walkthrough, see Workflows, editor & runs.
Stale runs & the Cloud Run gotcha
Section titled “Stale runs & the Cloud Run gotcha”Checkpoints and lock files live on the local filesystem of the host running the operator. On a persistent host this is invisible — the files survive between requests. On ephemeral hosts that wipe disk between deploys or scale-to-zero cycles (Cloud Run, similar serverless PaaS), those files vanish while the Kumiho run record still says running or paused.
When the gateway then loads such a run and finds no checkpoint and no live lock, it cannot honestly report it as live, so it surfaces it as stale. A stale run is effectively orphaned: there is no executor holding it and no on-disk state to resume from.
See Docker, Compose & one-click PaaS and Runtime modes, adapters & resource limits for deployment guidance on keeping that state durable.