Skip to content

Runs, approvals & checkpoints

Run lifecycle and statuses, human approval/input, retry, checkpoints, and stale-run handling.

Once a workflow is authored and stored, every execution becomes a run — a tracked instance with its own status, per-step results, and a checkpoint on disk. This page covers the run lifecycle and its statuses, how to pause a run for human approval or freeform input, how to retry a failed run, how checkpoints persist state, and the stale-run gotcha you hit on ephemeral hosts like Cloud Run.

Use this page when you operate workflows day to day — watching runs in the dashboard, clearing approval gates, retrying failures. If you have not run a workflow yet, start with Your first workflow. For the full definition CRUD and run endpoints, see the Workflows & Architect API.

A run is created when you launch a workflow (from the dashboard Execute button, POST /api/workflows/run/{name}, a cron trigger, or an entity-event trigger). The operator executes the step DAG and the gateway reports status by overlaying local checkpoint data on top of the Kumiho run record, so progress stays current even between persistence flushes.

StatusMeaning
pendingCreated but not yet picked up by the executor.
runningThe executor is active.
pausedWaiting at a human_approval or human_input gate.
completedAll steps finished successfully.
failedA step failed; the checkpoint is preserved so you can retry.
cancelledStopped by a user or the system.
staleA non-terminal status with no checkpoint and no live lock — see Stale runs.

Fetch a single run by its id. Every workflow route requires the pairing bearer token (see Pairing & authentication).

GET /api/workflows/runs/{run_id}
Authorization: Bearer <token>

The response is { "run": { ... } } with per-step detail. Live runs are read from the executor’s in-memory status; finished runs are resolved from the Kumiho run record. An unknown run_id returns 404.

To list recent runs:

GET /api/workflows/runs?limit=20&workflow=hello-world
Authorization: Bearer <token>
Query paramDefaultMeaning
limit20Number of runs to return.
workflowFilter by workflow name.

Stopping a run sends it to the executor, which halts at the next step boundary and kills owned shell/python subprocesses where possible. Deleting a run removes its Kumiho run record plus best-effort local checkpoint and artifact files — it does not delete or deprecate the workflow definition (definitions use separate /api/workflows/{kref} routes).

MethodPathEffect
POST/api/workflows/runs/{run_id}/cancelStop an active run at the next boundary.
DELETE/api/workflows/runs/{run_id}Delete the run record and local files.

The full run endpoint surface (list, trigger, approve, retry, cancel, delete, dashboard stats, agent-activity) is documented in the Workflows & Architect API.

With checkpoint: true (the workflow-level default), the executor saves run state to disk after each step completes and on pause for human approval. This is what makes retry and approval-resume possible.

name: deploy-pipeline
version: "1.0"
checkpoint: true # workflow-level default is true
steps:
- id: build
type: shell
shell:
command: "npm run build"
ArtifactPathPurpose
Checkpoint file~/.revka/workflow_checkpoints/{run_id}.jsonSnapshot of completed steps and their outputs.
Lock file~/.revka/workflow_locks/{run_id[:12]}.lockAdvisory lock held while the run is executing.

The gateway reads both files to determine the live status of a run: an active lock means the executor is alive, and the checkpoint supplies up-to-the-step progress.

Per-step retry handles transient failures during a run. Set retry (number of additional attempts after the first) and an optional retry_delay (seconds between attempts):

- id: flaky_step
type: agent
agent:
role: researcher
prompt: "Fetch and summarize the latest report."
retry: 2 # retry up to 2 times after the first attempt
retry_delay: 10 # wait 10 seconds between retries
FieldTypeDefaultMeaning
retryint0Extra attempts after the first failure.
retry_delayint (seconds)0Delay between retry attempts.

When a run reaches failed, the preserved checkpoint lets you retry from the first failed step — successful step outputs are reused, so only the failed step and its downstream steps re-execute.

On the Workflow Runs page, select the failed run and click Retry. The retry path re-launches execution from the checkpoint.

A human_approval step pauses the run and waits for a yes/no decision. The checkpoint is written before the run pauses, so an approval that arrives hours later resumes cleanly.

- id: approve
type: human_approval
human_approval:
message: "Deploy to production?"
timeout: 3600 # seconds — here, 1 hour
FieldTypeMeaning
messagestringThe prompt shown to the approver.
timeoutint (seconds)How long to wait before the gate times out.

While paused, the run sits in paused. Submit the decision over the API:

POST /api/workflows/runs/{run_id}/approve
Authorization: Bearer <token>
Content-Type: application/json
{ "approved": true, "feedback": "LGTM — ship it." }
FieldRequiredMeaning
approvedyestrue to approve and continue, false to reject.
feedbacknoFreeform note passed back into the run.

On resolution the gateway broadcasts a human_approval_resolved SSE event (carrying run_id and approved) to dashboard clients.

A human_input step pauses for freeform text instead of a yes/no. The submitted text becomes available to downstream steps as ${step_id.output}.

- id: ask_user
type: human_input
human_input:
message: "What changes do you want?"
channel: dashboard
timeout: 3600
FieldTypeMeaning
messagestringThe prompt shown to the operator.
channelstringWhere to ask (for example, dashboard).
timeoutint (seconds)How long to wait for a response.

A later step can consume the answer directly, e.g. prompt: "Apply these changes: ${ask_user.output}". See Variables, expressions & triggers for the full namespace list.

Approval registry (workflow human-approval)

Section titled “Approval registry (workflow human-approval)”

Approval gates are not limited to the dashboard — they can be cleared from a chat channel. The gateway keeps a process-global approval registry that bridges workflow human_approval steps to Discord, Slack, and Telegram replies.

When a run hits an approval step, it registers a pending approval in this registry. After the channel adapter posts the approval prompt, it attaches the channel’s thread and message IDs so the registry can scope the match precisely. When a user replies, the registry matches the message and atomically claims the approval (try_claim), which removes the entry so it cannot be double-resolved — this is what prevents a race between, say, a Discord reply and a dashboard click landing at the same time.

MatchRule
ApproveMessage starts with one of the approve keywords (case-insensitive).
RejectMessage starts with a reject keyword; text after the keyword becomes the feedback.

For the underlying real-time surface and how SSE events reach the dashboard, see Realtime: WebSocket, SSE & Live Canvas. Tool-call-level approvals (a separate gate that sits underneath the workflow) are covered in Policy, commands & sandboxing and Autonomy levels & approvals.

The dashboard keeps workflow definitions and workflow runs separate. The runs surface (/runs) gives you run-scoped controls only — stop an active run, retry a failed one, or delete a run record — and a live graph view.

The run viewer renders the workflow as an interactive DAG and pins each run to the exact YAML revision it executed (by resolving its workflow_revision_kref). This means later edits to the definition never change the graph of an already-completed run — what you see is what actually ran.

  • Completed conditional nodes show the matched branch, goto target, branch value, and emitted output, in both the graph and the step inspector.
  • Failed nodes show the best available failure detail — the executor error, structured output_data, an stderr preview, or the captured inputs.
  • The step inspector (click any node) shows status, agent id and role, an output preview, injected skills, and any group-chat transcript.
  • “Run to here” uses target_step_id to execute only the transitive ancestor closure of a target step plus the step itself, then stop — handy for iterating on one branch.

The dashboard also exposes aggregated stats via GET /api/workflows/dashboard (definition count, active runs, recent runs) and a per-agent drill-down via GET /api/workflows/agent-activity/{agent_id}, backed by JSONL run logs at ~/.revka/operator_mcp/runlogs/{agent_id}.jsonl. For the full dashboard walkthrough, see Workflows, editor & runs.

Checkpoints and lock files live on the local filesystem of the host running the operator. On a persistent host this is invisible — the files survive between requests. On ephemeral hosts that wipe disk between deploys or scale-to-zero cycles (Cloud Run, similar serverless PaaS), those files vanish while the Kumiho run record still says running or paused.

When the gateway then loads such a run and finds no checkpoint and no live lock, it cannot honestly report it as live, so it surfaces it as stale. A stale run is effectively orphaned: there is no executor holding it and no on-disk state to resume from.

See Docker, Compose & one-click PaaS and Runtime modes, adapters & resource limits for deployment guidance on keeping that state durable.