Observability & tracing

Observer backends (log/verbose/prometheus/otel), Prometheus metrics, OTLP export, and runtime traces.

Revka emits telemetry through a pluggable Observer pipeline. A single config key — [observability] backend — selects how the runtime records the agent lifecycle: discard it (none), log it (log), print it to your terminal (verbose), expose it as Prometheus metrics (prometheus), or export traces and metrics over OTLP (otel). A separate runtime trace logger persists structured JSONL events to disk for after-the-fact debugging, queryable with revka doctor traces.

Reach for this page when you want to wire Revka into Grafana, Jaeger/Tempo/Honeycomb, or your existing log pipeline, or when you need to inspect exactly which tool calls and model replies ran. Everything here is configured under the [observability] section of ~/.revka/config.toml; see the Configuration overview for how that file is resolved.

The Observer pipeline

Every backend implements one Observer trait with two hot-path methods plus a shutdown drain:

Method	When it fires	Notes
`record_event(&ObserverEvent)`	On each lifecycle event	Called synchronously on the hot path — backends must never block.
`record_metric(&ObserverMetric)`	On each measurement	Same synchronous contract.
`flush()`	On graceful shutdown	Drains buffered spans/metrics (used by the OTel backend).
`name()`	Anytime	Backend identifier for logs.

Because observers run inline with agent execution, the built-in backends are designed to be cheap (a log line, a counter increment, or a buffer append). Heavy work — like the OTLP network export — is buffered and pushed asynchronously, then force-flushed on flush() at shutdown.

The factory in the runtime reads backend at startup and constructs the matching observer once. There is no per-request switching.

Event taxonomy

record_event receives one of these ObserverEvent variants, covering the full agent loop, channels, the heartbeat, the response cache, “hands” (agent runs), and DORA deployment signals:

Group	Events
Agent loop	`AgentStart`, `AgentEnd`, `LlmRequest`, `LlmResponse`, `ToolCallStart`, `ToolCall`, `TurnComplete`
Channels & heartbeat	`ChannelMessage`, `HeartbeatTick`
Response cache	`CacheHit`, `CacheMiss`
Errors	`Error`
Hands (agent runs)	`HandStarted`, `HandCompleted`, `HandFailed`
Deployments (DORA)	`DeploymentStarted`, `DeploymentCompleted`, `DeploymentFailed`, `RecoveryCompleted`

CacheHit / CacheMiss distinguish the hot (in-memory) and warm (SQLite) cache layers and carry tokens_saved, so you can quantify what the response cache saved you.

Metric taxonomy

record_metric receives one of these ObserverMetric variants:

Metric	Type	Meaning
`RequestLatency(Duration)`	timing	Wall-clock latency of an LLM request
`TokensUsed(u64)`	count	Tokens consumed by the last request
`ActiveSessions(u64)`	gauge	Currently active sessions
`QueueDepth(u64)`	gauge	Pending work queue depth
`HandRunDuration`	timing	Duration of a hand (agent) run
`HandFindingsCount`	count	Findings produced by a hand
`HandSuccessRate`	ratio	Rolling hand success rate
`DeploymentLeadTime(Duration)`	timing	DORA lead time for changes
`RecoveryTime(Duration)`	timing	DORA time to restore service

Choosing a backend

[observability]
backend = "none"   # "none" | "noop" | "log" | "verbose" | "prometheus" | "otel"

`backend`	What it does	External deps	Build feature
`none` / `noop`	Zero-overhead no-op. Default.	None	—
`log`	Structured `tracing::info!` lines for every event and metric	None	—
`verbose`	Human-readable `>` / `<` progress lines on stderr (interactive only)	None	—
`prometheus`	Exposes metrics at `GET /metrics`	Prometheus server	`observability-prometheus`
`otel`	Pushes traces + metrics over OTLP HTTP	OTel collector	`observability-otel`

`none` / `noop` (default)

The safe default. All observer methods compile to no-ops — no overhead, no dependencies. The factory also falls back here (with a warn!) when a feature-gated backend is requested but its Cargo feature is absent, or when backend is an unrecognised string.

`log`

Emits every event and metric as a structured tracing::info! line with named fields (agent.start, tool.call, cache.hit, metric.tokens_used, …). It has no external dependencies and works with any tracing subscriber, so it composes with your existing log shipping:

RUST_LOG=info revka daemon

If you run the daemon under a JSON-formatting tracing-subscriber, the output is structured JSON ready for ingestion. This is the recommended first step before adding Prometheus or OTel.

`verbose`

Prints compact, human-readable progress to stderr for interactive CLI sessions — LLM thinking, tool start/end, and turn completion. It does not record metrics and only shows progress indicators, never prompt content:

> Thinking
> Send (provider=openrouter, model=claude-sonnet, messages=3)
< Receive (success=true, duration_ms=412)
> Tool shell
< Complete

Prometheus metrics

Set the backend, build with the feature, and scrape /metrics:

Build with the Prometheus feature.

cargo build --release --features observability-prometheus

Select the backend.
```
[observability]
backend = "prometheus"
```
Scrape the endpoint. It is served by the gateway at /metrics, unauthenticated, in Prometheus text format.
Terminal window
```
curl http://127.0.0.1:42617/metrics
```

GET /metrics
Auth:    none (read-only)
Returns: text/plain; version=0.0.4

If the backend is not prometheus, or the binary was built without observability-prometheus, the /metrics endpoint returns a human-readable hint instead of metrics — so a curl of an empty-looking response means the backend or feature is not active.

Metrics reference

Metric	Type	Labels
`revka_agent_starts_total`	counter	`provider`, `model`
`revka_llm_requests_total`	counter	`provider`, `model`, `success`
`revka_tokens_input_total`	counter	`provider`, `model`
`revka_tokens_output_total`	counter	`provider`, `model`
`revka_agent_duration_seconds`	histogram (0.1–60s)	`provider`, `model`
`revka_tool_calls_total`	counter	`tool`, `success`
`revka_tool_duration_seconds`	histogram (0.01–10s)	`tool`
`revka_channel_messages_total`	counter	`channel`, `direction`
`revka_heartbeat_ticks_total`	counter	—
`revka_errors_total`	counter	`component`
`revka_cache_hits_total`	counter	`cache_type`
`revka_cache_misses_total`	counter	`cache_type`
`revka_cache_tokens_saved_total`	counter	`cache_type`
`revka_request_latency_seconds`	histogram (0.01–10s)	—
`revka_tokens_used_last`	gauge	—
`revka_active_sessions`	gauge	—
`revka_queue_depth`	gauge	—
`revka_hand_runs_total`	counter	`hand`, `success`
`revka_hand_duration_seconds`	histogram	`hand`
`revka_hand_findings_total`	counter	`hand`
`revka_deployments_total`	counter	`status`
`revka_deployment_lead_time_seconds`	summary	—
`revka_deployment_failure_rate`	gauge	—
`revka_recovery_time_seconds`	summary	—
`revka_mttr_seconds`	summary	—

Scrape config & Grafana

Point a Prometheus server at the gateway:

scrape_configs:
  - job_name: "revka"
    static_configs:
      - targets: ["127.0.0.1:42617"]

From there, build Grafana panels on the metrics above — for example, a token-spend graph from revka_tokens_input_total / revka_tokens_output_total by provider and model, tool-latency heatmaps from revka_tool_duration_seconds, and a DORA dashboard from the deployment series. For dollar-cost tracking specifically, prefer the dedicated Cost tracking & budgets ledger, which records computed USD per call.

OpenTelemetry (OTLP)

The OTel backend exports both traces and metrics over OTLP HTTP/protobuf to any OpenTelemetry-compatible collector — Jaeger, Tempo, Honeycomb, Datadog, and others.

Build with the OTel feature.

cargo build --release --features observability-otel

Configure the backend and collector endpoint.

[observability]
backend = "otel"                          # aliases: "opentelemetry", "otlp"
otel_endpoint = "http://localhost:4318"   # default
otel_service_name = "revka"               # default; sets the service.name resource attribute

Start the daemon. Spans and metrics begin flowing to the collector.
Terminal window
```
revka daemon
```

Key	Type	Default	Meaning
`otel_endpoint`	string	`http://localhost:4318`	OTLP HTTP base URL. Traces are posted to `<endpoint>/v1/traces`, metrics to `<endpoint>/v1/metrics`.
`otel_service_name`	string	`"revka"`	`service.name` resource attribute on all spans and metrics.

Spans

The backend creates spans for agent.invocation, llm.call, tool.call, hand.run, and error, with attributes drawn from the event payloads:

Attribute	Appears on
`provider`, `model`, `success`, `duration_s`	`llm.call`, `agent.invocation`
`tokens_used`, `cost_usd`	`llm.call`
`tool.name`	`tool.call`
`hand.name`	`hand.run`
`error.message`	`error`

Metric instruments mirror the Prometheus set, prefixed revka.*, and are pushed over OTLP rather than scraped.

Runtime trace logger

Independent of the metrics backend, the runtime trace logger persists structured JSONL events — tool calls, model replies, and errors — to disk for post-hoc diagnostics. It is disabled by default.

[observability]
runtime_trace_mode = "rolling"                    # "none" | "rolling" | "full"
runtime_trace_path = "state/runtime-trace.jsonl"  # relative to workspace unless absolute
runtime_trace_max_entries = 200                   # rolling mode only

Key	Type	Default	Meaning
`runtime_trace_mode`	string	`"none"`	`none` (disabled), `rolling` (keep last N), `full` (unbounded)
`runtime_trace_path`	string	`state/runtime-trace.jsonl`	Trace file; relative paths resolve against the workspace
`runtime_trace_max_entries`	integer	`200`	Max lines retained in `rolling` mode

Each RuntimeTraceEvent line carries: id (UUID), timestamp (RFC 3339), event_type, optional channel / provider / model / turn_id / success / message, and a payload JSON object.

Mode tradeoffs:

rolling trims on every append via an atomic temp-file rename, so the file never grows beyond runtime_trace_max_entries lines — safe to leave on.
full grows unbounded — use it only for short-lived debugging.

Querying traces

Inspect the trace file with revka doctor traces (not a separate revka trace command). Events are returned newest-first, and the list view truncates the message preview at 80 characters.

revka doctor traces                                   # 20 most recent events
revka doctor traces --limit 50                        # show 50 events
revka doctor traces --event tool_call_result          # exact event-type filter
revka doctor traces --contains "timeout"              # full-text substring search
revka doctor traces --id <uuid>                       # one event, full JSON payload

Flag	Default	Meaning
`--limit <n>`	`20`	Maximum events to list
`--event <type>`	—	Case-insensitive exact match on `event_type`
`--contains <text>`	—	Substring search across `event_type`, `message`, `payload`, `channel`, `provider`, `model`
`--id <uuid>`	—	Fetch a single event by UUID as pretty-printed JSON (ignores other filters)

If runtime_trace_mode = "none" the file does not exist, and revka doctor traces prints a message telling you to enable rolling mode first.

Multi-observer fan-out

Internally, Revka can compose observers with MultiObserver, which fans out every event and metric to a list of child observers (and propagates flush() to all of them) — for example, emitting log and prometheus simultaneously.

MultiObserver::new(vec![Box::new(obs1), Box::new(obs2)])

`[observability]` config reference

The full section, with defaults:

[observability]
backend = "none"                                  # "none" | "noop" | "log" | "verbose" | "prometheus" | "otel"
otel_endpoint = "http://localhost:4318"           # OTel only
otel_service_name = "revka"                        # OTel only
runtime_trace_mode = "none"                        # "none" | "rolling" | "full"
runtime_trace_path = "state/runtime-trace.jsonl"
runtime_trace_max_entries = 200

Key	Default	Applies to
`backend`	`"none"`	All
`otel_endpoint`	`"http://localhost:4318"`	`otel`
`otel_service_name`	`"revka"`	`otel`
`runtime_trace_mode`	`"none"`	Runtime traces
`runtime_trace_path`	`"state/runtime-trace.jsonl"`	Runtime traces
`runtime_trace_max_entries`	`200`	Runtime traces (`rolling`)

Every field except backend is optional and the defaults are safe (no-op observer, no trace file). The opentelemetry and otlp values are aliases for otel.

Gateway endpoints

Two unauthenticated gateway endpoints surface observability data. Both are served by the running gateway (default http://127.0.0.1:42617):

Endpoint	Method	Auth	Returns
`/metrics`	`GET`	none	Prometheus metrics (`text/plain; version=0.0.4`), or a hint if Prometheus is not the active backend
`/health`	`GET`	none	Component health snapshot JSON (always `200`; inspect the body for status)

curl http://127.0.0.1:42617/metrics
curl http://127.0.0.1:42617/health

The /health endpoint is the primary liveness signal used by Docker HEALTHCHECK, load balancers, and revka status --format exit-code. For its full response shape and the component health registry, see Status, health, config & tools endpoints and the Updating, runbook & troubleshooting page.

Next steps

Cost tracking & budgets Per-call USD ledger, daily/monthly budgets, and the /api/cost endpoint.

revka doctor, status & self-test Run diagnostics, query traces, and probe model catalogs.

Updating, runbook & troubleshooting Health signals, incident triage, and safe rollout/rollback.

Cargo feature flags & ADRs Build with observability-prometheus and observability-otel.

Observability & tracing

The Observer pipeline

Event taxonomy

Metric taxonomy

Choosing a backend

`none` / `noop` (default)

`log`

`verbose`

Prometheus metrics

Metrics reference

Scrape config & Grafana

OpenTelemetry (OTLP)

Spans

Runtime trace logger

Querying traces

Multi-observer fan-out

`[observability]` config reference

Gateway endpoints

Next steps

Get started

Core concepts

Guides

CLI reference

Gateway API

Dashboard

Channels

Providers & models

Tools

Memory

Workflows & SOP

Cron & scheduling

Security & audit

Deployment & ops

Hardware

MCP & extensibility

Ecosystem

Reference

Observability & tracing

The Observer pipeline

Event taxonomy

Metric taxonomy

Choosing a backend

none / noop (default)

log

verbose

Prometheus metrics

Metrics reference

Scrape config & Grafana

OpenTelemetry (OTLP)

Spans

Runtime trace logger

Querying traces

Multi-observer fan-out

[observability] config reference

Gateway endpoints

Next steps

Get started

Core concepts

Guides

CLI reference

Gateway API

Dashboard

Channels

Providers & models

Tools

Memory

Workflows & SOP

Cron & scheduling

Security & audit

Deployment & ops

Hardware

MCP & extensibility

Ecosystem

Reference

`none` / `noop` (default)

`log`

`verbose`

`[observability]` config reference