Skip to content

Observability & tracing

Observer backends (log/verbose/prometheus/otel), Prometheus metrics, OTLP export, and runtime traces.

Revka emits telemetry through a pluggable Observer pipeline. A single config key — [observability] backend — selects how the runtime records the agent lifecycle: discard it (none), log it (log), print it to your terminal (verbose), expose it as Prometheus metrics (prometheus), or export traces and metrics over OTLP (otel). A separate runtime trace logger persists structured JSONL events to disk for after-the-fact debugging, queryable with revka doctor traces.

Reach for this page when you want to wire Revka into Grafana, Jaeger/Tempo/Honeycomb, or your existing log pipeline, or when you need to inspect exactly which tool calls and model replies ran. Everything here is configured under the [observability] section of ~/.revka/config.toml; see the Configuration overview for how that file is resolved.

Every backend implements one Observer trait with two hot-path methods plus a shutdown drain:

MethodWhen it firesNotes
record_event(&ObserverEvent)On each lifecycle eventCalled synchronously on the hot path — backends must never block.
record_metric(&ObserverMetric)On each measurementSame synchronous contract.
flush()On graceful shutdownDrains buffered spans/metrics (used by the OTel backend).
name()AnytimeBackend identifier for logs.

Because observers run inline with agent execution, the built-in backends are designed to be cheap (a log line, a counter increment, or a buffer append). Heavy work — like the OTLP network export — is buffered and pushed asynchronously, then force-flushed on flush() at shutdown.

The factory in the runtime reads backend at startup and constructs the matching observer once. There is no per-request switching.

record_event receives one of these ObserverEvent variants, covering the full agent loop, channels, the heartbeat, the response cache, “hands” (agent runs), and DORA deployment signals:

GroupEvents
Agent loopAgentStart, AgentEnd, LlmRequest, LlmResponse, ToolCallStart, ToolCall, TurnComplete
Channels & heartbeatChannelMessage, HeartbeatTick
Response cacheCacheHit, CacheMiss
ErrorsError
Hands (agent runs)HandStarted, HandCompleted, HandFailed
Deployments (DORA)DeploymentStarted, DeploymentCompleted, DeploymentFailed, RecoveryCompleted

CacheHit / CacheMiss distinguish the hot (in-memory) and warm (SQLite) cache layers and carry tokens_saved, so you can quantify what the response cache saved you.

record_metric receives one of these ObserverMetric variants:

MetricTypeMeaning
RequestLatency(Duration)timingWall-clock latency of an LLM request
TokensUsed(u64)countTokens consumed by the last request
ActiveSessions(u64)gaugeCurrently active sessions
QueueDepth(u64)gaugePending work queue depth
HandRunDurationtimingDuration of a hand (agent) run
HandFindingsCountcountFindings produced by a hand
HandSuccessRateratioRolling hand success rate
DeploymentLeadTime(Duration)timingDORA lead time for changes
RecoveryTime(Duration)timingDORA time to restore service
[observability]
backend = "none" # "none" | "noop" | "log" | "verbose" | "prometheus" | "otel"
backendWhat it doesExternal depsBuild feature
none / noopZero-overhead no-op. Default.None
logStructured tracing::info! lines for every event and metricNone
verboseHuman-readable > / < progress lines on stderr (interactive only)None
prometheusExposes metrics at GET /metricsPrometheus serverobservability-prometheus
otelPushes traces + metrics over OTLP HTTPOTel collectorobservability-otel

The safe default. All observer methods compile to no-ops — no overhead, no dependencies. The factory also falls back here (with a warn!) when a feature-gated backend is requested but its Cargo feature is absent, or when backend is an unrecognised string.

Emits every event and metric as a structured tracing::info! line with named fields (agent.start, tool.call, cache.hit, metric.tokens_used, …). It has no external dependencies and works with any tracing subscriber, so it composes with your existing log shipping:

Terminal window
RUST_LOG=info revka daemon

If you run the daemon under a JSON-formatting tracing-subscriber, the output is structured JSON ready for ingestion. This is the recommended first step before adding Prometheus or OTel.

Prints compact, human-readable progress to stderr for interactive CLI sessions — LLM thinking, tool start/end, and turn completion. It does not record metrics and only shows progress indicators, never prompt content:

> Thinking
> Send (provider=openrouter, model=claude-sonnet, messages=3)
< Receive (success=true, duration_ms=412)
> Tool shell
< Complete

Set the backend, build with the feature, and scrape /metrics:

  1. Build with the Prometheus feature.

    Terminal window
    cargo build --release --features observability-prometheus
  2. Select the backend.

    [observability]
    backend = "prometheus"
  3. Scrape the endpoint. It is served by the gateway at /metrics, unauthenticated, in Prometheus text format.

    Terminal window
    curl http://127.0.0.1:42617/metrics
GET /metrics
Auth: none (read-only)
Returns: text/plain; version=0.0.4

If the backend is not prometheus, or the binary was built without observability-prometheus, the /metrics endpoint returns a human-readable hint instead of metrics — so a curl of an empty-looking response means the backend or feature is not active.

MetricTypeLabels
revka_agent_starts_totalcounterprovider, model
revka_llm_requests_totalcounterprovider, model, success
revka_tokens_input_totalcounterprovider, model
revka_tokens_output_totalcounterprovider, model
revka_agent_duration_secondshistogram (0.1–60s)provider, model
revka_tool_calls_totalcountertool, success
revka_tool_duration_secondshistogram (0.01–10s)tool
revka_channel_messages_totalcounterchannel, direction
revka_heartbeat_ticks_totalcounter
revka_errors_totalcountercomponent
revka_cache_hits_totalcountercache_type
revka_cache_misses_totalcountercache_type
revka_cache_tokens_saved_totalcountercache_type
revka_request_latency_secondshistogram (0.01–10s)
revka_tokens_used_lastgauge
revka_active_sessionsgauge
revka_queue_depthgauge
revka_hand_runs_totalcounterhand, success
revka_hand_duration_secondshistogramhand
revka_hand_findings_totalcounterhand
revka_deployments_totalcounterstatus
revka_deployment_lead_time_secondssummary
revka_deployment_failure_rategauge
revka_recovery_time_secondssummary
revka_mttr_secondssummary

Point a Prometheus server at the gateway:

scrape_configs:
- job_name: "revka"
static_configs:
- targets: ["127.0.0.1:42617"]

From there, build Grafana panels on the metrics above — for example, a token-spend graph from revka_tokens_input_total / revka_tokens_output_total by provider and model, tool-latency heatmaps from revka_tool_duration_seconds, and a DORA dashboard from the deployment series. For dollar-cost tracking specifically, prefer the dedicated Cost tracking & budgets ledger, which records computed USD per call.

The OTel backend exports both traces and metrics over OTLP HTTP/protobuf to any OpenTelemetry-compatible collector — Jaeger, Tempo, Honeycomb, Datadog, and others.

  1. Build with the OTel feature.

    Terminal window
    cargo build --release --features observability-otel
  2. Configure the backend and collector endpoint.

    [observability]
    backend = "otel" # aliases: "opentelemetry", "otlp"
    otel_endpoint = "http://localhost:4318" # default
    otel_service_name = "revka" # default; sets the service.name resource attribute
  3. Start the daemon. Spans and metrics begin flowing to the collector.

    Terminal window
    revka daemon
KeyTypeDefaultMeaning
otel_endpointstringhttp://localhost:4318OTLP HTTP base URL. Traces are posted to <endpoint>/v1/traces, metrics to <endpoint>/v1/metrics.
otel_service_namestring"revka"service.name resource attribute on all spans and metrics.

The backend creates spans for agent.invocation, llm.call, tool.call, hand.run, and error, with attributes drawn from the event payloads:

AttributeAppears on
provider, model, success, duration_sllm.call, agent.invocation
tokens_used, cost_usdllm.call
tool.nametool.call
hand.namehand.run
error.messageerror

Metric instruments mirror the Prometheus set, prefixed revka.*, and are pushed over OTLP rather than scraped.

Independent of the metrics backend, the runtime trace logger persists structured JSONL events — tool calls, model replies, and errors — to disk for post-hoc diagnostics. It is disabled by default.

[observability]
runtime_trace_mode = "rolling" # "none" | "rolling" | "full"
runtime_trace_path = "state/runtime-trace.jsonl" # relative to workspace unless absolute
runtime_trace_max_entries = 200 # rolling mode only
KeyTypeDefaultMeaning
runtime_trace_modestring"none"none (disabled), rolling (keep last N), full (unbounded)
runtime_trace_pathstringstate/runtime-trace.jsonlTrace file; relative paths resolve against the workspace
runtime_trace_max_entriesinteger200Max lines retained in rolling mode

Each RuntimeTraceEvent line carries: id (UUID), timestamp (RFC 3339), event_type, optional channel / provider / model / turn_id / success / message, and a payload JSON object.

Mode tradeoffs:

  • rolling trims on every append via an atomic temp-file rename, so the file never grows beyond runtime_trace_max_entries lines — safe to leave on.
  • full grows unbounded — use it only for short-lived debugging.

Inspect the trace file with revka doctor traces (not a separate revka trace command). Events are returned newest-first, and the list view truncates the message preview at 80 characters.

Terminal window
revka doctor traces # 20 most recent events
revka doctor traces --limit 50 # show 50 events
revka doctor traces --event tool_call_result # exact event-type filter
revka doctor traces --contains "timeout" # full-text substring search
revka doctor traces --id <uuid> # one event, full JSON payload
FlagDefaultMeaning
--limit <n>20Maximum events to list
--event <type>Case-insensitive exact match on event_type
--contains <text>Substring search across event_type, message, payload, channel, provider, model
--id <uuid>Fetch a single event by UUID as pretty-printed JSON (ignores other filters)

If runtime_trace_mode = "none" the file does not exist, and revka doctor traces prints a message telling you to enable rolling mode first.

Internally, Revka can compose observers with MultiObserver, which fans out every event and metric to a list of child observers (and propagates flush() to all of them) — for example, emitting log and prometheus simultaneously.

MultiObserver::new(vec![Box::new(obs1), Box::new(obs2)])

The full section, with defaults:

[observability]
backend = "none" # "none" | "noop" | "log" | "verbose" | "prometheus" | "otel"
otel_endpoint = "http://localhost:4318" # OTel only
otel_service_name = "revka" # OTel only
runtime_trace_mode = "none" # "none" | "rolling" | "full"
runtime_trace_path = "state/runtime-trace.jsonl"
runtime_trace_max_entries = 200
KeyDefaultApplies to
backend"none"All
otel_endpoint"http://localhost:4318"otel
otel_service_name"revka"otel
runtime_trace_mode"none"Runtime traces
runtime_trace_path"state/runtime-trace.jsonl"Runtime traces
runtime_trace_max_entries200Runtime traces (rolling)

Every field except backend is optional and the defaults are safe (no-op observer, no trace file). The opentelemetry and otlp values are aliases for otel.

Two unauthenticated gateway endpoints surface observability data. Both are served by the running gateway (default http://127.0.0.1:42617):

EndpointMethodAuthReturns
/metricsGETnonePrometheus metrics (text/plain; version=0.0.4), or a hint if Prometheus is not the active backend
/healthGETnoneComponent health snapshot JSON (always 200; inspect the body for status)
Terminal window
curl http://127.0.0.1:42617/metrics
curl http://127.0.0.1:42617/health

The /health endpoint is the primary liveness signal used by Docker HEALTHCHECK, load balancers, and revka status --format exit-code. For its full response shape and the component health registry, see Status, health, config & tools endpoints and the Updating, runbook & troubleshooting page.