Observability & tracing
Observer backends (log/verbose/prometheus/otel), Prometheus metrics, OTLP export, and runtime traces.
Revka emits telemetry through a pluggable Observer pipeline. A single config key — [observability] backend — selects how the runtime records the agent lifecycle: discard it (none), log it (log), print it to your terminal (verbose), expose it as Prometheus metrics (prometheus), or export traces and metrics over OTLP (otel). A separate runtime trace logger persists structured JSONL events to disk for after-the-fact debugging, queryable with revka doctor traces.
Reach for this page when you want to wire Revka into Grafana, Jaeger/Tempo/Honeycomb, or your existing log pipeline, or when you need to inspect exactly which tool calls and model replies ran. Everything here is configured under the [observability] section of ~/.revka/config.toml; see the Configuration overview for how that file is resolved.
The Observer pipeline
Section titled “The Observer pipeline”Every backend implements one Observer trait with two hot-path methods plus a shutdown drain:
| Method | When it fires | Notes |
|---|---|---|
record_event(&ObserverEvent) | On each lifecycle event | Called synchronously on the hot path — backends must never block. |
record_metric(&ObserverMetric) | On each measurement | Same synchronous contract. |
flush() | On graceful shutdown | Drains buffered spans/metrics (used by the OTel backend). |
name() | Anytime | Backend identifier for logs. |
Because observers run inline with agent execution, the built-in backends are designed to be cheap (a log line, a counter increment, or a buffer append). Heavy work — like the OTLP network export — is buffered and pushed asynchronously, then force-flushed on flush() at shutdown.
The factory in the runtime reads backend at startup and constructs the matching observer once. There is no per-request switching.
Event taxonomy
Section titled “Event taxonomy”record_event receives one of these ObserverEvent variants, covering the full agent loop, channels, the heartbeat, the response cache, “hands” (agent runs), and DORA deployment signals:
| Group | Events |
|---|---|
| Agent loop | AgentStart, AgentEnd, LlmRequest, LlmResponse, ToolCallStart, ToolCall, TurnComplete |
| Channels & heartbeat | ChannelMessage, HeartbeatTick |
| Response cache | CacheHit, CacheMiss |
| Errors | Error |
| Hands (agent runs) | HandStarted, HandCompleted, HandFailed |
| Deployments (DORA) | DeploymentStarted, DeploymentCompleted, DeploymentFailed, RecoveryCompleted |
CacheHit / CacheMiss distinguish the hot (in-memory) and warm (SQLite) cache layers and carry tokens_saved, so you can quantify what the response cache saved you.
Metric taxonomy
Section titled “Metric taxonomy”record_metric receives one of these ObserverMetric variants:
| Metric | Type | Meaning |
|---|---|---|
RequestLatency(Duration) | timing | Wall-clock latency of an LLM request |
TokensUsed(u64) | count | Tokens consumed by the last request |
ActiveSessions(u64) | gauge | Currently active sessions |
QueueDepth(u64) | gauge | Pending work queue depth |
HandRunDuration | timing | Duration of a hand (agent) run |
HandFindingsCount | count | Findings produced by a hand |
HandSuccessRate | ratio | Rolling hand success rate |
DeploymentLeadTime(Duration) | timing | DORA lead time for changes |
RecoveryTime(Duration) | timing | DORA time to restore service |
Choosing a backend
Section titled “Choosing a backend”[observability]backend = "none" # "none" | "noop" | "log" | "verbose" | "prometheus" | "otel"backend | What it does | External deps | Build feature |
|---|---|---|---|
none / noop | Zero-overhead no-op. Default. | None | — |
log | Structured tracing::info! lines for every event and metric | None | — |
verbose | Human-readable > / < progress lines on stderr (interactive only) | None | — |
prometheus | Exposes metrics at GET /metrics | Prometheus server | observability-prometheus |
otel | Pushes traces + metrics over OTLP HTTP | OTel collector | observability-otel |
none / noop (default)
Section titled “none / noop (default)”The safe default. All observer methods compile to no-ops — no overhead, no dependencies. The factory also falls back here (with a warn!) when a feature-gated backend is requested but its Cargo feature is absent, or when backend is an unrecognised string.
Emits every event and metric as a structured tracing::info! line with named fields (agent.start, tool.call, cache.hit, metric.tokens_used, …). It has no external dependencies and works with any tracing subscriber, so it composes with your existing log shipping:
RUST_LOG=info revka daemonIf you run the daemon under a JSON-formatting tracing-subscriber, the output is structured JSON ready for ingestion. This is the recommended first step before adding Prometheus or OTel.
verbose
Section titled “verbose”Prints compact, human-readable progress to stderr for interactive CLI sessions — LLM thinking, tool start/end, and turn completion. It does not record metrics and only shows progress indicators, never prompt content:
> Thinking> Send (provider=openrouter, model=claude-sonnet, messages=3)< Receive (success=true, duration_ms=412)> Tool shell< CompletePrometheus metrics
Section titled “Prometheus metrics”Set the backend, build with the feature, and scrape /metrics:
-
Build with the Prometheus feature.
Terminal window cargo build --release --features observability-prometheus -
Select the backend.
[observability]backend = "prometheus" -
Scrape the endpoint. It is served by the gateway at
/metrics, unauthenticated, in Prometheus text format.Terminal window curl http://127.0.0.1:42617/metrics
GET /metricsAuth: none (read-only)Returns: text/plain; version=0.0.4If the backend is not prometheus, or the binary was built without observability-prometheus, the /metrics endpoint returns a human-readable hint instead of metrics — so a curl of an empty-looking response means the backend or feature is not active.
Metrics reference
Section titled “Metrics reference”| Metric | Type | Labels |
|---|---|---|
revka_agent_starts_total | counter | provider, model |
revka_llm_requests_total | counter | provider, model, success |
revka_tokens_input_total | counter | provider, model |
revka_tokens_output_total | counter | provider, model |
revka_agent_duration_seconds | histogram (0.1–60s) | provider, model |
revka_tool_calls_total | counter | tool, success |
revka_tool_duration_seconds | histogram (0.01–10s) | tool |
revka_channel_messages_total | counter | channel, direction |
revka_heartbeat_ticks_total | counter | — |
revka_errors_total | counter | component |
revka_cache_hits_total | counter | cache_type |
revka_cache_misses_total | counter | cache_type |
revka_cache_tokens_saved_total | counter | cache_type |
revka_request_latency_seconds | histogram (0.01–10s) | — |
revka_tokens_used_last | gauge | — |
revka_active_sessions | gauge | — |
revka_queue_depth | gauge | — |
revka_hand_runs_total | counter | hand, success |
revka_hand_duration_seconds | histogram | hand |
revka_hand_findings_total | counter | hand |
revka_deployments_total | counter | status |
revka_deployment_lead_time_seconds | summary | — |
revka_deployment_failure_rate | gauge | — |
revka_recovery_time_seconds | summary | — |
revka_mttr_seconds | summary | — |
Scrape config & Grafana
Section titled “Scrape config & Grafana”Point a Prometheus server at the gateway:
scrape_configs: - job_name: "revka" static_configs: - targets: ["127.0.0.1:42617"]From there, build Grafana panels on the metrics above — for example, a token-spend graph from revka_tokens_input_total / revka_tokens_output_total by provider and model, tool-latency heatmaps from revka_tool_duration_seconds, and a DORA dashboard from the deployment series. For dollar-cost tracking specifically, prefer the dedicated Cost tracking & budgets ledger, which records computed USD per call.
OpenTelemetry (OTLP)
Section titled “OpenTelemetry (OTLP)”The OTel backend exports both traces and metrics over OTLP HTTP/protobuf to any OpenTelemetry-compatible collector — Jaeger, Tempo, Honeycomb, Datadog, and others.
-
Build with the OTel feature.
Terminal window cargo build --release --features observability-otel -
Configure the backend and collector endpoint.
[observability]backend = "otel" # aliases: "opentelemetry", "otlp"otel_endpoint = "http://localhost:4318" # defaultotel_service_name = "revka" # default; sets the service.name resource attribute -
Start the daemon. Spans and metrics begin flowing to the collector.
Terminal window revka daemon
| Key | Type | Default | Meaning |
|---|---|---|---|
otel_endpoint | string | http://localhost:4318 | OTLP HTTP base URL. Traces are posted to <endpoint>/v1/traces, metrics to <endpoint>/v1/metrics. |
otel_service_name | string | "revka" | service.name resource attribute on all spans and metrics. |
The backend creates spans for agent.invocation, llm.call, tool.call, hand.run, and error, with attributes drawn from the event payloads:
| Attribute | Appears on |
|---|---|
provider, model, success, duration_s | llm.call, agent.invocation |
tokens_used, cost_usd | llm.call |
tool.name | tool.call |
hand.name | hand.run |
error.message | error |
Metric instruments mirror the Prometheus set, prefixed revka.*, and are pushed over OTLP rather than scraped.
Runtime trace logger
Section titled “Runtime trace logger”Independent of the metrics backend, the runtime trace logger persists structured JSONL events — tool calls, model replies, and errors — to disk for post-hoc diagnostics. It is disabled by default.
[observability]runtime_trace_mode = "rolling" # "none" | "rolling" | "full"runtime_trace_path = "state/runtime-trace.jsonl" # relative to workspace unless absoluteruntime_trace_max_entries = 200 # rolling mode only| Key | Type | Default | Meaning |
|---|---|---|---|
runtime_trace_mode | string | "none" | none (disabled), rolling (keep last N), full (unbounded) |
runtime_trace_path | string | state/runtime-trace.jsonl | Trace file; relative paths resolve against the workspace |
runtime_trace_max_entries | integer | 200 | Max lines retained in rolling mode |
Each RuntimeTraceEvent line carries: id (UUID), timestamp (RFC 3339), event_type, optional channel / provider / model / turn_id / success / message, and a payload JSON object.
Mode tradeoffs:
rollingtrims on every append via an atomic temp-file rename, so the file never grows beyondruntime_trace_max_entrieslines — safe to leave on.fullgrows unbounded — use it only for short-lived debugging.
Querying traces
Section titled “Querying traces”Inspect the trace file with revka doctor traces (not a separate revka trace command). Events are returned newest-first, and the list view truncates the message preview at 80 characters.
revka doctor traces # 20 most recent eventsrevka doctor traces --limit 50 # show 50 eventsrevka doctor traces --event tool_call_result # exact event-type filterrevka doctor traces --contains "timeout" # full-text substring searchrevka doctor traces --id <uuid> # one event, full JSON payload| Flag | Default | Meaning |
|---|---|---|
--limit <n> | 20 | Maximum events to list |
--event <type> | — | Case-insensitive exact match on event_type |
--contains <text> | — | Substring search across event_type, message, payload, channel, provider, model |
--id <uuid> | — | Fetch a single event by UUID as pretty-printed JSON (ignores other filters) |
If runtime_trace_mode = "none" the file does not exist, and revka doctor traces prints a message telling you to enable rolling mode first.
Multi-observer fan-out
Section titled “Multi-observer fan-out”Internally, Revka can compose observers with MultiObserver, which fans out every event and metric to a list of child observers (and propagates flush() to all of them) — for example, emitting log and prometheus simultaneously.
MultiObserver::new(vec![Box::new(obs1), Box::new(obs2)])[observability] config reference
Section titled “[observability] config reference”The full section, with defaults:
[observability]backend = "none" # "none" | "noop" | "log" | "verbose" | "prometheus" | "otel"otel_endpoint = "http://localhost:4318" # OTel onlyotel_service_name = "revka" # OTel onlyruntime_trace_mode = "none" # "none" | "rolling" | "full"runtime_trace_path = "state/runtime-trace.jsonl"runtime_trace_max_entries = 200| Key | Default | Applies to |
|---|---|---|
backend | "none" | All |
otel_endpoint | "http://localhost:4318" | otel |
otel_service_name | "revka" | otel |
runtime_trace_mode | "none" | Runtime traces |
runtime_trace_path | "state/runtime-trace.jsonl" | Runtime traces |
runtime_trace_max_entries | 200 | Runtime traces (rolling) |
Every field except backend is optional and the defaults are safe (no-op observer, no trace file). The opentelemetry and otlp values are aliases for otel.
Gateway endpoints
Section titled “Gateway endpoints”Two unauthenticated gateway endpoints surface observability data. Both are served by the running gateway (default http://127.0.0.1:42617):
| Endpoint | Method | Auth | Returns |
|---|---|---|---|
/metrics | GET | none | Prometheus metrics (text/plain; version=0.0.4), or a hint if Prometheus is not the active backend |
/health | GET | none | Component health snapshot JSON (always 200; inspect the body for status) |
curl http://127.0.0.1:42617/metricscurl http://127.0.0.1:42617/healthThe /health endpoint is the primary liveness signal used by Docker HEALTHCHECK, load balancers, and revka status --format exit-code. For its full response shape and the component health registry, see Status, health, config & tools endpoints and the Updating, runbook & troubleshooting page.