Voice, TTS & transcription

Voice call channels, wake-word capture, ClawdTalk SIP, and the shared TTS/transcription/media/link-enricher subsystems.

Revka can talk and listen, not just type. This page covers the three voice channels — the multi-provider Voice Call channel (Twilio / Telnyx / Plivo), the local-microphone Voice Wake word detector, and ClawdTalk (Telnyx SIP) — plus the four shared subsystems that voice and text channels both rely on: TTS (text-to-speech), Transcription (speech-to-text), the Media Pipeline, and the Link Enricher.

TTS, transcription, the media pipeline, and the link enricher are configured once at the top level of ~/.revka/config.toml and are then consumed by whichever channels need them. The voice channels live under [channels_config] like every other channel. If you have not connected a channel before, read the Channels overview for the shared trait model, allowlist semantics, and the polling-vs-webhook distinction.

Voice Call

The Voice Call channel handles real-time inbound and outbound phone calls through one of three telephony providers — Twilio, Telnyx, or Plivo — selected with a single provider key. It streams speech-to-text and text-to-speech during the call, optionally logs the full transcript to the workspace, and gates outbound calls behind an approval workflow. The telephony provider POSTs call events to Revka’s webhook, which translates them into channel messages for the agent loop.

[channels_config.voice_call]
provider = "twilio"               # twilio | telnyx | plivo
account_id = "ACxxxxxxxxxxxxxxxx" # Twilio Account SID / Telnyx API Key / Plivo Auth ID
auth_token = "your-auth-token"    # Twilio auth token / Telnyx API secret / Plivo auth token
from_number = "+15551234567"      # E.164 caller ID for outbound calls
webhook_port = 8090
require_outbound_approval = true
transcription_logging = true
# tts_voice = "Polly.Joanna"      # provider-specific voice name
max_call_duration_secs = 3600
# webhook_base_url = "https://your-tunnel.example.com"

Field	Type / values	Default	Meaning
`provider`	`"twilio"` \| `"telnyx"` \| `"plivo"`	`"twilio"`	Telephony provider.
`account_id`	string (required)	—	Twilio Account SID, Telnyx API Key, or Plivo Auth ID.
`auth_token`	string (required)	—	Twilio auth token, Telnyx API secret, or Plivo auth token.
`from_number`	string (required)	—	E.164 caller ID used for outbound calls.
`webhook_port`	integer	`8090`	Port the channel listens on for telephony webhooks.
`require_outbound_approval`	bool	`true`	Require human approval before placing an outbound call.
`transcription_logging`	bool	`true`	Log the full call transcript to the workspace directory.
`tts_voice`	string (optional)	—	Provider-specific voice name for call audio (e.g. `Polly.Joanna`).
`max_call_duration_secs`	integer	`3600`	Hard cap on call length, in seconds.
`webhook_base_url`	string (optional)	auto-detect	Public URL override for the webhook callback (e.g. an ngrok / Tailscale tunnel).

Because telephony providers deliver call events by webhook (push), Voice Call needs a reachable HTTPS callback URL. Set webhook_base_url to your public address, or expose the gateway through a tunnel — see Expose your gateway with a tunnel.

Voice Wake

Voice Wake turns the host machine’s microphone into an always-on wake-word trigger. It listens continuously on the default audio input via cpal, uses energy-based voice activity detection (VAD) to spot speech, transcribes a short window to check for your configured wake word, and on a match captures the full utterance and dispatches it to the agent. Internally it runs a four-state machine: Listening → Triggered → Capturing → Processing.

[channels_config.voice_wake]
wake_word = "hey revka"
silence_timeout_ms = 2000
energy_threshold = 0.01
max_capture_secs = 30

Field	Type	Default	Meaning
`wake_word`	string	`"hey revka"`	Case-insensitive substring matched in the trigger window.
`silence_timeout_ms`	integer	`2000`	Silence (ms) after the last energy spike before a capture is finalized.
`energy_threshold`	float	`0.01`	RMS floor for VAD; samples below this count as silence.
`max_capture_secs`	integer	`30`	Maximum capture length before transcription is forced.

Voice Wake also requires the top-level [transcription] section — it has no STT of its own and calls the shared transcription subsystem to turn captured audio into text. Configure transcription before enabling it.

Build with the feature enabled:

cargo build --release --locked --features voice-wake

If [channels_config.voice_wake] is configured but the binary was built without the flag, Revka intentionally skips the channel rather than erroring — revka channel list and revka channel doctor report it as skipped for this build.

ClawdTalk

ClawdTalk is a dedicated AI-voice channel built on Telnyx’s global SIP network, using the Telnyx API v2 (https://api.telnyx.com/v2) for call management. It is distinct from the multi-provider Voice Call channel: where Voice Call abstracts over three telephony vendors, ClawdTalk is Telnyx-SIP-specific and optimized for low-latency conversational calls. Inbound call events arrive by webhook with optional signature verification.

[channels_config.clawdtalk]
api_key = "KEY01xxxxxxxxxxxxxxxx"     # Telnyx API key (required)
connection_id = "telnyx-connection-id" # Telnyx SIP connection (required)
from_number = "+15551234567"           # E.164 caller ID (required)
allowed_destinations = []              # empty = allow all destinations
# webhook_secret = "telnyx-webhook-secret"

Field	Type	Default	Meaning
`api_key`	string (required)	—	Telnyx API key.
`connection_id`	string (required)	—	Telnyx SIP connection ID.
`from_number`	string (required)	—	E.164 caller ID for outbound calls.
`allowed_destinations`	list	`[]` → allow all	Destination number prefixes/patterns, or `"*"`. An empty list allows every destination.
`webhook_secret`	string (optional)	—	Telnyx webhook signature secret for verifying inbound events.

TTS (Text-to-Speech)

The [tts] section is a shared, multi-provider synthesis subsystem consumed by voice channels (and by WhatsApp Web). It is not called directly — channels invoke it when they need to turn the agent’s text into audio. Supported providers: OpenAI, ElevenLabs, Google Cloud TTS, Edge TTS (a free subprocess-based backend), and Piper (a local, OpenAI-compatible endpoint).

[tts]
enabled = true
default_provider = "openai"       # openai | elevenlabs | google | edge | piper
default_voice = "alloy"
default_format = "mp3"            # mp3 | opus | wav
max_text_length = 4096

[tts.openai]
# api_key = "..."                 # falls back to OPENAI_API_KEY
model = "tts-1"
speed = 1.0

[tts.elevenlabs]
# api_key = "..."                 # falls back to ELEVENLABS_API_KEY
model_id = "eleven_monolingual_v1"
stability = 0.5
similarity_boost = 0.5

[tts.google]
# api_key = "..."                 # falls back to GOOGLE_TTS_API_KEY
language_code = "en-US"

[tts.edge]
binary_path = "edge-tts"

[tts.piper]
api_url = "http://127.0.0.1:5000/v1/audio/speech"

Top-level [tts] keys:

Key	Type	Default	Meaning
`enabled`	bool	`false`	Master toggle for TTS synthesis.
`default_provider`	string	`"openai"`	`openai`, `elevenlabs`, `google`, `edge`, or `piper`.
`default_voice`	string	`"alloy"`	Voice ID passed to the selected provider.
`default_format`	string	`"mp3"`	Output audio format: `mp3`, `opus`, or `wav`.
`max_text_length`	integer	`4096`	Maximum input text length, in characters.

Per-provider sub-tables:

Sub-table	Keys (defaults)
`[tts.openai]`	`api_key`, `model` (`tts-1`), `speed` (`1.0`)
`[tts.elevenlabs]`	`api_key`, `model_id` (`eleven_monolingual_v1`), `stability` (`0.5`), `similarity_boost` (`0.5`)
`[tts.google]`	`api_key`, `language_code` (`en-US`)
`[tts.edge]`	`binary_path` (`edge-tts`)
`[tts.piper]`	`api_url` (`http://127.0.0.1:5000/v1/audio/speech`)

OpenAI’s built-in voices are alloy, echo, fable, onyx, nova, and shimmer. Edge TTS uses Microsoft Neural voices and is free but requires the edge-tts binary on PATH. Piper runs locally against an OpenAI-compatible HTTP endpoint — useful for fully offline, GPU-accelerated synthesis.

Transcription (Speech-to-Text)

The [transcription] section is the shared STT subsystem. It powers Voice Wake, the Media Pipeline’s audio leg, and audio-attachment handling on every channel that accepts voice notes (Telegram, Discord, Slack, Mattermost, Matrix, WhatsApp Web, WATI, and others). The default provider is Groq (Whisper-compatible); you can also use OpenAI, Deepgram, AssemblyAI, Google Cloud Speech-to-Text, or a self-hosted Whisper endpoint.

[transcription]
enabled = true
default_provider = "groq"         # groq | openai | deepgram | assemblyai | google | local_whisper
# api_key = "..."                 # Groq provider; falls back to GROQ_API_KEY
api_url = "https://api.groq.com/openai/v1/audio/transcriptions"
model = "whisper-large-v3-turbo"
# language = "en"                 # optional ISO-639-1 hint
# initial_prompt = "Revka, Kumiho"  # bias toward expected vocabulary
max_duration_secs = 120
transcribe_non_ptt_audio = false

# Optional per-provider sub-tables:
[transcription.openai]
# api_key = "..."
model = "whisper-1"

[transcription.local_whisper]
url = "http://10.10.0.1:8001/v1/transcribe"
# bearer_token = "..."
# max_audio_bytes = 26214400      # 25 MB
timeout_secs = 300

Key	Type	Default	Meaning
`enabled`	bool	`false`	Master toggle for transcription.
`default_provider`	string	`"groq"`	`groq`, `openai`, `deepgram`, `assemblyai`, `google`, or `local_whisper`.
`api_key`	string (optional)	—	Key for the Groq provider; falls back to `GROQ_API_KEY`.
`api_url`	string	`https://api.groq.com/openai/v1/audio/transcriptions`	Whisper-compatible endpoint (Groq provider).
`model`	string	`"whisper-large-v3-turbo"`	Whisper model name (Groq provider).
`language`	string (optional)	—	ISO-639-1 language hint (e.g. `en`, `ru`).
`initial_prompt`	string (optional)	—	Prompt that biases transcription toward expected proper nouns / terms.
`max_duration_secs`	integer	`120`	Skip audio longer than this many seconds.
`transcribe_non_ptt_audio`	bool	`false`	Also transcribe non-voice-note (forwarded/regular) audio on WhatsApp.

Per-provider sub-tables: [transcription.openai] (api_key, model = whisper-1), [transcription.deepgram] (api_key, model = nova-2), [transcription.assemblyai] (api_key), [transcription.google] (api_key, language_code = en-US), and [transcription.local_whisper].

The local_whisper sub-table points at any reachable Whisper-compatible endpoint:

Key	Type	Default	Meaning
`url`	string (required)	—	HTTP(S) endpoint URL.
`bearer_token`	string (optional)	—	Auth token; omit for unauthenticated local endpoints.
`max_audio_bytes`	integer	`26214400` (25 MB)	Maximum accepted audio size; peak memory per request is roughly 2× this.
`timeout_secs`	integer	`300`	Request timeout (large files on local GPU).

Media Pipeline

The Media Pipeline is the cross-channel inbound media-understanding stage. When enabled, it pre-processes attachments before the agent sees them, replacing raw files with text annotations: audio is transcribed, images are described, and video is summarized. The result is that an agent on any text model can reason over a voice note or a photo without a separate tool call.

[media_pipeline]
enabled = true
transcribe_audio = true
describe_images = true
summarize_video = true

Key	Type	Default	Meaning
`enabled`	bool	`false`	Master toggle for the pipeline.
`transcribe_audio`	bool	`true`	Transcribe audio attachments via the `[transcription]` provider.
`describe_images`	bool	`true`	Describe images when a vision-capable model is active.
`summarize_video`	bool	`true`	Summarize video attachments (requires an external API).

The audio leg reuses your [transcription] configuration, and image description uses the active provider’s vision capability. When the provider has no vision support, image description falls back gracefully to a simple [Image: attached] annotation rather than failing.

Link Enricher

The Link Enricher fetches the content behind URLs in inbound messages and prepends a short summary, so the agent has link context without an explicit web-fetch tool call. It is off by default and is SSRF-protected.

[link_enricher]
enabled = false
max_links = 3
timeout_secs = 10

Key	Type	Default	Meaning
`enabled`	bool	`false`	Master toggle for the enricher.
`max_links`	integer	`3`	Maximum unique URLs fetched per message.
`timeout_secs`	integer	`10`	Per-URL fetch timeout, in seconds.

How the pieces fit together

The voice channels are the audio I/O surface; the four subsystems are the shared machinery they (and text channels) draw on:

Component	Config location	Role	Feature flag
Voice Call	`[channels_config.voice_call]`	Phone calls via Twilio / Telnyx / Plivo (webhook)	—
Voice Wake	`[channels_config.voice_wake]`	Local mic wake-word capture (input-only)	`voice-wake`
ClawdTalk	`[channels_config.clawdtalk]`	Telnyx SIP voice calls (webhook)	—
TTS	`[tts]`	Text → speech for voice replies	—
Transcription	`[transcription]`	Speech → text for voice input	—
Media Pipeline	`[media_pipeline]`	Annotate inbound audio/image/video	—
Link Enricher	`[link_enricher]`	Fetch & summarize URLs in messages	—

A typical end-to-end voice setup combines several of these: enable [transcription] so spoken input becomes text, enable [tts] so replies become audio, then add a voice channel ([channels_config.voice_call] for the phone, or [channels_config.voice_wake] for the local mic). Voice Call additionally logs transcripts when transcription_logging = true.

Channels overview The channel trait, delivery modes, allowlists, and the feature-flag matrix.

Expose your gateway with a tunnel Give webhook channels like Voice Call a public HTTPS callback.

Autonomy levels & approvals The approval workflow behind outbound-call gating.

Media & vision tools Vision and media tooling the pipeline builds on.

Config: channels, tools & integrations Every channel config key in one place.

Cargo feature flags & ADRs The full build feature catalog, including voice-wake.

Voice, TTS & transcription

Voice Call

Voice Wake

ClawdTalk

TTS (Text-to-Speech)

Transcription (Speech-to-Text)

Media Pipeline

Link Enricher

How the pieces fit together

Get started

Core concepts

Guides

CLI reference

Gateway API

Dashboard

Channels

Providers & models

Tools

Memory

Workflows & SOP

Cron & scheduling

Security & audit

Deployment & ops

Hardware

MCP & extensibility

Ecosystem

Reference

Voice, TTS & transcription

Voice Call

Voice Wake

ClawdTalk

TTS (Text-to-Speech)

Transcription (Speech-to-Text)

Media Pipeline

Link Enricher

How the pieces fit together

Related pages

Get started

Core concepts

Guides

CLI reference

Gateway API

Dashboard

Channels

Providers & models

Tools

Memory

Workflows & SOP

Cron & scheduling

Security & audit

Deployment & ops

Hardware

MCP & extensibility

Ecosystem

Reference