Voice, TTS & transcription
Voice call channels, wake-word capture, ClawdTalk SIP, and the shared TTS/transcription/media/link-enricher subsystems.
Revka can talk and listen, not just type. This page covers the three voice channels — the multi-provider Voice Call channel (Twilio / Telnyx / Plivo), the local-microphone Voice Wake word detector, and ClawdTalk (Telnyx SIP) — plus the four shared subsystems that voice and text channels both rely on: TTS (text-to-speech), Transcription (speech-to-text), the Media Pipeline, and the Link Enricher.
TTS, transcription, the media pipeline, and the link enricher are configured once at the top level of ~/.revka/config.toml and are then consumed by whichever channels need them. The voice channels live under [channels_config] like every other channel. If you have not connected a channel before, read the Channels overview for the shared trait model, allowlist semantics, and the polling-vs-webhook distinction.
Voice Call
Section titled “Voice Call”The Voice Call channel handles real-time inbound and outbound phone calls through one of three telephony providers — Twilio, Telnyx, or Plivo — selected with a single provider key. It streams speech-to-text and text-to-speech during the call, optionally logs the full transcript to the workspace, and gates outbound calls behind an approval workflow. The telephony provider POSTs call events to Revka’s webhook, which translates them into channel messages for the agent loop.
[channels_config.voice_call]provider = "twilio" # twilio | telnyx | plivoaccount_id = "ACxxxxxxxxxxxxxxxx" # Twilio Account SID / Telnyx API Key / Plivo Auth IDauth_token = "your-auth-token" # Twilio auth token / Telnyx API secret / Plivo auth tokenfrom_number = "+15551234567" # E.164 caller ID for outbound callswebhook_port = 8090require_outbound_approval = truetranscription_logging = true# tts_voice = "Polly.Joanna" # provider-specific voice namemax_call_duration_secs = 3600# webhook_base_url = "https://your-tunnel.example.com"| Field | Type / values | Default | Meaning |
|---|---|---|---|
provider | "twilio" | "telnyx" | "plivo" | "twilio" | Telephony provider. |
account_id | string (required) | — | Twilio Account SID, Telnyx API Key, or Plivo Auth ID. |
auth_token | string (required) | — | Twilio auth token, Telnyx API secret, or Plivo auth token. |
from_number | string (required) | — | E.164 caller ID used for outbound calls. |
webhook_port | integer | 8090 | Port the channel listens on for telephony webhooks. |
require_outbound_approval | bool | true | Require human approval before placing an outbound call. |
transcription_logging | bool | true | Log the full call transcript to the workspace directory. |
tts_voice | string (optional) | — | Provider-specific voice name for call audio (e.g. Polly.Joanna). |
max_call_duration_secs | integer | 3600 | Hard cap on call length, in seconds. |
webhook_base_url | string (optional) | auto-detect | Public URL override for the webhook callback (e.g. an ngrok / Tailscale tunnel). |
Because telephony providers deliver call events by webhook (push), Voice Call needs a reachable HTTPS callback URL. Set webhook_base_url to your public address, or expose the gateway through a tunnel — see Expose your gateway with a tunnel.
Voice Wake
Section titled “Voice Wake”Voice Wake turns the host machine’s microphone into an always-on wake-word trigger. It listens continuously on the default audio input via cpal, uses energy-based voice activity detection (VAD) to spot speech, transcribes a short window to check for your configured wake word, and on a match captures the full utterance and dispatches it to the agent. Internally it runs a four-state machine: Listening → Triggered → Capturing → Processing.
[channels_config.voice_wake]wake_word = "hey revka"silence_timeout_ms = 2000energy_threshold = 0.01max_capture_secs = 30| Field | Type | Default | Meaning |
|---|---|---|---|
wake_word | string | "hey revka" | Case-insensitive substring matched in the trigger window. |
silence_timeout_ms | integer | 2000 | Silence (ms) after the last energy spike before a capture is finalized. |
energy_threshold | float | 0.01 | RMS floor for VAD; samples below this count as silence. |
max_capture_secs | integer | 30 | Maximum capture length before transcription is forced. |
Voice Wake also requires the top-level [transcription] section — it has no STT of its own and calls the shared transcription subsystem to turn captured audio into text. Configure transcription before enabling it.
Build with the feature enabled:
cargo build --release --locked --features voice-wakeIf [channels_config.voice_wake] is configured but the binary was built without the flag, Revka intentionally skips the channel rather than erroring — revka channel list and revka channel doctor report it as skipped for this build.
ClawdTalk
Section titled “ClawdTalk”ClawdTalk is a dedicated AI-voice channel built on Telnyx’s global SIP network, using the Telnyx API v2 (https://api.telnyx.com/v2) for call management. It is distinct from the multi-provider Voice Call channel: where Voice Call abstracts over three telephony vendors, ClawdTalk is Telnyx-SIP-specific and optimized for low-latency conversational calls. Inbound call events arrive by webhook with optional signature verification.
[channels_config.clawdtalk]api_key = "KEY01xxxxxxxxxxxxxxxx" # Telnyx API key (required)connection_id = "telnyx-connection-id" # Telnyx SIP connection (required)from_number = "+15551234567" # E.164 caller ID (required)allowed_destinations = [] # empty = allow all destinations# webhook_secret = "telnyx-webhook-secret"| Field | Type | Default | Meaning |
|---|---|---|---|
api_key | string (required) | — | Telnyx API key. |
connection_id | string (required) | — | Telnyx SIP connection ID. |
from_number | string (required) | — | E.164 caller ID for outbound calls. |
allowed_destinations | list | [] → allow all | Destination number prefixes/patterns, or "*". An empty list allows every destination. |
webhook_secret | string (optional) | — | Telnyx webhook signature secret for verifying inbound events. |
TTS (Text-to-Speech)
Section titled “TTS (Text-to-Speech)”The [tts] section is a shared, multi-provider synthesis subsystem consumed by voice channels (and by WhatsApp Web). It is not called directly — channels invoke it when they need to turn the agent’s text into audio. Supported providers: OpenAI, ElevenLabs, Google Cloud TTS, Edge TTS (a free subprocess-based backend), and Piper (a local, OpenAI-compatible endpoint).
[tts]enabled = truedefault_provider = "openai" # openai | elevenlabs | google | edge | piperdefault_voice = "alloy"default_format = "mp3" # mp3 | opus | wavmax_text_length = 4096
[tts.openai]# api_key = "..." # falls back to OPENAI_API_KEYmodel = "tts-1"speed = 1.0
[tts.elevenlabs]# api_key = "..." # falls back to ELEVENLABS_API_KEYmodel_id = "eleven_monolingual_v1"stability = 0.5similarity_boost = 0.5
[tts.google]# api_key = "..." # falls back to GOOGLE_TTS_API_KEYlanguage_code = "en-US"
[tts.edge]binary_path = "edge-tts"
[tts.piper]api_url = "http://127.0.0.1:5000/v1/audio/speech"Top-level [tts] keys:
| Key | Type | Default | Meaning |
|---|---|---|---|
enabled | bool | false | Master toggle for TTS synthesis. |
default_provider | string | "openai" | openai, elevenlabs, google, edge, or piper. |
default_voice | string | "alloy" | Voice ID passed to the selected provider. |
default_format | string | "mp3" | Output audio format: mp3, opus, or wav. |
max_text_length | integer | 4096 | Maximum input text length, in characters. |
Per-provider sub-tables:
| Sub-table | Keys (defaults) |
|---|---|
[tts.openai] | api_key, model (tts-1), speed (1.0) |
[tts.elevenlabs] | api_key, model_id (eleven_monolingual_v1), stability (0.5), similarity_boost (0.5) |
[tts.google] | api_key, language_code (en-US) |
[tts.edge] | binary_path (edge-tts) |
[tts.piper] | api_url (http://127.0.0.1:5000/v1/audio/speech) |
OpenAI’s built-in voices are alloy, echo, fable, onyx, nova, and shimmer. Edge TTS uses Microsoft Neural voices and is free but requires the edge-tts binary on PATH. Piper runs locally against an OpenAI-compatible HTTP endpoint — useful for fully offline, GPU-accelerated synthesis.
Transcription (Speech-to-Text)
Section titled “Transcription (Speech-to-Text)”The [transcription] section is the shared STT subsystem. It powers Voice Wake, the Media Pipeline’s audio leg, and audio-attachment handling on every channel that accepts voice notes (Telegram, Discord, Slack, Mattermost, Matrix, WhatsApp Web, WATI, and others). The default provider is Groq (Whisper-compatible); you can also use OpenAI, Deepgram, AssemblyAI, Google Cloud Speech-to-Text, or a self-hosted Whisper endpoint.
[transcription]enabled = truedefault_provider = "groq" # groq | openai | deepgram | assemblyai | google | local_whisper# api_key = "..." # Groq provider; falls back to GROQ_API_KEYapi_url = "https://api.groq.com/openai/v1/audio/transcriptions"model = "whisper-large-v3-turbo"# language = "en" # optional ISO-639-1 hint# initial_prompt = "Revka, Kumiho" # bias toward expected vocabularymax_duration_secs = 120transcribe_non_ptt_audio = false
# Optional per-provider sub-tables:[transcription.openai]# api_key = "..."model = "whisper-1"
[transcription.local_whisper]url = "http://10.10.0.1:8001/v1/transcribe"# bearer_token = "..."# max_audio_bytes = 26214400 # 25 MBtimeout_secs = 300| Key | Type | Default | Meaning |
|---|---|---|---|
enabled | bool | false | Master toggle for transcription. |
default_provider | string | "groq" | groq, openai, deepgram, assemblyai, google, or local_whisper. |
api_key | string (optional) | — | Key for the Groq provider; falls back to GROQ_API_KEY. |
api_url | string | https://api.groq.com/openai/v1/audio/transcriptions | Whisper-compatible endpoint (Groq provider). |
model | string | "whisper-large-v3-turbo" | Whisper model name (Groq provider). |
language | string (optional) | — | ISO-639-1 language hint (e.g. en, ru). |
initial_prompt | string (optional) | — | Prompt that biases transcription toward expected proper nouns / terms. |
max_duration_secs | integer | 120 | Skip audio longer than this many seconds. |
transcribe_non_ptt_audio | bool | false | Also transcribe non-voice-note (forwarded/regular) audio on WhatsApp. |
Per-provider sub-tables: [transcription.openai] (api_key, model = whisper-1), [transcription.deepgram] (api_key, model = nova-2), [transcription.assemblyai] (api_key), [transcription.google] (api_key, language_code = en-US), and [transcription.local_whisper].
The local_whisper sub-table points at any reachable Whisper-compatible endpoint:
| Key | Type | Default | Meaning |
|---|---|---|---|
url | string (required) | — | HTTP(S) endpoint URL. |
bearer_token | string (optional) | — | Auth token; omit for unauthenticated local endpoints. |
max_audio_bytes | integer | 26214400 (25 MB) | Maximum accepted audio size; peak memory per request is roughly 2× this. |
timeout_secs | integer | 300 | Request timeout (large files on local GPU). |
Media Pipeline
Section titled “Media Pipeline”The Media Pipeline is the cross-channel inbound media-understanding stage. When enabled, it pre-processes attachments before the agent sees them, replacing raw files with text annotations: audio is transcribed, images are described, and video is summarized. The result is that an agent on any text model can reason over a voice note or a photo without a separate tool call.
[media_pipeline]enabled = truetranscribe_audio = truedescribe_images = truesummarize_video = true| Key | Type | Default | Meaning |
|---|---|---|---|
enabled | bool | false | Master toggle for the pipeline. |
transcribe_audio | bool | true | Transcribe audio attachments via the [transcription] provider. |
describe_images | bool | true | Describe images when a vision-capable model is active. |
summarize_video | bool | true | Summarize video attachments (requires an external API). |
The audio leg reuses your [transcription] configuration, and image description uses the active provider’s vision capability. When the provider has no vision support, image description falls back gracefully to a simple [Image: attached] annotation rather than failing.
Link Enricher
Section titled “Link Enricher”The Link Enricher fetches the content behind URLs in inbound messages and prepends a short summary, so the agent has link context without an explicit web-fetch tool call. It is off by default and is SSRF-protected.
[link_enricher]enabled = falsemax_links = 3timeout_secs = 10| Key | Type | Default | Meaning |
|---|---|---|---|
enabled | bool | false | Master toggle for the enricher. |
max_links | integer | 3 | Maximum unique URLs fetched per message. |
timeout_secs | integer | 10 | Per-URL fetch timeout, in seconds. |
How the pieces fit together
Section titled “How the pieces fit together”The voice channels are the audio I/O surface; the four subsystems are the shared machinery they (and text channels) draw on:
| Component | Config location | Role | Feature flag |
|---|---|---|---|
| Voice Call | [channels_config.voice_call] | Phone calls via Twilio / Telnyx / Plivo (webhook) | — |
| Voice Wake | [channels_config.voice_wake] | Local mic wake-word capture (input-only) | voice-wake |
| ClawdTalk | [channels_config.clawdtalk] | Telnyx SIP voice calls (webhook) | — |
| TTS | [tts] | Text → speech for voice replies | — |
| Transcription | [transcription] | Speech → text for voice input | — |
| Media Pipeline | [media_pipeline] | Annotate inbound audio/image/video | — |
| Link Enricher | [link_enricher] | Fetch & summarize URLs in messages | — |
A typical end-to-end voice setup combines several of these: enable [transcription] so spoken input becomes text, enable [tts] so replies become audio, then add a voice channel ([channels_config.voice_call] for the phone, or [channels_config.voice_wake] for the local mic). Voice Call additionally logs transcripts when transcription_logging = true.