Skip to content

Voice, TTS & transcription

Voice call channels, wake-word capture, ClawdTalk SIP, and the shared TTS/transcription/media/link-enricher subsystems.

Revka can talk and listen, not just type. This page covers the three voice channels — the multi-provider Voice Call channel (Twilio / Telnyx / Plivo), the local-microphone Voice Wake word detector, and ClawdTalk (Telnyx SIP) — plus the four shared subsystems that voice and text channels both rely on: TTS (text-to-speech), Transcription (speech-to-text), the Media Pipeline, and the Link Enricher.

TTS, transcription, the media pipeline, and the link enricher are configured once at the top level of ~/.revka/config.toml and are then consumed by whichever channels need them. The voice channels live under [channels_config] like every other channel. If you have not connected a channel before, read the Channels overview for the shared trait model, allowlist semantics, and the polling-vs-webhook distinction.

The Voice Call channel handles real-time inbound and outbound phone calls through one of three telephony providers — Twilio, Telnyx, or Plivo — selected with a single provider key. It streams speech-to-text and text-to-speech during the call, optionally logs the full transcript to the workspace, and gates outbound calls behind an approval workflow. The telephony provider POSTs call events to Revka’s webhook, which translates them into channel messages for the agent loop.

[channels_config.voice_call]
provider = "twilio" # twilio | telnyx | plivo
account_id = "ACxxxxxxxxxxxxxxxx" # Twilio Account SID / Telnyx API Key / Plivo Auth ID
auth_token = "your-auth-token" # Twilio auth token / Telnyx API secret / Plivo auth token
from_number = "+15551234567" # E.164 caller ID for outbound calls
webhook_port = 8090
require_outbound_approval = true
transcription_logging = true
# tts_voice = "Polly.Joanna" # provider-specific voice name
max_call_duration_secs = 3600
# webhook_base_url = "https://your-tunnel.example.com"
FieldType / valuesDefaultMeaning
provider"twilio" | "telnyx" | "plivo""twilio"Telephony provider.
account_idstring (required)Twilio Account SID, Telnyx API Key, or Plivo Auth ID.
auth_tokenstring (required)Twilio auth token, Telnyx API secret, or Plivo auth token.
from_numberstring (required)E.164 caller ID used for outbound calls.
webhook_portinteger8090Port the channel listens on for telephony webhooks.
require_outbound_approvalbooltrueRequire human approval before placing an outbound call.
transcription_loggingbooltrueLog the full call transcript to the workspace directory.
tts_voicestring (optional)Provider-specific voice name for call audio (e.g. Polly.Joanna).
max_call_duration_secsinteger3600Hard cap on call length, in seconds.
webhook_base_urlstring (optional)auto-detectPublic URL override for the webhook callback (e.g. an ngrok / Tailscale tunnel).

Because telephony providers deliver call events by webhook (push), Voice Call needs a reachable HTTPS callback URL. Set webhook_base_url to your public address, or expose the gateway through a tunnel — see Expose your gateway with a tunnel.

Voice Wake turns the host machine’s microphone into an always-on wake-word trigger. It listens continuously on the default audio input via cpal, uses energy-based voice activity detection (VAD) to spot speech, transcribes a short window to check for your configured wake word, and on a match captures the full utterance and dispatches it to the agent. Internally it runs a four-state machine: Listening → Triggered → Capturing → Processing.

[channels_config.voice_wake]
wake_word = "hey revka"
silence_timeout_ms = 2000
energy_threshold = 0.01
max_capture_secs = 30
FieldTypeDefaultMeaning
wake_wordstring"hey revka"Case-insensitive substring matched in the trigger window.
silence_timeout_msinteger2000Silence (ms) after the last energy spike before a capture is finalized.
energy_thresholdfloat0.01RMS floor for VAD; samples below this count as silence.
max_capture_secsinteger30Maximum capture length before transcription is forced.

Voice Wake also requires the top-level [transcription] section — it has no STT of its own and calls the shared transcription subsystem to turn captured audio into text. Configure transcription before enabling it.

Build with the feature enabled:

Terminal window
cargo build --release --locked --features voice-wake

If [channels_config.voice_wake] is configured but the binary was built without the flag, Revka intentionally skips the channel rather than erroring — revka channel list and revka channel doctor report it as skipped for this build.

ClawdTalk is a dedicated AI-voice channel built on Telnyx’s global SIP network, using the Telnyx API v2 (https://api.telnyx.com/v2) for call management. It is distinct from the multi-provider Voice Call channel: where Voice Call abstracts over three telephony vendors, ClawdTalk is Telnyx-SIP-specific and optimized for low-latency conversational calls. Inbound call events arrive by webhook with optional signature verification.

[channels_config.clawdtalk]
api_key = "KEY01xxxxxxxxxxxxxxxx" # Telnyx API key (required)
connection_id = "telnyx-connection-id" # Telnyx SIP connection (required)
from_number = "+15551234567" # E.164 caller ID (required)
allowed_destinations = [] # empty = allow all destinations
# webhook_secret = "telnyx-webhook-secret"
FieldTypeDefaultMeaning
api_keystring (required)Telnyx API key.
connection_idstring (required)Telnyx SIP connection ID.
from_numberstring (required)E.164 caller ID for outbound calls.
allowed_destinationslist[]allow allDestination number prefixes/patterns, or "*". An empty list allows every destination.
webhook_secretstring (optional)Telnyx webhook signature secret for verifying inbound events.

The [tts] section is a shared, multi-provider synthesis subsystem consumed by voice channels (and by WhatsApp Web). It is not called directly — channels invoke it when they need to turn the agent’s text into audio. Supported providers: OpenAI, ElevenLabs, Google Cloud TTS, Edge TTS (a free subprocess-based backend), and Piper (a local, OpenAI-compatible endpoint).

[tts]
enabled = true
default_provider = "openai" # openai | elevenlabs | google | edge | piper
default_voice = "alloy"
default_format = "mp3" # mp3 | opus | wav
max_text_length = 4096
[tts.openai]
# api_key = "..." # falls back to OPENAI_API_KEY
model = "tts-1"
speed = 1.0
[tts.elevenlabs]
# api_key = "..." # falls back to ELEVENLABS_API_KEY
model_id = "eleven_monolingual_v1"
stability = 0.5
similarity_boost = 0.5
[tts.google]
# api_key = "..." # falls back to GOOGLE_TTS_API_KEY
language_code = "en-US"
[tts.edge]
binary_path = "edge-tts"
[tts.piper]
api_url = "http://127.0.0.1:5000/v1/audio/speech"

Top-level [tts] keys:

KeyTypeDefaultMeaning
enabledboolfalseMaster toggle for TTS synthesis.
default_providerstring"openai"openai, elevenlabs, google, edge, or piper.
default_voicestring"alloy"Voice ID passed to the selected provider.
default_formatstring"mp3"Output audio format: mp3, opus, or wav.
max_text_lengthinteger4096Maximum input text length, in characters.

Per-provider sub-tables:

Sub-tableKeys (defaults)
[tts.openai]api_key, model (tts-1), speed (1.0)
[tts.elevenlabs]api_key, model_id (eleven_monolingual_v1), stability (0.5), similarity_boost (0.5)
[tts.google]api_key, language_code (en-US)
[tts.edge]binary_path (edge-tts)
[tts.piper]api_url (http://127.0.0.1:5000/v1/audio/speech)

OpenAI’s built-in voices are alloy, echo, fable, onyx, nova, and shimmer. Edge TTS uses Microsoft Neural voices and is free but requires the edge-tts binary on PATH. Piper runs locally against an OpenAI-compatible HTTP endpoint — useful for fully offline, GPU-accelerated synthesis.

The [transcription] section is the shared STT subsystem. It powers Voice Wake, the Media Pipeline’s audio leg, and audio-attachment handling on every channel that accepts voice notes (Telegram, Discord, Slack, Mattermost, Matrix, WhatsApp Web, WATI, and others). The default provider is Groq (Whisper-compatible); you can also use OpenAI, Deepgram, AssemblyAI, Google Cloud Speech-to-Text, or a self-hosted Whisper endpoint.

[transcription]
enabled = true
default_provider = "groq" # groq | openai | deepgram | assemblyai | google | local_whisper
# api_key = "..." # Groq provider; falls back to GROQ_API_KEY
api_url = "https://api.groq.com/openai/v1/audio/transcriptions"
model = "whisper-large-v3-turbo"
# language = "en" # optional ISO-639-1 hint
# initial_prompt = "Revka, Kumiho" # bias toward expected vocabulary
max_duration_secs = 120
transcribe_non_ptt_audio = false
# Optional per-provider sub-tables:
[transcription.openai]
# api_key = "..."
model = "whisper-1"
[transcription.local_whisper]
url = "http://10.10.0.1:8001/v1/transcribe"
# bearer_token = "..."
# max_audio_bytes = 26214400 # 25 MB
timeout_secs = 300
KeyTypeDefaultMeaning
enabledboolfalseMaster toggle for transcription.
default_providerstring"groq"groq, openai, deepgram, assemblyai, google, or local_whisper.
api_keystring (optional)Key for the Groq provider; falls back to GROQ_API_KEY.
api_urlstringhttps://api.groq.com/openai/v1/audio/transcriptionsWhisper-compatible endpoint (Groq provider).
modelstring"whisper-large-v3-turbo"Whisper model name (Groq provider).
languagestring (optional)ISO-639-1 language hint (e.g. en, ru).
initial_promptstring (optional)Prompt that biases transcription toward expected proper nouns / terms.
max_duration_secsinteger120Skip audio longer than this many seconds.
transcribe_non_ptt_audioboolfalseAlso transcribe non-voice-note (forwarded/regular) audio on WhatsApp.

Per-provider sub-tables: [transcription.openai] (api_key, model = whisper-1), [transcription.deepgram] (api_key, model = nova-2), [transcription.assemblyai] (api_key), [transcription.google] (api_key, language_code = en-US), and [transcription.local_whisper].

The local_whisper sub-table points at any reachable Whisper-compatible endpoint:

KeyTypeDefaultMeaning
urlstring (required)HTTP(S) endpoint URL.
bearer_tokenstring (optional)Auth token; omit for unauthenticated local endpoints.
max_audio_bytesinteger26214400 (25 MB)Maximum accepted audio size; peak memory per request is roughly 2× this.
timeout_secsinteger300Request timeout (large files on local GPU).

The Media Pipeline is the cross-channel inbound media-understanding stage. When enabled, it pre-processes attachments before the agent sees them, replacing raw files with text annotations: audio is transcribed, images are described, and video is summarized. The result is that an agent on any text model can reason over a voice note or a photo without a separate tool call.

[media_pipeline]
enabled = true
transcribe_audio = true
describe_images = true
summarize_video = true
KeyTypeDefaultMeaning
enabledboolfalseMaster toggle for the pipeline.
transcribe_audiobooltrueTranscribe audio attachments via the [transcription] provider.
describe_imagesbooltrueDescribe images when a vision-capable model is active.
summarize_videobooltrueSummarize video attachments (requires an external API).

The audio leg reuses your [transcription] configuration, and image description uses the active provider’s vision capability. When the provider has no vision support, image description falls back gracefully to a simple [Image: attached] annotation rather than failing.

The Link Enricher fetches the content behind URLs in inbound messages and prepends a short summary, so the agent has link context without an explicit web-fetch tool call. It is off by default and is SSRF-protected.

[link_enricher]
enabled = false
max_links = 3
timeout_secs = 10
KeyTypeDefaultMeaning
enabledboolfalseMaster toggle for the enricher.
max_linksinteger3Maximum unique URLs fetched per message.
timeout_secsinteger10Per-URL fetch timeout, in seconds.

The voice channels are the audio I/O surface; the four subsystems are the shared machinery they (and text channels) draw on:

ComponentConfig locationRoleFeature flag
Voice Call[channels_config.voice_call]Phone calls via Twilio / Telnyx / Plivo (webhook)
Voice Wake[channels_config.voice_wake]Local mic wake-word capture (input-only)voice-wake
ClawdTalk[channels_config.clawdtalk]Telnyx SIP voice calls (webhook)
TTS[tts]Text → speech for voice replies
Transcription[transcription]Speech → text for voice input
Media Pipeline[media_pipeline]Annotate inbound audio/image/video
Link Enricher[link_enricher]Fetch & summarize URLs in messages

A typical end-to-end voice setup combines several of these: enable [transcription] so spoken input becomes text, enable [tts] so replies become audio, then add a voice channel ([channels_config.voice_call] for the phone, or [channels_config.voice_wake] for the local mic). Voice Call additionally logs transcripts when transcription_logging = true.