Loading...
What VoiceLayer — Voice I/O for AI Coding Agents can do
From fire-and-forget to full conversation — automatically
voice_speak for text-to-speech (announcements, briefings, status updates). voice_ask for bidirectional Q&A with session booking. Auto-mode detection chooses the right interaction pattern based on context — no manual mode switching.
~300ms transcription, no cloud required
Recording uses sox at 16kHz mono PCM, processed in 1-second chunks with RMS energy detection. Transcription runs through whisper.cpp locally — ~200-400ms on Apple Silicon with ggml-large-v3-turbo. A cloud fallback via Wispr Flow WebSocket API handles cases where local setup isn't available. Backend selection is automatic based on what's installed.
{
"mcpServers": {
"qa-voice": {
"command": "bunx",
"args": ["voicelayer-mcp"],
"env": {
"QA_VOICE_STT_BACKEND": "auto",
"QA_VOICE_TTS_VOICE": "en-US-JennyNeural"
}
}
}
}MCP config with STT backend auto-detection
One microphone, no conflicts
Only one Claude session can use the microphone at a time. A lockfile at /tmp/voicelayer-session.lock stores the owning PID, session ID, and start timestamp. Lock creation uses atomic wx write flags to prevent race conditions. Dead process detection uses signal-zero — if the owning PID no longer exists, the stale lock is automatically cleaned up.
// Other sessions see:
{
isError: true,
content: [{
type: "text",
text: "Line is busy — session abc123 " +
"(PID 4821) since 14:30:00. " +
"Fall back to text input."
}]
}"Line busy" response with owner details
Free, high-quality, rate-adaptive speech
Microsoft Edge-TTS provides neural-quality speech synthesis at zero cost. Speech rate auto-adjusts based on content length — shorter messages play faster, longer explanations slow down by up to 15%. Each voice mode has its own rate default. Audio plays through the platform-native player (afplay on macOS, mpv/ffplay on Linux).
Unix philosophy: touch a file to stop
Both recording and playback can be stopped instantly by touching a signal file: touch /tmp/voicelayer-stop. A 300ms polling loop monitors this file during audio output. For recording, silence detection (5s default) provides a natural stop. Maximum timeout prevents runaway sessions. The signal file is cleaned up automatically after each interaction.