Features

What VoiceLayer — Voice I/O for AI Coding Agents can do

2 Tools, Auto Mode Detection

From fire-and-forget to full conversation — automatically

voice_speak for text-to-speech (announcements, briefings, status updates). voice_ask for bidirectional Q&A with session booking. Auto-mode detection chooses the right interaction pattern based on context — no manual mode switching.

voice_speak — fire-and-forget TTS, rate adaptation by content length
voice_ask — full Q&A with microphone lock, STT via whisper.cpp
Auto-mode — picks announce, brief, consult, converse, or think automatically

Local Speech-to-Text

~300ms transcription, no cloud required

Recording uses sox at 16kHz mono PCM, processed in 1-second chunks with RMS energy detection. Transcription runs through whisper.cpp locally — ~200-400ms on Apple Silicon with ggml-large-v3-turbo. A cloud fallback via Wispr Flow WebSocket API handles cases where local setup isn't available. Backend selection is automatic based on what's installed.

json

{
  "mcpServers": {
    "qa-voice": {
      "command": "bunx",
      "args": ["voicelayer-mcp"],
      "env": {
        "QA_VOICE_STT_BACKEND": "auto",
        "QA_VOICE_TTS_VOICE": "en-US-JennyNeural"
      }
    }
  }
}

MCP config with STT backend auto-detection

Session Booking

One microphone, no conflicts

Only one Claude session can use the microphone at a time. A lockfile at /tmp/voicelayer-session.lock stores the owning PID, session ID, and start timestamp. Lock creation uses atomic wx write flags to prevent race conditions. Dead process detection uses signal-zero — if the owning PID no longer exists, the stale lock is automatically cleaned up.

typescript

// Other sessions see:
{
  isError: true,
  content: [{
    type: "text",
    text: "Line is busy — session abc123 " +
          "(PID 4821) since 14:30:00. " +
          "Fall back to text input."
  }]
}

"Line busy" response with owner details

Edge-TTS + Smart Chunking

Free, high-quality speech with word-boundary splitting

Microsoft Edge-TTS provides neural-quality speech synthesis at zero cost. Long messages are automatically chunked at word boundaries to prevent truncation. Speech rate auto-adjusts based on content length — shorter messages play faster, longer explanations slow down by up to 15%. Each voice mode has its own rate default.

Neural-quality voices at $0 cost
Word-boundary text splitting for long messages
Auto rate adjustment by content length
Mode-specific rate defaults
Configurable voice selection

MCP Daemon Architecture

Singleton voice service via socat — always on

VoiceLayer runs as a macOS MCP daemon with dual-protocol support (NDJSON + MCP Content-Length). A socat-based singleton ensures only one voice service instance runs, even across multiple Claude sessions. Auto-starts via macOS LaunchAgent. User-controlled stop via signal file, with 5-minute orphan timeout for session booking cleanup.

Socat singleton — one daemon, many sessions
Dual-protocol — NDJSON + MCP Content-Length
LaunchAgent auto-start — zero manual setup
5-minute orphan timeout for stale sessions
touch /tmp/voicelayer-stop — instant stop