VoiceLayer — Voice I/O for AI Coding Agents
VoiceLayer adds bidirectional voice to AI coding agents via the Model Context Protocol. It provides 5 voice modes (announce, brief, consult, converse, think) for different interaction patterns — from fire-and-forget status updates to full voice Q&A with local speech-to-text. Uses edge-tts for neural text-to-speech and whisper.cpp for local transcription (~300ms on Apple Silicon). Session booking prevents mic conflicts between parallel Claude sessions. Everything runs locally with zero cloud APIs.
Project journey
The Problem
Typing every interaction with AI coding agents felt wrong. QA testing, code review, and design discussions should be conversations — not typing marathons. Existing voice platforms charge per-minute and send data to the cloud.
5 Voice Modes
Designed 5 distinct modes for different moments: announce (fire-and-forget status), brief (agent reads back findings), consult (checkpoint before action), converse (full bidirectional Q&A), and think (silent notes to markdown).
Local STT via whisper.cpp
Speech-to-text runs locally using whisper.cpp with CoreML/Metal acceleration — transcription in ~200-400ms on Apple Silicon. No cloud APIs, no per-minute billing, no data leaving the machine.
Session Booking
Lockfile-based mutex prevents mic conflicts. Only one voice session at a time — other Claude sessions see "line busy" and fall back to text. Stale locks from dead processes are auto-cleaned.