End-to-end voice systems that feel natural and respond in real time.
Overview
Voice agents require tight integration of speech recognition, language modeling, and speech synthesis with latency budgets measured in milliseconds. We research full-duplex dialogue systems, real-time streaming pipelines, and voice cloning for personalized enterprise deployments.
Research Directions
Streaming ASR
Chunk-based recognition that begins transcription before the speaker finishes, enabling sub-200ms response initiation.
Turn-taking detection
Prosodic and semantic signals for distinguishing speech pauses from turn completions.
Neural TTS at low latency
Flow-matching and consistency model based synthesis that generates intelligible audio in under 100ms TTFB.
Voice cloning
Speaker adaptation from 10-30 seconds of reference audio for enterprise personas.
Emotion and prosody control
Fine-grained control over speaking rate, pitch variation, and emotional tone.
Natural conversation requires response latency under 500ms from end of user speech to first audio byte. This budget must cover: end-of-speech detection (50-80ms), transcription (50-100ms), LLM prefill and first token (100-200ms), and TTS first chunk (80-120ms). Every component must operate in streaming mode; batch processing at any stage blows the budget.
Traditional voice assistants are half-duplex: they listen, then respond. Full-duplex systems process incoming audio continuously, even while speaking, enabling barge-in (user interruption) and backchanneling (affirmative acknowledgments during user speech). We implement this as two parallel audio streams with a turn manager that arbitrates output based on prosodic signals and semantic completion confidence.
Pipeline topology
Each component in the voice pipeline communicates over an in-process message bus with backpressure. Audio chunks flow as 20ms frames. The ASR component emits partial and final transcripts; the LLM begins generating on partial transcripts above a confidence threshold.
interface VoiceFrame {
pcm: Float32Array // 20ms at 16kHz = 320 samples
timestamp: number
isFinal: boolean
}
interface Transcript {
text: string
confidence: number
isFinal: boolean
}
// Pipeline: Mic -> VAD -> ASR -> LLM -> TTS -> Speaker
// Each stage is an async generator consuming and producing typed framesAutoregressive TTS models (e.g., VALL-E, ElevenLabs architecture) produce high-quality audio but generate tokens sequentially, making first-chunk latency proportional to utterance length. We use a non-autoregressive flow-matching model that generates all mel spectrogram frames in parallel, achieving consistent sub-100ms TTFB regardless of output length. Vocoder inference runs on a separate CPU thread to overlap with generation.
Enterprise deployments require branded voice personas consistent across all customer interactions. We train speaker-conditional models using as little as 15 seconds of reference audio with a contrastive speaker encoder. Fine-tuning takes under 2 minutes on a single GPU. The resulting voice maintains the speaking style and timbre of the reference while remaining fully controllable via prosody conditioning.