06Active Research

Voice Agents

End-to-end voice systems that feel natural and respond in real time.

ASRTTSStreamingLatencyDialogue

Overview

Voice agents require tight integration of speech recognition, language modeling, and speech synthesis with latency budgets measured in milliseconds. We research full-duplex dialogue systems, real-time streaming pipelines, and voice cloning for personalized enterprise deployments.

Research Directions

Streaming ASR

Chunk-based recognition that begins transcription before the speaker finishes, enabling sub-200ms response initiation.

Turn-taking detection

Prosodic and semantic signals for distinguishing speech pauses from turn completions.

Neural TTS at low latency

Flow-matching and consistency model based synthesis that generates intelligible audio in under 100ms TTFB.

Voice cloning

Speaker adaptation from 10-30 seconds of reference audio for enterprise personas.

Emotion and prosody control

Fine-grained control over speaking rate, pitch variation, and emotional tone.

The latency budget

Natural conversation requires response latency under 500ms from end of user speech to first audio byte. This budget must cover: end-of-speech detection (50-80ms), transcription (50-100ms), LLM prefill and first token (100-200ms), and TTS first chunk (80-120ms). Every component must operate in streaming mode; batch processing at any stage blows the budget.

Full-duplex architecture

Traditional voice assistants are half-duplex: they listen, then respond. Full-duplex systems process incoming audio continuously, even while speaking, enabling barge-in (user interruption) and backchanneling (affirmative acknowledgments during user speech). We implement this as two parallel audio streams with a turn manager that arbitrates output based on prosodic signals and semantic completion confidence.

Pipeline topology

Each component in the voice pipeline communicates over an in-process message bus with backpressure. Audio chunks flow as 20ms frames. The ASR component emits partial and final transcripts; the LLM begins generating on partial transcripts above a confidence threshold.

typescript

interface VoiceFrame {
  pcm:        Float32Array   // 20ms at 16kHz = 320 samples
  timestamp:  number
  isFinal:    boolean
}

interface Transcript {
  text:       string
  confidence: number
  isFinal:    boolean
}

// Pipeline: Mic -> VAD -> ASR -> LLM -> TTS -> Speaker
// Each stage is an async generator consuming and producing typed frames

Neural TTS for low latency

Autoregressive TTS models (e.g., VALL-E, ElevenLabs architecture) produce high-quality audio but generate tokens sequentially, making first-chunk latency proportional to utterance length. We use a non-autoregressive flow-matching model that generates all mel spectrogram frames in parallel, achieving consistent sub-100ms TTFB regardless of output length. Vocoder inference runs on a separate CPU thread to overlap with generation.

Voice cloning for enterprise personas

Enterprise deployments require branded voice personas consistent across all customer interactions. We train speaker-conditional models using as little as 15 seconds of reference audio with a contrastive speaker encoder. Fine-tuning takes under 2 minutes on a single GPU. The resulting voice maintains the speaking style and timbre of the reference while remaining fully controllable via prosody conditioning.

PreviousVideo Generation Models Next Multimodal Reasoning