Voice AI Call Agent

Real-time AI phone agent with sub-second latency

All projects

Role

Architect & Engineer

Year

2026

Status

Private

Overview

A full voice-AI stack built for production telephony. Combines a low-latency speech pipeline, a visual conversation flow editor, a campaign-based outbound dialer, automated post-call data extraction, and synced call recording. Engines (STT, LLM, TTS) are configured at runtime so they can be hot-swapped per call without restarts. Designed to run agent logic on the telephony edge while offloading GPU inference to serverless infrastructure for cost-efficient scaling.

Highlights

Average time-to-talk: 833–1311 ms across production calls
Supports both chat-style (STT+LLM+TTS) and Realtime (speech-to-speech) modes
Edge agent + serverless GPU split removes per-frame network hops
Per-campaign concurrency limits and retry policies
Graceful fallback chain for missing plugins or provider outages

Capabilities

Low-Latency Voice Pipeline

STT → LLM → TTS pipeline tuned for ~800–1300ms time-to-talk. Silero VAD endpointing, preemptive generation, and PCM streaming minimize time-to-first-audio.

Multi-Provider LLM Layer

Pluggable LLM back-ends — Gemini, OpenAI (chat + Realtime speech-to-speech), DeepSeek, Cloudflare Workers AI, and self-hosted vLLM — selected per call from an admin UI.

Realtime Speech-to-Speech

Optional one-component pipeline using the OpenAI Realtime API for ~300–500ms voice-to-voice latency, with transcripts captured via parallel ASR for post-call analysis.

Visual Conversation Flows

Drag-and-drop state-machine editor for call flows. Each node carries its own prompt, transitions, and tool calls. Flows are scoped per LLM and versioned independently.

Campaign-Based Outbound Dialer

Long-lived dialer service places calls via the telephony SDK, tracks lead state (queued → calling → completed / no-answer / voicemail), retries with attempt caps, and handles concurrency limits per campaign.

CSV Lead Ingestion

Two-step CSV upload with auto-detected field mapping. Unmapped columns become per-lead template variables surfaced as {{tokens}} in prompts.

Post-Call Data Extraction

After hangup, the transcript is sent to the active LLM with a JSON schema derived from the campaign’s expected fields. Type-coerced output is stored as a structured opportunity record.

Recording with Synced Captions

Audio-only recordings are streamed to object storage. The call detail UI plays the recording with a custom audio player and highlights each transcript turn in lock-step with playback.

Hot-Swappable Engine Config

STT / LLM / TTS / Flow are fetched per call from a config API. Activate a different engine from the admin UI and the next call uses it — no worker restart required.

Tech stack

PythonLiveKit AgentsModalDjangoFastAPIPostgreSQLWhisperKokoro TTSGeminiOpenAI RealtimeDeepgramvLLMDrawflow

Privacy

This is a private project. Architecture, capabilities, and outcomes shown here are generalized; client identity, domain specifics, infrastructure addresses, and operational data are intentionally omitted.

Have a similar idea? Let's talk.

info@jayasena.dev Browse all projects