Exploring voice AI agents
Table of Contents
Voice AI is having a moment. OpenAI dropped their realtime API pricing by 60-87% in late 2024. ElevenLabs is reportedly raising at a $3B+ valuation. Platforms like Vapi and Deepgram are racing to hit sub-300ms latency for real-time speech.
The underlying tech has crossed a threshold where voice agents can actually feel conversational. Not the robotic IVR menus we’ve all suffered through, but something closer to talking to a person.
So I decided to try it out.
The project: Budget Buddy #
I wanted something that made sense as a phone call, not just speech-to-text piped into an LLM.
The idea is a voice agent that lets you log expenses by calling in. You’re walking out of a restaurant, hands full, and you call to say “I spent thirty dollars on dinner.” The agent logs it, tells you where you stand against your budget, and you hang up. Ten seconds.
Voice works well here because the interaction happens when your hands are busy. You’re driving, carrying bags, or just don’t want to open an app. A quick phone call has almost no friction. And motivational feedback about your spending actually lands better when spoken aloud than when read on a screen.
The core flow is simple. Call in, say what you spent, hear back your budget status. You can also ask questions like “How much have I spent on groceries this month?” or “Am I over budget on dining?”
The tech stack #
I used Vapi for voice orchestration. They handle the hard parts like speech-to-text, turn-taking, text-to-speech, and the phone number itself. You define an assistant with a system prompt and tools, and Vapi coordinates everything.
The architecture looks like this:
┌─────────────────┐
│ User Phone │
└────────┬────────┘
│ Phone Call
▼
┌─────────────────┐
│ Vapi │ ← Voice orchestration (Speech-to-text, Text-to-speech, turn-taking)
└────────┬────────┘
│ Webhook
▼
┌─────────────────┐ ┌─────────────────┐
│ Backend │◄────►│ GPT-4o-mini │
│ (FastAPI) │ │ (LLM) │
└────────┬────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ DB │
└─────────────────┘
When you call, Vapi transcribes your speech and sends it to the LLM. The LLM decides whether to call a tool (like log_expense or get_budget_status), and Vapi webhooks my backend to execute it. The backend saves to the database and returns the result. The LLM gets that result and speaks a response.
The whole thing runs on a small cloud instance.
What I learned #
Requests #
The webhook-based interaction model was interesting to work with. Your code doesn’t initiate anything. External services call you when things happen. It’s a different mental model than typical request-response, and it forces you to think carefully about state.
State management #
State management turned out to be surprisingly complex. You don’t want to overwhelm the LLM with data, but you also need enough context for it to be useful. Finding that balance took iteration. How much transaction history should it see? Should it know about spending trends? There’s a tension between keeping prompts lean and giving the agent enough to be helpful.
Tools #
Splitting functionality into “core agent” and “tools for operation” felt natural. The LLM handles conversation and intent. Tools handle data operations. Clean separation. This decomposition pattern seems like the right way to think about voice agents.
Voice UX #
Voice UX is its own discipline. You have to be conversational but also somewhat deterministic. Users expect the agent to sound natural, but you also need consistent behavior. It’s a different way of thinking from standard UX or APIs. Little things matter, like saying “thirty dollars” instead of “$30.00”, or varying confirmation phrases so it doesn’t sound robotic.
Testing is tricky. There’s inherent randomness in LLM responses, and you’re dealing with audio, transcription, and speech synthesis. Traditional unit tests don’t quite fit. I ended up relying more on end-to-end call testing and logging.
Demo #
If you want to try it out, start by calling 650-252-8420!
Possible things to build next #
The MVP works. You can call in, log expenses, and query your budget status. It’s single-user and uses caller ID for auth.
Future additions could include a simple web UI for configuring budget limits, SMS receipts after each call, weekly summary outbound calls, or integrating with bank data. But even the basic version is surprisingly useful.
Is voice the future? #
Probably not for everything. But for specific use cases where hands are busy, friction needs to be minimal, or the interaction is inherently conversational, voice agents make a lot of sense.
The tooling has gotten good enough that you can build something real in a short time. If you’ve been curious about voice AI, now’s a decent time to experiment.