What We Shipped: Geta.Team v2.1.11. Voice Calls Now Reply in a Second, Plus Calendar Tools Mid-Call

Share
What We Shipped: Geta.Team v2.1.11. Voice Calls Now Reply in a Second, Plus Calendar Tools Mid-Call

When you pick up the phone to your AI employee and there's a three-second silence before they reply, the illusion breaks. You're no longer talking to a colleague. You're talking to a chatbot with a voice. The pause tells you everything you need to know about what's actually happening behind the curtain.

We've been chasing that pause for months. v2.1.11 is the release where we finally caught it. Voice replies now start in about a second, down from the three-to-four seconds we were stuck at before. The fix wasn't optimisation. It was ripping out the entire voice stack and replacing it with something architecturally different.

What was wrong with the old voice path

The previous setup, running on Twilio's official ConversationRelay integration, looked sensible on paper. Twilio handled speech-to-text. We sent the transcript to our LLM. The LLM's reply went to ElevenLabs for text-to-speech. Three components, each best-in-class at its job.

The problem with three components is that each one has to wait for the previous one to finish before it can start. Twilio waits for end-of-speech detection. Our LLM waits for the full transcript. ElevenLabs waits for a coherent text chunk to synthesise. Even with streaming, even after a full speed pass on every layer, you couldn't get below about 700 to 1500 ms of pure handoff latency per turn. That floor is the architecture, not the implementation.

We confirmed the floor was real, then made the call to switch to a half-cascade speech-to-speech model (Gemini Live) running directly on Twilio's Media Streams.

What the new path looks like

Voice now flows through a single WebSocket bridge: Twilio Media Streams in on one side, Gemini Live's audio session on the other. Mu-law from Twilio gets resampled to 16 kHz for Gemini; Gemini's 24 kHz audio gets downsampled back to 8 kHz mu-law for Twilio. The model owns the full turn. It hears, it thinks, it speaks, all in one session, with no transcript intermediary.

The numbers from production calls since the cutover: time-to-first-audio is around one second for a typical reply. Transcript capture is more accurate because we're reading what the model actually produced, not a second-pass STT on the call audio. And, this is the real unlock, we can wire arbitrary tools directly into the live audio session. The model can decide mid-call to call calendar_list_events or send_email without breaking the conversation.

A few things had to come along for the ride to make this work reliably:

  • A voice prompt compactor. Sending each employee's full CLAUDE.md (several hundred lines) as system instruction added two-to-four seconds to the session cold start and made the model more prone to invent facts. We now run Gemini Flash Lite at the start of every call to compress the full prompt down to a fixed-shape 300-token voice prompt, regenerated every time so edits land instantly. The compactor runs in parallel with Twilio's WebSocket upgrade, so the 1.4 to 1.7 seconds it takes is hidden in the call setup.
  • Pre-warmed bridges for outbound calls. Gemini Live's STT cold start is 5 to 7 seconds. For outbound calls we pre-open the Gemini session in parallel with the Twilio dial, so by the time the called party picks up, the model is already listening.
  • A new voice picker. ElevenLabs is gone for phone calls. In its place: 8 prebuilt Gemini Live voices (Aoede, Leda, Kore, Zephyr, Charon, Puck, Fenrir, Orus). You pick one per phone line in the Twilio sheet or the dashboard connector.
  • VAD tuning. Gemini Live defaults to HIGH start-of-speech sensitivity, which meant breathing or quiet acknowledgments would interrupt the AI mid-sentence. We dropped it to LOW. End-of-speech detection stays high, so the model still knows when you've actually stopped talking.

Calendar Tools, live, mid-call

The second headline feature in v2.1.11: your AI employee can now consult and modify your calendar during a phone call. Google Calendar and Office 365 are both supported.

The Twilio Voice sheet has a new dropdown for calendar access (None, Google, or Office 365). Unconnected providers grey out automatically. Once enabled, the model gets four tools wired into the live session:

  • calendar_list_events, to list upcoming events.
  • calendar_create_event, to create one.
  • calendar_update_event, to modify an existing event.
  • calendar_delete_event, to cancel.

The system prompt for the call gets a "Calendar Access" block injected that names the provider unmistakably, prohibits the model from referencing the wrong one (even when an email address looks like a Google address), and requires a read-back-and-confirm step before any write tool fires. So before the AI moves your dentist appointment, you'll hear: "Confirming. Moving your dentist appointment from Thursday at 2 pm to Friday at 10 am. Should I make that change?"

User timezone propagates all the way from the database to the calendar CLI, so events get created and listed in your local time instead of UTC. For Office 365 specifically, we switched the list endpoint from /me/events to /me/calendarView, which is Microsoft's recommended endpoint for date-range queries and, critically, actually expands recurring event instances. Without that change, weekly recurring meetings were invisible to the voice tool.

The reliability work nobody sees

Three production fixes that aren't glamorous but are the difference between "demo magic" and "actually pick up the phone":

An AI-hung watchdog. Gemini Live's half-cascade occasionally freezes mid-call. The session stays alive, no error fires, but both the STT pipeline and the model output go silent for ten-plus seconds while you're still speaking. We now run a 2-second watchdog: if Twilio media is still arriving but Gemini hasn't emitted any event for more than 6 seconds, we push a synthetic system nudge to wake it up. Rate-limited so it can't loop, auto-stopped on cleanup.

Inbound greeting trigger. On inbound calls (where you dial the AI's number), both parties used to sit silent. You expect to be greeted, the model is waiting for audio. After several seconds you'd hang up. The bridge now sends a synthetic user-role turn immediately after the WebSocket attaches, telling the model to greet briefly and ask how it can help. Language-aware across French, English, Spanish, Hebrew, Italian, German.

A streamSid race fix. This one is the kind of bug that makes you stare at logs for an hour. Twilio's start event carries the streamSid you need to route outgoing audio back to the call. That event can arrive in the 36 to 500 ms gap between WebSocket connect and the bridge attaching its listener. Node's EventEmitter doesn't replay missed events, so streamSid would stay null forever, and every outgoing audio frame would silently fail the routing check. The model was speaking; the caller heard nothing. Fixed by reading streamSid from any incoming Twilio message that carries it, not just the start event.

Post-call hand-off, with receipts

When a call ends, the transcript gets handed off to the agent in chat mode. Previously, tool calls and Google Search grounding didn't make it into that transcript, which meant the post-call agent (Claude, in our case) would occasionally look at a clean transcript that mentioned "I sent the confirmation email" and decide Victoria had hallucinated, because there was no evidence of the send in front of it. Mildly maddening.

Now the bridge records every tool call as a role='tool' transcript entry ([<toolName>] args=<json> -> OK/ERROR | result=<json>) and intercepts Google Search grounding metadata that Gemini Live uses internally. The post-call message to the chat agent includes a legend explaining how to read the transcript (OK with sources = grounded fact, OK with empty sources = unreliable, ERROR = action didn't happen) and explicit self-evaluation rules: don't accuse the voice agent of hallucinating an action when the corresponding tool call OK is right there above the spoken claim.

Two non-voice fixes worth mentioning

A .env rewrite bug that occasionally wiped environment variables on container restart is gone. Two unconditional rewrite paths in server.js and a token migration script have been removed. Reading .env for in-process use is unchanged; we just don't write back to it anymore.

The memory-db skill was creating orphan databases for some employees because DB_PATH was being computed from process.cwd(), which varied depending on how the session was launched. Telexcel and Clementia both hit this and silently lost prior memories. The path is now anchored to the skill folder itself, so the database location is stable regardless of caller working directory.

What's still on the list

Gemini Live's half-cascade still occasionally ignores tool instructions or freezes despite the watchdog. We're evaluating gemini-2.5-flash-native-audio-latest for the next round. Reportedly more stable on tool invocation and turn-taking, and it would also let us bring back inline voice style tags (cheerful, serious, etc.) which the current half-cascade ignores. Post-call Claude can still occasionally over-flag on the rare cases where a tool entry doesn't quite match the pattern it's looking for. Continued tuning planned.

The voice product feels different now in a way that's hard to convey in a changelog. A second is a long time when you're talking to a person, but it's the difference between "this is a tool" and "this is a colleague." We've crossed that line on this release. Pick up the phone and try it.

Want to test the most advanced AI employees? Try it here: https://Geta.Team

Read more