Fix Gemini Live local VAD by sending correct activity events to server#4146
Fix Gemini Live local VAD by sending correct activity events to server#4146markbackman merged 2 commits intomainfrom
Conversation
When Gemini Live was configured with local VAD (server-side VAD disabled), the service was listening for the wrong frame types and not sending ActivityStart/ActivityEnd events to the server. Now it listens for VADUserStartedSpeakingFrame/VADUserStoppedSpeakingFrame and sends the appropriate activity signals when local VAD is in use. Also removes the unnecessary local SileroVADAnalyzer from server-side VAD examples and adds a new 26a example demonstrating local VAD configuration.
Codecov Report❌ Patch coverage is
🚀 New features to boost your workflow:
|
|
|
||
| Parameters: | ||
| disabled: Whether to disable VAD. Defaults to None. | ||
| disabled: Whether to disable VAD. Defaults to None (server-side VAD is enabled). |
There was a problem hiding this comment.
Was server-side VAD enabled always the default and you're just calling it out explicitly here? Or is this a change?
There was a problem hiding this comment.
Follow-up question: if so, prior to your changes to the examples in this PR, the examples attempted to have both local and server-side VAD going?
There was a problem hiding this comment.
And follow-up to that: haven't we long maintained that local VAD is faster/more reliable than server-side, and that we recommend treating server-side signals as "supplementary"? (I remember that's what we recommended for AWS Nova Sonic at least)
There was a problem hiding this comment.
I'm not sure if it has always been, but it's definitely now the default. In looking at the Gemini docs, it was unclear what the behavior is, but in testing it, it's very clear that the default is to use the server-side VAD.
There was a problem hiding this comment.
the examples attempted to have both local and server-side VAD going?
Correct. The local VAD was running but doing nothing, AFAICT.
There was a problem hiding this comment.
haven't we long maintained that local VAD is faster/more reliable than server-side, and that we recommend treating server-side signals as "supplementary"?
Yes, that is the case for most things. Though, for Gemini Live, the server-side VAD yields useful user transcripts whereas using the local VAD yields garbage for the user transcripts. I'm not sure why this is; perhaps their STT model requires a specific amount of silence padding. We'll have to ask the Google team.
Summary
GeminiVADParams(disabled=True)) not working. The service now correctly detects user speech via VAD frames and sendsActivityStart/ActivityEndsignals to the Gemini API to indicate turn boundaries.26a-gemini-live-local-vad.py) demonstrating local VAD configuration.Testing
python examples/foundational/26a-gemini-live-local-vad.py