-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Overview
Improve the two-way voice experience in Maple AI to be more stable and more hands-free for users who are walking, driving, or otherwise away from the screen.
Goals:
- More stable audio functionality — no data loss on errors
- A continuous voice conversation loop — the user's only required actions are tapping Send when done talking and tapping X to exit
Non-goals:
- Automatic stop detection (the app will NOT listen and guess when the user stops talking — recording stops only when the user taps Send)
- Streaming TTS from the server (TTS remains local, on-device)
Current Behavior
- User taps the microphone button. The input area transforms into a recording UI with waveform animation, timer, Cancel (X), and Send buttons.
- User speaks. Recording continues until the user taps Send.
- On Send, the input area shows "Processing..." while the audio is transcribed via Whisper Large V3.
- On success, the transcribed text is sent as a message. The user sees their transcribed text appear in the chat conversation as their sent message, and the AI begins responding.
- On failure, an inline error message appears below the input and auto-dismisses after 5-8 seconds. The audio blob is lost.
- After the AI responds, the user can tap the speaker icon on the message. If TTS models are not installed, a download dialog appears (~264 MB). If installed, the app generates a WAV file locally and plays it.
- During TTS generation, the speaker icon changes to a stop (square) icon but no audio plays until generation is complete (can take 10+ seconds for long responses). There is no visual indication that generation is in progress vs. stuck.
1. Voice Mode
Concept
Voice mode is a continuous conversation loop that lives in the input area. When active, the input area cycles through recording, processing, and playback states. The loop runs until the user explicitly exits.
The input area already transforms for recording today. Voice mode extends this existing behavior with additional states for TTS generation and playback, creating a continuous loop.
Entry
- The user taps the microphone button in the input area.
- Voice recording requires Pro/Max/Team access and the Whisper model to be available (
hasProAccess && localState.hasWhisperModel). If the user doesn't meet these requirements, the existing upgrade dialog is shown — voice mode is not entered. - The input area transforms into voice mode, starting in the recording state (same as current behavior).
- A short audio tone plays to confirm mic activation.
The Loop
┌─→ Recording ──(Send)──→ Processing ──→ Waiting ──→ Generating ──→ Playing ─┐
│ │
└──────────────────────────── (500ms pause) ←──────────────────────────────────┘
- Recording — Mic is active, user is speaking (existing behavior)
- Processing — Audio is being transcribed, then sent as a message (existing behavior)
- Waiting — AI is streaming its response (new)
- Generating — TTS is generating audio from the completed response (new)
- Playing — TTS audio is playing back (new)
- After playback ends → 500ms pause → back to Recording (mic activated tone plays)
Exit
- The user taps the X button at any point during voice mode.
- X is always visible and always exits voice mode, regardless of current state.
- On exit, the input area returns to the normal text input box.
- What happens to in-flight work on exit:
- During Recording: recording is discarded, mic is released.
- During Processing: the transcription/send is cancelled. If cancellation happens too late and the message has already been sent, the AI response appears in the chat normally but without TTS playback.
- During Waiting: nothing to cancel — the AI response continues streaming into the chat normally, but without TTS playback.
- During Generating: TTS generation is cancelled. Since
tts_synthesizehas no server-side abort mechanism, cancellation means the JS side marks voice mode as exited and discards the synthesis result when it returns. The Tauri command runs to completion in the background but the audio is never played. A generation sequence ID or similar token should be used so thespeak()callback can detect that its result is stale. - During Playing: playback stops immediately.
Input Area States
The recording state already exists and behaves as it does today. The new states (waiting, generating, playing) extend the input area with additional visual treatments:
| State | Visual | Label | Buttons |
|---|---|---|---|
| Recording | Waveform animation (current behavior) | "Recording" + timer (MM:SS) | X (left), Send (right) |
| Processing | Spinner or subtle pulse animation (current behavior) | "Processing..." | X (left) |
| Error | Error icon (new) | Error message + timer | X (left), Retry (center), Discard (right) |
| Waiting | Subtle breathing/pulse animation (new) | "Waiting for response..." | X (left) |
| Generating | Waveform animation (playback style, different color/treatment from recording) (new) | "Generating audio..." | X (left) |
| Playing | Waveform animation (playback style, animated to audio) (new) | "Playing" | X (left) |
Visual distinction between recording and playback waveforms:
- Recording waveform: uses the current visual style
- Playback waveform (generating + playing): visually distinct treatment (e.g., cool/blue tones, speaker-style animation) so the user can glance at the screen and immediately know whether the app is listening or speaking
Audio Cues
Short, subtle, non-verbal tones at state transitions. These are critical for the hands-free use case where the user is not looking at the screen.
| Transition | Cue |
|---|---|
| Mic activated (entering recording) | Gentle ascending tone |
| Send tapped (recording → processing) | Short confirmation tone |
Two cues is sufficient. TTS playback starting and ending are self-evident audio events — the voice itself signals both transitions. Adding extra tones around playback would feel noisy.
Audio cues should be designed as a cohesive pair — short (under 0.5s each), distinct from each other, and quiet enough to not compete with TTS playback or the user's environment.
Audio cues will be implemented as pre-recorded WAV files bundled with the app at:
frontend/public/audio/mic-on.wav— plays when mic activatesfrontend/public/audio/mic-off.wav— plays when user taps Send
These can be loaded at runtime via new Audio("/audio/mic-on.wav"). This approach is simple, consistent across platforms, and easy to iterate on by swapping files.
TTS Availability
Voice mode (the full loop with TTS) is only available on platforms that support TTS. The app uses the same platform check that determines whether the speaker icon is shown on messages: isTauriDesktop() || (isTauri() && isIOS()). On platforms where this check is false (Android, web), tapping the mic button uses the current single-recording flow (record → transcribe → text response) with no voice mode loop.
On supported platforms, voice mode additionally requires TTS models to be downloaded. If TTS is supported but models are not yet installed when the user taps the mic button:
- The recording and transcription flow works as it does today (single recording, no loop).
- TTS discovery prompt: The first time a user sends a voice message without TTS models installed, after the AI responds, the input area shows an inline prompt instead of returning to the normal text box. The prompt says something like "Enable voice responses?" with a Download button and an X (dismiss) button.
- Download opens the existing TTS download dialog (~264 MB).
- X dismisses the prompt and returns the input area to the normal text box.
- This prompt appears only once. If dismissed, it does not appear again. The user can still discover TTS via the speaker icons on messages.
- The timing is intentional — after receiving a text response to a voice message, the gap between "I spoke to it" and "it can't speak back" is most felt. This is the natural moment to surface the capability.
- After successful TTS install from this prompt: The app immediately generates TTS for the AI's latest response and begins playing it. Once playback finishes, the mic auto-activates — the user is now in voice mode with the full loop running. This creates a seamless transition from discovery into the complete voice conversation experience.
- Once TTS models are installed, the next time the user taps the mic button, voice mode runs with the full loop.
Edge Cases
- Voice mode is not available while AI is streaming: This is existing behavior — the app does not allow new input while an AI response is streaming. The mic button is disabled during streaming, same as the text input send button. The user can only tap Stop (to halt the AI response) or navigate to a different chat. No changes needed here.
- TTS generation fails: Show the error briefly in the voice mode input area, then skip to mic reactivation. Do not exit voice mode on TTS failure. The conversation loop continues. The AI's text response is already committed to the chat by the time TTS generation begins, so TTS failure only affects audio playback — the user's message and the AI's response remain visible in the conversation.
- App goes to background during voice mode: Platform-dependent. If the app suspends and resumes without reloading state, voice mode should still be active. If the app fully reloads, voice mode is lost (user re-enters by tapping mic).
- User switches chat or starts a new chat during voice mode: Voice mode exits immediately. Recording is discarded, in-flight transcription results are discarded, TTS generation results are discarded, playback stops. The media stream and audio blob are released. Any AI response that was already streaming continues into the previous chat normally. This cleanup must be triggered by both the
conversationselectedandnewchatevents in UnifiedChat. - Incoming phone call or audio interruption: The platform will likely suspend audio. On resume, if recording was active, it may be corrupted — treat as an error (show error state with option to retry or re-record). If playback was active, attempt to resume or restart playback.
- iOS-specific concern: When iOS suspends audio for a phone call,
AudioBufferSourceNode.onendedmay not fire (the source is paused/interrupted, not ended). This could leave voice mode stuck in the Playing state. The implementation should listen toAudioContext.onstatechangeand check for the"interrupted"state as a fallback. When an interrupted state is detected during Playing, treat it the same as playback ending — clean up and advance to the next loop state (or stay paused until the app resumes, then advance). - For recording, listen to
MediaStreamTrack.onendedto detect when the OS kills the mic stream (e.g., during a phone call). If the track ends unexpectedly during recording, transition to the error state with the existing audio blob (if any data was captured) or restart recording when the interruption ends.
- iOS-specific concern: When iOS suspends audio for a phone call,
2. Audio Recording Error Recovery
Problem
When audio transcription fails (network error, server error, timeout), the user's recording is permanently lost. The error message auto-dismisses and the audio blob is garbage collected. Data loss destroys user trust.
Behavior
On transcription failure:
- Voice mode transitions to the error state instead of dismissing or advancing.
- The error state displays:
- The error message (e.g., "Transcription failed. Check your connection.")
- The recording duration (so the user can see their audio still exists)
- A Retry button — re-submits the same audio blob for transcription
- A Discard button — discards the recording and returns to the recording state (mic reactivates, voice mode continues)
- The X button — exits voice mode entirely
- The audio blob is retained in memory until one of these outcomes:
- Transcription succeeds (blob is released)
- User taps Discard (blob is released, mic reactivates)
- User taps X to exit voice mode (blob is released)
- Only one blob is retained at a time. If the user taps Discard (which returns to the recording state) and records again, the previous blob is released when the new recording begins. There is no accumulation of failed recording blobs across retry/re-record cycles.
Retry behavior:
- On retry, voice mode returns to the "Processing..." state.
- If retry fails, voice mode returns to the error state with the updated error message.
- There is no retry limit. The user can retry as many times as they want.
- After 3 consecutive failures, append to the error message: "If this keeps failing, check your internet connection and try again later."
Error recovery also applies outside voice mode: If the user does a single voice recording without TTS installed (no voice mode loop), the error state still applies. The input area shows the error with Retry and Discard, retaining the audio blob. This is a critical stability improvement regardless of voice mode.
3. Speaker Icon Playback (One-Shot)
Speaker icons on individual AI messages remain as one-shot playback. They are independent from voice mode.
Behavior
- Tapping a speaker icon generates TTS for that specific message and plays it.
- This does NOT enter voice mode. The input area remains a text box.
- When playback finishes, nothing else happens. No mic activation, no loop.
- If voice mode is active and the user taps a speaker icon on a different message, voice mode exits. The speaker icon's one-shot playback proceeds normally. The user was clearly looking at the screen and interacting with a specific message, so exiting voice mode is the expected behavior. They can re-enter voice mode by tapping the mic button.
- Race condition note: If voice mode is in the Generating state when the user taps a speaker icon, two
tts_synthesizecalls may be in flight simultaneously (the voice mode synthesis and the one-shot synthesis). The voice mode synthesis result must be discarded — use a generation sequence ID so thespeak()callback knows its result is stale and should not be played. Only the most recentspeak()call should produce audio.
TTS Generation Feedback
TTS generation can take 10+ seconds for long responses. The speaker icon needs distinct states so the user knows what's happening.
| State | Icon | Description |
|---|---|---|
| Ready | Volume2 (static speaker) |
Tap to generate and play. Long-press opens TTS management dialog. |
| Generating | Volume2 with animated/pulsing sound waves |
Audio is being generated. Tap to cancel. |
| Playing | Square (stop icon) |
Audio is playing. Tap to stop. |
| Not installed | Volume2 (normal) |
TTS not set up. Tap opens the TTS download dialog (existing behavior). Long-press also opens the dialog. |
The Generating animation should clearly convey "working" — pulsing or rippling waves, not steady animation. This is distinct from the static Ready icon and the stop-square Playing icon.
Tapping during generation cancels the generation and returns to Ready state.
Note: The same generating animation concept (pulsing/rippling) should feel consistent between the speaker icon on messages and the voice mode generating state in the input area, even though they are visually different contexts.
Stretch Goals
These are not included in the initial implementation but are documented for future consideration.
Chunked TTS Playback
For long AI responses, split the text into chunks (e.g., by paragraph), generate TTS for each chunk sequentially, and begin playback of chunk 1 while chunk 2 is generating. This dramatically reduces perceived latency for long responses. The tts_synthesize Tauri command wraps the Supertone TTS engine and accepts arbitrary text, so it can be called multiple times with smaller segments.
Considerations:
- Chunk boundary selection (paragraph breaks, sentence breaks, or fixed character count)
- Seamless audio transitions between chunks (no gaps or clicks)
- Handling stop/cancel mid-stream (stop current playback, discard pending chunks)
- Progress indication (how to show which chunk is generating vs. playing)
Haptic Feedback
Add subtle haptic taps for key voice interaction moments on iOS:
- Recording starts (light tap)
- Send tapped (medium tap)
- TTS playback begins (light tap)
- TTS playback ends (light tap)
- Error occurs (error pattern / double tap)
This is particularly valuable for the hands-free use case (AirPods, phone in pocket) where haptics confirm actions without looking at the screen. Maple does not currently use haptics anywhere, so this would require establishing the haptic feedback infrastructure.
Hardware Media Button Support
Explore using headphone/earphone media buttons (play/pause) to control voice mode:
- Play/Pause press during recording → triggers Send (ends recording and sends audio)
- Play/Pause press during playback → triggers Stop (stops TTS playback)
This would enable a fully hands-free experience where the user never touches the phone screen during a voice conversation. The only required touch would be entering voice mode initially via the mic button.
Considerations:
- Platform support: iOS MediaSession API, Android media button intents
- Handling different headphone button behaviors (single press, double press, long press)
- Conflicts with system media controls (music apps, phone calls)
- Tauri's ability to intercept media button events on each platform
Implementation Notes
Files likely affected
| File | Changes |
|---|---|
frontend/src/components/RecordingOverlay.tsx |
Extend with new voice mode states (waiting, generating, playing, error). May be refactored or renamed to reflect broader role. |
frontend/src/components/UnifiedChat.tsx |
Voice mode state machine, auto-advance through loop states, retain audio blob on failure, retry logic |
frontend/src/services/tts/TTSContext.tsx |
Expose generation-in-progress state (distinct from playing), support cancellation of in-progress generation |
| Message component (where TTSButton lives) | Updated icon states for generating vs. playing (one-shot speaker icons) |
| New: audio cue assets | Pre-recorded short tone files bundled with the app |
Voice mode state machine
┌──────────────────────────────────────┐
│ X pressed │
│ (exit voice mode) │
▼ │
[Text Input] │
│ │
│ Mic button tapped │
▼ │
[Recording] ──── X ──→ [Text Input] │
│ │
│ Send tapped │
▼ │
[Processing] ── X ──→ [Text Input]* │
│ │
├── error ──→ [Error] ── Retry ──→ [Processing]
│ │ │
│ ├── Discard ──→ [Recording]
│ └── X ──→ [Text Input] │
│ │
│ success │
▼ │
[Waiting] ──── X ──→ [Text Input] │
│ │
│ AI response complete │
▼ │
[Generating] ── X ──→ [Text Input] │
│ │
│ TTS ready │
▼ │
[Playing] ──── X ──→ [Text Input] │
│ │
│ Playback ends │
│ (500ms pause) │
│ │
└──→ [Recording] ───────────────────────┘
* Cancels if possible; if message already sent, AI responds
normally but without TTS playback
Appendix: Audio Cue Generation Prompts
The following prompts can be given to an audio generation AI to produce the two audio cue files. Each prompt includes a shared preamble to ensure the cues feel like a cohesive pair.
Shared Context (include at the top of each prompt)
These are part of a set of 2 UI audio cues for a voice conversation feature in a mobile/desktop app. The full set is: (1) mic activated, (2) message sent. They should feel like a cohesive pair — same tonal palette, same instrument/synthesis style, same reverb/space. Think of the subtle, polished tones you hear from AirPods connecting or iOS system sounds: clean, minimal, warm, and immediately recognizable without being intrusive. All cues should be under 0.5 seconds, 44.1kHz, mono or stereo, delivered as WAV files. No melody, no music — just short tonal gestures. The overall vibe is calm, confident, and modern.
Prompt 1: Mic Activated (entering recording state)
Using the shared context above, generate the "mic activated" audio cue. This plays when the app begins listening for the user's voice. It should feel like an invitation to speak — a gentle, ascending two-note tone that rises in pitch. Warm and soft, not sharp or startling. Think of it as the app saying "I'm listening." Duration: 0.3-0.4 seconds. The attack should be soft (no hard click or pop at the start), with a quick natural fade-out.
Prompt 2: Message Sent (recording ended, audio being processed)
Using the shared context above, generate the "message sent" audio cue. This plays when the user taps Send to submit their voice recording. It should feel like a brief, satisfying confirmation — a single, clean tone with a slightly percussive attack, like a soft tap or ping. Neutral pitch (not high, not low). Think of it as the app saying "Got it." Duration: 0.2-0.3 seconds. Quick decay, no lingering tail.