Skip to content

audio: Whisper STT + Piper TTS as MCP-callable node capabilities#288

Open
RBrid wants to merge 36 commits intoopenclaw:masterfrom
RBrid:user/rbrid/audio-support1
Open

audio: Whisper STT + Piper TTS as MCP-callable node capabilities#288
RBrid wants to merge 36 commits intoopenclaw:masterfrom
RBrid:user/rbrid/audio-support1

Conversation

@RBrid
Copy link
Copy Markdown
Contributor

@RBrid RBrid commented May 7, 2026

Summary

Lands additional speech-related features — the Windows tray now exposes STT and TTS as first-class node capabilities that the gateway and chat agents can invoke. Local-first, privacy-respecting, no Microsoft cloud dependencies.

Credit

Builds on @NichUK's foundational PR #120 — first-pass Windows Voice Mode. His Talk Mode UX work, configurable provider catalog, and the broader voice-mode architecture shaped the design space we're operating in. This PR takes a different implementation path in places — Whisper-only STT, Piper as the default TTS, MCP capability surface instead of UI-only — but the architectural framing is heavily informed by his work.

What's new

STT capability (stt.*)

Three new node commands exposed to the gateway / MCP:

  • stt.transcribe — fixed-duration capture (capped 30 s) + transcription
  • stt.listen — VAD-driven capture, returns at end-of-speech or after timeoutMs
  • stt.status — engine readiness probe (no PII)

Backed by Whisper.net (local, on-disk model). Silero VAD ONNX for endpointing. NAudio for capture. Default off via NodeSttEnabled setting.

TTS capability (tts.*)

  • tts.speak — speak text on the Windows node, three providers:
    • Piper (default) — local neural TTS via Sherpa-ONNX, no network egress
    • Windows SAPI
    • ElevenLabs (cloud, opt-in)

Asset management

  • SHA-256 verification on every downloaded asset (Whisper, Piper, Silero VAD) — pinned hashes + HTTPS-only URLs, asserted in tests
  • Single-flight downloads (no duplicate concurrent fetches)
  • Tar pre-flight on Piper voice archives
  • One-click model/voice download UX from Voice Settings; throttled progress UI

Voice Settings page

Full rebuild — STT engine (Whisper), TTS provider, language, model selection, download/re-download buttons, voice preview. Fully localized across all 5 locales. Privacy invariant: never echoes raw exception messages into UI status text (full detail goes to logger only).

Voice Overlay (Voice Chat) window

Compact mic/transcript/playback surface for in-tray voice conversation (Phase 3 partial — uses existing chat session, not yet streaming first-chunk TTS).

Cleanups landed alongside

  • Removed legacy WinRT SpeechRecognizer + SAPI fallback (online-speech leak risk; doesn't activate in unpackaged builds)
  • Robustness fixes in AudioPipeline (don't drop the final utterance on stop/timeout, drain in-flight transcriptions on Stop, suppress misleading shutdown errors)
  • Hub UX polish: right-click tray opens Hub, Companion rename, expanded NavView memory

Privacy notes

  • All STT runs locally (Whisper). No audio leaves the device.
  • Piper TTS is fully local; ElevenLabs only fires when explicitly selected and configured.
  • stt.status carries no PII (no transcript history, no language history, no device IDs, no model paths).
  • Validation errors never echo caller-supplied input back; handler exceptions never propagate Message into responses.

Out of scope (deferred to later phases)

  • Voice Wake / hands-free trigger — needs platform exploration first
  • Push-to-Talk
  • True streaming first-chunk TTS playback
  • Streaming STT via AudioGraph pipeline / cloud providers
  • Selected non-default microphone/speaker for STT capture
  • Voice control record parsing
  • Central pronunciation dictionary

Validation

  • ./build.ps1
  • OpenClaw.Shared.Tests 1292 passed / 20 skipped ✅
  • OpenClaw.Tray.Tests 460 passed ✅
  • End-to-end: wsl openclaw nodes describe and chat agent both report all 24 commands including stt.transcribe, stt.listen, stt.status, tts.speak after pairing.

Operator notes

To expose voice commands to chat agents, the gateway needs them in its allowlist:

wsl openclaw config set gateway.nodes.allowCommands '["stt.transcribe","stt.listen","stt.status","tts.speak"]'
wsl openclaw gateway restart

(Gateway-side note: when a paired node's capability shape grows, the gateway's paired.json doesn't always refresh commands on reconnect — manual re-pair currently required. Tracked outside this repo.)

RBrid and others added 30 commits May 1, 2026 18:19
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review-driven cleanup on top of the initial stt.transcribe capability.
No behavior change for successful invocations.

Privacy:
* SttCapability no longer echoes the caller-supplied language tag in the
  "Invalid language tag" error, and no longer interpolates the underlying
  exception's Message into "Transcribe failed". Both could end up in the
  recent-activity stream and BuildSupportBundle output, which can be
  shared off-device. Full detail still goes to the local logger.
* App.OnNodeInvokeCompleted now sanitizes failed-invoke details for
  privacy-sensitive commands (stt.transcribe, camera.snap/clip,
  screen.snapshot/record). Recent activity and support bundles record
  only "privacy-sensitive | <ms> | error" instead of the raw error
  string. Non-privacy-sensitive commands keep the error text since it is
  useful for diagnostics and does not carry mic/camera args.
* Models.cs PermissionDiagnostics microphone detail now mentions
  stt.transcribe instead of "future voice features", so users hitting
  0x800455A0 see microphone in their permissions checklist as relevant.

Refactors for testability (no behavior change):
* New Services/NodeInvokeActivityFormatter.cs owns GetPrivacyClass and
  BuildDetails. App.OnNodeInvokeCompleted delegates to it.
* New Services/NodeCapabilityGating.cs owns the optional-capability
  predicates. NodeService.RegisterCapabilities calls into it instead of
  inlining "_settings?.NodeXxxEnabled" checks. Privacy-sensitive
  defaults stay off; everything else stays default-on.
* Both helpers are linked into OpenClaw.Tray.Tests.

Localization:
* SettingsWindow.xaml gains x:Uid for every TTS and STT control. The
  literal Text/Header/PlaceholderText values are kept as dev-time
  fallbacks, matching the SettingsTokenTextBox and SettingsMcpDescription
  pattern already in the file.
* en-us, fr-fr, nl-nl, zh-cn, and zh-tw .resw files gain matching
  entries for the 14 new TTS/STT keys. Brand names (ElevenLabs),
  command names (tts.speak, stt.transcribe, gateway.nodes.allowCommands,
  MSIX), BCP-47 tags, and the eleven_multilingual_v2 model identifier
  are kept verbatim across all locales.
* SettingsMcpDescription.Text in all five locales now lists "microphone"
  and "speakers" alongside camera/screen/canvas so the local MCP-server
  description reflects the full Phase 1 + Phase 2 voice surface.

Tests:
* Two new privacy regression tests in CapabilityTests verify that an
  invalid language and a thrown handler exception never leak their text
  into the response error.
* New NodeInvokeActivityFormatterTests pin the privacy-class table, the
  sanitized details for privacy-sensitive failures, and the full
  ActivityStreamService.BuildSupportBundle path.
* New NodeCapabilityGatingTests pin that tts.speak and stt.transcribe
  default off (including for null settings) and that the two capabilities
  are independent consent surfaces.
* New SettingsWindowLocalizationCoverageTests parses SettingsWindow.xaml
  and asserts every new TTS/STT x:Uid resolves to the expected
  .Header/.Text/.Content/.PlaceholderText keys in en-us.
* ActivityStreamServiceTests and NodeInvokeActivityFormatterTests now
  share a non-parallel xUnit collection because ActivityStreamService is
  a static singleton; running both classes in parallel could otherwise
  cause flaky support-bundle assertions.
* NodeCapabilityGatingTests cleans up its temp settings directories.

Cleanup:
* Drop "Phase 2" wording from SpeechToTextService.cs; the resw section
  comments referring to "Phase 1 TTS / Phase 2 STT" are likewise
  reworded to plain "TTS / STT settings". Phase numbering is a planning
  artifact and should not appear in the codebase.

Validation:
* .\build.ps1
* dotnet test tests/OpenClaw.Shared.Tests/OpenClaw.Shared.Tests.csproj
  --no-restore  (1173 passed, 20 skipped)
* dotnet test tests/OpenClaw.Tray.Tests/OpenClaw.Tray.Tests.csproj
  --no-restore  (465 passed)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace constructor-injected sample data with empty/loading states:
- UsagePage: remove fabricated provider costs and daily data
- SessionsPage: remove 3 fake AI conversation sessions
- NodesPage: remove fake Desktop-PC/MacBook-Pro nodes
- ChannelsPage: remove fake Telegram/WhatsApp channels
- SkillsPage: remove fake skills and stale 'API not yet wired' warning
- CronPage: remove fake cron jobs, stale warning, fix hardcoded defaults

All pages now show proper empty states until real gateway data arrives.
The Skills and Cron APIs were already fully wired; the warnings were
simply outdated and misleading.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add full voice interaction capabilities to the Windows node:

Core audio pipeline:
- NAudio WASAPI microphone capture with MTA thread initialization
- Energy-based voice activity detection with hysteresis
- Whisper.net speech-to-text with multi-threaded inference
- Pre-buffer to capture speech onset before VAD triggers
- Auto-download of Whisper models from HuggingFace

Voice overlay window:
- Modern WinUI 3 floating window with Mica backdrop and custom title bar
- Chat-style transcript bubbles with segment consolidation
- Real-time audio level visualization
- Start/Stop, Mute, and Settings controls

STT node capability:
- stt.listen and stt.status MCP commands for agent-initiated listening
- Follows existing capability pattern (like TTS)

Voice settings page:
- Model size selection (tiny/base/small) with download management
- Language selection (auto-detect + 9 languages)
- Silence timeout slider
- TTS voice picker with Windows neural voice enumeration
- ElevenLabs provider configuration
- Voice preview button

Integration:
- Tray menu Voice item
- Ctrl+Alt+Shift+V global hotkey for push-to-talk
- Deep links: openclaw://voice, openclaw://voice-stop
- Gateway chat responses shown in voice overlay
- TTS response playback with mic muting to prevent echo
- Capabilities page STT toggle
- Hub navigation Voice & Audio page

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Master refactored 8 separate windows into a unified Hub app (openclaw#272), which
removed src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml(.cs) and
WebChatWindow.xaml.cs. Node-capability toggles now live in
Pages/CapabilitiesPage as a code-built list (one icon + label per
capability) instead of an XAML page with x:Uid-localized headers.

Conflict resolution and re-integration:

* Accepted master's deletion of SettingsWindow.xaml, SettingsWindow.xaml.cs,
  and WebChatWindow.xaml.cs. The TTS/STT controls and code-behind that this
  branch added to those files are obsolete with the new Hub UI.

* Pages/CapabilitiesPage.xaml.cs gains a Speech-to-Text toggle alongside the
  existing Camera/Canvas/Screen/Location/TTS toggles, plus 'stt' in the
  active-capabilities summary string. This is the natural minimal alignment
  with the new pattern: one capability = one entry in the toggle list.

* The TTS provider / ElevenLabs key/voice/model UI that this branch had
  added is dropped because master removed the corresponding settings
  surface entirely. The backend services (TextToSpeechService,
  ElevenLabsTextToSpeechClient) and the SettingsManager keys are intact;
  the values can be set via direct settings.json edit until a new UI
  surface lands.

* Resolved 5 .resw conflicts (en-us, fr-fr, nl-nl, zh-cn, zh-tw) by taking
  master's content. All TTS/STT resource keys this branch had added are
  removed because the controls referencing them are gone. The earlier
  SettingsMcpDescription update (adding 'microphone' and 'speakers' to
  the capability list) is outside the conflict region and is preserved.

* Deleted tests/OpenClaw.Tray.Tests/SettingsWindowLocalizationCoverageTests.cs.
  It pinned that 14 specific x:Uids on SettingsWindow.xaml had matching
  resw entries; the controls and the file no longer exist.

Refactors from this branch survived the auto-merge cleanly:
* App.xaml.cs OnNodeInvokeCompleted still delegates to
  NodeInvokeActivityFormatter for privacy-class scrubbing.
* NodeService.RegisterCapabilities still calls NodeCapabilityGating
  predicates for every optional capability, including TTS and STT.

Validation:
* .\build.ps1
* dotnet test tests/OpenClaw.Shared.Tests/OpenClaw.Shared.Tests.csproj
  --no-restore  (1183 passed, 20 skipped)
* dotnet test tests/OpenClaw.Tray.Tests/OpenClaw.Tray.Tests.csproj
  --no-restore  (418 passed; restore required first because master's
  Tray.Tests now links GatewayDiscoveryService.cs which needs Zeroconf)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Master's Hub refactor (openclaw#272) removed the per-capability detail UI that
previously lived on SettingsWindow. The capability backends are intact but
have no in-app surface anymore: STT had no way to set the BCP-47 language
tag, and TTS had no way to pick the provider, ElevenLabs API key, voice
ID, or model without hand-editing settings.json.

CapabilitiesPage.xaml gains two new detail cards beneath the capability
toggle grid, mirroring the existing McpCard pattern (visible only when
the capability is enabled):

* SttCard:
    - Language TextBox bound to SttLanguage.
    - Commits on LostFocus or Enter.
    - Empty input restores the "en-US" default rather than persisting "".
    - Validates with SttCapability.NormalizeLanguageTag before saving so a
      typo in Settings cannot ship a broken default to the WinRT recognizer.
    - Status text never echoes the user-supplied tag back on the failure
      path; only the local UI affordance shows it (the activity stream
      / support bundle path was already privacy-scrubbed by an earlier
      commit on this branch).

* TtsCard:
    - Provider ComboBox (Windows built-in / ElevenLabs).
    - ElevenLabs sub-panel becomes visible only when that provider is
      selected. Holds API key (PasswordBox), voice ID, and model.
    - API key handling: when a key is already saved we render a fixed
      mask sentinel ("••••••••") instead of any plaintext. Saving the
      form treats the sentinel as "keep current key" so the user can
      change voice ID / model without retyping the key, and rotation
      requires explicitly typing a new key. The on-disk DPAPI encryption
      done by SettingsManager is unchanged.
    - All ElevenLabs fields commit on LostFocus.

SttCapability.NormalizeLanguageTag is promoted from private to public so
the UI validates against exactly the rule the wire protocol applies. No
behavior change for the capability itself.

Validation:
* .\build.ps1
* dotnet test tests/OpenClaw.Shared.Tests/OpenClaw.Shared.Tests.csproj
  --no-restore  (1183 passed, 20 skipped)
* dotnet test tests/OpenClaw.Tray.Tests/OpenClaw.Tray.Tests.csproj
  --no-restore  (418 passed)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Combine ranjeshj's audio stack (NAudio + Silero VAD + Whisper.net, voice
overlay, push-to-talk hotkey, deep links, voice settings page,
stt.listen + stt.status) with rbrid's STT work (stt.transcribe, privacy
hardening, gating refactor, BCP-47 validator, tests, localization).

Both engines coexist; user picks the preferred one in the new STT card
on CapabilitiesPage. Whisper is the default. While the Whisper model is
downloading or otherwise unavailable, calls fall back transparently to
the WinRT engine; explicit WinRT selection never silently upgrades to
Whisper.

Unified SttCapability (OpenClaw.Shared) exposes three commands:
* stt.transcribe — bounded fixed-duration capture (rbrid's), capped at
  30000 ms; per-call language overrides settings.
* stt.listen     — VAD-driven capture, 1000..120000 ms (ranjeshj's),
  default 30000 ms; "auto" language by default.
* stt.status     — per-engine readiness, no PII.

Privacy invariants from rbrid's branch are preserved across all three:
no caller-supplied language echo on validation errors; no exception
text in error responses; activity-stream details record only the
privacy class + duration + sanitized "error" sentinel for failed
privacy-sensitive invocations.

Engine selection lives in NodeService.OnStt* handlers:
* OnSttTranscribeAsync / OnSttListenAsync probe IsWhisperReady; on
  Whisper preference + not-ready, fall back to WinRT and tag the
  response with engineEffective + engineFallbackReason.
* OnSttStatusAsync probes both engines and reports preferred /
  effective engine plus per-engine readiness with no PII.

Settings consolidation:
* New SttEngine = "whisper" (default).
* SttLanguage default changes from "en-US" (rbrid) to "auto" (ranjeshj
  + audio plan); BCP-47 validator extended to accept "auto" sentinel.
* Kept ranjeshj's SttModelName, SttSilenceTimeout, VoiceTtsEnabled,
  VoiceAudioFeedback. SettingsManager round-trip updated; duplicate
  field declarations from the auto-merge resolved.

VoiceService now exposes IsWhisperReady, IsWhisperDownloadingModel,
WhisperModelDownloadProgress for the selector. DownloadModelAsync
sets the downloading flag for the duration so stt.status surfaces it.

UI:
* CapabilitiesPage STT card gains an Engine ComboBox (Whisper /
  Windows built-in) and a hint line that reflects download state.
  "More voice settings…" link navigates to the existing
  VoiceSettingsPage (deep config: model, silence, voices).
* TTS card from rbrid's branch is preserved.

Conflict resolution policy applied during merge:
* 8 stale "remove fake sample data" pages (CronPage, ChannelsPage,
  NodesPage, SessionsPage, SkillsPage, UsagePage, ChatWindow.xaml.cs,
  repo-assist.lock.yml) — took HEAD (post-cleanup version). Master
  and ranjeshj agreed; rbrid's branch was just behind.
* HubWindow + App.xaml.cs voice-related additions — took HEAD.
* SttCapability.cs — rewritten as the unified three-command class.
* CapabilitiesPage XAML + code-behind — kept rbrid's STT/TTS cards
  and extended the STT card with the engine picker.
* OpenClaw.Tray.WinUI.csproj — kept both NAudio.Wasapi and
  System.Speech package refs.

Validation (no-restore baseline):
* .\build.ps1
* dotnet test tests/OpenClaw.Shared.Tests/OpenClaw.Shared.Tests.csproj
  --no-restore  (1241 passed, 20 skipped)
* dotnet test tests/OpenClaw.Tray.Tests/OpenClaw.Tray.Tests.csproj
  --no-restore  (419 passed)

Follow-up tracked in plan.md (still pending):
* Audio_STM.md (STRIDE).
* Post-merge security review.
* Localization compliance audit.
* Test coverage extension (engine adapters, selector, MCP, locale).
* Cleanup and final push.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
NodeInvokeActivityFormatter.GetPrivacyClass now classifies any stt.*
command as privacy-sensitive, not just stt.transcribe. This catches
stt.listen (microphone capture) and stt.status (engine internals)
under the same scrubbing rules in the activity stream / support
bundle, and keeps the rule simple ("anything in the stt namespace").

Tests added:
* GetPrivacyClass: stt.listen, stt.status, stt.future-command rows.
* PrivacySensitive_FailedInvoke_OmitsErrorTextFromDetails: theory
  rows for stt.listen and stt.status alongside the existing
  stt.transcribe / camera.* / screen.* coverage.
* SttCapabilityTests: full coverage of the unified surface
  - Listen: timeoutMs clamps (below min, above max), default language
    "auto", invalid language rejected without echo, handler not
    wired, handler exception sanitized to "Listen failed", segments +
    engine metadata round-trip, cancellation.
  - Status: handler not wired, handler exception sanitized to
    "Status failed", per-engine readiness round-trip with download
    progress.
  - NormalizeLanguageTag: BCP-47 tags + "auto" sentinel
    (case-insensitive, normalized to lowercase) accepted; underscore
    / spaces / "automatic" rejected.
* SettingsRoundTripTests: round-trips SttEngine, SttModelName,
  SttSilenceTimeout, VoiceTtsEnabled, VoiceAudioFeedback through
  SettingsData.ToJson / FromJson.

Validation:
* .\build.ps1
* dotnet test tests/OpenClaw.Shared.Tests --no-restore  (1266 passed,
  20 skipped)
* dotnet test tests/OpenClaw.Tray.Tests --no-restore  (425 passed)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Added E:\OpenClawWindowsNode\Audio_STM.md — full STRIDE analysis
  of the merged audio surface (assets, trust boundaries, per-component
  threats, cross-references to code + tests, follow-up backlog).
* Promoted every new STT/TTS card string in CapabilitiesPage.xaml to
  x:Uid + resw entries across all five locales (en-us, fr-fr, nl-nl,
  zh-cn, zh-tw): engine picker labels, language input + help, "More
  voice settings…" link, TTS provider picker, ElevenLabs sub-panel
  fields. Brand names (ElevenLabs), the "auto" BCP-47 sentinel, and
  the eleven_multilingual_v2 model identifier are kept verbatim and
  registered as InvariantOrDeferred in LocalizationValidationTests.
* Added CapabilitiesPageLocalizationCoverageTests — pins every new
  STT/TTS x:Uid against expected resw key suffixes (.Text, .Header,
  .Content, .PlaceholderText) so a future hardcoded-string regression
  fails fast.

Validation:
* .\build.ps1
* dotnet test tests/OpenClaw.Shared.Tests --no-restore  (1266 passed,
  20 skipped)
* dotnet test tests/OpenClaw.Tray.Tests --no-restore  (461 passed)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The engine-selection logic that NodeService.OnSttTranscribeAsync /
OnSttListenAsync / OnSttStatusAsync inline-implemented is now a pure
helper in Services/SttEngineSelector.cs and is consumed identically
from all three handlers. No behavior change.

Selector rules (pinned by SttEngineSelectorTests, 21 cases):
* Whisper preference + Whisper ready → Whisper, no fallback.
* Whisper preference + Whisper NOT ready + WinRT ready → WinRT,
  fallbackReason="whisper-model-not-ready". Happy degradation while
  the model downloads on first launch.
* Whisper preference + neither ready → keep Whisper preference,
  fallbackReason="whisper-and-winrt-unavailable". Dispatch fails;
  the user's preference is reported unchanged so stt.status is
  honest about what they asked for.
* WinRT preference + WinRT ready → WinRT, no fallback.
* WinRT preference + WinRT ready + Whisper ALSO ready → still WinRT.
  Critical invariant: explicit user choice is never silently upgraded
  to Whisper when the model finishes downloading.
* WinRT preference + WinRT NOT ready → keep WinRT,
  fallbackReason="winrt-unavailable". Same invariant: do not fall
  back to Whisper without explicit user opt-in.
* null/empty/whitespace/unknown engine string → treat as Whisper
  preference. A typo in settings.json must not hard-fail STT.
* Case- and whitespace-insensitive parsing of "whisper" / "winrt".

Engine identifier constants are mirrored locally on
SttEngineSelector.SharedConstants (free of cross-assembly deps);
MirroredConstantsMatchSttCapability pins they stay in sync.

Validation:
* .\build.ps1
* dotnet test tests/OpenClaw.Shared.Tests --no-restore  (1266 passed,
  20 skipped)
* dotnet test tests/OpenClaw.Tray.Tests --no-restore  (482 passed)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Findings from the post-merge security review (full review recorded in
the session at files/security-review.md and reflected in the STM
follow-up backlog):

CRITICAL (1 fixed, 1 deferred):
* I-1 — UI now warns that selecting WinRT honors the Windows Online
  speech recognition toggle and may upload audio to Microsoft when
  that toggle is on. CapabilitiesPage SttEngineHint text updated to
  steer users to Whisper for fully local processing.
* S-4 / T-1 — DEFERRED: SHA-256 verification of the Whisper model
  (download AND load time) requires embedding canonical hashes for
  tiny / base / small from HuggingFace. Tracked as a Critical
  pre-GA follow-up in Audio_STM.md section 6, not blocking this
  merge. (Existing TLS + system trust chain remains the only check.)

HIGH (3 fixed):
* S-3 / D-1 — NodeService.OnSttListenAsync now enforces a 1-second
  cooldown between successive stt.listen invocations. Imperceptible
  to a real user but throttles a hostile loop from a compromised
  gateway. Throws InvalidOperationException("Listen rate limit")
  which the SttCapability sanitization wraps as "Listen failed".
* D-7 — AudioPipeline.CleanupCapture now wraps event-detach,
  capture.Dispose, and CTS dispose in independent try/catch blocks
  so a failure in one step doesn't leak the NAudio WasapiCapture
  COM object (which would hold the mic LED lit until process exit).
  Also added CleanupCapture() calls in StartAsync's two catch
  branches so the mic is released after a failed start.
* I-2 — VoiceOverlayWindow audit confirmed no transcript text reaches
  ActivityStreamService. Status: PIN, no code change needed.

MEDIUM (1 fixed):
* NEW-1 — TtsCapability previously returned \$"Speak failed: {ex.Message}",
  which can leak ElevenLabs key prefixes from 401 responses or
  device names from OS audio errors into the support bundle.
  Now returns a fixed "Speak failed" matching the SttCapability
  pattern. NodeInvokeActivityFormatter.GetPrivacyClass also now
  classifies tts.* as privacy-sensitive (was metadata) so failed-
  invoke details are uniformly scrubbed.

PIN (no change needed, confirmed by review):
* T-3 — SttModelName path-traversal: WhisperModelManager validates
  against the {tiny, base, small} allow-list before any Path.Combine.
* I-4 — ElevenLabs key DPAPI-encrypted at rest.
* I-5 — ElevenLabs key UI shows masked sentinel; plaintext never
  re-rendered after save.
* I-8 / PI-5 — stt.status response carries no PII (only readiness
  strings, engine name, capability flags, numeric download progress).
* PI-3 — Validation/handler errors don't echo caller input or
  exception text across stt.* and now tts.* as well.

Test additions:
* Speak_HandlerException_DoesNotLeakExceptionMessageIntoError —
  pins the new TTS privacy invariant with an "ElevenLabs 401:
  invalid key sk-secret-prefix" payload.
* Speak_ReturnsError_WhenHandlerThrows updated to assert the exact
  sanitized "Speak failed" message instead of leaking ex.Message.
* GetPrivacyClass theory rows now cover tts.speak and
  tts.future-command as privacy-sensitive (was metadata).

Validation:
* .\build.ps1
* dotnet test tests/OpenClaw.Shared.Tests --no-restore  (1271
  passed, 20 skipped)
* dotnet test tests/OpenClaw.Tray.Tests --no-restore  (483 passed)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both legacy stacks are removed; SttCapability now dispatches every
stt.* call to a single Whisper engine via VoiceService. When the
Whisper model is not yet downloaded, handlers return a clear error
pointing the caller at the Voice Settings page download button —
there is no automatic fallback engine.

Rationale (from the discussion with Ranjesh):
* WinRT SpeechRecognizer is an old API that fails to activate in
  unpackaged tray builds (the long-standing 0x800455A0 issue) and,
  when the OS Online speech recognition toggle is on, may upload
  audio to Microsoft cloud — at odds with our local-first posture.
* System.Speech (desktop SAPI) is even older and has no value over
  Whisper for any modern scenario.
* Carrying two engines complicated the merge with no real upside
  now that Whisper.net runs reliably on every supported PC.

Removed:
* src/OpenClaw.Tray.WinUI/Services/SpeechToText/SpeechToTextService.cs
  (the WinRT + SAPI engine).
* src/OpenClaw.Tray.WinUI/Services/SttEngineSelector.cs (no engines
  to select between).
* tests/OpenClaw.Tray.Tests/SttEngineSelectorTests.cs.
* System.Speech NuGet package reference (was duplicated; both
  copies removed).
* SttEngine setting (SettingsData + SettingsManager round-trip).
* SttCapability.EngineWinRt and DefaultEngine constants.
* SttTranscribeResult.EngineFallbackReason and
  SttListenResult.EngineFallbackReason — no fallback to report.
* CapabilitiesPage Engine ComboBox + the engine-related UI strings
  in all five locales.
* The "Windows built-in may upload audio" caveat (no longer relevant).

Simplified:
* SttStatusResult: replaced PreferredEngine/EffectiveEngine plus
  per-engine readiness blocks with a single Engine + Readiness pair
  (engine is always "whisper" today; the field stays so a future
  engine doesn't break the wire).
* NodeService.OnSttTranscribeAsync / OnSttListenAsync /
  OnSttStatusAsync: dropped selector logic + WinRT marshalling.
  When VoiceService.IsWhisperReady is false, throw clear
  "Whisper model not downloaded" — wrapped to "Transcribe failed"
  / "Listen failed" by SttCapability's privacy sanitizer.
* CapabilitiesPage STT card hint surfaces model download state
  ("Whisper model is ready" / "downloading" / "not downloaded —
  open More voice settings…").
* McpToolBridge curated descriptions: drop engineFallbackReason
  field and the per-engine blocks from stt.status.

Tests:
* CapabilityTests.Status_ReturnsEngineReadiness rewritten for the
  flat shape; now also asserts no language/path strings appear in
  the JSON (tightens PI-5 enforcement).
* SettingsRoundTripTests: dropped SttEngine field assertions.
* CapabilitiesPageLocalizationCoverageTests: dropped engine ComboBox
  Uids from the contract list.
* LocalizationValidationTests: removed the engine ComboBox keys
  from the InvariantOrDeferred allow-list (no longer needed; the
  invariants list now only protects "auto", "ElevenLabs", and
  "eleven_multilingual_v2").

Validation:
* .\build.ps1
* dotnet test tests/OpenClaw.Shared.Tests --no-restore  (1271 passed,
  20 skipped)
* dotnet test tests/OpenClaw.Tray.Tests --no-restore  (456 passed)

Audio_STM.md and Audio_FollowUps.md updated to reflect the engine
removal (smaller test-seam refactor surface; I-1 "WinRT online
speech caveat" follow-up is retired).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a third TTS provider, "piper", that runs Piper voices fully
locally on this PC through the official Sherpa-ONNX .NET binding
(org.k2fsa.sherpa.onnx 1.13.0). No cloud egress; the voice model
downloads once to %LOCALAPPDATA%\OpenClawTray\models\piper\<voice-id>\
and is reused across calls.

Backend (OpenClaw.Shared/Audio/PiperVoiceManager.cs):
* Curated catalog of 6 starter voices (en-US ×2, en-GB, fr-FR, de-DE,
  zh-CN) sourced from the sherpa-onnx tts-models GitHub release
  tarballs — these are repackaged Piper voices that include the
  language-specific espeak-ng-data, so the user only downloads one
  archive per voice instead of model + tokens + espeak separately.
* Download with progress callback; extraction via OS-bundled tar.exe
  (Win10 1803+); atomic per-voice directory layout; cleanup of
  partial files on failure or cancellation.
* IsVoiceDownloaded / GetVoiceSize / DeleteVoice for the (forthcoming)
  Voice Settings page UI.
* TODO marker for SHA-256 verification (Audio_FollowUps.md §2).

Tray service (OpenClawTray/Services/TextToSpeech/PiperTextToSpeechClient.cs):
* Wraps SherpaOnnx.OfflineTts; loads one voice at a time and reuses
  the loaded model across calls (load is the expensive ~200-500 ms
  step). Single-flight gate prevents concurrent generates from
  racing the same TTS instance.
* Inference runs on a background Task so cancellation can race the
  synthesis.
* Converts Sherpa's 32-bit float PCM samples to a standard 16-bit
  PCM mono WAV blob the WinUI MediaPlayer can play with no further
  transcoding.

Wiring (OpenClaw.Tray.WinUI/Services/TextToSpeech/TextToSpeechService.cs):
* Third branch in SpeakAsync's provider dispatch. SpeakWithPiperAsync
  resolves the voice from args.VoiceId or settings.TtsPiperVoiceId,
  fails with a "voice not downloaded" error pointing the user at
  Voice Settings if the file isn't present, and otherwise reuses the
  cached PiperTextToSpeechClient (rebuilds it only when the voice id
  changes).
* TextToSpeechService.PiperVoices exposed so the Voice Settings page
  can drive download / delete from the same instance.

UI (OpenClaw.Tray.WinUI/Pages/CapabilitiesPage.xaml + .xaml.cs):
* Added Piper as the first ComboBoxItem on the TTS provider picker
  ("Piper (local ML, recommended)"). Resw entries across all 5
  locales (en-us, fr-fr, nl-nl, zh-cn, zh-tw).
* UpdateTtsCard reads TtsProvider with a 3-way switch (piper /
  windows / elevenlabs); unknown / null defaults to Piper.

Capability + settings:
* TtsCapability.PiperProvider = "piper" wire constant.
* SettingsData.TtsPiperVoiceId / SettingsManager.TtsPiperVoiceId,
  default "en_US-amy-low" (~50 MB, smallest English voice).
  Round-trip preserved through Save/Load.

Tests:
* SettingsRoundTripTests asserts TtsPiperVoiceId persists.
* CapabilitiesPageLocalizationCoverageTests pins the new
  CapabilitiesPage_TtsProviderPiper x:Uid against en-us.
* PiperVoiceManager + PiperTextToSpeechClient have no unit tests
  yet — same blocker as the rest of the audio engine layer
  (Audio_FollowUps.md §1: needs interface extraction first).

Audio_FollowUps.md §3 updated with a "Status update — basic Piper
plumbing landed" subsection enumerating exactly what shipped and
what remains (Voice download UI, manager tests, SHA-256 verification,
spike validation).

Validation:
* .\build.ps1
* dotnet test tests/OpenClaw.Shared.Tests --no-restore  (1271 passed,
  20 skipped)
* dotnet test tests/OpenClaw.Tray.Tests --no-restore  (462 passed)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
User-visible

* New Piper voice download panel on the Voice & Audio page
  (catalog of 6 voices, download with progress, delete, preview).
* Piper is now the default TTS provider for fresh installs.
* Read responses aloud toggle now drives every chat reply,
  not only voice-overlay sessions.
* Voice Overlay's Settings button opens the Voice & Audio page
  (was a no-op stub).
* First Whisper auto-download surfaces a status line in the
  Voice Overlay so the user knows the silent ~140 MB fetch is
  why nothing is being transcribed yet.
* Speech Model card refreshes its 'Model ready / Download
  required' status whenever the page becomes visible, even if
  NodeService hasn't wired its VoiceService yet.
* Stale 'Windows built-in' fallback text removed from the
  Speech-to-Text card description (5 locales). Whisper has been
  the only engine since ff11467.
* Width bumps so labels no longer truncate (the Speech Model
  size combo, the Provider combo).
* Dropped 'STT' jargon from the Language ComboBox header.
* Fixed misleading '~50-80 MB each' Piper size copy (real range
  is ~25-150 MB depending on quality).

Plumbing

* New SettingsRequested event on VoiceOverlayWindow; App hooks
  it to ShowHub('voice').
* TtsCapability.ResolveProvider falls back to Piper.
* App.OnNotificationReceived no longer gates TTS on
  VoiceMode != Inactive.
* VoiceSettingsPage.UpdateModelStatus queries the file system
  via WhisperModelManager directly so it works before
  NodeService finishes lazy-init of VoiceService.
* VoiceService.InitializeAsync fires DiagnosticMessage events
  around silent VAD/Whisper auto-downloads.

Tests: Shared 1271 / Tray 462 (default-provider asserts updated).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
openclaw#6-openclaw#8, Low openclaw#9)

High openclaw#2: Reuse a singleton TextToSpeechService for chat replies
* App.SpeakResponseAsync now goes through NodeService.TextToSpeech
  (a new public accessor on the existing _textToSpeechService field)
  instead of constructing a fresh service per call. Cached Piper
  client is reused across replies; the service-internal _playbackGate
  + _activePlayer now actually serialize back-to-back replies, and
  Interrupt=true takes effect.

High openclaw#3: Per-provider VoiceId routing
* New TtsWindowsVoiceId setting (round-tripped via SettingsManager
  + SettingsData; SettingsRoundTripTests assert it).
* SpeakResponseAsync no longer passes _settings.TtsElevenLabsVoiceId
  as a generic VoiceId; the per-provider Speak* paths each look up
  their own setting (TtsPiperVoiceId / TtsWindowsVoiceId /
  TtsElevenLabsVoiceId).
* SpeakWithWindowsAsync falls back to TtsWindowsVoiceId when
  args.VoiceId is blank.
* VoiceSettingsPage.OnWindowsVoiceChanged writes TtsWindowsVoiceId
  (was overwriting TtsElevenLabsVoiceId, a real cross-provider bug).

High openclaw#4: stt.listen returns a complete utterance, not the first segment
* New AudioPipeline.UtteranceTranscribed event fires once per silence-
  bounded utterance with all Whisper segments aggregated and an
  immutable Segments snapshot.
* VoiceService bubbles it as UtteranceCompleted.
* ListenOnceAsync subscribes to UtteranceCompleted (drops the
  per-fragment accumulator) so multi-segment utterances no longer
  return truncated text.

High openclaw#5: Voice Overlay submits one chat message per utterance
* OnTranscriptionReceived keeps the per-fragment streaming bubble
  update; chat submission moved to a new OnUtteranceCompleted
  handler so the gateway sees one message per spoken utterance.

Medium openclaw#6: Per-asset cancellation tokens in VoiceSettingsPage
* Split _downloadCts into _whisperDownloadCts and _piperDownloadCts
  so starting a Piper download no longer cancels an in-flight
  Whisper download (and vice versa).

Medium openclaw#7: Preflight tar.exe before Piper download
* PiperVoiceManager.EnsureExtractorAvailable runs a fast
  `tar --version` check before any network I/O. Downlevel Windows
  users now get a clear actionable error instead of a wasted ~50-150
  MB download that would later fail at extraction.

Medium openclaw#8: Refresh stale MCP tool descriptions
* stt.transcribe / stt.listen / stt.status now describe the single
  Whisper engine surface (no preferredEngine / effectiveEngine /
  engineFallbackReason); stt.listen description explicitly notes
  the result is the full silence-bounded utterance.
* tts.speak description includes `piper` in the provider list and
  notes the fresh-install default.
* Updated McpToolBridgeTests assertion for the new shape.

Low openclaw#9: Per-asset single-flight in download managers
* Both WhisperModelManager and PiperVoiceManager wrap their
  Download*Async in a static ConcurrentDictionary<string,Task> keyed
  on the canonical asset ID. Concurrent calls for the same asset
  await the same in-flight Task instead of racing on the same .tmp
  file. Failed downloads remove themselves from the table so a fresh
  retry isn't blocked.

Tests: Shared 1271 / Tray 462. Build green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Critical (rubber-duck openclaw#1) — fail-closed integrity check before install.

* New `Sha256` field on `WhisperModelInfo` and `PiperVoiceInfo`.
* All 9 catalog entries (3 Whisper models + 6 Piper voices) carry a
  pinned lowercase-hex SHA-256, captured against the live HuggingFace
  and sherpa-onnx GitHub releases on 2026-05-05.
* Download core methods now:
    1. Refuse outright if the catalog entry has no pinned hash
       (`InvalidOperationException`).
    2. Compute SHA-256 of the temp file BEFORE the atomic rename
       (Whisper) or BEFORE the tar extraction (Piper).
    3. On mismatch, throw `System.Security.SecurityException`,
       delete the temp file, and let the catch block tear down any
       half-installed directory. Sanitized message — does NOT echo
       the actual hash (no confirmation oracle).
* New `AssetHashPinningTests` enforces that every catalog entry has
  a 64-hex-char SHA-256 and an https URL — future additions that
  forget the hash now break the build.

Audio_FollowUps.md §2 updated:
* Status block at the top documents what landed today.
* Pre-public-release TODO list trimmed to: independent re-verification
  of the pinned hashes, on-load verification (not just on download),
  and a future signed-manifest format so updates don't require a tray
  rebuild. The original detailed design notes are preserved as the
  spec for that next iteration.

Tests: Shared 1275 / Tray 462. Build green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
OnDownloadClick previously routed through VoiceService.DownloadModelAsync,
which silently no-op'd whenever _voiceService was null — and _voiceService
is only constructed inside NodeService.RegisterCapabilities (which runs on
Connect / StartLocalOnly, and only when NodeSttEnabled is true). A user
who toggled STT on without reconnecting, or who hadn't enabled MCP-only
mode, would tap Download and see nothing happen.

Construct a WhisperModelManager directly from
SettingsManager.SettingsDirectoryPath and download via that. Same
on-disk result as the VoiceService auto-download path, but available
regardless of NodeService lifecycle state. Same SHA-256 verification
applies (the manager owns it).

Tests: Tray 462 (no change in surface).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three coordinated tweaks based on the morning UX review.

1. Right-click on the tray icon now opens BOTH the popup quick-menu
   AND the companion app window. ShowHub gained an `activate` flag;
   for this code path we call ShowHub(activate:false) so the Hub
   surfaces via AppWindow.Show(activateWindow:false) and the popup
   (which is light-dismiss) stays the foreground window. Without this
   the Hub's Activate() would steal focus and dismiss the popup.

2. NavigationView pane mode is now expanded by default and remembered
   across sessions. PaneDisplayMode flipped from Auto to Left, and a
   new HubNavPaneOpen setting (default true) is round-tripped via
   SettingsManager / SettingsData. PaneOpening / PaneClosing handlers
   on HubWindow persist the user's last toggle. SettingsRoundTripTests
   covers the new field.

3. Renamed the mascot from 'Molty' to 'Companion' across the surface:

   User-facing strings:
   * VoiceOverlayWindow Title and header text → `Companion Voice`.
   * VoiceSettingsPage section header → `🔊 Companion Voice`.
   * Both Preview-button sample texts (Windows + Piper) now say
     `Hello! This is your Companion speaking.`.

   Code identifiers (HomePage):
   * MoltyRing → CompanionRing
   * MoltyProgressRing → CompanionProgressRing
   * UpdateMoltyRing → UpdateCompanionRing
   * Comment `<!-- Molty mascot -->` → `<!-- Companion mascot -->`

   `grep -i molty src/` returns zero hits.

Tests: Shared 1275 / Tray 462. Build green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n pane default

Two findings from the second rubber-duck pass.

Medium: ShowHub(activate:false) was a no-op when the Hub was previously
minimized. AppWindow.Show(activateWindow:false) does not restore minimized
windows. Detect OverlappedPresenter.State == Minimized first and
Restore(activateWindow:false) so the window actually surfaces behind the
popup, then call Show.

Low: regression test for HubNavPaneOpen migration. Settings files written
before this field existed must deserialize to true (NavView expanded).
Added an explicit FromJson(\"{}\") assertion plus pinned the field's
default in MissingFields_UseDefaults and BackwardCompatibility_OldSettings*
so a future refactor can't silently flip new installs to a collapsed pane.

Tests: Tray 463 (one new). Build green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Windows shell canonicalizes openclaw://send?args=... to
openclaw://send/?args=... before handing it to us. The previous
implementation called TrimEnd('/') on the WHOLE remainder before
splitting off the query, so the trailing slash before the '?' was
never trimmed and Path came out as 'send/' instead of 'send'.

Trim the slash from the path SEGMENT after splitting off the query.

Three new theory cases pin the regression for send / agent / activity
deep links — categories that all carry query parameters in the
launcher canonicalized form. Existing TrailingSlash test (no query)
still passes with the new placement.

Credit to the parallel Copilot session for catching this.

Tests: Shared 1275 / Tray 466 (3 new). Build green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tion queue; normalize BCP-47

Three coordinated STT pipeline fixes from the latest review.

openclaw#1 (High) — Buffered speech was discarded on Stop/Timeout
* AudioPipeline.StopAsync used to call _cts.Cancel() BEFORE flushing,
  and the flush passed the canceled token straight into Whisper.net
  (which honored cancel and dropped the final utterance). Reordered
  to: stop capture -> flush with a fresh CancellationToken.None ->
  cancel _cts -> cleanup. Adds an overrideToken parameter on
  TranscribeSamplesAsync so the flush can opt out of the pipeline cancel.
* VoiceService.ListenOnceAsync used to throw TimeoutException as soon
  as the linkedCts fired, even when speech was actively buffered. It
  now waits on Task.WhenAny(utteranceTcs, timeoutSentinel), and on
  timeout it gives pipeline.StopAsync up to 2 s to flush — only then
  reports timeout. stt.transcribe inherits this fix.

openclaw#3 (Medium) — Whisper.net language mismatch
* SpeechToTextService.NormalizeForWhisper trims BCP-47 input down to
  the 2-letter ISO 639-1 primary subtag that Whisper.net's WithLanguage
  call expects. `en-US` -> `en`, `zh-Hans-CN` -> `zh`, garbage
  -> `auto`. Capability validator + MCP docs continue to advertise
  the wider BCP-47 shape (no breaking change for callers); this fixes
  the gap to Whisper.
* Result.Language now echoes the normalized form so the caller sees
  what Whisper actually used.

openclaw#4 (Medium) — Unbounded transcription queue
* Each VAD-bounded segment fired `_ = Task.Run(TranscribeSamplesAsync)`
  with no in-flight cap. SpeechToTextService gates Whisper work but
  callbacks accumulate behind the gate, each holding a sample buffer.
  Now bounded with Interlocked counter + MaxConcurrentTranscriptions
  cap (2). Excess segments are dropped with a clear DiagnosticMessage
  rather than silently queued — better UX than getting stale utterances
  arriving minutes after the user stopped speaking.

Tests: Shared 1291 / Tray 466 (16 new normalizer tests). Build green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closes the inconsistency the buddy review flagged: Whisper and Piper
download paths are hash-pinned and fail closed on mismatch, but the
Silero VAD download path (VoiceService.DownloadVadModelAsync) was just
HTTPS + system trust chain — no integrity verification before File.Move
into the models directory.

* New SileroVadModelManifest holds the URL, SHA-256, and approximate
  size as public constants in OpenClaw.Shared.Audio. Hash captured from
  the upstream raw URL on 2026-05-05; same pre-public-release re-verify
  TODO as the other manifests (Audio_FollowUps.md §2).
* DownloadVadModelAsync now hashes the temp file with SHA-256 BEFORE
  the atomic rename. On mismatch it throws SecurityException and the
  catch block tears down the .tmp file. Sanitized error — does not echo
  the actual hash (no confirmation oracle).
* AssetHashPinningTests gains a SileroVadModel_HasPinnedSha256 case so
  a future renaming/forgetting of the constant trips the build.

Tests: Shared 1292 (1 new). Build green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The SkillMdDriftTests pinning test was failing — 14 commands present in
McpToolBridge.KnownCommands had no matching ### heading in skill.md:

* The 4 new entries this branch added: stt.transcribe, stt.listen,
  stt.status, tts.speak.
* 10 pre-existing app.* entries (app.navigate, app.status, app.sessions,
  app.agents, app.nodes, app.config.get, app.settings.get, app.settings.set,
  app.menu, app.search) that already drifted before the audio work.

Fixing them all in one pass so the test goes green and stays green.
Each new section follows the existing format: H3 heading, brief
description, JSON-shaped param block, return shape.

Privacy + provider notes added for stt.* and tts.* so agent readers
understand: stt.* is local Whisper only and requires NodeSttEnabled,
tts.* defaults to Piper (local neural).

Tests: SkillMdDriftTests now passes. Shared 1292 / Tray 466. Build green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two manual-test follow-ups on the Voice Settings page.

* Throttle progress UI updates to >=150 ms intervals on both the
  Whisper and Piper download paths. The streaming downloads emit a
  progress callback every ~80 KB chunk, so a 466 MB model produces
  ~5,800 dispatcher hops (Progress<T> + DispatcherQueue.TryEnqueue
  doubled the load). The dispatcher queue saturated and the app
  appeared frozen mid-download. Coalescing limits the rate to a few
  updates per second, with a forced final 100% report so the user
  never sees a stuck "99%" right before "Model ready". Also dropped
  the redundant inner DispatcherQueue.TryEnqueue (Progress<T> already
  marshals to the captured UI SyncContext).

* Re-download button now actually re-downloads. WhisperModelManager
  short-circuits DownloadModelAsync when the file is already present,
  so OnDownloadClick now calls the existing DeleteModel(modelName)
  first when the file is on disk. Net effect: delete -> fresh fetch
  -> SHA-256 re-verify -> atomic rename. Same on-disk result.

Tests: Shared 1292 / Tray 466 (no test surface change). Build green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closes the buddy review's last finding. The new voice UI was English-only
hard-coded in both XAML and code-behind, while the rest of the tray (and
the freshly redone CapabilitiesPage STT/TTS card) reads from .resw via
x:Uid + LocalizationHelper.GetString.

Coverage:

* VoiceSettingsPage.xaml — every user-facing TextBlock / Header /
  ComboBoxItem / Button content / placeholder gets x:Uid (page title,
  card headers, STT toggle, model + language combos, voice chat
  controls, all 3 TTS provider items, Piper download/delete/preview,
  ElevenLabs slot, privacy note).
* VoiceOverlayWindow.xaml — header text, status badge, empty state,
  status text, start/stop label, mute + settings tooltips.
* VoiceSettingsPage.xaml.cs and VoiceOverlayWindow.xaml.cs — runtime
  status messages (download progress, model-ready, preview failures,
  pipeline state transitions, mute/listen state) now read from
  LocalizationHelper.GetString. Format strings use Lf(...) so {0}/{1}
  placeholders are honored under CurrentCulture.

Translations pinned for en-us / fr-fr / nl-nl / zh-cn / zh-tw —
~95 new keys per locale (475 total resw entries). Translations are
best-effort; native speakers should review pre-public-release.

LocalizationValidationTests:
* AllLocales_HaveExactlySameKeysAsEnUs ✅
* Resources_AreTranslatedAllOrNoneAcrossNonEnglishLocales ✅
  (added VoiceSettingsPage_StatusError + ElevenLabs sample-ID
  placeholder keys to the InvariantOrDeferred list — they're
  intentionally identical across locales)

Build green. Shared 1292 / Tray 466.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closes the buddy review's stt.transcribe finding. The handler used to
adapt SttTranscribeArgs into SttListenArgs and call ListenOnceAsync,
which inherited VAD-based silence shutdown — so a 5 000 ms request
would return after 1 s if the user stopped speaking. The advertised
contract (skill.md, McpToolBridge) promises bounded fixed-duration
capture, not silence-bounded.

Implementation:

* AudioPipeline.CaptureFixedDurationAsync — new top-level method that
  starts WASAPI capture, accumulates every resampled+gain-applied
  16 kHz mono sample into _fixedCaptureBuffer for exactly durationMs
  (or until cancellation), then returns the buffer. OnDataAvailable
  branches on a new _fixedCaptureMode flag and bypasses the VAD path
  entirely in this mode.

* VoiceService.TranscribeFixedDurationAsync — wraps
  CaptureFixedDurationAsync + SpeechToTextService.TranscribeAsync and
  returns SttTranscribeResult directly. Empty buffer (cancelled
  immediately or no audio) returns transcribed=false rather than
  throwing.

* NodeService.OnSttTranscribeAsync now calls TranscribeFixedDurationAsync
  instead of bouncing through ListenOnceAsync.

stt.listen behavior is unchanged.

Tests: Shared 1292 / Tray 466. Build green. (No new tests — exercising
this path requires a real WASAPI device. The capture/transcribe
boundary is tightly coupled to NAudio + Whisper.net, which were the
test seams already deferred to Audio_FollowUps.md §1.)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
RBrid and others added 6 commits May 6, 2026 14:55
Adds x:Uid="VoiceOverlayWindow" on the WindowEx root, plus the
VoiceOverlayWindow_winexWindowEx_2.Title key in all 5 locale resw
files. Listed in InvariantOrDeferredResourceKeys so the parity test
allows the title to read identical "Companion Voice" in every locale —
matches the existing convention for ChatWindow / HubWindow /
CanvasWindow / TrayMenuWindow.

The visible header text and runtime status messages were already
localized; this just closes the gap on the actual OS-level window
title (alt-tab, taskbar).

Build green. Shared 1292 / Tray 466.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The MCP / wire-side stt.transcribe and stt.listen entry points
short-circuited with "Whisper model not downloaded" whenever
_voiceService.IsWhisperReady was false. That property reads
SpeechToTextService.IsModelLoaded — which is true only after the
model has been LOADED INTO MEMORY by EnsureInitializedAsync.

On a freshly-launched tray (or any state where the user hasn't
opened the Voice Overlay yet), the .bin file is on disk but the
model isn't loaded. The pre-flight check rejected the call before
the inner TranscribeFixedDurationAsync / ListenOnceAsync could run
EnsureInitializedAsync to load it lazily.

Net result: every first MCP STT call after launch failed with a
misleading "model not downloaded" error, even though the file was
right there.

Switch the pre-flight check to IsModelDownloaded (file on disk).
The lazy load happens inside the inner call as it always did.

Verified end-to-end via the local MCP HTTP server: tools/call
stt.transcribe with maxDurationMs:5000 returned a real transcript
("Hello, how is everybody doing?") on first invocation after a
fresh tray launch.

Tests: Shared 1292 / Tray 466. Build green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… cleanup

Three buckets of fixes from this afternoon's manual testing pass.

i18n: dot-suffix lookup bug in code-behind
* LocalizationHelper.GetString(X.Text) returns the raw key when the
  resource name has a dot — XAML x:Uid resolution interprets the
  trailing .Text as a property suffix, but direct programmatic
  lookup doesn't, so the resource map can't find it. Six call sites
  were displaying literal keys like "VoiceOverlayWindow_StatusBadge.Text"
  in the running UI.
* Added six dot-free code-only keys (BadgeReady, StatusReadyMessage,
  ButtonStartListening, ButtonDownloadModel, PiperButtonDownloadVoice,
  PreviewVoiceButtonContent) translated across all 5 locales, and
  swapped the call sites in VoiceOverlayWindow.xaml.cs and
  VoiceSettingsPage.xaml.cs to use them.

audio: Voice Overlay "Failed to encode audio features" on Stop
* Mid-encode interruptions from Whisper.net don't surface as a clean
  OperationCanceledException — they bubble up as misleading errors
  like "Failed to encode audio features." Pressing Stop while a
  transcription Task.Run was in-flight produced exactly that toast.
* AudioPipeline.StopAsync now drains in-flight transcriptions for up
  to 3 s before cancelling \_cts, so the user's last utterance has a
  chance to actually complete.
* TranscribeSamplesAsync's catch block suppresses errors when
  \_isStopping or the cancel token is set — those are expected
  shutdown-induced interruptions, not user-visible failures. Also
  sanitized the diagnostic toast (no raw ex.Message).

Capabilities page rework
* Removed the redundant Language TextBox + label + help + status
  block. The Voice & Audio page already owns the language picker via
  a curated ComboBox (the textbox accepted any string and silently
  failed validation on garbage like "foobar", which was a paper cut).
* "More voice settings…" hyperlink stays as the deep-link.
* Speech-to-Text card hint now reads file presence directly via a
  fresh WhisperModelManager rooted at SettingsManager.SettingsDirectoryPath
  (instead of hub.VoiceServiceInstance?.IsWhisperReady, which is null
  on a freshly-launched tray and reads "loaded into memory" rather
  than "file on disk"). Same trick used by VoiceSettingsPage's
  UpdateModelStatus.
* Updated the Capabilities help text in all 5 locales to say "Two-letter
  ISO 639-1 code (e.g. en, fr, ja)" instead of "BCP-47 tag (e.g. en-US,
  fr-FR, ja-JP)" — matches what NormalizeForWhisper actually accepts
  (region is stripped). (Help text is now only consumed by the language
  picker on Voice & Audio, but the resw key was renamed/repurposed to
  match.)
* Dropped the now-orphan SttLanguageLabel/TextBox/Help resw entries
  from all 5 locales, the CapabilitiesPageLocalizationCoverageTests
  catalog, and the LocalizationValidationTests invariant list.

Tests: Shared 1292 / Tray 460 (6 fewer cases — the
CapabilitiesPageLocalizationCoverageTests theory shrank by 3 keys ×
2 non-en locales). Build green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Title key in resw was VoiceOverlayWindow_winexWindowEx_2.Title, but
the root x:Uid was just "VoiceOverlayWindow" — so WinUI's auto-derived
property-suffix lookup (Window-typed elements get the _winexWindowEx_2
suffix) couldn't find a match and the title fell back to the XAML
default. Aligned the x:Uid to "VoiceOverlayWindow_winexWindowEx_2",
matching the existing pattern used by ChatWindow / HubWindow /
CanvasWindow / TrayMenuWindow.

(Also: the buddy's parallel "trailing whitespace in resw" finding is
already addressed by subsequent commits — XmlDocument.Save normalized
the formatting; `Get-Content | -match '\s+\$'` returns 0 on every
locale today.)

Build green. Tray 460.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The voice settings handlers and the Voice Overlay's start/stop catch
were formatting raw exception messages straight into user-facing UI
status text (and from there potentially into screenshots, error toasts,
support bundles, the activity stream). ex.Message can carry URLs,
local paths, hash digests, HTTP body fragments, or other implementation
detail that the user shouldn't see.

Seven call sites updated:
* VoiceSettingsPage.xaml.cs — Whisper download error, Piper download
  failure, Piper delete failure, Piper preview failure, Windows voice
  enumeration failure, Windows preview failure (6 sites).
* VoiceOverlayWindow.xaml.cs — overlay start/stop catch (1 site).

For each: full ex (message + type + stack) is logged via Logger.Error
or _logger.Error; the UI shows a generic localized message that ends in
"(see Debug log)" so users know where the detail lives.

Resw side:
* Six error-string keys in all 5 locales had their {0} format
  placeholders replaced with self-contained generic messages
  (translated, not just placeholder-stripped).
* VoiceSettingsPage_StatusError dropped from
  LocalizationValidationTests.InvariantOrDeferredResourceKeys — it
  used to be flagged invariant because the placeholder made every
  locale identical; with real translations it now varies and shouldn't
  be exempt.

Tests: Tray 460. Build green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
These two commands were already wired up in NodeService and advertised by
SttCapability, but the gateway's Windows platform-default policy hides any
command that isn't either platform-default (system.*, browser.proxy) or in
the node's DangerousCommands opt-in list. Only stt.transcribe was in that
list, so chat agents only saw stt.transcribe even when NodeSttEnabled was
on.

Adding stt.listen and stt.status lets them get the same explicit gateway
opt-in treatment as stt.transcribe, so once the operator allows them in
gateway.nodes.allowCommands they flow through to the agent's tools list.

Verified end-to-end: after re-pair, chat reports the full 24-command list
including stt.listen, stt.status, and tts.speak.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@RBrid RBrid marked this pull request as ready for review May 7, 2026 01:29
github-actions Bot added a commit that referenced this pull request May 7, 2026
…ponses

Exception messages in node capability error responses can contain sensitive
data: file paths, device names, API key prefixes, environment variable values,
or command output fragments. These responses flow back through the gateway to
the chat agent and may be persisted in recent activity / support bundles.

The new SttCapability (PR #288) explicitly documents the privacy invariant:
"never echo raw exception text into the response; full detail stays in the
local log only". This commit applies the same rule to the existing capabilities
that were not yet following it.

Affected capabilities:
- SystemCapability: system.run execution errors (can expose env vars/output),
  V2 approval handler exceptions, execApprovals.set policy update errors
- CameraCapability: camera.list / camera.snap / camera.clip errors
  (can expose device names and paths)
- ScreenCapability: screen.snapshot / screen.record errors (can expose paths)
- LocationCapability: location.get errors (can expose system paths)

Not changed:
- TtsCapability: already fixed by the in-flight PR #288
- BrowserProxyCapability: intentionally includes connectivity context
  (port/host reachability) which is user-facing diagnostic data
- CanvasCapability: mixed — some errors are intentional UI feedback

Exception detail is preserved in Logger.Error calls that were already present,
so diagnostics remain available in local logs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🤖 This is an automated response from Repo Assist.

Great contribution, @RBrid! This is substantial work and the architecture looks well-considered. A few observations after reviewing the code:

What looks good

  • SHA-256 pinning on every asset with a clear policy of refusing downloads with missing hashes — good security posture.
  • Privacy invariants are clearly documented: SttCapability never echoes exception messages into responses, stt.status deliberately carries no PII. The same pattern was retroactively applied to TtsCapability — this is exactly right.
  • Single-flight download using ConcurrentDictionary<string, Task> — nice approach to prevent race conditions.
  • Credits @NichUK's PR Add first-pass Windows Voice Mode #120 — appreciated.

One implementation concern

Both WhisperModelManager.DownloadModelCoreAsync and PiperVoiceManager.DownloadVoiceCoreAsync create new HttpClient() inside a using block (per-call). This is the classic socket exhaustion anti-pattern — each new HttpClient() creates a new connection pool; the underlying sockets are held in TIME_WAIT for up to 4 minutes after disposal. For large model downloads (80–500 MB) that rarely happen this is low-risk in practice, but it sets a precedent. Consider using a static readonly HttpClient (or IHttpClientFactory if the tray already uses one) shared across download calls.

Dependency surface

The PR adds Whisper.net, NAudio, Sherpa-ONNX, Silero VAD ONNX — several large native dependencies. This project has historically prioritized a minimal dependency footprint. Maintainers will want to think through:

  • Distribution size implications (native ONNX runtimes are sizeable)
  • Long-term maintenance burden of pinned model hashes
  • Whether default-off (NodeSttEnabled) is sufficient gating

The Audio_FollowUps.md §2 TODO for re-verifying Piper SHA-256 hashes before GA is important — the current implementation does verify hashes (the comment in the summary doc is outdated), but the hashes themselves need periodic re-verification.

Note on TtsCapability conflict

A separate PR (from this run) also fixes TtsCapability's ex.Message leak, which overlaps with your change to the same file. If both are in-flight simultaneously, there will be a merge conflict in TtsCapability.cs. The changes are compatible — one or the other should be merged first, then the other rebased.

Overall this is a strong PR. The local-first privacy design and careful security review in the asset management code are exactly the right approach for voice capabilities.

Generated by 🌈 Repo Assist, see workflow run. Learn more.

To install this agentic workflow, run

gh aw add githubnext/agentics/workflows/repo-assist.md@97143ac59cb3a13ef2a77581f929f06719c7402a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants