openclaw · shanselman · May 7, 2026 · May 2, 2026 · May 2, 2026 · May 4, 2026
diff --git a/README.md b/README.md
@@ -177,6 +177,7 @@ When Node Mode is enabled in Settings, your Windows PC becomes a **node** that t
 | **Canvas** | `canvas.present`, `canvas.hide`, `canvas.navigate`, `canvas.eval`, `canvas.snapshot`, `canvas.a2ui.push`, `canvas.a2ui.pushJSONL`, `canvas.a2ui.reset` | Display and control a WebView2 window |
 | **Screen** | `screen.snapshot`, `screen.record` | Capture screenshots and fixed-duration MP4 screen recordings |
 | **Camera** | `camera.list`, `camera.snap`, `camera.clip` | Enumerate cameras and capture still photos or short video clips |
+| **Speech-to-text** | `stt.transcribe` | Capture audio from the default microphone for a bounded duration and return transcribed text. Default-off; opt-in via Settings. When enabled, advertised to both gateway callers (subject to gateway allowlist) and local MCP clients (subject to bearer token). |
 | **Location** | `location.get` | Return Windows geolocation when permission is available |
 | **Device** | `device.info`, `device.status` | Return Windows host/app metadata and lightweight status |
 | **Text-to-speech** | `tts.speak` | Speak text aloud through Windows speech synthesis, or ElevenLabs when configured |

diff --git a/docs/WINDOWS_NODE_TESTING.md b/docs/WINDOWS_NODE_TESTING.md
@@ -61,6 +61,7 @@ These features need the gateway to send `node.invoke` commands:
 | `location.get` | Get Windows location | Uses Windows location permission/settings |
 | `device.info` / `device.status` | Device metadata/status | Returns host/app/locale plus battery/storage/network/uptime payloads |
 | `browser.proxy` | Proxy browser-control host requests | Requires Browser proxy bridge enabled, a compatible browser-control host listening on gateway port + 2, and matching browser-control auth |
+| `stt.transcribe` | Speech-to-text from default microphone | Default-off; bounded `maxDurationMs` ≤ 30000; concatenates phrases until duration elapses; requires explicit gateway allowlist |
 | `tts.speak` | Speak text aloud | Requires Text-to-speech playback enabled in Settings; gateway mode also requires `tts.speak` in `gateway.nodes.allowCommands` |
 
 ## Capabilities Advertised
@@ -112,6 +113,40 @@ When the node connects, it advertises these capabilities:
 - If you see "Camera access blocked", enable camera access for desktop apps in Windows Privacy settings
 - Packaged MSIX builds will show the system consent prompt automatically
 
+### `stt.transcribe` returns "Speech recognition failed" or "Internal Speech Error"
+- Open Windows Settings → Privacy & security → Speech (`ms-settings:privacy-speech`)
+- Turn **Online speech recognition** = On. The Windows speech recognizer's default dictation grammar often fails without it, and Windows surfaces an unmapped HRESULT as "Internal Speech Error"
+- Open Windows Settings → Time & language → Language & region (`ms-settings:regionlanguage`), select your display language → Language options, and confirm **Speech** appears under Installed features (install it if not, ~50 MB; reboot or sign out/in afterward)
+- Verify the recognizer end-to-end with `ms-settings:speech` → "Microphone" → **Get started** before re-trying `stt.transcribe`
+
+### `stt.transcribe` returns "Microphone permission denied"
+- Open Windows Settings → Privacy & security → Microphone
+- Ensure **Microphone access** (top-level toggle) is on
+- For **unpackaged** tray builds (the default `.\build.ps1` output): ensure **Let desktop apps access your microphone** is on. The tray exe will **not** appear as its own row — desktop-app access is granted as a group, not per-app
+- For **packaged MSIX** tray builds: the tray appears as its own entry under "Let apps access your microphone" and must be individually enabled (the OS shows a consent prompt on first use)
+- After changing permissions, re-pair the node so the gateway picks up the new advertised command
+
+### `stt.transcribe` returns "Language pack 'X' is not installed"
+- Open Windows Settings → Time & language → Language & region
+- Add the requested display language and ensure the **Speech** optional feature is installed
+- Restart the tray after installing the speech pack
+
+### Manual STT validation
+1. Enable Node Mode in Settings.
+2. Enable **Speech-to-text (microphone)** in Settings → Node mode.
+3. Append `stt.transcribe` to your existing gateway allowlist (do **not** copy a literal `...` — substitute the commands you already allow). For example, starting from the recommended Windows safe companion list:
+   ```bash
+   openclaw config set gateway.nodes.allowCommands '["canvas.present","canvas.hide","canvas.navigate","canvas.eval","canvas.snapshot","canvas.a2ui.push","canvas.a2ui.pushJSONL","canvas.a2ui.reset","camera.list","location.get","screen.snapshot","device.info","device.status","system.execApprovals.get","system.execApprovals.set","stt.transcribe"]'
+   openclaw gateway restart
+   ```
+4. Re-pair or re-approve the node so the gateway refreshes its command snapshot.
+5. Invoke and speak a short phrase:
+   ```bash
+   openclaw nodes invoke --node <id> --command stt.transcribe \
+       --params '{"maxDurationMs":5000,"language":"en-US"}'
+   ```
+6. The Windows microphone OS indicator should appear during recognition. Confirm a `transcribed:true` payload returns the text.
+
 ## Remaining Work (Roadmap)
 
 1. ~~**system.run + exec approvals**~~ ✅ Implemented

diff --git a/docs/gateway-node-integration.md b/docs/gateway-node-integration.md
@@ -336,7 +336,7 @@ Recommended gateway defaults:
 | Command bucket | Windows default? | Reason |
 |----------------|------------------|--------|
 | Safe declared companion commands: `canvas.*`, `camera.list`, `location.get`, `screen.snapshot`, `device.info`, `device.status` | Yes | Matches macOS parity and only applies when declared by the node |
-| Dangerous/privacy-heavy commands: `camera.snap`, `camera.clip`, `screen.record`, write commands like `contacts.add` | No | Existing gateway model already requires explicit `gateway.nodes.allowCommands` |
+| Dangerous/privacy-heavy commands: `camera.snap`, `camera.clip`, `screen.record`, `stt.transcribe`, write commands like `contacts.add` | No | Existing gateway model already requires explicit `gateway.nodes.allowCommands` |
 | Exec commands: `system.run`, `system.run.prepare`, `system.which`, `system.notify`, `browser.proxy` | Yes | Existing Windows headless-host behavior |
 
 Until the gateway expands Windows safe defaults, the practical local solution is:
@@ -364,6 +364,7 @@ Privacy-sensitive commands should stay out of the default safe list and should o
 camera.snap
 camera.clip
 screen.record
+stt.transcribe
 ```
 
 After changing either `gateway.nodes.allowCommands` or `gateway.nodes.denyCommands`, re-approve or re-pair the Windows node. Approved device records may keep a snapshot of the commands that were visible at approval time, so a gateway restart alone may not refresh existing approvals.
@@ -424,6 +425,7 @@ Proposal:
   - `camera.snap`
   - `camera.clip`
   - `screen.record`
+  - `stt.transcribe`
   - write commands such as `contacts.add`, `calendar.add`, etc.
 
 This does not grant capabilities to headless Windows hosts by itself. A command still has to pass both gates: the node must declare it in `commands`, and the gateway policy must allow it. Headless Windows node hosts that only declare `system.run` / `system.which` remain exec-only.
@@ -441,7 +443,7 @@ When shipping the Windows node, README/wiki should tell users:
 > ```
 > Then re-pair the node (`openclaw devices reject <old-id>` + re-approve).
 >
-> Add `camera.snap`, `camera.clip`, and `screen.record` only when you explicitly want to allow privacy-sensitive camera or screen capture.
+> Add `camera.snap`, `camera.clip`, `screen.record`, and `stt.transcribe` only when you explicitly want to allow privacy-sensitive camera, screen, or microphone capture.
 >
 > The Windows tray Command Center (`openclaw://commandcenter`) surfaces these policy problems directly: it separates safe companion allowlist fixes from privacy-sensitive opt-ins and provides copyable repair text for safe fixes or pending pairing approval.
 

diff --git a/src/OpenClaw.Shared/Audio/AudioModels.cs b/src/OpenClaw.Shared/Audio/AudioModels.cs
@@ -0,0 +1,70 @@
+using System;
+using System.Collections.Generic;
+
+namespace OpenClaw.Shared.Audio;
+
+/// <summary>Result of a speech-to-text transcription segment.</summary>
+public sealed class TranscriptionResult
+{
+    public string Text { get; init; } = "";
+    public TimeSpan Start { get; init; }
+    public TimeSpan End { get; init; }
+    public string Language { get; init; } = "en";
+}
+
+/// <summary>
+/// Aggregated result of a single silence-bounded utterance — i.e. all the
+/// Whisper segments produced from one VAD-bounded speech burst, combined.
+/// Consumers that need "what the user said" (chat submission, stt.listen)
+/// should listen for this event instead of per-segment TranscriptionResult
+/// to avoid sending partial text.
+/// </summary>
+public sealed class UtteranceResult
+{
+    /// <summary>Concatenated text across all segments, single-spaced.</summary>
+    public string Text { get; init; } = "";
+    /// <summary>Language detected on the first segment, or null if no segments.</summary>
+    public string? Language { get; init; }
+    /// <summary>Start of the first segment relative to capture start.</summary>
+    public TimeSpan Start { get; init; }
+    /// <summary>End of the last segment relative to capture start.</summary>
+    public TimeSpan End { get; init; }
+    /// <summary>Immutable snapshot of the per-segment results.</summary>
+    public IReadOnlyList<TranscriptionResult> Segments { get; init; } = Array.Empty<TranscriptionResult>();
+}
+
+/// <summary>Voice-activity detection event.</summary>
+public sealed class VadEvent
+{
+    public bool IsSpeaking { get; init; }
+    public float Probability { get; init; }
+}
+
+/// <summary>Configuration for the audio pipeline.</summary>
+public sealed class AudioPipelineOptions
+{
+    /// <summary>Path to the Whisper GGML model file.</summary>
+    public string ModelPath { get; init; } = "";
+
+    /// <summary>Language code for STT (e.g. "en", "auto").</summary>
+    public string Language { get; init; } = "auto";
+
+    /// <summary>Seconds of silence before a speech segment is finalized.</summary>
+    public float SilenceTimeoutSeconds { get; init; } = 1.5f;
+
+    /// <summary>Optional audio device ID. Null = system default microphone.</summary>
+    public string? DeviceId { get; init; }
+
+    /// <summary>VAD probability threshold (0.0–1.0). Audio above this is considered speech.</summary>
+    public float VadThreshold { get; init; } = 0.3f;
+}
+
+/// <summary>Pipeline state.</summary>
+public enum AudioPipelineState
+{
+    Stopped,
+    Starting,
+    Listening,
+    Processing,
+    Error
+}