Skip to content

Add Android RecognitionService for system-wide voice input#19

Merged
ivan-digital merged 9 commits into
mainfrom
feat/recognition-service
May 10, 2026
Merged

Add Android RecognitionService for system-wide voice input#19
ivan-digital merged 9 commits into
mainfrom
feat/recognition-service

Conversation

@ivan-digital
Copy link
Copy Markdown
Contributor

@ivan-digital ivan-digital commented Apr 11, 2026

Summary

A new SpeechRecognitionService (in audio.soniqo.speech.service) wrapping SpeechPipeline so any app using the SpeechRecognizer API (Gboard, Duolingo, the system voice-input picker) can invoke fully on-device STT.

This PR also absorbs what was originally split into PR #21 (interface refactor + Robolectric tests). Replacement for #21 — close it once this lands.

Service contract

  • Owns its own AudioRecord (VOICE_RECOGNITION, 16 kHz, PCM_FLOAT) — callers do not push audio.
  • Event mapping: SpeechStarted → beginningOfSpeech, PartialTranscription → partialResults, TranscriptionCompleted → results (session ends), SpeechEnded → endOfSpeech, Error → error(ERROR_SERVER).
  • Honors EXTRA_PARTIAL_RESULTS by wiring emitPartialTranscriptions on the pipeline.
  • EXTRA_LANGUAGE is logged but not enforced — Parakeet TDT v3 auto-detects.

Service polish (added during review)

  • onCheckRecognitionSupport (API 33+). Returns a RecognitionSupport with our SUPPORTED_LANGUAGES (27 BCP-47 tags from Parakeet TDT v3) marked installed-on-device when ModelManager.areModelsReady() is true, pending otherwise. Lets callers surface a "downloading models" UX instead of silently falling back to an online recognizer.
  • Audio focus management. Acquires AUDIOFOCUS_GAIN_TRANSIENT with USAGE_VOICE_COMMUNICATION when a session starts, abandons on tear down. On AUDIOFOCUS_LOSS / LOSS_TRANSIENT the listener tears down the session — yields the mic to incoming calls and nav prompts. Best-effort: a denied focus request logs and proceeds.
  • Settings activity (in the demo app). Without it, the gear icon next to the recognizer in the system Voice-input picker is greyed out. Currently informational — shows model-readiness state and the supported-language list. Wired via android:settingsActivity in recognition_service.xml and a RECOGNIZER_INTENT filter in the demo manifest.
  • ModelManager.areModelsReady() public API. Synchronous, side-effect-free check used by onCheckRecognitionSupport and the settings activity.

Bug fixes pinned by tests

  • Stop-hang. onStopListening cut the mic without flushing → VAD never saw silence → TranscriptionCompleted never fired. Now pushes ~1 s of zero frames after cancelling the mic job.
  • Start-race. The busy check happened before launching the suspending setup, so two concurrent starts both passed the gate. Now claims an AtomicBoolean synchronously.

Refactor + Robolectric coverage

  • SpeechPipeline becomes an interface with SpeechPipelineImpl; companion invoke keeps SpeechPipeline(config) working at every existing call site (demo + androidTest).
  • Service opened for subclassing with three protected seams: createPipeline, resolveModelDir, newAudioRecord.
  • 8 Robolectric tests exercising the contract end-to-end on the JVM in <1 s each:
Test What it pins
startListening_setsUpPipelineAndSignalsReady Happy path — readyForSpeech fires after pipeline init
startListening_concurrentCallReturnsBusy Regression — start-race fix
stopListening_flushesPipelineWithSilence Regression — stop-hang fix (~30 zero-frame chunks)
startListening_withoutPermission_reportsInsufficient Permission-denied path
transcriptionCompleted_emitsResultsAndTearsDownSession Final event delivers results(...) and closes the pipeline
startListening_requestsAudioFocus Audio-focus request goes out at session start
audioFocusLoss_tearsDownSession AUDIOFOCUS_LOSS callback closes the pipeline
onCheckRecognitionSupport_modelsNotReady_marksLanguagesPending API 33+ language-support path returns the right shape

Tests use a TestableService subclass that overrides the seams, a FakeSpeechPipeline implementing the new interface, and a MockK-mocked AudioRecord.

Scope

Out of scope (deferred):

  • A true language hint to STT — would need an API change in SpeechConfig / parakeet_stt.cpp. Tracked separately.
  • The legacy BroadcastReceiver for RecognizerIntent.ACTION_GET_LANGUAGE_DETAILS (pre-API-33 language discovery). minSdk is 26 but the modern onCheckRecognitionSupport path covers the dominant case; can add the receiver later if real-world demand surfaces.

Closes #4.

Test plan

  • ./gradlew :sdk:assembleDebug :app:assembleDebug — green
  • ./gradlew :sdk:testDebugUnitTest23/23 pass (8 service + 15 ModelManager)
  • ./gradlew :sdk:connectedDebugAndroidTest — 34/34 pass on arm64 emulator (verifies the SpeechPipeline interface refactor doesn't break the existing pipeline tests)
  • Installed demo on arm64 emulator; service registered (dumpsys package audio.soniqo.speech.demo):
    • android.speech.RecognitionService filter on SpeechRecognitionService with RECORD_AUDIO permission
    • android.speech.action.RECOGNIZER_INTENT filter on SpeechRecognitionSettingsActivity
  • Set as system default: settings put secure voice_recognition_service audio.soniqo.speech.demo/audio.soniqo.speech.service.SpeechRecognitionService — readback confirmed
  • Settings activity launches via am start -a android.speech.action.RECOGNIZER_INTENT — renders title, model-readiness state, and the 27-language list (screenshot in PR thread)
  • Manual: open Gboard in any text field, tap mic, speak — verify transcription comes back from our service
  • Manual: Settings → System → Languages & input → Voice input picker — confirm our service appears with the gear icon enabled

Notes

ivan-digital and others added 2 commits April 11, 2026 21:18
Exposes on-device STT via the standard android.speech.RecognitionService
API so keyboards and apps (Gboard, Duolingo, etc.) can use the pipeline
system-wide. The demo APK registers the service; users can pick it as
the default voice input under Settings → System → Languages & input.

Closes #4
onStopListening previously cancelled the mic feed without pushing any
audio to the pipeline, so VAD never saw silence and the final
TranscriptionCompleted never fired. After cutting the mic, push ~1 s of
zero frames so the pipeline finalizes and the caller gets results.

onStartListening only checked session != null before launching the
suspending setup, so two concurrent starts could both pass the gate and
race to assign session, leaking an AudioRecord and pipeline. Claim an
AtomicBoolean synchronously and reject duplicates with ERROR_RECOGNIZER_BUSY.
Ivan added 6 commits May 10, 2026 10:56
Refactor SpeechPipeline to an interface with an internal SpeechPipelineImpl
backed by NativeBridge. The factory `SpeechPipeline(config)` is preserved
via a companion `invoke` so all existing call sites in the demo app and
androidTest suite are unchanged.

Open SpeechRecognitionService for test subclassing and extract three
protected seams — createPipeline, resolveModelDir, newAudioRecord — so
JVM unit tests can run without loading the .so or opening the mic.

Add Robolectric + MockK and five tests covering the two bugs we fixed in
the previous commit (busy-race, stop-hang) plus permission denial,
ready-for-speech signaling, and TranscriptionCompleted teardown.
RecognitionService.onStartListening and onStopListening are protected in
the Android SDK, so tests outside the inheritance chain cannot call them.
Add startListening() / stopListening() public wrappers on TestableService
that delegate to the protected callbacks.

Verified locally: ./gradlew :sdk:testDebugUnitTest — 20/20 pass
(5 new SpeechRecognitionServiceTest, 15 existing ModelManagerDownloadTest).
Synchronous, side-effect-free check that every required model file for
the given precision is on disk and passes isValidModel(). Used by paths
that must answer 'are we ready?' without blocking, in particular
SpeechRecognitionService.onCheckRecognitionSupport(), which has to tell
the framework whether on-device recognition is currently available.
…vice

Three additions that round out the RecognitionService contract:

1. **Audio focus management.** Acquire AUDIOFOCUS_GAIN_TRANSIENT with
   USAGE_VOICE_COMMUNICATION when a session starts, abandon when it
   tears down. On AUDIOFOCUS_LOSS / LOSS_TRANSIENT the listener tears
   down the session — yielding the mic to incoming calls and nav
   prompts is the right behavior, and we don't currently support
   pause/resume mid-utterance anyway. Best-effort: a denied focus
   request logs and proceeds.

2. **onCheckRecognitionSupport (API 33+).** Override the framework hook
   that tells callers (Gboard etc.) which BCP-47 languages we can
   recognize and whether they're installed-on-device or pending
   download. Built off ModelManager.areModelsReady() — installed when
   models are present, pending otherwise. Lets the caller surface a
   'downloading models' UX rather than silently falling back to an
   online recognizer.

3. **SUPPORTED_LANGUAGES constant.** A representative subset of the
   languages Parakeet TDT v3 claims (ar, cs, da, de, el, en, es, fi,
   fr, he, hi, hu, id, it, ja, ko, nb, nl, pl, pt, ru, sv, th, tr, uk,
   vi, zh — 27). Public on the companion object so apps can mirror
   it in their own settings UI.

Tests: three new Robolectric tests covering audio-focus request,
audio-focus loss → teardown, and onCheckRecognitionSupport's pending
state. + androidx.annotation:annotation:1.8.2 for @RequiresApi.

Local: ./gradlew :sdk:testDebugUnitTest — 23/23 pass (8 service +
15 ModelManager).
The settings entry the system Voice-input picker (Settings → System →
Languages & input → Voice input) opens via the gear icon next to our
recognizer. Without this, the gear is greyed out and users can't tell
the recognizer is alive / configurable.

Currently informational only — shows model-readiness state and the
SDK's SUPPORTED_LANGUAGES list. Nothing user-tunable yet.

Wired into recognition_service.xml via android:settingsActivity and
declared in the demo manifest with the RECOGNIZER_INTENT intent
filter that the picker queries for.
Android 15 / One UI 8 forces edge-to-edge layouts by default. Without
inset handling the bottom mic button slides under the gesture-nav bar
on Galaxy devices (and the system-bar overlap on the top), making the
button untappable.

Wire ViewCompat.setOnApplyWindowInsetsListener on each Activity's root
LinearLayout to pad by the system-bar insets:

- MainActivity.buildUI() — Echo mode mic at the bottom
- DictationActivity.buildUI() — Dictation mic at the bottom
- SpeechRecognitionSettingsActivity.onCreate() — Settings entry that
  the system Voice-input picker opens (preserves existing 64/96 padding
  and adds inset padding on top)

No SDK change. Pure demo-app fix.
Adds a third demo entry (Recognizer test) that calls
SpeechRecognizer.createSpeechRecognizer(ctx) without a ComponentName,
exercising the system-default voice recognition service path
end-to-end through the binder boundary. Useful for smoke-testing the
recognition service without going through Gboard or Samsung Keyboard
(both of which bypass the system default).

README gains a new "System voice input (RecognitionService)" section
with a 4-step setup: manifest registration (including RECORD_AUDIO
uses-permission and the @xml/recognition_service resource readers
would otherwise be missing), system default selection via Settings or
adb, and verification via the new test screen. Mirrored into all 9
translations.
@ivan-digital ivan-digital merged commit 256b4ba into main May 10, 2026
@ivan-digital ivan-digital deleted the feat/recognition-service branch May 10, 2026 16:11
ivan-digital pushed a commit that referenced this pull request May 13, 2026
…h-core

speech-core PRs #19 and #20 lifted all the model wrappers, audio utilities,
and Linux examples out of this repo. This PR finishes the migration by
deleting the now-duplicated source and slimming the native side to a single
~250-line JNI bridge.

Net change: 51 files, +717 / -7412.

Bumped:
- speech-core submodule pointer: 679869d → ba75579 (PR #19 + #20 merged)

Deleted (now in speech-core):
- sdk/src/main/cpp/audio/  — fft, mel, stft (live at speech_core::audio)
- sdk/src/main/cpp/util/   — json.h
- sdk/src/main/cpp/models/ — silero_vad, parakeet_stt, kokoro_tts +
  phonemizer + multilingual, deepfilter, onnx_engine, inference_engine,
  onnx_backend, soc_detect
- linux/                   — moved verbatim to speech-core/examples/linux/
                             (libspeech.so, demo, CLIs, integration test)

Rewrote:
- sdk/src/main/cpp/jni_bridge.cpp (388 → 269 lines) — the model wrappers
  in speech_core::* directly implement VADInterface / STTInterface /
  TTSInterface / EnhancerInterface, so the 100+ lines of C-vtable adapter
  boilerplate (vad_process_chunk, stt_transcribe, tts_synthesize, etc.)
  that wrapped each model class into sc_*_vtable_t structs are gone. The
  bridge now constructs speech_core::SileroVad / ParakeetStt / KokoroTts
  and hands references to speech_core::VoicePipeline.
- sdk/src/main/cpp/CMakeLists.txt — replaced the manual list of speech-core
  source files with add_subdirectory(${SPEECH_CORE_DIR}) using
  SPEECH_CORE_WITH_ONNX=ON. Link speech_android against speech_core_models.

Compatibility:
- Kotlin contract unchanged. NativeBridge.onEvent still receives the same
  int event-type values (0..11). The new speech_core::EventType enum has
  ResponseDone and ResponseAudioDelta swapped relative to the old C ABI
  (sc_event_t.type) — added to_kotlin_event() to map explicitly so the
  Kotlin side keeps working without any change.
- Public Kotlin API (SpeechPipeline, SpeechConfig, SpeechEvent) untouched.

Docs:
- README.md rewritten as Android-only (Linux/Yocto/QNN sections moved
  to a one-line cross-link pointing at speech-core/examples/linux).
- All 9 README translations updated to mirror the new structure
  (zh, ja, ko, es, de, fr, hi, pt, ru) with existing high-quality
  translations preserved where the underlying English text is unchanged.
- AGENTS.md rewritten — Android-only scope, points contributors at
  speech-core for any C++ / model / Linux changes.
- .gitignore drops the linux/tests/models/ and /ort-linux/ entries that
  are no longer relevant.
- setup.sh trimmed to just the Android ORT download + submodule init
  (it was previously rewriting the .gitignore on every invocation).

Verified locally:
- ./gradlew :sdk:externalNativeBuildDebug — BUILD SUCCESSFUL, 5.6 MB
  libspeech_android.so produced for arm64-v8a, links libonnxruntime.so
  and libc++_shared.so cleanly.
- ./gradlew :sdk:assembleDebug :sdk:test — BUILD SUCCESSFUL, 77 tasks.

Next: connectedAndroidTest needs to run on an emulator (downloads
1.2 GB of models on first run); will run that in CI rather than locally.
ivan-digital added a commit that referenced this pull request May 13, 2026
Slim speech-android to Android-only after speech-core PRs #19/#20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Android RecognitionService for system-wide voice input

1 participant