This project is an Android voice assistant application that combines Speech-to-Text (STT), a Large Language Model (LLM), and Text-to-Speech (TTS) into a single on-device pipeline.
- High-Level Modules
- Runtime Pipeline
- User Interface Layer
- Configuration and Models
- LLM Framework Selection
- Optional Visual Question Answering (VQA)
- Key Files
app/: Android application (Jetpack Compose UI, view models, pipeline orchestration)stt/: Speech-to-Text module based on whisper.cppllm/: LLM module based on llama.cpp (and other selectable backends)resources/: Shared assets and model configuration files
At runtime, the app coordinates the following steps:
- Audio capture:
SpeechRecorderrecords microphone input to a local audio file. - Transcription:
Whisper(STT) converts audio to text using the configured STT model. - LLM inference:
Llmgenerates a response using the selected LLM backend and model. - Speech output:
SpeechSynthesisdrives Android Text-to-Speech to speak the response. - UI updates: The UI receives incremental updates as partial LLM tokens arrive.
The orchestration happens in app/src/main/java/com/arm/voiceassistant/Pipeline.kt, which owns the lifecycle of the STT and LLM engines and manages coroutines, state, and error flow.
flowchart LR
mic[(Microphone)] --> recorder[SpeechRecorder]
image[(Image Upload)] --> encoder[Vision Encoder]
subgraph Pipeline["Pipeline.kt"]
recorder --> stt[Whisper STT] --> llm[LLM backend] --> tts[SpeechSynthesis - Android TTS]
encoder -->|embeddings| llm
end
tts --> speaker[Speaker]
Speech synthesis happens in the Android app layer via the SpeechSynthesis component, which wraps the platform Text-to-Speech engine. It is invoked from Pipeline.kt after an LLM response is produced, and the audio is rendered locally on the device.
The UI is built with Jetpack Compose and is organized under:
app/src/main/java/com/arm/voiceassistant/ui/app/src/main/java/com/arm/voiceassistant/ui/composables/app/src/main/java/com/arm/voiceassistant/ui/screens/app/src/main/java/com/arm/voiceassistant/ui/theme/app/src/main/java/com/arm/voiceassistant/viewmodels/
MainActivity sets up the UI, requests microphone permissions, and initializes the main view model.
- STT config:
stt/stt-src/model_configuration_files/whisperTextConfig.json - LLM config:
llm/llm-src/model_configuration_files/{Framework}{Text|Vision}Config-{ModelName}.json - Model files: downloaded during the build and pushed to the device via
app/pushAppResources.py
The Pipeline loads default configs when user configs are missing or invalid, and can read custom configs if provided.
The LLM backend is chosen at build time via the llmFramework Gradle property. Supported values include:
llama.cpp(default)onnxruntime-genaimnnmediapipe
Example:
./gradlew assembleDebug -PllmFramework=onnxruntime-genaiThe app supports optional image-based prompts. An image can be uploaded and encoded into embeddings,which are retained in context for follow-up queries until the context is reset.
app/src/main/java/com/arm/voiceassistant/Pipeline.ktapp/src/main/java/com/arm/voiceassistant/MainActivity.ktapp/pushAppResources.pystt/andllm/module build files and native bindings