spot_audio is a ROS2 package consisting of two nodes:
spot_microphone_node.pyspot_speaker_node.py
Spot uses the ReSpeaker Microphone Array v2.0 (or "respeaker" for short) to collect audio data. The respeaker consists of four microphones in a circular array, but we only use one microphone.
Spot makes use of two deep-learning models to perform its listening tasks: Whisper and Audio Spectrogram Transformer (AST).
Whisper is an off-the-shelf speech-to-text (stt) model Spot uses to transcribe any speech it hears. We use the faster-whisper, a python3 wrapper for this model.
AST is an off-the-shelf audio classifier Spot uses to detect: speech, nonverbal vocalizations, and respiratory distress. We use a python3 package to interact with this model.
The spot_microphone_node.py is the ros2 node that processes the raw audio from mic. The node publishes four ros2 topics and serves a single ros2 service:
String.msgpublished on/<SPOT_NAME>/heart_beattopic. We publish to this topic every 8 seconds to alertspot_state_managernode that the microphone node is live.Speech.msgpublished on/<SPOT_NAME/speechtopic. We publish to this topic to every 3 seconds ifwhisperpicked up on any speech.AudioData.msgpublished on/<SPOT_NAME/raw_audiotopic. We publish to this topic continuously as we collect data from the microphone.Observation.msgpublished on/<SPOT_NAME/observations_no_idtopic. This gets published wheneverwhisperpicks up on speech or theASTdetects speech, a non-verbal vocalization, or respiratory distress.StopListening.srvserved on/<SPOT_NAME/stop_listening_service_name. Thespot_speaker_node.pycalls this service whenever it's about to play audio containing speech, so that Spot doesn't accidentally transcribe its own speech.
The spot_microphone_node.py uses two threads whose jobs are:
- To poll the microphone for raw audio data and put it onto a buffer, and
- To process the audio buffer using
whisperandAST.
The second thread processes the audio...
The noise produced by Spot walking around drowns out all other audio data in the microphone. This limits us to using the microphone when Spot isn't moving. To address this, we plan on purchasing a directional microphone, which will hopefully attenuate all noise not coming in the direction of the speaker. The integration of this microphone into this software stack may take substantial time.
Whisper's transcriptions are not perfect and tend not to pick up on the casualty's speech. Whisper's transcriptions are better when the audio segment is longer. We restrict ourselves to processing audio segments of three seconds, because if we wait longer, then Spot's responses seem very delayed compared to when the speaker finishes talking. To resolve this problem, some minor refinements should be made to how the audio data is processed within the node.
Rather than publishing a custom heart_beat message, we should be using the diagnostic_message ROS2 package.
import statements should be in alphabetical order.
Lists of dependencies in package.xml or CMakeLists.txt should be in alphabetical order.