VoxSafeBench is a comprehensive benchmark designed to assess the social alignment of Speech Language Models (SLMs) built around three core pillars: Safety, Fairness, and Privacy. VoxSafeBench adopts a unique Two-Tier design: Tier 1 evaluates content-centric risks with matched text and audio inputs, while Tier 2 evaluates audio-conditioned risks in which the transcript is benign but the correct response depends on who is speaking, how they speak, or where they speak.
π Table of Contents
First, clone the repository and navigate into the directory:
git clone https://github.com/AmphionTeam/VoxSafeBench.git
cd VoxSafeBenchEvaluating different models requires configuring their respective environments under each model's code repository.
For closed-source models (e.g., gemini_3_flash, gemini_3_pro and gpt_4o_audio) and the overall evaluation scripts (run_inference.py, run_evaluation.py), you need to install the following dependencies:
pip install openai google-genai python-dotenv tqdmFor open-source models, please refer to the specific setup instructions in their respective official repositories:
- Qwen3-omni: QwenLM/Qwen3-Omni
- Mimo-audio: XiaomiMiMo/MiMo-Audio
- Kimi-audio: MoonshotAI/Kimi-Audio
export HF_ENDPOINT=https://hf-mirror.com
HF_HUB_DOWNLOAD_TIMEOUT=240 huggingface-cli download --repo-type dataset --resume-download YuxiangW/VoxSafeBench --local-dir ./datasets --max-workers 32If the error is caused by network issues, you can try a few more times.
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download Qwen/Qwen3-Omni-30B-A3B-Instruct --local-dir ./model_warehouse/Qwen3_omniexport HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download Qwen/Qwen3-Omni-30B-A3B-Thinking --local-dir ./model_warehouse/Qwen3_omni_thinkingexport HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download XiaomiMiMo/MiMo-Audio-7B-Instruct --local-dir ./model_warehouse/Mimo_audio/MiMo-Audio-7B-Instruct
huggingface-cli download --resume-download XiaomiMiMo/MiMo-Audio-Tokenizer --local-dir ./model_warehouse/Mimo_audio/MiMo-Audio-Tokenizerexport HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download moonshotai/Kimi-Audio-7B-Instruct --local-dir ./model_warehouse/Kimi_audio
β οΈ Please replacemodel_warehouse/Kimi_audio/modeling_moonshot_kimia.pywithutils/modeling_moonshot_kimia.py.
β οΈ If you don't make this replacement, Kimi-audio cannot perform pure text input.
Please put your gemini_key and Openai_key in the .env file.
We provide a unified inference script run_inference.py to generate responses from any model across various tasks seamlessly.
You can use the unified runner by specifying the model and the task:
# Run a specific task for a model
python run_inference.py --model Qwen3_omni --task Safety-tier2/Emotion
# Run all tasks for a model
python run_inference.py --model Qwen3_omni --allAvailable Models:
Qwen3_omniQwen3_omni_thinkingMimo_audioMimo_audio_thinkingKimi_audiogemini_3_flashgemini_3_progpt_4o_audio
(You can see all available models dynamically via python run_inference.py --help)
Available Tasks:
Fairness Tasks
Fairness-tier1/testFairness-tier2/Bias_analysisFairness-tier2/test
Privacy Tasks
Privacy-tier1/Hard_privacyPrivacy-tier1/Soft_privacyPrivacy-tier2/Audio_conditioned_privacyPrivacy-tier2/Interactional_privacy
Safety Tasks (Tier 1)
Safety-tier1/Agentic_Action_RisksSafety-tier1/Multiturn_jailbreakSafety-tier1/No_jailbreakSafety-tier1/Singleturn_jailbreak
Safety Tasks (Tier 2)
Safety-tier2/Child_presenceSafety-tier2/Child_voiceSafety-tier2/EmotionSafety-tier2/Impaired_capacitySafety-tier2/Overlap_instruction_injectionSafety-tier2/Symbolic_backgroundSafety-tier2/Unsafe_ambient
You can override default settings per run:
python run_inference.py --model gpt_4o_audio --all --max-workers 8 --save-interval 20--max-workers: Set the number of concurrent workers (useful for API models like GPT-4o or Gemini).--save-interval: How often to save results to disk.--model-name: Override the internal model name string passed to the API or model loader.
Results are written automatically to:
results/<model_name>/<task_name>/results.jsonl
The runner automatically:
- Reads input data from the appropriate
datasets/**/metadata.jsonl. - Resumes from existing
results.jsonl(skips completed items, retries items that failed withERROR:). - Merges the model's outputs into the existing dataset fields.
After running model inference, you can evaluate the generated responses and compute metrics using the run_evaluation.py script. The evaluation uses automated judges (e.g., DeepSeek, OpenAI Moderation) and rule-based evaluators to score safety, privacy, and fairness.
Ensure you have the required API keys configured in your .env file, as the script uses these external services for evaluation.
You can run the evaluation script with different levels of granularity:
# Evaluate ALL models and ALL tasks (found in the `results/` directory)
python run_evaluation.py
# Evaluate all tasks for a specific model
python run_evaluation.py --model Qwen3_omni
# Evaluate a specific task for a specific model
python run_evaluation.py --model Qwen3_omni --task Safety-tier1/No_jailbreakYou can specify the number of concurrent threads used during evaluation:
python run_evaluation.py --model gpt_4o_audio --threads 16--threads: Set the maximum number of worker threads (default is 8) to speed up API-based evaluations.
The evaluation script reads from the results/ directory and writes the final evaluated data and metrics to the final_eval_results/ directory:
final_eval_results/<model_name>/<task_name>/
βββ results.jsonl # Data with evaluation scores and judgments appended
βββ log.txt # Computed metrics and summary statistics
- For the Privacy-tier2/Inferential_privacy task, please use the HearSay Benchmark.
- All experiments were conducted on Nvidia A800 GPUs.
