🎙️ VoxSafeBench

Not Just What Is Said, but Who, How, and Where

VoxSafeBench is a comprehensive benchmark designed to assess the social alignment of Speech Language Models (SLMs) built around three core pillars: Safety, Fairness, and Privacy. VoxSafeBench adopts a unique Two-Tier design: Tier 1 evaluates content-centric risks with matched text and audio inputs, while Tier 2 evaluates audio-conditioned risks in which the transcript is benign but the correct response depends on who is speaking, how they speak, or where they speak.

📑 Table of Contents

🎙️ VoxSafeBench

⚙️ Environment Setup

First, clone the repository and navigate into the directory:

git clone https://github.com/AmphionTeam/VoxSafeBench.git
cd VoxSafeBench

Evaluating different models requires configuring their respective environments under each model's code repository.

For closed-source models (e.g., gemini_3_flash, gemini_3_pro and gpt_4o_audio) and the overall evaluation scripts (run_inference.py, run_evaluation.py), you need to install the following dependencies:

pip install openai google-genai python-dotenv tqdm

For open-source models, please refer to the specific setup instructions in their respective official repositories:

Qwen3-omni: QwenLM/Qwen3-Omni
Mimo-audio: XiaomiMiMo/MiMo-Audio
Kimi-audio: MoonshotAI/Kimi-Audio

📥 Dataset download

export HF_ENDPOINT=https://hf-mirror.com

HF_HUB_DOWNLOAD_TIMEOUT=240 huggingface-cli download --repo-type dataset --resume-download YuxiangW/VoxSafeBench --local-dir ./datasets --max-workers 32

If the error is caused by network issues, you can try a few more times.

🤖 Models download

Qwen3-Omni

export HF_ENDPOINT=https://hf-mirror.com

huggingface-cli download --resume-download Qwen/Qwen3-Omni-30B-A3B-Instruct --local-dir ./model_warehouse/Qwen3_omni

Qwen3-Omni-thinking

export HF_ENDPOINT=https://hf-mirror.com

huggingface-cli download --resume-download Qwen/Qwen3-Omni-30B-A3B-Thinking --local-dir ./model_warehouse/Qwen3_omni_thinking

Mimo-audio/Mimo-audio-thinking

export HF_ENDPOINT=https://hf-mirror.com

huggingface-cli download --resume-download XiaomiMiMo/MiMo-Audio-7B-Instruct --local-dir ./model_warehouse/Mimo_audio/MiMo-Audio-7B-Instruct

huggingface-cli download --resume-download XiaomiMiMo/MiMo-Audio-Tokenizer --local-dir ./model_warehouse/Mimo_audio/MiMo-Audio-Tokenizer

Kimi-audio

export HF_ENDPOINT=https://hf-mirror.com

huggingface-cli download --resume-download moonshotai/Kimi-Audio-7B-Instruct --local-dir ./model_warehouse/Kimi_audio

⚠️ Please replace model_warehouse/Kimi_audio/modeling_moonshot_kimia.py with utils/modeling_moonshot_kimia.py.
⚠️ If you don't make this replacement, Kimi-audio cannot perform pure text input.

Gemini-3 & GPT-4o-audio

Please put your gemini_key and Openai_key in the .env file.

🚀 Model Inference (Unified Runner)

We provide a unified inference script run_inference.py to generate responses from any model across various tasks seamlessly.

Basic Usage

You can use the unified runner by specifying the model and the task:

# Run a specific task for a model
python run_inference.py --model Qwen3_omni --task Safety-tier2/Emotion

# Run all tasks for a model
python run_inference.py --model Qwen3_omni --all

Available Models:

Qwen3_omni
Qwen3_omni_thinking
Mimo_audio
Mimo_audio_thinking
Kimi_audio
gemini_3_flash
gemini_3_pro
gpt_4o_audio

(You can see all available models dynamically via python run_inference.py --help)

Available Tasks:

Fairness Tasks

Fairness-tier1/test
Fairness-tier2/Bias_analysis
Fairness-tier2/test

Privacy Tasks

Privacy-tier1/Hard_privacy
Privacy-tier1/Soft_privacy
Privacy-tier2/Audio_conditioned_privacy
Privacy-tier2/Interactional_privacy

Safety Tasks (Tier 1)

Safety-tier1/Agentic_Action_Risks
Safety-tier1/Multiturn_jailbreak
Safety-tier1/No_jailbreak
Safety-tier1/Singleturn_jailbreak

Safety Tasks (Tier 2)

Safety-tier2/Child_presence
Safety-tier2/Child_voice
Safety-tier2/Emotion
Safety-tier2/Impaired_capacity
Safety-tier2/Overlap_instruction_injection
Safety-tier2/Symbolic_background
Safety-tier2/Unsafe_ambient

Advanced Options

You can override default settings per run:

python run_inference.py --model gpt_4o_audio --all --max-workers 8 --save-interval 20

--max-workers: Set the number of concurrent workers (useful for API models like GPT-4o or Gemini).
--save-interval: How often to save results to disk.
--model-name: Override the internal model name string passed to the API or model loader.

Output Location

Results are written automatically to:

results/<model_name>/<task_name>/results.jsonl

The runner automatically:

Reads input data from the appropriate datasets/**/metadata.jsonl.
Resumes from existing results.jsonl (skips completed items, retries items that failed with ERROR:).
Merges the model's outputs into the existing dataset fields.

📈 Final Evaluation

After running model inference, you can evaluate the generated responses and compute metrics using the run_evaluation.py script. The evaluation uses automated judges (e.g., DeepSeek, OpenAI Moderation) and rule-based evaluators to score safety, privacy, and fairness.

Prerequisite

Ensure you have the required API keys configured in your .env file, as the script uses these external services for evaluation.

Basic Usage

You can run the evaluation script with different levels of granularity:

# Evaluate ALL models and ALL tasks (found in the `results/` directory)
python run_evaluation.py

# Evaluate all tasks for a specific model
python run_evaluation.py --model Qwen3_omni

# Evaluate a specific task for a specific model
python run_evaluation.py --model Qwen3_omni --task Safety-tier1/No_jailbreak

Advanced Options

You can specify the number of concurrent threads used during evaluation:

python run_evaluation.py --model gpt_4o_audio --threads 16

--threads: Set the maximum number of worker threads (default is 8) to speed up API-based evaluations.

Output Location

The evaluation script reads from the results/ directory and writes the final evaluated data and metrics to the final_eval_results/ directory:

final_eval_results/<model_name>/<task_name>/
├── results.jsonl   # Data with evaluation scores and judgments appended
└── log.txt         # Computed metrics and summary statistics

⚠️ Note

For the Privacy-tier2/Inferential_privacy task, please use the HearSay Benchmark.
All experiments were conducted on Nvidia A800 GPUs.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
evaluation		evaluation
models		models
utils		utils
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bigpic.png		bigpic.png
run_evaluation.py		run_evaluation.py
run_inference.py		run_inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎙️ VoxSafeBench

⚙️ Environment Setup

📥 Dataset download

🤖 Models download

Qwen3-Omni

Qwen3-Omni-thinking

Mimo-audio/Mimo-audio-thinking

Kimi-audio

Gemini-3 & GPT-4o-audio

🚀 Model Inference (Unified Runner)

Basic Usage

Advanced Options

Output Location

📈 Final Evaluation

Prerequisite

Basic Usage

Advanced Options

Output Location

⚠️ Note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎙️ VoxSafeBench

⚙️ Environment Setup

📥 Dataset download

🤖 Models download

Qwen3-Omni

Qwen3-Omni-thinking

Mimo-audio/Mimo-audio-thinking

Kimi-audio

Gemini-3 & GPT-4o-audio

🚀 Model Inference (Unified Runner)

Basic Usage

Advanced Options

Output Location

📈 Final Evaluation

Prerequisite

Basic Usage

Advanced Options

Output Location

⚠️ Note

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages