Skip to content

AmphionTeam/VoxSafeBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸŽ™οΈ VoxSafeBench

Not Just What Is Said, but Who, How, and Where

Demo Page arXiv Paper Dataset

Benchmark Safety Benchmark Fairness Benchmark Privacy


VoxSafeBench is a comprehensive benchmark designed to assess the social alignment of Speech Language Models (SLMs) built around three core pillars: Safety, Fairness, and Privacy. VoxSafeBench adopts a unique Two-Tier design: Tier 1 evaluates content-centric risks with matched text and audio inputs, while Tier 2 evaluates audio-conditioned risks in which the transcript is benign but the correct response depends on who is speaking, how they speak, or where they speak.

VoxSafeBench Benchmark Overview


πŸ“‘ Table of Contents

βš™οΈ Environment Setup

First, clone the repository and navigate into the directory:

git clone https://github.com/AmphionTeam/VoxSafeBench.git
cd VoxSafeBench

Evaluating different models requires configuring their respective environments under each model's code repository.

For closed-source models (e.g., gemini_3_flash, gemini_3_pro and gpt_4o_audio) and the overall evaluation scripts (run_inference.py, run_evaluation.py), you need to install the following dependencies:

pip install openai google-genai python-dotenv tqdm

For open-source models, please refer to the specific setup instructions in their respective official repositories:


πŸ“₯ Dataset download

export HF_ENDPOINT=https://hf-mirror.com

HF_HUB_DOWNLOAD_TIMEOUT=240 huggingface-cli download --repo-type dataset --resume-download YuxiangW/VoxSafeBench --local-dir ./datasets --max-workers 32

If the error is caused by network issues, you can try a few more times.

πŸ€– Models download

Qwen3-Omni

export HF_ENDPOINT=https://hf-mirror.com

huggingface-cli download --resume-download Qwen/Qwen3-Omni-30B-A3B-Instruct --local-dir ./model_warehouse/Qwen3_omni

Qwen3-Omni-thinking

export HF_ENDPOINT=https://hf-mirror.com

huggingface-cli download --resume-download Qwen/Qwen3-Omni-30B-A3B-Thinking --local-dir ./model_warehouse/Qwen3_omni_thinking

Mimo-audio/Mimo-audio-thinking

export HF_ENDPOINT=https://hf-mirror.com

huggingface-cli download --resume-download XiaomiMiMo/MiMo-Audio-7B-Instruct --local-dir ./model_warehouse/Mimo_audio/MiMo-Audio-7B-Instruct

huggingface-cli download --resume-download XiaomiMiMo/MiMo-Audio-Tokenizer --local-dir ./model_warehouse/Mimo_audio/MiMo-Audio-Tokenizer

Kimi-audio

export HF_ENDPOINT=https://hf-mirror.com

huggingface-cli download --resume-download moonshotai/Kimi-Audio-7B-Instruct --local-dir ./model_warehouse/Kimi_audio

⚠️ Please replace model_warehouse/Kimi_audio/modeling_moonshot_kimia.py with utils/modeling_moonshot_kimia.py.
⚠️ If you don't make this replacement, Kimi-audio cannot perform pure text input.

Gemini-3 & GPT-4o-audio

Please put your gemini_key and Openai_key in the .env file.


πŸš€ Model Inference (Unified Runner)

We provide a unified inference script run_inference.py to generate responses from any model across various tasks seamlessly.

Basic Usage

You can use the unified runner by specifying the model and the task:

# Run a specific task for a model
python run_inference.py --model Qwen3_omni --task Safety-tier2/Emotion

# Run all tasks for a model
python run_inference.py --model Qwen3_omni --all

Available Models:

  • Qwen3_omni
  • Qwen3_omni_thinking
  • Mimo_audio
  • Mimo_audio_thinking
  • Kimi_audio
  • gemini_3_flash
  • gemini_3_pro
  • gpt_4o_audio

(You can see all available models dynamically via python run_inference.py --help)

Available Tasks:

Fairness Tasks
  • Fairness-tier1/test
  • Fairness-tier2/Bias_analysis
  • Fairness-tier2/test
Privacy Tasks
  • Privacy-tier1/Hard_privacy
  • Privacy-tier1/Soft_privacy
  • Privacy-tier2/Audio_conditioned_privacy
  • Privacy-tier2/Interactional_privacy
Safety Tasks (Tier 1)
  • Safety-tier1/Agentic_Action_Risks
  • Safety-tier1/Multiturn_jailbreak
  • Safety-tier1/No_jailbreak
  • Safety-tier1/Singleturn_jailbreak
Safety Tasks (Tier 2)
  • Safety-tier2/Child_presence
  • Safety-tier2/Child_voice
  • Safety-tier2/Emotion
  • Safety-tier2/Impaired_capacity
  • Safety-tier2/Overlap_instruction_injection
  • Safety-tier2/Symbolic_background
  • Safety-tier2/Unsafe_ambient

Advanced Options

You can override default settings per run:

python run_inference.py --model gpt_4o_audio --all --max-workers 8 --save-interval 20
  • --max-workers: Set the number of concurrent workers (useful for API models like GPT-4o or Gemini).
  • --save-interval: How often to save results to disk.
  • --model-name: Override the internal model name string passed to the API or model loader.

Output Location

Results are written automatically to:

results/<model_name>/<task_name>/results.jsonl

The runner automatically:

  • Reads input data from the appropriate datasets/**/metadata.jsonl.
  • Resumes from existing results.jsonl (skips completed items, retries items that failed with ERROR:).
  • Merges the model's outputs into the existing dataset fields.

πŸ“ˆ Final Evaluation

After running model inference, you can evaluate the generated responses and compute metrics using the run_evaluation.py script. The evaluation uses automated judges (e.g., DeepSeek, OpenAI Moderation) and rule-based evaluators to score safety, privacy, and fairness.

Prerequisite

Ensure you have the required API keys configured in your .env file, as the script uses these external services for evaluation.

Basic Usage

You can run the evaluation script with different levels of granularity:

# Evaluate ALL models and ALL tasks (found in the `results/` directory)
python run_evaluation.py

# Evaluate all tasks for a specific model
python run_evaluation.py --model Qwen3_omni

# Evaluate a specific task for a specific model
python run_evaluation.py --model Qwen3_omni --task Safety-tier1/No_jailbreak

Advanced Options

You can specify the number of concurrent threads used during evaluation:

python run_evaluation.py --model gpt_4o_audio --threads 16
  • --threads: Set the maximum number of worker threads (default is 8) to speed up API-based evaluations.

Output Location

The evaluation script reads from the results/ directory and writes the final evaluated data and metrics to the final_eval_results/ directory:

final_eval_results/<model_name>/<task_name>/
β”œβ”€β”€ results.jsonl   # Data with evaluation scores and judgments appended
└── log.txt         # Computed metrics and summary statistics

⚠️ Note

  • For the Privacy-tier2/Inferential_privacy task, please use the HearSay Benchmark.
  • All experiments were conducted on Nvidia A800 GPUs.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages