- Create a conda virtual environment with
python=3.11
conda create -n nanda-family-env python=3.11 -c anaconda
- Activate the conda virtual environment
conda activate nanda-family-env
- Install Packages
pip install -r requirement.txt
- Create a .env file and initialize the following environment variables:
OPENAI_API_KEY=[YOUR_OPENAI_API_KEY]
HF_HOME=[YOUR_HF_HOME_DIRECTORY_PATH]
HF_HUB_CACHE=$HF_HOME/hub
HF_TOKEN=[YOUR_HF_ACCESS_TOKEN]
- Change the current directory to
stt
cd stt
-
Model Response Generation
- For generating responses of all the Eval Models, execute the following command:
python run_models.py- For generating responses of a specific Eval Model (e.g.
MBZUAI-IFM/Llama-3.1-Nanda-87B-Chat), execute the following command:
python run_models.py --model-path MBZUAI-IFM/Llama-3.1-Nanda-87B-ChatNote: The default System Prompt is set as
nanda-basicunderstt/stt_config.yaml- Ablations: For generating responses of
MBZUAI-IFM/Llama-3.1-Nanda-87B-Chatusing System Prompts other than the defaultnanda-basic(i.e.,empty,nanda_full, andnanda-simplified), execute the following command:
python run_models.py --model-path MBZUAI-IFM/Llama-3.1-Nanda-87B-Chat --ablations -
Evaluation (BLEU/ROUGE via Hugging Face
evaluate+ whitespace tokenizer; CER for transliteration)-
This script evaluates:
- translation: BLEU
- summarization: ROUGE-1/2/L/Lsum
- transliteration: CER (Character Error Rate, Levenshtein distance / reference length)
-
If a
*_responses.jsonlis empty (e.g., interrupted run), it is skipped by default and listed in the output JSON under"skipped". -
Install dependency (if needed):
pip install evaluate- Evaluate all generated STT response files and write a JSON report:
python stt_evaluation.py \ --responses_dir output/model_responses \ --data_dir data/test \ --output_file output/stt_eval_results.json -
- Change the current directory to
safety
cd safety
-
Model Response Generation
- For generating responses of all the Eval Models, execute the following command:
python run_generate_model_responses.py- For generating responses of a specific Eval Model (e.g.
MBZUAI-IFM/Llama-3.1-Nanda-87B-Chat), execute the following command:
python run_generate_model_responses.py --model-path MBZUAI-IFM/Llama-3.1-Nanda-87B-ChatNote: The default System Prompt is set as
nanda-basicundersafety/safety_config.yaml- Ablations: For generating responses of
MBZUAI-IFM/Llama-3.1-Nanda-87B-Chatusing System Prompts other than the defaultnanda-basic(i.e.,empty,nanda_full, andnanda-simplified), execute the following command:
python run_generate_model_responses.py --model-path MBZUAI-IFM/Llama-3.1-Nanda-87B-Chat --ablations -
Evaluation
- Prepare batch data for Safety Evaluation using
gpt-4o
python prepare_batch_data.py- Generate Safety Evaluation Responses using
gpt-4o
python generate_safety_eval_responses.py- Generate Summary of Safety Evaluation
python safety_evaluation.py- Find generated summary under
safety/output/results
- Prepare batch data for Safety Evaluation using
-
Generic MCQ-Benchmarks:
- We used version 0.4.5 of LM-Evaluation-Harness for the Generic MCQ-Benchmark (MMLU, HellaSwag, ARC, TruthfulQA-MC1,MC2) evaluation across Hindi and English
-
BhashaBench-v1:
- We used the scripts available at the BhashaBench repository for BhashaBench-v1 evaluation across Hindi and English
Note: We did not use apply_chat_template for MCQ-based evaluations, because doing so degraded the scores for all the models.
We distribute the different evaluation datasets under different licenses, based on the license of the corresponding source dataset.
ESaral: This dataset is a derivative work based on information obtained from ESaral Hindi Vakya Kosh website.
At the time of collection, no explicit license or terms of use were provided on the original website(s).
Accordingly, this dataset is shared under the CC BY-SA 4.0 license.
Note: If you are the owner of any of the original data or believe that your rights may be affected, please contact us at monojit.choudhury@mbzuai.ac.ae, and we will review and, if necessary, modify or remove the relevant content.
ILCI: Under CC0 (based on IndicTrans2)
MASSIVE: Under Apache 2.0 (based on MASSIVE)
Aksharantar: Under CC0 (based on IndicTrans2)
Bhasha-Abhijnaanam: CC0 (based on IndicTrans2)
PHINC: Under CC BY 4.0 (based on PHINC)
News: Under MIT (based on Someman/hindi-summarization)
CrossSum-Hi-En: Under CC BY-NC-SA 4.0 license (based on CrossSum)
Flores-Hi-En: Under CC BY-SA 4.0 license (based on Flores)
Do-Not-Answer-Hi-En: Under Apache 2.0 (based on Do-Not-Answer)
We extend our sincere gratitude to LibrAI for their invaluable support in the creation and refinement of the Do-Not-Answer-Hi dataset and their significant role in conducting the safety evaluation for this project, including formulation of the safety evaluation protocol.