Nanda Family: Open-Weights Generative Large Language Models for Hindi

The Nanda Family

Environment Setup

Create a conda virtual environment with python=3.11

conda create -n nanda-family-env python=3.11 -c anaconda

Activate the conda virtual environment

conda activate nanda-family-env

Install Packages

pip install -r requirement.txt

Create a .env file and initialize the following environment variables:

OPENAI_API_KEY=[YOUR_OPENAI_API_KEY]
HF_HOME=[YOUR_HF_HOME_DIRECTORY_PATH]
HF_HUB_CACHE=$HF_HOME/hub
HF_TOKEN=[YOUR_HF_ACCESS_TOKEN]

Getting Started with Summarization, Translation, and Transliteration (STT) Evaluation

Change the current directory to stt

cd stt

Model Response Generation
- For generating responses of all the Eval Models, execute the following command:
```
python run_models.py
```
- For generating responses of a specific Eval Model (e.g. MBZUAI-IFM/Llama-3.1-Nanda-87B-Chat), execute the following command:
```
python run_models.py --model-path MBZUAI-IFM/Llama-3.1-Nanda-87B-Chat
```
Note: The default System Prompt is set as nanda-basic under stt/stt_config.yaml
- Ablations: For generating responses of MBZUAI-IFM/Llama-3.1-Nanda-87B-Chat using System Prompts other than the default nanda-basic (i.e., empty, nanda_full, and nanda-simplified), execute the following command:
```
python run_models.py --model-path MBZUAI-IFM/Llama-3.1-Nanda-87B-Chat --ablations
```
Evaluation (BLEU/ROUGE via Hugging Face evaluate + whitespace tokenizer; CER for transliteration)
- This script evaluates:
  - translation: BLEU
  - summarization: ROUGE-1/2/L/Lsum
  - transliteration: CER (Character Error Rate, Levenshtein distance / reference length)
- If a *_responses.jsonl is empty (e.g., interrupted run), it is skipped by default and listed in the output JSON under "skipped".
- Install dependency (if needed):
```
pip install evaluate
```
- Evaluate all generated STT response files and write a JSON report:
```
python stt_evaluation.py \
  --responses_dir output/model_responses \
  --data_dir data/test \
  --output_file output/stt_eval_results.json
```

Getting Started with Safety Evaluation

Change the current directory to safety

cd safety

Model Response Generation
- For generating responses of all the Eval Models, execute the following command:
```
python run_generate_model_responses.py
```
- For generating responses of a specific Eval Model (e.g. MBZUAI-IFM/Llama-3.1-Nanda-87B-Chat), execute the following command:
```
python run_generate_model_responses.py --model-path MBZUAI-IFM/Llama-3.1-Nanda-87B-Chat
```
Note: The default System Prompt is set as nanda-basic under safety/safety_config.yaml
- Ablations: For generating responses of MBZUAI-IFM/Llama-3.1-Nanda-87B-Chat using System Prompts other than the default nanda-basic (i.e., empty, nanda_full, and nanda-simplified), execute the following command:
```
python run_generate_model_responses.py --model-path MBZUAI-IFM/Llama-3.1-Nanda-87B-Chat --ablations
```
Evaluation
- Prepare batch data for Safety Evaluation using gpt-4o
```
python prepare_batch_data.py
```
- Generate Safety Evaluation Responses using gpt-4o
```
python generate_safety_eval_responses.py
```
- Generate Summary of Safety Evaluation
```
python safety_evaluation.py
```
- Find generated summary under safety/output/results

MCQ-based Evaluations

Generic MCQ-Benchmarks:
- We used version 0.4.5 of LM-Evaluation-Harness for the Generic MCQ-Benchmark (MMLU, HellaSwag, ARC, TruthfulQA-MC1,MC2) evaluation across Hindi and English
BhashaBench-v1:
- We used the scripts available at the BhashaBench repository for BhashaBench-v1 evaluation across Hindi and English

Note: We did not use apply_chat_template for MCQ-based evaluations, because doing so degraded the scores for all the models.

License

We distribute the different evaluation datasets under different licenses, based on the license of the corresponding source dataset.

ESaral: This dataset is a derivative work based on information obtained from ESaral Hindi Vakya Kosh website. At the time of collection, no explicit license or terms of use were provided on the original website(s). Accordingly, this dataset is shared under the CC BY-SA 4.0 license.

Note: If you are the owner of any of the original data or believe that your rights may be affected, please contact us at monojit.choudhury@mbzuai.ac.ae, and we will review and, if necessary, modify or remove the relevant content.

ILCI: Under CC0 (based on IndicTrans2)

MASSIVE: Under Apache 2.0 (based on MASSIVE)

Aksharantar: Under CC0 (based on IndicTrans2)

Bhasha-Abhijnaanam: CC0 (based on IndicTrans2)

PHINC: Under CC BY 4.0 (based on PHINC)

News: Under MIT (based on Someman/hindi-summarization)

CrossSum-Hi-En: Under CC BY-NC-SA 4.0 license (based on CrossSum)

Flores-Hi-En: Under CC BY-SA 4.0 license (based on Flores)

Do-Not-Answer-Hi-En: Under Apache 2.0 (based on Do-Not-Answer)

Acknowledgment

We extend our sincere gratitude to LibrAI for their invaluable support in the creation and refinement of the Do-Not-Answer-Hi dataset and their significant role in conducting the safety evaluation for this project, including formulation of the safety evaluation protocol.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
safety		safety
stt		stt
.gitignore		.gitignore
LICENSE		LICENSE
MBZUAI-IFM_Nanda-Family-Technical-Report.pdf		MBZUAI-IFM_Nanda-Family-Technical-Report.pdf
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nanda Family: Open-Weights Generative Large Language Models for Hindi

The Nanda Family

Environment Setup

Getting Started with Summarization, Translation, and Transliteration (STT) Evaluation

Getting Started with Safety Evaluation

MCQ-based Evaluations

License

Acknowledgment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Nanda Family: Open-Weights Generative Large Language Models for Hindi

The Nanda Family

Environment Setup

Getting Started with Summarization, Translation, and Transliteration (STT) Evaluation

Getting Started with Safety Evaluation

MCQ-based Evaluations

License

Acknowledgment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages