This project focuses on fine-tuning small open-source language models (e.g., Gemma3:4B) to act as ReAct agents for domain-specific tasks—specifically, interacting with UAE labor laws and employment regulations. The goal is to reduce production costs by avoiding reliance on large-scale or paid models like ChatGPT or Gemini, while still delivering robust tool-augmented reasoning.
Deploying large proprietary models in production can be prohibitively expensive and often unnecessary for specialized applications. This project explores whether a small, instruction-tuned model (4B or smaller) can be trained to:
- Reliably select tools in a ReAct agent setting
- Generate accurate, grounded answers
- Mimic the performance of a much larger teacher model (Mistral 7B) through knowledge distillation
Using LlamaIndex:
- Initial parsing with
MarkdownNodeParser - If a chunk exceeds 512 tokens, further split using
SentenceSplitterwith overlap
- Embed all nodes and perform K-Means clustering
- For each cluster, apply sliding window technique to generate questions
- Inject persona-based variations (HR manager, Employer, Employee, Domestic Worker)
- Use
Mistral-7B-AWQviavLLMto generate QA pairs
-
Use LlamaIndex's
WorkflowAPI to generate ReAct-style agent traces -
Each trace contains:
- User query
- Model thought
- Tool selection
- Observation
- Final answer
-
Hosted Mistral 7B via
vLLMfor high-throughput generation
-
Fine-tuned with Unsloth +
PEFT -
Chat template modified to support ReAct agent flow:
user → assistant → observation → assistant → ... -
Enabled
train_on_responses_onlyto fine-tune only theassistantresponse -
Untuned
Gemma3:4b-itstruggled with:- Hallucinations
- Infinite reasoning loops
- Tool selection failures
After fine-tuning:
- Model understands when to stop
- Selects correct tool
- Avoids infinite loops and hallucinations
To explore model compression, knowledge distillation was attempted from 4B to 1B model.
- KL Divergence-based loss in HuggingFace’s
GKDTrainerrequires identical tokenizer vocab Gemma3:4b-itandGemma3:1b-ithave different tokenizers
Implemented based on the paper:
“Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs”
- Extended
SFTTrainerinstead ofGKDTrainer - Performed loss computation per batch (to reduce CUDA memory)
- Used Top-K token filtering to reduce unnecessary computation and accelerate training
| Model | Role | Behavior |
|---|---|---|
Gemma3:4b-it |
Untuned | Hallucinates, fails tool usage |
work-right-uae |
Fine-tuned (ReAct) | Predicts tool + halts correctly |
Gemma3:1b-it |
Distilled via ULD | Still under testing |
- Merge summarizer and ReAct agent into one compact model
- Enhance data quality and complexity to enable the model to handle more nuanced queries, multi-turn conversations, and complex reasoning across chat history
- Deploy with
llama.cppfor CPU-only environments (reduce production cost) - Further evaluate distilled
1Bmodel performance
- Models:
Gemma3:4b-it,Mistral-7B-AWQ,Gemma3:1b-it - Libraries:
LlamaIndex,Unsloth,vLLM,PEFT,Transformers,Ollama,ChromaDB - Distillation: Custom implementation of Universal Logit Distillation
