Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions issue-2-ai-skill-evaluator/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
__pycache__/
*.py[cod]
.ipynb_checkpoints/
.DS_Store
.env
.venv/
venv/
artifacts/
outputs/
*.log
44 changes: 44 additions & 0 deletions issue-2-ai-skill-evaluator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# AI Model for Evaluating 21st Century Skills (Issue #2)

## Project Overview
This module develops a cost-efficient, fine-tuned open-source Vision Language Model (VLM)
to evaluate student-submitted artifacts (drawings, written responses) against rubric-based
frameworks measuring 21st-century skills: creativity, critical thinking, problem-solving, and agency.

**Target cost:** < ₹0.10 per evaluation
**Replaces:** Gemini-based evaluation pipeline
**Approach:** Supervised fine-tuning of open-source VLMs (LLaMA-based) using PyTorch

## Project Structure
```text
data/ -> Rubric schemas and labeled artifact datasets
notebooks/ -> EDA, benchmarking, and training experiments
src/ -> Core source code (data utils, evaluator, fine-tuning pipeline)
```

## Getting Started
```bash
pip install -r requirements.txt
```

## Rubric Framework
Skills assessed:
- **Creativity** - originality, expression, divergent thinking
- **Critical Thinking** - analysis, evaluation, logical reasoning
- **Problem Solving** - approach, method, outcome quality
- **Agency** - self-direction, initiative, reflection

## Model Candidates Under Evaluation
| Model | Params | Multimodal | License |
|-------|--------|-----------|---------|
| LLaVA-1.5 | 7B/13B | ✅ | Apache 2.0 |
| InternVL2 | 2B/8B | ✅ | MIT |
| Qwen2-VL | 2B/7B | ✅ | Apache 2.0 |
| PaliGemma | 3B | ✅ | Gemma License |

## Milestones
- [ ] Dataset preparation and schema design
- [ ] Model benchmarking (zero-shot performance)
- [ ] Fine-tuning pipeline setup
- [ ] Cost-efficiency analysis
- [ ] Benchmark against Gemini and human evaluators
20 changes: 20 additions & 0 deletions issue-2-ai-skill-evaluator/data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Data Directory

This directory stores rubric definitions and labeled datasets used for benchmarking and
fine-tuning rubric-based artifact evaluators.

## Expected Inputs
- `*.json` rubric files defining the scoring schema for a skill
- Artifact files such as `.png`, `.jpg`, `.jpeg`, `.txt`, or `.md`
- Label tables in `.csv` format linking artifact identifiers to rubric scores

## Suggested Rubric Schema
Each rubric JSON file should include:
- `rubric_version`
- `skill`
- `description`
- `levels` with `score`, `label`, and `descriptor`
- `artifact_types`
- `evaluation_modalities`

See [sample_rubric.json](/C:/Users/asus/OneDrive/Desktop/C4GT_2026/issue-2-ai-skill-evaluator/data/sample_rubric.json) for a starter example.
29 changes: 29 additions & 0 deletions issue-2-ai-skill-evaluator/data/sample_rubric.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
{
"rubric_version": "1.0",
"skill": "creativity",
"description": "Assesses originality and creative expression in student artifacts",
"levels": [
{
"score": 1,
"label": "Beginning",
"descriptor": "Work shows minimal originality; heavily relies on given prompts or examples"
},
{
"score": 2,
"label": "Developing",
"descriptor": "Shows some original elements but largely conventional in approach"
},
{
"score": 3,
"label": "Proficient",
"descriptor": "Demonstrates clear original thinking with creative connections"
},
{
"score": 4,
"label": "Exemplary",
"descriptor": "Highly original work showing inventive, divergent thinking throughout"
}
],
"artifact_types": ["drawing", "written_response", "prototype"],
"evaluation_modalities": ["vision", "text", "multimodal"]
}
169 changes: 169 additions & 0 deletions issue-2-ai-skill-evaluator/notebooks/01_model_benchmarking.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Model Benchmarking for Skill Evaluation\n",
"\n",
"This notebook is a starter for benchmarking multimodal models on rubric-based artifact evaluation.\n",
"For this project, benchmarking means comparing candidate VLMs on:\n",
"- zero-shot rubric alignment,\n",
"- inference latency,\n",
"- estimated cost per evaluation,\n",
"- practicality for the INR 0.10 target.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"Pick a model checkpoint that fits local hardware. The example below shows a zero-shot path for LLaVA-style checkpoints using `transformers`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"import time\n",
"\n",
"import pandas as pd\n",
"from PIL import Image\n",
"\n",
"# Uncomment when running with a supported checkpoint locally.\n",
"# import torch\n",
"# from transformers import AutoProcessor, LlavaForConditionalGeneration\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load a Test Artifact\n",
"\n",
"Point this to a sample student artifact image before running model inference."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"artifact_path = Path(\"../data/sample_artifact.png\")\n",
"\n",
"if artifact_path.exists():\n",
" test_image = Image.open(artifact_path).convert(\"RGB\")\n",
" test_image\n",
"else:\n",
" print(f\"Add a sample image at {artifact_path} to run inference.\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Zero-Shot Inference Template\n",
"\n",
"Use this block to test a candidate model such as LLaVA-1.5 or InternVL2 on a rubric prompt."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"MODEL_NAME = \"llava-hf/llava-1.5-7b-hf\"\n",
"PROMPT = \"\"\"Evaluate this student artifact for creativity using a 1-4 rubric.\\n\"\n",
"PROMPT += \"Return a score, confidence from 0 to 1, and a short justification.\"\"\"\n",
"\n",
"# Example template for local benchmarking.\n",
"# processor = AutoProcessor.from_pretrained(MODEL_NAME)\n",
"# model = LlavaForConditionalGeneration.from_pretrained(\n",
"# MODEL_NAME,\n",
"# torch_dtype=torch.float16,\n",
"# low_cpu_mem_usage=True,\n",
"# )\n",
"#\n",
"# inputs = processor(text=PROMPT, images=test_image, return_tensors=\"pt\")\n",
"# start = time.perf_counter()\n",
"# output = model.generate(**inputs, max_new_tokens=128)\n",
"# inference_seconds = time.perf_counter() - start\n",
"# decoded = processor.batch_decode(output, skip_special_tokens=True)[0]\n",
"# print(decoded)\n",
"# print(f\"Inference time: {inference_seconds:.2f}s\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Benchmark Comparison Table\n",
"\n",
"Fill in observed latency and token estimates as you benchmark each model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"benchmark_df = pd.DataFrame(\n",
" [\n",
" {\"model\": \"LLaVA-1.5 7B\", \"params_b\": 7, \"inference_time_s\": None, \"avg_tokens_per_eval\": 350, \"cost_per_1m_tokens_usd\": 0.20},\n",
" {\"model\": \"InternVL2 2B\", \"params_b\": 2, \"inference_time_s\": None, \"avg_tokens_per_eval\": 320, \"cost_per_1m_tokens_usd\": 0.12},\n",
" {\"model\": \"Qwen2-VL 2B\", \"params_b\": 2, \"inference_time_s\": None, \"avg_tokens_per_eval\": 340, \"cost_per_1m_tokens_usd\": 0.15},\n",
" ]\n",
")\n",
"benchmark_df\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cost Calculation\n",
"\n",
"Estimate per-evaluation cost using token pricing assumptions and a USD-to-INR exchange rate."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"exchange_rate = 83.0 # INR per USD\n",
"\n",
"benchmark_df[\"estimated_cost_inr\"] = (\n",
" benchmark_df[\"cost_per_1m_tokens_usd\"]\n",
" * benchmark_df[\"avg_tokens_per_eval\"]\n",
" / 1_000_000\n",
" * exchange_rate\n",
")\n",
"\n",
"benchmark_df[[\"model\", \"estimated_cost_inr\"]]\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
13 changes: 13 additions & 0 deletions issue-2-ai-skill-evaluator/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
torch>=2.0.0
transformers>=4.40.0
pillow>=10.0.0
pandas>=2.0.0
numpy>=1.24.0
jupyter>=1.0.0
matplotlib>=3.7.0
seaborn>=0.12.0
tqdm>=4.65.0
datasets>=2.14.0
accelerate>=0.24.0
peft>=0.6.0
bitsandbytes>=0.41.0
1 change: 1 addition & 0 deletions issue-2-ai-skill-evaluator/src/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Utilities for the AI skill evaluator project."""
Loading