theapprenticeproject · Nidhi18-git · Apr 28, 2026
diff --git a/issue-2-ai-skill-evaluator/.gitignore b/issue-2-ai-skill-evaluator/.gitignore
@@ -0,0 +1,10 @@
+__pycache__/
+*.py[cod]
+.ipynb_checkpoints/
+.DS_Store
+.env
+.venv/
+venv/
+artifacts/
+outputs/
+*.log
diff --git a/issue-2-ai-skill-evaluator/README.md b/issue-2-ai-skill-evaluator/README.md
@@ -0,0 +1,44 @@
+# AI Model for Evaluating 21st Century Skills (Issue #2)
+
+## Project Overview
+This module develops a cost-efficient, fine-tuned open-source Vision Language Model (VLM)
+to evaluate student-submitted artifacts (drawings, written responses) against rubric-based
+frameworks measuring 21st-century skills: creativity, critical thinking, problem-solving, and agency.
+
+**Target cost:** < ₹0.10 per evaluation  
+**Replaces:** Gemini-based evaluation pipeline  
+**Approach:** Supervised fine-tuning of open-source VLMs (LLaMA-based) using PyTorch
+
+## Project Structure
+```text
+data/          -> Rubric schemas and labeled artifact datasets
+notebooks/     -> EDA, benchmarking, and training experiments
+src/           -> Core source code (data utils, evaluator, fine-tuning pipeline)
+```
+
+## Getting Started
+```bash
+pip install -r requirements.txt
+```
+
+## Rubric Framework
+Skills assessed:
+- **Creativity** - originality, expression, divergent thinking
+- **Critical Thinking** - analysis, evaluation, logical reasoning
+- **Problem Solving** - approach, method, outcome quality
+- **Agency** - self-direction, initiative, reflection
+
+## Model Candidates Under Evaluation
+| Model | Params | Multimodal | License |
+|-------|--------|-----------|---------|
+| LLaVA-1.5 | 7B/13B | ✅ | Apache 2.0 |
+| InternVL2 | 2B/8B | ✅ | MIT |
+| Qwen2-VL | 2B/7B | ✅ | Apache 2.0 |
+| PaliGemma | 3B | ✅ | Gemma License |
+
+## Milestones
+- [ ] Dataset preparation and schema design
+- [ ] Model benchmarking (zero-shot performance)
+- [ ] Fine-tuning pipeline setup
+- [ ] Cost-efficiency analysis
+- [ ] Benchmark against Gemini and human evaluators
diff --git a/issue-2-ai-skill-evaluator/data/README.md b/issue-2-ai-skill-evaluator/data/README.md
@@ -0,0 +1,20 @@
+# Data Directory
+
+This directory stores rubric definitions and labeled datasets used for benchmarking and
+fine-tuning rubric-based artifact evaluators.
+
+## Expected Inputs
+- `*.json` rubric files defining the scoring schema for a skill
+- Artifact files such as `.png`, `.jpg`, `.jpeg`, `.txt`, or `.md`
+- Label tables in `.csv` format linking artifact identifiers to rubric scores
+
+## Suggested Rubric Schema
+Each rubric JSON file should include:
+- `rubric_version`
+- `skill`
+- `description`
+- `levels` with `score`, `label`, and `descriptor`
+- `artifact_types`
+- `evaluation_modalities`
+
+See [sample_rubric.json](/C:/Users/asus/OneDrive/Desktop/C4GT_2026/issue-2-ai-skill-evaluator/data/sample_rubric.json) for a starter example.
diff --git a/issue-2-ai-skill-evaluator/data/sample_rubric.json b/issue-2-ai-skill-evaluator/data/sample_rubric.json
@@ -0,0 +1,29 @@
+{
+  "rubric_version": "1.0",
+  "skill": "creativity",
+  "description": "Assesses originality and creative expression in student artifacts",
+  "levels": [
+    {
+      "score": 1,
+      "label": "Beginning",
+      "descriptor": "Work shows minimal originality; heavily relies on given prompts or examples"
+    },
+    {
+      "score": 2,
+      "label": "Developing",
+      "descriptor": "Shows some original elements but largely conventional in approach"
+    },
+    {
+      "score": 3,
+      "label": "Proficient",
+      "descriptor": "Demonstrates clear original thinking with creative connections"
+    },
+    {
+      "score": 4,
+      "label": "Exemplary",
+      "descriptor": "Highly original work showing inventive, divergent thinking throughout"
+    }
+  ],
+  "artifact_types": ["drawing", "written_response", "prototype"],
+  "evaluation_modalities": ["vision", "text", "multimodal"]
+}
diff --git a/issue-2-ai-skill-evaluator/notebooks/01_model_benchmarking.ipynb b/issue-2-ai-skill-evaluator/notebooks/01_model_benchmarking.ipynb
@@ -0,0 +1,169 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# Model Benchmarking for Skill Evaluation\n",
+        "\n",
+        "This notebook is a starter for benchmarking multimodal models on rubric-based artifact evaluation.\n",
+        "For this project, benchmarking means comparing candidate VLMs on:\n",
+        "- zero-shot rubric alignment,\n",
+        "- inference latency,\n",
+        "- estimated cost per evaluation,\n",
+        "- practicality for the INR 0.10 target.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Setup\n",
+        "\n",
+        "Pick a model checkpoint that fits local hardware. The example below shows a zero-shot path for LLaVA-style checkpoints using `transformers`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from pathlib import Path\n",
+        "import time\n",
+        "\n",
+        "import pandas as pd\n",
+        "from PIL import Image\n",
+        "\n",
+        "# Uncomment when running with a supported checkpoint locally.\n",
+        "# import torch\n",
+        "# from transformers import AutoProcessor, LlavaForConditionalGeneration\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Load a Test Artifact\n",
+        "\n",
+        "Point this to a sample student artifact image before running model inference."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "artifact_path = Path(\"../data/sample_artifact.png\")\n",
+        "\n",
+        "if artifact_path.exists():\n",
+        "    test_image = Image.open(artifact_path).convert(\"RGB\")\n",
+        "    test_image\n",
+        "else:\n",
+        "    print(f\"Add a sample image at {artifact_path} to run inference.\")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Zero-Shot Inference Template\n",
+        "\n",
+        "Use this block to test a candidate model such as LLaVA-1.5 or InternVL2 on a rubric prompt."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "MODEL_NAME = \"llava-hf/llava-1.5-7b-hf\"\n",
+        "PROMPT = \"\"\"Evaluate this student artifact for creativity using a 1-4 rubric.\\n\"\n",
+        "PROMPT += \"Return a score, confidence from 0 to 1, and a short justification.\"\"\"\n",
+        "\n",
+        "# Example template for local benchmarking.\n",
+        "# processor = AutoProcessor.from_pretrained(MODEL_NAME)\n",
+        "# model = LlavaForConditionalGeneration.from_pretrained(\n",
+        "#     MODEL_NAME,\n",
+        "#     torch_dtype=torch.float16,\n",
+        "#     low_cpu_mem_usage=True,\n",
+        "# )\n",
+        "#\n",
+        "# inputs = processor(text=PROMPT, images=test_image, return_tensors=\"pt\")\n",
+        "# start = time.perf_counter()\n",
+        "# output = model.generate(**inputs, max_new_tokens=128)\n",
+        "# inference_seconds = time.perf_counter() - start\n",
+        "# decoded = processor.batch_decode(output, skip_special_tokens=True)[0]\n",
+        "# print(decoded)\n",
+        "# print(f\"Inference time: {inference_seconds:.2f}s\")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Benchmark Comparison Table\n",
+        "\n",
+        "Fill in observed latency and token estimates as you benchmark each model."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "benchmark_df = pd.DataFrame(\n",
+        "    [\n",
+        "        {\"model\": \"LLaVA-1.5 7B\", \"params_b\": 7, \"inference_time_s\": None, \"avg_tokens_per_eval\": 350, \"cost_per_1m_tokens_usd\": 0.20},\n",
+        "        {\"model\": \"InternVL2 2B\", \"params_b\": 2, \"inference_time_s\": None, \"avg_tokens_per_eval\": 320, \"cost_per_1m_tokens_usd\": 0.12},\n",
+        "        {\"model\": \"Qwen2-VL 2B\", \"params_b\": 2, \"inference_time_s\": None, \"avg_tokens_per_eval\": 340, \"cost_per_1m_tokens_usd\": 0.15},\n",
+        "    ]\n",
+        ")\n",
+        "benchmark_df\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Cost Calculation\n",
+        "\n",
+        "Estimate per-evaluation cost using token pricing assumptions and a USD-to-INR exchange rate."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "exchange_rate = 83.0  # INR per USD\n",
+        "\n",
+        "benchmark_df[\"estimated_cost_inr\"] = (\n",
+        "    benchmark_df[\"cost_per_1m_tokens_usd\"]\n",
+        "    * benchmark_df[\"avg_tokens_per_eval\"]\n",
+        "    / 1_000_000\n",
+        "    * exchange_rate\n",
+        ")\n",
+        "\n",
+        "benchmark_df[[\"model\", \"estimated_cost_inr\"]]\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python",
+      "version": "3.11"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}
diff --git a/issue-2-ai-skill-evaluator/requirements.txt b/issue-2-ai-skill-evaluator/requirements.txt
@@ -0,0 +1,13 @@
+torch>=2.0.0
+transformers>=4.40.0
+pillow>=10.0.0
+pandas>=2.0.0
+numpy>=1.24.0
+jupyter>=1.0.0
+matplotlib>=3.7.0
+seaborn>=0.12.0
+tqdm>=4.65.0
+datasets>=2.14.0
+accelerate>=0.24.0
+peft>=0.6.0
+bitsandbytes>=0.41.0
diff --git a/issue-2-ai-skill-evaluator/src/__init__.py b/issue-2-ai-skill-evaluator/src/__init__.py
@@ -0,0 +1 @@
+"""Utilities for the AI skill evaluator project."""
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		"""Utilities for the AI skill evaluator project."""