- Frontend: React + Tailwind CSS + react-force-graph (for the visualization) + framer-motion (for "crazy" animations).
- Backend: FastAPI + asyncio (for concurrency) + Tree-sitter (parsing) + ChromaDB (vector store).
- AI: OpenRouter (
nvidia/nemotron-3-nano-30b-a3b:free) + AWS Bedrock (Titan Embeddings v2).
Phase 1: The "Skeleton & Ingestion" (Days 1-2)
Goal: Get the system running and able to "eat" a GitHub repository.
Backend Deliverables (FastAPI):
-
Project Shell: Setup FastAPI with uvicorn. Create the HybridStorageManager class to handle RAM vs. Disk storage logic.
-
Ingestion Endpoint: Create POST /api/v1/repository/ingest.
-
Action: Use subprocess to run git clone --depth 1 <url> to a temp dir.
-
Action: Use httpx or PyGithub to fetch metadata (stars, forks) via GitHub API.
-
-
Tree-sitter Setup: Install tree-sitter and tree-sitter-languages (Python, JS, Go). Write a parser that walks the AST and extracts (Node: File) -> (Edge: Import).
Frontend Deliverables (React + Tailwind):
- "Crazy" Landing Page: Create a hero section with a glowing input field for the GitHub URL.
- UI Tip: Use a "Matrix-style" rain or particle background effect to signal "Code Intelligence."
- Repo Loading State: A terminal-like loader that streams real-time logs ("Cloning repo...", "Parsing AST...", "Vectorizing chunks...") as the backend processes data.
Phase 2: The "Brain" & "The Map" (Days 3-5)
Goal: Make the backend smart and the frontend visual.
Backend Deliverables:
-
Vector Pipeline: Implement the Hybrid Vector Engine.
-
Chunking: Split code by class/function (not just lines).
-
Embedding: Send chunks to AWS Bedrock (Titan v2) and store in ChromaDB.
-
-
Graph Endpoint: Create GET /api/v1/repository/graph.
- Return the JSON schema required by react-force-graph (nodes = files, links = dependencies).
Frontend Deliverables (The "Crazy UI"):
- 3D Force-Directed Graph: Integrate react-force-graph-3d.
- Visuals: Make nodes glowing spheres. Dependencies should be laser-like lines.
- Interaction: Clicking a node zooms the camera into it and opens a side panel with file details.
- HUD Layout: Build a "Heads-Up Display" overlay on top of the graph.
- Left Panel: File Explorer (Glassmorphism effect).
- Right Panel: AI Chat/Context (Hidden by default, slides in).
Phase 3: The "Intelligence" Features (Days 6-8)
Goal: Connect the RAG features and "Senior Mentor" mode.
Backend Deliverables:
-
Search Endpoint (Issue-to-Code): Implement POST /api/v1/search.
- Perform Hybrid Search (Dense Vector + Sparse BM25) + Reranking to find relevant files for a query.
-
Mentor Endpoint (Jargon Buster): Implement POST /api/v1/explain.
- Send selected code/text to OpenRouter (Nemotron-3) with a prompt to identify jargon and explain it simply.
-
Intent Endpoint: Fetch PR history for a file, summarize it using Map-Reduce if token count > 15k, and return the "Architectural Intent".
Frontend Deliverables:
- Contextual Chat: When a user selects a file in the graph, allow them to "Ask the Repo".
- UI Tip: Use a typewriter effect for AI responses. Highlight code snippets with syntax highlighting.
- "Jargon Hover": If the user toggles "Junior Mode", highlight complex words in the UI. Hovering them shows a tooltip with the "Student-Friendly Analogy".
Phase 4: Polish & Production (Days 9-10)
Goal: Stability and user experience refinement.
Steps:
-
Rate Limit Guardrails: Implement the asyncio.Semaphore logic to prevent AWS Bedrock 429 errors during mass vectorization.
-
The "Fat Repo" Check: Add the 1MB file size limit check in the parsing loop to prevent memory crashes.
-
Deployment:
- Frontend: Vercel or Netlify.
- Backend: AWS EC2 or Render (Dockerized). Use a persistent volume for ChromaDB if you aren't using S3 yet.
Phase 5: Resilience, Security & Institutional Memory (The "Production" Layer) Goal: Transform the prototype into a battle-hardened application capable of handling large repositories, hostile inputs, and deep context retrieval without crashing.
- The "Fat Repo" Shield (Ingestion Protection) The Problem: Parsing massive autogenerated files (like package-lock.json or minified bundles) causes memory spikes that crash the server.
Technical Implementation:
Pre-Flight Filtering: Implement a middleware layer is_safe_to_process() that inspects file metadata before reading content.
Hard Limits: Enforce a strict 1MB file size limit. Files exceeding this are logged as warnings and skipped, ensuring the graph generation continues for the rest of the repo.
Blocklist: Automatically exclude high-noise directories (node_modules, dist, pycache, .git) to reduce vector noise.
- Institutional Memory (GitHub GraphQL Integration) The Problem: The current system understands what the code does, but not why it was written. It lacks the context of past decisions found in PRs and Issues.
Technical Implementation:
GraphQL Client: Integrate a lightweight GraphQL client (python-graphql-client or requests) to query the GitHub GraphQL API v4.
Single-Shot Fetching: Instead of making 100+ REST API calls, execute a single complex query to retrieve the last 50 merged PRs, their associated issue threads, and the "Files Changed" list in one network round-trip.
Context Mapping: Map retrieved PR descriptions to specific file nodes in the graph. When a user clicks a file, the system displays "Related PRs" to show historical intent.
- Rate Limit Architecture (The Traffic Control) The Problem: Rapidly vectorizing 1,000+ code chunks triggers 429 Too Many Requests errors from LLM providers (OpenRouter/Bedrock), causing data gaps.
Technical Implementation:
Async Semaphores: Implement asyncio.Semaphore(n) to strictly cap the number of concurrent outbound requests (e.g., max 10 parallel embedding tasks).
Exponential Backoff: Wrap external API calls with a resilience library (like tenacity). If a request fails, the system automatically pauses (jittered wait) and retries up to 5 times before failing gracefully.
- Security Guardrails (Prompt Injection Defense) The Problem: Malicious actors could plant "Ignore previous instructions" commands inside public GitHub issues to hijack the AI.
Technical Implementation:
XML Sandboxing: Wrap all untrusted data (code snippets, issue comments) in strict XML tags (e.g., <untrusted_context>) within the system prompt.
Sandboxed Instructions: Explicitly instruct the LLM to treat all content within these tags as inert strings, preventing command execution.
- Automated Onboarding (Environment Setup) The Problem: Students often struggle to simply get a repo running.
Technical Implementation:
Deterministic Templating: A rule-based engine scans the root directory for configuration files (requirements.txt, package.json, Dockerfile, Cargo.toml).
Script Generation: Based on the detected stack, the system dynamically generates a copy-pasteable setup.sh (or PowerShell) script that installs dependencies and starts the local server.
Phase 6: The "Bharat" & "Contributor" Modules (Days 11-12)
Goal: Fulfill the specific "AI for Bharat" hackathon requirements (Regional Language Support & Issue Matching).
1. The Indic Bridge (Multilingual Explanations)
- The Problem: The current "Jargon Buster" only works in English, excluding non-native speakers (a core demographic for the hackathon).
- Technical Implementation:
- Language Parameter: Update the
POST /api/v1/explainendpoint to accept alanguagefield (e.g.,hi(Hindi),ta(Tamil),hinglish). - System Prompt Injection: Dynamically append a linguistic instruction to the OpenRouter/Bedrock prompt:
"Output the explanation in [Target Language]. If the target is Hinglish, use Roman script with common English technical terms (e.g., 'Function call kar raha hai')."
- Frontend Toggle: Add a simple dropdown in the "Senior Mentor" panel to switch languages instantly.
2. The "Good First Issue" Matcher (Static Logic)
- The Problem: Beginners don't know where to start contributing.
- Technical Implementation:
- Issue Scraper: Create a new endpoint
GET /api/v1/issues/recommendthat fetches OPEN issues from GitHub with labels likegood first issue,beginner, orhelp wanted. - Duplicate Work Detection: Implement a logic check that scans the issue's timeline. If an issue is referenced by an OPEN Pull Request, flag it as
⚠️ In Progress. - Skill Mapping: A simple keyword matcher that checks the issue title/body against the repo's language stack (e.g., if Repo is 90% Python, tag issues as "Python Required").
3. Code Entity Extraction & Indexing
- The Problem: The frontend needs to know what functions, classes, and methods exist within each file to help users jump to the exact symbol they need to understand or modify.
- Technical Implementation:
- Tree-sitter Entity Queries: Enhance the existing AST parsing in
app/services/parser.py(which previously only found imports) to also execute language-specific queries (Python, JS/TS, Go) forfunction_definition,class_definition,method_definition, etc. - Node Enhancement: Append an
extracted_namesarray to every fileNodereturned in theGET /api/v1/repository/graphendpoint payload, giving complete symbol visibility per file.
- Tree-sitter Entity Queries: Enhance the existing AST parsing in
You are absolutely right. The previous "Router" model was just a glorified FAQ bot. To make "Repo Buddy" truly valuable—and to win the Hackathon—it needs to be an Agentic Workflow Engine, not just a chatbot.
Instead of just telling the user what to do, it should plan the mission for them.
Here is the Advanced Phase 7: The "DevLens Architect" (Agentic Repo Buddy). This replaces the simple router with a state-aware agent that guides the user from "I want to help" to "git push".
Goal: Transform the chatbot into an active "Pair Programmer" that creates a step-by-step contribution plan based on the specific type of issue.
-
The Upgrade: instead of treating every message as a new query, the bot maintains a "Session State" (e.g.,
Current_Mission: Fix Issue #42). - Technical Implementation:
-
Workflow Engine: Implement a simple State Machine in Python (or use
LangGraphif you are feeling adventurous, but a Pythonclassworks for hackathons). - The 3 Modes: The bot detects the type of work and switches logic:
-
Mode A: The Exterminator (Bug Fixes)
$\rightarrow$ Focuses on reproducing the error and finding the faulty function. -
Mode B: The Builder (New Features)
$\rightarrow$ Focuses on architectural fit and where to add new files. -
Mode C: The Janitor (Refactoring/Docs)
$\rightarrow$ Focuses on dependency safety and isolating changes.
- The Problem: Standard bots just read the issue text. DevLens has a Dependency Graph (Phase 2). It should use it.
- Technical Implementation:
- Graph-RAG Retrieval: When a user selects an issue, the Agent doesn't just read the text. It:
- Extracts keywords (e.g., "Login failed").
- Hits the Vector Store (Phase 3) to find
auth.py. - CRITICAL STEP: It queries the Tree-sitter Graph (Phase 2) to find *what depends on
auth.py*.
- The "Blast Radius" Calculation: The bot tells the user: "If you touch
auth.py, you might breakuser_profile.ts. Be careful."
- The Upgrade: Instead of a generic "Go fix it," generate a specific Todo List.
- Technical Implementation:
- Prompt Strategy: Feed the Issue + The Linked Code +
CONTRIBUTING.mdinto the LLM. - Output Structure:
**Mission Plan for Issue #102:**
1. [ ] **Reproduce:** Run `pytest tests/test_auth.py` (I detected this is the relevant test).
2. [ ] **Locate:** The bug is likely in `AuthService.login()` in `src/services/auth.py`.
3. [ ] **Fix:** Ensure you handle the `NullReference` exception here.
4. [ ] **Verify:** Run the linter: `npm run lint`.
- The Feature: Don't make the user guess the git commands.
- Technical Implementation:
- Based on the
CONTRIBUTING.md(which you parsed in Phase 5), generate the exact setup commands. - Chat Output:
"Ready to start? Run this terminal command:"
git checkout -b fix/issue-102-login-error
# I noticed this repo uses poetry
poetry install
5a. The "Mission Update" Loop (Validation)
- The Problem: The Agent fires-and-forgets instructions. It needs to react to terminal outputs.
- Technical Implementation:
- Update Endpoint:
POST /api/v1/chatbotshould accept an optionalmission_contextpayload.
{
"message": "Error: Module not found",
"mission_id": "issue-102",
"current_step": 1,
"type": "terminal_output" // Tell the AI this is a system log, not user chat
}
- Logic:
- Input: Terminal Output (
Error: 401 Unauthorized). - Action: The Agent doesn't just "Analyze." It performs a "Rollback Check."
- Prompt: "Did the last command cause this error, or did it reveal an existing one?"
- If it caused it, the next step is automatically:
git checkout .(Revert).
- Role: sits between the "Investigator" and the "Tactical Planner."
- Logic:
- Investigator finds
auth.py(2000 lines). - Sniper asks LLM: "Based on the issue 'Login Failed', which functions in
auth.pyare relevant?" - LLM responds:
login()andverify_token(). - Sniper extracts only those 50 lines + imports.
- Result: High-signal context, low token usage.
- Logic: If
type == "terminal_output", the Agent uses a different prompt: "Analyze this error log specifically against the previous instruction. Do not change the subject."
Goal: Transform DevLens from a generic tool into a personalized "Co-Pilot" that adapts to the user's skill level and filters out dead or overwhelming repositories.
1. The "Gatekeeper" (Repo Health Audit)
-
The Problem: Beginners often pick repositories that are abandoned ("dead"), highly competitive (too many open PRs), or architecturally overwhelming (Dependency Hell), leading to immediate discouragement.
-
Technical Implementation:
-
Feasibility Endpoint: Create a pre-ingestion check
GET /api/v1/gatekeeper?url=.... -
The "Liveness" Check: Fetch the
last_commit_date. If it is > 1 year old$\rightarrow$ Flag as 🔴 "Inactive/Dead". -
The "Traffic" Check: Count open Pull Requests. If > 50
$\rightarrow$ Flag as 🟡 "High Competition". -
The "Complexity" Check: Parse
requirements.txtorpackage.json. -
If dependencies > 500
$\rightarrow$ Flag as "Expert Level". -
If standard stack
$\rightarrow$ Flag as "Beginner Friendly ✅". -
Verdict Generation: The UI blocks or warns the user before the expensive Graph generation begins.
2. The User Context Engine (Session-Based Profiling)
-
The Problem: A "First-Year Student" needs simple analogies (e.g., "This function is like a traffic cop"), while a "Senior Architect" needs concise technical facts (e.g., "This handles race conditions"). A one-size-fits-all AI response fails both.
-
Technical Implementation:
-
Onboarding Modal (Frontend): A "One-Minute Calibration" popup collecting:
-
Level: [Student / Junior / Senior]
-
Language: [English / Hindi / Hinglish]
-
Goal: [Learning / Contributing]
-
Context Injection (Backend): The backend does not use a database for this. Instead, the frontend passes this
user_profileJSON header with every request. -
Dynamic System Prompt: The prompt construction logic injects tone modifiers:
-
If Student: "Explain using real-world analogies (cooking, traffic). Avoid jargon."
-
If Hinglish: "Reply in Roman Hindi mixed with English technical terms."
3. The "Anti-Gravity" Handover (Local Setup Generator)
- The Problem: We removed the Cloud IDE to prioritize safety and learning. We must now make the manual transition to the user's local terminal feel seamless ("Magical").
- Technical Implementation:
- Command Synthesizer: Based on the repo's specific package manager (detected in Phase 5), generate a precise block of terminal commands.
- The "One-Click" Copy:
# 🚀 Mission Start: Paste this into your terminal
git clone https://github.com/owner/repo.git
cd repo
# Detected Poetry project:
poetry install
# Create your mission branch:
git checkout -b fix/issue-102-login-bug
- Safety Check: The Agent explicitly warns the user: "Do not run this if you don't have Python 3.10 installed. Check your version first."