devlens/implementation.md at main · haragam22/devlens

The Stack Strategy

Frontend: React + Tailwind CSS + react-force-graph (for the visualization) + framer-motion (for "crazy" animations).
Backend: FastAPI + asyncio (for concurrency) + Tree-sitter (parsing) + ChromaDB (vector store).
AI: OpenRouter (nvidia/nemotron-3-nano-30b-a3b:free) + AWS Bedrock (Titan Embeddings v2).

---

Phase 1: The "Skeleton & Ingestion" (Days 1-2)

Goal: Get the system running and able to "eat" a GitHub repository.

Backend Deliverables (FastAPI):

Project Shell: Setup FastAPI with uvicorn. Create the HybridStorageManager class to handle RAM vs. Disk storage logic.
Ingestion Endpoint: Create POST /api/v1/repository/ingest.
- Action: Use subprocess to run git clone --depth 1 <url> to a temp dir.
- Action: Use httpx or PyGithub to fetch metadata (stars, forks) via GitHub API.
Tree-sitter Setup: Install tree-sitter and tree-sitter-languages (Python, JS, Go). Write a parser that walks the AST and extracts (Node: File) -> (Edge: Import).

Frontend Deliverables (React + Tailwind):

"Crazy" Landing Page: Create a hero section with a glowing input field for the GitHub URL.
- UI Tip: Use a "Matrix-style" rain or particle background effect to signal "Code Intelligence."
Repo Loading State: A terminal-like loader that streams real-time logs ("Cloning repo...", "Parsing AST...", "Vectorizing chunks...") as the backend processes data.

---

Phase 2: The "Brain" & "The Map" (Days 3-5)

Goal: Make the backend smart and the frontend visual.

Backend Deliverables:

Vector Pipeline: Implement the Hybrid Vector Engine.
- Chunking: Split code by class/function (not just lines).
- Embedding: Send chunks to AWS Bedrock (Titan v2) and store in ChromaDB.
Graph Endpoint: Create GET /api/v1/repository/graph.
- Return the JSON schema required by react-force-graph (nodes = files, links = dependencies).

Frontend Deliverables (The "Crazy UI"):

3D Force-Directed Graph: Integrate react-force-graph-3d.
- Visuals: Make nodes glowing spheres. Dependencies should be laser-like lines.
- Interaction: Clicking a node zooms the camera into it and opens a side panel with file details.
HUD Layout: Build a "Heads-Up Display" overlay on top of the graph.
- Left Panel: File Explorer (Glassmorphism effect).
- Right Panel: AI Chat/Context (Hidden by default, slides in).

---

Phase 3: The "Intelligence" Features (Days 6-8)

Goal: Connect the RAG features and "Senior Mentor" mode.

Backend Deliverables:

Search Endpoint (Issue-to-Code): Implement POST /api/v1/search.
- Perform Hybrid Search (Dense Vector + Sparse BM25) + Reranking to find relevant files for a query.
Mentor Endpoint (Jargon Buster): Implement POST /api/v1/explain.
- Send selected code/text to OpenRouter (Nemotron-3) with a prompt to identify jargon and explain it simply.
Intent Endpoint: Fetch PR history for a file, summarize it using Map-Reduce if token count > 15k, and return the "Architectural Intent".

Frontend Deliverables:

Contextual Chat: When a user selects a file in the graph, allow them to "Ask the Repo".
- UI Tip: Use a typewriter effect for AI responses. Highlight code snippets with syntax highlighting.
"Jargon Hover": If the user toggles "Junior Mode", highlight complex words in the UI. Hovering them shows a tooltip with the "Student-Friendly Analogy".

---

Phase 4: Polish & Production (Days 9-10)

Goal: Stability and user experience refinement.

Steps:

Rate Limit Guardrails: Implement the asyncio.Semaphore logic to prevent AWS Bedrock 429 errors during mass vectorization.
The "Fat Repo" Check: Add the 1MB file size limit check in the parsing loop to prevent memory crashes.
Deployment:
- Frontend: Vercel or Netlify.
- Backend: AWS EC2 or Render (Dockerized). Use a persistent volume for ChromaDB if you aren't using S3 yet.

Phase 5: Resilience, Security & Institutional Memory (The "Production" Layer) Goal: Transform the prototype into a battle-hardened application capable of handling large repositories, hostile inputs, and deep context retrieval without crashing.

The "Fat Repo" Shield (Ingestion Protection) The Problem: Parsing massive autogenerated files (like package-lock.json or minified bundles) causes memory spikes that crash the server.

Technical Implementation:

Pre-Flight Filtering: Implement a middleware layer is_safe_to_process() that inspects file metadata before reading content.

Hard Limits: Enforce a strict 1MB file size limit. Files exceeding this are logged as warnings and skipped, ensuring the graph generation continues for the rest of the repo.

Blocklist: Automatically exclude high-noise directories (node_modules, dist, pycache, .git) to reduce vector noise.

Institutional Memory (GitHub GraphQL Integration) The Problem: The current system understands what the code does, but not why it was written. It lacks the context of past decisions found in PRs and Issues.

Technical Implementation:

GraphQL Client: Integrate a lightweight GraphQL client (python-graphql-client or requests) to query the GitHub GraphQL API v4.

Single-Shot Fetching: Instead of making 100+ REST API calls, execute a single complex query to retrieve the last 50 merged PRs, their associated issue threads, and the "Files Changed" list in one network round-trip.

Context Mapping: Map retrieved PR descriptions to specific file nodes in the graph. When a user clicks a file, the system displays "Related PRs" to show historical intent.

Rate Limit Architecture (The Traffic Control) The Problem: Rapidly vectorizing 1,000+ code chunks triggers 429 Too Many Requests errors from LLM providers (OpenRouter/Bedrock), causing data gaps.

Technical Implementation:

Async Semaphores: Implement asyncio.Semaphore(n) to strictly cap the number of concurrent outbound requests (e.g., max 10 parallel embedding tasks).

Exponential Backoff: Wrap external API calls with a resilience library (like tenacity). If a request fails, the system automatically pauses (jittered wait) and retries up to 5 times before failing gracefully.

Security Guardrails (Prompt Injection Defense) The Problem: Malicious actors could plant "Ignore previous instructions" commands inside public GitHub issues to hijack the AI.

Technical Implementation:

XML Sandboxing: Wrap all untrusted data (code snippets, issue comments) in strict XML tags (e.g., <untrusted_context>) within the system prompt.

Sandboxed Instructions: Explicitly instruct the LLM to treat all content within these tags as inert strings, preventing command execution.

Automated Onboarding (Environment Setup) The Problem: Students often struggle to simply get a repo running.

Technical Implementation:

Deterministic Templating: A rule-based engine scans the root directory for configuration files (requirements.txt, package.json, Dockerfile, Cargo.toml).

Script Generation: Based on the detected stack, the system dynamically generates a copy-pasteable setup.sh (or PowerShell) script that installs dependencies and starts the local server.

Phase 6: The "Bharat" & "Contributor" Modules (Days 11-12)

Goal: Fulfill the specific "AI for Bharat" hackathon requirements (Regional Language Support & Issue Matching).

1. The Indic Bridge (Multilingual Explanations)

The Problem: The current "Jargon Buster" only works in English, excluding non-native speakers (a core demographic for the hackathon).
Technical Implementation:
Language Parameter: Update the POST /api/v1/explain endpoint to accept a language field (e.g., hi (Hindi), ta (Tamil), hinglish).
System Prompt Injection: Dynamically append a linguistic instruction to the OpenRouter/Bedrock prompt:

"Output the explanation in [Target Language]. If the target is Hinglish, use Roman script with common English technical terms (e.g., 'Function call kar raha hai')."

Frontend Toggle: Add a simple dropdown in the "Senior Mentor" panel to switch languages instantly.

2. The "Good First Issue" Matcher (Static Logic)

The Problem: Beginners don't know where to start contributing.
Technical Implementation:
Issue Scraper: Create a new endpoint GET /api/v1/issues/recommend that fetches OPEN issues from GitHub with labels like good first issue, beginner, or help wanted.
Duplicate Work Detection: Implement a logic check that scans the issue's timeline. If an issue is referenced by an OPEN Pull Request, flag it as ⚠️ In Progress.
Skill Mapping: A simple keyword matcher that checks the issue title/body against the repo's language stack (e.g., if Repo is 90% Python, tag issues as "Python Required").

3. Code Entity Extraction & Indexing

The Problem: The frontend needs to know what functions, classes, and methods exist within each file to help users jump to the exact symbol they need to understand or modify.
Technical Implementation:
- Tree-sitter Entity Queries: Enhance the existing AST parsing in app/services/parser.py (which previously only found imports) to also execute language-specific queries (Python, JS/TS, Go) for function_definition, class_definition, method_definition, etc.
- Node Enhancement: Append an extracted_names array to every file Node returned in the GET /api/v1/repository/graph endpoint payload, giving complete symbol visibility per file.

---

You are absolutely right. The previous "Router" model was just a glorified FAQ bot. To make "Repo Buddy" truly valuable—and to win the Hackathon—it needs to be an Agentic Workflow Engine, not just a chatbot.

Instead of just telling the user what to do, it should plan the mission for them.

Here is the Advanced Phase 7: The "DevLens Architect" (Agentic Repo Buddy). This replaces the simple router with a state-aware agent that guides the user from "I want to help" to "git push".

Phase 7: The "DevLens Architect" (End-to-End Contribution Agent)

Goal: Transform the chatbot into an active "Pair Programmer" that creates a step-by-step contribution plan based on the specific type of issue.

1. The "Mission Control" (State-Aware Workflow)

The Upgrade: instead of treating every message as a new query, the bot maintains a "Session State" (e.g., Current_Mission: Fix Issue #42).
Technical Implementation:
Workflow Engine: Implement a simple State Machine in Python (or use LangGraph if you are feeling adventurous, but a Python class works for hackathons).
The 3 Modes: The bot detects the type of work and switches logic:
Mode A: The Exterminator (Bug Fixes) $\rightarrow$ Focuses on reproducing the error and finding the faulty function.
Mode B: The Builder (New Features) $\rightarrow$ Focuses on architectural fit and where to add new files.
Mode C: The Janitor (Refactoring/Docs) $\rightarrow$ Focuses on dependency safety and isolating changes.

2. The "Investigator" Agent (Connecting Issues to the Graph)

The Problem: Standard bots just read the issue text. DevLens has a Dependency Graph (Phase 2). It should use it.
Technical Implementation:
Graph-RAG Retrieval: When a user selects an issue, the Agent doesn't just read the text. It:

Extracts keywords (e.g., "Login failed").
Hits the Vector Store (Phase 3) to find auth.py.
CRITICAL STEP: It queries the Tree-sitter Graph (Phase 2) to find *what depends on auth.py*.

The "Blast Radius" Calculation: The bot tells the user: "If you touch auth.py, you might break user_profile.ts. Be careful."

3. The "Tactical Planner" (Step-by-Step Guide)

The Upgrade: Instead of a generic "Go fix it," generate a specific Todo List.
Technical Implementation:
Prompt Strategy: Feed the Issue + The Linked Code + CONTRIBUTING.md into the LLM.
Output Structure:

**Mission Plan for Issue #102:**
1. [ ] **Reproduce:** Run `pytest tests/test_auth.py` (I detected this is the relevant test).
2. [ ] **Locate:** The bug is likely in `AuthService.login()` in `src/services/auth.py`.
3. [ ] **Fix:** Ensure you handle the `NullReference` exception here.
4. [ ] **Verify:** Run the linter: `npm run lint`.

4. The "Git Commander" (Actionable Commands)

The Feature: Don't make the user guess the git commands.
Technical Implementation:
Based on the CONTRIBUTING.md (which you parsed in Phase 5), generate the exact setup commands.
Chat Output:

"Ready to start? Run this terminal command:"

git checkout -b fix/issue-102-login-error
# I noticed this repo uses poetry
poetry install

5a. The "Mission Update" Loop (Validation)

The Problem: The Agent fires-and-forgets instructions. It needs to react to terminal outputs.
Technical Implementation:
Update Endpoint: POST /api/v1/chatbot should accept an optional mission_context payload.

{
  "message": "Error: Module not found",
  "mission_id": "issue-102",
  "current_step": 1, 
  "type": "terminal_output" // Tell the AI this is a system log, not user chat
}

5b. "Mission Update" Loop (The Error Handler)

Logic:
Input: Terminal Output (Error: 401 Unauthorized).
Action: The Agent doesn't just "Analyze." It performs a "Rollback Check."
Prompt: "Did the last command cause this error, or did it reveal an existing one?"
If it caused it, the next step is automatically: git checkout . (Revert).

6. The "Sniper" (Context Pruner)

Role: sits between the "Investigator" and the "Tactical Planner."
Logic:

Investigator finds auth.py (2000 lines).
Sniper asks LLM: "Based on the issue 'Login Failed', which functions in auth.py are relevant?"
LLM responds: login() and verify_token().
Sniper extracts only those 50 lines + imports.
Result: High-signal context, low token usage.

Logic: If type == "terminal_output", the Agent uses a different prompt: "Analyze this error log specifically against the previous instruction. Do not change the subject."

Phase 8: The "Personalization & Feasibility" Layer (Day 15)

Goal: Transform DevLens from a generic tool into a personalized "Co-Pilot" that adapts to the user's skill level and filters out dead or overwhelming repositories.

1. The "Gatekeeper" (Repo Health Audit)

The Problem: Beginners often pick repositories that are abandoned ("dead"), highly competitive (too many open PRs), or architecturally overwhelming (Dependency Hell), leading to immediate discouragement.
Technical Implementation:
Feasibility Endpoint: Create a pre-ingestion check GET /api/v1/gatekeeper?url=....
The "Liveness" Check: Fetch the last_commit_date. If it is > 1 year old $\rightarrow$ Flag as 🔴 "Inactive/Dead".
The "Traffic" Check: Count open Pull Requests. If > 50 $\rightarrow$ Flag as 🟡 "High Competition".
The "Complexity" Check: Parse requirements.txt or package.json.
If dependencies > 500 $\rightarrow$ Flag as "Expert Level".
If standard stack $\rightarrow$ Flag as "Beginner Friendly ✅".
Verdict Generation: The UI blocks or warns the user before the expensive Graph generation begins.

2. The User Context Engine (Session-Based Profiling)

The Problem: A "First-Year Student" needs simple analogies (e.g., "This function is like a traffic cop"), while a "Senior Architect" needs concise technical facts (e.g., "This handles race conditions"). A one-size-fits-all AI response fails both.
Technical Implementation:
Onboarding Modal (Frontend): A "One-Minute Calibration" popup collecting:
Level: [Student / Junior / Senior]
Language: [English / Hindi / Hinglish]
Goal: [Learning / Contributing]
Context Injection (Backend): The backend does not use a database for this. Instead, the frontend passes this user_profile JSON header with every request.
Dynamic System Prompt: The prompt construction logic injects tone modifiers:
If Student: "Explain using real-world analogies (cooking, traffic). Avoid jargon."
If Hinglish: "Reply in Roman Hindi mixed with English technical terms."

3. The "Anti-Gravity" Handover (Local Setup Generator)

The Problem: We removed the Cloud IDE to prioritize safety and learning. We must now make the manual transition to the user's local terminal feel seamless ("Magical").
Technical Implementation:
Command Synthesizer: Based on the repo's specific package manager (detected in Phase 5), generate a precise block of terminal commands.
The "One-Click" Copy:

# 🚀 Mission Start: Paste this into your terminal
git clone https://github.com/owner/repo.git
cd repo
# Detected Poetry project:
poetry install
# Create your mission branch:
git checkout -b fix/issue-102-login-bug

Safety Check: The Agent explicitly warns the user: "Do not run this if you don't have Python 3.10 installed. Check your version first."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Stack Strategy

---

---

---

---

---

Phase 7: The "DevLens Architect" (End-to-End Contribution Agent)

1. The "Mission Control" (State-Aware Workflow)

2. The "Investigator" Agent (Connecting Issues to the Graph)

3. The "Tactical Planner" (Step-by-Step Guide)

4. The "Git Commander" (Actionable Commands)

5b. "Mission Update" Loop (The Error Handler)

6. The "Sniper" (Context Pruner)

Phase 8: The "Personalization & Feasibility" Layer (Day 15)

---

FilesExpand file tree

implementation.md

Latest commit

History

implementation.md

File metadata and controls

The Stack Strategy

---

---

---

---

---

Phase 7: The "DevLens Architect" (End-to-End Contribution Agent)

1. The "Mission Control" (State-Aware Workflow)

2. The "Investigator" Agent (Connecting Issues to the Graph)

3. The "Tactical Planner" (Step-by-Step Guide)

4. The "Git Commander" (Actionable Commands)

5b. "Mission Update" Loop (The Error Handler)

6. The "Sniper" (Context Pruner)

Phase 8: The "Personalization & Feasibility" Layer (Day 15)

---