Research: Git Integration Layer for AST-RAG
🎯 Goal
Design and prototype a full Git integration layer that adds commit/author nodes and blame-tracking edges to the AST graph, enabling queries like "who wrote this function?" and "when was this bug introduced?".
📋 Current State
- ✅ MVCC versioning via
valid_from / valid_to properties (commit hashes)
- ✅ Incremental updates via
update_from_git() using git diff
- ✅ Diff computation between commits
- ❌ No
Commit or Author nodes in the graph
- ❌ No edges like
AUTHORED, COMMITTED_ON, MODIFIED, INTRODUCED
- ❌ No blame information (who wrote each line/function)
- ❌ No temporal queries (show graph state at commit X)
🔍 Research Questions
1. Graph Schema Design
Questions:
- Should
Commit and Author be separate node types or properties?
- How to handle merge commits (multiple parents)?
- Should every AST node have edges to commits, or only track changes?
Proposed Schema:
// New Node Types
(:Commit {
hash: str,
author_email: str,
author_name: str,
committer_email: str,
committer_name: str,
message: str,
timestamp: datetime,
parents: list[str]
})
(:Author {
email: str,
name: str,
first_commit: datetime,
last_commit: datetime
})
// New Edge Types
(:Author)-[:AUTHORED]->(:Commit)
(:Commit)-[:MODIFIED]->(:Function {valid_from: commit.hash})
(:Commit)-[:INTRODUCED]->(:Function)
(:Function)-[:CHANGED_IN]->(:Commit)
2. Performance Impact
Questions:
- How many new nodes/edges will be added? (estimate: 1 commit per change, 1 author per commit)
- Will queries slow down with blame edges?
- Should blame edges be lazy-loaded or precomputed?
Estimates for medium project (1000 commits, 5000 functions):
- New
Commit nodes: ~1000
- New
Author nodes: ~10-50
- New
MODIFIED edges: ~10,000-50,000 (functions × commits)
- Graph size increase: 2-5×
3. Blame Analysis Implementation
Approach:
# Use GitPython blame API
repo = git.Repo(path)
blame = repo.blame('HEAD', 'path/to/file.py')
for commit, lines in blame:
# commit.author, commit.message, lines
# Link AST nodes to commits
Challenges:
- Blame is line-based, AST nodes are function/class-based
- Need to map line ranges to AST nodes
- Handle moved/renamed functions (git blame -C)
4. Temporal Queries
Use Cases:
- "Show me the graph as it was at commit abc123"
- "When was function X introduced?"
- "Who last modified this class?"
- "Show all changes between two dates"
Implementation Options:
- Snapshot approach: Store full graph snapshots per commit (expensive)
- Delta approach: Reconstruct state by replaying commits (slow queries)
- Hybrid: Current MVCC + Commit edges (proposed)
📐 Proposed Architecture
Schema Changes
File: ast_rag/schema/graph_schema.cql
// New constraints
CREATE CONSTRAINT commit_hash IF NOT EXISTS FOR (c:Commit) REQUIRE c.hash IS UNIQUE;
CREATE CONSTRAINT author_email IF NOT EXISTS FOR (a:Author) REQUIRE a.email IS UNIQUE;
// New indexes
CREATE INDEX commit_timestamp IF NOT EXISTS FOR (c:Commit) ON (c.timestamp);
CREATE INDEX authored_by IF NOT EXISTS FOR ()-[a:AUTHORED]-() ON (a.author_email);
New DTOs
File: ast_rag/dto/git.py (new)
class GitCommit(BaseModel):
hash: str
short_hash: str
author_name: str
author_email: str
committer_name: str
committer_email: str
message: str
timestamp: datetime
parents: list[str]
class GitBlameEntry(BaseModel):
commit: GitCommit
start_line: int
end_line: int
path: str
New Services
File: ast_rag/services/git_service.py (new)
class GitService:
def extract_commits(self, repo_path: str, from_commit: str, to_commit: str) -> list[GitCommit]
def get_blame(self, repo_path: str, file_path: str, commit: str) -> list[GitBlameEntry]
def get_author(self, email: str) -> Author
File: ast_rag/services/graph_updater_service.py (modify)
def update_from_git(
# ... existing params
extract_git_metadata: bool = True, # New flag
) -> DiffResult:
# After applying diff:
if extract_git_metadata:
self._update_git_nodes(diff, new_commit_hash)
🧪 Prototype Plan
Phase 1: Basic Commit/Author Nodes
- Add
Commit and Author node types to schema
- Extract commit metadata during
update_from_git()
- Create
AUTHORED and COMMITTED_ON edges
- Test with small repository
Phase 2: Blame Integration
- Implement
GitService.get_blame()
- Map blame line ranges to AST nodes
- Create
MODIFIED edges from commits to AST nodes
- Add confidence scores (direct blame = 1.0, inherited = 0.5)
Phase 3: Temporal Queries API
- Add
get_node_at_commit(node_id, commit_hash) method
- Add
get_commit_history(node_id) method
- Add
get_changes_between(from_commit, to_commit) method
- CLI commands:
ast-rag blame <function>, ast-rag history <function>
Phase 4: Performance Optimization
- Benchmark query performance with blame edges
- Add edge expiration (don't track every minor change)
- Consider edge compression (group by author/date ranges)
⚠️ Risks & Mitigations
| Risk |
Impact |
Mitigation |
| Graph size explosion |
High |
Limit blame tracking to function/class level, not lines |
| Query performance degradation |
Medium |
Add indexes, use edge type filtering |
| Git blame is slow |
Medium |
Cache blame results, lazy loading |
| Merge commit complexity |
Low |
Track first parent only, or all parents with weights |
📊 Success Metrics
📚 References
🎯 Deliverables
- Research document (this file) with findings
- Prototype implementation in feature branch
- Performance benchmarks
- Final design document with recommendations
- GitHub issues for implementation tasks
Labels: research, enhancement, git-integration
Priority: High
Estimated Research Time: 2-3 days
Research: Git Integration Layer for AST-RAG
🎯 Goal
Design and prototype a full Git integration layer that adds commit/author nodes and blame-tracking edges to the AST graph, enabling queries like "who wrote this function?" and "when was this bug introduced?".
📋 Current State
valid_from/valid_toproperties (commit hashes)update_from_git()using git diffCommitorAuthornodes in the graphAUTHORED,COMMITTED_ON,MODIFIED,INTRODUCED🔍 Research Questions
1. Graph Schema Design
Questions:
CommitandAuthorbe separate node types or properties?Proposed Schema:
2. Performance Impact
Questions:
Estimates for medium project (1000 commits, 5000 functions):
Commitnodes: ~1000Authornodes: ~10-50MODIFIEDedges: ~10,000-50,000 (functions × commits)3. Blame Analysis Implementation
Approach:
Challenges:
4. Temporal Queries
Use Cases:
Implementation Options:
📐 Proposed Architecture
Schema Changes
File:
ast_rag/schema/graph_schema.cqlNew DTOs
File:
ast_rag/dto/git.py(new)New Services
File:
ast_rag/services/git_service.py(new)File:
ast_rag/services/graph_updater_service.py(modify)🧪 Prototype Plan
Phase 1: Basic Commit/Author Nodes
CommitandAuthornode types to schemaupdate_from_git()AUTHOREDandCOMMITTED_ONedgesPhase 2: Blame Integration
GitService.get_blame()MODIFIEDedges from commits to AST nodesPhase 3: Temporal Queries API
get_node_at_commit(node_id, commit_hash)methodget_commit_history(node_id)methodget_changes_between(from_commit, to_commit)methodast-rag blame <function>,ast-rag history <function>Phase 4: Performance Optimization
📊 Success Metrics
ast-rag blame <function>works📚 References
🎯 Deliverables
Labels:
research,enhancement,git-integrationPriority: High
Estimated Research Time: 2-3 days