Description
In the context of the project's evolution and the existing issues regarding multi-project support, we are missing a crucial dimension: Time (Versioning). Code within the same repository can exist in multiple states (branches, tags, commit history), and the RAG engine must be capable of handling this context.
This task is a "Research of Research" (Meta-Task). Its goal is not immediate feature implementation, but a deep analysis of the problem space, identification of use cases, and the creation of specific technical tasks (Sub-issues) based on findings.
Research Goals
Define key use cases for RAG interaction with versioned code.
Identify technical constraints and risks (performance issues, data duplication).
Formulate a strategy for how raged will "travel" through history (git worktree, shallow clones, diff-based indexing).
Define requirements for the interface (CLI, API) and internal storage structure.
Key Research Areas
- Use Cases
We need to categorize the queries a user might make:
Diffing: "What is the difference between the implementation of X in branch main vs feature/Y?"
Archeology (History): "Why was this code written this way? (Find relevant commits/messages)".
Isolation (Branch Context): Questions that must be answered strictly within the context of a specific version (e.g., for deprecated APIs).
CI/CD Check: Automatic analysis of changes in PRs (diff-only indexing).
2. Challenges & Mitigations
Storage Explosion (Duplication): If we simply index every branch separately, the vector database will bloat, as 90% of the code overlaps.
Hypothesis: Use content hashing (content-addressable storage) or index only diffs.
Context Confusion: The LLM might mix code from different branches in a single response (version hallucinations).
Hypothesis: Strict metadata filtering in the vector DB.
Index Staleness: The develop branch moves forward, but the index remains old.
Hypothesis: Integration with git hooks or incremental indexing mechanisms.
3. Implementation Strategies
Git Worktree: Should we leverage native worktree support to physically parallelize versions?
Semantic Search Patterns: Are there existing patterns for versioned code among embedding providers?
Git Native Approach: Should we parse .git objects directly to access old file versions without checking out?
Expected Deliverables
Upon completion of this research, a summary should be posted as a comment on this issue, and separate Story/Task issues should be created:
RFC (Request for Comments): A document outlining the chosen architecture for version storage.
Sub-issues: Concrete tasks (e.g., "Add git_commit_hash to metadata schema", "Implement branch filtering in query engine").
Roadmap Decision: Determining if this fits into the MVP or qualifies as future work.
Rationale
Without this preliminary stage, we risk implementing a "naive" solution (simply scanning all worktrees as separate projects), which would lead to index bloat and logical errors in LLM responses.
Description
In the context of the project's evolution and the existing issues regarding multi-project support, we are missing a crucial dimension: Time (Versioning). Code within the same repository can exist in multiple states (branches, tags, commit history), and the RAG engine must be capable of handling this context.
This task is a "Research of Research" (Meta-Task). Its goal is not immediate feature implementation, but a deep analysis of the problem space, identification of use cases, and the creation of specific technical tasks (Sub-issues) based on findings.
Research Goals
Define key use cases for RAG interaction with versioned code.
Identify technical constraints and risks (performance issues, data duplication).
Formulate a strategy for how raged will "travel" through history (git worktree, shallow clones, diff-based indexing).
Define requirements for the interface (CLI, API) and internal storage structure.
Key Research Areas
We need to categorize the queries a user might make:
Diffing: "What is the difference between the implementation of X in branch main vs feature/Y?"
Archeology (History): "Why was this code written this way? (Find relevant commits/messages)".
Isolation (Branch Context): Questions that must be answered strictly within the context of a specific version (e.g., for deprecated APIs).
CI/CD Check: Automatic analysis of changes in PRs (diff-only indexing).
2. Challenges & Mitigations
Storage Explosion (Duplication): If we simply index every branch separately, the vector database will bloat, as 90% of the code overlaps.
Hypothesis: Use content hashing (content-addressable storage) or index only diffs.
Context Confusion: The LLM might mix code from different branches in a single response (version hallucinations).
Hypothesis: Strict metadata filtering in the vector DB.
Index Staleness: The develop branch moves forward, but the index remains old.
Hypothesis: Integration with git hooks or incremental indexing mechanisms.
3. Implementation Strategies
Git Worktree: Should we leverage native worktree support to physically parallelize versions?
Semantic Search Patterns: Are there existing patterns for versioned code among embedding providers?
Git Native Approach: Should we parse .git objects directly to access old file versions without checking out?
Expected Deliverables
Upon completion of this research, a summary should be posted as a comment on this issue, and separate Story/Task issues should be created:
RFC (Request for Comments): A document outlining the chosen architecture for version storage.
Sub-issues: Concrete tasks (e.g., "Add git_commit_hash to metadata schema", "Implement branch filtering in query engine").
Roadmap Decision: Determining if this fits into the MVP or qualifies as future work.
Rationale
Without this preliminary stage, we risk implementing a "naive" solution (simply scanning all worktrees as separate projects), which would lead to index bloat and logical errors in LLM responses.