Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 22 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,28 @@ Use Codegraph when you need fast structural answers about a repo without relying
"score": 59
}
],
"recommendedCommands": ["codegraph hotspots --root \"/workspace/codegraph/src\" --limit 20 --json"]
"duplicates": {
"total": 1,
"omitted": 0,
"minConfidence": "high",
"top": [
{
"confidence": "high",
"cloneType": "exact",
"score": 100,
"left": { "file": "src/a.ts", "startLine": 10, "endLine": 24, "tokenCount": 86 },
"right": { "file": "src/b.ts", "startLine": 8, "endLine": 22, "tokenCount": 86 },
"rawPairCount": 1,
"reasons": ["identical text", "matching normalized token stream"]
}
]
},
"recommendedCommands": [
"codegraph hotspots --root \"/workspace/codegraph\" \"/workspace/codegraph/src\" --limit 20 --json",
"codegraph graph --root \"/workspace/codegraph\" \"/workspace/codegraph/src\" --json --symbols-detailed --compact-json",
"codegraph duplicates --root \"/workspace/codegraph\" \"/workspace/codegraph/src\" --min-confidence medium --limit 20 --include-same-file",
"codegraph doctor \"/workspace/codegraph/.codegraph-cache/index-v1\""
]
}
Comment thread
lzehrung marked this conversation as resolved.
```

Expand Down
3 changes: 2 additions & 1 deletion codegraph-skill/codegraph/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ Then choose the narrowest follow-up command:
- Artifact bundle: `codegraph artifact build --root . --out codegraph-out --json`
- MCP server: `codegraph mcp serve --root . --stdio` or `codegraph mcp serve --root . --port 7331`

Use `--json` when the output will feed later reasoning, scripts, or another agent step. `search` is deterministic and returns project-relative explainable handles, evidence, neighbors, follow-up commands, result counts, limits, and omission counts. `explain` accepts those handles plus file paths, symbol names, and SQL object names, then returns bounded symbols, dependencies, reverse dependencies, references, snippets, SQL relation facts, changed-context review tasks/candidate tests, explicit limits, omission counts, and next commands. Generated command strings POSIX-shell-quote dynamic arguments when needed. For SQL objects, use search handles or schema-qualified names when basenames may be ambiguous. Reference and snippet omission counts are lower bounds after bounded navigation hits its cap. `artifact build` writes a durable SQLite, self-describing project-relative graph JSON, report, questions with unique stable-handle command IDs, and manifest bundle for handoff while excluding its own in-repo output directory and linked outside-root files. With `--force`, recognizable stale artifact files are removed, unrelated operator files are preserved, and unrecognized reserved-name collisions are refused. `codegraph doctor <artifact-dir>` recognizes manifest-backed artifact bundle directories and reports expected artifact presence. `mcp serve` exposes the same primitives as read-only MCP tools by default over stdio, or over Streamable HTTP with `--port <number>` at `/mcp`; HTTP binds to `127.0.0.1` unless `--host <host>` is passed, validates Host headers, and allows loopback Host headers for wildcard binds. File/artifact paths are confined after realpath resolution, SQLite query results are row- and byte-bounded, synthetic payload functions are rejected, and `--allow-build` is required before an agent may write artifact output.
Use `--json` when the output will feed later reasoning, scripts, or another agent step. `inspect` includes compact high-confidence duplicate opportunities plus a recommended `duplicates` command for full grouped JSON. `search` is deterministic and returns project-relative explainable handles, evidence, neighbors, follow-up commands, result counts, limits, and omission counts. `explain` accepts those handles plus file paths, symbol names, and SQL object names, then returns bounded symbols, dependencies, reverse dependencies, references, snippets, SQL relation facts, changed-context review tasks/candidate tests, explicit limits, omission counts, and next commands. Generated command strings POSIX-shell-quote dynamic arguments when needed. For SQL objects, use search handles or schema-qualified names when basenames may be ambiguous. Reference and snippet omission counts are lower bounds after bounded navigation hits its cap. `artifact build` writes a durable SQLite, self-describing project-relative graph JSON, report, questions with unique stable-handle command IDs, and manifest bundle for handoff while excluding its own in-repo output directory and linked outside-root files. With `--force`, recognizable stale artifact files are removed, unrelated operator files are preserved, and unrecognized reserved-name collisions are refused. `codegraph doctor <artifact-dir>` recognizes manifest-backed artifact bundle directories and reports expected artifact presence. `mcp serve` exposes the same primitives as read-only MCP tools by default over stdio, or over Streamable HTTP with `--port <number>` at `/mcp`; HTTP binds to `127.0.0.1` unless `--host <host>` is passed, validates Host headers, and allows loopback Host headers for wildcard binds. File/artifact paths are confined after realpath resolution, SQLite query results are row- and byte-bounded, synthetic payload functions are rejected, and `--allow-build` is required before an agent may write artifact output.

Numeric options such as `--limit`, `--threads`, `--depth`, `--max-refs`, and token bounds must be integers in their documented ranges; invalid numeric values fail instead of being silently clamped or ignored.

Expand Down Expand Up @@ -200,6 +200,7 @@ For git-provider impact and git-scoped review/index/graph commands, `WORKTREE` c

- Start here when you need an architecture summary:
`codegraph inspect ./src --limit 20`
Includes compact high-confidence duplicate opportunities and follow-up commands.
- Dependencies of a file:
`codegraph deps <file>`
- Reverse dependencies:
Expand Down
2 changes: 2 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,8 @@ codegraph index --report
codegraph review --report --report-file review.report.json
```

`inspect` emits bounded hotspots, unresolved imports, cycles, and high-confidence duplicate opportunities. Duplicate opportunities are intentionally compact and include file ranges, confidence, clone type, score, token counts, and raw pair counts; run the recommended `duplicates` command for full grouped JSON.

Graph, index, and review reports include `backend.native.byLanguage` so native usage and fallback remain visible per language. Build reports also include `backend.parser` when syntax-tree backend degradation leaves files without parser context. Reports also include `graph.fallbackImportExtraction.byLanguage` and `byReason` when regex import extraction is used. Review JSON reports `diagnostics.symbolMappingParseFailures`, `diagnostics.missingFiles`, `changedFiles[].status` as `updated`, `deleted`, or `missing`, and `sqlContext` when changed SQL files or changed SQL literals make SQL artifact facts relevant.

### Symbols, navigation, grep, and chunking
Expand Down
184 changes: 184 additions & 0 deletions docs/superpowers/plans/2026-05-23-eliminate-duplicate-findings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# Duplicate Findings Refactor Plan

## Baseline Scan

Generated from the repo-local CLI on branch `elim-dups` after rebuilding `dist`.

Commands:

```bash
node ./dist/cli.js duplicates --root . ./src --min-confidence medium --limit 100 --include-same-file
node ./dist/cli.js duplicates --root . . --min-confidence high --limit 200 --include-same-file
```

Results:

- `src` scan: 100 returned groups, 1234 omitted groups, 1669 omitted raw suggestions.
- whole-repo scan: 200 returned groups, 2097 omitted groups, 2995 omitted raw suggestions.
- Whole-repo top results are dominated by test helpers and setup snippets.
- Product-code top results are concentrated in `src/graphs`, `src/cli`, `src/impact`, `src/chunking`, `src/mcp`, and language definitions.

Post-refactor comparison:

- `src` scan: 100 returned groups, 869 omitted groups, 1611 omitted raw suggestions.
- whole-repo scan: 200 returned groups, 899 omitted groups, 2904 omitted raw suggestions.
- The previous `src/graphs/symbol-render.ts`, `src/cli/graph.ts`, chunk tokenizer, AST range, CSS/LESS, and first-pass test helper findings dropped out of the top product-code results.
- The analyzer no longer ranks the large C/C++ query chunk against tiny language snippets as a top medium/high finding.

The grouped output is usable for triage. Remaining caveats:

- Some chunk findings are sub-ranges of a larger duplicate and should be handled through the larger refactor.
- Some renamed findings compare very different-sized chunks; those are analyzer noise unless a human can identify a clear shared behavior.
- Repeated declarative language-definition shapes are not automatically bad duplication.

## Refactor Checklist

### Product Code

- [x] Refactor symbol graph renderers.
- Findings: `src/graphs/symbol-render.ts:84-149` and `src/graphs/symbol-render.ts:151-218`.
- Extract shared file-node, symbol-node, and graph-edge collection.
- Keep Mermaid and DOT formatting separate so escaping and syntax remain explicit.
- Add or update renderer tests if output ordering or formatting can change.

- [x] Refactor compact graph symbol projection.
- Findings: `src/cli/graph.ts:121-146` and `src/cli/graph.ts:169-194`.
- Extract shared file index, symbol index, symbol array, and sorted symbol edge construction.
- Keep the difference between full compact graph output and symbols-only output visible at the call site.

- [x] Share AST range conversion.
- Findings: `src/impact/suggestions.ts:372-385` and `src/util/ast.ts:32-41`.
- Reuse `toRange` from `src/util/ast.ts` or move the non-null conversion into a shared helper.
- Preserve the existing null-node behavior used by current callers.

- [x] Share default token counting.
- Findings: `src/chunking/chunkFile.ts:28-31` and `src/chunking/chunkTextFile.ts:21-24`.
- Introduce a small shared tokenizer helper in `src/chunking`.
- Keep public chunking options unchanged.

- [x] Consolidate dependency and reverse-dependency wrappers where useful.
- Findings: `src/agent-tools.ts:366-441`, `src/mcp/server.ts:228-246`, and `src/mcp/tools.ts:88-114`.
- Prefer a small result-mapping helper over forcing identical public response shapes.
- Verify both CLI/agent tools and MCP tools still expose `dependencies` and `reverseDependencies` separately.

- [x] Revisit CSS and Less language definitions.
- Findings: `src/languages/definitions/css.ts:9-17` and `src/languages/definitions/less.ts:9-17`.
- Extract shared CSS-family structure/query pieces only if it keeps each language definition readable.
- Consider whether Vue/Svelte stylesheet definitions can reuse the same helper without hiding language-specific behavior.

- [x] Evaluate JS fallback type duplication.
- Findings: `packages/codegraph-js-fallback/js-fallback.d.ts` and `src/jsFallback.ts`.
- Prefer generating or importing a single declaration source if package boundaries allow it.
- Leave as-is if the duplication is required to keep the fallback package self-contained.
- Decision: leave as-is because the fallback package publishes a self-contained `js-fallback.d.ts`.

- [x] Review smaller wrapper candidates opportunistically.
- `src/cli/artifact.ts`, `src/cli/explain.ts`, and `src/cli/search.ts` command context interfaces.
- `src/cli/projectFile.ts` and `src/session.ts` file-input resolution.
- `src/cli/options.ts` positive and non-negative integer parsers.
- `src/sqlite/canned-query.ts` direct dependencies and dependents queries.
- Decision: extracted the common agent CLI context and canned-query edge loader; left session file resolution and option parsing as-is because their existing helpers already separate concerns.

### Tests

- [x] Add a shared temporary directory helper for tests.
- Findings: repeated `mkTmpDir` helpers across dynamic resolution, fast graph edge cases, node modules, resolution precedence, robust fast graph, TS paths workspace, cache, and parsed-cache tests.
- Put it near existing test helpers and migrate only obvious identical helpers first.

- [x] Add shared edge-normalization helpers for graph tests.
- Findings: `tests/fast-graph.test.ts`, `tests/monorepo-fast-graph.test.ts`, and related fast graph tests.
- Replace duplicated `normEdge`, `toKey`, and slash normalization only where it improves readability.

- [x] Consolidate repeated SQLite/test database setup blocks.
- Findings: repeated chunks in `tests/sqlite.test.ts`, `tests/sql-artifact-graph.test.ts`, and `tests/sql-review-context.test.ts`.
- Extract helpers that describe domain intent, not just line-for-line setup.

- [x] Leave intentional fixture repetition alone.
- Repeated sample snippets are often test data, not production maintenance debt.
- Do not refactor setup that would make an individual test harder to read.

### Analyzer Follow-Ups

- [x] Consider a length-ratio guard for high-confidence renamed groups.
- Example noise: large C/C++ query chunks paired with tiny language-definition snippets.
- The detector already reports `lengthRatio`; ranking or confidence can use it more aggressively.

- [x] Consider collapsing adjacent same-file chunk findings under a larger group.
- Example: multiple `src/cli/graph.ts` chunk findings are one underlying helper extraction.
- Keep raw variants available through `--raw-pairs`.

## Follow-Up Scan: 2026-05-23

Generated after the first duplicate cleanup pass on branch `elim-dups`.

Commands:

```bash
node ./dist/cli.js duplicates --root . ./src --min-confidence medium --limit 120 --include-same-file
node ./dist/cli.js duplicates --root . . --min-confidence high --limit 120 --include-same-file
node ./dist/cli.js inspect --root . ./src --limit 8
node ./dist/cli.js review --base main --head HEAD --summary
node ./dist/cli.js doctor
```

Results:

- `src` scan: 120 returned groups, 849 omitted groups, 1611 omitted raw suggestions.
- whole-repo scan: 120 returned groups, 973 omitted groups, 2898 omitted raw suggestions.
- `doctor` reports native runtime availability and artifact health only; it does not surface duplicate cleanup opportunities.
- `review --summary` reports diff risk and candidate tests only; it does not surface duplicate cleanup opportunities.
- `inspect` reports hotspots, unresolved imports, cycles, and recommended commands only; it is the best current home for an at-a-glance duplicate opportunity section.

### Product Output Follow-Ups

- [x] Coalesce repeated grouped duplicate findings with the same primary ranges.
- Finding: `src/indexer/imports/languageSpecific.ts:249-258 applyJavaStatementOverride` vs `src/indexer/imports/languageSpecific.ts:260-269 applyKotlinStatementOverride` appeared multiple times with the same primary pair.
- Preserve raw evidence counts through `rawPairCount` and bounded `variants`.
- Keep `--raw-pairs` as the explicit escape hatch for low-level pair inspection.

- [x] Surface bounded duplicate opportunities in `inspect` output.
- Include high-signal fields only: confidence, clone type, score, files/ranges, token counts, and raw pair count.
- Keep the summary bounded by `--limit`.
- Add a `duplicates` follow-up command to `recommendedCommands` so agents can drill into full grouped JSON.
- Leave `doctor` focused on package/runtime/artifact health.

### Remaining Source Cleanup Candidates

- [x] Share C/C++ language-definition scaffolding where it stays readable.
- Findings: `src/languages/definitions/c.ts:14-147` and `src/languages/definitions/cpp.ts:14-165`.
- Preserve C-specific macro support and C++ namespace/class/alias/using/lambda behavior.

- [x] Share package export target selection between node-module and workspace resolution.
- Findings: `src/util/resolution/node.ts:22-34 tryResolveRelative` and `src/util/workspace.ts:298-307 pickExportTarget`.
- Prefer a small resolver helper over duplicating `exports` target precedence.

- [x] Share Java/Kotlin import override plumbing.
- Findings: `src/indexer/imports/languageSpecific.ts:249-258` and `src/indexer/imports/languageSpecific.ts:260-269`.
- Keep Java and Kotlin parsing differences explicit at the call site.

- [x] Share range line-counting for coverage suggestions.
- Findings: `src/impact/report-suggestions.ts:820-827` and `src/impact/report-suggestions.ts:829-836`.
- Preserve zero-count behavior when no coverage is available.

- [x] Share file-like agent handle parsing.
- Findings: `src/agent/handles.ts:54-60` and `src/agent/handles.ts:82-88`.
- Keep public handle prefixes and return types unchanged.

- [x] Share discovery glob relative-path matching.
- Findings: `src/cli/context.ts:25-35` and `src/util/projectFiles.ts:278-288`.
- Keep CLI include/ignore globs relative to active scan roots and config globs relative to project roots.

Follow-up verification:

- [x] `src` scan reports `0` repeated displayed primary pairs in the first 80 medium-or-higher groups.
- [x] `inspect --root . ./src --limit 5` emits compact high-confidence duplicate opportunities and a `duplicates` follow-up command.
- [x] Focused duplicate, inspect, C/C++, resolution, references, impact-suggestion, and agent-search tests pass.
- [x] Full test suite passes.

## Verification Plan

- [x] Run `npm run build`.
- [x] Run `npm run lint`.
- [x] Run focused tests for touched areas.
- [x] Run `npm test` before pushing a completed refactor batch.
- [x] Re-run the duplicate analyzer and compare top findings against this baseline.
Loading
Loading