Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 31 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,32 @@
# Changelog

## [1.4.3] - 2026-03-03

### Cache Contract

- Cache schema bumped from `v1.2` to `v1.3`.
- Added signed analysis profile to cache payload:
- `payload.ap.min_loc`
- `payload.ap.min_stmt`
- Cache compatibility now requires `payload.ap` to match current CLI analysis thresholds. On mismatch, cache is ignored
with `cache_status=analysis_profile_mismatch` and analysis continues without cache.

### CLI

- CLI now constructs cache context with effective `--min-loc` and `--min-stmt` values, so cache reuse is consistent
with active analysis thresholds.

### Tests

- Added regression coverage for analysis-profile cache mismatch/match behavior in:
- `tests/test_cache.py`
- `tests/test_cli_inprocess.py`

### Contract Notes

- Baseline contract is unchanged (`schema v1.0`, `fingerprint version 1`).
- Report schema is unchanged (`v1.1`); cache metadata adds a new `cache_status` enum value.

## [1.4.2] - 2026-02-17

### Overview
Expand Down Expand Up @@ -44,10 +71,10 @@ unchanged.
### Notes

- No changes to:
- detection semantics / fingerprints
- baseline hash inputs (`payload_sha256` semantic payload)
- exit code contract and precedence
- schema versions (baseline v1.0, cache v1.2, report v1.1)
- detection semantics / fingerprints
- baseline hash inputs (`payload_sha256` semantic payload)
- exit code contract and precedence
- schema versions (baseline v1.0, cache v1.2, report v1.1)

---

Expand Down
15 changes: 8 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,12 +117,12 @@ Full contract details: [`docs/book/06-baseline.md`](docs/book/06-baseline.md)

CodeClone uses a deterministic exit code contract:

| Code | Meaning |
|------|-----------------------------------------------------------------------------|
| `0` | Success — run completed without gating failures |
| Code | Meaning |
|------|-------------------------------------------------------------------------------------------------------------------------------------|
| `0` | Success — run completed without gating failures |
| `2` | Contract error — baseline missing/untrusted, invalid output extensions, incompatible versions, unreadable source files in CI/gating |
| `3` | Gating failure — new clones detected or threshold exceeded |
| `5` | Internal error — unexpected exception |
| `3` | Gating failure — new clones detected or threshold exceeded |
| `5` | Internal error — unexpected exception |

**Priority:** Contract errors (`2`) override gating failures (`3`) when both occur.

Expand Down Expand Up @@ -182,7 +182,7 @@ Canonical report contract: [`docs/book/08-report.md`](docs/book/08-report.md)
"cache_path": "/path/to/.cache/codeclone/cache.json",
"cache_used": true,
"cache_status": "ok",
"cache_schema_version": "1.2",
"cache_schema_version": "1.3",
"files_skipped_source_io": 0,
"groups_counts": {
"functions": {
Expand Down Expand Up @@ -263,7 +263,8 @@ Canonical report contract: [`docs/book/08-report.md`](docs/book/08-report.md)
Cache is an optimization layer only and is never a source of truth.

- Default path: `<root>/.cache/codeclone/cache.json`
- Schema version: **v1.2**
- Schema version: **v1.3**
- Compatibility includes analysis profile (`min_loc`, `min_stmt`)
- Invalid or oversized cache is ignored with warning and rebuilt (fail-open)

Full contract details: [`docs/book/07-cache.md`](docs/book/07-cache.md)
Expand Down
59 changes: 59 additions & 0 deletions codeclone/cache.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ class CacheStatus(str, Enum):
VERSION_MISMATCH = "version_mismatch"
PYTHON_TAG_MISMATCH = "python_tag_mismatch"
FINGERPRINT_MISMATCH = "mismatch_fingerprint_version"
ANALYSIS_PROFILE_MISMATCH = "analysis_profile_mismatch"
INTEGRITY_FAILED = "integrity_failed"


Expand Down Expand Up @@ -84,15 +85,22 @@ class CacheEntry(TypedDict):
segments: list[SegmentDict]


class AnalysisProfile(TypedDict):
min_loc: int
min_stmt: int


class CacheData(TypedDict):
version: str
python_tag: str
fingerprint_version: str
analysis_profile: AnalysisProfile
files: dict[str, CacheEntry]


class Cache:
__slots__ = (
"analysis_profile",
"cache_schema_version",
"data",
"fingerprint_version",
Expand All @@ -112,14 +120,21 @@ def __init__(
*,
root: str | Path | None = None,
max_size_bytes: int | None = None,
min_loc: int = 15,
min_stmt: int = 6,
):
self.path = Path(path)
self.root = _resolve_root(root)
self.fingerprint_version = BASELINE_FINGERPRINT_VERSION
self.analysis_profile: AnalysisProfile = {
"min_loc": min_loc,
"min_stmt": min_stmt,
}
self.data: CacheData = _empty_cache_data(
version=self._CACHE_VERSION,
python_tag=current_python_tag(),
fingerprint_version=self.fingerprint_version,
analysis_profile=self.analysis_profile,
)
self.legacy_secret_warning = self._detect_legacy_secret_warning()
self.cache_schema_version: str | None = None
Expand Down Expand Up @@ -164,6 +179,7 @@ def _ignore_cache(
version=self._CACHE_VERSION,
python_tag=current_python_tag(),
fingerprint_version=self.fingerprint_version,
analysis_profile=self.analysis_profile,
)

def _sign_data(self, data: Mapping[str, object]) -> str:
Expand Down Expand Up @@ -309,6 +325,28 @@ def _parse_cache_document(self, raw_obj: object) -> CacheData | None:
)
return None

analysis_profile = _as_analysis_profile(payload.get("ap"))
if analysis_profile is None:
self._ignore_cache(
"Cache format invalid; ignoring cache.",
status=CacheStatus.INVALID_TYPE,
schema_version=version,
)
return None

if analysis_profile != self.analysis_profile:
self._ignore_cache(
"Cache analysis profile mismatch "
f"(found min_loc={analysis_profile['min_loc']}, "
f"min_stmt={analysis_profile['min_stmt']}; "
f"expected min_loc={self.analysis_profile['min_loc']}, "
f"min_stmt={self.analysis_profile['min_stmt']}); "
"ignoring cache.",
status=CacheStatus.ANALYSIS_PROFILE_MISMATCH,
schema_version=version,
)
return None

files_obj = payload.get("files")
files_dict = _as_str_dict(files_obj)
if files_dict is None:
Expand Down Expand Up @@ -337,6 +375,7 @@ def _parse_cache_document(self, raw_obj: object) -> CacheData | None:
"version": self._CACHE_VERSION,
"python_tag": runtime_tag,
"fingerprint_version": self.fingerprint_version,
"analysis_profile": self.analysis_profile,
"files": parsed_files,
}

Expand All @@ -356,6 +395,7 @@ def save(self) -> None:
payload: dict[str, object] = {
"py": current_python_tag(),
"fp": self.fingerprint_version,
"ap": self.analysis_profile,
"files": wire_files,
}
signed_doc = {
Expand All @@ -371,6 +411,7 @@ def save(self) -> None:
self.data["version"] = self._CACHE_VERSION
self.data["python_tag"] = current_python_tag()
self.data["fingerprint_version"] = self.fingerprint_version
self.data["analysis_profile"] = self.analysis_profile

except OSError as e:
raise CacheError(f"Failed to save cache: {e}") from e
Expand Down Expand Up @@ -508,11 +549,13 @@ def _empty_cache_data(
version: str,
python_tag: str,
fingerprint_version: str,
analysis_profile: AnalysisProfile,
) -> CacheData:
return {
"version": version,
"python_tag": python_tag,
"fingerprint_version": fingerprint_version,
"analysis_profile": analysis_profile,
"files": {},
}

Expand Down Expand Up @@ -542,6 +585,22 @@ def _as_str_dict(value: object) -> dict[str, object] | None:
return value


def _as_analysis_profile(value: object) -> AnalysisProfile | None:
obj = _as_str_dict(value)
if obj is None:
return None

if set(obj.keys()) != {"min_loc", "min_stmt"}:
return None

min_loc = _as_int(obj.get("min_loc"))
min_stmt = _as_int(obj.get("min_stmt"))
if min_loc is None or min_stmt is None:
return None

return {"min_loc": min_loc, "min_stmt": min_stmt}


def _decode_wire_file_entry(value: object, filepath: str) -> CacheEntry | None:
obj = _as_str_dict(value)
if obj is None:
Expand Down
2 changes: 2 additions & 0 deletions codeclone/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -310,6 +310,8 @@ def _main_impl() -> None:
cache_path,
root=root_path,
max_size_bytes=args.max_cache_size_mb * 1024 * 1024,
min_loc=args.min_loc,
min_stmt=args.min_stmt,
)
cache.load()
if cache.load_warning:
Expand Down
2 changes: 1 addition & 1 deletion codeclone/contracts.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
BASELINE_SCHEMA_VERSION: Final = "1.0"
BASELINE_FINGERPRINT_VERSION: Final = "1"

CACHE_VERSION: Final = "1.2"
CACHE_VERSION: Final = "1.3"
REPORT_SCHEMA_VERSION: Final = "1.1"


Expand Down
4 changes: 2 additions & 2 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ This directory has two documentation layers.
- Config and defaults: [`docs/book/04-config-and-defaults.md`](book/04-config-and-defaults.md)
- Core pipeline and invariants: [`docs/book/05-core-pipeline.md`](book/05-core-pipeline.md)
- Baseline contract (schema v1): [`docs/book/06-baseline.md`](book/06-baseline.md)
- Cache contract (schema v1.2): [`docs/book/07-cache.md`](book/07-cache.md)
- Cache contract (schema v1.3): [`docs/book/07-cache.md`](book/07-cache.md)
- Report contract (schema v1.1): [`docs/book/08-report.md`](book/08-report.md)

## Interfaces
Expand All @@ -44,4 +44,4 @@ This directory has two documentation layers.

- Status enums and typed contracts: [`docs/book/appendix/a-status-enums.md`](book/appendix/a-status-enums.md)
- Schema layouts (baseline/cache/report): [`docs/book/appendix/b-schema-layouts.md`](book/appendix/b-schema-layouts.md)
- Error catalog (contract vs internal): [`docs/book/appendix/c-error-catalog.md`](book/appendix/c-error-catalog.md)
- Error catalog (contract vs internal): [`docs/book/appendix/c-error-catalog.md`](book/appendix/c-error-catalog.md)
44 changes: 29 additions & 15 deletions docs/book/01-architecture-map.md
Original file line number Diff line number Diff line change
@@ -1,74 +1,88 @@
# 01. Architecture Map

## Purpose

Document the current module boundaries and ownership in the codebase.

## Public surface

Main ownership layers:

- Core detection pipeline: scanner → extractor → cfg/normalize → grouping.
- Contracts/IO: baseline, cache, CLI validation, exit semantics.
- Report model/serialization: JSON/TXT generation and explainability facts.
- Render layer: HTML rendering and template assets.

## Data model
| Layer | Modules | Responsibility |
| --- | --- | --- |
| Contracts | `codeclone/contracts.py`, `codeclone/errors.py` | Shared schema versions, URLs, exit-code enum, typed exceptions |
| Discovery + parsing | `codeclone/scanner.py`, `codeclone/extractor.py` | Enumerate files, parse AST, extract function/block/segment units |
| Structural analysis | `codeclone/cfg.py`, `codeclone/normalize.py`, `codeclone/blockhash.py`, `codeclone/fingerprint.py`, `codeclone/blocks.py` | CFG, normalization, statement hashes, block/segment windows |
| Grouping + report core | `codeclone/_report_grouping.py`, `codeclone/_report_blocks.py`, `codeclone/_report_segments.py`, `codeclone/_report_explain.py` | Build groups, merge windows, suppress segment noise, compute explainability facts |
| Report serialization | `codeclone/_report_serialize.py`, `codeclone/_cli_meta.py` | Canonical JSON/TXT schema + shared report metadata |
| Rendering | `codeclone/html_report.py`, `codeclone/_html_escape.py`, `codeclone/_html_snippets.py`, `codeclone/templates.py` | HTML-only view layer over report model |
| Runtime orchestration | `codeclone/cli.py`, `codeclone/_cli_args.py`, `codeclone/_cli_paths.py`, `codeclone/_cli_summary.py`, `codeclone/ui_messages.py` | CLI UX, status handling, outputs, error category markers |

| Layer | Modules | Responsibility |
|------------------------|----------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
| Contracts | `codeclone/contracts.py`, `codeclone/errors.py` | Shared schema versions, URLs, exit-code enum, typed exceptions |
| Discovery + parsing | `codeclone/scanner.py`, `codeclone/extractor.py` | Enumerate files, parse AST, extract function/block/segment units |
| Structural analysis | `codeclone/cfg.py`, `codeclone/normalize.py`, `codeclone/blockhash.py`, `codeclone/fingerprint.py`, `codeclone/blocks.py` | CFG, normalization, statement hashes, block/segment windows |
| Grouping + report core | `codeclone/_report_grouping.py`, `codeclone/_report_blocks.py`, `codeclone/_report_segments.py`, `codeclone/_report_explain.py` | Build groups, merge windows, suppress segment noise, compute explainability facts |
| Report serialization | `codeclone/_report_serialize.py`, `codeclone/_cli_meta.py` | Canonical JSON/TXT schema + shared report metadata |
| Rendering | `codeclone/html_report.py`, `codeclone/_html_escape.py`, `codeclone/_html_snippets.py`, `codeclone/templates.py` | HTML-only view layer over report model |
| Runtime orchestration | `codeclone/cli.py`, `codeclone/_cli_args.py`, `codeclone/_cli_paths.py`, `codeclone/_cli_summary.py`, `codeclone/ui_messages.py` | CLI UX, status handling, outputs, error category markers |

Refs:

- `codeclone/report.py`
- `codeclone/cli.py:_main_impl`

## Contracts

- Core pipeline does not depend on HTML modules.
- HTML rendering receives already-computed report data/facts.
- Baseline and cache contracts are validated before being trusted.

Refs:

- `codeclone/report.py`
- `codeclone/html_report.py:build_html_report`
- `codeclone/baseline.py:Baseline.load`
- `codeclone/cache.py:Cache.load`

## Invariants (MUST)

- Report serialization is deterministic and schema-versioned.
- UI is render-only and must not recompute detection semantics.
- Status enums are domain-owned in baseline/cache modules.

Refs:

- `codeclone/_report_serialize.py:to_json_report`
- `codeclone/_report_explain.py:build_block_group_facts`
- `codeclone/baseline.py:BaselineStatus`
- `codeclone/cache.py:CacheStatus`

## Failure modes
| Condition | Layer |
| --- | --- |

| Condition | Layer |
|----------------------------------------|---------------------------------------------------|
| Invalid CLI args / invalid output path | Runtime orchestration (`_cli_args`, `_cli_paths`) |
| Baseline schema/integrity mismatch | Baseline contract layer |
| Cache corruption/version mismatch | Cache contract layer (fail-open) |
| HTML snippet read failure | Render layer fallback snippet |
| Baseline schema/integrity mismatch | Baseline contract layer |
| Cache corruption/version mismatch | Cache contract layer (fail-open) |
| HTML snippet read failure | Render layer fallback snippet |

## Determinism / canonicalization

- File iteration and group key ordering are explicit sorts.
- Report serializer uses fixed record layouts and sorted keys.

Refs:

- `codeclone/scanner.py:iter_py_files`
- `codeclone/_report_serialize.py:GROUP_ITEM_LAYOUT`

## Locked by tests

- `tests/test_report.py::test_report_json_compact_v11_contract`
- `tests/test_html_report.py::test_html_report_uses_core_block_group_facts`
- `tests/test_cache.py::test_cache_v12_uses_relpaths_when_root_set`
- `tests/test_cache.py::test_cache_v13_uses_relpaths_when_root_set`
- `tests/test_cli_unit.py::test_argument_parser_contract_error_marker_for_invalid_args`

## Non-guarantees

- Internal module split may change in v1.x if public contracts are preserved.
- Import tree acyclicity is a policy goal, not currently enforced by tooling.
Loading