treedb: add template lab tooling and class-aware rewrite observability#854
treedb: add template lab tooling and class-aware rewrite observability#854
Conversation
|
Tracking ticket with full experiment details and exact reproduction commands is now filed at: #855 Key outcome from this run:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0f8896fcf8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR advances the template-compression investigation by wiring template store/config into the value-log rewrite path, adding class-aware rewrite observability, and introducing lab tooling to extract corpora and run template sweeps.
Changes:
- Add template rewrite stats (attempted/kept/input/output + per-class reason counters) and expose them in
treemap vlog-rewrite. - Wire
TemplateStoreinto rewrite options and side-store lookup plumbing. - Add new lab commands:
template_corpus_extract,template_lab, andtemplate_seed_train(+ docs/benchmark notes) to support corpus extraction, sweeps, and store seeding.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/benchmarks/CELESTIA_VLOG_COMPRESSION_NOTES_2026-03-26.md | Adds reproducible commands/results for the template lab sweep. |
| TreeDB/vlog_rewrite.go | Public treedb rewrite stats extended with template/class counters and reason maps. |
| TreeDB/db/vlog_rewrite.go | Implements template-aware rewrite writer, applies template compression during offline rewrite, and plumbs stats/reason snapshots. |
| TreeDB/db/db.go | Adds ValueLogOptions.TemplateStore to support template encoding (e.g., rewrite prepass). |
| TreeDB/side_store_lookups.go | Wires templatedb lookups and conditionally sets TemplateStore when template mode is enabled. |
| TreeDB/cmd/treemap/main.go | Extends vlog-rewrite command with template flags, templatedb wiring, and class-aware output lines. |
| TreeDB/db/vlog_rewrite_test.go | Adds rewrite writer tests validating template prepass ordering and template-only dict bypass. |
| TreeDB/caching/db.go | Adds per-record dict-class range splitting and appends mixed-class batches via multiple dict IDs. |
| TreeDB/caching/vlog_dict_classifier_test.go | Adds tests for class-range splitting behavior (split outer-leaf vs single mode). |
| TreeDB/cmd/template_corpus_extract/main.go | New command to extract pointer/outer-leaf corpora + manifest. |
| TreeDB/cmd/template_corpus_extract/README.md | Documents corpus extraction output format and usage. |
| TreeDB/cmd/template_lab/main.go | New sweep harness for template configs and optional outer-leaf pretransform experiments. |
| TreeDB/cmd/template_lab/README.md | Documents lab harness input format, flags, and notes. |
| TreeDB/cmd/template_seed_train/main.go | New command to seed/warm templatedb from an existing app dir. |
| TreeDB/cmd/template_seed_train/README.md | Documents seeding usage and behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Follow-up update pushed in
Key result (pointer corpus):
Outer-leaf remained unchanged ( |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 94055dff59
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Follow-up pushed in What was added
ExperimentsCorpus:
ConclusionOuter-leaf pages remain non-templateable in this path even with directory normalization + aggressive template training knobs. This further supports pivoting outer-leaf optimization toward non-template mechanisms (block/dict/codec path), while template remains high-impact for pointer-value payloads. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3c6fbfd238
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 20 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d3736bfc31
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2834854a91
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 20 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| func (s readOnlyTemplateStore) GetTemplateDef(ctx context.Context, templateID uint64) ([]byte, error) { | ||
| if s.inner == nil { | ||
| return nil, nil |
There was a problem hiding this comment.
readOnlyTemplateStore.GetTemplateDef returns (nil, nil) when inner is nil. The template.Store contract expects an error like template.ErrMissingTemplate on misses; returning nil error with empty bytes can lead to confusing downstream decode errors if this wrapper is ever used without an inner store.
| return nil, nil | |
| return nil, template.ErrMissingTemplate |
| return stats, errors.New("value-log reader unavailable") | ||
| } | ||
|
|
||
| seen := make(map[outerLeafKey]struct{}, 1<<20) |
There was a problem hiding this comment.
The outer-leaf dedupe set is preallocated with a fixed capacity of 1<<20 regardless of the requested -limit/-stride. This can allocate a large amount of memory for small runs; consider sizing the map based on the effective limit (or leaving it unhinted) similar to template_corpus_extract’s approach.
| seen := make(map[outerLeafKey]struct{}, 1<<20) | |
| var seen map[outerLeafKey]struct{} | |
| if limit > 0 { | |
| // Approximate upper bound on distinct sampled entries we may keep in the dedupe set. | |
| effectiveMax := (limit + stride - 1) / stride | |
| if effectiveMax < 1 { | |
| effectiveMax = 1 | |
| } | |
| // Avoid over-allocating excessively large maps. | |
| if effectiveMax > 1<<20 { | |
| effectiveMax = 1 << 20 | |
| } | |
| seen = make(map[outerLeafKey]struct{}, effectiveMax) | |
| } else { | |
| // Unbounded run: let the map grow as needed without a large fixed hint. | |
| seen = make(map[outerLeafKey]struct{}) | |
| } |
| if reader == nil { | ||
| return 0, 0, errors.New("value-log reader unavailable") | ||
| } | ||
| seen := make(map[outerLeafKey]struct{}, 1<<20) |
There was a problem hiding this comment.
The probe outer-leaf dedupe set is preallocated with a fixed capacity of 1<<20. For small probe limits this can be unnecessarily memory-heavy; consider basing the initial capacity on the limit (or omitting the hint).
| seen := make(map[outerLeafKey]struct{}, 1<<20) | |
| // Size the dedupe set based on the probe limit to avoid unnecessary over-allocation. | |
| seenCap := 0 | |
| if limit > 0 { | |
| seenCap = limit | |
| if seenCap > 1<<20 { | |
| seenCap = 1 << 20 | |
| } | |
| } | |
| seen := make(map[outerLeafKey]struct{}, seenCap) |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9428f3d9c6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
This PR executes the next phase of the compression investigation plan:
What changed
pointer_valuevsouter_leafattempted/kept/input/output bytes and reason maps.TemplateStorein value-log options, side-store wiring).treemap vlog-rewriteoutput with class-specific template lines.template_corpus_extractcommand:pointer_values.bin,outer_leaf_pages.bin,manifest.json.template_labcommand:off|header_v1outer-leaf pre-transform,-disable-mask-templatesfor anchor-only runs.template_seed_traincommand + README for seedingtemplatedbfrom existing app dirs.Validation
go test ./TreeDB/caching -run 'TestValueLogDictClassRangesForRecords_SplitOuterLeaf|TestValueLogDictClassRangesForRecords_SingleMode' -count=1go test ./TreeDB/db -run 'TestRewriteWriter_TemplatePrepassEncodesBeforeDict|TestRewriteWriter_TemplateOnlyDisablesDictCompression' -count=1go test ./TreeDB/cmd/treemap ./TreeDB/cmd/template_corpus_extract ./TreeDB/cmd/template_lab ./TreeDB/cmd/template_seed_train -count=1go test ./TreeDB -run 'TestVacuumIndexOffline_WithTemplateFrames_WiresTemplateLookup' -count=1Experiment run (full corpus)
Corpus:
/tmp/template_corpus_run_20260327_120046Sweep artifacts:
/tmp/template_lab_run_off_20260327.json/tmp/template_lab_run_header_v1_20260327.jsonObserved:
20092/42802, encoded bytes improved by ~0.3904%.0keeps across tested configs.header_v1pre-transform: no measurable outer-leaf improvement in this pass.Repro details are documented in
docs/benchmarks/CELESTIA_VLOG_COMPRESSION_NOTES_2026-03-26.md.