Skip to content

treedb: add template lab tooling and class-aware rewrite observability#854

Merged
snissn merged 13 commits intomainfrom
followon/template-lab-plan-exec
Mar 30, 2026
Merged

treedb: add template lab tooling and class-aware rewrite observability#854
snissn merged 13 commits intomainfrom
followon/template-lab-plan-exec

Conversation

@snissn
Copy link
Copy Markdown
Owner

@snissn snissn commented Mar 27, 2026

Summary

This PR executes the next phase of the compression investigation plan:

  1. PR1 scope: add rewrite observability for template outcomes by payload class and add corpus extraction tooling.
  2. PR2 scope: add a microbench/sweep harness for template experiments on extracted corpora.
  3. PR3 scope (flagged prototype): add an outer-leaf pre-transform experiment path in the lab harness.

What changed

  • Added class-aware template counters/reason stats to rewrite stats:
    • pointer_value vs outer_leaf attempted/kept/input/output bytes and reason maps.
  • Wired template store/config into rewrite path (TemplateStore in value-log options, side-store wiring).
  • Extended treemap vlog-rewrite output with class-specific template lines.
  • Added template_corpus_extract command:
    • emits pointer_values.bin, outer_leaf_pages.bin, manifest.json.
  • Added template_lab command:
    • sweep over min-savings/fingerprint-k/max-fetch,
    • emits JSON/Markdown reports,
    • supports off|header_v1 outer-leaf pre-transform,
    • supports -disable-mask-templates for anchor-only runs.
  • Added template_seed_train command + README for seeding templatedb from existing app dirs.
  • Added benchmark note updates with reproducible commands/results.

Validation

  • go test ./TreeDB/caching -run 'TestValueLogDictClassRangesForRecords_SplitOuterLeaf|TestValueLogDictClassRangesForRecords_SingleMode' -count=1
  • go test ./TreeDB/db -run 'TestRewriteWriter_TemplatePrepassEncodesBeforeDict|TestRewriteWriter_TemplateOnlyDisablesDictCompression' -count=1
  • go test ./TreeDB/cmd/treemap ./TreeDB/cmd/template_corpus_extract ./TreeDB/cmd/template_lab ./TreeDB/cmd/template_seed_train -count=1
  • go test ./TreeDB -run 'TestVacuumIndexOffline_WithTemplateFrames_WiresTemplateLookup' -count=1

Experiment run (full corpus)

Corpus: /tmp/template_corpus_run_20260327_120046

Sweep artifacts:

  • /tmp/template_lab_run_off_20260327.json
  • /tmp/template_lab_run_header_v1_20260327.json

Observed:

  • Pointer corpus: template kept 20092/42802, encoded bytes improved by ~0.3904%.
  • Outer-leaf corpus: 0 keeps across tested configs.
  • header_v1 pre-transform: no measurable outer-leaf improvement in this pass.

Repro details are documented in docs/benchmarks/CELESTIA_VLOG_COMPRESSION_NOTES_2026-03-26.md.

Copy link
Copy Markdown
Owner Author

snissn commented Mar 27, 2026

Tracking ticket with full experiment details and exact reproduction commands is now filed at: #855

Key outcome from this run:

  • Pointer corpus: ~0.39% encoded-byte reduction (20092/42802 keeps)
  • Outer-leaf corpus: 0 keeps across tested configs, including header_v1 pre-transform prototype

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0f8896fcf8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread TreeDB/cmd/template_seed_train/main.go Outdated
Comment thread TreeDB/cmd/template_corpus_extract/main.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR advances the template-compression investigation by wiring template store/config into the value-log rewrite path, adding class-aware rewrite observability, and introducing lab tooling to extract corpora and run template sweeps.

Changes:

  • Add template rewrite stats (attempted/kept/input/output + per-class reason counters) and expose them in treemap vlog-rewrite.
  • Wire TemplateStore into rewrite options and side-store lookup plumbing.
  • Add new lab commands: template_corpus_extract, template_lab, and template_seed_train (+ docs/benchmark notes) to support corpus extraction, sweeps, and store seeding.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
docs/benchmarks/CELESTIA_VLOG_COMPRESSION_NOTES_2026-03-26.md Adds reproducible commands/results for the template lab sweep.
TreeDB/vlog_rewrite.go Public treedb rewrite stats extended with template/class counters and reason maps.
TreeDB/db/vlog_rewrite.go Implements template-aware rewrite writer, applies template compression during offline rewrite, and plumbs stats/reason snapshots.
TreeDB/db/db.go Adds ValueLogOptions.TemplateStore to support template encoding (e.g., rewrite prepass).
TreeDB/side_store_lookups.go Wires templatedb lookups and conditionally sets TemplateStore when template mode is enabled.
TreeDB/cmd/treemap/main.go Extends vlog-rewrite command with template flags, templatedb wiring, and class-aware output lines.
TreeDB/db/vlog_rewrite_test.go Adds rewrite writer tests validating template prepass ordering and template-only dict bypass.
TreeDB/caching/db.go Adds per-record dict-class range splitting and appends mixed-class batches via multiple dict IDs.
TreeDB/caching/vlog_dict_classifier_test.go Adds tests for class-range splitting behavior (split outer-leaf vs single mode).
TreeDB/cmd/template_corpus_extract/main.go New command to extract pointer/outer-leaf corpora + manifest.
TreeDB/cmd/template_corpus_extract/README.md Documents corpus extraction output format and usage.
TreeDB/cmd/template_lab/main.go New sweep harness for template configs and optional outer-leaf pretransform experiments.
TreeDB/cmd/template_lab/README.md Documents lab harness input format, flags, and notes.
TreeDB/cmd/template_seed_train/main.go New command to seed/warm templatedb from an existing app dir.
TreeDB/cmd/template_seed_train/README.md Documents seeding usage and behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread TreeDB/cmd/template_lab/main.go
Comment thread TreeDB/cmd/template_seed_train/main.go Outdated
Comment thread TreeDB/cmd/template_corpus_extract/main.go Outdated
Comment thread TreeDB/side_store_lookups.go
Comment thread TreeDB/caching/db.go
Comment thread TreeDB/cmd/template_corpus_extract/main.go Outdated
Copy link
Copy Markdown
Owner Author

snissn commented Mar 27, 2026

Follow-up update pushed in 94055dff:

  • Fixed template mask sparse-encode correctness bug in TreeDB/template/match.go (removed in-place mutation of input values).
  • Added regression test: TestEncodeMaskTemplateSparse_DoesNotMutateInput.
  • Switched template_lab default to mask-enabled mode (-disable-mask-templates=false).
  • Re-ran full corpus sweeps with mask mode enabled:
    • /tmp/template_lab_run_mask_on_off_20260327.json
    • /tmp/template_lab_run_mask_on_header_v1_20260327.json

Key result (pointer corpus):

  • Best row in off run: tmpl_ms4_fk8_fetch16
  • Encoded bytes 132,634,951 vs raw 880,146,915 (~84.93% reduction)
  • Encoded gzip 19,144,812 vs raw gzip 428,482,010

Outer-leaf remained unchanged (0 keeps across tested configs), including header_v1 pre-transform.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 94055dff59

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread TreeDB/cmd/template_lab/main.go Outdated
Comment thread TreeDB/cmd/template_seed_train/main.go Outdated
Copilot AI review requested due to automatic review settings March 27, 2026 22:53
Copy link
Copy Markdown
Owner Author

snissn commented Mar 27, 2026

Follow-up pushed in 3c6fbfd2:

What was added

  • New outer-leaf pretransform mode in template_lab:
    • header_dir_delta_v1
    • Carries header side-bytes (PageID+Checksum) and delta-normalizes columnar-prefix directory metadata (key/val/prefix dirs) in-place.
  • New round-trip tests in TreeDB/cmd/template_lab/main_test.go to prove lossless reverse transform.
  • New template_lab tuning flags for training/cold-search investigation:
    • -template-train-sample-stride
    • -template-synthesize-every
    • -template-min-anchor-freq
    • -template-min-presence-ratio
    • -template-min-publish-savings
    • -template-min-publish-ratio
    • -template-cold-search-after
    • -template-cold-search-probe-every

Experiments

Corpus: /tmp/template_corpus_run_20260327_120046

  1. Baseline header_dir_delta_v1 outer-leaf sweep (20k pages):
  • /tmp/template_lab_outer_header_dir_delta_20k.json
  • Keeps: 0 across all rows.
  1. Aggressive synthesis/routing sweep (20k pages):
  • /tmp/template_lab_outer_header_dir_delta_20k_aggr.json
  • Keeps: 0 across all rows.
  • Reasons: tmpl_no_candidates=40000, templates_published=0.
  1. Coarser fingerprint check (k=1) still produced 0 keeps.

Conclusion

Outer-leaf pages remain non-templateable in this path even with directory normalization + aggressive template training knobs. This further supports pivoting outer-leaf optimization toward non-template mechanisms (block/dict/codec path), while template remains high-impact for pointer-value payloads.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c6fbfd238

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread TreeDB/cmd/template_seed_train/main.go
Comment thread TreeDB/cmd/template_lab/main.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread TreeDB/cmd/template_lab/main.go Outdated
Comment thread TreeDB/cmd/treemap/main.go
Comment thread TreeDB/cmd/template_seed_train/main.go Outdated
Comment thread TreeDB/cmd/template_seed_train/main.go
Comment thread TreeDB/cmd/template_corpus_extract/main.go Outdated
Comment thread TreeDB/cmd/template_lab/main.go Outdated
Copilot AI review requested due to automatic review settings March 30, 2026 07:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 20 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread TreeDB/cmd/template_corpus_extract/main.go Outdated
Comment thread TreeDB/cmd/template_seed_train/main.go Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3736bfc31

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread TreeDB/cmd/template_lab/main.go
Comment thread TreeDB/side_store_lookups.go Outdated
Copilot AI review requested due to automatic review settings March 30, 2026 07:50
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2834854a91

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread TreeDB/cmd/template_seed_train/main.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 20 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread TreeDB/db/vlog_rewrite.go Outdated
Comment thread TreeDB/cmd/template_seed_train/main.go
Comment thread TreeDB/db/vlog_rewrite.go
@snissn snissn changed the base branch from followon/vlog-dict-class-split-experiment to main March 30, 2026 08:05
Copilot AI review requested due to automatic review settings March 30, 2026 08:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread TreeDB/cmd/template_seed_train/main.go Outdated

func (s readOnlyTemplateStore) GetTemplateDef(ctx context.Context, templateID uint64) ([]byte, error) {
if s.inner == nil {
return nil, nil
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readOnlyTemplateStore.GetTemplateDef returns (nil, nil) when inner is nil. The template.Store contract expects an error like template.ErrMissingTemplate on misses; returning nil error with empty bytes can lead to confusing downstream decode errors if this wrapper is ever used without an inner store.

Suggested change
return nil, nil
return nil, template.ErrMissingTemplate

Copilot uses AI. Check for mistakes.
Comment thread TreeDB/cmd/template_seed_train/main.go Outdated
return stats, errors.New("value-log reader unavailable")
}

seen := make(map[outerLeafKey]struct{}, 1<<20)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outer-leaf dedupe set is preallocated with a fixed capacity of 1<<20 regardless of the requested -limit/-stride. This can allocate a large amount of memory for small runs; consider sizing the map based on the effective limit (or leaving it unhinted) similar to template_corpus_extract’s approach.

Suggested change
seen := make(map[outerLeafKey]struct{}, 1<<20)
var seen map[outerLeafKey]struct{}
if limit > 0 {
// Approximate upper bound on distinct sampled entries we may keep in the dedupe set.
effectiveMax := (limit + stride - 1) / stride
if effectiveMax < 1 {
effectiveMax = 1
}
// Avoid over-allocating excessively large maps.
if effectiveMax > 1<<20 {
effectiveMax = 1 << 20
}
seen = make(map[outerLeafKey]struct{}, effectiveMax)
} else {
// Unbounded run: let the map grow as needed without a large fixed hint.
seen = make(map[outerLeafKey]struct{})
}

Copilot uses AI. Check for mistakes.
Comment thread TreeDB/cmd/template_seed_train/main.go Outdated
if reader == nil {
return 0, 0, errors.New("value-log reader unavailable")
}
seen := make(map[outerLeafKey]struct{}, 1<<20)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The probe outer-leaf dedupe set is preallocated with a fixed capacity of 1<<20. For small probe limits this can be unnecessarily memory-heavy; consider basing the initial capacity on the limit (or omitting the hint).

Suggested change
seen := make(map[outerLeafKey]struct{}, 1<<20)
// Size the dedupe set based on the probe limit to avoid unnecessary over-allocation.
seenCap := 0
if limit > 0 {
seenCap = limit
if seenCap > 1<<20 {
seenCap = 1 << 20
}
}
seen := make(map[outerLeafKey]struct{}, seenCap)

Copilot uses AI. Check for mistakes.
Comment thread TreeDB/cmd/template_corpus_extract/main.go
Comment thread TreeDB/cmd/template_corpus_extract/main.go
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9428f3d9c6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread TreeDB/cmd/template_seed_train/main.go
Comment thread TreeDB/cmd/template_lab/main.go
Copilot AI review requested due to automatic review settings March 30, 2026 08:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread TreeDB/cmd/template_lab/main.go
Comment thread TreeDB/cmd/template_lab/main.go
Comment thread TreeDB/caching/db.go
@snissn snissn merged commit 18a6e9c into main Mar 30, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants