Prioritize watcher drain writes#338
Conversation
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
@codex review Please focus on drain priority, write-queue starvation, and FTS-trigger side effects for watcher writes. |
|
@coderabbitai review Please review correctness around queue ordering, dedupe same-id merges, and preview_text insert behavior. |
|
@cursor review Please check for regressions in drain semantics and DB write behavior. |
|
@BugBot review Please check for queue starvation, DB locking, and trigger-related regressions. |
📝 WalkthroughWalkthroughThis PR optimizes deduplication merge operations by skipping tag updates when unchanged, adds automatic preview text generation during chunk insertion, prioritizes enrichment queue files in drain processing, and introduces comprehensive test coverage for these enhancements. ChangesDedupe: Conditional tags updates
Drain: Preview text and queue prioritization
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 187a31a58c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| files = sorted(queue_dir.glob("*.jsonl"), key=lambda path: (path.name.startswith("enrichment-"), path.name))[ | ||
| :batch_size | ||
| ] |
There was a problem hiding this comment.
Prevent enrichment starvation under sustained watcher load
When the queue has at least batch_size non-enrichment-* files, this ordering slices the batch before any enrichment file is selected, so a steady watcher/hook/store producer can keep enrichment updates out of every drain pass indefinitely. That keeps new chunks visible, but it also means summaries/tags/resolved queries for older chunks may never be applied while the realtime write rate stays above the drain capacity; consider reserving some batch capacity or aging in enrichment files rather than making the priority absolute.
Useful? React with 👍 / 👎.
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Get PR diff and files changed
gh pr view 338 --repo EtanHey/brainlayer --json files,body,title
echo "---DIFF---"
gh pr diff 338 --repo EtanHey/brainlayerLength of output: 11412 🏁 Script executed: #!/bin/bash
# Read the full _insert_chunk function and surrounding area in drain.py
echo "=== drain.py _insert_chunk ==="
rg -n "_insert_chunk|_preview_text|compute_dedupe_fields|preview_text" src/brainlayer/drain.py -A 3 -B 1
echo ""
echo "=== Full _insert_chunk function ==="
ast-grep --pattern 'def _insert_chunk($$$) {
$$$
}' src/brainlayer/drain.py 2>/dev/null || true
# fallback: grab lines around _insert_chunk
grep -n "_insert_chunk\|preview_text\|compute_dedupe_fields" src/brainlayer/drain.py | head -60
echo ""
echo "=== drain.py lines 95-160 ==="
sed -n '95,160p' src/brainlayer/drain.py
echo ""
echo "=== merge_existing_chunk_seen in dedupe.py: existing_tags source ==="
sed -n '600,680p' src/brainlayer/dedupe.py
echo ""
echo "=== json.dumps sort_keys usage in dedupe.py ==="
rg -n "json.dumps" src/brainlayer/dedupe.py -A 1 -B 1Length of output: 9092
1. Queue Ordering ✅files = sorted(queue_dir.glob("*.jsonl"),
key=lambda path: (path.name.startswith("enrichment-"), path.name))[:batch_size]Correct. 2. Dedupe Same-id Merge ✅ (with one minor note)merged_tags_json = json.dumps(merged_tags) if merged_tags else None
...
if merged_tags_json != existing_tags:
updates["tags"] = merged_tags_jsonCorrect for the common case. One convergence note: If Empty-list edge case ( 3.
|
| Area | Verdict | Note |
|---|---|---|
| Queue ordering | ✅ | Correct; test covers it |
| Same-id tag dedup | ✅ | One-time sort-canonicalize write if tags were stored unsorted |
preview_text on insert |
Whitespace-only summary bypasses content fallback, emits "", trigger still fires |
187a31a to
46ae636
Compare
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/brainlayer/dedupe.py`:
- Around line 643-649: The code compares serialized JSON (merged_tags_json) to
existing_tags which can falsely detect changes when order/format differs;
instead parse existing_tags into a Python structure and compare it semantically
to merged_tags (e.g., load existing = json.loads(existing_tags) if existing_tags
else None, normalize types—convert tag lists to sets or sort lists—then compare
existing == merged_tags) and only set updates["tags"] = merged_tags_json when
the parsed/normalized structures differ; on JSONDecodeError or other parse
issues treat as different and write the updated tags.
In `@src/brainlayer/drain.py`:
- Around line 517-519: The current sort in drain_once uses key
(path.name.startswith("enrichment-"), path.name) which correctly prioritizes
non-enrichment files before enrichment ones; to avoid starving enrichment
backlog add a simple fairness step after listing files from queue_dir.glob:
split into files_non_enrich and files_enrich (based on
path.name.startswith("enrichment-")), then build the batch by taking up to
N_enrich = max(1, min(len(files_enrich), batch_size // 10)) or at least one
enrichment when any exist and fill the remainder from files_non_enrich (truncate
to batch_size) so enrichment files get periodically processed; update the
variable files used downstream in drain_once accordingly and document the
behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 3a94b999-b5c2-407e-9a59-f9d09621f911
📒 Files selected for processing (3)
src/brainlayer/dedupe.pysrc/brainlayer/drain.pytests/test_arbitration.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Cursor Bugbot
- GitHub Check: Cursor Bugbot
- GitHub Check: test (3.12)
- GitHub Check: test (3.11)
- GitHub Check: test (3.13)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests
**/*.py: Usepaths.py:get_db_path()for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths
When performing bulk database operations: stop enrichment workers first, checkpoint WAL before and after, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks, and checkpoint every 3 batches
Files:
src/brainlayer/dedupe.pytests/test_arbitration.pysrc/brainlayer/drain.py
src/brainlayer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/brainlayer/**/*.py: Use retry logic onSQLITE_BUSYerrors; each worker must use its own database connection to handle concurrency safely
Classification must preserveai_code,stack_trace, anduser_messageverbatim; skipnoiseentries entirely and summarizebuild_loganddir_listingentries (structure only)
Use AST-aware chunking via tree-sitter; never split stack traces; mask large tool output
For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback viaenrichment_controller.py, and Ollama as offline last-resort; allow override viaBRAINLAYER_ENRICH_BACKENDenv var
Configure enrichment rate viaBRAINLAYER_ENRICH_RATEenvironment variable (default 0.2 = 12 RPM)
Implement chunk lifecycle columns:superseded_by,aggregated_into,archived_aton chunks table; exclude lifecycle-managed chunks from default search; allowinclude_archived=Trueto show history
Implementbrain_supersedewith safety gate for personal data (journals, notes, health/finance); use soft-delete forbrain_archivewith timestamp
Addsupersedesparameter tobrain_storefor atomic store-and-replace operations
Run linting and formatting with:ruff check src/ && ruff format src/
Run tests withpytest
UsePRAGMA wal_checkpoint(FULL)before and after bulk database operations to prevent WAL bloat
Files:
src/brainlayer/dedupe.pysrc/brainlayer/drain.py
🔇 Additional comments (3)
tests/test_arbitration.py (1)
83-130: LGTM!Also applies to: 189-223, 225-261, 263-307
src/brainlayer/drain.py (2)
100-103: LGTM!
107-108: ⚡ Quick winEnsure Python
_preview_textmatches the DBpreviewExpressionrules (otherwise the trigger may still overwritepreview_text)File:
src/brainlayer/drain.py(around lines 107-108)
BrainDatabase.swift’spreviewExpressioncomputespreview_textas:
coalesce(nullif(summary, ''), content)- replace
char(10)(\n),char(13)(\r),char(9)(\t) with' 'trim(...)substr(..., 1, 220)Python’s
_preview_text(...)used when incomingpreview_textis blank must match this precedence, the exact whitespace replacements, trimming behavior, and 220-char truncation so the insert reliably prevents the trigger from firing/overwriting.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 46ae6365cf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| files = sorted(queue_dir.glob("*.jsonl"), key=lambda path: (path.name.startswith("enrichment-"), path.name))[ | ||
| :batch_size |
There was a problem hiding this comment.
Prevent stale enrichment from applying after newer writes
When an enrichment-*.jsonl file was queued for a chunk before a later watcher/store update for the same chunk_id, this ordering now drains the newer write first and then applies the older enrichment afterward. _apply_enrichment unconditionally updates summary/tags/resolved queries and only stores content_hash without checking it against the current row, so a stale enrichment result can overwrite retrieval metadata for the newly merged content; this is especially likely for same-id watcher updates, which the existing arbitration tests show are merged into the canonical chunk.
Useful? React with 👍 / 👎.
46ae636 to
1a4aefb
Compare
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
Addressed CodeRabbit's preview_text edge case: added a regression test for whitespace-only summary falling back to content and updated _preview_text accordingly. Local lint/format and queue tests pass; pre-push BrainLayer gate passed after amend. |
1a4aefb to
e6a916b
Compare
|
Addressed the still-valid CodeRabbit tag comparison finding by comparing tags semantically via _loads_tags, with a regression test for unsorted JSON tags. I am intentionally not adding enrichment fairness to drain_once: this incident required writes to drain ahead of enrichment, and the new watcher file that appeared during live enrichment drain cleared before the enrichment backlog. Enrichment fairness can be separate once the FTS update-trigger cost is fixed. |
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
e6a916b to
2749cd5
Compare
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
| current_hash = row[0] or _content_hash(str(row[1] or "")) | ||
| if current_hash and current_hash != event["content_hash"]: | ||
| logger.warning("Skipping stale enrichment for chunk_id=%s content_hash mismatch", chunk_id) | ||
| return |
There was a problem hiding this comment.
Stale enrichment hash trusts column
High Severity
_apply_enrichment treats a matching content_hash column as proof the enrichment matches the row, but same-id watcher content merges update content without updating content_hash. After a rewrite, an queued enrichment for the prior body can still apply summary/tags to the new content.
Reviewed by Cursor Bugbot for commit 2749cd5. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2749cd5feb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| current_hash = row[0] or _content_hash(str(row[1] or "")) | ||
| if current_hash and current_hash != event["content_hash"]: |
There was a problem hiding this comment.
Verify actual content when rejecting stale enrichment
When a queued watcher/store update rewrites an existing chunk through _insert_or_merge_chunk (the same-id content merge or duplicate merge paths), those merge helpers update content but do not update content_hash; this new guard then trusts the stale stored hash instead of hashing the current content. In that state, an older enrichment-* event whose content_hash matches the pre-merge content will still pass this check and overwrite summary/tags/resolved-query metadata for the merged content, so the freshness check should either keep content_hash in sync on content-changing merges or compare against _content_hash(content) here. Fresh evidence is that this commit added the hash guard here, while merge_existing_chunk_content/merge_duplicate_chunk still omit content_hash from their content update sets.
Useful? React with 👍 / 👎.
2749cd5 to
8e4639a
Compare
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8e4639a. Configure here.
| ); | ||
| INSERT INTO chunk_fts_rowids(chunk_id, trigram_rowid) | ||
| VALUES (new.id, last_insert_rowid()) | ||
| ON CONFLICT(chunk_id) DO UPDATE SET trigram_rowid = excluded.trigram_rowid; |
There was a problem hiding this comment.
Repair breaks trigram rowid map
Medium Severity
repair_fts rebuilds chunks_fts_trigram with new rowids but does not refresh chunk_fts_rowids.trigram_rowid. Trigram update/delete triggers then target wrong or missing rowids, leaving orphan FTS rows and broken trigram sync after BRAINLAYER_REPAIR=1.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 8e4639a. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8e4639a068
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| END | ||
| """) | ||
| cursor.execute("DROP TRIGGER IF EXISTS chunks_fts_update") |
There was a problem hiding this comment.
Keep rowid map in sync during FTS backfills
When chunks_fts is populated outside the insert triggers, such as the startup empty-FTS backfill at src/brainlayer/vector_store.py:1483-1490 or the repair/rebuild paths, chunk_fts_rowids is not updated. A later chunk update then reaches this new mapped-rowid delete with a NULL or stale fts_rowid, so the old FTS row is not removed before inserting the replacement; searches can keep matching stale terms and may return duplicate rows after any repair/fresh backfill. The same synchronization gap applies to the trigram mapping when repair_fts() rebuilds chunks_fts_trigram.
Useful? React with 👍 / 👎.
|
Addressed the stale-enrichment review path and the live drain follow-up: enrichment updates now verify content_hash before mutating a chunk, watcher/hook/store inserts populate content_hash for future validation, and FTS update/delete triggers now delete by mapped FTS rowid instead of scanning FTS by chunk_id. Local verification: focused drain/FTS tests passed; full pytest passed on system Python; pre-push BrainLayer gate passed (2193 unit selected + MCP registration + isolated eval/hooks + bun stale-index + FTS5 determinism). Live verification: active queue drained to 0, enrichment restarted, watcher still running, and live chunks_fts_update/chunks_fts_trigram_update both use chunk_fts_rowids. |


Summary
Verification
Live recovery notes
Note
Medium Risk
Changes touch core drain ordering, chunk enrichment correctness, dedupe merges, and FTS trigger behavior on every chunk update; mistakes could cause stale metadata, search desync, or missed tag/FTS side effects.
Overview
Drain ordering and enrichment safety: The queue drain now processes non-
enrichment-*JSONL files before enrichment backlog so watcher/hook writes land first. Enrichment events can carry acontent_hash; if it no longer matches the chunk (including a hash derived from current content when the column is empty), the update is skipped. New ingests from store, watcher, and hook paths persist SHA-256content_hash, and_insert_chunkfillspreview_textfrom summary or content when missing so DB preview triggers do not fire a follow-up UPDATE.Dedupe seen merges: Same-id seen merges only include
tagsin the UPDATE when the merged tag set differs semantically from what is already stored (set equivalence via_loads_tags, not raw JSON string compare).FTS triggers:
vector_storeaddschunk_fts_rowidsto mapchunk_idto FTS/trigramrowids; insert/update/delete triggers use those rowids instead of deleting bychunk_id, with backfill for existing FTS rows.Tests: Coverage in
test_arbitration.pyandtest_search_trigram_fts.pyfor queue priority, stale enrichment, preview_text, tag merge behavior, and FTS update triggers.Reviewed by Cursor Bugbot for commit 8e4639a. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Prioritize watcher drain writes over enrichment backlog in
drain_oncedrain_oncenow sorts queue files so non-enrichment writes are processed beforeenrichment-*.jsonlfiles within a batch.content_hashdoes not match the chunk's current hash, preventing stale enrichments from overwriting rewritten content.content_hash(SHA-256 of trimmed content) is now persisted on insert/merge for store, watcher, and hook events.preview_textfrom summary or content when the column exists and no non-blank preview was supplied.vector_store.pynow uses achunk_fts_rowidsmapping table so deletes and updates target the correct rowid in both FTS tables.dedupe.pyno longer write thetagscolumn when the merged value is semantically identical to the existing value.Macroscope summarized 8e4639a.
Summary by CodeRabbit
New Features
Improvements