Skip to content

Prioritize watcher drain writes#338

Merged
EtanHey merged 1 commit into
mainfrom
fix/brainlayer-drain-preview-text
May 27, 2026
Merged

Prioritize watcher drain writes#338
EtanHey merged 1 commit into
mainfrom
fix/brainlayer-drain-preview-text

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented May 27, 2026

Summary

  • drain non-enrichment queue files before enrichment backlog files
  • set preview_text during initial chunk insert to avoid preview-trigger update churn
  • avoid touching tags during same-id seen merges when tag state is unchanged

Verification

  • pytest tests/test_arbitration.py tests/test_write_queue.py -q
  • pytest -q
  • pre-push BrainLayer test gate: passed

Live recovery notes

  • watcher/write queue drained to 0 before enrichment was resumed
  • active queue now contains enrichment backlog only; new watcher files drain ahead of enrichment files

Note

Medium Risk
Changes touch core drain ordering, chunk enrichment correctness, dedupe merges, and FTS trigger behavior on every chunk update; mistakes could cause stale metadata, search desync, or missed tag/FTS side effects.

Overview
Drain ordering and enrichment safety: The queue drain now processes non-enrichment-* JSONL files before enrichment backlog so watcher/hook writes land first. Enrichment events can carry a content_hash; if it no longer matches the chunk (including a hash derived from current content when the column is empty), the update is skipped. New ingests from store, watcher, and hook paths persist SHA-256 content_hash, and _insert_chunk fills preview_text from summary or content when missing so DB preview triggers do not fire a follow-up UPDATE.

Dedupe seen merges: Same-id seen merges only include tags in the UPDATE when the merged tag set differs semantically from what is already stored (set equivalence via _loads_tags, not raw JSON string compare).

FTS triggers: vector_store adds chunk_fts_rowids to map chunk_id to FTS/trigram rowids; insert/update/delete triggers use those rowids instead of deleting by chunk_id, with backfill for existing FTS rows.

Tests: Coverage in test_arbitration.py and test_search_trigram_fts.py for queue priority, stale enrichment, preview_text, tag merge behavior, and FTS update triggers.

Reviewed by Cursor Bugbot for commit 8e4639a. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Prioritize watcher drain writes over enrichment backlog in drain_once

  • drain_once now sorts queue files so non-enrichment writes are processed before enrichment-*.jsonl files within a batch.
  • Enrichment updates are skipped when the event's content_hash does not match the chunk's current hash, preventing stale enrichments from overwriting rewritten content.
  • content_hash (SHA-256 of trimmed content) is now persisted on insert/merge for store, watcher, and hook events.
  • Initial chunk inserts now populate preview_text from summary or content when the column exists and no non-blank preview was supplied.
  • FTS row maintenance in vector_store.py now uses a chunk_fts_rowids mapping table so deletes and updates target the correct rowid in both FTS tables.
  • Tag merges in dedupe.py no longer write the tags column when the merged value is semantically identical to the existing value.

Macroscope summarized 8e4639a.

Summary by CodeRabbit

  • New Features

    • Chunks now automatically generate preview text when not explicitly provided, derived from content summaries and limited to 220 characters.
  • Improvements

    • Optimized update merging to prevent unnecessary tag column updates when tags remain unchanged.
    • Enhanced queue processing to prioritize certain update types for improved data flow.

Review Change Stack

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 27, 2026

@codex review

Please focus on drain priority, write-queue starvation, and FTS-trigger side effects for watcher writes.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 27, 2026

@coderabbitai review

Please review correctness around queue ordering, dedupe same-id merges, and preview_text insert behavior.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 27, 2026

@cursor review

Please check for regressions in drain semantics and DB write behavior.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 27, 2026

@BugBot review

Please check for queue starvation, DB locking, and trigger-related regressions.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

📝 Walkthrough

Walkthrough

This PR optimizes deduplication merge operations by skipping tag updates when unchanged, adds automatic preview text generation during chunk insertion, prioritizes enrichment queue files in drain processing, and introduces comprehensive test coverage for these enhancements.

Changes

Dedupe: Conditional tags updates

Layer / File(s) Summary
Conditional tags update in merge_existing_chunk_seen
src/brainlayer/dedupe.py
The merge_existing_chunk_seen function computes merged tags and only includes the tags field in the SQL update payload when it differs from the existing value, avoiding redundant column overwrites.
Test helper and validation for tags merge without triggers
tests/test_arbitration.py
Adds _create_preview_trigger_db helper that creates a SQLite schema with trigger tracking, and test_drain_seen_merge_does_not_touch_unchanged_tags verifies that seen-count updates do not fire the tags-update trigger.

Drain: Preview text and queue prioritization

Layer / File(s) Summary
Auto-populate preview text on chunk insertion
src/brainlayer/drain.py, tests/test_arbitration.py
The _insert_chunk function auto-generates preview_text from summary or content (normalized, truncated to 220 chars) when the column exists and no value is provided. test_drain_sets_preview_text_on_initial_insert validates the generated preview text matches expected output.
Prioritize enrichment files in queue batch processing
src/brainlayer/drain.py, tests/test_arbitration.py
The drain_once function sorts queue files to process enrichment-*.jsonl files first within the batch size limit. test_drain_prioritizes_writes_before_enrichment confirms watcher writes are processed before pending enrichment backlog.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A twitch of whiskers, tags stand still,
When nothing's changed, we skip the SQL thrill!
Enrichment queues hop to the front of the line,
While preview text springs from content so fine.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: prioritizing watcher drain writes over enrichment backlog files, which is the primary objective of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/brainlayer-drain-preview-text

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread src/brainlayer/dedupe.py
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 187a31a58c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/brainlayer/drain.py
Comment on lines +517 to +519
files = sorted(queue_dir.glob("*.jsonl"), key=lambda path: (path.name.startswith("enrichment-"), path.name))[
:batch_size
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prevent enrichment starvation under sustained watcher load

When the queue has at least batch_size non-enrichment-* files, this ordering slices the batch before any enrichment file is selected, so a steady watcher/hook/store producer can keep enrichment updates out of every drain pass indefinitely. That keeps new chunks visible, but it also means summaries/tags/resolved queries for older chunks may never be applied while the realtime write rate stays above the drain capacity; consider reserving some batch capacity or aging in enrichment files rather than making the priority absolute.

Useful? React with 👍 / 👎.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get PR diff and files changed
gh pr view 338 --repo EtanHey/brainlayer --json files,body,title
echo "---DIFF---"
gh pr diff 338 --repo EtanHey/brainlayer

Length of output: 11412


🏁 Script executed:

#!/bin/bash
# Read the full _insert_chunk function and surrounding area in drain.py
echo "=== drain.py _insert_chunk ==="
rg -n "_insert_chunk|_preview_text|compute_dedupe_fields|preview_text" src/brainlayer/drain.py -A 3 -B 1

echo ""
echo "=== Full _insert_chunk function ==="
ast-grep --pattern 'def _insert_chunk($$$) {
  $$$
}' src/brainlayer/drain.py 2>/dev/null || true

# fallback: grab lines around _insert_chunk
grep -n "_insert_chunk\|preview_text\|compute_dedupe_fields" src/brainlayer/drain.py | head -60

echo ""
echo "=== drain.py lines 95-160 ==="
sed -n '95,160p' src/brainlayer/drain.py

echo ""
echo "=== merge_existing_chunk_seen in dedupe.py: existing_tags source ==="
sed -n '600,680p' src/brainlayer/dedupe.py

echo ""
echo "=== json.dumps sort_keys usage in dedupe.py ==="
rg -n "json.dumps" src/brainlayer/dedupe.py -A 1 -B 1

Length of output: 9092



@EtanHey Here's the focused review on the three areas:


1. Queue Ordering ✅

files = sorted(queue_dir.glob("*.jsonl"),
    key=lambda path: (path.name.startswith("enrichment-"), path.name))[:batch_size]

Correct. False < True in Python, so non-enrichment files always sort before enrichment files within the same sort pass. Secondary lexicographic sort by name is preserved within each group. batch_size slicing is applied after the full sort, so a batch spanning both groups will always exhaust non-enrichment files first. The test with batch_size=1 covers the boundary correctly.


2. Dedupe Same-id Merge ✅ (with one minor note)

merged_tags_json = json.dumps(merged_tags) if merged_tags else None
...
if merged_tags_json != existing_tags:
    updates["tags"] = merged_tags_json

Correct for the common case. existing_tags is fetched from SQLite as a raw str | None, and merged_tags = sorted(set)json.dumps on a sorted list produces a deterministic string, so the string-equality comparison is reliable for tags written by this code.

One convergence note: If existing_tags was stored by an external writer in unsorted JSON (e.g. '["b","a"]'), merged_tags_json would be '["a","b"]', triggering one extra write to canonicalize the order. This is harmless — subsequent reposts of the same content won't write again — but worth knowing.

Empty-list edge case (merged_tags = []merged_tags_json = None): pre-existing behaviour carried over from before this PR, not a regression.


3. preview_text Insert Behaviour ⚠️ Edge-case bug

preview_text survives the compute_dedupe_fields rebuild — confirmed. The rebuild does values = {**values, ...}, and preview_text is already in values at that point, so it's forwarded. ✅

However, there is a correctness gap in _preview_text when summary is non-empty but whitespace-only:

def _preview_text(values: dict[str, Any]) -> str:
    source = values.get("summary") or values.get("content") or ""
    return str(source).replace("\n", " ").replace("\r", " ").replace("\t", " ").strip()[:220]

If summary = "\n \t" (non-empty, truthy), Python's or short-circuits — content is never consulted — and _preview_text returns "". _insert_chunk then writes preview_text = "". The DB trigger fires anyway:

WHEN new.preview_text IS NULL OR trim(new.preview_text) = ''

trim("") = '' → trigger fires → the extra UPDATE you set out to suppress still happens.

Suggested fix:

def _preview_text(values: dict[str, Any]) -> str:
    summary = str(values.get("summary") or "").strip()
    content = str(values.get("content") or "").strip()
    source = summary or content
    return source.replace("\n", " ").replace("\r", " ").replace("\t", " ")[:220]

This strips each candidate before choosing, so a whitespace-only summary correctly falls through to content.


Summary table

Area Verdict Note
Queue ordering Correct; test covers it
Same-id tag dedup One-time sort-canonicalize write if tags were stored unsorted
preview_text on insert ⚠️ Whitespace-only summary bypasses content fallback, emits "", trigger still fires

@EtanHey EtanHey force-pushed the fix/brainlayer-drain-preview-text branch from 187a31a to 46ae636 Compare May 27, 2026 04:47
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/brainlayer/dedupe.py`:
- Around line 643-649: The code compares serialized JSON (merged_tags_json) to
existing_tags which can falsely detect changes when order/format differs;
instead parse existing_tags into a Python structure and compare it semantically
to merged_tags (e.g., load existing = json.loads(existing_tags) if existing_tags
else None, normalize types—convert tag lists to sets or sort lists—then compare
existing == merged_tags) and only set updates["tags"] = merged_tags_json when
the parsed/normalized structures differ; on JSONDecodeError or other parse
issues treat as different and write the updated tags.

In `@src/brainlayer/drain.py`:
- Around line 517-519: The current sort in drain_once uses key
(path.name.startswith("enrichment-"), path.name) which correctly prioritizes
non-enrichment files before enrichment ones; to avoid starving enrichment
backlog add a simple fairness step after listing files from queue_dir.glob:
split into files_non_enrich and files_enrich (based on
path.name.startswith("enrichment-")), then build the batch by taking up to
N_enrich = max(1, min(len(files_enrich), batch_size // 10)) or at least one
enrichment when any exist and fill the remainder from files_non_enrich (truncate
to batch_size) so enrichment files get periodically processed; update the
variable files used downstream in drain_once accordingly and document the
behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3a94b999-b5c2-407e-9a59-f9d09621f911

📥 Commits

Reviewing files that changed from the base of the PR and between 512b5ab and 187a31a.

📒 Files selected for processing (3)
  • src/brainlayer/dedupe.py
  • src/brainlayer/drain.py
  • tests/test_arbitration.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: Cursor Bugbot
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.13)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

**/*.py: Use paths.py:get_db_path() for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths
When performing bulk database operations: stop enrichment workers first, checkpoint WAL before and after, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks, and checkpoint every 3 batches

Files:

  • src/brainlayer/dedupe.py
  • tests/test_arbitration.py
  • src/brainlayer/drain.py
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use retry logic on SQLITE_BUSY errors; each worker must use its own database connection to handle concurrency safely
Classification must preserve ai_code, stack_trace, and user_message verbatim; skip noise entries entirely and summarize build_log and dir_listing entries (structure only)
Use AST-aware chunking via tree-sitter; never split stack traces; mask large tool output
For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback via enrichment_controller.py, and Ollama as offline last-resort; allow override via BRAINLAYER_ENRICH_BACKEND env var
Configure enrichment rate via BRAINLAYER_ENRICH_RATE environment variable (default 0.2 = 12 RPM)
Implement chunk lifecycle columns: superseded_by, aggregated_into, archived_at on chunks table; exclude lifecycle-managed chunks from default search; allow include_archived=True to show history
Implement brain_supersede with safety gate for personal data (journals, notes, health/finance); use soft-delete for brain_archive with timestamp
Add supersedes parameter to brain_store for atomic store-and-replace operations
Run linting and formatting with: ruff check src/ && ruff format src/
Run tests with pytest
Use PRAGMA wal_checkpoint(FULL) before and after bulk database operations to prevent WAL bloat

Files:

  • src/brainlayer/dedupe.py
  • src/brainlayer/drain.py
🔇 Additional comments (3)
tests/test_arbitration.py (1)

83-130: LGTM!

Also applies to: 189-223, 225-261, 263-307

src/brainlayer/drain.py (2)

100-103: LGTM!


107-108: ⚡ Quick win

Ensure Python _preview_text matches the DB previewExpression rules (otherwise the trigger may still overwrite preview_text)

File: src/brainlayer/drain.py (around lines 107-108)

BrainDatabase.swift’s previewExpression computes preview_text as:

  • coalesce(nullif(summary, ''), content)
  • replace char(10) (\n), char(13) (\r), char(9) (\t) with ' '
  • trim(...)
  • substr(..., 1, 220)

Python’s _preview_text(...) used when incoming preview_text is blank must match this precedence, the exact whitespace replacements, trimming behavior, and 220-char truncation so the insert reliably prevents the trigger from firing/overwriting.

Comment thread src/brainlayer/dedupe.py
Comment thread src/brainlayer/drain.py
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46ae6365cf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/brainlayer/drain.py
Comment on lines +517 to +518
files = sorted(queue_dir.glob("*.jsonl"), key=lambda path: (path.name.startswith("enrichment-"), path.name))[
:batch_size
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prevent stale enrichment from applying after newer writes

When an enrichment-*.jsonl file was queued for a chunk before a later watcher/store update for the same chunk_id, this ordering now drains the newer write first and then applies the older enrichment afterward. _apply_enrichment unconditionally updates summary/tags/resolved queries and only stores content_hash without checking it against the current row, so a stale enrichment result can overwrite retrieval metadata for the newly merged content; this is especially likely for same-id watcher updates, which the existing arbitration tests show are merged into the canonical chunk.

Useful? React with 👍 / 👎.

@EtanHey EtanHey force-pushed the fix/brainlayer-drain-preview-text branch from 46ae636 to 1a4aefb Compare May 27, 2026 04:51
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 27, 2026

Addressed CodeRabbit's preview_text edge case: added a regression test for whitespace-only summary falling back to content and updated _preview_text accordingly. Local lint/format and queue tests pass; pre-push BrainLayer gate passed after amend.

@EtanHey EtanHey force-pushed the fix/brainlayer-drain-preview-text branch from 1a4aefb to e6a916b Compare May 27, 2026 04:57
@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 27, 2026

Addressed the still-valid CodeRabbit tag comparison finding by comparing tags semantically via _loads_tags, with a regression test for unsorted JSON tags. I am intentionally not adding enrichment fairness to drain_once: this incident required writes to drain ahead of enrichment, and the new watcher file that appeared during live enrichment drain cleared before the enrichment backlog. Enrichment fairness can be separate once the FTS update-trigger cost is fixed.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@EtanHey EtanHey force-pushed the fix/brainlayer-drain-preview-text branch from e6a916b to 2749cd5 Compare May 27, 2026 05:10
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Comment thread src/brainlayer/drain.py
current_hash = row[0] or _content_hash(str(row[1] or ""))
if current_hash and current_hash != event["content_hash"]:
logger.warning("Skipping stale enrichment for chunk_id=%s content_hash mismatch", chunk_id)
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale enrichment hash trusts column

High Severity

_apply_enrichment treats a matching content_hash column as proof the enrichment matches the row, but same-id watcher content merges update content without updating content_hash. After a rewrite, an queued enrichment for the prior body can still apply summary/tags to the new content.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2749cd5. Configure here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2749cd5feb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/brainlayer/drain.py
Comment on lines +379 to +380
current_hash = row[0] or _content_hash(str(row[1] or ""))
if current_hash and current_hash != event["content_hash"]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Verify actual content when rejecting stale enrichment

When a queued watcher/store update rewrites an existing chunk through _insert_or_merge_chunk (the same-id content merge or duplicate merge paths), those merge helpers update content but do not update content_hash; this new guard then trusts the stale stored hash instead of hashing the current content. In that state, an older enrichment-* event whose content_hash matches the pre-merge content will still pass this check and overwrite summary/tags/resolved-query metadata for the merged content, so the freshness check should either keep content_hash in sync on content-changing merges or compare against _content_hash(content) here. Fresh evidence is that this commit added the hash guard here, while merge_existing_chunk_content/merge_duplicate_chunk still omit content_hash from their content update sets.

Useful? React with 👍 / 👎.

@EtanHey EtanHey force-pushed the fix/brainlayer-drain-preview-text branch from 2749cd5 to 8e4639a Compare May 27, 2026 05:17
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8e4639a. Configure here.

);
INSERT INTO chunk_fts_rowids(chunk_id, trigram_rowid)
VALUES (new.id, last_insert_rowid())
ON CONFLICT(chunk_id) DO UPDATE SET trigram_rowid = excluded.trigram_rowid;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repair breaks trigram rowid map

Medium Severity

repair_fts rebuilds chunks_fts_trigram with new rowids but does not refresh chunk_fts_rowids.trigram_rowid. Trigram update/delete triggers then target wrong or missing rowids, leaving orphan FTS rows and broken trigram sync after BRAINLAYER_REPAIR=1.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8e4639a. Configure here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e4639a068

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 876 to 878
END
""")
cursor.execute("DROP TRIGGER IF EXISTS chunks_fts_update")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep rowid map in sync during FTS backfills

When chunks_fts is populated outside the insert triggers, such as the startup empty-FTS backfill at src/brainlayer/vector_store.py:1483-1490 or the repair/rebuild paths, chunk_fts_rowids is not updated. A later chunk update then reaches this new mapped-rowid delete with a NULL or stale fts_rowid, so the old FTS row is not removed before inserting the replacement; searches can keep matching stale terms and may return duplicate rows after any repair/fresh backfill. The same synchronization gap applies to the trigram mapping when repair_fts() rebuilds chunks_fts_trigram.

Useful? React with 👍 / 👎.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented May 27, 2026

Addressed the stale-enrichment review path and the live drain follow-up: enrichment updates now verify content_hash before mutating a chunk, watcher/hook/store inserts populate content_hash for future validation, and FTS update/delete triggers now delete by mapped FTS rowid instead of scanning FTS by chunk_id. Local verification: focused drain/FTS tests passed; full pytest passed on system Python; pre-push BrainLayer gate passed (2193 unit selected + MCP registration + isolated eval/hooks + bun stale-index + FTS5 determinism). Live verification: active queue drained to 0, enrichment restarted, watcher still running, and live chunks_fts_update/chunks_fts_trigram_update both use chunk_fts_rowids.

@EtanHey EtanHey merged commit cdfd610 into main May 27, 2026
7 checks passed
@EtanHey EtanHey deleted the fix/brainlayer-drain-preview-text branch May 27, 2026 05:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant