fix(auto-capture): strip OpenClaw wrapper metadata before extraction by chenhaonan-eth · Pull Request #229 · CortexReach/memory-lancedb-pro

chenhaonan-eth · 2026-03-15T16:48:04Z

Summary

This PR fixes auto-capture memory pollution caused by OpenClaw wrapper metadata being passed through to smart extraction as if it were user content.

It closes the gap reported in #211 and broadens the cleanup beyond Feishu so the same class of contamination is removed for other channel payloads that carry OpenClaw transport/context wrappers.

Problem

In some OpenClaw sessions, the user message payload can contain transport/context wrappers such as:

<relevant-memories>...</relevant-memories>
[UNTRUSTED DATA ...] ... [END UNTRUSTED DATA]
Conversation info (untrusted metadata): + fenced JSON
Sender (untrusted metadata): + fenced JSON
runtime event lines like System: [timestamp] Exec completed ...
session reset bootstrap text injected after /new or /reset

When these wrappers were not fully removed before auto-capture/smart extraction, the plugin could store garbage memories containing metadata like message_id, username, Conversation info, etc.

Root cause

normalizeQuery() already had cleanup behavior for retrieval, but the auto-capture path still had incomplete sanitization.

stripAutoCaptureInjectedPrefix() handled some wrappers, but it did not reliably strip the OpenClaw inbound metadata blocks and related system-event noise before smart extraction.

As a result, upstream wrapper text from OpenClaw could leak into:

pending ingress text
agent_end message normalization
smart extraction prompts
stored memory entries

What changed

1) Extracted cleanup into a dedicated helper

Added src/auto-capture-cleanup.ts to centralize auto-capture text normalization.

2) Strip wrapper metadata more aggressively in auto-capture

The new cleanup removes:

<relevant-memories> blocks
[UNTRUSTED DATA ...] blocks
session reset bootstrap prefix
inbound metadata sentinel blocks with fenced JSON payloads
System: [..] Exec completed|failed|started ... event lines
leading mention/addressing prefixes

It also re-runs cleanup passes and collapses extra blank lines so mixed wrapper payloads are handled robustly.

3) Reused the same normalization in all auto-capture entry points

Updated index.ts so both:

ingress capture normalization
agent_end message/block normalization

flow through the same helper and still honor shouldSkipReflectionMessage().

Regression coverage

Added a scenario to test/smart-extractor-branches.mjs that simulates a polluted payload shape with wrapper blocks, fenced metadata JSON, runtime event noise, and a real user sentence.

The test asserts that:

smart extraction prompt still contains the real user statement
extraction prompt no longer contains wrapper metadata
stored entries contain the real memory content
stored entries do not contain Conversation info, Sender, message_id, or username

Validation

Ran:

node test/smart-extractor-branches.mjs
npm test

Both passed locally.

User impact

After this patch, auto-capture should stop writing transport-level wrapper noise into memory when OpenClaw channels inject context blocks around real user messages.

That means cleaner smart extraction, cleaner stored memories, and fewer false facts caused by metadata pollution.

Fixes #211

CortexReach#227 - Default agentId to 'main' in memory_forget and memory_update When resolveRuntimeAgentId returns undefined, scope resolution breaks and operations silently fail. Fallback to 'main'. CortexReach#229 - Strip OpenClaw wrapper metadata before auto-capture extraction Extracts cleanup logic into src/auto-capture-cleanup.ts module. Strips <relevant-memories>, [UNTRUSTED DATA], conversation/sender metadata blocks, session reset prefixes, and @mention prefixes before content reaches smart extraction. Prevents metadata pollution in stored memories.

rwmjhb · 2026-03-16T03:20:52Z

I think this fix is worth taking.

The problem is real, the scope is appropriate, and the cleanup is being applied in the right place: before auto-capture/smart extraction turns wrapper text into candidate memory content. I also like that this is not just a one-off Feishu patch. Pulling the normalization into a dedicated helper and reusing it across the auto-capture entry points is the right direction.

I ran the targeted regression (node test/smart-extractor-branches.mjs) and also ran the full npm test suite on the PR branch in an isolated worktree. Both passed for me.

At this point I’m not seeing any additional blocking issue in the code itself. The only thing preventing merge right now is that GitHub shows the PR as CONFLICTING against current master.

So my recommendation is:

rebase / resolve conflicts against current master
rerun node test/smart-extractor-branches.mjs and npm test
if still green, merge it

I would support merge after the branch is rebased.

AliceLJY

@chenhaonan-eth 这个 PR 质量很高，几个亮点：

架构：把 cleanup 逻辑抽到独立模块 auto-capture-cleanup.ts，index.ts 瘦身，职责分离干净
测试：完整的集成测试，mock LLM + embedding server，断言元数据不出现在 extraction prompt 和存储结果里
全位置匹配：stripLeadingInboundMetadata 改为正则迭代，不再只处理文本开头
System 事件行：AUTO_CAPTURE_SYSTEM_EVENT_LINE_RE 这个我们自己分析时都漏了
全局匹配：<relevant-memories> 和 [UNTRUSTED DATA] 去掉 ^ 锚点，多次出现也能剥离

有两个 gap 这个 PR 没有覆盖，补上就完整了：

Gap 1: `stripEnvelopeMetadata` in `src/smart-extractor.ts` (~line 80)

smart extractor 在调 LLM 前有自己的一层剥离，但正则只覆盖了 3/6 sentinel：

/(?:Conversation info|Sender|Replied message)\s*\(untrusted[^)]*\):\s*```json\s*\{[\s\S]*?\}\s*```/g

缺了 Thread starter、Forwarded message context、Chat history since last reply。当 Layer 1（你的 cleanup 模块）因为格式异常跳过时，这是最后防线。

Gap 2: `shouldCapture` fallback regex in `index.ts` (~line 1311)

regex fallback 路径的 metadataPattern 只覆盖了 Conversation info 和 Sender：

const metadataPattern = /^(Conversation info|Sender) \(untrusted metadata\):[\s\S]*?\n\s*\n/gim;

需要扩展到全部 6 个 sentinel，并且注意括号里有两种写法：(untrusted metadata) 和 (untrusted, for context)。

如果你愿意在这个 PR 里一并补上，我们可以帮忙 review。或者你先合并现有部分，我们提 follow-up PR 补这两处。

fix(auto-capture): strip injected wrapper metadata

425e748

chenhaonan-eth force-pushed the fix/autocapture-wrapper-cleanup branch from 454f277 to 425e748 Compare March 15, 2026 17:03

rwmjhb mentioned this pull request Mar 22, 2026

refactor: extract atomic auto-capture helpers #241

Closed

AliceLJY mentioned this pull request Mar 22, 2026

fix: complete metadata stripping for all channel sentinel patterns (#211) #310

Closed

3 tasks

AliceLJY reviewed Mar 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(auto-capture): strip OpenClaw wrapper metadata before extraction#229

fix(auto-capture): strip OpenClaw wrapper metadata before extraction#229
chenhaonan-eth wants to merge 1 commit intoCortexReach:masterfrom
chenhaonan-eth:fix/autocapture-wrapper-cleanup

chenhaonan-eth commented Mar 15, 2026 •

edited

Loading

Uh oh!

rwmjhb commented Mar 16, 2026

Uh oh!

AliceLJY left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chenhaonan-eth commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root cause

What changed

1) Extracted cleanup into a dedicated helper

2) Strip wrapper metadata more aggressively in auto-capture

3) Reused the same normalization in all auto-capture entry points

Regression coverage

Validation

User impact

Uh oh!

rwmjhb commented Mar 16, 2026

Uh oh!

AliceLJY left a comment

Choose a reason for hiding this comment

Gap 1: stripEnvelopeMetadata in src/smart-extractor.ts (~line 80)

Gap 2: shouldCapture fallback regex in index.ts (~line 1311)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chenhaonan-eth commented Mar 15, 2026 •

edited

Loading

Gap 1: `stripEnvelopeMetadata` in `src/smart-extractor.ts` (~line 80)

Gap 2: `shouldCapture` fallback regex in `index.ts` (~line 1311)