Skip to content

fix(auto-capture): strip OpenClaw wrapper metadata before extraction#229

Open
chenhaonan-eth wants to merge 1 commit intoCortexReach:masterfrom
chenhaonan-eth:fix/autocapture-wrapper-cleanup
Open

fix(auto-capture): strip OpenClaw wrapper metadata before extraction#229
chenhaonan-eth wants to merge 1 commit intoCortexReach:masterfrom
chenhaonan-eth:fix/autocapture-wrapper-cleanup

Conversation

@chenhaonan-eth
Copy link

@chenhaonan-eth chenhaonan-eth commented Mar 15, 2026

Summary

This PR fixes auto-capture memory pollution caused by OpenClaw wrapper metadata being passed through to smart extraction as if it were user content.

It closes the gap reported in #211 and broadens the cleanup beyond Feishu so the same class of contamination is removed for other channel payloads that carry OpenClaw transport/context wrappers.

Problem

In some OpenClaw sessions, the user message payload can contain transport/context wrappers such as:

  • <relevant-memories>...</relevant-memories>
  • [UNTRUSTED DATA ...] ... [END UNTRUSTED DATA]
  • Conversation info (untrusted metadata): + fenced JSON
  • Sender (untrusted metadata): + fenced JSON
  • runtime event lines like System: [timestamp] Exec completed ...
  • session reset bootstrap text injected after /new or /reset

When these wrappers were not fully removed before auto-capture/smart extraction, the plugin could store garbage memories containing metadata like message_id, username, Conversation info, etc.

Root cause

normalizeQuery() already had cleanup behavior for retrieval, but the auto-capture path still had incomplete sanitization.

stripAutoCaptureInjectedPrefix() handled some wrappers, but it did not reliably strip the OpenClaw inbound metadata blocks and related system-event noise before smart extraction.

As a result, upstream wrapper text from OpenClaw could leak into:

  1. pending ingress text
  2. agent_end message normalization
  3. smart extraction prompts
  4. stored memory entries

What changed

1) Extracted cleanup into a dedicated helper

Added src/auto-capture-cleanup.ts to centralize auto-capture text normalization.

2) Strip wrapper metadata more aggressively in auto-capture

The new cleanup removes:

  • <relevant-memories> blocks
  • [UNTRUSTED DATA ...] blocks
  • session reset bootstrap prefix
  • inbound metadata sentinel blocks with fenced JSON payloads
  • System: [..] Exec completed|failed|started ... event lines
  • leading mention/addressing prefixes

It also re-runs cleanup passes and collapses extra blank lines so mixed wrapper payloads are handled robustly.

3) Reused the same normalization in all auto-capture entry points

Updated index.ts so both:

  • ingress capture normalization
  • agent_end message/block normalization

flow through the same helper and still honor shouldSkipReflectionMessage().

Regression coverage

Added a scenario to test/smart-extractor-branches.mjs that simulates a polluted payload shape with wrapper blocks, fenced metadata JSON, runtime event noise, and a real user sentence.

The test asserts that:

  • smart extraction prompt still contains the real user statement
  • extraction prompt no longer contains wrapper metadata
  • stored entries contain the real memory content
  • stored entries do not contain Conversation info, Sender, message_id, or username

Validation

Ran:

node test/smart-extractor-branches.mjs
npm test

Both passed locally.

User impact

After this patch, auto-capture should stop writing transport-level wrapper noise into memory when OpenClaw channels inject context blocks around real user messages.

That means cleaner smart extraction, cleaner stored memories, and fewer false facts caused by metadata pollution.

Fixes #211

@chenhaonan-eth chenhaonan-eth force-pushed the fix/autocapture-wrapper-cleanup branch from 454f277 to 425e748 Compare March 15, 2026 17:03
caspian-coder added a commit to thepetshark/memory-lancedb-pro that referenced this pull request Mar 16, 2026
CortexReach#227 - Default agentId to 'main' in memory_forget and memory_update
  When resolveRuntimeAgentId returns undefined, scope resolution
  breaks and operations silently fail. Fallback to 'main'.

CortexReach#229 - Strip OpenClaw wrapper metadata before auto-capture extraction
  Extracts cleanup logic into src/auto-capture-cleanup.ts module.
  Strips <relevant-memories>, [UNTRUSTED DATA], conversation/sender
  metadata blocks, session reset prefixes, and @mention prefixes
  before content reaches smart extraction. Prevents metadata
  pollution in stored memories.
@rwmjhb
Copy link
Collaborator

rwmjhb commented Mar 16, 2026

I think this fix is worth taking.

The problem is real, the scope is appropriate, and the cleanup is being applied in the right place: before auto-capture/smart extraction turns wrapper text into candidate memory content. I also like that this is not just a one-off Feishu patch. Pulling the normalization into a dedicated helper and reusing it across the auto-capture entry points is the right direction.

I ran the targeted regression (node test/smart-extractor-branches.mjs) and also ran the full npm test suite on the PR branch in an isolated worktree. Both passed for me.

At this point I’m not seeing any additional blocking issue in the code itself. The only thing preventing merge right now is that GitHub shows the PR as CONFLICTING against current master.

So my recommendation is:

  • rebase / resolve conflicts against current master
  • rerun node test/smart-extractor-branches.mjs and npm test
  • if still green, merge it

I would support merge after the branch is rebased.

Copy link
Collaborator

@AliceLJY AliceLJY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenhaonan-eth 这个 PR 质量很高,几个亮点:

  1. 架构:把 cleanup 逻辑抽到独立模块 auto-capture-cleanup.ts,index.ts 瘦身,职责分离干净
  2. 测试:完整的集成测试,mock LLM + embedding server,断言元数据不出现在 extraction prompt 和存储结果里
  3. 全位置匹配stripLeadingInboundMetadata 改为正则迭代,不再只处理文本开头
  4. System 事件行AUTO_CAPTURE_SYSTEM_EVENT_LINE_RE 这个我们自己分析时都漏了
  5. 全局匹配<relevant-memories>[UNTRUSTED DATA] 去掉 ^ 锚点,多次出现也能剥离

有两个 gap 这个 PR 没有覆盖,补上就完整了:

Gap 1: stripEnvelopeMetadata in src/smart-extractor.ts (~line 80)

smart extractor 在调 LLM 前有自己的一层剥离,但正则只覆盖了 3/6 sentinel:

/(?:Conversation info|Sender|Replied message)\s*\(untrusted[^)]*\):\s*```json\s*\{[\s\S]*?\}\s*```/g

缺了 Thread starterForwarded message contextChat history since last reply。当 Layer 1(你的 cleanup 模块)因为格式异常跳过时,这是最后防线。

Gap 2: shouldCapture fallback regex in index.ts (~line 1311)

regex fallback 路径的 metadataPattern 只覆盖了 Conversation infoSender

const metadataPattern = /^(Conversation info|Sender) \(untrusted metadata\):[\s\S]*?\n\s*\n/gim;

需要扩展到全部 6 个 sentinel,并且注意括号里有两种写法:(untrusted metadata)(untrusted, for context)


如果你愿意在这个 PR 里一并补上,我们可以帮忙 review。或者你先合并现有部分,我们提 follow-up PR 补这两处。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] autoCapture includes Feishu metadata in stored memories

3 participants