fix(auto-capture): strip OpenClaw wrapper metadata before extraction#229
fix(auto-capture): strip OpenClaw wrapper metadata before extraction#229chenhaonan-eth wants to merge 1 commit intoCortexReach:masterfrom
Conversation
454f277 to
425e748
Compare
CortexReach#227 - Default agentId to 'main' in memory_forget and memory_update When resolveRuntimeAgentId returns undefined, scope resolution breaks and operations silently fail. Fallback to 'main'. CortexReach#229 - Strip OpenClaw wrapper metadata before auto-capture extraction Extracts cleanup logic into src/auto-capture-cleanup.ts module. Strips <relevant-memories>, [UNTRUSTED DATA], conversation/sender metadata blocks, session reset prefixes, and @mention prefixes before content reaches smart extraction. Prevents metadata pollution in stored memories.
|
I think this fix is worth taking. The problem is real, the scope is appropriate, and the cleanup is being applied in the right place: before auto-capture/smart extraction turns wrapper text into candidate memory content. I also like that this is not just a one-off Feishu patch. Pulling the normalization into a dedicated helper and reusing it across the auto-capture entry points is the right direction. I ran the targeted regression ( At this point I’m not seeing any additional blocking issue in the code itself. The only thing preventing merge right now is that GitHub shows the PR as So my recommendation is:
I would support merge after the branch is rebased. |
AliceLJY
left a comment
There was a problem hiding this comment.
@chenhaonan-eth 这个 PR 质量很高,几个亮点:
- 架构:把 cleanup 逻辑抽到独立模块
auto-capture-cleanup.ts,index.ts 瘦身,职责分离干净 - 测试:完整的集成测试,mock LLM + embedding server,断言元数据不出现在 extraction prompt 和存储结果里
- 全位置匹配:
stripLeadingInboundMetadata改为正则迭代,不再只处理文本开头 - System 事件行:
AUTO_CAPTURE_SYSTEM_EVENT_LINE_RE这个我们自己分析时都漏了 - 全局匹配:
<relevant-memories>和[UNTRUSTED DATA]去掉^锚点,多次出现也能剥离
有两个 gap 这个 PR 没有覆盖,补上就完整了:
Gap 1: stripEnvelopeMetadata in src/smart-extractor.ts (~line 80)
smart extractor 在调 LLM 前有自己的一层剥离,但正则只覆盖了 3/6 sentinel:
/(?:Conversation info|Sender|Replied message)\s*\(untrusted[^)]*\):\s*```json\s*\{[\s\S]*?\}\s*```/g缺了 Thread starter、Forwarded message context、Chat history since last reply。当 Layer 1(你的 cleanup 模块)因为格式异常跳过时,这是最后防线。
Gap 2: shouldCapture fallback regex in index.ts (~line 1311)
regex fallback 路径的 metadataPattern 只覆盖了 Conversation info 和 Sender:
const metadataPattern = /^(Conversation info|Sender) \(untrusted metadata\):[\s\S]*?\n\s*\n/gim;需要扩展到全部 6 个 sentinel,并且注意括号里有两种写法:(untrusted metadata) 和 (untrusted, for context)。
如果你愿意在这个 PR 里一并补上,我们可以帮忙 review。或者你先合并现有部分,我们提 follow-up PR 补这两处。
Summary
This PR fixes auto-capture memory pollution caused by OpenClaw wrapper metadata being passed through to smart extraction as if it were user content.
It closes the gap reported in #211 and broadens the cleanup beyond Feishu so the same class of contamination is removed for other channel payloads that carry OpenClaw transport/context wrappers.
Problem
In some OpenClaw sessions, the user message payload can contain transport/context wrappers such as:
<relevant-memories>...</relevant-memories>[UNTRUSTED DATA ...] ... [END UNTRUSTED DATA]Conversation info (untrusted metadata):+ fenced JSONSender (untrusted metadata):+ fenced JSONSystem: [timestamp] Exec completed .../newor/resetWhen these wrappers were not fully removed before auto-capture/smart extraction, the plugin could store garbage memories containing metadata like
message_id,username,Conversation info, etc.Root cause
normalizeQuery()already had cleanup behavior for retrieval, but the auto-capture path still had incomplete sanitization.stripAutoCaptureInjectedPrefix()handled some wrappers, but it did not reliably strip the OpenClaw inbound metadata blocks and related system-event noise before smart extraction.As a result, upstream wrapper text from OpenClaw could leak into:
What changed
1) Extracted cleanup into a dedicated helper
Added
src/auto-capture-cleanup.tsto centralize auto-capture text normalization.2) Strip wrapper metadata more aggressively in auto-capture
The new cleanup removes:
<relevant-memories>blocks[UNTRUSTED DATA ...]blocksSystem: [..] Exec completed|failed|started ...event linesIt also re-runs cleanup passes and collapses extra blank lines so mixed wrapper payloads are handled robustly.
3) Reused the same normalization in all auto-capture entry points
Updated
index.tsso both:agent_endmessage/block normalizationflow through the same helper and still honor
shouldSkipReflectionMessage().Regression coverage
Added a scenario to
test/smart-extractor-branches.mjsthat simulates a polluted payload shape with wrapper blocks, fenced metadata JSON, runtime event noise, and a real user sentence.The test asserts that:
Conversation info,Sender,message_id, orusernameValidation
Ran:
node test/smart-extractor-branches.mjs npm testBoth passed locally.
User impact
After this patch, auto-capture should stop writing transport-level wrapper noise into memory when OpenClaw channels inject context blocks around real user messages.
That means cleaner smart extraction, cleaner stored memories, and fewer false facts caused by metadata pollution.
Fixes #211