Skip to content

fix: add provenance header and speaker IDs to Slack transcript imports#815

Merged
bensig merged 2 commits intoMemPalace:developfrom
Kesshite:fix/security-normalize-roles
Apr 15, 2026
Merged

fix: add provenance header and speaker IDs to Slack transcript imports#815
bensig merged 2 commits intoMemPalace:developfrom
Kesshite:fix/security-normalize-roles

Conversation

@Kesshite
Copy link
Copy Markdown
Contributor

Summary

Slack exports are multi-party chats where no speaker is inherently the "user" or "assistant". The parser previously assigned these roles purely by position, allowing a crafted export to place attacker text in the "user" role — making it appear as the memory owner's own words in all future retrieval (data poisoning via stored memory).

This PR adds two mitigations while preserving compatibility with the existing exchange-pair chunking pipeline:

  • Provenance header at the top of every Slack transcript: [source: slack-export | multi-party chat — speaker roles are positional, not verified]
  • Speaker ID prefix on every message: [U1] Hello instead of just Hello

Role alternation (user/assistant) is preserved so convo_miner.py's > marker-based chunking continues to work — but now every message carries its original speaker identity and the transcript is clearly marked as multi-party with unverified roles.

Refs: #809 (Finding 6)

What changed

mempalace/normalize.py

  • _try_slack_json() — provenance header prepended to output; each message prefixed with [speaker_id]

tests/test_normalize.py

  • test_slack_json_has_provenance_header — verifies header presence and keywords
  • test_slack_json_preserves_speaker_id — verifies [U1], [U2] in output
  • test_slack_json_attacker_first_message_attributed — verifies attacker's ID is visible even when placed first

Test plan

  • pytest tests/test_normalize.py -v — 91/91 passed
  • pytest tests/ -v --ignore=tests/benchmarks — 689 passed, 2 failed (pre-existing version mismatch)
  • ruff check — all checks passed
  • ruff format --check — all files formatted
  • No new dependencies added
  • Exchange-pair chunking preserved (user/assistant alternation unchanged)

Slack exports are multi-party chats where no speaker is inherently
the "user" or "assistant". The parser previously assigned these roles
purely by position, allowing a crafted export to place attacker text
in the "user" role — making it appear as the memory owner's words
in all future retrieval (data poisoning via stored memory).

Changes:
- Add provenance header marking Slack transcripts as multi-party
  with positional (unverified) role assignment
- Prefix each message with the original speaker ID ([U1], [U2], etc.)
  so downstream consumers can distinguish authors
- Keep user/assistant role alternation for exchange-pair chunking
  compatibility with convo_miner.py

Tests:
- Provenance header presence and content
- Speaker ID preservation in output
- Attacker-first-message attribution verification

Refs: MemPalace#809
…onstant

- Move provenance notice from header to footer to prevent it becoming
  a standalone ChromaDB drawer via paragraph chunking on exports
  with fewer than 3 exchange pairs (violates verbatim-always principle)
- Sanitize speaker user_id/username: strip brackets, newlines, and
  control characters to prevent chunk-boundary injection via crafted
  Slack exports
- Extract header string to _SLACK_PROVENANCE_FOOTER module constant,
  consistent with _TOOL_RESULT_* constants pattern; tests import it
  instead of duplicating the literal

Refs: MemPalace#809
@Kesshite Kesshite force-pushed the fix/security-normalize-roles branch from a60839f to 2704b15 Compare April 14, 2026 11:05
Copy link
Copy Markdown
Collaborator

@bensig bensig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code reviewed — no issues found. CLAUDE.md compliance verified.

@bensig bensig merged commit e61dc2a into MemPalace:develop Apr 15, 2026
6 checks passed
igorls added a commit that referenced this pull request Apr 16, 2026
Advisor caught: initial boundary (962776c..develop) skipped PRs that
landed on develop after v3.3.0 tag but before the sync-back merge.
Adds entries for #871 MEMPAL_VERBOSE, #811 research() local-only
default, #866 init .gitignore, #864 MCP stdout redirect, #863
precompact hook, #865 searcher empty results, #831 cold-start palace,
#862 init help, #815 Slack provenance, #840 save hook auto-mine.
Also drops the awkward caveat on #846 created_at — it's post-v3.3.0.
shafdev pushed a commit to shafdev/mempalace that referenced this pull request Apr 17, 2026
Advisor caught: initial boundary (962776c..develop) skipped PRs that
landed on develop after v3.3.0 tag but before the sync-back merge.
Adds entries for MemPalace#871 MEMPAL_VERBOSE, MemPalace#811 research() local-only
default, MemPalace#866 init .gitignore, MemPalace#864 MCP stdout redirect, MemPalace#863
precompact hook, MemPalace#865 searcher empty results, MemPalace#831 cold-start palace,
MemPalace#862 init help, MemPalace#815 Slack provenance, MemPalace#840 save hook auto-mine.
Also drops the awkward caveat on MemPalace#846 created_at — it's post-v3.3.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mining File and conversation mining bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants