Skip to content

refactor(database): enforce unique archive_id for reliable deduplication#2136

Merged
jubalh merged 3 commits intomasterfrom
db-dup
Apr 2, 2026
Merged

refactor(database): enforce unique archive_id for reliable deduplication#2136
jubalh merged 3 commits intomasterfrom
db-dup

Conversation

@jubalh
Copy link
Copy Markdown
Member

@jubalh jubalh commented Mar 27, 2026

Implement a UNIQUE constraint on the archive_id column (XEP-0359 stanza-id) to prevent duplicate messages in the chat log. This addresses problems where the same message could be stored multiple times if received via both MAM and regular.

@jubalh jubalh added this to the 0.18.0 milestone Mar 27, 2026
@jubalh jubalh requested a review from sjaeckel March 27, 2026 22:10
@jubalh jubalh self-assigned this Mar 27, 2026
@jubalh jubalh added the MAM label Mar 27, 2026
Implement a UNIQUE constraint on the archive_id column (XEP-0359 stanza-id)
to prevent duplicate messages in the chat log. This addresses problems where
the same message could be stored multiple times if received via both MAM
and regular.

Signed-off-by: Michael Vetter <jubalh@iodoru.org>
jubalh added 2 commits March 27, 2026 23:57
Implement security checks to ensure the 'archive_id' (XEP-0359) used
for database deduplication originates from a trusted source.

XEP-0359 Section 3.1: "The 'by' attribute MUST be the XMPP
address (JID) of the entity assigning the unique and stable stanza ID."
Furthermore, Section 4 (Security Considerations) specifies: "A client
SHOULD only trust <stanza-id/> elements from its own server or from
a MUC service it is joined to."

XEP-0313 Section 4.1.2: "The 'by' attribute of the <result/> element is
the JID of the archive being queried." and "If the 'by' attribute is not
present, the recipient MUST assume that the results are from their own
personal archive."

Let _handle_chat verify <stanza-id/> 'by' attribute matches our bare JID.
Let _handle_groupchat verify <stanza-id/> 'by' attribute matches the MUC
s bare JID.
Let _handle_mam verify the <result/> 'by' attribute matches the outer
message 'from' (archive JID). If 'by' is missing then 'from' matches our
own bare JID (personal archive).

Signed-off-by: Michael Vetter <jubalh@iodoru.org>
Replace 'INSERT OR IGNORE' with 'INSERT ... ON CONFLICT(`archive_id`) DO
NOTHING RETURNING id'. This is only available since sqlite 3.35.0.
So deduplication only happens for `archive_id` and we don't silently
ignore other errors or constraints (like not null).

We can now detect if an insertion was skipped due to duplication by
checking the result of 'RETURNING id'.

We don't print out when we don't insert duplicated messages since this
will happen often and will be too noisy. So we match the behaviour of
what Dino is doing.

Signed-off-by: Michael Vetter <jubalh@iodoru.org>
@jubalh
Copy link
Copy Markdown
Member Author

jubalh commented Apr 1, 2026

Anybody ready to review this?

@jubalh jubalh merged commit 923dfea into master Apr 2, 2026
9 checks passed
@jubalh jubalh deleted the db-dup branch April 2, 2026 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant