Add SQLite memory lifecycle hashing, versioning, and forget support by RYB-404 · Pull Request #199 · XortexAI/XMem

RYB-404 · 2026-05-23T05:30:54Z

Summary

Adds reusable memory lifecycle metadata helpers for normalized SHA-256 content hashes, version lineage, current-record status, and soft-forget audit fields.
Updates the SQLite vector store to reuse existing current memories with the same normalized hash instead of inserting duplicates.
Adds explicit SQLite helpers for versioning (add_version) and soft/hard forget (forget) while keeping superseded or forgotten memories out of semantic and metadata retrieval.

Tests

python -m pytest tests\unit\test_memory_lifecycle.py -q
python -m pytest tests\e2e\test_memory_flow.py -q
python -m ruff check src\storage\memory_lifecycle.py src\storage\local.py tests\unit\test_memory_lifecycle.py

Addresses #166.

gemini-code-assist

Code Review

This pull request introduces a memory lifecycle management system, including duplicate detection via content hashing, versioning, and soft-forgetting capabilities. The SQLiteVectorStore was updated to support these features, and a new memory_lifecycle utility module was added. Feedback focuses on performance optimizations for database queries, specifically suggesting the use of SQLite's JSON functions for hash lookups and batching updates in the forget method. Additionally, it was noted that the versioning process should be made atomic to prevent data inconsistency.

gemini-code-assist · 2026-05-23T05:32:21Z

+    def _find_current_by_hash(self, content_hash: str) -> Optional[str]:
+        rows = self._conn.execute(
+            "SELECT id, metadata FROM xmem_vectors WHERE namespace = ?",
+            (self._namespace,),
+        ).fetchall()
+        for row in rows:
+            meta = json.loads(row["metadata"] or "{}")
+            if meta.get(CONTENT_HASH_KEY) == content_hash and is_retrievable_memory(meta):
+                return row["id"]
+        return None


The current implementation of _find_current_by_hash is highly inefficient. It fetches every single record in the namespace and parses their JSON metadata in Python to find a match. This results in $O(N)$ complexity per call, which becomes $O(M \times N)$ during batch add operations. As the database grows, this will cause severe performance degradation.

You should leverage SQLite's built-in JSON functions to perform the filtering at the database level. This is significantly faster and avoids loading unnecessary data into memory.

Suggested change

def _find_current_by_hash(self, content_hash: str) -> Optional[str]:

rows = self._conn.execute(

"SELECT id, metadata FROM xmem_vectors WHERE namespace = ?",

(self._namespace,),

).fetchall()

for row in rows:

meta = json.loads(row["metadata"] or "{}")

if meta.get(CONTENT_HASH_KEY) == content_hash and is_retrievable_memory(meta):

return row["id"]

return None

def _find_current_by_hash(self, content_hash: str) -> Optional[str]:

row = self._conn.execute(

f"SELECT id FROM xmem_vectors "

f"WHERE namespace = ? "

f"AND json_extract(metadata, '$.{CONTENT_HASH_KEY}') = ? "

f"AND json_extract(metadata, '$.{IS_CURRENT_KEY}') IS NOT 0 "

f"AND json_extract(metadata, '$.{FORGOTTEN_AT_KEY}') IS NULL "

f"LIMIT 1",

(self._namespace, content_hash),

).fetchone()

return row["id"] if row else None

gemini-code-assist · 2026-05-23T05:32:21Z

+        for vec_id in ids:
+            row = self._conn.execute(
+                "SELECT metadata FROM xmem_vectors WHERE namespace = ? AND id = ?",
+                (self._namespace, vec_id),
+            ).fetchone()
+            if not row:
+                continue
+            meta = json.loads(row["metadata"] or "{}")
+            meta[IS_CURRENT_KEY] = False
+            meta[FORGOTTEN_AT_KEY] = utc_now_iso()
+            meta[FORGET_REASON_KEY] = reason
+            self.update(vec_id, metadata=meta)


The forget method is inefficient for multiple IDs. It performs a SELECT and then calls self.update, which internally performs another SELECT and a COMMIT for every single ID. For a list of $N$ IDs, this results in $3N$ database operations and $N$ commits.

It is better to fetch the metadata for all IDs in a single query, process the updates in Python, and then use executemany to update the database in a single transaction.

if not ids: return True placeholders = ",".join("?" for _ in ids) rows = self._conn.execute( f"SELECT id, metadata FROM xmem_vectors WHERE namespace = ? AND id IN ({placeholders})", (self._namespace, *ids), ).fetchall() now = utc_now_iso() updates = [] for row in rows: meta = json.loads(row["metadata"] or "{}") meta[IS_CURRENT_KEY] = False meta[FORGOTTEN_AT_KEY] = now meta[FORGET_REASON_KEY] = reason updates.append((json.dumps(meta), self._namespace, row["id"])) if updates: self._conn.executemany( "UPDATE xmem_vectors SET metadata = ?, updated_at = CURRENT_TIMESTAMP " "WHERE namespace = ? AND id = ?", updates ) self._conn.commit()

gemini-code-assist · 2026-05-23T05:32:21Z

+        parent_meta[IS_CURRENT_KEY] = False
+        self.update(parent_id, metadata=parent_meta)
+        self.add([text], [embedding], ids=[new_id], metadata=[new_meta])


The versioning process is not atomic. The parent record is updated to is_current=False in one call, and the new version is added in another. If the second call fails (e.g., due to a connection issue or constraint violation) or the process is interrupted, the memory will be effectively 'lost' from retrieval because the parent is no longer marked as current, but no successor was successfully created. These operations should be wrapped in a single transaction to ensure data consistency.

greptile-apps · 2026-05-23T09:20:33Z

Greptile Summary

This PR adds memory lifecycle management to SQLiteVectorStore: SHA-256 content-hash deduplication on add(), a versioning helper (add_version) that marks parents as non-current, and a soft-forget path (forget) that stamps forgotten_at and keeps audit history. A new memory_lifecycle.py module centralises the hash, metadata construction, and retrievability check used by all three code paths.

add() now calls build_lifecycle_metadata, which forcefully assigns all lifecycle fields so callers cannot inject stale is_current/forgotten_at values through metadata; dedup is scoped by well-known tenant/user keys via _find_current_by_hash.
add_version guards against forgotten or superseded parents via is_retrievable_memory before proceeding, and marks the parent non-current atomically with the child insert.
update() recalculates the content hash and blocks collisions with other current records, but never calls is_retrievable_memory on the target and does not re-pin lifecycle fields after current_meta.update(metadata), leaving a path to silently modify or restore forgotten records.

Confidence Score: 3/5

Not safe to merge as-is: the update() method bypasses the lifecycle contract that the rest of this PR carefully enforces, allowing forgotten or superseded records to be silently modified or restored.

The add/add_version/forget paths are well-guarded and consistent. However, update() neither checks the retrievability of the target record nor re-pins lifecycle fields after merging caller metadata — a caller can pass {"forgotten_at": None, "is_current": True} to restore a forgotten memory without any audit trail, directly contradicting the guarantee that forget() is supposed to provide.

src/storage/local.py — specifically the update() method, which is the only write path that does not call build_lifecycle_metadata or is_retrievable_memory.

Important Files Changed

Filename	Overview
src/storage/local.py	Adds dedup, add_version, and forget support to SQLiteVectorStore; update() lacks a lifecycle guard allowing modification of forgotten/superseded records and silent override of lifecycle fields through the metadata parameter.
src/storage/memory_lifecycle.py	New module providing SHA-256 content hashing, lifecycle metadata construction, and retrievability check; build_lifecycle_metadata forcefully assigns all lifecycle fields, preventing caller overrides in the add() path.
tests/unit/test_memory_lifecycle.py	Good coverage of the happy path, hash dedup scoping, versioning, forgotten-parent guard, and soft-forget; no test exercises update() with lifecycle field overrides or with a non-current target record.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[add / add_version / update / forget] --> B{Which method?}

    B --> C[add]
    C --> C1[build_lifecycle_metadata\nforces all lifecycle fields]
    C1 --> C2[_find_current_by_hash\nscoped by namespace + dedup keys]
    C2 -->|duplicate found| C3[Return existing ID]
    C2 -->|no duplicate| C4[INSERT new record\nis_current=True]

    B --> D[add_version]
    D --> D1[get parent record]
    D1 --> D2{is_retrievable_memory?}
    D2 -->|No| D3[Return None]
    D2 -->|Yes| D4[build_lifecycle_metadata\nnext_version, parent_memory_id]
    D4 --> D5[_find_current_by_hash]
    D5 -->|existing == parent_id| D6[Return parent unchanged]
    D5 -->|other existing| D7[Mark parent is_current=False\nReturn existing ID]
    D5 -->|no duplicate| D8[Mark parent is_current=False\nINSERT new child record]

    B --> E[forget]
    E --> E1{hard_delete?}
    E1 -->|Yes| E2[delete rows]
    E1 -->|No| E3[Stamp is_current=False\nforgotten_at + reason]

    B --> F[update]
    F --> F1[Fetch row — no lifecycle guard]
    F1 --> F2[current_meta.update — lifecycle fields can be overridden]
    F2 --> F3[Recalculate CONTENT_HASH_KEY only]
    F3 --> F4[_find_current_by_hash collision check]
    F4 -->|collision with other record| F5[Return False]
    F4 -->|no collision| F6[UPDATE record]

Comments Outside Diff (1)

src/storage/local.py, line 190-232 (link)

update() has no lifecycle guard and exposes lifecycle fields to caller override

update() never calls is_retrievable_memory() on the fetched record, so it will happily mutate the content and embedding of a soft-forgotten or superseded memory. Worse, current_meta.update(metadata or {}) runs before the lifecycle fields are reconciled, so a caller can pass metadata={"forgotten_at": None, "is_current": True} to silently restore a forgotten record without going through forget()'s audit trail. Only CONTENT_HASH_KEY is re-pinned after the merge; every other lifecycle field (IS_CURRENT_KEY, FORGOTTEN_AT_KEY, FORGET_REASON_KEY, PARENT_MEMORY_ID_KEY, VERSION_KEY) can be freely overwritten. This creates an inconsistency with add(), which calls build_lifecycle_metadata and overwrites all lifecycle fields unconditionally.

_{Reviews (3): Last reviewed commit: "Prevent memory lifecycle hash collisions" | Re-trigger Greptile}

RYB-404 · 2026-05-23T11:51:40Z

Addressed the latest bot review points in the pushed update:

scoped hash dedup by tenant/user-style metadata (user_id, tenant_id, org_id, workspace_id, project_id) to avoid cross-user memory reuse in shared namespaces
tightened _find_current_by_hash to require explicit is_current = true
made duplicate-version handling supersede the parent when the successor already exists
lifecycle fields are now owned by build_lifecycle_metadata instead of caller metadata overrides
added regression coverage for scoped dedup, lifecycle override protection, and duplicate-version superseding

Re-ran:

python -m pytest tests\unit\test_memory_lifecycle.py -q
python -m pytest tests\e2e\test_memory_flow.py -q
python -m ruff check src\storage\memory_lifecycle.py src\storage\local.py tests\unit\test_memory_lifecycle.py

RYB-404

Pushed updates for the latest lifecycle review findings:

add_version() now refuses forgotten or superseded parents instead of creating a new current child from inactive history.
update() now rejects same-scope current hash collisions instead of leaving two current records with the same lifecycle hash.
Added regression coverage for forgotten/superseded version parents and update hash collisions.

Re-ran:

python -m pytest tests\unit\test_memory_lifecycle.py -q
python -m pytest tests\e2e\test_memory_flow.py -q
python -m ruff check src\storage\local.py tests\unit\test_memory_lifecycle.py

RYB-404 requested review from ishaanxgupta and ved015 as code owners May 23, 2026 05:30

github-actions Bot added the tests label May 23, 2026

gemini-code-assist Bot reviewed May 23, 2026

View reviewed changes

RYB-404 force-pushed the ryb-xmem-memory-lifecycle-166 branch from e0b0a55 to e006a29 Compare May 23, 2026 05:48

greptile-apps Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/storage/local.py Outdated

Comment thread src/storage/local.py Outdated

Comment thread src/storage/local.py Outdated

Comment thread src/storage/memory_lifecycle.py

Add memory lifecycle metadata support

2601823

RYB-404 force-pushed the ryb-xmem-memory-lifecycle-166 branch from e006a29 to 2601823 Compare May 23, 2026 11:51

greptile-apps Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/storage/local.py

RYB-404 added 2 commits May 24, 2026 11:32

Guard versioning against inactive memories

ee7021b

Prevent memory lifecycle hash collisions

b2613df

RYB-404 commented May 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SQLite memory lifecycle hashing, versioning, and forget support#199

Add SQLite memory lifecycle hashing, versioning, and forget support#199
RYB-404 wants to merge 3 commits into
XortexAI:mainfrom
RYB-404:ryb-xmem-memory-lifecycle-166

RYB-404 commented May 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Uh oh!

gemini-code-assist Bot May 23, 2026

Uh oh!

gemini-code-assist Bot May 23, 2026

Uh oh!

greptile-apps Bot commented May 23, 2026 •

edited

Loading

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RYB-404 commented May 23, 2026

Uh oh!

Uh oh!

RYB-404 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RYB-404 commented May 23, 2026

Summary

Tests

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RYB-404 commented May 23, 2026

Uh oh!

Uh oh!

RYB-404 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented May 23, 2026 •

edited

Loading