Add SQLite memory lifecycle hashing, versioning, and forget support#199
Add SQLite memory lifecycle hashing, versioning, and forget support#199RYB-404 wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a memory lifecycle management system, including duplicate detection via content hashing, versioning, and soft-forgetting capabilities. The SQLiteVectorStore was updated to support these features, and a new memory_lifecycle utility module was added. Feedback focuses on performance optimizations for database queries, specifically suggesting the use of SQLite's JSON functions for hash lookups and batching updates in the forget method. Additionally, it was noted that the versioning process should be made atomic to prevent data inconsistency.
| def _find_current_by_hash(self, content_hash: str) -> Optional[str]: | ||
| rows = self._conn.execute( | ||
| "SELECT id, metadata FROM xmem_vectors WHERE namespace = ?", | ||
| (self._namespace,), | ||
| ).fetchall() | ||
| for row in rows: | ||
| meta = json.loads(row["metadata"] or "{}") | ||
| if meta.get(CONTENT_HASH_KEY) == content_hash and is_retrievable_memory(meta): | ||
| return row["id"] | ||
| return None |
There was a problem hiding this comment.
The current implementation of _find_current_by_hash is highly inefficient. It fetches every single record in the namespace and parses their JSON metadata in Python to find a match. This results in add operations. As the database grows, this will cause severe performance degradation.
You should leverage SQLite's built-in JSON functions to perform the filtering at the database level. This is significantly faster and avoids loading unnecessary data into memory.
| def _find_current_by_hash(self, content_hash: str) -> Optional[str]: | |
| rows = self._conn.execute( | |
| "SELECT id, metadata FROM xmem_vectors WHERE namespace = ?", | |
| (self._namespace,), | |
| ).fetchall() | |
| for row in rows: | |
| meta = json.loads(row["metadata"] or "{}") | |
| if meta.get(CONTENT_HASH_KEY) == content_hash and is_retrievable_memory(meta): | |
| return row["id"] | |
| return None | |
| def _find_current_by_hash(self, content_hash: str) -> Optional[str]: | |
| row = self._conn.execute( | |
| f"SELECT id FROM xmem_vectors " | |
| f"WHERE namespace = ? " | |
| f"AND json_extract(metadata, '$.{CONTENT_HASH_KEY}') = ? " | |
| f"AND json_extract(metadata, '$.{IS_CURRENT_KEY}') IS NOT 0 " | |
| f"AND json_extract(metadata, '$.{FORGOTTEN_AT_KEY}') IS NULL " | |
| f"LIMIT 1", | |
| (self._namespace, content_hash), | |
| ).fetchone() | |
| return row["id"] if row else None |
| for vec_id in ids: | ||
| row = self._conn.execute( | ||
| "SELECT metadata FROM xmem_vectors WHERE namespace = ? AND id = ?", | ||
| (self._namespace, vec_id), | ||
| ).fetchone() | ||
| if not row: | ||
| continue | ||
| meta = json.loads(row["metadata"] or "{}") | ||
| meta[IS_CURRENT_KEY] = False | ||
| meta[FORGOTTEN_AT_KEY] = utc_now_iso() | ||
| meta[FORGET_REASON_KEY] = reason | ||
| self.update(vec_id, metadata=meta) |
There was a problem hiding this comment.
The forget method is inefficient for multiple IDs. It performs a SELECT and then calls self.update, which internally performs another SELECT and a COMMIT for every single ID. For a list of
It is better to fetch the metadata for all IDs in a single query, process the updates in Python, and then use executemany to update the database in a single transaction.
if not ids:
return True
placeholders = ",".join("?" for _ in ids)
rows = self._conn.execute(
f"SELECT id, metadata FROM xmem_vectors WHERE namespace = ? AND id IN ({placeholders})",
(self._namespace, *ids),
).fetchall()
now = utc_now_iso()
updates = []
for row in rows:
meta = json.loads(row["metadata"] or "{}")
meta[IS_CURRENT_KEY] = False
meta[FORGOTTEN_AT_KEY] = now
meta[FORGET_REASON_KEY] = reason
updates.append((json.dumps(meta), self._namespace, row["id"]))
if updates:
self._conn.executemany(
"UPDATE xmem_vectors SET metadata = ?, updated_at = CURRENT_TIMESTAMP "
"WHERE namespace = ? AND id = ?",
updates
)
self._conn.commit()| parent_meta[IS_CURRENT_KEY] = False | ||
| self.update(parent_id, metadata=parent_meta) | ||
| self.add([text], [embedding], ids=[new_id], metadata=[new_meta]) |
There was a problem hiding this comment.
The versioning process is not atomic. The parent record is updated to is_current=False in one call, and the new version is added in another. If the second call fails (e.g., due to a connection issue or constraint violation) or the process is interrupted, the memory will be effectively 'lost' from retrieval because the parent is no longer marked as current, but no successor was successfully created. These operations should be wrapped in a single transaction to ensure data consistency.
e0b0a55 to
e006a29
Compare
|
| Filename | Overview |
|---|---|
| src/storage/local.py | Adds dedup, add_version, and forget support to SQLiteVectorStore; update() lacks a lifecycle guard allowing modification of forgotten/superseded records and silent override of lifecycle fields through the metadata parameter. |
| src/storage/memory_lifecycle.py | New module providing SHA-256 content hashing, lifecycle metadata construction, and retrievability check; build_lifecycle_metadata forcefully assigns all lifecycle fields, preventing caller overrides in the add() path. |
| tests/unit/test_memory_lifecycle.py | Good coverage of the happy path, hash dedup scoping, versioning, forgotten-parent guard, and soft-forget; no test exercises update() with lifecycle field overrides or with a non-current target record. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[add / add_version / update / forget] --> B{Which method?}
B --> C[add]
C --> C1[build_lifecycle_metadata\nforces all lifecycle fields]
C1 --> C2[_find_current_by_hash\nscoped by namespace + dedup keys]
C2 -->|duplicate found| C3[Return existing ID]
C2 -->|no duplicate| C4[INSERT new record\nis_current=True]
B --> D[add_version]
D --> D1[get parent record]
D1 --> D2{is_retrievable_memory?}
D2 -->|No| D3[Return None]
D2 -->|Yes| D4[build_lifecycle_metadata\nnext_version, parent_memory_id]
D4 --> D5[_find_current_by_hash]
D5 -->|existing == parent_id| D6[Return parent unchanged]
D5 -->|other existing| D7[Mark parent is_current=False\nReturn existing ID]
D5 -->|no duplicate| D8[Mark parent is_current=False\nINSERT new child record]
B --> E[forget]
E --> E1{hard_delete?}
E1 -->|Yes| E2[delete rows]
E1 -->|No| E3[Stamp is_current=False\nforgotten_at + reason]
B --> F[update]
F --> F1[Fetch row — no lifecycle guard]
F1 --> F2[current_meta.update — lifecycle fields can be overridden]
F2 --> F3[Recalculate CONTENT_HASH_KEY only]
F3 --> F4[_find_current_by_hash collision check]
F4 -->|collision with other record| F5[Return False]
F4 -->|no collision| F6[UPDATE record]
Comments Outside Diff (1)
-
src/storage/local.py, line 190-232 (link)update()has no lifecycle guard and exposes lifecycle fields to caller overrideupdate()never callsis_retrievable_memory()on the fetched record, so it will happily mutate the content and embedding of a soft-forgotten or superseded memory. Worse,current_meta.update(metadata or {})runs before the lifecycle fields are reconciled, so a caller can passmetadata={"forgotten_at": None, "is_current": True}to silently restore a forgotten record without going throughforget()'s audit trail. OnlyCONTENT_HASH_KEYis re-pinned after the merge; every other lifecycle field (IS_CURRENT_KEY,FORGOTTEN_AT_KEY,FORGET_REASON_KEY,PARENT_MEMORY_ID_KEY,VERSION_KEY) can be freely overwritten. This creates an inconsistency withadd(), which callsbuild_lifecycle_metadataand overwrites all lifecycle fields unconditionally.
Reviews (3): Last reviewed commit: "Prevent memory lifecycle hash collisions" | Re-trigger Greptile
e006a29 to
2601823
Compare
|
Addressed the latest bot review points in the pushed update:
Re-ran:
|
RYB-404
left a comment
There was a problem hiding this comment.
Pushed updates for the latest lifecycle review findings:
add_version()now refuses forgotten or superseded parents instead of creating a new current child from inactive history.update()now rejects same-scope current hash collisions instead of leaving two current records with the same lifecycle hash.- Added regression coverage for forgotten/superseded version parents and update hash collisions.
Re-ran:
python -m pytest tests\unit\test_memory_lifecycle.py -qpython -m pytest tests\e2e\test_memory_flow.py -qpython -m ruff check src\storage\local.py tests\unit\test_memory_lifecycle.py
Summary
add_version) and soft/hard forget (forget) while keeping superseded or forgotten memories out of semantic and metadata retrieval.Tests
python -m pytest tests\unit\test_memory_lifecycle.py -qpython -m pytest tests\e2e\test_memory_flow.py -qpython -m ruff check src\storage\memory_lifecycle.py src\storage\local.py tests\unit\test_memory_lifecycle.pyAddresses #166.