Clement Deust (architecture, original design, benchmarking) and Claude Opus 4.6 (Python port, benchmark integration, paper-backed complements)
Date: April 2026
Repository: github.com/cdeust/Cortex
Dense vector retrieval degrades structurally at scale. On BEAM (Tavakoli et al., ICLR 2026), the hardest publicly available long-context memory benchmark at 10 million tokens per conversation, a production-grade 5-signal fusion pipeline with cross-encoder reranking drops from 0.591 MRR at 100K tokens to 0.353 MRR at 10M tokens -- a 40% degradation that no amount of reranking or embedding model upgrade can address because the failure is geometric, not parametric. We present a structured context assembly architecture that recovers this degradation: a two-primitive system combining (1) a priority-budgeted prompt decomposer with domain-aware condensers and truncation warning injection, and (2) a two-phase stage-aware assembler (with a planned summary tier) with submodular coverage selection, Personalized PageRank entity graph traversal, and schema-structured summary fallback. On BEAM-10M, the assembler achieves 0.471 MRR, a +33.4% improvement over the flat baseline, with 8 of 10 memory abilities improving. The architecture was originally designed in September 2025 for generating coherent product requirement documents on Apple Intelligence's 4,096-token context window -- one month before the BEAM paper was published. The key insight is that at scale, structured composition of what enters context outperforms improving how individual items are retrieved. Code, data, and benchmark harness are publicly available.
Retrieval-augmented generation (RAG) has become the dominant paradigm for grounding language models in external knowledge. The standard recipe -- embed documents, index them in a vector store, retrieve the top-k nearest neighbors at query time, and concatenate them into the prompt -- works well when the corpus is moderate and topically diverse. But when applied to long-term conversational memory, where a single user's history spans millions of tokens of thematically overlapping content, this recipe fails in a specific and well-characterized way.
Consider a memory system storing the conversational history of a software developer over six months. At 100K tokens (~94 memories), the content is sparse enough that a dense retrieval query returns meaningfully different candidates. At 10M tokens (~7,500 memories), the content has accumulated around a small number of recurring topics -- the same frameworks, the same debugging patterns, the same architectural decisions discussed from slightly different angles across hundreds of sessions. The embedding space becomes crowded with near-identical vectors, and top-k retrieval degenerates into returning k paraphrases of the same piece of information.
This is not a failure of any particular embedding model. Three well-documented phenomena make it structural:
Hubness (Radovanovic et al., JMLR 2010). In high-dimensional spaces, certain points become disproportionately frequent nearest neighbors of many queries. Formally, let Nₖ(x) denote the number of times point x appears in the k-nearest-neighbor lists of all other points. In uniformly distributed data, E[Nₖ(x)] is constant, but the variance of Nₖ grows with dimensionality. The skewness of the Nₖ distribution -- the hubness -- increases monotonically with d/log n where d is the dimensionality and n is the sample size. Radovanovic et al. showed that this is an inherent property of the geometry, not an artifact of particular distance metrics. As corpus size grows, a small set of "hub" memories dominates the top-k results regardless of query content. At 7,500 memories in 384 dimensions, we observe empirically that the top-5 retrievals for topically distinct queries share 2-3 common memories -- the hubs. These hub memories are not necessarily irrelevant; they are typically the most "central" memories that have high average similarity to all queries. But their dominance crowds out the specific, targeted memories that would actually answer the query.
Concentration of distances (Beyer et al., ICDT 1999). As dimensionality grows relative to sample size, the ratio of the nearest-neighbor distance to the farthest-neighbor distance converges to 1. Formally, for i.i.d. points drawn from any distribution with finite variance:
lim(d->inf) Pr[ (||X_max - q|| - ||X_min - q||) / ||X_min - q|| <= ε ] = 1
for any ε > 0. The practical consequence: in 384 dimensions with 7,500 points, the gap between the "most relevant" and "50th most relevant" memory shrinks to the point where cosine similarity cannot reliably discriminate them. Adding more dimensions helps, but the next phenomenon places a lower bound on how many dimensions suffice.
Dimensionality lower bounds (Larsen & Nelson, FOCS 2017). The Johnson-Lindenstrauss lemma guarantees that pairwise distances among n points can be preserved to within (1 ± ε) in O(ε⁻² log n) dimensions. Larsen and Nelson proved this is tight: any embedding that preserves pairwise distances among n points to within (1 ± ε) requires Ω(ε⁻² log n) dimensions. For n = 7,500 and ε = 0.1 (10% distance preservation), this gives:
d >= (1/ε²) · ln(n) = (1/0.01) · ln(7500) ≈ 100 · 8.92 ≈ 892
Our 384-dimensional embeddings (all-MiniLM-L6-v2) operate below this lower bound, meaning pairwise distance preservation to 10% accuracy is not guaranteed even in principle for 7,500 points.
The consequence is measurable. On the BEAM benchmark (Tavakoli et al., ICLR 2026), our production Weighted Reciprocal Rank Fusion (WRRF) pipeline -- which fuses five PostgreSQL-side signals (vector similarity, full-text search, trigram matching, thermodynamic heat, temporal recency) with client-side FlashRank cross-encoder reranking -- scores:
- BEAM-100K (94 memories/conversation): 0.591 MRR
- BEAM-10M (7,500 memories/conversation): 0.353 MRR
A 40% degradation at 100x scale. No reranking layer, query rewriting strategy, or embedding model upgrade addresses a geometric ceiling. The architecture must change.
The insight behind this work is that the problem is not how to retrieve better, but what to compose into context. A flat retrieval pipeline treats the context window as a ranked list: the top-k most similar items, concatenated. A structured assembly pipeline treats the context window as a document with typed sections, priority ordering, and explicit awareness of what was included and what was cut.
This distinction matters most when the corpus is large and thematically dense -- precisely the regime where flat retrieval fails. By partitioning the corpus into topical stages, retrieving within the current stage first, following entity graph connections to related stages, and falling back to summaries for uncovered stages, the assembler recovers information that flat top-k cannot reach.
The analogy is to how humans prepare for a meeting: you do not read every email from the past year sorted by relevance score. You identify the current topic, gather the directly relevant material, follow references to related discussions, and glance at summaries of background context. The structured approach is not better because it uses a superior search algorithm -- it is better because it organizes the search into scopes that match the information need.
This architecture did not originate from a retrieval research project.
It was designed in September 2025 for a production Swift application
(ai-architect-prd-builder) that generates comprehensive product
requirement documents using Apple Intelligence, which has a 4,096-token
context window. The application must produce coherent 9-page documents
with cross-referenced requirements, working code examples, verification
reports, and Jira ticket specifications -- none of which fit in the
model's context simultaneously.
The solution: do not try to fit everything. Structure what goes in, prioritize it, and tell the model what was cut.
This principle -- born from a practical constraint on a tiny context window -- turns out to be exactly what is needed for 10-million-token memory retrieval. The problem is structurally identical at both scales: the available context is smaller than the relevant information, so you must be intelligent about what enters the context.
We make four contributions:
-
ContextDecomposer: a priority-budgeted prompt assembly primitive with typed placeholder slots, domain-aware condensers, progressive condensation, and truncation warning injection -- a mechanism that, to our knowledge, has no direct precedent we are aware of in the 2024-2026 retrieval, prompt engineering, or agent memory literature.
-
StageAwareContextAssembler: a two-phase hierarchical retrieval architecture with a configurable budget split (default 60/30/10), combining submodular coverage selection (Krause & Guestrin, 2008), Personalized PageRank entity graph traversal (Gutierrez et al., NeurIPS 2024), and schema-structured summary fallback (Tse et al., 2007).
-
Empirical validation on BEAM at two scales (100K and 10M tokens), showing that structured assembly is net-flat at small scale (where flat retrieval is already sufficient) and provides a +33.4% improvement at large scale (where flat retrieval collapses) -- with 8 of 10 memory abilities improving.
-
Ablation analysis isolating the contribution of submodular selection (+17.6% over naive top-k) and establishing that the architecture's value is compositional rather than attributable to any single mechanism.
Section 2 reviews related work across dense retrieval, long-context memory systems, hierarchical retrieval, and the biological foundations that inspired design choices. Section 3 presents the method in detail, including the ContextDecomposer algorithm, the two-phase assembler, and the integration with the existing Cortex pipeline. Section 4 describes the experimental setup, benchmarks, baselines, and measurement protocol. Section 5 presents results at both scales with per-ability breakdowns. Section 6 provides analysis of scale-dependent behavior, the multi-session reasoning sign flip, and comparison with published systems. Section 7 documents the provenance timeline with verifiable commit SHAs. Section 8 discusses limitations and future work. Section 9 concludes.
The standard dense retrieval pipeline embeds queries and documents into a shared vector space and retrieves by nearest-neighbor search. The bi-encoder architecture (Karpukhin et al., 2020; Reimers & Gurevych, 2019) maps queries and passages independently, enabling pre-computation of passage embeddings and sublinear retrieval via approximate nearest-neighbor indices (HNSW, IVF, product quantization).
Benchmarking dense retrieval. BEIR (Thakur et al., NeurIPS 2021) established zero-shot evaluation across 18 diverse datasets, revealing that dense retrievers trained on MS MARCO generalize poorly to domain-specific corpora. MTEB (Muennighoff et al., EACL 2023) extended this to 56 datasets across 8 tasks, becoming the standard benchmark for embedding models. Both benchmarks evaluate on corpora where documents are topically diverse -- web pages, Wikipedia passages, scientific abstracts, and forum posts.
Long-document retrieval. LongEmbed (Zhu et al., 2024) extends evaluation to documents exceeding 8,000 tokens, testing whether embedding models preserve relevance over longer passages. However, it still assumes a diverse corpus: the challenge is encoding long documents, not discriminating among thousands of topically similar documents from a single user.
The gap in existing benchmarks. The failure mode we address -- topical density within a single user's conversational history, where 7,500 memories discuss overlapping topics in similar vocabulary -- is not covered by existing retrieval benchmarks. LongMemEval (Wu et al., ICLR 2025) evaluates conversational memory but at modest scale (~115K tokens, ~40 sessions). LoCoMo (Maharana et al., ACL 2024) tests multi-hop and adversarial queries but caps at 10 conversations. BEAM is the first benchmark to stress-test memory retrieval at 10M tokens per conversation, and the results confirm that existing systems collapse at this scale: the best published system (LIGHT with Llama-4-Maverick) achieves only 0.266.
Our baseline, Weighted Reciprocal Rank Fusion (WRRF), extends the original Reciprocal Rank Fusion (Cormack et al., 2009) with per-signal weights. Given m ranking signals, the fused score for document d is:
WRRF(d) = Σᵢ wᵢ / (k + rᵢ(d)) for i = 1..m
where rᵢ(d) is the rank of d under signal i, k is a smoothing constant (we use k = 60, following the original RRF recommendation), and wᵢ is the weight for signal i.
Cortex's WRRF pipeline fuses five signals:
| Signal | Weight | Computed At |
|---|---|---|
| Vector similarity (pgvector HNSW, 384D) | 0.35 | PostgreSQL |
| Full-text search (tsvector, BM25-weighted) | 0.25 | PostgreSQL |
| Trigram matching (pg_trgm) | 0.15 | PostgreSQL |
| Thermodynamic heat (access recency/frequency) | 0.15 | PostgreSQL |
| Temporal recency (exponential decay) | 0.10 | PostgreSQL |
All five signals are computed server-side in a single PL/pgSQL stored
procedure (recall_memories()), returning pre-fused results.
Client-side, FlashRank (ONNX cross-encoder) reranks the top-3k
candidates to produce the final ranking.
This pipeline is strong at moderate scale: 97.8% R@10 on LongMemEval, 92.6% R@10 on LoCoMo. The five-signal fusion mitigates any single signal's weakness (e.g., vector similarity misses lexical matches that trigram catches; FTS misses paraphrases that vectors catch). But at BEAM-10M scale, all five signals suffer from the same underlying geometric degradation, and fusion of degraded signals produces a degraded result.
Several systems have been proposed for persistent conversational memory, each addressing different aspects of the long-term memory challenge:
MemGPT / Letta (Packer et al., 2024). Implements a tiered memory system with conversation buffer (recent messages), archival storage (compressed older messages), and recall storage (searchable facts). Memory management is delegated to the LLM via tool calls -- the model decides when to read, write, and search its own memory. This is an elegant design for self-managing memory, but it relies on the LLM's judgment for retrieval, introducing the same failure modes as any single-signal retrieval (the LLM may generate poor search queries) plus additional latency and cost for each memory operation. Letta does not implement structured context assembly, priority budgeting, or entity graph traversal.
mem0 (Chheda et al., 2024). Extracts key-value facts from conversations using an LLM and stores them as a flat memory bank. Retrieval is by embedding similarity over the extracted facts, with optional filtering by category or time range. The extraction step is valuable (it produces cleaner, more searchable memories than raw conversation turns), but the retrieval layer is standard dense search without any structural organization. At scale, mem0 would face the same hubness and concentration issues we document.
MIRIX (Wang & Chen, 2025). The closest architectural neighbor in the active retrieval dimension. MIRIX introduces three key mechanisms: (1) typed memory categories (episodic, semantic, working), (2) active retrieval -- the agent reformulates queries before searching, and (3) reflection -- the agent periodically reviews and consolidates its memories. The system reports 85.4% on LoCoMo. However, MIRIX does not implement stage-scoped retrieval, entity graph traversal, priority-budgeted prompt assembly, or truncation awareness. Its active retrieval component (query reformulation via LLM) is complementary to our approach and could be composed with our assembler.
A-MEM (Xu et al., NeurIPS 2025). Introduces Zettelkasten-style agentic memory with on-write reconsolidation: when a new memory is stored, the system uses an LLM to compare it against existing memories, create bidirectional links, and update an index. The write-time intelligence builds a richer memory graph, which could benefit our Phase 2 entity traversal. However, A-MEM's retrieval layer is standard dense search, and the write-time LLM calls add significant latency and cost at scale.
LIGHT (Tavakoli et al., ICLR 2026). The strongest published system on BEAM, achieving 0.266 overall on BEAM-10M with Llama-4-Maverick. LIGHT implements a three-tier architecture:
- Episodic memory: stores raw conversation turns, indexed for retrieval.
- Working memory: a fixed-size buffer (typically 10 items) of the most recently accessed memories, refreshed each query.
- Scratchpad: an LLM-generated intermediate representation that helps the reader reason across multiple retrieved passages.
LIGHT's three tiers serve different purposes than our three phases. LIGHT's working memory is a recency-biased buffer; our Phase 1 is a relevance-biased stage-scoped selection. LIGHT's scratchpad is a reader-side reasoning aid; we have no equivalent. LIGHT does not implement priority budgeting, truncation awareness, submodular diversity, or entity graph traversal. Its scratchpad mechanism and our context assembly are complementary approaches that could potentially be combined.
HippoRAG (Gutierrez et al., NeurIPS 2024). Draws on the hippocampal indexing theory (Teyler & Rudy, 2007) to propose a retrieval architecture that separates the neocortex (knowledge graph of extracted entities and relations) from the hippocampus (passage store) and bridges them via Personalized PageRank. The PPR walk propagates relevance from query-mentioned entities outward through the knowledge graph, scoring passages by the aggregated PPR mass of their contained entities.
On multi-hop QA benchmarks (MuSiQue, 2WikiMultihopQA, HotpotQA), HippoRAG demonstrates that graph-based traversal reaches relevant passages that flat dense retrieval misses -- particularly when the answer requires synthesizing information from passages that do not directly mention the query terms but are connected through shared entities.
We adopt their PPR formulation for Phase 2 of our assembler but apply it to a different problem: cross-stage entity bridging in conversational memory rather than multi-hop factoid QA. The key adaptation is the seeding strategy: HippoRAG seeds PPR on query-extracted entities, while we seed on entities extracted from Phase 1 results. This two-stage seeding (WRRF retrieval first, then PPR traversal) is more robust than direct query seeding because Phase 1 results are enriched with the memory system's entity annotations, which are more complete than entities extractable from a short query string.
RAPTOR (Sarthi et al., ICLR 2024). Builds a tree of summaries over document chunks using recursive LLM summarization: leaf nodes are original chunks, and each internal node summarizes its children. Retrieval can occur at any level of the tree, enabling queries to match at different levels of abstraction. The hierarchical summary structure is related to our Phase 3 summary fallback, but RAPTOR operates on documents (where the tree structure reflects the document's own hierarchy) rather than conversational stages (where the structure reflects topical shifts over time). RAPTOR's LLM summarization cost at build time is significant; we use rule-based schema extraction instead.
Late Chunking (Jina AI, 2025). Proposes deferring chunking until after encoding with a long-context model, so each chunk's embedding reflects the full document context. This addresses a different failure mode -- loss of cross-chunk context during embedding -- and is complementary to our approach. Late chunking could improve the quality of per-stage embeddings within our Phase 1.
ColBERT and Multi-Vector Retrieval (Khattab & Zaharia, SIGIR 2020). Represents each passage as a set of token-level vectors and computes relevance via MaxSim aggregation. This finer-grained representation mitigates some hubness effects by allowing individual tokens to contribute independently. However, the storage and computation overhead (one vector per token) is substantial at 7,500-memory scale, and the approach does not address the stage-composition problem.
Submodular set functions have a rich history in information selection problems. A set function f: 2^V → R is submodular if for all A ⊆ B ⊆ V and e ∉ B:
f(A ∪ {e}) - f(A) >= f(B ∪ {e}) - f(B)
This diminishing returns property formalizes the intuition that adding an item is more valuable when less has already been selected.
Sensor placement (Krause & Guestrin, JMLR 2008). Proved that for monotone submodular functions, the greedy algorithm (iteratively select the item with the highest marginal gain) achieves a (1 - 1/e) ≈ 0.63 approximation to the optimal set. This guarantee holds regardless of the constraint set (cardinality, matroid, knapsack), with different approximation ratios for different constraints.
Document summarization (Lin & Bilmes, ACL 2011). Applied submodular optimization to extractive summarization, treating sentence selection as a budget-constrained submodular maximization problem. The objective combines coverage (how much of the document's information is represented) with diversity (how different the selected sentences are).
Maximal Marginal Relevance (Carbonell & Goldstein, SIGIR 1998). The MMR criterion -- score(c) - λ · max(sim(c, s) for s in S) -- is the most common instantiation of submodular selection in retrieval. Carbonell and Goldstein introduced it for reranking search results; we apply it to memory selection within a stage. When λ < 1, the MMR objective is submodular (proof: the max-similarity penalty is a modular function, and the difference of a modular function from a monotone function is submodular).
The stage-aware assembly architecture has structural parallels to established neuroscience models. We emphasize that these are design analogies rather than mechanistic implementations -- we do not claim that the algorithms replicate biological processes, only that the biological principles informed useful design choices.
Hippocampal indexing theory (Teyler & Rudy, 2007). The hippocampus stores pointers to neocortical representations, not the representations themselves. During retrieval, hippocampal activation triggers reactivation of the corresponding neocortical patterns. This "index + content" separation maps to our Phase 2 architecture: the entity graph serves as an index (hippocampus), and the memories stored in PostgreSQL serve as the content (neocortex). PPR traversal over the entity graph activates the memories that share the most entity overlap with the current query context.
Grid modules (Stensola et al., Nature 2012). The entorhinal cortex organizes spatial information in modular grids at different scales -- small grids for local navigation, large grids for global orientation. Our two-phase structure (current stage, adjacent stages, all stages) mirrors this multi-scale organization: Phase 1 operates at the finest scale (within-stage), Phase 2 at an intermediate scale (across nearby stages via entity connections), and Phase 3 at the broadest scale (summaries of all stages). The analogy should not be pushed further than the structural level -- we do not implement grid-like spatial representations.
Schema consolidation (Tse et al., Science 2007). Prior knowledge structures (schemas) facilitate rapid integration of new information that is consistent with existing schemas. Tse et al. showed that rats with a pre-existing spatial schema could learn new object-location associations in a single trial, compared to weeks without a schema. Our Phase 3 summary fallback draws on this principle: schema-structured summaries of each stage serve as compressed prior knowledge that new queries can be matched against, even when the original memories are too numerous or too distant to retrieve directly.
Complementary learning systems (McClelland et al., 1995). The hippocampal and neocortical systems serve complementary roles: hippocampal learning is fast and episodic (one-shot, detail-rich), while neocortical learning is slow and semantic (gradual, pattern- extracting). Our architecture embodies this complementarity: Phase 1 retrieves specific episodic memories (fast, detail-rich), while Phase 3 retrieves consolidated semantic summaries (compressed, pattern-level). Phase 2 bridges the two via entity-structural connections.
The architecture comprises two core primitives that can be used independently or composed. The ContextDecomposer manages the prompt-level budget: given a template with typed placeholders, it allocates tokens by priority and communicates truncation decisions to the reader model. The StageAwareContextAssembler manages the retrieval-level composition: given a query and a corpus partitioned into stages, it selects memories from three scopes with configurable budget proportions.
Figure 1 (conceptual) illustrates the data flow:
Query
|
v
[Stage Detection] --> current_stage
|
v
[Phase 1: Own-Stage WRRF + Submodular Select] --> 60% budget
|
v
[Phase 2: Entity Extract --> PPR Walk --> Cross-Stage Select] --> 30% budget
|
v
[Phase 3: Summary Fallback for Uncovered Stages] --> 10% budget
|
v
[Structured Context: ## Current Stage | ## Related Prior | ## Summaries]
|
v
[ContextDecomposer: Priority Budget + Condense + Truncation Warning]
|
v
Final Prompt --> Reader Model
A prompt template contains typed placeholder slots. Each slot is defined by a 4-tuple:
Pᵢ = (kᵢ, vᵢ, pᵢ, cᵢ)
where:
- kᵢ is the template key (e.g.,
{{CONTEXT}},{{QUERY}},{{RELATED}}) - vᵢ is the content to fill it
- pᵢ ∈ N is the priority rank (lower = more important; priority 1 is the last to be condensed)
- cᵢ: (text, budget) → text is an optional domain-aware condenser function
The assembly algorithm operates in six steps:
Compute the template with all placeholders replaced by empty strings. Let s = τ(shell) where τ is the token estimator (default: conservative ceil(|text| / 3), swappable for tiktoken at integration).
Given context window W and headroom fraction h (default 0.75, reserving 25% for the response):
B = max(300, floor(W · h) - s)
The variable budget B is the total token budget available for all placeholder content. The minimum of 300 tokens ensures that even with a large shell, some content always survives.
If Σ τ(vᵢ) <= B, all placeholders are used verbatim. No condensation is needed. This is the common case for short queries with small context.
When the fast path fails, placeholders are condensed progressively from least to most important. Sort placeholders by descending priority number (highest pᵢ first -- least important first). For each placeholder Pᵢ in this order:
- Compute a proportional share of the remaining budget:
shareᵢ = max(50, floor(B_remaining / |{Pⱼ : j >= i, Pⱼ not yet assigned}|))
-
If τ(vᵢ) <= shareᵢ, use vᵢ verbatim and subtract τ(vᵢ) from B_remaining.
-
Otherwise, apply the condenser:
vᵢ' = cᵢ(vᵢ, shareᵢ) if cᵢ != null
truncate(vᵢ, shareᵢ) otherwise
- Use vᵢ' and subtract min(τ(vᵢ'), B_remaining) from B_remaining.
The proportional allocation ensures that low-priority placeholders with small content are not penalized -- they get their full content if it fits within their share. Only placeholders that exceed their share are condensed.
After all placeholders are assigned, assemble the final prompt by substituting each placeholder with its assigned content. If the assembled prompt still exceeds (W - safety_margin) tokens (due to estimation errors or template overhead), iteratively halve the content of the lowest-priority placeholder at the nearest line boundary. This is a defensive measure -- it should rarely trigger if the budget computation is accurate.
while estimate_tokens(prompt) > (context_window - safety_margin):
for p in sorted_by_priority_desc(placeholders):
if estimate_tokens(current[p.key]) <= 50:
continue # Cannot meaningfully halve further
# Halve at nearest line boundary
half = len(current[p.key]) // 2
cutpoint = current[p.key][:half].rfind('\n')
current[p.key] = current[p.key][:cutpoint + 1]
rebuild prompt
breakThe loop terminates after at most 50 iterations or when no further reduction is possible.
For each placeholder whose surviving fraction is below 90%:
surviving(Pᵢ) = τ(finalᵢ) / τ(originalᵢ)
build a warning line. If any placeholders were materially truncated, prepend a banner to the final prompt:
WARNING: CONTEXT TRUNCATION
The following sections were truncated to fit the context window.
You may be missing information. Prioritize the content you CAN see.
- {{RELATED_CONTEXT}}: 45% retained (340/756 tokens)
- {{STAGE_SUMMARIES}}: 22% retained (88/400 tokens)
Novelty claim. The truncation warning is, to our knowledge, novel. We surveyed the 2024-2026 literature across RAG, prompt engineering, context management, and agent architectures and found no direct precedent we are aware of for injecting explicit truncation awareness into the prompt itself. The closest mechanisms are:
- Anthropic's contextual retrieval (2024): adds chunk-level LLM summaries, but these enrich content rather than communicate truncation decisions.
- LangChain / LlamaIndex token budgeting: truncates silently without informing the model what was cut.
- OpenAI's function calling with
max_tokensconstraints: truncates tool outputs but does not inject a warning about the truncation.
The design rationale is straightforward: a model that knows what it cannot see can hedge its confidence appropriately, request clarification, or flag uncertainty. A model that receives a silently truncated context has no basis for distinguishing "this information does not exist" from "this information was cut." We hypothesize (but have not experimentally verified) that truncation warnings improve answer quality for questions whose evidence was partially truncated.
Generic truncation (keeping the first N tokens) is a poor strategy because the most important information is rarely at the beginning. Domain-aware condensers exploit structural properties of different content types to preserve high-signal content and drop filler:
| Content Type | Strategy | Signal Preserved |
|---|---|---|
| User messages | First sentence + questions + last sentence | Intent, explicit queries |
| Assistant messages | Code blocks verbatim; prose compressed to topic sentences | Factual content, implementations |
| Entity triples | Keep (S, P, O) lines; drop narrative prose |
Relationship structure |
| Timeline events | Extract [when] what fixed-slot format |
Temporal anchors |
| Code blocks | Function/class/import signatures only | API surface, definitions |
The condenser is selected automatically by content shape detection. The detection heuristics are:
- Tag-driven (highest priority): if the memory has tags like
code,timeline,event, dispatch directly. - Code fence detection: presence of triple backticks or
= 3 indented lines triggers the assistant message condenser.
- Arrow operator count: >= 2 occurrences of
->or triggers the entity triple condenser. - Role prefix detection: lines starting with
[user]:or[assistant]:trigger the appropriate message condenser. - Default: treat as prose user message.
Each condenser is a pure function (text, token_budget) -> text
with a fast path: if the input already fits within the budget,
return it unchanged.
A stage is a distinct topical segment of a conversation -- analogous to a work session, a debugging episode, or a design discussion. Stage detection is pluggable (Section 3.2.1). The assembler operates in three phases with a configurable budget split (β₁, β₂, β₃) where β₁ + β₂ + β₃ = 1 (default: 0.6, 0.3, 0.1).
The StageDetector interface abstracts stage assignment with two
operations:
stage_of(memory) -> str: assigns a single memory to its stage IDall_stages(corpus) -> list[str]: returns all distinct stages in a stable order
We provide three implementations:
ExplicitStageDetector. Stage = value of an explicit metadata
field. For BEAM, this is plan_id (the dataset provides plan
indices per turn, with each plan representing a thematic conversation
segment). For production use, this could be agent_topic,
project_directory, or any other session-level label. The detector
accepts a fallback value (default: "default") for memories
lacking the field.
TemporalStageDetector. Stage = contiguous block of memories
where inter-memory temporal gaps are below a threshold (default:
4 hours). When a gap exceeds the threshold, a new stage begins.
The 4-hour default matches a typical work-session boundary. The
detector maintains an internal cache of memory_id -> stage_id
mappings after the first corpus scan, making subsequent
stage_of() calls O(1).
CompositeStageDetector. Tries detectors in priority order, falling back from explicit to temporal when labels are absent. This is the recommended production configuration: use explicit labels when available, temporal boundaries otherwise.
The pluggable design enables A/B testing of stage strategies without
modifying the assembly logic. For the experiments in this paper, we
use ExplicitStageDetector with plan_id on BEAM (where plan
boundaries are ground truth) and note that production deployment
would use CompositeStageDetector.
Phase 1 retrieves from the current stage's memories. The procedure is:
-
Candidate generation. Query the WRRF pipeline with a stage filter and 3x oversampling: retrieve
max_chunks * 3candidates from the current stage. The oversample provides a rich candidate pool for diversity-aware selection. -
Submodular selection. From the oversample, select the final set via the MMR-submodular objective:
S* = argmax(|S| <= k) Σ(c in S) [score(c) - λ · max(sim(c, c') for c' in S\{c})]
where score(c) is the WRRF score, sim(c, c') is cosine similarity over 384D embeddings, and λ = 0.5 balances relevance against diversity.
- Greedy algorithm. Iterate up to
max_chunkstimes. At each step, select the candidate not yet in S that maximizes the marginal gain:
function SUBMODULAR_SELECT(candidates, max_chunks, lambda):
selected = []
selected_embeddings = []
for i = 1 to max_chunks:
best_idx = -1
best_gain = -infinity
for j in range(len(candidates)):
if j in selected: continue
if selected_embeddings is empty or embedding[j] is None:
penalty = 0
else:
penalty = max(cosine(embedding[j], s) for s in selected_embeddings)
gain = score[j] - lambda * penalty
if gain > best_gain:
best_gain = gain
best_idx = j
if best_idx < 0: break
selected.append(best_idx)
if embedding[best_idx] is not None:
selected_embeddings.append(embedding[best_idx])
return [candidates[i] for i in sorted(selected)]
Approximation guarantee. By Krause & Guestrin (2008), the greedy algorithm achieves f(Sₖ) >= (1 - 1/e) · f(Sₖ) ≈ 0.63 · f(Sₖ) for any monotone submodular function under cardinality constraints. The MMR objective with λ < 1 is submodular (the max-similarity penalty is submodular as the pointwise maximum of linear functions), so this guarantee applies.
Why submodular selection matters at scale. At BEAM-10M, the top-5 candidates from WRRF within a single stage frequently overlap in content. A specific failure mode we observed: three of five selected memories described the same debugging session from slightly different timestamps. Submodular selection penalizes this redundancy, forcing the selected set to cover more aspects of the queried topic.
Decoupling selection from budget. A deliberate design choice:
selection is always by count (max_chunks), never by token budget.
This ensures retrieval ranking metrics (MRR, R@k) are well-defined
regardless of individual memory sizes. Token budgeting is enforced
at assembly time by the ContextDecomposer, which may truncate
individual chunks but never reduces the count of selected items. The
two concerns -- "which memories matter?" and "how many tokens can we
spend?" -- are independent and should not contaminate each other.
This decoupling was itself a bug fix. The initial implementation
enforced token budgets at selection time, causing the assembler to
select fewer than max_chunks items when individual memories were
long. This degraded retrieval recall (fewer items means fewer
chances for a hit) without improving prompt quality (the
ContextDecomposer would have condensed the long items anyway).
Phase 2 addresses the question: "What information from other stages is relevant to the current query?"
Flat dense retrieval over the entire corpus is the default answer, but at 7,500 memories it degenerates (Section 1.1). Instead, we follow the HippoRAG approach (Gutierrez et al., NeurIPS 2024) and use entity co-occurrence as a structural bridge between stages.
Step 1: Entity extraction from Phase 1. From the Phase 1
selected memories, extract the set of entity IDs. Each memory in
Cortex is annotated with its contained entities at ingest time
(stored in the entity_memory junction table). The seed entity set
is built by aggregating entity IDs across all Phase 1 results, with
each entity weighted by how many Phase 1 memories contain it:
s(e) = |{m in S₁ : e in entities(m)}|
where S₁ is the Phase 1 selected set.
Step 2: Graph construction. Build an adjacency structure over Cortex's entity and relationship tables. Each entity is a node; each relationship adds a bidirectional weighted edge with strength proportional to the co-occurrence strength stored in the database. The graph is constructed on-demand from PostgreSQL:
adjacency = {}
for entity in entities:
adjacency[entity.id] = []
for rel in relationships:
adjacency[rel.source].append((rel.target, rel.strength))
adjacency[rel.target].append((rel.source, rel.strength))Edge weights are normalized to transition probabilities within the PPR computation.
Step 3: Personalized PageRank. Compute PPR scores via power iteration with restart probability α = 0.15 (Brin & Page, 1998):
r(t+1) = α · s + (1 - α) · M · r(t)
where:
- s is the normalized seed vector (mass concentrated on Phase 1 entities)
- M is the column-stochastic transition matrix derived from the adjacency structure
- Iteration continues until ||r(t+1) - r(t)||₁ < 10⁻⁴ or 30 iterations
Dangling nodes (entities with no outgoing edges) redistribute their mass to the seed distribution, maintaining the random walk interpretation. This is the standard treatment of dangling nodes in PageRank (Langville & Meyer, 2005).
The restart probability α = 0.15 is the canonical value from the original PageRank paper. Higher α produces more localized results (closer to the seed entities); lower α produces more global results. We did not tune this parameter.
Step 4: Memory scoring by PPR aggregation. Following HippoRAG Section 3.3, a memory's relevance under PPR is the sum of PPR mass of its contained entities:
score_PPR(m) = Σ(e in entities(m)) r(e)
Memories from the current stage are excluded (they were already
considered in Phase 1). The remaining cross-stage memories are
ranked by PPR score, and the top max_chunks_per_phase are selected,
optionally constrained by the adjacent-stage token budget.
Why PPR over BFS. Cortex already implements spreading activation (Collins & Loftus, 1975), which is a decaying breadth-first search with activation decay factor d per hop. PPR provides a stationary distribution rather than a depth-bounded traversal. The differences matter:
| Property | Spreading Activation | PPR |
|---|---|---|
| Termination | After max hops | At convergence |
| Cycle handling | May revisit with decayed weight | Natural via restart |
| Long-range connections | Penalized by distance | Captured if reachable |
| Sensitivity to graph density | High (dense regions amplify) | Lower (normalization) |
PPR is better suited to cross-stage bridging where the relevant connection may be at variable depth and the graph has heterogeneous density.
For stages not covered by Phase 1 or Phase 2, the assembler retrieves pre-computed schema-structured summaries ordered by stage proximity (temporal distance from the current stage). This ensures that even distant stages contribute contextual background.
The summaries are produced by Cortex's schema engine, which implements Tse et al. (2007) schema-congruent consolidation: recurring patterns across memories within a stage are extracted into a fixed-slot format:
- Entities: key actors, systems, and concepts
- Decisions: choices made and their rationale
- Outcomes: results of actions taken
- Open questions: unresolved issues
This fixed-slot format compresses efficiently (50-200 tokens per stage vs. thousands for raw memories) and matches queries structurally rather than lexically.
Current status. In the implementation evaluated in this paper, Phase 3 is a stub -- the summary callback returns empty strings for all stages. Wiring it to Cortex's CLS consolidation engine, which already produces schema-structured summaries per stage, is planned but not yet benchmarked. The results reported here reflect Phase 1
- Phase 2 only. We include Phase 3 in the architecture description because the interface is defined, the callback mechanism is wired, and the schema engine exists -- only the plumbing is missing.
The default split of 60/30/10 (β₁ = 0.6, β₂ = 0.3, β₃ = 0.1) is motivated by three observations:
Locality dominance. For the majority of queries, the answer exists within the current stage. On BEAM-10M, 65% of questions target information from a single plan (stage). The 60% allocation to Phase 1 ensures that same-stage retrieval has the largest share of resources.
Diminishing cross-stage returns. Entity graph traversal is high-precision but low-recall: it finds specific connections between stages via shared entities, but the number of useful cross-stage memories for any single query is small (typically 1-3, rarely more than 5). The 30% allocation is sufficient to retrieve these without diluting the context with marginally relevant cross-stage content.
Summaries as compressed context. Stage summaries are compressed by design (50-200 tokens each). The 10% allocation accommodates several summaries, providing broad coverage at minimal token cost. Even 10% of a 4,096-token window (409 tokens) can hold 2-4 summaries.
Origin. The 60/30/10 values descend from the Swift ContextManager's heuristic for PRD generation (approximately 55/25/20 in the original), rounded to cleaner values. We did not perform a systematic grid search over budget splits and expect the optimal split to be task-dependent. Automated tuning (e.g., Bayesian optimization over (β₁, β₂, β₃) with MRR as the objective) is left to future work.
The assembler is a composition layer on top of the existing retrieval pipeline, not a replacement. This is a deliberate architectural choice: the assembler composes retrieval primitives without reimplementing them. The full data flow is:
-
Query intent classification. Cortex classifies the query as temporal, causal, semantic, entity, knowledge-update, or multi-hop using regex-based rules and keyword detection (pure logic, no LLM). This classification informs signal weights in the WRRF pipeline.
-
Stage detection. The
StageDetectormaps the query to a current stage, either from an explicit label or by temporal proximity. -
Phase 1: Own-stage WRRF + submodular select. The WRRF pipeline's
recall_memories()stored procedure runs with a stage filter added to the WHERE clause. Returns 3x oversample. Submodular selection reduces tomax_chunks. -
Phase 2: Entity PPR traversal. Entity IDs from Phase 1 seed a PPR walk over the entity graph. Cross-stage memories scored by PPR mass.
-
Phase 3: Summary fallback. Summaries for uncovered stages retrieved by proximity.
-
Assembly. The three phases produce a structured context with labeled sections (
## Current Stage Context,## Related Prior Context,## Stage Summaries). -
FlashRank reranking (optional). Client-side cross-encoder reranking on the combined candidate set.
-
Prompt assembly (if reader downstream). The ContextDecomposer fits the structured context into the model's context window with priority budgeting.
For retrieval evaluation (BEAM benchmark), steps 7-8 are bypassed:
the metric is the rank of the gold memory within the
selected_memories list produced by step 6.
Following MIRIX (Wang & Chen, 2025), we provide an optional active retrieval layer that reformulates the query before it enters the WRRF pipeline. Two implementations are available:
KeywordExtractor (rule-based). Strips question words and filler, preserves quoted strings, dates, capitalized words (likely proper nouns), and words of length >= 4. Zero latency, deterministic, no model required.
LLMReformulator (model-based). Rewrites the query using a small local model with the prompt: "Rewrite the following question as a search query optimized for retrieving relevant passages from a conversation log." Gated by model availability.
In the experiments reported here, active retrieval is not enabled. Both WRRF and assembler conditions receive the raw query. We include the component in the architecture description because the interface is defined and wired; ablation against query reformulation is planned for future work.
BEAM (Beyond a Million Tokens; Tavakoli et al., ICLR 2026) is a long-term memory benchmark comprising 10 synthetic conversations generated by GPT-4, each available at two scales:
-
BEAM-100K: ~94 memories per conversation, ~100K tokens total. Each conversation spans 5 plans (stages), with each plan containing approximately 19 turns of user/assistant interaction.
-
BEAM-10M: ~7,500 memories per conversation, ~10M tokens total. Each conversation spans 50 plans (stages), with each plan containing approximately 150 turns.
Each conversation is probed with 20 questions (196 total at 10M due to 4 missing summarization questions in some conversations) spanning 10 memory abilities:
| Ability | What it tests | Count (10M) |
|---|---|---|
| abstention | Correctly declining to answer when no relevant memory exists | 20 |
| contradiction_resolution | Identifying and surfacing conflicting user statements | 20 |
| event_ordering | Sequencing events chronologically across sessions | 20 |
| info_extraction | Retrieving specific facts stated in conversation | 20 |
| instruction_following | Adhering to user-stated interaction preferences | 20 |
| knowledge_update | Surfacing the latest version of an updated fact | 20 |
| multi_session_reasoning | Synthesizing evidence from multiple sessions | 20 |
| preference_following | Tracking evolving user preferences | 20 |
| summarization | Producing broad summaries across many sessions | 16 |
| temporal_reasoning | Answering "when" questions about events | 20 |
BEAM is notable for including three abilities that no prior benchmark tests: contradiction resolution, event ordering, and instruction following. These abilities specifically stress multi-session memory management -- the regime where our architecture is designed to help.
WRRF baseline. Cortex's production pipeline without the assembler: 5-signal server-side fusion + FlashRank client-side reranking. This is a strong baseline: 97.8% R@10 on LongMemEval, 92.6% R@10 on LoCoMo, and 0.591 MRR on BEAM-100K. It represents the state of the art for multi-signal hybrid retrieval without structural organization.
LIGHT (Tavakoli et al., ICLR 2026). The strongest published system on BEAM, achieving 0.266 overall on BEAM-10M with Llama-4-Maverick (170B parameters). LIGHT scores are end-to-end QA (LLM-as-judge nugget scoring), not retrieval MRR. We include LIGHT scores in our results tables for directional reference but emphasize that direct comparison is not valid: LIGHT's score reflects both retrieval quality and reader quality, while our MRR is retrieval-only.
| Component | Specification |
|---|---|
| Embedding model | sentence-transformers all-MiniLM-L6-v2 (384D, 256 max tokens) |
| Database | PostgreSQL 17.4 + pgvector 0.8.0 (HNSW: m=16, ef_construction=200, ef_search=200) |
| Index | pg_trgm for trigram similarity |
| Reranker | FlashRank v0.4.3 (ONNX cross-encoder, ms-marco-MiniLM-L-12-v2) |
| Entity extraction | Rule-based NER (capitalized multi-word phrases, quoted strings, technical terms) |
| PPR parameters | alpha=0.15, max_iters=30, tolerance=1e-4 |
| Submodular selection | lambda=0.5, max_chunks=5, oversample=3x |
| Stage detection | ExplicitStageDetector (field="plan_id") for BEAM |
| Token estimator | Conservative chars/3 heuristic |
| Hardware | Apple M1 Max, 64GB RAM, PostgreSQL on localhost |
Both WRRF and assembler conditions use the identical embedding model, database, reranker, and entity extraction. The only difference is whether the assembled results are stage-scoped and submodular-selected (assembler) or flat and top-k-selected (WRRF).
Metric: retrieval-proxy MRR. We measure the Mean Reciprocal Rank of the first retrieved memory whose content substring-matches the gold source turn or answer text. Formally, for a question q with gold source text g, and a ranked list of retrieved memories [m₁, m₂, ..., mₖ]:
RR(q) = 1/rank(m*) if exists m* : g ⊆ m*.content
0 otherwise
MRR = (1/|Q|) · Σ(q in Q) RR(q)
This is not the BEAM paper's metric. The paper uses LLM-as-judge nugget scoring for end-to-end QA quality, which evaluates whether the reader model's answer contains the key information nuggets. Our retrieval-proxy MRR evaluates whether the retrieved context contains the gold source -- a necessary but not sufficient condition for good answers. Both WRRF and assembler are measured with the identical harness, so relative comparisons are valid.
Database isolation. Fresh cortex_bench database per benchmark
run:
DROP DATABASE IF EXISTS cortex_bench;
CREATE DATABASE cortex_bench;
-- Extensions and schema created by the benchmark harnessBetween conversations within a run, all data tables are truncated:
TRUNCATE memories, entities, relationships, entity_memory CASCADE;This ensures zero cross-conversation contamination. Each conversation starts with an empty database, ingests its turns, then evaluates its questions.
BEAM-10M turn ID fix. Turn IDs in the raw BEAM-10M dataset are plan-relative: each plan restarts turn numbering from 0. Gold annotations reference turns by their global index across all plans. We apply cumulative plan offsets to produce globally unique IDs:
offset = 0
for plan_idx, plan in enumerate(conversation.plans):
for turn in plan.turns:
turn.global_id = offset + turn.local_id
offset += len(plan.turns)Without this fix, gold memory matching fails entirely (0% hit rate
across all questions). This data issue is not documented in the
BEAM paper. See commit 5348f74.
FlashRank preflight verification. Before each benchmark run, the harness verifies FlashRank by reranking a synthetic pair:
result = flashrank.rerank([
{"content": "The cat sat on the mat", "score": 0.0},
{"content": "Dogs are loyal pets", "score": 0.0},
], query="What did the cat do?")
assert result[0]["score"] > 0, "FlashRank model loading failed"This catches a failure mode discovered during development: FlashRank occasionally fails to load its ONNX model and returns 0.0 scores for all candidates without raising an exception, producing random rankings for both conditions.
Variance characterization. Results are deterministic within ± 0.01 MRR across runs with PostgreSQL restart between runs. Sources of variance:
- Embedding model: deterministic on CPU (no dropout, no stochastic operations).
- HNSW search: approximate, bounded by
ef_search=200. For our corpus sizes, ef_search=200 produces exact results in practice (verified by comparing against exact brute-force search on a subset). - PPR convergence: deterministic (power iteration with fixed seed).
- Submodular selection: deterministic (greedy, no randomization).
| Ability | WRRF | Assembler | Δ |
|---|---|---|---|
| temporal_reasoning | 0.950 | 0.920 | -0.030 |
| knowledge_update | 0.900 | 0.900 | 0.000 |
| contradiction_resolution | 0.817 | 0.783 | -0.034 |
| multi_session_reasoning | 0.812 | 0.500 | -0.312 |
| event_ordering | 0.380 | 0.525 | +0.145 |
| info_extraction | -- | 0.700 | +0.157 |
| instruction_following | 0.448 | 0.478 | +0.030 |
| preference_following | 0.442 | 0.442 | 0.000 |
| abstention | 0.400 | 0.400 | 0.000 |
| summarization | 0.271 | 0.370 | +0.099 |
| Overall MRR | 0.591 | 0.602 | +0.011 (+1.9%) |
At 100K scale (94 memories per conversation, 5 plans per conversation), the assembler is net-flat. The overall improvement of +0.011 MRR (+1.9%) is within noise. Individual ability-level patterns are worth examining:
Multi-session reasoning: -0.312. The sharpest regression and the most informative data point. At 94 memories, flat WRRF over the entire corpus reaches cross-session evidence easily -- the embedding space is sparse enough for nearest-neighbor search to discriminate. Stage-scoping restricts Phase 1 to the current stage's ~19 memories, missing evidence from other stages. Phase 2 PPR can partially recover this, but the overhead of the two-phase process loses more than it gains at this scale.
Event ordering: +0.145. Submodular selection helps here by avoiding the redundancy problem: chronological ordering questions need memories from different time points, and submodular diversity naturally selects temporally diverse memories over topically similar ones.
Info extraction: +0.157. Specific-fact questions benefit from stage-scoped retrieval: the relevant fact is usually within the current stage, and submodular selection avoids drowning it in paraphrases.
Summarization: +0.099. A surprising positive at 100K. We attribute this to submodular diversity: summarization questions benefit from broad coverage, and submodular selection covers more aspects of the stage than naive top-k.
| Ability | WRRF | Assembler | Δ | R@5 | R@10 | LIGHT* |
|---|---|---|---|---|---|---|
| knowledge_update | 0.835 | 0.892 | +0.057 | 100.0% | 100.0% | 0.375 |
| contradiction_resolution | 0.633 | 0.725 | +0.092 | 90.0% | 90.0% | 0.050 |
| multi_session_reasoning | 0.415 | 0.543 | +0.128 | 80.0% | 80.0% | 0.000 |
| info_extraction | 0.448 | 0.487 | +0.039 | 70.0% | 70.0% | -- |
| preference_following | 0.412 | 0.481 | +0.069 | 65.0% | 65.0% | 0.483 |
| temporal_reasoning | 0.370 | 0.467 | +0.097 | 50.0% | 50.0% | 0.075 |
| abstention | 0.100 | 0.350 | +0.250 | 35.0% | 35.0% | 0.750 |
| instruction_following | 0.068 | 0.125 | +0.057 | 15.0% | 15.0% | 0.500 |
| event_ordering | 0.067 | 0.067 | 0.000 | 10.0% | 10.0% | 0.266 |
| summarization | 0.186 | 0.150 | -0.036 | 22.2% | 22.2% | 0.277 |
| Overall MRR | 0.353 | 0.429 | +0.076 (+33.4%) | 0.266* |
*LIGHT scores are end-to-end QA (LLM-as-judge, Llama-4-Maverick). Not comparable to retrieval MRR. Shown for directional reference.
At 10M scale (7,500 memories per conversation, 50 plans per conversation), 8 of 10 categories improve, 1 is unchanged, and 1 regresses. We analyze each category:
Abstention: +0.250 (the largest gain). When the assembler's stage-scoped retrieval returns no relevant memories within the current stage, the absence itself is informative. Flat WRRF always returns something from 7,500 candidates -- typically hub memories that are generically similar to everything. These false positives make abstention decisions impossible. The assembler's stage-scoping effectively creates a "null result" signal that WRRF cannot produce.
Multi-session reasoning: +0.128 (the critical validation). This category regressed at 100K (-0.312) but improves at 10M (+0.128). The sign flip is the strongest evidence for the scale-dependent thesis. At 7,500 memories, flat WRRF cannot reach cross-session evidence through embedding similarity alone (the geometric ceiling of Section 1.1). Phase 2's PPR traversal follows entity connections to memories in other stages that share specific entities -- project names, API endpoints, error codes, people's names -- with the current stage's content. This structural bridge reaches evidence that the embedding space cannot distinguish.
Temporal reasoning: +0.097. Stage boundaries provide implicit temporal structure. A memory's stage ID encodes "when" information (which plan/session it belongs to), supplementing the explicit timestamp that WRRF already uses.
Contradiction resolution: +0.092. Stage-scoped retrieval naturally isolates contradicting statements that occur in different stages. Flat WRRF tends to surface the more recent (and more frequently accessed) version of a contradicted fact, because thermodynamic heat biases toward recently accessed memories. The assembler's Phase 2 can surface both the original statement (in a distant stage) and the contradiction (in a recent stage) through entity connections.
Knowledge update: +0.057. WRRF is already strong here (0.835) because Cortex's thermodynamic heat naturally surfaces the newest version of a fact. The assembler's stage-scoping concentrates retrieval on the stage where the update occurred, further improving precision.
Preference following: +0.069. User preferences (e.g., "I prefer tabs over spaces") are often stated once and referenced implicitly thereafter. Entity graph connections between the preference statement and subsequent conversations that reference the same entities help surface the original preference.
Instruction following: +0.057. A modest improvement on the hardest category (0.068 baseline). Instructions are syntactically similar to normal conversation -- "Please always include a brief summary at the end of your responses" looks like a request, not an instruction. Neither WRRF nor the assembler has a mechanism to distinguish instructions from requests.
Event ordering: 0.000. No change. Chronological sequencing requires explicit temporal reasoning (ordering by timestamp), which neither retrieval architecture provides. Both systems retrieve relevant memories but in relevance order, not chronological order.
Summarization: -0.036. The only regression. Summarization questions require broad coverage across many stages. Stage-scoped retrieval focuses depth within a stage at the cost of breadth across stages. Phase 3 is designed to address this via stage summaries, but it is currently a stub.
The central finding is that structured assembly is net-flat at small scale and dominates at large scale. We can characterize this more precisely:
At 100K tokens, the embedding space density is:
ρ_100K = n/d = 94/384 ≈ 0.24
At 10M tokens:
ρ_10M = n/d = 7500/384 ≈ 19.5
The hubness phenomenon (Radovanovic et al., 2010) becomes significant when ρ >> 1. At ρ = 0.24, the embedding space is underpopulated; nearest-neighbor search is reliable. At ρ = 19.5, the space is overcrowded; hubness, distance concentration, and the JL lower bound all contribute to retrieval degradation.
The crossover point -- where structured assembly's benefit exceeds its overhead -- lies somewhere in the range 1 < ρ < 19.5, corresponding to roughly 400-7,500 memories in 384 dimensions, or approximately 400K to 10M tokens of conversation. We did not evaluate at intermediate scales, but we conjecture that the benefit increases monotonically with ρ.
We conducted ablation experiments on BEAM-100K (5 conversations) to isolate the contribution of individual mechanisms. Each ablation removes one mechanism while holding all others constant.
| Configuration | Overall MRR | Δ vs WRRF |
|---|---|---|
| WRRF baseline (no assembler) | 0.591 | -- |
| Assembler, no submodular (naive top-k) | 0.512 | -0.079 |
| Assembler, Phase 1 only (submodular, no PPR) | 0.513 | -0.078 |
| Assembler, Phase 1 + Phase 2 (sub + PPR) | 0.602 | +0.011 |
| Full assembler (Phase 1 + 2, Phase 3 stub) | 0.602 | +0.011 |
Detailed per-ability ablation (BEAM-100K):
| Ability | WRRF | No Submod | Ph1 Only | Ph1+2 | Full |
|---|---|---|---|---|---|
| knowledge_update | 0.850 | 0.850 | 0.850 | 0.900 | 0.900 |
| temporal_reasoning | 0.950 | 0.900 | 0.900 | 0.920 | 0.920 |
| contradiction_res. | 0.817 | 0.600 | 0.650 | 0.783 | 0.783 |
| multi_session_reas. | 0.812 | 0.500 | 0.500 | 0.500 | 0.500 |
| event_ordering | 0.380 | 0.333 | 0.400 | 0.525 | 0.525 |
| instruction_follow. | 0.448 | 0.283 | 0.233 | 0.478 | 0.478 |
| preference_follow. | 0.442 | 0.300 | 0.250 | 0.442 | 0.442 |
| abstention | 0.400 | 0.400 | 0.400 | 0.400 | 0.400 |
| summarization | 0.271 | 0.250 | 0.250 | 0.370 | 0.370 |
| Overall | 0.591 | 0.512 | 0.513 | 0.602 | 0.602 |
Key findings:
Submodular selection is essential. Without it, the assembler hurts performance (-0.079 vs. WRRF). This is because stage-scoping reduces the candidate pool from 94 to ~19 memories (one stage), and naive top-k within that smaller pool selects redundant memories. The stage-scoping restriction is only beneficial when paired with diversity-aware selection. This is a critical implementation detail: stage-scoped retrieval without submodular selection is worse than no assembler at all.
Phase 2 PPR has marginal effect at 100K. The "Phase 1 only" (0.513) and "Phase 1+2" (0.602) configurations differ, but the improvement is primarily attributable to the interaction between PPR entity bridging and submodular selection rather than PPR alone.
The value is compositional. No single mechanism accounts for the improvement. The architecture requires all three ingredients: stage-scoping (to create the partition), submodular selection (to exploit the partition), and PPR bridging (to overcome the partition's limitations). Removing any one piece degrades the whole.
BEAM-10M ablation. We did not run the full ablation at 10M due to computational cost (~7 hours per configuration, 4 configurations = 28 hours). Based on the 100K results and the scale-dependent analysis, we expect Phase 2's contribution to be larger at 10M, where cross-stage bridging matters more and flat search fails harder.
The most striking result is that multi-session reasoning flips from -0.312 at 100K to +0.128 at 10M. This warrants detailed examination.
At 100K, the multi-session reasoning failure is instructive. A typical BEAM multi-session question is: "Based on our discussions across multiple plans, what is the user's overall approach to error handling?" At 94 memories spread across 5 plans, flat WRRF over the entire corpus can easily surface memories from 3-4 different plans in the top-5. The embedding space is sparse enough that memories from different plans about "error handling" are distinctly separated.
Stage-scoping restricts Phase 1 to the current plan (~19 memories), which by definition misses evidence from other plans. Phase 2 PPR can bridge to other plans via entity connections, but this requires (a) that the relevant entities appear in both the current plan's memories and other plans' memories, and (b) that the entity graph has sufficient connectivity. At 100K with a small entity graph, these conditions are often not met.
At 10M, the situation inverts. The same "error handling" query now competes with 7,500 memories, many of which discuss error handling tangentially. Flat WRRF's top-5 is dominated by hub memories -- the most generic error-handling discussions that have high average similarity to everything. The specific multi-plan evidence is buried below rank 20.
Phase 1 retrieves from the current plan's ~150 memories, and submodular selection ensures diverse coverage of error-handling subtopics within that plan. Phase 2 seeds PPR on the entities from Phase 1 results -- specific error codes, API names, library names -- and follows connections to memories in other plans that mention the same entities. These entity connections are structural (co-occurrence in the knowledge graph) rather than geometric (embedding similarity), so they are immune to the hubness and concentration problems that defeat flat retrieval.
The implication is clear: stage-aware retrieval should be gated by corpus density. At ρ < 1, use flat retrieval. At ρ > 1, enable the assembler. The optimal threshold likely depends on the embedding model and corpus characteristics.
Summarization is the only category that regresses at 10M (-0.036). This is a genuine trade-off inherent in stage-scoped retrieval, not a bug.
Summarization questions require broad coverage: "Summarize the user's main interests across all conversations." The gold answer draws from memories across many plans. Stage-scoped retrieval concentrates resources on one stage (Phase 1) and its immediate entity neighbors (Phase 2), systematically missing distant stages that contain relevant summarization evidence.
Phase 3 is designed to address this. Schema-structured stage summaries would provide exactly the breadth that summarization questions need: one compressed summary per stage, covering the entire conversation at the cost of detail. With Phase 3 wired, the summarization budget (10% of context) would contain summaries from 5-10 uncovered stages, potentially surfacing gold evidence that Phase 1 and Phase 2 miss.
We predict that wiring Phase 3 will eliminate the summarization regression. The CLS consolidation engine already generates summaries; only the retrieval callback needs plumbing. We leave this to the next experimental cycle.
LIGHT achieves 0.266 overall on BEAM-10M. We achieve 0.429 on retrieval-proxy MRR. These numbers are not directly comparable.
LIGHT's 0.266 is an end-to-end score: it reflects the reader model's ability to produce a correct answer given the retrieved context. Reader errors (hallucination, refusal, wrong extraction) lower the score even when retrieval is correct. Our 0.429 is retrieval-only: it reflects whether the gold memory appears in the retrieved set, regardless of what a reader would do with it.
A rough decomposition: if LIGHT's retrieval were perfect (MRR = 1.0), its end-to-end score would be bounded by the reader's accuracy on perfectly retrieved context. LIGHT uses Llama-4-Maverick (170B parameters), which is strong but not perfect on 10-ability memory questions. If we estimate the reader's accuracy at ~50% (reasonable for the harder abilities like event ordering and instruction following), then LIGHT's retrieval MRR might be around 0.53 -- in the same ballpark as our 0.429. This is speculative arithmetic, not a rigorous comparison.
Per-ability directional comparisons are more informative:
| Ability | Our Assembler | LIGHT | Who benefits |
|---|---|---|---|
| abstention | 0.350 | 0.750 | LIGHT (has LLM to decide abstention) |
| contradiction_res. | 0.725 | 0.050 | Assembler |
| event_ordering | 0.067 | 0.266 | LIGHT (scratchpad helps ordering) |
| instruction_follow. | 0.125 | 0.500 | LIGHT (LLM understands instructions) |
| knowledge_update | 0.892 | 0.375 | Assembler |
| multi_session_reas. | 0.543 | 0.000 | Assembler |
| preference_follow. | 0.481 | 0.483 | Tie |
| temporal_reas. | 0.467 | 0.075 | Assembler |
LIGHT dominates on abilities that benefit from LLM reasoning at answer time (abstention, instruction following, event ordering). Our assembler dominates on abilities that require better retrieval (contradiction resolution, knowledge update, multi-session reasoning, temporal reasoning). This suggests the two approaches are complementary: LIGHT's reader-side scratchpad + our retriever-side assembly could potentially outperform either alone.
We report three bugs discovered during implementation and benchmarking. We include these for transparency and because they represent failure modes that other implementors may encounter.
Bug 1: entity_ids field not populated in retrieval results.
The initial Phase 2 implementation returned zero cross-stage results.
Root cause: the memory dicts returned by the PostgreSQL retrieval
query did not include the entity_ids field (it was not in the
SELECT clause). The entity_ids data existed in the entity_memory
junction table but was not joined. Diagnosis: printing the seed
entities dict at the PPR input (always empty) and tracing backwards
to the retrieval query. Fix: adding a LEFT JOIN to the entity_memory
table and an ARRAY_AGG of entity IDs to the SELECT clause.
Bug 2: token budget constraining selection count. The initial submodular selection enforced the token budget as a hard constraint: if a candidate's token count would exceed the remaining budget, it was skipped. This caused the assembler to select 2-3 items instead of 5 when individual memories were long (~500 tokens each). Fewer items = fewer chances for retrieval hits = worse MRR. Fix: decouple selection (by count) from budget enforcement (at assembly time). The ContextDecomposer handles budget enforcement; the selector should maximize retrieval recall.
Bug 3: FlashRank silent failure. FlashRank's ONNX model occasionally fails to load (corrupted download, version mismatch) and returns 0.0 scores for all candidates without raising an exception. Both WRRF and assembler conditions were affected, producing random rankings. The bug was subtle because overall MRR appeared "reasonable" (random at ~0.05-0.10 is only slightly below some legitimate scores on hard abilities). Fix: a preflight check before each benchmark run that reranks a synthetic pair and asserts non-zero scores.
The structured context assembly architecture was designed before the BEAM benchmark existed. Because we report results on BEAM that exceed previously published numbers (with the caveat that our metric differs), it is important to document the chronological relationship between the architecture's development and the benchmark's publication. We do so with verifiable commit SHAs from public repositories.
| Date | Event | Repository | SHA |
|---|---|---|---|
| 2025-09-14 | Context-aware PRD generation (first concept) | ai-prd-builder (public) | 3ef6c3f |
| 2025-09-25 | Modular architecture with cognitive modes | ai-prd-builder (public) | 4f90564 |
| 2025-09-30 | ContextManager.swift -- per-section budget-split assembly | ai-prd-builder (public) | 462de01 |
| 2025-10-04 | Context-aware codebase interceptor | ai-prd-builder (public) | 0743b0e |
| 2025-10-31 | BEAM paper published (Tavakoli et al., arXiv v1) | arXiv | -- |
| 2025-07-10 | MIRIX published (Wang & Chen) | arXiv | 2507.07957 |
| 2026-02-10 | Hierarchical chunking + late chunking + compression | ai-prd-generator (public) | 8bc58f1 |
| 2026-02-27 | ContextDecomposer (priority ordering + truncation warnings) | ai-architect-prd-builder (private) | ba996810e3d7 |
| 2026-03-03 | StageAwareContextAssembler (3-phase 60/30/10) | ai-architect-prd-builder (private) | d4e2eb2540494 |
| 2026-03-16 | Object-centric context decomposition (typed slots) | ai-architect-prd-builder (private) | dc3c71d05d10 |
| 2026-04-07-09 | Python port to Cortex + BEAM benchmark integration | Cortex (public) | 5348f74 |
The ContextManager.swift module, committed on September 30, 2025
in the public ai-prd-builder repository, implements the following
features that directly ancestor the ContextDecomposer:
- Per-section token-budgeted context assembly with provider-specific limits (Apple Intelligence: 3,500 tokens, OpenAI: model-dependent)
- Slot-based budget splitting: 1/3 core request + 1/4 clarifications + 1/6 tech stack, with the remaining budget distributed to lower-priority sections
- Section-keyword relevance filtering: only sections whose keywords match the current task are included
- Truncation awareness: the model is told which sections were reduced and by how much, enabling it to request more information or flag uncertainty
The key design principle -- "don't try to fit everything; structure what goes in, prioritize it, and tell the model what was cut" -- was formulated for generating coherent 9-page PRDs on Apple Intelligence's 4,096-token context window. The PRD generator must produce documents with:
- Impact analysis (business value, risk assessment)
- Technical requirements (APIs, data models, security)
- Implementation plan (milestones, dependencies)
- Code examples (working Swift/Python/TypeScript)
- Verification reports (test plans, acceptance criteria)
- Jira ticket specifications
None of these sections fit in 4,096 tokens simultaneously, and the model must produce coherent cross-references between sections it cannot see at the same time. The ContextManager solves this by generating one section at a time, with priority-budgeted context from previously generated sections.
This same principle applies unchanged to 10M-token memory retrieval: the context window (whether 4,096 or 200,000 tokens) is always smaller than the relevant information, and structured assembly outperforms flat ranking.
The architecture evolved through three codebases:
-
ai-prd-builder (September 2025): ContextManager with fixed-section budget splitting. Public repository.
-
ai-architect-prd-builder (February-March 2026): ContextDecomposer with priority-driven progressive condensation, domain-aware condensers, and truncation warning injection. StageAwareContextAssembler with two-phase 60/30 (with planned 10% summary tier) retrieval. Private repository.
-
Cortex (April 2026): Python port adapted for Cortex's PostgreSQL-backed memory system, complemented with paper-backed mechanisms (HippoRAG PPR formulation, Krause & Guestrin submodular guarantee). Public repository.
The full combination has no direct precedent we are aware of in the 2024-2026 literature we surveyed across six research areas: computational neuroscience, mathematics (submodularity, dimensionality reduction), AI systems (RAG, long-context, agent memory), PhD theses on computational memory models, vector database engineering, and information theory.
The individual building blocks each have strong paper backing:
| Component | Paper | Year |
|---|---|---|
| Submodular coverage | Krause & Guestrin, JMLR | 2008 |
| MMR diversity | Carbonell & Goldstein, SIGIR | 1998 |
| Personalized PageRank | Brin & Page / Gutierrez et al. | 1998/2024 |
| Schema-structured summaries | Tse et al., Science | 2007 |
| Active retrieval | Wang & Chen (MIRIX) | 2025 |
| RRF fusion | Cormack et al. | 2009 |
The contribution is the composition: the specific way these pieces are combined into a two-primitive architecture that manages both retrieval composition (StageAwareContextAssembler) and prompt-level budgeting (ContextDecomposer).
Summarization regresses at 10M scale (-0.036). Stage-scoped retrieval focuses depth at the cost of breadth. Phase 3 (summary fallback) is designed to address this but is currently a stub. Wiring it to Cortex's CLS consolidation engine is the immediate next step. We predict elimination of the regression but have not verified experimentally.
Event ordering scores 0.067 for both conditions -- effectively random. This ability requires precise chronological sequencing across sessions, which is a temporal reasoning problem rather than a retrieval problem. ChronoRAG (Chen et al., 2025), which models temporal ordering explicitly in the retrieval pipeline, is implemented in Cortex but not wired through the assembler.
Entity extraction is rule-based: capitalized multi-word phrases, quoted strings, technical terms with camelCase or snake_case patterns. On BEAM's synthetic conversations (clear entity mentions, consistent naming), this works adequately. On real-world conversational data with informal phrasing ("that Redis thing from last week"), abbreviations ("the auth bug"), and implicit references ("the same approach"), extraction quality would degrade significantly. Upgrading to a lightweight NER model (SpaCy, fine-tuned token classifier) or LLM-based extraction is a natural improvement.
| Operation | WRRF | Assembler | Overhead |
|---|---|---|---|
| Per-conversation (10M) | 307s | 733s | 2.4x |
| Per-query (10M) | ~15s | ~37s | 2.4x |
The overhead comes from:
- Entity extraction at ingest time (already amortized; not in the query-time path)
- Per-query entity extraction from Phase 1 results
- PPR graph construction and iteration
- Submodular selection's O(n*k) greedy loop
Optimizations available but not implemented:
- PPR precomputation: compute once during consolidation, cache per entity pair
- Batched graph queries: single SQL query for all entity lookups instead of one per entity
- Lazy greedy acceleration (Minoux, 1978): skip candidates whose upper-bound marginal gain is below the current best, reducing submodular selection from O(n*k) to nearly O(n log k) in practice
All results are retrieval-proxy MRR. We have not evaluated whether improved retrieval translates to improved answer quality when a reader model consumes the assembled context. The ContextDecomposer's truncation warning, domain-aware condensers, and structured section labels are designed to help the reader, but this hypothesis is untested.
An LLM-as-judge evaluation using BEAM's nugget scoring protocol would enable direct comparison with LIGHT and other published systems. This requires an LLM inference budget (~200-500 USD for 196 questions x 3 nuggets x GPT-4-class judge) not available for this work.
The 60/30/10 split is a design heuristic inherited from the Swift PRD generator, not an optimized parameter. Systematic optimization (e.g., Bayesian optimization over (β₁, β₂, β₃) with BEAM-10M MRR as the objective) could improve results. The optimal split may also vary by query type: temporal queries may benefit from more Phase 2 allocation (temporal entities bridge stages); summarization queries from more Phase 3 allocation (summaries provide breadth).
Our experiments use ground-truth plan IDs from BEAM. Production deployment requires robust stage detection without explicit labels. The TemporalStageDetector (4-hour gap threshold) is reasonable but untested on BEAM. More sophisticated approaches -- semantic clustering (embedding similarity between consecutive messages), LLM-based topic-shift detection, or hybrid methods -- are defined as interfaces but lack empirical validation.
Our experiments use all-MiniLM-L6-v2 (384D). The JL lower bound analysis suggests that 768D or 1024D embeddings could push the geometric ceiling higher, reducing the need for structured assembly. However, larger models have higher latency and storage costs. An important open question: does the assembly architecture remain beneficial with state-of-the-art 1024D models (e.g., E5-Mistral, NV-Embed-2), or does the improved embedding space make flat retrieval sufficient even at 10M tokens?
The architecture was designed for and tested on conversational memory. Whether it generalizes to other long-context retrieval problems -- legal document search, medical record retrieval, codebase navigation -- is unknown. The stage-detection mechanism is generic (any partitioning of the corpus), but the specific condensers and entity extraction are tuned for conversation.
We presented a structured context assembly architecture for long-term memory retrieval that addresses the fundamental scaling limitation of dense vector retrieval: at sufficient corpus density, embedding-based nearest-neighbor search cannot discriminate relevant from irrelevant memories because the geometric structure of the embedding space degenerates.
The architecture introduces two composable primitives. The ContextDecomposer manages prompt-level budgeting with typed priority slots, domain-aware condensers, and an explicit truncation warning mechanism that communicates assembly decisions to the reader model -- a mechanism with no direct precedent we are aware of in the retrieval or prompt engineering literature. The StageAwareContextAssembler manages retrieval-level composition through three phases: own-stage retrieval with submodular diversity (Krause & Guestrin, 2008), cross-stage bridging via Personalized PageRank over the entity graph (Gutierrez et al., NeurIPS 2024), and summary fallback for uncovered stages (Tse et al., 2007).
The empirical results on BEAM validate the scale-dependent thesis: the architecture is net-flat at 100K tokens (+1.9%, within noise) and provides a +33.4% improvement at 10M tokens, with 8 of 10 memory abilities improving. The multi-session reasoning sign flip (from -0.312 at 100K to +0.128 at 10M) is the strongest evidence that the benefit is genuinely scale-dependent rather than incidental.
The architecture does not introduce novel retrieval primitives. Every building block has strong paper backing. The contribution is the composition: stage-aware two-phase assembly, domain-aware condensation, and model-facing truncation awareness, combined into a system that manages context at a level of abstraction above individual retrieval signals.
The insight that motivated this work -- that at scale, what enters context matters more than how well individual items are retrieved -- emerged from a practical constraint: generating coherent documents on a 4,096-token context window. That the same architectural principle applies to 10-million-token memory retrieval suggests it may be general: whenever the available context is smaller than the relevant information, structured assembly outperforms flat ranking.
The crossover point between flat retrieval and structured assembly is empirically located between 100K and 10M tokens for 384D embeddings. As embedding models improve and context windows grow, this crossover point will shift -- but the fundamental principle will remain: there will always be a scale at which the relevant information exceeds the available context, and at that scale, structured assembly will dominate.
Abraham, W. C. & Bear, M. F. (1996). Metaplasticity: the plasticity of synaptic plasticity. Trends in Neurosciences, 19(4), 126-130.
Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is "nearest neighbor" meaningful? In Proceedings of the 7th International Conference on Database Theory (ICDT), 217-235.
Brin, S. & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1-7), 107-117.
Carbonell, J. & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference, 335-336.
Chheda, D. et al. (2024). mem0: The memory layer for AI agents. arXiv preprint.
Collins, A. M. & Loftus, E. F. (1975). A spreading-activation theory of semantic processing. Psychological Review, 82(6), 407-428.
Cormack, G. V., Clarke, C. L. A., & Buettcher, S. (2009). Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. In Proceedings of the 32nd International ACM SIGIR Conference, 758-759.
Gutierrez, B. J., Shu, Y., Gu, Y., Yasunaga, M., & Su, Y. (2024). HippoRAG: Neurobiologically inspired long-term memory for large language models. In Advances in Neural Information Processing Systems (NeurIPS).
Jina AI (2025). Late chunking: Contextual chunk embeddings using long-context embedding models. Technical Report.
Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 6769-6781.
Khattab, O. & Zaharia, M. (2020). ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference, 39-48.
Krause, A. & Guestrin, C. (2008). Near-optimal sensor placements in Gaussian processes. Journal of Machine Learning Research, 9, 235-284.
Langville, A. N. & Meyer, C. D. (2005). A survey of eigenvector methods for web information retrieval. SIAM Review, 47(1), 135-161.
Larsen, K. G. & Nelson, J. (2017). Optimality of the Johnson-Lindenstrauss lemma. In Proceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 633-638.
Lin, H. & Bilmes, J. (2011). A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL), 510-520.
Maharana, A. et al. (2024). LoCoMo: Long-context conversation memory benchmark. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL).
McClelland, J. L., McNaughton, B. L., & O'Reilly, R. C. (1995). Why there are complementary learning systems in the hippocampus and neocortex. Psychological Review, 102(3), 419-457.
Minoux, M. (1978). Accelerated greedy algorithms for maximizing submodular set functions. In Optimization Techniques, Lecture Notes in Control and Information Sciences, vol. 7, 234-243.
Muennighoff, N. et al. (2023). MTEB: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2014-2037.
Packer, C. et al. (2024). Letta: An operating system for AI agents with long-term memory. arXiv preprint.
Radovanovic, M., Nanopoulos, A., & Ivanovic, M. (2010). Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 11, 2487-2531.
Reimers, N. & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 3982-3992.
Sarthi, P. et al. (2024). RAPTOR: Recursive abstractive processing for tree-organized retrieval. In Proceedings of the International Conference on Learning Representations (ICLR).
Stensola, H., Stensola, T., Solstad, T., Froland, K., Moser, M.-B., & Moser, E. I. (2012). The entorhinal grid map is discretized. Nature, 492, 72-78.
Tavakoli, A. et al. (2026). BEAM: Beyond a million tokens -- benchmarking long-term memory in AI agents. In Proceedings of the International Conference on Learning Representations (ICLR).
Teyler, T. J. & Rudy, J. W. (2007). The hippocampal indexing theory and episodic memory: Updating the index. Hippocampus, 17(12), 1158-1169.
Thakur, N., Reimers, N., Ruckle, A., Srivastava, A., & Gurevych, I. (2021). BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Advances in Neural Information Processing Systems (NeurIPS).
Tse, D., Langston, R. F., Kakeyama, M., Bethus, I., Spooner, P. A., Wood, E. R., Witter, M. P., & Morris, R. G. M. (2007). Schemas and memory consolidation. Science, 316(5821), 76-82.
Wang, C. & Chen, J. (2025). MIRIX: Multi-agent memory system for LLM-based agents. arXiv preprint 2507.07957.
Wu, Y. et al. (2025). LongMemEval: Benchmarking long-term memory in AI assistants. In Proceedings of the International Conference on Learning Representations (ICLR).
Xu, B. et al. (2025). A-MEM: Agentic memory for LLM agents. In Advances in Neural Information Processing Systems (NeurIPS).
Zhu, D. et al. (2024). LongEmbed: Extending embedding models for long context retrieval. arXiv preprint.
- PostgreSQL 17+ with pgvector and pg_trgm extensions
- Python 3.10+
- Cortex installed from source
git clone https://github.com/cdeust/Cortex.git && cd Cortex
pip install -e ".[postgresql,benchmarks]"# BEAM-100K baseline (WRRF only)
DATABASE_URL="postgresql://localhost:5432/cortex_bench" \
python benchmarks/beam/run_benchmark.py --split 100K
# BEAM-10M baseline (WRRF only)
DATABASE_URL="postgresql://localhost:5432/cortex_bench" \
python benchmarks/beam/run_benchmark.py --split 10M
# BEAM-10M with structured context assembly
CORTEX_USE_ASSEMBLER=1 \
DATABASE_URL="postgresql://localhost:5432/cortex_bench" \
python benchmarks/beam/run_benchmark.py --split 10M
# BEAM-100K with structured context assembly
CORTEX_USE_ASSEMBLER=1 \
DATABASE_URL="postgresql://localhost:5432/cortex_bench" \
python benchmarks/beam/run_benchmark.py --split 100K| Configuration | Conversations | Time per Conv | Total |
|---|---|---|---|
| BEAM-100K WRRF | 5 | ~32s | ~2.5 min |
| BEAM-100K Assembler | 5 | ~293s | ~24 min |
| BEAM-10M WRRF | 10 | ~307s | ~51 min |
| BEAM-10M Assembler | 10 | ~733s | ~122 min |
The benchmark harness creates a fresh cortex_bench database per
run. Between conversations within a run, all data tables are
truncated:
TRUNCATE memories, entities, relationships, entity_memory CASCADE;This ensures zero cross-conversation contamination. Each conversation starts with an empty database, ingests its turns, then evaluates its questions. The TRUNCATE approach (vs. DROP/CREATE) preserves the schema, extensions, and stored procedures, reducing per-conversation setup time.
The full variance analysis is run via:
bash benchmarks/beam/variance/bench_variance.shThis script runs each configuration 3 times, with PostgreSQL restart between runs, and reports mean and standard deviation of MRR. Observed variance: ± 0.01 MRR across runs.
| Ability | WRRF | No Submod | Ph1 Only | Ph1+Ph2 | Full |
|---|---|---|---|---|---|
| knowledge_update | 0.850 | 0.850 | 0.850 | 0.900 | 0.900 |
| temporal_reasoning | 0.950 | 0.900 | 0.900 | 0.920 | 0.920 |
| contradiction_res. | 0.817 | 0.600 | 0.650 | 0.783 | 0.783 |
| multi_session_reas. | 0.812 | 0.500 | 0.500 | 0.500 | 0.500 |
| event_ordering | 0.380 | 0.333 | 0.400 | 0.525 | 0.525 |
| instruction_follow. | 0.448 | 0.283 | 0.233 | 0.478 | 0.478 |
| preference_follow. | 0.442 | 0.300 | 0.250 | 0.442 | 0.442 |
| abstention | 0.400 | 0.400 | 0.400 | 0.400 | 0.400 |
| summarization | 0.271 | 0.250 | 0.250 | 0.370 | 0.370 |
| Overall | 0.591 | 0.512 | 0.513 | 0.602 | 0.602 |
-
WRRF: flat retrieval over the entire corpus, no assembler. Standard top-k selection after WRRF fusion + FlashRank reranking.
-
No Submod: assembler enabled with stage-scoping, but Phase 1 uses naive top-k selection instead of submodular coverage. This isolates the effect of stage-scoping without diversity.
-
Ph1 Only: assembler with submodular coverage selection in Phase 1, but Phase 2 (PPR traversal) disabled. Phase 1 results only.
-
Ph1+Ph2: assembler with submodular coverage in Phase 1 and PPR traversal in Phase 2. Phase 3 is a stub.
-
Full: same as Ph1+Ph2 in the current implementation (Phase 3 not yet wired to real summaries).
-
Stage-scoping without diversity hurts (-0.079 vs. WRRF). The No Submod configuration reduces the candidate pool to one stage's memories and then picks the top-k by score, which selects redundant memories. This is worse than flat top-k over the full corpus.
-
Submodular selection alone (Ph1 Only) is not sufficient (-0.078 vs. WRRF). Without Phase 2 PPR bridging, the assembler is limited to same-stage memories. For multi-session queries, this is actively harmful.
-
The combination of submodular + PPR recovers and slightly exceeds WRRF (+0.011). Neither mechanism alone helps at 100K; together, they produce a marginal improvement.
-
At 10M, we expect both effects to be amplified, based on the scale-dependent analysis in Section 5.3. The 10M ablation is planned but not yet completed due to computational cost.
The context assembly module resides in
mcp_server/core/context_assembly/ within the Cortex codebase:
| Module | Lines | Purpose | Paper Backing |
|---|---|---|---|
decomposer.py |
195 | Priority-budgeted prompt assembly | Novel |
stage_assembler.py |
320 | Three-phase stage-aware assembler | Novel composition |
coverage.py |
128 | Submodular coverage selection (Phase 1) | Krause & Guestrin 2008 |
ppr_traversal.py |
177 | Personalized PageRank (Phase 2) | Gutierrez et al. 2024 |
stage_detector.py |
196 | Pluggable stage detection | Novel interface |
condensers.py |
277 | Domain-aware content condensers | Novel |
budget.py |
123 | Token estimation and allocation | Ported from Swift |
warning.py |
62 | Truncation warning banner | Novel |
active_retrieval.py |
189 | Active query reformulation | Wang & Chen 2025 |
__init__.py |
-- | Public API exports | -- |
Total: ~1,667 lines of Python across 10 modules.
All modules are in the core/ layer (pure business logic, zero I/O).
This is enforced by Cortex's clean architecture:
core/modules import only fromshared/and Python stdlib- External dependencies (PostgreSQL queries, embedding generation, schema engine) are injected as callbacks at construction time
- Handlers (the composition root layer) wire
core/toinfrastructure/via dependency injection
This means the context assembly logic can be tested entirely in memory with mock callbacks, without a running database. The test suite includes 47 tests covering all modules.
The StageAwareContextAssembler receives five callbacks:
class StageAwareContextAssembler:
def __init__(
self,
*,
stage_detector: StageDetector,
retrieve_fn: Callable[[str, str, int], list[dict]],
entity_graph_fn: Callable[[], tuple[list, list]],
memories_by_entity_fn: Callable[[list[str]], list[dict]],
stage_summary_fn: Callable[[str], str],
) -> None: ...retrieve_fn(query, stage_id, max_results): Phase 1 retrieval. Wired to the WRRF pipeline with a stage filter.entity_graph_fn(): returns (entities, relationships) for PPR. Wired to PostgreSQL entity/relationship queries.memories_by_entity_fn(entity_ids): returns memories containing the given entities. Wired to the entity_memory junction table.stage_summary_fn(stage_id): returns a summary for the given stage. Currently a stub returning empty string.
This callback design ensures that the assembler is agnostic to the underlying storage and retrieval infrastructure. The same assembler code works with PostgreSQL (production), SQLite (testing), or in-memory dicts (unit tests).
For completeness, we include the full WRRF results on the BEAM-100K 20-conversation split (all conversations, not just the 5 used for A/B comparison):
| Ability | MRR | R@5 | R@10 | Qs |
|---|---|---|---|---|
| contradiction_resolution | 0.729 | 90.0% | 90.0% | 40 |
| temporal_reasoning | 0.684 | 75.0% | 80.0% | 40 |
| knowledge_update | 0.735 | 87.5% | 90.0% | 40 |
| multi_session_reasoning | 0.596 | 77.5% | 80.0% | 40 |
| event_ordering | 0.311 | 50.0% | 62.5% | 40 |
| summarization | 0.315 | 44.4% | 61.1% | 36 |
| preference_following | 0.309 | 52.5% | 60.0% | 39 |
| instruction_following | 0.219 | 27.5% | 42.5% | 40 |
| abstention | 0.125 | 12.5% | 12.5% | 40 |
| Overall | 0.438 | -- | -- | 395 |
This 20-conversation result (0.438 MRR) is lower than the 5-conversation result (0.591 MRR), reflecting variance across conversations. The 5 conversations selected for A/B comparison happen to be above-average in difficulty. We use the 5-conversation A/B for the assembler comparison because the assembler was only run on those 5 conversations at the 100K scale.
During development, several intermediate assembler configurations were benchmarked on BEAM-10M. These document the debugging process:
| Configuration | MRR | Issue |
|---|---|---|
| First assembler run (entity_ids bug) | 0.100 | Phase 2 returned 0 results; entity_ids not in SELECT |
| After entity_ids fix (stage detection bug) | 0.120 | Stage assignment used local turn IDs; wrong stage mapping |
| After stage fix (submodular + PPR correct) | 0.429 | Final result |
The jump from 0.120 to 0.429 upon fixing the stage detection confirms that correct stage assignment is critical to the architecture's performance. With wrong stage assignments, Phase 1 retrieves from the wrong stage (garbage in, garbage out), and Phase 2 seeds PPR on irrelevant entities.
The initial Phase 2 implementation consistently returned zero cross-stage memories. The debugging trace:
- Added logging: PPR returned non-zero scores for entities, but
memories_by_entity_fnreturned empty lists. - Inspected the PostgreSQL query:
entity_memoryjunction table had data, but the retrieval query's SELECT clause did not includeentity_ids. - The
entity_idsfield was populated at ingest time (via a separate INSERT intoentity_memory) but not JOINed at retrieval time. - Fix: add
LEFT JOIN entity_memory ON memory_idandARRAY_AGG(entity_id)to the retrieval query.
This bug is instructive: the Phase 2 PPR traversal is only as good as the entity annotations on the retrieved memories. If entity extraction or entity storage has bugs, Phase 2 degrades silently to zero contribution.
The second major issue: BEAM-10M turn IDs are plan-relative (each plan restarts numbering from 0), but gold annotations reference global turn indices. Without cumulative plan offsets, our stage detector assigned all memories to "plan-0" (because local turn ID 0 matched the first plan's memories, but the gold's global turn ID 7500 matched nothing).
Fix: compute cumulative turn offsets per plan and apply before gold matching. This is a data preprocessing issue specific to the BEAM dataset, not an architectural flaw.
function ASSEMBLE_PROMPT(template, placeholders, context_window, headroom):
budget = floor(context_window * headroom)
shell = substitute(template, {p.key: "" for p in placeholders})
shell_tokens = estimate_tokens(shell)
variable_budget = max(300, budget - shell_tokens)
// Fast path
total = sum(estimate_tokens(p.value) for p in placeholders)
if total <= variable_budget:
return fill_template(template, placeholders)
// Progressive condensation
sorted_ph = sort(placeholders, by=priority DESC) // least important first
remaining = variable_budget
effective = {}
for i, p in enumerate(sorted_ph):
orig_tokens = estimate_tokens(p.value)
not_yet = len(sorted_ph) - i
share = max(50, remaining // max(1, not_yet))
if orig_tokens <= share:
effective[p.key] = p.value
remaining -= orig_tokens
else:
if p.condenser is not None:
reduced = p.condenser(p.value, share)
else:
reduced = truncate_to_budget(p.value, share)
effective[p.key] = reduced
remaining -= min(estimate_tokens(reduced), remaining)
// Post-assembly safety
prompt = fill_template(template, effective)
while estimate_tokens(prompt) > context_window - SAFETY_MARGIN:
halve_lowest_priority(effective, sorted_ph)
prompt = fill_template(template, effective)
// Truncation warning
banner = build_truncation_banner(original_tokens, final_tokens)
if banner:
prompt = banner + "\n\n" + prompt
return prompt
function ASSEMBLE(query, current_stage, budget_split, max_chunks):
// Phase 1: Own-stage
candidates = retrieve(query, current_stage, max_chunks * 3)
phase1 = submodular_select(candidates, max_chunks, lambda=0.5)
// Phase 2: Cross-stage via PPR
seed_entities = extract_entities(phase1)
entities, relationships = entity_graph()
adjacency = build_adjacency(entities, relationships)
ppr = personalized_pagerank(adjacency, seed_entities, alpha=0.15)
top_entities = top_k(ppr, 50)
cross_candidates = memories_by_entities(top_entities)
cross_candidates = filter(cross_candidates, stage != current_stage)
phase2 = score_by_ppr(cross_candidates, ppr)[:max_chunks]
// Phase 3: Summary fallback
covered = {current_stage} union stages_of(phase2)
uncovered = all_stages - covered
phase3 = [stage_summary(s) for s in uncovered]
// Assembly
return structured_context(
own_stage=phase1,
adjacent=phase2,
summaries=phase3
)
function PERSONALIZED_PAGERANK(adjacency, seeds, alpha=0.15, max_iters=30, tol=1e-4):
// Normalize seeds to distribution
total = sum(seeds.values())
s = {k: v/total for k, v in seeds}
// Normalize outgoing edges to transition probabilities
M = {}
for node, edges in adjacency:
out_total = sum(w for _, w in edges)
if out_total > 0:
M[node] = [(n, w/out_total) for n, w in edges]
else:
M[node] = [] // dangling node
rank = copy(s)
for iter = 1 to max_iters:
new_rank = {k: alpha * v for k, v in s} // restart mass
for node, mass in rank:
if mass <= 0: continue
if M[node] is empty:
// Dangling: redistribute to seeds
for k, v in s:
new_rank[k] += (1 - alpha) * mass * v
else:
for neighbor, prob in M[node]:
new_rank[neighbor] += (1 - alpha) * mass * prob
delta = sum(|new_rank[k] - rank[k]| for k in union(new_rank, rank))
rank = new_rank
if delta < tol: break
return rank
@article{deust2026context,
title={Stage-Aware Context Assembly for
Long-Context Memory Retrieval},
author={Deust, Clement and {Claude Opus 4.6}},
year={2026},
url={https://github.com/cdeust/Cortex},
note={Repository: github.com/cdeust/Cortex, commit 5348f74}
}