Ideas

Active: Memory-integrated attention

1. Memory-augmented attention ← building now

Add an external Gram memory bank to a transformer with hybrid Plücker attention. Each attention head computes standard Q·K over the sequence plus scores each query against stored Gram matrices using the Plücker inner product. The query line from token i gets scored against the memory's stored lines — same p·(★q) operation the memory already uses. Output is a weighted combination of sequence values and memory-retrieved values.

This is a direct extension of hybrid v4 — instead of the Plücker bias coming only from other tokens in the sequence, it also comes from an external Gram matrix storing relational knowledge.

Why it works: The Plücker inner product is the same operation in both the memory system and the attention mechanism. No adapter needed.

2. Transversal retrieval as context injection

Given a prompt, use the generative mode (4 lines → 2 transversals) to retrieve related concepts from the Gram memory, then prepend them as additional context tokens. The transformer processes them alongside the input.

Simpler than option 1 — doesn't require modifying the attention mechanism at all. Just uses memory as a retrieval augmentation step before inference.

3. Online memory accumulation ★ WORKING — first geometry win

Status: implemented, PPL 206 vs standard 209

As the model processes text, it accumulates Plücker lines into Gram matrices (the M = Σ pᵢ⊗pᵢ operation) in real time. Later tokens can query these accumulated memories. Similar to Neural Turing Machines or modern Hopfield networks, but using geometric incidence instead of dot-product addressing.

Why it works: 36 numbers (6×6 Gram) compress the relational structure of the entire past sequence. Standard attention has no way to compute this — it sees individual tokens, not structural summaries. The Gram eigenvectors are emergent "principal relational axes" of the text.

Implementation: exp_mem_attn.py, OnlineMemoryAttention class. Memory-efficient version computes (B,H,T,T) Plücker incidence matrix (same footprint as standard attention) rather than materializing (B,T,H,6,6) Gram matrices.

3a. Dual-pathway incidence attention ← easiest to test

Instead of collapsing the (B,H,T,T) Plücker incidence matrix to a scalar via sum-of-squares, use it as a second attention pathway. Softmax over the incidence matrix and use it to route values, just like standard attention does. Two parallel pathways: Q·K dot-product attention + Plücker incidence attention, each producing weighted sums of values, combined with a learned gate.

This is the simplest change — the incidence matrix is already computed, we just need to softmax it and multiply by values instead of squaring and summing.

Why this should help: Currently the geometry produces a scalar "how related is my query to the past?" but throws away which past tokens are geometrically related. The attention pathway preserves this information.

3b. Longer sequences ← TESTED, peaks at 256

4-layer results (10 epochs from scratch):

Seq len	Standard	Online mem	Delta	%
128	~244	~244	~0	~0%
256	259.1	250.8	-8.3	3.2%
512	282.7	279.6	-3.1	1.1%

2-layer fast screening (10 epochs from scratch, d=128, 4 heads):

Seq len	Standard	Online mem	Delta	%
256	432.3	421.8	-10.5	2.4%

The advantage peaks at seq=256 then declines at 512. This confirms the 6D Gram saturation hypothesis: a rank-6, 21-independent-entry matrix can only encode so much structure. At 256 tokens it captures useful patterns; at 512 it's overwhelmed.

The 2-layer result at seq=256 confirms this isn't a fluke of the 4-layer architecture — geometry helps at both scales, though more at 4 layers (3.2% vs 2.4%) where the model has more capacity to refine the geometric signal.

This makes a strong case for higher-dimensional Plücker spaces (idea 4) or the multi-scale memory with separate fast/slow banks (idea 3f) for longer sequences.

3c. Tune decay parameter ← TESTED, decay doesn't matter

Status: tested, all decay values beat standard by ~3.3-3.5%

Sweep on fast 2-layer model (7 epochs each):

Decay (λ)	Best PPL	vs Standard (503.4)
0.95	486.0	-3.5% (best)
0.99	486.8	-3.3%
1.0	486.6	-3.3%
0.995	490.4	-2.6%

Lesson: The Gram memory mechanism is robust — decay is not a critical hyperparameter. The model benefits from geometric structure regardless of forgetting rate. λ=0.95 is marginally best but all values work.

Longer sequence results (seq=256, fast 2-layer):

Config	PPL	vs Standard (511.9)
λ=0.95	513.3	+0.3% (worse!)
λ=0.99	507.6	-0.8%

The memory advantage shrank from 3.5% at seq=128 to <1% at seq=256. The 6D Gram matrix (rank 6, 21 independent entries) likely saturates when compressing 200+ tokens of relational structure. This motivates both higher- dimensional Plücker spaces (idea 4) and amplifying memory (3d-ii below).

3d. Amplifying memory (λ > 1) with normalization

Motivation: The seq=256 results showed the memory advantage disappearing, and the decay sweep showed λ doesn't matter much in [0.95, 1.0]. The natural question: what happens above 1.0? If λ < 1 forgets old patterns and λ = 1 weights all patterns equally, then λ > 1 should amplify established patterns — the more a relational direction has been reinforced, the stronger it becomes.

This reframes the Gram matrix from "running average of relational structure" to "reinforcement signal for persistent patterns." The topic of a document is the relational structure that persists — recent tokens are noise until they reinforce the pattern. λ > 1 says "trust what's been confirmed."

Implementation: Applied as temporal weights on the incidence matrix (no loop needed): score_t = Σ_{s<t} λ^(t-s) · incidence(t,s)², with max-normalization per position when λ > 1 to prevent unbounded scores.

Status: tested, λ=1.05 hurts at both seq lengths

λ	PPL (seq=128)	vs Std (503.4)	PPL (seq=256)	vs Std (511.9)
0.95	486.0	-3.5%	513.3	+0.3%
0.99	486.8	-3.3%	507.6	-0.8%
1.0	486.6	-3.3%	—	—
1.05	496.8	-1.3%	513.3	+0.3%

Lesson: Amplification doesn't help. λ=0.95 and λ=1.05 fail at seq=256 for opposite reasons — fast forgetting loses long-range structure, while amplification over-weights old patterns and normalization washes out the signal. λ=0.99 (half-life ≈ 69 tokens) is the sweet spot, but the advantage still shrinks from 3.3% → 0.8% going 128 → 256 tokens. The real bottleneck is the 6D Gram space saturating, not the temporal weighting.

3e. Triadic-seeded Gram memory ← TESTED, doesn't help

Status: fast screening shows +2% improvement (494.8 → 485.2 PPL, 2-layer)

Two-tier memory: online Gram accumulation (within-sequence) + triadic associative memory (cross-sequence persistent storage). At end of each 128-token sequence, the accumulated Gram is encoded as an SDR via random projection and stored as (context_SDR, layer_SDR, gram_SDR) in triadic memory. At start of next sequence, context from first 16 token embeddings is used to recall the closest matching Gram, which seeds M₀.

Implementation: exp_triadic_seed.py

Fast screening (2-layer, 7 epochs, from scratch):

Online mem (no seed): PPL 494.8
Triadic-seeded: PPL 485.2 (2.0% better)
Seeded vs non-seeded eval: identical (485.2 both)

Full model (4-layer, 10 epochs, from scratch, bs=64):

Online mem (no seed): PPL 242.7
Triadic-seeded: PPL 244.0 (0.5% worse)
Seeded vs non-seeded eval: identical (244.0 vs 244.2)
Overhead: 92s → 128s/epoch as triadic memory grew to 11,640 triples

Key findings:

Seeding helped the 2-layer model but hurt the 4-layer model
The 4-layer model is expressive enough to learn its own structural priors
Batch-averaged context + Gram is too coarse — all topics blur together
Seeded eval never helps — the learned weights already capture the signal

Per-sequence storage test (2-layer, 7 epochs, from scratch):

Baseline: PPL 492.7
Seeded (per-seq, 25% subsample): PPL tracking ~5% better per epoch BUT epoch times explode (72s → 454s at 28k triples) as triadic memory grows. Per-sequence recall does 128×n_layers queries per batch.
Per-seq storage is impractical without batched triadic queries.

Optimization ideas to test:

Per-sequence storage (not batch-averaged): current approach averages context across 128 sequences in a batch, losing topic signal. Store each sequence individually for sharper topic recall. Risk: memory grows 128x. Mitigation: subsample 25% of sequences.
Contiguous sequence ordering: WikiText is article-structured. Process sequences in order (no shuffle) so consecutive sequences share topic. Triadic recall should be strongest when the context actually repeats.
Multi-scale write lines: Grassmann Flows uses offsets {1,2,4,8,12,16} instead of just bigrams (offset=1). This gives the Gram richer structure. Easy to add — just loop over offsets in the write line computation.
Larger SDR space: N=10000, P=100 for better reconstruction fidelity (current cos=0.795 is lossy). Trade-off: slower triadic query.
Seed at inference too: Instead of training-only benefit, build triadic memory from training data, then seed during validation. Need separate context computation (first 16 tokens of val sequences).
Gradient through seed: Currently frozen retrieval. Use straight-through estimator to let gradients flow through the recalled Gram → teach the model to use the seed signal more effectively.
Contrastive seed loss: Add auxiliary loss that encourages the model's online Gram at position T to be close to the recalled seed Gram. This regularizes the Gram space to be consistent across related sequences.
Short-sequence test (seq=32) ← TESTED, no benefit: At seq=32 the model barely builds M before the sequence ends, so a recalled Gram seed should matter most here. Result: PPL 265.8 seeded vs 266.4 baseline (-0.2%, noise). Even with minimal tokens to build structure, the model adapts fast enough that cross-sequence memory adds no value. This effectively closes the triadic seeding approach — it doesn't help at any sequence length (32, 128) or model size (2, 4 layers). The SDR encode/decode pipeline is too lossy and the model's learned projections already handle cold-start efficiently.

3f. Multi-scale memory ← TESTED, promising at seq=256

Dual decay: Half heads at λ=0.95 (sentence), half at λ=0.999 (document). Learned decay: Each head learns its own λ via sigmoid(param).

Rapid screening (2-layer, 10% data, ROCm):

seq=128:

Model	vs Standard
Online mem (λ=0.99)	-0.6%
Dual decay (0.95/0.999)	-0.6%
Learned decay	+1.0% (no benefit)

seq=256:

Model	vs Standard
Online mem (λ=0.99)	-1.7%
Dual decay (0.95/0.999)	-3.2%
Learned decay	-3.5%

seq=512: Too underfit with 10% data to draw conclusions.

Comprehensive variant sweep at seq=256, rapid (10% data, 2-layer):

Rank	Variant	vs Standard	Key idea
1	learned_decay	-3.5%	Learned λ per head + additive gate
2	dual_decay	-3.2%	Fixed fast/slow λ
3	residual_gram	-2.8%	Multiplicative (1+g)*out
4	multi_write	-2.3%	2 write projection pairs
5	trigram	-1.8%	Trigram write context
6	online_mem	-1.7%	Single λ=0.99
7	abs_inc	-1.4%
8	inc_route	-0.5%	Geometry routes values
9	learned_power	-0.2%	Learned exponent p
10	inc_bias	+0.8%	Geometry biases attention
11	resid_learned	+4.2%	Residual + learned (destructive!)

Key findings from variant sweep:

Additive scalar gate wins: Geometry as separate gated signal > routing > bias
incidence² is optimal: Learned power and |incidence| don't improve
Learned per-head decay is the strongest single improvement
Multiplicative + learned decay is destructive: Each works alone but their combination creates unstable gradient interactions
Wider write context marginal: Trigrams and multi-write help modestly
J6 matters at seq=256: Learned decay with identity gets -1.4% vs -3.5% with J6. At seq=128 J6 doesn't matter, but at seq=256 the geometric structure becomes load-bearing.

Head count scaling (d=192, 2 layers, seq=256, rapid 10% data):

Heads	d_head	Standard	Learned decay	Delta
4	32*	~3184	~3074	-3.5%
6	32	1730.8	1676.6	-3.1%
8	24	1777.5	1687.5	-5.1%
12	16	1748.7	1693.9	-3.1%

*d=128 fast model

8 heads is the sweet spot. At 12 heads (d_head=16), attention quality degrades and the advantage shrinks. The optimal regime balances d_head≥24 for attention quality with enough heads for timescale diversity.

Model width scaling (2 layers, 8 heads, seq=256, rapid 10% data):

d_model	Standard	Learned decay	Delta
128*	~3184	~3074	-3.5%
192	1777.5	1687.5	-5.1%
256	1318.3	1232.5	-6.5%
384	1026.9	964.3	-6.1%

*4 heads (fast model)

The geometry advantage grows with width and plateaus at ~6%. Wider models exploit the Gram's complementary signal better, peaking at d=256 (-6.5%) and holding at d=384 (-6.1%). The plateau makes sense: the 6×6 Gram has 21 independent parameters — its information capacity is finite.

This confirms the Gram provides genuinely useful structural information that standard attention alone cannot compute. The ~6% ceiling corresponds to the geometric capacity of a rank-6 symmetric matrix.

Depth scaling (d=192, 8 heads):

Layers	Standard	Learned decay	Delta
2	1777.5	1687.5	-5.1%
4	1731.3	1673.3	-3.3%

Deeper models show smaller advantage — they already capture structural information through their multiple attention layers.

Needs full-data verification at seq=256.

3g. Gram eigenstructure as features ← TESTED, strong signal

Instead of scoring read lines against the full Gram matrix, extract the top-k eigenvectors of M_t and use them as additional "memory keys" that the standard attention can attend to. This exposes the principal relational axes directly to the attention mechanism.

exp_sparse_gram.py results (synthetic induction head task, 2-layer, vocab=64):

Variant	Final Acc	Description
Standard	10.1%	Vanilla Q·K attention
Gram bias (full incidence)	24.8%	Q·K + additive
Eigen bias (eigenstructure)	81.1%	Q·K + low-rank eigenprojection bias (O(T·k))

The eigen_bias variant hits a phase transition around step 1000 where accuracy jumps from 7% to 22%, then climbs steadily to 81%. The eigendecomposition provides a dramatically better signal than raw incidence — the low-rank approximation acts as a denoiser, filtering out irrelevant geometric interactions and exposing the dominant relational axes.

Key insight: the full incidence matrix (gram_bias) only reaches 24.8% despite having strictly more information than the eigen approximation. The eigendecomposition's rank constraint forces the model to attend to the principal structure rather than being distracted by noise.

Also tested: Gram MLP readout (exp_fast.py, GramMLPAttention) — extracts upper triangle of Gram (21 features for 6×6), projects through MLP to d_head vector per head. 2-layer WikiText PPL 495.1 vs standard 504.2 vs scalar gate 492.7. The MLP readout beats standard but loses to the simpler scalar gate, suggesting the bottleneck is the read interface structure (eigen basis vs learned MLP), not just the dimensionality of the output.

WikiText-2 result (exp_fast.py, EigenBiasAttention, 2-layer from scratch): PPL 501.3 — barely beats standard (504.2), loses to scalar gate (492.7). The eigen bias does NOT transfer from synthetic to real tokens. On synthetic induction data the Gram has clean repeated structure that eigenvectors capture; on real text the Gram is noisy and the top-3 eigenvectors don't isolate the relational structure that matters for next-token prediction. The global Gram eigenvectors solve a different problem (principal axes of the whole sequence) than what the LM needs (position-specific routing). The scalar gate remains the best geometry approach for real language modeling.

3h. Multi-scale write lines (from Grassmann Flows) ← TESTED, marginal

Instead of only pairing consecutive tokens (offset=1), pair at offsets {1, 2, 4, 8}. Each offset produces write lines that capture structure at different timescales. All offsets accumulate into the same Gram. This is the one useful idea from the Grassmann Flows paper (Dec 2025).

Implementation: exp_fast.py, MultiScaleMemoryAttention class.

Results (from scratch):

Model	2-layer PPL	4-layer PPL
Standard	504.2	~244
Online mem (offset=1)	492.7	243.7
Multi-scale ({1,2,4,8})	489.1	243.4

Finding: Multi-scale helps at 2 layers (-3.6 PPL) but is negligible at 4 layers (-0.3 PPL). The deeper model already captures multi-scale structure through its multiple attention layers. Not worth the ~18% computational overhead for the full model.

Overhead: ~8s/epoch (2-layer), ~16s/epoch (4-layer) — modest.

3i. Dual-pathway incidence attention ← TESTED, negative

Status: tested, PPL 372 from scratch (1.8x worse than standard)

Used the (B,H,T,T) Plücker incidence matrix as a second attention pathway alongside standard Q·K. Each pathway softmaxes its logits and routes values independently, combined with a learned gate: (1-g)*std_out + g*geo_out.

Results: from scratch PPL 372 vs standard ~206 (1.8x worse). From checkpoint init PPL 209 (matches baseline but geometry isn't helping — the gate likely learns to suppress the geometric pathway).

Why it failed: When geometry competes with standard attention for value routing, it loses. Standard Q·K is a much better attention pattern for LM. The successful approach (online memory scalar gate) works because it ADDS geometric information on top of standard attention rather than replacing any part of the attention computation.

Lesson: Geometry should augment, not compete. Scalar gating > dual pathway.

3i-b. Systematic Gram memory sweep (50% data, from scratch)

Status: completed March 15, 2026

Fast 2-layer model (d=128, 4 heads):

Config	Standard PPL	Online Mem PPL	Gap
seq=64	424.0	423.8	-0.05% (tied)
seq=128	524.3	505.3	-3.6%
seq=256	535.6	528.6	-1.3%
decay=0.95, seq=128	(524.3)	511.7	-2.4%
decay=0.999, seq=128	519.2	513.0	-1.2%
multi_scale, seq=128	(524.3)	517.2	-1.4%

Depth sweep (50% data, seq=128, 15 epochs, from scratch):

Layers	d_model	Standard PPL	Online Mem PPL	Gap
1	96	695.6	625.5	-10.1%
2	128	524.3	505.3	-3.6%
3	160	457.9	453.9	-0.9%
4	192	418.1	417.7	-0.1%

Depth × sequence interaction (1-layer d=96, 50% data):

Layers	seq=128 Gap	seq=256 Gap
1	-10.1% (695.6→625.5)	-7.3% (682.1→632.2)
2	-3.6% (524.3→505.3)	-1.3% (535.6→528.6)
4	-0.1% (418.1→417.7)	—

Key insight: Geometry benefit is inversely proportional to model depth. The Gram memory provides structural information that deeper models learn to compute internally through multiple attention layers. A 1-layer model gets a massive 10% boost because it has no way to build multi-hop patterns without the explicit geometric memory. This suggests the Gram memory is acting as a "cheap extra layer" that captures relational structure the attention mechanism would otherwise need depth to learn.

The seq=256 penalty (10.1%→7.3% for 1-layer) persists even at high capacity deficit, confirming the 6D Gram saturation is intrinsic to the geometric representation, not model capacity.

Data scaling (2-layer d=128, seq=128):

Data fraction	Standard PPL	Online Mem PPL	Gap
50%	524.3	505.3	-3.6%
100%	330.0	326.6	-1.0%

More data reduces geometry advantage too — the model learns internally what the Gram was providing. The geometry benefit is largest in the low-capacity, low-data regime (shallow model, limited data).

Width vs depth: geometry benefit depends on DEPTH, not total capacity:

Config	Params	Standard	Online Mem	Gap
2L d=256	14.5M	408.0	394.8	-3.2%
4L d=192	11.5M	418.1	417.7	-0.1%

The 2L d=256 model has 27% more parameters than 4L d=192, yet geometry still helps 3.2%. This proves the Gram memory compensates specifically for sequential composition (multi-hop patterns across layers), not raw parameter count. A wide-and-shallow model can't compute multi-step relational patterns through depth, so the geometric summary fills that gap.

Findings:

Geometry advantage strongest at seq=128 with 2-layer model (3.6%)
At seq=64, too few tokens for Gram to accumulate useful structure
At seq=256, 6D Gram saturates (consistent with prior results)
4-layer model absorbs geometry benefit — more layers = less need for explicit Gram memory
decay=0.99 > decay=0.95 > decay=0.999 (medium forgetting is optimal)
multi_scale worse than single-scale online_mem at 2 layers
Baselines vary ±5 PPL across runs (random init) — relative gap is the stable metric

3j. Online associative memory (SDR-based) ← SHELVED

Status: shelved — indicator memory recall returns uniform distribution

Instead of seeding M₀ once at sequence start (which doesn't help at any sequence length or model size), give the model a persistent associative memory that it reads/writes at every position during the forward pass.

Architecture:

Keys: Plücker write lines (6D per head, flattened to 36D per position)
Values: hidden states (d_model per position)
Storage: write every 8 positions during training (controls memory growth)
Retrieval: batched GPU queries at every position via matrix multiply against indicator matrices. All B×T queries computed in one matmul.
Integration: recalled values → learned linear projection → scalar gate (starts small at 0.01, model learns how much to trust the memory)

Key differences from triadic seeding (3e):

Query at every position, not just sequence start
Keys are geometric (write lines), not semantic (context embeddings)
Values are hidden states, not Gram matrices
Batched GPU queries instead of one-at-a-time CPU queries
The memory is truly associative — "what have I seen before that looks geometrically like what I'm seeing now?"

Implementation: exp_assoc_mem.py

Concern: double forward pass per batch (one no-grad to get write lines for queries, then the real forward with recalled values). This doubles compute when reading is active. Could optimize by caching write lines from the previous batch instead.

3k. Iterated Gram (M² 2-hop matching) ← TESTED, negligible

Status: tested March 23, 2026 — no benefit over standard Gram

Instead of rd·M·rd (standard), interpolate with rd·M²·rd (2-hop matching). M² captures degree-6 interactions: two writes that share structure with a common third write both contribute. Learned interpolation parameter α between M and M² via sigmoid.

Implementation: exp_fast.py IteratedGramAttention, exp_new_ideas.py.

Results (2-layer, d=128, 4 heads, synthetic bigram data, 10 epochs):

Model	PPL	Δ vs standard
Standard	190.6	←baseline
Online mem	190.3	-0.2%
Iterated Gram	190.6	-0.0%

Finding: M² adds no value. The 2-hop relational matching doesn't capture structure that M alone misses. The learned α likely stays near 0 (pure M). This is consistent with the finding that the information bottleneck is the model's ability to USE geometric signal, not the Gram's ability to encode it.

3l. Learned Gram transition (SSM-style) ← TESTED, marginal

Status: tested March 23, 2026 — ties online_mem, 2x slower

Replace scalar decay M_t = λ·M_{t-1} + w⊗w with learned 6×6 transition matrix: M_t = A·M_{t-1}·A^T + w_t⊗w_t. Allows the Gram to rotate/mix its relational structure at each step (like a linear SSM on the Gram). Initialized as 0.99·I (near scalar decay).

Implementation: exp_fast.py LearnedTransitionAttention, exp_new_ideas.py.

Results (2-layer, d=128, 4 heads, synthetic bigram data, 10 epochs):

Model	PPL	Δ vs standard	Time/epoch
Standard	190.6	←baseline	~14s
Online mem	190.3	-0.2%	~18s
Learned transition	190.2	-0.2%	~36s

Finding: Ties online_mem but is 2x slower due to the sequential scan (can't be parallelized via cumsum like scalar decay). The 6×6 transition matrix has 36 parameters per head (vs 1 scalar decay), but doesn't learn anything more useful. A likely explanation: on this data, scalar decay already captures the temporal weighting well — there's no rotational structure in the bigram-generated Gram that A can exploit.

Lesson: Sequential scans are expensive. Only worth it if they provide qualitatively different behavior (like Mamba's input-dependent transitions).

3m. Separate read/write Grams ← TESTED, marginal

Status: tested March 23, 2026 — ties online_mem

Maintains two Gram matrices:

M_write = Σ Jw⊗Jw (write structure, queried by read lines)
M_read = Σ Jr⊗Jr (read structure, queried by write lines)

Score = (1-α)·rd·M_write·rd + α·Jw·M_read·Jw, where α is learned. The second term is "backward association" — what past reads match my write?

Implementation: exp_fast.py SeparateRWGramAttention, exp_new_ideas.py.

Results (2-layer, d=128, 4 heads, synthetic bigram data, 10 epochs):

Model	PPL	Δ vs standard
Standard	190.6	←baseline
Online mem	190.3	-0.2%
Separate R/W	190.3	-0.2%

Finding: Ties online_mem. The read Gram (backward association) doesn't add value — likely because in a causal LM, past reads have no privileged information about the future. The model probably learns α≈0 (all weight on write Gram = standard behavior).

Summary of 3k-3m

All three new Gram variants (iterated, learned transition, separate R/W) fail to improve over the simple online_mem baseline. This strengthens the key finding from the project: the bottleneck is not in the Gram's expressiveness but in the model's ability to use the geometric signal. Richer Gram representations (M², SSM transitions, dual Grams) don't help because the scalar gating mechanism can't extract more from them.

Note: tested on synthetic bigram data due to network restrictions. Should be re-verified on WikiText-2 where the baseline gap is larger. However, the relative rankings should be stable.

4. Higher Grassmannian attention ← TESTED, marginal

Status: tested March 15, 2026 — higher dims barely help

Tested G(2,4)→G(2,5)→G(2,6) on 1-layer model (where geometry matters most):

seq=128 (baseline standard: 695.6):

Gram	Plücker dim	Indep. entries	PPL	Gap
G(2,4)	6D	21	625.5	-10.1%
G(2,5)	10D	55	616.3	-11.4%
G(2,6)	15D	105	615.3	-11.5%

seq=256 (baseline standard: 682.1):

Gram	Plücker dim	Indep. entries	PPL	Gap
G(2,4)	6D	21	632.2	-7.3%
G(2,5)	10D	55	630.5	-7.6%
G(2,6)	15D	105	629.5	-7.7%

Key finding: 5x more Gram capacity (21→105 entries) adds only ~1.4% at seq=128 and ~0.4% at seq=256. The seq degradation (128→256) stays nearly identical across all dimensions.

Conclusion: Saturation is NOT the bottleneck. The 6D Gram already captures most useful geometric structure. The information bottleneck is the model's ability to USE the geometric signal, not the Gram's ability to ENCODE it. Higher Grassmannians add marginal value.

Fast 2-layer screening (10 epochs, from scratch, ROCm AMD 8060S):

Model	Best PPL	Params	J matrix
Standard (no geometry)	403.9	6,846,080	—
Online mem G(2,4)	400.5	6,896,528	J6 (Hodge)
Online mem G(2,6)	403.7	6,904,720	Identity

| Online mem G(2,4) no-J6 | 400.8 | 6,896,528 | Identity |

Finding: J6 is NOT the key ingredient (400.5 vs 400.8, noise-level difference). The exterior product structure itself provides the geometric signal — the specific bilinear form (J6 vs identity) is irrelevant.

However, G(2,6) with 15D coordinates (403.7) matches standard (403.9) and loses the G(2,4) advantage. Higher dimensionality actively hurts.

Why G(2,4) works but G(2,6) doesn't: The 4D→6D exterior product is a compact nonlinear feature map (quadratic interactions of 4 coordinates = 6 features). Going to 6D→15D creates 15 features from 6 coordinates, but the extra dimensions are redundant — the model only has 4 heads with 32 dims each to learn structure. The 6D sweet spot balances expressivity with learnability.

Full sweep (fast 2-layer, 10 epochs, ROCm):

Model	Best PPL	Plücker dim
Standard	403.9	—
G(2,4) + J6	400.5	6D
G(2,4) + identity	400.8	6D
G(2,5) + identity	401.0	10D
G(2,6) + identity	403.7	15D

Lesson: G(2,4) is the sweet spot. Higher Grassmannians dilute the signal. The geometry's value is in the exterior product as a compact nonlinear feature map (4 coords → 6 features), not in the Plücker inner product or higher-dimensional Grassmannian structure.

Active: Retrieval and reasoning

5. Few-shot relational reasoning ← TESTED, negative with neural embeddings

Tested with 384D MiniLM sentence embeddings across 5 relations (country→capital, animal→sound, material→product, country→language, profession→tool), 25 pairs each.

Results: geometry loses at ALL K values. Best geometry (Gram^0.05 on PCA-16D) reaches p@10=0.84 vs cosine centroid p@10=0.91. Offset-based Gram (encoding relational transformation tgt-src) is even worse (p@10=0.43).

Root cause: 384D neural embeddings already capture relational semantics so well that cosine similarity achieves >0.9 precision. Geometry can only lose information through the 384D→4D projection. The K≥10 crossover from the word association benchmark only occurs with weak embeddings (32D PPMI+SVD).

The geometry's niche is weak embeddings + medium data. When embeddings are information-poor (statistical, low-D, non-neural), Plücker projection extracts complementary structure. With modern sentence transformers, it's strictly dominated.

6. Compositional transversal chains

Chain multiple transversal operations: output of one query becomes input to the next. This is the geometric analogue of multi-hop reasoning. Already demonstrated with music→bassist→multitrack→arpeggio chains.

Could be formalized as a graph traversal where each step uses the multi-transversal generation to propose next nodes, and the Gram energy scoring to filter.

7. Cross-word geometry transfer

The Gram matrix comparison (e.g., dog/cat similarity = 0.86) captures relational structure. Can this be used for zero-shot transfer of associate patterns between related source words? If dog and cat have similar Gram matrices, can we use dog's relational structure to predict cat's associates?

Plücker attention LM experiment results

Results summary

Variant	Architecture	Best PPL	vs Standard
Standard	Q·K dot product	209	baseline
Online memory	Q·K + causal Gram accumulation	206	0.99x (wins!)
Hybrid v4	Q·K + learnable Plücker bias	207	1.00x (tied)
Bigram v3	Token-pair query lines, single-token key lines	215	1.03x worse
Kernel v2	Asymmetric Q/K line projections	252	1.21x worse
Original v1	Symmetric -log	incidence

Key findings

Plücker geometry cannot replace dot-product attention but can augment it
Online Gram accumulation is the first variant to beat standard attention (PPL 206 vs 209)
- Causal M_t = Σ_{s<t} decay^{t-s} · p_s⊗p_s creates differentiable growing memory
- Only 56,652 memory-specific parameters (0.5% overhead)
- Bigram write lines + separate read projections + sigmoid gating
Degree-4 interactions help when they encode context (bigram > kernel)
The original failure was architectural (symmetry, -log|.|, expressivity), not mathematical
Learned Plücker projections ≈ random projections for retrieval (geometry is decorative for discriminative tasks)
Hybrid approach matches standard — geometric incidence captures complementary structure

What went wrong (v1) and fixes

Symmetric Q/K: attn(i,j) = attn(j,i) → fixed with separate projections
-log|incidence| scoring: spiky gradients → fixed with signed inner product
Single-token lines: too little expressivity → fixed with bigram encoding

Resolved questions

Does higher Grassmannian help the RRF ensemble? No. G(2,4) through G(2,17) all add zero unique information when full 32D embedding covariance is exploited.
Can Plücker incidence serve as a continuous similarity signal? No. Quadratic in projected embeddings, dominated by embedding covariance (Mahalanobis).
Can an optimized Gram^0.05 multi-seed ensemble add to the RRF? No. Despite reaching p@10=0.1065 standalone (9.7x improvement), it's entirely subsumed by embedding-space covariance signals.
Signal pruning? Yes. Removing redundant signals improved p@10 from 0.123 to 0.128 (+4%). Fewer signals = less dilution in rank fusion.
Mahalanobis-cent is the strongest signal (-0.003 when ablated).
Alternative fusion methods? All degrade vs standard RRF with uniform K.

Geometry beats embeddings at K≥10

Multi-seed Gram^0.05 outperforms all embedding methods when given ≥10 seed associates. Crossover at K≈7-10 because 6D Plücker needs fewer samples to estimate structure than 32D Mahalanobis. This is the system's unique advantage.

Unified associative database (TDGA + transversal geometry)

Working prototype in associative_db.py. Combines TDGA's exact triadic recall with multi-seed Gram^0.05 geometric similarity and generative expansion.

Demonstrated

Exact triadic recall: "Paris capital_of ?" → France (50/50 overlap)
Geometric subject similarity: France↔Germany 0.54 (shared relational patterns)
Generative expansion: Paris → Spain, French, Germany, Italy (consistent patterns)
Cosine search: standard semantic retrieval over stored facts

Directions

Few-shot relational reasoning ← NEGATIVE with neural embeddings Tested: geometry loses at all K values when using 384D MiniLM embeddings. Cosine centroid p@10=0.91 vs best geometry p@10=0.84. The K≥10 crossover only applies with weak (32D PPMI+SVD) embeddings. See §5 above.

Knowledge graph completion The expansion operation scores unstored (subject, object) pairs by geometric consistency. This is link prediction without training a GNN — score candidate triples against accumulated Gram signatures.

Memory layer for LLM agents Triadic recall is O(P²·N) regardless of corpus size. An agent stores facts as (entity, relation, value) triples with constant-time recall. Geometry adds "what else might be true?" for reasoning beyond stored facts.

Incremental knowledge accumulation Both layers support one-shot storage, no retraining. Geometric signatures become more discriminative as more facts arrive (more lines → richer Gram).

Anomaly / novelty detection Score new candidate facts against existing geometric patterns. Low score against all subject signatures = genuinely novel fact. High score but unstored = plausible inference.

Limitations

Subjects with <2 facts get zero geometric signatures (need ≥2 lines for Gram)
Requires TDGA checkpoint + sentence-transformers (external dependencies)
384D→4D projection is lossy; geometry works because ensemble (50 seeds) recovers info

Active: Geometric induction head hypothesis

Background (Olsson et al., 2022)

Induction heads are the 2-head circuit in transformers that implement in-context pattern completion: find a previous occurrence of the current token, then predict that the next token will be whatever followed that token before. Key facts:

Induction heads require at least 2 layers (one head to match, one to copy)
A 1-layer model structurally cannot form induction heads
During training, induction heads appear in a sharp phase transition where most in-context learning ability is acquired simultaneously
They implement "fuzzy" pattern matching — not just exact token copies but abstract sequence completion (N-gram and concept-level induction)

Connection to Gram memory

The depth scaling result from §3i-b is the key evidence:

Layers	Geometry gap
1	-10.1%
2	-3.6%
3	-0.9%
4	-0.1%

Hypothesis: The online Gram memory provides a continuous, geometric analogue of induction head behavior. A 1-layer model cannot form induction heads, so the Gram memory fills that gap — providing multi-hop relational composition that normally requires depth. Deeper models grow their own induction circuits and the Gram becomes redundant.

The Gram memory operates at a higher abstraction level than token-level induction: it accumulates relational structure between bigrams (Plücker lines from token pairs), not token identities. This makes it closer to concept-level induction heads than standard prefix-matching ones.

Testable predictions

Selective benefit on induction positions: Gram memory should reduce loss specifically on tokens where the correct prediction could be made by pattern- matching against earlier context ("induction positions"), not uniformly.
Training dynamics: The 1-layer+Gram model should skip or smooth the loss bump that normally corresponds to induction head formation during training.
Scaling: If Gram memory genuinely substitutes for induction heads, the benefit should scale — not just help tiny models but any model that is induction-head-limited (undertrained, shallow, or handling novel patterns).

Experiment: exp_induction_test.py ← TESTED, POSITIVE

Status: prediction 1 confirmed — Gram memory selectively helps on induction positions

Methodology (follows Olsson et al.):

Synthetic data with learnable bigram statistics (vocab=64, ~5 successors/token)
Training: 50% pure bigram sequences + 50% repeated-half sequences
Repeated halves: [bigram first half] [exact copy of first half]
The second half = induction positions (model could predict by copying)
Train 1-layer models (d=96, h=4) to convergence on CPU

Results (1-layer, 50 epochs):

Position type	Standard PPL	Online mem PPL	Delta
Total	8.8	8.4	-4.8%
Induction	8.3	7.6	-7.7%
Non-induction	9.4	9.2	-1.8%

Selectivity: -5.9% (induction_delta - non_induction_delta)

Key findings:

Gram memory helps 4.3x more on induction positions than non-induction (-7.7% vs -1.8%)
Both models reach near the bigram entropy floor (~8.9 PPL) on non-induction positions
The online_mem model substantially exceeds the bigram floor on induction positions (7.6 vs 8.9), meaning it's genuinely exploiting the repetition structure — not just better bigram prediction
The standard 1-layer model also shows some induction ability (8.3 < 9.4) but less than online_mem
Training loss for online_mem continues decreasing well past where standard plateaus (2.168 vs 2.198 at epoch 50), suggesting the Gram provides useful gradient signal

Interpretation: The Gram memory acts as a geometric induction head — it accumulates Plücker lines from bigram pairs, and when the same bigram pattern recurs in the second half, the accumulated Gram energy provides a signal that helps the model predict the next token. This is a continuous, geometric version of the discrete prefix-matching that induction heads perform.

Next: Run with --all to test at 2 and 4 layers. If the selectivity decreases with depth (as predicted), that confirms the Gram substitutes specifically for the induction circuit that deeper models can form on their own.

Vector-valued neural networks (V-Nets) connection

The Gram memory is already vector-valued in spirit — the Plücker embedding p = a ∧ b takes two vectors and produces a 6-vector encoding the line they span in P³. The Gram matrix M = Σ pᵢ⊗pᵢ is a second-order statistic of geometric vectors. This is structurally similar to V-Net architectures where neurons operate on vectors rather than scalars.

V-Net integration opportunities:

Learned projections into Plücker space: Replace fixed linear W₁, W₂ with equivariant vector neuron layers that preserve orientation and relational direction. Current finding (learned ≈ random projections) suggests the linear projection isn't extracting geometric structure; equivariant layers might.
Richer Gram computation: V-Net layers could compose Plücker lines with learned rotations/reflections/gating rather than just linear superposition.
Natural Grassmannian generalization: V-Nets in ℝ⁵ or ℝ⁶ map directly onto G(2,6), G(2,7) without deriving each Hodge dual by hand. However, G(2,4) was shown to be the sweet spot (§4), so this may not help.

Two-path architecture with attention

The most promising integration would combine temporal (triadic) memory for exact sequence recall with transversal memory for generative retrieval, connected by attention:

Router attention: soft gate between temporal (exact recall) and transversal (generative) paths, learned from query + confidence signals
Attention over Gram eigenvectors: expose principal relational axes as additional memory keys that standard attention can attend to (addresses the scalar gate's limitation of answering "how much structure?" but not "which structure is relevant to this query?")
Cross-attention: transversal candidates scored against temporal store for consistency — temporal memory as a prior on the generative path

Exclusive Gram attention (XSA-inspired) ← testing

Inspired by Exclusive Self Attention (XSA) which projects out the self-value direction from attention output (two-line change, consistent gains up to 2.7B params).

Applied the exclusion idea to both pathways:

XSA on Q·K: project out self-value direction from attention output, so attention focuses purely on contextual (non-self) information.
Geometric exclusion on Gram: project out the self-write Plücker direction from read lines before computing incidence, so the Gram memory only captures relational structure orthogonal to the current token's geometry.

Motivation: the residual already carries self-information to the FFN, so both attention and Gram memory should focus exclusively on other tokens' information.

Implementation: ExclusiveOnlineMemoryAttention in exp_mem_attn.py

Forward pass and gradient flow verified
Run: uv run python exp_mem_attn.py xsa_online_mem
Status: awaiting WikiText PPL results (needs local run with data/network access)

Synthetic benchmark results (test_xsa_gram.py, vocab=256, d=128, 2 layers):

standard: PPL 104.0
online_mem: PPL 111.1 (+7.1 worse — Gram overhead hurts on tiny synthetic task)
xsa_online_mem: PPL 106.1 (+2.1 worse — exclusion recovers 5 PPL of the gap)
XSA exclusion clearly reduces self-information noise in the Gram pathway
On WikiText (where online_mem already beats standard), XSA should amplify the win

Induction head results (exp_sparse_gram.py, vocab=64, seq=48, 3000 steps):

Non-causal (BUGGY — future leak in eigendecomposition):

eigen_bias: acc 0.772, xsa_eigen_bias: acc 0.631

Causal fix (cumulative Gram via cumsum, direct rd·M_t·Jw — no eigendecomp needed):

standard: acc 0.101 (baseline)
gram_bias: acc 0.263 (+0.16, full incidence as bias, already causal)
eigen_bias: acc 0.793 (+0.69, causal Gram-mediated incidence)
xsa_eigen_bias: acc 0.609 (+0.51, XSA hurts — -0.18 vs eigen_bias)

Causal version is actually BETTER than non-causal (0.793 vs 0.772) — future tokens were adding noise. Direct rd·M_t·Jw is simpler, stabler, and correct.

Lesson: XSA exclusion helps in LM (self-info redundant with residual) but hurts on induction/pattern-matching (self-identity IS the query signal). The exclusion should be task-conditional or gated, not always-on.

Open questions

Geometry only helps with weak embeddings: Can geometry add to neural embedding systems in any task, or is the K≥10 crossover specific to co-occurrence vectors?
Optimal K regime: Can the Gram ensemble be made more sample-efficient?
Higher-order interactions: Non-linear kernels or tensor decompositions in 32D without lossy projection to capture useful higher-order interactions?
Sequence prediction via transversals: Can geometric constraints predict next tokens in BPE-tokenized text?
Geometric induction heads ← ANSWERED YES: Gram memory selectively helps on induction positions (7.7% vs 1.8% on non-induction, 4x selectivity). Eigen bias achieves 81.1% induction accuracy vs 10.1% standard (exp_sparse_gram.py).
V-Net projections: Can equivariant vector neuron layers learn better projections into Plücker space than linear W₁, W₂?

ARC-AGI experiments (exp_arc_real.py)

Setup: Train on ARC training tasks, evaluate on ARC evaluation set (unseen tasks) via autoregressive generation (no teacher forcing). Implements key mdlARC ideas: 3D positional encoding (x=col, y=row, z=section), per-task embeddings, 8x dihedral augmentation.

Results (98 train tasks, 15 eval pairs, autoregressive generation):

Variant	Layers	Tok Acc	Grid Acc	Params
Standard	1	24.2%	0%	131K
Eigen bias	1	24.2%	0%	137K
Standard	4	14.7%	0%	818K
Eigen bias	4	26.8%	0%	851K

Key findings:

3D positional encoding closed the gap at 1 layer (both 24.2%) — without it, eigen_bias was 23.2% vs standard 4.7%. Spatial awareness matters more than geometric attention for grid tasks.
At 4 layers, standard degrades (14.7% < 24.2%) — overfitting to 98 tasks. Eigen bias improves (26.8%) — Gram bias acts as regularization.
0% grid accuracy across the board — no complete ARC solutions with this setup.

Round 2: mdlARC-matched training (d=64, 4 layers, 274K params):

Closed all training gaps except data and model size:

3D RoPE, RMSNorm, SiLU gated FFN, AdamW (0.9, 0.95), warmup+WSD
Color permutation augmentation + 8x dihedral
Packed batches (pad to batch max, not global max)
Scheduled sampling (0%→50% model predictions during training) Prevents memorization — loss stays at 0.3-0.7 instead of dropping to 0

Variant	Tok Acc	Grid Acc	Notes
Standard (scheduled sampling)	184/641 (28.7%) peak	0/15	Stable, best standard result
Eigen bias (whole-seq Gram)	collapsed to 2.7%	0/15	Gram amplifies SS noise
Eigen bias (per-pair Gram)	153/469 (32.6%)	0/10	Best overall, stable

Key findings:

Scheduled sampling is critical — without it, models memorize 98 tasks instantly (loss→0 by step 200) and eval accuracy stays random. With SS, loss stays high and the model learns transferable in-context reasoning.
Per-pair Gram reset fixes noise amplification. Resetting the Gram at each <start> token prevents garbage write lines from one demo/test pair contaminating the Gram for subsequent pairs. The whole-sequence Gram collapsed from 26.4% to 2.7% under scheduled sampling; per-pair held at 32.6%.
3D positional encoding closed the 1-layer gap — without it, eigen_bias was 23.2% vs standard 4.7%. With it, both reach 24.2%. Spatial awareness matters more than geometric attention for grid tasks at 1 layer.
At 4 layers, standard degrades but geometry helps — standard 14.7% vs eigen_bias 26.8% (without SS). The Gram acts as regularization.
Teacher forcing creates train/eval mismatch — the model trains on correct previous tokens but generates from its own predictions at eval. Scheduled sampling bridges this gap.

Remaining gaps vs mdlARC (44% on ARC-1 eval):

Data: 98 tasks vs thousands (ARC-1 + ARC-2 + ConceptARC) — dominant factor
Params: 274K vs 75M

Spatial Plücker geometry for ARC grids (exploration)

Encode adjacent grid cells as Plücker lines in P³:

Each cell → point (1, x, y, color)
Adjacent cells → Plücker line via exterior product
Horizontal + vertical adjacencies → full set of spatial lines
6×6 Gram M = L^T @ L captures spatial structure in 36 numbers

Preliminary findings (task 007bbfb7, tiling rule, 3×3 → 9×9):

Input Gram is low-rank: top-3 eigenvalues capture 99.1% of variance
Eigenvector alignment between input and output Grams is strong (cos 0.7-0.97) — the transform preserves geometric principal axes
48% of input-output line pairs are nearly incident (|inner product| < 0.01) — consistent with tiling (output contains copies of input pattern)
Eigenvector alignment is consistent across all 5 training pairs

This suggests the rule can be characterized as a Gram transport — a linear map W such that M_out ≈ W · M_in · W^T. Learning W from demo pairs and applying to test input could predict output geometric structure without any neural network.

Gram transport results (all 400 ARC training tasks, same-size only):

Median 5.9% error predicting output Grams (131 tasks)
93% of tasks under 50% error, 58% under 10%
Joint Gram transport (input+output in one Gram): median 3.7% error
Zero learning — least-squares fit on 2-5 demo pairs

Multi-embedding multi-transversal ARC solver ★ BREAKTHROUGH

10+ ARC tasks solved at rank 1 with zero learning, pure Plücker geometry.

Pipeline (exp_arc_fast_solve.py):

Eight complementary embeddings per cell pair:
- hist+color: input/output color one-hot + histogram difference (with proper per-histogram tables for ≤2000 histograms)
- color-only: pure color mapping, no position
- pos+color: position + color mapping
- all: position + color + histograms combined
- row_feat, col_feat: row/column histograms + uniformity
- color_count: per-color frequency + mode indicator
- diagonal: diagonal indices + position
Each embedding → Plücker lines from adjacent cell pairs → 200 transversals per training pair via multi-transversal sampling (P3Memory)
Precomputed score tables: fold line @ J6 @ transversals.T into a lookup table score[adj_pair][color_a][color_b]. Scoring becomes table lookup + addition — zero matmul during scoring.
Dual scoring strategy:
- ≤2000 histograms: per-histogram tables for hist_color (proper per-candidate histogram), other embeddings as non-histogram tables
- >2000 histograms: all 8 embeddings with placeholder histograms, raw sum
Score ALL candidates exhaustively (MPS-accelerated, 234M candidates/sec) or via random sampling with chunking for large grids

Results:

Task	Colors	Grid	Candidates	Rank	Time
25ff71a9	3	3×3	19K	1	<1s
794b24be	3	3×3	19K	1	<1s
0d3d703e	8	3×3	134M	1	68s
25d8a9c8	10	3×3	1B	1	<2s
74dd1130	8	3×3	134M	1	68s
3618c87e	3	5×5	847B	1	3s
a9f96cdd	6	3×5	470B	1	2s
aabf363d	6	7×7	10^38	1	8s
ae3edfdc	5	15×15	10^157	1	46s
6e02f1e3	5	3×3	1.9M	13	30s
ed36ccf7	5	3×3	1.9M	57	30s
a85d4709	5	3×3	1.9M	718	30s

Key insights:

Histogram-only scoring is best for small grids — when per-histogram tables are feasible, using hist_color alone gives the best ranks. Adding non-histogram embeddings adds noise that hurts discrimination.
All 8 embeddings needed for large grids — when proper histograms aren't feasible, the 8 embeddings via raw sum scoring provide enough complementary constraints for rank 1.
Multi-transversal is essential — a single transversal is a weak constraint (1 scalar). 200+ transversals per training pair from different 4-tuples provide complementary constraints that intersect on the answer.
Precomputed tables eliminate the bottleneck — scoring goes from 58K/s (CPU matmul) to 234M/s (MPS table lookup), enabling exhaustive search over 134M+ candidates.
The approach scales to 10^157+ — large grids solved via sampling (0/10M random beat correct).
NaN/overflow protection needed — transversals and Plücker inner products can overflow float32. Filtering, clipping, and nan_to_num prevent score corruption.

Resolved: X+Y sorting via Plücker geometry (exp_xy_sort.py)

Question: Can Plücker geometry provide a sub-O(n² log n) algorithm for X+Y sorting?

Answer: No asymptotic improvement, but useful structural properties.

Key findings:

Gram eigenvectors cleanly separate X and Y factors: EV1 correlates ρ=0.996 with x, EV2 correlates ρ=0.992 with y — the geometry recovers the factored structure from the sum encoding
Weighted eigenprojection achieves Spearman ρ=0.995 with true sums from only 10% training samples — near-perfect approximate sort from 6 numbers per line
"Plücker direct" sort achieves Kendall τ=0.95 consistently (n=10 to n=100)
Multi-transversal partition reduces cell range to 11.1% of total range
Fundamental obstacle: Plücker inner product is a single scalar comparison, still need Ω(n² log n) evaluations

Practical value: O(36)-size Gram as compact summary for approximate sort, top-k queries, and threshold queries without full sort. The eigenvector factorization is a novel way to recover X/Y marginals from the sum structure.

FilesExpand file tree

IDEAS.md

Latest commit

History

IDEAS.md

File metadata and controls

Ideas

Active: Memory-integrated attention

1. Memory-augmented attention ← building now

2. Transversal retrieval as context injection

3. Online memory accumulation ★ WORKING — first geometry win

3a. Dual-pathway incidence attention ← easiest to test

3b. Longer sequences ← TESTED, peaks at 256

3c. Tune decay parameter ← TESTED, decay doesn't matter

3d. Amplifying memory (λ > 1) with normalization

3e. Triadic-seeded Gram memory ← TESTED, doesn't help

3f. Multi-scale memory ← TESTED, promising at seq=256

3g. Gram eigenstructure as features ← TESTED, strong signal

3h. Multi-scale write lines (from Grassmann Flows) ← TESTED, marginal

3i. Dual-pathway incidence attention ← TESTED, negative

3i-b. Systematic Gram memory sweep (50% data, from scratch)

3j. Online associative memory (SDR-based) ← SHELVED

3k. Iterated Gram (M² 2-hop matching) ← TESTED, negligible

3l. Learned Gram transition (SSM-style) ← TESTED, marginal

3m. Separate read/write Grams ← TESTED, marginal

Summary of 3k-3m

4. Higher Grassmannian attention ← TESTED, marginal

Active: Retrieval and reasoning

5. Few-shot relational reasoning ← TESTED, negative with neural embeddings

6. Compositional transversal chains

7. Cross-word geometry transfer

Plücker attention LM experiment results

Results summary

Key findings

What went wrong (v1) and fixes

Resolved questions

Geometry beats embeddings at K≥10

Unified associative database (TDGA + transversal geometry)

Demonstrated

Directions

Limitations

Active: Geometric induction head hypothesis

Background (Olsson et al., 2022)

Connection to Gram memory

Testable predictions

Experiment: exp_induction_test.py ← TESTED, POSITIVE

Vector-valued neural networks (V-Nets) connection

Two-path architecture with attention

Exclusive Gram attention (XSA-inspired) ← testing

Open questions

ARC-AGI experiments (exp_arc_real.py)

Spatial Plücker geometry for ARC grids (exploration)

Multi-embedding multi-transversal ARC solver ★ BREAKTHROUGH

Resolved: X+Y sorting via Plücker geometry (exp_xy_sort.py)