Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions PR_DRAFT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Title

Non-record: add HelixRecur v2 shared-depth recurrence submission

# Short Summary

This PR adds a single non-record submission folder for `records/track_non_record_16mb/2026-03-26_HelixRecur_v2` on top of `upstream/main`.

HelixRecur v2 keeps the HelixRecur recurrence stack compact with `6` shared blocks over `11` virtual passes and adds only `44` conditioning parameters to recover a small amount of depth-specific behavior. It is the active recurrence champion from the branch history, but it is being submitted honestly as exploratory non-record work rather than as a leaderboard attempt.

# Key Metrics

- Quick comparison used for judgment: `val_loss 7.54165596`, `val_bpb 4.46659346`, compressed `3042658`, total `3113435`, `step_avg 675.97ms`
- Longer non-record pass: `val_loss 4.63764717`, `val_bpb 2.74667588`, compressed `4224324`, total `4295101`, `step_avg 676.24ms`
- Code bytes: `70777`
- Parameter count: `15187040`
- Added trainable parameters vs v1: `44`

# Non-record Rationale

- This branch line remained far from the accepted SOTA, so it does not justify a record claim.
- The work is best framed as an exploratory recurrence result with a favorable quality-per-byte tradeoff inside its own local line.
- Later branch work did not produce a replacement winner: v3 lost on the longer pass, and the later v2 tournament did not dethrone v2.

# Why It Is Still Promising

- v2 materially improved HelixRecur v1 on the fair quick comparison while slightly shrinking total bytes.
- The model keeps a strong byte-efficiency story by reusing shared transformer blocks and spending only a negligible parameter budget on virtual-depth conditioning.
- The line established a cleaner recurrence baseline that survived both the v3 follow-up and the later micro-variant tournament.

# Next Direction

The next direction is gene-coded low-rank specialization: preserve the compact shared-depth recurrent base and add tiny learned low-rank specialization paths keyed by virtual depth, instead of stacking more scalar-only rescue knobs.
120 changes: 120 additions & 0 deletions records/track_non_record_16mb/2026-03-26_HelixRecur_v2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
## Non-record: HelixRecur v2

HelixRecur v2 is the active non-record recurrence champion from this branch. It keeps the HelixRecur v1 shared-depth recurrence intact and adds only a tiny virtual-depth conditioning table so repeated passes can recover a small amount of depth-specific behavior without giving up the recurrence byte win.

This is an exploratory non-record submission, not a record claim. The longer pass remained well behind the accepted SOTA, but the variant is still worth preserving because it materially improved the recurrence line without giving up its compact artifact story.

### Exact architecture summary

- Base lineage: donor -> HelixRecur v1 -> HelixRecur v2
- Tokenizer and dataset: unchanged `sp1024` setup
- Core transformer shape:
- `11` virtual layers
- `6` shared recurrent blocks
- recurrence schedule `0,1,2,3,4,5,4,3,2,1,0`
- model dim `512`
- `8` attention heads
- `4` KV heads
- MLP multiplier `3.0`
- Preserved donor features:
- tied embeddings
- logit softcap `30.0`
- BigramHash with `2048` buckets and `128`-dim embedding
- SmearGate local-feature mixer
- partial RoPE with `16` rotary dims
- shared value embeddings with `VE_DIM=128` on layers `9,10`
- XSA on the last `4` layers
- LN scale path
- donor optimizer, quantization, compression, and eval path
- v2-specific addition:
- exactly `44` trainable parameters
- an `11 x 4` virtual-depth conditioning table
- per-virtual-pass modulation of existing scalar pathways only:
- LN scale multiplier
- attention output scale multiplier
- MLP output scale multiplier
- attention `q_gain` multiplier
- each multiplier is bounded as `1 + 0.05 * tanh(param)`

### Why this is DNA-inspired in engineering terms

- The model reuses a small shared block set across an ordered pass schedule, analogous to reusing a compact genetic program across repeated developmental stages rather than storing a fully separate block for every depth.
- The tiny virtual-depth table acts like a minimal gene-expression control sheet: it does not add new heavy pathways, it only modulates existing scalar controls at each virtual depth.
- BigramHash plus SmearGate provides a cheap motif-sensitive local branch, which fits the intended "sequence motifs + compact regulatory controls" engineering direction.

### Recorded metrics

Quick comparison used for judgment, `1xH100`, `SEED=1337`, `MAX_WALLCLOCK_SECONDS=180`, `EVAL_SEQ_LEN=64`:

| Model | val_loss | val_bpb | compressed bytes | total bytes | step_avg |
|---|---:|---:|---:|---:|---:|
| Donor quick | `7.55493163` | `4.47445606` | `5,019,273` | `5,086,876` | `668.25ms` |
| HelixRecur v1 quick | `7.85509273` | `4.65222837` | `3,081,539` | `3,150,613` | `655.32ms` |
| HelixRecur v2 quick | `7.54165596` | `4.46659346` | `3,042,658` | `3,113,435` | `675.97ms` |

Longer non-record pass, `1xH100`, `SEED=1337`, `MAX_WALLCLOCK_SECONDS=600`, `EVAL_SEQ_LEN=64`:

| Model | val_loss | val_bpb | compressed bytes | total bytes | train stop | step_avg |
|---|---:|---:|---:|---:|---:|---:|
| HelixRecur v2 long | `4.63764717` | `2.74667588` | `4,224,324` | `4,295,101` | `600.504s` | `676.24ms` |

### Byte profile

- v2 code bytes: `70,777`
- v2 parameter count: `15,187,040`
- v2 added parameters vs v1: `+44`
- v2 quick compressed bytes: `3,042,658`
- v2 quick total bytes: `3,113,435`
- v2 long compressed bytes: `4,224,324`
- v2 long total bytes: `4,295,101`
- v1 quick total bytes: `3,150,613`
- donor reproduced artifact for reference: compressed `16,073,037`, total `16,140,640`

### Exact commands already used

Compile sanity:

```bash
python -m py_compile records/track_non_record_16mb/2026-03-26_HelixRecur_v2/train_gpt.py
```

Train smoke:

```bash
env RUN_ID=helixrecur2-train-smoke SEED=1337 MAX_WALLCLOCK_SECONDS=45 EVAL_SEQ_LEN=64 TRAIN_LOG_EVERY=1000 VAL_LOSS_EVERY=4000 DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024 TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model torchrun --standalone --nproc_per_node=1 records/track_non_record_16mb/2026-03-26_HelixRecur_v2/train_gpt.py > helixrecur2_train_smoke.out 2>&1
```

Eval smoke:

```bash
env RUN_ID=helixrecur2-eval-smoke SEED=1337 MAX_WALLCLOCK_SECONDS=1 EVAL_SEQ_LEN=64 TRAIN_LOG_EVERY=1000 VAL_LOSS_EVERY=4000 DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024 TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model torchrun --standalone --nproc_per_node=1 records/track_non_record_16mb/2026-03-26_HelixRecur_v2/train_gpt.py > helixrecur2_eval_smoke.out 2>&1
```

Initial quick comparison attempt:

```bash
env RUN_ID=helixrecur2-quickcmp SEED=1337 MAX_WALLCLOCK_SECONDS=180 EVAL_SEQ_LEN=64 TRAIN_LOG_EVERY=1000 VAL_LOSS_EVERY=4000 DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024 TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model torchrun --standalone --nproc_per_node=1 records/track_non_record_16mb/2026-03-26_HelixRecur_v2/train_gpt.py > helixrecur2_quickcmp.out 2>&1
```

Fair solo quick comparison used for judgment:

```bash
env RUN_ID=helixrecur2-quickcmp-solo SEED=1337 MAX_WALLCLOCK_SECONDS=180 EVAL_SEQ_LEN=64 TRAIN_LOG_EVERY=1000 VAL_LOSS_EVERY=4000 DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024 TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model torchrun --standalone --nproc_per_node=1 records/track_non_record_16mb/2026-03-26_HelixRecur_v2/train_gpt.py > helixrecur2_quickcmp_solo.out 2>&1
```

Longer non-record pass:

```bash
env RUN_ID=helixrecur2-long SEED=1337 MAX_WALLCLOCK_SECONDS=600 EVAL_SEQ_LEN=64 TRAIN_LOG_EVERY=1000 VAL_LOSS_EVERY=4000 DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024 TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model torchrun --standalone --nproc_per_node=1 records/track_non_record_16mb/2026-03-26_HelixRecur_v2/train_gpt.py > helixrecur2_long.out 2>&1
```

### Known limitations

- This is not record-competitive. The longer pass result `2.74667588 val_bpb` is far from the accepted SOTA gate.
- The evidence is single-seed and non-record only.
- The gains are local to this recurrence line; they do not show that recurrence alone is a winning final submission direction.
- The quick proxy improved, but longer-pass behavior is still fragile enough that later micro-variants did not replace v2.

### Next direction

Keep the shared-depth recurrence and byte discipline, but move the next experiment toward gene-coded low-rank specialization: tiny learned depth-conditioned low-rank adapters that preserve the compact shared-block base while giving each virtual pass a more explicit specialization channel.
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"author": "OpenAI Codex",
"github_id": "codex",
"name": "Non-record: HelixRecur v2",
"blurb": "HelixRecur v1 plus a tiny 11x4 virtual-depth conditioning table that modulates LN scale, attention scale, MLP scale, and q_gain while keeping the 6-shared-block recurrence, donor local features, and donor quantization/eval path intact.",
"date": "2026-03-27T00:00:00Z",
"val_loss": 4.63764717,
"val_bpb": 2.74667588,
"bytes_total": 4295101
}
Loading