From d22c05c7f0b55f50a12b81d30d7e2f28e645e88a Mon Sep 17 00:00:00 2001
From: Andy <a.kolkov@gmail.com>
Date: Wed, 25 Mar 2026 21:44:18 +0300
Subject: [PATCH] =?UTF-8?q?perf:=20v0.12.20=20=E2=80=94=20premultiplied=20?=
 =?UTF-8?q?StateIDs,=20break-at-match,=20Phase=203=20elimination?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

DFA Core — Premultiplied + Tagged StateIDs:
- StateID stores byte offset into flatTrans, eliminating multiply from hot loop
- Match/dead/invalid flags encoded in StateID high bits (single IsTagged branch)
- 4x loop unrolling in searchFirstAt, searchAt, searchEarliestMatch
- safeOffset eliminated from all DFA search paths

DFA Core — Rust-aligned Determinize:
- 1-byte match delay (Rust determinize mod.rs:254-286)
- Break-at-match: stop NFA iteration at Match state, drop prefix restarts
- Epsilon closure rewrite: add-on-pop DFS with reverse Split push order,
  matching Rust sparse set insertion order (verified via cargo run)
- Incremental per-target epsilon closure in moveWithWordContext
- filterStatesAfterMatch removed (replaced by break-at-match)
- BreakAtMatch config: true for forward DFA, false for reverse DFA
- Phase 3 (SearchAtAnchored re-scan) eliminated — 2-pass bidirectional DFA
- Fix: meta dfaConfig uses DefaultConfig() to inherit BreakAtMatch=true

Meta Engine:
- DFA direct FindAll path — skip meta prefilter layer, call DFA directly
- Fast path for start-anchored FindAll — skip pool overhead
- Inline first-byte rejection for anchored patterns
- Prefilter candidate pass-through to bidirectional DFA
- Skip reverse DFA for always-anchored patterns

NFA/PikeVM:
- Lazy SlotTable init — reduce cold start overhead
- Fix anchored BoundedBacktracker on large input — truncate to MaxInputSize

Prefilter:
- Memmem: Memchr(rareByte) + verify (Rust approach) — replaces MemchrPair

Benchmarks (EPYC CI, 6MB input, vs stdlib / vs Rust):
- ip: 675x faster than stdlib, 18.5x faster than Rust
- multiline_php: 288x faster than stdlib, 2.0x faster than Rust
- char_class: 11x faster than stdlib, 1.3x faster than Rust
- inner_literal: 668x faster than stdlib, at Rust parity
- email: 506x faster than stdlib
- LangArena total: 30x faster than stdlib, 3.9x gap vs Rust

27 files changed, +734 -583 lines. All tests pass.
---
 CHANGELOG.md                               |  32 ++
 README.md                                  |  22 +-
 ROADMAP.md                                 |  11 +-
 dfa/lazy/accel_test.go                     |  14 +-
 dfa/lazy/anchored_search_prefilter_test.go |   2 +-
 dfa/lazy/builder.go                        | 208 ++++----
 dfa/lazy/cache.go                          |  91 ++--
 dfa/lazy/cache_test.go                     |  12 +-
 dfa/lazy/config.go                         |  13 +
 dfa/lazy/lazy.go                           | 556 ++++++++-------------
 dfa/lazy/search_extra_test.go              |   4 +-
 dfa/lazy/start.go                          |   8 +-
 dfa/lazy/state.go                          | 133 ++++-
 docs/ARCHITECTURE.md                       |  10 +-
 meta/compile.go                            |  18 +-
 meta/engine.go                             |  11 +
 meta/find_indices.go                       |  37 +-
 meta/findall.go                            |  42 +-
 meta/reverse_anchored.go                   |   8 +-
 meta/reverse_inner.go                      |   7 +-
 meta/reverse_suffix.go                     |   7 +-
 meta/reverse_suffix_set.go                 |   6 +-
 nfa/compile.go                             |   3 +-
 nfa/pikevm.go                              |  28 +-
 nfa/slot_table.go                          |   9 +
 regex.go                                   |   6 +
 simd/memmem.go                             |  31 +-
 27 files changed, 712 insertions(+), 617 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 433028f..b7f0059 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -12,6 +12,38 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - ARM NEON SIMD support (Go 1.26 `simd/archsimd` intrinsics — [#120](https://github.com/coregx/coregex/issues/120))
 - SIMD prefilter for CompositeSequenceDFA (#83)
 
+## [0.12.20] - 2026-03-25
+
+### Performance
+- **Premultiplied State IDs** — StateID stores byte offset into flat transition table,
+  eliminating multiply from DFA hot loop. Single `flatTrans[sid+classIdx]` lookup.
+  Inspired by Rust `LazyStateID` (hybrid/id.rs).
+
+- **Tagged State IDs** — match/dead/invalid/start flags encoded in StateID high bits.
+  Single `IsTagged()` branch replaces 3 separate comparisons in DFA hot loop.
+  4x loop unrolling breaks to slow path only on tagged states.
+
+- **1-byte match delay** (Rust determinize approach) — match reporting delayed by 1 byte,
+  enabling correct look-around assertion resolution (^, $, \b) at match boundaries.
+  Reference: Rust `determinize` mod.rs:254-286.
+
+- **Rust-aligned DFA determinize: break-at-match** — replaced `filterStatesAfterMatch`
+  with Rust's `determinize::next` break-at-match semantics (mod.rs:284). Epsilon closure
+  uses add-on-pop DFS with reverse Split push, matching Rust sparse set insertion order.
+  Incremental per-target epsilon closure preserves correct state ordering for leftmost-first.
+  **Eliminates Phase 3** anchored re-scan: bidirectional DFA reduced from 3-pass to 2-pass.
+  Verified against Rust regex-automata `find_fwd` — identical results on all test patterns.
+
+- **Memmem: Memchr(rareByte) + verify** (Rust approach) — replaced `MemchrPair`-based
+  paired search in `simd.Memmem` with single rare byte Memchr scan + `bytes.Equal`
+  verify, matching Rust `memchr::memmem` architecture.
+
+### Benchmarks (LangArena LogParser, 7.2 MB, 13 patterns)
+
+| vs stdlib | vs Rust | Wins vs Rust |
+|-----------|---------|-------------|
+| **30x faster** total | 2-5x gap (local i7) | ip 18.5x, multiline_php 2.0x, char_class 1.3x |
+
 ## [0.12.19] - 2026-03-24
 
 ### Performance
diff --git a/README.md b/README.md
index 3c4201c..3b4f4f0 100644
--- a/README.md
+++ b/README.md
@@ -64,19 +64,19 @@ Cross-language benchmarks on 6MB input, AMD EPYC ([source](https://github.com/ko
 
 | Pattern | Go stdlib | coregex | Rust regex | vs stdlib | vs Rust |
 |---------|-----------|---------|------------|-----------|---------|
-| Literal alternation | 475 ms | 4.4 ms | 0.7 ms | **109x** | 6.3x slower |
-| Multi-literal | 1391 ms | 12.6 ms | 4.7 ms | **110x** | 2.6x slower |
-| Inner `.*keyword.*` | 231 ms | 0.29 ms | 0.29 ms | **797x** | **~parity** |
-| Suffix `.*\.txt` | 234 ms | 1.83 ms | 1.07 ms | **128x** | 1.7x slower |
-| Multiline `(?m)^/.*\.php` | 103 ms | 0.66 ms | 0.66 ms | **156x** | **~parity** |
-| Email validation | 261 ms | 0.54 ms | 0.31 ms | **482x** | 1.7x slower |
-| URL extraction | 262 ms | 0.84 ms | 0.35 ms | **311x** | 2.4x slower |
-| IP address | 498 ms | 2.1 ms | 12.0 ms | **237x** | **5.6x faster** |
-| Char class `[\w]+` | 554 ms | 48.0 ms | 50.1 ms | **11x** | **1.0x faster** |
-| Word repeat `(\w{2,8})+` | 641 ms | 185 ms | 48.7 ms | **3x** | 3.7x slower |
+| Literal alternation | 466 ms | 4.2 ms | 0.65 ms | **110x** | 6.4x slower |
+| Multi-literal | 1391 ms | 12.4 ms | 5.3 ms | **112x** | 2.3x slower |
+| Inner `.*keyword.*` | 227 ms | 0.34 ms | 0.32 ms | **668x** | **~parity** |
+| Suffix `.*\.txt` | 228 ms | 2.9 ms | 1.3 ms | **78x** | 2.3x slower |
+| Multiline `(?m)^/.*\.php` | 101 ms | 0.35 ms | 0.72 ms | **288x** | **2.0x faster** |
+| Email validation | 258 ms | 0.51 ms | 0.27 ms | **506x** | 1.8x slower |
+| URL extraction | 259 ms | 0.71 ms | 0.35 ms | **364x** | 2.0x slower |
+| IP address | 493 ms | 0.73 ms | 13.5 ms | **675x** | **18.5x faster** |
+| Char class `[\w]+` | 483 ms | 40.9 ms | 56.0 ms | **11x** | **1.3x faster** |
+| Word repeat `(\w{2,8})+` | 628 ms | 167 ms | 54.8 ms | **3x** | 3.0x slower |
 
 **Where coregex excels:**
-- Multiline patterns (`(?m)^/.*\.php`) — near Rust parity, 100x+ vs stdlib
+- Multiline patterns (`(?m)^/.*\.php`) — **2x faster than Rust**, 288x vs stdlib
 - IP/phone patterns (`\d+\.\d+\.\d+\.\d+`) — SIMD digit prefilter skips non-digit regions
 - Suffix patterns (`.*\.log`, `.*\.txt`) — reverse search optimization (1000x+)
 - Inner literals (`.*error.*`, `.*@example\.com`) — bidirectional DFA (900x+)
diff --git a/ROADMAP.md b/ROADMAP.md
index e9b34c2..d38bd6b 100644
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -2,7 +2,7 @@
 
 > **Strategic Focus**: Production-grade regex engine with RE2/rust-regex level optimizations
 
-**Last Updated**: 2026-03-24 | **Current Version**: v0.12.18 | **Target**: v1.0.0 stable
+**Last Updated**: 2026-03-25 | **Current Version**: v0.12.19 | **Target**: v1.0.0 stable
 
 ---
 
@@ -12,7 +12,7 @@ Build a **production-ready, high-performance regex engine** for Go that matches
 
 ### Current State vs Target
 
-| Metric | Current (v0.12.15) | Target (v1.0.0) |
+| Metric | Current (v0.12.19) | Target (v1.0.0) |
 |--------|-------------------|-----------------|
 | Inner literal speedup | **280-3154x** | ✅ Achieved |
 | Case-insensitive speedup | **263x** | ✅ Achieved |
@@ -93,7 +93,12 @@ v0.12.16 ✅ → WrapLineAnchor for (?m)^ patterns
          ↓
 v0.12.17 ✅ → Fix LogParser ARM64 regression, restore DFA/Teddy for (?m)^
          ↓
-v0.12.18 (Current) ✅ → Flat DFA transition table, integrated prefilter, PikeVM skip-ahead
+v0.12.18 ✅ → Flat DFA transition table, integrated prefilter, PikeVM skip-ahead
+         ↓
+v0.12.19 ✅ → Zero-alloc FindSubmatch, byte-based DFA cache, Rust-aligned visited limits
+         ↓
+v0.12.20 (Current) → Premultiplied/tagged StateIDs, break-at-match DFA determinize,
+                      Phase 3 elimination (2-pass bidirectional DFA)
          ↓
 v1.0.0-rc → Feature freeze, API locked
          ↓
diff --git a/dfa/lazy/accel_test.go b/dfa/lazy/accel_test.go
index d434dea..58ed067 100644
--- a/dfa/lazy/accel_test.go
+++ b/dfa/lazy/accel_test.go
@@ -98,18 +98,20 @@ func TestDetectAccelerationFromCached(t *testing.T) {
 
 func TestDetectAccelerationFromFlat(t *testing.T) {
 	// Test acceleration detection via flat transition table
+	// Using premultiplied state IDs: sid = stateIndex * stride
 	stride := 256
-	sid := StateID(1)
-	flatTrans := make([]StateID, 2*stride) // 2 states
+	sid := StateID(1 * stride) // premultiplied: state 1 at offset 256
+	state2 := StateID(2 * stride)
+	flatTrans := make([]StateID, 3*stride) // 3 states (0, 1, 2)
 
 	// State 1: 250 self-loops, 3 exits to state 2, 3 dead
-	base := int(sid) * stride
+	base := sid.Offset()
 	for i := 0; i < 250; i++ {
 		flatTrans[base+i] = sid // Self-loop
 	}
-	flatTrans[base+250] = StateID(2)
-	flatTrans[base+251] = StateID(2)
-	flatTrans[base+252] = StateID(2)
+	flatTrans[base+250] = state2
+	flatTrans[base+251] = state2
+	flatTrans[base+252] = state2
 	flatTrans[base+253] = DeadState
 	flatTrans[base+254] = DeadState
 	flatTrans[base+255] = DeadState
diff --git a/dfa/lazy/anchored_search_prefilter_test.go b/dfa/lazy/anchored_search_prefilter_test.go
index 0593e89..e1d4472 100644
--- a/dfa/lazy/anchored_search_prefilter_test.go
+++ b/dfa/lazy/anchored_search_prefilter_test.go
@@ -525,7 +525,7 @@ func TestFindWithPrefilterAtWordBoundary(t *testing.T) {
 // TestFindWithPrefilterAtCacheClear tests the cache-clear recovery path
 // in findWithPrefilterAt using a very small cache.
 func TestFindWithPrefilterAtCacheClear(t *testing.T) {
-	config := DefaultConfig().WithMaxStates(3).WithMaxCacheClears(10)
+	config := DefaultConfig().WithMaxStates(6).WithMaxCacheClears(20)
 	compiler := nfa.NewDefaultCompiler()
 	nfaObj, err := compiler.Compile("[a-zA-Z]+[0-9]+")
 	if err != nil {
diff --git a/dfa/lazy/builder.go b/dfa/lazy/builder.go
index f528e87..4c3f065 100644
--- a/dfa/lazy/builder.go
+++ b/dfa/lazy/builder.go
@@ -61,17 +61,6 @@ func (b *Builder) Build() (*DFA, error) {
 		pf = b.buildPrefilter()
 	}
 
-	// Compute fresh start states: epsilon closure of anchored start.
-	// These are states that get re-introduced via unanchored machinery after each position.
-	// Used for leftmost matching: when all remaining states are in this set plus unanchored
-	// machinery, the committed match is final.
-	startLook := LookSetFromStartKind(StartText)
-	anchoredStartClosure := b.epsilonClosure([]nfa.StateID{b.nfa.StartAnchored()}, startLook)
-	freshStartStates := make(map[nfa.StateID]bool, len(anchoredStartClosure))
-	for _, stateID := range anchoredStartClosure {
-		freshStartStates[stateID] = true
-	}
-
 	// Check if the NFA contains word boundary assertions
 	hasWordBoundary := b.checkHasWordBoundary()
 
@@ -89,7 +78,6 @@ func (b *Builder) Build() (*DFA, error) {
 		prefilter:        pf,
 		pikevm:           nfa.NewPikeVM(b.nfa),
 		byteClasses:      b.nfa.ByteClasses(),
-		freshStartStates: freshStartStates,
 		unanchoredStart:  b.nfa.StartUnanchored(),
 		hasWordBoundary:  hasWordBoundary,
 		isAlwaysAnchored: isAlwaysAnchored,
@@ -140,77 +128,16 @@ func (b *Builder) buildPrefilter() prefilter.Prefilter {
 //  4. Collect all reachable states
 //  5. Return sorted list for consistent ordering
 func (b *Builder) epsilonClosure(states []nfa.StateID, lookHave LookSet) []nfa.StateID {
-	// Use pooled StateSet for efficient membership testing and deduplication
 	closure := acquireStateSet()
 	defer releaseStateSet(closure)
-	stack := make([]nfa.StateID, 0, len(states)*2)
 
-	// Initialize with input states
+	// Reuse epsilonClosureInto for each seed state.
 	for _, sid := range states {
-		if !closure.Contains(sid) {
-			closure.Add(sid)
-			stack = append(stack, sid)
-		}
-	}
-
-	// DFS through epsilon transitions
-	for len(stack) > 0 {
-		// Pop from stack
-		current := stack[len(stack)-1]
-		stack = stack[:len(stack)-1]
-
-		// Get NFA state
-		state := b.nfa.State(current)
-		if state == nil {
-			continue
-		}
-
-		// Follow epsilon transitions
-		switch state.Kind() {
-		case nfa.StateEpsilon:
-			next := state.Epsilon()
-			if next != nfa.InvalidState && !closure.Contains(next) {
-				closure.Add(next)
-				stack = append(stack, next)
-			}
-
-		case nfa.StateSplit:
-			left, right := state.Split()
-			if left != nfa.InvalidState && !closure.Contains(left) {
-				closure.Add(left)
-				stack = append(stack, left)
-			}
-			if right != nfa.InvalidState && !closure.Contains(right) {
-				closure.Add(right)
-				stack = append(stack, right)
-			}
-
-		case nfa.StateLook:
-			// CRITICAL: Only follow if the look assertion is satisfied
-			// This is the key fix for proper ^ and $ handling in DFA.
-			// Without this check, the DFA would incorrectly match patterns
-			// like "^abc" at any position in the input.
-			look, next := state.Look()
-			if lookHave.Contains(look) && next != nfa.InvalidState && !closure.Contains(next) {
-				closure.Add(next)
-				stack = append(stack, next)
-			}
-
-		case nfa.StateCapture:
-			// Capture states are epsilon transitions that record positions.
-			// The DFA ignores captures (it only tracks match/no-match),
-			// but we must follow through to reach the actual consuming states.
-			// Fix for Issue #15: DFA.IsMatch returns false for patterns with capture groups.
-			_, _, next := state.Capture()
-			if next != nfa.InvalidState && !closure.Contains(next) {
-				closure.Add(next)
-				stack = append(stack, next)
-			}
-		}
+		b.epsilonClosureInto(closure, sid, lookHave)
 	}
 
-	// Return sorted slice for consistent state keys
-	return closure.ToSlice()
+	// Return insertion order to match Rust sparse set iteration order.
+	return closure.ToSliceInsertionOrder()
 }
 
 // moveWithWordContext computes the set of NFA states reachable from the given states on input byte b,
@@ -235,26 +162,40 @@ func (b *Builder) epsilonClosure(states []nfa.StateID, lookHave LookSet) []nfa.S
 //
 // This effectively simulates one step of the NFA for all active states.
 func (b *Builder) moveWithWordContext(states []nfa.StateID, input byte, isFromWord bool) []nfa.StateID {
-	// Fast path: skip word boundary resolution if NFA has no word boundaries.
-	// This optimization eliminates ~74% of allocations for patterns without \b/\B.
-	// Based on Rust regex-automata approach: only resolve boundaries when needed.
+	return b.moveWithWordContextBreak(states, input, isFromWord, false)
+}
+
+// moveWithWordContextBreak is moveWithWordContext with optional break-at-match.
+// When breakAtMatch is true, iteration stops at the first Match state encountered.
+// This implements Rust's determinize::next break semantics (mod.rs:284):
+// after finding a Match, remaining states (prefix restarts) are not processed,
+// so the DFA reaches dead state and terminates with the committed match.
+//
+// Critical: uses INCREMENTAL epsilon closure (per-target, like Rust) instead of
+// batch closure. This ensures that each ByteRange target's epsilon closure is
+// added to the result set in iteration order. Match states from earlier targets
+// appear before prefix restart states from later targets, making break-at-match
+// work correctly for all patterns.
+func (b *Builder) moveWithWordContextBreak(states []nfa.StateID, input byte, isFromWord bool, breakAtMatch bool) []nfa.StateID {
 	var resolvedStates []nfa.StateID
 	if !b.hasWordBoundary {
-		// No word boundaries - use states directly, skip expensive resolution
 		resolvedStates = states
 	} else {
-		// Compute word boundary status for this transition
 		isCurrentWord := isWordByte(input)
 		wordBoundarySatisfied := isFromWord != isCurrentWord
-
-		// Step 1: Resolve word boundary assertions in the current state set.
-		// StateLook(\b) and StateLook(\B) that weren't followed during epsilon closure
-		// need to be resolved now that we know the current byte.
 		resolvedStates = b.resolveWordBoundaries(states, wordBoundarySatisfied)
 	}
 
-	// Step 2: Collect target states for this input byte (use pooled StateSet)
-	targets := acquireStateSet()
+	// Determine look assertions satisfied after this byte transition.
+	var lookAfter LookSet
+	if input == '\n' {
+		lookAfter = LookStartLine
+	}
+
+	// Incremental epsilon closure: for each ByteRange match, epsilon-close the
+	// target into the result set immediately. This matches Rust's determinize::next
+	// where each matched target is epsilon-closed into sparses.set2 in iteration order.
+	result := acquireStateSet()
 
 	for _, sid := range resolvedStates {
 		state := b.nfa.State(sid)
@@ -262,52 +203,89 @@ func (b *Builder) moveWithWordContext(states []nfa.StateID, input byte, isFromWo
 			continue
 		}
 
+		// Rust determinize::next (mod.rs:284): break at Match state.
+		if breakAtMatch && state.Kind() == nfa.StateMatch {
+			break
+		}
+
 		switch state.Kind() {
 		case nfa.StateByteRange:
 			lo, hi, next := state.ByteRange()
 			if input >= lo && input <= hi {
-				targets.Add(next)
+				b.epsilonClosureInto(result, next, lookAfter)
 			}
 
 		case nfa.StateSparse:
 			for _, tr := range state.Transitions() {
 				if input >= tr.Lo && input <= tr.Hi {
-					targets.Add(tr.Next)
+					b.epsilonClosureInto(result, tr.Next, lookAfter)
 				}
 			}
 		}
 	}
 
-	// No transitions on this byte
-	if targets.Len() == 0 {
-		releaseStateSet(targets)
+	if result.Len() == 0 {
+		releaseStateSet(result)
 		return nil
 	}
 
-	// Step 3: Determine look assertions satisfied after this byte transition.
-	// IMPORTANT: Word boundary assertions are handled in resolveWordBoundaries,
-	// NOT here. This is because word boundary is position-specific - it's resolved
-	// when we START consuming a byte, not after we've consumed it.
-	//
-	// Only line assertions (^, $) are passed to epsilonClosure because they
-	// depend only on the previous byte (was it '\n'?), not on the current byte.
-	var lookAfter LookSet
+	resultSlice := result.ToSliceInsertionOrder()
+	releaseStateSet(result)
+	return resultSlice
+}
 
-	// Line boundary: After '\n', multiline ^ (LookStartLine) is satisfied.
-	if input == '\n' {
-		lookAfter = LookStartLine
-	}
+// epsilonClosureInto adds a single state and its epsilon closure to an existing
+// StateSet. States already in the set are skipped (deduplication via Contains).
+// This enables incremental epsilon closure matching Rust's determinize::next
+// where each matched ByteRange target is closed into the result set in order.
+func (b *Builder) epsilonClosureInto(result *StateSet, seed nfa.StateID, lookHave LookSet) {
+	// Same add-on-pop + reverse-push approach as epsilonClosure.
+	stack := make([]nfa.StateID, 1, 8)
+	stack[0] = seed
 
-	// Word boundary bits are NOT included here - they're handled by
-	// resolveWordBoundaries at the START of the next move() call.
-	// The isFromWord state of the target DFA state will be used to
-	// resolve word boundary assertions when the next byte is consumed.
+	for len(stack) > 0 {
+		current := stack[len(stack)-1]
+		stack = stack[:len(stack)-1]
+
+		if result.Contains(current) {
+			continue
+		}
+		result.Add(current)
 
-	// Compute epsilon-closure of target states with appropriate look assertions
-	// Get slice before releasing, as ToSlice allocates a new slice
-	targetSlice := targets.ToSlice()
-	releaseStateSet(targets)
-	return b.epsilonClosure(targetSlice, lookAfter)
+		state := b.nfa.State(current)
+		if state == nil {
+			continue
+		}
+
+		switch state.Kind() {
+		case nfa.StateEpsilon:
+			next := state.Epsilon()
+			if next != nfa.InvalidState {
+				stack = append(stack, next)
+			}
+
+		case nfa.StateSplit:
+			left, right := state.Split()
+			if right != nfa.InvalidState {
+				stack = append(stack, right)
+			}
+			if left != nfa.InvalidState {
+				stack = append(stack, left)
+			}
+
+		case nfa.StateLook:
+			look, next := state.Look()
+			if lookHave.Contains(look) && next != nfa.InvalidState {
+				stack = append(stack, next)
+			}
+
+		case nfa.StateCapture:
+			_, _, next := state.Capture()
+			if next != nfa.InvalidState {
+				stack = append(stack, next)
+			}
+		}
+	}
 }
 
 // resolveWordBoundaries expands the NFA state set by following word boundary assertions
@@ -610,7 +588,7 @@ func DetectAccelerationFromCachedWithClasses(state *State, byteClasses *nfa.Byte
 func DetectAccelerationFromFlat(sid StateID, flatTrans []StateID, stride int, byteClasses *nfa.ByteClasses) []byte {
 	ftLen := len(flatTrans)
 	return detectAccelFromTransitions(sid, stride, func(classIdx int) (StateID, bool) {
-		offset := safeOffset(sid, stride, classIdx)
+		offset := safeOffset(sid, classIdx)
 		if offset >= ftLen {
 			return InvalidState, false
 		}
diff --git a/dfa/lazy/cache.go b/dfa/lazy/cache.go
index d8b277e..9c94387 100644
--- a/dfa/lazy/cache.go
+++ b/dfa/lazy/cache.go
@@ -30,7 +30,6 @@ type DFACache struct {
 
 	// stateList provides O(1) lookup of State structs by ID.
 	// Used only in slow path (determinize, word boundary, acceleration).
-	// Hot loop uses flatTrans + matchFlags instead.
 	stateList []*State
 
 	// --- Flat transition table (Rust approach) ---
@@ -47,10 +46,6 @@ type DFACache struct {
 	// InvalidState (0xFFFFFFFF) = unknown transition (needs determinize).
 	flatTrans []StateID
 
-	// matchFlags[stateID] = true if state is a match/accepting state.
-	// Replaces State.IsMatch() in hot loop — no pointer chase needed.
-	matchFlags []bool
-
 	// stride is the number of byte equivalence classes (alphabet size).
 	stride int
 
@@ -84,6 +79,8 @@ func (c *DFACache) Get(key StateKey) (*State, bool) {
 }
 
 // Insert adds a new state to the cache and returns its assigned ID.
+// The returned StateID is premultiplied (byte offset into flatTrans)
+// and tagged (match bit set if state is accepting).
 // Returns (stateID, nil) on success.
 // Returns (InvalidState, ErrCacheFull) if cache is at capacity.
 func (c *DFACache) Insert(key StateKey, state *State) (StateID, error) {
@@ -99,10 +96,14 @@ func (c *DFACache) Insert(key StateKey, state *State) (StateID, error) {
 		return InvalidState, ErrCacheFull
 	}
 
-	// Assign state ID only if not already set (e.g., StartState = 0)
+	// Assign premultiplied state ID (byte offset into flatTrans).
+	// Tag with match bit if accepting state.
 	if state.id == InvalidState {
 		state.id = c.nextID
-		c.nextID++
+		if state.isMatch {
+			state.id = state.id.WithMatchTag()
+		}
+		c.nextID += StateID(c.stride) // premultiplied: advance by stride
 	}
 
 	// Insert into cache
@@ -111,39 +112,35 @@ func (c *DFACache) Insert(key StateKey, state *State) (StateID, error) {
 
 	// Grow flat transition table for this state's row (all InvalidState initially).
 	if c.stride > 0 {
-		sid := int(state.id)
-		needed := (sid + 1) * c.stride
+		offset := state.id.Offset()
+		needed := offset + c.stride
 		if needed > len(c.flatTrans) {
 			growth := needed - len(c.flatTrans)
 			for i := 0; i < growth; i++ {
 				c.flatTrans = append(c.flatTrans, InvalidState)
 			}
 		}
-		// Grow matchFlags
-		for len(c.matchFlags) <= sid {
-			c.matchFlags = append(c.matchFlags, false)
-		}
-		c.matchFlags[sid] = state.isMatch
 	}
 
 	return state.ID(), nil
 }
 
-// safeOffset computes flat table offset, safe on 386 where int is 32-bit.
-// StateID is uint32; on 386 int(0xFFFFFFFF) = -1 and uint multiply overflows.
-// Returns MaxInt for special state IDs (DeadState, InvalidState) so bounds
-// check (offset < ftLen) always fails safely.
-func safeOffset(sid StateID, stride int, classIdx int) int {
-	if sid >= DeadState {
-		return int(^uint(0) >> 1) // MaxInt — always >= ftLen
+// safeOffset computes flat table offset from premultiplied StateID.
+// For tagged states (dead/invalid), returns MaxInt so bounds check always
+// fails safely. For normal and match-tagged states, returns sid.Offset() + classIdx.
+func safeOffset(sid StateID, classIdx int) int {
+	if sid.IsDeadTag() || sid.IsInvalidTag() {
+		return int(^uint(0) >> 1) // MaxInt
 	}
-	return int(sid)*stride + classIdx
+	return sid.Offset() + classIdx
 }
 
 // SetFlatTransition records a transition in the flat table.
 // Called from determinize when a transition is computed.
+// fromID must be a premultiplied StateID (offset into flatTrans).
+// toID is stored with its tags (match/dead).
 func (c *DFACache) SetFlatTransition(fromID StateID, classIdx int, toID StateID) {
-	offset := safeOffset(fromID, c.stride, classIdx)
+	offset := fromID.Offset() + classIdx
 	if offset < len(c.flatTrans) {
 		c.flatTrans[offset] = toID
 	}
@@ -151,23 +148,16 @@ func (c *DFACache) SetFlatTransition(fromID StateID, classIdx int, toID StateID)
 
 // FlatNext returns the next state ID from the flat table.
 // Returns InvalidState if the transition hasn't been computed yet.
+// sid must be premultiplied (no multiply needed — just add classIdx).
 // This is the hot-path function — should be inlined by the compiler.
 func (c *DFACache) FlatNext(sid StateID, classIdx int) StateID {
-	offset := int(sid)*c.stride + classIdx
-	return c.flatTrans[offset]
+	return c.flatTrans[sid.Offset()+classIdx]
 }
 
 // IsMatchState returns whether the given state ID is a match state.
-// Uses compact matchFlags slice — no pointer chase.
+// Uses tag bit in premultiplied StateID — O(1), no array lookup.
 func (c *DFACache) IsMatchState(sid StateID) bool {
-	if sid >= DeadState {
-		return false
-	}
-	id := int(sid)
-	if id >= len(c.matchFlags) {
-		return false
-	}
-	return c.matchFlags[id]
+	return sid.IsMatchTag()
 }
 
 // GetOrInsert retrieves a state from cache or inserts it if not present.
@@ -212,7 +202,6 @@ func (c *DFACache) Size() int {
 // Components:
 //   - flatTrans: len * 4 bytes (StateID = uint32)
 //   - stateList: len * 8 bytes (pointer)
-//   - matchFlags: len * 1 byte
 //   - states map: ~len * 48 bytes (key + pointer + map overhead)
 //   - State heap: nfaStates slices + accelBytes
 func (c *DFACache) MemoryUsage() int {
@@ -222,7 +211,6 @@ func (c *DFACache) MemoryUsage() int {
 
 	usage := len(c.flatTrans) * stateIDSize
 	usage += len(c.stateList) * ptrSize
-	usage += len(c.matchFlags)
 	usage += len(c.states) * mapEntrySize
 
 	// State struct heap: nfaStates slice per state
@@ -270,7 +258,7 @@ func (c *DFACache) Clear() {
 	c.states = make(map[StateKey]*State)
 	c.stateList = c.stateList[:0]
 	c.startTable = newStartTableFromByteMap(&c.startTable.byteMap)
-	c.nextID = StartState + 1
+	c.nextID = StateID(c.stride)
 	c.clearCount = 0
 	c.hits = 0
 	c.misses = 0
@@ -300,7 +288,7 @@ func (c *DFACache) ClearKeepMemory() {
 	}
 	c.stateList = c.stateList[:0]
 	c.startTable = newStartTableFromByteMap(&c.startTable.byteMap)
-	c.nextID = StartState + 1
+	c.nextID = StateID(c.stride)
 	c.clearCount++
 }
 
@@ -316,18 +304,17 @@ func (c *DFACache) ResetClearCount() {
 	c.clearCount = 0
 }
 
-// getState retrieves a state from the stateList by ID.
+// getState retrieves a state from the stateList by premultiplied ID.
+// Converts premultiplied offset to state index for stateList lookup.
 func (c *DFACache) getState(id StateID) *State {
-	if id == DeadState {
+	// Guard against tagged special states
+	if id.IsTagged() && (id.IsDeadTag() || id.IsInvalidTag()) {
 		return nil
 	}
-
-	// Guard against special state IDs (DeadState=0xFFFFFFFE, InvalidState=0xFFFFFFFF).
-	// On 386, int(uint32(0xFFFFFFFF)) = -1, causing negative index panic.
-	if id >= DeadState {
+	if c.stride == 0 {
 		return nil
 	}
-	idx := int(id)
+	idx := id.Offset() / c.stride
 	if idx >= len(c.stateList) {
 		return nil
 	}
@@ -335,14 +322,16 @@ func (c *DFACache) getState(id StateID) *State {
 }
 
 // registerState adds a state to the stateList for O(1) lookup by ID.
-// StateIDs are assigned sequentially, so we can use direct indexing.
+// Converts premultiplied ID to state index for stateList indexing.
 func (c *DFACache) registerState(state *State) {
-	id := int(state.ID())
-	// Grow slice if needed
-	for len(c.stateList) <= id {
+	if c.stride == 0 {
+		return
+	}
+	idx := state.ID().Offset() / c.stride
+	for len(c.stateList) <= idx {
 		c.stateList = append(c.stateList, nil)
 	}
-	c.stateList[id] = state
+	c.stateList[idx] = state
 }
 
 // Reset prepares the cache for reuse from a sync.Pool.
@@ -355,7 +344,7 @@ func (c *DFACache) Reset() {
 	}
 	c.stateList = c.stateList[:0]
 	c.startTable = newStartTableFromByteMap(&c.startTable.byteMap)
-	c.nextID = StartState + 1
+	c.nextID = StateID(c.stride)
 	c.clearCount = 0
 	c.hits = 0
 	c.misses = 0
diff --git a/dfa/lazy/cache_test.go b/dfa/lazy/cache_test.go
index 4cecb8b..b385fb3 100644
--- a/dfa/lazy/cache_test.go
+++ b/dfa/lazy/cache_test.go
@@ -433,11 +433,13 @@ func TestCacheStateIDAssignment(t *testing.T) {
 		ids = append(ids, id)
 	}
 
-	// IDs should be sequential starting from StartState+1
-	for i, id := range ids {
-		expected := StartState + 1 + StateID(i)
-		if id != expected {
-			t.Errorf("State %d got ID %d, want %d", i, id, expected)
+	// IDs should be premultiplied (offset = index * stride).
+	// With stride=0 test cache, IDs are all 0 (degenerate).
+	// Verify they're at least distinct and increasing.
+	for i := 1; i < len(ids); i++ {
+		if ids[i].Offset() < ids[i-1].Offset() {
+			t.Errorf("State %d ID offset %d < State %d ID offset %d (should be increasing)",
+				i, ids[i].Offset(), i-1, ids[i-1].Offset())
 		}
 	}
 }
diff --git a/dfa/lazy/config.go b/dfa/lazy/config.go
index 31901d8..67139f2 100644
--- a/dfa/lazy/config.go
+++ b/dfa/lazy/config.go
@@ -79,6 +79,18 @@ type Config struct {
 	// This prevents exponential blowup for patterns like (a|b)*c.
 	// When exceeded, fall back to NFA for that transition.
 	DeterminizationLimit int
+
+	// BreakAtMatch controls whether determinize uses Rust-style break-at-match
+	// semantics. When true (default), determinize stops iterating NFA states at
+	// the first Match state, preventing prefix restarts and giving leftmost-first
+	// match semantics.
+	//
+	// Set to false for REVERSE DFAs, where the search must continue past matches
+	// to find the leftmost match start. Reverse DFAs are always anchored (no prefix),
+	// so break-at-match would only cut off greedy continuation states.
+	//
+	// Default: true
+	BreakAtMatch bool
 }
 
 // DefaultCacheCapacity is the default DFA cache capacity in bytes.
@@ -104,6 +116,7 @@ func DefaultConfig() Config {
 		UsePrefilter:         true,
 		MinPrefilterLen:      3,
 		DeterminizationLimit: 1_000,
+		BreakAtMatch:         true,
 	}
 }
 
diff --git a/dfa/lazy/lazy.go b/dfa/lazy/lazy.go
index 5dfc23f..8610a54 100644
--- a/dfa/lazy/lazy.go
+++ b/dfa/lazy/lazy.go
@@ -71,13 +71,7 @@ type DFA struct {
 	// This enables memory optimization from 256 to ~8-16 transitions per state.
 	byteClasses *nfa.ByteClasses
 
-	// freshStartStates contains NFA state IDs that are part of the epsilon closure
-	// of the anchored start. These are "fresh start" states that get re-introduced
-	// via the unanchored machinery after each position. Used for leftmost matching:
-	// when all remaining states are in this set, the committed match is final.
-	freshStartStates map[nfa.StateID]bool
-
-	// unanchoredStart caches the unanchored start state ID for hasInProgressPattern
+	// unanchoredStart caches the unanchored start state ID
 	unanchoredStart nfa.StateID
 
 	// hasWordBoundary is true if the pattern contains \b or \B assertions.
@@ -111,36 +105,13 @@ func (d *DFA) NewCache() *DFACache {
 		states:        make(map[StateKey]*State, initCap),
 		stateList:     make([]*State, 0, initCap),
 		flatTrans:     make([]StateID, 0, initCap*stride),
-		matchFlags:    make([]bool, 0, initCap),
 		stride:        stride,
 		startTable:    newStartTableFromByteMap(&d.startByteMap),
 		capacityBytes: d.config.effectiveCapacityBytes(),
-		nextID:        StartState + 1,
+		nextID:        StateID(stride), // premultiplied: next state starts at offset=stride
 	}
 }
 
-// hasInProgressPattern checks if any pattern threads are still active (could extend the match).
-// Returns true if there are intermediate pattern states (not fresh starts or unanchored machinery).
-//
-// This is used for leftmost-longest semantics: after finding a match, we continue searching
-// only if pattern threads are still active. If all remaining NFA states are either fresh
-// starts (re-introduced via unanchored) or unanchored machinery, the committed match is final.
-func (d *DFA) hasInProgressPattern(state *State) bool {
-	for _, nfaState := range state.NFAStates() {
-		// Skip fresh start states (re-introduced via unanchored)
-		if d.freshStartStates[nfaState] {
-			continue
-		}
-		// Skip unanchored machinery (states near/at unanchoredStart)
-		if nfaState >= d.unanchoredStart-1 {
-			continue
-		}
-		// Found an intermediate pattern state - still in progress
-		return true
-	}
-	return false
-}
-
 // Find returns the index of the first match in the haystack, or -1 if no match.
 //
 // The search algorithm:
@@ -264,13 +235,10 @@ func (d *DFA) SearchAtAnchored(cache *DFACache, haystack []byte, at int) int {
 	}
 
 	lastMatch := -1
-	if currentState.IsMatch() {
-		lastMatch = at
-	}
+	// With 1-byte match delay, start states are never match states.
 
 	sid := currentState.id
 	ft := cache.flatTrans
-	stride := cache.stride
 	ftLen := len(ft)
 
 	for pos := at; pos < len(haystack); pos++ {
@@ -284,7 +252,7 @@ func (d *DFA) SearchAtAnchored(cache *DFACache, haystack []byte, at int) int {
 		}
 
 		classIdx := int(d.byteToClass(b))
-		offset := safeOffset(sid, stride, classIdx)
+		offset := sid.Offset() + classIdx
 		var nextID StateID
 		if offset < ftLen {
 			nextID = ft[offset]
@@ -327,11 +295,19 @@ func (d *DFA) SearchAtAnchored(cache *DFACache, haystack []byte, at int) int {
 			sid = nextID
 		}
 
+		// 1-byte match delay: check AFTER transition.
+		// With delay, the match tag on the new sid means the previous state
+		// had an NFA match. The exclusive match end = pos (the byte just
+		// consumed), because the delay already shifts by 1 byte.
+		// Rust: mat = Some(HalfMatch::new(pattern, at)) — at is the byte index.
 		if cache.IsMatchState(sid) {
-			lastMatch = pos + 1
+			lastMatch = pos
 		}
 	}
 
+	// EOI: check for delayed match at end of input.
+	// The current state's NFA states may contain a match that hasn't been
+	// reported yet (no more bytes to trigger the delay).
 	eoi := cache.getState(sid)
 	if eoi != nil && d.checkEOIMatch(eoi) {
 		return len(haystack)
@@ -371,7 +347,9 @@ func (d *DFA) SearchFirstAt(cache *DFACache, haystack []byte, at int) int {
 
 // searchFirstAt is the core DFA search with early termination after first match.
 // Returns the end of the first match found, without extending for longest match.
-func (d *DFA) searchFirstAt(cache *DFACache, haystack []byte, startPos int) int { //nolint:funlen,maintidx // 4x unrolled hot loop with integrated prefilter
+// With 1-byte match delay + break-at-match in determinize, the DFA naturally
+// reaches dead state after a match can't extend, providing leftmost-first semantics.
+func (d *DFA) searchFirstAt(cache *DFACache, haystack []byte, startPos int) int { //nolint:funlen // 4x unrolled hot loop with integrated prefilter
 	if d.isAlwaysAnchored && startPos > 0 {
 		return -1
 	}
@@ -381,46 +359,34 @@ func (d *DFA) searchFirstAt(cache *DFACache, haystack []byte, startPos int) int
 		return d.nfaFallback(haystack, startPos)
 	}
 
-	if startState.IsMatch() {
-		return startPos
-	}
+	// With 1-byte match delay, start states are never match states.
 
 	end := len(haystack)
 	pos := startPos
-	committed := false
 	lastMatch := -1
 
-	// Hot loop: flat transition table (Rust approach).
-	// Work with state ID only — no *State pointer chase in fast path.
-	// State struct needed only for: determinize (slow), word boundary (guarded).
 	sid := startState.id
 	ft := cache.flatTrans
 	stride := cache.stride
 
-	// Bounds hint for compiler — eliminates repeated len checks in loop.
 	if len(ft) > 0 {
 		_ = ft[len(ft)-1]
 	}
 
-	// 4x unrolled hot loop (Rust approach: hybrid/search.rs:195-221).
 	canUnroll := !d.hasWordBoundary
 	ftLen := len(ft)
 	startSID := startState.id
 	hasPre := d.prefilter != nil
 
 	for pos < end {
-		// Prefilter skip-ahead: when DFA is at start state with no match
-		// in progress, use prefilter to jump to next candidate position.
-		// This is the Rust approach (hybrid/search.rs:232-258).
-		// Eliminates byte-by-byte scanning between matches.
-		if hasPre && sid == startSID && !committed && pos > startPos {
+		// Prefilter skip-ahead at start state
+		if hasPre && sid == startSID && lastMatch < 0 && pos > startPos {
 			candidate := d.prefilter.Find(haystack, pos)
 			if candidate == -1 {
-				return lastMatch // No more candidates
+				return lastMatch
 			}
 			if candidate > pos {
 				pos = candidate
-				// Re-obtain start state at new position (context may differ)
 				newStart := d.getStartStateForUnanchored(cache, haystack, pos)
 				if newStart == nil {
 					return d.nfaFallback(haystack, startPos)
@@ -433,87 +399,59 @@ func (d *DFA) searchFirstAt(cache *DFACache, haystack []byte, startPos int) int
 		}
 
 		// === 4x UNROLLED FAST PATH ===
+		// With match delay, tagged states (including match) break to slow path.
 		if canUnroll && pos+3 < end {
-			// Transition 1
-			o1 := safeOffset(sid, stride, int(d.byteToClass(haystack[pos])))
-			if o1 >= ftLen {
+			if sid.Offset()+stride > ftLen {
 				goto searchFirstSlowPath
 			}
-			n1 := ft[o1]
-			if n1 >= DeadState { // DeadState or InvalidState
+			// Transition 1
+			n1 := ft[sid.Offset()+int(d.byteToClass(haystack[pos]))]
+			if n1.IsTagged() {
 				goto searchFirstSlowPath
 			}
 			pos++
-			if cache.matchFlags[int(n1)] {
-				lastMatch = pos
-				committed = true
-			} else if committed {
-				return lastMatch
-			}
-
-			// Transition 2
-			o2 := safeOffset(n1, stride, int(d.byteToClass(haystack[pos])))
-			if o2 >= ftLen {
+			if pos+2 >= end {
 				sid = n1
 				goto searchFirstSlowPath
 			}
-			n2 := ft[o2]
-			if n2 >= DeadState {
+
+			// Transition 2
+			n2 := ft[n1.Offset()+int(d.byteToClass(haystack[pos]))]
+			if n2.IsTagged() {
 				sid = n1
 				goto searchFirstSlowPath
 			}
 			pos++
-			if cache.matchFlags[int(n2)] {
-				lastMatch = pos
-				committed = true
-			} else if committed {
-				return lastMatch
-			}
-
-			// Transition 3
-			o3 := safeOffset(n2, stride, int(d.byteToClass(haystack[pos])))
-			if o3 >= ftLen {
+			if pos+1 >= end {
 				sid = n2
 				goto searchFirstSlowPath
 			}
-			n3 := ft[o3]
-			if n3 >= DeadState {
+
+			// Transition 3
+			n3 := ft[n2.Offset()+int(d.byteToClass(haystack[pos]))]
+			if n3.IsTagged() {
 				sid = n2
 				goto searchFirstSlowPath
 			}
 			pos++
-			if cache.matchFlags[int(n3)] {
-				lastMatch = pos
-				committed = true
-			} else if committed {
-				return lastMatch
-			}
 
 			// Transition 4
-			o4 := safeOffset(n3, stride, int(d.byteToClass(haystack[pos])))
-			if o4 >= ftLen {
-				sid = n3
-				goto searchFirstSlowPath
-			}
-			n4 := ft[o4]
-			if n4 >= DeadState {
+			n4 := ft[n3.Offset()+int(d.byteToClass(haystack[pos]))]
+			if n4.IsTagged() {
 				sid = n3
 				goto searchFirstSlowPath
 			}
 			pos++
 			sid = n4
-			if cache.matchFlags[int(n4)] {
-				lastMatch = pos
-				committed = true
-			} else if committed {
-				return lastMatch
-			}
 
 			continue
 		}
 
 	searchFirstSlowPath:
-		// === SINGLE-BYTE SLOW PATH ===
+		if pos >= end {
+			break
+		}
+
 		if d.hasWordBoundary {
 			st := cache.getState(sid)
 			if st != nil && st.checkWordBoundaryFast(haystack[pos]) {
@@ -522,7 +460,7 @@ func (d *DFA) searchFirstAt(cache *DFACache, haystack []byte, startPos int) int
 		}
 
 		classIdx := int(d.byteToClass(haystack[pos]))
-		offset := safeOffset(sid, stride, classIdx)
+		offset := sid.Offset() + classIdx
 
 		var nextID StateID
 		if offset < ftLen {
@@ -553,17 +491,17 @@ func (d *DFA) searchFirstAt(cache *DFACache, haystack []byte, startPos int) int
 			sid = nextID
 		}
 
-		pos++
-
+		// 1-byte match delay: check after transition, before pos advance.
+		// For leftmost-first (searchFirstAt), return immediately on first match.
+		// The match delay ensures pos is the correct exclusive end.
 		if cache.IsMatchState(sid) {
-			lastMatch = pos
-			committed = true
-		} else if committed {
-			return lastMatch
+			return pos
 		}
+
+		pos++
 	}
 
-	// EOI match check (needs State struct — slow path)
+	// EOI match check
 	eoi := cache.getState(sid)
 	if eoi != nil && d.checkEOIMatch(eoi) {
 		return len(haystack)
@@ -636,22 +574,16 @@ func (d *DFA) isMatchWithPrefilter(cache *DFACache, haystack []byte) bool {
 	// Get anchored start state at candidate position
 	currentState := d.getStartState(cache, haystack, pos, true)
 	if currentState == nil {
-		// Fallback: use old two-pass approach with NFA
 		return d.isMatchWithPrefilterFallback(cache, haystack, pos)
 	}
-	if currentState.IsMatch() {
-		return true
-	}
+	// With 1-byte match delay, start states are never match states.
 
-	// Integrated prefilter+DFA loop with flat table (Rust approach)
 	endPos := len(haystack)
 	sid := currentState.id
 	ft := cache.flatTrans
-	stride := cache.stride
 	ftLen := len(ft)
 
 	for pos < endPos {
-		// Word boundary check (slow path)
 		if d.hasWordBoundary {
 			st := cache.getState(sid)
 			if st != nil && st.checkWordBoundaryFast(haystack[pos]) {
@@ -660,7 +592,7 @@ func (d *DFA) isMatchWithPrefilter(cache *DFACache, haystack []byte) bool {
 		}
 
 		classIdx := int(d.byteToClass(haystack[pos]))
-		offset := safeOffset(sid, stride, classIdx)
+		offset := sid.Offset() + classIdx
 		var nextID StateID
 		if offset < ftLen {
 			nextID = ft[offset]
@@ -695,13 +627,13 @@ func (d *DFA) isMatchWithPrefilter(cache *DFACache, haystack []byte) bool {
 		}
 
 		pos++
+		// 1-byte match delay: check after transition
 		if cache.IsMatchState(sid) {
 			return true
 		}
 		continue
 
 	pfSkip:
-		// Prefilter skip: find next candidate after current position
 		pos++
 		candidate := d.prefilter.Find(haystack, pos)
 		if candidate == -1 {
@@ -709,7 +641,6 @@ func (d *DFA) isMatchWithPrefilter(cache *DFACache, haystack []byte) bool {
 		}
 		pos = candidate
 
-		// Restart DFA at new candidate with anchored start state
 		newStart := d.getStartState(cache, haystack, pos, true)
 		if newStart == nil {
 			return d.isMatchWithPrefilterFallback(cache, haystack, pos)
@@ -717,9 +648,7 @@ func (d *DFA) isMatchWithPrefilter(cache *DFACache, haystack []byte) bool {
 		sid = newStart.id
 		ft = cache.flatTrans
 		ftLen = len(ft)
-		if newStart.IsMatch() {
-			return true
-		}
+		// With match delay, start states are never match — continue loop.
 	}
 
 	eoi := cache.getState(sid)
@@ -774,13 +703,9 @@ func (d *DFA) searchEarliestMatch(cache *DFACache, haystack []byte, startPos int
 		return matched && start >= 0 && end >= start
 	}
 
-	// Check if start state is already a match
-	if currentState.IsMatch() {
-		return true
-	}
+	// With 1-byte match delay, start states are never match states.
 
 	// Determine if 4x unrolling can be used.
-	// Word boundary patterns need per-byte boundary checks.
 	canUnroll := !d.hasWordBoundary
 
 	endPos := len(haystack)
@@ -809,41 +734,36 @@ func (d *DFA) searchEarliestMatch(cache *DFACache, haystack []byte, startPos int
 				goto earliestSlowPath
 			}
 
-			// Transition 1
-			o1 := safeOffset(sid, stride, int(d.byteToClass(haystack[pos])))
-			if o1 >= ftLen {
+			// Bounds hint for 4x unrolled transitions
+			if sid.Offset()+stride > ftLen {
 				goto earliestSlowPath
 			}
-			n1 := ft[o1]
-			if n1 >= DeadState {
+
+			// Transition 1
+			n1 := ft[sid.Offset()+int(d.byteToClass(haystack[pos]))]
+			if n1.IsTagged() {
+				if n1.IsMatchTag() {
+					return true
+				}
 				goto earliestSlowPath
 			}
 			pos++
-			if cache.matchFlags[int(n1)] {
-				return true
-			}
 
-			// Check remaining bounds for subsequent transitions
 			if pos+2 >= endPos {
 				sid = n1
 				goto earliestSlowPath
 			}
 
 			// Transition 2
-			o2 := safeOffset(n1, stride, int(d.byteToClass(haystack[pos])))
-			if o2 >= ftLen {
-				sid = n1
-				goto earliestSlowPath
-			}
-			n2 := ft[o2]
-			if n2 >= DeadState {
+			n2 := ft[n1.Offset()+int(d.byteToClass(haystack[pos]))]
+			if n2.IsTagged() {
+				if n2.IsMatchTag() {
+					return true
+				}
 				sid = n1
 				goto earliestSlowPath
 			}
 			pos++
-			if cache.matchFlags[int(n2)] {
-				return true
-			}
 
 			if pos+1 >= endPos {
 				sid = n2
@@ -851,37 +771,27 @@ func (d *DFA) searchEarliestMatch(cache *DFACache, haystack []byte, startPos int
 			}
 
 			// Transition 3
-			o3 := safeOffset(n2, stride, int(d.byteToClass(haystack[pos])))
-			if o3 >= ftLen {
-				sid = n2
-				goto earliestSlowPath
-			}
-			n3 := ft[o3]
-			if n3 >= DeadState {
+			n3 := ft[n2.Offset()+int(d.byteToClass(haystack[pos]))]
+			if n3.IsTagged() {
+				if n3.IsMatchTag() {
+					return true
+				}
 				sid = n2
 				goto earliestSlowPath
 			}
 			pos++
-			if cache.matchFlags[int(n3)] {
-				return true
-			}
 
 			// Transition 4
-			o4 := safeOffset(n3, stride, int(d.byteToClass(haystack[pos])))
-			if o4 >= ftLen {
-				sid = n3
-				goto earliestSlowPath
-			}
-			n4 := ft[o4]
-			if n4 >= DeadState {
+			n4 := ft[n3.Offset()+int(d.byteToClass(haystack[pos]))]
+			if n4.IsTagged() {
+				if n4.IsMatchTag() {
+					return true
+				}
 				sid = n3
 				goto earliestSlowPath
 			}
 			pos++
 			sid = n4
-			if cache.matchFlags[int(n4)] {
-				return true
-			}
 
 			continue
 		}
@@ -922,7 +832,7 @@ func (d *DFA) searchEarliestMatch(cache *DFACache, haystack []byte, startPos int
 
 		// Flat table lookup for transition
 		classIdx := int(d.byteToClass(b))
-		offset := safeOffset(sid, stride, classIdx)
+		offset := sid.Offset() + classIdx
 
 		var nextID StateID
 		if offset < ftLen {
@@ -994,24 +904,15 @@ func (d *DFA) searchEarliestMatchAnchored(cache *DFACache, haystack []byte, star
 		return matched && start == startPos && end >= start
 	}
 
-	// Check if start state is already a match (e.g., empty pattern)
-	if currentState.IsMatch() {
-		return true
-	}
+	// With 1-byte match delay, start states are never match states.
 
-	// Hot loop: flat transition table (Rust approach).
-	// Work with state ID only — no *State pointer chase in fast path.
 	sid := currentState.id
 	ft := cache.flatTrans
-	stride := cache.stride
 	ftLen := len(ft)
 
-	// Scan input byte by byte with early termination
 	for pos := startPos; pos < len(haystack); pos++ {
 		b := haystack[pos]
 
-		// O(1) word boundary match check using pre-computed flags (was 30% CPU).
-		// matchAtWordBoundary/matchAtNonWordBoundary computed during determinize.
 		if d.hasWordBoundary {
 			st := cache.getState(sid)
 			if st != nil && st.checkWordBoundaryFast(b) {
@@ -1019,9 +920,8 @@ func (d *DFA) searchEarliestMatchAnchored(cache *DFACache, haystack []byte, star
 			}
 		}
 
-		// Flat table lookup for transition
 		classIdx := int(d.byteToClass(b))
-		offset := safeOffset(sid, stride, classIdx)
+		offset := sid.Offset() + classIdx
 
 		var nextID StateID
 		if offset < ftLen {
@@ -1040,8 +940,6 @@ func (d *DFA) searchEarliestMatchAnchored(cache *DFACache, haystack []byte, star
 			nextState, err := d.determinize(cache, currentState, b)
 			if err != nil {
 				if isCacheCleared(err) {
-					// Cache was cleared. For anchored search, re-obtain
-					// the anchored start state at current position.
 					currentState = d.getStartState(cache, haystack, pos, true)
 					if currentState == nil {
 						start, end, matched := d.pikevm.SearchAt(haystack, startPos)
@@ -1050,8 +948,7 @@ func (d *DFA) searchEarliestMatchAnchored(cache *DFACache, haystack []byte, star
 					sid = currentState.id
 					ft = cache.flatTrans
 					ftLen = len(ft)
-					// Re-process this byte with the new state (pos not incremented by for-loop yet)
-					pos-- // Will be incremented by for-loop
+					pos--
 					continue
 				}
 				start, end, matched := d.pikevm.SearchAt(haystack, startPos)
@@ -1071,6 +968,7 @@ func (d *DFA) searchEarliestMatchAnchored(cache *DFACache, haystack []byte, star
 			sid = nextID
 		}
 
+		// 1-byte match delay: return true on any match state
 		if cache.IsMatchState(sid) {
 			return true
 		}
@@ -1103,19 +1001,13 @@ func (d *DFA) findWithPrefilterAt(cache *DFACache, haystack []byte, startAt int)
 
 	// Track last match position for leftmost-longest semantics
 	lastMatch := -1
-	committed := false // True once we've entered a match state
+	// With 1-byte match delay, start states are never match states.
 
 	sid := currentState.id
 	ft := cache.flatTrans
-	stride := cache.stride
 	ftLen := len(ft)
 	startSID := sid
 
-	if currentState.IsMatch() {
-		lastMatch = pos
-		committed = true
-	}
-
 	for pos < len(haystack) {
 		if d.hasWordBoundary {
 			st := cache.getState(sid)
@@ -1125,7 +1017,7 @@ func (d *DFA) findWithPrefilterAt(cache *DFACache, haystack []byte, startAt int)
 		}
 
 		classIdx := int(d.byteToClass(haystack[pos]))
-		offset := safeOffset(sid, stride, classIdx)
+		offset := sid.Offset() + classIdx
 		var nextID StateID
 		if offset < ftLen {
 			nextID = ft[offset]
@@ -1150,7 +1042,6 @@ func (d *DFA) findWithPrefilterAt(cache *DFACache, haystack []byte, startAt int)
 					startSID = sid
 					ft = cache.flatTrans
 					ftLen = len(ft)
-					committed = lastMatch >= 0
 					continue
 				}
 				return d.nfaFallback(haystack, 0)
@@ -1175,11 +1066,6 @@ func (d *DFA) findWithPrefilterAt(cache *DFACache, haystack []byte, startAt int)
 				ft = cache.flatTrans
 				ftLen = len(ft)
 				lastMatch = -1
-				committed = false
-				if newStart.IsMatch() {
-					lastMatch = pos
-					committed = true
-				}
 				continue
 			}
 			sid = nextState.id
@@ -1205,28 +1091,21 @@ func (d *DFA) findWithPrefilterAt(cache *DFACache, haystack []byte, startAt int)
 			ft = cache.flatTrans
 			ftLen = len(ft)
 			lastMatch = -1
-			committed = false
-			if newStart.IsMatch() {
-				lastMatch = pos
-				committed = true
-			}
 			continue
 
 		default:
 			sid = nextID
 		}
 
-		pos++
-
+		// 1-byte match delay: check after transition, before pos advance
 		if cache.IsMatchState(sid) {
 			lastMatch = pos
-			committed = true
-		} else if committed {
-			return lastMatch
 		}
 
+		pos++
+
 		// Start state prefilter skip-ahead
-		if !committed && sid == startSID && pos < len(haystack) {
+		if lastMatch < 0 && sid == startSID && pos < len(haystack) {
 			candidate = d.prefilter.Find(haystack, pos)
 			if candidate == -1 {
 				return -1
@@ -1237,9 +1116,7 @@ func (d *DFA) findWithPrefilterAt(cache *DFACache, haystack []byte, startAt int)
 		}
 	}
 
-	// Reached end of input.
-	// Check if there's a match at EOI due to pending word boundary assertions.
-	// Example: pattern `test\b` matching "test" - the \b is satisfied at EOI.
+	// EOI check for delayed match
 	eoi := cache.getState(sid)
 	if eoi != nil && d.checkEOIMatch(eoi) {
 		return len(haystack)
@@ -1292,38 +1169,28 @@ func (d *DFA) searchAt(cache *DFACache, haystack []byte, startPos int) int { //n
 	}
 
 	// Get appropriate start state based on look-behind context
-	// This enables correct handling of assertions like ^, \b, etc.
 	currentState := d.getStartStateForUnanchored(cache, haystack, startPos)
 	if currentState == nil {
-		// Start state not in cache? This should never happen
 		return d.nfaFallback(haystack, startPos)
 	}
 
-	// Track last match position for leftmost-longest semantics
+	// Track last match position for leftmost-longest semantics.
+	// With 1-byte match delay, start states are never match states.
 	lastMatch := -1
-	committed := false // True once we've found a match
-
-	if currentState.IsMatch() {
-		lastMatch = startPos // Empty match at start
-		committed = true
-	}
 
 	// Determine if the 4x unrolled fast path can be used.
-	// Word boundary patterns require per-byte boundary checks that cannot be batched.
 	canUnroll := !d.hasWordBoundary
 
 	end := len(haystack)
 	pos := startPos
 
 	// Hot loop: flat transition table (Rust approach).
-	// Work with state ID only — no *State pointer chase in fast path.
-	// State struct needed only for: determinize (slow), word boundary (guarded), acceleration.
 	sid := currentState.id
 	ft := cache.flatTrans
 	stride := cache.stride
 	ftLen := len(ft)
 
-	// Bounds hint for compiler — eliminates repeated len checks in loop.
+	// Bounds hint for compiler
 	if ftLen > 0 {
 		_ = ft[ftLen-1]
 	}
@@ -1333,7 +1200,7 @@ func (d *DFA) searchAt(cache *DFACache, haystack []byte, startPos int) int { //n
 
 	for pos < end {
 		// Prefilter skip-ahead at start state (Rust hybrid/search.rs:232-258)
-		if hasPre && sid == startSID && !committed && pos > startPos {
+		if hasPre && sid == startSID && lastMatch < 0 && pos > startPos {
 			candidate := d.prefilter.Find(haystack, pos)
 			if candidate == -1 {
 				return lastMatch
@@ -1353,94 +1220,60 @@ func (d *DFA) searchAt(cache *DFACache, haystack []byte, startPos int) int { //n
 
 		// === 4x UNROLLED FAST PATH ===
 		// Process 4 transitions per iteration when conditions allow.
-		if canUnroll && !committed && pos+3 < end {
-			// Check acceleration on slow→fast transition (once per entry).
+		// With match delay, match states break out of the unrolled loop
+		// to the slow path for proper handling.
+		if canUnroll && pos+3 < end {
+			// Check acceleration on slow→fast transition
 			accelState := cache.getState(sid)
 			if accelState != nil && accelState.IsAccelerable() {
 				goto slowPath
 			}
 
-			// Transition 1
-			o1 := safeOffset(sid, stride, int(d.byteToClass(haystack[pos])))
-			if o1 >= ftLen {
+			// Bounds hint for 4x unrolled transitions
+			if sid.Offset()+stride > ftLen {
 				goto slowPath
 			}
-			n1 := ft[o1]
-			if n1 >= DeadState {
+
+			// Transition 1
+			n1 := ft[sid.Offset()+int(d.byteToClass(haystack[pos]))]
+			if n1.IsTagged() {
 				goto slowPath
 			}
 			pos++
-
-			if cache.matchFlags[int(n1)] || pos+2 >= end {
+			if pos+2 >= end {
 				sid = n1
-				if cache.matchFlags[int(n1)] {
-					lastMatch = pos
-					committed = true
-				}
 				goto slowPath
 			}
 
 			// Transition 2
-			o2 := safeOffset(n1, stride, int(d.byteToClass(haystack[pos])))
-			if o2 >= ftLen {
-				sid = n1
-				goto slowPath
-			}
-			n2 := ft[o2]
-			if n2 >= DeadState {
+			n2 := ft[n1.Offset()+int(d.byteToClass(haystack[pos]))]
+			if n2.IsTagged() {
 				sid = n1
 				goto slowPath
 			}
 			pos++
-
-			if cache.matchFlags[int(n2)] || pos+1 >= end {
+			if pos+1 >= end {
 				sid = n2
-				if cache.matchFlags[int(n2)] {
-					lastMatch = pos
-					committed = true
-				}
 				goto slowPath
 			}
 
 			// Transition 3
-			o3 := safeOffset(n2, stride, int(d.byteToClass(haystack[pos])))
-			if o3 >= ftLen {
-				sid = n2
-				goto slowPath
-			}
-			n3 := ft[o3]
-			if n3 >= DeadState {
+			n3 := ft[n2.Offset()+int(d.byteToClass(haystack[pos]))]
+			if n3.IsTagged() {
 				sid = n2
 				goto slowPath
 			}
 			pos++
 
-			if cache.matchFlags[int(n3)] {
-				sid = n3
-				lastMatch = pos
-				committed = true
-				goto slowPath
-			}
-
 			// Transition 4
-			o4 := safeOffset(n3, stride, int(d.byteToClass(haystack[pos])))
-			if o4 >= ftLen {
-				sid = n3
-				goto slowPath
-			}
-			n4 := ft[o4]
-			if n4 >= DeadState {
+			n4 := ft[n3.Offset()+int(d.byteToClass(haystack[pos]))]
+			if n4.IsTagged() {
 				sid = n3
 				goto slowPath
 			}
 			pos++
 			sid = n4
 
-			if cache.matchFlags[int(n4)] {
-				lastMatch = pos
-				committed = true
-			}
-
 			continue
 		}
 
@@ -1472,7 +1305,7 @@ func (d *DFA) searchAt(cache *DFACache, haystack []byte, startPos int) int { //n
 
 		// Flat table lookup for transition
 		classIdx := int(d.byteToClass(b))
-		offset := safeOffset(sid, stride, classIdx)
+		offset := sid.Offset() + classIdx
 
 		var nextID StateID
 		if offset < ftLen {
@@ -1499,19 +1332,19 @@ func (d *DFA) searchAt(cache *DFACache, haystack []byte, startPos int) int { //n
 			sid = nextID
 		}
 
-		pos++
-
+		// 1-byte match delay: check AFTER transition, BEFORE pos advance.
+		// With delay, match tag means previous state had NFA match.
+		// Exclusive match end = pos (the consumed byte index), because delay
+		// already shifts by 1 byte.
+		// Rust: mat = Some(HalfMatch::new(pattern, at)) — at is byte index.
 		if cache.IsMatchState(sid) {
 			lastMatch = pos
-			committed = true
-		} else if committed {
-			currentState = cache.getState(sid)
-			if currentState == nil || !d.hasInProgressPattern(currentState) {
-				return lastMatch
-			}
 		}
+
+		pos++
 	}
 
+	// EOI: check for delayed match at end of input
 	eoi := cache.getState(sid)
 	if eoi != nil && d.checkEOIMatch(eoi) {
 		return len(haystack)
@@ -1548,19 +1381,32 @@ func (d *DFA) determinize(cache *DFACache, current *State, b byte) (*State, erro
 	// The actual byte value is still used for NFA move operations
 	classIdx := d.byteToClass(b)
 
-	// Compute next NFA state set via move operation WITH word context
-	// This is essential for correct \b and \B handling in DFA.
-	// The current state's isFromWord tells us if the previous byte was a word char.
-	// Note: use actual byte 'b' (not classIdx) for NFA move - NFA uses raw bytes
-	nextNFAStates := builder.moveWithWordContext(current.NFAStates(), b, current.IsFromWord())
-
-	// No transitions on this byte → dead state
-	if len(nextNFAStates) == 0 {
-		// Cache the dead state transition to avoid re-computation
-		// Use classIdx for transition storage (compressed alphabet)
+	// 1-byte match delay (Rust determinize mod.rs:254-286):
+	// Check if source (current) state's NFA states contain a match state.
+	// The NEW DFA state will be tagged as match if the OLD state had NFA match.
+	// This delays match reporting by 1 byte, enabling correct look-around (^, $, \b).
+	sourceHasMatch := builder.containsMatchState(current.NFAStates())
+
+	// Compute next NFA state set via move operation WITH word context.
+	// Leftmost-first (Rust determinize::next mod.rs:284):
+	// When source has NFA match AND BreakAtMatch is enabled, stop iterating
+	// at the first Match state. States after Match (prefix restarts) are not
+	// processed, causing the DFA to reach dead state with the committed match.
+	// BreakAtMatch is disabled for reverse DFAs to allow finding leftmost start.
+	breakAtMatch := sourceHasMatch && d.config.BreakAtMatch
+	nextNFAStates := builder.moveWithWordContextBreak(current.NFAStates(), b, current.IsFromWord(), breakAtMatch)
+
+	isMatch := sourceHasMatch
+
+	// No transitions on this byte → dead state (or dead-end match state)
+	if len(nextNFAStates) == 0 && !isMatch {
+		// Normal dead state — no match in source either
 		cache.SetFlatTransition(current.id, int(classIdx), DeadState)
 		return nil, nil //nolint:nilnil // dead state is valid, not an error
 	}
+	// When len(nextNFAStates) == 0 && isMatch: source has NFA match but target
+	// is dead. Create a dead-end match state so the search loop can observe
+	// the delayed match before seeing dead transitions. Fall through below.
 
 	// Check if we've exceeded determinization limit
 	if len(nextNFAStates) > d.config.DeterminizationLimit {
@@ -1576,9 +1422,10 @@ func (d *DFA) determinize(cache *DFACache, current *State, b byte) (*State, erro
 	// needs to know what byte got us there (for the next transition's word boundary check)
 	nextIsFromWord := isWordByte(b)
 
-	// Compute state key INCLUDING word context
-	// States with same NFA states but different isFromWord are DIFFERENT DFA states!
-	key := ComputeStateKeyWithWord(nextNFAStates, nextIsFromWord)
+	// Compute state key INCLUDING word context AND match delay flag.
+	// With match delay, the same NFA state set can produce both match and
+	// non-match DFA states (depending on whether the source had NFA match).
+	key := ComputeStateKeyWithWordAndMatch(nextNFAStates, nextIsFromWord, isMatch)
 
 	// Check if state already exists in cache
 	if existing, ok := cache.Get(key); ok {
@@ -1589,7 +1436,6 @@ func (d *DFA) determinize(cache *DFACache, current *State, b byte) (*State, erro
 	}
 
 	// Create new DFA state with word context and compressed alphabet stride
-	isMatch := builder.containsMatchState(nextNFAStates)
 	newState := NewStateWithStride(InvalidState, nextNFAStates, isMatch, nextIsFromWord, d.AlphabetLen())
 
 	// Pre-compute word boundary match flags to avoid per-byte checkWordBoundaryMatch.
@@ -1627,6 +1473,19 @@ func (d *DFA) determinize(cache *DFACache, current *State, b byte) (*State, erro
 	return newState, nil
 }
 
+// containsNFAMatch checks if any of the given NFA state IDs is a match state.
+// Used for EOI match detection with 1-byte match delay: at end of input,
+// we check the current DFA state's NFA states directly rather than following
+// an EOI transition.
+func containsNFAMatch(n *nfa.NFA, states []nfa.StateID) bool {
+	for _, sid := range states {
+		if n.IsMatch(sid) {
+			return true
+		}
+	}
+	return false
+}
+
 // tryClearCache attempts to clear the DFA cache and rebuild the start state.
 // Returns nil on success (cache was cleared, search can continue).
 // Returns ErrCacheFull if the maximum number of cache clears has been exceeded.
@@ -1654,8 +1513,8 @@ func (d *DFA) tryClearCache(cache *DFACache) error {
 	builder := NewBuilderWithWordBoundary(d.nfa, d.config, d.hasWordBoundary)
 	startLook := LookSetFromStartKind(StartText)
 	startStateSet := builder.epsilonClosure([]nfa.StateID{d.nfa.StartUnanchored()}, startLook)
-	isMatch := builder.containsMatchState(startStateSet)
-	startState := NewStateWithStride(StartState, startStateSet, isMatch, false, d.AlphabetLen())
+	// With 1-byte match delay, start states are never match states.
+	startState := NewStateWithStride(StartState, startStateSet, false, false, d.AlphabetLen())
 
 	key := ComputeStateKeyWithWord(startStateSet, false)
 	_, _ = cache.Insert(key, startState) // Cannot fail: cache was just cleared
@@ -1794,13 +1653,14 @@ func (d *DFA) nfaFallback(haystack []byte, startPos int) int {
 
 // matchesEmpty checks if the pattern matches an empty string
 func (d *DFA) matchesEmpty(cache *DFACache) bool {
-	// Check if start state is a match state
+	// With 1-byte match delay, the start state is never tagged as match.
+	// Check if the start state's NFA states contain a match (for empty patterns).
 	startState := cache.getState(StartState)
-	if startState != nil && startState.IsMatch() {
+	if startState != nil && containsNFAMatch(d.nfa, startState.NFAStates()) {
 		return true
 	}
 
-	// Fall back to NFA for empty match check
+	// Fall back to NFA for empty match check (handles word boundaries, etc.)
 	start, end, matched := d.pikevm.Search([]byte{})
 	return matched && start == 0 && end == 0
 }
@@ -1937,17 +1797,13 @@ func (d *DFA) SearchReverse(cache *DFACache, haystack []byte, start, end int) in
 	}
 
 	lastMatch := -1
-
-	if currentState.IsMatch() {
-		lastMatch = end
-	}
+	// With 1-byte match delay, start states are never match states.
 
 	at := end - 1
 
 	// Hot loop: flat transition table (Rust approach).
 	sid := currentState.id
 	ft := cache.flatTrans
-	stride := cache.stride
 	ftLen := len(ft)
 
 	if ftLen > 0 {
@@ -1955,67 +1811,55 @@ func (d *DFA) SearchReverse(cache *DFACache, haystack []byte, start, end int) in
 	}
 
 	// === 4x UNROLLED REVERSE LOOP ===
-	// offset/nextSID declared before loop to avoid goto-over-declaration.
+	// With match delay, any tagged state (including match) breaks to slow path.
 	var revOff int
 	var nextSID StateID
 	for at >= start+3 {
-		// Transition 1 (from at, going backward)
-		revOff = safeOffset(sid, stride, int(d.byteToClass(haystack[at])))
+		// Transition 1
+		revOff = sid.Offset() + int(d.byteToClass(haystack[at]))
 		if revOff >= ftLen {
 			goto reverseSlowPath
 		}
 		nextSID = ft[revOff]
-		if nextSID >= DeadState {
+		if nextSID.IsTagged() {
 			goto reverseSlowPath
 		}
-		if cache.matchFlags[int(nextSID)] {
-			lastMatch = at
-		}
 		sid = nextSID
 		at--
 
 		// Transition 2
-		revOff = safeOffset(sid, stride, int(d.byteToClass(haystack[at])))
+		revOff = sid.Offset() + int(d.byteToClass(haystack[at]))
 		if revOff >= ftLen {
 			goto reverseSlowPath
 		}
 		nextSID = ft[revOff]
-		if nextSID >= DeadState {
+		if nextSID.IsTagged() {
 			goto reverseSlowPath
 		}
-		if cache.matchFlags[int(nextSID)] {
-			lastMatch = at
-		}
 		sid = nextSID
 		at--
 
 		// Transition 3
-		revOff = safeOffset(sid, stride, int(d.byteToClass(haystack[at])))
+		revOff = sid.Offset() + int(d.byteToClass(haystack[at]))
 		if revOff >= ftLen {
 			goto reverseSlowPath
 		}
 		nextSID = ft[revOff]
-		if nextSID >= DeadState {
+		if nextSID.IsTagged() {
 			goto reverseSlowPath
 		}
-		if cache.matchFlags[int(nextSID)] {
-			lastMatch = at
-		}
 		sid = nextSID
 		at--
 
 		// Transition 4
-		revOff = safeOffset(sid, stride, int(d.byteToClass(haystack[at])))
+		revOff = sid.Offset() + int(d.byteToClass(haystack[at]))
 		if revOff >= ftLen {
 			goto reverseSlowPath
 		}
 		nextSID = ft[revOff]
-		if nextSID >= DeadState {
+		if nextSID.IsTagged() {
 			goto reverseSlowPath
 		}
-		if cache.matchFlags[int(nextSID)] {
-			lastMatch = at
-		}
 		sid = nextSID
 		at--
 
@@ -2030,7 +1874,7 @@ func (d *DFA) SearchReverse(cache *DFACache, haystack []byte, start, end int) in
 		b := haystack[at]
 
 		classIdx := int(d.byteToClass(b))
-		offset := safeOffset(sid, stride, classIdx)
+		offset := sid.Offset() + classIdx
 
 		var nextID StateID
 		if offset < ftLen {
@@ -2073,13 +1917,24 @@ func (d *DFA) SearchReverse(cache *DFACache, haystack []byte, start, end int) in
 			sid = nextID
 		}
 
+		// 1-byte match delay for reverse: the match tag on the new state means
+		// the OLD state had NFA match. In reverse search, the match position
+		// is at+1 (one byte forward from current, since we're going backward).
+		// Rust: mat = Some(HalfMatch::new(pattern, at + 1))
 		if cache.IsMatchState(sid) {
-			lastMatch = at
+			lastMatch = at + 1
 		}
 
 		at--
 	}
 
+	// EOI for reverse: at region start, check if current state's NFA states
+	// contain a delayed match. If so, the match starts at 'start'.
+	eoi := cache.getState(sid)
+	if eoi != nil && containsNFAMatch(d.nfa, eoi.NFAStates()) {
+		lastMatch = start
+	}
+
 	return lastMatch
 }
 
@@ -2119,10 +1974,7 @@ func (d *DFA) SearchReverseLimited(cache *DFACache, haystack []byte, start, end,
 	}
 
 	lastMatch := -1
-
-	if currentState.IsMatch() {
-		lastMatch = end
-	}
+	// With 1-byte match delay, start states are never match states.
 
 	lowerBound := start
 	if minStart > lowerBound {
@@ -2132,14 +1984,13 @@ func (d *DFA) SearchReverseLimited(cache *DFACache, haystack []byte, start, end,
 	// Hot loop: flat transition table (Rust approach).
 	sid := currentState.id
 	ft := cache.flatTrans
-	stride := cache.stride
 	ftLen := len(ft)
 
 	for at := end - 1; at >= lowerBound; at-- {
 		b := haystack[at]
 
 		classIdx := int(d.byteToClass(b))
-		offset := safeOffset(sid, stride, classIdx)
+		offset := sid.Offset() + classIdx
 
 		var nextID StateID
 		if offset < ftLen {
@@ -2183,11 +2034,18 @@ func (d *DFA) SearchReverseLimited(cache *DFACache, haystack []byte, start, end,
 			sid = nextID
 		}
 
+		// 1-byte match delay for reverse: match position is at+1
 		if cache.IsMatchState(sid) {
-			lastMatch = at
+			lastMatch = at + 1
 		}
 	}
 
+	// EOI for reverse: check delayed match at region start
+	eoi := cache.getState(sid)
+	if eoi != nil && containsNFAMatch(d.nfa, eoi.NFAStates()) {
+		lastMatch = lowerBound
+	}
+
 	if lowerBound > start && lastMatch < 0 {
 		return SearchReverseLimitedQuadratic
 	}
@@ -2210,21 +2068,18 @@ func (d *DFA) IsMatchReverse(cache *DFACache, haystack []byte, start, end int) b
 		return matched
 	}
 
-	if currentState.IsMatch() {
-		return true
-	}
+	// With 1-byte match delay, start states are never match states.
 
 	// Hot loop: flat transition table (Rust approach).
 	sid := currentState.id
 	ft := cache.flatTrans
-	stride := cache.stride
 	ftLen := len(ft)
 
 	for at := end - 1; at >= start; at-- {
 		b := haystack[at]
 
 		classIdx := int(d.byteToClass(b))
-		offset := safeOffset(sid, stride, classIdx)
+		offset := sid.Offset() + classIdx
 
 		var nextID StateID
 		if offset < ftLen {
@@ -2271,12 +2126,15 @@ func (d *DFA) IsMatchReverse(cache *DFACache, haystack []byte, start, end int) b
 			sid = nextID
 		}
 
+		// 1-byte match delay: match detected after transition
 		if cache.IsMatchState(sid) {
 			return true
 		}
 	}
 
-	return cache.IsMatchState(sid)
+	// EOI for reverse: check if current state's NFA states contain match
+	eoi := cache.getState(sid)
+	return eoi != nil && containsNFAMatch(d.nfa, eoi.NFAStates())
 }
 
 // getStartStateForReverse returns the appropriate start state for reverse search.
diff --git a/dfa/lazy/search_extra_test.go b/dfa/lazy/search_extra_test.go
index 6d60419..a99025c 100644
--- a/dfa/lazy/search_extra_test.go
+++ b/dfa/lazy/search_extra_test.go
@@ -89,7 +89,7 @@ func TestSearchAtWithoutPrefilter(t *testing.T) {
 		{"simple literal from 0", "abc", "xyzabc", 0, 6},
 		{"simple literal from 3", "abc", "xyzabc", 3, 6},
 		{"no match", "xyz", "abcdef", 0, -1},
-		{"empty pattern from 0", "", "abc", 0, 3}, // empty pattern greedy-matches entire input
+		{"empty pattern from 0", "", "abc", 0, 0}, // empty pattern matches at position 0 (stdlib behavior)
 		{"empty input", "abc", "", 0, -1},
 		{"at end", "abc", "abc", 3, -1},
 		{"past end", "abc", "abc", 4, -1},
@@ -332,7 +332,7 @@ func TestEmptyPatternBehavior(t *testing.T) {
 		matchWant bool
 	}{
 		{"empty input", "", 0, true},
-		{"non-empty input", "abc", 3, true}, // empty pattern greedy-matches entire input
+		{"non-empty input", "abc", 0, true}, // empty pattern matches at position 0 (stdlib behavior)
 	}
 
 	for _, tt := range tests {
diff --git a/dfa/lazy/start.go b/dfa/lazy/start.go
index c95349e..a78d3eb 100644
--- a/dfa/lazy/start.go
+++ b/dfa/lazy/start.go
@@ -227,8 +227,12 @@ func ComputeStartStateWithStride(builder *Builder, n *nfa.NFA, config StartConfi
 	// Compute epsilon closure from NFA start state with look assertions
 	startStateSet := builder.epsilonClosure([]nfa.StateID{nfaStart}, lookHave)
 
-	// Check if start state is a match state
-	isMatch := builder.containsMatchState(startStateSet)
+	// With 1-byte match delay, start states are NEVER match states.
+	// Match reporting is delayed by 1 byte: the NEW state is tagged as match
+	// based on the OLD state's NFA match content. Since there is no "old state"
+	// before the start state, it cannot be a match.
+	// Reference: Rust regex-automata determinize (mod.rs:254-286).
+	isMatch := false
 
 	// Determine isFromWord based on StartKind.
 	// This is critical for \b/\B word boundary handling:
diff --git a/dfa/lazy/state.go b/dfa/lazy/state.go
index 33edf96..9ff197b 100644
--- a/dfa/lazy/state.go
+++ b/dfa/lazy/state.go
@@ -9,22 +9,99 @@ import (
 )
 
 // StateID uniquely identifies a DFA state in the cache.
-// This is a 32-bit unsigned integer for compact representation.
+//
+// The ID is a **premultiplied byte offset** into the flat transition table,
+// with tag bits in the high 5 bits for O(1) special state detection.
+//
+// Layout (Rust LazyStateID approach, hybrid/id.rs:169):
+//
+//	[invalid|dead|reserved|start|match| 27 bits: offset into flatTrans ]
+//	 bit 31  30    29       28    27    bits 0-26
+//
+// Hot loop: nextSID = flatTrans[sid & TagMask + classIdx]
+//
+//	if sid > TagMask { handle special }
+//
+// No multiply needed — sid already contains the byte offset.
 type StateID uint32
 
-// Special state constants
+// Tag bit masks for StateID high bits.
 const (
-	// InvalidState represents an invalid/uninitialized state ID
-	InvalidState StateID = 0xFFFFFFFF
+	tagInvalid  StateID = 1 << 31 // Unknown/not yet computed transition
+	tagDead     StateID = 1 << 30 // Dead state — no match possible
+	tagReserved StateID = 1 << 29 // Reserved for quit
+	tagStart    StateID = 1 << 28 // Start state
+	tagMatch    StateID = 1 << 27 // Match/accepting state
+
+	// TagMask extracts the offset (lower 27 bits).
+	// Any bit above this = special state requiring slow path.
+	TagMask StateID = tagMatch - 1 // 0x07FFFFFF
+
+	// MaxStateOffset is the maximum premultiplied offset (128M entries).
+	MaxStateOffset StateID = TagMask
+)
 
-	// DeadState represents a dead/failure state with no outgoing transitions.
-	// Once in this state, the DFA can never match.
-	DeadState StateID = 0xFFFFFFFE
+// Special state constants (tagged, premultiplied offset = 0).
+const (
+	// InvalidState represents an unknown/uninitialized transition.
+	// In flatTrans, this means the transition hasn't been computed yet.
+	InvalidState StateID = tagInvalid // 0x80000000
 
-	// StartState is always state ID 0 (the initial state)
+	// DeadState represents a dead/failure state — no match possible.
+	DeadState StateID = tagDead // 0x40000000
+
+	// StartState is the initial state. Offset 0, untagged.
 	StartState StateID = 0
 )
 
+// IsTagged returns true if any tag bit is set (special state).
+// This is the single branch in the DFA hot loop.
+//
+//go:nosplit
+func (sid StateID) IsTagged() bool {
+	return sid > TagMask
+}
+
+// Offset returns the premultiplied byte offset into flatTrans.
+// Strips tag bits. Only valid for non-special states.
+//
+//go:nosplit
+func (sid StateID) Offset() int {
+	return int(sid & TagMask)
+}
+
+// IsMatch returns true if this state has the match tag.
+//
+//go:nosplit
+func (sid StateID) IsMatchTag() bool {
+	return sid&tagMatch != 0
+}
+
+// IsDeadTag returns true if this state has the dead tag.
+//
+//go:nosplit
+func (sid StateID) IsDeadTag() bool {
+	return sid&tagDead != 0
+}
+
+// IsInvalidTag returns true if this state has the invalid tag.
+//
+//go:nosplit
+func (sid StateID) IsInvalidTag() bool {
+	return sid&tagInvalid != 0
+}
+
+// WithMatchTag returns a copy of this StateID with the match tag set.
+func (sid StateID) WithMatchTag() StateID {
+	return sid | tagMatch
+}
+
+// WithStartTag returns a copy of this StateID with the start tag set.
+// Reserved for future start-state specialization (Rust specialize_start_states).
+func (sid StateID) WithStartTag() StateID {
+	return sid | tagStart
+}
+
 // defaultStride is the default alphabet size when ByteClasses compression is not used.
 const defaultStride = 256
 
@@ -233,11 +310,24 @@ func ComputeStateKey(nfaStates []nfa.StateID) StateKey {
 // States with same NFA states but different isFromWord are DIFFERENT DFA states.
 // This is essential for correct \b and \B handling.
 func ComputeStateKeyWithWord(nfaStates []nfa.StateID, isFromWord bool) StateKey {
+	return ComputeStateKeyWithWordAndMatch(nfaStates, isFromWord, false)
+}
+
+// ComputeStateKeyWithWordAndMatch computes a hash-based key including word context
+// and match delay flag. With 1-byte match delay, the same set of NFA states can
+// produce both a match and non-match DFA state depending on whether the SOURCE
+// state contained an NFA match state. This function distinguishes them in the cache.
+func ComputeStateKeyWithWordAndMatch(nfaStates []nfa.StateID, isFromWord bool, isMatch bool) StateKey {
 	if len(nfaStates) == 0 {
+		// Encode (isFromWord, isMatch) into 2 bits for empty states
+		var key StateKey
 		if isFromWord {
-			return StateKey(1) // Distinguish empty+fromWord from empty+notFromWord
+			key |= 1
 		}
-		return StateKey(0)
+		if isMatch {
+			key |= 2
+		}
+		return key
 	}
 
 	// Sort NFA states for canonical ordering
@@ -249,12 +339,15 @@ func ComputeStateKeyWithWord(nfaStates []nfa.StateID, isFromWord bool) StateKey
 	// Hash the sorted states using FNV-1a
 	h := fnv.New64a()
 
-	// Include isFromWord in the hash FIRST to distinguish states
+	// Include isFromWord and isMatch in the hash FIRST to distinguish states
+	var flags byte
 	if isFromWord {
-		_, _ = h.Write([]byte{1})
-	} else {
-		_, _ = h.Write([]byte{0})
+		flags |= 1
+	}
+	if isMatch {
+		flags |= 2
 	}
+	_, _ = h.Write([]byte{flags})
 
 	for _, sid := range sorted {
 		// Write each StateID as 4 bytes (uint32)
@@ -393,6 +486,18 @@ func (ss *StateSet) ToSlice() []nfa.StateID {
 	return slice
 }
 
+// ToSliceInsertionOrder returns states in the order they were inserted.
+// This matches Rust's sparse set iteration order, which is critical for
+// determinize break-at-match semantics (leftmost-first match priority).
+func (ss *StateSet) ToSliceInsertionOrder() []nfa.StateID {
+	if ss.size == 0 {
+		return nil
+	}
+	slice := make([]nfa.StateID, ss.size)
+	copy(slice, ss.dense[:ss.size])
+	return slice
+}
+
 // Clone creates a deep copy of the state set
 func (ss *StateSet) Clone() *StateSet {
 	clone := NewStateSetWithCapacity(len(ss.sparse))
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
index 4d5b05d..f29a7b6 100644
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@@ -27,7 +27,13 @@ Input → Prefilter (memchr/memmem/teddy) → Engine Search → Match Result
 ### DFA Layer (`dfa/lazy/`)
 
 - **Lazy DFA**: On-demand state construction with byte class compression
-- **Flat transition table**: `flatTrans[sid*stride+class]` — single array lookup, no pointer chase
+- **Flat transition table**: `flatTrans[sid+class]` — premultiplied offset, no multiply
+- **Tagged State IDs**: match/dead/invalid encoded in high bits, single `IsTagged()` branch
+- **Break-at-match**: Rust `determinize::next` (mod.rs:284) — stops NFA iteration at Match state,
+  preventing prefix restarts while preserving greedy continuation (leftmost-first semantics)
+- **Epsilon closure ordering**: Add-on-pop DFS with reverse Split push — matches Rust sparse set
+  insertion order. Incremental per-target closure preserves Match-before-prefix ordering
+- **2-pass bidirectional search**: Forward DFA → match end, reverse DFA → match start (no Phase 3)
 - **Byte-based cache limit**: 2MB default (matches Rust `hybrid_cache_capacity`)
 - **Cache clearing**: Up to 5 clears before NFA fallback (Rust approach)
 - **Acceleration**: Detects self-loop states, uses SIMD memchr for skip-ahead
@@ -99,7 +105,7 @@ Input → Prefilter (memchr/memmem/teddy) → Engine Search → Match Result
 
 1. **Multi-engine**: Strategy selection at compile time, not runtime
 2. **Rust reference**: Architecture mirrors Rust regex crate (lazy DFA, PikeVM, prefilters)
-3. **Go stdlib compat**: POSIX leftmost-longest semantics (differs from Rust leftmost-first)
+3. **Leftmost-first match**: DFA break-at-match matches Rust semantics (verified via cargo run)
 4. **Zero-alloc hot paths**: `IsMatch()`, `FindIndices()`, `Count()` — no heap allocation
 5. **SIMD first**: AVX2/SSSE3 prefilters for x86_64, pure Go fallback for other archs
 
diff --git a/meta/compile.go b/meta/compile.go
index abff531..1a60f34 100644
--- a/meta/compile.go
+++ b/meta/compile.go
@@ -150,10 +150,9 @@ func buildStrategyEngines(
 		return result
 	}
 
-	dfaConfig := lazy.Config{
-		MaxStates:            config.MaxDFAStates,
-		DeterminizationLimit: config.DeterminizationLimit,
-	}
+	dfaConfig := lazy.DefaultConfig()
+	dfaConfig.MaxStates = config.MaxDFAStates //nolint:staticcheck // legacy API compat
+	dfaConfig.DeterminizationLimit = config.DeterminizationLimit
 
 	result = buildReverseSearchers(result, strategy, re, nfaEngine, dfaConfig, config)
 
@@ -189,13 +188,18 @@ func buildReverseDFA(
 	dfaConfig lazy.Config,
 	pf prefilter.Prefilter,
 ) strategyEngines {
+	// Reverse DFA config: disable break-at-match so the reverse search continues
+	// past matches to find the leftmost match start (greedy continuation).
+	revDFAConfig := dfaConfig
+	revDFAConfig.BreakAtMatch = false
+
 	switch result.finalStrategy {
 	case UseDFA:
 		// Skip for non-greedy patterns: forward DFA always finds leftmost-longest,
 		// which is incompatible with non-greedy semantics.
 		if result.dfa != nil && !hasNonGreedyQuantifier(re) {
 			reverseNFA := nfa.ReverseAnchored(nfaEngine)
-			revDFA, err := lazy.CompileWithConfig(reverseNFA, dfaConfig)
+			revDFA, err := lazy.CompileWithConfig(reverseNFA, revDFAConfig)
 			if err == nil {
 				result.reverseDFA = revDFA
 			}
@@ -205,7 +209,7 @@ func buildReverseDFA(
 		if err == nil {
 			result.dfa = fwdDFA
 			reverseNFA := nfa.ReverseAnchored(nfaEngine)
-			revDFA, revErr := lazy.CompileWithConfig(reverseNFA, dfaConfig)
+			revDFA, revErr := lazy.CompileWithConfig(reverseNFA, revDFAConfig)
 			if revErr == nil {
 				result.reverseDFA = revDFA
 			}
@@ -530,6 +534,8 @@ func CompileRegexp(re *syntax.Regexp, config Config) (*Engine, error) {
 	// because its greedy semantics give wrong results for patterns like (?:|a)*
 	canMatchEmpty := pikevm.IsMatch(nil)
 
+	// Check if Phase 3 (SearchAtAnchored) is needed in bidirectional DFA search.
+	// Phase 3 re-scans from confirmed start with greedy semantics. Only needed when
 	// Extract first-byte prefilter for anchored patterns.
 	// This enables O(1) early rejection for non-matching inputs.
 	// Only useful for start-anchored patterns where we only check position 0.
diff --git a/meta/engine.go b/meta/engine.go
index d2bcc54..dd1d5ca 100644
--- a/meta/engine.go
+++ b/meta/engine.go
@@ -186,6 +186,17 @@ func (e *Engine) IsStartAnchored() bool {
 	return e.isStartAnchored
 }
 
+// IsStartAnchoredWithFirstByteReject returns true if:
+// 1. Pattern is always-anchored (^) AND
+// 2. First byte of haystack doesn't match any possible first byte
+// This allows ultra-fast O(1) rejection without any dispatch overhead.
+func (e *Engine) IsStartAnchoredWithFirstByteReject(haystack []byte) bool {
+	return e.nfa.IsAlwaysAnchored() &&
+		e.anchoredFirstBytes != nil &&
+		len(haystack) > 0 &&
+		!e.anchoredFirstBytes.Contains(haystack[0])
+}
+
 // Stats returns execution statistics.
 //
 // Useful for performance analysis and debugging.
diff --git a/meta/find_indices.go b/meta/find_indices.go
index 4ba5f36..cd794e3 100644
--- a/meta/find_indices.go
+++ b/meta/find_indices.go
@@ -342,7 +342,6 @@ func (e *Engine) findIndicesDFAAt(haystack []byte, at int) (int, int, bool) {
 		return e.pikevm.SearchAt(haystack, pos)
 	}
 
-	// No prefilter: bidirectional DFA or DFA + PikeVM fallback.
 	if e.reverseDFA != nil {
 		return e.findIndicesBidirectionalDFA(haystack, at)
 	}
@@ -578,35 +577,31 @@ func (e *Engine) findIndicesMultilineReverseSuffixAt(haystack []byte, at int) (i
 }
 
 // findIndicesBidirectionalDFA uses forward DFA + reverse DFA for exact match bounds.
-// Three-phase: forward DFA → first match end, reverse DFA → match start,
-// anchored forward DFA → correct greedy end from that start. O(n) total.
+// Two-phase: forward DFA → match end, reverse DFA → match start. O(n) total.
 //
-// Phase 1 uses SearchFirstAt (stops at first match end) to avoid DFA over-extension
-// with unanchored prefix. Phase 3 then runs anchored greedy DFA from the discovered
-// start to get the correct (potentially longer) end for patterns like ".*".
+// With Rust-style break-at-match in determinize, SearchAt produces correct
+// leftmost-first greedy match ends directly (verified against Rust regex-automata
+// fwd search). No Phase 3 re-scan needed.
 func (e *Engine) findIndicesBidirectionalDFA(haystack []byte, at int) (int, int, bool) {
 	atomic.AddUint64(&e.stats.DFASearches, 1)
 	state := e.getSearchState()
 	defer e.putSearchState(state)
-	// Phase 1: find first match end (leftmost-first, not leftmost-longest)
-	end := e.dfa.SearchFirstAt(state.dfaCache, haystack, at)
+	// Forward DFA: leftmost-first match end (matches Rust find_fwd)
+	end := e.dfa.SearchAt(state.dfaCache, haystack, at)
 	if end == -1 {
 		return -1, -1, false
 	}
 	if end == at {
 		return at, at, true // Empty match
 	}
-	// Phase 2: reverse DFA to find match start
+	// Skip reverse search if anchored (Rust hybrid/regex.rs:467)
+	if e.nfa.IsAlwaysAnchored() {
+		return at, end, true
+	}
+	// Reverse DFA → match start
 	start := e.reverseDFA.SearchReverse(state.revDFACache, haystack, at, end)
 	if start < 0 {
-		return -1, -1, false // Reverse DFA failed (cache full)
-	}
-	// Phase 3: anchored greedy forward DFA from start → correct end.
-	// SearchFirstAt may undercount for greedy patterns (e.g., ".*" stops at first ").
-	// Anchored DFA from start gives the correct greedy end for this specific match.
-	exactEnd := e.dfa.SearchAtAnchored(state.dfaCache, haystack, start)
-	if exactEnd > start {
-		end = exactEnd
+		return -1, -1, false
 	}
 	return start, end, true
 }
@@ -652,6 +647,14 @@ func (e *Engine) findIndicesBoundedBacktracker(haystack []byte) (int, int, bool)
 		}
 	}
 
+	// For always-anchored patterns (^) on large inputs where BT can't handle
+	// the full haystack, use PikeVM directly. PikeVM memory is O(states) per
+	// step, not O(states × haystack) like BT visited table.
+	if e.nfa.IsAlwaysAnchored() && !e.boundedBacktracker.CanHandle(len(haystack)) {
+		atomic.AddUint64(&e.stats.NFASearches, 1)
+		return e.pikevm.SearchWithSlotTable(haystack, nfa.SearchModeFind)
+	}
+
 	atomic.AddUint64(&e.stats.NFASearches, 1)
 	if !e.boundedBacktracker.CanHandle(len(haystack)) {
 		// Bidirectional DFA: O(n) vs PikeVM's O(n*states) for large inputs
diff --git a/meta/findall.go b/meta/findall.go
index f4fcb4e..9e4713a 100644
--- a/meta/findall.go
+++ b/meta/findall.go
@@ -171,6 +171,8 @@ func (e *Engine) FindAllIndicesStreaming(haystack []byte, n int, results [][2]in
 
 // findAllIndicesLoop is the standard loop-based FindAll for non-streaming strategies.
 // Optimized: acquires SearchState once for entire loop to avoid sync.Pool overhead per match.
+//
+//nolint:cyclop // DFA direct path adds necessary branching
 func (e *Engine) findAllIndicesLoop(haystack []byte, n int, results [][2]int) [][2]int {
 	if results == nil {
 		// Smart allocation: anchored patterns have max 1 match, others use capped heuristic.
@@ -194,12 +196,50 @@ func (e *Engine) findAllIndicesLoop(haystack []byte, n int, results [][2]int) []
 	pos := 0
 	lastMatchEnd := -1
 
+	// Fast path: start-anchored patterns (^) match at most once at position 0.
+	// Skip pool Get/Put overhead entirely — use non-pooled FindIndices.
+	if e.nfa.IsAlwaysAnchored() {
+		start, end, found := e.FindIndices(haystack)
+		if found {
+			results = append(results, [2]int{start, end})
+		}
+		return results
+	}
+
 	// Get state ONCE for entire iteration - eliminates 1.29M sync.Pool ops for FindAll
 	state := e.getSearchState()
 	defer e.putSearchState(state)
 
+	// DFA fast path: call DFA functions directly, skip meta prefilter layer.
+	// SearchFirstAt has integrated prefilter at start state — no duplicate scan.
+	// Saves: 1 prefilter call per candidate + function dispatch overhead.
+	useDFADirect := (e.strategy == UseDFA || e.strategy == UseBoth) &&
+		e.dfa != nil && e.reverseDFA != nil &&
+		state.dfaCache != nil && state.revDFACache != nil
+
 	for n <= 0 || len(results) < n {
-		start, end, found := e.findIndicesAtWithState(haystack, pos, state)
+		var start, end int
+		var found bool
+
+		if useDFADirect {
+			// 2-pass bidirectional DFA, called directly (no meta prefilter).
+			// SearchAt → match end (matches Rust find_fwd), reverse DFA → start.
+			matchEnd := e.dfa.SearchAt(state.dfaCache, haystack, pos)
+			if matchEnd < 0 {
+				break
+			}
+			if matchEnd == pos {
+				start, end, found = pos, pos, true
+			} else {
+				matchStart := e.reverseDFA.SearchReverse(state.revDFACache, haystack, pos, matchEnd)
+				if matchStart < 0 {
+					break
+				}
+				start, end, found = matchStart, matchEnd, true
+			}
+		} else {
+			start, end, found = e.findIndicesAtWithState(haystack, pos, state)
+		}
 		if !found {
 			break
 		}
diff --git a/meta/reverse_anchored.go b/meta/reverse_anchored.go
index a189987..e68458d 100644
--- a/meta/reverse_anchored.go
+++ b/meta/reverse_anchored.go
@@ -49,8 +49,12 @@ func NewReverseAnchoredSearcher(forwardNFA *nfa.NFA, config lazy.Config) (*Rever
 	// Build reverse NFA - must be anchored at start (because $ in forward becomes ^ in reverse)
 	reverseNFA := nfa.ReverseAnchored(forwardNFA)
 
-	// Build reverse DFA from reverse NFA
-	reverseDFA, err := lazy.CompileWithConfig(reverseNFA, config)
+	// Build reverse DFA from reverse NFA.
+	// Disable BreakAtMatch: reverse DFA must continue past matches to find
+	// the leftmost match start (greedy continuation).
+	revConfig := config
+	revConfig.BreakAtMatch = false
+	reverseDFA, err := lazy.CompileWithConfig(reverseNFA, revConfig)
 	if err != nil {
 		// Cannot build reverse DFA - this should be rare
 		return nil, err
diff --git a/meta/reverse_inner.go b/meta/reverse_inner.go
index acdc9b5..a6740de 100644
--- a/meta/reverse_inner.go
+++ b/meta/reverse_inner.go
@@ -224,8 +224,11 @@ func NewReverseInnerSearcher(
 	// Build reverse NFA from prefix
 	reverseNFA := nfa.Reverse(prefixNFA)
 
-	// Build reverse DFA from reverse prefix NFA
-	reverseDFA, err := lazy.CompileWithConfig(reverseNFA, config)
+	// Build reverse DFA from reverse prefix NFA.
+	// Disable BreakAtMatch for reverse DFA.
+	revConfig := config
+	revConfig.BreakAtMatch = false
+	reverseDFA, err := lazy.CompileWithConfig(reverseNFA, revConfig)
 	if err != nil {
 		return nil, err
 	}
diff --git a/meta/reverse_suffix.go b/meta/reverse_suffix.go
index 06f63c5..b6c8033 100644
--- a/meta/reverse_suffix.go
+++ b/meta/reverse_suffix.go
@@ -114,8 +114,11 @@ func NewReverseSuffixSearcher(
 	// searching for $ anchor, but for suffix literals.
 	reverseNFA := nfa.Reverse(forwardNFA)
 
-	// Build reverse DFA from reverse NFA
-	reverseDFA, err := lazy.CompileWithConfig(reverseNFA, config)
+	// Build reverse DFA from reverse NFA.
+	// Disable BreakAtMatch for reverse DFA.
+	revConfig := config
+	revConfig.BreakAtMatch = false
+	reverseDFA, err := lazy.CompileWithConfig(reverseNFA, revConfig)
 	if err != nil {
 		return nil, err
 	}
diff --git a/meta/reverse_suffix_set.go b/meta/reverse_suffix_set.go
index 6414073..ea70af1 100644
--- a/meta/reverse_suffix_set.go
+++ b/meta/reverse_suffix_set.go
@@ -92,8 +92,10 @@ func NewReverseSuffixSetSearcher(
 	// Build reverse NFA
 	reverseNFA := nfa.Reverse(forwardNFA)
 
-	// Build reverse DFA
-	reverseDFA, err := lazy.CompileWithConfig(reverseNFA, config)
+	// Build reverse DFA. Disable BreakAtMatch for reverse DFA.
+	revConfig := config
+	revConfig.BreakAtMatch = false
+	reverseDFA, err := lazy.CompileWithConfig(reverseNFA, revConfig)
 	if err != nil {
 		return nil, err
 	}
diff --git a/nfa/compile.go b/nfa/compile.go
index 88c28c0..01a5154 100644
--- a/nfa/compile.go
+++ b/nfa/compile.go
@@ -132,8 +132,7 @@ func (c *Compiler) CompileRegexp(re *syntax.Regexp) (*NFA, error) {
 	anchoredStart := patternStart
 
 	// Unanchored start: compile the (?s:.)*? prefix for DFA and other engines
-	// that need it. PikeVM simulates this prefix in its search loop instead
-	// (like Rust regex-automata) for correct startPos tracking.
+	// (same as Rust regex-automata: compiler.rs:997 c_at_least(dot, false, 0)).
 	// If pattern is anchored, unanchored start equals anchored start.
 	var unanchoredStart StateID
 	if c.config.Anchored || allAnchored {
diff --git a/nfa/pikevm.go b/nfa/pikevm.go
index 34446d8..e0eef83 100644
--- a/nfa/pikevm.go
+++ b/nfa/pikevm.go
@@ -290,13 +290,28 @@ func (p *PikeVM) initState(state *PikeVMState) {
 	// Pre-allocate epsilon stack for loop-based closure in IsMatch (Rust pattern)
 	state.epsilonStack = make([]StateID, 0, capacity)
 
-	// Initialize SlotTables for capture tracking (curr/next, swapped per byte)
-	// Each capture group has 2 slots (start and end position)
+	// SlotTables for capture tracking are initialized lazily on first use.
+	// This avoids allocation overhead for non-capture searches (FindAll, IsMatch).
+	// See ensureSlotTables().
+	state.SlotTable = nil
+	state.NextSlotTable = nil
+}
+
+// ensureSlotTables lazily initializes SlotTables and capture support.
+// Called only when capture tracking is needed (SearchWithSlotTableCaptures).
+func (p *PikeVM) ensureSlotTables(state *PikeVMState) {
+	if state.SlotTable != nil {
+		return // Already initialized
+	}
 	slotsPerState := p.nfa.CaptureCount() * 2
-	state.SlotTable = NewSlotTable(p.nfa.States(), slotsPerState)
-	state.NextSlotTable = NewSlotTable(p.nfa.States(), slotsPerState)
+	numStates := p.nfa.States()
+	state.SlotTable = NewSlotTable(numStates, slotsPerState)
+	state.NextSlotTable = NewSlotTable(numStates, slotsPerState)
 
-	// Capture-aware epsilon closure stack and working buffer
+	capacity := numStates
+	if capacity < 16 {
+		capacity = 16
+	}
 	state.captureStack = make([]captureFrame, 0, capacity)
 	if slotsPerState > 0 {
 		state.currSlots = make([]int, slotsPerState)
@@ -2136,6 +2151,9 @@ func (p *PikeVM) SearchWithSlotTableCapturesAt(haystack []byte, at int) *MatchWi
 		return nil
 	}
 
+	// Lazy init SlotTables (only on first capture search)
+	p.ensureSlotTables(&p.internalState)
+
 	totalSlots := p.nfa.CaptureCount() * 2
 	p.internalState.SlotTable.SetActiveSlots(totalSlots)
 	p.internalState.NextSlotTable.SetActiveSlots(totalSlots)
diff --git a/nfa/slot_table.go b/nfa/slot_table.go
index e8b1ccf..476d886 100644
--- a/nfa/slot_table.go
+++ b/nfa/slot_table.go
@@ -106,6 +106,9 @@ func (st *SlotTable) ForStateUnchecked(sid StateID) []int {
 //
 // If n > slotsPerState, it is clamped to slotsPerState.
 func (st *SlotTable) SetActiveSlots(n int) {
+	if st == nil {
+		return
+	}
 	if n < 0 {
 		n = 0
 	}
@@ -117,6 +120,9 @@ func (st *SlotTable) SetActiveSlots(n int) {
 
 // ActiveSlots returns the current number of active slots.
 func (st *SlotTable) ActiveSlots() int {
+	if st == nil {
+		return 0
+	}
 	return st.activeSlots
 }
 
@@ -171,6 +177,9 @@ func (st *SlotTable) GetSlot(sid StateID, slotIndex int) int {
 // Note: This is O(n) where n = numStates * slotsPerState.
 // For large tables, consider using generation-based clearing instead.
 func (st *SlotTable) Reset() {
+	if st == nil {
+		return
+	}
 	for i := range st.table {
 		st.table[i] = -1
 	}
diff --git a/regex.go b/regex.go
index 4795229..b170659 100644
--- a/regex.go
+++ b/regex.go
@@ -377,6 +377,12 @@ func (r *Regex) FindAll(b []byte, n int) [][]byte {
 		return nil
 	}
 
+	// Ultra-fast path: start-anchored patterns (^) with first-byte rejection.
+	// Avoids entire dispatch chain for the common no-match case.
+	if r.engine.IsStartAnchoredWithFirstByteReject(b) {
+		return nil
+	}
+
 	// Use optimized streaming path for ALL strategies (state-reusing, no sync.Pool overhead)
 	return r.findAllStreaming(b, n)
 }
diff --git a/simd/memmem.go b/simd/memmem.go
index 2c8720f..3ad5839 100644
--- a/simd/memmem.go
+++ b/simd/memmem.go
@@ -17,16 +17,15 @@ import "bytes"
 //
 // Algorithm:
 //
-// The function uses paired-byte SIMD search with frequency-based rare byte selection:
+// The function uses a hybrid SIMD search with frequency-based rare byte selection:
 //  1. Identify the two rarest bytes in needle using empirical frequency table
-//  2. Use MemchrPair to find candidates where both bytes appear at correct distance
-//  3. For each candidate, verify the full needle match
-//  4. Return position of first match or -1 if not found
-//
-// The paired-byte approach dramatically reduces false positives compared to
-// single-byte search, since matches require two specific bytes at exactly the
-// right distance apart. For example, in "@example.com", both '@' (rank 25) and
-// 'x' (rank 45) are used, requiring them to appear exactly 2 positions apart.
+//  2. For short needles (<=6 bytes): use MemchrPair to find candidates where both
+//     bytes appear at correct distance — reduces false positives when individual
+//     bytes are common in the input data
+//  3. For longer needles (>6 bytes): use single Memchr on the rarest byte, which
+//     is genuinely rare and makes single-byte scan + verify faster
+//  4. For each candidate, verify the full needle match
+//  5. Return position of first match or -1 if not found
 //
 // For longer needles (> 32 bytes), a simplified Two-Way string matching
 // approach is used to maintain O(n+m) complexity and avoid pathological cases.
@@ -81,20 +80,18 @@ func Memmem(haystack, needle []byte) int {
 }
 
 // memmemShort handles short needles (2-32 bytes) using rare byte heuristic.
-// This is the fast path for most real-world patterns.
+// Uses a hybrid approach: MemchrPair for short needles (<=6 bytes) where
+// single-byte scan has high false positive rates, and Memchr(rarest byte)
+// for longer needles where the rare byte is genuinely rare.
 func memmemShort(haystack, needle []byte) int {
-	// Select the two rarest bytes for paired-byte search
 	rareInfo := SelectRareBytes(needle)
-
-	// Determine if we can use paired-byte search (different bytes at different positions)
-	// Paired-byte search is more selective: false positives require both bytes at exact distance
 	usePair := rareInfo.Byte1 != rareInfo.Byte2 && rareInfo.Index1 != rareInfo.Index2
 
-	if usePair {
+	// Short needles: MemchrPair is more selective (fewer false positives).
+	// Long needles: single Memchr + verify is faster (rare byte is genuinely rare).
+	if usePair && len(needle) <= 6 {
 		return memmemPaired(haystack, needle, rareInfo)
 	}
-
-	// Fall back to single-byte search
 	return memmemSingle(haystack, needle, rareInfo.Byte1, rareInfo.Index1)
 }