coregx · kolkov · Mar 24, 2026 · Mar 23, 2026 · Mar 23, 2026 · Mar 23, 2026
@@ -12,6 +12,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - ARM NEON SIMD support (Go 1.26 `simd/archsimd` intrinsics — [#120](https://github.com/coregx/coregex/issues/120))
 - SIMD prefilter for CompositeSequenceDFA (#83)
 
+## [0.12.18] - 2026-03-24
+
+### Performance
+- **Flat DFA transition table** (Rust approach) — replaced double pointer chase
+  (`stateList[id].transitions[class]`) with flat array (`flatTrans[sid*stride+class]`).
+  Hot loop works with state ID only — no `*State` pointer in fast path. Applied to
+  all 6 DFA search functions. Inspired by Rust `Cache.trans` flat layout.
+
+- **4x loop unrolling** in `searchFirstAt` — process 4 bytes per iteration when
+  all transitions are in flat table. Falls to single-byte slow path on special states.
+
+- **DFA integrated prefilter skip-ahead** (Rust approach) — when DFA returns to
+  start state with no match in progress, uses `prefilter.Find()` to skip ahead
+  instead of byte-by-byte scanning. Applied to `searchFirstAt` and `searchAt`.
+  Reference: Rust `hybrid/search.rs:232-258`.
+  `peak_hours`: 197ms → **90ms** (gap vs Rust: 9x → 4x).
+
+- **PikeVM integrated prefilter skip-ahead** — prefilter integrated inside PikeVM
+  search loop (`pikevm.rs:1293`). When NFA has no active threads, PikeVM jumps to
+  next candidate. Safe for partial-coverage prefilters.
+
+### Fixed
+- **NFA candidate loop guard** — replaced `IsComplete()` with `partialCoverage`
+  flag. `IsComplete()` blocked ALL incomplete prefilters including prefix-only ones.
+  `errors` pattern: 1984ms → **80ms**.
+
+- **DFA prefilter skip for incomplete prefilters** — `IsComplete()` guard blocked
+  DFA prefilter skip-ahead for memmem/Teddy prefix-only prefilters. But DFA verifies
+  full pattern — skip is always safe. `sessions`: 229ms → **30ms**.
+
 ## [0.12.17] - 2026-03-23
 
 ### Fixed
@@ -39,6 +69,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   Now allows UseTeddy when anchors are only `(?m)^` (no \b, $, etc).
   `http_methods` on macOS ARM64: 89ms → **<1ms** (restored to v0.12.14 level).
 
+- **Fix NFA candidate loop guard** — `IsComplete()` guard blocked prefilter
+  candidate loop for ALL incomplete prefilters, including prefix-only ones
+  where all alternation branches are represented. Now uses `partialCoverage`
+  flag (set only on overflow truncation) instead of `IsComplete()`. Pattern
+  ` [5][0-9]{2} | [4][0-9]{2} ` (Kostya's `errors`): 1984ms → **109ms**.
+  Rust handles this by integrating prefilter as skip-ahead inside PikeVM
+  (not as an external correctness gate) — see `pikevm.rs:1293-1299`.
+
 ## [0.12.16] - 2026-03-21
 
 ### Performance

@@ -64,16 +64,16 @@ Cross-language benchmarks on 6MB input, AMD EPYC ([source](https://github.com/ko
 
 | Pattern | Go stdlib | coregex | Rust regex | vs stdlib | vs Rust |
 |---------|-----------|---------|------------|-----------|---------|
-| Literal alternation | 475 ms | 4.4 ms | 0.6 ms | **108x** | 7.1x slower |
-| Multi-literal | 1412 ms | 12.8 ms | 4.7 ms | **110x** | 2.7x slower |
-| Inner `.*keyword.*` | 232 ms | 0.30 ms | 0.27 ms | **774x** | 1.1x slower |
-| Suffix `.*\.txt` | 236 ms | 1.82 ms | 1.13 ms | **129x** | 1.6x slower |
-| Multiline `(?m)^/.*\.php` | 103 ms | 0.50 ms | 0.67 ms | **206x** | **1.3x faster** |
-| Email validation | 265 ms | 0.62 ms | 0.27 ms | **428x** | 2.2x slower |
-| URL extraction | 353 ms | 0.65 ms | 0.35 ms | **543x** | 1.8x slower |
-| IP address | 496 ms | 2.1 ms | 12.1 ms | **231x** | **5.6x faster** |
-| Char class `[\w]+` | 581 ms | 51.2 ms | 50.2 ms | **11x** | ~parity |
-| Word repeat `(\w{2,8})+` | 712 ms | 186 ms | 48.7 ms | **3x** | 3.8x slower |
+| Literal alternation | 475 ms | 4.4 ms | 0.7 ms | **109x** | 6.3x slower |
+| Multi-literal | 1391 ms | 12.6 ms | 4.7 ms | **110x** | 2.6x slower |
+| Inner `.*keyword.*` | 231 ms | 0.29 ms | 0.29 ms | **797x** | **~parity** |
+| Suffix `.*\.txt` | 234 ms | 1.83 ms | 1.07 ms | **128x** | 1.7x slower |
+| Multiline `(?m)^/.*\.php` | 103 ms | 0.66 ms | 0.66 ms | **156x** | **~parity** |
+| Email validation | 261 ms | 0.54 ms | 0.31 ms | **482x** | 1.7x slower |
+| URL extraction | 262 ms | 0.84 ms | 0.35 ms | **311x** | 2.4x slower |
+| IP address | 498 ms | 2.1 ms | 12.0 ms | **237x** | **5.6x faster** |
+| Char class `[\w]+` | 554 ms | 48.0 ms | 50.1 ms | **11x** | **1.0x faster** |
+| Word repeat `(\w{2,8})+` | 641 ms | 185 ms | 48.7 ms | **3x** | 3.7x slower |
 
 **Where coregex excels:**
 - Multiline patterns (`(?m)^/.*\.php`) — near Rust parity, 100x+ vs stdlib

@@ -2,7 +2,7 @@
 
 > **Strategic Focus**: Production-grade regex engine with RE2/rust-regex level optimizations
 
-**Last Updated**: 2026-03-20 | **Current Version**: v0.12.15 | **Target**: v1.0.0 stable
+**Last Updated**: 2026-03-24 | **Current Version**: v0.12.18 | **Target**: v1.0.0 stable
 
 ---
 
@@ -87,7 +87,13 @@ v0.12.13 ✅ → FatTeddy fix, prefilter acceleration, AC v0.2.1
          ↓
 v0.12.14 ✅ → Concurrent safety fix for isMatchDFA prefilter (#137)
          ↓
-v0.12.15 (Current) ✅ → Per-goroutine DFA cache, word boundary 30%→0.3% CPU, AC prefilter
+v0.12.15 ✅ → Per-goroutine DFA cache, word boundary 30%→0.3% CPU, AC prefilter
+         ↓
+v0.12.16 ✅ → WrapLineAnchor for (?m)^ patterns
+         ↓
+v0.12.17 ✅ → Fix LogParser ARM64 regression, restore DFA/Teddy for (?m)^
+         ↓
+v0.12.18 (Current) ✅ → Flat DFA transition table, integrated prefilter, PikeVM skip-ahead
          ↓
 v1.0.0-rc → Feature freeze, API locked
          ↓
@@ -130,7 +136,10 @@ v1.0.0 STABLE → Production release with API stability guarantee
 - ✅ **v0.12.12**: Prefix trimming for case-fold literals
 - ✅ **v0.12.13**: FatTeddy fix (ANDL→ORL, VPTEST), prefilter acceleration, AC v0.2.1
 - ✅ **v0.12.14**: Concurrent safety fix for isMatchDFA prefilter (#137)
-- ✅ **v0.12.15**: Per-goroutine DFA cache (Rust approach), word boundary 30%→0.3% CPU, AC DFA prefilter for >32 literals (7-13x faster)
+- ✅ **v0.12.15**: Per-goroutine DFA cache (Rust approach), word boundary 30%→0.3% CPU, 7 correctness fixes
+- ✅ **v0.12.16**: WrapLineAnchor for (?m)^ patterns
+- ✅ **v0.12.17**: Fix LogParser ARM64 regression — restore DFA/Teddy for (?m)^, partial prefilter
+- ✅ **v0.12.18**: Flat DFA transition table (Rust approach), integrated prefilter skip-ahead in DFA+PikeVM, 4x unrolling — **35% faster than v0.12.14, 3x from Rust**
 
 ---
 

@@ -27,36 +27,50 @@ import (
 //   - After too many clears, falls back to NFA
 //   - Clearing keeps allocated memory to avoid re-allocation
 type DFACache struct {
-	// states maps StateKey -> DFA State
+	// states maps StateKey -> DFA State (used only in determinize slow path)
 	states map[StateKey]*State
 
-	// stateList provides O(1) lookup of states by ID via direct indexing.
-	// StateIDs are sequential (0, 1, 2...), so slice indexing is faster than map.
-	// This was previously DFA.states — moved here because it grows during search.
+	// stateList provides O(1) lookup of State structs by ID.
+	// Used only in slow path (determinize, word boundary, acceleration).
+	// Hot loop uses flatTrans + matchFlags instead.
 	stateList []*State
 
+	// --- Flat transition table (Rust approach) ---
+	// Hot loop uses ONLY these fields — no *State pointer chase.
+	//
+	// Rust: cache.trans[sid + class] — single flat array, premultiplied ID.
+	// We use: flatTrans[int(sid)*stride + class] — same layout.
+	//
+	// This replaces per-state State.transitions[] in the hot loop:
+	// ONE slice access instead of TWO pointer chases (stateList → State → transitions).
+
+	// flatTrans is the flat transition table.
+	// Layout: [state0_c0, state0_c1, ..., state0_cN, state1_c0, ...]
+	// InvalidState (0xFFFFFFFF) = unknown transition (needs determinize).
+	flatTrans []StateID
+
+	// matchFlags[stateID] = true if state is a match/accepting state.
+	// Replaces State.IsMatch() in hot loop — no pointer chase needed.
+	matchFlags []bool
+
+	// stride is the number of byte equivalence classes (alphabet size).
+	stride int
+
 	// startTable caches start states for different look-behind contexts.
-	// This enables correct handling of assertions (^, \b, etc.) and
-	// avoids recomputing epsilon closures on every search.
-	// Previously lived on DFA — moved here because it is populated lazily.
 	startTable StartTable
 
 	// maxStates is the capacity limit
 	maxStates uint32
 
 	// nextID is the next available state ID.
-	// Start at 1 (0 is reserved for StartState).
 	nextID StateID
 
-	// clearCount tracks how many times the cache has been cleared during
-	// the current search. This is used to detect pathological cache thrashing
-	// and trigger NFA fallback when clears exceed the configured limit.
-	// Inspired by Rust regex-automata's hybrid DFA cache clearing strategy.
+	// clearCount tracks cache clear count for NFA fallback threshold.
 	clearCount int
 
-	// Statistics for cache performance tuning
-	hits   uint64 // Number of cache hits
-	misses uint64 // Number of cache misses
+	// Statistics
+	hits   uint64
+	misses uint64
 }
 
 // Get retrieves a state by its key.
@@ -95,9 +109,67 @@ func (c *DFACache) Insert(key StateKey, state *State) (StateID, error) {
 	c.states[key] = state
 	c.misses++
 
+	// Grow flat transition table for this state's row (all InvalidState initially).
+	if c.stride > 0 {
+		sid := int(state.id)
+		needed := (sid + 1) * c.stride
+		if needed > len(c.flatTrans) {
+			growth := needed - len(c.flatTrans)
+			for i := 0; i < growth; i++ {
+				c.flatTrans = append(c.flatTrans, InvalidState)
+			}
+		}
+		// Grow matchFlags
+		for len(c.matchFlags) <= sid {
+			c.matchFlags = append(c.matchFlags, false)
+		}
+		c.matchFlags[sid] = state.isMatch
+	}
+
 	return state.ID(), nil
 }
 
+// safeOffset computes flat table offset, safe on 386 where int is 32-bit.
+// StateID is uint32; on 386 int(0xFFFFFFFF) = -1 and uint multiply overflows.
+// Returns MaxInt for special state IDs (DeadState, InvalidState) so bounds
+// check (offset < ftLen) always fails safely.
+func safeOffset(sid StateID, stride int, classIdx int) int {
+	if sid >= DeadState {
+		return int(^uint(0) >> 1) // MaxInt — always >= ftLen
+	}
+	return int(sid)*stride + classIdx
+}
+
+// SetFlatTransition records a transition in the flat table.
+// Called from determinize when a transition is computed.
+func (c *DFACache) SetFlatTransition(fromID StateID, classIdx int, toID StateID) {
+	offset := safeOffset(fromID, c.stride, classIdx)
+	if offset < len(c.flatTrans) {
+		c.flatTrans[offset] = toID
+	}
+}
+
+// FlatNext returns the next state ID from the flat table.
+// Returns InvalidState if the transition hasn't been computed yet.
+// This is the hot-path function — should be inlined by the compiler.
+func (c *DFACache) FlatNext(sid StateID, classIdx int) StateID {
+	offset := int(sid)*c.stride + classIdx
+	return c.flatTrans[offset]
+}
+
+// IsMatchState returns whether the given state ID is a match state.
+// Uses compact matchFlags slice — no pointer chase.
+func (c *DFACache) IsMatchState(sid StateID) bool {
+	if sid >= DeadState {
+		return false
+	}
+	id := int(sid)
+	if id >= len(c.matchFlags) {
+		return false
+	}
+	return c.matchFlags[id]
+}
+
 // GetOrInsert retrieves a state from cache or inserts it if not present.
 // This is the primary method used during DFA construction.
 //
@@ -220,6 +292,11 @@ func (c *DFACache) getState(id StateID) *State {
 		return nil
 	}
 
+	// Guard against special state IDs (DeadState=0xFFFFFFFE, InvalidState=0xFFFFFFFF).
+	// On 386, int(uint32(0xFFFFFFFF)) = -1, causing negative index panic.
+	if id >= DeadState {
+		return nil
+	}
 	idx := int(id)
 	if idx >= len(c.stateList) {
 		return nil