From d22c05c7f0b55f50a12b81d30d7e2f28e645e88a Mon Sep 17 00:00:00 2001 From: Andy Date: Wed, 25 Mar 2026 21:44:18 +0300 Subject: [PATCH] =?UTF-8?q?perf:=20v0.12.20=20=E2=80=94=20premultiplied=20?= =?UTF-8?q?StateIDs,=20break-at-match,=20Phase=203=20elimination?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit DFA Core — Premultiplied + Tagged StateIDs: - StateID stores byte offset into flatTrans, eliminating multiply from hot loop - Match/dead/invalid flags encoded in StateID high bits (single IsTagged branch) - 4x loop unrolling in searchFirstAt, searchAt, searchEarliestMatch - safeOffset eliminated from all DFA search paths DFA Core — Rust-aligned Determinize: - 1-byte match delay (Rust determinize mod.rs:254-286) - Break-at-match: stop NFA iteration at Match state, drop prefix restarts - Epsilon closure rewrite: add-on-pop DFS with reverse Split push order, matching Rust sparse set insertion order (verified via cargo run) - Incremental per-target epsilon closure in moveWithWordContext - filterStatesAfterMatch removed (replaced by break-at-match) - BreakAtMatch config: true for forward DFA, false for reverse DFA - Phase 3 (SearchAtAnchored re-scan) eliminated — 2-pass bidirectional DFA - Fix: meta dfaConfig uses DefaultConfig() to inherit BreakAtMatch=true Meta Engine: - DFA direct FindAll path — skip meta prefilter layer, call DFA directly - Fast path for start-anchored FindAll — skip pool overhead - Inline first-byte rejection for anchored patterns - Prefilter candidate pass-through to bidirectional DFA - Skip reverse DFA for always-anchored patterns NFA/PikeVM: - Lazy SlotTable init — reduce cold start overhead - Fix anchored BoundedBacktracker on large input — truncate to MaxInputSize Prefilter: - Memmem: Memchr(rareByte) + verify (Rust approach) — replaces MemchrPair Benchmarks (EPYC CI, 6MB input, vs stdlib / vs Rust): - ip: 675x faster than stdlib, 18.5x faster than Rust - multiline_php: 288x faster than stdlib, 2.0x faster than Rust - char_class: 11x faster than stdlib, 1.3x faster than Rust - inner_literal: 668x faster than stdlib, at Rust parity - email: 506x faster than stdlib - LangArena total: 30x faster than stdlib, 3.9x gap vs Rust 27 files changed, +734 -583 lines. All tests pass. --- CHANGELOG.md | 32 ++ README.md | 22 +- ROADMAP.md | 11 +- dfa/lazy/accel_test.go | 14 +- dfa/lazy/anchored_search_prefilter_test.go | 2 +- dfa/lazy/builder.go | 208 ++++---- dfa/lazy/cache.go | 91 ++-- dfa/lazy/cache_test.go | 12 +- dfa/lazy/config.go | 13 + dfa/lazy/lazy.go | 556 ++++++++------------- dfa/lazy/search_extra_test.go | 4 +- dfa/lazy/start.go | 8 +- dfa/lazy/state.go | 133 ++++- docs/ARCHITECTURE.md | 10 +- meta/compile.go | 18 +- meta/engine.go | 11 + meta/find_indices.go | 37 +- meta/findall.go | 42 +- meta/reverse_anchored.go | 8 +- meta/reverse_inner.go | 7 +- meta/reverse_suffix.go | 7 +- meta/reverse_suffix_set.go | 6 +- nfa/compile.go | 3 +- nfa/pikevm.go | 28 +- nfa/slot_table.go | 9 + regex.go | 6 + simd/memmem.go | 31 +- 27 files changed, 712 insertions(+), 617 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 433028f..b7f0059 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -12,6 +12,38 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - ARM NEON SIMD support (Go 1.26 `simd/archsimd` intrinsics — [#120](https://github.com/coregx/coregex/issues/120)) - SIMD prefilter for CompositeSequenceDFA (#83) +## [0.12.20] - 2026-03-25 + +### Performance +- **Premultiplied State IDs** — StateID stores byte offset into flat transition table, + eliminating multiply from DFA hot loop. Single `flatTrans[sid+classIdx]` lookup. + Inspired by Rust `LazyStateID` (hybrid/id.rs). + +- **Tagged State IDs** — match/dead/invalid/start flags encoded in StateID high bits. + Single `IsTagged()` branch replaces 3 separate comparisons in DFA hot loop. + 4x loop unrolling breaks to slow path only on tagged states. + +- **1-byte match delay** (Rust determinize approach) — match reporting delayed by 1 byte, + enabling correct look-around assertion resolution (^, $, \b) at match boundaries. + Reference: Rust `determinize` mod.rs:254-286. + +- **Rust-aligned DFA determinize: break-at-match** — replaced `filterStatesAfterMatch` + with Rust's `determinize::next` break-at-match semantics (mod.rs:284). Epsilon closure + uses add-on-pop DFS with reverse Split push, matching Rust sparse set insertion order. + Incremental per-target epsilon closure preserves correct state ordering for leftmost-first. + **Eliminates Phase 3** anchored re-scan: bidirectional DFA reduced from 3-pass to 2-pass. + Verified against Rust regex-automata `find_fwd` — identical results on all test patterns. + +- **Memmem: Memchr(rareByte) + verify** (Rust approach) — replaced `MemchrPair`-based + paired search in `simd.Memmem` with single rare byte Memchr scan + `bytes.Equal` + verify, matching Rust `memchr::memmem` architecture. + +### Benchmarks (LangArena LogParser, 7.2 MB, 13 patterns) + +| vs stdlib | vs Rust | Wins vs Rust | +|-----------|---------|-------------| +| **30x faster** total | 2-5x gap (local i7) | ip 18.5x, multiline_php 2.0x, char_class 1.3x | + ## [0.12.19] - 2026-03-24 ### Performance diff --git a/README.md b/README.md index 3c4201c..3b4f4f0 100644 --- a/README.md +++ b/README.md @@ -64,19 +64,19 @@ Cross-language benchmarks on 6MB input, AMD EPYC ([source](https://github.com/ko | Pattern | Go stdlib | coregex | Rust regex | vs stdlib | vs Rust | |---------|-----------|---------|------------|-----------|---------| -| Literal alternation | 475 ms | 4.4 ms | 0.7 ms | **109x** | 6.3x slower | -| Multi-literal | 1391 ms | 12.6 ms | 4.7 ms | **110x** | 2.6x slower | -| Inner `.*keyword.*` | 231 ms | 0.29 ms | 0.29 ms | **797x** | **~parity** | -| Suffix `.*\.txt` | 234 ms | 1.83 ms | 1.07 ms | **128x** | 1.7x slower | -| Multiline `(?m)^/.*\.php` | 103 ms | 0.66 ms | 0.66 ms | **156x** | **~parity** | -| Email validation | 261 ms | 0.54 ms | 0.31 ms | **482x** | 1.7x slower | -| URL extraction | 262 ms | 0.84 ms | 0.35 ms | **311x** | 2.4x slower | -| IP address | 498 ms | 2.1 ms | 12.0 ms | **237x** | **5.6x faster** | -| Char class `[\w]+` | 554 ms | 48.0 ms | 50.1 ms | **11x** | **1.0x faster** | -| Word repeat `(\w{2,8})+` | 641 ms | 185 ms | 48.7 ms | **3x** | 3.7x slower | +| Literal alternation | 466 ms | 4.2 ms | 0.65 ms | **110x** | 6.4x slower | +| Multi-literal | 1391 ms | 12.4 ms | 5.3 ms | **112x** | 2.3x slower | +| Inner `.*keyword.*` | 227 ms | 0.34 ms | 0.32 ms | **668x** | **~parity** | +| Suffix `.*\.txt` | 228 ms | 2.9 ms | 1.3 ms | **78x** | 2.3x slower | +| Multiline `(?m)^/.*\.php` | 101 ms | 0.35 ms | 0.72 ms | **288x** | **2.0x faster** | +| Email validation | 258 ms | 0.51 ms | 0.27 ms | **506x** | 1.8x slower | +| URL extraction | 259 ms | 0.71 ms | 0.35 ms | **364x** | 2.0x slower | +| IP address | 493 ms | 0.73 ms | 13.5 ms | **675x** | **18.5x faster** | +| Char class `[\w]+` | 483 ms | 40.9 ms | 56.0 ms | **11x** | **1.3x faster** | +| Word repeat `(\w{2,8})+` | 628 ms | 167 ms | 54.8 ms | **3x** | 3.0x slower | **Where coregex excels:** -- Multiline patterns (`(?m)^/.*\.php`) — near Rust parity, 100x+ vs stdlib +- Multiline patterns (`(?m)^/.*\.php`) — **2x faster than Rust**, 288x vs stdlib - IP/phone patterns (`\d+\.\d+\.\d+\.\d+`) — SIMD digit prefilter skips non-digit regions - Suffix patterns (`.*\.log`, `.*\.txt`) — reverse search optimization (1000x+) - Inner literals (`.*error.*`, `.*@example\.com`) — bidirectional DFA (900x+) diff --git a/ROADMAP.md b/ROADMAP.md index e9b34c2..d38bd6b 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -2,7 +2,7 @@ > **Strategic Focus**: Production-grade regex engine with RE2/rust-regex level optimizations -**Last Updated**: 2026-03-24 | **Current Version**: v0.12.18 | **Target**: v1.0.0 stable +**Last Updated**: 2026-03-25 | **Current Version**: v0.12.19 | **Target**: v1.0.0 stable --- @@ -12,7 +12,7 @@ Build a **production-ready, high-performance regex engine** for Go that matches ### Current State vs Target -| Metric | Current (v0.12.15) | Target (v1.0.0) | +| Metric | Current (v0.12.19) | Target (v1.0.0) | |--------|-------------------|-----------------| | Inner literal speedup | **280-3154x** | ✅ Achieved | | Case-insensitive speedup | **263x** | ✅ Achieved | @@ -93,7 +93,12 @@ v0.12.16 ✅ → WrapLineAnchor for (?m)^ patterns ↓ v0.12.17 ✅ → Fix LogParser ARM64 regression, restore DFA/Teddy for (?m)^ ↓ -v0.12.18 (Current) ✅ → Flat DFA transition table, integrated prefilter, PikeVM skip-ahead +v0.12.18 ✅ → Flat DFA transition table, integrated prefilter, PikeVM skip-ahead + ↓ +v0.12.19 ✅ → Zero-alloc FindSubmatch, byte-based DFA cache, Rust-aligned visited limits + ↓ +v0.12.20 (Current) → Premultiplied/tagged StateIDs, break-at-match DFA determinize, + Phase 3 elimination (2-pass bidirectional DFA) ↓ v1.0.0-rc → Feature freeze, API locked ↓ diff --git a/dfa/lazy/accel_test.go b/dfa/lazy/accel_test.go index d434dea..58ed067 100644 --- a/dfa/lazy/accel_test.go +++ b/dfa/lazy/accel_test.go @@ -98,18 +98,20 @@ func TestDetectAccelerationFromCached(t *testing.T) { func TestDetectAccelerationFromFlat(t *testing.T) { // Test acceleration detection via flat transition table + // Using premultiplied state IDs: sid = stateIndex * stride stride := 256 - sid := StateID(1) - flatTrans := make([]StateID, 2*stride) // 2 states + sid := StateID(1 * stride) // premultiplied: state 1 at offset 256 + state2 := StateID(2 * stride) + flatTrans := make([]StateID, 3*stride) // 3 states (0, 1, 2) // State 1: 250 self-loops, 3 exits to state 2, 3 dead - base := int(sid) * stride + base := sid.Offset() for i := 0; i < 250; i++ { flatTrans[base+i] = sid // Self-loop } - flatTrans[base+250] = StateID(2) - flatTrans[base+251] = StateID(2) - flatTrans[base+252] = StateID(2) + flatTrans[base+250] = state2 + flatTrans[base+251] = state2 + flatTrans[base+252] = state2 flatTrans[base+253] = DeadState flatTrans[base+254] = DeadState flatTrans[base+255] = DeadState diff --git a/dfa/lazy/anchored_search_prefilter_test.go b/dfa/lazy/anchored_search_prefilter_test.go index 0593e89..e1d4472 100644 --- a/dfa/lazy/anchored_search_prefilter_test.go +++ b/dfa/lazy/anchored_search_prefilter_test.go @@ -525,7 +525,7 @@ func TestFindWithPrefilterAtWordBoundary(t *testing.T) { // TestFindWithPrefilterAtCacheClear tests the cache-clear recovery path // in findWithPrefilterAt using a very small cache. func TestFindWithPrefilterAtCacheClear(t *testing.T) { - config := DefaultConfig().WithMaxStates(3).WithMaxCacheClears(10) + config := DefaultConfig().WithMaxStates(6).WithMaxCacheClears(20) compiler := nfa.NewDefaultCompiler() nfaObj, err := compiler.Compile("[a-zA-Z]+[0-9]+") if err != nil { diff --git a/dfa/lazy/builder.go b/dfa/lazy/builder.go index f528e87..4c3f065 100644 --- a/dfa/lazy/builder.go +++ b/dfa/lazy/builder.go @@ -61,17 +61,6 @@ func (b *Builder) Build() (*DFA, error) { pf = b.buildPrefilter() } - // Compute fresh start states: epsilon closure of anchored start. - // These are states that get re-introduced via unanchored machinery after each position. - // Used for leftmost matching: when all remaining states are in this set plus unanchored - // machinery, the committed match is final. - startLook := LookSetFromStartKind(StartText) - anchoredStartClosure := b.epsilonClosure([]nfa.StateID{b.nfa.StartAnchored()}, startLook) - freshStartStates := make(map[nfa.StateID]bool, len(anchoredStartClosure)) - for _, stateID := range anchoredStartClosure { - freshStartStates[stateID] = true - } - // Check if the NFA contains word boundary assertions hasWordBoundary := b.checkHasWordBoundary() @@ -89,7 +78,6 @@ func (b *Builder) Build() (*DFA, error) { prefilter: pf, pikevm: nfa.NewPikeVM(b.nfa), byteClasses: b.nfa.ByteClasses(), - freshStartStates: freshStartStates, unanchoredStart: b.nfa.StartUnanchored(), hasWordBoundary: hasWordBoundary, isAlwaysAnchored: isAlwaysAnchored, @@ -140,77 +128,16 @@ func (b *Builder) buildPrefilter() prefilter.Prefilter { // 4. Collect all reachable states // 5. Return sorted list for consistent ordering func (b *Builder) epsilonClosure(states []nfa.StateID, lookHave LookSet) []nfa.StateID { - // Use pooled StateSet for efficient membership testing and deduplication closure := acquireStateSet() defer releaseStateSet(closure) - stack := make([]nfa.StateID, 0, len(states)*2) - // Initialize with input states + // Reuse epsilonClosureInto for each seed state. for _, sid := range states { - if !closure.Contains(sid) { - closure.Add(sid) - stack = append(stack, sid) - } - } - - // DFS through epsilon transitions - for len(stack) > 0 { - // Pop from stack - current := stack[len(stack)-1] - stack = stack[:len(stack)-1] - - // Get NFA state - state := b.nfa.State(current) - if state == nil { - continue - } - - // Follow epsilon transitions - switch state.Kind() { - case nfa.StateEpsilon: - next := state.Epsilon() - if next != nfa.InvalidState && !closure.Contains(next) { - closure.Add(next) - stack = append(stack, next) - } - - case nfa.StateSplit: - left, right := state.Split() - if left != nfa.InvalidState && !closure.Contains(left) { - closure.Add(left) - stack = append(stack, left) - } - if right != nfa.InvalidState && !closure.Contains(right) { - closure.Add(right) - stack = append(stack, right) - } - - case nfa.StateLook: - // CRITICAL: Only follow if the look assertion is satisfied - // This is the key fix for proper ^ and $ handling in DFA. - // Without this check, the DFA would incorrectly match patterns - // like "^abc" at any position in the input. - look, next := state.Look() - if lookHave.Contains(look) && next != nfa.InvalidState && !closure.Contains(next) { - closure.Add(next) - stack = append(stack, next) - } - - case nfa.StateCapture: - // Capture states are epsilon transitions that record positions. - // The DFA ignores captures (it only tracks match/no-match), - // but we must follow through to reach the actual consuming states. - // Fix for Issue #15: DFA.IsMatch returns false for patterns with capture groups. - _, _, next := state.Capture() - if next != nfa.InvalidState && !closure.Contains(next) { - closure.Add(next) - stack = append(stack, next) - } - } + b.epsilonClosureInto(closure, sid, lookHave) } - // Return sorted slice for consistent state keys - return closure.ToSlice() + // Return insertion order to match Rust sparse set iteration order. + return closure.ToSliceInsertionOrder() } // moveWithWordContext computes the set of NFA states reachable from the given states on input byte b, @@ -235,26 +162,40 @@ func (b *Builder) epsilonClosure(states []nfa.StateID, lookHave LookSet) []nfa.S // // This effectively simulates one step of the NFA for all active states. func (b *Builder) moveWithWordContext(states []nfa.StateID, input byte, isFromWord bool) []nfa.StateID { - // Fast path: skip word boundary resolution if NFA has no word boundaries. - // This optimization eliminates ~74% of allocations for patterns without \b/\B. - // Based on Rust regex-automata approach: only resolve boundaries when needed. + return b.moveWithWordContextBreak(states, input, isFromWord, false) +} + +// moveWithWordContextBreak is moveWithWordContext with optional break-at-match. +// When breakAtMatch is true, iteration stops at the first Match state encountered. +// This implements Rust's determinize::next break semantics (mod.rs:284): +// after finding a Match, remaining states (prefix restarts) are not processed, +// so the DFA reaches dead state and terminates with the committed match. +// +// Critical: uses INCREMENTAL epsilon closure (per-target, like Rust) instead of +// batch closure. This ensures that each ByteRange target's epsilon closure is +// added to the result set in iteration order. Match states from earlier targets +// appear before prefix restart states from later targets, making break-at-match +// work correctly for all patterns. +func (b *Builder) moveWithWordContextBreak(states []nfa.StateID, input byte, isFromWord bool, breakAtMatch bool) []nfa.StateID { var resolvedStates []nfa.StateID if !b.hasWordBoundary { - // No word boundaries - use states directly, skip expensive resolution resolvedStates = states } else { - // Compute word boundary status for this transition isCurrentWord := isWordByte(input) wordBoundarySatisfied := isFromWord != isCurrentWord - - // Step 1: Resolve word boundary assertions in the current state set. - // StateLook(\b) and StateLook(\B) that weren't followed during epsilon closure - // need to be resolved now that we know the current byte. resolvedStates = b.resolveWordBoundaries(states, wordBoundarySatisfied) } - // Step 2: Collect target states for this input byte (use pooled StateSet) - targets := acquireStateSet() + // Determine look assertions satisfied after this byte transition. + var lookAfter LookSet + if input == '\n' { + lookAfter = LookStartLine + } + + // Incremental epsilon closure: for each ByteRange match, epsilon-close the + // target into the result set immediately. This matches Rust's determinize::next + // where each matched target is epsilon-closed into sparses.set2 in iteration order. + result := acquireStateSet() for _, sid := range resolvedStates { state := b.nfa.State(sid) @@ -262,52 +203,89 @@ func (b *Builder) moveWithWordContext(states []nfa.StateID, input byte, isFromWo continue } + // Rust determinize::next (mod.rs:284): break at Match state. + if breakAtMatch && state.Kind() == nfa.StateMatch { + break + } + switch state.Kind() { case nfa.StateByteRange: lo, hi, next := state.ByteRange() if input >= lo && input <= hi { - targets.Add(next) + b.epsilonClosureInto(result, next, lookAfter) } case nfa.StateSparse: for _, tr := range state.Transitions() { if input >= tr.Lo && input <= tr.Hi { - targets.Add(tr.Next) + b.epsilonClosureInto(result, tr.Next, lookAfter) } } } } - // No transitions on this byte - if targets.Len() == 0 { - releaseStateSet(targets) + if result.Len() == 0 { + releaseStateSet(result) return nil } - // Step 3: Determine look assertions satisfied after this byte transition. - // IMPORTANT: Word boundary assertions are handled in resolveWordBoundaries, - // NOT here. This is because word boundary is position-specific - it's resolved - // when we START consuming a byte, not after we've consumed it. - // - // Only line assertions (^, $) are passed to epsilonClosure because they - // depend only on the previous byte (was it '\n'?), not on the current byte. - var lookAfter LookSet + resultSlice := result.ToSliceInsertionOrder() + releaseStateSet(result) + return resultSlice +} - // Line boundary: After '\n', multiline ^ (LookStartLine) is satisfied. - if input == '\n' { - lookAfter = LookStartLine - } +// epsilonClosureInto adds a single state and its epsilon closure to an existing +// StateSet. States already in the set are skipped (deduplication via Contains). +// This enables incremental epsilon closure matching Rust's determinize::next +// where each matched ByteRange target is closed into the result set in order. +func (b *Builder) epsilonClosureInto(result *StateSet, seed nfa.StateID, lookHave LookSet) { + // Same add-on-pop + reverse-push approach as epsilonClosure. + stack := make([]nfa.StateID, 1, 8) + stack[0] = seed - // Word boundary bits are NOT included here - they're handled by - // resolveWordBoundaries at the START of the next move() call. - // The isFromWord state of the target DFA state will be used to - // resolve word boundary assertions when the next byte is consumed. + for len(stack) > 0 { + current := stack[len(stack)-1] + stack = stack[:len(stack)-1] + + if result.Contains(current) { + continue + } + result.Add(current) - // Compute epsilon-closure of target states with appropriate look assertions - // Get slice before releasing, as ToSlice allocates a new slice - targetSlice := targets.ToSlice() - releaseStateSet(targets) - return b.epsilonClosure(targetSlice, lookAfter) + state := b.nfa.State(current) + if state == nil { + continue + } + + switch state.Kind() { + case nfa.StateEpsilon: + next := state.Epsilon() + if next != nfa.InvalidState { + stack = append(stack, next) + } + + case nfa.StateSplit: + left, right := state.Split() + if right != nfa.InvalidState { + stack = append(stack, right) + } + if left != nfa.InvalidState { + stack = append(stack, left) + } + + case nfa.StateLook: + look, next := state.Look() + if lookHave.Contains(look) && next != nfa.InvalidState { + stack = append(stack, next) + } + + case nfa.StateCapture: + _, _, next := state.Capture() + if next != nfa.InvalidState { + stack = append(stack, next) + } + } + } } // resolveWordBoundaries expands the NFA state set by following word boundary assertions @@ -610,7 +588,7 @@ func DetectAccelerationFromCachedWithClasses(state *State, byteClasses *nfa.Byte func DetectAccelerationFromFlat(sid StateID, flatTrans []StateID, stride int, byteClasses *nfa.ByteClasses) []byte { ftLen := len(flatTrans) return detectAccelFromTransitions(sid, stride, func(classIdx int) (StateID, bool) { - offset := safeOffset(sid, stride, classIdx) + offset := safeOffset(sid, classIdx) if offset >= ftLen { return InvalidState, false } diff --git a/dfa/lazy/cache.go b/dfa/lazy/cache.go index d8b277e..9c94387 100644 --- a/dfa/lazy/cache.go +++ b/dfa/lazy/cache.go @@ -30,7 +30,6 @@ type DFACache struct { // stateList provides O(1) lookup of State structs by ID. // Used only in slow path (determinize, word boundary, acceleration). - // Hot loop uses flatTrans + matchFlags instead. stateList []*State // --- Flat transition table (Rust approach) --- @@ -47,10 +46,6 @@ type DFACache struct { // InvalidState (0xFFFFFFFF) = unknown transition (needs determinize). flatTrans []StateID - // matchFlags[stateID] = true if state is a match/accepting state. - // Replaces State.IsMatch() in hot loop — no pointer chase needed. - matchFlags []bool - // stride is the number of byte equivalence classes (alphabet size). stride int @@ -84,6 +79,8 @@ func (c *DFACache) Get(key StateKey) (*State, bool) { } // Insert adds a new state to the cache and returns its assigned ID. +// The returned StateID is premultiplied (byte offset into flatTrans) +// and tagged (match bit set if state is accepting). // Returns (stateID, nil) on success. // Returns (InvalidState, ErrCacheFull) if cache is at capacity. func (c *DFACache) Insert(key StateKey, state *State) (StateID, error) { @@ -99,10 +96,14 @@ func (c *DFACache) Insert(key StateKey, state *State) (StateID, error) { return InvalidState, ErrCacheFull } - // Assign state ID only if not already set (e.g., StartState = 0) + // Assign premultiplied state ID (byte offset into flatTrans). + // Tag with match bit if accepting state. if state.id == InvalidState { state.id = c.nextID - c.nextID++ + if state.isMatch { + state.id = state.id.WithMatchTag() + } + c.nextID += StateID(c.stride) // premultiplied: advance by stride } // Insert into cache @@ -111,39 +112,35 @@ func (c *DFACache) Insert(key StateKey, state *State) (StateID, error) { // Grow flat transition table for this state's row (all InvalidState initially). if c.stride > 0 { - sid := int(state.id) - needed := (sid + 1) * c.stride + offset := state.id.Offset() + needed := offset + c.stride if needed > len(c.flatTrans) { growth := needed - len(c.flatTrans) for i := 0; i < growth; i++ { c.flatTrans = append(c.flatTrans, InvalidState) } } - // Grow matchFlags - for len(c.matchFlags) <= sid { - c.matchFlags = append(c.matchFlags, false) - } - c.matchFlags[sid] = state.isMatch } return state.ID(), nil } -// safeOffset computes flat table offset, safe on 386 where int is 32-bit. -// StateID is uint32; on 386 int(0xFFFFFFFF) = -1 and uint multiply overflows. -// Returns MaxInt for special state IDs (DeadState, InvalidState) so bounds -// check (offset < ftLen) always fails safely. -func safeOffset(sid StateID, stride int, classIdx int) int { - if sid >= DeadState { - return int(^uint(0) >> 1) // MaxInt — always >= ftLen +// safeOffset computes flat table offset from premultiplied StateID. +// For tagged states (dead/invalid), returns MaxInt so bounds check always +// fails safely. For normal and match-tagged states, returns sid.Offset() + classIdx. +func safeOffset(sid StateID, classIdx int) int { + if sid.IsDeadTag() || sid.IsInvalidTag() { + return int(^uint(0) >> 1) // MaxInt } - return int(sid)*stride + classIdx + return sid.Offset() + classIdx } // SetFlatTransition records a transition in the flat table. // Called from determinize when a transition is computed. +// fromID must be a premultiplied StateID (offset into flatTrans). +// toID is stored with its tags (match/dead). func (c *DFACache) SetFlatTransition(fromID StateID, classIdx int, toID StateID) { - offset := safeOffset(fromID, c.stride, classIdx) + offset := fromID.Offset() + classIdx if offset < len(c.flatTrans) { c.flatTrans[offset] = toID } @@ -151,23 +148,16 @@ func (c *DFACache) SetFlatTransition(fromID StateID, classIdx int, toID StateID) // FlatNext returns the next state ID from the flat table. // Returns InvalidState if the transition hasn't been computed yet. +// sid must be premultiplied (no multiply needed — just add classIdx). // This is the hot-path function — should be inlined by the compiler. func (c *DFACache) FlatNext(sid StateID, classIdx int) StateID { - offset := int(sid)*c.stride + classIdx - return c.flatTrans[offset] + return c.flatTrans[sid.Offset()+classIdx] } // IsMatchState returns whether the given state ID is a match state. -// Uses compact matchFlags slice — no pointer chase. +// Uses tag bit in premultiplied StateID — O(1), no array lookup. func (c *DFACache) IsMatchState(sid StateID) bool { - if sid >= DeadState { - return false - } - id := int(sid) - if id >= len(c.matchFlags) { - return false - } - return c.matchFlags[id] + return sid.IsMatchTag() } // GetOrInsert retrieves a state from cache or inserts it if not present. @@ -212,7 +202,6 @@ func (c *DFACache) Size() int { // Components: // - flatTrans: len * 4 bytes (StateID = uint32) // - stateList: len * 8 bytes (pointer) -// - matchFlags: len * 1 byte // - states map: ~len * 48 bytes (key + pointer + map overhead) // - State heap: nfaStates slices + accelBytes func (c *DFACache) MemoryUsage() int { @@ -222,7 +211,6 @@ func (c *DFACache) MemoryUsage() int { usage := len(c.flatTrans) * stateIDSize usage += len(c.stateList) * ptrSize - usage += len(c.matchFlags) usage += len(c.states) * mapEntrySize // State struct heap: nfaStates slice per state @@ -270,7 +258,7 @@ func (c *DFACache) Clear() { c.states = make(map[StateKey]*State) c.stateList = c.stateList[:0] c.startTable = newStartTableFromByteMap(&c.startTable.byteMap) - c.nextID = StartState + 1 + c.nextID = StateID(c.stride) c.clearCount = 0 c.hits = 0 c.misses = 0 @@ -300,7 +288,7 @@ func (c *DFACache) ClearKeepMemory() { } c.stateList = c.stateList[:0] c.startTable = newStartTableFromByteMap(&c.startTable.byteMap) - c.nextID = StartState + 1 + c.nextID = StateID(c.stride) c.clearCount++ } @@ -316,18 +304,17 @@ func (c *DFACache) ResetClearCount() { c.clearCount = 0 } -// getState retrieves a state from the stateList by ID. +// getState retrieves a state from the stateList by premultiplied ID. +// Converts premultiplied offset to state index for stateList lookup. func (c *DFACache) getState(id StateID) *State { - if id == DeadState { + // Guard against tagged special states + if id.IsTagged() && (id.IsDeadTag() || id.IsInvalidTag()) { return nil } - - // Guard against special state IDs (DeadState=0xFFFFFFFE, InvalidState=0xFFFFFFFF). - // On 386, int(uint32(0xFFFFFFFF)) = -1, causing negative index panic. - if id >= DeadState { + if c.stride == 0 { return nil } - idx := int(id) + idx := id.Offset() / c.stride if idx >= len(c.stateList) { return nil } @@ -335,14 +322,16 @@ func (c *DFACache) getState(id StateID) *State { } // registerState adds a state to the stateList for O(1) lookup by ID. -// StateIDs are assigned sequentially, so we can use direct indexing. +// Converts premultiplied ID to state index for stateList indexing. func (c *DFACache) registerState(state *State) { - id := int(state.ID()) - // Grow slice if needed - for len(c.stateList) <= id { + if c.stride == 0 { + return + } + idx := state.ID().Offset() / c.stride + for len(c.stateList) <= idx { c.stateList = append(c.stateList, nil) } - c.stateList[id] = state + c.stateList[idx] = state } // Reset prepares the cache for reuse from a sync.Pool. @@ -355,7 +344,7 @@ func (c *DFACache) Reset() { } c.stateList = c.stateList[:0] c.startTable = newStartTableFromByteMap(&c.startTable.byteMap) - c.nextID = StartState + 1 + c.nextID = StateID(c.stride) c.clearCount = 0 c.hits = 0 c.misses = 0 diff --git a/dfa/lazy/cache_test.go b/dfa/lazy/cache_test.go index 4cecb8b..b385fb3 100644 --- a/dfa/lazy/cache_test.go +++ b/dfa/lazy/cache_test.go @@ -433,11 +433,13 @@ func TestCacheStateIDAssignment(t *testing.T) { ids = append(ids, id) } - // IDs should be sequential starting from StartState+1 - for i, id := range ids { - expected := StartState + 1 + StateID(i) - if id != expected { - t.Errorf("State %d got ID %d, want %d", i, id, expected) + // IDs should be premultiplied (offset = index * stride). + // With stride=0 test cache, IDs are all 0 (degenerate). + // Verify they're at least distinct and increasing. + for i := 1; i < len(ids); i++ { + if ids[i].Offset() < ids[i-1].Offset() { + t.Errorf("State %d ID offset %d < State %d ID offset %d (should be increasing)", + i, ids[i].Offset(), i-1, ids[i-1].Offset()) } } } diff --git a/dfa/lazy/config.go b/dfa/lazy/config.go index 31901d8..67139f2 100644 --- a/dfa/lazy/config.go +++ b/dfa/lazy/config.go @@ -79,6 +79,18 @@ type Config struct { // This prevents exponential blowup for patterns like (a|b)*c. // When exceeded, fall back to NFA for that transition. DeterminizationLimit int + + // BreakAtMatch controls whether determinize uses Rust-style break-at-match + // semantics. When true (default), determinize stops iterating NFA states at + // the first Match state, preventing prefix restarts and giving leftmost-first + // match semantics. + // + // Set to false for REVERSE DFAs, where the search must continue past matches + // to find the leftmost match start. Reverse DFAs are always anchored (no prefix), + // so break-at-match would only cut off greedy continuation states. + // + // Default: true + BreakAtMatch bool } // DefaultCacheCapacity is the default DFA cache capacity in bytes. @@ -104,6 +116,7 @@ func DefaultConfig() Config { UsePrefilter: true, MinPrefilterLen: 3, DeterminizationLimit: 1_000, + BreakAtMatch: true, } } diff --git a/dfa/lazy/lazy.go b/dfa/lazy/lazy.go index 5dfc23f..8610a54 100644 --- a/dfa/lazy/lazy.go +++ b/dfa/lazy/lazy.go @@ -71,13 +71,7 @@ type DFA struct { // This enables memory optimization from 256 to ~8-16 transitions per state. byteClasses *nfa.ByteClasses - // freshStartStates contains NFA state IDs that are part of the epsilon closure - // of the anchored start. These are "fresh start" states that get re-introduced - // via the unanchored machinery after each position. Used for leftmost matching: - // when all remaining states are in this set, the committed match is final. - freshStartStates map[nfa.StateID]bool - - // unanchoredStart caches the unanchored start state ID for hasInProgressPattern + // unanchoredStart caches the unanchored start state ID unanchoredStart nfa.StateID // hasWordBoundary is true if the pattern contains \b or \B assertions. @@ -111,36 +105,13 @@ func (d *DFA) NewCache() *DFACache { states: make(map[StateKey]*State, initCap), stateList: make([]*State, 0, initCap), flatTrans: make([]StateID, 0, initCap*stride), - matchFlags: make([]bool, 0, initCap), stride: stride, startTable: newStartTableFromByteMap(&d.startByteMap), capacityBytes: d.config.effectiveCapacityBytes(), - nextID: StartState + 1, + nextID: StateID(stride), // premultiplied: next state starts at offset=stride } } -// hasInProgressPattern checks if any pattern threads are still active (could extend the match). -// Returns true if there are intermediate pattern states (not fresh starts or unanchored machinery). -// -// This is used for leftmost-longest semantics: after finding a match, we continue searching -// only if pattern threads are still active. If all remaining NFA states are either fresh -// starts (re-introduced via unanchored) or unanchored machinery, the committed match is final. -func (d *DFA) hasInProgressPattern(state *State) bool { - for _, nfaState := range state.NFAStates() { - // Skip fresh start states (re-introduced via unanchored) - if d.freshStartStates[nfaState] { - continue - } - // Skip unanchored machinery (states near/at unanchoredStart) - if nfaState >= d.unanchoredStart-1 { - continue - } - // Found an intermediate pattern state - still in progress - return true - } - return false -} - // Find returns the index of the first match in the haystack, or -1 if no match. // // The search algorithm: @@ -264,13 +235,10 @@ func (d *DFA) SearchAtAnchored(cache *DFACache, haystack []byte, at int) int { } lastMatch := -1 - if currentState.IsMatch() { - lastMatch = at - } + // With 1-byte match delay, start states are never match states. sid := currentState.id ft := cache.flatTrans - stride := cache.stride ftLen := len(ft) for pos := at; pos < len(haystack); pos++ { @@ -284,7 +252,7 @@ func (d *DFA) SearchAtAnchored(cache *DFACache, haystack []byte, at int) int { } classIdx := int(d.byteToClass(b)) - offset := safeOffset(sid, stride, classIdx) + offset := sid.Offset() + classIdx var nextID StateID if offset < ftLen { nextID = ft[offset] @@ -327,11 +295,19 @@ func (d *DFA) SearchAtAnchored(cache *DFACache, haystack []byte, at int) int { sid = nextID } + // 1-byte match delay: check AFTER transition. + // With delay, the match tag on the new sid means the previous state + // had an NFA match. The exclusive match end = pos (the byte just + // consumed), because the delay already shifts by 1 byte. + // Rust: mat = Some(HalfMatch::new(pattern, at)) — at is the byte index. if cache.IsMatchState(sid) { - lastMatch = pos + 1 + lastMatch = pos } } + // EOI: check for delayed match at end of input. + // The current state's NFA states may contain a match that hasn't been + // reported yet (no more bytes to trigger the delay). eoi := cache.getState(sid) if eoi != nil && d.checkEOIMatch(eoi) { return len(haystack) @@ -371,7 +347,9 @@ func (d *DFA) SearchFirstAt(cache *DFACache, haystack []byte, at int) int { // searchFirstAt is the core DFA search with early termination after first match. // Returns the end of the first match found, without extending for longest match. -func (d *DFA) searchFirstAt(cache *DFACache, haystack []byte, startPos int) int { //nolint:funlen,maintidx // 4x unrolled hot loop with integrated prefilter +// With 1-byte match delay + break-at-match in determinize, the DFA naturally +// reaches dead state after a match can't extend, providing leftmost-first semantics. +func (d *DFA) searchFirstAt(cache *DFACache, haystack []byte, startPos int) int { //nolint:funlen // 4x unrolled hot loop with integrated prefilter if d.isAlwaysAnchored && startPos > 0 { return -1 } @@ -381,46 +359,34 @@ func (d *DFA) searchFirstAt(cache *DFACache, haystack []byte, startPos int) int return d.nfaFallback(haystack, startPos) } - if startState.IsMatch() { - return startPos - } + // With 1-byte match delay, start states are never match states. end := len(haystack) pos := startPos - committed := false lastMatch := -1 - // Hot loop: flat transition table (Rust approach). - // Work with state ID only — no *State pointer chase in fast path. - // State struct needed only for: determinize (slow), word boundary (guarded). sid := startState.id ft := cache.flatTrans stride := cache.stride - // Bounds hint for compiler — eliminates repeated len checks in loop. if len(ft) > 0 { _ = ft[len(ft)-1] } - // 4x unrolled hot loop (Rust approach: hybrid/search.rs:195-221). canUnroll := !d.hasWordBoundary ftLen := len(ft) startSID := startState.id hasPre := d.prefilter != nil for pos < end { - // Prefilter skip-ahead: when DFA is at start state with no match - // in progress, use prefilter to jump to next candidate position. - // This is the Rust approach (hybrid/search.rs:232-258). - // Eliminates byte-by-byte scanning between matches. - if hasPre && sid == startSID && !committed && pos > startPos { + // Prefilter skip-ahead at start state + if hasPre && sid == startSID && lastMatch < 0 && pos > startPos { candidate := d.prefilter.Find(haystack, pos) if candidate == -1 { - return lastMatch // No more candidates + return lastMatch } if candidate > pos { pos = candidate - // Re-obtain start state at new position (context may differ) newStart := d.getStartStateForUnanchored(cache, haystack, pos) if newStart == nil { return d.nfaFallback(haystack, startPos) @@ -433,87 +399,59 @@ func (d *DFA) searchFirstAt(cache *DFACache, haystack []byte, startPos int) int } // === 4x UNROLLED FAST PATH === + // With match delay, tagged states (including match) break to slow path. if canUnroll && pos+3 < end { - // Transition 1 - o1 := safeOffset(sid, stride, int(d.byteToClass(haystack[pos]))) - if o1 >= ftLen { + if sid.Offset()+stride > ftLen { goto searchFirstSlowPath } - n1 := ft[o1] - if n1 >= DeadState { // DeadState or InvalidState + // Transition 1 + n1 := ft[sid.Offset()+int(d.byteToClass(haystack[pos]))] + if n1.IsTagged() { goto searchFirstSlowPath } pos++ - if cache.matchFlags[int(n1)] { - lastMatch = pos - committed = true - } else if committed { - return lastMatch - } - - // Transition 2 - o2 := safeOffset(n1, stride, int(d.byteToClass(haystack[pos]))) - if o2 >= ftLen { + if pos+2 >= end { sid = n1 goto searchFirstSlowPath } - n2 := ft[o2] - if n2 >= DeadState { + + // Transition 2 + n2 := ft[n1.Offset()+int(d.byteToClass(haystack[pos]))] + if n2.IsTagged() { sid = n1 goto searchFirstSlowPath } pos++ - if cache.matchFlags[int(n2)] { - lastMatch = pos - committed = true - } else if committed { - return lastMatch - } - - // Transition 3 - o3 := safeOffset(n2, stride, int(d.byteToClass(haystack[pos]))) - if o3 >= ftLen { + if pos+1 >= end { sid = n2 goto searchFirstSlowPath } - n3 := ft[o3] - if n3 >= DeadState { + + // Transition 3 + n3 := ft[n2.Offset()+int(d.byteToClass(haystack[pos]))] + if n3.IsTagged() { sid = n2 goto searchFirstSlowPath } pos++ - if cache.matchFlags[int(n3)] { - lastMatch = pos - committed = true - } else if committed { - return lastMatch - } // Transition 4 - o4 := safeOffset(n3, stride, int(d.byteToClass(haystack[pos]))) - if o4 >= ftLen { - sid = n3 - goto searchFirstSlowPath - } - n4 := ft[o4] - if n4 >= DeadState { + n4 := ft[n3.Offset()+int(d.byteToClass(haystack[pos]))] + if n4.IsTagged() { sid = n3 goto searchFirstSlowPath } pos++ sid = n4 - if cache.matchFlags[int(n4)] { - lastMatch = pos - committed = true - } else if committed { - return lastMatch - } continue } searchFirstSlowPath: - // === SINGLE-BYTE SLOW PATH === + if pos >= end { + break + } + if d.hasWordBoundary { st := cache.getState(sid) if st != nil && st.checkWordBoundaryFast(haystack[pos]) { @@ -522,7 +460,7 @@ func (d *DFA) searchFirstAt(cache *DFACache, haystack []byte, startPos int) int } classIdx := int(d.byteToClass(haystack[pos])) - offset := safeOffset(sid, stride, classIdx) + offset := sid.Offset() + classIdx var nextID StateID if offset < ftLen { @@ -553,17 +491,17 @@ func (d *DFA) searchFirstAt(cache *DFACache, haystack []byte, startPos int) int sid = nextID } - pos++ - + // 1-byte match delay: check after transition, before pos advance. + // For leftmost-first (searchFirstAt), return immediately on first match. + // The match delay ensures pos is the correct exclusive end. if cache.IsMatchState(sid) { - lastMatch = pos - committed = true - } else if committed { - return lastMatch + return pos } + + pos++ } - // EOI match check (needs State struct — slow path) + // EOI match check eoi := cache.getState(sid) if eoi != nil && d.checkEOIMatch(eoi) { return len(haystack) @@ -636,22 +574,16 @@ func (d *DFA) isMatchWithPrefilter(cache *DFACache, haystack []byte) bool { // Get anchored start state at candidate position currentState := d.getStartState(cache, haystack, pos, true) if currentState == nil { - // Fallback: use old two-pass approach with NFA return d.isMatchWithPrefilterFallback(cache, haystack, pos) } - if currentState.IsMatch() { - return true - } + // With 1-byte match delay, start states are never match states. - // Integrated prefilter+DFA loop with flat table (Rust approach) endPos := len(haystack) sid := currentState.id ft := cache.flatTrans - stride := cache.stride ftLen := len(ft) for pos < endPos { - // Word boundary check (slow path) if d.hasWordBoundary { st := cache.getState(sid) if st != nil && st.checkWordBoundaryFast(haystack[pos]) { @@ -660,7 +592,7 @@ func (d *DFA) isMatchWithPrefilter(cache *DFACache, haystack []byte) bool { } classIdx := int(d.byteToClass(haystack[pos])) - offset := safeOffset(sid, stride, classIdx) + offset := sid.Offset() + classIdx var nextID StateID if offset < ftLen { nextID = ft[offset] @@ -695,13 +627,13 @@ func (d *DFA) isMatchWithPrefilter(cache *DFACache, haystack []byte) bool { } pos++ + // 1-byte match delay: check after transition if cache.IsMatchState(sid) { return true } continue pfSkip: - // Prefilter skip: find next candidate after current position pos++ candidate := d.prefilter.Find(haystack, pos) if candidate == -1 { @@ -709,7 +641,6 @@ func (d *DFA) isMatchWithPrefilter(cache *DFACache, haystack []byte) bool { } pos = candidate - // Restart DFA at new candidate with anchored start state newStart := d.getStartState(cache, haystack, pos, true) if newStart == nil { return d.isMatchWithPrefilterFallback(cache, haystack, pos) @@ -717,9 +648,7 @@ func (d *DFA) isMatchWithPrefilter(cache *DFACache, haystack []byte) bool { sid = newStart.id ft = cache.flatTrans ftLen = len(ft) - if newStart.IsMatch() { - return true - } + // With match delay, start states are never match — continue loop. } eoi := cache.getState(sid) @@ -774,13 +703,9 @@ func (d *DFA) searchEarliestMatch(cache *DFACache, haystack []byte, startPos int return matched && start >= 0 && end >= start } - // Check if start state is already a match - if currentState.IsMatch() { - return true - } + // With 1-byte match delay, start states are never match states. // Determine if 4x unrolling can be used. - // Word boundary patterns need per-byte boundary checks. canUnroll := !d.hasWordBoundary endPos := len(haystack) @@ -809,41 +734,36 @@ func (d *DFA) searchEarliestMatch(cache *DFACache, haystack []byte, startPos int goto earliestSlowPath } - // Transition 1 - o1 := safeOffset(sid, stride, int(d.byteToClass(haystack[pos]))) - if o1 >= ftLen { + // Bounds hint for 4x unrolled transitions + if sid.Offset()+stride > ftLen { goto earliestSlowPath } - n1 := ft[o1] - if n1 >= DeadState { + + // Transition 1 + n1 := ft[sid.Offset()+int(d.byteToClass(haystack[pos]))] + if n1.IsTagged() { + if n1.IsMatchTag() { + return true + } goto earliestSlowPath } pos++ - if cache.matchFlags[int(n1)] { - return true - } - // Check remaining bounds for subsequent transitions if pos+2 >= endPos { sid = n1 goto earliestSlowPath } // Transition 2 - o2 := safeOffset(n1, stride, int(d.byteToClass(haystack[pos]))) - if o2 >= ftLen { - sid = n1 - goto earliestSlowPath - } - n2 := ft[o2] - if n2 >= DeadState { + n2 := ft[n1.Offset()+int(d.byteToClass(haystack[pos]))] + if n2.IsTagged() { + if n2.IsMatchTag() { + return true + } sid = n1 goto earliestSlowPath } pos++ - if cache.matchFlags[int(n2)] { - return true - } if pos+1 >= endPos { sid = n2 @@ -851,37 +771,27 @@ func (d *DFA) searchEarliestMatch(cache *DFACache, haystack []byte, startPos int } // Transition 3 - o3 := safeOffset(n2, stride, int(d.byteToClass(haystack[pos]))) - if o3 >= ftLen { - sid = n2 - goto earliestSlowPath - } - n3 := ft[o3] - if n3 >= DeadState { + n3 := ft[n2.Offset()+int(d.byteToClass(haystack[pos]))] + if n3.IsTagged() { + if n3.IsMatchTag() { + return true + } sid = n2 goto earliestSlowPath } pos++ - if cache.matchFlags[int(n3)] { - return true - } // Transition 4 - o4 := safeOffset(n3, stride, int(d.byteToClass(haystack[pos]))) - if o4 >= ftLen { - sid = n3 - goto earliestSlowPath - } - n4 := ft[o4] - if n4 >= DeadState { + n4 := ft[n3.Offset()+int(d.byteToClass(haystack[pos]))] + if n4.IsTagged() { + if n4.IsMatchTag() { + return true + } sid = n3 goto earliestSlowPath } pos++ sid = n4 - if cache.matchFlags[int(n4)] { - return true - } continue } @@ -922,7 +832,7 @@ func (d *DFA) searchEarliestMatch(cache *DFACache, haystack []byte, startPos int // Flat table lookup for transition classIdx := int(d.byteToClass(b)) - offset := safeOffset(sid, stride, classIdx) + offset := sid.Offset() + classIdx var nextID StateID if offset < ftLen { @@ -994,24 +904,15 @@ func (d *DFA) searchEarliestMatchAnchored(cache *DFACache, haystack []byte, star return matched && start == startPos && end >= start } - // Check if start state is already a match (e.g., empty pattern) - if currentState.IsMatch() { - return true - } + // With 1-byte match delay, start states are never match states. - // Hot loop: flat transition table (Rust approach). - // Work with state ID only — no *State pointer chase in fast path. sid := currentState.id ft := cache.flatTrans - stride := cache.stride ftLen := len(ft) - // Scan input byte by byte with early termination for pos := startPos; pos < len(haystack); pos++ { b := haystack[pos] - // O(1) word boundary match check using pre-computed flags (was 30% CPU). - // matchAtWordBoundary/matchAtNonWordBoundary computed during determinize. if d.hasWordBoundary { st := cache.getState(sid) if st != nil && st.checkWordBoundaryFast(b) { @@ -1019,9 +920,8 @@ func (d *DFA) searchEarliestMatchAnchored(cache *DFACache, haystack []byte, star } } - // Flat table lookup for transition classIdx := int(d.byteToClass(b)) - offset := safeOffset(sid, stride, classIdx) + offset := sid.Offset() + classIdx var nextID StateID if offset < ftLen { @@ -1040,8 +940,6 @@ func (d *DFA) searchEarliestMatchAnchored(cache *DFACache, haystack []byte, star nextState, err := d.determinize(cache, currentState, b) if err != nil { if isCacheCleared(err) { - // Cache was cleared. For anchored search, re-obtain - // the anchored start state at current position. currentState = d.getStartState(cache, haystack, pos, true) if currentState == nil { start, end, matched := d.pikevm.SearchAt(haystack, startPos) @@ -1050,8 +948,7 @@ func (d *DFA) searchEarliestMatchAnchored(cache *DFACache, haystack []byte, star sid = currentState.id ft = cache.flatTrans ftLen = len(ft) - // Re-process this byte with the new state (pos not incremented by for-loop yet) - pos-- // Will be incremented by for-loop + pos-- continue } start, end, matched := d.pikevm.SearchAt(haystack, startPos) @@ -1071,6 +968,7 @@ func (d *DFA) searchEarliestMatchAnchored(cache *DFACache, haystack []byte, star sid = nextID } + // 1-byte match delay: return true on any match state if cache.IsMatchState(sid) { return true } @@ -1103,19 +1001,13 @@ func (d *DFA) findWithPrefilterAt(cache *DFACache, haystack []byte, startAt int) // Track last match position for leftmost-longest semantics lastMatch := -1 - committed := false // True once we've entered a match state + // With 1-byte match delay, start states are never match states. sid := currentState.id ft := cache.flatTrans - stride := cache.stride ftLen := len(ft) startSID := sid - if currentState.IsMatch() { - lastMatch = pos - committed = true - } - for pos < len(haystack) { if d.hasWordBoundary { st := cache.getState(sid) @@ -1125,7 +1017,7 @@ func (d *DFA) findWithPrefilterAt(cache *DFACache, haystack []byte, startAt int) } classIdx := int(d.byteToClass(haystack[pos])) - offset := safeOffset(sid, stride, classIdx) + offset := sid.Offset() + classIdx var nextID StateID if offset < ftLen { nextID = ft[offset] @@ -1150,7 +1042,6 @@ func (d *DFA) findWithPrefilterAt(cache *DFACache, haystack []byte, startAt int) startSID = sid ft = cache.flatTrans ftLen = len(ft) - committed = lastMatch >= 0 continue } return d.nfaFallback(haystack, 0) @@ -1175,11 +1066,6 @@ func (d *DFA) findWithPrefilterAt(cache *DFACache, haystack []byte, startAt int) ft = cache.flatTrans ftLen = len(ft) lastMatch = -1 - committed = false - if newStart.IsMatch() { - lastMatch = pos - committed = true - } continue } sid = nextState.id @@ -1205,28 +1091,21 @@ func (d *DFA) findWithPrefilterAt(cache *DFACache, haystack []byte, startAt int) ft = cache.flatTrans ftLen = len(ft) lastMatch = -1 - committed = false - if newStart.IsMatch() { - lastMatch = pos - committed = true - } continue default: sid = nextID } - pos++ - + // 1-byte match delay: check after transition, before pos advance if cache.IsMatchState(sid) { lastMatch = pos - committed = true - } else if committed { - return lastMatch } + pos++ + // Start state prefilter skip-ahead - if !committed && sid == startSID && pos < len(haystack) { + if lastMatch < 0 && sid == startSID && pos < len(haystack) { candidate = d.prefilter.Find(haystack, pos) if candidate == -1 { return -1 @@ -1237,9 +1116,7 @@ func (d *DFA) findWithPrefilterAt(cache *DFACache, haystack []byte, startAt int) } } - // Reached end of input. - // Check if there's a match at EOI due to pending word boundary assertions. - // Example: pattern `test\b` matching "test" - the \b is satisfied at EOI. + // EOI check for delayed match eoi := cache.getState(sid) if eoi != nil && d.checkEOIMatch(eoi) { return len(haystack) @@ -1292,38 +1169,28 @@ func (d *DFA) searchAt(cache *DFACache, haystack []byte, startPos int) int { //n } // Get appropriate start state based on look-behind context - // This enables correct handling of assertions like ^, \b, etc. currentState := d.getStartStateForUnanchored(cache, haystack, startPos) if currentState == nil { - // Start state not in cache? This should never happen return d.nfaFallback(haystack, startPos) } - // Track last match position for leftmost-longest semantics + // Track last match position for leftmost-longest semantics. + // With 1-byte match delay, start states are never match states. lastMatch := -1 - committed := false // True once we've found a match - - if currentState.IsMatch() { - lastMatch = startPos // Empty match at start - committed = true - } // Determine if the 4x unrolled fast path can be used. - // Word boundary patterns require per-byte boundary checks that cannot be batched. canUnroll := !d.hasWordBoundary end := len(haystack) pos := startPos // Hot loop: flat transition table (Rust approach). - // Work with state ID only — no *State pointer chase in fast path. - // State struct needed only for: determinize (slow), word boundary (guarded), acceleration. sid := currentState.id ft := cache.flatTrans stride := cache.stride ftLen := len(ft) - // Bounds hint for compiler — eliminates repeated len checks in loop. + // Bounds hint for compiler if ftLen > 0 { _ = ft[ftLen-1] } @@ -1333,7 +1200,7 @@ func (d *DFA) searchAt(cache *DFACache, haystack []byte, startPos int) int { //n for pos < end { // Prefilter skip-ahead at start state (Rust hybrid/search.rs:232-258) - if hasPre && sid == startSID && !committed && pos > startPos { + if hasPre && sid == startSID && lastMatch < 0 && pos > startPos { candidate := d.prefilter.Find(haystack, pos) if candidate == -1 { return lastMatch @@ -1353,94 +1220,60 @@ func (d *DFA) searchAt(cache *DFACache, haystack []byte, startPos int) int { //n // === 4x UNROLLED FAST PATH === // Process 4 transitions per iteration when conditions allow. - if canUnroll && !committed && pos+3 < end { - // Check acceleration on slow→fast transition (once per entry). + // With match delay, match states break out of the unrolled loop + // to the slow path for proper handling. + if canUnroll && pos+3 < end { + // Check acceleration on slow→fast transition accelState := cache.getState(sid) if accelState != nil && accelState.IsAccelerable() { goto slowPath } - // Transition 1 - o1 := safeOffset(sid, stride, int(d.byteToClass(haystack[pos]))) - if o1 >= ftLen { + // Bounds hint for 4x unrolled transitions + if sid.Offset()+stride > ftLen { goto slowPath } - n1 := ft[o1] - if n1 >= DeadState { + + // Transition 1 + n1 := ft[sid.Offset()+int(d.byteToClass(haystack[pos]))] + if n1.IsTagged() { goto slowPath } pos++ - - if cache.matchFlags[int(n1)] || pos+2 >= end { + if pos+2 >= end { sid = n1 - if cache.matchFlags[int(n1)] { - lastMatch = pos - committed = true - } goto slowPath } // Transition 2 - o2 := safeOffset(n1, stride, int(d.byteToClass(haystack[pos]))) - if o2 >= ftLen { - sid = n1 - goto slowPath - } - n2 := ft[o2] - if n2 >= DeadState { + n2 := ft[n1.Offset()+int(d.byteToClass(haystack[pos]))] + if n2.IsTagged() { sid = n1 goto slowPath } pos++ - - if cache.matchFlags[int(n2)] || pos+1 >= end { + if pos+1 >= end { sid = n2 - if cache.matchFlags[int(n2)] { - lastMatch = pos - committed = true - } goto slowPath } // Transition 3 - o3 := safeOffset(n2, stride, int(d.byteToClass(haystack[pos]))) - if o3 >= ftLen { - sid = n2 - goto slowPath - } - n3 := ft[o3] - if n3 >= DeadState { + n3 := ft[n2.Offset()+int(d.byteToClass(haystack[pos]))] + if n3.IsTagged() { sid = n2 goto slowPath } pos++ - if cache.matchFlags[int(n3)] { - sid = n3 - lastMatch = pos - committed = true - goto slowPath - } - // Transition 4 - o4 := safeOffset(n3, stride, int(d.byteToClass(haystack[pos]))) - if o4 >= ftLen { - sid = n3 - goto slowPath - } - n4 := ft[o4] - if n4 >= DeadState { + n4 := ft[n3.Offset()+int(d.byteToClass(haystack[pos]))] + if n4.IsTagged() { sid = n3 goto slowPath } pos++ sid = n4 - if cache.matchFlags[int(n4)] { - lastMatch = pos - committed = true - } - continue } @@ -1472,7 +1305,7 @@ func (d *DFA) searchAt(cache *DFACache, haystack []byte, startPos int) int { //n // Flat table lookup for transition classIdx := int(d.byteToClass(b)) - offset := safeOffset(sid, stride, classIdx) + offset := sid.Offset() + classIdx var nextID StateID if offset < ftLen { @@ -1499,19 +1332,19 @@ func (d *DFA) searchAt(cache *DFACache, haystack []byte, startPos int) int { //n sid = nextID } - pos++ - + // 1-byte match delay: check AFTER transition, BEFORE pos advance. + // With delay, match tag means previous state had NFA match. + // Exclusive match end = pos (the consumed byte index), because delay + // already shifts by 1 byte. + // Rust: mat = Some(HalfMatch::new(pattern, at)) — at is byte index. if cache.IsMatchState(sid) { lastMatch = pos - committed = true - } else if committed { - currentState = cache.getState(sid) - if currentState == nil || !d.hasInProgressPattern(currentState) { - return lastMatch - } } + + pos++ } + // EOI: check for delayed match at end of input eoi := cache.getState(sid) if eoi != nil && d.checkEOIMatch(eoi) { return len(haystack) @@ -1548,19 +1381,32 @@ func (d *DFA) determinize(cache *DFACache, current *State, b byte) (*State, erro // The actual byte value is still used for NFA move operations classIdx := d.byteToClass(b) - // Compute next NFA state set via move operation WITH word context - // This is essential for correct \b and \B handling in DFA. - // The current state's isFromWord tells us if the previous byte was a word char. - // Note: use actual byte 'b' (not classIdx) for NFA move - NFA uses raw bytes - nextNFAStates := builder.moveWithWordContext(current.NFAStates(), b, current.IsFromWord()) - - // No transitions on this byte → dead state - if len(nextNFAStates) == 0 { - // Cache the dead state transition to avoid re-computation - // Use classIdx for transition storage (compressed alphabet) + // 1-byte match delay (Rust determinize mod.rs:254-286): + // Check if source (current) state's NFA states contain a match state. + // The NEW DFA state will be tagged as match if the OLD state had NFA match. + // This delays match reporting by 1 byte, enabling correct look-around (^, $, \b). + sourceHasMatch := builder.containsMatchState(current.NFAStates()) + + // Compute next NFA state set via move operation WITH word context. + // Leftmost-first (Rust determinize::next mod.rs:284): + // When source has NFA match AND BreakAtMatch is enabled, stop iterating + // at the first Match state. States after Match (prefix restarts) are not + // processed, causing the DFA to reach dead state with the committed match. + // BreakAtMatch is disabled for reverse DFAs to allow finding leftmost start. + breakAtMatch := sourceHasMatch && d.config.BreakAtMatch + nextNFAStates := builder.moveWithWordContextBreak(current.NFAStates(), b, current.IsFromWord(), breakAtMatch) + + isMatch := sourceHasMatch + + // No transitions on this byte → dead state (or dead-end match state) + if len(nextNFAStates) == 0 && !isMatch { + // Normal dead state — no match in source either cache.SetFlatTransition(current.id, int(classIdx), DeadState) return nil, nil //nolint:nilnil // dead state is valid, not an error } + // When len(nextNFAStates) == 0 && isMatch: source has NFA match but target + // is dead. Create a dead-end match state so the search loop can observe + // the delayed match before seeing dead transitions. Fall through below. // Check if we've exceeded determinization limit if len(nextNFAStates) > d.config.DeterminizationLimit { @@ -1576,9 +1422,10 @@ func (d *DFA) determinize(cache *DFACache, current *State, b byte) (*State, erro // needs to know what byte got us there (for the next transition's word boundary check) nextIsFromWord := isWordByte(b) - // Compute state key INCLUDING word context - // States with same NFA states but different isFromWord are DIFFERENT DFA states! - key := ComputeStateKeyWithWord(nextNFAStates, nextIsFromWord) + // Compute state key INCLUDING word context AND match delay flag. + // With match delay, the same NFA state set can produce both match and + // non-match DFA states (depending on whether the source had NFA match). + key := ComputeStateKeyWithWordAndMatch(nextNFAStates, nextIsFromWord, isMatch) // Check if state already exists in cache if existing, ok := cache.Get(key); ok { @@ -1589,7 +1436,6 @@ func (d *DFA) determinize(cache *DFACache, current *State, b byte) (*State, erro } // Create new DFA state with word context and compressed alphabet stride - isMatch := builder.containsMatchState(nextNFAStates) newState := NewStateWithStride(InvalidState, nextNFAStates, isMatch, nextIsFromWord, d.AlphabetLen()) // Pre-compute word boundary match flags to avoid per-byte checkWordBoundaryMatch. @@ -1627,6 +1473,19 @@ func (d *DFA) determinize(cache *DFACache, current *State, b byte) (*State, erro return newState, nil } +// containsNFAMatch checks if any of the given NFA state IDs is a match state. +// Used for EOI match detection with 1-byte match delay: at end of input, +// we check the current DFA state's NFA states directly rather than following +// an EOI transition. +func containsNFAMatch(n *nfa.NFA, states []nfa.StateID) bool { + for _, sid := range states { + if n.IsMatch(sid) { + return true + } + } + return false +} + // tryClearCache attempts to clear the DFA cache and rebuild the start state. // Returns nil on success (cache was cleared, search can continue). // Returns ErrCacheFull if the maximum number of cache clears has been exceeded. @@ -1654,8 +1513,8 @@ func (d *DFA) tryClearCache(cache *DFACache) error { builder := NewBuilderWithWordBoundary(d.nfa, d.config, d.hasWordBoundary) startLook := LookSetFromStartKind(StartText) startStateSet := builder.epsilonClosure([]nfa.StateID{d.nfa.StartUnanchored()}, startLook) - isMatch := builder.containsMatchState(startStateSet) - startState := NewStateWithStride(StartState, startStateSet, isMatch, false, d.AlphabetLen()) + // With 1-byte match delay, start states are never match states. + startState := NewStateWithStride(StartState, startStateSet, false, false, d.AlphabetLen()) key := ComputeStateKeyWithWord(startStateSet, false) _, _ = cache.Insert(key, startState) // Cannot fail: cache was just cleared @@ -1794,13 +1653,14 @@ func (d *DFA) nfaFallback(haystack []byte, startPos int) int { // matchesEmpty checks if the pattern matches an empty string func (d *DFA) matchesEmpty(cache *DFACache) bool { - // Check if start state is a match state + // With 1-byte match delay, the start state is never tagged as match. + // Check if the start state's NFA states contain a match (for empty patterns). startState := cache.getState(StartState) - if startState != nil && startState.IsMatch() { + if startState != nil && containsNFAMatch(d.nfa, startState.NFAStates()) { return true } - // Fall back to NFA for empty match check + // Fall back to NFA for empty match check (handles word boundaries, etc.) start, end, matched := d.pikevm.Search([]byte{}) return matched && start == 0 && end == 0 } @@ -1937,17 +1797,13 @@ func (d *DFA) SearchReverse(cache *DFACache, haystack []byte, start, end int) in } lastMatch := -1 - - if currentState.IsMatch() { - lastMatch = end - } + // With 1-byte match delay, start states are never match states. at := end - 1 // Hot loop: flat transition table (Rust approach). sid := currentState.id ft := cache.flatTrans - stride := cache.stride ftLen := len(ft) if ftLen > 0 { @@ -1955,67 +1811,55 @@ func (d *DFA) SearchReverse(cache *DFACache, haystack []byte, start, end int) in } // === 4x UNROLLED REVERSE LOOP === - // offset/nextSID declared before loop to avoid goto-over-declaration. + // With match delay, any tagged state (including match) breaks to slow path. var revOff int var nextSID StateID for at >= start+3 { - // Transition 1 (from at, going backward) - revOff = safeOffset(sid, stride, int(d.byteToClass(haystack[at]))) + // Transition 1 + revOff = sid.Offset() + int(d.byteToClass(haystack[at])) if revOff >= ftLen { goto reverseSlowPath } nextSID = ft[revOff] - if nextSID >= DeadState { + if nextSID.IsTagged() { goto reverseSlowPath } - if cache.matchFlags[int(nextSID)] { - lastMatch = at - } sid = nextSID at-- // Transition 2 - revOff = safeOffset(sid, stride, int(d.byteToClass(haystack[at]))) + revOff = sid.Offset() + int(d.byteToClass(haystack[at])) if revOff >= ftLen { goto reverseSlowPath } nextSID = ft[revOff] - if nextSID >= DeadState { + if nextSID.IsTagged() { goto reverseSlowPath } - if cache.matchFlags[int(nextSID)] { - lastMatch = at - } sid = nextSID at-- // Transition 3 - revOff = safeOffset(sid, stride, int(d.byteToClass(haystack[at]))) + revOff = sid.Offset() + int(d.byteToClass(haystack[at])) if revOff >= ftLen { goto reverseSlowPath } nextSID = ft[revOff] - if nextSID >= DeadState { + if nextSID.IsTagged() { goto reverseSlowPath } - if cache.matchFlags[int(nextSID)] { - lastMatch = at - } sid = nextSID at-- // Transition 4 - revOff = safeOffset(sid, stride, int(d.byteToClass(haystack[at]))) + revOff = sid.Offset() + int(d.byteToClass(haystack[at])) if revOff >= ftLen { goto reverseSlowPath } nextSID = ft[revOff] - if nextSID >= DeadState { + if nextSID.IsTagged() { goto reverseSlowPath } - if cache.matchFlags[int(nextSID)] { - lastMatch = at - } sid = nextSID at-- @@ -2030,7 +1874,7 @@ func (d *DFA) SearchReverse(cache *DFACache, haystack []byte, start, end int) in b := haystack[at] classIdx := int(d.byteToClass(b)) - offset := safeOffset(sid, stride, classIdx) + offset := sid.Offset() + classIdx var nextID StateID if offset < ftLen { @@ -2073,13 +1917,24 @@ func (d *DFA) SearchReverse(cache *DFACache, haystack []byte, start, end int) in sid = nextID } + // 1-byte match delay for reverse: the match tag on the new state means + // the OLD state had NFA match. In reverse search, the match position + // is at+1 (one byte forward from current, since we're going backward). + // Rust: mat = Some(HalfMatch::new(pattern, at + 1)) if cache.IsMatchState(sid) { - lastMatch = at + lastMatch = at + 1 } at-- } + // EOI for reverse: at region start, check if current state's NFA states + // contain a delayed match. If so, the match starts at 'start'. + eoi := cache.getState(sid) + if eoi != nil && containsNFAMatch(d.nfa, eoi.NFAStates()) { + lastMatch = start + } + return lastMatch } @@ -2119,10 +1974,7 @@ func (d *DFA) SearchReverseLimited(cache *DFACache, haystack []byte, start, end, } lastMatch := -1 - - if currentState.IsMatch() { - lastMatch = end - } + // With 1-byte match delay, start states are never match states. lowerBound := start if minStart > lowerBound { @@ -2132,14 +1984,13 @@ func (d *DFA) SearchReverseLimited(cache *DFACache, haystack []byte, start, end, // Hot loop: flat transition table (Rust approach). sid := currentState.id ft := cache.flatTrans - stride := cache.stride ftLen := len(ft) for at := end - 1; at >= lowerBound; at-- { b := haystack[at] classIdx := int(d.byteToClass(b)) - offset := safeOffset(sid, stride, classIdx) + offset := sid.Offset() + classIdx var nextID StateID if offset < ftLen { @@ -2183,11 +2034,18 @@ func (d *DFA) SearchReverseLimited(cache *DFACache, haystack []byte, start, end, sid = nextID } + // 1-byte match delay for reverse: match position is at+1 if cache.IsMatchState(sid) { - lastMatch = at + lastMatch = at + 1 } } + // EOI for reverse: check delayed match at region start + eoi := cache.getState(sid) + if eoi != nil && containsNFAMatch(d.nfa, eoi.NFAStates()) { + lastMatch = lowerBound + } + if lowerBound > start && lastMatch < 0 { return SearchReverseLimitedQuadratic } @@ -2210,21 +2068,18 @@ func (d *DFA) IsMatchReverse(cache *DFACache, haystack []byte, start, end int) b return matched } - if currentState.IsMatch() { - return true - } + // With 1-byte match delay, start states are never match states. // Hot loop: flat transition table (Rust approach). sid := currentState.id ft := cache.flatTrans - stride := cache.stride ftLen := len(ft) for at := end - 1; at >= start; at-- { b := haystack[at] classIdx := int(d.byteToClass(b)) - offset := safeOffset(sid, stride, classIdx) + offset := sid.Offset() + classIdx var nextID StateID if offset < ftLen { @@ -2271,12 +2126,15 @@ func (d *DFA) IsMatchReverse(cache *DFACache, haystack []byte, start, end int) b sid = nextID } + // 1-byte match delay: match detected after transition if cache.IsMatchState(sid) { return true } } - return cache.IsMatchState(sid) + // EOI for reverse: check if current state's NFA states contain match + eoi := cache.getState(sid) + return eoi != nil && containsNFAMatch(d.nfa, eoi.NFAStates()) } // getStartStateForReverse returns the appropriate start state for reverse search. diff --git a/dfa/lazy/search_extra_test.go b/dfa/lazy/search_extra_test.go index 6d60419..a99025c 100644 --- a/dfa/lazy/search_extra_test.go +++ b/dfa/lazy/search_extra_test.go @@ -89,7 +89,7 @@ func TestSearchAtWithoutPrefilter(t *testing.T) { {"simple literal from 0", "abc", "xyzabc", 0, 6}, {"simple literal from 3", "abc", "xyzabc", 3, 6}, {"no match", "xyz", "abcdef", 0, -1}, - {"empty pattern from 0", "", "abc", 0, 3}, // empty pattern greedy-matches entire input + {"empty pattern from 0", "", "abc", 0, 0}, // empty pattern matches at position 0 (stdlib behavior) {"empty input", "abc", "", 0, -1}, {"at end", "abc", "abc", 3, -1}, {"past end", "abc", "abc", 4, -1}, @@ -332,7 +332,7 @@ func TestEmptyPatternBehavior(t *testing.T) { matchWant bool }{ {"empty input", "", 0, true}, - {"non-empty input", "abc", 3, true}, // empty pattern greedy-matches entire input + {"non-empty input", "abc", 0, true}, // empty pattern matches at position 0 (stdlib behavior) } for _, tt := range tests { diff --git a/dfa/lazy/start.go b/dfa/lazy/start.go index c95349e..a78d3eb 100644 --- a/dfa/lazy/start.go +++ b/dfa/lazy/start.go @@ -227,8 +227,12 @@ func ComputeStartStateWithStride(builder *Builder, n *nfa.NFA, config StartConfi // Compute epsilon closure from NFA start state with look assertions startStateSet := builder.epsilonClosure([]nfa.StateID{nfaStart}, lookHave) - // Check if start state is a match state - isMatch := builder.containsMatchState(startStateSet) + // With 1-byte match delay, start states are NEVER match states. + // Match reporting is delayed by 1 byte: the NEW state is tagged as match + // based on the OLD state's NFA match content. Since there is no "old state" + // before the start state, it cannot be a match. + // Reference: Rust regex-automata determinize (mod.rs:254-286). + isMatch := false // Determine isFromWord based on StartKind. // This is critical for \b/\B word boundary handling: diff --git a/dfa/lazy/state.go b/dfa/lazy/state.go index 33edf96..9ff197b 100644 --- a/dfa/lazy/state.go +++ b/dfa/lazy/state.go @@ -9,22 +9,99 @@ import ( ) // StateID uniquely identifies a DFA state in the cache. -// This is a 32-bit unsigned integer for compact representation. +// +// The ID is a **premultiplied byte offset** into the flat transition table, +// with tag bits in the high 5 bits for O(1) special state detection. +// +// Layout (Rust LazyStateID approach, hybrid/id.rs:169): +// +// [invalid|dead|reserved|start|match| 27 bits: offset into flatTrans ] +// bit 31 30 29 28 27 bits 0-26 +// +// Hot loop: nextSID = flatTrans[sid & TagMask + classIdx] +// +// if sid > TagMask { handle special } +// +// No multiply needed — sid already contains the byte offset. type StateID uint32 -// Special state constants +// Tag bit masks for StateID high bits. const ( - // InvalidState represents an invalid/uninitialized state ID - InvalidState StateID = 0xFFFFFFFF + tagInvalid StateID = 1 << 31 // Unknown/not yet computed transition + tagDead StateID = 1 << 30 // Dead state — no match possible + tagReserved StateID = 1 << 29 // Reserved for quit + tagStart StateID = 1 << 28 // Start state + tagMatch StateID = 1 << 27 // Match/accepting state + + // TagMask extracts the offset (lower 27 bits). + // Any bit above this = special state requiring slow path. + TagMask StateID = tagMatch - 1 // 0x07FFFFFF + + // MaxStateOffset is the maximum premultiplied offset (128M entries). + MaxStateOffset StateID = TagMask +) - // DeadState represents a dead/failure state with no outgoing transitions. - // Once in this state, the DFA can never match. - DeadState StateID = 0xFFFFFFFE +// Special state constants (tagged, premultiplied offset = 0). +const ( + // InvalidState represents an unknown/uninitialized transition. + // In flatTrans, this means the transition hasn't been computed yet. + InvalidState StateID = tagInvalid // 0x80000000 - // StartState is always state ID 0 (the initial state) + // DeadState represents a dead/failure state — no match possible. + DeadState StateID = tagDead // 0x40000000 + + // StartState is the initial state. Offset 0, untagged. StartState StateID = 0 ) +// IsTagged returns true if any tag bit is set (special state). +// This is the single branch in the DFA hot loop. +// +//go:nosplit +func (sid StateID) IsTagged() bool { + return sid > TagMask +} + +// Offset returns the premultiplied byte offset into flatTrans. +// Strips tag bits. Only valid for non-special states. +// +//go:nosplit +func (sid StateID) Offset() int { + return int(sid & TagMask) +} + +// IsMatch returns true if this state has the match tag. +// +//go:nosplit +func (sid StateID) IsMatchTag() bool { + return sid&tagMatch != 0 +} + +// IsDeadTag returns true if this state has the dead tag. +// +//go:nosplit +func (sid StateID) IsDeadTag() bool { + return sid&tagDead != 0 +} + +// IsInvalidTag returns true if this state has the invalid tag. +// +//go:nosplit +func (sid StateID) IsInvalidTag() bool { + return sid&tagInvalid != 0 +} + +// WithMatchTag returns a copy of this StateID with the match tag set. +func (sid StateID) WithMatchTag() StateID { + return sid | tagMatch +} + +// WithStartTag returns a copy of this StateID with the start tag set. +// Reserved for future start-state specialization (Rust specialize_start_states). +func (sid StateID) WithStartTag() StateID { + return sid | tagStart +} + // defaultStride is the default alphabet size when ByteClasses compression is not used. const defaultStride = 256 @@ -233,11 +310,24 @@ func ComputeStateKey(nfaStates []nfa.StateID) StateKey { // States with same NFA states but different isFromWord are DIFFERENT DFA states. // This is essential for correct \b and \B handling. func ComputeStateKeyWithWord(nfaStates []nfa.StateID, isFromWord bool) StateKey { + return ComputeStateKeyWithWordAndMatch(nfaStates, isFromWord, false) +} + +// ComputeStateKeyWithWordAndMatch computes a hash-based key including word context +// and match delay flag. With 1-byte match delay, the same set of NFA states can +// produce both a match and non-match DFA state depending on whether the SOURCE +// state contained an NFA match state. This function distinguishes them in the cache. +func ComputeStateKeyWithWordAndMatch(nfaStates []nfa.StateID, isFromWord bool, isMatch bool) StateKey { if len(nfaStates) == 0 { + // Encode (isFromWord, isMatch) into 2 bits for empty states + var key StateKey if isFromWord { - return StateKey(1) // Distinguish empty+fromWord from empty+notFromWord + key |= 1 } - return StateKey(0) + if isMatch { + key |= 2 + } + return key } // Sort NFA states for canonical ordering @@ -249,12 +339,15 @@ func ComputeStateKeyWithWord(nfaStates []nfa.StateID, isFromWord bool) StateKey // Hash the sorted states using FNV-1a h := fnv.New64a() - // Include isFromWord in the hash FIRST to distinguish states + // Include isFromWord and isMatch in the hash FIRST to distinguish states + var flags byte if isFromWord { - _, _ = h.Write([]byte{1}) - } else { - _, _ = h.Write([]byte{0}) + flags |= 1 + } + if isMatch { + flags |= 2 } + _, _ = h.Write([]byte{flags}) for _, sid := range sorted { // Write each StateID as 4 bytes (uint32) @@ -393,6 +486,18 @@ func (ss *StateSet) ToSlice() []nfa.StateID { return slice } +// ToSliceInsertionOrder returns states in the order they were inserted. +// This matches Rust's sparse set iteration order, which is critical for +// determinize break-at-match semantics (leftmost-first match priority). +func (ss *StateSet) ToSliceInsertionOrder() []nfa.StateID { + if ss.size == 0 { + return nil + } + slice := make([]nfa.StateID, ss.size) + copy(slice, ss.dense[:ss.size]) + return slice +} + // Clone creates a deep copy of the state set func (ss *StateSet) Clone() *StateSet { clone := NewStateSetWithCapacity(len(ss.sparse)) diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 4d5b05d..f29a7b6 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -27,7 +27,13 @@ Input → Prefilter (memchr/memmem/teddy) → Engine Search → Match Result ### DFA Layer (`dfa/lazy/`) - **Lazy DFA**: On-demand state construction with byte class compression -- **Flat transition table**: `flatTrans[sid*stride+class]` — single array lookup, no pointer chase +- **Flat transition table**: `flatTrans[sid+class]` — premultiplied offset, no multiply +- **Tagged State IDs**: match/dead/invalid encoded in high bits, single `IsTagged()` branch +- **Break-at-match**: Rust `determinize::next` (mod.rs:284) — stops NFA iteration at Match state, + preventing prefix restarts while preserving greedy continuation (leftmost-first semantics) +- **Epsilon closure ordering**: Add-on-pop DFS with reverse Split push — matches Rust sparse set + insertion order. Incremental per-target closure preserves Match-before-prefix ordering +- **2-pass bidirectional search**: Forward DFA → match end, reverse DFA → match start (no Phase 3) - **Byte-based cache limit**: 2MB default (matches Rust `hybrid_cache_capacity`) - **Cache clearing**: Up to 5 clears before NFA fallback (Rust approach) - **Acceleration**: Detects self-loop states, uses SIMD memchr for skip-ahead @@ -99,7 +105,7 @@ Input → Prefilter (memchr/memmem/teddy) → Engine Search → Match Result 1. **Multi-engine**: Strategy selection at compile time, not runtime 2. **Rust reference**: Architecture mirrors Rust regex crate (lazy DFA, PikeVM, prefilters) -3. **Go stdlib compat**: POSIX leftmost-longest semantics (differs from Rust leftmost-first) +3. **Leftmost-first match**: DFA break-at-match matches Rust semantics (verified via cargo run) 4. **Zero-alloc hot paths**: `IsMatch()`, `FindIndices()`, `Count()` — no heap allocation 5. **SIMD first**: AVX2/SSSE3 prefilters for x86_64, pure Go fallback for other archs diff --git a/meta/compile.go b/meta/compile.go index abff531..1a60f34 100644 --- a/meta/compile.go +++ b/meta/compile.go @@ -150,10 +150,9 @@ func buildStrategyEngines( return result } - dfaConfig := lazy.Config{ - MaxStates: config.MaxDFAStates, - DeterminizationLimit: config.DeterminizationLimit, - } + dfaConfig := lazy.DefaultConfig() + dfaConfig.MaxStates = config.MaxDFAStates //nolint:staticcheck // legacy API compat + dfaConfig.DeterminizationLimit = config.DeterminizationLimit result = buildReverseSearchers(result, strategy, re, nfaEngine, dfaConfig, config) @@ -189,13 +188,18 @@ func buildReverseDFA( dfaConfig lazy.Config, pf prefilter.Prefilter, ) strategyEngines { + // Reverse DFA config: disable break-at-match so the reverse search continues + // past matches to find the leftmost match start (greedy continuation). + revDFAConfig := dfaConfig + revDFAConfig.BreakAtMatch = false + switch result.finalStrategy { case UseDFA: // Skip for non-greedy patterns: forward DFA always finds leftmost-longest, // which is incompatible with non-greedy semantics. if result.dfa != nil && !hasNonGreedyQuantifier(re) { reverseNFA := nfa.ReverseAnchored(nfaEngine) - revDFA, err := lazy.CompileWithConfig(reverseNFA, dfaConfig) + revDFA, err := lazy.CompileWithConfig(reverseNFA, revDFAConfig) if err == nil { result.reverseDFA = revDFA } @@ -205,7 +209,7 @@ func buildReverseDFA( if err == nil { result.dfa = fwdDFA reverseNFA := nfa.ReverseAnchored(nfaEngine) - revDFA, revErr := lazy.CompileWithConfig(reverseNFA, dfaConfig) + revDFA, revErr := lazy.CompileWithConfig(reverseNFA, revDFAConfig) if revErr == nil { result.reverseDFA = revDFA } @@ -530,6 +534,8 @@ func CompileRegexp(re *syntax.Regexp, config Config) (*Engine, error) { // because its greedy semantics give wrong results for patterns like (?:|a)* canMatchEmpty := pikevm.IsMatch(nil) + // Check if Phase 3 (SearchAtAnchored) is needed in bidirectional DFA search. + // Phase 3 re-scans from confirmed start with greedy semantics. Only needed when // Extract first-byte prefilter for anchored patterns. // This enables O(1) early rejection for non-matching inputs. // Only useful for start-anchored patterns where we only check position 0. diff --git a/meta/engine.go b/meta/engine.go index d2bcc54..dd1d5ca 100644 --- a/meta/engine.go +++ b/meta/engine.go @@ -186,6 +186,17 @@ func (e *Engine) IsStartAnchored() bool { return e.isStartAnchored } +// IsStartAnchoredWithFirstByteReject returns true if: +// 1. Pattern is always-anchored (^) AND +// 2. First byte of haystack doesn't match any possible first byte +// This allows ultra-fast O(1) rejection without any dispatch overhead. +func (e *Engine) IsStartAnchoredWithFirstByteReject(haystack []byte) bool { + return e.nfa.IsAlwaysAnchored() && + e.anchoredFirstBytes != nil && + len(haystack) > 0 && + !e.anchoredFirstBytes.Contains(haystack[0]) +} + // Stats returns execution statistics. // // Useful for performance analysis and debugging. diff --git a/meta/find_indices.go b/meta/find_indices.go index 4ba5f36..cd794e3 100644 --- a/meta/find_indices.go +++ b/meta/find_indices.go @@ -342,7 +342,6 @@ func (e *Engine) findIndicesDFAAt(haystack []byte, at int) (int, int, bool) { return e.pikevm.SearchAt(haystack, pos) } - // No prefilter: bidirectional DFA or DFA + PikeVM fallback. if e.reverseDFA != nil { return e.findIndicesBidirectionalDFA(haystack, at) } @@ -578,35 +577,31 @@ func (e *Engine) findIndicesMultilineReverseSuffixAt(haystack []byte, at int) (i } // findIndicesBidirectionalDFA uses forward DFA + reverse DFA for exact match bounds. -// Three-phase: forward DFA → first match end, reverse DFA → match start, -// anchored forward DFA → correct greedy end from that start. O(n) total. +// Two-phase: forward DFA → match end, reverse DFA → match start. O(n) total. // -// Phase 1 uses SearchFirstAt (stops at first match end) to avoid DFA over-extension -// with unanchored prefix. Phase 3 then runs anchored greedy DFA from the discovered -// start to get the correct (potentially longer) end for patterns like ".*". +// With Rust-style break-at-match in determinize, SearchAt produces correct +// leftmost-first greedy match ends directly (verified against Rust regex-automata +// fwd search). No Phase 3 re-scan needed. func (e *Engine) findIndicesBidirectionalDFA(haystack []byte, at int) (int, int, bool) { atomic.AddUint64(&e.stats.DFASearches, 1) state := e.getSearchState() defer e.putSearchState(state) - // Phase 1: find first match end (leftmost-first, not leftmost-longest) - end := e.dfa.SearchFirstAt(state.dfaCache, haystack, at) + // Forward DFA: leftmost-first match end (matches Rust find_fwd) + end := e.dfa.SearchAt(state.dfaCache, haystack, at) if end == -1 { return -1, -1, false } if end == at { return at, at, true // Empty match } - // Phase 2: reverse DFA to find match start + // Skip reverse search if anchored (Rust hybrid/regex.rs:467) + if e.nfa.IsAlwaysAnchored() { + return at, end, true + } + // Reverse DFA → match start start := e.reverseDFA.SearchReverse(state.revDFACache, haystack, at, end) if start < 0 { - return -1, -1, false // Reverse DFA failed (cache full) - } - // Phase 3: anchored greedy forward DFA from start → correct end. - // SearchFirstAt may undercount for greedy patterns (e.g., ".*" stops at first "). - // Anchored DFA from start gives the correct greedy end for this specific match. - exactEnd := e.dfa.SearchAtAnchored(state.dfaCache, haystack, start) - if exactEnd > start { - end = exactEnd + return -1, -1, false } return start, end, true } @@ -652,6 +647,14 @@ func (e *Engine) findIndicesBoundedBacktracker(haystack []byte) (int, int, bool) } } + // For always-anchored patterns (^) on large inputs where BT can't handle + // the full haystack, use PikeVM directly. PikeVM memory is O(states) per + // step, not O(states × haystack) like BT visited table. + if e.nfa.IsAlwaysAnchored() && !e.boundedBacktracker.CanHandle(len(haystack)) { + atomic.AddUint64(&e.stats.NFASearches, 1) + return e.pikevm.SearchWithSlotTable(haystack, nfa.SearchModeFind) + } + atomic.AddUint64(&e.stats.NFASearches, 1) if !e.boundedBacktracker.CanHandle(len(haystack)) { // Bidirectional DFA: O(n) vs PikeVM's O(n*states) for large inputs diff --git a/meta/findall.go b/meta/findall.go index f4fcb4e..9e4713a 100644 --- a/meta/findall.go +++ b/meta/findall.go @@ -171,6 +171,8 @@ func (e *Engine) FindAllIndicesStreaming(haystack []byte, n int, results [][2]in // findAllIndicesLoop is the standard loop-based FindAll for non-streaming strategies. // Optimized: acquires SearchState once for entire loop to avoid sync.Pool overhead per match. +// +//nolint:cyclop // DFA direct path adds necessary branching func (e *Engine) findAllIndicesLoop(haystack []byte, n int, results [][2]int) [][2]int { if results == nil { // Smart allocation: anchored patterns have max 1 match, others use capped heuristic. @@ -194,12 +196,50 @@ func (e *Engine) findAllIndicesLoop(haystack []byte, n int, results [][2]int) [] pos := 0 lastMatchEnd := -1 + // Fast path: start-anchored patterns (^) match at most once at position 0. + // Skip pool Get/Put overhead entirely — use non-pooled FindIndices. + if e.nfa.IsAlwaysAnchored() { + start, end, found := e.FindIndices(haystack) + if found { + results = append(results, [2]int{start, end}) + } + return results + } + // Get state ONCE for entire iteration - eliminates 1.29M sync.Pool ops for FindAll state := e.getSearchState() defer e.putSearchState(state) + // DFA fast path: call DFA functions directly, skip meta prefilter layer. + // SearchFirstAt has integrated prefilter at start state — no duplicate scan. + // Saves: 1 prefilter call per candidate + function dispatch overhead. + useDFADirect := (e.strategy == UseDFA || e.strategy == UseBoth) && + e.dfa != nil && e.reverseDFA != nil && + state.dfaCache != nil && state.revDFACache != nil + for n <= 0 || len(results) < n { - start, end, found := e.findIndicesAtWithState(haystack, pos, state) + var start, end int + var found bool + + if useDFADirect { + // 2-pass bidirectional DFA, called directly (no meta prefilter). + // SearchAt → match end (matches Rust find_fwd), reverse DFA → start. + matchEnd := e.dfa.SearchAt(state.dfaCache, haystack, pos) + if matchEnd < 0 { + break + } + if matchEnd == pos { + start, end, found = pos, pos, true + } else { + matchStart := e.reverseDFA.SearchReverse(state.revDFACache, haystack, pos, matchEnd) + if matchStart < 0 { + break + } + start, end, found = matchStart, matchEnd, true + } + } else { + start, end, found = e.findIndicesAtWithState(haystack, pos, state) + } if !found { break } diff --git a/meta/reverse_anchored.go b/meta/reverse_anchored.go index a189987..e68458d 100644 --- a/meta/reverse_anchored.go +++ b/meta/reverse_anchored.go @@ -49,8 +49,12 @@ func NewReverseAnchoredSearcher(forwardNFA *nfa.NFA, config lazy.Config) (*Rever // Build reverse NFA - must be anchored at start (because $ in forward becomes ^ in reverse) reverseNFA := nfa.ReverseAnchored(forwardNFA) - // Build reverse DFA from reverse NFA - reverseDFA, err := lazy.CompileWithConfig(reverseNFA, config) + // Build reverse DFA from reverse NFA. + // Disable BreakAtMatch: reverse DFA must continue past matches to find + // the leftmost match start (greedy continuation). + revConfig := config + revConfig.BreakAtMatch = false + reverseDFA, err := lazy.CompileWithConfig(reverseNFA, revConfig) if err != nil { // Cannot build reverse DFA - this should be rare return nil, err diff --git a/meta/reverse_inner.go b/meta/reverse_inner.go index acdc9b5..a6740de 100644 --- a/meta/reverse_inner.go +++ b/meta/reverse_inner.go @@ -224,8 +224,11 @@ func NewReverseInnerSearcher( // Build reverse NFA from prefix reverseNFA := nfa.Reverse(prefixNFA) - // Build reverse DFA from reverse prefix NFA - reverseDFA, err := lazy.CompileWithConfig(reverseNFA, config) + // Build reverse DFA from reverse prefix NFA. + // Disable BreakAtMatch for reverse DFA. + revConfig := config + revConfig.BreakAtMatch = false + reverseDFA, err := lazy.CompileWithConfig(reverseNFA, revConfig) if err != nil { return nil, err } diff --git a/meta/reverse_suffix.go b/meta/reverse_suffix.go index 06f63c5..b6c8033 100644 --- a/meta/reverse_suffix.go +++ b/meta/reverse_suffix.go @@ -114,8 +114,11 @@ func NewReverseSuffixSearcher( // searching for $ anchor, but for suffix literals. reverseNFA := nfa.Reverse(forwardNFA) - // Build reverse DFA from reverse NFA - reverseDFA, err := lazy.CompileWithConfig(reverseNFA, config) + // Build reverse DFA from reverse NFA. + // Disable BreakAtMatch for reverse DFA. + revConfig := config + revConfig.BreakAtMatch = false + reverseDFA, err := lazy.CompileWithConfig(reverseNFA, revConfig) if err != nil { return nil, err } diff --git a/meta/reverse_suffix_set.go b/meta/reverse_suffix_set.go index 6414073..ea70af1 100644 --- a/meta/reverse_suffix_set.go +++ b/meta/reverse_suffix_set.go @@ -92,8 +92,10 @@ func NewReverseSuffixSetSearcher( // Build reverse NFA reverseNFA := nfa.Reverse(forwardNFA) - // Build reverse DFA - reverseDFA, err := lazy.CompileWithConfig(reverseNFA, config) + // Build reverse DFA. Disable BreakAtMatch for reverse DFA. + revConfig := config + revConfig.BreakAtMatch = false + reverseDFA, err := lazy.CompileWithConfig(reverseNFA, revConfig) if err != nil { return nil, err } diff --git a/nfa/compile.go b/nfa/compile.go index 88c28c0..01a5154 100644 --- a/nfa/compile.go +++ b/nfa/compile.go @@ -132,8 +132,7 @@ func (c *Compiler) CompileRegexp(re *syntax.Regexp) (*NFA, error) { anchoredStart := patternStart // Unanchored start: compile the (?s:.)*? prefix for DFA and other engines - // that need it. PikeVM simulates this prefix in its search loop instead - // (like Rust regex-automata) for correct startPos tracking. + // (same as Rust regex-automata: compiler.rs:997 c_at_least(dot, false, 0)). // If pattern is anchored, unanchored start equals anchored start. var unanchoredStart StateID if c.config.Anchored || allAnchored { diff --git a/nfa/pikevm.go b/nfa/pikevm.go index 34446d8..e0eef83 100644 --- a/nfa/pikevm.go +++ b/nfa/pikevm.go @@ -290,13 +290,28 @@ func (p *PikeVM) initState(state *PikeVMState) { // Pre-allocate epsilon stack for loop-based closure in IsMatch (Rust pattern) state.epsilonStack = make([]StateID, 0, capacity) - // Initialize SlotTables for capture tracking (curr/next, swapped per byte) - // Each capture group has 2 slots (start and end position) + // SlotTables for capture tracking are initialized lazily on first use. + // This avoids allocation overhead for non-capture searches (FindAll, IsMatch). + // See ensureSlotTables(). + state.SlotTable = nil + state.NextSlotTable = nil +} + +// ensureSlotTables lazily initializes SlotTables and capture support. +// Called only when capture tracking is needed (SearchWithSlotTableCaptures). +func (p *PikeVM) ensureSlotTables(state *PikeVMState) { + if state.SlotTable != nil { + return // Already initialized + } slotsPerState := p.nfa.CaptureCount() * 2 - state.SlotTable = NewSlotTable(p.nfa.States(), slotsPerState) - state.NextSlotTable = NewSlotTable(p.nfa.States(), slotsPerState) + numStates := p.nfa.States() + state.SlotTable = NewSlotTable(numStates, slotsPerState) + state.NextSlotTable = NewSlotTable(numStates, slotsPerState) - // Capture-aware epsilon closure stack and working buffer + capacity := numStates + if capacity < 16 { + capacity = 16 + } state.captureStack = make([]captureFrame, 0, capacity) if slotsPerState > 0 { state.currSlots = make([]int, slotsPerState) @@ -2136,6 +2151,9 @@ func (p *PikeVM) SearchWithSlotTableCapturesAt(haystack []byte, at int) *MatchWi return nil } + // Lazy init SlotTables (only on first capture search) + p.ensureSlotTables(&p.internalState) + totalSlots := p.nfa.CaptureCount() * 2 p.internalState.SlotTable.SetActiveSlots(totalSlots) p.internalState.NextSlotTable.SetActiveSlots(totalSlots) diff --git a/nfa/slot_table.go b/nfa/slot_table.go index e8b1ccf..476d886 100644 --- a/nfa/slot_table.go +++ b/nfa/slot_table.go @@ -106,6 +106,9 @@ func (st *SlotTable) ForStateUnchecked(sid StateID) []int { // // If n > slotsPerState, it is clamped to slotsPerState. func (st *SlotTable) SetActiveSlots(n int) { + if st == nil { + return + } if n < 0 { n = 0 } @@ -117,6 +120,9 @@ func (st *SlotTable) SetActiveSlots(n int) { // ActiveSlots returns the current number of active slots. func (st *SlotTable) ActiveSlots() int { + if st == nil { + return 0 + } return st.activeSlots } @@ -171,6 +177,9 @@ func (st *SlotTable) GetSlot(sid StateID, slotIndex int) int { // Note: This is O(n) where n = numStates * slotsPerState. // For large tables, consider using generation-based clearing instead. func (st *SlotTable) Reset() { + if st == nil { + return + } for i := range st.table { st.table[i] = -1 } diff --git a/regex.go b/regex.go index 4795229..b170659 100644 --- a/regex.go +++ b/regex.go @@ -377,6 +377,12 @@ func (r *Regex) FindAll(b []byte, n int) [][]byte { return nil } + // Ultra-fast path: start-anchored patterns (^) with first-byte rejection. + // Avoids entire dispatch chain for the common no-match case. + if r.engine.IsStartAnchoredWithFirstByteReject(b) { + return nil + } + // Use optimized streaming path for ALL strategies (state-reusing, no sync.Pool overhead) return r.findAllStreaming(b, n) } diff --git a/simd/memmem.go b/simd/memmem.go index 2c8720f..3ad5839 100644 --- a/simd/memmem.go +++ b/simd/memmem.go @@ -17,16 +17,15 @@ import "bytes" // // Algorithm: // -// The function uses paired-byte SIMD search with frequency-based rare byte selection: +// The function uses a hybrid SIMD search with frequency-based rare byte selection: // 1. Identify the two rarest bytes in needle using empirical frequency table -// 2. Use MemchrPair to find candidates where both bytes appear at correct distance -// 3. For each candidate, verify the full needle match -// 4. Return position of first match or -1 if not found -// -// The paired-byte approach dramatically reduces false positives compared to -// single-byte search, since matches require two specific bytes at exactly the -// right distance apart. For example, in "@example.com", both '@' (rank 25) and -// 'x' (rank 45) are used, requiring them to appear exactly 2 positions apart. +// 2. For short needles (<=6 bytes): use MemchrPair to find candidates where both +// bytes appear at correct distance — reduces false positives when individual +// bytes are common in the input data +// 3. For longer needles (>6 bytes): use single Memchr on the rarest byte, which +// is genuinely rare and makes single-byte scan + verify faster +// 4. For each candidate, verify the full needle match +// 5. Return position of first match or -1 if not found // // For longer needles (> 32 bytes), a simplified Two-Way string matching // approach is used to maintain O(n+m) complexity and avoid pathological cases. @@ -81,20 +80,18 @@ func Memmem(haystack, needle []byte) int { } // memmemShort handles short needles (2-32 bytes) using rare byte heuristic. -// This is the fast path for most real-world patterns. +// Uses a hybrid approach: MemchrPair for short needles (<=6 bytes) where +// single-byte scan has high false positive rates, and Memchr(rarest byte) +// for longer needles where the rare byte is genuinely rare. func memmemShort(haystack, needle []byte) int { - // Select the two rarest bytes for paired-byte search rareInfo := SelectRareBytes(needle) - - // Determine if we can use paired-byte search (different bytes at different positions) - // Paired-byte search is more selective: false positives require both bytes at exact distance usePair := rareInfo.Byte1 != rareInfo.Byte2 && rareInfo.Index1 != rareInfo.Index2 - if usePair { + // Short needles: MemchrPair is more selective (fewer false positives). + // Long needles: single Memchr + verify is faster (rare byte is genuinely rare). + if usePair && len(needle) <= 6 { return memmemPaired(haystack, needle, rareInfo) } - - // Fall back to single-byte search return memmemSingle(haystack, needle, rareInfo.Byte1, rareInfo.Index1) }