Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- ARM NEON SIMD support (Go 1.26 `simd/archsimd` intrinsics — [#120](https://github.com/coregx/coregex/issues/120))
- SIMD prefilter for CompositeSequenceDFA (#83)

## [0.12.18] - 2026-03-24

### Performance
- **Flat DFA transition table** (Rust approach) — replaced double pointer chase
(`stateList[id].transitions[class]`) with flat array (`flatTrans[sid*stride+class]`).
Hot loop works with state ID only — no `*State` pointer in fast path. Applied to
all 6 DFA search functions. Inspired by Rust `Cache.trans` flat layout.

- **4x loop unrolling** in `searchFirstAt` — process 4 bytes per iteration when
all transitions are in flat table. Falls to single-byte slow path on special states.

- **DFA integrated prefilter skip-ahead** (Rust approach) — when DFA returns to
start state with no match in progress, uses `prefilter.Find()` to skip ahead
instead of byte-by-byte scanning. Applied to `searchFirstAt` and `searchAt`.
Reference: Rust `hybrid/search.rs:232-258`.
`peak_hours`: 197ms → **90ms** (gap vs Rust: 9x → 4x).

- **PikeVM integrated prefilter skip-ahead** — prefilter integrated inside PikeVM
search loop (`pikevm.rs:1293`). When NFA has no active threads, PikeVM jumps to
next candidate. Safe for partial-coverage prefilters.

### Fixed
- **NFA candidate loop guard** — replaced `IsComplete()` with `partialCoverage`
flag. `IsComplete()` blocked ALL incomplete prefilters including prefix-only ones.
`errors` pattern: 1984ms → **80ms**.

- **DFA prefilter skip for incomplete prefilters** — `IsComplete()` guard blocked
DFA prefilter skip-ahead for memmem/Teddy prefix-only prefilters. But DFA verifies
full pattern — skip is always safe. `sessions`: 229ms → **30ms**.

## [0.12.17] - 2026-03-23

### Fixed
Expand Down Expand Up @@ -39,6 +69,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
Now allows UseTeddy when anchors are only `(?m)^` (no \b, $, etc).
`http_methods` on macOS ARM64: 89ms → **<1ms** (restored to v0.12.14 level).

- **Fix NFA candidate loop guard** — `IsComplete()` guard blocked prefilter
candidate loop for ALL incomplete prefilters, including prefix-only ones
where all alternation branches are represented. Now uses `partialCoverage`
flag (set only on overflow truncation) instead of `IsComplete()`. Pattern
` [5][0-9]{2} | [4][0-9]{2} ` (Kostya's `errors`): 1984ms → **109ms**.
Rust handles this by integrating prefilter as skip-ahead inside PikeVM
(not as an external correctness gate) — see `pikevm.rs:1293-1299`.

## [0.12.16] - 2026-03-21

### Performance
Expand Down
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,16 +64,16 @@ Cross-language benchmarks on 6MB input, AMD EPYC ([source](https://github.com/ko

| Pattern | Go stdlib | coregex | Rust regex | vs stdlib | vs Rust |
|---------|-----------|---------|------------|-----------|---------|
| Literal alternation | 475 ms | 4.4 ms | 0.6 ms | **108x** | 7.1x slower |
| Multi-literal | 1412 ms | 12.8 ms | 4.7 ms | **110x** | 2.7x slower |
| Inner `.*keyword.*` | 232 ms | 0.30 ms | 0.27 ms | **774x** | 1.1x slower |
| Suffix `.*\.txt` | 236 ms | 1.82 ms | 1.13 ms | **129x** | 1.6x slower |
| Multiline `(?m)^/.*\.php` | 103 ms | 0.50 ms | 0.67 ms | **206x** | **1.3x faster** |
| Email validation | 265 ms | 0.62 ms | 0.27 ms | **428x** | 2.2x slower |
| URL extraction | 353 ms | 0.65 ms | 0.35 ms | **543x** | 1.8x slower |
| IP address | 496 ms | 2.1 ms | 12.1 ms | **231x** | **5.6x faster** |
| Char class `[\w]+` | 581 ms | 51.2 ms | 50.2 ms | **11x** | ~parity |
| Word repeat `(\w{2,8})+` | 712 ms | 186 ms | 48.7 ms | **3x** | 3.8x slower |
| Literal alternation | 475 ms | 4.4 ms | 0.7 ms | **109x** | 6.3x slower |
| Multi-literal | 1391 ms | 12.6 ms | 4.7 ms | **110x** | 2.6x slower |
| Inner `.*keyword.*` | 231 ms | 0.29 ms | 0.29 ms | **797x** | **~parity** |
| Suffix `.*\.txt` | 234 ms | 1.83 ms | 1.07 ms | **128x** | 1.7x slower |
| Multiline `(?m)^/.*\.php` | 103 ms | 0.66 ms | 0.66 ms | **156x** | **~parity** |
| Email validation | 261 ms | 0.54 ms | 0.31 ms | **482x** | 1.7x slower |
| URL extraction | 262 ms | 0.84 ms | 0.35 ms | **311x** | 2.4x slower |
| IP address | 498 ms | 2.1 ms | 12.0 ms | **237x** | **5.6x faster** |
| Char class `[\w]+` | 554 ms | 48.0 ms | 50.1 ms | **11x** | **1.0x faster** |
| Word repeat `(\w{2,8})+` | 641 ms | 185 ms | 48.7 ms | **3x** | 3.7x slower |

**Where coregex excels:**
- Multiline patterns (`(?m)^/.*\.php`) — near Rust parity, 100x+ vs stdlib
Expand Down
15 changes: 12 additions & 3 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

> **Strategic Focus**: Production-grade regex engine with RE2/rust-regex level optimizations

**Last Updated**: 2026-03-20 | **Current Version**: v0.12.15 | **Target**: v1.0.0 stable
**Last Updated**: 2026-03-24 | **Current Version**: v0.12.18 | **Target**: v1.0.0 stable

---

Expand Down Expand Up @@ -87,7 +87,13 @@ v0.12.13 ✅ → FatTeddy fix, prefilter acceleration, AC v0.2.1
v0.12.14 ✅ → Concurrent safety fix for isMatchDFA prefilter (#137)
v0.12.15 (Current) ✅ → Per-goroutine DFA cache, word boundary 30%→0.3% CPU, AC prefilter
v0.12.15 ✅ → Per-goroutine DFA cache, word boundary 30%→0.3% CPU, AC prefilter
v0.12.16 ✅ → WrapLineAnchor for (?m)^ patterns
v0.12.17 ✅ → Fix LogParser ARM64 regression, restore DFA/Teddy for (?m)^
v0.12.18 (Current) ✅ → Flat DFA transition table, integrated prefilter, PikeVM skip-ahead
v1.0.0-rc → Feature freeze, API locked
Expand Down Expand Up @@ -130,7 +136,10 @@ v1.0.0 STABLE → Production release with API stability guarantee
- ✅ **v0.12.12**: Prefix trimming for case-fold literals
- ✅ **v0.12.13**: FatTeddy fix (ANDL→ORL, VPTEST), prefilter acceleration, AC v0.2.1
- ✅ **v0.12.14**: Concurrent safety fix for isMatchDFA prefilter (#137)
- ✅ **v0.12.15**: Per-goroutine DFA cache (Rust approach), word boundary 30%→0.3% CPU, AC DFA prefilter for >32 literals (7-13x faster)
- ✅ **v0.12.15**: Per-goroutine DFA cache (Rust approach), word boundary 30%→0.3% CPU, 7 correctness fixes
- ✅ **v0.12.16**: WrapLineAnchor for (?m)^ patterns
- ✅ **v0.12.17**: Fix LogParser ARM64 regression — restore DFA/Teddy for (?m)^, partial prefilter
- ✅ **v0.12.18**: Flat DFA transition table (Rust approach), integrated prefilter skip-ahead in DFA+PikeVM, 4x unrolling — **35% faster than v0.12.14, 3x from Rust**

---

Expand Down
107 changes: 92 additions & 15 deletions dfa/lazy/cache.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,36 +27,50 @@ import (
// - After too many clears, falls back to NFA
// - Clearing keeps allocated memory to avoid re-allocation
type DFACache struct {
// states maps StateKey -> DFA State
// states maps StateKey -> DFA State (used only in determinize slow path)
states map[StateKey]*State

// stateList provides O(1) lookup of states by ID via direct indexing.
// StateIDs are sequential (0, 1, 2...), so slice indexing is faster than map.
// This was previously DFA.states — moved here because it grows during search.
// stateList provides O(1) lookup of State structs by ID.
// Used only in slow path (determinize, word boundary, acceleration).
// Hot loop uses flatTrans + matchFlags instead.
stateList []*State

// --- Flat transition table (Rust approach) ---
// Hot loop uses ONLY these fields — no *State pointer chase.
//
// Rust: cache.trans[sid + class] — single flat array, premultiplied ID.
// We use: flatTrans[int(sid)*stride + class] — same layout.
//
// This replaces per-state State.transitions[] in the hot loop:
// ONE slice access instead of TWO pointer chases (stateList → State → transitions).

// flatTrans is the flat transition table.
// Layout: [state0_c0, state0_c1, ..., state0_cN, state1_c0, ...]
// InvalidState (0xFFFFFFFF) = unknown transition (needs determinize).
flatTrans []StateID

// matchFlags[stateID] = true if state is a match/accepting state.
// Replaces State.IsMatch() in hot loop — no pointer chase needed.
matchFlags []bool

// stride is the number of byte equivalence classes (alphabet size).
stride int

// startTable caches start states for different look-behind contexts.
// This enables correct handling of assertions (^, \b, etc.) and
// avoids recomputing epsilon closures on every search.
// Previously lived on DFA — moved here because it is populated lazily.
startTable StartTable

// maxStates is the capacity limit
maxStates uint32

// nextID is the next available state ID.
// Start at 1 (0 is reserved for StartState).
nextID StateID

// clearCount tracks how many times the cache has been cleared during
// the current search. This is used to detect pathological cache thrashing
// and trigger NFA fallback when clears exceed the configured limit.
// Inspired by Rust regex-automata's hybrid DFA cache clearing strategy.
// clearCount tracks cache clear count for NFA fallback threshold.
clearCount int

// Statistics for cache performance tuning
hits uint64 // Number of cache hits
misses uint64 // Number of cache misses
// Statistics
hits uint64
misses uint64
}

// Get retrieves a state by its key.
Expand Down Expand Up @@ -95,9 +109,67 @@ func (c *DFACache) Insert(key StateKey, state *State) (StateID, error) {
c.states[key] = state
c.misses++

// Grow flat transition table for this state's row (all InvalidState initially).
if c.stride > 0 {
sid := int(state.id)
needed := (sid + 1) * c.stride
if needed > len(c.flatTrans) {
growth := needed - len(c.flatTrans)
for i := 0; i < growth; i++ {
c.flatTrans = append(c.flatTrans, InvalidState)
}
}
// Grow matchFlags
for len(c.matchFlags) <= sid {
c.matchFlags = append(c.matchFlags, false)
}
c.matchFlags[sid] = state.isMatch
}

return state.ID(), nil
}

// safeOffset computes flat table offset, safe on 386 where int is 32-bit.
// StateID is uint32; on 386 int(0xFFFFFFFF) = -1 and uint multiply overflows.
// Returns MaxInt for special state IDs (DeadState, InvalidState) so bounds
// check (offset < ftLen) always fails safely.
func safeOffset(sid StateID, stride int, classIdx int) int {
if sid >= DeadState {
return int(^uint(0) >> 1) // MaxInt — always >= ftLen
}
return int(sid)*stride + classIdx
}

// SetFlatTransition records a transition in the flat table.
// Called from determinize when a transition is computed.
func (c *DFACache) SetFlatTransition(fromID StateID, classIdx int, toID StateID) {
offset := safeOffset(fromID, c.stride, classIdx)
if offset < len(c.flatTrans) {
c.flatTrans[offset] = toID
}
}

// FlatNext returns the next state ID from the flat table.
// Returns InvalidState if the transition hasn't been computed yet.
// This is the hot-path function — should be inlined by the compiler.
func (c *DFACache) FlatNext(sid StateID, classIdx int) StateID {
offset := int(sid)*c.stride + classIdx
return c.flatTrans[offset]
}

// IsMatchState returns whether the given state ID is a match state.
// Uses compact matchFlags slice — no pointer chase.
func (c *DFACache) IsMatchState(sid StateID) bool {
if sid >= DeadState {
return false
}
id := int(sid)
if id >= len(c.matchFlags) {
return false
}
return c.matchFlags[id]
}

// GetOrInsert retrieves a state from cache or inserts it if not present.
// This is the primary method used during DFA construction.
//
Expand Down Expand Up @@ -220,6 +292,11 @@ func (c *DFACache) getState(id StateID) *State {
return nil
}

// Guard against special state IDs (DeadState=0xFFFFFFFE, InvalidState=0xFFFFFFFF).
// On 386, int(uint32(0xFFFFFFFF)) = -1, causing negative index panic.
if id >= DeadState {
return nil
}
idx := int(id)
if idx >= len(c.stateList) {
return nil
Expand Down
Loading
Loading