Skip to content

coregx/coregex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

256 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

coregex

GitHub Release Go Version Go Reference CI Go Report Card codecov License GitHub Stars GitHub Issues GitHub Discussions

High-performance regex engine for Go. Drop-in replacement for regexp with 3-3000x speedup.*

* Typical speedup 15-240x on real-world patterns. 1000x+ achieved on specific edge cases where prefilters skip entire input (e.g., IP pattern on text with no digits).

Why coregex?

Go's stdlib regexp is intentionally simple — single NFA engine, no optimizations. This guarantees O(n) time but leaves performance on the table.

coregex brings Rust regex-crate architecture to Go:

  • Multi-engine: 17 strategies — Lazy DFA, PikeVM, OnePass, BoundedBacktracker, and more
  • SIMD prefilters: AVX2/SSSE3 for fast candidate rejection
  • Reverse search: Suffix/inner literal patterns run 1000x+ faster
  • O(n) guarantee: No backtracking, no ReDoS vulnerabilities

Installation

go get github.com/coregx/coregex

Requires Go 1.25+. Minimal dependencies (golang.org/x/sys, github.com/coregx/ahocorasick).

Quick Start

package main

import (
    "fmt"
    "github.com/coregx/coregex"
)

func main() {
    re := coregex.MustCompile(`\w+@\w+\.\w+`)

    text := []byte("Contact support@example.com for help")

    // Find first match
    fmt.Printf("Found: %s\n", re.Find(text))

    // Check if matches (zero allocation)
    if re.MatchString("test@email.com") {
        fmt.Println("Valid email format")
    }
}

Performance

Cross-language benchmarks on 6MB input, AMD EPYC (source):

Pattern Go stdlib coregex Rust regex vs stdlib vs Rust
Literal alternation 554 ms 4.5 ms 0.72 ms 122x 6.2x slower
Multi-literal 1572 ms 12.4 ms 5.5 ms 126x 2.2x slower
Inner .*keyword.* 238 ms 0.27 ms 0.33 ms 881x 1.2x faster
Suffix .*\.txt 239 ms 1.9 ms 1.2 ms 125x 1.5x slower
Multiline (?m)^/.*\.php 102 ms 0.34 ms 0.75 ms 299x 2.2x faster
Email validation 257 ms 0.46 ms 0.31 ms 557x 1.4x slower
URL extraction 256 ms 0.62 ms 0.37 ms 413x 1.6x slower
IP address 494 ms 0.72 ms 13.5 ms 685x 18.8x faster
Version \d+.\d+.\d+ 164 ms 0.62 ms 0.79 ms 263x 1.2x faster
Char class [\w]+ 478 ms 42.1 ms 56.4 ms 11x 1.3x faster
Word repeat (\w{2,8})+ 690 ms 180 ms 54.7 ms 3x 3.2x slower

Where coregex excels:

  • Multiline patterns ((?m)^/.*\.php) — 2.2x faster than Rust, 299x vs stdlib
  • IP/phone patterns (\d+\.\d+\.\d+\.\d+) — SIMD digit prefilter skips non-digit regions
  • Suffix patterns (.*\.log, .*\.txt) — reverse search optimization (1000x+)
  • Inner literals (.*error.*, .*@example\.com) — bidirectional DFA (900x+)
  • Multi-pattern (foo|bar|baz|...) — Slim Teddy (≤32), Fat Teddy (33-64), or Aho-Corasick (>64)
  • Anchored alternations (^(\d+|UUID|hex32)) — O(1) branch dispatch (5-20x)
  • Concatenated char classes ([a-zA-Z]+[0-9]+) — DFA with byte classes (5-7x)
  • Zero-alloc iterators (AllIndex, AppendAllIndex) — 0 heap allocs, up to 30% faster than FindAll. Email pattern faster than Rust with AppendAllIndex.

Features

Engine Selection

coregex automatically selects the optimal engine:

Strategy Pattern Type Speedup
AnchoredLiteral ^prefix.*suffix$ 32-133x
MultilineReverseSuffix (?m)^/.*\.php 100-552x
ReverseInner .*keyword.* 100-900x
ReverseSuffix .*\.txt 100-1100x
BranchDispatch ^(\d+|UUID|hex32) 5-20x
CompositeSequenceDFA [a-zA-Z]+[0-9]+ 5-7x
LazyDFA IP, complex patterns 10-150x
AhoCorasick a|b|c|...|z (>64 patterns) 75-113x
CharClassSearcher [\w]+, \d+ 4-25x
Slim Teddy foo|bar|baz (2-32 patterns) 15-240x
Fat Teddy 33-64 patterns 60-73x
OnePass Anchored captures 10x
BoundedBacktracker Small patterns 2-5x

API Compatibility

Drop-in replacement for regexp.Regexp:

// stdlib
re := regexp.MustCompile(pattern)

// coregex — same API
re := coregex.MustCompile(pattern)

Supported methods:

  • Match, MatchString, MatchReader
  • Find, FindString, FindAll, FindAllString
  • FindIndex, FindStringIndex, FindAllIndex
  • FindSubmatch, FindStringSubmatch, FindAllSubmatch
  • ReplaceAll, ReplaceAllString, ReplaceAllFunc
  • Split, SubexpNames, NumSubexp
  • Longest, Copy, String

Zero-Allocation APIs

// Zero allocations — boolean match
matched := re.IsMatch(text)

// Zero allocations — single match indices
start, end, found := re.FindIndices(text)

// Zero allocations — iterator over all matches (Go 1.23+)
for m := range re.AllIndex(data) {
    fmt.Printf("match at [%d, %d]\n", m[0], m[1])
}

// Zero allocations — match content iterator
for s := range re.AllString(text) {
    fmt.Println(s)
}

// Buffer-reuse — append to caller's slice (strconv.Append* pattern)
var buf [][2]int
for _, chunk := range chunks {
    buf = re.AppendAllIndex(buf[:0], chunk, -1)
    process(buf)
}

Configuration

config := coregex.DefaultConfig()
config.DFAMaxStates = 10000      // Limit DFA cache
config.EnablePrefilter = true    // SIMD acceleration

re, err := coregex.CompileWithConfig(pattern, config)

Thread Safety

A compiled *Regexp is safe for concurrent use by multiple goroutines:

re := coregex.MustCompile(`\d+`)

// Safe: multiple goroutines sharing one compiled pattern
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
    wg.Add(1)
    go func() {
        defer wg.Done()
        re.FindString("test 123 data")  // thread-safe
    }()
}
wg.Wait()

Internally uses sync.Pool (same pattern as Go stdlib regexp) for per-search state management.

Syntax Support

Uses Go's regexp/syntax parser:

Feature Support
Character classes [a-z], \d, \w, \s
Quantifiers *, +, ?, {n,m}
Anchors ^, $, \b, \B
Groups (...), (?:...), (?P<name>...)
Unicode \p{L}, \P{N}
Flags (?i), (?m), (?s)
Backreferences Not supported (O(n) guarantee)

Architecture

Pattern → Parse → NFA → Literal Extract → Strategy Select
                                               ↓
                  ┌────────────────────────────────────────────┐
                  │ Engines (17 strategies):                   │
                  │  LazyDFA, PikeVM, OnePass,                 │
                  │  BoundedBacktracker, ReverseAnchored,      │
                  │  ReverseInner, ReverseSuffix,              │
                  │  ReverseSuffixSet, MultilineReverseSuffix, │
                  │  AnchoredLiteral, CharClassSearcher,       │
                  │  Teddy, DigitPrefilter, AhoCorasick,       │
                  │  CompositeSearcher, BranchDispatch, Both   │
                  └────────────────────────────────────────────┘
                                               ↓
Input → Prefilter (SIMD) → Engine → Match Result

For detailed architecture documentation, see docs/ARCHITECTURE.md. For optimization details, see docs/OPTIMIZATIONS.md.

SIMD Primitives (AMD64):

  • memchr — single byte search (AVX2)
  • memmem — substring search (SSSE3)
  • Slim Teddy — multi-pattern search, 2-32 patterns (SSSE3, 9+ GB/s)
  • Fat Teddy — multi-pattern search, 33-64 patterns (AVX2, 9+ GB/s)

Pure Go fallback on other architectures.

Battle-Tested

coregex was tested in GoAWK. This real-world testing uncovered 15+ edge cases that synthetic benchmarks missed.

Powered by coregex: uawk

uawk is a modern AWK interpreter built on coregex:

Benchmark (10MB) GoAWK uawk Speedup
Regex alternation 1.85s 97ms 19x
IP matching 290ms 99ms 2.9x
General regex 320ms 100ms 3.2x
go install github.com/kolkov/uawk/cmd/uawk@latest
uawk '/error/ { print $0 }' server.log

We need more testers! If you have a project using regexp, try coregex and report issues.

Documentation

Comparison

coregex stdlib regexp2
Performance 3-3000x faster Baseline Slower
SIMD AVX2/SSSE3 No No
O(n) guarantee Yes Yes No
Backreferences No No Yes
API Drop-in Different

Use coregex for performance-critical code with O(n) guarantee. Use stdlib for simple cases where performance doesn't matter. Use regexp2 if you need backreferences (accept exponential worst-case).

Related

Inspired by:

License

MIT — see LICENSE.


Status: Pre-1.0 (API may change). Ready for testing and feedback.

Releases · Issues · Discussions

Star History

Star History Chart