diff --git a/README.md b/README.md index c4751e3..72e792a 100644 --- a/README.md +++ b/README.md @@ -5,9 +5,9 @@ # Contact: qubitium@modelcloud.ai, x.com/qubitium --> -# PyPcre (Python Pcre2 Binding) +# PyPcre (Python PCRE2 Binding) -Modern `nogil` Python bindings for the Pcre2 library with `stdlib.re` api compatibility. +Modern `nogil` Python bindings for the PCRE2 library with `stdlib.re` API compatibility.

GitHub release @@ -20,23 +20,24 @@ Modern `nogil` Python bindings for the Pcre2 library with `stdlib.re` api compat ## Latest News -* 03/21/2026 [0.2.14](https://github.com/ModelCloud/PyPcre/releases/tag/v0.2.4): Python 3.14 compat -* 03/02/2026 [0.2.11](https://github.com/ModelCloud/PyPcre/releases/tag/v0.2.11): Auto-detect `Visual Studio` for `Windows` env during install/compile. -* 02/24/2026 [0.2.10](https://github.com/ModelCloud/PyPcre/releases/tag/v0.2.10): Allow VisualStudio (VS) compiler version check override via env var. -* 12/15/2025 [0.2.8](https://github.com/ModelCloud/PyPcre/releases/tag/v0.2.8): Fixed multi-arch Linux os compatibility where both x86_64 and i386 libs of pcre2 are installed. -* 10/20/2025 [0.2.4](https://github.com/ModelCloud/PyPcre/releases/tag/v0.2.4): Removed dependency on system having python3-dev packge. python.h will be optimistically downloaded from python.org when needed. -* 10/12/2025 [0.2.3](https://github.com/ModelCloud/PyPcre/releases/tag/v0.2.3): 🤗 Full `GIL=0` compliance for Python >= 3.13T. Reduced cache thread contention. Improved performance for all api. Expanded ci testing coverage. FreeBSD, Solaris and Windows compatibility validated. -* 10/09/2025 [0.1.0](https://github.com/ModelCloud/PyPcre/releases/tag/v0.1.0): 🎉 First release. Thread safe, auto JIT, auto pattern caching and optimistic linking to system library for fast install. +* 03/22/2026 [0.2.15](https://github.com/ModelCloud/PyPcre/releases/tag/v0.2.15): Python 3.15 `re` compatibility (`prefixmatch`, `NOFLAG`) +* 03/21/2026 [0.2.14](https://github.com/ModelCloud/PyPcre/releases/tag/v0.2.14): Python 3.14 compatibility +* 03/02/2026 [0.2.11](https://github.com/ModelCloud/PyPcre/releases/tag/v0.2.11): Auto-detect `Visual Studio` in Windows environments during install and compile. +* 02/24/2026 [0.2.10](https://github.com/ModelCloud/PyPcre/releases/tag/v0.2.10): Allow a `Visual Studio` (VS) compiler version check override via an environment variable. +* 12/15/2025 [0.2.8](https://github.com/ModelCloud/PyPcre/releases/tag/v0.2.8): Fixed multi-arch Linux OS compatibility when both x86_64 and i386 `pcre2` libraries are installed. +* 10/20/2025 [0.2.4](https://github.com/ModelCloud/PyPcre/releases/tag/v0.2.4): Removed the dependency on a system `python3-dev` package. `Python.h` will be downloaded optimistically from python.org when needed. +* 10/12/2025 [0.2.3](https://github.com/ModelCloud/PyPcre/releases/tag/v0.2.3): 🤗 Full `GIL=0` compliance for Python >= 3.13T. Reduced cache thread contention. Improved performance across all APIs. Expanded CI test coverage. FreeBSD, Solaris, and Windows compatibility validated. +* 10/09/2025 [0.1.0](https://github.com/ModelCloud/PyPcre/releases/tag/v0.1.0): 🎉 First release. Thread-safe, with auto JIT, auto pattern caching, and optimistic linking to the system library for fast installs. -## Why PyPcre: +## Why PyPcre -PyPcre is a modern Pcre2 binding designed to be both super fast and thread-safe in the `GIL=0` world. In the old days of global interpreter locks, Python had real threads but mostly fake concurrency (with the exception of some low-level apis and packages). In 2025, Python is moving toward full `GIl=0` design which will unlock true multi-threaded concurrency and finally bring Python in parity with other modern languages. +PyPcre is a modern PCRE2 binding designed to be both fast and thread-safe in a `GIL=0` world. In the era of the global interpreter lock, Python had real threads but often only limited concurrency, aside from a handful of low-level APIs and packages. As Python moves toward a fuller `GIL=0` design, true multi-threaded concurrency becomes practical and brings Python closer to parity with other modern languages. -Many Python regular expression packages will either out-right segfault due to safety under `GIL=0` or suffer sub-optimal performance due to non-threaded design mindset. +Many Python regular expression packages either segfault under `GIL=0` or suffer suboptimal performance because they were not designed with threaded execution in mind. -PyPcre is fully ci tested where every single api and Pcre2 flag is tested in a continuous development environment backed by the ModelCloud.AI team. Fuzz (clobber) tests are also performed to catch any memory safety, accuracy, or memory leak regressions. +PyPcre is fully CI-tested. Every API and PCRE2 flag is exercised in a continuous development environment backed by the ModelCloud.AI team. Fuzz (clobber) tests are also run to catch memory safety, accuracy, and memory leak regressions. -Safety first: PyPcre will optimistically link to the os provided `libpcre2` package for maximum safetey since PyPcre will automatically enjoy upstream security patches. You can force full source compile via `PYPCRE_BUILD_FROM_SOURCE=1` env toggle. +For safety, PyPcre preferentially links against the OS-provided `libpcre2` package so it can benefit from upstream security patches. You can force a full source build with the `PYPCRE_BUILD_FROM_SOURCE=1` environment variable. ## Installation @@ -44,16 +45,15 @@ Safety first: PyPcre will optimistically link to the os provided `libpcre2` pack pip install PyPcre ``` -The package prioritizes linking against the `libpcre2-8` shared library in system for fast install and max security protection which gets latest patches from OS. See [Building](#building) for manual build details. +The package prefers linking against the system `libpcre2-8` shared library for fast installs and to inherit security updates from the OS. See [Building](#building) for manual build details. -## Platform Support (Validated): +## Platform Support (Validated) -`Linux`, `MacOS`, `Windows`, `WSL`, `FreeBSD` +`Linux`, `macOS`, `Windows`, `WSL`, `FreeBSD` ## Usage - If you already rely on the standard library `re`, migrating is as simple as changing your import: @@ -61,17 +61,13 @@ simple as changing your import: import pcre as re ``` -The module-level entry points (`match`, `search`, `fullmatch`, `findall`, -`finditer`, `split`, `sub`, `subn`, `compile`, `escape`, `purge`) expose the -same call signatures as their `re` counterparts, making existing code work -unchanged. Every standard flag with a PCRE2 equivalent—`IGNORECASE`, -`MULTILINE`, `DOTALL`, `VERBOSE`, `ASCII`, and friends—is supported via the -re-exported constants and the `pcre.Flag` enum. +The high-level API keeps the standard library shape, so most existing `re` +code can move over with little or no rewriting. -### Sample Usage +### Quick start ```python -from pcre import match, search, findall, compile, Flag +from pcre import compile, findall, match, search, Flag if match(r"(?P\\w+)", "hello world"): print("found word") @@ -80,16 +76,59 @@ pattern = compile(rb"\d+", flags=Flag.MULTILINE) numbers = pattern.findall(b"line 1\nline 22") ``` -`pcre` mirrors the core helpers from Python’s standard library `re` module -`prefixmatch`, `match`, `search`, `fullmatch`, `finditer`, `findall`, and `compile` while -exposing PCRE2’s extended flag set through the Pythonic `Flag` enum -(`Flag.CASELESS`, `Flag.MULTILINE`, `Flag.UTF`, ...). +### User-facing API + +- Module helpers: `prefixmatch`, `match`, `search`, `fullmatch`, `finditer`, + `findall`, `split`, `sub`, `subn`, `compile`, `escape`, `purge`, and + `parallel_map`. +- `compile()` returns a `Pattern` object with the familiar matching helpers + plus `split()`, `sub()`, and `subn()`. +- `Pattern` exposes `.pattern`, `.flags`, `.jit`, `.groupindex`, and `.groups` + for introspection. +- `Match` objects expose the usual `group()`, `groups()`, `groupdict()`, + `start()`, `end()`, `span()`, and `expand()` methods, along with `.re`, + `.string`, `.pos`, `.endpos`, `.lastindex`, `.lastgroup`, and `.regs`. +- Flags are available through `pcre.Flag` and familiar aliases such as + `IGNORECASE`, `MULTILINE`, `DOTALL`, `VERBOSE`, `ASCII`, `UNICODE`, and + `NOFLAG`. +- Errors are raised as `pcre.PcreError`; `error` and `PatternError` are kept as + compatibility aliases. + +### Common examples + +Compiled patterns: + +```python +from pcre import compile, Flag + +pattern = compile(r"(?P[A-Za-z]+)", flags=Flag.CASELESS) +match = pattern.search("User: alice") +print(match.group("name")) # alice +``` + +Substitution: + +```python +from pcre import sub + +result = sub(r"\d+", "#", "room 101") +print(result) # room # +``` + +Bytes: + +```python +from pcre import compile + +pattern = compile(br"\w+") +print(pattern.findall(b"ab cd")) # [b'ab', b'cd'] +``` ### Stdlib `re` compatibility - Module-level helpers and the `Pattern` class follow the same call shapes as the standard library `re` module, including `pos`, `endpos`, and `flags` - behaviour. + behavior. - Python 3.15's `prefixmatch()` alias is available at both the module level and on compiled `Pattern` objects, and `re.NOFLAG` is re-exported as the zero-value compatibility alias. @@ -107,6 +146,7 @@ exposing PCRE2’s extended flag set through the Pythonic `Flag` enum raises a compatibility `ValueError` to prevent silent divergences. - `pcre.escape()` delegates directly to `re.escape` for byte and text patterns so escaping semantics remain identical. +- String patterns enable Unicode behavior by default. Byte patterns do not. ### `regex` package compatibility @@ -122,143 +162,31 @@ pattern = compile(r"\\U0001F600", flags=Flag.COMPAT_UNICODE_ESCAPE) assert pattern.pattern == r"\\x{0001F600}" ``` -Set the default behaviour globally with `pcre.configure(compat_regex=True)` +Set the default behavior globally with `pcre.configure(compat_regex=True)` so that subsequent calls to `compile()` and the module-level helpers apply the conversion without repeating the flag. -### Automatic pattern caching - -`pcre.compile()` caches the final `Pattern` wrapper for up to 128 -unique `(pattern, flags)` pairs when the pattern object is hashable. By default -the cache is **thread-local**, keeping per-thread LRU stores so workers do not -contend with one another. Adjust the capacity with `pcre.set_cache_limit(n)`—pass -`0` to disable caching completely or `None` for an unlimited cache—and check the -current limit with `pcre.get_cache_limit()`. The cache can be emptied at any time -with `pcre.clear_cache()`. - -Applications that prefer the historic global cache can opt back in before any -compilation takes place by setting `PYPCRE_CACHE_PATTERN_GLOBAL=1` in the -environment **before importing** `pcre`. Runtime switching is no longer -supported; altering the value after patterns have been compiled raises -`RuntimeError`. - -### Text versus bytes defaults - -String patterns follow the same defaults as Python’s `re` module, -automatically enabling the `Flag.UTF` and `Flag.UCP` options so Unicode -pattern and character semantics “just work.” Byte patterns remain raw by -default—neither option is activated—so you retain full control over -binary-oriented matching. Explicitly set `Flag.NO_UTF`/`Flag.NO_UCP` if you -need to opt out for strings, or add the UTF/UCP flags yourself when compiling -bytes. - -### Working with compiled patterns - -- `compile()` accepts either a pattern literal or an existing `Pattern` - instance, making it easy to mix compiled objects with the convenience - helpers. -- `Pattern.match/search/fullmatch/finditer/findall` accept optional - `pos`, `endpos`, and `options` arguments, mirroring the standard library - `re` module while letting you thread PCRE2 execution flags through - individual calls. - -### Threaded execution - -- `pcre.parallel_map()` fans out work across a shared thread pool for - `match`, `search`, `fullmatch`, and `findall`. The helper preserves the - order of the provided subjects and returns the same result objects you’d - normally receive from the `Pattern` methods. -- The threaded backend activates only on machines with at least eight CPU - cores; otherwise execution falls back to the sequential path regardless of - flags or configuration. -- Threading is **opt-in by default** when Python runs without the GIL - (e.g. Python with `-X gil=0` or `PYTHON_GIL=0`). When the GIL is active the default falls - back to sequential execution to avoid needless overhead. -- With auto threading enabled (`configure_threads(enabled=True)`), the pool - is only engaged when at least one subject is larger than the configured - threshold (60 kB by default). Smaller jobs run sequentially to avoid the - cost of thread hand-offs; adjust the boundary via - `configure_threads(threshold=...)`. -- Use `Flag.THREADS` to force threaded execution for a specific pattern or - `Flag.NO_THREADS` to lock it to sequential mode regardless of global - settings. -- `pcre.configure_thread_pool(max_workers=...)` controls the size of the - shared executor (capped to half the available CPUs); call it with - `preload=True` to spin the pool up eagerly, and `shutdown_thread_pool()` - to tear it down manually if needed. - -### Performance considerations - -- **Precompile for hot loops.** The module-level helpers mirror the `re` - API and route through the shared compilation cache, but the extra call - plumbing still adds overhead. With a simple pattern like `"fo"`, using - the low-level `pcre_ext_c.Pattern` directly costs ~0.60 µs per call, - whereas the high-level `pcre.match()` helper lands at ~4.4 µs per call - under the same workload. For sustained loops, create a `Pattern` object - once and reuse it. -- **Benchmark toggles.** The extension defaults to the fastest safe - configuration, but you can flip individual knobs back to the legacy - behaviour by setting environment variables *before* importing `pcre`: - - | Env var | Effect (per-call, `pattern.match("fo")`) | - |--------------------------------|------------------------------------------| - | _(baseline)_ | 0.60 µs | - | `PYPCRE_DISABLE_CONTEXT_CACHE=1` | 0.60 µs | - | `PYPCRE_FORCE_JIT_LOCK=1` | 0.60 µs | - | `pcre.match()` helper | 4.43 µs | - - The toggles reintroduce the legacy GIL hand-off, per-call match-context - allocation, and explicit locks so you can quantify the impact of each - optimisation on your workload. Measurements were taken on CPython 3.14 (rc3) - with 200 000 evaluations of `pcre_ext_c.compile("fo").match("foobar")`; absolute - values will vary by platform, but the relative differences are - representative. Leave the variables unset in production to keep the new fast - paths active. - -### JIT Pattern Compilation and Execution - -Pcre2’s JIT compiler is enabled by default for every compiled pattern. The -wrapper exposes two complementary ways to adjust that behaviour: - -- Toggle the global default at runtime with `pcre.configure(jit=False)` to - turn JIT off (call `pcre.configure(jit=True)` to turn it back on). -- Override the default per pattern using the Python-only flags `Flag.JIT` - and `Flag.NO_JIT`: - - ```python - from pcre import compile, configure, Flag - - configure(jit=False) # disable JIT globally - baseline = compile(r"expr") # JIT disabled - - fast = compile(r"expr", flags=Flag.JIT) # force-enable for this pattern - slow = compile(r"expr", flags=Flag.NO_JIT) # force-disable for this pattern - ``` - -## Pattern cache -- `pcre.compile()` caches hashable `(pattern, flags)` pairs, keeping up to 128 entries per thread by default. -- Set `PYPCRE_CACHE_PATTERN_GLOBAL=1` before importing `pcre` if you need a shared, process-wide cache instead of isolated thread stores. -- Use `pcre.clear_cache()` when you need to free the active cache proactively. -- Non-hashable pattern objects skip the cache and are compiled each time. - -## Default flags for text patterns -- String patterns enable `Flag.UTF` and `Flag.UCP` automatically so behaviour matches `re`. -- Byte patterns keep both flags disabled; opt in manually if Unicode semantics are desired. -- Explicitly supply `Flag.NO_UTF`/`Flag.NO_UCP` to override the defaults for strings. - -## Additional usage notes -- All top-level helpers (`match`, `search`, `fullmatch`, `finditer`, `findall`) defer to the cached compiler. -- Compiled `Pattern` objects expose `.pattern`, `.flags`, `.jit`, and `.groupindex` for introspection. -- Execution helpers accept `pos`, `endpos`, and `options`, allowing you to thread PCRE2 execution flags per call. - -## Memory allocation -- By default PyPcre uses CPython's `PyMem` allocator. -- Override the allocator explicitly by setting `PYPCRE_ALLOCATOR` to one of - `pymem`, `malloc`, `jemalloc`, or `tcmalloc` before importing the module. The - optional allocators are still loaded with `dlopen`, so no additional link - flags are required when they are absent. -- Call `pcre_ext_c.get_allocator()` to inspect which backend is active at - runtime. +### Common issues + +- Unsupported stdlib flags such as `re.DEBUG`, `re.LOCALE`, and `re.ASCII` + raise `ValueError`. If you want ASCII-style behavior, use `pcre.ASCII` or + `Flag.NO_UTF | Flag.NO_UCP`. +- Replacement types must match the subject type: text patterns use `str` + replacements, while byte patterns use bytes-like replacements. +- If you are porting patterns from the third-party `regex` package, check + `\u` and `\U` escapes first. That is the most common compatibility gap. +- Most users do not need to tune caching, JIT, or threading. The defaults are + intended to work well out of the box. + +### Optional runtime controls + +- `pcre.configure(jit=False)` disables JIT globally. `Flag.JIT` and + `Flag.NO_JIT` let you override that per pattern. +- `pcre.set_cache_limit()`, `pcre.get_cache_limit()`, and `pcre.clear_cache()` + control the high-level compile cache. +- `pcre.configure_threads()`, `pcre.configure_thread_pool()`, + `shutdown_thread_pool()`, `Flag.THREADS`, and `Flag.NO_THREADS` are available + if you want to opt into or restrict threaded execution. ## Building @@ -267,7 +195,7 @@ variant). Install the development headers for your platform before building, for example `apt install libpcre2-dev` on Debian/Ubuntu, `dnf install pcre2-devel` on Fedora/RHEL derivatives, or `brew install pcre2` on macOS. -If the headers or library live in a non-standard location you can export one +If the headers or library live in a non-standard location, you can export one or more of the following environment variables prior to invoking the build (`pip install .`, `python -m build`, etc.): @@ -275,16 +203,19 @@ or more of the following environment variables prior to invoking the build - `PYPCRE_INCLUDE_DIR` - `PYPCRE_LIBRARY_DIR` - `PYPCRE_LIBRARY_PATH` *(pathsep-separated directories or explicit library files to - prioritise when resolving `libpcre2-8`)* + prioritize when resolving `libpcre2-8`)* - `PYPCRE_LIBRARIES` - `PYPCRE_CFLAGS` - `PYPCRE_LDFLAGS` -When `pkg-config` is available the build will automatically pick up the +If you would rather force a source build, set `PYPCRE_BUILD_FROM_SOURCE=1` +before installing. + +When `pkg-config` is available, the build automatically picks up the required include and link flags via `pkg-config --cflags/--libs libpcre2-8`. Without `pkg-config`, the build script scans common installation prefixes for Linux distributions (Debian, Ubuntu, Fedora/RHEL/CentOS, openSUSE, Alpine), -FreeBSD and macOS (including Homebrew) to locate the headers and +FreeBSD, and macOS (including Homebrew) to locate the headers and libraries. If your system ships `libpcre2-8` under `/usr` but you also maintain a