real-regex 2026.6.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 René Chenard
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,7 @@
1
+ # Make the sdist self-contained: carry the C++ headers the binding compiles
2
+ # against and the binding source, plus the project docs.
3
+ graft include
4
+ recursive-include python/src *.cpp
5
+ recursive-include python/tests *.py
6
+ include python/README.md
7
+ include README.md LICENSE ARCHITECTURE.md
@@ -0,0 +1,206 @@
1
+ Metadata-Version: 2.4
2
+ Name: real-regex
3
+ Version: 2026.6.0
4
+ Summary: REAL — linear-time (ReDoS-safe) regex engine with an re-compatible API
5
+ Author-email: René Chenard <rene.chenard.1@ulaval.ca>
6
+ License-Expression: MIT
7
+ Project-URL: Homepage, https://github.com/RECHE23/real-regex
8
+ Project-URL: Repository, https://github.com/RECHE23/real-regex
9
+ Project-URL: Documentation, https://reche23.github.io/real-regex/
10
+ Project-URL: Issues, https://github.com/RECHE23/real-regex/issues
11
+ Keywords: regex,regular-expression,redos,linear-time,re
12
+ Classifier: Development Status :: 4 - Beta
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Programming Language :: Python :: 3
15
+ Classifier: Programming Language :: Python :: 3 :: Only
16
+ Classifier: Programming Language :: C++
17
+ Classifier: Topic :: Text Processing
18
+ Classifier: Topic :: Software Development :: Libraries
19
+ Classifier: Operating System :: OS Independent
20
+ Requires-Python: >=3.10
21
+ Description-Content-Type: text/markdown
22
+ License-File: LICENSE
23
+ Dynamic: license-file
24
+
25
+ # REAL
26
+
27
+ **Regular Expression Algorithmic Library** — a header-only C++20 regex engine,
28
+ constexpr from end to end, with an `re`-compatible Python binding.
29
+
30
+ - **Linear time, always.** The engine is a Pike VM (Thompson NFA simulation):
31
+ no backtracking, ReDoS-safe by construction.
32
+ - **Constexpr-friendly.** Patterns known at compile time are parsed, compiled
33
+ and matched at compile time.
34
+ - **Minimal memory.** Static (sizes fixed at compile time, zero allocation),
35
+ dynamic (storage sized exactly once at pattern compilation), or hybrid
36
+ (compile-time pattern, runtime text, zero heap allocation).
37
+ - **Zero dependencies.** One include.
38
+
39
+ Unsupported syntax is rejected with `real::regex_error` rather than silently
40
+ diverging. Deferred (and rejected): lookarounds, backreferences,
41
+ atomic/possessive groups, Unicode property classes, Unicode case folding,
42
+ `re.X`, `pos`/`endpos`. The planned next step is a lazy DFA for the
43
+ dense-candidate cases where `re` is still ahead.
44
+
45
+ Over the benchmark suite (`make bench-python`), REAL is **1.98x faster than
46
+ Python's `re`** at the geometric mean, with identical outputs; the `(a+)+b`
47
+ ReDoS case completes in microseconds where `re` takes over a second.
48
+
49
+ ## Supported syntax
50
+
51
+ | Syntax | Meaning |
52
+ |---|---|
53
+ | `abc` | literal bytes (UTF-8 patterns match their UTF-8 bytes) |
54
+ | `\.` `\*` `\\` … | escaped metacharacter, matched literally |
55
+ | `.` | any codepoint except `\n` |
56
+ | `[abc]` `[a-z]` `[^abc]` | character class (members must be ASCII); `[^…]` matches any codepoint outside the set |
57
+ | `\d \D \w \W \s \S` | digit / word / space classes (ASCII sets, like Python's `re.ASCII`) |
58
+ | `\n \t \r \f \v \a \0` `\xHH` | control and hex escapes |
59
+ | `x*` `x+` `x?` | quantifiers (greedy; append `?` for lazy) |
60
+ | `x{n}` `x{n,}` `x{,m}` `x{n,m}` | counted repetition (greedy or lazy; counts capped at 1000) |
61
+ | `a\|b` | alternation, leftmost branch preferred |
62
+ | `(…)` `(?:…)` | capturing / non-capturing group |
63
+ | `(?P<name>…)` `(?<name>…)` | named capturing group (Python and .NET styles) |
64
+ | `^` `$` | line/text anchors (Python semantics: `$` also matches before a final `\n`) |
65
+ | `\A` `\Z` | strict text start / end |
66
+ | `\b` `\B` | word boundary / non-boundary (ASCII word characters) |
67
+ | `(?ims)` prefix | global flags: `i` case-insensitive (ASCII), `m` multiline, `s` dotall — also `real::flags` on the constructor |
68
+
69
+ **Unicode model:** matching is UTF-8 byte-based, but every construct consumes
70
+ whole codepoints (multi-byte sequences compile to byte-level alternatives), so
71
+ match boundaries never split a character. Class members and the `\d \w \s`
72
+ sets are ASCII by design; `[^…]`, `\D \W \S` and `.` do match non-ASCII
73
+ codepoints.
74
+
75
+ **Divergence from Python:** when a *nullable* loop body ends with an empty
76
+ iteration — e.g. `(a*)*` on `"aa"` — Python captures that final empty
77
+ iteration (`''`); REAL, like Perl/PCRE, keeps the last non-empty one (`"aa"`).
78
+ Group 0 is identical either way.
79
+
80
+ ## C++ API
81
+
82
+ ```cpp
83
+ #include <real/real.hpp>
84
+
85
+ real::regex rx("hello"); // runtime pattern, storage sized exactly once
86
+ rx.match("hello world"); // anchored at the start (Python re.match)
87
+ rx.fullmatch("hello"); // whole text (Python re.fullmatch)
88
+ rx.search("say hello"); // leftmost match anywhere (Python re.search)
89
+ ```
90
+
91
+ `match`/`fullmatch`/`search` return a `real::match_result`: `matched()`,
92
+ `operator bool`, `start(g)`, `end(g)`, `m[g]` (a `std::string_view` into the
93
+ searched text, which must outlive the result), and the same accessors by group
94
+ name (`m["year"]`, `group_index`).
95
+
96
+ ```cpp
97
+ for (auto& m : rx.find_iter(text)) { … } // lazy, Python finditer rules
98
+ rx.find_all(text); // eager vector<match_result>
99
+ rx.replace(text, "$2:$1"); // $&, $1…, ${name}, $$ — re.sub
100
+ rx.replace(text, "#", 2); // count limit
101
+ rx.split(text); // Python re.split, with groups
102
+ ```
103
+
104
+ Empty matches follow Python's rules: they are yielded (even right after a
105
+ non-empty match) and the scan then advances one whole codepoint.
106
+ `find_iter`/`find_all` cannot be called on a temporary regex, and
107
+ `match`/`search`/`split` cannot take a temporary `std::string`.
108
+
109
+ ### Three memory modes
110
+
111
+ ```cpp
112
+ // Static: pattern compiled at compile time into exactly-sized constexpr
113
+ // arrays; an invalid pattern is a *compile error*.
114
+ constexpr real::static_regex<"(\\d{4})-(\\d{2})"> date;
115
+ static_assert(date.search("on 2026-06-10")[1] == "2026"); // constexpr match
116
+
117
+ // Hybrid: compile-time pattern, runtime text — matching performs zero heap
118
+ // allocations (state lives on the stack).
119
+ date.search(runtime_text);
120
+
121
+ // Dynamic: everything at runtime; the program is sized exactly once at
122
+ // compilation, match state is per-run scratch.
123
+ real::regex rx2(user_pattern, real::flags::icase);
124
+ ```
125
+
126
+ The pure library is standard C++20 with no platform dependencies. `real::real`
127
+ is the CMake `FetchContent`/`find_package` target.
128
+
129
+ ## Python binding
130
+
131
+ An `re`-compatible module backed by the C++ engine (CPython Limited API, one
132
+ abi3 extension, zero dependencies):
133
+
134
+ ```python
135
+ import real
136
+
137
+ real.search(r"(?P<y>\d{4})-(?P<m>\d{2})", "on 2026-06-10").groupdict()
138
+ real.compile(r"\w+").findall(text) # findall/finditer/split/sub/subn
139
+ real.sub(r"\s+", " ", text) # templates: \1, \g<name>, callables
140
+ real.compile(rb"[^;]+").findall(raw) # bytes patterns: raw-byte semantics
141
+ ```
142
+
143
+ `str` matching is UTF-8 with character indices in `start/end/span`; `bytes`
144
+ patterns get `re`'s exact raw-byte semantics. Unsupported `re` features raise
145
+ `real.error` at compile time. Build with `make python && make python-test`.
146
+
147
+ Once published: `pip install real-regex` (one `cp310-abi3` wheel per platform
148
+ serves CPython 3.10+; the self-contained sdist compiles where no wheel
149
+ matches).
150
+
151
+ **Release process (manual + tag-driven, for reliability):**
152
+ - Use calendar versioning `YYYY.M.PATCH` with monthly patch reset
153
+ (e.g. `2026.6.0` for the first release of June 2026, then `2026.6.1` etc.;
154
+ next month starts at `.0`).
155
+ - Update the version in **both** places:
156
+ - `pyproject.toml` (the one used by the release guard and PyPI)
157
+ - `python/real/__init__.py` (the runtime `__version__` exposed to users)
158
+ - Commit the change (optionally include `[release]` in the message as a
159
+ human signal or for future automation).
160
+ - `git tag v2026.6.0`
161
+ - `git push origin main v2026.6.0`
162
+
163
+ The tag triggers `.github/workflows/release.yml`:
164
+ - `check-version` ensures the tag exactly matches the version in `pyproject.toml`.
165
+ - It builds abi3 wheels with `cibuildwheel` (Linux/macOS/Windows) + sdist.
166
+ - Publishes to PyPI using Trusted Publishing (OIDC) — no secrets.
167
+
168
+ We deliberately kept the process simple and explicit (no auto-bump on
169
+ merge yet) to avoid accidental publishes and keep the history auditable.
170
+ The tag-based guard + OIDC is the reliable core.
171
+
172
+ ## Development
173
+
174
+ ```bash
175
+ make help # list all targets
176
+ make test # build and run the test suite
177
+ make coverage # line coverage report (LLVM)
178
+ make sanitize # tests under ASan + UBSan
179
+ make lint # clang-tidy
180
+ make misra # MISRA C++:2023-oriented analysis
181
+ make fuzz # libFuzzer robustness fuzzing (clang)
182
+ make doc # API reference (Doxygen)
183
+ ```
184
+
185
+ The API reference is published at <https://reche23.github.io/real-regex/>.
186
+
187
+ Select the compiler with `make test CXX=g++-14`. Every behaviour is tested at
188
+ runtime and in constexpr (`static_assert`) under Clang and GCC; an equivalence
189
+ suite checks the prefilter and fast paths never change results; a parity suite
190
+ and a randomized differential fuzzer compare Python outputs against `re`.
191
+
192
+ CI exercises:
193
+
194
+ | Platform | Architecture | Compiler |
195
+ |----------|--------------|----------|
196
+ | Linux | x86-64 | GCC, Clang |
197
+ | Linux | AArch64 | GCC |
198
+ | macOS | Apple Silicon (arm64) | Apple Clang |
199
+ | Windows | x86-64 | MSVC |
200
+
201
+ IntelLLVM (`icpx`), x86-64 macOS and the BSDs share the Clang flag set and are
202
+ supported by the build configuration but not exercised in CI.
203
+
204
+ ## License
205
+
206
+ MIT — Copyright (c) 2026 René Chenard
@@ -0,0 +1,182 @@
1
+ # REAL
2
+
3
+ **Regular Expression Algorithmic Library** — a header-only C++20 regex engine,
4
+ constexpr from end to end, with an `re`-compatible Python binding.
5
+
6
+ - **Linear time, always.** The engine is a Pike VM (Thompson NFA simulation):
7
+ no backtracking, ReDoS-safe by construction.
8
+ - **Constexpr-friendly.** Patterns known at compile time are parsed, compiled
9
+ and matched at compile time.
10
+ - **Minimal memory.** Static (sizes fixed at compile time, zero allocation),
11
+ dynamic (storage sized exactly once at pattern compilation), or hybrid
12
+ (compile-time pattern, runtime text, zero heap allocation).
13
+ - **Zero dependencies.** One include.
14
+
15
+ Unsupported syntax is rejected with `real::regex_error` rather than silently
16
+ diverging. Deferred (and rejected): lookarounds, backreferences,
17
+ atomic/possessive groups, Unicode property classes, Unicode case folding,
18
+ `re.X`, `pos`/`endpos`. The planned next step is a lazy DFA for the
19
+ dense-candidate cases where `re` is still ahead.
20
+
21
+ Over the benchmark suite (`make bench-python`), REAL is **1.98x faster than
22
+ Python's `re`** at the geometric mean, with identical outputs; the `(a+)+b`
23
+ ReDoS case completes in microseconds where `re` takes over a second.
24
+
25
+ ## Supported syntax
26
+
27
+ | Syntax | Meaning |
28
+ |---|---|
29
+ | `abc` | literal bytes (UTF-8 patterns match their UTF-8 bytes) |
30
+ | `\.` `\*` `\\` … | escaped metacharacter, matched literally |
31
+ | `.` | any codepoint except `\n` |
32
+ | `[abc]` `[a-z]` `[^abc]` | character class (members must be ASCII); `[^…]` matches any codepoint outside the set |
33
+ | `\d \D \w \W \s \S` | digit / word / space classes (ASCII sets, like Python's `re.ASCII`) |
34
+ | `\n \t \r \f \v \a \0` `\xHH` | control and hex escapes |
35
+ | `x*` `x+` `x?` | quantifiers (greedy; append `?` for lazy) |
36
+ | `x{n}` `x{n,}` `x{,m}` `x{n,m}` | counted repetition (greedy or lazy; counts capped at 1000) |
37
+ | `a\|b` | alternation, leftmost branch preferred |
38
+ | `(…)` `(?:…)` | capturing / non-capturing group |
39
+ | `(?P<name>…)` `(?<name>…)` | named capturing group (Python and .NET styles) |
40
+ | `^` `$` | line/text anchors (Python semantics: `$` also matches before a final `\n`) |
41
+ | `\A` `\Z` | strict text start / end |
42
+ | `\b` `\B` | word boundary / non-boundary (ASCII word characters) |
43
+ | `(?ims)` prefix | global flags: `i` case-insensitive (ASCII), `m` multiline, `s` dotall — also `real::flags` on the constructor |
44
+
45
+ **Unicode model:** matching is UTF-8 byte-based, but every construct consumes
46
+ whole codepoints (multi-byte sequences compile to byte-level alternatives), so
47
+ match boundaries never split a character. Class members and the `\d \w \s`
48
+ sets are ASCII by design; `[^…]`, `\D \W \S` and `.` do match non-ASCII
49
+ codepoints.
50
+
51
+ **Divergence from Python:** when a *nullable* loop body ends with an empty
52
+ iteration — e.g. `(a*)*` on `"aa"` — Python captures that final empty
53
+ iteration (`''`); REAL, like Perl/PCRE, keeps the last non-empty one (`"aa"`).
54
+ Group 0 is identical either way.
55
+
56
+ ## C++ API
57
+
58
+ ```cpp
59
+ #include <real/real.hpp>
60
+
61
+ real::regex rx("hello"); // runtime pattern, storage sized exactly once
62
+ rx.match("hello world"); // anchored at the start (Python re.match)
63
+ rx.fullmatch("hello"); // whole text (Python re.fullmatch)
64
+ rx.search("say hello"); // leftmost match anywhere (Python re.search)
65
+ ```
66
+
67
+ `match`/`fullmatch`/`search` return a `real::match_result`: `matched()`,
68
+ `operator bool`, `start(g)`, `end(g)`, `m[g]` (a `std::string_view` into the
69
+ searched text, which must outlive the result), and the same accessors by group
70
+ name (`m["year"]`, `group_index`).
71
+
72
+ ```cpp
73
+ for (auto& m : rx.find_iter(text)) { … } // lazy, Python finditer rules
74
+ rx.find_all(text); // eager vector<match_result>
75
+ rx.replace(text, "$2:$1"); // $&, $1…, ${name}, $$ — re.sub
76
+ rx.replace(text, "#", 2); // count limit
77
+ rx.split(text); // Python re.split, with groups
78
+ ```
79
+
80
+ Empty matches follow Python's rules: they are yielded (even right after a
81
+ non-empty match) and the scan then advances one whole codepoint.
82
+ `find_iter`/`find_all` cannot be called on a temporary regex, and
83
+ `match`/`search`/`split` cannot take a temporary `std::string`.
84
+
85
+ ### Three memory modes
86
+
87
+ ```cpp
88
+ // Static: pattern compiled at compile time into exactly-sized constexpr
89
+ // arrays; an invalid pattern is a *compile error*.
90
+ constexpr real::static_regex<"(\\d{4})-(\\d{2})"> date;
91
+ static_assert(date.search("on 2026-06-10")[1] == "2026"); // constexpr match
92
+
93
+ // Hybrid: compile-time pattern, runtime text — matching performs zero heap
94
+ // allocations (state lives on the stack).
95
+ date.search(runtime_text);
96
+
97
+ // Dynamic: everything at runtime; the program is sized exactly once at
98
+ // compilation, match state is per-run scratch.
99
+ real::regex rx2(user_pattern, real::flags::icase);
100
+ ```
101
+
102
+ The pure library is standard C++20 with no platform dependencies. `real::real`
103
+ is the CMake `FetchContent`/`find_package` target.
104
+
105
+ ## Python binding
106
+
107
+ An `re`-compatible module backed by the C++ engine (CPython Limited API, one
108
+ abi3 extension, zero dependencies):
109
+
110
+ ```python
111
+ import real
112
+
113
+ real.search(r"(?P<y>\d{4})-(?P<m>\d{2})", "on 2026-06-10").groupdict()
114
+ real.compile(r"\w+").findall(text) # findall/finditer/split/sub/subn
115
+ real.sub(r"\s+", " ", text) # templates: \1, \g<name>, callables
116
+ real.compile(rb"[^;]+").findall(raw) # bytes patterns: raw-byte semantics
117
+ ```
118
+
119
+ `str` matching is UTF-8 with character indices in `start/end/span`; `bytes`
120
+ patterns get `re`'s exact raw-byte semantics. Unsupported `re` features raise
121
+ `real.error` at compile time. Build with `make python && make python-test`.
122
+
123
+ Once published: `pip install real-regex` (one `cp310-abi3` wheel per platform
124
+ serves CPython 3.10+; the self-contained sdist compiles where no wheel
125
+ matches).
126
+
127
+ **Release process (manual + tag-driven, for reliability):**
128
+ - Use calendar versioning `YYYY.M.PATCH` with monthly patch reset
129
+ (e.g. `2026.6.0` for the first release of June 2026, then `2026.6.1` etc.;
130
+ next month starts at `.0`).
131
+ - Update the version in **both** places:
132
+ - `pyproject.toml` (the one used by the release guard and PyPI)
133
+ - `python/real/__init__.py` (the runtime `__version__` exposed to users)
134
+ - Commit the change (optionally include `[release]` in the message as a
135
+ human signal or for future automation).
136
+ - `git tag v2026.6.0`
137
+ - `git push origin main v2026.6.0`
138
+
139
+ The tag triggers `.github/workflows/release.yml`:
140
+ - `check-version` ensures the tag exactly matches the version in `pyproject.toml`.
141
+ - It builds abi3 wheels with `cibuildwheel` (Linux/macOS/Windows) + sdist.
142
+ - Publishes to PyPI using Trusted Publishing (OIDC) — no secrets.
143
+
144
+ We deliberately kept the process simple and explicit (no auto-bump on
145
+ merge yet) to avoid accidental publishes and keep the history auditable.
146
+ The tag-based guard + OIDC is the reliable core.
147
+
148
+ ## Development
149
+
150
+ ```bash
151
+ make help # list all targets
152
+ make test # build and run the test suite
153
+ make coverage # line coverage report (LLVM)
154
+ make sanitize # tests under ASan + UBSan
155
+ make lint # clang-tidy
156
+ make misra # MISRA C++:2023-oriented analysis
157
+ make fuzz # libFuzzer robustness fuzzing (clang)
158
+ make doc # API reference (Doxygen)
159
+ ```
160
+
161
+ The API reference is published at <https://reche23.github.io/real-regex/>.
162
+
163
+ Select the compiler with `make test CXX=g++-14`. Every behaviour is tested at
164
+ runtime and in constexpr (`static_assert`) under Clang and GCC; an equivalence
165
+ suite checks the prefilter and fast paths never change results; a parity suite
166
+ and a randomized differential fuzzer compare Python outputs against `re`.
167
+
168
+ CI exercises:
169
+
170
+ | Platform | Architecture | Compiler |
171
+ |----------|--------------|----------|
172
+ | Linux | x86-64 | GCC, Clang |
173
+ | Linux | AArch64 | GCC |
174
+ | macOS | Apple Silicon (arm64) | Apple Clang |
175
+ | Windows | x86-64 | MSVC |
176
+
177
+ IntelLLVM (`icpx`), x86-64 macOS and the BSDs share the Clang flag set and are
178
+ supported by the build configuration but not exercised in CI.
179
+
180
+ ## License
181
+
182
+ MIT — Copyright (c) 2026 René Chenard