scilex 2026.6.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,91 @@
1
+ # SciLex — performance baseline
2
+
3
+ A reproducible wall-time baseline for the Python binding, comparing SciLex against
4
+ Python's standard-library `re`. Its purpose is twofold: a **regression tripwire**
5
+ between versions on the same machine, and an honest statement of **where SciLex wins
6
+ and where it does not**.
7
+
8
+ Run it with **`make bench`** (or `python benchmarks/bench.py`). It is informational
9
+ only — it prints a table, is never invoked by `full-local-gate`, and never fails a
10
+ build.
11
+
12
+ ## The honest headline
13
+
14
+ SciLex is **not** built to beat `re` on raw throughput, and it does not. `re` is a
15
+ mature C backtracking engine; SciLex runs REAL's linear-time NFA at every position,
16
+ through the abi3 binding, and builds rich `Token` objects. On benign input that costs
17
+ a few times more per byte.
18
+
19
+ What SciLex guarantees instead is **linear time, ReDoS-safe by construction**: no rule
20
+ can make the scanner backtrack catastrophically. On an adversarial (or simply
21
+ unlucky) pattern, `re` degrades exponentially while SciLex stays flat — and *that* is
22
+ the difference that matters for a lexer fed untrusted or machine-generated input.
23
+
24
+ ### Gains and losses at a glance
25
+
26
+ | input | winner | why |
27
+ | --- | --- | --- |
28
+ | benign token soup | `re` (~5×, see B1) | a mature C backtracking engine; SciLex runs REAL's NFA per position, through the binding, and builds rich `Token`s |
29
+ | adversarial / ReDoS (B2) | **SciLex** (linear vs exponential) | REAL is linear-time and ReDoS-safe; `re` backtracks catastrophically |
30
+ | untrusted / machine-generated | **SciLex** | the linear bound holds on *every* input — no pathological cliff |
31
+
32
+ ## Conditions of this baseline
33
+
34
+ | | |
35
+ | --- | --- |
36
+ | Machine | Apple Silicon (`arm64`), Darwin 23.6.0 |
37
+ | Binding | abi3 CPython extension as built by `setup.py` (`Py_LIMITED_API` 3.10) |
38
+ | Method | best-of-5 timed runs, **minimum** reported |
39
+ | As of | 2026-06-19 — the Python-binding maturation reserve |
40
+
41
+ ## Baseline
42
+
43
+ ### B1. Benign tokenization (the everyday case — `re` wins)
44
+
45
+ Tokenizing ~10 KB of ordinary `ident = ident + number * ident - number ;` soup into
46
+ 4000 tokens (numbers, identifiers, operators; whitespace skipped). SciLex compiles the
47
+ rule set once (a reused `Lexer`); the `re` baseline is the standard "master pattern"
48
+ tokenizer (`(?P<NUM>…)|(?P<ID>…)|…` + `finditer`).
49
+
50
+ | tokenizer | time | vs `re` |
51
+ | --- | ---: | ---: |
52
+ | `scilex.Lexer.tokenize` | ~7.4 ms | ~5.2× |
53
+ | `re.finditer` (master pattern) | ~1.4 ms | 1.0× (baseline) |
54
+
55
+ **Reading.** `re` is ~5× faster here. That is expected and reported plainly: the cost
56
+ buys SciLex's linear guarantee and its ordered maximal-munch semantics, not a speed
57
+ record on benign input. Tightening the per-position scan is a known, deliberately
58
+ deferred lever — to be pursued only if a real workload makes it the bottleneck.
59
+
60
+ ### B2. Pathological input (the linearity guarantee — SciLex wins decisively)
61
+
62
+ The classic ReDoS trigger `(a+)+b` over a run of `n` `a`s with no terminating `b`. A
63
+ backtracking engine explores `O(2ⁿ)` partitions; REAL (and therefore SciLex) is linear.
64
+
65
+ | n | `scilex` (linear) | `re.match` (backtracking) |
66
+ | ---: | ---: | ---: |
67
+ | 16 | ~2.1 µs | ~2.2 ms |
68
+ | 18 | ~2.2 µs | ~9.1 ms |
69
+ | 20 | ~2.4 µs | ~35.9 ms |
70
+ | 22 | ~2.6 µs | ~142.7 ms |
71
+ | 24 | ~2.8 µs | ~574 ms |
72
+ | 26 | ~2.9 µs | ~2.30 s |
73
+ | 1000 | ~78 µs | would not finish |
74
+
75
+ **Reading.** `re`'s time roughly **quadruples every +2** in `n` (exponential); SciLex
76
+ grows **linearly** and is still ~78 µs at `n = 1000`, where `re` would not finish in any
77
+ practical time. This is the case SciLex exists for.
78
+
79
+ ## Methodology & reproduction
80
+
81
+ - **Goal:** a regression tripwire plus an honest win/lose map — not a throughput
82
+ contest. Compare a fresh `make bench` to this table **on the same machine**; a clear,
83
+ repeatable change is the signal.
84
+ - **Reproduce:** `make bench` builds the extension in place and runs
85
+ `benchmarks/bench.py`. The pathological sweep stops `re` once a single match passes
86
+ one second (its curve is already established); SciLex is measured well past that.
87
+ - **Not gated.** `make bench` is excluded from `full-local-gate` on purpose — a noisy
88
+ wall-time measurement must never turn a clean build red.
89
+ - **Deferred:** a compile-time `static_lexer` (REAL's `static_regex`) and a faster
90
+ per-position scan (e.g. first-byte / trie dispatch) are known levers, left until a
91
+ measured workload justifies them. No phantom numbers here for paths not yet built.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 René Chenard
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,7 @@
1
+ # Make the sdist self-contained: carry the C++ headers the binding compiles
2
+ # against and the binding source, plus the project docs. (The headers are also
3
+ # REAL's at build time, pulled from the real-regex build dependency.)
4
+ graft include
5
+ recursive-include python/src *.cpp
6
+ recursive-include python/tests *.py
7
+ include README.md LICENSE BENCHMARKS.md
@@ -0,0 +1,227 @@
1
+ Metadata-Version: 2.4
2
+ Name: scilex
3
+ Version: 2026.6.0
4
+ Summary: SciLex — a generic, linear-time maximal-munch lexer built on REAL
5
+ Author-email: René Chenard <rene.chenard.1@ulaval.ca>
6
+ License-Expression: MIT
7
+ Project-URL: Repository, https://github.com/RECHE23/scilex
8
+ Project-URL: Issues, https://github.com/RECHE23/scilex/issues
9
+ Keywords: lexer,tokenizer,maximal-munch,scanner,linear-time
10
+ Classifier: Development Status :: 4 - Beta
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: Programming Language :: Python :: 3
13
+ Classifier: Programming Language :: Python :: 3 :: Only
14
+ Classifier: Programming Language :: C++
15
+ Classifier: Topic :: Software Development :: Compilers
16
+ Classifier: Topic :: Software Development :: Libraries
17
+ Classifier: Operating System :: OS Independent
18
+ Requires-Python: >=3.10
19
+ Description-Content-Type: text/markdown
20
+ License-File: LICENSE
21
+ Dynamic: license-file
22
+
23
+ # SciLex
24
+
25
+ A small, header-only C++20 lexer built on [REAL](https://github.com/RECHE23/real-regex).
26
+
27
+ Define an ordered set of token rules — each a kind paired with a REAL regular
28
+ expression — and SciLex tokenizes source text by **maximal munch** (longest
29
+ match wins, rule order breaks ties). Because REAL is a linear-time engine,
30
+ tokenization is linear and ReDoS-safe by construction: no token rule can make
31
+ the scanner backtrack catastrophically.
32
+
33
+ This is a deliberate fresh start under the same premises as REAL — purity,
34
+ simplicity, and measured optimality. SciLex is a thin layer over REAL, not a
35
+ re-implementation of pattern matching.
36
+
37
+ ## Scope (v1)
38
+
39
+ **Included:**
40
+
41
+ - Token rules: a kind, a `real::regex`, and a `skip` flag (whitespace, comments).
42
+ - Maximal-munch tokenization with rule-order priority on equal-length ties.
43
+ - Source positions: byte `offset`, 1-based `line` and byte `column`.
44
+ - Non-owning tokens: each `lexeme` views into the source.
45
+ - Two ways to consume tokens: `tokenize` (eager, into a vector) and `scan`
46
+ (a lazy single-pass range that produces one token at a time — no token
47
+ vector is allocated; the parser-friendly access pattern).
48
+ - Optional synthetic end-of-input token (`eof_policy::append`): emits a final
49
+ `end_of_input` token at the real end position, so a parser always has a
50
+ current token to match.
51
+ - Optional indentation layout (`scilex::layout`, an opt-in header): rewrites a
52
+ token stream with synthetic `newline` / `indent` / `dedent` tokens for
53
+ indentation-significant languages; throws `layout_error` on an inconsistent
54
+ dedent.
55
+ - Lexical errors as exceptions carrying the failing position (`lex_error`).
56
+ - Linear-time / ReDoS-safe tokenization, inherited from REAL.
57
+
58
+ **Not in v1 (and honestly excluded — no phantom features):**
59
+
60
+ - Modes / context-sensitive lexing.
61
+ - Compile-time `static_lexer` (built on REAL's `static_regex`).
62
+ - Codepoint columns (columns are byte-based, matching REAL's UTF-8 model).
63
+ - Command-line tool, JSON specification language.
64
+
65
+ These may come later, each only if it earns its place — measured, tested, and
66
+ kept minimal.
67
+
68
+ ## Dependencies
69
+
70
+ SciLex is header-only and depends only on REAL's headers. By default the build
71
+ looks for them in a sibling checkout:
72
+
73
+ ```
74
+ ~/Projects/
75
+ ├── real-v1/ # REAL
76
+ └── scilex-v1/ # SciLex (uses ../real-v1/include by default)
77
+ ```
78
+
79
+ Point the build elsewhere with `REAL_INCLUDE` (Makefile) or
80
+ `-DSCILEX_REAL_INCLUDE=...` (CMake) — for instance at the path printed by
81
+ `python -c "import real; print(real.get_include())"` when REAL is installed via
82
+ pip.
83
+
84
+ For CI or a reproducible build — where no on-disk layout can be assumed — fetch
85
+ REAL with CMake FetchContent instead (`make build FETCH=1`, or
86
+ `-DSCILEX_FETCH_DEPS=ON`); point it at a remote and pin a tag with
87
+ `-DSCILEX_REAL_REPO=https://… -DSCILEX_REAL_TAG=v2026.6.6`.
88
+
89
+ ## Build
90
+
91
+ ```bash
92
+ make test # build and run the test suite
93
+ make coverage # line-coverage summary + HTML report
94
+ make sanitize # tests under AddressSanitizer + UndefinedBehaviorSanitizer
95
+ make lint # clang-tidy
96
+ make format # uncrustify, in place
97
+ make doc # API reference (Doxygen) with embedded coverage
98
+ ```
99
+
100
+ Override the compiler with `make test CXX=g++-14`.
101
+
102
+ `scilex::scilex` is the CMake target — `add_subdirectory`, `FetchContent`, or an
103
+ installed config package. The config calls `find_dependency(real)`, so installing
104
+ REAL's config package alongside (on the same prefix) makes the whole chain
105
+ resolve from one `find_package`:
106
+
107
+ ```cmake
108
+ # With REAL and SciLex installed under <prefix>:
109
+ find_package(scilex CONFIG REQUIRED) # pulls in real:: transitively
110
+ target_link_libraries(app PRIVATE scilex::scilex)
111
+ ```
112
+
113
+ ## Releasing
114
+
115
+ `make release` computes the next calendar version `YYYY.M.PATCH` — the patch
116
+ resets each month, the first release of a month is `.0` (PEP 440 drops leading
117
+ zeros, so `2026.6.1`, never `2026.06.001`) — bumps it in `pyproject.toml` and
118
+ `python/scilex/__init__.py`, then commits, tags and pushes from a clean `main`.
119
+ The pushed tag drives `.github/workflows/release.yml`,
120
+ which checks the tag matches the version, builds abi3 wheels (`cibuildwheel`,
121
+ Linux/macOS/Windows) and the self-contained sdist, and publishes to PyPI via
122
+ Trusted Publishing (OIDC, no stored secret). It then populates the GitHub
123
+ `/releases` page with auto-generated notes and the built artifacts. The pushed tag
124
+ is the single thing that triggers a publish; SciLex remains consumable as source
125
+ too (sibling checkout / FetchContent / `get_include()`).
126
+
127
+ **One-time PyPI setup.** Publishing needs a PyPI
128
+ [Trusted Publisher](https://docs.pypi.org/trusted-publishers/) configured once for
129
+ the project (publisher `RECHE23/scilex`, workflow `release.yml`, environment
130
+ `pypi`) and a matching `pypi` GitHub environment — no API token is stored.
131
+
132
+ ## Python binding
133
+
134
+ SciLex ships an abi3 CPython extension (use the C++ lexer from Python).
135
+ `pip install scilex` installs one `cp310-abi3` wheel per platform (CPython 3.10+;
136
+ the self-contained sdist compiles where no wheel matches, pulling REAL's headers
137
+ from the `real-regex` build dependency). For a source checkout:
138
+
139
+ ```bash
140
+ make python # build the extension in place
141
+ make python-test # run the binding test suite
142
+ ```
143
+
144
+ ```python
145
+ import scilex
146
+ lx = scilex.Lexer([
147
+ (0, r"\s+", True), # (kind, pattern, skip) — skipped
148
+ (1, r"[0-9]+", False), # number
149
+ (2, r"[A-Za-z_][A-Za-z0-9_]*", False), # identifier
150
+ ])
151
+
152
+ # Eager: a list of rich Token objects (kind, lexeme, structured position).
153
+ [(t.kind, t.lexeme) for t in lx.tokenize("foo 42")] # [(2, 'foo'), (1, '42')]
154
+
155
+ # Lazy: a generator yielding one Token at a time — nothing else is held.
156
+ for tok in lx.scan("foo 42"):
157
+ tok.kind, tok.lexeme, tok.position.line, tok.position.column
158
+
159
+ # A lexical error carries the failing position.
160
+ try:
161
+ lx.tokenize("foo @")
162
+ except scilex.error as e:
163
+ e.position # Position(line=1, column=5, offset=4)
164
+
165
+ # eof=True appends a terminal END_OF_INPUT token (a parser always has a token).
166
+ lx.tokenize("42", eof=True)[-1].kind == scilex.END_OF_INPUT
167
+ ```
168
+
169
+ For indentation-significant languages, `Layout` rewrites an `eof=True` token
170
+ stream with `NEWLINE` / `INDENT` / `DEDENT` tokens read from each line's leading
171
+ column:
172
+
173
+ ```python
174
+ lx = scilex.Lexer([(0, r"\s+", True), (1, r"\w+", False), (2, r":", False)])
175
+ laid = scilex.Layout().apply(lx.tokenize("if x:\n a\nb", eof=True))
176
+ [t.kind for t in laid]
177
+ # [1, 1, 2, NEWLINE, INDENT, 1, NEWLINE, DEDENT, 1, NEWLINE, END_OF_INPUT]
178
+ ```
179
+
180
+ `scilex.get_include()` returns the header directory so a C++ project can compile
181
+ against SciLex located through its Python install (add `real.get_include()` too,
182
+ since SciLex's headers include REAL's).
183
+
184
+ ## Example
185
+
186
+ ```cpp
187
+ #include <scilex/scilex.hpp>
188
+ #include <vector>
189
+
190
+ enum kind { WS, KW_IF, ID, NUM, PLUS };
191
+
192
+ std::vector<scilex::rule> rules;
193
+ rules.push_back({WS, real::regex("\\s+"), true}); // skipped
194
+ rules.push_back({KW_IF, real::regex("if")}); // before ID: wins ties
195
+ rules.push_back({ID, real::regex("[a-z]+")});
196
+ rules.push_back({NUM, real::regex("[0-9]+")});
197
+ rules.push_back({PLUS, real::regex("\\+")});
198
+
199
+ const scilex::lexer lexer(std::move(rules));
200
+
201
+ // Eager: all tokens in a vector.
202
+ for (const scilex::token& t : lexer.tokenize("if x + 42")) {
203
+ // t.kind, t.lexeme, t.start.{offset,line,column}
204
+ }
205
+
206
+ // Lazy: one token at a time, nothing else materialized.
207
+ for (const scilex::token& t : lexer.scan("if x + 42")) {
208
+ // ...
209
+ }
210
+ ```
211
+
212
+ ## Performance
213
+
214
+ See [BENCHMARKS.md](BENCHMARKS.md) for a reproducible, honest baseline (`make
215
+ bench`). The short of it: on benign input Python's `re` is faster (a mature C
216
+ backtracking engine), but SciLex is **linear-time and ReDoS-safe by construction**
217
+ — on a pathological pattern like `(a+)+b` it stays flat (~78 µs at 1000 chars)
218
+ where `re` explodes exponentially (seconds, then never finishing). SciLex trades
219
+ raw throughput on easy inputs for a guarantee that holds on every input.
220
+
221
+ ## License
222
+
223
+ MIT — see [LICENSE](LICENSE).
224
+
225
+ ## Author
226
+
227
+ René Chenard — rene.chenard@gmail.com
@@ -0,0 +1,205 @@
1
+ # SciLex
2
+
3
+ A small, header-only C++20 lexer built on [REAL](https://github.com/RECHE23/real-regex).
4
+
5
+ Define an ordered set of token rules — each a kind paired with a REAL regular
6
+ expression — and SciLex tokenizes source text by **maximal munch** (longest
7
+ match wins, rule order breaks ties). Because REAL is a linear-time engine,
8
+ tokenization is linear and ReDoS-safe by construction: no token rule can make
9
+ the scanner backtrack catastrophically.
10
+
11
+ This is a deliberate fresh start under the same premises as REAL — purity,
12
+ simplicity, and measured optimality. SciLex is a thin layer over REAL, not a
13
+ re-implementation of pattern matching.
14
+
15
+ ## Scope (v1)
16
+
17
+ **Included:**
18
+
19
+ - Token rules: a kind, a `real::regex`, and a `skip` flag (whitespace, comments).
20
+ - Maximal-munch tokenization with rule-order priority on equal-length ties.
21
+ - Source positions: byte `offset`, 1-based `line` and byte `column`.
22
+ - Non-owning tokens: each `lexeme` views into the source.
23
+ - Two ways to consume tokens: `tokenize` (eager, into a vector) and `scan`
24
+ (a lazy single-pass range that produces one token at a time — no token
25
+ vector is allocated; the parser-friendly access pattern).
26
+ - Optional synthetic end-of-input token (`eof_policy::append`): emits a final
27
+ `end_of_input` token at the real end position, so a parser always has a
28
+ current token to match.
29
+ - Optional indentation layout (`scilex::layout`, an opt-in header): rewrites a
30
+ token stream with synthetic `newline` / `indent` / `dedent` tokens for
31
+ indentation-significant languages; throws `layout_error` on an inconsistent
32
+ dedent.
33
+ - Lexical errors as exceptions carrying the failing position (`lex_error`).
34
+ - Linear-time / ReDoS-safe tokenization, inherited from REAL.
35
+
36
+ **Not in v1 (and honestly excluded — no phantom features):**
37
+
38
+ - Modes / context-sensitive lexing.
39
+ - Compile-time `static_lexer` (built on REAL's `static_regex`).
40
+ - Codepoint columns (columns are byte-based, matching REAL's UTF-8 model).
41
+ - Command-line tool, JSON specification language.
42
+
43
+ These may come later, each only if it earns its place — measured, tested, and
44
+ kept minimal.
45
+
46
+ ## Dependencies
47
+
48
+ SciLex is header-only and depends only on REAL's headers. By default the build
49
+ looks for them in a sibling checkout:
50
+
51
+ ```
52
+ ~/Projects/
53
+ ├── real-v1/ # REAL
54
+ └── scilex-v1/ # SciLex (uses ../real-v1/include by default)
55
+ ```
56
+
57
+ Point the build elsewhere with `REAL_INCLUDE` (Makefile) or
58
+ `-DSCILEX_REAL_INCLUDE=...` (CMake) — for instance at the path printed by
59
+ `python -c "import real; print(real.get_include())"` when REAL is installed via
60
+ pip.
61
+
62
+ For CI or a reproducible build — where no on-disk layout can be assumed — fetch
63
+ REAL with CMake FetchContent instead (`make build FETCH=1`, or
64
+ `-DSCILEX_FETCH_DEPS=ON`); point it at a remote and pin a tag with
65
+ `-DSCILEX_REAL_REPO=https://… -DSCILEX_REAL_TAG=v2026.6.6`.
66
+
67
+ ## Build
68
+
69
+ ```bash
70
+ make test # build and run the test suite
71
+ make coverage # line-coverage summary + HTML report
72
+ make sanitize # tests under AddressSanitizer + UndefinedBehaviorSanitizer
73
+ make lint # clang-tidy
74
+ make format # uncrustify, in place
75
+ make doc # API reference (Doxygen) with embedded coverage
76
+ ```
77
+
78
+ Override the compiler with `make test CXX=g++-14`.
79
+
80
+ `scilex::scilex` is the CMake target — `add_subdirectory`, `FetchContent`, or an
81
+ installed config package. The config calls `find_dependency(real)`, so installing
82
+ REAL's config package alongside (on the same prefix) makes the whole chain
83
+ resolve from one `find_package`:
84
+
85
+ ```cmake
86
+ # With REAL and SciLex installed under <prefix>:
87
+ find_package(scilex CONFIG REQUIRED) # pulls in real:: transitively
88
+ target_link_libraries(app PRIVATE scilex::scilex)
89
+ ```
90
+
91
+ ## Releasing
92
+
93
+ `make release` computes the next calendar version `YYYY.M.PATCH` — the patch
94
+ resets each month, the first release of a month is `.0` (PEP 440 drops leading
95
+ zeros, so `2026.6.1`, never `2026.06.001`) — bumps it in `pyproject.toml` and
96
+ `python/scilex/__init__.py`, then commits, tags and pushes from a clean `main`.
97
+ The pushed tag drives `.github/workflows/release.yml`,
98
+ which checks the tag matches the version, builds abi3 wheels (`cibuildwheel`,
99
+ Linux/macOS/Windows) and the self-contained sdist, and publishes to PyPI via
100
+ Trusted Publishing (OIDC, no stored secret). It then populates the GitHub
101
+ `/releases` page with auto-generated notes and the built artifacts. The pushed tag
102
+ is the single thing that triggers a publish; SciLex remains consumable as source
103
+ too (sibling checkout / FetchContent / `get_include()`).
104
+
105
+ **One-time PyPI setup.** Publishing needs a PyPI
106
+ [Trusted Publisher](https://docs.pypi.org/trusted-publishers/) configured once for
107
+ the project (publisher `RECHE23/scilex`, workflow `release.yml`, environment
108
+ `pypi`) and a matching `pypi` GitHub environment — no API token is stored.
109
+
110
+ ## Python binding
111
+
112
+ SciLex ships an abi3 CPython extension (use the C++ lexer from Python).
113
+ `pip install scilex` installs one `cp310-abi3` wheel per platform (CPython 3.10+;
114
+ the self-contained sdist compiles where no wheel matches, pulling REAL's headers
115
+ from the `real-regex` build dependency). For a source checkout:
116
+
117
+ ```bash
118
+ make python # build the extension in place
119
+ make python-test # run the binding test suite
120
+ ```
121
+
122
+ ```python
123
+ import scilex
124
+ lx = scilex.Lexer([
125
+ (0, r"\s+", True), # (kind, pattern, skip) — skipped
126
+ (1, r"[0-9]+", False), # number
127
+ (2, r"[A-Za-z_][A-Za-z0-9_]*", False), # identifier
128
+ ])
129
+
130
+ # Eager: a list of rich Token objects (kind, lexeme, structured position).
131
+ [(t.kind, t.lexeme) for t in lx.tokenize("foo 42")] # [(2, 'foo'), (1, '42')]
132
+
133
+ # Lazy: a generator yielding one Token at a time — nothing else is held.
134
+ for tok in lx.scan("foo 42"):
135
+ tok.kind, tok.lexeme, tok.position.line, tok.position.column
136
+
137
+ # A lexical error carries the failing position.
138
+ try:
139
+ lx.tokenize("foo @")
140
+ except scilex.error as e:
141
+ e.position # Position(line=1, column=5, offset=4)
142
+
143
+ # eof=True appends a terminal END_OF_INPUT token (a parser always has a token).
144
+ lx.tokenize("42", eof=True)[-1].kind == scilex.END_OF_INPUT
145
+ ```
146
+
147
+ For indentation-significant languages, `Layout` rewrites an `eof=True` token
148
+ stream with `NEWLINE` / `INDENT` / `DEDENT` tokens read from each line's leading
149
+ column:
150
+
151
+ ```python
152
+ lx = scilex.Lexer([(0, r"\s+", True), (1, r"\w+", False), (2, r":", False)])
153
+ laid = scilex.Layout().apply(lx.tokenize("if x:\n a\nb", eof=True))
154
+ [t.kind for t in laid]
155
+ # [1, 1, 2, NEWLINE, INDENT, 1, NEWLINE, DEDENT, 1, NEWLINE, END_OF_INPUT]
156
+ ```
157
+
158
+ `scilex.get_include()` returns the header directory so a C++ project can compile
159
+ against SciLex located through its Python install (add `real.get_include()` too,
160
+ since SciLex's headers include REAL's).
161
+
162
+ ## Example
163
+
164
+ ```cpp
165
+ #include <scilex/scilex.hpp>
166
+ #include <vector>
167
+
168
+ enum kind { WS, KW_IF, ID, NUM, PLUS };
169
+
170
+ std::vector<scilex::rule> rules;
171
+ rules.push_back({WS, real::regex("\\s+"), true}); // skipped
172
+ rules.push_back({KW_IF, real::regex("if")}); // before ID: wins ties
173
+ rules.push_back({ID, real::regex("[a-z]+")});
174
+ rules.push_back({NUM, real::regex("[0-9]+")});
175
+ rules.push_back({PLUS, real::regex("\\+")});
176
+
177
+ const scilex::lexer lexer(std::move(rules));
178
+
179
+ // Eager: all tokens in a vector.
180
+ for (const scilex::token& t : lexer.tokenize("if x + 42")) {
181
+ // t.kind, t.lexeme, t.start.{offset,line,column}
182
+ }
183
+
184
+ // Lazy: one token at a time, nothing else materialized.
185
+ for (const scilex::token& t : lexer.scan("if x + 42")) {
186
+ // ...
187
+ }
188
+ ```
189
+
190
+ ## Performance
191
+
192
+ See [BENCHMARKS.md](BENCHMARKS.md) for a reproducible, honest baseline (`make
193
+ bench`). The short of it: on benign input Python's `re` is faster (a mature C
194
+ backtracking engine), but SciLex is **linear-time and ReDoS-safe by construction**
195
+ — on a pathological pattern like `(a+)+b` it stays flat (~78 µs at 1000 chars)
196
+ where `re` explodes exponentially (seconds, then never finishing). SciLex trades
197
+ raw throughput on easy inputs for a guarantee that holds on every input.
198
+
199
+ ## License
200
+
201
+ MIT — see [LICENSE](LICENSE).
202
+
203
+ ## Author
204
+
205
+ René Chenard — rene.chenard@gmail.com
@@ -0,0 +1,125 @@
1
+ /*!
2
+ * \file layout.hpp
3
+ * \brief Optional indentation layout: insert NEWLINE / INDENT / DEDENT tokens.
4
+ *
5
+ * Some languages (Python-like, e.g. SciLang) make indentation significant. This
6
+ * opt-in pass turns a flat token stream into a layout-aware one: it inserts a
7
+ * \ref scilex::newline at each logical line end, and \ref scilex::indent /
8
+ * \ref scilex::dedent where the leading indentation changes.
9
+ *
10
+ * It works purely from token **positions** — every \ref scilex::token already
11
+ * carries its source line and (byte) column — so the base lexer needs no change
12
+ * and may keep skipping whitespace. Lines with no token (blank or
13
+ * comment-only) carry no structure and are naturally ignored.
14
+ *
15
+ * Indentation width is the byte column of a line's first token (tabs and spaces
16
+ * each count as one column; v1 does not police mixed tabs/spaces, and there is
17
+ * no implicit line continuation inside brackets).
18
+ *
19
+ * Input must be an end-of-input-terminated token sequence (the lexer's
20
+ * `eof_policy::append`); the terminal \ref scilex::end_of_input is preserved.
21
+ */
22
+ #ifndef SCILEX_LAYOUT_HPP
23
+ #define SCILEX_LAYOUT_HPP
24
+
25
+ #include <cstddef>
26
+ #include <limits>
27
+ #include <span>
28
+ #include <stdexcept>
29
+ #include <string>
30
+ #include <vector>
31
+
32
+ #include "token.hpp"
33
+
34
+ namespace scilex {
35
+
36
+ //! \brief Reserved kind: end of a logical line.
37
+ inline constexpr int newline {std::numeric_limits<int>::min() + 1};
38
+ //! \brief Reserved kind: indentation increased (start of a deeper block).
39
+ inline constexpr int indent {std::numeric_limits<int>::min() + 2};
40
+ //! \brief Reserved kind: indentation decreased (end of a block).
41
+ inline constexpr int dedent {std::numeric_limits<int>::min() + 3};
42
+
43
+ /*!
44
+ * \brief Thrown when a line's indentation matches no enclosing level.
45
+ */
46
+ class layout_error : public std::runtime_error
47
+ {
48
+ public:
49
+
50
+ //! \brief Builds the error. \param[in] message Cause. \param[in] where Position.
51
+ layout_error(const std::string& message,
52
+ position where)
53
+ : std::runtime_error(message),
54
+ where_(where)
55
+ {}
56
+
57
+ //! \brief Returns the position of the offending line.
58
+ [[nodiscard]] position where() const noexcept
59
+ {
60
+ return where_;
61
+ }
62
+
63
+ private:
64
+
65
+ position where_; //!< Where the indentation was inconsistent.
66
+ };
67
+
68
+ /*!
69
+ * \brief Rewrites \p tokens with NEWLINE / INDENT / DEDENT inserted.
70
+ *
71
+ * \param[in] tokens An end-of-input-terminated token sequence.
72
+ * \return The layout-aware token sequence (still end-of-input-terminated).
73
+ * \throws layout_error If a line dedents to an indentation that no open block
74
+ * used.
75
+ */
76
+ [[nodiscard]] inline std::vector<token> layout(std::span<const token> tokens)
77
+ {
78
+ std::vector<token> out;
79
+ std::vector<std::size_t> levels {0};
80
+ bool started {false};
81
+ std::size_t previous_line {0};
82
+ position end_position {0, 1, 1};
83
+
84
+ for (const token& current : tokens) {
85
+ if (current.kind == end_of_input) {
86
+ end_position = current.start; // remember; emit our own terminal at the end
87
+ continue;
88
+ }
89
+ if (!started || current.start.line != previous_line) {
90
+ if (started) {
91
+ out.push_back(token {newline, {}, current.start});
92
+ }
93
+ const std::size_t width {current.start.column - 1};
94
+ if (width > levels.back()) {
95
+ levels.push_back(width);
96
+ out.push_back(token {indent, {}, current.start});
97
+ }
98
+ else {
99
+ while (width < levels.back()) {
100
+ levels.pop_back();
101
+ out.push_back(token {dedent, {}, current.start});
102
+ }
103
+ if (width != levels.back()) {
104
+ throw layout_error("inconsistent indentation", current.start);
105
+ }
106
+ }
107
+ started = true;
108
+ }
109
+ out.push_back(current);
110
+ previous_line = current.start.line;
111
+ }
112
+
113
+ if (started) {
114
+ out.push_back(token {newline, {}, end_position});
115
+ }
116
+ while (levels.back() > 0) {
117
+ levels.pop_back();
118
+ out.push_back(token {dedent, {}, end_position});
119
+ }
120
+ out.push_back(token {end_of_input, {}, end_position});
121
+ return out;
122
+ }
123
+ } // namespace scilex
124
+
125
+ #endif // SCILEX_LAYOUT_HPP