scilex 2026.6.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- scilex-2026.6.0/BENCHMARKS.md +91 -0
- scilex-2026.6.0/LICENSE +21 -0
- scilex-2026.6.0/MANIFEST.in +7 -0
- scilex-2026.6.0/PKG-INFO +227 -0
- scilex-2026.6.0/README.md +205 -0
- scilex-2026.6.0/include/scilex/layout.hpp +125 -0
- scilex-2026.6.0/include/scilex/lexer.hpp +387 -0
- scilex-2026.6.0/include/scilex/scilex.hpp +22 -0
- scilex-2026.6.0/include/scilex/token.hpp +54 -0
- scilex-2026.6.0/pyproject.toml +55 -0
- scilex-2026.6.0/python/scilex/__init__.py +348 -0
- scilex-2026.6.0/python/scilex.egg-info/PKG-INFO +227 -0
- scilex-2026.6.0/python/scilex.egg-info/SOURCES.txt +17 -0
- scilex-2026.6.0/python/scilex.egg-info/dependency_links.txt +1 -0
- scilex-2026.6.0/python/scilex.egg-info/top_level.txt +1 -0
- scilex-2026.6.0/python/src/_scilex.cpp +445 -0
- scilex-2026.6.0/python/tests/test_scilex.py +271 -0
- scilex-2026.6.0/setup.cfg +4 -0
- scilex-2026.6.0/setup.py +68 -0
|
@@ -0,0 +1,91 @@
|
|
|
1
|
+
# SciLex — performance baseline
|
|
2
|
+
|
|
3
|
+
A reproducible wall-time baseline for the Python binding, comparing SciLex against
|
|
4
|
+
Python's standard-library `re`. Its purpose is twofold: a **regression tripwire**
|
|
5
|
+
between versions on the same machine, and an honest statement of **where SciLex wins
|
|
6
|
+
and where it does not**.
|
|
7
|
+
|
|
8
|
+
Run it with **`make bench`** (or `python benchmarks/bench.py`). It is informational
|
|
9
|
+
only — it prints a table, is never invoked by `full-local-gate`, and never fails a
|
|
10
|
+
build.
|
|
11
|
+
|
|
12
|
+
## The honest headline
|
|
13
|
+
|
|
14
|
+
SciLex is **not** built to beat `re` on raw throughput, and it does not. `re` is a
|
|
15
|
+
mature C backtracking engine; SciLex runs REAL's linear-time NFA at every position,
|
|
16
|
+
through the abi3 binding, and builds rich `Token` objects. On benign input that costs
|
|
17
|
+
a few times more per byte.
|
|
18
|
+
|
|
19
|
+
What SciLex guarantees instead is **linear time, ReDoS-safe by construction**: no rule
|
|
20
|
+
can make the scanner backtrack catastrophically. On an adversarial (or simply
|
|
21
|
+
unlucky) pattern, `re` degrades exponentially while SciLex stays flat — and *that* is
|
|
22
|
+
the difference that matters for a lexer fed untrusted or machine-generated input.
|
|
23
|
+
|
|
24
|
+
### Gains and losses at a glance
|
|
25
|
+
|
|
26
|
+
| input | winner | why |
|
|
27
|
+
| --- | --- | --- |
|
|
28
|
+
| benign token soup | `re` (~5×, see B1) | a mature C backtracking engine; SciLex runs REAL's NFA per position, through the binding, and builds rich `Token`s |
|
|
29
|
+
| adversarial / ReDoS (B2) | **SciLex** (linear vs exponential) | REAL is linear-time and ReDoS-safe; `re` backtracks catastrophically |
|
|
30
|
+
| untrusted / machine-generated | **SciLex** | the linear bound holds on *every* input — no pathological cliff |
|
|
31
|
+
|
|
32
|
+
## Conditions of this baseline
|
|
33
|
+
|
|
34
|
+
| | |
|
|
35
|
+
| --- | --- |
|
|
36
|
+
| Machine | Apple Silicon (`arm64`), Darwin 23.6.0 |
|
|
37
|
+
| Binding | abi3 CPython extension as built by `setup.py` (`Py_LIMITED_API` 3.10) |
|
|
38
|
+
| Method | best-of-5 timed runs, **minimum** reported |
|
|
39
|
+
| As of | 2026-06-19 — the Python-binding maturation reserve |
|
|
40
|
+
|
|
41
|
+
## Baseline
|
|
42
|
+
|
|
43
|
+
### B1. Benign tokenization (the everyday case — `re` wins)
|
|
44
|
+
|
|
45
|
+
Tokenizing ~10 KB of ordinary `ident = ident + number * ident - number ;` soup into
|
|
46
|
+
4000 tokens (numbers, identifiers, operators; whitespace skipped). SciLex compiles the
|
|
47
|
+
rule set once (a reused `Lexer`); the `re` baseline is the standard "master pattern"
|
|
48
|
+
tokenizer (`(?P<NUM>…)|(?P<ID>…)|…` + `finditer`).
|
|
49
|
+
|
|
50
|
+
| tokenizer | time | vs `re` |
|
|
51
|
+
| --- | ---: | ---: |
|
|
52
|
+
| `scilex.Lexer.tokenize` | ~7.4 ms | ~5.2× |
|
|
53
|
+
| `re.finditer` (master pattern) | ~1.4 ms | 1.0× (baseline) |
|
|
54
|
+
|
|
55
|
+
**Reading.** `re` is ~5× faster here. That is expected and reported plainly: the cost
|
|
56
|
+
buys SciLex's linear guarantee and its ordered maximal-munch semantics, not a speed
|
|
57
|
+
record on benign input. Tightening the per-position scan is a known, deliberately
|
|
58
|
+
deferred lever — to be pursued only if a real workload makes it the bottleneck.
|
|
59
|
+
|
|
60
|
+
### B2. Pathological input (the linearity guarantee — SciLex wins decisively)
|
|
61
|
+
|
|
62
|
+
The classic ReDoS trigger `(a+)+b` over a run of `n` `a`s with no terminating `b`. A
|
|
63
|
+
backtracking engine explores `O(2ⁿ)` partitions; REAL (and therefore SciLex) is linear.
|
|
64
|
+
|
|
65
|
+
| n | `scilex` (linear) | `re.match` (backtracking) |
|
|
66
|
+
| ---: | ---: | ---: |
|
|
67
|
+
| 16 | ~2.1 µs | ~2.2 ms |
|
|
68
|
+
| 18 | ~2.2 µs | ~9.1 ms |
|
|
69
|
+
| 20 | ~2.4 µs | ~35.9 ms |
|
|
70
|
+
| 22 | ~2.6 µs | ~142.7 ms |
|
|
71
|
+
| 24 | ~2.8 µs | ~574 ms |
|
|
72
|
+
| 26 | ~2.9 µs | ~2.30 s |
|
|
73
|
+
| 1000 | ~78 µs | would not finish |
|
|
74
|
+
|
|
75
|
+
**Reading.** `re`'s time roughly **quadruples every +2** in `n` (exponential); SciLex
|
|
76
|
+
grows **linearly** and is still ~78 µs at `n = 1000`, where `re` would not finish in any
|
|
77
|
+
practical time. This is the case SciLex exists for.
|
|
78
|
+
|
|
79
|
+
## Methodology & reproduction
|
|
80
|
+
|
|
81
|
+
- **Goal:** a regression tripwire plus an honest win/lose map — not a throughput
|
|
82
|
+
contest. Compare a fresh `make bench` to this table **on the same machine**; a clear,
|
|
83
|
+
repeatable change is the signal.
|
|
84
|
+
- **Reproduce:** `make bench` builds the extension in place and runs
|
|
85
|
+
`benchmarks/bench.py`. The pathological sweep stops `re` once a single match passes
|
|
86
|
+
one second (its curve is already established); SciLex is measured well past that.
|
|
87
|
+
- **Not gated.** `make bench` is excluded from `full-local-gate` on purpose — a noisy
|
|
88
|
+
wall-time measurement must never turn a clean build red.
|
|
89
|
+
- **Deferred:** a compile-time `static_lexer` (REAL's `static_regex`) and a faster
|
|
90
|
+
per-position scan (e.g. first-byte / trie dispatch) are known levers, left until a
|
|
91
|
+
measured workload justifies them. No phantom numbers here for paths not yet built.
|
scilex-2026.6.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 René Chenard
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
# Make the sdist self-contained: carry the C++ headers the binding compiles
|
|
2
|
+
# against and the binding source, plus the project docs. (The headers are also
|
|
3
|
+
# REAL's at build time, pulled from the real-regex build dependency.)
|
|
4
|
+
graft include
|
|
5
|
+
recursive-include python/src *.cpp
|
|
6
|
+
recursive-include python/tests *.py
|
|
7
|
+
include README.md LICENSE BENCHMARKS.md
|
scilex-2026.6.0/PKG-INFO
ADDED
|
@@ -0,0 +1,227 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: scilex
|
|
3
|
+
Version: 2026.6.0
|
|
4
|
+
Summary: SciLex — a generic, linear-time maximal-munch lexer built on REAL
|
|
5
|
+
Author-email: René Chenard <rene.chenard.1@ulaval.ca>
|
|
6
|
+
License-Expression: MIT
|
|
7
|
+
Project-URL: Repository, https://github.com/RECHE23/scilex
|
|
8
|
+
Project-URL: Issues, https://github.com/RECHE23/scilex/issues
|
|
9
|
+
Keywords: lexer,tokenizer,maximal-munch,scanner,linear-time
|
|
10
|
+
Classifier: Development Status :: 4 - Beta
|
|
11
|
+
Classifier: Intended Audience :: Developers
|
|
12
|
+
Classifier: Programming Language :: Python :: 3
|
|
13
|
+
Classifier: Programming Language :: Python :: 3 :: Only
|
|
14
|
+
Classifier: Programming Language :: C++
|
|
15
|
+
Classifier: Topic :: Software Development :: Compilers
|
|
16
|
+
Classifier: Topic :: Software Development :: Libraries
|
|
17
|
+
Classifier: Operating System :: OS Independent
|
|
18
|
+
Requires-Python: >=3.10
|
|
19
|
+
Description-Content-Type: text/markdown
|
|
20
|
+
License-File: LICENSE
|
|
21
|
+
Dynamic: license-file
|
|
22
|
+
|
|
23
|
+
# SciLex
|
|
24
|
+
|
|
25
|
+
A small, header-only C++20 lexer built on [REAL](https://github.com/RECHE23/real-regex).
|
|
26
|
+
|
|
27
|
+
Define an ordered set of token rules — each a kind paired with a REAL regular
|
|
28
|
+
expression — and SciLex tokenizes source text by **maximal munch** (longest
|
|
29
|
+
match wins, rule order breaks ties). Because REAL is a linear-time engine,
|
|
30
|
+
tokenization is linear and ReDoS-safe by construction: no token rule can make
|
|
31
|
+
the scanner backtrack catastrophically.
|
|
32
|
+
|
|
33
|
+
This is a deliberate fresh start under the same premises as REAL — purity,
|
|
34
|
+
simplicity, and measured optimality. SciLex is a thin layer over REAL, not a
|
|
35
|
+
re-implementation of pattern matching.
|
|
36
|
+
|
|
37
|
+
## Scope (v1)
|
|
38
|
+
|
|
39
|
+
**Included:**
|
|
40
|
+
|
|
41
|
+
- Token rules: a kind, a `real::regex`, and a `skip` flag (whitespace, comments).
|
|
42
|
+
- Maximal-munch tokenization with rule-order priority on equal-length ties.
|
|
43
|
+
- Source positions: byte `offset`, 1-based `line` and byte `column`.
|
|
44
|
+
- Non-owning tokens: each `lexeme` views into the source.
|
|
45
|
+
- Two ways to consume tokens: `tokenize` (eager, into a vector) and `scan`
|
|
46
|
+
(a lazy single-pass range that produces one token at a time — no token
|
|
47
|
+
vector is allocated; the parser-friendly access pattern).
|
|
48
|
+
- Optional synthetic end-of-input token (`eof_policy::append`): emits a final
|
|
49
|
+
`end_of_input` token at the real end position, so a parser always has a
|
|
50
|
+
current token to match.
|
|
51
|
+
- Optional indentation layout (`scilex::layout`, an opt-in header): rewrites a
|
|
52
|
+
token stream with synthetic `newline` / `indent` / `dedent` tokens for
|
|
53
|
+
indentation-significant languages; throws `layout_error` on an inconsistent
|
|
54
|
+
dedent.
|
|
55
|
+
- Lexical errors as exceptions carrying the failing position (`lex_error`).
|
|
56
|
+
- Linear-time / ReDoS-safe tokenization, inherited from REAL.
|
|
57
|
+
|
|
58
|
+
**Not in v1 (and honestly excluded — no phantom features):**
|
|
59
|
+
|
|
60
|
+
- Modes / context-sensitive lexing.
|
|
61
|
+
- Compile-time `static_lexer` (built on REAL's `static_regex`).
|
|
62
|
+
- Codepoint columns (columns are byte-based, matching REAL's UTF-8 model).
|
|
63
|
+
- Command-line tool, JSON specification language.
|
|
64
|
+
|
|
65
|
+
These may come later, each only if it earns its place — measured, tested, and
|
|
66
|
+
kept minimal.
|
|
67
|
+
|
|
68
|
+
## Dependencies
|
|
69
|
+
|
|
70
|
+
SciLex is header-only and depends only on REAL's headers. By default the build
|
|
71
|
+
looks for them in a sibling checkout:
|
|
72
|
+
|
|
73
|
+
```
|
|
74
|
+
~/Projects/
|
|
75
|
+
├── real-v1/ # REAL
|
|
76
|
+
└── scilex-v1/ # SciLex (uses ../real-v1/include by default)
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
Point the build elsewhere with `REAL_INCLUDE` (Makefile) or
|
|
80
|
+
`-DSCILEX_REAL_INCLUDE=...` (CMake) — for instance at the path printed by
|
|
81
|
+
`python -c "import real; print(real.get_include())"` when REAL is installed via
|
|
82
|
+
pip.
|
|
83
|
+
|
|
84
|
+
For CI or a reproducible build — where no on-disk layout can be assumed — fetch
|
|
85
|
+
REAL with CMake FetchContent instead (`make build FETCH=1`, or
|
|
86
|
+
`-DSCILEX_FETCH_DEPS=ON`); point it at a remote and pin a tag with
|
|
87
|
+
`-DSCILEX_REAL_REPO=https://… -DSCILEX_REAL_TAG=v2026.6.6`.
|
|
88
|
+
|
|
89
|
+
## Build
|
|
90
|
+
|
|
91
|
+
```bash
|
|
92
|
+
make test # build and run the test suite
|
|
93
|
+
make coverage # line-coverage summary + HTML report
|
|
94
|
+
make sanitize # tests under AddressSanitizer + UndefinedBehaviorSanitizer
|
|
95
|
+
make lint # clang-tidy
|
|
96
|
+
make format # uncrustify, in place
|
|
97
|
+
make doc # API reference (Doxygen) with embedded coverage
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
Override the compiler with `make test CXX=g++-14`.
|
|
101
|
+
|
|
102
|
+
`scilex::scilex` is the CMake target — `add_subdirectory`, `FetchContent`, or an
|
|
103
|
+
installed config package. The config calls `find_dependency(real)`, so installing
|
|
104
|
+
REAL's config package alongside (on the same prefix) makes the whole chain
|
|
105
|
+
resolve from one `find_package`:
|
|
106
|
+
|
|
107
|
+
```cmake
|
|
108
|
+
# With REAL and SciLex installed under <prefix>:
|
|
109
|
+
find_package(scilex CONFIG REQUIRED) # pulls in real:: transitively
|
|
110
|
+
target_link_libraries(app PRIVATE scilex::scilex)
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
## Releasing
|
|
114
|
+
|
|
115
|
+
`make release` computes the next calendar version `YYYY.M.PATCH` — the patch
|
|
116
|
+
resets each month, the first release of a month is `.0` (PEP 440 drops leading
|
|
117
|
+
zeros, so `2026.6.1`, never `2026.06.001`) — bumps it in `pyproject.toml` and
|
|
118
|
+
`python/scilex/__init__.py`, then commits, tags and pushes from a clean `main`.
|
|
119
|
+
The pushed tag drives `.github/workflows/release.yml`,
|
|
120
|
+
which checks the tag matches the version, builds abi3 wheels (`cibuildwheel`,
|
|
121
|
+
Linux/macOS/Windows) and the self-contained sdist, and publishes to PyPI via
|
|
122
|
+
Trusted Publishing (OIDC, no stored secret). It then populates the GitHub
|
|
123
|
+
`/releases` page with auto-generated notes and the built artifacts. The pushed tag
|
|
124
|
+
is the single thing that triggers a publish; SciLex remains consumable as source
|
|
125
|
+
too (sibling checkout / FetchContent / `get_include()`).
|
|
126
|
+
|
|
127
|
+
**One-time PyPI setup.** Publishing needs a PyPI
|
|
128
|
+
[Trusted Publisher](https://docs.pypi.org/trusted-publishers/) configured once for
|
|
129
|
+
the project (publisher `RECHE23/scilex`, workflow `release.yml`, environment
|
|
130
|
+
`pypi`) and a matching `pypi` GitHub environment — no API token is stored.
|
|
131
|
+
|
|
132
|
+
## Python binding
|
|
133
|
+
|
|
134
|
+
SciLex ships an abi3 CPython extension (use the C++ lexer from Python).
|
|
135
|
+
`pip install scilex` installs one `cp310-abi3` wheel per platform (CPython 3.10+;
|
|
136
|
+
the self-contained sdist compiles where no wheel matches, pulling REAL's headers
|
|
137
|
+
from the `real-regex` build dependency). For a source checkout:
|
|
138
|
+
|
|
139
|
+
```bash
|
|
140
|
+
make python # build the extension in place
|
|
141
|
+
make python-test # run the binding test suite
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
```python
|
|
145
|
+
import scilex
|
|
146
|
+
lx = scilex.Lexer([
|
|
147
|
+
(0, r"\s+", True), # (kind, pattern, skip) — skipped
|
|
148
|
+
(1, r"[0-9]+", False), # number
|
|
149
|
+
(2, r"[A-Za-z_][A-Za-z0-9_]*", False), # identifier
|
|
150
|
+
])
|
|
151
|
+
|
|
152
|
+
# Eager: a list of rich Token objects (kind, lexeme, structured position).
|
|
153
|
+
[(t.kind, t.lexeme) for t in lx.tokenize("foo 42")] # [(2, 'foo'), (1, '42')]
|
|
154
|
+
|
|
155
|
+
# Lazy: a generator yielding one Token at a time — nothing else is held.
|
|
156
|
+
for tok in lx.scan("foo 42"):
|
|
157
|
+
tok.kind, tok.lexeme, tok.position.line, tok.position.column
|
|
158
|
+
|
|
159
|
+
# A lexical error carries the failing position.
|
|
160
|
+
try:
|
|
161
|
+
lx.tokenize("foo @")
|
|
162
|
+
except scilex.error as e:
|
|
163
|
+
e.position # Position(line=1, column=5, offset=4)
|
|
164
|
+
|
|
165
|
+
# eof=True appends a terminal END_OF_INPUT token (a parser always has a token).
|
|
166
|
+
lx.tokenize("42", eof=True)[-1].kind == scilex.END_OF_INPUT
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
For indentation-significant languages, `Layout` rewrites an `eof=True` token
|
|
170
|
+
stream with `NEWLINE` / `INDENT` / `DEDENT` tokens read from each line's leading
|
|
171
|
+
column:
|
|
172
|
+
|
|
173
|
+
```python
|
|
174
|
+
lx = scilex.Lexer([(0, r"\s+", True), (1, r"\w+", False), (2, r":", False)])
|
|
175
|
+
laid = scilex.Layout().apply(lx.tokenize("if x:\n a\nb", eof=True))
|
|
176
|
+
[t.kind for t in laid]
|
|
177
|
+
# [1, 1, 2, NEWLINE, INDENT, 1, NEWLINE, DEDENT, 1, NEWLINE, END_OF_INPUT]
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
`scilex.get_include()` returns the header directory so a C++ project can compile
|
|
181
|
+
against SciLex located through its Python install (add `real.get_include()` too,
|
|
182
|
+
since SciLex's headers include REAL's).
|
|
183
|
+
|
|
184
|
+
## Example
|
|
185
|
+
|
|
186
|
+
```cpp
|
|
187
|
+
#include <scilex/scilex.hpp>
|
|
188
|
+
#include <vector>
|
|
189
|
+
|
|
190
|
+
enum kind { WS, KW_IF, ID, NUM, PLUS };
|
|
191
|
+
|
|
192
|
+
std::vector<scilex::rule> rules;
|
|
193
|
+
rules.push_back({WS, real::regex("\\s+"), true}); // skipped
|
|
194
|
+
rules.push_back({KW_IF, real::regex("if")}); // before ID: wins ties
|
|
195
|
+
rules.push_back({ID, real::regex("[a-z]+")});
|
|
196
|
+
rules.push_back({NUM, real::regex("[0-9]+")});
|
|
197
|
+
rules.push_back({PLUS, real::regex("\\+")});
|
|
198
|
+
|
|
199
|
+
const scilex::lexer lexer(std::move(rules));
|
|
200
|
+
|
|
201
|
+
// Eager: all tokens in a vector.
|
|
202
|
+
for (const scilex::token& t : lexer.tokenize("if x + 42")) {
|
|
203
|
+
// t.kind, t.lexeme, t.start.{offset,line,column}
|
|
204
|
+
}
|
|
205
|
+
|
|
206
|
+
// Lazy: one token at a time, nothing else materialized.
|
|
207
|
+
for (const scilex::token& t : lexer.scan("if x + 42")) {
|
|
208
|
+
// ...
|
|
209
|
+
}
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
## Performance
|
|
213
|
+
|
|
214
|
+
See [BENCHMARKS.md](BENCHMARKS.md) for a reproducible, honest baseline (`make
|
|
215
|
+
bench`). The short of it: on benign input Python's `re` is faster (a mature C
|
|
216
|
+
backtracking engine), but SciLex is **linear-time and ReDoS-safe by construction**
|
|
217
|
+
— on a pathological pattern like `(a+)+b` it stays flat (~78 µs at 1000 chars)
|
|
218
|
+
where `re` explodes exponentially (seconds, then never finishing). SciLex trades
|
|
219
|
+
raw throughput on easy inputs for a guarantee that holds on every input.
|
|
220
|
+
|
|
221
|
+
## License
|
|
222
|
+
|
|
223
|
+
MIT — see [LICENSE](LICENSE).
|
|
224
|
+
|
|
225
|
+
## Author
|
|
226
|
+
|
|
227
|
+
René Chenard — rene.chenard@gmail.com
|
|
@@ -0,0 +1,205 @@
|
|
|
1
|
+
# SciLex
|
|
2
|
+
|
|
3
|
+
A small, header-only C++20 lexer built on [REAL](https://github.com/RECHE23/real-regex).
|
|
4
|
+
|
|
5
|
+
Define an ordered set of token rules — each a kind paired with a REAL regular
|
|
6
|
+
expression — and SciLex tokenizes source text by **maximal munch** (longest
|
|
7
|
+
match wins, rule order breaks ties). Because REAL is a linear-time engine,
|
|
8
|
+
tokenization is linear and ReDoS-safe by construction: no token rule can make
|
|
9
|
+
the scanner backtrack catastrophically.
|
|
10
|
+
|
|
11
|
+
This is a deliberate fresh start under the same premises as REAL — purity,
|
|
12
|
+
simplicity, and measured optimality. SciLex is a thin layer over REAL, not a
|
|
13
|
+
re-implementation of pattern matching.
|
|
14
|
+
|
|
15
|
+
## Scope (v1)
|
|
16
|
+
|
|
17
|
+
**Included:**
|
|
18
|
+
|
|
19
|
+
- Token rules: a kind, a `real::regex`, and a `skip` flag (whitespace, comments).
|
|
20
|
+
- Maximal-munch tokenization with rule-order priority on equal-length ties.
|
|
21
|
+
- Source positions: byte `offset`, 1-based `line` and byte `column`.
|
|
22
|
+
- Non-owning tokens: each `lexeme` views into the source.
|
|
23
|
+
- Two ways to consume tokens: `tokenize` (eager, into a vector) and `scan`
|
|
24
|
+
(a lazy single-pass range that produces one token at a time — no token
|
|
25
|
+
vector is allocated; the parser-friendly access pattern).
|
|
26
|
+
- Optional synthetic end-of-input token (`eof_policy::append`): emits a final
|
|
27
|
+
`end_of_input` token at the real end position, so a parser always has a
|
|
28
|
+
current token to match.
|
|
29
|
+
- Optional indentation layout (`scilex::layout`, an opt-in header): rewrites a
|
|
30
|
+
token stream with synthetic `newline` / `indent` / `dedent` tokens for
|
|
31
|
+
indentation-significant languages; throws `layout_error` on an inconsistent
|
|
32
|
+
dedent.
|
|
33
|
+
- Lexical errors as exceptions carrying the failing position (`lex_error`).
|
|
34
|
+
- Linear-time / ReDoS-safe tokenization, inherited from REAL.
|
|
35
|
+
|
|
36
|
+
**Not in v1 (and honestly excluded — no phantom features):**
|
|
37
|
+
|
|
38
|
+
- Modes / context-sensitive lexing.
|
|
39
|
+
- Compile-time `static_lexer` (built on REAL's `static_regex`).
|
|
40
|
+
- Codepoint columns (columns are byte-based, matching REAL's UTF-8 model).
|
|
41
|
+
- Command-line tool, JSON specification language.
|
|
42
|
+
|
|
43
|
+
These may come later, each only if it earns its place — measured, tested, and
|
|
44
|
+
kept minimal.
|
|
45
|
+
|
|
46
|
+
## Dependencies
|
|
47
|
+
|
|
48
|
+
SciLex is header-only and depends only on REAL's headers. By default the build
|
|
49
|
+
looks for them in a sibling checkout:
|
|
50
|
+
|
|
51
|
+
```
|
|
52
|
+
~/Projects/
|
|
53
|
+
├── real-v1/ # REAL
|
|
54
|
+
└── scilex-v1/ # SciLex (uses ../real-v1/include by default)
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
Point the build elsewhere with `REAL_INCLUDE` (Makefile) or
|
|
58
|
+
`-DSCILEX_REAL_INCLUDE=...` (CMake) — for instance at the path printed by
|
|
59
|
+
`python -c "import real; print(real.get_include())"` when REAL is installed via
|
|
60
|
+
pip.
|
|
61
|
+
|
|
62
|
+
For CI or a reproducible build — where no on-disk layout can be assumed — fetch
|
|
63
|
+
REAL with CMake FetchContent instead (`make build FETCH=1`, or
|
|
64
|
+
`-DSCILEX_FETCH_DEPS=ON`); point it at a remote and pin a tag with
|
|
65
|
+
`-DSCILEX_REAL_REPO=https://… -DSCILEX_REAL_TAG=v2026.6.6`.
|
|
66
|
+
|
|
67
|
+
## Build
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
make test # build and run the test suite
|
|
71
|
+
make coverage # line-coverage summary + HTML report
|
|
72
|
+
make sanitize # tests under AddressSanitizer + UndefinedBehaviorSanitizer
|
|
73
|
+
make lint # clang-tidy
|
|
74
|
+
make format # uncrustify, in place
|
|
75
|
+
make doc # API reference (Doxygen) with embedded coverage
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
Override the compiler with `make test CXX=g++-14`.
|
|
79
|
+
|
|
80
|
+
`scilex::scilex` is the CMake target — `add_subdirectory`, `FetchContent`, or an
|
|
81
|
+
installed config package. The config calls `find_dependency(real)`, so installing
|
|
82
|
+
REAL's config package alongside (on the same prefix) makes the whole chain
|
|
83
|
+
resolve from one `find_package`:
|
|
84
|
+
|
|
85
|
+
```cmake
|
|
86
|
+
# With REAL and SciLex installed under <prefix>:
|
|
87
|
+
find_package(scilex CONFIG REQUIRED) # pulls in real:: transitively
|
|
88
|
+
target_link_libraries(app PRIVATE scilex::scilex)
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
## Releasing
|
|
92
|
+
|
|
93
|
+
`make release` computes the next calendar version `YYYY.M.PATCH` — the patch
|
|
94
|
+
resets each month, the first release of a month is `.0` (PEP 440 drops leading
|
|
95
|
+
zeros, so `2026.6.1`, never `2026.06.001`) — bumps it in `pyproject.toml` and
|
|
96
|
+
`python/scilex/__init__.py`, then commits, tags and pushes from a clean `main`.
|
|
97
|
+
The pushed tag drives `.github/workflows/release.yml`,
|
|
98
|
+
which checks the tag matches the version, builds abi3 wheels (`cibuildwheel`,
|
|
99
|
+
Linux/macOS/Windows) and the self-contained sdist, and publishes to PyPI via
|
|
100
|
+
Trusted Publishing (OIDC, no stored secret). It then populates the GitHub
|
|
101
|
+
`/releases` page with auto-generated notes and the built artifacts. The pushed tag
|
|
102
|
+
is the single thing that triggers a publish; SciLex remains consumable as source
|
|
103
|
+
too (sibling checkout / FetchContent / `get_include()`).
|
|
104
|
+
|
|
105
|
+
**One-time PyPI setup.** Publishing needs a PyPI
|
|
106
|
+
[Trusted Publisher](https://docs.pypi.org/trusted-publishers/) configured once for
|
|
107
|
+
the project (publisher `RECHE23/scilex`, workflow `release.yml`, environment
|
|
108
|
+
`pypi`) and a matching `pypi` GitHub environment — no API token is stored.
|
|
109
|
+
|
|
110
|
+
## Python binding
|
|
111
|
+
|
|
112
|
+
SciLex ships an abi3 CPython extension (use the C++ lexer from Python).
|
|
113
|
+
`pip install scilex` installs one `cp310-abi3` wheel per platform (CPython 3.10+;
|
|
114
|
+
the self-contained sdist compiles where no wheel matches, pulling REAL's headers
|
|
115
|
+
from the `real-regex` build dependency). For a source checkout:
|
|
116
|
+
|
|
117
|
+
```bash
|
|
118
|
+
make python # build the extension in place
|
|
119
|
+
make python-test # run the binding test suite
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
```python
|
|
123
|
+
import scilex
|
|
124
|
+
lx = scilex.Lexer([
|
|
125
|
+
(0, r"\s+", True), # (kind, pattern, skip) — skipped
|
|
126
|
+
(1, r"[0-9]+", False), # number
|
|
127
|
+
(2, r"[A-Za-z_][A-Za-z0-9_]*", False), # identifier
|
|
128
|
+
])
|
|
129
|
+
|
|
130
|
+
# Eager: a list of rich Token objects (kind, lexeme, structured position).
|
|
131
|
+
[(t.kind, t.lexeme) for t in lx.tokenize("foo 42")] # [(2, 'foo'), (1, '42')]
|
|
132
|
+
|
|
133
|
+
# Lazy: a generator yielding one Token at a time — nothing else is held.
|
|
134
|
+
for tok in lx.scan("foo 42"):
|
|
135
|
+
tok.kind, tok.lexeme, tok.position.line, tok.position.column
|
|
136
|
+
|
|
137
|
+
# A lexical error carries the failing position.
|
|
138
|
+
try:
|
|
139
|
+
lx.tokenize("foo @")
|
|
140
|
+
except scilex.error as e:
|
|
141
|
+
e.position # Position(line=1, column=5, offset=4)
|
|
142
|
+
|
|
143
|
+
# eof=True appends a terminal END_OF_INPUT token (a parser always has a token).
|
|
144
|
+
lx.tokenize("42", eof=True)[-1].kind == scilex.END_OF_INPUT
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
For indentation-significant languages, `Layout` rewrites an `eof=True` token
|
|
148
|
+
stream with `NEWLINE` / `INDENT` / `DEDENT` tokens read from each line's leading
|
|
149
|
+
column:
|
|
150
|
+
|
|
151
|
+
```python
|
|
152
|
+
lx = scilex.Lexer([(0, r"\s+", True), (1, r"\w+", False), (2, r":", False)])
|
|
153
|
+
laid = scilex.Layout().apply(lx.tokenize("if x:\n a\nb", eof=True))
|
|
154
|
+
[t.kind for t in laid]
|
|
155
|
+
# [1, 1, 2, NEWLINE, INDENT, 1, NEWLINE, DEDENT, 1, NEWLINE, END_OF_INPUT]
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
`scilex.get_include()` returns the header directory so a C++ project can compile
|
|
159
|
+
against SciLex located through its Python install (add `real.get_include()` too,
|
|
160
|
+
since SciLex's headers include REAL's).
|
|
161
|
+
|
|
162
|
+
## Example
|
|
163
|
+
|
|
164
|
+
```cpp
|
|
165
|
+
#include <scilex/scilex.hpp>
|
|
166
|
+
#include <vector>
|
|
167
|
+
|
|
168
|
+
enum kind { WS, KW_IF, ID, NUM, PLUS };
|
|
169
|
+
|
|
170
|
+
std::vector<scilex::rule> rules;
|
|
171
|
+
rules.push_back({WS, real::regex("\\s+"), true}); // skipped
|
|
172
|
+
rules.push_back({KW_IF, real::regex("if")}); // before ID: wins ties
|
|
173
|
+
rules.push_back({ID, real::regex("[a-z]+")});
|
|
174
|
+
rules.push_back({NUM, real::regex("[0-9]+")});
|
|
175
|
+
rules.push_back({PLUS, real::regex("\\+")});
|
|
176
|
+
|
|
177
|
+
const scilex::lexer lexer(std::move(rules));
|
|
178
|
+
|
|
179
|
+
// Eager: all tokens in a vector.
|
|
180
|
+
for (const scilex::token& t : lexer.tokenize("if x + 42")) {
|
|
181
|
+
// t.kind, t.lexeme, t.start.{offset,line,column}
|
|
182
|
+
}
|
|
183
|
+
|
|
184
|
+
// Lazy: one token at a time, nothing else materialized.
|
|
185
|
+
for (const scilex::token& t : lexer.scan("if x + 42")) {
|
|
186
|
+
// ...
|
|
187
|
+
}
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
## Performance
|
|
191
|
+
|
|
192
|
+
See [BENCHMARKS.md](BENCHMARKS.md) for a reproducible, honest baseline (`make
|
|
193
|
+
bench`). The short of it: on benign input Python's `re` is faster (a mature C
|
|
194
|
+
backtracking engine), but SciLex is **linear-time and ReDoS-safe by construction**
|
|
195
|
+
— on a pathological pattern like `(a+)+b` it stays flat (~78 µs at 1000 chars)
|
|
196
|
+
where `re` explodes exponentially (seconds, then never finishing). SciLex trades
|
|
197
|
+
raw throughput on easy inputs for a guarantee that holds on every input.
|
|
198
|
+
|
|
199
|
+
## License
|
|
200
|
+
|
|
201
|
+
MIT — see [LICENSE](LICENSE).
|
|
202
|
+
|
|
203
|
+
## Author
|
|
204
|
+
|
|
205
|
+
René Chenard — rene.chenard@gmail.com
|
|
@@ -0,0 +1,125 @@
|
|
|
1
|
+
/*!
|
|
2
|
+
* \file layout.hpp
|
|
3
|
+
* \brief Optional indentation layout: insert NEWLINE / INDENT / DEDENT tokens.
|
|
4
|
+
*
|
|
5
|
+
* Some languages (Python-like, e.g. SciLang) make indentation significant. This
|
|
6
|
+
* opt-in pass turns a flat token stream into a layout-aware one: it inserts a
|
|
7
|
+
* \ref scilex::newline at each logical line end, and \ref scilex::indent /
|
|
8
|
+
* \ref scilex::dedent where the leading indentation changes.
|
|
9
|
+
*
|
|
10
|
+
* It works purely from token **positions** — every \ref scilex::token already
|
|
11
|
+
* carries its source line and (byte) column — so the base lexer needs no change
|
|
12
|
+
* and may keep skipping whitespace. Lines with no token (blank or
|
|
13
|
+
* comment-only) carry no structure and are naturally ignored.
|
|
14
|
+
*
|
|
15
|
+
* Indentation width is the byte column of a line's first token (tabs and spaces
|
|
16
|
+
* each count as one column; v1 does not police mixed tabs/spaces, and there is
|
|
17
|
+
* no implicit line continuation inside brackets).
|
|
18
|
+
*
|
|
19
|
+
* Input must be an end-of-input-terminated token sequence (the lexer's
|
|
20
|
+
* `eof_policy::append`); the terminal \ref scilex::end_of_input is preserved.
|
|
21
|
+
*/
|
|
22
|
+
#ifndef SCILEX_LAYOUT_HPP
|
|
23
|
+
#define SCILEX_LAYOUT_HPP
|
|
24
|
+
|
|
25
|
+
#include <cstddef>
|
|
26
|
+
#include <limits>
|
|
27
|
+
#include <span>
|
|
28
|
+
#include <stdexcept>
|
|
29
|
+
#include <string>
|
|
30
|
+
#include <vector>
|
|
31
|
+
|
|
32
|
+
#include "token.hpp"
|
|
33
|
+
|
|
34
|
+
namespace scilex {
|
|
35
|
+
|
|
36
|
+
//! \brief Reserved kind: end of a logical line.
|
|
37
|
+
inline constexpr int newline {std::numeric_limits<int>::min() + 1};
|
|
38
|
+
//! \brief Reserved kind: indentation increased (start of a deeper block).
|
|
39
|
+
inline constexpr int indent {std::numeric_limits<int>::min() + 2};
|
|
40
|
+
//! \brief Reserved kind: indentation decreased (end of a block).
|
|
41
|
+
inline constexpr int dedent {std::numeric_limits<int>::min() + 3};
|
|
42
|
+
|
|
43
|
+
/*!
|
|
44
|
+
* \brief Thrown when a line's indentation matches no enclosing level.
|
|
45
|
+
*/
|
|
46
|
+
class layout_error : public std::runtime_error
|
|
47
|
+
{
|
|
48
|
+
public:
|
|
49
|
+
|
|
50
|
+
//! \brief Builds the error. \param[in] message Cause. \param[in] where Position.
|
|
51
|
+
layout_error(const std::string& message,
|
|
52
|
+
position where)
|
|
53
|
+
: std::runtime_error(message),
|
|
54
|
+
where_(where)
|
|
55
|
+
{}
|
|
56
|
+
|
|
57
|
+
//! \brief Returns the position of the offending line.
|
|
58
|
+
[[nodiscard]] position where() const noexcept
|
|
59
|
+
{
|
|
60
|
+
return where_;
|
|
61
|
+
}
|
|
62
|
+
|
|
63
|
+
private:
|
|
64
|
+
|
|
65
|
+
position where_; //!< Where the indentation was inconsistent.
|
|
66
|
+
};
|
|
67
|
+
|
|
68
|
+
/*!
|
|
69
|
+
* \brief Rewrites \p tokens with NEWLINE / INDENT / DEDENT inserted.
|
|
70
|
+
*
|
|
71
|
+
* \param[in] tokens An end-of-input-terminated token sequence.
|
|
72
|
+
* \return The layout-aware token sequence (still end-of-input-terminated).
|
|
73
|
+
* \throws layout_error If a line dedents to an indentation that no open block
|
|
74
|
+
* used.
|
|
75
|
+
*/
|
|
76
|
+
[[nodiscard]] inline std::vector<token> layout(std::span<const token> tokens)
|
|
77
|
+
{
|
|
78
|
+
std::vector<token> out;
|
|
79
|
+
std::vector<std::size_t> levels {0};
|
|
80
|
+
bool started {false};
|
|
81
|
+
std::size_t previous_line {0};
|
|
82
|
+
position end_position {0, 1, 1};
|
|
83
|
+
|
|
84
|
+
for (const token& current : tokens) {
|
|
85
|
+
if (current.kind == end_of_input) {
|
|
86
|
+
end_position = current.start; // remember; emit our own terminal at the end
|
|
87
|
+
continue;
|
|
88
|
+
}
|
|
89
|
+
if (!started || current.start.line != previous_line) {
|
|
90
|
+
if (started) {
|
|
91
|
+
out.push_back(token {newline, {}, current.start});
|
|
92
|
+
}
|
|
93
|
+
const std::size_t width {current.start.column - 1};
|
|
94
|
+
if (width > levels.back()) {
|
|
95
|
+
levels.push_back(width);
|
|
96
|
+
out.push_back(token {indent, {}, current.start});
|
|
97
|
+
}
|
|
98
|
+
else {
|
|
99
|
+
while (width < levels.back()) {
|
|
100
|
+
levels.pop_back();
|
|
101
|
+
out.push_back(token {dedent, {}, current.start});
|
|
102
|
+
}
|
|
103
|
+
if (width != levels.back()) {
|
|
104
|
+
throw layout_error("inconsistent indentation", current.start);
|
|
105
|
+
}
|
|
106
|
+
}
|
|
107
|
+
started = true;
|
|
108
|
+
}
|
|
109
|
+
out.push_back(current);
|
|
110
|
+
previous_line = current.start.line;
|
|
111
|
+
}
|
|
112
|
+
|
|
113
|
+
if (started) {
|
|
114
|
+
out.push_back(token {newline, {}, end_position});
|
|
115
|
+
}
|
|
116
|
+
while (levels.back() > 0) {
|
|
117
|
+
levels.pop_back();
|
|
118
|
+
out.push_back(token {dedent, {}, end_position});
|
|
119
|
+
}
|
|
120
|
+
out.push_back(token {end_of_input, {}, end_position});
|
|
121
|
+
return out;
|
|
122
|
+
}
|
|
123
|
+
} // namespace scilex
|
|
124
|
+
|
|
125
|
+
#endif // SCILEX_LAYOUT_HPP
|