justhtml 0.24.0__py3-none-any.whl → 0.38.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of justhtml might be problematic. Click here for more details.
- justhtml/__init__.py +44 -2
- justhtml/__main__.py +45 -9
- justhtml/constants.py +12 -0
- justhtml/errors.py +8 -3
- justhtml/linkify.py +438 -0
- justhtml/node.py +54 -35
- justhtml/parser.py +105 -38
- justhtml/sanitize.py +511 -282
- justhtml/selector.py +3 -1
- justhtml/serialize.py +398 -72
- justhtml/tokenizer.py +121 -21
- justhtml/tokens.py +21 -3
- justhtml/transforms.py +2568 -0
- justhtml/treebuilder.py +247 -190
- justhtml/treebuilder_modes.py +108 -102
- {justhtml-0.24.0.dist-info → justhtml-0.38.0.dist-info}/METADATA +28 -7
- justhtml-0.38.0.dist-info/RECORD +26 -0
- {justhtml-0.24.0.dist-info → justhtml-0.38.0.dist-info}/licenses/LICENSE +1 -1
- justhtml-0.24.0.dist-info/RECORD +0 -24
- {justhtml-0.24.0.dist-info → justhtml-0.38.0.dist-info}/WHEEL +0 -0
- {justhtml-0.24.0.dist-info → justhtml-0.38.0.dist-info}/entry_points.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: justhtml
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.38.0
|
|
4
4
|
Summary: A pure Python HTML5 parser that just works.
|
|
5
5
|
Project-URL: Homepage, https://github.com/emilstenstrom/justhtml
|
|
6
6
|
Project-URL: Issues, https://github.com/emilstenstrom/justhtml/issues
|
|
@@ -52,7 +52,7 @@ A pure Python HTML5 parser that just works. No C extensions to compile. No syste
|
|
|
52
52
|
# Requires: [intentionally left blank]
|
|
53
53
|
```
|
|
54
54
|
|
|
55
|
-
- **Just... Secure 🔒** — Safe-by-default
|
|
55
|
+
- **Just... Secure 🔒** — Safe-by-default sanitization at construction time — built-in Bleach-style allowlist sanitization on `JustHTML(...)` (disable with `safe=False`). Can sanitize inline CSS rules. ([Sanitization & Security](docs/sanitization.md))
|
|
56
56
|
|
|
57
57
|
```python
|
|
58
58
|
JustHTML(
|
|
@@ -74,12 +74,29 @@ A pure Python HTML5 parser that just works. No C extensions to compile. No syste
|
|
|
74
74
|
# => <p class="x">Hi</p>
|
|
75
75
|
```
|
|
76
76
|
|
|
77
|
+
- **Just... Transform 🏗️** — Built-in DOM transforms for: drop/unwrap nodes, rewrite attributes, linkify text, and compose safe pipelines. ([Transforms](docs/transforms.md))
|
|
78
|
+
|
|
79
|
+
```python
|
|
80
|
+
from justhtml import JustHTML, Linkify, SetAttrs, Unwrap
|
|
81
|
+
|
|
82
|
+
doc = JustHTML(
|
|
83
|
+
"<p>Hello <span class=\"x\">world</span> example.com</p>",
|
|
84
|
+
transforms=[
|
|
85
|
+
Unwrap("span.x"),
|
|
86
|
+
Linkify(),
|
|
87
|
+
SetAttrs("a", rel="nofollow"),
|
|
88
|
+
],
|
|
89
|
+
)
|
|
90
|
+
print(doc.to_html(pretty=False))
|
|
91
|
+
# => <p>Hello world <a href="https://example.com" rel="nofollow">example.com</a></p>
|
|
92
|
+
```
|
|
93
|
+
|
|
77
94
|
- **Just... Fast Enough ⚡** — Fast for the common case (fastest pure-Python HTML5 parser available); for terabytes, use a C/Rust parser like `html5ever`. ([Benchmarks](benchmarks/performance.py))
|
|
78
95
|
|
|
79
96
|
```bash
|
|
80
|
-
|
|
81
|
-
| python -m justhtml - > /dev/null
|
|
82
|
-
# 0.
|
|
97
|
+
/usr/bin/time -f '%e s' bash -lc \
|
|
98
|
+
"curl -Ls https://en.wikipedia.org/wiki/HTML | python -m justhtml - > /dev/null"
|
|
99
|
+
# 0.41 s
|
|
83
100
|
```
|
|
84
101
|
|
|
85
102
|
## Comparison
|
|
@@ -95,7 +112,7 @@ A pure Python HTML5 parser that just works. No C extensions to compile. No syste
|
|
|
95
112
|
| **`selectolax`**<br>Python wrapper of C-based Lexbor | 🟡 68% | 🚀 Very Fast | ✅ CSS selectors | ❌ Needs sanitization | Very fast but less compliant. |
|
|
96
113
|
| **`html.parser`**<br>Python stdlib | 🔴 4% | ⚡ Fast | ❌ None | ❌ Needs sanitization | Standard library. Chokes on malformed HTML. |
|
|
97
114
|
| **`BeautifulSoup`**<br>Pure Python | 🔴 4% (default) | 🐢 Slow | 🟡 Custom API | ❌ Needs sanitization | Wraps `html.parser` (default). Can use lxml or html5lib. |
|
|
98
|
-
| **`lxml`**<br>Python wrapper of C-based libxml2 | 🔴 1% | 🚀 Very Fast | 🟡 XPath |
|
|
115
|
+
| **`lxml`**<br>Python wrapper of C-based libxml2 | 🔴 1% | 🚀 Very Fast | 🟡 XPath | ❌ Needs sanitization | Fast but not HTML5 compliant. Don't use the old lxml.html.clean module! |
|
|
99
116
|
|
|
100
117
|
[1]: Parser compliance scores are from a strict run of the [html5lib-tests](https://github.com/html5lib/html5lib-tests) tree-construction fixtures (1,743 non-script tests). See [docs/correctness.md](docs/correctness.md) for details.
|
|
101
118
|
|
|
@@ -170,9 +187,13 @@ A pure Python HTML5 parser that just works. No C extensions to compile. No syste
|
|
|
170
187
|
|
|
171
188
|
- **Just... Correct ✅** — Spec-perfect HTML5 parsing with browser-grade error recovery — passes the official 9k+ [html5lib-tests](https://github.com/html5lib/html5lib-tests) suite, with 100% line+branch coverage. ([Correctness](/EmilStenstrom/justhtml/blob/main/docs/correctness.md))
|
|
172
189
|
- **Just... Python 🐍** — Pure Python, zero dependencies — no C extensions or system libraries, easy to debug, and works anywhere Python runs (including PyPy and Pyodide). ([Quickstart](/EmilStenstrom/justhtml/blob/main/docs/quickstart.md))
|
|
173
|
-
- **Just... Secure 🔒** — Safe-by-default
|
|
190
|
+
- **Just... Secure 🔒** — Safe-by-default sanitization at construction time — built-in Bleach-style allowlist sanitization on `JustHTML(...)` (disable with `safe=False`), plus URL/CSS rules. ([Sanitization & Security](/EmilStenstrom/justhtml/blob/main/docs/sanitization.md))
|
|
174
191
|
```
|
|
175
192
|
|
|
193
|
+
## Security
|
|
194
|
+
|
|
195
|
+
For security policy and vulnerability reporting, please see [SECURITY.md](SECURITY.md).
|
|
196
|
+
|
|
176
197
|
## Contributing
|
|
177
198
|
|
|
178
199
|
See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and guidelines.
|
|
@@ -0,0 +1,26 @@
|
|
|
1
|
+
justhtml/__init__.py,sha256=cyFtwOsxM_m-xG3vNdO4YvBQvEp0HOWUN3EnfGwGotc,1183
|
|
2
|
+
justhtml/__main__.py,sha256=aupMvpS2_C4b11GcSNm5_JdlDkllaQLE3_CR8ttUmmk,6559
|
|
3
|
+
justhtml/constants.py,sha256=85cNNHS3fCSwvFGsQSV7uk_G1Ce0llHBkg3sW8k7WZ8,11881
|
|
4
|
+
justhtml/context.py,sha256=Ac4mV-a3ZgJILQbstFu-EB6bRA5oYlSkHqpTxMlMfk0,293
|
|
5
|
+
justhtml/encoding.py,sha256=9mscoXtBb57zehG_BxzN6aTTJHaNfywk5gwxrnH92K8,11310
|
|
6
|
+
justhtml/entities.py,sha256=_cQ3MBrV2hJwAUPVF8JJf7zbrdrxycKOe3Z_thg93Ng,11161
|
|
7
|
+
justhtml/errors.py,sha256=XVTgiXmfh1tX3PjGKBuhiCQ-72gNVuimBUXexHW9pKo,11045
|
|
8
|
+
justhtml/linkify.py,sha256=qTrEJ4UeSC8fVbryst6HfZkgAs69YvaNWkM2sB3zS74,14112
|
|
9
|
+
justhtml/node.py,sha256=A9IetRR8_MC2QCmmcEiAV5nIg97rorUnlDZ9-LfkjOM,27857
|
|
10
|
+
justhtml/parser.py,sha256=STLG33TkMvb0Z_RH5gUDmcsEjWF_QQ2aabRDXhuUF1I,9984
|
|
11
|
+
justhtml/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
12
|
+
justhtml/sanitize.py,sha256=D0aOgy_iFtCnyNZjFaxoAoZvLvoHhUWgeL02p_M9d7k,33188
|
|
13
|
+
justhtml/selector.py,sha256=FLW-rOZwJxGf4uD6ZdHYI7QcEGzstBOOrf-Ubo-37uA,36015
|
|
14
|
+
justhtml/serialize.py,sha256=AZIGuIFJ8oLfpzz938svNN4wGgxYNA_EGheMzuwoi2s,32766
|
|
15
|
+
justhtml/stream.py,sha256=n8pKtVAivG0VerCWEcXSEBwzj8Tm1ltEAL7F46RGUVM,3431
|
|
16
|
+
justhtml/tokenizer.py,sha256=_v3dpjAuq89gjPJMbZLOKgrTc6GmV-QhuDSKGQA_3Pk,107171
|
|
17
|
+
justhtml/tokens.py,sha256=mk3VBdiula7voCKahRFJ45F14_Qh9Ega-XQ4wwavjMg,7695
|
|
18
|
+
justhtml/transforms.py,sha256=ptHXJ26AtbGTz0zZNIZQP47JphbATme3TyKK7x-qzw4,95289
|
|
19
|
+
justhtml/treebuilder.py,sha256=7RQCtHhRTj4uGlALPZtIzVD-ZoEK0ezyn1-Tto9yw3k,60972
|
|
20
|
+
justhtml/treebuilder_modes.py,sha256=8xupHR4IMaCyLGwKX6lcGDMwalMFlgne3B_fhMvyAE0,98887
|
|
21
|
+
justhtml/treebuilder_utils.py,sha256=LjK9tg9sNYR-sJdXKemJCzzzgh6lQW1KBqyvhpWtaoQ,2912
|
|
22
|
+
justhtml-0.38.0.dist-info/METADATA,sha256=yU5XJ-gqssbudTodF55FvNeRQchuPgTu3bFvI7Y9OuU,10171
|
|
23
|
+
justhtml-0.38.0.dist-info/WHEEL,sha256=WLgqFyCfm_KASv4WHyYy0P3pM_m7J5L9k2skdKLirC8,87
|
|
24
|
+
justhtml-0.38.0.dist-info/entry_points.txt,sha256=UN06mPn7J0cBM1dqyf245FvmU9mF3ivgplSr5ppdp6g,52
|
|
25
|
+
justhtml-0.38.0.dist-info/licenses/LICENSE,sha256=_IBvKQiU5PIZRnE1-yHzMEj41agX8PgoQkbXLaKdVy4,1256
|
|
26
|
+
justhtml-0.38.0.dist-info/RECORD,,
|
|
@@ -2,7 +2,7 @@ MIT License
|
|
|
2
2
|
|
|
3
3
|
Copyright (c) 2025 Emil Stenström (JustHTML)
|
|
4
4
|
Copyright (c) 2014-2017, The html5ever Project Developers (html5ever inspiration)
|
|
5
|
-
Copyright (c) 2006-2013 James Graham,
|
|
5
|
+
Copyright (c) 2006-2013 James Graham, Sam Sneddon, and
|
|
6
6
|
other contributors (html5lib-tests)
|
|
7
7
|
|
|
8
8
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
justhtml-0.24.0.dist-info/RECORD
DELETED
|
@@ -1,24 +0,0 @@
|
|
|
1
|
-
justhtml/__init__.py,sha256=fDm2MolicILd_aORC05rrF0VpROIT7U5DbgyzDwyPhs,586
|
|
2
|
-
justhtml/__main__.py,sha256=2qH55lmN9F14K3bqljm5B0YTvSdTC-r1U5BoymAI-uw,5204
|
|
3
|
-
justhtml/constants.py,sha256=-UATvXXQ7ueFWxJHW79c2eMmMWaSKoqwwcNIGesTAj0,11603
|
|
4
|
-
justhtml/context.py,sha256=Ac4mV-a3ZgJILQbstFu-EB6bRA5oYlSkHqpTxMlMfk0,293
|
|
5
|
-
justhtml/encoding.py,sha256=9mscoXtBb57zehG_BxzN6aTTJHaNfywk5gwxrnH92K8,11310
|
|
6
|
-
justhtml/entities.py,sha256=_cQ3MBrV2hJwAUPVF8JJf7zbrdrxycKOe3Z_thg93Ng,11161
|
|
7
|
-
justhtml/errors.py,sha256=cxoYDDOxGoC_sCIP85pHSDWb1Pm_sfZLALWiTMhb8kc,10754
|
|
8
|
-
justhtml/node.py,sha256=UnavBYOa_T7Yr7CVcb7tK2nVmt3s9Rs0nTlp9xNioMY,26916
|
|
9
|
-
justhtml/parser.py,sha256=huuBeS9bQSjfCyFbfYiLEHVxHLE0XTj3V96rnzb6v_4,6364
|
|
10
|
-
justhtml/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
11
|
-
justhtml/sanitize.py,sha256=efa56hNQct4a5pitj1JunPlKpsyz3DKkg7XFNEDbiXM,24884
|
|
12
|
-
justhtml/selector.py,sha256=ZQDOgHlmPHSBRBZmRVTCDWOqHniy8iZew1MzonAUt3s,35836
|
|
13
|
-
justhtml/serialize.py,sha256=3LSpyRG2IGIvhLsQFstUafBhLIIZiuM53uWw820gBb4,20475
|
|
14
|
-
justhtml/stream.py,sha256=n8pKtVAivG0VerCWEcXSEBwzj8Tm1ltEAL7F46RGUVM,3431
|
|
15
|
-
justhtml/tokenizer.py,sha256=wSiLfu0KtfH6XDV8XN2FsUBQhV1En_zGKb9itdiGa8w,103018
|
|
16
|
-
justhtml/tokens.py,sha256=7SGTlB9mjMFU2QvBPnnGiMJXLk1oiEMvvCmbMUXE_Kc,7051
|
|
17
|
-
justhtml/treebuilder.py,sha256=DZcrEW6p1IkC_jPu0Q0SsgITDIP2G2GVJJGbo2jZdkw,57712
|
|
18
|
-
justhtml/treebuilder_modes.py,sha256=84NzalfDmb6_hwd6a3nBst3S_q1CndEopCt5wCRcpUA,97691
|
|
19
|
-
justhtml/treebuilder_utils.py,sha256=LjK9tg9sNYR-sJdXKemJCzzzgh6lQW1KBqyvhpWtaoQ,2912
|
|
20
|
-
justhtml-0.24.0.dist-info/METADATA,sha256=7wGboRVZbhcPKMinUzk-d_dRUhEPRZk8tSlgusTWeLM,9520
|
|
21
|
-
justhtml-0.24.0.dist-info/WHEEL,sha256=WLgqFyCfm_KASv4WHyYy0P3pM_m7J5L9k2skdKLirC8,87
|
|
22
|
-
justhtml-0.24.0.dist-info/entry_points.txt,sha256=UN06mPn7J0cBM1dqyf245FvmU9mF3ivgplSr5ppdp6g,52
|
|
23
|
-
justhtml-0.24.0.dist-info/licenses/LICENSE,sha256=QGxhcdDa0J9T8bc3rQFQFR0sY9zPFwRw2X5h3NgBDe0,1261
|
|
24
|
-
justhtml-0.24.0.dist-info/RECORD,,
|
|
File without changes
|
|
File without changes
|