smol-html 0.1.5__tar.gz → 0.1.7__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {smol_html-0.1.5 → smol_html-0.1.7}/PKG-INFO +50 -6
- {smol_html-0.1.5 → smol_html-0.1.7}/README.md +65 -21
- {smol_html-0.1.5 → smol_html-0.1.7}/pyproject.toml +1 -1
- {smol_html-0.1.5 → smol_html-0.1.7}/.gitignore +0 -0
- {smol_html-0.1.5 → smol_html-0.1.7}/LICENSE +0 -0
- {smol_html-0.1.5 → smol_html-0.1.7}/smol.png +0 -0
@@ -1,6 +1,6 @@
|
|
1
1
|
Metadata-Version: 2.4
|
2
2
|
Name: smol-html
|
3
|
-
Version: 0.1.
|
3
|
+
Version: 0.1.7
|
4
4
|
Summary: Small, dependable HTML cleaner/minifier with sensible defaults
|
5
5
|
Project-URL: Homepage, https://github.com/NosibleAI/smol-html
|
6
6
|
Project-URL: Repository, https://github.com/NosibleAI/smol-html
|
@@ -35,6 +35,21 @@ Description-Content-Type: text/markdown
|
|
35
35
|
|
36
36
|
Small, dependable HTML cleaner/minifier with sensible defaults.
|
37
37
|
|
38
|
+
---
|
39
|
+
|
40
|
+
## Credits
|
41
|
+
|
42
|
+
This package is built on top of excellent open-source projects:
|
43
|
+
|
44
|
+
- [**minify-html**](https://github.com/wilsonzlin/minify-html) (MIT License) – High-performance HTML minifier.
|
45
|
+
- [**lxml**](https://lxml.de/) (BSD-like License) – Fast and feature-rich XML/HTML parsing and cleaning.
|
46
|
+
- [**BeautifulSoup4**](https://www.crummy.com/software/BeautifulSoup/) (MIT License) – Python library for parsing and navigating HTML/XML.
|
47
|
+
- [**Brotli**](https://github.com/google/brotli) (MIT License) – Compression algorithm developed by Google.
|
48
|
+
|
49
|
+
We are grateful to the maintainers and contributors of these projects — `smol-html` wouldn’t be possible without their work.
|
50
|
+
|
51
|
+
---
|
52
|
+
|
38
53
|
## Motivation
|
39
54
|
|
40
55
|
Nosible is a search engine, which means we need to store and process a very large number of webpages. To make this tractable, we strip out visual chrome and other non-essential components that don’t matter for downstream tasks (indexing, ranking, retrieval, and LLM pipelines) while preserving the important content and structure. This package cleans and minifies HTML, greatly reducing size on disk; combined with Brotli compression (by Google), the savings are even larger.
|
@@ -125,11 +140,17 @@ out = cleaner.make_smol(raw_html="<p>Hi</p>")
|
|
125
140
|
|
126
141
|
## Compressed Bytes Output
|
127
142
|
|
128
|
-
Produce compressed bytes using Brotli with `make_smol_bytes
|
143
|
+
Produce compressed bytes using Brotli with `make_smol_bytes`.
|
144
|
+
|
145
|
+
- By default, the compressed bytes are URL-safe Base64 encoded (`base64_encode=True`).
|
146
|
+
- If you enable Base64, you must decode before Brotli-decompressing.
|
147
|
+
- You can disable Base64 by passing `base64_encode=False` and decompress directly.
|
129
148
|
|
149
|
+
Default (Base64-encoded) output:
|
130
150
|
|
131
151
|
```python
|
132
152
|
from smol_html import SmolHtmlCleaner
|
153
|
+
import base64
|
133
154
|
import brotli # only needed if you want to decompress here in the example
|
134
155
|
|
135
156
|
html = """
|
@@ -145,15 +166,31 @@ cleaner = SmolHtmlCleaner()
|
|
145
166
|
# Get compressed bytes (quality 11 is strong compression)
|
146
167
|
compressed = cleaner.make_smol_bytes(raw_html=html, compression_level=11)
|
147
168
|
|
148
|
-
#
|
149
|
-
|
169
|
+
# Because Base64 is enabled by default, decode before decompressing
|
170
|
+
decoded = base64.urlsafe_b64decode(compressed)
|
171
|
+
decompressed = brotli.decompress(decoded).decode("utf-8")
|
150
172
|
print(decompressed)
|
151
173
|
|
152
|
-
# Or write compressed output directly to a file
|
153
|
-
with open("page.html.br", "wb") as f:
|
174
|
+
# Or write Base64-encoded compressed output directly to a file
|
175
|
+
with open("page.html.br.b64", "wb") as f:
|
154
176
|
f.write(compressed)
|
155
177
|
```
|
156
178
|
|
179
|
+
Disable Base64 and decompress directly:
|
180
|
+
|
181
|
+
```python
|
182
|
+
from smol_html import SmolHtmlCleaner
|
183
|
+
import brotli
|
184
|
+
|
185
|
+
cleaner = SmolHtmlCleaner()
|
186
|
+
compressed_raw = cleaner.make_smol_bytes(
|
187
|
+
raw_html="<p>Hi</p>",
|
188
|
+
compression_level=11,
|
189
|
+
base64_encode=False,
|
190
|
+
)
|
191
|
+
print(brotli.decompress(compressed_raw).decode("utf-8"))
|
192
|
+
```
|
193
|
+
|
157
194
|
## Parameter Reference
|
158
195
|
|
159
196
|
To improve readability, the reference is split into two tables:
|
@@ -213,3 +250,10 @@ To improve readability, the reference is split into two tables:
|
|
213
250
|
| `remove_unknown_tags` | `bool` | `True` |
|
214
251
|
| `safe_attrs_only` | `bool` | `True` |
|
215
252
|
| `safe_attrs` | `set[str]` | curated set |
|
253
|
+
|
254
|
+
### `make_smol_bytes` Options
|
255
|
+
|
256
|
+
| Parameter | Type | Default |
|
257
|
+
|---|---|---|
|
258
|
+
| `compression_level` | `int` | `4` |
|
259
|
+
| `base64_encode` | `bool` | `True` |
|
@@ -5,6 +5,21 @@
|
|
5
5
|
|
6
6
|
Small, dependable HTML cleaner/minifier with sensible defaults.
|
7
7
|
|
8
|
+
---
|
9
|
+
|
10
|
+
## Credits
|
11
|
+
|
12
|
+
This package is built on top of excellent open-source projects:
|
13
|
+
|
14
|
+
- [**minify-html**](https://github.com/wilsonzlin/minify-html) (MIT License) – High-performance HTML minifier.
|
15
|
+
- [**lxml**](https://lxml.de/) (BSD-like License) – Fast and feature-rich XML/HTML parsing and cleaning.
|
16
|
+
- [**BeautifulSoup4**](https://www.crummy.com/software/BeautifulSoup/) (MIT License) – Python library for parsing and navigating HTML/XML.
|
17
|
+
- [**Brotli**](https://github.com/google/brotli) (MIT License) – Compression algorithm developed by Google.
|
18
|
+
|
19
|
+
We are grateful to the maintainers and contributors of these projects — `smol-html` wouldn’t be possible without their work.
|
20
|
+
|
21
|
+
---
|
22
|
+
|
8
23
|
## Motivation
|
9
24
|
|
10
25
|
Nosible is a search engine, which means we need to store and process a very large number of webpages. To make this tractable, we strip out visual chrome and other non-essential components that don’t matter for downstream tasks (indexing, ranking, retrieval, and LLM pipelines) while preserving the important content and structure. This package cleans and minifies HTML, greatly reducing size on disk; combined with Brotli compression (by Google), the savings are even larger.
|
@@ -17,22 +32,22 @@ Nosible is a search engine, which means we need to store and process a very larg
|
|
17
32
|
pip install smol-html
|
18
33
|
```
|
19
34
|
|
20
|
-
### ⚡ Installing with uv
|
21
|
-
|
22
|
-
```bash
|
23
|
-
uv pip install smol-html
|
24
|
-
```
|
25
|
-
|
26
|
-
### Requirements
|
27
|
-
|
28
|
-
- Python: 3.9
|
29
|
-
- Dependencies:
|
30
|
-
- beautifulsoup4>=4.0.1
|
31
|
-
- brotli>=0.5.2
|
32
|
-
- lxml[html-clean]>=1.3.2
|
33
|
-
- minify-html>=0.2.6
|
34
|
-
|
35
|
-
## Quick Start
|
35
|
+
### ⚡ Installing with uv
|
36
|
+
|
37
|
+
```bash
|
38
|
+
uv pip install smol-html
|
39
|
+
```
|
40
|
+
|
41
|
+
### Requirements
|
42
|
+
|
43
|
+
- Python: 3.9
|
44
|
+
- Dependencies:
|
45
|
+
- beautifulsoup4>=4.0.1
|
46
|
+
- brotli>=0.5.2
|
47
|
+
- lxml[html-clean]>=1.3.2
|
48
|
+
- minify-html>=0.2.6
|
49
|
+
|
50
|
+
## Quick Start
|
36
51
|
|
37
52
|
Clean an HTML string (or page contents):
|
38
53
|
|
@@ -95,11 +110,17 @@ out = cleaner.make_smol(raw_html="<p>Hi</p>")
|
|
95
110
|
|
96
111
|
## Compressed Bytes Output
|
97
112
|
|
98
|
-
Produce compressed bytes using Brotli with `make_smol_bytes
|
113
|
+
Produce compressed bytes using Brotli with `make_smol_bytes`.
|
114
|
+
|
115
|
+
- By default, the compressed bytes are URL-safe Base64 encoded (`base64_encode=True`).
|
116
|
+
- If you enable Base64, you must decode before Brotli-decompressing.
|
117
|
+
- You can disable Base64 by passing `base64_encode=False` and decompress directly.
|
99
118
|
|
119
|
+
Default (Base64-encoded) output:
|
100
120
|
|
101
121
|
```python
|
102
122
|
from smol_html import SmolHtmlCleaner
|
123
|
+
import base64
|
103
124
|
import brotli # only needed if you want to decompress here in the example
|
104
125
|
|
105
126
|
html = """
|
@@ -115,15 +136,31 @@ cleaner = SmolHtmlCleaner()
|
|
115
136
|
# Get compressed bytes (quality 11 is strong compression)
|
116
137
|
compressed = cleaner.make_smol_bytes(raw_html=html, compression_level=11)
|
117
138
|
|
118
|
-
#
|
119
|
-
|
139
|
+
# Because Base64 is enabled by default, decode before decompressing
|
140
|
+
decoded = base64.urlsafe_b64decode(compressed)
|
141
|
+
decompressed = brotli.decompress(decoded).decode("utf-8")
|
120
142
|
print(decompressed)
|
121
143
|
|
122
|
-
# Or write compressed output directly to a file
|
123
|
-
with open("page.html.br", "wb") as f:
|
144
|
+
# Or write Base64-encoded compressed output directly to a file
|
145
|
+
with open("page.html.br.b64", "wb") as f:
|
124
146
|
f.write(compressed)
|
125
147
|
```
|
126
148
|
|
149
|
+
Disable Base64 and decompress directly:
|
150
|
+
|
151
|
+
```python
|
152
|
+
from smol_html import SmolHtmlCleaner
|
153
|
+
import brotli
|
154
|
+
|
155
|
+
cleaner = SmolHtmlCleaner()
|
156
|
+
compressed_raw = cleaner.make_smol_bytes(
|
157
|
+
raw_html="<p>Hi</p>",
|
158
|
+
compression_level=11,
|
159
|
+
base64_encode=False,
|
160
|
+
)
|
161
|
+
print(brotli.decompress(compressed_raw).decode("utf-8"))
|
162
|
+
```
|
163
|
+
|
127
164
|
## Parameter Reference
|
128
165
|
|
129
166
|
To improve readability, the reference is split into two tables:
|
@@ -183,3 +220,10 @@ To improve readability, the reference is split into two tables:
|
|
183
220
|
| `remove_unknown_tags` | `bool` | `True` |
|
184
221
|
| `safe_attrs_only` | `bool` | `True` |
|
185
222
|
| `safe_attrs` | `set[str]` | curated set |
|
223
|
+
|
224
|
+
### `make_smol_bytes` Options
|
225
|
+
|
226
|
+
| Parameter | Type | Default |
|
227
|
+
|---|---|---|
|
228
|
+
| `compression_level` | `int` | `4` |
|
229
|
+
| `base64_encode` | `bool` | `True` |
|
File without changes
|
File without changes
|
File without changes
|