smol-html 0.1.5__tar.gz → 0.1.7__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: smol-html
3
- Version: 0.1.5
3
+ Version: 0.1.7
4
4
  Summary: Small, dependable HTML cleaner/minifier with sensible defaults
5
5
  Project-URL: Homepage, https://github.com/NosibleAI/smol-html
6
6
  Project-URL: Repository, https://github.com/NosibleAI/smol-html
@@ -35,6 +35,21 @@ Description-Content-Type: text/markdown
35
35
 
36
36
  Small, dependable HTML cleaner/minifier with sensible defaults.
37
37
 
38
+ ---
39
+
40
+ ## Credits
41
+
42
+ This package is built on top of excellent open-source projects:
43
+
44
+ - [**minify-html**](https://github.com/wilsonzlin/minify-html) (MIT License) – High-performance HTML minifier.
45
+ - [**lxml**](https://lxml.de/) (BSD-like License) – Fast and feature-rich XML/HTML parsing and cleaning.
46
+ - [**BeautifulSoup4**](https://www.crummy.com/software/BeautifulSoup/) (MIT License) – Python library for parsing and navigating HTML/XML.
47
+ - [**Brotli**](https://github.com/google/brotli) (MIT License) – Compression algorithm developed by Google.
48
+
49
+ We are grateful to the maintainers and contributors of these projects — `smol-html` wouldn’t be possible without their work.
50
+
51
+ ---
52
+
38
53
  ## Motivation
39
54
 
40
55
  Nosible is a search engine, which means we need to store and process a very large number of webpages. To make this tractable, we strip out visual chrome and other non-essential components that don’t matter for downstream tasks (indexing, ranking, retrieval, and LLM pipelines) while preserving the important content and structure. This package cleans and minifies HTML, greatly reducing size on disk; combined with Brotli compression (by Google), the savings are even larger.
@@ -125,11 +140,17 @@ out = cleaner.make_smol(raw_html="<p>Hi</p>")
125
140
 
126
141
  ## Compressed Bytes Output
127
142
 
128
- Produce compressed bytes using Brotli with `make_smol_bytes`
143
+ Produce compressed bytes using Brotli with `make_smol_bytes`.
144
+
145
+ - By default, the compressed bytes are URL-safe Base64 encoded (`base64_encode=True`).
146
+ - If you enable Base64, you must decode before Brotli-decompressing.
147
+ - You can disable Base64 by passing `base64_encode=False` and decompress directly.
129
148
 
149
+ Default (Base64-encoded) output:
130
150
 
131
151
  ```python
132
152
  from smol_html import SmolHtmlCleaner
153
+ import base64
133
154
  import brotli # only needed if you want to decompress here in the example
134
155
 
135
156
  html = """
@@ -145,15 +166,31 @@ cleaner = SmolHtmlCleaner()
145
166
  # Get compressed bytes (quality 11 is strong compression)
146
167
  compressed = cleaner.make_smol_bytes(raw_html=html, compression_level=11)
147
168
 
148
- # Example: decompress back to text to inspect (optional)
149
- decompressed = brotli.decompress(compressed).decode("utf-8")
169
+ # Because Base64 is enabled by default, decode before decompressing
170
+ decoded = base64.urlsafe_b64decode(compressed)
171
+ decompressed = brotli.decompress(decoded).decode("utf-8")
150
172
  print(decompressed)
151
173
 
152
- # Or write compressed output directly to a file
153
- with open("page.html.br", "wb") as f:
174
+ # Or write Base64-encoded compressed output directly to a file
175
+ with open("page.html.br.b64", "wb") as f:
154
176
  f.write(compressed)
155
177
  ```
156
178
 
179
+ Disable Base64 and decompress directly:
180
+
181
+ ```python
182
+ from smol_html import SmolHtmlCleaner
183
+ import brotli
184
+
185
+ cleaner = SmolHtmlCleaner()
186
+ compressed_raw = cleaner.make_smol_bytes(
187
+ raw_html="<p>Hi</p>",
188
+ compression_level=11,
189
+ base64_encode=False,
190
+ )
191
+ print(brotli.decompress(compressed_raw).decode("utf-8"))
192
+ ```
193
+
157
194
  ## Parameter Reference
158
195
 
159
196
  To improve readability, the reference is split into two tables:
@@ -213,3 +250,10 @@ To improve readability, the reference is split into two tables:
213
250
  | `remove_unknown_tags` | `bool` | `True` |
214
251
  | `safe_attrs_only` | `bool` | `True` |
215
252
  | `safe_attrs` | `set[str]` | curated set |
253
+
254
+ ### `make_smol_bytes` Options
255
+
256
+ | Parameter | Type | Default |
257
+ |---|---|---|
258
+ | `compression_level` | `int` | `4` |
259
+ | `base64_encode` | `bool` | `True` |
@@ -5,6 +5,21 @@
5
5
 
6
6
  Small, dependable HTML cleaner/minifier with sensible defaults.
7
7
 
8
+ ---
9
+
10
+ ## Credits
11
+
12
+ This package is built on top of excellent open-source projects:
13
+
14
+ - [**minify-html**](https://github.com/wilsonzlin/minify-html) (MIT License) – High-performance HTML minifier.
15
+ - [**lxml**](https://lxml.de/) (BSD-like License) – Fast and feature-rich XML/HTML parsing and cleaning.
16
+ - [**BeautifulSoup4**](https://www.crummy.com/software/BeautifulSoup/) (MIT License) – Python library for parsing and navigating HTML/XML.
17
+ - [**Brotli**](https://github.com/google/brotli) (MIT License) – Compression algorithm developed by Google.
18
+
19
+ We are grateful to the maintainers and contributors of these projects — `smol-html` wouldn’t be possible without their work.
20
+
21
+ ---
22
+
8
23
  ## Motivation
9
24
 
10
25
  Nosible is a search engine, which means we need to store and process a very large number of webpages. To make this tractable, we strip out visual chrome and other non-essential components that don’t matter for downstream tasks (indexing, ranking, retrieval, and LLM pipelines) while preserving the important content and structure. This package cleans and minifies HTML, greatly reducing size on disk; combined with Brotli compression (by Google), the savings are even larger.
@@ -17,22 +32,22 @@ Nosible is a search engine, which means we need to store and process a very larg
17
32
  pip install smol-html
18
33
  ```
19
34
 
20
- ### ⚡ Installing with uv
21
-
22
- ```bash
23
- uv pip install smol-html
24
- ```
25
-
26
- ### Requirements
27
-
28
- - Python: 3.9
29
- - Dependencies:
30
- - beautifulsoup4>=4.0.1
31
- - brotli>=0.5.2
32
- - lxml[html-clean]>=1.3.2
33
- - minify-html>=0.2.6
34
-
35
- ## Quick Start
35
+ ### ⚡ Installing with uv
36
+
37
+ ```bash
38
+ uv pip install smol-html
39
+ ```
40
+
41
+ ### Requirements
42
+
43
+ - Python: 3.9
44
+ - Dependencies:
45
+ - beautifulsoup4>=4.0.1
46
+ - brotli>=0.5.2
47
+ - lxml[html-clean]>=1.3.2
48
+ - minify-html>=0.2.6
49
+
50
+ ## Quick Start
36
51
 
37
52
  Clean an HTML string (or page contents):
38
53
 
@@ -95,11 +110,17 @@ out = cleaner.make_smol(raw_html="<p>Hi</p>")
95
110
 
96
111
  ## Compressed Bytes Output
97
112
 
98
- Produce compressed bytes using Brotli with `make_smol_bytes`
113
+ Produce compressed bytes using Brotli with `make_smol_bytes`.
114
+
115
+ - By default, the compressed bytes are URL-safe Base64 encoded (`base64_encode=True`).
116
+ - If you enable Base64, you must decode before Brotli-decompressing.
117
+ - You can disable Base64 by passing `base64_encode=False` and decompress directly.
99
118
 
119
+ Default (Base64-encoded) output:
100
120
 
101
121
  ```python
102
122
  from smol_html import SmolHtmlCleaner
123
+ import base64
103
124
  import brotli # only needed if you want to decompress here in the example
104
125
 
105
126
  html = """
@@ -115,15 +136,31 @@ cleaner = SmolHtmlCleaner()
115
136
  # Get compressed bytes (quality 11 is strong compression)
116
137
  compressed = cleaner.make_smol_bytes(raw_html=html, compression_level=11)
117
138
 
118
- # Example: decompress back to text to inspect (optional)
119
- decompressed = brotli.decompress(compressed).decode("utf-8")
139
+ # Because Base64 is enabled by default, decode before decompressing
140
+ decoded = base64.urlsafe_b64decode(compressed)
141
+ decompressed = brotli.decompress(decoded).decode("utf-8")
120
142
  print(decompressed)
121
143
 
122
- # Or write compressed output directly to a file
123
- with open("page.html.br", "wb") as f:
144
+ # Or write Base64-encoded compressed output directly to a file
145
+ with open("page.html.br.b64", "wb") as f:
124
146
  f.write(compressed)
125
147
  ```
126
148
 
149
+ Disable Base64 and decompress directly:
150
+
151
+ ```python
152
+ from smol_html import SmolHtmlCleaner
153
+ import brotli
154
+
155
+ cleaner = SmolHtmlCleaner()
156
+ compressed_raw = cleaner.make_smol_bytes(
157
+ raw_html="<p>Hi</p>",
158
+ compression_level=11,
159
+ base64_encode=False,
160
+ )
161
+ print(brotli.decompress(compressed_raw).decode("utf-8"))
162
+ ```
163
+
127
164
  ## Parameter Reference
128
165
 
129
166
  To improve readability, the reference is split into two tables:
@@ -183,3 +220,10 @@ To improve readability, the reference is split into two tables:
183
220
  | `remove_unknown_tags` | `bool` | `True` |
184
221
  | `safe_attrs_only` | `bool` | `True` |
185
222
  | `safe_attrs` | `set[str]` | curated set |
223
+
224
+ ### `make_smol_bytes` Options
225
+
226
+ | Parameter | Type | Default |
227
+ |---|---|---|
228
+ | `compression_level` | `int` | `4` |
229
+ | `base64_encode` | `bool` | `True` |
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "smol-html"
3
- version = "0.1.5"
3
+ version = "0.1.7"
4
4
  description = "Small, dependable HTML cleaner/minifier with sensible defaults"
5
5
  readme = { file = "README.md", content-type = "text/markdown" }
6
6
  requires-python = ">=3.9"
File without changes
File without changes
File without changes