smol-html 0.1.6__tar.gz → 0.1.7__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: smol-html
3
- Version: 0.1.6
3
+ Version: 0.1.7
4
4
  Summary: Small, dependable HTML cleaner/minifier with sensible defaults
5
5
  Project-URL: Homepage, https://github.com/NosibleAI/smol-html
6
6
  Project-URL: Repository, https://github.com/NosibleAI/smol-html
@@ -35,6 +35,21 @@ Description-Content-Type: text/markdown
35
35
 
36
36
  Small, dependable HTML cleaner/minifier with sensible defaults.
37
37
 
38
+ ---
39
+
40
+ ## Credits
41
+
42
+ This package is built on top of excellent open-source projects:
43
+
44
+ - [**minify-html**](https://github.com/wilsonzlin/minify-html) (MIT License) – High-performance HTML minifier.
45
+ - [**lxml**](https://lxml.de/) (BSD-like License) – Fast and feature-rich XML/HTML parsing and cleaning.
46
+ - [**BeautifulSoup4**](https://www.crummy.com/software/BeautifulSoup/) (MIT License) – Python library for parsing and navigating HTML/XML.
47
+ - [**Brotli**](https://github.com/google/brotli) (MIT License) – Compression algorithm developed by Google.
48
+
49
+ We are grateful to the maintainers and contributors of these projects — `smol-html` wouldn’t be possible without their work.
50
+
51
+ ---
52
+
38
53
  ## Motivation
39
54
 
40
55
  Nosible is a search engine, which means we need to store and process a very large number of webpages. To make this tractable, we strip out visual chrome and other non-essential components that don’t matter for downstream tasks (indexing, ranking, retrieval, and LLM pipelines) while preserving the important content and structure. This package cleans and minifies HTML, greatly reducing size on disk; combined with Brotli compression (by Google), the savings are even larger.
@@ -5,6 +5,21 @@
5
5
 
6
6
  Small, dependable HTML cleaner/minifier with sensible defaults.
7
7
 
8
+ ---
9
+
10
+ ## Credits
11
+
12
+ This package is built on top of excellent open-source projects:
13
+
14
+ - [**minify-html**](https://github.com/wilsonzlin/minify-html) (MIT License) – High-performance HTML minifier.
15
+ - [**lxml**](https://lxml.de/) (BSD-like License) – Fast and feature-rich XML/HTML parsing and cleaning.
16
+ - [**BeautifulSoup4**](https://www.crummy.com/software/BeautifulSoup/) (MIT License) – Python library for parsing and navigating HTML/XML.
17
+ - [**Brotli**](https://github.com/google/brotli) (MIT License) – Compression algorithm developed by Google.
18
+
19
+ We are grateful to the maintainers and contributors of these projects — `smol-html` wouldn’t be possible without their work.
20
+
21
+ ---
22
+
8
23
  ## Motivation
9
24
 
10
25
  Nosible is a search engine, which means we need to store and process a very large number of webpages. To make this tractable, we strip out visual chrome and other non-essential components that don’t matter for downstream tasks (indexing, ranking, retrieval, and LLM pipelines) while preserving the important content and structure. This package cleans and minifies HTML, greatly reducing size on disk; combined with Brotli compression (by Google), the savings are even larger.
@@ -17,22 +32,22 @@ Nosible is a search engine, which means we need to store and process a very larg
17
32
  pip install smol-html
18
33
  ```
19
34
 
20
- ### ⚡ Installing with uv
21
-
22
- ```bash
23
- uv pip install smol-html
24
- ```
25
-
26
- ### Requirements
27
-
28
- - Python: 3.9
29
- - Dependencies:
30
- - beautifulsoup4>=4.0.1
31
- - brotli>=0.5.2
32
- - lxml[html-clean]>=1.3.2
33
- - minify-html>=0.2.6
34
-
35
- ## Quick Start
35
+ ### ⚡ Installing with uv
36
+
37
+ ```bash
38
+ uv pip install smol-html
39
+ ```
40
+
41
+ ### Requirements
42
+
43
+ - Python: 3.9
44
+ - Dependencies:
45
+ - beautifulsoup4>=4.0.1
46
+ - brotli>=0.5.2
47
+ - lxml[html-clean]>=1.3.2
48
+ - minify-html>=0.2.6
49
+
50
+ ## Quick Start
36
51
 
37
52
  Clean an HTML string (or page contents):
38
53
 
@@ -93,58 +108,58 @@ cleaner = SmolHtmlCleaner(
93
108
  out = cleaner.make_smol(raw_html="<p>Hi</p>")
94
109
  ```
95
110
 
96
- ## Compressed Bytes Output
97
-
98
- Produce compressed bytes using Brotli with `make_smol_bytes`.
99
-
100
- - By default, the compressed bytes are URL-safe Base64 encoded (`base64_encode=True`).
101
- - If you enable Base64, you must decode before Brotli-decompressing.
102
- - You can disable Base64 by passing `base64_encode=False` and decompress directly.
103
-
104
- Default (Base64-encoded) output:
105
-
106
- ```python
107
- from smol_html import SmolHtmlCleaner
108
- import base64
109
- import brotli # only needed if you want to decompress here in the example
110
-
111
- html = """
112
- <html>
113
- <body>
114
- <div> Hello <span> world </span> </div>
115
- </body>
116
- </html>
117
- """
118
-
119
- cleaner = SmolHtmlCleaner()
120
-
121
- # Get compressed bytes (quality 11 is strong compression)
122
- compressed = cleaner.make_smol_bytes(raw_html=html, compression_level=11)
123
-
124
- # Because Base64 is enabled by default, decode before decompressing
125
- decoded = base64.urlsafe_b64decode(compressed)
126
- decompressed = brotli.decompress(decoded).decode("utf-8")
127
- print(decompressed)
128
-
129
- # Or write Base64-encoded compressed output directly to a file
130
- with open("page.html.br.b64", "wb") as f:
131
- f.write(compressed)
132
- ```
133
-
134
- Disable Base64 and decompress directly:
135
-
136
- ```python
137
- from smol_html import SmolHtmlCleaner
138
- import brotli
139
-
140
- cleaner = SmolHtmlCleaner()
141
- compressed_raw = cleaner.make_smol_bytes(
142
- raw_html="<p>Hi</p>",
143
- compression_level=11,
144
- base64_encode=False,
145
- )
146
- print(brotli.decompress(compressed_raw).decode("utf-8"))
147
- ```
111
+ ## Compressed Bytes Output
112
+
113
+ Produce compressed bytes using Brotli with `make_smol_bytes`.
114
+
115
+ - By default, the compressed bytes are URL-safe Base64 encoded (`base64_encode=True`).
116
+ - If you enable Base64, you must decode before Brotli-decompressing.
117
+ - You can disable Base64 by passing `base64_encode=False` and decompress directly.
118
+
119
+ Default (Base64-encoded) output:
120
+
121
+ ```python
122
+ from smol_html import SmolHtmlCleaner
123
+ import base64
124
+ import brotli # only needed if you want to decompress here in the example
125
+
126
+ html = """
127
+ <html>
128
+ <body>
129
+ <div> Hello <span> world </span> </div>
130
+ </body>
131
+ </html>
132
+ """
133
+
134
+ cleaner = SmolHtmlCleaner()
135
+
136
+ # Get compressed bytes (quality 11 is strong compression)
137
+ compressed = cleaner.make_smol_bytes(raw_html=html, compression_level=11)
138
+
139
+ # Because Base64 is enabled by default, decode before decompressing
140
+ decoded = base64.urlsafe_b64decode(compressed)
141
+ decompressed = brotli.decompress(decoded).decode("utf-8")
142
+ print(decompressed)
143
+
144
+ # Or write Base64-encoded compressed output directly to a file
145
+ with open("page.html.br.b64", "wb") as f:
146
+ f.write(compressed)
147
+ ```
148
+
149
+ Disable Base64 and decompress directly:
150
+
151
+ ```python
152
+ from smol_html import SmolHtmlCleaner
153
+ import brotli
154
+
155
+ cleaner = SmolHtmlCleaner()
156
+ compressed_raw = cleaner.make_smol_bytes(
157
+ raw_html="<p>Hi</p>",
158
+ compression_level=11,
159
+ base64_encode=False,
160
+ )
161
+ print(brotli.decompress(compressed_raw).decode("utf-8"))
162
+ ```
148
163
 
149
164
  ## Parameter Reference
150
165
 
@@ -179,7 +194,7 @@ To improve readability, the reference is split into two tables:
179
194
  | `safe_attrs_only` | Only allow attributes listed in `safe_attrs`. | Set `False` if you need to keep arbitrary attributes. |
180
195
  | `safe_attrs` | Allowed HTML attributes when `safe_attrs_only=True`. | Extend to keep additional attributes you trust. |
181
196
 
182
- ### Types and Defaults
197
+ ### Types and Defaults
183
198
 
184
199
  | Parameter | Type | Default |
185
200
  |---|---|---|
@@ -204,11 +219,11 @@ To improve readability, the reference is split into two tables:
204
219
  | `kill_tags` | `set[str] &#124; None` | `None` |
205
220
  | `remove_unknown_tags` | `bool` | `True` |
206
221
  | `safe_attrs_only` | `bool` | `True` |
207
- | `safe_attrs` | `set[str]` | curated set |
208
-
209
- ### `make_smol_bytes` Options
210
-
211
- | Parameter | Type | Default |
212
- |---|---|---|
213
- | `compression_level` | `int` | `4` |
214
- | `base64_encode` | `bool` | `True` |
222
+ | `safe_attrs` | `set[str]` | curated set |
223
+
224
+ ### `make_smol_bytes` Options
225
+
226
+ | Parameter | Type | Default |
227
+ |---|---|---|
228
+ | `compression_level` | `int` | `4` |
229
+ | `base64_encode` | `bool` | `True` |
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "smol-html"
3
- version = "0.1.6"
3
+ version = "0.1.7"
4
4
  description = "Small, dependable HTML cleaner/minifier with sensible defaults"
5
5
  readme = { file = "README.md", content-type = "text/markdown" }
6
6
  requires-python = ">=3.9"
File without changes
File without changes
File without changes