smol-html 0.1.5__py3-none-any.whl → 0.1.7__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: smol-html
3
- Version: 0.1.5
3
+ Version: 0.1.7
4
4
  Summary: Small, dependable HTML cleaner/minifier with sensible defaults
5
5
  Project-URL: Homepage, https://github.com/NosibleAI/smol-html
6
6
  Project-URL: Repository, https://github.com/NosibleAI/smol-html
@@ -35,6 +35,21 @@ Description-Content-Type: text/markdown
35
35
 
36
36
  Small, dependable HTML cleaner/minifier with sensible defaults.
37
37
 
38
+ ---
39
+
40
+ ## Credits
41
+
42
+ This package is built on top of excellent open-source projects:
43
+
44
+ - [**minify-html**](https://github.com/wilsonzlin/minify-html) (MIT License) – High-performance HTML minifier.
45
+ - [**lxml**](https://lxml.de/) (BSD-like License) – Fast and feature-rich XML/HTML parsing and cleaning.
46
+ - [**BeautifulSoup4**](https://www.crummy.com/software/BeautifulSoup/) (MIT License) – Python library for parsing and navigating HTML/XML.
47
+ - [**Brotli**](https://github.com/google/brotli) (MIT License) – Compression algorithm developed by Google.
48
+
49
+ We are grateful to the maintainers and contributors of these projects — `smol-html` wouldn’t be possible without their work.
50
+
51
+ ---
52
+
38
53
  ## Motivation
39
54
 
40
55
  Nosible is a search engine, which means we need to store and process a very large number of webpages. To make this tractable, we strip out visual chrome and other non-essential components that don’t matter for downstream tasks (indexing, ranking, retrieval, and LLM pipelines) while preserving the important content and structure. This package cleans and minifies HTML, greatly reducing size on disk; combined with Brotli compression (by Google), the savings are even larger.
@@ -125,11 +140,17 @@ out = cleaner.make_smol(raw_html="<p>Hi</p>")
125
140
 
126
141
  ## Compressed Bytes Output
127
142
 
128
- Produce compressed bytes using Brotli with `make_smol_bytes`
143
+ Produce compressed bytes using Brotli with `make_smol_bytes`.
144
+
145
+ - By default, the compressed bytes are URL-safe Base64 encoded (`base64_encode=True`).
146
+ - If you enable Base64, you must decode before Brotli-decompressing.
147
+ - You can disable Base64 by passing `base64_encode=False` and decompress directly.
129
148
 
149
+ Default (Base64-encoded) output:
130
150
 
131
151
  ```python
132
152
  from smol_html import SmolHtmlCleaner
153
+ import base64
133
154
  import brotli # only needed if you want to decompress here in the example
134
155
 
135
156
  html = """
@@ -145,15 +166,31 @@ cleaner = SmolHtmlCleaner()
145
166
  # Get compressed bytes (quality 11 is strong compression)
146
167
  compressed = cleaner.make_smol_bytes(raw_html=html, compression_level=11)
147
168
 
148
- # Example: decompress back to text to inspect (optional)
149
- decompressed = brotli.decompress(compressed).decode("utf-8")
169
+ # Because Base64 is enabled by default, decode before decompressing
170
+ decoded = base64.urlsafe_b64decode(compressed)
171
+ decompressed = brotli.decompress(decoded).decode("utf-8")
150
172
  print(decompressed)
151
173
 
152
- # Or write compressed output directly to a file
153
- with open("page.html.br", "wb") as f:
174
+ # Or write Base64-encoded compressed output directly to a file
175
+ with open("page.html.br.b64", "wb") as f:
154
176
  f.write(compressed)
155
177
  ```
156
178
 
179
+ Disable Base64 and decompress directly:
180
+
181
+ ```python
182
+ from smol_html import SmolHtmlCleaner
183
+ import brotli
184
+
185
+ cleaner = SmolHtmlCleaner()
186
+ compressed_raw = cleaner.make_smol_bytes(
187
+ raw_html="<p>Hi</p>",
188
+ compression_level=11,
189
+ base64_encode=False,
190
+ )
191
+ print(brotli.decompress(compressed_raw).decode("utf-8"))
192
+ ```
193
+
157
194
  ## Parameter Reference
158
195
 
159
196
  To improve readability, the reference is split into two tables:
@@ -213,3 +250,10 @@ To improve readability, the reference is split into two tables:
213
250
  | `remove_unknown_tags` | `bool` | `True` |
214
251
  | `safe_attrs_only` | `bool` | `True` |
215
252
  | `safe_attrs` | `set[str]` | curated set |
253
+
254
+ ### `make_smol_bytes` Options
255
+
256
+ | Parameter | Type | Default |
257
+ |---|---|---|
258
+ | `compression_level` | `int` | `4` |
259
+ | `base64_encode` | `bool` | `True` |
@@ -0,0 +1,4 @@
1
+ smol_html-0.1.7.dist-info/METADATA,sha256=z62fJR48fIqXyaCKEFyX6_N5D9GOFf2G8JQukboEXFg,10058
2
+ smol_html-0.1.7.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
3
+ smol_html-0.1.7.dist-info/licenses/LICENSE,sha256=88yg3BujRGq8MYlWhbrzB2YMNWJaXnBck3c7l23labs,1089
4
+ smol_html-0.1.7.dist-info/RECORD,,
@@ -1,4 +0,0 @@
1
- smol_html-0.1.5.dist-info/METADATA,sha256=jriTNIRVbdSkr7EXyEa1ssdm_rZmwo4IV_FLVEJTJrE,8539
2
- smol_html-0.1.5.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
3
- smol_html-0.1.5.dist-info/licenses/LICENSE,sha256=88yg3BujRGq8MYlWhbrzB2YMNWJaXnBck3c7l23labs,1089
4
- smol_html-0.1.5.dist-info/RECORD,,