smol-html 0.1.0__tar.gz → 0.1.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1 +1 @@
1
- 3.12
1
+ 3.12
@@ -0,0 +1,200 @@
1
+ Metadata-Version: 2.4
2
+ Name: smol-html
3
+ Version: 0.1.2
4
+ Summary: Small, dependable HTML cleaner/minifier with sensible defaults
5
+ Project-URL: Homepage, https://github.com/NosibleAI/smol-html
6
+ Project-URL: Repository, https://github.com/NosibleAI/smol-html
7
+ Project-URL: Issues, https://github.com/NosibleAI/smol-html/issues
8
+ Author-email: Gareth Warburton <garethw738@gmail.com>, Stuart Reid <stuart@nosible.com>, Matthew Dicks <matthew@nosible.com>, Richard Taylor <richard@nosible.com>
9
+ License: MIT
10
+ License-File: LICENSE
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: License :: OSI Approved :: MIT License
14
+ Classifier: Operating System :: OS Independent
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3 :: Only
17
+ Classifier: Programming Language :: Python :: 3.9
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Classifier: Programming Language :: Python :: 3.13
22
+ Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
23
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
24
+ Requires-Python: >=3.9
25
+ Requires-Dist: beautifulsoup4>=4.0.1
26
+ Requires-Dist: brotli>=0.5.2
27
+ Requires-Dist: lxml[html-clean]>=1.3.2
28
+ Requires-Dist: minify-html>=0.2.6
29
+ Description-Content-Type: text/markdown
30
+
31
+ ![smol](smol.png)
32
+
33
+
34
+ # smol-html
35
+
36
+ Small, dependable HTML cleaner/minifier with sensible defaults.
37
+
38
+ ### 📦 Installation
39
+
40
+ ```bash
41
+ pip install smol-html
42
+ ```
43
+
44
+ ### ⚡ Installing with uv
45
+
46
+ ```bash
47
+ uv pip install smol-html
48
+ ```
49
+
50
+ ## Quick Start
51
+
52
+ Clean an HTML string (or page contents):
53
+
54
+ ```python
55
+ from smol_html import SmolHtmlCleaner
56
+
57
+ html = """
58
+ <html>
59
+ <head><title> Example </title></head>
60
+ <body>
61
+ <div> Hello <span> world </span> </div>
62
+ </body>
63
+ </html>
64
+ """
65
+
66
+ # All constructor arguments are keyword-only and optional.
67
+ cleaner = SmolHtmlCleaner()
68
+ cleaned = cleaner.make_smol(raw_html=html)
69
+
70
+ print(cleaned)
71
+ ```
72
+
73
+ ## Customization
74
+
75
+ `SmolHtmlCleaner` exposes keyword-only parameters with practical defaults. You can:
76
+ - Pass overrides to the constructor, or
77
+ - Adjust attributes on the instance after creation.
78
+
79
+ ```python
80
+ from smol_html import SmolHtmlCleaner
81
+
82
+ cleaner = SmolHtmlCleaner()
83
+ cleaner.attr_stop_words.add("advert") # e.g., add a custom stop word
84
+ ```
85
+
86
+ ## Usage Examples
87
+
88
+ Minimal:
89
+
90
+ ```python
91
+ from smol_html import SmolHtmlCleaner
92
+
93
+ cleaner = SmolHtmlCleaner()
94
+ out = cleaner.make_smol(raw_html="<p>Hi <!-- note --> <a href='x'>link</a></p>")
95
+ ```
96
+
97
+ Customize a few options:
98
+
99
+ ```python
100
+ from smol_html import SmolHtmlCleaner
101
+
102
+ cleaner = SmolHtmlCleaner(
103
+ attr_stop_words={"nav", "advert"},
104
+ remove_header_lists=False,
105
+ minify=True,
106
+ )
107
+
108
+ out = cleaner.make_smol(raw_html="<p>Hi</p>")
109
+ ```
110
+
111
+ ## Compressed Bytes Output
112
+
113
+ Produce compressed bytes using Brotli with `make_smol_bytes`
114
+
115
+
116
+ ```python
117
+ from smol_html import SmolHtmlCleaner
118
+ import brotli # only needed if you want to decompress here in the example
119
+
120
+ html = """
121
+ <html>
122
+ <body>
123
+ <div> Hello <span> world </span> </div>
124
+ </body>
125
+ </html>
126
+ """
127
+
128
+ cleaner = SmolHtmlCleaner()
129
+
130
+ # Get compressed bytes (quality 11 is strong compression)
131
+ compressed = cleaner.make_smol_bytes(raw_html=html, compression_level=11)
132
+
133
+ # Example: decompress back to text to inspect (optional)
134
+ decompressed = brotli.decompress(compressed).decode("utf-8")
135
+ print(decompressed)
136
+
137
+ # Or write compressed output directly to a file
138
+ with open("page.html.br", "wb") as f:
139
+ f.write(compressed)
140
+ ```
141
+
142
+ ## Parameter Reference
143
+
144
+ To improve readability, the reference is split into two tables:
145
+ - What it does and when to change
146
+ - Types and default values
147
+
148
+ ### What It Does
149
+
150
+ | Parameter | What it does | When to change |
151
+ |---|---|---|
152
+ | `non_text_to_keep` | Whitelist of empty/non-text tags to preserve (e.g., images, figures, tables, line breaks). | If important non-text elements are being removed or you want to keep/drop more empty tags. |
153
+ | `attr_stop_words` | Tokens matched against `id`/`class`/`role`/`item_type` on small elements; matches are removed as likely non-content. | Add tokens like `advert`, `hero`, `menu` to aggressively drop UI chrome, or remove tokens if content is lost. |
154
+ | `remove_header_lists` | Removes links/lists/images within `<header>` to reduce nav clutter. | Set `False` if your header contains meaningful content you want to keep. |
155
+ | `remove_footer_lists` | Removes links/lists/images within `<footer>` to reduce boilerplate. | Set `False` for content-heavy footers you need. |
156
+ | `minify` | Minifies output HTML using `minify_html`. | Set `False` for readability or debugging; use `--pretty` in the CLI. |
157
+ | `minify_kwargs` | Extra options passed to `minify_html.minify`. | Tune minification behavior (e.g., whitespace, comments) without changing cleaning. |
158
+ | `meta` | lxml Cleaner option: remove `<meta>` content when `True`. | Usually leave `False`; enable only for strict sanitation. |
159
+ | `page_structure` | lxml Cleaner option: remove page-structure tags (e.g., `<head>`, `<body>`) when `True`. | Rarely needed; keep `False` to preserve structure. |
160
+ | `links` | lxml Cleaner option: sanitize/clean links. | Leave `True` unless you need raw anchors untouched. |
161
+ | `scripts` | lxml Cleaner option: remove `<script>` tags when `True`. | Keep `False` to preserve scripts; usually safe to remove via `javascript=True` anyway. |
162
+ | `javascript` | lxml Cleaner option: remove JS and event handlers. | Set `False` only if you truly need inline JS (not recommended). |
163
+ | `comments` | lxml Cleaner option: remove HTML comments. | Set `False` to retain comments for debugging. |
164
+ | `style` | lxml Cleaner option: remove CSS and style attributes. | Set `False` to keep inline styles/CSS. |
165
+ | `processing_instructions` | lxml Cleaner option: remove processing instructions. | Rarely change; keep for safety. |
166
+ | `embedded` | lxml Cleaner option: remove embedded content (e.g., `<embed>`, `<object>`). | Set `False` to keep embedded media. |
167
+ | `frames` | lxml Cleaner option: remove frames/iframes. | Set `False` if iframes contain needed content. |
168
+ | `forms` | lxml Cleaner option: remove form elements. | Set `False` if you need to keep forms/inputs. |
169
+ | `annoying_tags` | lxml Cleaner option: remove tags considered "annoying" by lxml (e.g., `<blink>`, `<marquee>`). | Rarely change. |
170
+ | `kill_tags` | Additional explicit tags to remove entirely. | Add site-specific or custom tags to drop. |
171
+ | `remove_unknown_tags` | lxml Cleaner option: drop unknown/invalid tags. | Set `False` if you rely on custom elements. |
172
+ | `safe_attrs_only` | Only allow attributes listed in `safe_attrs`. | Set `False` if you need to keep arbitrary attributes. |
173
+ | `safe_attrs` | Allowed HTML attributes when `safe_attrs_only=True`. | Extend to keep additional attributes you trust. |
174
+
175
+ ### Types and Defaults
176
+
177
+ | Parameter | Type | Default |
178
+ |---|---|---|
179
+ | `non_text_to_keep` | `set[str]` | media/meta/table/`br` tags |
180
+ | `attr_stop_words` | `set[str]` | common UI/navigation tokens |
181
+ | `remove_header_lists` | `bool` | `True` |
182
+ | `remove_footer_lists` | `bool` | `True` |
183
+ | `minify` | `bool` | `True` |
184
+ | `minify_kwargs` | `dict` | `{}` |
185
+ | `meta` | `bool` | `False` |
186
+ | `page_structure` | `bool` | `False` |
187
+ | `links` | `bool` | `True` |
188
+ | `scripts` | `bool` | `False` |
189
+ | `javascript` | `bool` | `True` |
190
+ | `comments` | `bool` | `True` |
191
+ | `style` | `bool` | `True` |
192
+ | `processing_instructions` | `bool` | `True` |
193
+ | `embedded` | `bool` | `True` |
194
+ | `frames` | `bool` | `True` |
195
+ | `forms` | `bool` | `True` |
196
+ | `annoying_tags` | `bool` | `True` |
197
+ | `kill_tags` | `set[str] &#124; None` | `None` |
198
+ | `remove_unknown_tags` | `bool` | `True` |
199
+ | `safe_attrs_only` | `bool` | `True` |
200
+ | `safe_attrs` | `set[str]` | curated set |
@@ -0,0 +1,170 @@
1
+ ![smol](smol.png)
2
+
3
+
4
+ # smol-html
5
+
6
+ Small, dependable HTML cleaner/minifier with sensible defaults.
7
+
8
+ ### 📦 Installation
9
+
10
+ ```bash
11
+ pip install smol-html
12
+ ```
13
+
14
+ ### ⚡ Installing with uv
15
+
16
+ ```bash
17
+ uv pip install smol-html
18
+ ```
19
+
20
+ ## Quick Start
21
+
22
+ Clean an HTML string (or page contents):
23
+
24
+ ```python
25
+ from smol_html import SmolHtmlCleaner
26
+
27
+ html = """
28
+ <html>
29
+ <head><title> Example </title></head>
30
+ <body>
31
+ <div> Hello <span> world </span> </div>
32
+ </body>
33
+ </html>
34
+ """
35
+
36
+ # All constructor arguments are keyword-only and optional.
37
+ cleaner = SmolHtmlCleaner()
38
+ cleaned = cleaner.make_smol(raw_html=html)
39
+
40
+ print(cleaned)
41
+ ```
42
+
43
+ ## Customization
44
+
45
+ `SmolHtmlCleaner` exposes keyword-only parameters with practical defaults. You can:
46
+ - Pass overrides to the constructor, or
47
+ - Adjust attributes on the instance after creation.
48
+
49
+ ```python
50
+ from smol_html import SmolHtmlCleaner
51
+
52
+ cleaner = SmolHtmlCleaner()
53
+ cleaner.attr_stop_words.add("advert") # e.g., add a custom stop word
54
+ ```
55
+
56
+ ## Usage Examples
57
+
58
+ Minimal:
59
+
60
+ ```python
61
+ from smol_html import SmolHtmlCleaner
62
+
63
+ cleaner = SmolHtmlCleaner()
64
+ out = cleaner.make_smol(raw_html="<p>Hi <!-- note --> <a href='x'>link</a></p>")
65
+ ```
66
+
67
+ Customize a few options:
68
+
69
+ ```python
70
+ from smol_html import SmolHtmlCleaner
71
+
72
+ cleaner = SmolHtmlCleaner(
73
+ attr_stop_words={"nav", "advert"},
74
+ remove_header_lists=False,
75
+ minify=True,
76
+ )
77
+
78
+ out = cleaner.make_smol(raw_html="<p>Hi</p>")
79
+ ```
80
+
81
+ ## Compressed Bytes Output
82
+
83
+ Produce compressed bytes using Brotli with `make_smol_bytes`
84
+
85
+
86
+ ```python
87
+ from smol_html import SmolHtmlCleaner
88
+ import brotli # only needed if you want to decompress here in the example
89
+
90
+ html = """
91
+ <html>
92
+ <body>
93
+ <div> Hello <span> world </span> </div>
94
+ </body>
95
+ </html>
96
+ """
97
+
98
+ cleaner = SmolHtmlCleaner()
99
+
100
+ # Get compressed bytes (quality 11 is strong compression)
101
+ compressed = cleaner.make_smol_bytes(raw_html=html, compression_level=11)
102
+
103
+ # Example: decompress back to text to inspect (optional)
104
+ decompressed = brotli.decompress(compressed).decode("utf-8")
105
+ print(decompressed)
106
+
107
+ # Or write compressed output directly to a file
108
+ with open("page.html.br", "wb") as f:
109
+ f.write(compressed)
110
+ ```
111
+
112
+ ## Parameter Reference
113
+
114
+ To improve readability, the reference is split into two tables:
115
+ - What it does and when to change
116
+ - Types and default values
117
+
118
+ ### What It Does
119
+
120
+ | Parameter | What it does | When to change |
121
+ |---|---|---|
122
+ | `non_text_to_keep` | Whitelist of empty/non-text tags to preserve (e.g., images, figures, tables, line breaks). | If important non-text elements are being removed or you want to keep/drop more empty tags. |
123
+ | `attr_stop_words` | Tokens matched against `id`/`class`/`role`/`item_type` on small elements; matches are removed as likely non-content. | Add tokens like `advert`, `hero`, `menu` to aggressively drop UI chrome, or remove tokens if content is lost. |
124
+ | `remove_header_lists` | Removes links/lists/images within `<header>` to reduce nav clutter. | Set `False` if your header contains meaningful content you want to keep. |
125
+ | `remove_footer_lists` | Removes links/lists/images within `<footer>` to reduce boilerplate. | Set `False` for content-heavy footers you need. |
126
+ | `minify` | Minifies output HTML using `minify_html`. | Set `False` for readability or debugging; use `--pretty` in the CLI. |
127
+ | `minify_kwargs` | Extra options passed to `minify_html.minify`. | Tune minification behavior (e.g., whitespace, comments) without changing cleaning. |
128
+ | `meta` | lxml Cleaner option: remove `<meta>` content when `True`. | Usually leave `False`; enable only for strict sanitation. |
129
+ | `page_structure` | lxml Cleaner option: remove page-structure tags (e.g., `<head>`, `<body>`) when `True`. | Rarely needed; keep `False` to preserve structure. |
130
+ | `links` | lxml Cleaner option: sanitize/clean links. | Leave `True` unless you need raw anchors untouched. |
131
+ | `scripts` | lxml Cleaner option: remove `<script>` tags when `True`. | Keep `False` to preserve scripts; usually safe to remove via `javascript=True` anyway. |
132
+ | `javascript` | lxml Cleaner option: remove JS and event handlers. | Set `False` only if you truly need inline JS (not recommended). |
133
+ | `comments` | lxml Cleaner option: remove HTML comments. | Set `False` to retain comments for debugging. |
134
+ | `style` | lxml Cleaner option: remove CSS and style attributes. | Set `False` to keep inline styles/CSS. |
135
+ | `processing_instructions` | lxml Cleaner option: remove processing instructions. | Rarely change; keep for safety. |
136
+ | `embedded` | lxml Cleaner option: remove embedded content (e.g., `<embed>`, `<object>`). | Set `False` to keep embedded media. |
137
+ | `frames` | lxml Cleaner option: remove frames/iframes. | Set `False` if iframes contain needed content. |
138
+ | `forms` | lxml Cleaner option: remove form elements. | Set `False` if you need to keep forms/inputs. |
139
+ | `annoying_tags` | lxml Cleaner option: remove tags considered "annoying" by lxml (e.g., `<blink>`, `<marquee>`). | Rarely change. |
140
+ | `kill_tags` | Additional explicit tags to remove entirely. | Add site-specific or custom tags to drop. |
141
+ | `remove_unknown_tags` | lxml Cleaner option: drop unknown/invalid tags. | Set `False` if you rely on custom elements. |
142
+ | `safe_attrs_only` | Only allow attributes listed in `safe_attrs`. | Set `False` if you need to keep arbitrary attributes. |
143
+ | `safe_attrs` | Allowed HTML attributes when `safe_attrs_only=True`. | Extend to keep additional attributes you trust. |
144
+
145
+ ### Types and Defaults
146
+
147
+ | Parameter | Type | Default |
148
+ |---|---|---|
149
+ | `non_text_to_keep` | `set[str]` | media/meta/table/`br` tags |
150
+ | `attr_stop_words` | `set[str]` | common UI/navigation tokens |
151
+ | `remove_header_lists` | `bool` | `True` |
152
+ | `remove_footer_lists` | `bool` | `True` |
153
+ | `minify` | `bool` | `True` |
154
+ | `minify_kwargs` | `dict` | `{}` |
155
+ | `meta` | `bool` | `False` |
156
+ | `page_structure` | `bool` | `False` |
157
+ | `links` | `bool` | `True` |
158
+ | `scripts` | `bool` | `False` |
159
+ | `javascript` | `bool` | `True` |
160
+ | `comments` | `bool` | `True` |
161
+ | `style` | `bool` | `True` |
162
+ | `processing_instructions` | `bool` | `True` |
163
+ | `embedded` | `bool` | `True` |
164
+ | `frames` | `bool` | `True` |
165
+ | `forms` | `bool` | `True` |
166
+ | `annoying_tags` | `bool` | `True` |
167
+ | `kill_tags` | `set[str] &#124; None` | `None` |
168
+ | `remove_unknown_tags` | `bool` | `True` |
169
+ | `safe_attrs_only` | `bool` | `True` |
170
+ | `safe_attrs` | `set[str]` | curated set |
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "smol-html"
3
- version = "0.1.0"
3
+ version = "0.1.2"
4
4
  description = "Small, dependable HTML cleaner/minifier with sensible defaults"
5
5
  readme = { file = "README.md", content-type = "text/markdown" }
6
6
  requires-python = ">=3.9"
@@ -12,9 +12,10 @@ authors = [
12
12
  ]
13
13
  license = { text = "MIT" }
14
14
  dependencies = [
15
- "beautifulsoup4>=4.13.5",
16
- "lxml[html-clean]>=6.0.1",
17
- "minify-html>=0.16.4",
15
+ "beautifulsoup4>=4.0.1",
16
+ "brotli>=0.5.2",
17
+ "lxml[html-clean]>=1.3.2",
18
+ "minify-html>=0.2.6",
18
19
  ]
19
20
  classifiers = [
20
21
  "Development Status :: 4 - Beta",
Binary file
@@ -0,0 +1,4 @@
1
+ from smol_html.smol_html import SmolHtmlCleaner
2
+
3
+ all = ["__version__", "SmolHtmlCleaner"]
4
+ __version__ = "0.1.0"