PyPI - smol-html - Versions diffs - 0.1.0__tar.gz → 0.1.2__tar.gz - Mend

smol-html 0.1.0tar.gz → 0.1.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

{smol_html-0.1.0 → smol_html-0.1.2}/.python-version +1 -1
smol_html-0.1.2/PKG-INFO +200 -0
smol_html-0.1.2/README.md +170 -0
{smol_html-0.1.0 → smol_html-0.1.2}/pyproject.toml +5 -4
smol_html-0.1.2/smol.png +0 -0
smol_html-0.1.2/src/smol_html/__init__.py +4 -0
smol_html-0.1.0/src/smol_html/core.py → smol_html-0.1.2/src/smol_html/smol_html.py +377 -299
smol_html-0.1.2/uv.lock +453 -0
smol_html-0.1.0/PKG-INFO +0 -127
smol_html-0.1.0/README.md +0 -98
smol_html-0.1.0/src/smol_html/__init__.py +0 -4
{smol_html-0.1.0 → smol_html-0.1.2}/.gitignore +0 -0
{smol_html-0.1.0 → smol_html-0.1.2}/LICENSE +0 -0

{smol_html-0.1.0 → smol_html-0.1.2}/.python-version RENAMED Viewed

	@@ -1 +1 @@
1	- 3.12
1	+ 3.12

smol_html-0.1.2/PKG-INFO ADDED Viewed

@@ -0,0 +1,200 @@
+Metadata-Version: 2.4
+Name: smol-html
+Version: 0.1.2
+Summary: Small, dependable HTML cleaner/minifier with sensible defaults
+Project-URL: Homepage, https://github.com/NosibleAI/smol-html
+Project-URL: Repository, https://github.com/NosibleAI/smol-html
+Project-URL: Issues, https://github.com/NosibleAI/smol-html/issues
+Author-email: Gareth Warburton <garethw738@gmail.com>, Stuart Reid <stuart@nosible.com>, Matthew Dicks <matthew@nosible.com>, Richard Taylor <richard@nosible.com>
+License: MIT
+License-File: LICENSE
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3 :: Only
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Requires-Python: >=3.9
+Requires-Dist: beautifulsoup4>=4.0.1
+Requires-Dist: brotli>=0.5.2
+Requires-Dist: lxml[html-clean]>=1.3.2
+Requires-Dist: minify-html>=0.2.6
+Description-Content-Type: text/markdown
+![smol](smol.png)
+# smol-html
+Small, dependable HTML cleaner/minifier with sensible defaults.
+### 📦 Installation
+```bash
+pip install smol-html
+```
+### ⚡ Installing with uv
+```bash
+uv pip install smol-html
+```
+## Quick Start
+Clean an HTML string (or page contents):
+```python
+from smol_html import SmolHtmlCleaner
+html = """
+<html>
+  <head><title> Example </title></head>
+  <body>
+    <div>  Hello <span> world </span> </div>
+  </body>
+</html>
+"""
+# All constructor arguments are keyword-only and optional.
+cleaner = SmolHtmlCleaner()
+cleaned = cleaner.make_smol(raw_html=html)
+print(cleaned)
+```
+## Customization
+`SmolHtmlCleaner` exposes keyword-only parameters with practical defaults. You can:
+- Pass overrides to the constructor, or
+- Adjust attributes on the instance after creation.
+```python
+from smol_html import SmolHtmlCleaner
+cleaner = SmolHtmlCleaner()
+cleaner.attr_stop_words.add("advert")  # e.g., add a custom stop word
+```
+## Usage Examples
+Minimal:
+```python
+from smol_html import SmolHtmlCleaner
+cleaner = SmolHtmlCleaner()
+out = cleaner.make_smol(raw_html="<p>Hi <!-- note --> <a href='x'>link</a></p>")
+```
+Customize a few options:
+```python
+from smol_html import SmolHtmlCleaner
+cleaner = SmolHtmlCleaner(
+    attr_stop_words={"nav", "advert"},
+    remove_header_lists=False,
+    minify=True,
+)
+out = cleaner.make_smol(raw_html="<p>Hi</p>")
+```
+## Compressed Bytes Output
+Produce compressed bytes using Brotli with `make_smol_bytes`
+```python
+from smol_html import SmolHtmlCleaner
+import brotli  # only needed if you want to decompress here in the example
+html = """
+<html>
+  <body>
+    <div>  Hello <span> world </span> </div>
+  </body>
+</html>
+"""
+cleaner = SmolHtmlCleaner()
+# Get compressed bytes (quality 11 is strong compression)
+compressed = cleaner.make_smol_bytes(raw_html=html, compression_level=11)
+# Example: decompress back to text to inspect (optional)
+decompressed = brotli.decompress(compressed).decode("utf-8")
+print(decompressed)
+# Or write compressed output directly to a file
+with open("page.html.br", "wb") as f:
+    f.write(compressed)
+```
+## Parameter Reference
+To improve readability, the reference is split into two tables:
+- What it does and when to change
+- Types and default values
+### What It Does
+| Parameter | What it does | When to change |
+|---|---|---|
+| `non_text_to_keep` | Whitelist of empty/non-text tags to preserve (e.g., images, figures, tables, line breaks). | If important non-text elements are being removed or you want to keep/drop more empty tags. |
+| `attr_stop_words` | Tokens matched against `id`/`class`/`role`/`item_type` on small elements; matches are removed as likely non-content. | Add tokens like `advert`, `hero`, `menu` to aggressively drop UI chrome, or remove tokens if content is lost. |
+| `remove_header_lists` | Removes links/lists/images within `<header>` to reduce nav clutter. | Set `False` if your header contains meaningful content you want to keep. |
+| `remove_footer_lists` | Removes links/lists/images within `<footer>` to reduce boilerplate. | Set `False` for content-heavy footers you need. |
+| `minify` | Minifies output HTML using `minify_html`. | Set `False` for readability or debugging; use `--pretty` in the CLI. |
+| `minify_kwargs` | Extra options passed to `minify_html.minify`. | Tune minification behavior (e.g., whitespace, comments) without changing cleaning. |
+| `meta` | lxml Cleaner option: remove `<meta>` content when `True`. | Usually leave `False`; enable only for strict sanitation. |
+| `page_structure` | lxml Cleaner option: remove page-structure tags (e.g., `<head>`, `<body>`) when `True`. | Rarely needed; keep `False` to preserve structure. |
+| `links` | lxml Cleaner option: sanitize/clean links. | Leave `True` unless you need raw anchors untouched. |
+| `scripts` | lxml Cleaner option: remove `<script>` tags when `True`. | Keep `False` to preserve scripts; usually safe to remove via `javascript=True` anyway. |
+| `javascript` | lxml Cleaner option: remove JS and event handlers. | Set `False` only if you truly need inline JS (not recommended). |
+| `comments` | lxml Cleaner option: remove HTML comments. | Set `False` to retain comments for debugging. |
+| `style` | lxml Cleaner option: remove CSS and style attributes. | Set `False` to keep inline styles/CSS. |
+| `processing_instructions` | lxml Cleaner option: remove processing instructions. | Rarely change; keep for safety. |
+| `embedded` | lxml Cleaner option: remove embedded content (e.g., `<embed>`, `<object>`). | Set `False` to keep embedded media. |
+| `frames` | lxml Cleaner option: remove frames/iframes. | Set `False` if iframes contain needed content. |
+| `forms` | lxml Cleaner option: remove form elements. | Set `False` if you need to keep forms/inputs. |
+| `annoying_tags` | lxml Cleaner option: remove tags considered "annoying" by lxml (e.g., `<blink>`, `<marquee>`). | Rarely change. |
+| `kill_tags` | Additional explicit tags to remove entirely. | Add site-specific or custom tags to drop. |
+| `remove_unknown_tags` | lxml Cleaner option: drop unknown/invalid tags. | Set `False` if you rely on custom elements. |
+| `safe_attrs_only` | Only allow attributes listed in `safe_attrs`. | Set `False` if you need to keep arbitrary attributes. |
+| `safe_attrs` | Allowed HTML attributes when `safe_attrs_only=True`. | Extend to keep additional attributes you trust. |
+### Types and Defaults
+| Parameter | Type | Default |
+|---|---|---|
+| `non_text_to_keep` | `set[str]` | media/meta/table/`br` tags |
+| `attr_stop_words` | `set[str]` | common UI/navigation tokens |
+| `remove_header_lists` | `bool` | `True` |
+| `remove_footer_lists` | `bool` | `True` |
+| `minify` | `bool` | `True` |
+| `minify_kwargs` | `dict` | `{}` |
+| `meta` | `bool` | `False` |
+| `page_structure` | `bool` | `False` |
+| `links` | `bool` | `True` |
+| `scripts` | `bool` | `False` |
+| `javascript` | `bool` | `True` |
+| `comments` | `bool` | `True` |
+| `style` | `bool` | `True` |
+| `processing_instructions` | `bool` | `True` |
+| `embedded` | `bool` | `True` |
+| `frames` | `bool` | `True` |
+| `forms` | `bool` | `True` |
+| `annoying_tags` | `bool` | `True` |
+| `kill_tags` | `set[str] &#124; None` | `None` |
+| `remove_unknown_tags` | `bool` | `True` |
+| `safe_attrs_only` | `bool` | `True` |
+| `safe_attrs` | `set[str]` | curated set |

smol_html-0.1.2/README.md ADDED Viewed

@@ -0,0 +1,170 @@
+![smol](smol.png)
+# smol-html
+Small, dependable HTML cleaner/minifier with sensible defaults.
+### 📦 Installation
+```bash
+pip install smol-html
+```
+### ⚡ Installing with uv
+```bash
+uv pip install smol-html
+```
+## Quick Start
+Clean an HTML string (or page contents):
+```python
+from smol_html import SmolHtmlCleaner
+html = """
+<html>
+  <head><title> Example </title></head>
+  <body>
+    <div>  Hello <span> world </span> </div>
+  </body>
+</html>
+"""
+# All constructor arguments are keyword-only and optional.
+cleaner = SmolHtmlCleaner()
+cleaned = cleaner.make_smol(raw_html=html)
+print(cleaned)
+```
+## Customization
+`SmolHtmlCleaner` exposes keyword-only parameters with practical defaults. You can:
+- Pass overrides to the constructor, or
+- Adjust attributes on the instance after creation.
+```python
+from smol_html import SmolHtmlCleaner
+cleaner = SmolHtmlCleaner()
+cleaner.attr_stop_words.add("advert")  # e.g., add a custom stop word
+```
+## Usage Examples
+Minimal:
+```python
+from smol_html import SmolHtmlCleaner
+cleaner = SmolHtmlCleaner()
+out = cleaner.make_smol(raw_html="<p>Hi <!-- note --> <a href='x'>link</a></p>")
+```
+Customize a few options:
+```python
+from smol_html import SmolHtmlCleaner
+cleaner = SmolHtmlCleaner(
+    attr_stop_words={"nav", "advert"},
+    remove_header_lists=False,
+    minify=True,
+)
+out = cleaner.make_smol(raw_html="<p>Hi</p>")
+```
+## Compressed Bytes Output
+Produce compressed bytes using Brotli with `make_smol_bytes`
+```python
+from smol_html import SmolHtmlCleaner
+import brotli  # only needed if you want to decompress here in the example
+html = """
+<html>
+  <body>
+    <div>  Hello <span> world </span> </div>
+  </body>
+</html>
+"""
+cleaner = SmolHtmlCleaner()
+# Get compressed bytes (quality 11 is strong compression)
+compressed = cleaner.make_smol_bytes(raw_html=html, compression_level=11)
+# Example: decompress back to text to inspect (optional)
+decompressed = brotli.decompress(compressed).decode("utf-8")
+print(decompressed)
+# Or write compressed output directly to a file
+with open("page.html.br", "wb") as f:
+    f.write(compressed)
+```
+## Parameter Reference
+To improve readability, the reference is split into two tables:
+- What it does and when to change
+- Types and default values
+### What It Does
+| Parameter | What it does | When to change |
+|---|---|---|
+| `non_text_to_keep` | Whitelist of empty/non-text tags to preserve (e.g., images, figures, tables, line breaks). | If important non-text elements are being removed or you want to keep/drop more empty tags. |
+| `attr_stop_words` | Tokens matched against `id`/`class`/`role`/`item_type` on small elements; matches are removed as likely non-content. | Add tokens like `advert`, `hero`, `menu` to aggressively drop UI chrome, or remove tokens if content is lost. |
+| `remove_header_lists` | Removes links/lists/images within `<header>` to reduce nav clutter. | Set `False` if your header contains meaningful content you want to keep. |
+| `remove_footer_lists` | Removes links/lists/images within `<footer>` to reduce boilerplate. | Set `False` for content-heavy footers you need. |
+| `minify` | Minifies output HTML using `minify_html`. | Set `False` for readability or debugging; use `--pretty` in the CLI. |
+| `minify_kwargs` | Extra options passed to `minify_html.minify`. | Tune minification behavior (e.g., whitespace, comments) without changing cleaning. |
+| `meta` | lxml Cleaner option: remove `<meta>` content when `True`. | Usually leave `False`; enable only for strict sanitation. |
+| `page_structure` | lxml Cleaner option: remove page-structure tags (e.g., `<head>`, `<body>`) when `True`. | Rarely needed; keep `False` to preserve structure. |
+| `links` | lxml Cleaner option: sanitize/clean links. | Leave `True` unless you need raw anchors untouched. |
+| `scripts` | lxml Cleaner option: remove `<script>` tags when `True`. | Keep `False` to preserve scripts; usually safe to remove via `javascript=True` anyway. |
+| `javascript` | lxml Cleaner option: remove JS and event handlers. | Set `False` only if you truly need inline JS (not recommended). |
+| `comments` | lxml Cleaner option: remove HTML comments. | Set `False` to retain comments for debugging. |
+| `style` | lxml Cleaner option: remove CSS and style attributes. | Set `False` to keep inline styles/CSS. |
+| `processing_instructions` | lxml Cleaner option: remove processing instructions. | Rarely change; keep for safety. |
+| `embedded` | lxml Cleaner option: remove embedded content (e.g., `<embed>`, `<object>`). | Set `False` to keep embedded media. |
+| `frames` | lxml Cleaner option: remove frames/iframes. | Set `False` if iframes contain needed content. |
+| `forms` | lxml Cleaner option: remove form elements. | Set `False` if you need to keep forms/inputs. |
+| `annoying_tags` | lxml Cleaner option: remove tags considered "annoying" by lxml (e.g., `<blink>`, `<marquee>`). | Rarely change. |
+| `kill_tags` | Additional explicit tags to remove entirely. | Add site-specific or custom tags to drop. |
+| `remove_unknown_tags` | lxml Cleaner option: drop unknown/invalid tags. | Set `False` if you rely on custom elements. |
+| `safe_attrs_only` | Only allow attributes listed in `safe_attrs`. | Set `False` if you need to keep arbitrary attributes. |
+| `safe_attrs` | Allowed HTML attributes when `safe_attrs_only=True`. | Extend to keep additional attributes you trust. |
+### Types and Defaults
+| Parameter | Type | Default |
+|---|---|---|
+| `non_text_to_keep` | `set[str]` | media/meta/table/`br` tags |
+| `attr_stop_words` | `set[str]` | common UI/navigation tokens |
+| `remove_header_lists` | `bool` | `True` |
+| `remove_footer_lists` | `bool` | `True` |
+| `minify` | `bool` | `True` |
+| `minify_kwargs` | `dict` | `{}` |
+| `meta` | `bool` | `False` |
+| `page_structure` | `bool` | `False` |
+| `links` | `bool` | `True` |
+| `scripts` | `bool` | `False` |
+| `javascript` | `bool` | `True` |
+| `comments` | `bool` | `True` |
+| `style` | `bool` | `True` |
+| `processing_instructions` | `bool` | `True` |
+| `embedded` | `bool` | `True` |
+| `frames` | `bool` | `True` |
+| `forms` | `bool` | `True` |
+| `annoying_tags` | `bool` | `True` |
+| `kill_tags` | `set[str] &#124; None` | `None` |
+| `remove_unknown_tags` | `bool` | `True` |
+| `safe_attrs_only` | `bool` | `True` |
+| `safe_attrs` | `set[str]` | curated set |

{smol_html-0.1.0 → smol_html-0.1.2}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "smol-html"
-version = "0.1.0"
+version = "0.1.2"
 description = "Small, dependable HTML cleaner/minifier with sensible defaults"
 readme = { file = "README.md", content-type = "text/markdown" }
 requires-python = ">=3.9"
@@ -12,9 +12,10 @@ authors = [
 ]
 license = { text = "MIT" }
 dependencies = [
-  "beautifulsoup4>=4.13.5",
-  "lxml[html-clean]>=6.0.1",
-  "minify-html>=0.16.4",
+  "beautifulsoup4>=4.0.1",
+  "brotli>=0.5.2",
+  "lxml[html-clean]>=1.3.2",
+  "minify-html>=0.2.6",
 ]
 classifiers = [
   "Development Status :: 4 - Beta",

smol_html-0.1.2/smol.png ADDED Viewed

Binary file

smol_html-0.1.2/src/smol_html/__init__.py ADDED Viewed

@@ -0,0 +1,4 @@
+from smol_html.smol_html import SmolHtmlCleaner
+all = ["__version__", "SmolHtmlCleaner"]
+__version__ = "0.1.0"

smol-html 0.1.0__tar.gz → 0.1.2__tar.gz

smol-html 0.1.0tar.gz → 0.1.2tar.gz