everythingtohtml 0.1.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- everythingtohtml-0.1.2/.gitignore +35 -0
- everythingtohtml-0.1.2/CHANGELOG.md +72 -0
- everythingtohtml-0.1.2/LICENSE +21 -0
- everythingtohtml-0.1.2/PKG-INFO +294 -0
- everythingtohtml-0.1.2/README.md +240 -0
- everythingtohtml-0.1.2/pyproject.toml +107 -0
- everythingtohtml-0.1.2/src/everythingtohtml/__about__.py +3 -0
- everythingtohtml-0.1.2/src/everythingtohtml/__init__.py +34 -0
- everythingtohtml-0.1.2/src/everythingtohtml/__main__.py +130 -0
- everythingtohtml-0.1.2/src/everythingtohtml/_base_converter.py +78 -0
- everythingtohtml-0.1.2/src/everythingtohtml/_everything_to_html.py +408 -0
- everythingtohtml-0.1.2/src/everythingtohtml/_exceptions.py +63 -0
- everythingtohtml-0.1.2/src/everythingtohtml/_html_builder.py +106 -0
- everythingtohtml-0.1.2/src/everythingtohtml/_merge.py +145 -0
- everythingtohtml-0.1.2/src/everythingtohtml/_stream_info.py +46 -0
- everythingtohtml-0.1.2/src/everythingtohtml/_text_utils.py +46 -0
- everythingtohtml-0.1.2/src/everythingtohtml/converters/__init__.py +45 -0
- everythingtohtml-0.1.2/src/everythingtohtml/converters/_csv_converter.py +73 -0
- everythingtohtml-0.1.2/src/everythingtohtml/converters/_doc_converter.py +385 -0
- everythingtohtml-0.1.2/src/everythingtohtml/converters/_docx_converter.py +105 -0
- everythingtohtml-0.1.2/src/everythingtohtml/converters/_eml_converter.py +104 -0
- everythingtohtml-0.1.2/src/everythingtohtml/converters/_epub_converter.py +131 -0
- everythingtohtml-0.1.2/src/everythingtohtml/converters/_html_converter.py +66 -0
- everythingtohtml-0.1.2/src/everythingtohtml/converters/_ipynb_converter.py +96 -0
- everythingtohtml-0.1.2/src/everythingtohtml/converters/_json_converter.py +78 -0
- everythingtohtml-0.1.2/src/everythingtohtml/converters/_markdown_converter.py +57 -0
- everythingtohtml-0.1.2/src/everythingtohtml/converters/_odt_converter.py +171 -0
- everythingtohtml-0.1.2/src/everythingtohtml/converters/_pdf_converter.py +204 -0
- everythingtohtml-0.1.2/src/everythingtohtml/converters/_plain_text_converter.py +64 -0
- everythingtohtml-0.1.2/src/everythingtohtml/converters/_pptx_converter.py +233 -0
- everythingtohtml-0.1.2/src/everythingtohtml/converters/_rss_converter.py +146 -0
- everythingtohtml-0.1.2/src/everythingtohtml/converters/_rst_converter.py +57 -0
- everythingtohtml-0.1.2/src/everythingtohtml/converters/_xlsx_converter.py +84 -0
- everythingtohtml-0.1.2/src/everythingtohtml/converters/_yaml_converter.py +56 -0
- everythingtohtml-0.1.2/src/everythingtohtml/py.typed +0 -0
|
@@ -0,0 +1,35 @@
|
|
|
1
|
+
# Python
|
|
2
|
+
__pycache__/
|
|
3
|
+
*.py[cod]
|
|
4
|
+
*.egg-info/
|
|
5
|
+
.eggs/
|
|
6
|
+
build/
|
|
7
|
+
dist/
|
|
8
|
+
*.egg
|
|
9
|
+
|
|
10
|
+
# Virtual environments
|
|
11
|
+
.venv/
|
|
12
|
+
venv/
|
|
13
|
+
env/
|
|
14
|
+
|
|
15
|
+
# Test / coverage
|
|
16
|
+
.pytest_cache/
|
|
17
|
+
.coverage
|
|
18
|
+
htmlcov/
|
|
19
|
+
.mypy_cache/
|
|
20
|
+
.ruff_cache/
|
|
21
|
+
|
|
22
|
+
# Editors / OS
|
|
23
|
+
.vscode/
|
|
24
|
+
.idea/
|
|
25
|
+
*.swp
|
|
26
|
+
.DS_Store
|
|
27
|
+
Thumbs.db
|
|
28
|
+
|
|
29
|
+
# Local scratch output
|
|
30
|
+
*.local.html
|
|
31
|
+
/scratch/
|
|
32
|
+
examples/output/
|
|
33
|
+
|
|
34
|
+
# Built in CI for the in-browser demo
|
|
35
|
+
site/wheels/
|
|
@@ -0,0 +1,72 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project are documented here. The format is based on
|
|
4
|
+
[Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and this project adheres
|
|
5
|
+
to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
6
|
+
|
|
7
|
+
## [Unreleased]
|
|
8
|
+
|
|
9
|
+
## [0.1.2] - 2026-06-09
|
|
10
|
+
|
|
11
|
+
### Changed
|
|
12
|
+
|
|
13
|
+
- **Tables render much better.** The default stylesheet now gives every table a
|
|
14
|
+
shaded header row, zebra striping, compact cells, and horizontal scrolling for
|
|
15
|
+
wide tables (so they no longer overflow the page). This applies to DOCX, XLSX,
|
|
16
|
+
legacy DOC (via LibreOffice), CSV, and Markdown output alike.
|
|
17
|
+
- **DOCX tables**: mammoth's bare tables are post-processed to promote the first
|
|
18
|
+
row to a real `<thead>`/`<th>` header and to unwrap single-paragraph cells, so
|
|
19
|
+
they read as proper tables instead of an unstyled grid.
|
|
20
|
+
- Version bump also refreshes the in-browser demo wheel (cache-bust).
|
|
21
|
+
|
|
22
|
+
## [0.1.1] - 2026-06-09
|
|
23
|
+
|
|
24
|
+
### Added
|
|
25
|
+
|
|
26
|
+
- **EPUB** converter (built in, no extra): follows the spine reading order and
|
|
27
|
+
concatenates chapters into one HTML document.
|
|
28
|
+
- **Email** (`.eml`) converter (built in): renders headers, body (HTML or plain),
|
|
29
|
+
and an attachment list; HTML bodies are stripped of active content.
|
|
30
|
+
- **OpenDocument Text** (`.odt`) converter (built in): maps headings, paragraphs,
|
|
31
|
+
lists, and tables to semantic HTML using only core dependencies.
|
|
32
|
+
- **PDF** converter behind the new `pdf` extra (`pdfminer.six`): recovers prose as
|
|
33
|
+
paragraphs, one section per page.
|
|
34
|
+
- **Legacy `.doc`** converter behind the new `doc` extra: uses headless
|
|
35
|
+
LibreOffice when available for high-fidelity output, with a pure-Python
|
|
36
|
+
fallback otherwise.
|
|
37
|
+
- **`EverythingToHtml.merge()`**: combine several sources into one HTML document,
|
|
38
|
+
with `layout="stacked"` (table of contents) or `layout="columns"` (side by
|
|
39
|
+
side). Exposed on the CLI by passing two or more sources, plus `--columns`.
|
|
40
|
+
- **`EverythingToHtml.diff()`**: render a highlighted, line-by-line comparison of
|
|
41
|
+
two documents. Exposed on the CLI via `--diff`.
|
|
42
|
+
- **In-browser "universal reader" demo** (GitHub Pages + Pyodide): drag in a file
|
|
43
|
+
and read it as HTML entirely client-side, with multi-file merge and two-file
|
|
44
|
+
diff. PPTX shapes are positioned by their slide coordinates.
|
|
45
|
+
|
|
46
|
+
### Fixed
|
|
47
|
+
|
|
48
|
+
- **Legacy `.doc` mojibake**: the pure-Python fallback now parses the Word piece
|
|
49
|
+
table (CLX) from the table stream and decodes each text piece with its own
|
|
50
|
+
8-bit/16-bit encoding (UTF-16LE or the language-appropriate code page). This
|
|
51
|
+
fixes garbled output — Chinese especially — that the earlier single-span
|
|
52
|
+
heuristic produced. The heuristic remains as a last-resort fallback.
|
|
53
|
+
- Optional-dependency errors now surface as `MissingDependencyException` with the
|
|
54
|
+
exact install hint, instead of being hidden inside a generic
|
|
55
|
+
`FileConversionException`.
|
|
56
|
+
|
|
57
|
+
## [0.1.0] - 2026-06-08
|
|
58
|
+
|
|
59
|
+
### Added
|
|
60
|
+
|
|
61
|
+
- Initial release of **everythingtohtml**.
|
|
62
|
+
- `EverythingToHtml` engine with stream detection, priority-based converter
|
|
63
|
+
dispatch, and entry-point plugin support.
|
|
64
|
+
- Built-in converters (no extra dependencies): plain text, Markdown, HTML
|
|
65
|
+
normalization, CSV/TSV, JSON/JSONL, Jupyter notebooks, and RSS/Atom feeds.
|
|
66
|
+
- Optional converters behind extras: Word (`docx`), Excel (`xlsx`),
|
|
67
|
+
PowerPoint (`pptx`), reStructuredText (`rst`), and YAML (`yaml`).
|
|
68
|
+
- `everythingtohtml` / `e2h` command-line interface with stdin support.
|
|
69
|
+
- Self-contained, dark-mode-aware HTML output with an overridable stylesheet.
|
|
70
|
+
|
|
71
|
+
[Unreleased]: https://github.com/He-wei-gui/everythingtohtml/compare/v0.1.0...HEAD
|
|
72
|
+
[0.1.0]: https://github.com/He-wei-gui/everythingtohtml/releases/tag/v0.1.0
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 everythingtohtml contributors
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,294 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: everythingtohtml
|
|
3
|
+
Version: 0.1.2
|
|
4
|
+
Summary: Convert PDF, Office, data, and markup files into clean, self-contained HTML — for humans and for LLMs.
|
|
5
|
+
Project-URL: Homepage, https://github.com/He-wei-gui/everythingtohtml
|
|
6
|
+
Project-URL: Repository, https://github.com/He-wei-gui/everythingtohtml
|
|
7
|
+
Project-URL: Issues, https://github.com/He-wei-gui/everythingtohtml/issues
|
|
8
|
+
Project-URL: Changelog, https://github.com/He-wei-gui/everythingtohtml/blob/main/CHANGELOG.md
|
|
9
|
+
Author: everythingtohtml contributors
|
|
10
|
+
License-Expression: MIT
|
|
11
|
+
License-File: LICENSE
|
|
12
|
+
Keywords: converter,csv,document-conversion,docx,html,json,llm,markdown,pptx,rag,xlsx
|
|
13
|
+
Classifier: Development Status :: 4 - Beta
|
|
14
|
+
Classifier: Intended Audience :: Developers
|
|
15
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
16
|
+
Classifier: Operating System :: OS Independent
|
|
17
|
+
Classifier: Programming Language :: Python :: 3
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
21
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
22
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
23
|
+
Classifier: Topic :: Text Processing :: Markup :: HTML
|
|
24
|
+
Requires-Python: >=3.10
|
|
25
|
+
Requires-Dist: beautifulsoup4>=4.12
|
|
26
|
+
Requires-Dist: charset-normalizer>=3.0
|
|
27
|
+
Requires-Dist: defusedxml>=0.7
|
|
28
|
+
Requires-Dist: markdown-it-py>=3.0
|
|
29
|
+
Requires-Dist: mdurl>=0.1
|
|
30
|
+
Requires-Dist: puremagic>=1.20
|
|
31
|
+
Provides-Extra: all
|
|
32
|
+
Requires-Dist: docutils>=0.20; extra == 'all'
|
|
33
|
+
Requires-Dist: mammoth>=1.6; extra == 'all'
|
|
34
|
+
Requires-Dist: olefile>=0.46; extra == 'all'
|
|
35
|
+
Requires-Dist: openpyxl>=3.1; extra == 'all'
|
|
36
|
+
Requires-Dist: pdfminer-six>=20231228; extra == 'all'
|
|
37
|
+
Requires-Dist: python-pptx>=0.6.21; extra == 'all'
|
|
38
|
+
Requires-Dist: pyyaml>=6.0; extra == 'all'
|
|
39
|
+
Provides-Extra: doc
|
|
40
|
+
Requires-Dist: olefile>=0.46; extra == 'doc'
|
|
41
|
+
Provides-Extra: docx
|
|
42
|
+
Requires-Dist: mammoth>=1.6; extra == 'docx'
|
|
43
|
+
Provides-Extra: pdf
|
|
44
|
+
Requires-Dist: pdfminer-six>=20231228; extra == 'pdf'
|
|
45
|
+
Provides-Extra: pptx
|
|
46
|
+
Requires-Dist: python-pptx>=0.6.21; extra == 'pptx'
|
|
47
|
+
Provides-Extra: rst
|
|
48
|
+
Requires-Dist: docutils>=0.20; extra == 'rst'
|
|
49
|
+
Provides-Extra: xlsx
|
|
50
|
+
Requires-Dist: openpyxl>=3.1; extra == 'xlsx'
|
|
51
|
+
Provides-Extra: yaml
|
|
52
|
+
Requires-Dist: pyyaml>=6.0; extra == 'yaml'
|
|
53
|
+
Description-Content-Type: text/markdown
|
|
54
|
+
|
|
55
|
+
# everythingtohtml
|
|
56
|
+
|
|
57
|
+
> Convert (almost) any file into clean, self-contained HTML — a universal file reader for your browser and scripts.
|
|
58
|
+
|
|
59
|
+
[](https://github.com/He-wei-gui/everythingtohtml/actions/workflows/ci.yml)
|
|
60
|
+
[](https://pypi.org/project/everythingtohtml/)
|
|
61
|
+
[](https://pypi.org/project/everythingtohtml/)
|
|
62
|
+
[](LICENSE)
|
|
63
|
+
|
|
64
|
+
English | [中文发布文案](docs/LAUNCH.zh-CN.md) | **[▶ Live demo — drag a file, read it as HTML](https://he-wei-gui.github.io/everythingtohtml/)**
|
|
65
|
+
|
|
66
|
+
<p align="center">
|
|
67
|
+
<a href="https://he-wei-gui.github.io/everythingtohtml/">
|
|
68
|
+
<img src="site/screenshot.png" alt="everythingtohtml in-browser universal file reader" width="760">
|
|
69
|
+
</a>
|
|
70
|
+
</p>
|
|
71
|
+
|
|
72
|
+
**everythingtohtml** is the spiritual inverse of tools like
|
|
73
|
+
[markitdown](https://github.com/microsoft/markitdown): instead of flattening rich
|
|
74
|
+
documents *down* to Markdown, it lifts a wide range of formats *up* into clean,
|
|
75
|
+
styled, standalone HTML you can open in a browser, embed in a page, or feed to a
|
|
76
|
+
workflow that wants structured markup.
|
|
77
|
+
|
|
78
|
+
One small API. One CLI. A pluggable converter registry. No browser, no network
|
|
79
|
+
required for local files.
|
|
80
|
+
|
|
81
|
+
**中文简介**:everythingtohtml 是一个浏览器里的万能文件阅读器,也是一个 Python 包和 CLI。它可以把 PDF、Office、Markdown、CSV、JSON、EPUB 等常见文件转换成干净、自包含的 HTML,方便直接阅读、分享和自动化处理。
|
|
82
|
+
|
|
83
|
+
```python
|
|
84
|
+
from everythingtohtml import EverythingToHtml
|
|
85
|
+
|
|
86
|
+
eth = EverythingToHtml()
|
|
87
|
+
result = eth.convert("quarterly-report.docx")
|
|
88
|
+
print(result.html) # a complete <!DOCTYPE html> document
|
|
89
|
+
print(result.title) # best-effort document title
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
```console
|
|
93
|
+
$ everythingtohtml notes.md -o notes.html
|
|
94
|
+
$ everythingtohtml data.csv > data.html
|
|
95
|
+
$ everythingtohtml https://example.com/feed.rss > feed.html
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
## Why HTML (and not Markdown)?
|
|
99
|
+
|
|
100
|
+
Markdown is lossy: tables get flattened, styling vanishes, slide structure
|
|
101
|
+
disappears, and nested data becomes ambiguous. HTML keeps the structure that
|
|
102
|
+
matters — headings, tables, lists, sections, links, images — while staying:
|
|
103
|
+
|
|
104
|
+
- **Human-friendly** — open the output in any browser, no toolchain needed.
|
|
105
|
+
- **Restyleable** — every document ships with a small, overridable stylesheet.
|
|
106
|
+
- **Structure-preserving** — explicit `<table>`/`<section>` markup keeps tables,
|
|
107
|
+
sections, and nested content easy to inspect and process.
|
|
108
|
+
- **Self-contained** — one file, valid HTML5, dark-mode aware.
|
|
109
|
+
|
|
110
|
+
## Supported formats
|
|
111
|
+
|
|
112
|
+
| Format | Extensions | Extra needed |
|
|
113
|
+
| --- | --- | --- |
|
|
114
|
+
| Plain text | `.txt`, anything textual | — (built in) |
|
|
115
|
+
| Markdown | `.md`, `.markdown`, `.mkd` | — (built in) |
|
|
116
|
+
| HTML (clean/normalize) | `.html`, `.htm`, `.xhtml` | — (built in) |
|
|
117
|
+
| CSV / TSV | `.csv`, `.tsv` | — (built in) |
|
|
118
|
+
| JSON / JSONL | `.json`, `.jsonl`, `.ndjson` | — (built in) |
|
|
119
|
+
| Jupyter notebook | `.ipynb` | — (built in) |
|
|
120
|
+
| RSS / Atom feeds | `.rss`, `.atom` | — (built in) |
|
|
121
|
+
| EPUB e-books | `.epub` | — (built in) |
|
|
122
|
+
| Email | `.eml` | — (built in) |
|
|
123
|
+
| OpenDocument Text | `.odt` | — (built in) |
|
|
124
|
+
| YAML | `.yaml`, `.yml` | `pip install everythingtohtml[yaml]` |
|
|
125
|
+
| reStructuredText | `.rst` | `pip install everythingtohtml[rst]` |
|
|
126
|
+
| Word | `.docx` | `pip install everythingtohtml[docx]` |
|
|
127
|
+
| Word (legacy) | `.doc` | `pip install everythingtohtml[doc]` (LibreOffice recommended) |
|
|
128
|
+
| Excel | `.xlsx`, `.xlsm` | `pip install everythingtohtml[xlsx]` |
|
|
129
|
+
| PowerPoint | `.pptx` | `pip install everythingtohtml[pptx]` |
|
|
130
|
+
| PDF | `.pdf` | `pip install everythingtohtml[pdf]` |
|
|
131
|
+
|
|
132
|
+
> **Legacy `.doc`**: best results come from having [LibreOffice](https://www.libreoffice.org/)
|
|
133
|
+
> installed (used headlessly for high-fidelity conversion). Without it, a
|
|
134
|
+
> pure-Python `olefile` fallback recovers the text content.
|
|
135
|
+
|
|
136
|
+
> Want everything? `pip install everythingtohtml[all]`
|
|
137
|
+
|
|
138
|
+
New formats are just a small class away — see [Writing a converter](#writing-a-converter).
|
|
139
|
+
|
|
140
|
+
## Installation
|
|
141
|
+
|
|
142
|
+
```console
|
|
143
|
+
# core formats only (tiny dependency footprint)
|
|
144
|
+
pip install everythingtohtml
|
|
145
|
+
|
|
146
|
+
# pull in Office + data formats
|
|
147
|
+
pip install "everythingtohtml[all]"
|
|
148
|
+
|
|
149
|
+
# or cherry-pick
|
|
150
|
+
pip install "everythingtohtml[docx,xlsx]"
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
Requires Python 3.10+.
|
|
154
|
+
|
|
155
|
+
## Usage
|
|
156
|
+
|
|
157
|
+
### Library
|
|
158
|
+
|
|
159
|
+
```python
|
|
160
|
+
from everythingtohtml import EverythingToHtml
|
|
161
|
+
|
|
162
|
+
eth = EverythingToHtml()
|
|
163
|
+
|
|
164
|
+
# From a path
|
|
165
|
+
result = eth.convert("slides.pptx")
|
|
166
|
+
|
|
167
|
+
# From bytes or an open stream
|
|
168
|
+
with open("data.csv", "rb") as f:
|
|
169
|
+
result = eth.convert(f)
|
|
170
|
+
|
|
171
|
+
# From a URL (http/https/file/data URIs)
|
|
172
|
+
result = eth.convert("https://example.com/posts.atom")
|
|
173
|
+
|
|
174
|
+
# Give hints when the source is ambiguous (e.g. stdin)
|
|
175
|
+
from everythingtohtml import StreamInfo
|
|
176
|
+
result = eth.convert(raw_bytes, stream_info=StreamInfo(extension=".md"))
|
|
177
|
+
|
|
178
|
+
result.html # the full HTML document (str)
|
|
179
|
+
result.title # detected title, or None
|
|
180
|
+
result.text_content # alias for .html (drop-in for markdown-style code)
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
### Command line
|
|
184
|
+
|
|
185
|
+
```console
|
|
186
|
+
everythingtohtml SOURCE [-o OUTPUT] [--extension .md] [--mimetype text/markdown]
|
|
187
|
+
|
|
188
|
+
# convert a file to a file
|
|
189
|
+
everythingtohtml report.docx -o report.html
|
|
190
|
+
|
|
191
|
+
# pipe through stdin (give it a hint)
|
|
192
|
+
cat notes.md | everythingtohtml --extension .md > notes.html
|
|
193
|
+
|
|
194
|
+
# fetch and convert a remote feed
|
|
195
|
+
everythingtohtml https://hnrss.org/frontpage > hn.html
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
The CLI is also available as `e2h` for the impatient.
|
|
199
|
+
|
|
200
|
+
## Merging and comparing documents
|
|
201
|
+
|
|
202
|
+
Need to collate a stack of Word files into one page, or see exactly what changed
|
|
203
|
+
between two revisions? everythingtohtml does both — for **any** supported format.
|
|
204
|
+
|
|
205
|
+
```python
|
|
206
|
+
eth = EverythingToHtml()
|
|
207
|
+
|
|
208
|
+
# Merge several documents into one HTML page (each becomes a section, with a TOC)
|
|
209
|
+
merged = eth.merge(["intro.docx", "chapter1.doc", "appendix.pdf"])
|
|
210
|
+
|
|
211
|
+
# Place them side by side for visual comparison
|
|
212
|
+
columns = eth.merge(["draft-v1.docx", "draft-v2.docx"], layout="columns")
|
|
213
|
+
|
|
214
|
+
# Produce a highlighted, line-by-line diff of two documents' text
|
|
215
|
+
changes = eth.diff("spec-old.docx", "spec-new.docx")
|
|
216
|
+
open("changes.html", "w", encoding="utf-8").write(changes.html)
|
|
217
|
+
```
|
|
218
|
+
|
|
219
|
+
From the CLI:
|
|
220
|
+
|
|
221
|
+
```console
|
|
222
|
+
# two or more sources are merged automatically
|
|
223
|
+
everythingtohtml intro.docx chapter1.doc appendix.pdf -o handbook.html
|
|
224
|
+
|
|
225
|
+
# side-by-side layout
|
|
226
|
+
everythingtohtml old.docx new.docx --columns -o compare.html
|
|
227
|
+
|
|
228
|
+
# highlighted diff of exactly two documents
|
|
229
|
+
everythingtohtml spec-old.docx spec-new.docx --diff -o changes.html
|
|
230
|
+
```
|
|
231
|
+
|
|
232
|
+
## Architecture
|
|
233
|
+
|
|
234
|
+
everythingtohtml borrows the proven shape of markitdown:
|
|
235
|
+
|
|
236
|
+
```
|
|
237
|
+
EverythingToHtml # engine: detection + dispatch + plugins
|
|
238
|
+
├─ StreamInfo # immutable bag of hints (ext, mime, charset, …)
|
|
239
|
+
├─ DocumentConverter # base class: accepts() + convert()
|
|
240
|
+
│ ├─ MarkdownConverter
|
|
241
|
+
│ ├─ CsvConverter
|
|
242
|
+
│ ├─ DocxConverter (mammoth)
|
|
243
|
+
│ └─ … one small class per format
|
|
244
|
+
└─ DocumentConverterResult # { html, title, metadata }
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
When you call `convert()`, the engine:
|
|
248
|
+
|
|
249
|
+
1. **Detects** the stream — extension, mimetype, declared charset, and magic-byte
|
|
250
|
+
sniffing via `puremagic` fill in a `StreamInfo`.
|
|
251
|
+
2. **Dispatches** — converters are tried in priority order; each `accepts()` is a
|
|
252
|
+
cheap, non-destructive check. Specific formats win over the plain-text
|
|
253
|
+
catch-all.
|
|
254
|
+
3. **Converts** — the winning converter returns a `DocumentConverterResult`. If a
|
|
255
|
+
converter accepts but raises, the engine records it and tries the next one, so
|
|
256
|
+
one greedy converter can't sink the whole conversion.
|
|
257
|
+
|
|
258
|
+
### Writing a converter
|
|
259
|
+
|
|
260
|
+
```python
|
|
261
|
+
from everythingtohtml import DocumentConverter, DocumentConverterResult, StreamInfo
|
|
262
|
+
from everythingtohtml._html_builder import wrap_document, escape_text
|
|
263
|
+
|
|
264
|
+
class UpperTextConverter(DocumentConverter):
|
|
265
|
+
def accepts(self, file_stream, stream_info: StreamInfo, **kwargs) -> bool:
|
|
266
|
+
return stream_info.normalized_extension() == ".loud"
|
|
267
|
+
|
|
268
|
+
def convert(self, file_stream, stream_info: StreamInfo, **kwargs):
|
|
269
|
+
text = file_stream.read().decode("utf-8").upper()
|
|
270
|
+
return DocumentConverterResult(wrap_document(f"<pre>{escape_text(text)}</pre>"))
|
|
271
|
+
|
|
272
|
+
eth = EverythingToHtml()
|
|
273
|
+
eth.register_converter(UpperTextConverter())
|
|
274
|
+
```
|
|
275
|
+
|
|
276
|
+
Ship it as a package and expose it as a plugin via entry points so any user can
|
|
277
|
+
`EverythingToHtml(enable_plugins=True)` and pick it up automatically — see
|
|
278
|
+
[`docs/PLUGINS.md`](docs/PLUGINS.md).
|
|
279
|
+
|
|
280
|
+
## Contributing
|
|
281
|
+
|
|
282
|
+
Contributions are very welcome — new converters especially. See
|
|
283
|
+
[CONTRIBUTING.md](CONTRIBUTING.md) and our [Code of Conduct](CODE_OF_CONDUCT.md).
|
|
284
|
+
Found a security issue? See [SECURITY.md](SECURITY.md).
|
|
285
|
+
|
|
286
|
+
## Acknowledgements
|
|
287
|
+
|
|
288
|
+
The converter-registry design is directly inspired by Microsoft's excellent
|
|
289
|
+
[markitdown](https://github.com/microsoft/markitdown). everythingtohtml aims to be
|
|
290
|
+
its mirror image for teams that want structure-preserving HTML instead of Markdown.
|
|
291
|
+
|
|
292
|
+
## License
|
|
293
|
+
|
|
294
|
+
[MIT](LICENSE) © everythingtohtml contributors
|
|
@@ -0,0 +1,240 @@
|
|
|
1
|
+
# everythingtohtml
|
|
2
|
+
|
|
3
|
+
> Convert (almost) any file into clean, self-contained HTML — a universal file reader for your browser and scripts.
|
|
4
|
+
|
|
5
|
+
[](https://github.com/He-wei-gui/everythingtohtml/actions/workflows/ci.yml)
|
|
6
|
+
[](https://pypi.org/project/everythingtohtml/)
|
|
7
|
+
[](https://pypi.org/project/everythingtohtml/)
|
|
8
|
+
[](LICENSE)
|
|
9
|
+
|
|
10
|
+
English | [中文发布文案](docs/LAUNCH.zh-CN.md) | **[▶ Live demo — drag a file, read it as HTML](https://he-wei-gui.github.io/everythingtohtml/)**
|
|
11
|
+
|
|
12
|
+
<p align="center">
|
|
13
|
+
<a href="https://he-wei-gui.github.io/everythingtohtml/">
|
|
14
|
+
<img src="site/screenshot.png" alt="everythingtohtml in-browser universal file reader" width="760">
|
|
15
|
+
</a>
|
|
16
|
+
</p>
|
|
17
|
+
|
|
18
|
+
**everythingtohtml** is the spiritual inverse of tools like
|
|
19
|
+
[markitdown](https://github.com/microsoft/markitdown): instead of flattening rich
|
|
20
|
+
documents *down* to Markdown, it lifts a wide range of formats *up* into clean,
|
|
21
|
+
styled, standalone HTML you can open in a browser, embed in a page, or feed to a
|
|
22
|
+
workflow that wants structured markup.
|
|
23
|
+
|
|
24
|
+
One small API. One CLI. A pluggable converter registry. No browser, no network
|
|
25
|
+
required for local files.
|
|
26
|
+
|
|
27
|
+
**中文简介**:everythingtohtml 是一个浏览器里的万能文件阅读器,也是一个 Python 包和 CLI。它可以把 PDF、Office、Markdown、CSV、JSON、EPUB 等常见文件转换成干净、自包含的 HTML,方便直接阅读、分享和自动化处理。
|
|
28
|
+
|
|
29
|
+
```python
|
|
30
|
+
from everythingtohtml import EverythingToHtml
|
|
31
|
+
|
|
32
|
+
eth = EverythingToHtml()
|
|
33
|
+
result = eth.convert("quarterly-report.docx")
|
|
34
|
+
print(result.html) # a complete <!DOCTYPE html> document
|
|
35
|
+
print(result.title) # best-effort document title
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
```console
|
|
39
|
+
$ everythingtohtml notes.md -o notes.html
|
|
40
|
+
$ everythingtohtml data.csv > data.html
|
|
41
|
+
$ everythingtohtml https://example.com/feed.rss > feed.html
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
## Why HTML (and not Markdown)?
|
|
45
|
+
|
|
46
|
+
Markdown is lossy: tables get flattened, styling vanishes, slide structure
|
|
47
|
+
disappears, and nested data becomes ambiguous. HTML keeps the structure that
|
|
48
|
+
matters — headings, tables, lists, sections, links, images — while staying:
|
|
49
|
+
|
|
50
|
+
- **Human-friendly** — open the output in any browser, no toolchain needed.
|
|
51
|
+
- **Restyleable** — every document ships with a small, overridable stylesheet.
|
|
52
|
+
- **Structure-preserving** — explicit `<table>`/`<section>` markup keeps tables,
|
|
53
|
+
sections, and nested content easy to inspect and process.
|
|
54
|
+
- **Self-contained** — one file, valid HTML5, dark-mode aware.
|
|
55
|
+
|
|
56
|
+
## Supported formats
|
|
57
|
+
|
|
58
|
+
| Format | Extensions | Extra needed |
|
|
59
|
+
| --- | --- | --- |
|
|
60
|
+
| Plain text | `.txt`, anything textual | — (built in) |
|
|
61
|
+
| Markdown | `.md`, `.markdown`, `.mkd` | — (built in) |
|
|
62
|
+
| HTML (clean/normalize) | `.html`, `.htm`, `.xhtml` | — (built in) |
|
|
63
|
+
| CSV / TSV | `.csv`, `.tsv` | — (built in) |
|
|
64
|
+
| JSON / JSONL | `.json`, `.jsonl`, `.ndjson` | — (built in) |
|
|
65
|
+
| Jupyter notebook | `.ipynb` | — (built in) |
|
|
66
|
+
| RSS / Atom feeds | `.rss`, `.atom` | — (built in) |
|
|
67
|
+
| EPUB e-books | `.epub` | — (built in) |
|
|
68
|
+
| Email | `.eml` | — (built in) |
|
|
69
|
+
| OpenDocument Text | `.odt` | — (built in) |
|
|
70
|
+
| YAML | `.yaml`, `.yml` | `pip install everythingtohtml[yaml]` |
|
|
71
|
+
| reStructuredText | `.rst` | `pip install everythingtohtml[rst]` |
|
|
72
|
+
| Word | `.docx` | `pip install everythingtohtml[docx]` |
|
|
73
|
+
| Word (legacy) | `.doc` | `pip install everythingtohtml[doc]` (LibreOffice recommended) |
|
|
74
|
+
| Excel | `.xlsx`, `.xlsm` | `pip install everythingtohtml[xlsx]` |
|
|
75
|
+
| PowerPoint | `.pptx` | `pip install everythingtohtml[pptx]` |
|
|
76
|
+
| PDF | `.pdf` | `pip install everythingtohtml[pdf]` |
|
|
77
|
+
|
|
78
|
+
> **Legacy `.doc`**: best results come from having [LibreOffice](https://www.libreoffice.org/)
|
|
79
|
+
> installed (used headlessly for high-fidelity conversion). Without it, a
|
|
80
|
+
> pure-Python `olefile` fallback recovers the text content.
|
|
81
|
+
|
|
82
|
+
> Want everything? `pip install everythingtohtml[all]`
|
|
83
|
+
|
|
84
|
+
New formats are just a small class away — see [Writing a converter](#writing-a-converter).
|
|
85
|
+
|
|
86
|
+
## Installation
|
|
87
|
+
|
|
88
|
+
```console
|
|
89
|
+
# core formats only (tiny dependency footprint)
|
|
90
|
+
pip install everythingtohtml
|
|
91
|
+
|
|
92
|
+
# pull in Office + data formats
|
|
93
|
+
pip install "everythingtohtml[all]"
|
|
94
|
+
|
|
95
|
+
# or cherry-pick
|
|
96
|
+
pip install "everythingtohtml[docx,xlsx]"
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
Requires Python 3.10+.
|
|
100
|
+
|
|
101
|
+
## Usage
|
|
102
|
+
|
|
103
|
+
### Library
|
|
104
|
+
|
|
105
|
+
```python
|
|
106
|
+
from everythingtohtml import EverythingToHtml
|
|
107
|
+
|
|
108
|
+
eth = EverythingToHtml()
|
|
109
|
+
|
|
110
|
+
# From a path
|
|
111
|
+
result = eth.convert("slides.pptx")
|
|
112
|
+
|
|
113
|
+
# From bytes or an open stream
|
|
114
|
+
with open("data.csv", "rb") as f:
|
|
115
|
+
result = eth.convert(f)
|
|
116
|
+
|
|
117
|
+
# From a URL (http/https/file/data URIs)
|
|
118
|
+
result = eth.convert("https://example.com/posts.atom")
|
|
119
|
+
|
|
120
|
+
# Give hints when the source is ambiguous (e.g. stdin)
|
|
121
|
+
from everythingtohtml import StreamInfo
|
|
122
|
+
result = eth.convert(raw_bytes, stream_info=StreamInfo(extension=".md"))
|
|
123
|
+
|
|
124
|
+
result.html # the full HTML document (str)
|
|
125
|
+
result.title # detected title, or None
|
|
126
|
+
result.text_content # alias for .html (drop-in for markdown-style code)
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
### Command line
|
|
130
|
+
|
|
131
|
+
```console
|
|
132
|
+
everythingtohtml SOURCE [-o OUTPUT] [--extension .md] [--mimetype text/markdown]
|
|
133
|
+
|
|
134
|
+
# convert a file to a file
|
|
135
|
+
everythingtohtml report.docx -o report.html
|
|
136
|
+
|
|
137
|
+
# pipe through stdin (give it a hint)
|
|
138
|
+
cat notes.md | everythingtohtml --extension .md > notes.html
|
|
139
|
+
|
|
140
|
+
# fetch and convert a remote feed
|
|
141
|
+
everythingtohtml https://hnrss.org/frontpage > hn.html
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
The CLI is also available as `e2h` for the impatient.
|
|
145
|
+
|
|
146
|
+
## Merging and comparing documents
|
|
147
|
+
|
|
148
|
+
Need to collate a stack of Word files into one page, or see exactly what changed
|
|
149
|
+
between two revisions? everythingtohtml does both — for **any** supported format.
|
|
150
|
+
|
|
151
|
+
```python
|
|
152
|
+
eth = EverythingToHtml()
|
|
153
|
+
|
|
154
|
+
# Merge several documents into one HTML page (each becomes a section, with a TOC)
|
|
155
|
+
merged = eth.merge(["intro.docx", "chapter1.doc", "appendix.pdf"])
|
|
156
|
+
|
|
157
|
+
# Place them side by side for visual comparison
|
|
158
|
+
columns = eth.merge(["draft-v1.docx", "draft-v2.docx"], layout="columns")
|
|
159
|
+
|
|
160
|
+
# Produce a highlighted, line-by-line diff of two documents' text
|
|
161
|
+
changes = eth.diff("spec-old.docx", "spec-new.docx")
|
|
162
|
+
open("changes.html", "w", encoding="utf-8").write(changes.html)
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
From the CLI:
|
|
166
|
+
|
|
167
|
+
```console
|
|
168
|
+
# two or more sources are merged automatically
|
|
169
|
+
everythingtohtml intro.docx chapter1.doc appendix.pdf -o handbook.html
|
|
170
|
+
|
|
171
|
+
# side-by-side layout
|
|
172
|
+
everythingtohtml old.docx new.docx --columns -o compare.html
|
|
173
|
+
|
|
174
|
+
# highlighted diff of exactly two documents
|
|
175
|
+
everythingtohtml spec-old.docx spec-new.docx --diff -o changes.html
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
## Architecture
|
|
179
|
+
|
|
180
|
+
everythingtohtml borrows the proven shape of markitdown:
|
|
181
|
+
|
|
182
|
+
```
|
|
183
|
+
EverythingToHtml # engine: detection + dispatch + plugins
|
|
184
|
+
├─ StreamInfo # immutable bag of hints (ext, mime, charset, …)
|
|
185
|
+
├─ DocumentConverter # base class: accepts() + convert()
|
|
186
|
+
│ ├─ MarkdownConverter
|
|
187
|
+
│ ├─ CsvConverter
|
|
188
|
+
│ ├─ DocxConverter (mammoth)
|
|
189
|
+
│ └─ … one small class per format
|
|
190
|
+
└─ DocumentConverterResult # { html, title, metadata }
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
When you call `convert()`, the engine:
|
|
194
|
+
|
|
195
|
+
1. **Detects** the stream — extension, mimetype, declared charset, and magic-byte
|
|
196
|
+
sniffing via `puremagic` fill in a `StreamInfo`.
|
|
197
|
+
2. **Dispatches** — converters are tried in priority order; each `accepts()` is a
|
|
198
|
+
cheap, non-destructive check. Specific formats win over the plain-text
|
|
199
|
+
catch-all.
|
|
200
|
+
3. **Converts** — the winning converter returns a `DocumentConverterResult`. If a
|
|
201
|
+
converter accepts but raises, the engine records it and tries the next one, so
|
|
202
|
+
one greedy converter can't sink the whole conversion.
|
|
203
|
+
|
|
204
|
+
### Writing a converter
|
|
205
|
+
|
|
206
|
+
```python
|
|
207
|
+
from everythingtohtml import DocumentConverter, DocumentConverterResult, StreamInfo
|
|
208
|
+
from everythingtohtml._html_builder import wrap_document, escape_text
|
|
209
|
+
|
|
210
|
+
class UpperTextConverter(DocumentConverter):
|
|
211
|
+
def accepts(self, file_stream, stream_info: StreamInfo, **kwargs) -> bool:
|
|
212
|
+
return stream_info.normalized_extension() == ".loud"
|
|
213
|
+
|
|
214
|
+
def convert(self, file_stream, stream_info: StreamInfo, **kwargs):
|
|
215
|
+
text = file_stream.read().decode("utf-8").upper()
|
|
216
|
+
return DocumentConverterResult(wrap_document(f"<pre>{escape_text(text)}</pre>"))
|
|
217
|
+
|
|
218
|
+
eth = EverythingToHtml()
|
|
219
|
+
eth.register_converter(UpperTextConverter())
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
Ship it as a package and expose it as a plugin via entry points so any user can
|
|
223
|
+
`EverythingToHtml(enable_plugins=True)` and pick it up automatically — see
|
|
224
|
+
[`docs/PLUGINS.md`](docs/PLUGINS.md).
|
|
225
|
+
|
|
226
|
+
## Contributing
|
|
227
|
+
|
|
228
|
+
Contributions are very welcome — new converters especially. See
|
|
229
|
+
[CONTRIBUTING.md](CONTRIBUTING.md) and our [Code of Conduct](CODE_OF_CONDUCT.md).
|
|
230
|
+
Found a security issue? See [SECURITY.md](SECURITY.md).
|
|
231
|
+
|
|
232
|
+
## Acknowledgements
|
|
233
|
+
|
|
234
|
+
The converter-registry design is directly inspired by Microsoft's excellent
|
|
235
|
+
[markitdown](https://github.com/microsoft/markitdown). everythingtohtml aims to be
|
|
236
|
+
its mirror image for teams that want structure-preserving HTML instead of Markdown.
|
|
237
|
+
|
|
238
|
+
## License
|
|
239
|
+
|
|
240
|
+
[MIT](LICENSE) © everythingtohtml contributors
|