everythingtohtml 0.1.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (35) hide show
  1. everythingtohtml-0.1.2/.gitignore +35 -0
  2. everythingtohtml-0.1.2/CHANGELOG.md +72 -0
  3. everythingtohtml-0.1.2/LICENSE +21 -0
  4. everythingtohtml-0.1.2/PKG-INFO +294 -0
  5. everythingtohtml-0.1.2/README.md +240 -0
  6. everythingtohtml-0.1.2/pyproject.toml +107 -0
  7. everythingtohtml-0.1.2/src/everythingtohtml/__about__.py +3 -0
  8. everythingtohtml-0.1.2/src/everythingtohtml/__init__.py +34 -0
  9. everythingtohtml-0.1.2/src/everythingtohtml/__main__.py +130 -0
  10. everythingtohtml-0.1.2/src/everythingtohtml/_base_converter.py +78 -0
  11. everythingtohtml-0.1.2/src/everythingtohtml/_everything_to_html.py +408 -0
  12. everythingtohtml-0.1.2/src/everythingtohtml/_exceptions.py +63 -0
  13. everythingtohtml-0.1.2/src/everythingtohtml/_html_builder.py +106 -0
  14. everythingtohtml-0.1.2/src/everythingtohtml/_merge.py +145 -0
  15. everythingtohtml-0.1.2/src/everythingtohtml/_stream_info.py +46 -0
  16. everythingtohtml-0.1.2/src/everythingtohtml/_text_utils.py +46 -0
  17. everythingtohtml-0.1.2/src/everythingtohtml/converters/__init__.py +45 -0
  18. everythingtohtml-0.1.2/src/everythingtohtml/converters/_csv_converter.py +73 -0
  19. everythingtohtml-0.1.2/src/everythingtohtml/converters/_doc_converter.py +385 -0
  20. everythingtohtml-0.1.2/src/everythingtohtml/converters/_docx_converter.py +105 -0
  21. everythingtohtml-0.1.2/src/everythingtohtml/converters/_eml_converter.py +104 -0
  22. everythingtohtml-0.1.2/src/everythingtohtml/converters/_epub_converter.py +131 -0
  23. everythingtohtml-0.1.2/src/everythingtohtml/converters/_html_converter.py +66 -0
  24. everythingtohtml-0.1.2/src/everythingtohtml/converters/_ipynb_converter.py +96 -0
  25. everythingtohtml-0.1.2/src/everythingtohtml/converters/_json_converter.py +78 -0
  26. everythingtohtml-0.1.2/src/everythingtohtml/converters/_markdown_converter.py +57 -0
  27. everythingtohtml-0.1.2/src/everythingtohtml/converters/_odt_converter.py +171 -0
  28. everythingtohtml-0.1.2/src/everythingtohtml/converters/_pdf_converter.py +204 -0
  29. everythingtohtml-0.1.2/src/everythingtohtml/converters/_plain_text_converter.py +64 -0
  30. everythingtohtml-0.1.2/src/everythingtohtml/converters/_pptx_converter.py +233 -0
  31. everythingtohtml-0.1.2/src/everythingtohtml/converters/_rss_converter.py +146 -0
  32. everythingtohtml-0.1.2/src/everythingtohtml/converters/_rst_converter.py +57 -0
  33. everythingtohtml-0.1.2/src/everythingtohtml/converters/_xlsx_converter.py +84 -0
  34. everythingtohtml-0.1.2/src/everythingtohtml/converters/_yaml_converter.py +56 -0
  35. everythingtohtml-0.1.2/src/everythingtohtml/py.typed +0 -0
@@ -0,0 +1,35 @@
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *.egg-info/
5
+ .eggs/
6
+ build/
7
+ dist/
8
+ *.egg
9
+
10
+ # Virtual environments
11
+ .venv/
12
+ venv/
13
+ env/
14
+
15
+ # Test / coverage
16
+ .pytest_cache/
17
+ .coverage
18
+ htmlcov/
19
+ .mypy_cache/
20
+ .ruff_cache/
21
+
22
+ # Editors / OS
23
+ .vscode/
24
+ .idea/
25
+ *.swp
26
+ .DS_Store
27
+ Thumbs.db
28
+
29
+ # Local scratch output
30
+ *.local.html
31
+ /scratch/
32
+ examples/output/
33
+
34
+ # Built in CI for the in-browser demo
35
+ site/wheels/
@@ -0,0 +1,72 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project are documented here. The format is based on
4
+ [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and this project adheres
5
+ to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
6
+
7
+ ## [Unreleased]
8
+
9
+ ## [0.1.2] - 2026-06-09
10
+
11
+ ### Changed
12
+
13
+ - **Tables render much better.** The default stylesheet now gives every table a
14
+ shaded header row, zebra striping, compact cells, and horizontal scrolling for
15
+ wide tables (so they no longer overflow the page). This applies to DOCX, XLSX,
16
+ legacy DOC (via LibreOffice), CSV, and Markdown output alike.
17
+ - **DOCX tables**: mammoth's bare tables are post-processed to promote the first
18
+ row to a real `<thead>`/`<th>` header and to unwrap single-paragraph cells, so
19
+ they read as proper tables instead of an unstyled grid.
20
+ - Version bump also refreshes the in-browser demo wheel (cache-bust).
21
+
22
+ ## [0.1.1] - 2026-06-09
23
+
24
+ ### Added
25
+
26
+ - **EPUB** converter (built in, no extra): follows the spine reading order and
27
+ concatenates chapters into one HTML document.
28
+ - **Email** (`.eml`) converter (built in): renders headers, body (HTML or plain),
29
+ and an attachment list; HTML bodies are stripped of active content.
30
+ - **OpenDocument Text** (`.odt`) converter (built in): maps headings, paragraphs,
31
+ lists, and tables to semantic HTML using only core dependencies.
32
+ - **PDF** converter behind the new `pdf` extra (`pdfminer.six`): recovers prose as
33
+ paragraphs, one section per page.
34
+ - **Legacy `.doc`** converter behind the new `doc` extra: uses headless
35
+ LibreOffice when available for high-fidelity output, with a pure-Python
36
+ fallback otherwise.
37
+ - **`EverythingToHtml.merge()`**: combine several sources into one HTML document,
38
+ with `layout="stacked"` (table of contents) or `layout="columns"` (side by
39
+ side). Exposed on the CLI by passing two or more sources, plus `--columns`.
40
+ - **`EverythingToHtml.diff()`**: render a highlighted, line-by-line comparison of
41
+ two documents. Exposed on the CLI via `--diff`.
42
+ - **In-browser "universal reader" demo** (GitHub Pages + Pyodide): drag in a file
43
+ and read it as HTML entirely client-side, with multi-file merge and two-file
44
+ diff. PPTX shapes are positioned by their slide coordinates.
45
+
46
+ ### Fixed
47
+
48
+ - **Legacy `.doc` mojibake**: the pure-Python fallback now parses the Word piece
49
+ table (CLX) from the table stream and decodes each text piece with its own
50
+ 8-bit/16-bit encoding (UTF-16LE or the language-appropriate code page). This
51
+ fixes garbled output — Chinese especially — that the earlier single-span
52
+ heuristic produced. The heuristic remains as a last-resort fallback.
53
+ - Optional-dependency errors now surface as `MissingDependencyException` with the
54
+ exact install hint, instead of being hidden inside a generic
55
+ `FileConversionException`.
56
+
57
+ ## [0.1.0] - 2026-06-08
58
+
59
+ ### Added
60
+
61
+ - Initial release of **everythingtohtml**.
62
+ - `EverythingToHtml` engine with stream detection, priority-based converter
63
+ dispatch, and entry-point plugin support.
64
+ - Built-in converters (no extra dependencies): plain text, Markdown, HTML
65
+ normalization, CSV/TSV, JSON/JSONL, Jupyter notebooks, and RSS/Atom feeds.
66
+ - Optional converters behind extras: Word (`docx`), Excel (`xlsx`),
67
+ PowerPoint (`pptx`), reStructuredText (`rst`), and YAML (`yaml`).
68
+ - `everythingtohtml` / `e2h` command-line interface with stdin support.
69
+ - Self-contained, dark-mode-aware HTML output with an overridable stylesheet.
70
+
71
+ [Unreleased]: https://github.com/He-wei-gui/everythingtohtml/compare/v0.1.0...HEAD
72
+ [0.1.0]: https://github.com/He-wei-gui/everythingtohtml/releases/tag/v0.1.0
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 everythingtohtml contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,294 @@
1
+ Metadata-Version: 2.4
2
+ Name: everythingtohtml
3
+ Version: 0.1.2
4
+ Summary: Convert PDF, Office, data, and markup files into clean, self-contained HTML — for humans and for LLMs.
5
+ Project-URL: Homepage, https://github.com/He-wei-gui/everythingtohtml
6
+ Project-URL: Repository, https://github.com/He-wei-gui/everythingtohtml
7
+ Project-URL: Issues, https://github.com/He-wei-gui/everythingtohtml/issues
8
+ Project-URL: Changelog, https://github.com/He-wei-gui/everythingtohtml/blob/main/CHANGELOG.md
9
+ Author: everythingtohtml contributors
10
+ License-Expression: MIT
11
+ License-File: LICENSE
12
+ Keywords: converter,csv,document-conversion,docx,html,json,llm,markdown,pptx,rag,xlsx
13
+ Classifier: Development Status :: 4 - Beta
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: License :: OSI Approved :: MIT License
16
+ Classifier: Operating System :: OS Independent
17
+ Classifier: Programming Language :: Python :: 3
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Classifier: Programming Language :: Python :: 3.13
22
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
23
+ Classifier: Topic :: Text Processing :: Markup :: HTML
24
+ Requires-Python: >=3.10
25
+ Requires-Dist: beautifulsoup4>=4.12
26
+ Requires-Dist: charset-normalizer>=3.0
27
+ Requires-Dist: defusedxml>=0.7
28
+ Requires-Dist: markdown-it-py>=3.0
29
+ Requires-Dist: mdurl>=0.1
30
+ Requires-Dist: puremagic>=1.20
31
+ Provides-Extra: all
32
+ Requires-Dist: docutils>=0.20; extra == 'all'
33
+ Requires-Dist: mammoth>=1.6; extra == 'all'
34
+ Requires-Dist: olefile>=0.46; extra == 'all'
35
+ Requires-Dist: openpyxl>=3.1; extra == 'all'
36
+ Requires-Dist: pdfminer-six>=20231228; extra == 'all'
37
+ Requires-Dist: python-pptx>=0.6.21; extra == 'all'
38
+ Requires-Dist: pyyaml>=6.0; extra == 'all'
39
+ Provides-Extra: doc
40
+ Requires-Dist: olefile>=0.46; extra == 'doc'
41
+ Provides-Extra: docx
42
+ Requires-Dist: mammoth>=1.6; extra == 'docx'
43
+ Provides-Extra: pdf
44
+ Requires-Dist: pdfminer-six>=20231228; extra == 'pdf'
45
+ Provides-Extra: pptx
46
+ Requires-Dist: python-pptx>=0.6.21; extra == 'pptx'
47
+ Provides-Extra: rst
48
+ Requires-Dist: docutils>=0.20; extra == 'rst'
49
+ Provides-Extra: xlsx
50
+ Requires-Dist: openpyxl>=3.1; extra == 'xlsx'
51
+ Provides-Extra: yaml
52
+ Requires-Dist: pyyaml>=6.0; extra == 'yaml'
53
+ Description-Content-Type: text/markdown
54
+
55
+ # everythingtohtml
56
+
57
+ > Convert (almost) any file into clean, self-contained HTML — a universal file reader for your browser and scripts.
58
+
59
+ [![CI](https://github.com/He-wei-gui/everythingtohtml/actions/workflows/ci.yml/badge.svg)](https://github.com/He-wei-gui/everythingtohtml/actions/workflows/ci.yml)
60
+ [![PyPI](https://img.shields.io/pypi/v/everythingtohtml.svg)](https://pypi.org/project/everythingtohtml/)
61
+ [![Python versions](https://img.shields.io/pypi/pyversions/everythingtohtml.svg)](https://pypi.org/project/everythingtohtml/)
62
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
63
+
64
+ English | [中文发布文案](docs/LAUNCH.zh-CN.md) | **[▶ Live demo — drag a file, read it as HTML](https://he-wei-gui.github.io/everythingtohtml/)**
65
+
66
+ <p align="center">
67
+ <a href="https://he-wei-gui.github.io/everythingtohtml/">
68
+ <img src="site/screenshot.png" alt="everythingtohtml in-browser universal file reader" width="760">
69
+ </a>
70
+ </p>
71
+
72
+ **everythingtohtml** is the spiritual inverse of tools like
73
+ [markitdown](https://github.com/microsoft/markitdown): instead of flattening rich
74
+ documents *down* to Markdown, it lifts a wide range of formats *up* into clean,
75
+ styled, standalone HTML you can open in a browser, embed in a page, or feed to a
76
+ workflow that wants structured markup.
77
+
78
+ One small API. One CLI. A pluggable converter registry. No browser, no network
79
+ required for local files.
80
+
81
+ **中文简介**:everythingtohtml 是一个浏览器里的万能文件阅读器,也是一个 Python 包和 CLI。它可以把 PDF、Office、Markdown、CSV、JSON、EPUB 等常见文件转换成干净、自包含的 HTML,方便直接阅读、分享和自动化处理。
82
+
83
+ ```python
84
+ from everythingtohtml import EverythingToHtml
85
+
86
+ eth = EverythingToHtml()
87
+ result = eth.convert("quarterly-report.docx")
88
+ print(result.html) # a complete <!DOCTYPE html> document
89
+ print(result.title) # best-effort document title
90
+ ```
91
+
92
+ ```console
93
+ $ everythingtohtml notes.md -o notes.html
94
+ $ everythingtohtml data.csv > data.html
95
+ $ everythingtohtml https://example.com/feed.rss > feed.html
96
+ ```
97
+
98
+ ## Why HTML (and not Markdown)?
99
+
100
+ Markdown is lossy: tables get flattened, styling vanishes, slide structure
101
+ disappears, and nested data becomes ambiguous. HTML keeps the structure that
102
+ matters — headings, tables, lists, sections, links, images — while staying:
103
+
104
+ - **Human-friendly** — open the output in any browser, no toolchain needed.
105
+ - **Restyleable** — every document ships with a small, overridable stylesheet.
106
+ - **Structure-preserving** — explicit `<table>`/`<section>` markup keeps tables,
107
+ sections, and nested content easy to inspect and process.
108
+ - **Self-contained** — one file, valid HTML5, dark-mode aware.
109
+
110
+ ## Supported formats
111
+
112
+ | Format | Extensions | Extra needed |
113
+ | --- | --- | --- |
114
+ | Plain text | `.txt`, anything textual | — (built in) |
115
+ | Markdown | `.md`, `.markdown`, `.mkd` | — (built in) |
116
+ | HTML (clean/normalize) | `.html`, `.htm`, `.xhtml` | — (built in) |
117
+ | CSV / TSV | `.csv`, `.tsv` | — (built in) |
118
+ | JSON / JSONL | `.json`, `.jsonl`, `.ndjson` | — (built in) |
119
+ | Jupyter notebook | `.ipynb` | — (built in) |
120
+ | RSS / Atom feeds | `.rss`, `.atom` | — (built in) |
121
+ | EPUB e-books | `.epub` | — (built in) |
122
+ | Email | `.eml` | — (built in) |
123
+ | OpenDocument Text | `.odt` | — (built in) |
124
+ | YAML | `.yaml`, `.yml` | `pip install everythingtohtml[yaml]` |
125
+ | reStructuredText | `.rst` | `pip install everythingtohtml[rst]` |
126
+ | Word | `.docx` | `pip install everythingtohtml[docx]` |
127
+ | Word (legacy) | `.doc` | `pip install everythingtohtml[doc]` (LibreOffice recommended) |
128
+ | Excel | `.xlsx`, `.xlsm` | `pip install everythingtohtml[xlsx]` |
129
+ | PowerPoint | `.pptx` | `pip install everythingtohtml[pptx]` |
130
+ | PDF | `.pdf` | `pip install everythingtohtml[pdf]` |
131
+
132
+ > **Legacy `.doc`**: best results come from having [LibreOffice](https://www.libreoffice.org/)
133
+ > installed (used headlessly for high-fidelity conversion). Without it, a
134
+ > pure-Python `olefile` fallback recovers the text content.
135
+
136
+ > Want everything? `pip install everythingtohtml[all]`
137
+
138
+ New formats are just a small class away — see [Writing a converter](#writing-a-converter).
139
+
140
+ ## Installation
141
+
142
+ ```console
143
+ # core formats only (tiny dependency footprint)
144
+ pip install everythingtohtml
145
+
146
+ # pull in Office + data formats
147
+ pip install "everythingtohtml[all]"
148
+
149
+ # or cherry-pick
150
+ pip install "everythingtohtml[docx,xlsx]"
151
+ ```
152
+
153
+ Requires Python 3.10+.
154
+
155
+ ## Usage
156
+
157
+ ### Library
158
+
159
+ ```python
160
+ from everythingtohtml import EverythingToHtml
161
+
162
+ eth = EverythingToHtml()
163
+
164
+ # From a path
165
+ result = eth.convert("slides.pptx")
166
+
167
+ # From bytes or an open stream
168
+ with open("data.csv", "rb") as f:
169
+ result = eth.convert(f)
170
+
171
+ # From a URL (http/https/file/data URIs)
172
+ result = eth.convert("https://example.com/posts.atom")
173
+
174
+ # Give hints when the source is ambiguous (e.g. stdin)
175
+ from everythingtohtml import StreamInfo
176
+ result = eth.convert(raw_bytes, stream_info=StreamInfo(extension=".md"))
177
+
178
+ result.html # the full HTML document (str)
179
+ result.title # detected title, or None
180
+ result.text_content # alias for .html (drop-in for markdown-style code)
181
+ ```
182
+
183
+ ### Command line
184
+
185
+ ```console
186
+ everythingtohtml SOURCE [-o OUTPUT] [--extension .md] [--mimetype text/markdown]
187
+
188
+ # convert a file to a file
189
+ everythingtohtml report.docx -o report.html
190
+
191
+ # pipe through stdin (give it a hint)
192
+ cat notes.md | everythingtohtml --extension .md > notes.html
193
+
194
+ # fetch and convert a remote feed
195
+ everythingtohtml https://hnrss.org/frontpage > hn.html
196
+ ```
197
+
198
+ The CLI is also available as `e2h` for the impatient.
199
+
200
+ ## Merging and comparing documents
201
+
202
+ Need to collate a stack of Word files into one page, or see exactly what changed
203
+ between two revisions? everythingtohtml does both — for **any** supported format.
204
+
205
+ ```python
206
+ eth = EverythingToHtml()
207
+
208
+ # Merge several documents into one HTML page (each becomes a section, with a TOC)
209
+ merged = eth.merge(["intro.docx", "chapter1.doc", "appendix.pdf"])
210
+
211
+ # Place them side by side for visual comparison
212
+ columns = eth.merge(["draft-v1.docx", "draft-v2.docx"], layout="columns")
213
+
214
+ # Produce a highlighted, line-by-line diff of two documents' text
215
+ changes = eth.diff("spec-old.docx", "spec-new.docx")
216
+ open("changes.html", "w", encoding="utf-8").write(changes.html)
217
+ ```
218
+
219
+ From the CLI:
220
+
221
+ ```console
222
+ # two or more sources are merged automatically
223
+ everythingtohtml intro.docx chapter1.doc appendix.pdf -o handbook.html
224
+
225
+ # side-by-side layout
226
+ everythingtohtml old.docx new.docx --columns -o compare.html
227
+
228
+ # highlighted diff of exactly two documents
229
+ everythingtohtml spec-old.docx spec-new.docx --diff -o changes.html
230
+ ```
231
+
232
+ ## Architecture
233
+
234
+ everythingtohtml borrows the proven shape of markitdown:
235
+
236
+ ```
237
+ EverythingToHtml # engine: detection + dispatch + plugins
238
+ ├─ StreamInfo # immutable bag of hints (ext, mime, charset, …)
239
+ ├─ DocumentConverter # base class: accepts() + convert()
240
+ │ ├─ MarkdownConverter
241
+ │ ├─ CsvConverter
242
+ │ ├─ DocxConverter (mammoth)
243
+ │ └─ … one small class per format
244
+ └─ DocumentConverterResult # { html, title, metadata }
245
+ ```
246
+
247
+ When you call `convert()`, the engine:
248
+
249
+ 1. **Detects** the stream — extension, mimetype, declared charset, and magic-byte
250
+ sniffing via `puremagic` fill in a `StreamInfo`.
251
+ 2. **Dispatches** — converters are tried in priority order; each `accepts()` is a
252
+ cheap, non-destructive check. Specific formats win over the plain-text
253
+ catch-all.
254
+ 3. **Converts** — the winning converter returns a `DocumentConverterResult`. If a
255
+ converter accepts but raises, the engine records it and tries the next one, so
256
+ one greedy converter can't sink the whole conversion.
257
+
258
+ ### Writing a converter
259
+
260
+ ```python
261
+ from everythingtohtml import DocumentConverter, DocumentConverterResult, StreamInfo
262
+ from everythingtohtml._html_builder import wrap_document, escape_text
263
+
264
+ class UpperTextConverter(DocumentConverter):
265
+ def accepts(self, file_stream, stream_info: StreamInfo, **kwargs) -> bool:
266
+ return stream_info.normalized_extension() == ".loud"
267
+
268
+ def convert(self, file_stream, stream_info: StreamInfo, **kwargs):
269
+ text = file_stream.read().decode("utf-8").upper()
270
+ return DocumentConverterResult(wrap_document(f"<pre>{escape_text(text)}</pre>"))
271
+
272
+ eth = EverythingToHtml()
273
+ eth.register_converter(UpperTextConverter())
274
+ ```
275
+
276
+ Ship it as a package and expose it as a plugin via entry points so any user can
277
+ `EverythingToHtml(enable_plugins=True)` and pick it up automatically — see
278
+ [`docs/PLUGINS.md`](docs/PLUGINS.md).
279
+
280
+ ## Contributing
281
+
282
+ Contributions are very welcome — new converters especially. See
283
+ [CONTRIBUTING.md](CONTRIBUTING.md) and our [Code of Conduct](CODE_OF_CONDUCT.md).
284
+ Found a security issue? See [SECURITY.md](SECURITY.md).
285
+
286
+ ## Acknowledgements
287
+
288
+ The converter-registry design is directly inspired by Microsoft's excellent
289
+ [markitdown](https://github.com/microsoft/markitdown). everythingtohtml aims to be
290
+ its mirror image for teams that want structure-preserving HTML instead of Markdown.
291
+
292
+ ## License
293
+
294
+ [MIT](LICENSE) © everythingtohtml contributors
@@ -0,0 +1,240 @@
1
+ # everythingtohtml
2
+
3
+ > Convert (almost) any file into clean, self-contained HTML — a universal file reader for your browser and scripts.
4
+
5
+ [![CI](https://github.com/He-wei-gui/everythingtohtml/actions/workflows/ci.yml/badge.svg)](https://github.com/He-wei-gui/everythingtohtml/actions/workflows/ci.yml)
6
+ [![PyPI](https://img.shields.io/pypi/v/everythingtohtml.svg)](https://pypi.org/project/everythingtohtml/)
7
+ [![Python versions](https://img.shields.io/pypi/pyversions/everythingtohtml.svg)](https://pypi.org/project/everythingtohtml/)
8
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
9
+
10
+ English | [中文发布文案](docs/LAUNCH.zh-CN.md) | **[▶ Live demo — drag a file, read it as HTML](https://he-wei-gui.github.io/everythingtohtml/)**
11
+
12
+ <p align="center">
13
+ <a href="https://he-wei-gui.github.io/everythingtohtml/">
14
+ <img src="site/screenshot.png" alt="everythingtohtml in-browser universal file reader" width="760">
15
+ </a>
16
+ </p>
17
+
18
+ **everythingtohtml** is the spiritual inverse of tools like
19
+ [markitdown](https://github.com/microsoft/markitdown): instead of flattening rich
20
+ documents *down* to Markdown, it lifts a wide range of formats *up* into clean,
21
+ styled, standalone HTML you can open in a browser, embed in a page, or feed to a
22
+ workflow that wants structured markup.
23
+
24
+ One small API. One CLI. A pluggable converter registry. No browser, no network
25
+ required for local files.
26
+
27
+ **中文简介**:everythingtohtml 是一个浏览器里的万能文件阅读器,也是一个 Python 包和 CLI。它可以把 PDF、Office、Markdown、CSV、JSON、EPUB 等常见文件转换成干净、自包含的 HTML,方便直接阅读、分享和自动化处理。
28
+
29
+ ```python
30
+ from everythingtohtml import EverythingToHtml
31
+
32
+ eth = EverythingToHtml()
33
+ result = eth.convert("quarterly-report.docx")
34
+ print(result.html) # a complete <!DOCTYPE html> document
35
+ print(result.title) # best-effort document title
36
+ ```
37
+
38
+ ```console
39
+ $ everythingtohtml notes.md -o notes.html
40
+ $ everythingtohtml data.csv > data.html
41
+ $ everythingtohtml https://example.com/feed.rss > feed.html
42
+ ```
43
+
44
+ ## Why HTML (and not Markdown)?
45
+
46
+ Markdown is lossy: tables get flattened, styling vanishes, slide structure
47
+ disappears, and nested data becomes ambiguous. HTML keeps the structure that
48
+ matters — headings, tables, lists, sections, links, images — while staying:
49
+
50
+ - **Human-friendly** — open the output in any browser, no toolchain needed.
51
+ - **Restyleable** — every document ships with a small, overridable stylesheet.
52
+ - **Structure-preserving** — explicit `<table>`/`<section>` markup keeps tables,
53
+ sections, and nested content easy to inspect and process.
54
+ - **Self-contained** — one file, valid HTML5, dark-mode aware.
55
+
56
+ ## Supported formats
57
+
58
+ | Format | Extensions | Extra needed |
59
+ | --- | --- | --- |
60
+ | Plain text | `.txt`, anything textual | — (built in) |
61
+ | Markdown | `.md`, `.markdown`, `.mkd` | — (built in) |
62
+ | HTML (clean/normalize) | `.html`, `.htm`, `.xhtml` | — (built in) |
63
+ | CSV / TSV | `.csv`, `.tsv` | — (built in) |
64
+ | JSON / JSONL | `.json`, `.jsonl`, `.ndjson` | — (built in) |
65
+ | Jupyter notebook | `.ipynb` | — (built in) |
66
+ | RSS / Atom feeds | `.rss`, `.atom` | — (built in) |
67
+ | EPUB e-books | `.epub` | — (built in) |
68
+ | Email | `.eml` | — (built in) |
69
+ | OpenDocument Text | `.odt` | — (built in) |
70
+ | YAML | `.yaml`, `.yml` | `pip install everythingtohtml[yaml]` |
71
+ | reStructuredText | `.rst` | `pip install everythingtohtml[rst]` |
72
+ | Word | `.docx` | `pip install everythingtohtml[docx]` |
73
+ | Word (legacy) | `.doc` | `pip install everythingtohtml[doc]` (LibreOffice recommended) |
74
+ | Excel | `.xlsx`, `.xlsm` | `pip install everythingtohtml[xlsx]` |
75
+ | PowerPoint | `.pptx` | `pip install everythingtohtml[pptx]` |
76
+ | PDF | `.pdf` | `pip install everythingtohtml[pdf]` |
77
+
78
+ > **Legacy `.doc`**: best results come from having [LibreOffice](https://www.libreoffice.org/)
79
+ > installed (used headlessly for high-fidelity conversion). Without it, a
80
+ > pure-Python `olefile` fallback recovers the text content.
81
+
82
+ > Want everything? `pip install everythingtohtml[all]`
83
+
84
+ New formats are just a small class away — see [Writing a converter](#writing-a-converter).
85
+
86
+ ## Installation
87
+
88
+ ```console
89
+ # core formats only (tiny dependency footprint)
90
+ pip install everythingtohtml
91
+
92
+ # pull in Office + data formats
93
+ pip install "everythingtohtml[all]"
94
+
95
+ # or cherry-pick
96
+ pip install "everythingtohtml[docx,xlsx]"
97
+ ```
98
+
99
+ Requires Python 3.10+.
100
+
101
+ ## Usage
102
+
103
+ ### Library
104
+
105
+ ```python
106
+ from everythingtohtml import EverythingToHtml
107
+
108
+ eth = EverythingToHtml()
109
+
110
+ # From a path
111
+ result = eth.convert("slides.pptx")
112
+
113
+ # From bytes or an open stream
114
+ with open("data.csv", "rb") as f:
115
+ result = eth.convert(f)
116
+
117
+ # From a URL (http/https/file/data URIs)
118
+ result = eth.convert("https://example.com/posts.atom")
119
+
120
+ # Give hints when the source is ambiguous (e.g. stdin)
121
+ from everythingtohtml import StreamInfo
122
+ result = eth.convert(raw_bytes, stream_info=StreamInfo(extension=".md"))
123
+
124
+ result.html # the full HTML document (str)
125
+ result.title # detected title, or None
126
+ result.text_content # alias for .html (drop-in for markdown-style code)
127
+ ```
128
+
129
+ ### Command line
130
+
131
+ ```console
132
+ everythingtohtml SOURCE [-o OUTPUT] [--extension .md] [--mimetype text/markdown]
133
+
134
+ # convert a file to a file
135
+ everythingtohtml report.docx -o report.html
136
+
137
+ # pipe through stdin (give it a hint)
138
+ cat notes.md | everythingtohtml --extension .md > notes.html
139
+
140
+ # fetch and convert a remote feed
141
+ everythingtohtml https://hnrss.org/frontpage > hn.html
142
+ ```
143
+
144
+ The CLI is also available as `e2h` for the impatient.
145
+
146
+ ## Merging and comparing documents
147
+
148
+ Need to collate a stack of Word files into one page, or see exactly what changed
149
+ between two revisions? everythingtohtml does both — for **any** supported format.
150
+
151
+ ```python
152
+ eth = EverythingToHtml()
153
+
154
+ # Merge several documents into one HTML page (each becomes a section, with a TOC)
155
+ merged = eth.merge(["intro.docx", "chapter1.doc", "appendix.pdf"])
156
+
157
+ # Place them side by side for visual comparison
158
+ columns = eth.merge(["draft-v1.docx", "draft-v2.docx"], layout="columns")
159
+
160
+ # Produce a highlighted, line-by-line diff of two documents' text
161
+ changes = eth.diff("spec-old.docx", "spec-new.docx")
162
+ open("changes.html", "w", encoding="utf-8").write(changes.html)
163
+ ```
164
+
165
+ From the CLI:
166
+
167
+ ```console
168
+ # two or more sources are merged automatically
169
+ everythingtohtml intro.docx chapter1.doc appendix.pdf -o handbook.html
170
+
171
+ # side-by-side layout
172
+ everythingtohtml old.docx new.docx --columns -o compare.html
173
+
174
+ # highlighted diff of exactly two documents
175
+ everythingtohtml spec-old.docx spec-new.docx --diff -o changes.html
176
+ ```
177
+
178
+ ## Architecture
179
+
180
+ everythingtohtml borrows the proven shape of markitdown:
181
+
182
+ ```
183
+ EverythingToHtml # engine: detection + dispatch + plugins
184
+ ├─ StreamInfo # immutable bag of hints (ext, mime, charset, …)
185
+ ├─ DocumentConverter # base class: accepts() + convert()
186
+ │ ├─ MarkdownConverter
187
+ │ ├─ CsvConverter
188
+ │ ├─ DocxConverter (mammoth)
189
+ │ └─ … one small class per format
190
+ └─ DocumentConverterResult # { html, title, metadata }
191
+ ```
192
+
193
+ When you call `convert()`, the engine:
194
+
195
+ 1. **Detects** the stream — extension, mimetype, declared charset, and magic-byte
196
+ sniffing via `puremagic` fill in a `StreamInfo`.
197
+ 2. **Dispatches** — converters are tried in priority order; each `accepts()` is a
198
+ cheap, non-destructive check. Specific formats win over the plain-text
199
+ catch-all.
200
+ 3. **Converts** — the winning converter returns a `DocumentConverterResult`. If a
201
+ converter accepts but raises, the engine records it and tries the next one, so
202
+ one greedy converter can't sink the whole conversion.
203
+
204
+ ### Writing a converter
205
+
206
+ ```python
207
+ from everythingtohtml import DocumentConverter, DocumentConverterResult, StreamInfo
208
+ from everythingtohtml._html_builder import wrap_document, escape_text
209
+
210
+ class UpperTextConverter(DocumentConverter):
211
+ def accepts(self, file_stream, stream_info: StreamInfo, **kwargs) -> bool:
212
+ return stream_info.normalized_extension() == ".loud"
213
+
214
+ def convert(self, file_stream, stream_info: StreamInfo, **kwargs):
215
+ text = file_stream.read().decode("utf-8").upper()
216
+ return DocumentConverterResult(wrap_document(f"<pre>{escape_text(text)}</pre>"))
217
+
218
+ eth = EverythingToHtml()
219
+ eth.register_converter(UpperTextConverter())
220
+ ```
221
+
222
+ Ship it as a package and expose it as a plugin via entry points so any user can
223
+ `EverythingToHtml(enable_plugins=True)` and pick it up automatically — see
224
+ [`docs/PLUGINS.md`](docs/PLUGINS.md).
225
+
226
+ ## Contributing
227
+
228
+ Contributions are very welcome — new converters especially. See
229
+ [CONTRIBUTING.md](CONTRIBUTING.md) and our [Code of Conduct](CODE_OF_CONDUCT.md).
230
+ Found a security issue? See [SECURITY.md](SECURITY.md).
231
+
232
+ ## Acknowledgements
233
+
234
+ The converter-registry design is directly inspired by Microsoft's excellent
235
+ [markitdown](https://github.com/microsoft/markitdown). everythingtohtml aims to be
236
+ its mirror image for teams that want structure-preserving HTML instead of Markdown.
237
+
238
+ ## License
239
+
240
+ [MIT](LICENSE) © everythingtohtml contributors