@acedatacloud/skills 2026.615.0 → 2026.620.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/AGENTS.md CHANGED
@@ -31,6 +31,7 @@ Skills are located in the `skills/` directory (also mirrored to `.agents/skills/
31
31
  - **google-search** — Search the web, images, news, maps, places, and videos via Google
32
32
  - **face-transform** — Face analysis, beautification, age/gender transform, swap, cartoon
33
33
  - **short-url** — Create and manage short URLs
34
+ - **onepage-pdf** — Convert an HTML page into one tall single-page PDF (local; no API token; needs Python + pymupdf + Chrome/Edge)
34
35
  - **acedatacloud-api** — API usage guide — authentication, SDKs, error handling
35
36
 
36
37
  ## Authentication
package/README.md CHANGED
@@ -50,6 +50,7 @@ Compatible with **30+ AI coding agents** via the [agentskills.io](https://agents
50
50
  | [google-search](skills/google-search/) | Search the web, images, news, maps, places, and videos via Google |
51
51
  | [face-transform](skills/face-transform/) | Face analysis, beautification, age/gender transform, swap, cartoon |
52
52
  | [short-url](skills/short-url/) | Create and manage short URLs |
53
+ | [onepage-pdf](skills/onepage-pdf/) | Convert an HTML page into one tall single-page PDF — no pagination breaks (local, no token) |
53
54
  | [acedatacloud-api](skills/acedatacloud-api/) | API usage guide — authentication, SDKs, error handling |
54
55
 
55
56
  ### Connectors
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@acedatacloud/skills",
3
- "version": "2026.615.0",
3
+ "version": "2026.620.0",
4
4
  "description": "Agent Skills for AceDataCloud AI services — music, image, video generation, LLM chat, web search. Compatible with Claude Code, GitHub Copilot, Gemini CLI, OpenAI Codex, and 30+ AI coding agents.",
5
5
  "keywords": [
6
6
  "agent-skills",
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 XNTJ LLC
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,97 @@
1
+ ---
2
+ name: onepage-pdf
3
+ description: Convert an HTML page into a single continuous-page PDF (one tall page, no pagination breaks), preserving desktop layout, backgrounds and selectable text. Use when the user asks to turn an HTML file/report/proposal into a "single-page PDF", "long PDF", "不分页 PDF", "单页 PDF", "长图式 PDF", or complains that an HTML-to-PDF export breaks into pages or cuts content. Supports optional string redaction with leak verification.
4
+ license: MIT
5
+ metadata:
6
+ author: xntj-ai
7
+ version: "1.0"
8
+ source: https://github.com/xntj-ai/onepage-pdf
9
+ compatibility: No API token required. Requires Python 3.10+ with `pymupdf` (`pip install pymupdf`) and a local Chrome or Edge browser. Runs fully offline against local HTML — does not call api.acedata.cloud.
10
+ ---
11
+
12
+ # onepage-pdf
13
+
14
+ > Vendored from [github.com/xntj-ai/onepage-pdf](https://github.com/xntj-ai/onepage-pdf)
15
+ > by 张拼拼 · XNTJ, under the MIT License. See [LICENSE](LICENSE).
16
+
17
+ Render HTML to one tall PDF page via headless Chrome, then crop the page to the
18
+ real content height with PyMuPDF. Height is never predicted (print layout is
19
+ not screen layout); it is measured after rendering, which is exact.
20
+
21
+ **Requires:** Python with `pymupdf`, plus a local Chrome or Edge. No API token.
22
+
23
+ ## Workflow
24
+
25
+ ### 1. Inspect the source HTML first
26
+
27
+ Read the HTML and check four things; they decide whether `--extra-css` is needed:
28
+
29
+ 1. **Responsive breakpoints.** Print media queries evaluate against the default
30
+ paper width (~741px), NOT the `@page` size. Any `@media (max-width: N)` with
31
+ N ≥ 741 will fire during print and collapse the desktop layout. For each such
32
+ rule, write an override locking the desktop value with `!important`
33
+ (e.g. `.grid{grid-template-columns:repeat(3,1fr)!important}`).
34
+ 2. **Glassmorphism.** `backdrop-filter` blur is silently dropped in PDF output.
35
+ If glass elements sit on busy backgrounds, add a print fallback:
36
+ `.glass{backdrop-filter:none!important;background:rgba(255,255,255,.88)!important}`.
37
+ 3. **Scroll-reveal animations.** Common class patterns (`fade*`, `reveal*`,
38
+ `animate*`, `aos`) are forced visible automatically. Anything else that
39
+ starts at `opacity:0` needs an explicit `opacity:1!important` override.
40
+ 4. **vh/vw sizing.** Viewport units resolve against the page area in print and
41
+ drift ~1%; a `min-height:100vh` hero becomes ~187in tall on the bedrock
42
+ page. Override such rules with fixed px values.
43
+
44
+ Put all overrides in one CSS file and pass it via `--extra-css`.
45
+
46
+ ### 2. Convert
47
+
48
+ Paths below are relative to this skill directory.
49
+
50
+ ```bash
51
+ python scripts/onepage_pdf.py input.html -o output.pdf --width 1280 \
52
+ [--extra-css fixes.css] [--replace subs.json --forbid words.txt]
53
+ ```
54
+
55
+ - `--width`: match the design width of the page (snapped to 8px; non-8px
56
+ page sizes hit MediaBox rounding bugs that spawn phantom pages).
57
+ - `--replace`: JSON `[["old","new"], ...]`, applied in order — put longer /
58
+ more specific strings first. Use for redaction before publishing.
59
+ - `--forbid`: one word per line; the script aborts if any survives in the
60
+ HTML or in the final PDF text layer. Always pair with `--replace`.
61
+ - `--crop pixel`: switch to raster row-scanning if the vector crop misjudges
62
+ (e.g. a decorative element painted taller than the real content).
63
+
64
+ The script self-handles: oversized-bedrock rendering with auto-retry on
65
+ overflow, content cropping (MediaBox + CropBox rewritten identically for
66
+ viewer compatibility), CJK-safe output paths, single-page assertion.
67
+
68
+ A bundled example lives in `examples/` — try it end to end:
69
+
70
+ ```bash
71
+ python scripts/onepage_pdf.py examples/demo.html -o /tmp/demo.pdf \
72
+ --width 1280 --extra-css examples/demo-fixes.css
73
+ ```
74
+
75
+ ### 3. Verify
76
+
77
+ The script prints `OK 1 page, WxHpt`. Then:
78
+
79
+ 1. Render a thumbnail and eyeball it (layout intact, no collapsed grids,
80
+ backgrounds present, nothing cut at the bottom):
81
+ ```python
82
+ import pymupdf
83
+ doc = pymupdf.open("output.pdf")
84
+ doc[0].get_pixmap(dpi=40).save("check.png")
85
+ ```
86
+ 2. If redaction was used, the forbid check already ran against the PDF text;
87
+ still spot-check the rendered image for sensitive content in raster form.
88
+ 3. Heed the script warnings: heights above 14400pt break Acrobat (Chrome,
89
+ Firefox and WeChat preview are fine); "content nearly fills the bedrock"
90
+ means inspect the tail for truncation.
91
+
92
+ ## Troubleshooting and mechanics
93
+
94
+ Read [references/mechanics.md](references/mechanics.md) when output looks wrong
95
+ (collapsed layout, missing backgrounds, blank page, phantom second page, blurry
96
+ or missing CJK glyphs) — it documents the Chrome print-rendering rules this
97
+ tool is built around, plus the CDP-based alternative route.
@@ -0,0 +1,5 @@
1
+ /* lock the 900px breakpoint back to desktop (print MQ evaluates at ~741px) */
2
+ .card-grid { grid-template-columns: repeat(3, 1fr) !important; }
3
+ .hero h1 { font-size: 56px !important; }
4
+ /* glassmorphism: backdrop-filter blur is dropped in PDF — solidify the card */
5
+ .glass { backdrop-filter: none !important; background: rgba(255, 255, 255, .88) !important; }
@@ -0,0 +1,94 @@
1
+ <!DOCTYPE html>
2
+ <html lang="zh-CN">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <title>示例提案 — 星尘咖啡烘焙工坊数字化方案</title>
6
+ <style>
7
+ :root { --accent: #b45309; --ink: #1f2937; }
8
+ * { margin: 0; padding: 0; box-sizing: border-box; }
9
+ body {
10
+ font-family: "Noto Sans SC", "Microsoft YaHei", sans-serif;
11
+ color: var(--ink);
12
+ background: linear-gradient(175deg, #fdf6ec 0%, #f3e7d3 45%, #e8d5b5 100%);
13
+ width: 1280px; margin: 0 auto;
14
+ }
15
+ .hero { padding: 120px 80px 80px; }
16
+ .hero h1 { font-size: 56px; font-weight: 700; letter-spacing: 2px; }
17
+ .hero p { font-size: 20px; color: #6b7280; margin-top: 24px; max-width: 720px; }
18
+ section { padding: 64px 80px; }
19
+ h2 { font-size: 32px; margin-bottom: 32px; border-left: 6px solid var(--accent); padding-left: 18px; }
20
+ .card-grid { display: grid; grid-template-columns: repeat(3, 1fr); gap: 24px; }
21
+ .card {
22
+ background: #fff; border-radius: 16px; padding: 28px;
23
+ box-shadow: 0 8px 24px rgba(31, 41, 55, .08);
24
+ }
25
+ .card h3 { font-size: 20px; margin-bottom: 12px; color: var(--accent); }
26
+ .card p { font-size: 15px; line-height: 1.8; color: #4b5563; }
27
+ .glass {
28
+ margin-top: 48px; padding: 36px; border-radius: 20px;
29
+ background: rgba(255, 255, 255, .35);
30
+ backdrop-filter: blur(14px);
31
+ border: 1px solid rgba(255, 255, 255, .6);
32
+ }
33
+ .fade-in { opacity: 0; transform: translateY(24px); transition: all .8s ease; }
34
+ table { width: 100%; border-collapse: collapse; background: #fff; border-radius: 12px; overflow: hidden; }
35
+ th, td { padding: 14px 20px; text-align: left; border-bottom: 1px solid #f3f4f6; font-size: 15px; }
36
+ th { background: var(--ink); color: #fff; font-weight: 500; }
37
+ footer { padding: 48px 80px 96px; color: #9ca3af; font-size: 13px; }
38
+ /* breakpoint >= 741px: fires during print unless locked via --extra-css */
39
+ @media (max-width: 900px) {
40
+ .card-grid { grid-template-columns: 1fr; }
41
+ .hero h1 { font-size: 36px; }
42
+ }
43
+ </style>
44
+ </head>
45
+ <body>
46
+ <div class="hero fade-in">
47
+ <h1>星尘咖啡烘焙工坊<br>数字化升级方案</h1>
48
+ <p>虚构示例:本页面用于演示 onepage-pdf 的转换效果 — 渐变背景、三列卡片栅格、玻璃拟态、滚动渐入动画与中文排版,全部应在单页 PDF 中保持桌面布局。</p>
49
+ </div>
50
+
51
+ <section class="fade-in">
52
+ <h2>一、现状与目标</h2>
53
+ <div class="card-grid">
54
+ <div class="card"><h3>门店运营</h3><p>三家门店的点单、库存与会员数据彼此孤立,月度对账依赖人工汇总表格,平均耗时两个工作日,且错漏率随单量增长而上升。</p></div>
55
+ <div class="card"><h3>烘焙生产</h3><p>烘焙排期靠经验估算,生豆损耗未量化。引入批次追踪后,可将每炉烘焙曲线与杯测评分关联,沉淀为可复用的烘焙档案。</p></div>
56
+ <div class="card"><h3>会员增长</h3><p>现有会员体系仅有储值功能,缺少消费行为分析。目标是基于购买频次与口味偏好做精准触达,提升复购率至行业均值以上。</p></div>
57
+ </div>
58
+ <div class="glass">
59
+ <h3>方案总览(玻璃拟态卡片)</h3>
60
+ <p>分三期推进:第一期打通点单与库存数据流,第二期上线烘焙批次追踪,第三期构建会员画像与自动化营销。每期交付均含验收标准与回滚预案。</p>
61
+ </div>
62
+ </section>
63
+
64
+ <section class="fade-in">
65
+ <h2>二、实施排期</h2>
66
+ <table>
67
+ <tr><th>阶段</th><th>内容</th><th>周期</th><th>验收标准</th></tr>
68
+ <tr><td>一期</td><td>点单系统与库存打通</td><td>4 周</td><td>日结报表自动生成,误差为零</td></tr>
69
+ <tr><td>二期</td><td>烘焙批次追踪上线</td><td>6 周</td><td>每炉烘焙档案可检索、可对比</td></tr>
70
+ <tr><td>三期</td><td>会员画像与营销自动化</td><td>8 周</td><td>触达转化率较基线提升 20%</td></tr>
71
+ </table>
72
+ </section>
73
+
74
+ <section class="fade-in">
75
+ <h2>三、投入与回报</h2>
76
+ <div class="card-grid">
77
+ <div class="card"><h3>一次性投入</h3><p>系统搭建与数据迁移共计十二万元,含三个月驻场支持。所有数据资产归属工坊,供应商不留存任何经营数据副本。</p></div>
78
+ <div class="card"><h3>年度运维</h3><p>云资源与维护费用每年两万四千元,按季度支付。合同期内功能迭代不另行收费,重大版本升级提前一个月公示变更说明。</p></div>
79
+ <div class="card"><h3>预期回报</h3><p>按当前单量测算,对账人力成本年省约五万元;会员复购提升带来的增量收入预计在第二年覆盖全部投入。</p></div>
80
+ </div>
81
+ </section>
82
+
83
+ <footer>本文档为 onepage-pdf 演示用虚构内容 · 星尘咖啡烘焙工坊并不存在 · 生成于示例管线</footer>
84
+
85
+ <script>
86
+ // scroll-reveal: in a headless print pass these never fire — the injected
87
+ // CSS must force .fade-in visible or the PDF renders blank blocks
88
+ const io = new IntersectionObserver(es => es.forEach(e => {
89
+ if (e.isIntersecting) { e.target.style.opacity = 1; e.target.style.transform = 'none'; }
90
+ }));
91
+ document.querySelectorAll('.fade-in').forEach(el => io.observe(el));
92
+ </script>
93
+ </body>
94
+ </html>
@@ -0,0 +1,149 @@
1
+ # Chrome print-rendering mechanics
2
+
3
+ Why this tool crops instead of calculating, and what to do when output looks
4
+ wrong. Findings verified against Chromium source / specs / production code
5
+ (Gotenberg), June 2026.
6
+
7
+ ## Contents
8
+
9
+ - [Why height cannot be predicted](#why-height-cannot-be-predicted)
10
+ - [Media queries vs @page — two viewports](#media-queries-vs-page--two-viewports)
11
+ - [The 1.5x hidden shrink](#the-15x-hidden-shrink)
12
+ - [Unit rounding and the phantom page](#unit-rounding-and-the-phantom-page)
13
+ - [PDF size limits](#pdf-size-limits)
14
+ - [Fidelity losses in PDF output](#fidelity-losses-in-pdf-output)
15
+ - [Resource readiness](#resource-readiness)
16
+ - [CJK pitfalls](#cjk-pitfalls)
17
+ - [Crop internals](#crop-internals)
18
+ - [Alternative route: CDP measured height](#alternative-route-cdp-measured-height)
19
+
20
+ ## Why height cannot be predicted
21
+
22
+ Screen-measured `scrollHeight` does not match the printed layout height:
23
+
24
+ - `page.pdf()` / `--print-to-pdf` render with **print** media CSS; measuring on
25
+ screen measures a different layout.
26
+ - Print media queries evaluate against a different width than the layout width
27
+ (next section), shifting breakpoints.
28
+ - Web fonts arriving after measurement change line heights.
29
+ - The px→inch→pt conversion chain rounds non-monotonically (puppeteer#2278).
30
+
31
+ Hence: render onto an oversized bedrock page, measure the result, crop.
32
+
33
+ ## Media queries vs @page — two viewports
34
+
35
+ Per css-page-3 §7.1, media queries **must ignore `@page size`** (anti-cycle
36
+ rule) and evaluate against the default paper. Measured in Chrome: MQ width is
37
+ the default paper's page area, **~741px** (US Letter minus default margins),
38
+ regardless of any `@page` declaration.
39
+
40
+ Meanwhile vw/vh resolve against the **first page area** — which DOES honor
41
+ `@page size` — with ~1% drift from device-unit rounding. So one document
42
+ evaluates MQ at 741px and vw at the @page width simultaneously. Container
43
+ queries follow the laid-out container and are the only reliable responsive
44
+ mechanism in print.
45
+
46
+ Practical rule: lock every breakpoint ≥741px to its desktop value via
47
+ `--extra-css`, and replace vh/vw sizing with px.
48
+
49
+ ## The 1.5x hidden shrink
50
+
51
+ `printingMaximumShrinkFactor = 1.5` in Blink: if the widest unbreakable content
52
+ exceeds the page width, Chrome relayouts and **scales the whole document down
53
+ by up to 1.5x** to fit (beyond that it clips). One overflowing element changes
54
+ the scale of everything. Keep actual content width ≤ `--width`.
55
+
56
+ ## Unit rounding and the phantom page
57
+
58
+ Chrome converts px → inches → pt with rounding errors up to ~0.2% (MediaBox
59
+ 73pt becomes 72.959999). A height short by a fraction of a pt spills one line
60
+ onto a second page. Mitigations used here:
61
+
62
+ - widths/heights snapped to 8px (8px = 6pt exactly);
63
+ - the bedrock is intentionally huge, so rounding never matters at render time;
64
+ - Gotenberg's equivalent trick for the CDP route is `pageRanges: "1"`.
65
+
66
+ ## PDF size limits
67
+
68
+ - 14400pt (200in, ≈19200px) is the **Acrobat implementation limit** from ISO
69
+ 32000-1 Annex C — not a format limit. PDF 2.0 drops it.
70
+ - Chrome happily writes larger pages (verified: 22500pt, no clamp, no
71
+ UserUnit). Chrome/Firefox/PDFium/WeChat render them; old Acrobat may refuse.
72
+ - The script warns when output exceeds 14400pt; that's a compatibility note,
73
+ not an error.
74
+
75
+ ## Fidelity losses in PDF output
76
+
77
+ | Feature | Behavior in PDF | Fix |
78
+ |---|---|---|
79
+ | `backdrop-filter` blur | **Silently dropped** (crbug 40895818); translucent bg layer survives | Print fallback: `backdrop-filter:none` + near-opaque `background` |
80
+ | `background-attachment: fixed` | Painted on page 1 only, off-spec | Injected CSS forces `scroll` |
81
+ | `box-shadow` | Fine, stays vector | — |
82
+ | Perspective transforms | Rendered as identity (SkPDF) | Avoid in print |
83
+ | Element behind CSS `filter`/image filter | Rasterized; text loses selectability | Avoid filters on text containers |
84
+ | Backgrounds in general | Need `print-color-adjust: exact` (injected) | — |
85
+
86
+ ## Resource readiness
87
+
88
+ CLI `--print-to-pdf` waits for the `load` event only — not fonts, not
89
+ JS-injected content. The script passes:
90
+
91
+ - `--virtual-time-budget=10000`: fast-forwards timer-driven JS (experimental;
92
+ does not wait for slow networks or CPU);
93
+ - `--run-all-compositor-stages-before-draw`: ensures rasterization completes
94
+ before capture.
95
+
96
+ If fonts or late-loading images still miss: raise `--virtual-time`, inline
97
+ resources as data URLs, or preload fonts in `<head>`. Note headless
98
+ `--print-to-pdf` silently refuses external resources referenced from inside
99
+ `@page` margin-box CSS — inline those as data URLs.
100
+
101
+ `loading="lazy"` images render fine in current headless print (verified Chrome
102
+ 149); for older Chrome force `loading="eager"` via `--replace`.
103
+
104
+ ## CJK pitfalls
105
+
106
+ - Set `<html lang="zh-CN">` — without it, fontconfig on Linux may pick the
107
+ Thin weight of Noto CJK for body text.
108
+ - Docker/CI images need `fonts-noto-cjk` installed or all CJK renders as tofu.
109
+ - CJK fonts often ship Regular only; `font-weight:600+` triggers synthetic
110
+ bold that looks smeared in PDF. Load real weight files (e.g. Noto Sans SC
111
+ 500/700).
112
+ - Webfonts without CJK glyphs fall back to system fonts on screen but may
113
+ embed nothing in the PDF — Acrobat shows missing glyphs. Declare a CJK
114
+ fallback explicitly in `font-family`.
115
+
116
+ ## Crop internals
117
+
118
+ - Content bottom comes from `page.get_bboxlog()` — a render-level command log
119
+ covering text, images, vector paths, and contents of Form XObjects (the
120
+ APIs `get_text('blocks')` + `get_drawings()` miss XObject internals and
121
+ inline images). `ignore-text` (invisible text) entries are skipped, as are
122
+ fills taller than 97% of the page (the body background painted across the
123
+ whole bedrock).
124
+ - Pixel mode (`--crop pixel`) rasterizes at 24dpi and scans rows bottom-up for
125
+ horizontal variance > threshold; vertical gradients are horizontally uniform
126
+ so they don't count as content. Use it when vector mode keeps decorative
127
+ geometry that extends below the real content.
128
+ - Both MediaBox and CropBox are rewritten to the same rect via `xref_set_key`
129
+ in raw PDF (y-up) coordinates. This avoids PyMuPDF's `set_mediabox` side
130
+ effect (it deletes CropBox) and the viewer ambiguity between viewers that
131
+ honor MediaBox (some mobile/embedded viewers) vs CropBox (Acrobat, pdf.js).
132
+
133
+ ## Alternative route: CDP measured height
134
+
135
+ Gotenberg 8's production `singlePage=true` implementation (tasks.go):
136
+
137
+ ```
138
+ Emulation.setEmulatedMedia(media="print")
139
+ m = Page.getLayoutMetrics() # cssContentSize is the authoritative height
140
+ Page.printToPDF(paperWidth = W/96,
141
+ paperHeight = m.cssContentSize.height/96,
142
+ marginTop=0, ..., pageRanges="1") # "1" discards rounding spill
143
+ ```
144
+
145
+ Pros: no bedrock, no crop, exact MediaBox at render time. Cons: requires a CDP
146
+ client (websocket) instead of a bare CLI call; lazy-load / viewport-relative
147
+ layouts can still mismeasure (gotenberg#1046). This tool deliberately stays on
148
+ the CLI+crop route for zero runtime dependencies beyond PyMuPDF; switch to CDP
149
+ only if a page class consistently defeats the crop heuristics.
@@ -0,0 +1,287 @@
1
+ #!/usr/bin/env python3
2
+ """onepage-pdf: render an HTML file into a single continuous-page PDF (no pagination).
3
+
4
+ Strategy: never *predict* the page height (screen layout != print layout, and the
5
+ px->inch->pt conversion chain rounds non-monotonically). Instead render onto an
6
+ oversized "bedrock" page, then measure the real content bottom with PyMuPDF's
7
+ render-level bbox log and shrink MediaBox + CropBox to fit. Height is cropped,
8
+ not calculated.
9
+
10
+ Dependencies: PyMuPDF (pip install pymupdf) + a local Chrome or Edge.
11
+ """
12
+
13
+ import argparse
14
+ import json
15
+ import shutil
16
+ import subprocess
17
+ import sys
18
+ import tempfile
19
+ import uuid
20
+ from pathlib import Path
21
+
22
+ try:
23
+ import pymupdf
24
+ except ImportError: # PyMuPDF < 1.24 exposes the legacy name only
25
+ import fitz as pymupdf
26
+
27
+ BROWSERS = [
28
+ r"C:\Program Files\Google\Chrome\Application\chrome.exe",
29
+ r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe",
30
+ r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe",
31
+ "/usr/bin/google-chrome",
32
+ "/usr/bin/chromium",
33
+ "/usr/bin/chromium-browser",
34
+ "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",
35
+ ]
36
+
37
+ # Injected before </head>. The @page rule is the only official channel for paper
38
+ # size in CLI headless mode (--print-to-pdf hardcodes preferCSSPageSize:true).
39
+ # Width/height must be multiples of 8px: 8px = 6pt exactly, anything else risks
40
+ # MediaBox rounding errors that spill content onto a phantom second page.
41
+ PRINT_CSS = """
42
+ <style data-onepage-pdf>
43
+ @page {{ size: {width}px {bedrock}px; margin: 0; }}
44
+ * {{
45
+ -webkit-print-color-adjust: exact !important;
46
+ print-color-adjust: exact !important;
47
+ animation: none !important;
48
+ transition: none !important;
49
+ background-attachment: scroll !important; /* fixed bg paints only page 1 */
50
+ }}
51
+ /* scroll-reveal patterns start at opacity:0 and never "enter the viewport"
52
+ during a headless print pass — force them visible */
53
+ [class*="fade"], [class*="reveal"], [class*="animate"], [class*="aos"],
54
+ [data-aos] {{
55
+ opacity: 1 !important;
56
+ transform: none !important;
57
+ visibility: visible !important;
58
+ }}
59
+ {extra}
60
+ </style>
61
+ </head>"""
62
+
63
+ ACROBAT_LIMIT_PT = 14400 # ISO 32000-1 Annex C; Chrome will exceed it happily
64
+
65
+
66
+ def find_browser(explicit):
67
+ if explicit:
68
+ if Path(explicit).exists():
69
+ return explicit
70
+ sys.exit(f"browser not found: {explicit}")
71
+ for p in BROWSERS:
72
+ if Path(p).exists():
73
+ return p
74
+ found = shutil.which("chrome") or shutil.which("chromium") or shutil.which("msedge")
75
+ if found:
76
+ return found
77
+ sys.exit("no Chrome/Edge found; pass --chrome PATH")
78
+
79
+
80
+ def align8(px):
81
+ return max(8, int(round(px / 8.0)) * 8)
82
+
83
+
84
+ def apply_replacements(html, replace_file):
85
+ pairs = json.loads(Path(replace_file).read_text(encoding="utf-8"))
86
+ for old, new in pairs: # order matters: longest/most specific first
87
+ html = html.replace(old, new)
88
+ return html
89
+
90
+
91
+ def check_forbidden(text, forbid_file, where):
92
+ words = [w.strip() for w in Path(forbid_file).read_text(encoding="utf-8").splitlines() if w.strip()]
93
+ leaks = [w for w in words if w in text]
94
+ if leaks:
95
+ sys.exit(f"LEAK in {where}: {leaks}")
96
+ return len(words)
97
+
98
+
99
+ def render(browser, html_path, pdf_path, width, virtual_time):
100
+ cmd = [
101
+ browser,
102
+ "--headless=new",
103
+ "--disable-gpu",
104
+ "--no-pdf-header-footer",
105
+ "--hide-scrollbars",
106
+ f"--window-size={width},2000",
107
+ f"--virtual-time-budget={virtual_time}",
108
+ "--run-all-compositor-stages-before-draw",
109
+ f"--print-to-pdf={pdf_path}",
110
+ html_path.as_uri(),
111
+ ]
112
+ subprocess.run(cmd, check=True, timeout=180,
113
+ stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
114
+
115
+
116
+ def content_bottom_vector(page):
117
+ """Bottom edge (pt, from page top) via the render-level bbox log.
118
+
119
+ get_bboxlog() covers text, images, vector paths and unrolls Form XObjects —
120
+ the only single API that misses nothing. Full-page-height fills are the
121
+ body background painted across the whole bedrock page; skip them or the
122
+ union degenerates to the entire page.
123
+ """
124
+ page_h = page.rect.height
125
+ bottom = None
126
+ for kind, bbox in page.get_bboxlog():
127
+ if kind == "ignore-text": # invisible text, must not drive the crop
128
+ continue
129
+ r = pymupdf.Rect(bbox)
130
+ if r.is_empty or r.is_infinite or r.height >= 0.97 * page_h:
131
+ continue
132
+ bottom = r.y1 if bottom is None else max(bottom, r.y1)
133
+ try:
134
+ for annot in page.annots():
135
+ bottom = annot.rect.y1 if bottom is None else max(bottom, annot.rect.y1)
136
+ except TypeError:
137
+ pass
138
+ return bottom
139
+
140
+
141
+ def content_bottom_pixel(page, dpi=24, stdev_threshold=3.0):
142
+ """Fallback: rasterize and scan rows bottom-up for horizontal variance.
143
+
144
+ A vertical gradient background is horizontally uniform (variance ~0);
145
+ real content (text, cards, rules) breaks horizontal uniformity.
146
+ """
147
+ pix = page.get_pixmap(dpi=dpi, colorspace=pymupdf.csGRAY)
148
+ w, h, buf = pix.width, pix.height, pix.samples
149
+ threshold = stdev_threshold ** 2
150
+ for row in range(h - 1, -1, -1):
151
+ line = buf[row * w:(row + 1) * w]
152
+ mean = sum(line) / w
153
+ if sum((b - mean) ** 2 for b in line) / w > threshold:
154
+ return (row + 1) / dpi * 72.0
155
+ return None
156
+
157
+
158
+ def crop_to_content(doc, padding_px, crop_mode):
159
+ page = doc[0]
160
+ bottom = None
161
+ if crop_mode != "pixel":
162
+ bottom = content_bottom_vector(page)
163
+ if bottom is None:
164
+ bottom = content_bottom_pixel(page)
165
+ if bottom is None:
166
+ sys.exit("could not locate any content on the page")
167
+ new_h = min(page.rect.height, bottom + padding_px * 0.75) # px -> pt
168
+ # Rewrite both boxes via xref in raw PDF coordinates (y-up) — sidesteps
169
+ # PyMuPDF's top-left flip ambiguity and the set_mediabox side effect of
170
+ # deleting CropBox. Some viewers read MediaBox, some CropBox ∩ MediaBox;
171
+ # writing both identically removes the difference.
172
+ mb = page.mediabox
173
+ box = f"[{mb.x0:.2f} {mb.y1 - new_h:.2f} {mb.x1:.2f} {mb.y1:.2f}]"
174
+ doc.xref_set_key(page.xref, "MediaBox", box)
175
+ doc.xref_set_key(page.xref, "CropBox", box)
176
+ return new_h
177
+
178
+
179
+ def main():
180
+ ap = argparse.ArgumentParser(description="HTML -> single continuous-page PDF")
181
+ ap.add_argument("input", help="source HTML file")
182
+ ap.add_argument("-o", "--output", required=True, help="destination PDF")
183
+ ap.add_argument("--width", type=int, default=1280,
184
+ help="page width in CSS px (default 1280, snapped to 8px)")
185
+ ap.add_argument("--bedrock", type=int, default=18000,
186
+ help="initial oversized page height in px (default 18000)")
187
+ ap.add_argument("--padding", type=int, default=32,
188
+ help="whitespace kept below content, px (default 32)")
189
+ ap.add_argument("--extra-css", help="CSS file appended to the injected print styles "
190
+ "(breakpoint locks, glassmorphism fallbacks, ...)")
191
+ ap.add_argument("--replace", help="JSON file [[old,new],...] applied to the HTML "
192
+ "before rendering (redaction / token substitution)")
193
+ ap.add_argument("--forbid", help="text file, one word per line; abort if any "
194
+ "survives in the HTML or the final PDF text")
195
+ ap.add_argument("--crop", choices=["vector", "pixel"], default="vector",
196
+ help="content-bottom detection (default vector, auto-falls back)")
197
+ ap.add_argument("--chrome", help="explicit browser executable")
198
+ ap.add_argument("--virtual-time", type=int, default=10000,
199
+ help="--virtual-time-budget ms (default 10000)")
200
+ ap.add_argument("--max-retries", type=int, default=2,
201
+ help="bedrock doublings when content overflows (default 2)")
202
+ ap.add_argument("--keep-temp", action="store_true")
203
+ args = ap.parse_args()
204
+
205
+ try:
206
+ sys.stdout.reconfigure(encoding="utf-8", errors="replace")
207
+ except AttributeError:
208
+ pass
209
+
210
+ src = Path(args.input)
211
+ html = src.read_text(encoding="utf-8")
212
+ if args.replace:
213
+ html = apply_replacements(html, args.replace)
214
+ if args.forbid:
215
+ n = check_forbidden(html, args.forbid, "HTML after replacements")
216
+ print(f"forbid check (html): {n} words, clean")
217
+
218
+ width = align8(args.width)
219
+ extra = ""
220
+ if args.extra_css:
221
+ extra = Path(args.extra_css).read_text(encoding="utf-8")
222
+ if "</head>" not in html:
223
+ sys.exit("input HTML has no </head>; cannot inject print CSS")
224
+
225
+ browser = find_browser(args.chrome)
226
+ # Chrome writes PDFs unreliably to non-ASCII paths; stage everything in TEMP
227
+ workdir = Path(tempfile.gettempdir()) / f"onepage-pdf-{uuid.uuid4().hex[:8]}"
228
+ workdir.mkdir()
229
+ tmp_pdf = workdir / "out.pdf"
230
+
231
+ try:
232
+ bedrock = align8(args.bedrock)
233
+ for attempt in range(args.max_retries + 1):
234
+ injected = html.replace(
235
+ "</head>",
236
+ PRINT_CSS.format(width=width, bedrock=bedrock, extra=extra), 1)
237
+ tmp_html = workdir / "in.html"
238
+ tmp_html.write_text(injected, encoding="utf-8")
239
+ render(browser, tmp_html, tmp_pdf, width, args.virtual_time)
240
+ doc = pymupdf.open(tmp_pdf)
241
+ if doc.page_count == 1:
242
+ break
243
+ print(f"content overflowed {bedrock}px bedrock "
244
+ f"({doc.page_count} pages); doubling")
245
+ doc.close()
246
+ bedrock = align8(bedrock * 2)
247
+ else:
248
+ sys.exit(f"still paginated after {args.max_retries} retries; "
249
+ f"raise --bedrock explicitly")
250
+
251
+ new_h = crop_to_content(doc, args.padding, args.crop)
252
+ final = workdir / "final.pdf"
253
+ doc.save(final)
254
+ doc.close()
255
+
256
+ check = pymupdf.open(final)
257
+ page = check[0]
258
+ assert check.page_count == 1
259
+ w_pt, h_pt = page.rect.width, page.rect.height
260
+ if args.forbid:
261
+ text = "".join(p.get_text() for p in check)
262
+ check_forbidden(text, args.forbid, "final PDF text")
263
+ print("forbid check (pdf): clean")
264
+ check.close()
265
+
266
+ dest = Path(args.output)
267
+ dest.parent.mkdir(parents=True, exist_ok=True)
268
+ shutil.copyfile(final, dest)
269
+
270
+ size_kb = dest.stat().st_size / 1024
271
+ print(f"OK 1 page, {w_pt:.0f}x{h_pt:.0f}pt "
272
+ f"({w_pt/72*96:.0f}x{h_pt/72*96:.0f}px), {size_kb:.0f} KB -> {dest}")
273
+ if h_pt > ACROBAT_LIMIT_PT:
274
+ print(f"warning: height {h_pt:.0f}pt exceeds the 14400pt Acrobat "
275
+ f"limit; fine in Chrome/Firefox/WeChat, may fail in Acrobat")
276
+ if new_h >= bedrock * 0.75 * 0.985:
277
+ print("note: content nearly fills the bedrock; check the PDF tail "
278
+ "for accidental truncation")
279
+ finally:
280
+ if args.keep_temp:
281
+ print(f"temp kept: {workdir}")
282
+ else:
283
+ shutil.rmtree(workdir, ignore_errors=True)
284
+
285
+
286
+ if __name__ == "__main__":
287
+ main()