@smart-cloud/publisher-exporter 1.1.19 → 1.1.20
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +41 -0
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -29,9 +29,13 @@ publisher-exporter install-browsers
|
|
|
29
29
|
If `PLAYWRIGHT_BROWSERS_PATH` points to a shared system directory, create that directory first and make it writable by the same OS user that will run `publisher-exporter install-browsers`. A one-time elevated setup step to create or re-own the directory is fine. The later cron job does not need elevated privileges to use an already installed shared browser cache, but it does need read and execute access to that directory tree.
|
|
30
30
|
|
|
31
31
|
## Commands
|
|
32
|
+
- Normal usage:
|
|
32
33
|
|
|
33
34
|
```bash
|
|
34
35
|
PUBLISHER_CONFIG=./publisher.config.json publisher-exporter crawl
|
|
36
|
+
PUBLISHER_CONFIG=./publisher.config.json publisher-exporter crawl --crawl-mode incremental
|
|
37
|
+
PUBLISHER_CONFIG=./publisher.config.json publisher-exporter crawl --resume-rewrite
|
|
38
|
+
PUBLISHER_CONFIG=./publisher.config.json publisher-exporter crawl --retry-timeouts
|
|
35
39
|
PUBLISHER_CONFIG=./publisher.config.json publisher-exporter deploy
|
|
36
40
|
PUBLISHER_CONFIG=./publisher.config.json publisher-exporter invalidate
|
|
37
41
|
publisher-exporter publish
|
|
@@ -42,11 +46,40 @@ Equivalent `npx` usage:
|
|
|
42
46
|
|
|
43
47
|
```bash
|
|
44
48
|
PUBLISHER_CONFIG=./publisher.config.json npx @smart-cloud/publisher-exporter crawl
|
|
49
|
+
PUBLISHER_CONFIG=./publisher.config.json npx @smart-cloud/publisher-exporter crawl --crawl-mode incremental
|
|
50
|
+
PUBLISHER_CONFIG=./publisher.config.json npx @smart-cloud/publisher-exporter crawl --resume-rewrite
|
|
45
51
|
PUBLISHER_CONFIG=./publisher.config.json npx @smart-cloud/publisher-exporter deploy
|
|
46
52
|
PUBLISHER_CONFIG=./publisher.config.json npx @smart-cloud/publisher-exporter invalidate
|
|
47
53
|
npx @smart-cloud/publisher-exporter queue-runner --runtime-dir /srv/site/runtime --max-jobs 1
|
|
48
54
|
```
|
|
49
55
|
|
|
56
|
+
## Crawl Modes And Repair Workflows
|
|
57
|
+
|
|
58
|
+
- `crawl` without extra flags performs a normal full crawl, asset download, and final text rewrite.
|
|
59
|
+
- `crawl --crawl-mode incremental` reuses the existing crawl manifest when possible and rewrites only the files affected by re-crawled pages, text assets, or changed asset-map entries.
|
|
60
|
+
- `crawl --resume-rewrite` skips discovery, rendering, and asset download, then reruns the final text rewrite over the existing output tree using the current rewrite rules.
|
|
61
|
+
- `crawl --retry-timeouts` retries timed-out URLs from the latest archived full crawl or publish log snapshot.
|
|
62
|
+
|
|
63
|
+
Use `--resume-rewrite` when the already exported output is otherwise valid but the rewrite logic changed or a previous crawl stopped during the rewrite phase. This is the fastest repair path for stale output caused by rewrite-only bugs such as escaped JSON replacement issues, protocol-relative asset URLs like `//host/path.css`, or an interrupted final rewrite.
|
|
64
|
+
|
|
65
|
+
If you need to repair existing exported HTML after a rewrite bug, prefer `--resume-rewrite` over a new incremental crawl. Incremental crawl only rewrites targeted files; `--resume-rewrite` reprocesses every text file in the current output.
|
|
66
|
+
|
|
67
|
+
During `--resume-rewrite`, the terminal can be quieter than a full crawl. Progress still updates in `runtime/current-progress.json` and the live crawl event snapshot under the configured log directory.
|
|
68
|
+
|
|
69
|
+
## Direct CLI With Runtime-Managed Config
|
|
70
|
+
|
|
71
|
+
When the WordPress plugin already wrote `runtime/config.json` into shared publisher storage, direct CLI invocation still needs both the config path and the runtime directory.
|
|
72
|
+
|
|
73
|
+
Example:
|
|
74
|
+
|
|
75
|
+
```bash
|
|
76
|
+
STATIC_PUBLISHER_RUNTIME_DIR=/mnt/site/runtime \
|
|
77
|
+
PUBLISHER_CONFIG=/mnt/site/runtime/config.json \
|
|
78
|
+
publisher-exporter crawl --resume-rewrite
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
`PUBLISHER_CONFIG` selects the main exporter config file. `STATIC_PUBLISHER_RUNTIME_DIR` anchors runtime artifacts such as `current-progress.json`, `crawl-manifest.json`, queue files, and storage-relative output/log paths.
|
|
82
|
+
|
|
50
83
|
## Queue Runner
|
|
51
84
|
|
|
52
85
|
The queue runner reads the runtime JSON files generated by the WordPress plugin.
|
|
@@ -123,6 +156,14 @@ Archived files are gzip-compressed per file and listed in `job.json`. WordPress
|
|
|
123
156
|
|
|
124
157
|
Prune old archive directories with `publisher-exporter prune-logs --runtime-dir /srv/site/runtime --older-than-days 30`. Add that command to daily cron or another retention scheduler that matches your log policy.
|
|
125
158
|
|
|
159
|
+
## Troubleshooting
|
|
160
|
+
|
|
161
|
+
- If exported HTML still contains old rewritten URLs after a rewrite fix, run `crawl --resume-rewrite`. This reprocesses every text file in the current output tree and does not depend on earlier `rewriteComplete: true` entries in `crawl-manifest.json`.
|
|
162
|
+
- If `--resume-rewrite` appears to stop after the initial resume message, check `runtime/current-progress.json` and `<logDir>/current-crawl-event.json` before assuming it is stuck. That phase can stay mostly quiet on stdout while it rewrites files.
|
|
163
|
+
- If direct CLI execution seems to ignore the expected runtime state, verify that `PUBLISHER_CONFIG` and `STATIC_PUBLISHER_RUNTIME_DIR` both point at the same runtime tree. A mismatched config path and runtime dir can make the run look idle or incomplete.
|
|
164
|
+
- Prefer incremental crawl when the existing manifest is trusted and you only need a fast recrawl of changed pages or assets. Prefer a full crawl when discovery rules changed, the manifest is missing or suspect, or you want a clean end-to-end rebuild.
|
|
165
|
+
- For generated 404 capture, set `generated404RequestPath` to a path that really returns HTTP 404 from the origin. The exporter validates that response before reusing it as the generated 404 page.
|
|
166
|
+
|
|
126
167
|
## WordPress Integration
|
|
127
168
|
|
|
128
169
|
The companion WordPress plugin can optionally store an `External exporter dir` setting. Point it at this package root when you want PHP-side diagnostics to verify the local CLI install.
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@smart-cloud/publisher-exporter",
|
|
3
|
-
"version": "1.1.
|
|
3
|
+
"version": "1.1.20",
|
|
4
4
|
"license": "MIT",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"description": "Headless Playwright static publisher for WordPress/Elementor sites with sitemap-only page discovery, strict asset capture, escaped URL rewrite, structured logs, and targeted retry modes.",
|