@dev-pi2pie/word-counter 0.1.5-canary.3 → 0.1.5-canary.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +136 -4
- package/dist/cjs/detector.cjs +866 -208
- package/dist/cjs/detector.cjs.map +1 -1
- package/dist/cjs/index.cjs.map +1 -1
- package/dist/cjs/markdown.cjs +82 -14
- package/dist/cjs/markdown.cjs.map +1 -1
- package/dist/esm/bin.mjs +4662 -2537
- package/dist/esm/bin.mjs.map +1 -1
- package/dist/esm/detector.d.mts +157 -1
- package/dist/esm/detector.mjs +863 -210
- package/dist/esm/detector.mjs.map +1 -1
- package/dist/esm/index.mjs +1 -1
- package/dist/esm/markdown.mjs +71 -15
- package/dist/esm/markdown.mjs.map +1 -1
- package/dist/esm/worker/count-worker.mjs +559 -248
- package/dist/esm/worker/count-worker.mjs.map +1 -1
- package/dist/esm/worker-pool.mjs +3 -2
- package/dist/esm/worker-pool.mjs.map +1 -1
- package/package.json +7 -3
package/README.md
CHANGED
|
@@ -4,7 +4,7 @@ Locale-aware word counting powered by the Web API [`Intl.Segmenter`](https://dev
|
|
|
4
4
|
|
|
5
5
|
## Quick Start (npx)
|
|
6
6
|
|
|
7
|
-
Runtime requirement: Node.js `>=
|
|
7
|
+
Runtime requirement: Node.js `>=22.18.0`.
|
|
8
8
|
|
|
9
9
|
Run without installing:
|
|
10
10
|
|
|
@@ -101,18 +101,145 @@ Enable the optional WASM detector for ambiguous Latin and Han routes:
|
|
|
101
101
|
```bash
|
|
102
102
|
word-counter --detector wasm "This sentence should clearly be detected as English for the wasm detector path."
|
|
103
103
|
word-counter --detector wasm "漢字測試需要更多內容才能觸發偵測"
|
|
104
|
+
word-counter --detector wasm --content-gate strict "Internationalization documentation remains understandable."
|
|
105
|
+
word-counter --detector wasm --content-gate loose "四字成語"
|
|
106
|
+
word-counter --detector wasm --content-gate off "mode: debug\ntee: true\npath: logs\nUse this for testing."
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
Inspect detector behavior without count output:
|
|
110
|
+
|
|
111
|
+
```bash
|
|
112
|
+
word-counter inspect "こんにちは、世界!これはテストです。"
|
|
113
|
+
word-counter inspect --detector wasm --view engine "This sentence should clearly be detected as English for the wasm detector path."
|
|
114
|
+
word-counter inspect --detector regex -f json "こんにちは、世界!これはテストです。"
|
|
115
|
+
word-counter inspect --detector regex -f json --pretty "こんにちは、世界!これはテストです。"
|
|
116
|
+
word-counter inspect --detector wasm --content-gate off "mode: debug\ntee: true\npath: logs\nUse this for testing."
|
|
117
|
+
word-counter inspect -p ./examples/yaml-basic.md
|
|
118
|
+
word-counter inspect -p ./examples/test-case-multi-files-support
|
|
119
|
+
word-counter inspect -p ./examples/test-case-multi-files-support --section content -f json --pretty
|
|
104
120
|
```
|
|
105
121
|
|
|
106
122
|
Detector mode notes:
|
|
107
123
|
|
|
108
124
|
- `--detector regex` is the default behavior.
|
|
109
125
|
- `--detector wasm` only runs for ambiguous `und-Latn` and `und-Hani` chunks.
|
|
126
|
+
- `--content-gate default|strict|loose|off` configures the shared detector policy mode used by the WASM detector path.
|
|
127
|
+
- `default`: current fixture-backed project policy
|
|
128
|
+
- `strict`: raises detector eligibility thresholds and makes more borderline windows fall back
|
|
129
|
+
- `loose`: lowers detector eligibility thresholds and makes more borderline windows eligible or upgradable
|
|
130
|
+
- `off`: bypasses `contentGate` evaluation only
|
|
131
|
+
- mode behavior differs by route:
|
|
132
|
+
- `und-Latn`: `default|strict|loose` affect both eligibility and the Latin prose-style `contentGate`
|
|
133
|
+
- `und-Hani`: `default|strict|loose` affect eligibility only, while `contentGate` still reports `policy=none`
|
|
134
|
+
- current Hani behavior:
|
|
135
|
+
- `default`: keeps the current Hani diagnostic-sample threshold
|
|
136
|
+
- `strict`: raises the Hani diagnostic-sample threshold
|
|
137
|
+
- `loose`: uses a short-window Han-focused threshold so idiom-length samples such as `四字成語` can become eligible
|
|
138
|
+
- `off`: keeps the same Hani eligibility thresholds as `default`
|
|
110
139
|
- `--detector regex` keeps the original script/regex chunk-first detection path.
|
|
111
140
|
- `--detector wasm` uses a detector-oriented ambiguous-window scoring pass before accepted tags are projected back onto the counting chunks.
|
|
112
141
|
- In `--detector wasm` mode, Latin hint rules and explicit Latin hint flags are deferred until after detector evaluation and only relabel unresolved `und-Latn` output.
|
|
113
142
|
- Very short chunks stay on the original `und-*` fallback.
|
|
114
143
|
- Low-confidence or unsupported detector results fall back to `und-*`.
|
|
115
144
|
- Technical-noise-heavy Latin windows stay conservative and may remain `und-Latn` even when the detector produces a wrong-but-confident language guess.
|
|
145
|
+
- inspect/debug disclosure uses `contentGate` as the canonical gate field.
|
|
146
|
+
- legacy debug/evidence payloads still emit `qualityGate` as a compatibility alias derived from `contentGate.passed`.
|
|
147
|
+
- for practical verification, use `inspect` to compare direct mode outcomes across `default`, `strict`, `loose`, and `off`; use `--debug --detector-evidence` when you specifically need counting-flow event details or legacy `qualityGate` compatibility
|
|
148
|
+
- `word-counter inspect` supports:
|
|
149
|
+
- positional text input
|
|
150
|
+
- one direct `-p, --path <file>` input
|
|
151
|
+
- repeated `-p, --path` inputs for batch inspect
|
|
152
|
+
- directory inputs in default `--path-mode auto`
|
|
153
|
+
- literal file-only path handling in `--path-mode manual`
|
|
154
|
+
- `--section all|frontmatter|content`
|
|
155
|
+
- batch inspect keeps counting-style path acquisition but not counting aggregation:
|
|
156
|
+
- no inspect `--merged`
|
|
157
|
+
- no inspect `--per-file`
|
|
158
|
+
- no inspect `--jobs`
|
|
159
|
+
|
|
160
|
+
### Config Files
|
|
161
|
+
|
|
162
|
+
`word-counter` supports config files in these canonical names:
|
|
163
|
+
|
|
164
|
+
- `wc-intl-seg.config.toml`
|
|
165
|
+
- `wc-intl-seg.config.json`
|
|
166
|
+
- `wc-intl-seg.config.jsonc`
|
|
167
|
+
|
|
168
|
+
Config precedence is:
|
|
169
|
+
|
|
170
|
+
```text
|
|
171
|
+
built-in defaults
|
|
172
|
+
< user config dir / wc-intl-seg.config.{toml|jsonc|json}
|
|
173
|
+
< cwd / wc-intl-seg.config.{toml|jsonc|json}
|
|
174
|
+
< environment variables
|
|
175
|
+
< flag options
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
Same-scope file priority is `toml > jsonc > json`.
|
|
179
|
+
If lower-priority sibling config files are ignored, the CLI emits a warning.
|
|
180
|
+
|
|
181
|
+
Detector config notes:
|
|
182
|
+
|
|
183
|
+
- counting defaults to `regex`
|
|
184
|
+
- `inspect` also defaults to `regex`
|
|
185
|
+
- root `detector` controls normal counting
|
|
186
|
+
- optional `inspect.detector` overrides inspect-only behavior
|
|
187
|
+
- root `contentGate.mode` controls detector-policy defaults for counting
|
|
188
|
+
- optional `inspect.contentGate.mode` overrides inspect-only detector-policy behavior
|
|
189
|
+
- `WORD_COUNTER_CONTENT_GATE` overrides config-derived content-gate defaults
|
|
190
|
+
- `--content-gate` stays the highest-precedence detector-policy override
|
|
191
|
+
- `inspect --detector` only affects the current inspect invocation
|
|
192
|
+
|
|
193
|
+
Examples:
|
|
194
|
+
|
|
195
|
+
```bash
|
|
196
|
+
word-counter -d wasm "This sentence should clearly be detected as English for the wasm detector path."
|
|
197
|
+
word-counter --content-gate strict "Internationalization documentation remains understandable."
|
|
198
|
+
word-counter inspect -d regex -f json "こんにちは、世界!これはテストです。"
|
|
199
|
+
word-counter inspect --content-gate off "mode: debug\ntee: true\npath: logs\nUse this for testing."
|
|
200
|
+
word-counter --path ./examples/test-case-multi-files-support --format json
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
Default-reference config examples live under:
|
|
204
|
+
|
|
205
|
+
- `examples/wc-config/wc-intl-seg.config.toml`
|
|
206
|
+
- `examples/wc-config/wc-intl-seg.config.json`
|
|
207
|
+
- `examples/wc-config/wc-intl-seg.config.jsonc`
|
|
208
|
+
|
|
209
|
+
For full config behavior, platform-specific user config locations, merge rules, and examples, see [`docs/config-usage-guide.md`](docs/config-usage-guide.md).
|
|
210
|
+
|
|
211
|
+
### Detector Subpath (`@dev-pi2pie/word-counter/detector`)
|
|
212
|
+
|
|
213
|
+
Use the detector subpath when you need async detector-aware APIs directly in library code.
|
|
214
|
+
|
|
215
|
+
```ts
|
|
216
|
+
import {
|
|
217
|
+
inspectTextWithDetector,
|
|
218
|
+
segmentTextByLocaleWithDetector,
|
|
219
|
+
wordCounterWithDetector,
|
|
220
|
+
} from "@dev-pi2pie/word-counter/detector";
|
|
221
|
+
|
|
222
|
+
const inspectResult = await inspectTextWithDetector("こんにちは、世界!これはテストです。", {
|
|
223
|
+
detector: "wasm",
|
|
224
|
+
view: "pipeline",
|
|
225
|
+
});
|
|
226
|
+
const countResult = await wordCounterWithDetector(
|
|
227
|
+
"Internationalization documentation remains understandable.",
|
|
228
|
+
{
|
|
229
|
+
detector: "wasm",
|
|
230
|
+
contentGate: { mode: "strict" },
|
|
231
|
+
},
|
|
232
|
+
);
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
Detector subpath notes:
|
|
236
|
+
|
|
237
|
+
- detector entrypoints are async
|
|
238
|
+
- use the root package for normal counting when you do not need detector-specific control
|
|
239
|
+
- detector-subpath APIs that execute detector policy also accept:
|
|
240
|
+
- `contentGate: { mode: "default" | "strict" | "loose" | "off" }`
|
|
241
|
+
- use `detectorDebug` for counting-flow runtime diagnostics
|
|
242
|
+
- use `inspectTextWithDetector()` for direct detector diagnosis as structured data
|
|
116
243
|
|
|
117
244
|
Collect non-words (emoji/symbols/punctuation):
|
|
118
245
|
|
|
@@ -140,6 +267,7 @@ Directory scans are recursive by default:
|
|
|
140
267
|
|
|
141
268
|
```bash
|
|
142
269
|
word-counter --path ./examples/test-case-multi-files-support
|
|
270
|
+
word-counter --path ./examples/test-case-multi-files-support --recursive
|
|
143
271
|
word-counter --path ./examples/test-case-multi-files-support --no-recursive
|
|
144
272
|
```
|
|
145
273
|
|
|
@@ -153,6 +281,7 @@ Progress behavior in standard batch mode:
|
|
|
153
281
|
|
|
154
282
|
```bash
|
|
155
283
|
word-counter --path ./examples/test-case-multi-files-support
|
|
284
|
+
word-counter --path ./examples/test-case-multi-files-support --progress
|
|
156
285
|
word-counter --path ./examples/test-case-multi-files-support --no-progress
|
|
157
286
|
word-counter --path ./examples/test-case-multi-files-support --keep-progress
|
|
158
287
|
```
|
|
@@ -174,7 +303,7 @@ Quick policy:
|
|
|
174
303
|
- `--jobs 1`: async main-thread `load+count` baseline.
|
|
175
304
|
- `--jobs > 1`: worker `load+count` with async fallback when workers are unavailable.
|
|
176
305
|
- if requested `--jobs` exceeds host `suggestedMaxJobs` (from `--print-jobs-limit`), the CLI warns and runs with the suggested limit as a safety cap.
|
|
177
|
-
- use `--quiet-warnings` to suppress non-fatal warning lines (for example jobs-limit advisory and worker-fallback warning).
|
|
306
|
+
- use `--quiet-warnings` to suppress non-fatal warning lines (for example config discovery notes, jobs-limit advisory, and worker-fallback warning).
|
|
178
307
|
|
|
179
308
|
Inspect host jobs diagnostics:
|
|
180
309
|
|
|
@@ -196,7 +325,7 @@ word-counter doctor --format json --pretty
|
|
|
196
325
|
|
|
197
326
|
Doctor scope in v1:
|
|
198
327
|
|
|
199
|
-
- checks runtime support policy against Node.js `>=
|
|
328
|
+
- checks runtime support policy against Node.js `>=22.18.0`
|
|
200
329
|
- verifies `Intl.Segmenter` availability plus word/grapheme constructor health
|
|
201
330
|
- reports batch jobs host limits using the same heuristics as `--print-jobs-limit`
|
|
202
331
|
- reports worker-route preflight signals and the worker-disable env toggle that affects worker availability
|
|
@@ -216,7 +345,9 @@ For full policy details, JSON parity expectations (`--misc`, `--total-of whitesp
|
|
|
216
345
|
### Stable Path Resolution Contract
|
|
217
346
|
|
|
218
347
|
- Repeated `--path` values are accepted as mixed inputs (file + directory).
|
|
219
|
-
- In `--path-mode auto` (default), directory inputs are expanded to files
|
|
348
|
+
- In `--path-mode auto` (default), directory inputs are expanded to files.
|
|
349
|
+
`--recursive` explicitly enables recursive traversal and overrides non-recursive config/env defaults.
|
|
350
|
+
`--no-recursive` explicitly disables recursive traversal for the current invocation.
|
|
220
351
|
- In `--path-mode manual`, `--path` values are treated as literal file inputs; `--path <dir>` is not supported and is skipped as `not a regular file`.
|
|
221
352
|
- Extension and regex filters apply only to files discovered from directory expansion.
|
|
222
353
|
- Direct file inputs are always considered regardless of `--include-ext` / `--exclude-ext` / `--regex`.
|
|
@@ -500,6 +631,7 @@ Import from `@dev-pi2pie/word-counter/detector` for the explicit detector-enable
|
|
|
500
631
|
| `wordCounterWithDetector` | function | Async detector-aware counting entrypoint. |
|
|
501
632
|
| `segmentTextByLocaleWithDetector` | function | Async detector-aware locale segmentation. |
|
|
502
633
|
| `countSectionsWithDetector` | function | Async detector-aware section counting. |
|
|
634
|
+
| `inspectTextWithDetector` | function | Async detector-aware inspect entrypoint. |
|
|
503
635
|
| `DEFAULT_DETECTOR_MODE` | value | Current default detector mode (`regex`). |
|
|
504
636
|
| `DETECTOR_MODES` | value | Supported detector modes. |
|
|
505
637
|
|