@dev-pi2pie/word-counter 0.1.5-canary.2 → 0.1.5-canary.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +102 -3
- package/dist/cjs/detector.cjs +1159 -87
- package/dist/cjs/detector.cjs.map +1 -1
- package/dist/cjs/index.cjs.map +1 -1
- package/dist/cjs/markdown.cjs +88 -14
- package/dist/cjs/markdown.cjs.map +1 -1
- package/dist/esm/bin.mjs +2808 -979
- package/dist/esm/bin.mjs.map +1 -1
- package/dist/esm/detector.d.mts +196 -2
- package/dist/esm/detector.mjs +1156 -89
- package/dist/esm/detector.mjs.map +1 -1
- package/dist/esm/index.mjs +1 -1
- package/dist/esm/markdown.mjs +71 -15
- package/dist/esm/markdown.mjs.map +1 -1
- package/dist/esm/worker/count-worker.mjs +878 -107
- package/dist/esm/worker/count-worker.mjs.map +1 -1
- package/dist/esm/worker-pool.mjs +17 -3
- package/dist/esm/worker-pool.mjs.map +1 -1
- package/package.json +6 -2
package/README.md
CHANGED
|
@@ -101,16 +101,94 @@ Enable the optional WASM detector for ambiguous Latin and Han routes:
|
|
|
101
101
|
```bash
|
|
102
102
|
word-counter --detector wasm "This sentence should clearly be detected as English for the wasm detector path."
|
|
103
103
|
word-counter --detector wasm "漢字測試需要更多內容才能觸發偵測"
|
|
104
|
+
word-counter --detector wasm --content-gate strict "Internationalization documentation remains understandable."
|
|
105
|
+
word-counter --detector wasm --content-gate loose "四字成語"
|
|
106
|
+
word-counter --detector wasm --content-gate off "mode: debug\ntee: true\npath: logs\nUse this for testing."
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
Inspect detector behavior without count output:
|
|
110
|
+
|
|
111
|
+
```bash
|
|
112
|
+
word-counter inspect "こんにちは、世界!これはテストです。"
|
|
113
|
+
word-counter inspect --view engine "This sentence should clearly be detected as English for the wasm detector path."
|
|
114
|
+
word-counter inspect --detector regex -f json "こんにちは、世界!これはテストです。"
|
|
115
|
+
word-counter inspect --detector regex -f json --pretty "こんにちは、世界!これはテストです。"
|
|
116
|
+
word-counter inspect --detector wasm --content-gate off "mode: debug\ntee: true\npath: logs\nUse this for testing."
|
|
117
|
+
word-counter inspect -p ./examples/yaml-basic.md
|
|
118
|
+
word-counter inspect -p ./examples/test-case-multi-files-support
|
|
119
|
+
word-counter inspect -p ./examples/test-case-multi-files-support --section content -f json --pretty
|
|
104
120
|
```
|
|
105
121
|
|
|
106
122
|
Detector mode notes:
|
|
107
123
|
|
|
108
124
|
- `--detector regex` is the default behavior.
|
|
109
125
|
- `--detector wasm` only runs for ambiguous `und-Latn` and `und-Hani` chunks.
|
|
126
|
+
- `--content-gate default|strict|loose|off` configures the shared detector policy mode used by the WASM detector path.
|
|
127
|
+
- `default`: current fixture-backed project policy
|
|
128
|
+
- `strict`: raises detector eligibility thresholds and makes more borderline windows fall back
|
|
129
|
+
- `loose`: lowers detector eligibility thresholds and makes more borderline windows eligible or upgradable
|
|
130
|
+
- `off`: bypasses `contentGate` evaluation only
|
|
131
|
+
- mode behavior differs by route:
|
|
132
|
+
- `und-Latn`: `default|strict|loose` affect both eligibility and the Latin prose-style `contentGate`
|
|
133
|
+
- `und-Hani`: `default|strict|loose` affect eligibility only, while `contentGate` still reports `policy=none`
|
|
134
|
+
- current Hani behavior:
|
|
135
|
+
- `default`: keeps the current Hani diagnostic-sample threshold
|
|
136
|
+
- `strict`: raises the Hani diagnostic-sample threshold
|
|
137
|
+
- `loose`: uses a short-window Han-focused threshold so idiom-length samples such as `四字成語` can become eligible
|
|
138
|
+
- `off`: keeps the same Hani eligibility thresholds as `default`
|
|
110
139
|
- `--detector regex` keeps the original script/regex chunk-first detection path.
|
|
111
140
|
- `--detector wasm` uses a detector-oriented ambiguous-window scoring pass before accepted tags are projected back onto the counting chunks.
|
|
141
|
+
- In `--detector wasm` mode, Latin hint rules and explicit Latin hint flags are deferred until after detector evaluation and only relabel unresolved `und-Latn` output.
|
|
112
142
|
- Very short chunks stay on the original `und-*` fallback.
|
|
113
143
|
- Low-confidence or unsupported detector results fall back to `und-*`.
|
|
144
|
+
- Technical-noise-heavy Latin windows stay conservative and may remain `und-Latn` even when the detector produces a wrong-but-confident language guess.
|
|
145
|
+
- inspect/debug disclosure uses `contentGate` as the canonical gate field.
|
|
146
|
+
- legacy debug/evidence payloads still emit `qualityGate` as a compatibility alias derived from `contentGate.passed`.
|
|
147
|
+
- for practical verification, use `inspect` to compare direct mode outcomes across `default`, `strict`, `loose`, and `off`; use `--debug --detector-evidence` when you specifically need counting-flow event details or legacy `qualityGate` compatibility
|
|
148
|
+
- `word-counter inspect` supports:
|
|
149
|
+
- positional text input
|
|
150
|
+
- one direct `-p, --path <file>` input
|
|
151
|
+
- repeated `-p, --path` inputs for batch inspect
|
|
152
|
+
- directory inputs in default `--path-mode auto`
|
|
153
|
+
- literal file-only path handling in `--path-mode manual`
|
|
154
|
+
- `--section all|frontmatter|content`
|
|
155
|
+
- batch inspect keeps counting-style path acquisition but not counting aggregation:
|
|
156
|
+
- no inspect `--merged`
|
|
157
|
+
- no inspect `--per-file`
|
|
158
|
+
- no inspect `--jobs`
|
|
159
|
+
|
|
160
|
+
### Detector Subpath (`@dev-pi2pie/word-counter/detector`)
|
|
161
|
+
|
|
162
|
+
Use the detector subpath when you need async detector-aware APIs directly in library code.
|
|
163
|
+
|
|
164
|
+
```ts
|
|
165
|
+
import {
|
|
166
|
+
inspectTextWithDetector,
|
|
167
|
+
segmentTextByLocaleWithDetector,
|
|
168
|
+
wordCounterWithDetector,
|
|
169
|
+
} from "@dev-pi2pie/word-counter/detector";
|
|
170
|
+
|
|
171
|
+
const inspectResult = await inspectTextWithDetector("こんにちは、世界!これはテストです。", {
|
|
172
|
+
detector: "wasm",
|
|
173
|
+
view: "pipeline",
|
|
174
|
+
});
|
|
175
|
+
const countResult = await wordCounterWithDetector(
|
|
176
|
+
"Internationalization documentation remains understandable.",
|
|
177
|
+
{
|
|
178
|
+
detector: "wasm",
|
|
179
|
+
contentGate: { mode: "strict" },
|
|
180
|
+
},
|
|
181
|
+
);
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
Detector subpath notes:
|
|
185
|
+
|
|
186
|
+
- detector entrypoints are async
|
|
187
|
+
- use the root package for normal counting when you do not need detector-specific control
|
|
188
|
+
- detector-subpath APIs that execute detector policy also accept:
|
|
189
|
+
- `contentGate: { mode: "default" | "strict" | "loose" | "off" }`
|
|
190
|
+
- use `detectorDebug` for counting-flow runtime diagnostics
|
|
191
|
+
- use `inspectTextWithDetector()` for direct detector diagnosis as structured data
|
|
114
192
|
|
|
115
193
|
Collect non-words (emoji/symbols/punctuation):
|
|
116
194
|
|
|
@@ -285,14 +363,24 @@ word-counter --path ./examples/test-case-multi-files-support --debug --verbose
|
|
|
285
363
|
|
|
286
364
|
Use `--debug-report [path]` to route debug diagnostics to a JSONL report file:
|
|
287
365
|
|
|
288
|
-
- no path: writes to current working directory with pattern `wc-debug-YYYYMMDD-HHmmss-<pid>.jsonl`
|
|
366
|
+
- no path: writes to current working directory with pattern `wc-debug-YYYYMMDD-HHmmss-utc-<pid>.jsonl`
|
|
367
|
+
- no path with `--detector-evidence`: writes with pattern `wc-detector-evidence-YYYYMMDD-HHmmss-utc-<pid>.jsonl`
|
|
289
368
|
- path provided: writes to the specified location
|
|
290
369
|
- default-name collision handling: appends `-<n>` suffix to avoid overwriting existing files
|
|
291
370
|
- explicit path validation: existing directories are rejected (explicit paths are treated as file targets)
|
|
371
|
+
- compatibility note: the autogenerated filename moved from the older local-time pattern to the new UTC `...-utc-...jsonl` pattern
|
|
292
372
|
|
|
293
373
|
By default with `--debug-report`, debug lines are file-only (not mirrored to terminal).
|
|
294
374
|
Use `--debug-report-tee` (alias: `--debug-tee`) to mirror to both file and `stderr`.
|
|
295
|
-
Flag dependencies: `--verbose` requires `--debug`; `--debug-report` requires `--debug`; `--debug-report-tee`/`--debug-tee` requires `--debug-report`.
|
|
375
|
+
Flag dependencies: `--verbose` requires `--debug`; `--detector-evidence` requires `--debug` and `--detector wasm`; `--debug-report` requires `--debug`; `--debug-report-tee`/`--debug-tee` requires `--debug-report`.
|
|
376
|
+
|
|
377
|
+
Use `--detector-evidence` to add per-window detector evidence onto the same debug stream:
|
|
378
|
+
|
|
379
|
+
- only meaningful with `--detector wasm`
|
|
380
|
+
- compact mode emits bounded single-line previews plus detector decision metadata
|
|
381
|
+
- verbose mode emits full raw detector windows and full normalized samples
|
|
382
|
+
- evidence remains detector-window based even when output mode changes to `collector`, `char`, or another counting mode
|
|
383
|
+
- fallback evidence reports the post-fallback final tag used by downstream counting output; in rare split-relabel cases it may also include `finalLocales`
|
|
296
384
|
|
|
297
385
|
Examples:
|
|
298
386
|
|
|
@@ -301,17 +389,26 @@ word-counter --path ./examples/test-case-multi-files-support --debug --debug-rep
|
|
|
301
389
|
word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl
|
|
302
390
|
word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl --debug-report-tee
|
|
303
391
|
word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl --debug-tee
|
|
392
|
+
word-counter --detector wasm --debug --detector-evidence "This sentence should clearly be detected as English for the wasm detector path."
|
|
393
|
+
word-counter --detector wasm --debug --verbose --detector-evidence "This sentence should clearly be detected as English for the wasm detector path."
|
|
394
|
+
word-counter --detector wasm --debug --detector-evidence --debug-report
|
|
304
395
|
```
|
|
305
396
|
|
|
306
397
|
Skip details stay debug-gated and can be suppressed with `--quiet-skips`.
|
|
307
398
|
|
|
399
|
+
When `--format json` is combined with `--debug`, debug-only diagnostics are emitted under `debug.*`:
|
|
400
|
+
|
|
401
|
+
- single input and merged batch may include `debug.detector`
|
|
402
|
+
- per-file batch may include `debug.skipped`, `debug.detector`, and per-entry `files[i].debug.detector`
|
|
403
|
+
- per-file top-level `skipped` is still emitted temporarily for compatibility
|
|
404
|
+
|
|
308
405
|
## How It Works
|
|
309
406
|
|
|
310
407
|
- The runtime inspects each character's Unicode script to infer its likely locale tag (e.g., `und-Latn`, `und-Hani`, `ja`).
|
|
311
408
|
- Adjacent characters that share the same locale tag are grouped into a chunk.
|
|
312
409
|
- Each chunk is counted with `Intl.Segmenter` at `granularity: "word"`, caching segmenters to avoid re-instantiation.
|
|
313
410
|
- Per-locale counts are summed into an overall total and printed to stdout.
|
|
314
|
-
- With `--detector wasm`, ambiguous `und-Latn` and `und-Hani` chunks can be relabeled through the optional WASM detector before counting.
|
|
411
|
+
- With `--detector wasm`, ambiguous `und-Latn` and `und-Hani` chunks can be relabeled through the optional WASM detector before counting; unresolved `und-Latn` chunks then fall back to the existing Latin hint rules and explicit Latin hint precedence.
|
|
315
412
|
|
|
316
413
|
## Locale vs Language Code
|
|
317
414
|
|
|
@@ -479,6 +576,7 @@ Import from `@dev-pi2pie/word-counter/detector` for the explicit detector-enable
|
|
|
479
576
|
| `wordCounterWithDetector` | function | Async detector-aware counting entrypoint. |
|
|
480
577
|
| `segmentTextByLocaleWithDetector` | function | Async detector-aware locale segmentation. |
|
|
481
578
|
| `countSectionsWithDetector` | function | Async detector-aware section counting. |
|
|
579
|
+
| `inspectTextWithDetector` | function | Async detector-aware inspect entrypoint. |
|
|
482
580
|
| `DEFAULT_DETECTOR_MODE` | value | Current default detector mode (`regex`). |
|
|
483
581
|
| `DETECTOR_MODES` | value | Supported detector modes. |
|
|
484
582
|
|
|
@@ -696,6 +794,7 @@ Example JSON (trimmed):
|
|
|
696
794
|
- Detection is regex/script based, not statistical language-ID.
|
|
697
795
|
- Ambiguous Latin defaults to `und-Latn`; Han fallback defaults to `und-Hani`.
|
|
698
796
|
- `--detector wasm` is optional and conservative; it only runs for ambiguous chunks that meet minimum script-bearing length thresholds.
|
|
797
|
+
- In `--detector wasm` mode, ambiguous Latin stays on `und-Latn` for detector eligibility first, then built-in/custom Latin rules and explicit Latin hints are applied only if the detector leaves that chunk unresolved.
|
|
699
798
|
- The current first WASM engine is `whatlang`, remapped into this package's public tags.
|
|
700
799
|
- The npm package ships one portable WASM artifact; users do not install per-OS detector packages.
|
|
701
800
|
- Use explicit tag and hint flags when you need deterministic tagging.
|