@dev-pi2pie/word-counter 0.1.5-canary.2 → 0.1.5-canary.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -101,16 +101,94 @@ Enable the optional WASM detector for ambiguous Latin and Han routes:
101
101
  ```bash
102
102
  word-counter --detector wasm "This sentence should clearly be detected as English for the wasm detector path."
103
103
  word-counter --detector wasm "漢字測試需要更多內容才能觸發偵測"
104
+ word-counter --detector wasm --content-gate strict "Internationalization documentation remains understandable."
105
+ word-counter --detector wasm --content-gate loose "四字成語"
106
+ word-counter --detector wasm --content-gate off "mode: debug\ntee: true\npath: logs\nUse this for testing."
107
+ ```
108
+
109
+ Inspect detector behavior without count output:
110
+
111
+ ```bash
112
+ word-counter inspect "こんにちは、世界!これはテストです。"
113
+ word-counter inspect --view engine "This sentence should clearly be detected as English for the wasm detector path."
114
+ word-counter inspect --detector regex -f json "こんにちは、世界!これはテストです。"
115
+ word-counter inspect --detector regex -f json --pretty "こんにちは、世界!これはテストです。"
116
+ word-counter inspect --detector wasm --content-gate off "mode: debug\ntee: true\npath: logs\nUse this for testing."
117
+ word-counter inspect -p ./examples/yaml-basic.md
118
+ word-counter inspect -p ./examples/test-case-multi-files-support
119
+ word-counter inspect -p ./examples/test-case-multi-files-support --section content -f json --pretty
104
120
  ```
105
121
 
106
122
  Detector mode notes:
107
123
 
108
124
  - `--detector regex` is the default behavior.
109
125
  - `--detector wasm` only runs for ambiguous `und-Latn` and `und-Hani` chunks.
126
+ - `--content-gate default|strict|loose|off` configures the shared detector policy mode used by the WASM detector path.
127
+ - `default`: current fixture-backed project policy
128
+ - `strict`: raises detector eligibility thresholds and makes more borderline windows fall back
129
+ - `loose`: lowers detector eligibility thresholds and makes more borderline windows eligible or upgradable
130
+ - `off`: bypasses `contentGate` evaluation only
131
+ - mode behavior differs by route:
132
+ - `und-Latn`: `default|strict|loose` affect both eligibility and the Latin prose-style `contentGate`
133
+ - `und-Hani`: `default|strict|loose` affect eligibility only, while `contentGate` still reports `policy=none`
134
+ - current Hani behavior:
135
+ - `default`: keeps the current Hani diagnostic-sample threshold
136
+ - `strict`: raises the Hani diagnostic-sample threshold
137
+ - `loose`: uses a short-window Han-focused threshold so idiom-length samples such as `四字成語` can become eligible
138
+ - `off`: keeps the same Hani eligibility thresholds as `default`
110
139
  - `--detector regex` keeps the original script/regex chunk-first detection path.
111
140
  - `--detector wasm` uses a detector-oriented ambiguous-window scoring pass before accepted tags are projected back onto the counting chunks.
141
+ - In `--detector wasm` mode, Latin hint rules and explicit Latin hint flags are deferred until after detector evaluation and only relabel unresolved `und-Latn` output.
112
142
  - Very short chunks stay on the original `und-*` fallback.
113
143
  - Low-confidence or unsupported detector results fall back to `und-*`.
144
+ - Technical-noise-heavy Latin windows stay conservative and may remain `und-Latn` even when the detector produces a wrong-but-confident language guess.
145
+ - inspect/debug disclosure uses `contentGate` as the canonical gate field.
146
+ - legacy debug/evidence payloads still emit `qualityGate` as a compatibility alias derived from `contentGate.passed`.
147
+ - for practical verification, use `inspect` to compare direct mode outcomes across `default`, `strict`, `loose`, and `off`; use `--debug --detector-evidence` when you specifically need counting-flow event details or legacy `qualityGate` compatibility
148
+ - `word-counter inspect` supports:
149
+ - positional text input
150
+ - one direct `-p, --path <file>` input
151
+ - repeated `-p, --path` inputs for batch inspect
152
+ - directory inputs in default `--path-mode auto`
153
+ - literal file-only path handling in `--path-mode manual`
154
+ - `--section all|frontmatter|content`
155
+ - batch inspect keeps counting-style path acquisition but not counting aggregation:
156
+ - no inspect `--merged`
157
+ - no inspect `--per-file`
158
+ - no inspect `--jobs`
159
+
160
+ ### Detector Subpath (`@dev-pi2pie/word-counter/detector`)
161
+
162
+ Use the detector subpath when you need async detector-aware APIs directly in library code.
163
+
164
+ ```ts
165
+ import {
166
+ inspectTextWithDetector,
167
+ segmentTextByLocaleWithDetector,
168
+ wordCounterWithDetector,
169
+ } from "@dev-pi2pie/word-counter/detector";
170
+
171
+ const inspectResult = await inspectTextWithDetector("こんにちは、世界!これはテストです。", {
172
+ detector: "wasm",
173
+ view: "pipeline",
174
+ });
175
+ const countResult = await wordCounterWithDetector(
176
+ "Internationalization documentation remains understandable.",
177
+ {
178
+ detector: "wasm",
179
+ contentGate: { mode: "strict" },
180
+ },
181
+ );
182
+ ```
183
+
184
+ Detector subpath notes:
185
+
186
+ - detector entrypoints are async
187
+ - use the root package for normal counting when you do not need detector-specific control
188
+ - detector-subpath APIs that execute detector policy also accept:
189
+ - `contentGate: { mode: "default" | "strict" | "loose" | "off" }`
190
+ - use `detectorDebug` for counting-flow runtime diagnostics
191
+ - use `inspectTextWithDetector()` for direct detector diagnosis as structured data
114
192
 
115
193
  Collect non-words (emoji/symbols/punctuation):
116
194
 
@@ -285,14 +363,24 @@ word-counter --path ./examples/test-case-multi-files-support --debug --verbose
285
363
 
286
364
  Use `--debug-report [path]` to route debug diagnostics to a JSONL report file:
287
365
 
288
- - no path: writes to current working directory with pattern `wc-debug-YYYYMMDD-HHmmss-<pid>.jsonl`
366
+ - no path: writes to current working directory with pattern `wc-debug-YYYYMMDD-HHmmss-utc-<pid>.jsonl`
367
+ - no path with `--detector-evidence`: writes with pattern `wc-detector-evidence-YYYYMMDD-HHmmss-utc-<pid>.jsonl`
289
368
  - path provided: writes to the specified location
290
369
  - default-name collision handling: appends `-<n>` suffix to avoid overwriting existing files
291
370
  - explicit path validation: existing directories are rejected (explicit paths are treated as file targets)
371
+ - compatibility note: the autogenerated filename moved from the older local-time pattern to the new UTC `...-utc-...jsonl` pattern
292
372
 
293
373
  By default with `--debug-report`, debug lines are file-only (not mirrored to terminal).
294
374
  Use `--debug-report-tee` (alias: `--debug-tee`) to mirror to both file and `stderr`.
295
- Flag dependencies: `--verbose` requires `--debug`; `--debug-report` requires `--debug`; `--debug-report-tee`/`--debug-tee` requires `--debug-report`.
375
+ Flag dependencies: `--verbose` requires `--debug`; `--detector-evidence` requires `--debug` and `--detector wasm`; `--debug-report` requires `--debug`; `--debug-report-tee`/`--debug-tee` requires `--debug-report`.
376
+
377
+ Use `--detector-evidence` to add per-window detector evidence onto the same debug stream:
378
+
379
+ - only meaningful with `--detector wasm`
380
+ - compact mode emits bounded single-line previews plus detector decision metadata
381
+ - verbose mode emits full raw detector windows and full normalized samples
382
+ - evidence remains detector-window based even when output mode changes to `collector`, `char`, or another counting mode
383
+ - fallback evidence reports the post-fallback final tag used by downstream counting output; in rare split-relabel cases it may also include `finalLocales`
296
384
 
297
385
  Examples:
298
386
 
@@ -301,17 +389,26 @@ word-counter --path ./examples/test-case-multi-files-support --debug --debug-rep
301
389
  word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl
302
390
  word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl --debug-report-tee
303
391
  word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl --debug-tee
392
+ word-counter --detector wasm --debug --detector-evidence "This sentence should clearly be detected as English for the wasm detector path."
393
+ word-counter --detector wasm --debug --verbose --detector-evidence "This sentence should clearly be detected as English for the wasm detector path."
394
+ word-counter --detector wasm --debug --detector-evidence --debug-report
304
395
  ```
305
396
 
306
397
  Skip details stay debug-gated and can be suppressed with `--quiet-skips`.
307
398
 
399
+ When `--format json` is combined with `--debug`, debug-only diagnostics are emitted under `debug.*`:
400
+
401
+ - single input and merged batch may include `debug.detector`
402
+ - per-file batch may include `debug.skipped`, `debug.detector`, and per-entry `files[i].debug.detector`
403
+ - per-file top-level `skipped` is still emitted temporarily for compatibility
404
+
308
405
  ## How It Works
309
406
 
310
407
  - The runtime inspects each character's Unicode script to infer its likely locale tag (e.g., `und-Latn`, `und-Hani`, `ja`).
311
408
  - Adjacent characters that share the same locale tag are grouped into a chunk.
312
409
  - Each chunk is counted with `Intl.Segmenter` at `granularity: "word"`, caching segmenters to avoid re-instantiation.
313
410
  - Per-locale counts are summed into an overall total and printed to stdout.
314
- - With `--detector wasm`, ambiguous `und-Latn` and `und-Hani` chunks can be relabeled through the optional WASM detector before counting.
411
+ - With `--detector wasm`, ambiguous `und-Latn` and `und-Hani` chunks can be relabeled through the optional WASM detector before counting; unresolved `und-Latn` chunks then fall back to the existing Latin hint rules and explicit Latin hint precedence.
315
412
 
316
413
  ## Locale vs Language Code
317
414
 
@@ -479,6 +576,7 @@ Import from `@dev-pi2pie/word-counter/detector` for the explicit detector-enable
479
576
  | `wordCounterWithDetector` | function | Async detector-aware counting entrypoint. |
480
577
  | `segmentTextByLocaleWithDetector` | function | Async detector-aware locale segmentation. |
481
578
  | `countSectionsWithDetector` | function | Async detector-aware section counting. |
579
+ | `inspectTextWithDetector` | function | Async detector-aware inspect entrypoint. |
482
580
  | `DEFAULT_DETECTOR_MODE` | value | Current default detector mode (`regex`). |
483
581
  | `DETECTOR_MODES` | value | Supported detector modes. |
484
582
 
@@ -696,6 +794,7 @@ Example JSON (trimmed):
696
794
  - Detection is regex/script based, not statistical language-ID.
697
795
  - Ambiguous Latin defaults to `und-Latn`; Han fallback defaults to `und-Hani`.
698
796
  - `--detector wasm` is optional and conservative; it only runs for ambiguous chunks that meet minimum script-bearing length thresholds.
797
+ - In `--detector wasm` mode, ambiguous Latin stays on `und-Latn` for detector eligibility first, then built-in/custom Latin rules and explicit Latin hints are applied only if the detector leaves that chunk unresolved.
699
798
  - The current first WASM engine is `whatlang`, remapped into this package's public tags.
700
799
  - The npm package ships one portable WASM artifact; users do not install per-OS detector packages.
701
800
  - Use explicit tag and hint flags when you need deterministic tagging.