@dev-pi2pie/word-counter 0.1.5-canary.3 → 0.1.5-canary.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -4,7 +4,7 @@ Locale-aware word counting powered by the Web API [`Intl.Segmenter`](https://dev
4
4
 
5
5
  ## Quick Start (npx)
6
6
 
7
- Runtime requirement: Node.js `>=20`.
7
+ Runtime requirement: Node.js `>=22.18.0`.
8
8
 
9
9
  Run without installing:
10
10
 
@@ -101,18 +101,145 @@ Enable the optional WASM detector for ambiguous Latin and Han routes:
101
101
  ```bash
102
102
  word-counter --detector wasm "This sentence should clearly be detected as English for the wasm detector path."
103
103
  word-counter --detector wasm "漢字測試需要更多內容才能觸發偵測"
104
+ word-counter --detector wasm --content-gate strict "Internationalization documentation remains understandable."
105
+ word-counter --detector wasm --content-gate loose "四字成語"
106
+ word-counter --detector wasm --content-gate off "mode: debug\ntee: true\npath: logs\nUse this for testing."
107
+ ```
108
+
109
+ Inspect detector behavior without count output:
110
+
111
+ ```bash
112
+ word-counter inspect "こんにちは、世界!これはテストです。"
113
+ word-counter inspect --detector wasm --view engine "This sentence should clearly be detected as English for the wasm detector path."
114
+ word-counter inspect --detector regex -f json "こんにちは、世界!これはテストです。"
115
+ word-counter inspect --detector regex -f json --pretty "こんにちは、世界!これはテストです。"
116
+ word-counter inspect --detector wasm --content-gate off "mode: debug\ntee: true\npath: logs\nUse this for testing."
117
+ word-counter inspect -p ./examples/yaml-basic.md
118
+ word-counter inspect -p ./examples/test-case-multi-files-support
119
+ word-counter inspect -p ./examples/test-case-multi-files-support --section content -f json --pretty
104
120
  ```
105
121
 
106
122
  Detector mode notes:
107
123
 
108
124
  - `--detector regex` is the default behavior.
109
125
  - `--detector wasm` only runs for ambiguous `und-Latn` and `und-Hani` chunks.
126
+ - `--content-gate default|strict|loose|off` configures the shared detector policy mode used by the WASM detector path.
127
+ - `default`: current fixture-backed project policy
128
+ - `strict`: raises detector eligibility thresholds and makes more borderline windows fall back
129
+ - `loose`: lowers detector eligibility thresholds and makes more borderline windows eligible or upgradable
130
+ - `off`: bypasses `contentGate` evaluation only
131
+ - mode behavior differs by route:
132
+ - `und-Latn`: `default|strict|loose` affect both eligibility and the Latin prose-style `contentGate`
133
+ - `und-Hani`: `default|strict|loose` affect eligibility only, while `contentGate` still reports `policy=none`
134
+ - current Hani behavior:
135
+ - `default`: keeps the current Hani diagnostic-sample threshold
136
+ - `strict`: raises the Hani diagnostic-sample threshold
137
+ - `loose`: uses a short-window Han-focused threshold so idiom-length samples such as `四字成語` can become eligible
138
+ - `off`: keeps the same Hani eligibility thresholds as `default`
110
139
  - `--detector regex` keeps the original script/regex chunk-first detection path.
111
140
  - `--detector wasm` uses a detector-oriented ambiguous-window scoring pass before accepted tags are projected back onto the counting chunks.
112
141
  - In `--detector wasm` mode, Latin hint rules and explicit Latin hint flags are deferred until after detector evaluation and only relabel unresolved `und-Latn` output.
113
142
  - Very short chunks stay on the original `und-*` fallback.
114
143
  - Low-confidence or unsupported detector results fall back to `und-*`.
115
144
  - Technical-noise-heavy Latin windows stay conservative and may remain `und-Latn` even when the detector produces a wrong-but-confident language guess.
145
+ - inspect/debug disclosure uses `contentGate` as the canonical gate field.
146
+ - legacy debug/evidence payloads still emit `qualityGate` as a compatibility alias derived from `contentGate.passed`.
147
+ - for practical verification, use `inspect` to compare direct mode outcomes across `default`, `strict`, `loose`, and `off`; use `--debug --detector-evidence` when you specifically need counting-flow event details or legacy `qualityGate` compatibility
148
+ - `word-counter inspect` supports:
149
+ - positional text input
150
+ - one direct `-p, --path <file>` input
151
+ - repeated `-p, --path` inputs for batch inspect
152
+ - directory inputs in default `--path-mode auto`
153
+ - literal file-only path handling in `--path-mode manual`
154
+ - `--section all|frontmatter|content`
155
+ - batch inspect keeps counting-style path acquisition but not counting aggregation:
156
+ - no inspect `--merged`
157
+ - no inspect `--per-file`
158
+ - no inspect `--jobs`
159
+
160
+ ### Config Files
161
+
162
+ `word-counter` supports config files in these canonical names:
163
+
164
+ - `wc-intl-seg.config.toml`
165
+ - `wc-intl-seg.config.json`
166
+ - `wc-intl-seg.config.jsonc`
167
+
168
+ Config precedence is:
169
+
170
+ ```text
171
+ built-in defaults
172
+ < user config dir / wc-intl-seg.config.{toml|jsonc|json}
173
+ < cwd / wc-intl-seg.config.{toml|jsonc|json}
174
+ < environment variables
175
+ < flag options
176
+ ```
177
+
178
+ Same-scope file priority is `toml > jsonc > json`.
179
+ If lower-priority sibling config files are ignored, the CLI emits a warning.
180
+
181
+ Detector config notes:
182
+
183
+ - counting defaults to `regex`
184
+ - `inspect` also defaults to `regex`
185
+ - root `detector` controls normal counting
186
+ - optional `inspect.detector` overrides inspect-only behavior
187
+ - root `contentGate.mode` controls detector-policy defaults for counting
188
+ - optional `inspect.contentGate.mode` overrides inspect-only detector-policy behavior
189
+ - `WORD_COUNTER_CONTENT_GATE` overrides config-derived content-gate defaults
190
+ - `--content-gate` stays the highest-precedence detector-policy override
191
+ - `inspect --detector` only affects the current inspect invocation
192
+
193
+ Examples:
194
+
195
+ ```bash
196
+ word-counter -d wasm "This sentence should clearly be detected as English for the wasm detector path."
197
+ word-counter --content-gate strict "Internationalization documentation remains understandable."
198
+ word-counter inspect -d regex -f json "こんにちは、世界!これはテストです。"
199
+ word-counter inspect --content-gate off "mode: debug\ntee: true\npath: logs\nUse this for testing."
200
+ word-counter --path ./examples/test-case-multi-files-support --format json
201
+ ```
202
+
203
+ Default-reference config examples live under:
204
+
205
+ - `examples/wc-config/wc-intl-seg.config.toml`
206
+ - `examples/wc-config/wc-intl-seg.config.json`
207
+ - `examples/wc-config/wc-intl-seg.config.jsonc`
208
+
209
+ For full config behavior, platform-specific user config locations, merge rules, and examples, see [`docs/config-usage-guide.md`](docs/config-usage-guide.md).
210
+
211
+ ### Detector Subpath (`@dev-pi2pie/word-counter/detector`)
212
+
213
+ Use the detector subpath when you need async detector-aware APIs directly in library code.
214
+
215
+ ```ts
216
+ import {
217
+ inspectTextWithDetector,
218
+ segmentTextByLocaleWithDetector,
219
+ wordCounterWithDetector,
220
+ } from "@dev-pi2pie/word-counter/detector";
221
+
222
+ const inspectResult = await inspectTextWithDetector("こんにちは、世界!これはテストです。", {
223
+ detector: "wasm",
224
+ view: "pipeline",
225
+ });
226
+ const countResult = await wordCounterWithDetector(
227
+ "Internationalization documentation remains understandable.",
228
+ {
229
+ detector: "wasm",
230
+ contentGate: { mode: "strict" },
231
+ },
232
+ );
233
+ ```
234
+
235
+ Detector subpath notes:
236
+
237
+ - detector entrypoints are async
238
+ - use the root package for normal counting when you do not need detector-specific control
239
+ - detector-subpath APIs that execute detector policy also accept:
240
+ - `contentGate: { mode: "default" | "strict" | "loose" | "off" }`
241
+ - use `detectorDebug` for counting-flow runtime diagnostics
242
+ - use `inspectTextWithDetector()` for direct detector diagnosis as structured data
116
243
 
117
244
  Collect non-words (emoji/symbols/punctuation):
118
245
 
@@ -140,6 +267,7 @@ Directory scans are recursive by default:
140
267
 
141
268
  ```bash
142
269
  word-counter --path ./examples/test-case-multi-files-support
270
+ word-counter --path ./examples/test-case-multi-files-support --recursive
143
271
  word-counter --path ./examples/test-case-multi-files-support --no-recursive
144
272
  ```
145
273
 
@@ -153,6 +281,7 @@ Progress behavior in standard batch mode:
153
281
 
154
282
  ```bash
155
283
  word-counter --path ./examples/test-case-multi-files-support
284
+ word-counter --path ./examples/test-case-multi-files-support --progress
156
285
  word-counter --path ./examples/test-case-multi-files-support --no-progress
157
286
  word-counter --path ./examples/test-case-multi-files-support --keep-progress
158
287
  ```
@@ -174,7 +303,7 @@ Quick policy:
174
303
  - `--jobs 1`: async main-thread `load+count` baseline.
175
304
  - `--jobs > 1`: worker `load+count` with async fallback when workers are unavailable.
176
305
  - if requested `--jobs` exceeds host `suggestedMaxJobs` (from `--print-jobs-limit`), the CLI warns and runs with the suggested limit as a safety cap.
177
- - use `--quiet-warnings` to suppress non-fatal warning lines (for example jobs-limit advisory and worker-fallback warning).
306
+ - use `--quiet-warnings` to suppress non-fatal warning lines (for example config discovery notes, jobs-limit advisory, and worker-fallback warning).
178
307
 
179
308
  Inspect host jobs diagnostics:
180
309
 
@@ -196,7 +325,7 @@ word-counter doctor --format json --pretty
196
325
 
197
326
  Doctor scope in v1:
198
327
 
199
- - checks runtime support policy against Node.js `>=20`
328
+ - checks runtime support policy against Node.js `>=22.18.0`
200
329
  - verifies `Intl.Segmenter` availability plus word/grapheme constructor health
201
330
  - reports batch jobs host limits using the same heuristics as `--print-jobs-limit`
202
331
  - reports worker-route preflight signals and the worker-disable env toggle that affects worker availability
@@ -216,7 +345,9 @@ For full policy details, JSON parity expectations (`--misc`, `--total-of whitesp
216
345
  ### Stable Path Resolution Contract
217
346
 
218
347
  - Repeated `--path` values are accepted as mixed inputs (file + directory).
219
- - In `--path-mode auto` (default), directory inputs are expanded to files (recursive unless `--no-recursive`).
348
+ - In `--path-mode auto` (default), directory inputs are expanded to files.
349
+ `--recursive` explicitly enables recursive traversal and overrides non-recursive config/env defaults.
350
+ `--no-recursive` explicitly disables recursive traversal for the current invocation.
220
351
  - In `--path-mode manual`, `--path` values are treated as literal file inputs; `--path <dir>` is not supported and is skipped as `not a regular file`.
221
352
  - Extension and regex filters apply only to files discovered from directory expansion.
222
353
  - Direct file inputs are always considered regardless of `--include-ext` / `--exclude-ext` / `--regex`.
@@ -500,6 +631,7 @@ Import from `@dev-pi2pie/word-counter/detector` for the explicit detector-enable
500
631
  | `wordCounterWithDetector` | function | Async detector-aware counting entrypoint. |
501
632
  | `segmentTextByLocaleWithDetector` | function | Async detector-aware locale segmentation. |
502
633
  | `countSectionsWithDetector` | function | Async detector-aware section counting. |
634
+ | `inspectTextWithDetector` | function | Async detector-aware inspect entrypoint. |
503
635
  | `DEFAULT_DETECTOR_MODE` | value | Current default detector mode (`regex`). |
504
636
  | `DETECTOR_MODES` | value | Supported detector modes. |
505
637