@dev-pi2pie/word-counter 0.1.5-canary.1 → 0.1.5-canary.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -37,6 +37,8 @@ For local development in this repository:
37
37
  ```bash
38
38
  git clone https://github.com/dev-pi2pie/word-counter.git
39
39
  cd word-counter
40
+ rustup target add wasm32-unknown-unknown
41
+ cargo install wasm-pack --locked
40
42
  bun install
41
43
  bun run build
42
44
  npm link
@@ -94,6 +96,24 @@ word-counter --han-language zh-Hant "漢字測試"
94
96
  word-counter --han-tag zh-Hans "汉字测试"
95
97
  ```
96
98
 
99
+ Enable the optional WASM detector for ambiguous Latin and Han routes:
100
+
101
+ ```bash
102
+ word-counter --detector wasm "This sentence should clearly be detected as English for the wasm detector path."
103
+ word-counter --detector wasm "漢字測試需要更多內容才能觸發偵測"
104
+ ```
105
+
106
+ Detector mode notes:
107
+
108
+ - `--detector regex` is the default behavior.
109
+ - `--detector wasm` only runs for ambiguous `und-Latn` and `und-Hani` chunks.
110
+ - `--detector regex` keeps the original script/regex chunk-first detection path.
111
+ - `--detector wasm` uses a detector-oriented ambiguous-window scoring pass before accepted tags are projected back onto the counting chunks.
112
+ - In `--detector wasm` mode, Latin hint rules and explicit Latin hint flags are deferred until after detector evaluation and only relabel unresolved `und-Latn` output.
113
+ - Very short chunks stay on the original `und-*` fallback.
114
+ - Low-confidence or unsupported detector results fall back to `und-*`.
115
+ - Technical-noise-heavy Latin windows stay conservative and may remain `und-Latn` even when the detector produces a wrong-but-confident language guess.
116
+
97
117
  Collect non-words (emoji/symbols/punctuation):
98
118
 
99
119
  ```bash
@@ -267,14 +287,24 @@ word-counter --path ./examples/test-case-multi-files-support --debug --verbose
267
287
 
268
288
  Use `--debug-report [path]` to route debug diagnostics to a JSONL report file:
269
289
 
270
- - no path: writes to current working directory with pattern `wc-debug-YYYYMMDD-HHmmss-<pid>.jsonl`
290
+ - no path: writes to current working directory with pattern `wc-debug-YYYYMMDD-HHmmss-utc-<pid>.jsonl`
291
+ - no path with `--detector-evidence`: writes with pattern `wc-detector-evidence-YYYYMMDD-HHmmss-utc-<pid>.jsonl`
271
292
  - path provided: writes to the specified location
272
293
  - default-name collision handling: appends `-<n>` suffix to avoid overwriting existing files
273
294
  - explicit path validation: existing directories are rejected (explicit paths are treated as file targets)
295
+ - compatibility note: the autogenerated filename moved from the older local-time pattern to the new UTC `...-utc-...jsonl` pattern
274
296
 
275
297
  By default with `--debug-report`, debug lines are file-only (not mirrored to terminal).
276
298
  Use `--debug-report-tee` (alias: `--debug-tee`) to mirror to both file and `stderr`.
277
- Flag dependencies: `--verbose` requires `--debug`; `--debug-report` requires `--debug`; `--debug-report-tee`/`--debug-tee` requires `--debug-report`.
299
+ Flag dependencies: `--verbose` requires `--debug`; `--detector-evidence` requires `--debug` and `--detector wasm`; `--debug-report` requires `--debug`; `--debug-report-tee`/`--debug-tee` requires `--debug-report`.
300
+
301
+ Use `--detector-evidence` to add per-window detector evidence onto the same debug stream:
302
+
303
+ - only meaningful with `--detector wasm`
304
+ - compact mode emits bounded single-line previews plus detector decision metadata
305
+ - verbose mode emits full raw detector windows and full normalized samples
306
+ - evidence remains detector-window based even when output mode changes to `collector`, `char`, or another counting mode
307
+ - fallback evidence reports the post-fallback final tag used by downstream counting output; in rare split-relabel cases it may also include `finalLocales`
278
308
 
279
309
  Examples:
280
310
 
@@ -283,16 +313,26 @@ word-counter --path ./examples/test-case-multi-files-support --debug --debug-rep
283
313
  word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl
284
314
  word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl --debug-report-tee
285
315
  word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl --debug-tee
316
+ word-counter --detector wasm --debug --detector-evidence "This sentence should clearly be detected as English for the wasm detector path."
317
+ word-counter --detector wasm --debug --verbose --detector-evidence "This sentence should clearly be detected as English for the wasm detector path."
318
+ word-counter --detector wasm --debug --detector-evidence --debug-report
286
319
  ```
287
320
 
288
321
  Skip details stay debug-gated and can be suppressed with `--quiet-skips`.
289
322
 
323
+ When `--format json` is combined with `--debug`, debug-only diagnostics are emitted under `debug.*`:
324
+
325
+ - single input and merged batch may include `debug.detector`
326
+ - per-file batch may include `debug.skipped`, `debug.detector`, and per-entry `files[i].debug.detector`
327
+ - per-file top-level `skipped` is still emitted temporarily for compatibility
328
+
290
329
  ## How It Works
291
330
 
292
331
  - The runtime inspects each character's Unicode script to infer its likely locale tag (e.g., `und-Latn`, `und-Hani`, `ja`).
293
332
  - Adjacent characters that share the same locale tag are grouped into a chunk.
294
333
  - Each chunk is counted with `Intl.Segmenter` at `granularity: "word"`, caching segmenters to avoid re-instantiation.
295
334
  - Per-locale counts are summed into an overall total and printed to stdout.
335
+ - With `--detector wasm`, ambiguous `und-Latn` and `und-Hani` chunks can be relabeled through the optional WASM detector before counting; unresolved `und-Latn` chunks then fall back to the existing Latin hint rules and explicit Latin hint precedence.
296
336
 
297
337
  ## Locale vs Language Code
298
338
 
@@ -316,6 +356,10 @@ import wordCounter, {
316
356
  segmentTextByLocale,
317
357
  showSingularOrPluralWord,
318
358
  } from "@dev-pi2pie/word-counter";
359
+ import {
360
+ wordCounterWithDetector,
361
+ segmentTextByLocaleWithDetector,
362
+ } from "@dev-pi2pie/word-counter/detector";
319
363
 
320
364
  wordCounter("Hello world", { latinLanguageHint: "en" });
321
365
  wordCounter("Hello world", { latinTagHint: "en" });
@@ -329,6 +373,11 @@ wordCounter("Hi 👋, world!", { mode: "char", nonWords: true });
329
373
  wordCounter("飛鳥 bird 貓 cat", { mode: "char-collector" });
330
374
  wordCounter("Hi\tthere\n", { nonWords: true, includeWhitespace: true });
331
375
  countCharsForLocale("👋", "en");
376
+ await wordCounterWithDetector(
377
+ "This sentence should clearly be detected as English for the wasm detector path.",
378
+ { detector: "wasm" },
379
+ );
380
+ await segmentTextByLocaleWithDetector("Hello 世界", { detector: "regex" });
332
381
  ```
333
382
 
334
383
  Note: `includeWhitespace` only affects results when `nonWords: true` is enabled.
@@ -362,6 +411,7 @@ Sample output (with `nonWords: true` and `includeWhitespace: true`):
362
411
 
363
412
  ```js
364
413
  const wordCounter = require("@dev-pi2pie/word-counter");
414
+ const detector = require("@dev-pi2pie/word-counter/detector");
365
415
  const {
366
416
  countCharsForLocale,
367
417
  countWordsForLocale,
@@ -383,6 +433,10 @@ wordCounter("Hi 👋, world!", { mode: "char", nonWords: true });
383
433
  wordCounter("飛鳥 bird 貓 cat", { mode: "char-collector" });
384
434
  wordCounter("Hi\tthere\n", { nonWords: true, includeWhitespace: true });
385
435
  countCharsForLocale("👋", "en");
436
+ await detector.wordCounterWithDetector(
437
+ "This sentence should clearly be detected as English for the wasm detector path.",
438
+ { detector: "wasm" },
439
+ );
386
440
  ```
387
441
 
388
442
  Note: `includeWhitespace` only affects results when `nonWords: true` is enabled.
@@ -437,6 +491,18 @@ Sample output (with `nonWords: true` and `includeWhitespace: true`):
437
491
  | -------------------------- | -------- | ------------------------------ |
438
492
  | `showSingularOrPluralWord` | function | Formats singular/plural words. |
439
493
 
494
+ #### Detector Subpath
495
+
496
+ Import from `@dev-pi2pie/word-counter/detector` for the explicit detector-enabled API.
497
+
498
+ | Export | Kind | Notes |
499
+ | ----------------------------- | -------- | ----------------------------------------------- |
500
+ | `wordCounterWithDetector` | function | Async detector-aware counting entrypoint. |
501
+ | `segmentTextByLocaleWithDetector` | function | Async detector-aware locale segmentation. |
502
+ | `countSectionsWithDetector` | function | Async detector-aware section counting. |
503
+ | `DEFAULT_DETECTOR_MODE` | value | Current default detector mode (`regex`). |
504
+ | `DETECTOR_MODES` | value | Supported detector modes. |
505
+
440
506
  #### Types
441
507
 
442
508
  | Export | Kind | Notes |
@@ -650,6 +716,10 @@ Example JSON (trimmed):
650
716
 
651
717
  - Detection is regex/script based, not statistical language-ID.
652
718
  - Ambiguous Latin defaults to `und-Latn`; Han fallback defaults to `und-Hani`.
719
+ - `--detector wasm` is optional and conservative; it only runs for ambiguous chunks that meet minimum script-bearing length thresholds.
720
+ - In `--detector wasm` mode, ambiguous Latin stays on `und-Latn` for detector eligibility first, then built-in/custom Latin rules and explicit Latin hints are applied only if the detector leaves that chunk unresolved.
721
+ - The current first WASM engine is `whatlang`, remapped into this package's public tags.
722
+ - The npm package ships one portable WASM artifact; users do not install per-OS detector packages.
653
723
  - Use explicit tag and hint flags when you need deterministic tagging.
654
724
  - Full notes (built-in heuristics, limitations, and override guidance) are tracked in `docs/locale-tag-detection-notes.md`.
655
725