npm - @dev-pi2pie/word-counter - Versions diffs - 0.1.5-canary.2 → 0.1.5-canary.4 - Mend

@dev-pi2pie/word-counter 0.1.5-canary.2 → 0.1.5-canary.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

package/README.md +102 -3
package/dist/cjs/detector.cjs +1159 -87
package/dist/cjs/detector.cjs.map +1 -1
package/dist/cjs/index.cjs.map +1 -1
package/dist/cjs/markdown.cjs +88 -14
package/dist/cjs/markdown.cjs.map +1 -1
package/dist/esm/bin.mjs +2808 -979
package/dist/esm/bin.mjs.map +1 -1
package/dist/esm/detector.d.mts +196 -2
package/dist/esm/detector.mjs +1156 -89
package/dist/esm/detector.mjs.map +1 -1
package/dist/esm/index.mjs +1 -1
package/dist/esm/markdown.mjs +71 -15
package/dist/esm/markdown.mjs.map +1 -1
package/dist/esm/worker/count-worker.mjs +878 -107
package/dist/esm/worker/count-worker.mjs.map +1 -1
package/dist/esm/worker-pool.mjs +17 -3
package/dist/esm/worker-pool.mjs.map +1 -1
package/package.json +6 -2

package/README.md CHANGED Viewed

@@ -101,16 +101,94 @@ Enable the optional WASM detector for ambiguous Latin and Han routes:
 ```bash
 word-counter --detector wasm "This sentence should clearly be detected as English for the wasm detector path."
 word-counter --detector wasm "漢字測試需要更多內容才能觸發偵測"
+word-counter --detector wasm --content-gate strict "Internationalization documentation remains understandable."
+word-counter --detector wasm --content-gate loose "四字成語"
+word-counter --detector wasm --content-gate off "mode: debug\ntee: true\npath: logs\nUse this for testing."
+```
+Inspect detector behavior without count output:
+```bash
+word-counter inspect "こんにちは、世界！これはテストです。"
+word-counter inspect --view engine "This sentence should clearly be detected as English for the wasm detector path."
+word-counter inspect --detector regex -f json "こんにちは、世界！これはテストです。"
+word-counter inspect --detector regex -f json --pretty "こんにちは、世界！これはテストです。"
+word-counter inspect --detector wasm --content-gate off "mode: debug\ntee: true\npath: logs\nUse this for testing."
+word-counter inspect -p ./examples/yaml-basic.md
+word-counter inspect -p ./examples/test-case-multi-files-support
+word-counter inspect -p ./examples/test-case-multi-files-support --section content -f json --pretty
 ```
 Detector mode notes:
 - `--detector regex` is the default behavior.
 - `--detector wasm` only runs for ambiguous `und-Latn` and `und-Hani` chunks.
+- `--content-gate default|strict|loose|off` configures the shared detector policy mode used by the WASM detector path.
+  - `default`: current fixture-backed project policy
+  - `strict`: raises detector eligibility thresholds and makes more borderline windows fall back
+  - `loose`: lowers detector eligibility thresholds and makes more borderline windows eligible or upgradable
+  - `off`: bypasses `contentGate` evaluation only
+- mode behavior differs by route:
+  - `und-Latn`: `default|strict|loose` affect both eligibility and the Latin prose-style `contentGate`
+  - `und-Hani`: `default|strict|loose` affect eligibility only, while `contentGate` still reports `policy=none`
+- current Hani behavior:
+  - `default`: keeps the current Hani diagnostic-sample threshold
+  - `strict`: raises the Hani diagnostic-sample threshold
+  - `loose`: uses a short-window Han-focused threshold so idiom-length samples such as `四字成語` can become eligible
+  - `off`: keeps the same Hani eligibility thresholds as `default`
 - `--detector regex` keeps the original script/regex chunk-first detection path.
 - `--detector wasm` uses a detector-oriented ambiguous-window scoring pass before accepted tags are projected back onto the counting chunks.
+- In `--detector wasm` mode, Latin hint rules and explicit Latin hint flags are deferred until after detector evaluation and only relabel unresolved `und-Latn` output.
 - Very short chunks stay on the original `und-*` fallback.
 - Low-confidence or unsupported detector results fall back to `und-*`.
+- Technical-noise-heavy Latin windows stay conservative and may remain `und-Latn` even when the detector produces a wrong-but-confident language guess.
+- inspect/debug disclosure uses `contentGate` as the canonical gate field.
+- legacy debug/evidence payloads still emit `qualityGate` as a compatibility alias derived from `contentGate.passed`.
+- for practical verification, use `inspect` to compare direct mode outcomes across `default`, `strict`, `loose`, and `off`; use `--debug --detector-evidence` when you specifically need counting-flow event details or legacy `qualityGate` compatibility
+- `word-counter inspect` supports:
+  - positional text input
+  - one direct `-p, --path <file>` input
+  - repeated `-p, --path` inputs for batch inspect
+  - directory inputs in default `--path-mode auto`
+  - literal file-only path handling in `--path-mode manual`
+  - `--section all|frontmatter|content`
+- batch inspect keeps counting-style path acquisition but not counting aggregation:
+  - no inspect `--merged`
+  - no inspect `--per-file`
+  - no inspect `--jobs`
+### Detector Subpath (`@dev-pi2pie/word-counter/detector`)
+Use the detector subpath when you need async detector-aware APIs directly in library code.
+```ts
+import {
+  inspectTextWithDetector,
+  segmentTextByLocaleWithDetector,
+  wordCounterWithDetector,
+} from "@dev-pi2pie/word-counter/detector";
+const inspectResult = await inspectTextWithDetector("こんにちは、世界！これはテストです。", {
+  detector: "wasm",
+  view: "pipeline",
+});
+const countResult = await wordCounterWithDetector(
+  "Internationalization documentation remains understandable.",
+  {
+    detector: "wasm",
+    contentGate: { mode: "strict" },
+  },
+);
+```
+Detector subpath notes:
+- detector entrypoints are async
+- use the root package for normal counting when you do not need detector-specific control
+- detector-subpath APIs that execute detector policy also accept:
+  - `contentGate: { mode: "default" | "strict" | "loose" | "off" }`
+- use `detectorDebug` for counting-flow runtime diagnostics
+- use `inspectTextWithDetector()` for direct detector diagnosis as structured data
 Collect non-words (emoji/symbols/punctuation):
@@ -285,14 +363,24 @@ word-counter --path ./examples/test-case-multi-files-support --debug --verbose
 Use `--debug-report [path]` to route debug diagnostics to a JSONL report file:
-- no path: writes to current working directory with pattern `wc-debug-YYYYMMDD-HHmmss-<pid>.jsonl`
+- no path: writes to current working directory with pattern `wc-debug-YYYYMMDD-HHmmss-utc-<pid>.jsonl`
+- no path with `--detector-evidence`: writes with pattern `wc-detector-evidence-YYYYMMDD-HHmmss-utc-<pid>.jsonl`
 - path provided: writes to the specified location
 - default-name collision handling: appends `-<n>` suffix to avoid overwriting existing files
 - explicit path validation: existing directories are rejected (explicit paths are treated as file targets)
+- compatibility note: the autogenerated filename moved from the older local-time pattern to the new UTC `...-utc-...jsonl` pattern
 By default with `--debug-report`, debug lines are file-only (not mirrored to terminal).
 Use `--debug-report-tee` (alias: `--debug-tee`) to mirror to both file and `stderr`.
-Flag dependencies: `--verbose` requires `--debug`; `--debug-report` requires `--debug`; `--debug-report-tee`/`--debug-tee` requires `--debug-report`.
+Flag dependencies: `--verbose` requires `--debug`; `--detector-evidence` requires `--debug` and `--detector wasm`; `--debug-report` requires `--debug`; `--debug-report-tee`/`--debug-tee` requires `--debug-report`.
+Use `--detector-evidence` to add per-window detector evidence onto the same debug stream:
+- only meaningful with `--detector wasm`
+- compact mode emits bounded single-line previews plus detector decision metadata
+- verbose mode emits full raw detector windows and full normalized samples
+- evidence remains detector-window based even when output mode changes to `collector`, `char`, or another counting mode
+- fallback evidence reports the post-fallback final tag used by downstream counting output; in rare split-relabel cases it may also include `finalLocales`
 Examples:
@@ -301,17 +389,26 @@ word-counter --path ./examples/test-case-multi-files-support --debug --debug-rep
 word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl
 word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl --debug-report-tee
 word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl --debug-tee
+word-counter --detector wasm --debug --detector-evidence "This sentence should clearly be detected as English for the wasm detector path."
+word-counter --detector wasm --debug --verbose --detector-evidence "This sentence should clearly be detected as English for the wasm detector path."
+word-counter --detector wasm --debug --detector-evidence --debug-report
 ```
 Skip details stay debug-gated and can be suppressed with `--quiet-skips`.
+When `--format json` is combined with `--debug`, debug-only diagnostics are emitted under `debug.*`:
+- single input and merged batch may include `debug.detector`
+- per-file batch may include `debug.skipped`, `debug.detector`, and per-entry `files[i].debug.detector`
+- per-file top-level `skipped` is still emitted temporarily for compatibility
 ## How It Works
 - The runtime inspects each character's Unicode script to infer its likely locale tag (e.g., `und-Latn`, `und-Hani`, `ja`).
 - Adjacent characters that share the same locale tag are grouped into a chunk.
 - Each chunk is counted with `Intl.Segmenter` at `granularity: "word"`, caching segmenters to avoid re-instantiation.
 - Per-locale counts are summed into an overall total and printed to stdout.
-- With `--detector wasm`, ambiguous `und-Latn` and `und-Hani` chunks can be relabeled through the optional WASM detector before counting.
+- With `--detector wasm`, ambiguous `und-Latn` and `und-Hani` chunks can be relabeled through the optional WASM detector before counting; unresolved `und-Latn` chunks then fall back to the existing Latin hint rules and explicit Latin hint precedence.
 ## Locale vs Language Code
@@ -479,6 +576,7 @@ Import from `@dev-pi2pie/word-counter/detector` for the explicit detector-enable
 | `wordCounterWithDetector`     | function | Async detector-aware counting entrypoint.       |
 | `segmentTextByLocaleWithDetector` | function | Async detector-aware locale segmentation.  |
 | `countSectionsWithDetector`   | function | Async detector-aware section counting.          |
+| `inspectTextWithDetector`     | function | Async detector-aware inspect entrypoint.        |
 | `DEFAULT_DETECTOR_MODE`       | value    | Current default detector mode (`regex`).        |
 | `DETECTOR_MODES`              | value    | Supported detector modes.                       |
@@ -696,6 +794,7 @@ Example JSON (trimmed):
 - Detection is regex/script based, not statistical language-ID.
 - Ambiguous Latin defaults to `und-Latn`; Han fallback defaults to `und-Hani`.
 - `--detector wasm` is optional and conservative; it only runs for ambiguous chunks that meet minimum script-bearing length thresholds.
+- In `--detector wasm` mode, ambiguous Latin stays on `und-Latn` for detector eligibility first, then built-in/custom Latin rules and explicit Latin hints are applied only if the detector leaves that chunk unresolved.
 - The current first WASM engine is `whatlang`, remapped into this package's public tags.
 - The npm package ships one portable WASM artifact; users do not install per-OS detector packages.
 - Use explicit tag and hint flags when you need deterministic tagging.