npm - @dev-pi2pie/word-counter - Versions diffs - 0.1.0-canary.4 → 0.1.0 - Mend

@dev-pi2pie/word-counter 0.1.0-canary.4 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/README.md CHANGED Viewed

@@ -2,130 +2,116 @@
 Locale-aware word counting powered by the Web API [`Intl.Segmenter`](https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter). The script automatically detects the primary writing system for each portion of the input, segments the text with matching BCP 47 locale tags, and reports word totals per locale.
-## How It Works
+## Quick Start (npx)
-- The runtime inspects each character's Unicode script to infer its likely locale tag (e.g., `und-Latn`, `zh-Hani`, `ja`).
-- Adjacent characters that share the same locale tag are grouped into a chunk.
-- Each chunk is counted with `Intl.Segmenter` at `granularity: "word"`, caching segmenters to avoid re-instantiation.
-- Per-locale counts are summed into a overall total and printed to stdout.
+Runtime requirement: Node.js `>=20`.
-## Locale vs Language Code
-- Output keeps the field name `locale` for compatibility.
-- In this project, locale values are BCP 47 tags and are often language/script focused (for example: `en`, `und-Latn`, `zh-Hani`) rather than region-specific tags (for example: `en-US`, `zh-TW`).
-- Default detection prefers language/script tags to avoid incorrect region assumptions.
-- You can still provide region-specific locale tags through hint flags when needed.
-## Installation
-### For Development
-Clone the repository and set up locally:
+Run without installing:
 ```bash
-git clone https://github.com/dev-pi2pie/word-counter.git
-cd word-counter
-bun install
-bun run build
-npm link
+npx @dev-pi2pie/word-counter "Hello 世界 안녕"
 ```
-After linking, you can use the `word-counter` command globally:
+Pipe stdin:
 ```bash
-word-counter "Hello 世界 안녕"
+echo "こんにちは world مرحبا" | npx @dev-pi2pie/word-counter
 ```
-To use the linked package inside another project:
+File input:
 ```bash
-npm link @dev-pi2pie/word-counter
+npx @dev-pi2pie/word-counter --path ./examples/yaml-basic.md
 ```
-To uninstall the global link:
+## Install and Usage Paths
-```bash
-npm unlink --global @dev-pi2pie/word-counter
-```
+Pick one path based on how often you use it:
-### From npm Registry (npmjs.com)
+1. One-off use: `npx @dev-pi2pie/word-counter ...` (no install, best for quick checks and CI snippets).
+2. Frequent CLI use: `npm install -g @dev-pi2pie/word-counter@latest` then run `word-counter ...`.
+3. Library use in code: `npm install @dev-pi2pie/word-counter` and import from your app/scripts.
+For local development in this repository:
 ```bash
-npm install -g @dev-pi2pie/word-counter@latest
+git clone https://github.com/dev-pi2pie/word-counter.git
+cd word-counter
+bun install
+bun run build
+npm link
 ```
-## Usage
-Once installed (via `npm link` or the npm registry), you can use the CLI directly:
+Then:
 ```bash
 word-counter "Hello 世界 안녕"
 ```
-Alternatively, run the built CLI with Node:
+To remove the global link:
 ```bash
-node dist/esm/bin.mjs "Hello 世界 안녕"
+npm unlink --global @dev-pi2pie/word-counter
 ```
-You can also pipe text:
+## CLI Usage
+Basic text:
 ```bash
-echo "こんにちは world مرحبا" | word-counter
+word-counter "Hello 世界 안녕"
 ```
-Hint a locale tag for ambiguous Latin text (ASCII-heavy content):
+Hint a language tag for ambiguous Latin text:
 ```bash
 word-counter --latin-language en "Hello world"
 word-counter --latin-tag en "Hello world"
 ```
-Hint a locale tag for Han text fallback:
+Hint a language tag for Han fallback:
 ```bash
 word-counter --han-language zh-Hant "漢字測試"
 word-counter --han-tag zh-Hans "汉字测试"
 ```
-Collect non-word segments (emoji, symbols, punctuation):
+Collect non-words (emoji/symbols/punctuation):
 ```bash
 word-counter --non-words "Hi 👋, world!"
 ```
-When enabled, `total` includes words + non-words (emoji, symbols, punctuation).
-Or read from a file:
+Override total composition:
 ```bash
-word-counter --path ./fixtures/sample.txt
+word-counter --non-words --total-of words "Hi 👋, world!"
+word-counter --total-of punctuation --format raw "Hi, world!"
+word-counter --total-of words,emoji --format json "Hi 👋, world!"
 ```
-`--path` accepts any readable text-like file, including empty or whitespace-only files.
-Such files are treated as valid inputs and contribute zero words by default.
-### Batch Counting
+## Batch Counting (`--path`)
-Process multiple files by repeating `--path`:
+Repeat `--path` for mixed inputs (files and/or directories):
 ```bash
-word-counter --path ./docs/a.md --path ./docs/b.txt
+word-counter --path ./docs/a.md --path ./docs --path ./notes.txt
 ```
-Pass a directory path to scan files recursively (default):
+Directory scans are recursive by default:
 ```bash
 word-counter --path ./examples/test-case-multi-files-support
+word-counter --path ./examples/test-case-multi-files-support --no-recursive
 ```
-Show per-file results plus merged summary:
+Show per-file plus merged summary:
 ```bash
 word-counter --path ./examples/test-case-multi-files-support --per-file
 ```
-Batch progress is auto-enabled for multi-file standard output and is transient:
+Progress behavior in standard batch mode:
 ```bash
 word-counter --path ./examples/test-case-multi-files-support
@@ -133,33 +119,83 @@ word-counter --path ./examples/test-case-multi-files-support --no-progress
 word-counter --path ./examples/test-case-multi-files-support --keep-progress
 ```
-Progress updates follow this style while running:
+Progress is transient by default, auto-disabled for single-input runs, and suppressed in `--format raw` and `--format json`.
-```text
-Counting files [██████░░░░░░░░░░░░]  31%  37/120 elapsed 00:01.2
-Counting files [███████████░░░░░░░]  58%  70/120 elapsed 00:02.8
-Counting files [████████████████████] 100% 120/120 elapsed 00:04.1
-```
+### Stable Path Resolution Contract (`#26`)
+- Repeated `--path` values are accepted as mixed inputs (file + directory).
+- In `--path-mode auto` (default), directory inputs are expanded to files (recursive unless `--no-recursive`).
+- In `--path-mode manual`, directory inputs are not expanded and are skipped as non-regular files.
+- Extension filters apply only to files discovered from directory expansion.
+- Direct file inputs are always considered regardless of `--include-ext` / `--exclude-ext`.
+- Overlap dedupe is by resolved absolute file path.
+- If the same file is discovered multiple ways (repeated roots, nested roots, explicit file + directory), it is counted once.
+- Final processing order is deterministic: resolved files are sorted by absolute path ascending before load/count.
-Single-input runs do not show progress by default. Progress is also suppressed in `--format raw` and `--format json`.
-Use `--keep-progress` when you want the final progress line to stay visible after completion.
+### Extension Filters
-Restrict directory scanning extensions:
+Use include/exclude filters for directory scans:
 ```bash
 word-counter --path ./examples/test-case-multi-files-support --include-ext .md,.mdx
 word-counter --path ./examples/test-case-multi-files-support --include-ext .md,.txt --exclude-ext .txt
 ```
-Skip diagnostics are debug-gated. By default, skipped-file details are hidden.
-Use `--debug` to print skipped-file diagnostics to `stderr`:
+Direct file path example (filters do not block explicit file inputs):
+```bash
+word-counter --path ./examples/test-case-multi-files-support/ignored.js --include-ext .md --exclude-ext .md
+```
+### Debugging Diagnostics (`--debug`)
+`--debug` remains the diagnostics gate and now defaults to `compact` event volume:
+- lifecycle/stage timing events
+- resolved/skipped summary events
+- dedupe/filter summary counts
+Use `--verbose` to include per-file/per-path events:
+```bash
+word-counter --path ./examples/test-case-multi-files-support --debug --verbose
+```
+Use `--debug-report [path]` to route debug diagnostics to a JSONL report file:
+- no path: writes to current working directory with pattern `wc-debug-YYYYMMDD-HHmmss-<pid>.jsonl`
+- path provided: writes to the specified location
+- default-name collision handling: appends `-<n>` suffix to avoid overwriting existing files
+- explicit path validation: existing directories are rejected (explicit paths are treated as file targets)
+By default with `--debug-report`, debug lines are file-only (not mirrored to terminal).
+Use `--debug-report-tee` (alias: `--debug-tee`) to mirror to both file and `stderr`.
+Flag dependencies: `--verbose` requires `--debug`; `--debug-report` requires `--debug`; `--debug-report-tee`/`--debug-tee` requires `--debug-report`.
+Examples:
 ```bash
-word-counter --path ./examples/test-case-multi-files-support --debug
+word-counter --path ./examples/test-case-multi-files-support --debug --debug-report
+word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl
+word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl --debug-report-tee
+word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl --debug-tee
 ```
-With `--debug`, batch resolution/progress lifecycle diagnostics are emitted as structured `[debug]` entries on `stderr` (stdout remains clean).
-In `--debug` mode, the final progress line is kept visible (not auto-cleared).
+Skip details stay debug-gated and can still be suppressed with `--quiet-skips`.
+## How It Works
+- The runtime inspects each character's Unicode script to infer its likely locale tag (e.g., `und-Latn`, `zh-Hani`, `ja`).
+- Adjacent characters that share the same locale tag are grouped into a chunk.
+- Each chunk is counted with `Intl.Segmenter` at `granularity: "word"`, caching segmenters to avoid re-instantiation.
+- Per-locale counts are summed into an overall total and printed to stdout.
+## Locale vs Language Code
+- Output keeps the field name `locale` for compatibility.
+- In this project, locale values are BCP 47 tags and are often language/script focused (for example: `en`, `und-Latn`, `zh-Hani`) rather than region-specific tags (for example: `en-US`, `zh-TW`).
+- Default detection prefers language/script tags to avoid incorrect region assumptions.
+- You can still provide region-specific locale tags through hint flags when needed.
 ## Library Usage
@@ -182,6 +218,7 @@ wordCounter("Hello world", { latinTagHint: "en" });
 wordCounter("漢字測試", { hanTagHint: "zh-Hant" });
 wordCounter("Hi 👋, world!", { nonWords: true });
 wordCounter("Hi 👋, world!", { mode: "char", nonWords: true });
+wordCounter("飛鳥 bird 貓 cat", { mode: "char-collector" });
 wordCounter("Hi\tthere\n", { nonWords: true, includeWhitespace: true });
 countCharsForLocale("👋", "en");
 ```
@@ -231,6 +268,7 @@ wordCounter("Hello world", { latinTagHint: "en" });
 wordCounter("漢字測試", { hanTagHint: "zh-Hant" });
 wordCounter("Hi 👋, world!", { nonWords: true });
 wordCounter("Hi 👋, world!", { mode: "char", nonWords: true });
+wordCounter("飛鳥 bird 貓 cat", { mode: "char-collector" });
 wordCounter("Hi\tthere\n", { nonWords: true, includeWhitespace: true });
 countCharsForLocale("👋", "en");
 ```
@@ -294,7 +332,7 @@ Sample output (with `nonWords: true` and `includeWhitespace: true`):
 | `WordCounterOptions`   | type | Options for the `wordCounter` function.           |
 | `WordCounterResult`    | type | Returned by `wordCounter`.                        |
 | `WordCounterBreakdown` | type | Breakdown payload in `WordCounterResult`.         |
-| `WordCounterMode`      | type | `"chunk" \| "segments" \| "collector" \| "char"`. |
+| `WordCounterMode`      | type | `"chunk" \| "segments" \| "collector" \| "char" \| "char-collector"`. |
 | `NonWordCollection`    | type | Non-word segments + counts payload.               |
 ### Display Modes
@@ -306,6 +344,7 @@ Choose a breakdown style with `--mode` (or `-m`):
 - `collector` – aggregate counts per locale regardless of text position.
   Keeps per-locale segment lists in memory, so very large corpora can use noticeably more memory than `chunk` mode.
 - `char` – count grapheme clusters (user-perceived characters) per locale.
+- `char-collector` – aggregate grapheme-cluster counts per locale (collector-style char mode).
 Aliases are normalized for CLI + API:
@@ -313,6 +352,7 @@ Aliases are normalized for CLI + API:
 - `segments`, `segment`, `seg`
 - `collector`, `collect`, `colle`
 - `char`, `chars`, `character`, `characters`
+- `char-collector`, `charcollector`, `char-collect`, `collector-char`, `characters-collector`, `colchar`, `charcol`, `char-col`, `char-colle`
 Examples:
@@ -328,6 +368,9 @@ word-counter -m collector "飛鳥 bird 貓 cat; how do you do?"
 # grapheme-aware character count
 word-counter -m char "Hi 👋, world!"
+# aggregate grapheme-aware character counts per locale
+word-counter -m char-collector "飛鳥 bird 貓 cat; how do you do?"
 ```
 ### Section Modes (Frontmatter)

package/dist/cjs/index.cjs CHANGED Viewed

@@ -208,6 +208,32 @@ function analyzeCharChunk(chunk, collectNonWords, includeWhitespace) {
 		nonWords: nonWords ?? void 0
 	};
 }
+function aggregateCharsByLocale(chunks) {
+	const order = [];
+	const map = /* @__PURE__ */ new Map();
+	for (const chunk of chunks) {
+		const existing = map.get(chunk.locale);
+		if (existing) {
+			existing.chars += chunk.chars;
+			existing.wordChars += chunk.wordChars;
+			existing.nonWordChars += chunk.nonWordChars;
+			if (chunk.nonWords) {
+				if (!existing.nonWords) existing.nonWords = createNonWordCollection();
+				mergeNonWordCollections(existing.nonWords, chunk.nonWords);
+			}
+			continue;
+		}
+		order.push(chunk.locale);
+		map.set(chunk.locale, {
+			locale: chunk.locale,
+			chars: chunk.chars,
+			wordChars: chunk.wordChars,
+			nonWordChars: chunk.nonWordChars,
+			nonWords: chunk.nonWords ? mergeNonWordCollections(createNonWordCollection(), chunk.nonWords) : void 0
+		});
+	}
+	return order.map((locale) => map.get(locale));
+}
 function aggregateByLocale(chunks) {
 	const order = [];
 	const map = /* @__PURE__ */ new Map();
@@ -242,11 +268,55 @@ const MODE_ALIASES = {
 	char: "char",
 	chars: "char",
 	character: "char",
-	characters: "char"
+	characters: "char",
+	"char-collector": "char-collector"
 };
+const CHAR_MODE_ALIASES = new Set([
+	"char",
+	"chars",
+	"character",
+	"characters"
+]);
+const COLLECTOR_MODE_ALIASES = new Set([
+	"collector",
+	"collect",
+	"colle",
+	"col"
+]);
+function collapseSeparators(value) {
+	return value.replace(/[-_\s]+/g, "");
+}
+function isComposedCharCollectorFromTokens(value) {
+	const tokens = value.split(/[-_\s]+/).map((token) => token.trim()).filter((token) => token.length > 0);
+	if (tokens.length < 2) return false;
+	let hasCharAlias = false;
+	let hasCollectorAlias = false;
+	for (const token of tokens) {
+		if (CHAR_MODE_ALIASES.has(token)) {
+			hasCharAlias = true;
+			continue;
+		}
+		if (COLLECTOR_MODE_ALIASES.has(token)) {
+			hasCollectorAlias = true;
+			continue;
+		}
+		return false;
+	}
+	return hasCharAlias && hasCollectorAlias;
+}
+function isComposedCharCollectorCompact(value) {
+	for (const charAlias of CHAR_MODE_ALIASES) for (const collectorAlias of COLLECTOR_MODE_ALIASES) if (value === `${charAlias}${collectorAlias}` || value === `${collectorAlias}${charAlias}`) return true;
+	return false;
+}
 function normalizeMode(input) {
 	if (!input) return null;
-	return MODE_ALIASES[input.trim().toLowerCase()] ?? null;
+	const normalized = input.trim().toLowerCase();
+	const direct = MODE_ALIASES[normalized];
+	if (direct) return direct;
+	if (isComposedCharCollectorFromTokens(normalized)) return "char-collector";
+	const compact = collapseSeparators(normalized);
+	if (isComposedCharCollectorCompact(compact)) return "char-collector";
+	return MODE_ALIASES[compact] ?? null;
 }
 function resolveMode(input, fallback = "chunk") {
 	return normalizeMode(input) ?? fallback;
@@ -408,25 +478,37 @@ function wordCounter(text, options = {}) {
 		hanLanguageHint: options.hanLanguageHint,
 		hanTagHint: options.hanTagHint
 	});
-	if (mode === "char") {
-		const analyzed$1 = chunks.map((chunk) => analyzeCharChunk(chunk, collectNonWords, includeWhitespace));
-		const total$1 = analyzed$1.reduce((sum, chunk) => sum + chunk.chars, 0);
-		const items = analyzed$1.map((chunk) => ({
-			locale: chunk.locale,
-			text: chunk.text,
-			chars: chunk.chars,
-			nonWords: chunk.nonWords
-		}));
+	if (mode === "char" || mode === "char-collector") {
+		const analyzed = chunks.map((chunk) => analyzeCharChunk(chunk, collectNonWords, includeWhitespace));
+		const total = analyzed.reduce((sum, chunk) => sum + chunk.chars, 0);
+		const counts = collectNonWords ? {
+			words: analyzed.reduce((sum, chunk) => sum + chunk.wordChars, 0),
+			nonWords: analyzed.reduce((sum, chunk) => sum + chunk.nonWordChars, 0),
+			total
+		} : void 0;
+		if (mode === "char") return {
+			total,
+			counts,
+			breakdown: {
+				mode,
+				items: analyzed.map((chunk) => ({
+					locale: chunk.locale,
+					text: chunk.text,
+					chars: chunk.chars,
+					nonWords: chunk.nonWords
+				}))
+			}
+		};
 		return {
-			total: total$1,
-			counts: collectNonWords ? {
-				words: analyzed$1.reduce((sum, chunk) => sum + chunk.wordChars, 0),
-				nonWords: analyzed$1.reduce((sum, chunk) => sum + chunk.nonWordChars, 0),
-				total: total$1
-			} : void 0,
+			total,
+			counts,
 			breakdown: {
 				mode,
-				items
+				items: aggregateCharsByLocale(analyzed).map((chunk) => ({
+					locale: chunk.locale,
+					chars: chunk.chars,
+					nonWords: chunk.nonWords
+				}))
 			}
 		};
 	}
@@ -796,7 +878,7 @@ function parseTomlFrontmatter(frontmatter) {
 					index += 1;
 					const nextLine = lines[index] ?? "";
 					combined += `\n${nextLine}`;
-					if ((/* @__PURE__ */ new RegExp(`${delimiter}\\s*$`)).test(nextLine)) {
+					if (new RegExp(`${delimiter}\\s*$`).test(nextLine)) {
 						closed = true;
 						break;
 					}
@@ -924,10 +1006,10 @@ function parseMarkdown(input) {
 			data: null,
 			frontmatterType: null
 		};
-		const frontmatter$1 = jsonBlock.jsonText;
+		const frontmatter = jsonBlock.jsonText;
 		let content = normalizedWithoutBom.slice(jsonBlock.endIndex + 1);
 		if (content.startsWith("\n")) content = content.slice(1);
-		const data = parseFrontmatter(frontmatter$1, "json");
+		const data = parseFrontmatter(frontmatter, "json");
 		if (!data) return {
 			frontmatter: null,
 			content: normalizedWithoutBom,
@@ -935,7 +1017,7 @@ function parseMarkdown(input) {
 			frontmatterType: null
 		};
 		return {
-			frontmatter: frontmatter$1,
+			frontmatter,
 			content,
 			data,
 			frontmatterType: "json"