khmer-segment 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +7 -21
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -1,5 +1,3 @@
1
-
2
-
3
1
  # khmer-segment
4
2
 
5
3
  A framework-agnostic Khmer text processing library for JavaScript and TypeScript.
@@ -23,6 +21,7 @@ npm install khmer-segment
23
21
  ```ts
24
22
  import {
25
23
  containsKhmer,
24
+ isKhmerText,
26
25
  normalizeKhmer,
27
26
  splitClusters,
28
27
  countClusters,
@@ -59,41 +58,33 @@ console.log(result.tokens);
59
58
 
60
59
  ### Detection
61
60
 
62
-
63
61
  | Function | Description |
64
62
  | --------------------- | --------------------------------------------------------- |
65
63
  | `isKhmerChar(char)` | Returns `true` if the character is a Khmer code point |
66
64
  | `containsKhmer(text)` | Returns `true` if the text contains any Khmer characters |
67
65
  | `isKhmerText(text)` | Returns `true` if all non-whitespace characters are Khmer |
68
66
 
69
-
70
67
  ### Normalization
71
68
 
72
-
73
69
  | Function | Description |
74
70
  | -------------------------------- | ------------------------------------------------------------------------------------------ |
75
71
  | `normalizeKhmer(text)` | Reorders Khmer characters into canonical order (base → coeng → shift signs → vowel → sign) |
76
72
  | `normalizeKhmerCluster(cluster)` | Normalizes a single cluster |
77
73
 
78
-
79
74
  ### Cluster Utilities
80
75
 
81
-
82
76
  | Function | Description |
83
77
  | ---------------------------- | ------------------------------------------------- |
84
78
  | `splitClusters(text)` | Splits text into Khmer-safe grapheme clusters |
85
79
  | `countClusters(text)` | Returns the number of clusters in the text |
86
80
  | `getClusterBoundaries(text)` | Returns `{ start, end }` offsets for each cluster |
87
81
 
88
-
89
82
  ### Segmentation
90
83
 
91
-
92
84
  | Function | Description |
93
85
  | ------------------------------ | -------------------------------------------------------------- |
94
86
  | `segmentWords(text, options?)` | Segments text into word tokens using dictionary-based matching |
95
87
 
96
-
97
88
  #### `SegmentOptions`
98
89
 
99
90
  ```ts
@@ -123,12 +114,10 @@ interface SegmentToken {
123
114
 
124
115
  ### Dictionary
125
116
 
126
-
127
117
  | Function | Description |
128
118
  | --------------------------------------- | ------------------------------------------------ |
129
119
  | `createDictionary(words, frequencies?)` | Creates an in-memory dictionary from a word list |
130
120
 
131
-
132
121
  ```ts
133
122
  const dict = createDictionary(['សួស្តី', 'អ្នក', 'ខ្មែរ']);
134
123
 
@@ -175,7 +164,7 @@ console.log(freqData.words.length); // 49113
175
164
  console.log(freqData.frequencies.get('ជា')); // 701541
176
165
  ```
177
166
 
178
- This is a **separate import** — the core `khmer-segment` package stays small (~8KB). Only import the dictionary when you need it.
167
+ This is a **separate import** — the core `khmer-segment` package stays small (~11KB). The dictionary module is ~3.9MB. Only import the dictionary when you need it.
179
168
 
180
169
  ---
181
170
 
@@ -242,7 +231,7 @@ const result = segmentWords('កខគ');
242
231
 
243
232
  ## Dictionary Strategy
244
233
 
245
- The library ships a **separate optional dictionary** via `khmer-segment/dictionary` with 49,113 Khmer words. This keeps the core package small (~8KB).
234
+ The library ships a **separate optional dictionary** via `khmer-segment/dictionary` with 49,113 Khmer words. This keeps the core package small (~11KB).
246
235
 
247
236
  Options:
248
237
 
@@ -272,7 +261,6 @@ const dict = createDictionary([...words, 'custom_word'], frequencies);
272
261
 
273
262
  ## Framework Compatibility
274
263
 
275
-
276
264
  | Environment | Support |
277
265
  | ------------------- | ------- |
278
266
  | Node.js (ESM + CJS) | Yes |
@@ -282,7 +270,6 @@ const dict = createDictionary([...words, 'custom_word'], frequencies);
282
270
  | Angular | Yes |
283
271
  | Vue | Yes |
284
272
 
285
-
286
273
  No framework-specific code in the core. Tree-shakeable with `sideEffects: false`.
287
274
 
288
275
  ---
@@ -316,8 +303,7 @@ No framework-specific code in the core. Tree-shakeable with `sideEffects: false`
316
303
  - Fixed normalization for MUUSIKATOAN (៉) and TRIISAP (៊) — shift signs now placed before vowels
317
304
  - Fixed Unicode range constants (NIKAHIT, REAHMUK, YUUKEALAKHMOU are signs, not vowels)
318
305
  - 149 tests
319
- - `compareTyping(expected, actual)` for MonkeyType-like apps
320
- - Better token metadata (`isKhmer`, `clusterCount`)
306
+ - Rebuilt dictionary with 49,113 words (merged from 10 sources)
321
307
 
322
308
  ### v0.3.0
323
309
 
@@ -352,7 +338,7 @@ npm run lint # TypeScript type check
352
338
  ### Automated Tests
353
339
 
354
340
  ```bash
355
- npm test # run 98 tests with vitest
341
+ npm test # run 149 tests with vitest
356
342
  npm run test:watch # watch mode — re-runs on changes
357
343
  npm run lint # TypeScript type check
358
344
  ```
@@ -385,10 +371,10 @@ Features:
385
371
  - **[Word Segmentation of Khmer Text Using Conditional Random Fields](https://medium.com/@phylypo/segmentation-of-khmer-text-using-conditional-random-fields-3a2d4d73956a)** — Phylypo Tum (2019). Comprehensive overview of Khmer segmentation approaches from dictionary-based to CRF, achieving 99.7% accuracy with Linear Chain CRF.
386
372
  - **[Khmer Word Segmentation Using Conditional Random Fields](https://www.niptict.edu.kh/khmer-word-segmentation-tool/)** — Vichea Chea, Ye Kyaw Thu, et al. (2015). The prior state-of-the-art CRF model for Khmer segmentation (98.5% accuracy, 5-tag system).
387
373
  - **[Benchmark dataset and Python notebooks](https://github.com/phylypo/segmentation-crf-khmer)** — 10K+ segmented Khmer news articles useful for evaluating segmentation quality.
388
- - **[khmerlbdict](https://github.com/silnrsi/khmerlbdict)** — Source of the default dictionary used by this library (MIT license, 34K+ words).
374
+ - **[khmerlbdict](https://github.com/silnrsi/khmerlbdict)** — Source of the default dictionary used by this library (MIT license). Merged with Royal Academy of Cambodia's Khmer Dictionary for a total of 49,113 words.
389
375
 
390
376
  ---
391
377
 
392
378
  ## License
393
379
 
394
- MIT
380
+ MIT
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "khmer-segment",
3
- "version": "0.2.0",
3
+ "version": "0.2.1",
4
4
  "description": "Khmer text segmentation, normalization, and cluster utilities for JavaScript and TypeScript.",
5
5
  "type": "module",
6
6
  "main": "./dist/index.cjs",