khmer-segment 0.2.0 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +7 -21
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -1,5 +1,3 @@
|
|
|
1
|
-
|
|
2
|
-
|
|
3
1
|
# khmer-segment
|
|
4
2
|
|
|
5
3
|
A framework-agnostic Khmer text processing library for JavaScript and TypeScript.
|
|
@@ -23,6 +21,7 @@ npm install khmer-segment
|
|
|
23
21
|
```ts
|
|
24
22
|
import {
|
|
25
23
|
containsKhmer,
|
|
24
|
+
isKhmerText,
|
|
26
25
|
normalizeKhmer,
|
|
27
26
|
splitClusters,
|
|
28
27
|
countClusters,
|
|
@@ -59,41 +58,33 @@ console.log(result.tokens);
|
|
|
59
58
|
|
|
60
59
|
### Detection
|
|
61
60
|
|
|
62
|
-
|
|
63
61
|
| Function | Description |
|
|
64
62
|
| --------------------- | --------------------------------------------------------- |
|
|
65
63
|
| `isKhmerChar(char)` | Returns `true` if the character is a Khmer code point |
|
|
66
64
|
| `containsKhmer(text)` | Returns `true` if the text contains any Khmer characters |
|
|
67
65
|
| `isKhmerText(text)` | Returns `true` if all non-whitespace characters are Khmer |
|
|
68
66
|
|
|
69
|
-
|
|
70
67
|
### Normalization
|
|
71
68
|
|
|
72
|
-
|
|
73
69
|
| Function | Description |
|
|
74
70
|
| -------------------------------- | ------------------------------------------------------------------------------------------ |
|
|
75
71
|
| `normalizeKhmer(text)` | Reorders Khmer characters into canonical order (base → coeng → shift signs → vowel → sign) |
|
|
76
72
|
| `normalizeKhmerCluster(cluster)` | Normalizes a single cluster |
|
|
77
73
|
|
|
78
|
-
|
|
79
74
|
### Cluster Utilities
|
|
80
75
|
|
|
81
|
-
|
|
82
76
|
| Function | Description |
|
|
83
77
|
| ---------------------------- | ------------------------------------------------- |
|
|
84
78
|
| `splitClusters(text)` | Splits text into Khmer-safe grapheme clusters |
|
|
85
79
|
| `countClusters(text)` | Returns the number of clusters in the text |
|
|
86
80
|
| `getClusterBoundaries(text)` | Returns `{ start, end }` offsets for each cluster |
|
|
87
81
|
|
|
88
|
-
|
|
89
82
|
### Segmentation
|
|
90
83
|
|
|
91
|
-
|
|
92
84
|
| Function | Description |
|
|
93
85
|
| ------------------------------ | -------------------------------------------------------------- |
|
|
94
86
|
| `segmentWords(text, options?)` | Segments text into word tokens using dictionary-based matching |
|
|
95
87
|
|
|
96
|
-
|
|
97
88
|
#### `SegmentOptions`
|
|
98
89
|
|
|
99
90
|
```ts
|
|
@@ -123,12 +114,10 @@ interface SegmentToken {
|
|
|
123
114
|
|
|
124
115
|
### Dictionary
|
|
125
116
|
|
|
126
|
-
|
|
127
117
|
| Function | Description |
|
|
128
118
|
| --------------------------------------- | ------------------------------------------------ |
|
|
129
119
|
| `createDictionary(words, frequencies?)` | Creates an in-memory dictionary from a word list |
|
|
130
120
|
|
|
131
|
-
|
|
132
121
|
```ts
|
|
133
122
|
const dict = createDictionary(['សួស្តី', 'អ្នក', 'ខ្មែរ']);
|
|
134
123
|
|
|
@@ -175,7 +164,7 @@ console.log(freqData.words.length); // 49113
|
|
|
175
164
|
console.log(freqData.frequencies.get('ជា')); // 701541
|
|
176
165
|
```
|
|
177
166
|
|
|
178
|
-
This is a **separate import** — the core `khmer-segment` package stays small (~
|
|
167
|
+
This is a **separate import** — the core `khmer-segment` package stays small (~11KB). The dictionary module is ~3.9MB. Only import the dictionary when you need it.
|
|
179
168
|
|
|
180
169
|
---
|
|
181
170
|
|
|
@@ -242,7 +231,7 @@ const result = segmentWords('កខគ');
|
|
|
242
231
|
|
|
243
232
|
## Dictionary Strategy
|
|
244
233
|
|
|
245
|
-
The library ships a **separate optional dictionary** via `khmer-segment/dictionary` with 49,113 Khmer words. This keeps the core package small (~
|
|
234
|
+
The library ships a **separate optional dictionary** via `khmer-segment/dictionary` with 49,113 Khmer words. This keeps the core package small (~11KB).
|
|
246
235
|
|
|
247
236
|
Options:
|
|
248
237
|
|
|
@@ -272,7 +261,6 @@ const dict = createDictionary([...words, 'custom_word'], frequencies);
|
|
|
272
261
|
|
|
273
262
|
## Framework Compatibility
|
|
274
263
|
|
|
275
|
-
|
|
276
264
|
| Environment | Support |
|
|
277
265
|
| ------------------- | ------- |
|
|
278
266
|
| Node.js (ESM + CJS) | Yes |
|
|
@@ -282,7 +270,6 @@ const dict = createDictionary([...words, 'custom_word'], frequencies);
|
|
|
282
270
|
| Angular | Yes |
|
|
283
271
|
| Vue | Yes |
|
|
284
272
|
|
|
285
|
-
|
|
286
273
|
No framework-specific code in the core. Tree-shakeable with `sideEffects: false`.
|
|
287
274
|
|
|
288
275
|
---
|
|
@@ -316,8 +303,7 @@ No framework-specific code in the core. Tree-shakeable with `sideEffects: false`
|
|
|
316
303
|
- Fixed normalization for MUUSIKATOAN (៉) and TRIISAP (៊) — shift signs now placed before vowels
|
|
317
304
|
- Fixed Unicode range constants (NIKAHIT, REAHMUK, YUUKEALAKHMOU are signs, not vowels)
|
|
318
305
|
- 149 tests
|
|
319
|
-
-
|
|
320
|
-
- Better token metadata (`isKhmer`, `clusterCount`)
|
|
306
|
+
- Rebuilt dictionary with 49,113 words (merged from 10 sources)
|
|
321
307
|
|
|
322
308
|
### v0.3.0
|
|
323
309
|
|
|
@@ -352,7 +338,7 @@ npm run lint # TypeScript type check
|
|
|
352
338
|
### Automated Tests
|
|
353
339
|
|
|
354
340
|
```bash
|
|
355
|
-
npm test # run
|
|
341
|
+
npm test # run 149 tests with vitest
|
|
356
342
|
npm run test:watch # watch mode — re-runs on changes
|
|
357
343
|
npm run lint # TypeScript type check
|
|
358
344
|
```
|
|
@@ -385,10 +371,10 @@ Features:
|
|
|
385
371
|
- **[Word Segmentation of Khmer Text Using Conditional Random Fields](https://medium.com/@phylypo/segmentation-of-khmer-text-using-conditional-random-fields-3a2d4d73956a)** — Phylypo Tum (2019). Comprehensive overview of Khmer segmentation approaches from dictionary-based to CRF, achieving 99.7% accuracy with Linear Chain CRF.
|
|
386
372
|
- **[Khmer Word Segmentation Using Conditional Random Fields](https://www.niptict.edu.kh/khmer-word-segmentation-tool/)** — Vichea Chea, Ye Kyaw Thu, et al. (2015). The prior state-of-the-art CRF model for Khmer segmentation (98.5% accuracy, 5-tag system).
|
|
387
373
|
- **[Benchmark dataset and Python notebooks](https://github.com/phylypo/segmentation-crf-khmer)** — 10K+ segmented Khmer news articles useful for evaluating segmentation quality.
|
|
388
|
-
- **[khmerlbdict](https://github.com/silnrsi/khmerlbdict)** — Source of the default dictionary used by this library (MIT license,
|
|
374
|
+
- **[khmerlbdict](https://github.com/silnrsi/khmerlbdict)** — Source of the default dictionary used by this library (MIT license). Merged with Royal Academy of Cambodia's Khmer Dictionary for a total of 49,113 words.
|
|
389
375
|
|
|
390
376
|
---
|
|
391
377
|
|
|
392
378
|
## License
|
|
393
379
|
|
|
394
|
-
MIT
|
|
380
|
+
MIT
|