flappa-doormal 2.0.0 → 2.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +325 -0
- package/README.md +77 -1
- package/dist/index.d.mts +155 -1
- package/dist/index.d.mts.map +1 -1
- package/dist/index.mjs +267 -16
- package/dist/index.mjs.map +1 -1
- package/package.json +3 -2
package/AGENTS.md
ADDED
|
@@ -0,0 +1,325 @@
|
|
|
1
|
+
# AGENTS.md
|
|
2
|
+
|
|
3
|
+
## Project Overview
|
|
4
|
+
|
|
5
|
+
**flappa-doormal** is a declarative text segmentation library for Arabic texts. It splits pages of content into logical segments (hadiths, chapters, verses) using pattern-based rules with a human-readable template syntax.
|
|
6
|
+
|
|
7
|
+
### Why This Library Exists
|
|
8
|
+
|
|
9
|
+
Traditional Arabic text segmentation requires:
|
|
10
|
+
- Hand-crafted regular expressions with complex Unicode patterns
|
|
11
|
+
- Deep knowledge of Arabic diacritics and character equivalences
|
|
12
|
+
- Handling page boundaries and multi-page content spans
|
|
13
|
+
- Manual capture group management for metadata extraction
|
|
14
|
+
|
|
15
|
+
**flappa-doormal** solves these problems with:
|
|
16
|
+
- **Declarative rules**: Describe *what* to match, not *how* to match it
|
|
17
|
+
- **Template tokens**: `{{raqms}} {{dash}}` instead of `[\u0660-\u0669]+\s*[-–—ـ]`
|
|
18
|
+
- **Named captures**: `{{raqms:hadithNum}}` → automatic `meta.hadithNum` extraction
|
|
19
|
+
- **Fuzzy matching**: Diacritic-insensitive matching for harakat variations
|
|
20
|
+
- **Page-aware**: Tracks which page each segment comes from
|
|
21
|
+
|
|
22
|
+
## Architecture
|
|
23
|
+
|
|
24
|
+
### Repository Structure
|
|
25
|
+
|
|
26
|
+
```text
|
|
27
|
+
src/
|
|
28
|
+
├── index.ts # Main entry point and exports
|
|
29
|
+
├── pattern-detection.ts # Token detection for auto-generating rules (NEW)
|
|
30
|
+
├── pattern-detection.test.ts # Pattern detection tests (22 tests)
|
|
31
|
+
└── segmentation/
|
|
32
|
+
├── types.ts # TypeScript type definitions for rules/segments
|
|
33
|
+
├── segmenter.ts # Core segmentation engine (segmentPages, applyBreakpoints)
|
|
34
|
+
├── breakpoint-utils.ts # Extracted breakpoint processing utilities
|
|
35
|
+
├── tokens.ts # Token definitions and expansion logic
|
|
36
|
+
├── fuzzy.ts # Diacritic-insensitive matching utilities
|
|
37
|
+
├── html.ts # HTML utilities (stripHtmlTags)
|
|
38
|
+
├── textUtils.ts # Text processing utilities
|
|
39
|
+
├── match-utils.ts # Extracted match processing utilities
|
|
40
|
+
├── segmenter.test.ts # Core test suite (150+ tests including breakpoints)
|
|
41
|
+
├── segmenter.bukhari.test.ts # Real-world test cases
|
|
42
|
+
├── breakpoint-utils.test.ts # Breakpoint utility tests (42 tests)
|
|
43
|
+
├── tokens.test.ts # Token expansion tests
|
|
44
|
+
├── fuzzy.test.ts # Fuzzy matching tests
|
|
45
|
+
├── textUtils.test.ts # Text utility tests
|
|
46
|
+
└── match-utils.test.ts # Utility function tests
|
|
47
|
+
|
|
48
|
+
test/
|
|
49
|
+
├── 2576.json # Test data for book 2576 (Sahih Bukhari)
|
|
50
|
+
└── 2588.json # Test data for book 2588 (Al-Mughni)
|
|
51
|
+
|
|
52
|
+
docs/
|
|
53
|
+
├── checkpoints/ # AI agent handoff documentation
|
|
54
|
+
│ └── 2025-12-09-handoff.md
|
|
55
|
+
└── reviews/ # Performance analysis reports
|
|
56
|
+
└── 2025-12-10/
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
### Core Components
|
|
60
|
+
|
|
61
|
+
1. **`segmentPages(pages, options)`** - Main entry point
|
|
62
|
+
- Takes array of `{id, content}` pages and split rules
|
|
63
|
+
- Returns array of `{content, from, to?, meta?}` segments
|
|
64
|
+
|
|
65
|
+
2. **`tokens.ts`** - Template system
|
|
66
|
+
- `TOKEN_PATTERNS` - Map of token names to regex patterns
|
|
67
|
+
- `expandTokensWithCaptures()` - Expands `{{token:name}}` syntax
|
|
68
|
+
- Supports fuzzy transform for diacritic-insensitive matching
|
|
69
|
+
|
|
70
|
+
3. **`match-utils.ts`** - Extracted utilities (for testability)
|
|
71
|
+
- `extractNamedCaptures()` - Get named groups from regex match
|
|
72
|
+
- `filterByConstraints()` - Apply min/max page filters
|
|
73
|
+
- `anyRuleAllowsId()` - Check if page passes rule constraints
|
|
74
|
+
|
|
75
|
+
4. **`breakpoint-utils.ts`** - Breakpoint processing utilities (NEW)
|
|
76
|
+
- `normalizeBreakpoint()` - Convert string to BreakpointRule object
|
|
77
|
+
- `isPageExcluded()` - Check if page is in exclude list
|
|
78
|
+
- `isInBreakpointRange()` - Validate page against min/max/exclude constraints
|
|
79
|
+
- `buildExcludeSet()` - Create Set from PageRange[] for O(1) lookups
|
|
80
|
+
- `createSegment()` - Create segment with optional to/meta fields
|
|
81
|
+
- `expandBreakpoints()` - Expand patterns with pre-compiled regexes
|
|
82
|
+
- `findActualEndPage()` - Search backwards for ending page by content
|
|
83
|
+
- `findBreakPosition()` - Find break position using breakpoint patterns
|
|
84
|
+
- `hasExcludedPageInRange()` - Check if range contains excluded pages
|
|
85
|
+
- `findNextPagePosition()` - Find next page content position
|
|
86
|
+
- `findPatternBreakPosition()` - Find pattern match by preference
|
|
87
|
+
|
|
88
|
+
5. **`types.ts`** - Type definitions
|
|
89
|
+
- `Logger` interface - Optional logging for debugging
|
|
90
|
+
- `SegmentationOptions` - Options with `logger` property
|
|
91
|
+
- Verbosity levels: `trace`, `debug`, `info`, `warn`, `error`
|
|
92
|
+
|
|
93
|
+
6. **`fuzzy.ts`** - Arabic text normalization
|
|
94
|
+
- `makeDiacriticInsensitive()` - Generate regex that ignores diacritics
|
|
95
|
+
|
|
96
|
+
7. **`pattern-detection.ts`** - Token auto-detection (NEW)
|
|
97
|
+
- `detectTokenPatterns()` - Detect tokens in text with positions
|
|
98
|
+
- `generateTemplateFromText()` - Convert text to template string
|
|
99
|
+
- `suggestPatternConfig()` - Suggest rule configuration
|
|
100
|
+
- `analyzeTextForRule()` - Complete analysis returning template + config
|
|
101
|
+
|
|
102
|
+
## Key Algorithms
|
|
103
|
+
|
|
104
|
+
### Token Expansion
|
|
105
|
+
|
|
106
|
+
```
|
|
107
|
+
Input: "{{raqms:num}} {{dash}} {{:text}}"
|
|
108
|
+
Output: "(?<num>[\u0660-\u0669]+) [-–—ـ] (?<text>.+)"
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
The expansion algorithm:
|
|
112
|
+
1. Splits query into token and text segments
|
|
113
|
+
2. Looks up token patterns from `TOKEN_PATTERNS`
|
|
114
|
+
3. Wraps in named capture group if `:name` suffix present
|
|
115
|
+
4. Applies fuzzy transform if enabled (before wrapping in groups)
|
|
116
|
+
|
|
117
|
+
### Fuzzy Application Order
|
|
118
|
+
|
|
119
|
+
**Critical design decision**: Fuzzy transforms are applied to raw token patterns and plain text *before* they're wrapped in regex groups.
|
|
120
|
+
|
|
121
|
+
```
|
|
122
|
+
WRONG: makeDiacriticInsensitive("(?<name>حدثنا)") // Breaks ( ? < > )
|
|
123
|
+
RIGHT: "(?<name>" + makeDiacriticInsensitive("حدثنا") + ")"
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
### lineStartsAfter Content Capture
|
|
127
|
+
|
|
128
|
+
For patterns like `^٦٦٩٦ - (content)`, the content capture is the *last* positional group:
|
|
129
|
+
|
|
130
|
+
```typescript
|
|
131
|
+
// Pattern: ^(?:(?<num>[\u0660-\u0669]+) [-–—ـ] )(.*)
|
|
132
|
+
// Match: m[1] = named group value, m[2] = content
|
|
133
|
+
// Solution: Iterate backward from m.length-1 to find last defined capture
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
### Breakpoints Post-Processing Algorithm
|
|
137
|
+
|
|
138
|
+
The `breakpoints` option provides a post-processing mechanism for limiting segment size. Unlike the deprecated `maxSpan` (which was per-rule), breakpoints runs AFTER all structural rules.
|
|
139
|
+
|
|
140
|
+
**API Options:**
|
|
141
|
+
```typescript
|
|
142
|
+
interface SegmentationOptions {
|
|
143
|
+
rules: SplitRule[];
|
|
144
|
+
maxPages?: number; // Maximum pages a segment can span
|
|
145
|
+
breakpoints?: string[]; // Ordered array of regex patterns (supports token expansion)
|
|
146
|
+
prefer?: 'longer' | 'shorter'; // Select last or first match within window
|
|
147
|
+
}
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
**How it works:**
|
|
151
|
+
1. Structural rules run first, creating initial segments
|
|
152
|
+
2. Breakpoints then processes any segment exceeding `maxPages`
|
|
153
|
+
3. Patterns are tried in order until one matches
|
|
154
|
+
4. Empty string `''` means "fall back to page boundary"
|
|
155
|
+
|
|
156
|
+
**Example:**
|
|
157
|
+
```typescript
|
|
158
|
+
segmentPages(pages, {
|
|
159
|
+
rules: [
|
|
160
|
+
{ lineStartsWith: ['{{basmalah}}'], split: 'at' },
|
|
161
|
+
{ lineStartsWith: ['{{bab}}'], split: 'at', meta: { type: 'chapter' } },
|
|
162
|
+
],
|
|
163
|
+
maxPages: 2,
|
|
164
|
+
breakpoints: ['{{tarqim}}\\s*', '\\n', ''], // Try: punctuation → newline → page boundary
|
|
165
|
+
prefer: 'longer', // Greedy: make segments as large as possible
|
|
166
|
+
});
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
**Key behaviors:**
|
|
170
|
+
- **Pattern order matters**: First matching pattern wins
|
|
171
|
+
- **`prefer: 'longer'`**: Finds LAST match in window (greedy)
|
|
172
|
+
- **`prefer: 'shorter'`**: Finds FIRST match (conservative)
|
|
173
|
+
- **Recursive**: If split result still exceeds `maxPages`, breakpoints runs again
|
|
174
|
+
|
|
175
|
+
> **Note**: The old `maxSpan` and `fallback` properties on `SplitRule` are deprecated and removed.
|
|
176
|
+
|
|
177
|
+
## Design Decisions
|
|
178
|
+
|
|
179
|
+
### 1. Why `{{double-braces}}`?
|
|
180
|
+
|
|
181
|
+
- Single braces `{}` conflict with regex quantifiers `{n,m}`
|
|
182
|
+
- Double braces are visually distinct and rarely appear in content
|
|
183
|
+
- Consistent with template systems (Handlebars, Mustache)
|
|
184
|
+
|
|
185
|
+
### 2. Why `lineStartsAfter` vs `lineStartsWith`?
|
|
186
|
+
|
|
187
|
+
| Pattern | Marker in content? | Use case |
|
|
188
|
+
|---------|-------------------|----------|
|
|
189
|
+
| `lineStartsWith` | ✅ Yes | Keep marker, segment at boundary |
|
|
190
|
+
| `lineStartsAfter` | ❌ No | Strip marker, capture only content |
|
|
191
|
+
|
|
192
|
+
### 3. Why fuzzy transform at token level?
|
|
193
|
+
|
|
194
|
+
Applying fuzzy globally would corrupt regex metacharacters. Instead:
|
|
195
|
+
- Fuzzy is passed to `expandTokensWithCaptures()`
|
|
196
|
+
- Applied only to Arabic text portions
|
|
197
|
+
- Preserves `(`, `)`, `|`, `?`, etc.
|
|
198
|
+
|
|
199
|
+
### 4. Why extract match utilities?
|
|
200
|
+
|
|
201
|
+
The original `segmentPages` had complexity 37 (max: 15). Extraction:
|
|
202
|
+
- Creates independently testable units
|
|
203
|
+
- Reduces main function complexity
|
|
204
|
+
- Improves code readability
|
|
205
|
+
|
|
206
|
+
## Working with the Codebase
|
|
207
|
+
|
|
208
|
+
### Adding a New Token
|
|
209
|
+
|
|
210
|
+
1. Add to `TOKEN_PATTERNS` in `tokens.ts`:
|
|
211
|
+
```typescript
|
|
212
|
+
export const TOKEN_PATTERNS = {
|
|
213
|
+
// ...existing
|
|
214
|
+
verse: '﴿[^﴾]+﴾', // Quranic verse markers
|
|
215
|
+
};
|
|
216
|
+
```
|
|
217
|
+
2. Add test cases in `segmenter.test.ts`
|
|
218
|
+
3. Document in README.md
|
|
219
|
+
|
|
220
|
+
### Adding a New Pattern Type
|
|
221
|
+
|
|
222
|
+
1. Add type to union in `types.ts`:
|
|
223
|
+
```typescript
|
|
224
|
+
type NewPattern = { newPatternField: string[] };
|
|
225
|
+
type PatternType = ... | NewPattern;
|
|
226
|
+
```
|
|
227
|
+
2. Handle in `buildRuleRegex()` in `segmenter.ts`
|
|
228
|
+
3. Add comprehensive tests
|
|
229
|
+
|
|
230
|
+
### Testing Strategy
|
|
231
|
+
|
|
232
|
+
- **Unit tests**: Each utility function has dedicated tests
|
|
233
|
+
- **Integration tests**: Full pipeline tests in `segmenter.test.ts`
|
|
234
|
+
- **Real-world tests**: `segmenter.bukhari.test.ts` uses actual hadith data
|
|
235
|
+
- Run: `bun test`
|
|
236
|
+
|
|
237
|
+
## Code Quality Standards
|
|
238
|
+
|
|
239
|
+
1. **TypeScript strict mode** - No `any` types
|
|
240
|
+
2. **Biome linting** - Max complexity 15 per function (some exceptions exist)
|
|
241
|
+
3. **JSDoc comments** - All exported functions documented
|
|
242
|
+
4. **Test coverage** - 251 tests across 8 files
|
|
243
|
+
|
|
244
|
+
## Dependencies
|
|
245
|
+
|
|
246
|
+
### Development
|
|
247
|
+
- **@biomejs/biome** - Linting and formatting
|
|
248
|
+
- **tsdown** - Build tool (generates `.mjs` and `.d.mts`)
|
|
249
|
+
- **Bun** - Runtime and test runner
|
|
250
|
+
|
|
251
|
+
## Build & Release
|
|
252
|
+
|
|
253
|
+
```bash
|
|
254
|
+
# Install dependencies
|
|
255
|
+
bun install
|
|
256
|
+
|
|
257
|
+
# Run all tests
|
|
258
|
+
bun test
|
|
259
|
+
|
|
260
|
+
# Build distribution
|
|
261
|
+
bun run build
|
|
262
|
+
# Output: dist/index.mjs (~17 KB gzip ~5.7 KB)
|
|
263
|
+
|
|
264
|
+
# Run performance test (generates 50K pages, measures segmentation speed/memory)
|
|
265
|
+
bun run perf
|
|
266
|
+
|
|
267
|
+
# Format code
|
|
268
|
+
bunx biome format --write .
|
|
269
|
+
|
|
270
|
+
# Lint code
|
|
271
|
+
bunx biome lint .
|
|
272
|
+
```
|
|
273
|
+
|
|
274
|
+
## Lessons Learned
|
|
275
|
+
|
|
276
|
+
### From Development
|
|
277
|
+
|
|
278
|
+
1. **Named captures shift positional indices**: When `(?<name>…)` appears before `(.*)`, the content is at `m[2]` not `m[1]`. Solution: iterate backward to find last defined group.
|
|
279
|
+
|
|
280
|
+
2. **Fuzzy + metacharacters don't mix**: `makeDiacriticInsensitive` expands Arabic to character classes. If applied to `(?<name>text)`, it corrupts the `(`, `?`, `<`, `>` characters.
|
|
281
|
+
|
|
282
|
+
3. **Alternations need per-alternative fuzzy**: Token `narrated: 'حدثنا|أخبرنا'` requires splitting at `|`, applying fuzzy to each, then rejoining.
|
|
283
|
+
|
|
284
|
+
4. **Complexity extraction works**: Pulling logic into `match-utils.ts` reduced main function complexity from 37 to 10 and made the code testable.
|
|
285
|
+
|
|
286
|
+
5. **Rule order matters for specificity**: When multiple rules can match the same position, put specific patterns BEFORE generic ones. Example: `## {{raqms:num}} {{dash}}` must come before `##` to capture the number.
|
|
287
|
+
|
|
288
|
+
6. **Post-processing beats per-rule limits**: The `maxSpan` approach (per-rule page limits) caused premature splits. Moving to post-processing `breakpoints` preserves structural integrity while still limiting segment size.
|
|
289
|
+
|
|
290
|
+
7. **Window padding matters**: When calculating approximate content windows, 50% padding is needed (not 20%) to ensure enough content is captured for `prefer: 'longer'` scenarios.
|
|
291
|
+
|
|
292
|
+
8. **Escaping in tests requires care**: TypeScript string `'\\.'` creates regex `\.`, but regex literal `/\./` is already escaped. Double-backslash in strings, single in literals.
|
|
293
|
+
|
|
294
|
+
### Architecture Insights
|
|
295
|
+
|
|
296
|
+
- **Declarative > Imperative**: Users describe patterns, library handles regex
|
|
297
|
+
- **Composability**: Tokens can be combined freely with `:name` captures
|
|
298
|
+
- **Fail gracefully**: Unknown tokens are left as-is, allowing partial templates
|
|
299
|
+
- **Post-process > Inline**: Breakpoints runs after rules, avoiding conflicts
|
|
300
|
+
- **Dependency injection for testability**: `breakpoint-utils.ts` accepts a `PatternProcessor` function instead of importing `processPattern` directly, enabling independent testing without mocking
|
|
301
|
+
- **Optional logging**: Use optional chaining (`logger?.debug?.()`) for zero-overhead when no logger is provided. All log methods are optional, allowing clients to subscribe to only the verbosity levels they need.
|
|
302
|
+
|
|
303
|
+
---
|
|
304
|
+
|
|
305
|
+
## Token Reference
|
|
306
|
+
|
|
307
|
+
| Token | Pattern Description | Example Match |
|
|
308
|
+
|-------|---------------------|---------------|
|
|
309
|
+
| `{{tarqim}}` | Arabic punctuation (. , ; ? ! ( ) etc.) | `؛` `،` `.` |
|
|
310
|
+
| `{{basmalah}}` | "بِسْمِ اللَّهِ" patterns | بِسْمِ اللَّهِ الرَّحْمَنِ |
|
|
311
|
+
| `{{bab}}` | "باب" (chapter) | بَابُ الإيمان |
|
|
312
|
+
| `{{fasl}}` | "فصل" (section) | فصل: في الطهارة |
|
|
313
|
+
| `{{kitab}}` | "كتاب" (book) | كتاب الصلاة |
|
|
314
|
+
| `{{raqm}}` | Single Arabic-Indic numeral | ٥ |
|
|
315
|
+
| `{{raqms}}` | Multiple Arabic-Indic numerals | ٧٥٦٣ |
|
|
316
|
+
| `{{raqms:num}}` | Numerals with named capture | `meta.num = "٧٥٦٣"` |
|
|
317
|
+
| `{{dash}}` | Various dash characters | - – — ـ |
|
|
318
|
+
| `{{numbered}}` | Composite: `{{raqms}} {{dash}}` | ٧٥٦٣ - |
|
|
319
|
+
|
|
320
|
+
**Named captures**: Add `:name` suffix to capture into `meta`:
|
|
321
|
+
```typescript
|
|
322
|
+
'{{raqms:hadithNum}} {{dash}}'
|
|
323
|
+
// → segment.meta.hadithNum = "٧٥٦٣"
|
|
324
|
+
```
|
|
325
|
+
|
package/README.md
CHANGED
|
@@ -395,6 +395,32 @@ console.log(TOKEN_PATTERNS.narrated);
|
|
|
395
395
|
// 'حدثنا|أخبرنا|حدثني|وحدثنا|أنبأنا|سمعت'
|
|
396
396
|
```
|
|
397
397
|
|
|
398
|
+
### Pattern Detection
|
|
399
|
+
|
|
400
|
+
Auto-detect tokens in Arabic text for building rules:
|
|
401
|
+
|
|
402
|
+
```typescript
|
|
403
|
+
import { detectTokenPatterns, analyzeTextForRule } from 'flappa-doormal';
|
|
404
|
+
|
|
405
|
+
// Detect individual tokens
|
|
406
|
+
const tokens = detectTokenPatterns('٣٤ - حدثنا');
|
|
407
|
+
// [
|
|
408
|
+
// { token: 'raqms', match: '٣٤', index: 0, endIndex: 2 },
|
|
409
|
+
// { token: 'dash', match: '-', index: 3, endIndex: 4 },
|
|
410
|
+
// { token: 'naql', match: 'حدثنا', index: 5, endIndex: 10 }
|
|
411
|
+
// ]
|
|
412
|
+
|
|
413
|
+
// Get complete rule suggestion
|
|
414
|
+
const rule = analyzeTextForRule('٣٤ - ');
|
|
415
|
+
// {
|
|
416
|
+
// template: '{{raqms}} {{dash}} ',
|
|
417
|
+
// patternType: 'lineStartsAfter',
|
|
418
|
+
// fuzzy: false,
|
|
419
|
+
// metaType: 'hadith',
|
|
420
|
+
// detected: [...]
|
|
421
|
+
// }
|
|
422
|
+
```
|
|
423
|
+
|
|
398
424
|
## Types
|
|
399
425
|
|
|
400
426
|
### `SplitRule`
|
|
@@ -433,6 +459,56 @@ type Segment = {
|
|
|
433
459
|
};
|
|
434
460
|
```
|
|
435
461
|
|
|
462
|
+
### `Logger`
|
|
463
|
+
|
|
464
|
+
Optional logging interface for debugging segmentation:
|
|
465
|
+
|
|
466
|
+
```typescript
|
|
467
|
+
interface Logger {
|
|
468
|
+
trace?: (message: string, ...args: unknown[]) => void; // Per-iteration details
|
|
469
|
+
debug?: (message: string, ...args: unknown[]) => void; // Detailed operations
|
|
470
|
+
info?: (message: string, ...args: unknown[]) => void; // Key progress points
|
|
471
|
+
warn?: (message: string, ...args: unknown[]) => void; // Potential issues
|
|
472
|
+
error?: (message: string, ...args: unknown[]) => void; // Critical failures
|
|
473
|
+
}
|
|
474
|
+
```
|
|
475
|
+
|
|
476
|
+
## Debugging
|
|
477
|
+
|
|
478
|
+
### Using the Logger
|
|
479
|
+
|
|
480
|
+
Pass a `logger` option to receive detailed information about the segmentation process:
|
|
481
|
+
|
|
482
|
+
```typescript
|
|
483
|
+
// Console logger for development
|
|
484
|
+
const segments = segmentPages(pages, {
|
|
485
|
+
rules: [...],
|
|
486
|
+
logger: {
|
|
487
|
+
debug: console.debug,
|
|
488
|
+
info: console.info,
|
|
489
|
+
warn: console.warn,
|
|
490
|
+
}
|
|
491
|
+
});
|
|
492
|
+
|
|
493
|
+
// Production logger (only errors)
|
|
494
|
+
const segments = segmentPages(pages, {
|
|
495
|
+
rules: [...],
|
|
496
|
+
logger: {
|
|
497
|
+
error: (msg, ...args) => myLoggingService.error(msg, args),
|
|
498
|
+
}
|
|
499
|
+
});
|
|
500
|
+
```
|
|
501
|
+
|
|
502
|
+
**Verbosity levels:**
|
|
503
|
+
- `trace` - Per-iteration loop details (very verbose)
|
|
504
|
+
- `debug` - Segment processing, pattern matching
|
|
505
|
+
- `info` - Start/completion of breakpoint processing
|
|
506
|
+
- `warn` - Safety checks triggered
|
|
507
|
+
- `error` - Infinite loop detection
|
|
508
|
+
|
|
509
|
+
When no logger is provided, no logging overhead is incurred.
|
|
510
|
+
|
|
511
|
+
|
|
436
512
|
## Usage with Next.js / Node.js
|
|
437
513
|
|
|
438
514
|
```typescript
|
|
@@ -474,7 +550,7 @@ console.log(`Found ${segments.length} segments`);
|
|
|
474
550
|
# Install dependencies
|
|
475
551
|
bun install
|
|
476
552
|
|
|
477
|
-
# Run tests (
|
|
553
|
+
# Run tests (251 tests)
|
|
478
554
|
bun test
|
|
479
555
|
|
|
480
556
|
# Build
|
package/dist/index.d.mts
CHANGED
|
@@ -460,6 +460,42 @@ type BreakpointRule = {
|
|
|
460
460
|
* { pattern: '{{tarqim}}\\s*', min: 10 }
|
|
461
461
|
*/
|
|
462
462
|
type Breakpoint = string | BreakpointRule;
|
|
463
|
+
/**
|
|
464
|
+
* Logger interface for custom logging implementations.
|
|
465
|
+
*
|
|
466
|
+
* All methods are optional - only implement the verbosity levels you need.
|
|
467
|
+
* When no logger is provided, no logging overhead is incurred.
|
|
468
|
+
*
|
|
469
|
+
* Compatible with the Logger interface from ffmpeg-simplified and similar libraries.
|
|
470
|
+
*
|
|
471
|
+
* @example
|
|
472
|
+
* // Simple console logger
|
|
473
|
+
* const logger: Logger = {
|
|
474
|
+
* debug: console.debug,
|
|
475
|
+
* info: console.info,
|
|
476
|
+
* warn: console.warn,
|
|
477
|
+
* error: console.error,
|
|
478
|
+
* };
|
|
479
|
+
*
|
|
480
|
+
* @example
|
|
481
|
+
* // Production logger (only warnings and errors)
|
|
482
|
+
* const prodLogger: Logger = {
|
|
483
|
+
* warn: (msg, ...args) => myLoggingService.warn(msg, args),
|
|
484
|
+
* error: (msg, ...args) => myLoggingService.error(msg, args),
|
|
485
|
+
* };
|
|
486
|
+
*/
|
|
487
|
+
interface Logger {
|
|
488
|
+
/** Log a debug message (verbose debugging output) */
|
|
489
|
+
debug?: (message: string, ...args: unknown[]) => void;
|
|
490
|
+
/** Log an error message (critical failures) */
|
|
491
|
+
error?: (message: string, ...args: unknown[]) => void;
|
|
492
|
+
/** Log an informational message (key progress points) */
|
|
493
|
+
info?: (message: string, ...args: unknown[]) => void;
|
|
494
|
+
/** Log a trace message (extremely verbose, per-iteration details) */
|
|
495
|
+
trace?: (message: string, ...args: unknown[]) => void;
|
|
496
|
+
/** Log a warning message (potential issues) */
|
|
497
|
+
warn?: (message: string, ...args: unknown[]) => void;
|
|
498
|
+
}
|
|
463
499
|
/**
|
|
464
500
|
* Segmentation options controlling how pages are split.
|
|
465
501
|
*
|
|
@@ -480,6 +516,17 @@ type Breakpoint = string | BreakpointRule;
|
|
|
480
516
|
* breakpoints: ['{{tarqim}}\\s*', '\\n', ''],
|
|
481
517
|
* prefer: 'longer'
|
|
482
518
|
* };
|
|
519
|
+
*
|
|
520
|
+
* @example
|
|
521
|
+
* // With custom logger for debugging
|
|
522
|
+
* const options: SegmentationOptions = {
|
|
523
|
+
* rules: [...],
|
|
524
|
+
* logger: {
|
|
525
|
+
* debug: console.debug,
|
|
526
|
+
* info: console.info,
|
|
527
|
+
* warn: console.warn,
|
|
528
|
+
* }
|
|
529
|
+
* };
|
|
483
530
|
*/
|
|
484
531
|
type SegmentationOptions = {
|
|
485
532
|
/**
|
|
@@ -542,6 +589,38 @@ type SegmentationOptions = {
|
|
|
542
589
|
* @default 'longer'
|
|
543
590
|
*/
|
|
544
591
|
prefer?: 'longer' | 'shorter';
|
|
592
|
+
/**
|
|
593
|
+
* Optional logger for debugging segmentation.
|
|
594
|
+
*
|
|
595
|
+
* Provide a logger to receive detailed information about the segmentation
|
|
596
|
+
* process. Useful for debugging pattern matching, page tracking, and
|
|
597
|
+
* breakpoint processing issues.
|
|
598
|
+
*
|
|
599
|
+
* When not provided, no logging overhead is incurred (methods are not called).
|
|
600
|
+
*
|
|
601
|
+
* Verbosity levels:
|
|
602
|
+
* - `trace`: Per-iteration details (very verbose)
|
|
603
|
+
* - `debug`: Detailed operation information
|
|
604
|
+
* - `info`: Key progress points
|
|
605
|
+
* - `warn`: Potential issues
|
|
606
|
+
* - `error`: Critical failures
|
|
607
|
+
*
|
|
608
|
+
* @example
|
|
609
|
+
* // Console logger for development
|
|
610
|
+
* logger: {
|
|
611
|
+
* debug: console.debug,
|
|
612
|
+
* info: console.info,
|
|
613
|
+
* warn: console.warn,
|
|
614
|
+
* }
|
|
615
|
+
*
|
|
616
|
+
* @example
|
|
617
|
+
* // Custom logger integration
|
|
618
|
+
* logger: {
|
|
619
|
+
* debug: (msg, ...args) => winston.debug(msg, { meta: args }),
|
|
620
|
+
* error: (msg, ...args) => winston.error(msg, { meta: args }),
|
|
621
|
+
* }
|
|
622
|
+
*/
|
|
623
|
+
logger?: Logger;
|
|
545
624
|
};
|
|
546
625
|
/**
|
|
547
626
|
* Output segment produced by `segmentPages()`.
|
|
@@ -846,5 +925,80 @@ declare const getAvailableTokens: () => string[];
|
|
|
846
925
|
*/
|
|
847
926
|
declare const getTokenPattern: (tokenName: string) => string | undefined;
|
|
848
927
|
//#endregion
|
|
849
|
-
|
|
928
|
+
//#region src/pattern-detection.d.ts
|
|
929
|
+
/**
|
|
930
|
+
* Pattern detection utilities for recognizing template tokens in Arabic text.
|
|
931
|
+
* Used to auto-detect patterns from user-highlighted text in the segmentation dialog.
|
|
932
|
+
*
|
|
933
|
+
* @module pattern-detection
|
|
934
|
+
*/
|
|
935
|
+
/**
|
|
936
|
+
* Result of detecting a token pattern in text
|
|
937
|
+
*/
|
|
938
|
+
type DetectedPattern = {
|
|
939
|
+
/** Token name from TOKEN_PATTERNS (e.g., 'raqms', 'dash') */
|
|
940
|
+
token: string;
|
|
941
|
+
/** The matched text */
|
|
942
|
+
match: string;
|
|
943
|
+
/** Start index in the original text */
|
|
944
|
+
index: number;
|
|
945
|
+
/** End index (exclusive) */
|
|
946
|
+
endIndex: number;
|
|
947
|
+
};
|
|
948
|
+
/**
|
|
949
|
+
* Analyzes text and returns all detected token patterns with their positions.
|
|
950
|
+
* Patterns are detected in priority order to avoid partial matches.
|
|
951
|
+
*
|
|
952
|
+
* @param text - The text to analyze for token patterns
|
|
953
|
+
* @returns Array of detected patterns sorted by position
|
|
954
|
+
*
|
|
955
|
+
* @example
|
|
956
|
+
* detectTokenPatterns("٣٤ - حدثنا")
|
|
957
|
+
* // Returns: [
|
|
958
|
+
* // { token: 'raqms', match: '٣٤', index: 0, endIndex: 2 },
|
|
959
|
+
* // { token: 'dash', match: '-', index: 3, endIndex: 4 },
|
|
960
|
+
* // { token: 'naql', match: 'حدثنا', index: 5, endIndex: 10 }
|
|
961
|
+
* // ]
|
|
962
|
+
*/
|
|
963
|
+
declare const detectTokenPatterns: (text: string) => DetectedPattern[];
|
|
964
|
+
/**
|
|
965
|
+
* Generates a template pattern from text using detected tokens.
|
|
966
|
+
* Replaces matched portions with {{token}} syntax.
|
|
967
|
+
*
|
|
968
|
+
* @param text - Original text
|
|
969
|
+
* @param detected - Array of detected patterns from detectTokenPatterns
|
|
970
|
+
* @returns Template string with tokens, e.g., "{{raqms}} {{dash}} "
|
|
971
|
+
*
|
|
972
|
+
* @example
|
|
973
|
+
* const detected = detectTokenPatterns("٣٤ - ");
|
|
974
|
+
* generateTemplateFromText("٣٤ - ", detected);
|
|
975
|
+
* // Returns: "{{raqms}} {{dash}} "
|
|
976
|
+
*/
|
|
977
|
+
declare const generateTemplateFromText: (text: string, detected: DetectedPattern[]) => string;
|
|
978
|
+
/**
|
|
979
|
+
* Determines the best pattern type for auto-generated rules based on detected patterns.
|
|
980
|
+
*
|
|
981
|
+
* @param detected - Array of detected patterns
|
|
982
|
+
* @returns Suggested pattern type and whether to use fuzzy matching
|
|
983
|
+
*/
|
|
984
|
+
declare const suggestPatternConfig: (detected: DetectedPattern[]) => {
|
|
985
|
+
patternType: "lineStartsWith" | "lineStartsAfter";
|
|
986
|
+
fuzzy: boolean;
|
|
987
|
+
metaType?: string;
|
|
988
|
+
};
|
|
989
|
+
/**
|
|
990
|
+
* Analyzes text and generates a complete suggested rule configuration.
|
|
991
|
+
*
|
|
992
|
+
* @param text - Highlighted text from the page
|
|
993
|
+
* @returns Suggested rule configuration or null if no patterns detected
|
|
994
|
+
*/
|
|
995
|
+
declare const analyzeTextForRule: (text: string) => {
|
|
996
|
+
template: string;
|
|
997
|
+
patternType: "lineStartsWith" | "lineStartsAfter";
|
|
998
|
+
fuzzy: boolean;
|
|
999
|
+
metaType?: string;
|
|
1000
|
+
detected: DetectedPattern[];
|
|
1001
|
+
} | null;
|
|
1002
|
+
//#endregion
|
|
1003
|
+
export { type Breakpoint, type BreakpointRule, type DetectedPattern, type ExpandResult, type Logger, type Page, type PageRange, type Segment, type SegmentationOptions, type SplitRule, TOKEN_PATTERNS, analyzeTextForRule, containsTokens, detectTokenPatterns, escapeRegex, expandTokens, expandTokensWithCaptures, generateTemplateFromText, getAvailableTokens, getTokenPattern, makeDiacriticInsensitive, normalizeLineEndings, segmentPages, stripHtmlTags, suggestPatternConfig, templateToRegex };
|
|
850
1004
|
//# sourceMappingURL=index.d.mts.map
|
package/dist/index.d.mts.map
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"index.d.mts","names":[],"sources":["../src/segmentation/fuzzy.ts","../src/segmentation/types.ts","../src/segmentation/segmenter.ts","../src/segmentation/textUtils.ts","../src/segmentation/tokens.ts"],"sourcesContent":[],"mappings":";;AAkEA;AA+FA;;;;;AC9IiB;AAoBG;AAsBM;AAyBC;AAiBH;;;;;;;AAoBC;AA0FzB;AAAkD;AA4GlD;;;;;AAkBA;AAqCA;AA0EY,cDhYC,WCgYqB,EAAA,CAAA,CAAA,EAAA,MAAc,EAAA,GAAA,MAAA;
|
|
1
|
+
{"version":3,"file":"index.d.mts","names":[],"sources":["../src/segmentation/fuzzy.ts","../src/segmentation/types.ts","../src/segmentation/segmenter.ts","../src/segmentation/textUtils.ts","../src/segmentation/tokens.ts","../src/pattern-detection.ts"],"sourcesContent":[],"mappings":";;AAkEA;AA+FA;;;;;AC9IiB;AAoBG;AAsBM;AAyBC;AAiBH;;;;;;;AAoBC;AA0FzB;AAAkD;AA4GlD;;;;;AAkBA;AAqCA;AA0EY,cDhYC,WCgYqB,EAAA,CAAA,CAAA,EAAA,MAAc,EAAA,GAAA,MAAA;AA8BhD;AAiDA;;;;;AAiHA;;;;AC2KA;;;;;;;;ACvyBA;AAaA;;;;AC6NA;AA2CA;AAWA;AA2DA;AAqGA;AAuBA;AAqBA;AAgBA;;;;ACtfA;AA0Da,cL2FA,wBK3FsC,EAAA,CAAA,IAAA,EAAe,MAAA,EAAA,GAAA,MAAA;;;;ALJlE;AA+FA;;;;;AC9IiB;AAoBG;AAsBM;AAyBC;AAiBH;;;;KApFnB,YAAA,GAuGC;EACA;EAAmB,KAAA,EAAA,MAAA;AAAA,CAAA;AA0FzB;AAAkD;AA4GlD;;;;;AAkBA;AAqCA;AA0EA;AA8BA;AAiDA;;;;KA1eK,eAAA,GA0kBc;EAiBP;;;;AC2KZ;;;;;;;;ACvyBA;AAaA;;;;AC6NA;AA2CA;AAWA;AA2DA,KHpSK,qBAAA,GGoSQ;EAqGA;EAuBA,cAAA,EAAA,MAOZ,EAAA;AAcD,CAAA;AAgBA;;;;ACtfA;AA0DA;AA4DA;AAuBA;AAiCA;;;;;;;;;;;;KJpGK,sBAAA;;;;;;;;;;;;;;;;KAiBA,mBAAA;;;;;;;;;;;;;;KAeA,WAAA,GACC,eACA,kBACA,wBACA,yBACA;;;;;;;KAYD,aAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KA8EO,SAAA;;;;;;;KAYP,eAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAyCS;;;;;;;;;;;;SAaH;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KA0CC,SAAA,GAAY,cAAc,gBAAgB;;;;;;;;;;;;;KAkB1C,IAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqCA,cAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAqCE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqCF,UAAA,YAAsB;;;;;;;;;;;;;;;;;;;;;;;;;UA8BjB,MAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAiDL,mBAAA;;;;;;;;UAQA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;gBA8CM;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;WA0CL;;;;;;;;;;;;;;;;KAiBD,OAAA;;;;;;;;;;;;;;;;;;;;;;;;;;SA6BD;;;;AAzkBgB;AAiBH;;;;;;;AAoBC;AA0FzB;AAAkD;AA4GlD;;;;;AAkBA;AAqCA;AA0EA;AA8BA;AAiDA;;;;;AAiHA;;;;AC2KA;;;;;;;;ACvyBA;AAaA;;;;AC6Na,cF6jBA,YE7jBgB,EAAA,CAAA,KAAM,EF6jBC,IE7jBD,EAAA,EAAA,OAAA,EF6jBkB,mBE7jBlB,EAAA,GF6jBwC,OE7jBxC,EAAA;;;;AJ9KnC;AA+FA;;;;AC9IK,cEbQ,aFaI,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAAA;AAoBG;AAsBM;AAyBC;AAiBH;;;;;AAoBlB,cExGO,oBFwGP,EAAA,CAAA,OAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;ADzDN;AA+FA;;;;;AC9IiB;AAoBG;AAsBM;AAyBC;AAiBH;;;;;;;AAoBC;AA0FzB;AAAkD;AA4GlD;;;;;AAkBA;AAqCA;AA0EA;AA8BA;AAiDA;;;;;AAiHA;;;;AC2KA;;;;;cE7jBa,gBAAgB;;;AD1O7B;AAaA;;;;AC6NA;AA2CA;AAWA;AA2DA;AAqGA;AAuBA;AAqBA;AAgBA;cAvOa;;;AC/Qb;AA0DA;AA4DA;AAuBA;AAiCa,KD4GD,YAAA,GCzFX;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;cDoJY,2FAA0F;;;;;;;;;;;;;;;;;;;;cAqG1F;;;;;;;;;;;;;;;;;;;;;;cAuBA,uCAAsC;;;;;;;;;;;;;cAqBtC;;;;;;;;;;;;;;;cAgBA;;;;AJhcb;AA+FA;;;;;AC9IiB;AAoBG;AA+Cf,KI1EO,eAAA,GJ0Ee;EAiBtB;EAeA,KAAA,EAAA,MAAA;EACC;EACA,KAAA,EAAA,MAAA;EACA;EACA,KAAA,EAAA,MAAA;EACA;EAAmB,QAAA,EAAA,MAAA;AAAA,CAAA;AA0FzB;AAAkD;AA4GlD;;;;;AAkBA;AAqCA;AA0EA;AA8BA;AAiDA;;;;AAgGmB,cI3iBN,mBJ2iBM,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GI3iBgC,eJ2iBhC,EAAA;AAiBnB;;;;AC2KA;;;;;;;;ACvyBA;AAaa,cE+GA,wBF/G2E,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,QAAA,EE+GvB,eF/GuB,EAAA,EAAA,GAAA,MAAA;;;;AC6NxF;AA2CA;AAWA;AA2Da,cCxMA,oBDwRZ,EAhFsG,CAAA,QAAA,ECvMzF,eDuRb,EAAA,EAAA,GAAA;EAqBY,WAAA,EAAA,gBAAiF,GAAA,iBAAA;EAuBjF,KAAA,EAAA,OAAA;EAqBA,QAAA,CAAA,EAAA,MAAA;AAgBb,CAAA;;;;ACtfA;AA0DA;AA4DA;AAuBa,cAiCA,kBAhCC,EAAA,CAAA,IAAA,EAAA,MAAe,EAAA,GAAA;EAgChB,QAAA,EAAA,MAAA;;;;YAOC"}
|