flappa-doormal 2.18.0 → 2.20.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +40 -11
- package/README.md +313 -10
- package/dist/index.d.mts +366 -113
- package/dist/index.d.mts.map +1 -1
- package/dist/index.mjs +1396 -202
- package/dist/index.mjs.map +1 -1
- package/package.json +7 -3
package/AGENTS.md
CHANGED
|
@@ -30,6 +30,12 @@ src/
|
|
|
30
30
|
│ ├── breakpoints.ts # Breakpoint types
|
|
31
31
|
│ ├── options.ts # SegmentationOptions and Logger
|
|
32
32
|
│ └── segmenter.ts # Internal segmenter types
|
|
33
|
+
├── dictionary/ # Dictionary-specific compiler, runtime, profiles, diagnostics
|
|
34
|
+
│ ├── arabic-dictionary-rule.ts
|
|
35
|
+
│ ├── heading-classifier.ts
|
|
36
|
+
│ ├── profile.ts
|
|
37
|
+
│ ├── profiles.ts
|
|
38
|
+
│ └── runtime.ts
|
|
33
39
|
├── analysis/ # Pattern discovery module
|
|
34
40
|
│ ├── line-starts.ts # analyzeCommonLineStarts (frequent line markers)
|
|
35
41
|
│ ├── repeating-sequences.ts # analyzeRepeatingSequences (N-grams)
|
|
@@ -56,6 +62,22 @@ src/
|
|
|
56
62
|
├── detection.ts # Pattern auto-detection (standalone)
|
|
57
63
|
└── *.test.ts # Unit and integration tests (co-located)
|
|
58
64
|
|
|
65
|
+
testing/
|
|
66
|
+
├── exports.test.ts # Public export contract test
|
|
67
|
+
└── fixtures/
|
|
68
|
+
├── README.md # Fixture purpose, source, and refresh workflow
|
|
69
|
+
├── dictionary-book-options.ts # Local golden options for the four reference dictionaries
|
|
70
|
+
├── dictionary-books.ts # Test fixture loader helpers
|
|
71
|
+
├── dictionary-fixture-manifest.ts
|
|
72
|
+
└── dictionary-books/ # Extracted markdown pages used by integration tests
|
|
73
|
+
|
|
74
|
+
scripts/
|
|
75
|
+
├── analyze-dictionary-profile.ts # Full-book diagnostics against an explicit input file/books dir
|
|
76
|
+
├── export-dictionary-book-options.ts
|
|
77
|
+
├── extract-dictionary-test-fixtures.ts
|
|
78
|
+
├── generate-dictionary-html-previews.ts
|
|
79
|
+
└── split-dictionary-csvs.ts
|
|
80
|
+
|
|
59
81
|
### Core Components
|
|
60
82
|
|
|
61
83
|
1. **`segmentPages(pages, options)`** - Main entry point (`src/segmentation/segmenter.ts`)
|
|
@@ -386,7 +408,8 @@ The original `segmentPages` had complexity 37 (max: 15). Extraction:
|
|
|
386
408
|
|
|
387
409
|
- **Unit tests**: Each utility function has dedicated tests
|
|
388
410
|
- **Integration tests**: Full pipeline tests in `src/segmentation/segmenter.test.ts`
|
|
389
|
-
- **
|
|
411
|
+
- **Dictionary integration tests**: `src/dictionary/*.test.ts` use extracted markdown fixtures under `testing/fixtures/dictionary-books/`
|
|
412
|
+
- **Optional corpus tooling**: full-book diagnostics/preview scripts can use external Shamela JSONs via `--input` or `--books-dir`, but the test suite does not require a local `books/` directory
|
|
390
413
|
- **Style convention**: Prefer `it('should ...', () => { ... })` (Bun) for consistency across the suite
|
|
391
414
|
- Run: `bun test`
|
|
392
415
|
|
|
@@ -395,7 +418,7 @@ The original `segmentPages` had complexity 37 (max: 15). Extraction:
|
|
|
395
418
|
1. **TypeScript strict mode** - No `any` types
|
|
396
419
|
2. **Biome linting** - Max complexity 15 per function (some exceptions exist)
|
|
397
420
|
3. **JSDoc comments** - All exported functions documented
|
|
398
|
-
4. **Test coverage** -
|
|
421
|
+
4. **Test coverage** - keep coverage representative; do not rely on local corpora for CI
|
|
399
422
|
|
|
400
423
|
## Dependencies
|
|
401
424
|
|
|
@@ -417,14 +440,26 @@ bun test
|
|
|
417
440
|
bun run build
|
|
418
441
|
# Output: dist/index.mjs (~17 KB gzip ~5.7 KB)
|
|
419
442
|
|
|
420
|
-
# Run performance
|
|
421
|
-
bun run perf
|
|
443
|
+
# Run performance tests
|
|
444
|
+
bun run test:perf
|
|
445
|
+
|
|
446
|
+
# Regenerate extracted dictionary test fixtures (requires external books dir if not using ./books)
|
|
447
|
+
bun run dictionary:extract-fixtures -- --books-dir /path/to/books
|
|
448
|
+
|
|
449
|
+
# Export built-in dictionary options (writes to out/dictionary-options by default)
|
|
450
|
+
bun run dictionary:export-options
|
|
451
|
+
|
|
452
|
+
# Scan a full book with a builtin dictionary profile
|
|
453
|
+
bun run dictionary:scan -- --book 1687 --input /path/to/1687.json
|
|
454
|
+
|
|
455
|
+
# Validate a dictionary profile shape in userland
|
|
456
|
+
# (public API: validateDictionaryProfile(profile))
|
|
422
457
|
|
|
423
458
|
# Format code
|
|
424
459
|
bunx biome format --write .
|
|
425
460
|
|
|
426
461
|
# Lint code
|
|
427
|
-
bunx biome
|
|
462
|
+
bunx biome check .
|
|
428
463
|
```
|
|
429
464
|
|
|
430
465
|
## Lessons Learned
|
|
@@ -605,12 +640,6 @@ bunx biome lint .
|
|
|
605
640
|
|
|
606
641
|
57. **Validation Hints Specificity**: Generic error hints like "Check segmenter.ts" are unhelpful. Provide specific file names and logical components (e.g., "Check maxPages windowing in breakpoint-processor.ts"). User-friendly validation reports guide debugging much faster than "Something is wrong".
|
|
607
642
|
|
|
608
|
-
### Process Template (Multi-agent design review, TDD-first)
|
|
609
|
-
|
|
610
|
-
If you want to repeat the “write a plan → get multiple AI critiques → synthesize → update plan → implement TDD-first” workflow, use:
|
|
611
|
-
|
|
612
|
-
- `docs/ai-multi-agent-tdd-template.md`
|
|
613
|
-
|
|
614
643
|
### Architecture Insights
|
|
615
644
|
|
|
616
645
|
- **Declarative > Imperative**: Users describe patterns, library handles regex
|
package/README.md
CHANGED
|
@@ -285,6 +285,7 @@ const rules = [{
|
|
|
285
285
|
| `lineEndsWith` | ✅ Included | Match patterns at end of line |
|
|
286
286
|
| `template` | Depends | Custom pattern with full control |
|
|
287
287
|
| `regex` | Depends | Raw regex for complex cases |
|
|
288
|
+
| `dictionaryEntry` | ✅ Included | Serializable Arabic dictionary headword rule |
|
|
288
289
|
|
|
289
290
|
#### Building UIs with Pattern Type Keys
|
|
290
291
|
|
|
@@ -293,7 +294,7 @@ The library exports `PATTERN_TYPE_KEYS` (a const array) and `PatternTypeKey` (a
|
|
|
293
294
|
```typescript
|
|
294
295
|
import { PATTERN_TYPE_KEYS, type PatternTypeKey } from 'flappa-doormal';
|
|
295
296
|
|
|
296
|
-
// PATTERN_TYPE_KEYS = ['lineStartsWith', 'lineStartsAfter', 'lineEndsWith', 'template', 'regex']
|
|
297
|
+
// PATTERN_TYPE_KEYS = ['lineStartsWith', 'lineStartsAfter', 'lineEndsWith', 'template', 'regex', 'dictionaryEntry']
|
|
297
298
|
|
|
298
299
|
// Build a dropdown/select
|
|
299
300
|
PATTERN_TYPE_KEYS.map(key => <option value={key}>{key}</option>)
|
|
@@ -348,10 +349,88 @@ const segments = segmentPages(pages, {
|
|
|
348
349
|
If the previous page ends with strong sentence punctuation (`.`, `!`, `?`, `؟`, `؛`),
|
|
349
350
|
the stoplist guard is skipped and the page-start match is allowed.
|
|
350
351
|
|
|
351
|
-
####
|
|
352
|
+
#### Preferred Dictionary Profile
|
|
352
353
|
|
|
353
|
-
|
|
354
|
-
|
|
354
|
+
For new Shamela-style dictionary work, prefer the top-level `dictionary`
|
|
355
|
+
profile over hand-built raw regexes or the older one-rule helper:
|
|
356
|
+
|
|
357
|
+
```typescript
|
|
358
|
+
import { segmentPages } from 'flappa-doormal';
|
|
359
|
+
|
|
360
|
+
const segments = segmentPages(pages, {
|
|
361
|
+
breakpoints: ['{{tarqim}}'],
|
|
362
|
+
dictionary: {
|
|
363
|
+
version: 2,
|
|
364
|
+
zones: [{
|
|
365
|
+
name: 'main',
|
|
366
|
+
blockers: [
|
|
367
|
+
{ appliesTo: ['lineEntry', 'inlineSubentry'], use: 'pageContinuation' },
|
|
368
|
+
{ appliesTo: ['lineEntry', 'inlineSubentry'], use: 'intro' },
|
|
369
|
+
{
|
|
370
|
+
appliesTo: ['lineEntry', 'inlineSubentry'],
|
|
371
|
+
use: 'stopLemma',
|
|
372
|
+
words: ['ومعناه', 'ويقال', 'وقيل']
|
|
373
|
+
},
|
|
374
|
+
],
|
|
375
|
+
families: [
|
|
376
|
+
{ classes: ['chapter'], emit: 'chapter', use: 'heading' },
|
|
377
|
+
{ emit: 'entry', use: 'lineEntry', wrappers: 'none' },
|
|
378
|
+
{ emit: 'entry', prefixes: ['و'], stripPrefixesFromLemma: false, use: 'inlineSubentry' },
|
|
379
|
+
],
|
|
380
|
+
}],
|
|
381
|
+
},
|
|
382
|
+
maxPages: 1,
|
|
383
|
+
});
|
|
384
|
+
```
|
|
385
|
+
|
|
386
|
+
Why this is preferred:
|
|
387
|
+
- serializable JSON authoring shape
|
|
388
|
+
- profile-scoped blockers instead of giant regex blobs
|
|
389
|
+
- zone support for books that change layout later
|
|
390
|
+
- compatible with diagnostics tooling via `diagnoseDictionaryProfile()`
|
|
391
|
+
- first-class validation via `validateDictionaryProfile()`
|
|
392
|
+
|
|
393
|
+
The production dictionary implementation now lives under `src/dictionary/`
|
|
394
|
+
inside the repo, separate from the generic segmentation internals.
|
|
395
|
+
|
|
396
|
+
Dictionary runtime semantics:
|
|
397
|
+
- `segmentPages()` is still the only entry point; dictionary profiles do not use
|
|
398
|
+
a separate API
|
|
399
|
+
- dictionary split points are merged with ordinary `rules`
|
|
400
|
+
- when a rule split and a dictionary split land at the same offset, metadata is
|
|
401
|
+
merged; if `debug` is enabled, `_flappa.rule` and `_flappa.dictionary` can
|
|
402
|
+
both appear on the same segment
|
|
403
|
+
- for dictionary-only configs, content before the first detected entry/chapter
|
|
404
|
+
is preserved as a leading segment with no dictionary metadata
|
|
405
|
+
|
|
406
|
+
#### Advanced: Single-Rule Arabic Dictionary Matching
|
|
407
|
+
|
|
408
|
+
`createArabicDictionaryEntryRule()` and the native `dictionaryEntry` rule shape
|
|
409
|
+
are still supported as the lower-level, advanced path for clients who want one
|
|
410
|
+
Arabic dictionary-style matcher inside a broader `rules` pipeline.
|
|
411
|
+
|
|
412
|
+
Use this path when:
|
|
413
|
+
- you need exactly one conservative dictionary headword rule
|
|
414
|
+
- you want to compose it with ordinary `SplitRule[]`
|
|
415
|
+
- you do not need profile zones, per-family blockers, or full-book tuning
|
|
416
|
+
|
|
417
|
+
Prefer the top-level `dictionary` profile when:
|
|
418
|
+
- segmenting an entire dictionary book
|
|
419
|
+
- persisting JSON config for a corpus
|
|
420
|
+
- the book changes layout in different sections
|
|
421
|
+
- you need diagnostics, blocker hit rates, or book-specific profile tuning
|
|
422
|
+
|
|
423
|
+
Decision guide:
|
|
424
|
+
|
|
425
|
+
| Use case | Preferred API |
|
|
426
|
+
|----------|---------------|
|
|
427
|
+
| One conservative lemma matcher inside a normal segmentation pipeline | `createArabicDictionaryEntryRule()` / `dictionaryEntry` |
|
|
428
|
+
| Full-book dictionary segmentation with blockers, families, and zones | top-level `dictionary` |
|
|
429
|
+
| Persisted JSON config for real books | top-level `dictionary` |
|
|
430
|
+
| Advanced composition with other `SplitRule[]` rules | `createArabicDictionaryEntryRule()` / `dictionaryEntry` |
|
|
431
|
+
|
|
432
|
+
The helper returns a serializable native `dictionaryEntry` rule rather than an
|
|
433
|
+
eagerly-compiled regex blob:
|
|
355
434
|
|
|
356
435
|
```typescript
|
|
357
436
|
import { createArabicDictionaryEntryRule, segmentPages } from 'flappa-doormal';
|
|
@@ -364,20 +443,221 @@ const rule = createArabicDictionaryEntryRule({
|
|
|
364
443
|
allowParenthesized: true, // e.g. (عنبر) :
|
|
365
444
|
allowWhitespaceBeforeColon: true, // e.g. عنبر :
|
|
366
445
|
allowCommaSeparated: true, // e.g. سبد، دبس:
|
|
446
|
+
midLineSubentries: false, // line/page starts only
|
|
367
447
|
});
|
|
368
448
|
|
|
369
449
|
const segments = segmentPages(pages, { rules: [rule] });
|
|
370
450
|
```
|
|
371
451
|
|
|
452
|
+
Equivalent direct JSON-authored rule:
|
|
453
|
+
|
|
454
|
+
```typescript
|
|
455
|
+
const rule = {
|
|
456
|
+
dictionaryEntry: {
|
|
457
|
+
stopWords: ['وقيل', 'ويقال', 'قال', 'العجاج', 'أخاك'],
|
|
458
|
+
allowParenthesized: true,
|
|
459
|
+
allowWhitespaceBeforeColon: true,
|
|
460
|
+
allowCommaSeparated: true,
|
|
461
|
+
midLineSubentries: false,
|
|
462
|
+
},
|
|
463
|
+
pageStartPrevWordStoplist: ['قال', 'وقيل', 'ويقال'],
|
|
464
|
+
samePagePrevWordStoplist: ['جل'],
|
|
465
|
+
meta: { type: 'entry' },
|
|
466
|
+
};
|
|
467
|
+
```
|
|
468
|
+
|
|
372
469
|
Behavior:
|
|
373
470
|
- Keeps the lemma marker in `segment.content`
|
|
374
471
|
- Stores the matched lemma in `segment.meta.lemma`
|
|
375
472
|
- Matches root entries at true line/page starts like `عز:` and `لع:`
|
|
376
473
|
- Matches mid-line subentries conservatively when they begin with `و`
|
|
474
|
+
- Supports disabling mid-line subentries entirely with `midLineSubentries: false`
|
|
377
475
|
- Can match parenthesized headwords like `(عنبر) :` when enabled
|
|
378
476
|
- Can match comma-separated headword lists like `سبد، دبس:` when enabled
|
|
379
477
|
- Can suppress same-page false positives like `جلّ وعزّ:` with `samePagePrevWordStoplist`
|
|
380
478
|
|
|
479
|
+
Option notes:
|
|
480
|
+
- `stopWords`
|
|
481
|
+
- exact lemma-level blockers for non-lexical heads like `وقيل` or `ويقال`
|
|
482
|
+
- use this for rejecting candidate headwords themselves
|
|
483
|
+
- `pageStartPrevWordStoplist`
|
|
484
|
+
- blocks a page-start candidate when the previous page ends with one of these
|
|
485
|
+
words
|
|
486
|
+
- useful for page-wrap false positives after citation/introduction prose
|
|
487
|
+
- `samePagePrevWordStoplist`
|
|
488
|
+
- blocks a same-page candidate when the previous local word matches
|
|
489
|
+
- useful for phrases like `جلّ وعزّ`
|
|
490
|
+
- `allowParenthesized`
|
|
491
|
+
- enables heads like `(عنبر):`
|
|
492
|
+
- `allowWhitespaceBeforeColon`
|
|
493
|
+
- enables spacing variants like `عنبر :`
|
|
494
|
+
- `allowCommaSeparated`
|
|
495
|
+
- enables grouped heads like `سبد، دبس:`
|
|
496
|
+
- `midLineSubentries`
|
|
497
|
+
- when `true`, allows conservative same-line subentries such as `والعزاء:`
|
|
498
|
+
- when `false`, only line-start/page-start heads are emitted
|
|
499
|
+
|
|
500
|
+
Serialization tradeoff:
|
|
501
|
+
- `dictionaryEntry` is serializable and safe to keep in JSON
|
|
502
|
+
- but it is still a single-rule primitive
|
|
503
|
+
- if you need corpus-wide blocker tuning, families, or zones, move up to the
|
|
504
|
+
top-level `dictionary` profile
|
|
505
|
+
|
|
506
|
+
Example: compose with chapter rules
|
|
507
|
+
|
|
508
|
+
```typescript
|
|
509
|
+
import { createArabicDictionaryEntryRule, segmentPages } from 'flappa-doormal';
|
|
510
|
+
|
|
511
|
+
const segments = segmentPages(pages, {
|
|
512
|
+
rules: [
|
|
513
|
+
{ lineStartsAfter: ['## '], meta: { type: 'chapter' } },
|
|
514
|
+
{
|
|
515
|
+
fuzzy: true,
|
|
516
|
+
lineStartsAfter: ['{{bab}} '],
|
|
517
|
+
meta: { type: 'chapter' },
|
|
518
|
+
},
|
|
519
|
+
createArabicDictionaryEntryRule({
|
|
520
|
+
stopWords: ['وقيل', 'ويقال', 'قال'],
|
|
521
|
+
pageStartPrevWordStoplist: ['قال', 'وقيل', 'ويقال'],
|
|
522
|
+
samePagePrevWordStoplist: ['جل'],
|
|
523
|
+
allowCommaSeparated: true,
|
|
524
|
+
}),
|
|
525
|
+
],
|
|
526
|
+
breakpoints: ['{{tarqim}}'],
|
|
527
|
+
maxPages: 1,
|
|
528
|
+
});
|
|
529
|
+
```
|
|
530
|
+
|
|
531
|
+
Example: one-off advanced rule inside a non-dictionary pipeline
|
|
532
|
+
|
|
533
|
+
```typescript
|
|
534
|
+
import { createArabicDictionaryEntryRule, segmentPages } from 'flappa-doormal';
|
|
535
|
+
|
|
536
|
+
const segments = segmentPages(pages, {
|
|
537
|
+
rules: [
|
|
538
|
+
{ lineStartsWith: ['{{kitab}}'], meta: { type: 'book' } },
|
|
539
|
+
{ lineStartsWith: ['{{bab}}'], meta: { type: 'chapter' } },
|
|
540
|
+
createArabicDictionaryEntryRule({
|
|
541
|
+
stopWords: ['وقيل', 'ويقال'],
|
|
542
|
+
midLineSubentries: false,
|
|
543
|
+
allowParenthesized: true,
|
|
544
|
+
}),
|
|
545
|
+
],
|
|
546
|
+
});
|
|
547
|
+
```
|
|
548
|
+
|
|
549
|
+
Use `createArabicDictionaryEntryRule()` or `dictionaryEntry` when you only need
|
|
550
|
+
one conservative dictionary matcher and want it to behave like a normal
|
|
551
|
+
`SplitRule`.
|
|
552
|
+
|
|
553
|
+
For full-book dictionary profiling, diagnostics, and book-specific tuning,
|
|
554
|
+
prefer the top-level `dictionary` contract above.
|
|
555
|
+
|
|
556
|
+
#### Repo Fixture Book Options
|
|
557
|
+
|
|
558
|
+
The repo keeps book-specific golden options for the four reference Shamela
|
|
559
|
+
dictionaries as local test/support fixtures, not as part of the public package
|
|
560
|
+
API.
|
|
561
|
+
|
|
562
|
+
If you want standalone JSON copies of those fixture options for your own local
|
|
563
|
+
workflow, export them on demand:
|
|
564
|
+
|
|
565
|
+
```bash
|
|
566
|
+
bun run dictionary:export-options
|
|
567
|
+
bun run dictionary:export-options -- --out-dir /path/to/dictionary-options
|
|
568
|
+
```
|
|
569
|
+
|
|
570
|
+
By default this writes to `out/dictionary-options/`, which is not intended to
|
|
571
|
+
be checked into the repo.
|
|
572
|
+
|
|
573
|
+
#### Dictionary Diagnostics
|
|
574
|
+
|
|
575
|
+
Use `diagnoseDictionaryProfile()` when tuning blockers and families for a
|
|
576
|
+
dictionary profile:
|
|
577
|
+
|
|
578
|
+
```typescript
|
|
579
|
+
import { diagnoseDictionaryProfile } from 'flappa-doormal';
|
|
580
|
+
|
|
581
|
+
const diagnostics = diagnoseDictionaryProfile(pages, profile, {
|
|
582
|
+
sampleLimit: 25,
|
|
583
|
+
});
|
|
584
|
+
|
|
585
|
+
console.log(diagnostics.blockerHits);
|
|
586
|
+
console.log(diagnostics.rejectedLemmas.slice(0, 10));
|
|
587
|
+
```
|
|
588
|
+
|
|
589
|
+
Returned diagnostics include:
|
|
590
|
+
- accepted vs rejected candidate counts
|
|
591
|
+
- accepted counts by `kind`
|
|
592
|
+
- accepted/rejected counts by family and zone
|
|
593
|
+
- blocker hit counts (`intro`, `stopLemma`, `pageContinuation`, etc.)
|
|
594
|
+
- top rejected lemmas
|
|
595
|
+
- sampled accepted/rejected candidates for quick inspection
|
|
596
|
+
|
|
597
|
+
Validate profiles before persisting them or shipping them to an editor/CI step:
|
|
598
|
+
|
|
599
|
+
```typescript
|
|
600
|
+
import { validateDictionaryProfile } from 'flappa-doormal';
|
|
601
|
+
|
|
602
|
+
const issues = validateDictionaryProfile(profile);
|
|
603
|
+
if (issues.length > 0) {
|
|
604
|
+
console.error(issues);
|
|
605
|
+
}
|
|
606
|
+
```
|
|
607
|
+
|
|
608
|
+
Validation catches:
|
|
609
|
+
- empty or duplicate zones
|
|
610
|
+
- invalid gate shapes
|
|
611
|
+
- empty blocker lists
|
|
612
|
+
- inert heading families (for example, a heading family that emits `entry` but
|
|
613
|
+
never matches `entry` headings)
|
|
614
|
+
|
|
615
|
+
The runtime throws `DictionaryProfileValidationError` if invalid profiles reach
|
|
616
|
+
`segmentPages()` or `diagnoseDictionaryProfile()`.
|
|
617
|
+
|
|
618
|
+
#### Dictionary Surface Analysis
|
|
619
|
+
|
|
620
|
+
For corpus exploration and profile authoring, the library also exposes the
|
|
621
|
+
heading/surface scanner used during the proposal phase:
|
|
622
|
+
|
|
623
|
+
```typescript
|
|
624
|
+
import {
|
|
625
|
+
analyzeDictionaryMarkdownPages,
|
|
626
|
+
classifyDictionaryHeading,
|
|
627
|
+
scanDictionaryMarkdownPage,
|
|
628
|
+
} from 'flappa-doormal';
|
|
629
|
+
|
|
630
|
+
const kind = classifyDictionaryHeading('## (خَ غ)');
|
|
631
|
+
const pageMatches = scanDictionaryMarkdownPage(page);
|
|
632
|
+
const report = analyzeDictionaryMarkdownPages(pages);
|
|
633
|
+
```
|
|
634
|
+
|
|
635
|
+
Use these for:
|
|
636
|
+
- inspecting `convertContentToMarkdown()` output before profile authoring
|
|
637
|
+
- spotting structural marker/code lines
|
|
638
|
+
- building your own authoring tools around the same heading classifier
|
|
639
|
+
|
|
640
|
+
These are analysis helpers, not a replacement for the full runtime.
|
|
641
|
+
|
|
642
|
+
For full-book scans, use the bundled script:
|
|
643
|
+
|
|
644
|
+
```bash
|
|
645
|
+
bun run dictionary:scan -- --book 1687 --input /path/to/1687.json
|
|
646
|
+
bun run dictionary:scan -- --book 7031 --books-dir /path/to/books --json
|
|
647
|
+
bun run dictionary:scan -- --book 1687 --input /path/to/1687.json --out diagnostics/1687.txt
|
|
648
|
+
```
|
|
649
|
+
|
|
650
|
+
The scan script:
|
|
651
|
+
- reads an explicit `--input` file or resolves `<books-dir>/<book>.json`
|
|
652
|
+
- converts each page with `convertContentToMarkdown()`
|
|
653
|
+
- applies `removeZeroWidth`
|
|
654
|
+
- runs `diagnoseDictionaryProfile()` with the repo-local golden profile fixture
|
|
655
|
+
for that book
|
|
656
|
+
|
|
657
|
+
The test suite does not require the full Shamela corpora. It uses extracted
|
|
658
|
+
markdown fixtures under `testing/fixtures/dictionary-books/`, so moving your
|
|
659
|
+
local `books/` directory will not break CI or the built-in tests.
|
|
660
|
+
|
|
381
661
|
#### Dictionary Letter-Code Lines
|
|
382
662
|
|
|
383
663
|
For dictionary-specific letter-code lines like `ك ش ن` or `(هـ ث)`, use
|
|
@@ -539,11 +819,6 @@ Pass an optional `logger` to trace segmentation decisions or enable `debug` to a
|
|
|
539
819
|
const segments = segmentPages(pages, {
|
|
540
820
|
rules: [...],
|
|
541
821
|
debug: true, // Enables detailed match metadata
|
|
542
|
-
logger: {
|
|
543
|
-
debug: (msg, data) => console.log(`[DEBUG] ${msg}`, data),
|
|
544
|
-
info: (msg, data) => console.info(`[INFO] ${msg}`, data),
|
|
545
|
-
warn: (msg, data) => console.warn(`[WARN] ${msg}`, data),
|
|
546
|
-
error: (msg, data) => console.error(`[ERROR] ${msg}`, data),
|
|
547
822
|
logger: {
|
|
548
823
|
debug: (msg, data) => console.log(`[DEBUG] ${msg}`, data),
|
|
549
824
|
info: (msg, data) => console.info(`[INFO] ${msg}`, data),
|
|
@@ -598,7 +873,35 @@ If a segment was created by a `breakpoint` pattern (e.g. because it exceeded `ma
|
|
|
598
873
|
}
|
|
599
874
|
```
|
|
600
875
|
|
|
601
|
-
**3.
|
|
876
|
+
**3. Dictionary-based Splits**
|
|
877
|
+
If a segment was created by a dictionary profile:
|
|
878
|
+
```json
|
|
879
|
+
{
|
|
880
|
+
"meta": {
|
|
881
|
+
"_flappa": {
|
|
882
|
+
"dictionary": {
|
|
883
|
+
"family": "lineEntry"
|
|
884
|
+
}
|
|
885
|
+
}
|
|
886
|
+
}
|
|
887
|
+
}
|
|
888
|
+
```
|
|
889
|
+
|
|
890
|
+
Heading-driven dictionary splits can also record the heading class:
|
|
891
|
+
```json
|
|
892
|
+
{
|
|
893
|
+
"meta": {
|
|
894
|
+
"_flappa": {
|
|
895
|
+
"dictionary": {
|
|
896
|
+
"family": "heading",
|
|
897
|
+
"headingClass": "chapter"
|
|
898
|
+
}
|
|
899
|
+
}
|
|
900
|
+
}
|
|
901
|
+
}
|
|
902
|
+
```
|
|
903
|
+
|
|
904
|
+
**4. Safety Fallback Splits (`maxContentLength`)**
|
|
602
905
|
If no rule or breakpoint matched and the library was forced to perform a safety fallback split:
|
|
603
906
|
```json
|
|
604
907
|
{
|