flappa-doormal 2.19.0 → 2.21.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +63 -11
- package/README.md +383 -11
- package/dist/index.d.mts +440 -132
- package/dist/index.d.mts.map +1 -1
- package/dist/index.mjs +2 -4445
- package/dist/index.mjs.map +1 -1
- package/dist/mcp/server.d.mts +1 -0
- package/dist/mcp/server.mjs +156 -0
- package/dist/mcp/server.mjs.map +1 -0
- package/dist/segmentation-advisor-D375TL8-.mjs +6128 -0
- package/dist/segmentation-advisor-D375TL8-.mjs.map +1 -0
- package/package.json +18 -4
package/AGENTS.md
CHANGED
|
@@ -30,6 +30,17 @@ src/
|
|
|
30
30
|
│ ├── breakpoints.ts # Breakpoint types
|
|
31
31
|
│ ├── options.ts # SegmentationOptions and Logger
|
|
32
32
|
│ └── segmenter.ts # Internal segmenter types
|
|
33
|
+
├── dictionary/ # Dictionary-specific compiler, runtime, profiles, diagnostics
|
|
34
|
+
│ ├── arabic-dictionary-rule.ts
|
|
35
|
+
│ ├── constants.ts
|
|
36
|
+
│ ├── dictionary-blockers.ts
|
|
37
|
+
│ ├── dictionary-candidates.ts
|
|
38
|
+
│ ├── dictionary-diagnostics.ts
|
|
39
|
+
│ ├── dictionary-zones.ts
|
|
40
|
+
│ ├── heading-classifier.ts
|
|
41
|
+
│ ├── profile.ts
|
|
42
|
+
│ ├── profiles.ts
|
|
43
|
+
│ └── runtime.ts
|
|
33
44
|
├── analysis/ # Pattern discovery module
|
|
34
45
|
│ ├── line-starts.ts # analyzeCommonLineStarts (frequent line markers)
|
|
35
46
|
│ ├── repeating-sequences.ts # analyzeRepeatingSequences (N-grams)
|
|
@@ -56,6 +67,22 @@ src/
|
|
|
56
67
|
├── detection.ts # Pattern auto-detection (standalone)
|
|
57
68
|
└── *.test.ts # Unit and integration tests (co-located)
|
|
58
69
|
|
|
70
|
+
testing/
|
|
71
|
+
├── exports.test.ts # Public export contract test
|
|
72
|
+
└── fixtures/
|
|
73
|
+
├── README.md # Fixture purpose, source, and refresh workflow
|
|
74
|
+
├── dictionary-book-options.ts # Local golden options for the four reference dictionaries
|
|
75
|
+
├── dictionary-books.ts # Test fixture loader helpers
|
|
76
|
+
├── dictionary-fixture-manifest.ts
|
|
77
|
+
└── dictionary-books/ # Extracted markdown pages used by integration tests
|
|
78
|
+
|
|
79
|
+
scripts/
|
|
80
|
+
├── analyze-dictionary-profile.ts # Full-book diagnostics against an explicit input file/books dir
|
|
81
|
+
├── export-dictionary-book-options.ts
|
|
82
|
+
├── extract-dictionary-test-fixtures.ts
|
|
83
|
+
├── generate-dictionary-html-previews.ts
|
|
84
|
+
└── split-dictionary-csvs.ts
|
|
85
|
+
|
|
59
86
|
### Core Components
|
|
60
87
|
|
|
61
88
|
1. **`segmentPages(pages, options)`** - Main entry point (`src/segmentation/segmenter.ts`)
|
|
@@ -100,6 +127,24 @@ src/
|
|
|
100
127
|
- `options.ts`: Comprehensive `SegmentationOptions` and `Logger` definitions
|
|
101
128
|
- `index.ts`: Public API types for consumers
|
|
102
129
|
|
|
130
|
+
### Dictionary Blocker Notes
|
|
131
|
+
|
|
132
|
+
- `previousWord.scope` defaults to `'samePage'` and only checks the same page's
|
|
133
|
+
preceding Arabic word unless you opt into cross-page behavior.
|
|
134
|
+
- `previousWord.scope: 'pageStart'` only runs for page-start candidates and
|
|
135
|
+
compares against the previous page's last Arabic word, skipping the check when
|
|
136
|
+
the previous page ends with strong sentence punctuation.
|
|
137
|
+
- `previousWord.scope: 'any'` combines the page-start cross-page check with the
|
|
138
|
+
usual same-page check for non-page-start candidates.
|
|
139
|
+
- `pageContinuation.authorityPrecision` defaults to `'high'`; set it to
|
|
140
|
+
`'aggressive'` when page-start continuation blocking should treat
|
|
141
|
+
authority-like prefixes more conservatively.
|
|
142
|
+
- `qualifierTail` and `structuralLeak` are intentionally non-configurable global
|
|
143
|
+
safety checks. They run before zone blockers and appear in diagnostics as
|
|
144
|
+
rejection reasons.
|
|
145
|
+
- `diagnoseDictionaryProfile()` now reports `rejectionReasons` rather than
|
|
146
|
+
the former `blockerHits`.
|
|
147
|
+
|
|
103
148
|
11. **`textUtils.ts`** - Low-level helpers (`src/utils/textUtils.ts`)
|
|
104
149
|
- `makeDiacriticInsensitive()`: Arabic-aware regex generation
|
|
105
150
|
- `adjustForUnicodeBoundary()`: Prevents invalid splits across multi-character clusters
|
|
@@ -386,7 +431,8 @@ The original `segmentPages` had complexity 37 (max: 15). Extraction:
|
|
|
386
431
|
|
|
387
432
|
- **Unit tests**: Each utility function has dedicated tests
|
|
388
433
|
- **Integration tests**: Full pipeline tests in `src/segmentation/segmenter.test.ts`
|
|
389
|
-
- **
|
|
434
|
+
- **Dictionary integration tests**: `src/dictionary/*.test.ts` use extracted markdown fixtures under `testing/fixtures/dictionary-books/`
|
|
435
|
+
- **Optional corpus tooling**: full-book diagnostics/preview scripts can use external Shamela JSONs via `--input` or `--books-dir`, but the test suite does not require a local `books/` directory
|
|
390
436
|
- **Style convention**: Prefer `it('should ...', () => { ... })` (Bun) for consistency across the suite
|
|
391
437
|
- Run: `bun test`
|
|
392
438
|
|
|
@@ -395,7 +441,7 @@ The original `segmentPages` had complexity 37 (max: 15). Extraction:
|
|
|
395
441
|
1. **TypeScript strict mode** - No `any` types
|
|
396
442
|
2. **Biome linting** - Max complexity 15 per function (some exceptions exist)
|
|
397
443
|
3. **JSDoc comments** - All exported functions documented
|
|
398
|
-
4. **Test coverage** -
|
|
444
|
+
4. **Test coverage** - keep coverage representative; do not rely on local corpora for CI
|
|
399
445
|
|
|
400
446
|
## Dependencies
|
|
401
447
|
|
|
@@ -417,14 +463,26 @@ bun test
|
|
|
417
463
|
bun run build
|
|
418
464
|
# Output: dist/index.mjs (~17 KB gzip ~5.7 KB)
|
|
419
465
|
|
|
420
|
-
# Run performance
|
|
421
|
-
bun run perf
|
|
466
|
+
# Run performance tests
|
|
467
|
+
bun run test:perf
|
|
468
|
+
|
|
469
|
+
# Regenerate extracted dictionary test fixtures (requires external books dir if not using ./books)
|
|
470
|
+
bun run dictionary:extract-fixtures -- --books-dir /path/to/books
|
|
471
|
+
|
|
472
|
+
# Export built-in dictionary options (writes to out/dictionary-options by default)
|
|
473
|
+
bun run dictionary:export-options
|
|
474
|
+
|
|
475
|
+
# Scan a full book with a builtin dictionary profile
|
|
476
|
+
bun run dictionary:scan -- --book 1687 --input /path/to/1687.json
|
|
477
|
+
|
|
478
|
+
# Validate a dictionary profile shape in userland
|
|
479
|
+
# (public API: validateDictionaryProfile(profile))
|
|
422
480
|
|
|
423
481
|
# Format code
|
|
424
482
|
bunx biome format --write .
|
|
425
483
|
|
|
426
484
|
# Lint code
|
|
427
|
-
bunx biome
|
|
485
|
+
bunx biome check .
|
|
428
486
|
```
|
|
429
487
|
|
|
430
488
|
## Lessons Learned
|
|
@@ -605,12 +663,6 @@ bunx biome lint .
|
|
|
605
663
|
|
|
606
664
|
57. **Validation Hints Specificity**: Generic error hints like "Check segmenter.ts" are unhelpful. Provide specific file names and logical components (e.g., "Check maxPages windowing in breakpoint-processor.ts"). User-friendly validation reports guide debugging much faster than "Something is wrong".
|
|
607
665
|
|
|
608
|
-
### Process Template (Multi-agent design review, TDD-first)
|
|
609
|
-
|
|
610
|
-
If you want to repeat the “write a plan → get multiple AI critiques → synthesize → update plan → implement TDD-first” workflow, use:
|
|
611
|
-
|
|
612
|
-
- `docs/ai-multi-agent-tdd-template.md`
|
|
613
|
-
|
|
614
666
|
### Architecture Insights
|
|
615
667
|
|
|
616
668
|
- **Declarative > Imperative**: Users describe patterns, library handles regex
|
package/README.md
CHANGED
|
@@ -349,12 +349,100 @@ const segments = segmentPages(pages, {
|
|
|
349
349
|
If the previous page ends with strong sentence punctuation (`.`, `!`, `?`, `؟`, `؛`),
|
|
350
350
|
the stoplist guard is skipped and the page-start match is allowed.
|
|
351
351
|
|
|
352
|
-
####
|
|
352
|
+
#### Preferred Dictionary Profile
|
|
353
353
|
|
|
354
|
-
|
|
355
|
-
|
|
356
|
-
|
|
357
|
-
|
|
354
|
+
For new Shamela-style dictionary work, prefer the top-level `dictionary`
|
|
355
|
+
profile over hand-built raw regexes or the older one-rule helper:
|
|
356
|
+
|
|
357
|
+
```typescript
|
|
358
|
+
import { segmentPages } from 'flappa-doormal';
|
|
359
|
+
|
|
360
|
+
const segments = segmentPages(pages, {
|
|
361
|
+
breakpoints: ['{{tarqim}}'],
|
|
362
|
+
dictionary: {
|
|
363
|
+
version: 2,
|
|
364
|
+
zones: [{
|
|
365
|
+
name: 'main',
|
|
366
|
+
blockers: [
|
|
367
|
+
{ appliesTo: ['lineEntry', 'inlineSubentry'], use: 'pageContinuation' },
|
|
368
|
+
{ appliesTo: ['lineEntry', 'inlineSubentry'], use: 'intro' },
|
|
369
|
+
{
|
|
370
|
+
appliesTo: ['lineEntry', 'inlineSubentry'],
|
|
371
|
+
use: 'stopLemma',
|
|
372
|
+
words: ['ومعناه', 'ويقال', 'وقيل']
|
|
373
|
+
},
|
|
374
|
+
],
|
|
375
|
+
families: [
|
|
376
|
+
{ classes: ['chapter'], emit: 'chapter', use: 'heading' },
|
|
377
|
+
{ emit: 'entry', use: 'lineEntry', wrappers: 'none' },
|
|
378
|
+
{ emit: 'entry', prefixes: ['و'], stripPrefixesFromLemma: false, use: 'inlineSubentry' },
|
|
379
|
+
],
|
|
380
|
+
}],
|
|
381
|
+
},
|
|
382
|
+
maxPages: 1,
|
|
383
|
+
});
|
|
384
|
+
```
|
|
385
|
+
|
|
386
|
+
Why this is preferred:
|
|
387
|
+
- serializable JSON authoring shape
|
|
388
|
+
- profile-scoped blockers instead of giant regex blobs
|
|
389
|
+
- zone support for books that change layout later
|
|
390
|
+
- compatible with diagnostics tooling via `diagnoseDictionaryProfile()`
|
|
391
|
+
- first-class validation via `validateDictionaryProfile()`
|
|
392
|
+
|
|
393
|
+
Blocker authoring notes:
|
|
394
|
+
- `previousWord.scope` defaults to `'samePage'`
|
|
395
|
+
- set `scope: 'pageStart'` to compare only against the previous page's last
|
|
396
|
+
Arabic word for page-start candidates
|
|
397
|
+
- set `scope: 'any'` to combine the page-start cross-page check with the normal
|
|
398
|
+
same-page check
|
|
399
|
+
- `pageContinuation.authorityPrecision` defaults to `'high'`; use
|
|
400
|
+
`'aggressive'` when page-start continuation filtering should treat
|
|
401
|
+
authority-like prefixes more conservatively
|
|
402
|
+
- `qualifierTail` and `structuralLeak` are always-on global safety checks and
|
|
403
|
+
show up in diagnostics even though they are not zone-declared blockers
|
|
404
|
+
|
|
405
|
+
The production dictionary implementation now lives under `src/dictionary/`
|
|
406
|
+
inside the repo, separate from the generic segmentation internals.
|
|
407
|
+
|
|
408
|
+
Dictionary runtime semantics:
|
|
409
|
+
- `segmentPages()` is still the only entry point; dictionary profiles do not use
|
|
410
|
+
a separate API
|
|
411
|
+
- dictionary split points are merged with ordinary `rules`
|
|
412
|
+
- when a rule split and a dictionary split land at the same offset, metadata is
|
|
413
|
+
merged; if `debug` is enabled, `_flappa.rule` and `_flappa.dictionary` can
|
|
414
|
+
both appear on the same segment
|
|
415
|
+
- for dictionary-only configs, content before the first detected entry/chapter
|
|
416
|
+
is preserved as a leading segment with no dictionary metadata
|
|
417
|
+
|
|
418
|
+
#### Advanced: Single-Rule Arabic Dictionary Matching
|
|
419
|
+
|
|
420
|
+
`createArabicDictionaryEntryRule()` and the native `dictionaryEntry` rule shape
|
|
421
|
+
are still supported as the lower-level, advanced path for clients who want one
|
|
422
|
+
Arabic dictionary-style matcher inside a broader `rules` pipeline.
|
|
423
|
+
|
|
424
|
+
Use this path when:
|
|
425
|
+
- you need exactly one conservative dictionary headword rule
|
|
426
|
+
- you want to compose it with ordinary `SplitRule[]`
|
|
427
|
+
- you do not need profile zones, per-family blockers, or full-book tuning
|
|
428
|
+
|
|
429
|
+
Prefer the top-level `dictionary` profile when:
|
|
430
|
+
- segmenting an entire dictionary book
|
|
431
|
+
- persisting JSON config for a corpus
|
|
432
|
+
- the book changes layout in different sections
|
|
433
|
+
- you need diagnostics, rejection-reason rates, or book-specific profile tuning
|
|
434
|
+
|
|
435
|
+
Decision guide:
|
|
436
|
+
|
|
437
|
+
| Use case | Preferred API |
|
|
438
|
+
|----------|---------------|
|
|
439
|
+
| One conservative lemma matcher inside a normal segmentation pipeline | `createArabicDictionaryEntryRule()` / `dictionaryEntry` |
|
|
440
|
+
| Full-book dictionary segmentation with blockers, families, and zones | top-level `dictionary` |
|
|
441
|
+
| Persisted JSON config for real books | top-level `dictionary` |
|
|
442
|
+
| Advanced composition with other `SplitRule[]` rules | `createArabicDictionaryEntryRule()` / `dictionaryEntry` |
|
|
443
|
+
|
|
444
|
+
The helper returns a serializable native `dictionaryEntry` rule rather than an
|
|
445
|
+
eagerly-compiled regex blob:
|
|
358
446
|
|
|
359
447
|
```typescript
|
|
360
448
|
import { createArabicDictionaryEntryRule, segmentPages } from 'flappa-doormal';
|
|
@@ -400,6 +488,193 @@ Behavior:
|
|
|
400
488
|
- Can match comma-separated headword lists like `سبد، دبس:` when enabled
|
|
401
489
|
- Can suppress same-page false positives like `جلّ وعزّ:` with `samePagePrevWordStoplist`
|
|
402
490
|
|
|
491
|
+
Option notes:
|
|
492
|
+
- `stopWords`
|
|
493
|
+
- exact lemma-level blockers for non-lexical heads like `وقيل` or `ويقال`
|
|
494
|
+
- use this for rejecting candidate headwords themselves
|
|
495
|
+
- `pageStartPrevWordStoplist`
|
|
496
|
+
- blocks a page-start candidate when the previous page ends with one of these
|
|
497
|
+
words
|
|
498
|
+
- useful for page-wrap false positives after citation/introduction prose
|
|
499
|
+
- `samePagePrevWordStoplist`
|
|
500
|
+
- blocks a same-page candidate when the previous local word matches
|
|
501
|
+
- useful for phrases like `جلّ وعزّ`
|
|
502
|
+
- `allowParenthesized`
|
|
503
|
+
- enables heads like `(عنبر):`
|
|
504
|
+
- `allowWhitespaceBeforeColon`
|
|
505
|
+
- enables spacing variants like `عنبر :`
|
|
506
|
+
- `allowCommaSeparated`
|
|
507
|
+
- enables grouped heads like `سبد، دبس:`
|
|
508
|
+
- `midLineSubentries`
|
|
509
|
+
- when `true`, allows conservative same-line subentries such as `والعزاء:`
|
|
510
|
+
- when `false`, only line-start/page-start heads are emitted
|
|
511
|
+
|
|
512
|
+
Serialization tradeoff:
|
|
513
|
+
- `dictionaryEntry` is serializable and safe to keep in JSON
|
|
514
|
+
- but it is still a single-rule primitive
|
|
515
|
+
- if you need corpus-wide blocker tuning, families, or zones, move up to the
|
|
516
|
+
top-level `dictionary` profile
|
|
517
|
+
|
|
518
|
+
Example: compose with chapter rules
|
|
519
|
+
|
|
520
|
+
```typescript
|
|
521
|
+
import { createArabicDictionaryEntryRule, segmentPages } from 'flappa-doormal';
|
|
522
|
+
|
|
523
|
+
const segments = segmentPages(pages, {
|
|
524
|
+
rules: [
|
|
525
|
+
{ lineStartsAfter: ['## '], meta: { type: 'chapter' } },
|
|
526
|
+
{
|
|
527
|
+
fuzzy: true,
|
|
528
|
+
lineStartsAfter: ['{{bab}} '],
|
|
529
|
+
meta: { type: 'chapter' },
|
|
530
|
+
},
|
|
531
|
+
createArabicDictionaryEntryRule({
|
|
532
|
+
stopWords: ['وقيل', 'ويقال', 'قال'],
|
|
533
|
+
pageStartPrevWordStoplist: ['قال', 'وقيل', 'ويقال'],
|
|
534
|
+
samePagePrevWordStoplist: ['جل'],
|
|
535
|
+
allowCommaSeparated: true,
|
|
536
|
+
}),
|
|
537
|
+
],
|
|
538
|
+
breakpoints: ['{{tarqim}}'],
|
|
539
|
+
maxPages: 1,
|
|
540
|
+
});
|
|
541
|
+
```
|
|
542
|
+
|
|
543
|
+
Example: one-off advanced rule inside a non-dictionary pipeline
|
|
544
|
+
|
|
545
|
+
```typescript
|
|
546
|
+
import { createArabicDictionaryEntryRule, segmentPages } from 'flappa-doormal';
|
|
547
|
+
|
|
548
|
+
const segments = segmentPages(pages, {
|
|
549
|
+
rules: [
|
|
550
|
+
{ lineStartsWith: ['{{kitab}}'], meta: { type: 'book' } },
|
|
551
|
+
{ lineStartsWith: ['{{bab}}'], meta: { type: 'chapter' } },
|
|
552
|
+
createArabicDictionaryEntryRule({
|
|
553
|
+
stopWords: ['وقيل', 'ويقال'],
|
|
554
|
+
midLineSubentries: false,
|
|
555
|
+
allowParenthesized: true,
|
|
556
|
+
}),
|
|
557
|
+
],
|
|
558
|
+
});
|
|
559
|
+
```
|
|
560
|
+
|
|
561
|
+
Use `createArabicDictionaryEntryRule()` or `dictionaryEntry` when you only need
|
|
562
|
+
one conservative dictionary matcher and want it to behave like a normal
|
|
563
|
+
`SplitRule`.
|
|
564
|
+
|
|
565
|
+
For full-book dictionary profiling, diagnostics, and book-specific tuning,
|
|
566
|
+
prefer the top-level `dictionary` contract above.
|
|
567
|
+
|
|
568
|
+
#### Repo Fixture Book Options
|
|
569
|
+
|
|
570
|
+
The repo keeps book-specific golden options for the four reference Shamela
|
|
571
|
+
dictionaries as local test/support fixtures, not as part of the public package
|
|
572
|
+
API.
|
|
573
|
+
|
|
574
|
+
If you want standalone JSON copies of those fixture options for your own local
|
|
575
|
+
workflow, export them on demand:
|
|
576
|
+
|
|
577
|
+
```bash
|
|
578
|
+
bun run dictionary:export-options
|
|
579
|
+
bun run dictionary:export-options -- --out-dir /path/to/dictionary-options
|
|
580
|
+
```
|
|
581
|
+
|
|
582
|
+
By default this writes to `out/dictionary-options/`, which is not intended to
|
|
583
|
+
be checked into the repo.
|
|
584
|
+
|
|
585
|
+
#### Dictionary Diagnostics
|
|
586
|
+
|
|
587
|
+
Use `diagnoseDictionaryProfile()` when tuning blockers and families for a
|
|
588
|
+
dictionary profile:
|
|
589
|
+
|
|
590
|
+
```typescript
|
|
591
|
+
import { diagnoseDictionaryProfile } from 'flappa-doormal';
|
|
592
|
+
|
|
593
|
+
const diagnostics = diagnoseDictionaryProfile(pages, profile, {
|
|
594
|
+
sampleLimit: 25,
|
|
595
|
+
});
|
|
596
|
+
|
|
597
|
+
console.log(diagnostics.rejectionReasons);
|
|
598
|
+
console.log(diagnostics.rejectedLemmas.slice(0, 10));
|
|
599
|
+
```
|
|
600
|
+
|
|
601
|
+
Returned diagnostics include:
|
|
602
|
+
- accepted vs rejected candidate counts
|
|
603
|
+
- accepted counts by `kind`
|
|
604
|
+
- accepted/rejected counts by family and zone
|
|
605
|
+
- rejection-reason counts (`intro`, `stopLemma`, `pageContinuation`,
|
|
606
|
+
`qualifierTail`, `structuralLeak`, etc.)
|
|
607
|
+
- top rejected lemmas
|
|
608
|
+
- sampled accepted/rejected candidates for quick inspection
|
|
609
|
+
|
|
610
|
+
`diagnoseDictionaryProfile()` is primarily a tuning API for profile authoring,
|
|
611
|
+
so consumers should treat its output shape as less stable than the segmentation
|
|
612
|
+
API itself.
|
|
613
|
+
|
|
614
|
+
Validate profiles before persisting them or shipping them to an editor/CI step:
|
|
615
|
+
|
|
616
|
+
```typescript
|
|
617
|
+
import { validateDictionaryProfile } from 'flappa-doormal';
|
|
618
|
+
|
|
619
|
+
const issues = validateDictionaryProfile(profile);
|
|
620
|
+
if (issues.length > 0) {
|
|
621
|
+
console.error(issues);
|
|
622
|
+
}
|
|
623
|
+
```
|
|
624
|
+
|
|
625
|
+
Validation catches:
|
|
626
|
+
- empty or duplicate zones
|
|
627
|
+
- invalid gate shapes
|
|
628
|
+
- empty blocker lists
|
|
629
|
+
- inert heading families (for example, a heading family that emits `entry` but
|
|
630
|
+
never matches `entry` headings)
|
|
631
|
+
|
|
632
|
+
The runtime throws `DictionaryProfileValidationError` if invalid profiles reach
|
|
633
|
+
`segmentPages()` or `diagnoseDictionaryProfile()`.
|
|
634
|
+
|
|
635
|
+
#### Dictionary Surface Analysis
|
|
636
|
+
|
|
637
|
+
For corpus exploration and profile authoring, the library also exposes the
|
|
638
|
+
heading/surface scanner used during the proposal phase:
|
|
639
|
+
|
|
640
|
+
```typescript
|
|
641
|
+
import {
|
|
642
|
+
analyzeDictionaryMarkdownPages,
|
|
643
|
+
classifyDictionaryHeading,
|
|
644
|
+
scanDictionaryMarkdownPage,
|
|
645
|
+
} from 'flappa-doormal';
|
|
646
|
+
|
|
647
|
+
const kind = classifyDictionaryHeading('## (خَ غ)');
|
|
648
|
+
const pageMatches = scanDictionaryMarkdownPage(page);
|
|
649
|
+
const report = analyzeDictionaryMarkdownPages(pages);
|
|
650
|
+
```
|
|
651
|
+
|
|
652
|
+
Use these for:
|
|
653
|
+
- inspecting `convertContentToMarkdown()` output before profile authoring
|
|
654
|
+
- spotting structural marker/code lines
|
|
655
|
+
- building your own authoring tools around the same heading classifier
|
|
656
|
+
|
|
657
|
+
These are analysis helpers, not a replacement for the full runtime.
|
|
658
|
+
|
|
659
|
+
For full-book scans, use the bundled script:
|
|
660
|
+
|
|
661
|
+
```bash
|
|
662
|
+
bun run dictionary:scan -- --book 1687 --input /path/to/1687.json
|
|
663
|
+
bun run dictionary:scan -- --book 7031 --books-dir /path/to/books --json
|
|
664
|
+
bun run dictionary:scan -- --book 1687 --input /path/to/1687.json --out diagnostics/1687.txt
|
|
665
|
+
```
|
|
666
|
+
|
|
667
|
+
The scan script:
|
|
668
|
+
- reads an explicit `--input` file or resolves `<books-dir>/<book>.json`
|
|
669
|
+
- converts each page with `convertContentToMarkdown()`
|
|
670
|
+
- applies `removeZeroWidth`
|
|
671
|
+
- runs `diagnoseDictionaryProfile()` with the repo-local golden profile fixture
|
|
672
|
+
for that book
|
|
673
|
+
|
|
674
|
+
The test suite does not require the full Shamela corpora. It uses extracted
|
|
675
|
+
markdown fixtures under `testing/fixtures/dictionary-books/`, so moving your
|
|
676
|
+
local `books/` directory will not break CI or the built-in tests.
|
|
677
|
+
|
|
403
678
|
#### Dictionary Letter-Code Lines
|
|
404
679
|
|
|
405
680
|
For dictionary-specific letter-code lines like `ك ش ن` or `(هـ ث)`, use
|
|
@@ -561,11 +836,6 @@ Pass an optional `logger` to trace segmentation decisions or enable `debug` to a
|
|
|
561
836
|
const segments = segmentPages(pages, {
|
|
562
837
|
rules: [...],
|
|
563
838
|
debug: true, // Enables detailed match metadata
|
|
564
|
-
logger: {
|
|
565
|
-
debug: (msg, data) => console.log(`[DEBUG] ${msg}`, data),
|
|
566
|
-
info: (msg, data) => console.info(`[INFO] ${msg}`, data),
|
|
567
|
-
warn: (msg, data) => console.warn(`[WARN] ${msg}`, data),
|
|
568
|
-
error: (msg, data) => console.error(`[ERROR] ${msg}`, data),
|
|
569
839
|
logger: {
|
|
570
840
|
debug: (msg, data) => console.log(`[DEBUG] ${msg}`, data),
|
|
571
841
|
info: (msg, data) => console.info(`[INFO] ${msg}`, data),
|
|
@@ -620,7 +890,35 @@ If a segment was created by a `breakpoint` pattern (e.g. because it exceeded `ma
|
|
|
620
890
|
}
|
|
621
891
|
```
|
|
622
892
|
|
|
623
|
-
**3.
|
|
893
|
+
**3. Dictionary-based Splits**
|
|
894
|
+
If a segment was created by a dictionary profile:
|
|
895
|
+
```json
|
|
896
|
+
{
|
|
897
|
+
"meta": {
|
|
898
|
+
"_flappa": {
|
|
899
|
+
"dictionary": {
|
|
900
|
+
"family": "lineEntry"
|
|
901
|
+
}
|
|
902
|
+
}
|
|
903
|
+
}
|
|
904
|
+
}
|
|
905
|
+
```
|
|
906
|
+
|
|
907
|
+
Heading-driven dictionary splits can also record the heading class:
|
|
908
|
+
```json
|
|
909
|
+
{
|
|
910
|
+
"meta": {
|
|
911
|
+
"_flappa": {
|
|
912
|
+
"dictionary": {
|
|
913
|
+
"family": "heading",
|
|
914
|
+
"headingClass": "chapter"
|
|
915
|
+
}
|
|
916
|
+
}
|
|
917
|
+
}
|
|
918
|
+
}
|
|
919
|
+
```
|
|
920
|
+
|
|
921
|
+
**4. Safety Fallback Splits (`maxContentLength`)**
|
|
624
922
|
If no rule or breakpoint matched and the library was forced to perform a safety fallback split:
|
|
625
923
|
```json
|
|
626
924
|
{
|
|
@@ -1099,6 +1397,80 @@ const segments = segmentPages(pages, { rules });
|
|
|
1099
1397
|
// ]
|
|
1100
1398
|
```
|
|
1101
1399
|
|
|
1400
|
+
## Agent Advisor Workflow
|
|
1401
|
+
|
|
1402
|
+
If you want an AI agent to start from raw pages and get to a draft configuration with less hand-written glue, use `suggestSegmentationOptions()`:
|
|
1403
|
+
|
|
1404
|
+
```typescript
|
|
1405
|
+
import { suggestSegmentationOptions } from 'flappa-doormal';
|
|
1406
|
+
|
|
1407
|
+
const report = suggestSegmentationOptions(pages, {
|
|
1408
|
+
maxRules: 4,
|
|
1409
|
+
topLineStarts: 12,
|
|
1410
|
+
topRepeatingSequences: 8,
|
|
1411
|
+
});
|
|
1412
|
+
|
|
1413
|
+
console.log(report.assessment);
|
|
1414
|
+
console.log(report.recommendedOptions);
|
|
1415
|
+
console.log(report.ruleSuggestions.slice(0, 5));
|
|
1416
|
+
```
|
|
1417
|
+
|
|
1418
|
+
The report includes:
|
|
1419
|
+
|
|
1420
|
+
- preprocess cleanup hints (`removeZeroWidth`, `condenseEllipsis`, `fixTrailingWaw`)
|
|
1421
|
+
- an assessment of whether the book looks `structured`, `continuous`, or `mixed`
|
|
1422
|
+
- draft `SplitRule[]` suggestions with examples and confidence
|
|
1423
|
+
- a ready-to-run `recommendedOptions` object
|
|
1424
|
+
- rule validation output
|
|
1425
|
+
- self-evaluation of the generated segmentation draft
|
|
1426
|
+
- optional breakpoint suggestions when the draft still produces very large segments
|
|
1427
|
+
|
|
1428
|
+
For local JSON files, you can run the bundled script:
|
|
1429
|
+
|
|
1430
|
+
```bash
|
|
1431
|
+
bun run segment:advise -- --input ./pages.json
|
|
1432
|
+
bun run segment:advise -- --input ./book.json --format markdown --out ./segmentation-report.md
|
|
1433
|
+
```
|
|
1434
|
+
|
|
1435
|
+
Input can be either:
|
|
1436
|
+
|
|
1437
|
+
- `Page[]`
|
|
1438
|
+
- `{ pages: Page[] }`
|
|
1439
|
+
|
|
1440
|
+
## MCP Server
|
|
1441
|
+
|
|
1442
|
+
The repo now includes a stdio MCP server wrapper for agent workflows:
|
|
1443
|
+
|
|
1444
|
+
```bash
|
|
1445
|
+
bun run mcp:serve
|
|
1446
|
+
```
|
|
1447
|
+
|
|
1448
|
+
When packaged, the server binary is:
|
|
1449
|
+
|
|
1450
|
+
```bash
|
|
1451
|
+
flappa-doormal-mcp
|
|
1452
|
+
```
|
|
1453
|
+
|
|
1454
|
+
Exposed MCP tools:
|
|
1455
|
+
|
|
1456
|
+
- `inspect_book`
|
|
1457
|
+
Input: `{ pages, advisorOptions? }`
|
|
1458
|
+
Returns preprocess detections, line-start analysis, repeating sequences, and draft rule suggestions.
|
|
1459
|
+
- `suggest_segmentation_options`
|
|
1460
|
+
Input: `{ pages, advisorOptions? }`
|
|
1461
|
+
Returns the full advisor report, including `recommendedOptions`.
|
|
1462
|
+
- `preview_segmentation`
|
|
1463
|
+
Input: `{ pages, options, sampleSegments? }`
|
|
1464
|
+
Runs segmentation and returns segments, samples, and validation.
|
|
1465
|
+
- `validate_segmentation`
|
|
1466
|
+
Input: `{ pages, options, segments }`
|
|
1467
|
+
Validates caller-provided segments against the source book.
|
|
1468
|
+
- `score_candidate_options`
|
|
1469
|
+
Input: `{ pages, candidates, sampleSegments? }`
|
|
1470
|
+
Ranks multiple `SegmentationOptions` candidates using validation and segment-shape heuristics.
|
|
1471
|
+
|
|
1472
|
+
All tool results are returned as JSON-friendly objects so agents can iterate without scraping prose output.
|
|
1473
|
+
|
|
1102
1474
|
## Advanced: Metadata Extraction & Data Migration
|
|
1103
1475
|
|
|
1104
1476
|
If you already have pre-segmented data (e.g., records from a database or JSON file) and want to use **flappa-doormal's** token system to extract metadata and clean the content without further splitting, you can use the **Metadata Extraction** pattern.
|