flappa-doormal 2.19.0 → 2.20.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/AGENTS.md CHANGED
@@ -30,6 +30,12 @@ src/
30
30
  │ ├── breakpoints.ts # Breakpoint types
31
31
  │ ├── options.ts # SegmentationOptions and Logger
32
32
  │ └── segmenter.ts # Internal segmenter types
33
+ ├── dictionary/ # Dictionary-specific compiler, runtime, profiles, diagnostics
34
+ │ ├── arabic-dictionary-rule.ts
35
+ │ ├── heading-classifier.ts
36
+ │ ├── profile.ts
37
+ │ ├── profiles.ts
38
+ │ └── runtime.ts
33
39
  ├── analysis/ # Pattern discovery module
34
40
  │ ├── line-starts.ts # analyzeCommonLineStarts (frequent line markers)
35
41
  │ ├── repeating-sequences.ts # analyzeRepeatingSequences (N-grams)
@@ -56,6 +62,22 @@ src/
56
62
  ├── detection.ts # Pattern auto-detection (standalone)
57
63
  └── *.test.ts # Unit and integration tests (co-located)
58
64
 
65
+ testing/
66
+ ├── exports.test.ts # Public export contract test
67
+ └── fixtures/
68
+ ├── README.md # Fixture purpose, source, and refresh workflow
69
+ ├── dictionary-book-options.ts # Local golden options for the four reference dictionaries
70
+ ├── dictionary-books.ts # Test fixture loader helpers
71
+ ├── dictionary-fixture-manifest.ts
72
+ └── dictionary-books/ # Extracted markdown pages used by integration tests
73
+
74
+ scripts/
75
+ ├── analyze-dictionary-profile.ts # Full-book diagnostics against an explicit input file/books dir
76
+ ├── export-dictionary-book-options.ts
77
+ ├── extract-dictionary-test-fixtures.ts
78
+ ├── generate-dictionary-html-previews.ts
79
+ └── split-dictionary-csvs.ts
80
+
59
81
  ### Core Components
60
82
 
61
83
  1. **`segmentPages(pages, options)`** - Main entry point (`src/segmentation/segmenter.ts`)
@@ -386,7 +408,8 @@ The original `segmentPages` had complexity 37 (max: 15). Extraction:
386
408
 
387
409
  - **Unit tests**: Each utility function has dedicated tests
388
410
  - **Integration tests**: Full pipeline tests in `src/segmentation/segmenter.test.ts`
389
- - **Real-world tests**: `src/segmentation/segmenter.bukhari.test.ts` uses actual hadith data
411
+ - **Dictionary integration tests**: `src/dictionary/*.test.ts` use extracted markdown fixtures under `testing/fixtures/dictionary-books/`
412
+ - **Optional corpus tooling**: full-book diagnostics/preview scripts can use external Shamela JSONs via `--input` or `--books-dir`, but the test suite does not require a local `books/` directory
390
413
  - **Style convention**: Prefer `it('should ...', () => { ... })` (Bun) for consistency across the suite
391
414
  - Run: `bun test`
392
415
 
@@ -395,7 +418,7 @@ The original `segmentPages` had complexity 37 (max: 15). Extraction:
395
418
  1. **TypeScript strict mode** - No `any` types
396
419
  2. **Biome linting** - Max complexity 15 per function (some exceptions exist)
397
420
  3. **JSDoc comments** - All exported functions documented
398
- 4. **Test coverage** - 642 tests across 21 files
421
+ 4. **Test coverage** - keep coverage representative; do not rely on local corpora for CI
399
422
 
400
423
  ## Dependencies
401
424
 
@@ -417,14 +440,26 @@ bun test
417
440
  bun run build
418
441
  # Output: dist/index.mjs (~17 KB gzip ~5.7 KB)
419
442
 
420
- # Run performance test (generates 50K pages, measures segmentation speed/memory)
421
- bun run perf
443
+ # Run performance tests
444
+ bun run test:perf
445
+
446
+ # Regenerate extracted dictionary test fixtures (requires external books dir if not using ./books)
447
+ bun run dictionary:extract-fixtures -- --books-dir /path/to/books
448
+
449
+ # Export built-in dictionary options (writes to out/dictionary-options by default)
450
+ bun run dictionary:export-options
451
+
452
+ # Scan a full book with a builtin dictionary profile
453
+ bun run dictionary:scan -- --book 1687 --input /path/to/1687.json
454
+
455
+ # Validate a dictionary profile shape in userland
456
+ # (public API: validateDictionaryProfile(profile))
422
457
 
423
458
  # Format code
424
459
  bunx biome format --write .
425
460
 
426
461
  # Lint code
427
- bunx biome lint .
462
+ bunx biome check .
428
463
  ```
429
464
 
430
465
  ## Lessons Learned
@@ -605,12 +640,6 @@ bunx biome lint .
605
640
 
606
641
  57. **Validation Hints Specificity**: Generic error hints like "Check segmenter.ts" are unhelpful. Provide specific file names and logical components (e.g., "Check maxPages windowing in breakpoint-processor.ts"). User-friendly validation reports guide debugging much faster than "Something is wrong".
607
642
 
608
- ### Process Template (Multi-agent design review, TDD-first)
609
-
610
- If you want to repeat the “write a plan → get multiple AI critiques → synthesize → update plan → implement TDD-first” workflow, use:
611
-
612
- - `docs/ai-multi-agent-tdd-template.md`
613
-
614
643
  ### Architecture Insights
615
644
 
616
645
  - **Declarative > Imperative**: Users describe patterns, library handles regex
package/README.md CHANGED
@@ -349,12 +349,88 @@ const segments = segmentPages(pages, {
349
349
  If the previous page ends with strong sentence punctuation (`.`, `!`, `?`, `؟`, `؛`),
350
350
  the stoplist guard is skipped and the page-start match is allowed.
351
351
 
352
- #### Arabic Dictionary Helper
352
+ #### Preferred Dictionary Profile
353
353
 
354
- Use `createArabicDictionaryEntryRule()` to build a conservative rule for Arabic
355
- dictionaries with lemma capture, stopword filtering, and page-wrap protection.
356
- The helper now returns a serializable native `dictionaryEntry` rule rather than
357
- an eagerly-compiled regex blob:
354
+ For new Shamela-style dictionary work, prefer the top-level `dictionary`
355
+ profile over hand-built raw regexes or the older one-rule helper:
356
+
357
+ ```typescript
358
+ import { segmentPages } from 'flappa-doormal';
359
+
360
+ const segments = segmentPages(pages, {
361
+ breakpoints: ['{{tarqim}}'],
362
+ dictionary: {
363
+ version: 2,
364
+ zones: [{
365
+ name: 'main',
366
+ blockers: [
367
+ { appliesTo: ['lineEntry', 'inlineSubentry'], use: 'pageContinuation' },
368
+ { appliesTo: ['lineEntry', 'inlineSubentry'], use: 'intro' },
369
+ {
370
+ appliesTo: ['lineEntry', 'inlineSubentry'],
371
+ use: 'stopLemma',
372
+ words: ['ومعناه', 'ويقال', 'وقيل']
373
+ },
374
+ ],
375
+ families: [
376
+ { classes: ['chapter'], emit: 'chapter', use: 'heading' },
377
+ { emit: 'entry', use: 'lineEntry', wrappers: 'none' },
378
+ { emit: 'entry', prefixes: ['و'], stripPrefixesFromLemma: false, use: 'inlineSubentry' },
379
+ ],
380
+ }],
381
+ },
382
+ maxPages: 1,
383
+ });
384
+ ```
385
+
386
+ Why this is preferred:
387
+ - serializable JSON authoring shape
388
+ - profile-scoped blockers instead of giant regex blobs
389
+ - zone support for books that change layout later
390
+ - compatible with diagnostics tooling via `diagnoseDictionaryProfile()`
391
+ - first-class validation via `validateDictionaryProfile()`
392
+
393
+ The production dictionary implementation now lives under `src/dictionary/`
394
+ inside the repo, separate from the generic segmentation internals.
395
+
396
+ Dictionary runtime semantics:
397
+ - `segmentPages()` is still the only entry point; dictionary profiles do not use
398
+ a separate API
399
+ - dictionary split points are merged with ordinary `rules`
400
+ - when a rule split and a dictionary split land at the same offset, metadata is
401
+ merged; if `debug` is enabled, `_flappa.rule` and `_flappa.dictionary` can
402
+ both appear on the same segment
403
+ - for dictionary-only configs, content before the first detected entry/chapter
404
+ is preserved as a leading segment with no dictionary metadata
405
+
406
+ #### Advanced: Single-Rule Arabic Dictionary Matching
407
+
408
+ `createArabicDictionaryEntryRule()` and the native `dictionaryEntry` rule shape
409
+ are still supported as the lower-level, advanced path for clients who want one
410
+ Arabic dictionary-style matcher inside a broader `rules` pipeline.
411
+
412
+ Use this path when:
413
+ - you need exactly one conservative dictionary headword rule
414
+ - you want to compose it with ordinary `SplitRule[]`
415
+ - you do not need profile zones, per-family blockers, or full-book tuning
416
+
417
+ Prefer the top-level `dictionary` profile when:
418
+ - segmenting an entire dictionary book
419
+ - persisting JSON config for a corpus
420
+ - the book changes layout in different sections
421
+ - you need diagnostics, blocker hit rates, or book-specific profile tuning
422
+
423
+ Decision guide:
424
+
425
+ | Use case | Preferred API |
426
+ |----------|---------------|
427
+ | One conservative lemma matcher inside a normal segmentation pipeline | `createArabicDictionaryEntryRule()` / `dictionaryEntry` |
428
+ | Full-book dictionary segmentation with blockers, families, and zones | top-level `dictionary` |
429
+ | Persisted JSON config for real books | top-level `dictionary` |
430
+ | Advanced composition with other `SplitRule[]` rules | `createArabicDictionaryEntryRule()` / `dictionaryEntry` |
431
+
432
+ The helper returns a serializable native `dictionaryEntry` rule rather than an
433
+ eagerly-compiled regex blob:
358
434
 
359
435
  ```typescript
360
436
  import { createArabicDictionaryEntryRule, segmentPages } from 'flappa-doormal';
@@ -400,6 +476,188 @@ Behavior:
400
476
  - Can match comma-separated headword lists like `سبد، دبس:` when enabled
401
477
  - Can suppress same-page false positives like `جلّ وعزّ:` with `samePagePrevWordStoplist`
402
478
 
479
+ Option notes:
480
+ - `stopWords`
481
+ - exact lemma-level blockers for non-lexical heads like `وقيل` or `ويقال`
482
+ - use this for rejecting candidate headwords themselves
483
+ - `pageStartPrevWordStoplist`
484
+ - blocks a page-start candidate when the previous page ends with one of these
485
+ words
486
+ - useful for page-wrap false positives after citation/introduction prose
487
+ - `samePagePrevWordStoplist`
488
+ - blocks a same-page candidate when the previous local word matches
489
+ - useful for phrases like `جلّ وعزّ`
490
+ - `allowParenthesized`
491
+ - enables heads like `(عنبر):`
492
+ - `allowWhitespaceBeforeColon`
493
+ - enables spacing variants like `عنبر :`
494
+ - `allowCommaSeparated`
495
+ - enables grouped heads like `سبد، دبس:`
496
+ - `midLineSubentries`
497
+ - when `true`, allows conservative same-line subentries such as `والعزاء:`
498
+ - when `false`, only line-start/page-start heads are emitted
499
+
500
+ Serialization tradeoff:
501
+ - `dictionaryEntry` is serializable and safe to keep in JSON
502
+ - but it is still a single-rule primitive
503
+ - if you need corpus-wide blocker tuning, families, or zones, move up to the
504
+ top-level `dictionary` profile
505
+
506
+ Example: compose with chapter rules
507
+
508
+ ```typescript
509
+ import { createArabicDictionaryEntryRule, segmentPages } from 'flappa-doormal';
510
+
511
+ const segments = segmentPages(pages, {
512
+ rules: [
513
+ { lineStartsAfter: ['## '], meta: { type: 'chapter' } },
514
+ {
515
+ fuzzy: true,
516
+ lineStartsAfter: ['{{bab}} '],
517
+ meta: { type: 'chapter' },
518
+ },
519
+ createArabicDictionaryEntryRule({
520
+ stopWords: ['وقيل', 'ويقال', 'قال'],
521
+ pageStartPrevWordStoplist: ['قال', 'وقيل', 'ويقال'],
522
+ samePagePrevWordStoplist: ['جل'],
523
+ allowCommaSeparated: true,
524
+ }),
525
+ ],
526
+ breakpoints: ['{{tarqim}}'],
527
+ maxPages: 1,
528
+ });
529
+ ```
530
+
531
+ Example: one-off advanced rule inside a non-dictionary pipeline
532
+
533
+ ```typescript
534
+ import { createArabicDictionaryEntryRule, segmentPages } from 'flappa-doormal';
535
+
536
+ const segments = segmentPages(pages, {
537
+ rules: [
538
+ { lineStartsWith: ['{{kitab}}'], meta: { type: 'book' } },
539
+ { lineStartsWith: ['{{bab}}'], meta: { type: 'chapter' } },
540
+ createArabicDictionaryEntryRule({
541
+ stopWords: ['وقيل', 'ويقال'],
542
+ midLineSubentries: false,
543
+ allowParenthesized: true,
544
+ }),
545
+ ],
546
+ });
547
+ ```
548
+
549
+ Use `createArabicDictionaryEntryRule()` or `dictionaryEntry` when you only need
550
+ one conservative dictionary matcher and want it to behave like a normal
551
+ `SplitRule`.
552
+
553
+ For full-book dictionary profiling, diagnostics, and book-specific tuning,
554
+ prefer the top-level `dictionary` contract above.
555
+
556
+ #### Repo Fixture Book Options
557
+
558
+ The repo keeps book-specific golden options for the four reference Shamela
559
+ dictionaries as local test/support fixtures, not as part of the public package
560
+ API.
561
+
562
+ If you want standalone JSON copies of those fixture options for your own local
563
+ workflow, export them on demand:
564
+
565
+ ```bash
566
+ bun run dictionary:export-options
567
+ bun run dictionary:export-options -- --out-dir /path/to/dictionary-options
568
+ ```
569
+
570
+ By default this writes to `out/dictionary-options/`, which is not intended to
571
+ be checked into the repo.
572
+
573
+ #### Dictionary Diagnostics
574
+
575
+ Use `diagnoseDictionaryProfile()` when tuning blockers and families for a
576
+ dictionary profile:
577
+
578
+ ```typescript
579
+ import { diagnoseDictionaryProfile } from 'flappa-doormal';
580
+
581
+ const diagnostics = diagnoseDictionaryProfile(pages, profile, {
582
+ sampleLimit: 25,
583
+ });
584
+
585
+ console.log(diagnostics.blockerHits);
586
+ console.log(diagnostics.rejectedLemmas.slice(0, 10));
587
+ ```
588
+
589
+ Returned diagnostics include:
590
+ - accepted vs rejected candidate counts
591
+ - accepted counts by `kind`
592
+ - accepted/rejected counts by family and zone
593
+ - blocker hit counts (`intro`, `stopLemma`, `pageContinuation`, etc.)
594
+ - top rejected lemmas
595
+ - sampled accepted/rejected candidates for quick inspection
596
+
597
+ Validate profiles before persisting them or shipping them to an editor/CI step:
598
+
599
+ ```typescript
600
+ import { validateDictionaryProfile } from 'flappa-doormal';
601
+
602
+ const issues = validateDictionaryProfile(profile);
603
+ if (issues.length > 0) {
604
+ console.error(issues);
605
+ }
606
+ ```
607
+
608
+ Validation catches:
609
+ - empty or duplicate zones
610
+ - invalid gate shapes
611
+ - empty blocker lists
612
+ - inert heading families (for example, a heading family that emits `entry` but
613
+ never matches `entry` headings)
614
+
615
+ The runtime throws `DictionaryProfileValidationError` if invalid profiles reach
616
+ `segmentPages()` or `diagnoseDictionaryProfile()`.
617
+
618
+ #### Dictionary Surface Analysis
619
+
620
+ For corpus exploration and profile authoring, the library also exposes the
621
+ heading/surface scanner used during the proposal phase:
622
+
623
+ ```typescript
624
+ import {
625
+ analyzeDictionaryMarkdownPages,
626
+ classifyDictionaryHeading,
627
+ scanDictionaryMarkdownPage,
628
+ } from 'flappa-doormal';
629
+
630
+ const kind = classifyDictionaryHeading('## (خَ غ)');
631
+ const pageMatches = scanDictionaryMarkdownPage(page);
632
+ const report = analyzeDictionaryMarkdownPages(pages);
633
+ ```
634
+
635
+ Use these for:
636
+ - inspecting `convertContentToMarkdown()` output before profile authoring
637
+ - spotting structural marker/code lines
638
+ - building your own authoring tools around the same heading classifier
639
+
640
+ These are analysis helpers, not a replacement for the full runtime.
641
+
642
+ For full-book scans, use the bundled script:
643
+
644
+ ```bash
645
+ bun run dictionary:scan -- --book 1687 --input /path/to/1687.json
646
+ bun run dictionary:scan -- --book 7031 --books-dir /path/to/books --json
647
+ bun run dictionary:scan -- --book 1687 --input /path/to/1687.json --out diagnostics/1687.txt
648
+ ```
649
+
650
+ The scan script:
651
+ - reads an explicit `--input` file or resolves `<books-dir>/<book>.json`
652
+ - converts each page with `convertContentToMarkdown()`
653
+ - applies `removeZeroWidth`
654
+ - runs `diagnoseDictionaryProfile()` with the repo-local golden profile fixture
655
+ for that book
656
+
657
+ The test suite does not require the full Shamela corpora. It uses extracted
658
+ markdown fixtures under `testing/fixtures/dictionary-books/`, so moving your
659
+ local `books/` directory will not break CI or the built-in tests.
660
+
403
661
  #### Dictionary Letter-Code Lines
404
662
 
405
663
  For dictionary-specific letter-code lines like `ك ش ن` or `(هـ ث)`, use
@@ -561,11 +819,6 @@ Pass an optional `logger` to trace segmentation decisions or enable `debug` to a
561
819
  const segments = segmentPages(pages, {
562
820
  rules: [...],
563
821
  debug: true, // Enables detailed match metadata
564
- logger: {
565
- debug: (msg, data) => console.log(`[DEBUG] ${msg}`, data),
566
- info: (msg, data) => console.info(`[INFO] ${msg}`, data),
567
- warn: (msg, data) => console.warn(`[WARN] ${msg}`, data),
568
- error: (msg, data) => console.error(`[ERROR] ${msg}`, data),
569
822
  logger: {
570
823
  debug: (msg, data) => console.log(`[DEBUG] ${msg}`, data),
571
824
  info: (msg, data) => console.info(`[INFO] ${msg}`, data),
@@ -620,7 +873,35 @@ If a segment was created by a `breakpoint` pattern (e.g. because it exceeded `ma
620
873
  }
621
874
  ```
622
875
 
623
- **3. Safety Fallback Splits (`maxContentLength`)**
876
+ **3. Dictionary-based Splits**
877
+ If a segment was created by a dictionary profile:
878
+ ```json
879
+ {
880
+ "meta": {
881
+ "_flappa": {
882
+ "dictionary": {
883
+ "family": "lineEntry"
884
+ }
885
+ }
886
+ }
887
+ }
888
+ ```
889
+
890
+ Heading-driven dictionary splits can also record the heading class:
891
+ ```json
892
+ {
893
+ "meta": {
894
+ "_flappa": {
895
+ "dictionary": {
896
+ "family": "heading",
897
+ "headingClass": "chapter"
898
+ }
899
+ }
900
+ }
901
+ }
902
+ ```
903
+
904
+ **4. Safety Fallback Splits (`maxContentLength`)**
624
905
  If no rule or breakpoint matched and the library was forced to perform a safety fallback split:
625
906
  ```json
626
907
  {