npm - flappa-doormal - Versions diffs - 2.19.0 → 2.21.0 - Mend

flappa-doormal 2.19.0 → 2.21.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

package/AGENTS.md +63 -11
package/README.md +383 -11
package/dist/index.d.mts +440 -132
package/dist/index.d.mts.map +1 -1
package/dist/index.mjs +2 -4445
package/dist/index.mjs.map +1 -1
package/dist/mcp/server.d.mts +1 -0
package/dist/mcp/server.mjs +156 -0
package/dist/mcp/server.mjs.map +1 -0
package/dist/segmentation-advisor-D375TL8-.mjs +6128 -0
package/dist/segmentation-advisor-D375TL8-.mjs.map +1 -0
package/package.json +18 -4

package/AGENTS.md CHANGED Viewed

@@ -30,6 +30,17 @@ src/
 │   ├── breakpoints.ts          # Breakpoint types
 │   ├── options.ts              # SegmentationOptions and Logger
 │   └── segmenter.ts            # Internal segmenter types
+├── dictionary/                 # Dictionary-specific compiler, runtime, profiles, diagnostics
+│   ├── arabic-dictionary-rule.ts
+│   ├── constants.ts
+│   ├── dictionary-blockers.ts
+│   ├── dictionary-candidates.ts
+│   ├── dictionary-diagnostics.ts
+│   ├── dictionary-zones.ts
+│   ├── heading-classifier.ts
+│   ├── profile.ts
+│   ├── profiles.ts
+│   └── runtime.ts
 ├── analysis/                   # Pattern discovery module
 │   ├── line-starts.ts          # analyzeCommonLineStarts (frequent line markers)
 │   ├── repeating-sequences.ts  # analyzeRepeatingSequences (N-grams)
@@ -56,6 +67,22 @@ src/
 ├── detection.ts                # Pattern auto-detection (standalone)
 └── *.test.ts                   # Unit and integration tests (co-located)
+testing/
+├── exports.test.ts             # Public export contract test
+└── fixtures/
+    ├── README.md                     # Fixture purpose, source, and refresh workflow
+    ├── dictionary-book-options.ts   # Local golden options for the four reference dictionaries
+    ├── dictionary-books.ts          # Test fixture loader helpers
+    ├── dictionary-fixture-manifest.ts
+    └── dictionary-books/            # Extracted markdown pages used by integration tests
+scripts/
+├── analyze-dictionary-profile.ts    # Full-book diagnostics against an explicit input file/books dir
+├── export-dictionary-book-options.ts
+├── extract-dictionary-test-fixtures.ts
+├── generate-dictionary-html-previews.ts
+└── split-dictionary-csvs.ts
 ### Core Components
 1. **`segmentPages(pages, options)`** - Main entry point (`src/segmentation/segmenter.ts`)
@@ -100,6 +127,24 @@ src/
     - `options.ts`: Comprehensive `SegmentationOptions` and `Logger` definitions
     - `index.ts`: Public API types for consumers
+### Dictionary Blocker Notes
+- `previousWord.scope` defaults to `'samePage'` and only checks the same page's
+  preceding Arabic word unless you opt into cross-page behavior.
+- `previousWord.scope: 'pageStart'` only runs for page-start candidates and
+  compares against the previous page's last Arabic word, skipping the check when
+  the previous page ends with strong sentence punctuation.
+- `previousWord.scope: 'any'` combines the page-start cross-page check with the
+  usual same-page check for non-page-start candidates.
+- `pageContinuation.authorityPrecision` defaults to `'high'`; set it to
+  `'aggressive'` when page-start continuation blocking should treat
+  authority-like prefixes more conservatively.
+- `qualifierTail` and `structuralLeak` are intentionally non-configurable global
+  safety checks. They run before zone blockers and appear in diagnostics as
+  rejection reasons.
+- `diagnoseDictionaryProfile()` now reports `rejectionReasons` rather than
+  the former `blockerHits`.
 11. **`textUtils.ts`** - Low-level helpers (`src/utils/textUtils.ts`)
     - `makeDiacriticInsensitive()`: Arabic-aware regex generation
     - `adjustForUnicodeBoundary()`: Prevents invalid splits across multi-character clusters
@@ -386,7 +431,8 @@ The original `segmentPages` had complexity 37 (max: 15). Extraction:
 - **Unit tests**: Each utility function has dedicated tests
 - **Integration tests**: Full pipeline tests in `src/segmentation/segmenter.test.ts`
-- **Real-world tests**: `src/segmentation/segmenter.bukhari.test.ts` uses actual hadith data
+- **Dictionary integration tests**: `src/dictionary/*.test.ts` use extracted markdown fixtures under `testing/fixtures/dictionary-books/`
+- **Optional corpus tooling**: full-book diagnostics/preview scripts can use external Shamela JSONs via `--input` or `--books-dir`, but the test suite does not require a local `books/` directory
 - **Style convention**: Prefer `it('should ...', () => { ... })` (Bun) for consistency across the suite
 - Run: `bun test`
@@ -395,7 +441,7 @@ The original `segmentPages` had complexity 37 (max: 15). Extraction:
 1. **TypeScript strict mode** - No `any` types
 2. **Biome linting** - Max complexity 15 per function (some exceptions exist)
 3. **JSDoc comments** - All exported functions documented
-4. **Test coverage** - 642 tests across 21 files
+4. **Test coverage** - keep coverage representative; do not rely on local corpora for CI
 ## Dependencies
@@ -417,14 +463,26 @@ bun test
 bun run build
 # Output: dist/index.mjs (~17 KB gzip ~5.7 KB)
-# Run performance test (generates 50K pages, measures segmentation speed/memory)
-bun run perf
+# Run performance tests
+bun run test:perf
+# Regenerate extracted dictionary test fixtures (requires external books dir if not using ./books)
+bun run dictionary:extract-fixtures -- --books-dir /path/to/books
+# Export built-in dictionary options (writes to out/dictionary-options by default)
+bun run dictionary:export-options
+# Scan a full book with a builtin dictionary profile
+bun run dictionary:scan -- --book 1687 --input /path/to/1687.json
+# Validate a dictionary profile shape in userland
+# (public API: validateDictionaryProfile(profile))
 # Format code
 bunx biome format --write .
 # Lint code
-bunx biome lint .
+bunx biome check .
 ```
 ## Lessons Learned
@@ -605,12 +663,6 @@ bunx biome lint .
 57. **Validation Hints Specificity**: Generic error hints like "Check segmenter.ts" are unhelpful. Provide specific file names and logical components (e.g., "Check maxPages windowing in breakpoint-processor.ts"). User-friendly validation reports guide debugging much faster than "Something is wrong".
-### Process Template (Multi-agent design review, TDD-first)
-If you want to repeat the “write a plan → get multiple AI critiques → synthesize → update plan → implement TDD-first” workflow, use:
-- `docs/ai-multi-agent-tdd-template.md`
 ### Architecture Insights
 - **Declarative > Imperative**: Users describe patterns, library handles regex

package/README.md CHANGED Viewed

@@ -349,12 +349,100 @@ const segments = segmentPages(pages, {
 If the previous page ends with strong sentence punctuation (`.`, `!`, `?`, `؟`, `؛`),
 the stoplist guard is skipped and the page-start match is allowed.
-#### Arabic Dictionary Helper
+#### Preferred Dictionary Profile
-Use `createArabicDictionaryEntryRule()` to build a conservative rule for Arabic
-dictionaries with lemma capture, stopword filtering, and page-wrap protection.
-The helper now returns a serializable native `dictionaryEntry` rule rather than
-an eagerly-compiled regex blob:
+For new Shamela-style dictionary work, prefer the top-level `dictionary`
+profile over hand-built raw regexes or the older one-rule helper:
+```typescript
+import { segmentPages } from 'flappa-doormal';
+const segments = segmentPages(pages, {
+  breakpoints: ['{{tarqim}}'],
+  dictionary: {
+    version: 2,
+    zones: [{
+      name: 'main',
+      blockers: [
+        { appliesTo: ['lineEntry', 'inlineSubentry'], use: 'pageContinuation' },
+        { appliesTo: ['lineEntry', 'inlineSubentry'], use: 'intro' },
+        {
+          appliesTo: ['lineEntry', 'inlineSubentry'],
+          use: 'stopLemma',
+          words: ['ومعناه', 'ويقال', 'وقيل']
+        },
+      ],
+      families: [
+        { classes: ['chapter'], emit: 'chapter', use: 'heading' },
+        { emit: 'entry', use: 'lineEntry', wrappers: 'none' },
+        { emit: 'entry', prefixes: ['و'], stripPrefixesFromLemma: false, use: 'inlineSubentry' },
+      ],
+    }],
+  },
+  maxPages: 1,
+});
+```
+Why this is preferred:
+- serializable JSON authoring shape
+- profile-scoped blockers instead of giant regex blobs
+- zone support for books that change layout later
+- compatible with diagnostics tooling via `diagnoseDictionaryProfile()`
+- first-class validation via `validateDictionaryProfile()`
+Blocker authoring notes:
+- `previousWord.scope` defaults to `'samePage'`
+- set `scope: 'pageStart'` to compare only against the previous page's last
+  Arabic word for page-start candidates
+- set `scope: 'any'` to combine the page-start cross-page check with the normal
+  same-page check
+- `pageContinuation.authorityPrecision` defaults to `'high'`; use
+  `'aggressive'` when page-start continuation filtering should treat
+  authority-like prefixes more conservatively
+- `qualifierTail` and `structuralLeak` are always-on global safety checks and
+  show up in diagnostics even though they are not zone-declared blockers
+The production dictionary implementation now lives under `src/dictionary/`
+inside the repo, separate from the generic segmentation internals.
+Dictionary runtime semantics:
+- `segmentPages()` is still the only entry point; dictionary profiles do not use
+  a separate API
+- dictionary split points are merged with ordinary `rules`
+- when a rule split and a dictionary split land at the same offset, metadata is
+  merged; if `debug` is enabled, `_flappa.rule` and `_flappa.dictionary` can
+  both appear on the same segment
+- for dictionary-only configs, content before the first detected entry/chapter
+  is preserved as a leading segment with no dictionary metadata
+#### Advanced: Single-Rule Arabic Dictionary Matching
+`createArabicDictionaryEntryRule()` and the native `dictionaryEntry` rule shape
+are still supported as the lower-level, advanced path for clients who want one
+Arabic dictionary-style matcher inside a broader `rules` pipeline.
+Use this path when:
+- you need exactly one conservative dictionary headword rule
+- you want to compose it with ordinary `SplitRule[]`
+- you do not need profile zones, per-family blockers, or full-book tuning
+Prefer the top-level `dictionary` profile when:
+- segmenting an entire dictionary book
+- persisting JSON config for a corpus
+- the book changes layout in different sections
+- you need diagnostics, rejection-reason rates, or book-specific profile tuning
+Decision guide:
+| Use case | Preferred API |
+|----------|---------------|
+| One conservative lemma matcher inside a normal segmentation pipeline | `createArabicDictionaryEntryRule()` / `dictionaryEntry` |
+| Full-book dictionary segmentation with blockers, families, and zones | top-level `dictionary` |
+| Persisted JSON config for real books | top-level `dictionary` |
+| Advanced composition with other `SplitRule[]` rules | `createArabicDictionaryEntryRule()` / `dictionaryEntry` |
+The helper returns a serializable native `dictionaryEntry` rule rather than an
+eagerly-compiled regex blob:
 ```typescript
 import { createArabicDictionaryEntryRule, segmentPages } from 'flappa-doormal';
@@ -400,6 +488,193 @@ Behavior:
 - Can match comma-separated headword lists like `سبد، دبس:` when enabled
 - Can suppress same-page false positives like `جلّ وعزّ:` with `samePagePrevWordStoplist`
+Option notes:
+- `stopWords`
+  - exact lemma-level blockers for non-lexical heads like `وقيل` or `ويقال`
+  - use this for rejecting candidate headwords themselves
+- `pageStartPrevWordStoplist`
+  - blocks a page-start candidate when the previous page ends with one of these
+    words
+  - useful for page-wrap false positives after citation/introduction prose
+- `samePagePrevWordStoplist`
+  - blocks a same-page candidate when the previous local word matches
+  - useful for phrases like `جلّ وعزّ`
+- `allowParenthesized`
+  - enables heads like `(عنبر):`
+- `allowWhitespaceBeforeColon`
+  - enables spacing variants like `عنبر :`
+- `allowCommaSeparated`
+  - enables grouped heads like `سبد، دبس:`
+- `midLineSubentries`
+  - when `true`, allows conservative same-line subentries such as `والعزاء:`
+  - when `false`, only line-start/page-start heads are emitted
+Serialization tradeoff:
+- `dictionaryEntry` is serializable and safe to keep in JSON
+- but it is still a single-rule primitive
+- if you need corpus-wide blocker tuning, families, or zones, move up to the
+  top-level `dictionary` profile
+Example: compose with chapter rules
+```typescript
+import { createArabicDictionaryEntryRule, segmentPages } from 'flappa-doormal';
+const segments = segmentPages(pages, {
+  rules: [
+    { lineStartsAfter: ['## '], meta: { type: 'chapter' } },
+    {
+      fuzzy: true,
+      lineStartsAfter: ['{{bab}} '],
+      meta: { type: 'chapter' },
+    },
+    createArabicDictionaryEntryRule({
+      stopWords: ['وقيل', 'ويقال', 'قال'],
+      pageStartPrevWordStoplist: ['قال', 'وقيل', 'ويقال'],
+      samePagePrevWordStoplist: ['جل'],
+      allowCommaSeparated: true,
+    }),
+  ],
+  breakpoints: ['{{tarqim}}'],
+  maxPages: 1,
+});
+```
+Example: one-off advanced rule inside a non-dictionary pipeline
+```typescript
+import { createArabicDictionaryEntryRule, segmentPages } from 'flappa-doormal';
+const segments = segmentPages(pages, {
+  rules: [
+    { lineStartsWith: ['{{kitab}}'], meta: { type: 'book' } },
+    { lineStartsWith: ['{{bab}}'], meta: { type: 'chapter' } },
+    createArabicDictionaryEntryRule({
+      stopWords: ['وقيل', 'ويقال'],
+      midLineSubentries: false,
+      allowParenthesized: true,
+    }),
+  ],
+});
+```
+Use `createArabicDictionaryEntryRule()` or `dictionaryEntry` when you only need
+one conservative dictionary matcher and want it to behave like a normal
+`SplitRule`.
+For full-book dictionary profiling, diagnostics, and book-specific tuning,
+prefer the top-level `dictionary` contract above.
+#### Repo Fixture Book Options
+The repo keeps book-specific golden options for the four reference Shamela
+dictionaries as local test/support fixtures, not as part of the public package
+API.
+If you want standalone JSON copies of those fixture options for your own local
+workflow, export them on demand:
+```bash
+bun run dictionary:export-options
+bun run dictionary:export-options -- --out-dir /path/to/dictionary-options
+```
+By default this writes to `out/dictionary-options/`, which is not intended to
+be checked into the repo.
+#### Dictionary Diagnostics
+Use `diagnoseDictionaryProfile()` when tuning blockers and families for a
+dictionary profile:
+```typescript
+import { diagnoseDictionaryProfile } from 'flappa-doormal';
+const diagnostics = diagnoseDictionaryProfile(pages, profile, {
+  sampleLimit: 25,
+});
+console.log(diagnostics.rejectionReasons);
+console.log(diagnostics.rejectedLemmas.slice(0, 10));
+```
+Returned diagnostics include:
+- accepted vs rejected candidate counts
+- accepted counts by `kind`
+- accepted/rejected counts by family and zone
+- rejection-reason counts (`intro`, `stopLemma`, `pageContinuation`,
+  `qualifierTail`, `structuralLeak`, etc.)
+- top rejected lemmas
+- sampled accepted/rejected candidates for quick inspection
+`diagnoseDictionaryProfile()` is primarily a tuning API for profile authoring,
+so consumers should treat its output shape as less stable than the segmentation
+API itself.
+Validate profiles before persisting them or shipping them to an editor/CI step:
+```typescript
+import { validateDictionaryProfile } from 'flappa-doormal';
+const issues = validateDictionaryProfile(profile);
+if (issues.length > 0) {
+  console.error(issues);
+}
+```
+Validation catches:
+- empty or duplicate zones
+- invalid gate shapes
+- empty blocker lists
+- inert heading families (for example, a heading family that emits `entry` but
+  never matches `entry` headings)
+The runtime throws `DictionaryProfileValidationError` if invalid profiles reach
+`segmentPages()` or `diagnoseDictionaryProfile()`.
+#### Dictionary Surface Analysis
+For corpus exploration and profile authoring, the library also exposes the
+heading/surface scanner used during the proposal phase:
+```typescript
+import {
+  analyzeDictionaryMarkdownPages,
+  classifyDictionaryHeading,
+  scanDictionaryMarkdownPage,
+} from 'flappa-doormal';
+const kind = classifyDictionaryHeading('## (خَ غ)');
+const pageMatches = scanDictionaryMarkdownPage(page);
+const report = analyzeDictionaryMarkdownPages(pages);
+```
+Use these for:
+- inspecting `convertContentToMarkdown()` output before profile authoring
+- spotting structural marker/code lines
+- building your own authoring tools around the same heading classifier
+These are analysis helpers, not a replacement for the full runtime.
+For full-book scans, use the bundled script:
+```bash
+bun run dictionary:scan -- --book 1687 --input /path/to/1687.json
+bun run dictionary:scan -- --book 7031 --books-dir /path/to/books --json
+bun run dictionary:scan -- --book 1687 --input /path/to/1687.json --out diagnostics/1687.txt
+```
+The scan script:
+- reads an explicit `--input` file or resolves `<books-dir>/<book>.json`
+- converts each page with `convertContentToMarkdown()`
+- applies `removeZeroWidth`
+- runs `diagnoseDictionaryProfile()` with the repo-local golden profile fixture
+  for that book
+The test suite does not require the full Shamela corpora. It uses extracted
+markdown fixtures under `testing/fixtures/dictionary-books/`, so moving your
+local `books/` directory will not break CI or the built-in tests.
 #### Dictionary Letter-Code Lines
 For dictionary-specific letter-code lines like `ك ش ن` or `(هـ ث)`, use
@@ -561,11 +836,6 @@ Pass an optional `logger` to trace segmentation decisions or enable `debug` to a
 const segments = segmentPages(pages, {
   rules: [...],
   debug: true, // Enables detailed match metadata
-  logger: {
-    debug: (msg, data) => console.log(`[DEBUG] ${msg}`, data),
-    info: (msg, data) => console.info(`[INFO] ${msg}`, data),
-    warn: (msg, data) => console.warn(`[WARN] ${msg}`, data),
-    error: (msg, data) => console.error(`[ERROR] ${msg}`, data),
   logger: {
     debug: (msg, data) => console.log(`[DEBUG] ${msg}`, data),
     info: (msg, data) => console.info(`[INFO] ${msg}`, data),
@@ -620,7 +890,35 @@ If a segment was created by a `breakpoint` pattern (e.g. because it exceeded `ma
 }
 ```
-**3. Safety Fallback Splits (`maxContentLength`)**
+**3. Dictionary-based Splits**
+If a segment was created by a dictionary profile:
+```json
+{
+  "meta": {
+    "_flappa": {
+      "dictionary": {
+        "family": "lineEntry"
+      }
+    }
+  }
+}
+```
+Heading-driven dictionary splits can also record the heading class:
+```json
+{
+  "meta": {
+    "_flappa": {
+      "dictionary": {
+        "family": "heading",
+        "headingClass": "chapter"
+      }
+    }
+  }
+}
+```
+**4. Safety Fallback Splits (`maxContentLength`)**
 If no rule or breakpoint matched and the library was forced to perform a safety fallback split:
 ```json
 {
@@ -1099,6 +1397,80 @@ const segments = segmentPages(pages, { rules });
 // ]
 ```
+## Agent Advisor Workflow
+If you want an AI agent to start from raw pages and get to a draft configuration with less hand-written glue, use `suggestSegmentationOptions()`:
+```typescript
+import { suggestSegmentationOptions } from 'flappa-doormal';
+const report = suggestSegmentationOptions(pages, {
+  maxRules: 4,
+  topLineStarts: 12,
+  topRepeatingSequences: 8,
+});
+console.log(report.assessment);
+console.log(report.recommendedOptions);
+console.log(report.ruleSuggestions.slice(0, 5));
+```
+The report includes:
+- preprocess cleanup hints (`removeZeroWidth`, `condenseEllipsis`, `fixTrailingWaw`)
+- an assessment of whether the book looks `structured`, `continuous`, or `mixed`
+- draft `SplitRule[]` suggestions with examples and confidence
+- a ready-to-run `recommendedOptions` object
+- rule validation output
+- self-evaluation of the generated segmentation draft
+- optional breakpoint suggestions when the draft still produces very large segments
+For local JSON files, you can run the bundled script:
+```bash
+bun run segment:advise -- --input ./pages.json
+bun run segment:advise -- --input ./book.json --format markdown --out ./segmentation-report.md
+```
+Input can be either:
+- `Page[]`
+- `{ pages: Page[] }`
+## MCP Server
+The repo now includes a stdio MCP server wrapper for agent workflows:
+```bash
+bun run mcp:serve
+```
+When packaged, the server binary is:
+```bash
+flappa-doormal-mcp
+```
+Exposed MCP tools:
+- `inspect_book`
+  Input: `{ pages, advisorOptions? }`
+  Returns preprocess detections, line-start analysis, repeating sequences, and draft rule suggestions.
+- `suggest_segmentation_options`
+  Input: `{ pages, advisorOptions? }`
+  Returns the full advisor report, including `recommendedOptions`.
+- `preview_segmentation`
+  Input: `{ pages, options, sampleSegments? }`
+  Runs segmentation and returns segments, samples, and validation.
+- `validate_segmentation`
+  Input: `{ pages, options, segments }`
+  Validates caller-provided segments against the source book.
+- `score_candidate_options`
+  Input: `{ pages, candidates, sampleSegments? }`
+  Ranks multiple `SegmentationOptions` candidates using validation and segment-shape heuristics.
+All tool results are returned as JSON-friendly objects so agents can iterate without scraping prose output.
 ## Advanced: Metadata Extraction & Data Migration
 If you already have pre-segmented data (e.g., records from a database or JSON file) and want to use **flappa-doormal's** token system to extract metadata and clean the content without further splitting, you can use the **Metadata Extraction** pattern.