PyPI - biblicus - Versions diffs - 0.7.0__tar.gz → 0.8.0__tar.gz - Mend

biblicus 0.7.0tar.gz → 0.8.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (222) hide show

{biblicus-0.7.0 → biblicus-0.8.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: biblicus
-Version: 0.7.0
+Version: 0.8.0
 Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
 License: MIT
 Requires-Python: >=3.9
@@ -25,8 +25,21 @@ Requires-Dist: unstructured>=0.12.0; extra == "unstructured"
 Requires-Dist: python-docx>=1.1.0; extra == "unstructured"
 Provides-Extra: ocr
 Requires-Dist: rapidocr-onnxruntime>=1.3.0; extra == "ocr"
+Provides-Extra: paddleocr
+Requires-Dist: paddleocr>=2.7.0; extra == "paddleocr"
+Requires-Dist: paddlepaddle>=2.5.0; extra == "paddleocr"
+Requires-Dist: huggingface_hub>=0.20.0; extra == "paddleocr"
+Requires-Dist: requests>=2.28.0; extra == "paddleocr"
 Provides-Extra: markitdown
 Requires-Dist: markitdown[all]>=0.1.0; python_version >= "3.10" and extra == "markitdown"
+Provides-Extra: deepgram
+Requires-Dist: deepgram-sdk>=3.0; extra == "deepgram"
+Provides-Extra: docling
+Requires-Dist: docling[vlm]>=2.0.0; extra == "docling"
+Provides-Extra: docling-mlx
+Requires-Dist: docling[mlx-vlm]>=2.0.0; extra == "docling-mlx"
+Provides-Extra: topic-modeling
+Requires-Dist: bertopic>=0.15.0; extra == "topic-modeling"
 Dynamic: license-file
 # Biblicus
@@ -160,9 +173,14 @@ python3 -m pip install biblicus
 Some extractors are optional so the base install stays small.
 - Optical character recognition for images: `python3 -m pip install "biblicus[ocr]"`
-- Speech to text transcription: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
+- Advanced optical character recognition with PaddleOCR: `python3 -m pip install "biblicus[paddleocr]"`
+- Document understanding with Docling VLM: `python3 -m pip install "biblicus[docling]"`
+- Document understanding with Docling VLM and MLX acceleration: `python3 -m pip install "biblicus[docling-mlx]"`
+- Speech to text transcription with OpenAI: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
+- Speech to text transcription with Deepgram: `python3 -m pip install "biblicus[deepgram]"` (requires a Deepgram API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
 - Broad document parsing fallback: `python3 -m pip install "biblicus[unstructured]"`
 - MarkItDown document conversion (requires Python 3.10 or higher): `python3 -m pip install "biblicus[markitdown]"`
+- Topic modeling analysis with BERTopic: `python3 -m pip install "biblicus[topic-modeling]"`
 ## Quick start
@@ -420,6 +438,7 @@ The documents below follow the pipeline from raw items to model context:
 - [Corpus][corpus]
 - [Text extraction][text-extraction]
+- [Speech to text][speech-to-text]
 - [Knowledge base][knowledge-base]
 - [Backends][backends]
 - [Context packs][context-packs]
@@ -468,21 +487,97 @@ corpus/
 Two backends are included.
 - `scan` is a minimal baseline that scans raw items directly.
-- `sqlite-full-text-search` is a practical baseline that builds a full text search index in Sqlite.
+- `sqlite-full-text-search` is a practical baseline that builds a full text search index in SQLite.
+For detailed documentation including configuration options, performance characteristics, and usage examples, see the [Backend Reference][backend-reference].
 ## Extraction backends
-These extractors are built in. Optional ones require extra dependencies.
+These extractors are built in. Optional ones require extra dependencies. See [text extraction documentation][text-extraction] for details.
+### Text and document extraction
+- [`pass-through-text`](docs/extractors/text-document/pass-through.md) reads text items and strips Markdown front matter.
+- [`metadata-text`](docs/extractors/text-document/metadata.md) turns catalog metadata into a small text artifact.
+- [`pdf-text`](docs/extractors/text-document/pdf.md) extracts text from Portable Document Format items with `pypdf`.
+- [`unstructured`](docs/extractors/text-document/unstructured.md) provides broad document parsing (optional).
+- [`markitdown`](docs/extractors/text-document/markitdown.md) converts many formats into Markdown-like text (optional).
+### Optical character recognition
+- [`ocr-rapidocr`](docs/extractors/ocr/rapidocr.md) does optical character recognition on images (optional).
+- [`ocr-paddleocr-vl`](docs/extractors/ocr/paddleocr-vl.md) does advanced optical character recognition with PaddleOCR vision-language model (optional).
+### Vision-language models
+- [`docling-smol`](docs/extractors/vlm-document/docling-smol.md) uses the SmolDocling-256M vision-language model for fast document understanding (optional).
+- [`docling-granite`](docs/extractors/vlm-document/docling-granite.md) uses the Granite Docling-258M vision-language model for high-accuracy extraction (optional).
+### Speech to text
+- [`stt-openai`](docs/extractors/speech-to-text/openai.md) performs speech to text on audio using OpenAI (optional).
+- [`stt-deepgram`](docs/extractors/speech-to-text/deepgram.md) performs speech to text on audio using Deepgram (optional).
+### Pipeline utilities
+- [`select-text`](docs/extractors/pipeline-utilities/select-text.md) chooses one prior extraction result in a pipeline.
+- [`select-longest-text`](docs/extractors/pipeline-utilities/select-longest.md) chooses the longest prior extraction result.
+- [`select-override`](docs/extractors/pipeline-utilities/select-override.md) chooses the last extraction result for matching media types in a pipeline.
+- [`select-smart-override`](docs/extractors/pipeline-utilities/select-smart-override.md) intelligently chooses between extraction results based on confidence and content quality.
+For detailed documentation on all extractors, see the [Extractor Reference][extractor-reference].
+## Topic modeling analysis
+Biblicus can run analysis pipelines on extracted text without changing the raw corpus. Topic modeling is the first
+analysis backend. It reads an extraction run, optionally applies an LLM-driven extraction pass, applies lexical
+processing, runs BERTopic, and optionally applies an LLM fine-tuning pass to label topics. The output is structured
+JavaScript Object Notation.
+Run a topic analysis using a recipe file:
+```
+biblicus analyze topics --corpus corpora/example --recipe recipes/topic-modeling.yml --extraction-run pipeline:<run_id>
+```
+If `--extraction-run` is omitted, Biblicus uses the most recent extraction run and emits a warning about
+reproducibility. The analysis output is stored under:
+```
+.biblicus/runs/analysis/topic-modeling/<run_id>/output.json
+```
+Minimal recipe example:
+```yaml
+schema_version: 1
+text_source:
+  sample_size: 200
+llm_extraction:
+  enabled: false
+lexical_processing:
+  enabled: true
+  lowercase: true
+  strip_punctuation: false
+  collapse_whitespace: true
+bertopic_analysis:
+  parameters:
+    min_topic_size: 8
+    nr_topics: 10
+llm_fine_tuning:
+  enabled: false
+```
+LLM extraction and fine-tuning require `biblicus[openai]` and a configured OpenAI API key.
+Recipe files are validated strictly against the topic modeling schema, so type mismatches or unknown fields are errors.
+For a repeatable, real-world integration run that downloads a Wikipedia corpus and executes topic modeling, use:
+```
+python3 scripts/topic_modeling_integration.py --corpus corpora/wiki_demo --force
+```
-- `pass-through-text` reads text items and strips Markdown front matter.
-- `metadata-text` turns catalog metadata into a small text artifact.
-- `pdf-text` extracts text from Portable Document Format items with `pypdf`.
-- `select-text` chooses one prior extraction result in a pipeline.
-- `select-longest-text` chooses the longest prior extraction result.
-- `ocr-rapidocr` does optical character recognition on images (optional).
-- `stt-openai` performs speech to text on audio (optional).
-- `unstructured` provides broad document parsing (optional).
-- `markitdown` converts many formats into Markdown-like text (optional).
+See `docs/TOPIC_MODELING.md` for parameter examples and per-topic output behavior.
 ## Integration corpus and evaluation dataset
@@ -539,6 +634,9 @@ License terms are in `LICENSE`.
 [corpus]: docs/CORPUS.md
 [knowledge-base]: docs/KNOWLEDGE_BASE.md
 [text-extraction]: docs/EXTRACTION.md
+[extractor-reference]: docs/extractors/index.md
+[backend-reference]: docs/backends/index.md
+[speech-to-text]: docs/STT.md
 [user-configuration]: docs/USER_CONFIGURATION.md
 [backends]: docs/BACKENDS.md
 [context-packs]: docs/CONTEXT_PACK.md

biblicus-0.7.0/src/biblicus.egg-info/PKG-INFO → biblicus-0.8.0/README.md RENAMED Viewed

@@ -1,34 +1,3 @@
-Metadata-Version: 2.4
-Name: biblicus
-Version: 0.7.0
-Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
-License: MIT
-Requires-Python: >=3.9
-Description-Content-Type: text/markdown
-License-File: LICENSE
-Requires-Dist: pydantic>=2.0
-Requires-Dist: PyYAML>=6.0
-Requires-Dist: pypdf>=4.0
-Provides-Extra: dev
-Requires-Dist: behave>=1.2.6; extra == "dev"
-Requires-Dist: coverage[toml]>=7.0; extra == "dev"
-Requires-Dist: sphinx>=7.0; extra == "dev"
-Requires-Dist: myst-parser>=2.0; extra == "dev"
-Requires-Dist: sphinx_rtd_theme>=2.0; extra == "dev"
-Requires-Dist: ruff>=0.4.0; extra == "dev"
-Requires-Dist: black>=24.0; extra == "dev"
-Requires-Dist: python-semantic-release>=9.0.0; extra == "dev"
-Provides-Extra: openai
-Requires-Dist: openai>=1.0; extra == "openai"
-Provides-Extra: unstructured
-Requires-Dist: unstructured>=0.12.0; extra == "unstructured"
-Requires-Dist: python-docx>=1.1.0; extra == "unstructured"
-Provides-Extra: ocr
-Requires-Dist: rapidocr-onnxruntime>=1.3.0; extra == "ocr"
-Provides-Extra: markitdown
-Requires-Dist: markitdown[all]>=0.1.0; python_version >= "3.10" and extra == "markitdown"
-Dynamic: license-file
 # Biblicus
 ![Continuous integration][continuous-integration-badge]
@@ -160,9 +129,14 @@ python3 -m pip install biblicus
 Some extractors are optional so the base install stays small.
 - Optical character recognition for images: `python3 -m pip install "biblicus[ocr]"`
-- Speech to text transcription: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
+- Advanced optical character recognition with PaddleOCR: `python3 -m pip install "biblicus[paddleocr]"`
+- Document understanding with Docling VLM: `python3 -m pip install "biblicus[docling]"`
+- Document understanding with Docling VLM and MLX acceleration: `python3 -m pip install "biblicus[docling-mlx]"`
+- Speech to text transcription with OpenAI: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
+- Speech to text transcription with Deepgram: `python3 -m pip install "biblicus[deepgram]"` (requires a Deepgram API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
 - Broad document parsing fallback: `python3 -m pip install "biblicus[unstructured]"`
 - MarkItDown document conversion (requires Python 3.10 or higher): `python3 -m pip install "biblicus[markitdown]"`
+- Topic modeling analysis with BERTopic: `python3 -m pip install "biblicus[topic-modeling]"`
 ## Quick start
@@ -420,6 +394,7 @@ The documents below follow the pipeline from raw items to model context:
 - [Corpus][corpus]
 - [Text extraction][text-extraction]
+- [Speech to text][speech-to-text]
 - [Knowledge base][knowledge-base]
 - [Backends][backends]
 - [Context packs][context-packs]
@@ -468,21 +443,97 @@ corpus/
 Two backends are included.
 - `scan` is a minimal baseline that scans raw items directly.
-- `sqlite-full-text-search` is a practical baseline that builds a full text search index in Sqlite.
+- `sqlite-full-text-search` is a practical baseline that builds a full text search index in SQLite.
+For detailed documentation including configuration options, performance characteristics, and usage examples, see the [Backend Reference][backend-reference].
 ## Extraction backends
-These extractors are built in. Optional ones require extra dependencies.
+These extractors are built in. Optional ones require extra dependencies. See [text extraction documentation][text-extraction] for details.
+### Text and document extraction
+- [`pass-through-text`](docs/extractors/text-document/pass-through.md) reads text items and strips Markdown front matter.
+- [`metadata-text`](docs/extractors/text-document/metadata.md) turns catalog metadata into a small text artifact.
+- [`pdf-text`](docs/extractors/text-document/pdf.md) extracts text from Portable Document Format items with `pypdf`.
+- [`unstructured`](docs/extractors/text-document/unstructured.md) provides broad document parsing (optional).
+- [`markitdown`](docs/extractors/text-document/markitdown.md) converts many formats into Markdown-like text (optional).
+### Optical character recognition
+- [`ocr-rapidocr`](docs/extractors/ocr/rapidocr.md) does optical character recognition on images (optional).
+- [`ocr-paddleocr-vl`](docs/extractors/ocr/paddleocr-vl.md) does advanced optical character recognition with PaddleOCR vision-language model (optional).
+### Vision-language models
+- [`docling-smol`](docs/extractors/vlm-document/docling-smol.md) uses the SmolDocling-256M vision-language model for fast document understanding (optional).
+- [`docling-granite`](docs/extractors/vlm-document/docling-granite.md) uses the Granite Docling-258M vision-language model for high-accuracy extraction (optional).
+### Speech to text
+- [`stt-openai`](docs/extractors/speech-to-text/openai.md) performs speech to text on audio using OpenAI (optional).
+- [`stt-deepgram`](docs/extractors/speech-to-text/deepgram.md) performs speech to text on audio using Deepgram (optional).
+### Pipeline utilities
+- [`select-text`](docs/extractors/pipeline-utilities/select-text.md) chooses one prior extraction result in a pipeline.
+- [`select-longest-text`](docs/extractors/pipeline-utilities/select-longest.md) chooses the longest prior extraction result.
+- [`select-override`](docs/extractors/pipeline-utilities/select-override.md) chooses the last extraction result for matching media types in a pipeline.
+- [`select-smart-override`](docs/extractors/pipeline-utilities/select-smart-override.md) intelligently chooses between extraction results based on confidence and content quality.
+For detailed documentation on all extractors, see the [Extractor Reference][extractor-reference].
+## Topic modeling analysis
+Biblicus can run analysis pipelines on extracted text without changing the raw corpus. Topic modeling is the first
+analysis backend. It reads an extraction run, optionally applies an LLM-driven extraction pass, applies lexical
+processing, runs BERTopic, and optionally applies an LLM fine-tuning pass to label topics. The output is structured
+JavaScript Object Notation.
+Run a topic analysis using a recipe file:
+```
+biblicus analyze topics --corpus corpora/example --recipe recipes/topic-modeling.yml --extraction-run pipeline:<run_id>
+```
+If `--extraction-run` is omitted, Biblicus uses the most recent extraction run and emits a warning about
+reproducibility. The analysis output is stored under:
+```
+.biblicus/runs/analysis/topic-modeling/<run_id>/output.json
+```
+Minimal recipe example:
+```yaml
+schema_version: 1
+text_source:
+  sample_size: 200
+llm_extraction:
+  enabled: false
+lexical_processing:
+  enabled: true
+  lowercase: true
+  strip_punctuation: false
+  collapse_whitespace: true
+bertopic_analysis:
+  parameters:
+    min_topic_size: 8
+    nr_topics: 10
+llm_fine_tuning:
+  enabled: false
+```
+LLM extraction and fine-tuning require `biblicus[openai]` and a configured OpenAI API key.
+Recipe files are validated strictly against the topic modeling schema, so type mismatches or unknown fields are errors.
+For a repeatable, real-world integration run that downloads a Wikipedia corpus and executes topic modeling, use:
+```
+python3 scripts/topic_modeling_integration.py --corpus corpora/wiki_demo --force
+```
-- `pass-through-text` reads text items and strips Markdown front matter.
-- `metadata-text` turns catalog metadata into a small text artifact.
-- `pdf-text` extracts text from Portable Document Format items with `pypdf`.
-- `select-text` chooses one prior extraction result in a pipeline.
-- `select-longest-text` chooses the longest prior extraction result.
-- `ocr-rapidocr` does optical character recognition on images (optional).
-- `stt-openai` performs speech to text on audio (optional).
-- `unstructured` provides broad document parsing (optional).
-- `markitdown` converts many formats into Markdown-like text (optional).
+See `docs/TOPIC_MODELING.md` for parameter examples and per-topic output behavior.
 ## Integration corpus and evaluation dataset
@@ -539,6 +590,9 @@ License terms are in `LICENSE`.
 [corpus]: docs/CORPUS.md
 [knowledge-base]: docs/KNOWLEDGE_BASE.md
 [text-extraction]: docs/EXTRACTION.md
+[extractor-reference]: docs/extractors/index.md
+[backend-reference]: docs/backends/index.md
+[speech-to-text]: docs/STT.md
 [user-configuration]: docs/USER_CONFIGURATION.md
 [backends]: docs/BACKENDS.md
 [context-packs]: docs/CONTEXT_PACK.md

{biblicus-0.7.0 → biblicus-0.8.0}/docs/BACKENDS.md RENAMED Viewed

@@ -3,6 +3,8 @@
 Backends are pluggable engines that implement a small, stable interface.
 The goal is to make new retrieval ideas easy to test without reshaping the corpus.
+For user documentation on available backends, see the [Backend Reference](backends/index.md).
 ## Backend contract
 Backends implement two operations:

{biblicus-0.7.0 → biblicus-0.8.0}/docs/DEMOS.md RENAMED Viewed

@@ -185,6 +185,28 @@ python3 -m biblicus extract build --corpus corpora/demo --step pass-through-text
 The output includes a `run_id` you can reuse when building a retrieval backend.
+### Topic modeling integration run
+Use the integration script to download a Wikipedia corpus, run extraction, and run topic modeling with a single command.
+```
+python3 scripts/topic_modeling_integration.py --corpus corpora/wiki_demo --force
+```
+Run with a smaller corpus and a higher topic count:
+```
+python3 scripts/topic_modeling_integration.py \
+  --corpus corpora/wiki_demo \
+  --force \
+  --limit 20 \
+  --bertopic-param nr_topics=8 \
+  --bertopic-param min_topic_size=2
+```
+The command prints the analysis run identifier and the output path. Open the `output.json` file to inspect per-topic labels,
+keywords, and document examples.
 ### Select extracted text within a pipeline
 When you want an explicit choice among multiple extraction outputs, add a selection extractor step at the end of the pipeline.

{biblicus-0.7.0 → biblicus-0.8.0}/docs/EXTRACTION.md RENAMED Viewed

@@ -1,9 +1,11 @@
-# Text extraction
+# Text Extraction Pipeline
 Text extraction is a separate pipeline stage that produces derived text artifacts under a corpus.
 This separation matters because it lets you combine extraction choices and retrieval backends independently.
+For detailed documentation on specific extractors, see [Extractor Reference](extractors/index.md).
 ## What extraction produces
 An extraction run produces:
@@ -31,99 +33,42 @@ corpus/
                   <item id>.txt
 ```
-## Built in extractors
-Version zero includes a small set of deterministic extractors.
-`pass-through-text`
-- Reads text items and returns their content
-- For Markdown items, it strips YAML front matter and returns only the body
-- Skips non text items
-`metadata-text`
-- Builds a small text representation from catalog metadata
-- This is useful when you have a non text item with meaningful tags or a title
+## Available Extractors
-`pdf-text`
+Biblicus provides 16 built-in extractors organized by category:
-- Attempts to extract text from Portable Document Format items
-- Skips items that are not Portable Document Format
-- Uses the `pypdf` library
-- Produces empty output for scanned Portable Document Format files that contain no extractable text without optical character recognition
+### Text & Document Processing
-`select-text`
+- [`pass-through-text`](extractors/text-document/pass-through.md) - Direct text file reading
+- [`metadata-text`](extractors/text-document/metadata.md) - Text from item metadata
+- [`pdf-text`](extractors/text-document/pdf.md) - PDF text extraction using pypdf
+- [`markitdown`](extractors/text-document/markitdown.md) - Office documents via MarkItDown
+- [`unstructured`](extractors/text-document/unstructured.md) - Universal document parsing
-- Selects extracted text artifacts from earlier pipeline steps
-- This is used when you have multiple pipeline steps that can produce usable text for the same items and you want one chosen result
-- Records which step supplied the selected text
+### Optical Character Recognition
-`unstructured`
+- [`ocr-rapidocr`](extractors/ocr/rapidocr.md) - Fast ONNX-based OCR
+- [`ocr-paddleocr-vl`](extractors/ocr/paddleocr-vl.md) - Advanced OCR with VL model
-- Broad document text extraction backed by the optional `unstructured` dependency
-- Intended as a last-resort extractor for non-text items when more specific extractors cannot produce usable text
-- Skips items that are already text so the pass-through extractor remains the canonical choice for text items
+### Vision-Language Models
-To install:
+- [`docling-smol`](extractors/vlm-document/docling-smol.md) - SmolDocling-256M for fast document processing
+- [`docling-granite`](extractors/vlm-document/docling-granite.md) - Granite Docling-258M for high-accuracy extraction
-```
-python3 -m pip install "biblicus[unstructured]"
-```
-`markitdown`
+### Speech-to-Text
-- Converts common document formats into Markdown-like text
-- Backed by the optional `markitdown` dependency
-- Requires Python 3.10 or higher
-- Skips items that are already text so the pass-through extractor remains the canonical choice for text items
-- This means it will not process `text/html` or other text media types unless that policy changes
-To install:
-```
-python3 -m pip install "biblicus[markitdown]"
-```
-Example:
-```
-python3 -m biblicus extract build --corpus corpora/extraction-demo \\
-  --step markitdown
-```
-`ocr-rapidocr`
-- Optical character recognition for image items
-- Backed by the optional `rapidocr-onnxruntime` dependency
-- Intended as a practical default when you need text from images without running a service
-To install:
-```
-python3 -m pip install "biblicus[ocr]"
-```
+- [`stt-openai`](extractors/speech-to-text/openai.md) - OpenAI Whisper API
+- [`stt-deepgram`](extractors/speech-to-text/deepgram.md) - Deepgram Nova-3 API
-`stt-openai`
+### Pipeline Utilities
-- Speech to text transcription for audio items
-- Backed by the optional `openai` dependency
-- Requires an OpenAI API key (from `OPENAI_API_KEY` or the user configuration file)
+- [`select-text`](extractors/pipeline-utilities/select-text.md) - First successful extractor
+- [`select-longest-text`](extractors/pipeline-utilities/select-longest.md) - Longest output selection
+- [`select-override`](extractors/pipeline-utilities/select-override.md) - Per-item override by ID
+- [`select-smart-override`](extractors/pipeline-utilities/select-smart-override.md) - Media type-based routing
+- [`pipeline`](extractors/pipeline-utilities/pipeline.md) - Multi-step extraction workflow
-To install:
-```
-python3 -m pip install "biblicus[openai]"
-```
-To configure:
-- Create `~/.biblicus/config.yml` or `./.biblicus/config.yml` with:
-```
-openai:
-  api_key: YOUR_KEY_HERE
-```
+For detailed documentation including configuration options, usage examples, and best practices, see the [Extractor Reference](extractors/index.md).
 ## How selection chooses text
@@ -131,12 +76,12 @@ The `select-text` extractor does not attempt to judge extraction quality. It cho
 Usable means non-empty after stripping whitespace.
-This means selection does not automatically choose the longest extracted text or the extraction with the most content. If you want a scoring rule such as choose the longest extracted text, that should be a separate selection extractor so the policy is explicit, versioned, and testable.
+This means selection does not automatically choose the longest extracted text or the extraction with the most content. If you want a scoring rule such as choose the longest extracted text, use the [`select-longest-text`](extractors/pipeline-utilities/select-longest.md) extractor instead.
-`select-longest-text`
+Other selection strategies include:
-- Selects the longest usable extracted text from earlier pipeline steps
-- Useful when you have multiple competing extractors for the same item types and you want a deterministic “more content wins” policy
+- [`select-override`](extractors/pipeline-utilities/select-override.md) - Override extraction for specific items by ID
+- [`select-smart-override`](extractors/pipeline-utilities/select-smart-override.md) - Route items based on media type patterns
 ## Pipeline extractor
@@ -146,6 +91,8 @@ The pipeline runs every step in order and records all step outputs. Each step re
 This lets you build explicit extraction policies while keeping every step outcome available for comparison and metrics.
+For details, see the [`pipeline` extractor documentation](extractors/pipeline-utilities/pipeline.md).
 ## Complementary versus competing extractors
 The pipeline is designed for complementary steps that do not overlap much in what they handle.
@@ -169,9 +116,9 @@ python3 -m biblicus init corpora/extraction-demo
 printf 'x' > /tmp/image.png
 python3 -m biblicus ingest --corpus corpora/extraction-demo /tmp/image.png --tag extracted
-python3 -m biblicus extract build --corpus corpora/extraction-demo \\
-  --step pass-through-text \\
-  --step pdf-text \\
+python3 -m biblicus extract build --corpus corpora/extraction-demo \
+  --step pass-through-text \
+  --step pdf-text \
   --step metadata-text
 ```
@@ -182,14 +129,38 @@ The extracted text for the image comes from the `metadata-text` step because the
 Selection is a pipeline step that chooses extracted text from previous pipeline steps. Selection is just another extractor in the pipeline, and it decides which prior output to carry forward.
 ```
-python3 -m biblicus extract build --corpus corpora/extraction-demo \\
-  --step pass-through-text \\
-  --step metadata-text \\
+python3 -m biblicus extract build --corpus corpora/extraction-demo \
+  --step pass-through-text \
+  --step metadata-text \
   --step select-text
 ```
 The pipeline run produces one extraction run under `pipeline`. You can point retrieval backends at that run.
+## Example: PDF with OCR fallback
+Try text extraction first, fall back to OCR for scanned documents:
+```
+python3 -m biblicus extract build --corpus corpora/extraction-demo \
+  --step pdf-text \
+  --step ocr-rapidocr \
+  --step select-text
+```
+This pipeline tries `pdf-text` first for PDFs with text layers, falls back to `ocr-rapidocr` for scanned PDFs, and uses `select-text` to pick the first successful result.
+## Example: VLM for complex documents
+Use vision-language models for documents with complex layouts:
+```
+python3 -m biblicus extract build --corpus corpora/extraction-demo \
+  --step docling-granite
+```
+The `docling-granite` extractor uses IBM Research's Granite Docling-258M VLM for high-accuracy extraction of tables, code blocks, and equations.
 ## Inspecting and deleting extraction runs
 Extraction runs are stored under the corpus and can be listed and inspected.
@@ -202,8 +173,8 @@ python3 -m biblicus extract show --corpus corpora/extraction-demo --run pipeline
 Deletion is explicit and requires typing the exact run reference as confirmation:
 ```
-python3 -m biblicus extract delete --corpus corpora/extraction-demo \\
-  --run pipeline:EXTRACTION_RUN_ID \\
+python3 -m biblicus extract delete --corpus corpora/extraction-demo \
+  --run pipeline:EXTRACTION_RUN_ID \
   --confirm pipeline:EXTRACTION_RUN_ID
 ```
@@ -212,7 +183,7 @@ python3 -m biblicus extract delete --corpus corpora/extraction-demo \\
 Retrieval backends can build and query using a selected extraction run. This is configured by passing `extraction_run=extractor_id:run_id` to the backend build command.
 ```
-python3 -m biblicus build --corpus corpora/extraction-demo --backend sqlite-full-text-search \\
+python3 -m biblicus build --corpus corpora/extraction-demo --backend sqlite-full-text-search \
   --config extraction_run=pipeline:EXTRACTION_RUN_ID
 python3 -m biblicus query --corpus corpora/extraction-demo --query extracted
 ```

biblicus 0.7.0__tar.gz → 0.8.0__tar.gz

biblicus 0.7.0tar.gz → 0.8.0tar.gz