PyPI - biblicus - Versions diffs - 0.6.0__tar.gz → 0.8.0__tar.gz - Mend

biblicus 0.6.0tar.gz → 0.8.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (223) hide show

{biblicus-0.6.0 → biblicus-0.8.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: biblicus
-Version: 0.6.0
+Version: 0.8.0
 Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
 License: MIT
 Requires-Python: >=3.9
@@ -25,6 +25,21 @@ Requires-Dist: unstructured>=0.12.0; extra == "unstructured"
 Requires-Dist: python-docx>=1.1.0; extra == "unstructured"
 Provides-Extra: ocr
 Requires-Dist: rapidocr-onnxruntime>=1.3.0; extra == "ocr"
+Provides-Extra: paddleocr
+Requires-Dist: paddleocr>=2.7.0; extra == "paddleocr"
+Requires-Dist: paddlepaddle>=2.5.0; extra == "paddleocr"
+Requires-Dist: huggingface_hub>=0.20.0; extra == "paddleocr"
+Requires-Dist: requests>=2.28.0; extra == "paddleocr"
+Provides-Extra: markitdown
+Requires-Dist: markitdown[all]>=0.1.0; python_version >= "3.10" and extra == "markitdown"
+Provides-Extra: deepgram
+Requires-Dist: deepgram-sdk>=3.0; extra == "deepgram"
+Provides-Extra: docling
+Requires-Dist: docling[vlm]>=2.0.0; extra == "docling"
+Provides-Extra: docling-mlx
+Requires-Dist: docling[mlx-vlm]>=2.0.0; extra == "docling-mlx"
+Provides-Extra: topic-modeling
+Requires-Dist: bertopic>=0.15.0; extra == "topic-modeling"
 Dynamic: license-file
 # Biblicus
@@ -67,7 +82,7 @@ If you want to run a real, executable version of this story, use `scripts/readme
 This simplified sequence diagram shows the same idea at a high level.
 ```mermaid
-%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
+%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
 sequenceDiagram
   participant App as Your assistant code
   participant KB as Knowledge base
@@ -106,7 +121,7 @@ In a coding assistant, retrieval is often triggered by what the user is doing ri
 This diagram shows two sequential Biblicus calls. They are shown separately to make the boundaries explicit: retrieval returns evidence, and context pack building consumes evidence.
 ```mermaid
-%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
+%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
 sequenceDiagram
   participant User
   participant App as Your assistant code
@@ -158,8 +173,14 @@ python3 -m pip install biblicus
 Some extractors are optional so the base install stays small.
 - Optical character recognition for images: `python3 -m pip install "biblicus[ocr]"`
-- Speech to text transcription: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
+- Advanced optical character recognition with PaddleOCR: `python3 -m pip install "biblicus[paddleocr]"`
+- Document understanding with Docling VLM: `python3 -m pip install "biblicus[docling]"`
+- Document understanding with Docling VLM and MLX acceleration: `python3 -m pip install "biblicus[docling-mlx]"`
+- Speech to text transcription with OpenAI: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
+- Speech to text transcription with Deepgram: `python3 -m pip install "biblicus[deepgram]"` (requires a Deepgram API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
 - Broad document parsing fallback: `python3 -m pip install "biblicus[unstructured]"`
+- MarkItDown document conversion (requires Python 3.10 or higher): `python3 -m pip install "biblicus[markitdown]"`
+- Topic modeling analysis with BERTopic: `python3 -m pip install "biblicus[topic-modeling]"`
 ## Quick start
@@ -417,6 +438,7 @@ The documents below follow the pipeline from raw items to model context:
 - [Corpus][corpus]
 - [Text extraction][text-extraction]
+- [Speech to text][speech-to-text]
 - [Knowledge base][knowledge-base]
 - [Backends][backends]
 - [Context packs][context-packs]
@@ -465,7 +487,97 @@ corpus/
 Two backends are included.
 - `scan` is a minimal baseline that scans raw items directly.
-- `sqlite-full-text-search` is a practical baseline that builds a full text search index in Sqlite.
+- `sqlite-full-text-search` is a practical baseline that builds a full text search index in SQLite.
+For detailed documentation including configuration options, performance characteristics, and usage examples, see the [Backend Reference][backend-reference].
+## Extraction backends
+These extractors are built in. Optional ones require extra dependencies. See [text extraction documentation][text-extraction] for details.
+### Text and document extraction
+- [`pass-through-text`](docs/extractors/text-document/pass-through.md) reads text items and strips Markdown front matter.
+- [`metadata-text`](docs/extractors/text-document/metadata.md) turns catalog metadata into a small text artifact.
+- [`pdf-text`](docs/extractors/text-document/pdf.md) extracts text from Portable Document Format items with `pypdf`.
+- [`unstructured`](docs/extractors/text-document/unstructured.md) provides broad document parsing (optional).
+- [`markitdown`](docs/extractors/text-document/markitdown.md) converts many formats into Markdown-like text (optional).
+### Optical character recognition
+- [`ocr-rapidocr`](docs/extractors/ocr/rapidocr.md) does optical character recognition on images (optional).
+- [`ocr-paddleocr-vl`](docs/extractors/ocr/paddleocr-vl.md) does advanced optical character recognition with PaddleOCR vision-language model (optional).
+### Vision-language models
+- [`docling-smol`](docs/extractors/vlm-document/docling-smol.md) uses the SmolDocling-256M vision-language model for fast document understanding (optional).
+- [`docling-granite`](docs/extractors/vlm-document/docling-granite.md) uses the Granite Docling-258M vision-language model for high-accuracy extraction (optional).
+### Speech to text
+- [`stt-openai`](docs/extractors/speech-to-text/openai.md) performs speech to text on audio using OpenAI (optional).
+- [`stt-deepgram`](docs/extractors/speech-to-text/deepgram.md) performs speech to text on audio using Deepgram (optional).
+### Pipeline utilities
+- [`select-text`](docs/extractors/pipeline-utilities/select-text.md) chooses one prior extraction result in a pipeline.
+- [`select-longest-text`](docs/extractors/pipeline-utilities/select-longest.md) chooses the longest prior extraction result.
+- [`select-override`](docs/extractors/pipeline-utilities/select-override.md) chooses the last extraction result for matching media types in a pipeline.
+- [`select-smart-override`](docs/extractors/pipeline-utilities/select-smart-override.md) intelligently chooses between extraction results based on confidence and content quality.
+For detailed documentation on all extractors, see the [Extractor Reference][extractor-reference].
+## Topic modeling analysis
+Biblicus can run analysis pipelines on extracted text without changing the raw corpus. Topic modeling is the first
+analysis backend. It reads an extraction run, optionally applies an LLM-driven extraction pass, applies lexical
+processing, runs BERTopic, and optionally applies an LLM fine-tuning pass to label topics. The output is structured
+JavaScript Object Notation.
+Run a topic analysis using a recipe file:
+```
+biblicus analyze topics --corpus corpora/example --recipe recipes/topic-modeling.yml --extraction-run pipeline:<run_id>
+```
+If `--extraction-run` is omitted, Biblicus uses the most recent extraction run and emits a warning about
+reproducibility. The analysis output is stored under:
+```
+.biblicus/runs/analysis/topic-modeling/<run_id>/output.json
+```
+Minimal recipe example:
+```yaml
+schema_version: 1
+text_source:
+  sample_size: 200
+llm_extraction:
+  enabled: false
+lexical_processing:
+  enabled: true
+  lowercase: true
+  strip_punctuation: false
+  collapse_whitespace: true
+bertopic_analysis:
+  parameters:
+    min_topic_size: 8
+    nr_topics: 10
+llm_fine_tuning:
+  enabled: false
+```
+LLM extraction and fine-tuning require `biblicus[openai]` and a configured OpenAI API key.
+Recipe files are validated strictly against the topic modeling schema, so type mismatches or unknown fields are errors.
+For a repeatable, real-world integration run that downloads a Wikipedia corpus and executes topic modeling, use:
+```
+python3 scripts/topic_modeling_integration.py --corpus corpora/wiki_demo --force
+```
+See `docs/TOPIC_MODELING.md` for parameter examples and per-topic output behavior.
 ## Integration corpus and evaluation dataset
@@ -522,6 +634,9 @@ License terms are in `LICENSE`.
 [corpus]: docs/CORPUS.md
 [knowledge-base]: docs/KNOWLEDGE_BASE.md
 [text-extraction]: docs/EXTRACTION.md
+[extractor-reference]: docs/extractors/index.md
+[backend-reference]: docs/backends/index.md
+[speech-to-text]: docs/STT.md
 [user-configuration]: docs/USER_CONFIGURATION.md
 [backends]: docs/BACKENDS.md
 [context-packs]: docs/CONTEXT_PACK.md

biblicus-0.6.0/src/biblicus.egg-info/PKG-INFO → biblicus-0.8.0/README.md RENAMED Viewed

@@ -1,32 +1,3 @@
-Metadata-Version: 2.4
-Name: biblicus
-Version: 0.6.0
-Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
-License: MIT
-Requires-Python: >=3.9
-Description-Content-Type: text/markdown
-License-File: LICENSE
-Requires-Dist: pydantic>=2.0
-Requires-Dist: PyYAML>=6.0
-Requires-Dist: pypdf>=4.0
-Provides-Extra: dev
-Requires-Dist: behave>=1.2.6; extra == "dev"
-Requires-Dist: coverage[toml]>=7.0; extra == "dev"
-Requires-Dist: sphinx>=7.0; extra == "dev"
-Requires-Dist: myst-parser>=2.0; extra == "dev"
-Requires-Dist: sphinx_rtd_theme>=2.0; extra == "dev"
-Requires-Dist: ruff>=0.4.0; extra == "dev"
-Requires-Dist: black>=24.0; extra == "dev"
-Requires-Dist: python-semantic-release>=9.0.0; extra == "dev"
-Provides-Extra: openai
-Requires-Dist: openai>=1.0; extra == "openai"
-Provides-Extra: unstructured
-Requires-Dist: unstructured>=0.12.0; extra == "unstructured"
-Requires-Dist: python-docx>=1.1.0; extra == "unstructured"
-Provides-Extra: ocr
-Requires-Dist: rapidocr-onnxruntime>=1.3.0; extra == "ocr"
-Dynamic: license-file
 # Biblicus
 ![Continuous integration][continuous-integration-badge]
@@ -67,7 +38,7 @@ If you want to run a real, executable version of this story, use `scripts/readme
 This simplified sequence diagram shows the same idea at a high level.
 ```mermaid
-%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
+%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
 sequenceDiagram
   participant App as Your assistant code
   participant KB as Knowledge base
@@ -106,7 +77,7 @@ In a coding assistant, retrieval is often triggered by what the user is doing ri
 This diagram shows two sequential Biblicus calls. They are shown separately to make the boundaries explicit: retrieval returns evidence, and context pack building consumes evidence.
 ```mermaid
-%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
+%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
 sequenceDiagram
   participant User
   participant App as Your assistant code
@@ -158,8 +129,14 @@ python3 -m pip install biblicus
 Some extractors are optional so the base install stays small.
 - Optical character recognition for images: `python3 -m pip install "biblicus[ocr]"`
-- Speech to text transcription: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
+- Advanced optical character recognition with PaddleOCR: `python3 -m pip install "biblicus[paddleocr]"`
+- Document understanding with Docling VLM: `python3 -m pip install "biblicus[docling]"`
+- Document understanding with Docling VLM and MLX acceleration: `python3 -m pip install "biblicus[docling-mlx]"`
+- Speech to text transcription with OpenAI: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
+- Speech to text transcription with Deepgram: `python3 -m pip install "biblicus[deepgram]"` (requires a Deepgram API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
 - Broad document parsing fallback: `python3 -m pip install "biblicus[unstructured]"`
+- MarkItDown document conversion (requires Python 3.10 or higher): `python3 -m pip install "biblicus[markitdown]"`
+- Topic modeling analysis with BERTopic: `python3 -m pip install "biblicus[topic-modeling]"`
 ## Quick start
@@ -417,6 +394,7 @@ The documents below follow the pipeline from raw items to model context:
 - [Corpus][corpus]
 - [Text extraction][text-extraction]
+- [Speech to text][speech-to-text]
 - [Knowledge base][knowledge-base]
 - [Backends][backends]
 - [Context packs][context-packs]
@@ -465,7 +443,97 @@ corpus/
 Two backends are included.
 - `scan` is a minimal baseline that scans raw items directly.
-- `sqlite-full-text-search` is a practical baseline that builds a full text search index in Sqlite.
+- `sqlite-full-text-search` is a practical baseline that builds a full text search index in SQLite.
+For detailed documentation including configuration options, performance characteristics, and usage examples, see the [Backend Reference][backend-reference].
+## Extraction backends
+These extractors are built in. Optional ones require extra dependencies. See [text extraction documentation][text-extraction] for details.
+### Text and document extraction
+- [`pass-through-text`](docs/extractors/text-document/pass-through.md) reads text items and strips Markdown front matter.
+- [`metadata-text`](docs/extractors/text-document/metadata.md) turns catalog metadata into a small text artifact.
+- [`pdf-text`](docs/extractors/text-document/pdf.md) extracts text from Portable Document Format items with `pypdf`.
+- [`unstructured`](docs/extractors/text-document/unstructured.md) provides broad document parsing (optional).
+- [`markitdown`](docs/extractors/text-document/markitdown.md) converts many formats into Markdown-like text (optional).
+### Optical character recognition
+- [`ocr-rapidocr`](docs/extractors/ocr/rapidocr.md) does optical character recognition on images (optional).
+- [`ocr-paddleocr-vl`](docs/extractors/ocr/paddleocr-vl.md) does advanced optical character recognition with PaddleOCR vision-language model (optional).
+### Vision-language models
+- [`docling-smol`](docs/extractors/vlm-document/docling-smol.md) uses the SmolDocling-256M vision-language model for fast document understanding (optional).
+- [`docling-granite`](docs/extractors/vlm-document/docling-granite.md) uses the Granite Docling-258M vision-language model for high-accuracy extraction (optional).
+### Speech to text
+- [`stt-openai`](docs/extractors/speech-to-text/openai.md) performs speech to text on audio using OpenAI (optional).
+- [`stt-deepgram`](docs/extractors/speech-to-text/deepgram.md) performs speech to text on audio using Deepgram (optional).
+### Pipeline utilities
+- [`select-text`](docs/extractors/pipeline-utilities/select-text.md) chooses one prior extraction result in a pipeline.
+- [`select-longest-text`](docs/extractors/pipeline-utilities/select-longest.md) chooses the longest prior extraction result.
+- [`select-override`](docs/extractors/pipeline-utilities/select-override.md) chooses the last extraction result for matching media types in a pipeline.
+- [`select-smart-override`](docs/extractors/pipeline-utilities/select-smart-override.md) intelligently chooses between extraction results based on confidence and content quality.
+For detailed documentation on all extractors, see the [Extractor Reference][extractor-reference].
+## Topic modeling analysis
+Biblicus can run analysis pipelines on extracted text without changing the raw corpus. Topic modeling is the first
+analysis backend. It reads an extraction run, optionally applies an LLM-driven extraction pass, applies lexical
+processing, runs BERTopic, and optionally applies an LLM fine-tuning pass to label topics. The output is structured
+JavaScript Object Notation.
+Run a topic analysis using a recipe file:
+```
+biblicus analyze topics --corpus corpora/example --recipe recipes/topic-modeling.yml --extraction-run pipeline:<run_id>
+```
+If `--extraction-run` is omitted, Biblicus uses the most recent extraction run and emits a warning about
+reproducibility. The analysis output is stored under:
+```
+.biblicus/runs/analysis/topic-modeling/<run_id>/output.json
+```
+Minimal recipe example:
+```yaml
+schema_version: 1
+text_source:
+  sample_size: 200
+llm_extraction:
+  enabled: false
+lexical_processing:
+  enabled: true
+  lowercase: true
+  strip_punctuation: false
+  collapse_whitespace: true
+bertopic_analysis:
+  parameters:
+    min_topic_size: 8
+    nr_topics: 10
+llm_fine_tuning:
+  enabled: false
+```
+LLM extraction and fine-tuning require `biblicus[openai]` and a configured OpenAI API key.
+Recipe files are validated strictly against the topic modeling schema, so type mismatches or unknown fields are errors.
+For a repeatable, real-world integration run that downloads a Wikipedia corpus and executes topic modeling, use:
+```
+python3 scripts/topic_modeling_integration.py --corpus corpora/wiki_demo --force
+```
+See `docs/TOPIC_MODELING.md` for parameter examples and per-topic output behavior.
 ## Integration corpus and evaluation dataset
@@ -522,6 +590,9 @@ License terms are in `LICENSE`.
 [corpus]: docs/CORPUS.md
 [knowledge-base]: docs/KNOWLEDGE_BASE.md
 [text-extraction]: docs/EXTRACTION.md
+[extractor-reference]: docs/extractors/index.md
+[backend-reference]: docs/backends/index.md
+[speech-to-text]: docs/STT.md
 [user-configuration]: docs/USER_CONFIGURATION.md
 [backends]: docs/BACKENDS.md
 [context-packs]: docs/CONTEXT_PACK.md

{biblicus-0.6.0 → biblicus-0.8.0}/docs/BACKENDS.md RENAMED Viewed

@@ -3,6 +3,8 @@
 Backends are pluggable engines that implement a small, stable interface.
 The goal is to make new retrieval ideas easy to test without reshaping the corpus.
+For user documentation on available backends, see the [Backend Reference](backends/index.md).
 ## Backend contract
 Backends implement two operations:

{biblicus-0.6.0 → biblicus-0.8.0}/docs/DEMOS.md RENAMED Viewed

@@ -185,6 +185,28 @@ python3 -m biblicus extract build --corpus corpora/demo --step pass-through-text
 The output includes a `run_id` you can reuse when building a retrieval backend.
+### Topic modeling integration run
+Use the integration script to download a Wikipedia corpus, run extraction, and run topic modeling with a single command.
+```
+python3 scripts/topic_modeling_integration.py --corpus corpora/wiki_demo --force
+```
+Run with a smaller corpus and a higher topic count:
+```
+python3 scripts/topic_modeling_integration.py \
+  --corpus corpora/wiki_demo \
+  --force \
+  --limit 20 \
+  --bertopic-param nr_topics=8 \
+  --bertopic-param min_topic_size=2
+```
+The command prints the analysis run identifier and the output path. Open the `output.json` file to inspect per-topic labels,
+keywords, and document examples.
 ### Select extracted text within a pipeline
 When you want an explicit choice among multiple extraction outputs, add a selection extractor step at the end of the pipeline.
@@ -221,6 +243,25 @@ python3 -m biblicus build --corpus corpora/pdf_samples --backend sqlite-full-tex
 python3 -m biblicus query --corpus corpora/pdf_samples --query "Dummy PDF file"
 ```
+### Wikipedia retrieval demo (Python)
+This example downloads a few Wikipedia summaries about retrieval and knowledge bases, builds an extraction run, creates a local full text index, and returns evidence plus a context pack.
+```
+rm -rf corpora/wikipedia_rag_demo
+python3 scripts/wikipedia_rag_demo.py --corpus corpora/wikipedia_rag_demo --force
+```
+### MarkItDown extraction demo (Python 3.10+)
+MarkItDown requires Python 3.10 or higher. This example uses the `py311` conda environment to run the extractor over the mixed sample corpus.
+```
+conda run -n py311 python -m pip install -e . "markitdown[all]"
+conda run -n py311 python scripts/download_mixed_samples.py --corpus corpora/markitdown_demo_py311 --force
+conda run -n py311 python -m biblicus extract build --corpus corpora/markitdown_demo_py311 --step markitdown
+```
 ### Mixed modality integration corpus
 This example assembles a tiny mixed corpus with a Markdown note, a Hypertext Markup Language page, an image, a Portable Document Format file with extractable text, and a generated Portable Document Format file with no extractable text.

{biblicus-0.6.0 → biblicus-0.8.0}/docs/EXTRACTION.md RENAMED Viewed

@@ -1,9 +1,11 @@
-# Text extraction
+# Text Extraction Pipeline
 Text extraction is a separate pipeline stage that produces derived text artifacts under a corpus.
 This separation matters because it lets you combine extraction choices and retrieval backends independently.
+For detailed documentation on specific extractors, see [Extractor Reference](extractors/index.md).
 ## What extraction produces
 An extraction run produces:
@@ -31,78 +33,42 @@ corpus/
                   <item id>.txt
 ```
-## Built in extractors
-Version zero includes a small set of deterministic extractors.
-`pass-through-text`
-- Reads text items and returns their content
-- For Markdown items, it strips YAML front matter and returns only the body
-- Skips non text items
-`metadata-text`
-- Builds a small text representation from catalog metadata
-- This is useful when you have a non text item with meaningful tags or a title
-`pdf-text`
-- Attempts to extract text from Portable Document Format items
-- Skips items that are not Portable Document Format
-- Uses the `pypdf` library
-- Produces empty output for scanned Portable Document Format files that contain no extractable text without optical character recognition
-`select-text`
+## Available Extractors
-- Selects extracted text artifacts from earlier pipeline steps
-- This is used when you have multiple pipeline steps that can produce usable text for the same items and you want one chosen result
-- Records which step supplied the selected text
+Biblicus provides 16 built-in extractors organized by category:
-`unstructured`
+### Text & Document Processing
-- Broad document text extraction backed by the optional `unstructured` dependency
-- Intended as a last-resort extractor for non-text items when more specific extractors cannot produce usable text
-- Skips items that are already text so the pass-through extractor remains the canonical choice for text items
+- [`pass-through-text`](extractors/text-document/pass-through.md) - Direct text file reading
+- [`metadata-text`](extractors/text-document/metadata.md) - Text from item metadata
+- [`pdf-text`](extractors/text-document/pdf.md) - PDF text extraction using pypdf
+- [`markitdown`](extractors/text-document/markitdown.md) - Office documents via MarkItDown
+- [`unstructured`](extractors/text-document/unstructured.md) - Universal document parsing
-To install:
+### Optical Character Recognition
-```
-python3 -m pip install "biblicus[unstructured]"
-```
-`ocr-rapidocr`
+- [`ocr-rapidocr`](extractors/ocr/rapidocr.md) - Fast ONNX-based OCR
+- [`ocr-paddleocr-vl`](extractors/ocr/paddleocr-vl.md) - Advanced OCR with VL model
-- Optical character recognition for image items
-- Backed by the optional `rapidocr-onnxruntime` dependency
-- Intended as a practical default when you need text from images without running a service
+### Vision-Language Models
-To install:
+- [`docling-smol`](extractors/vlm-document/docling-smol.md) - SmolDocling-256M for fast document processing
+- [`docling-granite`](extractors/vlm-document/docling-granite.md) - Granite Docling-258M for high-accuracy extraction
-```
-python3 -m pip install "biblicus[ocr]"
-```
+### Speech-to-Text
-`stt-openai`
+- [`stt-openai`](extractors/speech-to-text/openai.md) - OpenAI Whisper API
+- [`stt-deepgram`](extractors/speech-to-text/deepgram.md) - Deepgram Nova-3 API
-- Speech to text transcription for audio items
-- Backed by the optional `openai` dependency
-- Requires an OpenAI API key (from `OPENAI_API_KEY` or the user configuration file)
-To install:
-```
-python3 -m pip install "biblicus[openai]"
-```
+### Pipeline Utilities
-To configure:
+- [`select-text`](extractors/pipeline-utilities/select-text.md) - First successful extractor
+- [`select-longest-text`](extractors/pipeline-utilities/select-longest.md) - Longest output selection
+- [`select-override`](extractors/pipeline-utilities/select-override.md) - Per-item override by ID
+- [`select-smart-override`](extractors/pipeline-utilities/select-smart-override.md) - Media type-based routing
+- [`pipeline`](extractors/pipeline-utilities/pipeline.md) - Multi-step extraction workflow
-- Create `~/.biblicus/config.yml` or `./.biblicus/config.yml` with:
-```
-openai:
-  api_key: YOUR_KEY_HERE
-```
+For detailed documentation including configuration options, usage examples, and best practices, see the [Extractor Reference](extractors/index.md).
 ## How selection chooses text
@@ -110,12 +76,12 @@ The `select-text` extractor does not attempt to judge extraction quality. It cho
 Usable means non-empty after stripping whitespace.
-This means selection does not automatically choose the longest extracted text or the extraction with the most content. If you want a scoring rule such as choose the longest extracted text, that should be a separate selection extractor so the policy is explicit, versioned, and testable.
+This means selection does not automatically choose the longest extracted text or the extraction with the most content. If you want a scoring rule such as choose the longest extracted text, use the [`select-longest-text`](extractors/pipeline-utilities/select-longest.md) extractor instead.
-`select-longest-text`
+Other selection strategies include:
-- Selects the longest usable extracted text from earlier pipeline steps
-- Useful when you have multiple competing extractors for the same item types and you want a deterministic “more content wins” policy
+- [`select-override`](extractors/pipeline-utilities/select-override.md) - Override extraction for specific items by ID
+- [`select-smart-override`](extractors/pipeline-utilities/select-smart-override.md) - Route items based on media type patterns
 ## Pipeline extractor
@@ -125,6 +91,8 @@ The pipeline runs every step in order and records all step outputs. Each step re
 This lets you build explicit extraction policies while keeping every step outcome available for comparison and metrics.
+For details, see the [`pipeline` extractor documentation](extractors/pipeline-utilities/pipeline.md).
 ## Complementary versus competing extractors
 The pipeline is designed for complementary steps that do not overlap much in what they handle.
@@ -148,9 +116,9 @@ python3 -m biblicus init corpora/extraction-demo
 printf 'x' > /tmp/image.png
 python3 -m biblicus ingest --corpus corpora/extraction-demo /tmp/image.png --tag extracted
-python3 -m biblicus extract build --corpus corpora/extraction-demo \\
-  --step pass-through-text \\
-  --step pdf-text \\
+python3 -m biblicus extract build --corpus corpora/extraction-demo \
+  --step pass-through-text \
+  --step pdf-text \
   --step metadata-text
 ```
@@ -161,14 +129,38 @@ The extracted text for the image comes from the `metadata-text` step because the
 Selection is a pipeline step that chooses extracted text from previous pipeline steps. Selection is just another extractor in the pipeline, and it decides which prior output to carry forward.
 ```
-python3 -m biblicus extract build --corpus corpora/extraction-demo \\
-  --step pass-through-text \\
-  --step metadata-text \\
+python3 -m biblicus extract build --corpus corpora/extraction-demo \
+  --step pass-through-text \
+  --step metadata-text \
   --step select-text
 ```
 The pipeline run produces one extraction run under `pipeline`. You can point retrieval backends at that run.
+## Example: PDF with OCR fallback
+Try text extraction first, fall back to OCR for scanned documents:
+```
+python3 -m biblicus extract build --corpus corpora/extraction-demo \
+  --step pdf-text \
+  --step ocr-rapidocr \
+  --step select-text
+```
+This pipeline tries `pdf-text` first for PDFs with text layers, falls back to `ocr-rapidocr` for scanned PDFs, and uses `select-text` to pick the first successful result.
+## Example: VLM for complex documents
+Use vision-language models for documents with complex layouts:
+```
+python3 -m biblicus extract build --corpus corpora/extraction-demo \
+  --step docling-granite
+```
+The `docling-granite` extractor uses IBM Research's Granite Docling-258M VLM for high-accuracy extraction of tables, code blocks, and equations.
 ## Inspecting and deleting extraction runs
 Extraction runs are stored under the corpus and can be listed and inspected.
@@ -181,8 +173,8 @@ python3 -m biblicus extract show --corpus corpora/extraction-demo --run pipeline
 Deletion is explicit and requires typing the exact run reference as confirmation:
 ```
-python3 -m biblicus extract delete --corpus corpora/extraction-demo \\
-  --run pipeline:EXTRACTION_RUN_ID \\
+python3 -m biblicus extract delete --corpus corpora/extraction-demo \
+  --run pipeline:EXTRACTION_RUN_ID \
   --confirm pipeline:EXTRACTION_RUN_ID
 ```
@@ -191,7 +183,7 @@ python3 -m biblicus extract delete --corpus corpora/extraction-demo \\
 Retrieval backends can build and query using a selected extraction run. This is configured by passing `extraction_run=extractor_id:run_id` to the backend build command.
 ```
-python3 -m biblicus build --corpus corpora/extraction-demo --backend sqlite-full-text-search \\
+python3 -m biblicus build --corpus corpora/extraction-demo --backend sqlite-full-text-search \
   --config extraction_run=pipeline:EXTRACTION_RUN_ID
 python3 -m biblicus query --corpus corpora/extraction-demo --query extracted
 ```

{biblicus-0.6.0 → biblicus-0.8.0}/docs/FEATURE_INDEX.md RENAMED Viewed

@@ -123,6 +123,7 @@ What it does:
 - Includes a Portable Document Format text extractor plugin.
 - Includes a speech to text extractor plugin for audio items.
 - Includes a selection extractor step for choosing extracted text within a pipeline.
+- Includes a MarkItDown extractor plugin for document conversion.
 Documentation:
@@ -139,6 +140,7 @@ Behavior specifications:
 - `features/ocr_extractor.feature`
 - `features/stt_extractor.feature`
 - `features/unstructured_extractor.feature`
+- `features/markitdown_extractor.feature`
 - `features/integration_unstructured_extraction.feature`
 Primary implementation:

biblicus 0.6.0__tar.gz → 0.8.0__tar.gz

biblicus 0.6.0tar.gz → 0.8.0tar.gz