PyPI - biblicus - Versions diffs - 0.5.0__tar.gz → 0.7.0__tar.gz - Mend

biblicus 0.5.0tar.gz → 0.7.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (158) hide show

{biblicus-0.5.0/src/biblicus.egg-info → biblicus-0.7.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: biblicus
-Version: 0.5.0
+Version: 0.7.0
 Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
 License: MIT
 Requires-Python: >=3.9
@@ -25,6 +25,8 @@ Requires-Dist: unstructured>=0.12.0; extra == "unstructured"
 Requires-Dist: python-docx>=1.1.0; extra == "unstructured"
 Provides-Extra: ocr
 Requires-Dist: rapidocr-onnxruntime>=1.3.0; extra == "ocr"
+Provides-Extra: markitdown
+Requires-Dist: markitdown[all]>=0.1.0; python_version >= "3.10" and extra == "markitdown"
 Dynamic: license-file
 # Biblicus
@@ -45,6 +47,40 @@ It can be used alongside LangGraph, Tactus, Pydantic AI, any agent framework, or
 See [retrieval augmented generation overview] for a short introduction to the idea.
+## Start with a knowledge base
+If you just want to hand a folder to your assistant and move on, use the high-level knowledge base interface. The folder can be nothing more than a handful of plain text files. You are not choosing a retrieval strategy yet. You are just collecting.
+This example assumes a folder called `notes/` with a few `.txt` files. The knowledge base handles sensible defaults and still gives you a clear context pack for your model call.
+```python
+from biblicus.knowledge_base import KnowledgeBase
+kb = KnowledgeBase.from_folder("notes")
+result = kb.query("Primary button style preference")
+context_pack = kb.context_pack(result, max_tokens=800)
+print(context_pack.text)
+```
+If you want to run a real, executable version of this story, use `scripts/readme_end_to_end_demo.py` from a fresh clone.
+This simplified sequence diagram shows the same idea at a high level.
+```mermaid
+%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
+sequenceDiagram
+  participant App as Your assistant code
+  participant KB as Knowledge base
+  participant LLM as Large language model
+  App->>KB: query
+  KB-->>App: evidence and context
+  App->>LLM: context plus prompt
+  LLM-->>App: response draft
+```
 ## A simple mental model
 Think in three stages.
@@ -72,7 +108,7 @@ In a coding assistant, retrieval is often triggered by what the user is doing ri
 This diagram shows two sequential Biblicus calls. They are shown separately to make the boundaries explicit: retrieval returns evidence, and context pack building consumes evidence.
 ```mermaid
-%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
+%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
 sequenceDiagram
   participant User
   participant App as Your assistant code
@@ -126,6 +162,7 @@ Some extractors are optional so the base install stays small.
 - Optical character recognition for images: `python3 -m pip install "biblicus[ocr]"`
 - Speech to text transcription: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
 - Broad document parsing fallback: `python3 -m pip install "biblicus[unstructured]"`
+- MarkItDown document conversion (requires Python 3.10 or higher): `python3 -m pip install "biblicus[markitdown]"`
 ## Quick start
@@ -153,11 +190,11 @@ biblicus crawl --corpus corpora/example \\
   --tag crawled
 ```
-## End-to-end example: evidence to assistant context
+## End-to-end example: lower-level control
 The command-line interface returns JavaScript Object Notation by default. This makes it easy to use Biblicus in scripts and to treat retrieval as a deterministic, testable step.
-Start with a few short “memories” from a chat system. Each memory is stored as a normal item in the corpus.
+This version shows the lower-level pieces explicitly. You are building the corpus, controlling each memory string, choosing the backend, and shaping the context pack yourself.
 ```python
 from biblicus.backends import get_backend
@@ -383,6 +420,7 @@ The documents below follow the pipeline from raw items to model context:
 - [Corpus][corpus]
 - [Text extraction][text-extraction]
+- [Knowledge base][knowledge-base]
 - [Backends][backends]
 - [Context packs][context-packs]
 - [Testing and evaluation][testing]
@@ -432,6 +470,20 @@ Two backends are included.
 - `scan` is a minimal baseline that scans raw items directly.
 - `sqlite-full-text-search` is a practical baseline that builds a full text search index in Sqlite.
+## Extraction backends
+These extractors are built in. Optional ones require extra dependencies.
+- `pass-through-text` reads text items and strips Markdown front matter.
+- `metadata-text` turns catalog metadata into a small text artifact.
+- `pdf-text` extracts text from Portable Document Format items with `pypdf`.
+- `select-text` chooses one prior extraction result in a pipeline.
+- `select-longest-text` chooses the longest prior extraction result.
+- `ocr-rapidocr` does optical character recognition on images (optional).
+- `stt-openai` performs speech to text on audio (optional).
+- `unstructured` provides broad document parsing (optional).
+- `markitdown` converts many formats into Markdown-like text (optional).
 ## Integration corpus and evaluation dataset
 Use `scripts/download_wikipedia.py` to download a small integration corpus from Wikipedia when running tests or demos. The repository does not include that content.
@@ -485,6 +537,7 @@ License terms are in `LICENSE`.
 [roadmap]: docs/ROADMAP.md
 [feature-index]: docs/FEATURE_INDEX.md
 [corpus]: docs/CORPUS.md
+[knowledge-base]: docs/KNOWLEDGE_BASE.md
 [text-extraction]: docs/EXTRACTION.md
 [user-configuration]: docs/USER_CONFIGURATION.md
 [backends]: docs/BACKENDS.md

{biblicus-0.5.0 → biblicus-0.7.0}/README.md RENAMED Viewed

@@ -16,6 +16,40 @@ It can be used alongside LangGraph, Tactus, Pydantic AI, any agent framework, or
 See [retrieval augmented generation overview] for a short introduction to the idea.
+## Start with a knowledge base
+If you just want to hand a folder to your assistant and move on, use the high-level knowledge base interface. The folder can be nothing more than a handful of plain text files. You are not choosing a retrieval strategy yet. You are just collecting.
+This example assumes a folder called `notes/` with a few `.txt` files. The knowledge base handles sensible defaults and still gives you a clear context pack for your model call.
+```python
+from biblicus.knowledge_base import KnowledgeBase
+kb = KnowledgeBase.from_folder("notes")
+result = kb.query("Primary button style preference")
+context_pack = kb.context_pack(result, max_tokens=800)
+print(context_pack.text)
+```
+If you want to run a real, executable version of this story, use `scripts/readme_end_to_end_demo.py` from a fresh clone.
+This simplified sequence diagram shows the same idea at a high level.
+```mermaid
+%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
+sequenceDiagram
+  participant App as Your assistant code
+  participant KB as Knowledge base
+  participant LLM as Large language model
+  App->>KB: query
+  KB-->>App: evidence and context
+  App->>LLM: context plus prompt
+  LLM-->>App: response draft
+```
 ## A simple mental model
 Think in three stages.
@@ -43,7 +77,7 @@ In a coding assistant, retrieval is often triggered by what the user is doing ri
 This diagram shows two sequential Biblicus calls. They are shown separately to make the boundaries explicit: retrieval returns evidence, and context pack building consumes evidence.
 ```mermaid
-%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
+%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
 sequenceDiagram
   participant User
   participant App as Your assistant code
@@ -97,6 +131,7 @@ Some extractors are optional so the base install stays small.
 - Optical character recognition for images: `python3 -m pip install "biblicus[ocr]"`
 - Speech to text transcription: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
 - Broad document parsing fallback: `python3 -m pip install "biblicus[unstructured]"`
+- MarkItDown document conversion (requires Python 3.10 or higher): `python3 -m pip install "biblicus[markitdown]"`
 ## Quick start
@@ -124,11 +159,11 @@ biblicus crawl --corpus corpora/example \\
   --tag crawled
 ```
-## End-to-end example: evidence to assistant context
+## End-to-end example: lower-level control
 The command-line interface returns JavaScript Object Notation by default. This makes it easy to use Biblicus in scripts and to treat retrieval as a deterministic, testable step.
-Start with a few short “memories” from a chat system. Each memory is stored as a normal item in the corpus.
+This version shows the lower-level pieces explicitly. You are building the corpus, controlling each memory string, choosing the backend, and shaping the context pack yourself.
 ```python
 from biblicus.backends import get_backend
@@ -354,6 +389,7 @@ The documents below follow the pipeline from raw items to model context:
 - [Corpus][corpus]
 - [Text extraction][text-extraction]
+- [Knowledge base][knowledge-base]
 - [Backends][backends]
 - [Context packs][context-packs]
 - [Testing and evaluation][testing]
@@ -403,6 +439,20 @@ Two backends are included.
 - `scan` is a minimal baseline that scans raw items directly.
 - `sqlite-full-text-search` is a practical baseline that builds a full text search index in Sqlite.
+## Extraction backends
+These extractors are built in. Optional ones require extra dependencies.
+- `pass-through-text` reads text items and strips Markdown front matter.
+- `metadata-text` turns catalog metadata into a small text artifact.
+- `pdf-text` extracts text from Portable Document Format items with `pypdf`.
+- `select-text` chooses one prior extraction result in a pipeline.
+- `select-longest-text` chooses the longest prior extraction result.
+- `ocr-rapidocr` does optical character recognition on images (optional).
+- `stt-openai` performs speech to text on audio (optional).
+- `unstructured` provides broad document parsing (optional).
+- `markitdown` converts many formats into Markdown-like text (optional).
 ## Integration corpus and evaluation dataset
 Use `scripts/download_wikipedia.py` to download a small integration corpus from Wikipedia when running tests or demos. The repository does not include that content.
@@ -456,6 +506,7 @@ License terms are in `LICENSE`.
 [roadmap]: docs/ROADMAP.md
 [feature-index]: docs/FEATURE_INDEX.md
 [corpus]: docs/CORPUS.md
+[knowledge-base]: docs/KNOWLEDGE_BASE.md
 [text-extraction]: docs/EXTRACTION.md
 [user-configuration]: docs/USER_CONFIGURATION.md
 [backends]: docs/BACKENDS.md

{biblicus-0.5.0 → biblicus-0.7.0}/docs/DEMOS.md RENAMED Viewed

@@ -221,6 +221,25 @@ python3 -m biblicus build --corpus corpora/pdf_samples --backend sqlite-full-tex
 python3 -m biblicus query --corpus corpora/pdf_samples --query "Dummy PDF file"
 ```
+### Wikipedia retrieval demo (Python)
+This example downloads a few Wikipedia summaries about retrieval and knowledge bases, builds an extraction run, creates a local full text index, and returns evidence plus a context pack.
+```
+rm -rf corpora/wikipedia_rag_demo
+python3 scripts/wikipedia_rag_demo.py --corpus corpora/wikipedia_rag_demo --force
+```
+### MarkItDown extraction demo (Python 3.10+)
+MarkItDown requires Python 3.10 or higher. This example uses the `py311` conda environment to run the extractor over the mixed sample corpus.
+```
+conda run -n py311 python -m pip install -e . "markitdown[all]"
+conda run -n py311 python scripts/download_mixed_samples.py --corpus corpora/markitdown_demo_py311 --force
+conda run -n py311 python -m biblicus extract build --corpus corpora/markitdown_demo_py311 --step markitdown
+```
 ### Mixed modality integration corpus
 This example assembles a tiny mixed corpus with a Markdown note, a Hypertext Markup Language page, an image, a Portable Document Format file with extractable text, and a generated Portable Document Format file with no extractable text.

{biblicus-0.5.0 → biblicus-0.7.0}/docs/EXTRACTION.md RENAMED Viewed

@@ -71,6 +71,27 @@ To install:
 python3 -m pip install "biblicus[unstructured]"
 ```
+`markitdown`
+- Converts common document formats into Markdown-like text
+- Backed by the optional `markitdown` dependency
+- Requires Python 3.10 or higher
+- Skips items that are already text so the pass-through extractor remains the canonical choice for text items
+- This means it will not process `text/html` or other text media types unless that policy changes
+To install:
+```
+python3 -m pip install "biblicus[markitdown]"
+```
+Example:
+```
+python3 -m biblicus extract build --corpus corpora/extraction-demo \\
+  --step markitdown
+```
 `ocr-rapidocr`
 - Optical character recognition for image items

{biblicus-0.5.0 → biblicus-0.7.0}/docs/FEATURE_INDEX.md RENAMED Viewed

@@ -123,6 +123,7 @@ What it does:
 - Includes a Portable Document Format text extractor plugin.
 - Includes a speech to text extractor plugin for audio items.
 - Includes a selection extractor step for choosing extracted text within a pipeline.
+- Includes a MarkItDown extractor plugin for document conversion.
 Documentation:
@@ -139,6 +140,7 @@ Behavior specifications:
 - `features/ocr_extractor.feature`
 - `features/stt_extractor.feature`
 - `features/unstructured_extractor.feature`
+- `features/markitdown_extractor.feature`
 - `features/integration_unstructured_extraction.feature`
 Primary implementation:
@@ -208,6 +210,21 @@ Primary implementation:
 - `src/biblicus/context.py`
+## Knowledge base
+What it does:
+- Provides a turnkey interface that accepts a folder and returns a ready-to-query workflow.
+- Applies sensible defaults for import, retrieval, and context pack shaping.
+Behavior specifications:
+- `features/knowledge_base.feature`
+Primary implementation:
+- `src/biblicus/knowledge_base.py`
 ## Testing, coverage, and documentation build
 What it does:

biblicus-0.7.0/docs/KNOWLEDGE_BASE.md ADDED Viewed

@@ -0,0 +1,68 @@
+# Knowledge base
+The knowledge base is the high‑level, turnkey workflow that makes Biblicus feel effortless. You hand it a folder. It chooses sensible defaults, builds a retrieval run, and gives you evidence you can turn into context.
+This is the right layer when you want to use Biblicus without spending time on setup. You can still override the defaults later when you want fine‑grained control.
+## What it does
+- Creates or opens a corpus at a chosen location (or a temporary location if you do not provide one).
+- Imports a folder tree into that corpus.
+- Builds a retrieval run using a default backend.
+- Exposes a simple `query` method that returns evidence.
+- Exposes a `context_pack` helper to shape evidence into model context.
+## Minimal use
+```python
+from biblicus.knowledge_base import KnowledgeBase
+kb = KnowledgeBase.from_folder("notes")
+result = kb.query("Primary button style preference")
+context_pack = kb.context_pack(result, max_tokens=800)
+print(context_pack.text)
+```
+## Default behavior
+The knowledge base wraps existing primitives. Defaults are explicit and deterministic.
+- **Corpus**: stored on disk and fully inspectable.
+- **Import**: uses the folder tree import, preserving relative paths.
+- **Backend**: defaults to the `scan` backend.
+- **Query budget**: defaults to a small, conservative evidence budget.
+## Overrides
+You can override the defaults when needed.
+```python
+from biblicus.knowledge_base import KnowledgeBase
+from biblicus.models import QueryBudget
+kb = KnowledgeBase.from_folder(
+    "notes",
+    backend_id="scan",
+    recipe_name="Knowledge base demo",
+    query_budget=QueryBudget(max_total_items=10, max_total_characters=4000, max_items_per_source=None),
+    tags=["memory"],
+    corpus_root="corpora/knowledge-base",
+)
+```
+## How it relates to lower‑level control
+The knowledge base is a convenience layer. It uses the same underlying parts that the lower‑level examples use.
+- `Corpus` for ingestion and storage
+- `import_tree` for folder ingestion
+- A backend run (`scan` by default)
+- `QueryBudget` for evidence limits
+- `ContextPackPolicy` and token fitting for context shaping
+You can always drop down to those lower‑level primitives when you need more control.
+If the high‑level workflow is not enough, switch to `Corpus`, `get_backend`, and `ContextPackPolicy` directly.

biblicus-0.7.0/docs/ROADMAP.md ADDED Viewed

@@ -0,0 +1,155 @@
+# Roadmap
+This document describes what we plan to build next.
+If you are looking for runnable examples, see `docs/DEMOS.md`.
+If you are looking for what already exists, start with:
+- `docs/FEATURE_INDEX.md` for a map of features to behavior specifications and modules.
+- `CHANGELOG.md` for released changes.
+## Principles
+- Behavior specifications are the authoritative definition of behavior.
+- Every behavior that exists is specified.
+- Validation and documentation are part of the product.
+- Raw corpus items remain readable, portable files.
+- Derived artifacts are stored under the corpus and can coexist for multiple implementations.
+## Next: retrieval evaluation and datasets
+Goal: make evaluation results easier to interpret and compare.
+Deliverables:
+- A dataset authoring workflow that supports small hand-labeled sets and larger synthetic sets.
+- A report that includes per-query diagnostics and a clear summary.
+Acceptance checks:
+- Dataset formats are versioned when they change.
+- Reports remain deterministic for the same inputs.
+## Next: context pack policy surfaces
+Goal: make context shaping policies easier to evaluate and swap.
+Deliverables:
+- A clear set of context pack policy variants (formatting, ordering, metadata inclusion).
+- Token budget strategies that can use a real tokenizer.
+- Documentation that explains where context shaping fits in the pipeline.
+Acceptance checks:
+- Behavior specifications cover policy selection and budgeting behaviors.
+- Example outputs show how context packs differ across policies.
+## Next: extraction backends (OCR and document understanding)
+Goal: treat optical character recognition and document understanding as pluggable extractors with consistent inputs and outputs.
+Deliverables:
+- A baseline OCR extractor that is fast and local for smoke tests.
+- A higher quality OCR extractor candidate (for example: Paddle OCR or Docling OCR).
+- A general document understanding extractor candidate (for example: Docling or Unstructured).
+- A consistent output contract that captures text plus optional confidence and per-page metadata.
+- A selector policy for choosing between multiple extractor outputs in a pipeline.
+- A shared evaluation harness for extraction backends using the same corpus and dataset.
+Acceptance checks:
+- Behavior specifications cover extractor selection and output provenance.
+- Evaluation reports compare accuracy, processable fraction, latency, and cost.
+## Next: corpus analysis tools
+Goal: provide lightweight analysis utilities that summarize corpus themes and guide curation.
+Deliverables:
+- A topic modeling workflow for corpus analysis (for example: BERTopic).
+- A report that highlights dominant themes and outliers.
+- A way to compare topic distributions across corpora or corpus snapshots.
+Acceptance checks:
+- Analysis is reproducible for the same corpus state.
+- Reports are exportable and readable without custom tooling.
+### Candidate backend ecosystem (for planning and evaluation)
+Document understanding and OCR blur together at the interface level in Biblicus, so the roadmap treats them as extractor candidates with the same input/output contract.
+Docling family candidates:
+- Docling (document understanding with structured outputs)
+- docling-ocr (OCR component in the Docling ecosystem)
+General-purpose extraction candidates:
+- Unstructured (element-oriented extraction for many formats)
+- MarkItDown (lightweight conversion to Markdown)
+- Kreuzberg (speed-focused extraction for bulk workflows)
+- ExtractThinker (schema-driven extraction using Pydantic contracts)
+Ecosystem adapters:
+- LangChain document loaders (uniform loader interface across many sources)
+### Guidance for choosing early targets
+- If you need layout and table understanding, prioritize Docling and docling-ocr.
+- If you need speed and simplicity, prioritize MarkItDown or Kreuzberg.
+- If you need schema-first extraction, prioritize ExtractThinker layered on an OCR or document extractor.
+## Later: alternate backends and hosting modes
+Goal: broaden the backend surface while keeping the core predictable.
+Deliverables:
+- A second backend with different performance tradeoffs.
+- A tool server that exposes a backend through a stable interface.
+- Documentation that shows how to run a backend out of process.
+Acceptance checks:
+- Local tests remain fast and deterministic.
+- Integration tests validate retrieval through the tool boundary.
+## Deferred: corpus and extraction work
+These are valuable, but intentionally not the near-term focus while retrieval becomes practical end to end.
+### In-memory corpus for ephemeral workflows
+Goal: allow programmatic, temporary corpora that live in memory for short-lived agents or tests.
+Deliverables:
+- A memory-backed corpus implementation that supports the same ingestion and catalog APIs.
+- A serialization option for snapshots so ephemeral corpora can be persisted when needed.
+- Documentation that explains tradeoffs versus file-based corpora.
+Acceptance checks:
+- Behavior specifications cover ingestion, listing, and reindexing in memory.
+- Retrieval and extraction can operate on the in-memory corpus without special casing.
+### Extractor datasets and evaluation harness
+Goal: compare extraction approaches in a way that is measurable, repeatable, and useful for practical engineering decisions.
+Deliverables:
+- Dataset authoring workflow for extraction ground truth (for example: expected transcripts and expected optical character recognition text).
+- Evaluation metrics for accuracy, speed, and cost, including “processable fraction” for a given extractor recipe.
+- A report format that can compare multiple extraction recipes against the same corpus and dataset.
+Acceptance checks:
+- Evaluation results are stable and reproducible for the same corpus and dataset inputs.
+- Reports make it clear when an extractor fails to process an item versus producing empty output.

{biblicus-0.5.0 → biblicus-0.7.0}/docs/api.rst RENAMED Viewed

@@ -8,6 +8,10 @@ Core
    :members:
    :undoc-members:
+.. automodule:: biblicus.knowledge_base
+   :members:
+   :undoc-members:
 .. automodule:: biblicus.models
    :members:
    :undoc-members:

{biblicus-0.5.0 → biblicus-0.7.0}/docs/index.rst RENAMED Viewed

@@ -11,6 +11,7 @@ Contents
    CORPUS
    EXTRACTION
+   KNOWLEDGE_BASE
    BACKENDS
    CONTEXT_PACK
    DEMOS

{biblicus-0.5.0 → biblicus-0.7.0}/features/environment.py RENAMED Viewed

@@ -134,6 +134,32 @@ def after_scenario(context, scenario) -> None:
                 sys.modules.pop(name, None)
         context._fake_rapidocr_unavailable_installed = False
         context._fake_rapidocr_unavailable_original_modules = {}
+    if getattr(context, "_fake_markitdown_installed", False):
+        original_modules = getattr(context, "_fake_markitdown_original_modules", {})
+        for name in [
+            "markitdown",
+        ]:
+            if name in original_modules:
+                sys.modules[name] = original_modules[name]
+            else:
+                sys.modules.pop(name, None)
+        context._fake_markitdown_installed = False
+        context._fake_markitdown_original_modules = {}
+    if getattr(context, "_fake_markitdown_unavailable_installed", False):
+        original_modules = getattr(context, "_fake_markitdown_unavailable_original_modules", {})
+        for name in [
+            "markitdown",
+        ]:
+            if name in original_modules:
+                sys.modules[name] = original_modules[name]
+            else:
+                sys.modules.pop(name, None)
+        context._fake_markitdown_unavailable_installed = False
+        context._fake_markitdown_unavailable_original_modules = {}
+    original_sys_version_info = getattr(context, "_original_sys_version_info", None)
+    if original_sys_version_info is not None:
+        sys.version_info = original_sys_version_info
+        context._original_sys_version_info = None
     if hasattr(context, "_tmp"):
         context._tmp.cleanup()

biblicus-0.7.0/features/knowledge_base.feature ADDED Viewed

@@ -0,0 +1,55 @@
+Feature: Knowledge base (turnkey workflow)
+  A knowledge base is a high-level workflow that hides the plumbing while keeping behavior explicit.
+  It should accept a folder, ingest files, build defaults, and allow retrieval with minimal configuration.
+  Scenario: Build a knowledge base from a folder and query it
+    Given a folder "notes" exists with text files:
+      | filename | contents                                                   |
+      | note1.txt | The user's name is Tactus Maximus.                       |
+      | note2.txt | Primary button style preference: the user's favorite color is magenta. |
+    When I create a knowledge base from folder "notes" only
+    And I query the knowledge base for "Primary button style preference"
+    Then the knowledge base returns evidence that includes "favorite color is magenta"
+  Scenario: Knowledge base context pack is shaped with a token budget
+    Given a folder "notes" exists with text files:
+      | filename | contents                              |
+      | note1.txt | one two three                         |
+      | note2.txt | four five six                         |
+    When I create a knowledge base from folder "notes" only
+    And I query the knowledge base for "one"
+    And I build a context pack from the knowledge base query with token budget 3
+    Then the context pack text equals:
+      """
+      one two three
+      """
+  Scenario: Knowledge base context pack defaults to no token budget
+    Given a folder "notes" exists with text files:
+      | filename | contents      |
+      | note1.txt | alpha beta   |
+    When I create a knowledge base from folder "notes" only
+    And I query the knowledge base for "alpha"
+    And I build a context pack from the knowledge base query without a token budget
+    Then the context pack text equals:
+      """
+      alpha beta
+      """
+  Scenario: Knowledge base rejects missing folder
+    When I attempt to create a knowledge base from folder "missing"
+    Then the knowledge base error includes "does not exist"
+  Scenario: Knowledge base rejects non-folder path
+    Given a file "not-a-folder.txt" exists with contents "hello"
+    When I attempt to create a knowledge base from folder "not-a-folder.txt"
+    Then the knowledge base error includes "not a directory"
+  Scenario: Knowledge base can use an explicit corpus root
+    Given a folder "notes" exists with text files:
+      | filename | contents |
+      | note1.txt | alpha |
+    And a folder "kb-root" exists
+    When I create a knowledge base from folder "notes" using corpus root "kb-root"
+    And I query the knowledge base for "alpha"
+    Then the knowledge base returns evidence that includes "alpha"

biblicus 0.5.0__tar.gz → 0.7.0__tar.gz

biblicus 0.5.0tar.gz → 0.7.0tar.gz