PyPI - biblicus - Versions diffs - 0.5.0__tar.gz → 0.6.0__tar.gz - Mend

biblicus 0.5.0tar.gz → 0.6.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (153) hide show

{biblicus-0.5.0/src/biblicus.egg-info → biblicus-0.6.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: biblicus
-Version: 0.5.0
+Version: 0.6.0
 Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
 License: MIT
 Requires-Python: >=3.9
@@ -45,6 +45,40 @@ It can be used alongside LangGraph, Tactus, Pydantic AI, any agent framework, or
 See [retrieval augmented generation overview] for a short introduction to the idea.
+## Start with a knowledge base
+If you just want to hand a folder to your assistant and move on, use the high-level knowledge base interface. The folder can be nothing more than a handful of plain text files. You are not choosing a retrieval strategy yet. You are just collecting.
+This example assumes a folder called `notes/` with a few `.txt` files. The knowledge base handles sensible defaults and still gives you a clear context pack for your model call.
+```python
+from biblicus.knowledge_base import KnowledgeBase
+kb = KnowledgeBase.from_folder("notes")
+result = kb.query("Primary button style preference")
+context_pack = kb.context_pack(result, max_tokens=800)
+print(context_pack.text)
+```
+If you want to run a real, executable version of this story, use `scripts/readme_end_to_end_demo.py` from a fresh clone.
+This simplified sequence diagram shows the same idea at a high level.
+```mermaid
+%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
+sequenceDiagram
+  participant App as Your assistant code
+  participant KB as Knowledge base
+  participant LLM as Large language model
+  App->>KB: query
+  KB-->>App: evidence and context
+  App->>LLM: context plus prompt
+  LLM-->>App: response draft
+```
 ## A simple mental model
 Think in three stages.
@@ -153,11 +187,11 @@ biblicus crawl --corpus corpora/example \\
   --tag crawled
 ```
-## End-to-end example: evidence to assistant context
+## End-to-end example: lower-level control
 The command-line interface returns JavaScript Object Notation by default. This makes it easy to use Biblicus in scripts and to treat retrieval as a deterministic, testable step.
-Start with a few short “memories” from a chat system. Each memory is stored as a normal item in the corpus.
+This version shows the lower-level pieces explicitly. You are building the corpus, controlling each memory string, choosing the backend, and shaping the context pack yourself.
 ```python
 from biblicus.backends import get_backend
@@ -383,6 +417,7 @@ The documents below follow the pipeline from raw items to model context:
 - [Corpus][corpus]
 - [Text extraction][text-extraction]
+- [Knowledge base][knowledge-base]
 - [Backends][backends]
 - [Context packs][context-packs]
 - [Testing and evaluation][testing]
@@ -485,6 +520,7 @@ License terms are in `LICENSE`.
 [roadmap]: docs/ROADMAP.md
 [feature-index]: docs/FEATURE_INDEX.md
 [corpus]: docs/CORPUS.md
+[knowledge-base]: docs/KNOWLEDGE_BASE.md
 [text-extraction]: docs/EXTRACTION.md
 [user-configuration]: docs/USER_CONFIGURATION.md
 [backends]: docs/BACKENDS.md

{biblicus-0.5.0 → biblicus-0.6.0}/README.md RENAMED Viewed

@@ -16,6 +16,40 @@ It can be used alongside LangGraph, Tactus, Pydantic AI, any agent framework, or
 See [retrieval augmented generation overview] for a short introduction to the idea.
+## Start with a knowledge base
+If you just want to hand a folder to your assistant and move on, use the high-level knowledge base interface. The folder can be nothing more than a handful of plain text files. You are not choosing a retrieval strategy yet. You are just collecting.
+This example assumes a folder called `notes/` with a few `.txt` files. The knowledge base handles sensible defaults and still gives you a clear context pack for your model call.
+```python
+from biblicus.knowledge_base import KnowledgeBase
+kb = KnowledgeBase.from_folder("notes")
+result = kb.query("Primary button style preference")
+context_pack = kb.context_pack(result, max_tokens=800)
+print(context_pack.text)
+```
+If you want to run a real, executable version of this story, use `scripts/readme_end_to_end_demo.py` from a fresh clone.
+This simplified sequence diagram shows the same idea at a high level.
+```mermaid
+%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
+sequenceDiagram
+  participant App as Your assistant code
+  participant KB as Knowledge base
+  participant LLM as Large language model
+  App->>KB: query
+  KB-->>App: evidence and context
+  App->>LLM: context plus prompt
+  LLM-->>App: response draft
+```
 ## A simple mental model
 Think in three stages.
@@ -124,11 +158,11 @@ biblicus crawl --corpus corpora/example \\
   --tag crawled
 ```
-## End-to-end example: evidence to assistant context
+## End-to-end example: lower-level control
 The command-line interface returns JavaScript Object Notation by default. This makes it easy to use Biblicus in scripts and to treat retrieval as a deterministic, testable step.
-Start with a few short “memories” from a chat system. Each memory is stored as a normal item in the corpus.
+This version shows the lower-level pieces explicitly. You are building the corpus, controlling each memory string, choosing the backend, and shaping the context pack yourself.
 ```python
 from biblicus.backends import get_backend
@@ -354,6 +388,7 @@ The documents below follow the pipeline from raw items to model context:
 - [Corpus][corpus]
 - [Text extraction][text-extraction]
+- [Knowledge base][knowledge-base]
 - [Backends][backends]
 - [Context packs][context-packs]
 - [Testing and evaluation][testing]
@@ -456,6 +491,7 @@ License terms are in `LICENSE`.
 [roadmap]: docs/ROADMAP.md
 [feature-index]: docs/FEATURE_INDEX.md
 [corpus]: docs/CORPUS.md
+[knowledge-base]: docs/KNOWLEDGE_BASE.md
 [text-extraction]: docs/EXTRACTION.md
 [user-configuration]: docs/USER_CONFIGURATION.md
 [backends]: docs/BACKENDS.md

{biblicus-0.5.0 → biblicus-0.6.0}/docs/FEATURE_INDEX.md RENAMED Viewed

@@ -208,6 +208,21 @@ Primary implementation:
 - `src/biblicus/context.py`
+## Knowledge base
+What it does:
+- Provides a turnkey interface that accepts a folder and returns a ready-to-query workflow.
+- Applies sensible defaults for import, retrieval, and context pack shaping.
+Behavior specifications:
+- `features/knowledge_base.feature`
+Primary implementation:
+- `src/biblicus/knowledge_base.py`
 ## Testing, coverage, and documentation build
 What it does:

biblicus-0.6.0/docs/KNOWLEDGE_BASE.md ADDED Viewed

@@ -0,0 +1,68 @@
+# Knowledge base
+The knowledge base is the high‑level, turnkey workflow that makes Biblicus feel effortless. You hand it a folder. It chooses sensible defaults, builds a retrieval run, and gives you evidence you can turn into context.
+This is the right layer when you want to use Biblicus without spending time on setup. You can still override the defaults later when you want fine‑grained control.
+## What it does
+- Creates or opens a corpus at a chosen location (or a temporary location if you do not provide one).
+- Imports a folder tree into that corpus.
+- Builds a retrieval run using a default backend.
+- Exposes a simple `query` method that returns evidence.
+- Exposes a `context_pack` helper to shape evidence into model context.
+## Minimal use
+```python
+from biblicus.knowledge_base import KnowledgeBase
+kb = KnowledgeBase.from_folder("notes")
+result = kb.query("Primary button style preference")
+context_pack = kb.context_pack(result, max_tokens=800)
+print(context_pack.text)
+```
+## Default behavior
+The knowledge base wraps existing primitives. Defaults are explicit and deterministic.
+- **Corpus**: stored on disk and fully inspectable.
+- **Import**: uses the folder tree import, preserving relative paths.
+- **Backend**: defaults to the `scan` backend.
+- **Query budget**: defaults to a small, conservative evidence budget.
+## Overrides
+You can override the defaults when needed.
+```python
+from biblicus.knowledge_base import KnowledgeBase
+from biblicus.models import QueryBudget
+kb = KnowledgeBase.from_folder(
+    "notes",
+    backend_id="scan",
+    recipe_name="Knowledge base demo",
+    query_budget=QueryBudget(max_total_items=10, max_total_characters=4000, max_items_per_source=None),
+    tags=["memory"],
+    corpus_root="corpora/knowledge-base",
+)
+```
+## How it relates to lower‑level control
+The knowledge base is a convenience layer. It uses the same underlying parts that the lower‑level examples use.
+- `Corpus` for ingestion and storage
+- `import_tree` for folder ingestion
+- A backend run (`scan` by default)
+- `QueryBudget` for evidence limits
+- `ContextPackPolicy` and token fitting for context shaping
+You can always drop down to those lower‑level primitives when you need more control.
+If the high‑level workflow is not enough, switch to `Corpus`, `get_backend`, and `ContextPackPolicy` directly.

{biblicus-0.5.0 → biblicus-0.6.0}/docs/ROADMAP.md RENAMED Viewed

@@ -46,6 +46,65 @@ Acceptance checks:
 - Behavior specifications cover policy selection and budgeting behaviors.
 - Example outputs show how context packs differ across policies.
+## Next: extraction backends (OCR and document understanding)
+Goal: treat optical character recognition and document understanding as pluggable extractors with consistent inputs and outputs.
+Deliverables:
+- A baseline OCR extractor that is fast and local for smoke tests.
+- A higher quality OCR extractor candidate (for example: Paddle OCR or Docling OCR).
+- A general document understanding extractor candidate (for example: Docling or Unstructured).
+- A consistent output contract that captures text plus optional confidence and per-page metadata.
+- A selector policy for choosing between multiple extractor outputs in a pipeline.
+- A shared evaluation harness for extraction backends using the same corpus and dataset.
+Acceptance checks:
+- Behavior specifications cover extractor selection and output provenance.
+- Evaluation reports compare accuracy, processable fraction, latency, and cost.
+## Next: corpus analysis tools
+Goal: provide lightweight analysis utilities that summarize corpus themes and guide curation.
+Deliverables:
+- A topic modeling workflow for corpus analysis (for example: BERTopic).
+- A report that highlights dominant themes and outliers.
+- A way to compare topic distributions across corpora or corpus snapshots.
+Acceptance checks:
+- Analysis is reproducible for the same corpus state.
+- Reports are exportable and readable without custom tooling.
+### Candidate backend ecosystem (for planning and evaluation)
+Document understanding and OCR blur together at the interface level in Biblicus, so the roadmap treats them as extractor candidates with the same input/output contract.
+Docling family candidates:
+- Docling (document understanding with structured outputs)
+- docling-ocr (OCR component in the Docling ecosystem)
+General-purpose extraction candidates:
+- Unstructured (element-oriented extraction for many formats)
+- MarkItDown (lightweight conversion to Markdown)
+- Kreuzberg (speed-focused extraction for bulk workflows)
+- ExtractThinker (schema-driven extraction using Pydantic contracts)
+Ecosystem adapters:
+- LangChain document loaders (uniform loader interface across many sources)
+### Guidance for choosing early targets
+- If you need layout and table understanding, prioritize Docling and docling-ocr.
+- If you need speed and simplicity, prioritize MarkItDown or Kreuzberg.
+- If you need schema-first extraction, prioritize ExtractThinker layered on an OCR or document extractor.
 ## Later: alternate backends and hosting modes
 Goal: broaden the backend surface while keeping the core predictable.

{biblicus-0.5.0 → biblicus-0.6.0}/docs/api.rst RENAMED Viewed

@@ -8,6 +8,10 @@ Core
    :members:
    :undoc-members:
+.. automodule:: biblicus.knowledge_base
+   :members:
+   :undoc-members:
 .. automodule:: biblicus.models
    :members:
    :undoc-members:

{biblicus-0.5.0 → biblicus-0.6.0}/docs/index.rst RENAMED Viewed

@@ -11,6 +11,7 @@ Contents
    CORPUS
    EXTRACTION
+   KNOWLEDGE_BASE
    BACKENDS
    CONTEXT_PACK
    DEMOS

biblicus-0.6.0/features/knowledge_base.feature ADDED Viewed

@@ -0,0 +1,55 @@
+Feature: Knowledge base (turnkey workflow)
+  A knowledge base is a high-level workflow that hides the plumbing while keeping behavior explicit.
+  It should accept a folder, ingest files, build defaults, and allow retrieval with minimal configuration.
+  Scenario: Build a knowledge base from a folder and query it
+    Given a folder "notes" exists with text files:
+      | filename | contents                                                   |
+      | note1.txt | The user's name is Tactus Maximus.                       |
+      | note2.txt | Primary button style preference: the user's favorite color is magenta. |
+    When I create a knowledge base from folder "notes" only
+    And I query the knowledge base for "Primary button style preference"
+    Then the knowledge base returns evidence that includes "favorite color is magenta"
+  Scenario: Knowledge base context pack is shaped with a token budget
+    Given a folder "notes" exists with text files:
+      | filename | contents                              |
+      | note1.txt | one two three                         |
+      | note2.txt | four five six                         |
+    When I create a knowledge base from folder "notes" only
+    And I query the knowledge base for "one"
+    And I build a context pack from the knowledge base query with token budget 3
+    Then the context pack text equals:
+      """
+      one two three
+      """
+  Scenario: Knowledge base context pack defaults to no token budget
+    Given a folder "notes" exists with text files:
+      | filename | contents      |
+      | note1.txt | alpha beta   |
+    When I create a knowledge base from folder "notes" only
+    And I query the knowledge base for "alpha"
+    And I build a context pack from the knowledge base query without a token budget
+    Then the context pack text equals:
+      """
+      alpha beta
+      """
+  Scenario: Knowledge base rejects missing folder
+    When I attempt to create a knowledge base from folder "missing"
+    Then the knowledge base error includes "does not exist"
+  Scenario: Knowledge base rejects non-folder path
+    Given a file "not-a-folder.txt" exists with contents "hello"
+    When I attempt to create a knowledge base from folder "not-a-folder.txt"
+    Then the knowledge base error includes "not a directory"
+  Scenario: Knowledge base can use an explicit corpus root
+    Given a folder "notes" exists with text files:
+      | filename | contents |
+      | note1.txt | alpha |
+    And a folder "kb-root" exists
+    When I create a knowledge base from folder "notes" using corpus root "kb-root"
+    And I query the knowledge base for "alpha"
+    Then the knowledge base returns evidence that includes "alpha"

biblicus-0.6.0/features/steps/knowledge_base_steps.py ADDED Viewed

@@ -0,0 +1,90 @@
+from __future__ import annotations
+from pathlib import Path
+from behave import given, then, when
+from biblicus.knowledge_base import KnowledgeBase
+@given('a folder "{folder}" exists')
+def given_folder_exists(context, folder: str) -> None:
+    root = Path(context.workdir) / folder
+    root.mkdir(parents=True, exist_ok=True)
+    context.knowledge_base_folder = root
+@given('a folder "{folder}" exists with text files:')
+def given_folder_exists_with_text_files(context, folder: str) -> None:
+    root = Path(context.workdir) / folder
+    root.mkdir(parents=True, exist_ok=True)
+    for row in context.table:
+        filename = row["filename"]
+        contents = row["contents"]
+        path = root / filename
+        path.write_text(contents, encoding="utf-8")
+    context.knowledge_base_folder = root
+@given('a file "{filename}" exists with contents "{contents}"')
+def given_file_exists_with_contents(context, filename: str, contents: str) -> None:
+    path = Path(context.workdir) / filename
+    path.write_text(contents, encoding="utf-8")
+    context.knowledge_base_file = path
+@when('I create a knowledge base from folder "{folder}" only')
+def when_create_knowledge_base_from_folder(context, folder: str) -> None:
+    root = Path(context.workdir) / folder
+    context.knowledge_base = KnowledgeBase.from_folder(root)
+@when('I create a knowledge base from folder "{folder}" using corpus root "{corpus_root}"')
+def when_create_knowledge_base_from_folder_with_corpus_root(
+    context, folder: str, corpus_root: str
+) -> None:
+    root = Path(context.workdir) / folder
+    corpus_root_path = Path(context.workdir) / corpus_root
+    context.knowledge_base = KnowledgeBase.from_folder(root, corpus_root=corpus_root_path)
+@when('I attempt to create a knowledge base from folder "{folder}"')
+def when_attempt_create_knowledge_base_from_folder(context, folder: str) -> None:
+    root = Path(context.workdir) / folder
+    try:
+        KnowledgeBase.from_folder(root)
+    except (FileNotFoundError, NotADirectoryError) as exc:
+        context.knowledge_base_error = exc
+@then('the knowledge base error includes "{text}"')
+def then_knowledge_base_error_includes(context, text: str) -> None:
+    error = context.knowledge_base_error
+    assert text in str(error)
+@when('I query the knowledge base for "{query_text}"')
+def when_query_knowledge_base(context, query_text: str) -> None:
+    context.knowledge_base_result = context.knowledge_base.query(query_text)
+@when("I build a context pack from the knowledge base query with token budget {max_tokens:d}")
+def when_build_context_pack_from_knowledge_base_query(context, max_tokens: int) -> None:
+    context.context_pack = context.knowledge_base.context_pack(
+        context.knowledge_base_result,
+        max_tokens=max_tokens,
+    )
+@when("I build a context pack from the knowledge base query without a token budget")
+def when_build_context_pack_from_knowledge_base_query_without_budget(context) -> None:
+    context.context_pack = context.knowledge_base.context_pack(
+        context.knowledge_base_result,
+    )
+@then('the knowledge base returns evidence that includes "{text}"')
+def then_knowledge_base_returns_evidence_that_includes(context, text: str) -> None:
+    evidence_items = context.knowledge_base_result.evidence
+    evidence_texts = [item.text or "" for item in evidence_items]
+    assert any(text in evidence_text for evidence_text in evidence_texts)

{biblicus-0.5.0 → biblicus-0.6.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "biblicus"
-version = "0.5.0"
+version = "0.6.0"
 description = "Command line interface and Python library for corpus ingestion, retrieval, and evaluation."
 readme = "README.md"
 requires-python = ">=3.9"

{biblicus-0.5.0 → biblicus-0.6.0}/src/biblicus/__init__.py RENAMED Viewed

@@ -3,6 +3,7 @@ Biblicus public package interface.
 """
 from .corpus import Corpus
+from .knowledge_base import KnowledgeBase
 from .models import (
     CorpusConfig,
     Evidence,
@@ -19,10 +20,11 @@ __all__ = [
     "CorpusConfig",
     "Evidence",
     "IngestResult",
+    "KnowledgeBase",
     "QueryBudget",
     "RecipeManifest",
     "RetrievalResult",
     "RetrievalRun",
 ]
-__version__ = "0.5.0"
+__version__ = "0.6.0"

biblicus 0.5.0__tar.gz → 0.6.0__tar.gz

biblicus 0.5.0tar.gz → 0.6.0tar.gz