biblicus 0.5.0__tar.gz → 0.7.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {biblicus-0.5.0/src/biblicus.egg-info → biblicus-0.7.0}/PKG-INFO +57 -4
- {biblicus-0.5.0 → biblicus-0.7.0}/README.md +54 -3
- {biblicus-0.5.0 → biblicus-0.7.0}/docs/DEMOS.md +19 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/docs/EXTRACTION.md +21 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/docs/FEATURE_INDEX.md +17 -0
- biblicus-0.7.0/docs/KNOWLEDGE_BASE.md +68 -0
- biblicus-0.7.0/docs/ROADMAP.md +155 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/docs/api.rst +4 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/docs/index.rst +1 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/environment.py +26 -0
- biblicus-0.7.0/features/knowledge_base.feature +55 -0
- biblicus-0.7.0/features/markitdown_extractor.feature +99 -0
- biblicus-0.7.0/features/steps/knowledge_base_steps.py +90 -0
- biblicus-0.7.0/features/steps/markitdown_steps.py +173 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/pyproject.toml +5 -1
- {biblicus-0.5.0 → biblicus-0.7.0}/scripts/test.py +15 -4
- biblicus-0.7.0/scripts/wikipedia_rag_demo.py +212 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/__init__.py +3 -1
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/__init__.py +2 -0
- biblicus-0.7.0/src/biblicus/extractors/markitdown_text.py +128 -0
- biblicus-0.7.0/src/biblicus/knowledge_base.py +191 -0
- {biblicus-0.5.0 → biblicus-0.7.0/src/biblicus.egg-info}/PKG-INFO +57 -4
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus.egg-info/SOURCES.txt +8 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus.egg-info/requires.txt +5 -0
- biblicus-0.5.0/docs/ROADMAP.md +0 -81
- {biblicus-0.5.0 → biblicus-0.7.0}/LICENSE +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/MANIFEST.in +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/THIRD_PARTY_NOTICES.md +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/datasets/wikipedia_mini.json +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/docs/ARCHITECTURE.md +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/docs/BACKENDS.md +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/docs/CONTEXT_PACK.md +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/docs/CORPUS.md +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/docs/CORPUS_DESIGN.md +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/docs/TESTING.md +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/docs/USER_CONFIGURATION.md +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/docs/conf.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/backend_validation.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/biblicus_corpus.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/cli_entrypoint.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/cli_parsing.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/content_sniffing.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/context_pack.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/context_pack_cli.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/corpus_edge_cases.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/corpus_identity.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/corpus_purge.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/crawl.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/error_cases.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/evaluation.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/evidence_processing.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/extraction_error_handling.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/extraction_run_lifecycle.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/extraction_selection.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/extraction_selection_longest.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/extractor_pipeline.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/extractor_validation.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/frontmatter.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/hook_config_validation.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/hook_error_handling.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/import_tree.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/ingest_sources.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/integration_audio_samples.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/integration_image_samples.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/integration_mixed_corpus.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/integration_mixed_extraction.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/integration_ocr_image_extraction.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/integration_pdf_retrieval.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/integration_pdf_samples.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/integration_unstructured_extraction.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/integration_wikipedia.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/lifecycle_hooks.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/model_validation.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/ocr_extractor.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/pdf_text_extraction.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/python_api.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/python_hook_logging.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/query_processing.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/retrieval_budget.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/retrieval_scan.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/retrieval_sqlite_full_text_search.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/retrieval_uses_extraction_run.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/retrieval_utilities.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/source_loading.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/backend_steps.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/cli_parsing_steps.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/cli_steps.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/context_pack_steps.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/crawl_steps.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/evidence_processing_steps.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/extraction_run_lifecycle_steps.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/extraction_steps.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/extractor_steps.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/frontmatter_steps.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/model_steps.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/openai_steps.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/pdf_steps.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/python_api_steps.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/rapidocr_steps.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/retrieval_steps.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/stt_steps.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/unstructured_steps.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/user_config_steps.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/streaming_ingest.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/stt_extractor.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/text_extraction_runs.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/token_budget.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/unstructured_extractor.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/features/user_config.feature +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/scripts/download_audio_samples.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/scripts/download_image_samples.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/scripts/download_mixed_samples.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/scripts/download_pdf_samples.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/scripts/download_wikipedia.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/scripts/readme_end_to_end_demo.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/setup.cfg +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/__main__.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/_vendor/dotyaml/__init__.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/_vendor/dotyaml/interpolation.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/_vendor/dotyaml/loader.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/_vendor/dotyaml/transformer.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/backends/__init__.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/backends/base.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/backends/scan.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/backends/sqlite_full_text_search.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/cli.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/constants.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/context.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/corpus.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/crawl.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/errors.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/evaluation.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/evidence_processing.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extraction.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/base.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/metadata_text.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/openai_stt.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/pass_through_text.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/pdf_text.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/pipeline.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/rapidocr_text.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/select_longest_text.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/select_text.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/unstructured_text.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/frontmatter.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/hook_logging.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/hook_manager.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/hooks.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/ignore.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/models.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/retrieval.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/sources.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/time.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/uris.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/user_config.py +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus.egg-info/dependency_links.txt +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus.egg-info/entry_points.txt +0 -0
- {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus.egg-info/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: biblicus
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.7.0
|
|
4
4
|
Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
|
|
5
5
|
License: MIT
|
|
6
6
|
Requires-Python: >=3.9
|
|
@@ -25,6 +25,8 @@ Requires-Dist: unstructured>=0.12.0; extra == "unstructured"
|
|
|
25
25
|
Requires-Dist: python-docx>=1.1.0; extra == "unstructured"
|
|
26
26
|
Provides-Extra: ocr
|
|
27
27
|
Requires-Dist: rapidocr-onnxruntime>=1.3.0; extra == "ocr"
|
|
28
|
+
Provides-Extra: markitdown
|
|
29
|
+
Requires-Dist: markitdown[all]>=0.1.0; python_version >= "3.10" and extra == "markitdown"
|
|
28
30
|
Dynamic: license-file
|
|
29
31
|
|
|
30
32
|
# Biblicus
|
|
@@ -45,6 +47,40 @@ It can be used alongside LangGraph, Tactus, Pydantic AI, any agent framework, or
|
|
|
45
47
|
|
|
46
48
|
See [retrieval augmented generation overview] for a short introduction to the idea.
|
|
47
49
|
|
|
50
|
+
## Start with a knowledge base
|
|
51
|
+
|
|
52
|
+
If you just want to hand a folder to your assistant and move on, use the high-level knowledge base interface. The folder can be nothing more than a handful of plain text files. You are not choosing a retrieval strategy yet. You are just collecting.
|
|
53
|
+
|
|
54
|
+
This example assumes a folder called `notes/` with a few `.txt` files. The knowledge base handles sensible defaults and still gives you a clear context pack for your model call.
|
|
55
|
+
|
|
56
|
+
```python
|
|
57
|
+
from biblicus.knowledge_base import KnowledgeBase
|
|
58
|
+
|
|
59
|
+
|
|
60
|
+
kb = KnowledgeBase.from_folder("notes")
|
|
61
|
+
result = kb.query("Primary button style preference")
|
|
62
|
+
context_pack = kb.context_pack(result, max_tokens=800)
|
|
63
|
+
|
|
64
|
+
print(context_pack.text)
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
If you want to run a real, executable version of this story, use `scripts/readme_end_to_end_demo.py` from a fresh clone.
|
|
68
|
+
|
|
69
|
+
This simplified sequence diagram shows the same idea at a high level.
|
|
70
|
+
|
|
71
|
+
```mermaid
|
|
72
|
+
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
|
|
73
|
+
sequenceDiagram
|
|
74
|
+
participant App as Your assistant code
|
|
75
|
+
participant KB as Knowledge base
|
|
76
|
+
participant LLM as Large language model
|
|
77
|
+
|
|
78
|
+
App->>KB: query
|
|
79
|
+
KB-->>App: evidence and context
|
|
80
|
+
App->>LLM: context plus prompt
|
|
81
|
+
LLM-->>App: response draft
|
|
82
|
+
```
|
|
83
|
+
|
|
48
84
|
## A simple mental model
|
|
49
85
|
|
|
50
86
|
Think in three stages.
|
|
@@ -72,7 +108,7 @@ In a coding assistant, retrieval is often triggered by what the user is doing ri
|
|
|
72
108
|
This diagram shows two sequential Biblicus calls. They are shown separately to make the boundaries explicit: retrieval returns evidence, and context pack building consumes evidence.
|
|
73
109
|
|
|
74
110
|
```mermaid
|
|
75
|
-
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
|
|
111
|
+
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
|
|
76
112
|
sequenceDiagram
|
|
77
113
|
participant User
|
|
78
114
|
participant App as Your assistant code
|
|
@@ -126,6 +162,7 @@ Some extractors are optional so the base install stays small.
|
|
|
126
162
|
- Optical character recognition for images: `python3 -m pip install "biblicus[ocr]"`
|
|
127
163
|
- Speech to text transcription: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
|
|
128
164
|
- Broad document parsing fallback: `python3 -m pip install "biblicus[unstructured]"`
|
|
165
|
+
- MarkItDown document conversion (requires Python 3.10 or higher): `python3 -m pip install "biblicus[markitdown]"`
|
|
129
166
|
|
|
130
167
|
## Quick start
|
|
131
168
|
|
|
@@ -153,11 +190,11 @@ biblicus crawl --corpus corpora/example \\
|
|
|
153
190
|
--tag crawled
|
|
154
191
|
```
|
|
155
192
|
|
|
156
|
-
## End-to-end example:
|
|
193
|
+
## End-to-end example: lower-level control
|
|
157
194
|
|
|
158
195
|
The command-line interface returns JavaScript Object Notation by default. This makes it easy to use Biblicus in scripts and to treat retrieval as a deterministic, testable step.
|
|
159
196
|
|
|
160
|
-
|
|
197
|
+
This version shows the lower-level pieces explicitly. You are building the corpus, controlling each memory string, choosing the backend, and shaping the context pack yourself.
|
|
161
198
|
|
|
162
199
|
```python
|
|
163
200
|
from biblicus.backends import get_backend
|
|
@@ -383,6 +420,7 @@ The documents below follow the pipeline from raw items to model context:
|
|
|
383
420
|
|
|
384
421
|
- [Corpus][corpus]
|
|
385
422
|
- [Text extraction][text-extraction]
|
|
423
|
+
- [Knowledge base][knowledge-base]
|
|
386
424
|
- [Backends][backends]
|
|
387
425
|
- [Context packs][context-packs]
|
|
388
426
|
- [Testing and evaluation][testing]
|
|
@@ -432,6 +470,20 @@ Two backends are included.
|
|
|
432
470
|
- `scan` is a minimal baseline that scans raw items directly.
|
|
433
471
|
- `sqlite-full-text-search` is a practical baseline that builds a full text search index in Sqlite.
|
|
434
472
|
|
|
473
|
+
## Extraction backends
|
|
474
|
+
|
|
475
|
+
These extractors are built in. Optional ones require extra dependencies.
|
|
476
|
+
|
|
477
|
+
- `pass-through-text` reads text items and strips Markdown front matter.
|
|
478
|
+
- `metadata-text` turns catalog metadata into a small text artifact.
|
|
479
|
+
- `pdf-text` extracts text from Portable Document Format items with `pypdf`.
|
|
480
|
+
- `select-text` chooses one prior extraction result in a pipeline.
|
|
481
|
+
- `select-longest-text` chooses the longest prior extraction result.
|
|
482
|
+
- `ocr-rapidocr` does optical character recognition on images (optional).
|
|
483
|
+
- `stt-openai` performs speech to text on audio (optional).
|
|
484
|
+
- `unstructured` provides broad document parsing (optional).
|
|
485
|
+
- `markitdown` converts many formats into Markdown-like text (optional).
|
|
486
|
+
|
|
435
487
|
## Integration corpus and evaluation dataset
|
|
436
488
|
|
|
437
489
|
Use `scripts/download_wikipedia.py` to download a small integration corpus from Wikipedia when running tests or demos. The repository does not include that content.
|
|
@@ -485,6 +537,7 @@ License terms are in `LICENSE`.
|
|
|
485
537
|
[roadmap]: docs/ROADMAP.md
|
|
486
538
|
[feature-index]: docs/FEATURE_INDEX.md
|
|
487
539
|
[corpus]: docs/CORPUS.md
|
|
540
|
+
[knowledge-base]: docs/KNOWLEDGE_BASE.md
|
|
488
541
|
[text-extraction]: docs/EXTRACTION.md
|
|
489
542
|
[user-configuration]: docs/USER_CONFIGURATION.md
|
|
490
543
|
[backends]: docs/BACKENDS.md
|
|
@@ -16,6 +16,40 @@ It can be used alongside LangGraph, Tactus, Pydantic AI, any agent framework, or
|
|
|
16
16
|
|
|
17
17
|
See [retrieval augmented generation overview] for a short introduction to the idea.
|
|
18
18
|
|
|
19
|
+
## Start with a knowledge base
|
|
20
|
+
|
|
21
|
+
If you just want to hand a folder to your assistant and move on, use the high-level knowledge base interface. The folder can be nothing more than a handful of plain text files. You are not choosing a retrieval strategy yet. You are just collecting.
|
|
22
|
+
|
|
23
|
+
This example assumes a folder called `notes/` with a few `.txt` files. The knowledge base handles sensible defaults and still gives you a clear context pack for your model call.
|
|
24
|
+
|
|
25
|
+
```python
|
|
26
|
+
from biblicus.knowledge_base import KnowledgeBase
|
|
27
|
+
|
|
28
|
+
|
|
29
|
+
kb = KnowledgeBase.from_folder("notes")
|
|
30
|
+
result = kb.query("Primary button style preference")
|
|
31
|
+
context_pack = kb.context_pack(result, max_tokens=800)
|
|
32
|
+
|
|
33
|
+
print(context_pack.text)
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
If you want to run a real, executable version of this story, use `scripts/readme_end_to_end_demo.py` from a fresh clone.
|
|
37
|
+
|
|
38
|
+
This simplified sequence diagram shows the same idea at a high level.
|
|
39
|
+
|
|
40
|
+
```mermaid
|
|
41
|
+
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
|
|
42
|
+
sequenceDiagram
|
|
43
|
+
participant App as Your assistant code
|
|
44
|
+
participant KB as Knowledge base
|
|
45
|
+
participant LLM as Large language model
|
|
46
|
+
|
|
47
|
+
App->>KB: query
|
|
48
|
+
KB-->>App: evidence and context
|
|
49
|
+
App->>LLM: context plus prompt
|
|
50
|
+
LLM-->>App: response draft
|
|
51
|
+
```
|
|
52
|
+
|
|
19
53
|
## A simple mental model
|
|
20
54
|
|
|
21
55
|
Think in three stages.
|
|
@@ -43,7 +77,7 @@ In a coding assistant, retrieval is often triggered by what the user is doing ri
|
|
|
43
77
|
This diagram shows two sequential Biblicus calls. They are shown separately to make the boundaries explicit: retrieval returns evidence, and context pack building consumes evidence.
|
|
44
78
|
|
|
45
79
|
```mermaid
|
|
46
|
-
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
|
|
80
|
+
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
|
|
47
81
|
sequenceDiagram
|
|
48
82
|
participant User
|
|
49
83
|
participant App as Your assistant code
|
|
@@ -97,6 +131,7 @@ Some extractors are optional so the base install stays small.
|
|
|
97
131
|
- Optical character recognition for images: `python3 -m pip install "biblicus[ocr]"`
|
|
98
132
|
- Speech to text transcription: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
|
|
99
133
|
- Broad document parsing fallback: `python3 -m pip install "biblicus[unstructured]"`
|
|
134
|
+
- MarkItDown document conversion (requires Python 3.10 or higher): `python3 -m pip install "biblicus[markitdown]"`
|
|
100
135
|
|
|
101
136
|
## Quick start
|
|
102
137
|
|
|
@@ -124,11 +159,11 @@ biblicus crawl --corpus corpora/example \\
|
|
|
124
159
|
--tag crawled
|
|
125
160
|
```
|
|
126
161
|
|
|
127
|
-
## End-to-end example:
|
|
162
|
+
## End-to-end example: lower-level control
|
|
128
163
|
|
|
129
164
|
The command-line interface returns JavaScript Object Notation by default. This makes it easy to use Biblicus in scripts and to treat retrieval as a deterministic, testable step.
|
|
130
165
|
|
|
131
|
-
|
|
166
|
+
This version shows the lower-level pieces explicitly. You are building the corpus, controlling each memory string, choosing the backend, and shaping the context pack yourself.
|
|
132
167
|
|
|
133
168
|
```python
|
|
134
169
|
from biblicus.backends import get_backend
|
|
@@ -354,6 +389,7 @@ The documents below follow the pipeline from raw items to model context:
|
|
|
354
389
|
|
|
355
390
|
- [Corpus][corpus]
|
|
356
391
|
- [Text extraction][text-extraction]
|
|
392
|
+
- [Knowledge base][knowledge-base]
|
|
357
393
|
- [Backends][backends]
|
|
358
394
|
- [Context packs][context-packs]
|
|
359
395
|
- [Testing and evaluation][testing]
|
|
@@ -403,6 +439,20 @@ Two backends are included.
|
|
|
403
439
|
- `scan` is a minimal baseline that scans raw items directly.
|
|
404
440
|
- `sqlite-full-text-search` is a practical baseline that builds a full text search index in Sqlite.
|
|
405
441
|
|
|
442
|
+
## Extraction backends
|
|
443
|
+
|
|
444
|
+
These extractors are built in. Optional ones require extra dependencies.
|
|
445
|
+
|
|
446
|
+
- `pass-through-text` reads text items and strips Markdown front matter.
|
|
447
|
+
- `metadata-text` turns catalog metadata into a small text artifact.
|
|
448
|
+
- `pdf-text` extracts text from Portable Document Format items with `pypdf`.
|
|
449
|
+
- `select-text` chooses one prior extraction result in a pipeline.
|
|
450
|
+
- `select-longest-text` chooses the longest prior extraction result.
|
|
451
|
+
- `ocr-rapidocr` does optical character recognition on images (optional).
|
|
452
|
+
- `stt-openai` performs speech to text on audio (optional).
|
|
453
|
+
- `unstructured` provides broad document parsing (optional).
|
|
454
|
+
- `markitdown` converts many formats into Markdown-like text (optional).
|
|
455
|
+
|
|
406
456
|
## Integration corpus and evaluation dataset
|
|
407
457
|
|
|
408
458
|
Use `scripts/download_wikipedia.py` to download a small integration corpus from Wikipedia when running tests or demos. The repository does not include that content.
|
|
@@ -456,6 +506,7 @@ License terms are in `LICENSE`.
|
|
|
456
506
|
[roadmap]: docs/ROADMAP.md
|
|
457
507
|
[feature-index]: docs/FEATURE_INDEX.md
|
|
458
508
|
[corpus]: docs/CORPUS.md
|
|
509
|
+
[knowledge-base]: docs/KNOWLEDGE_BASE.md
|
|
459
510
|
[text-extraction]: docs/EXTRACTION.md
|
|
460
511
|
[user-configuration]: docs/USER_CONFIGURATION.md
|
|
461
512
|
[backends]: docs/BACKENDS.md
|
|
@@ -221,6 +221,25 @@ python3 -m biblicus build --corpus corpora/pdf_samples --backend sqlite-full-tex
|
|
|
221
221
|
python3 -m biblicus query --corpus corpora/pdf_samples --query "Dummy PDF file"
|
|
222
222
|
```
|
|
223
223
|
|
|
224
|
+
### Wikipedia retrieval demo (Python)
|
|
225
|
+
|
|
226
|
+
This example downloads a few Wikipedia summaries about retrieval and knowledge bases, builds an extraction run, creates a local full text index, and returns evidence plus a context pack.
|
|
227
|
+
|
|
228
|
+
```
|
|
229
|
+
rm -rf corpora/wikipedia_rag_demo
|
|
230
|
+
python3 scripts/wikipedia_rag_demo.py --corpus corpora/wikipedia_rag_demo --force
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
### MarkItDown extraction demo (Python 3.10+)
|
|
234
|
+
|
|
235
|
+
MarkItDown requires Python 3.10 or higher. This example uses the `py311` conda environment to run the extractor over the mixed sample corpus.
|
|
236
|
+
|
|
237
|
+
```
|
|
238
|
+
conda run -n py311 python -m pip install -e . "markitdown[all]"
|
|
239
|
+
conda run -n py311 python scripts/download_mixed_samples.py --corpus corpora/markitdown_demo_py311 --force
|
|
240
|
+
conda run -n py311 python -m biblicus extract build --corpus corpora/markitdown_demo_py311 --step markitdown
|
|
241
|
+
```
|
|
242
|
+
|
|
224
243
|
### Mixed modality integration corpus
|
|
225
244
|
|
|
226
245
|
This example assembles a tiny mixed corpus with a Markdown note, a Hypertext Markup Language page, an image, a Portable Document Format file with extractable text, and a generated Portable Document Format file with no extractable text.
|
|
@@ -71,6 +71,27 @@ To install:
|
|
|
71
71
|
python3 -m pip install "biblicus[unstructured]"
|
|
72
72
|
```
|
|
73
73
|
|
|
74
|
+
`markitdown`
|
|
75
|
+
|
|
76
|
+
- Converts common document formats into Markdown-like text
|
|
77
|
+
- Backed by the optional `markitdown` dependency
|
|
78
|
+
- Requires Python 3.10 or higher
|
|
79
|
+
- Skips items that are already text so the pass-through extractor remains the canonical choice for text items
|
|
80
|
+
- This means it will not process `text/html` or other text media types unless that policy changes
|
|
81
|
+
|
|
82
|
+
To install:
|
|
83
|
+
|
|
84
|
+
```
|
|
85
|
+
python3 -m pip install "biblicus[markitdown]"
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
Example:
|
|
89
|
+
|
|
90
|
+
```
|
|
91
|
+
python3 -m biblicus extract build --corpus corpora/extraction-demo \\
|
|
92
|
+
--step markitdown
|
|
93
|
+
```
|
|
94
|
+
|
|
74
95
|
`ocr-rapidocr`
|
|
75
96
|
|
|
76
97
|
- Optical character recognition for image items
|
|
@@ -123,6 +123,7 @@ What it does:
|
|
|
123
123
|
- Includes a Portable Document Format text extractor plugin.
|
|
124
124
|
- Includes a speech to text extractor plugin for audio items.
|
|
125
125
|
- Includes a selection extractor step for choosing extracted text within a pipeline.
|
|
126
|
+
- Includes a MarkItDown extractor plugin for document conversion.
|
|
126
127
|
|
|
127
128
|
Documentation:
|
|
128
129
|
|
|
@@ -139,6 +140,7 @@ Behavior specifications:
|
|
|
139
140
|
- `features/ocr_extractor.feature`
|
|
140
141
|
- `features/stt_extractor.feature`
|
|
141
142
|
- `features/unstructured_extractor.feature`
|
|
143
|
+
- `features/markitdown_extractor.feature`
|
|
142
144
|
- `features/integration_unstructured_extraction.feature`
|
|
143
145
|
|
|
144
146
|
Primary implementation:
|
|
@@ -208,6 +210,21 @@ Primary implementation:
|
|
|
208
210
|
|
|
209
211
|
- `src/biblicus/context.py`
|
|
210
212
|
|
|
213
|
+
## Knowledge base
|
|
214
|
+
|
|
215
|
+
What it does:
|
|
216
|
+
|
|
217
|
+
- Provides a turnkey interface that accepts a folder and returns a ready-to-query workflow.
|
|
218
|
+
- Applies sensible defaults for import, retrieval, and context pack shaping.
|
|
219
|
+
|
|
220
|
+
Behavior specifications:
|
|
221
|
+
|
|
222
|
+
- `features/knowledge_base.feature`
|
|
223
|
+
|
|
224
|
+
Primary implementation:
|
|
225
|
+
|
|
226
|
+
- `src/biblicus/knowledge_base.py`
|
|
227
|
+
|
|
211
228
|
## Testing, coverage, and documentation build
|
|
212
229
|
|
|
213
230
|
What it does:
|
|
@@ -0,0 +1,68 @@
|
|
|
1
|
+
# Knowledge base
|
|
2
|
+
|
|
3
|
+
The knowledge base is the high‑level, turnkey workflow that makes Biblicus feel effortless. You hand it a folder. It chooses sensible defaults, builds a retrieval run, and gives you evidence you can turn into context.
|
|
4
|
+
|
|
5
|
+
This is the right layer when you want to use Biblicus without spending time on setup. You can still override the defaults later when you want fine‑grained control.
|
|
6
|
+
|
|
7
|
+
## What it does
|
|
8
|
+
|
|
9
|
+
- Creates or opens a corpus at a chosen location (or a temporary location if you do not provide one).
|
|
10
|
+
- Imports a folder tree into that corpus.
|
|
11
|
+
- Builds a retrieval run using a default backend.
|
|
12
|
+
- Exposes a simple `query` method that returns evidence.
|
|
13
|
+
- Exposes a `context_pack` helper to shape evidence into model context.
|
|
14
|
+
|
|
15
|
+
## Minimal use
|
|
16
|
+
|
|
17
|
+
```python
|
|
18
|
+
from biblicus.knowledge_base import KnowledgeBase
|
|
19
|
+
|
|
20
|
+
|
|
21
|
+
kb = KnowledgeBase.from_folder("notes")
|
|
22
|
+
result = kb.query("Primary button style preference")
|
|
23
|
+
context_pack = kb.context_pack(result, max_tokens=800)
|
|
24
|
+
|
|
25
|
+
print(context_pack.text)
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
## Default behavior
|
|
29
|
+
|
|
30
|
+
The knowledge base wraps existing primitives. Defaults are explicit and deterministic.
|
|
31
|
+
|
|
32
|
+
- **Corpus**: stored on disk and fully inspectable.
|
|
33
|
+
- **Import**: uses the folder tree import, preserving relative paths.
|
|
34
|
+
- **Backend**: defaults to the `scan` backend.
|
|
35
|
+
- **Query budget**: defaults to a small, conservative evidence budget.
|
|
36
|
+
|
|
37
|
+
## Overrides
|
|
38
|
+
|
|
39
|
+
You can override the defaults when needed.
|
|
40
|
+
|
|
41
|
+
```python
|
|
42
|
+
from biblicus.knowledge_base import KnowledgeBase
|
|
43
|
+
from biblicus.models import QueryBudget
|
|
44
|
+
|
|
45
|
+
|
|
46
|
+
kb = KnowledgeBase.from_folder(
|
|
47
|
+
"notes",
|
|
48
|
+
backend_id="scan",
|
|
49
|
+
recipe_name="Knowledge base demo",
|
|
50
|
+
query_budget=QueryBudget(max_total_items=10, max_total_characters=4000, max_items_per_source=None),
|
|
51
|
+
tags=["memory"],
|
|
52
|
+
corpus_root="corpora/knowledge-base",
|
|
53
|
+
)
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
## How it relates to lower‑level control
|
|
57
|
+
|
|
58
|
+
The knowledge base is a convenience layer. It uses the same underlying parts that the lower‑level examples use.
|
|
59
|
+
|
|
60
|
+
- `Corpus` for ingestion and storage
|
|
61
|
+
- `import_tree` for folder ingestion
|
|
62
|
+
- A backend run (`scan` by default)
|
|
63
|
+
- `QueryBudget` for evidence limits
|
|
64
|
+
- `ContextPackPolicy` and token fitting for context shaping
|
|
65
|
+
|
|
66
|
+
You can always drop down to those lower‑level primitives when you need more control.
|
|
67
|
+
|
|
68
|
+
If the high‑level workflow is not enough, switch to `Corpus`, `get_backend`, and `ContextPackPolicy` directly.
|
|
@@ -0,0 +1,155 @@
|
|
|
1
|
+
# Roadmap
|
|
2
|
+
|
|
3
|
+
This document describes what we plan to build next.
|
|
4
|
+
|
|
5
|
+
If you are looking for runnable examples, see `docs/DEMOS.md`.
|
|
6
|
+
|
|
7
|
+
If you are looking for what already exists, start with:
|
|
8
|
+
|
|
9
|
+
- `docs/FEATURE_INDEX.md` for a map of features to behavior specifications and modules.
|
|
10
|
+
- `CHANGELOG.md` for released changes.
|
|
11
|
+
|
|
12
|
+
## Principles
|
|
13
|
+
|
|
14
|
+
- Behavior specifications are the authoritative definition of behavior.
|
|
15
|
+
- Every behavior that exists is specified.
|
|
16
|
+
- Validation and documentation are part of the product.
|
|
17
|
+
- Raw corpus items remain readable, portable files.
|
|
18
|
+
- Derived artifacts are stored under the corpus and can coexist for multiple implementations.
|
|
19
|
+
|
|
20
|
+
## Next: retrieval evaluation and datasets
|
|
21
|
+
|
|
22
|
+
Goal: make evaluation results easier to interpret and compare.
|
|
23
|
+
|
|
24
|
+
Deliverables:
|
|
25
|
+
|
|
26
|
+
- A dataset authoring workflow that supports small hand-labeled sets and larger synthetic sets.
|
|
27
|
+
- A report that includes per-query diagnostics and a clear summary.
|
|
28
|
+
|
|
29
|
+
Acceptance checks:
|
|
30
|
+
|
|
31
|
+
- Dataset formats are versioned when they change.
|
|
32
|
+
- Reports remain deterministic for the same inputs.
|
|
33
|
+
|
|
34
|
+
## Next: context pack policy surfaces
|
|
35
|
+
|
|
36
|
+
Goal: make context shaping policies easier to evaluate and swap.
|
|
37
|
+
|
|
38
|
+
Deliverables:
|
|
39
|
+
|
|
40
|
+
- A clear set of context pack policy variants (formatting, ordering, metadata inclusion).
|
|
41
|
+
- Token budget strategies that can use a real tokenizer.
|
|
42
|
+
- Documentation that explains where context shaping fits in the pipeline.
|
|
43
|
+
|
|
44
|
+
Acceptance checks:
|
|
45
|
+
|
|
46
|
+
- Behavior specifications cover policy selection and budgeting behaviors.
|
|
47
|
+
- Example outputs show how context packs differ across policies.
|
|
48
|
+
|
|
49
|
+
## Next: extraction backends (OCR and document understanding)
|
|
50
|
+
|
|
51
|
+
Goal: treat optical character recognition and document understanding as pluggable extractors with consistent inputs and outputs.
|
|
52
|
+
|
|
53
|
+
Deliverables:
|
|
54
|
+
|
|
55
|
+
- A baseline OCR extractor that is fast and local for smoke tests.
|
|
56
|
+
- A higher quality OCR extractor candidate (for example: Paddle OCR or Docling OCR).
|
|
57
|
+
- A general document understanding extractor candidate (for example: Docling or Unstructured).
|
|
58
|
+
- A consistent output contract that captures text plus optional confidence and per-page metadata.
|
|
59
|
+
- A selector policy for choosing between multiple extractor outputs in a pipeline.
|
|
60
|
+
- A shared evaluation harness for extraction backends using the same corpus and dataset.
|
|
61
|
+
|
|
62
|
+
Acceptance checks:
|
|
63
|
+
|
|
64
|
+
- Behavior specifications cover extractor selection and output provenance.
|
|
65
|
+
- Evaluation reports compare accuracy, processable fraction, latency, and cost.
|
|
66
|
+
|
|
67
|
+
## Next: corpus analysis tools
|
|
68
|
+
|
|
69
|
+
Goal: provide lightweight analysis utilities that summarize corpus themes and guide curation.
|
|
70
|
+
|
|
71
|
+
Deliverables:
|
|
72
|
+
|
|
73
|
+
- A topic modeling workflow for corpus analysis (for example: BERTopic).
|
|
74
|
+
- A report that highlights dominant themes and outliers.
|
|
75
|
+
- A way to compare topic distributions across corpora or corpus snapshots.
|
|
76
|
+
|
|
77
|
+
Acceptance checks:
|
|
78
|
+
|
|
79
|
+
- Analysis is reproducible for the same corpus state.
|
|
80
|
+
- Reports are exportable and readable without custom tooling.
|
|
81
|
+
|
|
82
|
+
### Candidate backend ecosystem (for planning and evaluation)
|
|
83
|
+
|
|
84
|
+
Document understanding and OCR blur together at the interface level in Biblicus, so the roadmap treats them as extractor candidates with the same input/output contract.
|
|
85
|
+
|
|
86
|
+
Docling family candidates:
|
|
87
|
+
|
|
88
|
+
- Docling (document understanding with structured outputs)
|
|
89
|
+
- docling-ocr (OCR component in the Docling ecosystem)
|
|
90
|
+
|
|
91
|
+
General-purpose extraction candidates:
|
|
92
|
+
|
|
93
|
+
- Unstructured (element-oriented extraction for many formats)
|
|
94
|
+
- MarkItDown (lightweight conversion to Markdown)
|
|
95
|
+
- Kreuzberg (speed-focused extraction for bulk workflows)
|
|
96
|
+
- ExtractThinker (schema-driven extraction using Pydantic contracts)
|
|
97
|
+
|
|
98
|
+
Ecosystem adapters:
|
|
99
|
+
|
|
100
|
+
- LangChain document loaders (uniform loader interface across many sources)
|
|
101
|
+
|
|
102
|
+
### Guidance for choosing early targets
|
|
103
|
+
|
|
104
|
+
- If you need layout and table understanding, prioritize Docling and docling-ocr.
|
|
105
|
+
- If you need speed and simplicity, prioritize MarkItDown or Kreuzberg.
|
|
106
|
+
- If you need schema-first extraction, prioritize ExtractThinker layered on an OCR or document extractor.
|
|
107
|
+
|
|
108
|
+
## Later: alternate backends and hosting modes
|
|
109
|
+
|
|
110
|
+
Goal: broaden the backend surface while keeping the core predictable.
|
|
111
|
+
|
|
112
|
+
Deliverables:
|
|
113
|
+
|
|
114
|
+
- A second backend with different performance tradeoffs.
|
|
115
|
+
- A tool server that exposes a backend through a stable interface.
|
|
116
|
+
- Documentation that shows how to run a backend out of process.
|
|
117
|
+
|
|
118
|
+
Acceptance checks:
|
|
119
|
+
|
|
120
|
+
- Local tests remain fast and deterministic.
|
|
121
|
+
- Integration tests validate retrieval through the tool boundary.
|
|
122
|
+
|
|
123
|
+
## Deferred: corpus and extraction work
|
|
124
|
+
|
|
125
|
+
These are valuable, but intentionally not the near-term focus while retrieval becomes practical end to end.
|
|
126
|
+
|
|
127
|
+
### In-memory corpus for ephemeral workflows
|
|
128
|
+
|
|
129
|
+
Goal: allow programmatic, temporary corpora that live in memory for short-lived agents or tests.
|
|
130
|
+
|
|
131
|
+
Deliverables:
|
|
132
|
+
|
|
133
|
+
- A memory-backed corpus implementation that supports the same ingestion and catalog APIs.
|
|
134
|
+
- A serialization option for snapshots so ephemeral corpora can be persisted when needed.
|
|
135
|
+
- Documentation that explains tradeoffs versus file-based corpora.
|
|
136
|
+
|
|
137
|
+
Acceptance checks:
|
|
138
|
+
|
|
139
|
+
- Behavior specifications cover ingestion, listing, and reindexing in memory.
|
|
140
|
+
- Retrieval and extraction can operate on the in-memory corpus without special casing.
|
|
141
|
+
|
|
142
|
+
### Extractor datasets and evaluation harness
|
|
143
|
+
|
|
144
|
+
Goal: compare extraction approaches in a way that is measurable, repeatable, and useful for practical engineering decisions.
|
|
145
|
+
|
|
146
|
+
Deliverables:
|
|
147
|
+
|
|
148
|
+
- Dataset authoring workflow for extraction ground truth (for example: expected transcripts and expected optical character recognition text).
|
|
149
|
+
- Evaluation metrics for accuracy, speed, and cost, including “processable fraction” for a given extractor recipe.
|
|
150
|
+
- A report format that can compare multiple extraction recipes against the same corpus and dataset.
|
|
151
|
+
|
|
152
|
+
Acceptance checks:
|
|
153
|
+
|
|
154
|
+
- Evaluation results are stable and reproducible for the same corpus and dataset inputs.
|
|
155
|
+
- Reports make it clear when an extractor fails to process an item versus producing empty output.
|
|
@@ -134,6 +134,32 @@ def after_scenario(context, scenario) -> None:
|
|
|
134
134
|
sys.modules.pop(name, None)
|
|
135
135
|
context._fake_rapidocr_unavailable_installed = False
|
|
136
136
|
context._fake_rapidocr_unavailable_original_modules = {}
|
|
137
|
+
if getattr(context, "_fake_markitdown_installed", False):
|
|
138
|
+
original_modules = getattr(context, "_fake_markitdown_original_modules", {})
|
|
139
|
+
for name in [
|
|
140
|
+
"markitdown",
|
|
141
|
+
]:
|
|
142
|
+
if name in original_modules:
|
|
143
|
+
sys.modules[name] = original_modules[name]
|
|
144
|
+
else:
|
|
145
|
+
sys.modules.pop(name, None)
|
|
146
|
+
context._fake_markitdown_installed = False
|
|
147
|
+
context._fake_markitdown_original_modules = {}
|
|
148
|
+
if getattr(context, "_fake_markitdown_unavailable_installed", False):
|
|
149
|
+
original_modules = getattr(context, "_fake_markitdown_unavailable_original_modules", {})
|
|
150
|
+
for name in [
|
|
151
|
+
"markitdown",
|
|
152
|
+
]:
|
|
153
|
+
if name in original_modules:
|
|
154
|
+
sys.modules[name] = original_modules[name]
|
|
155
|
+
else:
|
|
156
|
+
sys.modules.pop(name, None)
|
|
157
|
+
context._fake_markitdown_unavailable_installed = False
|
|
158
|
+
context._fake_markitdown_unavailable_original_modules = {}
|
|
159
|
+
original_sys_version_info = getattr(context, "_original_sys_version_info", None)
|
|
160
|
+
if original_sys_version_info is not None:
|
|
161
|
+
sys.version_info = original_sys_version_info
|
|
162
|
+
context._original_sys_version_info = None
|
|
137
163
|
if hasattr(context, "_tmp"):
|
|
138
164
|
context._tmp.cleanup()
|
|
139
165
|
|
|
@@ -0,0 +1,55 @@
|
|
|
1
|
+
Feature: Knowledge base (turnkey workflow)
|
|
2
|
+
A knowledge base is a high-level workflow that hides the plumbing while keeping behavior explicit.
|
|
3
|
+
It should accept a folder, ingest files, build defaults, and allow retrieval with minimal configuration.
|
|
4
|
+
|
|
5
|
+
Scenario: Build a knowledge base from a folder and query it
|
|
6
|
+
Given a folder "notes" exists with text files:
|
|
7
|
+
| filename | contents |
|
|
8
|
+
| note1.txt | The user's name is Tactus Maximus. |
|
|
9
|
+
| note2.txt | Primary button style preference: the user's favorite color is magenta. |
|
|
10
|
+
When I create a knowledge base from folder "notes" only
|
|
11
|
+
And I query the knowledge base for "Primary button style preference"
|
|
12
|
+
Then the knowledge base returns evidence that includes "favorite color is magenta"
|
|
13
|
+
|
|
14
|
+
Scenario: Knowledge base context pack is shaped with a token budget
|
|
15
|
+
Given a folder "notes" exists with text files:
|
|
16
|
+
| filename | contents |
|
|
17
|
+
| note1.txt | one two three |
|
|
18
|
+
| note2.txt | four five six |
|
|
19
|
+
When I create a knowledge base from folder "notes" only
|
|
20
|
+
And I query the knowledge base for "one"
|
|
21
|
+
And I build a context pack from the knowledge base query with token budget 3
|
|
22
|
+
Then the context pack text equals:
|
|
23
|
+
"""
|
|
24
|
+
one two three
|
|
25
|
+
"""
|
|
26
|
+
|
|
27
|
+
Scenario: Knowledge base context pack defaults to no token budget
|
|
28
|
+
Given a folder "notes" exists with text files:
|
|
29
|
+
| filename | contents |
|
|
30
|
+
| note1.txt | alpha beta |
|
|
31
|
+
When I create a knowledge base from folder "notes" only
|
|
32
|
+
And I query the knowledge base for "alpha"
|
|
33
|
+
And I build a context pack from the knowledge base query without a token budget
|
|
34
|
+
Then the context pack text equals:
|
|
35
|
+
"""
|
|
36
|
+
alpha beta
|
|
37
|
+
"""
|
|
38
|
+
|
|
39
|
+
Scenario: Knowledge base rejects missing folder
|
|
40
|
+
When I attempt to create a knowledge base from folder "missing"
|
|
41
|
+
Then the knowledge base error includes "does not exist"
|
|
42
|
+
|
|
43
|
+
Scenario: Knowledge base rejects non-folder path
|
|
44
|
+
Given a file "not-a-folder.txt" exists with contents "hello"
|
|
45
|
+
When I attempt to create a knowledge base from folder "not-a-folder.txt"
|
|
46
|
+
Then the knowledge base error includes "not a directory"
|
|
47
|
+
|
|
48
|
+
Scenario: Knowledge base can use an explicit corpus root
|
|
49
|
+
Given a folder "notes" exists with text files:
|
|
50
|
+
| filename | contents |
|
|
51
|
+
| note1.txt | alpha |
|
|
52
|
+
And a folder "kb-root" exists
|
|
53
|
+
When I create a knowledge base from folder "notes" using corpus root "kb-root"
|
|
54
|
+
And I query the knowledge base for "alpha"
|
|
55
|
+
Then the knowledge base returns evidence that includes "alpha"
|