biblicus 0.5.0__tar.gz → 0.7.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (158) hide show
  1. {biblicus-0.5.0/src/biblicus.egg-info → biblicus-0.7.0}/PKG-INFO +57 -4
  2. {biblicus-0.5.0 → biblicus-0.7.0}/README.md +54 -3
  3. {biblicus-0.5.0 → biblicus-0.7.0}/docs/DEMOS.md +19 -0
  4. {biblicus-0.5.0 → biblicus-0.7.0}/docs/EXTRACTION.md +21 -0
  5. {biblicus-0.5.0 → biblicus-0.7.0}/docs/FEATURE_INDEX.md +17 -0
  6. biblicus-0.7.0/docs/KNOWLEDGE_BASE.md +68 -0
  7. biblicus-0.7.0/docs/ROADMAP.md +155 -0
  8. {biblicus-0.5.0 → biblicus-0.7.0}/docs/api.rst +4 -0
  9. {biblicus-0.5.0 → biblicus-0.7.0}/docs/index.rst +1 -0
  10. {biblicus-0.5.0 → biblicus-0.7.0}/features/environment.py +26 -0
  11. biblicus-0.7.0/features/knowledge_base.feature +55 -0
  12. biblicus-0.7.0/features/markitdown_extractor.feature +99 -0
  13. biblicus-0.7.0/features/steps/knowledge_base_steps.py +90 -0
  14. biblicus-0.7.0/features/steps/markitdown_steps.py +173 -0
  15. {biblicus-0.5.0 → biblicus-0.7.0}/pyproject.toml +5 -1
  16. {biblicus-0.5.0 → biblicus-0.7.0}/scripts/test.py +15 -4
  17. biblicus-0.7.0/scripts/wikipedia_rag_demo.py +212 -0
  18. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/__init__.py +3 -1
  19. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/__init__.py +2 -0
  20. biblicus-0.7.0/src/biblicus/extractors/markitdown_text.py +128 -0
  21. biblicus-0.7.0/src/biblicus/knowledge_base.py +191 -0
  22. {biblicus-0.5.0 → biblicus-0.7.0/src/biblicus.egg-info}/PKG-INFO +57 -4
  23. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus.egg-info/SOURCES.txt +8 -0
  24. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus.egg-info/requires.txt +5 -0
  25. biblicus-0.5.0/docs/ROADMAP.md +0 -81
  26. {biblicus-0.5.0 → biblicus-0.7.0}/LICENSE +0 -0
  27. {biblicus-0.5.0 → biblicus-0.7.0}/MANIFEST.in +0 -0
  28. {biblicus-0.5.0 → biblicus-0.7.0}/THIRD_PARTY_NOTICES.md +0 -0
  29. {biblicus-0.5.0 → biblicus-0.7.0}/datasets/wikipedia_mini.json +0 -0
  30. {biblicus-0.5.0 → biblicus-0.7.0}/docs/ARCHITECTURE.md +0 -0
  31. {biblicus-0.5.0 → biblicus-0.7.0}/docs/BACKENDS.md +0 -0
  32. {biblicus-0.5.0 → biblicus-0.7.0}/docs/CONTEXT_PACK.md +0 -0
  33. {biblicus-0.5.0 → biblicus-0.7.0}/docs/CORPUS.md +0 -0
  34. {biblicus-0.5.0 → biblicus-0.7.0}/docs/CORPUS_DESIGN.md +0 -0
  35. {biblicus-0.5.0 → biblicus-0.7.0}/docs/TESTING.md +0 -0
  36. {biblicus-0.5.0 → biblicus-0.7.0}/docs/USER_CONFIGURATION.md +0 -0
  37. {biblicus-0.5.0 → biblicus-0.7.0}/docs/conf.py +0 -0
  38. {biblicus-0.5.0 → biblicus-0.7.0}/features/backend_validation.feature +0 -0
  39. {biblicus-0.5.0 → biblicus-0.7.0}/features/biblicus_corpus.feature +0 -0
  40. {biblicus-0.5.0 → biblicus-0.7.0}/features/cli_entrypoint.feature +0 -0
  41. {biblicus-0.5.0 → biblicus-0.7.0}/features/cli_parsing.feature +0 -0
  42. {biblicus-0.5.0 → biblicus-0.7.0}/features/content_sniffing.feature +0 -0
  43. {biblicus-0.5.0 → biblicus-0.7.0}/features/context_pack.feature +0 -0
  44. {biblicus-0.5.0 → biblicus-0.7.0}/features/context_pack_cli.feature +0 -0
  45. {biblicus-0.5.0 → biblicus-0.7.0}/features/corpus_edge_cases.feature +0 -0
  46. {biblicus-0.5.0 → biblicus-0.7.0}/features/corpus_identity.feature +0 -0
  47. {biblicus-0.5.0 → biblicus-0.7.0}/features/corpus_purge.feature +0 -0
  48. {biblicus-0.5.0 → biblicus-0.7.0}/features/crawl.feature +0 -0
  49. {biblicus-0.5.0 → biblicus-0.7.0}/features/error_cases.feature +0 -0
  50. {biblicus-0.5.0 → biblicus-0.7.0}/features/evaluation.feature +0 -0
  51. {biblicus-0.5.0 → biblicus-0.7.0}/features/evidence_processing.feature +0 -0
  52. {biblicus-0.5.0 → biblicus-0.7.0}/features/extraction_error_handling.feature +0 -0
  53. {biblicus-0.5.0 → biblicus-0.7.0}/features/extraction_run_lifecycle.feature +0 -0
  54. {biblicus-0.5.0 → biblicus-0.7.0}/features/extraction_selection.feature +0 -0
  55. {biblicus-0.5.0 → biblicus-0.7.0}/features/extraction_selection_longest.feature +0 -0
  56. {biblicus-0.5.0 → biblicus-0.7.0}/features/extractor_pipeline.feature +0 -0
  57. {biblicus-0.5.0 → biblicus-0.7.0}/features/extractor_validation.feature +0 -0
  58. {biblicus-0.5.0 → biblicus-0.7.0}/features/frontmatter.feature +0 -0
  59. {biblicus-0.5.0 → biblicus-0.7.0}/features/hook_config_validation.feature +0 -0
  60. {biblicus-0.5.0 → biblicus-0.7.0}/features/hook_error_handling.feature +0 -0
  61. {biblicus-0.5.0 → biblicus-0.7.0}/features/import_tree.feature +0 -0
  62. {biblicus-0.5.0 → biblicus-0.7.0}/features/ingest_sources.feature +0 -0
  63. {biblicus-0.5.0 → biblicus-0.7.0}/features/integration_audio_samples.feature +0 -0
  64. {biblicus-0.5.0 → biblicus-0.7.0}/features/integration_image_samples.feature +0 -0
  65. {biblicus-0.5.0 → biblicus-0.7.0}/features/integration_mixed_corpus.feature +0 -0
  66. {biblicus-0.5.0 → biblicus-0.7.0}/features/integration_mixed_extraction.feature +0 -0
  67. {biblicus-0.5.0 → biblicus-0.7.0}/features/integration_ocr_image_extraction.feature +0 -0
  68. {biblicus-0.5.0 → biblicus-0.7.0}/features/integration_pdf_retrieval.feature +0 -0
  69. {biblicus-0.5.0 → biblicus-0.7.0}/features/integration_pdf_samples.feature +0 -0
  70. {biblicus-0.5.0 → biblicus-0.7.0}/features/integration_unstructured_extraction.feature +0 -0
  71. {biblicus-0.5.0 → biblicus-0.7.0}/features/integration_wikipedia.feature +0 -0
  72. {biblicus-0.5.0 → biblicus-0.7.0}/features/lifecycle_hooks.feature +0 -0
  73. {biblicus-0.5.0 → biblicus-0.7.0}/features/model_validation.feature +0 -0
  74. {biblicus-0.5.0 → biblicus-0.7.0}/features/ocr_extractor.feature +0 -0
  75. {biblicus-0.5.0 → biblicus-0.7.0}/features/pdf_text_extraction.feature +0 -0
  76. {biblicus-0.5.0 → biblicus-0.7.0}/features/python_api.feature +0 -0
  77. {biblicus-0.5.0 → biblicus-0.7.0}/features/python_hook_logging.feature +0 -0
  78. {biblicus-0.5.0 → biblicus-0.7.0}/features/query_processing.feature +0 -0
  79. {biblicus-0.5.0 → biblicus-0.7.0}/features/retrieval_budget.feature +0 -0
  80. {biblicus-0.5.0 → biblicus-0.7.0}/features/retrieval_scan.feature +0 -0
  81. {biblicus-0.5.0 → biblicus-0.7.0}/features/retrieval_sqlite_full_text_search.feature +0 -0
  82. {biblicus-0.5.0 → biblicus-0.7.0}/features/retrieval_uses_extraction_run.feature +0 -0
  83. {biblicus-0.5.0 → biblicus-0.7.0}/features/retrieval_utilities.feature +0 -0
  84. {biblicus-0.5.0 → biblicus-0.7.0}/features/source_loading.feature +0 -0
  85. {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/backend_steps.py +0 -0
  86. {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/cli_parsing_steps.py +0 -0
  87. {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/cli_steps.py +0 -0
  88. {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/context_pack_steps.py +0 -0
  89. {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/crawl_steps.py +0 -0
  90. {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/evidence_processing_steps.py +0 -0
  91. {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/extraction_run_lifecycle_steps.py +0 -0
  92. {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/extraction_steps.py +0 -0
  93. {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/extractor_steps.py +0 -0
  94. {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/frontmatter_steps.py +0 -0
  95. {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/model_steps.py +0 -0
  96. {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/openai_steps.py +0 -0
  97. {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/pdf_steps.py +0 -0
  98. {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/python_api_steps.py +0 -0
  99. {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/rapidocr_steps.py +0 -0
  100. {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/retrieval_steps.py +0 -0
  101. {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/stt_steps.py +0 -0
  102. {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/unstructured_steps.py +0 -0
  103. {biblicus-0.5.0 → biblicus-0.7.0}/features/steps/user_config_steps.py +0 -0
  104. {biblicus-0.5.0 → biblicus-0.7.0}/features/streaming_ingest.feature +0 -0
  105. {biblicus-0.5.0 → biblicus-0.7.0}/features/stt_extractor.feature +0 -0
  106. {biblicus-0.5.0 → biblicus-0.7.0}/features/text_extraction_runs.feature +0 -0
  107. {biblicus-0.5.0 → biblicus-0.7.0}/features/token_budget.feature +0 -0
  108. {biblicus-0.5.0 → biblicus-0.7.0}/features/unstructured_extractor.feature +0 -0
  109. {biblicus-0.5.0 → biblicus-0.7.0}/features/user_config.feature +0 -0
  110. {biblicus-0.5.0 → biblicus-0.7.0}/scripts/download_audio_samples.py +0 -0
  111. {biblicus-0.5.0 → biblicus-0.7.0}/scripts/download_image_samples.py +0 -0
  112. {biblicus-0.5.0 → biblicus-0.7.0}/scripts/download_mixed_samples.py +0 -0
  113. {biblicus-0.5.0 → biblicus-0.7.0}/scripts/download_pdf_samples.py +0 -0
  114. {biblicus-0.5.0 → biblicus-0.7.0}/scripts/download_wikipedia.py +0 -0
  115. {biblicus-0.5.0 → biblicus-0.7.0}/scripts/readme_end_to_end_demo.py +0 -0
  116. {biblicus-0.5.0 → biblicus-0.7.0}/setup.cfg +0 -0
  117. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/__main__.py +0 -0
  118. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/_vendor/dotyaml/__init__.py +0 -0
  119. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/_vendor/dotyaml/interpolation.py +0 -0
  120. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/_vendor/dotyaml/loader.py +0 -0
  121. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/_vendor/dotyaml/transformer.py +0 -0
  122. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/backends/__init__.py +0 -0
  123. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/backends/base.py +0 -0
  124. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/backends/scan.py +0 -0
  125. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/backends/sqlite_full_text_search.py +0 -0
  126. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/cli.py +0 -0
  127. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/constants.py +0 -0
  128. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/context.py +0 -0
  129. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/corpus.py +0 -0
  130. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/crawl.py +0 -0
  131. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/errors.py +0 -0
  132. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/evaluation.py +0 -0
  133. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/evidence_processing.py +0 -0
  134. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extraction.py +0 -0
  135. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/base.py +0 -0
  136. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/metadata_text.py +0 -0
  137. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/openai_stt.py +0 -0
  138. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/pass_through_text.py +0 -0
  139. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/pdf_text.py +0 -0
  140. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/pipeline.py +0 -0
  141. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/rapidocr_text.py +0 -0
  142. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/select_longest_text.py +0 -0
  143. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/select_text.py +0 -0
  144. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/extractors/unstructured_text.py +0 -0
  145. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/frontmatter.py +0 -0
  146. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/hook_logging.py +0 -0
  147. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/hook_manager.py +0 -0
  148. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/hooks.py +0 -0
  149. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/ignore.py +0 -0
  150. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/models.py +0 -0
  151. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/retrieval.py +0 -0
  152. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/sources.py +0 -0
  153. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/time.py +0 -0
  154. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/uris.py +0 -0
  155. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus/user_config.py +0 -0
  156. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus.egg-info/dependency_links.txt +0 -0
  157. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus.egg-info/entry_points.txt +0 -0
  158. {biblicus-0.5.0 → biblicus-0.7.0}/src/biblicus.egg-info/top_level.txt +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: biblicus
3
- Version: 0.5.0
3
+ Version: 0.7.0
4
4
  Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
5
5
  License: MIT
6
6
  Requires-Python: >=3.9
@@ -25,6 +25,8 @@ Requires-Dist: unstructured>=0.12.0; extra == "unstructured"
25
25
  Requires-Dist: python-docx>=1.1.0; extra == "unstructured"
26
26
  Provides-Extra: ocr
27
27
  Requires-Dist: rapidocr-onnxruntime>=1.3.0; extra == "ocr"
28
+ Provides-Extra: markitdown
29
+ Requires-Dist: markitdown[all]>=0.1.0; python_version >= "3.10" and extra == "markitdown"
28
30
  Dynamic: license-file
29
31
 
30
32
  # Biblicus
@@ -45,6 +47,40 @@ It can be used alongside LangGraph, Tactus, Pydantic AI, any agent framework, or
45
47
 
46
48
  See [retrieval augmented generation overview] for a short introduction to the idea.
47
49
 
50
+ ## Start with a knowledge base
51
+
52
+ If you just want to hand a folder to your assistant and move on, use the high-level knowledge base interface. The folder can be nothing more than a handful of plain text files. You are not choosing a retrieval strategy yet. You are just collecting.
53
+
54
+ This example assumes a folder called `notes/` with a few `.txt` files. The knowledge base handles sensible defaults and still gives you a clear context pack for your model call.
55
+
56
+ ```python
57
+ from biblicus.knowledge_base import KnowledgeBase
58
+
59
+
60
+ kb = KnowledgeBase.from_folder("notes")
61
+ result = kb.query("Primary button style preference")
62
+ context_pack = kb.context_pack(result, max_tokens=800)
63
+
64
+ print(context_pack.text)
65
+ ```
66
+
67
+ If you want to run a real, executable version of this story, use `scripts/readme_end_to_end_demo.py` from a fresh clone.
68
+
69
+ This simplified sequence diagram shows the same idea at a high level.
70
+
71
+ ```mermaid
72
+ %%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
73
+ sequenceDiagram
74
+ participant App as Your assistant code
75
+ participant KB as Knowledge base
76
+ participant LLM as Large language model
77
+
78
+ App->>KB: query
79
+ KB-->>App: evidence and context
80
+ App->>LLM: context plus prompt
81
+ LLM-->>App: response draft
82
+ ```
83
+
48
84
  ## A simple mental model
49
85
 
50
86
  Think in three stages.
@@ -72,7 +108,7 @@ In a coding assistant, retrieval is often triggered by what the user is doing ri
72
108
  This diagram shows two sequential Biblicus calls. They are shown separately to make the boundaries explicit: retrieval returns evidence, and context pack building consumes evidence.
73
109
 
74
110
  ```mermaid
75
- %%{init: {"theme": "base", "themeVariables": {"primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
111
+ %%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
76
112
  sequenceDiagram
77
113
  participant User
78
114
  participant App as Your assistant code
@@ -126,6 +162,7 @@ Some extractors are optional so the base install stays small.
126
162
  - Optical character recognition for images: `python3 -m pip install "biblicus[ocr]"`
127
163
  - Speech to text transcription: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
128
164
  - Broad document parsing fallback: `python3 -m pip install "biblicus[unstructured]"`
165
+ - MarkItDown document conversion (requires Python 3.10 or higher): `python3 -m pip install "biblicus[markitdown]"`
129
166
 
130
167
  ## Quick start
131
168
 
@@ -153,11 +190,11 @@ biblicus crawl --corpus corpora/example \\
153
190
  --tag crawled
154
191
  ```
155
192
 
156
- ## End-to-end example: evidence to assistant context
193
+ ## End-to-end example: lower-level control
157
194
 
158
195
  The command-line interface returns JavaScript Object Notation by default. This makes it easy to use Biblicus in scripts and to treat retrieval as a deterministic, testable step.
159
196
 
160
- Start with a few short “memories” from a chat system. Each memory is stored as a normal item in the corpus.
197
+ This version shows the lower-level pieces explicitly. You are building the corpus, controlling each memory string, choosing the backend, and shaping the context pack yourself.
161
198
 
162
199
  ```python
163
200
  from biblicus.backends import get_backend
@@ -383,6 +420,7 @@ The documents below follow the pipeline from raw items to model context:
383
420
 
384
421
  - [Corpus][corpus]
385
422
  - [Text extraction][text-extraction]
423
+ - [Knowledge base][knowledge-base]
386
424
  - [Backends][backends]
387
425
  - [Context packs][context-packs]
388
426
  - [Testing and evaluation][testing]
@@ -432,6 +470,20 @@ Two backends are included.
432
470
  - `scan` is a minimal baseline that scans raw items directly.
433
471
  - `sqlite-full-text-search` is a practical baseline that builds a full text search index in Sqlite.
434
472
 
473
+ ## Extraction backends
474
+
475
+ These extractors are built in. Optional ones require extra dependencies.
476
+
477
+ - `pass-through-text` reads text items and strips Markdown front matter.
478
+ - `metadata-text` turns catalog metadata into a small text artifact.
479
+ - `pdf-text` extracts text from Portable Document Format items with `pypdf`.
480
+ - `select-text` chooses one prior extraction result in a pipeline.
481
+ - `select-longest-text` chooses the longest prior extraction result.
482
+ - `ocr-rapidocr` does optical character recognition on images (optional).
483
+ - `stt-openai` performs speech to text on audio (optional).
484
+ - `unstructured` provides broad document parsing (optional).
485
+ - `markitdown` converts many formats into Markdown-like text (optional).
486
+
435
487
  ## Integration corpus and evaluation dataset
436
488
 
437
489
  Use `scripts/download_wikipedia.py` to download a small integration corpus from Wikipedia when running tests or demos. The repository does not include that content.
@@ -485,6 +537,7 @@ License terms are in `LICENSE`.
485
537
  [roadmap]: docs/ROADMAP.md
486
538
  [feature-index]: docs/FEATURE_INDEX.md
487
539
  [corpus]: docs/CORPUS.md
540
+ [knowledge-base]: docs/KNOWLEDGE_BASE.md
488
541
  [text-extraction]: docs/EXTRACTION.md
489
542
  [user-configuration]: docs/USER_CONFIGURATION.md
490
543
  [backends]: docs/BACKENDS.md
@@ -16,6 +16,40 @@ It can be used alongside LangGraph, Tactus, Pydantic AI, any agent framework, or
16
16
 
17
17
  See [retrieval augmented generation overview] for a short introduction to the idea.
18
18
 
19
+ ## Start with a knowledge base
20
+
21
+ If you just want to hand a folder to your assistant and move on, use the high-level knowledge base interface. The folder can be nothing more than a handful of plain text files. You are not choosing a retrieval strategy yet. You are just collecting.
22
+
23
+ This example assumes a folder called `notes/` with a few `.txt` files. The knowledge base handles sensible defaults and still gives you a clear context pack for your model call.
24
+
25
+ ```python
26
+ from biblicus.knowledge_base import KnowledgeBase
27
+
28
+
29
+ kb = KnowledgeBase.from_folder("notes")
30
+ result = kb.query("Primary button style preference")
31
+ context_pack = kb.context_pack(result, max_tokens=800)
32
+
33
+ print(context_pack.text)
34
+ ```
35
+
36
+ If you want to run a real, executable version of this story, use `scripts/readme_end_to_end_demo.py` from a fresh clone.
37
+
38
+ This simplified sequence diagram shows the same idea at a high level.
39
+
40
+ ```mermaid
41
+ %%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
42
+ sequenceDiagram
43
+ participant App as Your assistant code
44
+ participant KB as Knowledge base
45
+ participant LLM as Large language model
46
+
47
+ App->>KB: query
48
+ KB-->>App: evidence and context
49
+ App->>LLM: context plus prompt
50
+ LLM-->>App: response draft
51
+ ```
52
+
19
53
  ## A simple mental model
20
54
 
21
55
  Think in three stages.
@@ -43,7 +77,7 @@ In a coding assistant, retrieval is often triggered by what the user is doing ri
43
77
  This diagram shows two sequential Biblicus calls. They are shown separately to make the boundaries explicit: retrieval returns evidence, and context pack building consumes evidence.
44
78
 
45
79
  ```mermaid
46
- %%{init: {"theme": "base", "themeVariables": {"primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
80
+ %%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
47
81
  sequenceDiagram
48
82
  participant User
49
83
  participant App as Your assistant code
@@ -97,6 +131,7 @@ Some extractors are optional so the base install stays small.
97
131
  - Optical character recognition for images: `python3 -m pip install "biblicus[ocr]"`
98
132
  - Speech to text transcription: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
99
133
  - Broad document parsing fallback: `python3 -m pip install "biblicus[unstructured]"`
134
+ - MarkItDown document conversion (requires Python 3.10 or higher): `python3 -m pip install "biblicus[markitdown]"`
100
135
 
101
136
  ## Quick start
102
137
 
@@ -124,11 +159,11 @@ biblicus crawl --corpus corpora/example \\
124
159
  --tag crawled
125
160
  ```
126
161
 
127
- ## End-to-end example: evidence to assistant context
162
+ ## End-to-end example: lower-level control
128
163
 
129
164
  The command-line interface returns JavaScript Object Notation by default. This makes it easy to use Biblicus in scripts and to treat retrieval as a deterministic, testable step.
130
165
 
131
- Start with a few short “memories” from a chat system. Each memory is stored as a normal item in the corpus.
166
+ This version shows the lower-level pieces explicitly. You are building the corpus, controlling each memory string, choosing the backend, and shaping the context pack yourself.
132
167
 
133
168
  ```python
134
169
  from biblicus.backends import get_backend
@@ -354,6 +389,7 @@ The documents below follow the pipeline from raw items to model context:
354
389
 
355
390
  - [Corpus][corpus]
356
391
  - [Text extraction][text-extraction]
392
+ - [Knowledge base][knowledge-base]
357
393
  - [Backends][backends]
358
394
  - [Context packs][context-packs]
359
395
  - [Testing and evaluation][testing]
@@ -403,6 +439,20 @@ Two backends are included.
403
439
  - `scan` is a minimal baseline that scans raw items directly.
404
440
  - `sqlite-full-text-search` is a practical baseline that builds a full text search index in Sqlite.
405
441
 
442
+ ## Extraction backends
443
+
444
+ These extractors are built in. Optional ones require extra dependencies.
445
+
446
+ - `pass-through-text` reads text items and strips Markdown front matter.
447
+ - `metadata-text` turns catalog metadata into a small text artifact.
448
+ - `pdf-text` extracts text from Portable Document Format items with `pypdf`.
449
+ - `select-text` chooses one prior extraction result in a pipeline.
450
+ - `select-longest-text` chooses the longest prior extraction result.
451
+ - `ocr-rapidocr` does optical character recognition on images (optional).
452
+ - `stt-openai` performs speech to text on audio (optional).
453
+ - `unstructured` provides broad document parsing (optional).
454
+ - `markitdown` converts many formats into Markdown-like text (optional).
455
+
406
456
  ## Integration corpus and evaluation dataset
407
457
 
408
458
  Use `scripts/download_wikipedia.py` to download a small integration corpus from Wikipedia when running tests or demos. The repository does not include that content.
@@ -456,6 +506,7 @@ License terms are in `LICENSE`.
456
506
  [roadmap]: docs/ROADMAP.md
457
507
  [feature-index]: docs/FEATURE_INDEX.md
458
508
  [corpus]: docs/CORPUS.md
509
+ [knowledge-base]: docs/KNOWLEDGE_BASE.md
459
510
  [text-extraction]: docs/EXTRACTION.md
460
511
  [user-configuration]: docs/USER_CONFIGURATION.md
461
512
  [backends]: docs/BACKENDS.md
@@ -221,6 +221,25 @@ python3 -m biblicus build --corpus corpora/pdf_samples --backend sqlite-full-tex
221
221
  python3 -m biblicus query --corpus corpora/pdf_samples --query "Dummy PDF file"
222
222
  ```
223
223
 
224
+ ### Wikipedia retrieval demo (Python)
225
+
226
+ This example downloads a few Wikipedia summaries about retrieval and knowledge bases, builds an extraction run, creates a local full text index, and returns evidence plus a context pack.
227
+
228
+ ```
229
+ rm -rf corpora/wikipedia_rag_demo
230
+ python3 scripts/wikipedia_rag_demo.py --corpus corpora/wikipedia_rag_demo --force
231
+ ```
232
+
233
+ ### MarkItDown extraction demo (Python 3.10+)
234
+
235
+ MarkItDown requires Python 3.10 or higher. This example uses the `py311` conda environment to run the extractor over the mixed sample corpus.
236
+
237
+ ```
238
+ conda run -n py311 python -m pip install -e . "markitdown[all]"
239
+ conda run -n py311 python scripts/download_mixed_samples.py --corpus corpora/markitdown_demo_py311 --force
240
+ conda run -n py311 python -m biblicus extract build --corpus corpora/markitdown_demo_py311 --step markitdown
241
+ ```
242
+
224
243
  ### Mixed modality integration corpus
225
244
 
226
245
  This example assembles a tiny mixed corpus with a Markdown note, a Hypertext Markup Language page, an image, a Portable Document Format file with extractable text, and a generated Portable Document Format file with no extractable text.
@@ -71,6 +71,27 @@ To install:
71
71
  python3 -m pip install "biblicus[unstructured]"
72
72
  ```
73
73
 
74
+ `markitdown`
75
+
76
+ - Converts common document formats into Markdown-like text
77
+ - Backed by the optional `markitdown` dependency
78
+ - Requires Python 3.10 or higher
79
+ - Skips items that are already text so the pass-through extractor remains the canonical choice for text items
80
+ - This means it will not process `text/html` or other text media types unless that policy changes
81
+
82
+ To install:
83
+
84
+ ```
85
+ python3 -m pip install "biblicus[markitdown]"
86
+ ```
87
+
88
+ Example:
89
+
90
+ ```
91
+ python3 -m biblicus extract build --corpus corpora/extraction-demo \\
92
+ --step markitdown
93
+ ```
94
+
74
95
  `ocr-rapidocr`
75
96
 
76
97
  - Optical character recognition for image items
@@ -123,6 +123,7 @@ What it does:
123
123
  - Includes a Portable Document Format text extractor plugin.
124
124
  - Includes a speech to text extractor plugin for audio items.
125
125
  - Includes a selection extractor step for choosing extracted text within a pipeline.
126
+ - Includes a MarkItDown extractor plugin for document conversion.
126
127
 
127
128
  Documentation:
128
129
 
@@ -139,6 +140,7 @@ Behavior specifications:
139
140
  - `features/ocr_extractor.feature`
140
141
  - `features/stt_extractor.feature`
141
142
  - `features/unstructured_extractor.feature`
143
+ - `features/markitdown_extractor.feature`
142
144
  - `features/integration_unstructured_extraction.feature`
143
145
 
144
146
  Primary implementation:
@@ -208,6 +210,21 @@ Primary implementation:
208
210
 
209
211
  - `src/biblicus/context.py`
210
212
 
213
+ ## Knowledge base
214
+
215
+ What it does:
216
+
217
+ - Provides a turnkey interface that accepts a folder and returns a ready-to-query workflow.
218
+ - Applies sensible defaults for import, retrieval, and context pack shaping.
219
+
220
+ Behavior specifications:
221
+
222
+ - `features/knowledge_base.feature`
223
+
224
+ Primary implementation:
225
+
226
+ - `src/biblicus/knowledge_base.py`
227
+
211
228
  ## Testing, coverage, and documentation build
212
229
 
213
230
  What it does:
@@ -0,0 +1,68 @@
1
+ # Knowledge base
2
+
3
+ The knowledge base is the high‑level, turnkey workflow that makes Biblicus feel effortless. You hand it a folder. It chooses sensible defaults, builds a retrieval run, and gives you evidence you can turn into context.
4
+
5
+ This is the right layer when you want to use Biblicus without spending time on setup. You can still override the defaults later when you want fine‑grained control.
6
+
7
+ ## What it does
8
+
9
+ - Creates or opens a corpus at a chosen location (or a temporary location if you do not provide one).
10
+ - Imports a folder tree into that corpus.
11
+ - Builds a retrieval run using a default backend.
12
+ - Exposes a simple `query` method that returns evidence.
13
+ - Exposes a `context_pack` helper to shape evidence into model context.
14
+
15
+ ## Minimal use
16
+
17
+ ```python
18
+ from biblicus.knowledge_base import KnowledgeBase
19
+
20
+
21
+ kb = KnowledgeBase.from_folder("notes")
22
+ result = kb.query("Primary button style preference")
23
+ context_pack = kb.context_pack(result, max_tokens=800)
24
+
25
+ print(context_pack.text)
26
+ ```
27
+
28
+ ## Default behavior
29
+
30
+ The knowledge base wraps existing primitives. Defaults are explicit and deterministic.
31
+
32
+ - **Corpus**: stored on disk and fully inspectable.
33
+ - **Import**: uses the folder tree import, preserving relative paths.
34
+ - **Backend**: defaults to the `scan` backend.
35
+ - **Query budget**: defaults to a small, conservative evidence budget.
36
+
37
+ ## Overrides
38
+
39
+ You can override the defaults when needed.
40
+
41
+ ```python
42
+ from biblicus.knowledge_base import KnowledgeBase
43
+ from biblicus.models import QueryBudget
44
+
45
+
46
+ kb = KnowledgeBase.from_folder(
47
+ "notes",
48
+ backend_id="scan",
49
+ recipe_name="Knowledge base demo",
50
+ query_budget=QueryBudget(max_total_items=10, max_total_characters=4000, max_items_per_source=None),
51
+ tags=["memory"],
52
+ corpus_root="corpora/knowledge-base",
53
+ )
54
+ ```
55
+
56
+ ## How it relates to lower‑level control
57
+
58
+ The knowledge base is a convenience layer. It uses the same underlying parts that the lower‑level examples use.
59
+
60
+ - `Corpus` for ingestion and storage
61
+ - `import_tree` for folder ingestion
62
+ - A backend run (`scan` by default)
63
+ - `QueryBudget` for evidence limits
64
+ - `ContextPackPolicy` and token fitting for context shaping
65
+
66
+ You can always drop down to those lower‑level primitives when you need more control.
67
+
68
+ If the high‑level workflow is not enough, switch to `Corpus`, `get_backend`, and `ContextPackPolicy` directly.
@@ -0,0 +1,155 @@
1
+ # Roadmap
2
+
3
+ This document describes what we plan to build next.
4
+
5
+ If you are looking for runnable examples, see `docs/DEMOS.md`.
6
+
7
+ If you are looking for what already exists, start with:
8
+
9
+ - `docs/FEATURE_INDEX.md` for a map of features to behavior specifications and modules.
10
+ - `CHANGELOG.md` for released changes.
11
+
12
+ ## Principles
13
+
14
+ - Behavior specifications are the authoritative definition of behavior.
15
+ - Every behavior that exists is specified.
16
+ - Validation and documentation are part of the product.
17
+ - Raw corpus items remain readable, portable files.
18
+ - Derived artifacts are stored under the corpus and can coexist for multiple implementations.
19
+
20
+ ## Next: retrieval evaluation and datasets
21
+
22
+ Goal: make evaluation results easier to interpret and compare.
23
+
24
+ Deliverables:
25
+
26
+ - A dataset authoring workflow that supports small hand-labeled sets and larger synthetic sets.
27
+ - A report that includes per-query diagnostics and a clear summary.
28
+
29
+ Acceptance checks:
30
+
31
+ - Dataset formats are versioned when they change.
32
+ - Reports remain deterministic for the same inputs.
33
+
34
+ ## Next: context pack policy surfaces
35
+
36
+ Goal: make context shaping policies easier to evaluate and swap.
37
+
38
+ Deliverables:
39
+
40
+ - A clear set of context pack policy variants (formatting, ordering, metadata inclusion).
41
+ - Token budget strategies that can use a real tokenizer.
42
+ - Documentation that explains where context shaping fits in the pipeline.
43
+
44
+ Acceptance checks:
45
+
46
+ - Behavior specifications cover policy selection and budgeting behaviors.
47
+ - Example outputs show how context packs differ across policies.
48
+
49
+ ## Next: extraction backends (OCR and document understanding)
50
+
51
+ Goal: treat optical character recognition and document understanding as pluggable extractors with consistent inputs and outputs.
52
+
53
+ Deliverables:
54
+
55
+ - A baseline OCR extractor that is fast and local for smoke tests.
56
+ - A higher quality OCR extractor candidate (for example: Paddle OCR or Docling OCR).
57
+ - A general document understanding extractor candidate (for example: Docling or Unstructured).
58
+ - A consistent output contract that captures text plus optional confidence and per-page metadata.
59
+ - A selector policy for choosing between multiple extractor outputs in a pipeline.
60
+ - A shared evaluation harness for extraction backends using the same corpus and dataset.
61
+
62
+ Acceptance checks:
63
+
64
+ - Behavior specifications cover extractor selection and output provenance.
65
+ - Evaluation reports compare accuracy, processable fraction, latency, and cost.
66
+
67
+ ## Next: corpus analysis tools
68
+
69
+ Goal: provide lightweight analysis utilities that summarize corpus themes and guide curation.
70
+
71
+ Deliverables:
72
+
73
+ - A topic modeling workflow for corpus analysis (for example: BERTopic).
74
+ - A report that highlights dominant themes and outliers.
75
+ - A way to compare topic distributions across corpora or corpus snapshots.
76
+
77
+ Acceptance checks:
78
+
79
+ - Analysis is reproducible for the same corpus state.
80
+ - Reports are exportable and readable without custom tooling.
81
+
82
+ ### Candidate backend ecosystem (for planning and evaluation)
83
+
84
+ Document understanding and OCR blur together at the interface level in Biblicus, so the roadmap treats them as extractor candidates with the same input/output contract.
85
+
86
+ Docling family candidates:
87
+
88
+ - Docling (document understanding with structured outputs)
89
+ - docling-ocr (OCR component in the Docling ecosystem)
90
+
91
+ General-purpose extraction candidates:
92
+
93
+ - Unstructured (element-oriented extraction for many formats)
94
+ - MarkItDown (lightweight conversion to Markdown)
95
+ - Kreuzberg (speed-focused extraction for bulk workflows)
96
+ - ExtractThinker (schema-driven extraction using Pydantic contracts)
97
+
98
+ Ecosystem adapters:
99
+
100
+ - LangChain document loaders (uniform loader interface across many sources)
101
+
102
+ ### Guidance for choosing early targets
103
+
104
+ - If you need layout and table understanding, prioritize Docling and docling-ocr.
105
+ - If you need speed and simplicity, prioritize MarkItDown or Kreuzberg.
106
+ - If you need schema-first extraction, prioritize ExtractThinker layered on an OCR or document extractor.
107
+
108
+ ## Later: alternate backends and hosting modes
109
+
110
+ Goal: broaden the backend surface while keeping the core predictable.
111
+
112
+ Deliverables:
113
+
114
+ - A second backend with different performance tradeoffs.
115
+ - A tool server that exposes a backend through a stable interface.
116
+ - Documentation that shows how to run a backend out of process.
117
+
118
+ Acceptance checks:
119
+
120
+ - Local tests remain fast and deterministic.
121
+ - Integration tests validate retrieval through the tool boundary.
122
+
123
+ ## Deferred: corpus and extraction work
124
+
125
+ These are valuable, but intentionally not the near-term focus while retrieval becomes practical end to end.
126
+
127
+ ### In-memory corpus for ephemeral workflows
128
+
129
+ Goal: allow programmatic, temporary corpora that live in memory for short-lived agents or tests.
130
+
131
+ Deliverables:
132
+
133
+ - A memory-backed corpus implementation that supports the same ingestion and catalog APIs.
134
+ - A serialization option for snapshots so ephemeral corpora can be persisted when needed.
135
+ - Documentation that explains tradeoffs versus file-based corpora.
136
+
137
+ Acceptance checks:
138
+
139
+ - Behavior specifications cover ingestion, listing, and reindexing in memory.
140
+ - Retrieval and extraction can operate on the in-memory corpus without special casing.
141
+
142
+ ### Extractor datasets and evaluation harness
143
+
144
+ Goal: compare extraction approaches in a way that is measurable, repeatable, and useful for practical engineering decisions.
145
+
146
+ Deliverables:
147
+
148
+ - Dataset authoring workflow for extraction ground truth (for example: expected transcripts and expected optical character recognition text).
149
+ - Evaluation metrics for accuracy, speed, and cost, including “processable fraction” for a given extractor recipe.
150
+ - A report format that can compare multiple extraction recipes against the same corpus and dataset.
151
+
152
+ Acceptance checks:
153
+
154
+ - Evaluation results are stable and reproducible for the same corpus and dataset inputs.
155
+ - Reports make it clear when an extractor fails to process an item versus producing empty output.
@@ -8,6 +8,10 @@ Core
8
8
  :members:
9
9
  :undoc-members:
10
10
 
11
+ .. automodule:: biblicus.knowledge_base
12
+ :members:
13
+ :undoc-members:
14
+
11
15
  .. automodule:: biblicus.models
12
16
  :members:
13
17
  :undoc-members:
@@ -11,6 +11,7 @@ Contents
11
11
 
12
12
  CORPUS
13
13
  EXTRACTION
14
+ KNOWLEDGE_BASE
14
15
  BACKENDS
15
16
  CONTEXT_PACK
16
17
  DEMOS
@@ -134,6 +134,32 @@ def after_scenario(context, scenario) -> None:
134
134
  sys.modules.pop(name, None)
135
135
  context._fake_rapidocr_unavailable_installed = False
136
136
  context._fake_rapidocr_unavailable_original_modules = {}
137
+ if getattr(context, "_fake_markitdown_installed", False):
138
+ original_modules = getattr(context, "_fake_markitdown_original_modules", {})
139
+ for name in [
140
+ "markitdown",
141
+ ]:
142
+ if name in original_modules:
143
+ sys.modules[name] = original_modules[name]
144
+ else:
145
+ sys.modules.pop(name, None)
146
+ context._fake_markitdown_installed = False
147
+ context._fake_markitdown_original_modules = {}
148
+ if getattr(context, "_fake_markitdown_unavailable_installed", False):
149
+ original_modules = getattr(context, "_fake_markitdown_unavailable_original_modules", {})
150
+ for name in [
151
+ "markitdown",
152
+ ]:
153
+ if name in original_modules:
154
+ sys.modules[name] = original_modules[name]
155
+ else:
156
+ sys.modules.pop(name, None)
157
+ context._fake_markitdown_unavailable_installed = False
158
+ context._fake_markitdown_unavailable_original_modules = {}
159
+ original_sys_version_info = getattr(context, "_original_sys_version_info", None)
160
+ if original_sys_version_info is not None:
161
+ sys.version_info = original_sys_version_info
162
+ context._original_sys_version_info = None
137
163
  if hasattr(context, "_tmp"):
138
164
  context._tmp.cleanup()
139
165
 
@@ -0,0 +1,55 @@
1
+ Feature: Knowledge base (turnkey workflow)
2
+ A knowledge base is a high-level workflow that hides the plumbing while keeping behavior explicit.
3
+ It should accept a folder, ingest files, build defaults, and allow retrieval with minimal configuration.
4
+
5
+ Scenario: Build a knowledge base from a folder and query it
6
+ Given a folder "notes" exists with text files:
7
+ | filename | contents |
8
+ | note1.txt | The user's name is Tactus Maximus. |
9
+ | note2.txt | Primary button style preference: the user's favorite color is magenta. |
10
+ When I create a knowledge base from folder "notes" only
11
+ And I query the knowledge base for "Primary button style preference"
12
+ Then the knowledge base returns evidence that includes "favorite color is magenta"
13
+
14
+ Scenario: Knowledge base context pack is shaped with a token budget
15
+ Given a folder "notes" exists with text files:
16
+ | filename | contents |
17
+ | note1.txt | one two three |
18
+ | note2.txt | four five six |
19
+ When I create a knowledge base from folder "notes" only
20
+ And I query the knowledge base for "one"
21
+ And I build a context pack from the knowledge base query with token budget 3
22
+ Then the context pack text equals:
23
+ """
24
+ one two three
25
+ """
26
+
27
+ Scenario: Knowledge base context pack defaults to no token budget
28
+ Given a folder "notes" exists with text files:
29
+ | filename | contents |
30
+ | note1.txt | alpha beta |
31
+ When I create a knowledge base from folder "notes" only
32
+ And I query the knowledge base for "alpha"
33
+ And I build a context pack from the knowledge base query without a token budget
34
+ Then the context pack text equals:
35
+ """
36
+ alpha beta
37
+ """
38
+
39
+ Scenario: Knowledge base rejects missing folder
40
+ When I attempt to create a knowledge base from folder "missing"
41
+ Then the knowledge base error includes "does not exist"
42
+
43
+ Scenario: Knowledge base rejects non-folder path
44
+ Given a file "not-a-folder.txt" exists with contents "hello"
45
+ When I attempt to create a knowledge base from folder "not-a-folder.txt"
46
+ Then the knowledge base error includes "not a directory"
47
+
48
+ Scenario: Knowledge base can use an explicit corpus root
49
+ Given a folder "notes" exists with text files:
50
+ | filename | contents |
51
+ | note1.txt | alpha |
52
+ And a folder "kb-root" exists
53
+ When I create a knowledge base from folder "notes" using corpus root "kb-root"
54
+ And I query the knowledge base for "alpha"
55
+ Then the knowledge base returns evidence that includes "alpha"