biblicus 0.14.0__tar.gz → 0.15.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (341) hide show
  1. {biblicus-0.14.0/src/biblicus.egg-info → biblicus-0.15.1}/PKG-INFO +98 -28
  2. {biblicus-0.14.0 → biblicus-0.15.1}/README.md +90 -27
  3. biblicus-0.15.1/docs/ANALYSIS.md +143 -0
  4. biblicus-0.15.1/docs/ARCHITECTURE.md +46 -0
  5. biblicus-0.15.1/docs/ARCHITECTURE_DETAIL.md +267 -0
  6. {biblicus-0.14.0 → biblicus-0.15.1}/docs/BACKENDS.md +24 -0
  7. {biblicus-0.14.0 → biblicus-0.15.1}/docs/CONTEXT_PACK.md +58 -0
  8. {biblicus-0.14.0 → biblicus-0.15.1}/docs/CORPUS.md +49 -10
  9. {biblicus-0.14.0 → biblicus-0.15.1}/docs/CORPUS_DESIGN.md +18 -5
  10. {biblicus-0.14.0 → biblicus-0.15.1}/docs/DEMOS.md +75 -49
  11. {biblicus-0.14.0 → biblicus-0.15.1}/docs/EXTRACTION.md +46 -11
  12. {biblicus-0.14.0 → biblicus-0.15.1}/docs/EXTRACTION_EVALUATION.md +33 -3
  13. {biblicus-0.14.0 → biblicus-0.15.1}/docs/FEATURE_INDEX.md +145 -0
  14. {biblicus-0.14.0 → biblicus-0.15.1}/docs/KNOWLEDGE_BASE.md +19 -0
  15. biblicus-0.15.1/docs/MARKOV_ANALYSIS.md +262 -0
  16. {biblicus-0.14.0 → biblicus-0.15.1}/docs/PROFILING.md +65 -1
  17. biblicus-0.15.1/docs/PR_FAQ_TEXT_ANNOTATE.md +118 -0
  18. {biblicus-0.14.0 → biblicus-0.15.1}/docs/RETRIEVAL.md +33 -7
  19. {biblicus-0.14.0 → biblicus-0.15.1}/docs/RETRIEVAL_EVALUATION.md +44 -7
  20. {biblicus-0.14.0 → biblicus-0.15.1}/docs/RETRIEVAL_QUALITY.md +9 -3
  21. {biblicus-0.14.0 → biblicus-0.15.1}/docs/ROADMAP.md +42 -14
  22. {biblicus-0.14.0 → biblicus-0.15.1}/docs/STT.md +4 -4
  23. {biblicus-0.14.0 → biblicus-0.15.1}/docs/TESTING.md +15 -4
  24. biblicus-0.15.1/docs/TEXT_ANNOTATE.md +119 -0
  25. biblicus-0.15.1/docs/TEXT_EXTRACT.md +671 -0
  26. biblicus-0.15.1/docs/TEXT_LINK.md +124 -0
  27. biblicus-0.15.1/docs/TEXT_REDACT.md +170 -0
  28. biblicus-0.15.1/docs/TEXT_SLICE.md +319 -0
  29. biblicus-0.15.1/docs/TEXT_UTILITIES.md +137 -0
  30. {biblicus-0.14.0 → biblicus-0.15.1}/docs/TOPIC_MODELING.md +78 -5
  31. {biblicus-0.14.0 → biblicus-0.15.1}/docs/USER_CONFIGURATION.md +11 -0
  32. biblicus-0.15.1/docs/USE_CASES.md +37 -0
  33. biblicus-0.15.1/docs/UTILITIES.md +23 -0
  34. {biblicus-0.14.0 → biblicus-0.15.1}/docs/backends/index.md +25 -0
  35. {biblicus-0.14.0 → biblicus-0.15.1}/docs/backends/vector.md +2 -2
  36. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/index.md +12 -1
  37. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/ocr/index.md +8 -0
  38. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/pipeline-utilities/index.md +11 -0
  39. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/speech-to-text/index.md +8 -0
  40. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/text-document/index.md +11 -0
  41. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/vlm-document/index.md +8 -0
  42. biblicus-0.15.1/docs/index.rst +222 -0
  43. biblicus-0.15.1/docs/use_cases/notes_to_context_pack.md +48 -0
  44. biblicus-0.15.1/docs/use_cases/sequence_markov.md +82 -0
  45. biblicus-0.15.1/docs/use_cases/text_folder_search.md +39 -0
  46. biblicus-0.15.1/docs/use_cases/text_redact.md +50 -0
  47. biblicus-0.15.1/features/ai_llm.feature +25 -0
  48. biblicus-0.15.1/features/ai_models.feature +74 -0
  49. {biblicus-0.14.0 → biblicus-0.15.1}/features/analysis_schema.feature +1 -1
  50. biblicus-0.15.1/features/embeddings.feature +39 -0
  51. {biblicus-0.14.0 → biblicus-0.15.1}/features/environment.py +61 -0
  52. biblicus-0.15.1/features/integration_text_annotate.feature +22 -0
  53. biblicus-0.15.1/features/integration_text_extract.feature +69 -0
  54. biblicus-0.15.1/features/integration_text_link.feature +25 -0
  55. biblicus-0.15.1/features/integration_text_redact.feature +31 -0
  56. biblicus-0.15.1/features/integration_text_slice.feature +27 -0
  57. biblicus-0.15.1/features/integration_use_cases.feature +10 -0
  58. biblicus-0.15.1/features/integration_use_cases_sequence_markov.feature +15 -0
  59. biblicus-0.15.1/features/markov_analysis.feature +36 -0
  60. biblicus-0.15.1/features/markov_analysis_categorical.feature +42 -0
  61. biblicus-0.15.1/features/markov_analysis_llm.feature +65 -0
  62. biblicus-0.15.1/features/markov_analysis_topic_modeling.feature +40 -0
  63. biblicus-0.15.1/features/markov_analysis_variants.feature +559 -0
  64. biblicus-0.15.1/features/markov_internal_branches.feature +297 -0
  65. biblicus-0.15.1/features/markov_schema.feature +161 -0
  66. biblicus-0.15.1/features/markov_start_end_labels.feature +10 -0
  67. biblicus-0.15.1/features/profiling_config_overrides.feature +16 -0
  68. biblicus-0.15.1/features/recipe_cascading.feature +63 -0
  69. biblicus-0.15.1/features/recipe_utilities.feature +77 -0
  70. biblicus-0.15.1/features/steps/ai_llm_steps.py +44 -0
  71. biblicus-0.15.1/features/steps/ai_models_steps.py +181 -0
  72. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/analysis_steps.py +8 -6
  73. biblicus-0.15.1/features/steps/embeddings_steps.py +122 -0
  74. biblicus-0.15.1/features/steps/markov_internal_steps.py +1933 -0
  75. biblicus-0.15.1/features/steps/markov_schema_steps.py +729 -0
  76. biblicus-0.15.1/features/steps/markov_start_end_steps.py +38 -0
  77. biblicus-0.15.1/features/steps/markov_steps.py +451 -0
  78. biblicus-0.15.1/features/steps/openai_steps.py +715 -0
  79. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/profiling_steps.py +74 -0
  80. biblicus-0.15.1/features/steps/recipe_steps.py +96 -0
  81. biblicus-0.15.1/features/steps/text_annotate_steps.py +477 -0
  82. biblicus-0.15.1/features/steps/text_extract_steps.py +480 -0
  83. biblicus-0.15.1/features/steps/text_internal_steps.py +64 -0
  84. biblicus-0.15.1/features/steps/text_link_internal_steps.py +379 -0
  85. biblicus-0.15.1/features/steps/text_link_steps.py +494 -0
  86. biblicus-0.15.1/features/steps/text_mock_steps.py +199 -0
  87. biblicus-0.15.1/features/steps/text_redact_steps.py +509 -0
  88. biblicus-0.15.1/features/steps/text_slice_steps.py +433 -0
  89. biblicus-0.15.1/features/steps/text_tool_loop_steps.py +36 -0
  90. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/topic_modeling_steps.py +45 -0
  91. biblicus-0.15.1/features/steps/use_cases_steps.py +118 -0
  92. biblicus-0.15.1/features/text_annotate.feature +227 -0
  93. biblicus-0.15.1/features/text_extract.feature +226 -0
  94. biblicus-0.15.1/features/text_internal_branches.feature +52 -0
  95. biblicus-0.15.1/features/text_link.feature +146 -0
  96. biblicus-0.15.1/features/text_link_internal_branches.feature +106 -0
  97. biblicus-0.15.1/features/text_mock.feature +86 -0
  98. biblicus-0.15.1/features/text_redact.feature +135 -0
  99. biblicus-0.15.1/features/text_slice.feature +135 -0
  100. biblicus-0.15.1/features/text_utilities.feature +51 -0
  101. {biblicus-0.14.0 → biblicus-0.15.1}/features/topic_modeling.feature +3 -3
  102. biblicus-0.15.1/features/use_cases.feature +21 -0
  103. {biblicus-0.14.0 → biblicus-0.15.1}/pyproject.toml +11 -1
  104. biblicus-0.15.1/scripts/markov_analysis_demo.py +279 -0
  105. biblicus-0.15.1/scripts/markov_cached_segments_demo.py +603 -0
  106. biblicus-0.15.1/scripts/markov_run_report.py +243 -0
  107. biblicus-0.15.1/scripts/use_cases/notes_to_context_pack_demo.py +119 -0
  108. biblicus-0.15.1/scripts/use_cases/sequence_markov_demo.py +189 -0
  109. biblicus-0.15.1/scripts/use_cases/text_folder_search_demo.py +132 -0
  110. biblicus-0.15.1/scripts/use_cases/text_redact_demo.py +116 -0
  111. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/__init__.py +1 -1
  112. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/_vendor/dotyaml/__init__.py +2 -2
  113. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/_vendor/dotyaml/loader.py +40 -1
  114. biblicus-0.15.1/src/biblicus/ai/__init__.py +39 -0
  115. biblicus-0.15.1/src/biblicus/ai/embeddings.py +114 -0
  116. biblicus-0.15.1/src/biblicus/ai/llm.py +138 -0
  117. biblicus-0.15.1/src/biblicus/ai/models.py +226 -0
  118. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/analysis/__init__.py +5 -2
  119. biblicus-0.15.1/src/biblicus/analysis/markov.py +1624 -0
  120. biblicus-0.15.1/src/biblicus/analysis/models.py +1530 -0
  121. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/analysis/topic_modeling.py +98 -19
  122. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/backends/sqlite_full_text_search.py +4 -2
  123. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/cli.py +118 -23
  124. biblicus-0.15.1/src/biblicus/recipes.py +136 -0
  125. biblicus-0.15.1/src/biblicus/text/__init__.py +43 -0
  126. biblicus-0.15.1/src/biblicus/text/annotate.py +222 -0
  127. biblicus-0.15.1/src/biblicus/text/extract.py +210 -0
  128. biblicus-0.15.1/src/biblicus/text/link.py +519 -0
  129. biblicus-0.15.1/src/biblicus/text/markup.py +200 -0
  130. biblicus-0.15.1/src/biblicus/text/models.py +319 -0
  131. biblicus-0.15.1/src/biblicus/text/prompts.py +113 -0
  132. biblicus-0.15.1/src/biblicus/text/redact.py +229 -0
  133. biblicus-0.15.1/src/biblicus/text/slice.py +155 -0
  134. biblicus-0.15.1/src/biblicus/text/tool_loop.py +334 -0
  135. {biblicus-0.14.0 → biblicus-0.15.1/src/biblicus.egg-info}/PKG-INFO +98 -28
  136. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus.egg-info/SOURCES.txt +88 -2
  137. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus.egg-info/requires.txt +9 -0
  138. biblicus-0.14.0/docs/ANALYSIS.md +0 -47
  139. biblicus-0.14.0/docs/ARCHITECTURE.md +0 -180
  140. biblicus-0.14.0/docs/index.rst +0 -33
  141. biblicus-0.14.0/features/steps/openai_steps.py +0 -314
  142. biblicus-0.14.0/src/biblicus/analysis/llm.py +0 -106
  143. biblicus-0.14.0/src/biblicus/analysis/models.py +0 -777
  144. {biblicus-0.14.0 → biblicus-0.15.1}/LICENSE +0 -0
  145. {biblicus-0.14.0 → biblicus-0.15.1}/MANIFEST.in +0 -0
  146. {biblicus-0.14.0 → biblicus-0.15.1}/THIRD_PARTY_NOTICES.md +0 -0
  147. {biblicus-0.14.0 → biblicus-0.15.1}/datasets/extraction_lab/labels.json +0 -0
  148. {biblicus-0.14.0 → biblicus-0.15.1}/datasets/retrieval_lab/labels.json +0 -0
  149. {biblicus-0.14.0 → biblicus-0.15.1}/datasets/wikipedia_mini.json +0 -0
  150. {biblicus-0.14.0 → biblicus-0.15.1}/docs/api.rst +0 -0
  151. {biblicus-0.14.0 → biblicus-0.15.1}/docs/backends/scan.md +0 -0
  152. {biblicus-0.14.0 → biblicus-0.15.1}/docs/backends/sqlite-full-text-search.md +0 -0
  153. {biblicus-0.14.0 → biblicus-0.15.1}/docs/conf.py +0 -0
  154. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/ocr/paddleocr-vl.md +0 -0
  155. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/ocr/rapidocr.md +0 -0
  156. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/pipeline-utilities/pipeline.md +0 -0
  157. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/pipeline-utilities/select-longest.md +0 -0
  158. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/pipeline-utilities/select-override.md +0 -0
  159. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/pipeline-utilities/select-smart-override.md +0 -0
  160. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/pipeline-utilities/select-text.md +0 -0
  161. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/speech-to-text/deepgram.md +0 -0
  162. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/speech-to-text/openai.md +0 -0
  163. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/text-document/markitdown.md +0 -0
  164. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/text-document/metadata.md +0 -0
  165. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/text-document/pass-through.md +0 -0
  166. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/text-document/pdf.md +0 -0
  167. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/text-document/unstructured.md +0 -0
  168. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/vlm-document/docling-granite.md +0 -0
  169. {biblicus-0.14.0 → biblicus-0.15.1}/docs/extractors/vlm-document/docling-smol.md +0 -0
  170. {biblicus-0.14.0 → biblicus-0.15.1}/features/backend_validation.feature +0 -0
  171. {biblicus-0.14.0 → biblicus-0.15.1}/features/biblicus_corpus.feature +0 -0
  172. {biblicus-0.14.0 → biblicus-0.15.1}/features/cli_entrypoint.feature +0 -0
  173. {biblicus-0.14.0 → biblicus-0.15.1}/features/cli_parsing.feature +0 -0
  174. {biblicus-0.14.0 → biblicus-0.15.1}/features/cli_step_spec_parsing.feature +0 -0
  175. {biblicus-0.14.0 → biblicus-0.15.1}/features/content_sniffing.feature +0 -0
  176. {biblicus-0.14.0 → biblicus-0.15.1}/features/context_pack.feature +0 -0
  177. {biblicus-0.14.0 → biblicus-0.15.1}/features/context_pack_cli.feature +0 -0
  178. {biblicus-0.14.0 → biblicus-0.15.1}/features/context_pack_policies.feature +0 -0
  179. {biblicus-0.14.0 → biblicus-0.15.1}/features/corpus_edge_cases.feature +0 -0
  180. {biblicus-0.14.0 → biblicus-0.15.1}/features/corpus_identity.feature +0 -0
  181. {biblicus-0.14.0 → biblicus-0.15.1}/features/corpus_purge.feature +0 -0
  182. {biblicus-0.14.0 → biblicus-0.15.1}/features/crawl.feature +0 -0
  183. {biblicus-0.14.0 → biblicus-0.15.1}/features/docling_granite_extractor.feature +0 -0
  184. {biblicus-0.14.0 → biblicus-0.15.1}/features/docling_smol_extractor.feature +0 -0
  185. {biblicus-0.14.0 → biblicus-0.15.1}/features/error_cases.feature +0 -0
  186. {biblicus-0.14.0 → biblicus-0.15.1}/features/evaluation.feature +0 -0
  187. {biblicus-0.14.0 → biblicus-0.15.1}/features/evidence_processing.feature +0 -0
  188. {biblicus-0.14.0 → biblicus-0.15.1}/features/extraction_error_handling.feature +0 -0
  189. {biblicus-0.14.0 → biblicus-0.15.1}/features/extraction_evaluation.feature +0 -0
  190. {biblicus-0.14.0 → biblicus-0.15.1}/features/extraction_evaluation_lab.feature +0 -0
  191. {biblicus-0.14.0 → biblicus-0.15.1}/features/extraction_run_lifecycle.feature +0 -0
  192. {biblicus-0.14.0 → biblicus-0.15.1}/features/extraction_selection.feature +0 -0
  193. {biblicus-0.14.0 → biblicus-0.15.1}/features/extraction_selection_longest.feature +0 -0
  194. {biblicus-0.14.0 → biblicus-0.15.1}/features/extractor_pipeline.feature +0 -0
  195. {biblicus-0.14.0 → biblicus-0.15.1}/features/extractor_validation.feature +0 -0
  196. {biblicus-0.14.0 → biblicus-0.15.1}/features/frontmatter.feature +0 -0
  197. {biblicus-0.14.0 → biblicus-0.15.1}/features/hook_config_validation.feature +0 -0
  198. {biblicus-0.14.0 → biblicus-0.15.1}/features/hook_error_handling.feature +0 -0
  199. {biblicus-0.14.0 → biblicus-0.15.1}/features/import_tree.feature +0 -0
  200. {biblicus-0.14.0 → biblicus-0.15.1}/features/inference_backend.feature +0 -0
  201. {biblicus-0.14.0 → biblicus-0.15.1}/features/ingest_sources.feature +0 -0
  202. {biblicus-0.14.0 → biblicus-0.15.1}/features/integration_audio_samples.feature +0 -0
  203. {biblicus-0.14.0 → biblicus-0.15.1}/features/integration_image_samples.feature +0 -0
  204. {biblicus-0.14.0 → biblicus-0.15.1}/features/integration_mixed_corpus.feature +0 -0
  205. {biblicus-0.14.0 → biblicus-0.15.1}/features/integration_mixed_extraction.feature +0 -0
  206. {biblicus-0.14.0 → biblicus-0.15.1}/features/integration_ocr_image_extraction.feature +0 -0
  207. {biblicus-0.14.0 → biblicus-0.15.1}/features/integration_pdf_retrieval.feature +0 -0
  208. {biblicus-0.14.0 → biblicus-0.15.1}/features/integration_pdf_samples.feature +0 -0
  209. {biblicus-0.14.0 → biblicus-0.15.1}/features/integration_unstructured_extraction.feature +0 -0
  210. {biblicus-0.14.0 → biblicus-0.15.1}/features/integration_wikipedia.feature +0 -0
  211. {biblicus-0.14.0 → biblicus-0.15.1}/features/knowledge_base.feature +0 -0
  212. {biblicus-0.14.0 → biblicus-0.15.1}/features/lifecycle_hooks.feature +0 -0
  213. {biblicus-0.14.0 → biblicus-0.15.1}/features/markitdown_extractor.feature +0 -0
  214. {biblicus-0.14.0 → biblicus-0.15.1}/features/model_validation.feature +0 -0
  215. {biblicus-0.14.0 → biblicus-0.15.1}/features/ocr_extractor.feature +0 -0
  216. {biblicus-0.14.0 → biblicus-0.15.1}/features/paddleocr_vl_extractor.feature +0 -0
  217. {biblicus-0.14.0 → biblicus-0.15.1}/features/paddleocr_vl_parse_api_response.feature +0 -0
  218. {biblicus-0.14.0 → biblicus-0.15.1}/features/pdf_text_extraction.feature +0 -0
  219. {biblicus-0.14.0 → biblicus-0.15.1}/features/profiling.feature +0 -0
  220. {biblicus-0.14.0 → biblicus-0.15.1}/features/python_api.feature +0 -0
  221. {biblicus-0.14.0 → biblicus-0.15.1}/features/python_hook_logging.feature +0 -0
  222. {biblicus-0.14.0 → biblicus-0.15.1}/features/query_processing.feature +0 -0
  223. {biblicus-0.14.0 → biblicus-0.15.1}/features/recipe_file_extraction.feature +0 -0
  224. {biblicus-0.14.0 → biblicus-0.15.1}/features/retrieval_budget.feature +0 -0
  225. {biblicus-0.14.0 → biblicus-0.15.1}/features/retrieval_evaluation_lab.feature +0 -0
  226. {biblicus-0.14.0 → biblicus-0.15.1}/features/retrieval_quality.feature +0 -0
  227. {biblicus-0.14.0 → biblicus-0.15.1}/features/retrieval_scan.feature +0 -0
  228. {biblicus-0.14.0 → biblicus-0.15.1}/features/retrieval_sqlite_full_text_search.feature +0 -0
  229. {biblicus-0.14.0 → biblicus-0.15.1}/features/retrieval_uses_extraction_run.feature +0 -0
  230. {biblicus-0.14.0 → biblicus-0.15.1}/features/retrieval_utilities.feature +0 -0
  231. {biblicus-0.14.0 → biblicus-0.15.1}/features/select_override.feature +0 -0
  232. {biblicus-0.14.0 → biblicus-0.15.1}/features/smart_override_selection.feature +0 -0
  233. {biblicus-0.14.0 → biblicus-0.15.1}/features/source_loading.feature +0 -0
  234. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/backend_steps.py +0 -0
  235. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/cli_parsing_steps.py +0 -0
  236. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/cli_steps.py +0 -0
  237. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/context_pack_steps.py +0 -0
  238. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/crawl_steps.py +0 -0
  239. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/deepgram_steps.py +0 -0
  240. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/docling_steps.py +0 -0
  241. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/evidence_processing_steps.py +0 -0
  242. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/extraction_evaluation_lab_steps.py +0 -0
  243. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/extraction_evaluation_steps.py +0 -0
  244. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/extraction_run_lifecycle_steps.py +0 -0
  245. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/extraction_steps.py +0 -0
  246. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/extractor_steps.py +0 -0
  247. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/frontmatter_steps.py +0 -0
  248. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/inference_steps.py +0 -0
  249. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/knowledge_base_steps.py +0 -0
  250. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/markitdown_steps.py +0 -0
  251. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/model_steps.py +0 -0
  252. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/paddleocr_mock_steps.py +0 -0
  253. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/paddleocr_vl_steps.py +0 -0
  254. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/paddleocr_vl_unit_steps.py +0 -0
  255. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/pdf_steps.py +0 -0
  256. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/python_api_steps.py +0 -0
  257. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/rapidocr_steps.py +0 -0
  258. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/requests_mock_steps.py +0 -0
  259. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/retrieval_evaluation_lab_steps.py +0 -0
  260. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/retrieval_quality_steps.py +0 -0
  261. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/retrieval_steps.py +0 -0
  262. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/stt_deepgram_steps.py +0 -0
  263. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/stt_steps.py +0 -0
  264. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/unstructured_steps.py +0 -0
  265. {biblicus-0.14.0 → biblicus-0.15.1}/features/steps/user_config_steps.py +0 -0
  266. {biblicus-0.14.0 → biblicus-0.15.1}/features/streaming_ingest.feature +0 -0
  267. {biblicus-0.14.0 → biblicus-0.15.1}/features/stt_deepgram_extractor.feature +0 -0
  268. {biblicus-0.14.0 → biblicus-0.15.1}/features/stt_extractor.feature +0 -0
  269. {biblicus-0.14.0 → biblicus-0.15.1}/features/text_extraction_runs.feature +0 -0
  270. {biblicus-0.14.0 → biblicus-0.15.1}/features/token_budget.feature +0 -0
  271. {biblicus-0.14.0 → biblicus-0.15.1}/features/unstructured_extractor.feature +0 -0
  272. {biblicus-0.14.0 → biblicus-0.15.1}/features/user_config.feature +0 -0
  273. {biblicus-0.14.0 → biblicus-0.15.1}/scripts/download_ag_news.py +0 -0
  274. {biblicus-0.14.0 → biblicus-0.15.1}/scripts/download_audio_samples.py +0 -0
  275. {biblicus-0.14.0 → biblicus-0.15.1}/scripts/download_image_samples.py +0 -0
  276. {biblicus-0.14.0 → biblicus-0.15.1}/scripts/download_mixed_samples.py +0 -0
  277. {biblicus-0.14.0 → biblicus-0.15.1}/scripts/download_pdf_samples.py +0 -0
  278. {biblicus-0.14.0 → biblicus-0.15.1}/scripts/download_wikipedia.py +0 -0
  279. {biblicus-0.14.0 → biblicus-0.15.1}/scripts/extraction_evaluation_demo.py +0 -0
  280. {biblicus-0.14.0 → biblicus-0.15.1}/scripts/extraction_evaluation_lab.py +0 -0
  281. {biblicus-0.14.0 → biblicus-0.15.1}/scripts/profiling_demo.py +0 -0
  282. {biblicus-0.14.0 → biblicus-0.15.1}/scripts/readme_end_to_end_demo.py +0 -0
  283. {biblicus-0.14.0 → biblicus-0.15.1}/scripts/retrieval_evaluation_lab.py +0 -0
  284. {biblicus-0.14.0 → biblicus-0.15.1}/scripts/test.py +0 -0
  285. {biblicus-0.14.0 → biblicus-0.15.1}/scripts/topic_modeling_integration.py +0 -0
  286. {biblicus-0.14.0 → biblicus-0.15.1}/scripts/wikipedia_rag_demo.py +0 -0
  287. {biblicus-0.14.0 → biblicus-0.15.1}/setup.cfg +0 -0
  288. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/__main__.py +0 -0
  289. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/_vendor/dotyaml/interpolation.py +0 -0
  290. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/_vendor/dotyaml/transformer.py +0 -0
  291. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/analysis/base.py +0 -0
  292. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/analysis/profiling.py +0 -0
  293. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/analysis/schema.py +0 -0
  294. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/backends/__init__.py +0 -0
  295. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/backends/base.py +0 -0
  296. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/backends/hybrid.py +0 -0
  297. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/backends/scan.py +0 -0
  298. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/backends/vector.py +0 -0
  299. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/constants.py +0 -0
  300. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/context.py +0 -0
  301. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/corpus.py +0 -0
  302. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/crawl.py +0 -0
  303. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/errors.py +0 -0
  304. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/evaluation.py +0 -0
  305. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/evidence_processing.py +0 -0
  306. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extraction.py +0 -0
  307. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extraction_evaluation.py +0 -0
  308. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extractors/__init__.py +0 -0
  309. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extractors/base.py +0 -0
  310. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extractors/deepgram_stt.py +0 -0
  311. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extractors/docling_granite_text.py +0 -0
  312. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extractors/docling_smol_text.py +0 -0
  313. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extractors/markitdown_text.py +0 -0
  314. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extractors/metadata_text.py +0 -0
  315. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extractors/openai_stt.py +0 -0
  316. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extractors/paddleocr_vl_text.py +0 -0
  317. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extractors/pass_through_text.py +0 -0
  318. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extractors/pdf_text.py +0 -0
  319. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extractors/pipeline.py +0 -0
  320. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extractors/rapidocr_text.py +0 -0
  321. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extractors/select_longest_text.py +0 -0
  322. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extractors/select_override.py +0 -0
  323. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extractors/select_smart_override.py +0 -0
  324. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extractors/select_text.py +0 -0
  325. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/extractors/unstructured_text.py +0 -0
  326. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/frontmatter.py +0 -0
  327. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/hook_logging.py +0 -0
  328. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/hook_manager.py +0 -0
  329. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/hooks.py +0 -0
  330. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/ignore.py +0 -0
  331. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/inference.py +0 -0
  332. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/knowledge_base.py +0 -0
  333. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/models.py +0 -0
  334. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/retrieval.py +0 -0
  335. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/sources.py +0 -0
  336. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/time.py +0 -0
  337. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/uris.py +0 -0
  338. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus/user_config.py +0 -0
  339. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus.egg-info/dependency_links.txt +0 -0
  340. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus.egg-info/entry_points.txt +0 -0
  341. {biblicus-0.14.0 → biblicus-0.15.1}/src/biblicus.egg-info/top_level.txt +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: biblicus
3
- Version: 0.14.0
3
+ Version: 0.15.1
4
4
  Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
5
5
  License: MIT
6
6
  Requires-Python: >=3.9
@@ -9,6 +9,8 @@ License-File: LICENSE
9
9
  Requires-Dist: pydantic>=2.0
10
10
  Requires-Dist: PyYAML>=6.0
11
11
  Requires-Dist: pypdf>=4.0
12
+ Requires-Dist: Jinja2>=3.1
13
+ Requires-Dist: dotyaml>=0.1.3
12
14
  Provides-Extra: dev
13
15
  Requires-Dist: behave>=1.2.6; extra == "dev"
14
16
  Requires-Dist: coverage[toml]>=7.0; extra == "dev"
@@ -18,6 +20,9 @@ Requires-Dist: sphinx_rtd_theme>=2.0; extra == "dev"
18
20
  Requires-Dist: ruff>=0.4.0; extra == "dev"
19
21
  Requires-Dist: black>=24.0; extra == "dev"
20
22
  Requires-Dist: python-semantic-release>=9.0.0; extra == "dev"
23
+ Provides-Extra: dspy
24
+ Requires-Dist: dspy>=2.5; extra == "dspy"
25
+ Requires-Dist: litellm>=1.0; extra == "dspy"
21
26
  Provides-Extra: openai
22
27
  Requires-Dist: openai>=1.0; extra == "openai"
23
28
  Provides-Extra: unstructured
@@ -40,6 +45,8 @@ Provides-Extra: docling-mlx
40
45
  Requires-Dist: docling[mlx-vlm]>=2.0.0; extra == "docling-mlx"
41
46
  Provides-Extra: topic-modeling
42
47
  Requires-Dist: bertopic>=0.15.0; extra == "topic-modeling"
48
+ Provides-Extra: markov-analysis
49
+ Requires-Dist: hmmlearn>=0.3.0; extra == "markov-analysis"
43
50
  Provides-Extra: datasets
44
51
  Requires-Dist: datasets>=2.18.0; extra == "datasets"
45
52
  Dynamic: license-file
@@ -50,18 +57,33 @@ Dynamic: license-file
50
57
  ![Coverage][coverage-badge]
51
58
  ![Documentation][documentation-badge]
52
59
 
53
- Make your documents usable by your assistant, then decide later how you will search and retrieve them.
54
-
60
+ <p>
61
+ <img
62
+ src="docs/_static/Biblicus-logo.png"
63
+ alt="Biblicus logo"
64
+ align="right"
65
+ width="216"
66
+ />
67
+ Make your documents usable by your assistant, then decide later how you will search and retrieve them.
68
+ </p>
55
69
  If you are building an assistant in Python, you probably have material you want it to use: notes, documents, web pages, and reference files. A common approach is retrieval augmented generation, where a system retrieves relevant material and uses it as evidence when generating a response.
56
70
 
57
71
  The first practical problem is not retrieval. It is collection and care. You need a stable place to put raw items, you need a small amount of metadata so you can find them again, and you need a way to evolve your retrieval approach over time without rewriting ingestion.
58
72
 
59
- This library gives you a corpus, which is a normal folder on disk. It stores each ingested item as a file, with optional metadata stored next to it. You can open and inspect the raw files directly. Any derived catalog or index can be rebuilt from the raw corpus.
73
+ Biblicus gives you a normal folder on disk to manage. In Biblicus documentation, that managed folder is called a *corpus* (plural: *corpora*). It stores each ingested item as a file, with optional metadata stored next to it. You can open and inspect the raw files directly. Any derived catalog or index can be rebuilt from the raw files.
60
74
 
61
75
  It can be used alongside LangGraph, Tactus, Pydantic AI, any agent framework, or your own setup. Use it from Python or from the command line interface.
62
76
 
63
77
  See [retrieval augmented generation overview] for a short introduction to the idea.
64
78
 
79
+ ## Analysis highlights
80
+
81
+ - `biblicus analyze markov` learns a directed, weighted state transition graph over segmented text.
82
+ - YAML recipes support cascading composition plus dotted `--config key=value` overrides.
83
+ - Text extract splits long texts with an LLM by inserting XML tags in-place for structured spans.
84
+ - See `docs/MARKOV_ANALYSIS.md` for Markov analysis details and runnable demos.
85
+ - See `docs/TEXT_EXTRACT.md` for the text extract utility and examples.
86
+
65
87
  ## Start with a knowledge base
66
88
 
67
89
  If you just want to hand a folder to your assistant and move on, use the high-level knowledge base interface. The folder can be nothing more than a handful of plain text files. You are not choosing a retrieval strategy yet. You are just collecting.
@@ -106,7 +128,7 @@ Think in three stages.
106
128
 
107
129
  If you learn a few project words, the rest of the system becomes predictable.
108
130
 
109
- - Corpus is the folder that holds raw items and their metadata.
131
+ - Corpus is the managed folder that holds raw items and their metadata.
110
132
  - Item is the raw bytes plus optional metadata and source information.
111
133
  - Catalog is the rebuildable index of the corpus.
112
134
  - Extraction run is a recorded extraction build that produces text artifacts.
@@ -161,28 +183,28 @@ sequenceDiagram
161
183
  This repository is a working Python package. Install it into a virtual environment from the repository root.
162
184
 
163
185
  ```
164
- python3 -m pip install -e .
186
+ python -m pip install -e .
165
187
  ```
166
188
 
167
189
  After the first release, you can install it from Python Package Index.
168
190
 
169
191
  ```
170
- python3 -m pip install biblicus
192
+ python -m pip install biblicus
171
193
  ```
172
194
 
173
195
  ### Optional extras
174
196
 
175
197
  Some extractors are optional so the base install stays small.
176
198
 
177
- - Optical character recognition for images: `python3 -m pip install "biblicus[ocr]"`
178
- - Advanced optical character recognition with PaddleOCR: `python3 -m pip install "biblicus[paddleocr]"`
179
- - Document understanding with Docling VLM: `python3 -m pip install "biblicus[docling]"`
180
- - Document understanding with Docling VLM and MLX acceleration: `python3 -m pip install "biblicus[docling-mlx]"`
181
- - Speech to text transcription with OpenAI: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
182
- - Speech to text transcription with Deepgram: `python3 -m pip install "biblicus[deepgram]"` (requires a Deepgram API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
183
- - Broad document parsing fallback: `python3 -m pip install "biblicus[unstructured]"`
184
- - MarkItDown document conversion (requires Python 3.10 or higher): `python3 -m pip install "biblicus[markitdown]"`
185
- - Topic modeling analysis with BERTopic: `python3 -m pip install "biblicus[topic-modeling]"`
199
+ - Optical character recognition for images: `python -m pip install "biblicus[ocr]"`
200
+ - Advanced optical character recognition with PaddleOCR: `python -m pip install "biblicus[paddleocr]"`
201
+ - Document understanding with Docling VLM: `python -m pip install "biblicus[docling]"`
202
+ - Document understanding with Docling VLM and MLX acceleration: `python -m pip install "biblicus[docling-mlx]"`
203
+ - Speech to text transcription with OpenAI: `python -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
204
+ - Speech to text transcription with Deepgram: `python -m pip install "biblicus[deepgram]"` (requires a Deepgram API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
205
+ - Broad document parsing fallback: `python -m pip install "biblicus[unstructured]"`
206
+ - MarkItDown document conversion (requires Python 3.10 or higher): `python -m pip install "biblicus[markitdown]"`
207
+ - Topic modeling analysis with BERTopic: `python -m pip install "biblicus[topic-modeling]"`
186
208
 
187
209
  ## Quick start
188
210
 
@@ -200,16 +222,49 @@ biblicus build --corpus corpora/example --backend scan
200
222
  biblicus query --corpus corpora/example --query "note"
201
223
  ```
202
224
 
203
- If you want to turn a website section into corpus items, crawl a root web address while restricting the crawl to an allowed prefix:
225
+ ## Web Ingestion
226
+
227
+ Biblicus supports ingesting content directly from the web using two approaches.
228
+
229
+ ### Ingest from URLs
204
230
 
231
+ Ingest individual documents or web pages from URLs. The `ingest` command automatically detects content types including PDF, HTML, Markdown, images, and audio:
232
+
233
+ ```bash
234
+ # Ingest a document from a URL
235
+ biblicus ingest https://example.com/document.pdf --tags "research"
236
+
237
+ # Ingest a web page
238
+ biblicus ingest https://example.com/article.html --tags "article"
239
+
240
+ # Ingest with a corpus path specified
241
+ biblicus ingest --corpus corpora/example https://docs.example.com/guide.md --tags "documentation"
205
242
  ```
206
- biblicus crawl --corpus corpora/example \\
207
- --root-url https://example.com/docs/index.html \\
208
- --allowed-prefix https://example.com/docs/ \\
209
- --max-items 50 \\
210
- --tag crawled
243
+
244
+ ### Crawl Websites
245
+
246
+ Crawl entire website sections with automatic link discovery. The crawler follows links within the allowed prefix and stores discovered content:
247
+
248
+ ```bash
249
+ # Crawl a documentation site
250
+ biblicus crawl \
251
+ --corpus corpora/example \
252
+ --root-url https://docs.example.com/ \
253
+ --allowed-prefix https://docs.example.com/ \
254
+ --max-items 100 \
255
+ --tags "documentation"
256
+
257
+ # Crawl a specific blog category
258
+ biblicus crawl \
259
+ --corpus corpora/example \
260
+ --root-url https://blog.example.com/category/tutorials/ \
261
+ --allowed-prefix https://blog.example.com/category/tutorials/ \
262
+ --max-items 50 \
263
+ --tags "tutorials,blog"
211
264
  ```
212
265
 
266
+ The `--allowed-prefix` parameter restricts the crawler to only follow links that start with the specified URL prefix, preventing it from crawling outside the intended scope. The crawler respects `.biblicusignore` rules and stores items under `raw/imports/crawl/` in your corpus.
267
+
213
268
  ## End-to-end example: lower-level control
214
269
 
215
270
  The command-line interface returns JavaScript Object Notation by default. This makes it easy to use Biblicus in scripts and to treat retrieval as a deterministic, testable step.
@@ -490,7 +545,7 @@ Three backends are included.
490
545
 
491
546
  - `scan` is a minimal baseline that scans raw items directly.
492
547
  - `sqlite-full-text-search` is a practical baseline that builds a full text search index in SQLite.
493
- - `vector` is a deterministic term-frequency vector baseline with cosine similarity scoring.
548
+ - `tf-vector` is a deterministic term-frequency vector baseline with cosine similarity scoring.
494
549
 
495
550
  For detailed documentation including configuration options, performance characteristics, and usage examples, see the [Backend Reference][backend-reference].
496
551
 
@@ -540,6 +595,21 @@ For detailed documentation on all extractors, see the [Extractor Reference][extr
540
595
  For extraction evaluation workflows, dataset formats, and report interpretation, see
541
596
  `docs/EXTRACTION_EVALUATION.md`.
542
597
 
598
+ ## Text extract utility
599
+
600
+ Text extract is a reusable analysis utility that lets a model insert XML tags into a long text without re-emitting the
601
+ entire document. It returns structured spans and the marked-up text, and it is used as a segmentation option in Markov
602
+ analysis.
603
+
604
+ See `docs/TEXT_EXTRACT.md` for the utility API and examples, and `docs/MARKOV_ANALYSIS.md` for the Markov integration.
605
+
606
+ ## Text slice utility
607
+
608
+ Text slice is a reusable analysis utility that lets a model insert `<slice/>` markers into a long text without
609
+ re-emitting the entire document. It returns ordered slices and the marked-up text for auditing and reuse.
610
+
611
+ See `docs/TEXT_SLICE.md` for the utility API and examples.
612
+
543
613
  ## Topic modeling analysis
544
614
 
545
615
  Biblicus can run analysis pipelines on extracted text without changing the raw corpus. Profiling and topic modeling
@@ -594,7 +664,7 @@ AG News integration runs require `biblicus[datasets]` in addition to `biblicus[t
594
664
  For a repeatable, real-world integration run that downloads AG News and executes topic modeling, use:
595
665
 
596
666
  ```
597
- python3 scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
667
+ python scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
598
668
  ```
599
669
 
600
670
  See `docs/TOPIC_MODELING.md` for parameter examples and per-topic output behavior.
@@ -608,13 +678,13 @@ Use `scripts/download_pdf_samples.py` to download a small Portable Document Form
608
678
  ## Tests and coverage
609
679
 
610
680
  ```
611
- python3 scripts/test.py
681
+ python scripts/test.py
612
682
  ```
613
683
 
614
684
  To include integration scenarios that download public test data at runtime, run this command.
615
685
 
616
686
  ```
617
- python3 scripts/test.py --integration
687
+ python scripts/test.py --integration
618
688
  ```
619
689
 
620
690
  ## Releases
@@ -632,13 +702,13 @@ Reference documentation is generated from Sphinx style docstrings.
632
702
  Install development dependencies:
633
703
 
634
704
  ```
635
- python3 -m pip install -e ".[dev]"
705
+ python -m pip install -e ".[dev]"
636
706
  ```
637
707
 
638
708
  Build the documentation:
639
709
 
640
710
  ```
641
- python3 -m sphinx -b html docs docs/_build/html
711
+ python -m sphinx -b html docs docs/_build/html
642
712
  ```
643
713
 
644
714
  ## License
@@ -4,18 +4,33 @@
4
4
  ![Coverage][coverage-badge]
5
5
  ![Documentation][documentation-badge]
6
6
 
7
- Make your documents usable by your assistant, then decide later how you will search and retrieve them.
8
-
7
+ <p>
8
+ <img
9
+ src="docs/_static/Biblicus-logo.png"
10
+ alt="Biblicus logo"
11
+ align="right"
12
+ width="216"
13
+ />
14
+ Make your documents usable by your assistant, then decide later how you will search and retrieve them.
15
+ </p>
9
16
  If you are building an assistant in Python, you probably have material you want it to use: notes, documents, web pages, and reference files. A common approach is retrieval augmented generation, where a system retrieves relevant material and uses it as evidence when generating a response.
10
17
 
11
18
  The first practical problem is not retrieval. It is collection and care. You need a stable place to put raw items, you need a small amount of metadata so you can find them again, and you need a way to evolve your retrieval approach over time without rewriting ingestion.
12
19
 
13
- This library gives you a corpus, which is a normal folder on disk. It stores each ingested item as a file, with optional metadata stored next to it. You can open and inspect the raw files directly. Any derived catalog or index can be rebuilt from the raw corpus.
20
+ Biblicus gives you a normal folder on disk to manage. In Biblicus documentation, that managed folder is called a *corpus* (plural: *corpora*). It stores each ingested item as a file, with optional metadata stored next to it. You can open and inspect the raw files directly. Any derived catalog or index can be rebuilt from the raw files.
14
21
 
15
22
  It can be used alongside LangGraph, Tactus, Pydantic AI, any agent framework, or your own setup. Use it from Python or from the command line interface.
16
23
 
17
24
  See [retrieval augmented generation overview] for a short introduction to the idea.
18
25
 
26
+ ## Analysis highlights
27
+
28
+ - `biblicus analyze markov` learns a directed, weighted state transition graph over segmented text.
29
+ - YAML recipes support cascading composition plus dotted `--config key=value` overrides.
30
+ - Text extract splits long texts with an LLM by inserting XML tags in-place for structured spans.
31
+ - See `docs/MARKOV_ANALYSIS.md` for Markov analysis details and runnable demos.
32
+ - See `docs/TEXT_EXTRACT.md` for the text extract utility and examples.
33
+
19
34
  ## Start with a knowledge base
20
35
 
21
36
  If you just want to hand a folder to your assistant and move on, use the high-level knowledge base interface. The folder can be nothing more than a handful of plain text files. You are not choosing a retrieval strategy yet. You are just collecting.
@@ -60,7 +75,7 @@ Think in three stages.
60
75
 
61
76
  If you learn a few project words, the rest of the system becomes predictable.
62
77
 
63
- - Corpus is the folder that holds raw items and their metadata.
78
+ - Corpus is the managed folder that holds raw items and their metadata.
64
79
  - Item is the raw bytes plus optional metadata and source information.
65
80
  - Catalog is the rebuildable index of the corpus.
66
81
  - Extraction run is a recorded extraction build that produces text artifacts.
@@ -115,28 +130,28 @@ sequenceDiagram
115
130
  This repository is a working Python package. Install it into a virtual environment from the repository root.
116
131
 
117
132
  ```
118
- python3 -m pip install -e .
133
+ python -m pip install -e .
119
134
  ```
120
135
 
121
136
  After the first release, you can install it from Python Package Index.
122
137
 
123
138
  ```
124
- python3 -m pip install biblicus
139
+ python -m pip install biblicus
125
140
  ```
126
141
 
127
142
  ### Optional extras
128
143
 
129
144
  Some extractors are optional so the base install stays small.
130
145
 
131
- - Optical character recognition for images: `python3 -m pip install "biblicus[ocr]"`
132
- - Advanced optical character recognition with PaddleOCR: `python3 -m pip install "biblicus[paddleocr]"`
133
- - Document understanding with Docling VLM: `python3 -m pip install "biblicus[docling]"`
134
- - Document understanding with Docling VLM and MLX acceleration: `python3 -m pip install "biblicus[docling-mlx]"`
135
- - Speech to text transcription with OpenAI: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
136
- - Speech to text transcription with Deepgram: `python3 -m pip install "biblicus[deepgram]"` (requires a Deepgram API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
137
- - Broad document parsing fallback: `python3 -m pip install "biblicus[unstructured]"`
138
- - MarkItDown document conversion (requires Python 3.10 or higher): `python3 -m pip install "biblicus[markitdown]"`
139
- - Topic modeling analysis with BERTopic: `python3 -m pip install "biblicus[topic-modeling]"`
146
+ - Optical character recognition for images: `python -m pip install "biblicus[ocr]"`
147
+ - Advanced optical character recognition with PaddleOCR: `python -m pip install "biblicus[paddleocr]"`
148
+ - Document understanding with Docling VLM: `python -m pip install "biblicus[docling]"`
149
+ - Document understanding with Docling VLM and MLX acceleration: `python -m pip install "biblicus[docling-mlx]"`
150
+ - Speech to text transcription with OpenAI: `python -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
151
+ - Speech to text transcription with Deepgram: `python -m pip install "biblicus[deepgram]"` (requires a Deepgram API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
152
+ - Broad document parsing fallback: `python -m pip install "biblicus[unstructured]"`
153
+ - MarkItDown document conversion (requires Python 3.10 or higher): `python -m pip install "biblicus[markitdown]"`
154
+ - Topic modeling analysis with BERTopic: `python -m pip install "biblicus[topic-modeling]"`
140
155
 
141
156
  ## Quick start
142
157
 
@@ -154,16 +169,49 @@ biblicus build --corpus corpora/example --backend scan
154
169
  biblicus query --corpus corpora/example --query "note"
155
170
  ```
156
171
 
157
- If you want to turn a website section into corpus items, crawl a root web address while restricting the crawl to an allowed prefix:
172
+ ## Web Ingestion
173
+
174
+ Biblicus supports ingesting content directly from the web using two approaches.
175
+
176
+ ### Ingest from URLs
158
177
 
178
+ Ingest individual documents or web pages from URLs. The `ingest` command automatically detects content types including PDF, HTML, Markdown, images, and audio:
179
+
180
+ ```bash
181
+ # Ingest a document from a URL
182
+ biblicus ingest https://example.com/document.pdf --tags "research"
183
+
184
+ # Ingest a web page
185
+ biblicus ingest https://example.com/article.html --tags "article"
186
+
187
+ # Ingest with a corpus path specified
188
+ biblicus ingest --corpus corpora/example https://docs.example.com/guide.md --tags "documentation"
159
189
  ```
160
- biblicus crawl --corpus corpora/example \\
161
- --root-url https://example.com/docs/index.html \\
162
- --allowed-prefix https://example.com/docs/ \\
163
- --max-items 50 \\
164
- --tag crawled
190
+
191
+ ### Crawl Websites
192
+
193
+ Crawl entire website sections with automatic link discovery. The crawler follows links within the allowed prefix and stores discovered content:
194
+
195
+ ```bash
196
+ # Crawl a documentation site
197
+ biblicus crawl \
198
+ --corpus corpora/example \
199
+ --root-url https://docs.example.com/ \
200
+ --allowed-prefix https://docs.example.com/ \
201
+ --max-items 100 \
202
+ --tags "documentation"
203
+
204
+ # Crawl a specific blog category
205
+ biblicus crawl \
206
+ --corpus corpora/example \
207
+ --root-url https://blog.example.com/category/tutorials/ \
208
+ --allowed-prefix https://blog.example.com/category/tutorials/ \
209
+ --max-items 50 \
210
+ --tags "tutorials,blog"
165
211
  ```
166
212
 
213
+ The `--allowed-prefix` parameter restricts the crawler to only follow links that start with the specified URL prefix, preventing it from crawling outside the intended scope. The crawler respects `.biblicusignore` rules and stores items under `raw/imports/crawl/` in your corpus.
214
+
167
215
  ## End-to-end example: lower-level control
168
216
 
169
217
  The command-line interface returns JavaScript Object Notation by default. This makes it easy to use Biblicus in scripts and to treat retrieval as a deterministic, testable step.
@@ -444,7 +492,7 @@ Three backends are included.
444
492
 
445
493
  - `scan` is a minimal baseline that scans raw items directly.
446
494
  - `sqlite-full-text-search` is a practical baseline that builds a full text search index in SQLite.
447
- - `vector` is a deterministic term-frequency vector baseline with cosine similarity scoring.
495
+ - `tf-vector` is a deterministic term-frequency vector baseline with cosine similarity scoring.
448
496
 
449
497
  For detailed documentation including configuration options, performance characteristics, and usage examples, see the [Backend Reference][backend-reference].
450
498
 
@@ -494,6 +542,21 @@ For detailed documentation on all extractors, see the [Extractor Reference][extr
494
542
  For extraction evaluation workflows, dataset formats, and report interpretation, see
495
543
  `docs/EXTRACTION_EVALUATION.md`.
496
544
 
545
+ ## Text extract utility
546
+
547
+ Text extract is a reusable analysis utility that lets a model insert XML tags into a long text without re-emitting the
548
+ entire document. It returns structured spans and the marked-up text, and it is used as a segmentation option in Markov
549
+ analysis.
550
+
551
+ See `docs/TEXT_EXTRACT.md` for the utility API and examples, and `docs/MARKOV_ANALYSIS.md` for the Markov integration.
552
+
553
+ ## Text slice utility
554
+
555
+ Text slice is a reusable analysis utility that lets a model insert `<slice/>` markers into a long text without
556
+ re-emitting the entire document. It returns ordered slices and the marked-up text for auditing and reuse.
557
+
558
+ See `docs/TEXT_SLICE.md` for the utility API and examples.
559
+
497
560
  ## Topic modeling analysis
498
561
 
499
562
  Biblicus can run analysis pipelines on extracted text without changing the raw corpus. Profiling and topic modeling
@@ -548,7 +611,7 @@ AG News integration runs require `biblicus[datasets]` in addition to `biblicus[t
548
611
  For a repeatable, real-world integration run that downloads AG News and executes topic modeling, use:
549
612
 
550
613
  ```
551
- python3 scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
614
+ python scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
552
615
  ```
553
616
 
554
617
  See `docs/TOPIC_MODELING.md` for parameter examples and per-topic output behavior.
@@ -562,13 +625,13 @@ Use `scripts/download_pdf_samples.py` to download a small Portable Document Form
562
625
  ## Tests and coverage
563
626
 
564
627
  ```
565
- python3 scripts/test.py
628
+ python scripts/test.py
566
629
  ```
567
630
 
568
631
  To include integration scenarios that download public test data at runtime, run this command.
569
632
 
570
633
  ```
571
- python3 scripts/test.py --integration
634
+ python scripts/test.py --integration
572
635
  ```
573
636
 
574
637
  ## Releases
@@ -586,13 +649,13 @@ Reference documentation is generated from Sphinx style docstrings.
586
649
  Install development dependencies:
587
650
 
588
651
  ```
589
- python3 -m pip install -e ".[dev]"
652
+ python -m pip install -e ".[dev]"
590
653
  ```
591
654
 
592
655
  Build the documentation:
593
656
 
594
657
  ```
595
- python3 -m sphinx -b html docs docs/_build/html
658
+ python -m sphinx -b html docs docs/_build/html
596
659
  ```
597
660
 
598
661
  ## License
@@ -0,0 +1,143 @@
1
+ # Corpus analysis
2
+
3
+ Biblicus supports analysis backends that run on extracted text artifacts without changing the raw corpus. Analysis is a
4
+ pluggable phase that reads an extraction run, produces structured output, and stores artifacts under the corpus runs
5
+ folder. Each analysis backend declares its own configuration schema and output contract, and all schemas are validated
6
+ strictly.
7
+
8
+ ## How analysis runs work
9
+
10
+ - Analysis runs are tied to a corpus state via the extraction run reference.
11
+ - The analysis output is written under `.biblicus/runs/analysis/<analysis-id>/<run_id>/`.
12
+ - Analysis is reproducible when you supply the same extraction run and corpus catalog state.
13
+ - Analysis configuration is stored as a recipe manifest in the run metadata.
14
+
15
+ If you omit the extraction run, Biblicus uses the most recent extraction run and emits a reproducibility warning. For
16
+ repeatable analysis runs, always pass the extraction run reference explicitly.
17
+
18
+ ## Analysis run artifacts
19
+
20
+ Every analysis run records a manifest alongside the output:
21
+
22
+ ```
23
+ .biblicus/runs/analysis/<analysis-id>/<run_id>/
24
+ manifest.json
25
+ output.json
26
+ ```
27
+
28
+ The manifest captures the recipe, extraction run reference, and catalog timestamp so results can be reproduced and
29
+ compared later.
30
+
31
+ ## Inspecting output
32
+
33
+ Analysis outputs are JSON documents. You can view them directly:
34
+
35
+ ```
36
+ cat corpora/example/.biblicus/runs/analysis/profiling/RUN_ID/output.json
37
+ ```
38
+
39
+ Each analysis backend defines its own `report` payload. The run metadata is consistent across backends.
40
+
41
+ ## Comparing analysis runs
42
+
43
+ When you compare analysis results, record:
44
+
45
+ - Corpus path and catalog timestamp.
46
+ - Extraction run reference.
47
+ - Analysis recipe name and configuration.
48
+ - Analysis run identifier and output path.
49
+
50
+ These make it possible to rerun the analysis and explain differences.
51
+
52
+ ## Pluggable analysis backends
53
+
54
+ Analysis backends implement the `CorpusAnalysisBackend` interface and are registered under `biblicus.analysis`.
55
+ A backend receives the corpus, a recipe name, a configuration mapping, and an extraction run reference. It returns a
56
+ Pydantic model that is serialized to JavaScript Object Notation for storage.
57
+
58
+ ## Choosing an analysis backend
59
+
60
+ Start with profiling when you need fast, deterministic baselines. Use topic modeling when you want thematic clustering
61
+ and exploratory labels. Use Markov analysis when you want state-transition structure over sequences of segments.
62
+ Combine multiple backends for a clear view of corpus composition, themes, and state dynamics.
63
+
64
+ ## Recipe files
65
+
66
+ Analysis recipes are optional JavaScript Object Notation or YAML files that capture configuration in a repeatable way.
67
+ They are useful for sharing experiments and keeping runs reproducible.
68
+
69
+ Recipes support cascading composition. When a command accepts `--recipe`, you can pass multiple recipe files. Biblicus
70
+ merges them in order, where later recipes override earlier recipes via a deep merge. You can then apply `--config`
71
+ overrides on top of the composed view.
72
+
73
+ Minimal profiling recipe:
74
+
75
+ ```
76
+ schema_version: 1
77
+ ```
78
+
79
+ Minimal topic modeling recipe:
80
+
81
+ ```
82
+ schema_version: 1
83
+ text_source:
84
+ sample_size: 500
85
+ bertopic_analysis:
86
+ parameters:
87
+ nr_topics: 8
88
+ ```
89
+
90
+ Minimal Markov analysis recipe:
91
+
92
+ ```
93
+ schema_version: 1
94
+ model:
95
+ family: gaussian
96
+ n_states: 8
97
+ segmentation:
98
+ method: sentence
99
+ observations:
100
+ encoder: tfidf
101
+ ```
102
+
103
+ ## Topic modeling
104
+
105
+ Topic modeling is the first analysis backend. It uses BERTopic to cluster extracted text, produces per-topic evidence,
106
+ and optionally labels topics using an LLM. See `docs/TOPIC_MODELING.md` for detailed configuration and examples.
107
+
108
+ The integration demo script is a working reference you can use as a starting point:
109
+
110
+ ```
111
+ python scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
112
+ ```
113
+
114
+ The command prints the analysis run identifier and the output path. Open the resulting `output.json` to inspect per-topic
115
+ labels, keywords, and document examples.
116
+
117
+ ## Markov analysis
118
+
119
+ Markov analysis learns a directed, weighted state transition graph over sequences of text segments. The output includes
120
+ per-state exemplars, per-item decoded paths, and optional GraphViz exports. See `docs/MARKOV_ANALYSIS.md` for detailed
121
+ configuration and examples.
122
+
123
+ Text extract is available as a segmentation strategy for long texts. It inserts XML tags in-place using a virtual file
124
+ editing loop, then extracts spans without requiring the model to re-emit the full transcript.
125
+
126
+ ## Profiling analysis
127
+
128
+ Profiling is the baseline analysis backend. It summarizes corpus composition and extraction coverage using
129
+ deterministic counts and distribution metrics. See `docs/PROFILING.md` for the full reference and working demo.
130
+
131
+ ### Minimal profiling run
132
+
133
+ ```
134
+ python -m biblicus analyze profile --corpus corpora/example --extraction-run pipeline:RUN_ID
135
+ ```
136
+
137
+ The command writes an analysis run directory and prints the run identifier.
138
+
139
+ Run profiling from the CLI:
140
+
141
+ ```
142
+ biblicus analyze profile --corpus corpora/example --extraction-run pipeline:RUN_ID
143
+ ```
@@ -0,0 +1,46 @@
1
+ # Biblicus Architecture
2
+
3
+ Biblicus sits between raw, unstructured data and the moment you need reliable answers from it.
4
+ It is built for teams who receive large, messy corpora and must extract usable signals without
5
+ losing provenance or reproducibility. Retrieval-augmented generation is one use case, but the
6
+ system is broader than chatbots: it supports any pipeline that needs structured insight from
7
+ unstructured data.
8
+
9
+ At a high level the system does five things:
10
+
11
+ 1. **Ingests** raw content into a corpus with minimal friction.
12
+ 2. **Extracts** text from diverse media (documents, images, audio).
13
+ 3. **Transforms** and annotates text with reusable LLM utilities.
14
+ 4. **Retrieves** evidence through explicit, reproducible stages.
15
+ 5. **Evaluates** results so improvements are measurable, not anecdotal.
16
+
17
+ The guiding idea is that every retrieval produces **evidence**: structured outputs with scores
18
+ and provenance that can be inspected, audited, and reused. Context packs, summaries, and downstream
19
+ generation are all derived from that evidence.
20
+
21
+ ## Why it exists
22
+
23
+ Real-world AI work often starts with a folder full of files, not a clean database. Biblicus is the
24
+ toolkit that turns those files into a manageable, testable system. It supports workflows like:
25
+
26
+ - Indexing large collections of emails and making them searchable while protecting sensitive data.
27
+ - Processing discovery dumps of scanned PDFs with OCR and extracting evidence for analysis.
28
+ - Turning policy or rules documents into a controlled knowledge base for assistants.
29
+
30
+ ## How it fits into AI systems
31
+
32
+ Biblicus integrates with agent frameworks through explicit tool interfaces. It does not hide
33
+ retrieval inside the model. Instead, it provides repeatable pipelines that expose *what* was
34
+ retrieved and *why*, so models can use evidence directly and safely.
35
+
36
+ ## Where to go next
37
+
38
+ - Start with **CORPUS** and **EXTRACTION** to understand how raw content is ingested.
39
+ - Move to **RETRIEVAL** and **RETRIEVAL_EVALUATION** to see how evidence is produced and tested.
40
+ - Explore **TOPIC_MODELING** and **MARKOV_ANALYSIS** if you need higher-level analysis tools.
41
+ - See **TEXT_UTILITIES** for reusable, AI-assisted text transformations.
42
+
43
+ ## Detailed architecture and policies
44
+
45
+ For a deep, internal reference (including design policies and architectural constraints), see
46
+ `ARCHITECTURE_DETAIL.md`.