biblicus 0.3.0__tar.gz → 0.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (139) hide show
  1. {biblicus-0.3.0/src/biblicus.egg-info → biblicus-0.4.0}/PKG-INFO +29 -13
  2. {biblicus-0.3.0 → biblicus-0.4.0}/README.md +28 -12
  3. {biblicus-0.3.0 → biblicus-0.4.0}/docs/CORPUS.md +14 -1
  4. {biblicus-0.3.0 → biblicus-0.4.0}/docs/DEMOS.md +49 -9
  5. {biblicus-0.3.0 → biblicus-0.4.0}/docs/EXTRACTION.md +19 -2
  6. biblicus-0.4.0/docs/ROADMAP.md +200 -0
  7. {biblicus-0.3.0 → biblicus-0.4.0}/docs/conf.py +0 -1
  8. biblicus-0.4.0/features/crawl.feature +81 -0
  9. biblicus-0.4.0/features/extraction_run_lifecycle.feature +117 -0
  10. {biblicus-0.3.0 → biblicus-0.4.0}/features/extractor_pipeline.feature +3 -3
  11. biblicus-0.4.0/features/steps/crawl_steps.py +68 -0
  12. biblicus-0.4.0/features/steps/extraction_run_lifecycle_steps.py +148 -0
  13. {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/extraction_steps.py +38 -6
  14. {biblicus-0.3.0 → biblicus-0.4.0}/features/text_extraction_runs.feature +1 -1
  15. {biblicus-0.3.0 → biblicus-0.4.0}/pyproject.toml +1 -1
  16. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/__init__.py +1 -1
  17. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/cli.py +147 -7
  18. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/corpus.py +170 -1
  19. biblicus-0.4.0/src/biblicus/crawl.py +186 -0
  20. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extraction.py +4 -2
  21. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/models.py +31 -0
  22. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/time.py +1 -1
  23. {biblicus-0.3.0 → biblicus-0.4.0/src/biblicus.egg-info}/PKG-INFO +29 -13
  24. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus.egg-info/SOURCES.txt +5 -0
  25. biblicus-0.3.0/docs/ROADMAP.md +0 -174
  26. {biblicus-0.3.0 → biblicus-0.4.0}/LICENSE +0 -0
  27. {biblicus-0.3.0 → biblicus-0.4.0}/MANIFEST.in +0 -0
  28. {biblicus-0.3.0 → biblicus-0.4.0}/THIRD_PARTY_NOTICES.md +0 -0
  29. {biblicus-0.3.0 → biblicus-0.4.0}/datasets/wikipedia_mini.json +0 -0
  30. {biblicus-0.3.0 → biblicus-0.4.0}/docs/ARCHITECTURE.md +0 -0
  31. {biblicus-0.3.0 → biblicus-0.4.0}/docs/BACKENDS.md +0 -0
  32. {biblicus-0.3.0 → biblicus-0.4.0}/docs/CORPUS_DESIGN.md +0 -0
  33. {biblicus-0.3.0 → biblicus-0.4.0}/docs/FEATURE_INDEX.md +0 -0
  34. {biblicus-0.3.0 → biblicus-0.4.0}/docs/TESTING.md +0 -0
  35. {biblicus-0.3.0 → biblicus-0.4.0}/docs/USER_CONFIGURATION.md +0 -0
  36. {biblicus-0.3.0 → biblicus-0.4.0}/docs/api.rst +0 -0
  37. {biblicus-0.3.0 → biblicus-0.4.0}/docs/index.rst +6 -6
  38. {biblicus-0.3.0 → biblicus-0.4.0}/features/backend_validation.feature +0 -0
  39. {biblicus-0.3.0 → biblicus-0.4.0}/features/biblicus_corpus.feature +0 -0
  40. {biblicus-0.3.0 → biblicus-0.4.0}/features/cli_entrypoint.feature +0 -0
  41. {biblicus-0.3.0 → biblicus-0.4.0}/features/cli_parsing.feature +0 -0
  42. {biblicus-0.3.0 → biblicus-0.4.0}/features/content_sniffing.feature +0 -0
  43. {biblicus-0.3.0 → biblicus-0.4.0}/features/corpus_edge_cases.feature +0 -0
  44. {biblicus-0.3.0 → biblicus-0.4.0}/features/corpus_identity.feature +0 -0
  45. {biblicus-0.3.0 → biblicus-0.4.0}/features/corpus_purge.feature +0 -0
  46. {biblicus-0.3.0 → biblicus-0.4.0}/features/environment.py +0 -0
  47. {biblicus-0.3.0 → biblicus-0.4.0}/features/error_cases.feature +0 -0
  48. {biblicus-0.3.0 → biblicus-0.4.0}/features/evaluation.feature +0 -0
  49. {biblicus-0.3.0 → biblicus-0.4.0}/features/extraction_error_handling.feature +0 -0
  50. {biblicus-0.3.0 → biblicus-0.4.0}/features/extraction_selection.feature +0 -0
  51. {biblicus-0.3.0 → biblicus-0.4.0}/features/extraction_selection_longest.feature +0 -0
  52. {biblicus-0.3.0 → biblicus-0.4.0}/features/extractor_validation.feature +0 -0
  53. {biblicus-0.3.0 → biblicus-0.4.0}/features/frontmatter.feature +0 -0
  54. {biblicus-0.3.0 → biblicus-0.4.0}/features/hook_config_validation.feature +0 -0
  55. {biblicus-0.3.0 → biblicus-0.4.0}/features/hook_error_handling.feature +0 -0
  56. {biblicus-0.3.0 → biblicus-0.4.0}/features/import_tree.feature +0 -0
  57. {biblicus-0.3.0 → biblicus-0.4.0}/features/ingest_sources.feature +0 -0
  58. {biblicus-0.3.0 → biblicus-0.4.0}/features/integration_audio_samples.feature +0 -0
  59. {biblicus-0.3.0 → biblicus-0.4.0}/features/integration_image_samples.feature +0 -0
  60. {biblicus-0.3.0 → biblicus-0.4.0}/features/integration_mixed_corpus.feature +0 -0
  61. {biblicus-0.3.0 → biblicus-0.4.0}/features/integration_mixed_extraction.feature +0 -0
  62. {biblicus-0.3.0 → biblicus-0.4.0}/features/integration_ocr_image_extraction.feature +0 -0
  63. {biblicus-0.3.0 → biblicus-0.4.0}/features/integration_pdf_retrieval.feature +0 -0
  64. {biblicus-0.3.0 → biblicus-0.4.0}/features/integration_pdf_samples.feature +0 -0
  65. {biblicus-0.3.0 → biblicus-0.4.0}/features/integration_unstructured_extraction.feature +0 -0
  66. {biblicus-0.3.0 → biblicus-0.4.0}/features/integration_wikipedia.feature +0 -0
  67. {biblicus-0.3.0 → biblicus-0.4.0}/features/lifecycle_hooks.feature +0 -0
  68. {biblicus-0.3.0 → biblicus-0.4.0}/features/model_validation.feature +0 -0
  69. {biblicus-0.3.0 → biblicus-0.4.0}/features/ocr_extractor.feature +0 -0
  70. {biblicus-0.3.0 → biblicus-0.4.0}/features/pdf_text_extraction.feature +0 -0
  71. {biblicus-0.3.0 → biblicus-0.4.0}/features/python_api.feature +0 -0
  72. {biblicus-0.3.0 → biblicus-0.4.0}/features/python_hook_logging.feature +0 -0
  73. {biblicus-0.3.0 → biblicus-0.4.0}/features/retrieval_budget.feature +0 -0
  74. {biblicus-0.3.0 → biblicus-0.4.0}/features/retrieval_scan.feature +0 -0
  75. {biblicus-0.3.0 → biblicus-0.4.0}/features/retrieval_sqlite_full_text_search.feature +0 -0
  76. {biblicus-0.3.0 → biblicus-0.4.0}/features/retrieval_uses_extraction_run.feature +0 -0
  77. {biblicus-0.3.0 → biblicus-0.4.0}/features/retrieval_utilities.feature +0 -0
  78. {biblicus-0.3.0 → biblicus-0.4.0}/features/source_loading.feature +0 -0
  79. {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/backend_steps.py +0 -0
  80. {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/cli_parsing_steps.py +0 -0
  81. {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/cli_steps.py +0 -0
  82. {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/extractor_steps.py +0 -0
  83. {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/frontmatter_steps.py +0 -0
  84. {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/model_steps.py +0 -0
  85. {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/openai_steps.py +0 -0
  86. {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/pdf_steps.py +0 -0
  87. {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/python_api_steps.py +0 -0
  88. {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/rapidocr_steps.py +0 -0
  89. {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/retrieval_steps.py +0 -0
  90. {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/stt_steps.py +0 -0
  91. {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/unstructured_steps.py +0 -0
  92. {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/user_config_steps.py +0 -0
  93. {biblicus-0.3.0 → biblicus-0.4.0}/features/streaming_ingest.feature +0 -0
  94. {biblicus-0.3.0 → biblicus-0.4.0}/features/stt_extractor.feature +0 -0
  95. {biblicus-0.3.0 → biblicus-0.4.0}/features/unstructured_extractor.feature +0 -0
  96. {biblicus-0.3.0 → biblicus-0.4.0}/features/user_config.feature +0 -0
  97. {biblicus-0.3.0 → biblicus-0.4.0}/scripts/download_audio_samples.py +0 -0
  98. {biblicus-0.3.0 → biblicus-0.4.0}/scripts/download_image_samples.py +0 -0
  99. {biblicus-0.3.0 → biblicus-0.4.0}/scripts/download_mixed_samples.py +0 -0
  100. {biblicus-0.3.0 → biblicus-0.4.0}/scripts/download_pdf_samples.py +0 -0
  101. {biblicus-0.3.0 → biblicus-0.4.0}/scripts/download_wikipedia.py +0 -0
  102. {biblicus-0.3.0 → biblicus-0.4.0}/scripts/test.py +0 -0
  103. {biblicus-0.3.0 → biblicus-0.4.0}/setup.cfg +0 -0
  104. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/__main__.py +0 -0
  105. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/_vendor/dotyaml/__init__.py +0 -0
  106. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/_vendor/dotyaml/interpolation.py +0 -0
  107. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/_vendor/dotyaml/loader.py +0 -0
  108. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/_vendor/dotyaml/transformer.py +0 -0
  109. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/backends/__init__.py +0 -0
  110. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/backends/base.py +0 -0
  111. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/backends/scan.py +0 -0
  112. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/backends/sqlite_full_text_search.py +0 -0
  113. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/constants.py +0 -0
  114. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/errors.py +0 -0
  115. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/evaluation.py +0 -0
  116. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/__init__.py +0 -0
  117. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/base.py +0 -0
  118. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/metadata_text.py +0 -0
  119. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/openai_stt.py +0 -0
  120. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/pass_through_text.py +0 -0
  121. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/pdf_text.py +0 -0
  122. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/pipeline.py +0 -0
  123. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/rapidocr_text.py +0 -0
  124. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/select_longest_text.py +0 -0
  125. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/select_text.py +0 -0
  126. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/unstructured_text.py +0 -0
  127. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/frontmatter.py +0 -0
  128. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/hook_logging.py +0 -0
  129. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/hook_manager.py +0 -0
  130. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/hooks.py +0 -0
  131. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/ignore.py +0 -0
  132. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/retrieval.py +0 -0
  133. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/sources.py +0 -0
  134. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/uris.py +0 -0
  135. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/user_config.py +0 -0
  136. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus.egg-info/dependency_links.txt +0 -0
  137. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus.egg-info/entry_points.txt +0 -0
  138. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus.egg-info/requires.txt +0 -0
  139. {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus.egg-info/top_level.txt +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: biblicus
3
- Version: 0.3.0
3
+ Version: 0.4.0
4
4
  Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
5
5
  License: MIT
6
6
  Requires-Python: >=3.9
@@ -77,10 +77,7 @@ flowchart LR
77
77
  direction LR
78
78
  LegendArtifact[Stored artifact or evidence]
79
79
  LegendStep[Step]
80
- LegendStable[Stable region]
81
- LegendPluggable[Pluggable region]
82
80
  LegendArtifact --- LegendStep
83
- LegendStable --- LegendPluggable
84
81
  end
85
82
 
86
83
  subgraph Main[" "]
@@ -93,14 +90,14 @@ flowchart LR
93
90
  Raw --> Catalog[Catalog file]
94
91
  end
95
92
 
96
- subgraph PluggableExtractionPipeline[Pluggable extraction pipeline]
93
+ subgraph PluggableExtractionPipeline[Pluggable: extraction pipeline]
97
94
  direction TB
98
95
  Catalog --> Extract[Extract pipeline]
99
96
  Extract --> ExtractedText[Extracted text artifacts]
100
97
  ExtractedText --> ExtractionRun[Extraction run manifest]
101
98
  end
102
99
 
103
- subgraph PluggableRetrievalBackend[Pluggable retrieval backend]
100
+ subgraph PluggableRetrievalBackend[Pluggable: retrieval backend]
104
101
  direction LR
105
102
 
106
103
  subgraph BackendIngestionIndexing[Ingestion and indexing]
@@ -154,8 +151,6 @@ flowchart LR
154
151
  style Main fill:#ffffff,stroke:#ffffff,color:#111111
155
152
  style LegendArtifact fill:#f3e5f5,stroke:#8e24aa,color:#111111
156
153
  style LegendStep fill:#eceff1,stroke:#90a4ae,color:#111111
157
- style LegendStable fill:#ffffff,stroke:#8e24aa,stroke-width:2px,color:#111111
158
- style LegendPluggable fill:#ffffff,stroke:#1e88e5,stroke-dasharray:6 3,stroke-width:2px,color:#111111
159
154
  ```
160
155
 
161
156
  ## Practical value
@@ -168,6 +163,7 @@ flowchart LR
168
163
 
169
164
  - Initialize a corpus folder.
170
165
  - Ingest items from file paths, web addresses, or text input.
166
+ - Crawl a website section into corpus items when you want a repeatable “import from the web” workflow.
171
167
  - Run extraction when you want derived text artifacts from non-text sources.
172
168
  - Reindex to refresh the catalog after edits.
173
169
  - Build a retrieval run with a backend.
@@ -205,11 +201,22 @@ biblicus init corpora/example
205
201
  biblicus ingest --corpus corpora/example notes/example.txt
206
202
  echo "A short note" | biblicus ingest --corpus corpora/example --stdin --title "First note"
207
203
  biblicus list --corpus corpora/example
208
- biblicus extract --corpus corpora/example --step pass-through-text --step metadata-text
204
+ biblicus extract build --corpus corpora/example --step pass-through-text --step metadata-text
205
+ biblicus extract list --corpus corpora/example
209
206
  biblicus build --corpus corpora/example --backend scan
210
207
  biblicus query --corpus corpora/example --query "note"
211
208
  ```
212
209
 
210
+ If you want to turn a website section into corpus items, crawl a root web address while restricting the crawl to an allowed prefix:
211
+
212
+ ```
213
+ biblicus crawl --corpus corpora/example \\
214
+ --root-url https://example.com/docs/index.html \\
215
+ --allowed-prefix https://example.com/docs/ \\
216
+ --max-items 50 \\
217
+ --tag crawled
218
+ ```
219
+
213
220
  ## Python usage
214
221
 
215
222
  From Python, the same flow is available through the Corpus class and backend interfaces. The public surface area is small on purpose.
@@ -233,7 +240,7 @@ In an assistant system, retrieval usually produces context for a model call. Thi
233
240
 
234
241
  ## Learn more
235
242
 
236
- Full documentation is available on [ReadTheDocs](https://biblicus.readthedocs.io/).
243
+ Full documentation is published on GitHub Pages: https://anthusai.github.io/Biblicus/
237
244
 
238
245
  The documents below are written to be read in order.
239
246
 
@@ -262,7 +269,16 @@ corpus/
262
269
  config.json
263
270
  catalog.json
264
271
  runs/
265
- run-id.json
272
+ extraction/
273
+ pipeline/
274
+ <run id>/
275
+ manifest.json
276
+ text/
277
+ <item id>.txt
278
+ retrieval/
279
+ <backend id>/
280
+ <run id>/
281
+ manifest.json
266
282
  ```
267
283
 
268
284
  ## Retrieval backends
@@ -313,7 +329,7 @@ python3 -m pip install -e ".[dev]"
313
329
  Build the documentation:
314
330
 
315
331
  ```
316
- python3 -m sphinx -b html docs docs/_build
332
+ python3 -m sphinx -b html docs docs/_build/html
317
333
  ```
318
334
 
319
335
  ## License
@@ -333,4 +349,4 @@ License terms are in `LICENSE`.
333
349
 
334
350
  [continuous-integration-badge]: https://github.com/AnthusAI/Biblicus/actions/workflows/ci.yml/badge.svg?branch=main
335
351
  [coverage-badge]: https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/AnthusAI/Biblicus/main/coverage_badge.json
336
- [documentation-badge]: https://readthedocs.org/projects/biblicus/badge/?version=latest
352
+ [documentation-badge]: https://img.shields.io/badge/docs-GitHub%20Pages-blue
@@ -48,10 +48,7 @@ flowchart LR
48
48
  direction LR
49
49
  LegendArtifact[Stored artifact or evidence]
50
50
  LegendStep[Step]
51
- LegendStable[Stable region]
52
- LegendPluggable[Pluggable region]
53
51
  LegendArtifact --- LegendStep
54
- LegendStable --- LegendPluggable
55
52
  end
56
53
 
57
54
  subgraph Main[" "]
@@ -64,14 +61,14 @@ flowchart LR
64
61
  Raw --> Catalog[Catalog file]
65
62
  end
66
63
 
67
- subgraph PluggableExtractionPipeline[Pluggable extraction pipeline]
64
+ subgraph PluggableExtractionPipeline[Pluggable: extraction pipeline]
68
65
  direction TB
69
66
  Catalog --> Extract[Extract pipeline]
70
67
  Extract --> ExtractedText[Extracted text artifacts]
71
68
  ExtractedText --> ExtractionRun[Extraction run manifest]
72
69
  end
73
70
 
74
- subgraph PluggableRetrievalBackend[Pluggable retrieval backend]
71
+ subgraph PluggableRetrievalBackend[Pluggable: retrieval backend]
75
72
  direction LR
76
73
 
77
74
  subgraph BackendIngestionIndexing[Ingestion and indexing]
@@ -125,8 +122,6 @@ flowchart LR
125
122
  style Main fill:#ffffff,stroke:#ffffff,color:#111111
126
123
  style LegendArtifact fill:#f3e5f5,stroke:#8e24aa,color:#111111
127
124
  style LegendStep fill:#eceff1,stroke:#90a4ae,color:#111111
128
- style LegendStable fill:#ffffff,stroke:#8e24aa,stroke-width:2px,color:#111111
129
- style LegendPluggable fill:#ffffff,stroke:#1e88e5,stroke-dasharray:6 3,stroke-width:2px,color:#111111
130
125
  ```
131
126
 
132
127
  ## Practical value
@@ -139,6 +134,7 @@ flowchart LR
139
134
 
140
135
  - Initialize a corpus folder.
141
136
  - Ingest items from file paths, web addresses, or text input.
137
+ - Crawl a website section into corpus items when you want a repeatable “import from the web” workflow.
142
138
  - Run extraction when you want derived text artifacts from non-text sources.
143
139
  - Reindex to refresh the catalog after edits.
144
140
  - Build a retrieval run with a backend.
@@ -176,11 +172,22 @@ biblicus init corpora/example
176
172
  biblicus ingest --corpus corpora/example notes/example.txt
177
173
  echo "A short note" | biblicus ingest --corpus corpora/example --stdin --title "First note"
178
174
  biblicus list --corpus corpora/example
179
- biblicus extract --corpus corpora/example --step pass-through-text --step metadata-text
175
+ biblicus extract build --corpus corpora/example --step pass-through-text --step metadata-text
176
+ biblicus extract list --corpus corpora/example
180
177
  biblicus build --corpus corpora/example --backend scan
181
178
  biblicus query --corpus corpora/example --query "note"
182
179
  ```
183
180
 
181
+ If you want to turn a website section into corpus items, crawl a root web address while restricting the crawl to an allowed prefix:
182
+
183
+ ```
184
+ biblicus crawl --corpus corpora/example \\
185
+ --root-url https://example.com/docs/index.html \\
186
+ --allowed-prefix https://example.com/docs/ \\
187
+ --max-items 50 \\
188
+ --tag crawled
189
+ ```
190
+
184
191
  ## Python usage
185
192
 
186
193
  From Python, the same flow is available through the Corpus class and backend interfaces. The public surface area is small on purpose.
@@ -204,7 +211,7 @@ In an assistant system, retrieval usually produces context for a model call. Thi
204
211
 
205
212
  ## Learn more
206
213
 
207
- Full documentation is available on [ReadTheDocs](https://biblicus.readthedocs.io/).
214
+ Full documentation is published on GitHub Pages: https://anthusai.github.io/Biblicus/
208
215
 
209
216
  The documents below are written to be read in order.
210
217
 
@@ -233,7 +240,16 @@ corpus/
233
240
  config.json
234
241
  catalog.json
235
242
  runs/
236
- run-id.json
243
+ extraction/
244
+ pipeline/
245
+ <run id>/
246
+ manifest.json
247
+ text/
248
+ <item id>.txt
249
+ retrieval/
250
+ <backend id>/
251
+ <run id>/
252
+ manifest.json
237
253
  ```
238
254
 
239
255
  ## Retrieval backends
@@ -284,7 +300,7 @@ python3 -m pip install -e ".[dev]"
284
300
  Build the documentation:
285
301
 
286
302
  ```
287
- python3 -m sphinx -b html docs docs/_build
303
+ python3 -m sphinx -b html docs docs/_build/html
288
304
  ```
289
305
 
290
306
  ## License
@@ -304,4 +320,4 @@ License terms are in `LICENSE`.
304
320
 
305
321
  [continuous-integration-badge]: https://github.com/AnthusAI/Biblicus/actions/workflows/ci.yml/badge.svg?branch=main
306
322
  [coverage-badge]: https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/AnthusAI/Biblicus/main/coverage_badge.json
307
- [documentation-badge]: https://readthedocs.org/projects/biblicus/badge/?version=latest
323
+ [documentation-badge]: https://img.shields.io/badge/docs-GitHub%20Pages-blue
@@ -43,6 +43,20 @@ Ingest a web address:
43
43
  python3 -m biblicus ingest --corpus corpora/example https://example.com --tag web
44
44
  ```
45
45
 
46
+ ## Crawl a website prefix
47
+
48
+ To build a corpus from a website section, crawl a root uniform resource locator and restrict the crawl to an allowed prefix.
49
+
50
+ ```
51
+ python3 -m biblicus crawl --corpus corpora/example \\
52
+ --root-url https://example.com/docs/index.html \\
53
+ --allowed-prefix https://example.com/docs/ \\
54
+ --max-items 50 \\
55
+ --tag crawled
56
+ ```
57
+
58
+ The crawl command only follows links within the allowed prefix, and it respects `.biblicusignore` patterns against the path relative to the allowed prefix.
59
+
46
60
  Ingest a text note:
47
61
 
48
62
  ```
@@ -100,4 +114,3 @@ Purging deletes all items and derived artifacts under the corpus. It requires yo
100
114
  ```
101
115
  python3 -m biblicus purge --corpus corpora/example --confirm example
102
116
  ```
103
-
@@ -133,6 +133,46 @@ The catalog is rebuildable. You can edit raw files or sidecar metadata, then ref
133
133
  python3 -m biblicus reindex --corpus corpora/demo
134
134
  ```
135
135
 
136
+ ### Crawl a website prefix
137
+
138
+ To turn a website section into corpus items, crawl a root page and restrict the crawl to an allowed prefix.
139
+
140
+ In one terminal, create a tiny local website and serve it:
141
+
142
+ ```
143
+ rm -rf /tmp/biblicus-site
144
+ mkdir -p /tmp/biblicus-site/site/subdir
145
+ cat > /tmp/biblicus-site/site/index.html <<'HTML'
146
+ <html>
147
+ <body>
148
+ <a href="page.html">Page</a>
149
+ <a href="subdir/">Subdir</a>
150
+ </body>
151
+ </html>
152
+ HTML
153
+ cat > /tmp/biblicus-site/site/page.html <<'HTML'
154
+ <html><body>hello</body></html>
155
+ HTML
156
+ cat > /tmp/biblicus-site/site/subdir/index.html <<'HTML'
157
+ <html><body>subdir</body></html>
158
+ HTML
159
+
160
+ python3 -m http.server 8000 --directory /tmp/biblicus-site
161
+ ```
162
+
163
+ In another terminal:
164
+
165
+ ```
166
+ rm -rf corpora/crawl-demo
167
+ python3 -m biblicus init corpora/crawl-demo
168
+ python3 -m biblicus crawl --corpus corpora/crawl-demo \\
169
+ --root-url http://127.0.0.1:8000/site/index.html \\
170
+ --allowed-prefix http://127.0.0.1:8000/site/ \\
171
+ --max-items 50 \\
172
+ --tag crawled
173
+ python3 -m biblicus list --corpus corpora/crawl-demo
174
+ ```
175
+
136
176
  ### Build an extraction run
137
177
 
138
178
  Text extraction is a separate pipeline stage from retrieval. An extraction run produces derived text artifacts under the corpus.
@@ -140,7 +180,7 @@ Text extraction is a separate pipeline stage from retrieval. An extraction run p
140
180
  This extractor reads text items and skips non-text items.
141
181
 
142
182
  ```
143
- python3 -m biblicus extract --corpus corpora/demo --step pass-through-text
183
+ python3 -m biblicus extract build --corpus corpora/demo --step pass-through-text
144
184
  ```
145
185
 
146
186
  The output includes a `run_id` you can reuse when building a retrieval backend.
@@ -150,7 +190,7 @@ The output includes a `run_id` you can reuse when building a retrieval backend.
150
190
  When you want an explicit choice among multiple extraction outputs, add a selection extractor step at the end of the pipeline.
151
191
 
152
192
  ```
153
- python3 -m biblicus extract --corpus corpora/demo \\
193
+ python3 -m biblicus extract build --corpus corpora/demo \\
154
194
  --step pass-through-text \\
155
195
  --step metadata-text \\
156
196
  --step select-text
@@ -171,7 +211,7 @@ This example downloads a small set of public Portable Document Format files, ext
171
211
  rm -rf corpora/pdf_samples
172
212
  python3 scripts/download_pdf_samples.py --corpus corpora/pdf_samples --force
173
213
 
174
- python3 -m biblicus extract --corpus corpora/pdf_samples --step pdf-text
214
+ python3 -m biblicus extract build --corpus corpora/pdf_samples --step pdf-text
175
215
  ```
176
216
 
177
217
  Copy the `run_id` from the JavaScript Object Notation output. You will use it as `PDF_EXTRACTION_RUN_ID` in the next command.
@@ -211,7 +251,7 @@ python3 -m pip install "biblicus[ocr]"
211
251
  Then build an extraction run:
212
252
 
213
253
  ```
214
- python3 -m biblicus extract --corpus corpora/image_samples --step ocr-rapidocr
254
+ python3 -m biblicus extract build --corpus corpora/image_samples --step ocr-rapidocr
215
255
  ```
216
256
 
217
257
  ### Optional: Unstructured as a last-resort extractor
@@ -227,7 +267,7 @@ python3 -m pip install "biblicus[unstructured]"
227
267
  Then build an extraction run:
228
268
 
229
269
  ```
230
- python3 -m biblicus extract --corpus corpora/pdf_samples --step unstructured
270
+ python3 -m biblicus extract build --corpus corpora/pdf_samples --step unstructured
231
271
  ```
232
272
 
233
273
  To see Unstructured handle a non-Portable-Document-Format format, use the mixed corpus demo, which includes a `.docx` sample:
@@ -235,13 +275,13 @@ To see Unstructured handle a non-Portable-Document-Format format, use the mixed
235
275
  ```
236
276
  rm -rf corpora/mixed_samples
237
277
  python3 scripts/download_mixed_samples.py --corpus corpora/mixed_samples --force
238
- python3 -m biblicus extract --corpus corpora/mixed_samples --step unstructured
278
+ python3 -m biblicus extract build --corpus corpora/mixed_samples --step unstructured
239
279
  ```
240
280
 
241
281
  When you want to prefer one extractor over another for the same item types, order the steps and end with `select-text`:
242
282
 
243
283
  ```
244
- python3 -m biblicus extract --corpus corpora/pdf_samples \\
284
+ python3 -m biblicus extract build --corpus corpora/pdf_samples \\
245
285
  --step unstructured \\
246
286
  --step pdf-text \\
247
287
  --step select-text
@@ -263,7 +303,7 @@ python3 -m biblicus list --corpus corpora/audio_samples
263
303
  If you only want a metadata-only baseline, extract `metadata-text`:
264
304
 
265
305
  ```
266
- python3 -m biblicus extract --corpus corpora/audio_samples --step metadata-text
306
+ python3 -m biblicus extract build --corpus corpora/audio_samples --step metadata-text
267
307
  ```
268
308
 
269
309
  For real speech to text transcription with the OpenAI backend, install the optional dependency and set an API key:
@@ -272,7 +312,7 @@ For real speech to text transcription with the OpenAI backend, install the optio
272
312
  python3 -m pip install "biblicus[openai]"
273
313
  mkdir -p .biblicus
274
314
  printf "openai:\n api_key: ...\n" > .biblicus/config.yml
275
- python3 -m biblicus extract --corpus corpora/audio_samples --step stt-openai
315
+ python3 -m biblicus extract build --corpus corpora/audio_samples --step stt-openai
276
316
  ```
277
317
 
278
318
  ### Build and query the minimal backend
@@ -148,7 +148,7 @@ python3 -m biblicus init corpora/extraction-demo
148
148
  printf 'x' > /tmp/image.png
149
149
  python3 -m biblicus ingest --corpus corpora/extraction-demo /tmp/image.png --tag extracted
150
150
 
151
- python3 -m biblicus extract --corpus corpora/extraction-demo \\
151
+ python3 -m biblicus extract build --corpus corpora/extraction-demo \\
152
152
  --step pass-through-text \\
153
153
  --step pdf-text \\
154
154
  --step metadata-text
@@ -161,7 +161,7 @@ The extracted text for the image comes from the `metadata-text` step because the
161
161
  Selection is a pipeline step that chooses extracted text from previous pipeline steps. Selection is just another extractor in the pipeline, and it decides which prior output to carry forward.
162
162
 
163
163
  ```
164
- python3 -m biblicus extract --corpus corpora/extraction-demo \\
164
+ python3 -m biblicus extract build --corpus corpora/extraction-demo \\
165
165
  --step pass-through-text \\
166
166
  --step metadata-text \\
167
167
  --step select-text
@@ -169,6 +169,23 @@ python3 -m biblicus extract --corpus corpora/extraction-demo \\
169
169
 
170
170
  The pipeline run produces one extraction run under `pipeline`. You can point retrieval backends at that run.
171
171
 
172
+ ## Inspecting and deleting extraction runs
173
+
174
+ Extraction runs are stored under the corpus and can be listed and inspected.
175
+
176
+ ```
177
+ python3 -m biblicus extract list --corpus corpora/extraction-demo
178
+ python3 -m biblicus extract show --corpus corpora/extraction-demo --run pipeline:EXTRACTION_RUN_ID
179
+ ```
180
+
181
+ Deletion is explicit and requires typing the exact run reference as confirmation:
182
+
183
+ ```
184
+ python3 -m biblicus extract delete --corpus corpora/extraction-demo \\
185
+ --run pipeline:EXTRACTION_RUN_ID \\
186
+ --confirm pipeline:EXTRACTION_RUN_ID
187
+ ```
188
+
172
189
  ## Use extracted text in retrieval
173
190
 
174
191
  Retrieval backends can build and query using a selected extraction run. This is configured by passing `extraction_run=extractor_id:run_id` to the backend build command.
@@ -0,0 +1,200 @@
1
+ # Roadmap
2
+
3
+ This document is the ordered plan for what to build next.
4
+
5
+ If you are looking for runnable examples, see `docs/DEMOS.md`.
6
+
7
+ ## Principles
8
+
9
+ - Behavior specifications are the authoritative definition of behavior.
10
+ - Every behavior that exists is specified.
11
+ - Validation and documentation are part of the product.
12
+ - Raw corpus items remain readable, portable files.
13
+ - Derived artifacts are stored under the corpus and can coexist for multiple implementations.
14
+
15
+ ## Current state
16
+
17
+ Version zero includes:
18
+
19
+ - A file based corpus with ingestion, catalog rebuild, import, ignore rules, and lifecycle hooks.
20
+ - A retrieval baseline (`scan`) and a practical local backend (`sqlite-full-text-search`).
21
+ - A separate text extraction stage with extraction runs and a composable extractor pipeline.
22
+ - Selection extractor steps that choose extracted text within a pipeline.
23
+ - A speech to text extractor plugin (`stt-openai`) implemented as an optional dependency.
24
+ - An optical character recognition extractor plugin (`ocr-rapidocr`) implemented as an optional dependency.
25
+ - A broad catchall extractor plugin (`unstructured`) implemented as an optional dependency.
26
+ - Integration corpora that include deterministic non-text cases such as a blank Portable Document Format file and a silence Waveform Audio File Format clip.
27
+
28
+ Milestones 1 through 4 are complete. The next planned work begins at Milestone 5.
29
+
30
+ ## Near-term focus
31
+
32
+ The next work will focus on the retrieval side of the pipeline:
33
+
34
+ - Make retrieval runs and evidence production the simplest possible practical “minimum viable product”.
35
+ - Add explicit evidence quality stages (rerank and filter) that are easy to compose, test, and evaluate.
36
+ - Expand retrieval evaluation so it is easy to compare backends using the same corpora and datasets.
37
+
38
+ Lower-priority work related to corpus ingestion conveniences and extractor evaluation remains valuable, but it is deferred while we make retrieval practical end to end.
39
+
40
+ ## Milestones
41
+
42
+ ### Milestone 1: Artifact lifecycle and storage layout
43
+
44
+ Goal: make derived artifacts easy to inspect, compare, and retain across multiple extraction implementations.
45
+
46
+ Status: complete.
47
+
48
+ Deliverables:
49
+
50
+ - A stable on-disk layout for extracted artifacts that partitions by extraction recipe and extractor identity.
51
+ - A clear, human-readable manifest for each extraction run that includes configuration, timing, and summary stats.
52
+ - Corpus-level tooling to list, inspect, and delete derived artifacts without touching raw items.
53
+
54
+ Acceptance checks:
55
+
56
+ - Raw items remain readable, portable files in `raw/`.
57
+ - Derived artifacts can coexist for multiple extractors and multiple recipes over the same raw items.
58
+ - Behavior specifications cover artifact layout and lifecycle operations.
59
+
60
+ ### Milestone 2: Idempotency and change detection
61
+
62
+ Goal: make extraction runs repeatable, fast, and safe by skipping work when nothing relevant changed.
63
+
64
+ Status: complete.
65
+
66
+ Deliverables:
67
+
68
+ - Change detection for extraction inputs (raw bytes identity) and extraction settings (extractor identity and configuration).
69
+ - Extraction run behavior that cleanly separates “skipped because already present” from “skipped because unsupported”.
70
+ - A simple “rebuild” workflow that is explicit and safe: delete an extraction run, then build it again.
71
+
72
+ Acceptance checks:
73
+
74
+ - Running the same extraction recipe twice produces the same outputs and reports predictable skip counts.
75
+ - Behavior specifications cover idempotency and change detection outcomes.
76
+
77
+ ### Milestone 3: Failure semantics and reporting
78
+
79
+ Goal: make extraction outcomes diagnosable and measurable without reading log output.
80
+
81
+ Status: complete.
82
+
83
+ Deliverables:
84
+
85
+ - A clear set of extraction outcome categories (success, empty output, skipped, fatal error) with structured reasons.
86
+ - Per-run reporting that summarizes outcomes and provides a path to per-item details.
87
+ - Consistent, user-facing errors when optional dependencies or required configuration are missing.
88
+
89
+ Acceptance checks:
90
+
91
+ - Behavior specifications cover error classification and summary reporting.
92
+ - Reports remain deterministic for the same corpus and recipe.
93
+
94
+ ### Milestone 4: Corpus import and crawl utilities
95
+
96
+ Goal: make it easy to build a corpus from real-world sources while keeping the corpus readable and portable.
97
+
98
+ Status: complete.
99
+
100
+ Deliverables:
101
+
102
+ - Folder tree import ergonomics: stable naming, media type detection, and predictable metadata sidecars.
103
+ - A website crawl command that stays within an allow-listed uniform resource locator prefix and respects `.biblicusignore`.
104
+ - Integration downloads that produce a small, realistic, repeatable corpus for experimentation without committing third-party content to the repository.
105
+
106
+ Acceptance checks:
107
+
108
+ - The crawl and import workflows are fully specified with behavior specifications.
109
+ - Integration corpora remain gitignored, and can be regenerated from scripts.
110
+
111
+ ### Milestone 6: Evidence quality stages
112
+
113
+ Goal: add explicit rerank and filter stages to retrieval.
114
+
115
+ Status: next.
116
+
117
+ Deliverables:
118
+
119
+ - A rerank stage interface that takes evidence and returns reordered evidence.
120
+ - A filter stage interface that applies metadata and source constraints.
121
+ - Documentation that explains how to configure budgets and stage ordering.
122
+
123
+ Acceptance checks:
124
+
125
+ - Behavior specs cover the new stages.
126
+ - Evaluation reports show per stage metrics and final metrics.
127
+
128
+ ### Milestone 7: Evaluation reports and datasets
129
+
130
+ Goal: make evaluation results easier to interpret and compare.
131
+
132
+ Status: next.
133
+
134
+ Deliverables:
135
+
136
+ - A dataset authoring workflow that supports small hand labeled sets and larger synthetic sets.
137
+ - A report that includes per query diagnostics and a clear summary.
138
+
139
+ Acceptance checks:
140
+
141
+ - The existing dataset format remains stable or is versioned.
142
+ - Reports remain deterministic for the same inputs.
143
+
144
+ ### Milestone 8: Pluggable backend hosting modes
145
+
146
+ Goal: add one reference backend in an external process or remote service mode.
147
+
148
+ Status: later.
149
+
150
+ Deliverables:
151
+
152
+ - A tool server that exposes a backend through a stable interface.
153
+ - Documentation that shows how to run a backend out of process and connect to it.
154
+
155
+ Acceptance checks:
156
+
157
+ - Local tests remain fast and deterministic.
158
+ - Integration tests validate end to end retrieval through the tool boundary.
159
+
160
+ ## Where to put design notes
161
+
162
+ Design notes live in `docs/` so they are easy to browse and cross link.
163
+
164
+ Executable behavior lives in `features/*.feature`.
165
+
166
+ ## Completed milestones (version zero)
167
+
168
+ These milestones are complete as of version zero, and are maintained through behavior specifications:
169
+
170
+ - Portable Document Format text extraction (`pdf-text`).
171
+ - Optical character recognition extraction (`ocr-rapidocr`).
172
+ - Catchall extraction for wide format coverage (`unstructured`).
173
+ - Selection extractor steps (`select-text`, `select-longest-text`).
174
+
175
+ ## Completed milestones (post version zero)
176
+
177
+ These milestones are complete after version zero, and remain defined by behavior specifications:
178
+
179
+ - Extraction run lifecycle operations (`extract list`, `extract show`, `extract delete`) and a stable artifact layout.
180
+ - Deterministic extraction run identifiers based on recipe and catalog version (idempotent extraction runs).
181
+ - Crawl ingestion (`crawl`) with allow-listed prefix enforcement and `.biblicusignore` filtering.
182
+
183
+ ## Deferred milestones
184
+
185
+ These milestones remain planned, but are not the near-term focus.
186
+
187
+ ### Milestone 5: Extractor datasets and evaluation harness (deferred)
188
+
189
+ Goal: compare extraction approaches in a way that is measurable, repeatable, and useful for practical engineering decisions.
190
+
191
+ Deliverables:
192
+
193
+ - Dataset authoring workflow for extraction ground truth (for example: expected transcripts and expected optical character recognition text).
194
+ - Evaluation metrics for accuracy, speed, and cost, including “processable fraction” for a given extractor recipe.
195
+ - A report format that can compare multiple extraction recipes against the same corpus and dataset.
196
+
197
+ Acceptance checks:
198
+
199
+ - Evaluation results are stable and reproducible for the same corpus and dataset inputs.
200
+ - Reports make it clear when an extractor fails to process an item versus producing empty output.
@@ -23,7 +23,6 @@ autodoc_typehints = "description"
23
23
  html_theme = "sphinx_rtd_theme"
24
24
 
25
25
  html_theme_options = {
26
- "display_version": True,
27
26
  "prev_next_buttons_location": "bottom",
28
27
  "style_external_links": False,
29
28
  "collapse_navigation": False,