biblicus 0.3.0__tar.gz → 0.4.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {biblicus-0.3.0/src/biblicus.egg-info → biblicus-0.4.0}/PKG-INFO +29 -13
- {biblicus-0.3.0 → biblicus-0.4.0}/README.md +28 -12
- {biblicus-0.3.0 → biblicus-0.4.0}/docs/CORPUS.md +14 -1
- {biblicus-0.3.0 → biblicus-0.4.0}/docs/DEMOS.md +49 -9
- {biblicus-0.3.0 → biblicus-0.4.0}/docs/EXTRACTION.md +19 -2
- biblicus-0.4.0/docs/ROADMAP.md +200 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/docs/conf.py +0 -1
- biblicus-0.4.0/features/crawl.feature +81 -0
- biblicus-0.4.0/features/extraction_run_lifecycle.feature +117 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/extractor_pipeline.feature +3 -3
- biblicus-0.4.0/features/steps/crawl_steps.py +68 -0
- biblicus-0.4.0/features/steps/extraction_run_lifecycle_steps.py +148 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/extraction_steps.py +38 -6
- {biblicus-0.3.0 → biblicus-0.4.0}/features/text_extraction_runs.feature +1 -1
- {biblicus-0.3.0 → biblicus-0.4.0}/pyproject.toml +1 -1
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/__init__.py +1 -1
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/cli.py +147 -7
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/corpus.py +170 -1
- biblicus-0.4.0/src/biblicus/crawl.py +186 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extraction.py +4 -2
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/models.py +31 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/time.py +1 -1
- {biblicus-0.3.0 → biblicus-0.4.0/src/biblicus.egg-info}/PKG-INFO +29 -13
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus.egg-info/SOURCES.txt +5 -0
- biblicus-0.3.0/docs/ROADMAP.md +0 -174
- {biblicus-0.3.0 → biblicus-0.4.0}/LICENSE +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/MANIFEST.in +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/THIRD_PARTY_NOTICES.md +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/datasets/wikipedia_mini.json +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/docs/ARCHITECTURE.md +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/docs/BACKENDS.md +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/docs/CORPUS_DESIGN.md +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/docs/FEATURE_INDEX.md +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/docs/TESTING.md +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/docs/USER_CONFIGURATION.md +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/docs/api.rst +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/docs/index.rst +6 -6
- {biblicus-0.3.0 → biblicus-0.4.0}/features/backend_validation.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/biblicus_corpus.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/cli_entrypoint.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/cli_parsing.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/content_sniffing.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/corpus_edge_cases.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/corpus_identity.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/corpus_purge.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/environment.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/error_cases.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/evaluation.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/extraction_error_handling.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/extraction_selection.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/extraction_selection_longest.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/extractor_validation.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/frontmatter.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/hook_config_validation.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/hook_error_handling.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/import_tree.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/ingest_sources.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/integration_audio_samples.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/integration_image_samples.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/integration_mixed_corpus.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/integration_mixed_extraction.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/integration_ocr_image_extraction.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/integration_pdf_retrieval.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/integration_pdf_samples.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/integration_unstructured_extraction.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/integration_wikipedia.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/lifecycle_hooks.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/model_validation.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/ocr_extractor.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/pdf_text_extraction.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/python_api.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/python_hook_logging.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/retrieval_budget.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/retrieval_scan.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/retrieval_sqlite_full_text_search.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/retrieval_uses_extraction_run.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/retrieval_utilities.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/source_loading.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/backend_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/cli_parsing_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/cli_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/extractor_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/frontmatter_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/model_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/openai_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/pdf_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/python_api_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/rapidocr_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/retrieval_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/stt_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/unstructured_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/steps/user_config_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/streaming_ingest.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/stt_extractor.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/unstructured_extractor.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/features/user_config.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/scripts/download_audio_samples.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/scripts/download_image_samples.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/scripts/download_mixed_samples.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/scripts/download_pdf_samples.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/scripts/download_wikipedia.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/scripts/test.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/setup.cfg +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/__main__.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/_vendor/dotyaml/__init__.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/_vendor/dotyaml/interpolation.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/_vendor/dotyaml/loader.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/_vendor/dotyaml/transformer.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/backends/__init__.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/backends/base.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/backends/scan.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/backends/sqlite_full_text_search.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/constants.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/errors.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/evaluation.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/__init__.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/base.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/metadata_text.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/openai_stt.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/pass_through_text.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/pdf_text.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/pipeline.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/rapidocr_text.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/select_longest_text.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/select_text.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/extractors/unstructured_text.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/frontmatter.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/hook_logging.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/hook_manager.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/hooks.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/ignore.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/retrieval.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/sources.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/uris.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus/user_config.py +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus.egg-info/dependency_links.txt +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus.egg-info/entry_points.txt +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus.egg-info/requires.txt +0 -0
- {biblicus-0.3.0 → biblicus-0.4.0}/src/biblicus.egg-info/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: biblicus
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.4.0
|
|
4
4
|
Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
|
|
5
5
|
License: MIT
|
|
6
6
|
Requires-Python: >=3.9
|
|
@@ -77,10 +77,7 @@ flowchart LR
|
|
|
77
77
|
direction LR
|
|
78
78
|
LegendArtifact[Stored artifact or evidence]
|
|
79
79
|
LegendStep[Step]
|
|
80
|
-
LegendStable[Stable region]
|
|
81
|
-
LegendPluggable[Pluggable region]
|
|
82
80
|
LegendArtifact --- LegendStep
|
|
83
|
-
LegendStable --- LegendPluggable
|
|
84
81
|
end
|
|
85
82
|
|
|
86
83
|
subgraph Main[" "]
|
|
@@ -93,14 +90,14 @@ flowchart LR
|
|
|
93
90
|
Raw --> Catalog[Catalog file]
|
|
94
91
|
end
|
|
95
92
|
|
|
96
|
-
subgraph PluggableExtractionPipeline[Pluggable extraction pipeline]
|
|
93
|
+
subgraph PluggableExtractionPipeline[Pluggable: extraction pipeline]
|
|
97
94
|
direction TB
|
|
98
95
|
Catalog --> Extract[Extract pipeline]
|
|
99
96
|
Extract --> ExtractedText[Extracted text artifacts]
|
|
100
97
|
ExtractedText --> ExtractionRun[Extraction run manifest]
|
|
101
98
|
end
|
|
102
99
|
|
|
103
|
-
subgraph PluggableRetrievalBackend[Pluggable retrieval backend]
|
|
100
|
+
subgraph PluggableRetrievalBackend[Pluggable: retrieval backend]
|
|
104
101
|
direction LR
|
|
105
102
|
|
|
106
103
|
subgraph BackendIngestionIndexing[Ingestion and indexing]
|
|
@@ -154,8 +151,6 @@ flowchart LR
|
|
|
154
151
|
style Main fill:#ffffff,stroke:#ffffff,color:#111111
|
|
155
152
|
style LegendArtifact fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
156
153
|
style LegendStep fill:#eceff1,stroke:#90a4ae,color:#111111
|
|
157
|
-
style LegendStable fill:#ffffff,stroke:#8e24aa,stroke-width:2px,color:#111111
|
|
158
|
-
style LegendPluggable fill:#ffffff,stroke:#1e88e5,stroke-dasharray:6 3,stroke-width:2px,color:#111111
|
|
159
154
|
```
|
|
160
155
|
|
|
161
156
|
## Practical value
|
|
@@ -168,6 +163,7 @@ flowchart LR
|
|
|
168
163
|
|
|
169
164
|
- Initialize a corpus folder.
|
|
170
165
|
- Ingest items from file paths, web addresses, or text input.
|
|
166
|
+
- Crawl a website section into corpus items when you want a repeatable “import from the web” workflow.
|
|
171
167
|
- Run extraction when you want derived text artifacts from non-text sources.
|
|
172
168
|
- Reindex to refresh the catalog after edits.
|
|
173
169
|
- Build a retrieval run with a backend.
|
|
@@ -205,11 +201,22 @@ biblicus init corpora/example
|
|
|
205
201
|
biblicus ingest --corpus corpora/example notes/example.txt
|
|
206
202
|
echo "A short note" | biblicus ingest --corpus corpora/example --stdin --title "First note"
|
|
207
203
|
biblicus list --corpus corpora/example
|
|
208
|
-
biblicus extract --corpus corpora/example --step pass-through-text --step metadata-text
|
|
204
|
+
biblicus extract build --corpus corpora/example --step pass-through-text --step metadata-text
|
|
205
|
+
biblicus extract list --corpus corpora/example
|
|
209
206
|
biblicus build --corpus corpora/example --backend scan
|
|
210
207
|
biblicus query --corpus corpora/example --query "note"
|
|
211
208
|
```
|
|
212
209
|
|
|
210
|
+
If you want to turn a website section into corpus items, crawl a root web address while restricting the crawl to an allowed prefix:
|
|
211
|
+
|
|
212
|
+
```
|
|
213
|
+
biblicus crawl --corpus corpora/example \\
|
|
214
|
+
--root-url https://example.com/docs/index.html \\
|
|
215
|
+
--allowed-prefix https://example.com/docs/ \\
|
|
216
|
+
--max-items 50 \\
|
|
217
|
+
--tag crawled
|
|
218
|
+
```
|
|
219
|
+
|
|
213
220
|
## Python usage
|
|
214
221
|
|
|
215
222
|
From Python, the same flow is available through the Corpus class and backend interfaces. The public surface area is small on purpose.
|
|
@@ -233,7 +240,7 @@ In an assistant system, retrieval usually produces context for a model call. Thi
|
|
|
233
240
|
|
|
234
241
|
## Learn more
|
|
235
242
|
|
|
236
|
-
Full documentation is
|
|
243
|
+
Full documentation is published on GitHub Pages: https://anthusai.github.io/Biblicus/
|
|
237
244
|
|
|
238
245
|
The documents below are written to be read in order.
|
|
239
246
|
|
|
@@ -262,7 +269,16 @@ corpus/
|
|
|
262
269
|
config.json
|
|
263
270
|
catalog.json
|
|
264
271
|
runs/
|
|
265
|
-
|
|
272
|
+
extraction/
|
|
273
|
+
pipeline/
|
|
274
|
+
<run id>/
|
|
275
|
+
manifest.json
|
|
276
|
+
text/
|
|
277
|
+
<item id>.txt
|
|
278
|
+
retrieval/
|
|
279
|
+
<backend id>/
|
|
280
|
+
<run id>/
|
|
281
|
+
manifest.json
|
|
266
282
|
```
|
|
267
283
|
|
|
268
284
|
## Retrieval backends
|
|
@@ -313,7 +329,7 @@ python3 -m pip install -e ".[dev]"
|
|
|
313
329
|
Build the documentation:
|
|
314
330
|
|
|
315
331
|
```
|
|
316
|
-
python3 -m sphinx -b html docs docs/_build
|
|
332
|
+
python3 -m sphinx -b html docs docs/_build/html
|
|
317
333
|
```
|
|
318
334
|
|
|
319
335
|
## License
|
|
@@ -333,4 +349,4 @@ License terms are in `LICENSE`.
|
|
|
333
349
|
|
|
334
350
|
[continuous-integration-badge]: https://github.com/AnthusAI/Biblicus/actions/workflows/ci.yml/badge.svg?branch=main
|
|
335
351
|
[coverage-badge]: https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/AnthusAI/Biblicus/main/coverage_badge.json
|
|
336
|
-
[documentation-badge]: https://
|
|
352
|
+
[documentation-badge]: https://img.shields.io/badge/docs-GitHub%20Pages-blue
|
|
@@ -48,10 +48,7 @@ flowchart LR
|
|
|
48
48
|
direction LR
|
|
49
49
|
LegendArtifact[Stored artifact or evidence]
|
|
50
50
|
LegendStep[Step]
|
|
51
|
-
LegendStable[Stable region]
|
|
52
|
-
LegendPluggable[Pluggable region]
|
|
53
51
|
LegendArtifact --- LegendStep
|
|
54
|
-
LegendStable --- LegendPluggable
|
|
55
52
|
end
|
|
56
53
|
|
|
57
54
|
subgraph Main[" "]
|
|
@@ -64,14 +61,14 @@ flowchart LR
|
|
|
64
61
|
Raw --> Catalog[Catalog file]
|
|
65
62
|
end
|
|
66
63
|
|
|
67
|
-
subgraph PluggableExtractionPipeline[Pluggable extraction pipeline]
|
|
64
|
+
subgraph PluggableExtractionPipeline[Pluggable: extraction pipeline]
|
|
68
65
|
direction TB
|
|
69
66
|
Catalog --> Extract[Extract pipeline]
|
|
70
67
|
Extract --> ExtractedText[Extracted text artifacts]
|
|
71
68
|
ExtractedText --> ExtractionRun[Extraction run manifest]
|
|
72
69
|
end
|
|
73
70
|
|
|
74
|
-
subgraph PluggableRetrievalBackend[Pluggable retrieval backend]
|
|
71
|
+
subgraph PluggableRetrievalBackend[Pluggable: retrieval backend]
|
|
75
72
|
direction LR
|
|
76
73
|
|
|
77
74
|
subgraph BackendIngestionIndexing[Ingestion and indexing]
|
|
@@ -125,8 +122,6 @@ flowchart LR
|
|
|
125
122
|
style Main fill:#ffffff,stroke:#ffffff,color:#111111
|
|
126
123
|
style LegendArtifact fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
127
124
|
style LegendStep fill:#eceff1,stroke:#90a4ae,color:#111111
|
|
128
|
-
style LegendStable fill:#ffffff,stroke:#8e24aa,stroke-width:2px,color:#111111
|
|
129
|
-
style LegendPluggable fill:#ffffff,stroke:#1e88e5,stroke-dasharray:6 3,stroke-width:2px,color:#111111
|
|
130
125
|
```
|
|
131
126
|
|
|
132
127
|
## Practical value
|
|
@@ -139,6 +134,7 @@ flowchart LR
|
|
|
139
134
|
|
|
140
135
|
- Initialize a corpus folder.
|
|
141
136
|
- Ingest items from file paths, web addresses, or text input.
|
|
137
|
+
- Crawl a website section into corpus items when you want a repeatable “import from the web” workflow.
|
|
142
138
|
- Run extraction when you want derived text artifacts from non-text sources.
|
|
143
139
|
- Reindex to refresh the catalog after edits.
|
|
144
140
|
- Build a retrieval run with a backend.
|
|
@@ -176,11 +172,22 @@ biblicus init corpora/example
|
|
|
176
172
|
biblicus ingest --corpus corpora/example notes/example.txt
|
|
177
173
|
echo "A short note" | biblicus ingest --corpus corpora/example --stdin --title "First note"
|
|
178
174
|
biblicus list --corpus corpora/example
|
|
179
|
-
biblicus extract --corpus corpora/example --step pass-through-text --step metadata-text
|
|
175
|
+
biblicus extract build --corpus corpora/example --step pass-through-text --step metadata-text
|
|
176
|
+
biblicus extract list --corpus corpora/example
|
|
180
177
|
biblicus build --corpus corpora/example --backend scan
|
|
181
178
|
biblicus query --corpus corpora/example --query "note"
|
|
182
179
|
```
|
|
183
180
|
|
|
181
|
+
If you want to turn a website section into corpus items, crawl a root web address while restricting the crawl to an allowed prefix:
|
|
182
|
+
|
|
183
|
+
```
|
|
184
|
+
biblicus crawl --corpus corpora/example \\
|
|
185
|
+
--root-url https://example.com/docs/index.html \\
|
|
186
|
+
--allowed-prefix https://example.com/docs/ \\
|
|
187
|
+
--max-items 50 \\
|
|
188
|
+
--tag crawled
|
|
189
|
+
```
|
|
190
|
+
|
|
184
191
|
## Python usage
|
|
185
192
|
|
|
186
193
|
From Python, the same flow is available through the Corpus class and backend interfaces. The public surface area is small on purpose.
|
|
@@ -204,7 +211,7 @@ In an assistant system, retrieval usually produces context for a model call. Thi
|
|
|
204
211
|
|
|
205
212
|
## Learn more
|
|
206
213
|
|
|
207
|
-
Full documentation is
|
|
214
|
+
Full documentation is published on GitHub Pages: https://anthusai.github.io/Biblicus/
|
|
208
215
|
|
|
209
216
|
The documents below are written to be read in order.
|
|
210
217
|
|
|
@@ -233,7 +240,16 @@ corpus/
|
|
|
233
240
|
config.json
|
|
234
241
|
catalog.json
|
|
235
242
|
runs/
|
|
236
|
-
|
|
243
|
+
extraction/
|
|
244
|
+
pipeline/
|
|
245
|
+
<run id>/
|
|
246
|
+
manifest.json
|
|
247
|
+
text/
|
|
248
|
+
<item id>.txt
|
|
249
|
+
retrieval/
|
|
250
|
+
<backend id>/
|
|
251
|
+
<run id>/
|
|
252
|
+
manifest.json
|
|
237
253
|
```
|
|
238
254
|
|
|
239
255
|
## Retrieval backends
|
|
@@ -284,7 +300,7 @@ python3 -m pip install -e ".[dev]"
|
|
|
284
300
|
Build the documentation:
|
|
285
301
|
|
|
286
302
|
```
|
|
287
|
-
python3 -m sphinx -b html docs docs/_build
|
|
303
|
+
python3 -m sphinx -b html docs docs/_build/html
|
|
288
304
|
```
|
|
289
305
|
|
|
290
306
|
## License
|
|
@@ -304,4 +320,4 @@ License terms are in `LICENSE`.
|
|
|
304
320
|
|
|
305
321
|
[continuous-integration-badge]: https://github.com/AnthusAI/Biblicus/actions/workflows/ci.yml/badge.svg?branch=main
|
|
306
322
|
[coverage-badge]: https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/AnthusAI/Biblicus/main/coverage_badge.json
|
|
307
|
-
[documentation-badge]: https://
|
|
323
|
+
[documentation-badge]: https://img.shields.io/badge/docs-GitHub%20Pages-blue
|
|
@@ -43,6 +43,20 @@ Ingest a web address:
|
|
|
43
43
|
python3 -m biblicus ingest --corpus corpora/example https://example.com --tag web
|
|
44
44
|
```
|
|
45
45
|
|
|
46
|
+
## Crawl a website prefix
|
|
47
|
+
|
|
48
|
+
To build a corpus from a website section, crawl a root uniform resource locator and restrict the crawl to an allowed prefix.
|
|
49
|
+
|
|
50
|
+
```
|
|
51
|
+
python3 -m biblicus crawl --corpus corpora/example \\
|
|
52
|
+
--root-url https://example.com/docs/index.html \\
|
|
53
|
+
--allowed-prefix https://example.com/docs/ \\
|
|
54
|
+
--max-items 50 \\
|
|
55
|
+
--tag crawled
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
The crawl command only follows links within the allowed prefix, and it respects `.biblicusignore` patterns against the path relative to the allowed prefix.
|
|
59
|
+
|
|
46
60
|
Ingest a text note:
|
|
47
61
|
|
|
48
62
|
```
|
|
@@ -100,4 +114,3 @@ Purging deletes all items and derived artifacts under the corpus. It requires yo
|
|
|
100
114
|
```
|
|
101
115
|
python3 -m biblicus purge --corpus corpora/example --confirm example
|
|
102
116
|
```
|
|
103
|
-
|
|
@@ -133,6 +133,46 @@ The catalog is rebuildable. You can edit raw files or sidecar metadata, then ref
|
|
|
133
133
|
python3 -m biblicus reindex --corpus corpora/demo
|
|
134
134
|
```
|
|
135
135
|
|
|
136
|
+
### Crawl a website prefix
|
|
137
|
+
|
|
138
|
+
To turn a website section into corpus items, crawl a root page and restrict the crawl to an allowed prefix.
|
|
139
|
+
|
|
140
|
+
In one terminal, create a tiny local website and serve it:
|
|
141
|
+
|
|
142
|
+
```
|
|
143
|
+
rm -rf /tmp/biblicus-site
|
|
144
|
+
mkdir -p /tmp/biblicus-site/site/subdir
|
|
145
|
+
cat > /tmp/biblicus-site/site/index.html <<'HTML'
|
|
146
|
+
<html>
|
|
147
|
+
<body>
|
|
148
|
+
<a href="page.html">Page</a>
|
|
149
|
+
<a href="subdir/">Subdir</a>
|
|
150
|
+
</body>
|
|
151
|
+
</html>
|
|
152
|
+
HTML
|
|
153
|
+
cat > /tmp/biblicus-site/site/page.html <<'HTML'
|
|
154
|
+
<html><body>hello</body></html>
|
|
155
|
+
HTML
|
|
156
|
+
cat > /tmp/biblicus-site/site/subdir/index.html <<'HTML'
|
|
157
|
+
<html><body>subdir</body></html>
|
|
158
|
+
HTML
|
|
159
|
+
|
|
160
|
+
python3 -m http.server 8000 --directory /tmp/biblicus-site
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
In another terminal:
|
|
164
|
+
|
|
165
|
+
```
|
|
166
|
+
rm -rf corpora/crawl-demo
|
|
167
|
+
python3 -m biblicus init corpora/crawl-demo
|
|
168
|
+
python3 -m biblicus crawl --corpus corpora/crawl-demo \\
|
|
169
|
+
--root-url http://127.0.0.1:8000/site/index.html \\
|
|
170
|
+
--allowed-prefix http://127.0.0.1:8000/site/ \\
|
|
171
|
+
--max-items 50 \\
|
|
172
|
+
--tag crawled
|
|
173
|
+
python3 -m biblicus list --corpus corpora/crawl-demo
|
|
174
|
+
```
|
|
175
|
+
|
|
136
176
|
### Build an extraction run
|
|
137
177
|
|
|
138
178
|
Text extraction is a separate pipeline stage from retrieval. An extraction run produces derived text artifacts under the corpus.
|
|
@@ -140,7 +180,7 @@ Text extraction is a separate pipeline stage from retrieval. An extraction run p
|
|
|
140
180
|
This extractor reads text items and skips non-text items.
|
|
141
181
|
|
|
142
182
|
```
|
|
143
|
-
python3 -m biblicus extract --corpus corpora/demo --step pass-through-text
|
|
183
|
+
python3 -m biblicus extract build --corpus corpora/demo --step pass-through-text
|
|
144
184
|
```
|
|
145
185
|
|
|
146
186
|
The output includes a `run_id` you can reuse when building a retrieval backend.
|
|
@@ -150,7 +190,7 @@ The output includes a `run_id` you can reuse when building a retrieval backend.
|
|
|
150
190
|
When you want an explicit choice among multiple extraction outputs, add a selection extractor step at the end of the pipeline.
|
|
151
191
|
|
|
152
192
|
```
|
|
153
|
-
python3 -m biblicus extract --corpus corpora/demo \\
|
|
193
|
+
python3 -m biblicus extract build --corpus corpora/demo \\
|
|
154
194
|
--step pass-through-text \\
|
|
155
195
|
--step metadata-text \\
|
|
156
196
|
--step select-text
|
|
@@ -171,7 +211,7 @@ This example downloads a small set of public Portable Document Format files, ext
|
|
|
171
211
|
rm -rf corpora/pdf_samples
|
|
172
212
|
python3 scripts/download_pdf_samples.py --corpus corpora/pdf_samples --force
|
|
173
213
|
|
|
174
|
-
python3 -m biblicus extract --corpus corpora/pdf_samples --step pdf-text
|
|
214
|
+
python3 -m biblicus extract build --corpus corpora/pdf_samples --step pdf-text
|
|
175
215
|
```
|
|
176
216
|
|
|
177
217
|
Copy the `run_id` from the JavaScript Object Notation output. You will use it as `PDF_EXTRACTION_RUN_ID` in the next command.
|
|
@@ -211,7 +251,7 @@ python3 -m pip install "biblicus[ocr]"
|
|
|
211
251
|
Then build an extraction run:
|
|
212
252
|
|
|
213
253
|
```
|
|
214
|
-
python3 -m biblicus extract --corpus corpora/image_samples --step ocr-rapidocr
|
|
254
|
+
python3 -m biblicus extract build --corpus corpora/image_samples --step ocr-rapidocr
|
|
215
255
|
```
|
|
216
256
|
|
|
217
257
|
### Optional: Unstructured as a last-resort extractor
|
|
@@ -227,7 +267,7 @@ python3 -m pip install "biblicus[unstructured]"
|
|
|
227
267
|
Then build an extraction run:
|
|
228
268
|
|
|
229
269
|
```
|
|
230
|
-
python3 -m biblicus extract --corpus corpora/pdf_samples --step unstructured
|
|
270
|
+
python3 -m biblicus extract build --corpus corpora/pdf_samples --step unstructured
|
|
231
271
|
```
|
|
232
272
|
|
|
233
273
|
To see Unstructured handle a non-Portable-Document-Format format, use the mixed corpus demo, which includes a `.docx` sample:
|
|
@@ -235,13 +275,13 @@ To see Unstructured handle a non-Portable-Document-Format format, use the mixed
|
|
|
235
275
|
```
|
|
236
276
|
rm -rf corpora/mixed_samples
|
|
237
277
|
python3 scripts/download_mixed_samples.py --corpus corpora/mixed_samples --force
|
|
238
|
-
python3 -m biblicus extract --corpus corpora/mixed_samples --step unstructured
|
|
278
|
+
python3 -m biblicus extract build --corpus corpora/mixed_samples --step unstructured
|
|
239
279
|
```
|
|
240
280
|
|
|
241
281
|
When you want to prefer one extractor over another for the same item types, order the steps and end with `select-text`:
|
|
242
282
|
|
|
243
283
|
```
|
|
244
|
-
python3 -m biblicus extract --corpus corpora/pdf_samples \\
|
|
284
|
+
python3 -m biblicus extract build --corpus corpora/pdf_samples \\
|
|
245
285
|
--step unstructured \\
|
|
246
286
|
--step pdf-text \\
|
|
247
287
|
--step select-text
|
|
@@ -263,7 +303,7 @@ python3 -m biblicus list --corpus corpora/audio_samples
|
|
|
263
303
|
If you only want a metadata-only baseline, extract `metadata-text`:
|
|
264
304
|
|
|
265
305
|
```
|
|
266
|
-
python3 -m biblicus extract --corpus corpora/audio_samples --step metadata-text
|
|
306
|
+
python3 -m biblicus extract build --corpus corpora/audio_samples --step metadata-text
|
|
267
307
|
```
|
|
268
308
|
|
|
269
309
|
For real speech to text transcription with the OpenAI backend, install the optional dependency and set an API key:
|
|
@@ -272,7 +312,7 @@ For real speech to text transcription with the OpenAI backend, install the optio
|
|
|
272
312
|
python3 -m pip install "biblicus[openai]"
|
|
273
313
|
mkdir -p .biblicus
|
|
274
314
|
printf "openai:\n api_key: ...\n" > .biblicus/config.yml
|
|
275
|
-
python3 -m biblicus extract --corpus corpora/audio_samples --step stt-openai
|
|
315
|
+
python3 -m biblicus extract build --corpus corpora/audio_samples --step stt-openai
|
|
276
316
|
```
|
|
277
317
|
|
|
278
318
|
### Build and query the minimal backend
|
|
@@ -148,7 +148,7 @@ python3 -m biblicus init corpora/extraction-demo
|
|
|
148
148
|
printf 'x' > /tmp/image.png
|
|
149
149
|
python3 -m biblicus ingest --corpus corpora/extraction-demo /tmp/image.png --tag extracted
|
|
150
150
|
|
|
151
|
-
python3 -m biblicus extract --corpus corpora/extraction-demo \\
|
|
151
|
+
python3 -m biblicus extract build --corpus corpora/extraction-demo \\
|
|
152
152
|
--step pass-through-text \\
|
|
153
153
|
--step pdf-text \\
|
|
154
154
|
--step metadata-text
|
|
@@ -161,7 +161,7 @@ The extracted text for the image comes from the `metadata-text` step because the
|
|
|
161
161
|
Selection is a pipeline step that chooses extracted text from previous pipeline steps. Selection is just another extractor in the pipeline, and it decides which prior output to carry forward.
|
|
162
162
|
|
|
163
163
|
```
|
|
164
|
-
python3 -m biblicus extract --corpus corpora/extraction-demo \\
|
|
164
|
+
python3 -m biblicus extract build --corpus corpora/extraction-demo \\
|
|
165
165
|
--step pass-through-text \\
|
|
166
166
|
--step metadata-text \\
|
|
167
167
|
--step select-text
|
|
@@ -169,6 +169,23 @@ python3 -m biblicus extract --corpus corpora/extraction-demo \\
|
|
|
169
169
|
|
|
170
170
|
The pipeline run produces one extraction run under `pipeline`. You can point retrieval backends at that run.
|
|
171
171
|
|
|
172
|
+
## Inspecting and deleting extraction runs
|
|
173
|
+
|
|
174
|
+
Extraction runs are stored under the corpus and can be listed and inspected.
|
|
175
|
+
|
|
176
|
+
```
|
|
177
|
+
python3 -m biblicus extract list --corpus corpora/extraction-demo
|
|
178
|
+
python3 -m biblicus extract show --corpus corpora/extraction-demo --run pipeline:EXTRACTION_RUN_ID
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
Deletion is explicit and requires typing the exact run reference as confirmation:
|
|
182
|
+
|
|
183
|
+
```
|
|
184
|
+
python3 -m biblicus extract delete --corpus corpora/extraction-demo \\
|
|
185
|
+
--run pipeline:EXTRACTION_RUN_ID \\
|
|
186
|
+
--confirm pipeline:EXTRACTION_RUN_ID
|
|
187
|
+
```
|
|
188
|
+
|
|
172
189
|
## Use extracted text in retrieval
|
|
173
190
|
|
|
174
191
|
Retrieval backends can build and query using a selected extraction run. This is configured by passing `extraction_run=extractor_id:run_id` to the backend build command.
|
|
@@ -0,0 +1,200 @@
|
|
|
1
|
+
# Roadmap
|
|
2
|
+
|
|
3
|
+
This document is the ordered plan for what to build next.
|
|
4
|
+
|
|
5
|
+
If you are looking for runnable examples, see `docs/DEMOS.md`.
|
|
6
|
+
|
|
7
|
+
## Principles
|
|
8
|
+
|
|
9
|
+
- Behavior specifications are the authoritative definition of behavior.
|
|
10
|
+
- Every behavior that exists is specified.
|
|
11
|
+
- Validation and documentation are part of the product.
|
|
12
|
+
- Raw corpus items remain readable, portable files.
|
|
13
|
+
- Derived artifacts are stored under the corpus and can coexist for multiple implementations.
|
|
14
|
+
|
|
15
|
+
## Current state
|
|
16
|
+
|
|
17
|
+
Version zero includes:
|
|
18
|
+
|
|
19
|
+
- A file based corpus with ingestion, catalog rebuild, import, ignore rules, and lifecycle hooks.
|
|
20
|
+
- A retrieval baseline (`scan`) and a practical local backend (`sqlite-full-text-search`).
|
|
21
|
+
- A separate text extraction stage with extraction runs and a composable extractor pipeline.
|
|
22
|
+
- Selection extractor steps that choose extracted text within a pipeline.
|
|
23
|
+
- A speech to text extractor plugin (`stt-openai`) implemented as an optional dependency.
|
|
24
|
+
- An optical character recognition extractor plugin (`ocr-rapidocr`) implemented as an optional dependency.
|
|
25
|
+
- A broad catchall extractor plugin (`unstructured`) implemented as an optional dependency.
|
|
26
|
+
- Integration corpora that include deterministic non-text cases such as a blank Portable Document Format file and a silence Waveform Audio File Format clip.
|
|
27
|
+
|
|
28
|
+
Milestones 1 through 4 are complete. The next planned work begins at Milestone 5.
|
|
29
|
+
|
|
30
|
+
## Near-term focus
|
|
31
|
+
|
|
32
|
+
The next work will focus on the retrieval side of the pipeline:
|
|
33
|
+
|
|
34
|
+
- Make retrieval runs and evidence production the simplest possible practical “minimum viable product”.
|
|
35
|
+
- Add explicit evidence quality stages (rerank and filter) that are easy to compose, test, and evaluate.
|
|
36
|
+
- Expand retrieval evaluation so it is easy to compare backends using the same corpora and datasets.
|
|
37
|
+
|
|
38
|
+
Lower-priority work related to corpus ingestion conveniences and extractor evaluation remains valuable, but it is deferred while we make retrieval practical end to end.
|
|
39
|
+
|
|
40
|
+
## Milestones
|
|
41
|
+
|
|
42
|
+
### Milestone 1: Artifact lifecycle and storage layout
|
|
43
|
+
|
|
44
|
+
Goal: make derived artifacts easy to inspect, compare, and retain across multiple extraction implementations.
|
|
45
|
+
|
|
46
|
+
Status: complete.
|
|
47
|
+
|
|
48
|
+
Deliverables:
|
|
49
|
+
|
|
50
|
+
- A stable on-disk layout for extracted artifacts that partitions by extraction recipe and extractor identity.
|
|
51
|
+
- A clear, human-readable manifest for each extraction run that includes configuration, timing, and summary stats.
|
|
52
|
+
- Corpus-level tooling to list, inspect, and delete derived artifacts without touching raw items.
|
|
53
|
+
|
|
54
|
+
Acceptance checks:
|
|
55
|
+
|
|
56
|
+
- Raw items remain readable, portable files in `raw/`.
|
|
57
|
+
- Derived artifacts can coexist for multiple extractors and multiple recipes over the same raw items.
|
|
58
|
+
- Behavior specifications cover artifact layout and lifecycle operations.
|
|
59
|
+
|
|
60
|
+
### Milestone 2: Idempotency and change detection
|
|
61
|
+
|
|
62
|
+
Goal: make extraction runs repeatable, fast, and safe by skipping work when nothing relevant changed.
|
|
63
|
+
|
|
64
|
+
Status: complete.
|
|
65
|
+
|
|
66
|
+
Deliverables:
|
|
67
|
+
|
|
68
|
+
- Change detection for extraction inputs (raw bytes identity) and extraction settings (extractor identity and configuration).
|
|
69
|
+
- Extraction run behavior that cleanly separates “skipped because already present” from “skipped because unsupported”.
|
|
70
|
+
- A simple “rebuild” workflow that is explicit and safe: delete an extraction run, then build it again.
|
|
71
|
+
|
|
72
|
+
Acceptance checks:
|
|
73
|
+
|
|
74
|
+
- Running the same extraction recipe twice produces the same outputs and reports predictable skip counts.
|
|
75
|
+
- Behavior specifications cover idempotency and change detection outcomes.
|
|
76
|
+
|
|
77
|
+
### Milestone 3: Failure semantics and reporting
|
|
78
|
+
|
|
79
|
+
Goal: make extraction outcomes diagnosable and measurable without reading log output.
|
|
80
|
+
|
|
81
|
+
Status: complete.
|
|
82
|
+
|
|
83
|
+
Deliverables:
|
|
84
|
+
|
|
85
|
+
- A clear set of extraction outcome categories (success, empty output, skipped, fatal error) with structured reasons.
|
|
86
|
+
- Per-run reporting that summarizes outcomes and provides a path to per-item details.
|
|
87
|
+
- Consistent, user-facing errors when optional dependencies or required configuration are missing.
|
|
88
|
+
|
|
89
|
+
Acceptance checks:
|
|
90
|
+
|
|
91
|
+
- Behavior specifications cover error classification and summary reporting.
|
|
92
|
+
- Reports remain deterministic for the same corpus and recipe.
|
|
93
|
+
|
|
94
|
+
### Milestone 4: Corpus import and crawl utilities
|
|
95
|
+
|
|
96
|
+
Goal: make it easy to build a corpus from real-world sources while keeping the corpus readable and portable.
|
|
97
|
+
|
|
98
|
+
Status: complete.
|
|
99
|
+
|
|
100
|
+
Deliverables:
|
|
101
|
+
|
|
102
|
+
- Folder tree import ergonomics: stable naming, media type detection, and predictable metadata sidecars.
|
|
103
|
+
- A website crawl command that stays within an allow-listed uniform resource locator prefix and respects `.biblicusignore`.
|
|
104
|
+
- Integration downloads that produce a small, realistic, repeatable corpus for experimentation without committing third-party content to the repository.
|
|
105
|
+
|
|
106
|
+
Acceptance checks:
|
|
107
|
+
|
|
108
|
+
- The crawl and import workflows are fully specified with behavior specifications.
|
|
109
|
+
- Integration corpora remain gitignored, and can be regenerated from scripts.
|
|
110
|
+
|
|
111
|
+
### Milestone 6: Evidence quality stages
|
|
112
|
+
|
|
113
|
+
Goal: add explicit rerank and filter stages to retrieval.
|
|
114
|
+
|
|
115
|
+
Status: next.
|
|
116
|
+
|
|
117
|
+
Deliverables:
|
|
118
|
+
|
|
119
|
+
- A rerank stage interface that takes evidence and returns reordered evidence.
|
|
120
|
+
- A filter stage interface that applies metadata and source constraints.
|
|
121
|
+
- Documentation that explains how to configure budgets and stage ordering.
|
|
122
|
+
|
|
123
|
+
Acceptance checks:
|
|
124
|
+
|
|
125
|
+
- Behavior specs cover the new stages.
|
|
126
|
+
- Evaluation reports show per stage metrics and final metrics.
|
|
127
|
+
|
|
128
|
+
### Milestone 7: Evaluation reports and datasets
|
|
129
|
+
|
|
130
|
+
Goal: make evaluation results easier to interpret and compare.
|
|
131
|
+
|
|
132
|
+
Status: next.
|
|
133
|
+
|
|
134
|
+
Deliverables:
|
|
135
|
+
|
|
136
|
+
- A dataset authoring workflow that supports small hand labeled sets and larger synthetic sets.
|
|
137
|
+
- A report that includes per query diagnostics and a clear summary.
|
|
138
|
+
|
|
139
|
+
Acceptance checks:
|
|
140
|
+
|
|
141
|
+
- The existing dataset format remains stable or is versioned.
|
|
142
|
+
- Reports remain deterministic for the same inputs.
|
|
143
|
+
|
|
144
|
+
### Milestone 8: Pluggable backend hosting modes
|
|
145
|
+
|
|
146
|
+
Goal: add one reference backend in an external process or remote service mode.
|
|
147
|
+
|
|
148
|
+
Status: later.
|
|
149
|
+
|
|
150
|
+
Deliverables:
|
|
151
|
+
|
|
152
|
+
- A tool server that exposes a backend through a stable interface.
|
|
153
|
+
- Documentation that shows how to run a backend out of process and connect to it.
|
|
154
|
+
|
|
155
|
+
Acceptance checks:
|
|
156
|
+
|
|
157
|
+
- Local tests remain fast and deterministic.
|
|
158
|
+
- Integration tests validate end to end retrieval through the tool boundary.
|
|
159
|
+
|
|
160
|
+
## Where to put design notes
|
|
161
|
+
|
|
162
|
+
Design notes live in `docs/` so they are easy to browse and cross link.
|
|
163
|
+
|
|
164
|
+
Executable behavior lives in `features/*.feature`.
|
|
165
|
+
|
|
166
|
+
## Completed milestones (version zero)
|
|
167
|
+
|
|
168
|
+
These milestones are complete as of version zero, and are maintained through behavior specifications:
|
|
169
|
+
|
|
170
|
+
- Portable Document Format text extraction (`pdf-text`).
|
|
171
|
+
- Optical character recognition extraction (`ocr-rapidocr`).
|
|
172
|
+
- Catchall extraction for wide format coverage (`unstructured`).
|
|
173
|
+
- Selection extractor steps (`select-text`, `select-longest-text`).
|
|
174
|
+
|
|
175
|
+
## Completed milestones (post version zero)
|
|
176
|
+
|
|
177
|
+
These milestones are complete after version zero, and remain defined by behavior specifications:
|
|
178
|
+
|
|
179
|
+
- Extraction run lifecycle operations (`extract list`, `extract show`, `extract delete`) and a stable artifact layout.
|
|
180
|
+
- Deterministic extraction run identifiers based on recipe and catalog version (idempotent extraction runs).
|
|
181
|
+
- Crawl ingestion (`crawl`) with allow-listed prefix enforcement and `.biblicusignore` filtering.
|
|
182
|
+
|
|
183
|
+
## Deferred milestones
|
|
184
|
+
|
|
185
|
+
These milestones remain planned, but are not the near-term focus.
|
|
186
|
+
|
|
187
|
+
### Milestone 5: Extractor datasets and evaluation harness (deferred)
|
|
188
|
+
|
|
189
|
+
Goal: compare extraction approaches in a way that is measurable, repeatable, and useful for practical engineering decisions.
|
|
190
|
+
|
|
191
|
+
Deliverables:
|
|
192
|
+
|
|
193
|
+
- Dataset authoring workflow for extraction ground truth (for example: expected transcripts and expected optical character recognition text).
|
|
194
|
+
- Evaluation metrics for accuracy, speed, and cost, including “processable fraction” for a given extractor recipe.
|
|
195
|
+
- A report format that can compare multiple extraction recipes against the same corpus and dataset.
|
|
196
|
+
|
|
197
|
+
Acceptance checks:
|
|
198
|
+
|
|
199
|
+
- Evaluation results are stable and reproducible for the same corpus and dataset inputs.
|
|
200
|
+
- Reports make it clear when an extractor fails to process an item versus producing empty output.
|