biblicus 0.3.0__tar.gz → 0.5.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {biblicus-0.3.0/src/biblicus.egg-info → biblicus-0.5.0}/PKG-INFO +273 -112
- biblicus-0.5.0/README.md +468 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/docs/ARCHITECTURE.md +1 -0
- biblicus-0.5.0/docs/CONTEXT_PACK.md +61 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/docs/CORPUS.md +14 -1
- {biblicus-0.3.0 → biblicus-0.5.0}/docs/DEMOS.md +49 -9
- {biblicus-0.3.0 → biblicus-0.5.0}/docs/EXTRACTION.md +19 -2
- {biblicus-0.3.0 → biblicus-0.5.0}/docs/FEATURE_INDEX.md +20 -0
- biblicus-0.5.0/docs/ROADMAP.md +81 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/docs/api.rst +4 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/docs/conf.py +0 -1
- {biblicus-0.3.0 → biblicus-0.5.0}/docs/index.rst +7 -6
- biblicus-0.5.0/features/context_pack.feature +42 -0
- biblicus-0.5.0/features/context_pack_cli.feature +29 -0
- biblicus-0.5.0/features/crawl.feature +81 -0
- biblicus-0.5.0/features/evidence_processing.feature +25 -0
- biblicus-0.5.0/features/extraction_run_lifecycle.feature +117 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/extractor_pipeline.feature +3 -3
- biblicus-0.5.0/features/query_processing.feature +27 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/steps/cli_steps.py +102 -0
- biblicus-0.5.0/features/steps/context_pack_steps.py +115 -0
- biblicus-0.5.0/features/steps/crawl_steps.py +68 -0
- biblicus-0.5.0/features/steps/evidence_processing_steps.py +47 -0
- biblicus-0.5.0/features/steps/extraction_run_lifecycle_steps.py +148 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/steps/extraction_steps.py +38 -6
- {biblicus-0.3.0 → biblicus-0.5.0}/features/text_extraction_runs.feature +1 -1
- biblicus-0.5.0/features/token_budget.feature +37 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/pyproject.toml +1 -1
- biblicus-0.5.0/scripts/readme_end_to_end_demo.py +81 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/__init__.py +1 -1
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/cli.py +236 -7
- biblicus-0.5.0/src/biblicus/context.py +183 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/corpus.py +170 -1
- biblicus-0.5.0/src/biblicus/crawl.py +186 -0
- biblicus-0.5.0/src/biblicus/evidence_processing.py +201 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/extraction.py +4 -2
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/models.py +31 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/time.py +1 -1
- {biblicus-0.3.0 → biblicus-0.5.0/src/biblicus.egg-info}/PKG-INFO +273 -112
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus.egg-info/SOURCES.txt +16 -0
- biblicus-0.3.0/README.md +0 -307
- biblicus-0.3.0/docs/ROADMAP.md +0 -174
- {biblicus-0.3.0 → biblicus-0.5.0}/LICENSE +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/MANIFEST.in +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/THIRD_PARTY_NOTICES.md +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/datasets/wikipedia_mini.json +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/docs/BACKENDS.md +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/docs/CORPUS_DESIGN.md +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/docs/TESTING.md +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/docs/USER_CONFIGURATION.md +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/backend_validation.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/biblicus_corpus.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/cli_entrypoint.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/cli_parsing.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/content_sniffing.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/corpus_edge_cases.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/corpus_identity.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/corpus_purge.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/environment.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/error_cases.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/evaluation.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/extraction_error_handling.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/extraction_selection.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/extraction_selection_longest.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/extractor_validation.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/frontmatter.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/hook_config_validation.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/hook_error_handling.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/import_tree.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/ingest_sources.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/integration_audio_samples.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/integration_image_samples.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/integration_mixed_corpus.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/integration_mixed_extraction.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/integration_ocr_image_extraction.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/integration_pdf_retrieval.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/integration_pdf_samples.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/integration_unstructured_extraction.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/integration_wikipedia.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/lifecycle_hooks.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/model_validation.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/ocr_extractor.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/pdf_text_extraction.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/python_api.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/python_hook_logging.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/retrieval_budget.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/retrieval_scan.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/retrieval_sqlite_full_text_search.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/retrieval_uses_extraction_run.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/retrieval_utilities.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/source_loading.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/steps/backend_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/steps/cli_parsing_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/steps/extractor_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/steps/frontmatter_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/steps/model_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/steps/openai_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/steps/pdf_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/steps/python_api_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/steps/rapidocr_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/steps/retrieval_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/steps/stt_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/steps/unstructured_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/steps/user_config_steps.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/streaming_ingest.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/stt_extractor.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/unstructured_extractor.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/features/user_config.feature +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/scripts/download_audio_samples.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/scripts/download_image_samples.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/scripts/download_mixed_samples.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/scripts/download_pdf_samples.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/scripts/download_wikipedia.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/scripts/test.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/setup.cfg +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/__main__.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/_vendor/dotyaml/__init__.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/_vendor/dotyaml/interpolation.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/_vendor/dotyaml/loader.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/_vendor/dotyaml/transformer.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/backends/__init__.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/backends/base.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/backends/scan.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/backends/sqlite_full_text_search.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/constants.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/errors.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/evaluation.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/extractors/__init__.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/extractors/base.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/extractors/metadata_text.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/extractors/openai_stt.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/extractors/pass_through_text.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/extractors/pdf_text.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/extractors/pipeline.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/extractors/rapidocr_text.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/extractors/select_longest_text.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/extractors/select_text.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/extractors/unstructured_text.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/frontmatter.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/hook_logging.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/hook_manager.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/hooks.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/ignore.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/retrieval.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/sources.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/uris.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus/user_config.py +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus.egg-info/dependency_links.txt +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus.egg-info/entry_points.txt +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus.egg-info/requires.txt +0 -0
- {biblicus-0.3.0 → biblicus-0.5.0}/src/biblicus.egg-info/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: biblicus
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.5.0
|
|
4
4
|
Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
|
|
5
5
|
License: MIT
|
|
6
6
|
Requires-Python: >=3.9
|
|
@@ -41,11 +41,11 @@ The first practical problem is not retrieval. It is collection and care. You nee
|
|
|
41
41
|
|
|
42
42
|
This library gives you a corpus, which is a normal folder on disk. It stores each ingested item as a file, with optional metadata stored next to it. You can open and inspect the raw files directly. Any derived catalog or index can be rebuilt from the raw corpus.
|
|
43
43
|
|
|
44
|
-
It can be used alongside
|
|
44
|
+
It can be used alongside LangGraph, Tactus, Pydantic AI, any agent framework, or your own setup. Use it from Python or from the command line interface.
|
|
45
45
|
|
|
46
46
|
See [retrieval augmented generation overview] for a short introduction to the idea.
|
|
47
47
|
|
|
48
|
-
## A
|
|
48
|
+
## A simple mental model
|
|
49
49
|
|
|
50
50
|
Think in three stages.
|
|
51
51
|
|
|
@@ -63,99 +63,30 @@ If you learn a few project words, the rest of the system becomes predictable.
|
|
|
63
63
|
- Run is a recorded retrieval build for a corpus.
|
|
64
64
|
- Evidence is what retrieval returns, with identifiers and source information.
|
|
65
65
|
|
|
66
|
-
##
|
|
66
|
+
## Where it fits in an assistant
|
|
67
67
|
|
|
68
|
-
|
|
69
|
-
Extraction is introduced here as a separate stage so you can swap extraction approaches without changing the raw corpus.
|
|
70
|
-
The legend shows what the block styles mean.
|
|
71
|
-
Your code is where you decide how to turn evidence into context and how to call a model.
|
|
68
|
+
Biblicus does not answer user questions. It is not a language model. It helps your assistant answer them by retrieving relevant material and returning it as structured evidence. Your code decides how to turn evidence into a context pack for the model call, which is then passed to a model you choose.
|
|
72
69
|
|
|
73
|
-
|
|
74
|
-
%%{init: {"flowchart": {"useMaxWidth": true, "nodeSpacing": 18, "rankSpacing": 22}}}%%
|
|
75
|
-
flowchart LR
|
|
76
|
-
subgraph Legend[Legend]
|
|
77
|
-
direction LR
|
|
78
|
-
LegendArtifact[Stored artifact or evidence]
|
|
79
|
-
LegendStep[Step]
|
|
80
|
-
LegendStable[Stable region]
|
|
81
|
-
LegendPluggable[Pluggable region]
|
|
82
|
-
LegendArtifact --- LegendStep
|
|
83
|
-
LegendStable --- LegendPluggable
|
|
84
|
-
end
|
|
85
|
-
|
|
86
|
-
subgraph Main[" "]
|
|
87
|
-
direction TB
|
|
88
|
-
|
|
89
|
-
subgraph StableCore[Stable core]
|
|
90
|
-
direction TB
|
|
91
|
-
Source[Source items] --> Ingest[Ingest]
|
|
92
|
-
Ingest --> Raw[Raw item files]
|
|
93
|
-
Raw --> Catalog[Catalog file]
|
|
94
|
-
end
|
|
95
|
-
|
|
96
|
-
subgraph PluggableExtractionPipeline[Pluggable extraction pipeline]
|
|
97
|
-
direction TB
|
|
98
|
-
Catalog --> Extract[Extract pipeline]
|
|
99
|
-
Extract --> ExtractedText[Extracted text artifacts]
|
|
100
|
-
ExtractedText --> ExtractionRun[Extraction run manifest]
|
|
101
|
-
end
|
|
102
|
-
|
|
103
|
-
subgraph PluggableRetrievalBackend[Pluggable retrieval backend]
|
|
104
|
-
direction LR
|
|
105
|
-
|
|
106
|
-
subgraph BackendIngestionIndexing[Ingestion and indexing]
|
|
107
|
-
direction TB
|
|
108
|
-
ExtractionRun --> Build[Build run]
|
|
109
|
-
Build --> BackendIndex[Backend index]
|
|
110
|
-
BackendIndex --> Run[Run manifest]
|
|
111
|
-
end
|
|
112
|
-
|
|
113
|
-
subgraph BackendRetrievalGeneration[Retrieval and generation]
|
|
114
|
-
direction TB
|
|
115
|
-
Run --> Query[Query]
|
|
116
|
-
Query --> Evidence[Evidence]
|
|
117
|
-
end
|
|
118
|
-
end
|
|
70
|
+
In a coding assistant, retrieval is often triggered by what the user is doing right now. For example: you are about to propose a user interface change, so you retrieve the user's stated preferences, then you include that as context for the model call.
|
|
119
71
|
|
|
120
|
-
|
|
72
|
+
This diagram shows two sequential Biblicus calls. They are shown separately to make the boundaries explicit: retrieval returns evidence, and context pack building consumes evidence.
|
|
121
73
|
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
|
|
136
|
-
|
|
137
|
-
|
|
138
|
-
style ExtractionRun fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
139
|
-
style BackendIndex fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
140
|
-
style Run fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
141
|
-
style Evidence fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
142
|
-
style Context fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
143
|
-
style Answer fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
144
|
-
style Source fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
145
|
-
|
|
146
|
-
style Ingest fill:#eceff1,stroke:#90a4ae,color:#111111
|
|
147
|
-
style Extract fill:#eceff1,stroke:#90a4ae,color:#111111
|
|
148
|
-
style Build fill:#eceff1,stroke:#90a4ae,color:#111111
|
|
149
|
-
style Query fill:#eceff1,stroke:#90a4ae,color:#111111
|
|
150
|
-
style Model fill:#eceff1,stroke:#90a4ae,color:#111111
|
|
151
|
-
end
|
|
152
|
-
|
|
153
|
-
style Legend fill:#ffffff,stroke:#ffffff,color:#111111
|
|
154
|
-
style Main fill:#ffffff,stroke:#ffffff,color:#111111
|
|
155
|
-
style LegendArtifact fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
156
|
-
style LegendStep fill:#eceff1,stroke:#90a4ae,color:#111111
|
|
157
|
-
style LegendStable fill:#ffffff,stroke:#8e24aa,stroke-width:2px,color:#111111
|
|
158
|
-
style LegendPluggable fill:#ffffff,stroke:#1e88e5,stroke-dasharray:6 3,stroke-width:2px,color:#111111
|
|
74
|
+
```mermaid
|
|
75
|
+
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
|
|
76
|
+
sequenceDiagram
|
|
77
|
+
participant User
|
|
78
|
+
participant App as Your assistant code
|
|
79
|
+
participant Bib as Biblicus
|
|
80
|
+
participant LLM as Large language model
|
|
81
|
+
|
|
82
|
+
User->>App: request
|
|
83
|
+
App->>Bib: query retrieval
|
|
84
|
+
Bib-->>App: retrieval result evidence JSON
|
|
85
|
+
App->>Bib: build context pack from evidence
|
|
86
|
+
Bib-->>App: context pack text
|
|
87
|
+
App->>LLM: context pack plus prompt
|
|
88
|
+
LLM-->>App: response draft
|
|
89
|
+
App-->>User: response
|
|
159
90
|
```
|
|
160
91
|
|
|
161
92
|
## Practical value
|
|
@@ -168,6 +99,7 @@ flowchart LR
|
|
|
168
99
|
|
|
169
100
|
- Initialize a corpus folder.
|
|
170
101
|
- Ingest items from file paths, web addresses, or text input.
|
|
102
|
+
- Crawl a website section into corpus items when you want a repeatable “import from the web” workflow.
|
|
171
103
|
- Run extraction when you want derived text artifacts from non-text sources.
|
|
172
104
|
- Reindex to refresh the catalog after edits.
|
|
173
105
|
- Build a retrieval run with a backend.
|
|
@@ -205,11 +137,232 @@ biblicus init corpora/example
|
|
|
205
137
|
biblicus ingest --corpus corpora/example notes/example.txt
|
|
206
138
|
echo "A short note" | biblicus ingest --corpus corpora/example --stdin --title "First note"
|
|
207
139
|
biblicus list --corpus corpora/example
|
|
208
|
-
biblicus extract --corpus corpora/example --step pass-through-text --step metadata-text
|
|
140
|
+
biblicus extract build --corpus corpora/example --step pass-through-text --step metadata-text
|
|
141
|
+
biblicus extract list --corpus corpora/example
|
|
209
142
|
biblicus build --corpus corpora/example --backend scan
|
|
210
143
|
biblicus query --corpus corpora/example --query "note"
|
|
211
144
|
```
|
|
212
145
|
|
|
146
|
+
If you want to turn a website section into corpus items, crawl a root web address while restricting the crawl to an allowed prefix:
|
|
147
|
+
|
|
148
|
+
```
|
|
149
|
+
biblicus crawl --corpus corpora/example \\
|
|
150
|
+
--root-url https://example.com/docs/index.html \\
|
|
151
|
+
--allowed-prefix https://example.com/docs/ \\
|
|
152
|
+
--max-items 50 \\
|
|
153
|
+
--tag crawled
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
## End-to-end example: evidence to assistant context
|
|
157
|
+
|
|
158
|
+
The command-line interface returns JavaScript Object Notation by default. This makes it easy to use Biblicus in scripts and to treat retrieval as a deterministic, testable step.
|
|
159
|
+
|
|
160
|
+
Start with a few short “memories” from a chat system. Each memory is stored as a normal item in the corpus.
|
|
161
|
+
|
|
162
|
+
```python
|
|
163
|
+
from biblicus.backends import get_backend
|
|
164
|
+
from biblicus.context import ContextPackPolicy, TokenBudget, build_context_pack, fit_context_pack_to_token_budget
|
|
165
|
+
from biblicus.corpus import Corpus
|
|
166
|
+
from biblicus.models import QueryBudget
|
|
167
|
+
|
|
168
|
+
|
|
169
|
+
corpus = Corpus.init("corpora/story")
|
|
170
|
+
|
|
171
|
+
notes = [
|
|
172
|
+
("User name", "The user's name is Tactus Maximus."),
|
|
173
|
+
("Button style preference", "Primary button style preference: the user's favorite color is magenta."),
|
|
174
|
+
("Style preference", "The user prefers concise answers."),
|
|
175
|
+
("Language preference", "The user dislikes idioms and abbreviations."),
|
|
176
|
+
("Engineering preference", "The user likes code that is over-documented and behavior-driven."),
|
|
177
|
+
]
|
|
178
|
+
for note_title, note_text in notes:
|
|
179
|
+
corpus.ingest_note(note_text, title=note_title, tags=["memory"])
|
|
180
|
+
|
|
181
|
+
backend = get_backend("scan")
|
|
182
|
+
run = backend.build_run(corpus, recipe_name="Story demo", config={})
|
|
183
|
+
budget = QueryBudget(max_total_items=5, max_total_characters=2000, max_items_per_source=None)
|
|
184
|
+
result = backend.query(
|
|
185
|
+
corpus,
|
|
186
|
+
run=run,
|
|
187
|
+
query_text="Primary button style preference",
|
|
188
|
+
budget=budget,
|
|
189
|
+
)
|
|
190
|
+
|
|
191
|
+
policy = ContextPackPolicy(join_with="\n\n")
|
|
192
|
+
context_pack = build_context_pack(result, policy=policy)
|
|
193
|
+
context_pack = fit_context_pack_to_token_budget(
|
|
194
|
+
context_pack,
|
|
195
|
+
policy=policy,
|
|
196
|
+
token_budget=TokenBudget(max_tokens=60),
|
|
197
|
+
)
|
|
198
|
+
print(context_pack.text)
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
If you want a runnable version of this story, use the script at `scripts/readme_end_to_end_demo.py`.
|
|
202
|
+
|
|
203
|
+
If you prefer the command-line interface, here is the same flow in compressed form:
|
|
204
|
+
|
|
205
|
+
```
|
|
206
|
+
biblicus init corpora/story
|
|
207
|
+
biblicus ingest --corpus corpora/story --stdin --title "User name" --tag memory <<< "The user's name is Tactus Maximus."
|
|
208
|
+
biblicus ingest --corpus corpora/story --stdin --title "Button style preference" --tag memory <<< "Primary button style preference: the user's favorite color is magenta."
|
|
209
|
+
biblicus ingest --corpus corpora/story --stdin --title "Style preference" --tag memory <<< "The user prefers concise answers."
|
|
210
|
+
biblicus ingest --corpus corpora/story --stdin --title "Language preference" --tag memory <<< "The user dislikes idioms and abbreviations."
|
|
211
|
+
biblicus ingest --corpus corpora/story --stdin --title "Engineering preference" --tag memory <<< "The user likes code that is over-documented and behavior-driven."
|
|
212
|
+
biblicus build --corpus corpora/story --backend scan
|
|
213
|
+
biblicus query --corpus corpora/story --query "Primary button style preference"
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
Example output:
|
|
217
|
+
|
|
218
|
+
```json
|
|
219
|
+
{
|
|
220
|
+
"query_text": "Primary button style preference",
|
|
221
|
+
"budget": {
|
|
222
|
+
"max_total_items": 5,
|
|
223
|
+
"max_total_characters": 2000,
|
|
224
|
+
"max_items_per_source": null
|
|
225
|
+
},
|
|
226
|
+
"run_id": "RUN_ID",
|
|
227
|
+
"recipe_id": "RECIPE_ID",
|
|
228
|
+
"backend_id": "scan",
|
|
229
|
+
"generated_at": "2026-01-29T00:00:00.000000Z",
|
|
230
|
+
"evidence": [
|
|
231
|
+
{
|
|
232
|
+
"item_id": "ITEM_ID",
|
|
233
|
+
"source_uri": "text",
|
|
234
|
+
"media_type": "text/markdown",
|
|
235
|
+
"score": 1.0,
|
|
236
|
+
"rank": 1,
|
|
237
|
+
"text": "Primary button style preference: the user's favorite color is magenta.",
|
|
238
|
+
"content_ref": null,
|
|
239
|
+
"span_start": null,
|
|
240
|
+
"span_end": null,
|
|
241
|
+
"stage": "scan",
|
|
242
|
+
"recipe_id": "RECIPE_ID",
|
|
243
|
+
"run_id": "RUN_ID",
|
|
244
|
+
"hash": null
|
|
245
|
+
}
|
|
246
|
+
],
|
|
247
|
+
"stats": {}
|
|
248
|
+
}
|
|
249
|
+
```
|
|
250
|
+
|
|
251
|
+
Evidence is the output contract. Your code decides how to convert evidence into assistant context.
|
|
252
|
+
|
|
253
|
+
### Turn evidence into a context pack
|
|
254
|
+
|
|
255
|
+
A context pack is a readable text block you send to a model. There is no single correct format. Treat it as a policy surface you can iterate on.
|
|
256
|
+
|
|
257
|
+
Here is a minimal example that builds a context pack from evidence:
|
|
258
|
+
|
|
259
|
+
```python
|
|
260
|
+
from biblicus.context import ContextPackPolicy, build_context_pack
|
|
261
|
+
|
|
262
|
+
|
|
263
|
+
policy = ContextPackPolicy(
|
|
264
|
+
join_with="\n\n",
|
|
265
|
+
)
|
|
266
|
+
context_pack = build_context_pack(result, policy=policy)
|
|
267
|
+
print(context_pack.text)
|
|
268
|
+
```
|
|
269
|
+
|
|
270
|
+
Example context pack output:
|
|
271
|
+
|
|
272
|
+
```text
|
|
273
|
+
Primary button style preference: the user's favorite color is magenta.
|
|
274
|
+
```
|
|
275
|
+
|
|
276
|
+
You can also build a context pack from the command-line interface by piping the retrieval result:
|
|
277
|
+
|
|
278
|
+
```
|
|
279
|
+
biblicus query --corpus corpora/story --query "Primary button style preference" \\
|
|
280
|
+
| biblicus context-pack build
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
Most production systems also apply a budget when building context. If you want a precise token budget, the budgeting logic needs a specific tokenizer and should be treated as its own stage.
|
|
284
|
+
|
|
285
|
+
## Pipeline diagram
|
|
286
|
+
|
|
287
|
+
This diagram shows how a corpus becomes evidence for your assistant. Your code decides how to turn evidence into context and how to call a model.
|
|
288
|
+
|
|
289
|
+
```mermaid
|
|
290
|
+
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff"}, "flowchart": {"useMaxWidth": true, "nodeSpacing": 18, "rankSpacing": 22}}}%%
|
|
291
|
+
flowchart TB
|
|
292
|
+
subgraph Legend[Legend]
|
|
293
|
+
direction LR
|
|
294
|
+
LegendArtifact[Stored artifact or evidence]
|
|
295
|
+
LegendStep[Step]
|
|
296
|
+
LegendArtifact --- LegendStep
|
|
297
|
+
end
|
|
298
|
+
|
|
299
|
+
subgraph Main[" "]
|
|
300
|
+
direction TB
|
|
301
|
+
|
|
302
|
+
subgraph Pipeline[" "]
|
|
303
|
+
direction TB
|
|
304
|
+
|
|
305
|
+
subgraph RowStable[Stable core]
|
|
306
|
+
direction TB
|
|
307
|
+
Source[Source items] --> Ingest[Ingest] --> Raw[Raw item files] --> Catalog[Catalog file]
|
|
308
|
+
end
|
|
309
|
+
|
|
310
|
+
subgraph RowExtraction[Pluggable: extraction pipeline]
|
|
311
|
+
direction TB
|
|
312
|
+
Catalog --> Extract[Extract pipeline] --> ExtractedText[Extracted text artifacts] --> ExtractionRun[Extraction run manifest]
|
|
313
|
+
end
|
|
314
|
+
|
|
315
|
+
subgraph RowRetrieval[Pluggable: retrieval backend]
|
|
316
|
+
direction TB
|
|
317
|
+
ExtractionRun --> Build[Build run] --> BackendIndex[Backend index] --> Run[Run manifest] --> Retrieve[Retrieve] --> Rerank[Rerank optional] --> Filter[Filter optional] --> Evidence[Evidence]
|
|
318
|
+
end
|
|
319
|
+
|
|
320
|
+
subgraph RowContext[Context]
|
|
321
|
+
direction TB
|
|
322
|
+
Evidence --> ContextPack[Context pack] --> FitTokens[Fit tokens optional] --> Context[Assistant context]
|
|
323
|
+
end
|
|
324
|
+
|
|
325
|
+
subgraph RowYourCode[Your code]
|
|
326
|
+
direction TB
|
|
327
|
+
Context --> Model[Large language model call] --> Answer[Answer]
|
|
328
|
+
end
|
|
329
|
+
end
|
|
330
|
+
|
|
331
|
+
style RowStable fill:#ffffff,stroke:#8e24aa,stroke-width:2px,color:#111111
|
|
332
|
+
style RowExtraction fill:#ffffff,stroke:#5e35b1,stroke-dasharray:6 3,stroke-width:2px,color:#111111
|
|
333
|
+
style RowRetrieval fill:#ffffff,stroke:#1e88e5,stroke-dasharray:6 3,stroke-width:2px,color:#111111
|
|
334
|
+
style RowContext fill:#ffffff,stroke:#7b1fa2,stroke-width:2px,color:#111111
|
|
335
|
+
style RowYourCode fill:#ffffff,stroke:#d81b60,stroke-width:2px,color:#111111
|
|
336
|
+
|
|
337
|
+
style Raw fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
338
|
+
style Catalog fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
339
|
+
style ExtractedText fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
340
|
+
style ExtractionRun fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
341
|
+
style BackendIndex fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
342
|
+
style Run fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
343
|
+
style Evidence fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
344
|
+
style ContextPack fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
345
|
+
style Context fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
346
|
+
style Answer fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
347
|
+
style Source fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
348
|
+
|
|
349
|
+
style Ingest fill:#eceff1,stroke:#90a4ae,color:#111111
|
|
350
|
+
style Extract fill:#eceff1,stroke:#90a4ae,color:#111111
|
|
351
|
+
style Build fill:#eceff1,stroke:#90a4ae,color:#111111
|
|
352
|
+
style Retrieve fill:#eceff1,stroke:#90a4ae,color:#111111
|
|
353
|
+
style Rerank fill:#eceff1,stroke:#90a4ae,color:#111111
|
|
354
|
+
style Filter fill:#eceff1,stroke:#90a4ae,color:#111111
|
|
355
|
+
style FitTokens fill:#eceff1,stroke:#90a4ae,color:#111111
|
|
356
|
+
style Model fill:#eceff1,stroke:#90a4ae,color:#111111
|
|
357
|
+
end
|
|
358
|
+
|
|
359
|
+
style Legend fill:#ffffff,stroke:#ffffff,color:#111111
|
|
360
|
+
style Main fill:#ffffff,stroke:#ffffff,color:#111111
|
|
361
|
+
style Pipeline fill:#ffffff,stroke:#ffffff,color:#111111
|
|
362
|
+
style LegendArtifact fill:#f3e5f5,stroke:#8e24aa,color:#111111
|
|
363
|
+
style LegendStep fill:#eceff1,stroke:#90a4ae,color:#111111
|
|
364
|
+
```
|
|
365
|
+
|
|
213
366
|
## Python usage
|
|
214
367
|
|
|
215
368
|
From Python, the same flow is available through the Corpus class and backend interfaces. The public surface area is small on purpose.
|
|
@@ -222,30 +375,28 @@ From Python, the same flow is available through the Corpus class and backend int
|
|
|
222
375
|
- Query a run with `backend.query`.
|
|
223
376
|
- Evaluate with `evaluate_run`.
|
|
224
377
|
|
|
225
|
-
## How it fits into an assistant
|
|
226
|
-
|
|
227
|
-
In an assistant system, retrieval usually produces context for a model call. This library treats evidence as the primary output so you can decide how to use it.
|
|
228
|
-
|
|
229
|
-
- Use a corpus as the source of truth for raw items.
|
|
230
|
-
- Use a backend run to build any derived artifacts needed for retrieval.
|
|
231
|
-
- Use queries to obtain evidence objects.
|
|
232
|
-
- Convert evidence into the format your framework expects, such as message content, tool output, or citations.
|
|
233
|
-
|
|
234
378
|
## Learn more
|
|
235
379
|
|
|
236
|
-
Full documentation is
|
|
380
|
+
Full documentation is published on GitHub Pages: https://anthusai.github.io/Biblicus/
|
|
237
381
|
|
|
238
|
-
The documents below
|
|
382
|
+
The documents below follow the pipeline from raw items to model context:
|
|
239
383
|
|
|
240
|
-
- [Architecture][architecture]
|
|
241
|
-
- [Roadmap][roadmap]
|
|
242
|
-
- [Feature index][feature-index]
|
|
243
384
|
- [Corpus][corpus]
|
|
244
385
|
- [Text extraction][text-extraction]
|
|
245
|
-
- [User configuration][user-configuration]
|
|
246
386
|
- [Backends][backends]
|
|
387
|
+
- [Context packs][context-packs]
|
|
388
|
+
- [Testing and evaluation][testing]
|
|
389
|
+
|
|
390
|
+
Reference:
|
|
391
|
+
|
|
247
392
|
- [Demos][demos]
|
|
248
|
-
- [
|
|
393
|
+
- [User configuration][user-configuration]
|
|
394
|
+
|
|
395
|
+
Design and implementation map:
|
|
396
|
+
|
|
397
|
+
- [Feature index][feature-index]
|
|
398
|
+
- [Roadmap][roadmap]
|
|
399
|
+
- [Architecture][architecture]
|
|
249
400
|
|
|
250
401
|
## Metadata and catalog
|
|
251
402
|
|
|
@@ -262,7 +413,16 @@ corpus/
|
|
|
262
413
|
config.json
|
|
263
414
|
catalog.json
|
|
264
415
|
runs/
|
|
265
|
-
|
|
416
|
+
extraction/
|
|
417
|
+
pipeline/
|
|
418
|
+
<run id>/
|
|
419
|
+
manifest.json
|
|
420
|
+
text/
|
|
421
|
+
<item id>.txt
|
|
422
|
+
retrieval/
|
|
423
|
+
<backend id>/
|
|
424
|
+
<run id>/
|
|
425
|
+
manifest.json
|
|
266
426
|
```
|
|
267
427
|
|
|
268
428
|
## Retrieval backends
|
|
@@ -313,7 +473,7 @@ python3 -m pip install -e ".[dev]"
|
|
|
313
473
|
Build the documentation:
|
|
314
474
|
|
|
315
475
|
```
|
|
316
|
-
python3 -m sphinx -b html docs docs/_build
|
|
476
|
+
python3 -m sphinx -b html docs docs/_build/html
|
|
317
477
|
```
|
|
318
478
|
|
|
319
479
|
## License
|
|
@@ -328,9 +488,10 @@ License terms are in `LICENSE`.
|
|
|
328
488
|
[text-extraction]: docs/EXTRACTION.md
|
|
329
489
|
[user-configuration]: docs/USER_CONFIGURATION.md
|
|
330
490
|
[backends]: docs/BACKENDS.md
|
|
491
|
+
[context-packs]: docs/CONTEXT_PACK.md
|
|
331
492
|
[demos]: docs/DEMOS.md
|
|
332
493
|
[testing]: docs/TESTING.md
|
|
333
494
|
|
|
334
495
|
[continuous-integration-badge]: https://github.com/AnthusAI/Biblicus/actions/workflows/ci.yml/badge.svg?branch=main
|
|
335
496
|
[coverage-badge]: https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/AnthusAI/Biblicus/main/coverage_badge.json
|
|
336
|
-
[documentation-badge]: https://
|
|
497
|
+
[documentation-badge]: https://img.shields.io/badge/docs-GitHub%20Pages-blue
|