biblicus 0.6.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (48) hide show
  1. biblicus/__init__.py +30 -0
  2. biblicus/__main__.py +8 -0
  3. biblicus/_vendor/dotyaml/__init__.py +14 -0
  4. biblicus/_vendor/dotyaml/interpolation.py +63 -0
  5. biblicus/_vendor/dotyaml/loader.py +181 -0
  6. biblicus/_vendor/dotyaml/transformer.py +135 -0
  7. biblicus/backends/__init__.py +42 -0
  8. biblicus/backends/base.py +65 -0
  9. biblicus/backends/scan.py +375 -0
  10. biblicus/backends/sqlite_full_text_search.py +487 -0
  11. biblicus/cli.py +804 -0
  12. biblicus/constants.py +12 -0
  13. biblicus/context.py +183 -0
  14. biblicus/corpus.py +1531 -0
  15. biblicus/crawl.py +186 -0
  16. biblicus/errors.py +15 -0
  17. biblicus/evaluation.py +257 -0
  18. biblicus/evidence_processing.py +201 -0
  19. biblicus/extraction.py +531 -0
  20. biblicus/extractors/__init__.py +44 -0
  21. biblicus/extractors/base.py +68 -0
  22. biblicus/extractors/metadata_text.py +106 -0
  23. biblicus/extractors/openai_stt.py +180 -0
  24. biblicus/extractors/pass_through_text.py +84 -0
  25. biblicus/extractors/pdf_text.py +100 -0
  26. biblicus/extractors/pipeline.py +105 -0
  27. biblicus/extractors/rapidocr_text.py +129 -0
  28. biblicus/extractors/select_longest_text.py +105 -0
  29. biblicus/extractors/select_text.py +100 -0
  30. biblicus/extractors/unstructured_text.py +100 -0
  31. biblicus/frontmatter.py +89 -0
  32. biblicus/hook_logging.py +180 -0
  33. biblicus/hook_manager.py +203 -0
  34. biblicus/hooks.py +261 -0
  35. biblicus/ignore.py +64 -0
  36. biblicus/knowledge_base.py +191 -0
  37. biblicus/models.py +445 -0
  38. biblicus/retrieval.py +133 -0
  39. biblicus/sources.py +212 -0
  40. biblicus/time.py +17 -0
  41. biblicus/uris.py +63 -0
  42. biblicus/user_config.py +138 -0
  43. biblicus-0.6.0.dist-info/METADATA +533 -0
  44. biblicus-0.6.0.dist-info/RECORD +48 -0
  45. biblicus-0.6.0.dist-info/WHEEL +5 -0
  46. biblicus-0.6.0.dist-info/entry_points.txt +2 -0
  47. biblicus-0.6.0.dist-info/licenses/LICENSE +21 -0
  48. biblicus-0.6.0.dist-info/top_level.txt +1 -0
@@ -0,0 +1,533 @@
1
+ Metadata-Version: 2.4
2
+ Name: biblicus
3
+ Version: 0.6.0
4
+ Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
5
+ License: MIT
6
+ Requires-Python: >=3.9
7
+ Description-Content-Type: text/markdown
8
+ License-File: LICENSE
9
+ Requires-Dist: pydantic>=2.0
10
+ Requires-Dist: PyYAML>=6.0
11
+ Requires-Dist: pypdf>=4.0
12
+ Provides-Extra: dev
13
+ Requires-Dist: behave>=1.2.6; extra == "dev"
14
+ Requires-Dist: coverage[toml]>=7.0; extra == "dev"
15
+ Requires-Dist: sphinx>=7.0; extra == "dev"
16
+ Requires-Dist: myst-parser>=2.0; extra == "dev"
17
+ Requires-Dist: sphinx_rtd_theme>=2.0; extra == "dev"
18
+ Requires-Dist: ruff>=0.4.0; extra == "dev"
19
+ Requires-Dist: black>=24.0; extra == "dev"
20
+ Requires-Dist: python-semantic-release>=9.0.0; extra == "dev"
21
+ Provides-Extra: openai
22
+ Requires-Dist: openai>=1.0; extra == "openai"
23
+ Provides-Extra: unstructured
24
+ Requires-Dist: unstructured>=0.12.0; extra == "unstructured"
25
+ Requires-Dist: python-docx>=1.1.0; extra == "unstructured"
26
+ Provides-Extra: ocr
27
+ Requires-Dist: rapidocr-onnxruntime>=1.3.0; extra == "ocr"
28
+ Dynamic: license-file
29
+
30
+ # Biblicus
31
+
32
+ ![Continuous integration][continuous-integration-badge]
33
+ ![Coverage][coverage-badge]
34
+ ![Documentation][documentation-badge]
35
+
36
+ Make your documents usable by your assistant, then decide later how you will search and retrieve them.
37
+
38
+ If you are building an assistant in Python, you probably have material you want it to use: notes, documents, web pages, and reference files. A common approach is retrieval augmented generation, where a system retrieves relevant material and uses it as evidence when generating a response.
39
+
40
+ The first practical problem is not retrieval. It is collection and care. You need a stable place to put raw items, you need a small amount of metadata so you can find them again, and you need a way to evolve your retrieval approach over time without rewriting ingestion.
41
+
42
+ This library gives you a corpus, which is a normal folder on disk. It stores each ingested item as a file, with optional metadata stored next to it. You can open and inspect the raw files directly. Any derived catalog or index can be rebuilt from the raw corpus.
43
+
44
+ It can be used alongside LangGraph, Tactus, Pydantic AI, any agent framework, or your own setup. Use it from Python or from the command line interface.
45
+
46
+ See [retrieval augmented generation overview] for a short introduction to the idea.
47
+
48
+ ## Start with a knowledge base
49
+
50
+ If you just want to hand a folder to your assistant and move on, use the high-level knowledge base interface. The folder can be nothing more than a handful of plain text files. You are not choosing a retrieval strategy yet. You are just collecting.
51
+
52
+ This example assumes a folder called `notes/` with a few `.txt` files. The knowledge base handles sensible defaults and still gives you a clear context pack for your model call.
53
+
54
+ ```python
55
+ from biblicus.knowledge_base import KnowledgeBase
56
+
57
+
58
+ kb = KnowledgeBase.from_folder("notes")
59
+ result = kb.query("Primary button style preference")
60
+ context_pack = kb.context_pack(result, max_tokens=800)
61
+
62
+ print(context_pack.text)
63
+ ```
64
+
65
+ If you want to run a real, executable version of this story, use `scripts/readme_end_to_end_demo.py` from a fresh clone.
66
+
67
+ This simplified sequence diagram shows the same idea at a high level.
68
+
69
+ ```mermaid
70
+ %%{init: {"theme": "base", "themeVariables": {"primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
71
+ sequenceDiagram
72
+ participant App as Your assistant code
73
+ participant KB as Knowledge base
74
+ participant LLM as Large language model
75
+
76
+ App->>KB: query
77
+ KB-->>App: evidence and context
78
+ App->>LLM: context plus prompt
79
+ LLM-->>App: response draft
80
+ ```
81
+
82
+ ## A simple mental model
83
+
84
+ Think in three stages.
85
+
86
+ - Ingest puts raw items into a corpus. This is file first and human inspectable.
87
+ - Extract turns items into usable text. This is where you would do text extraction from Portable Document Format files, optical character recognition for images, or speech to text for audio. If an item is already text, extraction can simply read it. Extraction outputs are derived artifacts, not edits to the raw files.
88
+ - Retrieve searches extracted text and returns evidence. Evidence is structured so you can turn it into context for your model call in whatever way your project prefers.
89
+
90
+ If you learn a few project words, the rest of the system becomes predictable.
91
+
92
+ - Corpus is the folder that holds raw items and their metadata.
93
+ - Item is the raw bytes plus optional metadata and source information.
94
+ - Catalog is the rebuildable index of the corpus.
95
+ - Extraction run is a recorded extraction build that produces text artifacts.
96
+ - Backend is a pluggable retrieval implementation.
97
+ - Run is a recorded retrieval build for a corpus.
98
+ - Evidence is what retrieval returns, with identifiers and source information.
99
+
100
+ ## Where it fits in an assistant
101
+
102
+ Biblicus does not answer user questions. It is not a language model. It helps your assistant answer them by retrieving relevant material and returning it as structured evidence. Your code decides how to turn evidence into a context pack for the model call, which is then passed to a model you choose.
103
+
104
+ In a coding assistant, retrieval is often triggered by what the user is doing right now. For example: you are about to propose a user interface change, so you retrieve the user's stated preferences, then you include that as context for the model call.
105
+
106
+ This diagram shows two sequential Biblicus calls. They are shown separately to make the boundaries explicit: retrieval returns evidence, and context pack building consumes evidence.
107
+
108
+ ```mermaid
109
+ %%{init: {"theme": "base", "themeVariables": {"primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff", "noteBkgColor": "#ffffff", "noteTextColor": "#111111", "actorBkg": "#f3e5f5", "actorBorder": "#8e24aa", "actorTextColor": "#111111"}}}%%
110
+ sequenceDiagram
111
+ participant User
112
+ participant App as Your assistant code
113
+ participant Bib as Biblicus
114
+ participant LLM as Large language model
115
+
116
+ User->>App: request
117
+ App->>Bib: query retrieval
118
+ Bib-->>App: retrieval result evidence JSON
119
+ App->>Bib: build context pack from evidence
120
+ Bib-->>App: context pack text
121
+ App->>LLM: context pack plus prompt
122
+ LLM-->>App: response draft
123
+ App-->>User: response
124
+ ```
125
+
126
+ ## Practical value
127
+
128
+ - You can ingest raw material once, then try many retrieval approaches over time.
129
+ - You can keep raw files readable and portable, without locking your data inside a database.
130
+ - You can evaluate retrieval runs against shared datasets and compare backends using the same corpus.
131
+
132
+ ## Typical flow
133
+
134
+ - Initialize a corpus folder.
135
+ - Ingest items from file paths, web addresses, or text input.
136
+ - Crawl a website section into corpus items when you want a repeatable “import from the web” workflow.
137
+ - Run extraction when you want derived text artifacts from non-text sources.
138
+ - Reindex to refresh the catalog after edits.
139
+ - Build a retrieval run with a backend.
140
+ - Query the run to collect evidence and evaluate it with datasets.
141
+
142
+ ## Install
143
+
144
+ This repository is a working Python package. Install it into a virtual environment from the repository root.
145
+
146
+ ```
147
+ python3 -m pip install -e .
148
+ ```
149
+
150
+ After the first release, you can install it from Python Package Index.
151
+
152
+ ```
153
+ python3 -m pip install biblicus
154
+ ```
155
+
156
+ ### Optional extras
157
+
158
+ Some extractors are optional so the base install stays small.
159
+
160
+ - Optical character recognition for images: `python3 -m pip install "biblicus[ocr]"`
161
+ - Speech to text transcription: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
162
+ - Broad document parsing fallback: `python3 -m pip install "biblicus[unstructured]"`
163
+
164
+ ## Quick start
165
+
166
+ ```
167
+ mkdir -p notes
168
+ echo "A small file note" > notes/example.txt
169
+
170
+ biblicus init corpora/example
171
+ biblicus ingest --corpus corpora/example notes/example.txt
172
+ echo "A short note" | biblicus ingest --corpus corpora/example --stdin --title "First note"
173
+ biblicus list --corpus corpora/example
174
+ biblicus extract build --corpus corpora/example --step pass-through-text --step metadata-text
175
+ biblicus extract list --corpus corpora/example
176
+ biblicus build --corpus corpora/example --backend scan
177
+ biblicus query --corpus corpora/example --query "note"
178
+ ```
179
+
180
+ If you want to turn a website section into corpus items, crawl a root web address while restricting the crawl to an allowed prefix:
181
+
182
+ ```
183
+ biblicus crawl --corpus corpora/example \\
184
+ --root-url https://example.com/docs/index.html \\
185
+ --allowed-prefix https://example.com/docs/ \\
186
+ --max-items 50 \\
187
+ --tag crawled
188
+ ```
189
+
190
+ ## End-to-end example: lower-level control
191
+
192
+ The command-line interface returns JavaScript Object Notation by default. This makes it easy to use Biblicus in scripts and to treat retrieval as a deterministic, testable step.
193
+
194
+ This version shows the lower-level pieces explicitly. You are building the corpus, controlling each memory string, choosing the backend, and shaping the context pack yourself.
195
+
196
+ ```python
197
+ from biblicus.backends import get_backend
198
+ from biblicus.context import ContextPackPolicy, TokenBudget, build_context_pack, fit_context_pack_to_token_budget
199
+ from biblicus.corpus import Corpus
200
+ from biblicus.models import QueryBudget
201
+
202
+
203
+ corpus = Corpus.init("corpora/story")
204
+
205
+ notes = [
206
+ ("User name", "The user's name is Tactus Maximus."),
207
+ ("Button style preference", "Primary button style preference: the user's favorite color is magenta."),
208
+ ("Style preference", "The user prefers concise answers."),
209
+ ("Language preference", "The user dislikes idioms and abbreviations."),
210
+ ("Engineering preference", "The user likes code that is over-documented and behavior-driven."),
211
+ ]
212
+ for note_title, note_text in notes:
213
+ corpus.ingest_note(note_text, title=note_title, tags=["memory"])
214
+
215
+ backend = get_backend("scan")
216
+ run = backend.build_run(corpus, recipe_name="Story demo", config={})
217
+ budget = QueryBudget(max_total_items=5, max_total_characters=2000, max_items_per_source=None)
218
+ result = backend.query(
219
+ corpus,
220
+ run=run,
221
+ query_text="Primary button style preference",
222
+ budget=budget,
223
+ )
224
+
225
+ policy = ContextPackPolicy(join_with="\n\n")
226
+ context_pack = build_context_pack(result, policy=policy)
227
+ context_pack = fit_context_pack_to_token_budget(
228
+ context_pack,
229
+ policy=policy,
230
+ token_budget=TokenBudget(max_tokens=60),
231
+ )
232
+ print(context_pack.text)
233
+ ```
234
+
235
+ If you want a runnable version of this story, use the script at `scripts/readme_end_to_end_demo.py`.
236
+
237
+ If you prefer the command-line interface, here is the same flow in compressed form:
238
+
239
+ ```
240
+ biblicus init corpora/story
241
+ biblicus ingest --corpus corpora/story --stdin --title "User name" --tag memory <<< "The user's name is Tactus Maximus."
242
+ biblicus ingest --corpus corpora/story --stdin --title "Button style preference" --tag memory <<< "Primary button style preference: the user's favorite color is magenta."
243
+ biblicus ingest --corpus corpora/story --stdin --title "Style preference" --tag memory <<< "The user prefers concise answers."
244
+ biblicus ingest --corpus corpora/story --stdin --title "Language preference" --tag memory <<< "The user dislikes idioms and abbreviations."
245
+ biblicus ingest --corpus corpora/story --stdin --title "Engineering preference" --tag memory <<< "The user likes code that is over-documented and behavior-driven."
246
+ biblicus build --corpus corpora/story --backend scan
247
+ biblicus query --corpus corpora/story --query "Primary button style preference"
248
+ ```
249
+
250
+ Example output:
251
+
252
+ ```json
253
+ {
254
+ "query_text": "Primary button style preference",
255
+ "budget": {
256
+ "max_total_items": 5,
257
+ "max_total_characters": 2000,
258
+ "max_items_per_source": null
259
+ },
260
+ "run_id": "RUN_ID",
261
+ "recipe_id": "RECIPE_ID",
262
+ "backend_id": "scan",
263
+ "generated_at": "2026-01-29T00:00:00.000000Z",
264
+ "evidence": [
265
+ {
266
+ "item_id": "ITEM_ID",
267
+ "source_uri": "text",
268
+ "media_type": "text/markdown",
269
+ "score": 1.0,
270
+ "rank": 1,
271
+ "text": "Primary button style preference: the user's favorite color is magenta.",
272
+ "content_ref": null,
273
+ "span_start": null,
274
+ "span_end": null,
275
+ "stage": "scan",
276
+ "recipe_id": "RECIPE_ID",
277
+ "run_id": "RUN_ID",
278
+ "hash": null
279
+ }
280
+ ],
281
+ "stats": {}
282
+ }
283
+ ```
284
+
285
+ Evidence is the output contract. Your code decides how to convert evidence into assistant context.
286
+
287
+ ### Turn evidence into a context pack
288
+
289
+ A context pack is a readable text block you send to a model. There is no single correct format. Treat it as a policy surface you can iterate on.
290
+
291
+ Here is a minimal example that builds a context pack from evidence:
292
+
293
+ ```python
294
+ from biblicus.context import ContextPackPolicy, build_context_pack
295
+
296
+
297
+ policy = ContextPackPolicy(
298
+ join_with="\n\n",
299
+ )
300
+ context_pack = build_context_pack(result, policy=policy)
301
+ print(context_pack.text)
302
+ ```
303
+
304
+ Example context pack output:
305
+
306
+ ```text
307
+ Primary button style preference: the user's favorite color is magenta.
308
+ ```
309
+
310
+ You can also build a context pack from the command-line interface by piping the retrieval result:
311
+
312
+ ```
313
+ biblicus query --corpus corpora/story --query "Primary button style preference" \\
314
+ | biblicus context-pack build
315
+ ```
316
+
317
+ Most production systems also apply a budget when building context. If you want a precise token budget, the budgeting logic needs a specific tokenizer and should be treated as its own stage.
318
+
319
+ ## Pipeline diagram
320
+
321
+ This diagram shows how a corpus becomes evidence for your assistant. Your code decides how to turn evidence into context and how to call a model.
322
+
323
+ ```mermaid
324
+ %%{init: {"theme": "base", "themeVariables": {"primaryColor": "#f3e5f5", "primaryTextColor": "#111111", "primaryBorderColor": "#8e24aa", "lineColor": "#90a4ae", "secondaryColor": "#eceff1", "tertiaryColor": "#ffffff"}, "flowchart": {"useMaxWidth": true, "nodeSpacing": 18, "rankSpacing": 22}}}%%
325
+ flowchart TB
326
+ subgraph Legend[Legend]
327
+ direction LR
328
+ LegendArtifact[Stored artifact or evidence]
329
+ LegendStep[Step]
330
+ LegendArtifact --- LegendStep
331
+ end
332
+
333
+ subgraph Main[" "]
334
+ direction TB
335
+
336
+ subgraph Pipeline[" "]
337
+ direction TB
338
+
339
+ subgraph RowStable[Stable core]
340
+ direction TB
341
+ Source[Source items] --> Ingest[Ingest] --> Raw[Raw item files] --> Catalog[Catalog file]
342
+ end
343
+
344
+ subgraph RowExtraction[Pluggable: extraction pipeline]
345
+ direction TB
346
+ Catalog --> Extract[Extract pipeline] --> ExtractedText[Extracted text artifacts] --> ExtractionRun[Extraction run manifest]
347
+ end
348
+
349
+ subgraph RowRetrieval[Pluggable: retrieval backend]
350
+ direction TB
351
+ ExtractionRun --> Build[Build run] --> BackendIndex[Backend index] --> Run[Run manifest] --> Retrieve[Retrieve] --> Rerank[Rerank optional] --> Filter[Filter optional] --> Evidence[Evidence]
352
+ end
353
+
354
+ subgraph RowContext[Context]
355
+ direction TB
356
+ Evidence --> ContextPack[Context pack] --> FitTokens[Fit tokens optional] --> Context[Assistant context]
357
+ end
358
+
359
+ subgraph RowYourCode[Your code]
360
+ direction TB
361
+ Context --> Model[Large language model call] --> Answer[Answer]
362
+ end
363
+ end
364
+
365
+ style RowStable fill:#ffffff,stroke:#8e24aa,stroke-width:2px,color:#111111
366
+ style RowExtraction fill:#ffffff,stroke:#5e35b1,stroke-dasharray:6 3,stroke-width:2px,color:#111111
367
+ style RowRetrieval fill:#ffffff,stroke:#1e88e5,stroke-dasharray:6 3,stroke-width:2px,color:#111111
368
+ style RowContext fill:#ffffff,stroke:#7b1fa2,stroke-width:2px,color:#111111
369
+ style RowYourCode fill:#ffffff,stroke:#d81b60,stroke-width:2px,color:#111111
370
+
371
+ style Raw fill:#f3e5f5,stroke:#8e24aa,color:#111111
372
+ style Catalog fill:#f3e5f5,stroke:#8e24aa,color:#111111
373
+ style ExtractedText fill:#f3e5f5,stroke:#8e24aa,color:#111111
374
+ style ExtractionRun fill:#f3e5f5,stroke:#8e24aa,color:#111111
375
+ style BackendIndex fill:#f3e5f5,stroke:#8e24aa,color:#111111
376
+ style Run fill:#f3e5f5,stroke:#8e24aa,color:#111111
377
+ style Evidence fill:#f3e5f5,stroke:#8e24aa,color:#111111
378
+ style ContextPack fill:#f3e5f5,stroke:#8e24aa,color:#111111
379
+ style Context fill:#f3e5f5,stroke:#8e24aa,color:#111111
380
+ style Answer fill:#f3e5f5,stroke:#8e24aa,color:#111111
381
+ style Source fill:#f3e5f5,stroke:#8e24aa,color:#111111
382
+
383
+ style Ingest fill:#eceff1,stroke:#90a4ae,color:#111111
384
+ style Extract fill:#eceff1,stroke:#90a4ae,color:#111111
385
+ style Build fill:#eceff1,stroke:#90a4ae,color:#111111
386
+ style Retrieve fill:#eceff1,stroke:#90a4ae,color:#111111
387
+ style Rerank fill:#eceff1,stroke:#90a4ae,color:#111111
388
+ style Filter fill:#eceff1,stroke:#90a4ae,color:#111111
389
+ style FitTokens fill:#eceff1,stroke:#90a4ae,color:#111111
390
+ style Model fill:#eceff1,stroke:#90a4ae,color:#111111
391
+ end
392
+
393
+ style Legend fill:#ffffff,stroke:#ffffff,color:#111111
394
+ style Main fill:#ffffff,stroke:#ffffff,color:#111111
395
+ style Pipeline fill:#ffffff,stroke:#ffffff,color:#111111
396
+ style LegendArtifact fill:#f3e5f5,stroke:#8e24aa,color:#111111
397
+ style LegendStep fill:#eceff1,stroke:#90a4ae,color:#111111
398
+ ```
399
+
400
+ ## Python usage
401
+
402
+ From Python, the same flow is available through the Corpus class and backend interfaces. The public surface area is small on purpose.
403
+
404
+ - Create a corpus with `Corpus.init` or open one with `Corpus.open`.
405
+ - Ingest notes with `Corpus.ingest_note`.
406
+ - Ingest files or web addresses with `Corpus.ingest_source`.
407
+ - List items with `Corpus.list_items`.
408
+ - Build a retrieval run with `get_backend` and `backend.build_run`.
409
+ - Query a run with `backend.query`.
410
+ - Evaluate with `evaluate_run`.
411
+
412
+ ## Learn more
413
+
414
+ Full documentation is published on GitHub Pages: https://anthusai.github.io/Biblicus/
415
+
416
+ The documents below follow the pipeline from raw items to model context:
417
+
418
+ - [Corpus][corpus]
419
+ - [Text extraction][text-extraction]
420
+ - [Knowledge base][knowledge-base]
421
+ - [Backends][backends]
422
+ - [Context packs][context-packs]
423
+ - [Testing and evaluation][testing]
424
+
425
+ Reference:
426
+
427
+ - [Demos][demos]
428
+ - [User configuration][user-configuration]
429
+
430
+ Design and implementation map:
431
+
432
+ - [Feature index][feature-index]
433
+ - [Roadmap][roadmap]
434
+ - [Architecture][architecture]
435
+
436
+ ## Metadata and catalog
437
+
438
+ Raw items are stored as files in the corpus raw directory. Metadata can live in a Markdown front matter block or a sidecar file with the suffix `.biblicus.yml`. The catalog lives in `.biblicus/catalog.json` and can be rebuilt at any time with `biblicus reindex`.
439
+
440
+ ## Corpus layout
441
+
442
+ ```
443
+ corpus/
444
+ raw/
445
+ item.bin
446
+ item.bin.biblicus.yml
447
+ .biblicus/
448
+ config.json
449
+ catalog.json
450
+ runs/
451
+ extraction/
452
+ pipeline/
453
+ <run id>/
454
+ manifest.json
455
+ text/
456
+ <item id>.txt
457
+ retrieval/
458
+ <backend id>/
459
+ <run id>/
460
+ manifest.json
461
+ ```
462
+
463
+ ## Retrieval backends
464
+
465
+ Two backends are included.
466
+
467
+ - `scan` is a minimal baseline that scans raw items directly.
468
+ - `sqlite-full-text-search` is a practical baseline that builds a full text search index in Sqlite.
469
+
470
+ ## Integration corpus and evaluation dataset
471
+
472
+ Use `scripts/download_wikipedia.py` to download a small integration corpus from Wikipedia when running tests or demos. The repository does not include that content.
473
+
474
+ The dataset file `datasets/wikipedia_mini.json` provides a small evaluation set that matches the integration corpus.
475
+
476
+ Use `scripts/download_pdf_samples.py` to download a small Portable Document Format integration corpus when running tests or demos. The repository does not include that content.
477
+
478
+ ## Tests and coverage
479
+
480
+ ```
481
+ python3 scripts/test.py
482
+ ```
483
+
484
+ To include integration scenarios that download public test data at runtime, run this command.
485
+
486
+ ```
487
+ python3 scripts/test.py --integration
488
+ ```
489
+
490
+ ## Releases
491
+
492
+ Releases are automated from the main branch using semantic versioning and conventional commit messages.
493
+
494
+ The release pipeline publishes a GitHub release and uploads the package to Python Package Index when continuous integration succeeds.
495
+
496
+ Publishing uses a Python Package Index token stored in the GitHub secret named PYPI_TOKEN.
497
+
498
+ ## Documentation
499
+
500
+ Reference documentation is generated from Sphinx style docstrings.
501
+
502
+ Install development dependencies:
503
+
504
+ ```
505
+ python3 -m pip install -e ".[dev]"
506
+ ```
507
+
508
+ Build the documentation:
509
+
510
+ ```
511
+ python3 -m sphinx -b html docs docs/_build/html
512
+ ```
513
+
514
+ ## License
515
+
516
+ License terms are in `LICENSE`.
517
+
518
+ [retrieval augmented generation overview]: https://en.wikipedia.org/wiki/Retrieval-augmented_generation
519
+ [architecture]: docs/ARCHITECTURE.md
520
+ [roadmap]: docs/ROADMAP.md
521
+ [feature-index]: docs/FEATURE_INDEX.md
522
+ [corpus]: docs/CORPUS.md
523
+ [knowledge-base]: docs/KNOWLEDGE_BASE.md
524
+ [text-extraction]: docs/EXTRACTION.md
525
+ [user-configuration]: docs/USER_CONFIGURATION.md
526
+ [backends]: docs/BACKENDS.md
527
+ [context-packs]: docs/CONTEXT_PACK.md
528
+ [demos]: docs/DEMOS.md
529
+ [testing]: docs/TESTING.md
530
+
531
+ [continuous-integration-badge]: https://github.com/AnthusAI/Biblicus/actions/workflows/ci.yml/badge.svg?branch=main
532
+ [coverage-badge]: https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/AnthusAI/Biblicus/main/coverage_badge.json
533
+ [documentation-badge]: https://img.shields.io/badge/docs-GitHub%20Pages-blue
@@ -0,0 +1,48 @@
1
+ biblicus/__init__.py,sha256=jxBNIMVKudpRsbzdiE5CmU6nIjgnNhCRq0OZLSwt_kM,495
2
+ biblicus/__main__.py,sha256=ipfkUoTlocVnrQDM69C7TeBqQxmHVeiWMRaT3G9rtnk,117
3
+ biblicus/cli.py,sha256=hBau464XNdSGdWeOCE2Q7dm0P8I4sR0W-NgVT0wPmh4,27724
4
+ biblicus/constants.py,sha256=R6fZDoLVMCwgKvTaxEx7G0CstwHGaUTlW9MsmNLDZ44,269
5
+ biblicus/context.py,sha256=qnT9CH7_ldoPcg-rxnUOtRhheOmpDAbF8uqhf8OdjC4,5832
6
+ biblicus/corpus.py,sha256=gF1RNl6fdz7wplzpHEIkEBkhYxHgKTKguBR_kD9IgUw,54109
7
+ biblicus/crawl.py,sha256=n8rXBMnziBK9vtKQQCXYOpBzqsPCswj2PzVJUb370KY,6250
8
+ biblicus/errors.py,sha256=uMajd5DvgnJ_-jq5sbeom1GV8DPUc-kojBaECFi6CsY,467
9
+ biblicus/evaluation.py,sha256=5xWpb-8f49Osh9aHzo1ab3AXOmls3Imc5rdnEC0pN-8,8143
10
+ biblicus/evidence_processing.py,sha256=EMv1AkV_Eufk-poBz9nRR1dZgC-QewvI-NrULBUGVGA,6074
11
+ biblicus/extraction.py,sha256=VEjBjIpaBboftGgEcpDj7z7um41e5uDZpP_7acQg7fw,19448
12
+ biblicus/frontmatter.py,sha256=JOGjIDzbbOkebQw2RzA-3WDVMAMtJta2INjS4e7-LMg,2463
13
+ biblicus/hook_logging.py,sha256=IMvde-JhVWrx9tNz3eDJ1CY_rr5Sj7DZ2YNomYCZbz0,5366
14
+ biblicus/hook_manager.py,sha256=ZCAkE5wLvn4lnQz8jho_o0HGEC9KdQd9qitkAEUQRcw,6997
15
+ biblicus/hooks.py,sha256=OHQOmOi7rUcQqYWVeod4oPe8nVLepD7F_SlN7O_-BsE,7863
16
+ biblicus/ignore.py,sha256=fyjt34E6tWNNrm1FseOhgH2MgryyVBQVzxhKL5s4aio,1800
17
+ biblicus/knowledge_base.py,sha256=JmlJw8WD_fgstuq1PyWVzU9kzvVzyv7_xOvhS70xwUw,6654
18
+ biblicus/models.py,sha256=6SWQ2Czg9O3zjuam8a4m8V3LlEgcGLbEctYDB6F1rRs,15317
19
+ biblicus/retrieval.py,sha256=A1SI4WK5cX-WbtN6FJ0QQxqlEOtQhddLrL0LZIuoTC4,4180
20
+ biblicus/sources.py,sha256=EFy8-rQNLsyzz-98mH-z8gEHMYbqigcNFKLaR92KfDE,7241
21
+ biblicus/time.py,sha256=3BSKOSo7R10K-0Dzrbdtl3fh5_yShTYqfdlKvvdkx7M,485
22
+ biblicus/uris.py,sha256=xXD77lqsT9NxbyzI1spX9Y5a3-U6sLYMnpeSAV7g-nM,2013
23
+ biblicus/user_config.py,sha256=DqO08yLn82DhTiFpmIyyLj_J0nMbrtE8xieTj2Cgd6A,4287
24
+ biblicus/_vendor/dotyaml/__init__.py,sha256=e4zbejeJRwlD4I0q3YvotMypO19lXqmT8iyU1q6SvhY,376
25
+ biblicus/_vendor/dotyaml/interpolation.py,sha256=PfUAEEOTFobv7Ox0E6nAxht6BqhHIDe4hP32fZn5TOs,1992
26
+ biblicus/_vendor/dotyaml/loader.py,sha256=KePkjyhKZSvQZphmlmlzTYZJBQsqL5qhtGV1y7G6wzM,5624
27
+ biblicus/_vendor/dotyaml/transformer.py,sha256=2AKPS8DMOPuYtzmM-dlwIqVbARfbBH5jYV1m5qpR49E,3725
28
+ biblicus/backends/__init__.py,sha256=wLXIumV51l6ZIKzjoKKeU7AgIxGOryG7T7ls3a_Fv98,1212
29
+ biblicus/backends/base.py,sha256=Erfj9dXg0nkRKnEcNjHR9_0Ddb2B1NvbmRksVm_g1dU,1776
30
+ biblicus/backends/scan.py,sha256=hdNnQWqi5IH6j95w30BZHxLJ0W9PTaOkqfWJuxCCEMI,12478
31
+ biblicus/backends/sqlite_full_text_search.py,sha256=KgmwOiKvkA0pv7vD0V7bcOdDx_nZIOfuIN6Z4Ij7I68,16516
32
+ biblicus/extractors/__init__.py,sha256=X3pu18QL85IBpYf56l6_5PUxFPhEN5qLTlOrxYpfGck,1776
33
+ biblicus/extractors/base.py,sha256=ka-nz_1zHPr4TS9sU4JfOoY-PJh7lbHPBOEBrbQFGSc,2171
34
+ biblicus/extractors/metadata_text.py,sha256=7FbEPp0K1mXc7FH1_c0KhPhPexF9U6eLd3TVY1vTp1s,3537
35
+ biblicus/extractors/openai_stt.py,sha256=fggErIu6YN6tXbleNTuROhfYi7zDgMd2vD_ecXZ7eXs,7162
36
+ biblicus/extractors/pass_through_text.py,sha256=DNxkCwpH2bbXjPGPEQwsx8kfqXi6rIxXNY_n3TU2-WI,2777
37
+ biblicus/extractors/pdf_text.py,sha256=YtUphgLVxyWJXew6ZsJ8wBRh67Y5ri4ZTRlMmq3g1Bk,3255
38
+ biblicus/extractors/pipeline.py,sha256=LY6eM3ypw50MDB2cPEQqZrjxkhVvIc6sv4UEhHdNDrE,3208
39
+ biblicus/extractors/rapidocr_text.py,sha256=OMAuZealLSSTFVVmBalT-AFJy2pEpHyyvpuWxlnY-GU,4531
40
+ biblicus/extractors/select_longest_text.py,sha256=wRveXAfYLdj7CpGuo4RoD7zE6SIfylRCbv40z2azO0k,3702
41
+ biblicus/extractors/select_text.py,sha256=w0ATmDy3tWWbOObzW87jGZuHbgXllUhotX5XyySLs-o,3395
42
+ biblicus/extractors/unstructured_text.py,sha256=l2S_wD_htu7ZHoJQNQtP-kGlEgOeKV_w2IzAC93lePE,3564
43
+ biblicus-0.6.0.dist-info/licenses/LICENSE,sha256=lw44GXFG_Q0fS8m5VoEvv_xtdBXK26pBcbSPUCXee_Q,1078
44
+ biblicus-0.6.0.dist-info/METADATA,sha256=NXcMvQZklQCSukUOGcZaLSw_aqUm6wFojy6k_pfZvzc,21311
45
+ biblicus-0.6.0.dist-info/WHEEL,sha256=wUyA8OaulRlbfwMtmQsvNngGrxQHAvkKcvRmdizlJi0,92
46
+ biblicus-0.6.0.dist-info/entry_points.txt,sha256=BZmO4H8Uz00fyi1RAFryOCGfZgX7eHWkY2NE-G54U5A,47
47
+ biblicus-0.6.0.dist-info/top_level.txt,sha256=sUD_XVZwDxZ29-FBv1MknTGh4mgDXznGuP28KJY_WKc,9
48
+ biblicus-0.6.0.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: setuptools (80.10.2)
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
5
+
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ biblicus = biblicus.cli:main
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Biblicus Contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1 @@
1
+ biblicus