biblicus 0.1.1__py3-none-any.whl → 0.3.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (45) hide show
  1. biblicus/__init__.py +2 -2
  2. biblicus/_vendor/dotyaml/__init__.py +14 -0
  3. biblicus/_vendor/dotyaml/interpolation.py +63 -0
  4. biblicus/_vendor/dotyaml/loader.py +181 -0
  5. biblicus/_vendor/dotyaml/transformer.py +135 -0
  6. biblicus/backends/__init__.py +0 -2
  7. biblicus/backends/base.py +3 -3
  8. biblicus/backends/scan.py +96 -13
  9. biblicus/backends/sqlite_full_text_search.py +74 -14
  10. biblicus/cli.py +126 -19
  11. biblicus/constants.py +2 -0
  12. biblicus/corpus.py +455 -45
  13. biblicus/errors.py +15 -0
  14. biblicus/evaluation.py +4 -8
  15. biblicus/extraction.py +529 -0
  16. biblicus/extractors/__init__.py +44 -0
  17. biblicus/extractors/base.py +68 -0
  18. biblicus/extractors/metadata_text.py +106 -0
  19. biblicus/extractors/openai_stt.py +180 -0
  20. biblicus/extractors/pass_through_text.py +84 -0
  21. biblicus/extractors/pdf_text.py +100 -0
  22. biblicus/extractors/pipeline.py +105 -0
  23. biblicus/extractors/rapidocr_text.py +129 -0
  24. biblicus/extractors/select_longest_text.py +105 -0
  25. biblicus/extractors/select_text.py +100 -0
  26. biblicus/extractors/unstructured_text.py +100 -0
  27. biblicus/frontmatter.py +0 -3
  28. biblicus/hook_logging.py +180 -0
  29. biblicus/hook_manager.py +203 -0
  30. biblicus/hooks.py +261 -0
  31. biblicus/ignore.py +64 -0
  32. biblicus/models.py +107 -0
  33. biblicus/retrieval.py +0 -4
  34. biblicus/sources.py +85 -5
  35. biblicus/time.py +0 -1
  36. biblicus/uris.py +3 -4
  37. biblicus/user_config.py +138 -0
  38. biblicus-0.3.0.dist-info/METADATA +336 -0
  39. biblicus-0.3.0.dist-info/RECORD +44 -0
  40. biblicus-0.1.1.dist-info/METADATA +0 -174
  41. biblicus-0.1.1.dist-info/RECORD +0 -22
  42. {biblicus-0.1.1.dist-info → biblicus-0.3.0.dist-info}/WHEEL +0 -0
  43. {biblicus-0.1.1.dist-info → biblicus-0.3.0.dist-info}/entry_points.txt +0 -0
  44. {biblicus-0.1.1.dist-info → biblicus-0.3.0.dist-info}/licenses/LICENSE +0 -0
  45. {biblicus-0.1.1.dist-info → biblicus-0.3.0.dist-info}/top_level.txt +0 -0
@@ -1,174 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: biblicus
3
- Version: 0.1.1
4
- Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
5
- License: MIT
6
- Requires-Python: >=3.9
7
- Description-Content-Type: text/markdown
8
- License-File: LICENSE
9
- Requires-Dist: pydantic>=2.0
10
- Requires-Dist: PyYAML>=6.0
11
- Provides-Extra: dev
12
- Requires-Dist: behave>=1.2.6; extra == "dev"
13
- Requires-Dist: coverage[toml]>=7.0; extra == "dev"
14
- Requires-Dist: sphinx>=7.0; extra == "dev"
15
- Requires-Dist: myst-parser>=2.0; extra == "dev"
16
- Requires-Dist: ruff>=0.4.0; extra == "dev"
17
- Requires-Dist: black>=24.0; extra == "dev"
18
- Requires-Dist: python-semantic-release>=9.0.0; extra == "dev"
19
- Dynamic: license-file
20
-
21
- # Biblicus
22
-
23
- Make your documents usable by your assistant, then decide later how you will search and retrieve them.
24
-
25
- If you are building an assistant in Python, you probably have material you want it to use: notes, documents, web pages, and reference files. A common approach is retrieval augmented generation, where a system retrieves relevant material and uses it as evidence when generating a response.
26
-
27
- The first practical problem is not retrieval. It is collection and care. You need a stable place to put raw items, you need a small amount of metadata so you can find them again, and you need a way to evolve your retrieval approach over time without rewriting ingestion.
28
-
29
- This library gives you a corpus, which is a normal folder on disk. It stores each ingested item as a file, with optional metadata stored next to it. You can open and inspect the raw files directly. Any derived catalog or index can be rebuilt from the raw corpus.
30
-
31
- It integrates with LangChain, Tactus, Pydantic AI, and the agent development kit. Use it from Python or from the command line interface.
32
-
33
- See [retrieval augmented generation overview] for a short introduction to the idea.
34
-
35
- ## The framework
36
-
37
- The framework is a small, explicit vocabulary that appears in code, specifications, and documentation. If you learn these words, the rest of the system becomes predictable.
38
-
39
- - Corpus is the folder that holds raw items and their metadata.
40
- - Item is the raw bytes of a document or other artifact, plus its source.
41
- - Catalog is the rebuildable index of the corpus.
42
- - Evidence is what retrieval returns, ready to be turned into context for a large language model.
43
- - Run is a recorded retrieval build for a corpus.
44
- - Backend is a pluggable retrieval implementation.
45
- - Recipe is a named configuration for a backend.
46
- - Pipeline stage is a distinct retrieval step such as retrieve, rerank, and filter.
47
-
48
- ## Practical value
49
-
50
- - You can ingest raw material once, then try many retrieval approaches over time.
51
- - You can keep raw files readable and portable, without locking your data inside a database.
52
- - You can evaluate retrieval runs against shared datasets and compare backends using the same corpus.
53
-
54
- ## Typical flow
55
-
56
- - Initialize a corpus folder.
57
- - Ingest items from file paths, web addresses, or text input.
58
- - Reindex to refresh the catalog after edits.
59
- - Build a retrieval run with a backend.
60
- - Query the run to collect evidence and evaluate it with datasets.
61
-
62
- ## Install
63
-
64
- This repository is a working Python package. Install it into a virtual environment from the repository root.
65
-
66
- ```
67
- python3 -m pip install -e .
68
- ```
69
-
70
- After the first release, you can install it from Python Package Index.
71
-
72
- ```
73
- python3 -m pip install biblicus
74
- ```
75
-
76
- ## Quick start
77
-
78
- ```
79
- biblicus init corpora/example
80
- biblicus ingest --corpus corpora/example notes/example.txt
81
- echo "A short note" | biblicus ingest --corpus corpora/example --stdin --title "First note"
82
- biblicus list --corpus corpora/example
83
- biblicus build --corpus corpora/example --backend scan
84
- biblicus query --corpus corpora/example --query "note"
85
- ```
86
-
87
- ## Python usage
88
-
89
- From Python, the same flow is available through the Corpus class and backend interfaces. The public surface area is small on purpose.
90
-
91
- - Create a corpus with `Corpus.init` or open one with `Corpus.open`.
92
- - Ingest notes with `Corpus.ingest_note`.
93
- - Ingest files or web addresses with `Corpus.ingest_source`.
94
- - List items with `Corpus.list_items`.
95
- - Build a retrieval run with `get_backend` and `backend.build_run`.
96
- - Query a run with `backend.query`.
97
- - Evaluate with `evaluate_run`.
98
-
99
- ## How it fits into an assistant
100
-
101
- In an assistant system, retrieval usually produces context for a model call. This library treats evidence as the primary output so you can decide how to use it.
102
-
103
- - Use a corpus as the source of truth for raw items.
104
- - Use a backend run to build any derived artifacts needed for retrieval.
105
- - Use queries to obtain evidence objects.
106
- - Convert evidence into the format your framework expects, such as message content, tool output, or citations.
107
-
108
- ## Learn more
109
-
110
- The documents below are written to be read in order.
111
-
112
- - [Architecture][architecture]
113
- - [Backends][backends]
114
-
115
- ## Metadata and catalog
116
-
117
- Raw items are stored as files in the corpus raw directory. Metadata can live in a Markdown front matter block or a sidecar file with the suffix `.biblicus.yml`. The catalog lives in `.biblicus/catalog.json` and can be rebuilt at any time with `biblicus reindex`.
118
-
119
- ## Corpus layout
120
-
121
- ```
122
- corpus/
123
- raw/
124
- item.bin
125
- item.bin.biblicus.yml
126
- .biblicus/
127
- config.json
128
- catalog.json
129
- runs/
130
- run-id.json
131
- ```
132
-
133
- ## Retrieval backends
134
-
135
- Two backends are included.
136
-
137
- - `scan` is a minimal baseline that scans raw items directly.
138
- - `sqlite-full-text-search` is a practical baseline that builds a full text search index in Sqlite.
139
-
140
- ## Integration corpus and evaluation dataset
141
-
142
- Use `scripts/download_wikipedia.py` to download a small integration corpus from Wikipedia when running tests or demos. The repository does not include that content.
143
-
144
- The dataset file `datasets/wikipedia_mini.json` provides a small evaluation set that matches the integration corpus.
145
-
146
- ## Tests and coverage
147
-
148
- ```
149
- python3 scripts/test.py
150
- ```
151
-
152
- ## Releases
153
-
154
- Releases are automated from the main branch using semantic versioning and conventional commit messages.
155
-
156
- The release pipeline publishes a GitHub release and uploads the package to Python Package Index when continuous integration succeeds.
157
-
158
- Publishing uses a Python Package Index token stored in the GitHub secret named PYPI_TOKEN.
159
-
160
- ## Documentation
161
-
162
- Reference documentation is generated from Sphinx style docstrings. Build the documentation with the command below.
163
-
164
- ```
165
- sphinx-build -b html docs docs/_build
166
- ```
167
-
168
- ## License
169
-
170
- License terms are in `LICENSE`.
171
-
172
- [retrieval augmented generation overview]: https://en.wikipedia.org/wiki/Retrieval-augmented_generation
173
- [architecture]: docs/ARCHITECTURE.md
174
- [backends]: docs/BACKENDS.md
@@ -1,22 +0,0 @@
1
- biblicus/__init__.py,sha256=o_1kQ7q9DCcjH7zm5MAvPx49hArnSvbr88kHKzBFMvM,432
2
- biblicus/__main__.py,sha256=ipfkUoTlocVnrQDM69C7TeBqQxmHVeiWMRaT3G9rtnk,117
3
- biblicus/cli.py,sha256=DwnvcDmjelzUq_9VMo_U_-FoBs3Si3QONVJdWGonXs4,15116
4
- biblicus/constants.py,sha256=t8p0yStpJAYPxsFlM0u5zJcQr_ARKEqEnIgNckjyF5Y,196
5
- biblicus/corpus.py,sha256=953gzT77HvYeTs2pcBXyixYRTxh65nm1JtlHVfKvCzg,30921
6
- biblicus/evaluation.py,sha256=H_W35vF5_L4B2JCfLu19VRu402tZ2pFkN2BbBP69lVY,8119
7
- biblicus/frontmatter.py,sha256=8Tqlpd3bVzZrGRB9Rdj2IwHMSJLvd2ABxMNOi3L5br4,2466
8
- biblicus/models.py,sha256=ZDb7-t9pycPpgZWVs5CcrpyeA_8OZLoQk-aflKjU7M4,10512
9
- biblicus/retrieval.py,sha256=T7HELWCNAxZ26yj7dPH8IBUaxV_gx8Ql9iwwGz0teyI,4184
10
- biblicus/sources.py,sha256=XFF75kqMyYdeYy6k8NtDnOmCxAmroW7DH6mdzWMPMuY,4358
11
- biblicus/time.py,sha256=rvp2fJXSLVmyA76GCfNKtZoifASodemJTOWN8smPt0s,486
12
- biblicus/uris.py,sha256=sRDyGmoHr_H4XR4qv_lSbQJXylYD0fNEr02H5wjomnQ,1986
13
- biblicus/backends/__init__.py,sha256=5OXKSzsn7THhwh9T5StOvEqojx_85XXuYSGdTpMK11U,1214
14
- biblicus/backends/base.py,sha256=699TKygGgL72Ifkhz1V890nOK6BslwO0-OY7xeqZl-I,1764
15
- biblicus/backends/scan.py,sha256=qvktqHIB0459sjzEO4EnS1PCXwwM19LjOx8oaDoU7DQ,9245
16
- biblicus/backends/sqlite_full_text_search.py,sha256=s_3gsEcdlxSFuluWcug4XEklwEoY42_Dgd7luY-BqqI,14152
17
- biblicus-0.1.1.dist-info/licenses/LICENSE,sha256=lw44GXFG_Q0fS8m5VoEvv_xtdBXK26pBcbSPUCXee_Q,1078
18
- biblicus-0.1.1.dist-info/METADATA,sha256=lgvWJUgESiwWTCZ6_uUzgZeM3SkvnwjIzcsb8OE53BA,6635
19
- biblicus-0.1.1.dist-info/WHEEL,sha256=wUyA8OaulRlbfwMtmQsvNngGrxQHAvkKcvRmdizlJi0,92
20
- biblicus-0.1.1.dist-info/entry_points.txt,sha256=BZmO4H8Uz00fyi1RAFryOCGfZgX7eHWkY2NE-G54U5A,47
21
- biblicus-0.1.1.dist-info/top_level.txt,sha256=sUD_XVZwDxZ29-FBv1MknTGh4mgDXznGuP28KJY_WKc,9
22
- biblicus-0.1.1.dist-info/RECORD,,