PyPI - biblicus - Versions diffs - 0.2.0__tar.gz → 0.4.0__tar.gz - Mend

biblicus 0.2.0tar.gz → 0.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (147) hide show

{biblicus-0.2.0 → biblicus-0.4.0}/MANIFEST.in RENAMED Viewed

@@ -1,5 +1,7 @@
 include README.md
 include LICENSE
+include THIRD_PARTY_NOTICES.md
+include .biblicus/config.example.yml
 include pyproject.toml
 recursive-include src *.py

{biblicus-0.2.0 → biblicus-0.4.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: biblicus
-Version: 0.2.0
+Version: 0.4.0
 Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
 License: MIT
 Requires-Python: >=3.9
@@ -8,20 +8,30 @@ Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: pydantic>=2.0
 Requires-Dist: PyYAML>=6.0
+Requires-Dist: pypdf>=4.0
 Provides-Extra: dev
 Requires-Dist: behave>=1.2.6; extra == "dev"
 Requires-Dist: coverage[toml]>=7.0; extra == "dev"
 Requires-Dist: sphinx>=7.0; extra == "dev"
 Requires-Dist: myst-parser>=2.0; extra == "dev"
+Requires-Dist: sphinx_rtd_theme>=2.0; extra == "dev"
 Requires-Dist: ruff>=0.4.0; extra == "dev"
 Requires-Dist: black>=24.0; extra == "dev"
 Requires-Dist: python-semantic-release>=9.0.0; extra == "dev"
+Provides-Extra: openai
+Requires-Dist: openai>=1.0; extra == "openai"
+Provides-Extra: unstructured
+Requires-Dist: unstructured>=0.12.0; extra == "unstructured"
+Requires-Dist: python-docx>=1.1.0; extra == "unstructured"
+Provides-Extra: ocr
+Requires-Dist: rapidocr-onnxruntime>=1.3.0; extra == "ocr"
 Dynamic: license-file
 # Biblicus
 ![Continuous integration][continuous-integration-badge]
 ![Coverage][coverage-badge]
+![Documentation][documentation-badge]
 Make your documents usable by your assistant, then decide later how you will search and retrieve them.
@@ -31,28 +41,34 @@ The first practical problem is not retrieval. It is collection and care. You nee
 This library gives you a corpus, which is a normal folder on disk. It stores each ingested item as a file, with optional metadata stored next to it. You can open and inspect the raw files directly. Any derived catalog or index can be rebuilt from the raw corpus.
-It integrates with LangChain, Tactus, Pydantic AI, and the agent development kit. Use it from Python or from the command line interface.
+It can be used alongside LangChain, Tactus, Pydantic AI, or the agent development kit. Use it from Python or from the command line interface.
 See [retrieval augmented generation overview] for a short introduction to the idea.
-## The framework
+## A beginner friendly mental model
-The framework is a small, explicit vocabulary that appears in code, specifications, and documentation. If you learn these words, the rest of the system becomes predictable.
+Think in three stages.
+- Ingest puts raw items into a corpus. This is file first and human inspectable.
+- Extract turns items into usable text. This is where you would do text extraction from Portable Document Format files, optical character recognition for images, or speech to text for audio. If an item is already text, extraction can simply read it. Extraction outputs are derived artifacts, not edits to the raw files.
+- Retrieve searches extracted text and returns evidence. Evidence is structured so you can turn it into context for your model call in whatever way your project prefers.
+If you learn a few project words, the rest of the system becomes predictable.
 - Corpus is the folder that holds raw items and their metadata.
-- Item is the raw bytes of a document or other artifact, plus its source.
+- Item is the raw bytes plus optional metadata and source information.
 - Catalog is the rebuildable index of the corpus.
-- Evidence is what retrieval returns, ready to be turned into context for a large language model.
-- Run is a recorded retrieval build for a corpus.
+- Extraction run is a recorded extraction build that produces text artifacts.
 - Backend is a pluggable retrieval implementation.
-- Recipe is a named configuration for a backend.
-- Pipeline stage is a distinct retrieval step such as retrieve, rerank, and filter.
+- Run is a recorded retrieval build for a corpus.
+- Evidence is what retrieval returns, with identifiers and source information.
 ## Diagram
 This diagram shows how a corpus becomes evidence for an assistant.
-The legend shows what the border styles and fill styles mean.
-The your code region is where you decide how to turn evidence into context and how to call a model.
+Extraction is introduced here as a separate stage so you can swap extraction approaches without changing the raw corpus.
+The legend shows what the block styles mean.
+Your code is where you decide how to turn evidence into context and how to call a model.
 ```mermaid
 %%{init: {"flowchart": {"useMaxWidth": true, "nodeSpacing": 18, "rankSpacing": 22}}}%%
@@ -74,12 +90,19 @@ flowchart LR
       Raw --> Catalog[Catalog file]
     end
-    subgraph PluggableRetrievalBackend[Pluggable retrieval backend]
+    subgraph PluggableExtractionPipeline[Pluggable: extraction pipeline]
+      direction TB
+      Catalog --> Extract[Extract pipeline]
+      Extract --> ExtractedText[Extracted text artifacts]
+      ExtractedText --> ExtractionRun[Extraction run manifest]
+    end
+    subgraph PluggableRetrievalBackend[Pluggable: retrieval backend]
       direction LR
       subgraph BackendIngestionIndexing[Ingestion and indexing]
         direction TB
-        Catalog --> Build[Build run]
+        ExtractionRun --> Build[Build run]
         Build --> BackendIndex[Backend index]
         BackendIndex --> Run[Run manifest]
       end
@@ -100,6 +123,7 @@ flowchart LR
     end
     style StableCore fill:#ffffff,stroke:#8e24aa,stroke-width:2px,color:#111111
+    style PluggableExtractionPipeline fill:#ffffff,stroke:#5e35b1,stroke-dasharray:6 3,stroke-width:2px,color:#111111
     style PluggableRetrievalBackend fill:#ffffff,stroke:#1e88e5,stroke-dasharray:6 3,stroke-width:2px,color:#111111
     style YourCode fill:#ffffff,stroke:#d81b60,stroke-width:2px,color:#111111
     style BackendIngestionIndexing fill:#ffffff,stroke:#cfd8dc,color:#111111
@@ -107,6 +131,8 @@ flowchart LR
     style Raw fill:#f3e5f5,stroke:#8e24aa,color:#111111
     style Catalog fill:#f3e5f5,stroke:#8e24aa,color:#111111
+    style ExtractedText fill:#f3e5f5,stroke:#8e24aa,color:#111111
+    style ExtractionRun fill:#f3e5f5,stroke:#8e24aa,color:#111111
     style BackendIndex fill:#f3e5f5,stroke:#8e24aa,color:#111111
     style Run fill:#f3e5f5,stroke:#8e24aa,color:#111111
     style Evidence fill:#f3e5f5,stroke:#8e24aa,color:#111111
@@ -115,6 +141,7 @@ flowchart LR
     style Source fill:#f3e5f5,stroke:#8e24aa,color:#111111
     style Ingest fill:#eceff1,stroke:#90a4ae,color:#111111
+    style Extract fill:#eceff1,stroke:#90a4ae,color:#111111
     style Build fill:#eceff1,stroke:#90a4ae,color:#111111
     style Query fill:#eceff1,stroke:#90a4ae,color:#111111
     style Model fill:#eceff1,stroke:#90a4ae,color:#111111
@@ -136,6 +163,8 @@ flowchart LR
 - Initialize a corpus folder.
 - Ingest items from file paths, web addresses, or text input.
+- Crawl a website section into corpus items when you want a repeatable “import from the web” workflow.
+- Run extraction when you want derived text artifacts from non-text sources.
 - Reindex to refresh the catalog after edits.
 - Build a retrieval run with a backend.
 - Query the run to collect evidence and evaluate it with datasets.
@@ -154,17 +183,40 @@ After the first release, you can install it from Python Package Index.
 python3 -m pip install biblicus
 ```
+### Optional extras
+Some extractors are optional so the base install stays small.
+- Optical character recognition for images: `python3 -m pip install "biblicus[ocr]"`
+- Speech to text transcription: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
+- Broad document parsing fallback: `python3 -m pip install "biblicus[unstructured]"`
 ## Quick start
 ```
+mkdir -p notes
+echo "A small file note" > notes/example.txt
 biblicus init corpora/example
 biblicus ingest --corpus corpora/example notes/example.txt
 echo "A short note" | biblicus ingest --corpus corpora/example --stdin --title "First note"
 biblicus list --corpus corpora/example
+biblicus extract build --corpus corpora/example --step pass-through-text --step metadata-text
+biblicus extract list --corpus corpora/example
 biblicus build --corpus corpora/example --backend scan
 biblicus query --corpus corpora/example --query "note"
 ```
+If you want to turn a website section into corpus items, crawl a root web address while restricting the crawl to an allowed prefix:
+```
+biblicus crawl --corpus corpora/example \\
+  --root-url https://example.com/docs/index.html \\
+  --allowed-prefix https://example.com/docs/ \\
+  --max-items 50 \\
+  --tag crawled
+```
 ## Python usage
 From Python, the same flow is available through the Corpus class and backend interfaces. The public surface area is small on purpose.
@@ -188,13 +240,18 @@ In an assistant system, retrieval usually produces context for a model call. Thi
 ## Learn more
+Full documentation is published on GitHub Pages: https://anthusai.github.io/Biblicus/
 The documents below are written to be read in order.
 - [Architecture][architecture]
+- [Roadmap][roadmap]
+- [Feature index][feature-index]
 - [Corpus][corpus]
 - [Text extraction][text-extraction]
+- [User configuration][user-configuration]
 - [Backends][backends]
-- [Next steps][next-steps]
+- [Demos][demos]
 - [Testing][testing]
 ## Metadata and catalog
@@ -212,7 +269,16 @@ corpus/
     config.json
     catalog.json
     runs/
-      run-id.json
+      extraction/
+        pipeline/
+          <run id>/
+            manifest.json
+            text/
+              <item id>.txt
+      retrieval/
+        <backend id>/
+          <run id>/
+            manifest.json
 ```
 ## Retrieval backends
@@ -252,10 +318,18 @@ Publishing uses a Python Package Index token stored in the GitHub secret named P
 ## Documentation
-Reference documentation is generated from Sphinx style docstrings. Build the documentation with the command below.
+Reference documentation is generated from Sphinx style docstrings.
+Install development dependencies:
+```
+python3 -m pip install -e ".[dev]"
+```
+Build the documentation:
 ```
-sphinx-build -b html docs docs/_build
+python3 -m sphinx -b html docs docs/_build/html
 ```
 ## License
@@ -264,11 +338,15 @@ License terms are in `LICENSE`.
 [retrieval augmented generation overview]: https://en.wikipedia.org/wiki/Retrieval-augmented_generation
 [architecture]: docs/ARCHITECTURE.md
+[roadmap]: docs/ROADMAP.md
+[feature-index]: docs/FEATURE_INDEX.md
 [corpus]: docs/CORPUS.md
 [text-extraction]: docs/EXTRACTION.md
+[user-configuration]: docs/USER_CONFIGURATION.md
 [backends]: docs/BACKENDS.md
-[next-steps]: docs/NEXT_STEPS.md
+[demos]: docs/DEMOS.md
 [testing]: docs/TESTING.md
 [continuous-integration-badge]: https://github.com/AnthusAI/Biblicus/actions/workflows/ci.yml/badge.svg?branch=main
 [coverage-badge]: https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/AnthusAI/Biblicus/main/coverage_badge.json
+[documentation-badge]: https://img.shields.io/badge/docs-GitHub%20Pages-blue

biblicus-0.2.0/src/biblicus.egg-info/PKG-INFO → biblicus-0.4.0/README.md RENAMED Viewed

@@ -1,27 +1,8 @@
-Metadata-Version: 2.4
-Name: biblicus
-Version: 0.2.0
-Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
-License: MIT
-Requires-Python: >=3.9
-Description-Content-Type: text/markdown
-License-File: LICENSE
-Requires-Dist: pydantic>=2.0
-Requires-Dist: PyYAML>=6.0
-Provides-Extra: dev
-Requires-Dist: behave>=1.2.6; extra == "dev"
-Requires-Dist: coverage[toml]>=7.0; extra == "dev"
-Requires-Dist: sphinx>=7.0; extra == "dev"
-Requires-Dist: myst-parser>=2.0; extra == "dev"
-Requires-Dist: ruff>=0.4.0; extra == "dev"
-Requires-Dist: black>=24.0; extra == "dev"
-Requires-Dist: python-semantic-release>=9.0.0; extra == "dev"
-Dynamic: license-file
 # Biblicus
 ![Continuous integration][continuous-integration-badge]
 ![Coverage][coverage-badge]
+![Documentation][documentation-badge]
 Make your documents usable by your assistant, then decide later how you will search and retrieve them.
@@ -31,28 +12,34 @@ The first practical problem is not retrieval. It is collection and care. You nee
 This library gives you a corpus, which is a normal folder on disk. It stores each ingested item as a file, with optional metadata stored next to it. You can open and inspect the raw files directly. Any derived catalog or index can be rebuilt from the raw corpus.
-It integrates with LangChain, Tactus, Pydantic AI, and the agent development kit. Use it from Python or from the command line interface.
+It can be used alongside LangChain, Tactus, Pydantic AI, or the agent development kit. Use it from Python or from the command line interface.
 See [retrieval augmented generation overview] for a short introduction to the idea.
-## The framework
+## A beginner friendly mental model
+Think in three stages.
+- Ingest puts raw items into a corpus. This is file first and human inspectable.
+- Extract turns items into usable text. This is where you would do text extraction from Portable Document Format files, optical character recognition for images, or speech to text for audio. If an item is already text, extraction can simply read it. Extraction outputs are derived artifacts, not edits to the raw files.
+- Retrieve searches extracted text and returns evidence. Evidence is structured so you can turn it into context for your model call in whatever way your project prefers.
-The framework is a small, explicit vocabulary that appears in code, specifications, and documentation. If you learn these words, the rest of the system becomes predictable.
+If you learn a few project words, the rest of the system becomes predictable.
 - Corpus is the folder that holds raw items and their metadata.
-- Item is the raw bytes of a document or other artifact, plus its source.
+- Item is the raw bytes plus optional metadata and source information.
 - Catalog is the rebuildable index of the corpus.
-- Evidence is what retrieval returns, ready to be turned into context for a large language model.
-- Run is a recorded retrieval build for a corpus.
+- Extraction run is a recorded extraction build that produces text artifacts.
 - Backend is a pluggable retrieval implementation.
-- Recipe is a named configuration for a backend.
-- Pipeline stage is a distinct retrieval step such as retrieve, rerank, and filter.
+- Run is a recorded retrieval build for a corpus.
+- Evidence is what retrieval returns, with identifiers and source information.
 ## Diagram
 This diagram shows how a corpus becomes evidence for an assistant.
-The legend shows what the border styles and fill styles mean.
-The your code region is where you decide how to turn evidence into context and how to call a model.
+Extraction is introduced here as a separate stage so you can swap extraction approaches without changing the raw corpus.
+The legend shows what the block styles mean.
+Your code is where you decide how to turn evidence into context and how to call a model.
 ```mermaid
 %%{init: {"flowchart": {"useMaxWidth": true, "nodeSpacing": 18, "rankSpacing": 22}}}%%
@@ -74,12 +61,19 @@ flowchart LR
       Raw --> Catalog[Catalog file]
     end
-    subgraph PluggableRetrievalBackend[Pluggable retrieval backend]
+    subgraph PluggableExtractionPipeline[Pluggable: extraction pipeline]
+      direction TB
+      Catalog --> Extract[Extract pipeline]
+      Extract --> ExtractedText[Extracted text artifacts]
+      ExtractedText --> ExtractionRun[Extraction run manifest]
+    end
+    subgraph PluggableRetrievalBackend[Pluggable: retrieval backend]
       direction LR
       subgraph BackendIngestionIndexing[Ingestion and indexing]
         direction TB
-        Catalog --> Build[Build run]
+        ExtractionRun --> Build[Build run]
         Build --> BackendIndex[Backend index]
         BackendIndex --> Run[Run manifest]
       end
@@ -100,6 +94,7 @@ flowchart LR
     end
     style StableCore fill:#ffffff,stroke:#8e24aa,stroke-width:2px,color:#111111
+    style PluggableExtractionPipeline fill:#ffffff,stroke:#5e35b1,stroke-dasharray:6 3,stroke-width:2px,color:#111111
     style PluggableRetrievalBackend fill:#ffffff,stroke:#1e88e5,stroke-dasharray:6 3,stroke-width:2px,color:#111111
     style YourCode fill:#ffffff,stroke:#d81b60,stroke-width:2px,color:#111111
     style BackendIngestionIndexing fill:#ffffff,stroke:#cfd8dc,color:#111111
@@ -107,6 +102,8 @@ flowchart LR
     style Raw fill:#f3e5f5,stroke:#8e24aa,color:#111111
     style Catalog fill:#f3e5f5,stroke:#8e24aa,color:#111111
+    style ExtractedText fill:#f3e5f5,stroke:#8e24aa,color:#111111
+    style ExtractionRun fill:#f3e5f5,stroke:#8e24aa,color:#111111
     style BackendIndex fill:#f3e5f5,stroke:#8e24aa,color:#111111
     style Run fill:#f3e5f5,stroke:#8e24aa,color:#111111
     style Evidence fill:#f3e5f5,stroke:#8e24aa,color:#111111
@@ -115,6 +112,7 @@ flowchart LR
     style Source fill:#f3e5f5,stroke:#8e24aa,color:#111111
     style Ingest fill:#eceff1,stroke:#90a4ae,color:#111111
+    style Extract fill:#eceff1,stroke:#90a4ae,color:#111111
     style Build fill:#eceff1,stroke:#90a4ae,color:#111111
     style Query fill:#eceff1,stroke:#90a4ae,color:#111111
     style Model fill:#eceff1,stroke:#90a4ae,color:#111111
@@ -136,6 +134,8 @@ flowchart LR
 - Initialize a corpus folder.
 - Ingest items from file paths, web addresses, or text input.
+- Crawl a website section into corpus items when you want a repeatable “import from the web” workflow.
+- Run extraction when you want derived text artifacts from non-text sources.
 - Reindex to refresh the catalog after edits.
 - Build a retrieval run with a backend.
 - Query the run to collect evidence and evaluate it with datasets.
@@ -154,17 +154,40 @@ After the first release, you can install it from Python Package Index.
 python3 -m pip install biblicus
 ```
+### Optional extras
+Some extractors are optional so the base install stays small.
+- Optical character recognition for images: `python3 -m pip install "biblicus[ocr]"`
+- Speech to text transcription: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
+- Broad document parsing fallback: `python3 -m pip install "biblicus[unstructured]"`
 ## Quick start
 ```
+mkdir -p notes
+echo "A small file note" > notes/example.txt
 biblicus init corpora/example
 biblicus ingest --corpus corpora/example notes/example.txt
 echo "A short note" | biblicus ingest --corpus corpora/example --stdin --title "First note"
 biblicus list --corpus corpora/example
+biblicus extract build --corpus corpora/example --step pass-through-text --step metadata-text
+biblicus extract list --corpus corpora/example
 biblicus build --corpus corpora/example --backend scan
 biblicus query --corpus corpora/example --query "note"
 ```
+If you want to turn a website section into corpus items, crawl a root web address while restricting the crawl to an allowed prefix:
+```
+biblicus crawl --corpus corpora/example \\
+  --root-url https://example.com/docs/index.html \\
+  --allowed-prefix https://example.com/docs/ \\
+  --max-items 50 \\
+  --tag crawled
+```
 ## Python usage
 From Python, the same flow is available through the Corpus class and backend interfaces. The public surface area is small on purpose.
@@ -188,13 +211,18 @@ In an assistant system, retrieval usually produces context for a model call. Thi
 ## Learn more
+Full documentation is published on GitHub Pages: https://anthusai.github.io/Biblicus/
 The documents below are written to be read in order.
 - [Architecture][architecture]
+- [Roadmap][roadmap]
+- [Feature index][feature-index]
 - [Corpus][corpus]
 - [Text extraction][text-extraction]
+- [User configuration][user-configuration]
 - [Backends][backends]
-- [Next steps][next-steps]
+- [Demos][demos]
 - [Testing][testing]
 ## Metadata and catalog
@@ -212,7 +240,16 @@ corpus/
     config.json
     catalog.json
     runs/
-      run-id.json
+      extraction/
+        pipeline/
+          <run id>/
+            manifest.json
+            text/
+              <item id>.txt
+      retrieval/
+        <backend id>/
+          <run id>/
+            manifest.json
 ```
 ## Retrieval backends
@@ -252,10 +289,18 @@ Publishing uses a Python Package Index token stored in the GitHub secret named P
 ## Documentation
-Reference documentation is generated from Sphinx style docstrings. Build the documentation with the command below.
+Reference documentation is generated from Sphinx style docstrings.
+Install development dependencies:
+```
+python3 -m pip install -e ".[dev]"
+```
+Build the documentation:
 ```
-sphinx-build -b html docs docs/_build
+python3 -m sphinx -b html docs docs/_build/html
 ```
 ## License
@@ -264,11 +309,15 @@ License terms are in `LICENSE`.
 [retrieval augmented generation overview]: https://en.wikipedia.org/wiki/Retrieval-augmented_generation
 [architecture]: docs/ARCHITECTURE.md
+[roadmap]: docs/ROADMAP.md
+[feature-index]: docs/FEATURE_INDEX.md
 [corpus]: docs/CORPUS.md
 [text-extraction]: docs/EXTRACTION.md
+[user-configuration]: docs/USER_CONFIGURATION.md
 [backends]: docs/BACKENDS.md
-[next-steps]: docs/NEXT_STEPS.md
+[demos]: docs/DEMOS.md
 [testing]: docs/TESTING.md
 [continuous-integration-badge]: https://github.com/AnthusAI/Biblicus/actions/workflows/ci.yml/badge.svg?branch=main
 [coverage-badge]: https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/AnthusAI/Biblicus/main/coverage_badge.json
+[documentation-badge]: https://img.shields.io/badge/docs-GitHub%20Pages-blue

biblicus-0.4.0/THIRD_PARTY_NOTICES.md ADDED Viewed

@@ -0,0 +1,36 @@
+# Third-party notices
+This project includes vendored third-party source code.
+## dotyaml
+Portions of this repository vendor code from the `dotyaml` project.
+- Project: `dotyaml`
+- Source: `../dotyaml` (vendored into `src/biblicus/_vendor/dotyaml/`)
+- License: MIT
+```
+MIT License
+Copyright (c) 2025 yamlenv
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+```

{biblicus-0.2.0 → biblicus-0.4.0}/docs/CORPUS.md RENAMED Viewed

@@ -43,6 +43,20 @@ Ingest a web address:
 python3 -m biblicus ingest --corpus corpora/example https://example.com --tag web
 ```
+## Crawl a website prefix
+To build a corpus from a website section, crawl a root uniform resource locator and restrict the crawl to an allowed prefix.
+```
+python3 -m biblicus crawl --corpus corpora/example \\
+  --root-url https://example.com/docs/index.html \\
+  --allowed-prefix https://example.com/docs/ \\
+  --max-items 50 \\
+  --tag crawled
+```
+The crawl command only follows links within the allowed prefix, and it respects `.biblicusignore` patterns against the path relative to the allowed prefix.
 Ingest a text note:
 ```
@@ -100,4 +114,3 @@ Purging deletes all items and derived artifacts under the corpus. It requires yo
 ```
 python3 -m biblicus purge --corpus corpora/example --confirm example
 ```

biblicus-0.2.0/docs/CORPUS_WORKFLOWS.md → biblicus-0.4.0/docs/CORPUS_DESIGN.md RENAMED Viewed

@@ -1,13 +1,9 @@
-# Corpus workflows and lifecycle hooks
+# Corpus design
-This document records the design decisions and outcomes for corpus management and lifecycle hooks in version zero. It is written in a decision format because the long term shape of the library is determined by corpus workflows more than by any particular retrieval backend.
+This document records design decisions and outcomes for corpus management and lifecycle hooks in version zero.
 The goal is to make corpus management practical for day to day use, while keeping the raw corpus durable and readable as ordinary files on disk.
-## Initiative constraints
-The project uses strict behavior driven development. Behavior specifications in `features/*.feature` are the authoritative definition of system behavior.
 ## What exists today
 The project already supports:

biblicus 0.2.0__tar.gz → 0.4.0__tar.gz

biblicus 0.2.0tar.gz → 0.4.0tar.gz