PyPI - biblicus - Versions diffs - 0.12.0__tar.gz → 0.14.0__tar.gz - Mend

biblicus 0.12.0tar.gz → 0.14.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (252) hide show

{biblicus-0.12.0/src/biblicus.egg-info → biblicus-0.14.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: biblicus
-Version: 0.12.0
+Version: 0.14.0
 Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
 License: MIT
 Requires-Python: >=3.9
@@ -486,10 +486,11 @@ corpus/
 ## Retrieval backends
-Two backends are included.
+Three backends are included.
 - `scan` is a minimal baseline that scans raw items directly.
 - `sqlite-full-text-search` is a practical baseline that builds a full text search index in SQLite.
+- `vector` is a deterministic term-frequency vector baseline with cosine similarity scoring.
 For detailed documentation including configuration options, performance characteristics, and usage examples, see the [Backend Reference][backend-reference].
@@ -497,7 +498,8 @@ For detailed documentation including configuration options, performance characte
 For the retrieval pipeline overview and run artifacts, see `docs/RETRIEVAL.md`. For retrieval quality upgrades
 (tuned lexical baseline, reranking, hybrid retrieval), see `docs/RETRIEVAL_QUALITY.md`. For evaluation workflows
-and dataset formats, see `docs/RETRIEVAL_EVALUATION.md`.
+and dataset formats, see `docs/RETRIEVAL_EVALUATION.md`. For a runnable walkthrough, use the retrieval evaluation lab
+script (`scripts/retrieval_evaluation_lab.py`).
 ## Extraction backends
@@ -535,6 +537,9 @@ These extractors are built in. Optional ones require extra dependencies. See [te
 For detailed documentation on all extractors, see the [Extractor Reference][extractor-reference].
+For extraction evaluation workflows, dataset formats, and report interpretation, see
+`docs/EXTRACTION_EVALUATION.md`.
 ## Topic modeling analysis
 Biblicus can run analysis pipelines on extracted text without changing the raw corpus. Profiling and topic modeling

{biblicus-0.12.0 → biblicus-0.14.0}/README.md RENAMED Viewed

@@ -440,10 +440,11 @@ corpus/
 ## Retrieval backends
-Two backends are included.
+Three backends are included.
 - `scan` is a minimal baseline that scans raw items directly.
 - `sqlite-full-text-search` is a practical baseline that builds a full text search index in SQLite.
+- `vector` is a deterministic term-frequency vector baseline with cosine similarity scoring.
 For detailed documentation including configuration options, performance characteristics, and usage examples, see the [Backend Reference][backend-reference].
@@ -451,7 +452,8 @@ For detailed documentation including configuration options, performance characte
 For the retrieval pipeline overview and run artifacts, see `docs/RETRIEVAL.md`. For retrieval quality upgrades
 (tuned lexical baseline, reranking, hybrid retrieval), see `docs/RETRIEVAL_QUALITY.md`. For evaluation workflows
-and dataset formats, see `docs/RETRIEVAL_EVALUATION.md`.
+and dataset formats, see `docs/RETRIEVAL_EVALUATION.md`. For a runnable walkthrough, use the retrieval evaluation lab
+script (`scripts/retrieval_evaluation_lab.py`).
 ## Extraction backends
@@ -489,6 +491,9 @@ These extractors are built in. Optional ones require extra dependencies. See [te
 For detailed documentation on all extractors, see the [Extractor Reference][extractor-reference].
+For extraction evaluation workflows, dataset formats, and report interpretation, see
+`docs/EXTRACTION_EVALUATION.md`.
 ## Topic modeling analysis
 Biblicus can run analysis pipelines on extracted text without changing the raw corpus. Profiling and topic modeling

biblicus-0.14.0/datasets/extraction_lab/labels.json ADDED Viewed

@@ -0,0 +1,19 @@
+{
+  "schema_version": 1,
+  "name": "extraction-evaluation-lab",
+  "description": "Bundled labels for the extraction evaluation lab.",
+  "items": [
+    {
+      "filename": "alpha.md",
+      "expected_text": "Alpha note"
+    },
+    {
+      "filename": "beta.txt",
+      "expected_text": "Beta note"
+    },
+    {
+      "filename": "blank.md",
+      "expected_text": ""
+    }
+  ]
+}

biblicus-0.14.0/datasets/retrieval_lab/labels.json ADDED Viewed

@@ -0,0 +1,25 @@
+{
+  "schema_version": 1,
+  "name": "retrieval-evaluation-lab",
+  "description": "Bundled labels for the retrieval evaluation lab.",
+  "queries": [
+    {
+      "query_id": "q1",
+      "query_text": "alpha unique",
+      "expected_filename": "alpha.txt",
+      "kind": "gold"
+    },
+    {
+      "query_id": "q2",
+      "query_text": "beta unique",
+      "expected_filename": "beta.txt",
+      "kind": "gold"
+    },
+    {
+      "query_id": "q3",
+      "query_text": "gamma unique",
+      "expected_filename": "gamma.txt",
+      "kind": "gold"
+    }
+  ]
+}

{biblicus-0.12.0 → biblicus-0.14.0}/docs/BACKENDS.md RENAMED Viewed

@@ -37,3 +37,4 @@ See:
 - `biblicus.backends.scan.ScanBackend` (minimal baseline)
 - `biblicus.backends.sqlite_full_text_search.SqliteFullTextSearchBackend` (practical local backend)
+- `biblicus.backends.vector.VectorBackend` (term-frequency vector baseline)

{biblicus-0.12.0 → biblicus-0.14.0}/docs/DEMOS.md RENAMED Viewed

@@ -198,6 +198,44 @@ python3 -m pip install "biblicus[datasets,topic-modeling]"
 python3 scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
 ```
+### Extraction evaluation demo run
+Use the extraction evaluation demo to build an extraction run, write a labeled dataset from AG News items, and evaluate
+coverage and accuracy.
+Install optional dependencies first:
+```
+python3 -m pip install "biblicus[datasets]"
+```
+```
+python3 scripts/extraction_evaluation_demo.py --corpus corpora/ag_news_extraction_eval --force
+```
+The script prints the dataset path, extraction run reference, and evaluation output path so you can inspect the results.
+### Extraction evaluation lab run
+Use the lab script for a fast, fully local walkthrough with bundled files and labels:
+```
+python3 scripts/extraction_evaluation_lab.py --corpus corpora/extraction_eval_lab --force
+```
+The lab writes a generated dataset file and evaluation output path and prints both in the command output.
+### Retrieval evaluation lab run
+Use the retrieval evaluation lab to build a tiny corpus, run extraction, build a retrieval backend, and evaluate it
+against bundled labels:
+```
+python3 scripts/retrieval_evaluation_lab.py --corpus corpora/retrieval_eval_lab --force
+```
+The script prints the dataset path, retrieval run identifier, and evaluation output location.
 Run with a larger corpus and a higher topic count:
 ```

{biblicus-0.12.0 → biblicus-0.14.0}/docs/EXTRACTION.md RENAMED Viewed

@@ -188,6 +188,11 @@ python3 -m biblicus build --corpus corpora/extraction-demo --backend sqlite-full
 python3 -m biblicus query --corpus corpora/extraction-demo --query extracted
 ```
+## Evaluate extraction quality
+Extraction evaluation measures coverage and accuracy for a given extractor recipe. See `docs/EXTRACTION_EVALUATION.md`
+for the dataset format, command-line interface usage, and report interpretation.
 ## What extraction is not
 Text extraction does not mutate the raw corpus. It is derived output that can be regenerated and compared across implementations.

biblicus-0.14.0/docs/EXTRACTION_EVALUATION.md ADDED Viewed

@@ -0,0 +1,147 @@
+# Extraction evaluation
+Biblicus provides an extraction evaluation harness that measures how well an extractor recipe turns raw items into text.
+It is designed to be deterministic, auditable, and useful for selecting a default extraction pipeline.
+## What extraction evaluation measures
+Extraction evaluation reports:
+- Coverage of extracted text (present, empty, missing)
+- Accuracy against labeled ground truth text
+- Processable fraction for each extractor recipe
+- Optional system metrics such as latency and external cost
+The output is structured JSON so you can version it, compare it across runs, and use it in reports.
+## Dataset format
+Extraction evaluation datasets are JSON with a versioned schema. Each entry maps a corpus item to its expected extracted
+text.
+Example:
+```json
+{
+  "schema_version": 1,
+  "name": "Extraction baseline",
+  "description": "Short labeled texts for extraction accuracy",
+  "items": [
+    {
+      "item_id": "3a2c3f0b-...",
+      "expected_text": "Hello world",
+      "kind": "gold"
+    },
+    {
+      "source_uri": "file:///corpora/demo/report.pdf",
+      "expected_text": "Quarterly results",
+      "kind": "gold"
+    }
+  ]
+}
+```
+Fields:
+- `schema_version`: dataset schema version, currently `1`
+- `name`: dataset name
+- `description`: optional description
+- `items`: list of labeled items with either `item_id` or `source_uri`
+- `expected_text`: expected extracted text for the item
+- `kind`: label kind, for example `gold` or `synthetic`
+## Run extraction evaluation from the CLI
+```
+biblicus extract evaluate --corpus corpora/example \
+  --run pipeline:EXTRACTION_RUN_ID \
+  --dataset datasets/extraction.json
+```
+If you omit `--run`, Biblicus uses the latest extraction run and emits a reproducibility warning.
+## Run extraction evaluation from Python
+```
+from pathlib import Path
+from biblicus.corpus import Corpus
+from biblicus.extraction_evaluation import evaluate_extraction_run, load_extraction_dataset
+from biblicus.models import ExtractionRunReference
+corpus = Corpus.open(Path("corpora/example"))
+run = corpus.load_extraction_run("pipeline", "RUN_ID")
+dataset = load_extraction_dataset(Path("datasets/extraction.json"))
+result = evaluate_extraction_run(corpus=corpus, run=run, dataset=dataset)
+print(result.model_dump())
+```
+## Output location
+Extraction evaluation artifacts are stored under:
+```
+.biblicus/runs/evaluation/extraction/<run_id>/output.json
+```
+## Working demo
+A runnable demo is provided in `scripts/extraction_evaluation_demo.py`. It downloads AG News, runs extraction, builds a
+dataset from the ingested items, and evaluates the extraction run:
+```
+python3 scripts/extraction_evaluation_demo.py --corpus corpora/ag_news_extraction_eval --force
+```
+## Extraction evaluation lab
+For a fast, fully local walkthrough, use the bundled lab. It ingests a tiny set of files, runs extraction, generates a
+dataset, and evaluates the run in seconds.
+```
+python3 scripts/extraction_evaluation_lab.py --corpus corpora/extraction_eval_lab --force
+```
+The lab uses the bundled files under `datasets/extraction_lab/items` and writes the generated dataset to
+`datasets/extraction_lab_output.json` by default. The command output includes the evaluation artifact path so you can
+inspect the metrics immediately.
+### Lab walkthrough
+1) Run the lab:
+```
+python3 scripts/extraction_evaluation_lab.py --corpus corpora/extraction_eval_lab --force
+```
+2) Inspect the generated dataset:
+```
+cat datasets/extraction_lab_output.json
+```
+The dataset is small and deterministic. Each entry maps a corpus item to the expected extracted text.
+3) Inspect the evaluation output:
+```
+cat corpora/extraction_eval_lab/.biblicus/runs/evaluation/extraction/<run_id>/output.json
+```
+The output includes:
+- Coverage counts for present, empty, and missing extracted text.
+- Processable fraction for the extractor recipe.
+- Average similarity between expected and extracted text.
+4) Compare metrics to raw items:
+The lab includes a Markdown note, a plain text file, and a blank Markdown note. The blank note yields empty extracted
+text, which should be reflected in the coverage metrics. Because the expected text matches the extracted text for the
+non-empty items, the similarity score is 1.0 for those items.
+## Interpretation tips
+- Use coverage metrics to detect extractors that skip or fail on specific media types.
+- Use accuracy metrics to compare competing extractors on labeled samples.
+- Track processable fraction before optimizing quality so you know what fraction of the corpus is actually evaluated.

{biblicus-0.12.0 → biblicus-0.14.0}/docs/FEATURE_INDEX.md RENAMED Viewed

@@ -148,6 +148,26 @@ Primary implementation:
 - `src/biblicus/extraction.py`
 - `src/biblicus/extractors/`
+## Extraction evaluation
+What it does:
+- Evaluates extraction runs against labeled datasets.
+- Reports coverage, accuracy, and processable fraction metrics.
+Documentation:
+- `docs/EXTRACTION_EVALUATION.md`
+Behavior specifications:
+- `features/extraction_evaluation.feature`
+- `features/extraction_evaluation_lab.feature`
+Primary implementation:
+- `src/biblicus/extraction_evaluation.py`
 ## Retrieval backends
 What it does:
@@ -180,10 +200,15 @@ What it does:
 - Evaluates retrieval runs against datasets and budgets.
+Documentation:
+- `docs/RETRIEVAL_EVALUATION.md`
 Behavior specifications:
 - `features/evaluation.feature`
 - `features/model_validation.feature`
+- `features/retrieval_evaluation_lab.feature`
 Primary implementation:

biblicus-0.14.0/docs/RETRIEVAL.md ADDED Viewed

@@ -0,0 +1,96 @@
+# Retrieval
+Biblicus treats retrieval as a reproducible, explicit pipeline stage that transforms a corpus into structured evidence.
+Retrieval is separated from extraction and context shaping so each can be evaluated independently and swapped without
+rewriting ingestion.
+## Retrieval concepts
+- **Backend**: a pluggable retrieval implementation that can build and query runs.
+- **Run**: a recorded retrieval build for a corpus and extraction run.
+- **Evidence**: structured output containing identifiers, provenance, and scores.
+- **Stage**: explicit steps such as retrieve, rerank, and filter.
+## How retrieval runs work
+1) Ingest raw items into a corpus.
+2) Build an extraction run to produce text artifacts.
+3) Build a retrieval run with a backend, referencing the extraction run.
+4) Query the run to return evidence.
+Retrieval runs are stored under:
+```
+.biblicus/runs/retrieval/<backend_id>/<run_id>/
+```
+## A minimal run you can execute
+This walkthrough uses the full text search backend and produces evidence you can inspect immediately.
+```
+rm -rf corpora/retrieval_demo
+python3 -m biblicus init corpora/retrieval_demo
+printf "alpha beta\n" > /tmp/retrieval-alpha.txt
+printf "beta gamma\n" > /tmp/retrieval-beta.txt
+python3 -m biblicus ingest --corpus corpora/retrieval_demo /tmp/retrieval-alpha.txt
+python3 -m biblicus ingest --corpus corpora/retrieval_demo /tmp/retrieval-beta.txt
+python3 -m biblicus extract build --corpus corpora/retrieval_demo --step pass-through-text
+python3 -m biblicus build --corpus corpora/retrieval_demo --backend sqlite-full-text-search
+python3 -m biblicus query --corpus corpora/retrieval_demo --query "beta"
+```
+The query output is structured evidence with identifiers and scores. That evidence is the primary output for evaluation
+and downstream context packing.
+## Backends
+See `docs/backends/index.md` for backend selection and configuration.
+## Choosing a backend
+Start with the simplest backend that answers your question:
+- `scan` for tiny corpora or sanity checks.
+- `sqlite-full-text-search` for a practical lexical baseline.
+- `vector` when you want deterministic term-frequency similarity without external dependencies.
+You can compare them with the same dataset and budget using the retrieval evaluation workflow.
+## Evaluation
+Retrieval runs are evaluated against datasets with explicit budgets. See `docs/RETRIEVAL_EVALUATION.md` for the
+dataset format and workflow, `docs/FEATURE_INDEX.md` for the behavior specifications, and `docs/CONTEXT_PACK.md` for
+how evidence feeds into context packs.
+## Labs and demos
+When you want a repeatable example with bundled data, use the retrieval evaluation lab:
+```
+python3 scripts/retrieval_evaluation_lab.py --corpus corpora/retrieval_eval_lab --force
+```
+The lab builds a tiny corpus, runs extraction, builds a retrieval run, and evaluates it. It prints the dataset path and
+evaluation output so you can open the JavaScript Object Notation directly.
+## Reproducibility checklist
+Use these habits when you want repeatable retrieval experiments:
+- Record the extraction run identifier and pass it explicitly when you build a retrieval run.
+- Keep evaluation datasets in source control and treat them as immutable inputs.
+- Capture the full retrieval run identifier when you compare outputs across backends.
+## Why the separation matters
+Keeping extraction and retrieval distinct makes it possible to:
+- Reuse the same extracted artifacts across many retrieval backends.
+- Compare backends against the same corpus and dataset inputs.
+- Record and audit retrieval decisions without mixing in prompting or context formatting.
+## Retrieval quality
+For retrieval quality upgrades, see `docs/RETRIEVAL_QUALITY.md`.

biblicus-0.14.0/docs/RETRIEVAL_EVALUATION.md ADDED Viewed

@@ -0,0 +1,181 @@
+# Retrieval evaluation
+Biblicus evaluates retrieval runs against deterministic datasets so quality comparisons are repeatable across backends
+and corpora. Evaluations keep the evidence-first model intact by reporting per-query evidence alongside summary
+metrics.
+## Dataset format
+Retrieval datasets are stored as JavaScript Object Notation files with a strict schema:
+```json
+{
+  "schema_version": 1,
+  "name": "example-dataset",
+  "description": "Small hand-labeled dataset for smoke tests.",
+  "queries": [
+    {
+      "query_id": "q-001",
+      "query_text": "alpha",
+      "expected_item_id": "item-id-123",
+      "kind": "gold"
+    }
+  ]
+}
+```
+Each query includes either an `expected_item_id` or an `expected_source_uri`. The `kind` field records whether the
+query is hand-labeled (`gold`) or synthetic.
+## Metrics primer
+Retrieval evaluation reports a small set of textbook metrics:
+- **Hit rate**: the fraction of queries that retrieved the expected item at any rank.
+- **Precision-at-k**: hit rate normalized by the evidence budget (`max_total_items`).
+- **Mean reciprocal rank**: the average of `1 / rank` for the first matching item per query.
+These metrics are deterministic for the same corpus, run, dataset, and budget.
+## Running an evaluation
+Use the command-line interface to evaluate a retrieval run against a dataset:
+```bash
+biblicus eval --corpus corpora/example --run <run_id> --dataset datasets/retrieval.json \
+  --max-total-items 5 --max-total-characters 2000 --max-items-per-source 5
+```
+If `--run` is omitted, the latest retrieval run is used. Evaluations are deterministic for the same corpus, run, and
+budget.
+## End-to-end evaluation example
+This example builds a tiny corpus, creates a retrieval run, and evaluates it against a minimal dataset:
+```
+rm -rf corpora/retrieval_eval_demo
+python3 -m biblicus init corpora/retrieval_eval_demo
+printf "alpha apple\n" > /tmp/eval-alpha.txt
+printf "beta banana\n" > /tmp/eval-beta.txt
+python3 -m biblicus ingest --corpus corpora/retrieval_eval_demo /tmp/eval-alpha.txt
+python3 -m biblicus ingest --corpus corpora/retrieval_eval_demo /tmp/eval-beta.txt
+python3 -m biblicus extract build --corpus corpora/retrieval_eval_demo --step pass-through-text
+python3 -m biblicus build --corpus corpora/retrieval_eval_demo --backend sqlite-full-text-search
+cat > /tmp/retrieval_eval_dataset.json <<'JSON'
+{
+  "schema_version": 1,
+  "name": "retrieval-eval-demo",
+  "description": "Minimal dataset for evaluation walkthroughs.",
+  "queries": [
+    {
+      "query_id": "q1",
+      "query_text": "apple",
+      "expected_item_id": "ITEM_ID_FOR_ALPHA",
+      "kind": "gold"
+    }
+  ]
+}
+JSON
+```
+Replace `ITEM_ID_FOR_ALPHA` with the item identifier from `biblicus list`, then run:
+```
+python3 -m biblicus eval --corpus corpora/retrieval_eval_demo --dataset /tmp/retrieval_eval_dataset.json \
+  --max-total-items 3 --max-total-characters 2000 --max-items-per-source 5
+```
+## Retrieval evaluation lab
+The retrieval evaluation lab ships with bundled files and labels so you can run a deterministic end-to-end evaluation
+without external dependencies.
+```
+python3 scripts/retrieval_evaluation_lab.py --corpus corpora/retrieval_eval_lab --force
+```
+The script prints a summary that includes the generated dataset path, the retrieval run identifier, and the evaluation
+output path.
+## Output
+The evaluation output includes:
+- Dataset metadata (name, description, query count).
+- Run metadata (backend ID, run ID, evaluation timestamp).
+- Metrics (hit rate, precision-at-k, mean reciprocal rank).
+- System diagnostics (latency percentiles and index size).
+The output is JavaScript Object Notation suitable for downstream reporting.
+Example snippet:
+```json
+{
+  "dataset": {
+    "name": "retrieval-eval-demo",
+    "description": "Minimal dataset for evaluation walkthroughs.",
+    "queries": 1
+  },
+  "backend_id": "sqlite-full-text-search",
+  "run_id": "RUN_ID",
+  "evaluated_at": "2024-01-01T00:00:00Z",
+  "metrics": {
+    "hit_rate": 1.0,
+    "precision_at_max_total_items": 0.3333333333333333,
+    "mean_reciprocal_rank": 1.0
+  },
+  "system": {
+    "average_latency_milliseconds": 1.2,
+    "percentile_95_latency_milliseconds": 2.4,
+    "index_bytes": 2048.0
+  }
+}
+```
+The `metrics` section is the primary signal for retriever quality. The `system` section helps compare performance and
+storage costs across backends.
+## What to record for comparisons
+When you compare retrieval runs, capture the same inputs every time:
+- Corpus path (and whether the catalog has been reindexed).
+- Extraction run identifier used by the retrieval run.
+- Retrieval backend identifier and run identifier.
+- Evaluation dataset path and schema version.
+- Evidence budget values.
+This metadata allows you to rerun the evaluation and explain differences between results.
+## Common pitfalls
+- Evaluating against a dataset built for a different corpus or extraction run.
+- Changing budgets between runs and expecting metrics to be comparable.
+- Using stale item identifiers after reindexing or re-ingesting content.
+## Python usage
+```python
+from pathlib import Path
+from biblicus.corpus import Corpus
+from biblicus.evaluation import evaluate_run, load_dataset
+from biblicus.models import QueryBudget
+corpus = Corpus.open("corpora/example")
+run = corpus.load_run("<run_id>")
+dataset = load_dataset(Path("datasets/retrieval.json"))
+budget = QueryBudget(max_total_items=5, max_total_characters=2000, max_items_per_source=5)
+result = evaluate_run(corpus=corpus, run=run, dataset=dataset, budget=budget)
+print(result.model_dump_json(indent=2))
+```
+## Design notes
+- Evaluation is reproducible by construction: the run manifest, dataset, and budget fully determine the results.
+- The evaluation workflow expects retrieval stages to remain explicit in the run artifacts.
+- Reports are portable, so comparisons across backends and corpora are straightforward.

biblicus 0.12.0__tar.gz → 0.14.0__tar.gz

biblicus 0.12.0tar.gz → 0.14.0tar.gz