PyPI - biblicus - Versions diffs - 0.13.0__tar.gz → 0.14.0__tar.gz - Mend

biblicus 0.13.0tar.gz → 0.14.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (252) hide show

{biblicus-0.13.0/src/biblicus.egg-info → biblicus-0.14.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: biblicus
-Version: 0.13.0
+Version: 0.14.0
 Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
 License: MIT
 Requires-Python: >=3.9
@@ -498,7 +498,8 @@ For detailed documentation including configuration options, performance characte
 For the retrieval pipeline overview and run artifacts, see `docs/RETRIEVAL.md`. For retrieval quality upgrades
 (tuned lexical baseline, reranking, hybrid retrieval), see `docs/RETRIEVAL_QUALITY.md`. For evaluation workflows
-and dataset formats, see `docs/RETRIEVAL_EVALUATION.md`.
+and dataset formats, see `docs/RETRIEVAL_EVALUATION.md`. For a runnable walkthrough, use the retrieval evaluation lab
+script (`scripts/retrieval_evaluation_lab.py`).
 ## Extraction backends

{biblicus-0.13.0 → biblicus-0.14.0}/README.md RENAMED Viewed

@@ -452,7 +452,8 @@ For detailed documentation including configuration options, performance characte
 For the retrieval pipeline overview and run artifacts, see `docs/RETRIEVAL.md`. For retrieval quality upgrades
 (tuned lexical baseline, reranking, hybrid retrieval), see `docs/RETRIEVAL_QUALITY.md`. For evaluation workflows
-and dataset formats, see `docs/RETRIEVAL_EVALUATION.md`.
+and dataset formats, see `docs/RETRIEVAL_EVALUATION.md`. For a runnable walkthrough, use the retrieval evaluation lab
+script (`scripts/retrieval_evaluation_lab.py`).
 ## Extraction backends

biblicus-0.14.0/datasets/retrieval_lab/labels.json ADDED Viewed

@@ -0,0 +1,25 @@
+{
+  "schema_version": 1,
+  "name": "retrieval-evaluation-lab",
+  "description": "Bundled labels for the retrieval evaluation lab.",
+  "queries": [
+    {
+      "query_id": "q1",
+      "query_text": "alpha unique",
+      "expected_filename": "alpha.txt",
+      "kind": "gold"
+    },
+    {
+      "query_id": "q2",
+      "query_text": "beta unique",
+      "expected_filename": "beta.txt",
+      "kind": "gold"
+    },
+    {
+      "query_id": "q3",
+      "query_text": "gamma unique",
+      "expected_filename": "gamma.txt",
+      "kind": "gold"
+    }
+  ]
+}

{biblicus-0.13.0 → biblicus-0.14.0}/docs/DEMOS.md RENAMED Viewed

@@ -225,6 +225,17 @@ python3 scripts/extraction_evaluation_lab.py --corpus corpora/extraction_eval_la
 The lab writes a generated dataset file and evaluation output path and prints both in the command output.
+### Retrieval evaluation lab run
+Use the retrieval evaluation lab to build a tiny corpus, run extraction, build a retrieval backend, and evaluate it
+against bundled labels:
+```
+python3 scripts/retrieval_evaluation_lab.py --corpus corpora/retrieval_eval_lab --force
+```
+The script prints the dataset path, retrieval run identifier, and evaluation output location.
 Run with a larger corpus and a higher topic count:
 ```

{biblicus-0.13.0 → biblicus-0.14.0}/docs/FEATURE_INDEX.md RENAMED Viewed

@@ -200,10 +200,15 @@ What it does:
 - Evaluates retrieval runs against datasets and budgets.
+Documentation:
+- `docs/RETRIEVAL_EVALUATION.md`
 Behavior specifications:
 - `features/evaluation.feature`
 - `features/model_validation.feature`
+- `features/retrieval_evaluation_lab.feature`
 Primary implementation:

biblicus-0.14.0/docs/RETRIEVAL.md ADDED Viewed

@@ -0,0 +1,96 @@
+# Retrieval
+Biblicus treats retrieval as a reproducible, explicit pipeline stage that transforms a corpus into structured evidence.
+Retrieval is separated from extraction and context shaping so each can be evaluated independently and swapped without
+rewriting ingestion.
+## Retrieval concepts
+- **Backend**: a pluggable retrieval implementation that can build and query runs.
+- **Run**: a recorded retrieval build for a corpus and extraction run.
+- **Evidence**: structured output containing identifiers, provenance, and scores.
+- **Stage**: explicit steps such as retrieve, rerank, and filter.
+## How retrieval runs work
+1) Ingest raw items into a corpus.
+2) Build an extraction run to produce text artifacts.
+3) Build a retrieval run with a backend, referencing the extraction run.
+4) Query the run to return evidence.
+Retrieval runs are stored under:
+```
+.biblicus/runs/retrieval/<backend_id>/<run_id>/
+```
+## A minimal run you can execute
+This walkthrough uses the full text search backend and produces evidence you can inspect immediately.
+```
+rm -rf corpora/retrieval_demo
+python3 -m biblicus init corpora/retrieval_demo
+printf "alpha beta\n" > /tmp/retrieval-alpha.txt
+printf "beta gamma\n" > /tmp/retrieval-beta.txt
+python3 -m biblicus ingest --corpus corpora/retrieval_demo /tmp/retrieval-alpha.txt
+python3 -m biblicus ingest --corpus corpora/retrieval_demo /tmp/retrieval-beta.txt
+python3 -m biblicus extract build --corpus corpora/retrieval_demo --step pass-through-text
+python3 -m biblicus build --corpus corpora/retrieval_demo --backend sqlite-full-text-search
+python3 -m biblicus query --corpus corpora/retrieval_demo --query "beta"
+```
+The query output is structured evidence with identifiers and scores. That evidence is the primary output for evaluation
+and downstream context packing.
+## Backends
+See `docs/backends/index.md` for backend selection and configuration.
+## Choosing a backend
+Start with the simplest backend that answers your question:
+- `scan` for tiny corpora or sanity checks.
+- `sqlite-full-text-search` for a practical lexical baseline.
+- `vector` when you want deterministic term-frequency similarity without external dependencies.
+You can compare them with the same dataset and budget using the retrieval evaluation workflow.
+## Evaluation
+Retrieval runs are evaluated against datasets with explicit budgets. See `docs/RETRIEVAL_EVALUATION.md` for the
+dataset format and workflow, `docs/FEATURE_INDEX.md` for the behavior specifications, and `docs/CONTEXT_PACK.md` for
+how evidence feeds into context packs.
+## Labs and demos
+When you want a repeatable example with bundled data, use the retrieval evaluation lab:
+```
+python3 scripts/retrieval_evaluation_lab.py --corpus corpora/retrieval_eval_lab --force
+```
+The lab builds a tiny corpus, runs extraction, builds a retrieval run, and evaluates it. It prints the dataset path and
+evaluation output so you can open the JavaScript Object Notation directly.
+## Reproducibility checklist
+Use these habits when you want repeatable retrieval experiments:
+- Record the extraction run identifier and pass it explicitly when you build a retrieval run.
+- Keep evaluation datasets in source control and treat them as immutable inputs.
+- Capture the full retrieval run identifier when you compare outputs across backends.
+## Why the separation matters
+Keeping extraction and retrieval distinct makes it possible to:
+- Reuse the same extracted artifacts across many retrieval backends.
+- Compare backends against the same corpus and dataset inputs.
+- Record and audit retrieval decisions without mixing in prompting or context formatting.
+## Retrieval quality
+For retrieval quality upgrades, see `docs/RETRIEVAL_QUALITY.md`.

biblicus-0.14.0/docs/RETRIEVAL_EVALUATION.md ADDED Viewed

@@ -0,0 +1,181 @@
+# Retrieval evaluation
+Biblicus evaluates retrieval runs against deterministic datasets so quality comparisons are repeatable across backends
+and corpora. Evaluations keep the evidence-first model intact by reporting per-query evidence alongside summary
+metrics.
+## Dataset format
+Retrieval datasets are stored as JavaScript Object Notation files with a strict schema:
+```json
+{
+  "schema_version": 1,
+  "name": "example-dataset",
+  "description": "Small hand-labeled dataset for smoke tests.",
+  "queries": [
+    {
+      "query_id": "q-001",
+      "query_text": "alpha",
+      "expected_item_id": "item-id-123",
+      "kind": "gold"
+    }
+  ]
+}
+```
+Each query includes either an `expected_item_id` or an `expected_source_uri`. The `kind` field records whether the
+query is hand-labeled (`gold`) or synthetic.
+## Metrics primer
+Retrieval evaluation reports a small set of textbook metrics:
+- **Hit rate**: the fraction of queries that retrieved the expected item at any rank.
+- **Precision-at-k**: hit rate normalized by the evidence budget (`max_total_items`).
+- **Mean reciprocal rank**: the average of `1 / rank` for the first matching item per query.
+These metrics are deterministic for the same corpus, run, dataset, and budget.
+## Running an evaluation
+Use the command-line interface to evaluate a retrieval run against a dataset:
+```bash
+biblicus eval --corpus corpora/example --run <run_id> --dataset datasets/retrieval.json \
+  --max-total-items 5 --max-total-characters 2000 --max-items-per-source 5
+```
+If `--run` is omitted, the latest retrieval run is used. Evaluations are deterministic for the same corpus, run, and
+budget.
+## End-to-end evaluation example
+This example builds a tiny corpus, creates a retrieval run, and evaluates it against a minimal dataset:
+```
+rm -rf corpora/retrieval_eval_demo
+python3 -m biblicus init corpora/retrieval_eval_demo
+printf "alpha apple\n" > /tmp/eval-alpha.txt
+printf "beta banana\n" > /tmp/eval-beta.txt
+python3 -m biblicus ingest --corpus corpora/retrieval_eval_demo /tmp/eval-alpha.txt
+python3 -m biblicus ingest --corpus corpora/retrieval_eval_demo /tmp/eval-beta.txt
+python3 -m biblicus extract build --corpus corpora/retrieval_eval_demo --step pass-through-text
+python3 -m biblicus build --corpus corpora/retrieval_eval_demo --backend sqlite-full-text-search
+cat > /tmp/retrieval_eval_dataset.json <<'JSON'
+{
+  "schema_version": 1,
+  "name": "retrieval-eval-demo",
+  "description": "Minimal dataset for evaluation walkthroughs.",
+  "queries": [
+    {
+      "query_id": "q1",
+      "query_text": "apple",
+      "expected_item_id": "ITEM_ID_FOR_ALPHA",
+      "kind": "gold"
+    }
+  ]
+}
+JSON
+```
+Replace `ITEM_ID_FOR_ALPHA` with the item identifier from `biblicus list`, then run:
+```
+python3 -m biblicus eval --corpus corpora/retrieval_eval_demo --dataset /tmp/retrieval_eval_dataset.json \
+  --max-total-items 3 --max-total-characters 2000 --max-items-per-source 5
+```
+## Retrieval evaluation lab
+The retrieval evaluation lab ships with bundled files and labels so you can run a deterministic end-to-end evaluation
+without external dependencies.
+```
+python3 scripts/retrieval_evaluation_lab.py --corpus corpora/retrieval_eval_lab --force
+```
+The script prints a summary that includes the generated dataset path, the retrieval run identifier, and the evaluation
+output path.
+## Output
+The evaluation output includes:
+- Dataset metadata (name, description, query count).
+- Run metadata (backend ID, run ID, evaluation timestamp).
+- Metrics (hit rate, precision-at-k, mean reciprocal rank).
+- System diagnostics (latency percentiles and index size).
+The output is JavaScript Object Notation suitable for downstream reporting.
+Example snippet:
+```json
+{
+  "dataset": {
+    "name": "retrieval-eval-demo",
+    "description": "Minimal dataset for evaluation walkthroughs.",
+    "queries": 1
+  },
+  "backend_id": "sqlite-full-text-search",
+  "run_id": "RUN_ID",
+  "evaluated_at": "2024-01-01T00:00:00Z",
+  "metrics": {
+    "hit_rate": 1.0,
+    "precision_at_max_total_items": 0.3333333333333333,
+    "mean_reciprocal_rank": 1.0
+  },
+  "system": {
+    "average_latency_milliseconds": 1.2,
+    "percentile_95_latency_milliseconds": 2.4,
+    "index_bytes": 2048.0
+  }
+}
+```
+The `metrics` section is the primary signal for retriever quality. The `system` section helps compare performance and
+storage costs across backends.
+## What to record for comparisons
+When you compare retrieval runs, capture the same inputs every time:
+- Corpus path (and whether the catalog has been reindexed).
+- Extraction run identifier used by the retrieval run.
+- Retrieval backend identifier and run identifier.
+- Evaluation dataset path and schema version.
+- Evidence budget values.
+This metadata allows you to rerun the evaluation and explain differences between results.
+## Common pitfalls
+- Evaluating against a dataset built for a different corpus or extraction run.
+- Changing budgets between runs and expecting metrics to be comparable.
+- Using stale item identifiers after reindexing or re-ingesting content.
+## Python usage
+```python
+from pathlib import Path
+from biblicus.corpus import Corpus
+from biblicus.evaluation import evaluate_run, load_dataset
+from biblicus.models import QueryBudget
+corpus = Corpus.open("corpora/example")
+run = corpus.load_run("<run_id>")
+dataset = load_dataset(Path("datasets/retrieval.json"))
+budget = QueryBudget(max_total_items=5, max_total_characters=2000, max_items_per_source=5)
+result = evaluate_run(corpus=corpus, run=run, dataset=dataset, budget=budget)
+print(result.model_dump_json(indent=2))
+```
+## Design notes
+- Evaluation is reproducible by construction: the run manifest, dataset, and budget fully determine the results.
+- The evaluation workflow expects retrieval stages to remain explicit in the run artifacts.
+- Reports are portable, so comparisons across backends and corpora are straightforward.

biblicus-0.14.0/docs/RETRIEVAL_QUALITY.md ADDED Viewed

@@ -0,0 +1,106 @@
+# Retrieval quality upgrades
+This document describes the retrieval quality upgrades available in Biblicus. It is a reference for how retrieval
+quality is expressed in runs and how to interpret the signals in artifacts and evidence.
+## Goals
+- Improve relevance without losing determinism or reproducibility.
+- Keep retrieval stages explicit and visible in run artifacts.
+- Preserve the evidence-first output model.
+## Available upgrades
+### 1) Tuned lexical baseline
+Biblicus exposes the knobs you use to shape lexical relevance without losing determinism:
+- BM25-style scoring with configurable parameters.
+- N-gram range controls.
+- Stop word strategy per backend.
+- Field weighting (for example: title, body, metadata).
+Example configuration (SQLite full text search):
+```
+python3 -m biblicus build --corpus corpora/demo --backend sqlite-full-text-search \
+  --config chunk_size=200 \
+  --config chunk_overlap=50 \
+  --config snippet_characters=120 \
+  --config ngram_min=1 \
+  --config ngram_max=2
+```
+### 2) Reranking stage
+The optional rerank stage rescoring keeps retrieval quality transparent. It re-scores a bounded candidate set and
+records rerank scores alongside retrieve scores in evidence metadata.
+Example configuration:
+```
+python3 -m biblicus build --corpus corpora/demo --backend sqlite-full-text-search \
+  --config rerank_enabled=true \
+  --config rerank_model=cross-encoder \
+  --config rerank_top_k=10
+```
+### 3) Hybrid retrieval
+Hybrid retrieval combines lexical and vector signals. It expands candidate pools for each component backend, fuses
+scores with explicit weights, and then applies the final budget.
+Example configuration:
+```
+python3 -m biblicus build --corpus corpora/demo --backend hybrid \
+  --config lexical_backend=sqlite-full-text-search \
+  --config embedding_backend=vector \
+  --config lexical_weight=0.7 \
+  --config embedding_weight=0.3
+```
+Evidence items record both stage scores in `stage_scores` and preserve the hybrid weights in the run metadata so
+evaluation can interpret how the fused ranking was produced.
+## Evaluation guidance
+Evaluation keeps the retrieval stages explicit and makes comparisons easy:
+- Measure hit rate, precision-at-k, and mean reciprocal rank against shared datasets.
+- Use the retrieval evaluation lab for a repeatable walkthrough (`scripts/retrieval_evaluation_lab.py`).
+- Run artifacts capture each stage and configuration for auditability.
+- Deterministic settings remain available as the default baseline.
+## Interpreting evidence signals
+Evidence returned by retrieval runs includes a `stage` label and optional `stage_scores` map:
+- `stage` identifies the last stage that produced the evidence (for example, `retrieve`, `rerank`, `hybrid`).
+- `stage_scores` contains per-stage scores so you can compare lexical and vector contributions in hybrid runs.
+Use these fields to understand how a candidate moved through the pipeline and why it ranked where it did.
+## Budget awareness
+Budgets shape every retrieval comparison:
+- `max_total_items` limits the evidence list length and defines the denominator for precision-at-k.
+- `max_total_characters` controls how much text can survive into evidence outputs.
+- `max_items_per_source` prevents one source from dominating the final list.
+When you compare backends, keep budgets constant and note any candidate expansion in hybrid runs so fused rankings are
+drawn from comparable pools.
+## Non-goals
+- Automated hyperparameter tuning.
+- Hidden fallback stages that obscure retrieval behavior.
+- UI-driven tuning in this phase.
+## Summary
+Retrieval quality upgrades in Biblicus keep determinism intact while making scoring richer and more interpretable.
+Start with tuned lexical baselines, add reranking when you need sharper relevance, and reach for hybrid retrieval when
+you want to balance lexical precision with semantic similarity signals. Evaluate each change with the same dataset and
+budget so improvements remain credible and reproducible.

biblicus-0.14.0/features/retrieval_evaluation_lab.feature ADDED Viewed

@@ -0,0 +1,10 @@
+Feature: Retrieval evaluation lab
+  The retrieval evaluation lab provides a deterministic walkthrough with bundled data.
+  Scenario: Retrieval evaluation lab reports expected metrics
+    When I run the retrieval evaluation lab with corpus "corpus" and dataset "dataset.json"
+    Then the retrieval evaluation lab dataset file exists
+    And the retrieval evaluation lab output file exists
+    And the retrieval evaluation lab metrics include hit_rate 1
+    And the retrieval evaluation lab metrics include mean_reciprocal_rank 1
+    And the retrieval evaluation lab metrics include precision_at_max_total_items 0.3333333333333333

biblicus-0.14.0/features/steps/retrieval_evaluation_lab_steps.py ADDED Viewed

@@ -0,0 +1,77 @@
+from __future__ import annotations
+import json
+import math
+import subprocess
+from pathlib import Path
+from behave import then, when
+def _corpus_path(context, name: str) -> Path:
+    return (context.workdir / name).resolve()
+def _parse_json_output(standard_output: str) -> dict[str, object]:
+    return json.loads(standard_output)
+def _expect_metric(metrics: dict[str, object], key: str, expected: float) -> None:
+    actual = float(metrics[key])
+    assert math.isclose(actual, expected, rel_tol=1e-12, abs_tol=1e-12)
+@when('I run the retrieval evaluation lab with corpus "{corpus_name}" and dataset "{dataset_name}"')
+def step_run_retrieval_evaluation_lab(context, corpus_name: str, dataset_name: str) -> None:
+    corpus = _corpus_path(context, corpus_name)
+    dataset_path = (context.workdir / dataset_name).resolve()
+    result = subprocess.run(
+        [
+            "python3",
+            "scripts/retrieval_evaluation_lab.py",
+            "--corpus",
+            str(corpus),
+            "--dataset-path",
+            str(dataset_path),
+            "--force",
+        ],
+        cwd=context.repo_root,
+        capture_output=True,
+        text=True,
+        check=False,
+    )
+    context.last_result = result
+    assert result.returncode == 0, result.stderr
+    context.retrieval_lab_summary = _parse_json_output(result.stdout)
+@then("the retrieval evaluation lab dataset file exists")
+def step_retrieval_lab_dataset_exists(context) -> None:
+    summary = context.retrieval_lab_summary
+    dataset_path = Path(summary["dataset_path"])
+    assert dataset_path.is_file()
+@then("the retrieval evaluation lab output file exists")
+def step_retrieval_lab_output_exists(context) -> None:
+    summary = context.retrieval_lab_summary
+    output_path = Path(summary["evaluation_output_path"])
+    assert output_path.is_file()
+@then("the retrieval evaluation lab metrics include hit_rate {expected:g}")
+def step_retrieval_lab_hit_rate(context, expected: float) -> None:
+    metrics = context.retrieval_lab_summary["metrics"]
+    _expect_metric(metrics, "hit_rate", expected)
+@then("the retrieval evaluation lab metrics include mean_reciprocal_rank {expected:g}")
+def step_retrieval_lab_mean_reciprocal_rank(context, expected: float) -> None:
+    metrics = context.retrieval_lab_summary["metrics"]
+    _expect_metric(metrics, "mean_reciprocal_rank", expected)
+@then("the retrieval evaluation lab metrics include precision_at_max_total_items {expected:g}")
+def step_retrieval_lab_precision_at_max_total_items(context, expected: float) -> None:
+    metrics = context.retrieval_lab_summary["metrics"]
+    _expect_metric(metrics, "precision_at_max_total_items", expected)

{biblicus-0.13.0 → biblicus-0.14.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "biblicus"
-version = "0.13.0"
+version = "0.14.0"
 description = "Command line interface and Python library for corpus ingestion, retrieval, and evaluation."
 readme = "README.md"
 requires-python = ">=3.9"

biblicus 0.13.0__tar.gz → 0.14.0__tar.gz

biblicus 0.13.0tar.gz → 0.14.0tar.gz