PyPI - retrieval-observatory - Versions diffs - 0.1.2__tar.gz - Mend

retrieval-observatory 0.1.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (111) hide show

retrieval_observatory-0.1.2/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Ameya Kiwalkar
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

retrieval_observatory-0.1.2/MANIFEST.in ADDED Viewed

@@ -0,0 +1,6 @@
+graft retrieval_observatory
+prune retrieval_observatory/dashboard/ui/node_modules
+global-exclude *.db
+global-exclude __pycache__
+global-exclude .retobs
+recursive-exclude results *

retrieval_observatory-0.1.2/PKG-INFO ADDED Viewed

@@ -0,0 +1,655 @@
+Metadata-Version: 2.4
+Name: retrieval-observatory
+Version: 0.1.2
+Summary: Framework-agnostic benchmarking for hybrid RAG retrieval pipelines
+Author-email: Ameya Kiwalkar <akiwalkar@berkeley.edu>
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/AmeyaKI/retrieval-observatory
+Project-URL: Documentation, https://github.com/AmeyaKI/retrieval-observatory#readme
+Project-URL: Repository, https://github.com/AmeyaKI/retrieval-observatory
+Project-URL: Issues, https://github.com/AmeyaKI/retrieval-observatory/issues
+Project-URL: Changelog, https://github.com/AmeyaKI/retrieval-observatory/releases
+Keywords: rag,retrieval,benchmark,evaluation,nlp
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Developers
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: pydantic>=2.0
+Requires-Dist: httpx>=0.27
+Requires-Dist: aiosqlite>=0.20
+Requires-Dist: rich>=13.0
+Requires-Dist: typer>=0.12
+Requires-Dist: pyyaml>=6.0
+Requires-Dist: numpy>=1.26
+Provides-Extra: beir
+Requires-Dist: datasets>=2.0; extra == "beir"
+Requires-Dist: beir>=2.0; extra == "beir"
+Provides-Extra: cohere
+Requires-Dist: cohere>=5.0; extra == "cohere"
+Provides-Extra: hf
+Requires-Dist: sentence-transformers>=3.0; extra == "hf"
+Requires-Dist: torch; extra == "hf"
+Provides-Extra: dense
+Requires-Dist: sentence-transformers>=3.0; extra == "dense"
+Requires-Dist: faiss-cpu>=1.7; extra == "dense"
+Requires-Dist: torch; extra == "dense"
+Provides-Extra: langchain
+Requires-Dist: langchain-core>=0.2; extra == "langchain"
+Provides-Extra: llamaindex
+Requires-Dist: llama-index-core>=0.10; extra == "llamaindex"
+Provides-Extra: pgvector
+Requires-Dist: asyncpg>=0.29; extra == "pgvector"
+Requires-Dist: pgvector>=0.3; extra == "pgvector"
+Provides-Extra: postgres
+Requires-Dist: asyncpg>=0.29; extra == "postgres"
+Provides-Extra: dashboard
+Requires-Dist: fastapi>=0.111; extra == "dashboard"
+Requires-Dist: uvicorn>=0.29; extra == "dashboard"
+Requires-Dist: python-multipart>=0.0.9; extra == "dashboard"
+Provides-Extra: llm-judge
+Requires-Dist: google-generativeai>=0.8; extra == "llm-judge"
+Requires-Dist: anthropic>=0.28; extra == "llm-judge"
+Requires-Dist: openai>=1.0; extra == "llm-judge"
+Provides-Extra: demo
+Requires-Dist: beir>=2.0; extra == "demo"
+Requires-Dist: datasets>=2.0; extra == "demo"
+Requires-Dist: rank-bm25>=0.2.2; extra == "demo"
+Provides-Extra: classifier
+Requires-Dist: scikit-learn>=1.4; extra == "classifier"
+Requires-Dist: joblib>=1.3; extra == "classifier"
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0; extra == "dev"
+Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
+Requires-Dist: coverage; extra == "dev"
+Requires-Dist: respx>=0.21; extra == "dev"
+Requires-Dist: scikit-learn>=1.4; extra == "dev"
+Requires-Dist: joblib>=1.3; extra == "dev"
+Dynamic: license-file
+# retrieval-observatory (retobs)
+[![PyPI version](https://badge.fury.io/py/retrieval-observatory.svg)](https://pypi.org/project/retrieval-observatory/)
+Most RAG evaluation tools score end-to-end answer quality and stop there. They don't tell you **which stage helped**, **what it cost in latency**, or **which queries will fail before you run retrieval**. retobs is an open-source multi-stage retrieval benchmark and local dashboard that measures per-stage contribution, failure diagnosis, latency–quality tradeoffs, and query difficulty — so you can decide whether to add that reranker (or switch to dense) with evidence, not intuition.
+**Headline result:** On BEIR/FiQA, dense retrieval (`all-MiniLM-L6-v2`) outperforms BM25 by **+132% NDCG@10** (0.369 vs 0.159) at **~130× lower latency** than cross-encoder reranking. On SciFact and FiQA, dense-only is the **sole Pareto-optimal** pipeline. On NFCorpus, dense/rerank/RRF NDCG CIs overlap — no single winner on quality alone.
+Quality–Latency Tradeoff — NFCorpus Pareto frontier
+---
+## Install
+```bash
+pip install "retrieval-observatory[demo,dashboard,dense]"
+```
+For development from source:
+```bash
+git clone https://github.com/AmeyaKI/retrieval-observatory.git && cd retrieval-observatory
+python -m venv .venv && source .venv/bin/activate
+pip install -e ".[demo,dashboard,dense]"
+```
+---
+## Quickstart (~5 minutes)
+Run BM25 on 50 SciFact queries, then open the dashboard.
+**PyPI install** (bundled example config):
+```bash
+CFG="$(python -c 'from retrieval_observatory import EXAMPLES_DIR; print(EXAMPLES_DIR / "quickstart_scifact.yaml")')"
+retobs validate --config "$CFG"
+retobs run --config "$CFG"
+retobs serve --db .retobs/quickstart_scifact.db
+```
+**From a git clone** (repo `examples/` tree):
+```bash
+retobs validate --config examples/quickstart_scifact.yaml
+retobs run --config examples/quickstart_scifact.yaml
+retobs serve --db .retobs/quickstart_scifact.db
+```
+Open `http://localhost:8000` — explore metrics, latency, and query-level diagnostics.
+### Full examples and BEIR publish configs
+The PyPI wheel includes quickstart YAMLs only. For the full `examples/` demos (HTTP quickstart, temporal demo, dashboard demo with JSONL data) and multi-dataset BEIR sweeps, clone the repo:
+```bash
+git clone https://github.com/AmeyaKI/retrieval-observatory.git
+cd retrieval-observatory
+./scripts/run_beir_publish.sh full-sweep   # uses configs/beir_publish/
+```
+---
+## Benchmark Results
+Cross-dataset summary (full BEIR test splits, 4 independent pipelines). See [results/BENCHMARK_ANALYSIS.md](results/BENCHMARK_ANALYSIS.md) for motivation, Pareto analysis, classifier calibration, and limitations.
+| Dataset         | bm25 NDCG@10 | dense_only | rrf_hybrid | bm25__rerank | Pareto optimal   |
+| --------------- | ------------ | ---------- | ---------- | ------------ | ---------------- |
+| NFCorpus (323q) | 0.264        | **0.310**  | 0.304      | 0.310        | bm25, dense_only |
+| SciFact (300q)  | 0.544        | **0.640**  | 0.623      | 0.628        | dense_only       |
+| FiQA (648q)     | 0.159        | **0.369**  | 0.290      | 0.260        | dense_only       |
+Four pipelines: `bm25`, `dense_only`, `rrf_hybrid`, `bm25__rerank`. Stage attribution uses the bm25 → bm25__rerank prefix pair only. JSON exports and regeneration: [results/RESULTS_OVERVIEW.md](results/RESULTS_OVERVIEW.md).
+---
+## What retobs tells you
+```
+Stage Contribution: bm25 → bm25__rerank
+┌───────────────┬──────────┬──────────┬──────────────┬────────────────┐
+│ Metric        │ Before   │ After    │ Δ            │ Significant?   │
+├───────────────┼──────────┼──────────┼──────────────┼────────────────┤
+│ recall@10     │ 0.1190   │ 0.1380   │ +0.0190 (+16%)│ q=0.041 ✓    │
+│ ndcg@10       │ 0.2640   │ 0.3100   │ +0.0460 (+17%)│ q=0.012 ✓    │
+│ Latency P50   │ 2ms      │ 4,057ms  │ +4,055ms     │ —             │
+└───────────────┴──────────┴──────────┴──────────────┴────────────────┘
+```
+1. **Stage attribution** — What did each stage add in quality, cost, and latency? BH-corrected significance on paired queries.
+2. **Failure diagnosis** — Candidate misses, lexical mismatches, reranker drops — labeled per query.
+3. **Latency–quality tradeoff** — Pareto frontier and budget slider; see whether reranking is worth it at your latency budget.
+Core promise:
+- Comparable **Recall@K, NDCG@K, MRR, MAP, latency percentiles, and estimated cost per 1k queries** across pipelines.
+- Multi-stage pipelines with independent stage analysis and temporal recall for time-sensitive datasets.
+---
+## How It's Different
+| Tool            | What it measures                                                                   |
+| --------------- | ---------------------------------------------------------------------------------- |
+| BEIR            | End-to-end pipeline accuracy on fixed datasets                                     |
+| RAGAs / TruLens | Answer quality given retrieved context                                             |
+| **retobs**      | **Per-stage contribution: what did each stage add in quality, cost, and latency?** |
+retobs is not a leaderboard and not an answer evaluator. It's a diagnostic layer between "I have a retrieval pipeline" and "I understand how to improve it."
+---
+## Install (development)
+```bash
+python -m venv .venv
+source .venv/bin/activate
+# Full local development setup
+pip install -e ".[demo,dashboard,dense,dev,llm-judge]"
+```
+For a smaller install:
+```bash
+pip install -e ".[demo,dashboard]"
+```
+---
+## Stage Attribution in 60 Seconds
+Add `ablations: true` to your combinations config and retobs automatically runs the prefix pipeline too:
+```yaml
+stages:
+  bm25:
+    type: adapter.bm25
+    config: {k: 100}
+  rerank:
+    type: adapter.hf_crossencoder
+    config:
+      model: cross-encoder/ms-marco-MiniLM-L-6-v2
+      k: 10
+combinations:
+  include:
+    - [bm25, rerank]
+  ablations: true   # automatically also runs [bm25] alone — no extra config needed
+```
+`retobs run` then prints the stage contribution table showing exactly what the reranker added.
+For a 3-stage pipeline, `ablations: true` generates **all valid ordered subsequences** — not just prefixes:
+```yaml
+combinations:
+  include:
+    - [bm25, fast_rerank, precise_rerank]
+  ablations: true
+# Generates: bm25 | bm25__fast_rerank | bm25__precise_rerank | bm25__fast_rerank__precise_rerank
+# Answers: does skipping fast_rerank and going direct to precise_rerank beat the cascade?
+```
+To test only whether a specific stage pays for itself, name it explicitly:
+```yaml
+combinations:
+  include:
+    - [bm25, fast_rerank, precise_rerank]
+  ablations: [fast_rerank]   # generates only: without fast_rerank vs with fast_rerank
+```
+Optionally set a latency budget to get a one-line verdict in CI:
+```bash
+retobs run --config my_experiment.yaml --latency-budget-ms 1000
+```
+---
+## Query Difficulty Classifier
+Predict whether a query will be hard for retrieval **before** running your pipeline, using only query text. Labels come from post-hoc diagnostics (mean Recall across pipelines on a specific corpus), so models are **dataset-specific**.
+```bash
+# Install classifier dependencies
+pip install -e ".[classifier]"
+# After one or more benchmark runs on the same dataset:
+retobs classifier train --dataset beir/nfcorpus
+# Inspect cross-val accuracy, Brier score, and feature importances:
+retobs classifier report --dataset beir/nfcorpus
+# Score a single query:
+retobs classifier predict --model .retobs/models/query_difficulty_beir_nfcorpus.joblib \
+  --query "What mitochondrial mechanisms were studied since 2019?"
+# Next benchmark run auto-applies predictions when a matching model exists
+retobs run --config my_experiment.yaml
+```
+The dashboard shows **Classifier Calibration**: mean Recall@10 (with bootstrap CIs) grouped by predicted difficulty. If predicted-hard queries have lower Recall@10 than predicted-easy ones, the classifier is doing useful work.
+**Caveat:** The classifier predicts observatory difficulty under *your* pipelines on *your* corpus—not intrinsic question hardness. Train and evaluate on the same dataset; cross-dataset use is unsupported.
+---
+## HTTP Quickstart
+If your retrieval service is already running, point retobs at it and get metrics immediately:
+```bash
+# Start the mock server
+pip install fastapi uvicorn rank-bm25
+uvicorn examples.http_quickstart.server:app --port 8000
+# Benchmark it
+retobs run --config examples/http_quickstart/config.yaml
+```
+The HTTP adapter POSTs `{"query": str, "k": int}` and expects `{"results": [{"id", "text", "score"}]}`.
+---
+## Quick Test Of The Observatory
+```bash
+# 1. Install/update editable package
+source .venv/bin/activate
+pip install -e ".[demo,dashboard,dense,dev,llm-judge]"
+# 2. Confirm CLI commands are registered
+retobs --help
+# 3. Generate a starter experiment config
+retobs init --mode bm25+reranker --output my_experiment.yaml
+# 4. Validate before running
+retobs validate --config my_experiment.yaml
+# 5. Run the benchmark (stage attribution table printed automatically)
+retobs run --config my_experiment.yaml --no-cache
+# 6. Open the interactive dashboard
+retobs serve --db .retobs/results.db --port 8000
+```
+Open `http://localhost:8000` — move the latency budget slider and watch the stage verdict update live.
+Load multiple result databases in one dashboard (sidebar tabs per DB):
+```bash
+retobs serve --db .retobs/publish_smoke_scifact.db --db .retobs/dashboard_demo.db
+# or comma-separated:
+retobs serve --db .retobs/a.db,.retobs/b.db
+# or env var (colon-separated):
+RETOBS_DASHBOARD_DBS=.retobs/a.db:.retobs/b.db retobs serve
+```
+---
+## YAML Stage Combinations
+You can define stages once and ask `retobs` to expand the exact combinations you want to benchmark.
+```yaml
+experiment:
+  name: my-rag-sweep
+dataset:
+  type: custom
+  name: custom
+  queries_path: data/queries.jsonl
+  corpus_path: data/corpus.jsonl
+  timestamp_field: timestamp
+  metadata_fields: [source]
+stages:
+  bm25:
+    type: adapter.bm25
+    config: {k: 100}
+  dense:
+    type: adapter.hf_biencoder
+    config:
+      model: sentence-transformers/all-MiniLM-L6-v2
+      k: 100
+  rerank:
+    type: adapter.hf_crossencoder
+    config:
+      model: cross-encoder/ms-marco-MiniLM-L-6-v2
+      k: 10
+combinations:
+  include:
+    - [bm25, rerank]
+    - [dense, rerank]
+  ablations: true   # auto-generates [bm25] and [dense] prefix pipelines
+metrics:
+  recall_at_k: [1, 5, 10, 20]
+  precision_at_k: [5, 10]
+  ndcg_at_k: [10]
+  mrr: true
+  map: true
+execution:
+  concurrency: 4
+  timeout_seconds: 60
+  cache_results: true
+output:
+  store: sqlite
+  db_path: .retobs/results.db
+```
+Expanded pipeline IDs are stable, for example `bm25`, `dense`, `bm25__rerank`, and `dense__rerank`.
+Cost is configured for relative tradeoff analysis:
+```yaml
+costs:
+  bm25:
+    per_1k_queries: 0.10
+  rerank:
+    per_1k_queries: 1.50
+```
+`retobs run` and the dashboard both treat this as an **estimated** cost model from your YAML, not measured cloud billing telemetry.
+> **Stage cache note:** When `execution.cache_results: true`, retrieval stages are cached by
+> `hash(stage_config + upstream_candidates + query_id)`. The upstream candidate fingerprint ensures
+> that two pipelines sharing the same reranker but with different first-stage retrievers (e.g.
+> `bm25→rerank` vs `dense→rerank`) never share reranker snapshots. Stage 0 (first retriever) still
+> shares cache entries across ablation combos as intended. Use `--no-cache` when you want
+> fully independent execution for reproducibility auditing.
+### HTTP adapter schema
+The `adapter.http` stage wraps any REST endpoint. Your server must accept:
+**Request** — `POST` with JSON body:
+```json
+{"query": "user question text", "k": 100}
+```
+When query filters are set, a `filters` object is also included.
+**Response** — JSON in either shape:
+```json
+{"documents": [{"id": "doc_1", "text": "...", "score": 0.92}]}
+```
+```json
+[{"id": "doc_1", "text": "...", "score": 0.92}]
+```
+Each document object must include the configured ID field (default `id`). Text and score fields default to `text` and `score` but can be remapped:
+```yaml
+- type: adapter.http
+  url: http://localhost:8080/retrieve
+  config:
+    k: 100
+    id_field: doc_id
+    text_field: content
+    score_field: relevance
+```
+See `[examples/http_quickstart/server.py](examples/http_quickstart/server.py)` for a reference implementation.
+### Custom Python retriever via `adapter.import`
+Use `adapter.import` to load a Python factory callable from your own module without editing retobs internals:
+```yaml
+- type: adapter.import
+  retriever_id: keyword
+  config:
+    factory: retriever:build_retriever
+    k: 10
+```
+Supported factory paths:
+- `package.module:callable`
+- `package.module.callable`
+Factory signature:
+```python
+def build_retriever(corpus: dict | None, stage_cfg: dict, **kwargs):
+    ...
+    return retriever_or_reranker, k
+```
+Runnable example: `[examples/custom_retriever/](examples/custom_retriever/)`
+---
+## Custom Dataset Format
+### `queries.jsonl`
+```json
+{"query_id":"q1","text":"What changed in the refund policy?","relevant_doc_ids":{"doc_17":2,"doc_22":1},"temporal_anchor":"2024-01-15T00:00:00"}
+```
+`relevant_doc_ids` can be a list for binary labels or a dict for graded relevance.
+### `corpus.jsonl`
+```json
+{"id":"doc_17","title":"Refund policy update","text":"Refunds are now processed within 7 days.","timestamp":"2024-01-10T00:00:00"}
+```
+### Optional `qrels.jsonl`
+```json
+{"query_id":"q1","doc_id":"doc_17","grade":2}
+```
+`qrels.tsv` in TREC-style format is also supported.
+---
+## LLM-Assisted Labels
+Gold labels are the default and remain the recommended evaluation source.
+For unlabeled datasets, you can opt into LLM-assisted labels:
+```yaml
+labels:
+  mode: pooled_llm_judge   # gold, llm_judge, or pooled_llm_judge
+  judge: gemini            # gemini, openai, or anthropic
+  model: gemini-2.0-flash
+  cache_path: .retobs/llm_judge_cache.db
+```
+---
+## Dashboard Features
+| Feature                  | Description                                                                               |
+| ------------------------ | ----------------------------------------------------------------------------------------- |
+| Stage Attribution        | Before/after metric table for each pipeline pair with BH-corrected significance.          |
+| Tradeoff Explorer        | Latency budget + min quality delta sliders; verdict computed client-side.                 |
+| Experiment Overview      | Headline winner, difficulty buckets, failure-label summary, reproducibility warnings.     |
+| Pipeline Architecture    | Stage-by-stage flow diagram with per-stage quality and latency.                           |
+| Stage Combination Matrix | Compact view of quality, latency, and optional cost-per-1k by pipeline/stage.             |
+| Query Explorer           | Query-level diagnostics with failure labels, missing relevant IDs, and difficulty bucket. |
+| Run Comparison           | Side-by-side metrics with query-ID-aligned paired bootstrap p-values.                     |
+| Recall@K Curves          | Recall trends across K with BEIR reference lines when available.                          |
+| Stage Recall Funnel      | Shows how much candidate recall survives through reranking stages.                        |
+| Latency Breakdown        | P50/P95/P99 plus profiling metrics for compute, network, and retries.                     |
+| Segment Analysis         | NDCG@10 by query metadata such as number of relevant docs.                                |
+---
+## Example Runs
+### BEIR BM25 Baseline
+```bash
+retobs validate --config examples/beir_demo.yaml
+retobs run --config examples/beir_demo.yaml
+retobs serve --db .retobs/beir_demo.db
+```
+### Three-Way nfcorpus Comparison
+```bash
+pip install -e ".[demo,dashboard,dense]"
+retobs validate --config examples/nfcorpus_three_way.yaml
+retobs run --config examples/nfcorpus_three_way.yaml --no-cache
+retobs serve --db .retobs/nfcorpus_three_way.db
+```
+### Temporal Recall Demo
+```bash
+pip install -e ".[demo,dashboard]"
+python examples/temporal_demo/generate_data.py
+retobs run --config examples/temporal_demo/config.yaml --no-cache
+retobs serve --db .retobs/temporal_demo.db
+```
+This demo intentionally includes old and new relevant documents per query so `recall@1` and `temporal_recall@1` diverge when top-ranked hits are stale.
+### RRF Hybrid (BM25 + Dense)
+```bash
+pip install -e ".[demo,dashboard,dense]"
+retobs run --config examples/rrf_hybrid.yaml
+```
+### Dense vs BM25+Cohere Hybrid
+```bash
+pip install -e ".[demo,dashboard,dense,cohere]"
+export COHERE_API_KEY=your-key-here
+retobs run --config examples/hybrid_comparison.yaml
+```
+---
+## CLI Reference
+```bash
+retobs init      --mode MODE --output PATH                Generate starter config and sample data
+retobs validate  --config PATH [--db PATH]                Validate config and dataset before running
+retobs run       --config PATH [--no-cache]               Run a benchmark experiment
+                             [--latency-budget-ms N]      Print verdict against stage latency delta
+retobs serve     --db PATH [--db PATH ...] [--port N]      Start dashboard (repeat --db for multiple SQLite files)
+retobs compare   RUN_ID_1 RUN_ID_2 --db PATH              Compare runs with paired bootstrap tests
+retobs inspect   RUN_ID --query QUERY_ID [--pipeline ID]  Debug per-query retrieval results
+```
+Init modes: `beir`, `custom-jsonl`, `http-endpoint`, `bm25+dense` (includes RRF), `bm25+reranker` (includes ablations).
+---
+## Run The Test Suite
+```bash
+source .venv/bin/activate
+pip install -e ".[demo,dashboard,dense,dev,llm-judge]"
+pytest tests/ -q
+npm --prefix retrieval_observatory/dashboard/ui run build
+python -m compileall retrieval_observatory -q
+```
+---
+## Dashboard Development
+The dashboard UI is **pre-built in the PyPI wheel**, so `retobs serve` works after `pip install` with no Node.js required. When developing from a git clone and editing React sources, rebuild the UI:
+```bash
+cd retrieval_observatory/dashboard/ui
+npm install
+npm run dev      # hot-reloading dev server on :5173 (proxies API to retobs serve)
+npm run build    # rebuild dist/ before python -m build or tagging a release
+```
+Or use `make dashboard-dev` / `make dashboard-build` from the repo root.
+---
+## Optional Dependency Groups
+| Group        | Installs                                | Use for                                                      |
+| ------------ | --------------------------------------- | ------------------------------------------------------------ |
+| `demo`       | beir, datasets, rank-bm25               | Running BEIR datasets with BM25                              |
+| `dashboard`  | fastapi, uvicorn, python-multipart      | Serving the dashboard and accepting uploads                  |
+| `dense`      | sentence-transformers, faiss-cpu, torch | Dense bi-encoder retrieval and local cross-encoder reranking |
+| `dev`        | pytest, pytest-asyncio, coverage, respx | Running tests                                                |
+| `cohere`     | cohere                                  | Cohere reranking                                             |
+| `langchain`  | langchain-core                          | LangChain adapter (programmatic use)                         |
+| `llamaindex` | llama-index-core                        | LlamaIndex adapter (programmatic use)                        |
+| `pgvector`   | asyncpg, pgvector                       | Pgvector adapter                                             |
+| `llm-judge`  | google-generativeai, anthropic, openai  | LLM-assisted relevance judging                               |
+PostgreSQL backend (`asyncpg`) is community-supported and not CI-tested. SQLite is recommended for evaluation workloads.
+```bash
+pip install -e ".[demo,dashboard,dense,dev,llm-judge]"
+```