PyPI - cbrkit - Versions diffs - 1.0.0__tar.gz → 1.2.0__tar.gz - Mend

cbrkit 1.0.0tar.gz → 1.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (95) hide show

cbrkit-1.0.0/README.md → cbrkit-1.2.0/PKG-INFO RENAMED Viewed

@@ -1,3 +1,104 @@
+Metadata-Version: 2.3
+Name: cbrkit
+Version: 1.2.0
+Summary: Customizable Case-Based Reasoning (CBR) toolkit for Python with a built-in API and CLI
+Keywords: cbr,case-based reasoning,api,similarity,nlp,retrieval,cli,tool,library
+Author: Mirko Lenz
+Author-email: Mirko Lenz <mirko@mirkolenz.com>
+Classifier: Development Status :: 4 - Beta
+Classifier: Environment :: Console
+Classifier: Framework :: Pytest
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Natural Language :: English
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3.14
+Classifier: Programming Language :: Python :: 3
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Scientific/Engineering :: Information Analysis
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Classifier: Topic :: Utilities
+Classifier: Typing :: Typed
+Requires-Dist: frozendict>=2,<3
+Requires-Dist: numpy>=2,<3
+Requires-Dist: orjson>=3,<4
+Requires-Dist: polars>=1,<2
+Requires-Dist: pydantic>=2,<3
+Requires-Dist: pyyaml>=6,<7
+Requires-Dist: rtoml>=0.12,<1
+Requires-Dist: scipy>=1,<2
+Requires-Dist: xmltodict>=1,<2
+Requires-Dist: cbrkit[anthropic,api,bm25,chromadb,chunking,cohere,eval,google,graphs,graphviz,instructor,lancedb,levenshtein,nltk,ollama,openai,openai-agents,pandas,pydantic-ai,spacy,sql,timeseries,transformers,voyageai,zvec] ; extra == 'all'
+Requires-Dist: anthropic>=0.40,<1 ; extra == 'anthropic'
+Requires-Dist: cbrkit[cli] ; extra == 'api'
+Requires-Dist: fastapi>=0.100,<1 ; extra == 'api'
+Requires-Dist: pydantic-settings>=2,<3 ; extra == 'api'
+Requires-Dist: python-multipart>=0.0.15,<1 ; extra == 'api'
+Requires-Dist: uvicorn[standard]>=0.30,<1 ; extra == 'api'
+Requires-Dist: fastmcp>=3,<4 ; extra == 'api'
+Requires-Dist: bm25s[core,stem,indexing]>=0.3,<1 ; extra == 'bm25'
+Requires-Dist: chromadb>=1,<2 ; extra == 'chromadb'
+Requires-Dist: chonkie>=1,<2 ; extra == 'chunking'
+Requires-Dist: rich>=13,<15 ; extra == 'cli'
+Requires-Dist: typer>=0.9,<1 ; extra == 'cli'
+Requires-Dist: cohere>=5,<6 ; extra == 'cohere'
+Requires-Dist: ranx>=0.3,<1 ; extra == 'eval'
+Requires-Dist: google-genai>=1,<2 ; extra == 'google'
+Requires-Dist: networkx>=3,<4 ; extra == 'graphs'
+Requires-Dist: rustworkx>=0.15,<1 ; extra == 'graphs'
+Requires-Dist: pygraphviz>=1,<2 ; extra == 'graphviz'
+Requires-Dist: instructor>=1,<2 ; extra == 'instructor'
+Requires-Dist: lancedb>=0.20,<1 ; extra == 'lancedb'
+Requires-Dist: levenshtein>=0.26,<1 ; extra == 'levenshtein'
+Requires-Dist: nltk>=3,<4 ; extra == 'nltk'
+Requires-Dist: ollama>=0.3,<1 ; extra == 'ollama'
+Requires-Dist: openai>=1,<3 ; extra == 'openai'
+Requires-Dist: tiktoken>=0.8,<1 ; extra == 'openai'
+Requires-Dist: openai-agents>=0.2,<1 ; extra == 'openai-agents'
+Requires-Dist: pandas>=2,<4 ; extra == 'pandas'
+Requires-Dist: pydantic-ai-slim>=1,<2 ; extra == 'pydantic-ai'
+Requires-Dist: spacy>=3.8,<4 ; extra == 'spacy'
+Requires-Dist: sqlalchemy>=2,<3 ; extra == 'sql'
+Requires-Dist: minineedle>=3,<4 ; extra == 'timeseries'
+Requires-Dist: sentence-transformers>=4,<6 ; extra == 'transformers'
+Requires-Dist: torch>=2.5,<3 ; extra == 'transformers'
+Requires-Dist: transformers>=4,<6 ; extra == 'transformers'
+Requires-Dist: voyageai>=0.3,<1 ; extra == 'voyageai'
+Requires-Python: >=3.13, <4
+Project-URL: Repository, https://github.com/wi2trier/cbrkit
+Project-URL: Documentation, https://wi2trier.github.io/cbrkit/
+Project-URL: Issues, https://github.com/wi2trier/cbrkit/issues
+Project-URL: Changelog, https://github.com/wi2trier/cbrkit/releases
+Provides-Extra: all
+Provides-Extra: anthropic
+Provides-Extra: api
+Provides-Extra: bm25
+Provides-Extra: chromadb
+Provides-Extra: chunking
+Provides-Extra: cli
+Provides-Extra: cohere
+Provides-Extra: eval
+Provides-Extra: google
+Provides-Extra: graphs
+Provides-Extra: graphviz
+Provides-Extra: instructor
+Provides-Extra: lancedb
+Provides-Extra: levenshtein
+Provides-Extra: nltk
+Provides-Extra: ollama
+Provides-Extra: openai
+Provides-Extra: openai-agents
+Provides-Extra: pandas
+Provides-Extra: pydantic-ai
+Provides-Extra: spacy
+Provides-Extra: sql
+Provides-Extra: timeseries
+Provides-Extra: transformers
+Provides-Extra: voyageai
+Description-Content-Type: text/markdown
 <!-- markdownlint-disable MD033 MD041 -->
 <h1><p align="center">CBRkit</p></h1>
@@ -36,26 +137,30 @@ To get started, we provide a [demo project](https://github.com/wi2trier/cbrkit-d
 Further examples can be found in our [tests](./tests/test_retrieve.py) and [documentation](https://wi2trier.github.io/cbrkit/).
 The following modules are part of CBRkit:
-- `cbrkit.loaders` and `cbrkit.dumpers`: Functions for loading and exporting cases and queries.
-- `cbrkit.sim`: Similarity functions for common data types and some utility functions such as `cache`, `combine`, `transpose`, etc.
-  - `cbrkit.sim.strings`: String similarity measures (Levenshtein, Jaro, semantic, etc.).
+- `cbrkit.loaders`: Functions for loading cases and queries from various file formats and data sources.
+- `cbrkit.dumpers`: Functions for exporting data to JSON, YAML, CSV, TOML, and Markdown.
+- `cbrkit.sim`: Similarity measures for common data types with utility functions such as `cache`, `combine`, `transpose`, etc.
+  - `cbrkit.sim.strings`: String similarity measures (Levenshtein, Jaro, spaCy, etc.).
   - `cbrkit.sim.numbers`: Numeric similarity measures (linear, exponential, threshold).
-  - `cbrkit.sim.collections`: Similarity measures for collections and sequences (Jaccard, DTW, Smith-Waterman).
+  - `cbrkit.sim.collections`: Similarity measures for collections and sequences (Jaccard, etc.).
   - `cbrkit.sim.embed`: Embedding-based similarity functions with caching support.
-  - `cbrkit.sim.graphs`: Graph similarity algorithms including GED, A*, VF2, and more.
-  - `cbrkit.sim.taxonomy`: Taxonomy-based similarity functions.
+  - `cbrkit.sim.graphs`: Graph similarity algorithms including A\*, VF2, greedy, LAP, and more.
+  - `cbrkit.sim.taxonomy`: Taxonomy-based similarity functions (Wu-Palmer, etc.).
   - `cbrkit.sim.generic`: Generic similarity functions (equality, tables, static).
   - `cbrkit.sim.attribute_value`: Similarity for attribute-value based data.
   - `cbrkit.sim.pooling`: Functions for aggregating multiple similarity values.
   - `cbrkit.sim.aggregator`: Combines multiple local measures into global scores.
-- `cbrkit.retrieval`: Functions for defining and applying retrieval pipelines, includes BM25 retrieval, rerankers, etc.
-- `cbrkit.adapt`: Adaptation generator functions for adapting cases based on a query.
-- `cbrkit.reuse`: Functions for defining and applying reuse pipelines.
+- `cbrkit.adapt`: Adaptation functions for adapting cases based on a query.
+- `cbrkit.retrieval`: Retrieval pipelines with BM25, embedding-based retrieval, re-ranking (Cohere, Voyage AI, Sentence Transformers), and more.
+- `cbrkit.reuse`: Reuse pipelines that apply adaptation and score the results.
+- `cbrkit.revise`: Revision pipelines for assessing and optionally repairing solutions.
+- `cbrkit.retain`: Retention pipelines for storing solved cases back into the casebase.
+- `cbrkit.cycle`: Full CBR cycle orchestration across all four phases.
+- `cbrkit.system`: High-level `System` class for composing all CBR phases into a single object.
+- `cbrkit.synthesis`: LLM-based synthesis for generating insights from cases (RAG), with providers for OpenAI, Anthropic, Cohere, Google, Ollama, and more.
 - `cbrkit.eval`: Evaluation metrics for retrieval results including precision, recall, and custom metrics.
-- `cbrkit.model`: Data models for graphs and results.
-- `cbrkit.cycle`: CBR cycle implementation.
+- `cbrkit.model`: Data models for results and graph structures.
 - `cbrkit.typing`: Generic type definitions for defining custom functions.
-- `cbrkit.synthesis`: Functions for working on a casebase with LLMs to create new insights, e.g., in a RAG context.
 ## Installation
@@ -74,14 +179,12 @@ pip install cbrkit[EXTRA_NAME,...]
 where `EXTRA_NAME` is one of the following:
 - `all`: All optional dependencies
-- `api`: REST API Server
-- `cli`: Command Line Interface (CLI)
-- `eval`: Evaluation tools for common metrics like `precision` and `recall`
-- `graphs`: Graph libraries like `networkx` and `rustworkx`
-- `llm`: Large Language Models (LLM) APIs like Ollama and OpenAI
-- `nlp`: Standalone NLP tools `levenshtein`, `nltk`, `openai`, and `spacy`
-- `timeseries`: Time series similarity measures like `dtw` and `smith_waterman`
-- `transformers`: Advanced NLP tools based on `pytorch` and `transformers`
+- **LLM providers:** `anthropic`, `cohere`, `google`, `ollama`, `openai`, `openai-agents`, `pydantic-ai`, `instructor`, `voyageai`
+- **NLP / text processing:** `bm25`, `chunking`, `levenshtein`, `nltk`, `spacy`
+- **ML / embeddings:** `transformers` (includes `pytorch` and `sentence-transformers`)
+- **Graphs:** `graphs` (`networkx` and `rustworkx`), `graphviz`
+- **Data backends:** `chromadb`, `lancedb`, `pandas`, `sql` (SQLAlchemy), `zvec`
+- **Tools:** `cli` (CLI), `api` (REST API server), `eval` (evaluation metrics), `timeseries` (DTW, Smith-Waterman)
 Alternatively, you can also clone this git repository and install CBRKit and its dependencies via uv: `uv sync --all-extras`
@@ -95,7 +198,8 @@ We provide predefined functions for the following formats:
 - toml
 - xml
 - yaml
-- py (object inside of a python file).
+- txt (plain text)
+- py (object inside of a python file)
 Loading one of those formats can be done via the `file` function:
@@ -104,8 +208,18 @@ import cbrkit
 casebase = cbrkit.loaders.file("path/to/cases.[json,toml,yaml,xml,csv]")
 ```
-Additionally, CBRkit also integrates with `polars` and `pandas` for loading data frames.
-The following example shows how to load cases and queries from a CSV file using `polars`:
+You can also load all files from a directory or use the unified `path` function:
+```python
+# Load all files matching a glob pattern from a directory
+casebase = cbrkit.loaders.directory("path/to/cases/", pattern="*.json")
+# Unified path function: auto-detects whether path is a file or directory
+casebase = cbrkit.loaders.path("path/to/cases.json")  # single file
+casebase = cbrkit.loaders.path("path/to/cases/")      # directory
+```
+Additionally, CBRkit integrates with `polars` and `pandas` for loading data frames:
 ```python
 import polars as pl
@@ -115,6 +229,25 @@ df = pl.read_csv("path/to/cases.csv")
 casebase = cbrkit.loaders.polars(df)
 ```
+For database access, CBRkit provides `sqlite` and `sqlalchemy` loaders (the latter requires the `sql` extra):
+```python
+casebase = cbrkit.loaders.sqlite("path/to/database.db", "SELECT * FROM cases")
+```
+**Tip:** You can validate a loaded casebase against a Pydantic model using `cbrkit.loaders.validate()`:
+```python
+from pydantic import BaseModel
+class Car(BaseModel):
+    price: int
+    year: int
+    model: str
+casebase = cbrkit.loaders.validate(casebase, Car)
+```
 ## Defining Queries
 CBRkit expects the type of the queries to match the type of the cases.
@@ -139,6 +272,29 @@ In case your query collection only contains a single entry, you can use the `sin
 query = cbrkit.helpers.singleton(queries)
 ```
+## Exporting Data
+CBRkit provides functions for exporting data through the `cbrkit.dumpers` module.
+Supported formats include JSON, YAML, CSV, TOML, and Markdown.
+```python
+import cbrkit
+# Export to a file (format is inferred from the extension)
+cbrkit.dumpers.file("output.json", data)
+cbrkit.dumpers.file("output.yaml", data)
+# Export to a directory (one file per entry)
+cbrkit.dumpers.directory("output/", data)
+# Or use the unified path function
+cbrkit.dumpers.path("output.json", data)  # writes a single file
+cbrkit.dumpers.path("output/", data)      # writes a directory
+# Format data as a Markdown code block
+md = cbrkit.dumpers.markdown()(data)
+```
 ## Similarity Measures and Aggregation
 The next step is to define similarity measures for the cases and queries.
@@ -229,6 +385,21 @@ cached_sim = cbrkit.sim.embed.build(
 )
 ```
+#### Collection and Sequence Similarity
+CBRkit provides similarity measures for collections and sequences in `cbrkit.sim.collections`:
+```python
+# Jaccard similarity for sets (requires the `nltk` extra)
+jaccard_sim = cbrkit.sim.collections.jaccard()
+# Optimal sequence mapping using A* search
+seq_sim = cbrkit.sim.collections.mapping(cbrkit.sim.generic.equality())
+```
+Dynamic Time Warping and Smith-Waterman alignment are available with the `timeseries` extra.
+See the [module documentation](https://wi2trier.github.io/cbrkit/cbrkit/sim/collections.html) for the full list.
 #### Taxonomy-Based Similarity
 ```python
@@ -269,20 +440,15 @@ CBRkit provides extensive support for graph similarity through various algorithm
 ```python
 # Using Graph Edit Distance (GED) with A* search
-graph_sim = cbrkit.sim.graphs.astar(
-    node_sim=cbrkit.sim.generic.equality(),
+graph_sim = cbrkit.sim.graphs.astar.build(
+    node_sim_func=cbrkit.sim.generic.equality(),
     node_matcher=lambda n1, n2: n1 == n2,
-    edge_matcher=lambda e1, e2: e1 == e2
+    edge_matcher=lambda e1, e2: e1 == e2,
 )
 ```
-Available graph algorithms include:
-- `astar`: A* search for optimal graph edit distance
-- `vf2`: VF2 algorithm for (sub)graph isomorphism
-- `lap`: Linear assignment problem solver
-- `greedy`: Fast greedy matching
-- `brute_force`: Exhaustive search for small graphs
-- `dfs`: Depth-first search based matching
+Available graph algorithms include `astar`, `vf2`, `greedy`, `lap`, `brute_force`, `dfs`, `dtw`, and `smith_waterman`.
+See the [module documentation](https://wi2trier.github.io/cbrkit/cbrkit/sim/graphs.html) for a full list of algorithms and their parameters.
 ### Global Similarity and Aggregation
@@ -333,9 +499,30 @@ cbrkit.sim.attribute_value(
 )
 ```
+## CBR Cycle Phases
+All four phases of the CBR cycle — retrieval, reuse, revise, and retain — follow the same unified protocol `CbrFunc` (defined in `cbrkit.typing`).
+Each phase function takes a casebase and a query, and returns an updated casebase together with a score map.
+The casebase in the output may differ from the input depending on the phase (e.g., adapted cases in reuse, newly stored cases in retain).
+The score map assigns a floating-point score to each case in the output casebase, with phase-specific semantics:
+- **Retrieval**: Similarity scores between cases and the query.
+- **Reuse**: Quality scores of adapted cases compared to the query.
+- **Revise**: Assessment scores evaluating solution correctness.
+- **Retain**: Fitness scores for retained cases.
+This uniform interface makes it easy to compose phases into pipelines and to swap implementations.
+The phase-specific type aliases `RetrieverFunc`, `ReuserFunc`, `ReviserFunc`, and `RetainerFunc` are provided for clarity but are structurally identical to `CbrFunc`.
+Each phase result has the following attributes:
+- `similarities`: A dictionary containing the scores for each case.
+- `ranking`: A list of case keys sorted by their score.
+- `casebase`: The casebase containing the output cases.
 ## Retrieval
-The final step is to retrieve cases based on the loaded queries.
+The first phase is to retrieve cases based on the loaded queries.
 The `cbrkit.retrieval` module provides utility functions for this purpose.
 You first build a retrieval pipeline by specifying a global similarity function and optionally a limit for the number of retrieved cases.
@@ -439,50 +626,177 @@ An overview of all available adaptation functions can be found in the [module do
 ## Reuse
-The reuse phase applies adaptation functions to retrieved cases. The `cbrkit.reuse` module provides utility functions for this purpose. You first build a reuse pipeline by specifying a global adaptation function:
+The reuse phase applies adaptation functions to retrieved cases and scores the adapted results.
+The `cbrkit.reuse` module provides utility functions for this purpose.
+You build a reuse pipeline by specifying an adaptation function and a similarity function:
 ```python
 reuser = cbrkit.reuse.build(
-    cbrkit.adapt.attribute_value(...),
+    adaptation_func=cbrkit.adapt.attribute_value(...),
+    similarity_func=cbrkit.sim.attribute_value(...),
 )
 ```
-This reuser can then be applied to the retrieval result to adapt cases based on a query:
+This reuser can then be applied to a retrieval result to adapt cases based on a query:
 ```python
-result = cbrkit.reuse.apply_query(retrieval_result, query, reuser)
+result = cbrkit.reuse.apply_result(retrieval_result, reuser)
 ```
-Our result has the following attributes:
-- `adaptations`: A dictionary containing the adapted values for each case.
-- `ranking`: A list of case indices matching the retrieval result.
-- `casebase`: The casebase containing only the adapted cases.
+As with all CBR phases, the result contains `similarities` (quality scores of adapted cases), `ranking`, and `casebase` (containing the adapted cases).
 Multiple reuse pipelines can be combined by passing them as a list or tuple:
 ```python
 reuser1 = cbrkit.reuse.build(...)
 reuser2 = cbrkit.reuse.build(...)
-result = cbrkit.reuse.apply_query(retrieval_result, query, (reuser1, reuser2))
+result = cbrkit.reuse.apply_result(retrieval_result, (reuser1, reuser2))
 ```
 The result structure follows the same pattern as the retrieval results with `final_step` and `steps` attributes.
+## Revise
+The revise phase assesses the quality of solutions produced by the reuse phase and optionally repairs them.
+The `cbrkit.revise` module provides utility functions for this purpose.
+You build a revise pipeline by specifying an assessment function and an optional repair function:
+```python
+reviser = cbrkit.revise.build(
+    assess_func=cbrkit.sim.attribute_value(...),
+    repair_func=some_adaptation_func,  # optional
+)
+```
+The reviser can be applied to a reuse result:
+```python
+result = cbrkit.revise.apply_result(reuse_result, reviser)
+```
+When a `repair_func` is provided, solutions are repaired before assessment.
+The result contains `similarities` with quality assessment scores for each case.
+## Retain
+The retain phase decides whether and how to integrate new cases into the casebase.
+The `cbrkit.retain` module provides utility functions for this purpose.
+You build a retain pipeline by specifying an assessment function and a storage function:
+```python
+retainer = cbrkit.retain.build(
+    assess_func=cbrkit.sim.generic.equality(),
+    storage_func=cbrkit.retain.static(
+        key_func=lambda keys: max(keys, default=-1) + 1,
+        casebase=casebase,
+    ),
+)
+```
+CBRkit provides several built-in storage functions:
+- `static`: Generates keys from a fixed reference casebase to avoid collisions.
+- `indexable`: Keeps an `IndexableFunc`'s index in sync with the casebase.
+You can filter retained cases based on their assessment scores using the `dropout` wrapper:
+```python
+retainer = cbrkit.retain.dropout(
+    retainer_func=cbrkit.retain.build(...),
+    min_similarity=0.5,
+)
+```
+The retainer can be applied to a revise result:
+```python
+result = cbrkit.retain.apply_result(revise_result, retainer)
+```
+The result contains `similarities` with fitness scores and `casebase` with the updated cases.
+## Full CBR Cycle
+The `cbrkit.cycle` module orchestrates all four phases (retrieval, reuse, revise, retain) in a single call.
+This is useful when you want to run the complete CBR cycle without manually chaining the phases.
+```python
+result = cbrkit.cycle.apply_query(
+    casebase,
+    query,
+    retrievers=retriever,
+    reusers=reuser,
+    revisers=reviser,
+    retainers=retainer,
+)
+# Access results from each phase
+retrieval_result = result.retrieval
+reuse_result = result.reuse
+revise_result = result.revise
+retain_result = result.retain
+```
+For multiple queries, use `cbrkit.cycle.apply_queries` or `cbrkit.cycle.apply_batches`.
+## System
+The `cbrkit.system.System` class provides a high-level interface for composing all CBR phases into a single reusable object.
+It is especially useful for integrating CBRkit into applications where the casebase and phase functions are configured once and reused across multiple queries.
+```python
+system = cbrkit.system.System(
+    casebase=casebase,
+    retriever_factory=lambda config: retriever,
+    reuser_factory=lambda config: reuser,
+)
+# Run individual phases
+retrieval_result = system.retrieve(query)
+reuse_result = system.reuse(query)
+# Run the full cycle
+cycle_result = system.cycle(query)
+```
+The `System` class supports optional configuration parameters for each phase factory, allowing you to customize the behavior per query.
 ## Advanced Retrieval
 ### BM25 Retrieval
-CBRkit includes a BM25 retriever for text-based retrieval:
+CBRkit includes a BM25 retriever for sparse text-based retrieval (requires the `bm25` extra).
+The BM25 retriever delegates text tokenization to a `cbrkit.sim.embed.bm25` embedding function:
 ```python
-retriever = cbrkit.retrieval.bm25(
-    key="text_field",  # Field to search in
-    limit=10
+bm25_func = cbrkit.sim.embed.bm25(language="en")
+retriever = cbrkit.retrieval.dropout(
+    cbrkit.retrieval.bm25(conversion_func=bm25_func),
+    limit=10,
+)
+result = cbrkit.retrieval.apply_query(casebase, query, retriever)
+```
+### Embedding-Based Retrieval
+CBRkit supports embedding-based retrieval through vector similarity search.
+The `embed` retriever uses an embedding function with caching and a vector scorer:
+```python
+embed_func = cbrkit.sim.embed.cache(
+    func=cbrkit.sim.embed.sentence_transformers(model="all-MiniLM-L6-v2"),
+    path="embeddings.sqlite3",
+    table="strf/minilm",
+)
+retriever = cbrkit.retrieval.dropout(
+    cbrkit.retrieval.embed(conversion_func=embed_func),
+    limit=10,
 )
 result = cbrkit.retrieval.apply_query(casebase, query, retriever)
 ```
+For persistent storage backends, CBRkit also supports `lancedb`, `chromadb`, and `zvec` retrievers (each requires its respective extra).
+These backends manage index persistence and support hybrid search modes.
 ### Combining Multiple Retrievers
 The `combine` function allows merging results from multiple retrievers:
@@ -492,23 +806,78 @@ retriever1 = cbrkit.retrieval.build(...)
 retriever2 = cbrkit.retrieval.bm25(...)
 combined = cbrkit.retrieval.combine(
-    retrievers=[retriever1, retriever2],
-    aggregator=cbrkit.sim.aggregator(pooling="mean")
+    retriever_funcs=[retriever1, retriever2],
+    aggregator=cbrkit.sim.aggregator(pooling="mean"),
 )
 result = cbrkit.retrieval.apply_query(casebase, query, combined)
 ```
 ### Distributed Processing
-For large-scale retrieval, use the `distribute` wrapper:
+`build` and `distribute` offer two different levels of parallelism.
+`build(sim_func, multiprocessing=True)` parallelizes the **similarity computations** within batches: all (casebase, query) pairs are flattened into individual comparisons and distributed across processes.
+`distribute(retriever, multiprocessing=True)` parallelizes across **batches**: each (casebase, query) pair is passed to the wrapped retriever as a separate process.
+Use `distribute` when you have many independent queries and want to process them in parallel as separate retrieval tasks:
 ```python
 retriever = cbrkit.retrieval.distribute(
     cbrkit.retrieval.build(...),
-    batch_size=1000
+    multiprocessing=True,  # or an integer for a specific number of processes
 )
 ```
+### Re-ranking
+CBRkit supports re-ranking retrieved results using external models.
+Re-rankers take the initial retrieval results and reorder them based on a more expensive model.
+The following re-rankers are available (each requires its respective extra):
+- `cbrkit.retrieval.cohere`: Cohere re-ranking (extra: `cohere`)
+- `cbrkit.retrieval.voyageai`: Voyage AI re-ranking (extra: `voyageai`)
+- `cbrkit.retrieval.sentence_transformers`: Sentence Transformers cross-encoder re-ranking (extra: `transformers`)
+```python
+reranker = cbrkit.retrieval.cohere(model="rerank-v3.5")
+# Use as a second-stage retriever in a sequential pipeline
+retriever = cbrkit.retrieval.build(cbrkit.sim.attribute_value(...))
+result = cbrkit.retrieval.apply_query(casebase, query, (retriever, reranker))
+```
+### Indexed Retrieval
+Some retrievers like `bm25`, `embed`, and `lancedb` support **indexed retrieval**, where the casebase is pre-indexed once and then queried without passing the full casebase each time.
+This is useful for large casebases or when using external search backends.
+To use indexed retrieval, first create a retriever and call its `index()` method:
+```python
+from frozendict import frozendict
+bm25_func = cbrkit.sim.embed.bm25(language="en")
+retriever = cbrkit.retrieval.bm25(conversion_func=bm25_func)
+retriever.create_index(frozendict(casebase))
+```
+Then pass an empty casebase (`{}`) to signal that the retriever should use its pre-indexed data:
+```python
+result = cbrkit.retrieval.apply_query({}, query, retriever)
+```
+As a convenience, CBRkit provides `apply_query_indexed` and `apply_queries_indexed` which handle the empty casebase automatically:
+```python
+result = cbrkit.retrieval.apply_query_indexed(query, retriever)
+# or for multiple queries:
+result = cbrkit.retrieval.apply_queries_indexed(queries, retriever)
+```
+If a retriever receives an empty casebase but has not been indexed yet, a `ValueError` is raised with a message to call `index()` first.
+The `System` class also supports indexed retrieval by defaulting the casebase to an empty dict.
+This allows creating a system where all retrievers are pre-indexed and no casebase needs to be provided at query time.
 ## Evaluation
 CBRkit provides evaluation tools through the `cbrkit.eval` module for assessing the quality of retrieval results.
@@ -559,20 +928,26 @@ All of them can be computed at different cutoff points by appending `@k`, e.g.,
 We also offer a function to automatically generate a list of metrics for different cutoff points:
 ```python
-metrics = cbrkit.eval.metrics_at_k(["precision", "recall", "f1"], [1, 5, 10])
+metrics = cbrkit.eval.generate_metrics(["precision", "recall", "f1"], ks=[1, 5, 10])
 ```
 ## Synthesis
 In the context of CBRkit, synthesis refers to creating new insights from the cases which were retrieved in a previous retrieval step, for example in a RAG context. CBRkit builds a synthesizer using the function `cbrkit.synthesis.build` with a `provider` and a `prompt`. A synthesizer maps a `Result` (obtained in the retrieval step) to an LLM output (can be a string or structurized). An example can be found in [examples/cars_rag.py](https://github.com/wi2trier/cbrkit/blob/main/examples/cars_rag.py).
-The following **providers** are currently supported if a valid API key is stored the respective environment variable:
+The following **providers** are available in `cbrkit.synthesis.providers` (each requires its respective extra):
-- Anthropic (`ANTHROPIC_API_KEY`)
-- Cohere (`CO_API_KEY`)
-- Google (`GOOGLE_API_KEY`)
-- Ollama
-- OpenAI (`OPENAI_API_KEY`)
+- `openai` / `openai_completions`: OpenAI Completions API (`OPENAI_API_KEY`)
+- `openai_responses`: OpenAI Responses API (`OPENAI_API_KEY`)
+- `openai_agents`: OpenAI Agents framework (`OPENAI_API_KEY`)
+- `anthropic`: Anthropic Claude API (`ANTHROPIC_API_KEY`)
+- `cohere`: Cohere API (`CO_API_KEY`)
+- `google`: Google Generative AI (`GOOGLE_API_KEY`)
+- `ollama`: Ollama (local, no API key needed)
+- `pydantic_ai`: Pydantic AI framework
+- `instructor`: Instructor for structured output
+Providers can be chained using `cbrkit.synthesis.providers.pipe()` and managed as conversations using `cbrkit.synthesis.providers.conversation()`.
 The respective provider class in `cbrkit.synthesis.providers` has to be initialized with the model name and a response type (either `str` or a [Pydantic model](https://docs.pydantic.dev/latest/concepts/models/) for structured output). Further model options like `temperature`, `seed`, `max_tokens`, etc. can also be specified here.
@@ -603,16 +978,15 @@ CBRKit's `transpose` prompt allows to transpose cases and queries before they ar
 ```python
 from cbrkit.typing import JsonEntry
-from cbrkit.dumpers import json_markdown
 def encoder(value) -> dict:
     ...
 baseprompt = cbrkit.synthesis.prompts.default(instructions, encoder=encoder)
 # transform the entries, e.g., by shortening, leaving out irrelevant attributes, etc.
-# In this case, the value of every field is trunctated to 100 characters
+# In this case, the value of every field is truncated to 100 characters
 def shorten(entry: dict) -> JsonEntry:
-    entry = {k: str(v)[:100] for k,v in entry.items()}
-    return json_markdown(entry)
+    entry = {k: str(v)[:100] for k, v in entry.items()}
+    return cbrkit.dumpers.markdown()(entry)
 prompt = cbrkit.synthesis.prompts.transpose(baseprompt, shorten)
 synthesizer = cbrkit.synthesis.build(provider, prompt)
@@ -650,6 +1024,60 @@ response = get_result(batches)
 The complete version of this example can be found under `examples/cars_rag_large.py`.
+## Tips and Common Patterns
+### Parameter Naming Conventions
+CBRkit inspects function signatures to determine their behavior:
+- **Similarity functions** must use `x` (case) and `y` (query) as parameter names.
+- **Adaptation functions** must use `case` and `query` for pair functions, or `casebase` and `query` for map/reduce functions.
+- **Batch functions** accept a list of tuples instead of individual pairs: `f([(x1, y1), (x2, y2), ...])`.
+### Filtering with `dropout`
+The `dropout` wrapper is the standard way to add limits and thresholds to any retriever or retainer.
+It supports `limit` (maximum number of results), `min_similarity`, and `max_similarity`:
+```python
+retriever = cbrkit.retrieval.dropout(
+    cbrkit.retrieval.build(sim_func),
+    limit=10,
+    min_similarity=0.3,
+)
+```
+### Composing Multiple Phase Functions
+All CBR phases support sequential composition by passing a tuple of phase functions.
+Each step receives the output casebase of the previous step, enabling patterns like MAC/FAC:
+```python
+result = cbrkit.retrieval.apply_query(casebase, query, (cheap_retriever, expensive_retriever))
+```
+### Using `frozendict` for Immutable Casebases
+Several components (e.g., indexed retrieval, retain phase) benefit from immutable casebases.
+Use `frozendict` to prevent accidental mutations:
+```python
+from frozendict import frozendict
+casebase = frozendict(cbrkit.loaders.file("cases.json"))
+```
+### Multiprocessing Support
+The `cbrkit.retrieval.build` function supports multiprocessing to parallelize similarity computations within batches:
+```python
+retriever = cbrkit.retrieval.build(sim_func, multiprocessing=True)
+# or with a specific number of processes:
+retriever = cbrkit.retrieval.build(sim_func, multiprocessing=4)
+```
+To parallelize across batches instead, see [Distributed Processing](#distributed-processing).
 ## Logging
 CBRkit integrates with the `logging` module to provide a unified logging interface.

cbrkit 1.0.0__tar.gz → 1.2.0__tar.gz

cbrkit 1.0.0tar.gz → 1.2.0tar.gz