PyPI - cbrkit - Versions diffs - 0.26.4__tar.gz → 0.27.0__tar.gz - Mend

cbrkit 0.26.4tar.gz → 0.27.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (68) hide show

{cbrkit-0.26.4 → cbrkit-0.27.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.3
 Name: cbrkit
-Version: 0.26.4
+Version: 0.27.0
 Summary: Customizable Case-Based Reasoning (CBR) toolkit for Python with a built-in API and CLI
 Keywords: cbr,case-based reasoning,api,similarity,nlp,retrieval,cli,tool,library
 Author: Mirko Lenz
@@ -39,6 +39,7 @@ Requires-Dist: fastapi>=0.100,<1 ; extra == 'api'
 Requires-Dist: pydantic-settings>=2,<3 ; extra == 'api'
 Requires-Dist: python-multipart>=0.0.15,<1 ; extra == 'api'
 Requires-Dist: uvicorn[standard]>=0.30,<1 ; extra == 'api'
+Requires-Dist: chonkie>=1,<2 ; extra == 'chunking'
 Requires-Dist: typer>=0.9,<1 ; extra == 'cli'
 Requires-Dist: ranx>=0.3,<1 ; extra == 'eval'
 Requires-Dist: networkx>=3,<4 ; extra == 'graphs'
@@ -70,6 +71,7 @@ Project-URL: Issues, https://github.com/wi2trier/cbrkit/issues
 Project-URL: Repository, https://github.com/wi2trier/cbrkit
 Provides-Extra: all
 Provides-Extra: api
+Provides-Extra: chunking
 Provides-Extra: cli
 Provides-Extra: eval
 Provides-Extra: graphs
@@ -119,12 +121,25 @@ Further examples can be found in our [tests](./tests/test_retrieve.py) and [docu
 The following modules are part of CBRkit:
 - `cbrkit.loaders` and `cbrkit.dumpers`: Functions for loading and exporting cases and queries.
-- `cbrkit.sim`: Similarity generator functions for common data types like strings and numbers.
-- `cbrkit.retrieval`: Functions for defining and applying retrieval pipelines.
+- `cbrkit.sim`: Similarity functions for common data types and some utility functions such as `cache`, `combine`, `transpose`, etc.
+  - `cbrkit.sim.strings`: String similarity measures (Levenshtein, Jaro, semantic, etc.).
+  - `cbrkit.sim.numbers`: Numeric similarity measures (linear, exponential, threshold).
+  - `cbrkit.sim.collections`: Similarity measures for collections and sequences (Jaccard, DTW, Smith-Waterman).
+  - `cbrkit.sim.embed`: Embedding-based similarity functions with caching support.
+  - `cbrkit.sim.graphs`: Graph similarity algorithms including GED, A*, VF2, and more.
+  - `cbrkit.sim.taxonomy`: Taxonomy-based similarity functions.
+  - `cbrkit.sim.generic`: Generic similarity functions (equality, tables, static).
+  - `cbrkit.sim.attribute_value`: Similarity for attribute-value based data.
+  - `cbrkit.sim.pooling`: Functions for aggregating multiple similarity values.
+  - `cbrkit.sim.aggregator`: Combines multiple local measures into global scores.
+- `cbrkit.retrieval`: Functions for defining and applying retrieval pipelines, includes BM25 retrieval, rerankers, etc.
 - `cbrkit.adapt`: Adaptation generator functions for adapting cases based on a query.
 - `cbrkit.reuse`: Functions for defining and applying reuse pipelines.
+- `cbrkit.eval`: Evaluation metrics for retrieval results including precision, recall, and custom metrics.
+- `cbrkit.model`: Data models for graphs and results.
+- `cbrkit.cycle`: CBR cycle implementation.
 - `cbrkit.typing`: Generic type definitions for defining custom functions.
-- `cbrkit.synthesis`: Functions for working on a casebase with LLMs to create new insights, e.g. in a RAG context.
+- `cbrkit.synthesis`: Functions for working on a casebase with LLMs to create new insights, e.g., in a RAG context.
 ## Installation
@@ -235,18 +250,125 @@ You need to make sure that the two parameters are named `x` and `y`, otherwise C
 ### Built-in Similarity Measures
-CBRkit also contains a selection of built-in similarity measures for the most common data types in the module `cbrkit.sim`.
+CBRkit contains a comprehensive selection of built-in similarity measures for various data types in the module `cbrkit.sim`.
 They are provided through **generator functions** that allow you to customize the behavior of the built-in measures.
-For example, an spacy-based embedding similarity measure can be obtained as follows:
+#### String Similarity
+```python
+# Semantic similarity is covered by the `cbrkit.sim.embed` module.
+# See below for details.
+# Edit distance measures
+levenshtein_sim = cbrkit.sim.strings.levenshtein()
+jaro_sim = cbrkit.sim.strings.jaro()
+# Exact matching
+equality_sim = cbrkit.sim.generic.equality()
+```
+#### Number Similarity
+```python
+# Linear similarity with optional thresholds
+linear_sim = cbrkit.sim.numbers.linear(max_distance=100)
+# Exponential decay similarity
+exp_sim = cbrkit.sim.numbers.exponential(alpha=0.1)
+# Step functions
+threshold_sim = cbrkit.sim.numbers.threshold(threshold=50)
+```
+#### Embedding-Based Similarity
+```python
+# Build a similarity function with embedding and scorer
+embed_sim = cbrkit.sim.embed.build(
+    conversion_func=cbrkit.sim.embed.sentence_transformers(
+        model="all-MiniLM-L6-v2"
+    ),
+    sim_func=cbrkit.sim.embed.cosine()  # or dot(), angular(), euclidean(), manhattan()
+)
+# Using OpenAI embeddings
+openai_sim = cbrkit.sim.embed.build(
+    conversion_func=cbrkit.sim.embed.openai(
+        model="text-embedding-3-small"
+    ),
+    sim_func=cbrkit.sim.embed.cosine()
+)
+# Caching embeddings for performance
+cached_embed_func = cbrkit.sim.embed.cache(
+    func=cbrkit.sim.embed.sentence_transformers(
+        model="all-MiniLM-L6-v2"
+    ),
+    path="embeddings_cache.npz",
+    autodump=True,
+    autoload=True
+)
+cached_sim = cbrkit.sim.embed.build(
+    conversion_func=cached_embed_func,
+    sim_func=cbrkit.sim.embed.cosine()
+)
+```
+#### Taxonomy-Based Similarity
 ```python
-semantic_similarity = cbrkit.sim.strings.spacy(model="en_core_web_lg")
+# Load taxonomy from file
+taxonomy_sim = cbrkit.sim.taxonomy.build(
+    path="taxonomy.yaml",
+    measure=cbrkit.sim.taxonomy.wu_palmer(),
+)
 ```
-**Please note:** Calling the function `cbrkit.sim.strings.spacy` returns a similarity function itself that has the same signature as the `color_similarity` function defined above.
+#### Utility Functions
+```python
+# Combining multiple similarity functions
+combined_sim = cbrkit.sim.combine(
+    sim_funcs=[sim1, sim2, sim3],
+    aggregator=cbrkit.sim.aggregator(pooling="mean")
+)
+# Caching similarity results
+cached_sim = cbrkit.sim.cache(base_sim_func)
+# Transposing similarity functions
+transposed_sim = cbrkit.sim.transpose(
+    sim_func=number_sim,
+    to_x=lambda s: float(s),
+    to_y=lambda s: float(s)
+)
+```
+**Please note:** Calling these functions returns a similarity function itself that has the signature `sim = f(x, y)`.
 An overview of all available similarity measures can be found in the [module documentation](https://wi2trier.github.io/cbrkit/cbrkit/sim.html).
+### Graph Similarity
+CBRkit provides extensive support for graph similarity through various algorithms:
+```python
+# Using Graph Edit Distance (GED) with A* search
+graph_sim = cbrkit.sim.graphs.astar(
+    node_sim=cbrkit.sim.generic.equality(),
+    node_matcher=lambda n1, n2: n1 == n2,
+    edge_matcher=lambda e1, e2: e1 == e2
+)
+```
+Available graph algorithms include:
+- `astar`: A* search for optimal graph edit distance
+- `vf2`: VF2 algorithm for (sub)graph isomorphism
+- `lap`: Linear assignment problem solver
+- `greedy`: Fast greedy matching
+- `brute_force`: Exhaustive search for small graphs
+- `dfs`: Depth-first search based matching
 ### Global Similarity and Aggregation
 When dealing with cases that are not represented through elementary data types like strings, we need to aggregate individual measures to obtain a global similarity score.
@@ -377,9 +499,8 @@ They are provided through **generator functions** that allow you to customize th
 For example, a number aggregator can be obtained as follows:
 ```python
-# pooling must be a PoolingFunction or one of the provided PoolingNames
-pooling = "mean"
-number_adapter = cbrkit.adapt.numbers.aggregate(pooling)
+# pooling can be a string like "mean", "min", "max", "sum", etc. or a custom PoolingFunction
+number_adapter = cbrkit.adapt.numbers.aggregate(pooling="mean")
 ```
 **Please note:** Calling the function `cbrkit.adapt.numbers.aggregate` returns an adaptation function that takes a collection of values and returns an adapted value.
@@ -433,6 +554,46 @@ result = cbrkit.reuse.apply_query(retrieval_result, query, (reuser1, reuser2))
 The result structure follows the same pattern as the retrieval results with `final_step` and `steps` attributes.
+## Advanced Retrieval
+### BM25 Retrieval
+CBRkit includes a BM25 retriever for text-based retrieval:
+```python
+retriever = cbrkit.retrieval.bm25(
+    key="text_field",  # Field to search in
+    limit=10
+)
+result = cbrkit.retrieval.apply_query(casebase, query, retriever)
+```
+### Combining Multiple Retrievers
+The `combine` function allows merging results from multiple retrievers:
+```python
+retriever1 = cbrkit.retrieval.build(...)
+retriever2 = cbrkit.retrieval.bm25(...)
+combined = cbrkit.retrieval.combine(
+    retrievers=[retriever1, retriever2],
+    aggregator=cbrkit.sim.aggregator(pooling="mean")
+)
+result = cbrkit.retrieval.apply_query(casebase, query, combined)
+```
+### Distributed Processing
+For large-scale retrieval, use the `distribute` wrapper:
+```python
+retriever = cbrkit.retrieval.distribute(
+    cbrkit.retrieval.build(...),
+    batch_size=1000
+)
+```
 ## Evaluation
 CBRkit provides evaluation tools through the `cbrkit.eval` module for assessing the quality of retrieval results.
@@ -518,7 +679,8 @@ response = cbrkit.synthesis.apply_result(retrieval, synthesizer).response
 ### Working with large casebases
-Because the built-in `default` and `document_aware` prompt functions include the entire casebase as context, the LLM input can be quite long when working with a large casebase. Because of this, in this case, we recommend transposing the cases (e.g. truncate every case to a fixed length) and/or apply chunking.
+Because the built-in `default` and `document_aware` prompt functions include the entire casebase as context, the LLM input can be quite long when working with a large casebase.
+Because of this, in this case, we recommend transposing the cases (e.g., truncate every case to a fixed length) and/or apply chunking.
 #### Transposing cases
@@ -531,7 +693,7 @@ from cbrkit.dumpers import json_markdown
 def encoder(value) -> dict:
     ...
 baseprompt = cbrkit.synthesis.prompts.default(instructions, encoder=encoder)
-# transform the entries, e.g. by shortening, leaving out irrelevant attributes, etc.
+# transform the entries, e.g., by shortening, leaving out irrelevant attributes, etc.
 # In this case, the value of every field is trunctated to 100 characters
 def shorten(entry: dict) -> JsonEntry:
     entry = {k: str(v)[:100] for k,v in entry.items()}

{cbrkit-0.26.4 → cbrkit-0.27.0}/README.md RENAMED Viewed

@@ -37,12 +37,25 @@ Further examples can be found in our [tests](./tests/test_retrieve.py) and [docu
 The following modules are part of CBRkit:
 - `cbrkit.loaders` and `cbrkit.dumpers`: Functions for loading and exporting cases and queries.
-- `cbrkit.sim`: Similarity generator functions for common data types like strings and numbers.
-- `cbrkit.retrieval`: Functions for defining and applying retrieval pipelines.
+- `cbrkit.sim`: Similarity functions for common data types and some utility functions such as `cache`, `combine`, `transpose`, etc.
+  - `cbrkit.sim.strings`: String similarity measures (Levenshtein, Jaro, semantic, etc.).
+  - `cbrkit.sim.numbers`: Numeric similarity measures (linear, exponential, threshold).
+  - `cbrkit.sim.collections`: Similarity measures for collections and sequences (Jaccard, DTW, Smith-Waterman).
+  - `cbrkit.sim.embed`: Embedding-based similarity functions with caching support.
+  - `cbrkit.sim.graphs`: Graph similarity algorithms including GED, A*, VF2, and more.
+  - `cbrkit.sim.taxonomy`: Taxonomy-based similarity functions.
+  - `cbrkit.sim.generic`: Generic similarity functions (equality, tables, static).
+  - `cbrkit.sim.attribute_value`: Similarity for attribute-value based data.
+  - `cbrkit.sim.pooling`: Functions for aggregating multiple similarity values.
+  - `cbrkit.sim.aggregator`: Combines multiple local measures into global scores.
+- `cbrkit.retrieval`: Functions for defining and applying retrieval pipelines, includes BM25 retrieval, rerankers, etc.
 - `cbrkit.adapt`: Adaptation generator functions for adapting cases based on a query.
 - `cbrkit.reuse`: Functions for defining and applying reuse pipelines.
+- `cbrkit.eval`: Evaluation metrics for retrieval results including precision, recall, and custom metrics.
+- `cbrkit.model`: Data models for graphs and results.
+- `cbrkit.cycle`: CBR cycle implementation.
 - `cbrkit.typing`: Generic type definitions for defining custom functions.
-- `cbrkit.synthesis`: Functions for working on a casebase with LLMs to create new insights, e.g. in a RAG context.
+- `cbrkit.synthesis`: Functions for working on a casebase with LLMs to create new insights, e.g., in a RAG context.
 ## Installation
@@ -153,18 +166,125 @@ You need to make sure that the two parameters are named `x` and `y`, otherwise C
 ### Built-in Similarity Measures
-CBRkit also contains a selection of built-in similarity measures for the most common data types in the module `cbrkit.sim`.
+CBRkit contains a comprehensive selection of built-in similarity measures for various data types in the module `cbrkit.sim`.
 They are provided through **generator functions** that allow you to customize the behavior of the built-in measures.
-For example, an spacy-based embedding similarity measure can be obtained as follows:
+#### String Similarity
+```python
+# Semantic similarity is covered by the `cbrkit.sim.embed` module.
+# See below for details.
+# Edit distance measures
+levenshtein_sim = cbrkit.sim.strings.levenshtein()
+jaro_sim = cbrkit.sim.strings.jaro()
+# Exact matching
+equality_sim = cbrkit.sim.generic.equality()
+```
+#### Number Similarity
+```python
+# Linear similarity with optional thresholds
+linear_sim = cbrkit.sim.numbers.linear(max_distance=100)
+# Exponential decay similarity
+exp_sim = cbrkit.sim.numbers.exponential(alpha=0.1)
+# Step functions
+threshold_sim = cbrkit.sim.numbers.threshold(threshold=50)
+```
+#### Embedding-Based Similarity
+```python
+# Build a similarity function with embedding and scorer
+embed_sim = cbrkit.sim.embed.build(
+    conversion_func=cbrkit.sim.embed.sentence_transformers(
+        model="all-MiniLM-L6-v2"
+    ),
+    sim_func=cbrkit.sim.embed.cosine()  # or dot(), angular(), euclidean(), manhattan()
+)
+# Using OpenAI embeddings
+openai_sim = cbrkit.sim.embed.build(
+    conversion_func=cbrkit.sim.embed.openai(
+        model="text-embedding-3-small"
+    ),
+    sim_func=cbrkit.sim.embed.cosine()
+)
+# Caching embeddings for performance
+cached_embed_func = cbrkit.sim.embed.cache(
+    func=cbrkit.sim.embed.sentence_transformers(
+        model="all-MiniLM-L6-v2"
+    ),
+    path="embeddings_cache.npz",
+    autodump=True,
+    autoload=True
+)
+cached_sim = cbrkit.sim.embed.build(
+    conversion_func=cached_embed_func,
+    sim_func=cbrkit.sim.embed.cosine()
+)
+```
+#### Taxonomy-Based Similarity
 ```python
-semantic_similarity = cbrkit.sim.strings.spacy(model="en_core_web_lg")
+# Load taxonomy from file
+taxonomy_sim = cbrkit.sim.taxonomy.build(
+    path="taxonomy.yaml",
+    measure=cbrkit.sim.taxonomy.wu_palmer(),
+)
 ```
-**Please note:** Calling the function `cbrkit.sim.strings.spacy` returns a similarity function itself that has the same signature as the `color_similarity` function defined above.
+#### Utility Functions
+```python
+# Combining multiple similarity functions
+combined_sim = cbrkit.sim.combine(
+    sim_funcs=[sim1, sim2, sim3],
+    aggregator=cbrkit.sim.aggregator(pooling="mean")
+)
+# Caching similarity results
+cached_sim = cbrkit.sim.cache(base_sim_func)
+# Transposing similarity functions
+transposed_sim = cbrkit.sim.transpose(
+    sim_func=number_sim,
+    to_x=lambda s: float(s),
+    to_y=lambda s: float(s)
+)
+```
+**Please note:** Calling these functions returns a similarity function itself that has the signature `sim = f(x, y)`.
 An overview of all available similarity measures can be found in the [module documentation](https://wi2trier.github.io/cbrkit/cbrkit/sim.html).
+### Graph Similarity
+CBRkit provides extensive support for graph similarity through various algorithms:
+```python
+# Using Graph Edit Distance (GED) with A* search
+graph_sim = cbrkit.sim.graphs.astar(
+    node_sim=cbrkit.sim.generic.equality(),
+    node_matcher=lambda n1, n2: n1 == n2,
+    edge_matcher=lambda e1, e2: e1 == e2
+)
+```
+Available graph algorithms include:
+- `astar`: A* search for optimal graph edit distance
+- `vf2`: VF2 algorithm for (sub)graph isomorphism
+- `lap`: Linear assignment problem solver
+- `greedy`: Fast greedy matching
+- `brute_force`: Exhaustive search for small graphs
+- `dfs`: Depth-first search based matching
 ### Global Similarity and Aggregation
 When dealing with cases that are not represented through elementary data types like strings, we need to aggregate individual measures to obtain a global similarity score.
@@ -295,9 +415,8 @@ They are provided through **generator functions** that allow you to customize th
 For example, a number aggregator can be obtained as follows:
 ```python
-# pooling must be a PoolingFunction or one of the provided PoolingNames
-pooling = "mean"
-number_adapter = cbrkit.adapt.numbers.aggregate(pooling)
+# pooling can be a string like "mean", "min", "max", "sum", etc. or a custom PoolingFunction
+number_adapter = cbrkit.adapt.numbers.aggregate(pooling="mean")
 ```
 **Please note:** Calling the function `cbrkit.adapt.numbers.aggregate` returns an adaptation function that takes a collection of values and returns an adapted value.
@@ -351,6 +470,46 @@ result = cbrkit.reuse.apply_query(retrieval_result, query, (reuser1, reuser2))
 The result structure follows the same pattern as the retrieval results with `final_step` and `steps` attributes.
+## Advanced Retrieval
+### BM25 Retrieval
+CBRkit includes a BM25 retriever for text-based retrieval:
+```python
+retriever = cbrkit.retrieval.bm25(
+    key="text_field",  # Field to search in
+    limit=10
+)
+result = cbrkit.retrieval.apply_query(casebase, query, retriever)
+```
+### Combining Multiple Retrievers
+The `combine` function allows merging results from multiple retrievers:
+```python
+retriever1 = cbrkit.retrieval.build(...)
+retriever2 = cbrkit.retrieval.bm25(...)
+combined = cbrkit.retrieval.combine(
+    retrievers=[retriever1, retriever2],
+    aggregator=cbrkit.sim.aggregator(pooling="mean")
+)
+result = cbrkit.retrieval.apply_query(casebase, query, combined)
+```
+### Distributed Processing
+For large-scale retrieval, use the `distribute` wrapper:
+```python
+retriever = cbrkit.retrieval.distribute(
+    cbrkit.retrieval.build(...),
+    batch_size=1000
+)
+```
 ## Evaluation
 CBRkit provides evaluation tools through the `cbrkit.eval` module for assessing the quality of retrieval results.
@@ -436,7 +595,8 @@ response = cbrkit.synthesis.apply_result(retrieval, synthesizer).response
 ### Working with large casebases
-Because the built-in `default` and `document_aware` prompt functions include the entire casebase as context, the LLM input can be quite long when working with a large casebase. Because of this, in this case, we recommend transposing the cases (e.g. truncate every case to a fixed length) and/or apply chunking.
+Because the built-in `default` and `document_aware` prompt functions include the entire casebase as context, the LLM input can be quite long when working with a large casebase.
+Because of this, in this case, we recommend transposing the cases (e.g., truncate every case to a fixed length) and/or apply chunking.
 #### Transposing cases
@@ -449,7 +609,7 @@ from cbrkit.dumpers import json_markdown
 def encoder(value) -> dict:
     ...
 baseprompt = cbrkit.synthesis.prompts.default(instructions, encoder=encoder)
-# transform the entries, e.g. by shortening, leaving out irrelevant attributes, etc.
+# transform the entries, e.g., by shortening, leaving out irrelevant attributes, etc.
 # In this case, the value of every field is trunctated to 100 characters
 def shorten(entry: dict) -> JsonEntry:
     entry = {k: str(v)[:100] for k,v in entry.items()}

{cbrkit-0.26.4 → cbrkit-0.27.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "cbrkit"
-version = "0.26.4"
+version = "0.27.0"
 description = "Customizable Case-Based Reasoning (CBR) toolkit for Python with a built-in API and CLI"
 authors = [{ name = "Mirko Lenz", email = "mirko@mirkolenz.com" }]
 readme = "README.md"
@@ -58,6 +58,7 @@ api = [
     "python-multipart>=0.0.15,<1",
     "uvicorn[standard]>=0.30,<1",
 ]
+chunking = ["chonkie>=1,<2"]
 cli = ["typer>=0.9,<1"]
 eval = ["ranx>=0.3,<1"]
 graphs = ["networkx>=3,<4", "rustworkx>=0.15,<1"]

{cbrkit-0.26.4 → cbrkit-0.27.0}/src/cbrkit/eval/common.py RENAMED Viewed

@@ -244,7 +244,7 @@ def kendall_tau(
         qrel_relevant = {k for k, v in qrels[key].items() if v >= relevance_level}
         sorted_qrel_relevant = sorted(qrel_relevant, key=lambda x: qrels[key][x])
-        sorted_run = sorted(run.keys(), key=lambda x: run[key][x], reverse=True)
+        sorted_run = sorted(run[key].keys(), key=lambda x: run[key][x], reverse=True)
         run_k = sorted_run[: k if k > 0 else len(sorted_run)]
         max_idx = min(len(run_k), len(sorted_qrel_relevant))

{cbrkit-0.26.4 → cbrkit-0.27.0}/src/cbrkit/eval/retrieval.py RENAMED Viewed

@@ -1,7 +1,7 @@
 from collections.abc import Sequence
 from typing import Any, Literal
-from ..helpers import round, scale, unpack_float
+from ..helpers import normalize_and_scale, round, unpack_float
 from ..retrieval import Result, ResultStep
 from ..typing import EvalMetricFunc, Float, QueryCaseMatrix
 from .common import DEFAULT_METRICS, compute
@@ -65,12 +65,10 @@ def retrieval_step_to_qrels[Q, C, S: Float](
         min_sim = 0.0
         max_sim = 1.0
-    qrel_factor = max_qrel - min_qrel
     return {
         query: {
             case: round(
-                scale(sim, min_sim, max_sim) * qrel_factor + min_qrel,
+                normalize_and_scale(sim, min_sim, max_sim, min_qrel, max_qrel),
                 round_mode,
             )
             for case, sim in entry.items()

{cbrkit-0.26.4 → cbrkit-0.27.0}/src/cbrkit/helpers.py RENAMED Viewed

@@ -71,6 +71,8 @@ __all__ = [
     "load_callables_map",
     "load_callables",
     "load_object",
+    "normalize",
+    "normalize_and_scale",
     "log_batch",
     "mp_count",
     "mp_map",
@@ -605,6 +607,30 @@ def scale(value: float, lower: float, upper: float) -> float:
     return value * (upper - lower) + lower
+def normalize(value: float, value_min: float, value_max: float) -> float:
+    """Normalize a value from [value_min, value_max] to [0, 1]."""
+    if value_max == value_min:
+        # Handle edge case where all values are identical
+        return 0.0
+    return (value - value_min) / (value_max - value_min)
+def normalize_and_scale(
+    value: float,
+    value_min: float,
+    value_max: float,
+    target_min: float,
+    target_max: float,
+) -> float:
+    """Normalize a value from [value_min, value_max] to [target_min, target_max]."""
+    # First normalize to [0, 1]
+    normalized = normalize(value, value_min, value_max)
+    # Then scale to target range
+    return scale(normalized, target_min, target_max)
 def load_object(import_name: str) -> Any:
     """Import an object based on a string.

{cbrkit-0.26.4 → cbrkit-0.27.0}/src/cbrkit/retrieval/__init__.py RENAMED Viewed

@@ -3,6 +3,9 @@ from ..model import QueryResultStep, Result, ResultStep
 from .apply import apply_batches, apply_queries, apply_query
 from .build import build, combine, distribute, dropout, transpose, transpose_value
+with optional_dependencies():
+    from .build import chunk
 with optional_dependencies():
     from .rerank import cohere
@@ -22,6 +25,7 @@ __all__ = [
     "dropout",
     "distribute",
     "combine",
+    "chunk",
     "apply_batches",
     "apply_queries",
     "apply_query",

{cbrkit-0.26.4 → cbrkit-0.27.0}/src/cbrkit/retrieval/build.py RENAMED Viewed

@@ -12,6 +12,7 @@ from ..helpers import (
     mp_count,
     mp_map,
     mp_starmap,
+    optional_dependencies,
     sim_map2ranking,
     unpack_float,
     use_mp,
@@ -315,3 +316,47 @@ class build[K, V, S: Float](RetrieverFunc[K, V, S]):
             similarities[idx][key] = sim
         return similarities
+with optional_dependencies():
+    from chonkie import BaseChunker
+    @dataclass(slots=True, frozen=True)
+    class chunk[S: Float](RetrieverFunc[str, str, S]):
+        """Chunks string cases using the chonkie library before retrieval.
+        This retriever is special in that it returns a different set of cases for each batch
+        it processes, as it splits the original string cases into chunks.
+        Args:
+            retriever_func: The retriever function to be used on the chunked strings.
+            chunker: A BaseChunker instance from the chonkie library.
+        Returns:
+            A retriever function that chunks string cases and retrieves from the chunks.
+        """
+        retriever_func: RetrieverFunc[str, str, S]
+        chunker: BaseChunker
+        @override
+        def __call__(
+            self, batches: Sequence[tuple[Casebase[str, str], str]]
+        ) -> Sequence[SimMap[str, S]]:
+            chunked_batches: list[tuple[Casebase[str, str], str]] = []
+            for casebase, query in batches:
+                chunked_casebase: dict[str, str] = {}
+                for case_key, case_text in casebase.items():
+                    chunks = self.chunker.chunk(case_text)
+                    for i, chunk in enumerate(chunks):
+                        chunk_key = f"{case_key}-chunk{i}"
+                        chunked_casebase[chunk_key] = (
+                            chunk if isinstance(chunk, str) else chunk.text
+                        )
+                chunked_batches.append((chunked_casebase, query))
+            return self.retriever_func(chunked_batches)

{cbrkit-0.26.4 → cbrkit-0.27.0}/src/cbrkit/sim/__init__.py RENAMED Viewed

@@ -9,8 +9,9 @@ there is also a measure for attribute-value data.
 Additionally, the module contains an aggregator to combine multiple local measures into a global score.
 """
-from . import collections, embed, generic, graphs, numbers, strings, taxonomy
-from .aggregator import PoolingName, aggregator
+from . import collections, embed, generic, graphs, numbers, pooling, strings, taxonomy
+from .aggregator import aggregator
+from .pooling import PoolingName
 from .attribute_value import AttributeValueSim, attribute_value
 from .wrappers import (
     attribute_table,
@@ -40,6 +41,7 @@ __all__ = [
     "graphs",
     "embed",
     "taxonomy",
+    "pooling",
     "aggregator",
     "PoolingName",
     "AttributeValueSim",

cbrkit 0.26.4__tar.gz → 0.27.0__tar.gz

cbrkit 0.26.4tar.gz → 0.27.0tar.gz