PyPI - scorebook - Versions diffs - 0.0.14__tar.gz → 0.0.16__tar.gz - Mend

scorebook 0.0.14tar.gz → 0.0.16tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (113) hide show

{scorebook-0.0.14 → scorebook-0.0.16}/PKG-INFO RENAMED Viewed

@@ -1,43 +1,41 @@
 Metadata-Version: 2.4
 Name: scorebook
-Version: 0.0.14
+Version: 0.0.16
 Summary: A Python project for LLM evaluation.
 License-File: LICENSE
 Author: Euan Campbell
 Author-email: euan@trismik.com
-Requires-Python: >=3.9, <3.14
+Requires-Python: >=3.10
 Classifier: Programming Language :: Python :: 3
-Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
 Classifier: Programming Language :: Python :: 3.12
 Classifier: Programming Language :: Python :: 3.13
-Provides-Extra: bedrock
 Provides-Extra: examples
-Provides-Extra: openai
-Provides-Extra: portkey
-Provides-Extra: vertex
+Provides-Extra: metrics
+Provides-Extra: providers
 Requires-Dist: accelerate ; extra == "examples"
-Requires-Dist: boto3 (==1.40.0) ; extra == "bedrock"
+Requires-Dist: bert-score ; extra == "metrics"
+Requires-Dist: boto3 (==1.40.0) ; extra == "providers"
 Requires-Dist: datasets (>=3.6.0)
-Requires-Dist: fsspec[gcs] ; extra == "vertex"
-Requires-Dist: google-cloud-storage ; extra == "vertex"
-Requires-Dist: google-genai ; extra == "vertex"
-Requires-Dist: ipywidgets (>=8.0.0)
-Requires-Dist: notebook (>=7.4.5,<8.0.0)
+Requires-Dist: fsspec[gcs] ; extra == "providers"
+Requires-Dist: google-cloud-storage ; extra == "providers"
+Requires-Dist: google-genai ; extra == "providers"
+Requires-Dist: ipywidgets ; extra == "examples"
+Requires-Dist: jinja2 (>=3.1.6,<4.0.0)
 Requires-Dist: notebook ; extra == "examples"
-Requires-Dist: openai ; extra == "openai"
-Requires-Dist: pandas ; extra == "vertex"
-Requires-Dist: portkey-ai ; extra == "portkey"
-Requires-Dist: python-dotenv ; extra == "bedrock"
-Requires-Dist: python-dotenv ; extra == "openai"
-Requires-Dist: python-dotenv ; extra == "portkey"
-Requires-Dist: python-dotenv ; extra == "vertex"
+Requires-Dist: openai ; extra == "providers"
+Requires-Dist: pandas ; extra == "providers"
+Requires-Dist: portkey-ai ; extra == "providers"
+Requires-Dist: python-dotenv (>=1.0.0)
+Requires-Dist: rouge-score ; extra == "metrics"
+Requires-Dist: sacrebleu ; extra == "metrics"
+Requires-Dist: scikit-learn (>=1.0.0) ; extra == "metrics"
 Requires-Dist: torch ; extra == "examples"
 Requires-Dist: torchaudio ; extra == "examples"
 Requires-Dist: torchvision ; extra == "examples"
 Requires-Dist: transformers ; extra == "examples"
-Requires-Dist: trismik (==1.0.2)
+Requires-Dist: trismik (>=1.0.3)
 Description-Content-Type: text/markdown
 <h1 align="center">Scorebook</h1>
@@ -51,6 +49,9 @@ Description-Content-Type: text/markdown
     <img alt="Documentation" src="https://img.shields.io/badge/docs-Scorebook-blue?style=flat">
   </a>
   <img alt="License" src="https://img.shields.io/badge/license-MIT-green">
+  <a target="_blank" href="https://colab.research.google.com/github/trismik/scorebook/blob/main/tutorials/quickstarts/getting_started.ipynb">
+    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
+  </a>
 </p>
 Scorebook provides a flexible and extensible framework for evaluating models such as large language models (LLMs). Easily evaluate any model using evaluation datasets from Hugging Face such as MMLU-Pro, HellaSwag, and CommonSenseQA, or with data from any other source. Evaluations calculate scores for any number of specified metrics such as accuracy, precision, and recall, as well as any custom defined metrics, including LLM as a judge (LLMaJ).
@@ -251,9 +252,16 @@ results = evaluate(
 ## Metrics
-| Metric     | Sync/Async | Aggregate Scores                                 | Item Scores                             |
-|------------|------------|--------------------------------------------------|-----------------------------------------|
-| `Accuracy` | Sync       | `Float`: Percentage of correct outputs           | `Boolean`: Exact match between output and label |
+| Metric       | Sync/Async | Aggregate Scores                                 | Item Scores                             |
+|--------------|------------|--------------------------------------------------|-----------------------------------------|
+| `Accuracy`   | Sync       | `Float`: Percentage of correct outputs           | `Boolean`: Exact match between output and label |
+| `ExactMatch` | Sync       | `Float`: Percentage of exact string matches      | `Boolean`: Exact match with optional case/whitespace normalization |
+| `F1`         | Sync       | `Dict[str, Float]`: F1 scores per averaging method (macro, micro, weighted) | `Boolean`: Exact match between output and label |
+| `Precision`  | Sync       | `Dict[str, Float]`: Precision scores per averaging method (macro, micro, weighted) | `Boolean`: Exact match between output and label |
+| `Recall`     | Sync       | `Dict[str, Float]`: Recall scores per averaging method (macro, micro, weighted) | `Boolean`: Exact match between output and label |
+| `BLEU`       | Sync       | `Float`: Corpus-level BLEU score                 | `Float`: Sentence-level BLEU score |
+| `ROUGE`      | Sync       | `Dict[str, Float]`: Average F1 scores per ROUGE type | `Dict[str, Float]`: F1 scores per ROUGE type |
+| `BertScore`  | Sync       | `Dict[str, Float]`: Average precision, recall, and F1 scores | `Dict[str, Float]`: Precision, recall, and F1 scores per item |
 ## Tutorials

{scorebook-0.0.14 → scorebook-0.0.16}/README.md RENAMED Viewed

@@ -9,6 +9,9 @@
     <img alt="Documentation" src="https://img.shields.io/badge/docs-Scorebook-blue?style=flat">
   </a>
   <img alt="License" src="https://img.shields.io/badge/license-MIT-green">
+  <a target="_blank" href="https://colab.research.google.com/github/trismik/scorebook/blob/main/tutorials/quickstarts/getting_started.ipynb">
+    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
+  </a>
 </p>
 Scorebook provides a flexible and extensible framework for evaluating models such as large language models (LLMs). Easily evaluate any model using evaluation datasets from Hugging Face such as MMLU-Pro, HellaSwag, and CommonSenseQA, or with data from any other source. Evaluations calculate scores for any number of specified metrics such as accuracy, precision, and recall, as well as any custom defined metrics, including LLM as a judge (LLMaJ).
@@ -209,9 +212,16 @@ results = evaluate(
 ## Metrics
-| Metric     | Sync/Async | Aggregate Scores                                 | Item Scores                             |
-|------------|------------|--------------------------------------------------|-----------------------------------------|
-| `Accuracy` | Sync       | `Float`: Percentage of correct outputs           | `Boolean`: Exact match between output and label |
+| Metric       | Sync/Async | Aggregate Scores                                 | Item Scores                             |
+|--------------|------------|--------------------------------------------------|-----------------------------------------|
+| `Accuracy`   | Sync       | `Float`: Percentage of correct outputs           | `Boolean`: Exact match between output and label |
+| `ExactMatch` | Sync       | `Float`: Percentage of exact string matches      | `Boolean`: Exact match with optional case/whitespace normalization |
+| `F1`         | Sync       | `Dict[str, Float]`: F1 scores per averaging method (macro, micro, weighted) | `Boolean`: Exact match between output and label |
+| `Precision`  | Sync       | `Dict[str, Float]`: Precision scores per averaging method (macro, micro, weighted) | `Boolean`: Exact match between output and label |
+| `Recall`     | Sync       | `Dict[str, Float]`: Recall scores per averaging method (macro, micro, weighted) | `Boolean`: Exact match between output and label |
+| `BLEU`       | Sync       | `Float`: Corpus-level BLEU score                 | `Float`: Sentence-level BLEU score |
+| `ROUGE`      | Sync       | `Dict[str, Float]`: Average F1 scores per ROUGE type | `Dict[str, Float]`: F1 scores per ROUGE type |
+| `BertScore`  | Sync       | `Dict[str, Float]`: Average precision, recall, and F1 scores | `Dict[str, Float]`: Precision, recall, and F1 scores per item |
 ## Tutorials

{scorebook-0.0.14 → scorebook-0.0.16}/pyproject.toml RENAMED Viewed

@@ -7,23 +7,26 @@ authors = [
     { name = "Marco Basaldella", email = "marco@trismik.com" }
 ]
 readme = "README.md"
-requires-python = ">=3.9, <3.14"
+requires-python = ">=3.10"
 dependencies = [
     "datasets>=3.6.0",
-    "notebook (>=7.4.5,<8.0.0)",
-    "trismik==1.0.2",
-    "ipywidgets>=8.0.0",
+    "trismik>=1.0.3",
+    "python-dotenv>=1.0.0",
+    "jinja2 (>=3.1.6,<4.0.0)",
 ]
 [project.scripts]
 scorebook = "scorebook.cli.main:main"
 [tool.poetry]
-version = "0.0.14"  # base version
-packages = [{ include = "scorebook", from = "src" }]
+version = "0.0.16"  # base version
+packages = [
+    { include = "scorebook", from = "src" },
+    { include = "tutorials" }
+]
 [tool.poetry.dependencies]
-python = ">=3.9,<3.14"
+python = ">=3.10,<3.14"
 [[tool.poetry.source]]
 name = "testpypi"
@@ -42,17 +45,15 @@ mypy = "^1.15.0"
 autoflake = "^2.3.1"
 toml = "^0.10.2"
 types-pyyaml = "^6.0.12.20250822"
-unasync = {version = "^0.5.0", python = ">=3.9,<4"}
+unasync = {version = "^0.5.0", python = ">=3.10,<4"}
 tomlkit = "^0.13.2"
 detect-secrets = "^1.5.0"
+setuptools = "^75.0.0"
 [project.optional-dependencies]
-openai = ["openai", "python-dotenv"]
-portkey = ["portkey-ai", "python-dotenv"]
-bedrock = ["boto3==1.40.0", "python-dotenv"]
-vertex = ["google-genai", "pandas", "google-cloud-storage", "fsspec[gcs]", "python-dotenv"]
-examples = ["transformers", "torch", "torchvision", "torchaudio", "accelerate", "notebook"]
+providers = ["openai", "portkey-ai", "boto3==1.40.0", "google-genai", "pandas", "google-cloud-storage", "fsspec[gcs]"]
+examples = ["transformers", "torch", "torchvision", "torchaudio", "accelerate", "notebook", "ipywidgets"]
+metrics = ["sacrebleu", "rouge-score", "scikit-learn>=1.0.0", "bert-score"]
 [build-system]
 requires = ["poetry-core"]
@@ -60,14 +61,16 @@ build-backend = "poetry.core.masonry.api"
 [tool.pytest.ini_options]
 asyncio_default_fixture_loop_scope = "class"
+testpaths = ["tests/unit"]
 markers = [
-    "unit: Unit tests that use mocks and don't require external dependencies",
+    "unit: Unit tests using only core dependencies (no optional packages)",
+    "metrics: Tests requiring metrics extras (sklearn, sacrebleu, rouge-score, bert-score)",
     "integration: Integration tests that may require network access or external services",
 ]
 [tool.black]
 line-length = 100
-target-version = ['py39']
+target-version = ['py310']
 include = '\.pyi?$'
 [tool.isort]
@@ -76,7 +79,7 @@ line_length = 100
 multi_line_output = 3
 [tool.mypy]
-python_version = "3.9"
+python_version = "3.10"
 warn_return_any = true
 warn_unused_configs = true
 disallow_untyped_defs = true

{scorebook-0.0.14 → scorebook-0.0.16}/src/scorebook/__init__.py RENAMED Viewed

@@ -16,6 +16,7 @@ from scorebook.eval_datasets.eval_dataset import EvalDataset
 from scorebook.evaluate._async.evaluate_async import evaluate_async
 from scorebook.evaluate._sync.evaluate import evaluate
 from scorebook.inference.inference_pipeline import InferencePipeline
+from scorebook.metrics.core.metric_registry import scorebook_metric
 from scorebook.score._async.score_async import score_async
 from scorebook.score._sync.score import score
 from scorebook.utils.render_template import render_template
@@ -35,4 +36,5 @@ __all__ = [
     "create_project_async",
     "upload_result",
     "upload_result_async",
+    "scorebook_metric",
 ]

{scorebook-0.0.14 → scorebook-0.0.16}/src/scorebook/dashboard/credentials.py RENAMED Viewed

@@ -3,8 +3,10 @@
 import logging
 import os
 import pathlib
+import warnings
 from typing import Optional
+from dotenv import load_dotenv
 from trismik import TrismikClient
 from scorebook.settings import TRISMIK_SERVICE_URL
@@ -92,16 +94,44 @@ def validate_token(token: str) -> bool:
         return False
-def login(trismik_api_key: str) -> None:
+def login(trismik_api_key: Optional[str] = None) -> None:
     """Login to trismik by saving API key locally.
+    If no API key is provided, the function will attempt to read it from the
+    TRISMIK_API_KEY environment variable or .env file (using python-dotenv).
+    Environment variables take precedence over .env file values.
     Args:
-        trismik_api_key: The API key to use.
+        trismik_api_key: The API key to use. If not provided, reads from
+            environment or .env file.
     Raises:
-        ValueError: If API key is empty or invalid.
+        ValueError: If API key is empty, not found, or invalid.
+    Warns:
+        UserWarning: If an explicit API key is passed but TRISMIK_API_KEY
+            environment variable is also set.
     """
+    # Warn if user passes explicit key but env var is also set
+    if trismik_api_key is not None and os.environ.get("TRISMIK_API_KEY"):
+        warnings.warn(
+            "TRISMIK_API_KEY environment variable is set. The environment variable "
+            "takes precedence over the stored token when calling evaluate(). "
+            "To use the explicitly provided key, unset the TRISMIK_API_KEY "
+            "environment variable.",
+            UserWarning,
+            stacklevel=2,
+        )
+    if trismik_api_key is None:
+        # Load from .env file if TRISMIK_API_KEY is not already set in environment
+        load_dotenv()
+        trismik_api_key = os.environ.get("TRISMIK_API_KEY")
     if not trismik_api_key:
-        raise ValueError("API key cannot be empty")
+        raise ValueError(
+            "API key cannot be empty. Either pass it as a parameter or "
+            "set the TRISMIK_API_KEY environment variable or .env file."
+        )
     # Validate token
     if not validate_token(trismik_api_key):

{scorebook-0.0.14 → scorebook-0.0.16}/src/scorebook/eval_datasets/eval_dataset.py RENAMED Viewed

@@ -18,8 +18,8 @@ from scorebook.exceptions import (
     DatasetSampleError,
     MissingFieldError,
 )
-from scorebook.metrics.metric_base import MetricBase
-from scorebook.metrics.metric_registry import MetricRegistry
+from scorebook.metrics.core.metric_base import MetricBase
+from scorebook.metrics.core.metric_registry import MetricRegistry
 from scorebook.utils.io_helpers import validate_path
 from scorebook.utils.render_template import render_template

{scorebook-0.0.14 → scorebook-0.0.16}/src/scorebook/evaluate/_async/evaluate_async.py RENAMED Viewed

@@ -113,8 +113,6 @@ async def evaluate_async(
         with evaluation_progress_context(
             total_eval_runs=len(eval_run_specs),
             total_items=total_items,
-            dataset_count=len(datasets),
-            hyperparam_count=len(hyperparameter_configs),
             model_display=model_display,
             enabled=show_progress_bars,
         ) as progress_bars:
@@ -151,19 +149,31 @@ async def execute_runs(
     async def worker(
         run: Union[EvalRunSpec, AdaptiveEvalRunSpec]
     ) -> Union[ClassicEvalRunResult, AdaptiveEvalRunResult]:
+        # Create progress callback for adaptive evals
+        on_progress: Optional[Callable[[int, int], None]] = None
+        if progress_bars is not None and isinstance(run, AdaptiveEvalRunSpec):
+            def _on_progress(current: int, total: int) -> None:
+                progress_bars.on_item_progress(current, total)
+            on_progress = _on_progress
         # Execute run (score_async handles upload internally for classic evals)
         run_result = await execute_run(
-            inference, run, upload_results, experiment_id, project_id, metadata, trismik_client
+            inference,
+            run,
+            upload_results,
+            experiment_id,
+            project_id,
+            metadata,
+            trismik_client,
+            on_progress,
         )
         # Update progress bars with items processed and success status
         if progress_bars is not None:
-            # Classic evals have .items; adaptive evals use max_iterations
-            items_processed = (
-                len(run.dataset.items)
-                if isinstance(run, EvalRunSpec)
-                else evaluation_settings["max_iterations"]
-            )
+            # Classic evals: update items count; Adaptive evals: items already tracked via callback
+            items_processed = len(run.dataset.items) if isinstance(run, EvalRunSpec) else 0
             progress_bars.on_run_completed(items_processed, run_result.run_completed)
         # Update upload progress for classic evals
@@ -195,11 +205,12 @@ async def execute_runs(
 async def execute_run(
     inference: Callable,
     run: Union[EvalRunSpec, AdaptiveEvalRunSpec],
-    upload_results: bool,  # NEW PARAMETER
+    upload_results: bool,
     experiment_id: Optional[str] = None,
     project_id: Optional[str] = None,
     metadata: Optional[Dict[str, Any]] = None,
     trismik_client: Optional[Union[TrismikClient, TrismikAsyncClient]] = None,
+    on_progress: Optional[Callable[[int, int], None]] = None,
 ) -> Union[ClassicEvalRunResult, AdaptiveEvalRunResult]:
     """Execute a single evaluation run."""
@@ -218,6 +229,7 @@ async def execute_run(
             resolved_project_id,
             metadata,
             trismik_client,
+            on_progress,
         )
     else:
@@ -338,6 +350,7 @@ async def execute_adaptive_eval_run(
     project_id: str,
     metadata: Optional[Dict[str, Any]] = None,
     trismik_client: Optional[Union[TrismikClient, TrismikAsyncClient]] = None,
+    on_progress: Optional[Callable[[int, int], None]] = None,
 ) -> AdaptiveEvalRunResult:
     """Execute an adaptive evaluation run."""
     logger.debug("Executing adaptive run for %s", run)
@@ -347,7 +360,7 @@ async def execute_adaptive_eval_run(
             raise ScoreBookError("Trismik client is required for adaptive evaluation")
         adaptive_eval_run_result = await run_adaptive_evaluation(
-            inference, run, experiment_id, project_id, metadata, trismik_client
+            inference, run, experiment_id, project_id, metadata, trismik_client, on_progress
         )
         logger.debug("Adaptive evaluation completed for run %s", adaptive_eval_run_result)
@@ -365,6 +378,7 @@ async def run_adaptive_evaluation(
     project_id: str,
     metadata: Any,
     trismik_client: Union[TrismikClient, TrismikAsyncClient],
+    on_progress: Optional[Callable[[int, int], None]] = None,
 ) -> AdaptiveEvalRunResult:
     """Run an adaptive evaluation using the Trismik API.
@@ -375,6 +389,7 @@ async def run_adaptive_evaluation(
         project_id: Trismik project ID
         metadata: Additional metadata
         trismik_client: Trismik client instance
+        on_progress: Optional callback for progress updates (current, total)
     Returns:
         Results from the adaptive evaluation
     """
@@ -404,6 +419,7 @@ async def run_adaptive_evaluation(
             inference_setup={},
         ),
         item_processor=make_trismik_inference(inference_with_hyperparams),
+        on_progress=on_progress,
         return_dict=False,
     )

{scorebook-0.0.14 → scorebook-0.0.16}/src/scorebook/evaluate/_sync/evaluate.py RENAMED Viewed

@@ -112,8 +112,6 @@ def evaluate(
         with evaluation_progress_context(
             total_eval_runs=len(eval_run_specs),
             total_items=total_items,
-            dataset_count=len(datasets),
-            hyperparam_count=len(hyperparameter_configs),
             model_display=model_display,
             enabled=show_progress_bars,
         ) as progress_bars:
@@ -150,19 +148,31 @@ def execute_runs(
     def worker(
         run: Union[EvalRunSpec, AdaptiveEvalRunSpec]
     ) -> Union[ClassicEvalRunResult, AdaptiveEvalRunResult]:
+        # Create progress callback for adaptive evals
+        on_progress: Optional[Callable[[int, int], None]] = None
+        if progress_bars is not None and isinstance(run, AdaptiveEvalRunSpec):
+            def _on_progress(current: int, total: int) -> None:
+                progress_bars.on_item_progress(current, total)
+            on_progress = _on_progress
         # Execute run (score_async handles upload internally for classic evals)
         run_result = execute_run(
-            inference, run, upload_results, experiment_id, project_id, metadata, trismik_client
+            inference,
+            run,
+            upload_results,
+            experiment_id,
+            project_id,
+            metadata,
+            trismik_client,
+            on_progress,
         )
         # Update progress bars with items processed and success status
         if progress_bars is not None:
-            # Classic evals have .items; adaptive evals use max_iterations
-            items_processed = (
-                len(run.dataset.items)
-                if isinstance(run, EvalRunSpec)
-                else evaluation_settings["max_iterations"]
-            )
+            # Classic evals: update items count; Adaptive evals: items already tracked via callback
+            items_processed = len(run.dataset.items) if isinstance(run, EvalRunSpec) else 0
             progress_bars.on_run_completed(items_processed, run_result.run_completed)
         # Update upload progress for classic evals
@@ -194,11 +204,12 @@ def execute_runs(
 def execute_run(
     inference: Callable,
     run: Union[EvalRunSpec, AdaptiveEvalRunSpec],
-    upload_results: bool,  # NEW PARAMETER
+    upload_results: bool,
     experiment_id: Optional[str] = None,
     project_id: Optional[str] = None,
     metadata: Optional[Dict[str, Any]] = None,
     trismik_client: Optional[Union[TrismikClient, TrismikAsyncClient]] = None,
+    on_progress: Optional[Callable[[int, int], None]] = None,
 ) -> Union[ClassicEvalRunResult, AdaptiveEvalRunResult]:
     """Execute a single evaluation run."""
@@ -217,6 +228,7 @@ def execute_run(
             resolved_project_id,
             metadata,
             trismik_client,
+            on_progress,
         )
     else:
@@ -337,6 +349,7 @@ def execute_adaptive_eval_run(
     project_id: str,
     metadata: Optional[Dict[str, Any]] = None,
     trismik_client: Optional[Union[TrismikClient, TrismikAsyncClient]] = None,
+    on_progress: Optional[Callable[[int, int], None]] = None,
 ) -> AdaptiveEvalRunResult:
     """Execute an adaptive evaluation run."""
     logger.debug("Executing adaptive run for %s", run)
@@ -346,7 +359,7 @@ def execute_adaptive_eval_run(
             raise ScoreBookError("Trismik client is required for adaptive evaluation")
         adaptive_eval_run_result = run_adaptive_evaluation(
-            inference, run, experiment_id, project_id, metadata, trismik_client
+            inference, run, experiment_id, project_id, metadata, trismik_client, on_progress
         )
         logger.debug("Adaptive evaluation completed for run %s", adaptive_eval_run_result)
@@ -364,6 +377,7 @@ def run_adaptive_evaluation(
     project_id: str,
     metadata: Any,
     trismik_client: Union[TrismikClient, TrismikAsyncClient],
+    on_progress: Optional[Callable[[int, int], None]] = None,
 ) -> AdaptiveEvalRunResult:
     """Run an adaptive evaluation using the Trismik API.
@@ -374,6 +388,7 @@ def run_adaptive_evaluation(
         project_id: Trismik project ID
         metadata: Additional metadata
         trismik_client: Trismik client instance
+        on_progress: Optional callback for progress updates (current, total)
     Returns:
         Results from the adaptive evaluation
     """
@@ -403,6 +418,7 @@ def run_adaptive_evaluation(
             inference_setup={},
         ),
         item_processor=make_trismik_inference(inference_with_hyperparams),
+        on_progress=on_progress,
         return_dict=False,
     )

scorebook-0.0.16/src/scorebook/metrics/README.md ADDED Viewed

@@ -0,0 +1,121 @@
+# Adding Metrics to Scorebook
+This guide explains how to add new metrics to Scorebook.
+## Quick Start
+1. Create a metric file: `src/scorebook/metrics/yourmetric.py`
+2. Implement the metric class
+3. Add tests
+4. Submit PR for review
+### Where to Put Tests
+Tests go in one of two directories:
+- **`tests/unit/test_metrics/`** - For fast tests using mocked data. These run on every commit.
+- **`tests/extended/test_metrics/`** - For tests that require external dependencies, large datasets, or are computationally expensive.
+Most metrics only need unit tests. Use extended tests when your metric relies on external APIs, models, or takes significant time to run.
+See [CONTRIBUTING.md](../../../CONTRIBUTING.md) for instructions on running tests.
+---
+## Requirements
+Your metric must:
+- Use the `@scorebook_metric` decorator
+- Inherit from `MetricBase`
+- Implement the `score()` static method
+The `score()` method returns a tuple of `(aggregate_scores, item_scores)`:
+- **aggregate_scores**: A `Dict[str, float]` with overall metric values (e.g., `{"accuracy": 0.85}`)
+- **item_scores**: A `List` of per-item scores. For metrics that produce a single value per item, use `int`, `float`, `bool`, or `str`. For metrics that produce multiple values per item, use a `Dict[str, Union[int, float, bool, str]]` where keys are metric names.
+---
+## File Naming
+Metric files must use normalized names (lowercase, no underscores/spaces). This naming convention is required for the registry's lazy loading system to work.
+1. User requests a metric by name (e.g., `"f1_score"`, `"F1Score"`, or `"f1 score"`)
+2. The registry normalizes the input → `"f1score"`
+3. The registry imports `scorebook.metrics.f1score`
+4. The `@scorebook_metric` decorator registers the class
+**Examples:**
+- Class: `F1Score` → File: `f1score.py` → User can request: `"f1score"`, `"F1Score"`, `"f1_score"`, `"f1 score"`
+- Class: `MeanSquaredError` → File: `meansquarederror.py` → User can request: `"MeanSquaredError"`, `"mean_squared_error"`, etc.
+**Collision detection:** Class names that normalize to the same key will raise an error at registration time. For example, `F1Score` and `F1_Score` both normalize to `"f1score"` and cannot coexist.
+---
+## Implementation Template
+Create your metric file in `src/scorebook/metrics/yourmetric.py`:
+```python
+"""Brief description of the metric."""
+from typing import Any, Dict, List, Tuple
+from scorebook.metrics import MetricBase, scorebook_metric
+@scorebook_metric
+class YourMetric(MetricBase):
+    """One-line description of what this metric measures.
+    Formula or explanation (e.g., Accuracy = correct / total).
+    """
+    def score(outputs: List[Any], labels: List[Any]) -> Tuple[Dict[str, Any], List[Any]]:
+        """Calculate metric score between outputs and labels.
+        Args:
+            outputs: A list of model inference outputs.
+            labels: A list of ground truth labels.
+        Returns:
+            Tuple containing:
+                - Aggregate scores dict (e.g., {"your_metric": 0.85})
+                - List of per-item scores
+        Raises:
+            ValueError: If outputs and labels have different lengths.
+        """
+        # Input validation
+        if len(outputs) != len(labels):
+            raise ValueError("Number of outputs must match number of labels")
+        if not outputs:
+            return {"your_metric": 0.0}, []
+        # Calculate per-item scores
+        item_scores = [calculate_score(out, lab) for out, lab in zip(outputs, labels)]
+        # Calculate aggregate score
+        aggregate_score = sum(item_scores) / len(item_scores)
+        return {"your_metric": aggregate_score}, item_scores
+```
+---
+## Documentation
+Each metric should have:
+1. **Module-level docstring**: Brief description at the top of the file
+2. **Class docstring**: What the metric measures, formula, and any limitations
+3. **Method docstring**: Args, Returns, and Raises sections
+---
+## Example
+See `src/scorebook/metrics/accuracy.py` for a complete reference implementation.

scorebook-0.0.16/src/scorebook/metrics/__init__.py ADDED Viewed

@@ -0,0 +1,9 @@
+"""Metrics for evaluating model predictions."""
+from scorebook.metrics.core.metric_base import MetricBase
+from scorebook.metrics.core.metric_registry import scorebook_metric
+__all__ = [
+    "MetricBase",
+    "scorebook_metric",
+]

{scorebook-0.0.14 → scorebook-0.0.16}/src/scorebook/metrics/accuracy.py RENAMED Viewed

@@ -2,11 +2,10 @@
 from typing import Any, Dict, List, Tuple
-from scorebook.metrics.metric_base import MetricBase
-from scorebook.metrics.metric_registry import MetricRegistry
+from scorebook.metrics import MetricBase, scorebook_metric
-@MetricRegistry.register()
+@scorebook_metric
 class Accuracy(MetricBase):
     """Accuracy metric for evaluating model predictions of any type.
@@ -25,9 +24,6 @@ class Accuracy(MetricBase):
             The aggregate accuracy score for all items (correct predictions / total predictions).
             The item scores for each output-label pair (true/false).
         """
-        if len(outputs) != len(labels):
-            raise ValueError("Number of outputs must match number of labels")
         if not outputs:  # Handle empty lists
             return {"accuracy": 0.0}, []

scorebook 0.0.14__tar.gz → 0.0.16__tar.gz

scorebook 0.0.14tar.gz → 0.0.16tar.gz