finesse-benchmark 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,264 @@
1
+ Metadata-Version: 2.4
2
+ Name: finesse-benchmark
3
+ Version: 0.1.0
4
+ Summary:
5
+ License-File: licence
6
+ Author: winter.sci.dev
7
+ Author-email: enzoescipy@gmail.com
8
+ Requires-Python: >=3.10,<3.15
9
+ Classifier: Programming Language :: Python :: 3
10
+ Classifier: Programming Language :: Python :: 3.10
11
+ Classifier: Programming Language :: Python :: 3.11
12
+ Classifier: Programming Language :: Python :: 3.12
13
+ Classifier: Programming Language :: Python :: 3.13
14
+ Classifier: Programming Language :: Python :: 3.14
15
+ Requires-Dist: datasets (>=4.3.0,<5.0.0)
16
+ Requires-Dist: litellm (>=1.78.7,<2.0.0)
17
+ Requires-Dist: matplotlib (>=3.10.7,<4.0.0)
18
+ Requires-Dist: pydantic (>=2.12.3,<3.0.0)
19
+ Requires-Dist: seaborn (>=0.13.2,<0.14.0)
20
+ Requires-Dist: torch (>=2.9.0,<3.0.0)
21
+ Requires-Dist: transformers (>=4.57.1,<5.0.0)
22
+ Requires-Dist: typer (>=0.20.0,<0.21.0)
23
+ Description-Content-Type: text/markdown
24
+
25
+ # Finesse Benchmark: Evaluating Long-Context Embedders with Semantic Precision
26
+
27
+ ## Introduction
28
+
29
+ The **Finesse Benchmark** is a sophisticated evaluation framework designed to assess the performance of long-context embedding models on semantic understanding and information retention. Unlike traditional benchmarks that rely on superficial metrics, Finesse focuses on **Relative Semantic Similarity (RSS)**—a robust metric that measures how well models distinguish between relevant ("memory") and irrelevant ("noise") chunks in long sequences.
30
+
31
+ ### Key Features
32
+ - **Modular Evaluation Modes**: Supports `merger_mode` (using sequence-merger with a base embedder), `native_mode` (direct long-context embedders like Snowflake Arctic Embed), and `BYOK_mode` (Bring Your Own Keys for external APIs via LiteLLM).
33
+ - **Dynamic Probe Generation**: Creates synthetic probes from atomic text chunks in the dataset, masking portions to test reconstruction accuracy.
34
+ - **Top-Down and Bottom-Up Scoring**: Combines **Top-Down (TD)** for contextual coherence (how well the model separates memory from noise) and **Bottom-Up (BU)** for individual chunk integrity (how well each chunk recognizes itself in compositions).
35
+ - **Reproducibility and Integrity**: Outputs include self-contained content hashes and optional model hashes for notarization and verification.
36
+ - **CLI-Driven Workflow**: Simple commands (`init`, `generate`, `score`, `checksum`) for end-to-end evaluation.
37
+ - **Dataset**: Uses the [enzoescipy/finesse-benchmark-database](https://huggingface.co/datasets/enzoescipy/finesse-benchmark-database) on Hugging Face, which provides domain-diverse atomic chunks grouped by `string_id`.
38
+
39
+ Finesse is built with [Pydantic](https://pydantic-docs.helpmanual.io/) for configuration validation, [Typer](https://typer.tiangolo.com/) for CLI, and [Torch](https://pytorch.org/) for efficient embedding computations.
40
+
41
+ ## Installation
42
+
43
+ Install via pip:
44
+
45
+ ```bash
46
+ pip install finesse-benchmark
47
+ ```
48
+
49
+ - Requires Python 3.8+.
50
+ - For GPU acceleration: Ensure CUDA is installed and set `device: "cuda"` in config.
51
+ - Hugging Face models are downloaded automatically (use `transformers` cache).
52
+
53
+ For BYOK mode (e.g., OpenAI), install additional dependencies:
54
+
55
+ ```bash
56
+ pip install litellm
57
+ ```
58
+
59
+ ## Quick Start
60
+
61
+ ### 1. Initialize Config (Optional)
62
+ Generate a default `benchmark.yaml` template:
63
+
64
+ ```bash
65
+ finesse init --output benchmark.yaml
66
+ ```
67
+
68
+ Edit `benchmark.yaml` to select mode, models, and probe settings. For BYOK mode, see the dedicated section below.
69
+
70
+ ### 2. Generate Raw Embeddings
71
+ Run the evaluation to generate raw probe and synthesis embeddings:
72
+
73
+ ```bash
74
+ finesse generate --config benchmark.yaml --output results --samples 5 --seed 42
75
+ ```
76
+
77
+ - This saves a `.pt` file (e.g., `embeddings_merger_mode_finesse-benchmark-database.pt`) containing raw data and config.
78
+ - Overrides: Use `--dataset-path` for custom HF datasets, `--samples` for more evaluations per length.
79
+
80
+ ### 3. Score the Embeddings
81
+ Compute RSS scores from the raw data:
82
+
83
+ ```bash
84
+ finesse score --pt-path results/embeddings_merger_mode_finesse-benchmark-database.pt --output results
85
+ ```
86
+
87
+ - Outputs `benchmark_results.json` with `average_rss` (final score) and per-length scores.
88
+ - Scores are normalized and scaled (multiplied by 500 for interpretability).
89
+
90
+ ### 4. Verify Integrity (Checksum)
91
+ Validate the results for tampering or reproducibility:
92
+
93
+ ```bash
94
+ finesse checksum --json-path results/benchmark_results.json
95
+ ```
96
+
97
+ For full provenance (model unchanged), provide the model ID:
98
+
99
+ ```bash
100
+ finesse checksum --json-path results/benchmark_results.json --model-path Snowflake/snowflake-arctic-embed-l-v2.0
101
+ ```
102
+
103
+ - ✅ Success if content and model hashes match.
104
+ - Only Hugging Face model IDs (e.g., `org/repo`) are accepted for `--model-path`.
105
+
106
+ ## Detailed CLI Reference
107
+
108
+ All commands use [Typer](https://typer.tiangolo.com/) for intuitive interfaces. Run `finesse --help` for overview.
109
+
110
+ ### `finesse init`
111
+ Generates a commented `benchmark.yaml` template.
112
+
113
+ **Options**:
114
+ - `--output`: Path to save YAML (default: `benchmark.yaml`).
115
+
116
+ **Example**:
117
+ ```bash
118
+ finesse init --output my_config.yaml
119
+ ```
120
+
121
+ The template includes examples for all modes and validates against `BenchmarkConfig` before saving.
122
+
123
+ ### `finesse generate`
124
+ Generates raw embeddings from the dataset using the specified config.
125
+
126
+ **Options**:
127
+ - `--config` (required): Path to `benchmark.yaml`.
128
+ - `--dataset-path`: Override HF dataset path (default: from config).
129
+ - `--output`: Directory for `.pt` files (default: `results`).
130
+ - `--samples`: Samples per sequence length (overrides config).
131
+ - `--seed`: Random seed for reproducibility (overrides config).
132
+
133
+ **Output**:
134
+ - `.pt` file: Torch tensor with `config` (dict), `raw_results` (embeddings per length).
135
+
136
+ **Example**:
137
+ ```bash
138
+ finesse generate --config benchmark.yaml --output ./my_results --samples 10
139
+ ```
140
+
141
+ ### `finesse score`
142
+ Computes TD/BU scores and final RSS from raw `.pt` data.
143
+
144
+ **Options**:
145
+ - `--pt-path` (required): Path to `.pt` file from `generate`.
146
+ - `--output`: Directory for JSON (default: `results`).
147
+
148
+ **Scoring Logic** (simplified):
149
+ - **TD Score**: Quartile gap between memory and noise similarities (excludes first/last synthesis steps for stability).
150
+ - **BU Score**: Similar gap from individual chunk perspectives.
151
+ - **Final RSS**: `((avg_TD + avg_BU) / 2) - |TD - BU|` per length, averaged across lengths, scaled by 500.
152
+
153
+ **Output**:
154
+ - `benchmark_results.json`:
155
+ ```json
156
+ {
157
+ "config": {...},
158
+ "average_rss": 42.123456,
159
+ "length_scores": {"5": 40.5, "16": 43.7},
160
+ "content_hash": "sha256:...",
161
+ "model_hash": "sha256:..." // Optional, for HF models
162
+ }
163
+ ```
164
+
165
+ **Example**:
166
+ ```bash
167
+ finesse score --pt-path results/embeddings_byok_mode_finesse-benchmark-database.pt
168
+ ```
169
+
170
+ ### `finesse checksum`
171
+ Verifies JSON integrity via self-contained hash. Optional model provenance check.
172
+
173
+ **Options**:
174
+ - `--json-path` (required): Path to `benchmark_results.json`.
175
+ - `--model-path`: HF model ID for provenance (e.g., `intfloat/multilingual-e5-base`).
176
+
177
+ **Verification**:
178
+ - Recomputes `content_hash` (excludes hash itself) and compares.
179
+ - For `--model-path`: Recomputes `model_hash` from model files and compares.
180
+
181
+ **Example**:
182
+ ```bash
183
+ finesse checksum --json-path results/benchmark_results.json --model-path enzoescipy/sequence-merger-tiny
184
+ ```
185
+
186
+ ## Output Files Explained
187
+
188
+ - **`.pt` (Raw Embeddings)**: Binary Torch file with:
189
+ - `config`: Full benchmark config as dict.
190
+ - `raw_results`: Dict of `{length: {"probe_embeddings": [...], "synthesis_embeddings": [...], "num_synth_steps": int}}`.
191
+ - Used as input to `score`; enables decoupling of embedding generation (GPU-heavy) from scoring (CPU-friendly).
192
+
193
+ - **`benchmark_results.json`**: Human-readable results with:
194
+ - `average_rss`: Overall score (higher is better; >40 indicates strong performance).
195
+ - `length_scores`: Per-sequence-length scores (tests scaling).
196
+ - `content_hash`: SHA-256 of config + scores (for tamper-proofing).
197
+ - `model_hash`: SHA-256 of model files (if applicable; verifies unchanged model).
198
+
199
+ Hashes ensure reproducibility: Rerun `checksum` on shared results to confirm no alterations.
200
+
201
+ ## Using BYOK Mode (Bring Your Own Keys)
202
+
203
+ BYOK mode integrates external embedding APIs (e.g., OpenAI, Cohere) via [LiteLLM](https://github.com/BerriAI/litellm) for fair comparison with open models.
204
+
205
+ ### Setup
206
+ 1. Edit `benchmark.yaml`:
207
+ ```yaml
208
+ mode: "byok_mode"
209
+
210
+ models:
211
+ byok_embedder:
212
+ provider: "openai" # 'openai', 'cohere', 'google', etc.
213
+ name: "text-embedding-3-large" # Provider-specific model
214
+ ```
215
+
216
+ 2. Set Environment Variables (REQUIRED; never hardcode in YAML):
217
+ - OpenAI: `export OPENAI_API_KEY="sk-..."`
218
+ - Cohere: `export COHERE_API_KEY="..."`
219
+ - Google: `export GOOGLE_API_KEY="..."` (or Vertex AI creds).
220
+ - LiteLLM auto-detects based on `provider`. See [LiteLLM docs](https://docs.litellm.ai/docs/providers) for full list.
221
+
222
+ On Windows (PowerShell): `$env:OPENAI_API_KEY="sk-..."`
223
+
224
+ 3. Run as usual:
225
+ ```bash
226
+ finesse generate --config byok_config.yaml
227
+ ```
228
+
229
+ ### Notes
230
+ - Costs: BYOK incurs API fees; start with small `--samples` (e.g., 1-5).
231
+ - Security: Keys stay in env vars—YAML remains commit-safe.
232
+ - Validation: Config validator ensures `byok_embedder` is set for `byok_mode`; others are optional/ignored.
233
+ - Example YAML (in `init` template): Uncomment and customize the BYOK section.
234
+
235
+ ## Configuration Deep Dive (benchmark.yaml)
236
+
237
+ - **mode**: `"merger_mode"` (default; uses merger + base), `"native_mode"` (direct embedder), `"byok_mode"`.
238
+ - **models**: Mode-specific; unused fields default to `None` (no download).
239
+ - `merger`: Sequence-merger path (e.g., `"enzoescipy/sequence-merger-tiny"`).
240
+ - `base_embedder`/`native_embedder`: Embedder path (e.g., `"intfloat/multilingual-e5-base"`).
241
+ - **dataset**: HF path (default: `"enzoescipy/finesse-benchmark-database"`), split (`"train"`).
242
+ - **probe_config**:
243
+ - `mask_ratio`: 0.15 (fraction masked).
244
+ - `sequence_length`: `{min: 5, max: 16}` (probe lengths in tokens).
245
+ - `samples_per_length`: 1+ (evals per length).
246
+ - **advanced**: `{batch_size: 8, device: "cuda"}` (optional).
247
+ - **seed**: 42 (reproducibility).
248
+
249
+ Pydantic ensures type safety; invalid configs raise `ValueError` on load.
250
+
251
+ ## Development
252
+
253
+ - Source: `src/finesse_benchmark/`.
254
+ - Tests: Run `pytest` (add tests for scoring, hashing).
255
+ - Contributing: Fork, PR with docs/tests. Focus on new providers/modes.
256
+
257
+ ## License
258
+
259
+ Apache 2.0 License. See [LICENSE](LICENSE) for details.
260
+
261
+ ## Acknowledgments
262
+
263
+ Built on insights from long-context evaluation research. Thanks to Hugging Face Transformers and Pydantic teams.
264
+
@@ -0,0 +1,239 @@
1
+ # Finesse Benchmark: Evaluating Long-Context Embedders with Semantic Precision
2
+
3
+ ## Introduction
4
+
5
+ The **Finesse Benchmark** is a sophisticated evaluation framework designed to assess the performance of long-context embedding models on semantic understanding and information retention. Unlike traditional benchmarks that rely on superficial metrics, Finesse focuses on **Relative Semantic Similarity (RSS)**—a robust metric that measures how well models distinguish between relevant ("memory") and irrelevant ("noise") chunks in long sequences.
6
+
7
+ ### Key Features
8
+ - **Modular Evaluation Modes**: Supports `merger_mode` (using sequence-merger with a base embedder), `native_mode` (direct long-context embedders like Snowflake Arctic Embed), and `BYOK_mode` (Bring Your Own Keys for external APIs via LiteLLM).
9
+ - **Dynamic Probe Generation**: Creates synthetic probes from atomic text chunks in the dataset, masking portions to test reconstruction accuracy.
10
+ - **Top-Down and Bottom-Up Scoring**: Combines **Top-Down (TD)** for contextual coherence (how well the model separates memory from noise) and **Bottom-Up (BU)** for individual chunk integrity (how well each chunk recognizes itself in compositions).
11
+ - **Reproducibility and Integrity**: Outputs include self-contained content hashes and optional model hashes for notarization and verification.
12
+ - **CLI-Driven Workflow**: Simple commands (`init`, `generate`, `score`, `checksum`) for end-to-end evaluation.
13
+ - **Dataset**: Uses the [enzoescipy/finesse-benchmark-database](https://huggingface.co/datasets/enzoescipy/finesse-benchmark-database) on Hugging Face, which provides domain-diverse atomic chunks grouped by `string_id`.
14
+
15
+ Finesse is built with [Pydantic](https://pydantic-docs.helpmanual.io/) for configuration validation, [Typer](https://typer.tiangolo.com/) for CLI, and [Torch](https://pytorch.org/) for efficient embedding computations.
16
+
17
+ ## Installation
18
+
19
+ Install via pip:
20
+
21
+ ```bash
22
+ pip install finesse-benchmark
23
+ ```
24
+
25
+ - Requires Python 3.8+.
26
+ - For GPU acceleration: Ensure CUDA is installed and set `device: "cuda"` in config.
27
+ - Hugging Face models are downloaded automatically (use `transformers` cache).
28
+
29
+ For BYOK mode (e.g., OpenAI), install additional dependencies:
30
+
31
+ ```bash
32
+ pip install litellm
33
+ ```
34
+
35
+ ## Quick Start
36
+
37
+ ### 1. Initialize Config (Optional)
38
+ Generate a default `benchmark.yaml` template:
39
+
40
+ ```bash
41
+ finesse init --output benchmark.yaml
42
+ ```
43
+
44
+ Edit `benchmark.yaml` to select mode, models, and probe settings. For BYOK mode, see the dedicated section below.
45
+
46
+ ### 2. Generate Raw Embeddings
47
+ Run the evaluation to generate raw probe and synthesis embeddings:
48
+
49
+ ```bash
50
+ finesse generate --config benchmark.yaml --output results --samples 5 --seed 42
51
+ ```
52
+
53
+ - This saves a `.pt` file (e.g., `embeddings_merger_mode_finesse-benchmark-database.pt`) containing raw data and config.
54
+ - Overrides: Use `--dataset-path` for custom HF datasets, `--samples` for more evaluations per length.
55
+
56
+ ### 3. Score the Embeddings
57
+ Compute RSS scores from the raw data:
58
+
59
+ ```bash
60
+ finesse score --pt-path results/embeddings_merger_mode_finesse-benchmark-database.pt --output results
61
+ ```
62
+
63
+ - Outputs `benchmark_results.json` with `average_rss` (final score) and per-length scores.
64
+ - Scores are normalized and scaled (multiplied by 500 for interpretability).
65
+
66
+ ### 4. Verify Integrity (Checksum)
67
+ Validate the results for tampering or reproducibility:
68
+
69
+ ```bash
70
+ finesse checksum --json-path results/benchmark_results.json
71
+ ```
72
+
73
+ For full provenance (model unchanged), provide the model ID:
74
+
75
+ ```bash
76
+ finesse checksum --json-path results/benchmark_results.json --model-path Snowflake/snowflake-arctic-embed-l-v2.0
77
+ ```
78
+
79
+ - ✅ Success if content and model hashes match.
80
+ - Only Hugging Face model IDs (e.g., `org/repo`) are accepted for `--model-path`.
81
+
82
+ ## Detailed CLI Reference
83
+
84
+ All commands use [Typer](https://typer.tiangolo.com/) for intuitive interfaces. Run `finesse --help` for overview.
85
+
86
+ ### `finesse init`
87
+ Generates a commented `benchmark.yaml` template.
88
+
89
+ **Options**:
90
+ - `--output`: Path to save YAML (default: `benchmark.yaml`).
91
+
92
+ **Example**:
93
+ ```bash
94
+ finesse init --output my_config.yaml
95
+ ```
96
+
97
+ The template includes examples for all modes and validates against `BenchmarkConfig` before saving.
98
+
99
+ ### `finesse generate`
100
+ Generates raw embeddings from the dataset using the specified config.
101
+
102
+ **Options**:
103
+ - `--config` (required): Path to `benchmark.yaml`.
104
+ - `--dataset-path`: Override HF dataset path (default: from config).
105
+ - `--output`: Directory for `.pt` files (default: `results`).
106
+ - `--samples`: Samples per sequence length (overrides config).
107
+ - `--seed`: Random seed for reproducibility (overrides config).
108
+
109
+ **Output**:
110
+ - `.pt` file: Torch tensor with `config` (dict), `raw_results` (embeddings per length).
111
+
112
+ **Example**:
113
+ ```bash
114
+ finesse generate --config benchmark.yaml --output ./my_results --samples 10
115
+ ```
116
+
117
+ ### `finesse score`
118
+ Computes TD/BU scores and final RSS from raw `.pt` data.
119
+
120
+ **Options**:
121
+ - `--pt-path` (required): Path to `.pt` file from `generate`.
122
+ - `--output`: Directory for JSON (default: `results`).
123
+
124
+ **Scoring Logic** (simplified):
125
+ - **TD Score**: Quartile gap between memory and noise similarities (excludes first/last synthesis steps for stability).
126
+ - **BU Score**: Similar gap from individual chunk perspectives.
127
+ - **Final RSS**: `((avg_TD + avg_BU) / 2) - |TD - BU|` per length, averaged across lengths, scaled by 500.
128
+
129
+ **Output**:
130
+ - `benchmark_results.json`:
131
+ ```json
132
+ {
133
+ "config": {...},
134
+ "average_rss": 42.123456,
135
+ "length_scores": {"5": 40.5, "16": 43.7},
136
+ "content_hash": "sha256:...",
137
+ "model_hash": "sha256:..." // Optional, for HF models
138
+ }
139
+ ```
140
+
141
+ **Example**:
142
+ ```bash
143
+ finesse score --pt-path results/embeddings_byok_mode_finesse-benchmark-database.pt
144
+ ```
145
+
146
+ ### `finesse checksum`
147
+ Verifies JSON integrity via self-contained hash. Optional model provenance check.
148
+
149
+ **Options**:
150
+ - `--json-path` (required): Path to `benchmark_results.json`.
151
+ - `--model-path`: HF model ID for provenance (e.g., `intfloat/multilingual-e5-base`).
152
+
153
+ **Verification**:
154
+ - Recomputes `content_hash` (excludes hash itself) and compares.
155
+ - For `--model-path`: Recomputes `model_hash` from model files and compares.
156
+
157
+ **Example**:
158
+ ```bash
159
+ finesse checksum --json-path results/benchmark_results.json --model-path enzoescipy/sequence-merger-tiny
160
+ ```
161
+
162
+ ## Output Files Explained
163
+
164
+ - **`.pt` (Raw Embeddings)**: Binary Torch file with:
165
+ - `config`: Full benchmark config as dict.
166
+ - `raw_results`: Dict of `{length: {"probe_embeddings": [...], "synthesis_embeddings": [...], "num_synth_steps": int}}`.
167
+ - Used as input to `score`; enables decoupling of embedding generation (GPU-heavy) from scoring (CPU-friendly).
168
+
169
+ - **`benchmark_results.json`**: Human-readable results with:
170
+ - `average_rss`: Overall score (higher is better; >40 indicates strong performance).
171
+ - `length_scores`: Per-sequence-length scores (tests scaling).
172
+ - `content_hash`: SHA-256 of config + scores (for tamper-proofing).
173
+ - `model_hash`: SHA-256 of model files (if applicable; verifies unchanged model).
174
+
175
+ Hashes ensure reproducibility: Rerun `checksum` on shared results to confirm no alterations.
176
+
177
+ ## Using BYOK Mode (Bring Your Own Keys)
178
+
179
+ BYOK mode integrates external embedding APIs (e.g., OpenAI, Cohere) via [LiteLLM](https://github.com/BerriAI/litellm) for fair comparison with open models.
180
+
181
+ ### Setup
182
+ 1. Edit `benchmark.yaml`:
183
+ ```yaml
184
+ mode: "byok_mode"
185
+
186
+ models:
187
+ byok_embedder:
188
+ provider: "openai" # 'openai', 'cohere', 'google', etc.
189
+ name: "text-embedding-3-large" # Provider-specific model
190
+ ```
191
+
192
+ 2. Set Environment Variables (REQUIRED; never hardcode in YAML):
193
+ - OpenAI: `export OPENAI_API_KEY="sk-..."`
194
+ - Cohere: `export COHERE_API_KEY="..."`
195
+ - Google: `export GOOGLE_API_KEY="..."` (or Vertex AI creds).
196
+ - LiteLLM auto-detects based on `provider`. See [LiteLLM docs](https://docs.litellm.ai/docs/providers) for full list.
197
+
198
+ On Windows (PowerShell): `$env:OPENAI_API_KEY="sk-..."`
199
+
200
+ 3. Run as usual:
201
+ ```bash
202
+ finesse generate --config byok_config.yaml
203
+ ```
204
+
205
+ ### Notes
206
+ - Costs: BYOK incurs API fees; start with small `--samples` (e.g., 1-5).
207
+ - Security: Keys stay in env vars—YAML remains commit-safe.
208
+ - Validation: Config validator ensures `byok_embedder` is set for `byok_mode`; others are optional/ignored.
209
+ - Example YAML (in `init` template): Uncomment and customize the BYOK section.
210
+
211
+ ## Configuration Deep Dive (benchmark.yaml)
212
+
213
+ - **mode**: `"merger_mode"` (default; uses merger + base), `"native_mode"` (direct embedder), `"byok_mode"`.
214
+ - **models**: Mode-specific; unused fields default to `None` (no download).
215
+ - `merger`: Sequence-merger path (e.g., `"enzoescipy/sequence-merger-tiny"`).
216
+ - `base_embedder`/`native_embedder`: Embedder path (e.g., `"intfloat/multilingual-e5-base"`).
217
+ - **dataset**: HF path (default: `"enzoescipy/finesse-benchmark-database"`), split (`"train"`).
218
+ - **probe_config**:
219
+ - `mask_ratio`: 0.15 (fraction masked).
220
+ - `sequence_length`: `{min: 5, max: 16}` (probe lengths in tokens).
221
+ - `samples_per_length`: 1+ (evals per length).
222
+ - **advanced**: `{batch_size: 8, device: "cuda"}` (optional).
223
+ - **seed**: 42 (reproducibility).
224
+
225
+ Pydantic ensures type safety; invalid configs raise `ValueError` on load.
226
+
227
+ ## Development
228
+
229
+ - Source: `src/finesse_benchmark/`.
230
+ - Tests: Run `pytest` (add tests for scoring, hashing).
231
+ - Contributing: Fork, PR with docs/tests. Focus on new providers/modes.
232
+
233
+ ## License
234
+
235
+ Apache 2.0 License. See [LICENSE](LICENSE) for details.
236
+
237
+ ## Acknowledgments
238
+
239
+ Built on insights from long-context evaluation research. Thanks to Hugging Face Transformers and Pydantic teams.