finesse-benchmark 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- finesse_benchmark-0.1.0/PKG-INFO +264 -0
- finesse_benchmark-0.1.0/README.md +239 -0
- finesse_benchmark-0.1.0/licence +202 -0
- finesse_benchmark-0.1.0/pyproject.toml +23 -0
- finesse_benchmark-0.1.0/src/finesse_benchmark/__init__.py +5 -0
- finesse_benchmark-0.1.0/src/finesse_benchmark/cli.py +341 -0
- finesse_benchmark-0.1.0/src/finesse_benchmark/config.py +90 -0
- finesse_benchmark-0.1.0/src/finesse_benchmark/evaluator.py +257 -0
- finesse_benchmark-0.1.0/src/finesse_benchmark/scoring.py +146 -0
- finesse_benchmark-0.1.0/src/finesse_benchmark/utils.py +95 -0
|
@@ -0,0 +1,264 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: finesse-benchmark
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary:
|
|
5
|
+
License-File: licence
|
|
6
|
+
Author: winter.sci.dev
|
|
7
|
+
Author-email: enzoescipy@gmail.com
|
|
8
|
+
Requires-Python: >=3.10,<3.15
|
|
9
|
+
Classifier: Programming Language :: Python :: 3
|
|
10
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.14
|
|
15
|
+
Requires-Dist: datasets (>=4.3.0,<5.0.0)
|
|
16
|
+
Requires-Dist: litellm (>=1.78.7,<2.0.0)
|
|
17
|
+
Requires-Dist: matplotlib (>=3.10.7,<4.0.0)
|
|
18
|
+
Requires-Dist: pydantic (>=2.12.3,<3.0.0)
|
|
19
|
+
Requires-Dist: seaborn (>=0.13.2,<0.14.0)
|
|
20
|
+
Requires-Dist: torch (>=2.9.0,<3.0.0)
|
|
21
|
+
Requires-Dist: transformers (>=4.57.1,<5.0.0)
|
|
22
|
+
Requires-Dist: typer (>=0.20.0,<0.21.0)
|
|
23
|
+
Description-Content-Type: text/markdown
|
|
24
|
+
|
|
25
|
+
# Finesse Benchmark: Evaluating Long-Context Embedders with Semantic Precision
|
|
26
|
+
|
|
27
|
+
## Introduction
|
|
28
|
+
|
|
29
|
+
The **Finesse Benchmark** is a sophisticated evaluation framework designed to assess the performance of long-context embedding models on semantic understanding and information retention. Unlike traditional benchmarks that rely on superficial metrics, Finesse focuses on **Relative Semantic Similarity (RSS)**—a robust metric that measures how well models distinguish between relevant ("memory") and irrelevant ("noise") chunks in long sequences.
|
|
30
|
+
|
|
31
|
+
### Key Features
|
|
32
|
+
- **Modular Evaluation Modes**: Supports `merger_mode` (using sequence-merger with a base embedder), `native_mode` (direct long-context embedders like Snowflake Arctic Embed), and `BYOK_mode` (Bring Your Own Keys for external APIs via LiteLLM).
|
|
33
|
+
- **Dynamic Probe Generation**: Creates synthetic probes from atomic text chunks in the dataset, masking portions to test reconstruction accuracy.
|
|
34
|
+
- **Top-Down and Bottom-Up Scoring**: Combines **Top-Down (TD)** for contextual coherence (how well the model separates memory from noise) and **Bottom-Up (BU)** for individual chunk integrity (how well each chunk recognizes itself in compositions).
|
|
35
|
+
- **Reproducibility and Integrity**: Outputs include self-contained content hashes and optional model hashes for notarization and verification.
|
|
36
|
+
- **CLI-Driven Workflow**: Simple commands (`init`, `generate`, `score`, `checksum`) for end-to-end evaluation.
|
|
37
|
+
- **Dataset**: Uses the [enzoescipy/finesse-benchmark-database](https://huggingface.co/datasets/enzoescipy/finesse-benchmark-database) on Hugging Face, which provides domain-diverse atomic chunks grouped by `string_id`.
|
|
38
|
+
|
|
39
|
+
Finesse is built with [Pydantic](https://pydantic-docs.helpmanual.io/) for configuration validation, [Typer](https://typer.tiangolo.com/) for CLI, and [Torch](https://pytorch.org/) for efficient embedding computations.
|
|
40
|
+
|
|
41
|
+
## Installation
|
|
42
|
+
|
|
43
|
+
Install via pip:
|
|
44
|
+
|
|
45
|
+
```bash
|
|
46
|
+
pip install finesse-benchmark
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
- Requires Python 3.8+.
|
|
50
|
+
- For GPU acceleration: Ensure CUDA is installed and set `device: "cuda"` in config.
|
|
51
|
+
- Hugging Face models are downloaded automatically (use `transformers` cache).
|
|
52
|
+
|
|
53
|
+
For BYOK mode (e.g., OpenAI), install additional dependencies:
|
|
54
|
+
|
|
55
|
+
```bash
|
|
56
|
+
pip install litellm
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
## Quick Start
|
|
60
|
+
|
|
61
|
+
### 1. Initialize Config (Optional)
|
|
62
|
+
Generate a default `benchmark.yaml` template:
|
|
63
|
+
|
|
64
|
+
```bash
|
|
65
|
+
finesse init --output benchmark.yaml
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
Edit `benchmark.yaml` to select mode, models, and probe settings. For BYOK mode, see the dedicated section below.
|
|
69
|
+
|
|
70
|
+
### 2. Generate Raw Embeddings
|
|
71
|
+
Run the evaluation to generate raw probe and synthesis embeddings:
|
|
72
|
+
|
|
73
|
+
```bash
|
|
74
|
+
finesse generate --config benchmark.yaml --output results --samples 5 --seed 42
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
- This saves a `.pt` file (e.g., `embeddings_merger_mode_finesse-benchmark-database.pt`) containing raw data and config.
|
|
78
|
+
- Overrides: Use `--dataset-path` for custom HF datasets, `--samples` for more evaluations per length.
|
|
79
|
+
|
|
80
|
+
### 3. Score the Embeddings
|
|
81
|
+
Compute RSS scores from the raw data:
|
|
82
|
+
|
|
83
|
+
```bash
|
|
84
|
+
finesse score --pt-path results/embeddings_merger_mode_finesse-benchmark-database.pt --output results
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
- Outputs `benchmark_results.json` with `average_rss` (final score) and per-length scores.
|
|
88
|
+
- Scores are normalized and scaled (multiplied by 500 for interpretability).
|
|
89
|
+
|
|
90
|
+
### 4. Verify Integrity (Checksum)
|
|
91
|
+
Validate the results for tampering or reproducibility:
|
|
92
|
+
|
|
93
|
+
```bash
|
|
94
|
+
finesse checksum --json-path results/benchmark_results.json
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
For full provenance (model unchanged), provide the model ID:
|
|
98
|
+
|
|
99
|
+
```bash
|
|
100
|
+
finesse checksum --json-path results/benchmark_results.json --model-path Snowflake/snowflake-arctic-embed-l-v2.0
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
- ✅ Success if content and model hashes match.
|
|
104
|
+
- Only Hugging Face model IDs (e.g., `org/repo`) are accepted for `--model-path`.
|
|
105
|
+
|
|
106
|
+
## Detailed CLI Reference
|
|
107
|
+
|
|
108
|
+
All commands use [Typer](https://typer.tiangolo.com/) for intuitive interfaces. Run `finesse --help` for overview.
|
|
109
|
+
|
|
110
|
+
### `finesse init`
|
|
111
|
+
Generates a commented `benchmark.yaml` template.
|
|
112
|
+
|
|
113
|
+
**Options**:
|
|
114
|
+
- `--output`: Path to save YAML (default: `benchmark.yaml`).
|
|
115
|
+
|
|
116
|
+
**Example**:
|
|
117
|
+
```bash
|
|
118
|
+
finesse init --output my_config.yaml
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
The template includes examples for all modes and validates against `BenchmarkConfig` before saving.
|
|
122
|
+
|
|
123
|
+
### `finesse generate`
|
|
124
|
+
Generates raw embeddings from the dataset using the specified config.
|
|
125
|
+
|
|
126
|
+
**Options**:
|
|
127
|
+
- `--config` (required): Path to `benchmark.yaml`.
|
|
128
|
+
- `--dataset-path`: Override HF dataset path (default: from config).
|
|
129
|
+
- `--output`: Directory for `.pt` files (default: `results`).
|
|
130
|
+
- `--samples`: Samples per sequence length (overrides config).
|
|
131
|
+
- `--seed`: Random seed for reproducibility (overrides config).
|
|
132
|
+
|
|
133
|
+
**Output**:
|
|
134
|
+
- `.pt` file: Torch tensor with `config` (dict), `raw_results` (embeddings per length).
|
|
135
|
+
|
|
136
|
+
**Example**:
|
|
137
|
+
```bash
|
|
138
|
+
finesse generate --config benchmark.yaml --output ./my_results --samples 10
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
### `finesse score`
|
|
142
|
+
Computes TD/BU scores and final RSS from raw `.pt` data.
|
|
143
|
+
|
|
144
|
+
**Options**:
|
|
145
|
+
- `--pt-path` (required): Path to `.pt` file from `generate`.
|
|
146
|
+
- `--output`: Directory for JSON (default: `results`).
|
|
147
|
+
|
|
148
|
+
**Scoring Logic** (simplified):
|
|
149
|
+
- **TD Score**: Quartile gap between memory and noise similarities (excludes first/last synthesis steps for stability).
|
|
150
|
+
- **BU Score**: Similar gap from individual chunk perspectives.
|
|
151
|
+
- **Final RSS**: `((avg_TD + avg_BU) / 2) - |TD - BU|` per length, averaged across lengths, scaled by 500.
|
|
152
|
+
|
|
153
|
+
**Output**:
|
|
154
|
+
- `benchmark_results.json`:
|
|
155
|
+
```json
|
|
156
|
+
{
|
|
157
|
+
"config": {...},
|
|
158
|
+
"average_rss": 42.123456,
|
|
159
|
+
"length_scores": {"5": 40.5, "16": 43.7},
|
|
160
|
+
"content_hash": "sha256:...",
|
|
161
|
+
"model_hash": "sha256:..." // Optional, for HF models
|
|
162
|
+
}
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
**Example**:
|
|
166
|
+
```bash
|
|
167
|
+
finesse score --pt-path results/embeddings_byok_mode_finesse-benchmark-database.pt
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
### `finesse checksum`
|
|
171
|
+
Verifies JSON integrity via self-contained hash. Optional model provenance check.
|
|
172
|
+
|
|
173
|
+
**Options**:
|
|
174
|
+
- `--json-path` (required): Path to `benchmark_results.json`.
|
|
175
|
+
- `--model-path`: HF model ID for provenance (e.g., `intfloat/multilingual-e5-base`).
|
|
176
|
+
|
|
177
|
+
**Verification**:
|
|
178
|
+
- Recomputes `content_hash` (excludes hash itself) and compares.
|
|
179
|
+
- For `--model-path`: Recomputes `model_hash` from model files and compares.
|
|
180
|
+
|
|
181
|
+
**Example**:
|
|
182
|
+
```bash
|
|
183
|
+
finesse checksum --json-path results/benchmark_results.json --model-path enzoescipy/sequence-merger-tiny
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
## Output Files Explained
|
|
187
|
+
|
|
188
|
+
- **`.pt` (Raw Embeddings)**: Binary Torch file with:
|
|
189
|
+
- `config`: Full benchmark config as dict.
|
|
190
|
+
- `raw_results`: Dict of `{length: {"probe_embeddings": [...], "synthesis_embeddings": [...], "num_synth_steps": int}}`.
|
|
191
|
+
- Used as input to `score`; enables decoupling of embedding generation (GPU-heavy) from scoring (CPU-friendly).
|
|
192
|
+
|
|
193
|
+
- **`benchmark_results.json`**: Human-readable results with:
|
|
194
|
+
- `average_rss`: Overall score (higher is better; >40 indicates strong performance).
|
|
195
|
+
- `length_scores`: Per-sequence-length scores (tests scaling).
|
|
196
|
+
- `content_hash`: SHA-256 of config + scores (for tamper-proofing).
|
|
197
|
+
- `model_hash`: SHA-256 of model files (if applicable; verifies unchanged model).
|
|
198
|
+
|
|
199
|
+
Hashes ensure reproducibility: Rerun `checksum` on shared results to confirm no alterations.
|
|
200
|
+
|
|
201
|
+
## Using BYOK Mode (Bring Your Own Keys)
|
|
202
|
+
|
|
203
|
+
BYOK mode integrates external embedding APIs (e.g., OpenAI, Cohere) via [LiteLLM](https://github.com/BerriAI/litellm) for fair comparison with open models.
|
|
204
|
+
|
|
205
|
+
### Setup
|
|
206
|
+
1. Edit `benchmark.yaml`:
|
|
207
|
+
```yaml
|
|
208
|
+
mode: "byok_mode"
|
|
209
|
+
|
|
210
|
+
models:
|
|
211
|
+
byok_embedder:
|
|
212
|
+
provider: "openai" # 'openai', 'cohere', 'google', etc.
|
|
213
|
+
name: "text-embedding-3-large" # Provider-specific model
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
2. Set Environment Variables (REQUIRED; never hardcode in YAML):
|
|
217
|
+
- OpenAI: `export OPENAI_API_KEY="sk-..."`
|
|
218
|
+
- Cohere: `export COHERE_API_KEY="..."`
|
|
219
|
+
- Google: `export GOOGLE_API_KEY="..."` (or Vertex AI creds).
|
|
220
|
+
- LiteLLM auto-detects based on `provider`. See [LiteLLM docs](https://docs.litellm.ai/docs/providers) for full list.
|
|
221
|
+
|
|
222
|
+
On Windows (PowerShell): `$env:OPENAI_API_KEY="sk-..."`
|
|
223
|
+
|
|
224
|
+
3. Run as usual:
|
|
225
|
+
```bash
|
|
226
|
+
finesse generate --config byok_config.yaml
|
|
227
|
+
```
|
|
228
|
+
|
|
229
|
+
### Notes
|
|
230
|
+
- Costs: BYOK incurs API fees; start with small `--samples` (e.g., 1-5).
|
|
231
|
+
- Security: Keys stay in env vars—YAML remains commit-safe.
|
|
232
|
+
- Validation: Config validator ensures `byok_embedder` is set for `byok_mode`; others are optional/ignored.
|
|
233
|
+
- Example YAML (in `init` template): Uncomment and customize the BYOK section.
|
|
234
|
+
|
|
235
|
+
## Configuration Deep Dive (benchmark.yaml)
|
|
236
|
+
|
|
237
|
+
- **mode**: `"merger_mode"` (default; uses merger + base), `"native_mode"` (direct embedder), `"byok_mode"`.
|
|
238
|
+
- **models**: Mode-specific; unused fields default to `None` (no download).
|
|
239
|
+
- `merger`: Sequence-merger path (e.g., `"enzoescipy/sequence-merger-tiny"`).
|
|
240
|
+
- `base_embedder`/`native_embedder`: Embedder path (e.g., `"intfloat/multilingual-e5-base"`).
|
|
241
|
+
- **dataset**: HF path (default: `"enzoescipy/finesse-benchmark-database"`), split (`"train"`).
|
|
242
|
+
- **probe_config**:
|
|
243
|
+
- `mask_ratio`: 0.15 (fraction masked).
|
|
244
|
+
- `sequence_length`: `{min: 5, max: 16}` (probe lengths in tokens).
|
|
245
|
+
- `samples_per_length`: 1+ (evals per length).
|
|
246
|
+
- **advanced**: `{batch_size: 8, device: "cuda"}` (optional).
|
|
247
|
+
- **seed**: 42 (reproducibility).
|
|
248
|
+
|
|
249
|
+
Pydantic ensures type safety; invalid configs raise `ValueError` on load.
|
|
250
|
+
|
|
251
|
+
## Development
|
|
252
|
+
|
|
253
|
+
- Source: `src/finesse_benchmark/`.
|
|
254
|
+
- Tests: Run `pytest` (add tests for scoring, hashing).
|
|
255
|
+
- Contributing: Fork, PR with docs/tests. Focus on new providers/modes.
|
|
256
|
+
|
|
257
|
+
## License
|
|
258
|
+
|
|
259
|
+
Apache 2.0 License. See [LICENSE](LICENSE) for details.
|
|
260
|
+
|
|
261
|
+
## Acknowledgments
|
|
262
|
+
|
|
263
|
+
Built on insights from long-context evaluation research. Thanks to Hugging Face Transformers and Pydantic teams.
|
|
264
|
+
|
|
@@ -0,0 +1,239 @@
|
|
|
1
|
+
# Finesse Benchmark: Evaluating Long-Context Embedders with Semantic Precision
|
|
2
|
+
|
|
3
|
+
## Introduction
|
|
4
|
+
|
|
5
|
+
The **Finesse Benchmark** is a sophisticated evaluation framework designed to assess the performance of long-context embedding models on semantic understanding and information retention. Unlike traditional benchmarks that rely on superficial metrics, Finesse focuses on **Relative Semantic Similarity (RSS)**—a robust metric that measures how well models distinguish between relevant ("memory") and irrelevant ("noise") chunks in long sequences.
|
|
6
|
+
|
|
7
|
+
### Key Features
|
|
8
|
+
- **Modular Evaluation Modes**: Supports `merger_mode` (using sequence-merger with a base embedder), `native_mode` (direct long-context embedders like Snowflake Arctic Embed), and `BYOK_mode` (Bring Your Own Keys for external APIs via LiteLLM).
|
|
9
|
+
- **Dynamic Probe Generation**: Creates synthetic probes from atomic text chunks in the dataset, masking portions to test reconstruction accuracy.
|
|
10
|
+
- **Top-Down and Bottom-Up Scoring**: Combines **Top-Down (TD)** for contextual coherence (how well the model separates memory from noise) and **Bottom-Up (BU)** for individual chunk integrity (how well each chunk recognizes itself in compositions).
|
|
11
|
+
- **Reproducibility and Integrity**: Outputs include self-contained content hashes and optional model hashes for notarization and verification.
|
|
12
|
+
- **CLI-Driven Workflow**: Simple commands (`init`, `generate`, `score`, `checksum`) for end-to-end evaluation.
|
|
13
|
+
- **Dataset**: Uses the [enzoescipy/finesse-benchmark-database](https://huggingface.co/datasets/enzoescipy/finesse-benchmark-database) on Hugging Face, which provides domain-diverse atomic chunks grouped by `string_id`.
|
|
14
|
+
|
|
15
|
+
Finesse is built with [Pydantic](https://pydantic-docs.helpmanual.io/) for configuration validation, [Typer](https://typer.tiangolo.com/) for CLI, and [Torch](https://pytorch.org/) for efficient embedding computations.
|
|
16
|
+
|
|
17
|
+
## Installation
|
|
18
|
+
|
|
19
|
+
Install via pip:
|
|
20
|
+
|
|
21
|
+
```bash
|
|
22
|
+
pip install finesse-benchmark
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
- Requires Python 3.8+.
|
|
26
|
+
- For GPU acceleration: Ensure CUDA is installed and set `device: "cuda"` in config.
|
|
27
|
+
- Hugging Face models are downloaded automatically (use `transformers` cache).
|
|
28
|
+
|
|
29
|
+
For BYOK mode (e.g., OpenAI), install additional dependencies:
|
|
30
|
+
|
|
31
|
+
```bash
|
|
32
|
+
pip install litellm
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
## Quick Start
|
|
36
|
+
|
|
37
|
+
### 1. Initialize Config (Optional)
|
|
38
|
+
Generate a default `benchmark.yaml` template:
|
|
39
|
+
|
|
40
|
+
```bash
|
|
41
|
+
finesse init --output benchmark.yaml
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
Edit `benchmark.yaml` to select mode, models, and probe settings. For BYOK mode, see the dedicated section below.
|
|
45
|
+
|
|
46
|
+
### 2. Generate Raw Embeddings
|
|
47
|
+
Run the evaluation to generate raw probe and synthesis embeddings:
|
|
48
|
+
|
|
49
|
+
```bash
|
|
50
|
+
finesse generate --config benchmark.yaml --output results --samples 5 --seed 42
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
- This saves a `.pt` file (e.g., `embeddings_merger_mode_finesse-benchmark-database.pt`) containing raw data and config.
|
|
54
|
+
- Overrides: Use `--dataset-path` for custom HF datasets, `--samples` for more evaluations per length.
|
|
55
|
+
|
|
56
|
+
### 3. Score the Embeddings
|
|
57
|
+
Compute RSS scores from the raw data:
|
|
58
|
+
|
|
59
|
+
```bash
|
|
60
|
+
finesse score --pt-path results/embeddings_merger_mode_finesse-benchmark-database.pt --output results
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
- Outputs `benchmark_results.json` with `average_rss` (final score) and per-length scores.
|
|
64
|
+
- Scores are normalized and scaled (multiplied by 500 for interpretability).
|
|
65
|
+
|
|
66
|
+
### 4. Verify Integrity (Checksum)
|
|
67
|
+
Validate the results for tampering or reproducibility:
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
finesse checksum --json-path results/benchmark_results.json
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
For full provenance (model unchanged), provide the model ID:
|
|
74
|
+
|
|
75
|
+
```bash
|
|
76
|
+
finesse checksum --json-path results/benchmark_results.json --model-path Snowflake/snowflake-arctic-embed-l-v2.0
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
- ✅ Success if content and model hashes match.
|
|
80
|
+
- Only Hugging Face model IDs (e.g., `org/repo`) are accepted for `--model-path`.
|
|
81
|
+
|
|
82
|
+
## Detailed CLI Reference
|
|
83
|
+
|
|
84
|
+
All commands use [Typer](https://typer.tiangolo.com/) for intuitive interfaces. Run `finesse --help` for overview.
|
|
85
|
+
|
|
86
|
+
### `finesse init`
|
|
87
|
+
Generates a commented `benchmark.yaml` template.
|
|
88
|
+
|
|
89
|
+
**Options**:
|
|
90
|
+
- `--output`: Path to save YAML (default: `benchmark.yaml`).
|
|
91
|
+
|
|
92
|
+
**Example**:
|
|
93
|
+
```bash
|
|
94
|
+
finesse init --output my_config.yaml
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
The template includes examples for all modes and validates against `BenchmarkConfig` before saving.
|
|
98
|
+
|
|
99
|
+
### `finesse generate`
|
|
100
|
+
Generates raw embeddings from the dataset using the specified config.
|
|
101
|
+
|
|
102
|
+
**Options**:
|
|
103
|
+
- `--config` (required): Path to `benchmark.yaml`.
|
|
104
|
+
- `--dataset-path`: Override HF dataset path (default: from config).
|
|
105
|
+
- `--output`: Directory for `.pt` files (default: `results`).
|
|
106
|
+
- `--samples`: Samples per sequence length (overrides config).
|
|
107
|
+
- `--seed`: Random seed for reproducibility (overrides config).
|
|
108
|
+
|
|
109
|
+
**Output**:
|
|
110
|
+
- `.pt` file: Torch tensor with `config` (dict), `raw_results` (embeddings per length).
|
|
111
|
+
|
|
112
|
+
**Example**:
|
|
113
|
+
```bash
|
|
114
|
+
finesse generate --config benchmark.yaml --output ./my_results --samples 10
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
### `finesse score`
|
|
118
|
+
Computes TD/BU scores and final RSS from raw `.pt` data.
|
|
119
|
+
|
|
120
|
+
**Options**:
|
|
121
|
+
- `--pt-path` (required): Path to `.pt` file from `generate`.
|
|
122
|
+
- `--output`: Directory for JSON (default: `results`).
|
|
123
|
+
|
|
124
|
+
**Scoring Logic** (simplified):
|
|
125
|
+
- **TD Score**: Quartile gap between memory and noise similarities (excludes first/last synthesis steps for stability).
|
|
126
|
+
- **BU Score**: Similar gap from individual chunk perspectives.
|
|
127
|
+
- **Final RSS**: `((avg_TD + avg_BU) / 2) - |TD - BU|` per length, averaged across lengths, scaled by 500.
|
|
128
|
+
|
|
129
|
+
**Output**:
|
|
130
|
+
- `benchmark_results.json`:
|
|
131
|
+
```json
|
|
132
|
+
{
|
|
133
|
+
"config": {...},
|
|
134
|
+
"average_rss": 42.123456,
|
|
135
|
+
"length_scores": {"5": 40.5, "16": 43.7},
|
|
136
|
+
"content_hash": "sha256:...",
|
|
137
|
+
"model_hash": "sha256:..." // Optional, for HF models
|
|
138
|
+
}
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
**Example**:
|
|
142
|
+
```bash
|
|
143
|
+
finesse score --pt-path results/embeddings_byok_mode_finesse-benchmark-database.pt
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
### `finesse checksum`
|
|
147
|
+
Verifies JSON integrity via self-contained hash. Optional model provenance check.
|
|
148
|
+
|
|
149
|
+
**Options**:
|
|
150
|
+
- `--json-path` (required): Path to `benchmark_results.json`.
|
|
151
|
+
- `--model-path`: HF model ID for provenance (e.g., `intfloat/multilingual-e5-base`).
|
|
152
|
+
|
|
153
|
+
**Verification**:
|
|
154
|
+
- Recomputes `content_hash` (excludes hash itself) and compares.
|
|
155
|
+
- For `--model-path`: Recomputes `model_hash` from model files and compares.
|
|
156
|
+
|
|
157
|
+
**Example**:
|
|
158
|
+
```bash
|
|
159
|
+
finesse checksum --json-path results/benchmark_results.json --model-path enzoescipy/sequence-merger-tiny
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
## Output Files Explained
|
|
163
|
+
|
|
164
|
+
- **`.pt` (Raw Embeddings)**: Binary Torch file with:
|
|
165
|
+
- `config`: Full benchmark config as dict.
|
|
166
|
+
- `raw_results`: Dict of `{length: {"probe_embeddings": [...], "synthesis_embeddings": [...], "num_synth_steps": int}}`.
|
|
167
|
+
- Used as input to `score`; enables decoupling of embedding generation (GPU-heavy) from scoring (CPU-friendly).
|
|
168
|
+
|
|
169
|
+
- **`benchmark_results.json`**: Human-readable results with:
|
|
170
|
+
- `average_rss`: Overall score (higher is better; >40 indicates strong performance).
|
|
171
|
+
- `length_scores`: Per-sequence-length scores (tests scaling).
|
|
172
|
+
- `content_hash`: SHA-256 of config + scores (for tamper-proofing).
|
|
173
|
+
- `model_hash`: SHA-256 of model files (if applicable; verifies unchanged model).
|
|
174
|
+
|
|
175
|
+
Hashes ensure reproducibility: Rerun `checksum` on shared results to confirm no alterations.
|
|
176
|
+
|
|
177
|
+
## Using BYOK Mode (Bring Your Own Keys)
|
|
178
|
+
|
|
179
|
+
BYOK mode integrates external embedding APIs (e.g., OpenAI, Cohere) via [LiteLLM](https://github.com/BerriAI/litellm) for fair comparison with open models.
|
|
180
|
+
|
|
181
|
+
### Setup
|
|
182
|
+
1. Edit `benchmark.yaml`:
|
|
183
|
+
```yaml
|
|
184
|
+
mode: "byok_mode"
|
|
185
|
+
|
|
186
|
+
models:
|
|
187
|
+
byok_embedder:
|
|
188
|
+
provider: "openai" # 'openai', 'cohere', 'google', etc.
|
|
189
|
+
name: "text-embedding-3-large" # Provider-specific model
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
2. Set Environment Variables (REQUIRED; never hardcode in YAML):
|
|
193
|
+
- OpenAI: `export OPENAI_API_KEY="sk-..."`
|
|
194
|
+
- Cohere: `export COHERE_API_KEY="..."`
|
|
195
|
+
- Google: `export GOOGLE_API_KEY="..."` (or Vertex AI creds).
|
|
196
|
+
- LiteLLM auto-detects based on `provider`. See [LiteLLM docs](https://docs.litellm.ai/docs/providers) for full list.
|
|
197
|
+
|
|
198
|
+
On Windows (PowerShell): `$env:OPENAI_API_KEY="sk-..."`
|
|
199
|
+
|
|
200
|
+
3. Run as usual:
|
|
201
|
+
```bash
|
|
202
|
+
finesse generate --config byok_config.yaml
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
### Notes
|
|
206
|
+
- Costs: BYOK incurs API fees; start with small `--samples` (e.g., 1-5).
|
|
207
|
+
- Security: Keys stay in env vars—YAML remains commit-safe.
|
|
208
|
+
- Validation: Config validator ensures `byok_embedder` is set for `byok_mode`; others are optional/ignored.
|
|
209
|
+
- Example YAML (in `init` template): Uncomment and customize the BYOK section.
|
|
210
|
+
|
|
211
|
+
## Configuration Deep Dive (benchmark.yaml)
|
|
212
|
+
|
|
213
|
+
- **mode**: `"merger_mode"` (default; uses merger + base), `"native_mode"` (direct embedder), `"byok_mode"`.
|
|
214
|
+
- **models**: Mode-specific; unused fields default to `None` (no download).
|
|
215
|
+
- `merger`: Sequence-merger path (e.g., `"enzoescipy/sequence-merger-tiny"`).
|
|
216
|
+
- `base_embedder`/`native_embedder`: Embedder path (e.g., `"intfloat/multilingual-e5-base"`).
|
|
217
|
+
- **dataset**: HF path (default: `"enzoescipy/finesse-benchmark-database"`), split (`"train"`).
|
|
218
|
+
- **probe_config**:
|
|
219
|
+
- `mask_ratio`: 0.15 (fraction masked).
|
|
220
|
+
- `sequence_length`: `{min: 5, max: 16}` (probe lengths in tokens).
|
|
221
|
+
- `samples_per_length`: 1+ (evals per length).
|
|
222
|
+
- **advanced**: `{batch_size: 8, device: "cuda"}` (optional).
|
|
223
|
+
- **seed**: 42 (reproducibility).
|
|
224
|
+
|
|
225
|
+
Pydantic ensures type safety; invalid configs raise `ValueError` on load.
|
|
226
|
+
|
|
227
|
+
## Development
|
|
228
|
+
|
|
229
|
+
- Source: `src/finesse_benchmark/`.
|
|
230
|
+
- Tests: Run `pytest` (add tests for scoring, hashing).
|
|
231
|
+
- Contributing: Fork, PR with docs/tests. Focus on new providers/modes.
|
|
232
|
+
|
|
233
|
+
## License
|
|
234
|
+
|
|
235
|
+
Apache 2.0 License. See [LICENSE](LICENSE) for details.
|
|
236
|
+
|
|
237
|
+
## Acknowledgments
|
|
238
|
+
|
|
239
|
+
Built on insights from long-context evaluation research. Thanks to Hugging Face Transformers and Pydantic teams.
|