ruby-skill-bench 1.0.1 → 1.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +299 -23
- data/docs/architecture.md +3 -1
- data/docs/first-eval-guide.md +7 -7
- data/docs/testing-guide.md +1 -1
- data/lib/skill_bench/agent/react_agent/loop_runner.rb +44 -9
- data/lib/skill_bench/agent/react_agent/step.rb +7 -1
- data/lib/skill_bench/agent/react_agent.rb +2 -1
- data/lib/skill_bench/cli/batch_result_printer.rb +45 -0
- data/lib/skill_bench/cli/eval/eval_options.rb +4 -0
- data/lib/skill_bench/cli/help_printer.rb +10 -2
- data/lib/skill_bench/cli/init_command.rb +2 -1
- data/lib/skill_bench/cli/result_printer.rb +1 -1
- data/lib/skill_bench/cli/run_command.rb +47 -9
- data/lib/skill_bench/cli/validate_command.rb +242 -0
- data/lib/skill_bench/cli.rb +3 -0
- data/lib/skill_bench/client.rb +43 -1
- data/lib/skill_bench/clients/all.rb +3 -0
- data/lib/skill_bench/clients/base_client.rb +14 -6
- data/lib/skill_bench/clients/base_url_validator.rb +105 -0
- data/lib/skill_bench/clients/provider_config.rb +34 -1
- data/lib/skill_bench/clients/provider_schemas.rb +4 -0
- data/lib/skill_bench/clients/providers/mistral.rb +47 -0
- data/lib/skill_bench/clients/request_builder.rb +2 -4
- data/lib/skill_bench/clients/response_builder.rb +91 -0
- data/lib/skill_bench/clients/response_error_handler.rb +5 -17
- data/lib/skill_bench/clients/retry_handler.rb +4 -7
- data/lib/skill_bench/commands/init.rb +5 -0
- data/lib/skill_bench/commands/skill_new.rb +3 -1
- data/lib/skill_bench/config/applier.rb +2 -0
- data/lib/skill_bench/config/defaults.rb +2 -0
- data/lib/skill_bench/config/facade_readers.rb +7 -0
- data/lib/skill_bench/config/facade_writers.rb +17 -0
- data/lib/skill_bench/config/json_loader.rb +1 -1
- data/lib/skill_bench/config/store.rb +29 -0
- data/lib/skill_bench/config.rb +18 -0
- data/lib/skill_bench/constants.rb +58 -0
- data/lib/skill_bench/evaluation/runner.rb +20 -3
- data/lib/skill_bench/execution/context_hydrator.rb +66 -15
- data/lib/skill_bench/execution/sandbox.rb +76 -14
- data/lib/skill_bench/judge/judge.rb +4 -0
- data/lib/skill_bench/judge/prompt.rb +42 -6
- data/lib/skill_bench/models/config.rb +32 -0
- data/lib/skill_bench/output_formatter.rb +60 -1
- data/lib/skill_bench/package_verifier.rb +1 -1
- data/lib/skill_bench/rails/skill_templates.rb +19 -5
- data/lib/skill_bench/services/agent_spawner_service.rb +7 -3
- data/lib/skill_bench/services/batch_runner_service.rb +111 -0
- data/lib/skill_bench/services/compare_option_parser.rb +1 -0
- data/lib/skill_bench/services/cost_calculator.rb +91 -0
- data/lib/skill_bench/services/html_formatter.rb +289 -0
- data/lib/skill_bench/services/json_formatter.rb +19 -1
- data/lib/skill_bench/services/junit_formatter.rb +74 -24
- data/lib/skill_bench/services/provider_resolver.rb +5 -2
- data/lib/skill_bench/services/response_cache.rb +130 -0
- data/lib/skill_bench/services/runner_service.rb +88 -4
- data/lib/skill_bench/services/summary_formatter.rb +90 -0
- data/lib/skill_bench/services/template_registry.rb +43 -9
- data/lib/skill_bench/services/trend_recorder_service.rb +29 -2
- data/lib/skill_bench/tools/registry.rb +29 -3
- data/lib/skill_bench/tools/run_command.rb +172 -35
- data/lib/skill_bench/trend_tracker/persistence.rb +27 -10
- data/lib/skill_bench/trend_tracker.rb +5 -5
- data/lib/skill_bench/version.rb +1 -1
- data/lib/skill_bench.rb +3 -3
- metadata +19 -36
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: f47976b55f6f8c147adb4ed784ce04ba52ff71f805f8e35d797ba776021641c4
|
|
4
|
+
data.tar.gz: c2febaadbdeb7e149041661258ce84e41499121445cf726cefece642e60174a4
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 5ab3082fa715a0776455a88b28e2d990d3bb7e52fbc5f0cc47176e1d44e76cdc430531bc0401abab350cb4c2321d0b625ce78d381094fe404cdedd7e61b27227
|
|
7
|
+
data.tar.gz: 3d5f67b876457691e003e62a8ba57fc04ac21e002bb5c6a84d1ae7954cf7dd7136f05a658d7ee4d416ab2db19cb0ad90e51dd0d920df9ffa998cb8391db65df5
|
data/README.md
CHANGED
|
@@ -30,7 +30,7 @@ See the [Ecosystem Overview](https://github.com/igmarin/agent-mcp-runtime/blob/m
|
|
|
30
30
|
- **Isolated Git Sandboxes**: Every run operates in a temporary repo. Clean diffs, zero side-effects, 100% reproducibility.
|
|
31
31
|
- **Blind Judging with Dimensions**: LLM judge scores baseline and context independently across 5 canonical dimensions (Correctness, Skill Adherence, Code Quality, Test Coverage, Documentation). Eval authors configure weights and thresholds via `criteria.json`.
|
|
32
32
|
- **Sophisticated ReAct Loop**: Employs a robust `Thought → Tool → Observation` loop to handle complex, multi-step engineering tasks.
|
|
33
|
-
- **Multi-Provider Ecosystem**: Native support for **OpenAI**, **Anthropic**, **Google Gemini**, **Azure OpenAI**, **Ollama**, **Groq**, **DeepSeek**, and **OpenCode**.
|
|
33
|
+
- **Multi-Provider Ecosystem**: Native support for **OpenAI**, **Anthropic**, **Google Gemini**, **Azure OpenAI**, **Ollama**, **Groq**, **DeepSeek**, **Mistral**, and **OpenCode**.
|
|
34
34
|
- **Standardized Intelligence**: Consistent reporting format regardless of the underlying LLM provider.
|
|
35
35
|
|
|
36
36
|
---
|
|
@@ -64,11 +64,14 @@ CLI / API → RunnerService → Sandbox + ReAct Agent → LLM Client Layer → P
|
|
|
64
64
|
| **Ollama** | — | `:ollama` |
|
|
65
65
|
| **Groq** | `SKILL_BENCH_GROQ_API_KEY` | `:groq` |
|
|
66
66
|
| **DeepSeek** | `SKILL_BENCH_DEEPSEEK_API_KEY` | `:deepseek` |
|
|
67
|
+
| **Mistral** | `SKILL_BENCH_MISTRAL_API_KEY` | `:mistral` |
|
|
67
68
|
| **OpenCode** | `SKILL_BENCH_OPENCODE_API_KEY`, `SKILL_BENCH_OPENCODE_BASE_URL` | `:opencode` |
|
|
68
69
|
|
|
69
70
|
> **Note:** Environment variables are loaded automatically. You can also configure provider settings in `skill-bench.json` (created by `skill-bench init`).
|
|
70
71
|
>
|
|
71
72
|
> **OpenCode requires a custom `base_url`:** OpenCode does not host a public LLM API. You must provide your own OpenAI-compatible endpoint (e.g. a LiteLLM proxy, self-hosted vLLM, or company gateway) via the `base_url` config key. Without it, the provider will fail with "Base URL not set for Opencode".
|
|
73
|
+
>
|
|
74
|
+
> **Mistral** uses Mistral's OpenAI-compatible chat completions API (default model `mistral-large-latest`). Set `SKILL_BENCH_MISTRAL_API_KEY` and scaffold it with `skill-bench init --mistral`.
|
|
72
75
|
|
|
73
76
|
### Command Allowlist
|
|
74
77
|
|
|
@@ -79,6 +82,7 @@ By default, no shell commands are permitted. You must configure `allowed_command
|
|
|
79
82
|
"provider": "openai",
|
|
80
83
|
"max_execution_time": 30,
|
|
81
84
|
"allowed_commands": ["rspec", "bundle", "ruby", "git"],
|
|
85
|
+
"allow_host_execution": false,
|
|
82
86
|
"config": {
|
|
83
87
|
"api_key": null,
|
|
84
88
|
"model": "gpt-4o"
|
|
@@ -87,6 +91,8 @@ By default, no shell commands are permitted. You must configure `allowed_command
|
|
|
87
91
|
```
|
|
88
92
|
|
|
89
93
|
> **Security:** The agent can only execute commands on this list. Dangerous commands (bash, curl, sudo, etc.) are always blocked regardless of configuration.
|
|
94
|
+
>
|
|
95
|
+
> **Where commands run:** Allowed commands run inside a temporary git **sandbox directory** on the host — a copy of your eval files, not your project. True container isolation (Docker) is **not yet shipped**, so the sandbox directory is the only boundary. Because of this, host execution **fails closed**: it is disabled by default and must be explicitly enabled with `"allow_host_execution": true`. With it disabled (the default), `run_command` refuses to execute and returns an error instead of running un-isolated. Enable it only when you accept that allowed commands run directly on your machine.
|
|
90
96
|
|
|
91
97
|
### Configuration Hierarchy
|
|
92
98
|
|
|
@@ -137,7 +143,9 @@ skill-bench init --openai
|
|
|
137
143
|
}
|
|
138
144
|
```
|
|
139
145
|
|
|
140
|
-
**Available providers:** `--openai`, `--anthropic`, `--gemini`, `--ollama`, `--azure`, `--groq`, `--deepseek`, `--opencode`
|
|
146
|
+
**Available providers:** `--openai`, `--anthropic`, `--gemini`, `--ollama`, `--azure`, `--groq`, `--deepseek`, `--mistral`, `--opencode`
|
|
147
|
+
|
|
148
|
+
**Zero-config offline path:** `skill-bench init --mock` scaffolds a minimal offline config that needs no API key and no network — `{"provider":"mock","max_execution_time":30}`. Use it to try the full flow (and run the bundled examples) before wiring up a real provider.
|
|
141
149
|
|
|
142
150
|
Use `--force` to overwrite an existing config.
|
|
143
151
|
|
|
@@ -338,7 +346,7 @@ skill-bench run my-first-eval --skill=my-service
|
|
|
338
346
|
3. **Context run** — Agent receives `task.md` + `SKILL.md` as prompt → produces output B
|
|
339
347
|
4. **Blind judging** — LLM judge scores output A and output B independently across the dimensions defined in `criteria.json`
|
|
340
348
|
5. **Delta computation** — Compare scores, compute deltas, apply pass/fail logic
|
|
341
|
-
6. **History recording** — Store result in `.skill-bench-
|
|
349
|
+
6. **History recording** — Store result in `.skill-bench-trends.json` for trend tracking
|
|
342
350
|
|
|
343
351
|
Provider is read from `skill-bench.json` — no `--provider` flag needed.
|
|
344
352
|
|
|
@@ -350,11 +358,54 @@ skill-bench run my-first-eval --skill=skill-a --skill=skill-b
|
|
|
350
358
|
|
|
351
359
|
Both skill contexts are concatenated and sent to the agent. The judge evaluates whether the combined context improves results.
|
|
352
360
|
|
|
353
|
-
**Output Formats:**
|
|
361
|
+
**Output Formats:** `--format human` (default), `json`, `junit`, or `html`.
|
|
362
|
+
|
|
363
|
+
- Human-readable (default) — full delta table, iteration timeline, and a `Tokens: N | Est. Cost: $X.XXXX` line.
|
|
364
|
+
- JSON: `--format json` — machine-readable, including top-level `tokens` and `cost` fields.
|
|
365
|
+
- JUnit XML: `--format junit` — for CI test reporting.
|
|
366
|
+
- HTML: `--format html` — a self-contained, shareable report (styles inlined, no external assets) with the delta table and iteration timeline. Redirect it to a file:
|
|
367
|
+
|
|
368
|
+
```bash
|
|
369
|
+
skill-bench run my-first-eval --skill=my-service --format html > report.html
|
|
370
|
+
```
|
|
371
|
+
|
|
372
|
+
---
|
|
373
|
+
|
|
374
|
+
## Pre-flight Checks: `validate` / `doctor`
|
|
375
|
+
|
|
376
|
+
Before spending tokens on a run, sanity-check your setup. `skill-bench validate` (aliased as `doctor`) runs read-only pre-flight checks — it never runs an eval and never makes a network call:
|
|
377
|
+
|
|
378
|
+
```bash
|
|
379
|
+
skill-bench validate
|
|
380
|
+
# or, identically:
|
|
381
|
+
skill-bench doctor
|
|
382
|
+
```
|
|
383
|
+
|
|
384
|
+
It runs three checks and prints a `PASS` / `FAIL` / `SKIP` line for each:
|
|
354
385
|
|
|
355
|
-
|
|
356
|
-
-
|
|
357
|
-
|
|
386
|
+
1. **criteria** — validates the criteria JSON (default `criteria.json`, override with `--criteria PATH`). Skipped if the default file is absent.
|
|
387
|
+
2. **config** — schema-checks `skill-bench.json` (default, override with `--config PATH`): `provider` is required and must be a known provider, `max_execution_time` must be a positive integer, and `config` (when present) must be an object.
|
|
388
|
+
3. **provider key** — reports whether the configured provider's API key is present (the `mock` provider needs none).
|
|
389
|
+
|
|
390
|
+
A passing report exits `0`:
|
|
391
|
+
|
|
392
|
+
```text
|
|
393
|
+
skill-bench validate
|
|
394
|
+
|
|
395
|
+
[PASS] criteria criteria.json is valid
|
|
396
|
+
[PASS] config skill-bench.json matches the expected shape
|
|
397
|
+
[PASS] provider key openai credentials present
|
|
398
|
+
|
|
399
|
+
All checks passed.
|
|
400
|
+
```
|
|
401
|
+
|
|
402
|
+
A failure exits non-zero and names what is wrong:
|
|
403
|
+
|
|
404
|
+
```text
|
|
405
|
+
[FAIL] provider key openai is missing: api_key
|
|
406
|
+
|
|
407
|
+
1 check(s) failed.
|
|
408
|
+
```
|
|
358
409
|
|
|
359
410
|
---
|
|
360
411
|
|
|
@@ -427,6 +478,22 @@ The `--variant` spec supports two forms:
|
|
|
427
478
|
- `pack:<name>` — resolve via registry manifest
|
|
428
479
|
- `/absolute/path` or `relative/path` — use a direct path
|
|
429
480
|
|
|
481
|
+
### Response Caching (opt-in, `--cache`)
|
|
482
|
+
|
|
483
|
+
LLM responses can be cached so identical requests reuse a previous result instead of calling the provider again. Caching is **off by default**. Enable it per run with `--cache`, or set the `SKILL_BENCH_CACHE` environment variable to a truthy value (`1`, `true`, `yes`, or `on`):
|
|
484
|
+
|
|
485
|
+
```bash
|
|
486
|
+
# Per-run flag
|
|
487
|
+
skill-bench run my-first-eval --skill=my-service --cache
|
|
488
|
+
|
|
489
|
+
# Or via the environment
|
|
490
|
+
SKILL_BENCH_CACHE=1 skill-bench run my-first-eval --skill=my-service
|
|
491
|
+
```
|
|
492
|
+
|
|
493
|
+
The cache is in-memory (process-lifetime) and content-addressed: the key is a SHA-256 digest of the provider, model, system prompt, messages, tools, and temperature, so only truly identical requests share an entry. The `mock` and null providers are never cached.
|
|
494
|
+
|
|
495
|
+
This pays off most with `compare`, which runs the skill-less baseline twice with identical inputs — with caching enabled, the repeated baseline reuses the cached response instead of making a second call.
|
|
496
|
+
|
|
430
497
|
---
|
|
431
498
|
|
|
432
499
|
## File Reference: What Lives on Disk
|
|
@@ -446,6 +513,7 @@ SkillBench creates and manages three files in your project. Understanding them h
|
|
|
446
513
|
"provider": "openai",
|
|
447
514
|
"max_execution_time": 300,
|
|
448
515
|
"allowed_commands": ["rspec", "bundle", "ruby", "git"],
|
|
516
|
+
"allow_host_execution": false,
|
|
449
517
|
"config": {
|
|
450
518
|
"api_key": "sk-...",
|
|
451
519
|
"model": "gpt-4o",
|
|
@@ -458,10 +526,11 @@ SkillBench creates and manages three files in your project. Understanding them h
|
|
|
458
526
|
- Configuration is loaded in this order: **code defaults** → `~/.skill-bench.json` (user-wide) → `./skill-bench.json` (local) → **environment variables**. Later sources override earlier ones.
|
|
459
527
|
- If `api_key` is `null`, SkillBench looks for the matching environment variable (e.g. `SKILL_BENCH_OPENAI_API_KEY`).
|
|
460
528
|
- `allowed_commands` is a **safeguard**, not a convenience. By default the agent cannot run *any* shell command. Add only what your evals need.
|
|
529
|
+
- `allow_host_execution` (default `false`) gates whether `run_command` may run on the host when no container isolation is active. Since container isolation is not yet shipped, leaving it `false` means `run_command` **fails closed** (refuses to execute). Set it to `true` only if you accept that allowed commands run directly on your machine inside the temporary sandbox directory.
|
|
461
530
|
|
|
462
531
|
---
|
|
463
532
|
|
|
464
|
-
### `.skill-bench-
|
|
533
|
+
### `.skill-bench-trends.json` — Evaluation History (Auto-Generated)
|
|
465
534
|
|
|
466
535
|
**What it is:** A JSON array that records every successful eval run. SkillBench appends to it automatically. It stores the timestamp, eval name, skill names, scores, and deltas so you can track improvement over time.
|
|
467
536
|
|
|
@@ -497,13 +566,13 @@ TREND: baseline ↑ (+2), context ↑ (+7)
|
|
|
497
566
|
|
|
498
567
|
The trend compares the current run against the *previous run of the same eval + skill*. This tells you at a glance whether your latest skill edit made things better or worse.
|
|
499
568
|
|
|
500
|
-
**Pro tip:**
|
|
569
|
+
**Pro tip:** `.skill-bench-trends.json` is git-ignored by default (via the `.skill-bench-trends.json*` line in `.gitignore`). If you want to share trend data with your team, remove that line so the file can be committed.
|
|
501
570
|
|
|
502
571
|
---
|
|
503
572
|
|
|
504
|
-
### `.skill-bench-
|
|
573
|
+
### `.skill-bench-trends.json.bak` — Backup (Auto-Generated)
|
|
505
574
|
|
|
506
|
-
**What it is:** A
|
|
575
|
+
**What it is:** A snapshot of the *previous* good version of `.skill-bench-trends.json`, copied just before each new write. (The first run has no prior version yet, so it creates no `.bak`.) If the main file gets corrupted (e.g. you kill the process mid-write), SkillBench automatically falls back to the `.bak` file.
|
|
507
576
|
|
|
508
577
|
**Who edits it:** Nobody. It is a safety net.
|
|
509
578
|
|
|
@@ -541,7 +610,7 @@ Read the output carefully. Look at **two things:**
|
|
|
541
610
|
### Step 3: Inspect the History
|
|
542
611
|
|
|
543
612
|
```bash
|
|
544
|
-
cat .skill-bench-
|
|
613
|
+
cat .skill-bench-trends.json | jq '.[-1]'
|
|
545
614
|
```
|
|
546
615
|
|
|
547
616
|
This shows the latest entry. Focus on the dimension with the smallest delta — that is where your skill is weakest.
|
|
@@ -729,6 +798,7 @@ These 5 dimensions are **mandatory** in every `criteria.json`. You can add custo
|
|
|
729
798
|
Eval: my-first-eval
|
|
730
799
|
Skill: my-service
|
|
731
800
|
Provider: openai
|
|
801
|
+
Tokens: 18432 | Est. Cost: $0.0934
|
|
732
802
|
═══════════════════════════════════════════════════════
|
|
733
803
|
|
|
734
804
|
=== BASELINE ITERATIONS ===
|
|
@@ -774,8 +844,9 @@ These 5 dimensions are **mandatory** in every `criteria.json`. You can add custo
|
|
|
774
844
|
- **CONTEXT:** The agent's score *with* the skill. This is the "aided" performance.
|
|
775
845
|
- **DELTA:** `CONTEXT - BASELINE`. How much the skill helped.
|
|
776
846
|
- **TOTAL:** Sum of all dimension scores. Max possible is 100.
|
|
777
|
-
- **TREND:** Comparison against the previous run of the same eval + skill (from `.skill-bench-
|
|
847
|
+
- **TREND:** Comparison against the previous run of the same eval + skill (from `.skill-bench-trends.json`). Shows whether scores are improving over time.
|
|
778
848
|
- **VERDICT:** `PASS` only if `CONTEXT >= pass_threshold` AND `DELTA >= minimum_delta`.
|
|
849
|
+
- **Tokens / Est. Cost:** The header shows total tokens used across the run and an estimated USD cost as `Tokens: N | Est. Cost: $X.XXXX`. The cost is approximate — it comes from a built-in per-model price table (`Services::CostCalculator`) and shows `—` when the model isn't in that table. JSON output (`--format json`) exposes the same data as top-level `tokens` and `cost` fields.
|
|
779
850
|
|
|
780
851
|
**Iteration timeline:**
|
|
781
852
|
|
|
@@ -827,7 +898,7 @@ Your eval result depends on **both** conditions. Here is every scenario:
|
|
|
827
898
|
|
|
828
899
|
## Reliability & Security
|
|
829
900
|
|
|
830
|
-
- **
|
|
901
|
+
- **Allowlist-Gated Execution**: The agent can only run commands you add to `allowed_commands`; with an empty allowlist it can run nothing. Commands run inside a temporary git sandbox **directory** (a copy of the eval files) on the host — container isolation is not yet shipped, so host execution is **disabled by default** and must be explicitly opted into with `allow_host_execution: true`.
|
|
831
902
|
- **Command Blocklist**: Dangerous commands (`bash`, `sh`, `python`, `curl`, etc.) are always blocked, even if listed in `allowed_commands`.
|
|
832
903
|
- **Path Validation**: Eval paths are validated to prevent directory traversal attacks.
|
|
833
904
|
- **Atomic History Writes**: Benchmark history uses file locking to prevent corruption from concurrent writes.
|
|
@@ -836,7 +907,7 @@ Your eval result depends on **both** conditions. Here is every scenario:
|
|
|
836
907
|
- **Traceability**: Every thought and tool call is logged with full backtrace for post-mortem analysis.
|
|
837
908
|
- **Robust Error Recovery**: Handles provider outages and rate limits gracefully with standardized error logging.
|
|
838
909
|
- **XML-Safe Output**: JUnit XML output is properly escaped to prevent injection attacks.
|
|
839
|
-
- **Test Coverage**:
|
|
910
|
+
- **Test Coverage**: 700+ tests covering core engine, CLI commands, and all provider clients. Run `bundle exec rake test` to see the current count.
|
|
840
911
|
|
|
841
912
|
## Testing
|
|
842
913
|
|
|
@@ -855,21 +926,226 @@ bundle exec ruby -Itest test/integration_test.rb
|
|
|
855
926
|
|
|
856
927
|
**Test Structure:**
|
|
857
928
|
|
|
858
|
-
- `test/
|
|
859
|
-
- `test/agent_eval/` —
|
|
929
|
+
- `test/agent/` — Agent runtime tests
|
|
930
|
+
- `test/agent_eval/` — Agent evaluation tests
|
|
931
|
+
- `test/cli/` — CLI command tests
|
|
860
932
|
- `test/clients/` — Provider client tests
|
|
933
|
+
- `test/evaluator/` — Core evaluation engine tests
|
|
934
|
+
- `test/history_recorder/` — Benchmark history persistence tests
|
|
935
|
+
- `test/models/` — Domain model tests
|
|
936
|
+
- `test/registry/` — Skill/eval registry tests
|
|
937
|
+
- `test/services/` — Service layer tests
|
|
938
|
+
- `test/skills/` — Skill loading tests
|
|
939
|
+
- `test/tools/` — Agent tool tests
|
|
940
|
+
- Plus several top-level `test/*_test.rb` files (e.g. `integration_test.rb`, `evaluation_runner_test.rb`, `trend_tracker_test.rb`).
|
|
941
|
+
|
|
942
|
+
---
|
|
943
|
+
|
|
944
|
+
## Security
|
|
945
|
+
|
|
946
|
+
### Threat Model
|
|
947
|
+
|
|
948
|
+
Ruby Skill Bench is designed with security as a primary concern. The system executes AI agents in isolated environments and must protect against various attack vectors:
|
|
949
|
+
|
|
950
|
+
- **Path Traversal:** Preventing agents from accessing files outside the sandbox
|
|
951
|
+
- **Command Injection:** Preventing execution of arbitrary shell commands
|
|
952
|
+
- **Resource Exhaustion:** Preventing denial-of-service through resource consumption
|
|
953
|
+
- **Information Leakage:** Protecting sensitive data like API keys
|
|
954
|
+
|
|
955
|
+
### Security Features
|
|
956
|
+
|
|
957
|
+
#### Path Traversal Protection
|
|
958
|
+
|
|
959
|
+
- **Symlink Validation:** All symlinks are validated to ensure they don't escape the sandbox
|
|
960
|
+
- **TOCTOU Mitigation:** Path validation is re-checked after directory creation operations
|
|
961
|
+
- **Path Normalization:** All paths are normalized and validated against working directory boundaries
|
|
962
|
+
- **Character Validation:** Paths are validated against strict character patterns
|
|
963
|
+
|
|
964
|
+
#### Command Execution Security
|
|
965
|
+
|
|
966
|
+
- **Command Allowlist:** Only explicitly allowed commands can be executed
|
|
967
|
+
- **Dangerous Commands Blocklist:** Dangerous commands (bash, curl, sudo, etc.) are always blocked
|
|
968
|
+
- **Shell Tokenization:** Commands are tokenized before execution to prevent shell injection
|
|
969
|
+
- **Fail-Closed Host Execution:** Container isolation is not yet active, so commands run on the host inside a temporary sandbox directory. To match this reality, `run_command` refuses to execute unless `allow_host_execution: true` is set; it is **disabled by default**.
|
|
970
|
+
|
|
971
|
+
> **The allowlist is the only real authorization control — and it only checks the base command.** `run_command` authorizes by the first token of the command (`rake`, `find`, `git`, …); it does **not** inspect arguments. Shell tokenization stops metacharacter injection, but it does **not** sandbox what an allowlisted binary can do. Because many common tools are general-purpose execution wrappers, **allowlisting any one of them is equivalent to granting arbitrary host code execution** — for example `rake -e '...'`, `rspec -e`, `make` (arbitrary recipes), `find . -exec ...`, or `git` (hooks, `-c core.fsmonitor=...`, `! ...` aliases). Combined with the fail-closed model above (`run_command` refuses to run on the host unless `allow_host_execution` is explicitly enabled — see `HOST_EXECUTION_REFUSED` in `run_command.rb`), the practical guidance is: **keep `allowed_commands` as minimal as possible — empty for untrusted skills** — and treat every entry as if you were handing the skill a shell.
|
|
972
|
+
>
|
|
973
|
+
> An **optional, default-off** `command_argument_constraints` setting can refuse commands whose arguments contain configured substrings/flags (for example blocking `-e` or `-exec`). It is a defense-in-depth speed bump, **not** a sandbox, and is unset by default; the allowlist remains the control that matters.
|
|
974
|
+
|
|
975
|
+
#### Docker Security Hardening (Planned — Not Yet Active)
|
|
976
|
+
|
|
977
|
+
> **Status:** The container isolation model described below is **planned, not shipped**. No Docker build context is packaged, so containers are never launched today — `run_command` runs on the host gated by the allowlist and `allow_host_execution`. The settings below document the intended hardened model for when container isolation lands.
|
|
978
|
+
|
|
979
|
+
When container isolation is enabled in a future release, containers are intended to launch with hardened security settings:
|
|
980
|
+
|
|
981
|
+
- **Non-root User:** Containers run as a non-root user
|
|
982
|
+
- **Privilege Prevention:** `--security-opt no-new-privileges` prevents privilege escalation
|
|
983
|
+
- **Capability Dropping:** All Linux capabilities are dropped except minimal needed ones
|
|
984
|
+
- **Network Isolation:** `--network none` disables network access
|
|
985
|
+
- **Read-only Root:** Container filesystem is read-only (except for mounted volumes)
|
|
986
|
+
|
|
987
|
+
#### Resource Limits
|
|
988
|
+
|
|
989
|
+
- **File Size Limits:** Individual files in context hydration are limited to 50KB
|
|
990
|
+
- **Total Context Size:** Total context size is limited to 1MB to prevent memory exhaustion
|
|
991
|
+
- **Execution Timeout:** Commands are limited to a configurable timeout (default: 30 seconds)
|
|
992
|
+
- **Max Iterations:** Agent loops are limited to prevent infinite loops
|
|
993
|
+
|
|
994
|
+
### API Key Security
|
|
995
|
+
|
|
996
|
+
- **Environment Variables:** API keys are loaded from environment variables, not hardcoded
|
|
997
|
+
- **Configuration Hierarchy:** Keys can be set in `skill-bench.json` or environment variables
|
|
998
|
+
- **No Logging:** API keys are never logged or exposed in error messages
|
|
999
|
+
- **Provider-Specific Keys:** Each provider uses its own API key configuration
|
|
1000
|
+
|
|
1001
|
+
### Best Practices for Users
|
|
1002
|
+
|
|
1003
|
+
1. **Never Commit API Keys:** Never commit `skill-bench.json` with API keys to version control
|
|
1004
|
+
2. **Use Environment Variables:** Prefer environment variables for sensitive configuration
|
|
1005
|
+
3. **Minimal Command Allowlist:** Only allow commands necessary for your evals
|
|
1006
|
+
4. **Regular Updates:** Keep dependencies updated to patch security vulnerabilities
|
|
1007
|
+
5. **Review Changes:** Review skill files before execution to ensure they don't contain malicious code
|
|
1008
|
+
|
|
1009
|
+
### Reporting Security Issues
|
|
1010
|
+
|
|
1011
|
+
To report a security vulnerability, please follow the process in
|
|
1012
|
+
[SECURITY.md](SECURITY.md). **Do not open a public issue** — use GitHub's
|
|
1013
|
+
private vulnerability reporting (Security tab) or email the maintainer at
|
|
1014
|
+
[ismael.marin@gmail.com](mailto:ismael.marin@gmail.com).
|
|
1015
|
+
|
|
1016
|
+
---
|
|
1017
|
+
|
|
1018
|
+
## Troubleshooting
|
|
1019
|
+
|
|
1020
|
+
### Common Issues and Solutions
|
|
1021
|
+
|
|
1022
|
+
#### Configuration Issues
|
|
1023
|
+
|
|
1024
|
+
**Problem:** "Config load failed, using mock provider"
|
|
1025
|
+
- **Solution:** Ensure your `skill-bench.json` file is properly formatted JSON and contains required fields
|
|
1026
|
+
- **Check:** Verify the file exists in your project root or home directory
|
|
1027
|
+
|
|
1028
|
+
**Problem:** "API Key not set for [Provider]"
|
|
1029
|
+
- **Solution:** Set the appropriate environment variable (e.g., `SKILL_BENCH_OPENAI_API_KEY`) or add it to your `skill-bench.json`
|
|
1030
|
+
- **Check:** Run `env | grep SKILL_BENCH` to verify environment variables are set
|
|
1031
|
+
|
|
1032
|
+
**Problem:** "No allowed commands configured"
|
|
1033
|
+
- **Solution:** Add `allowed_commands` array to your `skill-bench.json` with the commands you want to allow
|
|
1034
|
+
- **Check:** Ensure commands are in the allowlist and not in the dangerous commands list
|
|
1035
|
+
|
|
1036
|
+
#### Execution Issues
|
|
1037
|
+
|
|
1038
|
+
**Problem:** "Command execution timed out"
|
|
1039
|
+
- **Solution:** Increase `max_execution_time` in your `skill-bench.json` or simplify the task
|
|
1040
|
+
- **Check:** Verify the command isn't hanging or waiting for input
|
|
1041
|
+
|
|
1042
|
+
**Problem:** "Command execution refused: no sandbox isolation is active and 'allow_host_execution' is not enabled"
|
|
1043
|
+
- **Cause:** Container isolation is not yet shipped, so commands would run on the host. SkillBench fails closed by default rather than run un-isolated.
|
|
1044
|
+
- **Solution:** Set `"allow_host_execution": true` in `skill-bench.json` to permit allowed commands to run directly on the host (inside the temporary sandbox directory). Enable it only when you accept that trade-off.
|
|
1045
|
+
|
|
1046
|
+
**Problem:** "Context hydration failed"
|
|
1047
|
+
- **Solution:** Verify the source path exists and is a directory
|
|
1048
|
+
- **Check:** Ensure the path is within the base directory and file sizes are under limits
|
|
1049
|
+
|
|
1050
|
+
#### Network Issues
|
|
1051
|
+
|
|
1052
|
+
**Problem:** "Network Error: Connection refused"
|
|
1053
|
+
- **Solution:** Check your internet connection and API provider status
|
|
1054
|
+
- **Check:** Verify the base URL in your configuration is correct
|
|
1055
|
+
|
|
1056
|
+
**Problem:** "API Request failed: 429"
|
|
1057
|
+
- **Solution:** This is a rate limit error. The system will retry automatically
|
|
1058
|
+
- **Check:** Reduce request frequency or check your API quota
|
|
1059
|
+
|
|
1060
|
+
#### Test Failures
|
|
1061
|
+
|
|
1062
|
+
**Problem:** Tests fail with "WebMock::NetConnectNotAllowedError"
|
|
1063
|
+
- **Solution:** This occurs when tests try to make real HTTP requests. Ensure test stubs are properly configured
|
|
1064
|
+
- **Check:** Verify WebMock is properly stubbing the expected URLs
|
|
1065
|
+
|
|
1066
|
+
**Problem:** "E2E sibling repositories not present"
|
|
1067
|
+
- **Solution:** This is expected if you don't have the agent-mcp-runtime repository cloned
|
|
1068
|
+
- **Check:** These tests will be skipped and won't affect the overall test results
|
|
1069
|
+
|
|
1070
|
+
### Debug Mode
|
|
1071
|
+
|
|
1072
|
+
For detailed debugging, you can enable verbose logging:
|
|
1073
|
+
|
|
1074
|
+
```bash
|
|
1075
|
+
# Set environment variable for verbose logging
|
|
1076
|
+
export SKILL_BENCH_DEBUG=true
|
|
1077
|
+
skill-bench run my-eval --skill=my-skill
|
|
1078
|
+
```
|
|
1079
|
+
|
|
1080
|
+
### Getting Help
|
|
1081
|
+
|
|
1082
|
+
If you encounter issues not covered here:
|
|
1083
|
+
|
|
1084
|
+
1. Check the [GitHub Issues](https://github.com/igmarin/ruby-skill-bench/issues) for similar problems
|
|
1085
|
+
2. Create a new issue with detailed information about your environment and the problem
|
|
1086
|
+
3. Include Ruby version, SkillBench version, and error messages
|
|
1087
|
+
4. Provide steps to reproduce the issue
|
|
1088
|
+
|
|
1089
|
+
---
|
|
861
1090
|
|
|
862
1091
|
## CI/CD Integration
|
|
863
1092
|
|
|
864
|
-
|
|
1093
|
+
### Batch Runs
|
|
1094
|
+
|
|
1095
|
+
Run every eval at once instead of one at a time:
|
|
1096
|
+
|
|
1097
|
+
```bash
|
|
1098
|
+
# Every eval under the default evals/ directory
|
|
1099
|
+
skill-bench run --all --skill=my-service
|
|
1100
|
+
|
|
1101
|
+
# Or point at a specific directory
|
|
1102
|
+
skill-bench run --evals-dir path/to/evals --skill=my-service
|
|
1103
|
+
```
|
|
1104
|
+
|
|
1105
|
+
A batch run exits `0` only when **every** eval passes and non-zero if any fail, so the process exit code is itself a CI gate. Two formats are built for batch consumption:
|
|
1106
|
+
|
|
1107
|
+
- `--summary` emits an aggregate JSON gate — `passed` / `failed` / `total` counts, summed `tokens` and `cost`, and the `worst_delta` eval (the smallest context-minus-baseline delta in the batch). Archive it as a single machine-readable artifact:
|
|
1108
|
+
|
|
1109
|
+
```bash
|
|
1110
|
+
skill-bench run --all --skill=my-service --summary
|
|
1111
|
+
```
|
|
1112
|
+
|
|
1113
|
+
- `--format junit` aggregates the batch into one JUnit document with **one `<testcase>` per eval** (a `<failure>` child for each failing eval), so test reporters show per-eval results:
|
|
1114
|
+
|
|
1115
|
+
```bash
|
|
1116
|
+
skill-bench run --all --skill=my-service --format junit > junit.xml
|
|
1117
|
+
```
|
|
1118
|
+
|
|
1119
|
+
### GitHub Action
|
|
1120
|
+
|
|
1121
|
+
Downstream repos can gate a skill change on every push or PR with the bundled composite action. Add a step that references `igmarin/ruby-skill-bench@v1`:
|
|
1122
|
+
|
|
1123
|
+
```yaml
|
|
1124
|
+
# .github/workflows/skill-bench.yml
|
|
1125
|
+
name: skill-bench
|
|
1126
|
+
on: [pull_request]
|
|
1127
|
+
|
|
1128
|
+
jobs:
|
|
1129
|
+
skill-bench:
|
|
1130
|
+
runs-on: ubuntu-latest
|
|
1131
|
+
steps:
|
|
1132
|
+
- uses: actions/checkout@v4
|
|
1133
|
+
- uses: igmarin/ruby-skill-bench@v1
|
|
1134
|
+
with:
|
|
1135
|
+
evals-dir: evals # directory scanned for evals (default: evals)
|
|
1136
|
+
skill: skills/my-service # skill applied to every eval (default: "")
|
|
1137
|
+
format: junit # human | json | junit | html (default: junit)
|
|
1138
|
+
ruby-version: "3.3" # Ruby for ruby/setup-ruby (default: 3.3)
|
|
1139
|
+
args: --summary # extra flags appended verbatim (e.g. --summary, --pack NAME)
|
|
1140
|
+
```
|
|
1141
|
+
|
|
1142
|
+
The action installs the gem and runs `skill-bench run --all --evals-dir <evals-dir> --format <format>` (adding `--skill` when set and appending `args` verbatim). The run step's exit code is the gate. For a full copy-paste workflow template, see [`examples/ci/`](examples/ci/).
|
|
1143
|
+
|
|
1144
|
+
> The gem's own repository CI (`.github/workflows/ci.yml`) runs the test suite — rubocop, reek, and minitest against Ruby 3.3 and 3.4, on push and pull requests — and is separate from the reusable action above.
|
|
865
1145
|
|
|
866
|
-
|
|
867
|
-
- Tests against Ruby 3.3 and 3.4
|
|
868
|
-
- Executes rubocop, reek, and minitest
|
|
869
|
-
- Outputs JUnit XML for test reporting
|
|
1146
|
+
To preview the machine-readable output locally:
|
|
870
1147
|
|
|
871
1148
|
```bash
|
|
872
|
-
# Run locally with CI output
|
|
873
1149
|
skill-bench run my-eval --skill=my-skill --format json
|
|
874
1150
|
```
|
|
875
1151
|
|
data/docs/architecture.md
CHANGED
|
@@ -172,9 +172,11 @@ project-root/
|
|
|
172
172
|
│ └── my-first-eval/
|
|
173
173
|
│ ├── task.md # Agent prompt
|
|
174
174
|
│ └── criteria.json # Scoring rules
|
|
175
|
-
└── .skill-bench-
|
|
175
|
+
└── .skill-bench-trends.json # Benchmark history (auto-generated)
|
|
176
176
|
```
|
|
177
177
|
|
|
178
|
+
A `.skill-bench-trends.json.bak` file is created automatically as a backup of the trend file.
|
|
179
|
+
|
|
178
180
|
### Skill Discovery
|
|
179
181
|
|
|
180
182
|
Skills are discovered recursively. These are all valid:
|
data/docs/first-eval-guide.md
CHANGED
|
@@ -268,7 +268,7 @@ Provider is read from `skill-bench.json` — no `--provider` flag needed.
|
|
|
268
268
|
2. Agent runs **with** skill context → produces context output
|
|
269
269
|
3. Judge scores both independently → per-dimension scores
|
|
270
270
|
4. Engine computes deltas → applies pass/fail logic
|
|
271
|
-
5. Result is recorded in `.skill-bench-
|
|
271
|
+
5. Result is recorded in `.skill-bench-trends.json` for trend tracking
|
|
272
272
|
|
|
273
273
|
**Run with multiple skills:**
|
|
274
274
|
|
|
@@ -346,7 +346,7 @@ Both skill contexts are concatenated. The judge evaluates whether the combined c
|
|
|
346
346
|
| **BASELINE** | Score without skill (unaided performance). Think: "How well does the AI do on its own?" |
|
|
347
347
|
| **CONTEXT** | Score with skill (aided performance). Think: "How well does the AI do when it reads my skill?" |
|
|
348
348
|
| **DELTA** | Improvement = CONTEXT - BASELINE. Think: "How many points did my skill add?" |
|
|
349
|
-
| **TREND** | Change since the *previous* run of this exact eval + skill. Stored in `.skill-bench-
|
|
349
|
+
| **TREND** | Change since the *previous* run of this exact eval + skill. Stored in `.skill-bench-trends.json`. |
|
|
350
350
|
| **VERDICT** | PASS only if CONTEXT >= threshold AND DELTA >= minimum_delta. Both must be true. |
|
|
351
351
|
| **Iterations** | ReAct loop steps for each run: thought → tools → observation. Helps you understand *how* the agent worked. |
|
|
352
352
|
| **What went well** | Dimensions scoring ≥ 80% of max, with judge reasoning. Strengths of your skill. |
|
|
@@ -417,10 +417,10 @@ Your first run probably will not pass. That is normal. Here is how to improve.
|
|
|
417
417
|
|
|
418
418
|
### Use the History File
|
|
419
419
|
|
|
420
|
-
After each run, SkillBench appends to `.skill-bench-
|
|
420
|
+
After each run, SkillBench appends to `.skill-bench-trends.json`. You can read it to track progress:
|
|
421
421
|
|
|
422
422
|
```bash
|
|
423
|
-
cat .skill-bench-
|
|
423
|
+
cat .skill-bench-trends.json | jq '.[-1]'
|
|
424
424
|
```
|
|
425
425
|
|
|
426
426
|
Look at the dimension with the **smallest delta**. That is where your skill is weakest. Open `SKILL.md` and add a concrete rule targeting that dimension.
|
|
@@ -467,7 +467,7 @@ Created by `skill-bench init`. Stores provider, API key, model, timeout, and all
|
|
|
467
467
|
}
|
|
468
468
|
```
|
|
469
469
|
|
|
470
|
-
### `.skill-bench-
|
|
470
|
+
### `.skill-bench-trends.json` — Evaluation History (Auto-Generated)
|
|
471
471
|
|
|
472
472
|
A JSON array recording every successful eval run. SkillBench writes it automatically. It stores timestamps, eval names, skill names, scores, and deltas. This powers the **TREND** line in your output.
|
|
473
473
|
|
|
@@ -487,9 +487,9 @@ A JSON array recording every successful eval run. SkillBench writes it automatic
|
|
|
487
487
|
|
|
488
488
|
**Tip:** Commit this file to git if you want to share trend data with your team.
|
|
489
489
|
|
|
490
|
-
### `.skill-bench-
|
|
490
|
+
### `.skill-bench-trends.json.bak` — Backup (Auto-Generated)
|
|
491
491
|
|
|
492
|
-
A
|
|
492
|
+
A snapshot of the previous good version of the history file, copied just before each new write. If the main file gets corrupted, SkillBench recovers from this backup automatically. You never need to touch it.
|
|
493
493
|
|
|
494
494
|
---
|
|
495
495
|
|
data/docs/testing-guide.md
CHANGED
|
@@ -273,7 +273,7 @@ Both must be true. This prevents two failure modes:
|
|
|
273
273
|
TREND: baseline ↑ (+2), context ↑ (+7)
|
|
274
274
|
```
|
|
275
275
|
|
|
276
|
-
This compares the current run against the **previous run of the same eval + skill** (stored in `.skill-bench-
|
|
276
|
+
This compares the current run against the **previous run of the same eval + skill** (stored in `.skill-bench-trends.json`).
|
|
277
277
|
|
|
278
278
|
- `↑` = improved since last run
|
|
279
279
|
- `↓` = regressed since last run
|
|
@@ -16,6 +16,7 @@ module SkillBench
|
|
|
16
16
|
def self.call(initial_prompt, max_iterations, config)
|
|
17
17
|
messages = [{ role: 'user', content: initial_prompt }]
|
|
18
18
|
iterations_log = []
|
|
19
|
+
total_usage = empty_usage
|
|
19
20
|
step_count = 0
|
|
20
21
|
|
|
21
22
|
while step_count < max_iterations
|
|
@@ -24,24 +25,27 @@ module SkillBench
|
|
|
24
25
|
step_result = Step.call(messages, config)
|
|
25
26
|
iteration = step_result[:iteration]
|
|
26
27
|
iterations_log << attach_step_number(iteration, step_count) if iteration
|
|
28
|
+
total_usage = add_usage(total_usage, step_result[:usage])
|
|
27
29
|
|
|
28
30
|
unless step_result[:continue]
|
|
29
31
|
final_result = step_result[:result] || { success: false, response: { error: { message: 'Step returned no result' } } }
|
|
30
|
-
return
|
|
32
|
+
return finalize(final_result, iterations_log, total_usage)
|
|
31
33
|
end
|
|
32
34
|
|
|
33
35
|
messages = step_result[:messages]
|
|
34
36
|
end
|
|
35
37
|
|
|
36
|
-
|
|
38
|
+
finalize(
|
|
37
39
|
{ success: false, response: { error: { message: Agent::ReactAgent::MAX_ITERATIONS_REACHED } } },
|
|
38
|
-
iterations_log
|
|
40
|
+
iterations_log,
|
|
41
|
+
total_usage
|
|
39
42
|
)
|
|
40
43
|
rescue StandardError => e
|
|
41
44
|
SkillBench::ErrorLogger.log_error(e, 'ReactAgent Error')
|
|
42
|
-
|
|
45
|
+
finalize(
|
|
43
46
|
{ success: false, response: { error: { message: e.message } } },
|
|
44
|
-
iterations_log
|
|
47
|
+
iterations_log,
|
|
48
|
+
total_usage
|
|
45
49
|
)
|
|
46
50
|
end
|
|
47
51
|
|
|
@@ -54,14 +58,45 @@ module SkillBench
|
|
|
54
58
|
iteration.merge(step_number: step_count)
|
|
55
59
|
end
|
|
56
60
|
|
|
57
|
-
# Merges the collected iterations into the
|
|
61
|
+
# Merges the collected iterations and accumulated usage into the response.
|
|
58
62
|
#
|
|
59
63
|
# @param result [Hash] The final result hash from the loop.
|
|
60
64
|
# @param iterations_log [Array<Hash>] Collected iteration metadata.
|
|
61
|
-
# @
|
|
62
|
-
|
|
65
|
+
# @param total_usage [Hash] Summed token usage across all iterations.
|
|
66
|
+
# @return [Hash] The result with :iterations and :usage injected into :response.
|
|
67
|
+
def self.finalize(result, iterations_log, total_usage)
|
|
63
68
|
response = result[:response] || {}
|
|
64
|
-
result.merge(response: response.merge(iterations: iterations_log))
|
|
69
|
+
result.merge(response: response.merge(iterations: iterations_log, usage: total_usage))
|
|
70
|
+
end
|
|
71
|
+
|
|
72
|
+
# A zeroed token-usage accumulator.
|
|
73
|
+
#
|
|
74
|
+
# @return [Hash] Usage hash with prompt/completion/total token counts set to zero.
|
|
75
|
+
def self.empty_usage
|
|
76
|
+
{ prompt_tokens: 0, completion_tokens: 0, total_tokens: 0 }
|
|
77
|
+
end
|
|
78
|
+
|
|
79
|
+
# Adds a single step's usage onto a running total.
|
|
80
|
+
#
|
|
81
|
+
# @param total [Hash] The running usage total.
|
|
82
|
+
# @param usage [Hash, nil] A step's usage hash (may be nil or empty).
|
|
83
|
+
# @return [Hash] A new summed usage hash.
|
|
84
|
+
def self.add_usage(total, usage)
|
|
85
|
+
usage ||= {}
|
|
86
|
+
{
|
|
87
|
+
prompt_tokens: total[:prompt_tokens] + token_count(usage, :prompt_tokens),
|
|
88
|
+
completion_tokens: total[:completion_tokens] + token_count(usage, :completion_tokens),
|
|
89
|
+
total_tokens: total[:total_tokens] + token_count(usage, :total_tokens)
|
|
90
|
+
}
|
|
91
|
+
end
|
|
92
|
+
|
|
93
|
+
# Reads a token count from a usage hash, tolerating string keys.
|
|
94
|
+
#
|
|
95
|
+
# @param usage [Hash] The usage hash.
|
|
96
|
+
# @param key [Symbol] The usage key (e.g. :prompt_tokens).
|
|
97
|
+
# @return [Integer] The token count, or zero when absent.
|
|
98
|
+
def self.token_count(usage, key)
|
|
99
|
+
(usage[key] || usage[key.to_s] || 0).to_i
|
|
65
100
|
end
|
|
66
101
|
end
|
|
67
102
|
end
|