ruby-skill-bench 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (119) hide show
  1. checksums.yaml +7 -0
  2. data/LICENSE +21 -0
  3. data/README.md +794 -0
  4. data/bin/skill-bench +15 -0
  5. data/docs/architecture.md +200 -0
  6. data/docs/first-eval-guide.md +522 -0
  7. data/docs/testing-guide.md +361 -0
  8. data/lib/skill_bench/agent/react_agent/loop_runner.rb +69 -0
  9. data/lib/skill_bench/agent/react_agent/step.rb +92 -0
  10. data/lib/skill_bench/agent/react_agent/tool_executor.rb +88 -0
  11. data/lib/skill_bench/agent/react_agent.rb +58 -0
  12. data/lib/skill_bench/agent/runner.rb +108 -0
  13. data/lib/skill_bench/agent/summary.rb +39 -0
  14. data/lib/skill_bench/agent.rb +10 -0
  15. data/lib/skill_bench/cli/eval/eval_command_registry.rb +35 -0
  16. data/lib/skill_bench/cli/eval/eval_commands.rb +112 -0
  17. data/lib/skill_bench/cli/eval/eval_options.rb +75 -0
  18. data/lib/skill_bench/cli/eval_command.rb +40 -0
  19. data/lib/skill_bench/cli/help_printer.rb +47 -0
  20. data/lib/skill_bench/cli/init_command.rb +69 -0
  21. data/lib/skill_bench/cli/result_printer.rb +20 -0
  22. data/lib/skill_bench/cli/run_command.rb +72 -0
  23. data/lib/skill_bench/cli/skill_command.rb +79 -0
  24. data/lib/skill_bench/cli.rb +51 -0
  25. data/lib/skill_bench/client.rb +23 -0
  26. data/lib/skill_bench/clients/all.rb +19 -0
  27. data/lib/skill_bench/clients/base_client.rb +212 -0
  28. data/lib/skill_bench/clients/provider_config.rb +47 -0
  29. data/lib/skill_bench/clients/provider_registry.rb +56 -0
  30. data/lib/skill_bench/clients/provider_schemas.rb +73 -0
  31. data/lib/skill_bench/clients/providers/anthropic.rb +219 -0
  32. data/lib/skill_bench/clients/providers/azure_openai.rb +69 -0
  33. data/lib/skill_bench/clients/providers/deepseek.rb +39 -0
  34. data/lib/skill_bench/clients/providers/gemini.rb +63 -0
  35. data/lib/skill_bench/clients/providers/groq.rb +39 -0
  36. data/lib/skill_bench/clients/providers/null_client.rb +50 -0
  37. data/lib/skill_bench/clients/providers/ollama.rb +63 -0
  38. data/lib/skill_bench/clients/providers/openai.rb +39 -0
  39. data/lib/skill_bench/clients/providers/opencode.rb +56 -0
  40. data/lib/skill_bench/clients/providers/openrouter.rb +40 -0
  41. data/lib/skill_bench/clients/request_builder.rb +43 -0
  42. data/lib/skill_bench/clients/response_error_handler.rb +73 -0
  43. data/lib/skill_bench/clients/response_parser.rb +93 -0
  44. data/lib/skill_bench/clients/retry_handler.rb +78 -0
  45. data/lib/skill_bench/commands/eval_new.rb +89 -0
  46. data/lib/skill_bench/commands/init.rb +39 -0
  47. data/lib/skill_bench/commands/run.rb +21 -0
  48. data/lib/skill_bench/commands/skill_new.rb +115 -0
  49. data/lib/skill_bench/config/applier.rb +67 -0
  50. data/lib/skill_bench/config/defaults.rb +42 -0
  51. data/lib/skill_bench/config/env_overrides.rb +117 -0
  52. data/lib/skill_bench/config/facade_readers.rb +65 -0
  53. data/lib/skill_bench/config/facade_writers.rb +120 -0
  54. data/lib/skill_bench/config/json_loader.rb +84 -0
  55. data/lib/skill_bench/config/store.rb +177 -0
  56. data/lib/skill_bench/config.rb +172 -0
  57. data/lib/skill_bench/criteria.rb +141 -0
  58. data/lib/skill_bench/delta_report.rb +97 -0
  59. data/lib/skill_bench/dimension.rb +69 -0
  60. data/lib/skill_bench/error_logger.rb +35 -0
  61. data/lib/skill_bench/evaluate_command.rb +120 -0
  62. data/lib/skill_bench/evaluation/generator.rb +191 -0
  63. data/lib/skill_bench/evaluation/runner.rb +81 -0
  64. data/lib/skill_bench/evaluation.rb +10 -0
  65. data/lib/skill_bench/execution/context_hydrator.rb +97 -0
  66. data/lib/skill_bench/execution/sandbox.rb +174 -0
  67. data/lib/skill_bench/execution/source_path_resolver.rb +60 -0
  68. data/lib/skill_bench/execution.rb +10 -0
  69. data/lib/skill_bench/history_recorder/history_file.rb +71 -0
  70. data/lib/skill_bench/history_recorder/history_path_resolver.rb +87 -0
  71. data/lib/skill_bench/history_recorder/persistence_service.rb +38 -0
  72. data/lib/skill_bench/history_recorder/summary_service.rb +61 -0
  73. data/lib/skill_bench/history_recorder.rb +40 -0
  74. data/lib/skill_bench/interactive.rb +61 -0
  75. data/lib/skill_bench/judge/judge.rb +72 -0
  76. data/lib/skill_bench/judge/prompt.rb +121 -0
  77. data/lib/skill_bench/judge/response.rb +158 -0
  78. data/lib/skill_bench/judge.rb +10 -0
  79. data/lib/skill_bench/migration/provider_migrator.rb +30 -0
  80. data/lib/skill_bench/models/config.rb +61 -0
  81. data/lib/skill_bench/models/criteria_validator.rb +106 -0
  82. data/lib/skill_bench/models/eval.rb +81 -0
  83. data/lib/skill_bench/models/provider.rb +70 -0
  84. data/lib/skill_bench/models/skill.rb +32 -0
  85. data/lib/skill_bench/output_formatter.rb +132 -0
  86. data/lib/skill_bench/package_verifier.rb +80 -0
  87. data/lib/skill_bench/rails/skill_templates.rb +99 -0
  88. data/lib/skill_bench/runner.rb +89 -0
  89. data/lib/skill_bench/services/delta_table_formatter.rb +72 -0
  90. data/lib/skill_bench/services/feedback_generator.rb +122 -0
  91. data/lib/skill_bench/services/formatting_helpers.rb +45 -0
  92. data/lib/skill_bench/services/iteration_formatter.rb +30 -0
  93. data/lib/skill_bench/services/json_formatter.rb +18 -0
  94. data/lib/skill_bench/services/judge_score_parser_service.rb +66 -0
  95. data/lib/skill_bench/services/junit_formatter.rb +42 -0
  96. data/lib/skill_bench/services/option_parser_service.rb +63 -0
  97. data/lib/skill_bench/services/output_persistence_service.rb +77 -0
  98. data/lib/skill_bench/services/result_printer_service.rb +126 -0
  99. data/lib/skill_bench/services/runner_service.rb +381 -0
  100. data/lib/skill_bench/services/skill_resolver.rb +78 -0
  101. data/lib/skill_bench/services/template_registry/category_data.rb +73 -0
  102. data/lib/skill_bench/services/template_registry.rb +148 -0
  103. data/lib/skill_bench/task/evaluator.rb +94 -0
  104. data/lib/skill_bench/task/file_reader.rb +69 -0
  105. data/lib/skill_bench/task.rb +10 -0
  106. data/lib/skill_bench/tools/argument_parser.rb +20 -0
  107. data/lib/skill_bench/tools/base.rb +73 -0
  108. data/lib/skill_bench/tools/dispatcher.rb +61 -0
  109. data/lib/skill_bench/tools/read_file.rb +66 -0
  110. data/lib/skill_bench/tools/registry.rb +23 -0
  111. data/lib/skill_bench/tools/run_command.rb +89 -0
  112. data/lib/skill_bench/tools/write_file.rb +78 -0
  113. data/lib/skill_bench/tools.rb +33 -0
  114. data/lib/skill_bench/trend_tracker/persistence.rb +69 -0
  115. data/lib/skill_bench/trend_tracker/trend_calculator.rb +60 -0
  116. data/lib/skill_bench/trend_tracker.rb +66 -0
  117. data/lib/skill_bench/version.rb +6 -0
  118. data/lib/skill_bench.rb +103 -0
  119. metadata +247 -0
data/README.md ADDED
@@ -0,0 +1,794 @@
1
+ # Ruby Skill Bench
2
+
3
+ ![Ruby Skill Bench Logo](https://github.com/user-attachments/assets/056d7ca4-8671-41ec-9efb-e323b73fb135)
4
+
5
+
6
+ ![CodeRabbit Pull Request Reviews](https://img.shields.io/coderabbit/prs/github/igmarin/ruby-skill-bench?utm_source=oss&utm_medium=github&utm_campaign=igmarin%2Fruby-skill-bench&labelColor=171717&color=FF570A&link=https%3A%2F%2Fcoderabbit.ai&label=CodeRabbit+Reviews)
7
+
8
+ *A high-fidelity evaluation engine for benchmarking AI agent skills across any stack (Rails-first, but extensible).*
9
+
10
+ ---
11
+
12
+ ## Features
13
+
14
+ - **Side-by-Side Evaluation**: Quantify the "ROI of Context" by comparing baseline vs. skill-enhanced agent runs.
15
+ - **Isolated Git Sandboxes**: Every run operates in a temporary repo. Clean diffs, zero side-effects, 100% reproducibility.
16
+ - **Blind Judging with Dimensions**: LLM judge scores baseline and context independently across 5 canonical dimensions (Correctness, Skill Adherence, Code Quality, Test Coverage, Documentation). Eval authors configure weights and thresholds via `criteria.json`.
17
+ - **Sophisticated ReAct Loop**: Employs a robust `Thought → Tool → Observation` loop to handle complex, multi-step engineering tasks.
18
+ - **Multi-Provider Ecosystem**: Native support for **OpenAI**, **Anthropic**, **Google Gemini**, **Azure OpenAI**, **Ollama**, **Groq**, **DeepSeek**, and **OpenCode**.
19
+ - **Standardized Intelligence**: Consistent reporting format regardless of the underlying LLM provider.
20
+
21
+ ---
22
+
23
+ ## Architecture Overview
24
+
25
+ The system decoupling allows the reasoning engine to remain agnostic of the execution environment.
26
+
27
+ ```text
28
+ CLI / API → RunnerService → Sandbox + ReAct Agent → LLM Client Layer → Provider
29
+
30
+ EvaluationRunner (baseline + context)
31
+
32
+ Judge (blind scoring)
33
+
34
+ DeltaReport
35
+ ```
36
+
37
+ ---
38
+
39
+ ## Configuration & Orchestration
40
+
41
+ ### Environment Variable Mapping
42
+
43
+ | Provider | Required Env Variables | Registry Key |
44
+ | :--- | :--- | :--- |
45
+ | **OpenAI** | `SKILL_BENCH_OPENAI_API_KEY` | `:openai` |
46
+ | **Anthropic** | `SKILL_BENCH_ANTHROPIC_API_KEY` | `:anthropic` |
47
+ | **Gemini** | `SKILL_BENCH_GEMINI_API_KEY` | `:gemini` |
48
+ | **Azure** | `SKILL_BENCH_AZURE_API_KEY` | `:azure` |
49
+ | **Ollama** | — | `:ollama` |
50
+ | **Groq** | `SKILL_BENCH_GROQ_API_KEY` | `:groq` |
51
+ | **DeepSeek** | `SKILL_BENCH_DEEPSEEK_API_KEY` | `:deepseek` |
52
+ | **OpenCode** | `SKILL_BENCH_OPENCODE_API_KEY`, `SKILL_BENCH_OPENCODE_BASE_URL` | `:opencode` |
53
+
54
+ > **Note:** Environment variables are loaded automatically. You can also configure provider settings in `skill-bench.json` (created by `skill-bench init`).
55
+ >
56
+ > **OpenCode requires a custom `base_url`:** OpenCode does not host a public LLM API. You must provide your own OpenAI-compatible endpoint (e.g. a LiteLLM proxy, self-hosted vLLM, or company gateway) via the `base_url` config key. Without it, the provider will fail with "Base URL not set for Opencode".
57
+
58
+ ### Command Allowlist
59
+
60
+ By default, no shell commands are permitted. You must configure `allowed_commands` in `skill-bench.json`:
61
+
62
+ ```json
63
+ {
64
+ "provider": "openai",
65
+ "max_execution_time": 30,
66
+ "allowed_commands": ["rspec", "bundle", "ruby", "git"],
67
+ "config": {
68
+ "api_key": null,
69
+ "model": "gpt-4o"
70
+ }
71
+ }
72
+ ```
73
+
74
+ > **Security:** The agent can only execute commands on this list. Dangerous commands (bash, curl, sudo, etc.) are always blocked regardless of configuration.
75
+
76
+ ### Configuration Hierarchy
77
+
78
+ Configuration is loaded in this order (later sources override earlier ones):
79
+
80
+ 1. **Code defaults** — built-in defaults for provider, model, and timeout
81
+ 2. **Home JSON** — `~/.skill-bench.json` for user-wide settings
82
+ 3. **Local JSON** — `./skill-bench.json` for project-specific settings
83
+ 4. **Environment variables** — provider API keys and models from `ENV`
84
+
85
+ ---
86
+
87
+ ## Getting Started
88
+
89
+ ### Installation
90
+
91
+ ```bash
92
+ gem install ruby-skill-bench
93
+ ```
94
+
95
+ Or add to your `Gemfile`:
96
+
97
+ ```ruby
98
+ gem 'ruby-skill-bench'
99
+ ```
100
+
101
+ ### Usage: The 4-Step Flow
102
+
103
+ Each command creates specific files. Here is exactly what lands on disk after each step.
104
+
105
+ #### 1. Initialize Configuration
106
+
107
+ ```bash
108
+ skill-bench init --openai
109
+ ```
110
+
111
+ **Creates:** `skill-bench.json` (provider configuration)
112
+
113
+ ```json
114
+ {
115
+ "provider": "openai",
116
+ "max_execution_time": 30,
117
+ "allowed_commands": ["rspec", "bundle", "ruby", "git"],
118
+ "config": {
119
+ "api_key": null,
120
+ "model": "gpt-4o"
121
+ }
122
+ }
123
+ ```
124
+
125
+ **Available providers:** `--openai`, `--anthropic`, `--gemini`, `--ollama`, `--azure`, `--groq`, `--deepseek`, `--opencode`
126
+
127
+ Use `--force` to overwrite an existing config.
128
+
129
+ ---
130
+
131
+ #### 2. Create a Skill
132
+
133
+ ```bash
134
+ skill-bench skill new my-service --mode=rails --template=service_object
135
+ ```
136
+
137
+ **Creates:**
138
+
139
+ ```
140
+ skills/
141
+ └── my-service/
142
+ └── SKILL.md # <- Your skill instructions go here
143
+ ```
144
+
145
+ `SKILL.md` is free-form Markdown. It typically contains:
146
+ - What pattern the skill implements (e.g., "Service Object with `.call`")
147
+ - Hard rules the agent must follow
148
+ - Code examples
149
+ - Response format expectations
150
+
151
+ **Example `SKILL.md`:**
152
+
153
+ ```markdown
154
+ # Service Object Skill
155
+
156
+ ## Pattern
157
+
158
+ All service objects use the `.call` class method and return a standardized hash:
159
+
160
+ ```ruby
161
+ { success: true, response: { data: ... } }
162
+ ```
163
+
164
+ ## Hard Rules
165
+
166
+ 1. Every `.rb` file begins with `# frozen_string_literal: true`
167
+ 2. Every public method has YARD docs (`@param`, `@return`, `@raise`)
168
+ 3. `rescue StandardError` blocks must log backtrace
169
+ ```
170
+
171
+ ---
172
+
173
+ ### Using TemplateRegistry for Rapid Eval Scaffolding
174
+
175
+ For programmatic eval creation, use `SkillBench::Services::TemplateRegistry` to generate scaffolding from pre-built templates. This is ideal for automating eval creation or building tools on top of SkillBench.
176
+
177
+ **Basic Usage:**
178
+
179
+ ```ruby
180
+ require 'skill_bench'
181
+
182
+ # Generate a task template for a CRUD service
183
+ task_content = SkillBench::Services::TemplateRegistry.call(
184
+ :task_md,
185
+ :crud,
186
+ skill_name: "UserCreator"
187
+ )
188
+
189
+ # Generate criteria JSON for an API client
190
+ criteria_content = SkillBench::Services::TemplateRegistry.call(:criteria_json, :api)
191
+
192
+ # Generate skill instructions for a background job
193
+ skill_content = SkillBench::Services::TemplateRegistry.call(
194
+ :skill_md,
195
+ :background_job,
196
+ skill_name: "OrderProcessor"
197
+ )
198
+ ```
199
+
200
+ **Available Template Types:**
201
+
202
+ | Type | Output | Purpose |
203
+ |------|--------|---------|
204
+ | `task_md` | Markdown | Agent prompt with requirements |
205
+ | `criteria_json` | JSON | Scoring rules and dimensions |
206
+ | `skill_md` | Markdown | Skill instructions for the agent |
207
+
208
+ **Supported Categories:**
209
+
210
+ | Category | Use Case |
211
+ |----------|----------|
212
+ | `crud` | Service Objects with Create, Read, Update, Delete |
213
+ | `api` | API clients with authentication and error handling |
214
+ | `background_job` | ActiveJob/Sidekiq workers with retry logic |
215
+ | `controller` | RESTful controllers with strong parameters |
216
+ | `model` | ActiveRecord models with validations |
217
+ | `migration` | Database migrations with indexes |
218
+ | `concern` | ActiveSupport::Concern modules |
219
+ | `policy` | Authorization policies (Pundit-style) |
220
+ | `form_object` | Form objects with validations |
221
+ | `view_component` | ViewComponent components with previews |
222
+
223
+ **Variable Interpolation:**
224
+
225
+ Templates support `{{variable_name}}` syntax for dynamic content:
226
+
227
+ ```ruby
228
+ # Custom variables are interpolated into templates
229
+ task = SkillBench::Services::TemplateRegistry.call(
230
+ :task_md,
231
+ :api,
232
+ skill_name: "PaymentGateway",
233
+ endpoint: "/api/v1/payments"
234
+ )
235
+ ```
236
+
237
+ **Complete Workflow Example:**
238
+
239
+ ```ruby
240
+ require 'fileutils'
241
+ require 'skill_bench'
242
+
243
+ # Define your skill name
244
+ skill_name = "OrderService"
245
+
246
+ # Generate all eval scaffolding
247
+ task_md = SkillBench::Services::TemplateRegistry.call(:task_md, :crud, skill_name: skill_name)
248
+ criteria_json = SkillBench::Services::TemplateRegistry.call(:criteria_json, :crud)
249
+ skill_md = SkillBench::Services::TemplateRegistry.call(:skill_md, :crud, skill_name: skill_name)
250
+
251
+ # Write to disk
252
+ FileUtils.mkdir_p("evals/order-service")
253
+ File.write("evals/order-service/task.md", task_md)
254
+ File.write("evals/order-service/criteria.json", criteria_json)
255
+
256
+ FileUtils.mkdir_p("skills/order-service")
257
+ File.write("skills/order-service/SKILL.md", skill_md)
258
+
259
+ puts "Eval scaffolding created for #{skill_name}!"
260
+ ```
261
+
262
+ > **Note:** `TemplateRegistry` is a pure function with no side effects. It returns template strings that you can customize before writing to disk.
263
+
264
+ ---
265
+
266
+ #### 3. Create an Eval
267
+
268
+ You have two options: manual or auto-generated.
269
+
270
+ **Option A — Manual (full control):**
271
+
272
+ ```bash
273
+ skill-bench eval new my-first-eval --runtime=rails
274
+ ```
275
+
276
+ **Creates:**
277
+
278
+ ```
279
+ evals/
280
+ └── my-first-eval/
281
+ ├── task.md # <- The task description for the agent
282
+ └── criteria.json # <- Scoring rules and dimension weights
283
+ ```
284
+
285
+ **`task.md`** tells the agent what to build. Be specific — the agent receives this as its user prompt.
286
+
287
+ **Example `task.md`:**
288
+
289
+ ```markdown
290
+ Create a `UserRegistrationService` that:
291
+
292
+ 1. Accepts `email` and `password`
293
+ 2. Validates email format with a regex
294
+ 3. Validates password length (minimum 8 characters)
295
+ 4. Returns `{ success: true, response: { user_id: ... } }` on success
296
+ 5. Returns `{ success: false, response: { error: { message: ... } } }` on failure
297
+ 6. Includes YARD documentation for every public method
298
+ 7. Includes RSpec tests that cover both success and failure paths
299
+ ```
300
+
301
+ **`criteria.json`** tells the judge how to score the agent's output. See the [Scoring Engine](#scoring-engine) section for the full format.
302
+
303
+ **Option B — Auto-Generated (from a skill):**
304
+
305
+ ```bash
306
+ skill-bench eval generate my-service --name my-first-eval
307
+ ```
308
+
309
+ Reads `skills/my-service/SKILL.md`, sends it to the LLM, and auto-generates `task.md` + `criteria.json`. The generated eval is immediately validated against the same rules as manual evals.
310
+
311
+ ---
312
+
313
+ #### 4. Run the Eval
314
+
315
+ ```bash
316
+ skill-bench run my-first-eval --skill=my-service
317
+ ```
318
+
319
+ **What happens internally:**
320
+
321
+ 1. **Resolve** — Load eval (`task.md` + `criteria.json`), skill (`SKILL.md`), and provider config
322
+ 2. **Baseline run** — Agent receives `task.md` as a prompt, no skill context → produces output A
323
+ 3. **Context run** — Agent receives `task.md` + `SKILL.md` as prompt → produces output B
324
+ 4. **Blind judging** — LLM judge scores output A and output B independently across the dimensions defined in `criteria.json`
325
+ 5. **Delta computation** — Compare scores, compute deltas, apply pass/fail logic
326
+ 6. **History recording** — Store result in `.skill-bench-history.json` for trend tracking
327
+
328
+ Provider is read from `skill-bench.json` — no `--provider` flag needed.
329
+
330
+ **Run with multiple skills (skill chaining):**
331
+
332
+ ```bash
333
+ skill-bench run my-first-eval --skill=skill-a --skill=skill-b
334
+ ```
335
+
336
+ Both skill contexts are concatenated and sent to the agent. The judge evaluates whether the combined context improves results.
337
+
338
+ **Output Formats:**
339
+
340
+ - Human-readable (default)
341
+ - JSON: `--format json`
342
+ - JUnit XML: `--format junit`
343
+
344
+ ---
345
+
346
+ ## File Reference: What Lives on Disk
347
+
348
+ SkillBench creates and manages three files in your project. Understanding them helps you iterate faster.
349
+
350
+ ### `skill-bench.json` — Your Configuration
351
+
352
+ **What it is:** The config file you create with `skill-bench init`. It tells SkillBench which LLM provider to use, your API key, timeout settings, and which shell commands the agent is allowed to run.
353
+
354
+ **Who edits it:** You. This is the only file SkillBench expects you to write by hand.
355
+
356
+ **Typical contents:**
357
+
358
+ ```json
359
+ {
360
+ "provider": "openai",
361
+ "max_execution_time": 300,
362
+ "allowed_commands": ["rspec", "bundle", "ruby", "git"],
363
+ "config": {
364
+ "api_key": "sk-...",
365
+ "model": "gpt-4o",
366
+ "max_iterations": 25
367
+ }
368
+ }
369
+ ```
370
+
371
+ **Key rules:**
372
+ - Configuration is loaded in this order: **code defaults** → `~/.skill-bench.json` (user-wide) → `./skill-bench.json` (local) → **environment variables**. Later sources override earlier ones.
373
+ - If `api_key` is `null`, SkillBench looks for the matching environment variable (e.g. `SKILL_BENCH_OPENAI_API_KEY`).
374
+ - `allowed_commands` is a **safeguard**, not a convenience. By default the agent cannot run *any* shell command. Add only what your evals need.
375
+
376
+ ---
377
+
378
+ ### `.skill-bench-history.json` — Evaluation History (Auto-Generated)
379
+
380
+ **What it is:** A JSON array that records every successful eval run. SkillBench appends to it automatically. It stores the timestamp, eval name, skill names, scores, and deltas so you can track improvement over time.
381
+
382
+ **Who edits it:** Nobody. SkillBench writes it; you read it. If you delete it, you lose your trend data.
383
+
384
+ **Example entry:**
385
+
386
+ ```json
387
+ [
388
+ {
389
+ "timestamp": "2026-05-12T10:30:00Z",
390
+ "eval_name": "my-first-eval",
391
+ "skill_names": ["my-service"],
392
+ "verdict": true,
393
+ "baseline_total": 32,
394
+ "context_total": 87,
395
+ "deltas": {
396
+ "correctness": 16,
397
+ "skill_adherence": 17,
398
+ "code_quality": 6,
399
+ "test_coverage": 10,
400
+ "documentation": 6
401
+ }
402
+ }
403
+ ]
404
+ ```
405
+
406
+ **Why it matters:** This file powers the **TREND** line you see in human-readable output:
407
+
408
+ ```text
409
+ TREND: baseline ↑ (+2), context ↑ (+7)
410
+ ```
411
+
412
+ The trend compares the current run against the *previous run of the same eval + skill*. This tells you at a glance whether your latest skill edit made things better or worse.
413
+
414
+ **Pro tip:** Commit `.skill-bench-history.json` to git if you want to share trend data with your team. Add it to `.gitignore` if you prefer to keep scores private.
415
+
416
+ ---
417
+
418
+ ### `.skill-bench-history.json.bak` — Backup (Auto-Generated)
419
+
420
+ **What it is:** A copy of `.skill-bench-history.json` created every time SkillBench writes a new entry. If the main file gets corrupted (e.g. you kill the process mid-write), SkillBench automatically falls back to the `.bak` file.
421
+
422
+ **Who edits it:** Nobody. It is a safety net.
423
+
424
+ **When to care:** Almost never. If you see a "History file corrupted" warning, SkillBench has already recovered from the `.bak` for you.
425
+
426
+ ---
427
+
428
+ ## Iterating on Skills: A Practical Workflow
429
+
430
+ Writing a good skill is rarely a one-shot process. Here is a tested workflow that uses the history file to guide your improvements.
431
+
432
+ ### Step 1: Write a V1 Skill
433
+
434
+ Create a skill and an eval that exercises it:
435
+
436
+ ```bash
437
+ skill-bench skill new my-service --mode=rails --template=service_object
438
+ skill-bench eval new my-first-eval --runtime=rails
439
+ # ... edit SKILL.md, task.md, and criteria.json ...
440
+ ```
441
+
442
+ ### Step 2: Run the Eval (Baseline + Context)
443
+
444
+ ```bash
445
+ skill-bench run my-first-eval --skill=my-service
446
+ ```
447
+
448
+ This executes the full evaluation pipeline: a **baseline run** (agent receives the task without the skill) and a **context run** (agent receives the task with the skill). The two outputs are scored independently by the judge and compared.
449
+
450
+ Read the output carefully. Look at **two things:**
451
+
452
+ 1. **Verdict:** Did it pass? If not, which dimension failed?
453
+ 2. **Delta:** Which dimensions improved the most? Which improved the least?
454
+
455
+ ### Step 3: Inspect the History
456
+
457
+ ```bash
458
+ cat .skill-bench-history.json | jq '.[-1]'
459
+ ```
460
+
461
+ This shows the latest entry. Focus on the dimension with the smallest delta — that is where your skill is weakest.
462
+
463
+ ### Step 4: Edit the Skill
464
+
465
+ Suppose `test_coverage` only improved by `+3`. Open `skills/my-service/SKILL.md` and add a concrete rule:
466
+
467
+ ```markdown
468
+ ## Hard Rules
469
+
470
+ ... existing rules ...
471
+
472
+ 5. Every service must include RSpec tests with at least:
473
+ - One happy-path test
474
+ - One error-path test
475
+ - Use of `let` and `subject` blocks
476
+ ```
477
+
478
+ ### Step 5: Re-run and Compare Trends
479
+
480
+ ```bash
481
+ skill-bench run my-first-eval --skill=my-service
482
+ ```
483
+
484
+ Watch the **TREND** line:
485
+
486
+ ```text
487
+ TREND: baseline → (0), context ↑ (+5)
488
+ ```
489
+
490
+ The context score went up by 5 points compared to the previous run. If `test_coverage` delta jumped from `+3` to `+8`, your edit worked.
491
+
492
+ ### Step 6: Iterate Until Stable
493
+
494
+ Repeat steps 4-5 until:
495
+ - The eval passes consistently (2-3 runs in a row)
496
+ - Deltas are stable (not swinging wildly)
497
+ - The trend line shows `context → (0)` or small positive deltas
498
+
499
+ ### When to Stop Iterating
500
+
501
+ | Situation | Action |
502
+ |-----------|--------|
503
+ | Context score is ~95+ and deltas are flat | Your skill is mature. Move on. |
504
+ | Context score is stuck below threshold | Your eval task might be too hard, or your skill rules are too vague. Rewrite `task.md` with clearer acceptance criteria. |
505
+ | Baseline score is already high | The task is too easy. Make `task.md` harder so the skill has room to show value. |
506
+ | One dimension is always low | Add a specific rule to `SKILL.md` targeting that dimension. |
507
+
508
+ ---
509
+
510
+ ## Scoring Engine
511
+
512
+ The engine runs every eval **twice** — once without skill context (baseline) and once with skill context — then uses an LLM judge to score both outputs independently across configurable dimensions.
513
+
514
+ ### How It Works (Visual Walkthrough)
515
+
516
+ ```text
517
+ ┌────────────────────────────────────────────────────────────────────────┐
518
+ │ EVALUATION PIPELINE │
519
+ ├────────────────────────────────────────────────────────────────────────┤
520
+ │ │
521
+ │ Step 1: Baseline Run │
522
+ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
523
+ │ │ task.md │───→│ Agent │───→│ Output A │ │
524
+ │ └─────────────┘ │ (no skill) │ │ (git diff) │ │
525
+ │ └─────────────┘ └─────────────┘ │
526
+ │ │
527
+ │ Step 2: Context Run │
528
+ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
529
+ │ │ task.md │───→│ Agent │───→│ Output B │ │
530
+ │ │ SKILL.md │───→│ (+ skill) │───→│ (git diff) │ │
531
+ │ └─────────────┘ └─────────────┘ └─────────────┘ │
532
+ │ │
533
+ │ Step 3: Blind Judging (two independent calls) │
534
+ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
535
+ │ │ Output A │───→│ Judge │───→│ Score A │ │
536
+ │ │ criteria │ │ (baseline) │ │ per dim │ │
537
+ │ └─────────────┘ └─────────────┘ └─────────────┘ │
538
+ │ │
539
+ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
540
+ │ │ Output B │───→│ Judge │───→│ Score B │ │
541
+ │ │ criteria │ │ (context) │ │ per dim │ │
542
+ │ └─────────────┘ └─────────────┘ └─────────────┘ │
543
+ │ │
544
+ │ Step 4: Verdict │
545
+ │ Delta = Score B - Score A │
546
+ │ Pass if: Score B >= pass_threshold AND Delta >= minimum_delta │
547
+ │ │
548
+ └────────────────────────────────────────────────────────────────────────┘
549
+ ```
550
+
551
+ **Key principle:** The judge never sees both outputs in the same call. This eliminates "halo effect" bias — the judge scores each output on its own merits, not by direct comparison.
552
+
553
+ ### Canonical Dimensions
554
+
555
+ These 5 dimensions are **mandatory** in every `criteria.json`. You can add custom dimensions beyond these, but you cannot remove any of the core 5.
556
+
557
+ | Dimension | Default Description | Typical Weight |
558
+ |-----------|---------------------|----------------|
559
+ | **Correctness** | Does the output fulfill the task requirements? Are all specified behaviors present and correct? | 25-35 |
560
+ | **Skill Adherence** | Did the agent follow the specific patterns, hard gates, and workflows defined in the skill? | 20-30 |
561
+ | **Code Quality** | Is the code clean, well-structured, free of smells, follows SRP, and avoids duplication? | 15-25 |
562
+ | **Test Coverage** | Are there meaningful tests? Do they test the right things? Are they following TDD/best practices? | 10-20 |
563
+ | **Documentation** | Is there adequate YARD documentation, clear intent, and helpful inline comments where needed? | 5-15 |
564
+
565
+ **Why these weights?** Correctness and Skill Adherence are usually the highest because they directly measure "did the agent do the right thing" and "did the skill help." Test Coverage and Documentation are lower because they are supporting qualities.
566
+
567
+ ### `criteria.json` Format
568
+
569
+ ```json
570
+ {
571
+ "context": "Evaluate whether the skill helps build a proper API REST collection",
572
+ "dimensions": [
573
+ { "name": "correctness", "max_score": 30 },
574
+ { "name": "skill_adherence", "max_score": 25 },
575
+ { "name": "code_quality", "max_score": 20 },
576
+ { "name": "test_coverage", "max_score": 15 },
577
+ { "name": "documentation", "max_score": 10 }
578
+ ],
579
+ "pass_threshold": 70,
580
+ "minimum_delta": 10
581
+ }
582
+ ```
583
+
584
+ **Field-by-field breakdown:**
585
+
586
+ | Field | Type | Required | Description |
587
+ |-------|------|----------|-------------|
588
+ | `context` | string | Yes | Human-readable description of what this eval measures. Shown in the judge prompt. |
589
+ | `dimensions` | array | Yes | List of dimension objects. **Must include all 5 canonical dimensions.** Each needs `name` and `max_score`. `max_score` values must sum to exactly 100. |
590
+ | `pass_threshold` | integer | No | Minimum total **context** score (0-100) to pass. Default: 70. |
591
+ | `minimum_delta` | integer | No | Minimum total improvement (context - baseline) required to pass. Default: 10. |
592
+
593
+ **Rules:**
594
+
595
+ 1. **Sum to 100:** `dimensions` `max_score` values must sum to exactly 100. The engine rejects any eval where they don't.
596
+ 2. **All 5 core dimensions required:** You cannot omit `correctness`, `skill_adherence`, `code_quality`, `test_coverage`, or `documentation`.
597
+ 3. **Custom dimensions allowed:** You can add dimensions beyond the core 5. Their `max_score` values still count toward the 100 total.
598
+ 4. **Pass/fail logic:** Both conditions must be true:
599
+ - `context_total >= pass_threshold` (the agent with skill scored high enough)
600
+ - `total_delta >= minimum_delta` (the skill made a meaningful difference)
601
+
602
+ **Example with custom dimension descriptions:**
603
+
604
+ ```json
605
+ {
606
+ "context": "Evaluate REST API collection skill",
607
+ "dimensions": [
608
+ { "name": "correctness", "max_score": 30 },
609
+ { "name": "skill_adherence", "max_score": 25, "description": "Did the agent use the `.call` pattern and return the standardized hash?" },
610
+ { "name": "code_quality", "max_score": 20 },
611
+ { "name": "test_coverage", "max_score": 15 },
612
+ { "name": "documentation", "max_score": 10 }
613
+ ],
614
+ "pass_threshold": 70,
615
+ "minimum_delta": 10
616
+ }
617
+ ```
618
+
619
+ **Example with a custom dimension (6 total, still summing to 100):**
620
+
621
+ ```json
622
+ {
623
+ "context": "Evaluate with performance considerations",
624
+ "dimensions": [
625
+ { "name": "correctness", "max_score": 25 },
626
+ { "name": "skill_adherence", "max_score": 20 },
627
+ { "name": "code_quality", "max_score": 15 },
628
+ { "name": "test_coverage", "max_score": 15 },
629
+ { "name": "documentation", "max_score": 10 },
630
+ { "name": "performance", "max_score": 15, "description": "Is the solution performant? Are N+1 queries avoided?" }
631
+ ],
632
+ "pass_threshold": 70,
633
+ "minimum_delta": 10
634
+ }
635
+ ```
636
+
637
+ ### Understanding the Output
638
+
639
+ **Human-readable format:**
640
+
641
+ ```text
642
+ ═══════════════════════════════════════════════════════
643
+ Eval: my-first-eval
644
+ Skill: my-service
645
+ Provider: openai
646
+ ═══════════════════════════════════════════════════════
647
+
648
+ === BASELINE ITERATIONS ===
649
+ Step 1: Read task → Tool: read_file → Observation: content...
650
+ Step 2: Plan changes → Tool: write_file → Observation: Success...
651
+ Step 3: Run tests → Tool: run_command → Observation: 3 runs, 0 failures
652
+ Step 4: Final answer
653
+
654
+ === CONTEXT ITERATIONS ===
655
+ Step 1: Read task → Tool: read_file → Observation: content...
656
+ Step 2: Apply skill pattern → Tool: write_file, run_command → Observation: Success...
657
+ Step 3: Final answer
658
+
659
+ DIMENSION BASELINE CONTEXT DELTA
660
+ ──────────────────────── ───────── ───────── ───────
661
+ Correctness (30) 12 28 +16
662
+ Skill Adherence (25) 5 22 +17
663
+ Code Quality (20) 10 16 +6
664
+ Test Coverage (15) 3 13 +10
665
+ Documentation (10) 2 8 +6
666
+ ──────────────────────── ───────── ───────── ───────
667
+ TOTAL 32/100 87/100 +55
668
+
669
+ TREND: baseline ↑ (+2), context ↑ (+7)
670
+ VERDICT: PASS (threshold: 70, minimum delta: 10)
671
+ ═══════════════════════════════════════════════════════
672
+
673
+ === WHAT WENT WELL ===
674
+ Correctness (28/30, baseline: 12/30)
675
+ The agent correctly implemented all required behaviors.
676
+ Skill Adherence (22/25, baseline: 5/25)
677
+ Followed the service object pattern and hard gates.
678
+
679
+ === WHAT WENT WRONG ===
680
+ Test Coverage (13/15, baseline: 3/15)
681
+ Tests exist but edge cases are missing.
682
+ Advice: Are there meaningful tests? Do they test the right things?
683
+ ```
684
+
685
+ **What each column means:**
686
+
687
+ - **BASELINE:** The agent's score *without* the skill. This is the "unaided" performance.
688
+ - **CONTEXT:** The agent's score *with* the skill. This is the "aided" performance.
689
+ - **DELTA:** `CONTEXT - BASELINE`. How much the skill helped.
690
+ - **TOTAL:** Sum of all dimension scores. Max possible is 100.
691
+ - **TREND:** Comparison against the previous run of the same eval + skill (from `.skill-bench-history.json`). Shows whether scores are improving over time.
692
+ - **VERDICT:** `PASS` only if `CONTEXT >= pass_threshold` AND `DELTA >= minimum_delta`.
693
+
694
+ **Iteration timeline:**
695
+
696
+ Each run (baseline and context) shows the ReAct loop steps the agent took: thinking, calling tools, and observing results. This helps you understand *how* the agent worked through the task. Observations are truncated to keep the output readable. If the timeline is empty, the agent finished in a single LLM call without using tools.
697
+
698
+ **Feedback sections:**
699
+
700
+ - **WHAT WENT WELL** — Dimensions where the context score is ≥ 80% of the max, with the judge's reasoning. These are the strengths of your skill.
701
+ - **WHAT WENT WRONG** — Dimensions where the context score is < 80% of the max, with the judge's reasoning and the baseline score for comparison. These are where your skill needs work.
702
+ - **ADVICE** — Each low-scoring dimension shows its description from `criteria.json` as actionable guidance. If the description is empty, no advice line appears.
703
+
704
+ **Verdict Decision Matrix**
705
+
706
+ Your eval result depends on **both** conditions. Here is every scenario:
707
+
708
+ | Context Score | Delta | Verdict | Why |
709
+ |---------------|-------|---------|-----|
710
+ | 87 | +55 | **PASS** | Context >= 70 **and** delta >= 10. The skill helped a lot. |
711
+ | 87 | -2 | **FAIL** | Context >= 70 **but** delta < 10. The skill made things **worse**. |
712
+ | 65 | +15 | **FAIL** | Context < 70 **even though** delta >= 10. Absolute score too low. |
713
+ | 65 | +5 | **FAIL** | Context < 70 **and** delta < 10. Both conditions failed. |
714
+
715
+ **Negative delta means the skill hurt performance.** If baseline=89 and context=87, your skill confused the agent or added noise. This is the most common "unexpected FAIL" — the skill reads well to humans but backfires with the LLM.
716
+
717
+ **FAIL example — skill made things worse:**
718
+
719
+ ```text
720
+ DIMENSION BASELINE CONTEXT DELTA
721
+ ──────────────────────── ───────── ───────── ───────
722
+ Correctness (30) 28 25 -3
723
+ Skill Adherence (25) 23 22 -1
724
+ Code Quality (20) 18 18 +0
725
+ Test Coverage (15) 12 13 +1
726
+ Documentation (10) 8 9 +1
727
+ ──────────────────────── ───────── ───────── ───────
728
+ TOTAL 89/100 87/100 -2
729
+
730
+ VERDICT: FAIL (threshold: 70, minimum delta: 10)
731
+ ```
732
+
733
+ **Why this FAILs:** Context score (87) is above the threshold (70), but the delta is **negative** (-2). The agent scored 89 *without* the skill and only 87 *with* it. The skill actively hurt performance. Common causes:
734
+ - Skill is too long or contradictory — the agent ignores the task to follow the skill
735
+ - Skill prescribes patterns that conflict with the task requirements
736
+ - Skill adds boilerplate that the judge penalizes (over-engineering)
737
+
738
+ **Fix:** Remove rules that don't directly improve the dimension with the lowest delta. Shorter skills usually beat longer ones.
739
+
740
+ ---
741
+
742
+ ## Reliability & Security
743
+
744
+ - **Safe-by-Design**: No code execution occurs on the host system; everything happens in the sandbox.
745
+ - **Command Blocklist**: Dangerous commands (`bash`, `sh`, `python`, `curl`, etc.) are always blocked, even if listed in `allowed_commands`.
746
+ - **Path Validation**: Eval paths are validated to prevent directory traversal attacks.
747
+ - **Atomic History Writes**: Benchmark history uses file locking to prevent corruption from concurrent writes.
748
+ - **URL Sanitization**: All provider URL parameters are CGI-escaped to prevent injection.
749
+ - **YAML Safety**: Config loading uses `permitted_classes: []` to prevent symbol DoS attacks.
750
+ - **Traceability**: Every thought and tool call is logged with full backtrace for post-mortem analysis.
751
+ - **Robust Error Recovery**: Handles provider outages and rate limits gracefully with standardized error logging.
752
+ - **XML-Safe Output**: JUnit XML output is properly escaped to prevent injection attacks.
753
+ - **Test Coverage**: 373+ tests covering core engine, CLI commands, and all provider clients.
754
+
755
+ ## Testing
756
+
757
+ The project uses Minitest with WebMock for HTTP stubbing.
758
+
759
+ ```bash
760
+ # Run all tests
761
+ bundle exec rake test
762
+
763
+ # Run with coverage
764
+ bundle exec rake test COVERAGE=true
765
+
766
+ # Run specific test file
767
+ bundle exec ruby -Itest test/integration_test.rb
768
+ ```
769
+
770
+ **Test Structure:**
771
+
772
+ - `test/evaluator/` — Core evaluation engine tests
773
+ - `test/agent_eval/` — CLI, models, and service tests
774
+ - `test/clients/` — Provider client tests
775
+
776
+ ## CI/CD Integration
777
+
778
+ GitHub Actions workflow included (`.github/workflows/ci.yml`):
779
+
780
+ - Runs on push and pull requests
781
+ - Tests against Ruby 3.3 and 3.4
782
+ - Executes rubocop, reek, and minitest
783
+ - Outputs JUnit XML for test reporting
784
+
785
+ ```bash
786
+ # Run locally with CI output
787
+ skill-bench run my-eval --skill=my-skill --format json
788
+ ```
789
+
790
+ ---
791
+
792
+ ## License
793
+
794
+ The gem is available as open source under the terms of the [MIT License](LICENSE).