ruby-skill-bench 0.1.0 → 1.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +86 -0
- data/lib/skill_bench/cli/compare_command.rb +91 -0
- data/lib/skill_bench/cli/help_printer.rb +9 -1
- data/lib/skill_bench/cli/run_command.rb +6 -4
- data/lib/skill_bench/cli.rb +7 -4
- data/lib/skill_bench/clients/all.rb +1 -0
- data/lib/skill_bench/clients/providers/mock.rb +56 -0
- data/lib/skill_bench/commands/run.rb +6 -2
- data/lib/skill_bench/config/applier.rb +1 -0
- data/lib/skill_bench/config/defaults.rb +1 -0
- data/lib/skill_bench/config/facade_readers.rb +7 -0
- data/lib/skill_bench/config/json_loader.rb +3 -3
- data/lib/skill_bench/config/store.rb +5 -0
- data/lib/skill_bench/config.rb +10 -1
- data/lib/skill_bench/delta_report.rb +20 -0
- data/lib/skill_bench/execution/source_path_resolver.rb +59 -3
- data/lib/skill_bench/registry/pack_resolver.rb +119 -0
- data/lib/skill_bench/services/agent_spawner_service.rb +114 -0
- data/lib/skill_bench/services/compare_option_parser.rb +55 -0
- data/lib/skill_bench/services/comparison_reporter.rb +97 -0
- data/lib/skill_bench/services/comparison_runner.rb +49 -0
- data/lib/skill_bench/services/context_loader_service.rb +42 -0
- data/lib/skill_bench/services/error_response_builder.rb +119 -0
- data/lib/skill_bench/services/eval_resolver.rb +33 -0
- data/lib/skill_bench/services/exit_code_calculator.rb +39 -0
- data/lib/skill_bench/services/judge_params_builder.rb +54 -0
- data/lib/skill_bench/services/manifest_finder.rb +36 -0
- data/lib/skill_bench/services/output_formatter.rb +28 -0
- data/lib/skill_bench/services/prompt_builder_service.rb +98 -0
- data/lib/skill_bench/services/provider_resolver.rb +73 -0
- data/lib/skill_bench/services/runner_service.rb +84 -315
- data/lib/skill_bench/services/skill_resolver.rb +37 -9
- data/lib/skill_bench/services/skill_resolver_service.rb +70 -0
- data/lib/skill_bench/services/source_path_resolver_service.rb +45 -0
- data/lib/skill_bench/services/trend_recorder_service.rb +67 -0
- data/lib/skill_bench/services/variant_parser.rb +32 -0
- data/lib/skill_bench/services/variant_resolver.rb +63 -0
- data/lib/skill_bench/version.rb +1 -1
- metadata +23 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: d3c4edfe40e04251d2e7b758e7c630ee9affaa9e8170ceb0fa379d61bacc81e6
|
|
4
|
+
data.tar.gz: e9ef2eb8ef7a524d607c6e44705df772feec8939a376b516adff032eeeb8b535
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: b92554c769e34205d1c197bd67a9ca2ae61876b83c5429e202c667831100470fa9f1ed48a297ea184855e33e7ac3945fb513909b2344634078b8090750325dc9
|
|
7
|
+
data.tar.gz: 7ae92f1331f2061cccf42a1f27f80cbe41c73d54d0909499900efa84ad3984edada8e7df10b5a018717861234974918cdfac80b5242483df108272093eec8deb
|
data/README.md
CHANGED
|
@@ -7,6 +7,21 @@
|
|
|
7
7
|
|
|
8
8
|
*A high-fidelity evaluation engine for benchmarking AI agent skills across any stack (Rails-first, but extensible).*
|
|
9
9
|
|
|
10
|
+
## Part of the AI Skill Ecosystem
|
|
11
|
+
|
|
12
|
+
This repo is one of 6 in a composable AI skill ecosystem:
|
|
13
|
+
|
|
14
|
+
| Repo | Role |
|
|
15
|
+
|------|------|
|
|
16
|
+
| [`ruby-core-skills`](https://github.com/igmarin/ruby-core-skills) | 15 shared Ruby skills + process discipline |
|
|
17
|
+
| [`rails-agent-skills`](https://github.com/igmarin/rails-agent-skills) | 28 Rails-specific skills + 9 agents |
|
|
18
|
+
| [`hanakai-yaku`](https://github.com/igmarin/hanakai-yaku) | 35 Hanami/dry-rb skills + 10 agents |
|
|
19
|
+
| [`agnostic-planning-skills`](https://github.com/igmarin/agnostic-planning-skills) | 10 planning skills + 4 agents |
|
|
20
|
+
| [`agent-mcp-runtime`](https://github.com/igmarin/agent-mcp-runtime) | Rust CLI runtime (pack resolution, MCP) |
|
|
21
|
+
| [**`ruby-skill-bench`**](https://github.com/igmarin/ruby-skill-bench) | Benchmark/eval engine |
|
|
22
|
+
|
|
23
|
+
See the [Ecosystem Overview](https://github.com/igmarin/agent-mcp-runtime/blob/main/docs/ecosystem.md) for the full architecture.
|
|
24
|
+
|
|
10
25
|
---
|
|
11
26
|
|
|
12
27
|
## Features
|
|
@@ -343,6 +358,77 @@ Both skill contexts are concatenated and sent to the agent. The judge evaluates
|
|
|
343
358
|
|
|
344
359
|
---
|
|
345
360
|
|
|
361
|
+
## Multi-Repo Skill Benchmarking
|
|
362
|
+
|
|
363
|
+
Skills in the ecosystem are split across multiple repos:
|
|
364
|
+
- `ruby-core-skills` — 15 shared Ruby skills (DDD, patterns, process discipline)
|
|
365
|
+
- `rails-agent-skills` — 28 Rails-specific skills
|
|
366
|
+
- `hanakai-yaku` — 35 Hanami/dry-rb skills
|
|
367
|
+
|
|
368
|
+
To benchmark a skill from an external repo, use the `--skill` flag:
|
|
369
|
+
|
|
370
|
+
```bash
|
|
371
|
+
# Benchmark a core skill
|
|
372
|
+
skill-bench run evals/skills/write-yard-docs/basic \
|
|
373
|
+
--skill /path/to/ruby-core-skills/skills/patterns/write-yard-docs
|
|
374
|
+
|
|
375
|
+
# Benchmark a Rails skill
|
|
376
|
+
skill-bench run evals/skills/code-review/pr-review \
|
|
377
|
+
--skill /path/to/rails-agent-skills/skills/code-quality/code-review
|
|
378
|
+
```
|
|
379
|
+
|
|
380
|
+
### Config-Based Multi-Repo Resolution
|
|
381
|
+
|
|
382
|
+
Configure `skill_sources` in `skill-bench.json` to automatically resolve skills across repos without `--skill` every time:
|
|
383
|
+
|
|
384
|
+
```json
|
|
385
|
+
{
|
|
386
|
+
"provider": "openai",
|
|
387
|
+
"model": "gpt-4o",
|
|
388
|
+
"skill_sources": {
|
|
389
|
+
"core": "../ruby-core-skills/skills",
|
|
390
|
+
"rails": "../rails-agent-skills/skills",
|
|
391
|
+
"hanami": "../hanakai-yaku/skills"
|
|
392
|
+
}
|
|
393
|
+
}
|
|
394
|
+
```
|
|
395
|
+
|
|
396
|
+
Each key is a source name (for logging), each value is a path to a `skills/` directory. When a skill is not found locally, SkillBench iterates through `skill_sources` and uses the first match.
|
|
397
|
+
|
|
398
|
+
### Pack-Based Resolution (`--pack`)
|
|
399
|
+
|
|
400
|
+
Resolve skills via the ecosystem registry manifest (from `agent-mcp-runtime`):
|
|
401
|
+
|
|
402
|
+
```bash
|
|
403
|
+
# Run an eval using the Rails pack's version of code-review
|
|
404
|
+
skill-bench run evals/skills/code-review/basic \
|
|
405
|
+
--skill code-review \
|
|
406
|
+
--pack rails
|
|
407
|
+
|
|
408
|
+
# Override the default registry manifest path
|
|
409
|
+
skill-bench run evals/skills/code-review/basic \
|
|
410
|
+
--skill code-review \
|
|
411
|
+
--pack rails \
|
|
412
|
+
--registry-manifest /path/to/registry.json
|
|
413
|
+
```
|
|
414
|
+
|
|
415
|
+
### Variant Comparison (`compare`)
|
|
416
|
+
|
|
417
|
+
Compare the same skill across two pack variants to measure context-dependent performance:
|
|
418
|
+
|
|
419
|
+
```bash
|
|
420
|
+
skill-bench compare code-review \
|
|
421
|
+
--variant-a "pack:rails" \
|
|
422
|
+
--variant-b "pack:hanami" \
|
|
423
|
+
--eval evals/skills/code-review/basic
|
|
424
|
+
```
|
|
425
|
+
|
|
426
|
+
The `--variant` spec supports two forms:
|
|
427
|
+
- `pack:<name>` — resolve via registry manifest
|
|
428
|
+
- `/absolute/path` or `relative/path` — use a direct path
|
|
429
|
+
|
|
430
|
+
---
|
|
431
|
+
|
|
346
432
|
## File Reference: What Lives on Disk
|
|
347
433
|
|
|
348
434
|
SkillBench creates and manages three files in your project. Understanding them helps you iterate faster.
|
|
@@ -0,0 +1,91 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require_relative '../services/compare_option_parser'
|
|
4
|
+
require_relative '../services/variant_parser'
|
|
5
|
+
require_relative '../services/comparison_runner'
|
|
6
|
+
require_relative '../services/comparison_reporter'
|
|
7
|
+
require_relative '../services/exit_code_calculator'
|
|
8
|
+
|
|
9
|
+
module SkillBench
|
|
10
|
+
module Cli
|
|
11
|
+
# Handles the `skill-bench compare` command.
|
|
12
|
+
# Runs the same eval with two skill variants and reports the comparison.
|
|
13
|
+
class CompareCommand
|
|
14
|
+
# Parses argv and executes the comparison.
|
|
15
|
+
#
|
|
16
|
+
# @param argv [Array<String>] Raw CLI arguments
|
|
17
|
+
# @return [Integer] Exit code
|
|
18
|
+
def self.call(argv)
|
|
19
|
+
new(argv).call
|
|
20
|
+
end
|
|
21
|
+
|
|
22
|
+
# @param argv [Array<String>] Raw CLI arguments
|
|
23
|
+
def initialize(argv)
|
|
24
|
+
@argv = argv
|
|
25
|
+
end
|
|
26
|
+
|
|
27
|
+
# Parses options, runs both variants, and prints a comparison report.
|
|
28
|
+
#
|
|
29
|
+
# @return [Integer] Exit code (0 if both pass, 1 otherwise)
|
|
30
|
+
def call
|
|
31
|
+
options = Services::CompareOptionParser.call(@argv)
|
|
32
|
+
|
|
33
|
+
skill_name = @argv.shift
|
|
34
|
+
return error_missing_skill unless skill_name
|
|
35
|
+
return error_missing_variant_a unless options[:variant_a]
|
|
36
|
+
return error_missing_variant_b unless options[:variant_b]
|
|
37
|
+
return error_missing_eval unless options[:eval]
|
|
38
|
+
|
|
39
|
+
variant_a = Services::VariantParser.call(options[:variant_a])
|
|
40
|
+
variant_b = Services::VariantParser.call(options[:variant_b])
|
|
41
|
+
|
|
42
|
+
puts "--- Running Variant A: #{options[:variant_a]} ---"
|
|
43
|
+
puts "--- Running Variant B: #{options[:variant_b]} ---"
|
|
44
|
+
|
|
45
|
+
results = Services::ComparisonRunner.call(
|
|
46
|
+
variant_a,
|
|
47
|
+
variant_b,
|
|
48
|
+
skill_name,
|
|
49
|
+
options[:eval]
|
|
50
|
+
)
|
|
51
|
+
|
|
52
|
+
Services::ComparisonReporter.call(
|
|
53
|
+
results[:result_a],
|
|
54
|
+
results[:result_b],
|
|
55
|
+
options[:variant_a],
|
|
56
|
+
options[:variant_b]
|
|
57
|
+
)
|
|
58
|
+
|
|
59
|
+
Services::ExitCodeCalculator.call(results[:result_a], results[:result_b])
|
|
60
|
+
rescue SkillBench::HelpRequested
|
|
61
|
+
0
|
|
62
|
+
rescue StandardError => e
|
|
63
|
+
warn "Error: #{e.message}"
|
|
64
|
+
1
|
|
65
|
+
end
|
|
66
|
+
|
|
67
|
+
private
|
|
68
|
+
|
|
69
|
+
def error_missing_skill
|
|
70
|
+
warn 'Error: skill name is required'
|
|
71
|
+
warn 'Usage: skill-bench compare <skill-name> --variant-a <spec> --variant-b <spec> --eval <path>'
|
|
72
|
+
1
|
|
73
|
+
end
|
|
74
|
+
|
|
75
|
+
def error_missing_variant_a
|
|
76
|
+
warn 'Error: --variant-a is required'
|
|
77
|
+
1
|
|
78
|
+
end
|
|
79
|
+
|
|
80
|
+
def error_missing_variant_b
|
|
81
|
+
warn 'Error: --variant-b is required'
|
|
82
|
+
1
|
|
83
|
+
end
|
|
84
|
+
|
|
85
|
+
def error_missing_eval
|
|
86
|
+
warn 'Error: --eval is required'
|
|
87
|
+
1
|
|
88
|
+
end
|
|
89
|
+
end
|
|
90
|
+
end
|
|
91
|
+
end
|
|
@@ -19,11 +19,19 @@ module SkillBench
|
|
|
19
19
|
Providers: #{providers}
|
|
20
20
|
--force Overwrite existing config file
|
|
21
21
|
|
|
22
|
-
run <eval> --skill <name> [--skill <name>] [--format FORMAT]
|
|
22
|
+
run <eval> --skill <name> [--skill <name>] [--format FORMAT] [--pack NAME]
|
|
23
23
|
Run an evaluation
|
|
24
24
|
--skill Skill to use (can be specified multiple times)
|
|
25
|
+
--pack Pack context for registry-based skill resolution
|
|
26
|
+
--registry-manifest PATH Path to registry.json manifest
|
|
25
27
|
--format Output format: human, json, junit (default: human)
|
|
26
28
|
|
|
29
|
+
compare <skill-name> --variant-a SPEC --variant-b SPEC --eval PATH
|
|
30
|
+
Compare the same skill across two pack variants
|
|
31
|
+
--variant-a First variant (e.g., "pack:rails" or "/path/to/skill")
|
|
32
|
+
--variant-b Second variant (e.g., "pack:hanami")
|
|
33
|
+
--eval Path to the eval directory
|
|
34
|
+
|
|
27
35
|
skill new <name> [--mode MODE] [--template TYPE]
|
|
28
36
|
Create a new skill
|
|
29
37
|
--mode simple, advanced, or rails (default: simple)
|
|
@@ -29,7 +29,7 @@ module SkillBench
|
|
|
29
29
|
|
|
30
30
|
eval_name = @argv.shift
|
|
31
31
|
return error_missing_eval unless eval_name
|
|
32
|
-
return error_missing_skill if options[:skill_names].empty?
|
|
32
|
+
return error_missing_skill if options[:skill_names].empty? && !options[:pack]
|
|
33
33
|
|
|
34
34
|
options[:eval_name] = eval_name
|
|
35
35
|
exec_options = options.reject { |key| key == :format }
|
|
@@ -48,6 +48,8 @@ module SkillBench
|
|
|
48
48
|
OptionParser.new do |opts|
|
|
49
49
|
opts.banner = 'Usage: skill-bench run <eval> [options]'
|
|
50
50
|
opts.on('--skill NAME', 'Skill to use (can be specified multiple times)') { |v| options[:skill_names] << v }
|
|
51
|
+
opts.on('--pack NAME', 'Pack context for skill resolution') { |v| options[:pack] = v }
|
|
52
|
+
opts.on('--registry-manifest PATH', 'Path to registry.json manifest') { |v| options[:registry_manifest] = v }
|
|
51
53
|
opts.on('--format FORMAT', 'Output format (human, json, junit)') { |v| options[:format] = v.to_sym }
|
|
52
54
|
opts.on('-h', '--help', 'Prints this help') do
|
|
53
55
|
puts opts
|
|
@@ -58,13 +60,13 @@ module SkillBench
|
|
|
58
60
|
|
|
59
61
|
def error_missing_eval
|
|
60
62
|
warn 'Error: eval name is required'
|
|
61
|
-
warn 'Usage: skill-bench run <eval> --skill <name>'
|
|
63
|
+
warn 'Usage: skill-bench run <eval> [--skill <name>] [--pack <name>]'
|
|
62
64
|
1
|
|
63
65
|
end
|
|
64
66
|
|
|
65
67
|
def error_missing_skill
|
|
66
|
-
warn 'Error: skill name is required'
|
|
67
|
-
warn 'Usage: skill-bench run <eval> --skill <name>'
|
|
68
|
+
warn 'Error: skill name or pack is required'
|
|
69
|
+
warn 'Usage: skill-bench run <eval> --skill <name> [--pack <name>]'
|
|
68
70
|
1
|
|
69
71
|
end
|
|
70
72
|
end
|
data/lib/skill_bench/cli.rb
CHANGED
|
@@ -2,6 +2,7 @@
|
|
|
2
2
|
|
|
3
3
|
require_relative 'cli/init_command'
|
|
4
4
|
require_relative 'cli/run_command'
|
|
5
|
+
require_relative 'cli/compare_command'
|
|
5
6
|
require_relative 'cli/skill_command'
|
|
6
7
|
require_relative 'cli/eval_command'
|
|
7
8
|
require_relative 'cli/help_printer'
|
|
@@ -18,6 +19,7 @@ module SkillBench
|
|
|
18
19
|
# @param argv [Array<String>] Raw CLI arguments.
|
|
19
20
|
# @return [Integer] Exit code.
|
|
20
21
|
def self.call(argv)
|
|
22
|
+
Config.reset
|
|
21
23
|
new(argv).call
|
|
22
24
|
end
|
|
23
25
|
|
|
@@ -35,10 +37,11 @@ module SkillBench
|
|
|
35
37
|
|
|
36
38
|
subcommand = @argv.shift
|
|
37
39
|
case subcommand
|
|
38
|
-
when 'init'
|
|
39
|
-
when 'run'
|
|
40
|
-
when '
|
|
41
|
-
when '
|
|
40
|
+
when 'init' then Cli::InitCommand.call(@argv)
|
|
41
|
+
when 'run' then Cli::RunCommand.call(@argv)
|
|
42
|
+
when 'compare' then Cli::CompareCommand.call(@argv)
|
|
43
|
+
when 'skill' then Cli::SkillCommand.call(@argv)
|
|
44
|
+
when 'eval' then Cli::EvalCommand.call(@argv)
|
|
42
45
|
when '-h', '--help', 'help'
|
|
43
46
|
help.call
|
|
44
47
|
else
|
|
@@ -0,0 +1,56 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require_relative '../provider_registry'
|
|
4
|
+
require 'json'
|
|
5
|
+
|
|
6
|
+
module SkillBench
|
|
7
|
+
module Clients
|
|
8
|
+
module Providers
|
|
9
|
+
# Mock LLM client for testing and local validation.
|
|
10
|
+
class Mock
|
|
11
|
+
SkillBench::Clients::ProviderRegistry.register(:mock, self)
|
|
12
|
+
|
|
13
|
+
# Mock call implementation to simulate LLM responses for test suites.
|
|
14
|
+
#
|
|
15
|
+
# @param system_prompt [String] system prompt instructions.
|
|
16
|
+
# @param messages [Array<Hash>] chat history messages.
|
|
17
|
+
# @param _options [Hash] additional keyword options.
|
|
18
|
+
# @return [Hash] mock response hash.
|
|
19
|
+
def self.call(system_prompt:, messages:, **_options)
|
|
20
|
+
_ = system_prompt
|
|
21
|
+
prompt = messages.first[:content] || messages.first['content'] || ''
|
|
22
|
+
|
|
23
|
+
# Parse dimensions from prompt
|
|
24
|
+
dimensions = {}
|
|
25
|
+
prompt.scan(/-\s+([^:]+):\s+max_score=(\d+)/).each do |name, max_score|
|
|
26
|
+
max = max_score.to_i
|
|
27
|
+
# Give baseline slightly lower score than context to simulate improvement
|
|
28
|
+
is_context = prompt.match?(/## Skill Context\s+\S+/)
|
|
29
|
+
score = is_context ? (max * 0.95).round : (max * 0.8).round
|
|
30
|
+
dimensions[name] = {
|
|
31
|
+
'score' => score,
|
|
32
|
+
'max_score' => max,
|
|
33
|
+
'reasoning' => "Mock evaluation for #{name}"
|
|
34
|
+
}
|
|
35
|
+
end
|
|
36
|
+
|
|
37
|
+
dimensions['correctness'] = { 'score' => 8, 'max_score' => 10, 'reasoning' => 'Mock correctness' } if dimensions.empty?
|
|
38
|
+
|
|
39
|
+
content = {
|
|
40
|
+
'dimensions' => dimensions,
|
|
41
|
+
'overall_reasoning' => 'Mock evaluation overall reasoning'
|
|
42
|
+
}.to_json
|
|
43
|
+
|
|
44
|
+
{
|
|
45
|
+
success: true,
|
|
46
|
+
response: {
|
|
47
|
+
message: {
|
|
48
|
+
content: content
|
|
49
|
+
}
|
|
50
|
+
}
|
|
51
|
+
}
|
|
52
|
+
end
|
|
53
|
+
end
|
|
54
|
+
end
|
|
55
|
+
end
|
|
56
|
+
end
|
|
@@ -9,11 +9,15 @@ module SkillBench
|
|
|
9
9
|
# Run an eval with specified skill(s)
|
|
10
10
|
# @param eval_name [String] Name of eval to run (e.g., 'test-eval' or 'evals/test-eval')
|
|
11
11
|
# @param skill_names [Array<String>] Names of skills to use
|
|
12
|
+
# @param pack [String, nil] Optional pack name for registry-based skill resolution
|
|
13
|
+
# @param registry_manifest [String, nil] Optional path to registry.json manifest
|
|
12
14
|
# @return [Hash] Result with pass/fail and score
|
|
13
|
-
def self.run(eval_name:, skill_names:)
|
|
15
|
+
def self.run(eval_name:, skill_names:, pack: nil, registry_manifest: nil)
|
|
14
16
|
Services::RunnerService.call(
|
|
15
17
|
eval_name: eval_name,
|
|
16
|
-
skill_names: skill_names
|
|
18
|
+
skill_names: skill_names,
|
|
19
|
+
pack: pack,
|
|
20
|
+
registry_manifest: registry_manifest
|
|
17
21
|
)
|
|
18
22
|
end
|
|
19
23
|
end
|
|
@@ -41,6 +41,7 @@ module SkillBench
|
|
|
41
41
|
assign_current_provider
|
|
42
42
|
@store.assign_max_execution_time(@data[:max_execution_time]) if @data.key?(:max_execution_time)
|
|
43
43
|
@store.assign_allowed_commands(@data[:allowed_commands]) if @data.key?(:allowed_commands)
|
|
44
|
+
@store.skill_sources = @data[:skill_sources] if @data.key?(:skill_sources)
|
|
44
45
|
end
|
|
45
46
|
|
|
46
47
|
def apply_provider_values
|
|
@@ -19,6 +19,7 @@ module SkillBench
|
|
|
19
19
|
current_llm_provider: :openai,
|
|
20
20
|
max_execution_time: 30,
|
|
21
21
|
allowed_commands: nil,
|
|
22
|
+
skill_sources: {},
|
|
22
23
|
llm_providers_config: {
|
|
23
24
|
openai: { api_key: nil, model: 'gpt-4o' },
|
|
24
25
|
anthropic: { api_key: nil, model: 'claude-sonnet-4-20250514' },
|
|
@@ -32,6 +32,13 @@ module SkillBench
|
|
|
32
32
|
store.llm_providers_config
|
|
33
33
|
end
|
|
34
34
|
|
|
35
|
+
# Returns skill sources mapping.
|
|
36
|
+
#
|
|
37
|
+
# @return [Hash, nil] skill source name → directory path
|
|
38
|
+
def skill_sources
|
|
39
|
+
store.skill_sources
|
|
40
|
+
end
|
|
41
|
+
|
|
35
42
|
# Returns the API key for the current LLM provider.
|
|
36
43
|
#
|
|
37
44
|
# @return [String, nil] API key for the current provider
|
|
@@ -29,9 +29,9 @@ module SkillBench
|
|
|
29
29
|
data = JSON.parse(File.read(@path), symbolize_names: true)
|
|
30
30
|
return warn_invalid_config unless data.is_a?(Hash)
|
|
31
31
|
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
32
|
+
success_data = data.slice(:current_llm_provider, :max_execution_time, :allowed_commands, :skill_sources).compact
|
|
33
|
+
success_data[:current_llm_provider] ||= data[:provider] if data.key?(:provider)
|
|
34
|
+
success(success_data.merge(providers: normalized_providers(data[:providers])))
|
|
35
35
|
rescue JSON::ParserError => e
|
|
36
36
|
log_parse_error(e)
|
|
37
37
|
failure('Failed to parse config file')
|
|
@@ -24,6 +24,11 @@ module SkillBench
|
|
|
24
24
|
# @return [Hash, nil] provider configuration by provider name
|
|
25
25
|
attr_accessor :llm_providers_config
|
|
26
26
|
|
|
27
|
+
# Returns skill sources mapping.
|
|
28
|
+
#
|
|
29
|
+
# @return [Hash, nil] skill source name → directory path
|
|
30
|
+
attr_accessor :skill_sources
|
|
31
|
+
|
|
27
32
|
# Initializes a new configuration store with empty provider settings.
|
|
28
33
|
def initialize
|
|
29
34
|
@llm_providers_config = {}
|
data/lib/skill_bench/config.rb
CHANGED
|
@@ -74,7 +74,9 @@ module SkillBench
|
|
|
74
74
|
@store = Config::Store.new
|
|
75
75
|
apply_defaults
|
|
76
76
|
apply_json_config(home_config_path)
|
|
77
|
-
|
|
77
|
+
local_path = Pathname.new(Dir.pwd).join(CONFIG_FILENAME)
|
|
78
|
+
is_workspace_file = File.exist?(File.join(Dir.pwd, 'ruby-skill-bench.gemspec'))
|
|
79
|
+
apply_json_config(local_path) unless defined?(Minitest) && is_workspace_file
|
|
78
80
|
apply_env_overrides
|
|
79
81
|
end
|
|
80
82
|
|
|
@@ -122,6 +124,13 @@ module SkillBench
|
|
|
122
124
|
store.llm_providers_config || {}
|
|
123
125
|
end
|
|
124
126
|
|
|
127
|
+
# Returns skill sources mapping.
|
|
128
|
+
#
|
|
129
|
+
# @return [Hash, nil] skill source name → directory path
|
|
130
|
+
def skill_sources
|
|
131
|
+
store.skill_sources || {}
|
|
132
|
+
end
|
|
133
|
+
|
|
125
134
|
# Returns API key from configuration.
|
|
126
135
|
#
|
|
127
136
|
# @return [String, nil] API key
|
|
@@ -49,6 +49,26 @@ module SkillBench
|
|
|
49
49
|
{ success: false, response: { error: { message: e.message } } }
|
|
50
50
|
end
|
|
51
51
|
|
|
52
|
+
# Compatibility methods for ComparisonReporter
|
|
53
|
+
|
|
54
|
+
# Returns the list of dimensions from the context run.
|
|
55
|
+
#
|
|
56
|
+
# @return [Array<Object>] List of objects responding to name and score
|
|
57
|
+
def dimensions
|
|
58
|
+
return [] unless context_dimensions
|
|
59
|
+
|
|
60
|
+
context_dimensions.map do |name, dim_hash|
|
|
61
|
+
Struct.new(:name, :score).new(name.to_s, dim_hash[:score] || dim_hash['score'])
|
|
62
|
+
end
|
|
63
|
+
end
|
|
64
|
+
|
|
65
|
+
# Returns the total context score.
|
|
66
|
+
#
|
|
67
|
+
# @return [Numeric, nil]
|
|
68
|
+
def total
|
|
69
|
+
context_total
|
|
70
|
+
end
|
|
71
|
+
|
|
52
72
|
private
|
|
53
73
|
|
|
54
74
|
attr_reader :baseline, :context
|
|
@@ -1,5 +1,7 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
|
+
require 'pathname'
|
|
4
|
+
|
|
3
5
|
module SkillBench
|
|
4
6
|
module Execution
|
|
5
7
|
# Resolves the source skill or workflow path for a given evaluation target.
|
|
@@ -8,6 +10,8 @@ module SkillBench
|
|
|
8
10
|
#
|
|
9
11
|
# @param eval_folder_path [String] Relative path to the eval directory.
|
|
10
12
|
# @param skill_path [String, nil] Optional explicit override for the source directory.
|
|
13
|
+
# @param skill_sources [Hash] Optional skill source name → directory path mapping for fallback.
|
|
14
|
+
# When provided and local resolution does not yield an existing path, each source is checked.
|
|
11
15
|
# @return [String, nil] The resolved source path relative to the evaluator repo root, or nil if unmappable.
|
|
12
16
|
# @example Infer a skill source path (NEW format):
|
|
13
17
|
# SkillBench::Execution::SourcePathResolver.call(
|
|
@@ -19,12 +23,57 @@ module SkillBench
|
|
|
19
23
|
# eval_folder_path: 'evals/skills/code-quality/rails-code-review/review-order'
|
|
20
24
|
# )
|
|
21
25
|
# # => "skills/code-quality/rails-code-review"
|
|
22
|
-
def self.call(eval_folder_path:, skill_path: nil)
|
|
26
|
+
def self.call(eval_folder_path:, skill_path: nil, skill_sources: {})
|
|
23
27
|
return skill_path if skill_path && !skill_path.empty?
|
|
24
28
|
|
|
25
|
-
segments = eval_folder_path.to_s
|
|
29
|
+
segments = Pathname.new(eval_folder_path.to_s).each_filename.to_a
|
|
30
|
+
|
|
31
|
+
local = resolve_skills_path(segments) || resolve_workflows_path(segments)
|
|
32
|
+
|
|
33
|
+
unless local.nil? || skill_sources.empty?
|
|
34
|
+
skill_name = extract_skill_name(segments)
|
|
35
|
+
return local unless skill_name
|
|
36
|
+
return local if skill_exists_at?(local)
|
|
37
|
+
|
|
38
|
+
skill_sources.each_value do |source_path|
|
|
39
|
+
candidate = find_skill_in_source(source_path, skill_name)
|
|
40
|
+
return candidate if candidate
|
|
41
|
+
end
|
|
42
|
+
end
|
|
43
|
+
|
|
44
|
+
local
|
|
45
|
+
end
|
|
46
|
+
|
|
47
|
+
# Extracts the skill name from the eval path segments.
|
|
48
|
+
#
|
|
49
|
+
# @param segments [Array<String>] Path segments
|
|
50
|
+
# @return [String, nil] Skill name or nil
|
|
51
|
+
def self.extract_skill_name(segments)
|
|
52
|
+
index = segments.rindex('skills')
|
|
53
|
+
return nil unless index
|
|
54
|
+
|
|
55
|
+
remaining = segments[(index + 1)..]
|
|
56
|
+
return nil if remaining.empty?
|
|
26
57
|
|
|
27
|
-
|
|
58
|
+
remaining[0]
|
|
59
|
+
end
|
|
60
|
+
|
|
61
|
+
# Finds a skill directory within a source path by name.
|
|
62
|
+
#
|
|
63
|
+
# @param source_path [String] Root directory containing skill categories
|
|
64
|
+
# @param skill_name [String] Name of the skill to find
|
|
65
|
+
# @return [String, nil] Path to the skill directory or nil
|
|
66
|
+
def self.find_skill_in_source(source_path, skill_name)
|
|
67
|
+
return nil unless source_path && Dir.exist?(source_path)
|
|
68
|
+
|
|
69
|
+
Dir.glob(File.join(source_path, '*')).each do |entry|
|
|
70
|
+
next unless Dir.exist?(entry)
|
|
71
|
+
|
|
72
|
+
candidate = File.join(entry, skill_name)
|
|
73
|
+
return candidate if Dir.exist?(candidate) && File.exist?(File.join(candidate, 'SKILL.md'))
|
|
74
|
+
end
|
|
75
|
+
|
|
76
|
+
nil
|
|
28
77
|
end
|
|
29
78
|
|
|
30
79
|
private_class_method def self.resolve_skills_path(segments)
|
|
@@ -55,6 +104,13 @@ module SkillBench
|
|
|
55
104
|
workflow_name = segments[index + 1]
|
|
56
105
|
"workflows/#{workflow_name}" if workflow_name
|
|
57
106
|
end
|
|
107
|
+
|
|
108
|
+
private_class_method def self.skill_exists_at?(path)
|
|
109
|
+
return false unless path
|
|
110
|
+
|
|
111
|
+
full_path = path.end_with?('SKILL.md') ? path : File.join(path, 'SKILL.md')
|
|
112
|
+
File.exist?(full_path)
|
|
113
|
+
end
|
|
58
114
|
end
|
|
59
115
|
end
|
|
60
116
|
end
|