ruby-skill-bench 1.0.1 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +145 -0
- data/lib/skill_bench/agent/react_agent.rb +2 -1
- data/lib/skill_bench/clients/all.rb +1 -0
- data/lib/skill_bench/clients/base_client.rb +2 -5
- data/lib/skill_bench/clients/request_builder.rb +2 -4
- data/lib/skill_bench/clients/response_builder.rb +91 -0
- data/lib/skill_bench/clients/response_error_handler.rb +5 -17
- data/lib/skill_bench/clients/retry_handler.rb +4 -7
- data/lib/skill_bench/constants.rb +58 -0
- data/lib/skill_bench/execution/context_hydrator.rb +16 -6
- data/lib/skill_bench/execution/sandbox.rb +18 -3
- data/lib/skill_bench/tools/run_command.rb +2 -17
- data/lib/skill_bench/version.rb +1 -1
- data/lib/skill_bench.rb +1 -0
- metadata +3 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: d2ad524e13bc006a56f0197d07b3ba7b0ce2f99f60b61f0739c3d5bc0d75a687
|
|
4
|
+
data.tar.gz: a920c473148b52584653acbb1e91cb3973791c09de6c4df994a77c097eabc476
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: c1f131af9bcde90e7fc3a7e6bef7f3770edfa4e2826ee19c3aabf5c210d6d3b6e5bdd460778a87f6fdc77b5b99bc17b2225e1b79de31674ad4acfe1bbc89f862
|
|
7
|
+
data.tar.gz: d8e3791c91242b25779afa3a21c57daeb06995bc4c65b01ca6f378a69491aeec95c5844ea541e5dcaf18b46d8e7f153ffb38ff2bb060ab5dacd653c8c1026bcd
|
data/README.md
CHANGED
|
@@ -859,6 +859,151 @@ bundle exec ruby -Itest test/integration_test.rb
|
|
|
859
859
|
- `test/agent_eval/` — CLI, models, and service tests
|
|
860
860
|
- `test/clients/` — Provider client tests
|
|
861
861
|
|
|
862
|
+
---
|
|
863
|
+
|
|
864
|
+
## Security
|
|
865
|
+
|
|
866
|
+
### Threat Model
|
|
867
|
+
|
|
868
|
+
Ruby Skill Bench is designed with security as a primary concern. The system executes AI agents in isolated environments and must protect against various attack vectors:
|
|
869
|
+
|
|
870
|
+
- **Path Traversal:** Preventing agents from accessing files outside the sandbox
|
|
871
|
+
- **Command Injection:** Preventing execution of arbitrary shell commands
|
|
872
|
+
- **Resource Exhaustion:** Preventing denial-of-service through resource consumption
|
|
873
|
+
- **Information Leakage:** Protecting sensitive data like API keys
|
|
874
|
+
|
|
875
|
+
### Security Features
|
|
876
|
+
|
|
877
|
+
#### Path Traversal Protection
|
|
878
|
+
|
|
879
|
+
- **Symlink Validation:** All symlinks are validated to ensure they don't escape the sandbox
|
|
880
|
+
- **TOCTOU Mitigation:** Path validation is re-checked after directory creation operations
|
|
881
|
+
- **Path Normalization:** All paths are normalized and validated against working directory boundaries
|
|
882
|
+
- **Character Validation:** Paths are validated against strict character patterns
|
|
883
|
+
|
|
884
|
+
#### Command Execution Security
|
|
885
|
+
|
|
886
|
+
- **Command Allowlist:** Only explicitly allowed commands can be executed
|
|
887
|
+
- **Dangerous Commands Blocklist:** Dangerous commands (bash, curl, sudo, etc.) are always blocked
|
|
888
|
+
- **Shell Tokenization:** Commands are tokenized before execution to prevent shell injection
|
|
889
|
+
- **Docker Isolation:** Commands can be executed in isolated Docker containers with hardened security settings
|
|
890
|
+
|
|
891
|
+
#### Docker Security Hardening
|
|
892
|
+
|
|
893
|
+
When Docker is available, containers are launched with hardened security settings:
|
|
894
|
+
|
|
895
|
+
- **Non-root User:** Containers run as a non-root user
|
|
896
|
+
- **Privilege Prevention:** `--security-opt no-new-privileges` prevents privilege escalation
|
|
897
|
+
- **Capability Dropping:** All Linux capabilities are dropped except minimal needed ones
|
|
898
|
+
- **Network Isolation:** `--network none` disables network access
|
|
899
|
+
- **Read-only Root:** Container filesystem is read-only (except for mounted volumes)
|
|
900
|
+
|
|
901
|
+
#### Resource Limits
|
|
902
|
+
|
|
903
|
+
- **File Size Limits:** Individual files in context hydration are limited to 50KB
|
|
904
|
+
- **Total Context Size:** Total context size is limited to 1MB to prevent memory exhaustion
|
|
905
|
+
- **Execution Timeout:** Commands are limited to a configurable timeout (default: 30 seconds)
|
|
906
|
+
- **Max Iterations:** Agent loops are limited to prevent infinite loops
|
|
907
|
+
|
|
908
|
+
### API Key Security
|
|
909
|
+
|
|
910
|
+
- **Environment Variables:** API keys are loaded from environment variables, not hardcoded
|
|
911
|
+
- **Configuration Hierarchy:** Keys can be set in `skill-bench.json` or environment variables
|
|
912
|
+
- **No Logging:** API keys are never logged or exposed in error messages
|
|
913
|
+
- **Provider-Specific Keys:** Each provider uses its own API key configuration
|
|
914
|
+
|
|
915
|
+
### Best Practices for Users
|
|
916
|
+
|
|
917
|
+
1. **Never Commit API Keys:** Never commit `skill-bench.json` with API keys to version control
|
|
918
|
+
2. **Use Environment Variables:** Prefer environment variables for sensitive configuration
|
|
919
|
+
3. **Minimal Command Allowlist:** Only allow commands necessary for your evals
|
|
920
|
+
4. **Regular Updates:** Keep dependencies updated to patch security vulnerabilities
|
|
921
|
+
5. **Review Changes:** Review skill files before execution to ensure they don't contain malicious code
|
|
922
|
+
|
|
923
|
+
### Reporting Security Issues
|
|
924
|
+
|
|
925
|
+
If you discover a security vulnerability:
|
|
926
|
+
|
|
927
|
+
1. **Do Not Open a Public Issue:** Send a private email to the maintainers
|
|
928
|
+
2. **Provide Details:** Include steps to reproduce and potential impact
|
|
929
|
+
3. **Allow Time for Fix:** Give maintainers time to address the issue before disclosure
|
|
930
|
+
4. **Follow Responsible Disclosure:** Follow responsible disclosure practices
|
|
931
|
+
|
|
932
|
+
---
|
|
933
|
+
|
|
934
|
+
## Troubleshooting
|
|
935
|
+
|
|
936
|
+
### Common Issues and Solutions
|
|
937
|
+
|
|
938
|
+
#### Configuration Issues
|
|
939
|
+
|
|
940
|
+
**Problem:** "Config load failed, using mock provider"
|
|
941
|
+
- **Solution:** Ensure your `skill-bench.json` file is properly formatted JSON and contains required fields
|
|
942
|
+
- **Check:** Verify the file exists in your project root or home directory
|
|
943
|
+
|
|
944
|
+
**Problem:** "API Key not set for [Provider]"
|
|
945
|
+
- **Solution:** Set the appropriate environment variable (e.g., `SKILL_BENCH_OPENAI_API_KEY`) or add it to your `skill-bench.json`
|
|
946
|
+
- **Check:** Run `env | grep SKILL_BENCH` to verify environment variables are set
|
|
947
|
+
|
|
948
|
+
**Problem:** "No allowed commands configured"
|
|
949
|
+
- **Solution:** Add `allowed_commands` array to your `skill-bench.json` with the commands you want to allow
|
|
950
|
+
- **Check:** Ensure commands are in the allowlist and not in the dangerous commands list
|
|
951
|
+
|
|
952
|
+
#### Execution Issues
|
|
953
|
+
|
|
954
|
+
**Problem:** "Command execution timed out"
|
|
955
|
+
- **Solution:** Increase `max_execution_time` in your `skill-bench.json` or simplify the task
|
|
956
|
+
- **Check:** Verify the command isn't hanging or waiting for input
|
|
957
|
+
|
|
958
|
+
**Problem:** "Docker container failed to start"
|
|
959
|
+
- **Solution:** Ensure Docker is running and you have permissions to run Docker commands
|
|
960
|
+
- **Check:** Run `docker info` to verify Docker daemon is accessible
|
|
961
|
+
|
|
962
|
+
**Problem:** "Context hydration failed"
|
|
963
|
+
- **Solution:** Verify the source path exists and is a directory
|
|
964
|
+
- **Check:** Ensure the path is within the base directory and file sizes are under limits
|
|
965
|
+
|
|
966
|
+
#### Network Issues
|
|
967
|
+
|
|
968
|
+
**Problem:** "Network Error: Connection refused"
|
|
969
|
+
- **Solution:** Check your internet connection and API provider status
|
|
970
|
+
- **Check:** Verify the base URL in your configuration is correct
|
|
971
|
+
|
|
972
|
+
**Problem:** "API Request failed: 429"
|
|
973
|
+
- **Solution:** This is a rate limit error. The system will retry automatically
|
|
974
|
+
- **Check:** Reduce request frequency or check your API quota
|
|
975
|
+
|
|
976
|
+
#### Test Failures
|
|
977
|
+
|
|
978
|
+
**Problem:** Tests fail with "WebMock::NetConnectNotAllowedError"
|
|
979
|
+
- **Solution:** This occurs when tests try to make real HTTP requests. Ensure test stubs are properly configured
|
|
980
|
+
- **Check:** Verify WebMock is properly stubbing the expected URLs
|
|
981
|
+
|
|
982
|
+
**Problem:** "E2E sibling repositories not present"
|
|
983
|
+
- **Solution:** This is expected if you don't have the agent-mcp-runtime repository cloned
|
|
984
|
+
- **Check:** These tests will be skipped and won't affect the overall test results
|
|
985
|
+
|
|
986
|
+
### Debug Mode
|
|
987
|
+
|
|
988
|
+
For detailed debugging, you can enable verbose logging:
|
|
989
|
+
|
|
990
|
+
```bash
|
|
991
|
+
# Set environment variable for verbose logging
|
|
992
|
+
export SKILL_BENCH_DEBUG=true
|
|
993
|
+
skill-bench run my-eval --skill=my-skill
|
|
994
|
+
```
|
|
995
|
+
|
|
996
|
+
### Getting Help
|
|
997
|
+
|
|
998
|
+
If you encounter issues not covered here:
|
|
999
|
+
|
|
1000
|
+
1. Check the [GitHub Issues](https://github.com/igmarin/ruby-skill-bench/issues) for similar problems
|
|
1001
|
+
2. Create a new issue with detailed information about your environment and the problem
|
|
1002
|
+
3. Include Ruby version, SkillBench version, and error messages
|
|
1003
|
+
4. Provide steps to reproduce the issue
|
|
1004
|
+
|
|
1005
|
+
---
|
|
1006
|
+
|
|
862
1007
|
## CI/CD Integration
|
|
863
1008
|
|
|
864
1009
|
GitHub Actions workflow included (`.github/workflows/ci.yml`):
|
|
@@ -1,5 +1,6 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
|
+
require_relative '../constants'
|
|
3
4
|
require_relative 'react_agent/step'
|
|
4
5
|
require_relative 'react_agent/loop_runner'
|
|
5
6
|
|
|
@@ -29,7 +30,7 @@ module SkillBench
|
|
|
29
30
|
def initialize(params)
|
|
30
31
|
@system_prompt = params[:system_prompt]
|
|
31
32
|
@initial_prompt = params[:initial_prompt]
|
|
32
|
-
@max_iterations = params[:max_iterations] ||
|
|
33
|
+
@max_iterations = params[:max_iterations] || Constants::ReactAgent::DEFAULT_MAX_ITERATIONS
|
|
33
34
|
@working_dir = params[:working_dir] || Dir.pwd
|
|
34
35
|
@container_id = params[:container_id]
|
|
35
36
|
@client_params = params[:client_params] || {}
|
|
@@ -4,6 +4,7 @@ require_relative '../config'
|
|
|
4
4
|
require_relative 'provider_config'
|
|
5
5
|
require_relative 'response_parser'
|
|
6
6
|
require_relative 'response_error_handler'
|
|
7
|
+
require_relative 'response_builder'
|
|
7
8
|
require_relative 'request_builder'
|
|
8
9
|
require_relative 'retry_handler'
|
|
9
10
|
|
|
@@ -135,7 +136,7 @@ module SkillBench
|
|
|
135
136
|
else
|
|
136
137
|
"#{missing.first} not set for #{@provider_display_name}"
|
|
137
138
|
end
|
|
138
|
-
|
|
139
|
+
ResponseBuilder.error(message: message)
|
|
139
140
|
end
|
|
140
141
|
|
|
141
142
|
# Extracts the message hash from the provider's specific response body structure.
|
|
@@ -182,10 +183,6 @@ module SkillBench
|
|
|
182
183
|
message = extract_message(parsed)
|
|
183
184
|
return missing_message_response(response, parsed) unless ResponseParser.valid_message?(message)
|
|
184
185
|
|
|
185
|
-
success_response(parsed, message)
|
|
186
|
-
end
|
|
187
|
-
|
|
188
|
-
def success_response(parsed, message)
|
|
189
186
|
content = ResponseParser.extract_content(message)
|
|
190
187
|
{
|
|
191
188
|
success: true,
|
|
@@ -1,22 +1,20 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
3
|
require 'faraday'
|
|
4
|
+
require_relative '../constants'
|
|
4
5
|
|
|
5
6
|
module SkillBench
|
|
6
7
|
module Clients
|
|
7
8
|
# Builds and executes HTTP requests to LLM provider APIs.
|
|
8
9
|
# Encapsulates Faraday connection setup and request execution.
|
|
9
10
|
class RequestBuilder
|
|
10
|
-
DEFAULT_OPEN_TIMEOUT = 10
|
|
11
|
-
DEFAULT_TIMEOUT = 120
|
|
12
|
-
|
|
13
11
|
# Creates a Faraday connection with JSON middleware.
|
|
14
12
|
#
|
|
15
13
|
# @param base_url [String] The API base URL
|
|
16
14
|
# @param open_timeout [Integer] Connection open timeout in seconds
|
|
17
15
|
# @param timeout [Integer] Request timeout in seconds
|
|
18
16
|
# @return [Faraday::Connection] Configured Faraday connection
|
|
19
|
-
def self.build_connection(base_url, open_timeout: DEFAULT_OPEN_TIMEOUT, timeout: DEFAULT_TIMEOUT)
|
|
17
|
+
def self.build_connection(base_url, open_timeout: Constants::HttpClient::DEFAULT_OPEN_TIMEOUT, timeout: Constants::HttpClient::DEFAULT_TIMEOUT)
|
|
20
18
|
Faraday.new(url: base_url) do |f|
|
|
21
19
|
f.request :json
|
|
22
20
|
f.response :json
|
|
@@ -0,0 +1,91 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module SkillBench
|
|
4
|
+
module Clients
|
|
5
|
+
# Service object for building standardized response hashes.
|
|
6
|
+
# Eliminates duplication of error response formatting across the codebase.
|
|
7
|
+
class ResponseBuilder
|
|
8
|
+
# Builds a standardized error response.
|
|
9
|
+
#
|
|
10
|
+
# @param message [String] The error message.
|
|
11
|
+
# @param status [String] The status identifier (default: 'error').
|
|
12
|
+
# @return [Hash] Standardized error response hash.
|
|
13
|
+
def self.error(message:, status: 'error')
|
|
14
|
+
{
|
|
15
|
+
success: false,
|
|
16
|
+
response: { error: { message: message } },
|
|
17
|
+
result: message,
|
|
18
|
+
status: status
|
|
19
|
+
}
|
|
20
|
+
end
|
|
21
|
+
|
|
22
|
+
# Builds a standardized success response.
|
|
23
|
+
#
|
|
24
|
+
# @param content [String] The response content.
|
|
25
|
+
# @param metadata [Hash] Additional metadata to include in response.
|
|
26
|
+
# @return [Hash] Standardized success response hash.
|
|
27
|
+
def self.success(content:, metadata: {})
|
|
28
|
+
{
|
|
29
|
+
success: true,
|
|
30
|
+
result: content,
|
|
31
|
+
response: { content: content }.merge(metadata),
|
|
32
|
+
status: 'success'
|
|
33
|
+
}
|
|
34
|
+
end
|
|
35
|
+
|
|
36
|
+
# Builds a standardized API error response.
|
|
37
|
+
#
|
|
38
|
+
# @param error_message [String] The API error message.
|
|
39
|
+
# @param usage [Hash] Token usage information.
|
|
40
|
+
# @return [Hash] Standardized API error response hash.
|
|
41
|
+
def self.api_error(error_message:, usage: {})
|
|
42
|
+
{
|
|
43
|
+
success: false,
|
|
44
|
+
result: "API Error: #{error_message}",
|
|
45
|
+
usage: usage,
|
|
46
|
+
response: { error: { message: "API Error: #{error_message}" } },
|
|
47
|
+
status: 'error'
|
|
48
|
+
}
|
|
49
|
+
end
|
|
50
|
+
|
|
51
|
+
# Builds a standardized network error response.
|
|
52
|
+
#
|
|
53
|
+
# @param error_message [String] The network error message.
|
|
54
|
+
# @return [Hash] Standardized network error response hash.
|
|
55
|
+
def self.network_error(error_message:)
|
|
56
|
+
{
|
|
57
|
+
success: false,
|
|
58
|
+
response: { error: { message: "Network Error: #{error_message}" } },
|
|
59
|
+
result: "Network Error: #{error_message}",
|
|
60
|
+
status: 'error'
|
|
61
|
+
}
|
|
62
|
+
end
|
|
63
|
+
|
|
64
|
+
# Builds a standardized parsing error response.
|
|
65
|
+
#
|
|
66
|
+
# @param error_message [String] The parsing error message.
|
|
67
|
+
# @return [Hash] Standardized parsing error response hash.
|
|
68
|
+
def self.parsing_error(error_message:)
|
|
69
|
+
{
|
|
70
|
+
success: false,
|
|
71
|
+
response: { error: { message: "Parsing Error: #{error_message}" } },
|
|
72
|
+
result: "Parsing Error: #{error_message}",
|
|
73
|
+
status: 'error'
|
|
74
|
+
}
|
|
75
|
+
end
|
|
76
|
+
|
|
77
|
+
# Builds a standardized unexpected error response.
|
|
78
|
+
#
|
|
79
|
+
# @param error_message [String] The unexpected error message.
|
|
80
|
+
# @return [Hash] Standardized unexpected error response hash.
|
|
81
|
+
def self.unexpected_error(error_message:)
|
|
82
|
+
{
|
|
83
|
+
success: false,
|
|
84
|
+
response: { error: { message: "Unexpected Error: #{error_message}" } },
|
|
85
|
+
result: "Unexpected Error: #{error_message}",
|
|
86
|
+
status: 'error'
|
|
87
|
+
}
|
|
88
|
+
end
|
|
89
|
+
end
|
|
90
|
+
end
|
|
91
|
+
end
|
|
@@ -23,14 +23,8 @@ module SkillBench
|
|
|
23
23
|
error_msg += " - #{detail}"
|
|
24
24
|
end
|
|
25
25
|
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
result: error_msg,
|
|
29
|
-
usage: usage_extractor.call(parsed),
|
|
30
|
-
response: { error: { message: error_msg } },
|
|
31
|
-
status: 'error',
|
|
32
|
-
code: response.status
|
|
33
|
-
}
|
|
26
|
+
base_response = ResponseBuilder.api_error(error_message: error_msg, usage: usage_extractor.call(parsed))
|
|
27
|
+
base_response.merge(code: response.status)
|
|
34
28
|
end
|
|
35
29
|
|
|
36
30
|
# Creates an error response when the LLM response has no message content.
|
|
@@ -41,14 +35,8 @@ module SkillBench
|
|
|
41
35
|
# @return [Hash] Standardized error response
|
|
42
36
|
def self.missing_message_response(response, parsed, &usage_extractor)
|
|
43
37
|
error_msg = 'LLM response missing message content'
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
result: error_msg,
|
|
47
|
-
usage: usage_extractor.call(parsed),
|
|
48
|
-
response: { error: { message: error_msg } },
|
|
49
|
-
status: 'error',
|
|
50
|
-
code: response.status
|
|
51
|
-
}
|
|
38
|
+
base_response = ResponseBuilder.error(message: error_msg)
|
|
39
|
+
base_response.merge(usage: usage_extractor.call(parsed), code: response.status)
|
|
52
40
|
end
|
|
53
41
|
|
|
54
42
|
# Handles an exception by logging and returning a standardized error response.
|
|
@@ -58,7 +46,7 @@ module SkillBench
|
|
|
58
46
|
# @return [Hash] Standardized error response
|
|
59
47
|
def self.handle_exception(error, type)
|
|
60
48
|
log_error(error)
|
|
61
|
-
|
|
49
|
+
ResponseBuilder.error(message: "#{type}: #{error.message}")
|
|
62
50
|
end
|
|
63
51
|
|
|
64
52
|
# Logs an error message and backtrace to Rails.logger or stderr.
|
|
@@ -2,6 +2,7 @@
|
|
|
2
2
|
|
|
3
3
|
require 'faraday'
|
|
4
4
|
require_relative '../error_logger'
|
|
5
|
+
require_relative '../constants'
|
|
5
6
|
|
|
6
7
|
module SkillBench
|
|
7
8
|
module Clients
|
|
@@ -9,10 +10,6 @@ module SkillBench
|
|
|
9
10
|
# Retries on transient errors (429, 503). Raises permanent errors immediately.
|
|
10
11
|
# Returns the block result on success.
|
|
11
12
|
class RetryHandler
|
|
12
|
-
RETRYABLE_STATUSES = [429, 503].freeze
|
|
13
|
-
|
|
14
|
-
MAX_DELAY = 30 # Maximum delay cap in seconds
|
|
15
|
-
|
|
16
13
|
# Executes the given block with retry logic.
|
|
17
14
|
#
|
|
18
15
|
# @param max_attempts [Integer] Maximum number of attempts (default: 3).
|
|
@@ -21,7 +18,7 @@ module SkillBench
|
|
|
21
18
|
# @return [Object] The block's return value on success.
|
|
22
19
|
# @raise [Faraday::Error] On non-retryable errors or after exhausting retries.
|
|
23
20
|
# @raise [ArgumentError] if no block is given or max_attempts < 1.
|
|
24
|
-
def self.call(max_attempts:
|
|
21
|
+
def self.call(max_attempts: Constants::HttpClient::DEFAULT_MAX_RETRIES, base_delay: Constants::HttpClient::DEFAULT_RETRY_DELAY, &block)
|
|
25
22
|
raise ArgumentError, 'RetryHandler requires a block' unless block
|
|
26
23
|
raise ArgumentError, 'max_attempts must be >= 1' if max_attempts < 1
|
|
27
24
|
|
|
@@ -59,11 +56,11 @@ module SkillBench
|
|
|
59
56
|
private
|
|
60
57
|
|
|
61
58
|
def retryable?(status, attempt)
|
|
62
|
-
RETRYABLE_STATUSES.include?(status) && attempt < @max_attempts
|
|
59
|
+
Constants::HttpClient::RETRYABLE_STATUSES.include?(status) && attempt < @max_attempts
|
|
63
60
|
end
|
|
64
61
|
|
|
65
62
|
def compute_delay(attempt)
|
|
66
|
-
[@base_delay * (2**(attempt - 1)),
|
|
63
|
+
[@base_delay * (2**(attempt - 1)), Constants::ReactAgent::DEFAULT_MAX_DELAY].min
|
|
67
64
|
end
|
|
68
65
|
|
|
69
66
|
def extract_status(error)
|
|
@@ -0,0 +1,58 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module SkillBench
|
|
4
|
+
# Centralized configuration constants for the SkillBench system.
|
|
5
|
+
# This eliminates magic numbers and provides a single source of truth
|
|
6
|
+
# for configurable values across the codebase.
|
|
7
|
+
module Constants
|
|
8
|
+
# ReAct Agent Configuration
|
|
9
|
+
module ReactAgent
|
|
10
|
+
DEFAULT_MAX_ITERATIONS = 25
|
|
11
|
+
DEFAULT_MAX_DELAY = 30 # Maximum delay cap in seconds for retry logic
|
|
12
|
+
end
|
|
13
|
+
|
|
14
|
+
# HTTP Client Configuration
|
|
15
|
+
module HttpClient
|
|
16
|
+
DEFAULT_OPEN_TIMEOUT = 10
|
|
17
|
+
DEFAULT_TIMEOUT = 120
|
|
18
|
+
DEFAULT_MAX_RETRIES = 3
|
|
19
|
+
DEFAULT_RETRY_DELAY = 1
|
|
20
|
+
RETRYABLE_STATUSES = [429, 503].freeze
|
|
21
|
+
end
|
|
22
|
+
|
|
23
|
+
# Context Hydration Configuration
|
|
24
|
+
module ContextHydration
|
|
25
|
+
MAX_FILE_SIZE = 50_000 # Maximum file size in bytes
|
|
26
|
+
MAX_TOTAL_CONTEXT_SIZE = 1_000_000 # Maximum total context size in bytes (1MB)
|
|
27
|
+
TEXT_EXTENSIONS = %w[.md .rb .json .yml .yaml .txt].freeze
|
|
28
|
+
end
|
|
29
|
+
|
|
30
|
+
# Sandbox Configuration
|
|
31
|
+
module Sandbox
|
|
32
|
+
DOCKER_IMAGE_NAME = 'evaluator-sandbox'
|
|
33
|
+
end
|
|
34
|
+
|
|
35
|
+
# Tool Execution Configuration
|
|
36
|
+
module Tools
|
|
37
|
+
DANGEROUS_COMMANDS = %w[
|
|
38
|
+
bash sh zsh fish dash ksh csh tcsh
|
|
39
|
+
python python3 python2 ruby perl node
|
|
40
|
+
php lua tcl wish
|
|
41
|
+
curl wget nc ncat socat
|
|
42
|
+
eval exec
|
|
43
|
+
sudo su doas
|
|
44
|
+
chmod chown mount umount
|
|
45
|
+
dd mkfs fdisk parted
|
|
46
|
+
insmod rmmod modprobe
|
|
47
|
+
systemctl service
|
|
48
|
+
passwd useradd userdel groupadd groupdel
|
|
49
|
+
].freeze
|
|
50
|
+
end
|
|
51
|
+
|
|
52
|
+
# File Path Configuration
|
|
53
|
+
module FilePath
|
|
54
|
+
ALLOWED_PATH_PATTERN = %r{\A[a-zA-Z0-9._\-/]+\z}
|
|
55
|
+
MAX_PATH_LENGTH = 4096
|
|
56
|
+
end
|
|
57
|
+
end
|
|
58
|
+
end
|
|
@@ -2,6 +2,7 @@
|
|
|
2
2
|
|
|
3
3
|
require 'pathname'
|
|
4
4
|
require 'cgi'
|
|
5
|
+
require_relative '../constants'
|
|
5
6
|
|
|
6
7
|
module SkillBench
|
|
7
8
|
module Execution
|
|
@@ -10,10 +11,6 @@ module SkillBench
|
|
|
10
11
|
class ContextHydrator
|
|
11
12
|
# Error message returned when context hydration fails.
|
|
12
13
|
HYDRATION_FAILED = 'Failed to hydrate context from source path'
|
|
13
|
-
# File extensions considered for context hydration.
|
|
14
|
-
TEXT_EXTENSIONS = %w[.md .rb .json .yml .yaml .txt].freeze
|
|
15
|
-
# Maximum file size (in bytes) for files included in context hydration.
|
|
16
|
-
MAX_FILE_SIZE = 50_000
|
|
17
14
|
|
|
18
15
|
# Loads and formats source context files.
|
|
19
16
|
#
|
|
@@ -50,6 +47,8 @@ module SkillBench
|
|
|
50
47
|
return missing_path_result unless full_path.exist? && full_path.directory?
|
|
51
48
|
|
|
52
49
|
context_files = collect_context_files(full_path)
|
|
50
|
+
return missing_path_result unless validate_total_size?(context_files)
|
|
51
|
+
|
|
53
52
|
xml_context = build_xml(context_files)
|
|
54
53
|
|
|
55
54
|
{ success: true, response: { context: xml_context } }
|
|
@@ -65,12 +64,23 @@ module SkillBench
|
|
|
65
64
|
end
|
|
66
65
|
|
|
67
66
|
def collect_context_files(full_path)
|
|
68
|
-
pattern = full_path.join("*{#{TEXT_EXTENSIONS.join(',')}}").to_s
|
|
67
|
+
pattern = full_path.join("*{#{Constants::ContextHydration::TEXT_EXTENSIONS.join(',')}}").to_s
|
|
69
68
|
Dir.glob(pattern).reject { |f| File.symlink?(f) }
|
|
70
|
-
.select { |f| File.size(f) <= MAX_FILE_SIZE }
|
|
69
|
+
.select { |f| File.size(f) <= Constants::ContextHydration::MAX_FILE_SIZE }
|
|
71
70
|
.sort
|
|
72
71
|
end
|
|
73
72
|
|
|
73
|
+
def validate_total_size?(context_files)
|
|
74
|
+
total_size = context_files.sum { |f| File.size(f) }
|
|
75
|
+
return true if total_size <= Constants::ContextHydration::MAX_TOTAL_CONTEXT_SIZE
|
|
76
|
+
|
|
77
|
+
SkillBench::ErrorLogger.log_error(
|
|
78
|
+
StandardError.new("Total context size #{total_size} exceeds maximum #{Constants::ContextHydration::MAX_TOTAL_CONTEXT_SIZE}"),
|
|
79
|
+
'ContextHydrator'
|
|
80
|
+
)
|
|
81
|
+
false
|
|
82
|
+
end
|
|
83
|
+
|
|
74
84
|
# Builds the XML structure wrapping the contents of the context files.
|
|
75
85
|
#
|
|
76
86
|
# @param context_files [Array<String>] List of absolute paths to context files.
|
|
@@ -3,6 +3,7 @@
|
|
|
3
3
|
require 'fileutils'
|
|
4
4
|
require 'tmpdir'
|
|
5
5
|
require 'open3'
|
|
6
|
+
require_relative '../constants'
|
|
6
7
|
|
|
7
8
|
module SkillBench
|
|
8
9
|
module Execution
|
|
@@ -143,18 +144,32 @@ module SkillBench
|
|
|
143
144
|
|
|
144
145
|
# Starts a Docker container for isolated command execution.
|
|
145
146
|
# Builds the image only if it does not already exist.
|
|
147
|
+
# Uses hardened security settings for production safety.
|
|
146
148
|
#
|
|
147
149
|
# @raise [RuntimeError] when the Docker image cannot be built or the container fails to start.
|
|
148
150
|
def start_container
|
|
149
|
-
image_name =
|
|
151
|
+
image_name = Constants::Sandbox::DOCKER_IMAGE_NAME
|
|
150
152
|
docker_dir = File.expand_path('docker', __dir__)
|
|
151
153
|
|
|
152
154
|
# Build image (Docker layer cache handles no-op builds)
|
|
153
155
|
raise "Failed to build Docker image #{image_name}" unless system('docker', 'build', '-t', image_name, docker_dir, '--quiet')
|
|
154
156
|
|
|
155
|
-
# Start a detached container
|
|
157
|
+
# Start a detached container with hardened security settings
|
|
158
|
+
# --user $(id -u):$(id -g): Runs as non-root user
|
|
159
|
+
# --security-opt no-new-privileges: Prevents privilege escalation
|
|
160
|
+
# --cap-drop ALL: Drops all Linux capabilities
|
|
161
|
+
# --cap-add CHOWN, DAC_OVERRIDE: Adds back minimal capabilities for git operations
|
|
162
|
+
# --network none: Disables network access for additional isolation
|
|
156
163
|
stdout, stderr, status = Open3.capture3(
|
|
157
|
-
'docker', 'run', '-d', '--rm',
|
|
164
|
+
'docker', 'run', '-d', '--rm',
|
|
165
|
+
'--user', "#{Process.uid}:#{Process.gid}",
|
|
166
|
+
'--security-opt', 'no-new-privileges',
|
|
167
|
+
'--cap-drop', 'ALL',
|
|
168
|
+
'--cap-add', 'CHOWN',
|
|
169
|
+
'--cap-add', 'DAC_OVERRIDE',
|
|
170
|
+
'--network', 'none',
|
|
171
|
+
'-v', "#{@path}:/sandbox:rw",
|
|
172
|
+
image_name
|
|
158
173
|
)
|
|
159
174
|
|
|
160
175
|
raise "Failed to start Docker container: #{stderr}" unless status.success?
|
|
@@ -4,27 +4,12 @@ require 'open3'
|
|
|
4
4
|
require 'timeout'
|
|
5
5
|
require 'shellwords'
|
|
6
6
|
require_relative '../config'
|
|
7
|
+
require_relative '../constants'
|
|
7
8
|
|
|
8
9
|
module SkillBench
|
|
9
10
|
module Tools
|
|
10
11
|
# Handles executing a shell command within the working directory.
|
|
11
12
|
class RunCommand
|
|
12
|
-
# Commands that are always blocked even if listed in allowed_commands,
|
|
13
|
-
# because they can be used to escape the sandbox or execute arbitrary code.
|
|
14
|
-
DANGEROUS_COMMANDS = %w[
|
|
15
|
-
bash sh zsh fish dash ksh csh tcsh
|
|
16
|
-
python python3 python2 ruby perl node
|
|
17
|
-
php lua tcl wish
|
|
18
|
-
curl wget nc ncat socat
|
|
19
|
-
eval exec
|
|
20
|
-
sudo su doas
|
|
21
|
-
chmod chown mount umount
|
|
22
|
-
dd mkfs fdisk parted
|
|
23
|
-
insmod rmmod modprobe
|
|
24
|
-
systemctl service
|
|
25
|
-
passwd useradd userdel groupadd groupdel
|
|
26
|
-
].freeze
|
|
27
|
-
|
|
28
13
|
# @return [Hash] The tool definition for the LLM API.
|
|
29
14
|
def self.definition
|
|
30
15
|
{
|
|
@@ -59,7 +44,7 @@ module SkillBench
|
|
|
59
44
|
return 'Error: Empty command.' if argv.empty?
|
|
60
45
|
|
|
61
46
|
base_cmd = argv.first
|
|
62
|
-
return "Error: Command '#{base_cmd}' is blocked for security reasons." if DANGEROUS_COMMANDS.include?(base_cmd)
|
|
47
|
+
return "Error: Command '#{base_cmd}' is blocked for security reasons." if Constants::Tools::DANGEROUS_COMMANDS.include?(base_cmd)
|
|
63
48
|
|
|
64
49
|
allowed = SkillBench::Config.allowed_commands
|
|
65
50
|
return 'Error: No allowed commands configured. Set allowed_commands in skill-bench.json or use --mode mock.' if allowed.nil?
|
data/lib/skill_bench/version.rb
CHANGED
data/lib/skill_bench.rb
CHANGED
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: ruby-skill-bench
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 1.0
|
|
4
|
+
version: 1.1.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Ismael Marin
|
|
@@ -147,6 +147,7 @@ files:
|
|
|
147
147
|
- lib/skill_bench/clients/providers/opencode.rb
|
|
148
148
|
- lib/skill_bench/clients/providers/openrouter.rb
|
|
149
149
|
- lib/skill_bench/clients/request_builder.rb
|
|
150
|
+
- lib/skill_bench/clients/response_builder.rb
|
|
150
151
|
- lib/skill_bench/clients/response_error_handler.rb
|
|
151
152
|
- lib/skill_bench/clients/response_parser.rb
|
|
152
153
|
- lib/skill_bench/clients/retry_handler.rb
|
|
@@ -162,6 +163,7 @@ files:
|
|
|
162
163
|
- lib/skill_bench/config/facade_writers.rb
|
|
163
164
|
- lib/skill_bench/config/json_loader.rb
|
|
164
165
|
- lib/skill_bench/config/store.rb
|
|
166
|
+
- lib/skill_bench/constants.rb
|
|
165
167
|
- lib/skill_bench/criteria.rb
|
|
166
168
|
- lib/skill_bench/delta_report.rb
|
|
167
169
|
- lib/skill_bench/dimension.rb
|