ruby-skill-bench 1.0.1 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d3c4edfe40e04251d2e7b758e7c630ee9affaa9e8170ceb0fa379d61bacc81e6
4
- data.tar.gz: e9ef2eb8ef7a524d607c6e44705df772feec8939a376b516adff032eeeb8b535
3
+ metadata.gz: d2ad524e13bc006a56f0197d07b3ba7b0ce2f99f60b61f0739c3d5bc0d75a687
4
+ data.tar.gz: a920c473148b52584653acbb1e91cb3973791c09de6c4df994a77c097eabc476
5
5
  SHA512:
6
- metadata.gz: b92554c769e34205d1c197bd67a9ca2ae61876b83c5429e202c667831100470fa9f1ed48a297ea184855e33e7ac3945fb513909b2344634078b8090750325dc9
7
- data.tar.gz: 7ae92f1331f2061cccf42a1f27f80cbe41c73d54d0909499900efa84ad3984edada8e7df10b5a018717861234974918cdfac80b5242483df108272093eec8deb
6
+ metadata.gz: c1f131af9bcde90e7fc3a7e6bef7f3770edfa4e2826ee19c3aabf5c210d6d3b6e5bdd460778a87f6fdc77b5b99bc17b2225e1b79de31674ad4acfe1bbc89f862
7
+ data.tar.gz: d8e3791c91242b25779afa3a21c57daeb06995bc4c65b01ca6f378a69491aeec95c5844ea541e5dcaf18b46d8e7f153ffb38ff2bb060ab5dacd653c8c1026bcd
data/README.md CHANGED
@@ -859,6 +859,151 @@ bundle exec ruby -Itest test/integration_test.rb
859
859
  - `test/agent_eval/` — CLI, models, and service tests
860
860
  - `test/clients/` — Provider client tests
861
861
 
862
+ ---
863
+
864
+ ## Security
865
+
866
+ ### Threat Model
867
+
868
+ Ruby Skill Bench is designed with security as a primary concern. The system executes AI agents in isolated environments and must protect against various attack vectors:
869
+
870
+ - **Path Traversal:** Preventing agents from accessing files outside the sandbox
871
+ - **Command Injection:** Preventing execution of arbitrary shell commands
872
+ - **Resource Exhaustion:** Preventing denial-of-service through resource consumption
873
+ - **Information Leakage:** Protecting sensitive data like API keys
874
+
875
+ ### Security Features
876
+
877
+ #### Path Traversal Protection
878
+
879
+ - **Symlink Validation:** All symlinks are validated to ensure they don't escape the sandbox
880
+ - **TOCTOU Mitigation:** Path validation is re-checked after directory creation operations
881
+ - **Path Normalization:** All paths are normalized and validated against working directory boundaries
882
+ - **Character Validation:** Paths are validated against strict character patterns
883
+
884
+ #### Command Execution Security
885
+
886
+ - **Command Allowlist:** Only explicitly allowed commands can be executed
887
+ - **Dangerous Commands Blocklist:** Dangerous commands (bash, curl, sudo, etc.) are always blocked
888
+ - **Shell Tokenization:** Commands are tokenized before execution to prevent shell injection
889
+ - **Docker Isolation:** Commands can be executed in isolated Docker containers with hardened security settings
890
+
891
+ #### Docker Security Hardening
892
+
893
+ When Docker is available, containers are launched with hardened security settings:
894
+
895
+ - **Non-root User:** Containers run as a non-root user
896
+ - **Privilege Prevention:** `--security-opt no-new-privileges` prevents privilege escalation
897
+ - **Capability Dropping:** All Linux capabilities are dropped except minimal needed ones
898
+ - **Network Isolation:** `--network none` disables network access
899
+ - **Read-only Root:** Container filesystem is read-only (except for mounted volumes)
900
+
901
+ #### Resource Limits
902
+
903
+ - **File Size Limits:** Individual files in context hydration are limited to 50KB
904
+ - **Total Context Size:** Total context size is limited to 1MB to prevent memory exhaustion
905
+ - **Execution Timeout:** Commands are limited to a configurable timeout (default: 30 seconds)
906
+ - **Max Iterations:** Agent loops are limited to prevent infinite loops
907
+
908
+ ### API Key Security
909
+
910
+ - **Environment Variables:** API keys are loaded from environment variables, not hardcoded
911
+ - **Configuration Hierarchy:** Keys can be set in `skill-bench.json` or environment variables
912
+ - **No Logging:** API keys are never logged or exposed in error messages
913
+ - **Provider-Specific Keys:** Each provider uses its own API key configuration
914
+
915
+ ### Best Practices for Users
916
+
917
+ 1. **Never Commit API Keys:** Never commit `skill-bench.json` with API keys to version control
918
+ 2. **Use Environment Variables:** Prefer environment variables for sensitive configuration
919
+ 3. **Minimal Command Allowlist:** Only allow commands necessary for your evals
920
+ 4. **Regular Updates:** Keep dependencies updated to patch security vulnerabilities
921
+ 5. **Review Changes:** Review skill files before execution to ensure they don't contain malicious code
922
+
923
+ ### Reporting Security Issues
924
+
925
+ If you discover a security vulnerability:
926
+
927
+ 1. **Do Not Open a Public Issue:** Send a private email to the maintainers
928
+ 2. **Provide Details:** Include steps to reproduce and potential impact
929
+ 3. **Allow Time for Fix:** Give maintainers time to address the issue before disclosure
930
+ 4. **Follow Responsible Disclosure:** Follow responsible disclosure practices
931
+
932
+ ---
933
+
934
+ ## Troubleshooting
935
+
936
+ ### Common Issues and Solutions
937
+
938
+ #### Configuration Issues
939
+
940
+ **Problem:** "Config load failed, using mock provider"
941
+ - **Solution:** Ensure your `skill-bench.json` file is properly formatted JSON and contains required fields
942
+ - **Check:** Verify the file exists in your project root or home directory
943
+
944
+ **Problem:** "API Key not set for [Provider]"
945
+ - **Solution:** Set the appropriate environment variable (e.g., `SKILL_BENCH_OPENAI_API_KEY`) or add it to your `skill-bench.json`
946
+ - **Check:** Run `env | grep SKILL_BENCH` to verify environment variables are set
947
+
948
+ **Problem:** "No allowed commands configured"
949
+ - **Solution:** Add `allowed_commands` array to your `skill-bench.json` with the commands you want to allow
950
+ - **Check:** Ensure commands are in the allowlist and not in the dangerous commands list
951
+
952
+ #### Execution Issues
953
+
954
+ **Problem:** "Command execution timed out"
955
+ - **Solution:** Increase `max_execution_time` in your `skill-bench.json` or simplify the task
956
+ - **Check:** Verify the command isn't hanging or waiting for input
957
+
958
+ **Problem:** "Docker container failed to start"
959
+ - **Solution:** Ensure Docker is running and you have permissions to run Docker commands
960
+ - **Check:** Run `docker info` to verify Docker daemon is accessible
961
+
962
+ **Problem:** "Context hydration failed"
963
+ - **Solution:** Verify the source path exists and is a directory
964
+ - **Check:** Ensure the path is within the base directory and file sizes are under limits
965
+
966
+ #### Network Issues
967
+
968
+ **Problem:** "Network Error: Connection refused"
969
+ - **Solution:** Check your internet connection and API provider status
970
+ - **Check:** Verify the base URL in your configuration is correct
971
+
972
+ **Problem:** "API Request failed: 429"
973
+ - **Solution:** This is a rate limit error. The system will retry automatically
974
+ - **Check:** Reduce request frequency or check your API quota
975
+
976
+ #### Test Failures
977
+
978
+ **Problem:** Tests fail with "WebMock::NetConnectNotAllowedError"
979
+ - **Solution:** This occurs when tests try to make real HTTP requests. Ensure test stubs are properly configured
980
+ - **Check:** Verify WebMock is properly stubbing the expected URLs
981
+
982
+ **Problem:** "E2E sibling repositories not present"
983
+ - **Solution:** This is expected if you don't have the agent-mcp-runtime repository cloned
984
+ - **Check:** These tests will be skipped and won't affect the overall test results
985
+
986
+ ### Debug Mode
987
+
988
+ For detailed debugging, you can enable verbose logging:
989
+
990
+ ```bash
991
+ # Set environment variable for verbose logging
992
+ export SKILL_BENCH_DEBUG=true
993
+ skill-bench run my-eval --skill=my-skill
994
+ ```
995
+
996
+ ### Getting Help
997
+
998
+ If you encounter issues not covered here:
999
+
1000
+ 1. Check the [GitHub Issues](https://github.com/igmarin/ruby-skill-bench/issues) for similar problems
1001
+ 2. Create a new issue with detailed information about your environment and the problem
1002
+ 3. Include Ruby version, SkillBench version, and error messages
1003
+ 4. Provide steps to reproduce the issue
1004
+
1005
+ ---
1006
+
862
1007
  ## CI/CD Integration
863
1008
 
864
1009
  GitHub Actions workflow included (`.github/workflows/ci.yml`):
@@ -1,5 +1,6 @@
1
1
  # frozen_string_literal: true
2
2
 
3
+ require_relative '../constants'
3
4
  require_relative 'react_agent/step'
4
5
  require_relative 'react_agent/loop_runner'
5
6
 
@@ -29,7 +30,7 @@ module SkillBench
29
30
  def initialize(params)
30
31
  @system_prompt = params[:system_prompt]
31
32
  @initial_prompt = params[:initial_prompt]
32
- @max_iterations = params[:max_iterations] || 25
33
+ @max_iterations = params[:max_iterations] || Constants::ReactAgent::DEFAULT_MAX_ITERATIONS
33
34
  @working_dir = params[:working_dir] || Dir.pwd
34
35
  @container_id = params[:container_id]
35
36
  @client_params = params[:client_params] || {}
@@ -2,6 +2,7 @@
2
2
 
3
3
  require_relative 'response_parser'
4
4
  require_relative 'response_error_handler'
5
+ require_relative 'response_builder'
5
6
  require_relative 'request_builder'
6
7
  require_relative 'retry_handler'
7
8
  require_relative 'base_client'
@@ -4,6 +4,7 @@ require_relative '../config'
4
4
  require_relative 'provider_config'
5
5
  require_relative 'response_parser'
6
6
  require_relative 'response_error_handler'
7
+ require_relative 'response_builder'
7
8
  require_relative 'request_builder'
8
9
  require_relative 'retry_handler'
9
10
 
@@ -135,7 +136,7 @@ module SkillBench
135
136
  else
136
137
  "#{missing.first} not set for #{@provider_display_name}"
137
138
  end
138
- { success: false, response: { error: { message: message } }, result: message, status: 'error' }
139
+ ResponseBuilder.error(message: message)
139
140
  end
140
141
 
141
142
  # Extracts the message hash from the provider's specific response body structure.
@@ -182,10 +183,6 @@ module SkillBench
182
183
  message = extract_message(parsed)
183
184
  return missing_message_response(response, parsed) unless ResponseParser.valid_message?(message)
184
185
 
185
- success_response(parsed, message)
186
- end
187
-
188
- def success_response(parsed, message)
189
186
  content = ResponseParser.extract_content(message)
190
187
  {
191
188
  success: true,
@@ -1,22 +1,20 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  require 'faraday'
4
+ require_relative '../constants'
4
5
 
5
6
  module SkillBench
6
7
  module Clients
7
8
  # Builds and executes HTTP requests to LLM provider APIs.
8
9
  # Encapsulates Faraday connection setup and request execution.
9
10
  class RequestBuilder
10
- DEFAULT_OPEN_TIMEOUT = 10
11
- DEFAULT_TIMEOUT = 120
12
-
13
11
  # Creates a Faraday connection with JSON middleware.
14
12
  #
15
13
  # @param base_url [String] The API base URL
16
14
  # @param open_timeout [Integer] Connection open timeout in seconds
17
15
  # @param timeout [Integer] Request timeout in seconds
18
16
  # @return [Faraday::Connection] Configured Faraday connection
19
- def self.build_connection(base_url, open_timeout: DEFAULT_OPEN_TIMEOUT, timeout: DEFAULT_TIMEOUT)
17
+ def self.build_connection(base_url, open_timeout: Constants::HttpClient::DEFAULT_OPEN_TIMEOUT, timeout: Constants::HttpClient::DEFAULT_TIMEOUT)
20
18
  Faraday.new(url: base_url) do |f|
21
19
  f.request :json
22
20
  f.response :json
@@ -0,0 +1,91 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SkillBench
4
+ module Clients
5
+ # Service object for building standardized response hashes.
6
+ # Eliminates duplication of error response formatting across the codebase.
7
+ class ResponseBuilder
8
+ # Builds a standardized error response.
9
+ #
10
+ # @param message [String] The error message.
11
+ # @param status [String] The status identifier (default: 'error').
12
+ # @return [Hash] Standardized error response hash.
13
+ def self.error(message:, status: 'error')
14
+ {
15
+ success: false,
16
+ response: { error: { message: message } },
17
+ result: message,
18
+ status: status
19
+ }
20
+ end
21
+
22
+ # Builds a standardized success response.
23
+ #
24
+ # @param content [String] The response content.
25
+ # @param metadata [Hash] Additional metadata to include in response.
26
+ # @return [Hash] Standardized success response hash.
27
+ def self.success(content:, metadata: {})
28
+ {
29
+ success: true,
30
+ result: content,
31
+ response: { content: content }.merge(metadata),
32
+ status: 'success'
33
+ }
34
+ end
35
+
36
+ # Builds a standardized API error response.
37
+ #
38
+ # @param error_message [String] The API error message.
39
+ # @param usage [Hash] Token usage information.
40
+ # @return [Hash] Standardized API error response hash.
41
+ def self.api_error(error_message:, usage: {})
42
+ {
43
+ success: false,
44
+ result: "API Error: #{error_message}",
45
+ usage: usage,
46
+ response: { error: { message: "API Error: #{error_message}" } },
47
+ status: 'error'
48
+ }
49
+ end
50
+
51
+ # Builds a standardized network error response.
52
+ #
53
+ # @param error_message [String] The network error message.
54
+ # @return [Hash] Standardized network error response hash.
55
+ def self.network_error(error_message:)
56
+ {
57
+ success: false,
58
+ response: { error: { message: "Network Error: #{error_message}" } },
59
+ result: "Network Error: #{error_message}",
60
+ status: 'error'
61
+ }
62
+ end
63
+
64
+ # Builds a standardized parsing error response.
65
+ #
66
+ # @param error_message [String] The parsing error message.
67
+ # @return [Hash] Standardized parsing error response hash.
68
+ def self.parsing_error(error_message:)
69
+ {
70
+ success: false,
71
+ response: { error: { message: "Parsing Error: #{error_message}" } },
72
+ result: "Parsing Error: #{error_message}",
73
+ status: 'error'
74
+ }
75
+ end
76
+
77
+ # Builds a standardized unexpected error response.
78
+ #
79
+ # @param error_message [String] The unexpected error message.
80
+ # @return [Hash] Standardized unexpected error response hash.
81
+ def self.unexpected_error(error_message:)
82
+ {
83
+ success: false,
84
+ response: { error: { message: "Unexpected Error: #{error_message}" } },
85
+ result: "Unexpected Error: #{error_message}",
86
+ status: 'error'
87
+ }
88
+ end
89
+ end
90
+ end
91
+ end
@@ -23,14 +23,8 @@ module SkillBench
23
23
  error_msg += " - #{detail}"
24
24
  end
25
25
 
26
- {
27
- success: false,
28
- result: error_msg,
29
- usage: usage_extractor.call(parsed),
30
- response: { error: { message: error_msg } },
31
- status: 'error',
32
- code: response.status
33
- }
26
+ base_response = ResponseBuilder.api_error(error_message: error_msg, usage: usage_extractor.call(parsed))
27
+ base_response.merge(code: response.status)
34
28
  end
35
29
 
36
30
  # Creates an error response when the LLM response has no message content.
@@ -41,14 +35,8 @@ module SkillBench
41
35
  # @return [Hash] Standardized error response
42
36
  def self.missing_message_response(response, parsed, &usage_extractor)
43
37
  error_msg = 'LLM response missing message content'
44
- {
45
- success: false,
46
- result: error_msg,
47
- usage: usage_extractor.call(parsed),
48
- response: { error: { message: error_msg } },
49
- status: 'error',
50
- code: response.status
51
- }
38
+ base_response = ResponseBuilder.error(message: error_msg)
39
+ base_response.merge(usage: usage_extractor.call(parsed), code: response.status)
52
40
  end
53
41
 
54
42
  # Handles an exception by logging and returning a standardized error response.
@@ -58,7 +46,7 @@ module SkillBench
58
46
  # @return [Hash] Standardized error response
59
47
  def self.handle_exception(error, type)
60
48
  log_error(error)
61
- { success: false, result: "#{type}: #{error.message}", status: 'error' }
49
+ ResponseBuilder.error(message: "#{type}: #{error.message}")
62
50
  end
63
51
 
64
52
  # Logs an error message and backtrace to Rails.logger or stderr.
@@ -2,6 +2,7 @@
2
2
 
3
3
  require 'faraday'
4
4
  require_relative '../error_logger'
5
+ require_relative '../constants'
5
6
 
6
7
  module SkillBench
7
8
  module Clients
@@ -9,10 +10,6 @@ module SkillBench
9
10
  # Retries on transient errors (429, 503). Raises permanent errors immediately.
10
11
  # Returns the block result on success.
11
12
  class RetryHandler
12
- RETRYABLE_STATUSES = [429, 503].freeze
13
-
14
- MAX_DELAY = 30 # Maximum delay cap in seconds
15
-
16
13
  # Executes the given block with retry logic.
17
14
  #
18
15
  # @param max_attempts [Integer] Maximum number of attempts (default: 3).
@@ -21,7 +18,7 @@ module SkillBench
21
18
  # @return [Object] The block's return value on success.
22
19
  # @raise [Faraday::Error] On non-retryable errors or after exhausting retries.
23
20
  # @raise [ArgumentError] if no block is given or max_attempts < 1.
24
- def self.call(max_attempts: 3, base_delay: 1, &block)
21
+ def self.call(max_attempts: Constants::HttpClient::DEFAULT_MAX_RETRIES, base_delay: Constants::HttpClient::DEFAULT_RETRY_DELAY, &block)
25
22
  raise ArgumentError, 'RetryHandler requires a block' unless block
26
23
  raise ArgumentError, 'max_attempts must be >= 1' if max_attempts < 1
27
24
 
@@ -59,11 +56,11 @@ module SkillBench
59
56
  private
60
57
 
61
58
  def retryable?(status, attempt)
62
- RETRYABLE_STATUSES.include?(status) && attempt < @max_attempts
59
+ Constants::HttpClient::RETRYABLE_STATUSES.include?(status) && attempt < @max_attempts
63
60
  end
64
61
 
65
62
  def compute_delay(attempt)
66
- [@base_delay * (2**(attempt - 1)), MAX_DELAY].min
63
+ [@base_delay * (2**(attempt - 1)), Constants::ReactAgent::DEFAULT_MAX_DELAY].min
67
64
  end
68
65
 
69
66
  def extract_status(error)
@@ -0,0 +1,58 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SkillBench
4
+ # Centralized configuration constants for the SkillBench system.
5
+ # This eliminates magic numbers and provides a single source of truth
6
+ # for configurable values across the codebase.
7
+ module Constants
8
+ # ReAct Agent Configuration
9
+ module ReactAgent
10
+ DEFAULT_MAX_ITERATIONS = 25
11
+ DEFAULT_MAX_DELAY = 30 # Maximum delay cap in seconds for retry logic
12
+ end
13
+
14
+ # HTTP Client Configuration
15
+ module HttpClient
16
+ DEFAULT_OPEN_TIMEOUT = 10
17
+ DEFAULT_TIMEOUT = 120
18
+ DEFAULT_MAX_RETRIES = 3
19
+ DEFAULT_RETRY_DELAY = 1
20
+ RETRYABLE_STATUSES = [429, 503].freeze
21
+ end
22
+
23
+ # Context Hydration Configuration
24
+ module ContextHydration
25
+ MAX_FILE_SIZE = 50_000 # Maximum file size in bytes
26
+ MAX_TOTAL_CONTEXT_SIZE = 1_000_000 # Maximum total context size in bytes (1MB)
27
+ TEXT_EXTENSIONS = %w[.md .rb .json .yml .yaml .txt].freeze
28
+ end
29
+
30
+ # Sandbox Configuration
31
+ module Sandbox
32
+ DOCKER_IMAGE_NAME = 'evaluator-sandbox'
33
+ end
34
+
35
+ # Tool Execution Configuration
36
+ module Tools
37
+ DANGEROUS_COMMANDS = %w[
38
+ bash sh zsh fish dash ksh csh tcsh
39
+ python python3 python2 ruby perl node
40
+ php lua tcl wish
41
+ curl wget nc ncat socat
42
+ eval exec
43
+ sudo su doas
44
+ chmod chown mount umount
45
+ dd mkfs fdisk parted
46
+ insmod rmmod modprobe
47
+ systemctl service
48
+ passwd useradd userdel groupadd groupdel
49
+ ].freeze
50
+ end
51
+
52
+ # File Path Configuration
53
+ module FilePath
54
+ ALLOWED_PATH_PATTERN = %r{\A[a-zA-Z0-9._\-/]+\z}
55
+ MAX_PATH_LENGTH = 4096
56
+ end
57
+ end
58
+ end
@@ -2,6 +2,7 @@
2
2
 
3
3
  require 'pathname'
4
4
  require 'cgi'
5
+ require_relative '../constants'
5
6
 
6
7
  module SkillBench
7
8
  module Execution
@@ -10,10 +11,6 @@ module SkillBench
10
11
  class ContextHydrator
11
12
  # Error message returned when context hydration fails.
12
13
  HYDRATION_FAILED = 'Failed to hydrate context from source path'
13
- # File extensions considered for context hydration.
14
- TEXT_EXTENSIONS = %w[.md .rb .json .yml .yaml .txt].freeze
15
- # Maximum file size (in bytes) for files included in context hydration.
16
- MAX_FILE_SIZE = 50_000
17
14
 
18
15
  # Loads and formats source context files.
19
16
  #
@@ -50,6 +47,8 @@ module SkillBench
50
47
  return missing_path_result unless full_path.exist? && full_path.directory?
51
48
 
52
49
  context_files = collect_context_files(full_path)
50
+ return missing_path_result unless validate_total_size?(context_files)
51
+
53
52
  xml_context = build_xml(context_files)
54
53
 
55
54
  { success: true, response: { context: xml_context } }
@@ -65,12 +64,23 @@ module SkillBench
65
64
  end
66
65
 
67
66
  def collect_context_files(full_path)
68
- pattern = full_path.join("*{#{TEXT_EXTENSIONS.join(',')}}").to_s
67
+ pattern = full_path.join("*{#{Constants::ContextHydration::TEXT_EXTENSIONS.join(',')}}").to_s
69
68
  Dir.glob(pattern).reject { |f| File.symlink?(f) }
70
- .select { |f| File.size(f) <= MAX_FILE_SIZE }
69
+ .select { |f| File.size(f) <= Constants::ContextHydration::MAX_FILE_SIZE }
71
70
  .sort
72
71
  end
73
72
 
73
+ def validate_total_size?(context_files)
74
+ total_size = context_files.sum { |f| File.size(f) }
75
+ return true if total_size <= Constants::ContextHydration::MAX_TOTAL_CONTEXT_SIZE
76
+
77
+ SkillBench::ErrorLogger.log_error(
78
+ StandardError.new("Total context size #{total_size} exceeds maximum #{Constants::ContextHydration::MAX_TOTAL_CONTEXT_SIZE}"),
79
+ 'ContextHydrator'
80
+ )
81
+ false
82
+ end
83
+
74
84
  # Builds the XML structure wrapping the contents of the context files.
75
85
  #
76
86
  # @param context_files [Array<String>] List of absolute paths to context files.
@@ -3,6 +3,7 @@
3
3
  require 'fileutils'
4
4
  require 'tmpdir'
5
5
  require 'open3'
6
+ require_relative '../constants'
6
7
 
7
8
  module SkillBench
8
9
  module Execution
@@ -143,18 +144,32 @@ module SkillBench
143
144
 
144
145
  # Starts a Docker container for isolated command execution.
145
146
  # Builds the image only if it does not already exist.
147
+ # Uses hardened security settings for production safety.
146
148
  #
147
149
  # @raise [RuntimeError] when the Docker image cannot be built or the container fails to start.
148
150
  def start_container
149
- image_name = 'evaluator-sandbox'
151
+ image_name = Constants::Sandbox::DOCKER_IMAGE_NAME
150
152
  docker_dir = File.expand_path('docker', __dir__)
151
153
 
152
154
  # Build image (Docker layer cache handles no-op builds)
153
155
  raise "Failed to build Docker image #{image_name}" unless system('docker', 'build', '-t', image_name, docker_dir, '--quiet')
154
156
 
155
- # Start a detached container mounting the sandbox dir to /sandbox
157
+ # Start a detached container with hardened security settings
158
+ # --user $(id -u):$(id -g): Runs as non-root user
159
+ # --security-opt no-new-privileges: Prevents privilege escalation
160
+ # --cap-drop ALL: Drops all Linux capabilities
161
+ # --cap-add CHOWN, DAC_OVERRIDE: Adds back minimal capabilities for git operations
162
+ # --network none: Disables network access for additional isolation
156
163
  stdout, stderr, status = Open3.capture3(
157
- 'docker', 'run', '-d', '--rm', '-v', "#{@path}:/sandbox", image_name
164
+ 'docker', 'run', '-d', '--rm',
165
+ '--user', "#{Process.uid}:#{Process.gid}",
166
+ '--security-opt', 'no-new-privileges',
167
+ '--cap-drop', 'ALL',
168
+ '--cap-add', 'CHOWN',
169
+ '--cap-add', 'DAC_OVERRIDE',
170
+ '--network', 'none',
171
+ '-v', "#{@path}:/sandbox:rw",
172
+ image_name
158
173
  )
159
174
 
160
175
  raise "Failed to start Docker container: #{stderr}" unless status.success?
@@ -4,27 +4,12 @@ require 'open3'
4
4
  require 'timeout'
5
5
  require 'shellwords'
6
6
  require_relative '../config'
7
+ require_relative '../constants'
7
8
 
8
9
  module SkillBench
9
10
  module Tools
10
11
  # Handles executing a shell command within the working directory.
11
12
  class RunCommand
12
- # Commands that are always blocked even if listed in allowed_commands,
13
- # because they can be used to escape the sandbox or execute arbitrary code.
14
- DANGEROUS_COMMANDS = %w[
15
- bash sh zsh fish dash ksh csh tcsh
16
- python python3 python2 ruby perl node
17
- php lua tcl wish
18
- curl wget nc ncat socat
19
- eval exec
20
- sudo su doas
21
- chmod chown mount umount
22
- dd mkfs fdisk parted
23
- insmod rmmod modprobe
24
- systemctl service
25
- passwd useradd userdel groupadd groupdel
26
- ].freeze
27
-
28
13
  # @return [Hash] The tool definition for the LLM API.
29
14
  def self.definition
30
15
  {
@@ -59,7 +44,7 @@ module SkillBench
59
44
  return 'Error: Empty command.' if argv.empty?
60
45
 
61
46
  base_cmd = argv.first
62
- return "Error: Command '#{base_cmd}' is blocked for security reasons." if DANGEROUS_COMMANDS.include?(base_cmd)
47
+ return "Error: Command '#{base_cmd}' is blocked for security reasons." if Constants::Tools::DANGEROUS_COMMANDS.include?(base_cmd)
63
48
 
64
49
  allowed = SkillBench::Config.allowed_commands
65
50
  return 'Error: No allowed commands configured. Set allowed_commands in skill-bench.json or use --mode mock.' if allowed.nil?
@@ -2,5 +2,5 @@
2
2
 
3
3
  module SkillBench
4
4
  # The current gem version.
5
- VERSION = '1.0.1'
5
+ VERSION = '1.1.0'
6
6
  end
data/lib/skill_bench.rb CHANGED
@@ -8,6 +8,7 @@
8
8
 
9
9
  # Core modules
10
10
  require_relative 'skill_bench/version'
11
+ require_relative 'skill_bench/constants'
11
12
  require_relative 'skill_bench/dimension'
12
13
  require_relative 'skill_bench/criteria'
13
14
  require_relative 'skill_bench/delta_report'
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ruby-skill-bench
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.1
4
+ version: 1.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ismael Marin
@@ -147,6 +147,7 @@ files:
147
147
  - lib/skill_bench/clients/providers/opencode.rb
148
148
  - lib/skill_bench/clients/providers/openrouter.rb
149
149
  - lib/skill_bench/clients/request_builder.rb
150
+ - lib/skill_bench/clients/response_builder.rb
150
151
  - lib/skill_bench/clients/response_error_handler.rb
151
152
  - lib/skill_bench/clients/response_parser.rb
152
153
  - lib/skill_bench/clients/retry_handler.rb
@@ -162,6 +163,7 @@ files:
162
163
  - lib/skill_bench/config/facade_writers.rb
163
164
  - lib/skill_bench/config/json_loader.rb
164
165
  - lib/skill_bench/config/store.rb
166
+ - lib/skill_bench/constants.rb
165
167
  - lib/skill_bench/criteria.rb
166
168
  - lib/skill_bench/delta_report.rb
167
169
  - lib/skill_bench/dimension.rb