RubyGems - braintrust - Versions diffs - 0.2.1 → 0.3.0 - Mend

braintrust 0.2.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

checksums.yaml +4 -4
data/README.md +107 -10
data/lib/braintrust/contrib/rails/server/application_controller.rb +34 -0
data/lib/braintrust/contrib/rails/server/engine.rb +72 -0
data/lib/braintrust/contrib/rails/server/eval_controller.rb +36 -0
data/lib/braintrust/contrib/rails/server/generator.rb +43 -0
data/lib/braintrust/contrib/rails/server/health_controller.rb +15 -0
data/lib/braintrust/contrib/rails/server/list_controller.rb +16 -0
data/lib/braintrust/contrib/rails/server/routes.rb +8 -0
data/lib/braintrust/contrib/rails/server.rb +20 -0
data/lib/braintrust/eval/runner.rb +80 -52
data/lib/braintrust/scorer.rb +55 -4
data/lib/braintrust/server/handlers/eval.rb +8 -168
data/lib/braintrust/server/handlers/list.rb +3 -41
data/lib/braintrust/server/rack.rb +2 -0
data/lib/braintrust/server/services/eval_service.rb +214 -0
data/lib/braintrust/server/services/list_service.rb +64 -0
data/lib/braintrust/trace/span_processor.rb +0 -5
data/lib/braintrust/version.rb +1 -1
metadata +11 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 747b190f21c7de342f85390f8a51b17628e23fa2436776989a3ebe637bf9d596
-  data.tar.gz: 1e6c0c59c9ce56d499a04d8424506c56e2c2ad359506a6d5175c7173dc4ab238
+  metadata.gz: c07be3c454a924c5c97c2653136a2b9cdd1098409af16326b1db8676c5c8b0d2
+  data.tar.gz: c1eb75eefdcacebc2c955ae23aa3196d276a76d6ab828cdfb817c7e9168325b3
 SHA512:
-  metadata.gz: 3f652583ec04f5b874e3417db4cc0dff7f43341eeffe686466b8caad5614ed336e8580ac7533ef100726f09cdb264900e0f454edd11328e611513ffc8f77d3cb
-  data.tar.gz: 3316d0cb4ccc77e2d0c0ae48c033b6f5c026237d85c85e75e139434023f713820c3790a98090dac636ae6d44127692279404bcd3ab88b0d50a3de3127d38e3a6
+  metadata.gz: d02058bd5321ed16ea2f785aaeb24f4d4f105c5357c3c7ceb2a8a02c090b69c7187623b23e14d5026bb0cf236e64dddae7025509d7b2d6769bb50f110612120f
+  data.tar.gz: 15627209b382c023c2640e1d2219b6d33b84cb7c67ba1a3b8e3ebbe1aa912d3df832583a1e37b3831699b67ea81f3b4242b67a606dfdd727827e648a6509fea7

data/README.md CHANGED Viewed

@@ -259,6 +259,8 @@ Braintrust::Eval.run(
 )
 ```
+See [eval.rb](./examples/eval.rb) for a full example.
 ### Datasets
 Use test cases from a Braintrust dataset:
@@ -287,6 +289,8 @@ Braintrust::Eval.run(
 )
 ```
+See [dataset.rb](./examples/eval/dataset.rb) for a full example.
 ### Scorers
 Use scoring functions defined in Braintrust:
@@ -315,6 +319,46 @@ Braintrust::Eval.run(
 )
 ```
+See [remote_functions.rb](./examples/eval/remote_functions.rb) for a full example.
+#### Scorer metadata
+Scorers can return a Hash with `:score` and `:metadata` to attach structured context to the score. The metadata is logged on the scorer's span and visible in the Braintrust UI for debugging and filtering:
+```ruby
+Braintrust::Scorer.new("translation") do |expected:, output:|
+  common_words = output.downcase.split & expected.downcase.split
+  overlap = common_words.size.to_f / expected.split.size
+  {
+    score: overlap,
+    metadata: {word_overlap: common_words.size, missing_words: expected.downcase.split - output.downcase.split}
+  }
+end
+```
+See [scorer_metadata.rb](./examples/eval/scorer_metadata.rb) for a full example.
+#### Multiple scores from one scorer
+When several scores can be computed together (e.g. in one LLM call), you can return an `Array` of score `Hash` instead of a single value. Each metric appears as a separate score column in the Braintrust UI:
+```ruby
+Braintrust::Scorer.new("summary_quality") do |output:, expected:|
+  words = output.downcase.split
+  key_terms = expected[:key_terms]
+  covered = key_terms.count { |t| words.include?(t) }
+  [
+    {name: "coverage", score: covered.to_f / key_terms.size, metadata: {missing: key_terms - words}},
+    {name: "conciseness", score: words.size <= expected[:max_words] ? 1.0 : 0.0}
+  ]
+end
+```
+`name` and `score` are required, `metadata` is optional.
+See [multi_score.rb](./examples/eval/multi_score.rb) for a full example.
 #### Trace scoring
 Scorers can access the full evaluation trace (all spans generated by the task) by declaring a `trace:` keyword parameter. This is useful for inspecting intermediate LLM calls, validating tool usage, or checking the message thread:
@@ -344,11 +388,15 @@ Braintrust::Eval.run(
 )
 ```
-See examples: [eval.rb](./examples/eval.rb), [dataset.rb](./examples/eval/dataset.rb), [remote_functions.rb](./examples/eval/remote_functions.rb), [trace_scoring.rb](./examples/eval/trace_scoring.rb)
+See [trace_scoring.rb](./examples/eval/trace_scoring.rb) for a full example.
 ### Dev Server
-Run evaluations from the Braintrust web UI against code in your own application. Define evaluators, pass them to the dev server, and start serving:
+Run evaluations from the Braintrust web UI against code in your own application.
+#### Run as a Rack app
+Define evaluators, pass them to the dev server, and start serving:
 ```ruby
 # eval_server.ru
@@ -374,10 +422,21 @@ run Braintrust::Server::Rack.app(
 )
 ```
+Add your Rack server to your Gemfile:
+```ruby
+gem "rack"
+gem "puma" # recommended
+```
+Then start the server:
 ```bash
 bundle exec rackup eval_server.ru -p 8300 -o 0.0.0.0
 ```
+See example: [server/eval.ru](./examples/server/eval.ru)
 **Custom evaluators**
 Evaluators can also be defined as subclasses:
@@ -394,6 +453,51 @@ class FoodClassifier < Braintrust::Eval::Evaluator
 end
 ```
+#### Run as a Rails engine
+Use the Rails engine when your evaluators live inside an existing Rails app and you want to mount the Braintrust eval server into that application.
+Define each evaluator in its own file, for example under `app/evaluators/`:
+```ruby
+# app/evaluators/food_classifier.rb
+class FoodClassifier < Braintrust::Eval::Evaluator
+  def task
+    ->(input:) { classify(input) }
+  end
+  def scorers
+    [Braintrust::Scorer.new("exact_match") { |expected:, output:| output == expected ? 1.0 : 0.0 }]
+  end
+end
+```
+Then generate the Braintrust initializer:
+```bash
+bin/rails generate braintrust:eval_server
+```
+```ruby
+# config/routes.rb
+Rails.application.routes.draw do
+  mount Braintrust::Contrib::Rails::Engine, at: "/braintrust"
+end
+```
+The generator writes `config/initializers/braintrust_server.rb`, where you can review or customize the slug-to-evaluator mapping it discovers from `app/evaluators/**/*.rb` and `evaluators/**/*.rb`.
+See example: [contrib/rails/eval.rb](./examples/contrib/rails/eval.rb)
+**Developing locally**
+If you want to skip authentication on incoming eval requests while developing locally:
+- **For Rack**: Pass `auth: :none` to `Braintrust::Server::Rack.app(...)`
+- **For Rails**: Set `config.auth = :none` in `config/initializers/braintrust_server.rb`
+*NOTE: Setting `:none` disables authentication on incoming requests into your server; executing evals requires a `BRAINTRUST_API_KEY` to fetch resources.*
 **Supported web servers**
 The dev server requires the `rack` gem and a Rack-compatible web server.
@@ -405,14 +509,7 @@ The dev server requires the `rack` gem and a Rack-compatible web server.
 | [Passenger](https://www.phusionpassenger.com/) | 6.x               |                                      |
 | [WEBrick](https://github.com/ruby/webrick)     | Not supported     | Does not support server-sent events. |
-Add your chosen server to your Gemfile:
-```ruby
-gem "rack"
-gem "puma" # recommended
-```
-See example: [server/eval.ru](./examples/server/eval.ru)
+See examples: [server/eval.ru](./examples/server/eval.ru),
 ## Documentation

data/lib/braintrust/contrib/rails/server/application_controller.rb ADDED Viewed

@@ -0,0 +1,34 @@
+# frozen_string_literal: true
+module Braintrust
+  module Contrib
+    module Rails
+      module Server
+        class ApplicationController < ActionController::API
+          before_action :authenticate!
+          private
+          def authenticate!
+            auth_result = Engine.auth_strategy.authenticate(request.env)
+            unless auth_result
+              render json: {"error" => "Unauthorized"}, status: :unauthorized
+              return
+            end
+            request.env["braintrust.auth"] = auth_result
+            @braintrust_auth = auth_result
+          end
+          def parse_json_body
+            body = request.body.read
+            return nil if body.nil? || body.empty?
+            JSON.parse(body)
+          rescue JSON::ParserError
+            nil
+          end
+        end
+      end
+    end
+  end
+end

data/lib/braintrust/contrib/rails/server/engine.rb ADDED Viewed

@@ -0,0 +1,72 @@
+# frozen_string_literal: true
+module Braintrust
+  module Contrib
+    module Rails
+      module Server
+        class Engine < ::Rails::Engine
+          isolate_namespace Braintrust::Contrib::Rails::Server
+          config.evaluators = {}
+          config.auth = :clerk_token
+          # Register the engine's routes file so Rails loads it during initialization.
+          paths["config/routes.rb"] << File.expand_path("routes.rb", __dir__)
+          initializer "braintrust.server.cors" do |app|
+            app.middleware.use Braintrust::Server::Middleware::Cors
+          end
+          # Class-level helpers that read from engine config.
+          def self.evaluators
+            config.evaluators
+          end
+          def self.auth_strategy
+            resolve_auth(config.auth)
+          end
+          def self.list_service
+            Braintrust::Server::Services::List.new(-> { config.evaluators })
+          end
+          # Long-lived so the state cache persists across requests.
+          def self.eval_service
+            @eval_service ||= Braintrust::Server::Services::Eval.new(-> { config.evaluators })
+          end
+          # Support the explicit `|config|` style used by this integration while
+          # still delegating zero-arity DSL blocks to Rails' native implementation.
+          def self.configure(&block)
+            return super if block&.arity == 0
+            yield config if block
+          end
+          def self.resolve_auth(auth)
+            case auth
+            when :none
+              Braintrust::Server::Auth::NoAuth.new
+            when :clerk_token
+              Braintrust::Server::Auth::ClerkToken.new
+            when Symbol, String
+              raise ArgumentError, "Unknown auth strategy #{auth.inspect}. Expected :none, :clerk_token, or an auth object."
+            else
+              auth
+            end
+          end
+          private_class_method :resolve_auth
+          generators do
+            require "braintrust/contrib/rails/server/generator"
+          end
+        end
+      end
+    end
+  end
+end
+require_relative "application_controller"
+require_relative "health_controller"
+require_relative "list_controller"
+require_relative "eval_controller"

data/lib/braintrust/contrib/rails/server/eval_controller.rb ADDED Viewed

@@ -0,0 +1,36 @@
+# frozen_string_literal: true
+module Braintrust
+  module Contrib
+    module Rails
+      module Server
+        class EvalController < ApplicationController
+          include ActionController::Live
+          def create
+            body = parse_json_body
+            unless body
+              render json: {"error" => "Invalid JSON body"}, status: :bad_request
+              return
+            end
+            result = Engine.eval_service.validate(body)
+            if result[:error]
+              render json: {"error" => result[:error]}, status: result[:status]
+              return
+            end
+            response.headers["Content-Type"] = "text/event-stream"
+            response.headers["Cache-Control"] = "no-cache"
+            response.headers["Connection"] = "keep-alive"
+            sse = Braintrust::Server::SSEWriter.new { |chunk| response.stream.write(chunk) }
+            Engine.eval_service.stream(result, auth: @braintrust_auth, sse: sse)
+          ensure
+            response.stream.close
+          end
+        end
+      end
+    end
+  end
+end

data/lib/braintrust/contrib/rails/server/generator.rb ADDED Viewed

@@ -0,0 +1,43 @@
+# frozen_string_literal: true
+require "rails/generators"
+module Braintrust
+  module Contrib
+    module Rails
+      module Server
+        module Generators
+          class ServerGenerator < ::Rails::Generators::Base
+            namespace "braintrust:server"
+            source_root File.expand_path("templates", __dir__)
+            def create_initializer
+              @evaluators = discovered_evaluators
+              template "initializer.rb.tt", "config/initializers/braintrust_server.rb"
+            end
+            private
+            def discovered_evaluators
+              evaluator_roots.flat_map do |root|
+                Dir[File.join(destination_root, root, "**/*.rb")].sort.map do |file|
+                  relative_path = file.delete_prefix("#{File.join(destination_root, root)}/").sub(/\.rb\z/, "")
+                  {
+                    class_name: relative_path.split("/").map(&:camelize).join("::"),
+                    slug: relative_path.tr("/", "-").tr("_", "-")
+                  }
+                end
+              end
+            end
+            def evaluator_roots
+              %w[app/evaluators evaluators].select do |root|
+                Dir.exist?(File.join(destination_root, root))
+              end
+            end
+          end
+        end
+      end
+    end
+  end
+end

data/lib/braintrust/contrib/rails/server/health_controller.rb ADDED Viewed

@@ -0,0 +1,15 @@
+# frozen_string_literal: true
+module Braintrust
+  module Contrib
+    module Rails
+      module Server
+        class HealthController < ApplicationController
+          def show
+            render json: {"status" => "ok"}
+          end
+        end
+      end
+    end
+  end
+end

data/lib/braintrust/contrib/rails/server/list_controller.rb ADDED Viewed

@@ -0,0 +1,16 @@
+# frozen_string_literal: true
+module Braintrust
+  module Contrib
+    module Rails
+      module Server
+        class ListController < ApplicationController
+          def show
+            result = Engine.list_service.call
+            render json: result
+          end
+        end
+      end
+    end
+  end
+end

data/lib/braintrust/contrib/rails/server/routes.rb ADDED Viewed

@@ -0,0 +1,8 @@
+# frozen_string_literal: true
+Braintrust::Contrib::Rails::Server::Engine.routes.draw do
+  get "/", to: "health#show"
+  get "/list", to: "list#show"
+  post "/list", to: "list#show"
+  post "/eval", to: "eval#create"
+end

data/lib/braintrust/contrib/rails/server.rb ADDED Viewed

@@ -0,0 +1,20 @@
+# frozen_string_literal: true
+begin
+  require "action_controller"
+  require "rails/engine"
+rescue LoadError
+  raise LoadError,
+    "Rails (actionpack + railties) is required for the Braintrust Rails server engine. " \
+    "Add `gem 'rails'` or `gem 'actionpack'` and `gem 'railties'` to your Gemfile."
+end
+require "json"
+require_relative "../../eval"
+require_relative "../../server/sse"
+require_relative "../../server/auth/no_auth"
+require_relative "../../server/auth/clerk_token"
+require_relative "../../server/middleware/cors"
+require_relative "../../server/services/list_service"
+require_relative "../../server/services/eval_service"
+require_relative "server/engine"

data/lib/braintrust/eval/runner.rb CHANGED Viewed

@@ -82,11 +82,17 @@ module Braintrust
       # @param case_context [CaseContext] The per-case accumulator
       # @param errors [Queue] Thread-safe error collection queue
       def run_eval_case(case_context, errors)
-        tracer.in_span("eval") do |eval_span|
+        # Each eval case starts its own trace — detach from any ambient span context
+        eval_span = tracer.start_root_span("eval")
+        OpenTelemetry::Trace.with_span(eval_span) do
+          # Set attributes known before task execution
           eval_span.set_attribute("braintrust.parent", eval_context.parent_span_attr) if eval_context.parent_span_attr
-          # Set tags early so they're present even if task fails
+          set_json_attr(eval_span, "braintrust.span_attributes", build_span_attributes("eval"))
+          set_json_attr(eval_span, "braintrust.input_json", {input: case_context.input})
+          set_json_attr(eval_span, "braintrust.expected", case_context.expected) if case_context.expected
+          set_json_attr(eval_span, "braintrust.metadata", case_context.metadata) if case_context.metadata
           eval_span.set_attribute("braintrust.tags", case_context.tags) if case_context.tags
+          eval_span.set_attribute("braintrust.origin", case_context.origin) if case_context.origin
           # Run task
           begin
@@ -94,6 +100,7 @@ module Braintrust
           rescue => e
             # Error already recorded on task span, set eval span status
             eval_span.status = OpenTelemetry::Trace::Status.error(e.message)
+            set_json_attr(eval_span, "braintrust.output_json", {output: nil})
             errors << "Task failed for input '#{case_context.input}': #{e.message}"
             report_progress(eval_span, case_context, error: e.message)
             next
@@ -104,26 +111,21 @@ module Braintrust
           case_context.trace = build_trace(eval_span)
           # Run scorers
-          case_scores = nil
           begin
-            case_scores = run_scorers(case_context)
+            run_scorers(case_context)
           rescue => e
             # Error already recorded on score span, set eval span status
             eval_span.status = OpenTelemetry::Trace::Status.error(e.message)
             errors << "Scorers failed for input '#{case_context.input}': #{e.message}"
           end
-          # Set eval span attributes (after task and scorers complete)
-          set_json_attr(eval_span, "braintrust.span_attributes", build_span_attributes("eval"))
-          set_json_attr(eval_span, "braintrust.input_json", case_context.input)
-          set_json_attr(eval_span, "braintrust.output_json", case_context.output)
-          set_json_attr(eval_span, "braintrust.expected", case_context.expected) if case_context.expected
+          # Set output after task completes
+          set_json_attr(eval_span, "braintrust.output_json", {output: case_context.output})
-          # Set origin for cases from remote sources (already JSON-serialized)
-          eval_span.set_attribute("braintrust.origin", case_context.origin) if case_context.origin
-          report_progress(eval_span, case_context, data: case_context.output, scores: case_scores || {})
+          report_progress(eval_span, case_context, data: case_context.output)
         end
+      ensure
+        eval_span&.finish
       end
       # Run task with OpenTelemetry tracing
@@ -151,43 +153,62 @@ module Braintrust
         end
       end
-      # Run scorers with OpenTelemetry tracing
-      # Creates single score span for all scorers
+      # Run scorers with OpenTelemetry tracing.
+      # Creates one span per scorer, each a direct child of the current (eval) span.
       # @param case_context [CaseContext] The per-case context (output must be populated)
-      # @return [Hash] Scores hash { scorer_name => score_value }
       def run_scorers(case_context)
-        tracer.in_span("score") do |score_span|
+        scorer_kwargs = {
+          input: case_context.input,
+          expected: case_context.expected,
+          output: case_context.output,
+          metadata: case_context.metadata || {},
+          trace: case_context.trace
+        }
+        scorer_input = {
+          input: case_context.input,
+          expected: case_context.expected,
+          output: case_context.output,
+          metadata: case_context.metadata || {}
+        }
+        scorer_error = nil
+        eval_context.scorers.each do |scorer|
+          collect_scores(run_scorer(scorer, scorer_kwargs, scorer_input))
+        rescue => e
+          scorer_error ||= e
+        end
+        raise scorer_error if scorer_error
+      end
+      # Run a single scorer inside its own span.
+      # @param scorer [Scorer] The scorer to run
+      # @param scorer_kwargs [Hash] Keyword arguments for the scorer
+      # @param scorer_input [Hash] Input to log on the span
+      # @return [Array<Hash>] Raw score results from the scorer
+      def run_scorer(scorer, scorer_kwargs, scorer_input)
+        tracer.in_span(scorer.name) do |score_span|
           score_span.set_attribute("braintrust.parent", eval_context.parent_span_attr) if eval_context.parent_span_attr
-          set_json_attr(score_span, "braintrust.span_attributes", build_span_attributes("score"))
-          scorer_kwargs = {
-            input: case_context.input,
-            expected: case_context.expected,
-            output: case_context.output,
-            metadata: case_context.metadata || {},
-            trace: case_context.trace
-          }
-          scores = {}
-          scorer_error = nil
-          eval_context.scorers.each do |scorer|
-            score_value = scorer.call(**scorer_kwargs)
-            scores[scorer.name] = score_value
-            # Collect raw score for summary (thread-safe)
-            collect_score(scorer.name, score_value)
-          rescue => e
-            # Record first error but continue processing other scorers
-            scorer_error ||= e
-            record_span_error(score_span, e, "ScorerError")
-          end
+          set_json_attr(score_span, "braintrust.span_attributes", build_scorer_span_attributes(scorer.name))
+          set_json_attr(score_span, "braintrust.input_json", scorer_input)
-          # Always set scores attribute, even if some scorers failed
-          set_json_attr(score_span, "braintrust.scores", scores)
+          score_results = scorer.call(**scorer_kwargs)
-          # Raise after setting scores so we can see which scorers succeeded
-          raise scorer_error if scorer_error
+          scorer_scores = {}
+          scorer_metadata = {}
+          score_results.each do |s|
+            scorer_scores[s[:name]] = s[:score]
+            scorer_metadata[s[:name]] = s[:metadata] if s[:metadata].is_a?(Hash)
+          end
+          set_json_attr(score_span, "braintrust.output_json", scorer_scores)
+          set_json_attr(score_span, "braintrust.scores", scorer_scores)
+          set_json_attr(score_span, "braintrust.metadata", scorer_metadata) unless scorer_metadata.empty?
-          scores
+          score_results
+        rescue => e
+          record_span_error(score_span, e, "ScorerError")
+          raise
         end
       end
@@ -255,6 +276,16 @@ module Braintrust
         attrs
       end
+      # Build span_attributes for a scorer span.
+      # Each scorer gets its own span with type "score", purpose "scorer", and the scorer's name.
+      # @param scorer_name [String] The scorer name
+      # @return [Hash]
+      def build_scorer_span_attributes(scorer_name)
+        attrs = {type: "score", name: scorer_name, purpose: "scorer"}
+        attrs[:generation] = eval_context.generation if eval_context.generation
+        attrs
+      end
       # Set a span attribute by JSON encoding the value
       # @param span [OpenTelemetry::Trace::Span] The span
       # @param key [String] The attribute key
@@ -263,14 +294,11 @@ module Braintrust
         span.set_attribute(key, JSON.dump(value))
       end
-      # Collect a single score value for summary calculation
-      # @param name [String] Scorer name
-      # @param value [Object] Score value (only Numeric values are collected)
-      def collect_score(name, value)
-        return unless value.is_a?(Numeric)
+      # Collect score results into the summary accumulator (thread-safe).
+      # @param score_results [Array<Hash>] Score results from a scorer
+      def collect_scores(score_results)
         @score_mutex.synchronize do
-          (@scores[name] ||= []) << value
+          score_results.each { |s| (@scores[s[:name]] ||= []) << s[:score] }
         end
       end
     end

data/lib/braintrust/scorer.rb CHANGED Viewed

@@ -40,12 +40,52 @@ module Braintrust
       Block.new(name: name || DEFAULT_NAME, &block)
     end
-    # Included into classes that +include Scorer+. Prepends KeywordFilter
-    # so #call receives only its declared kwargs, and provides a default #name.
+    # Included into classes that +include Scorer+. Prepends KeywordFilter and
+    # ResultNormalizer so #call receives only declared kwargs and always returns
+    # Array<Hash>. Also provides a default #name and #call_parameters.
     module Callable
+      # Normalizes the raw return value of #call into Array<Hash>.
+      # Nested inside Callable because it depends on #name which Callable provides.
+      module ResultNormalizer
+        # @return [Array<Hash>] normalized score hashes with :score, :metadata, :name keys
+        def call(**kwargs)
+          normalize_score_result(super)
+        end
+        private
+        # @param result [Numeric, Hash, Array<Hash>] raw return value from #call
+        # @return [Array<Hash>] one or more score hashes with :score, :metadata, :name keys
+        # @raise [ArgumentError] if any score value is not Numeric
+        def normalize_score_result(result)
+          case result
+          when Array then result.map { |item| normalize_score_item(item) }
+          when Hash then [normalize_score_item(result)]
+          else
+            raise ArgumentError, "#{name}: score must be Numeric, got #{result.inspect}" unless result.is_a?(Numeric)
+            [{score: result, metadata: nil, name: name}]
+          end
+        end
+        # Fills in missing :name from the scorer and validates :score.
+        # @param item [Hash] a score hash with at least a :score key
+        # @return [Hash] the same hash with :name set
+        # @raise [ArgumentError] if :score is not Numeric
+        def normalize_score_item(item)
+          item[:name] ||= name
+          raise ArgumentError, "#{item[:name]}: score must be Numeric, got #{item[:score].inspect}" unless item[:score].is_a?(Numeric)
+          item
+        end
+      end
+      # Infrastructure modules prepended onto every scorer class.
+      # Used both to set up the ancestor chain and to skip past them in
+      # #call_parameters so KeywordFilter sees the real call signature.
+      PREPENDED = [Internal::Callable::KeywordFilter, ResultNormalizer].freeze
       # @param base [Class] the class including Callable
       def self.included(base)
-        base.prepend(Internal::Callable::KeywordFilter)
+        PREPENDED.each { |mod| base.prepend(mod) }
       end
       # Default name derived from the class name (e.g. FuzzyMatch -> "fuzzy_match").
@@ -55,6 +95,17 @@ module Braintrust
         return Scorer::DEFAULT_NAME unless klass
         klass.gsub(/([a-z])([A-Z])/, '\1_\2').downcase
       end
+      # Provides KeywordFilter with the actual call signature of the subclass.
+      # Walks past PREPENDED modules in the ancestor chain so that user-defined
+      # #call keyword params are correctly introspected.
+      # Block overrides this to point directly at @block.parameters.
+      # @return [Array<Array>] parameter list
+      def call_parameters
+        meth = method(:call)
+        meth = meth.super_method while meth.super_method && PREPENDED.include?(meth.owner)
+        meth.parameters
+      end
     end
     # Block-based scorer. Stores a Proc and delegates #call to it.
@@ -75,7 +126,7 @@ module Braintrust
       end
       # @param kwargs [Hash] keyword arguments (filtered by KeywordFilter)
-      # @return [Float, Hash, Array] score result
+      # @return [Array<Hash>] normalized score results
       def call(**kwargs)
         @block.call(**kwargs)
       end

data/lib/braintrust/server/handlers/eval.rb CHANGED Viewed

@@ -10,38 +10,15 @@ module Braintrust
       class Eval
         def initialize(evaluators)
           @evaluators = evaluators
+          @service = Services::Eval.new(evaluators)
         end
         def call(env)
           body = parse_body(env)
           return error_response(400, "Invalid JSON body") unless body
-          name = body["name"]
-          return error_response(400, "Missing required field: name") unless name
-          evaluator = @evaluators[name]
-          return error_response(404, "Evaluator '#{name}' not found") unless evaluator
-          data = body["data"]
-          return error_response(400, "Missing required field: data") unless data
-          # Validate exactly one data source
-          data_sources = ["data", "dataset_name", "dataset_id"].count { |k| data.key?(k) }
-          return error_response(400, "Exactly one data source required") if data_sources != 1
-          experiment_name = body["experiment_name"]
-          # Resolve data source
-          cases, dataset = resolve_data_source(data)
-          # Resolve remote scorers from request
-          remote_scorer_ids = resolve_remote_scorers(body["scores"])
-          # Resolve parent span context
-          parent = resolve_parent(body["parent"])
-          # Build state from auth context (if present)
-          state = build_state(env)
+          result = @service.validate(body)
+          return error_response(result[:status], result[:error]) if result[:error]
           # The protocol-rack adapter (used by Falcon and any server built on
           # protocol-http) buffers `each`-based bodies through an Enumerable path.
@@ -50,64 +27,7 @@ module Braintrust
           body_class = env.key?("protocol.http.request") ? SSEStreamBody : SSEBody
           sse_body = body_class.new do |sse|
-            # Only pass project/experiment params when state is available
-            run_opts = {
-              on_progress: ->(progress_data) {
-                # Build remote eval protocol events from generic progress data.
-                # Runner provides: id, data/error, scores (optional), origin (optional).
-                # Protocol requires: id, object_type, origin, name, format, output_type, event, data.
-                base = {
-                  "object_type" => "task",
-                  "name" => name,
-                  "format" => "code",
-                  "output_type" => "completion"
-                }
-                base["id"] = progress_data["id"] if progress_data["id"]
-                base["origin"] = progress_data["origin"] if progress_data["origin"]
-                if progress_data.key?("error")
-                  sse.event("progress", JSON.dump(base.merge("event" => "error", "data" => progress_data["error"])))
-                else
-                  sse.event("progress", JSON.dump(base.merge("event" => "json_delta", "data" => JSON.dump(progress_data["data"]))))
-                end
-                # Signal per-cell completion so the UI exits "Streaming..." state
-                # and updates the progress bar immediately.
-                sse.event("progress", JSON.dump(base.merge("event" => "done", "data" => "")))
-              },
-              quiet: true
-            }
-            run_opts[:parent] = parent if parent
-            run_opts[:scorers] = remote_scorer_ids if remote_scorer_ids
-            run_opts[:dataset] = dataset if dataset
-            if state
-              run_opts[:state] = state
-              run_opts[:experiment] = experiment_name if experiment_name
-              run_opts[:project_id] = body["project_id"] if body["project_id"]
-            end
-            result = evaluator.run(cases, **run_opts)
-            # Flush buffered OTLP spans before sending completion events.
-            # The BatchSpanProcessor exports every ~5s; fast evals can finish
-            # before a single export fires, causing the UI to see no results.
-            Braintrust::Trace.flush_spans
-            # Build summary from result scores
-            averaged_scores = {}
-            result.scorer_stats.each do |scorer_name, stats|
-              averaged_scores[scorer_name] = stats.score_mean
-            end
-            sse.event("summary", JSON.dump({
-              "scores" => averaged_scores,
-              "experiment_name" => experiment_name,
-              "experiment_id" => result.experiment_id,
-              "project_id" => result.project_id
-            }))
-            sse.event("done", "")
+            @service.stream(result, auth: env["braintrust.auth"], sse: sse)
           end
           [200, {"content-type" => "text/event-stream", "cache-control" => "no-cache", "connection" => "keep-alive"}, sse_body]
@@ -115,90 +35,6 @@ module Braintrust
         private
-        # Resolve data source from the data field.
-        # Returns [cases, dataset] where exactly one is non-nil.
-        def resolve_data_source(data)
-          if data.key?("data")
-            cases = data["data"].map do |d|
-              {input: d["input"], expected: d["expected"]}
-            end
-            [cases, nil]
-          elsif data.key?("dataset_id")
-            [nil, Braintrust::Dataset::ID.new(id: data["dataset_id"])]
-          elsif data.key?("dataset_name")
-            dataset_opts = {name: data["dataset_name"]}
-            dataset_opts[:project] = data["project_name"] if data["project_name"]
-            [nil, dataset_opts]
-          else
-            [nil, nil]
-          end
-        end
-        # Map request scores array to Scorer::ID structs.
-        # The UI sends function_id as a nested object: {"function_id": "uuid"}.
-        def resolve_remote_scorers(scores)
-          return nil if scores.nil? || scores.empty?
-          scores.map do |s|
-            func_id = s["function_id"]
-            func_id = func_id["function_id"] if func_id.is_a?(Hash)
-            Braintrust::Scorer::ID.new(
-              function_id: func_id,
-              version: s["version"]
-            )
-          end
-        end
-        # Map request parent to symbol-keyed Hash.
-        # Hardcode playground_id to match Java SDK behavior.
-        # Also extracts generation from propagated_event for span_attributes.
-        def resolve_parent(parent)
-          return nil unless parent.is_a?(Hash)
-          object_id = parent["object_id"]
-          return nil unless object_id
-          generation = parent.dig("propagated_event", "span_attributes", "generation")
-          result = {object_type: "playground_id", object_id: object_id}
-          result[:generation] = generation if generation
-          result
-        end
-        # Build State from auth context set by Auth middleware.
-        # Returns nil when no auth context is present (e.g. NoAuth strategy).
-        # Uses an LRU-style cache (max 64 entries) keyed by [api_key, app_url, org_name].
-        def build_state(env)
-          auth = env["braintrust.auth"]
-          return nil unless auth.is_a?(Hash)
-          cache_key = [auth["api_key"], auth["app_url"], auth["org_name"]]
-          @state_mutex ||= Mutex.new
-          @state_cache ||= {}
-          @state_mutex.synchronize do
-            cached = @state_cache[cache_key]
-            return cached if cached
-            state = Braintrust::State.new(
-              api_key: auth["api_key"],
-              org_id: auth["org_id"],
-              org_name: auth["org_name"],
-              app_url: auth["app_url"],
-              api_url: auth["api_url"],
-              enable_tracing: false
-            )
-            # Evict oldest entry if cache is full
-            if @state_cache.size >= 64
-              oldest_key = @state_cache.keys.first
-              @state_cache.delete(oldest_key)
-            end
-            @state_cache[cache_key] = state
-            state
-          end
-        end
         def parse_body(env)
           body = env["rack.input"]&.read
           return nil if body.nil? || body.empty?
@@ -211,6 +47,10 @@ module Braintrust
           [status, {"content-type" => "application/json"},
             [JSON.dump({"error" => message})]]
         end
+        def build_state(env)
+          @service.build_state(env["braintrust.auth"])
+        end
       end
     end
   end

data/lib/braintrust/server/handlers/list.rb CHANGED Viewed

@@ -23,50 +23,12 @@ module Braintrust
       class List
         def initialize(evaluators)
           @evaluators = evaluators
+          @service = Services::List.new(evaluators)
         end
         def call(_env)
-          result = {}
-          @evaluators.each do |name, evaluator|
-            scores = (evaluator.scorers || []).each_with_index.map do |scorer, i|
-              scorer_name = scorer.respond_to?(:name) ? scorer.name : "score_#{i}"
-              {"name" => scorer_name}
-            end
-            entry = {"scores" => scores}
-            params = serialize_parameters(evaluator.parameters)
-            entry["parameters"] = params if params
-            result[name] = entry
-          end
-          [200, {"content-type" => "application/json"},
-            [JSON.dump(result)]]
-        end
-        private
-        # Convert user-defined parameters to the dev server protocol format.
-        # Wraps in a staticParameters container with "data" typed entries.
-        def serialize_parameters(parameters)
-          return nil unless parameters && !parameters.empty?
-          schema = {}
-          parameters.each do |name, spec|
-            spec = spec.transform_keys(&:to_s) if spec.is_a?(Hash)
-            if spec.is_a?(Hash)
-              schema[name.to_s] = {
-                "type" => "data",
-                "schema" => {"type" => spec["type"] || "string"},
-                "default" => spec["default"],
-                "description" => spec["description"]
-              }
-            end
-          end
-          {
-            "type" => "braintrust.staticParameters",
-            "schema" => schema,
-            "source" => nil
-          }
+          result = @service.call
+          [200, {"content-type" => "application/json"}, [JSON.dump(result)]]
         end
       end
     end

data/lib/braintrust/server/rack.rb CHANGED Viewed

@@ -15,6 +15,8 @@ require_relative "auth/no_auth"
 require_relative "auth/clerk_token"
 require_relative "middleware/cors"
 require_relative "middleware/auth"
+require_relative "services/list_service"
+require_relative "services/eval_service"
 require_relative "handlers/health"
 require_relative "handlers/list"
 require_relative "handlers/eval"

data/lib/braintrust/server/services/eval_service.rb ADDED Viewed

@@ -0,0 +1,214 @@
+# frozen_string_literal: true
+require "json"
+module Braintrust
+  module Server
+    module Services
+      # Framework-agnostic service for running evaluations and streaming SSE results.
+      # Must be long-lived (not per-request) to preserve the @state_cache across requests.
+      class Eval
+        def initialize(evaluators)
+          @evaluators = evaluators
+          @state_mutex = Mutex.new
+          @state_cache = {}
+        end
+        # Validates request body. Returns:
+        #   {error: String, status: Integer} on failure
+        #   {evaluator:, name:, cases:, dataset:, ...} on success
+        def validate(body)
+          name = body["name"]
+          return {error: "Missing required field: name", status: 400} unless name
+          evaluator = current_evaluators[name]
+          return {error: "Evaluator '#{name}' not found", status: 404} unless evaluator
+          data = body["data"]
+          return {error: "Missing required field: data", status: 400} unless data
+          data_sources = ["data", "dataset_name", "dataset_id"].count { |k| data.key?(k) }
+          return {error: "Exactly one data source required", status: 400} if data_sources != 1
+          cases, dataset = resolve_data_source(data)
+          {
+            evaluator: evaluator,
+            name: name,
+            cases: cases,
+            dataset: dataset,
+            experiment_name: body["experiment_name"],
+            remote_scorer_ids: resolve_remote_scorers(body["scores"]),
+            parent: resolve_parent(body["parent"]),
+            project_id: body["project_id"]
+          }
+        end
+        # Runs the validated eval and streams SSE events via the sse writer.
+        # +validated+ is the hash returned by #validate.
+        # +auth+ is the auth context hash (or nil/true for no-auth).
+        # +sse+ is an SSEWriter instance.
+        def stream(validated, auth:, sse:)
+          name = validated[:name]
+          evaluator = validated[:evaluator]
+          cases = validated[:cases]
+          dataset = validated[:dataset]
+          experiment_name = validated[:experiment_name]
+          remote_scorer_ids = validated[:remote_scorer_ids]
+          parent = validated[:parent]
+          project_id = validated[:project_id]
+          state = build_state(auth)
+          # Only pass project/experiment params when state is available
+          run_opts = {
+            on_progress: ->(progress_data) {
+              # Build remote eval protocol events from generic progress data.
+              # Runner provides: id, data/error, scores (optional), origin (optional).
+              # Protocol requires: id, object_type, origin, name, format, output_type, event, data.
+              base = {
+                "object_type" => "task",
+                "name" => name,
+                "format" => "code",
+                "output_type" => "completion"
+              }
+              base["id"] = progress_data["id"] if progress_data["id"]
+              base["origin"] = progress_data["origin"] if progress_data["origin"]
+              if progress_data.key?("error")
+                sse.event("progress", JSON.dump(base.merge("event" => "error", "data" => progress_data["error"])))
+              else
+                sse.event("progress", JSON.dump(base.merge("event" => "json_delta", "data" => JSON.dump(progress_data["data"]))))
+              end
+              # Signal per-cell completion so the UI exits "Streaming..." state
+              # and updates the progress bar immediately.
+              sse.event("progress", JSON.dump(base.merge("event" => "done", "data" => "")))
+            },
+            quiet: true
+          }
+          run_opts[:parent] = parent if parent
+          run_opts[:scorers] = remote_scorer_ids if remote_scorer_ids
+          run_opts[:dataset] = dataset if dataset
+          if state
+            run_opts[:state] = state
+            run_opts[:experiment] = experiment_name if experiment_name
+            run_opts[:project_id] = project_id if project_id
+          end
+          result = evaluator.run(cases, **run_opts)
+          # Flush buffered OTLP spans before sending completion events.
+          # The BatchSpanProcessor exports every ~5s; fast evals can finish
+          # before a single export fires, causing the UI to see no results.
+          Braintrust::Trace.flush_spans
+          # Build summary from result scores
+          averaged_scores = {}
+          result.scorer_stats.each do |scorer_name, stats|
+            averaged_scores[scorer_name] = stats.score_mean
+          end
+          sse.event("summary", JSON.dump({
+            "scores" => averaged_scores,
+            "experiment_name" => experiment_name,
+            "experiment_id" => result.experiment_id,
+            "project_id" => result.project_id
+          }))
+          sse.event("done", "")
+        end
+        # Build State from auth context hash.
+        # Returns nil when auth is not a Hash (e.g. NoAuth returns true).
+        # Uses an LRU-style cache (max 64 entries) keyed by [api_key, app_url, org_name].
+        def build_state(auth)
+          return nil unless auth.is_a?(Hash)
+          cache_key = [auth["api_key"], auth["app_url"], auth["org_name"]]
+          @state_mutex ||= Mutex.new
+          @state_cache ||= {}
+          @state_mutex.synchronize do
+            cached = @state_cache[cache_key]
+            return cached if cached
+            state = Braintrust::State.new(
+              api_key: auth["api_key"],
+              org_id: auth["org_id"],
+              org_name: auth["org_name"],
+              app_url: auth["app_url"],
+              api_url: auth["api_url"],
+              enable_tracing: false
+            )
+            if @state_cache.size >= 64
+              oldest_key = @state_cache.keys.first
+              @state_cache.delete(oldest_key)
+            end
+            @state_cache[cache_key] = state
+            state
+          end
+        end
+        private
+        def current_evaluators
+          return @evaluators.call if @evaluators.respond_to?(:call)
+          @evaluators
+        end
+        # Resolve data source from the data field.
+        # Returns [cases, dataset] where exactly one is non-nil.
+        def resolve_data_source(data)
+          if data.key?("data")
+            cases = data["data"].map do |d|
+              {input: d["input"], expected: d["expected"]}
+            end
+            [cases, nil]
+          elsif data.key?("dataset_id")
+            [nil, Braintrust::Dataset::ID.new(id: data["dataset_id"])]
+          elsif data.key?("dataset_name")
+            dataset_opts = {name: data["dataset_name"]}
+            dataset_opts[:project] = data["project_name"] if data["project_name"]
+            [nil, dataset_opts]
+          else
+            [nil, nil]
+          end
+        end
+        # Map request scores array to Scorer::ID structs.
+        # The UI sends function_id as a nested object: {"function_id": "uuid"}.
+        def resolve_remote_scorers(scores)
+          return nil if scores.nil? || scores.empty?
+          scores.map do |s|
+            func_id = s["function_id"]
+            func_id = func_id["function_id"] if func_id.is_a?(Hash)
+            Braintrust::Scorer::ID.new(
+              function_id: func_id,
+              version: s["version"]
+            )
+          end
+        end
+        # Map request parent to symbol-keyed Hash.
+        # Hardcode playground_id to match Java SDK behavior.
+        # Also extracts generation from propagated_event for span_attributes.
+        def resolve_parent(parent)
+          return nil unless parent.is_a?(Hash)
+          object_id = parent["object_id"]
+          return nil unless object_id
+          generation = parent.dig("propagated_event", "span_attributes", "generation")
+          result = {object_type: "playground_id", object_id: object_id}
+          result[:generation] = generation if generation
+          result
+        end
+      end
+    end
+  end
+end

data/lib/braintrust/server/services/list_service.rb ADDED Viewed

@@ -0,0 +1,64 @@
+# frozen_string_literal: true
+require "json"
+module Braintrust
+  module Server
+    module Services
+      # Framework-agnostic service for listing evaluators.
+      # Returns a plain Hash (not a Rack triplet) suitable for JSON.dump.
+      class List
+        def initialize(evaluators)
+          @evaluators = evaluators
+        end
+        def call
+          result = {}
+          current_evaluators.each do |name, evaluator|
+            scores = (evaluator.scorers || []).each_with_index.map do |scorer, i|
+              scorer_name = scorer.respond_to?(:name) ? scorer.name : "score_#{i}"
+              {"name" => scorer_name}
+            end
+            entry = {"scores" => scores}
+            params = serialize_parameters(evaluator.parameters)
+            entry["parameters"] = params if params
+            result[name] = entry
+          end
+          result
+        end
+        private
+        def current_evaluators
+          return @evaluators.call if @evaluators.respond_to?(:call)
+          @evaluators
+        end
+        # Convert user-defined parameters to the dev server protocol format.
+        # Wraps in a staticParameters container with "data" typed entries.
+        def serialize_parameters(parameters)
+          return nil unless parameters && !parameters.empty?
+          schema = {}
+          parameters.each do |name, spec|
+            spec = spec.transform_keys(&:to_s) if spec.is_a?(Hash)
+            if spec.is_a?(Hash)
+              schema[name.to_s] = {
+                "type" => "data",
+                "schema" => {"type" => spec["type"] || "string"},
+                "default" => spec["default"],
+                "description" => spec["description"]
+              }
+            end
+          end
+          {
+            "type" => "braintrust.staticParameters",
+            "schema" => schema,
+            "source" => nil
+          }
+        end
+      end
+    end
+  end
+end

data/lib/braintrust/trace/span_processor.rb CHANGED Viewed

@@ -80,11 +80,6 @@ module Braintrust
       # Determine if a span should be forwarded to the wrapped processor
       # based on configured filters
       def should_forward_span?(span)
-        # Always keep root spans (spans with no parent)
-        # Check if parent_span_id is the invalid/zero span ID
-        is_root = span.parent_span_id == OpenTelemetry::Trace::INVALID_SPAN_ID
-        return true if is_root
         # If no filters, keep everything
         return true if @filters.empty?

data/lib/braintrust/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Braintrust
-  VERSION = "0.2.1"
+  VERSION = "0.3.0"
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: braintrust
 version: !ruby/object:Gem::Version
-  version: 0.2.1
+  version: 0.3.0
 platform: ruby
 authors:
 - Braintrust
@@ -215,6 +215,14 @@ files:
 - lib/braintrust/contrib/openai/patcher.rb
 - lib/braintrust/contrib/patcher.rb
 - lib/braintrust/contrib/rails/railtie.rb
+- lib/braintrust/contrib/rails/server.rb
+- lib/braintrust/contrib/rails/server/application_controller.rb
+- lib/braintrust/contrib/rails/server/engine.rb
+- lib/braintrust/contrib/rails/server/eval_controller.rb
+- lib/braintrust/contrib/rails/server/generator.rb
+- lib/braintrust/contrib/rails/server/health_controller.rb
+- lib/braintrust/contrib/rails/server/list_controller.rb
+- lib/braintrust/contrib/rails/server/routes.rb
 - lib/braintrust/contrib/registry.rb
 - lib/braintrust/contrib/ruby_llm/deprecated.rb
 - lib/braintrust/contrib/ruby_llm/instrumentation/chat.rb
@@ -267,6 +275,8 @@ files:
 - lib/braintrust/server/rack.rb
 - lib/braintrust/server/rack/app.rb
 - lib/braintrust/server/router.rb
+- lib/braintrust/server/services/eval_service.rb
+- lib/braintrust/server/services/list_service.rb
 - lib/braintrust/server/sse.rb
 - lib/braintrust/setup.rb
 - lib/braintrust/state.rb