RubyGems - braintrust - Versions diffs - 0.1.4 → 0.2.0 - Mend

braintrust 0.1.4 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

checksums.yaml +4 -4
data/README.md +71 -2
data/lib/braintrust/api/datasets.rb +10 -0
data/lib/braintrust/api/internal/experiments.rb +1 -1
data/lib/braintrust/dataset.rb +10 -6
data/lib/braintrust/eval/evaluator.rb +72 -0
data/lib/braintrust/eval/functions.rb +44 -10
data/lib/braintrust/eval/runner.rb +55 -13
data/lib/braintrust/eval/scorer.rb +4 -0
data/lib/braintrust/eval.rb +97 -50
data/lib/braintrust/server/auth/clerk_token.rb +68 -0
data/lib/braintrust/server/auth/no_auth.rb +14 -0
data/lib/braintrust/server/handlers/eval.rb +217 -0
data/lib/braintrust/server/handlers/health.rb +16 -0
data/lib/braintrust/server/handlers/list.rb +74 -0
data/lib/braintrust/server/middleware/auth.rb +29 -0
data/lib/braintrust/server/middleware/cors.rb +87 -0
data/lib/braintrust/server/rack/app.rb +38 -0
data/lib/braintrust/server/rack.rb +36 -0
data/lib/braintrust/server/router.rb +37 -0
data/lib/braintrust/server/sse.rb +52 -0
data/lib/braintrust/server.rb +8 -0
data/lib/braintrust/trace/span_exporter.rb +36 -0
data/lib/braintrust/trace.rb +3 -4
data/lib/braintrust/version.rb +1 -1
metadata +15 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: e1a5c8840f707c7b4da95e4ccc8abea32591606d667a309432f2955d5df26eca
-  data.tar.gz: a45e62f34a1d59dd11e1cc46ff8d128a495a45a80fa1ce2026c76b648b58de89
+  metadata.gz: d67e6d0faeb24297af8a5f43ac1bd1ceacff1f37df2610244ae5f81e34c4ae5f
+  data.tar.gz: 489ec68fee424aa8aa1880b73b58f1f26529493d8898cd0ae5876d3b919fcb7c
 SHA512:
-  metadata.gz: 75b71465a80ed2cfd3c6600113dd62357d01e0bd672f2043045f56d7d0223882cc2c5fd9f8927973ae546f99b68971bbe66fd34d66ec6fd62fafd65ca52abcd7
-  data.tar.gz: 06eb21fec07c05755aacd0a214cd16594f47a7033e723a2798df26692a0c15ccd9d5ed614588cc805620d303bb0f1a1c577c66b25c92c6dbbaa40283023cd662
+  metadata.gz: cd876122ad92c5439ff45e975fd84418bfcc7d72d6f9398e48b1ac4c60f09fb96c2b85b46ee1c8de6a75291c0b7d2754ee2fa069f77f8a2f8a4c069132c59d94
+  data.tar.gz: 45d3f80f69ac9725d93aa0db24815da093bfd992b5418f8551c8d25e8caef9299f270a92fa922a4bc4bf3190d9f823a35c7203f9a74bd58daee31869b987f103

data/README.md CHANGED Viewed

@@ -23,6 +23,7 @@ This is the official Ruby SDK for [Braintrust](https://www.braintrust.dev), for
 - [Evals](#evals)
   - [Datasets](#datasets)
   - [Scorers](#scorers)
+  - [Dev Server](#dev-server)
 - [Documentation](#documentation)
 - [Troubleshooting](#troubleshooting)
 - [Contributing](#contributing)
@@ -148,8 +149,8 @@ Braintrust.init(
 The SDK automatically instruments these LLM libraries:
-| Provider  | Gem           | Versions | Integration Name | Examples                                        |
-| --------- | ------------- | -------- | ---------------- | ----------------------------------------------- |
+| Provider  | Gem           | Versions | Integration Name | Examples                                  |
+| --------- | ------------- | -------- | ---------------- | ----------------------------------------- |
 | Anthropic | `anthropic`   | >= 0.3.0 | `:anthropic`     | [Link](./examples/contrib/anthropic.rb)   |
 | OpenAI    | `openai`      | >= 0.1.0 | `:openai`        | [Link](./examples/contrib/openai.rb)      |
 |           | `ruby-openai` | >= 7.0.0 | `:ruby_openai`   | [Link](./examples/contrib/ruby-openai.rb) |
@@ -318,6 +319,74 @@ Braintrust::Eval.run(
 See examples: [eval.rb](./examples/eval.rb), [dataset.rb](./examples/eval/dataset.rb), [remote_functions.rb](./examples/eval/remote_functions.rb)
+### Dev Server
+Run evaluations from the Braintrust web UI against code in your own application. Define evaluators, pass them to the dev server, and start serving:
+```ruby
+# eval_server.ru
+require "braintrust/eval"
+require "braintrust/server"
+# Define evaluators — these can reference your application code (models, services, etc.)
+food_classifier = Braintrust::Eval::Evaluator.new(
+  task: ->(input) { FoodClassifier.classify(input) },
+  scorers: [
+    Braintrust::Eval.scorer("exact_match") { |input, expected, output| output == expected ? 1.0 : 0.0 }
+  ]
+)
+# Initialize Braintrust (requires BRAINTRUST_API_KEY)
+Braintrust.init(blocking_login: true)
+# Start the server
+run Braintrust::Server::Rack.app(
+  evaluators: {
+    "food-classifier" => food_classifier
+  }
+)
+```
+```bash
+bundle exec rackup eval_server.ru -p 8300 -o 0.0.0.0
+```
+**Custom evaluators**
+Evaluators can also be defined as subclasses:
+```ruby
+class FoodClassifier < Braintrust::Eval::Evaluator
+  def task
+    ->(input) { classify(input) }
+  end
+  def scorers
+    [Braintrust::Eval.scorer("exact_match") { |i, e, o| o == e ? 1.0 : 0.0 }]
+  end
+end
+```
+**Supported web servers**
+The dev server requires the `rack` gem and a Rack-compatible web server.
+| Server                                         | Version Supported | Notes                                |
+| ---------------------------------------------- | ----------------- | ------------------------------------ |
+| [Puma](https://puma.io/)                       | 6.x               |                                      |
+| [Falcon](https://socketry.github.io/falcon/)   | 0.x               |                                      |
+| [Passenger](https://www.phusionpassenger.com/) | 6.x               |                                      |
+| [WEBrick](https://github.com/ruby/webrick)     | Not supported     | Does not support server-sent events. |
+Add your chosen server to your Gemfile:
+```ruby
+gem "rack"
+gem "puma" # recommended
+```
+See example: [server/eval.ru](./examples/server/eval.ru)
 ## Documentation
 - [Braintrust Documentation](https://www.braintrust.dev/docs)

data/lib/braintrust/api/datasets.rb CHANGED Viewed

@@ -82,6 +82,14 @@ module Braintrust
         http_post_json("/v1/dataset/#{id}/insert", {events: events})
       end
+      # Delete a dataset by ID
+      # DELETE /v1/dataset/{id}
+      # @param id [String] Dataset UUID
+      # @return [Hash] Delete response
+      def delete(id:)
+        http_request(:delete, "/v1/dataset/#{id}")
+      end
       # Generate a permalink URL to view a dataset in the Braintrust UI
       # @param id [String] Dataset UUID
       # @return [String] Permalink URL
@@ -150,6 +158,8 @@ module Braintrust
           req["Content-Type"] = "application/json"
           req.body = JSON.dump(payload) if payload
           req
+        when :delete
+          Net::HTTP::Delete.new(uri)
         else
           raise ArgumentError, "Unsupported HTTP method: #{method}"
         end

data/lib/braintrust/api/internal/experiments.rb CHANGED Viewed

@@ -29,9 +29,9 @@ module Braintrust
           payload = {
             project_id: project_id,
-            name: name,
             ensure_new: ensure_new
           }
+          payload[:name] = name if name
           payload[:tags] = tags if tags
           payload[:metadata] = metadata if metadata
           payload[:dataset_id] = dataset_id if dataset_id

data/lib/braintrust/dataset.rb CHANGED Viewed

@@ -12,9 +12,9 @@ module Braintrust
   #   dataset = Braintrust::Dataset.new(name: "my-dataset", project: "my-project")
   #   dataset.each { |record| puts record[:input] }
   #
-  # @example With explicit API client
-  #   api = Braintrust::API.new(state: my_state)
-  #   dataset = Braintrust::Dataset.new(name: "my-dataset", project: "my-project", api: api)
+  # @example With explicit state
+  #   state = Braintrust.init(api_key: "...")
+  #   dataset = Braintrust::Dataset.new(name: "my-dataset", project: "my-project", state: state)
   #
   # @example Eager loading for small datasets
   #   records = dataset.fetch_all(limit: 100)
@@ -38,13 +38,13 @@ module Braintrust
     # @param id [String, nil] Dataset UUID (required if name not provided)
     # @param project [String, nil] Project name (required if using name)
     # @param version [String, nil] Optional version to pin to
-    # @param api [API, nil] Braintrust API client (defaults to API.new using global state)
-    def initialize(name: nil, id: nil, project: nil, version: nil, api: nil)
+    # @param state [State, nil] Braintrust state (defaults to global state)
+    def initialize(name: nil, id: nil, project: nil, version: nil, state: nil)
       @name = name
       @provided_id = id
       @project = project
       @version = version
-      @api = api || API.new
+      @api = API.new(state: state)
       @resolved_id = nil
       @metadata = nil
@@ -182,4 +182,8 @@ module Braintrust
       )
     end
   end
+  # Value object wrapping a dataset UUID for resolution by ID.
+  # Used by Eval.run to distinguish dataset-by-ID from dataset-by-name.
+  DatasetId = Struct.new(:id, keyword_init: true)
 end

data/lib/braintrust/eval/evaluator.rb ADDED Viewed

@@ -0,0 +1,72 @@
+# frozen_string_literal: true
+module Braintrust
+  module Eval
+    # Base class for evaluators. Subclass and override #task and #scorers,
+    # or instantiate directly with keyword arguments.
+    #
+    # @example Subclass pattern
+    #   class FoodClassifier < Braintrust::Eval::Evaluator
+    #     def task
+    #       ->(input) { classify(input) }
+    #     end
+    #
+    #     def scorers
+    #       [Braintrust::Eval.scorer("exact_match") { |i, e, o| o == e ? 1.0 : 0.0 }]
+    #     end
+    #   end
+    #
+    # @example Inline pattern
+    #   Braintrust::Eval::Evaluator.new(
+    #     task: ->(input) { input.upcase },
+    #     scorers: [my_scorer]
+    #   )
+    class Evaluator
+      attr_accessor :task, :scorers, :parameters
+      def initialize(task: nil, scorers: [], parameters: {})
+        @task = task
+        @scorers = scorers
+        @parameters = parameters
+      end
+      # Validate that the evaluator has required fields set.
+      # @raise [ArgumentError] if validation fails
+      def validate!
+        raise ArgumentError, "task is required" unless task
+        unless task.respond_to?(:call)
+          raise ArgumentError, "task must be callable (respond to :call)"
+        end
+      end
+      # Run this evaluator against the given cases.
+      # Delegates to Braintrust::Eval.run with the evaluator's task and scorers.
+      #
+      # @param cases [Array] The test cases
+      # @param on_progress [#call, nil] Optional callback fired after each test case
+      # @param quiet [Boolean] If true, suppress result output (default: false)
+      # @param project [String, nil] Project name
+      # @param experiment [String, nil] Experiment name
+      # @param project_id [String, nil] Project UUID (skips project creation)
+      # @param dataset [String, Hash, Dataset, DatasetId, nil] Dataset to fetch
+      # @param scorers [Array, nil] Additional scorers (merged with evaluator's own)
+      # @param parent [Hash, nil] Parent span context
+      # @param state [State, nil] Braintrust state
+      # @param update [Boolean] If true, allow reusing existing experiment (default: false)
+      # @param tracer_provider [TracerProvider, nil] OpenTelemetry tracer provider (defaults to global)
+      # @return [Result]
+      def run(cases, on_progress: nil, quiet: false,
+        project: nil, experiment: nil, project_id: nil,
+        dataset: nil, scorers: nil, parent: nil,
+        state: nil, update: false, tracer_provider: nil)
+        all_scorers = scorers ? self.scorers + scorers : self.scorers
+        Braintrust::Eval.run(
+          task: task, scorers: all_scorers, cases: cases, dataset: dataset,
+          project: project, experiment: experiment, project_id: project_id,
+          parent: parent, on_progress: on_progress, quiet: quiet,
+          state: state, update: update, tracer_provider: tracer_provider
+        )
+      end
+    end
+  end
+end

data/lib/braintrust/eval/functions.rb CHANGED Viewed

@@ -56,6 +56,27 @@ module Braintrust
           end
         end
+        # Create a scorer that invokes a remote function by ID
+        # @param id [String] Function UUID
+        # @param version [String, nil] Optional version to pin to
+        # @param state [State, nil] Braintrust state (defaults to global)
+        # @param tracer_provider [TracerProvider, nil] OpenTelemetry tracer provider
+        # @return [Scorer] Scorer object that invokes remote function
+        def scorer_by_id(id:, state: nil, version: nil, tracer_provider: nil)
+          state ||= Braintrust.current_state
+          api = API.new(state: state)
+          api.login
+          function_metadata = api.functions.get(id: id, version: version)
+          function_id = function_metadata["id"]
+          function_name = function_metadata["name"] || id
+          tracer_provider ||= OpenTelemetry.tracer_provider
+          tracer = tracer_provider.tracer("braintrust.functions")
+          build_scorer(function_id: function_id, function_name: function_name, api: api, tracer: tracer)
+        end
         # Create a scorer that invokes a remote function
         # @param project [String] Project name
         # @param slug [String] Function slug
@@ -76,10 +97,21 @@ module Braintrust
           tracer_provider ||= OpenTelemetry.tracer_provider
           tracer = tracer_provider.tracer("braintrust.functions")
-          # Create a scorer that invokes the remote function
-          Scorer.new(slug) do |input, expected, output, metadata|
-            # Create a span for the function invocation
-            tracer.in_span("function: #{slug}") do |span|
+          build_scorer(function_id: function_id, function_name: function_name, api: api, tracer: tracer)
+        end
+        private
+        # Build a Scorer that invokes a remote function
+        # Shared implementation used by both scorer and scorer_by_id
+        # @param function_id [String] Function UUID
+        # @param function_name [String] Function display name
+        # @param api [API] Braintrust API client
+        # @param tracer [OpenTelemetry::Trace::Tracer] Tracer instance
+        # @return [Scorer]
+        def build_scorer(function_id:, function_name:, api:, tracer:)
+          Scorer.new(function_name) do |input, expected, output, metadata|
+            tracer.in_span("function: #{function_name}") do |span|
               scorer_input = {
                 input: input,
                 expected: expected,
@@ -91,14 +123,17 @@ module Braintrust
               span.set_attribute("braintrust.input_json", JSON.dump(scorer_input))
               span.set_attribute("braintrust.function.name", function_name)
               span.set_attribute("braintrust.function.id", function_id)
-              span.set_attribute("braintrust.function.slug", slug)
               begin
-                # Invoke the function via API
-                # The remote scorer receives all scorer arguments
                 result = api.functions.invoke(id: function_id, input: scorer_input)
                 score = case result
+                when Numeric
+                  result.to_f
+                when true
+                  1.0
+                when false
+                  0.0
                 when Hash
                   if result.key?("score")
                     result["score"].to_f
@@ -107,6 +142,8 @@ module Braintrust
                   end
                 when String
                   result.to_f
+                when nil
+                  nil
                 else
                   raise Error, "Unsupported result type: #{result.class}"
                 end
@@ -114,7 +151,6 @@ module Braintrust
                 span.set_attribute("braintrust.output_json", JSON.dump(score))
                 score
               rescue => e
-                # Record exception and set error status
                 span.record_exception(e)
                 span.status = OpenTelemetry::Trace::Status.error(e.message)
                 raise
@@ -123,8 +159,6 @@ module Braintrust
           end
         end
-        private
         # Resolve function ID from project name and slug
         # @param api [API] API client
         # @param project [String] Project name

data/lib/braintrust/eval/runner.rb CHANGED Viewed

@@ -17,18 +17,21 @@ module Braintrust
       # Maximum parallelism allowed (mirrors Internal::ThreadPool::MAX_PARALLELISM)
       MAX_PARALLELISM = Internal::ThreadPool::MAX_PARALLELISM
-      def initialize(experiment_id:, experiment_name:, project_id:, project_name:,
-        task:, scorers:, api:, tracer_provider: nil)
+      def initialize(task:, scorers:, experiment_id: nil, experiment_name: nil,
+        project_id: nil, project_name: nil, state: nil, tracer_provider: nil,
+        on_progress: nil, parent: nil)
         @experiment_id = experiment_id
         @experiment_name = experiment_name
         @project_id = project_id
         @project_name = project_name
         @task = task
         @scorers = normalize_scorers(scorers)
-        @api = api
+        @state = state
         @tracer_provider = tracer_provider || OpenTelemetry.tracer_provider
         @tracer = @tracer_provider.tracer("braintrust-eval")
-        @parent_attr = "experiment_id:#{experiment_id}"
+        @parent_attr = parent ? "#{parent[:object_type]}:#{parent[:object_id]}" : nil
+        @generation = parent&.dig(:generation)
+        @on_progress = on_progress
         # Mutex for thread-safe score collection
         @score_mutex = Mutex.new
@@ -60,8 +63,10 @@ module Braintrust
         # Calculate duration
         duration = Time.now - start_time
-        # Generate permalink
-        permalink = @api.object_permalink(object_type: "experiment", object_id: experiment_id)
+        # Generate permalink (only when state and experiment are available)
+        permalink = if @state && experiment_id
+          @state.object_permalink(object_type: "experiment", object_id: experiment_id)
+        end
         Result.new(
           experiment_id: experiment_id,
@@ -86,7 +91,7 @@ module Braintrust
       # @param errors [Queue] Thread-safe error collection queue
       def run_case(test_case, errors)
         tracer.in_span("eval") do |eval_span|
-          eval_span.set_attribute("braintrust.parent", parent_attr)
+          eval_span.set_attribute("braintrust.parent", parent_attr) if parent_attr
           # Set tags early so they're present even if task fails
           eval_span.set_attribute("braintrust.tags", test_case.tags) if test_case.tags
@@ -99,12 +104,23 @@ module Braintrust
             # Error already recorded on task span, set eval span status
             eval_span.status = OpenTelemetry::Trace::Status.error(e.message)
             errors << "Task failed for input '#{test_case.input}': #{e.message}"
+            if @on_progress
+              error_progress = {
+                "id" => eval_span.context.hex_span_id,
+                "error" => e.message
+              }
+              if test_case.origin
+                error_progress["origin"] = test_case.origin.is_a?(String) ? JSON.parse(test_case.origin) : test_case.origin
+              end
+              @on_progress.call(error_progress)
+            end
             next
           end
           # Run scorers
+          case_scores = nil
           begin
-            run_scorers(test_case, output)
+            case_scores = run_scorers(test_case, output)
           rescue => e
             # Error already recorded on score span, set eval span status
             eval_span.status = OpenTelemetry::Trace::Status.error(e.message)
@@ -112,13 +128,25 @@ module Braintrust
           end
           # Set eval span attributes (after task and scorers complete)
-          set_json_attr(eval_span, "braintrust.span_attributes", {type: "eval"})
+          set_json_attr(eval_span, "braintrust.span_attributes", build_span_attributes("eval"))
           set_json_attr(eval_span, "braintrust.input_json", test_case.input)
           set_json_attr(eval_span, "braintrust.output_json", output)
           set_json_attr(eval_span, "braintrust.expected", test_case.expected) if test_case.expected
           # Set origin for cases from remote sources (already JSON-serialized)
           eval_span.set_attribute("braintrust.origin", test_case.origin) if test_case.origin
+          if @on_progress
+            progress = {
+              "id" => eval_span.context.hex_span_id,
+              "data" => output,
+              "scores" => case_scores || {}
+            }
+            if test_case.origin
+              progress["origin"] = test_case.origin.is_a?(String) ? JSON.parse(test_case.origin) : test_case.origin
+            end
+            @on_progress.call(progress)
+          end
         end
       end
@@ -128,8 +156,8 @@ module Braintrust
       # @return [Object] Task output
       def run_task(test_case)
         tracer.in_span("task") do |task_span|
-          task_span.set_attribute("braintrust.parent", parent_attr)
-          set_json_attr(task_span, "braintrust.span_attributes", {type: "task"})
+          task_span.set_attribute("braintrust.parent", parent_attr) if parent_attr
+          set_json_attr(task_span, "braintrust.span_attributes", build_span_attributes("task"))
           set_json_attr(task_span, "braintrust.input_json", test_case.input)
           begin
@@ -149,10 +177,11 @@ module Braintrust
       # Creates single score span for all scorers
       # @param test_case [Case] The test case
       # @param output [Object] Task output
+      # @return [Hash] Scores hash { scorer_name => score_value }
       def run_scorers(test_case, output)
         tracer.in_span("score") do |score_span|
-          score_span.set_attribute("braintrust.parent", parent_attr)
-          set_json_attr(score_span, "braintrust.span_attributes", {type: "score"})
+          score_span.set_attribute("braintrust.parent", parent_attr) if parent_attr
+          set_json_attr(score_span, "braintrust.span_attributes", build_span_attributes("score"))
           scores = {}
           scorer_error = nil
@@ -173,6 +202,8 @@ module Braintrust
           # Raise after setting scores so we can see which scorers succeeded
           raise scorer_error if scorer_error
+          scores
         end
       end
@@ -221,6 +252,17 @@ module Braintrust
         span.status = OpenTelemetry::Trace::Status.error(error.message)
       end
+      # Build span_attributes hash with type, and optionally name and generation.
+      # Matches Java SDK behavior of including these on every span.
+      # @param type [String] Span type ("eval", "task", or "score")
+      # @return [Hash]
+      def build_span_attributes(type)
+        attrs = {type: type}
+        attrs[:name] = experiment_name if experiment_name
+        attrs[:generation] = @generation if @generation
+        attrs
+      end
       # Set a span attribute by JSON encoding the value
       # @param span [OpenTelemetry::Trace::Span] The span
       # @param key [String] The attribute key

data/lib/braintrust/eval/scorer.rb CHANGED Viewed

@@ -105,4 +105,8 @@ module Braintrust
       end
     end
   end
+  # Value object wrapping a remote scorer function UUID.
+  # Used by Eval.run to distinguish remote scorers from local callables.
+  ScorerId = Struct.new(:function_id, :version, keyword_init: true)
 end