langsmith-sdk 0.3.2 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: eff170a34b29bf7c78f0f7cc81f3634aefff1e0b33c50cc88d37568e9da2c24a
4
- data.tar.gz: 13c343e6387ad70f82c9d1f85f3dd07f5d0a5aa09a6cb2f35b873b28c59520b6
3
+ metadata.gz: 60f8521d25bfd486266a608d4ef371ca88966598d65efb62f1ee0d16934e7589
4
+ data.tar.gz: 8fcba710ec2d8d5eb8a017baba74b419cddf1ab927343bd3a17afc2430fe8d95
5
5
  SHA512:
6
- metadata.gz: f5a525d017355d9a0aa320a5d1dd4f64893404f7856bd640df61d1aa12d868edefc2e8f29891512f88a6a8c838e0869be462edbcf956d5b8656856e73b90f427
7
- data.tar.gz: 87476c11dd5789b1460e0ff8d386fa6e843c0c3073253eb2de2f96fb3c8baf6c58aa2b148b51ebc0797a2e206e41398c3a595c331ce586c67f8dbcffc47bf0d2
6
+ metadata.gz: 13851df9e6cabd412d88cd45919ba0d8168bc54d9f3168de09c4c9373cf385fd44d96d7b26d949969eb0e5bcae99036a5604e9e7283ede390b31dd875f9d7f2c
7
+ data.tar.gz: 4a46c7321921dfcbcb2d7a7828e13d51d9f4c66984903562f265acca9dfde452f97a4ab375d0a5c0587fb4a6fda536655777494cfae1e13b2697f2d048e398ad
data/CHANGELOG.md CHANGED
@@ -7,6 +7,18 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
 
8
8
  ## [Unreleased]
9
9
 
10
+ ## [0.4.0] - 2026-02-17
11
+
12
+ ### Added
13
+
14
+ - Multi-tenant evaluation support with `tenant_id` parameter in `ExperimentRunner`
15
+ - Context tracking for evaluation root run tenant ID
16
+ - Tenant ID propagation to dataset, experiment, and feedback API calls
17
+
18
+ ### Changed
19
+
20
+ - Improved experiment cleanup with ensure block in `ExperimentRunner#run`
21
+
10
22
  ## [0.3.2] - 2026-02-11
11
23
 
12
24
  ### Fixed
@@ -89,7 +101,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
89
101
  - `prompt` - Prompt template rendering
90
102
  - `parser` - Output parsing operations
91
103
 
92
- [Unreleased]: https://github.com/felipekb/langsmith-ruby-sdk/compare/v0.3.2...HEAD
104
+ [Unreleased]: https://github.com/felipekb/langsmith-ruby-sdk/compare/v0.4.0...HEAD
105
+ [0.4.0]: https://github.com/felipekb/langsmith-ruby-sdk/compare/v0.3.2...v0.4.0
93
106
  [0.3.2]: https://github.com/felipekb/langsmith-ruby-sdk/compare/v0.3.1...v0.3.2
94
107
  [0.3.1]: https://github.com/felipekb/langsmith-ruby-sdk/compare/v0.3.0...v0.3.1
95
108
  [0.3.0]: https://github.com/felipekb/langsmith-ruby-sdk/compare/v0.2.0...v0.3.0
data/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # LangSmith Ruby SDK
2
2
 
3
- A Ruby SDK for [LangSmith](https://smith.langchain.com/) tracing and observability.
3
+ A Ruby SDK for [LangSmith](https://smith.langchain.com/) tracing, experiments, and evaluations.
4
4
 
5
5
  ## Installation
6
6
 
@@ -164,6 +164,67 @@ Langsmith.trace("openai_call", run_type: "llm") do |run|
164
164
  end
165
165
  ```
166
166
 
167
+ ## Evaluations (Datasets + Experiments)
168
+
169
+ Run your app against a LangSmith dataset and attach evaluator feedback to each traced example run:
170
+
171
+ ```ruby
172
+ require "langsmith"
173
+
174
+ summary = Langsmith::Evaluation.run(
175
+ dataset_id: "dataset-uuid",
176
+ experiment_name: "qa-baseline-v1",
177
+ description: "First baseline on FAQ dataset",
178
+ metadata: { model: "gpt-4", prompt_version: 3 },
179
+ evaluators: {
180
+ correctness: lambda { |outputs:, reference_outputs:, inputs:, run:|
181
+ predicted = outputs[:answer].to_s.strip.downcase
182
+ expected = reference_outputs[:answer].to_s.strip.downcase
183
+
184
+ {
185
+ score: predicted == expected ? 1.0 : 0.0,
186
+ value: predicted,
187
+ comment: "question=#{inputs[:question]} run_id=#{run[:id]}"
188
+ }
189
+ },
190
+ has_answer: ->(outputs:, **) { outputs[:answer].to_s.empty? ? 0.0 : 1.0 }
191
+ }
192
+ ) do |example|
193
+ # Wrap each dataset example in a trace so feedback can attach to the run.
194
+ Langsmith.trace("qa_inference", run_type: "chain", inputs: example[:inputs]) do
195
+ answer = MyApp.answer(example[:inputs][:question])
196
+ { answer: answer }
197
+ end
198
+ end
199
+
200
+ pp summary
201
+ ```
202
+
203
+ ### Evaluator Contract
204
+
205
+ Each evaluator receives keyword arguments:
206
+ - `outputs:` your block return value
207
+ - `reference_outputs:` `example[:outputs]` from the dataset
208
+ - `inputs:` `example[:inputs]` from the dataset
209
+ - `run:` the LangSmith run hash for the traced example
210
+
211
+ Evaluator return values:
212
+ - `Numeric` -> used as `score`
213
+ - `true` / `false` -> converted to `1.0` / `0.0`
214
+ - `Hash` -> expected keys: `:score`, `:value`, `:comment`
215
+ - `nil` -> skip feedback creation for that evaluator
216
+
217
+ If one evaluator raises, the others still run. If your example block raises, the example is marked failed and the experiment continues.
218
+
219
+ ### Evaluation Summary
220
+
221
+ `Langsmith::Evaluation.run` returns:
222
+ - `:experiment_id`
223
+ - `:total`
224
+ - `:succeeded`
225
+ - `:failed`
226
+ - `:results` (per-example `:example_id`, `:run_id`, `:status`, `:error`, `:feedback`)
227
+
167
228
  ## Examples
168
229
 
169
230
  See [`examples/LLM_TRACING.md`](examples/LLM_TRACING.md) for comprehensive examples including:
@@ -175,6 +236,7 @@ See [`examples/LLM_TRACING.md`](examples/LLM_TRACING.md) for comprehensive examp
175
236
  - Error handling and retries
176
237
  - Multi-tenant tracing
177
238
  - Per-trace project overrides
239
+ - Dataset experiments and evaluations (see section above)
178
240
 
179
241
  ## Development
180
242
 
@@ -183,4 +245,3 @@ After checking out the repo, run `bundle install` to install dependencies. Then,
183
245
  ## License
184
246
 
185
247
  The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
186
-
data/langsmith.gemspec ADDED
@@ -0,0 +1,38 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative "lib/langsmith/version"
4
+
5
+ Gem::Specification.new do |spec|
6
+ spec.name = "langsmith-sdk"
7
+ spec.version = Langsmith::VERSION
8
+ spec.authors = ["Felipe Cabezudo"]
9
+ spec.email = ["felipecabedilo@gmail.com"]
10
+
11
+ spec.summary = "Ruby SDK for LangSmith tracing and observability"
12
+ spec.description = "A Ruby client for LangSmith, providing tracing and observability for LLM applications"
13
+ spec.homepage = "https://github.com/felipekb/langsmith-ruby-sdk"
14
+ spec.license = "MIT"
15
+ spec.required_ruby_version = ">= 3.1.0"
16
+
17
+ spec.metadata["allowed_push_host"] = "https://rubygems.org"
18
+ spec.metadata["homepage_uri"] = spec.homepage
19
+ spec.metadata["source_code_uri"] = spec.homepage
20
+ spec.metadata["changelog_uri"] = "#{spec.homepage}/blob/main/CHANGELOG.md"
21
+ spec.metadata["rubygems_mfa_required"] = "true"
22
+
23
+ spec.files = Dir.chdir(__dir__) do
24
+ `git ls-files -z`.split("\x0").reject do |f|
25
+ (File.expand_path(f) == __FILE__) ||
26
+ f.start_with?(*%w[bin/ test/ spec/ features/ .git .github appveyor Gemfile])
27
+ end
28
+ end
29
+ spec.bindir = "exe"
30
+ spec.executables = spec.files.grep(%r{\Aexe/}) { |f| File.basename(f) }
31
+ spec.require_paths = ["lib"]
32
+
33
+ # Runtime dependencies
34
+ spec.add_dependency "concurrent-ruby", ">= 1.1", "< 3.0"
35
+ spec.add_dependency "faraday", "~> 2.0"
36
+ spec.add_dependency "faraday-net_http_persistent", "~> 2.0"
37
+ spec.add_dependency "faraday-retry", "~> 2.0"
38
+ end
@@ -14,7 +14,9 @@ module Langsmith
14
14
  CONTEXT_KEY = :langsmith_run_stack
15
15
  EVALUATION_CONTEXT_KEY = :langsmith_evaluation_context
16
16
  EVALUATION_ROOT_RUN_ID_KEY = :langsmith_evaluation_root_run_id
17
- private_constant :CONTEXT_KEY, :EVALUATION_CONTEXT_KEY, :EVALUATION_ROOT_RUN_ID_KEY
17
+ EVALUATION_ROOT_RUN_TENANT_ID_KEY = :langsmith_evaluation_root_run_tenant_id
18
+ private_constant :CONTEXT_KEY, :EVALUATION_CONTEXT_KEY, :EVALUATION_ROOT_RUN_ID_KEY,
19
+ :EVALUATION_ROOT_RUN_TENANT_ID_KEY
18
20
 
19
21
  class << self
20
22
  # Returns the current run stack for this thread.
@@ -56,6 +58,7 @@ module Langsmith
56
58
  Thread.current[CONTEXT_KEY] = []
57
59
  Thread.current[EVALUATION_CONTEXT_KEY] = nil
58
60
  Thread.current[EVALUATION_ROOT_RUN_ID_KEY] = nil
61
+ Thread.current[EVALUATION_ROOT_RUN_TENANT_ID_KEY] = nil
59
62
  end
60
63
 
61
64
  # Check if there's an active trace context
@@ -99,6 +102,19 @@ module Langsmith
99
102
  Thread.current[EVALUATION_ROOT_RUN_ID_KEY]
100
103
  end
101
104
 
105
+ # Stores the root run tenant ID for the current evaluation example.
106
+ #
107
+ # @param tenant_id [String, nil] the root run's tenant ID
108
+ def set_evaluation_root_run_tenant_id(tenant_id)
109
+ Thread.current[EVALUATION_ROOT_RUN_TENANT_ID_KEY] = tenant_id
110
+ end
111
+
112
+ # Returns the root run tenant ID for the current evaluation example, or nil.
113
+ # @return [String, nil]
114
+ def evaluation_root_run_tenant_id
115
+ Thread.current[EVALUATION_ROOT_RUN_TENANT_ID_KEY]
116
+ end
117
+
102
118
  # Execute a block with evaluation context set.
103
119
  # Context is cleared in ensure block even if the block raises.
104
120
  #
@@ -110,6 +126,7 @@ module Langsmith
110
126
  ensure
111
127
  Thread.current[EVALUATION_CONTEXT_KEY] = nil
112
128
  Thread.current[EVALUATION_ROOT_RUN_ID_KEY] = nil
129
+ Thread.current[EVALUATION_ROOT_RUN_TENANT_ID_KEY] = nil
113
130
  end
114
131
  end
115
132
  end
@@ -13,13 +13,16 @@ module Langsmith
13
13
  # @param description [String, nil] optional experiment description
14
14
  # @param metadata [Hash, nil] optional experiment metadata
15
15
  # @param evaluators [Hash] map of evaluator key to callable
16
+ # @param tenant_id [String, nil] tenant ID for dataset/session/feedback API calls
16
17
  # @param block [Proc] block that receives each example and produces a result
17
- def initialize(dataset_id:, experiment_name:, description: nil, metadata: nil, evaluators: {}, &block)
18
+ def initialize(dataset_id:, experiment_name:, description: nil, metadata: nil, evaluators: {}, tenant_id: nil,
19
+ &block)
18
20
  @dataset_id = dataset_id
19
21
  @experiment_name = experiment_name
20
22
  @description = description
21
23
  @metadata = metadata
22
24
  @evaluators = evaluators
25
+ @tenant_id = tenant_id
23
26
  @block = block
24
27
  end
25
28
 
@@ -27,22 +30,22 @@ module Langsmith
27
30
  #
28
31
  # @return [Hash] summary with :experiment_id, :total, :succeeded, :failed, :results
29
32
  def run
30
- examples = client.list_examples(dataset_id: @dataset_id)
33
+ experiment_id = nil
34
+ examples = client.list_examples(dataset_id: @dataset_id, tenant_id: @tenant_id)
31
35
 
32
36
  experiment = client.create_experiment(
33
37
  name: @experiment_name,
34
38
  dataset_id: @dataset_id,
35
39
  description: @description,
36
- metadata: @metadata
40
+ metadata: @metadata,
41
+ tenant_id: @tenant_id
37
42
  )
38
43
  experiment_id = experiment[:id]
39
44
 
40
45
  results = examples.map { |example| run_example(example, experiment_id) }
41
-
42
- Langsmith.flush
43
- client.close_experiment(experiment_id: experiment_id, end_time: Time.now.utc.iso8601)
44
-
45
46
  build_summary(experiment_id, results)
47
+ ensure
48
+ close_experiment(experiment_id) if experiment_id
46
49
  end
47
50
 
48
51
  private
@@ -54,45 +57,53 @@ module Langsmith
54
57
  def run_example(example, experiment_id)
55
58
  outputs = nil
56
59
  run_id = nil
60
+ run_tenant_id = nil
57
61
 
58
62
  begin
59
63
  Context.with_evaluation(experiment_id: experiment_id, example_id: example[:id]) do
60
64
  outputs = @block.call(example)
61
65
  run_id = Context.evaluation_root_run_id
66
+ run_tenant_id = Context.evaluation_root_run_tenant_id
62
67
  end
63
68
  rescue StandardError => e
64
69
  return { example_id: example[:id], run_id: nil, status: :error, error: e.message, feedback: nil }
65
70
  end
66
71
 
67
- feedback = run_evaluators(example, outputs, run_id)
72
+ feedback = run_evaluators(example, outputs, run_id, run_tenant_id)
68
73
  { example_id: example[:id], run_id: run_id, status: :success, error: nil, feedback: feedback }
69
74
  rescue StandardError => e
70
75
  { example_id: example[:id], run_id: run_id, status: :success, error: e.message, feedback: nil }
71
76
  end
72
77
 
73
- def run_evaluators(example, outputs, run_id)
78
+ def run_evaluators(example, outputs, run_id, run_tenant_id)
74
79
  return nil if @evaluators.empty? || run_id.nil?
75
80
 
81
+ tenant_id = run_tenant_id || @tenant_id
76
82
  Langsmith.flush
77
- run = fetch_run_with_retry(run_id)
83
+ run = fetch_run_with_retry(run_id, tenant_id: tenant_id)
78
84
 
79
85
  @evaluators.each_with_object({}) do |(key, evaluator), feedback|
80
- feedback[key] = execute_evaluator(key, evaluator, example, outputs, run_id, run)
86
+ feedback[key] = execute_evaluator(key, evaluator, example, outputs, run_id, run, tenant_id)
81
87
  end
82
88
  end
83
89
 
84
90
  # LangSmith has indexing lag after batch ingest — the run may not be
85
91
  # queryable immediately. Retry a few times with a short delay.
86
- def fetch_run_with_retry(run_id, retries: 3, delay: 1)
87
- client.read_run(run_id: run_id)
92
+ def fetch_run_with_retry(run_id, tenant_id:, retries: 3, delay: 1)
93
+ client.read_run(run_id: run_id, tenant_id: tenant_id)
88
94
  rescue Client::APIError => e
89
95
  raise unless e.status_code == 404 && retries.positive?
90
96
 
91
97
  sleep(delay)
92
- fetch_run_with_retry(run_id, retries: retries - 1, delay: delay)
98
+ fetch_run_with_retry(run_id, tenant_id: tenant_id, retries: retries - 1, delay: delay)
99
+ end
100
+
101
+ def close_experiment(experiment_id)
102
+ Langsmith.flush
103
+ client.close_experiment(experiment_id: experiment_id, end_time: Time.now.utc.iso8601, tenant_id: @tenant_id)
93
104
  end
94
105
 
95
- def execute_evaluator(key, evaluator, example, outputs, run_id, run)
106
+ def execute_evaluator(key, evaluator, example, outputs, run_id, run, tenant_id)
96
107
  result = evaluator.call(
97
108
  outputs: outputs,
98
109
  reference_outputs: example[:outputs],
@@ -102,7 +113,7 @@ module Langsmith
102
113
  return { score: nil, success: true, skipped: true } if result.nil?
103
114
 
104
115
  normalized = normalize_result(result)
105
- client.create_feedback(run_id: run_id, key: key.to_s, **normalized)
116
+ client.create_feedback(run_id: run_id, key: key.to_s, tenant_id: tenant_id, **normalized)
106
117
  normalized.merge(success: true)
107
118
  rescue StandardError => e
108
119
  { score: nil, success: false, error: e.message }
@@ -22,15 +22,17 @@ module Langsmith
22
22
  # @param description [String, nil] optional experiment description
23
23
  # @param metadata [Hash, nil] optional experiment metadata
24
24
  # @param evaluators [Hash] map of evaluator key to callable (see ExperimentRunner)
25
+ # @param tenant_id [String, nil] tenant ID for dataset/session/feedback API calls
25
26
  # @yield [Hash] each dataset example
26
27
  # @return [Hash] summary with :experiment_id, :total, :succeeded, :failed, :results
27
- def self.run(dataset_id:, experiment_name:, description: nil, metadata: nil, evaluators: {}, &block)
28
+ def self.run(dataset_id:, experiment_name:, description: nil, metadata: nil, evaluators: {}, tenant_id: nil, &block)
28
29
  ExperimentRunner.new(
29
30
  dataset_id: dataset_id,
30
31
  experiment_name: experiment_name,
31
32
  description: description,
32
33
  metadata: metadata,
33
34
  evaluators: evaluators,
35
+ tenant_id: tenant_id,
34
36
  &block
35
37
  ).run
36
38
  end
@@ -140,7 +140,10 @@ module Langsmith
140
140
  # attach feedback to it later. Only root runs (no parent) register;
141
141
  # child runs must not overwrite.
142
142
  def register_evaluation_root_run(effective_parent_id)
143
- Context.set_evaluation_root_run_id(@run.id) if effective_parent_id.nil? && Context.evaluating?
143
+ return unless effective_parent_id.nil? && Context.evaluating?
144
+
145
+ Context.set_evaluation_root_run_id(@run.id)
146
+ Context.set_evaluation_root_run_tenant_id(@run.tenant_id)
144
147
  end
145
148
 
146
149
  # Sanitize block results to prevent circular references.
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Langsmith
4
- VERSION = "0.3.2"
4
+ VERSION = "0.4.0"
5
5
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: langsmith-sdk
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.2
4
+ version: 0.4.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Felipe Cabezudo
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2026-02-11 00:00:00.000000000 Z
11
+ date: 2026-02-17 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: concurrent-ruby
@@ -91,6 +91,7 @@ files:
91
91
  - examples/complex_agent.rb
92
92
  - examples/llm_tracing.rb
93
93
  - examples/openai_integration.rb
94
+ - langsmith.gemspec
94
95
  - lib/langsmith.rb
95
96
  - lib/langsmith/batch_processor.rb
96
97
  - lib/langsmith/client.rb