dspy-evals 1.0.0 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (5) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +149 -185
  3. data/lib/dspy/evals/version.rb +1 -1
  4. data/lib/dspy/evals.rb +106 -33
  5. metadata +7 -10
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 2b829eb6518603189ba468cbd12bdfb8cf28569805219e0cf6402912e9f70aea
4
- data.tar.gz: 3d20d190d99d337d3fc1acec66360eb7abbccab343ced186cc7af0e1072a2f88
3
+ metadata.gz: d6e4ef06553cd53f974d9813c83807ce43a4222132f6200837a15087722e6483
4
+ data.tar.gz: 077d72f3f4db1122e749248b29d27206f277cef03954ef58483dc4452328fbac
5
5
  SHA512:
6
- metadata.gz: 89ea669cb6ae3a7f4443a9cab042ece8c5a50f39ededed884bd44d12454c55fad6d3c3a580fc70e6824a12dea44378325fab9e4fc2a66e3483c262d76dbf3e56
7
- data.tar.gz: 848ddef8ad63facc5a08b0c79c138c7da4283b25207c670a74f328bdc3cafcff1563515b1a1f12b0b264f5b812d746dcd6219354c192d6b99b2b109d097a1055
6
+ metadata.gz: 73aa76a6904812e98cf7ded086a9642ca9fc696c3e1c017b5fec91a63a4887e54d61b961d79d6cb3d1fda2fc44707007e67849c71e61185af14d7146dc50ef0a
7
+ data.tar.gz: 4270e116956e4ba8960f958302f817a3236d82922224a8cdfd066d172693602e99dfe2da3a68c4699c91fb3a01d9f0e5da0ad6bec86b3bb4b3a802193b269d41
data/README.md CHANGED
@@ -3,81 +3,97 @@
3
3
  [![Gem Version](https://img.shields.io/gem/v/dspy)](https://rubygems.org/gems/dspy)
4
4
  [![Total Downloads](https://img.shields.io/gem/dt/dspy)](https://rubygems.org/gems/dspy)
5
5
  [![Build Status](https://img.shields.io/github/actions/workflow/status/vicentereig/dspy.rb/ruby.yml?branch=main&label=build)](https://github.com/vicentereig/dspy.rb/actions/workflows/ruby.yml)
6
- [![Documentation](https://img.shields.io/badge/docs-vicentereig.github.io%2Fdspy.rb-blue)](https://vicentereig.github.io/dspy.rb/)
6
+ [![Documentation](https://img.shields.io/badge/docs-oss.vicente.services%2Fdspy.rb-blue)](https://oss.vicente.services/dspy.rb/)
7
+ [![Discord](https://img.shields.io/discord/1161519468141355160?label=discord&logo=discord&logoColor=white)](https://discord.gg/zWBhrMqn)
7
8
 
8
- > [!NOTE]
9
- > The core Prompt Engineering Framework is production-ready with
10
- > comprehensive documentation. I am focusing now on educational content on systematic Prompt Optimization and Context Engineering.
11
- > Your feedback is invaluable. if you encounter issues, please open an [issue](https://github.com/vicentereig/dspy.rb/issues). If you have suggestions, open a [new thread](https://github.com/vicentereig/dspy.rb/discussions).
12
- >
13
- > If you want to contribute, feel free to reach out to me to coordinate efforts: hey at vicente.services
14
- >
15
- > And, yes, this is 100% a legit project. :)
9
+ **Build reliable LLM applications in idiomatic Ruby using composable, type-safe modules.**
16
10
 
11
+ DSPy.rb is the Ruby port of Stanford's [DSPy](https://dspy.ai). Instead of wrestling with brittle prompt strings, you define typed signatures and let the framework handle the rest. Prompts become functions. LLM calls become predictable.
17
12
 
18
- **Build reliable LLM applications in idiomatic Ruby using composable, type-safe modules.**
13
+ ```ruby
14
+ require 'dspy'
19
15
 
20
- The Ruby framework for programming with large language models. DSPy.rb brings structured LLM programming to Ruby developers, programmatic Prompt Engineering and Context Engineering.
21
- Instead of wrestling with prompt strings and parsing responses, you define typed signatures using idiomatic Ruby to compose and decompose AI Worklows and AI Agents.
16
+ DSPy.configure do |c|
17
+ c.lm = DSPy::LM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY'])
18
+ end
22
19
 
23
- **Prompts are the just Functions.** Traditional prompting is like writing code with string concatenation: it works until it doesn't. DSPy.rb brings you
24
- the programming approach pioneered by [dspy.ai](https://dspy.ai/): instead of crafting fragile prompts, you define modular
25
- signatures and let the framework handle the messy details.
20
+ class Summarize < DSPy::Signature
21
+ description "Summarize the given text in one sentence."
26
22
 
27
- DSPy.rb is an idiomatic Ruby surgical port of Stanford's [DSPy framework](https://github.com/stanfordnlp/dspy). While implementing
28
- the core concepts of signatures, predictors, and the main optimization algorithms from the original Python library, DSPy.rb embraces Ruby
29
- conventions and adds Ruby-specific innovations like Sorbet-base Typed system, ReAct loops, and production-ready integrations like non-blocking Open Telemetry Instrumentation.
23
+ input do
24
+ const :text, String
25
+ end
30
26
 
31
- **What you get?** Ruby LLM applications that actually scale and don't break when you sneeze.
27
+ output do
28
+ const :summary, String
29
+ end
30
+ end
32
31
 
33
- Check the [examples](examples/) and take them for a spin!
32
+ summarizer = DSPy::Predict.new(Summarize)
33
+ result = summarizer.call(text: "DSPy.rb brings structured LLM programming to Ruby...")
34
+ puts result.summary
35
+ ```
34
36
 
35
- ## Your First DSPy Program
36
- ### Installation
37
+ That's it. No prompt templates. No JSON parsing. No prayer-based error handling.
37
38
 
38
- Add to your Gemfile:
39
+ ## Installation
39
40
 
40
41
  ```ruby
42
+ # Gemfile
41
43
  gem 'dspy'
44
+ gem 'dspy-openai' # For OpenAI, OpenRouter, or Ollama
45
+ # gem 'dspy-anthropic' # For Claude
46
+ # gem 'dspy-gemini' # For Gemini
47
+ # gem 'dspy-ruby_llm' # For 12+ providers via RubyLLM
42
48
  ```
43
49
 
44
- and
45
-
46
50
  ```bash
47
51
  bundle install
48
52
  ```
49
53
 
50
- ### Optional Sibling Gems
51
-
52
- DSPy.rb ships multiple gems from this monorepo so you only install what you need. Add these alongside `dspy`:
53
-
54
- | Gem | Description | Status |
55
- | --- | --- | --- |
56
- | `dspy-schema` | Exposes `DSPy::TypeSystem::SorbetJsonSchema` for downstream reuse. | **Stable** (v1.0.0) |
57
- | `dspy-code_act` | Think-Code-Observe agents that synthesize and execute Ruby safely. | Preview (0.x) |
58
- | `dspy-datasets` | Dataset helpers plus Parquet/Polars tooling for richer evaluation corpora. | Preview (0.x) |
59
- | `dspy-evals` | High-throughput evaluation harness with metrics, callbacks, and regression fixtures. | Preview (0.x) |
60
- | `dspy-miprov2` | Bayesian optimization + Gaussian Process backend for the MIPROv2 teleprompter. | Preview (0.x) |
61
- | `dspy-gepa` | `DSPy::Teleprompt::GEPA`, reflection loops, experiment tracking, telemetry adapters. | Preview (mirrors `dspy` version) |
62
- | `gepa` | GEPA optimizer core (Pareto engine, telemetry, reflective proposer). | Preview (mirrors `dspy` version) |
63
- | `dspy-o11y` | Core observability APIs: `DSPy::Observability`, async span processor, observation types. | **Stable** (v1.0.0) |
64
- | `dspy-o11y-langfuse` | Auto-configures DSPy observability to stream spans to Langfuse via OTLP. | **Stable** (v1.0.0) |
54
+ ## Quick Start
65
55
 
66
- Set the matching `DSPY_WITH_*` environment variables (see `Gemfile`) to include or exclude each sibling gem when running Bundler locally (for example `DSPY_WITH_GEPA=1` or `DSPY_WITH_O11Y_LANGFUSE=1`). Refer to `docs/core-concepts/dependency-tree.md` for the full dependency map and roadmap.
67
- ### Your First Reliable Predictor
56
+ ### Configure Your LLM
68
57
 
69
58
  ```ruby
70
-
71
- # Configure DSPy globablly to use your fave LLM - you can override this on an instance levle.
59
+ # OpenAI
72
60
  DSPy.configure do |c|
73
61
  c.lm = DSPy::LM.new('openai/gpt-4o-mini',
74
62
  api_key: ENV['OPENAI_API_KEY'],
75
- structured_outputs: true) # Enable OpenAI's native JSON mode
63
+ structured_outputs: true)
64
+ end
65
+
66
+ # Anthropic Claude
67
+ DSPy.configure do |c|
68
+ c.lm = DSPy::LM.new('anthropic/claude-sonnet-4-20250514',
69
+ api_key: ENV['ANTHROPIC_API_KEY'])
76
70
  end
77
71
 
78
- # Define a signature for sentiment classification - instead of writing a full prompt!
72
+ # Google Gemini
73
+ DSPy.configure do |c|
74
+ c.lm = DSPy::LM.new('gemini/gemini-2.5-flash',
75
+ api_key: ENV['GEMINI_API_KEY'])
76
+ end
77
+
78
+ # Ollama (local, free)
79
+ DSPy.configure do |c|
80
+ c.lm = DSPy::LM.new('ollama/llama3.2')
81
+ end
82
+
83
+ # OpenRouter (200+ models)
84
+ DSPy.configure do |c|
85
+ c.lm = DSPy::LM.new('openrouter/deepseek/deepseek-chat-v3.1:free',
86
+ api_key: ENV['OPENROUTER_API_KEY'])
87
+ end
88
+ ```
89
+
90
+ ### Define a Signature
91
+
92
+ Signatures are typed contracts for LLM operations. Define inputs, outputs, and let DSPy handle the prompt:
93
+
94
+ ```ruby
79
95
  class Classify < DSPy::Signature
80
- description "Classify sentiment of a given sentence." # sets the goal of the underlying prompt
96
+ description "Classify sentiment of a given sentence."
81
97
 
82
98
  class Sentiment < T::Enum
83
99
  enums do
@@ -86,182 +102,130 @@ class Classify < DSPy::Signature
86
102
  Neutral = new('neutral')
87
103
  end
88
104
  end
89
-
90
- # Structured Inputs: makes sure you are sending only valid prompt inputs to your model
105
+
91
106
  input do
92
107
  const :sentence, String, description: 'The sentence to analyze'
93
108
  end
94
109
 
95
- # Structured Outputs: your predictor will validate the output of the model too.
96
110
  output do
97
- const :sentiment, Sentiment, description: 'The sentiment of the sentence'
98
- const :confidence, Float, description: 'A number between 0.0 and 1.0'
111
+ const :sentiment, Sentiment
112
+ const :confidence, Float
99
113
  end
100
114
  end
101
115
 
102
- # Wire it to the simplest prompting technique - a Predictn.
103
- classify = DSPy::Predict.new(Classify)
104
- # it may raise an error if you mess the inputs or your LLM messes the outputs.
105
- result = classify.call(sentence: "This book was super fun to read!")
116
+ classifier = DSPy::Predict.new(Classify)
117
+ result = classifier.call(sentence: "This book was super fun to read!")
106
118
 
107
- puts result.sentiment # => #<Sentiment::Positive>
108
- puts result.confidence # => 0.85
119
+ result.sentiment # => #<Sentiment::Positive>
120
+ result.confidence # => 0.92
109
121
  ```
110
122
 
111
- ### Access to 200+ Models Across 5 Providers
123
+ ### Chain of Thought
112
124
 
113
- DSPy.rb provides unified access to major LLM providers with provider-specific optimizations:
125
+ For complex reasoning, use `ChainOfThought` to get step-by-step explanations:
114
126
 
115
127
  ```ruby
116
- # OpenAI (GPT-4, GPT-4o, GPT-4o-mini, GPT-5, etc.)
117
- DSPy.configure do |c|
118
- c.lm = DSPy::LM.new('openai/gpt-4o-mini',
119
- api_key: ENV['OPENAI_API_KEY'],
120
- structured_outputs: true) # Native JSON mode
121
- end
128
+ solver = DSPy::ChainOfThought.new(MathProblem)
129
+ result = solver.call(problem: "If a train travels 120km in 2 hours, what's its speed?")
122
130
 
123
- # Google Gemini (Gemini 1.5 Pro, Flash, Gemini 2.0, etc.)
124
- DSPy.configure do |c|
125
- c.lm = DSPy::LM.new('gemini/gemini-2.5-flash',
126
- api_key: ENV['GEMINI_API_KEY'],
127
- structured_outputs: true) # Native structured outputs
128
- end
131
+ result.reasoning # => "Speed = Distance / Time = 120km / 2h = 60km/h"
132
+ result.answer # => "60 km/h"
133
+ ```
129
134
 
130
- # Anthropic Claude (Claude 3.5, Claude 4, etc.)
131
- DSPy.configure do |c|
132
- c.lm = DSPy::LM.new('anthropic/claude-sonnet-4-5-20250929',
133
- api_key: ENV['ANTHROPIC_API_KEY'],
134
- structured_outputs: true) # Tool-based extraction (default)
135
- end
135
+ ### ReAct Agents
136
136
 
137
- # Ollama - Run any local model (Llama, Mistral, Gemma, etc.)
138
- DSPy.configure do |c|
139
- c.lm = DSPy::LM.new('ollama/llama3.2') # Free, runs locally, no API key needed
140
- end
137
+ Build agents that use tools to accomplish tasks:
141
138
 
142
- # OpenRouter - Access to 200+ models from multiple providers
143
- DSPy.configure do |c|
144
- c.lm = DSPy::LM.new('openrouter/deepseek/deepseek-chat-v3.1:free',
145
- api_key: ENV['OPENROUTER_API_KEY'])
139
+ ```ruby
140
+ class SearchTool < DSPy::Tools::Tool
141
+ tool_name "search"
142
+ description "Search for information"
143
+
144
+ input do
145
+ const :query, String
146
+ end
147
+
148
+ output do
149
+ const :results, T::Array[String]
150
+ end
151
+
152
+ def call(query:)
153
+ # Your search implementation
154
+ { results: ["Result 1", "Result 2"] }
155
+ end
146
156
  end
157
+
158
+ toolset = DSPy::Tools::Toolset.new(tools: [SearchTool.new])
159
+ agent = DSPy::ReAct.new(signature: ResearchTask, tools: toolset, max_iterations: 5)
160
+ result = agent.call(question: "What's the latest on Ruby 3.4?")
147
161
  ```
148
162
 
149
- ## What You Get
150
-
151
- **Developer Experience:**
152
- - LLM provider support using official Ruby clients:
153
- - [OpenAI Ruby](https://github.com/openai/openai-ruby) with vision model support
154
- - [Anthropic Ruby SDK](https://github.com/anthropics/anthropic-sdk-ruby) with multimodal capabilities
155
- - [Google Gemini API](https://ai.google.dev/) with native structured outputs
156
- - [Ollama](https://ollama.com/) via OpenAI compatibility layer for local models
157
- - **Multimodal Support** - Complete image analysis with DSPy::Image, type-safe bounding boxes, vision-capable models
158
- - Runtime type checking with [Sorbet](https://sorbet.org/) including T::Enum and union types
159
- - Type-safe tool definitions for ReAct agents
160
- - Comprehensive instrumentation and observability
161
-
162
- **Core Building Blocks:**
163
- - **Signatures** - Define input/output schemas using Sorbet types with T::Enum and union type support
164
- - **Predict** - LLM completion with structured data extraction and multimodal support
165
- - **Chain of Thought** - Step-by-step reasoning for complex problems with automatic prompt optimization
166
- - **ReAct** - Tool-using agents with type-safe tool definitions and error recovery
167
- - **Module Composition** - Combine multiple LLM calls into production-ready workflows
168
-
169
- **Optimization & Evaluation:**
170
- - **Prompt Objects** - Manipulate prompts as first-class objects instead of strings
171
- - **Typed Examples** - Type-safe training data with automatic validation
172
- - **Evaluation Framework** - Advanced metrics beyond simple accuracy with error-resilient pipelines
173
- - **MIPROv2 Optimization** - Advanced Bayesian optimization with Gaussian Processes, multiple optimization strategies, auto-config presets, and storage persistence
174
-
175
- **Production Features:**
176
- - **Reliable JSON Extraction** - Native structured outputs for OpenAI and Gemini, Anthropic tool-based extraction, and automatic strategy selection with fallback
177
- - **Type-Safe Configuration** - Strategy enums with automatic provider optimization (Strict/Compatible modes)
178
- - **Smart Retry Logic** - Progressive fallback with exponential backoff for handling transient failures
179
- - **Zero-Config Langfuse Integration** - Set env vars and get automatic OpenTelemetry traces in Langfuse
180
- - **Performance Caching** - Schema and capability caching for faster repeated operations
181
- - **File-based Storage** - Optimization result persistence with versioning
182
- - **Structured Logging** - JSON and key=value formats with span tracking
183
-
184
- ## Recent Achievements
185
-
186
- DSPy.rb has rapidly evolved from experimental to production-ready:
187
-
188
- ### Foundation
189
- - ✅ **JSON Parsing Reliability** - Native OpenAI structured outputs with adaptive retry logic and schema-aware fallbacks
190
- - ✅ **Type-Safe Strategy Configuration** - Provider-optimized strategy selection and enum-backed optimizer presets
191
- - ✅ **Core Module System** - Predict, ChainOfThought, ReAct with type safety (add `dspy-code_act` for Think-Code-Observe agents)
192
- - ✅ **Production Observability** - OpenTelemetry, New Relic, and Langfuse integration
193
- - ✅ **Advanced Optimization** - MIPROv2 with Bayesian optimization, Gaussian Processes, and multi-mode search
194
-
195
- ### Recent Advances
196
- - ✅ **MIPROv2 ADE Integrity (v0.29.1)** - Stratified train/val/test splits, honest precision accounting, and enum-driven `--auto` presets with integration coverage
197
- - ✅ **Instruction Deduplication (v0.29.1)** - Candidate generation now filters repeated programs so optimization logs highlight unique strategies
198
- - ✅ **GEPA Teleprompter (v0.29.0)** - Genetic-Pareto reflective prompt evolution with merge proposer scheduling, reflective mutation, and ADE demo parity
199
- - ✅ **Optimizer Utilities Parity (v0.29.0)** - Bootstrap strategies, dataset summaries, and Layer 3 utilities unlock multi-predictor programs on Ruby
200
- - ✅ **Observability Hardening (v0.29.0)** - OTLP exporter runs on a single-thread executor preventing frozen SSL contexts without blocking spans
201
- - ✅ **Documentation Refresh (v0.29.x)** - New GEPA guide plus ADE optimization docs covering presets, stratified splits, and error-handling defaults
202
-
203
- **Current Focus Areas:**
204
-
205
- ### Production Readiness
206
- - 🚧 **Production Patterns** - Real-world usage validation and performance optimization
207
- - 🚧 **Ruby Ecosystem Integration** - Rails integration, Sidekiq compatibility, deployment patterns
208
-
209
- ### Community & Adoption
210
- - 🚧 **Community Examples** - Real-world applications and case studies
211
- - 🚧 **Contributor Experience** - Making it easier to contribute and extend
212
- - 🚧 **Performance Benchmarks** - Comparative analysis vs other frameworks
213
-
214
- **v1.0 Philosophy:**
215
- v1.0 will be released after extensive production battle-testing, not after checking off features.
216
- The API is already stable - v1.0 represents confidence in production reliability backed by real-world validation.
163
+ ## What's Included
164
+
165
+ **Core Modules**: Predict, ChainOfThought, ReAct agents, and composable pipelines.
166
+
167
+ **Type Safety**: Sorbet-based runtime validation. Enums, unions, nested structs—all work.
168
+
169
+ **Multimodal**: Image analysis with `DSPy::Image` for vision-capable models.
217
170
 
171
+ **Observability**: Zero-config Langfuse integration via OpenTelemetry. Non-blocking, production-ready.
172
+
173
+ **Optimization**: MIPROv2 (Bayesian optimization) and GEPA (genetic evolution) for prompt tuning.
174
+
175
+ **Provider Support**: OpenAI, Anthropic, Gemini, Ollama, and OpenRouter via official SDKs.
218
176
 
219
177
  ## Documentation
220
178
 
221
- 📖 **[Complete Documentation Website](https://vicentereig.github.io/dspy.rb/)**
179
+ **[Full Documentation](https://oss.vicente.services/dspy.rb/)** — Getting started, core concepts, advanced patterns.
222
180
 
223
- ### LLM-Friendly Documentation
181
+ **[llms.txt](https://oss.vicente.services/dspy.rb/llms.txt)** LLM-friendly reference for AI assistants.
224
182
 
225
- For LLMs and AI assistants working with DSPy.rb:
226
- - **[llms.txt](https://vicentereig.github.io/dspy.rb/llms.txt)** - Concise reference optimized for LLMs
227
- - **[llms-full.txt](https://vicentereig.github.io/dspy.rb/llms-full.txt)** - Comprehensive API documentation
183
+ ### Claude Skill
228
184
 
229
- ### Getting Started
230
- - **[Installation & Setup](docs/src/getting-started/installation.md)** - Detailed installation and configuration
231
- - **[Quick Start Guide](docs/src/getting-started/quick-start.md)** - Your first DSPy programs
232
- - **[Core Concepts](docs/src/getting-started/core-concepts.md)** - Understanding signatures, predictors, and modules
185
+ A [Claude Skill](https://github.com/vicentereig/dspy-rb-skill) is available to help you build DSPy.rb applications:
233
186
 
234
- ### Prompt Engineering
235
- - **[Signatures & Types](docs/src/core-concepts/signatures.md)** - Define typed interfaces for LLM operations
236
- - **[Predictors](docs/src/core-concepts/predictors.md)** - Predict, ChainOfThought, ReAct, and more
237
- - **[Modules & Pipelines](docs/src/core-concepts/modules.md)** - Compose complex multi-stage workflows
238
- - **[Multimodal Support](docs/src/core-concepts/multimodal.md)** - Image analysis with vision-capable models
239
- - **[Examples & Validation](docs/src/core-concepts/examples.md)** - Type-safe training data
240
- - **[Rich Types](docs/src/advanced/complex-types.md)** - Sorbet type integration with automatic coercion for structs, enums, and arrays
241
- - **[Composable Pipelines](docs/src/advanced/pipelines.md)** - Manual module composition patterns
187
+ ```bash
188
+ # Claude Code
189
+ git clone https://github.com/vicentereig/dspy-rb-skill ~/.claude/skills/dspy-rb
190
+ ```
242
191
 
243
- ### Prompt Optimization
244
- - **[Evaluation Framework](docs/src/optimization/evaluation.md)** - Advanced metrics beyond simple accuracy
245
- - **[Prompt Optimization](docs/src/optimization/prompt-optimization.md)** - Manipulate prompts as objects
246
- - **[MIPROv2 Optimizer](docs/src/optimization/miprov2.md)** - Advanced Bayesian optimization with Gaussian Processes
247
- - **[GEPA Optimizer](docs/src/optimization/gepa.md)** *(beta)* - Reflective mutation with optional reflection LMs
192
+ For Claude.ai Pro/Max, download the [skill ZIP](https://github.com/vicentereig/dspy-rb-skill/archive/refs/heads/main.zip) and upload via Settings > Skills.
248
193
 
249
- ### Context Engineering
250
- - **[Tools](docs/src/core-concepts/toolsets.md)** - Tool wieldint agents.
251
- - **[Agentic Memory](docs/src/core-concepts/memory.md)** - Memory Tools & Agentic Loops
252
- - **[RAG Patterns](docs/src/advanced/rag.md)** - Manual RAG implementation with external services
194
+ ## Examples
253
195
 
254
- ### Production Features
255
- - **[Observability](docs/src/production/observability.md)** - Zero-config Langfuse integration with a dedicated export worker that never blocks your LLMs
256
- - **[Storage System](docs/src/production/storage.md)** - Persistence and optimization result storage
257
- - **[Custom Metrics](docs/src/advanced/custom-metrics.md)** - Proc-based evaluation logic
196
+ The [examples/](examples/) directory has runnable code for common patterns:
258
197
 
198
+ - Sentiment classification
199
+ - ReAct agents with tools
200
+ - Image analysis
201
+ - Prompt optimization
259
202
 
203
+ ```bash
204
+ bundle exec ruby examples/first_predictor.rb
205
+ ```
206
+
207
+ ## Optional Gems
260
208
 
209
+ DSPy.rb ships sibling gems for features with heavier dependencies. Add them as needed:
261
210
 
211
+ | Gem | What it does |
212
+ | --- | --- |
213
+ | `dspy-datasets` | Dataset helpers, Parquet/Polars tooling |
214
+ | `dspy-evals` | Evaluation harness with metrics and callbacks |
215
+ | `dspy-miprov2` | Bayesian optimization for prompt tuning |
216
+ | `dspy-gepa` | Genetic-Pareto prompt evolution |
217
+ | `dspy-o11y-langfuse` | Auto-configure Langfuse tracing |
218
+ | `dspy-code_act` | Think-Code-Observe agents |
219
+ | `dspy-deep_search` | Production DeepSearch with Exa |
262
220
 
221
+ See [the full list](https://oss.vicente.services/dspy.rb/getting-started/installation/) in the docs.
263
222
 
223
+ ## Contributing
264
224
 
225
+ Feedback is invaluable. If you encounter issues, [open an issue](https://github.com/vicentereig/dspy.rb/issues). For suggestions, [start a discussion](https://github.com/vicentereig/dspy.rb/discussions).
226
+
227
+ Want to contribute code? Reach out: hey at vicente.services
265
228
 
266
229
  ## License
267
- This project is licensed under the MIT License.
230
+
231
+ MIT License.
@@ -2,6 +2,6 @@
2
2
 
3
3
  module DSPy
4
4
  class Evals
5
- VERSION = '1.0.0'
5
+ VERSION = '1.0.2'
6
6
  end
7
7
  end
data/lib/dspy/evals.rb CHANGED
@@ -1,7 +1,6 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  require 'json'
4
- require 'polars'
5
4
  require 'concurrent'
6
5
  require 'sorbet-runtime'
7
6
  require_relative 'example'
@@ -111,8 +110,14 @@ module DSPy
111
110
  }
112
111
  end
113
112
 
114
- sig { returns(Polars::DataFrame) }
113
+ if defined?(Polars::DataFrame)
114
+ sig { returns(Polars::DataFrame) }
115
+ else
116
+ sig { returns(T.untyped) }
117
+ end
115
118
  def to_polars
119
+ ensure_polars!
120
+
116
121
  rows = @results.each_with_index.map do |result, index|
117
122
  {
118
123
  "index" => index,
@@ -130,6 +135,20 @@ module DSPy
130
135
 
131
136
  private
132
137
 
138
+ POLARS_MISSING_ERROR = <<~MSG
139
+ Polars is required to export evaluation results. Add `gem 'polars'`
140
+ (or enable the `dspy-datasets` gem / `DSPY_WITH_DATASETS=1`) before
141
+ calling `DSPy::Evals::BatchEvaluationResult#to_polars`.
142
+ MSG
143
+
144
+ def ensure_polars!
145
+ return if defined?(Polars::DataFrame)
146
+
147
+ require 'polars'
148
+ rescue LoadError => e
149
+ raise LoadError, "#{POLARS_MISSING_ERROR}\n\n#{e.message}"
150
+ end
151
+
133
152
  def serialize_for_polars(value)
134
153
  case value
135
154
  when NilClass, TrueClass, FalseClass, Numeric, String
@@ -172,6 +191,12 @@ module DSPy
172
191
  sig { returns(T.nilable(BatchEvaluationResult)) }
173
192
  attr_reader :last_batch_result
174
193
 
194
+ sig { returns(T::Boolean) }
195
+ attr_reader :export_scores
196
+
197
+ sig { returns(String) }
198
+ attr_reader :score_name
199
+
175
200
  include DSPy::Callbacks
176
201
 
177
202
  create_before_callback :call, wrap: false
@@ -208,16 +233,20 @@ module DSPy
208
233
  num_threads: T.nilable(Integer),
209
234
  max_errors: T.nilable(Integer),
210
235
  failure_score: T.nilable(Numeric),
211
- provide_traceback: T::Boolean
236
+ provide_traceback: T::Boolean,
237
+ export_scores: T::Boolean,
238
+ score_name: String
212
239
  ).void
213
240
  end
214
- def initialize(program, metric: nil, num_threads: 1, max_errors: 5, failure_score: 0.0, provide_traceback: true)
241
+ def initialize(program, metric: nil, num_threads: 1, max_errors: 5, failure_score: 0.0, provide_traceback: true, export_scores: false, score_name: 'evaluation')
215
242
  @program = program
216
243
  @metric = metric
217
244
  @num_threads = num_threads || 1
218
245
  @max_errors = max_errors || 5
219
246
  @provide_traceback = provide_traceback
220
247
  @failure_score = failure_score ? failure_score.to_f : 0.0
248
+ @export_scores = export_scores
249
+ @score_name = score_name
221
250
  @last_example_result = nil
222
251
  @last_batch_result = nil
223
252
  end
@@ -225,25 +254,7 @@ module DSPy
225
254
  # Evaluate program on a single example
226
255
  sig { params(example: T.untyped, trace: T.nilable(T.untyped)).returns(EvaluationResult) }
227
256
  def call(example, trace: nil)
228
- run_callbacks(:before, :call, example: example)
229
-
230
- DSPy::Context.with_span(
231
- operation: 'evaluation.example',
232
- 'dspy.module' => 'Evaluator',
233
- 'evaluation.program' => @program.class.name,
234
- 'evaluation.has_metric' => !@metric.nil?
235
- ) do
236
- begin
237
- perform_call(example, trace: trace)
238
- rescue => e
239
- build_error_result(example, e, trace: trace)
240
- end
241
- end.then do |result|
242
- @last_example_result = result
243
- emit_example_observation(example, result)
244
- run_callbacks(:after, :call, example: example, result: result)
245
- result
246
- end
257
+ call_with_program(@program, example, trace: trace, track_state: true)
247
258
  end
248
259
 
249
260
  # Evaluate program on multiple examples
@@ -374,8 +385,9 @@ module DSPy
374
385
 
375
386
  futures = batch.map do |item|
376
387
  Concurrent::Promises.future_on(executor) do
377
- [:ok, item[:index], safe_call(item[:example])]
378
- rescue => e
388
+ program_for_thread = fork_program_for_thread
389
+ [:ok, item[:index], safe_call(item[:example], program: program_for_thread, track_state: false)]
390
+ rescue StandardError => e
379
391
  [:error, item[:index], e]
380
392
  end
381
393
  end
@@ -412,18 +424,18 @@ module DSPy
412
424
  results.compact
413
425
  end
414
426
 
415
- def safe_call(example)
416
- call(example)
417
- rescue => e
427
+ def safe_call(example, program: @program, track_state: true)
428
+ call_with_program(program, example, track_state: track_state)
429
+ rescue StandardError => e
418
430
  build_error_result(example, e)
419
431
  end
420
432
 
421
- def perform_call(example, trace:)
433
+ def perform_call(example, trace:, program:)
422
434
  # Extract input from example - support both hash and object formats
423
435
  input_values = extract_input_values(example)
424
436
 
425
437
  # Run prediction
426
- prediction = @program.call(**input_values)
438
+ prediction = program.call(**input_values)
427
439
 
428
440
  # Calculate metrics if provided
429
441
  metrics = {}
@@ -440,7 +452,7 @@ module DSPy
440
452
  passed = !!metric_result
441
453
  metrics[:passed] = passed
442
454
  end
443
- rescue => e
455
+ rescue StandardError => e
444
456
  passed = false
445
457
  metrics[:error] = e.message
446
458
  metrics[:passed] = false
@@ -461,6 +473,34 @@ module DSPy
461
473
  )
462
474
  end
463
475
 
476
+ def call_with_program(program, example, trace: nil, track_state: true)
477
+ run_callbacks(:before, :call, example: example)
478
+
479
+ DSPy::Context.with_span(
480
+ operation: 'evaluation.example',
481
+ 'dspy.module' => 'Evaluator',
482
+ 'evaluation.program' => program.class.name,
483
+ 'evaluation.has_metric' => !@metric.nil?
484
+ ) do
485
+ begin
486
+ perform_call(example, trace: trace, program: program)
487
+ rescue StandardError => e
488
+ build_error_result(example, e, trace: trace)
489
+ end
490
+ end.then do |result|
491
+ @last_example_result = result if track_state
492
+ emit_example_observation(example, result)
493
+ run_callbacks(:after, :call, example: example, result: result)
494
+ result
495
+ end
496
+ end
497
+
498
+ def fork_program_for_thread
499
+ return @program if @program.nil?
500
+ return @program.dup_for_thread if @program.respond_to?(:dup_for_thread)
501
+ @program.dup
502
+ end
503
+
464
504
  def build_error_result(example, error, trace: nil)
465
505
  metrics = {
466
506
  error: error.message,
@@ -646,7 +686,12 @@ module DSPy
646
686
  score: result.metrics[:score],
647
687
  error: result.metrics[:error]
648
688
  })
649
- rescue => e
689
+
690
+ # Export score to Langfuse if enabled
691
+ if @export_scores
692
+ export_example_score(example, result)
693
+ end
694
+ rescue StandardError => e
650
695
  DSPy.log('evals.example.observation_error', error: e.message)
651
696
  end
652
697
 
@@ -659,10 +704,38 @@ module DSPy
659
704
  pass_rate: batch_result.pass_rate,
660
705
  score: batch_result.score
661
706
  })
662
- rescue => e
707
+
708
+ # Export batch score to Langfuse if enabled
709
+ if @export_scores
710
+ export_batch_score(batch_result)
711
+ end
712
+ rescue StandardError => e
663
713
  DSPy.log('evals.batch.observation_error', error: e.message)
664
714
  end
665
715
 
716
+ def export_example_score(example, result)
717
+ score_value = result.metrics[:score] || (result.passed ? 1.0 : 0.0)
718
+ example_id = extract_example_id(example)
719
+
720
+ DSPy.score(
721
+ @score_name,
722
+ score_value,
723
+ comment: "Example: #{example_id || 'unknown'}, passed: #{result.passed}"
724
+ )
725
+ rescue StandardError => e
726
+ DSPy.log('evals.score_export_error', error: e.message)
727
+ end
728
+
729
+ def export_batch_score(batch_result)
730
+ DSPy.score(
731
+ "#{@score_name}_batch",
732
+ batch_result.pass_rate,
733
+ comment: "Batch: #{batch_result.passed_examples}/#{batch_result.total_examples} passed"
734
+ )
735
+ rescue StandardError => e
736
+ DSPy.log('evals.batch_score_export_error', error: e.message)
737
+ end
738
+
666
739
  def extract_example_id(example)
667
740
  if example.respond_to?(:id)
668
741
  example.id
metadata CHANGED
@@ -1,29 +1,28 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: dspy-evals
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 1.0.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Vicente Reig Rincón de Arellano
8
- autorequire:
9
8
  bindir: bin
10
9
  cert_chain: []
11
- date: 2025-10-25 00:00:00.000000000 Z
10
+ date: 1980-01-02 00:00:00.000000000 Z
12
11
  dependencies:
13
12
  - !ruby/object:Gem::Dependency
14
13
  name: dspy
15
14
  requirement: !ruby/object:Gem::Requirement
16
15
  requirements:
17
- - - '='
16
+ - - ">="
18
17
  - !ruby/object:Gem::Version
19
- version: 0.30.0
18
+ version: '0.30'
20
19
  type: :runtime
21
20
  prerelease: false
22
21
  version_requirements: !ruby/object:Gem::Requirement
23
22
  requirements:
24
- - - '='
23
+ - - ">="
25
24
  - !ruby/object:Gem::Version
26
- version: 0.30.0
25
+ version: '0.30'
27
26
  - !ruby/object:Gem::Dependency
28
27
  name: concurrent-ruby
29
28
  requirement: !ruby/object:Gem::Requirement
@@ -69,7 +68,6 @@ licenses:
69
68
  - MIT
70
69
  metadata:
71
70
  github_repo: git@github.com:vicentereig/dspy.rb
72
- post_install_message:
73
71
  rdoc_options: []
74
72
  require_paths:
75
73
  - lib
@@ -84,8 +82,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
84
82
  - !ruby/object:Gem::Version
85
83
  version: '0'
86
84
  requirements: []
87
- rubygems_version: 3.0.3.1
88
- signing_key:
85
+ rubygems_version: 3.6.9
89
86
  specification_version: 4
90
87
  summary: Evaluation utilities for DSPy.rb programs.
91
88
  test_files: []