dspy-evals 1.0.0 → 1.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +149 -185
- data/lib/dspy/evals/version.rb +1 -1
- data/lib/dspy/evals.rb +106 -33
- metadata +7 -10
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: d6e4ef06553cd53f974d9813c83807ce43a4222132f6200837a15087722e6483
|
|
4
|
+
data.tar.gz: 077d72f3f4db1122e749248b29d27206f277cef03954ef58483dc4452328fbac
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 73aa76a6904812e98cf7ded086a9642ca9fc696c3e1c017b5fec91a63a4887e54d61b961d79d6cb3d1fda2fc44707007e67849c71e61185af14d7146dc50ef0a
|
|
7
|
+
data.tar.gz: 4270e116956e4ba8960f958302f817a3236d82922224a8cdfd066d172693602e99dfe2da3a68c4699c91fb3a01d9f0e5da0ad6bec86b3bb4b3a802193b269d41
|
data/README.md
CHANGED
|
@@ -3,81 +3,97 @@
|
|
|
3
3
|
[](https://rubygems.org/gems/dspy)
|
|
4
4
|
[](https://rubygems.org/gems/dspy)
|
|
5
5
|
[](https://github.com/vicentereig/dspy.rb/actions/workflows/ruby.yml)
|
|
6
|
-
[](https://oss.vicente.services/dspy.rb/)
|
|
7
|
+
[](https://discord.gg/zWBhrMqn)
|
|
7
8
|
|
|
8
|
-
|
|
9
|
-
> The core Prompt Engineering Framework is production-ready with
|
|
10
|
-
> comprehensive documentation. I am focusing now on educational content on systematic Prompt Optimization and Context Engineering.
|
|
11
|
-
> Your feedback is invaluable. if you encounter issues, please open an [issue](https://github.com/vicentereig/dspy.rb/issues). If you have suggestions, open a [new thread](https://github.com/vicentereig/dspy.rb/discussions).
|
|
12
|
-
>
|
|
13
|
-
> If you want to contribute, feel free to reach out to me to coordinate efforts: hey at vicente.services
|
|
14
|
-
>
|
|
15
|
-
> And, yes, this is 100% a legit project. :)
|
|
9
|
+
**Build reliable LLM applications in idiomatic Ruby using composable, type-safe modules.**
|
|
16
10
|
|
|
11
|
+
DSPy.rb is the Ruby port of Stanford's [DSPy](https://dspy.ai). Instead of wrestling with brittle prompt strings, you define typed signatures and let the framework handle the rest. Prompts become functions. LLM calls become predictable.
|
|
17
12
|
|
|
18
|
-
|
|
13
|
+
```ruby
|
|
14
|
+
require 'dspy'
|
|
19
15
|
|
|
20
|
-
|
|
21
|
-
|
|
16
|
+
DSPy.configure do |c|
|
|
17
|
+
c.lm = DSPy::LM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY'])
|
|
18
|
+
end
|
|
22
19
|
|
|
23
|
-
|
|
24
|
-
the
|
|
25
|
-
signatures and let the framework handle the messy details.
|
|
20
|
+
class Summarize < DSPy::Signature
|
|
21
|
+
description "Summarize the given text in one sentence."
|
|
26
22
|
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
23
|
+
input do
|
|
24
|
+
const :text, String
|
|
25
|
+
end
|
|
30
26
|
|
|
31
|
-
|
|
27
|
+
output do
|
|
28
|
+
const :summary, String
|
|
29
|
+
end
|
|
30
|
+
end
|
|
32
31
|
|
|
33
|
-
|
|
32
|
+
summarizer = DSPy::Predict.new(Summarize)
|
|
33
|
+
result = summarizer.call(text: "DSPy.rb brings structured LLM programming to Ruby...")
|
|
34
|
+
puts result.summary
|
|
35
|
+
```
|
|
34
36
|
|
|
35
|
-
|
|
36
|
-
### Installation
|
|
37
|
+
That's it. No prompt templates. No JSON parsing. No prayer-based error handling.
|
|
37
38
|
|
|
38
|
-
|
|
39
|
+
## Installation
|
|
39
40
|
|
|
40
41
|
```ruby
|
|
42
|
+
# Gemfile
|
|
41
43
|
gem 'dspy'
|
|
44
|
+
gem 'dspy-openai' # For OpenAI, OpenRouter, or Ollama
|
|
45
|
+
# gem 'dspy-anthropic' # For Claude
|
|
46
|
+
# gem 'dspy-gemini' # For Gemini
|
|
47
|
+
# gem 'dspy-ruby_llm' # For 12+ providers via RubyLLM
|
|
42
48
|
```
|
|
43
49
|
|
|
44
|
-
and
|
|
45
|
-
|
|
46
50
|
```bash
|
|
47
51
|
bundle install
|
|
48
52
|
```
|
|
49
53
|
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
DSPy.rb ships multiple gems from this monorepo so you only install what you need. Add these alongside `dspy`:
|
|
53
|
-
|
|
54
|
-
| Gem | Description | Status |
|
|
55
|
-
| --- | --- | --- |
|
|
56
|
-
| `dspy-schema` | Exposes `DSPy::TypeSystem::SorbetJsonSchema` for downstream reuse. | **Stable** (v1.0.0) |
|
|
57
|
-
| `dspy-code_act` | Think-Code-Observe agents that synthesize and execute Ruby safely. | Preview (0.x) |
|
|
58
|
-
| `dspy-datasets` | Dataset helpers plus Parquet/Polars tooling for richer evaluation corpora. | Preview (0.x) |
|
|
59
|
-
| `dspy-evals` | High-throughput evaluation harness with metrics, callbacks, and regression fixtures. | Preview (0.x) |
|
|
60
|
-
| `dspy-miprov2` | Bayesian optimization + Gaussian Process backend for the MIPROv2 teleprompter. | Preview (0.x) |
|
|
61
|
-
| `dspy-gepa` | `DSPy::Teleprompt::GEPA`, reflection loops, experiment tracking, telemetry adapters. | Preview (mirrors `dspy` version) |
|
|
62
|
-
| `gepa` | GEPA optimizer core (Pareto engine, telemetry, reflective proposer). | Preview (mirrors `dspy` version) |
|
|
63
|
-
| `dspy-o11y` | Core observability APIs: `DSPy::Observability`, async span processor, observation types. | **Stable** (v1.0.0) |
|
|
64
|
-
| `dspy-o11y-langfuse` | Auto-configures DSPy observability to stream spans to Langfuse via OTLP. | **Stable** (v1.0.0) |
|
|
54
|
+
## Quick Start
|
|
65
55
|
|
|
66
|
-
|
|
67
|
-
### Your First Reliable Predictor
|
|
56
|
+
### Configure Your LLM
|
|
68
57
|
|
|
69
58
|
```ruby
|
|
70
|
-
|
|
71
|
-
# Configure DSPy globablly to use your fave LLM - you can override this on an instance levle.
|
|
59
|
+
# OpenAI
|
|
72
60
|
DSPy.configure do |c|
|
|
73
61
|
c.lm = DSPy::LM.new('openai/gpt-4o-mini',
|
|
74
62
|
api_key: ENV['OPENAI_API_KEY'],
|
|
75
|
-
structured_outputs: true)
|
|
63
|
+
structured_outputs: true)
|
|
64
|
+
end
|
|
65
|
+
|
|
66
|
+
# Anthropic Claude
|
|
67
|
+
DSPy.configure do |c|
|
|
68
|
+
c.lm = DSPy::LM.new('anthropic/claude-sonnet-4-20250514',
|
|
69
|
+
api_key: ENV['ANTHROPIC_API_KEY'])
|
|
76
70
|
end
|
|
77
71
|
|
|
78
|
-
#
|
|
72
|
+
# Google Gemini
|
|
73
|
+
DSPy.configure do |c|
|
|
74
|
+
c.lm = DSPy::LM.new('gemini/gemini-2.5-flash',
|
|
75
|
+
api_key: ENV['GEMINI_API_KEY'])
|
|
76
|
+
end
|
|
77
|
+
|
|
78
|
+
# Ollama (local, free)
|
|
79
|
+
DSPy.configure do |c|
|
|
80
|
+
c.lm = DSPy::LM.new('ollama/llama3.2')
|
|
81
|
+
end
|
|
82
|
+
|
|
83
|
+
# OpenRouter (200+ models)
|
|
84
|
+
DSPy.configure do |c|
|
|
85
|
+
c.lm = DSPy::LM.new('openrouter/deepseek/deepseek-chat-v3.1:free',
|
|
86
|
+
api_key: ENV['OPENROUTER_API_KEY'])
|
|
87
|
+
end
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
### Define a Signature
|
|
91
|
+
|
|
92
|
+
Signatures are typed contracts for LLM operations. Define inputs, outputs, and let DSPy handle the prompt:
|
|
93
|
+
|
|
94
|
+
```ruby
|
|
79
95
|
class Classify < DSPy::Signature
|
|
80
|
-
description "Classify sentiment of a given sentence."
|
|
96
|
+
description "Classify sentiment of a given sentence."
|
|
81
97
|
|
|
82
98
|
class Sentiment < T::Enum
|
|
83
99
|
enums do
|
|
@@ -86,182 +102,130 @@ class Classify < DSPy::Signature
|
|
|
86
102
|
Neutral = new('neutral')
|
|
87
103
|
end
|
|
88
104
|
end
|
|
89
|
-
|
|
90
|
-
# Structured Inputs: makes sure you are sending only valid prompt inputs to your model
|
|
105
|
+
|
|
91
106
|
input do
|
|
92
107
|
const :sentence, String, description: 'The sentence to analyze'
|
|
93
108
|
end
|
|
94
109
|
|
|
95
|
-
# Structured Outputs: your predictor will validate the output of the model too.
|
|
96
110
|
output do
|
|
97
|
-
const :sentiment, Sentiment
|
|
98
|
-
const :confidence, Float
|
|
111
|
+
const :sentiment, Sentiment
|
|
112
|
+
const :confidence, Float
|
|
99
113
|
end
|
|
100
114
|
end
|
|
101
115
|
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
# it may raise an error if you mess the inputs or your LLM messes the outputs.
|
|
105
|
-
result = classify.call(sentence: "This book was super fun to read!")
|
|
116
|
+
classifier = DSPy::Predict.new(Classify)
|
|
117
|
+
result = classifier.call(sentence: "This book was super fun to read!")
|
|
106
118
|
|
|
107
|
-
|
|
108
|
-
|
|
119
|
+
result.sentiment # => #<Sentiment::Positive>
|
|
120
|
+
result.confidence # => 0.92
|
|
109
121
|
```
|
|
110
122
|
|
|
111
|
-
###
|
|
123
|
+
### Chain of Thought
|
|
112
124
|
|
|
113
|
-
|
|
125
|
+
For complex reasoning, use `ChainOfThought` to get step-by-step explanations:
|
|
114
126
|
|
|
115
127
|
```ruby
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
c.lm = DSPy::LM.new('openai/gpt-4o-mini',
|
|
119
|
-
api_key: ENV['OPENAI_API_KEY'],
|
|
120
|
-
structured_outputs: true) # Native JSON mode
|
|
121
|
-
end
|
|
128
|
+
solver = DSPy::ChainOfThought.new(MathProblem)
|
|
129
|
+
result = solver.call(problem: "If a train travels 120km in 2 hours, what's its speed?")
|
|
122
130
|
|
|
123
|
-
#
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
api_key: ENV['GEMINI_API_KEY'],
|
|
127
|
-
structured_outputs: true) # Native structured outputs
|
|
128
|
-
end
|
|
131
|
+
result.reasoning # => "Speed = Distance / Time = 120km / 2h = 60km/h"
|
|
132
|
+
result.answer # => "60 km/h"
|
|
133
|
+
```
|
|
129
134
|
|
|
130
|
-
|
|
131
|
-
DSPy.configure do |c|
|
|
132
|
-
c.lm = DSPy::LM.new('anthropic/claude-sonnet-4-5-20250929',
|
|
133
|
-
api_key: ENV['ANTHROPIC_API_KEY'],
|
|
134
|
-
structured_outputs: true) # Tool-based extraction (default)
|
|
135
|
-
end
|
|
135
|
+
### ReAct Agents
|
|
136
136
|
|
|
137
|
-
|
|
138
|
-
DSPy.configure do |c|
|
|
139
|
-
c.lm = DSPy::LM.new('ollama/llama3.2') # Free, runs locally, no API key needed
|
|
140
|
-
end
|
|
137
|
+
Build agents that use tools to accomplish tasks:
|
|
141
138
|
|
|
142
|
-
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
|
|
139
|
+
```ruby
|
|
140
|
+
class SearchTool < DSPy::Tools::Tool
|
|
141
|
+
tool_name "search"
|
|
142
|
+
description "Search for information"
|
|
143
|
+
|
|
144
|
+
input do
|
|
145
|
+
const :query, String
|
|
146
|
+
end
|
|
147
|
+
|
|
148
|
+
output do
|
|
149
|
+
const :results, T::Array[String]
|
|
150
|
+
end
|
|
151
|
+
|
|
152
|
+
def call(query:)
|
|
153
|
+
# Your search implementation
|
|
154
|
+
{ results: ["Result 1", "Result 2"] }
|
|
155
|
+
end
|
|
146
156
|
end
|
|
157
|
+
|
|
158
|
+
toolset = DSPy::Tools::Toolset.new(tools: [SearchTool.new])
|
|
159
|
+
agent = DSPy::ReAct.new(signature: ResearchTask, tools: toolset, max_iterations: 5)
|
|
160
|
+
result = agent.call(question: "What's the latest on Ruby 3.4?")
|
|
147
161
|
```
|
|
148
162
|
|
|
149
|
-
## What
|
|
150
|
-
|
|
151
|
-
**
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
|
|
156
|
-
- [Ollama](https://ollama.com/) via OpenAI compatibility layer for local models
|
|
157
|
-
- **Multimodal Support** - Complete image analysis with DSPy::Image, type-safe bounding boxes, vision-capable models
|
|
158
|
-
- Runtime type checking with [Sorbet](https://sorbet.org/) including T::Enum and union types
|
|
159
|
-
- Type-safe tool definitions for ReAct agents
|
|
160
|
-
- Comprehensive instrumentation and observability
|
|
161
|
-
|
|
162
|
-
**Core Building Blocks:**
|
|
163
|
-
- **Signatures** - Define input/output schemas using Sorbet types with T::Enum and union type support
|
|
164
|
-
- **Predict** - LLM completion with structured data extraction and multimodal support
|
|
165
|
-
- **Chain of Thought** - Step-by-step reasoning for complex problems with automatic prompt optimization
|
|
166
|
-
- **ReAct** - Tool-using agents with type-safe tool definitions and error recovery
|
|
167
|
-
- **Module Composition** - Combine multiple LLM calls into production-ready workflows
|
|
168
|
-
|
|
169
|
-
**Optimization & Evaluation:**
|
|
170
|
-
- **Prompt Objects** - Manipulate prompts as first-class objects instead of strings
|
|
171
|
-
- **Typed Examples** - Type-safe training data with automatic validation
|
|
172
|
-
- **Evaluation Framework** - Advanced metrics beyond simple accuracy with error-resilient pipelines
|
|
173
|
-
- **MIPROv2 Optimization** - Advanced Bayesian optimization with Gaussian Processes, multiple optimization strategies, auto-config presets, and storage persistence
|
|
174
|
-
|
|
175
|
-
**Production Features:**
|
|
176
|
-
- **Reliable JSON Extraction** - Native structured outputs for OpenAI and Gemini, Anthropic tool-based extraction, and automatic strategy selection with fallback
|
|
177
|
-
- **Type-Safe Configuration** - Strategy enums with automatic provider optimization (Strict/Compatible modes)
|
|
178
|
-
- **Smart Retry Logic** - Progressive fallback with exponential backoff for handling transient failures
|
|
179
|
-
- **Zero-Config Langfuse Integration** - Set env vars and get automatic OpenTelemetry traces in Langfuse
|
|
180
|
-
- **Performance Caching** - Schema and capability caching for faster repeated operations
|
|
181
|
-
- **File-based Storage** - Optimization result persistence with versioning
|
|
182
|
-
- **Structured Logging** - JSON and key=value formats with span tracking
|
|
183
|
-
|
|
184
|
-
## Recent Achievements
|
|
185
|
-
|
|
186
|
-
DSPy.rb has rapidly evolved from experimental to production-ready:
|
|
187
|
-
|
|
188
|
-
### Foundation
|
|
189
|
-
- ✅ **JSON Parsing Reliability** - Native OpenAI structured outputs with adaptive retry logic and schema-aware fallbacks
|
|
190
|
-
- ✅ **Type-Safe Strategy Configuration** - Provider-optimized strategy selection and enum-backed optimizer presets
|
|
191
|
-
- ✅ **Core Module System** - Predict, ChainOfThought, ReAct with type safety (add `dspy-code_act` for Think-Code-Observe agents)
|
|
192
|
-
- ✅ **Production Observability** - OpenTelemetry, New Relic, and Langfuse integration
|
|
193
|
-
- ✅ **Advanced Optimization** - MIPROv2 with Bayesian optimization, Gaussian Processes, and multi-mode search
|
|
194
|
-
|
|
195
|
-
### Recent Advances
|
|
196
|
-
- ✅ **MIPROv2 ADE Integrity (v0.29.1)** - Stratified train/val/test splits, honest precision accounting, and enum-driven `--auto` presets with integration coverage
|
|
197
|
-
- ✅ **Instruction Deduplication (v0.29.1)** - Candidate generation now filters repeated programs so optimization logs highlight unique strategies
|
|
198
|
-
- ✅ **GEPA Teleprompter (v0.29.0)** - Genetic-Pareto reflective prompt evolution with merge proposer scheduling, reflective mutation, and ADE demo parity
|
|
199
|
-
- ✅ **Optimizer Utilities Parity (v0.29.0)** - Bootstrap strategies, dataset summaries, and Layer 3 utilities unlock multi-predictor programs on Ruby
|
|
200
|
-
- ✅ **Observability Hardening (v0.29.0)** - OTLP exporter runs on a single-thread executor preventing frozen SSL contexts without blocking spans
|
|
201
|
-
- ✅ **Documentation Refresh (v0.29.x)** - New GEPA guide plus ADE optimization docs covering presets, stratified splits, and error-handling defaults
|
|
202
|
-
|
|
203
|
-
**Current Focus Areas:**
|
|
204
|
-
|
|
205
|
-
### Production Readiness
|
|
206
|
-
- 🚧 **Production Patterns** - Real-world usage validation and performance optimization
|
|
207
|
-
- 🚧 **Ruby Ecosystem Integration** - Rails integration, Sidekiq compatibility, deployment patterns
|
|
208
|
-
|
|
209
|
-
### Community & Adoption
|
|
210
|
-
- 🚧 **Community Examples** - Real-world applications and case studies
|
|
211
|
-
- 🚧 **Contributor Experience** - Making it easier to contribute and extend
|
|
212
|
-
- 🚧 **Performance Benchmarks** - Comparative analysis vs other frameworks
|
|
213
|
-
|
|
214
|
-
**v1.0 Philosophy:**
|
|
215
|
-
v1.0 will be released after extensive production battle-testing, not after checking off features.
|
|
216
|
-
The API is already stable - v1.0 represents confidence in production reliability backed by real-world validation.
|
|
163
|
+
## What's Included
|
|
164
|
+
|
|
165
|
+
**Core Modules**: Predict, ChainOfThought, ReAct agents, and composable pipelines.
|
|
166
|
+
|
|
167
|
+
**Type Safety**: Sorbet-based runtime validation. Enums, unions, nested structs—all work.
|
|
168
|
+
|
|
169
|
+
**Multimodal**: Image analysis with `DSPy::Image` for vision-capable models.
|
|
217
170
|
|
|
171
|
+
**Observability**: Zero-config Langfuse integration via OpenTelemetry. Non-blocking, production-ready.
|
|
172
|
+
|
|
173
|
+
**Optimization**: MIPROv2 (Bayesian optimization) and GEPA (genetic evolution) for prompt tuning.
|
|
174
|
+
|
|
175
|
+
**Provider Support**: OpenAI, Anthropic, Gemini, Ollama, and OpenRouter via official SDKs.
|
|
218
176
|
|
|
219
177
|
## Documentation
|
|
220
178
|
|
|
221
|
-
|
|
179
|
+
**[Full Documentation](https://oss.vicente.services/dspy.rb/)** — Getting started, core concepts, advanced patterns.
|
|
222
180
|
|
|
223
|
-
|
|
181
|
+
**[llms.txt](https://oss.vicente.services/dspy.rb/llms.txt)** — LLM-friendly reference for AI assistants.
|
|
224
182
|
|
|
225
|
-
|
|
226
|
-
- **[llms.txt](https://vicentereig.github.io/dspy.rb/llms.txt)** - Concise reference optimized for LLMs
|
|
227
|
-
- **[llms-full.txt](https://vicentereig.github.io/dspy.rb/llms-full.txt)** - Comprehensive API documentation
|
|
183
|
+
### Claude Skill
|
|
228
184
|
|
|
229
|
-
|
|
230
|
-
- **[Installation & Setup](docs/src/getting-started/installation.md)** - Detailed installation and configuration
|
|
231
|
-
- **[Quick Start Guide](docs/src/getting-started/quick-start.md)** - Your first DSPy programs
|
|
232
|
-
- **[Core Concepts](docs/src/getting-started/core-concepts.md)** - Understanding signatures, predictors, and modules
|
|
185
|
+
A [Claude Skill](https://github.com/vicentereig/dspy-rb-skill) is available to help you build DSPy.rb applications:
|
|
233
186
|
|
|
234
|
-
|
|
235
|
-
|
|
236
|
-
|
|
237
|
-
|
|
238
|
-
- **[Multimodal Support](docs/src/core-concepts/multimodal.md)** - Image analysis with vision-capable models
|
|
239
|
-
- **[Examples & Validation](docs/src/core-concepts/examples.md)** - Type-safe training data
|
|
240
|
-
- **[Rich Types](docs/src/advanced/complex-types.md)** - Sorbet type integration with automatic coercion for structs, enums, and arrays
|
|
241
|
-
- **[Composable Pipelines](docs/src/advanced/pipelines.md)** - Manual module composition patterns
|
|
187
|
+
```bash
|
|
188
|
+
# Claude Code
|
|
189
|
+
git clone https://github.com/vicentereig/dspy-rb-skill ~/.claude/skills/dspy-rb
|
|
190
|
+
```
|
|
242
191
|
|
|
243
|
-
|
|
244
|
-
- **[Evaluation Framework](docs/src/optimization/evaluation.md)** - Advanced metrics beyond simple accuracy
|
|
245
|
-
- **[Prompt Optimization](docs/src/optimization/prompt-optimization.md)** - Manipulate prompts as objects
|
|
246
|
-
- **[MIPROv2 Optimizer](docs/src/optimization/miprov2.md)** - Advanced Bayesian optimization with Gaussian Processes
|
|
247
|
-
- **[GEPA Optimizer](docs/src/optimization/gepa.md)** *(beta)* - Reflective mutation with optional reflection LMs
|
|
192
|
+
For Claude.ai Pro/Max, download the [skill ZIP](https://github.com/vicentereig/dspy-rb-skill/archive/refs/heads/main.zip) and upload via Settings > Skills.
|
|
248
193
|
|
|
249
|
-
|
|
250
|
-
- **[Tools](docs/src/core-concepts/toolsets.md)** - Tool wieldint agents.
|
|
251
|
-
- **[Agentic Memory](docs/src/core-concepts/memory.md)** - Memory Tools & Agentic Loops
|
|
252
|
-
- **[RAG Patterns](docs/src/advanced/rag.md)** - Manual RAG implementation with external services
|
|
194
|
+
## Examples
|
|
253
195
|
|
|
254
|
-
|
|
255
|
-
- **[Observability](docs/src/production/observability.md)** - Zero-config Langfuse integration with a dedicated export worker that never blocks your LLMs
|
|
256
|
-
- **[Storage System](docs/src/production/storage.md)** - Persistence and optimization result storage
|
|
257
|
-
- **[Custom Metrics](docs/src/advanced/custom-metrics.md)** - Proc-based evaluation logic
|
|
196
|
+
The [examples/](examples/) directory has runnable code for common patterns:
|
|
258
197
|
|
|
198
|
+
- Sentiment classification
|
|
199
|
+
- ReAct agents with tools
|
|
200
|
+
- Image analysis
|
|
201
|
+
- Prompt optimization
|
|
259
202
|
|
|
203
|
+
```bash
|
|
204
|
+
bundle exec ruby examples/first_predictor.rb
|
|
205
|
+
```
|
|
206
|
+
|
|
207
|
+
## Optional Gems
|
|
260
208
|
|
|
209
|
+
DSPy.rb ships sibling gems for features with heavier dependencies. Add them as needed:
|
|
261
210
|
|
|
211
|
+
| Gem | What it does |
|
|
212
|
+
| --- | --- |
|
|
213
|
+
| `dspy-datasets` | Dataset helpers, Parquet/Polars tooling |
|
|
214
|
+
| `dspy-evals` | Evaluation harness with metrics and callbacks |
|
|
215
|
+
| `dspy-miprov2` | Bayesian optimization for prompt tuning |
|
|
216
|
+
| `dspy-gepa` | Genetic-Pareto prompt evolution |
|
|
217
|
+
| `dspy-o11y-langfuse` | Auto-configure Langfuse tracing |
|
|
218
|
+
| `dspy-code_act` | Think-Code-Observe agents |
|
|
219
|
+
| `dspy-deep_search` | Production DeepSearch with Exa |
|
|
262
220
|
|
|
221
|
+
See [the full list](https://oss.vicente.services/dspy.rb/getting-started/installation/) in the docs.
|
|
263
222
|
|
|
223
|
+
## Contributing
|
|
264
224
|
|
|
225
|
+
Feedback is invaluable. If you encounter issues, [open an issue](https://github.com/vicentereig/dspy.rb/issues). For suggestions, [start a discussion](https://github.com/vicentereig/dspy.rb/discussions).
|
|
226
|
+
|
|
227
|
+
Want to contribute code? Reach out: hey at vicente.services
|
|
265
228
|
|
|
266
229
|
## License
|
|
267
|
-
|
|
230
|
+
|
|
231
|
+
MIT License.
|
data/lib/dspy/evals/version.rb
CHANGED
data/lib/dspy/evals.rb
CHANGED
|
@@ -1,7 +1,6 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
3
|
require 'json'
|
|
4
|
-
require 'polars'
|
|
5
4
|
require 'concurrent'
|
|
6
5
|
require 'sorbet-runtime'
|
|
7
6
|
require_relative 'example'
|
|
@@ -111,8 +110,14 @@ module DSPy
|
|
|
111
110
|
}
|
|
112
111
|
end
|
|
113
112
|
|
|
114
|
-
|
|
113
|
+
if defined?(Polars::DataFrame)
|
|
114
|
+
sig { returns(Polars::DataFrame) }
|
|
115
|
+
else
|
|
116
|
+
sig { returns(T.untyped) }
|
|
117
|
+
end
|
|
115
118
|
def to_polars
|
|
119
|
+
ensure_polars!
|
|
120
|
+
|
|
116
121
|
rows = @results.each_with_index.map do |result, index|
|
|
117
122
|
{
|
|
118
123
|
"index" => index,
|
|
@@ -130,6 +135,20 @@ module DSPy
|
|
|
130
135
|
|
|
131
136
|
private
|
|
132
137
|
|
|
138
|
+
POLARS_MISSING_ERROR = <<~MSG
|
|
139
|
+
Polars is required to export evaluation results. Add `gem 'polars'`
|
|
140
|
+
(or enable the `dspy-datasets` gem / `DSPY_WITH_DATASETS=1`) before
|
|
141
|
+
calling `DSPy::Evals::BatchEvaluationResult#to_polars`.
|
|
142
|
+
MSG
|
|
143
|
+
|
|
144
|
+
def ensure_polars!
|
|
145
|
+
return if defined?(Polars::DataFrame)
|
|
146
|
+
|
|
147
|
+
require 'polars'
|
|
148
|
+
rescue LoadError => e
|
|
149
|
+
raise LoadError, "#{POLARS_MISSING_ERROR}\n\n#{e.message}"
|
|
150
|
+
end
|
|
151
|
+
|
|
133
152
|
def serialize_for_polars(value)
|
|
134
153
|
case value
|
|
135
154
|
when NilClass, TrueClass, FalseClass, Numeric, String
|
|
@@ -172,6 +191,12 @@ module DSPy
|
|
|
172
191
|
sig { returns(T.nilable(BatchEvaluationResult)) }
|
|
173
192
|
attr_reader :last_batch_result
|
|
174
193
|
|
|
194
|
+
sig { returns(T::Boolean) }
|
|
195
|
+
attr_reader :export_scores
|
|
196
|
+
|
|
197
|
+
sig { returns(String) }
|
|
198
|
+
attr_reader :score_name
|
|
199
|
+
|
|
175
200
|
include DSPy::Callbacks
|
|
176
201
|
|
|
177
202
|
create_before_callback :call, wrap: false
|
|
@@ -208,16 +233,20 @@ module DSPy
|
|
|
208
233
|
num_threads: T.nilable(Integer),
|
|
209
234
|
max_errors: T.nilable(Integer),
|
|
210
235
|
failure_score: T.nilable(Numeric),
|
|
211
|
-
provide_traceback: T::Boolean
|
|
236
|
+
provide_traceback: T::Boolean,
|
|
237
|
+
export_scores: T::Boolean,
|
|
238
|
+
score_name: String
|
|
212
239
|
).void
|
|
213
240
|
end
|
|
214
|
-
def initialize(program, metric: nil, num_threads: 1, max_errors: 5, failure_score: 0.0, provide_traceback: true)
|
|
241
|
+
def initialize(program, metric: nil, num_threads: 1, max_errors: 5, failure_score: 0.0, provide_traceback: true, export_scores: false, score_name: 'evaluation')
|
|
215
242
|
@program = program
|
|
216
243
|
@metric = metric
|
|
217
244
|
@num_threads = num_threads || 1
|
|
218
245
|
@max_errors = max_errors || 5
|
|
219
246
|
@provide_traceback = provide_traceback
|
|
220
247
|
@failure_score = failure_score ? failure_score.to_f : 0.0
|
|
248
|
+
@export_scores = export_scores
|
|
249
|
+
@score_name = score_name
|
|
221
250
|
@last_example_result = nil
|
|
222
251
|
@last_batch_result = nil
|
|
223
252
|
end
|
|
@@ -225,25 +254,7 @@ module DSPy
|
|
|
225
254
|
# Evaluate program on a single example
|
|
226
255
|
sig { params(example: T.untyped, trace: T.nilable(T.untyped)).returns(EvaluationResult) }
|
|
227
256
|
def call(example, trace: nil)
|
|
228
|
-
|
|
229
|
-
|
|
230
|
-
DSPy::Context.with_span(
|
|
231
|
-
operation: 'evaluation.example',
|
|
232
|
-
'dspy.module' => 'Evaluator',
|
|
233
|
-
'evaluation.program' => @program.class.name,
|
|
234
|
-
'evaluation.has_metric' => !@metric.nil?
|
|
235
|
-
) do
|
|
236
|
-
begin
|
|
237
|
-
perform_call(example, trace: trace)
|
|
238
|
-
rescue => e
|
|
239
|
-
build_error_result(example, e, trace: trace)
|
|
240
|
-
end
|
|
241
|
-
end.then do |result|
|
|
242
|
-
@last_example_result = result
|
|
243
|
-
emit_example_observation(example, result)
|
|
244
|
-
run_callbacks(:after, :call, example: example, result: result)
|
|
245
|
-
result
|
|
246
|
-
end
|
|
257
|
+
call_with_program(@program, example, trace: trace, track_state: true)
|
|
247
258
|
end
|
|
248
259
|
|
|
249
260
|
# Evaluate program on multiple examples
|
|
@@ -374,8 +385,9 @@ module DSPy
|
|
|
374
385
|
|
|
375
386
|
futures = batch.map do |item|
|
|
376
387
|
Concurrent::Promises.future_on(executor) do
|
|
377
|
-
|
|
378
|
-
|
|
388
|
+
program_for_thread = fork_program_for_thread
|
|
389
|
+
[:ok, item[:index], safe_call(item[:example], program: program_for_thread, track_state: false)]
|
|
390
|
+
rescue StandardError => e
|
|
379
391
|
[:error, item[:index], e]
|
|
380
392
|
end
|
|
381
393
|
end
|
|
@@ -412,18 +424,18 @@ module DSPy
|
|
|
412
424
|
results.compact
|
|
413
425
|
end
|
|
414
426
|
|
|
415
|
-
def safe_call(example)
|
|
416
|
-
|
|
417
|
-
rescue => e
|
|
427
|
+
def safe_call(example, program: @program, track_state: true)
|
|
428
|
+
call_with_program(program, example, track_state: track_state)
|
|
429
|
+
rescue StandardError => e
|
|
418
430
|
build_error_result(example, e)
|
|
419
431
|
end
|
|
420
432
|
|
|
421
|
-
def perform_call(example, trace:)
|
|
433
|
+
def perform_call(example, trace:, program:)
|
|
422
434
|
# Extract input from example - support both hash and object formats
|
|
423
435
|
input_values = extract_input_values(example)
|
|
424
436
|
|
|
425
437
|
# Run prediction
|
|
426
|
-
prediction =
|
|
438
|
+
prediction = program.call(**input_values)
|
|
427
439
|
|
|
428
440
|
# Calculate metrics if provided
|
|
429
441
|
metrics = {}
|
|
@@ -440,7 +452,7 @@ module DSPy
|
|
|
440
452
|
passed = !!metric_result
|
|
441
453
|
metrics[:passed] = passed
|
|
442
454
|
end
|
|
443
|
-
rescue => e
|
|
455
|
+
rescue StandardError => e
|
|
444
456
|
passed = false
|
|
445
457
|
metrics[:error] = e.message
|
|
446
458
|
metrics[:passed] = false
|
|
@@ -461,6 +473,34 @@ module DSPy
|
|
|
461
473
|
)
|
|
462
474
|
end
|
|
463
475
|
|
|
476
|
+
def call_with_program(program, example, trace: nil, track_state: true)
|
|
477
|
+
run_callbacks(:before, :call, example: example)
|
|
478
|
+
|
|
479
|
+
DSPy::Context.with_span(
|
|
480
|
+
operation: 'evaluation.example',
|
|
481
|
+
'dspy.module' => 'Evaluator',
|
|
482
|
+
'evaluation.program' => program.class.name,
|
|
483
|
+
'evaluation.has_metric' => !@metric.nil?
|
|
484
|
+
) do
|
|
485
|
+
begin
|
|
486
|
+
perform_call(example, trace: trace, program: program)
|
|
487
|
+
rescue StandardError => e
|
|
488
|
+
build_error_result(example, e, trace: trace)
|
|
489
|
+
end
|
|
490
|
+
end.then do |result|
|
|
491
|
+
@last_example_result = result if track_state
|
|
492
|
+
emit_example_observation(example, result)
|
|
493
|
+
run_callbacks(:after, :call, example: example, result: result)
|
|
494
|
+
result
|
|
495
|
+
end
|
|
496
|
+
end
|
|
497
|
+
|
|
498
|
+
def fork_program_for_thread
|
|
499
|
+
return @program if @program.nil?
|
|
500
|
+
return @program.dup_for_thread if @program.respond_to?(:dup_for_thread)
|
|
501
|
+
@program.dup
|
|
502
|
+
end
|
|
503
|
+
|
|
464
504
|
def build_error_result(example, error, trace: nil)
|
|
465
505
|
metrics = {
|
|
466
506
|
error: error.message,
|
|
@@ -646,7 +686,12 @@ module DSPy
|
|
|
646
686
|
score: result.metrics[:score],
|
|
647
687
|
error: result.metrics[:error]
|
|
648
688
|
})
|
|
649
|
-
|
|
689
|
+
|
|
690
|
+
# Export score to Langfuse if enabled
|
|
691
|
+
if @export_scores
|
|
692
|
+
export_example_score(example, result)
|
|
693
|
+
end
|
|
694
|
+
rescue StandardError => e
|
|
650
695
|
DSPy.log('evals.example.observation_error', error: e.message)
|
|
651
696
|
end
|
|
652
697
|
|
|
@@ -659,10 +704,38 @@ module DSPy
|
|
|
659
704
|
pass_rate: batch_result.pass_rate,
|
|
660
705
|
score: batch_result.score
|
|
661
706
|
})
|
|
662
|
-
|
|
707
|
+
|
|
708
|
+
# Export batch score to Langfuse if enabled
|
|
709
|
+
if @export_scores
|
|
710
|
+
export_batch_score(batch_result)
|
|
711
|
+
end
|
|
712
|
+
rescue StandardError => e
|
|
663
713
|
DSPy.log('evals.batch.observation_error', error: e.message)
|
|
664
714
|
end
|
|
665
715
|
|
|
716
|
+
def export_example_score(example, result)
|
|
717
|
+
score_value = result.metrics[:score] || (result.passed ? 1.0 : 0.0)
|
|
718
|
+
example_id = extract_example_id(example)
|
|
719
|
+
|
|
720
|
+
DSPy.score(
|
|
721
|
+
@score_name,
|
|
722
|
+
score_value,
|
|
723
|
+
comment: "Example: #{example_id || 'unknown'}, passed: #{result.passed}"
|
|
724
|
+
)
|
|
725
|
+
rescue StandardError => e
|
|
726
|
+
DSPy.log('evals.score_export_error', error: e.message)
|
|
727
|
+
end
|
|
728
|
+
|
|
729
|
+
def export_batch_score(batch_result)
|
|
730
|
+
DSPy.score(
|
|
731
|
+
"#{@score_name}_batch",
|
|
732
|
+
batch_result.pass_rate,
|
|
733
|
+
comment: "Batch: #{batch_result.passed_examples}/#{batch_result.total_examples} passed"
|
|
734
|
+
)
|
|
735
|
+
rescue StandardError => e
|
|
736
|
+
DSPy.log('evals.batch_score_export_error', error: e.message)
|
|
737
|
+
end
|
|
738
|
+
|
|
666
739
|
def extract_example_id(example)
|
|
667
740
|
if example.respond_to?(:id)
|
|
668
741
|
example.id
|
metadata
CHANGED
|
@@ -1,29 +1,28 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: dspy-evals
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 1.0.
|
|
4
|
+
version: 1.0.2
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Vicente Reig Rincón de Arellano
|
|
8
|
-
autorequire:
|
|
9
8
|
bindir: bin
|
|
10
9
|
cert_chain: []
|
|
11
|
-
date:
|
|
10
|
+
date: 1980-01-02 00:00:00.000000000 Z
|
|
12
11
|
dependencies:
|
|
13
12
|
- !ruby/object:Gem::Dependency
|
|
14
13
|
name: dspy
|
|
15
14
|
requirement: !ruby/object:Gem::Requirement
|
|
16
15
|
requirements:
|
|
17
|
-
- -
|
|
16
|
+
- - ">="
|
|
18
17
|
- !ruby/object:Gem::Version
|
|
19
|
-
version: 0.30
|
|
18
|
+
version: '0.30'
|
|
20
19
|
type: :runtime
|
|
21
20
|
prerelease: false
|
|
22
21
|
version_requirements: !ruby/object:Gem::Requirement
|
|
23
22
|
requirements:
|
|
24
|
-
- -
|
|
23
|
+
- - ">="
|
|
25
24
|
- !ruby/object:Gem::Version
|
|
26
|
-
version: 0.30
|
|
25
|
+
version: '0.30'
|
|
27
26
|
- !ruby/object:Gem::Dependency
|
|
28
27
|
name: concurrent-ruby
|
|
29
28
|
requirement: !ruby/object:Gem::Requirement
|
|
@@ -69,7 +68,6 @@ licenses:
|
|
|
69
68
|
- MIT
|
|
70
69
|
metadata:
|
|
71
70
|
github_repo: git@github.com:vicentereig/dspy.rb
|
|
72
|
-
post_install_message:
|
|
73
71
|
rdoc_options: []
|
|
74
72
|
require_paths:
|
|
75
73
|
- lib
|
|
@@ -84,8 +82,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
|
84
82
|
- !ruby/object:Gem::Version
|
|
85
83
|
version: '0'
|
|
86
84
|
requirements: []
|
|
87
|
-
rubygems_version: 3.
|
|
88
|
-
signing_key:
|
|
85
|
+
rubygems_version: 3.6.9
|
|
89
86
|
specification_version: 4
|
|
90
87
|
summary: Evaluation utilities for DSPy.rb programs.
|
|
91
88
|
test_files: []
|