dspy-datasets 0.29.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: c8de3f972de17ce584e6f1f8f7eec8084b6d24c3517fd14001d58d12537b98d1
4
+ data.tar.gz: f47577ccf5b0826387bfb991d3f6372f9a41cccef7e1d9f3583030a0b5a4c61e
5
+ SHA512:
6
+ metadata.gz: e02a16d9b3321c2841d052e1c69fa91106cbbbeb8b44394f1c41052b01936a2757cb94b26c0292309effe724477eae12487ce6a9ac85b6bd10c1bd12f13a9798
7
+ data.tar.gz: 9ac56b72949104a5bb5d998768f419283b9a47b00653f54f84e99de492c987cdc548f060faa84855db4a8a54f2c231524f3ebf5269e774d8e10888d3fbdcabbf
data/LICENSE ADDED
@@ -0,0 +1,45 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Vicente Services SL
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
22
+
23
+ This project is a Ruby port of the original Python [DSPy library](https://github.com/stanfordnlp/dspy), which is licensed under the MIT License:
24
+
25
+ MIT License
26
+
27
+ Copyright (c) 2023 Stanford Future Data Systems
28
+
29
+ Permission is hereby granted, free of charge, to any person obtaining a copy
30
+ of this software and associated documentation files (the "Software"), to deal
31
+ in the Software without restriction, including without limitation the rights
32
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
33
+ copies of the Software, and to permit persons to whom the Software is
34
+ furnished to do so, subject to the following conditions:
35
+
36
+ The above copyright notice and this permission notice shall be included in all
37
+ copies or substantial portions of the Software.
38
+
39
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
40
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
41
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
42
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
43
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
44
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
45
+ SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,247 @@
1
+ # DSPy.rb
2
+
3
+ [![Gem Version](https://img.shields.io/gem/v/dspy)](https://rubygems.org/gems/dspy)
4
+ [![Total Downloads](https://img.shields.io/gem/dt/dspy)](https://rubygems.org/gems/dspy)
5
+ [![Build Status](https://img.shields.io/github/actions/workflow/status/vicentereig/dspy.rb/ruby.yml?branch=main&label=build)](https://github.com/vicentereig/dspy.rb/actions/workflows/ruby.yml)
6
+ [![Documentation](https://img.shields.io/badge/docs-vicentereig.github.io%2Fdspy.rb-blue)](https://vicentereig.github.io/dspy.rb/)
7
+
8
+ **Build reliable LLM applications in idiomatic Ruby using composable, type-safe modules.**
9
+
10
+ The Ruby framework for programming with large language models. DSPy.rb brings structured LLM programming to Ruby developers. Instead of wrestling with prompt strings and parsing responses, you define typed signatures using idiomatic Ruby to compose and decompose AI Worklows and AI Agents.
11
+
12
+ **Prompts are the just Functions.** Traditional prompting is like writing code with string concatenation: it works until it doesn't. DSPy.rb brings you
13
+ the programming approach pioneered by [dspy.ai](https://dspy.ai/): instead of crafting fragile prompts, you define modular
14
+ signatures and let the framework handle the messy details.
15
+
16
+ DSPy.rb is an idiomatic Ruby surgical port of Stanford's [DSPy framework](https://github.com/stanfordnlp/dspy). While implementing
17
+ the core concepts of signatures, predictors, and optimization from the original Python library, DSPy.rb embraces Ruby
18
+ conventions and adds Ruby-specific innovations like CodeAct agents and enhanced production instrumentation.
19
+
20
+ The result? LLM applications that actually scale and don't break when you sneeze.
21
+
22
+ ## Your First DSPy Program
23
+
24
+ ```ruby
25
+ # Define a signature for sentiment classification
26
+ class Classify < DSPy::Signature
27
+ description "Classify sentiment of a given sentence."
28
+
29
+ class Sentiment < T::Enum
30
+ enums do
31
+ Positive = new('positive')
32
+ Negative = new('negative')
33
+ Neutral = new('neutral')
34
+ end
35
+ end
36
+
37
+ input do
38
+ const :sentence, String
39
+ end
40
+
41
+ output do
42
+ const :sentiment, Sentiment
43
+ const :confidence, Float
44
+ end
45
+ end
46
+
47
+ # Configure DSPy with your LLM
48
+ DSPy.configure do |c|
49
+ c.lm = DSPy::LM.new('openai/gpt-4o-mini',
50
+ api_key: ENV['OPENAI_API_KEY'],
51
+ structured_outputs: true) # Enable OpenAI's native JSON mode
52
+ end
53
+
54
+ # Create the predictor and run inference
55
+ classify = DSPy::Predict.new(Classify)
56
+ result = classify.call(sentence: "This book was super fun to read!")
57
+
58
+ puts result.sentiment # => #<Sentiment::Positive>
59
+ puts result.confidence # => 0.85
60
+ ```
61
+
62
+ ### Access to 200+ Models Across 5 Providers
63
+
64
+ DSPy.rb provides unified access to major LLM providers with provider-specific optimizations:
65
+
66
+ ```ruby
67
+ # OpenAI (GPT-4, GPT-4o, GPT-4o-mini, GPT-5, etc.)
68
+ DSPy.configure do |c|
69
+ c.lm = DSPy::LM.new('openai/gpt-4o-mini',
70
+ api_key: ENV['OPENAI_API_KEY'],
71
+ structured_outputs: true) # Native JSON mode
72
+ end
73
+
74
+ # Google Gemini (Gemini 1.5 Pro, Flash, Gemini 2.0, etc.)
75
+ DSPy.configure do |c|
76
+ c.lm = DSPy::LM.new('gemini/gemini-2.5-flash',
77
+ api_key: ENV['GEMINI_API_KEY'],
78
+ structured_outputs: true) # Native structured outputs
79
+ end
80
+
81
+ # Anthropic Claude (Claude 3.5, Claude 4, etc.)
82
+ DSPy.configure do |c|
83
+ c.lm = DSPy::LM.new('anthropic/claude-sonnet-4-5-20250929',
84
+ api_key: ENV['ANTHROPIC_API_KEY'],
85
+ structured_outputs: true) # Tool-based extraction (default)
86
+ end
87
+
88
+ # Ollama - Run any local model (Llama, Mistral, Gemma, etc.)
89
+ DSPy.configure do |c|
90
+ c.lm = DSPy::LM.new('ollama/llama3.2') # Free, runs locally, no API key needed
91
+ end
92
+
93
+ # OpenRouter - Access to 200+ models from multiple providers
94
+ DSPy.configure do |c|
95
+ c.lm = DSPy::LM.new('openrouter/deepseek/deepseek-chat-v3.1:free',
96
+ api_key: ENV['OPENROUTER_API_KEY'])
97
+ end
98
+ ```
99
+
100
+ ## What You Get
101
+
102
+ **Core Building Blocks:**
103
+ - **Signatures** - Define input/output schemas using Sorbet types with T::Enum and union type support
104
+ - **Predict** - LLM completion with structured data extraction and multimodal support
105
+ - **Chain of Thought** - Step-by-step reasoning for complex problems with automatic prompt optimization
106
+ - **ReAct** - Tool-using agents with type-safe tool definitions and error recovery
107
+ - **CodeAct** - Dynamic code execution agents for programming tasks
108
+ - **Module Composition** - Combine multiple LLM calls into production-ready workflows
109
+
110
+ **Optimization & Evaluation:**
111
+ - **Prompt Objects** - Manipulate prompts as first-class objects instead of strings
112
+ - **Typed Examples** - Type-safe training data with automatic validation
113
+ - **Evaluation Framework** - Advanced metrics beyond simple accuracy with error-resilient pipelines
114
+ - **MIPROv2 Optimization** - Advanced Bayesian optimization with Gaussian Processes, multiple optimization strategies, auto-config presets, and storage persistence
115
+
116
+ **Production Features:**
117
+ - **Reliable JSON Extraction** - Native structured outputs for OpenAI and Gemini, Anthropic tool-based extraction, and automatic strategy selection with fallback
118
+ - **Type-Safe Configuration** - Strategy enums with automatic provider optimization (Strict/Compatible modes)
119
+ - **Smart Retry Logic** - Progressive fallback with exponential backoff for handling transient failures
120
+ - **Zero-Config Langfuse Integration** - Set env vars and get automatic OpenTelemetry traces in Langfuse
121
+ - **Performance Caching** - Schema and capability caching for faster repeated operations
122
+ - **File-based Storage** - Optimization result persistence with versioning
123
+ - **Structured Logging** - JSON and key=value formats with span tracking
124
+
125
+ **Developer Experience:**
126
+ - LLM provider support using official Ruby clients:
127
+ - [OpenAI Ruby](https://github.com/openai/openai-ruby) with vision model support
128
+ - [Anthropic Ruby SDK](https://github.com/anthropics/anthropic-sdk-ruby) with multimodal capabilities
129
+ - [Google Gemini API](https://ai.google.dev/) with native structured outputs
130
+ - [Ollama](https://ollama.com/) via OpenAI compatibility layer for local models
131
+ - **Multimodal Support** - Complete image analysis with DSPy::Image, type-safe bounding boxes, vision-capable models
132
+ - Runtime type checking with [Sorbet](https://sorbet.org/) including T::Enum and union types
133
+ - Type-safe tool definitions for ReAct agents
134
+ - Comprehensive instrumentation and observability
135
+
136
+ ## Development Status
137
+
138
+ DSPy.rb is actively developed and approaching stability. The core framework is production-ready with
139
+ comprehensive documentation, but I'm battle-testing features through the 0.x series before committing
140
+ to a stable v1.0 API.
141
+
142
+ Real-world usage feedback is invaluable - if you encounter issues or have suggestions, please open a GitHub issue!
143
+
144
+ ## Documentation
145
+
146
+ 📖 **[Complete Documentation Website](https://vicentereig.github.io/dspy.rb/)**
147
+
148
+ ### LLM-Friendly Documentation
149
+
150
+ For LLMs and AI assistants working with DSPy.rb:
151
+ - **[llms.txt](https://vicentereig.github.io/dspy.rb/llms.txt)** - Concise reference optimized for LLMs
152
+ - **[llms-full.txt](https://vicentereig.github.io/dspy.rb/llms-full.txt)** - Comprehensive API documentation
153
+
154
+ ### Getting Started
155
+ - **[Installation & Setup](docs/src/getting-started/installation.md)** - Detailed installation and configuration
156
+ - **[Quick Start Guide](docs/src/getting-started/quick-start.md)** - Your first DSPy programs
157
+ - **[Core Concepts](docs/src/getting-started/core-concepts.md)** - Understanding signatures, predictors, and modules
158
+
159
+ ### Core Features
160
+ - **[Signatures & Types](docs/src/core-concepts/signatures.md)** - Define typed interfaces for LLM operations
161
+ - **[Predictors](docs/src/core-concepts/predictors.md)** - Predict, ChainOfThought, ReAct, and more
162
+ - **[Modules & Pipelines](docs/src/core-concepts/modules.md)** - Compose complex multi-stage workflows
163
+ - **[Multimodal Support](docs/src/core-concepts/multimodal.md)** - Image analysis with vision-capable models
164
+ - **[Examples & Validation](docs/src/core-concepts/examples.md)** - Type-safe training data
165
+
166
+ ### Optimization
167
+ - **[Evaluation Framework](docs/src/optimization/evaluation.md)** - Advanced metrics beyond simple accuracy
168
+ - **[Prompt Optimization](docs/src/optimization/prompt-optimization.md)** - Manipulate prompts as objects
169
+ - **[MIPROv2 Optimizer](docs/src/optimization/miprov2.md)** - Advanced Bayesian optimization with Gaussian Processes
170
+ - **[GEPA Optimizer](docs/src/optimization/gepa.md)** *(beta)* - Reflective mutation with optional reflection LMs
171
+
172
+ ### Production Features
173
+ - **[Storage System](docs/src/production/storage.md)** - Persistence and optimization result storage
174
+ - **[Observability](docs/src/production/observability.md)** - Zero-config Langfuse integration with a dedicated export worker that never blocks your LLMs
175
+
176
+ ### Advanced Usage
177
+ - **[Complex Types](docs/src/advanced/complex-types.md)** - Sorbet type integration with automatic coercion for structs, enums, and arrays
178
+ - **[Manual Pipelines](docs/src/advanced/pipelines.md)** - Manual module composition patterns
179
+ - **[RAG Patterns](docs/src/advanced/rag.md)** - Manual RAG implementation with external services
180
+ - **[Custom Metrics](docs/src/advanced/custom-metrics.md)** - Proc-based evaluation logic
181
+
182
+ ## Quick Start
183
+
184
+ ### Installation
185
+
186
+ Add to your Gemfile:
187
+
188
+ ```ruby
189
+ gem 'dspy'
190
+ ```
191
+
192
+ Then run:
193
+
194
+ ```bash
195
+ bundle install
196
+ ```
197
+
198
+ ## Recent Achievements
199
+
200
+ DSPy.rb has rapidly evolved from experimental to production-ready:
201
+
202
+ ### Foundation
203
+ - ✅ **JSON Parsing Reliability** - Native OpenAI structured outputs, strategy selection, retry logic
204
+ - ✅ **Type-Safe Strategy Configuration** - Provider-optimized automatic strategy selection
205
+ - ✅ **Core Module System** - Predict, ChainOfThought, ReAct, CodeAct with type safety
206
+ - ✅ **Production Observability** - OpenTelemetry, New Relic, and Langfuse integration
207
+ - ✅ **Advanced Optimization** - MIPROv2 with Bayesian optimization, Gaussian Processes, and multiple strategies
208
+
209
+ ### Recent Advances
210
+ - ✅ **Enhanced Langfuse Integration (v0.25.0)** - Comprehensive OpenTelemetry span reporting with proper input/output, hierarchical nesting, accurate timing, and observation types
211
+ - ✅ **Comprehensive Multimodal Framework** - Complete image analysis with `DSPy::Image`, type-safe bounding boxes, vision model integration
212
+ - ✅ **Advanced Type System** - `T::Enum` integration, union types for agentic workflows, complex type coercion
213
+ - ✅ **Production-Ready Evaluation** - Multi-factor metrics beyond accuracy, error-resilient evaluation pipelines
214
+ - ✅ **Documentation Ecosystem** - `llms.txt` for AI assistants, ADRs, blog articles, comprehensive examples
215
+ - ✅ **API Maturation** - Simplified idiomatic patterns, better error handling, production-proven designs
216
+
217
+ ## Roadmap - Production Battle-Testing Toward v1.0
218
+
219
+ DSPy.rb has transitioned from **feature building** to **production validation**. The core framework is
220
+ feature-complete and stable - now I'm focusing on real-world usage patterns, performance optimization,
221
+ and ecosystem integration.
222
+
223
+ **Current Focus Areas:**
224
+
225
+ ### Production Readiness
226
+ - 🚧 **Production Patterns** - Real-world usage validation and performance optimization
227
+ - 🚧 **Ruby Ecosystem Integration** - Rails integration, Sidekiq compatibility, deployment patterns
228
+ - 🚧 **Scale Testing** - High-volume usage, memory management, connection pooling
229
+ - 🚧 **Error Recovery** - Robust failure handling patterns for production environments
230
+
231
+ ### Ecosystem Expansion
232
+ - 🚧 **Model Context Protocol (MCP)** - Integration with MCP ecosystem
233
+ - 🚧 **Additional Provider Support** - Azure OpenAI, local models beyond Ollama
234
+ - 🚧 **Tool Ecosystem** - Expanded tool integrations for ReAct agents
235
+
236
+ ### Community & Adoption
237
+ - 🚧 **Community Examples** - Real-world applications and case studies
238
+ - 🚧 **Contributor Experience** - Making it easier to contribute and extend
239
+ - 🚧 **Performance Benchmarks** - Comparative analysis vs other frameworks
240
+
241
+ **v1.0 Philosophy:**
242
+ v1.0 will be released after extensive production battle-testing, not after checking off features.
243
+ The API is already stable - v1.0 represents confidence in production reliability backed by real-world validation.
244
+
245
+ ## License
246
+
247
+ This project is licensed under the MIT License.
@@ -0,0 +1,26 @@
1
+ # frozen_string_literal: true
2
+
3
+ module DSPy
4
+ module Datasets
5
+ module ADE
6
+ extend self
7
+
8
+ DATASET_ID = 'ade-benchmark-corpus/ade_corpus_v2'
9
+
10
+ def examples(split: 'train', limit: 200, offset: 0, cache_dir: nil)
11
+ dataset = DSPy::Datasets.fetch(DATASET_ID, split: split, cache_dir: cache_dir)
12
+ dataset.rows(limit: limit, offset: offset).map do |row|
13
+ {
14
+ 'text' => row.fetch('text', '').to_s,
15
+ 'label' => row.fetch('label', 0).to_i
16
+ }
17
+ end
18
+ end
19
+
20
+ def fetch_rows(split:, limit:, offset:, cache_dir: nil)
21
+ dataset = DSPy::Datasets.fetch(DATASET_ID, split: split, cache_dir: cache_dir)
22
+ dataset.rows(limit: limit, offset: offset)
23
+ end
24
+ end
25
+ end
26
+ end
@@ -0,0 +1,45 @@
1
+ # frozen_string_literal: true
2
+
3
+ module DSPy
4
+ module Datasets
5
+ class Dataset
6
+ include Enumerable
7
+
8
+ attr_reader :info, :split
9
+
10
+ def initialize(info:, split:, loader:)
11
+ @info = info
12
+ @split = split
13
+ @loader = loader
14
+ end
15
+
16
+ def each
17
+ return enum_for(:each) unless block_given?
18
+
19
+ @loader.each_row do |row|
20
+ yield row
21
+ end
22
+ end
23
+
24
+ def rows(limit: nil, offset: 0)
25
+ enumerator = each
26
+ enumerator = enumerator.drop(offset) if offset.positive?
27
+ limit ? enumerator.take(limit) : enumerator.to_a
28
+ end
29
+
30
+ def size
31
+ @loader.row_count
32
+ end
33
+
34
+ alias count size
35
+
36
+ def features
37
+ info.features
38
+ end
39
+
40
+ def metadata
41
+ info.metadata
42
+ end
43
+ end
44
+ end
45
+ end
@@ -0,0 +1,10 @@
1
+ # frozen_string_literal: true
2
+
3
+ module DSPy
4
+ module Datasets
5
+ class DatasetError < StandardError; end
6
+ class DatasetNotFoundError < DatasetError; end
7
+ class InvalidSplitError < DatasetError; end
8
+ class DownloadError < DatasetError; end
9
+ end
10
+ end
@@ -0,0 +1,236 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'json'
4
+ require 'net/http'
5
+ require 'uri'
6
+ require 'time'
7
+
8
+ module DSPy
9
+ module Datasets
10
+ module HuggingFace
11
+ class APIError < StandardError; end
12
+
13
+ class DatasetSummary < T::Struct
14
+ const :id, String
15
+ const :author, T.nilable(String)
16
+ const :disabled, T::Boolean
17
+ const :gated, T::Boolean
18
+ const :private, T::Boolean
19
+ const :likes, T.nilable(Integer)
20
+ const :downloads, T.nilable(Integer)
21
+ const :tags, T::Array[String]
22
+ const :sha, T.nilable(String)
23
+ const :last_modified, T.nilable(Time)
24
+ const :description, T.nilable(String)
25
+ end
26
+
27
+ class Sibling < T::Struct
28
+ const :rfilename, String
29
+ const :size, T.nilable(Integer)
30
+ end
31
+
32
+ class DatasetDetails < T::Struct
33
+ const :summary, DatasetSummary
34
+ const :card_data, T.nilable(T::Hash[String, T.untyped])
35
+ const :siblings, T::Array[Sibling]
36
+ const :configs, T::Array[T::Hash[String, T.untyped]]
37
+ end
38
+
39
+ class ParquetListing < T::Struct
40
+ const :files, T::Hash[String, T::Hash[String, T::Array[String]]]
41
+ end
42
+
43
+ class Tag < T::Struct
44
+ const :id, String
45
+ const :label, String
46
+ const :type, String
47
+ end
48
+
49
+ class TagsByType < T::Struct
50
+ const :tags, T::Hash[String, T::Array[Tag]]
51
+ end
52
+
53
+ class ListParams < T::Struct
54
+ const :search, T.nilable(String)
55
+ const :author, T.nilable(String)
56
+ const :filter, T.nilable(T::Array[String])
57
+ const :sort, T.nilable(String)
58
+ const :direction, T.nilable(Integer)
59
+ const :limit, T.nilable(Integer)
60
+ const :offset, T.nilable(Integer)
61
+ const :full, T.nilable(T::Boolean)
62
+ end
63
+
64
+ class Client
65
+ extend T::Sig
66
+
67
+ BASE_URL = 'https://huggingface.co'
68
+ DEFAULT_TIMEOUT = 15
69
+
70
+ sig { params(base_url: String, timeout: Integer).void }
71
+ def initialize(base_url: BASE_URL, timeout: DEFAULT_TIMEOUT)
72
+ @base_url = base_url
73
+ @timeout = timeout
74
+ end
75
+
76
+ sig { params(params: ListParams).returns(T::Array[DatasetSummary]) }
77
+ def list_datasets(params = ListParams.new)
78
+ query = build_list_query(params)
79
+ payload = get('/api/datasets', query)
80
+ unless payload.is_a?(Array)
81
+ raise APIError, 'Unexpected response when listing datasets'
82
+ end
83
+
84
+ payload.map { |entry| parse_dataset_summary(entry) }
85
+ end
86
+
87
+ sig { params(repo_id: String, full: T.nilable(T::Boolean), revision: T.nilable(String)).returns(DatasetDetails) }
88
+ def dataset(repo_id, full: nil, revision: nil)
89
+ path = if revision
90
+ "/api/datasets/#{repo_id}/revision/#{revision}"
91
+ else
92
+ "/api/datasets/#{repo_id}"
93
+ end
94
+ query = {}
95
+ query[:full] = full ? 1 : 0 unless full.nil?
96
+ payload = get(path, query)
97
+ DatasetDetails.new(
98
+ summary: parse_dataset_summary(payload),
99
+ card_data: payload['cardData'],
100
+ siblings: Array(payload['siblings']).map { |item| Sibling.new(rfilename: item['rfilename'].to_s, size: item['size']) },
101
+ configs: Array(payload['configs']).map { |config| config }
102
+ )
103
+ end
104
+
105
+ sig { params(repo_id: String).returns(ParquetListing) }
106
+ def dataset_parquet(repo_id)
107
+ payload = get("/api/datasets/#{repo_id}/parquet")
108
+ unless payload.is_a?(Hash)
109
+ raise APIError, 'Unexpected parquet listing response'
110
+ end
111
+
112
+ files = payload.each_with_object({}) do |(config, splits), acc|
113
+ acc[config] = splits.each_with_object({}) do |(split, urls), split_acc|
114
+ split_acc[split] = Array(urls).map(&:to_s)
115
+ end
116
+ end
117
+
118
+ ParquetListing.new(files: files)
119
+ end
120
+
121
+ sig { returns(TagsByType) }
122
+ def dataset_tags_by_type
123
+ payload = get('/api/datasets-tags-by-type')
124
+ unless payload.is_a?(Hash)
125
+ raise APIError, 'Unexpected dataset tags response'
126
+ end
127
+
128
+ tags = payload.each_with_object({}) do |(category, items), acc|
129
+ acc[category] = Array(items).map do |item|
130
+ Tag.new(
131
+ id: item.fetch('id').to_s,
132
+ label: item.fetch('label').to_s,
133
+ type: item.fetch('type').to_s
134
+ )
135
+ end
136
+ end
137
+
138
+ TagsByType.new(tags: tags)
139
+ end
140
+
141
+ private
142
+
143
+ sig { params(path: String, params: T::Hash[Symbol, T.untyped]).returns(T.untyped) }
144
+ def get(path, params = {})
145
+ uri = build_uri(path, params)
146
+ request = Net::HTTP::Get.new(uri)
147
+ request['Accept'] = 'application/json'
148
+
149
+ response = Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https', read_timeout: @timeout, open_timeout: @timeout) do |http|
150
+ http.request(request)
151
+ end
152
+
153
+ unless response.is_a?(Net::HTTPSuccess)
154
+ raise APIError, "Hugging Face API request failed: #{response.code} #{response.message}"
155
+ end
156
+
157
+ JSON.parse(response.body)
158
+ rescue JSON::ParserError => e
159
+ raise APIError, "Failed to parse Hugging Face API response: #{e.message}"
160
+ end
161
+
162
+ sig { params(path: String, params: T::Hash[Symbol, T.untyped]).returns(URI::HTTPS) }
163
+ def build_uri(path, params)
164
+ uri = URI.join(@base_url, path)
165
+ unless params.empty?
166
+ # Expand repeated filters if present
167
+ query_pairs = params.each_with_object([]) do |(key, value), acc|
168
+ next if value.nil?
169
+
170
+ if key == :filter && value.is_a?(Array)
171
+ value.each { |filter| acc << ["filter", filter.to_s] }
172
+ else
173
+ acc << [key.to_s, format_query_value(value)]
174
+ end
175
+ end
176
+ uri.query = URI.encode_www_form(query_pairs)
177
+ end
178
+ uri
179
+ end
180
+
181
+ sig { params(value: T.untyped).returns(String) }
182
+ def format_query_value(value)
183
+ case value
184
+ when TrueClass, FalseClass
185
+ value ? '1' : '0'
186
+ else
187
+ value.to_s
188
+ end
189
+ end
190
+
191
+ sig { params(payload: T::Hash[String, T.untyped]).returns(DatasetSummary) }
192
+ def parse_dataset_summary(payload)
193
+ DatasetSummary.new(
194
+ id: payload.fetch('id').to_s,
195
+ author: payload['author'],
196
+ disabled: payload.fetch('disabled', false),
197
+ gated: payload.fetch('gated', false),
198
+ private: payload.fetch('private', false),
199
+ likes: payload['likes'],
200
+ downloads: payload['downloads'],
201
+ tags: Array(payload['tags']).map(&:to_s),
202
+ sha: payload['sha'],
203
+ last_modified: parse_time(payload['lastModified']),
204
+ description: payload['description']
205
+ )
206
+ end
207
+
208
+ sig { params(params: ListParams).returns(T::Hash[Symbol, T.untyped]) }
209
+ def build_list_query(params)
210
+ query = {
211
+ search: params.search,
212
+ author: params.author,
213
+ sort: params.sort,
214
+ direction: params.direction,
215
+ limit: params.limit,
216
+ offset: params.offset,
217
+ full: params.full
218
+ }.reject { |_, value| value.nil? }
219
+
220
+ query[:filter] = params.filter if params.filter
221
+
222
+ query
223
+ end
224
+
225
+ sig { params(value: T.untyped).returns(T.nilable(Time)) }
226
+ def parse_time(value)
227
+ return nil unless value
228
+
229
+ Time.parse(value.to_s)
230
+ rescue ArgumentError
231
+ nil
232
+ end
233
+ end
234
+ end
235
+ end
236
+ end
@@ -0,0 +1,24 @@
1
+ # frozen_string_literal: true
2
+
3
+ module DSPy
4
+ module Datasets
5
+ class DatasetInfo
6
+ attr_reader :id, :name, :provider, :splits, :features, :loader, :loader_options, :metadata
7
+
8
+ def initialize(id:, name:, provider:, splits:, features:, loader:, loader_options:, metadata: {})
9
+ @id = id
10
+ @name = name
11
+ @provider = provider
12
+ @splits = Array(splits).map(&:to_s).freeze
13
+ @features = features.freeze
14
+ @loader = loader
15
+ @loader_options = loader_options.freeze
16
+ @metadata = metadata.freeze
17
+ end
18
+
19
+ def default_split
20
+ @splits.first
21
+ end
22
+ end
23
+ end
24
+ end
@@ -0,0 +1,134 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'json'
4
+ require 'net/http'
5
+ require 'uri'
6
+ require 'fileutils'
7
+ require 'parquet'
8
+
9
+ module DSPy
10
+ module Datasets
11
+ module Loaders
12
+ class HuggingFaceParquet
13
+ BASE_URL = 'https://datasets-server.huggingface.co'
14
+
15
+ def initialize(info, split:, cache_dir:)
16
+ @info = info
17
+ @split = split
18
+ @cache_root = determine_cache_root(cache_dir)
19
+ end
20
+
21
+ def each_row
22
+ return enum_for(:each_row) unless block_given?
23
+
24
+ parquet_files.each do |file|
25
+ table = load_table(file)
26
+ field_names = table.schema.fields.map(&:name)
27
+ table.raw_records.each do |values|
28
+ yield normalized_row(field_names, values)
29
+ end
30
+ end
31
+ end
32
+
33
+ def row_count
34
+ @row_count ||= parquet_files.sum do |file|
35
+ load_table(file).n_rows
36
+ end
37
+ end
38
+
39
+ private
40
+
41
+ attr_reader :info, :split, :cache_root
42
+
43
+ def normalized_row(field_names, values)
44
+ field_names.each_with_index.each_with_object({}) do |(name, index), row|
45
+ row[name] = values[index]
46
+ end
47
+ end
48
+
49
+ def load_table(file)
50
+ Arrow::Table.load(ensure_cached(file))
51
+ end
52
+
53
+ def parquet_files
54
+ @parquet_files ||= begin
55
+ uri = URI("#{BASE_URL}/parquet")
56
+ params = {
57
+ dataset: info.loader_options.fetch(:dataset),
58
+ config: info.loader_options.fetch(:config),
59
+ split: split
60
+ }
61
+ uri.query = URI.encode_www_form(params)
62
+
63
+ response = http_get(uri)
64
+ unless response.is_a?(Net::HTTPSuccess)
65
+ raise DatasetError, "Failed to fetch parquet manifest: #{response.code}"
66
+ end
67
+
68
+ body = JSON.parse(response.body)
69
+ files = body.fetch('parquet_files', [])
70
+ raise DatasetError, "No parquet files available for #{info.id} (#{split})" if files.empty?
71
+
72
+ files
73
+ end
74
+ end
75
+
76
+ def ensure_cached(file)
77
+ FileUtils.mkdir_p(cache_dir)
78
+ path = File.join(cache_dir, file.fetch('filename'))
79
+ return path if File.exist?(path) && File.size?(path)
80
+
81
+ download_file(file.fetch('url'), path)
82
+ path
83
+ end
84
+
85
+ def cache_dir
86
+ @cache_dir ||= File.join(cache_root, split)
87
+ end
88
+
89
+ def determine_cache_root(cache_dir)
90
+ base = if cache_dir
91
+ File.expand_path(cache_dir)
92
+ elsif ENV['DSPY_DATASETS_CACHE']
93
+ File.expand_path(ENV['DSPY_DATASETS_CACHE'])
94
+ else
95
+ File.expand_path('../../../../tmp/dspy_datasets', __dir__)
96
+ end
97
+ File.join(base, sanitized_dataset_id)
98
+ end
99
+
100
+ def sanitized_dataset_id
101
+ info.id.gsub(/[^\w.-]+/, '_')
102
+ end
103
+
104
+ def http_get(uri)
105
+ Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
106
+ request = Net::HTTP::Get.new(uri)
107
+ http.request(request)
108
+ end
109
+ end
110
+
111
+ def download_file(url, destination)
112
+ uri = URI(url)
113
+ Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
114
+ request = Net::HTTP::Get.new(uri)
115
+ http.request(request) do |response|
116
+ unless response.is_a?(Net::HTTPSuccess)
117
+ raise DownloadError, "Failed to download parquet file: #{response.code}"
118
+ end
119
+
120
+ File.open(destination, 'wb') do |file|
121
+ response.read_body do |chunk|
122
+ file.write(chunk)
123
+ end
124
+ end
125
+ end
126
+ end
127
+ rescue => e
128
+ File.delete(destination) if File.exist?(destination)
129
+ raise
130
+ end
131
+ end
132
+ end
133
+ end
134
+ end
@@ -0,0 +1,19 @@
1
+ # frozen_string_literal: true
2
+
3
+ module DSPy
4
+ module Datasets
5
+ module Loaders
6
+ extend self
7
+
8
+ def build(info, split:, cache_dir:)
9
+ case info.loader
10
+ when :huggingface_parquet
11
+ require_relative 'loaders/huggingface_parquet'
12
+ HuggingFaceParquet.new(info, split: split, cache_dir: cache_dir)
13
+ else
14
+ raise DatasetError, "Unsupported loader: #{info.loader}"
15
+ end
16
+ end
17
+ end
18
+ end
19
+ end
@@ -0,0 +1,40 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative 'info'
4
+
5
+ module DSPy
6
+ module Datasets
7
+ module Manifest
8
+ extend self
9
+
10
+ def all
11
+ @all ||= [
12
+ DatasetInfo.new(
13
+ id: 'ade-benchmark-corpus/ade_corpus_v2',
14
+ name: 'ADE Corpus V2',
15
+ provider: 'huggingface',
16
+ splits: %w[train],
17
+ features: {
18
+ 'text' => { 'type' => 'string' },
19
+ 'label' => { 'type' => 'int64', 'description' => '0: Not-Related, 1: Related' }
20
+ },
21
+ loader: :huggingface_parquet,
22
+ loader_options: {
23
+ dataset: 'ade-benchmark-corpus/ade_corpus_v2',
24
+ config: 'Ade_corpus_v2_classification'
25
+ },
26
+ metadata: {
27
+ description: 'Adverse drug event classification corpus used in ADE optimization examples.',
28
+ homepage: 'https://huggingface.co/datasets/ade-benchmark-corpus/ade_corpus_v2',
29
+ approx_row_count: 23516
30
+ }
31
+ )
32
+ ].freeze
33
+ end
34
+
35
+ def by_id(id)
36
+ all.detect { |dataset| dataset.id == id }
37
+ end
38
+ end
39
+ end
40
+ end
@@ -0,0 +1,7 @@
1
+ # frozen_string_literal: true
2
+
3
+ module DSPy
4
+ module Datasets
5
+ VERSION = DSPy::VERSION
6
+ end
7
+ end
@@ -0,0 +1,53 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative 'datasets/version'
4
+ require_relative 'datasets/errors'
5
+ require_relative 'datasets/dataset'
6
+ require_relative 'datasets/manifest'
7
+ require_relative 'datasets/loaders'
8
+ require_relative 'datasets/hugging_face/api'
9
+ require_relative 'datasets/ade'
10
+
11
+ module DSPy
12
+ module Datasets
13
+ PaginatedList = Struct.new(:items, :page, :per_page, :total_count, keyword_init: true) do
14
+ def total_pages
15
+ return 0 if per_page.zero?
16
+
17
+ (total_count.to_f / per_page).ceil
18
+ end
19
+ end
20
+
21
+ module_function
22
+
23
+ def list(page: 1, per_page: 20)
24
+ page = [page.to_i, 1].max
25
+ per_page = [per_page.to_i, 1].max
26
+
27
+ all = Manifest.all
28
+ offset = (page - 1) * per_page
29
+ slice = offset >= all.length ? [] : all.slice(offset, per_page) || []
30
+
31
+ PaginatedList.new(
32
+ items: slice,
33
+ page: page,
34
+ per_page: per_page,
35
+ total_count: all.length
36
+ )
37
+ end
38
+
39
+ def fetch(dataset_id, split: nil, cache_dir: nil)
40
+ info = Manifest.by_id(dataset_id)
41
+ raise DatasetNotFoundError, "Unknown dataset: #{dataset_id}" unless info
42
+
43
+ split ||= info.default_split
44
+ split = split.to_s
45
+ unless info.splits.include?(split)
46
+ raise InvalidSplitError, "Invalid split '#{split}' for dataset #{dataset_id} (available: #{info.splits.join(', ')})"
47
+ end
48
+
49
+ loader = Loaders.build(info, split: split, cache_dir: cache_dir)
50
+ Dataset.new(info: info, split: split, loader: loader)
51
+ end
52
+ end
53
+ end
metadata ADDED
@@ -0,0 +1,82 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: dspy-datasets
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.29.1
5
+ platform: ruby
6
+ authors:
7
+ - Vicente Reig Rincón de Arellano
8
+ bindir: bin
9
+ cert_chain: []
10
+ date: 2025-10-20 00:00:00.000000000 Z
11
+ dependencies:
12
+ - !ruby/object:Gem::Dependency
13
+ name: dspy
14
+ requirement: !ruby/object:Gem::Requirement
15
+ requirements:
16
+ - - '='
17
+ - !ruby/object:Gem::Version
18
+ version: 0.29.1
19
+ type: :runtime
20
+ prerelease: false
21
+ version_requirements: !ruby/object:Gem::Requirement
22
+ requirements:
23
+ - - '='
24
+ - !ruby/object:Gem::Version
25
+ version: 0.29.1
26
+ - !ruby/object:Gem::Dependency
27
+ name: red-parquet
28
+ requirement: !ruby/object:Gem::Requirement
29
+ requirements:
30
+ - - "~>"
31
+ - !ruby/object:Gem::Version
32
+ version: '21.0'
33
+ type: :runtime
34
+ prerelease: false
35
+ version_requirements: !ruby/object:Gem::Requirement
36
+ requirements:
37
+ - - "~>"
38
+ - !ruby/object:Gem::Version
39
+ version: '21.0'
40
+ description: DSPy datasets provide prebuilt loaders, caching, and schema metadata
41
+ for benchmark corpora used in DSPy examples and teleprompters.
42
+ email:
43
+ - hey@vicente.services
44
+ executables: []
45
+ extensions: []
46
+ extra_rdoc_files: []
47
+ files:
48
+ - LICENSE
49
+ - README.md
50
+ - lib/dspy/datasets.rb
51
+ - lib/dspy/datasets/ade.rb
52
+ - lib/dspy/datasets/dataset.rb
53
+ - lib/dspy/datasets/errors.rb
54
+ - lib/dspy/datasets/hugging_face/api.rb
55
+ - lib/dspy/datasets/info.rb
56
+ - lib/dspy/datasets/loaders.rb
57
+ - lib/dspy/datasets/loaders/huggingface_parquet.rb
58
+ - lib/dspy/datasets/manifest.rb
59
+ - lib/dspy/datasets/version.rb
60
+ homepage: https://github.com/vicentereig/dspy.rb
61
+ licenses:
62
+ - MIT
63
+ metadata:
64
+ github_repo: git@github.com:vicentereig/dspy.rb
65
+ rdoc_options: []
66
+ require_paths:
67
+ - lib
68
+ required_ruby_version: !ruby/object:Gem::Requirement
69
+ requirements:
70
+ - - ">="
71
+ - !ruby/object:Gem::Version
72
+ version: 3.3.0
73
+ required_rubygems_version: !ruby/object:Gem::Requirement
74
+ requirements:
75
+ - - ">="
76
+ - !ruby/object:Gem::Version
77
+ version: '0'
78
+ requirements: []
79
+ rubygems_version: 3.6.5
80
+ specification_version: 4
81
+ summary: Curated datasets and loaders for DSPy.rb.
82
+ test_files: []