RubyGems - dspy-datasets - Versions diffs - 0.29.1 - Mend

dspy-datasets 0.29.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

checksums.yaml +7 -0
data/LICENSE +45 -0
data/README.md +247 -0
data/lib/dspy/datasets/ade.rb +26 -0
data/lib/dspy/datasets/dataset.rb +45 -0
data/lib/dspy/datasets/errors.rb +10 -0
data/lib/dspy/datasets/hugging_face/api.rb +236 -0
data/lib/dspy/datasets/info.rb +24 -0
data/lib/dspy/datasets/loaders/huggingface_parquet.rb +134 -0
data/lib/dspy/datasets/loaders.rb +19 -0
data/lib/dspy/datasets/manifest.rb +40 -0
data/lib/dspy/datasets/version.rb +7 -0
data/lib/dspy/datasets.rb +53 -0
metadata +82 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: c8de3f972de17ce584e6f1f8f7eec8084b6d24c3517fd14001d58d12537b98d1
+  data.tar.gz: f47577ccf5b0826387bfb991d3f6372f9a41cccef7e1d9f3583030a0b5a4c61e
+SHA512:
+  metadata.gz: e02a16d9b3321c2841d052e1c69fa91106cbbbeb8b44394f1c41052b01936a2757cb94b26c0292309effe724477eae12487ce6a9ac85b6bd10c1bd12f13a9798
+  data.tar.gz: 9ac56b72949104a5bb5d998768f419283b9a47b00653f54f84e99de492c987cdc548f060faa84855db4a8a54f2c231524f3ebf5269e774d8e10888d3fbdcabbf

data/LICENSE ADDED Viewed

@@ -0,0 +1,45 @@
+MIT License
+Copyright (c) 2025 Vicente Services SL
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+This project is a Ruby port of the original Python [DSPy library](https://github.com/stanfordnlp/dspy), which is licensed under the MIT License:
+MIT License
+Copyright (c) 2023 Stanford Future Data Systems
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,247 @@
+# DSPy.rb
+[![Gem Version](https://img.shields.io/gem/v/dspy)](https://rubygems.org/gems/dspy)
+[![Total Downloads](https://img.shields.io/gem/dt/dspy)](https://rubygems.org/gems/dspy)
+[![Build Status](https://img.shields.io/github/actions/workflow/status/vicentereig/dspy.rb/ruby.yml?branch=main&label=build)](https://github.com/vicentereig/dspy.rb/actions/workflows/ruby.yml)
+[![Documentation](https://img.shields.io/badge/docs-vicentereig.github.io%2Fdspy.rb-blue)](https://vicentereig.github.io/dspy.rb/)
+**Build reliable LLM applications in idiomatic Ruby using composable, type-safe modules.**
+The Ruby framework for programming with large language models. DSPy.rb brings structured LLM programming to Ruby developers. Instead of wrestling with prompt strings and parsing responses, you define typed signatures using idiomatic Ruby to compose and decompose AI Worklows and AI Agents.
+**Prompts are the just Functions.** Traditional prompting is like writing code with string concatenation: it works until it doesn't. DSPy.rb brings you
+the programming approach pioneered by [dspy.ai](https://dspy.ai/): instead of crafting fragile prompts, you define modular
+signatures and let the framework handle the messy details.
+DSPy.rb is an idiomatic Ruby surgical port of Stanford's [DSPy framework](https://github.com/stanfordnlp/dspy). While implementing
+the core concepts of signatures, predictors, and optimization from the original Python library, DSPy.rb embraces Ruby
+conventions and adds Ruby-specific innovations like CodeAct agents and enhanced production instrumentation.
+The result? LLM applications that actually scale and don't break when you sneeze.
+## Your First DSPy Program
+```ruby
+# Define a signature for sentiment classification
+class Classify < DSPy::Signature
+  description "Classify sentiment of a given sentence."
+  class Sentiment < T::Enum
+    enums do
+      Positive = new('positive')
+      Negative = new('negative')
+      Neutral = new('neutral')
+    end
+  end
+  input do
+    const :sentence, String
+  end
+  output do
+    const :sentiment, Sentiment
+    const :confidence, Float
+  end
+end
+# Configure DSPy with your LLM
+DSPy.configure do |c|
+  c.lm = DSPy::LM.new('openai/gpt-4o-mini',
+                      api_key: ENV['OPENAI_API_KEY'],
+                      structured_outputs: true)  # Enable OpenAI's native JSON mode
+end
+# Create the predictor and run inference
+classify = DSPy::Predict.new(Classify)
+result = classify.call(sentence: "This book was super fun to read!")
+puts result.sentiment    # => #<Sentiment::Positive>
+puts result.confidence   # => 0.85
+```
+### Access to 200+ Models Across 5 Providers
+DSPy.rb provides unified access to major LLM providers with provider-specific optimizations:
+```ruby
+# OpenAI (GPT-4, GPT-4o, GPT-4o-mini, GPT-5, etc.)
+DSPy.configure do |c|
+  c.lm = DSPy::LM.new('openai/gpt-4o-mini',
+                      api_key: ENV['OPENAI_API_KEY'],
+                      structured_outputs: true)  # Native JSON mode
+end
+# Google Gemini (Gemini 1.5 Pro, Flash, Gemini 2.0, etc.)
+DSPy.configure do |c|
+  c.lm = DSPy::LM.new('gemini/gemini-2.5-flash',
+                      api_key: ENV['GEMINI_API_KEY'],
+                      structured_outputs: true)  # Native structured outputs
+end
+# Anthropic Claude (Claude 3.5, Claude 4, etc.)
+DSPy.configure do |c|
+  c.lm = DSPy::LM.new('anthropic/claude-sonnet-4-5-20250929',
+                      api_key: ENV['ANTHROPIC_API_KEY'],
+                      structured_outputs: true)  # Tool-based extraction (default)
+end
+# Ollama - Run any local model (Llama, Mistral, Gemma, etc.)
+DSPy.configure do |c|
+  c.lm = DSPy::LM.new('ollama/llama3.2')  # Free, runs locally, no API key needed
+end
+# OpenRouter - Access to 200+ models from multiple providers
+DSPy.configure do |c|
+  c.lm = DSPy::LM.new('openrouter/deepseek/deepseek-chat-v3.1:free',
+                      api_key: ENV['OPENROUTER_API_KEY'])
+end
+```
+## What You Get
+**Core Building Blocks:**
+- **Signatures** - Define input/output schemas using Sorbet types with T::Enum and union type support
+- **Predict** - LLM completion with structured data extraction and multimodal support
+- **Chain of Thought** - Step-by-step reasoning for complex problems with automatic prompt optimization
+- **ReAct** - Tool-using agents with type-safe tool definitions and error recovery
+- **CodeAct** - Dynamic code execution agents for programming tasks
+- **Module Composition** - Combine multiple LLM calls into production-ready workflows
+**Optimization & Evaluation:**
+- **Prompt Objects** - Manipulate prompts as first-class objects instead of strings
+- **Typed Examples** - Type-safe training data with automatic validation
+- **Evaluation Framework** - Advanced metrics beyond simple accuracy with error-resilient pipelines
+- **MIPROv2 Optimization** - Advanced Bayesian optimization with Gaussian Processes, multiple optimization strategies, auto-config presets, and storage persistence
+**Production Features:**
+- **Reliable JSON Extraction** - Native structured outputs for OpenAI and Gemini, Anthropic tool-based extraction, and automatic strategy selection with fallback
+- **Type-Safe Configuration** - Strategy enums with automatic provider optimization (Strict/Compatible modes)
+- **Smart Retry Logic** - Progressive fallback with exponential backoff for handling transient failures
+- **Zero-Config Langfuse Integration** - Set env vars and get automatic OpenTelemetry traces in Langfuse
+- **Performance Caching** - Schema and capability caching for faster repeated operations
+- **File-based Storage** - Optimization result persistence with versioning
+- **Structured Logging** - JSON and key=value formats with span tracking
+**Developer Experience:**
+- LLM provider support using official Ruby clients:
+  - [OpenAI Ruby](https://github.com/openai/openai-ruby) with vision model support
+  - [Anthropic Ruby SDK](https://github.com/anthropics/anthropic-sdk-ruby) with multimodal capabilities
+  - [Google Gemini API](https://ai.google.dev/) with native structured outputs
+  - [Ollama](https://ollama.com/) via OpenAI compatibility layer for local models
+- **Multimodal Support** - Complete image analysis with DSPy::Image, type-safe bounding boxes, vision-capable models
+- Runtime type checking with [Sorbet](https://sorbet.org/) including T::Enum and union types
+- Type-safe tool definitions for ReAct agents
+- Comprehensive instrumentation and observability
+## Development Status
+DSPy.rb is actively developed and approaching stability. The core framework is production-ready with
+comprehensive documentation, but I'm battle-testing features through the 0.x series before committing
+to a stable v1.0 API.
+Real-world usage feedback is invaluable - if you encounter issues or have suggestions, please open a GitHub issue!
+## Documentation
+📖 **[Complete Documentation Website](https://vicentereig.github.io/dspy.rb/)**
+### LLM-Friendly Documentation
+For LLMs and AI assistants working with DSPy.rb:
+- **[llms.txt](https://vicentereig.github.io/dspy.rb/llms.txt)** - Concise reference optimized for LLMs
+- **[llms-full.txt](https://vicentereig.github.io/dspy.rb/llms-full.txt)** - Comprehensive API documentation
+### Getting Started
+- **[Installation & Setup](docs/src/getting-started/installation.md)** - Detailed installation and configuration
+- **[Quick Start Guide](docs/src/getting-started/quick-start.md)** - Your first DSPy programs
+- **[Core Concepts](docs/src/getting-started/core-concepts.md)** - Understanding signatures, predictors, and modules
+### Core Features
+- **[Signatures & Types](docs/src/core-concepts/signatures.md)** - Define typed interfaces for LLM operations
+- **[Predictors](docs/src/core-concepts/predictors.md)** - Predict, ChainOfThought, ReAct, and more
+- **[Modules & Pipelines](docs/src/core-concepts/modules.md)** - Compose complex multi-stage workflows
+- **[Multimodal Support](docs/src/core-concepts/multimodal.md)** - Image analysis with vision-capable models
+- **[Examples & Validation](docs/src/core-concepts/examples.md)** - Type-safe training data
+### Optimization
+- **[Evaluation Framework](docs/src/optimization/evaluation.md)** - Advanced metrics beyond simple accuracy
+- **[Prompt Optimization](docs/src/optimization/prompt-optimization.md)** - Manipulate prompts as objects
+- **[MIPROv2 Optimizer](docs/src/optimization/miprov2.md)** - Advanced Bayesian optimization with Gaussian Processes
+- **[GEPA Optimizer](docs/src/optimization/gepa.md)** *(beta)* - Reflective mutation with optional reflection LMs
+### Production Features
+- **[Storage System](docs/src/production/storage.md)** - Persistence and optimization result storage
+- **[Observability](docs/src/production/observability.md)** - Zero-config Langfuse integration with a dedicated export worker that never blocks your LLMs
+### Advanced Usage
+- **[Complex Types](docs/src/advanced/complex-types.md)** - Sorbet type integration with automatic coercion for structs, enums, and arrays
+- **[Manual Pipelines](docs/src/advanced/pipelines.md)** - Manual module composition patterns
+- **[RAG Patterns](docs/src/advanced/rag.md)** - Manual RAG implementation with external services
+- **[Custom Metrics](docs/src/advanced/custom-metrics.md)** - Proc-based evaluation logic
+## Quick Start
+### Installation
+Add to your Gemfile:
+```ruby
+gem 'dspy'
+```
+Then run:
+```bash
+bundle install
+```
+## Recent Achievements
+DSPy.rb has rapidly evolved from experimental to production-ready:
+### Foundation
+- ✅ **JSON Parsing Reliability** - Native OpenAI structured outputs, strategy selection, retry logic
+- ✅ **Type-Safe Strategy Configuration** - Provider-optimized automatic strategy selection
+- ✅ **Core Module System** - Predict, ChainOfThought, ReAct, CodeAct with type safety
+- ✅ **Production Observability** - OpenTelemetry, New Relic, and Langfuse integration
+- ✅ **Advanced Optimization** - MIPROv2 with Bayesian optimization, Gaussian Processes, and multiple strategies
+### Recent Advances
+- ✅ **Enhanced Langfuse Integration (v0.25.0)** - Comprehensive OpenTelemetry span reporting with proper input/output, hierarchical nesting, accurate timing, and observation types
+- ✅ **Comprehensive Multimodal Framework** - Complete image analysis with `DSPy::Image`, type-safe bounding boxes, vision model integration
+- ✅ **Advanced Type System** - `T::Enum` integration, union types for agentic workflows, complex type coercion
+- ✅ **Production-Ready Evaluation** - Multi-factor metrics beyond accuracy, error-resilient evaluation pipelines
+- ✅ **Documentation Ecosystem** - `llms.txt` for AI assistants, ADRs, blog articles, comprehensive examples
+- ✅ **API Maturation** - Simplified idiomatic patterns, better error handling, production-proven designs
+## Roadmap - Production Battle-Testing Toward v1.0
+DSPy.rb has transitioned from **feature building** to **production validation**. The core framework is
+feature-complete and stable - now I'm focusing on real-world usage patterns, performance optimization,
+and ecosystem integration.
+**Current Focus Areas:**
+### Production Readiness
+- 🚧 **Production Patterns** - Real-world usage validation and performance optimization
+- 🚧 **Ruby Ecosystem Integration** - Rails integration, Sidekiq compatibility, deployment patterns
+- 🚧 **Scale Testing** - High-volume usage, memory management, connection pooling
+- 🚧 **Error Recovery** - Robust failure handling patterns for production environments
+### Ecosystem Expansion
+- 🚧 **Model Context Protocol (MCP)** - Integration with MCP ecosystem
+- 🚧 **Additional Provider Support** - Azure OpenAI, local models beyond Ollama
+- 🚧 **Tool Ecosystem** - Expanded tool integrations for ReAct agents
+### Community & Adoption
+- 🚧 **Community Examples** - Real-world applications and case studies
+- 🚧 **Contributor Experience** - Making it easier to contribute and extend
+- 🚧 **Performance Benchmarks** - Comparative analysis vs other frameworks
+**v1.0 Philosophy:**
+v1.0 will be released after extensive production battle-testing, not after checking off features.
+The API is already stable - v1.0 represents confidence in production reliability backed by real-world validation.
+## License
+This project is licensed under the MIT License.

data/lib/dspy/datasets/ade.rb ADDED Viewed

@@ -0,0 +1,26 @@
+# frozen_string_literal: true
+module DSPy
+  module Datasets
+    module ADE
+      extend self
+      DATASET_ID = 'ade-benchmark-corpus/ade_corpus_v2'
+      def examples(split: 'train', limit: 200, offset: 0, cache_dir: nil)
+        dataset = DSPy::Datasets.fetch(DATASET_ID, split: split, cache_dir: cache_dir)
+        dataset.rows(limit: limit, offset: offset).map do |row|
+          {
+            'text' => row.fetch('text', '').to_s,
+            'label' => row.fetch('label', 0).to_i
+          }
+        end
+      end
+      def fetch_rows(split:, limit:, offset:, cache_dir: nil)
+        dataset = DSPy::Datasets.fetch(DATASET_ID, split: split, cache_dir: cache_dir)
+        dataset.rows(limit: limit, offset: offset)
+      end
+    end
+  end
+end

data/lib/dspy/datasets/dataset.rb ADDED Viewed

@@ -0,0 +1,45 @@
+# frozen_string_literal: true
+module DSPy
+  module Datasets
+    class Dataset
+      include Enumerable
+      attr_reader :info, :split
+      def initialize(info:, split:, loader:)
+        @info = info
+        @split = split
+        @loader = loader
+      end
+      def each
+        return enum_for(:each) unless block_given?
+        @loader.each_row do |row|
+          yield row
+        end
+      end
+      def rows(limit: nil, offset: 0)
+        enumerator = each
+        enumerator = enumerator.drop(offset) if offset.positive?
+        limit ? enumerator.take(limit) : enumerator.to_a
+      end
+      def size
+        @loader.row_count
+      end
+      alias count size
+      def features
+        info.features
+      end
+      def metadata
+        info.metadata
+      end
+    end
+  end
+end

data/lib/dspy/datasets/errors.rb ADDED Viewed

@@ -0,0 +1,10 @@
+# frozen_string_literal: true
+module DSPy
+  module Datasets
+    class DatasetError < StandardError; end
+    class DatasetNotFoundError < DatasetError; end
+    class InvalidSplitError < DatasetError; end
+    class DownloadError < DatasetError; end
+  end
+end

data/lib/dspy/datasets/hugging_face/api.rb ADDED Viewed

@@ -0,0 +1,236 @@
+# frozen_string_literal: true
+require 'json'
+require 'net/http'
+require 'uri'
+require 'time'
+module DSPy
+  module Datasets
+    module HuggingFace
+      class APIError < StandardError; end
+      class DatasetSummary < T::Struct
+        const :id, String
+        const :author, T.nilable(String)
+        const :disabled, T::Boolean
+        const :gated, T::Boolean
+        const :private, T::Boolean
+        const :likes, T.nilable(Integer)
+        const :downloads, T.nilable(Integer)
+        const :tags, T::Array[String]
+        const :sha, T.nilable(String)
+        const :last_modified, T.nilable(Time)
+        const :description, T.nilable(String)
+      end
+      class Sibling < T::Struct
+        const :rfilename, String
+        const :size, T.nilable(Integer)
+      end
+      class DatasetDetails < T::Struct
+        const :summary, DatasetSummary
+        const :card_data, T.nilable(T::Hash[String, T.untyped])
+        const :siblings, T::Array[Sibling]
+        const :configs, T::Array[T::Hash[String, T.untyped]]
+      end
+      class ParquetListing < T::Struct
+        const :files, T::Hash[String, T::Hash[String, T::Array[String]]]
+      end
+      class Tag < T::Struct
+        const :id, String
+        const :label, String
+        const :type, String
+      end
+      class TagsByType < T::Struct
+        const :tags, T::Hash[String, T::Array[Tag]]
+      end
+      class ListParams < T::Struct
+        const :search, T.nilable(String)
+        const :author, T.nilable(String)
+        const :filter, T.nilable(T::Array[String])
+        const :sort, T.nilable(String)
+        const :direction, T.nilable(Integer)
+        const :limit, T.nilable(Integer)
+        const :offset, T.nilable(Integer)
+        const :full, T.nilable(T::Boolean)
+      end
+      class Client
+        extend T::Sig
+        BASE_URL = 'https://huggingface.co'
+        DEFAULT_TIMEOUT = 15
+        sig { params(base_url: String, timeout: Integer).void }
+        def initialize(base_url: BASE_URL, timeout: DEFAULT_TIMEOUT)
+          @base_url = base_url
+          @timeout = timeout
+        end
+        sig { params(params: ListParams).returns(T::Array[DatasetSummary]) }
+        def list_datasets(params = ListParams.new)
+          query = build_list_query(params)
+          payload = get('/api/datasets', query)
+          unless payload.is_a?(Array)
+            raise APIError, 'Unexpected response when listing datasets'
+          end
+          payload.map { |entry| parse_dataset_summary(entry) }
+        end
+        sig { params(repo_id: String, full: T.nilable(T::Boolean), revision: T.nilable(String)).returns(DatasetDetails) }
+        def dataset(repo_id, full: nil, revision: nil)
+          path = if revision
+                   "/api/datasets/#{repo_id}/revision/#{revision}"
+                 else
+                   "/api/datasets/#{repo_id}"
+                 end
+          query = {}
+          query[:full] = full ? 1 : 0 unless full.nil?
+          payload = get(path, query)
+          DatasetDetails.new(
+            summary: parse_dataset_summary(payload),
+            card_data: payload['cardData'],
+            siblings: Array(payload['siblings']).map { |item| Sibling.new(rfilename: item['rfilename'].to_s, size: item['size']) },
+            configs: Array(payload['configs']).map { |config| config }
+          )
+        end
+        sig { params(repo_id: String).returns(ParquetListing) }
+        def dataset_parquet(repo_id)
+          payload = get("/api/datasets/#{repo_id}/parquet")
+          unless payload.is_a?(Hash)
+            raise APIError, 'Unexpected parquet listing response'
+          end
+          files = payload.each_with_object({}) do |(config, splits), acc|
+            acc[config] = splits.each_with_object({}) do |(split, urls), split_acc|
+              split_acc[split] = Array(urls).map(&:to_s)
+            end
+          end
+          ParquetListing.new(files: files)
+        end
+        sig { returns(TagsByType) }
+        def dataset_tags_by_type
+          payload = get('/api/datasets-tags-by-type')
+          unless payload.is_a?(Hash)
+            raise APIError, 'Unexpected dataset tags response'
+          end
+          tags = payload.each_with_object({}) do |(category, items), acc|
+            acc[category] = Array(items).map do |item|
+              Tag.new(
+                id: item.fetch('id').to_s,
+                label: item.fetch('label').to_s,
+                type: item.fetch('type').to_s
+              )
+            end
+          end
+          TagsByType.new(tags: tags)
+        end
+        private
+        sig { params(path: String, params: T::Hash[Symbol, T.untyped]).returns(T.untyped) }
+        def get(path, params = {})
+          uri = build_uri(path, params)
+          request = Net::HTTP::Get.new(uri)
+          request['Accept'] = 'application/json'
+          response = Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https', read_timeout: @timeout, open_timeout: @timeout) do |http|
+            http.request(request)
+          end
+          unless response.is_a?(Net::HTTPSuccess)
+            raise APIError, "Hugging Face API request failed: #{response.code} #{response.message}"
+          end
+          JSON.parse(response.body)
+        rescue JSON::ParserError => e
+          raise APIError, "Failed to parse Hugging Face API response: #{e.message}"
+        end
+        sig { params(path: String, params: T::Hash[Symbol, T.untyped]).returns(URI::HTTPS) }
+        def build_uri(path, params)
+          uri = URI.join(@base_url, path)
+          unless params.empty?
+            # Expand repeated filters if present
+            query_pairs = params.each_with_object([]) do |(key, value), acc|
+              next if value.nil?
+              if key == :filter && value.is_a?(Array)
+                value.each { |filter| acc << ["filter", filter.to_s] }
+              else
+                acc << [key.to_s, format_query_value(value)]
+              end
+            end
+            uri.query = URI.encode_www_form(query_pairs)
+          end
+          uri
+        end
+        sig { params(value: T.untyped).returns(String) }
+        def format_query_value(value)
+          case value
+          when TrueClass, FalseClass
+            value ? '1' : '0'
+          else
+            value.to_s
+          end
+        end
+        sig { params(payload: T::Hash[String, T.untyped]).returns(DatasetSummary) }
+        def parse_dataset_summary(payload)
+          DatasetSummary.new(
+            id: payload.fetch('id').to_s,
+            author: payload['author'],
+            disabled: payload.fetch('disabled', false),
+            gated: payload.fetch('gated', false),
+            private: payload.fetch('private', false),
+            likes: payload['likes'],
+            downloads: payload['downloads'],
+            tags: Array(payload['tags']).map(&:to_s),
+            sha: payload['sha'],
+            last_modified: parse_time(payload['lastModified']),
+            description: payload['description']
+          )
+        end
+        sig { params(params: ListParams).returns(T::Hash[Symbol, T.untyped]) }
+        def build_list_query(params)
+          query = {
+            search: params.search,
+            author: params.author,
+            sort: params.sort,
+            direction: params.direction,
+            limit: params.limit,
+            offset: params.offset,
+            full: params.full
+          }.reject { |_, value| value.nil? }
+          query[:filter] = params.filter if params.filter
+          query
+        end
+        sig { params(value: T.untyped).returns(T.nilable(Time)) }
+        def parse_time(value)
+          return nil unless value
+          Time.parse(value.to_s)
+        rescue ArgumentError
+          nil
+        end
+      end
+    end
+  end
+end

data/lib/dspy/datasets/info.rb ADDED Viewed

@@ -0,0 +1,24 @@
+# frozen_string_literal: true
+module DSPy
+  module Datasets
+    class DatasetInfo
+      attr_reader :id, :name, :provider, :splits, :features, :loader, :loader_options, :metadata
+      def initialize(id:, name:, provider:, splits:, features:, loader:, loader_options:, metadata: {})
+        @id = id
+        @name = name
+        @provider = provider
+        @splits = Array(splits).map(&:to_s).freeze
+        @features = features.freeze
+        @loader = loader
+        @loader_options = loader_options.freeze
+        @metadata = metadata.freeze
+      end
+      def default_split
+        @splits.first
+      end
+    end
+  end
+end

data/lib/dspy/datasets/loaders/huggingface_parquet.rb ADDED Viewed

@@ -0,0 +1,134 @@
+# frozen_string_literal: true
+require 'json'
+require 'net/http'
+require 'uri'
+require 'fileutils'
+require 'parquet'
+module DSPy
+  module Datasets
+    module Loaders
+      class HuggingFaceParquet
+        BASE_URL = 'https://datasets-server.huggingface.co'
+        def initialize(info, split:, cache_dir:)
+          @info = info
+          @split = split
+          @cache_root = determine_cache_root(cache_dir)
+        end
+        def each_row
+          return enum_for(:each_row) unless block_given?
+          parquet_files.each do |file|
+            table = load_table(file)
+            field_names = table.schema.fields.map(&:name)
+            table.raw_records.each do |values|
+              yield normalized_row(field_names, values)
+            end
+          end
+        end
+        def row_count
+          @row_count ||= parquet_files.sum do |file|
+            load_table(file).n_rows
+          end
+        end
+        private
+        attr_reader :info, :split, :cache_root
+        def normalized_row(field_names, values)
+          field_names.each_with_index.each_with_object({}) do |(name, index), row|
+            row[name] = values[index]
+          end
+        end
+        def load_table(file)
+          Arrow::Table.load(ensure_cached(file))
+        end
+        def parquet_files
+          @parquet_files ||= begin
+            uri = URI("#{BASE_URL}/parquet")
+            params = {
+              dataset: info.loader_options.fetch(:dataset),
+              config: info.loader_options.fetch(:config),
+              split: split
+            }
+            uri.query = URI.encode_www_form(params)
+            response = http_get(uri)
+            unless response.is_a?(Net::HTTPSuccess)
+              raise DatasetError, "Failed to fetch parquet manifest: #{response.code}"
+            end
+            body = JSON.parse(response.body)
+            files = body.fetch('parquet_files', [])
+            raise DatasetError, "No parquet files available for #{info.id} (#{split})" if files.empty?
+            files
+          end
+        end
+        def ensure_cached(file)
+          FileUtils.mkdir_p(cache_dir)
+          path = File.join(cache_dir, file.fetch('filename'))
+          return path if File.exist?(path) && File.size?(path)
+          download_file(file.fetch('url'), path)
+          path
+        end
+        def cache_dir
+          @cache_dir ||= File.join(cache_root, split)
+        end
+        def determine_cache_root(cache_dir)
+          base = if cache_dir
+                   File.expand_path(cache_dir)
+                 elsif ENV['DSPY_DATASETS_CACHE']
+                   File.expand_path(ENV['DSPY_DATASETS_CACHE'])
+                 else
+                   File.expand_path('../../../../tmp/dspy_datasets', __dir__)
+                 end
+          File.join(base, sanitized_dataset_id)
+        end
+        def sanitized_dataset_id
+          info.id.gsub(/[^\w.-]+/, '_')
+        end
+        def http_get(uri)
+          Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
+            request = Net::HTTP::Get.new(uri)
+            http.request(request)
+          end
+        end
+        def download_file(url, destination)
+          uri = URI(url)
+          Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
+            request = Net::HTTP::Get.new(uri)
+            http.request(request) do |response|
+              unless response.is_a?(Net::HTTPSuccess)
+                raise DownloadError, "Failed to download parquet file: #{response.code}"
+              end
+              File.open(destination, 'wb') do |file|
+                response.read_body do |chunk|
+                  file.write(chunk)
+                end
+              end
+            end
+          end
+        rescue => e
+          File.delete(destination) if File.exist?(destination)
+          raise
+        end
+      end
+    end
+  end
+end

data/lib/dspy/datasets/loaders.rb ADDED Viewed

@@ -0,0 +1,19 @@
+# frozen_string_literal: true
+module DSPy
+  module Datasets
+    module Loaders
+      extend self
+      def build(info, split:, cache_dir:)
+        case info.loader
+        when :huggingface_parquet
+          require_relative 'loaders/huggingface_parquet'
+          HuggingFaceParquet.new(info, split: split, cache_dir: cache_dir)
+        else
+          raise DatasetError, "Unsupported loader: #{info.loader}"
+        end
+      end
+    end
+  end
+end

data/lib/dspy/datasets/manifest.rb ADDED Viewed

@@ -0,0 +1,40 @@
+# frozen_string_literal: true
+require_relative 'info'
+module DSPy
+  module Datasets
+    module Manifest
+      extend self
+      def all
+        @all ||= [
+          DatasetInfo.new(
+            id: 'ade-benchmark-corpus/ade_corpus_v2',
+            name: 'ADE Corpus V2',
+            provider: 'huggingface',
+            splits: %w[train],
+            features: {
+              'text' => { 'type' => 'string' },
+              'label' => { 'type' => 'int64', 'description' => '0: Not-Related, 1: Related' }
+            },
+            loader: :huggingface_parquet,
+            loader_options: {
+              dataset: 'ade-benchmark-corpus/ade_corpus_v2',
+              config: 'Ade_corpus_v2_classification'
+            },
+            metadata: {
+              description: 'Adverse drug event classification corpus used in ADE optimization examples.',
+              homepage: 'https://huggingface.co/datasets/ade-benchmark-corpus/ade_corpus_v2',
+              approx_row_count: 23516
+            }
+          )
+        ].freeze
+      end
+      def by_id(id)
+        all.detect { |dataset| dataset.id == id }
+      end
+    end
+  end
+end

data/lib/dspy/datasets/version.rb ADDED Viewed

@@ -0,0 +1,7 @@
+# frozen_string_literal: true
+module DSPy
+  module Datasets
+    VERSION = DSPy::VERSION
+  end
+end

data/lib/dspy/datasets.rb ADDED Viewed

@@ -0,0 +1,53 @@
+# frozen_string_literal: true
+require_relative 'datasets/version'
+require_relative 'datasets/errors'
+require_relative 'datasets/dataset'
+require_relative 'datasets/manifest'
+require_relative 'datasets/loaders'
+require_relative 'datasets/hugging_face/api'
+require_relative 'datasets/ade'
+module DSPy
+  module Datasets
+    PaginatedList = Struct.new(:items, :page, :per_page, :total_count, keyword_init: true) do
+      def total_pages
+        return 0 if per_page.zero?
+        (total_count.to_f / per_page).ceil
+      end
+    end
+    module_function
+    def list(page: 1, per_page: 20)
+      page = [page.to_i, 1].max
+      per_page = [per_page.to_i, 1].max
+      all = Manifest.all
+      offset = (page - 1) * per_page
+      slice = offset >= all.length ? [] : all.slice(offset, per_page) || []
+      PaginatedList.new(
+        items: slice,
+        page: page,
+        per_page: per_page,
+        total_count: all.length
+      )
+    end
+    def fetch(dataset_id, split: nil, cache_dir: nil)
+      info = Manifest.by_id(dataset_id)
+      raise DatasetNotFoundError, "Unknown dataset: #{dataset_id}" unless info
+      split ||= info.default_split
+      split = split.to_s
+      unless info.splits.include?(split)
+        raise InvalidSplitError, "Invalid split '#{split}' for dataset #{dataset_id} (available: #{info.splits.join(', ')})"
+      end
+      loader = Loaders.build(info, split: split, cache_dir: cache_dir)
+      Dataset.new(info: info, split: split, loader: loader)
+    end
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,82 @@
+--- !ruby/object:Gem::Specification
+name: dspy-datasets
+version: !ruby/object:Gem::Version
+  version: 0.29.1
+platform: ruby
+authors:
+- Vicente Reig Rincón de Arellano
+bindir: bin
+cert_chain: []
+date: 2025-10-20 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: dspy
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '='
+      - !ruby/object:Gem::Version
+        version: 0.29.1
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '='
+      - !ruby/object:Gem::Version
+        version: 0.29.1
+- !ruby/object:Gem::Dependency
+  name: red-parquet
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '21.0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '21.0'
+description: DSPy datasets provide prebuilt loaders, caching, and schema metadata
+  for benchmark corpora used in DSPy examples and teleprompters.
+email:
+- hey@vicente.services
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- LICENSE
+- README.md
+- lib/dspy/datasets.rb
+- lib/dspy/datasets/ade.rb
+- lib/dspy/datasets/dataset.rb
+- lib/dspy/datasets/errors.rb
+- lib/dspy/datasets/hugging_face/api.rb
+- lib/dspy/datasets/info.rb
+- lib/dspy/datasets/loaders.rb
+- lib/dspy/datasets/loaders/huggingface_parquet.rb
+- lib/dspy/datasets/manifest.rb
+- lib/dspy/datasets/version.rb
+homepage: https://github.com/vicentereig/dspy.rb
+licenses:
+- MIT
+metadata:
+  github_repo: git@github.com:vicentereig/dspy.rb
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: 3.3.0
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubygems_version: 3.6.5
+specification_version: 4
+summary: Curated datasets and loaders for DSPy.rb.
+test_files: []