npm - @every-env/compound-plugin - Versions diffs - 0.3.0 → 0.5.1 - Mend

@every-env/compound-plugin 0.3.0 → 0.5.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (50) hide show

package/{plugins/compound-engineering → .claude}/commands/release-docs.md +0 -1
package/.claude-plugin/marketplace.json +2 -2
package/.github/workflows/ci.yml +1 -1
package/.github/workflows/deploy-docs.yml +3 -3
package/.github/workflows/publish.yml +37 -0
package/README.md +12 -3
package/docs/index.html +13 -13
package/docs/pages/changelog.html +39 -0
package/docs/plans/2026-02-08-feat-convert-local-md-settings-for-opencode-codex-plan.md +143 -0
package/docs/plans/2026-02-08-feat-simplify-plugin-settings-plan.md +195 -0
package/docs/plans/2026-02-09-refactor-dspy-ruby-skill-update-plan.md +104 -0
package/docs/plans/2026-02-12-feat-add-cursor-cli-target-provider-plan.md +306 -0
package/docs/specs/cursor.md +85 -0
package/package.json +1 -1
package/plugins/compound-engineering/.claude-plugin/plugin.json +2 -2
package/plugins/compound-engineering/CHANGELOG.md +38 -0
package/plugins/compound-engineering/README.md +5 -3
package/plugins/compound-engineering/commands/workflows/brainstorm.md +6 -1
package/plugins/compound-engineering/commands/workflows/compound.md +1 -0
package/plugins/compound-engineering/commands/workflows/review.md +23 -21
package/plugins/compound-engineering/commands/workflows/work.md +29 -15
package/plugins/compound-engineering/skills/dspy-ruby/SKILL.md +539 -396
package/plugins/compound-engineering/skills/dspy-ruby/assets/config-template.rb +159 -331
package/plugins/compound-engineering/skills/dspy-ruby/assets/module-template.rb +210 -236
package/plugins/compound-engineering/skills/dspy-ruby/assets/signature-template.rb +173 -95
package/plugins/compound-engineering/skills/dspy-ruby/references/core-concepts.md +552 -143
package/plugins/compound-engineering/skills/dspy-ruby/references/observability.md +366 -0
package/plugins/compound-engineering/skills/dspy-ruby/references/optimization.md +440 -460
package/plugins/compound-engineering/skills/dspy-ruby/references/providers.md +305 -225
package/plugins/compound-engineering/skills/dspy-ruby/references/toolsets.md +502 -0
package/plugins/compound-engineering/skills/setup/SKILL.md +168 -0
package/src/commands/convert.ts +10 -5
package/src/commands/install.ts +18 -10
package/src/converters/claude-to-codex.ts +7 -2
package/src/converters/claude-to-cursor.ts +166 -0
package/src/converters/claude-to-droid.ts +174 -0
package/src/converters/claude-to-opencode.ts +8 -2
package/src/targets/cursor.ts +48 -0
package/src/targets/droid.ts +50 -0
package/src/targets/index.ts +18 -0
package/src/types/cursor.ts +29 -0
package/src/types/droid.ts +20 -0
package/tests/cli.test.ts +62 -0
package/tests/codex-converter.test.ts +62 -0
package/tests/converter.test.ts +61 -0
package/tests/cursor-converter.test.ts +347 -0
package/tests/cursor-writer.test.ts +137 -0
package/tests/droid-converter.test.ts +277 -0
package/tests/droid-writer.test.ts +100 -0
package/plugins/compound-engineering/commands/technical_review.md +0 -8

package/plugins/compound-engineering/skills/dspy-ruby/references/optimization.md CHANGED Viewed

@@ -1,623 +1,603 @@
-# DSPy.rb Testing, Optimization & Observability
+# DSPy.rb Optimization
-## Testing
+## MIPROv2
-DSPy.rb enables standard RSpec testing patterns for LLM logic, making your AI applications testable and maintainable.
+MIPROv2 (Multi-prompt Instruction Proposal with Retrieval Optimization) is the primary instruction tuner in DSPy.rb. It proposes new instructions and few-shot demonstrations per predictor, evaluates them on mini-batches, and retains candidates that improve the metric. It ships as a separate gem to keep the Gaussian Process dependency tree out of apps that do not need it.
-### Basic Testing Setup
+### Installation
 ```ruby
-require 'rspec'
-require 'dspy'
-RSpec.describe EmailClassifier do
-  before do
-    DSPy.configure do |c|
-      c.lm = DSPy::LM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY'])
-    end
-  end
-  describe '#classify' do
-    it 'classifies technical support emails correctly' do
-      classifier = EmailClassifier.new
-      result = classifier.forward(
-        email_subject: "Can't log in",
-        email_body: "I'm unable to access my account"
-      )
-      expect(result[:category]).to eq('Technical')
-      expect(result[:priority]).to be_in(['High', 'Medium', 'Low'])
-    end
-  end
-end
+# Gemfile
+gem "dspy"
+gem "dspy-miprov2"
 ```
-### Mocking LLM Responses
+Bundler auto-requires `dspy/miprov2`. No additional `require` statement is needed.
+### AutoMode presets
-Test your modules without making actual API calls:
+Use `DSPy::Teleprompt::MIPROv2::AutoMode` for preconfigured optimizers:
 ```ruby
-RSpec.describe MyModule do
-  it 'handles mock responses correctly' do
-    # Create a mock predictor that returns predetermined results
-    mock_predictor = instance_double(DSPy::Predict)
-    allow(mock_predictor).to receive(:forward).and_return({
-      category: 'Technical',
-      priority: 'High',
-      confidence: 0.95
-    })
-    # Inject mock into your module
-    module_instance = MyModule.new
-    module_instance.instance_variable_set(:@predictor, mock_predictor)
-    result = module_instance.forward(input: 'test data')
-    expect(result[:category]).to eq('Technical')
-  end
-end
+light  = DSPy::Teleprompt::MIPROv2::AutoMode.light(metric: metric)   # 6 trials, greedy
+medium = DSPy::Teleprompt::MIPROv2::AutoMode.medium(metric: metric)  # 12 trials, adaptive
+heavy  = DSPy::Teleprompt::MIPROv2::AutoMode.heavy(metric: metric)   # 18 trials, Bayesian
 ```
-### Testing Type Safety
+| Preset   | Trials | Strategy   | Use case                                            |
+|----------|--------|------------|-----------------------------------------------------|
+| `light`  | 6      | `:greedy`  | Quick wins on small datasets or during prototyping. |
+| `medium` | 12     | `:adaptive`| Balanced exploration vs. runtime for most pilots.   |
+| `heavy`  | 18     | `:bayesian`| Highest accuracy targets or multi-stage programs.   |
+### Manual configuration with dry-configurable
+`DSPy::Teleprompt::MIPROv2` includes `Dry::Configurable`. Configure at the class level (defaults for all instances) or instance level (overrides class defaults).
-Verify that signatures enforce type constraints:
+**Class-level defaults:**
 ```ruby
-RSpec.describe EmailClassificationSignature do
-  it 'validates output types' do
-    predictor = DSPy::Predict.new(EmailClassificationSignature)
-    # This should work
-    result = predictor.forward(
-      email_subject: 'Test',
-      email_body: 'Test body'
-    )
-    expect(result[:category]).to be_a(String)
+DSPy::Teleprompt::MIPROv2.configure do |config|
+  config.optimization_strategy = :bayesian
+  config.num_trials = 30
+  config.bootstrap_sets = 10
+end
+```
-    # Test that invalid types are caught
-    expect {
-      # Simulate LLM returning invalid type
-      predictor.send(:validate_output, { category: 123 })
-    }.to raise_error(DSPy::ValidationError)
-  end
+**Instance-level overrides:**
+```ruby
+optimizer = DSPy::Teleprompt::MIPROv2.new(metric: metric)
+optimizer.configure do |config|
+  config.num_trials = 15
+  config.num_instruction_candidates = 6
+  config.bootstrap_sets = 5
+  config.max_bootstrapped_examples = 4
+  config.max_labeled_examples = 16
+  config.optimization_strategy = :adaptive       # :greedy, :adaptive, :bayesian
+  config.early_stopping_patience = 3
+  config.init_temperature = 1.0
+  config.final_temperature = 0.1
+  config.minibatch_size = nil                     # nil = auto
+  config.auto_seed = 42
 end
 ```
-### Testing Edge Cases
+The `optimization_strategy` setting accepts symbols (`:greedy`, `:adaptive`, `:bayesian`) and coerces them internally to `DSPy::Teleprompt::OptimizationStrategy` T::Enum values.
+The old `config:` constructor parameter is removed. Passing `config:` raises `ArgumentError`.
-Always test boundary conditions and error scenarios:
+### Auto presets via configure
+Instead of `AutoMode`, set the preset through the configure block:
 ```ruby
-RSpec.describe EmailClassifier do
-  it 'handles empty emails' do
-    classifier = EmailClassifier.new
-    result = classifier.forward(
-      email_subject: '',
-      email_body: ''
-    )
-    # Define expected behavior for edge case
-    expect(result[:category]).to eq('General')
-  end
+optimizer = DSPy::Teleprompt::MIPROv2.new(metric: metric)
+optimizer.configure do |config|
+  config.auto_preset = DSPy::Teleprompt::AutoPreset.deserialize("medium")
+end
+```
-  it 'handles very long emails' do
-    long_body = 'word ' * 10000
-    classifier = EmailClassifier.new
+### Compile and inspect
-    expect {
-      classifier.forward(
-        email_subject: 'Test',
-        email_body: long_body
-      )
-    }.not_to raise_error
-  end
+```ruby
+program = DSPy::Predict.new(MySignature)
-  it 'handles special characters' do
-    classifier = EmailClassifier.new
-    result = classifier.forward(
-      email_subject: 'Test <script>alert("xss")</script>',
-      email_body: 'Body with émojis 🎉 and spëcial çharacters'
-    )
+result = optimizer.compile(
+  program,
+  trainset: train_examples,
+  valset: val_examples
+)
-    expect(result[:category]).to be_in(['Technical', 'Billing', 'General'])
-  end
-end
+optimized_program = result.optimized_program
+puts "Best score: #{result.best_score_value}"
 ```
-### Integration Testing
+The `result` object exposes:
+- `optimized_program` -- ready-to-use predictor with updated instruction and demos.
+- `optimization_trace[:trial_logs]` -- per-trial record of instructions, demos, and scores.
+- `metadata[:optimizer]` -- `"MIPROv2"`, useful when persisting experiments from multiple optimizers.
+### Multi-stage programs
+MIPROv2 generates dataset summaries for each predictor and proposes per-stage instructions. For a ReAct agent with `thought_generator` and `observation_processor` predictors, the optimizer handles credit assignment internally. The metric only needs to evaluate the final output.
+### Bootstrap sampling
+During the bootstrap phase MIPROv2:
+1. Generates dataset summaries from the training set.
+2. Bootstraps few-shot demonstrations by running the baseline program.
+3. Proposes candidate instructions grounded in the summaries and bootstrapped examples.
+4. Evaluates each candidate on mini-batches drawn from the validation set.
-Test complete workflows end-to-end:
+Control the bootstrap phase with `bootstrap_sets`, `max_bootstrapped_examples`, and `max_labeled_examples`.
+### Bayesian optimization
+When `optimization_strategy` is `:bayesian` (or when using the `heavy` preset), MIPROv2 fits a Gaussian Process surrogate over past trial scores to select the next candidate. This replaces random search with informed exploration, reducing the number of trials needed to find high-scoring instructions.
+---
+## GEPA
+GEPA (Genetic-Pareto Reflective Prompt Evolution) is a feedback-driven optimizer. It runs the program on a small batch, collects scores and textual feedback, and asks a reflection LM to rewrite the instruction. Improved candidates are retained on a Pareto frontier.
+### Installation
 ```ruby
-RSpec.describe EmailProcessingPipeline do
-  it 'processes email through complete pipeline' do
-    pipeline = EmailProcessingPipeline.new
+# Gemfile
+gem "dspy"
+gem "dspy-gepa"
+```
-    result = pipeline.forward(
-      email_subject: 'Billing question',
-      email_body: 'How do I update my payment method?'
-    )
+The `dspy-gepa` gem depends on the `gepa` core optimizer gem automatically.
+### Metric contract
+GEPA metrics return `DSPy::Prediction` with both a numeric score and a feedback string. Do not return a plain boolean.
-    # Verify the complete pipeline output
-    expect(result[:classification]).to eq('Billing')
-    expect(result[:priority]).to eq('Medium')
-    expect(result[:suggested_response]).to include('payment')
-    expect(result[:assigned_team]).to eq('billing_support')
+```ruby
+metric = lambda do |example, prediction|
+  expected  = example.expected_values[:label]
+  predicted = prediction.label
+  score = predicted == expected ? 1.0 : 0.0
+  feedback = if score == 1.0
+    "Correct (#{expected}) for: \"#{example.input_values[:text][0..60]}\""
+  else
+    "Misclassified (expected #{expected}, got #{predicted}) for: \"#{example.input_values[:text][0..60]}\""
   end
+  DSPy::Prediction.new(score: score, feedback: feedback)
 end
 ```
-### VCR for Deterministic Tests
+Keep the score in `[0, 1]`. Always include a short feedback message explaining what happened -- GEPA hands this text to the reflection model so it can reason about failures.
+### Feedback maps
-Use VCR to record and replay API responses:
+`feedback_map` targets individual predictors inside a composite module. Each entry receives keyword arguments and returns a `DSPy::Prediction`:
 ```ruby
-require 'vcr'
+feedback_map = {
+  'self' => lambda do |predictor_output:, predictor_inputs:, module_inputs:, module_outputs:, captured_trace:|
+    expected  = module_inputs.expected_values[:label]
+    predicted = predictor_output.label
+    DSPy::Prediction.new(
+      score: predicted == expected ? 1.0 : 0.0,
+      feedback: "Classifier saw \"#{predictor_inputs[:text][0..80]}\" -> #{predicted} (expected #{expected})"
+    )
+  end
+}
+```
-VCR.configure do |config|
-  config.cassette_library_dir = 'spec/vcr_cassettes'
-  config.hook_into :webmock
-  config.filter_sensitive_data('<OPENAI_API_KEY>') { ENV['OPENAI_API_KEY'] }
-end
+For single-predictor programs, key the map with `'self'`. For multi-predictor chains, add entries per component so the reflection LM sees localized context at each step. Omit `feedback_map` entirely if the top-level metric already covers the basics.
-RSpec.describe EmailClassifier do
-  it 'classifies emails consistently', :vcr do
-    VCR.use_cassette('email_classification') do
-      classifier = EmailClassifier.new
-      result = classifier.forward(
-        email_subject: 'Test subject',
-        email_body: 'Test body'
-      )
-      expect(result[:category]).to eq('Technical')
-    end
-  end
-end
+### Configuring the teleprompter
+```ruby
+teleprompter = DSPy::Teleprompt::GEPA.new(
+  metric: metric,
+  reflection_lm: DSPy::ReflectionLM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY']),
+  feedback_map: feedback_map,
+  config: {
+    max_metric_calls: 600,
+    minibatch_size: 6,
+    skip_perfect_score: false
+  }
+)
 ```
-## Optimization
+Key configuration knobs:
+| Knob                 | Purpose                                                                                   |
+|----------------------|-------------------------------------------------------------------------------------------|
+| `max_metric_calls`   | Hard budget on evaluation calls. Set to at least the validation set size plus a few minibatches. |
+| `minibatch_size`     | Examples per reflective replay batch. Smaller = cheaper iterations, noisier scores.       |
+| `skip_perfect_score` | Set `true` to stop early when a candidate reaches score `1.0`.                            |
-DSPy.rb provides powerful optimization capabilities to automatically improve your prompts and modules.
+### Minibatch sizing
-### MIPROv2 Optimization
+| Goal                                            | Suggested size | Rationale                                                  |
+|-------------------------------------------------|----------------|------------------------------------------------------------|
+| Explore many candidates within a tight budget   | 3--6           | Cheap iterations, more prompt variants, noisier metrics.   |
+| Stable metrics when each rollout is costly      | 8--12          | Smoother scores, fewer candidates unless budget is raised. |
+| Investigate specific failure modes              | 3--4 then 8+   | Start with breadth, increase once patterns emerge.         |
-MIPROv2 is an advanced multi-prompt optimization technique that uses bootstrap sampling, instruction generation, and Bayesian optimization.
+### Compile and evaluate
 ```ruby
-require 'dspy/mipro'
+program = DSPy::Predict.new(MySignature)
-# Define your module to optimize
-class EmailClassifier < DSPy::Module
-  def initialize
-    super
-    @predictor = DSPy::ChainOfThought.new(EmailClassificationSignature)
-  end
+result = teleprompter.compile(program, trainset: train, valset: val)
+optimized_program = result.optimized_program
-  def forward(input)
-    @predictor.forward(input)
-  end
-end
+test_metrics = evaluate(optimized_program, test)
+```
-# Prepare training data
-training_examples = [
-  {
-    input: { email_subject: "Can't log in", email_body: "Password reset not working" },
-    expected_output: { category: 'Technical', priority: 'High' }
-  },
-  {
-    input: { email_subject: "Billing question", email_body: "How much does premium cost?" },
-    expected_output: { category: 'Billing', priority: 'Medium' }
-  },
-  # Add more examples...
-]
-# Define evaluation metric
-def accuracy_metric(example, prediction)
-  (example[:expected_output][:category] == prediction[:category]) ? 1.0 : 0.0
-end
+The `result` object exposes:
+- `optimized_program` -- predictor with updated instruction and few-shot examples.
+- `best_score_value` -- validation score for the best candidate.
+- `metadata` -- candidate counts, trace hashes, and telemetry IDs.
-# Run optimization
-optimizer = DSPy::MIPROv2.new(
-  metric: method(:accuracy_metric),
-  num_candidates: 10,
-  num_threads: 4
-)
+### Reflection LM
-optimized_module = optimizer.compile(
-  EmailClassifier.new,
-  trainset: training_examples
-)
+Swap `DSPy::ReflectionLM` for any callable object that accepts the reflection prompt hash and returns a string. The default reflection signature extracts the new instruction from triple backticks in the response.
+### Experiment tracking
-# Use optimized module
-result = optimized_module.forward(
-  email_subject: "New email",
-  email_body: "New email content"
+Plug `GEPA::Logging::ExperimentTracker` into a persistence layer:
+```ruby
+tracker = GEPA::Logging::ExperimentTracker.new
+tracker.with_subscriber { |event| MyModel.create!(payload: event) }
+teleprompter = DSPy::Teleprompt::GEPA.new(
+  metric: metric,
+  reflection_lm: reflection_lm,
+  experiment_tracker: tracker,
+  config: { max_metric_calls: 900 }
 )
 ```
-### Bootstrap Few-Shot Learning
+The tracker emits Pareto update events, merge decisions, and candidate evolution records as JSONL.
-Automatically generate few-shot examples from your training data:
+### Pareto frontier
-```ruby
-require 'dspy/teleprompt'
+GEPA maintains a diverse candidate pool and samples from the Pareto frontier instead of mutating only the top-scoring program. This balances exploration and prevents the search from collapsing onto a single lineage.
-# Create a teleprompter for few-shot optimization
-teleprompter = DSPy::BootstrapFewShot.new(
-  metric: method(:accuracy_metric),
-  max_bootstrapped_demos: 5,
-  max_labeled_demos: 3
-)
+Enable the merge proposer after multiple strong lineages emerge:
-# Compile the optimized module
-optimized = teleprompter.compile(
-  MyModule.new,
-  trainset: training_examples
-)
+```ruby
+config: {
+  max_metric_calls: 900,
+  enable_merge_proposer: true
+}
 ```
-### Custom Optimization Metrics
+Premature merges eat budget without meaningful gains. Gate merge on having several validated candidates first.
-Define custom metrics for your specific use case:
+### Advanced options
-```ruby
-def custom_metric(example, prediction)
-  score = 0.0
+- `acceptance_strategy:` -- plug in bespoke Pareto filters or early-stop heuristics.
+- Telemetry spans emit via `GEPA::Telemetry`. Enable global observability with `DSPy.configure { |c| c.observability = true }` to stream spans to an OpenTelemetry exporter.
+---
+## Evaluation Framework
-  # Category accuracy (60% weight)
-  score += 0.6 if example[:expected_output][:category] == prediction[:category]
+`DSPy::Evals` provides batch evaluation of predictors against test datasets with built-in and custom metrics.
-  # Priority accuracy (40% weight)
-  score += 0.4 if example[:expected_output][:priority] == prediction[:priority]
+### Basic usage
-  score
+```ruby
+metric = proc do |example, prediction|
+  prediction.answer == example.expected_values[:answer]
 end
-# Use in optimization
-optimizer = DSPy::MIPROv2.new(
-  metric: method(:custom_metric),
-  num_candidates: 10
+evaluator = DSPy::Evals.new(predictor, metric: metric)
+result = evaluator.evaluate(
+  test_examples,
+  display_table: true,
+  display_progress: true
 )
+puts "Pass rate: #{(result.pass_rate * 100).round(1)}%"
+puts "Passed: #{result.passed_examples}/#{result.total_examples}"
 ```
-### A/B Testing Different Approaches
+### DSPy::Example
-Compare different module implementations:
+Convert raw data into `DSPy::Example` instances before passing to optimizers or evaluators. Each example carries `input_values` and `expected_values`:
 ```ruby
-# Approach A: ChainOfThought
-class ApproachA < DSPy::Module
-  def initialize
-    super
-    @predictor = DSPy::ChainOfThought.new(EmailClassificationSignature)
-  end
-  def forward(input)
-    @predictor.forward(input)
-  end
+examples = rows.map do |row|
+  DSPy::Example.new(
+    input_values: { text: row[:text] },
+    expected_values: { label: row[:label] }
+  )
 end
-# Approach B: ReAct with tools
-class ApproachB < DSPy::Module
-  def initialize
-    super
-    @predictor = DSPy::ReAct.new(
-      EmailClassificationSignature,
-      tools: [KnowledgeBaseTool.new]
-    )
-  end
+train, val, test = split_examples(examples, train_ratio: 0.6, val_ratio: 0.2, seed: 42)
+```
-  def forward(input)
-    @predictor.forward(input)
-  end
-end
+Hold back a test set from the optimization loop. Optimizers work on train/val; only the test set proves generalization.
-# Evaluate both approaches
-def evaluate_approach(approach_class, test_set)
-  approach = approach_class.new
-  scores = test_set.map do |example|
-    prediction = approach.forward(example[:input])
-    accuracy_metric(example, prediction)
-  end
-  scores.sum / scores.size
-end
+### Built-in metrics
-approach_a_score = evaluate_approach(ApproachA, test_examples)
-approach_b_score = evaluate_approach(ApproachB, test_examples)
-puts "Approach A accuracy: #{approach_a_score}"
-puts "Approach B accuracy: #{approach_b_score}"
-```
+```ruby
+# Exact match -- prediction must exactly equal expected value
+metric = DSPy::Metrics.exact_match(field: :answer, case_sensitive: true)
-## Observability
+# Contains -- prediction must contain expected substring
+metric = DSPy::Metrics.contains(field: :answer, case_sensitive: false)
-Track your LLM application's performance, token usage, and behavior in production.
+# Numeric difference -- numeric output within tolerance
+metric = DSPy::Metrics.numeric_difference(field: :answer, tolerance: 0.01)
-### OpenTelemetry Integration
+# Composite AND -- all sub-metrics must pass
+metric = DSPy::Metrics.composite_and(
+  DSPy::Metrics.exact_match(field: :answer),
+  DSPy::Metrics.contains(field: :reasoning)
+)
+```
-DSPy.rb automatically integrates with OpenTelemetry when configured:
+### Custom metrics
 ```ruby
-require 'opentelemetry/sdk'
-require 'dspy'
+quality_metric = lambda do |example, prediction|
+  return false unless prediction
-# Configure OpenTelemetry
-OpenTelemetry::SDK.configure do |c|
-  c.service_name = 'my-dspy-app'
-  c.use_all # Use all available instrumentation
+  score = 0.0
+  score += 0.5 if prediction.answer == example.expected_values[:answer]
+  score += 0.3 if prediction.explanation && prediction.explanation.length > 50
+  score += 0.2 if prediction.confidence && prediction.confidence > 0.8
+  score >= 0.7
 end
-# DSPy automatically creates traces for predictions
-predictor = DSPy::Predict.new(MySignature)
-result = predictor.forward(input: 'data')
-# Traces are automatically sent to your OpenTelemetry collector
+evaluator = DSPy::Evals.new(predictor, metric: quality_metric)
 ```
-### Langfuse Integration
+Access prediction fields with dot notation (`prediction.answer`), not hash notation.
+### Observability hooks
-Track detailed LLM execution traces with Langfuse:
+Register callbacks without editing the evaluator:
 ```ruby
-require 'dspy/langfuse'
-# Configure Langfuse
-DSPy.configure do |c|
-  c.lm = DSPy::LM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY'])
-  c.langfuse = {
-    public_key: ENV['LANGFUSE_PUBLIC_KEY'],
-    secret_key: ENV['LANGFUSE_SECRET_KEY'],
-    host: ENV['LANGFUSE_HOST'] || 'https://cloud.langfuse.com'
-  }
+DSPy::Evals.before_example do |payload|
+  example = payload[:example]
+  DSPy.logger.info("Evaluating example #{example.id}") if example.respond_to?(:id)
 end
-# All predictions are automatically traced
-predictor = DSPy::Predict.new(MySignature)
-result = predictor.forward(input: 'data')
-# View detailed traces in Langfuse dashboard
+DSPy::Evals.after_batch do |payload|
+  result = payload[:result]
+  Langfuse.event(
+    name: 'eval.batch',
+    metadata: {
+      total: result.total_examples,
+      passed: result.passed_examples,
+      score: result.score
+    }
+  )
+end
 ```
-### Manual Token Tracking
+Available hooks: `before_example`, `after_example`, `before_batch`, `after_batch`.
-Track token usage without external services:
+### Langfuse score export
+Enable `export_scores: true` to emit `score.create` events for each evaluated example and a batch score at the end:
 ```ruby
-class TokenTracker
-  def initialize
-    @total_tokens = 0
-    @request_count = 0
-  end
+evaluator = DSPy::Evals.new(
+  predictor,
+  metric: metric,
+  export_scores: true,
+  score_name: 'qa_accuracy'   # default: 'evaluation'
+)
-  def track_prediction(predictor, input)
-    start_time = Time.now
-    result = predictor.forward(input)
-    duration = Time.now - start_time
+result = evaluator.evaluate(test_examples)
+# Emits per-example scores + overall batch score via DSPy::Scores::Exporter
+```
-    # Get token usage from response metadata
-    tokens = result.metadata[:usage][:total_tokens] rescue 0
-    @total_tokens += tokens
-    @request_count += 1
+Scores attach to the current trace context automatically and flow to Langfuse asynchronously.
-    puts "Request ##{@request_count}: #{tokens} tokens in #{duration}s"
-    puts "Total tokens used: #{@total_tokens}"
+### Evaluation results
-    result
-  end
-end
+```ruby
+result = evaluator.evaluate(test_examples)
-# Usage
-tracker = TokenTracker.new
-predictor = DSPy::Predict.new(MySignature)
+result.score            # Overall score (0.0 to 1.0)
+result.passed_count     # Examples that passed
+result.failed_count     # Examples that failed
+result.error_count      # Examples that errored
-result = tracker.track_prediction(predictor, { input: 'data' })
+result.results.each do |r|
+  r.passed              # Boolean
+  r.score               # Numeric score
+  r.error               # Error message if the example errored
+end
 ```
-### Custom Logging
-Add detailed logging to your modules:
+### Integration with optimizers
 ```ruby
-class EmailClassifier < DSPy::Module
-  def initialize
-    super
-    @predictor = DSPy::ChainOfThought.new(EmailClassificationSignature)
-    @logger = Logger.new(STDOUT)
-  end
+metric = proc do |example, prediction|
+  expected  = example.expected_values[:answer].to_s.strip.downcase
+  predicted = prediction.answer.to_s.strip.downcase
+  !expected.empty? && predicted.include?(expected)
+end
-  def forward(input)
-    @logger.info "Classifying email: #{input[:email_subject]}"
+optimizer = DSPy::Teleprompt::MIPROv2::AutoMode.medium(metric: metric)
-    start_time = Time.now
-    result = @predictor.forward(input)
-    duration = Time.now - start_time
+result = optimizer.compile(
+  DSPy::Predict.new(QASignature),
+  trainset: train_examples,
+  valset: val_examples
+)
-    @logger.info "Classification: #{result[:category]} (#{duration}s)"
+evaluator = DSPy::Evals.new(result.optimized_program, metric: metric)
+test_result = evaluator.evaluate(test_examples, display_table: true)
+puts "Test accuracy: #{(test_result.pass_rate * 100).round(2)}%"
+```
-    if result[:reasoning]
-      @logger.debug "Reasoning: #{result[:reasoning]}"
-    end
+---
-    result
-  rescue => e
-    @logger.error "Classification failed: #{e.message}"
-    raise
-  end
+## Storage System
+`DSPy::Storage` persists optimization results, tracks history, and manages multiple versions of optimized programs.
+### ProgramStorage (low-level)
+```ruby
+storage = DSPy::Storage::ProgramStorage.new(storage_path: "./dspy_storage")
+# Save
+saved = storage.save_program(
+  result.optimized_program,
+  result,
+  metadata: {
+    signature_class: 'ClassifyText',
+    optimizer: 'MIPROv2',
+    examples_count: examples.size
+  }
+)
+puts "Stored with ID: #{saved.program_id}"
+# Load
+saved = storage.load_program(program_id)
+predictor = saved.program
+score = saved.optimization_result[:best_score_value]
+# List
+storage.list_programs.each do |p|
+  puts "#{p[:program_id]} -- score: #{p[:best_score]} -- saved: #{p[:saved_at]}"
 end
 ```
-### Performance Monitoring
-Monitor latency and performance metrics:
+### StorageManager (recommended)
 ```ruby
-class PerformanceMonitor
-  def initialize
-    @metrics = {
-      total_requests: 0,
-      total_duration: 0.0,
-      errors: 0,
-      success_count: 0
-    }
-  end
+manager = DSPy::Storage::StorageManager.new
-  def monitor_request
-    start_time = Time.now
-    @metrics[:total_requests] += 1
-    begin
-      result = yield
-      @metrics[:success_count] += 1
-      result
-    rescue => e
-      @metrics[:errors] += 1
-      raise
-    ensure
-      duration = Time.now - start_time
-      @metrics[:total_duration] += duration
-      if @metrics[:total_requests] % 10 == 0
-        print_stats
-      end
-    end
-  end
+# Save with tags
+saved = manager.save_optimization_result(
+  result,
+  tags: ['production', 'sentiment-analysis'],
+  description: 'Optimized sentiment classifier v2'
+)
-  def print_stats
-    avg_duration = @metrics[:total_duration] / @metrics[:total_requests]
-    success_rate = @metrics[:success_count].to_f / @metrics[:total_requests]
+# Find programs
+programs = manager.find_programs(
+  optimizer: 'MIPROv2',
+  min_score: 0.85,
+  tags: ['production']
+)
-    puts "\n=== Performance Stats ==="
-    puts "Total requests: #{@metrics[:total_requests]}"
-    puts "Average duration: #{avg_duration.round(3)}s"
-    puts "Success rate: #{(success_rate * 100).round(2)}%"
-    puts "Errors: #{@metrics[:errors]}"
-    puts "========================\n"
-  end
-end
+recent = manager.find_programs(
+  max_age_days: 7,
+  signature_class: 'ClassifyText'
+)
-# Usage
-monitor = PerformanceMonitor.new
-predictor = DSPy::Predict.new(MySignature)
+# Get best program for a signature
+best = manager.get_best_program('ClassifyText')
+predictor = best.program
+```
-result = monitor.monitor_request do
-  predictor.forward(input: 'data')
-end
+Global shorthand:
+```ruby
+DSPy::Storage::StorageManager.save(result, metadata: { version: '2.0' })
+DSPy::Storage::StorageManager.load(program_id)
+DSPy::Storage::StorageManager.best('ClassifyText')
 ```
-### Error Rate Tracking
+### Checkpoints
-Monitor and alert on error rates:
+Create and restore checkpoints during long-running optimizations:
 ```ruby
-class ErrorRateMonitor
-  def initialize(alert_threshold: 0.1)
-    @alert_threshold = alert_threshold
-    @recent_results = []
-    @window_size = 100
-  end
+# Save a checkpoint
+manager.create_checkpoint(
+  current_result,
+  'iteration_50',
+  metadata: { iteration: 50, current_score: 0.87 }
+)
-  def track_result(success:)
-    @recent_results << success
-    @recent_results.shift if @recent_results.size > @window_size
+# Restore
+restored = manager.restore_checkpoint('iteration_50')
+program = restored.program
-    error_rate = calculate_error_rate
-    alert_if_needed(error_rate)
+# Auto-checkpoint every N iterations
+if iteration % 10 == 0
+  manager.create_checkpoint(current_result, "auto_checkpoint_#{iteration}")
+end
+```
-    error_rate
-  end
+### Import and export
-  private
+Share programs between environments:
-  def calculate_error_rate
-    failures = @recent_results.count(false)
-    failures.to_f / @recent_results.size
-  end
+```ruby
+storage = DSPy::Storage::ProgramStorage.new
-  def alert_if_needed(error_rate)
-    if error_rate > @alert_threshold
-      puts "⚠️  ALERT: Error rate #{(error_rate * 100).round(2)}% exceeds threshold!"
-      # Send notification, page oncall, etc.
-    end
-  end
-end
+# Export
+storage.export_programs(['abc123', 'def456'], './export_backup.json')
+# Import
+imported = storage.import_programs('./export_backup.json')
+puts "Imported #{imported.size} programs"
 ```
-## Best Practices
+### Optimization history
-### 1. Start with Tests
+```ruby
+history = manager.get_optimization_history
-Write tests before optimizing:
+history[:summary][:total_programs]
+history[:summary][:avg_score]
-```ruby
-# Define test cases first
-test_cases = [
-  { input: {...}, expected: {...} },
-  # More test cases...
-]
-# Ensure baseline functionality
-test_cases.each do |tc|
-  result = module.forward(tc[:input])
-  assert result[:category] == tc[:expected][:category]
+history[:optimizer_stats].each do |optimizer, stats|
+  puts "#{optimizer}: #{stats[:count]} programs, best: #{stats[:best_score]}"
 end
-# Then optimize
-optimized = optimizer.compile(module, trainset: test_cases)
+history[:trends][:improvement_percentage]
 ```
-### 2. Use Meaningful Metrics
-Define metrics that align with business goals:
+### Program comparison
 ```ruby
-def business_aligned_metric(example, prediction)
-  # High-priority errors are more costly
-  if example[:expected_output][:priority] == 'High'
-    return prediction[:priority] == 'High' ? 1.0 : 0.0
-  else
-    return prediction[:category] == example[:expected_output][:category] ? 0.8 : 0.0
-  end
-end
+comparison = manager.compare_programs(id_a, id_b)
+comparison[:comparison][:score_difference]
+comparison[:comparison][:better_program]
+comparison[:comparison][:age_difference_hours]
 ```
-### 3. Monitor in Production
-Always track production performance:
+### Storage configuration
 ```ruby
-class ProductionModule < DSPy::Module
-  def initialize
-    super
-    @predictor = DSPy::ChainOfThought.new(MySignature)
-    @monitor = PerformanceMonitor.new
-    @error_tracker = ErrorRateMonitor.new
-  end
+config = DSPy::Storage::StorageManager::StorageConfig.new
+config.storage_path = Rails.root.join('dspy_storage')
+config.auto_save = true
+config.save_intermediate_results = false
+config.max_stored_programs = 100
-  def forward(input)
-    @monitor.monitor_request do
-      result = @predictor.forward(input)
-      @error_tracker.track_result(success: true)
-      result
-    rescue => e
-      @error_tracker.track_result(success: false)
-      raise
-    end
-  end
-end
+manager = DSPy::Storage::StorageManager.new(config: config)
 ```
-### 4. Version Your Modules
+### Cleanup
-Track which version of your module is deployed:
+Remove old programs. Cleanup retains the best performing and most recent programs using a weighted score (70% performance, 30% recency):
 ```ruby
-class EmailClassifierV2 < DSPy::Module
-  VERSION = '2.1.0'
+deleted_count = manager.cleanup_old_programs
+```
-  def initialize
-    super
-    @predictor = DSPy::ChainOfThought.new(EmailClassificationSignature)
-  end
+### Storage events
-  def forward(input)
-    result = @predictor.forward(input)
-    result.merge(model_version: VERSION)
-  end
-end
+The storage system emits structured log events for monitoring:
+- `dspy.storage.save_start`, `dspy.storage.save_complete`, `dspy.storage.save_error`
+- `dspy.storage.load_start`, `dspy.storage.load_complete`, `dspy.storage.load_error`
+- `dspy.storage.delete`, `dspy.storage.export`, `dspy.storage.import`, `dspy.storage.cleanup`
+### File layout
+```
+dspy_storage/
+  programs/
+    abc123def456.json
+    789xyz012345.json
+  history.json
 ```
+---
+## API rules
+- Call predictors with `.call()`, not `.forward()`.
+- Access prediction fields with dot notation (`result.answer`), not hash notation (`result[:answer]`).
+- GEPA metrics return `DSPy::Prediction.new(score:, feedback:)`, not a boolean.
+- MIPROv2 metrics may return `true`/`false`, a numeric score, or `DSPy::Prediction`.