npm - compound-engineering-pi - Versions diffs - 0.2.3 - Mend

compound-engineering-pi 0.2.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (332) hide show

package/plugins/compound-engineering/skills/dspy-ruby/references/optimization.md ADDED Viewed

@@ -0,0 +1,603 @@
+# DSPy.rb Optimization
+## MIPROv2
+MIPROv2 (Multi-prompt Instruction Proposal with Retrieval Optimization) is the primary instruction tuner in DSPy.rb. It proposes new instructions and few-shot demonstrations per predictor, evaluates them on mini-batches, and retains candidates that improve the metric. It ships as a separate gem to keep the Gaussian Process dependency tree out of apps that do not need it.
+### Installation
+```ruby
+# Gemfile
+gem "dspy"
+gem "dspy-miprov2"
+```
+Bundler auto-requires `dspy/miprov2`. No additional `require` statement is needed.
+### AutoMode presets
+Use `DSPy::Teleprompt::MIPROv2::AutoMode` for preconfigured optimizers:
+```ruby
+light  = DSPy::Teleprompt::MIPROv2::AutoMode.light(metric: metric)   # 6 trials, greedy
+medium = DSPy::Teleprompt::MIPROv2::AutoMode.medium(metric: metric)  # 12 trials, adaptive
+heavy  = DSPy::Teleprompt::MIPROv2::AutoMode.heavy(metric: metric)   # 18 trials, Bayesian
+```
+| Preset   | Trials | Strategy   | Use case                                            |
+|----------|--------|------------|-----------------------------------------------------|
+| `light`  | 6      | `:greedy`  | Quick wins on small datasets or during prototyping. |
+| `medium` | 12     | `:adaptive`| Balanced exploration vs. runtime for most pilots.   |
+| `heavy`  | 18     | `:bayesian`| Highest accuracy targets or multi-stage programs.   |
+### Manual configuration with dry-configurable
+`DSPy::Teleprompt::MIPROv2` includes `Dry::Configurable`. Configure at the class level (defaults for all instances) or instance level (overrides class defaults).
+**Class-level defaults:**
+```ruby
+DSPy::Teleprompt::MIPROv2.configure do |config|
+  config.optimization_strategy = :bayesian
+  config.num_trials = 30
+  config.bootstrap_sets = 10
+end
+```
+**Instance-level overrides:**
+```ruby
+optimizer = DSPy::Teleprompt::MIPROv2.new(metric: metric)
+optimizer.configure do |config|
+  config.num_trials = 15
+  config.num_instruction_candidates = 6
+  config.bootstrap_sets = 5
+  config.max_bootstrapped_examples = 4
+  config.max_labeled_examples = 16
+  config.optimization_strategy = :adaptive       # :greedy, :adaptive, :bayesian
+  config.early_stopping_patience = 3
+  config.init_temperature = 1.0
+  config.final_temperature = 0.1
+  config.minibatch_size = nil                     # nil = auto
+  config.auto_seed = 42
+end
+```
+The `optimization_strategy` setting accepts symbols (`:greedy`, `:adaptive`, `:bayesian`) and coerces them internally to `DSPy::Teleprompt::OptimizationStrategy` T::Enum values.
+The old `config:` constructor parameter is removed. Passing `config:` raises `ArgumentError`.
+### Auto presets via configure
+Instead of `AutoMode`, set the preset through the configure block:
+```ruby
+optimizer = DSPy::Teleprompt::MIPROv2.new(metric: metric)
+optimizer.configure do |config|
+  config.auto_preset = DSPy::Teleprompt::AutoPreset.deserialize("medium")
+end
+```
+### Compile and inspect
+```ruby
+program = DSPy::Predict.new(MySignature)
+result = optimizer.compile(
+  program,
+  trainset: train_examples,
+  valset: val_examples
+)
+optimized_program = result.optimized_program
+puts "Best score: #{result.best_score_value}"
+```
+The `result` object exposes:
+- `optimized_program` -- ready-to-use predictor with updated instruction and demos.
+- `optimization_trace[:trial_logs]` -- per-trial record of instructions, demos, and scores.
+- `metadata[:optimizer]` -- `"MIPROv2"`, useful when persisting experiments from multiple optimizers.
+### Multi-stage programs
+MIPROv2 generates dataset summaries for each predictor and proposes per-stage instructions. For a ReAct agent with `thought_generator` and `observation_processor` predictors, the optimizer handles credit assignment internally. The metric only needs to evaluate the final output.
+### Bootstrap sampling
+During the bootstrap phase MIPROv2:
+1. Generates dataset summaries from the training set.
+2. Bootstraps few-shot demonstrations by running the baseline program.
+3. Proposes candidate instructions grounded in the summaries and bootstrapped examples.
+4. Evaluates each candidate on mini-batches drawn from the validation set.
+Control the bootstrap phase with `bootstrap_sets`, `max_bootstrapped_examples`, and `max_labeled_examples`.
+### Bayesian optimization
+When `optimization_strategy` is `:bayesian` (or when using the `heavy` preset), MIPROv2 fits a Gaussian Process surrogate over past trial scores to select the next candidate. This replaces random search with informed exploration, reducing the number of trials needed to find high-scoring instructions.
+---
+## GEPA
+GEPA (Genetic-Pareto Reflective Prompt Evolution) is a feedback-driven optimizer. It runs the program on a small batch, collects scores and textual feedback, and asks a reflection LM to rewrite the instruction. Improved candidates are retained on a Pareto frontier.
+### Installation
+```ruby
+# Gemfile
+gem "dspy"
+gem "dspy-gepa"
+```
+The `dspy-gepa` gem depends on the `gepa` core optimizer gem automatically.
+### Metric contract
+GEPA metrics return `DSPy::Prediction` with both a numeric score and a feedback string. Do not return a plain boolean.
+```ruby
+metric = lambda do |example, prediction|
+  expected  = example.expected_values[:label]
+  predicted = prediction.label
+  score = predicted == expected ? 1.0 : 0.0
+  feedback = if score == 1.0
+    "Correct (#{expected}) for: \"#{example.input_values[:text][0..60]}\""
+  else
+    "Misclassified (expected #{expected}, got #{predicted}) for: \"#{example.input_values[:text][0..60]}\""
+  end
+  DSPy::Prediction.new(score: score, feedback: feedback)
+end
+```
+Keep the score in `[0, 1]`. Always include a short feedback message explaining what happened -- GEPA hands this text to the reflection model so it can reason about failures.
+### Feedback maps
+`feedback_map` targets individual predictors inside a composite module. Each entry receives keyword arguments and returns a `DSPy::Prediction`:
+```ruby
+feedback_map = {
+  'self' => lambda do |predictor_output:, predictor_inputs:, module_inputs:, module_outputs:, captured_trace:|
+    expected  = module_inputs.expected_values[:label]
+    predicted = predictor_output.label
+    DSPy::Prediction.new(
+      score: predicted == expected ? 1.0 : 0.0,
+      feedback: "Classifier saw \"#{predictor_inputs[:text][0..80]}\" -> #{predicted} (expected #{expected})"
+    )
+  end
+}
+```
+For single-predictor programs, key the map with `'self'`. For multi-predictor chains, add entries per component so the reflection LM sees localized context at each step. Omit `feedback_map` entirely if the top-level metric already covers the basics.
+### Configuring the teleprompter
+```ruby
+teleprompter = DSPy::Teleprompt::GEPA.new(
+  metric: metric,
+  reflection_lm: DSPy::ReflectionLM.new('openai/gpt-4o-mini', api_key: ENV['OPENAI_API_KEY']),
+  feedback_map: feedback_map,
+  config: {
+    max_metric_calls: 600,
+    minibatch_size: 6,
+    skip_perfect_score: false
+  }
+)
+```
+Key configuration knobs:
+| Knob                 | Purpose                                                                                   |
+|----------------------|-------------------------------------------------------------------------------------------|
+| `max_metric_calls`   | Hard budget on evaluation calls. Set to at least the validation set size plus a few minibatches. |
+| `minibatch_size`     | Examples per reflective replay batch. Smaller = cheaper iterations, noisier scores.       |
+| `skip_perfect_score` | Set `true` to stop early when a candidate reaches score `1.0`.                            |
+### Minibatch sizing
+| Goal                                            | Suggested size | Rationale                                                  |
+|-------------------------------------------------|----------------|------------------------------------------------------------|
+| Explore many candidates within a tight budget   | 3--6           | Cheap iterations, more prompt variants, noisier metrics.   |
+| Stable metrics when each rollout is costly      | 8--12          | Smoother scores, fewer candidates unless budget is raised. |
+| Investigate specific failure modes              | 3--4 then 8+   | Start with breadth, increase once patterns emerge.         |
+### Compile and evaluate
+```ruby
+program = DSPy::Predict.new(MySignature)
+result = teleprompter.compile(program, trainset: train, valset: val)
+optimized_program = result.optimized_program
+test_metrics = evaluate(optimized_program, test)
+```
+The `result` object exposes:
+- `optimized_program` -- predictor with updated instruction and few-shot examples.
+- `best_score_value` -- validation score for the best candidate.
+- `metadata` -- candidate counts, trace hashes, and telemetry IDs.
+### Reflection LM
+Swap `DSPy::ReflectionLM` for any callable object that accepts the reflection prompt hash and returns a string. The default reflection signature extracts the new instruction from triple backticks in the response.
+### Experiment tracking
+Plug `GEPA::Logging::ExperimentTracker` into a persistence layer:
+```ruby
+tracker = GEPA::Logging::ExperimentTracker.new
+tracker.with_subscriber { |event| MyModel.create!(payload: event) }
+teleprompter = DSPy::Teleprompt::GEPA.new(
+  metric: metric,
+  reflection_lm: reflection_lm,
+  experiment_tracker: tracker,
+  config: { max_metric_calls: 900 }
+)
+```
+The tracker emits Pareto update events, merge decisions, and candidate evolution records as JSONL.
+### Pareto frontier
+GEPA maintains a diverse candidate pool and samples from the Pareto frontier instead of mutating only the top-scoring program. This balances exploration and prevents the search from collapsing onto a single lineage.
+Enable the merge proposer after multiple strong lineages emerge:
+```ruby
+config: {
+  max_metric_calls: 900,
+  enable_merge_proposer: true
+}
+```
+Premature merges eat budget without meaningful gains. Gate merge on having several validated candidates first.
+### Advanced options
+- `acceptance_strategy:` -- plug in bespoke Pareto filters or early-stop heuristics.
+- Telemetry spans emit via `GEPA::Telemetry`. Enable global observability with `DSPy.configure { |c| c.observability = true }` to stream spans to an OpenTelemetry exporter.
+---
+## Evaluation Framework
+`DSPy::Evals` provides batch evaluation of predictors against test datasets with built-in and custom metrics.
+### Basic usage
+```ruby
+metric = proc do |example, prediction|
+  prediction.answer == example.expected_values[:answer]
+end
+evaluator = DSPy::Evals.new(predictor, metric: metric)
+result = evaluator.evaluate(
+  test_examples,
+  display_table: true,
+  display_progress: true
+)
+puts "Pass rate: #{(result.pass_rate * 100).round(1)}%"
+puts "Passed: #{result.passed_examples}/#{result.total_examples}"
+```
+### DSPy::Example
+Convert raw data into `DSPy::Example` instances before passing to optimizers or evaluators. Each example carries `input_values` and `expected_values`:
+```ruby
+examples = rows.map do |row|
+  DSPy::Example.new(
+    input_values: { text: row[:text] },
+    expected_values: { label: row[:label] }
+  )
+end
+train, val, test = split_examples(examples, train_ratio: 0.6, val_ratio: 0.2, seed: 42)
+```
+Hold back a test set from the optimization loop. Optimizers work on train/val; only the test set proves generalization.
+### Built-in metrics
+```ruby
+# Exact match -- prediction must exactly equal expected value
+metric = DSPy::Metrics.exact_match(field: :answer, case_sensitive: true)
+# Contains -- prediction must contain expected substring
+metric = DSPy::Metrics.contains(field: :answer, case_sensitive: false)
+# Numeric difference -- numeric output within tolerance
+metric = DSPy::Metrics.numeric_difference(field: :answer, tolerance: 0.01)
+# Composite AND -- all sub-metrics must pass
+metric = DSPy::Metrics.composite_and(
+  DSPy::Metrics.exact_match(field: :answer),
+  DSPy::Metrics.contains(field: :reasoning)
+)
+```
+### Custom metrics
+```ruby
+quality_metric = lambda do |example, prediction|
+  return false unless prediction
+  score = 0.0
+  score += 0.5 if prediction.answer == example.expected_values[:answer]
+  score += 0.3 if prediction.explanation && prediction.explanation.length > 50
+  score += 0.2 if prediction.confidence && prediction.confidence > 0.8
+  score >= 0.7
+end
+evaluator = DSPy::Evals.new(predictor, metric: quality_metric)
+```
+Access prediction fields with dot notation (`prediction.answer`), not hash notation.
+### Observability hooks
+Register callbacks without editing the evaluator:
+```ruby
+DSPy::Evals.before_example do |payload|
+  example = payload[:example]
+  DSPy.logger.info("Evaluating example #{example.id}") if example.respond_to?(:id)
+end
+DSPy::Evals.after_batch do |payload|
+  result = payload[:result]
+  Langfuse.event(
+    name: 'eval.batch',
+    metadata: {
+      total: result.total_examples,
+      passed: result.passed_examples,
+      score: result.score
+    }
+  )
+end
+```
+Available hooks: `before_example`, `after_example`, `before_batch`, `after_batch`.
+### Langfuse score export
+Enable `export_scores: true` to emit `score.create` events for each evaluated example and a batch score at the end:
+```ruby
+evaluator = DSPy::Evals.new(
+  predictor,
+  metric: metric,
+  export_scores: true,
+  score_name: 'qa_accuracy'   # default: 'evaluation'
+)
+result = evaluator.evaluate(test_examples)
+# Emits per-example scores + overall batch score via DSPy::Scores::Exporter
+```
+Scores attach to the current trace context automatically and flow to Langfuse asynchronously.
+### Evaluation results
+```ruby
+result = evaluator.evaluate(test_examples)
+result.score            # Overall score (0.0 to 1.0)
+result.passed_count     # Examples that passed
+result.failed_count     # Examples that failed
+result.error_count      # Examples that errored
+result.results.each do |r|
+  r.passed              # Boolean
+  r.score               # Numeric score
+  r.error               # Error message if the example errored
+end
+```
+### Integration with optimizers
+```ruby
+metric = proc do |example, prediction|
+  expected  = example.expected_values[:answer].to_s.strip.downcase
+  predicted = prediction.answer.to_s.strip.downcase
+  !expected.empty? && predicted.include?(expected)
+end
+optimizer = DSPy::Teleprompt::MIPROv2::AutoMode.medium(metric: metric)
+result = optimizer.compile(
+  DSPy::Predict.new(QASignature),
+  trainset: train_examples,
+  valset: val_examples
+)
+evaluator = DSPy::Evals.new(result.optimized_program, metric: metric)
+test_result = evaluator.evaluate(test_examples, display_table: true)
+puts "Test accuracy: #{(test_result.pass_rate * 100).round(2)}%"
+```
+---
+## Storage System
+`DSPy::Storage` persists optimization results, tracks history, and manages multiple versions of optimized programs.
+### ProgramStorage (low-level)
+```ruby
+storage = DSPy::Storage::ProgramStorage.new(storage_path: "./dspy_storage")
+# Save
+saved = storage.save_program(
+  result.optimized_program,
+  result,
+  metadata: {
+    signature_class: 'ClassifyText',
+    optimizer: 'MIPROv2',
+    examples_count: examples.size
+  }
+)
+puts "Stored with ID: #{saved.program_id}"
+# Load
+saved = storage.load_program(program_id)
+predictor = saved.program
+score = saved.optimization_result[:best_score_value]
+# List
+storage.list_programs.each do |p|
+  puts "#{p[:program_id]} -- score: #{p[:best_score]} -- saved: #{p[:saved_at]}"
+end
+```
+### StorageManager (recommended)
+```ruby
+manager = DSPy::Storage::StorageManager.new
+# Save with tags
+saved = manager.save_optimization_result(
+  result,
+  tags: ['production', 'sentiment-analysis'],
+  description: 'Optimized sentiment classifier v2'
+)
+# Find programs
+programs = manager.find_programs(
+  optimizer: 'MIPROv2',
+  min_score: 0.85,
+  tags: ['production']
+)
+recent = manager.find_programs(
+  max_age_days: 7,
+  signature_class: 'ClassifyText'
+)
+# Get best program for a signature
+best = manager.get_best_program('ClassifyText')
+predictor = best.program
+```
+Global shorthand:
+```ruby
+DSPy::Storage::StorageManager.save(result, metadata: { version: '2.0' })
+DSPy::Storage::StorageManager.load(program_id)
+DSPy::Storage::StorageManager.best('ClassifyText')
+```
+### Checkpoints
+Create and restore checkpoints during long-running optimizations:
+```ruby
+# Save a checkpoint
+manager.create_checkpoint(
+  current_result,
+  'iteration_50',
+  metadata: { iteration: 50, current_score: 0.87 }
+)
+# Restore
+restored = manager.restore_checkpoint('iteration_50')
+program = restored.program
+# Auto-checkpoint every N iterations
+if iteration % 10 == 0
+  manager.create_checkpoint(current_result, "auto_checkpoint_#{iteration}")
+end
+```
+### Import and export
+Share programs between environments:
+```ruby
+storage = DSPy::Storage::ProgramStorage.new
+# Export
+storage.export_programs(['abc123', 'def456'], './export_backup.json')
+# Import
+imported = storage.import_programs('./export_backup.json')
+puts "Imported #{imported.size} programs"
+```
+### Optimization history
+```ruby
+history = manager.get_optimization_history
+history[:summary][:total_programs]
+history[:summary][:avg_score]
+history[:optimizer_stats].each do |optimizer, stats|
+  puts "#{optimizer}: #{stats[:count]} programs, best: #{stats[:best_score]}"
+end
+history[:trends][:improvement_percentage]
+```
+### Program comparison
+```ruby
+comparison = manager.compare_programs(id_a, id_b)
+comparison[:comparison][:score_difference]
+comparison[:comparison][:better_program]
+comparison[:comparison][:age_difference_hours]
+```
+### Storage configuration
+```ruby
+config = DSPy::Storage::StorageManager::StorageConfig.new
+config.storage_path = Rails.root.join('dspy_storage')
+config.auto_save = true
+config.save_intermediate_results = false
+config.max_stored_programs = 100
+manager = DSPy::Storage::StorageManager.new(config: config)
+```
+### Cleanup
+Remove old programs. Cleanup retains the best performing and most recent programs using a weighted score (70% performance, 30% recency):
+```ruby
+deleted_count = manager.cleanup_old_programs
+```
+### Storage events
+The storage system emits structured log events for monitoring:
+- `dspy.storage.save_start`, `dspy.storage.save_complete`, `dspy.storage.save_error`
+- `dspy.storage.load_start`, `dspy.storage.load_complete`, `dspy.storage.load_error`
+- `dspy.storage.delete`, `dspy.storage.export`, `dspy.storage.import`, `dspy.storage.cleanup`
+### File layout
+```
+dspy_storage/
+  programs/
+    abc123def456.json
+    789xyz012345.json
+  history.json
+```
+---
+## API rules
+- Call predictors with `.call()`, not `.forward()`.
+- Access prediction fields with dot notation (`result.answer`), not hash notation (`result[:answer]`).
+- GEPA metrics return `DSPy::Prediction.new(score:, feedback:)`, not a boolean.
+- MIPROv2 metrics may return `true`/`false`, a numeric score, or `DSPy::Prediction`.