npm - @synsci/cli-darwin-x64 - Versions diffs - 1.1.73 → 1.1.74 - Mend

@synsci/cli-darwin-x64 1.1.73 → 1.1.74

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/bin/skills/llm-as-judge-evaluation/SKILL.md ADDED Viewed

@@ -0,0 +1,385 @@
+---
+name: llm-as-judge-evaluation
+description: Evaluate LLM outputs using frontier models as judges. Use for pairwise model comparison, quality scoring with custom rubrics, and automated evaluation pipelines. Covers position bias mitigation, statistical significance, and generating preference data for DPO/RLHF.
+version: 1.0.0
+author: Synthetic Sciences
+license: MIT
+tags: [Evaluation, LLM-as-Judge, Pairwise Comparison, Quality Assessment, Rubric Design, Model Comparison, Automated Evaluation]
+dependencies: [openai, anthropic, datasets, numpy]
+---
+# LLM-as-Judge Evaluation
+## When to Use This Skill
+Use LLM-as-Judge evaluation when you need to:
+- **Compare a fine-tuned model vs frontier** — Does the student beat the teacher on your task?
+- **Quality gates before deployment** — Automated go/no-go on model releases
+- **Continuous evaluation** — Monitor production model quality over time
+- **Generate preference data** — Create (chosen, rejected) pairs for DPO/RLHF training
+- **Evaluate without ground truth** — When exact answers don't exist (creative, open-ended tasks)
+### When NOT to Use
+- Tasks with verifiable answers (math, code execution) — use exact match or unit tests
+- Extremely simple classification — use accuracy/F1 directly
+- Safety evaluation — use dedicated safety benchmarks, not general judges
+## Pairwise Comparison
+The most reliable LLM-as-judge method. Show a judge two outputs (A and B) and ask which is better.
+### Basic Implementation
+```python
+import openai
+import json
+import random
+client = openai.OpenAI()
+PAIRWISE_PROMPT = """You are an expert evaluator. Compare two responses to the same prompt.
+## Task Context
+{task_description}
+## User Input
+{user_input}
+## Response A
+{response_a}
+## Response B
+{response_b}
+## Evaluation Criteria
+{criteria}
+Which response is better? Consider all criteria above.
+Return JSON: {{"winner": "A" or "B" or "tie", "reasoning": "brief explanation"}}"""
+def pairwise_compare(user_input, response_a, response_b, task_description, criteria,
+                     model="gpt-4o", swap_positions=True):
+    """Compare two responses with position bias mitigation."""
+    results = []
+    # First comparison: A=position1, B=position2
+    prompt = PAIRWISE_PROMPT.format(
+        task_description=task_description,
+        user_input=user_input,
+        response_a=response_a,
+        response_b=response_b,
+        criteria=criteria,
+    )
+    resp = client.chat.completions.create(
+        model=model,
+        messages=[{"role": "user", "content": prompt}],
+        response_format={"type": "json_object"},
+        temperature=0,
+    )
+    result1 = json.loads(resp.choices[0].message.content)
+    results.append(result1["winner"])
+    if swap_positions:
+        # Second comparison: swap positions to detect position bias
+        prompt_swapped = PAIRWISE_PROMPT.format(
+            task_description=task_description,
+            user_input=user_input,
+            response_a=response_b,  # Swapped
+            response_b=response_a,  # Swapped
+            criteria=criteria,
+        )
+        resp2 = client.chat.completions.create(
+            model=model,
+            messages=[{"role": "user", "content": prompt_swapped}],
+            response_format={"type": "json_object"},
+            temperature=0,
+        )
+        result2 = json.loads(resp2.choices[0].message.content)
+        # Reverse the swapped result
+        swapped_winner = {"A": "B", "B": "A", "tie": "tie"}[result2["winner"]]
+        results.append(swapped_winner)
+    # Aggregate: both must agree, otherwise tie
+    if len(set(results)) == 1:
+        return results[0]
+    return "tie"
+```
+### Running a Full Evaluation
+```python
+def evaluate_model_pair(eval_set, model_a_fn, model_b_fn, task_description, criteria,
+                        judge_model="gpt-4o"):
+    """Run pairwise evaluation across an entire eval set.
+    Args:
+        eval_set: List of {"input": str, "reference": str (optional)}
+        model_a_fn: Function(input) -> str (e.g., frontier model)
+        model_b_fn: Function(input) -> str (e.g., fine-tuned model)
+        task_description: What the models are supposed to do
+        criteria: Evaluation criteria string
+        judge_model: Which model to use as judge
+    """
+    results = {"A": 0, "B": 0, "tie": 0}
+    details = []
+    for i, example in enumerate(eval_set):
+        # Generate responses
+        response_a = model_a_fn(example["input"])
+        response_b = model_b_fn(example["input"])
+        # Random assignment to positions (reduces systematic bias)
+        if random.random() < 0.5:
+            winner = pairwise_compare(
+                example["input"], response_a, response_b,
+                task_description, criteria, judge_model
+            )
+        else:
+            raw = pairwise_compare(
+                example["input"], response_b, response_a,
+                task_description, criteria, judge_model
+            )
+            winner = {"A": "B", "B": "A", "tie": "tie"}[raw]
+        results[winner] += 1
+        details.append({
+            "input": example["input"],
+            "response_a": response_a,
+            "response_b": response_b,
+            "winner": winner,
+        })
+        if (i + 1) % 20 == 0:
+            print(f"Progress: {i+1}/{len(eval_set)} — A:{results['A']} B:{results['B']} Tie:{results['tie']}")
+    total = sum(results.values())
+    report = {
+        "total_comparisons": total,
+        "model_a_wins": results["A"],
+        "model_b_wins": results["B"],
+        "ties": results["tie"],
+        "model_a_win_rate": results["A"] / total,
+        "model_b_win_rate": results["B"] / total,
+        "tie_rate": results["tie"] / total,
+    }
+    return report, details
+```
+## Likert Scoring (1-5 Scale)
+For absolute quality assessment rather than comparison:
+```python
+LIKERT_PROMPT = """You are an expert evaluator. Rate this response on a 1-5 scale.
+## Task Context
+{task_description}
+## User Input
+{user_input}
+## Response
+{response}
+## Scoring Rubric
+{rubric}
+Rate the response on each dimension. Then provide an overall score.
+Return JSON: {{"scores": {{"dimension_name": score, ...}}, "overall": score, "reasoning": "..."}}"""
+def likert_score(user_input, response, task_description, rubric, model="gpt-4o"):
+    """Score a single response on a 1-5 Likert scale."""
+    prompt = LIKERT_PROMPT.format(
+        task_description=task_description,
+        user_input=user_input,
+        response=response,
+        rubric=rubric,
+    )
+    resp = client.chat.completions.create(
+        model=model,
+        messages=[{"role": "user", "content": prompt}],
+        response_format={"type": "json_object"},
+        temperature=0,
+    )
+    return json.loads(resp.choices[0].message.content)
+```
+## Custom Rubric Design
+### Template
+```python
+RUBRIC_TEMPLATE = """
+Score 1 (Poor): {poor_description}
+Score 2 (Below Average): {below_avg_description}
+Score 3 (Average): {avg_description}
+Score 4 (Good): {good_description}
+Score 5 (Excellent): {excellent_description}
+"""
+# Example: Code generation rubric
+CODE_RUBRIC = """
+Dimensions:
+1. Correctness (weight: 0.4)
+   1: Code has critical bugs, won't run
+   2: Runs but produces wrong output in common cases
+   3: Correct for common cases, fails on edge cases
+   4: Correct for all cases, minor style issues
+   5: Correct, clean, handles all edge cases
+2. Efficiency (weight: 0.2)
+   1: Exponential or worse complexity
+   2: Unnecessarily slow, obvious optimization missed
+   3: Acceptable performance for typical inputs
+   4: Well-optimized, good algorithmic choices
+   5: Optimal or near-optimal solution
+3. Readability (weight: 0.2)
+   1: Incomprehensible, no structure
+   2: Hard to follow, poor naming
+   3: Readable with effort, some unclear parts
+   4: Clean code, good naming and structure
+   5: Exemplary clarity, well-documented
+4. Completeness (weight: 0.2)
+   1: Missing major requirements
+   2: Partial implementation
+   3: Implements core requirements
+   4: Complete with good error handling
+   5: Complete with tests, docs, error handling
+"""
+```
+## Position Bias Mitigation
+LLM judges tend to prefer whichever response appears first. Always mitigate this:
+```python
+def mitigated_pairwise(user_input, response_a, response_b, **kwargs):
+    """Run comparison twice with swapped positions."""
+    # Round 1: A first, B second
+    r1 = pairwise_compare(user_input, response_a, response_b, swap_positions=False, **kwargs)
+    # Round 2: B first, A second
+    r2_raw = pairwise_compare(user_input, response_b, response_a, swap_positions=False, **kwargs)
+    r2 = {"A": "B", "B": "A", "tie": "tie"}[r2_raw]
+    # Agreement check
+    if r1 == r2:
+        return r1  # Both rounds agree
+    return "tie"  # Disagreement = inconclusive
+```
+## Statistical Significance
+### Bootstrap Confidence Intervals
+```python
+import numpy as np
+def bootstrap_win_rate(wins, total, n_bootstrap=10000, ci=0.95):
+    """Calculate bootstrap confidence interval for win rate."""
+    win_rate = wins / total
+    samples = np.random.binomial(total, win_rate, n_bootstrap) / total
+    alpha = (1 - ci) / 2
+    lower = np.percentile(samples, alpha * 100)
+    upper = np.percentile(samples, (1 - alpha) * 100)
+    return {
+        "win_rate": win_rate,
+        "ci_lower": lower,
+        "ci_upper": upper,
+        "significant": lower > 0.5 or upper < 0.5,  # Significantly different from 50%
+    }
+```
+### Minimum Sample Size
+| Desired precision | Minimum samples | Notes |
+|-------------------|-----------------|-------|
+| Directional (which is better) | 50-100 | Rough signal |
+| Reliable estimate (+-5%) | 200-400 | Standard evaluation |
+| High confidence (+-2%) | 500-1000 | Production decisions |
+| Publication quality | 1000+ | Statistical rigor |
+**Rule of thumb**: Use at least 100 examples for deployment decisions, 200+ for reliable win rates.
+## Generating Preference Data for DPO
+Convert judge outputs to (chosen, rejected) pairs:
+```python
+def generate_dpo_pairs(eval_set, model_a_fn, model_b_fn, task_description, criteria,
+                       judge_model="gpt-4o"):
+    """Generate DPO training pairs from pairwise evaluation."""
+    pairs = []
+    for example in eval_set:
+        response_a = model_a_fn(example["input"])
+        response_b = model_b_fn(example["input"])
+        winner = pairwise_compare(
+            example["input"], response_a, response_b,
+            task_description, criteria, judge_model
+        )
+        if winner == "tie":
+            continue  # Skip ties for DPO
+        chosen = response_a if winner == "A" else response_b
+        rejected = response_b if winner == "A" else response_a
+        pairs.append({
+            "prompt": example["input"],
+            "chosen": chosen,
+            "rejected": rejected,
+        })
+    print(f"Generated {len(pairs)} DPO pairs from {len(eval_set)} examples "
+          f"({len(eval_set) - len(pairs)} ties skipped)")
+    return pairs
+```
+## Multi-Judge Ensemble
+Use multiple judge models for higher reliability:
+```python
+def multi_judge_compare(user_input, response_a, response_b, task_description, criteria,
+                        judges=None):
+    """Use multiple judge models and take majority vote."""
+    judges = judges or ["gpt-4o", "claude-sonnet-4-5-20250929"]
+    votes = []
+    for judge in judges:
+        winner = pairwise_compare(
+            user_input, response_a, response_b,
+            task_description, criteria, model=judge
+        )
+        votes.append(winner)
+    # Majority vote
+    from collections import Counter
+    vote_counts = Counter(votes)
+    majority = vote_counts.most_common(1)[0]
+    return {
+        "winner": majority[0],
+        "confidence": majority[1] / len(votes),
+        "votes": dict(vote_counts),
+        "judge_details": list(zip(judges, votes)),
+    }
+```
+## Quick Start Checklist
+1. **Define criteria**: Write a rubric specific to your task
+2. **Prepare eval set**: 100+ held-out examples with production inputs
+3. **Generate responses**: Run both models on the eval set
+4. **Run pairwise comparison**: With position bias mitigation
+5. **Check significance**: Bootstrap CI on win rate
+6. **Decision gate**: Student wins > 50% -> proceed to deploy
+7. **Save preference data**: Use ties and wins for DPO training

package/bin/skills/llm-as-judge-evaluation/references/pairwise-comparison.md ADDED Viewed

@@ -0,0 +1,95 @@
+# Pairwise Comparison Reference
+## Overview
+Pairwise comparison is the most reliable LLM-as-judge method. Instead of asking
+"how good is this response?", you ask "which of these two responses is better?" —
+a much easier judgment task that produces more consistent results.
+## Why Pairwise > Likert
+| Aspect | Pairwise | Likert (1-5) |
+|--------|----------|--------------|
+| Inter-annotator agreement | High | Low-moderate |
+| Calibration needed | No | Yes (what does "4" mean?) |
+| Position bias | Mitigatable (swap) | N/A (single response) |
+| Sensitivity | High (detects small differences) | Low (coarse scale) |
+| Cost per comparison | 2x (need swap) | 1x |
+| Best for | A/B testing, model selection | Monitoring, thresholds |
+## Advanced: Chain-of-Thought Judging
+Better results when the judge explains its reasoning before deciding:
+```python
+COT_PAIRWISE_PROMPT = """You are an expert evaluator comparing two responses.
+## Task: {task_description}
+## Input: {user_input}
+## Response A
+{response_a}
+## Response B
+{response_b}
+## Evaluation Criteria
+{criteria}
+Think step by step:
+1. Analyze Response A's strengths and weaknesses
+2. Analyze Response B's strengths and weaknesses
+3. Compare on each criterion
+4. Make your final judgment
+Return JSON:
+{{
+    "analysis_a": "strengths and weaknesses of A",
+    "analysis_b": "strengths and weaknesses of B",
+    "comparison": "criterion-by-criterion comparison",
+    "winner": "A" or "B" or "tie",
+    "confidence": "high" or "medium" or "low"
+}}"""
+```
+## Handling Ties
+Tie rates inform evaluation quality:
+| Tie Rate | Interpretation | Action |
+|----------|---------------|--------|
+| < 10% | Clear quality difference | Good signal |
+| 10-30% | Models are close | Normal, increase sample size |
+| 30-50% | Very similar quality | May need finer-grained criteria |
+| > 50% | Criteria too vague | Rewrite rubric with specific anchors |
+## Reference-Based Comparison
+When you have a ground-truth reference, include it for more accurate judging:
+```python
+REFERENCE_PAIRWISE_PROMPT = """Compare two responses against a known correct reference.
+## Input: {user_input}
+## Reference (ground truth)
+{reference}
+## Response A
+{response_a}
+## Response B
+{response_b}
+Which response is more faithful to the reference while remaining helpful?
+Return JSON: {{"winner": "A" or "B" or "tie", "reasoning": "..."}}"""
+```
+## Common Pitfalls
+1. **Length bias**: Judges prefer longer responses. Add "conciseness" to criteria.
+2. **Format bias**: Judges prefer markdown/structured responses. Normalize formatting.
+3. **Sycophancy**: Judges prefer responses that agree with the user. Use neutral criteria.
+4. **Self-preference**: GPT-4 may prefer GPT-4 style. Use Claude as judge for GPT outputs and vice versa.
+5. **Instruction following vs quality**: Separate these in your rubric.

package/bin/skills/llm-as-judge-evaluation/references/scoring-rubrics.md ADDED Viewed

@@ -0,0 +1,169 @@
+# Scoring Rubrics Reference
+## Overview
+A good rubric is the difference between noisy and reliable LLM-as-judge evaluation.
+This reference provides rubric templates for common evaluation scenarios.
+## Rubric Design Principles
+1. **Specific anchors**: Each score level must describe observable behavior, not vague quality
+2. **Independent dimensions**: Criteria should not overlap (avoid "quality" and "helpfulness")
+3. **Weighted dimensions**: Not all criteria matter equally — assign weights
+4. **Calibration examples**: Include 2-3 example responses with their expected scores
+5. **Task-aligned**: The rubric should match what your users actually care about
+## Template: General Quality
+```
+Dimensions (all weighted equally unless specified):
+1. Accuracy
+   1: Contains factual errors or hallucinations
+   2: Mostly correct but with notable inaccuracies
+   3: Factually correct on main points, minor issues
+   4: Accurate and well-supported claims
+   5: Perfectly accurate with appropriate caveats
+2. Relevance
+   1: Does not address the user's question
+   2: Partially relevant, misses key aspects
+   3: Addresses the main question adequately
+   4: Comprehensive coverage of the topic
+   5: Precisely addresses every aspect of the question
+3. Clarity
+   1: Confusing, poorly organized
+   2: Understandable but hard to follow
+   3: Clear and logically organized
+   4: Well-structured with good flow
+   5: Exceptionally clear, easy to scan and understand
+4. Conciseness
+   1: Extremely verbose, buries the answer
+   2: Contains significant unnecessary content
+   3: Appropriate length for the question
+   4: Efficiently communicated
+   5: Optimal length — every word serves a purpose
+```
+## Template: Code Generation
+```
+Dimensions:
+1. Correctness (weight: 0.40)
+   1: Won't compile/run, fundamental logic errors
+   2: Runs but fails on basic test cases
+   3: Handles common cases correctly
+   4: Handles edge cases, good error handling
+   5: Correct, robust, handles all specified requirements
+2. Code Quality (weight: 0.25)
+   1: Unreadable, no structure
+   2: Poor naming, minimal structure
+   3: Acceptable style, reasonable naming
+   4: Clean, well-organized, follows conventions
+   5: Exemplary code that teaches best practices
+3. Efficiency (weight: 0.15)
+   1: Exponential complexity or worse
+   2: Unnecessarily slow (wrong algorithm choice)
+   3: Acceptable for typical input sizes
+   4: Well-optimized, appropriate algorithms
+   5: Optimal or near-optimal solution
+4. Completeness (weight: 0.20)
+   1: Missing major requirements
+   2: Partial implementation, key gaps
+   3: Core requirements met
+   4: Complete with error handling
+   5: Complete with tests, docs, and error handling
+```
+## Template: Customer Support
+```
+Dimensions:
+1. Problem Resolution (weight: 0.40)
+   1: Does not address the customer's issue
+   2: Acknowledges issue but provides wrong solution
+   3: Provides a valid solution that may not be optimal
+   4: Provides the best available solution
+   5: Resolves issue and proactively prevents related problems
+2. Tone & Empathy (weight: 0.25)
+   1: Rude, dismissive, or robotic
+   2: Professional but cold
+   3: Friendly and professional
+   4: Warm, empathetic, personalized
+   5: Exceptional rapport while maintaining professionalism
+3. Accuracy (weight: 0.20)
+   1: Contains incorrect information about products/policies
+   2: Mostly correct with some errors
+   3: Factually accurate
+   4: Accurate with helpful additional context
+   5: Perfectly accurate with relevant links/resources
+4. Efficiency (weight: 0.15)
+   1: Requires multiple follow-ups for basic resolution
+   2: Could be more direct
+   3: Reasonable number of steps to resolution
+   4: Efficient resolution path
+   5: Resolves in minimum possible interactions
+```
+## Template: Summarization
+```
+Dimensions:
+1. Faithfulness (weight: 0.35)
+   1: Contains hallucinated information not in source
+   2: Mostly faithful but adds unsupported claims
+   3: Faithful to source material
+   4: Accurately represents source with proper nuance
+   5: Perfectly faithful, captures nuance and caveats
+2. Coverage (weight: 0.30)
+   1: Misses most key points
+   2: Captures some key points, misses important ones
+   3: Covers main points adequately
+   4: Comprehensive coverage of key information
+   5: Captures all important points and relationships
+3. Coherence (weight: 0.20)
+   1: Disjointed, hard to follow
+   2: Some logical flow issues
+   3: Reads smoothly
+   4: Well-organized with clear structure
+   5: Exemplary narrative flow
+4. Conciseness (weight: 0.15)
+   1: As long as original (no compression)
+   2: Minimal compression, includes unnecessary details
+   3: Reasonable length reduction
+   4: Well-compressed, only essential information
+   5: Maximum information density, every word counts
+```
+## Composite Scoring
+```python
+def weighted_score(scores, weights):
+    """Calculate weighted composite score from dimension scores.
+    Args:
+        scores: dict of {"dimension": score} (1-5)
+        weights: dict of {"dimension": weight} (sums to 1.0)
+    """
+    total = sum(scores[dim] * weights[dim] for dim in scores)
+    return round(total, 2)
+# Example
+scores = {"correctness": 4, "quality": 3, "efficiency": 5, "completeness": 4}
+weights = {"correctness": 0.4, "quality": 0.25, "efficiency": 0.15, "completeness": 0.2}
+composite = weighted_score(scores, weights)  # 3.85
+```