npm - @shakudo/kaji-setup-external - Versions diffs - 1.0.0 - Mend

@shakudo/kaji-setup-external 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (411) hide show

package/assets/skills/context-optimization/skills/advanced-evaluation/references/metrics-guide.md ADDED Viewed

@@ -0,0 +1,331 @@
+# Metric Selection Guide for LLM Evaluation
+This reference provides guidance on selecting appropriate metrics for different evaluation scenarios.
+## Metric Categories
+### Classification Metrics
+Use for binary or multi-class evaluation tasks (pass/fail, correct/incorrect).
+#### Precision
+```
+Precision = True Positives / (True Positives + False Positives)
+```
+**Interpretation**: Of all responses the judge said were good, what fraction were actually good?
+**Use when**: False positives are costly (e.g., approving unsafe content)
+```python
+def precision(predictions, ground_truth):
+    true_positives = sum(1 for p, g in zip(predictions, ground_truth) if p == 1 and g == 1)
+    predicted_positives = sum(predictions)
+    return true_positives / predicted_positives if predicted_positives > 0 else 0
+```
+#### Recall
+```
+Recall = True Positives / (True Positives + False Negatives)
+```
+**Interpretation**: Of all actually good responses, what fraction did the judge identify?
+**Use when**: False negatives are costly (e.g., missing good content in filtering)
+```python
+def recall(predictions, ground_truth):
+    true_positives = sum(1 for p, g in zip(predictions, ground_truth) if p == 1 and g == 1)
+    actual_positives = sum(ground_truth)
+    return true_positives / actual_positives if actual_positives > 0 else 0
+```
+#### F1 Score
+```
+F1 = 2 * (Precision * Recall) / (Precision + Recall)
+```
+**Interpretation**: Harmonic mean of precision and recall
+**Use when**: You need a single number balancing both concerns
+```python
+def f1_score(predictions, ground_truth):
+    p = precision(predictions, ground_truth)
+    r = recall(predictions, ground_truth)
+    return 2 * p * r / (p + r) if (p + r) > 0 else 0
+```
+### Agreement Metrics
+Use for comparing automated evaluation with human judgment.
+#### Cohen's Kappa (κ)
+```
+κ = (Observed Agreement - Expected Agreement) / (1 - Expected Agreement)
+```
+**Interpretation**: Agreement adjusted for chance
+- κ > 0.8: Almost perfect agreement
+- κ 0.6-0.8: Substantial agreement
+- κ 0.4-0.6: Moderate agreement
+- κ < 0.4: Fair to poor agreement
+**Use for**: Binary or categorical judgments
+```python
+def cohens_kappa(judge1, judge2):
+    from sklearn.metrics import cohen_kappa_score
+    return cohen_kappa_score(judge1, judge2)
+```
+#### Weighted Kappa
+For ordinal scales where disagreement severity matters:
+```python
+def weighted_kappa(judge1, judge2):
+    from sklearn.metrics import cohen_kappa_score
+    return cohen_kappa_score(judge1, judge2, weights='quadratic')
+```
+**Interpretation**: Penalizes large disagreements more than small ones
+### Correlation Metrics
+Use for ordinal/continuous scores.
+#### Spearman's Rank Correlation (ρ)
+**Interpretation**: Correlation between rankings, not absolute values
+- ρ > 0.9: Very strong correlation
+- ρ 0.7-0.9: Strong correlation
+- ρ 0.5-0.7: Moderate correlation
+- ρ < 0.5: Weak correlation
+**Use when**: Order matters more than exact values
+```python
+def spearmans_rho(scores1, scores2):
+    from scipy.stats import spearmanr
+    rho, p_value = spearmanr(scores1, scores2)
+    return {'rho': rho, 'p_value': p_value}
+```
+#### Kendall's Tau (τ)
+**Interpretation**: Similar to Spearman but based on pairwise concordance
+**Use when**: You have many tied values
+```python
+def kendalls_tau(scores1, scores2):
+    from scipy.stats import kendalltau
+    tau, p_value = kendalltau(scores1, scores2)
+    return {'tau': tau, 'p_value': p_value}
+```
+#### Pearson Correlation (r)
+**Interpretation**: Linear correlation between scores
+**Use when**: Exact score values matter, not just order
+```python
+def pearsons_r(scores1, scores2):
+    from scipy.stats import pearsonr
+    r, p_value = pearsonr(scores1, scores2)
+    return {'r': r, 'p_value': p_value}
+```
+### Pairwise Comparison Metrics
+#### Agreement Rate
+```
+Agreement = (Matching Decisions) / (Total Comparisons)
+```
+**Interpretation**: Simple percentage of agreement
+```python
+def pairwise_agreement(decisions1, decisions2):
+    matches = sum(1 for d1, d2 in zip(decisions1, decisions2) if d1 == d2)
+    return matches / len(decisions1)
+```
+#### Position Consistency
+```
+Consistency = (Consistent across position swaps) / (Total comparisons)
+```
+**Interpretation**: How often does swapping position change the decision?
+```python
+def position_consistency(results):
+    consistent = sum(1 for r in results if r['position_consistent'])
+    return consistent / len(results)
+```
+## Selection Decision Tree
+```
+What type of evaluation task?
+│
+├── Binary classification (pass/fail)
+│   └── Use: Precision, Recall, F1, Cohen's κ
+│
+├── Ordinal scale (1-5 rating)
+│   ├── Comparing to human judgments?
+│   │   └── Use: Spearman's ρ, Weighted κ
+│   └── Comparing two automated judges?
+│       └── Use: Kendall's τ, Spearman's ρ
+│
+├── Pairwise preference
+│   └── Use: Agreement rate, Position consistency
+│
+└── Multi-label classification
+    └── Use: Macro-F1, Micro-F1, Per-label metrics
+```
+## Metric Selection by Use Case
+### Use Case 1: Validating Automated Evaluation
+**Goal**: Ensure automated evaluation correlates with human judgment
+**Recommended Metrics**:
+1. Primary: Spearman's ρ (for ordinal scales) or Cohen's κ (for categorical)
+2. Secondary: Per-criterion agreement
+3. Diagnostic: Confusion matrix for systematic errors
+```python
+def validate_automated_eval(automated_scores, human_scores, criteria):
+    results = {}
+    # Overall correlation
+    results['overall_spearman'] = spearmans_rho(automated_scores, human_scores)
+    # Per-criterion agreement
+    for criterion in criteria:
+        auto_crit = [s[criterion] for s in automated_scores]
+        human_crit = [s[criterion] for s in human_scores]
+        results[f'{criterion}_spearman'] = spearmans_rho(auto_crit, human_crit)
+    return results
+```
+### Use Case 2: Comparing Two Models
+**Goal**: Determine which model produces better outputs
+**Recommended Metrics**:
+1. Primary: Win rate (from pairwise comparison)
+2. Secondary: Position consistency (bias check)
+3. Diagnostic: Per-criterion breakdown
+```python
+def compare_models(model_a_outputs, model_b_outputs, prompts):
+    results = []
+    for a, b, p in zip(model_a_outputs, model_b_outputs, prompts):
+        comparison = await compare_with_position_swap(a, b, p)
+        results.append(comparison)
+    return {
+        'a_wins': sum(1 for r in results if r['winner'] == 'A'),
+        'b_wins': sum(1 for r in results if r['winner'] == 'B'),
+        'ties': sum(1 for r in results if r['winner'] == 'TIE'),
+        'position_consistency': position_consistency(results)
+    }
+```
+### Use Case 3: Quality Monitoring
+**Goal**: Track evaluation quality over time
+**Recommended Metrics**:
+1. Primary: Rolling agreement with human spot-checks
+2. Secondary: Score distribution stability
+3. Diagnostic: Bias indicators (position, length)
+```python
+class QualityMonitor:
+    def __init__(self, window_size=100):
+        self.window = deque(maxlen=window_size)
+    def add_evaluation(self, automated, human_spot_check=None):
+        self.window.append({
+            'automated': automated,
+            'human': human_spot_check,
+            'length': len(automated['response'])
+        })
+    def get_metrics(self):
+        # Filter to evaluations with human spot-checks
+        with_human = [e for e in self.window if e['human'] is not None]
+        if len(with_human) < 10:
+            return {'insufficient_data': True}
+        auto_scores = [e['automated']['score'] for e in with_human]
+        human_scores = [e['human']['score'] for e in with_human]
+        return {
+            'correlation': spearmans_rho(auto_scores, human_scores),
+            'mean_difference': np.mean([a - h for a, h in zip(auto_scores, human_scores)]),
+            'length_correlation': spearmans_rho(
+                [e['length'] for e in self.window],
+                [e['automated']['score'] for e in self.window]
+            )
+        }
+```
+## Interpreting Metric Results
+### Good Evaluation System Indicators
+| Metric | Good | Acceptable | Concerning |
+|--------|------|------------|------------|
+| Spearman's ρ | > 0.8 | 0.6-0.8 | < 0.6 |
+| Cohen's κ | > 0.7 | 0.5-0.7 | < 0.5 |
+| Position consistency | > 0.9 | 0.8-0.9 | < 0.8 |
+| Length correlation | < 0.2 | 0.2-0.4 | > 0.4 |
+### Warning Signs
+1. **High agreement but low correlation**: May indicate calibration issues
+2. **Low position consistency**: Position bias affecting results
+3. **High length correlation**: Length bias inflating scores
+4. **Per-criterion variance**: Some criteria may be poorly defined
+## Reporting Template
+```markdown
+## Evaluation System Metrics Report
+### Human Agreement
+- Spearman's ρ: 0.82 (p < 0.001)
+- Cohen's κ: 0.74
+- Sample size: 500 evaluations
+### Bias Indicators
+- Position consistency: 91%
+- Length-score correlation: 0.12
+### Per-Criterion Performance
+| Criterion | Spearman's ρ | κ |
+|-----------|--------------|---|
+| Accuracy | 0.88 | 0.79 |
+| Clarity | 0.76 | 0.68 |
+| Completeness | 0.81 | 0.72 |
+### Recommendations
+- All metrics within acceptable ranges
+- Monitor "Clarity" criterion - lower agreement may indicate need for rubric refinement
+```

package/assets/skills/context-optimization/skills/advanced-evaluation/scripts/evaluation_example.py ADDED Viewed

@@ -0,0 +1,337 @@
+"""
+Advanced Evaluation Example
+This script demonstrates the core evaluation patterns from the advanced-evaluation skill.
+It uses pseudocode that works across Python environments without specific dependencies.
+"""
+# =============================================================================
+# DIRECT SCORING EXAMPLE
+# =============================================================================
+def direct_scoring_example():
+    """
+    Direct scoring: Rate a single response against defined criteria.
+    Best for objective criteria like accuracy, completeness, instruction following.
+    """
+    # Input
+    prompt = "Explain quantum entanglement to a high school student"
+    response = """
+    Quantum entanglement is like having two magical coins that are connected.
+    When you flip one and it lands on heads, the other instantly shows tails,
+    no matter how far apart they are. Scientists call this "spooky action at a distance."
+    """
+    criteria = [
+        {"name": "Accuracy", "description": "Scientific correctness", "weight": 0.4},
+        {"name": "Clarity", "description": "Understandable for audience", "weight": 0.3},
+        {"name": "Engagement", "description": "Interesting and memorable", "weight": 0.3}
+    ]
+    # System prompt for the evaluator
+    system_prompt = """You are an expert evaluator. Assess the response against each criterion.
+For each criterion:
+1. Find specific evidence in the response
+2. Score according to the rubric (1-5 scale)
+3. Justify your score with evidence
+4. Suggest one specific improvement
+Be objective and consistent. Base scores on explicit evidence."""
+    # User prompt structure
+    user_prompt = f"""## Original Prompt
+{prompt}
+## Response to Evaluate
+{response}
+## Criteria
+1. **Accuracy** (weight: 0.4): Scientific correctness
+2. **Clarity** (weight: 0.3): Understandable for audience
+3. **Engagement** (weight: 0.3): Interesting and memorable
+## Output Format
+Respond with valid JSON:
+{{
+  "scores": [
+    {{
+      "criterion": "Accuracy",
+      "score": 4,
+      "evidence": ["quote or observation"],
+      "justification": "why this score",
+      "improvement": "specific suggestion"
+    }}
+  ],
+  "summary": {{
+    "assessment": "overall quality summary",
+    "strengths": ["strength 1"],
+    "weaknesses": ["weakness 1"]
+  }}
+}}"""
+    # Expected output structure
+    expected_output = {
+        "scores": [
+            {
+                "criterion": "Accuracy",
+                "score": 4,
+                "evidence": ["Correctly uses analogy", "Mentions spooky action at a distance"],
+                "justification": "Core concept is correct, analogy is appropriate",
+                "improvement": "Could mention it's a quantum mechanical phenomenon"
+            },
+            {
+                "criterion": "Clarity",
+                "score": 5,
+                "evidence": ["Simple coin analogy", "No jargon"],
+                "justification": "Appropriate for high school level",
+                "improvement": "None needed"
+            },
+            {
+                "criterion": "Engagement",
+                "score": 4,
+                "evidence": ["Magical coins", "Spooky action quote"],
+                "justification": "Memorable imagery and Einstein quote",
+                "improvement": "Could add a real-world application"
+            }
+        ],
+        "summary": {
+            "assessment": "Good explanation suitable for the target audience",
+            "strengths": ["Clear analogy", "Age-appropriate language"],
+            "weaknesses": ["Could be more comprehensive"]
+        }
+    }
+    # Calculate weighted score
+    total_weight = sum(c["weight"] for c in criteria)
+    weighted_score = sum(
+        s["score"] * next(c["weight"] for c in criteria if c["name"] == s["criterion"])
+        for s in expected_output["scores"]
+    ) / total_weight
+    print(f"Weighted Score: {weighted_score:.2f}/5")
+    return expected_output
+# =============================================================================
+# PAIRWISE COMPARISON WITH POSITION BIAS MITIGATION
+# =============================================================================
+def pairwise_comparison_example():
+    """
+    Pairwise comparison: Compare two responses and select the better one.
+    Includes position swapping to mitigate position bias.
+    Best for subjective preferences like tone, style, persuasiveness.
+    """
+    prompt = "Explain machine learning to a beginner"
+    response_a = """
+    Machine learning is a subset of artificial intelligence that enables
+    systems to learn and improve from experience without being explicitly
+    programmed. It uses statistical techniques to give computers the ability
+    to identify patterns in data.
+    """
+    response_b = """
+    Imagine teaching a dog a new trick. You show the dog what to do, give
+    treats when it's right, and eventually it learns. Machine learning works
+    similarly - we show computers lots of examples, tell them when they're
+    right, and they learn to recognize patterns on their own.
+    """
+    criteria = ["clarity", "accessibility", "accuracy"]
+    # System prompt emphasizing bias awareness
+    system_prompt = """You are an expert evaluator comparing two AI responses.
+CRITICAL INSTRUCTIONS:
+- Do NOT prefer responses because they are longer
+- Do NOT prefer responses based on position (first vs second)
+- Focus ONLY on quality according to the specified criteria
+- Ties are acceptable when responses are genuinely equivalent"""
+    # First pass: A first, B second
+    def evaluate_pass(first_response, second_response, first_label, second_label):
+        user_prompt = f"""## Original Prompt
+{prompt}
+## Response {first_label}
+{first_response}
+## Response {second_label}
+{second_response}
+## Comparison Criteria
+{', '.join(criteria)}
+## Output Format
+{{
+  "comparison": [
+    {{"criterion": "clarity", "winner": "A|B|TIE", "reasoning": "..."}}
+  ],
+  "result": {{
+    "winner": "A|B|TIE",
+    "confidence": 0.0-1.0,
+    "reasoning": "overall reasoning"
+  }}
+}}"""
+        return user_prompt
+    # Position bias mitigation protocol
+    print("Pass 1: A in first position")
+    pass1_result = {"winner": "B", "confidence": 0.8}
+    print("Pass 2: B in first position (swapped)")
+    pass2_result = {"winner": "A", "confidence": 0.75}  # A because B was first
+    # Map pass2 result back (swap labels)
+    def map_winner(winner):
+        return {"A": "B", "B": "A", "TIE": "TIE"}[winner]
+    pass2_mapped = map_winner(pass2_result["winner"])
+    print(f"Pass 2 mapped winner: {pass2_mapped}")
+    # Check consistency
+    consistent = pass1_result["winner"] == pass2_mapped
+    if consistent:
+        final_result = {
+            "winner": pass1_result["winner"],
+            "confidence": (pass1_result["confidence"] + pass2_result["confidence"]) / 2,
+            "position_consistent": True
+        }
+    else:
+        final_result = {
+            "winner": "TIE",
+            "confidence": 0.5,
+            "position_consistent": False,
+            "bias_detected": True
+        }
+    print(f"\nFinal Result: {final_result}")
+    return final_result
+# =============================================================================
+# RUBRIC GENERATION
+# =============================================================================
+def rubric_generation_example():
+    """
+    Generate a domain-specific scoring rubric.
+    Rubrics reduce evaluation variance by 40-60%.
+    """
+    criterion_name = "Code Readability"
+    criterion_description = "How easy the code is to understand and maintain"
+    domain = "software engineering"
+    scale = "1-5"
+    strictness = "balanced"
+    system_prompt = f"""You are an expert in creating evaluation rubrics.
+Create clear, actionable rubrics with distinct boundaries between levels.
+Strictness: {strictness}
+- lenient: Lower bar for passing scores
+- balanced: Fair, typical expectations
+- strict: High standards, critical evaluation"""
+    user_prompt = f"""Create a scoring rubric for:
+**Criterion**: {criterion_name}
+**Description**: {criterion_description}
+**Scale**: {scale}
+**Domain**: {domain}
+Generate:
+1. Clear descriptions for each score level
+2. Specific characteristics that define each level
+3. Brief example text for each level
+4. General scoring guidelines
+5. Edge cases with guidance"""
+    # Expected rubric structure
+    rubric = {
+        "criterion": criterion_name,
+        "scale": {"min": 1, "max": 5},
+        "levels": [
+            {
+                "score": 1,
+                "label": "Poor",
+                "description": "Code is difficult to understand without significant effort",
+                "characteristics": [
+                    "No meaningful variable or function names",
+                    "No comments or documentation",
+                    "Deeply nested or convoluted logic"
+                ],
+                "example": "def f(x): return x[0]*x[1]+x[2]"
+            },
+            {
+                "score": 3,
+                "label": "Adequate",
+                "description": "Code is understandable with some effort",
+                "characteristics": [
+                    "Most variables have meaningful names",
+                    "Basic comments for complex sections",
+                    "Logic is followable but could be cleaner"
+                ],
+                "example": "def calc_total(items): # calculate sum\n    total = 0\n    for i in items: total += i\n    return total"
+            },
+            {
+                "score": 5,
+                "label": "Excellent",
+                "description": "Code is immediately clear and maintainable",
+                "characteristics": [
+                    "All names are descriptive and consistent",
+                    "Comprehensive documentation",
+                    "Clean, modular structure"
+                ],
+                "example": "def calculate_total_price(items: List[Item]) -> Decimal:\n    '''Calculate the total price of all items.'''\n    return sum(item.price for item in items)"
+            }
+        ],
+        "scoring_guidelines": [
+            "Focus on readability, not cleverness",
+            "Consider the intended audience (team skill level)",
+            "Consistency matters more than style preference"
+        ],
+        "edge_cases": [
+            {
+                "situation": "Code uses domain-specific abbreviations",
+                "guidance": "Score based on readability for domain experts, not general audience"
+            },
+            {
+                "situation": "Code is auto-generated",
+                "guidance": "Apply same standards but note in evaluation"
+            }
+        ]
+    }
+    print("Generated Rubric:")
+    for level in rubric["levels"]:
+        print(f"  {level['score']}: {level['label']} - {level['description']}")
+    return rubric
+# =============================================================================
+# MAIN
+# =============================================================================
+if __name__ == "__main__":
+    print("=" * 60)
+    print("DIRECT SCORING EXAMPLE")
+    print("=" * 60)
+    direct_scoring_example()
+    print("\n" + "=" * 60)
+    print("PAIRWISE COMPARISON EXAMPLE")
+    print("=" * 60)
+    pairwise_comparison_example()
+    print("\n" + "=" * 60)
+    print("RUBRIC GENERATION EXAMPLE")
+    print("=" * 60)
+    rubric_generation_example()