npm - @bastani/atomic - Versions diffs - 0.6.8 → 0.7.0-2 - Mend

@bastani/atomic 0.6.8 → 0.7.0-2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (765) hide show

package/.agents/skills/advanced-evaluation/SKILL.md DELETED Viewed

@@ -1,404 +0,0 @@
----
-name: advanced-evaluation
-description: This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment. Part of the context engineering skill suite — also activates when the user mentions "context engineering" or "context-engineering" in the context of evaluating LLM output quality.
-metadata:
-  provider: atomic
----
-# Advanced Evaluation
-This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.
-**Key insight**: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.
-## When to Activate
-Activate this skill when:
-- Building automated evaluation pipelines for LLM outputs
-- Comparing multiple model responses to select the best one
-- Establishing consistent quality standards across evaluation teams
-- Debugging evaluation systems that show inconsistent results
-- Designing A/B tests for prompt or model changes
-- Creating rubrics for human or automated evaluation
-- Analyzing correlation between automated and human judgments
-## Core Concepts
-### The Evaluation Taxonomy
-Select between two primary approaches based on whether ground truth exists:
-**Direct Scoring** — Use when objective criteria exist (factual accuracy, instruction following, toxicity). A single LLM rates one response on a defined scale. Achieves moderate-to-high reliability for well-defined criteria. Watch for score calibration drift and inconsistent scale interpretation.
-**Pairwise Comparison** — Use for subjective preferences (tone, style, persuasiveness). An LLM compares two responses and selects the better one. Achieves higher human-judge agreement than direct scoring for preference tasks (Zheng et al., 2023). Watch for position bias and length bias.
-### The Bias Landscape
-Mitigate these systematic biases in every evaluation system:
-**Position Bias**: First-position responses get preferential treatment. Mitigate by evaluating twice with swapped positions, then apply majority vote or consistency check.
-**Length Bias**: Longer responses score higher regardless of quality. Mitigate by explicitly prompting to ignore length and applying length-normalized scoring.
-**Self-Enhancement Bias**: Models rate their own outputs higher. Mitigate by using different models for generation and evaluation.
-**Verbosity Bias**: Excessive detail scores higher even when unnecessary. Mitigate with criteria-specific rubrics that penalize irrelevant detail.
-**Authority Bias**: Confident tone scores higher regardless of accuracy. Mitigate by requiring evidence citation and adding a fact-checking layer.
-### Metric Selection Framework
-Match metrics to the evaluation task structure:
-| Task Type | Primary Metrics | Secondary Metrics |
-|-----------|-----------------|-------------------|
-| Binary classification (pass/fail) | Recall, Precision, F1 | Cohen's kappa |
-| Ordinal scale (1-5 rating) | Spearman's rho, Kendall's tau | Cohen's kappa (weighted) |
-| Pairwise preference | Agreement rate, Position consistency | Confidence calibration |
-| Multi-label | Macro-F1, Micro-F1 | Per-label precision/recall |
-Prioritize systematic disagreement patterns over absolute agreement rates because a judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.
-## Evaluation Approaches
-### Direct Scoring Implementation
-Build direct scoring with three components: clear criteria, a calibrated scale, and structured output format.
-**Criteria Definition Pattern**:
-```
-Criterion: [Name]
-Description: [What this criterion measures]
-Weight: [Relative importance, 0-1]
-```
-**Scale Calibration** — Choose scale granularity based on rubric detail:
-- 1-3: Binary with neutral option, lowest cognitive load
-- 1-5: Standard Likert, best balance of granularity and reliability
-- 1-10: Use only with detailed per-level rubrics because calibration is harder
-**Prompt Structure for Direct Scoring**:
-```
-You are an expert evaluator assessing response quality.
-## Task
-Evaluate the following response against each criterion.
-## Original Prompt
-{prompt}
-## Response to Evaluate
-{response}
-## Criteria
-{for each criterion: name, description, weight}
-## Instructions
-For each criterion:
-1. Find specific evidence in the response
-2. Score according to the rubric (1-{max} scale)
-3. Justify your score with evidence
-4. Suggest one specific improvement
-## Output Format
-Respond with structured JSON containing scores, justifications, and summary.
-```
-Always require justification before the score in all scoring prompts because research shows this improves reliability by 15-25% compared to score-first approaches.
-### Pairwise Comparison Implementation
-Apply position bias mitigation in every pairwise evaluation:
-1. First pass: Response A in first position, Response B in second
-2. Second pass: Response B in first position, Response A in second
-3. Consistency check: If passes disagree, return TIE with reduced confidence
-4. Final verdict: Consistent winner with averaged confidence
-**Prompt Structure for Pairwise Comparison**:
-```
-You are an expert evaluator comparing two AI responses.
-## Critical Instructions
-- Do NOT prefer responses because they are longer
-- Do NOT prefer responses based on position (first vs second)
-- Focus ONLY on quality according to the specified criteria
-- Ties are acceptable when responses are genuinely equivalent
-## Original Prompt
-{prompt}
-## Response A
-{response_a}
-## Response B
-{response_b}
-## Comparison Criteria
-{criteria list}
-## Instructions
-1. Analyze each response independently first
-2. Compare them on each criterion
-3. Determine overall winner with confidence level
-## Output Format
-JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.
-```
-**Confidence Calibration** — Map confidence to position consistency:
-- Both passes agree: confidence = average of individual confidences
-- Passes disagree: confidence = 0.5, verdict = TIE
-### Rubric Generation
-Generate rubrics to reduce evaluation variance by 40-60% compared to open-ended scoring.
-**Include these rubric components**:
-1. **Level descriptions**: Clear boundaries for each score level
-2. **Characteristics**: Observable features that define each level
-3. **Examples**: Representative text for each level (optional but valuable)
-4. **Edge cases**: Guidance for ambiguous situations
-5. **Scoring guidelines**: General principles for consistent application
-**Set strictness calibration** for the use case:
-- **Lenient**: Lower passing bar, appropriate for encouraging iteration
-- **Balanced**: Typical production expectations
-- **Strict**: High standards for safety-critical or high-stakes evaluation
-Adapt rubrics to the domain — use domain-specific terminology. A code readability rubric mentions variables, functions, and comments. A medical accuracy rubric references clinical terminology and evidence standards.
-## Practical Guidance
-### Evaluation Pipeline Design
-Build production evaluation systems with these layers: Criteria Loader (rubrics + weights) -> Primary Scorer (direct or pairwise) -> Bias Mitigation (position swap, etc.) -> Confidence Scoring (calibration) -> Output (scores + justifications + confidence). See [Evaluation Pipeline Diagram](./references/evaluation-pipeline.md) for the full visual layout.
-### Decision Framework: Direct vs. Pairwise
-Apply this decision tree:
-```
-Is there an objective ground truth?
-+-- Yes -> Direct Scoring
-|   Examples: factual accuracy, instruction following, format compliance
-|
-+-- No -> Is it a preference or quality judgment?
-    +-- Yes -> Pairwise Comparison
-    |   Examples: tone, style, persuasiveness, creativity
-    |
-    +-- No -> Consider reference-based evaluation
-        Examples: summarization (compare to source), translation (compare to reference)
-```
-### Scaling Evaluation
-For high-volume evaluation, apply one of these strategies:
-1. **Panel of LLMs (PoLL)**: Use multiple models as judges and aggregate votes to reduce individual model bias. More expensive but more reliable for high-stakes decisions.
-2. **Hierarchical evaluation**: Use a fast cheap model for screening and an expensive model for edge cases. Requires calibration of the screening threshold.
-3. **Human-in-the-loop**: Automate clear cases and route low-confidence decisions to human review. Design feedback loops to improve automated evaluation over time.
-## Examples
-### Example 1: Direct Scoring for Accuracy
-**Input**:
-```
-Prompt: "What causes seasons on Earth?"
-Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun,
-different hemispheres receive more direct sunlight at different times of year."
-Criterion: Factual Accuracy (weight: 1.0)
-Scale: 1-5
-```
-**Output**:
-```json
-{
-  "criterion": "Factual Accuracy",
-  "score": 5,
-  "evidence": [
-    "Correctly identifies axial tilt as primary cause",
-    "Correctly explains differential sunlight by hemisphere",
-    "No factual errors present"
-  ],
-  "justification": "Response accurately explains the cause of seasons with correct
-scientific reasoning. Both the axial tilt and its effect on sunlight distribution
-are correctly described.",
-  "improvement": "Could add the specific tilt angle (23.5 degrees) for completeness."
-}
-```
-### Example 2: Pairwise Comparison with Position Swap
-**Input**:
-```
-Prompt: "Explain machine learning to a beginner"
-Response A: [Technical explanation with jargon]
-Response B: [Simple analogy-based explanation]
-Criteria: ["clarity", "accessibility"]
-```
-**First Pass (A first)**:
-```json
-{ "winner": "B", "confidence": 0.8 }
-```
-**Second Pass (B first)**:
-```json
-{ "winner": "A", "confidence": 0.6 }
-```
-(Note: Winner is A because B was in first position)
-**Mapped Second Pass**:
-```json
-{ "winner": "B", "confidence": 0.6 }
-```
-**Final Result**:
-```json
-{
-  "winner": "B",
-  "confidence": 0.7,
-  "positionConsistency": {
-    "consistent": true,
-    "firstPassWinner": "B",
-    "secondPassWinner": "B"
-  }
-}
-```
-### Example 3: Rubric Generation
-**Input**:
-```
-criterionName: "Code Readability"
-criterionDescription: "How easy the code is to understand and maintain"
-domain: "software engineering"
-scale: "1-5"
-strictness: "balanced"
-```
-**Output** (abbreviated):
-```json
-{
-  "levels": [
-    {
-      "score": 1,
-      "label": "Poor",
-      "description": "Code is difficult to understand without significant effort",
-      "characteristics": [
-        "No meaningful variable or function names",
-        "No comments or documentation",
-        "Deeply nested or convoluted logic"
-      ]
-    },
-    {
-      "score": 3,
-      "label": "Adequate",
-      "description": "Code is understandable with some effort",
-      "characteristics": [
-        "Most variables have meaningful names",
-        "Basic comments present for complex sections",
-        "Logic is followable but could be cleaner"
-      ]
-    },
-    {
-      "score": 5,
-      "label": "Excellent",
-      "description": "Code is immediately clear and maintainable",
-      "characteristics": [
-        "All names are descriptive and consistent",
-        "Comprehensive documentation",
-        "Clean, modular structure"
-      ]
-    }
-  ],
-  "edgeCases": [
-    {
-      "situation": "Code is well-structured but uses domain-specific abbreviations",
-      "guidance": "Score based on readability for domain experts, not general audience"
-    }
-  ]
-}
-```
-## Guidelines
-1. **Always require justification before scores** - Chain-of-thought prompting improves reliability by 15-25%
-2. **Always swap positions in pairwise comparison** - Single-pass comparison is corrupted by position bias
-3. **Match scale granularity to rubric specificity** - Don't use 1-10 without detailed level descriptions
-4. **Separate objective and subjective criteria** - Use direct scoring for objective, pairwise for subjective
-5. **Include confidence scores** - Calibrate to position consistency and evidence strength
-6. **Define edge cases explicitly** - Ambiguous situations cause the most evaluation variance
-7. **Use domain-specific rubrics** - Generic rubrics produce generic (less useful) evaluations
-8. **Validate against human judgments** - Automated evaluation is only valuable if it correlates with human assessment
-9. **Monitor for systematic bias** - Track disagreement patterns by criterion, response type, model
-10. **Design for iteration** - Evaluation systems improve with feedback loops
-## Gotchas
-1. **Scoring without justification**: Scores lack grounding and are difficult to debug. Always require evidence-based justification before the score.
-2. **Single-pass pairwise comparison**: Position bias corrupts results when positions are not swapped. Always evaluate twice with swapped positions and check consistency.
-3. **Overloaded criteria**: Criteria that measure multiple things at once produce unreliable scores. Enforce one criterion = one measurable aspect.
-4. **Missing edge case guidance**: Evaluators handle ambiguous cases inconsistently without explicit instructions. Include edge cases in rubrics with clear resolution rules.
-5. **Ignoring confidence calibration**: High-confidence wrong judgments are worse than low-confidence ones. Calibrate confidence to position consistency and evidence strength.
-6. **Rubric drift**: Rubrics become miscalibrated as quality standards evolve or model capabilities improve. Schedule periodic rubric reviews and re-anchor score levels against fresh human-annotated examples.
-7. **Evaluation prompt sensitivity**: Minor wording changes in evaluation prompts (e.g., reordering instructions, changing phrasing) can cause 10-20% score swings. Version-control evaluation prompts and run regression tests before deploying prompt changes.
-8. **Uncontrolled length bias**: Longer responses systematically score higher even when conciseness is preferred. Add explicit length-neutrality instructions to evaluation prompts and validate with length-controlled test pairs.
-## Integration
-This skill integrates with:
-- **context-fundamentals** - Evaluation prompts require effective context structure
-- **tool-design** - Evaluation tools need proper schemas and error handling
-- **context-optimization** - Evaluation prompts can be optimized for token efficiency
-- **evaluation** (foundational) - This skill extends the foundational evaluation concepts
-## References
-Internal reference:
-- [LLM-as-Judge Implementation Patterns](./references/implementation-patterns.md) - Read when: building an evaluation pipeline from scratch or integrating LLM judges into CI/CD
-- [Bias Mitigation Techniques](./references/bias-mitigation.md) - Read when: evaluation results show inconsistent or suspicious scoring patterns
-- [Metric Selection Guide](./references/metrics-guide.md) - Read when: choosing statistical metrics to validate evaluation reliability
-- [Evaluation Pipeline Diagram](./references/evaluation-pipeline.md) - Read when: designing the architecture of a multi-stage evaluation system
-External research:
-- [Eugene Yan: Evaluating the Effectiveness of LLM-Evaluators](https://eugeneyan.com/writing/llm-evaluators/) - Read when: surveying the state of the art in LLM evaluation
-- [Judging LLM-as-a-Judge (Zheng et al., 2023)](https://arxiv.org/abs/2306.05685) - Read when: understanding position bias and MT-Bench methodology
-- [G-Eval: NLG Evaluation using GPT-4 (Liu et al., 2023)](https://arxiv.org/abs/2303.16634) - Read when: implementing chain-of-thought evaluation scoring
-- [Large Language Models are not Fair Evaluators (Wang et al., 2023)](https://arxiv.org/abs/2305.17926) - Read when: diagnosing systematic bias in evaluation outputs
-Related skills in this collection:
-- evaluation - Foundational evaluation concepts
-- context-fundamentals - Context structure for evaluation prompts
-- tool-design - Building evaluation tools
----
-## Skill Metadata
-**Created**: 2025-12-24
-**Last Updated**: 2026-03-17
-**Author**: Agent Skills for Context Engineering Contributors
-**Version**: 2.0.0

package/.agents/skills/advanced-evaluation/references/bias-mitigation.md DELETED Viewed

@@ -1,288 +0,0 @@
-# Bias Mitigation Techniques for LLM Evaluation
-This reference details specific techniques for mitigating known biases in LLM-as-a-Judge systems.
-## Position Bias
-### The Problem
-In pairwise comparison, LLMs systematically prefer responses in certain positions. Research shows:
-- GPT has mild first-position bias (~55% preference for first position in ties)
-- Claude shows similar patterns
-- Smaller models often show stronger bias
-### Mitigation: Position Swapping Protocol
-```python
-async def position_swap_comparison(response_a, response_b, prompt, criteria):
-    # Pass 1: Original order
-    result_ab = await compare(response_a, response_b, prompt, criteria)
-    # Pass 2: Swapped order
-    result_ba = await compare(response_b, response_a, prompt, criteria)
-    # Map second result (A in second position → B in first)
-    result_ba_mapped = {
-        'winner': {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[result_ba['winner']],
-        'confidence': result_ba['confidence']
-    }
-    # Consistency check
-    if result_ab['winner'] == result_ba_mapped['winner']:
-        return {
-            'winner': result_ab['winner'],
-            'confidence': (result_ab['confidence'] + result_ba_mapped['confidence']) / 2,
-            'position_consistent': True
-        }
-    else:
-        # Disagreement indicates position bias was a factor
-        return {
-            'winner': 'TIE',
-            'confidence': 0.5,
-            'position_consistent': False,
-            'bias_detected': True
-        }
-```
-### Alternative: Multiple Shuffles
-For higher reliability, use multiple position orderings:
-```python
-async def multi_shuffle_comparison(response_a, response_b, prompt, criteria, n_shuffles=3):
-    results = []
-    for i in range(n_shuffles):
-        if i % 2 == 0:
-            r = await compare(response_a, response_b, prompt, criteria)
-        else:
-            r = await compare(response_b, response_a, prompt, criteria)
-            r['winner'] = {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[r['winner']]
-        results.append(r)
-    # Majority vote
-    winners = [r['winner'] for r in results]
-    final_winner = max(set(winners), key=winners.count)
-    agreement = winners.count(final_winner) / len(winners)
-    return {
-        'winner': final_winner,
-        'confidence': agreement,
-        'n_shuffles': n_shuffles
-    }
-```
-## Length Bias
-### The Problem
-LLMs tend to rate longer responses higher, regardless of quality. This manifests as:
-- Verbose responses receiving inflated scores
-- Concise but complete responses penalized
-- Padding and repetition being rewarded
-### Mitigation: Explicit Prompting
-Include anti-length-bias instructions in the prompt:
-```
-CRITICAL EVALUATION GUIDELINES:
-- Do NOT prefer responses because they are longer
-- Concise, complete answers are as valuable as detailed ones
-- Penalize unnecessary verbosity or repetition
-- Focus on information density, not word count
-```
-### Mitigation: Length-Normalized Scoring
-```python
-def length_normalized_score(score, response_length, target_length=500):
-    """Adjust score based on response length."""
-    length_ratio = response_length / target_length
-    if length_ratio > 2.0:
-        # Penalize excessively long responses
-        penalty = (length_ratio - 2.0) * 0.1
-        return max(score - penalty, 1)
-    elif length_ratio < 0.3:
-        # Penalize excessively short responses
-        penalty = (0.3 - length_ratio) * 0.5
-        return max(score - penalty, 1)
-    else:
-        return score
-```
-### Mitigation: Separate Length Criterion
-Make length a separate, explicit criterion so it's not implicitly rewarded:
-```python
-criteria = [
-    {"name": "Accuracy", "description": "Factual correctness", "weight": 0.4},
-    {"name": "Completeness", "description": "Covers key points", "weight": 0.3},
-    {"name": "Conciseness", "description": "No unnecessary content", "weight": 0.3}  # Explicit
-]
-```
-## Self-Enhancement Bias
-### The Problem
-Models rate outputs generated by themselves (or similar models) higher than outputs from different models.
-### Mitigation: Cross-Model Evaluation
-Use a different model family for evaluation than generation:
-```python
-def get_evaluator_model(generator_model):
-    """Select evaluator to avoid self-enhancement bias."""
-    if 'gpt' in generator_model.lower():
-        return 'claude-4-5-sonnet'
-    elif 'claude' in generator_model.lower():
-        return 'gpt-5.2'
-    else:
-        return 'gpt-5.2'  # Default
-```
-### Mitigation: Blind Evaluation
-Remove model attribution from responses before evaluation:
-```python
-def anonymize_response(response, model_name):
-    """Remove model-identifying patterns."""
-    patterns = [
-        f"As {model_name}",
-        "I am an AI",
-        "I don't have personal opinions",
-        # Model-specific patterns
-    ]
-    anonymized = response
-    for pattern in patterns:
-        anonymized = anonymized.replace(pattern, "[REDACTED]")
-    return anonymized
-```
-## Verbosity Bias
-### The Problem
-Detailed explanations receive higher scores even when the extra detail is irrelevant or incorrect.
-### Mitigation: Relevance-Weighted Scoring
-```python
-async def relevance_weighted_evaluation(response, prompt, criteria):
-    # First, assess relevance of each segment
-    relevance_scores = await assess_relevance(response, prompt)
-    # Weight evaluation by relevance
-    segments = split_into_segments(response)
-    weighted_scores = []
-    for segment, relevance in zip(segments, relevance_scores):
-        if relevance > 0.5:  # Only count relevant segments
-            score = await evaluate_segment(segment, prompt, criteria)
-            weighted_scores.append(score * relevance)
-    return sum(weighted_scores) / len(weighted_scores)
-```
-### Mitigation: Rubric with Verbosity Penalty
-Include explicit verbosity penalties in rubrics:
-```python
-rubric_levels = [
-    {
-        "score": 5,
-        "description": "Complete and concise. All necessary information, nothing extraneous.",
-        "characteristics": ["Every sentence adds value", "No repetition", "Appropriately scoped"]
-    },
-    {
-        "score": 3,
-        "description": "Complete but verbose. Contains unnecessary detail or repetition.",
-        "characteristics": ["Main points covered", "Some tangents", "Could be more concise"]
-    },
-    # ... etc
-]
-```
-## Authority Bias
-### The Problem
-Confident, authoritative tone is rated higher regardless of accuracy.
-### Mitigation: Evidence Requirement
-Require explicit evidence for claims:
-```
-For each claim in the response:
-1. Identify whether it's a factual claim
-2. Note if evidence or sources are provided
-3. Score based on verifiability, not confidence
-IMPORTANT: Confident claims without evidence should NOT receive higher scores than
-hedged claims with evidence.
-```
-### Mitigation: Fact-Checking Layer
-Add a fact-checking step before scoring:
-```python
-async def fact_checked_evaluation(response, prompt, criteria):
-    # Extract claims
-    claims = await extract_claims(response)
-    # Fact-check each claim
-    fact_check_results = await asyncio.gather(*[
-        verify_claim(claim) for claim in claims
-    ])
-    # Adjust score based on fact-check results
-    accuracy_factor = sum(r['verified'] for r in fact_check_results) / len(fact_check_results)
-    base_score = await evaluate(response, prompt, criteria)
-    return base_score * (0.7 + 0.3 * accuracy_factor)  # At least 70% of score
-```
-## Aggregate Bias Detection
-Monitor for systematic biases in production:
-```python
-class BiasMonitor:
-    def __init__(self):
-        self.evaluations = []
-    def record(self, evaluation):
-        self.evaluations.append(evaluation)
-    def detect_position_bias(self):
-        """Detect if first position wins more often than expected."""
-        first_wins = sum(1 for e in self.evaluations if e['first_position_winner'])
-        expected = len(self.evaluations) * 0.5
-        z_score = (first_wins - expected) / (expected * 0.5) ** 0.5
-        return {'bias_detected': abs(z_score) > 2, 'z_score': z_score}
-    def detect_length_bias(self):
-        """Detect if longer responses score higher."""
-        from scipy.stats import spearmanr
-        lengths = [e['response_length'] for e in self.evaluations]
-        scores = [e['score'] for e in self.evaluations]
-        corr, p_value = spearmanr(lengths, scores)
-        return {'bias_detected': corr > 0.3 and p_value < 0.05, 'correlation': corr}
-```
-## Summary Table
-| Bias | Primary Mitigation | Secondary Mitigation | Detection Method |
-|------|-------------------|---------------------|------------------|
-| Position | Position swapping | Multiple shuffles | Consistency check |
-| Length | Explicit prompting | Length normalization | Length-score correlation |
-| Self-enhancement | Cross-model evaluation | Anonymization | Model comparison study |
-| Verbosity | Relevance weighting | Rubric penalties | Relevance scoring |
-| Authority | Evidence requirement | Fact-checking layer | Confidence-accuracy correlation |