npm - @bastani/atomic - Versions diffs - 0.5.11-0 → 0.5.12-0 - Mend

@bastani/atomic 0.5.11-0 → 0.5.12-0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (506) hide show

package/.agents/skills/advanced-evaluation/references/bias-mitigation.md ADDED Viewed

@@ -0,0 +1,288 @@
+# Bias Mitigation Techniques for LLM Evaluation
+This reference details specific techniques for mitigating known biases in LLM-as-a-Judge systems.
+## Position Bias
+### The Problem
+In pairwise comparison, LLMs systematically prefer responses in certain positions. Research shows:
+- GPT has mild first-position bias (~55% preference for first position in ties)
+- Claude shows similar patterns
+- Smaller models often show stronger bias
+### Mitigation: Position Swapping Protocol
+```python
+async def position_swap_comparison(response_a, response_b, prompt, criteria):
+    # Pass 1: Original order
+    result_ab = await compare(response_a, response_b, prompt, criteria)
+    # Pass 2: Swapped order
+    result_ba = await compare(response_b, response_a, prompt, criteria)
+    # Map second result (A in second position → B in first)
+    result_ba_mapped = {
+        'winner': {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[result_ba['winner']],
+        'confidence': result_ba['confidence']
+    }
+    # Consistency check
+    if result_ab['winner'] == result_ba_mapped['winner']:
+        return {
+            'winner': result_ab['winner'],
+            'confidence': (result_ab['confidence'] + result_ba_mapped['confidence']) / 2,
+            'position_consistent': True
+        }
+    else:
+        # Disagreement indicates position bias was a factor
+        return {
+            'winner': 'TIE',
+            'confidence': 0.5,
+            'position_consistent': False,
+            'bias_detected': True
+        }
+```
+### Alternative: Multiple Shuffles
+For higher reliability, use multiple position orderings:
+```python
+async def multi_shuffle_comparison(response_a, response_b, prompt, criteria, n_shuffles=3):
+    results = []
+    for i in range(n_shuffles):
+        if i % 2 == 0:
+            r = await compare(response_a, response_b, prompt, criteria)
+        else:
+            r = await compare(response_b, response_a, prompt, criteria)
+            r['winner'] = {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[r['winner']]
+        results.append(r)
+    # Majority vote
+    winners = [r['winner'] for r in results]
+    final_winner = max(set(winners), key=winners.count)
+    agreement = winners.count(final_winner) / len(winners)
+    return {
+        'winner': final_winner,
+        'confidence': agreement,
+        'n_shuffles': n_shuffles
+    }
+```
+## Length Bias
+### The Problem
+LLMs tend to rate longer responses higher, regardless of quality. This manifests as:
+- Verbose responses receiving inflated scores
+- Concise but complete responses penalized
+- Padding and repetition being rewarded
+### Mitigation: Explicit Prompting
+Include anti-length-bias instructions in the prompt:
+```
+CRITICAL EVALUATION GUIDELINES:
+- Do NOT prefer responses because they are longer
+- Concise, complete answers are as valuable as detailed ones
+- Penalize unnecessary verbosity or repetition
+- Focus on information density, not word count
+```
+### Mitigation: Length-Normalized Scoring
+```python
+def length_normalized_score(score, response_length, target_length=500):
+    """Adjust score based on response length."""
+    length_ratio = response_length / target_length
+    if length_ratio > 2.0:
+        # Penalize excessively long responses
+        penalty = (length_ratio - 2.0) * 0.1
+        return max(score - penalty, 1)
+    elif length_ratio < 0.3:
+        # Penalize excessively short responses
+        penalty = (0.3 - length_ratio) * 0.5
+        return max(score - penalty, 1)
+    else:
+        return score
+```
+### Mitigation: Separate Length Criterion
+Make length a separate, explicit criterion so it's not implicitly rewarded:
+```python
+criteria = [
+    {"name": "Accuracy", "description": "Factual correctness", "weight": 0.4},
+    {"name": "Completeness", "description": "Covers key points", "weight": 0.3},
+    {"name": "Conciseness", "description": "No unnecessary content", "weight": 0.3}  # Explicit
+]
+```
+## Self-Enhancement Bias
+### The Problem
+Models rate outputs generated by themselves (or similar models) higher than outputs from different models.
+### Mitigation: Cross-Model Evaluation
+Use a different model family for evaluation than generation:
+```python
+def get_evaluator_model(generator_model):
+    """Select evaluator to avoid self-enhancement bias."""
+    if 'gpt' in generator_model.lower():
+        return 'claude-4-5-sonnet'
+    elif 'claude' in generator_model.lower():
+        return 'gpt-5.2'
+    else:
+        return 'gpt-5.2'  # Default
+```
+### Mitigation: Blind Evaluation
+Remove model attribution from responses before evaluation:
+```python
+def anonymize_response(response, model_name):
+    """Remove model-identifying patterns."""
+    patterns = [
+        f"As {model_name}",
+        "I am an AI",
+        "I don't have personal opinions",
+        # Model-specific patterns
+    ]
+    anonymized = response
+    for pattern in patterns:
+        anonymized = anonymized.replace(pattern, "[REDACTED]")
+    return anonymized
+```
+## Verbosity Bias
+### The Problem
+Detailed explanations receive higher scores even when the extra detail is irrelevant or incorrect.
+### Mitigation: Relevance-Weighted Scoring
+```python
+async def relevance_weighted_evaluation(response, prompt, criteria):
+    # First, assess relevance of each segment
+    relevance_scores = await assess_relevance(response, prompt)
+    # Weight evaluation by relevance
+    segments = split_into_segments(response)
+    weighted_scores = []
+    for segment, relevance in zip(segments, relevance_scores):
+        if relevance > 0.5:  # Only count relevant segments
+            score = await evaluate_segment(segment, prompt, criteria)
+            weighted_scores.append(score * relevance)
+    return sum(weighted_scores) / len(weighted_scores)
+```
+### Mitigation: Rubric with Verbosity Penalty
+Include explicit verbosity penalties in rubrics:
+```python
+rubric_levels = [
+    {
+        "score": 5,
+        "description": "Complete and concise. All necessary information, nothing extraneous.",
+        "characteristics": ["Every sentence adds value", "No repetition", "Appropriately scoped"]
+    },
+    {
+        "score": 3,
+        "description": "Complete but verbose. Contains unnecessary detail or repetition.",
+        "characteristics": ["Main points covered", "Some tangents", "Could be more concise"]
+    },
+    # ... etc
+]
+```
+## Authority Bias
+### The Problem
+Confident, authoritative tone is rated higher regardless of accuracy.
+### Mitigation: Evidence Requirement
+Require explicit evidence for claims:
+```
+For each claim in the response:
+1. Identify whether it's a factual claim
+2. Note if evidence or sources are provided
+3. Score based on verifiability, not confidence
+IMPORTANT: Confident claims without evidence should NOT receive higher scores than
+hedged claims with evidence.
+```
+### Mitigation: Fact-Checking Layer
+Add a fact-checking step before scoring:
+```python
+async def fact_checked_evaluation(response, prompt, criteria):
+    # Extract claims
+    claims = await extract_claims(response)
+    # Fact-check each claim
+    fact_check_results = await asyncio.gather(*[
+        verify_claim(claim) for claim in claims
+    ])
+    # Adjust score based on fact-check results
+    accuracy_factor = sum(r['verified'] for r in fact_check_results) / len(fact_check_results)
+    base_score = await evaluate(response, prompt, criteria)
+    return base_score * (0.7 + 0.3 * accuracy_factor)  # At least 70% of score
+```
+## Aggregate Bias Detection
+Monitor for systematic biases in production:
+```python
+class BiasMonitor:
+    def __init__(self):
+        self.evaluations = []
+    def record(self, evaluation):
+        self.evaluations.append(evaluation)
+    def detect_position_bias(self):
+        """Detect if first position wins more often than expected."""
+        first_wins = sum(1 for e in self.evaluations if e['first_position_winner'])
+        expected = len(self.evaluations) * 0.5
+        z_score = (first_wins - expected) / (expected * 0.5) ** 0.5
+        return {'bias_detected': abs(z_score) > 2, 'z_score': z_score}
+    def detect_length_bias(self):
+        """Detect if longer responses score higher."""
+        from scipy.stats import spearmanr
+        lengths = [e['response_length'] for e in self.evaluations]
+        scores = [e['score'] for e in self.evaluations]
+        corr, p_value = spearmanr(lengths, scores)
+        return {'bias_detected': corr > 0.3 and p_value < 0.05, 'correlation': corr}
+```
+## Summary Table
+| Bias | Primary Mitigation | Secondary Mitigation | Detection Method |
+|------|-------------------|---------------------|------------------|
+| Position | Position swapping | Multiple shuffles | Consistency check |
+| Length | Explicit prompting | Length normalization | Length-score correlation |
+| Self-enhancement | Cross-model evaluation | Anonymization | Model comparison study |
+| Verbosity | Relevance weighting | Rubric penalties | Relevance scoring |
+| Authority | Evidence requirement | Fact-checking layer | Confidence-accuracy correlation |

package/.agents/skills/advanced-evaluation/references/evaluation-pipeline.md ADDED Viewed

@@ -0,0 +1,43 @@
+# Evaluation Pipeline Diagram
+Visual layout of a production evaluation pipeline.
+```
+┌─────────────────────────────────────────────────┐
+│                 Evaluation Pipeline              │
+├─────────────────────────────────────────────────┤
+│                                                   │
+│  Input: Response + Prompt + Context               │
+│           │                                       │
+│           ▼                                       │
+│  ┌─────────────────────┐                         │
+│  │   Criteria Loader   │ ◄── Rubrics, weights    │
+│  └──────────┬──────────┘                         │
+│             │                                     │
+│             ▼                                     │
+│  ┌─────────────────────┐                         │
+│  │   Primary Scorer    │ ◄── Direct or Pairwise  │
+│  └──────────┬──────────┘                         │
+│             │                                     │
+│             ▼                                     │
+│  ┌─────────────────────┐                         │
+│  │   Bias Mitigation   │ ◄── Position swap, etc. │
+│  └──────────┬──────────┘                         │
+│             │                                     │
+│             ▼                                     │
+│  ┌─────────────────────┐                         │
+│  │ Confidence Scoring  │ ◄── Calibration         │
+│  └──────────┬──────────┘                         │
+│             │                                     │
+│             ▼                                     │
+│  Output: Scores + Justifications + Confidence     │
+│                                                   │
+└─────────────────────────────────────────────────┘
+```
+## Pipeline Stages
+1. **Criteria Loader**: Loads rubrics and criterion weights from configuration
+2. **Primary Scorer**: Applies direct scoring or pairwise comparison
+3. **Bias Mitigation**: Runs position swaps, length normalization, and other debiasing
+4. **Confidence Scoring**: Calibrates confidence based on position consistency and evidence strength

package/.agents/skills/advanced-evaluation/references/implementation-patterns.md ADDED Viewed

@@ -0,0 +1,315 @@
+# LLM-as-Judge Implementation Patterns
+This reference provides detailed implementation patterns for building production-grade LLM evaluation systems.
+## Pattern 1: Structured Evaluation Pipeline
+The most reliable evaluation systems follow a structured pipeline that separates concerns:
+```
+Input Validation → Criteria Loading → Scoring → Bias Mitigation → Output Formatting
+```
+### Input Validation Layer
+Before evaluation begins, validate:
+1. **Response presence**: Non-empty response to evaluate
+2. **Prompt presence**: Original prompt for context
+3. **Criteria validity**: At least one criterion with name and description
+4. **Weight normalization**: Weights sum to 1.0 (or normalize them)
+```python
+def validate_input(response, prompt, criteria):
+    if not response or not response.strip():
+        raise ValueError("Response cannot be empty")
+    if not prompt or not prompt.strip():
+        raise ValueError("Prompt cannot be empty")
+    if not criteria or len(criteria) == 0:
+        raise ValueError("At least one criterion required")
+    # Normalize weights
+    total_weight = sum(c.get('weight', 1) for c in criteria)
+    for c in criteria:
+        c['weight'] = c.get('weight', 1) / total_weight
+```
+### Criteria Loading Layer
+Criteria should be loaded from configuration, not hardcoded:
+```python
+class CriteriaLoader:
+    def __init__(self, rubric_path=None):
+        self.rubrics = self._load_rubrics(rubric_path)
+    def get_criteria(self, task_type):
+        return self.rubrics.get(task_type, self.default_criteria)
+    def get_rubric(self, criterion_name):
+        return self.rubrics.get(criterion_name, {}).get('levels', [])
+```
+### Scoring Layer
+The scoring layer handles the actual LLM call:
+```python
+async def score_response(response, prompt, criteria, rubric, model):
+    system_prompt = build_system_prompt(criteria, rubric)
+    user_prompt = build_user_prompt(response, prompt, criteria)
+    result = await generate_text(
+        model=model,
+        system=system_prompt,
+        prompt=user_prompt,
+        temperature=0.3  # Lower temperature for consistency
+    )
+    return parse_scores(result.text)
+```
+### Bias Mitigation Layer
+For pairwise comparison, always include position swapping:
+```python
+async def compare_with_bias_mitigation(response_a, response_b, prompt, criteria, model):
+    # First pass: A first
+    pass1 = await compare_pair(response_a, response_b, prompt, criteria, model)
+    # Second pass: B first
+    pass2 = await compare_pair(response_b, response_a, prompt, criteria, model)
+    # Map pass2 winner back
+    pass2_mapped = map_winner(pass2.winner)  # A→B, B→A, TIE→TIE
+    # Check consistency
+    if pass1.winner == pass2_mapped:
+        return {
+            'winner': pass1.winner,
+            'confidence': (pass1.confidence + pass2.confidence) / 2,
+            'consistent': True
+        }
+    else:
+        return {
+            'winner': 'TIE',
+            'confidence': 0.5,
+            'consistent': False
+        }
+```
+## Pattern 2: Hierarchical Evaluation
+For complex evaluations, use a hierarchical approach:
+```
+Quick Screen (cheap model) → Detailed Evaluation (expensive model) → Human Review (edge cases)
+```
+### Quick Screen Implementation
+```python
+async def quick_screen(response, prompt, threshold=0.7):
+    """Fast, cheap screening for obvious passes/fails."""
+    result = await generate_text(
+        model='gpt-5.2',  # Cheaper model
+        prompt=f"Rate 0-1 if this response adequately addresses the prompt:\n\nPrompt: {prompt}\n\nResponse: {response}",
+        temperature=0
+    )
+    score = float(result.text.strip())
+    return score, score > threshold
+```
+### Detailed Evaluation
+```python
+async def detailed_evaluation(response, prompt, criteria):
+    """Full evaluation for borderline or important cases."""
+    result = await generate_text(
+        model='gpt-5.2',  # More capable model
+        system=DETAILED_EVALUATION_PROMPT,
+        prompt=build_detailed_prompt(response, prompt, criteria),
+        temperature=0.3
+    )
+    return parse_detailed_scores(result.text)
+```
+## Pattern 3: Panel of LLM Judges (PoLL)
+For high-stakes evaluation, use multiple models:
+```python
+async def poll_evaluation(response, prompt, criteria, models):
+    """Aggregate judgments from multiple LLM judges."""
+    results = await asyncio.gather(*[
+        score_with_model(response, prompt, criteria, model)
+        for model in models
+    ])
+    # Aggregate scores
+    aggregated = aggregate_scores(results)
+    # Calculate agreement
+    agreement = calculate_agreement(results)
+    return {
+        'scores': aggregated,
+        'agreement': agreement,
+        'individual_results': results
+    }
+def aggregate_scores(results):
+    """Aggregate scores using median (robust to outliers)."""
+    scores = {}
+    for criterion in results[0]['scores'].keys():
+        criterion_scores = [r['scores'][criterion] for r in results]
+        scores[criterion] = {
+            'score': statistics.median(criterion_scores),
+            'std': statistics.stdev(criterion_scores) if len(criterion_scores) > 1 else 0
+        }
+    return scores
+```
+## Pattern 4: Confidence Calibration
+Confidence scores should be calibrated to actual reliability:
+```python
+def calibrate_confidence(raw_confidence, position_consistent, evidence_count):
+    """Calibrate confidence based on multiple signals."""
+    # Base confidence from model output
+    calibrated = raw_confidence
+    # Position consistency is a strong signal
+    if not position_consistent:
+        calibrated *= 0.6  # Significant reduction
+    # More evidence = higher confidence
+    evidence_factor = min(evidence_count / 3, 1.0)  # Cap at 3 pieces
+    calibrated *= (0.7 + 0.3 * evidence_factor)
+    return min(calibrated, 0.99)  # Never 100% confident
+```
+## Pattern 5: Output Formatting
+Always return structured outputs with consistent schemas:
+```python
+@dataclass
+class ScoreResult:
+    criterion: str
+    score: float
+    max_score: float
+    justification: str
+    evidence: List[str]
+    improvement: str
+@dataclass
+class EvaluationResult:
+    success: bool
+    scores: List[ScoreResult]
+    overall_score: float
+    weighted_score: float
+    summary: Dict[str, Any]
+    metadata: Dict[str, Any]
+def format_output(scores, metadata) -> EvaluationResult:
+    """Format evaluation results consistently."""
+    return EvaluationResult(
+        success=True,
+        scores=scores,
+        overall_score=sum(s.score for s in scores) / len(scores),
+        weighted_score=calculate_weighted_score(scores),
+        summary=generate_summary(scores),
+        metadata=metadata
+    )
+```
+## Error Handling Patterns
+### Graceful Degradation
+```python
+async def evaluate_with_fallback(response, prompt, criteria):
+    try:
+        return await full_evaluation(response, prompt, criteria)
+    except RateLimitError:
+        # Fall back to simpler evaluation
+        return await simple_evaluation(response, prompt, criteria)
+    except ParseError as e:
+        # Return partial results with error flag
+        return {
+            'success': False,
+            'partial_results': e.partial_data,
+            'error': str(e)
+        }
+```
+### Retry Logic
+```python
+async def evaluate_with_retry(response, prompt, criteria, max_retries=3):
+    for attempt in range(max_retries):
+        try:
+            result = await evaluate(response, prompt, criteria)
+            if is_valid_result(result):
+                return result
+        except TransientError:
+            await asyncio.sleep(2 ** attempt)  # Exponential backoff
+    raise EvaluationError("Max retries exceeded")
+```
+## Testing Patterns
+### Unit Tests for Parsing
+```python
+def test_score_parsing():
+    raw_output = '{"scores": [{"criterion": "Accuracy", "score": 4}]}'
+    result = parse_scores(raw_output)
+    assert result.scores[0].criterion == "Accuracy"
+    assert result.scores[0].score == 4
+def test_malformed_output():
+    raw_output = 'Invalid JSON'
+    with pytest.raises(ParseError):
+        parse_scores(raw_output)
+```
+### Integration Tests with Real API
+```python
+@pytest.mark.integration
+async def test_full_evaluation_pipeline():
+    result = await evaluate(
+        response="Water boils at 100°C at sea level.",
+        prompt="At what temperature does water boil?",
+        criteria=[{"name": "Accuracy", "description": "Factual correctness", "weight": 1}]
+    )
+    assert result.success
+    assert len(result.scores) == 1
+    assert result.scores[0].score >= 4  # Should score high for accurate response
+```
+### Bias Detection Tests
+```python
+async def test_position_bias_mitigation():
+    # Same response in both positions should tie
+    result = await compare(
+        response_a="Same response",
+        response_b="Same response",
+        prompt="Test prompt",
+        criteria=["quality"],
+        swap_positions=True
+    )
+    assert result.winner == "TIE"
+    assert result.consistent == True
+```