npm - @bastani/atomic - Versions diffs - 0.6.8-0 → 0.7.0-1 - Mend

@bastani/atomic 0.6.8-0 → 0.7.0-1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (765) hide show

package/.agents/skills/advanced-evaluation/references/evaluation-pipeline.md DELETED Viewed

@@ -1,43 +0,0 @@
-# Evaluation Pipeline Diagram
-Visual layout of a production evaluation pipeline.
-```
-┌─────────────────────────────────────────────────┐
-│                 Evaluation Pipeline              │
-├─────────────────────────────────────────────────┤
-│                                                   │
-│  Input: Response + Prompt + Context               │
-│           │                                       │
-│           ▼                                       │
-│  ┌─────────────────────┐                         │
-│  │   Criteria Loader   │ ◄── Rubrics, weights    │
-│  └──────────┬──────────┘                         │
-│             │                                     │
-│             ▼                                     │
-│  ┌─────────────────────┐                         │
-│  │   Primary Scorer    │ ◄── Direct or Pairwise  │
-│  └──────────┬──────────┘                         │
-│             │                                     │
-│             ▼                                     │
-│  ┌─────────────────────┐                         │
-│  │   Bias Mitigation   │ ◄── Position swap, etc. │
-│  └──────────┬──────────┘                         │
-│             │                                     │
-│             ▼                                     │
-│  ┌─────────────────────┐                         │
-│  │ Confidence Scoring  │ ◄── Calibration         │
-│  └──────────┬──────────┘                         │
-│             │                                     │
-│             ▼                                     │
-│  Output: Scores + Justifications + Confidence     │
-│                                                   │
-└─────────────────────────────────────────────────┘
-```
-## Pipeline Stages
-1. **Criteria Loader**: Loads rubrics and criterion weights from configuration
-2. **Primary Scorer**: Applies direct scoring or pairwise comparison
-3. **Bias Mitigation**: Runs position swaps, length normalization, and other debiasing
-4. **Confidence Scoring**: Calibrates confidence based on position consistency and evidence strength

package/.agents/skills/advanced-evaluation/references/implementation-patterns.md DELETED Viewed

@@ -1,315 +0,0 @@
-# LLM-as-Judge Implementation Patterns
-This reference provides detailed implementation patterns for building production-grade LLM evaluation systems.
-## Pattern 1: Structured Evaluation Pipeline
-The most reliable evaluation systems follow a structured pipeline that separates concerns:
-```
-Input Validation → Criteria Loading → Scoring → Bias Mitigation → Output Formatting
-```
-### Input Validation Layer
-Before evaluation begins, validate:
-1. **Response presence**: Non-empty response to evaluate
-2. **Prompt presence**: Original prompt for context
-3. **Criteria validity**: At least one criterion with name and description
-4. **Weight normalization**: Weights sum to 1.0 (or normalize them)
-```python
-def validate_input(response, prompt, criteria):
-    if not response or not response.strip():
-        raise ValueError("Response cannot be empty")
-    if not prompt or not prompt.strip():
-        raise ValueError("Prompt cannot be empty")
-    if not criteria or len(criteria) == 0:
-        raise ValueError("At least one criterion required")
-    # Normalize weights
-    total_weight = sum(c.get('weight', 1) for c in criteria)
-    for c in criteria:
-        c['weight'] = c.get('weight', 1) / total_weight
-```
-### Criteria Loading Layer
-Criteria should be loaded from configuration, not hardcoded:
-```python
-class CriteriaLoader:
-    def __init__(self, rubric_path=None):
-        self.rubrics = self._load_rubrics(rubric_path)
-    def get_criteria(self, task_type):
-        return self.rubrics.get(task_type, self.default_criteria)
-    def get_rubric(self, criterion_name):
-        return self.rubrics.get(criterion_name, {}).get('levels', [])
-```
-### Scoring Layer
-The scoring layer handles the actual LLM call:
-```python
-async def score_response(response, prompt, criteria, rubric, model):
-    system_prompt = build_system_prompt(criteria, rubric)
-    user_prompt = build_user_prompt(response, prompt, criteria)
-    result = await generate_text(
-        model=model,
-        system=system_prompt,
-        prompt=user_prompt,
-        temperature=0.3  # Lower temperature for consistency
-    )
-    return parse_scores(result.text)
-```
-### Bias Mitigation Layer
-For pairwise comparison, always include position swapping:
-```python
-async def compare_with_bias_mitigation(response_a, response_b, prompt, criteria, model):
-    # First pass: A first
-    pass1 = await compare_pair(response_a, response_b, prompt, criteria, model)
-    # Second pass: B first
-    pass2 = await compare_pair(response_b, response_a, prompt, criteria, model)
-    # Map pass2 winner back
-    pass2_mapped = map_winner(pass2.winner)  # A→B, B→A, TIE→TIE
-    # Check consistency
-    if pass1.winner == pass2_mapped:
-        return {
-            'winner': pass1.winner,
-            'confidence': (pass1.confidence + pass2.confidence) / 2,
-            'consistent': True
-        }
-    else:
-        return {
-            'winner': 'TIE',
-            'confidence': 0.5,
-            'consistent': False
-        }
-```
-## Pattern 2: Hierarchical Evaluation
-For complex evaluations, use a hierarchical approach:
-```
-Quick Screen (cheap model) → Detailed Evaluation (expensive model) → Human Review (edge cases)
-```
-### Quick Screen Implementation
-```python
-async def quick_screen(response, prompt, threshold=0.7):
-    """Fast, cheap screening for obvious passes/fails."""
-    result = await generate_text(
-        model='gpt-5.2',  # Cheaper model
-        prompt=f"Rate 0-1 if this response adequately addresses the prompt:\n\nPrompt: {prompt}\n\nResponse: {response}",
-        temperature=0
-    )
-    score = float(result.text.strip())
-    return score, score > threshold
-```
-### Detailed Evaluation
-```python
-async def detailed_evaluation(response, prompt, criteria):
-    """Full evaluation for borderline or important cases."""
-    result = await generate_text(
-        model='gpt-5.2',  # More capable model
-        system=DETAILED_EVALUATION_PROMPT,
-        prompt=build_detailed_prompt(response, prompt, criteria),
-        temperature=0.3
-    )
-    return parse_detailed_scores(result.text)
-```
-## Pattern 3: Panel of LLM Judges (PoLL)
-For high-stakes evaluation, use multiple models:
-```python
-async def poll_evaluation(response, prompt, criteria, models):
-    """Aggregate judgments from multiple LLM judges."""
-    results = await asyncio.gather(*[
-        score_with_model(response, prompt, criteria, model)
-        for model in models
-    ])
-    # Aggregate scores
-    aggregated = aggregate_scores(results)
-    # Calculate agreement
-    agreement = calculate_agreement(results)
-    return {
-        'scores': aggregated,
-        'agreement': agreement,
-        'individual_results': results
-    }
-def aggregate_scores(results):
-    """Aggregate scores using median (robust to outliers)."""
-    scores = {}
-    for criterion in results[0]['scores'].keys():
-        criterion_scores = [r['scores'][criterion] for r in results]
-        scores[criterion] = {
-            'score': statistics.median(criterion_scores),
-            'std': statistics.stdev(criterion_scores) if len(criterion_scores) > 1 else 0
-        }
-    return scores
-```
-## Pattern 4: Confidence Calibration
-Confidence scores should be calibrated to actual reliability:
-```python
-def calibrate_confidence(raw_confidence, position_consistent, evidence_count):
-    """Calibrate confidence based on multiple signals."""
-    # Base confidence from model output
-    calibrated = raw_confidence
-    # Position consistency is a strong signal
-    if not position_consistent:
-        calibrated *= 0.6  # Significant reduction
-    # More evidence = higher confidence
-    evidence_factor = min(evidence_count / 3, 1.0)  # Cap at 3 pieces
-    calibrated *= (0.7 + 0.3 * evidence_factor)
-    return min(calibrated, 0.99)  # Never 100% confident
-```
-## Pattern 5: Output Formatting
-Always return structured outputs with consistent schemas:
-```python
-@dataclass
-class ScoreResult:
-    criterion: str
-    score: float
-    max_score: float
-    justification: str
-    evidence: List[str]
-    improvement: str
-@dataclass
-class EvaluationResult:
-    success: bool
-    scores: List[ScoreResult]
-    overall_score: float
-    weighted_score: float
-    summary: Dict[str, Any]
-    metadata: Dict[str, Any]
-def format_output(scores, metadata) -> EvaluationResult:
-    """Format evaluation results consistently."""
-    return EvaluationResult(
-        success=True,
-        scores=scores,
-        overall_score=sum(s.score for s in scores) / len(scores),
-        weighted_score=calculate_weighted_score(scores),
-        summary=generate_summary(scores),
-        metadata=metadata
-    )
-```
-## Error Handling Patterns
-### Graceful Degradation
-```python
-async def evaluate_with_fallback(response, prompt, criteria):
-    try:
-        return await full_evaluation(response, prompt, criteria)
-    except RateLimitError:
-        # Fall back to simpler evaluation
-        return await simple_evaluation(response, prompt, criteria)
-    except ParseError as e:
-        # Return partial results with error flag
-        return {
-            'success': False,
-            'partial_results': e.partial_data,
-            'error': str(e)
-        }
-```
-### Retry Logic
-```python
-async def evaluate_with_retry(response, prompt, criteria, max_retries=3):
-    for attempt in range(max_retries):
-        try:
-            result = await evaluate(response, prompt, criteria)
-            if is_valid_result(result):
-                return result
-        except TransientError:
-            await asyncio.sleep(2 ** attempt)  # Exponential backoff
-    raise EvaluationError("Max retries exceeded")
-```
-## Testing Patterns
-### Unit Tests for Parsing
-```python
-def test_score_parsing():
-    raw_output = '{"scores": [{"criterion": "Accuracy", "score": 4}]}'
-    result = parse_scores(raw_output)
-    assert result.scores[0].criterion == "Accuracy"
-    assert result.scores[0].score == 4
-def test_malformed_output():
-    raw_output = 'Invalid JSON'
-    with pytest.raises(ParseError):
-        parse_scores(raw_output)
-```
-### Integration Tests with Real API
-```python
-@pytest.mark.integration
-async def test_full_evaluation_pipeline():
-    result = await evaluate(
-        response="Water boils at 100°C at sea level.",
-        prompt="At what temperature does water boil?",
-        criteria=[{"name": "Accuracy", "description": "Factual correctness", "weight": 1}]
-    )
-    assert result.success
-    assert len(result.scores) == 1
-    assert result.scores[0].score >= 4  # Should score high for accurate response
-```
-### Bias Detection Tests
-```python
-async def test_position_bias_mitigation():
-    # Same response in both positions should tie
-    result = await compare(
-        response_a="Same response",
-        response_b="Same response",
-        prompt="Test prompt",
-        criteria=["quality"],
-        swap_positions=True
-    )
-    assert result.winner == "TIE"
-    assert result.consistent == True
-```

package/.agents/skills/advanced-evaluation/references/metrics-guide.md DELETED Viewed

@@ -1,331 +0,0 @@
-# Metric Selection Guide for LLM Evaluation
-This reference provides guidance on selecting appropriate metrics for different evaluation scenarios.
-## Metric Categories
-### Classification Metrics
-Use for binary or multi-class evaluation tasks (pass/fail, correct/incorrect).
-#### Precision
-```
-Precision = True Positives / (True Positives + False Positives)
-```
-**Interpretation**: Of all responses the judge said were good, what fraction were actually good?
-**Use when**: False positives are costly (e.g., approving unsafe content)
-```python
-def precision(predictions, ground_truth):
-    true_positives = sum(1 for p, g in zip(predictions, ground_truth) if p == 1 and g == 1)
-    predicted_positives = sum(predictions)
-    return true_positives / predicted_positives if predicted_positives > 0 else 0
-```
-#### Recall
-```
-Recall = True Positives / (True Positives + False Negatives)
-```
-**Interpretation**: Of all actually good responses, what fraction did the judge identify?
-**Use when**: False negatives are costly (e.g., missing good content in filtering)
-```python
-def recall(predictions, ground_truth):
-    true_positives = sum(1 for p, g in zip(predictions, ground_truth) if p == 1 and g == 1)
-    actual_positives = sum(ground_truth)
-    return true_positives / actual_positives if actual_positives > 0 else 0
-```
-#### F1 Score
-```
-F1 = 2 * (Precision * Recall) / (Precision + Recall)
-```
-**Interpretation**: Harmonic mean of precision and recall
-**Use when**: You need a single number balancing both concerns
-```python
-def f1_score(predictions, ground_truth):
-    p = precision(predictions, ground_truth)
-    r = recall(predictions, ground_truth)
-    return 2 * p * r / (p + r) if (p + r) > 0 else 0
-```
-### Agreement Metrics
-Use for comparing automated evaluation with human judgment.
-#### Cohen's Kappa (κ)
-```
-κ = (Observed Agreement - Expected Agreement) / (1 - Expected Agreement)
-```
-**Interpretation**: Agreement adjusted for chance
-- κ > 0.8: Almost perfect agreement
-- κ 0.6-0.8: Substantial agreement
-- κ 0.4-0.6: Moderate agreement
-- κ < 0.4: Fair to poor agreement
-**Use for**: Binary or categorical judgments
-```python
-def cohens_kappa(judge1, judge2):
-    from sklearn.metrics import cohen_kappa_score
-    return cohen_kappa_score(judge1, judge2)
-```
-#### Weighted Kappa
-For ordinal scales where disagreement severity matters:
-```python
-def weighted_kappa(judge1, judge2):
-    from sklearn.metrics import cohen_kappa_score
-    return cohen_kappa_score(judge1, judge2, weights='quadratic')
-```
-**Interpretation**: Penalizes large disagreements more than small ones
-### Correlation Metrics
-Use for ordinal/continuous scores.
-#### Spearman's Rank Correlation (ρ)
-**Interpretation**: Correlation between rankings, not absolute values
-- ρ > 0.9: Very strong correlation
-- ρ 0.7-0.9: Strong correlation
-- ρ 0.5-0.7: Moderate correlation
-- ρ < 0.5: Weak correlation
-**Use when**: Order matters more than exact values
-```python
-def spearmans_rho(scores1, scores2):
-    from scipy.stats import spearmanr
-    rho, p_value = spearmanr(scores1, scores2)
-    return {'rho': rho, 'p_value': p_value}
-```
-#### Kendall's Tau (τ)
-**Interpretation**: Similar to Spearman but based on pairwise concordance
-**Use when**: You have many tied values
-```python
-def kendalls_tau(scores1, scores2):
-    from scipy.stats import kendalltau
-    tau, p_value = kendalltau(scores1, scores2)
-    return {'tau': tau, 'p_value': p_value}
-```
-#### Pearson Correlation (r)
-**Interpretation**: Linear correlation between scores
-**Use when**: Exact score values matter, not just order
-```python
-def pearsons_r(scores1, scores2):
-    from scipy.stats import pearsonr
-    r, p_value = pearsonr(scores1, scores2)
-    return {'r': r, 'p_value': p_value}
-```
-### Pairwise Comparison Metrics
-#### Agreement Rate
-```
-Agreement = (Matching Decisions) / (Total Comparisons)
-```
-**Interpretation**: Simple percentage of agreement
-```python
-def pairwise_agreement(decisions1, decisions2):
-    matches = sum(1 for d1, d2 in zip(decisions1, decisions2) if d1 == d2)
-    return matches / len(decisions1)
-```
-#### Position Consistency
-```
-Consistency = (Consistent across position swaps) / (Total comparisons)
-```
-**Interpretation**: How often does swapping position change the decision?
-```python
-def position_consistency(results):
-    consistent = sum(1 for r in results if r['position_consistent'])
-    return consistent / len(results)
-```
-## Selection Decision Tree
-```
-What type of evaluation task?
-│
-├── Binary classification (pass/fail)
-│   └── Use: Precision, Recall, F1, Cohen's κ
-│
-├── Ordinal scale (1-5 rating)
-│   ├── Comparing to human judgments?
-│   │   └── Use: Spearman's ρ, Weighted κ
-│   └── Comparing two automated judges?
-│       └── Use: Kendall's τ, Spearman's ρ
-│
-├── Pairwise preference
-│   └── Use: Agreement rate, Position consistency
-│
-└── Multi-label classification
-    └── Use: Macro-F1, Micro-F1, Per-label metrics
-```
-## Metric Selection by Use Case
-### Use Case 1: Validating Automated Evaluation
-**Goal**: Ensure automated evaluation correlates with human judgment
-**Recommended Metrics**:
-1. Primary: Spearman's ρ (for ordinal scales) or Cohen's κ (for categorical)
-2. Secondary: Per-criterion agreement
-3. Diagnostic: Confusion matrix for systematic errors
-```python
-def validate_automated_eval(automated_scores, human_scores, criteria):
-    results = {}
-    # Overall correlation
-    results['overall_spearman'] = spearmans_rho(automated_scores, human_scores)
-    # Per-criterion agreement
-    for criterion in criteria:
-        auto_crit = [s[criterion] for s in automated_scores]
-        human_crit = [s[criterion] for s in human_scores]
-        results[f'{criterion}_spearman'] = spearmans_rho(auto_crit, human_crit)
-    return results
-```
-### Use Case 2: Comparing Two Models
-**Goal**: Determine which model produces better outputs
-**Recommended Metrics**:
-1. Primary: Win rate (from pairwise comparison)
-2. Secondary: Position consistency (bias check)
-3. Diagnostic: Per-criterion breakdown
-```python
-def compare_models(model_a_outputs, model_b_outputs, prompts):
-    results = []
-    for a, b, p in zip(model_a_outputs, model_b_outputs, prompts):
-        comparison = await compare_with_position_swap(a, b, p)
-        results.append(comparison)
-    return {
-        'a_wins': sum(1 for r in results if r['winner'] == 'A'),
-        'b_wins': sum(1 for r in results if r['winner'] == 'B'),
-        'ties': sum(1 for r in results if r['winner'] == 'TIE'),
-        'position_consistency': position_consistency(results)
-    }
-```
-### Use Case 3: Quality Monitoring
-**Goal**: Track evaluation quality over time
-**Recommended Metrics**:
-1. Primary: Rolling agreement with human spot-checks
-2. Secondary: Score distribution stability
-3. Diagnostic: Bias indicators (position, length)
-```python
-class QualityMonitor:
-    def __init__(self, window_size=100):
-        self.window = deque(maxlen=window_size)
-    def add_evaluation(self, automated, human_spot_check=None):
-        self.window.append({
-            'automated': automated,
-            'human': human_spot_check,
-            'length': len(automated['response'])
-        })
-    def get_metrics(self):
-        # Filter to evaluations with human spot-checks
-        with_human = [e for e in self.window if e['human'] is not None]
-        if len(with_human) < 10:
-            return {'insufficient_data': True}
-        auto_scores = [e['automated']['score'] for e in with_human]
-        human_scores = [e['human']['score'] for e in with_human]
-        return {
-            'correlation': spearmans_rho(auto_scores, human_scores),
-            'mean_difference': np.mean([a - h for a, h in zip(auto_scores, human_scores)]),
-            'length_correlation': spearmans_rho(
-                [e['length'] for e in self.window],
-                [e['automated']['score'] for e in self.window]
-            )
-        }
-```
-## Interpreting Metric Results
-### Good Evaluation System Indicators
-| Metric | Good | Acceptable | Concerning |
-|--------|------|------------|------------|
-| Spearman's ρ | > 0.8 | 0.6-0.8 | < 0.6 |
-| Cohen's κ | > 0.7 | 0.5-0.7 | < 0.5 |
-| Position consistency | > 0.9 | 0.8-0.9 | < 0.8 |
-| Length correlation | < 0.2 | 0.2-0.4 | > 0.4 |
-### Warning Signs
-1. **High agreement but low correlation**: May indicate calibration issues
-2. **Low position consistency**: Position bias affecting results
-3. **High length correlation**: Length bias inflating scores
-4. **Per-criterion variance**: Some criteria may be poorly defined
-## Reporting Template
-```markdown
-## Evaluation System Metrics Report
-### Human Agreement
-- Spearman's ρ: 0.82 (p < 0.001)
-- Cohen's κ: 0.74
-- Sample size: 500 evaluations
-### Bias Indicators
-- Position consistency: 91%
-- Length-score correlation: 0.12
-### Per-Criterion Performance
-| Criterion | Spearman's ρ | κ |
-|-----------|--------------|---|
-| Accuracy | 0.88 | 0.79 |
-| Clarity | 0.76 | 0.68 |
-| Completeness | 0.81 | 0.72 |
-### Recommendations
-- All metrics within acceptable ranges
-- Monitor "Clarity" criterion - lower agreement may indicate need for rubric refinement
-```