omgkit 2.5.1 → 2.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "omgkit",
3
- "version": "2.5.1",
4
- "description": "Omega-Level Development Kit - AI Team System for Claude Code. 23 agents, 54 commands, 72 skills, sprint management.",
3
+ "version": "2.6.0",
4
+ "description": "Omega-Level Development Kit - AI Team System for Claude Code. 23 agents, 58 commands, 88 skills, sprint management.",
5
5
  "keywords": [
6
6
  "claude-code",
7
7
  "ai",
@@ -0,0 +1,65 @@
1
+ ---
2
+ name: ai-engineering
3
+ description: Building production AI applications with Foundation Models. Covers prompt engineering, RAG, agents, finetuning, evaluation, and deployment. Use when working with LLMs, building AI features, or architecting AI systems.
4
+ ---
5
+
6
+ # AI Engineering Skills
7
+
8
+ Comprehensive skills for building AI applications with Foundation Models.
9
+
10
+ ## AI Engineering Stack
11
+
12
+ ```
13
+ ┌─────────────────────────────────────────────────────┐
14
+ │ APPLICATION LAYER │
15
+ │ Prompt Engineering, RAG, Agents, Guardrails │
16
+ ├─────────────────────────────────────────────────────┤
17
+ │ MODEL LAYER │
18
+ │ Model Selection, Finetuning, Evaluation │
19
+ ├─────────────────────────────────────────────────────┤
20
+ │ INFRASTRUCTURE LAYER │
21
+ │ Inference Optimization, Caching, Orchestration │
22
+ └─────────────────────────────────────────────────────┘
23
+ ```
24
+
25
+ ## 12 Core Skills
26
+
27
+ | Skill | Description | Guide |
28
+ |-------|-------------|-------|
29
+ | Foundation Models | Model architecture, sampling, structured outputs | [foundation-models/](foundation-models/SKILL.md) |
30
+ | Evaluation Methodology | Metrics, AI-as-judge, comparative evaluation | [evaluation-methodology/](evaluation-methodology/SKILL.md) |
31
+ | AI System Evaluation | End-to-end evaluation, benchmarks, model selection | [ai-system-evaluation/](ai-system-evaluation/SKILL.md) |
32
+ | Prompt Engineering | System prompts, few-shot, chain-of-thought, defense | [prompt-engineering/](prompt-engineering/SKILL.md) |
33
+ | RAG Systems | Chunking, embedding, retrieval, reranking | [rag-systems/](rag-systems/SKILL.md) |
34
+ | AI Agents | Tool use, planning strategies, memory systems | [ai-agents/](ai-agents/SKILL.md) |
35
+ | Finetuning | LoRA, QLoRA, PEFT, model merging | [finetuning/](finetuning/SKILL.md) |
36
+ | Dataset Engineering | Data quality, curation, synthesis, annotation | [dataset-engineering/](dataset-engineering/SKILL.md) |
37
+ | Inference Optimization | Quantization, batching, caching, speculative decoding | [inference-optimization/](inference-optimization/SKILL.md) |
38
+ | AI Architecture | Gateway, routing, observability, deployment | [ai-architecture/](ai-architecture/SKILL.md) |
39
+ | Guardrails & Safety | Input/output guards, PII protection, injection defense | [guardrails-safety/](guardrails-safety/SKILL.md) |
40
+ | User Feedback | Explicit/implicit signals, feedback loops, A/B testing | [user-feedback/](user-feedback/SKILL.md) |
41
+
42
+ ## Development Process
43
+
44
+ ```
45
+ 1. Use Case Evaluation → 2. Model Selection → 3. Evaluation Pipeline
46
+
47
+ 4. Prompt Engineering → 5. Context (RAG/Agents) → 6. Finetuning (if needed)
48
+
49
+ 7. Inference Optimization → 8. Deployment → 9. Monitoring & Feedback
50
+ ```
51
+
52
+ ## Quick Decision Guide
53
+
54
+ | Need | Start With |
55
+ |------|------------|
56
+ | Improve output quality | prompt-engineering |
57
+ | Add external knowledge | rag-systems |
58
+ | Multi-step reasoning | ai-agents |
59
+ | Reduce latency/cost | inference-optimization |
60
+ | Measure quality | evaluation-methodology |
61
+ | Protect system | guardrails-safety |
62
+
63
+ ## Reference
64
+
65
+ Based on "AI Engineering" by Chip Huyen (O'Reilly, 2025).
@@ -0,0 +1,157 @@
1
+ ---
2
+ name: ai-agents
3
+ description: Building AI agents - tool use, planning strategies (ReAct, Plan-and-Execute), memory systems, agent evaluation. Use when building autonomous AI systems, tool-augmented apps, or multi-step workflows.
4
+ ---
5
+
6
+ # AI Agents
7
+
8
+ Building AI agents with tools and planning.
9
+
10
+ ## Agent Architecture
11
+
12
+ ```
13
+ ┌─────────────────────────────────────┐
14
+ │ AI AGENT │
15
+ ├─────────────────────────────────────┤
16
+ │ ┌──────────┐ │
17
+ │ │ BRAIN │ (Foundation Model) │
18
+ │ │ Planning │ │
19
+ │ │ Reasoning│ │
20
+ │ └────┬─────┘ │
21
+ │ │ │
22
+ │ ┌───┴───┐ │
23
+ │ ↓ ↓ │
24
+ │ ┌─────┐ ┌──────┐ │
25
+ │ │TOOLS│ │MEMORY│ │
26
+ │ └─────┘ └──────┘ │
27
+ └─────────────────────────────────────┘
28
+ ```
29
+
30
+ ## Tool Definition
31
+
32
+ ```python
33
+ tools = [{
34
+ "type": "function",
35
+ "function": {
36
+ "name": "search_database",
37
+ "description": "Search products by query",
38
+ "parameters": {
39
+ "type": "object",
40
+ "properties": {
41
+ "query": {"type": "string"},
42
+ "category": {"type": "string", "enum": ["electronics", "clothing"]},
43
+ "max_price": {"type": "number"}
44
+ },
45
+ "required": ["query"]
46
+ }
47
+ }
48
+ }]
49
+ ```
50
+
51
+ ## Planning Strategies
52
+
53
+ ### ReAct (Reasoning + Acting)
54
+ ```python
55
+ REACT_PROMPT = """Tools: {tools}
56
+
57
+ Format:
58
+ Thought: [reasoning]
59
+ Action: [tool_name]
60
+ Action Input: [JSON]
61
+ Observation: [result]
62
+ ... repeat ...
63
+ Final Answer: [answer]
64
+
65
+ Question: {question}
66
+ Thought:"""
67
+
68
+ def react_agent(question, tools, max_steps=10):
69
+ prompt = REACT_PROMPT.format(...)
70
+
71
+ for _ in range(max_steps):
72
+ response = llm.generate(prompt)
73
+
74
+ if "Final Answer:" in response:
75
+ return extract_answer(response)
76
+
77
+ action, input = parse_action(response)
78
+ observation = execute(tools[action], input)
79
+ prompt += f"\nObservation: {observation}\nThought:"
80
+ ```
81
+
82
+ ### Plan-and-Execute
83
+ ```python
84
+ def plan_and_execute(task, tools):
85
+ # Step 1: Create plan
86
+ plan = llm.generate(f"Create step-by-step plan for: {task}")
87
+ steps = parse_plan(plan)
88
+
89
+ # Step 2: Execute each step
90
+ results = []
91
+ for step in steps:
92
+ result = execute_step(step, tools)
93
+ results.append(result)
94
+
95
+ # Step 3: Synthesize
96
+ return synthesize(task, results)
97
+ ```
98
+
99
+ ### Reflexion (Self-Reflection)
100
+ ```python
101
+ def reflexion_agent(task, max_attempts=3):
102
+ memory = []
103
+
104
+ for attempt in range(max_attempts):
105
+ solution = generate(task, memory)
106
+ success, feedback = evaluate(solution)
107
+
108
+ if success:
109
+ return solution
110
+
111
+ reflection = reflect(task, solution, feedback)
112
+ memory.append({"solution": solution, "reflection": reflection})
113
+ ```
114
+
115
+ ## Memory Systems
116
+
117
+ ```python
118
+ class AgentMemory:
119
+ def __init__(self):
120
+ self.short_term = [] # Recent turns
121
+ self.long_term = VectorDB() # Persistent
122
+
123
+ def add(self, message):
124
+ self.short_term.append(message)
125
+ if len(self.short_term) > 20:
126
+ self.consolidate()
127
+
128
+ def consolidate(self):
129
+ summary = summarize(self.short_term[:10])
130
+ self.long_term.add(summary)
131
+ self.short_term = self.short_term[10:]
132
+
133
+ def retrieve(self, query, k=5):
134
+ return {
135
+ "recent": self.short_term[-5:],
136
+ "relevant": self.long_term.search(query, k),
137
+ }
138
+ ```
139
+
140
+ ## Agent Evaluation
141
+
142
+ ```python
143
+ def evaluate_agent(agent, test_cases):
144
+ return {
145
+ "task_success": mean([agent.run(c["task"]) == c["expected"] for c in test_cases]),
146
+ "avg_steps": mean([agent.step_count for _ in test_cases]),
147
+ "avg_latency": mean([measure_time(agent.run, c["task"]) for c in test_cases]),
148
+ }
149
+ ```
150
+
151
+ ## Best Practices
152
+
153
+ 1. Start with simple tools, add complexity gradually
154
+ 2. Add reflection for complex tasks
155
+ 3. Limit max steps to prevent infinite loops
156
+ 4. Log all agent actions for debugging
157
+ 5. Use evaluation to measure progress
@@ -0,0 +1,133 @@
1
+ ---
2
+ name: ai-architecture
3
+ description: AI application architecture - gateway, orchestration, model routing, observability, deployment patterns. Use when designing AI systems, scaling applications, or building production infrastructure.
4
+ ---
5
+
6
+ # AI Architecture Skill
7
+
8
+ Designing production AI applications.
9
+
10
+ ## Reference Architecture
11
+
12
+ ```
13
+ ┌─────────────────────────────────────────────────────┐
14
+ │ CLIENT LAYER (Web, Mobile, API, CLI) │
15
+ └───────────────────────┬─────────────────────────────┘
16
+
17
+ ┌───────────────────────┴─────────────────────────────┐
18
+ │ GATEWAY LAYER │
19
+ │ Rate Limiter | Auth | Input Guard │
20
+ └───────────────────────┬─────────────────────────────┘
21
+
22
+ ┌───────────────────────┴─────────────────────────────┐
23
+ │ ORCHESTRATION LAYER │
24
+ │ Router | Cache | Context | Agent | Output Guard │
25
+ └───────────────────────┬─────────────────────────────┘
26
+
27
+ ┌───────────────────────┴─────────────────────────────┐
28
+ │ MODEL LAYER │
29
+ │ Primary LLM | Fallback | Specialized │
30
+ └───────────────────────┬─────────────────────────────┘
31
+
32
+ ┌───────────────────────┴─────────────────────────────┐
33
+ │ DATA LAYER │
34
+ │ Vector DB | SQL DB | Cache │
35
+ └─────────────────────────────────────────────────────┘
36
+ ```
37
+
38
+ ## Model Router
39
+
40
+ ```python
41
+ class ModelRouter:
42
+ def __init__(self):
43
+ self.models = {
44
+ "gpt-4": {"cost": 0.03, "quality": 0.95, "latency": 2.0},
45
+ "gpt-3.5": {"cost": 0.002, "quality": 0.80, "latency": 0.5},
46
+ }
47
+ self.classifier = load_complexity_classifier()
48
+
49
+ def route(self, query, constraints):
50
+ complexity = self.classifier.predict(query)
51
+
52
+ if complexity == "simple" and constraints.get("cost_sensitive"):
53
+ return "gpt-3.5"
54
+ elif complexity == "complex":
55
+ return "gpt-4"
56
+ else:
57
+ return "gpt-3.5"
58
+
59
+ def with_fallback(self, query, primary, fallbacks):
60
+ for model in [primary] + fallbacks:
61
+ try:
62
+ response = self.call(model, query)
63
+ if self.validate(response):
64
+ return response
65
+ except:
66
+ continue
67
+ raise Exception("All models failed")
68
+ ```
69
+
70
+ ## Context Enhancement
71
+
72
+ ```python
73
+ class ContextEnhancer:
74
+ def enhance(self, query, history):
75
+ # Retrieve
76
+ docs = self.retriever.retrieve(query, k=10)
77
+
78
+ # Rerank
79
+ docs = self.rerank(query, docs)[:5]
80
+
81
+ # Compress if needed
82
+ context = self.format(docs)
83
+ if len(context) > 4000:
84
+ context = self.summarize(context)
85
+
86
+ # Add history
87
+ history_context = self.format_history(history[-5:])
88
+
89
+ return {
90
+ "retrieved": context,
91
+ "history": history_context
92
+ }
93
+ ```
94
+
95
+ ## Observability
96
+
97
+ ```python
98
+ from opentelemetry import trace
99
+ from prometheus_client import Counter, Histogram
100
+
101
+ REQUESTS = Counter('ai_requests', 'Total', ['model', 'status'])
102
+ LATENCY = Histogram('ai_latency', 'Latency', ['model'])
103
+ TOKENS = Counter('ai_tokens', 'Tokens', ['model', 'type'])
104
+
105
+ tracer = trace.get_tracer(__name__)
106
+
107
+ class ObservableClient:
108
+ def generate(self, prompt, model):
109
+ with tracer.start_as_current_span("ai_generate") as span:
110
+ span.set_attribute("model", model)
111
+
112
+ start = time.time()
113
+ try:
114
+ response = self.client.generate(prompt, model)
115
+
116
+ REQUESTS.labels(model=model, status="ok").inc()
117
+ LATENCY.labels(model=model).observe(time.time()-start)
118
+ TOKENS.labels(model=model, type="in").inc(count(prompt))
119
+ TOKENS.labels(model=model, type="out").inc(count(response))
120
+
121
+ return response
122
+ except Exception as e:
123
+ REQUESTS.labels(model=model, status="error").inc()
124
+ raise
125
+ ```
126
+
127
+ ## Best Practices
128
+
129
+ 1. Add gateway for rate limiting/auth
130
+ 2. Use model router for cost optimization
131
+ 3. Implement fallback chains
132
+ 4. Add comprehensive observability
133
+ 5. Cache at multiple levels
@@ -0,0 +1,95 @@
1
+ ---
2
+ name: ai-system-evaluation
3
+ description: End-to-end AI system evaluation - model selection, benchmarks, cost/latency analysis, build vs buy decisions. Use when selecting models, designing eval pipelines, or making architecture decisions.
4
+ ---
5
+
6
+ # AI System Evaluation
7
+
8
+ Evaluating AI systems end-to-end.
9
+
10
+ ## Evaluation Criteria
11
+
12
+ ### 1. Domain-Specific Capability
13
+
14
+ | Domain | Benchmarks |
15
+ |--------|------------|
16
+ | Math & Reasoning | GSM-8K, MATH |
17
+ | Code | HumanEval, MBPP |
18
+ | Knowledge | MMLU, ARC |
19
+ | Multi-turn Chat | MT-Bench |
20
+
21
+ ### 2. Generation Quality
22
+
23
+ | Criterion | Measurement |
24
+ |-----------|-------------|
25
+ | Factual Consistency | NLI, SAFE, SelfCheckGPT |
26
+ | Coherence | AI judge rubric |
27
+ | Relevance | Semantic similarity |
28
+ | Fluency | Perplexity |
29
+
30
+ ### 3. Cost & Latency
31
+
32
+ ```python
33
+ @dataclass
34
+ class PerformanceMetrics:
35
+ ttft: float # Time to First Token (seconds)
36
+ tpot: float # Time Per Output Token
37
+ throughput: float # Tokens/second
38
+
39
+ def cost(self, input_tokens, output_tokens, prices):
40
+ return input_tokens * prices["input"] + output_tokens * prices["output"]
41
+ ```
42
+
43
+ ## Model Selection Workflow
44
+
45
+ ```
46
+ 1. Define Requirements
47
+ ├── Task type
48
+ ├── Quality threshold
49
+ ├── Latency requirements (<2s TTFT)
50
+ ├── Cost budget
51
+ └── Deployment constraints
52
+
53
+ 2. Filter Options
54
+ ├── API vs Self-hosted
55
+ ├── Open source vs Proprietary
56
+ └── Size constraints
57
+
58
+ 3. Benchmark on Your Data
59
+ ├── Create eval dataset (100+ examples)
60
+ ├── Run experiments
61
+ └── Analyze results
62
+
63
+ 4. Make Decision
64
+ └── Balance quality, cost, latency
65
+ ```
66
+
67
+ ## Build vs Buy
68
+
69
+ | Factor | API | Self-Host |
70
+ |--------|-----|-----------|
71
+ | Data Privacy | Less control | Full control |
72
+ | Performance | Best models | Slightly behind |
73
+ | Cost at Scale | Expensive | Amortized |
74
+ | Customization | Limited | Full control |
75
+ | Maintenance | Zero | Significant |
76
+
77
+ ## Public Benchmarks
78
+
79
+ | Benchmark | Focus |
80
+ |-----------|-------|
81
+ | MMLU | Knowledge (57 subjects) |
82
+ | HumanEval | Code generation |
83
+ | GSM-8K | Math reasoning |
84
+ | TruthfulQA | Factuality |
85
+ | MT-Bench | Multi-turn chat |
86
+
87
+ **Caution**: Benchmarks can be gamed. Data contamination is common. Always evaluate on YOUR data.
88
+
89
+ ## Best Practices
90
+
91
+ 1. Test on domain-specific data
92
+ 2. Measure both quality and cost
93
+ 3. Consider latency requirements
94
+ 4. Plan for fallback models
95
+ 5. Re-evaluate periodically
@@ -0,0 +1,135 @@
1
+ ---
2
+ name: dataset-engineering
3
+ description: Building and processing datasets - data quality, curation, deduplication, synthesis, annotation, formatting. Use when creating training data, improving data quality, or generating synthetic data.
4
+ ---
5
+
6
+ # Dataset Engineering Skill
7
+
8
+ Building high-quality datasets for AI applications.
9
+
10
+ ## Data Quality Dimensions
11
+
12
+ | Dimension | Description | Check |
13
+ |-----------|-------------|-------|
14
+ | Accuracy | Data is correct | Validation |
15
+ | Completeness | No missing values | Schema check |
16
+ | Consistency | No contradictions | Dedup |
17
+ | Timeliness | Up-to-date | Timestamps |
18
+ | Relevance | Matches use case | Filtering |
19
+
20
+ ## Data Curation Pipeline
21
+
22
+ ```python
23
+ class DataCurationPipeline:
24
+ def run(self, raw_data):
25
+ # 1. Inspect
26
+ self.inspect(raw_data)
27
+
28
+ # 2. Deduplicate
29
+ data = self.deduplicator.dedupe(raw_data)
30
+
31
+ # 3. Clean and filter
32
+ data = self.cleaner.clean(data)
33
+ data = self.filter.filter(data)
34
+
35
+ # 4. Format
36
+ return self.formatter.format(data)
37
+ ```
38
+
39
+ ## Deduplication
40
+
41
+ ```python
42
+ from datasketch import MinHash, MinHashLSH
43
+
44
+ class Deduplicator:
45
+ def __init__(self, threshold=0.8):
46
+ self.lsh = MinHashLSH(threshold=threshold, num_perm=128)
47
+
48
+ def minhash(self, text):
49
+ m = MinHash(num_perm=128)
50
+ for word in text.split():
51
+ m.update(word.encode('utf8'))
52
+ return m
53
+
54
+ def dedupe(self, docs):
55
+ unique = []
56
+ for i, doc in enumerate(docs):
57
+ mh = self.minhash(doc["text"])
58
+ if not self.lsh.query(mh):
59
+ self.lsh.insert(f"doc_{i}", mh)
60
+ unique.append(doc)
61
+ return unique
62
+ ```
63
+
64
+ ## Data Synthesis
65
+
66
+ ### AI-Powered QA Generation
67
+ ```python
68
+ def generate_qa(document, model, n=5):
69
+ prompt = f"""Generate {n} QA pairs from:
70
+
71
+ {document}
72
+
73
+ Format: [{{"question": "...", "answer": "..."}}]"""
74
+
75
+ return json.loads(model.generate(prompt))
76
+ ```
77
+
78
+ ### Self-Instruct
79
+ ```python
80
+ def self_instruct(seeds, model, n=100):
81
+ generated = []
82
+
83
+ for _ in range(n):
84
+ samples = random.sample(seeds + generated[-20:], 5)
85
+ prompt = f"Examples:\n{format(samples)}\n\nNew task:"
86
+
87
+ new = model.generate(prompt)
88
+ if is_valid(new) and is_diverse(new, generated):
89
+ generated.append(new)
90
+
91
+ return generated
92
+ ```
93
+
94
+ ### Data Augmentation
95
+ ```python
96
+ def augment_text(text):
97
+ methods = [
98
+ lambda t: synonym_replace(t),
99
+ lambda t: back_translate(t),
100
+ lambda t: model.rephrase(t)
101
+ ]
102
+ return random.choice(methods)(text)
103
+ ```
104
+
105
+ ## Data Formatting
106
+
107
+ ### Instruction Format
108
+ ```python
109
+ def format_instruction(example):
110
+ return f"""### Instruction:
111
+ {example['instruction']}
112
+
113
+ ### Input:
114
+ {example.get('input', '')}
115
+
116
+ ### Response:
117
+ {example['output']}"""
118
+ ```
119
+
120
+ ### Chat Format
121
+ ```python
122
+ def format_chat(conversation):
123
+ return [
124
+ {"role": turn["role"], "content": turn["content"]}
125
+ for turn in conversation
126
+ ]
127
+ ```
128
+
129
+ ## Best Practices
130
+
131
+ 1. Inspect data before processing
132
+ 2. Deduplicate before expensive operations
133
+ 3. Use multiple synthesis methods
134
+ 4. Validate synthetic data quality
135
+ 5. Track data lineage
@@ -0,0 +1,93 @@
1
+ ---
2
+ name: evaluation-methodology
3
+ description: Methods for evaluating AI model outputs - exact match, semantic similarity, LLM-as-judge, comparative evaluation, ELO ranking. Use when measuring AI quality, building eval pipelines, or comparing models.
4
+ ---
5
+
6
+ # Evaluation Methodology
7
+
8
+ Methods for evaluating Foundation Model outputs.
9
+
10
+ ## Evaluation Approaches
11
+
12
+ ### 1. Exact Evaluation
13
+
14
+ | Method | Use Case | Example |
15
+ |--------|----------|---------|
16
+ | Exact Match | QA, Math | `"5" == "5"` |
17
+ | Functional Correctness | Code | Pass test cases |
18
+ | BLEU/ROUGE | Translation | N-gram overlap |
19
+ | Semantic Similarity | Open-ended | Embedding cosine |
20
+
21
+ ```python
22
+ # Semantic Similarity
23
+ from sentence_transformers import SentenceTransformer
24
+ from sklearn.metrics.pairwise import cosine_similarity
25
+
26
+ model = SentenceTransformer('all-MiniLM-L6-v2')
27
+ emb1 = model.encode([generated])
28
+ emb2 = model.encode([reference])
29
+ similarity = cosine_similarity(emb1, emb2)[0][0]
30
+ ```
31
+
32
+ ### 2. AI as Judge
33
+
34
+ ```python
35
+ JUDGE_PROMPT = """Rate the response on a scale of 1-5.
36
+
37
+ Criteria:
38
+ - Accuracy: Is information correct?
39
+ - Helpfulness: Does it address the need?
40
+ - Clarity: Is it easy to understand?
41
+
42
+ Query: {query}
43
+ Response: {response}
44
+
45
+ Return JSON: {"score": N, "reasoning": "..."}"""
46
+
47
+ # Multi-judge for reliability
48
+ judges = ["gpt-4", "claude-3"]
49
+ scores = [get_score(judge, response) for judge in judges]
50
+ final_score = sum(scores) / len(scores)
51
+ ```
52
+
53
+ ### 3. Comparative Evaluation (ELO)
54
+
55
+ ```python
56
+ COMPARE_PROMPT = """Compare these responses.
57
+
58
+ Query: {query}
59
+ A: {response_a}
60
+ B: {response_b}
61
+
62
+ Which is better? Return: A, B, or tie"""
63
+
64
+ def update_elo(rating_a, rating_b, winner, k=32):
65
+ expected_a = 1 / (1 + 10**((rating_b - rating_a) / 400))
66
+ score_a = 1 if winner == "A" else 0 if winner == "B" else 0.5
67
+ return rating_a + k * (score_a - expected_a)
68
+ ```
69
+
70
+ ## Evaluation Pipeline
71
+
72
+ ```
73
+ 1. Define Criteria (accuracy, helpfulness, safety)
74
+
75
+ 2. Create Scoring Rubric with Examples
76
+
77
+ 3. Select Methods (exact + AI judge + human)
78
+
79
+ 4. Create Evaluation Dataset
80
+
81
+ 5. Run Evaluation
82
+
83
+ 6. Analyze & Iterate
84
+ ```
85
+
86
+ ## Best Practices
87
+
88
+ 1. Use multiple evaluation methods
89
+ 2. Calibrate AI judges with human data
90
+ 3. Include both automatic and human evaluation
91
+ 4. Version your evaluation datasets
92
+ 5. Track metrics over time
93
+ 6. Test for position bias in comparisons