omgkit 2.5.2 → 2.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +2 -2
- package/plugin/skills/ai-engineering/SKILL.md +65 -0
- package/plugin/skills/ai-engineering/ai-agents/SKILL.md +157 -0
- package/plugin/skills/ai-engineering/ai-architecture/SKILL.md +133 -0
- package/plugin/skills/ai-engineering/ai-system-evaluation/SKILL.md +95 -0
- package/plugin/skills/ai-engineering/dataset-engineering/SKILL.md +135 -0
- package/plugin/skills/ai-engineering/evaluation-methodology/SKILL.md +93 -0
- package/plugin/skills/ai-engineering/finetuning/SKILL.md +133 -0
- package/plugin/skills/ai-engineering/foundation-models/SKILL.md +90 -0
- package/plugin/skills/ai-engineering/guardrails-safety/SKILL.md +153 -0
- package/plugin/skills/ai-engineering/inference-optimization/SKILL.md +150 -0
- package/plugin/skills/ai-engineering/prompt-engineering/SKILL.md +133 -0
- package/plugin/skills/ai-engineering/rag-systems/SKILL.md +137 -0
- package/plugin/skills/ai-engineering/user-feedback/SKILL.md +162 -0
package/package.json
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "omgkit",
|
|
3
|
-
"version": "2.
|
|
4
|
-
"description": "Omega-Level Development Kit - AI Team System for Claude Code. 23 agents, 58 commands,
|
|
3
|
+
"version": "2.6.0",
|
|
4
|
+
"description": "Omega-Level Development Kit - AI Team System for Claude Code. 23 agents, 58 commands, 88 skills, sprint management.",
|
|
5
5
|
"keywords": [
|
|
6
6
|
"claude-code",
|
|
7
7
|
"ai",
|
|
@@ -0,0 +1,65 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ai-engineering
|
|
3
|
+
description: Building production AI applications with Foundation Models. Covers prompt engineering, RAG, agents, finetuning, evaluation, and deployment. Use when working with LLMs, building AI features, or architecting AI systems.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# AI Engineering Skills
|
|
7
|
+
|
|
8
|
+
Comprehensive skills for building AI applications with Foundation Models.
|
|
9
|
+
|
|
10
|
+
## AI Engineering Stack
|
|
11
|
+
|
|
12
|
+
```
|
|
13
|
+
┌─────────────────────────────────────────────────────┐
|
|
14
|
+
│ APPLICATION LAYER │
|
|
15
|
+
│ Prompt Engineering, RAG, Agents, Guardrails │
|
|
16
|
+
├─────────────────────────────────────────────────────┤
|
|
17
|
+
│ MODEL LAYER │
|
|
18
|
+
│ Model Selection, Finetuning, Evaluation │
|
|
19
|
+
├─────────────────────────────────────────────────────┤
|
|
20
|
+
│ INFRASTRUCTURE LAYER │
|
|
21
|
+
│ Inference Optimization, Caching, Orchestration │
|
|
22
|
+
└─────────────────────────────────────────────────────┘
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
## 12 Core Skills
|
|
26
|
+
|
|
27
|
+
| Skill | Description | Guide |
|
|
28
|
+
|-------|-------------|-------|
|
|
29
|
+
| Foundation Models | Model architecture, sampling, structured outputs | [foundation-models/](foundation-models/SKILL.md) |
|
|
30
|
+
| Evaluation Methodology | Metrics, AI-as-judge, comparative evaluation | [evaluation-methodology/](evaluation-methodology/SKILL.md) |
|
|
31
|
+
| AI System Evaluation | End-to-end evaluation, benchmarks, model selection | [ai-system-evaluation/](ai-system-evaluation/SKILL.md) |
|
|
32
|
+
| Prompt Engineering | System prompts, few-shot, chain-of-thought, defense | [prompt-engineering/](prompt-engineering/SKILL.md) |
|
|
33
|
+
| RAG Systems | Chunking, embedding, retrieval, reranking | [rag-systems/](rag-systems/SKILL.md) |
|
|
34
|
+
| AI Agents | Tool use, planning strategies, memory systems | [ai-agents/](ai-agents/SKILL.md) |
|
|
35
|
+
| Finetuning | LoRA, QLoRA, PEFT, model merging | [finetuning/](finetuning/SKILL.md) |
|
|
36
|
+
| Dataset Engineering | Data quality, curation, synthesis, annotation | [dataset-engineering/](dataset-engineering/SKILL.md) |
|
|
37
|
+
| Inference Optimization | Quantization, batching, caching, speculative decoding | [inference-optimization/](inference-optimization/SKILL.md) |
|
|
38
|
+
| AI Architecture | Gateway, routing, observability, deployment | [ai-architecture/](ai-architecture/SKILL.md) |
|
|
39
|
+
| Guardrails & Safety | Input/output guards, PII protection, injection defense | [guardrails-safety/](guardrails-safety/SKILL.md) |
|
|
40
|
+
| User Feedback | Explicit/implicit signals, feedback loops, A/B testing | [user-feedback/](user-feedback/SKILL.md) |
|
|
41
|
+
|
|
42
|
+
## Development Process
|
|
43
|
+
|
|
44
|
+
```
|
|
45
|
+
1. Use Case Evaluation → 2. Model Selection → 3. Evaluation Pipeline
|
|
46
|
+
↓
|
|
47
|
+
4. Prompt Engineering → 5. Context (RAG/Agents) → 6. Finetuning (if needed)
|
|
48
|
+
↓
|
|
49
|
+
7. Inference Optimization → 8. Deployment → 9. Monitoring & Feedback
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
## Quick Decision Guide
|
|
53
|
+
|
|
54
|
+
| Need | Start With |
|
|
55
|
+
|------|------------|
|
|
56
|
+
| Improve output quality | prompt-engineering |
|
|
57
|
+
| Add external knowledge | rag-systems |
|
|
58
|
+
| Multi-step reasoning | ai-agents |
|
|
59
|
+
| Reduce latency/cost | inference-optimization |
|
|
60
|
+
| Measure quality | evaluation-methodology |
|
|
61
|
+
| Protect system | guardrails-safety |
|
|
62
|
+
|
|
63
|
+
## Reference
|
|
64
|
+
|
|
65
|
+
Based on "AI Engineering" by Chip Huyen (O'Reilly, 2025).
|
|
@@ -0,0 +1,157 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ai-agents
|
|
3
|
+
description: Building AI agents - tool use, planning strategies (ReAct, Plan-and-Execute), memory systems, agent evaluation. Use when building autonomous AI systems, tool-augmented apps, or multi-step workflows.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# AI Agents
|
|
7
|
+
|
|
8
|
+
Building AI agents with tools and planning.
|
|
9
|
+
|
|
10
|
+
## Agent Architecture
|
|
11
|
+
|
|
12
|
+
```
|
|
13
|
+
┌─────────────────────────────────────┐
|
|
14
|
+
│ AI AGENT │
|
|
15
|
+
├─────────────────────────────────────┤
|
|
16
|
+
│ ┌──────────┐ │
|
|
17
|
+
│ │ BRAIN │ (Foundation Model) │
|
|
18
|
+
│ │ Planning │ │
|
|
19
|
+
│ │ Reasoning│ │
|
|
20
|
+
│ └────┬─────┘ │
|
|
21
|
+
│ │ │
|
|
22
|
+
│ ┌───┴───┐ │
|
|
23
|
+
│ ↓ ↓ │
|
|
24
|
+
│ ┌─────┐ ┌──────┐ │
|
|
25
|
+
│ │TOOLS│ │MEMORY│ │
|
|
26
|
+
│ └─────┘ └──────┘ │
|
|
27
|
+
└─────────────────────────────────────┘
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
## Tool Definition
|
|
31
|
+
|
|
32
|
+
```python
|
|
33
|
+
tools = [{
|
|
34
|
+
"type": "function",
|
|
35
|
+
"function": {
|
|
36
|
+
"name": "search_database",
|
|
37
|
+
"description": "Search products by query",
|
|
38
|
+
"parameters": {
|
|
39
|
+
"type": "object",
|
|
40
|
+
"properties": {
|
|
41
|
+
"query": {"type": "string"},
|
|
42
|
+
"category": {"type": "string", "enum": ["electronics", "clothing"]},
|
|
43
|
+
"max_price": {"type": "number"}
|
|
44
|
+
},
|
|
45
|
+
"required": ["query"]
|
|
46
|
+
}
|
|
47
|
+
}
|
|
48
|
+
}]
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
## Planning Strategies
|
|
52
|
+
|
|
53
|
+
### ReAct (Reasoning + Acting)
|
|
54
|
+
```python
|
|
55
|
+
REACT_PROMPT = """Tools: {tools}
|
|
56
|
+
|
|
57
|
+
Format:
|
|
58
|
+
Thought: [reasoning]
|
|
59
|
+
Action: [tool_name]
|
|
60
|
+
Action Input: [JSON]
|
|
61
|
+
Observation: [result]
|
|
62
|
+
... repeat ...
|
|
63
|
+
Final Answer: [answer]
|
|
64
|
+
|
|
65
|
+
Question: {question}
|
|
66
|
+
Thought:"""
|
|
67
|
+
|
|
68
|
+
def react_agent(question, tools, max_steps=10):
|
|
69
|
+
prompt = REACT_PROMPT.format(...)
|
|
70
|
+
|
|
71
|
+
for _ in range(max_steps):
|
|
72
|
+
response = llm.generate(prompt)
|
|
73
|
+
|
|
74
|
+
if "Final Answer:" in response:
|
|
75
|
+
return extract_answer(response)
|
|
76
|
+
|
|
77
|
+
action, input = parse_action(response)
|
|
78
|
+
observation = execute(tools[action], input)
|
|
79
|
+
prompt += f"\nObservation: {observation}\nThought:"
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
### Plan-and-Execute
|
|
83
|
+
```python
|
|
84
|
+
def plan_and_execute(task, tools):
|
|
85
|
+
# Step 1: Create plan
|
|
86
|
+
plan = llm.generate(f"Create step-by-step plan for: {task}")
|
|
87
|
+
steps = parse_plan(plan)
|
|
88
|
+
|
|
89
|
+
# Step 2: Execute each step
|
|
90
|
+
results = []
|
|
91
|
+
for step in steps:
|
|
92
|
+
result = execute_step(step, tools)
|
|
93
|
+
results.append(result)
|
|
94
|
+
|
|
95
|
+
# Step 3: Synthesize
|
|
96
|
+
return synthesize(task, results)
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
### Reflexion (Self-Reflection)
|
|
100
|
+
```python
|
|
101
|
+
def reflexion_agent(task, max_attempts=3):
|
|
102
|
+
memory = []
|
|
103
|
+
|
|
104
|
+
for attempt in range(max_attempts):
|
|
105
|
+
solution = generate(task, memory)
|
|
106
|
+
success, feedback = evaluate(solution)
|
|
107
|
+
|
|
108
|
+
if success:
|
|
109
|
+
return solution
|
|
110
|
+
|
|
111
|
+
reflection = reflect(task, solution, feedback)
|
|
112
|
+
memory.append({"solution": solution, "reflection": reflection})
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
## Memory Systems
|
|
116
|
+
|
|
117
|
+
```python
|
|
118
|
+
class AgentMemory:
|
|
119
|
+
def __init__(self):
|
|
120
|
+
self.short_term = [] # Recent turns
|
|
121
|
+
self.long_term = VectorDB() # Persistent
|
|
122
|
+
|
|
123
|
+
def add(self, message):
|
|
124
|
+
self.short_term.append(message)
|
|
125
|
+
if len(self.short_term) > 20:
|
|
126
|
+
self.consolidate()
|
|
127
|
+
|
|
128
|
+
def consolidate(self):
|
|
129
|
+
summary = summarize(self.short_term[:10])
|
|
130
|
+
self.long_term.add(summary)
|
|
131
|
+
self.short_term = self.short_term[10:]
|
|
132
|
+
|
|
133
|
+
def retrieve(self, query, k=5):
|
|
134
|
+
return {
|
|
135
|
+
"recent": self.short_term[-5:],
|
|
136
|
+
"relevant": self.long_term.search(query, k),
|
|
137
|
+
}
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
## Agent Evaluation
|
|
141
|
+
|
|
142
|
+
```python
|
|
143
|
+
def evaluate_agent(agent, test_cases):
|
|
144
|
+
return {
|
|
145
|
+
"task_success": mean([agent.run(c["task"]) == c["expected"] for c in test_cases]),
|
|
146
|
+
"avg_steps": mean([agent.step_count for _ in test_cases]),
|
|
147
|
+
"avg_latency": mean([measure_time(agent.run, c["task"]) for c in test_cases]),
|
|
148
|
+
}
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
## Best Practices
|
|
152
|
+
|
|
153
|
+
1. Start with simple tools, add complexity gradually
|
|
154
|
+
2. Add reflection for complex tasks
|
|
155
|
+
3. Limit max steps to prevent infinite loops
|
|
156
|
+
4. Log all agent actions for debugging
|
|
157
|
+
5. Use evaluation to measure progress
|
|
@@ -0,0 +1,133 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ai-architecture
|
|
3
|
+
description: AI application architecture - gateway, orchestration, model routing, observability, deployment patterns. Use when designing AI systems, scaling applications, or building production infrastructure.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# AI Architecture Skill
|
|
7
|
+
|
|
8
|
+
Designing production AI applications.
|
|
9
|
+
|
|
10
|
+
## Reference Architecture
|
|
11
|
+
|
|
12
|
+
```
|
|
13
|
+
┌─────────────────────────────────────────────────────┐
|
|
14
|
+
│ CLIENT LAYER (Web, Mobile, API, CLI) │
|
|
15
|
+
└───────────────────────┬─────────────────────────────┘
|
|
16
|
+
│
|
|
17
|
+
┌───────────────────────┴─────────────────────────────┐
|
|
18
|
+
│ GATEWAY LAYER │
|
|
19
|
+
│ Rate Limiter | Auth | Input Guard │
|
|
20
|
+
└───────────────────────┬─────────────────────────────┘
|
|
21
|
+
│
|
|
22
|
+
┌───────────────────────┴─────────────────────────────┐
|
|
23
|
+
│ ORCHESTRATION LAYER │
|
|
24
|
+
│ Router | Cache | Context | Agent | Output Guard │
|
|
25
|
+
└───────────────────────┬─────────────────────────────┘
|
|
26
|
+
│
|
|
27
|
+
┌───────────────────────┴─────────────────────────────┐
|
|
28
|
+
│ MODEL LAYER │
|
|
29
|
+
│ Primary LLM | Fallback | Specialized │
|
|
30
|
+
└───────────────────────┬─────────────────────────────┘
|
|
31
|
+
│
|
|
32
|
+
┌───────────────────────┴─────────────────────────────┐
|
|
33
|
+
│ DATA LAYER │
|
|
34
|
+
│ Vector DB | SQL DB | Cache │
|
|
35
|
+
└─────────────────────────────────────────────────────┘
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
## Model Router
|
|
39
|
+
|
|
40
|
+
```python
|
|
41
|
+
class ModelRouter:
|
|
42
|
+
def __init__(self):
|
|
43
|
+
self.models = {
|
|
44
|
+
"gpt-4": {"cost": 0.03, "quality": 0.95, "latency": 2.0},
|
|
45
|
+
"gpt-3.5": {"cost": 0.002, "quality": 0.80, "latency": 0.5},
|
|
46
|
+
}
|
|
47
|
+
self.classifier = load_complexity_classifier()
|
|
48
|
+
|
|
49
|
+
def route(self, query, constraints):
|
|
50
|
+
complexity = self.classifier.predict(query)
|
|
51
|
+
|
|
52
|
+
if complexity == "simple" and constraints.get("cost_sensitive"):
|
|
53
|
+
return "gpt-3.5"
|
|
54
|
+
elif complexity == "complex":
|
|
55
|
+
return "gpt-4"
|
|
56
|
+
else:
|
|
57
|
+
return "gpt-3.5"
|
|
58
|
+
|
|
59
|
+
def with_fallback(self, query, primary, fallbacks):
|
|
60
|
+
for model in [primary] + fallbacks:
|
|
61
|
+
try:
|
|
62
|
+
response = self.call(model, query)
|
|
63
|
+
if self.validate(response):
|
|
64
|
+
return response
|
|
65
|
+
except:
|
|
66
|
+
continue
|
|
67
|
+
raise Exception("All models failed")
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
## Context Enhancement
|
|
71
|
+
|
|
72
|
+
```python
|
|
73
|
+
class ContextEnhancer:
|
|
74
|
+
def enhance(self, query, history):
|
|
75
|
+
# Retrieve
|
|
76
|
+
docs = self.retriever.retrieve(query, k=10)
|
|
77
|
+
|
|
78
|
+
# Rerank
|
|
79
|
+
docs = self.rerank(query, docs)[:5]
|
|
80
|
+
|
|
81
|
+
# Compress if needed
|
|
82
|
+
context = self.format(docs)
|
|
83
|
+
if len(context) > 4000:
|
|
84
|
+
context = self.summarize(context)
|
|
85
|
+
|
|
86
|
+
# Add history
|
|
87
|
+
history_context = self.format_history(history[-5:])
|
|
88
|
+
|
|
89
|
+
return {
|
|
90
|
+
"retrieved": context,
|
|
91
|
+
"history": history_context
|
|
92
|
+
}
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
## Observability
|
|
96
|
+
|
|
97
|
+
```python
|
|
98
|
+
from opentelemetry import trace
|
|
99
|
+
from prometheus_client import Counter, Histogram
|
|
100
|
+
|
|
101
|
+
REQUESTS = Counter('ai_requests', 'Total', ['model', 'status'])
|
|
102
|
+
LATENCY = Histogram('ai_latency', 'Latency', ['model'])
|
|
103
|
+
TOKENS = Counter('ai_tokens', 'Tokens', ['model', 'type'])
|
|
104
|
+
|
|
105
|
+
tracer = trace.get_tracer(__name__)
|
|
106
|
+
|
|
107
|
+
class ObservableClient:
|
|
108
|
+
def generate(self, prompt, model):
|
|
109
|
+
with tracer.start_as_current_span("ai_generate") as span:
|
|
110
|
+
span.set_attribute("model", model)
|
|
111
|
+
|
|
112
|
+
start = time.time()
|
|
113
|
+
try:
|
|
114
|
+
response = self.client.generate(prompt, model)
|
|
115
|
+
|
|
116
|
+
REQUESTS.labels(model=model, status="ok").inc()
|
|
117
|
+
LATENCY.labels(model=model).observe(time.time()-start)
|
|
118
|
+
TOKENS.labels(model=model, type="in").inc(count(prompt))
|
|
119
|
+
TOKENS.labels(model=model, type="out").inc(count(response))
|
|
120
|
+
|
|
121
|
+
return response
|
|
122
|
+
except Exception as e:
|
|
123
|
+
REQUESTS.labels(model=model, status="error").inc()
|
|
124
|
+
raise
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
## Best Practices
|
|
128
|
+
|
|
129
|
+
1. Add gateway for rate limiting/auth
|
|
130
|
+
2. Use model router for cost optimization
|
|
131
|
+
3. Implement fallback chains
|
|
132
|
+
4. Add comprehensive observability
|
|
133
|
+
5. Cache at multiple levels
|
|
@@ -0,0 +1,95 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ai-system-evaluation
|
|
3
|
+
description: End-to-end AI system evaluation - model selection, benchmarks, cost/latency analysis, build vs buy decisions. Use when selecting models, designing eval pipelines, or making architecture decisions.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# AI System Evaluation
|
|
7
|
+
|
|
8
|
+
Evaluating AI systems end-to-end.
|
|
9
|
+
|
|
10
|
+
## Evaluation Criteria
|
|
11
|
+
|
|
12
|
+
### 1. Domain-Specific Capability
|
|
13
|
+
|
|
14
|
+
| Domain | Benchmarks |
|
|
15
|
+
|--------|------------|
|
|
16
|
+
| Math & Reasoning | GSM-8K, MATH |
|
|
17
|
+
| Code | HumanEval, MBPP |
|
|
18
|
+
| Knowledge | MMLU, ARC |
|
|
19
|
+
| Multi-turn Chat | MT-Bench |
|
|
20
|
+
|
|
21
|
+
### 2. Generation Quality
|
|
22
|
+
|
|
23
|
+
| Criterion | Measurement |
|
|
24
|
+
|-----------|-------------|
|
|
25
|
+
| Factual Consistency | NLI, SAFE, SelfCheckGPT |
|
|
26
|
+
| Coherence | AI judge rubric |
|
|
27
|
+
| Relevance | Semantic similarity |
|
|
28
|
+
| Fluency | Perplexity |
|
|
29
|
+
|
|
30
|
+
### 3. Cost & Latency
|
|
31
|
+
|
|
32
|
+
```python
|
|
33
|
+
@dataclass
|
|
34
|
+
class PerformanceMetrics:
|
|
35
|
+
ttft: float # Time to First Token (seconds)
|
|
36
|
+
tpot: float # Time Per Output Token
|
|
37
|
+
throughput: float # Tokens/second
|
|
38
|
+
|
|
39
|
+
def cost(self, input_tokens, output_tokens, prices):
|
|
40
|
+
return input_tokens * prices["input"] + output_tokens * prices["output"]
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
## Model Selection Workflow
|
|
44
|
+
|
|
45
|
+
```
|
|
46
|
+
1. Define Requirements
|
|
47
|
+
├── Task type
|
|
48
|
+
├── Quality threshold
|
|
49
|
+
├── Latency requirements (<2s TTFT)
|
|
50
|
+
├── Cost budget
|
|
51
|
+
└── Deployment constraints
|
|
52
|
+
|
|
53
|
+
2. Filter Options
|
|
54
|
+
├── API vs Self-hosted
|
|
55
|
+
├── Open source vs Proprietary
|
|
56
|
+
└── Size constraints
|
|
57
|
+
|
|
58
|
+
3. Benchmark on Your Data
|
|
59
|
+
├── Create eval dataset (100+ examples)
|
|
60
|
+
├── Run experiments
|
|
61
|
+
└── Analyze results
|
|
62
|
+
|
|
63
|
+
4. Make Decision
|
|
64
|
+
└── Balance quality, cost, latency
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
## Build vs Buy
|
|
68
|
+
|
|
69
|
+
| Factor | API | Self-Host |
|
|
70
|
+
|--------|-----|-----------|
|
|
71
|
+
| Data Privacy | Less control | Full control |
|
|
72
|
+
| Performance | Best models | Slightly behind |
|
|
73
|
+
| Cost at Scale | Expensive | Amortized |
|
|
74
|
+
| Customization | Limited | Full control |
|
|
75
|
+
| Maintenance | Zero | Significant |
|
|
76
|
+
|
|
77
|
+
## Public Benchmarks
|
|
78
|
+
|
|
79
|
+
| Benchmark | Focus |
|
|
80
|
+
|-----------|-------|
|
|
81
|
+
| MMLU | Knowledge (57 subjects) |
|
|
82
|
+
| HumanEval | Code generation |
|
|
83
|
+
| GSM-8K | Math reasoning |
|
|
84
|
+
| TruthfulQA | Factuality |
|
|
85
|
+
| MT-Bench | Multi-turn chat |
|
|
86
|
+
|
|
87
|
+
**Caution**: Benchmarks can be gamed. Data contamination is common. Always evaluate on YOUR data.
|
|
88
|
+
|
|
89
|
+
## Best Practices
|
|
90
|
+
|
|
91
|
+
1. Test on domain-specific data
|
|
92
|
+
2. Measure both quality and cost
|
|
93
|
+
3. Consider latency requirements
|
|
94
|
+
4. Plan for fallback models
|
|
95
|
+
5. Re-evaluate periodically
|
|
@@ -0,0 +1,135 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: dataset-engineering
|
|
3
|
+
description: Building and processing datasets - data quality, curation, deduplication, synthesis, annotation, formatting. Use when creating training data, improving data quality, or generating synthetic data.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Dataset Engineering Skill
|
|
7
|
+
|
|
8
|
+
Building high-quality datasets for AI applications.
|
|
9
|
+
|
|
10
|
+
## Data Quality Dimensions
|
|
11
|
+
|
|
12
|
+
| Dimension | Description | Check |
|
|
13
|
+
|-----------|-------------|-------|
|
|
14
|
+
| Accuracy | Data is correct | Validation |
|
|
15
|
+
| Completeness | No missing values | Schema check |
|
|
16
|
+
| Consistency | No contradictions | Dedup |
|
|
17
|
+
| Timeliness | Up-to-date | Timestamps |
|
|
18
|
+
| Relevance | Matches use case | Filtering |
|
|
19
|
+
|
|
20
|
+
## Data Curation Pipeline
|
|
21
|
+
|
|
22
|
+
```python
|
|
23
|
+
class DataCurationPipeline:
|
|
24
|
+
def run(self, raw_data):
|
|
25
|
+
# 1. Inspect
|
|
26
|
+
self.inspect(raw_data)
|
|
27
|
+
|
|
28
|
+
# 2. Deduplicate
|
|
29
|
+
data = self.deduplicator.dedupe(raw_data)
|
|
30
|
+
|
|
31
|
+
# 3. Clean and filter
|
|
32
|
+
data = self.cleaner.clean(data)
|
|
33
|
+
data = self.filter.filter(data)
|
|
34
|
+
|
|
35
|
+
# 4. Format
|
|
36
|
+
return self.formatter.format(data)
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
## Deduplication
|
|
40
|
+
|
|
41
|
+
```python
|
|
42
|
+
from datasketch import MinHash, MinHashLSH
|
|
43
|
+
|
|
44
|
+
class Deduplicator:
|
|
45
|
+
def __init__(self, threshold=0.8):
|
|
46
|
+
self.lsh = MinHashLSH(threshold=threshold, num_perm=128)
|
|
47
|
+
|
|
48
|
+
def minhash(self, text):
|
|
49
|
+
m = MinHash(num_perm=128)
|
|
50
|
+
for word in text.split():
|
|
51
|
+
m.update(word.encode('utf8'))
|
|
52
|
+
return m
|
|
53
|
+
|
|
54
|
+
def dedupe(self, docs):
|
|
55
|
+
unique = []
|
|
56
|
+
for i, doc in enumerate(docs):
|
|
57
|
+
mh = self.minhash(doc["text"])
|
|
58
|
+
if not self.lsh.query(mh):
|
|
59
|
+
self.lsh.insert(f"doc_{i}", mh)
|
|
60
|
+
unique.append(doc)
|
|
61
|
+
return unique
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
## Data Synthesis
|
|
65
|
+
|
|
66
|
+
### AI-Powered QA Generation
|
|
67
|
+
```python
|
|
68
|
+
def generate_qa(document, model, n=5):
|
|
69
|
+
prompt = f"""Generate {n} QA pairs from:
|
|
70
|
+
|
|
71
|
+
{document}
|
|
72
|
+
|
|
73
|
+
Format: [{{"question": "...", "answer": "..."}}]"""
|
|
74
|
+
|
|
75
|
+
return json.loads(model.generate(prompt))
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
### Self-Instruct
|
|
79
|
+
```python
|
|
80
|
+
def self_instruct(seeds, model, n=100):
|
|
81
|
+
generated = []
|
|
82
|
+
|
|
83
|
+
for _ in range(n):
|
|
84
|
+
samples = random.sample(seeds + generated[-20:], 5)
|
|
85
|
+
prompt = f"Examples:\n{format(samples)}\n\nNew task:"
|
|
86
|
+
|
|
87
|
+
new = model.generate(prompt)
|
|
88
|
+
if is_valid(new) and is_diverse(new, generated):
|
|
89
|
+
generated.append(new)
|
|
90
|
+
|
|
91
|
+
return generated
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
### Data Augmentation
|
|
95
|
+
```python
|
|
96
|
+
def augment_text(text):
|
|
97
|
+
methods = [
|
|
98
|
+
lambda t: synonym_replace(t),
|
|
99
|
+
lambda t: back_translate(t),
|
|
100
|
+
lambda t: model.rephrase(t)
|
|
101
|
+
]
|
|
102
|
+
return random.choice(methods)(text)
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
## Data Formatting
|
|
106
|
+
|
|
107
|
+
### Instruction Format
|
|
108
|
+
```python
|
|
109
|
+
def format_instruction(example):
|
|
110
|
+
return f"""### Instruction:
|
|
111
|
+
{example['instruction']}
|
|
112
|
+
|
|
113
|
+
### Input:
|
|
114
|
+
{example.get('input', '')}
|
|
115
|
+
|
|
116
|
+
### Response:
|
|
117
|
+
{example['output']}"""
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
### Chat Format
|
|
121
|
+
```python
|
|
122
|
+
def format_chat(conversation):
|
|
123
|
+
return [
|
|
124
|
+
{"role": turn["role"], "content": turn["content"]}
|
|
125
|
+
for turn in conversation
|
|
126
|
+
]
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
## Best Practices
|
|
130
|
+
|
|
131
|
+
1. Inspect data before processing
|
|
132
|
+
2. Deduplicate before expensive operations
|
|
133
|
+
3. Use multiple synthesis methods
|
|
134
|
+
4. Validate synthetic data quality
|
|
135
|
+
5. Track data lineage
|
|
@@ -0,0 +1,93 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: evaluation-methodology
|
|
3
|
+
description: Methods for evaluating AI model outputs - exact match, semantic similarity, LLM-as-judge, comparative evaluation, ELO ranking. Use when measuring AI quality, building eval pipelines, or comparing models.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Evaluation Methodology
|
|
7
|
+
|
|
8
|
+
Methods for evaluating Foundation Model outputs.
|
|
9
|
+
|
|
10
|
+
## Evaluation Approaches
|
|
11
|
+
|
|
12
|
+
### 1. Exact Evaluation
|
|
13
|
+
|
|
14
|
+
| Method | Use Case | Example |
|
|
15
|
+
|--------|----------|---------|
|
|
16
|
+
| Exact Match | QA, Math | `"5" == "5"` |
|
|
17
|
+
| Functional Correctness | Code | Pass test cases |
|
|
18
|
+
| BLEU/ROUGE | Translation | N-gram overlap |
|
|
19
|
+
| Semantic Similarity | Open-ended | Embedding cosine |
|
|
20
|
+
|
|
21
|
+
```python
|
|
22
|
+
# Semantic Similarity
|
|
23
|
+
from sentence_transformers import SentenceTransformer
|
|
24
|
+
from sklearn.metrics.pairwise import cosine_similarity
|
|
25
|
+
|
|
26
|
+
model = SentenceTransformer('all-MiniLM-L6-v2')
|
|
27
|
+
emb1 = model.encode([generated])
|
|
28
|
+
emb2 = model.encode([reference])
|
|
29
|
+
similarity = cosine_similarity(emb1, emb2)[0][0]
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
### 2. AI as Judge
|
|
33
|
+
|
|
34
|
+
```python
|
|
35
|
+
JUDGE_PROMPT = """Rate the response on a scale of 1-5.
|
|
36
|
+
|
|
37
|
+
Criteria:
|
|
38
|
+
- Accuracy: Is information correct?
|
|
39
|
+
- Helpfulness: Does it address the need?
|
|
40
|
+
- Clarity: Is it easy to understand?
|
|
41
|
+
|
|
42
|
+
Query: {query}
|
|
43
|
+
Response: {response}
|
|
44
|
+
|
|
45
|
+
Return JSON: {"score": N, "reasoning": "..."}"""
|
|
46
|
+
|
|
47
|
+
# Multi-judge for reliability
|
|
48
|
+
judges = ["gpt-4", "claude-3"]
|
|
49
|
+
scores = [get_score(judge, response) for judge in judges]
|
|
50
|
+
final_score = sum(scores) / len(scores)
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
### 3. Comparative Evaluation (ELO)
|
|
54
|
+
|
|
55
|
+
```python
|
|
56
|
+
COMPARE_PROMPT = """Compare these responses.
|
|
57
|
+
|
|
58
|
+
Query: {query}
|
|
59
|
+
A: {response_a}
|
|
60
|
+
B: {response_b}
|
|
61
|
+
|
|
62
|
+
Which is better? Return: A, B, or tie"""
|
|
63
|
+
|
|
64
|
+
def update_elo(rating_a, rating_b, winner, k=32):
|
|
65
|
+
expected_a = 1 / (1 + 10**((rating_b - rating_a) / 400))
|
|
66
|
+
score_a = 1 if winner == "A" else 0 if winner == "B" else 0.5
|
|
67
|
+
return rating_a + k * (score_a - expected_a)
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
## Evaluation Pipeline
|
|
71
|
+
|
|
72
|
+
```
|
|
73
|
+
1. Define Criteria (accuracy, helpfulness, safety)
|
|
74
|
+
↓
|
|
75
|
+
2. Create Scoring Rubric with Examples
|
|
76
|
+
↓
|
|
77
|
+
3. Select Methods (exact + AI judge + human)
|
|
78
|
+
↓
|
|
79
|
+
4. Create Evaluation Dataset
|
|
80
|
+
↓
|
|
81
|
+
5. Run Evaluation
|
|
82
|
+
↓
|
|
83
|
+
6. Analyze & Iterate
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
## Best Practices
|
|
87
|
+
|
|
88
|
+
1. Use multiple evaluation methods
|
|
89
|
+
2. Calibrate AI judges with human data
|
|
90
|
+
3. Include both automatic and human evaluation
|
|
91
|
+
4. Version your evaluation datasets
|
|
92
|
+
5. Track metrics over time
|
|
93
|
+
6. Test for position bias in comparisons
|