@rune-kit/rune 2.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +357 -0
- package/agents/.gitkeep +0 -0
- package/agents/architect.md +29 -0
- package/agents/asset-creator.md +11 -0
- package/agents/audit.md +11 -0
- package/agents/autopsy.md +11 -0
- package/agents/brainstorm.md +11 -0
- package/agents/browser-pilot.md +11 -0
- package/agents/coder.md +29 -0
- package/agents/completion-gate.md +11 -0
- package/agents/constraint-check.md +11 -0
- package/agents/context-engine.md +11 -0
- package/agents/cook.md +11 -0
- package/agents/db.md +11 -0
- package/agents/debug.md +11 -0
- package/agents/dependency-doctor.md +11 -0
- package/agents/deploy.md +11 -0
- package/agents/design.md +11 -0
- package/agents/docs-seeker.md +11 -0
- package/agents/fix.md +11 -0
- package/agents/hallucination-guard.md +11 -0
- package/agents/incident.md +11 -0
- package/agents/integrity-check.md +11 -0
- package/agents/journal.md +11 -0
- package/agents/launch.md +11 -0
- package/agents/logic-guardian.md +11 -0
- package/agents/marketing.md +11 -0
- package/agents/onboard.md +11 -0
- package/agents/perf.md +11 -0
- package/agents/plan.md +11 -0
- package/agents/preflight.md +11 -0
- package/agents/problem-solver.md +11 -0
- package/agents/rescue.md +11 -0
- package/agents/research.md +11 -0
- package/agents/researcher.md +29 -0
- package/agents/review-intake.md +11 -0
- package/agents/review.md +11 -0
- package/agents/reviewer.md +28 -0
- package/agents/safeguard.md +11 -0
- package/agents/sast.md +11 -0
- package/agents/scanner.md +28 -0
- package/agents/scope-guard.md +11 -0
- package/agents/scout.md +11 -0
- package/agents/sentinel.md +11 -0
- package/agents/sequential-thinking.md +11 -0
- package/agents/session-bridge.md +11 -0
- package/agents/skill-forge.md +11 -0
- package/agents/skill-router.md +11 -0
- package/agents/surgeon.md +11 -0
- package/agents/team.md +11 -0
- package/agents/test.md +11 -0
- package/agents/trend-scout.md +11 -0
- package/agents/verification.md +11 -0
- package/agents/video-creator.md +11 -0
- package/agents/watchdog.md +11 -0
- package/agents/worktree.md +11 -0
- package/commands/.gitkeep +0 -0
- package/commands/rune.md +168 -0
- package/compiler/__tests__/openclaw-adapter.test.js +140 -0
- package/compiler/__tests__/parser.test.js +55 -0
- package/compiler/adapters/antigravity.js +59 -0
- package/compiler/adapters/claude.js +37 -0
- package/compiler/adapters/cursor.js +67 -0
- package/compiler/adapters/generic.js +60 -0
- package/compiler/adapters/index.js +45 -0
- package/compiler/adapters/openclaw.js +150 -0
- package/compiler/adapters/windsurf.js +60 -0
- package/compiler/bin/rune.js +288 -0
- package/compiler/doctor.js +153 -0
- package/compiler/emitter.js +240 -0
- package/compiler/parser.js +208 -0
- package/compiler/transformer.js +69 -0
- package/compiler/transforms/branding.js +27 -0
- package/compiler/transforms/cross-references.js +29 -0
- package/compiler/transforms/frontmatter.js +38 -0
- package/compiler/transforms/hooks.js +68 -0
- package/compiler/transforms/subagents.js +36 -0
- package/compiler/transforms/tool-names.js +60 -0
- package/contexts/dev.md +34 -0
- package/contexts/research.md +43 -0
- package/contexts/review.md +55 -0
- package/extensions/ai-ml/PACK.md +517 -0
- package/extensions/analytics/PACK.md +557 -0
- package/extensions/backend/PACK.md +678 -0
- package/extensions/chrome-ext/PACK.md +995 -0
- package/extensions/content/PACK.md +381 -0
- package/extensions/devops/PACK.md +520 -0
- package/extensions/ecommerce/PACK.md +280 -0
- package/extensions/gamedev/PACK.md +393 -0
- package/extensions/mobile/PACK.md +273 -0
- package/extensions/saas/PACK.md +805 -0
- package/extensions/security/PACK.md +536 -0
- package/extensions/trading/PACK.md +597 -0
- package/extensions/ui/PACK.md +947 -0
- package/package.json +47 -0
- package/skills/.gitkeep +0 -0
- package/skills/adversary/SKILL.md +271 -0
- package/skills/asset-creator/SKILL.md +157 -0
- package/skills/audit/SKILL.md +466 -0
- package/skills/autopsy/SKILL.md +200 -0
- package/skills/ba/SKILL.md +279 -0
- package/skills/brainstorm/SKILL.md +266 -0
- package/skills/browser-pilot/SKILL.md +168 -0
- package/skills/completion-gate/SKILL.md +151 -0
- package/skills/constraint-check/SKILL.md +165 -0
- package/skills/context-engine/SKILL.md +176 -0
- package/skills/cook/SKILL.md +636 -0
- package/skills/db/SKILL.md +256 -0
- package/skills/debug/SKILL.md +240 -0
- package/skills/dependency-doctor/SKILL.md +235 -0
- package/skills/deploy/SKILL.md +174 -0
- package/skills/design/DESIGN-REFERENCE.md +365 -0
- package/skills/design/SKILL.md +462 -0
- package/skills/doc-processor/SKILL.md +254 -0
- package/skills/docs/SKILL.md +336 -0
- package/skills/docs-seeker/SKILL.md +166 -0
- package/skills/fix/SKILL.md +192 -0
- package/skills/git/SKILL.md +285 -0
- package/skills/hallucination-guard/SKILL.md +204 -0
- package/skills/incident/SKILL.md +241 -0
- package/skills/integrity-check/SKILL.md +169 -0
- package/skills/journal/SKILL.md +190 -0
- package/skills/launch/SKILL.md +330 -0
- package/skills/logic-guardian/SKILL.md +240 -0
- package/skills/marketing/SKILL.md +229 -0
- package/skills/mcp-builder/SKILL.md +311 -0
- package/skills/onboard/SKILL.md +298 -0
- package/skills/perf/SKILL.md +297 -0
- package/skills/plan/SKILL.md +520 -0
- package/skills/preflight/SKILL.md +231 -0
- package/skills/problem-solver/SKILL.md +284 -0
- package/skills/rescue/SKILL.md +434 -0
- package/skills/research/SKILL.md +122 -0
- package/skills/review/SKILL.md +354 -0
- package/skills/review-intake/SKILL.md +222 -0
- package/skills/safeguard/SKILL.md +188 -0
- package/skills/sast/SKILL.md +190 -0
- package/skills/scaffold/SKILL.md +276 -0
- package/skills/scope-guard/SKILL.md +150 -0
- package/skills/scout/SKILL.md +232 -0
- package/skills/sentinel/SKILL.md +320 -0
- package/skills/sentinel-env/SKILL.md +226 -0
- package/skills/sequential-thinking/SKILL.md +234 -0
- package/skills/session-bridge/SKILL.md +287 -0
- package/skills/skill-forge/SKILL.md +317 -0
- package/skills/skill-router/SKILL.md +267 -0
- package/skills/surgeon/SKILL.md +203 -0
- package/skills/team/SKILL.md +397 -0
- package/skills/test/SKILL.md +271 -0
- package/skills/trend-scout/SKILL.md +145 -0
- package/skills/verification/SKILL.md +201 -0
- package/skills/video-creator/SKILL.md +201 -0
- package/skills/watchdog/SKILL.md +166 -0
- package/skills/worktree/SKILL.md +140 -0
|
@@ -0,0 +1,517 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: "@rune/ai-ml"
|
|
3
|
+
description: AI/ML integration patterns — LLM integration, RAG pipelines, embeddings, and fine-tuning workflows.
|
|
4
|
+
metadata:
|
|
5
|
+
author: runedev
|
|
6
|
+
version: "0.2.0"
|
|
7
|
+
layer: L4
|
|
8
|
+
price: "$15"
|
|
9
|
+
target: AI engineers
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# @rune/ai-ml
|
|
13
|
+
|
|
14
|
+
## Purpose
|
|
15
|
+
|
|
16
|
+
AI-powered features fail in predictable ways: LLM calls without retry logic that crash on rate limits, RAG pipelines that retrieve irrelevant chunks because the chunking strategy ignores document structure, embedding search that returns semantic matches with zero keyword overlap, and fine-tuning runs that overfit because the eval set leaked into training data. This pack codifies production patterns for each — from API client resilience to retrieval quality to model evaluation — so AI features ship with the reliability of traditional software.
|
|
17
|
+
|
|
18
|
+
## Triggers
|
|
19
|
+
|
|
20
|
+
- Auto-trigger: when `openai`, `anthropic`, `@langchain`, `pinecone`, `pgvector`, `embedding`, `llm` detected in dependencies or code
|
|
21
|
+
- `/rune llm-integration` — audit or improve LLM API usage
|
|
22
|
+
- `/rune rag-patterns` — build or audit RAG pipeline
|
|
23
|
+
- `/rune embedding-search` — implement or optimize semantic search
|
|
24
|
+
- `/rune fine-tuning-guide` — prepare and execute fine-tuning workflow
|
|
25
|
+
- Called by `cook` (L1) when AI/ML task detected
|
|
26
|
+
- Called by `plan` (L2) when AI architecture decisions needed
|
|
27
|
+
|
|
28
|
+
## Skills Included
|
|
29
|
+
|
|
30
|
+
### llm-integration
|
|
31
|
+
|
|
32
|
+
LLM integration patterns — API client wrappers, streaming responses, structured output, retry with exponential backoff, model fallback chains, prompt versioning.
|
|
33
|
+
|
|
34
|
+
#### Workflow
|
|
35
|
+
|
|
36
|
+
**Step 1 — Detect LLM usage**
|
|
37
|
+
Use Grep to find LLM API calls: `openai.chat`, `anthropic.messages`, `OpenAI(`, `Anthropic(`, `generateText`, `streamText`. Read client initialization and prompt construction to understand: model selection, error handling, output parsing, and token management.
|
|
38
|
+
|
|
39
|
+
**Step 2 — Audit resilience**
|
|
40
|
+
Check for: no retry on rate limit (429), no timeout on API calls, unstructured output parsing (regex on LLM text instead of function calling), hardcoded prompts without versioning, no token counting before request, missing fallback model chain, and streaming without backpressure handling.
|
|
41
|
+
|
|
42
|
+
**Step 3 — Emit robust LLM client**
|
|
43
|
+
Emit: typed client wrapper with exponential backoff retry, structured output via Zod schema + function calling, streaming with proper error boundaries, token budget management, and prompt version registry.
|
|
44
|
+
|
|
45
|
+
#### Example
|
|
46
|
+
|
|
47
|
+
```typescript
|
|
48
|
+
// Robust LLM client — retry, structured output, fallback chain
|
|
49
|
+
import OpenAI from 'openai';
|
|
50
|
+
import { z } from 'zod';
|
|
51
|
+
|
|
52
|
+
const client = new OpenAI();
|
|
53
|
+
|
|
54
|
+
const SentimentSchema = z.object({
|
|
55
|
+
sentiment: z.enum(['positive', 'negative', 'neutral']),
|
|
56
|
+
confidence: z.number().min(0).max(1),
|
|
57
|
+
reasoning: z.string(),
|
|
58
|
+
});
|
|
59
|
+
|
|
60
|
+
async function analyzeSentiment(text: string, attempt = 0): Promise<z.infer<typeof SentimentSchema>> {
|
|
61
|
+
const models = ['gpt-4o-mini', 'gpt-4o'] as const; // fallback chain
|
|
62
|
+
const model = attempt >= 2 ? models[1] : models[0];
|
|
63
|
+
|
|
64
|
+
try {
|
|
65
|
+
const response = await client.chat.completions.create({
|
|
66
|
+
model,
|
|
67
|
+
messages: [
|
|
68
|
+
{ role: 'system', content: 'Analyze sentiment. Return JSON matching the schema.' },
|
|
69
|
+
{ role: 'user', content: text },
|
|
70
|
+
],
|
|
71
|
+
response_format: { type: 'json_object' },
|
|
72
|
+
max_tokens: 200,
|
|
73
|
+
timeout: 10_000,
|
|
74
|
+
});
|
|
75
|
+
|
|
76
|
+
return SentimentSchema.parse(JSON.parse(response.choices[0].message.content!));
|
|
77
|
+
} catch (err) {
|
|
78
|
+
if (err instanceof OpenAI.RateLimitError && attempt < 3) {
|
|
79
|
+
await new Promise(r => setTimeout(r, Math.pow(2, attempt) * 1000));
|
|
80
|
+
return analyzeSentiment(text, attempt + 1);
|
|
81
|
+
}
|
|
82
|
+
throw err;
|
|
83
|
+
}
|
|
84
|
+
}
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
---
|
|
88
|
+
|
|
89
|
+
### rag-patterns
|
|
90
|
+
|
|
91
|
+
RAG pipeline patterns — document chunking, embedding generation, vector store setup, retrieval strategies, reranking.
|
|
92
|
+
|
|
93
|
+
#### Workflow
|
|
94
|
+
|
|
95
|
+
**Step 1 — Detect RAG components**
|
|
96
|
+
Use Grep to find vector store usage: `PineconeClient`, `pgvector`, `Weaviate`, `ChromaClient`, `QdrantClient`. Find embedding calls: `embeddings.create`, `embed()`. Read the ingestion pipeline and retrieval logic to map the full RAG flow.
|
|
97
|
+
|
|
98
|
+
**Step 2 — Audit retrieval quality**
|
|
99
|
+
Check for: fixed-size chunking that splits mid-sentence (context loss), no overlap between chunks (boundary information lost), embeddings generated without metadata (no filtering capability), retrieval without reranking (relevance drops after top-3), no chunk deduplication, and context window overflow (retrieved chunks exceed model limit).
|
|
100
|
+
|
|
101
|
+
**Step 3 — Emit RAG pipeline**
|
|
102
|
+
Emit: recursive text splitter with semantic boundaries, embedding generation with metadata, vector upsert with namespace, retrieval with reranking, and context window budget management.
|
|
103
|
+
|
|
104
|
+
#### Example
|
|
105
|
+
|
|
106
|
+
```typescript
|
|
107
|
+
// RAG pipeline — recursive chunking + pgvector + reranking
|
|
108
|
+
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
|
|
109
|
+
import { OpenAIEmbeddings } from '@langchain/openai';
|
|
110
|
+
import { PGVectorStore } from '@langchain/community/vectorstores/pgvector';
|
|
111
|
+
|
|
112
|
+
// Ingestion: chunk → embed → store
|
|
113
|
+
async function ingestDocument(doc: { content: string; metadata: Record<string, string> }) {
|
|
114
|
+
const splitter = new RecursiveCharacterTextSplitter({
|
|
115
|
+
chunkSize: 1000,
|
|
116
|
+
chunkOverlap: 200,
|
|
117
|
+
separators: ['\n## ', '\n### ', '\n\n', '\n', '. ', ' '],
|
|
118
|
+
});
|
|
119
|
+
const chunks = await splitter.createDocuments(
|
|
120
|
+
[doc.content],
|
|
121
|
+
[doc.metadata],
|
|
122
|
+
);
|
|
123
|
+
|
|
124
|
+
const embeddings = new OpenAIEmbeddings({ model: 'text-embedding-3-small' });
|
|
125
|
+
await PGVectorStore.fromDocuments(chunks, embeddings, {
|
|
126
|
+
postgresConnectionOptions: { connectionString: process.env.DATABASE_URL },
|
|
127
|
+
tableName: 'documents',
|
|
128
|
+
});
|
|
129
|
+
}
|
|
130
|
+
|
|
131
|
+
// Retrieval: query → vector search → rerank → top-k
|
|
132
|
+
async function retrieve(query: string, topK = 5) {
|
|
133
|
+
const store = await PGVectorStore.initialize(embeddings, pgConfig);
|
|
134
|
+
const candidates = await store.similaritySearch(query, topK * 3); // over-retrieve
|
|
135
|
+
|
|
136
|
+
// Rerank with Cohere
|
|
137
|
+
const { results } = await cohere.rerank({
|
|
138
|
+
model: 'rerank-english-v3.0',
|
|
139
|
+
query,
|
|
140
|
+
documents: candidates.map(c => c.pageContent),
|
|
141
|
+
topN: topK,
|
|
142
|
+
});
|
|
143
|
+
|
|
144
|
+
return results.map(r => candidates[r.index]);
|
|
145
|
+
}
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
---
|
|
149
|
+
|
|
150
|
+
### embedding-search
|
|
151
|
+
|
|
152
|
+
Embedding-based search — semantic search, hybrid search (BM25 + vector), similarity thresholds, index optimization.
|
|
153
|
+
|
|
154
|
+
#### Workflow
|
|
155
|
+
|
|
156
|
+
**Step 1 — Detect search implementation**
|
|
157
|
+
Use Grep to find search code: `similarity_search`, `vector_search`, `fts`, `tsvector`, `BM25`. Read search handlers to understand: query flow, ranking strategy, and result formatting.
|
|
158
|
+
|
|
159
|
+
**Step 2 — Audit search quality**
|
|
160
|
+
Check for: pure vector search without keyword fallback (misses exact matches), no similarity threshold (returns irrelevant results at low scores), missing query embedding cache (repeated queries re-embed), no hybrid scoring (BM25 for exact + vector for semantic), and unoptimized vector index (HNSW parameters not tuned).
|
|
161
|
+
|
|
162
|
+
**Step 3 — Emit hybrid search**
|
|
163
|
+
Emit: combined BM25 + vector search with reciprocal rank fusion, similarity threshold filtering, query embedding cache, and HNSW index tuning.
|
|
164
|
+
|
|
165
|
+
#### Example
|
|
166
|
+
|
|
167
|
+
```typescript
|
|
168
|
+
// Hybrid search — BM25 + vector with reciprocal rank fusion
|
|
169
|
+
async function hybridSearch(query: string, limit = 10) {
|
|
170
|
+
// Parallel: keyword (BM25) + semantic (vector)
|
|
171
|
+
const [keywordResults, vectorResults] = await Promise.all([
|
|
172
|
+
db.execute(sql`
|
|
173
|
+
SELECT id, content, ts_rank(search_vector, plainto_tsquery(${query})) AS bm25_score
|
|
174
|
+
FROM documents
|
|
175
|
+
WHERE search_vector @@ plainto_tsquery(${query})
|
|
176
|
+
ORDER BY bm25_score DESC LIMIT ${limit * 2}
|
|
177
|
+
`),
|
|
178
|
+
db.execute(sql`
|
|
179
|
+
SELECT id, content, 1 - (embedding <=> ${await getEmbedding(query)}) AS vector_score
|
|
180
|
+
FROM documents
|
|
181
|
+
ORDER BY embedding <=> ${await getEmbedding(query)}
|
|
182
|
+
LIMIT ${limit * 2}
|
|
183
|
+
`),
|
|
184
|
+
]);
|
|
185
|
+
|
|
186
|
+
// Reciprocal rank fusion (k=60)
|
|
187
|
+
const scores = new Map<string, number>();
|
|
188
|
+
const K = 60;
|
|
189
|
+
keywordResults.forEach((r, i) => scores.set(r.id, (scores.get(r.id) || 0) + 1 / (K + i + 1)));
|
|
190
|
+
vectorResults.forEach((r, i) => scores.set(r.id, (scores.get(r.id) || 0) + 1 / (K + i + 1)));
|
|
191
|
+
|
|
192
|
+
return [...scores.entries()]
|
|
193
|
+
.sort((a, b) => b[1] - a[1])
|
|
194
|
+
.slice(0, limit)
|
|
195
|
+
.filter(([_, score]) => score > 0.01); // threshold
|
|
196
|
+
}
|
|
197
|
+
|
|
198
|
+
// Embedding cache (avoid re-embedding repeated queries)
|
|
199
|
+
const embeddingCache = new Map<string, number[]>();
|
|
200
|
+
async function getEmbedding(text: string): Promise<number[]> {
|
|
201
|
+
const cached = embeddingCache.get(text);
|
|
202
|
+
if (cached) return cached;
|
|
203
|
+
const { data } = await openai.embeddings.create({ model: 'text-embedding-3-small', input: text });
|
|
204
|
+
embeddingCache.set(text, data[0].embedding);
|
|
205
|
+
return data[0].embedding;
|
|
206
|
+
}
|
|
207
|
+
```
|
|
208
|
+
|
|
209
|
+
---
|
|
210
|
+
|
|
211
|
+
### fine-tuning-guide
|
|
212
|
+
|
|
213
|
+
Fine-tuning workflows — dataset preparation, training configuration, evaluation metrics, deployment, A/B testing.
|
|
214
|
+
|
|
215
|
+
#### Workflow
|
|
216
|
+
|
|
217
|
+
**Step 1 — Audit training data**
|
|
218
|
+
Use Read to examine the dataset files. Check for: data format (JSONL with `messages` array), train/eval split (eval must not overlap with train), sufficient examples (minimum 50, recommended 200+), balanced class distribution, and PII in training data.
|
|
219
|
+
|
|
220
|
+
**Step 2 — Prepare and validate dataset**
|
|
221
|
+
Emit: JSONL formatter that validates each example, train/eval splitter with stratification, token count estimator (cost preview), and data quality checks (duplicate detection, format validation).
|
|
222
|
+
|
|
223
|
+
**Step 3 — Execute fine-tuning and evaluate**
|
|
224
|
+
Emit: fine-tune API call with hyperparameters, evaluation script that compares base vs fine-tuned on held-out set, and A/B deployment configuration.
|
|
225
|
+
|
|
226
|
+
#### Example
|
|
227
|
+
|
|
228
|
+
```python
|
|
229
|
+
# Fine-tuning workflow — prepare, train, evaluate
|
|
230
|
+
import json
|
|
231
|
+
import openai
|
|
232
|
+
from sklearn.model_selection import train_test_split
|
|
233
|
+
|
|
234
|
+
# Step 1: Prepare JSONL dataset
|
|
235
|
+
def prepare_dataset(examples: list[dict], output_prefix: str):
|
|
236
|
+
train, eval_set = train_test_split(examples, test_size=0.2, random_state=42)
|
|
237
|
+
|
|
238
|
+
for split_name, split_data in [("train", train), ("eval", eval_set)]:
|
|
239
|
+
path = f"{output_prefix}_{split_name}.jsonl"
|
|
240
|
+
with open(path, "w") as f:
|
|
241
|
+
for ex in split_data:
|
|
242
|
+
f.write(json.dumps({"messages": [
|
|
243
|
+
{"role": "system", "content": ex["system"]},
|
|
244
|
+
{"role": "user", "content": ex["input"]},
|
|
245
|
+
{"role": "assistant", "content": ex["output"]},
|
|
246
|
+
]}) + "\n")
|
|
247
|
+
print(f"Wrote {len(split_data)} examples to {path}")
|
|
248
|
+
|
|
249
|
+
# Step 2: Launch fine-tuning
|
|
250
|
+
def start_fine_tune(train_file: str, eval_file: str):
|
|
251
|
+
train_id = openai.files.create(file=open(train_file, "rb"), purpose="fine-tune").id
|
|
252
|
+
eval_id = openai.files.create(file=open(eval_file, "rb"), purpose="fine-tune").id
|
|
253
|
+
|
|
254
|
+
job = openai.fine_tuning.jobs.create(
|
|
255
|
+
training_file=train_id,
|
|
256
|
+
validation_file=eval_id,
|
|
257
|
+
model="gpt-4o-mini-2024-07-18",
|
|
258
|
+
hyperparameters={"n_epochs": 3, "batch_size": "auto", "learning_rate_multiplier": "auto"},
|
|
259
|
+
)
|
|
260
|
+
print(f"Fine-tuning job: {job.id} — status: {job.status}")
|
|
261
|
+
return job
|
|
262
|
+
|
|
263
|
+
# Step 3: Evaluate base vs fine-tuned
|
|
264
|
+
def evaluate(base_model: str, ft_model: str, eval_set: list[dict]) -> dict:
|
|
265
|
+
results = {"base": {"correct": 0}, "finetuned": {"correct": 0}}
|
|
266
|
+
for ex in eval_set:
|
|
267
|
+
for label, model in [("base", base_model), ("finetuned", ft_model)]:
|
|
268
|
+
response = openai.chat.completions.create(
|
|
269
|
+
model=model, messages=ex["messages"][:2], max_tokens=500,
|
|
270
|
+
)
|
|
271
|
+
if response.choices[0].message.content.strip() == ex["messages"][2]["content"].strip():
|
|
272
|
+
results[label]["correct"] += 1
|
|
273
|
+
for label in results:
|
|
274
|
+
results[label]["accuracy"] = results[label]["correct"] / len(eval_set)
|
|
275
|
+
return results
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
---
|
|
279
|
+
|
|
280
|
+
### llm-architect
|
|
281
|
+
|
|
282
|
+
LLM system architecture — model selection, prompt engineering patterns, evaluation frameworks, cost optimization, multi-model routing, and guardrail design.
|
|
283
|
+
|
|
284
|
+
#### Workflow
|
|
285
|
+
|
|
286
|
+
**Step 1 — Assess LLM requirements**
|
|
287
|
+
Understand the use case: what does the LLM need to do? Classify into:
|
|
288
|
+
- **Generation**: open-ended text (blog, email, creative writing)
|
|
289
|
+
- **Extraction**: structured data from unstructured input (JSON from text, entities, classification)
|
|
290
|
+
- **Reasoning**: multi-step logic (math, code generation, planning)
|
|
291
|
+
- **Conversation**: multi-turn dialogue with memory
|
|
292
|
+
- **Agentic**: tool use, function calling, autonomous task execution
|
|
293
|
+
|
|
294
|
+
For each class, identify: latency requirements (real-time < 2s, async < 30s, batch), accuracy requirements (critical = needs eval suite, casual = spot check), cost sensitivity (per-call budget), and data sensitivity (PII, HIPAA, can data leave the network?).
|
|
295
|
+
|
|
296
|
+
**Step 2 — Model selection matrix**
|
|
297
|
+
Based on requirements, recommend model tier:
|
|
298
|
+
|
|
299
|
+
| Requirement | Recommended | Fallback |
|
|
300
|
+
|------------|-------------|----------|
|
|
301
|
+
| Fast + cheap (classification, routing) | Haiku / GPT-4o-mini | Local (Llama 3) |
|
|
302
|
+
| Balanced (code, summaries, RAG) | Sonnet / GPT-4o | Haiku with retry |
|
|
303
|
+
| Deep reasoning (architecture, math) | Opus / o1 | Sonnet with chain-of-thought |
|
|
304
|
+
| On-premise required | Llama 3 / Mistral | Ollama local deployment |
|
|
305
|
+
| Multimodal (vision + text) | Sonnet / GPT-4o | Local LLaVA |
|
|
306
|
+
|
|
307
|
+
Emit: primary model, fallback model, estimated cost per 1K calls, and latency p50/p99.
|
|
308
|
+
|
|
309
|
+
**Step 3 — Prompt architecture**
|
|
310
|
+
Design the prompt structure:
|
|
311
|
+
- **System prompt**: Role definition, constraints, output format. Keep under 500 tokens for cost efficiency.
|
|
312
|
+
- **Few-shot examples**: 2-3 examples for extraction/classification tasks. Format matches expected output exactly.
|
|
313
|
+
- **Chain-of-thought**: For reasoning tasks, explicitly request step-by-step thinking before final answer.
|
|
314
|
+
- **Structured output**: JSON mode or tool use for extraction. Define schema with Zod/Pydantic for validation.
|
|
315
|
+
|
|
316
|
+
**Step 4 — Guardrails and evaluation**
|
|
317
|
+
Design safety and quality layers:
|
|
318
|
+
- **Input guardrails**: PII detection, prompt injection detection, topic filtering
|
|
319
|
+
- **Output guardrails**: Schema validation, hallucination checks, toxicity filtering
|
|
320
|
+
- **Evaluation framework**: Define eval dataset (50+ examples), metrics (accuracy, latency, cost), and regression threshold (new prompt must not drop > 2% on any metric)
|
|
321
|
+
|
|
322
|
+
Save architecture doc to `.rune/ai/llm-architecture.md`.
|
|
323
|
+
|
|
324
|
+
#### Example
|
|
325
|
+
|
|
326
|
+
```typescript
|
|
327
|
+
// Multi-model router with fallback
|
|
328
|
+
interface ModelConfig {
|
|
329
|
+
id: string;
|
|
330
|
+
provider: 'anthropic' | 'openai' | 'local';
|
|
331
|
+
costPer1kTokens: number;
|
|
332
|
+
maxTokens: number;
|
|
333
|
+
latencyP50Ms: number;
|
|
334
|
+
}
|
|
335
|
+
|
|
336
|
+
const MODELS: Record<string, ModelConfig> = {
|
|
337
|
+
fast: {
|
|
338
|
+
id: 'claude-haiku-4-5-20251001',
|
|
339
|
+
provider: 'anthropic',
|
|
340
|
+
costPer1kTokens: 0.001,
|
|
341
|
+
maxTokens: 4096,
|
|
342
|
+
latencyP50Ms: 200,
|
|
343
|
+
},
|
|
344
|
+
balanced: {
|
|
345
|
+
id: 'claude-sonnet-4-6',
|
|
346
|
+
provider: 'anthropic',
|
|
347
|
+
costPer1kTokens: 0.01,
|
|
348
|
+
maxTokens: 8192,
|
|
349
|
+
latencyP50Ms: 800,
|
|
350
|
+
},
|
|
351
|
+
deep: {
|
|
352
|
+
id: 'claude-opus-4-6',
|
|
353
|
+
provider: 'anthropic',
|
|
354
|
+
costPer1kTokens: 0.05,
|
|
355
|
+
maxTokens: 16384,
|
|
356
|
+
latencyP50Ms: 2000,
|
|
357
|
+
},
|
|
358
|
+
};
|
|
359
|
+
|
|
360
|
+
type TaskComplexity = 'trivial' | 'standard' | 'complex';
|
|
361
|
+
|
|
362
|
+
function selectModel(complexity: TaskComplexity): ModelConfig {
|
|
363
|
+
const map: Record<TaskComplexity, string> = {
|
|
364
|
+
trivial: 'fast',
|
|
365
|
+
standard: 'balanced',
|
|
366
|
+
complex: 'deep',
|
|
367
|
+
};
|
|
368
|
+
return MODELS[map[complexity]];
|
|
369
|
+
}
|
|
370
|
+
|
|
371
|
+
// Prompt architecture template
|
|
372
|
+
const systemPrompt = `You are a ${role} assistant.
|
|
373
|
+
|
|
374
|
+
CONSTRAINTS:
|
|
375
|
+
- ${constraints.join('\n- ')}
|
|
376
|
+
|
|
377
|
+
OUTPUT FORMAT:
|
|
378
|
+
Return valid JSON matching this schema:
|
|
379
|
+
${JSON.stringify(outputSchema, null, 2)}
|
|
380
|
+
|
|
381
|
+
Do not include explanations outside the JSON.`;
|
|
382
|
+
|
|
383
|
+
// Guardrail: validate structured output
|
|
384
|
+
import { z } from 'zod';
|
|
385
|
+
|
|
386
|
+
const OutputSchema = z.object({
|
|
387
|
+
classification: z.enum(['positive', 'negative', 'neutral']),
|
|
388
|
+
confidence: z.number().min(0).max(1),
|
|
389
|
+
reasoning: z.string().max(200),
|
|
390
|
+
});
|
|
391
|
+
|
|
392
|
+
function validateOutput(raw: string): z.infer<typeof OutputSchema> {
|
|
393
|
+
const parsed = JSON.parse(raw);
|
|
394
|
+
return OutputSchema.parse(parsed); // throws if invalid
|
|
395
|
+
}
|
|
396
|
+
```
|
|
397
|
+
|
|
398
|
+
---
|
|
399
|
+
|
|
400
|
+
### prompt-patterns
|
|
401
|
+
|
|
402
|
+
Reusable prompt engineering patterns — structured output, chain-of-thought, self-critique, tool use orchestration, and multi-turn memory management.
|
|
403
|
+
|
|
404
|
+
#### Workflow
|
|
405
|
+
|
|
406
|
+
**Step 1 — Identify the pattern**
|
|
407
|
+
Match the user's task to a proven prompt pattern:
|
|
408
|
+
- **Extraction**: Use JSON mode + schema definition + few-shot examples
|
|
409
|
+
- **Classification**: Use enum output + confidence score + chain-of-thought
|
|
410
|
+
- **Summarization**: Use structured summary template + length constraint + key point extraction
|
|
411
|
+
- **Code generation**: Use system prompt with language constraints + test-driven output format
|
|
412
|
+
- **Agent loop**: Use ReAct pattern (Thought → Action → Observation → repeat)
|
|
413
|
+
- **Self-critique**: Use generate → critique → revise loop for quality-sensitive output
|
|
414
|
+
|
|
415
|
+
**Step 2 — Apply the pattern**
|
|
416
|
+
Generate the prompt following the selected pattern. Include:
|
|
417
|
+
- System prompt (role + constraints + output format)
|
|
418
|
+
- User message template (input variables marked with `{{variable}}`)
|
|
419
|
+
- Few-shot examples (2-3, matching exact output format)
|
|
420
|
+
- Validation schema (Zod/Pydantic for structured output)
|
|
421
|
+
|
|
422
|
+
**Step 3 — Test harness**
|
|
423
|
+
Emit a test file with 5+ test cases that validate the prompt produces correct output for known inputs. Include edge cases: empty input, very long input, ambiguous input, adversarial input.
|
|
424
|
+
|
|
425
|
+
#### Example
|
|
426
|
+
|
|
427
|
+
```typescript
|
|
428
|
+
// Pattern: ReAct Agent Loop
|
|
429
|
+
const REACT_SYSTEM = `You are an agent that solves tasks using available tools.
|
|
430
|
+
|
|
431
|
+
For each step, output EXACTLY this JSON format:
|
|
432
|
+
{"thought": "reasoning about what to do next",
|
|
433
|
+
"action": "tool_name",
|
|
434
|
+
"action_input": "input for the tool"}
|
|
435
|
+
|
|
436
|
+
After receiving an observation, continue with the next thought.
|
|
437
|
+
When you have the final answer, output:
|
|
438
|
+
{"thought": "I have the answer", "final_answer": "the answer"}
|
|
439
|
+
|
|
440
|
+
Available tools:
|
|
441
|
+
{{tools}}`;
|
|
442
|
+
|
|
443
|
+
// Pattern: Self-Critique Loop
|
|
444
|
+
async function generateWithCritique(prompt: string, maxRounds = 2) {
|
|
445
|
+
let output = await llm.generate(prompt);
|
|
446
|
+
|
|
447
|
+
for (let i = 0; i < maxRounds; i++) {
|
|
448
|
+
const critique = await llm.generate(
|
|
449
|
+
`Review this output for errors, omissions, and improvements:\n\n${output}\n\n` +
|
|
450
|
+
`List specific issues. If no issues, respond with "APPROVED".`
|
|
451
|
+
);
|
|
452
|
+
|
|
453
|
+
if (critique.includes('APPROVED')) break;
|
|
454
|
+
|
|
455
|
+
output = await llm.generate(
|
|
456
|
+
`Original output:\n${output}\n\nCritique:\n${critique}\n\n` +
|
|
457
|
+
`Revise the output to address all issues in the critique.`
|
|
458
|
+
);
|
|
459
|
+
}
|
|
460
|
+
|
|
461
|
+
return output;
|
|
462
|
+
}
|
|
463
|
+
```
|
|
464
|
+
|
|
465
|
+
---
|
|
466
|
+
|
|
467
|
+
## Connections
|
|
468
|
+
|
|
469
|
+
```
|
|
470
|
+
Calls → research (L3): lookup model documentation and best practices
|
|
471
|
+
Calls → docs-seeker (L3): API reference for LLM providers
|
|
472
|
+
Calls → verification (L3): validate pipeline correctness
|
|
473
|
+
Called By ← cook (L1): when AI/ML task detected
|
|
474
|
+
Called By ← plan (L2): when AI architecture decisions needed
|
|
475
|
+
Called By ← review (L2): when AI code under review
|
|
476
|
+
```
|
|
477
|
+
|
|
478
|
+
## Tech Stack Support
|
|
479
|
+
|
|
480
|
+
| Provider | SDK | Vector Store | Notes |
|
|
481
|
+
|----------|-----|-------------|-------|
|
|
482
|
+
| OpenAI | openai v4+ | pgvector | Most common, JSON mode + function calling |
|
|
483
|
+
| Anthropic | @anthropic-ai/sdk | Pinecone | Tool use + long context |
|
|
484
|
+
| Cohere | cohere-ai | Weaviate | Reranking + embed v3 |
|
|
485
|
+
| Local (Ollama) | ollama-js | ChromaDB | Self-hosted, privacy-sensitive |
|
|
486
|
+
|
|
487
|
+
## Constraints
|
|
488
|
+
|
|
489
|
+
1. MUST implement retry with exponential backoff on all LLM API calls — rate limits are guaranteed at scale.
|
|
490
|
+
2. MUST validate LLM output against a schema (Zod/Pydantic) — never trust raw text parsing for structured data.
|
|
491
|
+
3. MUST separate training and evaluation datasets — eval set leaking into training invalidates all metrics.
|
|
492
|
+
4. MUST set similarity thresholds on vector search — returning all results regardless of score degrades quality.
|
|
493
|
+
5. MUST NOT embed sensitive/PII data without explicit consent — embeddings are not easily deletable from vector stores.
|
|
494
|
+
|
|
495
|
+
## Sharp Edges
|
|
496
|
+
|
|
497
|
+
| Failure Mode | Severity | Mitigation |
|
|
498
|
+
|---|---|---|
|
|
499
|
+
| LLM rate limit (429) crashes entire request pipeline | HIGH | Exponential backoff retry with jitter; fallback model chain for critical paths |
|
|
500
|
+
| RAG retrieves irrelevant chunks due to fixed-size splitting across section boundaries | HIGH | Use recursive splitter with semantic separators (headings, paragraphs); include metadata for filtering |
|
|
501
|
+
| Vector search returns high-similarity results that are factually wrong (semantic ≠ factual) | HIGH | Always rerank with cross-encoder; include source citation for verification |
|
|
502
|
+
| Fine-tuned model overfits to training format, fails on slightly different inputs | HIGH | Include diverse input formats in training data; evaluate on out-of-distribution examples |
|
|
503
|
+
| Embedding dimension mismatch between index and query model (model upgraded) | CRITICAL | Pin embedding model version; store model version in index metadata; re-embed on model change |
|
|
504
|
+
| Token budget overflow when stuffing retrieved chunks into prompt | MEDIUM | Count tokens before assembly; truncate or drop lowest-ranked chunks to fit budget |
|
|
505
|
+
|
|
506
|
+
## Done When
|
|
507
|
+
|
|
508
|
+
- LLM client has retry, structured output, streaming, and fallback chain
|
|
509
|
+
- RAG pipeline ingests, chunks, embeds, stores, retrieves, and reranks correctly
|
|
510
|
+
- Hybrid search returns relevant results for both keyword and semantic queries
|
|
511
|
+
- Fine-tuning dataset validated, model trained, and eval shows improvement over base
|
|
512
|
+
- All API calls handle rate limits and timeouts gracefully
|
|
513
|
+
- Structured report emitted for each skill invoked
|
|
514
|
+
|
|
515
|
+
## Cost Profile
|
|
516
|
+
|
|
517
|
+
~10,000–18,000 tokens per full pack run (all 4 skills). Individual skill: ~2,500–5,000 tokens. Sonnet default. Use haiku for code detection scans; escalate to sonnet for pipeline design and evaluation strategy.
|