mindforge-cc 10.0.3 → 10.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.mindforge/config.json +25 -2
- package/.mindforge/engine/cross-model-eval.md +74 -0
- package/.mindforge/engine/proactive/signal-detector.md +60 -0
- package/.mindforge/engine/proactive/suggestion-engine.md +100 -0
- package/.mindforge/personas/agent-architect.md +57 -0
- package/.mindforge/personas/agent-evaluator.md +162 -0
- package/.mindforge/personas/agent-memory-designer.md +157 -0
- package/.mindforge/personas/agent-ops-engineer.md +120 -0
- package/.mindforge/personas/agent-orchestrator.md +112 -0
- package/.mindforge/personas/ai-economist.md +57 -0
- package/.mindforge/personas/ai-safety-engineer.md +57 -0
- package/.mindforge/personas/analytics-engineer.md +57 -0
- package/.mindforge/personas/anti-pattern-hunter.md +61 -0
- package/.mindforge/personas/api-gateway-designer.md +132 -0
- package/.mindforge/personas/auth-engineer.md +112 -0
- package/.mindforge/personas/build-engineer.md +57 -0
- package/.mindforge/personas/business-analyst.md +56 -0
- package/.mindforge/personas/cache-architect.md +100 -0
- package/.mindforge/personas/causal-scientist.md +57 -0
- package/.mindforge/personas/cdn-architect.md +118 -0
- package/.mindforge/personas/change-agent.md +104 -0
- package/.mindforge/personas/code-narrator.md +52 -0
- package/.mindforge/personas/codegen-specialist.md +68 -0
- package/.mindforge/personas/communication-architect.md +102 -0
- package/.mindforge/personas/compliance-engineer.md +96 -0
- package/.mindforge/personas/consensus-engineer.md +116 -0
- package/.mindforge/personas/contract-tester.md +60 -192
- package/.mindforge/personas/data-architect.md +108 -0
- package/.mindforge/personas/data-mesh-architect.md +57 -0
- package/.mindforge/personas/data-pipeline-architect.md +120 -0
- package/.mindforge/personas/de-sloppifier.md +60 -0
- package/.mindforge/personas/debt-manager.md +66 -0
- package/.mindforge/personas/decision-architect.md +82 -51
- package/.mindforge/personas/deployment-captain.md +74 -0
- package/.mindforge/personas/design-system-lead.md +112 -0
- package/.mindforge/personas/dmux-orchestrator.md +75 -0
- package/.mindforge/personas/dx-engineer.md +96 -0
- package/.mindforge/personas/ecommerce-engineer.md +57 -0
- package/.mindforge/personas/edge-engineer.md +94 -0
- package/.mindforge/personas/edtech-architect.md +106 -0
- package/.mindforge/personas/embedding-architect.md +57 -0
- package/.mindforge/personas/environment-engineer.md +57 -0
- package/.mindforge/personas/eval-judge.md +55 -0
- package/.mindforge/personas/event-architect.md +102 -0
- package/.mindforge/personas/experiment-designer.md +138 -0
- package/.mindforge/personas/feature-store-engineer.md +57 -0
- package/.mindforge/personas/finops-analyst.md +66 -0
- package/.mindforge/personas/fintech-architect.md +57 -0
- package/.mindforge/personas/flutter-engineer.md +104 -0
- package/.mindforge/personas/gaming-engineer.md +57 -0
- package/.mindforge/personas/graphql-designer.md +73 -0
- package/.mindforge/personas/healthcare-engineer.md +57 -0
- package/.mindforge/personas/hiring-strategist.md +105 -0
- package/.mindforge/personas/hitl-architect.md +165 -0
- package/.mindforge/personas/i18n-architect.md +69 -0
- package/.mindforge/personas/iot-architect.md +105 -0
- package/.mindforge/personas/knowledge-curator.md +139 -0
- package/.mindforge/personas/knowledge-engineer.md +57 -0
- package/.mindforge/personas/lakehouse-architect.md +57 -0
- package/.mindforge/personas/llm-orchestrator.md +57 -0
- package/.mindforge/personas/logistics-architect.md +106 -0
- package/.mindforge/personas/market-analyst.md +53 -0
- package/.mindforge/personas/marketplace-engineer.md +105 -0
- package/.mindforge/personas/mcp-designer.md +54 -0
- package/.mindforge/personas/meeting-designer.md +104 -0
- package/.mindforge/personas/mentorship-lead.md +106 -0
- package/.mindforge/personas/migration-architect.md +57 -0
- package/.mindforge/personas/ml-ops-engineer.md +101 -0
- package/.mindforge/personas/mobile-architect.md +105 -0
- package/.mindforge/personas/mobile-security-engineer.md +106 -0
- package/.mindforge/personas/multi-tenancy-architect.md +71 -0
- package/.mindforge/personas/multimodal-engineer.md +57 -0
- package/.mindforge/personas/offline-specialist.md +105 -0
- package/.mindforge/personas/onboarding-navigator.md +63 -0
- package/.mindforge/personas/payments-engineer.md +135 -0
- package/.mindforge/personas/pipeline-engineer.md +115 -0
- package/.mindforge/personas/platform-engineer.md +97 -0
- package/.mindforge/personas/platform-lead.md +57 -0
- package/.mindforge/personas/privacy-engineer.md +57 -0
- package/.mindforge/personas/product-owner.md +56 -0
- package/.mindforge/personas/productivity-analyst.md +57 -0
- package/.mindforge/personas/prompt-architect.md +101 -0
- package/.mindforge/personas/proofreader.md +53 -0
- package/.mindforge/personas/pwa-architect.md +105 -0
- package/.mindforge/personas/quality-scorer.md +63 -0
- package/.mindforge/personas/react-native-engineer.md +106 -0
- package/.mindforge/personas/resilience-engineer.md +69 -0
- package/.mindforge/personas/rfc-architect.md +64 -0
- package/.mindforge/personas/saga-orchestrator.md +80 -0
- package/.mindforge/personas/secrets-engineer.md +57 -0
- package/.mindforge/personas/skill-smith.md +79 -0
- package/.mindforge/personas/sre-lead.md +107 -0
- package/.mindforge/personas/stream-engineer.md +57 -0
- package/.mindforge/personas/streaming-engineer.md +64 -0
- package/.mindforge/personas/swarm-templates.json +674 -44
- package/.mindforge/personas/system-designer.md +57 -0
- package/.mindforge/personas/team-coach.md +120 -0
- package/.mindforge/personas/tech-lead-coach.md +103 -0
- package/.mindforge/personas/technical-writer-lead.md +111 -0
- package/.mindforge/personas/vibe-checker.md +75 -0
- package/.mindforge/personas/worktree-manager.md +56 -0
- package/.mindforge/personas/zero-trust-engineer.md +113 -0
- package/.mindforge/skills/a11y-testing/SKILL.md +143 -0
- package/.mindforge/skills/agent-evaluation-framework/SKILL.md +227 -0
- package/.mindforge/skills/agent-memory-design/SKILL.md +199 -0
- package/.mindforge/skills/agent-orchestration-patterns/SKILL.md +129 -0
- package/.mindforge/skills/agent-tool-selection/SKILL.md +204 -0
- package/.mindforge/skills/ai-agent-deployment/SKILL.md +176 -0
- package/.mindforge/skills/ai-cost-management/SKILL.md +57 -0
- package/.mindforge/skills/ai-safety-alignment/SKILL.md +53 -0
- package/.mindforge/skills/analytics-instrumentation/SKILL.md +172 -0
- package/.mindforge/skills/api-gateway-patterns/SKILL.md +177 -0
- package/.mindforge/skills/api-marketplace/SKILL.md +56 -0
- package/.mindforge/skills/api-versioning/SKILL.md +100 -0
- package/.mindforge/skills/app-store-deployment/SKILL.md +44 -0
- package/.mindforge/skills/architecture-tradeoff-analysis/SKILL.md +97 -0
- package/.mindforge/skills/audit-logging/SKILL.md +140 -0
- package/.mindforge/skills/auth-patterns/SKILL.md +148 -0
- package/.mindforge/skills/autonomous-agent-harness/SKILL.md +218 -0
- package/.mindforge/skills/autonomous-agents/SKILL.md +59 -0
- package/.mindforge/skills/build-system-optimization/SKILL.md +54 -0
- package/.mindforge/skills/build-vs-buy/SKILL.md +80 -0
- package/.mindforge/skills/bundle-optimization/SKILL.md +174 -0
- package/.mindforge/skills/business-analyst/SKILL.md +82 -0
- package/.mindforge/skills/caching-strategies/SKILL.md +132 -0
- package/.mindforge/skills/capacity-planning/SKILL.md +96 -0
- package/.mindforge/skills/causal-inference/SKILL.md +42 -0
- package/.mindforge/skills/cdn-optimization/SKILL.md +212 -0
- package/.mindforge/skills/change-management/SKILL.md +106 -0
- package/.mindforge/skills/chaos-engineering/SKILL.md +99 -0
- package/.mindforge/skills/ci-cd-pipeline/SKILL.md +118 -0
- package/.mindforge/skills/cli-design/SKILL.md +118 -0
- package/.mindforge/skills/code-generation-patterns/SKILL.md +92 -0
- package/.mindforge/skills/code-review-methodology/SKILL.md +180 -0
- package/.mindforge/skills/code-tour/SKILL.md +145 -0
- package/.mindforge/skills/codebase-onboarding/SKILL.md +95 -0
- package/.mindforge/skills/compliance-as-code/SKILL.md +195 -0
- package/.mindforge/skills/conflict-resolution/SKILL.md +87 -0
- package/.mindforge/skills/connection-pooling/SKILL.md +151 -0
- package/.mindforge/skills/container-security/SKILL.md +151 -0
- package/.mindforge/skills/context-engineering/SKILL.md +114 -0
- package/.mindforge/skills/contract-testing/SKILL.md +85 -0
- package/.mindforge/skills/cost-estimation/SKILL.md +82 -0
- package/.mindforge/skills/cqrs-event-sourcing/SKILL.md +95 -0
- package/.mindforge/skills/cross-platform-testing/SKILL.md +43 -0
- package/.mindforge/skills/data-governance/SKILL.md +42 -0
- package/.mindforge/skills/data-lakehouse/SKILL.md +42 -0
- package/.mindforge/skills/data-mesh/SKILL.md +42 -0
- package/.mindforge/skills/data-modeling/SKILL.md +107 -0
- package/.mindforge/skills/data-pipeline-design/SKILL.md +171 -0
- package/.mindforge/skills/data-privacy-engineering/SKILL.md +42 -0
- package/.mindforge/skills/database-performance/SKILL.md +174 -0
- package/.mindforge/skills/database-sharding-advanced/SKILL.md +206 -0
- package/.mindforge/skills/de-sloppify/SKILL.md +120 -0
- package/.mindforge/skills/defense-in-depth/SKILL.md +84 -0
- package/.mindforge/skills/delegation-patterns/SKILL.md +123 -0
- package/.mindforge/skills/dependency-management/SKILL.md +94 -0
- package/.mindforge/skills/deployment-workflow/SKILL.md +135 -0
- package/.mindforge/skills/design-system/SKILL.md +113 -0
- package/.mindforge/skills/developer-onboarding/SKILL.md +99 -0
- package/.mindforge/skills/developer-productivity-metrics/SKILL.md +59 -0
- package/.mindforge/skills/distributed-consensus/SKILL.md +141 -0
- package/.mindforge/skills/dmux-workflows/SKILL.md +141 -0
- package/.mindforge/skills/dns-architecture/SKILL.md +167 -0
- package/.mindforge/skills/ecommerce-architecture/SKILL.md +41 -0
- package/.mindforge/skills/edge-computing/SKILL.md +91 -0
- package/.mindforge/skills/edtech-platform/SKILL.md +41 -0
- package/.mindforge/skills/email-deliverability/SKILL.md +177 -0
- package/.mindforge/skills/embedding-systems/SKILL.md +55 -0
- package/.mindforge/skills/environment-management/SKILL.md +54 -0
- package/.mindforge/skills/error-handling-architecture/SKILL.md +118 -0
- package/.mindforge/skills/estimation-techniques/SKILL.md +113 -0
- package/.mindforge/skills/eval-harness/SKILL.md +180 -0
- package/.mindforge/skills/event-driven-architecture/SKILL.md +162 -0
- package/.mindforge/skills/experiment-design/SKILL.md +139 -0
- package/.mindforge/skills/experiment-platform/SKILL.md +43 -0
- package/.mindforge/skills/feature-engineering/SKILL.md +42 -0
- package/.mindforge/skills/feature-flag-management/SKILL.md +183 -0
- package/.mindforge/skills/fine-tuning-workflow/SKILL.md +189 -0
- package/.mindforge/skills/fintech-patterns/SKILL.md +41 -0
- package/.mindforge/skills/flutter-architecture/SKILL.md +42 -0
- package/.mindforge/skills/gaming-backend/SKILL.md +41 -0
- package/.mindforge/skills/git-workflow-design/SKILL.md +129 -0
- package/.mindforge/skills/graceful-degradation/SKILL.md +95 -0
- package/.mindforge/skills/graphql-patterns/SKILL.md +243 -0
- package/.mindforge/skills/guardrails-and-safety/SKILL.md +137 -0
- package/.mindforge/skills/healthcare-systems/SKILL.md +40 -0
- package/.mindforge/skills/hiring-engineering/SKILL.md +119 -0
- package/.mindforge/skills/human-in-the-loop-design/SKILL.md +234 -0
- package/.mindforge/skills/i18n-architecture/SKILL.md +147 -0
- package/.mindforge/skills/idempotency-patterns/SKILL.md +84 -0
- package/.mindforge/skills/incident-communication/SKILL.md +96 -0
- package/.mindforge/skills/incident-management/SKILL.md +97 -0
- package/.mindforge/skills/infrastructure-as-code/SKILL.md +98 -0
- package/.mindforge/skills/instinct-clustering/SKILL.md +190 -0
- package/.mindforge/skills/internal-developer-platform/SKILL.md +51 -0
- package/.mindforge/skills/iot-platform/SKILL.md +41 -0
- package/.mindforge/skills/k8s-deployment/SKILL.md +358 -0
- package/.mindforge/skills/knowledge-graphs/SKILL.md +56 -0
- package/.mindforge/skills/knowledge-sharing-systems/SKILL.md +112 -0
- package/.mindforge/skills/llm-cost-optimization/SKILL.md +198 -0
- package/.mindforge/skills/llm-orchestration/SKILL.md +56 -0
- package/.mindforge/skills/load-testing/SKILL.md +84 -0
- package/.mindforge/skills/logistics-optimization/SKILL.md +40 -0
- package/.mindforge/skills/market-researcher/SKILL.md +99 -0
- package/.mindforge/skills/marketplace-trust/SKILL.md +40 -0
- package/.mindforge/skills/mcp-server-patterns/SKILL.md +264 -0
- package/.mindforge/skills/media-streaming/SKILL.md +41 -0
- package/.mindforge/skills/meeting-architecture/SKILL.md +146 -0
- package/.mindforge/skills/mentoring-patterns/SKILL.md +77 -0
- package/.mindforge/skills/microservices-patterns/SKILL.md +83 -0
- package/.mindforge/skills/migration-platform/SKILL.md +61 -0
- package/.mindforge/skills/migration-strategies/SKILL.md +129 -0
- package/.mindforge/skills/ml-feature-store/SKILL.md +56 -0
- package/.mindforge/skills/ml-monitoring/SKILL.md +42 -0
- package/.mindforge/skills/mobile-performance/SKILL.md +44 -0
- package/.mindforge/skills/mobile-security/SKILL.md +45 -0
- package/.mindforge/skills/model-evaluation/SKILL.md +53 -0
- package/.mindforge/skills/monorepo-management/SKILL.md +100 -0
- package/.mindforge/skills/multi-tenancy-patterns/SKILL.md +145 -0
- package/.mindforge/skills/multi-turn-conversation-design/SKILL.md +206 -0
- package/.mindforge/skills/multimodal-ai/SKILL.md +51 -0
- package/.mindforge/skills/mutation-testing/SKILL.md +97 -0
- package/.mindforge/skills/notification-system-design/SKILL.md +168 -0
- package/.mindforge/skills/observability-stack/SKILL.md +136 -0
- package/.mindforge/skills/offline-first-design/SKILL.md +43 -0
- package/.mindforge/skills/on-call-design/SKILL.md +111 -0
- package/.mindforge/skills/pagination-patterns/SKILL.md +230 -0
- package/.mindforge/skills/payment-integration/SKILL.md +176 -0
- package/.mindforge/skills/performance-reviews/SKILL.md +140 -0
- package/.mindforge/skills/platform-observability/SKILL.md +58 -0
- package/.mindforge/skills/platform-reliability/SKILL.md +52 -0
- package/.mindforge/skills/post-incident-learning/SKILL.md +96 -0
- package/.mindforge/skills/product-manager/SKILL.md +104 -0
- package/.mindforge/skills/progressive-web-app/SKILL.md +44 -0
- package/.mindforge/skills/prompt-engineering/SKILL.md +94 -0
- package/.mindforge/skills/proofreader/SKILL.md +158 -0
- package/.mindforge/skills/push-notification-architecture/SKILL.md +45 -0
- package/.mindforge/skills/python-performance/SKILL.md +183 -0
- package/.mindforge/skills/quality-audit/SKILL.md +171 -0
- package/.mindforge/skills/queue-design/SKILL.md +85 -0
- package/.mindforge/skills/rag-architecture/SKILL.md +176 -0
- package/.mindforge/skills/rate-limiting-design/SKILL.md +94 -0
- package/.mindforge/skills/react-native-patterns/SKILL.md +42 -0
- package/.mindforge/skills/react-performance/SKILL.md +229 -0
- package/.mindforge/skills/real-time-analytics/SKILL.md +42 -0
- package/.mindforge/skills/real-time-sync/SKILL.md +83 -0
- package/.mindforge/skills/responsive-native/SKILL.md +44 -0
- package/.mindforge/skills/responsive-patterns/SKILL.md +141 -0
- package/.mindforge/skills/rfc-pipeline/SKILL.md +114 -0
- package/.mindforge/skills/saas-multi-tenant/SKILL.md +41 -0
- package/.mindforge/skills/santa-method/SKILL.md +134 -0
- package/.mindforge/skills/search-implementation/SKILL.md +98 -0
- package/.mindforge/skills/secrets-platform/SKILL.md +56 -0
- package/.mindforge/skills/secrets-rotation/SKILL.md +173 -0
- package/.mindforge/skills/self-serve-infrastructure/SKILL.md +51 -0
- package/.mindforge/skills/serverless-patterns/SKILL.md +119 -0
- package/.mindforge/skills/skill-creator-meta/SKILL.md +146 -0
- package/.mindforge/skills/sprint-retrospective-facilitation/SKILL.md +112 -0
- package/.mindforge/skills/stakeholder-communication/SKILL.md +85 -0
- package/.mindforge/skills/state-management/SKILL.md +104 -0
- package/.mindforge/skills/stream-processing/SKILL.md +43 -0
- package/.mindforge/skills/streaming-architecture/SKILL.md +81 -0
- package/.mindforge/skills/supply-chain-security/SKILL.md +145 -0
- package/.mindforge/skills/synthetic-data-generation/SKILL.md +52 -0
- package/.mindforge/skills/system-design/SKILL.md +88 -0
- package/.mindforge/skills/team-topology-design/SKILL.md +107 -0
- package/.mindforge/skills/technical-debt-management/SKILL.md +86 -0
- package/.mindforge/skills/technical-interview-design/SKILL.md +98 -0
- package/.mindforge/skills/technical-leadership/SKILL.md +75 -0
- package/.mindforge/skills/technical-writing/SKILL.md +237 -0
- package/.mindforge/skills/technology-radar/SKILL.md +88 -0
- package/.mindforge/skills/testing-anti-patterns/SKILL.md +288 -0
- package/.mindforge/skills/tool-design/SKILL.md +138 -0
- package/.mindforge/skills/typescript-advanced/SKILL.md +198 -0
- package/.mindforge/skills/using-git-worktrees/SKILL.md +139 -0
- package/.mindforge/skills/verification-loop/SKILL.md +13 -1
- package/.mindforge/skills/vibe-security/SKILL.md +165 -0
- package/.mindforge/skills/visual-regression-testing/SKILL.md +97 -0
- package/.mindforge/skills/websocket-patterns/SKILL.md +203 -0
- package/.mindforge/skills/writing-plans/SKILL.md +170 -0
- package/.mindforge/skills/writing-skills/SKILL.md +216 -0
- package/.mindforge/skills/zero-trust-architecture/SKILL.md +166 -0
- package/CHANGELOG.md +176 -0
- package/MINDFORGE.md +4 -4
- package/package.json +2 -2
- package/.mindforge/personas/data-privacy-engineer.md +0 -187
|
@@ -0,0 +1,55 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: embedding-systems
|
|
3
|
+
version: 1.0.0
|
|
4
|
+
min_mindforge_version: 10.5.0
|
|
5
|
+
status: stable
|
|
6
|
+
triggers: embedding system design, vector database architecture, semantic search implementation, embedding pipeline, similarity algorithm, vector index optimization, vector embedding model selection, vector similarity search, dense vector retrieval, embedding dimension reduction, hybrid search design, vector store scaling
|
|
7
|
+
compose:
|
|
8
|
+
- rag-architecture
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# Embedding Systems & Vector Databases
|
|
12
|
+
|
|
13
|
+
## When this skill activates
|
|
14
|
+
|
|
15
|
+
This skill activates when designing semantic search systems, implementing vector databases, building dense retrieval pipelines, or optimizing embedding-based similarity search. It applies to any system where AI must find semantically related content (documents, images, code) using vector representations.
|
|
16
|
+
|
|
17
|
+
## Mandatory actions when this skill is active
|
|
18
|
+
|
|
19
|
+
### Before writing any code
|
|
20
|
+
|
|
21
|
+
1. **Select embedding model** — Choose based on modality (text: sentence-transformers, OpenAI ada-002; code: CodeBERT, StarCoder; images: CLIP; multimodal: CLIP, ImageBind) and performance requirements (dimension size vs. accuracy trade-off, encoding speed, cost). Validate model on your domain data before committing.
|
|
22
|
+
2. **Design vector schema** — Define embedding dimensions (384, 768, 1536), metadata fields (timestamps, tags, source IDs), and filtering constraints. Schema changes require full reindexing. Get it right upfront.
|
|
23
|
+
3. **Choose vector database** — Evaluate based on scale (millions vs. billions of vectors), query latency requirements (<100ms for real-time, <1s for batch), indexing strategy (HNSW, IVF, Product Quantization), and ecosystem (Pinecone, Weaviate, Qdrant, Milvus, pgvector). Run benchmarks on your data before selecting.
|
|
24
|
+
4. **Establish similarity metrics** — Select distance function: cosine similarity (most common, normalized), dot product (unnormalized, faster), Euclidean distance (position-sensitive). Different metrics produce different rankings. Test which aligns with human judgment.
|
|
25
|
+
|
|
26
|
+
### During implementation
|
|
27
|
+
|
|
28
|
+
- **Normalize embeddings consistently** — Always L2-normalize embeddings before storage if using cosine similarity. Unnormalized embeddings produce incorrect rankings. Validate normalization: check that ||embedding|| = 1.0 for all vectors.
|
|
29
|
+
- **Batch encode for efficiency** — Encode embeddings in batches (32-256 examples) to maximize GPU utilization. Single-example encoding wastes 90%+ of GPU compute. Implement batching with dynamic padding to handle variable-length inputs.
|
|
30
|
+
- **Design hybrid search** — Combine dense retrieval (semantic similarity) with sparse retrieval (BM25, TF-IDF keyword matching). Hybrid search outperforms either alone: semantic search finds conceptually similar content, keyword search ensures exact term matches are included. Fuse rankings with reciprocal rank fusion (RRF).
|
|
31
|
+
- **Implement metadata filtering** — Support filtering by metadata (date ranges, categories, tags) before or after vector search. Pre-filtering (filter then search) is faster but less accurate. Post-filtering (search then filter) is more accurate but slower. Choose based on selectivity (how many vectors match the filter).
|
|
32
|
+
- **Optimize index parameters** — Tune HNSW parameters (M, efConstruction, efSearch) for your accuracy/latency trade-off. Higher M and efConstruction improve accuracy but slow indexing. Higher efSearch improves query accuracy but slows search. Benchmark with your data.
|
|
33
|
+
- **Handle embedding drift** — Embedding models change (model updates, fine-tuning). When embeddings change, you must reindex all vectors. Implement versioned indexes: maintain old index while building new index, then hot-swap. Monitor query quality after model changes.
|
|
34
|
+
|
|
35
|
+
### After implementation
|
|
36
|
+
|
|
37
|
+
- **Validate recall accuracy** — Measure recall@k: for a set of ground-truth similar pairs, what percentage are in the top-k search results? Target: recall@10 >90%. If lower, increase k, tune index parameters, or use a better embedding model.
|
|
38
|
+
- **Benchmark query latency** — Measure p50, p95, p99 latency under realistic load (queries per second, index size). Target: p95 <100ms for real-time search. If higher, scale horizontally, optimize index, or use approximate search (lower accuracy, higher speed).
|
|
39
|
+
- **Test edge cases** — Query with empty strings, extremely long texts, rare languages, special characters. Ensure system degrades gracefully (return empty results, not crashes). Validate that out-of-domain queries (content very different from training data) still return sensible results.
|
|
40
|
+
- **Monitor index size and cost** — Track storage cost (GB per million vectors), indexing throughput (vectors/second), and query cost ($/1000 queries). Embedding systems can become expensive at scale. Optimize dimension size (use dimension reduction if accuracy is acceptable).
|
|
41
|
+
|
|
42
|
+
## Self-check before task completion
|
|
43
|
+
|
|
44
|
+
- [ ] Embedding model is selected and validated on domain-specific data
|
|
45
|
+
- [ ] Vector schema (dimensions, metadata, filters) is defined and documented
|
|
46
|
+
- [ ] Vector database is chosen based on benchmarks with realistic data scale and latency
|
|
47
|
+
- [ ] Similarity metric (cosine/dot product/Euclidean) is selected and justified
|
|
48
|
+
- [ ] Embeddings are L2-normalized before storage (if using cosine similarity)
|
|
49
|
+
- [ ] Batch encoding is implemented with dynamic padding for efficiency
|
|
50
|
+
- [ ] Hybrid search combines dense (semantic) and sparse (keyword) retrieval with RRF fusion
|
|
51
|
+
- [ ] Metadata filtering is implemented with pre/post-filtering strategy documented
|
|
52
|
+
- [ ] Recall@10 is validated at >90% on ground-truth similar pairs
|
|
53
|
+
- [ ] Query latency p95 is <100ms under realistic load
|
|
54
|
+
- [ ] Edge cases (empty queries, long texts, out-of-domain) are tested and handled gracefully
|
|
55
|
+
- [ ] Storage cost, indexing throughput, and query cost are measured and acceptable
|
|
@@ -0,0 +1,54 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: environment-management
|
|
3
|
+
version: 1.0.0
|
|
4
|
+
min_mindforge_version: 10.7.0
|
|
5
|
+
status: stable
|
|
6
|
+
triggers: environment management platform, preview environment design, ephemeral environment, environment parity, configuration drift detection, ephemeral environment provisioning, staging environment, environment lifecycle, environment as code, environment cleanup, environment promotion workflow, branch environment
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# Skill — Environment Management
|
|
10
|
+
|
|
11
|
+
## When this skill activates
|
|
12
|
+
|
|
13
|
+
This skill activates when the user is designing or implementing environment management capabilities. This includes preview environments (per-PR), ephemeral environments, environment parity (dev/staging/prod consistency), configuration drift detection, automated environment provisioning, staging environment design, environment lifecycle management, environment-as-code, automatic cleanup, environment promotion workflows, and branch-based environments.
|
|
14
|
+
|
|
15
|
+
## Mandatory actions when this skill is active
|
|
16
|
+
|
|
17
|
+
### Before writing any code
|
|
18
|
+
|
|
19
|
+
1. Inventory existing environments: count, purpose, cost, utilization, and configuration drift from production.
|
|
20
|
+
2. Define environment types: production, staging, QA, developer sandbox, preview (per-PR), load testing. Establish parity requirements for each.
|
|
21
|
+
3. Assess environment provisioning time: how long from request to usable environment. Target: under 10 minutes.
|
|
22
|
+
4. Identify configuration drift: where do dev/staging/prod differ (versions, feature flags, resource sizes, network topology). Quantify drift percentage.
|
|
23
|
+
5. Establish environment cleanup policies: when to delete ephemeral environments (after PR merge, after N days of inactivity).
|
|
24
|
+
|
|
25
|
+
### During implementation
|
|
26
|
+
|
|
27
|
+
- **Preview Environments (Per-PR):** Automatically create an isolated environment for each pull request. Include: full application stack, seeded test data, unique URL. Preview environment should be ready in under 10 minutes. Delete automatically when PR is merged or closed.
|
|
28
|
+
- **Ephemeral Environments:** Short-lived environments created on-demand and destroyed after use. Use for: testing, demos, training, experiments. Include cost cap ($50-$200) and auto-deletion after 7 days of inactivity.
|
|
29
|
+
- **Environment Parity:** Dev, staging, and prod should be as similar as possible. Use same: container images, Terraform modules, network topology, resource sizes (scale down in non-prod, but maintain same architecture). Differ only in: data (synthetic in non-prod), scale (fewer replicas), and external integrations (use mocks in non-prod).
|
|
30
|
+
- **Environment as Code:** All environments defined via IaC (Terraform, CloudFormation, Pulumi). No manual changes in cloud console. Code should be versioned and reviewed via PRs. Use modules to ensure consistency across environments.
|
|
31
|
+
- **Configuration Drift Detection:** Use Terraform plan, CloudFormation drift detection, or Config Sentinel. Run drift detection daily and alert on any manual changes. Automatically remediate drift by reapplying IaC.
|
|
32
|
+
- **Environment Provisioning:** Self-service provisioning via CLI, API, or portal. Provisioning should: validate inputs, estimate cost, apply IaC, seed data, run smoke tests, return URL. Complete in under 10 minutes.
|
|
33
|
+
- **Environment Lifecycle:** Define stages: provisioning → active → idle → scheduled for deletion → deleted. Idle environments (no activity for 3+ days) should be flagged for review. Auto-delete after 7 days idle (with 24-hour warning).
|
|
34
|
+
- **Environment Promotion:** Promote changes from dev → staging → production. Use GitOps workflow: commit to environment-specific branch triggers deployment. Include smoke tests and rollback on failure.
|
|
35
|
+
- **Staging Environment Design:** Staging should mirror production as closely as possible. Use 20-30% of production scale. Include: same services, same network topology, same feature flags, same monitoring. Differ only in: data volume and external integrations (use mocks or sandbox APIs).
|
|
36
|
+
|
|
37
|
+
### After implementation
|
|
38
|
+
|
|
39
|
+
- Verify preview environments are created automatically for each PR and deleted on merge.
|
|
40
|
+
- Confirm ephemeral environments include cost caps and auto-deletion after 7 days.
|
|
41
|
+
- Validate environment parity: dev/staging/prod use same IaC modules and container images.
|
|
42
|
+
- Ensure configuration drift detection runs daily and alerts on manual changes.
|
|
43
|
+
- Check that environment provisioning completes in under 10 minutes with cost estimates.
|
|
44
|
+
|
|
45
|
+
## Self-check before task completion
|
|
46
|
+
|
|
47
|
+
- [ ] Preview environments are created automatically per PR and deleted on merge.
|
|
48
|
+
- [ ] Ephemeral environments include cost caps and auto-deletion after 7 days idle.
|
|
49
|
+
- [ ] Environment parity is maintained: dev/staging/prod use same IaC modules.
|
|
50
|
+
- [ ] Configuration drift detection runs daily and alerts on manual changes.
|
|
51
|
+
- [ ] Environment provisioning is self-service and completes in under 10 minutes.
|
|
52
|
+
- [ ] Environment lifecycle includes idle detection and auto-cleanup after 7 days.
|
|
53
|
+
- [ ] Environment promotion uses GitOps workflow with smoke tests and rollback.
|
|
54
|
+
- [ ] Staging environment mirrors production at 20-30% scale with same architecture.
|
|
@@ -0,0 +1,118 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: error-handling-architecture
|
|
3
|
+
version: 1.0.0
|
|
4
|
+
min_mindforge_version: 10.0.8
|
|
5
|
+
status: stable
|
|
6
|
+
triggers: error handling architecture, error hierarchy, error boundary, retry policy, dead letter queue, error reporting, sentry integration, error classification, graceful error, error propagation, error context, structured error
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# Error Handling Architecture
|
|
10
|
+
|
|
11
|
+
## When this skill activates
|
|
12
|
+
|
|
13
|
+
This skill activates when designing error hierarchies, implementing error boundaries, configuring retry policies, setting up error reporting (Sentry/Datadog/Bugsnag), building dead letter queue processing, or establishing error handling patterns across a codebase. It applies to both frontend and backend error architecture.
|
|
14
|
+
|
|
15
|
+
## Mandatory actions when this skill is active
|
|
16
|
+
|
|
17
|
+
### Before
|
|
18
|
+
|
|
19
|
+
1. Map the system boundaries where errors cross layers (UI → API → service → DB → external).
|
|
20
|
+
2. Identify the error reporting tool in use or to be adopted (Sentry, Datadog, Bugsnag, custom).
|
|
21
|
+
3. Categorize existing errors: which are retryable, which are permanent, which are user-facing.
|
|
22
|
+
4. Determine SLA for error detection and alerting (time from error to notification).
|
|
23
|
+
5. Review current error handling for anti-patterns (swallowed errors, generic catches, leaked internals).
|
|
24
|
+
|
|
25
|
+
### During
|
|
26
|
+
|
|
27
|
+
**Error Hierarchy Design:**
|
|
28
|
+
```
|
|
29
|
+
BaseError (abstract)
|
|
30
|
+
├── ValidationError — Invalid input (400). User can fix and retry.
|
|
31
|
+
├── NotFoundError — Resource doesn't exist (404). Client should not retry.
|
|
32
|
+
├── AuthenticationError — Identity unknown (401). Redirect to login.
|
|
33
|
+
├── AuthorizationError — Identity known, permission denied (403). Contact admin.
|
|
34
|
+
├── ConflictError — State conflict (409). Client should refresh and retry.
|
|
35
|
+
├── RateLimitError — Too many requests (429). Client must back off.
|
|
36
|
+
├── ExternalServiceError — Upstream dependency failed (502/503). Retry with backoff.
|
|
37
|
+
└── InternalError — Unexpected failure (500). Alert engineers.
|
|
38
|
+
```
|
|
39
|
+
- Every error class carries: `code` (machine-readable), `message` (human-readable), `context` (debugging data), `retryable` (boolean).
|
|
40
|
+
- NEVER use generic `Error` or `Exception` for domain errors. Always use typed errors.
|
|
41
|
+
- Error codes follow a namespace: `AUTH_001`, `PAYMENT_002`, `INVENTORY_003`.
|
|
42
|
+
|
|
43
|
+
**Error Boundaries (Defense in Depth):**
|
|
44
|
+
- **React UI**: `ErrorBoundary` components at route level and critical widget level. Show user-friendly fallback, report to Sentry.
|
|
45
|
+
- **API Controllers**: Catch all errors at the handler level. Transform to appropriate HTTP status and structured response.
|
|
46
|
+
- **Service Layer**: Catch errors from dependencies, add context, re-throw as domain errors.
|
|
47
|
+
- **Infrastructure**: Global uncaught exception handler as last resort (log + alert + graceful shutdown).
|
|
48
|
+
- Each boundary: catch, enrich with context, decide (retry/propagate/absorb), report.
|
|
49
|
+
|
|
50
|
+
**Retry Policies:**
|
|
51
|
+
- **Exponential backoff**: `delay = baseDelay * 2^attempt` (e.g., 100ms, 200ms, 400ms, 800ms).
|
|
52
|
+
- **Jitter**: Add random variance to prevent thundering herd (`delay + random(0, delay * 0.1)`).
|
|
53
|
+
- **Max retries**: 3 for fast operations, 5 for critical operations, 0 for non-idempotent writes.
|
|
54
|
+
- **Idempotency keys**: Required for retrying write operations (prevent double-processing).
|
|
55
|
+
- **Circuit breaker**: After N consecutive failures, stop trying for a cooldown period. Prevents cascade.
|
|
56
|
+
- **Retry only retryable errors**: Network timeouts (yes), validation errors (no), 500s (maybe, with limit).
|
|
57
|
+
|
|
58
|
+
**Dead Letter Queues (Async Error Handling):**
|
|
59
|
+
- Messages that fail processing after max retries go to the DLQ.
|
|
60
|
+
- DLQ messages must retain: original message, error details, attempt count, timestamp of each failure.
|
|
61
|
+
- Alerting: notify when DLQ depth exceeds threshold (e.g., >10 messages in 5 minutes).
|
|
62
|
+
- Manual replay: tooling to inspect DLQ messages and replay them after fix is deployed.
|
|
63
|
+
- Poisoned messages: after replay still fails, move to permanent failure store with investigation ticket.
|
|
64
|
+
- Never silently drop messages. Every message must reach success or explicit human acknowledgment.
|
|
65
|
+
|
|
66
|
+
**Error Reporting (Sentry Integration Pattern):**
|
|
67
|
+
- **Breadcrumbs**: Log navigation, API calls, user actions leading up to the error.
|
|
68
|
+
- **Tags**: environment, service, user_id, transaction_id, feature_flag.
|
|
69
|
+
- **Context**: request body (sanitized), response status, database query (no PII).
|
|
70
|
+
- **Grouping**: Configure fingerprinting so related errors group together (not 1000 separate issues).
|
|
71
|
+
- **Alerting**: New issues → Slack. Regression (resolved issue re-opens) → PagerDuty.
|
|
72
|
+
- **Sampling**: 100% for errors, 10-20% for transactions/performance in production.
|
|
73
|
+
- **PII scrubbing**: Strip emails, tokens, passwords before sending to error reporting.
|
|
74
|
+
|
|
75
|
+
**Structured Error Responses (API):**
|
|
76
|
+
```json
|
|
77
|
+
{
|
|
78
|
+
"error": {
|
|
79
|
+
"code": "VALIDATION_001",
|
|
80
|
+
"message": "Email address is invalid",
|
|
81
|
+
"details": [
|
|
82
|
+
{ "field": "email", "constraint": "Must be a valid email format" }
|
|
83
|
+
],
|
|
84
|
+
"retryable": false,
|
|
85
|
+
"request_id": "req_abc123"
|
|
86
|
+
}
|
|
87
|
+
}
|
|
88
|
+
```
|
|
89
|
+
- Always include `request_id` for correlation with server logs.
|
|
90
|
+
- Never expose stack traces, internal paths, or database details in API responses.
|
|
91
|
+
- Use `details` array for field-level validation errors.
|
|
92
|
+
- Include `retryable` flag so clients can implement automatic retry logic.
|
|
93
|
+
|
|
94
|
+
**Error Propagation Rules:**
|
|
95
|
+
- Errors propagate UP (from infrastructure to service to controller to client).
|
|
96
|
+
- At each layer: catch, add context specific to that layer, re-throw as appropriate type.
|
|
97
|
+
- NEVER swallow errors silently (`catch (e) {}`). At minimum: log and re-throw.
|
|
98
|
+
- Transform errors at boundaries (internal DB error becomes generic 500 to the client).
|
|
99
|
+
- Preserve the original error as `cause` for debugging (Error Cause chain / `cause` property).
|
|
100
|
+
|
|
101
|
+
### After
|
|
102
|
+
|
|
103
|
+
1. Error hierarchy covers all known failure modes in the system.
|
|
104
|
+
2. Every boundary has explicit error handling (no unhandled promise rejections, no bare throws).
|
|
105
|
+
3. Retry policies are configured with backoff, jitter, and max attempts.
|
|
106
|
+
4. Error reporting captures sufficient context for debugging without exposing PII.
|
|
107
|
+
5. Alerting is configured for new errors, regressions, and DLQ threshold breaches.
|
|
108
|
+
|
|
109
|
+
## Self-check before task completion
|
|
110
|
+
|
|
111
|
+
- [ ] All errors use typed error classes from the hierarchy (no generic `Error` throws).
|
|
112
|
+
- [ ] Error boundaries exist at every system layer (UI, API, service, infrastructure).
|
|
113
|
+
- [ ] Retry policies use exponential backoff with jitter and respect idempotency.
|
|
114
|
+
- [ ] Dead letter queues are configured for async processing with alerting and replay tooling.
|
|
115
|
+
- [ ] Error reporting includes breadcrumbs, tags, and context (without PII).
|
|
116
|
+
- [ ] API error responses are structured with code, message, retryable flag, and request_id.
|
|
117
|
+
- [ ] No errors are silently swallowed anywhere in the codebase.
|
|
118
|
+
- [ ] Error messages are actionable for the audience (user-friendly in UI, detailed in logs).
|
|
@@ -0,0 +1,113 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: estimation-techniques
|
|
3
|
+
version: 1.0.0
|
|
4
|
+
min_mindforge_version: 10.1.0
|
|
5
|
+
status: stable
|
|
6
|
+
triggers: estimation technique, story points, planning poker, reference class forecasting, cone of uncertainty, velocity projection, t-shirt sizing, three-point estimate, estimation accuracy, estimation calibration, effort estimation, time estimation
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# Estimation Techniques
|
|
10
|
+
|
|
11
|
+
## When this skill activates
|
|
12
|
+
|
|
13
|
+
This skill activates when the team needs to estimate effort, duration, or complexity
|
|
14
|
+
of work items. It provides multiple estimation techniques, guidance on when to use each,
|
|
15
|
+
and frameworks for improving estimation accuracy over time through calibration and
|
|
16
|
+
reference class forecasting.
|
|
17
|
+
|
|
18
|
+
## Mandatory actions when this skill is active
|
|
19
|
+
|
|
20
|
+
### Before
|
|
21
|
+
|
|
22
|
+
1. **Clarify what is being estimated** — Effort (person-days), duration (calendar time),
|
|
23
|
+
complexity (relative sizing), or cost (dollars)? These are different things.
|
|
24
|
+
2. **Identify the audience** — Who needs this estimate and what decision will it inform?
|
|
25
|
+
(Sprint planning vs. budget approval vs. roadmap planning require different precision.)
|
|
26
|
+
3. **Gather reference data** — Pull historical velocity, past estimates vs. actuals for
|
|
27
|
+
similar work, and team availability for the period.
|
|
28
|
+
|
|
29
|
+
### During
|
|
30
|
+
|
|
31
|
+
4. **Select appropriate technique based on context:**
|
|
32
|
+
|
|
33
|
+
- **Story Points (relative sizing):**
|
|
34
|
+
Compare work items to known reference stories. "Is this bigger or smaller than
|
|
35
|
+
the login feature we built last sprint?" Use Fibonacci sequence (1, 2, 3, 5, 8, 13)
|
|
36
|
+
to force recognition that large items have more uncertainty.
|
|
37
|
+
Best for: Sprint planning, backlog prioritization.
|
|
38
|
+
|
|
39
|
+
- **Planning Poker:**
|
|
40
|
+
Each estimator independently selects a card. Reveal simultaneously. If estimates
|
|
41
|
+
diverge by >2x, the highest and lowest explain their reasoning. Re-estimate.
|
|
42
|
+
Convergence indicates shared understanding. Divergence reveals hidden complexity
|
|
43
|
+
or misunderstanding. Best for: Team alignment on scope understanding.
|
|
44
|
+
|
|
45
|
+
- **Three-Point Estimate:**
|
|
46
|
+
Optimistic (O) + Most Likely (M) + Pessimistic (P).
|
|
47
|
+
Weighted average: (O + 4M + P) / 6.
|
|
48
|
+
Standard deviation: (P - O) / 6.
|
|
49
|
+
Present as range, not single number. Best for: Executive communication,
|
|
50
|
+
project planning with uncertainty bounds.
|
|
51
|
+
|
|
52
|
+
- **T-Shirt Sizing (S / M / L / XL):**
|
|
53
|
+
Rough relative sizing for large backlogs. Fast, low-ceremony. Good for roadmap-
|
|
54
|
+
level planning where precision is not needed. Convert to approximate days only
|
|
55
|
+
when pressed (S=1-2d, M=3-5d, L=1-2w, XL=3-4w — calibrate to your team).
|
|
56
|
+
Best for: Roadmap planning, early-stage estimation.
|
|
57
|
+
|
|
58
|
+
- **Reference Class Forecasting:**
|
|
59
|
+
Do NOT ask "how long do I think this will take?"
|
|
60
|
+
Instead ask "how long did SIMILAR work actually take in the past?"
|
|
61
|
+
Search for completed work of similar scope, technology, and team composition.
|
|
62
|
+
Use the actual duration distribution as the forecast basis.
|
|
63
|
+
Best for: Overcoming optimism bias, project-level estimates.
|
|
64
|
+
|
|
65
|
+
5. **Apply the Cone of Uncertainty:**
|
|
66
|
+
- Initial concept: estimate accuracy is 4x (could be 4x longer or 0.25x shorter)
|
|
67
|
+
- After requirements: 2x range
|
|
68
|
+
- After detailed design: 1.5x range
|
|
69
|
+
- After implementation started: 1.25x range
|
|
70
|
+
- Communicate which stage you are at and the corresponding uncertainty range.
|
|
71
|
+
- Never present early-stage estimates as precise numbers.
|
|
72
|
+
|
|
73
|
+
6. **Velocity-based projection:**
|
|
74
|
+
- Average velocity over last 3-5 sprints (discard outliers)
|
|
75
|
+
- Remaining story points / average velocity = estimated sprints remaining
|
|
76
|
+
- Present as range: (remaining / max velocity) to (remaining / min velocity)
|
|
77
|
+
- Account for known disruptions (holidays, planned absences, onboarding)
|
|
78
|
+
|
|
79
|
+
7. **Common estimation pitfalls to avoid:**
|
|
80
|
+
- Anchoring — First number mentioned biases all subsequent estimates
|
|
81
|
+
- Planning fallacy — People consistently underestimate (use reference class data)
|
|
82
|
+
- Scope creep — Estimate what is defined NOW, flag unknowns separately
|
|
83
|
+
- Precision theater — "14.5 days" implies false precision; use ranges
|
|
84
|
+
- Forgetting overhead — Testing, code review, deployment, documentation are work too
|
|
85
|
+
- Hero assumptions — Estimate for average team member, not the fastest
|
|
86
|
+
|
|
87
|
+
8. **Calibration practices:**
|
|
88
|
+
- Track estimate vs. actual for every completed item
|
|
89
|
+
- Review accuracy quarterly — are you consistently over or under?
|
|
90
|
+
- Adjust multiplier based on historical bias (if you are consistently 1.5x under,
|
|
91
|
+
multiply estimates by 1.5)
|
|
92
|
+
- Share calibration data with the team to build collective estimation skill
|
|
93
|
+
|
|
94
|
+
### After
|
|
95
|
+
|
|
96
|
+
9. **Record the estimate with context** — Document: what was estimated, technique used,
|
|
97
|
+
assumptions made, uncertainty range, and who participated.
|
|
98
|
+
10. **Track actuals** — When work completes, record actual effort/duration alongside the
|
|
99
|
+
estimate for future calibration.
|
|
100
|
+
11. **Retrospect on accuracy** — Periodically review estimation accuracy trends. Celebrate
|
|
101
|
+
improvement in calibration, investigate systematic biases.
|
|
102
|
+
|
|
103
|
+
## Self-check before task completion
|
|
104
|
+
|
|
105
|
+
- [ ] Estimation technique selected intentionally (not just defaulting to one)
|
|
106
|
+
- [ ] Estimate presented as range, not single number
|
|
107
|
+
- [ ] Cone of uncertainty stage identified and communicated
|
|
108
|
+
- [ ] Reference class data consulted where available
|
|
109
|
+
- [ ] Common pitfalls actively guarded against (especially anchoring and planning fallacy)
|
|
110
|
+
- [ ] Assumptions and unknowns documented alongside the estimate
|
|
111
|
+
- [ ] Overhead included (testing, review, deployment, documentation)
|
|
112
|
+
- [ ] Historical calibration data referenced if available
|
|
113
|
+
- [ ] Tracking mechanism in place to compare estimate vs. actual
|
|
@@ -0,0 +1,180 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: eval-harness
|
|
3
|
+
version: 1.0.0
|
|
4
|
+
min_mindforge_version: 10.0.4
|
|
5
|
+
status: stable
|
|
6
|
+
triggers: eval, evaluation, grading, pass at k, rubric, regression eval, capability eval, model judge, deterministic grading, LLM-as-judge, eval score, eval-driven, benchmark eval
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# Skill — Eval Harness (Systematic Evaluation Framework)
|
|
10
|
+
|
|
11
|
+
## When this skill activates
|
|
12
|
+
When measuring, scoring, or validating system outputs against defined criteria.
|
|
13
|
+
Use for capability evaluation (can the system do X?), regression evaluation (does
|
|
14
|
+
a change break existing behavior?), or comparative evaluation (is version A better
|
|
15
|
+
than version B?). The eval harness ensures you define success BEFORE implementing,
|
|
16
|
+
not after.
|
|
17
|
+
|
|
18
|
+
Core principle: **Define-before-code** — write evaluation criteria before writing
|
|
19
|
+
the implementation they measure.
|
|
20
|
+
|
|
21
|
+
## Mandatory actions when this skill is active
|
|
22
|
+
|
|
23
|
+
### Before evaluation begins
|
|
24
|
+
|
|
25
|
+
1. **Define the eval type:**
|
|
26
|
+
- **Capability eval**: Can the system perform task X at acceptable quality?
|
|
27
|
+
- **Regression eval**: Does this change preserve existing behavior?
|
|
28
|
+
- **Comparative eval**: Is output A better than output B on criteria C?
|
|
29
|
+
|
|
30
|
+
2. **Write the eval config BEFORE implementation:**
|
|
31
|
+
```
|
|
32
|
+
.mindforge/evals/[eval-name]/
|
|
33
|
+
├── config.json # eval metadata, parameters, thresholds
|
|
34
|
+
├── rubric.md # human-readable success criteria
|
|
35
|
+
├── test-cases.json # input/expected-output pairs
|
|
36
|
+
└── results.jsonl # append-only results log
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
3. **Define success criteria in config.json:**
|
|
40
|
+
```json
|
|
41
|
+
{
|
|
42
|
+
"name": "eval-name",
|
|
43
|
+
"type": "capability" | "regression" | "comparative",
|
|
44
|
+
"version": "1.0.0",
|
|
45
|
+
"created": "ISO-8601",
|
|
46
|
+
"thresholds": {
|
|
47
|
+
"pass_at_1": 0.8,
|
|
48
|
+
"pass_at_5": 0.95,
|
|
49
|
+
"pass_at_10": 0.99
|
|
50
|
+
},
|
|
51
|
+
"grader": "code" | "model" | "human",
|
|
52
|
+
"model_judge_config": {
|
|
53
|
+
"model": "claude-sonnet",
|
|
54
|
+
"rubric_path": "./rubric.md",
|
|
55
|
+
"temperature": 0.0
|
|
56
|
+
},
|
|
57
|
+
"test_case_count": 0,
|
|
58
|
+
"tags": []
|
|
59
|
+
}
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
4. **Write the rubric (rubric.md) with explicit scoring:**
|
|
63
|
+
- Each criterion gets a 1-5 scale with concrete examples at each level
|
|
64
|
+
- Define what a "pass" means (minimum score per criterion)
|
|
65
|
+
- Define what a "fail" looks like with specific examples
|
|
66
|
+
- Include edge cases that should be tested
|
|
67
|
+
|
|
68
|
+
### During evaluation
|
|
69
|
+
|
|
70
|
+
**Three Grader Types:**
|
|
71
|
+
|
|
72
|
+
**1. Code-Based (Deterministic):**
|
|
73
|
+
- Use when outputs have objectively verifiable properties
|
|
74
|
+
- Write assertion functions that return PASS/FAIL with evidence
|
|
75
|
+
- Examples: output matches regex, JSON schema validates, function returns expected value
|
|
76
|
+
- No ambiguity — the grader is a function, not a judgment call
|
|
77
|
+
- Always prefer code-based grading when possible (fastest, most reliable)
|
|
78
|
+
|
|
79
|
+
```typescript
|
|
80
|
+
// Example code grader
|
|
81
|
+
function grade(output: string, expected: TestCase): GradeResult {
|
|
82
|
+
const parsed = JSON.parse(output);
|
|
83
|
+
return {
|
|
84
|
+
pass: parsed.status === expected.status && parsed.count >= expected.minCount,
|
|
85
|
+
evidence: `status=${parsed.status}, count=${parsed.count}`,
|
|
86
|
+
criterion: "structural-correctness"
|
|
87
|
+
};
|
|
88
|
+
}
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
**2. Model-Based (LLM-as-Judge):**
|
|
92
|
+
- Use when outputs require semantic understanding (prose quality, code correctness, reasoning)
|
|
93
|
+
- Always provide the rubric in the judge prompt — never rely on implicit standards
|
|
94
|
+
- Use temperature 0.0 for judge calls (determinism)
|
|
95
|
+
- Run judge 3x per item and take majority vote (reduces noise)
|
|
96
|
+
- Log the judge's reasoning alongside the score
|
|
97
|
+
|
|
98
|
+
```
|
|
99
|
+
Judge prompt structure:
|
|
100
|
+
1. Task description (what was the system asked to do?)
|
|
101
|
+
2. Rubric (what does good look like? what does bad look like?)
|
|
102
|
+
3. The output to grade
|
|
103
|
+
4. Instruction: score 1-5 per criterion, explain each score, give overall PASS/FAIL
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
**3. Human-Based (Flag for Review):**
|
|
107
|
+
- Use when stakes are too high for automated judgment
|
|
108
|
+
- Generate a review queue with: input, output, rubric, suggested-score
|
|
109
|
+
- Human confirms or overrides the suggested score
|
|
110
|
+
- Track inter-rater reliability if multiple humans review
|
|
111
|
+
|
|
112
|
+
**pass@k Metrics:**
|
|
113
|
+
- Generate k independent outputs for each test case
|
|
114
|
+
- **pass@1**: Fraction of test cases where the first output passes
|
|
115
|
+
- **pass@5**: Fraction where at least 1 of 5 outputs passes
|
|
116
|
+
- **pass@10**: Fraction where at least 1 of 10 outputs passes
|
|
117
|
+
- Formula: pass@k = 1 - C(n-c, k) / C(n, k) where n=total, c=correct
|
|
118
|
+
- Always report pass@1 (baseline) and at least one higher-k metric
|
|
119
|
+
- Use pass@1 for production readiness, pass@k for capability ceiling
|
|
120
|
+
|
|
121
|
+
**Result logging (results.jsonl):**
|
|
122
|
+
```json
|
|
123
|
+
{
|
|
124
|
+
"timestamp": "ISO-8601",
|
|
125
|
+
"test_case_id": "tc-001",
|
|
126
|
+
"input": "...",
|
|
127
|
+
"output": "...",
|
|
128
|
+
"grader": "code",
|
|
129
|
+
"scores": {"criterion_a": 4, "criterion_b": 5},
|
|
130
|
+
"pass": true,
|
|
131
|
+
"evidence": "...",
|
|
132
|
+
"latency_ms": 0,
|
|
133
|
+
"model_version": "...",
|
|
134
|
+
"run_id": "uuid"
|
|
135
|
+
}
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
### After evaluation
|
|
139
|
+
|
|
140
|
+
1. **Compute aggregate metrics:**
|
|
141
|
+
- Overall pass rate (pass@1, pass@5, pass@10)
|
|
142
|
+
- Per-criterion score distribution
|
|
143
|
+
- Failure mode clustering (what patterns cause failures?)
|
|
144
|
+
- Comparison to previous run (regression detection)
|
|
145
|
+
|
|
146
|
+
2. **Regression detection logic:**
|
|
147
|
+
- If pass@1 drops > 5% from previous run: FLAG as regression
|
|
148
|
+
- If any previously-passing test case now fails: FLAG as regression
|
|
149
|
+
- If new failure modes appear that didn't exist before: FLAG as regression
|
|
150
|
+
- Regressions block shipping until investigated
|
|
151
|
+
|
|
152
|
+
3. **Store results:**
|
|
153
|
+
- Append to results.jsonl (never overwrite)
|
|
154
|
+
- Update config.json with latest run metadata
|
|
155
|
+
- If regression detected: create `.mindforge/evals/[name]/REGRESSION.md`
|
|
156
|
+
|
|
157
|
+
4. **Report format:**
|
|
158
|
+
```
|
|
159
|
+
## Eval Report: [eval-name]
|
|
160
|
+
- Type: capability | regression | comparative
|
|
161
|
+
- Run: [run-id] at [timestamp]
|
|
162
|
+
- Test cases: N total, P passed, F failed
|
|
163
|
+
- pass@1: X% | pass@5: Y% | pass@10: Z%
|
|
164
|
+
- Threshold: pass@1 >= T% → [MET / NOT MET]
|
|
165
|
+
- Regressions: [none | list]
|
|
166
|
+
- Top failure modes: [list with counts]
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
## Self-check before task completion
|
|
170
|
+
|
|
171
|
+
Before marking a task done when this skill was active:
|
|
172
|
+
|
|
173
|
+
- [ ] Did I define success criteria BEFORE writing implementation code?
|
|
174
|
+
- [ ] Did I choose the appropriate grader type (code > model > human preference)?
|
|
175
|
+
- [ ] Did I track pass@k metrics (at minimum pass@1)?
|
|
176
|
+
- [ ] Did I run regression evals against previous results?
|
|
177
|
+
- [ ] Are results stored in `.mindforge/evals/[name]/results.jsonl`?
|
|
178
|
+
- [ ] If model-based grading: did I use temperature 0.0 and majority vote?
|
|
179
|
+
- [ ] Did I report failure modes, not just pass rates?
|
|
180
|
+
- [ ] Is the rubric explicit enough that another reviewer could grade independently?
|