mindforge-cc 10.0.2 → 10.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.mindforge/config.json +73 -2
- package/.mindforge/engine/autonomous/cross-iteration-bridge.md +96 -0
- package/.mindforge/engine/cost-tracking/budget-enforcer.md +68 -0
- package/.mindforge/engine/cost-tracking/router.md +58 -0
- package/.mindforge/engine/cost-tracking/token-ledger.md +77 -0
- package/.mindforge/engine/council/council-protocol.md +96 -0
- package/.mindforge/engine/council/council-templates.md +85 -0
- package/.mindforge/engine/council/synthesis-engine.md +71 -0
- package/.mindforge/engine/cross-model-eval.md +74 -0
- package/.mindforge/engine/instincts/capture-engine.md +63 -0
- package/.mindforge/engine/instincts/instinct-schema.md +76 -0
- package/.mindforge/engine/instincts/promotion-engine.md +77 -0
- package/.mindforge/engine/proactive/signal-detector.md +60 -0
- package/.mindforge/engine/proactive/suggestion-engine.md +100 -0
- package/.mindforge/engine/skills/composition.md +83 -0
- package/.mindforge/engine/skills/loader.md +16 -0
- package/.mindforge/personas/agent-architect.md +57 -0
- package/.mindforge/personas/agent-evaluator.md +162 -0
- package/.mindforge/personas/agent-memory-designer.md +157 -0
- package/.mindforge/personas/agent-ops-engineer.md +120 -0
- package/.mindforge/personas/agent-orchestrator.md +112 -0
- package/.mindforge/personas/ai-economist.md +57 -0
- package/.mindforge/personas/ai-safety-engineer.md +57 -0
- package/.mindforge/personas/analytics-engineer.md +57 -0
- package/.mindforge/personas/anti-pattern-hunter.md +61 -0
- package/.mindforge/personas/api-gateway-designer.md +132 -0
- package/.mindforge/personas/auth-engineer.md +112 -0
- package/.mindforge/personas/build-engineer.md +57 -0
- package/.mindforge/personas/business-analyst.md +56 -0
- package/.mindforge/personas/cache-architect.md +100 -0
- package/.mindforge/personas/causal-scientist.md +57 -0
- package/.mindforge/personas/cdn-architect.md +118 -0
- package/.mindforge/personas/change-agent.md +104 -0
- package/.mindforge/personas/code-narrator.md +52 -0
- package/.mindforge/personas/codegen-specialist.md +68 -0
- package/.mindforge/personas/communication-architect.md +102 -0
- package/.mindforge/personas/compliance-engineer.md +96 -0
- package/.mindforge/personas/consensus-engineer.md +116 -0
- package/.mindforge/personas/contract-tester.md +60 -192
- package/.mindforge/personas/cost-optimizer.md +71 -0
- package/.mindforge/personas/council-architect.md +66 -0
- package/.mindforge/personas/council-critic.md +67 -0
- package/.mindforge/personas/council-pragmatist.md +71 -0
- package/.mindforge/personas/council-skeptic.md +73 -0
- package/.mindforge/personas/data-architect.md +108 -0
- package/.mindforge/personas/data-mesh-architect.md +57 -0
- package/.mindforge/personas/data-pipeline-architect.md +120 -0
- package/.mindforge/personas/de-sloppifier.md +60 -0
- package/.mindforge/personas/debt-manager.md +66 -0
- package/.mindforge/personas/decision-architect.md +82 -51
- package/.mindforge/personas/deployment-captain.md +74 -0
- package/.mindforge/personas/design-system-lead.md +112 -0
- package/.mindforge/personas/dmux-orchestrator.md +75 -0
- package/.mindforge/personas/doc-auditor.md +84 -0
- package/.mindforge/personas/dx-engineer.md +96 -0
- package/.mindforge/personas/ecommerce-engineer.md +57 -0
- package/.mindforge/personas/edge-engineer.md +94 -0
- package/.mindforge/personas/edtech-architect.md +106 -0
- package/.mindforge/personas/embedding-architect.md +57 -0
- package/.mindforge/personas/environment-engineer.md +57 -0
- package/.mindforge/personas/eval-judge.md +55 -0
- package/.mindforge/personas/event-architect.md +102 -0
- package/.mindforge/personas/experiment-designer.md +138 -0
- package/.mindforge/personas/feature-store-engineer.md +57 -0
- package/.mindforge/personas/finops-analyst.md +66 -0
- package/.mindforge/personas/fintech-architect.md +57 -0
- package/.mindforge/personas/flutter-engineer.md +104 -0
- package/.mindforge/personas/gaming-engineer.md +57 -0
- package/.mindforge/personas/graphql-designer.md +73 -0
- package/.mindforge/personas/healthcare-engineer.md +57 -0
- package/.mindforge/personas/hiring-strategist.md +105 -0
- package/.mindforge/personas/hitl-architect.md +165 -0
- package/.mindforge/personas/i18n-architect.md +69 -0
- package/.mindforge/personas/instinct-curator.md +83 -0
- package/.mindforge/personas/iot-architect.md +105 -0
- package/.mindforge/personas/knowledge-curator.md +139 -0
- package/.mindforge/personas/knowledge-engineer.md +57 -0
- package/.mindforge/personas/lakehouse-architect.md +57 -0
- package/.mindforge/personas/llm-orchestrator.md +57 -0
- package/.mindforge/personas/logistics-architect.md +106 -0
- package/.mindforge/personas/market-analyst.md +53 -0
- package/.mindforge/personas/marketplace-engineer.md +105 -0
- package/.mindforge/personas/mcp-designer.md +54 -0
- package/.mindforge/personas/meeting-designer.md +104 -0
- package/.mindforge/personas/mentorship-lead.md +106 -0
- package/.mindforge/personas/migration-architect.md +57 -0
- package/.mindforge/personas/ml-ops-engineer.md +101 -0
- package/.mindforge/personas/mobile-architect.md +105 -0
- package/.mindforge/personas/mobile-security-engineer.md +106 -0
- package/.mindforge/personas/multi-model-bridge.md +86 -0
- package/.mindforge/personas/multi-tenancy-architect.md +71 -0
- package/.mindforge/personas/multimodal-engineer.md +57 -0
- package/.mindforge/personas/offline-specialist.md +105 -0
- package/.mindforge/personas/onboarding-navigator.md +63 -0
- package/.mindforge/personas/payments-engineer.md +135 -0
- package/.mindforge/personas/pipeline-engineer.md +115 -0
- package/.mindforge/personas/platform-engineer.md +97 -0
- package/.mindforge/personas/platform-lead.md +57 -0
- package/.mindforge/personas/privacy-engineer.md +57 -0
- package/.mindforge/personas/product-owner.md +56 -0
- package/.mindforge/personas/productivity-analyst.md +57 -0
- package/.mindforge/personas/prompt-architect.md +101 -0
- package/.mindforge/personas/proofreader.md +53 -0
- package/.mindforge/personas/pwa-architect.md +105 -0
- package/.mindforge/personas/quality-scorer.md +63 -0
- package/.mindforge/personas/react-native-engineer.md +106 -0
- package/.mindforge/personas/resilience-engineer.md +69 -0
- package/.mindforge/personas/rfc-architect.md +64 -0
- package/.mindforge/personas/saga-orchestrator.md +80 -0
- package/.mindforge/personas/secrets-engineer.md +57 -0
- package/.mindforge/personas/skill-smith.md +79 -0
- package/.mindforge/personas/sre-lead.md +107 -0
- package/.mindforge/personas/stream-engineer.md +57 -0
- package/.mindforge/personas/streaming-engineer.md +64 -0
- package/.mindforge/personas/swarm-templates.json +695 -38
- package/.mindforge/personas/system-designer.md +57 -0
- package/.mindforge/personas/team-coach.md +120 -0
- package/.mindforge/personas/tech-lead-coach.md +103 -0
- package/.mindforge/personas/technical-writer-lead.md +111 -0
- package/.mindforge/personas/threat-modeler.md +82 -0
- package/.mindforge/personas/vibe-checker.md +75 -0
- package/.mindforge/personas/worktree-manager.md +56 -0
- package/.mindforge/personas/zero-trust-engineer.md +113 -0
- package/.mindforge/skills/a11y-testing/SKILL.md +143 -0
- package/.mindforge/skills/agent-evaluation-framework/SKILL.md +227 -0
- package/.mindforge/skills/agent-introspection-debugging/SKILL.md +88 -0
- package/.mindforge/skills/agent-loops/SKILL.md +84 -0
- package/.mindforge/skills/agent-memory-design/SKILL.md +199 -0
- package/.mindforge/skills/agent-orchestration-patterns/SKILL.md +129 -0
- package/.mindforge/skills/agent-tool-selection/SKILL.md +204 -0
- package/.mindforge/skills/ai-agent-deployment/SKILL.md +176 -0
- package/.mindforge/skills/ai-cost-management/SKILL.md +57 -0
- package/.mindforge/skills/ai-safety-alignment/SKILL.md +53 -0
- package/.mindforge/skills/analytics-instrumentation/SKILL.md +172 -0
- package/.mindforge/skills/api-gateway-patterns/SKILL.md +177 -0
- package/.mindforge/skills/api-marketplace/SKILL.md +56 -0
- package/.mindforge/skills/api-versioning/SKILL.md +100 -0
- package/.mindforge/skills/app-store-deployment/SKILL.md +44 -0
- package/.mindforge/skills/architecture-tradeoff-analysis/SKILL.md +97 -0
- package/.mindforge/skills/audit-logging/SKILL.md +140 -0
- package/.mindforge/skills/auth-patterns/SKILL.md +148 -0
- package/.mindforge/skills/autonomous-agent-harness/SKILL.md +218 -0
- package/.mindforge/skills/autonomous-agents/SKILL.md +59 -0
- package/.mindforge/skills/autonomous-loops/SKILL.md +105 -0
- package/.mindforge/skills/build-system-optimization/SKILL.md +54 -0
- package/.mindforge/skills/build-vs-buy/SKILL.md +80 -0
- package/.mindforge/skills/bundle-optimization/SKILL.md +174 -0
- package/.mindforge/skills/business-analyst/SKILL.md +82 -0
- package/.mindforge/skills/caching-strategies/SKILL.md +132 -0
- package/.mindforge/skills/capacity-planning/SKILL.md +96 -0
- package/.mindforge/skills/causal-inference/SKILL.md +42 -0
- package/.mindforge/skills/cdn-optimization/SKILL.md +212 -0
- package/.mindforge/skills/change-management/SKILL.md +106 -0
- package/.mindforge/skills/chaos-engineering/SKILL.md +99 -0
- package/.mindforge/skills/ci-cd-pipeline/SKILL.md +118 -0
- package/.mindforge/skills/cli-design/SKILL.md +118 -0
- package/.mindforge/skills/code-generation-patterns/SKILL.md +92 -0
- package/.mindforge/skills/code-review-methodology/SKILL.md +180 -0
- package/.mindforge/skills/code-tour/SKILL.md +145 -0
- package/.mindforge/skills/codebase-onboarding/SKILL.md +95 -0
- package/.mindforge/skills/compliance-as-code/SKILL.md +195 -0
- package/.mindforge/skills/conflict-resolution/SKILL.md +87 -0
- package/.mindforge/skills/connection-pooling/SKILL.md +151 -0
- package/.mindforge/skills/container-security/SKILL.md +151 -0
- package/.mindforge/skills/context-engineering/SKILL.md +114 -0
- package/.mindforge/skills/continuous-learning/SKILL.md +84 -0
- package/.mindforge/skills/contract-testing/SKILL.md +85 -0
- package/.mindforge/skills/cost-aware-routing/SKILL.md +83 -0
- package/.mindforge/skills/cost-estimation/SKILL.md +82 -0
- package/.mindforge/skills/council/SKILL.md +68 -0
- package/.mindforge/skills/cqrs-event-sourcing/SKILL.md +95 -0
- package/.mindforge/skills/cross-platform-testing/SKILL.md +43 -0
- package/.mindforge/skills/data-governance/SKILL.md +42 -0
- package/.mindforge/skills/data-lakehouse/SKILL.md +42 -0
- package/.mindforge/skills/data-mesh/SKILL.md +42 -0
- package/.mindforge/skills/data-modeling/SKILL.md +107 -0
- package/.mindforge/skills/data-pipeline-design/SKILL.md +171 -0
- package/.mindforge/skills/data-privacy-engineering/SKILL.md +42 -0
- package/.mindforge/skills/database-performance/SKILL.md +174 -0
- package/.mindforge/skills/database-sharding-advanced/SKILL.md +206 -0
- package/.mindforge/skills/de-sloppify/SKILL.md +120 -0
- package/.mindforge/skills/defense-in-depth/SKILL.md +84 -0
- package/.mindforge/skills/delegation-patterns/SKILL.md +123 -0
- package/.mindforge/skills/dependency-management/SKILL.md +94 -0
- package/.mindforge/skills/deployment-workflow/SKILL.md +135 -0
- package/.mindforge/skills/design-system/SKILL.md +113 -0
- package/.mindforge/skills/developer-onboarding/SKILL.md +99 -0
- package/.mindforge/skills/developer-productivity-metrics/SKILL.md +59 -0
- package/.mindforge/skills/distributed-consensus/SKILL.md +141 -0
- package/.mindforge/skills/dmux-workflows/SKILL.md +141 -0
- package/.mindforge/skills/dns-architecture/SKILL.md +167 -0
- package/.mindforge/skills/doc-health-audit/SKILL.md +102 -0
- package/.mindforge/skills/ecommerce-architecture/SKILL.md +41 -0
- package/.mindforge/skills/edge-computing/SKILL.md +91 -0
- package/.mindforge/skills/edtech-platform/SKILL.md +41 -0
- package/.mindforge/skills/email-deliverability/SKILL.md +177 -0
- package/.mindforge/skills/embedding-systems/SKILL.md +55 -0
- package/.mindforge/skills/environment-management/SKILL.md +54 -0
- package/.mindforge/skills/error-handling-architecture/SKILL.md +118 -0
- package/.mindforge/skills/estimation-techniques/SKILL.md +113 -0
- package/.mindforge/skills/eval-harness/SKILL.md +180 -0
- package/.mindforge/skills/event-driven-architecture/SKILL.md +162 -0
- package/.mindforge/skills/experiment-design/SKILL.md +139 -0
- package/.mindforge/skills/experiment-platform/SKILL.md +43 -0
- package/.mindforge/skills/feature-engineering/SKILL.md +42 -0
- package/.mindforge/skills/feature-flag-management/SKILL.md +183 -0
- package/.mindforge/skills/fine-tuning-workflow/SKILL.md +189 -0
- package/.mindforge/skills/fintech-patterns/SKILL.md +41 -0
- package/.mindforge/skills/flutter-architecture/SKILL.md +42 -0
- package/.mindforge/skills/gaming-backend/SKILL.md +41 -0
- package/.mindforge/skills/git-workflow-design/SKILL.md +129 -0
- package/.mindforge/skills/graceful-degradation/SKILL.md +95 -0
- package/.mindforge/skills/graphql-patterns/SKILL.md +243 -0
- package/.mindforge/skills/guardrails-and-safety/SKILL.md +137 -0
- package/.mindforge/skills/healthcare-systems/SKILL.md +40 -0
- package/.mindforge/skills/hiring-engineering/SKILL.md +119 -0
- package/.mindforge/skills/human-in-the-loop-design/SKILL.md +234 -0
- package/.mindforge/skills/i18n-architecture/SKILL.md +147 -0
- package/.mindforge/skills/idempotency-patterns/SKILL.md +84 -0
- package/.mindforge/skills/incident-communication/SKILL.md +96 -0
- package/.mindforge/skills/incident-management/SKILL.md +97 -0
- package/.mindforge/skills/infrastructure-as-code/SKILL.md +98 -0
- package/.mindforge/skills/instinct-clustering/SKILL.md +190 -0
- package/.mindforge/skills/internal-developer-platform/SKILL.md +51 -0
- package/.mindforge/skills/iot-platform/SKILL.md +41 -0
- package/.mindforge/skills/k8s-deployment/SKILL.md +358 -0
- package/.mindforge/skills/knowledge-graphs/SKILL.md +56 -0
- package/.mindforge/skills/knowledge-sharing-systems/SKILL.md +112 -0
- package/.mindforge/skills/llm-cost-optimization/SKILL.md +198 -0
- package/.mindforge/skills/llm-orchestration/SKILL.md +56 -0
- package/.mindforge/skills/load-testing/SKILL.md +84 -0
- package/.mindforge/skills/logistics-optimization/SKILL.md +40 -0
- package/.mindforge/skills/market-researcher/SKILL.md +99 -0
- package/.mindforge/skills/marketplace-trust/SKILL.md +40 -0
- package/.mindforge/skills/mcp-server-patterns/SKILL.md +264 -0
- package/.mindforge/skills/media-streaming/SKILL.md +41 -0
- package/.mindforge/skills/meeting-architecture/SKILL.md +146 -0
- package/.mindforge/skills/mentoring-patterns/SKILL.md +77 -0
- package/.mindforge/skills/microservices-patterns/SKILL.md +83 -0
- package/.mindforge/skills/migration-platform/SKILL.md +61 -0
- package/.mindforge/skills/migration-strategies/SKILL.md +129 -0
- package/.mindforge/skills/ml-feature-store/SKILL.md +56 -0
- package/.mindforge/skills/ml-monitoring/SKILL.md +42 -0
- package/.mindforge/skills/mobile-performance/SKILL.md +44 -0
- package/.mindforge/skills/mobile-security/SKILL.md +45 -0
- package/.mindforge/skills/model-evaluation/SKILL.md +53 -0
- package/.mindforge/skills/monorepo-management/SKILL.md +100 -0
- package/.mindforge/skills/multi-llm-consult/SKILL.md +75 -0
- package/.mindforge/skills/multi-tenancy-patterns/SKILL.md +145 -0
- package/.mindforge/skills/multi-turn-conversation-design/SKILL.md +206 -0
- package/.mindforge/skills/multimodal-ai/SKILL.md +51 -0
- package/.mindforge/skills/mutation-testing/SKILL.md +97 -0
- package/.mindforge/skills/notification-system-design/SKILL.md +168 -0
- package/.mindforge/skills/observability-stack/SKILL.md +136 -0
- package/.mindforge/skills/offline-first-design/SKILL.md +43 -0
- package/.mindforge/skills/on-call-design/SKILL.md +111 -0
- package/.mindforge/skills/pagination-patterns/SKILL.md +230 -0
- package/.mindforge/skills/payment-integration/SKILL.md +176 -0
- package/.mindforge/skills/performance-reviews/SKILL.md +140 -0
- package/.mindforge/skills/platform-observability/SKILL.md +58 -0
- package/.mindforge/skills/platform-reliability/SKILL.md +52 -0
- package/.mindforge/skills/post-incident-learning/SKILL.md +96 -0
- package/.mindforge/skills/product-manager/SKILL.md +104 -0
- package/.mindforge/skills/progressive-web-app/SKILL.md +44 -0
- package/.mindforge/skills/prompt-engineering/SKILL.md +94 -0
- package/.mindforge/skills/proofreader/SKILL.md +158 -0
- package/.mindforge/skills/push-notification-architecture/SKILL.md +45 -0
- package/.mindforge/skills/python-performance/SKILL.md +183 -0
- package/.mindforge/skills/quality-audit/SKILL.md +171 -0
- package/.mindforge/skills/queue-design/SKILL.md +85 -0
- package/.mindforge/skills/rag-architecture/SKILL.md +176 -0
- package/.mindforge/skills/rate-limiting-design/SKILL.md +94 -0
- package/.mindforge/skills/react-native-patterns/SKILL.md +42 -0
- package/.mindforge/skills/react-performance/SKILL.md +229 -0
- package/.mindforge/skills/real-time-analytics/SKILL.md +42 -0
- package/.mindforge/skills/real-time-sync/SKILL.md +83 -0
- package/.mindforge/skills/responsive-native/SKILL.md +44 -0
- package/.mindforge/skills/responsive-patterns/SKILL.md +141 -0
- package/.mindforge/skills/rfc-pipeline/SKILL.md +114 -0
- package/.mindforge/skills/saas-multi-tenant/SKILL.md +41 -0
- package/.mindforge/skills/santa-method/SKILL.md +134 -0
- package/.mindforge/skills/search-implementation/SKILL.md +98 -0
- package/.mindforge/skills/secrets-platform/SKILL.md +56 -0
- package/.mindforge/skills/secrets-rotation/SKILL.md +173 -0
- package/.mindforge/skills/self-serve-infrastructure/SKILL.md +51 -0
- package/.mindforge/skills/serverless-patterns/SKILL.md +119 -0
- package/.mindforge/skills/skill-creator-meta/SKILL.md +146 -0
- package/.mindforge/skills/sprint-retrospective-facilitation/SKILL.md +112 -0
- package/.mindforge/skills/stakeholder-communication/SKILL.md +85 -0
- package/.mindforge/skills/state-management/SKILL.md +104 -0
- package/.mindforge/skills/stream-processing/SKILL.md +43 -0
- package/.mindforge/skills/streaming-architecture/SKILL.md +81 -0
- package/.mindforge/skills/supply-chain-security/SKILL.md +145 -0
- package/.mindforge/skills/synthetic-data-generation/SKILL.md +52 -0
- package/.mindforge/skills/system-design/SKILL.md +88 -0
- package/.mindforge/skills/team-topology-design/SKILL.md +107 -0
- package/.mindforge/skills/technical-debt-management/SKILL.md +86 -0
- package/.mindforge/skills/technical-interview-design/SKILL.md +98 -0
- package/.mindforge/skills/technical-leadership/SKILL.md +75 -0
- package/.mindforge/skills/technical-writing/SKILL.md +237 -0
- package/.mindforge/skills/technology-radar/SKILL.md +88 -0
- package/.mindforge/skills/testing-anti-patterns/SKILL.md +288 -0
- package/.mindforge/skills/threat-modeling/SKILL.md +109 -0
- package/.mindforge/skills/tool-design/SKILL.md +138 -0
- package/.mindforge/skills/typescript-advanced/SKILL.md +198 -0
- package/.mindforge/skills/using-git-worktrees/SKILL.md +139 -0
- package/.mindforge/skills/verification-loop/SKILL.md +97 -0
- package/.mindforge/skills/vibe-security/SKILL.md +165 -0
- package/.mindforge/skills/visual-regression-testing/SKILL.md +97 -0
- package/.mindforge/skills/websocket-patterns/SKILL.md +203 -0
- package/.mindforge/skills/writing-plans/SKILL.md +170 -0
- package/.mindforge/skills/writing-skills/SKILL.md +216 -0
- package/.mindforge/skills/zero-trust-architecture/SKILL.md +166 -0
- package/CHANGELOG.md +195 -0
- package/MINDFORGE.md +4 -4
- package/README.md +2 -2
- package/RELEASENOTES.md +66 -0
- package/bin/installer-core.js +1 -1
- package/bin/wizard/theme.js +2 -2
- package/docs/commands-reference.md +18 -1
- package/package.json +2 -2
- package/.mindforge/personas/data-privacy-engineer.md +0 -187
|
@@ -0,0 +1,180 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: eval-harness
|
|
3
|
+
version: 1.0.0
|
|
4
|
+
min_mindforge_version: 10.0.4
|
|
5
|
+
status: stable
|
|
6
|
+
triggers: eval, evaluation, grading, pass at k, rubric, regression eval, capability eval, model judge, deterministic grading, LLM-as-judge, eval score, eval-driven, benchmark eval
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# Skill — Eval Harness (Systematic Evaluation Framework)
|
|
10
|
+
|
|
11
|
+
## When this skill activates
|
|
12
|
+
When measuring, scoring, or validating system outputs against defined criteria.
|
|
13
|
+
Use for capability evaluation (can the system do X?), regression evaluation (does
|
|
14
|
+
a change break existing behavior?), or comparative evaluation (is version A better
|
|
15
|
+
than version B?). The eval harness ensures you define success BEFORE implementing,
|
|
16
|
+
not after.
|
|
17
|
+
|
|
18
|
+
Core principle: **Define-before-code** — write evaluation criteria before writing
|
|
19
|
+
the implementation they measure.
|
|
20
|
+
|
|
21
|
+
## Mandatory actions when this skill is active
|
|
22
|
+
|
|
23
|
+
### Before evaluation begins
|
|
24
|
+
|
|
25
|
+
1. **Define the eval type:**
|
|
26
|
+
- **Capability eval**: Can the system perform task X at acceptable quality?
|
|
27
|
+
- **Regression eval**: Does this change preserve existing behavior?
|
|
28
|
+
- **Comparative eval**: Is output A better than output B on criteria C?
|
|
29
|
+
|
|
30
|
+
2. **Write the eval config BEFORE implementation:**
|
|
31
|
+
```
|
|
32
|
+
.mindforge/evals/[eval-name]/
|
|
33
|
+
├── config.json # eval metadata, parameters, thresholds
|
|
34
|
+
├── rubric.md # human-readable success criteria
|
|
35
|
+
├── test-cases.json # input/expected-output pairs
|
|
36
|
+
└── results.jsonl # append-only results log
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
3. **Define success criteria in config.json:**
|
|
40
|
+
```json
|
|
41
|
+
{
|
|
42
|
+
"name": "eval-name",
|
|
43
|
+
"type": "capability" | "regression" | "comparative",
|
|
44
|
+
"version": "1.0.0",
|
|
45
|
+
"created": "ISO-8601",
|
|
46
|
+
"thresholds": {
|
|
47
|
+
"pass_at_1": 0.8,
|
|
48
|
+
"pass_at_5": 0.95,
|
|
49
|
+
"pass_at_10": 0.99
|
|
50
|
+
},
|
|
51
|
+
"grader": "code" | "model" | "human",
|
|
52
|
+
"model_judge_config": {
|
|
53
|
+
"model": "claude-sonnet",
|
|
54
|
+
"rubric_path": "./rubric.md",
|
|
55
|
+
"temperature": 0.0
|
|
56
|
+
},
|
|
57
|
+
"test_case_count": 0,
|
|
58
|
+
"tags": []
|
|
59
|
+
}
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
4. **Write the rubric (rubric.md) with explicit scoring:**
|
|
63
|
+
- Each criterion gets a 1-5 scale with concrete examples at each level
|
|
64
|
+
- Define what a "pass" means (minimum score per criterion)
|
|
65
|
+
- Define what a "fail" looks like with specific examples
|
|
66
|
+
- Include edge cases that should be tested
|
|
67
|
+
|
|
68
|
+
### During evaluation
|
|
69
|
+
|
|
70
|
+
**Three Grader Types:**
|
|
71
|
+
|
|
72
|
+
**1. Code-Based (Deterministic):**
|
|
73
|
+
- Use when outputs have objectively verifiable properties
|
|
74
|
+
- Write assertion functions that return PASS/FAIL with evidence
|
|
75
|
+
- Examples: output matches regex, JSON schema validates, function returns expected value
|
|
76
|
+
- No ambiguity — the grader is a function, not a judgment call
|
|
77
|
+
- Always prefer code-based grading when possible (fastest, most reliable)
|
|
78
|
+
|
|
79
|
+
```typescript
|
|
80
|
+
// Example code grader
|
|
81
|
+
function grade(output: string, expected: TestCase): GradeResult {
|
|
82
|
+
const parsed = JSON.parse(output);
|
|
83
|
+
return {
|
|
84
|
+
pass: parsed.status === expected.status && parsed.count >= expected.minCount,
|
|
85
|
+
evidence: `status=${parsed.status}, count=${parsed.count}`,
|
|
86
|
+
criterion: "structural-correctness"
|
|
87
|
+
};
|
|
88
|
+
}
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
**2. Model-Based (LLM-as-Judge):**
|
|
92
|
+
- Use when outputs require semantic understanding (prose quality, code correctness, reasoning)
|
|
93
|
+
- Always provide the rubric in the judge prompt — never rely on implicit standards
|
|
94
|
+
- Use temperature 0.0 for judge calls (determinism)
|
|
95
|
+
- Run judge 3x per item and take majority vote (reduces noise)
|
|
96
|
+
- Log the judge's reasoning alongside the score
|
|
97
|
+
|
|
98
|
+
```
|
|
99
|
+
Judge prompt structure:
|
|
100
|
+
1. Task description (what was the system asked to do?)
|
|
101
|
+
2. Rubric (what does good look like? what does bad look like?)
|
|
102
|
+
3. The output to grade
|
|
103
|
+
4. Instruction: score 1-5 per criterion, explain each score, give overall PASS/FAIL
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
**3. Human-Based (Flag for Review):**
|
|
107
|
+
- Use when stakes are too high for automated judgment
|
|
108
|
+
- Generate a review queue with: input, output, rubric, suggested-score
|
|
109
|
+
- Human confirms or overrides the suggested score
|
|
110
|
+
- Track inter-rater reliability if multiple humans review
|
|
111
|
+
|
|
112
|
+
**pass@k Metrics:**
|
|
113
|
+
- Generate k independent outputs for each test case
|
|
114
|
+
- **pass@1**: Fraction of test cases where the first output passes
|
|
115
|
+
- **pass@5**: Fraction where at least 1 of 5 outputs passes
|
|
116
|
+
- **pass@10**: Fraction where at least 1 of 10 outputs passes
|
|
117
|
+
- Formula: pass@k = 1 - C(n-c, k) / C(n, k) where n=total, c=correct
|
|
118
|
+
- Always report pass@1 (baseline) and at least one higher-k metric
|
|
119
|
+
- Use pass@1 for production readiness, pass@k for capability ceiling
|
|
120
|
+
|
|
121
|
+
**Result logging (results.jsonl):**
|
|
122
|
+
```json
|
|
123
|
+
{
|
|
124
|
+
"timestamp": "ISO-8601",
|
|
125
|
+
"test_case_id": "tc-001",
|
|
126
|
+
"input": "...",
|
|
127
|
+
"output": "...",
|
|
128
|
+
"grader": "code",
|
|
129
|
+
"scores": {"criterion_a": 4, "criterion_b": 5},
|
|
130
|
+
"pass": true,
|
|
131
|
+
"evidence": "...",
|
|
132
|
+
"latency_ms": 0,
|
|
133
|
+
"model_version": "...",
|
|
134
|
+
"run_id": "uuid"
|
|
135
|
+
}
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
### After evaluation
|
|
139
|
+
|
|
140
|
+
1. **Compute aggregate metrics:**
|
|
141
|
+
- Overall pass rate (pass@1, pass@5, pass@10)
|
|
142
|
+
- Per-criterion score distribution
|
|
143
|
+
- Failure mode clustering (what patterns cause failures?)
|
|
144
|
+
- Comparison to previous run (regression detection)
|
|
145
|
+
|
|
146
|
+
2. **Regression detection logic:**
|
|
147
|
+
- If pass@1 drops > 5% from previous run: FLAG as regression
|
|
148
|
+
- If any previously-passing test case now fails: FLAG as regression
|
|
149
|
+
- If new failure modes appear that didn't exist before: FLAG as regression
|
|
150
|
+
- Regressions block shipping until investigated
|
|
151
|
+
|
|
152
|
+
3. **Store results:**
|
|
153
|
+
- Append to results.jsonl (never overwrite)
|
|
154
|
+
- Update config.json with latest run metadata
|
|
155
|
+
- If regression detected: create `.mindforge/evals/[name]/REGRESSION.md`
|
|
156
|
+
|
|
157
|
+
4. **Report format:**
|
|
158
|
+
```
|
|
159
|
+
## Eval Report: [eval-name]
|
|
160
|
+
- Type: capability | regression | comparative
|
|
161
|
+
- Run: [run-id] at [timestamp]
|
|
162
|
+
- Test cases: N total, P passed, F failed
|
|
163
|
+
- pass@1: X% | pass@5: Y% | pass@10: Z%
|
|
164
|
+
- Threshold: pass@1 >= T% → [MET / NOT MET]
|
|
165
|
+
- Regressions: [none | list]
|
|
166
|
+
- Top failure modes: [list with counts]
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
## Self-check before task completion
|
|
170
|
+
|
|
171
|
+
Before marking a task done when this skill was active:
|
|
172
|
+
|
|
173
|
+
- [ ] Did I define success criteria BEFORE writing implementation code?
|
|
174
|
+
- [ ] Did I choose the appropriate grader type (code > model > human preference)?
|
|
175
|
+
- [ ] Did I track pass@k metrics (at minimum pass@1)?
|
|
176
|
+
- [ ] Did I run regression evals against previous results?
|
|
177
|
+
- [ ] Are results stored in `.mindforge/evals/[name]/results.jsonl`?
|
|
178
|
+
- [ ] If model-based grading: did I use temperature 0.0 and majority vote?
|
|
179
|
+
- [ ] Did I report failure modes, not just pass rates?
|
|
180
|
+
- [ ] Is the rubric explicit enough that another reviewer could grade independently?
|
|
@@ -0,0 +1,162 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: event-driven-architecture
|
|
3
|
+
version: 1.0.0
|
|
4
|
+
min_mindforge_version: 0.1.0
|
|
5
|
+
status: stable
|
|
6
|
+
triggers: event driven architecture, event bus, pub sub pattern, event schema design, ordering guarantee, exactly once delivery, dead letter topic, event sourcing integration, event catalog, event versioning strategy, event replay strategy, event consumer group
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# Skill — Event-Driven Architecture
|
|
10
|
+
|
|
11
|
+
## When this skill activates
|
|
12
|
+
Any task involving event bus design, pub/sub patterns, message ordering,
|
|
13
|
+
delivery guarantees, dead letter handling, or event schema evolution.
|
|
14
|
+
|
|
15
|
+
## Mandatory actions when this skill is active
|
|
16
|
+
|
|
17
|
+
### Before writing any code
|
|
18
|
+
1. Classify event types (domain, integration, or command events).
|
|
19
|
+
2. Define delivery guarantees required for each event stream.
|
|
20
|
+
3. Design the event schema with forward/backward compatibility in mind.
|
|
21
|
+
|
|
22
|
+
### During implementation
|
|
23
|
+
- Make all consumers idempotent (safe to process same event multiple times).
|
|
24
|
+
- Implement dead letter topic handling with alerting.
|
|
25
|
+
- Use partition keys to maintain ordering where required.
|
|
26
|
+
|
|
27
|
+
### After implementation
|
|
28
|
+
- Register events in the event catalog with schema and owner.
|
|
29
|
+
- Add consumer lag monitoring.
|
|
30
|
+
- Document retry and failure handling in ARCHITECTURE.md.
|
|
31
|
+
|
|
32
|
+
## Event Types
|
|
33
|
+
|
|
34
|
+
### Domain Events
|
|
35
|
+
- Facts about what happened in a bounded context.
|
|
36
|
+
- Named in past tense: `OrderPlaced`, `PaymentProcessed`, `UserRegistered`.
|
|
37
|
+
- Owned by the producing domain — consumers must adapt.
|
|
38
|
+
- Immutable once published.
|
|
39
|
+
|
|
40
|
+
### Integration Events
|
|
41
|
+
- Cross-boundary communication between services.
|
|
42
|
+
- May be transformed from domain events (different schema, less detail).
|
|
43
|
+
- Published on shared event bus (Kafka, SNS, EventBridge).
|
|
44
|
+
|
|
45
|
+
### Command Events
|
|
46
|
+
- Request for action (not a fact).
|
|
47
|
+
- Named as imperative: `ProcessPayment`, `SendNotification`.
|
|
48
|
+
- Exactly one consumer expected to handle.
|
|
49
|
+
- Requires acknowledgment/response.
|
|
50
|
+
|
|
51
|
+
## Delivery Guarantees
|
|
52
|
+
|
|
53
|
+
### At-Most-Once
|
|
54
|
+
- Fire and forget. No retries.
|
|
55
|
+
- Use for: metrics, analytics, non-critical notifications.
|
|
56
|
+
- Risk: message loss on failure.
|
|
57
|
+
|
|
58
|
+
### At-Least-Once (Recommended Default)
|
|
59
|
+
- Retry until acknowledged.
|
|
60
|
+
- Consumers MUST be idempotent.
|
|
61
|
+
- Use for: most business events.
|
|
62
|
+
- Risk: duplicate processing (mitigated by idempotency).
|
|
63
|
+
|
|
64
|
+
### Exactly-Once (Expensive)
|
|
65
|
+
- Requires transactional outbox + deduplication.
|
|
66
|
+
- Use for: financial transactions, inventory changes.
|
|
67
|
+
- Implementation: idempotency key + processed event log.
|
|
68
|
+
|
|
69
|
+
## Ordering Guarantees
|
|
70
|
+
|
|
71
|
+
### Per-Partition Ordering
|
|
72
|
+
- Events with the same partition key are ordered.
|
|
73
|
+
- Partition key = entity ID (e.g., order_id, user_id).
|
|
74
|
+
- Different entities may be processed out of order (acceptable).
|
|
75
|
+
|
|
76
|
+
### Global Ordering
|
|
77
|
+
- Extremely expensive — single partition = no parallelism.
|
|
78
|
+
- Almost never needed — design around per-entity ordering instead.
|
|
79
|
+
|
|
80
|
+
### Kafka Partition Key Design
|
|
81
|
+
```
|
|
82
|
+
topic: order-events
|
|
83
|
+
partition_key: order_id
|
|
84
|
+
result: all events for order-123 arrive in sequence
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
## Schema Evolution
|
|
88
|
+
|
|
89
|
+
### Compatibility Modes (Avro/Protobuf)
|
|
90
|
+
- **Backward compatible**: new schema can read old data (add optional fields).
|
|
91
|
+
- **Forward compatible**: old schema can read new data (ignore unknown fields).
|
|
92
|
+
- **Full compatible**: both directions (safest, most restrictive).
|
|
93
|
+
|
|
94
|
+
### Rules for Safe Evolution
|
|
95
|
+
- Adding optional fields: always safe.
|
|
96
|
+
- Removing fields: only if no consumers depend on them.
|
|
97
|
+
- Renaming fields: treat as remove + add (breaking).
|
|
98
|
+
- Changing field types: always breaking.
|
|
99
|
+
|
|
100
|
+
### Schema Registry
|
|
101
|
+
- Central registry of all event schemas with version history.
|
|
102
|
+
- Validates compatibility before allowing schema updates.
|
|
103
|
+
- Consumers reference schema by ID (embedded in message header).
|
|
104
|
+
|
|
105
|
+
## Consumer Groups
|
|
106
|
+
|
|
107
|
+
### Competing Consumers (Scaling Pattern)
|
|
108
|
+
- Multiple instances in same group share the load.
|
|
109
|
+
- Each message processed by exactly one instance.
|
|
110
|
+
- Use for: order processing, notification sending.
|
|
111
|
+
- Scale by adding more consumers (up to partition count).
|
|
112
|
+
|
|
113
|
+
### Broadcasting (Fan-Out Pattern)
|
|
114
|
+
- Each consumer group gets every message.
|
|
115
|
+
- Use for: audit logging, cache invalidation, analytics.
|
|
116
|
+
- Different groups process independently at their own pace.
|
|
117
|
+
|
|
118
|
+
## Dead Letter Topics (DLT)
|
|
119
|
+
|
|
120
|
+
### Flow
|
|
121
|
+
```
|
|
122
|
+
message → consumer → FAIL → retry (3x with backoff) → FAIL → DLT → alert
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
### Requirements
|
|
126
|
+
- Every consumer MUST have a DLT configured.
|
|
127
|
+
- DLT messages retain full context (original message + error + attempt count).
|
|
128
|
+
- Alert on first DLT message (don't silently accumulate).
|
|
129
|
+
- Manual resolution workflow: inspect → fix → replay or discard.
|
|
130
|
+
|
|
131
|
+
### Retry Strategy
|
|
132
|
+
- Attempt 1: immediate.
|
|
133
|
+
- Attempt 2: 1 second delay.
|
|
134
|
+
- Attempt 3: 10 second delay.
|
|
135
|
+
- After 3 failures: route to DLT.
|
|
136
|
+
|
|
137
|
+
## Event Catalog
|
|
138
|
+
|
|
139
|
+
Every event in the system must be registered:
|
|
140
|
+
|
|
141
|
+
| Field | Description |
|
|
142
|
+
|-------|-------------|
|
|
143
|
+
| Event name | `OrderPlaced` |
|
|
144
|
+
| Schema version | `v3` |
|
|
145
|
+
| Owner (team) | Order Service team |
|
|
146
|
+
| Producers | order-service |
|
|
147
|
+
| Consumers | notification-svc, analytics-svc, fulfillment-svc |
|
|
148
|
+
| Partition key | order_id |
|
|
149
|
+
| Delivery guarantee | at-least-once |
|
|
150
|
+
| Retention | 7 days |
|
|
151
|
+
|
|
152
|
+
## Self-check before task completion
|
|
153
|
+
|
|
154
|
+
Before marking a task done when this skill was active:
|
|
155
|
+
|
|
156
|
+
- [ ] Did I read the full SKILL.md before starting? (Not just the triggers)
|
|
157
|
+
- [ ] Are all consumers idempotent?
|
|
158
|
+
- [ ] Is ordering guaranteed per entity via partition keys?
|
|
159
|
+
- [ ] Is dead letter topic configured with alerting?
|
|
160
|
+
- [ ] Are event schemas registered in the catalog?
|
|
161
|
+
- [ ] Is schema evolution backward-compatible?
|
|
162
|
+
- [ ] Are consumer groups configured correctly (competing vs broadcasting)?
|
|
@@ -0,0 +1,139 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: experiment-design
|
|
3
|
+
version: 1.0.0
|
|
4
|
+
min_mindforge_version: 10.0.4
|
|
5
|
+
status: stable
|
|
6
|
+
triggers: experiment design, A/B testing architecture, statistical significance, sample size calculator, guardrail metric, experiment lifecycle, hypothesis testing, control variant, experiment analysis, metric sensitivity, experiment duration, randomization unit
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# Skill — Experiment Design (Rigorous A/B Testing Architecture)
|
|
10
|
+
|
|
11
|
+
## When this skill activates
|
|
12
|
+
When designing, planning, or analyzing A/B tests, multivariate experiments, or
|
|
13
|
+
any controlled experiment that requires statistical rigor. Use for feature rollout
|
|
14
|
+
decisions, conversion optimization, pricing tests, or any scenario where you need
|
|
15
|
+
to measure the causal impact of a change.
|
|
16
|
+
|
|
17
|
+
Core principle: **Hypothesis-first** — never launch an experiment without a written
|
|
18
|
+
hypothesis that specifies the expected effect direction, magnitude, and mechanism.
|
|
19
|
+
|
|
20
|
+
## Mandatory actions when this skill is active
|
|
21
|
+
|
|
22
|
+
### Before experiment begins
|
|
23
|
+
|
|
24
|
+
1. **Write the hypothesis in structured format:**
|
|
25
|
+
```
|
|
26
|
+
If we [change X],
|
|
27
|
+
then [metric Y] will [improve/degrade] by [Z amount]
|
|
28
|
+
because [causal mechanism].
|
|
29
|
+
```
|
|
30
|
+
- The hypothesis must be falsifiable
|
|
31
|
+
- The expected effect size must be realistic (based on prior data or industry benchmarks)
|
|
32
|
+
- The causal mechanism must be articulated (not just "it will be better")
|
|
33
|
+
|
|
34
|
+
2. **Calculate sample size:**
|
|
35
|
+
```
|
|
36
|
+
Inputs:
|
|
37
|
+
- Baseline conversion rate (current metric value)
|
|
38
|
+
- Minimum Detectable Effect (MDE): smallest improvement worth detecting
|
|
39
|
+
- Statistical significance level (alpha): typically 0.05
|
|
40
|
+
- Statistical power (1-beta): typically 0.80
|
|
41
|
+
- Number of variants (control + treatments)
|
|
42
|
+
|
|
43
|
+
Output:
|
|
44
|
+
- Required sample size per variant
|
|
45
|
+
- Estimated duration = required_N / daily_traffic_per_variant
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
Rules:
|
|
49
|
+
- MDE should be the smallest PRACTICALLY significant effect (not just statistically significant)
|
|
50
|
+
- If duration > 8 weeks: increase MDE or find higher-traffic surface
|
|
51
|
+
- Never compromise on power — underpowered experiments waste everyone's time
|
|
52
|
+
|
|
53
|
+
3. **Define guardrail metrics:**
|
|
54
|
+
```json
|
|
55
|
+
{
|
|
56
|
+
"primary_metric": "conversion_rate",
|
|
57
|
+
"secondary_metrics": ["revenue_per_user", "engagement_time"],
|
|
58
|
+
"guardrail_metrics": [
|
|
59
|
+
{"name": "page_load_time_p95", "threshold": "+200ms", "action": "stop"},
|
|
60
|
+
{"name": "error_rate", "threshold": "+0.5%", "action": "stop"},
|
|
61
|
+
{"name": "revenue_per_session", "threshold": "-2%", "action": "alert"}
|
|
62
|
+
]
|
|
63
|
+
}
|
|
64
|
+
```
|
|
65
|
+
- Guardrails are metrics that MUST NOT degrade beyond threshold
|
|
66
|
+
- Violation of a guardrail = experiment stopped regardless of primary metric
|
|
67
|
+
- Always include: performance, error rate, and revenue as guardrails
|
|
68
|
+
|
|
69
|
+
4. **Choose randomization unit:**
|
|
70
|
+
- **User-level**: Default for most experiments (consistent experience across sessions)
|
|
71
|
+
- **Session-level**: For UI experiments where cross-session contamination is acceptable
|
|
72
|
+
- **Page-level**: Only for layout experiments with no carryover effects
|
|
73
|
+
- **Device-level**: When logged-out users are significant traffic
|
|
74
|
+
- Rule: randomization unit >= analysis unit (never analyze at user level if randomized at page level)
|
|
75
|
+
|
|
76
|
+
### During experiment
|
|
77
|
+
|
|
78
|
+
1. **Minimum duration rules:**
|
|
79
|
+
- Run for at least 1 full business cycle (typically 7 days minimum)
|
|
80
|
+
- Recommended: 2 full weeks to capture weekday/weekend variation
|
|
81
|
+
- NEVER stop early because results "look significant" (peeking problem)
|
|
82
|
+
- If using sequential testing: define stopping rules BEFORE launch
|
|
83
|
+
|
|
84
|
+
2. **Monitoring protocol:**
|
|
85
|
+
- Check guardrail metrics daily
|
|
86
|
+
- Do NOT check primary metric significance until planned end date
|
|
87
|
+
- If peeking is necessary: use group sequential methods with alpha spending
|
|
88
|
+
- Log any system issues that may contaminate results (outages, bugs, other launches)
|
|
89
|
+
|
|
90
|
+
3. **Sample Ratio Mismatch (SRM) check:**
|
|
91
|
+
- Verify variant assignment is balanced (chi-square test, p < 0.001 = SRM)
|
|
92
|
+
- SRM invalidates the experiment — do not trust results
|
|
93
|
+
- Common causes: bot filtering, redirect failures, bucketing bugs
|
|
94
|
+
|
|
95
|
+
### After experiment (analysis)
|
|
96
|
+
|
|
97
|
+
1. **Statistical analysis checklist:**
|
|
98
|
+
- [ ] Confirm no SRM
|
|
99
|
+
- [ ] Check primary metric: p-value < 0.05 AND confidence interval excludes 0
|
|
100
|
+
- [ ] Check practical significance: is the effect size large enough to matter?
|
|
101
|
+
- [ ] Check guardrail metrics: no violations
|
|
102
|
+
- [ ] Check segment consistency: does the effect hold across key segments?
|
|
103
|
+
- [ ] Check novelty/primacy effects: is the effect stable over time?
|
|
104
|
+
|
|
105
|
+
2. **Decision framework:**
|
|
106
|
+
```
|
|
107
|
+
IF p < 0.05 AND practical significance AND no guardrail violations:
|
|
108
|
+
→ SHIP (roll out to 100%)
|
|
109
|
+
IF p < 0.05 BUT guardrail violation:
|
|
110
|
+
→ ITERATE (fix guardrail issue, re-run)
|
|
111
|
+
IF p >= 0.05 AND confidence interval includes meaningful effects:
|
|
112
|
+
→ EXTEND (underpowered, run longer or increase traffic)
|
|
113
|
+
IF p >= 0.05 AND confidence interval excludes meaningful effects:
|
|
114
|
+
→ KILL (the change doesn't work, move on)
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
3. **Document the result:**
|
|
118
|
+
```markdown
|
|
119
|
+
## Experiment Result: [name]
|
|
120
|
+
- Hypothesis: [statement]
|
|
121
|
+
- Duration: [days] | Sample: [N per variant]
|
|
122
|
+
- Primary metric: [baseline] → [variant] ([+/-X%], p=[value])
|
|
123
|
+
- Guardrails: [all clear / violations]
|
|
124
|
+
- Decision: SHIP / ITERATE / EXTEND / KILL
|
|
125
|
+
- Learning: [what did we learn about user behavior?]
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
## Self-check before task completion
|
|
129
|
+
|
|
130
|
+
Before marking a task done when this skill was active:
|
|
131
|
+
|
|
132
|
+
- [ ] Did I write a structured hypothesis with expected effect size and mechanism?
|
|
133
|
+
- [ ] Did I calculate required sample size based on MDE and baseline?
|
|
134
|
+
- [ ] Did I define guardrail metrics with explicit thresholds?
|
|
135
|
+
- [ ] Did I choose an appropriate randomization unit?
|
|
136
|
+
- [ ] Did I set minimum duration (>= 1 business cycle)?
|
|
137
|
+
- [ ] Did I plan for the peeking problem (no early stopping without sequential testing)?
|
|
138
|
+
- [ ] Did I document the decision framework (ship/iterate/extend/kill)?
|
|
139
|
+
- [ ] Is the experiment design reproducible by another engineer?
|
|
@@ -0,0 +1,43 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: experiment-platform
|
|
3
|
+
version: 1.0.0
|
|
4
|
+
min_mindforge_version: 10.6.0
|
|
5
|
+
status: stable
|
|
6
|
+
triggers: experiment platform design, experimentation infrastructure, statistical rigor experiment, guardrail metric design, experiment velocity, feature flag experiment, experiment analysis automation, sample size calculation, multi-variant testing, experiment platform lifecycle, experiment review process, sequential testing
|
|
7
|
+
compose: experiment-design
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# Skill — Experiment Platform
|
|
11
|
+
|
|
12
|
+
## When this skill activates
|
|
13
|
+
This skill activates when building experimentation infrastructure, implementing statistical testing frameworks, or designing A/B testing platforms. Use when organizations need to scale testing velocity while maintaining statistical rigor.
|
|
14
|
+
|
|
15
|
+
## Mandatory actions when this skill is active
|
|
16
|
+
|
|
17
|
+
### Before writing any code
|
|
18
|
+
1. Define experiment framework components: randomization service, exposure logging, metric computation pipeline, and statistical analysis engine
|
|
19
|
+
2. Establish statistical rigor standards: minimum sample size, power (80%+), significance level (5%), minimum detectable effect, and multiple comparison corrections
|
|
20
|
+
3. Design guardrail metrics framework: business health (revenue, retention), user experience (latency, errors), and ecosystem health (partner impact)
|
|
21
|
+
4. Plan experiment lifecycle states: draft, review, running, paused, completed, archived with transition criteria and approval gates
|
|
22
|
+
|
|
23
|
+
### During implementation
|
|
24
|
+
- Implement consistent randomization using stable hashing (user_id + experiment_id) ensuring users see same variant across sessions
|
|
25
|
+
- Build exposure logging capturing: timestamp, user_id, experiment_id, variant, context for accurate sample size and covariate adjustment
|
|
26
|
+
- Create metric computation pipeline with: numerator/denominator structure, winsorization for outliers, delta method for ratios, bootstrap for confidence intervals
|
|
27
|
+
- Design sequential testing capability for early stopping: alpha spending functions, futility boundaries, and minimum runtime requirements
|
|
28
|
+
- Implement stratified analysis for heterogeneous treatment effects: by platform, user segment, geography with interaction effect testing
|
|
29
|
+
- Build automated guardrail checks: alert on significant negative movement in critical metrics with experiment auto-pause capability
|
|
30
|
+
- Create experiment metadata repository: hypothesis, success criteria, related experiments, learnings, and decision outcome for institutional knowledge
|
|
31
|
+
|
|
32
|
+
### After implementation
|
|
33
|
+
- Generate automated experiment scorecards: primary metric movement, guardrail status, statistical significance, practical significance, recommendation
|
|
34
|
+
- Build experiment catalog with search and discovery: hypothesis library, metric glossary, analysis templates, and historical results
|
|
35
|
+
- Create experimentation health dashboard: velocity (experiments/week), quality (statistical power distribution), impact (significant wins), and coverage (features tested)
|
|
36
|
+
- Document statistical methodology: test selection, variance reduction techniques, multiple comparison approach, and sequential testing procedures
|
|
37
|
+
|
|
38
|
+
## Self-check before task completion
|
|
39
|
+
- [ ] Randomization service ensures stable assignment and balanced allocation across variants
|
|
40
|
+
- [ ] Exposure logging captures all necessary context for accurate analysis and debugging
|
|
41
|
+
- [ ] Statistical analysis engine implements proper corrections for multiple comparisons and peeking
|
|
42
|
+
- [ ] Guardrail metrics monitored automatically with alerting and experiment pause capability
|
|
43
|
+
- [ ] Experiment lifecycle enforces minimum runtime and sample size before declaring results
|
|
@@ -0,0 +1,42 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: feature-engineering
|
|
3
|
+
version: 1.0.0
|
|
4
|
+
min_mindforge_version: 10.6.0
|
|
5
|
+
status: stable
|
|
6
|
+
triggers: ML feature engineering workflow, feature selection method, feature transformation, feature importance analysis, automated feature discovery, feature scaling normalization, feature interaction, temporal feature extraction, text feature engineering, categorical encoding strategy, feature validation, domain feature creation
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# Skill — Feature Engineering
|
|
10
|
+
|
|
11
|
+
## When this skill activates
|
|
12
|
+
This skill activates when building ML pipelines that require feature creation, transformation, or selection. Use when designing feature stores, implementing automated feature discovery, or optimizing model input representation.
|
|
13
|
+
|
|
14
|
+
## Mandatory actions when this skill is active
|
|
15
|
+
|
|
16
|
+
### Before writing any code
|
|
17
|
+
1. Conduct exploratory data analysis to understand feature distributions, missing patterns, correlations, and domain-specific relationships
|
|
18
|
+
2. Define feature engineering strategy: target encoding risks, temporal leakage prevention, train-test split boundaries, and cross-validation approach
|
|
19
|
+
3. Document business logic for derived features with domain expert validation and interpretability requirements
|
|
20
|
+
4. Establish feature quality metrics: null rates, cardinality, stability over time, and correlation with target variable
|
|
21
|
+
|
|
22
|
+
### During implementation
|
|
23
|
+
- Implement feature transformations within sklearn Pipelines or similar frameworks to prevent train-test leakage
|
|
24
|
+
- Use robust scaling methods appropriate to distribution (StandardScaler for normal, RobustScaler for outliers, quantile for non-parametric)
|
|
25
|
+
- Create temporal features with proper lag handling: rolling windows, exponential smoothing, seasonal decomposition, time-since-event
|
|
26
|
+
- Encode categorical variables with strategy matching cardinality (one-hot <10 categories, target encoding >50, embeddings for high-cardinality)
|
|
27
|
+
- Generate interaction features guided by domain knowledge and feature importance: polynomial, ratio, difference, product features
|
|
28
|
+
- Handle missing values explicitly with strategy documented: imputation (mean/median/mode), indicator variables, or model-based imputation
|
|
29
|
+
- Validate feature importance using multiple methods: permutation importance, SHAP values, and univariate tests to identify top contributors
|
|
30
|
+
|
|
31
|
+
### After implementation
|
|
32
|
+
- Create feature documentation with schema definitions, transformation logic, expected ranges, and update frequency
|
|
33
|
+
- Build feature monitoring dashboards tracking distribution drift, missing rate changes, and correlation stability over time
|
|
34
|
+
- Generate feature store integration with versioning, metadata tracking, and point-in-time correctness for temporal joins
|
|
35
|
+
- Validate feature pipeline performance: transformation latency, memory usage, and batch vs online serving consistency
|
|
36
|
+
|
|
37
|
+
## Self-check before task completion
|
|
38
|
+
- [ ] All features are computed within transformation pipelines to prevent train-test leakage
|
|
39
|
+
- [ ] Feature importance analysis identifies top 20 contributors with interpretable business meaning
|
|
40
|
+
- [ ] Temporal features respect time boundaries and use only historically available information
|
|
41
|
+
- [ ] Feature documentation includes transformation logic, expected distributions, and monitoring thresholds
|
|
42
|
+
- [ ] Feature validation tests confirm stability across different time periods and data segments
|