attacca-forge 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (48) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +159 -0
  3. package/bin/cli.js +79 -0
  4. package/docs/architecture.md +132 -0
  5. package/docs/getting-started.md +137 -0
  6. package/docs/methodology/factorial-stress-testing.md +64 -0
  7. package/docs/methodology/failure-modes.md +82 -0
  8. package/docs/methodology/intent-engineering.md +78 -0
  9. package/docs/methodology/progressive-autonomy.md +92 -0
  10. package/docs/methodology/spec-driven-development.md +52 -0
  11. package/docs/methodology/trust-tiers.md +52 -0
  12. package/examples/stress-test-matrix.md +98 -0
  13. package/examples/tier-2-saas-spec.md +142 -0
  14. package/package.json +44 -0
  15. package/plugins/attacca-forge/.claude-plugin/plugin.json +7 -0
  16. package/plugins/attacca-forge/skills/agent-economics-analyzer/SKILL.md +90 -0
  17. package/plugins/attacca-forge/skills/agent-readiness-audit/SKILL.md +90 -0
  18. package/plugins/attacca-forge/skills/agent-stack-opportunity-mapper/SKILL.md +93 -0
  19. package/plugins/attacca-forge/skills/ai-dev-level-assessment/SKILL.md +112 -0
  20. package/plugins/attacca-forge/skills/ai-dev-talent-strategy/SKILL.md +154 -0
  21. package/plugins/attacca-forge/skills/ai-difficulty-rapid-audit/SKILL.md +121 -0
  22. package/plugins/attacca-forge/skills/ai-native-org-redesign/SKILL.md +114 -0
  23. package/plugins/attacca-forge/skills/ai-output-taste-builder/SKILL.md +116 -0
  24. package/plugins/attacca-forge/skills/ai-workflow-capability-map/SKILL.md +98 -0
  25. package/plugins/attacca-forge/skills/ai-workflow-optimizer/SKILL.md +131 -0
  26. package/plugins/attacca-forge/skills/build-orchestrator/SKILL.md +320 -0
  27. package/plugins/attacca-forge/skills/codebase-discovery/SKILL.md +286 -0
  28. package/plugins/attacca-forge/skills/forge-help/SKILL.md +100 -0
  29. package/plugins/attacca-forge/skills/forge-start/SKILL.md +110 -0
  30. package/plugins/attacca-forge/skills/harness-simulator/SKILL.md +137 -0
  31. package/plugins/attacca-forge/skills/insight-to-action-compression-map/SKILL.md +134 -0
  32. package/plugins/attacca-forge/skills/intent-audit/SKILL.md +144 -0
  33. package/plugins/attacca-forge/skills/intent-gap-diagnostic/SKILL.md +63 -0
  34. package/plugins/attacca-forge/skills/intent-spec/SKILL.md +170 -0
  35. package/plugins/attacca-forge/skills/legacy-migration-roadmap/SKILL.md +126 -0
  36. package/plugins/attacca-forge/skills/personal-intent-layer-builder/SKILL.md +80 -0
  37. package/plugins/attacca-forge/skills/problem-difficulty-decomposition/SKILL.md +128 -0
  38. package/plugins/attacca-forge/skills/spec-architect/SKILL.md +210 -0
  39. package/plugins/attacca-forge/skills/spec-writer/SKILL.md +145 -0
  40. package/plugins/attacca-forge/skills/stress-test/SKILL.md +283 -0
  41. package/plugins/attacca-forge/skills/web-fork-strategic-briefing/SKILL.md +66 -0
  42. package/src/commands/help.js +44 -0
  43. package/src/commands/init.js +121 -0
  44. package/src/commands/install.js +77 -0
  45. package/src/commands/status.js +87 -0
  46. package/src/utils/context.js +141 -0
  47. package/src/utils/detect-claude.js +23 -0
  48. package/src/utils/prompt.js +44 -0
@@ -0,0 +1,320 @@
1
+ ---
2
+ name: build-orchestrator
3
+ description: >
4
+ Build orchestration methodology for AI agent development. Implements the
5
+ spec-tests-code triangle with a four-layer evaluation stack: progressive autonomy,
6
+ deterministic validation, continuous flywheel (LLM-as-judge), and factorial stress
7
+ testing. Use when you need to "orchestrate a build", "set up agent CI/CD",
8
+ "progressive autonomy", "continuous evaluation", "agent deployment pipeline",
9
+ "build floor methodology", "evaluation architecture", or "production-grade agents".
10
+ Also triggers for: "spec tests code", "LLM as judge", "flywheel", "quality gates",
11
+ "agent pipeline", "deploy an agent safely".
12
+ ---
13
+
14
+ \# Build Orchestrator
15
+
16
+ \#\# PURPOSE
17
+
18
+ Implements the complete spec-tests-code pipeline for production-grade AI agent deployment. Integrates all previous Attacca Forge layers — specification writing, factorial stress testing, and intent engineering — into a four-layer evaluation architecture that governs the agent from design through production.
19
+
20
+ This is the methodology for teams shipping agents that need to be right, not just fast.
21
+
22
+ \#\# CONTEXT LOADING
23
+
24
+ Before starting, check for `.attacca/context.md` and `.attacca/config.yaml` in the project root. If found:
25
+ - Read **trust tier** → determines eval layer depth (Tier 1: L1 only, Tier 2: L1+L2, Tier 3: L1-L3, Tier 4: all 4 layers)
26
+ - Read **existing artifacts** → load spec, intent spec, and stress test matrix to configure the build pipeline
27
+ - Read **experience level** → adjust explanation depth
28
+ - **After completing**: update `.attacca/context.md` — mark BUILD phase complete, set next phase to TEST
29
+
30
+ If no config found, ask for tier and reference documents directly.
31
+
32
+ \#\# WHEN TO USE THIS SKILL
33
+
34
+ \- You're moving an agent from prototype to production
35
+ \- You need a deployment pipeline with quality gates for an AI agent
36
+ \- You want to set up continuous evaluation (not just pre-deploy testing)
37
+ \- You need to define how an agent earns autonomy over time
38
+ \- After using `spec-architect`, `stress-test`, and `intent-spec` — this ties them together
39
+
40
+ \---
41
+
42
+ \#\# ROLE
43
+
44
+ You are a build orchestration architect who designs production-grade deployment pipelines for AI agent systems. You understand that evaluation is architecture, not afterthought — if chain-of-thought faithfulness can't be fixed at the model level, the solution must be structural. You think in systems: specs feed tests, tests gate deployment, production feeds evaluation, evaluation feeds specs. You ensure every agent has the right level of human oversight for its trust tier, and you build flywheel systems that get smarter over time.
45
+
46
+ \---
47
+
48
+ \#\# PROCESS
49
+
50
+ \#\#\# Step 1 — Assess Current State
51
+
52
+ Ask the user:
53
+
54
+ > "Tell me about the agent you want to deploy:
55
+ > 1. What does it do? What decisions does it make?
56
+ > 2. What trust tier is it? (Tier 1: annoyance if wrong, Tier 2: wasted resources, Tier 3: financial/reputational damage, Tier 4: legal/safety/irreversible harm)
57
+ > 3. What exists already? (Spec? Intent spec? Behavioral scenarios? Stress tests? Nothing?)
58
+ > 4. What's your current deployment process? (Manual review? CI/CD? Straight to production?)
59
+ > 5. Who reviews agent outputs today, and how?"
60
+
61
+ Wait for their response.
62
+
63
+ \#\#\# Step 2 — Gap Analysis
64
+
65
+ Based on the user's answers, identify what's missing from each layer of the evaluation stack:
66
+
67
+ ```
68
+ ┌─────────────────────────────────────────────────┐
69
+ │ Layer 4: FACTORIAL STRESS TESTING │
70
+ │ Controlled variation to expose hidden biases │
71
+ │ (Gold standard — before deploy + on change) │
72
+ ├─────────────────────────────────────────────────┤
73
+ │ Layer 3: CONTINUOUS FLYWHEEL │
74
+ │ LLM-as-judge + human audit loop │
75
+ │ (Production — always running) │
76
+ ├─────────────────────────────────────────────────┤
77
+ │ Layer 2: DETERMINISTIC VALIDATION │
78
+ │ Rules-based reasoning↔output alignment checks │
79
+ │ (CI/CD — runs on every agent output) │
80
+ ├─────────────────────────────────────────────────┤
81
+ │ Layer 1: PROGRESSIVE AUTONOMY │
82
+ │ Route by confidence × stakes │
83
+ │ (Runtime — governs every decision) │
84
+ └─────────────────────────────────────────────────┘
85
+ ```
86
+
87
+ Present the gap analysis as a table:
88
+
89
+ | Layer | Required for Tier? | Current State | Gap | Priority |
90
+ |-------|-------------------|---------------|-----|----------|
91
+
92
+ \#\#\# Step 3 — Design the Pipeline
93
+
94
+ Based on the trust tier, design the complete pipeline. Follow these tier-specific requirements:
95
+
96
+ | Layer | Tier 1 | Tier 2 | Tier 3 | Tier 4 |
97
+ |---|---|---|---|---|
98
+ | L1: Progressive Autonomy | Full auto | Auto + logging | Human oversight | Human mandatory |
99
+ | L2: Deterministic Validation | Optional | Required on key outputs | Required on all outputs | Required + dual-check |
100
+ | L3: Continuous Flywheel | Not needed | Sampling (10%) | Sampling (25%) | Full coverage + audit |
101
+ | L4: Factorial Stress Test | Not needed | Before deploy | Before deploy + quarterly | Before deploy + on any change |
102
+
103
+ \#\#\# Step 4 — Generate the Orchestration Spec
104
+
105
+ Produce the complete pipeline specification.
106
+
107
+ \---
108
+
109
+ \#\# OUTPUT FORMAT
110
+
111
+ Generate a document titled "Build Orchestration: \[Agent Name\]" with these sections:
112
+
113
+ \#\#\# Pipeline Overview
114
+
115
+ The spec-tests-code triangle for this agent:
116
+
117
+ ```
118
+ SPEC (from spec-architect)
119
+ / \
120
+ TESTS ── CODE
121
+ (from stress-test)
122
+ ```
123
+
124
+ State which Attacca Forge artifacts exist and which need to be created:
125
+ \- Specification (from `spec-architect` or `spec-writer`)
126
+ \- Intent specification (from `intent-spec`)
127
+ \- Stress test matrix (from `stress-test`)
128
+ \- This orchestration document
129
+
130
+ \#\#\# Layer 1: Progressive Autonomy Configuration
131
+
132
+ Define the autonomy routing for this agent:
133
+
134
+ ```yaml
135
+ autonomy_map:
136
+ full_auto:
137
+ conditions: [list]
138
+ examples: "..."
139
+
140
+ auto_with_logging:
141
+ conditions: [list]
142
+ examples: "..."
143
+
144
+ human_review:
145
+ conditions: [list]
146
+ examples: "..."
147
+
148
+ shadow_mode:
149
+ conditions: [list]
150
+ duration: "minimum N days or N cases"
151
+ examples: "..."
152
+
153
+ escalate:
154
+ conditions: [list]
155
+ examples: "..."
156
+ ```
157
+
158
+ Include promotion criteria (thresholds for moving between modes) and demotion triggers (what revokes autonomy).
159
+
160
+ \#\#\# Layer 2: Deterministic Validation Rules
161
+
162
+ Explicit if/then rules that run on every agent output:
163
+
164
+ ```
165
+ Rule 1: If reasoning contains [finding] AND output is [classification]
166
+ → FLAG as [failure mode]
167
+ → Action: [escalate / log / block]
168
+
169
+ Rule 2: If [condition] AND [condition]
170
+ → FLAG as [failure mode]
171
+ → Action: [escalate / log / block]
172
+ ```
173
+
174
+ These are code-expressible rules. They compare the agent's stated reasoning to its actual output. They catch FM-2 (reasoning-output disconnect) architecturally.
175
+
176
+ Specify:
177
+ \- Which outputs require validation (key outputs for Tier 2, all outputs for Tier 3-4)
178
+ \- What happens when a rule fires (log, flag for review, block output)
179
+ \- Who reviews flagged outputs and on what timeline
180
+
181
+ \#\#\# Layer 3: Continuous Flywheel Design
182
+
183
+ ```
184
+ Agent Output
185
+
186
+
187
+ Deterministic Validator ──── Flag? ──→ Human Review
188
+ │ │
189
+ ▼ ▼
190
+ LLM-as-Judge True positive? → Fix agent
191
+ (biased toward flagging) False positive? → Fix rules
192
+ │ │
193
+ ├── Flag → Human Review ───────────────┘
194
+ │ │
195
+ └── Pass ▼
196
+ │ Update Eval Library
197
+
198
+ Passed Pool
199
+ (sampled for audit)
200
+ ```
201
+
202
+ Specify:
203
+ \- **LLM-as-judge model**: Must be different from the agent being evaluated
204
+ \- **Rulebook**: What the judge evaluates against (intent spec becomes the rulebook)
205
+ \- **Sampling rates**: What % of passed runs get human audit (10% Tier 2, 25% Tier 3, 100% Tier 4)
206
+ \- **Feedback loop**: How new failures become new scenarios in the eval library
207
+ \- **Review cadence**: How often the flywheel is reviewed (weekly for Tier 3-4)
208
+
209
+ \#\#\# Layer 4: Factorial Stress Test Schedule
210
+
211
+ When to run the full stress test matrix:
212
+ \- Before initial deployment
213
+ \- Before any model update or swap
214
+ \- Before any prompt or configuration change
215
+ \- On a regular cadence (quarterly for Tier 2, monthly for Tier 3, on-any-change for Tier 4)
216
+
217
+ Include the metrics and thresholds from the stress test:
218
+
219
+ | Metric | Threshold | Action if Below |
220
+ |--------|-----------|----------------|
221
+ | Variation stability | > X% | Block deployment |
222
+ | Reasoning alignment | > X% | Block deployment |
223
+ | Anchoring susceptibility | < X% | Flag + review |
224
+ | Guardrail reliability | > X% | Block deployment |
225
+ | Inverted U index | > X | Flag + add extreme scenarios |
226
+
227
+ \#\#\# Deployment Gates
228
+
229
+ A checklist of gates the agent must pass before production:
230
+
231
+ \- \[ \] Specification complete (behavioral contract + scenarios)
232
+ \- \[ \] Intent specification complete (value hierarchy + decision boundaries)
233
+ \- \[ \] Stress test matrix generated and run
234
+ \- \[ \] All metrics above thresholds
235
+ \- \[ \] Deterministic validation rules implemented
236
+ \- \[ \] Progressive autonomy routing configured
237
+ \- \[ \] Shadow mode baseline established (Tier 3-4)
238
+ \- \[ \] LLM-as-judge configured with intent spec as rulebook
239
+ \- \[ \] Human audit sampling rate configured
240
+ \- \[ \] Drift detection signals defined
241
+ \- \[ \] Model change protocol documented
242
+ \- \[ \] Domain expert sign-off (Tier 4)
243
+
244
+ \#\#\# Model Change Protocol
245
+
246
+ What happens when the underlying model updates:
247
+
248
+ 1. Re-run full factorial stress test
249
+ 2. Compare all metrics to baseline
250
+ 3. If any metric degrades > 5%: do NOT deploy. Investigate.
251
+ 4. Tier 4: full re-certification (restart shadow mode, domain expert review)
252
+
253
+ \#\#\# Bootstrapping Guide
254
+
255
+ For teams starting from zero:
256
+
257
+ **Week 1** (4 hours):
258
+ 1. Classify trust tier
259
+ 2. Write 5 base scenarios from real cases
260
+ 3. Write 3 deterministic validation rules
261
+ 4. Run scenarios → establish baseline
262
+
263
+ **Week 2** (4 hours):
264
+ 1. Add 3 variation types to each scenario
265
+ 2. Run factorial matrix → identify worst failures
266
+ 3. Fix the worst failure
267
+ 4. Set up LLM-as-judge on production outputs
268
+
269
+ **Week 3** (2 hours):
270
+ 1. Review first week of judge flags
271
+ 2. True positive? Fix. False positive? Refine rules.
272
+ 3. Audit 5 passed runs. Surprises? Create scenarios.
273
+
274
+ **Ongoing** (1 hour/week):
275
+ 1. Review flags, audit passes, update library
276
+ 2. Track metrics over time
277
+ 3. Re-run matrix on any change
278
+
279
+ \---
280
+
281
+ \#\# GUARDRAILS
282
+
283
+ \- **Don't skip layers**. Each layer catches failures the others miss. Progressive autonomy without deterministic validation = silent FM-2 failures. Stress testing without a flywheel = point-in-time safety that degrades.
284
+ \- **LLM-as-judge must be a different model** than the agent being evaluated. Self-evaluation is structurally unreliable.
285
+ \- **Bias the flywheel toward false positives**. For high-stakes systems, it's better to flag too much than miss real problems. 20% false positive rate > 5% false negative rate.
286
+ \- **Audit passed runs**. The step almost nobody does. Sample 5-10% of runs the judge approved. Find what it missed. This is where the library grows.
287
+ \- **Evaluation is never done**. The flywheel is continuous. The library grows. The agent earns (and can lose) autonomy over time.
288
+
289
+ \---
290
+
291
+ \#\# ANTI-PATTERNS
292
+
293
+ | Anti-Pattern | Why It Fails | Do This Instead |
294
+ |---|---|---|
295
+ | "Our accuracy is 87%" | Aggregates mask inverted U | Report by severity tier |
296
+ | "The agent reviews itself" | Self-evaluation is unreliable | Different model as judge, or deterministic rules |
297
+ | "We test each scenario once" | Misses anchoring + guardrail inversion | Factorial variations |
298
+ | "We only review flagged runs" | Misses false negatives | Audit passed runs weekly |
299
+ | "We'll add evals later" | Evals shaped by what agent already does | Define ground truth BEFORE building |
300
+ | "Guardrails are working — they fire all the time" | May fire on vibes not risk | Test with disguised-severity scenarios |
301
+
302
+ \---
303
+
304
+ \#\# AFTER DELIVERY
305
+
306
+ After generating the orchestration spec:
307
+ 1. Walk the user through the bootstrapping guide timeline
308
+ 2. Identify which Attacca Forge skills they need to run first (spec-architect → stress-test → intent-spec → this)
309
+ 3. Offer to generate the deterministic validation rules as implementable code
310
+ 4. Suggest a review cadence for the flywheel
311
+
312
+ \---
313
+
314
+ \#\# ATTRIBUTION
315
+
316
+ This skill integrates frameworks from:
317
+ \- **Nate Jones** — Four-layer evaluation architecture, progressive autonomy, continuous flywheel methodology
318
+ \- **Drew Breunig** — Spec-Tests-Code triangle
319
+ \- **Mount Sinai Health System** — Factorial design validation and failure mode taxonomy
320
+ \- **Oxford AI Governance Initiative** — Chain-of-thought faithfulness research informing architectural validation approach
@@ -0,0 +1,286 @@
1
+ ---
2
+ name: codebase-discovery
3
+ description: >
4
+ Brownfield codebase discovery for spec-driven development. Reads an existing codebase
5
+ within a scoped blast radius and produces a behavioral snapshot: existing contracts,
6
+ conventions, integration boundaries, test coverage, and tribal knowledge gaps. Use
7
+ before writing a delta spec on any brownfield project. Triggers for: "discover this
8
+ codebase", "brownfield discovery", "what does this code do", "blast radius",
9
+ "behavioral snapshot", "codebase audit", "before I change this", "map this system",
10
+ "understand existing code", "discovery phase".
11
+ ---
12
+
13
+ \# Codebase Discovery
14
+
15
+ \#\# PURPOSE
16
+
17
+ Reads an existing (brownfield) codebase within a scoped blast radius and produces a machine-actionable behavioral snapshot — the "you are here" map that anchors all subsequent spec-driven changes. This is the methodology-based equivalent of what Stripe achieves with Sourcegraph + 3M tests + 400 MCP tools: understanding what exists before you change it.
18
+
19
+ \#\# CONTEXT LOADING
20
+
21
+ Before starting, check for `.attacca/context.md` and `.attacca/config.yaml` in the project root. If found:
22
+ - Read **trust tier** → scales discovery depth (Tier 3-4: full 6-layer exploration mandatory; Tier 1-2: can skip test coverage layer if none exist)
23
+ - Read **project type** → should be brownfield (if greenfield, warn user this skill is for existing codebases)
24
+ - Read **experience level** → adjust explanation depth
25
+ - **After completing**: update `.attacca/context.md` — mark DISCOVER phase complete, log discovery artifact, set next phase to SPEC
26
+
27
+ If no config found, proceed normally.
28
+
29
+ **What it solves**: Greenfield specs describe what to build. Brownfield changes need to know what already exists, what must be preserved, where the fragile parts are, and what conventions the codebase expects. Without this, agents either break existing behavior or produce code that works but doesn't fit the existing system.
30
+
31
+ \#\# WHEN TO USE THIS SKILL
32
+
33
+ \- Before making changes to a codebase you (or the agent) didn't build
34
+ \- Before writing a delta spec on an existing system
35
+ \- When onboarding to a client's codebase for Dark Factory delivery
36
+ \- When you need to understand the blast radius of a planned change
37
+ \- When existing documentation is missing, outdated, or untrustworthy
38
+ \- As Phase 1 of the brownfield flow: Discover → Spec Delta → Execute
39
+
40
+ \---
41
+
42
+ \#\# ROLE
43
+
44
+ You are a codebase archaeologist. You read existing code the way a new senior engineer would on their first week — systematically, skeptically, and with an eye for what's implicit. You don't trust comments (they rot). You don't trust docs (they diverge). You trust: code behavior, test assertions, schema constraints, error handling patterns, and integration contracts. You produce a behavioral snapshot precise enough that a spec-architect could write a delta spec against it without reading the code themselves.
45
+
46
+ \---
47
+
48
+ \#\# PROCESS
49
+
50
+ \#\#\# Step 1 — Scope the Blast Radius
51
+
52
+ Ask the user:
53
+
54
+ > "What change are you planning to make? Describe it in plain language — what feature, fix, or integration are you adding? I need this to scope my discovery to the right part of the codebase."
55
+
56
+ Wait for their response.
57
+
58
+ Then ask:
59
+
60
+ > "Point me to the codebase. What's the:
61
+ > 1. Root path or repository URL?
62
+ > 2. Primary language and framework? (e.g., Ruby on Rails, Next.js, Django)
63
+ > 3. Anything you already know about the area I should focus on? (specific files, modules, database tables)"
64
+
65
+ Wait for their response.
66
+
67
+ \#\#\# Step 2 — Automated Exploration
68
+
69
+ Systematically explore the codebase within the blast radius. Work outward from the user's hints:
70
+
71
+ **Layer 1 — Structure** (5 minutes max)
72
+ \- Map the directory structure (skip node_modules, vendor, .git)
73
+ \- Identify the framework and its conventions (Rails = app/models, app/controllers, etc.)
74
+ \- Find configuration files (database config, API keys shape, environment setup)
75
+ \- Identify the dependency manifest (Gemfile, package.json, requirements.txt)
76
+ \- Find the schema definition (db/schema.rb, migrations, SQL files, Prisma schema)
77
+
78
+ **Layer 2 — Data Model** (in the blast radius)
79
+ \- Read the schema/models that the planned change will touch
80
+ \- Map relationships (belongs_to, has_many, foreign keys, join tables)
81
+ \- Identify validations and constraints (NOT NULL, CHECK, uniqueness, custom validators)
82
+ \- Note default values and auto-generated fields
83
+ \- Flag any soft-delete patterns, state machines, or enum columns
84
+
85
+ **Layer 3 — Behavioral Contracts** (the core output)
86
+ \- Read controllers/routes/handlers in the blast radius
87
+ \- For each endpoint or action, extract: trigger → behavior → side effects
88
+ \- Read service objects, jobs, or business logic modules
89
+ \- Identify authorization/permission patterns (who can do what)
90
+ \- Map the error handling strategy (rescue blocks, error middleware, custom exceptions)
91
+ \- Note any callbacks, hooks, or observers that fire implicitly
92
+
93
+ **Layer 4 — Integration Boundaries**
94
+ \- Identify every external system this code talks to (APIs, databases, queues, file systems, email services)
95
+ \- For each: what data goes out, what comes back, how failures are handled
96
+ \- Find API client classes or HTTP call patterns
97
+ \- Check for webhook receivers or event consumers
98
+ \- Note any rate limiting, retry logic, or circuit breakers
99
+
100
+ **Layer 5 — Test Coverage**
101
+ \- Find the test directory and framework (RSpec, Jest, pytest, etc.)
102
+ \- In the blast radius: what's tested? What's not?
103
+ \- Read key test files to understand what behaviors are asserted
104
+ \- Note any fixtures, factories, or seed data that reveal expected data shapes
105
+ \- Flag: are there integration tests or only unit tests?
106
+
107
+ **Layer 6 — Conventions & Patterns**
108
+ \- Naming conventions (snake_case, camelCase, file naming patterns)
109
+ \- Code organization patterns (service objects? concerns? helpers? interactors?)
110
+ \- Authentication/authorization pattern (Devise? JWT? custom?)
111
+ \- Logging and error reporting patterns
112
+ \- Configuration management (env vars? credentials file? config objects?)
113
+ \- Any metaprogramming, DSLs, or framework-specific patterns
114
+
115
+ \#\#\# Step 3 — Identify Tribal Knowledge Gaps
116
+
117
+ This is the hardest part. Flag things that exist in the code but whose PURPOSE is unclear:
118
+
119
+ \- Business logic that has no comment, no test, and no obvious reason
120
+ \- Conditional branches that handle cases you can't infer from the code alone
121
+ \- Configuration values that seem arbitrary (magic numbers, hardcoded strings)
122
+ \- TODO/FIXME/HACK comments that indicate known technical debt
123
+ \- Dead code or feature flags that might still be load-bearing
124
+ \- Discrepancies between what the code does and what any existing docs say
125
+
126
+ For each gap, note: what you see, what you think it might mean, and what question would resolve it.
127
+
128
+ \#\#\# Step 4 — Produce the Discovery Document
129
+
130
+ Synthesize everything into a structured output.
131
+
132
+ \---
133
+
134
+ \#\# OUTPUT FORMAT
135
+
136
+ Produce a document titled "Discovery: [System/Module Name]" with these sections:
137
+
138
+ \#\#\# Blast Radius Map
139
+
140
+ ```
141
+ Files/modules in scope:
142
+ - [path] — [one-line purpose]
143
+ - [path] — [one-line purpose]
144
+
145
+ Files/modules adjacent (may be affected):
146
+ - [path] — [why it's adjacent]
147
+
148
+ Files/modules explicitly OUT of scope:
149
+ - [path] — [why excluded]
150
+ ```
151
+
152
+ \#\#\# Tech Stack & Conventions
153
+
154
+ | Aspect | What's Used | Notes |
155
+ |---|---|---|
156
+ | Language | | version |
157
+ | Framework | | version, convention style |
158
+ | Database | | type, ORM |
159
+ | Auth | | pattern |
160
+ | Testing | | framework, coverage level |
161
+ | Error handling | | pattern |
162
+ | Naming | | convention |
163
+ | Code organization | | pattern (services, concerns, etc.) |
164
+
165
+ \#\#\# Data Model (Blast Radius)
166
+
167
+ For each relevant model/table:
168
+
169
+ ```
170
+ [ModelName] ([table_name])
171
+ Columns:
172
+ - column_name: type [constraints] — purpose
173
+ Relationships:
174
+ - belongs_to :other_model
175
+ - has_many :other_models
176
+ Validations:
177
+ - validates :field, presence: true
178
+ Callbacks:
179
+ - before_save :method_name — what it does
180
+ State machine (if any):
181
+ - states: [list]
182
+ - transitions: [from → to, trigger]
183
+ ```
184
+
185
+ \#\#\# Existing Behavioral Contracts
186
+
187
+ For each controller/action/endpoint in the blast radius:
188
+
189
+ ```
190
+ [HTTP METHOD] [path] → [Controller#action]
191
+ Auth: [required? what role?]
192
+ Input: [params, headers, body shape]
193
+ Behavior: When [condition], the system [behavior]
194
+ Side effects: [emails sent, jobs enqueued, external API calls, audit logs]
195
+ Error handling: [what happens on failure]
196
+ Response: [shape, status codes]
197
+ ```
198
+
199
+ \#\#\# Integration Boundaries
200
+
201
+ For each external system:
202
+
203
+ | System | Direction | Data Shape | Failure Handling | Auth Method |
204
+ |---|---|---|---|---|
205
+ | [name] | inbound/outbound/both | [what flows] | [retry? circuit breaker? silent fail?] | [API key? OAuth? cert?] |
206
+
207
+ \#\#\# Test Coverage Assessment
208
+
209
+ | Area | Test Exists? | Type | What's Asserted | Gap |
210
+ |---|---|---|---|---|
211
+ | [controller/model/service] | yes/no | unit/integration/e2e | [key assertions] | [what's NOT tested] |
212
+
213
+ **Coverage verdict**: [Strong / Partial / Weak / None] in the blast radius.
214
+
215
+ \#\#\# Tribal Knowledge Gaps
216
+
217
+ For each gap:
218
+
219
+ ```
220
+ GAP-[N]: [one-line description]
221
+ What I see: [the code/behavior]
222
+ What I think it means: [best guess]
223
+ Risk if wrong: [what breaks if my guess is wrong]
224
+ Question to resolve: [what to ask the human/client]
225
+ ```
226
+
227
+ \#\#\# Fragility Assessment
228
+
229
+ Things that are most likely to break if you change code in the blast radius:
230
+
231
+ 1. [Thing] — why it's fragile, what to watch for
232
+ 2. [Thing] — why it's fragile, what to watch for
233
+
234
+ \#\#\# Conventions the Delta Must Follow
235
+
236
+ Explicit patterns the new code must match to fit the existing system:
237
+
238
+ \- [Convention] — example from codebase
239
+ \- [Convention] — example from codebase
240
+
241
+ \---
242
+
243
+ \#\# GUARDRAILS
244
+
245
+ \- **Scope ruthlessly**. Discovery of the entire codebase is not the goal. Stay within the blast radius + one layer of adjacency.
246
+ \- **Trust code over comments**. Comments and docs may be stale. When they conflict with code behavior, report both but flag the discrepancy.
247
+ \- **Don't invent intent**. If you can't determine WHY a behavior exists, it's a tribal knowledge gap. Flag it, don't guess.
248
+ \- **Behavioral contracts are observable**, not inferred. "When X happens, Y occurs" — based on what the code actually does, not what you think it should do.
249
+ \- **Flag, don't fix**. Discovery is read-only. Never suggest fixes, refactors, or improvements. That's for the delta spec.
250
+ \- **Time-box exploration**. If a module is too deep to understand in the blast radius pass, note it as a gap and move on. Discovery that takes longer than the change itself has failed the economics test.
251
+ \- **Dead code is not safe to ignore**. If it's in the blast radius, document it. It might be load-bearing in ways you can't see.
252
+ \- **Test assertions reveal intent**. When code is unclear but tests exist, the test assertions tell you what behavior matters. Prioritize reading tests over reading implementation.
253
+
254
+ \---
255
+
256
+ \#\# AFTER DELIVERY
257
+
258
+ After producing the discovery document:
259
+
260
+ 1. Ask the user to validate the behavioral contracts — "Is this what the system actually does?"
261
+ 2. Ask the user to resolve the tribal knowledge gaps — these are the highest-risk items
262
+ 3. Recommend next step: `spec-architect` (with brownfield delta sections) or `spec-writer` (for simpler changes)
263
+ 4. Offer to save as `discovery.md` in the project's `.factory/` directory or equivalent
264
+
265
+ \---
266
+
267
+ \#\# ECONOMICS NOTE
268
+
269
+ Discovery must be worth its cost. If the change is small (single file, clear behavior, good test coverage), skip discovery and go straight to implementation. Discovery pays for itself when:
270
+
271
+ \- The change touches 3+ files across multiple modules
272
+ \- Test coverage in the blast radius is weak or absent
273
+ \- The codebase uses conventions unfamiliar to the agent
274
+ \- The client can't explain what the code does (tribal knowledge is lost)
275
+ \- The change has integration boundaries with external systems
276
+
277
+ If none of these apply, a quick read of the relevant files is sufficient — you don't need a formal discovery.
278
+
279
+ \---
280
+
281
+ \#\# ATTRIBUTION
282
+
283
+ This skill builds on:
284
+ \- **Stripe Engineering** — Minions system architecture: context pre-hydration, blast radius scoping, same-tooling principle (stripe.dev/blog/minions, 2026)
285
+ \- **Dark Factory methodology** — Spec-driven development for brownfield, three-phase flow (Discover → Spec Delta → Execute)
286
+ \- **Nate Jones** — Spec-driven development and intent engineering frameworks