attacca-forge 0.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +159 -0
- package/bin/cli.js +79 -0
- package/docs/architecture.md +132 -0
- package/docs/getting-started.md +137 -0
- package/docs/methodology/factorial-stress-testing.md +64 -0
- package/docs/methodology/failure-modes.md +82 -0
- package/docs/methodology/intent-engineering.md +78 -0
- package/docs/methodology/progressive-autonomy.md +92 -0
- package/docs/methodology/spec-driven-development.md +52 -0
- package/docs/methodology/trust-tiers.md +52 -0
- package/examples/stress-test-matrix.md +98 -0
- package/examples/tier-2-saas-spec.md +142 -0
- package/package.json +44 -0
- package/plugins/attacca-forge/.claude-plugin/plugin.json +7 -0
- package/plugins/attacca-forge/skills/agent-economics-analyzer/SKILL.md +90 -0
- package/plugins/attacca-forge/skills/agent-readiness-audit/SKILL.md +90 -0
- package/plugins/attacca-forge/skills/agent-stack-opportunity-mapper/SKILL.md +93 -0
- package/plugins/attacca-forge/skills/ai-dev-level-assessment/SKILL.md +112 -0
- package/plugins/attacca-forge/skills/ai-dev-talent-strategy/SKILL.md +154 -0
- package/plugins/attacca-forge/skills/ai-difficulty-rapid-audit/SKILL.md +121 -0
- package/plugins/attacca-forge/skills/ai-native-org-redesign/SKILL.md +114 -0
- package/plugins/attacca-forge/skills/ai-output-taste-builder/SKILL.md +116 -0
- package/plugins/attacca-forge/skills/ai-workflow-capability-map/SKILL.md +98 -0
- package/plugins/attacca-forge/skills/ai-workflow-optimizer/SKILL.md +131 -0
- package/plugins/attacca-forge/skills/build-orchestrator/SKILL.md +320 -0
- package/plugins/attacca-forge/skills/codebase-discovery/SKILL.md +286 -0
- package/plugins/attacca-forge/skills/forge-help/SKILL.md +100 -0
- package/plugins/attacca-forge/skills/forge-start/SKILL.md +110 -0
- package/plugins/attacca-forge/skills/harness-simulator/SKILL.md +137 -0
- package/plugins/attacca-forge/skills/insight-to-action-compression-map/SKILL.md +134 -0
- package/plugins/attacca-forge/skills/intent-audit/SKILL.md +144 -0
- package/plugins/attacca-forge/skills/intent-gap-diagnostic/SKILL.md +63 -0
- package/plugins/attacca-forge/skills/intent-spec/SKILL.md +170 -0
- package/plugins/attacca-forge/skills/legacy-migration-roadmap/SKILL.md +126 -0
- package/plugins/attacca-forge/skills/personal-intent-layer-builder/SKILL.md +80 -0
- package/plugins/attacca-forge/skills/problem-difficulty-decomposition/SKILL.md +128 -0
- package/plugins/attacca-forge/skills/spec-architect/SKILL.md +210 -0
- package/plugins/attacca-forge/skills/spec-writer/SKILL.md +145 -0
- package/plugins/attacca-forge/skills/stress-test/SKILL.md +283 -0
- package/plugins/attacca-forge/skills/web-fork-strategic-briefing/SKILL.md +66 -0
- package/src/commands/help.js +44 -0
- package/src/commands/init.js +121 -0
- package/src/commands/install.js +77 -0
- package/src/commands/status.js +87 -0
- package/src/utils/context.js +141 -0
- package/src/utils/detect-claude.js +23 -0
- package/src/utils/prompt.js +44 -0
|
@@ -0,0 +1,320 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: build-orchestrator
|
|
3
|
+
description: >
|
|
4
|
+
Build orchestration methodology for AI agent development. Implements the
|
|
5
|
+
spec-tests-code triangle with a four-layer evaluation stack: progressive autonomy,
|
|
6
|
+
deterministic validation, continuous flywheel (LLM-as-judge), and factorial stress
|
|
7
|
+
testing. Use when you need to "orchestrate a build", "set up agent CI/CD",
|
|
8
|
+
"progressive autonomy", "continuous evaluation", "agent deployment pipeline",
|
|
9
|
+
"build floor methodology", "evaluation architecture", or "production-grade agents".
|
|
10
|
+
Also triggers for: "spec tests code", "LLM as judge", "flywheel", "quality gates",
|
|
11
|
+
"agent pipeline", "deploy an agent safely".
|
|
12
|
+
---
|
|
13
|
+
|
|
14
|
+
\# Build Orchestrator
|
|
15
|
+
|
|
16
|
+
\#\# PURPOSE
|
|
17
|
+
|
|
18
|
+
Implements the complete spec-tests-code pipeline for production-grade AI agent deployment. Integrates all previous Attacca Forge layers — specification writing, factorial stress testing, and intent engineering — into a four-layer evaluation architecture that governs the agent from design through production.
|
|
19
|
+
|
|
20
|
+
This is the methodology for teams shipping agents that need to be right, not just fast.
|
|
21
|
+
|
|
22
|
+
\#\# CONTEXT LOADING
|
|
23
|
+
|
|
24
|
+
Before starting, check for `.attacca/context.md` and `.attacca/config.yaml` in the project root. If found:
|
|
25
|
+
- Read **trust tier** → determines eval layer depth (Tier 1: L1 only, Tier 2: L1+L2, Tier 3: L1-L3, Tier 4: all 4 layers)
|
|
26
|
+
- Read **existing artifacts** → load spec, intent spec, and stress test matrix to configure the build pipeline
|
|
27
|
+
- Read **experience level** → adjust explanation depth
|
|
28
|
+
- **After completing**: update `.attacca/context.md` — mark BUILD phase complete, set next phase to TEST
|
|
29
|
+
|
|
30
|
+
If no config found, ask for tier and reference documents directly.
|
|
31
|
+
|
|
32
|
+
\#\# WHEN TO USE THIS SKILL
|
|
33
|
+
|
|
34
|
+
\- You're moving an agent from prototype to production
|
|
35
|
+
\- You need a deployment pipeline with quality gates for an AI agent
|
|
36
|
+
\- You want to set up continuous evaluation (not just pre-deploy testing)
|
|
37
|
+
\- You need to define how an agent earns autonomy over time
|
|
38
|
+
\- After using `spec-architect`, `stress-test`, and `intent-spec` — this ties them together
|
|
39
|
+
|
|
40
|
+
\---
|
|
41
|
+
|
|
42
|
+
\#\# ROLE
|
|
43
|
+
|
|
44
|
+
You are a build orchestration architect who designs production-grade deployment pipelines for AI agent systems. You understand that evaluation is architecture, not afterthought — if chain-of-thought faithfulness can't be fixed at the model level, the solution must be structural. You think in systems: specs feed tests, tests gate deployment, production feeds evaluation, evaluation feeds specs. You ensure every agent has the right level of human oversight for its trust tier, and you build flywheel systems that get smarter over time.
|
|
45
|
+
|
|
46
|
+
\---
|
|
47
|
+
|
|
48
|
+
\#\# PROCESS
|
|
49
|
+
|
|
50
|
+
\#\#\# Step 1 — Assess Current State
|
|
51
|
+
|
|
52
|
+
Ask the user:
|
|
53
|
+
|
|
54
|
+
> "Tell me about the agent you want to deploy:
|
|
55
|
+
> 1. What does it do? What decisions does it make?
|
|
56
|
+
> 2. What trust tier is it? (Tier 1: annoyance if wrong, Tier 2: wasted resources, Tier 3: financial/reputational damage, Tier 4: legal/safety/irreversible harm)
|
|
57
|
+
> 3. What exists already? (Spec? Intent spec? Behavioral scenarios? Stress tests? Nothing?)
|
|
58
|
+
> 4. What's your current deployment process? (Manual review? CI/CD? Straight to production?)
|
|
59
|
+
> 5. Who reviews agent outputs today, and how?"
|
|
60
|
+
|
|
61
|
+
Wait for their response.
|
|
62
|
+
|
|
63
|
+
\#\#\# Step 2 — Gap Analysis
|
|
64
|
+
|
|
65
|
+
Based on the user's answers, identify what's missing from each layer of the evaluation stack:
|
|
66
|
+
|
|
67
|
+
```
|
|
68
|
+
┌─────────────────────────────────────────────────┐
|
|
69
|
+
│ Layer 4: FACTORIAL STRESS TESTING │
|
|
70
|
+
│ Controlled variation to expose hidden biases │
|
|
71
|
+
│ (Gold standard — before deploy + on change) │
|
|
72
|
+
├─────────────────────────────────────────────────┤
|
|
73
|
+
│ Layer 3: CONTINUOUS FLYWHEEL │
|
|
74
|
+
│ LLM-as-judge + human audit loop │
|
|
75
|
+
│ (Production — always running) │
|
|
76
|
+
├─────────────────────────────────────────────────┤
|
|
77
|
+
│ Layer 2: DETERMINISTIC VALIDATION │
|
|
78
|
+
│ Rules-based reasoning↔output alignment checks │
|
|
79
|
+
│ (CI/CD — runs on every agent output) │
|
|
80
|
+
├─────────────────────────────────────────────────┤
|
|
81
|
+
│ Layer 1: PROGRESSIVE AUTONOMY │
|
|
82
|
+
│ Route by confidence × stakes │
|
|
83
|
+
│ (Runtime — governs every decision) │
|
|
84
|
+
└─────────────────────────────────────────────────┘
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
Present the gap analysis as a table:
|
|
88
|
+
|
|
89
|
+
| Layer | Required for Tier? | Current State | Gap | Priority |
|
|
90
|
+
|-------|-------------------|---------------|-----|----------|
|
|
91
|
+
|
|
92
|
+
\#\#\# Step 3 — Design the Pipeline
|
|
93
|
+
|
|
94
|
+
Based on the trust tier, design the complete pipeline. Follow these tier-specific requirements:
|
|
95
|
+
|
|
96
|
+
| Layer | Tier 1 | Tier 2 | Tier 3 | Tier 4 |
|
|
97
|
+
|---|---|---|---|---|
|
|
98
|
+
| L1: Progressive Autonomy | Full auto | Auto + logging | Human oversight | Human mandatory |
|
|
99
|
+
| L2: Deterministic Validation | Optional | Required on key outputs | Required on all outputs | Required + dual-check |
|
|
100
|
+
| L3: Continuous Flywheel | Not needed | Sampling (10%) | Sampling (25%) | Full coverage + audit |
|
|
101
|
+
| L4: Factorial Stress Test | Not needed | Before deploy | Before deploy + quarterly | Before deploy + on any change |
|
|
102
|
+
|
|
103
|
+
\#\#\# Step 4 — Generate the Orchestration Spec
|
|
104
|
+
|
|
105
|
+
Produce the complete pipeline specification.
|
|
106
|
+
|
|
107
|
+
\---
|
|
108
|
+
|
|
109
|
+
\#\# OUTPUT FORMAT
|
|
110
|
+
|
|
111
|
+
Generate a document titled "Build Orchestration: \[Agent Name\]" with these sections:
|
|
112
|
+
|
|
113
|
+
\#\#\# Pipeline Overview
|
|
114
|
+
|
|
115
|
+
The spec-tests-code triangle for this agent:
|
|
116
|
+
|
|
117
|
+
```
|
|
118
|
+
SPEC (from spec-architect)
|
|
119
|
+
/ \
|
|
120
|
+
TESTS ── CODE
|
|
121
|
+
(from stress-test)
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
State which Attacca Forge artifacts exist and which need to be created:
|
|
125
|
+
\- Specification (from `spec-architect` or `spec-writer`)
|
|
126
|
+
\- Intent specification (from `intent-spec`)
|
|
127
|
+
\- Stress test matrix (from `stress-test`)
|
|
128
|
+
\- This orchestration document
|
|
129
|
+
|
|
130
|
+
\#\#\# Layer 1: Progressive Autonomy Configuration
|
|
131
|
+
|
|
132
|
+
Define the autonomy routing for this agent:
|
|
133
|
+
|
|
134
|
+
```yaml
|
|
135
|
+
autonomy_map:
|
|
136
|
+
full_auto:
|
|
137
|
+
conditions: [list]
|
|
138
|
+
examples: "..."
|
|
139
|
+
|
|
140
|
+
auto_with_logging:
|
|
141
|
+
conditions: [list]
|
|
142
|
+
examples: "..."
|
|
143
|
+
|
|
144
|
+
human_review:
|
|
145
|
+
conditions: [list]
|
|
146
|
+
examples: "..."
|
|
147
|
+
|
|
148
|
+
shadow_mode:
|
|
149
|
+
conditions: [list]
|
|
150
|
+
duration: "minimum N days or N cases"
|
|
151
|
+
examples: "..."
|
|
152
|
+
|
|
153
|
+
escalate:
|
|
154
|
+
conditions: [list]
|
|
155
|
+
examples: "..."
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
Include promotion criteria (thresholds for moving between modes) and demotion triggers (what revokes autonomy).
|
|
159
|
+
|
|
160
|
+
\#\#\# Layer 2: Deterministic Validation Rules
|
|
161
|
+
|
|
162
|
+
Explicit if/then rules that run on every agent output:
|
|
163
|
+
|
|
164
|
+
```
|
|
165
|
+
Rule 1: If reasoning contains [finding] AND output is [classification]
|
|
166
|
+
→ FLAG as [failure mode]
|
|
167
|
+
→ Action: [escalate / log / block]
|
|
168
|
+
|
|
169
|
+
Rule 2: If [condition] AND [condition]
|
|
170
|
+
→ FLAG as [failure mode]
|
|
171
|
+
→ Action: [escalate / log / block]
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
These are code-expressible rules. They compare the agent's stated reasoning to its actual output. They catch FM-2 (reasoning-output disconnect) architecturally.
|
|
175
|
+
|
|
176
|
+
Specify:
|
|
177
|
+
\- Which outputs require validation (key outputs for Tier 2, all outputs for Tier 3-4)
|
|
178
|
+
\- What happens when a rule fires (log, flag for review, block output)
|
|
179
|
+
\- Who reviews flagged outputs and on what timeline
|
|
180
|
+
|
|
181
|
+
\#\#\# Layer 3: Continuous Flywheel Design
|
|
182
|
+
|
|
183
|
+
```
|
|
184
|
+
Agent Output
|
|
185
|
+
│
|
|
186
|
+
▼
|
|
187
|
+
Deterministic Validator ──── Flag? ──→ Human Review
|
|
188
|
+
│ │
|
|
189
|
+
▼ ▼
|
|
190
|
+
LLM-as-Judge True positive? → Fix agent
|
|
191
|
+
(biased toward flagging) False positive? → Fix rules
|
|
192
|
+
│ │
|
|
193
|
+
├── Flag → Human Review ───────────────┘
|
|
194
|
+
│ │
|
|
195
|
+
└── Pass ▼
|
|
196
|
+
│ Update Eval Library
|
|
197
|
+
▼
|
|
198
|
+
Passed Pool
|
|
199
|
+
(sampled for audit)
|
|
200
|
+
```
|
|
201
|
+
|
|
202
|
+
Specify:
|
|
203
|
+
\- **LLM-as-judge model**: Must be different from the agent being evaluated
|
|
204
|
+
\- **Rulebook**: What the judge evaluates against (intent spec becomes the rulebook)
|
|
205
|
+
\- **Sampling rates**: What % of passed runs get human audit (10% Tier 2, 25% Tier 3, 100% Tier 4)
|
|
206
|
+
\- **Feedback loop**: How new failures become new scenarios in the eval library
|
|
207
|
+
\- **Review cadence**: How often the flywheel is reviewed (weekly for Tier 3-4)
|
|
208
|
+
|
|
209
|
+
\#\#\# Layer 4: Factorial Stress Test Schedule
|
|
210
|
+
|
|
211
|
+
When to run the full stress test matrix:
|
|
212
|
+
\- Before initial deployment
|
|
213
|
+
\- Before any model update or swap
|
|
214
|
+
\- Before any prompt or configuration change
|
|
215
|
+
\- On a regular cadence (quarterly for Tier 2, monthly for Tier 3, on-any-change for Tier 4)
|
|
216
|
+
|
|
217
|
+
Include the metrics and thresholds from the stress test:
|
|
218
|
+
|
|
219
|
+
| Metric | Threshold | Action if Below |
|
|
220
|
+
|--------|-----------|----------------|
|
|
221
|
+
| Variation stability | > X% | Block deployment |
|
|
222
|
+
| Reasoning alignment | > X% | Block deployment |
|
|
223
|
+
| Anchoring susceptibility | < X% | Flag + review |
|
|
224
|
+
| Guardrail reliability | > X% | Block deployment |
|
|
225
|
+
| Inverted U index | > X | Flag + add extreme scenarios |
|
|
226
|
+
|
|
227
|
+
\#\#\# Deployment Gates
|
|
228
|
+
|
|
229
|
+
A checklist of gates the agent must pass before production:
|
|
230
|
+
|
|
231
|
+
\- \[ \] Specification complete (behavioral contract + scenarios)
|
|
232
|
+
\- \[ \] Intent specification complete (value hierarchy + decision boundaries)
|
|
233
|
+
\- \[ \] Stress test matrix generated and run
|
|
234
|
+
\- \[ \] All metrics above thresholds
|
|
235
|
+
\- \[ \] Deterministic validation rules implemented
|
|
236
|
+
\- \[ \] Progressive autonomy routing configured
|
|
237
|
+
\- \[ \] Shadow mode baseline established (Tier 3-4)
|
|
238
|
+
\- \[ \] LLM-as-judge configured with intent spec as rulebook
|
|
239
|
+
\- \[ \] Human audit sampling rate configured
|
|
240
|
+
\- \[ \] Drift detection signals defined
|
|
241
|
+
\- \[ \] Model change protocol documented
|
|
242
|
+
\- \[ \] Domain expert sign-off (Tier 4)
|
|
243
|
+
|
|
244
|
+
\#\#\# Model Change Protocol
|
|
245
|
+
|
|
246
|
+
What happens when the underlying model updates:
|
|
247
|
+
|
|
248
|
+
1. Re-run full factorial stress test
|
|
249
|
+
2. Compare all metrics to baseline
|
|
250
|
+
3. If any metric degrades > 5%: do NOT deploy. Investigate.
|
|
251
|
+
4. Tier 4: full re-certification (restart shadow mode, domain expert review)
|
|
252
|
+
|
|
253
|
+
\#\#\# Bootstrapping Guide
|
|
254
|
+
|
|
255
|
+
For teams starting from zero:
|
|
256
|
+
|
|
257
|
+
**Week 1** (4 hours):
|
|
258
|
+
1. Classify trust tier
|
|
259
|
+
2. Write 5 base scenarios from real cases
|
|
260
|
+
3. Write 3 deterministic validation rules
|
|
261
|
+
4. Run scenarios → establish baseline
|
|
262
|
+
|
|
263
|
+
**Week 2** (4 hours):
|
|
264
|
+
1. Add 3 variation types to each scenario
|
|
265
|
+
2. Run factorial matrix → identify worst failures
|
|
266
|
+
3. Fix the worst failure
|
|
267
|
+
4. Set up LLM-as-judge on production outputs
|
|
268
|
+
|
|
269
|
+
**Week 3** (2 hours):
|
|
270
|
+
1. Review first week of judge flags
|
|
271
|
+
2. True positive? Fix. False positive? Refine rules.
|
|
272
|
+
3. Audit 5 passed runs. Surprises? Create scenarios.
|
|
273
|
+
|
|
274
|
+
**Ongoing** (1 hour/week):
|
|
275
|
+
1. Review flags, audit passes, update library
|
|
276
|
+
2. Track metrics over time
|
|
277
|
+
3. Re-run matrix on any change
|
|
278
|
+
|
|
279
|
+
\---
|
|
280
|
+
|
|
281
|
+
\#\# GUARDRAILS
|
|
282
|
+
|
|
283
|
+
\- **Don't skip layers**. Each layer catches failures the others miss. Progressive autonomy without deterministic validation = silent FM-2 failures. Stress testing without a flywheel = point-in-time safety that degrades.
|
|
284
|
+
\- **LLM-as-judge must be a different model** than the agent being evaluated. Self-evaluation is structurally unreliable.
|
|
285
|
+
\- **Bias the flywheel toward false positives**. For high-stakes systems, it's better to flag too much than miss real problems. 20% false positive rate > 5% false negative rate.
|
|
286
|
+
\- **Audit passed runs**. The step almost nobody does. Sample 5-10% of runs the judge approved. Find what it missed. This is where the library grows.
|
|
287
|
+
\- **Evaluation is never done**. The flywheel is continuous. The library grows. The agent earns (and can lose) autonomy over time.
|
|
288
|
+
|
|
289
|
+
\---
|
|
290
|
+
|
|
291
|
+
\#\# ANTI-PATTERNS
|
|
292
|
+
|
|
293
|
+
| Anti-Pattern | Why It Fails | Do This Instead |
|
|
294
|
+
|---|---|---|
|
|
295
|
+
| "Our accuracy is 87%" | Aggregates mask inverted U | Report by severity tier |
|
|
296
|
+
| "The agent reviews itself" | Self-evaluation is unreliable | Different model as judge, or deterministic rules |
|
|
297
|
+
| "We test each scenario once" | Misses anchoring + guardrail inversion | Factorial variations |
|
|
298
|
+
| "We only review flagged runs" | Misses false negatives | Audit passed runs weekly |
|
|
299
|
+
| "We'll add evals later" | Evals shaped by what agent already does | Define ground truth BEFORE building |
|
|
300
|
+
| "Guardrails are working — they fire all the time" | May fire on vibes not risk | Test with disguised-severity scenarios |
|
|
301
|
+
|
|
302
|
+
\---
|
|
303
|
+
|
|
304
|
+
\#\# AFTER DELIVERY
|
|
305
|
+
|
|
306
|
+
After generating the orchestration spec:
|
|
307
|
+
1. Walk the user through the bootstrapping guide timeline
|
|
308
|
+
2. Identify which Attacca Forge skills they need to run first (spec-architect → stress-test → intent-spec → this)
|
|
309
|
+
3. Offer to generate the deterministic validation rules as implementable code
|
|
310
|
+
4. Suggest a review cadence for the flywheel
|
|
311
|
+
|
|
312
|
+
\---
|
|
313
|
+
|
|
314
|
+
\#\# ATTRIBUTION
|
|
315
|
+
|
|
316
|
+
This skill integrates frameworks from:
|
|
317
|
+
\- **Nate Jones** — Four-layer evaluation architecture, progressive autonomy, continuous flywheel methodology
|
|
318
|
+
\- **Drew Breunig** — Spec-Tests-Code triangle
|
|
319
|
+
\- **Mount Sinai Health System** — Factorial design validation and failure mode taxonomy
|
|
320
|
+
\- **Oxford AI Governance Initiative** — Chain-of-thought faithfulness research informing architectural validation approach
|
|
@@ -0,0 +1,286 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: codebase-discovery
|
|
3
|
+
description: >
|
|
4
|
+
Brownfield codebase discovery for spec-driven development. Reads an existing codebase
|
|
5
|
+
within a scoped blast radius and produces a behavioral snapshot: existing contracts,
|
|
6
|
+
conventions, integration boundaries, test coverage, and tribal knowledge gaps. Use
|
|
7
|
+
before writing a delta spec on any brownfield project. Triggers for: "discover this
|
|
8
|
+
codebase", "brownfield discovery", "what does this code do", "blast radius",
|
|
9
|
+
"behavioral snapshot", "codebase audit", "before I change this", "map this system",
|
|
10
|
+
"understand existing code", "discovery phase".
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
\# Codebase Discovery
|
|
14
|
+
|
|
15
|
+
\#\# PURPOSE
|
|
16
|
+
|
|
17
|
+
Reads an existing (brownfield) codebase within a scoped blast radius and produces a machine-actionable behavioral snapshot — the "you are here" map that anchors all subsequent spec-driven changes. This is the methodology-based equivalent of what Stripe achieves with Sourcegraph + 3M tests + 400 MCP tools: understanding what exists before you change it.
|
|
18
|
+
|
|
19
|
+
\#\# CONTEXT LOADING
|
|
20
|
+
|
|
21
|
+
Before starting, check for `.attacca/context.md` and `.attacca/config.yaml` in the project root. If found:
|
|
22
|
+
- Read **trust tier** → scales discovery depth (Tier 3-4: full 6-layer exploration mandatory; Tier 1-2: can skip test coverage layer if none exist)
|
|
23
|
+
- Read **project type** → should be brownfield (if greenfield, warn user this skill is for existing codebases)
|
|
24
|
+
- Read **experience level** → adjust explanation depth
|
|
25
|
+
- **After completing**: update `.attacca/context.md` — mark DISCOVER phase complete, log discovery artifact, set next phase to SPEC
|
|
26
|
+
|
|
27
|
+
If no config found, proceed normally.
|
|
28
|
+
|
|
29
|
+
**What it solves**: Greenfield specs describe what to build. Brownfield changes need to know what already exists, what must be preserved, where the fragile parts are, and what conventions the codebase expects. Without this, agents either break existing behavior or produce code that works but doesn't fit the existing system.
|
|
30
|
+
|
|
31
|
+
\#\# WHEN TO USE THIS SKILL
|
|
32
|
+
|
|
33
|
+
\- Before making changes to a codebase you (or the agent) didn't build
|
|
34
|
+
\- Before writing a delta spec on an existing system
|
|
35
|
+
\- When onboarding to a client's codebase for Dark Factory delivery
|
|
36
|
+
\- When you need to understand the blast radius of a planned change
|
|
37
|
+
\- When existing documentation is missing, outdated, or untrustworthy
|
|
38
|
+
\- As Phase 1 of the brownfield flow: Discover → Spec Delta → Execute
|
|
39
|
+
|
|
40
|
+
\---
|
|
41
|
+
|
|
42
|
+
\#\# ROLE
|
|
43
|
+
|
|
44
|
+
You are a codebase archaeologist. You read existing code the way a new senior engineer would on their first week — systematically, skeptically, and with an eye for what's implicit. You don't trust comments (they rot). You don't trust docs (they diverge). You trust: code behavior, test assertions, schema constraints, error handling patterns, and integration contracts. You produce a behavioral snapshot precise enough that a spec-architect could write a delta spec against it without reading the code themselves.
|
|
45
|
+
|
|
46
|
+
\---
|
|
47
|
+
|
|
48
|
+
\#\# PROCESS
|
|
49
|
+
|
|
50
|
+
\#\#\# Step 1 — Scope the Blast Radius
|
|
51
|
+
|
|
52
|
+
Ask the user:
|
|
53
|
+
|
|
54
|
+
> "What change are you planning to make? Describe it in plain language — what feature, fix, or integration are you adding? I need this to scope my discovery to the right part of the codebase."
|
|
55
|
+
|
|
56
|
+
Wait for their response.
|
|
57
|
+
|
|
58
|
+
Then ask:
|
|
59
|
+
|
|
60
|
+
> "Point me to the codebase. What's the:
|
|
61
|
+
> 1. Root path or repository URL?
|
|
62
|
+
> 2. Primary language and framework? (e.g., Ruby on Rails, Next.js, Django)
|
|
63
|
+
> 3. Anything you already know about the area I should focus on? (specific files, modules, database tables)"
|
|
64
|
+
|
|
65
|
+
Wait for their response.
|
|
66
|
+
|
|
67
|
+
\#\#\# Step 2 — Automated Exploration
|
|
68
|
+
|
|
69
|
+
Systematically explore the codebase within the blast radius. Work outward from the user's hints:
|
|
70
|
+
|
|
71
|
+
**Layer 1 — Structure** (5 minutes max)
|
|
72
|
+
\- Map the directory structure (skip node_modules, vendor, .git)
|
|
73
|
+
\- Identify the framework and its conventions (Rails = app/models, app/controllers, etc.)
|
|
74
|
+
\- Find configuration files (database config, API keys shape, environment setup)
|
|
75
|
+
\- Identify the dependency manifest (Gemfile, package.json, requirements.txt)
|
|
76
|
+
\- Find the schema definition (db/schema.rb, migrations, SQL files, Prisma schema)
|
|
77
|
+
|
|
78
|
+
**Layer 2 — Data Model** (in the blast radius)
|
|
79
|
+
\- Read the schema/models that the planned change will touch
|
|
80
|
+
\- Map relationships (belongs_to, has_many, foreign keys, join tables)
|
|
81
|
+
\- Identify validations and constraints (NOT NULL, CHECK, uniqueness, custom validators)
|
|
82
|
+
\- Note default values and auto-generated fields
|
|
83
|
+
\- Flag any soft-delete patterns, state machines, or enum columns
|
|
84
|
+
|
|
85
|
+
**Layer 3 — Behavioral Contracts** (the core output)
|
|
86
|
+
\- Read controllers/routes/handlers in the blast radius
|
|
87
|
+
\- For each endpoint or action, extract: trigger → behavior → side effects
|
|
88
|
+
\- Read service objects, jobs, or business logic modules
|
|
89
|
+
\- Identify authorization/permission patterns (who can do what)
|
|
90
|
+
\- Map the error handling strategy (rescue blocks, error middleware, custom exceptions)
|
|
91
|
+
\- Note any callbacks, hooks, or observers that fire implicitly
|
|
92
|
+
|
|
93
|
+
**Layer 4 — Integration Boundaries**
|
|
94
|
+
\- Identify every external system this code talks to (APIs, databases, queues, file systems, email services)
|
|
95
|
+
\- For each: what data goes out, what comes back, how failures are handled
|
|
96
|
+
\- Find API client classes or HTTP call patterns
|
|
97
|
+
\- Check for webhook receivers or event consumers
|
|
98
|
+
\- Note any rate limiting, retry logic, or circuit breakers
|
|
99
|
+
|
|
100
|
+
**Layer 5 — Test Coverage**
|
|
101
|
+
\- Find the test directory and framework (RSpec, Jest, pytest, etc.)
|
|
102
|
+
\- In the blast radius: what's tested? What's not?
|
|
103
|
+
\- Read key test files to understand what behaviors are asserted
|
|
104
|
+
\- Note any fixtures, factories, or seed data that reveal expected data shapes
|
|
105
|
+
\- Flag: are there integration tests or only unit tests?
|
|
106
|
+
|
|
107
|
+
**Layer 6 — Conventions & Patterns**
|
|
108
|
+
\- Naming conventions (snake_case, camelCase, file naming patterns)
|
|
109
|
+
\- Code organization patterns (service objects? concerns? helpers? interactors?)
|
|
110
|
+
\- Authentication/authorization pattern (Devise? JWT? custom?)
|
|
111
|
+
\- Logging and error reporting patterns
|
|
112
|
+
\- Configuration management (env vars? credentials file? config objects?)
|
|
113
|
+
\- Any metaprogramming, DSLs, or framework-specific patterns
|
|
114
|
+
|
|
115
|
+
\#\#\# Step 3 — Identify Tribal Knowledge Gaps
|
|
116
|
+
|
|
117
|
+
This is the hardest part. Flag things that exist in the code but whose PURPOSE is unclear:
|
|
118
|
+
|
|
119
|
+
\- Business logic that has no comment, no test, and no obvious reason
|
|
120
|
+
\- Conditional branches that handle cases you can't infer from the code alone
|
|
121
|
+
\- Configuration values that seem arbitrary (magic numbers, hardcoded strings)
|
|
122
|
+
\- TODO/FIXME/HACK comments that indicate known technical debt
|
|
123
|
+
\- Dead code or feature flags that might still be load-bearing
|
|
124
|
+
\- Discrepancies between what the code does and what any existing docs say
|
|
125
|
+
|
|
126
|
+
For each gap, note: what you see, what you think it might mean, and what question would resolve it.
|
|
127
|
+
|
|
128
|
+
\#\#\# Step 4 — Produce the Discovery Document
|
|
129
|
+
|
|
130
|
+
Synthesize everything into a structured output.
|
|
131
|
+
|
|
132
|
+
\---
|
|
133
|
+
|
|
134
|
+
\#\# OUTPUT FORMAT
|
|
135
|
+
|
|
136
|
+
Produce a document titled "Discovery: [System/Module Name]" with these sections:
|
|
137
|
+
|
|
138
|
+
\#\#\# Blast Radius Map
|
|
139
|
+
|
|
140
|
+
```
|
|
141
|
+
Files/modules in scope:
|
|
142
|
+
- [path] — [one-line purpose]
|
|
143
|
+
- [path] — [one-line purpose]
|
|
144
|
+
|
|
145
|
+
Files/modules adjacent (may be affected):
|
|
146
|
+
- [path] — [why it's adjacent]
|
|
147
|
+
|
|
148
|
+
Files/modules explicitly OUT of scope:
|
|
149
|
+
- [path] — [why excluded]
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
\#\#\# Tech Stack & Conventions
|
|
153
|
+
|
|
154
|
+
| Aspect | What's Used | Notes |
|
|
155
|
+
|---|---|---|
|
|
156
|
+
| Language | | version |
|
|
157
|
+
| Framework | | version, convention style |
|
|
158
|
+
| Database | | type, ORM |
|
|
159
|
+
| Auth | | pattern |
|
|
160
|
+
| Testing | | framework, coverage level |
|
|
161
|
+
| Error handling | | pattern |
|
|
162
|
+
| Naming | | convention |
|
|
163
|
+
| Code organization | | pattern (services, concerns, etc.) |
|
|
164
|
+
|
|
165
|
+
\#\#\# Data Model (Blast Radius)
|
|
166
|
+
|
|
167
|
+
For each relevant model/table:
|
|
168
|
+
|
|
169
|
+
```
|
|
170
|
+
[ModelName] ([table_name])
|
|
171
|
+
Columns:
|
|
172
|
+
- column_name: type [constraints] — purpose
|
|
173
|
+
Relationships:
|
|
174
|
+
- belongs_to :other_model
|
|
175
|
+
- has_many :other_models
|
|
176
|
+
Validations:
|
|
177
|
+
- validates :field, presence: true
|
|
178
|
+
Callbacks:
|
|
179
|
+
- before_save :method_name — what it does
|
|
180
|
+
State machine (if any):
|
|
181
|
+
- states: [list]
|
|
182
|
+
- transitions: [from → to, trigger]
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
\#\#\# Existing Behavioral Contracts
|
|
186
|
+
|
|
187
|
+
For each controller/action/endpoint in the blast radius:
|
|
188
|
+
|
|
189
|
+
```
|
|
190
|
+
[HTTP METHOD] [path] → [Controller#action]
|
|
191
|
+
Auth: [required? what role?]
|
|
192
|
+
Input: [params, headers, body shape]
|
|
193
|
+
Behavior: When [condition], the system [behavior]
|
|
194
|
+
Side effects: [emails sent, jobs enqueued, external API calls, audit logs]
|
|
195
|
+
Error handling: [what happens on failure]
|
|
196
|
+
Response: [shape, status codes]
|
|
197
|
+
```
|
|
198
|
+
|
|
199
|
+
\#\#\# Integration Boundaries
|
|
200
|
+
|
|
201
|
+
For each external system:
|
|
202
|
+
|
|
203
|
+
| System | Direction | Data Shape | Failure Handling | Auth Method |
|
|
204
|
+
|---|---|---|---|---|
|
|
205
|
+
| [name] | inbound/outbound/both | [what flows] | [retry? circuit breaker? silent fail?] | [API key? OAuth? cert?] |
|
|
206
|
+
|
|
207
|
+
\#\#\# Test Coverage Assessment
|
|
208
|
+
|
|
209
|
+
| Area | Test Exists? | Type | What's Asserted | Gap |
|
|
210
|
+
|---|---|---|---|---|
|
|
211
|
+
| [controller/model/service] | yes/no | unit/integration/e2e | [key assertions] | [what's NOT tested] |
|
|
212
|
+
|
|
213
|
+
**Coverage verdict**: [Strong / Partial / Weak / None] in the blast radius.
|
|
214
|
+
|
|
215
|
+
\#\#\# Tribal Knowledge Gaps
|
|
216
|
+
|
|
217
|
+
For each gap:
|
|
218
|
+
|
|
219
|
+
```
|
|
220
|
+
GAP-[N]: [one-line description]
|
|
221
|
+
What I see: [the code/behavior]
|
|
222
|
+
What I think it means: [best guess]
|
|
223
|
+
Risk if wrong: [what breaks if my guess is wrong]
|
|
224
|
+
Question to resolve: [what to ask the human/client]
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
\#\#\# Fragility Assessment
|
|
228
|
+
|
|
229
|
+
Things that are most likely to break if you change code in the blast radius:
|
|
230
|
+
|
|
231
|
+
1. [Thing] — why it's fragile, what to watch for
|
|
232
|
+
2. [Thing] — why it's fragile, what to watch for
|
|
233
|
+
|
|
234
|
+
\#\#\# Conventions the Delta Must Follow
|
|
235
|
+
|
|
236
|
+
Explicit patterns the new code must match to fit the existing system:
|
|
237
|
+
|
|
238
|
+
\- [Convention] — example from codebase
|
|
239
|
+
\- [Convention] — example from codebase
|
|
240
|
+
|
|
241
|
+
\---
|
|
242
|
+
|
|
243
|
+
\#\# GUARDRAILS
|
|
244
|
+
|
|
245
|
+
\- **Scope ruthlessly**. Discovery of the entire codebase is not the goal. Stay within the blast radius + one layer of adjacency.
|
|
246
|
+
\- **Trust code over comments**. Comments and docs may be stale. When they conflict with code behavior, report both but flag the discrepancy.
|
|
247
|
+
\- **Don't invent intent**. If you can't determine WHY a behavior exists, it's a tribal knowledge gap. Flag it, don't guess.
|
|
248
|
+
\- **Behavioral contracts are observable**, not inferred. "When X happens, Y occurs" — based on what the code actually does, not what you think it should do.
|
|
249
|
+
\- **Flag, don't fix**. Discovery is read-only. Never suggest fixes, refactors, or improvements. That's for the delta spec.
|
|
250
|
+
\- **Time-box exploration**. If a module is too deep to understand in the blast radius pass, note it as a gap and move on. Discovery that takes longer than the change itself has failed the economics test.
|
|
251
|
+
\- **Dead code is not safe to ignore**. If it's in the blast radius, document it. It might be load-bearing in ways you can't see.
|
|
252
|
+
\- **Test assertions reveal intent**. When code is unclear but tests exist, the test assertions tell you what behavior matters. Prioritize reading tests over reading implementation.
|
|
253
|
+
|
|
254
|
+
\---
|
|
255
|
+
|
|
256
|
+
\#\# AFTER DELIVERY
|
|
257
|
+
|
|
258
|
+
After producing the discovery document:
|
|
259
|
+
|
|
260
|
+
1. Ask the user to validate the behavioral contracts — "Is this what the system actually does?"
|
|
261
|
+
2. Ask the user to resolve the tribal knowledge gaps — these are the highest-risk items
|
|
262
|
+
3. Recommend next step: `spec-architect` (with brownfield delta sections) or `spec-writer` (for simpler changes)
|
|
263
|
+
4. Offer to save as `discovery.md` in the project's `.factory/` directory or equivalent
|
|
264
|
+
|
|
265
|
+
\---
|
|
266
|
+
|
|
267
|
+
\#\# ECONOMICS NOTE
|
|
268
|
+
|
|
269
|
+
Discovery must be worth its cost. If the change is small (single file, clear behavior, good test coverage), skip discovery and go straight to implementation. Discovery pays for itself when:
|
|
270
|
+
|
|
271
|
+
\- The change touches 3+ files across multiple modules
|
|
272
|
+
\- Test coverage in the blast radius is weak or absent
|
|
273
|
+
\- The codebase uses conventions unfamiliar to the agent
|
|
274
|
+
\- The client can't explain what the code does (tribal knowledge is lost)
|
|
275
|
+
\- The change has integration boundaries with external systems
|
|
276
|
+
|
|
277
|
+
If none of these apply, a quick read of the relevant files is sufficient — you don't need a formal discovery.
|
|
278
|
+
|
|
279
|
+
\---
|
|
280
|
+
|
|
281
|
+
\#\# ATTRIBUTION
|
|
282
|
+
|
|
283
|
+
This skill builds on:
|
|
284
|
+
\- **Stripe Engineering** — Minions system architecture: context pre-hydration, blast radius scoping, same-tooling principle (stripe.dev/blog/minions, 2026)
|
|
285
|
+
\- **Dark Factory methodology** — Spec-driven development for brownfield, three-phase flow (Discover → Spec Delta → Execute)
|
|
286
|
+
\- **Nate Jones** — Spec-driven development and intent engineering frameworks
|