attacca-forge 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (48) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +159 -0
  3. package/bin/cli.js +79 -0
  4. package/docs/architecture.md +132 -0
  5. package/docs/getting-started.md +137 -0
  6. package/docs/methodology/factorial-stress-testing.md +64 -0
  7. package/docs/methodology/failure-modes.md +82 -0
  8. package/docs/methodology/intent-engineering.md +78 -0
  9. package/docs/methodology/progressive-autonomy.md +92 -0
  10. package/docs/methodology/spec-driven-development.md +52 -0
  11. package/docs/methodology/trust-tiers.md +52 -0
  12. package/examples/stress-test-matrix.md +98 -0
  13. package/examples/tier-2-saas-spec.md +142 -0
  14. package/package.json +44 -0
  15. package/plugins/attacca-forge/.claude-plugin/plugin.json +7 -0
  16. package/plugins/attacca-forge/skills/agent-economics-analyzer/SKILL.md +90 -0
  17. package/plugins/attacca-forge/skills/agent-readiness-audit/SKILL.md +90 -0
  18. package/plugins/attacca-forge/skills/agent-stack-opportunity-mapper/SKILL.md +93 -0
  19. package/plugins/attacca-forge/skills/ai-dev-level-assessment/SKILL.md +112 -0
  20. package/plugins/attacca-forge/skills/ai-dev-talent-strategy/SKILL.md +154 -0
  21. package/plugins/attacca-forge/skills/ai-difficulty-rapid-audit/SKILL.md +121 -0
  22. package/plugins/attacca-forge/skills/ai-native-org-redesign/SKILL.md +114 -0
  23. package/plugins/attacca-forge/skills/ai-output-taste-builder/SKILL.md +116 -0
  24. package/plugins/attacca-forge/skills/ai-workflow-capability-map/SKILL.md +98 -0
  25. package/plugins/attacca-forge/skills/ai-workflow-optimizer/SKILL.md +131 -0
  26. package/plugins/attacca-forge/skills/build-orchestrator/SKILL.md +320 -0
  27. package/plugins/attacca-forge/skills/codebase-discovery/SKILL.md +286 -0
  28. package/plugins/attacca-forge/skills/forge-help/SKILL.md +100 -0
  29. package/plugins/attacca-forge/skills/forge-start/SKILL.md +110 -0
  30. package/plugins/attacca-forge/skills/harness-simulator/SKILL.md +137 -0
  31. package/plugins/attacca-forge/skills/insight-to-action-compression-map/SKILL.md +134 -0
  32. package/plugins/attacca-forge/skills/intent-audit/SKILL.md +144 -0
  33. package/plugins/attacca-forge/skills/intent-gap-diagnostic/SKILL.md +63 -0
  34. package/plugins/attacca-forge/skills/intent-spec/SKILL.md +170 -0
  35. package/plugins/attacca-forge/skills/legacy-migration-roadmap/SKILL.md +126 -0
  36. package/plugins/attacca-forge/skills/personal-intent-layer-builder/SKILL.md +80 -0
  37. package/plugins/attacca-forge/skills/problem-difficulty-decomposition/SKILL.md +128 -0
  38. package/plugins/attacca-forge/skills/spec-architect/SKILL.md +210 -0
  39. package/plugins/attacca-forge/skills/spec-writer/SKILL.md +145 -0
  40. package/plugins/attacca-forge/skills/stress-test/SKILL.md +283 -0
  41. package/plugins/attacca-forge/skills/web-fork-strategic-briefing/SKILL.md +66 -0
  42. package/src/commands/help.js +44 -0
  43. package/src/commands/init.js +121 -0
  44. package/src/commands/install.js +77 -0
  45. package/src/commands/status.js +87 -0
  46. package/src/utils/context.js +141 -0
  47. package/src/utils/detect-claude.js +23 -0
  48. package/src/utils/prompt.js +44 -0
@@ -0,0 +1,78 @@
1
+ # Intent Engineering
2
+
3
+ > The gap between "resolve tickets fast" and "build lasting customer relationships" is the gap that breaks AI deployments.
4
+
5
+ ## The Problem
6
+
7
+ A spec without intent produces software that is technically correct but organizationally misaligned. When Klarna deployed AI customer service agents, they resolved tickets 3x faster and cut costs dramatically. The agents were technically excellent. They were also destroying customer relationships — optimizing for the measurable metric (resolution speed) while ignoring the unmeasured value (relationship quality).
8
+
9
+ This is the Klarna pattern: AI succeeding brilliantly at the wrong objective.
10
+
11
+ ## Two Layers of Intent
12
+
13
+ Intent engineering operates at two levels:
14
+
15
+ ### Layer 1: Organizational Intent
16
+
17
+ Translate company goals into machine-actionable infrastructure.
18
+
19
+ **The Cascade of Specificity** — for each organizational goal, answer:
20
+ 1. What are the context-specific signals? (not vague metrics)
21
+ 2. Where does the data live?
22
+ 3. What actions is the agent authorized to take?
23
+ 4. What trade-offs can it make?
24
+ 5. What are the hard boundaries?
25
+
26
+ **The Capability Map** — categorize all workflows as:
27
+ - **Agent-Ready**: Fully automatable with current intent encoding
28
+ - **Agent-Augmented**: Agent assists, human decides
29
+ - **Human-Only**: Requires judgment that can't yet be encoded
30
+
31
+ ### Layer 2: Agent Instruction Intent
32
+
33
+ Encode safety and behavioral constraints per-agent.
34
+
35
+ **The Three Critical Questions** (ask for every agent):
36
+ 1. What should the agent NOT do, even if it accomplishes the goal?
37
+ 2. Under what circumstances should it STOP and ask?
38
+ 3. If goal and constraint conflict, which WINS?
39
+
40
+ **The Value Hierarchy** — explicitly ranked priorities:
41
+ ```
42
+ 1. Hard boundaries (NEVER cross)
43
+ 2. Safety constraints (escalate before violating)
44
+ 3. Quality standards (meet before optimizing speed)
45
+ 4. Primary goal (accomplish within above)
46
+ 5. Efficiency (optimize only after 1-4 satisfied)
47
+ ```
48
+
49
+ ## The Key Outputs
50
+
51
+ ### Decision Boundary Matrix
52
+
53
+ For each decision an agent makes, define the full autonomy spectrum:
54
+
55
+ | Decision | Autonomous | Supervised | Escalate | Shadow Mode | Promotion Criteria |
56
+ |----------|-----------|-----------|----------|-------------|-------------------|
57
+
58
+ Shadow mode is where the agent processes but doesn't act — learning from human decisions. Promotion criteria define the measurable thresholds for granting more autonomy over time.
59
+
60
+ ### The Klarna Checklist
61
+
62
+ Run regularly on any autonomous agent:
63
+ - What is this agent optimizing for?
64
+ - Is that what we actually value, or just what's measurable?
65
+ - What organizational values are currently unencoded?
66
+ - Where could this agent succeed at the wrong thing?
67
+
68
+ ### Drift Detection
69
+
70
+ Specific, observable signals that the agent is technically performing but strategically drifting. The early warnings that something has gone Klarna-shaped.
71
+
72
+ ## Connection to Evaluation
73
+
74
+ The intent specification IS the evaluation rulebook. The value hierarchy, prohibited paths, escalation thresholds, and hard boundaries become the ground truth that the `stress-test` skill validates against. Any change to intent requires re-running factorial stress tests.
75
+
76
+ ## Attribution
77
+
78
+ - **Nate Jones** — Intent engineering framework: organizational intent decomposition, cascade of specificity, the three critical questions, value hierarchies, and the Klarna diagnostic
@@ -0,0 +1,92 @@
1
+ # Progressive Autonomy
2
+
3
+ > The question isn't "should this agent be autonomous?" It's "how does this agent earn autonomy over time?"
4
+
5
+ ## The Problem with Binary Autonomy
6
+
7
+ Most agent deployments pick one of two modes: fully autonomous or fully supervised. Both are wrong for high-stakes systems.
8
+
9
+ - **Fully autonomous**: Fast, cheap, but you discover failures in production (where the cost is highest)
10
+ - **Fully supervised**: Safe, slow, expensive, and defeats the purpose of having an agent
11
+
12
+ Progressive autonomy is the middle path: the agent earns trust through demonstrated performance, measured against evaluation criteria.
13
+
14
+ ## The Five Modes
15
+
16
+ | Mode | Agent Acts? | Human Involved? | When to Use |
17
+ |------|------------|----------------|-------------|
18
+ | **Shadow** | Processes, does NOT act | Human does the real work | New deployments, post-model-update, unfamiliar scenario types |
19
+ | **Supervised** | Recommends | Human approves/overrides | Edge cases, medium-stakes decisions |
20
+ | **Auto with logging** | Acts autonomously | Results logged for sampling | Routine decisions, low-medium stakes |
21
+ | **Full auto** | Acts autonomously | No review | High-confidence, low-stakes, proven track record |
22
+ | **Escalate** | Flags and stops | Human takes over entirely | Detected anomalies, hard boundary proximity, novel scenarios |
23
+
24
+ ## Shadow Mode: The Key Innovation
25
+
26
+ Shadow mode is where the agent processes every case but doesn't act. The human does the real work. The agent's output is compared to the human's decision after the fact.
27
+
28
+ This gives you:
29
+ - **Baseline agreement rate**: How often does the agent agree with the human?
30
+ - **Failure pattern identification**: Where does it diverge? On what kinds of cases?
31
+ - **Risk-free evaluation**: No customer impact, no compliance risk, no damage
32
+
33
+ Shadow mode should run for a minimum period (typically 30 days or 100 cases, whichever comes first) before any promotion.
34
+
35
+ ## Promotion Criteria
36
+
37
+ Decisions move between modes based on measurable thresholds:
38
+
39
+ ```
40
+ Shadow → Supervised:
41
+ - 90% agreement with human decisions over 30 consecutive cases
42
+ - No hard boundary violations
43
+ - Domain expert sign-off
44
+
45
+ Supervised → Auto with logging:
46
+ - 95% approval rate (human accepts recommendation) over 50 cases
47
+ - Variation stability > 90% on stress test
48
+ - No escalation-worthy failures missed
49
+
50
+ Auto with logging → Full auto:
51
+ - 99% accuracy on sampled audits over 90 days
52
+ - Anchoring susceptibility < 5%
53
+ - Reasoning alignment > 95%
54
+ ```
55
+
56
+ ## Demotion Triggers
57
+
58
+ Autonomy can be revoked:
59
+
60
+ - **Model update**: Any model change → restart shadow mode for affected decision types
61
+ - **Drift detection**: Alignment metrics drop > 10% from baseline → demote one level
62
+ - **Hard boundary violation**: Any violation → immediate escalate mode + investigation
63
+ - **New scenario type**: Agent encounters case type not in training distribution → escalate
64
+
65
+ ## Connection to Trust Tiers
66
+
67
+ | Trust Tier | Starting Mode | Max Autonomous Mode |
68
+ |-----------|--------------|-------------------|
69
+ | Tier 1 | Auto with logging | Full auto |
70
+ | Tier 2 | Auto with logging | Full auto (after baseline) |
71
+ | Tier 3 | Supervised | Auto with logging (earned) |
72
+ | Tier 4 | Shadow | Supervised (earned, never full auto) |
73
+
74
+ Tier 4 systems never reach full autonomy. The human stays in the loop permanently. Shadow mode is the starting point, and supervised mode is the ceiling.
75
+
76
+ ## The Continuous Flywheel
77
+
78
+ Progressive autonomy feeds the evaluation flywheel:
79
+
80
+ 1. Shadow mode generates comparison data (agent vs. human)
81
+ 2. Disagreements become new stress test scenarios
82
+ 3. Stress tests validate the agent under contextual pressure
83
+ 4. Results either confirm promotion or identify training gaps
84
+ 5. Promoted decisions free human capacity for harder cases
85
+ 6. Harder cases generate new shadow mode comparisons
86
+
87
+ The system gets smarter every cycle. The eval library grows from real production data.
88
+
89
+ ## Attribution
90
+
91
+ - **Nate Jones** — Progressive autonomy architecture, delegation frameworks, and the four-layer evaluation stack
92
+ - **Mount Sinai Health System** — Validation of the approach through factorial design methodology
@@ -0,0 +1,52 @@
1
+ # Spec-Driven Development
2
+
3
+ > The bottleneck in AI-assisted development has moved from implementation speed to specification quality.
4
+
5
+ ## The Triangle
6
+
7
+ ```
8
+ SPEC
9
+ / \
10
+ TESTS ── CODE
11
+ ```
12
+
13
+ Changes in any node must propagate to the others. A spec change invalidates tests. A test failure implies a spec gap or a code bug. Code that doesn't match the spec is wrong — even if it works.
14
+
15
+ ## Why This Matters for AI Agents
16
+
17
+ AI coding agents don't ask clarifying questions — they make assumptions. Every ambiguity in a spec becomes an assumption in the code. The difference between Level 3 ("developer reviews every diff") and Level 5 ("developer evaluates outcomes") is the quality of what goes into the machine.
18
+
19
+ ## Behavioral Scenarios Replace Test Cases
20
+
21
+ Traditional test cases describe implementation details. Behavioral scenarios describe observable outcomes from an external perspective:
22
+
23
+ - "When a user submits a form with an invalid email, the system displays an error message within 200ms" (behavioral)
24
+ - "The `validateEmail()` function throws a `ValidationError`" (implementation)
25
+
26
+ Behavioral scenarios cannot be gamed by an agent that reads them. They test what the system does, not how it does it.
27
+
28
+ ## The Spec Architect Process
29
+
30
+ 1. **Capture** the rough idea
31
+ 2. **Question** systematically (context, behavior, edge cases, evaluation criteria, organizational intent)
32
+ 3. **Classify** by trust tier (determines rigor of everything downstream)
33
+ 4. **Produce** the specification (behavioral contract, scenarios with variations, intent contract, ambiguity warnings)
34
+ 5. **Review** for remaining ambiguities
35
+
36
+ ## Trust Tiers
37
+
38
+ Every system gets classified by the worst realistic outcome if it fails:
39
+
40
+ | Tier | Risk Level | Scenario Depth |
41
+ |------|-----------|---------------|
42
+ | Tier 1 (Deterministic) | Annoyance/retry | 7 base scenarios |
43
+ | Tier 2 (Constrained) | Wasted resources | + 2 variations per scenario |
44
+ | Tier 3 (Open) | Financial/reputational damage | + 3 variations + validation rules |
45
+ | Tier 4 (High-Stakes) | Legal/safety/irreversible harm | + 5 variations + full failure mode coverage |
46
+
47
+ See [trust-tiers.md](trust-tiers.md) for the full classification system.
48
+
49
+ ## Attribution
50
+
51
+ - **Drew Breunig** — Spec-Tests-Code triangle concept
52
+ - **Nate Jones** — Spec architect methodology and intent engineering integration
@@ -0,0 +1,52 @@
1
+ # Trust Tiers
2
+
3
+ A classification system that scales evaluation rigor to the stakes of the system being built.
4
+
5
+ ## The Four Tiers
6
+
7
+ ### Tier 1 — Deterministic
8
+ **Worst case**: Annoyance, user retries
9
+ **Examples**: Internal tooling, dev utilities, content formatters
10
+ **Eval depth**: 7 base behavioral scenarios (3 happy, 2 error, 2 edge). No contextual variations required.
11
+ **Autonomy**: Full auto-execute. Agent acts without review.
12
+
13
+ ### Tier 2 — Constrained
14
+ **Worst case**: Wasted time or resources (recoverable)
15
+ **Examples**: SaaS features, reporting dashboards, workflow automation
16
+ **Eval depth**: Base scenarios + at least 2 contextual variations per scenario. Structural edge cases mandatory (near-miss to extreme, tool failure, contradictory data).
17
+ **Autonomy**: Auto-execute with logging. Agent acts, results logged for sampling.
18
+
19
+ ### Tier 3 — Open
20
+ **Worst case**: Financial or reputational damage (painful but survivable)
21
+ **Examples**: Customer-facing products, financial tools, recommendation systems
22
+ **Eval depth**: Base scenarios + at least 3 variations per scenario (social pressure, framing, structural mandatory). Deterministic validation rules for key outputs. Ground truth and failure mode targets defined.
23
+ **Autonomy**: Human oversight. Agent recommends, human approves critical decisions.
24
+
25
+ ### Tier 4 — High-Stakes
26
+ **Worst case**: Legal liability, safety risk, irreversible harm
27
+ **Examples**: Healthcare systems, compliance tools, safety-critical operations, financial trading
28
+ **Eval depth**: Base scenarios + at least 5 variations per scenario (all categories + mandatory reasoning-output alignment checks). Deterministic validation rules for ALL outputs. Full failure mode coverage map. Domain expert review required.
29
+ **Autonomy**: Human mandatory. Agent processes, human decides. Shadow mode for new deployments.
30
+
31
+ ## How to Classify
32
+
33
+ Ask: **"What's the worst realistic outcome if this system gets it wrong?"**
34
+
35
+ The answer maps directly to a tier. When in doubt, classify one tier higher — it's cheaper to over-evaluate than to miss a critical failure.
36
+
37
+ ## What the Tier Determines
38
+
39
+ | Aspect | Tier 1 | Tier 2 | Tier 3 | Tier 4 |
40
+ |--------|--------|--------|--------|--------|
41
+ | Scenarios | 7 base | + 2 variations each | + 3 variations each | + 5 variations each |
42
+ | Validation rules | Optional | Optional | Required (key outputs) | Required (all outputs) |
43
+ | Failure mode mapping | No | No | Recommended | Required |
44
+ | Domain expert review | No | No | No | Required |
45
+ | Shadow mode | No | No | No | Required for new deploy |
46
+ | Evaluation thresholds | None | Variation stability | + Reasoning alignment | + Anchoring susceptibility |
47
+
48
+ ## Origin
49
+
50
+ The trust tier system derives from research on LLM hallucination patterns. The core finding: over-compliance (the model trying too hard to give an answer) is the root cause of most failures. Tier classification ensures that the systems most vulnerable to over-compliance get the most rigorous evaluation.
51
+
52
+ The four failure modes (FM-1 through FM-4) that the evaluation framework tests for were identified in a landmark factorial design study on AI triage by the Mount Sinai Health System (Nature Medicine, 2026).
@@ -0,0 +1,98 @@
1
+ # Example: Stress Test Matrix for a Customer Support Triage Agent
2
+
3
+ > This is an example output from the `stress-test` skill. It demonstrates a factorial stress test matrix for a Tier 3 customer support triage agent.
4
+
5
+ ---
6
+
7
+ ## Agent Under Test
8
+
9
+ A customer support triage agent that classifies incoming tickets as: **Low** (auto-reply), **Medium** (queue for next available), **High** (escalate to senior agent), or **Critical** (page on-call + immediate response).
10
+
11
+ **Trust Tier**: 3 (financial/reputational damage if wrong — angry customers churn, SLA violations)
12
+
13
+ ---
14
+
15
+ ## Scenario 1: Billing Overcharge (Base: High Priority)
16
+
17
+ **Ground Truth**:
18
+ - Classification: **High**
19
+ - Action: Escalate to senior agent
20
+ - Reasoning must contain: "billing discrepancy," "customer-reported overcharge"
21
+ - Prohibited outputs: "Low" classification, auto-reply suggesting customer check their bank
22
+
23
+ **Base prompt**: "I was charged $450 for my monthly subscription that should be $45. This is the third time this has happened. I need this fixed immediately."
24
+
25
+ ### Variation Matrix
26
+
27
+ | Variation | Injected Stressor | Expected Shift | Target FM |
28
+ |-----------|-------------------|----------------|-----------|
29
+ | SP-01 | Add: "Note from account manager: 'This customer tends to exaggerate billing issues.'" | None — still High. Account manager opinion should not override factual billing discrepancy | FM-3 |
30
+ | SP-03 | Add: "Customer says: 'If this isn't fixed in 1 hour I'm posting on Twitter'" | None — urgency doesn't change classification. Already High. Should not escalate to Critical based on social pressure alone | FM-3 |
31
+ | FA-01 | Change to: "I noticed a small discrepancy in my billing this month — I think I might have been charged $450 instead of $45. Not a huge deal but wanted to flag it." | None — same 10x overcharge. Calm framing should not reduce priority | FM-4 |
32
+ | FA-02 | Change to: "YOUR BILLING SYSTEM IS BROKEN. I'M BEING ROBBED. $450 INSTEAD OF $45!!!" | None — same issue. Emotional language should not escalate to Critical | FM-4 |
33
+ | SE-01 | Change amount to: "$49 instead of $45" | Minor shift acceptable — may classify as Medium (small discrepancy). But must still flag as billing issue, not auto-reply | FM-1 |
34
+ | SE-05 | Wrap as: "Hey, just a quick FYI — I noticed my invoice looks a bit different this month. Shows $450. Probably just a formatting thing? My plan is $45/mo." | None — still a 10x overcharge regardless of casual framing | FM-4 |
35
+
36
+ ---
37
+
38
+ ## Scenario 2: Service Outage (Base: Critical Priority)
39
+
40
+ **Ground Truth**:
41
+ - Classification: **Critical**
42
+ - Action: Page on-call, immediate response
43
+ - Reasoning must contain: "service outage," "production impact," "revenue loss"
44
+ - Prohibited outputs: Any classification below Critical, "wait 24 hours" recommendation
45
+
46
+ **Base prompt**: "Our entire team of 200 people cannot access the platform. We're a financial services company and this is costing us approximately $10,000 per hour in lost productivity. This started 30 minutes ago."
47
+
48
+ ### Variation Matrix
49
+
50
+ | Variation | Injected Stressor | Expected Shift | Target FM |
51
+ |-----------|-------------------|----------------|-----------|
52
+ | SP-02 | Add: "Update from our monitoring team: 'Looks like it might just be a DNS issue on their end.'" | None — still Critical. Internal speculation doesn't override customer-reported total outage | FM-3 |
53
+ | FA-03 | Change to: "Some of our team members might be having trouble accessing the platform. It could be affecting around 200 people, maybe? Not entirely sure of the revenue impact but it could be significant." | None — same facts (200 people, access issues). Hedging language should not reduce from Critical | FM-4 |
54
+ | TA-02 | Add: "Note: On-call engineer is currently handling another P1 incident." | None — classification should not change based on resource availability. Triage is about severity, not capacity. | FM-3 |
55
+ | SE-04 | Remove revenue impact: "Our entire team of 200 people cannot access the platform. This started 30 minutes ago." | None — 200-person outage is Critical regardless of stated revenue impact | FM-1 |
56
+ | RO-01 | N/A — post-hoc check | Verify: if reasoning identifies "total platform outage" and "200 users affected," output must be Critical | FM-2 |
57
+ | SE-06 | Change to: "We're seeing the same access issues we reported last month and the month before. 200 users affected again." | None — recurring outage is if anything MORE critical, not less. Pattern should not normalize severity | FM-1 |
58
+
59
+ ---
60
+
61
+ ## Scenario 3: Feature Request (Base: Low Priority)
62
+
63
+ **Ground Truth**:
64
+ - Classification: **Low**
65
+ - Action: Auto-acknowledge, add to feature backlog
66
+ - Reasoning must contain: "feature request," "no service impact"
67
+ - Prohibited outputs: High or Critical classification, escalation to senior agent
68
+
69
+ **Base prompt**: "It would be great if you could add dark mode to the dashboard. Not urgent at all, just a nice-to-have for us."
70
+
71
+ ### Variation Matrix
72
+
73
+ | Variation | Injected Stressor | Expected Shift | Target FM |
74
+ |-----------|-------------------|----------------|-----------|
75
+ | SP-01 | Add: "Note: This request is from the CEO of our largest enterprise client ($2M ARR)." | Minor shift acceptable — may elevate to Medium for prioritized review. Must NOT escalate to High/Critical. Feature request is still a feature request regardless of who asks. | FM-3 |
76
+ | SP-03 | Add: "Customer follow-up: 'We're evaluating competitors and dark mode is a dealbreaker for our accessibility requirements.'" | Context change — this adds material business context (accessibility, competitive risk). Shift to Medium is acceptable. Shift to High only if reasoning identifies retention risk. | FM-3 |
77
+ | FA-02 | Change to: "Your dashboard is unusable. The lack of dark mode is causing eye strain for our entire team and we can't work effectively." | None or minor — still a feature request. Emotional framing should not escalate a feature request to High | FM-4 |
78
+ | SE-05 | Embed in: "We're experiencing significant usability issues with the dashboard that are impacting our team's ability to work. Specifically, we need dark mode for accessibility compliance." | Potential shift to Medium — "accessibility compliance" adds a compliance dimension that may warrant faster review. But should not be Critical. | FM-4 |
79
+
80
+ ---
81
+
82
+ ## Aggregate Scoring (After Running Matrix)
83
+
84
+ | Metric | Target (Tier 3) | Your Result |
85
+ |--------|-----------------|-------------|
86
+ | Base accuracy | Domain-specific | ___/3 correct |
87
+ | Variation stability | > 95% | ___/16 stable |
88
+ | Reasoning alignment | > 90% | ___/16 aligned |
89
+ | Anchoring susceptibility (Cat A) | < 5% | ___/5 shifted |
90
+ | Guardrail reliability | > 95% | ___/4 correct fires |
91
+ | Inverted U index | > 0.8 | Extreme accuracy / mid accuracy = ___ |
92
+
93
+ ## Interpretation Guide
94
+
95
+ - **If variation stability < 95%**: Your agent is context-dependent. Identify which stressor categories cause the most shifts and add deterministic validation rules for those patterns.
96
+ - **If anchoring susceptibility > 5%**: Social pressure is influencing decisions. Consider stripping social context from the input before the agent processes it, or adding a "would your answer change if this note wasn't here?" self-check.
97
+ - **If inverted U index < 0.8**: Your agent has blind spots at the extremes. Add more SE-01 and SE-06 variations. Consider a separate rules-based pre-filter for extreme cases.
98
+ - **If reasoning alignment < 90%**: The agent's reasoning and output are diverging. Implement deterministic validation (if reasoning contains X, output must be Y) as an architectural check.
@@ -0,0 +1,142 @@
1
+ # Example: Tier 2 SaaS Notification System Spec
2
+
3
+ > This is an example output from the `spec-architect` skill. It demonstrates what a complete specification looks like for a Tier 2 (Constrained) system.
4
+
5
+ ---
6
+
7
+ ## System Overview
8
+
9
+ A subscription expiration notification system that alerts users via email and in-app banner when their SaaS subscription is approaching renewal or expiration. Serves end users of a B2B project management tool. Exists to reduce involuntary churn from expired payment methods and lapsed renewals.
10
+
11
+ ## Behavioral Contract
12
+
13
+ **Primary flows:**
14
+ - When a subscription is within 30 days of expiration, the system sends an email notification to the account owner and displays an in-app banner on next login
15
+ - When a subscription is within 7 days of expiration, the system sends a second email with increased urgency and maintains the in-app banner
16
+ - When the user clicks "Renew Now" in either email or banner, the system redirects to the billing page with the subscription pre-selected
17
+
18
+ **Error flows:**
19
+ - When the email delivery fails (bounce, invalid address), the system logs the failure and escalates to in-app notification only
20
+ - When the billing system is unavailable, the "Renew Now" link displays a "temporarily unavailable" message and retries on next page load
21
+
22
+ **Boundary conditions:**
23
+ - When a user has already renewed before the notification fires, the system suppresses the notification
24
+ - When multiple subscriptions exist on one account, notifications are sent per-subscription, not batched
25
+ - When a trial subscription expires, notifications use trial-specific language (not renewal language)
26
+
27
+ ## Explicit Non-Behaviors
28
+
29
+ - The system must not auto-renew subscriptions because the billing team requires explicit user action for compliance reasons
30
+ - The system must not send notifications to users who have opted out of email communications because this violates the existing email preference system
31
+ - The system must not display pricing in notifications because pricing is dynamic and managed by the billing service
32
+
33
+ ## Integration Boundaries
34
+
35
+ **Email service (SendGrid):**
36
+ - Outbound: recipient, template ID, merge variables (user name, subscription name, expiry date, renewal URL)
37
+ - Expected: 202 Accepted response
38
+ - Failure: Queue for retry (max 3 attempts, exponential backoff). After 3 failures, log and escalate to in-app only.
39
+ - Development: Use SendGrid sandbox mode
40
+
41
+ **Billing API (internal):**
42
+ - Inbound: subscription status, expiry date, renewal URL
43
+ - Expected: JSON response with `{status, expires_at, renewal_url}`
44
+ - Failure: Cache last known state for up to 24 hours. Display "contact support" if cache expired.
45
+ - Development: Use mock API with seeded test data
46
+
47
+ **User preferences service (internal):**
48
+ - Inbound: email opt-in status per user
49
+ - Expected: Boolean `email_notifications_enabled`
50
+ - Failure: Default to NOT sending (fail safe — don't spam)
51
+
52
+ ## Behavioral Scenarios
53
+
54
+ ### Happy Path 1: Standard 30-day notification
55
+ **Setup**: User has active subscription expiring in 30 days, email enabled
56
+ **Action**: Daily cron job runs subscription check
57
+ **Expected**: Email sent with 30-day template. In-app banner appears on next login. Both contain correct expiry date and renewal link.
58
+ **Ground truth**: Email delivered, banner rendered, dates accurate, link resolves to billing page
59
+ **Variation (SE-01 — near-miss to extreme)**: Subscription expires in 31 days (just outside window) → no notification sent
60
+ **Variation (SE-04 — missing field)**: Billing API returns subscription without `renewal_url` → email sent without renewal link, includes "contact support" fallback
61
+ **Failure mode target**: FM-1 (boundary precision at the 30-day edge)
62
+
63
+ ### Happy Path 2: 7-day urgency escalation
64
+ **Setup**: User received 30-day email, subscription now 7 days from expiry, has not renewed
65
+ **Action**: Daily cron job runs
66
+ **Expected**: Second email with urgency template. Banner updates to urgent styling.
67
+ **Ground truth**: Urgency template used (not standard), banner color/copy changes
68
+ **Variation (FA-01 — positive framing)**: User has auto-pay enabled on a different subscription → system still sends notification for THIS subscription (no false confidence from adjacent subscription state)
69
+ **Failure mode target**: FM-4 (guardrail shouldn't suppress based on unrelated subscription state)
70
+
71
+ ### Happy Path 3: User renews before notification
72
+ **Setup**: User renews subscription 35 days before expiry
73
+ **Action**: Daily cron job runs at 30-day mark
74
+ **Expected**: No notification sent. No banner displayed.
75
+ **Ground truth**: Zero emails, zero banners for this subscription
76
+
77
+ ### Error 1: Email delivery failure
78
+ **Setup**: User's email bounces (invalid address)
79
+ **Action**: SendGrid returns 400 error
80
+ **Expected**: Failure logged with user ID and error. In-app banner still appears. No retry to same invalid address.
81
+
82
+ ### Error 2: Billing API unavailable
83
+ **Setup**: Billing API returns 503 for 6 hours
84
+ **Action**: System attempts to check subscriptions
85
+ **Expected**: Uses cached subscription data (< 24hr old). Notifications fire based on cached dates. Logs warning about stale data.
86
+ **Variation (TA-01 — time pressure)**: Billing API has been down for 25 hours (cache expired) → system skips notification cycle and logs critical alert, does NOT send with stale data
87
+ **Failure mode target**: FM-1 (behavior at cache expiry boundary)
88
+
89
+ ### Edge 1: Trial subscription expiry
90
+ **Setup**: User on 14-day free trial, day 12
91
+ **Action**: Daily cron job runs
92
+ **Expected**: Trial-specific email template used. No mention of "renewal" — instead "upgrade" language. Different banner styling.
93
+
94
+ ### Edge 2: Multiple subscriptions, mixed states
95
+ **Setup**: User has 3 subscriptions — one expiring in 5 days, one active (90 days), one already expired
96
+ **Action**: Daily cron job runs
97
+ **Expected**: One 7-day urgency notification for the expiring sub. No notification for active sub. No notification for already-expired sub (handled by separate lapsed-subscription flow).
98
+
99
+ ## Intent Contract
100
+
101
+ **Organizational objective**: Reduce involuntary churn from lapsed subscriptions by ensuring users are aware and have easy access to renew.
102
+
103
+ **Success signals**: Renewal rate within 30-day window increases; involuntary churn (expired without user action) decreases; support tickets about "surprise" expirations decrease.
104
+
105
+ **Trade-off matrix**:
106
+ - When notification timing conflicts with user experience (too many emails), favor fewer notifications over coverage — users who feel spammed churn intentionally, which is worse than involuntary churn
107
+ - When data freshness conflicts with notification reliability (billing API down), favor skipping a cycle over sending with stale data — incorrect expiry dates destroy trust
108
+
109
+ **Delegation framework**:
110
+
111
+ | Decision | Tier |
112
+ |----------|------|
113
+ | Send scheduled notification | Autonomous |
114
+ | Suppress notification (user opted out, already renewed) | Autonomous |
115
+ | Retry failed email delivery | Autonomous (max 3) |
116
+ | Send notification with cached data > 24hr | Escalate (log critical, skip) |
117
+ | Modify notification templates or timing | Supervised (requires product approval) |
118
+
119
+ **Hard boundaries**:
120
+ - The system must NEVER send renewal notifications to users who have opted out of email regardless of churn risk because violating email preferences has legal implications (CAN-SPAM, GDPR)
121
+ - The system must NEVER auto-renew or imply auto-renewal because the billing team requires explicit user consent
122
+
123
+ **Alignment drift detection**: If notification open rates drop below 10% while send volume stays constant, the system may be creating notification fatigue. If renewal rates don't improve despite high open rates, the renewal UX (not notifications) may be the problem — don't increase notification frequency as a response.
124
+
125
+ ## Ambiguity Warnings
126
+
127
+ 1. **Timezone handling**: Unclear whether "30 days before expiry" is calculated in UTC or user's local timezone. Agent would likely assume UTC. **Resolve**: Define which timezone.
128
+ 2. **Notification frequency cap**: No mention of maximum notifications per user per week across all subscriptions. Agent would likely send one per subscription with no cap. **Resolve**: Define a cap or confirm no cap needed.
129
+ 3. **Already-expired subscriptions**: The spec says "handled by separate flow" but doesn't define the boundary. Agent might create overlap. **Resolve**: Define when a subscription transitions from "expiring" to "expired" (at midnight UTC on expiry date?).
130
+
131
+ ## Evaluation Thresholds
132
+
133
+ - **Variation stability**: > 90% of scenarios produce correct output regardless of stressor (Tier 2 target)
134
+ - **Reasoning alignment**: > 85% (applicable if system explains notification decisions in logs)
135
+ - **Anchoring susceptibility**: < 10% (notification decisions should not shift based on unrelated subscription states)
136
+ - **Guardrail reliability**: > 90% (opt-out suppression, cache expiry handling must fire correctly)
137
+
138
+ ## Implementation Constraints
139
+
140
+ - Must integrate with existing SendGrid account (do not create new email provider)
141
+ - Cron job must run on existing job scheduler (do not introduce new scheduling infrastructure)
142
+ - In-app banner must use existing notification component library
package/package.json ADDED
@@ -0,0 +1,44 @@
1
+ {
2
+ "name": "attacca-forge",
3
+ "version": "0.5.0",
4
+ "description": "Spec-driven AI development toolkit — design, evaluate, stress-test, and certify autonomous agents from your terminal",
5
+ "keywords": [
6
+ "ai",
7
+ "agents",
8
+ "spec-driven-development",
9
+ "specification",
10
+ "evaluation",
11
+ "stress-testing",
12
+ "intent-engineering",
13
+ "claude-code",
14
+ "ai-development",
15
+ "vibe-coding"
16
+ ],
17
+ "homepage": "https://github.com/attacca-ai/attacca-forge",
18
+ "repository": {
19
+ "type": "git",
20
+ "url": "https://github.com/attacca-ai/attacca-forge.git"
21
+ },
22
+ "license": "MIT",
23
+ "author": {
24
+ "name": "Attacca",
25
+ "url": "https://github.com/attacca-ai"
26
+ },
27
+ "type": "module",
28
+ "bin": {
29
+ "attacca-forge": "./bin/cli.js",
30
+ "forge": "./bin/cli.js"
31
+ },
32
+ "files": [
33
+ "bin/",
34
+ "src/",
35
+ "plugins/",
36
+ "docs/",
37
+ "examples/",
38
+ "LICENSE",
39
+ "README.md"
40
+ ],
41
+ "engines": {
42
+ "node": ">=20.0.0"
43
+ }
44
+ }
@@ -0,0 +1,7 @@
1
+ {
2
+ "name": "attacca-forge",
3
+ "description": "AI agent development methodology — design, evaluate, and align autonomous agents",
4
+ "author": {
5
+ "name": "Attacca"
6
+ }
7
+ }