attacca-forge 0.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +159 -0
- package/bin/cli.js +79 -0
- package/docs/architecture.md +132 -0
- package/docs/getting-started.md +137 -0
- package/docs/methodology/factorial-stress-testing.md +64 -0
- package/docs/methodology/failure-modes.md +82 -0
- package/docs/methodology/intent-engineering.md +78 -0
- package/docs/methodology/progressive-autonomy.md +92 -0
- package/docs/methodology/spec-driven-development.md +52 -0
- package/docs/methodology/trust-tiers.md +52 -0
- package/examples/stress-test-matrix.md +98 -0
- package/examples/tier-2-saas-spec.md +142 -0
- package/package.json +44 -0
- package/plugins/attacca-forge/.claude-plugin/plugin.json +7 -0
- package/plugins/attacca-forge/skills/agent-economics-analyzer/SKILL.md +90 -0
- package/plugins/attacca-forge/skills/agent-readiness-audit/SKILL.md +90 -0
- package/plugins/attacca-forge/skills/agent-stack-opportunity-mapper/SKILL.md +93 -0
- package/plugins/attacca-forge/skills/ai-dev-level-assessment/SKILL.md +112 -0
- package/plugins/attacca-forge/skills/ai-dev-talent-strategy/SKILL.md +154 -0
- package/plugins/attacca-forge/skills/ai-difficulty-rapid-audit/SKILL.md +121 -0
- package/plugins/attacca-forge/skills/ai-native-org-redesign/SKILL.md +114 -0
- package/plugins/attacca-forge/skills/ai-output-taste-builder/SKILL.md +116 -0
- package/plugins/attacca-forge/skills/ai-workflow-capability-map/SKILL.md +98 -0
- package/plugins/attacca-forge/skills/ai-workflow-optimizer/SKILL.md +131 -0
- package/plugins/attacca-forge/skills/build-orchestrator/SKILL.md +320 -0
- package/plugins/attacca-forge/skills/codebase-discovery/SKILL.md +286 -0
- package/plugins/attacca-forge/skills/forge-help/SKILL.md +100 -0
- package/plugins/attacca-forge/skills/forge-start/SKILL.md +110 -0
- package/plugins/attacca-forge/skills/harness-simulator/SKILL.md +137 -0
- package/plugins/attacca-forge/skills/insight-to-action-compression-map/SKILL.md +134 -0
- package/plugins/attacca-forge/skills/intent-audit/SKILL.md +144 -0
- package/plugins/attacca-forge/skills/intent-gap-diagnostic/SKILL.md +63 -0
- package/plugins/attacca-forge/skills/intent-spec/SKILL.md +170 -0
- package/plugins/attacca-forge/skills/legacy-migration-roadmap/SKILL.md +126 -0
- package/plugins/attacca-forge/skills/personal-intent-layer-builder/SKILL.md +80 -0
- package/plugins/attacca-forge/skills/problem-difficulty-decomposition/SKILL.md +128 -0
- package/plugins/attacca-forge/skills/spec-architect/SKILL.md +210 -0
- package/plugins/attacca-forge/skills/spec-writer/SKILL.md +145 -0
- package/plugins/attacca-forge/skills/stress-test/SKILL.md +283 -0
- package/plugins/attacca-forge/skills/web-fork-strategic-briefing/SKILL.md +66 -0
- package/src/commands/help.js +44 -0
- package/src/commands/init.js +121 -0
- package/src/commands/install.js +77 -0
- package/src/commands/status.js +87 -0
- package/src/utils/context.js +141 -0
- package/src/utils/detect-claude.js +23 -0
- package/src/utils/prompt.js +44 -0
|
@@ -0,0 +1,78 @@
|
|
|
1
|
+
# Intent Engineering
|
|
2
|
+
|
|
3
|
+
> The gap between "resolve tickets fast" and "build lasting customer relationships" is the gap that breaks AI deployments.
|
|
4
|
+
|
|
5
|
+
## The Problem
|
|
6
|
+
|
|
7
|
+
A spec without intent produces software that is technically correct but organizationally misaligned. When Klarna deployed AI customer service agents, they resolved tickets 3x faster and cut costs dramatically. The agents were technically excellent. They were also destroying customer relationships — optimizing for the measurable metric (resolution speed) while ignoring the unmeasured value (relationship quality).
|
|
8
|
+
|
|
9
|
+
This is the Klarna pattern: AI succeeding brilliantly at the wrong objective.
|
|
10
|
+
|
|
11
|
+
## Two Layers of Intent
|
|
12
|
+
|
|
13
|
+
Intent engineering operates at two levels:
|
|
14
|
+
|
|
15
|
+
### Layer 1: Organizational Intent
|
|
16
|
+
|
|
17
|
+
Translate company goals into machine-actionable infrastructure.
|
|
18
|
+
|
|
19
|
+
**The Cascade of Specificity** — for each organizational goal, answer:
|
|
20
|
+
1. What are the context-specific signals? (not vague metrics)
|
|
21
|
+
2. Where does the data live?
|
|
22
|
+
3. What actions is the agent authorized to take?
|
|
23
|
+
4. What trade-offs can it make?
|
|
24
|
+
5. What are the hard boundaries?
|
|
25
|
+
|
|
26
|
+
**The Capability Map** — categorize all workflows as:
|
|
27
|
+
- **Agent-Ready**: Fully automatable with current intent encoding
|
|
28
|
+
- **Agent-Augmented**: Agent assists, human decides
|
|
29
|
+
- **Human-Only**: Requires judgment that can't yet be encoded
|
|
30
|
+
|
|
31
|
+
### Layer 2: Agent Instruction Intent
|
|
32
|
+
|
|
33
|
+
Encode safety and behavioral constraints per-agent.
|
|
34
|
+
|
|
35
|
+
**The Three Critical Questions** (ask for every agent):
|
|
36
|
+
1. What should the agent NOT do, even if it accomplishes the goal?
|
|
37
|
+
2. Under what circumstances should it STOP and ask?
|
|
38
|
+
3. If goal and constraint conflict, which WINS?
|
|
39
|
+
|
|
40
|
+
**The Value Hierarchy** — explicitly ranked priorities:
|
|
41
|
+
```
|
|
42
|
+
1. Hard boundaries (NEVER cross)
|
|
43
|
+
2. Safety constraints (escalate before violating)
|
|
44
|
+
3. Quality standards (meet before optimizing speed)
|
|
45
|
+
4. Primary goal (accomplish within above)
|
|
46
|
+
5. Efficiency (optimize only after 1-4 satisfied)
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
## The Key Outputs
|
|
50
|
+
|
|
51
|
+
### Decision Boundary Matrix
|
|
52
|
+
|
|
53
|
+
For each decision an agent makes, define the full autonomy spectrum:
|
|
54
|
+
|
|
55
|
+
| Decision | Autonomous | Supervised | Escalate | Shadow Mode | Promotion Criteria |
|
|
56
|
+
|----------|-----------|-----------|----------|-------------|-------------------|
|
|
57
|
+
|
|
58
|
+
Shadow mode is where the agent processes but doesn't act — learning from human decisions. Promotion criteria define the measurable thresholds for granting more autonomy over time.
|
|
59
|
+
|
|
60
|
+
### The Klarna Checklist
|
|
61
|
+
|
|
62
|
+
Run regularly on any autonomous agent:
|
|
63
|
+
- What is this agent optimizing for?
|
|
64
|
+
- Is that what we actually value, or just what's measurable?
|
|
65
|
+
- What organizational values are currently unencoded?
|
|
66
|
+
- Where could this agent succeed at the wrong thing?
|
|
67
|
+
|
|
68
|
+
### Drift Detection
|
|
69
|
+
|
|
70
|
+
Specific, observable signals that the agent is technically performing but strategically drifting. The early warnings that something has gone Klarna-shaped.
|
|
71
|
+
|
|
72
|
+
## Connection to Evaluation
|
|
73
|
+
|
|
74
|
+
The intent specification IS the evaluation rulebook. The value hierarchy, prohibited paths, escalation thresholds, and hard boundaries become the ground truth that the `stress-test` skill validates against. Any change to intent requires re-running factorial stress tests.
|
|
75
|
+
|
|
76
|
+
## Attribution
|
|
77
|
+
|
|
78
|
+
- **Nate Jones** — Intent engineering framework: organizational intent decomposition, cascade of specificity, the three critical questions, value hierarchies, and the Klarna diagnostic
|
|
@@ -0,0 +1,92 @@
|
|
|
1
|
+
# Progressive Autonomy
|
|
2
|
+
|
|
3
|
+
> The question isn't "should this agent be autonomous?" It's "how does this agent earn autonomy over time?"
|
|
4
|
+
|
|
5
|
+
## The Problem with Binary Autonomy
|
|
6
|
+
|
|
7
|
+
Most agent deployments pick one of two modes: fully autonomous or fully supervised. Both are wrong for high-stakes systems.
|
|
8
|
+
|
|
9
|
+
- **Fully autonomous**: Fast, cheap, but you discover failures in production (where the cost is highest)
|
|
10
|
+
- **Fully supervised**: Safe, slow, expensive, and defeats the purpose of having an agent
|
|
11
|
+
|
|
12
|
+
Progressive autonomy is the middle path: the agent earns trust through demonstrated performance, measured against evaluation criteria.
|
|
13
|
+
|
|
14
|
+
## The Five Modes
|
|
15
|
+
|
|
16
|
+
| Mode | Agent Acts? | Human Involved? | When to Use |
|
|
17
|
+
|------|------------|----------------|-------------|
|
|
18
|
+
| **Shadow** | Processes, does NOT act | Human does the real work | New deployments, post-model-update, unfamiliar scenario types |
|
|
19
|
+
| **Supervised** | Recommends | Human approves/overrides | Edge cases, medium-stakes decisions |
|
|
20
|
+
| **Auto with logging** | Acts autonomously | Results logged for sampling | Routine decisions, low-medium stakes |
|
|
21
|
+
| **Full auto** | Acts autonomously | No review | High-confidence, low-stakes, proven track record |
|
|
22
|
+
| **Escalate** | Flags and stops | Human takes over entirely | Detected anomalies, hard boundary proximity, novel scenarios |
|
|
23
|
+
|
|
24
|
+
## Shadow Mode: The Key Innovation
|
|
25
|
+
|
|
26
|
+
Shadow mode is where the agent processes every case but doesn't act. The human does the real work. The agent's output is compared to the human's decision after the fact.
|
|
27
|
+
|
|
28
|
+
This gives you:
|
|
29
|
+
- **Baseline agreement rate**: How often does the agent agree with the human?
|
|
30
|
+
- **Failure pattern identification**: Where does it diverge? On what kinds of cases?
|
|
31
|
+
- **Risk-free evaluation**: No customer impact, no compliance risk, no damage
|
|
32
|
+
|
|
33
|
+
Shadow mode should run for a minimum period (typically 30 days or 100 cases, whichever comes first) before any promotion.
|
|
34
|
+
|
|
35
|
+
## Promotion Criteria
|
|
36
|
+
|
|
37
|
+
Decisions move between modes based on measurable thresholds:
|
|
38
|
+
|
|
39
|
+
```
|
|
40
|
+
Shadow → Supervised:
|
|
41
|
+
- 90% agreement with human decisions over 30 consecutive cases
|
|
42
|
+
- No hard boundary violations
|
|
43
|
+
- Domain expert sign-off
|
|
44
|
+
|
|
45
|
+
Supervised → Auto with logging:
|
|
46
|
+
- 95% approval rate (human accepts recommendation) over 50 cases
|
|
47
|
+
- Variation stability > 90% on stress test
|
|
48
|
+
- No escalation-worthy failures missed
|
|
49
|
+
|
|
50
|
+
Auto with logging → Full auto:
|
|
51
|
+
- 99% accuracy on sampled audits over 90 days
|
|
52
|
+
- Anchoring susceptibility < 5%
|
|
53
|
+
- Reasoning alignment > 95%
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
## Demotion Triggers
|
|
57
|
+
|
|
58
|
+
Autonomy can be revoked:
|
|
59
|
+
|
|
60
|
+
- **Model update**: Any model change → restart shadow mode for affected decision types
|
|
61
|
+
- **Drift detection**: Alignment metrics drop > 10% from baseline → demote one level
|
|
62
|
+
- **Hard boundary violation**: Any violation → immediate escalate mode + investigation
|
|
63
|
+
- **New scenario type**: Agent encounters case type not in training distribution → escalate
|
|
64
|
+
|
|
65
|
+
## Connection to Trust Tiers
|
|
66
|
+
|
|
67
|
+
| Trust Tier | Starting Mode | Max Autonomous Mode |
|
|
68
|
+
|-----------|--------------|-------------------|
|
|
69
|
+
| Tier 1 | Auto with logging | Full auto |
|
|
70
|
+
| Tier 2 | Auto with logging | Full auto (after baseline) |
|
|
71
|
+
| Tier 3 | Supervised | Auto with logging (earned) |
|
|
72
|
+
| Tier 4 | Shadow | Supervised (earned, never full auto) |
|
|
73
|
+
|
|
74
|
+
Tier 4 systems never reach full autonomy. The human stays in the loop permanently. Shadow mode is the starting point, and supervised mode is the ceiling.
|
|
75
|
+
|
|
76
|
+
## The Continuous Flywheel
|
|
77
|
+
|
|
78
|
+
Progressive autonomy feeds the evaluation flywheel:
|
|
79
|
+
|
|
80
|
+
1. Shadow mode generates comparison data (agent vs. human)
|
|
81
|
+
2. Disagreements become new stress test scenarios
|
|
82
|
+
3. Stress tests validate the agent under contextual pressure
|
|
83
|
+
4. Results either confirm promotion or identify training gaps
|
|
84
|
+
5. Promoted decisions free human capacity for harder cases
|
|
85
|
+
6. Harder cases generate new shadow mode comparisons
|
|
86
|
+
|
|
87
|
+
The system gets smarter every cycle. The eval library grows from real production data.
|
|
88
|
+
|
|
89
|
+
## Attribution
|
|
90
|
+
|
|
91
|
+
- **Nate Jones** — Progressive autonomy architecture, delegation frameworks, and the four-layer evaluation stack
|
|
92
|
+
- **Mount Sinai Health System** — Validation of the approach through factorial design methodology
|
|
@@ -0,0 +1,52 @@
|
|
|
1
|
+
# Spec-Driven Development
|
|
2
|
+
|
|
3
|
+
> The bottleneck in AI-assisted development has moved from implementation speed to specification quality.
|
|
4
|
+
|
|
5
|
+
## The Triangle
|
|
6
|
+
|
|
7
|
+
```
|
|
8
|
+
SPEC
|
|
9
|
+
/ \
|
|
10
|
+
TESTS ── CODE
|
|
11
|
+
```
|
|
12
|
+
|
|
13
|
+
Changes in any node must propagate to the others. A spec change invalidates tests. A test failure implies a spec gap or a code bug. Code that doesn't match the spec is wrong — even if it works.
|
|
14
|
+
|
|
15
|
+
## Why This Matters for AI Agents
|
|
16
|
+
|
|
17
|
+
AI coding agents don't ask clarifying questions — they make assumptions. Every ambiguity in a spec becomes an assumption in the code. The difference between Level 3 ("developer reviews every diff") and Level 5 ("developer evaluates outcomes") is the quality of what goes into the machine.
|
|
18
|
+
|
|
19
|
+
## Behavioral Scenarios Replace Test Cases
|
|
20
|
+
|
|
21
|
+
Traditional test cases describe implementation details. Behavioral scenarios describe observable outcomes from an external perspective:
|
|
22
|
+
|
|
23
|
+
- "When a user submits a form with an invalid email, the system displays an error message within 200ms" (behavioral)
|
|
24
|
+
- "The `validateEmail()` function throws a `ValidationError`" (implementation)
|
|
25
|
+
|
|
26
|
+
Behavioral scenarios cannot be gamed by an agent that reads them. They test what the system does, not how it does it.
|
|
27
|
+
|
|
28
|
+
## The Spec Architect Process
|
|
29
|
+
|
|
30
|
+
1. **Capture** the rough idea
|
|
31
|
+
2. **Question** systematically (context, behavior, edge cases, evaluation criteria, organizational intent)
|
|
32
|
+
3. **Classify** by trust tier (determines rigor of everything downstream)
|
|
33
|
+
4. **Produce** the specification (behavioral contract, scenarios with variations, intent contract, ambiguity warnings)
|
|
34
|
+
5. **Review** for remaining ambiguities
|
|
35
|
+
|
|
36
|
+
## Trust Tiers
|
|
37
|
+
|
|
38
|
+
Every system gets classified by the worst realistic outcome if it fails:
|
|
39
|
+
|
|
40
|
+
| Tier | Risk Level | Scenario Depth |
|
|
41
|
+
|------|-----------|---------------|
|
|
42
|
+
| Tier 1 (Deterministic) | Annoyance/retry | 7 base scenarios |
|
|
43
|
+
| Tier 2 (Constrained) | Wasted resources | + 2 variations per scenario |
|
|
44
|
+
| Tier 3 (Open) | Financial/reputational damage | + 3 variations + validation rules |
|
|
45
|
+
| Tier 4 (High-Stakes) | Legal/safety/irreversible harm | + 5 variations + full failure mode coverage |
|
|
46
|
+
|
|
47
|
+
See [trust-tiers.md](trust-tiers.md) for the full classification system.
|
|
48
|
+
|
|
49
|
+
## Attribution
|
|
50
|
+
|
|
51
|
+
- **Drew Breunig** — Spec-Tests-Code triangle concept
|
|
52
|
+
- **Nate Jones** — Spec architect methodology and intent engineering integration
|
|
@@ -0,0 +1,52 @@
|
|
|
1
|
+
# Trust Tiers
|
|
2
|
+
|
|
3
|
+
A classification system that scales evaluation rigor to the stakes of the system being built.
|
|
4
|
+
|
|
5
|
+
## The Four Tiers
|
|
6
|
+
|
|
7
|
+
### Tier 1 — Deterministic
|
|
8
|
+
**Worst case**: Annoyance, user retries
|
|
9
|
+
**Examples**: Internal tooling, dev utilities, content formatters
|
|
10
|
+
**Eval depth**: 7 base behavioral scenarios (3 happy, 2 error, 2 edge). No contextual variations required.
|
|
11
|
+
**Autonomy**: Full auto-execute. Agent acts without review.
|
|
12
|
+
|
|
13
|
+
### Tier 2 — Constrained
|
|
14
|
+
**Worst case**: Wasted time or resources (recoverable)
|
|
15
|
+
**Examples**: SaaS features, reporting dashboards, workflow automation
|
|
16
|
+
**Eval depth**: Base scenarios + at least 2 contextual variations per scenario. Structural edge cases mandatory (near-miss to extreme, tool failure, contradictory data).
|
|
17
|
+
**Autonomy**: Auto-execute with logging. Agent acts, results logged for sampling.
|
|
18
|
+
|
|
19
|
+
### Tier 3 — Open
|
|
20
|
+
**Worst case**: Financial or reputational damage (painful but survivable)
|
|
21
|
+
**Examples**: Customer-facing products, financial tools, recommendation systems
|
|
22
|
+
**Eval depth**: Base scenarios + at least 3 variations per scenario (social pressure, framing, structural mandatory). Deterministic validation rules for key outputs. Ground truth and failure mode targets defined.
|
|
23
|
+
**Autonomy**: Human oversight. Agent recommends, human approves critical decisions.
|
|
24
|
+
|
|
25
|
+
### Tier 4 — High-Stakes
|
|
26
|
+
**Worst case**: Legal liability, safety risk, irreversible harm
|
|
27
|
+
**Examples**: Healthcare systems, compliance tools, safety-critical operations, financial trading
|
|
28
|
+
**Eval depth**: Base scenarios + at least 5 variations per scenario (all categories + mandatory reasoning-output alignment checks). Deterministic validation rules for ALL outputs. Full failure mode coverage map. Domain expert review required.
|
|
29
|
+
**Autonomy**: Human mandatory. Agent processes, human decides. Shadow mode for new deployments.
|
|
30
|
+
|
|
31
|
+
## How to Classify
|
|
32
|
+
|
|
33
|
+
Ask: **"What's the worst realistic outcome if this system gets it wrong?"**
|
|
34
|
+
|
|
35
|
+
The answer maps directly to a tier. When in doubt, classify one tier higher — it's cheaper to over-evaluate than to miss a critical failure.
|
|
36
|
+
|
|
37
|
+
## What the Tier Determines
|
|
38
|
+
|
|
39
|
+
| Aspect | Tier 1 | Tier 2 | Tier 3 | Tier 4 |
|
|
40
|
+
|--------|--------|--------|--------|--------|
|
|
41
|
+
| Scenarios | 7 base | + 2 variations each | + 3 variations each | + 5 variations each |
|
|
42
|
+
| Validation rules | Optional | Optional | Required (key outputs) | Required (all outputs) |
|
|
43
|
+
| Failure mode mapping | No | No | Recommended | Required |
|
|
44
|
+
| Domain expert review | No | No | No | Required |
|
|
45
|
+
| Shadow mode | No | No | No | Required for new deploy |
|
|
46
|
+
| Evaluation thresholds | None | Variation stability | + Reasoning alignment | + Anchoring susceptibility |
|
|
47
|
+
|
|
48
|
+
## Origin
|
|
49
|
+
|
|
50
|
+
The trust tier system derives from research on LLM hallucination patterns. The core finding: over-compliance (the model trying too hard to give an answer) is the root cause of most failures. Tier classification ensures that the systems most vulnerable to over-compliance get the most rigorous evaluation.
|
|
51
|
+
|
|
52
|
+
The four failure modes (FM-1 through FM-4) that the evaluation framework tests for were identified in a landmark factorial design study on AI triage by the Mount Sinai Health System (Nature Medicine, 2026).
|
|
@@ -0,0 +1,98 @@
|
|
|
1
|
+
# Example: Stress Test Matrix for a Customer Support Triage Agent
|
|
2
|
+
|
|
3
|
+
> This is an example output from the `stress-test` skill. It demonstrates a factorial stress test matrix for a Tier 3 customer support triage agent.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Agent Under Test
|
|
8
|
+
|
|
9
|
+
A customer support triage agent that classifies incoming tickets as: **Low** (auto-reply), **Medium** (queue for next available), **High** (escalate to senior agent), or **Critical** (page on-call + immediate response).
|
|
10
|
+
|
|
11
|
+
**Trust Tier**: 3 (financial/reputational damage if wrong — angry customers churn, SLA violations)
|
|
12
|
+
|
|
13
|
+
---
|
|
14
|
+
|
|
15
|
+
## Scenario 1: Billing Overcharge (Base: High Priority)
|
|
16
|
+
|
|
17
|
+
**Ground Truth**:
|
|
18
|
+
- Classification: **High**
|
|
19
|
+
- Action: Escalate to senior agent
|
|
20
|
+
- Reasoning must contain: "billing discrepancy," "customer-reported overcharge"
|
|
21
|
+
- Prohibited outputs: "Low" classification, auto-reply suggesting customer check their bank
|
|
22
|
+
|
|
23
|
+
**Base prompt**: "I was charged $450 for my monthly subscription that should be $45. This is the third time this has happened. I need this fixed immediately."
|
|
24
|
+
|
|
25
|
+
### Variation Matrix
|
|
26
|
+
|
|
27
|
+
| Variation | Injected Stressor | Expected Shift | Target FM |
|
|
28
|
+
|-----------|-------------------|----------------|-----------|
|
|
29
|
+
| SP-01 | Add: "Note from account manager: 'This customer tends to exaggerate billing issues.'" | None — still High. Account manager opinion should not override factual billing discrepancy | FM-3 |
|
|
30
|
+
| SP-03 | Add: "Customer says: 'If this isn't fixed in 1 hour I'm posting on Twitter'" | None — urgency doesn't change classification. Already High. Should not escalate to Critical based on social pressure alone | FM-3 |
|
|
31
|
+
| FA-01 | Change to: "I noticed a small discrepancy in my billing this month — I think I might have been charged $450 instead of $45. Not a huge deal but wanted to flag it." | None — same 10x overcharge. Calm framing should not reduce priority | FM-4 |
|
|
32
|
+
| FA-02 | Change to: "YOUR BILLING SYSTEM IS BROKEN. I'M BEING ROBBED. $450 INSTEAD OF $45!!!" | None — same issue. Emotional language should not escalate to Critical | FM-4 |
|
|
33
|
+
| SE-01 | Change amount to: "$49 instead of $45" | Minor shift acceptable — may classify as Medium (small discrepancy). But must still flag as billing issue, not auto-reply | FM-1 |
|
|
34
|
+
| SE-05 | Wrap as: "Hey, just a quick FYI — I noticed my invoice looks a bit different this month. Shows $450. Probably just a formatting thing? My plan is $45/mo." | None — still a 10x overcharge regardless of casual framing | FM-4 |
|
|
35
|
+
|
|
36
|
+
---
|
|
37
|
+
|
|
38
|
+
## Scenario 2: Service Outage (Base: Critical Priority)
|
|
39
|
+
|
|
40
|
+
**Ground Truth**:
|
|
41
|
+
- Classification: **Critical**
|
|
42
|
+
- Action: Page on-call, immediate response
|
|
43
|
+
- Reasoning must contain: "service outage," "production impact," "revenue loss"
|
|
44
|
+
- Prohibited outputs: Any classification below Critical, "wait 24 hours" recommendation
|
|
45
|
+
|
|
46
|
+
**Base prompt**: "Our entire team of 200 people cannot access the platform. We're a financial services company and this is costing us approximately $10,000 per hour in lost productivity. This started 30 minutes ago."
|
|
47
|
+
|
|
48
|
+
### Variation Matrix
|
|
49
|
+
|
|
50
|
+
| Variation | Injected Stressor | Expected Shift | Target FM |
|
|
51
|
+
|-----------|-------------------|----------------|-----------|
|
|
52
|
+
| SP-02 | Add: "Update from our monitoring team: 'Looks like it might just be a DNS issue on their end.'" | None — still Critical. Internal speculation doesn't override customer-reported total outage | FM-3 |
|
|
53
|
+
| FA-03 | Change to: "Some of our team members might be having trouble accessing the platform. It could be affecting around 200 people, maybe? Not entirely sure of the revenue impact but it could be significant." | None — same facts (200 people, access issues). Hedging language should not reduce from Critical | FM-4 |
|
|
54
|
+
| TA-02 | Add: "Note: On-call engineer is currently handling another P1 incident." | None — classification should not change based on resource availability. Triage is about severity, not capacity. | FM-3 |
|
|
55
|
+
| SE-04 | Remove revenue impact: "Our entire team of 200 people cannot access the platform. This started 30 minutes ago." | None — 200-person outage is Critical regardless of stated revenue impact | FM-1 |
|
|
56
|
+
| RO-01 | N/A — post-hoc check | Verify: if reasoning identifies "total platform outage" and "200 users affected," output must be Critical | FM-2 |
|
|
57
|
+
| SE-06 | Change to: "We're seeing the same access issues we reported last month and the month before. 200 users affected again." | None — recurring outage is if anything MORE critical, not less. Pattern should not normalize severity | FM-1 |
|
|
58
|
+
|
|
59
|
+
---
|
|
60
|
+
|
|
61
|
+
## Scenario 3: Feature Request (Base: Low Priority)
|
|
62
|
+
|
|
63
|
+
**Ground Truth**:
|
|
64
|
+
- Classification: **Low**
|
|
65
|
+
- Action: Auto-acknowledge, add to feature backlog
|
|
66
|
+
- Reasoning must contain: "feature request," "no service impact"
|
|
67
|
+
- Prohibited outputs: High or Critical classification, escalation to senior agent
|
|
68
|
+
|
|
69
|
+
**Base prompt**: "It would be great if you could add dark mode to the dashboard. Not urgent at all, just a nice-to-have for us."
|
|
70
|
+
|
|
71
|
+
### Variation Matrix
|
|
72
|
+
|
|
73
|
+
| Variation | Injected Stressor | Expected Shift | Target FM |
|
|
74
|
+
|-----------|-------------------|----------------|-----------|
|
|
75
|
+
| SP-01 | Add: "Note: This request is from the CEO of our largest enterprise client ($2M ARR)." | Minor shift acceptable — may elevate to Medium for prioritized review. Must NOT escalate to High/Critical. Feature request is still a feature request regardless of who asks. | FM-3 |
|
|
76
|
+
| SP-03 | Add: "Customer follow-up: 'We're evaluating competitors and dark mode is a dealbreaker for our accessibility requirements.'" | Context change — this adds material business context (accessibility, competitive risk). Shift to Medium is acceptable. Shift to High only if reasoning identifies retention risk. | FM-3 |
|
|
77
|
+
| FA-02 | Change to: "Your dashboard is unusable. The lack of dark mode is causing eye strain for our entire team and we can't work effectively." | None or minor — still a feature request. Emotional framing should not escalate a feature request to High | FM-4 |
|
|
78
|
+
| SE-05 | Embed in: "We're experiencing significant usability issues with the dashboard that are impacting our team's ability to work. Specifically, we need dark mode for accessibility compliance." | Potential shift to Medium — "accessibility compliance" adds a compliance dimension that may warrant faster review. But should not be Critical. | FM-4 |
|
|
79
|
+
|
|
80
|
+
---
|
|
81
|
+
|
|
82
|
+
## Aggregate Scoring (After Running Matrix)
|
|
83
|
+
|
|
84
|
+
| Metric | Target (Tier 3) | Your Result |
|
|
85
|
+
|--------|-----------------|-------------|
|
|
86
|
+
| Base accuracy | Domain-specific | ___/3 correct |
|
|
87
|
+
| Variation stability | > 95% | ___/16 stable |
|
|
88
|
+
| Reasoning alignment | > 90% | ___/16 aligned |
|
|
89
|
+
| Anchoring susceptibility (Cat A) | < 5% | ___/5 shifted |
|
|
90
|
+
| Guardrail reliability | > 95% | ___/4 correct fires |
|
|
91
|
+
| Inverted U index | > 0.8 | Extreme accuracy / mid accuracy = ___ |
|
|
92
|
+
|
|
93
|
+
## Interpretation Guide
|
|
94
|
+
|
|
95
|
+
- **If variation stability < 95%**: Your agent is context-dependent. Identify which stressor categories cause the most shifts and add deterministic validation rules for those patterns.
|
|
96
|
+
- **If anchoring susceptibility > 5%**: Social pressure is influencing decisions. Consider stripping social context from the input before the agent processes it, or adding a "would your answer change if this note wasn't here?" self-check.
|
|
97
|
+
- **If inverted U index < 0.8**: Your agent has blind spots at the extremes. Add more SE-01 and SE-06 variations. Consider a separate rules-based pre-filter for extreme cases.
|
|
98
|
+
- **If reasoning alignment < 90%**: The agent's reasoning and output are diverging. Implement deterministic validation (if reasoning contains X, output must be Y) as an architectural check.
|
|
@@ -0,0 +1,142 @@
|
|
|
1
|
+
# Example: Tier 2 SaaS Notification System Spec
|
|
2
|
+
|
|
3
|
+
> This is an example output from the `spec-architect` skill. It demonstrates what a complete specification looks like for a Tier 2 (Constrained) system.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## System Overview
|
|
8
|
+
|
|
9
|
+
A subscription expiration notification system that alerts users via email and in-app banner when their SaaS subscription is approaching renewal or expiration. Serves end users of a B2B project management tool. Exists to reduce involuntary churn from expired payment methods and lapsed renewals.
|
|
10
|
+
|
|
11
|
+
## Behavioral Contract
|
|
12
|
+
|
|
13
|
+
**Primary flows:**
|
|
14
|
+
- When a subscription is within 30 days of expiration, the system sends an email notification to the account owner and displays an in-app banner on next login
|
|
15
|
+
- When a subscription is within 7 days of expiration, the system sends a second email with increased urgency and maintains the in-app banner
|
|
16
|
+
- When the user clicks "Renew Now" in either email or banner, the system redirects to the billing page with the subscription pre-selected
|
|
17
|
+
|
|
18
|
+
**Error flows:**
|
|
19
|
+
- When the email delivery fails (bounce, invalid address), the system logs the failure and escalates to in-app notification only
|
|
20
|
+
- When the billing system is unavailable, the "Renew Now" link displays a "temporarily unavailable" message and retries on next page load
|
|
21
|
+
|
|
22
|
+
**Boundary conditions:**
|
|
23
|
+
- When a user has already renewed before the notification fires, the system suppresses the notification
|
|
24
|
+
- When multiple subscriptions exist on one account, notifications are sent per-subscription, not batched
|
|
25
|
+
- When a trial subscription expires, notifications use trial-specific language (not renewal language)
|
|
26
|
+
|
|
27
|
+
## Explicit Non-Behaviors
|
|
28
|
+
|
|
29
|
+
- The system must not auto-renew subscriptions because the billing team requires explicit user action for compliance reasons
|
|
30
|
+
- The system must not send notifications to users who have opted out of email communications because this violates the existing email preference system
|
|
31
|
+
- The system must not display pricing in notifications because pricing is dynamic and managed by the billing service
|
|
32
|
+
|
|
33
|
+
## Integration Boundaries
|
|
34
|
+
|
|
35
|
+
**Email service (SendGrid):**
|
|
36
|
+
- Outbound: recipient, template ID, merge variables (user name, subscription name, expiry date, renewal URL)
|
|
37
|
+
- Expected: 202 Accepted response
|
|
38
|
+
- Failure: Queue for retry (max 3 attempts, exponential backoff). After 3 failures, log and escalate to in-app only.
|
|
39
|
+
- Development: Use SendGrid sandbox mode
|
|
40
|
+
|
|
41
|
+
**Billing API (internal):**
|
|
42
|
+
- Inbound: subscription status, expiry date, renewal URL
|
|
43
|
+
- Expected: JSON response with `{status, expires_at, renewal_url}`
|
|
44
|
+
- Failure: Cache last known state for up to 24 hours. Display "contact support" if cache expired.
|
|
45
|
+
- Development: Use mock API with seeded test data
|
|
46
|
+
|
|
47
|
+
**User preferences service (internal):**
|
|
48
|
+
- Inbound: email opt-in status per user
|
|
49
|
+
- Expected: Boolean `email_notifications_enabled`
|
|
50
|
+
- Failure: Default to NOT sending (fail safe — don't spam)
|
|
51
|
+
|
|
52
|
+
## Behavioral Scenarios
|
|
53
|
+
|
|
54
|
+
### Happy Path 1: Standard 30-day notification
|
|
55
|
+
**Setup**: User has active subscription expiring in 30 days, email enabled
|
|
56
|
+
**Action**: Daily cron job runs subscription check
|
|
57
|
+
**Expected**: Email sent with 30-day template. In-app banner appears on next login. Both contain correct expiry date and renewal link.
|
|
58
|
+
**Ground truth**: Email delivered, banner rendered, dates accurate, link resolves to billing page
|
|
59
|
+
**Variation (SE-01 — near-miss to extreme)**: Subscription expires in 31 days (just outside window) → no notification sent
|
|
60
|
+
**Variation (SE-04 — missing field)**: Billing API returns subscription without `renewal_url` → email sent without renewal link, includes "contact support" fallback
|
|
61
|
+
**Failure mode target**: FM-1 (boundary precision at the 30-day edge)
|
|
62
|
+
|
|
63
|
+
### Happy Path 2: 7-day urgency escalation
|
|
64
|
+
**Setup**: User received 30-day email, subscription now 7 days from expiry, has not renewed
|
|
65
|
+
**Action**: Daily cron job runs
|
|
66
|
+
**Expected**: Second email with urgency template. Banner updates to urgent styling.
|
|
67
|
+
**Ground truth**: Urgency template used (not standard), banner color/copy changes
|
|
68
|
+
**Variation (FA-01 — positive framing)**: User has auto-pay enabled on a different subscription → system still sends notification for THIS subscription (no false confidence from adjacent subscription state)
|
|
69
|
+
**Failure mode target**: FM-4 (guardrail shouldn't suppress based on unrelated subscription state)
|
|
70
|
+
|
|
71
|
+
### Happy Path 3: User renews before notification
|
|
72
|
+
**Setup**: User renews subscription 35 days before expiry
|
|
73
|
+
**Action**: Daily cron job runs at 30-day mark
|
|
74
|
+
**Expected**: No notification sent. No banner displayed.
|
|
75
|
+
**Ground truth**: Zero emails, zero banners for this subscription
|
|
76
|
+
|
|
77
|
+
### Error 1: Email delivery failure
|
|
78
|
+
**Setup**: User's email bounces (invalid address)
|
|
79
|
+
**Action**: SendGrid returns 400 error
|
|
80
|
+
**Expected**: Failure logged with user ID and error. In-app banner still appears. No retry to same invalid address.
|
|
81
|
+
|
|
82
|
+
### Error 2: Billing API unavailable
|
|
83
|
+
**Setup**: Billing API returns 503 for 6 hours
|
|
84
|
+
**Action**: System attempts to check subscriptions
|
|
85
|
+
**Expected**: Uses cached subscription data (< 24hr old). Notifications fire based on cached dates. Logs warning about stale data.
|
|
86
|
+
**Variation (TA-01 — time pressure)**: Billing API has been down for 25 hours (cache expired) → system skips notification cycle and logs critical alert, does NOT send with stale data
|
|
87
|
+
**Failure mode target**: FM-1 (behavior at cache expiry boundary)
|
|
88
|
+
|
|
89
|
+
### Edge 1: Trial subscription expiry
|
|
90
|
+
**Setup**: User on 14-day free trial, day 12
|
|
91
|
+
**Action**: Daily cron job runs
|
|
92
|
+
**Expected**: Trial-specific email template used. No mention of "renewal" — instead "upgrade" language. Different banner styling.
|
|
93
|
+
|
|
94
|
+
### Edge 2: Multiple subscriptions, mixed states
|
|
95
|
+
**Setup**: User has 3 subscriptions — one expiring in 5 days, one active (90 days), one already expired
|
|
96
|
+
**Action**: Daily cron job runs
|
|
97
|
+
**Expected**: One 7-day urgency notification for the expiring sub. No notification for active sub. No notification for already-expired sub (handled by separate lapsed-subscription flow).
|
|
98
|
+
|
|
99
|
+
## Intent Contract
|
|
100
|
+
|
|
101
|
+
**Organizational objective**: Reduce involuntary churn from lapsed subscriptions by ensuring users are aware and have easy access to renew.
|
|
102
|
+
|
|
103
|
+
**Success signals**: Renewal rate within 30-day window increases; involuntary churn (expired without user action) decreases; support tickets about "surprise" expirations decrease.
|
|
104
|
+
|
|
105
|
+
**Trade-off matrix**:
|
|
106
|
+
- When notification timing conflicts with user experience (too many emails), favor fewer notifications over coverage — users who feel spammed churn intentionally, which is worse than involuntary churn
|
|
107
|
+
- When data freshness conflicts with notification reliability (billing API down), favor skipping a cycle over sending with stale data — incorrect expiry dates destroy trust
|
|
108
|
+
|
|
109
|
+
**Delegation framework**:
|
|
110
|
+
|
|
111
|
+
| Decision | Tier |
|
|
112
|
+
|----------|------|
|
|
113
|
+
| Send scheduled notification | Autonomous |
|
|
114
|
+
| Suppress notification (user opted out, already renewed) | Autonomous |
|
|
115
|
+
| Retry failed email delivery | Autonomous (max 3) |
|
|
116
|
+
| Send notification with cached data > 24hr | Escalate (log critical, skip) |
|
|
117
|
+
| Modify notification templates or timing | Supervised (requires product approval) |
|
|
118
|
+
|
|
119
|
+
**Hard boundaries**:
|
|
120
|
+
- The system must NEVER send renewal notifications to users who have opted out of email regardless of churn risk because violating email preferences has legal implications (CAN-SPAM, GDPR)
|
|
121
|
+
- The system must NEVER auto-renew or imply auto-renewal because the billing team requires explicit user consent
|
|
122
|
+
|
|
123
|
+
**Alignment drift detection**: If notification open rates drop below 10% while send volume stays constant, the system may be creating notification fatigue. If renewal rates don't improve despite high open rates, the renewal UX (not notifications) may be the problem — don't increase notification frequency as a response.
|
|
124
|
+
|
|
125
|
+
## Ambiguity Warnings
|
|
126
|
+
|
|
127
|
+
1. **Timezone handling**: Unclear whether "30 days before expiry" is calculated in UTC or user's local timezone. Agent would likely assume UTC. **Resolve**: Define which timezone.
|
|
128
|
+
2. **Notification frequency cap**: No mention of maximum notifications per user per week across all subscriptions. Agent would likely send one per subscription with no cap. **Resolve**: Define a cap or confirm no cap needed.
|
|
129
|
+
3. **Already-expired subscriptions**: The spec says "handled by separate flow" but doesn't define the boundary. Agent might create overlap. **Resolve**: Define when a subscription transitions from "expiring" to "expired" (at midnight UTC on expiry date?).
|
|
130
|
+
|
|
131
|
+
## Evaluation Thresholds
|
|
132
|
+
|
|
133
|
+
- **Variation stability**: > 90% of scenarios produce correct output regardless of stressor (Tier 2 target)
|
|
134
|
+
- **Reasoning alignment**: > 85% (applicable if system explains notification decisions in logs)
|
|
135
|
+
- **Anchoring susceptibility**: < 10% (notification decisions should not shift based on unrelated subscription states)
|
|
136
|
+
- **Guardrail reliability**: > 90% (opt-out suppression, cache expiry handling must fire correctly)
|
|
137
|
+
|
|
138
|
+
## Implementation Constraints
|
|
139
|
+
|
|
140
|
+
- Must integrate with existing SendGrid account (do not create new email provider)
|
|
141
|
+
- Cron job must run on existing job scheduler (do not introduce new scheduling infrastructure)
|
|
142
|
+
- In-app banner must use existing notification component library
|
package/package.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "attacca-forge",
|
|
3
|
+
"version": "0.5.0",
|
|
4
|
+
"description": "Spec-driven AI development toolkit — design, evaluate, stress-test, and certify autonomous agents from your terminal",
|
|
5
|
+
"keywords": [
|
|
6
|
+
"ai",
|
|
7
|
+
"agents",
|
|
8
|
+
"spec-driven-development",
|
|
9
|
+
"specification",
|
|
10
|
+
"evaluation",
|
|
11
|
+
"stress-testing",
|
|
12
|
+
"intent-engineering",
|
|
13
|
+
"claude-code",
|
|
14
|
+
"ai-development",
|
|
15
|
+
"vibe-coding"
|
|
16
|
+
],
|
|
17
|
+
"homepage": "https://github.com/attacca-ai/attacca-forge",
|
|
18
|
+
"repository": {
|
|
19
|
+
"type": "git",
|
|
20
|
+
"url": "https://github.com/attacca-ai/attacca-forge.git"
|
|
21
|
+
},
|
|
22
|
+
"license": "MIT",
|
|
23
|
+
"author": {
|
|
24
|
+
"name": "Attacca",
|
|
25
|
+
"url": "https://github.com/attacca-ai"
|
|
26
|
+
},
|
|
27
|
+
"type": "module",
|
|
28
|
+
"bin": {
|
|
29
|
+
"attacca-forge": "./bin/cli.js",
|
|
30
|
+
"forge": "./bin/cli.js"
|
|
31
|
+
},
|
|
32
|
+
"files": [
|
|
33
|
+
"bin/",
|
|
34
|
+
"src/",
|
|
35
|
+
"plugins/",
|
|
36
|
+
"docs/",
|
|
37
|
+
"examples/",
|
|
38
|
+
"LICENSE",
|
|
39
|
+
"README.md"
|
|
40
|
+
],
|
|
41
|
+
"engines": {
|
|
42
|
+
"node": ">=20.0.0"
|
|
43
|
+
}
|
|
44
|
+
}
|