flyee 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +11 -9
- package/bin/install.js +9 -0
- package/core/skills/cost-tracking/SKILL.md +68 -0
- package/core/skills/doctor/SKILL.md +171 -0
- package/core/skills/hallucination-guard/SKILL.md +110 -0
- package/core/skills/knowledge-persistence/SKILL.md +123 -0
- package/core/skills/quality-gates/SKILL.md +109 -0
- package/core/skills/roadmap-reassessment/SKILL.md +134 -0
- package/core/skills/skill-discovery/SKILL.md +152 -0
- package/core/skills/sprint-validation/SKILL.md +125 -0
- package/core/skills/stuck-detection/SKILL.md +176 -0
- package/core/skills/token-profiles/SKILL.md +150 -0
- package/core/skills/unique-ids/SKILL.md +112 -0
- package/core/templates/RUNTIME.template.md +41 -0
- package/package.json +2 -2
|
@@ -0,0 +1,134 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: roadmap-reassessment
|
|
3
|
+
description: Automatically reassess the project roadmap after each phase/sprint completes. Checks if learned information changes the plan, reorders priorities, and adjusts scope. Prevents executing an outdated plan when reality has shifted.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Roadmap Reassessment
|
|
7
|
+
|
|
8
|
+
> Reavaliar o plano após cada phase completada.
|
|
9
|
+
|
|
10
|
+
## Problem
|
|
11
|
+
|
|
12
|
+
Plans are made with incomplete information. As work progresses, the agent learns:
|
|
13
|
+
- A dependency is harder than expected
|
|
14
|
+
- A feature is simpler than planned
|
|
15
|
+
- A new requirement emerged
|
|
16
|
+
- A technical constraint was discovered
|
|
17
|
+
- An approach was abandoned in favor of a better one
|
|
18
|
+
|
|
19
|
+
Without reassessment, the agent blindly follows the original plan even when it no longer makes sense.
|
|
20
|
+
|
|
21
|
+
## When to Trigger
|
|
22
|
+
|
|
23
|
+
| Trigger | Reassessment Type |
|
|
24
|
+
|---------|-------------------|
|
|
25
|
+
| Phase/Sprint completed | **Full reassessment** |
|
|
26
|
+
| Major blocker found | **Emergency reassessment** |
|
|
27
|
+
| User changes requirements | **Scope reassessment** |
|
|
28
|
+
| Budget milestone (50%, 75%) | **Budget reassessment** |
|
|
29
|
+
|
|
30
|
+
## Reassessment Protocol
|
|
31
|
+
|
|
32
|
+
### Step 1: Gather Evidence
|
|
33
|
+
|
|
34
|
+
```markdown
|
|
35
|
+
What changed since the plan was made?
|
|
36
|
+
|
|
37
|
+
1. **Completed work**: What was done? What was learned?
|
|
38
|
+
2. **Discovered complexity**: What was harder/easier than expected?
|
|
39
|
+
3. **New dependencies**: What new dependencies were discovered?
|
|
40
|
+
4. **Abandoned approaches**: What was tried and didn't work?
|
|
41
|
+
5. **Cost reality**: How much budget was used vs planned?
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
### Step 2: Evaluate Remaining Plan
|
|
45
|
+
|
|
46
|
+
For each remaining phase/task, ask:
|
|
47
|
+
|
|
48
|
+
```markdown
|
|
49
|
+
| Phase | Still Valid? | Priority Changed? | Effort Changed? | Action |
|
|
50
|
+
|-------|-------------|-------------------|-----------------|--------|
|
|
51
|
+
| Phase N+1 | Yes/No | ↑↓→ | ↑↓→ | Keep/Reorder/Split/Remove |
|
|
52
|
+
| Phase N+2 | Yes/No | ↑↓→ | ↑↓→ | Keep/Reorder/Split/Remove |
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
### Step 3: Decide
|
|
56
|
+
|
|
57
|
+
| Decision | When |
|
|
58
|
+
|----------|------|
|
|
59
|
+
| **Keep** | Plan is still valid. No changes. |
|
|
60
|
+
| **Reorder** | Priorities shifted. Move phases up/down. |
|
|
61
|
+
| **Split** | A phase is too large. Break it into smaller phases. |
|
|
62
|
+
| **Merge** | Two phases are related and should be combined. |
|
|
63
|
+
| **Remove** | Phase is no longer needed. |
|
|
64
|
+
| **Add** | New phase needed based on discoveries. |
|
|
65
|
+
| **Pause** | Need user input before continuing. |
|
|
66
|
+
|
|
67
|
+
### Step 4: Update Artifacts
|
|
68
|
+
|
|
69
|
+
If the plan changed:
|
|
70
|
+
|
|
71
|
+
1. Update `implementation_plan.md` or equivalent planning doc
|
|
72
|
+
2. Update `.flyee/STATE.md` with new phase order
|
|
73
|
+
3. Add entry to `.flyee/DECISIONS.md`:
|
|
74
|
+
```markdown
|
|
75
|
+
## [DATE] Roadmap Reassessment after Phase N
|
|
76
|
+
|
|
77
|
+
**Trigger:** Phase N completed
|
|
78
|
+
**Changes:**
|
|
79
|
+
- Phase N+2 moved before N+1 (dependency discovered)
|
|
80
|
+
- Phase N+3 removed (handled by Phase N)
|
|
81
|
+
- New Phase N+4 added (edge case found)
|
|
82
|
+
**Rationale:** [explanation]
|
|
83
|
+
```
|
|
84
|
+
4. Emit event for Flyee SaaS
|
|
85
|
+
|
|
86
|
+
### Step 5: Notify
|
|
87
|
+
|
|
88
|
+
```markdown
|
|
89
|
+
📋 **Roadmap Reassessment Complete**
|
|
90
|
+
|
|
91
|
+
Phase {N} finished. Here's what changed:
|
|
92
|
+
|
|
93
|
+
| Change | Detail |
|
|
94
|
+
|--------|--------|
|
|
95
|
+
| ✅ Completed | Phase {N}: {description} |
|
|
96
|
+
| 🔄 Reordered | Phase {X} moved up (blocking dependency) |
|
|
97
|
+
| ➕ Added | Phase {Y}: {new discovery} |
|
|
98
|
+
| 🗑️ Removed | Phase {Z}: {no longer needed} |
|
|
99
|
+
|
|
100
|
+
Budget: ${used} / ${ceiling} ({percentage}%)
|
|
101
|
+
Remaining phases: {count}
|
|
102
|
+
Estimated remaining cost: ${projection}
|
|
103
|
+
|
|
104
|
+
Continue? (yes / review changes / pause)
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
## Budget Reassessment
|
|
108
|
+
|
|
109
|
+
At cost milestones (50%, 75%, 90%):
|
|
110
|
+
|
|
111
|
+
```markdown
|
|
112
|
+
💰 **Budget Reassessment — {percentage}% Used**
|
|
113
|
+
|
|
114
|
+
Spent: ${used} on {completed_phases} phases
|
|
115
|
+
Remaining: ${remaining} for {remaining_phases} phases
|
|
116
|
+
Avg cost per phase: ${avg}
|
|
117
|
+
Projected total: ${projection}
|
|
118
|
+
|
|
119
|
+
| Action | Description |
|
|
120
|
+
|--------|-------------|
|
|
121
|
+
| Continue | Budget is sufficient |
|
|
122
|
+
| Trim scope | Remove low-priority phases to stay in budget |
|
|
123
|
+
| Request increase | Ask user for more budget |
|
|
124
|
+
| Switch models | Use cheaper models for remaining work |
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
## Integration
|
|
128
|
+
|
|
129
|
+
- **state-machine**: Reads phases from STATE.md
|
|
130
|
+
- **cost-tracking**: Provides budget data for budget reassessment
|
|
131
|
+
- **quality-gates**: Sprint Gate triggers reassessment
|
|
132
|
+
- **knowledge-persistence**: Reassessment decisions go to KNOWLEDGE.md
|
|
133
|
+
- **context-budget**: Validates new phases fit in context windows
|
|
134
|
+
- **Flyee SaaS**: Reassessment events update project timeline (S-02, S-03)
|
|
@@ -0,0 +1,152 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: skill-discovery
|
|
3
|
+
description: Automatic discovery and suggestion of relevant skills based on the current task context. Modes - auto (load silently), suggest (recommend to user), off (manual only). Replaces static frontmatter-only loading with dynamic detection.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Skill Discovery
|
|
7
|
+
|
|
8
|
+
> Auto-detect and load relevant skills based on task context.
|
|
9
|
+
|
|
10
|
+
## Problem
|
|
11
|
+
|
|
12
|
+
Current approach: skills are loaded via agent frontmatter (`skills: [a, b, c]`). This is static — if the agent lists 10 skills, ALL 10 are loaded regardless of whether they're relevant to the current task.
|
|
13
|
+
|
|
14
|
+
GSD-2 approach: skills are discovered dynamically based on the task context. A "Fix CSS bug" task loads frontend skills. A "Database migration" task loads backend skills.
|
|
15
|
+
|
|
16
|
+
## Modes
|
|
17
|
+
|
|
18
|
+
Configure in `.flyee/config.json`:
|
|
19
|
+
|
|
20
|
+
```json
|
|
21
|
+
{
|
|
22
|
+
"skill_discovery": "suggest"
|
|
23
|
+
}
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
| Mode | Behavior |
|
|
27
|
+
|------|----------|
|
|
28
|
+
| `auto` | Discover and load relevant skills silently |
|
|
29
|
+
| `suggest` | Discover skills, show to user, ask before loading |
|
|
30
|
+
| `off` | Only load skills explicitly listed in agent frontmatter |
|
|
31
|
+
|
|
32
|
+
## Discovery Algorithm
|
|
33
|
+
|
|
34
|
+
### Step 1: Context Analysis
|
|
35
|
+
|
|
36
|
+
Analyze the user's request for signals:
|
|
37
|
+
|
|
38
|
+
```
|
|
39
|
+
Input: "Fix the login form validation on mobile"
|
|
40
|
+
|
|
41
|
+
Signals detected:
|
|
42
|
+
- "fix" → debugging domain
|
|
43
|
+
- "login form" → frontend/UI domain
|
|
44
|
+
- "validation" → form patterns
|
|
45
|
+
- "mobile" → mobile-responsive/mobile-design
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
### Step 2: Skill Matching
|
|
49
|
+
|
|
50
|
+
Match signals to skill domains:
|
|
51
|
+
|
|
52
|
+
```
|
|
53
|
+
Signal → Skill Mapping:
|
|
54
|
+
|
|
55
|
+
frontend/UI/CSS/React → frontend-design, nextjs-react-expert, tailwind-patterns
|
|
56
|
+
backend/API/database → api-patterns, database-design, nodejs-best-practices
|
|
57
|
+
mobile/iOS/Android → mobile-design
|
|
58
|
+
security/auth/OWASP → vulnerability-scanner, red-team-tactics
|
|
59
|
+
testing/test/jest → testing-patterns, tdd-workflow, webapp-testing
|
|
60
|
+
git/branch/commit → git-workflow
|
|
61
|
+
debug/fix/error → systematic-debugging
|
|
62
|
+
design/UI/component → design-system-enforcement, atomic-design
|
|
63
|
+
deploy/production/CI → deployment-procedures
|
|
64
|
+
performance/slow/optimize → performance-profiling, nextjs-react-expert
|
|
65
|
+
SEO/search/meta → seo-fundamentals, geo-fundamentals
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
### Step 3: File Context
|
|
69
|
+
|
|
70
|
+
Also check which files are being modified:
|
|
71
|
+
|
|
72
|
+
```
|
|
73
|
+
*.tsx, *.jsx, *.css → frontend-design
|
|
74
|
+
*.py → python-patterns
|
|
75
|
+
*.ts (backend) → nodejs-best-practices
|
|
76
|
+
package.json → project-setup
|
|
77
|
+
Dockerfile → server-management
|
|
78
|
+
*.test.*, *.spec.* → testing-patterns
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
### Step 4: Confidence Scoring
|
|
82
|
+
|
|
83
|
+
Each potential skill gets a confidence score:
|
|
84
|
+
|
|
85
|
+
| Factor | Weight |
|
|
86
|
+
|--------|--------|
|
|
87
|
+
| Direct keyword match | 0.4 |
|
|
88
|
+
| File extension match | 0.3 |
|
|
89
|
+
| Agent frontmatter includes it | 0.2 |
|
|
90
|
+
| Recently used (last 3 sessions) | 0.1 |
|
|
91
|
+
|
|
92
|
+
**Load threshold:** score >= 0.5
|
|
93
|
+
|
|
94
|
+
### Step 5: Present/Load
|
|
95
|
+
|
|
96
|
+
**Mode: auto**
|
|
97
|
+
```
|
|
98
|
+
🧩 Auto-loaded skills: frontend-design, testing-patterns
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
**Mode: suggest**
|
|
102
|
+
```markdown
|
|
103
|
+
🧩 **Suggested skills for this task:**
|
|
104
|
+
|
|
105
|
+
| Skill | Confidence | Reason |
|
|
106
|
+
|-------|-----------|--------|
|
|
107
|
+
| frontend-design | 0.85 | Keywords: "form", "mobile"; Files: .tsx |
|
|
108
|
+
| testing-patterns | 0.65 | Keyword: "validation" |
|
|
109
|
+
| mobile-design | 0.55 | Keyword: "mobile" |
|
|
110
|
+
|
|
111
|
+
Load these skills? (yes / select / no)
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
## Always-Load Skills
|
|
115
|
+
|
|
116
|
+
Some skills should ALWAYS be loaded regardless of discovery:
|
|
117
|
+
|
|
118
|
+
```json
|
|
119
|
+
{
|
|
120
|
+
"always_load_skills": [
|
|
121
|
+
"clean-code",
|
|
122
|
+
"knowledge-persistence",
|
|
123
|
+
"quality-gates"
|
|
124
|
+
]
|
|
125
|
+
}
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
## Staleness (Future — F-18)
|
|
129
|
+
|
|
130
|
+
Skills not used for N sessions get deprioritized:
|
|
131
|
+
|
|
132
|
+
```json
|
|
133
|
+
{
|
|
134
|
+
"skill_staleness_sessions": 20
|
|
135
|
+
}
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
Usage tracking in `.flyee/skill-usage.json`:
|
|
139
|
+
```json
|
|
140
|
+
{
|
|
141
|
+
"frontend-design": { "last_used": "2026-03-31", "count": 15 },
|
|
142
|
+
"game-development": { "last_used": "2026-01-15", "count": 2 }
|
|
143
|
+
}
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
## Integration
|
|
147
|
+
|
|
148
|
+
- **intelligent-routing**: Skill discovery runs AFTER agent routing
|
|
149
|
+
- **context-budget**: Discovered skills count against context budget
|
|
150
|
+
- **knowledge-persistence**: Skill effectiveness feeds into KNOWLEDGE.md
|
|
151
|
+
- **cost-tracking**: Track if skill loading improved task outcomes
|
|
152
|
+
- **Flyee SaaS**: Skill usage analytics in dashboard
|
|
@@ -0,0 +1,125 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: sprint-validation
|
|
3
|
+
description: Gate de conclusão de sprint/milestone. Compara success criteria do roadmap contra resultados reais antes de selar o sprint. Diferente de quality-gates (que é por task) — este é o gate FINAL do sprint inteiro.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Sprint Validation Gate
|
|
7
|
+
|
|
8
|
+
> O gate final antes de marcar um sprint/milestone como concluído.
|
|
9
|
+
|
|
10
|
+
## Purpose
|
|
11
|
+
|
|
12
|
+
Individual tasks can pass their quality gates while the sprint as a whole fails. Examples:
|
|
13
|
+
- Each task works in isolation, but they don't work together
|
|
14
|
+
- All code is written, but the feature doesn't meet the original requirements
|
|
15
|
+
- Tests pass, but the user story isn't fulfilled
|
|
16
|
+
|
|
17
|
+
Sprint validation catches these by comparing **planned outcomes** against **actual results**.
|
|
18
|
+
|
|
19
|
+
## When to Trigger
|
|
20
|
+
|
|
21
|
+
- Before marking any sprint/milestone as "completed"
|
|
22
|
+
- After ALL tasks in the sprint have their own Completion Gates passed
|
|
23
|
+
- Before the final git commit/merge for the sprint
|
|
24
|
+
|
|
25
|
+
## Validation Protocol
|
|
26
|
+
|
|
27
|
+
### Step 1: Gather Success Criteria
|
|
28
|
+
|
|
29
|
+
Read the sprint planning document for defined success criteria:
|
|
30
|
+
|
|
31
|
+
```markdown
|
|
32
|
+
## Sprint S09 — Blog & CMS
|
|
33
|
+
|
|
34
|
+
### Success Criteria (from planning)
|
|
35
|
+
1. Blog posts can be created, edited, and published
|
|
36
|
+
2. CMS admin panel with CRUD operations
|
|
37
|
+
3. SEO metadata auto-generated for each post
|
|
38
|
+
4. RSS feed available at /feed.xml
|
|
39
|
+
5. Performance: LCP < 2.5s on blog pages
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
### Step 2: Verify Each Criterion
|
|
43
|
+
|
|
44
|
+
For each success criterion, provide **evidence**:
|
|
45
|
+
|
|
46
|
+
```markdown
|
|
47
|
+
## Sprint Validation — S09
|
|
48
|
+
|
|
49
|
+
| # | Criterion | Status | Evidence |
|
|
50
|
+
|---|-----------|--------|----------|
|
|
51
|
+
| 1 | Blog CRUD | ✅ | Routes exist: POST/GET/PUT/DELETE /api/posts |
|
|
52
|
+
| 2 | CMS admin | ✅ | Page exists: /admin/posts with DataTable |
|
|
53
|
+
| 3 | SEO metadata | ✅ | generateMetadata() in app/blog/[slug]/page.tsx |
|
|
54
|
+
| 4 | RSS feed | ❌ | /feed.xml returns 404 — NOT IMPLEMENTED |
|
|
55
|
+
| 5 | LCP < 2.5s | ⚠️ | Not measurable without deployment |
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
### Step 3: Integration Check
|
|
59
|
+
|
|
60
|
+
Beyond individual criteria, verify components work together:
|
|
61
|
+
|
|
62
|
+
```markdown
|
|
63
|
+
## Integration Verification
|
|
64
|
+
|
|
65
|
+
- [ ] E2E flow works: Create post → Edit → Publish → View on blog → Appears in RSS
|
|
66
|
+
- [ ] All routes registered in router/navigation
|
|
67
|
+
- [ ] Database migrations applied without conflict
|
|
68
|
+
- [ ] No import cycles between modules
|
|
69
|
+
- [ ] Shared types/interfaces consistent across components
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
### Step 4: Decision
|
|
73
|
+
|
|
74
|
+
| Result | Action |
|
|
75
|
+
|--------|--------|
|
|
76
|
+
| All criteria ✅ | Sprint PASS → mark completed |
|
|
77
|
+
| 1-2 criteria ❌ or ⚠️ | Sprint PARTIAL → create fix tasks, DON'T mark completed |
|
|
78
|
+
| 3+ criteria ❌ | Sprint FAIL → reassess, DON'T mark completed |
|
|
79
|
+
|
|
80
|
+
### Step 5: Sprint Summary
|
|
81
|
+
|
|
82
|
+
```markdown
|
|
83
|
+
## Sprint S09 — Validation Result: PARTIAL
|
|
84
|
+
|
|
85
|
+
### Passed (3/5)
|
|
86
|
+
- ✅ Blog CRUD — fully functional
|
|
87
|
+
- ✅ CMS admin — complete with DataTable
|
|
88
|
+
- ✅ SEO metadata — auto-generated
|
|
89
|
+
|
|
90
|
+
### Failed (1/5)
|
|
91
|
+
- ❌ RSS feed — not implemented
|
|
92
|
+
|
|
93
|
+
### Unverifiable (1/5)
|
|
94
|
+
- ⚠️ LCP performance — requires deployment
|
|
95
|
+
|
|
96
|
+
### Action
|
|
97
|
+
- Created fix task: "Implement RSS feed at /feed.xml"
|
|
98
|
+
- Sprint remains OPEN until RSS is complete
|
|
99
|
+
- LCP deferred to post-deployment verification
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
## Event Emission
|
|
103
|
+
|
|
104
|
+
```python
|
|
105
|
+
bridge.emit_event("sprint.validation", {
|
|
106
|
+
"sprint": "S09",
|
|
107
|
+
"result": "partial",
|
|
108
|
+
"criteria_total": 5,
|
|
109
|
+
"criteria_passed": 3,
|
|
110
|
+
"criteria_failed": 1,
|
|
111
|
+
"criteria_unverifiable": 1,
|
|
112
|
+
"action": "created_fix_tasks",
|
|
113
|
+
"fix_tasks": ["Implement RSS feed"]
|
|
114
|
+
})
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
## Integration
|
|
118
|
+
|
|
119
|
+
- **quality-gates**: Tasks must pass Completion Gate BEFORE sprint validation runs
|
|
120
|
+
- **roadmap-reassessment**: Failed sprint triggers reassessment
|
|
121
|
+
- **verification-gate**: Sprint validation uses verification commands
|
|
122
|
+
- **cost-tracking**: Sprint cost is part of the validation summary
|
|
123
|
+
- **knowledge-persistence**: Record what worked/failed in KNOWLEDGE.md
|
|
124
|
+
- **task-complete**: Sprint validation BLOCKS `/task-complete` for the sprint itself
|
|
125
|
+
- **Flyee SaaS**: Sprint validation results visible in progress dashboard
|
|
@@ -0,0 +1,176 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: stuck-detection
|
|
3
|
+
description: Detects when the agent enters an infinite loop — repeating the same actions, failing the same way, or making zero progress. Uses a sliding-window pattern detector to identify cycles and break them before wasting budget.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Stuck Detection
|
|
7
|
+
|
|
8
|
+
> Detects and breaks infinite loops in agent execution.
|
|
9
|
+
|
|
10
|
+
## Problem
|
|
11
|
+
|
|
12
|
+
Without stuck detection, an agent can:
|
|
13
|
+
- Retry the same failing command indefinitely
|
|
14
|
+
- Edit → undo → re-edit the same file in a loop
|
|
15
|
+
- Make the same search query repeatedly
|
|
16
|
+
- Generate the same code that fails the same test
|
|
17
|
+
|
|
18
|
+
Each iteration burns tokens and budget with zero progress.
|
|
19
|
+
|
|
20
|
+
## Detection Algorithm
|
|
21
|
+
|
|
22
|
+
### Sliding Window Pattern Detector
|
|
23
|
+
|
|
24
|
+
Monitor the last N tool calls (window size = 10) for:
|
|
25
|
+
|
|
26
|
+
```
|
|
27
|
+
Pattern 1: EXACT REPEAT
|
|
28
|
+
Tool calls [i] == Tool calls [i-1] == Tool calls [i-2]
|
|
29
|
+
→ Same tool, same arguments, 3+ times in a row
|
|
30
|
+
|
|
31
|
+
Pattern 2: CYCLE
|
|
32
|
+
Sequence A-B-C-A-B-C detected
|
|
33
|
+
→ Agent alternating between the same actions
|
|
34
|
+
|
|
35
|
+
Pattern 3: ZERO PROGRESS
|
|
36
|
+
Last 5 tool calls produced no file changes
|
|
37
|
+
AND no new information was gathered
|
|
38
|
+
→ Agent is "thinking" but not "doing"
|
|
39
|
+
|
|
40
|
+
Pattern 4: FAIL-RETRY-FAIL
|
|
41
|
+
Same command fails → agent retries → same failure
|
|
42
|
+
→ 3+ identical failures without changing approach
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
### Detection Thresholds
|
|
46
|
+
|
|
47
|
+
| Pattern | Threshold | Confidence |
|
|
48
|
+
|---------|-----------|------------|
|
|
49
|
+
| Exact repeat | 3 consecutive | HIGH |
|
|
50
|
+
| Cycle | 2 full cycles (6+ calls) | HIGH |
|
|
51
|
+
| Zero progress | 5 calls, 0 changes | MEDIUM |
|
|
52
|
+
| Fail-retry-fail | 3 identical failures | HIGH |
|
|
53
|
+
|
|
54
|
+
## Response Protocol
|
|
55
|
+
|
|
56
|
+
### Phase 1: Soft Intervention
|
|
57
|
+
|
|
58
|
+
On first detection:
|
|
59
|
+
|
|
60
|
+
```markdown
|
|
61
|
+
⚠️ **Stuck Detection Triggered**
|
|
62
|
+
|
|
63
|
+
Pattern: {pattern_type}
|
|
64
|
+
Evidence: {last N tool calls summarized}
|
|
65
|
+
Window: {call_count} calls, {time_elapsed}
|
|
66
|
+
|
|
67
|
+
Attempting recovery: Retry with diagnostic prompt.
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
Inject a **diagnostic prompt** that:
|
|
71
|
+
1. Summarizes what was attempted
|
|
72
|
+
2. Explains WHY it's stuck
|
|
73
|
+
3. Asks the agent to try a **different approach**
|
|
74
|
+
|
|
75
|
+
### Phase 2: Hard Stop
|
|
76
|
+
|
|
77
|
+
If soft intervention fails (same pattern recurs within 5 tool calls):
|
|
78
|
+
|
|
79
|
+
```markdown
|
|
80
|
+
🛑 **Stuck Detection — Hard Stop**
|
|
81
|
+
|
|
82
|
+
The agent has been stuck in a loop for {call_count} calls.
|
|
83
|
+
Pattern: {pattern_type}
|
|
84
|
+
Estimated wasted cost: ${cost}
|
|
85
|
+
|
|
86
|
+
Action: Pausing execution.
|
|
87
|
+
Next steps:
|
|
88
|
+
1. Review the session log
|
|
89
|
+
2. Provide manual guidance
|
|
90
|
+
3. Resume with `/execute`
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
### Phase 3: Escalation
|
|
94
|
+
|
|
95
|
+
Emit event for Flyee SaaS:
|
|
96
|
+
|
|
97
|
+
```json
|
|
98
|
+
{
|
|
99
|
+
"event_type": "stuck_detection.triggered",
|
|
100
|
+
"payload": {
|
|
101
|
+
"task_id": "T-001",
|
|
102
|
+
"pattern": "fail_retry_fail",
|
|
103
|
+
"tool_calls_in_loop": 9,
|
|
104
|
+
"wasted_cost_usd": 0.45,
|
|
105
|
+
"intervention": "hard_stop",
|
|
106
|
+
"last_5_calls": [
|
|
107
|
+
"run_command: npm test",
|
|
108
|
+
"replace_file_content: src/utils.ts",
|
|
109
|
+
"run_command: npm test",
|
|
110
|
+
"replace_file_content: src/utils.ts",
|
|
111
|
+
"run_command: npm test"
|
|
112
|
+
]
|
|
113
|
+
}
|
|
114
|
+
}
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
## Implementation
|
|
118
|
+
|
|
119
|
+
### For Prompt-Based Agents (Flyee approach)
|
|
120
|
+
|
|
121
|
+
Since flyee-agent runs INSIDE the host runtime (not as a standalone app), stuck detection works through **skill instructions**:
|
|
122
|
+
|
|
123
|
+
```markdown
|
|
124
|
+
## Self-Monitoring Protocol
|
|
125
|
+
|
|
126
|
+
You MUST monitor your own behavior for loops:
|
|
127
|
+
|
|
128
|
+
After every 5 tool calls, ask yourself:
|
|
129
|
+
1. Am I making progress? (new files, passing tests, new info)
|
|
130
|
+
2. Am I repeating the same action? (same edit, same command)
|
|
131
|
+
3. Have I changed approach since the last failure?
|
|
132
|
+
|
|
133
|
+
If the answer to #1 is NO and #2 or #3 is YES:
|
|
134
|
+
→ STOP. State what you're stuck on. Ask the user for guidance.
|
|
135
|
+
|
|
136
|
+
DO NOT:
|
|
137
|
+
- Retry the same command more than 2 times
|
|
138
|
+
- Edit the same file more than 3 times for the same issue
|
|
139
|
+
- Run the same failing test more than 2 times without changing the code
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
### For Bridge-Level Detection
|
|
143
|
+
|
|
144
|
+
The `local_tracker.py` can implement pattern detection on the event log:
|
|
145
|
+
|
|
146
|
+
```python
|
|
147
|
+
def check_stuck(events, window=10):
|
|
148
|
+
recent = events[-window:]
|
|
149
|
+
|
|
150
|
+
# Pattern 1: Exact repeat
|
|
151
|
+
if len(set(e['action'] for e in recent[-3:])) == 1:
|
|
152
|
+
return 'exact_repeat'
|
|
153
|
+
|
|
154
|
+
# Pattern 4: Fail-retry-fail
|
|
155
|
+
failures = [e for e in recent if e.get('result') == 'failure']
|
|
156
|
+
if len(failures) >= 3:
|
|
157
|
+
cmds = [f['action'] for f in failures[-3:]]
|
|
158
|
+
if len(set(cmds)) == 1:
|
|
159
|
+
return 'fail_retry_fail'
|
|
160
|
+
|
|
161
|
+
return None
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
## Integration
|
|
165
|
+
|
|
166
|
+
- **cost-tracking**: Log wasted cost during stuck periods
|
|
167
|
+
- **session-resilience**: Stuck detection can trigger session save
|
|
168
|
+
- **knowledge-persistence**: Record what caused the loop in Lessons Learned
|
|
169
|
+
- **task-complete**: Block completion if agent was stuck and didn't resolve
|
|
170
|
+
- **Flyee SaaS**: Dashboard shows stuck incidents timeline (S-03)
|
|
171
|
+
|
|
172
|
+
## Exclusions
|
|
173
|
+
|
|
174
|
+
- **Research tasks**: Repeated searches are normal during research
|
|
175
|
+
- **Interactive debugging**: User may ask to retry — that's not a loop
|
|
176
|
+
- **Build/test cycles**: Edit → test → edit → test is normal IF the edits are different
|