@jaguilar87/gaia-ops 3.6.0 → 3.6.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +2 -1
- package/skills/README.md +154 -0
- package/skills/domain/fast-queries/SKILL.md +256 -0
- package/skills/domain/gitops-patterns/SKILL.md +667 -0
- package/skills/domain/terraform-patterns/SKILL.md +444 -0
- package/skills/domain/universal-protocol/SKILL.md +212 -0
- package/skills/standards/anti-patterns/SKILL.md +193 -0
- package/skills/standards/command-execution/SKILL.md +136 -0
- package/skills/standards/output-format/SKILL.md +76 -0
- package/skills/standards/security-tiers/SKILL.md +55 -0
- package/skills/workflow/approval/SKILL.md +393 -0
- package/skills/workflow/execution/SKILL.md +523 -0
- package/skills/workflow/investigation/SKILL.md +236 -0
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@jaguilar87/gaia-ops",
|
|
3
|
-
"version": "3.6.
|
|
3
|
+
"version": "3.6.1",
|
|
4
4
|
"description": "Multi-agent orchestration system for Claude Code - DevOps automation toolkit",
|
|
5
5
|
"main": "index.js",
|
|
6
6
|
"type": "module",
|
|
@@ -42,6 +42,7 @@
|
|
|
42
42
|
"templates/",
|
|
43
43
|
"config/",
|
|
44
44
|
"speckit/",
|
|
45
|
+
"skills/",
|
|
45
46
|
"tests/",
|
|
46
47
|
"README.en.md",
|
|
47
48
|
"README.md",
|
package/skills/README.md
ADDED
|
@@ -0,0 +1,154 @@
|
|
|
1
|
+
# Skills System
|
|
2
|
+
|
|
3
|
+
Skills are on-demand knowledge modules loaded based on context triggers. They reduce token duplication and improve maintainability.
|
|
4
|
+
|
|
5
|
+
## Architecture
|
|
6
|
+
|
|
7
|
+
```
|
|
8
|
+
.claude/skills/
|
|
9
|
+
├── workflow/ # How to work (process patterns)
|
|
10
|
+
│ ├── investigation/
|
|
11
|
+
│ ├── approval/
|
|
12
|
+
│ └── execution/
|
|
13
|
+
└── domain/ # What patterns to use (technical patterns)
|
|
14
|
+
├── terraform-patterns/
|
|
15
|
+
├── gitops-patterns/
|
|
16
|
+
└── universal-protocol/
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
## Skill Categories
|
|
20
|
+
|
|
21
|
+
| Category | Purpose | When Loaded | Example |
|
|
22
|
+
|----------|---------|-------------|---------|
|
|
23
|
+
| **Workflow** | Process/methodology | By workflow phase | investigation-skill: how to investigate before acting |
|
|
24
|
+
| **Domain** | Technical patterns | By keywords in task | terraform-patterns: HCL patterns for this project |
|
|
25
|
+
|
|
26
|
+
## Trigger Mechanism
|
|
27
|
+
|
|
28
|
+
Skills are loaded when:
|
|
29
|
+
1. **Workflow phase changes** (automatic) - investigation → approval → execution
|
|
30
|
+
2. **Task contains trigger keywords** (see `skill-triggers.json`)
|
|
31
|
+
|
|
32
|
+
## Skill Structure
|
|
33
|
+
|
|
34
|
+
Each skill is a directory containing:
|
|
35
|
+
|
|
36
|
+
```
|
|
37
|
+
skill-name/
|
|
38
|
+
└── SKILL.md # Core skill content
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
### SKILL.md Format
|
|
42
|
+
|
|
43
|
+
```markdown
|
|
44
|
+
---
|
|
45
|
+
name: skill-name
|
|
46
|
+
description: Brief description
|
|
47
|
+
triggers: [keyword1, keyword2] # For domain skills
|
|
48
|
+
phase: start|investigation|approval|execution # For workflow skills
|
|
49
|
+
---
|
|
50
|
+
|
|
51
|
+
# Skill Name
|
|
52
|
+
|
|
53
|
+
[Content that agents will read when skill is loaded]
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
## How Skills Work
|
|
57
|
+
|
|
58
|
+
1. **Hook intercepts Task tool call**
|
|
59
|
+
```python
|
|
60
|
+
# pre_tool_use.py
|
|
61
|
+
if is_project_agent:
|
|
62
|
+
skills = skill_loader.load_skills(task_prompt, workflow_phase)
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
2. **skill_loader.py determines which skills to load**
|
|
66
|
+
```python
|
|
67
|
+
# Load workflow skill based on phase
|
|
68
|
+
if phase == "start":
|
|
69
|
+
load("workflow/investigation")
|
|
70
|
+
|
|
71
|
+
# Load domain skills based on keywords
|
|
72
|
+
if "terraform" in prompt:
|
|
73
|
+
load("domain/terraform-patterns")
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
3. **Skills are injected into prompt**
|
|
77
|
+
```
|
|
78
|
+
# Project Context (Auto-Injected)
|
|
79
|
+
{...context...}
|
|
80
|
+
|
|
81
|
+
# Active Skills
|
|
82
|
+
## investigation-skill
|
|
83
|
+
[content of investigation SKILL.md]
|
|
84
|
+
|
|
85
|
+
## terraform-patterns
|
|
86
|
+
[content of terraform-patterns SKILL.md]
|
|
87
|
+
|
|
88
|
+
---
|
|
89
|
+
# User Task
|
|
90
|
+
{original prompt}
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
## Benefits
|
|
94
|
+
|
|
95
|
+
| Metric | Before Skills | After Skills |
|
|
96
|
+
|--------|---------------|--------------|
|
|
97
|
+
| Token duplication | ~6000 tokens repeated in 4 agents | ~1500 tokens in skills, loaded once |
|
|
98
|
+
| Agent size | ~280 lines each | ~180 lines each |
|
|
99
|
+
| Maintenance | Update 4 files | Update 1 skill |
|
|
100
|
+
| Consistency | Can drift | Guaranteed consistent |
|
|
101
|
+
|
|
102
|
+
## Usage Example
|
|
103
|
+
|
|
104
|
+
**User request:** "Create a new VPC in terraform"
|
|
105
|
+
|
|
106
|
+
**Skills loaded:**
|
|
107
|
+
1. `workflow/investigation` (phase: start)
|
|
108
|
+
2. `domain/terraform-patterns` (trigger: "terraform")
|
|
109
|
+
3. `domain/universal-protocol` (auto_load for project agents)
|
|
110
|
+
|
|
111
|
+
**Agent receives:**
|
|
112
|
+
- Full project context (~3000 tokens)
|
|
113
|
+
- Investigation skill (~500 tokens) - how to discover patterns first
|
|
114
|
+
- Terraform patterns skill (~600 tokens) - HCL patterns for this project
|
|
115
|
+
- Universal protocol skill (~400 tokens) - AGENT_STATUS format, Security Tiers
|
|
116
|
+
|
|
117
|
+
**Total:** ~4500 tokens vs ~6000 without skills
|
|
118
|
+
|
|
119
|
+
## Skill Development Guidelines
|
|
120
|
+
|
|
121
|
+
### Do's
|
|
122
|
+
- ✅ Keep skills focused and specific
|
|
123
|
+
- ✅ Use concrete examples
|
|
124
|
+
- ✅ Include decision trees when applicable
|
|
125
|
+
- ✅ Update skills when patterns change
|
|
126
|
+
|
|
127
|
+
### Don'ts
|
|
128
|
+
- ❌ Duplicate information across skills
|
|
129
|
+
- ❌ Make skills too generic (defeats the purpose)
|
|
130
|
+
- ❌ Include project-specific credentials/secrets
|
|
131
|
+
- ❌ Create skills for one-time operations
|
|
132
|
+
|
|
133
|
+
## Testing Skills
|
|
134
|
+
|
|
135
|
+
Test that skills load correctly:
|
|
136
|
+
|
|
137
|
+
```bash
|
|
138
|
+
# Test skill loader with agent and prompt
|
|
139
|
+
python3 .claude/hooks/modules/skills/skill_loader.py \
|
|
140
|
+
--test \
|
|
141
|
+
--prompt "terraform apply vpc" \
|
|
142
|
+
--agent "terraform-architect"
|
|
143
|
+
|
|
144
|
+
# Expected output:
|
|
145
|
+
# Loaded skills:
|
|
146
|
+
# - workflow/investigation (phase: start)
|
|
147
|
+
# - domain/terraform-patterns (trigger: terraform)
|
|
148
|
+
# - domain/universal-protocol (auto_load)
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
## Version History
|
|
152
|
+
|
|
153
|
+
- v1.0 (2026-01-15): Initial skills system with workflow + domain categories
|
|
154
|
+
- v1.1 (2026-01-15): Added universal-protocol skill for all project agents
|
|
@@ -0,0 +1,256 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: fast-queries
|
|
3
|
+
description: Quick diagnostic scripts for instant health checks (<5 sec). Auto-loads for all infrastructure agents.
|
|
4
|
+
triggers: [health, status, check, triage, diagnose, error, issue, problem, failing, trouble, investigate, debug]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Fast-Query Diagnostics
|
|
8
|
+
|
|
9
|
+
## Overview
|
|
10
|
+
|
|
11
|
+
You have access to fast diagnostic scripts that provide health status in **<5 seconds**. These scripts only show problems, not everything, making them ideal for quick triage.
|
|
12
|
+
|
|
13
|
+
**CRITICAL**: Always run relevant fast-queries **FIRST** when investigating issues, checking status, or validating changes.
|
|
14
|
+
|
|
15
|
+
## Available Health Checks
|
|
16
|
+
|
|
17
|
+
### 1. All Systems Triage
|
|
18
|
+
```bash
|
|
19
|
+
bash .claude/tools/fast-queries/run_triage.sh [domain]
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
**Domains**: `all`, `gitops`, `terraform`, `cloud`, `appservices`
|
|
23
|
+
|
|
24
|
+
**When to use**:
|
|
25
|
+
- User asks general "what's the status?"
|
|
26
|
+
- Starting any investigation
|
|
27
|
+
- Pre-flight checks before changes
|
|
28
|
+
- Post-deployment validation
|
|
29
|
+
|
|
30
|
+
**Example**:
|
|
31
|
+
```bash
|
|
32
|
+
# Check everything
|
|
33
|
+
bash .claude/tools/fast-queries/run_triage.sh all
|
|
34
|
+
|
|
35
|
+
# Check only Kubernetes
|
|
36
|
+
bash .claude/tools/fast-queries/run_triage.sh gitops
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
---
|
|
40
|
+
|
|
41
|
+
### 2. GitOps/Kubernetes Health
|
|
42
|
+
```bash
|
|
43
|
+
bash .claude/tools/fast-queries/gitops/quicktriage_gitops_operator.sh [namespace]
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
**Output**: Problematic pods, deployments not ready, recent warnings
|
|
47
|
+
|
|
48
|
+
**When to use**:
|
|
49
|
+
- Pod/deployment issues
|
|
50
|
+
- Investigating k8s errors
|
|
51
|
+
- Validating flux reconciliation
|
|
52
|
+
- Checking namespace health
|
|
53
|
+
|
|
54
|
+
**Example**:
|
|
55
|
+
```bash
|
|
56
|
+
# Check specific namespace
|
|
57
|
+
bash .claude/tools/fast-queries/gitops/quicktriage_gitops_operator.sh common
|
|
58
|
+
|
|
59
|
+
# Check all namespaces
|
|
60
|
+
bash .claude/tools/fast-queries/gitops/quicktriage_gitops_operator.sh
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
---
|
|
64
|
+
|
|
65
|
+
### 3. Terraform Validation
|
|
66
|
+
```bash
|
|
67
|
+
bash .claude/tools/fast-queries/terraform/quicktriage_terraform_architect.sh [directory]
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
**Output**: ✅/❌ for format, validation, and drift detection
|
|
71
|
+
|
|
72
|
+
**When to use**:
|
|
73
|
+
- Before terraform operations
|
|
74
|
+
- Validating HCL changes
|
|
75
|
+
- Drift detection
|
|
76
|
+
- Pre-commit checks
|
|
77
|
+
|
|
78
|
+
**Example**:
|
|
79
|
+
```bash
|
|
80
|
+
# Check specific terraform directory
|
|
81
|
+
bash .claude/tools/fast-queries/terraform/quicktriage_terraform_architect.sh terraform/environments/prod
|
|
82
|
+
|
|
83
|
+
# Check base terraform directory
|
|
84
|
+
bash .claude/tools/fast-queries/terraform/quicktriage_terraform_architect.sh terraform/
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
---
|
|
88
|
+
|
|
89
|
+
### 4. AWS Resources Check
|
|
90
|
+
```bash
|
|
91
|
+
bash .claude/tools/fast-queries/cloud/aws/quicktriage_aws_troubleshooter.sh
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
**Output**: Status of EKS clusters, RDS, VPC health, recent CloudWatch errors, quota warnings
|
|
95
|
+
|
|
96
|
+
**When to use**:
|
|
97
|
+
- AWS infrastructure issues
|
|
98
|
+
- EKS cluster problems
|
|
99
|
+
- Validating AWS resource state
|
|
100
|
+
- Quota/limit checks
|
|
101
|
+
|
|
102
|
+
**Example**:
|
|
103
|
+
```bash
|
|
104
|
+
bash .claude/tools/fast-queries/cloud/aws/quicktriage_aws_troubleshooter.sh
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
---
|
|
108
|
+
|
|
109
|
+
### 5. GCP Resources Check
|
|
110
|
+
```bash
|
|
111
|
+
bash .claude/tools/fast-queries/cloud/gcp/quicktriage_gcp_troubleshooter.sh [project]
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
**Output**: Status of GKE clusters, Cloud SQL, recent errors, quota warnings
|
|
115
|
+
|
|
116
|
+
**When to use**:
|
|
117
|
+
- GCP infrastructure issues
|
|
118
|
+
- GKE cluster problems
|
|
119
|
+
- Validating GCP resource state
|
|
120
|
+
- Quota/limit checks
|
|
121
|
+
|
|
122
|
+
**Example**:
|
|
123
|
+
```bash
|
|
124
|
+
# Check specific project
|
|
125
|
+
bash .claude/tools/fast-queries/cloud/gcp/quicktriage_gcp_troubleshooter.sh vtr-digital-prod
|
|
126
|
+
|
|
127
|
+
# Check default project
|
|
128
|
+
bash .claude/tools/fast-queries/cloud/gcp/quicktriage_gcp_troubleshooter.sh
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
---
|
|
132
|
+
|
|
133
|
+
## Output Format
|
|
134
|
+
|
|
135
|
+
All scripts follow the same pattern:
|
|
136
|
+
|
|
137
|
+
- ✅ **Green check** = Healthy/OK - No action needed
|
|
138
|
+
- ⚠️ **Yellow warning** = Warning/non-critical issue - Review recommended
|
|
139
|
+
- ❌ **Red X** = Problem detected - Action required
|
|
140
|
+
|
|
141
|
+
### Exit Codes
|
|
142
|
+
- `0` = All healthy
|
|
143
|
+
- `1` = Issues found (warnings or errors)
|
|
144
|
+
- `2` = Script error (missing tools, permissions)
|
|
145
|
+
|
|
146
|
+
## Workflow Integration
|
|
147
|
+
|
|
148
|
+
### Investigation Phase (T0 - Read Only)
|
|
149
|
+
```bash
|
|
150
|
+
# ALWAYS start with fast-queries
|
|
151
|
+
bash .claude/tools/fast-queries/run_triage.sh all
|
|
152
|
+
|
|
153
|
+
# Then deep-dive based on findings
|
|
154
|
+
# If gitops shows errors → kubectl describe pod X
|
|
155
|
+
# If terraform shows drift → terraform plan
|
|
156
|
+
# If cloud shows issues → aws/gcloud describe X
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
### Before T3 Operations (Apply/Deploy)
|
|
160
|
+
```bash
|
|
161
|
+
# Pre-flight check
|
|
162
|
+
bash .claude/tools/fast-queries/run_triage.sh terraform
|
|
163
|
+
|
|
164
|
+
# If ✅ proceed with terraform apply
|
|
165
|
+
# If ❌ fix issues first
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
### After T3 Operations (Validation)
|
|
169
|
+
```bash
|
|
170
|
+
# Post-deployment check
|
|
171
|
+
bash .claude/tools/fast-queries/run_triage.sh gitops
|
|
172
|
+
|
|
173
|
+
# Verify deployment succeeded
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
## Performance Characteristics
|
|
177
|
+
|
|
178
|
+
| Script | Duration | API Calls | Best For |
|
|
179
|
+
|--------|----------|-----------|----------|
|
|
180
|
+
| GitOps | 2-3 sec | ~5 kubectl | Pod health, deployment status |
|
|
181
|
+
| Terraform | 3-4 sec | 0 (local) | Validation, format check |
|
|
182
|
+
| AWS Cloud | 4-5 sec | ~8 AWS | EKS, RDS, VPC health |
|
|
183
|
+
| GCP Cloud | 4-5 sec | ~8 GCP | GKE, Cloud SQL health |
|
|
184
|
+
| All Systems | 8-15 sec | All combined | Full system triage |
|
|
185
|
+
|
|
186
|
+
## Interpreting Results
|
|
187
|
+
|
|
188
|
+
After running fast-queries:
|
|
189
|
+
|
|
190
|
+
1. **✅ All green**: System healthy, proceed with task
|
|
191
|
+
2. **⚠️ Warnings present**: Review warnings, decide if blocking
|
|
192
|
+
3. **❌ Errors found**:
|
|
193
|
+
- Explain findings to user in their language
|
|
194
|
+
- Suggest next steps for investigation
|
|
195
|
+
- Ask if they want deep-dive diagnostics
|
|
196
|
+
|
|
197
|
+
### Example Response Pattern
|
|
198
|
+
|
|
199
|
+
```
|
|
200
|
+
I've run the fast-queries health check:
|
|
201
|
+
|
|
202
|
+
✅ Terraform: All modules valid
|
|
203
|
+
⚠️ GitOps: 2 pods in 'common' namespace restarting frequently
|
|
204
|
+
❌ AWS: EKS cluster 'digital-prod' has nodes in NotReady state
|
|
205
|
+
|
|
206
|
+
The critical issue is the EKS nodes. This is likely causing the pod restarts.
|
|
207
|
+
Would you like me to investigate the EKS node issue in detail?
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
## Cross-Agent Usage
|
|
211
|
+
|
|
212
|
+
These scripts are **shared across all agents**:
|
|
213
|
+
|
|
214
|
+
- **gitops-operator**: Can run cloud and terraform checks
|
|
215
|
+
- **terraform-architect**: Can run cloud and gitops checks
|
|
216
|
+
- **cloud-troubleshooter**: Can run all checks
|
|
217
|
+
- **devops-developer**: Can run all checks for debugging
|
|
218
|
+
|
|
219
|
+
## Limitations
|
|
220
|
+
|
|
221
|
+
- **No write operations**: Fast-queries are read-only (T0)
|
|
222
|
+
- **Snapshot in time**: Results represent current state only
|
|
223
|
+
- **No historical analysis**: For trends, use CloudWatch/Stackdriver
|
|
224
|
+
- **Requires credentials**: AWS/GCP CLI must be configured
|
|
225
|
+
|
|
226
|
+
## Best Practices
|
|
227
|
+
|
|
228
|
+
✅ **Do**:
|
|
229
|
+
- Run fast-queries FIRST before deep investigation
|
|
230
|
+
- Run relevant domain checks (gitops, terraform, cloud)
|
|
231
|
+
- Interpret results and explain to user
|
|
232
|
+
- Use for pre/post validation of changes
|
|
233
|
+
|
|
234
|
+
❌ **Don't**:
|
|
235
|
+
- Skip fast-queries and go straight to detailed commands
|
|
236
|
+
- Run full triage (`all`) when you know the specific domain
|
|
237
|
+
- Use for historical/trend analysis (use monitoring tools instead)
|
|
238
|
+
- Expect fixes (these are diagnostic only)
|
|
239
|
+
|
|
240
|
+
## Integration with Universal Protocol
|
|
241
|
+
|
|
242
|
+
Fast-queries support the investigation phase (PLAN_STATUS: INVESTIGATING):
|
|
243
|
+
|
|
244
|
+
```
|
|
245
|
+
1. User reports issue
|
|
246
|
+
2. Run fast-queries for relevant domain
|
|
247
|
+
3. Analyze results
|
|
248
|
+
4. If issues found → create plan to fix (move to PENDING_APPROVAL)
|
|
249
|
+
5. If all clear → investigate deeper with domain tools
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
---
|
|
253
|
+
|
|
254
|
+
**Last Updated**: 2026-01-15
|
|
255
|
+
**Version**: 1.0
|
|
256
|
+
**Maintained by**: gaia-ops system
|