@jaguilar87/gaia-ops 3.6.0 → 3.6.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@jaguilar87/gaia-ops",
3
- "version": "3.6.0",
3
+ "version": "3.6.1",
4
4
  "description": "Multi-agent orchestration system for Claude Code - DevOps automation toolkit",
5
5
  "main": "index.js",
6
6
  "type": "module",
@@ -42,6 +42,7 @@
42
42
  "templates/",
43
43
  "config/",
44
44
  "speckit/",
45
+ "skills/",
45
46
  "tests/",
46
47
  "README.en.md",
47
48
  "README.md",
@@ -0,0 +1,154 @@
1
+ # Skills System
2
+
3
+ Skills are on-demand knowledge modules loaded based on context triggers. They reduce token duplication and improve maintainability.
4
+
5
+ ## Architecture
6
+
7
+ ```
8
+ .claude/skills/
9
+ ├── workflow/ # How to work (process patterns)
10
+ │ ├── investigation/
11
+ │ ├── approval/
12
+ │ └── execution/
13
+ └── domain/ # What patterns to use (technical patterns)
14
+ ├── terraform-patterns/
15
+ ├── gitops-patterns/
16
+ └── universal-protocol/
17
+ ```
18
+
19
+ ## Skill Categories
20
+
21
+ | Category | Purpose | When Loaded | Example |
22
+ |----------|---------|-------------|---------|
23
+ | **Workflow** | Process/methodology | By workflow phase | investigation-skill: how to investigate before acting |
24
+ | **Domain** | Technical patterns | By keywords in task | terraform-patterns: HCL patterns for this project |
25
+
26
+ ## Trigger Mechanism
27
+
28
+ Skills are loaded when:
29
+ 1. **Workflow phase changes** (automatic) - investigation → approval → execution
30
+ 2. **Task contains trigger keywords** (see `skill-triggers.json`)
31
+
32
+ ## Skill Structure
33
+
34
+ Each skill is a directory containing:
35
+
36
+ ```
37
+ skill-name/
38
+ └── SKILL.md # Core skill content
39
+ ```
40
+
41
+ ### SKILL.md Format
42
+
43
+ ```markdown
44
+ ---
45
+ name: skill-name
46
+ description: Brief description
47
+ triggers: [keyword1, keyword2] # For domain skills
48
+ phase: start|investigation|approval|execution # For workflow skills
49
+ ---
50
+
51
+ # Skill Name
52
+
53
+ [Content that agents will read when skill is loaded]
54
+ ```
55
+
56
+ ## How Skills Work
57
+
58
+ 1. **Hook intercepts Task tool call**
59
+ ```python
60
+ # pre_tool_use.py
61
+ if is_project_agent:
62
+ skills = skill_loader.load_skills(task_prompt, workflow_phase)
63
+ ```
64
+
65
+ 2. **skill_loader.py determines which skills to load**
66
+ ```python
67
+ # Load workflow skill based on phase
68
+ if phase == "start":
69
+ load("workflow/investigation")
70
+
71
+ # Load domain skills based on keywords
72
+ if "terraform" in prompt:
73
+ load("domain/terraform-patterns")
74
+ ```
75
+
76
+ 3. **Skills are injected into prompt**
77
+ ```
78
+ # Project Context (Auto-Injected)
79
+ {...context...}
80
+
81
+ # Active Skills
82
+ ## investigation-skill
83
+ [content of investigation SKILL.md]
84
+
85
+ ## terraform-patterns
86
+ [content of terraform-patterns SKILL.md]
87
+
88
+ ---
89
+ # User Task
90
+ {original prompt}
91
+ ```
92
+
93
+ ## Benefits
94
+
95
+ | Metric | Before Skills | After Skills |
96
+ |--------|---------------|--------------|
97
+ | Token duplication | ~6000 tokens repeated in 4 agents | ~1500 tokens in skills, loaded once |
98
+ | Agent size | ~280 lines each | ~180 lines each |
99
+ | Maintenance | Update 4 files | Update 1 skill |
100
+ | Consistency | Can drift | Guaranteed consistent |
101
+
102
+ ## Usage Example
103
+
104
+ **User request:** "Create a new VPC in terraform"
105
+
106
+ **Skills loaded:**
107
+ 1. `workflow/investigation` (phase: start)
108
+ 2. `domain/terraform-patterns` (trigger: "terraform")
109
+ 3. `domain/universal-protocol` (auto_load for project agents)
110
+
111
+ **Agent receives:**
112
+ - Full project context (~3000 tokens)
113
+ - Investigation skill (~500 tokens) - how to discover patterns first
114
+ - Terraform patterns skill (~600 tokens) - HCL patterns for this project
115
+ - Universal protocol skill (~400 tokens) - AGENT_STATUS format, Security Tiers
116
+
117
+ **Total:** ~4500 tokens vs ~6000 without skills
118
+
119
+ ## Skill Development Guidelines
120
+
121
+ ### Do's
122
+ - ✅ Keep skills focused and specific
123
+ - ✅ Use concrete examples
124
+ - ✅ Include decision trees when applicable
125
+ - ✅ Update skills when patterns change
126
+
127
+ ### Don'ts
128
+ - ❌ Duplicate information across skills
129
+ - ❌ Make skills too generic (defeats the purpose)
130
+ - ❌ Include project-specific credentials/secrets
131
+ - ❌ Create skills for one-time operations
132
+
133
+ ## Testing Skills
134
+
135
+ Test that skills load correctly:
136
+
137
+ ```bash
138
+ # Test skill loader with agent and prompt
139
+ python3 .claude/hooks/modules/skills/skill_loader.py \
140
+ --test \
141
+ --prompt "terraform apply vpc" \
142
+ --agent "terraform-architect"
143
+
144
+ # Expected output:
145
+ # Loaded skills:
146
+ # - workflow/investigation (phase: start)
147
+ # - domain/terraform-patterns (trigger: terraform)
148
+ # - domain/universal-protocol (auto_load)
149
+ ```
150
+
151
+ ## Version History
152
+
153
+ - v1.0 (2026-01-15): Initial skills system with workflow + domain categories
154
+ - v1.1 (2026-01-15): Added universal-protocol skill for all project agents
@@ -0,0 +1,256 @@
1
+ ---
2
+ name: fast-queries
3
+ description: Quick diagnostic scripts for instant health checks (<5 sec). Auto-loads for all infrastructure agents.
4
+ triggers: [health, status, check, triage, diagnose, error, issue, problem, failing, trouble, investigate, debug]
5
+ ---
6
+
7
+ # Fast-Query Diagnostics
8
+
9
+ ## Overview
10
+
11
+ You have access to fast diagnostic scripts that provide health status in **<5 seconds**. These scripts only show problems, not everything, making them ideal for quick triage.
12
+
13
+ **CRITICAL**: Always run relevant fast-queries **FIRST** when investigating issues, checking status, or validating changes.
14
+
15
+ ## Available Health Checks
16
+
17
+ ### 1. All Systems Triage
18
+ ```bash
19
+ bash .claude/tools/fast-queries/run_triage.sh [domain]
20
+ ```
21
+
22
+ **Domains**: `all`, `gitops`, `terraform`, `cloud`, `appservices`
23
+
24
+ **When to use**:
25
+ - User asks general "what's the status?"
26
+ - Starting any investigation
27
+ - Pre-flight checks before changes
28
+ - Post-deployment validation
29
+
30
+ **Example**:
31
+ ```bash
32
+ # Check everything
33
+ bash .claude/tools/fast-queries/run_triage.sh all
34
+
35
+ # Check only Kubernetes
36
+ bash .claude/tools/fast-queries/run_triage.sh gitops
37
+ ```
38
+
39
+ ---
40
+
41
+ ### 2. GitOps/Kubernetes Health
42
+ ```bash
43
+ bash .claude/tools/fast-queries/gitops/quicktriage_gitops_operator.sh [namespace]
44
+ ```
45
+
46
+ **Output**: Problematic pods, deployments not ready, recent warnings
47
+
48
+ **When to use**:
49
+ - Pod/deployment issues
50
+ - Investigating k8s errors
51
+ - Validating flux reconciliation
52
+ - Checking namespace health
53
+
54
+ **Example**:
55
+ ```bash
56
+ # Check specific namespace
57
+ bash .claude/tools/fast-queries/gitops/quicktriage_gitops_operator.sh common
58
+
59
+ # Check all namespaces
60
+ bash .claude/tools/fast-queries/gitops/quicktriage_gitops_operator.sh
61
+ ```
62
+
63
+ ---
64
+
65
+ ### 3. Terraform Validation
66
+ ```bash
67
+ bash .claude/tools/fast-queries/terraform/quicktriage_terraform_architect.sh [directory]
68
+ ```
69
+
70
+ **Output**: ✅/❌ for format, validation, and drift detection
71
+
72
+ **When to use**:
73
+ - Before terraform operations
74
+ - Validating HCL changes
75
+ - Drift detection
76
+ - Pre-commit checks
77
+
78
+ **Example**:
79
+ ```bash
80
+ # Check specific terraform directory
81
+ bash .claude/tools/fast-queries/terraform/quicktriage_terraform_architect.sh terraform/environments/prod
82
+
83
+ # Check base terraform directory
84
+ bash .claude/tools/fast-queries/terraform/quicktriage_terraform_architect.sh terraform/
85
+ ```
86
+
87
+ ---
88
+
89
+ ### 4. AWS Resources Check
90
+ ```bash
91
+ bash .claude/tools/fast-queries/cloud/aws/quicktriage_aws_troubleshooter.sh
92
+ ```
93
+
94
+ **Output**: Status of EKS clusters, RDS, VPC health, recent CloudWatch errors, quota warnings
95
+
96
+ **When to use**:
97
+ - AWS infrastructure issues
98
+ - EKS cluster problems
99
+ - Validating AWS resource state
100
+ - Quota/limit checks
101
+
102
+ **Example**:
103
+ ```bash
104
+ bash .claude/tools/fast-queries/cloud/aws/quicktriage_aws_troubleshooter.sh
105
+ ```
106
+
107
+ ---
108
+
109
+ ### 5. GCP Resources Check
110
+ ```bash
111
+ bash .claude/tools/fast-queries/cloud/gcp/quicktriage_gcp_troubleshooter.sh [project]
112
+ ```
113
+
114
+ **Output**: Status of GKE clusters, Cloud SQL, recent errors, quota warnings
115
+
116
+ **When to use**:
117
+ - GCP infrastructure issues
118
+ - GKE cluster problems
119
+ - Validating GCP resource state
120
+ - Quota/limit checks
121
+
122
+ **Example**:
123
+ ```bash
124
+ # Check specific project
125
+ bash .claude/tools/fast-queries/cloud/gcp/quicktriage_gcp_troubleshooter.sh vtr-digital-prod
126
+
127
+ # Check default project
128
+ bash .claude/tools/fast-queries/cloud/gcp/quicktriage_gcp_troubleshooter.sh
129
+ ```
130
+
131
+ ---
132
+
133
+ ## Output Format
134
+
135
+ All scripts follow the same pattern:
136
+
137
+ - ✅ **Green check** = Healthy/OK - No action needed
138
+ - ⚠️ **Yellow warning** = Warning/non-critical issue - Review recommended
139
+ - ❌ **Red X** = Problem detected - Action required
140
+
141
+ ### Exit Codes
142
+ - `0` = All healthy
143
+ - `1` = Issues found (warnings or errors)
144
+ - `2` = Script error (missing tools, permissions)
145
+
146
+ ## Workflow Integration
147
+
148
+ ### Investigation Phase (T0 - Read Only)
149
+ ```bash
150
+ # ALWAYS start with fast-queries
151
+ bash .claude/tools/fast-queries/run_triage.sh all
152
+
153
+ # Then deep-dive based on findings
154
+ # If gitops shows errors → kubectl describe pod X
155
+ # If terraform shows drift → terraform plan
156
+ # If cloud shows issues → aws/gcloud describe X
157
+ ```
158
+
159
+ ### Before T3 Operations (Apply/Deploy)
160
+ ```bash
161
+ # Pre-flight check
162
+ bash .claude/tools/fast-queries/run_triage.sh terraform
163
+
164
+ # If ✅ proceed with terraform apply
165
+ # If ❌ fix issues first
166
+ ```
167
+
168
+ ### After T3 Operations (Validation)
169
+ ```bash
170
+ # Post-deployment check
171
+ bash .claude/tools/fast-queries/run_triage.sh gitops
172
+
173
+ # Verify deployment succeeded
174
+ ```
175
+
176
+ ## Performance Characteristics
177
+
178
+ | Script | Duration | API Calls | Best For |
179
+ |--------|----------|-----------|----------|
180
+ | GitOps | 2-3 sec | ~5 kubectl | Pod health, deployment status |
181
+ | Terraform | 3-4 sec | 0 (local) | Validation, format check |
182
+ | AWS Cloud | 4-5 sec | ~8 AWS | EKS, RDS, VPC health |
183
+ | GCP Cloud | 4-5 sec | ~8 GCP | GKE, Cloud SQL health |
184
+ | All Systems | 8-15 sec | All combined | Full system triage |
185
+
186
+ ## Interpreting Results
187
+
188
+ After running fast-queries:
189
+
190
+ 1. **✅ All green**: System healthy, proceed with task
191
+ 2. **⚠️ Warnings present**: Review warnings, decide if blocking
192
+ 3. **❌ Errors found**:
193
+ - Explain findings to user in their language
194
+ - Suggest next steps for investigation
195
+ - Ask if they want deep-dive diagnostics
196
+
197
+ ### Example Response Pattern
198
+
199
+ ```
200
+ I've run the fast-queries health check:
201
+
202
+ ✅ Terraform: All modules valid
203
+ ⚠️ GitOps: 2 pods in 'common' namespace restarting frequently
204
+ ❌ AWS: EKS cluster 'digital-prod' has nodes in NotReady state
205
+
206
+ The critical issue is the EKS nodes. This is likely causing the pod restarts.
207
+ Would you like me to investigate the EKS node issue in detail?
208
+ ```
209
+
210
+ ## Cross-Agent Usage
211
+
212
+ These scripts are **shared across all agents**:
213
+
214
+ - **gitops-operator**: Can run cloud and terraform checks
215
+ - **terraform-architect**: Can run cloud and gitops checks
216
+ - **cloud-troubleshooter**: Can run all checks
217
+ - **devops-developer**: Can run all checks for debugging
218
+
219
+ ## Limitations
220
+
221
+ - **No write operations**: Fast-queries are read-only (T0)
222
+ - **Snapshot in time**: Results represent current state only
223
+ - **No historical analysis**: For trends, use CloudWatch/Stackdriver
224
+ - **Requires credentials**: AWS/GCP CLI must be configured
225
+
226
+ ## Best Practices
227
+
228
+ ✅ **Do**:
229
+ - Run fast-queries FIRST before deep investigation
230
+ - Run relevant domain checks (gitops, terraform, cloud)
231
+ - Interpret results and explain to user
232
+ - Use for pre/post validation of changes
233
+
234
+ ❌ **Don't**:
235
+ - Skip fast-queries and go straight to detailed commands
236
+ - Run full triage (`all`) when you know the specific domain
237
+ - Use for historical/trend analysis (use monitoring tools instead)
238
+ - Expect fixes (these are diagnostic only)
239
+
240
+ ## Integration with Universal Protocol
241
+
242
+ Fast-queries support the investigation phase (PLAN_STATUS: INVESTIGATING):
243
+
244
+ ```
245
+ 1. User reports issue
246
+ 2. Run fast-queries for relevant domain
247
+ 3. Analyze results
248
+ 4. If issues found → create plan to fix (move to PENDING_APPROVAL)
249
+ 5. If all clear → investigate deeper with domain tools
250
+ ```
251
+
252
+ ---
253
+
254
+ **Last Updated**: 2026-01-15
255
+ **Version**: 1.0
256
+ **Maintained by**: gaia-ops system