@raishin/vanguard-frontier-agentic 1.3.0 → 1.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (261) hide show
  1. package/README.md +23 -1
  2. package/agents/kubernetes/README.md +10 -1
  3. package/agents/kubernetes/kubernetes-live-admission-policy-guard-agent/AGENT.md +12 -0
  4. package/agents/kubernetes/kubernetes-live-admission-policy-guard-agent/harnesses/claude-code.agent.md +12 -0
  5. package/agents/kubernetes/kubernetes-live-admission-policy-guard-agent/harnesses/copilot.agent.md +12 -0
  6. package/agents/kubernetes/kubernetes-live-admission-policy-guard-agent/harnesses/cursor.agent.md +12 -0
  7. package/agents/kubernetes/kubernetes-live-admission-policy-guard-agent/harnesses/gemini.agent.md +12 -0
  8. package/agents/kubernetes/kubernetes-live-admission-policy-guard-agent/harnesses/kiro-cli.agent.json +1 -1
  9. package/agents/kubernetes/kubernetes-live-admission-policy-guard-agent/harnesses/kiro-ide.agent.md +12 -0
  10. package/agents/kubernetes/kubernetes-live-admission-policy-guard-agent/metadata.json +6 -3
  11. package/agents/kubernetes/kubernetes-live-admission-policy-guard-agent/references/least-privilege-rbac.yaml +98 -0
  12. package/agents/kubernetes/kubernetes-live-admission-policy-guard-agent/references/rbac-pre-flight.md +108 -0
  13. package/agents/kubernetes/kubernetes-live-admission-policy-guard-agent/references/refusal-list.md +112 -0
  14. package/agents/kubernetes/kubernetes-live-argocd-sync-guard-agent/AGENT.md +13 -1
  15. package/agents/kubernetes/kubernetes-live-argocd-sync-guard-agent/harnesses/claude-code.agent.md +12 -0
  16. package/agents/kubernetes/kubernetes-live-argocd-sync-guard-agent/harnesses/copilot.agent.md +12 -0
  17. package/agents/kubernetes/kubernetes-live-argocd-sync-guard-agent/harnesses/cursor.agent.md +12 -0
  18. package/agents/kubernetes/kubernetes-live-argocd-sync-guard-agent/harnesses/gemini.agent.md +12 -0
  19. package/agents/kubernetes/kubernetes-live-argocd-sync-guard-agent/harnesses/kiro-cli.agent.json +1 -1
  20. package/agents/kubernetes/kubernetes-live-argocd-sync-guard-agent/harnesses/kiro-ide.agent.md +12 -0
  21. package/agents/kubernetes/kubernetes-live-argocd-sync-guard-agent/metadata.json +6 -3
  22. package/agents/kubernetes/kubernetes-live-argocd-sync-guard-agent/references/least-privilege-rbac.yaml +92 -0
  23. package/agents/kubernetes/kubernetes-live-argocd-sync-guard-agent/references/rbac-pre-flight.md +108 -0
  24. package/agents/kubernetes/kubernetes-live-argocd-sync-guard-agent/references/refusal-list.md +112 -0
  25. package/agents/kubernetes/kubernetes-live-mesh-policy-guard-agent/AGENT.md +13 -1
  26. package/agents/kubernetes/kubernetes-live-mesh-policy-guard-agent/harnesses/claude-code.agent.md +12 -0
  27. package/agents/kubernetes/kubernetes-live-mesh-policy-guard-agent/harnesses/copilot.agent.md +12 -0
  28. package/agents/kubernetes/kubernetes-live-mesh-policy-guard-agent/harnesses/cursor.agent.md +12 -0
  29. package/agents/kubernetes/kubernetes-live-mesh-policy-guard-agent/harnesses/gemini.agent.md +12 -0
  30. package/agents/kubernetes/kubernetes-live-mesh-policy-guard-agent/harnesses/kiro-cli.agent.json +1 -1
  31. package/agents/kubernetes/kubernetes-live-mesh-policy-guard-agent/harnesses/kiro-ide.agent.md +12 -0
  32. package/agents/kubernetes/kubernetes-live-mesh-policy-guard-agent/metadata.json +6 -3
  33. package/agents/kubernetes/kubernetes-live-mesh-policy-guard-agent/references/least-privilege-rbac.yaml +101 -0
  34. package/agents/kubernetes/kubernetes-live-mesh-policy-guard-agent/references/rbac-pre-flight.md +106 -0
  35. package/agents/kubernetes/kubernetes-live-mesh-policy-guard-agent/references/refusal-list.md +102 -0
  36. package/agents/kubernetes/kubernetes-live-network-architecture-mutation-guard-agent/AGENT.md +71 -0
  37. package/agents/kubernetes/kubernetes-live-network-architecture-mutation-guard-agent/harnesses/claude-code.agent.md +54 -0
  38. package/agents/kubernetes/kubernetes-live-network-architecture-mutation-guard-agent/harnesses/codex.toml +38 -0
  39. package/agents/kubernetes/kubernetes-live-network-architecture-mutation-guard-agent/harnesses/copilot.agent.md +54 -0
  40. package/agents/kubernetes/kubernetes-live-network-architecture-mutation-guard-agent/harnesses/cursor.agent.md +54 -0
  41. package/agents/kubernetes/kubernetes-live-network-architecture-mutation-guard-agent/harnesses/gemini.agent.md +54 -0
  42. package/agents/kubernetes/kubernetes-live-network-architecture-mutation-guard-agent/harnesses/kiro-cli.agent.json +5 -0
  43. package/agents/kubernetes/kubernetes-live-network-architecture-mutation-guard-agent/harnesses/kiro-ide.agent.md +54 -0
  44. package/agents/kubernetes/kubernetes-live-network-architecture-mutation-guard-agent/metadata.json +44 -0
  45. package/agents/kubernetes/kubernetes-live-network-policy-guard-agent/AGENT.md +14 -2
  46. package/agents/kubernetes/kubernetes-live-network-policy-guard-agent/harnesses/claude-code.agent.md +13 -1
  47. package/agents/kubernetes/kubernetes-live-network-policy-guard-agent/harnesses/copilot.agent.md +13 -1
  48. package/agents/kubernetes/kubernetes-live-network-policy-guard-agent/harnesses/cursor.agent.md +13 -1
  49. package/agents/kubernetes/kubernetes-live-network-policy-guard-agent/harnesses/gemini.agent.md +13 -1
  50. package/agents/kubernetes/kubernetes-live-network-policy-guard-agent/harnesses/kiro-cli.agent.json +1 -1
  51. package/agents/kubernetes/kubernetes-live-network-policy-guard-agent/harnesses/kiro-ide.agent.md +13 -1
  52. package/agents/kubernetes/kubernetes-live-network-policy-guard-agent/metadata.json +6 -3
  53. package/agents/kubernetes/kubernetes-live-network-policy-guard-agent/references/least-privilege-rbac.yaml +101 -0
  54. package/agents/kubernetes/kubernetes-live-network-policy-guard-agent/references/rbac-pre-flight.md +106 -0
  55. package/agents/kubernetes/kubernetes-live-network-policy-guard-agent/references/refusal-list.md +102 -0
  56. package/agents/kubernetes/kubernetes-live-rbac-mutation-guard-agent/AGENT.md +12 -0
  57. package/agents/kubernetes/kubernetes-live-rbac-mutation-guard-agent/harnesses/claude-code.agent.md +12 -0
  58. package/agents/kubernetes/kubernetes-live-rbac-mutation-guard-agent/harnesses/copilot.agent.md +12 -0
  59. package/agents/kubernetes/kubernetes-live-rbac-mutation-guard-agent/harnesses/cursor.agent.md +12 -0
  60. package/agents/kubernetes/kubernetes-live-rbac-mutation-guard-agent/harnesses/gemini.agent.md +12 -0
  61. package/agents/kubernetes/kubernetes-live-rbac-mutation-guard-agent/harnesses/kiro-cli.agent.json +1 -1
  62. package/agents/kubernetes/kubernetes-live-rbac-mutation-guard-agent/harnesses/kiro-ide.agent.md +12 -0
  63. package/agents/kubernetes/kubernetes-live-rbac-mutation-guard-agent/metadata.json +6 -3
  64. package/agents/kubernetes/kubernetes-live-rbac-mutation-guard-agent/references/least-privilege-rbac.yaml +92 -0
  65. package/agents/kubernetes/kubernetes-live-rbac-mutation-guard-agent/references/rbac-pre-flight.md +115 -0
  66. package/agents/kubernetes/kubernetes-live-rbac-mutation-guard-agent/references/refusal-list.md +132 -0
  67. package/agents/kubernetes/kubernetes-live-velero-restore-guard-agent/AGENT.md +15 -3
  68. package/agents/kubernetes/kubernetes-live-velero-restore-guard-agent/harnesses/claude-code.agent.md +15 -3
  69. package/agents/kubernetes/kubernetes-live-velero-restore-guard-agent/harnesses/codex.toml +2 -2
  70. package/agents/kubernetes/kubernetes-live-velero-restore-guard-agent/harnesses/copilot.agent.md +15 -3
  71. package/agents/kubernetes/kubernetes-live-velero-restore-guard-agent/harnesses/cursor.agent.md +15 -3
  72. package/agents/kubernetes/kubernetes-live-velero-restore-guard-agent/harnesses/gemini.agent.md +15 -3
  73. package/agents/kubernetes/kubernetes-live-velero-restore-guard-agent/harnesses/kiro-cli.agent.json +1 -1
  74. package/agents/kubernetes/kubernetes-live-velero-restore-guard-agent/harnesses/kiro-ide.agent.md +15 -3
  75. package/agents/kubernetes/kubernetes-live-velero-restore-guard-agent/metadata.json +7 -4
  76. package/agents/kubernetes/kubernetes-live-velero-restore-guard-agent/references/least-privilege-rbac.yaml +92 -0
  77. package/agents/kubernetes/kubernetes-live-velero-restore-guard-agent/references/rbac-pre-flight.md +109 -0
  78. package/agents/kubernetes/kubernetes-live-velero-restore-guard-agent/references/refusal-list.md +122 -0
  79. package/agents/kubernetes/kubernetes-network-architecture-review-agent/AGENT.md +65 -0
  80. package/agents/kubernetes/kubernetes-network-architecture-review-agent/harnesses/claude-code.agent.md +48 -0
  81. package/agents/kubernetes/kubernetes-network-architecture-review-agent/harnesses/codex.toml +37 -0
  82. package/agents/kubernetes/kubernetes-network-architecture-review-agent/harnesses/copilot.agent.md +48 -0
  83. package/agents/kubernetes/kubernetes-network-architecture-review-agent/harnesses/cursor.agent.md +48 -0
  84. package/agents/kubernetes/kubernetes-network-architecture-review-agent/harnesses/gemini.agent.md +48 -0
  85. package/agents/kubernetes/kubernetes-network-architecture-review-agent/harnesses/kiro-cli.agent.json +5 -0
  86. package/agents/kubernetes/kubernetes-network-architecture-review-agent/harnesses/kiro-ide.agent.md +48 -0
  87. package/agents/kubernetes/kubernetes-network-architecture-review-agent/metadata.json +44 -0
  88. package/agents/kubernetes/kubernetes-psa-review-agent/metadata.json +2 -1
  89. package/agents/terraform/terraform-reviewer/AGENT.md +2 -1
  90. package/catalog/agents.json +78 -12
  91. package/catalog/install-roles.json +8 -4
  92. package/catalog/skill-manifest.json +521 -422
  93. package/catalog/skills.json +67 -0
  94. package/package.json +23 -4
  95. package/schemas/AGENTS.md +14 -0
  96. package/schemas/agent.frontmatter.schema.json +89 -0
  97. package/schemas/agent.schema.json +8 -0
  98. package/schemas/skill.frontmatter.schema.json +95 -0
  99. package/scripts/apply-skill-allowed-tools.py +142 -0
  100. package/scripts/backfill-skill-metadata.py +410 -0
  101. package/scripts/export-marketplace-agents.mjs +175 -0
  102. package/skills/argocd/argo-rollouts-progressive-delivery-review/SKILL.md +3 -0
  103. package/skills/argocd/argocd-gitops-review/SKILL.md +3 -0
  104. package/skills/aws/aws-agentcore/SKILL.md +3 -0
  105. package/skills/aws/aws-api-edge-delivery-review/SKILL.md +3 -0
  106. package/skills/aws/aws-bedrock-agent-security-governor/SKILL.md +3 -0
  107. package/skills/aws/aws-change-impact-advisor/SKILL.md +3 -0
  108. package/skills/aws/aws-ci-cd-release-engineer/SKILL.md +3 -0
  109. package/skills/aws/aws-compliance-evidence-mapper/SKILL.md +3 -0
  110. package/skills/aws/aws-cost-anomaly-watch-coordinator/SKILL.md +3 -0
  111. package/skills/aws/aws-cost-optimization-governor/SKILL.md +3 -0
  112. package/skills/aws/aws-daily-operations-briefing-coordinator/SKILL.md +3 -0
  113. package/skills/aws/aws-data-protection-backup-steward/SKILL.md +3 -0
  114. package/skills/aws/aws-deployment-hotfix-operator/SKILL.md +3 -0
  115. package/skills/aws/aws-devops-agent-skill-designer/SKILL.md +3 -0
  116. package/skills/aws/aws-dynamodb-data-modeling-performance-review/SKILL.md +3 -0
  117. package/skills/aws/aws-ec2-compute-operations-steward/SKILL.md +3 -0
  118. package/skills/aws/aws-ecs-fargate-platform-operator/SKILL.md +3 -0
  119. package/skills/aws/aws-ecs-service-remediation-operator/SKILL.md +3 -0
  120. package/skills/aws/aws-eks-platform-operator/SKILL.md +3 -0
  121. package/skills/aws/aws-event-driven-architecture-review/SKILL.md +3 -0
  122. package/skills/aws/aws-generative-ai-developer/SKILL.md +3 -0
  123. package/skills/aws/aws-iac-change-safety-review/SKILL.md +3 -0
  124. package/skills/aws/aws-iac-patch-executor/SKILL.md +3 -0
  125. package/skills/aws/aws-iam-least-privilege-review/SKILL.md +3 -0
  126. package/skills/aws/aws-kms-secrets-lifecycle-steward/SKILL.md +3 -0
  127. package/skills/aws/aws-landing-zone-governor/SKILL.md +3 -0
  128. package/skills/aws/aws-live-deployment-guarded-operator/SKILL.md +3 -0
  129. package/skills/aws/aws-live-ecs-rollout-guard/SKILL.md +3 -0
  130. package/skills/aws/aws-live-iac-change-guard/SKILL.md +3 -0
  131. package/skills/aws/aws-live-pipeline-approval-operator/SKILL.md +3 -0
  132. package/skills/aws/aws-live-serverless-release-guard/SKILL.md +3 -0
  133. package/skills/aws/aws-maestro/SKILL.md +3 -0
  134. package/skills/aws/aws-migration-cutover-architect/SKILL.md +3 -0
  135. package/skills/aws/aws-network-architect/SKILL.md +3 -0
  136. package/skills/aws/aws-non-destructive-task-automation-advisor/SKILL.md +3 -0
  137. package/skills/aws/aws-observability-incident-responder/SKILL.md +3 -0
  138. package/skills/aws/aws-pipeline-fix-operator/SKILL.md +3 -0
  139. package/skills/aws/aws-private-ca-issuer-review/SKILL.md +3 -0
  140. package/skills/aws/aws-rds-aurora-performance-investigator/SKILL.md +3 -0
  141. package/skills/aws/aws-resilience-bcdr-review/SKILL.md +3 -0
  142. package/skills/aws/aws-s3-data-perimeter-governor/SKILL.md +3 -0
  143. package/skills/aws/aws-security-posture-hardening/SKILL.md +3 -0
  144. package/skills/aws/aws-serverless-production-readiness/SKILL.md +3 -0
  145. package/skills/aws/aws-serverless-rollout-corrector/SKILL.md +3 -0
  146. package/skills/aws/aws-solution-architect/SKILL.md +3 -0
  147. package/skills/aws/aws-ticket-triage-escalation-coordinator/SKILL.md +3 -0
  148. package/skills/azure/azure-ai-foundry-ops-governor/SKILL.md +3 -0
  149. package/skills/azure/azure-aks-platform-operator/SKILL.md +3 -0
  150. package/skills/azure/azure-app-service-production-readiness/SKILL.md +3 -0
  151. package/skills/azure/azure-cosmosdb-application-developer/SKILL.md +3 -0
  152. package/skills/azure/azure-cosmosdb-performance-investigator/SKILL.md +3 -0
  153. package/skills/azure/azure-cosmosdb-platform-operator/SKILL.md +3 -0
  154. package/skills/azure/azure-cost-estimation-review/SKILL.md +3 -0
  155. package/skills/azure/azure-cost-optimization-governor/SKILL.md +3 -0
  156. package/skills/azure/azure-entra-id-specialist/SKILL.md +3 -0
  157. package/skills/azure/azure-governance-policy-guardrails/SKILL.md +3 -0
  158. package/skills/azure/azure-identity-governance-review/SKILL.md +3 -0
  159. package/skills/azure/azure-key-vault-secret-lifecycle-auditor/SKILL.md +3 -0
  160. package/skills/azure/azure-keyvault-certificate-issuer-review/SKILL.md +3 -0
  161. package/skills/azure/azure-landing-zone-architect/SKILL.md +3 -0
  162. package/skills/azure/azure-live-aks-rollout-guard/SKILL.md +3 -0
  163. package/skills/azure/azure-live-app-service-slot-swap-guard/SKILL.md +3 -0
  164. package/skills/azure/azure-live-arm-deployment-stack-guard/SKILL.md +3 -0
  165. package/skills/azure/azure-live-cost-budget-action-guard/SKILL.md +3 -0
  166. package/skills/azure/azure-live-entra-role-assignment-guard/SKILL.md +3 -0
  167. package/skills/azure/azure-live-keyvault-rotation-purge-guard/SKILL.md +3 -0
  168. package/skills/azure/azure-live-pim-jit-activation-guard/SKILL.md +3 -0
  169. package/skills/azure/azure-maestro/SKILL.md +3 -0
  170. package/skills/azure/azure-migrate-landing-zone-cutover/SKILL.md +3 -0
  171. package/skills/azure/azure-network-topology-review/SKILL.md +3 -0
  172. package/skills/azure/azure-observability-investigator/SKILL.md +3 -0
  173. package/skills/azure/azure-platform-automation-devops/SKILL.md +3 -0
  174. package/skills/azure/azure-private-endpoint-adoption-planner/SKILL.md +3 -0
  175. package/skills/azure/azure-rbac-review/SKILL.md +3 -0
  176. package/skills/azure/azure-resilience-bcdr-review/SKILL.md +3 -0
  177. package/skills/azure/azure-resource-health-incident-triage/SKILL.md +3 -0
  178. package/skills/azure/azure-role-selector/SKILL.md +3 -0
  179. package/skills/azure/azure-security-posture-hardening/SKILL.md +3 -0
  180. package/skills/azure/azure-subscription-resource-organization/SKILL.md +3 -0
  181. package/skills/backstage/backstage-scaffolder-template-review/SKILL.md +3 -0
  182. package/skills/cert-manager/cert-manager-issuer-trust-review/SKILL.md +3 -0
  183. package/skills/cilium/cilium-network-policy-review/SKILL.md +3 -0
  184. package/skills/falco/falco-runtime-threat-rules-review/SKILL.md +3 -0
  185. package/skills/finops/finops-cloud-price-advisor/SKILL.md +3 -0
  186. package/skills/fluxcd/fluxcd-kustomization-helmrelease-review/SKILL.md +3 -0
  187. package/skills/istio/istio-ambient-mesh-review/SKILL.md +3 -0
  188. package/skills/kubernetes/README.md +5 -1
  189. package/skills/kubernetes/external-secrets-operator-review/SKILL.md +3 -0
  190. package/skills/kubernetes/kubecost-chargeback-allocation-review/SKILL.md +3 -0
  191. package/skills/kubernetes/kubernetes-live-network-architecture-mutation-guard/SKILL.md +82 -0
  192. package/skills/kubernetes/kubernetes-live-network-architecture-mutation-guard/metadata.json +33 -0
  193. package/skills/kubernetes/kubernetes-live-network-architecture-mutation-guard/references/least-privilege-rbac.yaml +210 -0
  194. package/skills/kubernetes/kubernetes-live-network-architecture-mutation-guard/references/official-sources.md +41 -0
  195. package/skills/kubernetes/kubernetes-live-network-architecture-mutation-guard/references/permitted-mutations.md +173 -0
  196. package/skills/kubernetes/kubernetes-live-network-architecture-mutation-guard/references/rbac-pre-flight.md +252 -0
  197. package/skills/kubernetes/kubernetes-live-network-architecture-mutation-guard/references/refusal-list.md +313 -0
  198. package/skills/kubernetes/kubernetes-live-network-architecture-mutation-guard/references/rollback-patterns.md +103 -0
  199. package/skills/kubernetes/kubernetes-live-rbac-mutation-guard/SKILL.md +3 -0
  200. package/skills/kubernetes/kubernetes-maestro/SKILL.md +3 -0
  201. package/skills/kubernetes/kubernetes-maestro/references/safety-checklist.md +1 -1
  202. package/skills/kubernetes/kubernetes-maestro/references/workflow-and-output.md +57 -5
  203. package/skills/kubernetes/kubernetes-network-architecture-review/SKILL.md +84 -0
  204. package/skills/kubernetes/kubernetes-network-architecture-review/metadata.json +34 -0
  205. package/skills/kubernetes/kubernetes-network-architecture-review/references/dataplane-and-cni.md +89 -0
  206. package/skills/kubernetes/kubernetes-network-architecture-review/references/dns-and-discovery.md +120 -0
  207. package/skills/kubernetes/kubernetes-network-architecture-review/references/mcp-and-evidence.md +53 -0
  208. package/skills/kubernetes/kubernetes-network-architecture-review/references/multi-cluster-and-egress.md +69 -0
  209. package/skills/kubernetes/kubernetes-network-architecture-review/references/official-sources.md +54 -0
  210. package/skills/kubernetes/kubernetes-network-architecture-review/references/service-gateway-routing.md +108 -0
  211. package/skills/kubernetes/kubernetes-network-architecture-review/references/troubleshooting-playbook.md +100 -0
  212. package/skills/kubernetes/kubernetes-pod-security-admission-review/SKILL.md +3 -0
  213. package/skills/kubernetes/kubernetes-pod-spec-review/SKILL.md +3 -0
  214. package/skills/kubernetes/kubernetes-rbac-review/SKILL.md +3 -0
  215. package/skills/kubernetes/kubernetes-workload-identity-review/SKILL.md +3 -0
  216. package/skills/kyverno/kyverno-policy-review/SKILL.md +3 -0
  217. package/skills/oci/oci-autonomous-database-architect/SKILL.md +3 -0
  218. package/skills/oci/oci-certificates-issuer-review/SKILL.md +3 -0
  219. package/skills/oci/oci-cloud-guard-responder/SKILL.md +3 -0
  220. package/skills/oci/oci-compute-instance-agent-operator/SKILL.md +3 -0
  221. package/skills/oci/oci-compute-platform-operator/SKILL.md +3 -0
  222. package/skills/oci/oci-cost-finops-analyst/SKILL.md +3 -0
  223. package/skills/oci/oci-database-platform-dba/SKILL.md +3 -0
  224. package/skills/oci/oci-dbtools-sql-analyst/SKILL.md +3 -0
  225. package/skills/oci/oci-devops-container-platform-engineer/SKILL.md +3 -0
  226. package/skills/oci/oci-exadata-database-architect/SKILL.md +3 -0
  227. package/skills/oci/oci-exadata-platform-architect/SKILL.md +3 -0
  228. package/skills/oci/oci-fusion-apps-environment-operator/SKILL.md +3 -0
  229. package/skills/oci/oci-goldengate-replication-operator/SKILL.md +3 -0
  230. package/skills/oci/oci-identity-access-governor/SKILL.md +3 -0
  231. package/skills/oci/oci-iot-digital-twin-engineer/SKILL.md +3 -0
  232. package/skills/oci/oci-limits-capacity-planner/SKILL.md +3 -0
  233. package/skills/oci/oci-live-autonomous-db-lifecycle-guard/SKILL.md +3 -0
  234. package/skills/oci/oci-live-cost-budget-runaway-guard/SKILL.md +3 -0
  235. package/skills/oci/oci-live-iam-policy-compartment-guard/SKILL.md +3 -0
  236. package/skills/oci/oci-live-network-security-rule-guard/SKILL.md +3 -0
  237. package/skills/oci/oci-live-oke-rollout-guard/SKILL.md +3 -0
  238. package/skills/oci/oci-live-resource-manager-stack-guard/SKILL.md +3 -0
  239. package/skills/oci/oci-live-vault-key-destruction-guard/SKILL.md +3 -0
  240. package/skills/oci/oci-load-balancer-traffic-engineer/SKILL.md +3 -0
  241. package/skills/oci/oci-maestro/SKILL.md +3 -0
  242. package/skills/oci/oci-migration-cutover-architect/SKILL.md +3 -0
  243. package/skills/oci/oci-multi-cloud-architect/SKILL.md +3 -0
  244. package/skills/oci/oci-mysql-heatwave-ai-specialist/SKILL.md +3 -0
  245. package/skills/oci/oci-network-architect/SKILL.md +3 -0
  246. package/skills/oci/oci-observability-incident-responder/SKILL.md +3 -0
  247. package/skills/oci/oci-recovery-service-operator/SKILL.md +3 -0
  248. package/skills/oci/oci-registry-artifact-governor/SKILL.md +3 -0
  249. package/skills/oci/oci-resource-search-inventory-analyst/SKILL.md +3 -0
  250. package/skills/oci/oci-security-compliance-reviewer/SKILL.md +3 -0
  251. package/skills/oci/oci-solution-architect/SKILL.md +3 -0
  252. package/skills/oci/oci-storage-backup-steward/SKILL.md +3 -0
  253. package/skills/oci/oci-support-incident-coordinator/SKILL.md +3 -0
  254. package/skills/oci/oracle-oci-mcp-grounded-advisor/SKILL.md +3 -0
  255. package/skills/opentelemetry/opentelemetry-collector-config-review/SKILL.md +3 -0
  256. package/skills/prometheus/prometheus-alerting-cardinality-review/SKILL.md +3 -0
  257. package/skills/sigstore/sigstore-cosign-supply-chain-review/SKILL.md +3 -0
  258. package/skills/terraform/terraform-maestro/SKILL.md +3 -0
  259. package/skills/velero/velero-backup-restore-guard/SKILL.md +5 -2
  260. package/skills/velero/velero-backup-restore-guard/references/safety-checklist.md +1 -1
  261. package/skills/velero/velero-backup-restore-guard/references/workflow-and-output.md +17 -8
@@ -0,0 +1,103 @@
1
+ # Rollback patterns
2
+
3
+ Every permitted mutation has a documented rollback verb. The agent must surface the rollback verb in the response **before** executing the mutation, not after.
4
+
5
+ The default rollback strategy across this guard is **`kubectl apply -f <baseline.yaml>`**, never `kubectl delete`. Apply re-establishes the prior state; delete removes the resource entirely (and may cascade-delete children).
6
+
7
+ ---
8
+
9
+ ## Service spec patches
10
+
11
+ | Mutation | Rollback verb |
12
+ |---|---|
13
+ | `patch internalTrafficPolicy` | `kubectl apply -f /tmp/svc-<name>.before.yaml` |
14
+ | `patch externalTrafficPolicy` | `kubectl apply -f /tmp/svc-<name>.before.yaml` |
15
+ | `annotate topology-mode=Auto` | `kubectl annotate svc <name> -n <ns> service.kubernetes.io/topology-mode-` |
16
+ | `patch trafficDistribution` | `kubectl patch svc <name> -n <ns> --type=json -p='[{"op":"remove","path":"/spec/trafficDistribution"}]'` |
17
+
18
+ Post-rollback verification: `kubectl get endpointslice -n <ns> -l kubernetes.io/service-name=<name>` should show populated endpoints; `kubectl get svc <name> -n <ns> -o jsonpath='{.spec}'` should match `/tmp/svc-<name>.before.yaml`.
19
+
20
+ ---
21
+
22
+ ## CoreDNS Corefile patch
23
+
24
+ ```bash
25
+ # Rollback
26
+ kubectl apply -f /tmp/coredns.before.yaml
27
+
28
+ # Force reload by deleting the oldest pod (rolling restart-equivalent;
29
+ # the `reload` plugin will pick up the restored Corefile within 30s, but
30
+ # evicting one pod accelerates recovery)
31
+ kubectl -n kube-system delete pod -l k8s-app=kube-dns --field-selector status.phase=Running --limit=1
32
+
33
+ # Verify
34
+ kubectl -n kube-system logs -l k8s-app=kube-dns --tail=50 --since=2m | grep -i "reload"
35
+ kubectl -n kube-system get pods -l k8s-app=kube-dns
36
+ ```
37
+
38
+ If the rollback `apply -f` succeeds but a CoreDNS pod is in `CrashLoopBackOff` from a previous bad apply, do not delete the ConfigMap — the cluster may be running on the cached previous Corefile. Diagnose with `kubectl describe pod -n kube-system <coredns-pod>` and surface the error to the operator.
39
+
40
+ ---
41
+
42
+ ## NodeLocal DNSCache install rollback
43
+
44
+ NodeLocal DNSCache is **not** trivially reversible — uninstalling it during traffic causes every Pod's DNS to fail until kube-proxy / Cilium iptables-redirect rules are also reverted. Treat the install as a maintenance-window-only operation.
45
+
46
+ ```bash
47
+ # Pre-uninstall: confirm the cluster's kube-proxy mode and iptables-redirect rules
48
+ kubectl -n kube-system get cm kube-proxy -o jsonpath='{.data.config\.conf}' | grep -i mode
49
+
50
+ # Uninstall (drains the redirect rules first because the DaemonSet's preStop hook handles cleanup)
51
+ kubectl delete -f /tmp/nodelocaldns-rendered.yaml --grace-period=60
52
+
53
+ # Verify pods drained cleanly
54
+ kubectl -n kube-system get pods -l k8s-app=node-local-dns
55
+ # (should return no resources)
56
+
57
+ # Verify DNS still works post-rollback
58
+ kubectl run --rm -it --restart=Never dns-test --image=busybox:1.36 -- nslookup kubernetes.default
59
+ ```
60
+
61
+ If any Pod's DNS fails immediately after the rollback, the iptables/IPVS redirect rules did not drain. The operator may need to flush conntrack and restart kube-proxy DaemonSet pods (which is itself a kube-system DaemonSet write — outside this guard's scope; refer to platform team).
62
+
63
+ ---
64
+
65
+ ## Gateway API resource rollback
66
+
67
+ Apply-vs-delete distinction matters here. Routes that were just **created** (no prior state) should be rolled back by deletion of that single resource, not `apply -f`. Routes that were **patched** roll back via `apply -f baseline.yaml`.
68
+
69
+ ```bash
70
+ # Patched resource — roll back to baseline
71
+ kubectl apply -f /tmp/<resource>.before.yaml
72
+
73
+ # Newly created resource — delete it (the agent is allowed to delete only its own creations)
74
+ # NOTE: this guard's RBAC binding does NOT grant `delete` on Gateway API resources.
75
+ # Newly-created Gateway API resources require operator-confirmed delete via a different principal.
76
+ kubectl delete <kind>/<name> -n <ns>
77
+ ```
78
+
79
+ For `Gateway` resources, post-rollback verification: `kubectl get gateway <name> -n <ns>` should return either the baseline state or `NotFound`. The associated controller's Pods should not show `Reconcile` errors in their logs.
80
+
81
+ ---
82
+
83
+ ## ClusterMesh peer Secret rollback
84
+
85
+ ```bash
86
+ # Delete the peer secret
87
+ kubectl delete secret <peer-cluster-name> -n kube-system
88
+
89
+ # Verify ClusterMesh status reflects peer disconnection
90
+ kubectl exec -n kube-system ds/cilium -- cilium clustermesh status
91
+ ```
92
+
93
+ The kvstore replication state caches remote peer endpoint maps. Per `--clustermesh-cache-ttl` (default `0s` per upstream `docs.cilium.io`), the cache is **never revoked** after disconnect unless explicitly configured. Operators should pre-set a non-zero TTL before peering is established, otherwise rolling back the peer Secret leaves stale `ServiceImports` indefinitely.
94
+
95
+ ---
96
+
97
+ ## Universal rollback rules
98
+
99
+ 1. **Capture before write.** No baseline → no rollback → no mutation. The agent refuses if baseline capture failed.
100
+ 2. **Apply, don't delete, when in doubt.** Apply re-establishes the prior state idempotently. Delete cascades.
101
+ 3. **Verify after rollback.** A rollback is not complete until verification confirms the prior state holds.
102
+ 4. **Surface the rollback command before the mutation.** The user sees the rollback in the response shape **before** they approve the mutation, not after.
103
+ 5. **The rollback verb is part of the proposal, not a follow-up.** If the agent cannot produce a rollback, it cannot produce a mutation.
@@ -1,9 +1,12 @@
1
1
  ---
2
2
  name: kubernetes-live-rbac-mutation-guard
3
3
  description: Guard live kubectl apply, create, or delete operations on Kubernetes RBAC objects — Roles, ClusterRoles, RoleBindings, ClusterRoleBindings — with privilege-escalation verb detection, scope assessment, current-state diff, and explicit approval before any write. Use only when an intentional RBAC mutation is requested against a confirmed cluster target.
4
+ allowed-tools: Read Grep Glob WebFetch
4
5
  metadata:
5
6
  author: "github: Raishin"
6
7
  version: "0.1.0"
8
+ updated: "2026-05-05"
9
+ category: security
7
10
  ---
8
11
 
9
12
  # Kubernetes Live RBAC Mutation Guard
@@ -1,9 +1,12 @@
1
1
  ---
2
2
  name: kubernetes-maestro
3
3
  description: Route Kubernetes tasks to the narrowest specialist or team of specialists from the catalog. Use when you do not already know the specialist. Not for direct Kubernetes answers; Maestro classifies, dispatches, and synthesizes only. Dispatches single agent for focused tasks, parallel team (max 4) for multi-domain tasks. Never auto-dispatches live-guard agents — requires explicit human confirmation with blast-radius and rollback before routing to any live mutation specialist.
4
+ allowed-tools: Agent Skill Read Grep Glob
4
5
  metadata:
5
6
  author: "github: Raishin"
6
7
  version: "0.1.0"
8
+ updated: "2026-05-05"
9
+ category: ai
7
10
  ---
8
11
 
9
12
  # Kubernetes Maestro — Routing Skill
@@ -58,7 +58,7 @@ kubectl get application -n argocd <app-name> -o yaml | grep -A5 status
58
58
 
59
59
  ### Network Policy (kubernetes-live-network-policy-guard-agent)
60
60
  ```shell
61
- cilium monitor --type drop -n <namespace> # Cilium: watch for drops
61
+ kubectl -n kube-system exec ds/cilium -- cilium-dbg monitor --type drop # Cilium: watch for drops (in-pod cilium-dbg)
62
62
  hubble observe --namespace <namespace> # Hubble: traffic observation
63
63
  kubectl get cnp,ccnp,netpol -n <namespace>
64
64
  ```
@@ -14,7 +14,9 @@ Use this reference when classifying a task or selecting the right specialist(s).
14
14
  | IRSA, workload identity, serviceAccountToken, OIDC trust, pod identity, azure workload identity, GKE WI, annotate serviceaccount, projected token, eks.amazonaws.com | kubernetes-workload-identity-review-agent | Workload identity review | No |
15
15
  | Istio, ambient mesh, waypoint, ztunnel, AuthorizationPolicy, PeerAuthentication, mTLS, RequestAuthentication, VirtualService, DestinationRule, HBONE | istio-ambient-mesh-review-agent | Istio mesh review | No |
16
16
  | apply AuthorizationPolicy, apply PeerAuthentication, change mTLS, delete DENY policy, enable PERMISSIVE, istioctl apply | kubernetes-live-mesh-policy-guard-agent | Live mesh policy mutation | YES |
17
- | Cilium, CiliumNetworkPolicy, CiliumClusterwideNetworkPolicy, NetworkPolicy, ClusterMesh, egress gateway, Hubble, L7 policy, toCIDRSet | cilium-network-policy-review-agent | Cilium network policy review | No |
17
+ | CNI choice, kube-proxy, kube-proxy mode, kube-proxy replacement, IPAM, MTU, encapsulation, VXLAN, Geneve, dual-stack, IPv6, Pod CIDR, Service CIDR, EndpointSlices, internalTrafficPolicy, externalTrafficPolicy, topology-aware routing, trafficDistribution, Ingress, Gateway API, GRPCRoute, HTTPRoute, GatewayClass, CoreDNS, NodeLocal DNSCache, ndots, Corefile, Submariner, MCS-API, ClusterMesh topology, ClusterMesh kvstore, conntrack, NodePort path | kubernetes-network-architecture-review-agent | Network architecture review | No |
18
+ | apply Service patch internalTrafficPolicy, apply Service patch externalTrafficPolicy, annotate topology-mode, set trafficDistribution, patch CoreDNS Corefile, install NodeLocal DNSCache, apply Gateway API resource, apply HTTPRoute / GRPCRoute / TLSRoute / ReferenceGrant, create ClusterMesh peer Secret | kubernetes-live-network-architecture-mutation-guard-agent | Live network architecture mutation | YES |
19
+ | Cilium policy, CiliumNetworkPolicy, CiliumClusterwideNetworkPolicy, NetworkPolicy content, ClusterMesh policy, egress gateway policy, Hubble flow filter, L7 policy, toCIDRSet | cilium-network-policy-review-agent | Cilium network policy review | No |
18
20
  | apply CiliumNetworkPolicy, kubectl apply cnp, delete default-deny, change toCIDRSet, egress gateway policy | kubernetes-live-network-policy-guard-agent | Live network policy mutation | YES |
19
21
  | Argo CD, ArgoCD, Application, AppProject, ApplicationSet, sync window, argocd sync, gitops, app of apps, ApplicationSet | argocd-gitops-review-agent | Argo CD GitOps review | No |
20
22
  | argocd app sync, sync production, delete sync-window, expand AppProject, enable auto-sync, ApplicationSet cluster generator | kubernetes-live-argocd-sync-guard-agent | Live Argo CD sync guard | YES |
@@ -29,11 +31,12 @@ Use this reference when classifying a task or selecting the right specialist(s).
29
31
  | `admission-security` | PSA, PodSecurityAdmission, pod-security label, enforce, audit, warn, restricted, baseline, privileged, PSP migration, Kyverno, ClusterPolicy, PolicyException, mutate, generate, image verify |
30
32
  | `workload-identity` | IRSA, workload identity, serviceAccountToken, OIDC, pod identity, azure workload identity, GKE WI, projected token, bound service account |
31
33
  | `mesh` | Istio, ambient mesh, waypoint, ztunnel, AuthorizationPolicy, PeerAuthentication, mTLS, RequestAuthentication, VirtualService, DestinationRule, Envoy |
32
- | `network-policy` | Cilium, CiliumNetworkPolicy, NetworkPolicy, ClusterMesh, Hubble, egress gateway, L7 policy, CNI |
34
+ | `network-architecture` | CNI choice, dataplane, kube-proxy mode, kube-proxy replacement, IPAM, MTU, encapsulation, dual-stack, IPv6, Pod CIDR, Service CIDR, Service routing surface, EndpointSlices, trafficPolicy, topology-aware routing, trafficDistribution, Ingress, Gateway API, GRPCRoute, HTTPRoute, CoreDNS, NodeLocal DNSCache, ndots, Corefile, multi-cluster topology, ClusterMesh topology and kvstore, Submariner, MCS-API, conntrack, NodePort path |
35
+ | `network-policy` | Cilium policy semantics, CiliumNetworkPolicy, NetworkPolicy content, Hubble flow filter, egress gateway policy, L7 policy, ClusterMesh policy boundary |
33
36
  | `gitops` | Argo CD, ArgoCD, Application, AppProject, ApplicationSet, sync window, app of apps, GitOps, deployment sync |
34
37
  | `observability` | OpenTelemetry, OTEL, otelcol, collector, pipeline, receiver, processor, exporter, Instrumentation CR, TargetAllocator, tracing, metrics, logs |
35
38
  | `pki` | cert-manager, ClusterIssuer, Issuer, CertificateRequest, CertificateRequestPolicy, approver-policy, trust-manager, Bundle, ConfigMapBundle, certificate renewal, TLS cert, SPIFFE, cert-manager webhook |
36
- | `live-guard` | apply RBAC live, apply admission policy live, change mTLS live, apply network policy live, argocd sync production, requires human gate, production mutation |
39
+ | `live-guard` | apply RBAC live, apply admission policy live, change mTLS live, apply network policy live, apply Service patch live, patch CoreDNS Corefile live, install NodeLocal DNSCache live, apply Gateway API resource live, create ClusterMesh peer Secret live, argocd sync production, requires human gate, production mutation |
37
40
 
38
41
  ## Specialist reference
39
42
 
@@ -65,11 +68,20 @@ Use this reference when classifying a task or selecting the right specialist(s).
65
68
  | `istio-ambient-mesh-review-agent` | Istio mesh review | Reviewing Istio ambient mesh waypoint config, AuthorizationPolicy, PeerAuthentication, mTLS mode, VirtualService/DestinationRule, or RequestAuthentication |
66
69
  | `kubernetes-live-mesh-policy-guard-agent` | Live mesh policy mutation | Applying or deleting AuthorizationPolicy or PeerAuthentication, changing mTLS mode, or enabling PERMISSIVE mode in a live cluster — gate required |
67
70
 
71
+ ### Network architecture
72
+
73
+ | Agent | Domain | Use when… |
74
+ |---|---|---|
75
+ | `kubernetes-network-architecture-review-agent` | Network architecture review | Reviewing CNI choice, kube-proxy mode, kube-proxy replacement, IPAM, MTU and encapsulation, dual-stack, Pod / Service CIDR sizing (one-way doors), Service routing surface (EndpointSlices, internalTrafficPolicy / externalTrafficPolicy, topology-aware routing, `trafficDistribution`), Ingress vs Gateway API migration, CoreDNS Corefile, NodeLocal DNSCache architecture, multi-cluster topology (ClusterMesh topology, Submariner, MCS-API, ClusterMesh kvstore behavior), or troubleshooting connectivity at the dataplane / Service / DNS layer. Read-only; delegates NetworkPolicy content review and live mutations. |
76
+ | `kubernetes-live-network-architecture-mutation-guard-agent` | Live network architecture mutation | Applying Service spec patches (`internalTrafficPolicy`, `externalTrafficPolicy`, `topology-mode`, `trafficDistribution`), patching CoreDNS Corefile (resourceName-locked `ConfigMap/coredns`), installing NodeLocal DNSCache, creating Gateway API resources (`Gateway`, `HTTPRoute`, `GRPCRoute`, `TLSRoute`, `ReferenceGrant`), or creating Cilium ClusterMesh peer `Secret` in a live cluster — gate required. **HARD REFUSE** one-way doors: CNI replacement, kube-proxy mode swap, MTU change, Pod / Service CIDR resize, namespace deletion, kube-system DaemonSet/Deployment writes, CRD operations. Cluster-side enforcement via least-privilege ServiceAccount per `docs/least-privilege-rbac.md`; pre-flight `kubectl auth can-i` matrix runs before any mutation. |
77
+
78
+ **Scope boundary with policy / mesh / pod-spec specialists:** the architecture agent owns *design correctness, sizing, and operational traps* in dataplane, Service routing, DNS, and multi-cluster topology. It does NOT review NetworkPolicy content (→ `cilium-network-policy-review-agent`), mesh L7 (→ `istio-ambient-mesh-review-agent`), pod `securityContext` / hostNetwork (→ `kubernetes-pod-spec-review-agent`), or perform live mutations (→ `kubernetes-live-network-policy-guard-agent` / `kubernetes-live-mesh-policy-guard-agent`). When a task spans architecture + policy + mesh, dispatch the team in parallel; the architecture findings (kube-proxy replacement mode, CNI version, MTU, Envoy DaemonSet status) are independent inputs the policy and mesh specialists need.
79
+
68
80
  ### Network policy
69
81
 
70
82
  | Agent | Domain | Use when… |
71
83
  |---|---|---|
72
- | `cilium-network-policy-review-agent` | Cilium network policy review | Reviewing CiliumNetworkPolicy, CiliumClusterwideNetworkPolicy, ClusterMesh config, Hubble observability, or L7 policy rules |
84
+ | `cilium-network-policy-review-agent` | Cilium network policy review | Reviewing CiliumNetworkPolicy, CiliumClusterwideNetworkPolicy, ClusterMesh policy semantics (`policy-default-local-cluster`), Hubble flow filter, or L7 policy rules. Architecture-level ClusterMesh design (topology, kvstore, CIDR overlap) is owned by `kubernetes-network-architecture-review-agent`. |
73
85
  | `kubernetes-live-network-policy-guard-agent` | Live network policy mutation | Applying or deleting CiliumNetworkPolicy, removing default-deny rules, changing toCIDRSet, or modifying egress gateway config in a live cluster — gate required |
74
86
 
75
87
  ### GitOps
@@ -170,7 +182,47 @@ Mode: parallel (2)
170
182
 
171
183
  ---
172
184
 
173
- ### Live-guard gate example
185
+ ### Example 6: Holistic Kubernetes networking review (architecture + policy + mesh)
186
+
187
+ **User request:** "Review our cluster's networking holistically — CNI choice, kube-proxy mode, our CiliumNetworkPolicies, and the Istio ambient mesh."
188
+
189
+ **Routing:**
190
+ ```
191
+ Route: kubernetes-network-architecture-review-agent, cilium-network-policy-review-agent, istio-ambient-mesh-review-agent
192
+ Reason: Task spans network architecture (CNI, kube-proxy mode, dataplane, DNS), Cilium network policy content review, and Istio ambient mesh L7 review — three distinct networking concerns with non-overlapping scopes. Hard-ceiling 4 specialists; this stays under the limit.
193
+ Mode: parallel (3)
194
+ ```
195
+
196
+ `kubernetes-network-architecture-review-agent` reviews CNI choice, kube-proxy mode, IPAM, MTU, Pod / Service CIDR sizing as one-way doors, EndpointSlice topology, CoreDNS / NodeLocal DNSCache architecture, and ClusterMesh topology / kvstore behavior. `cilium-network-policy-review-agent` reviews CiliumNetworkPolicy default-deny posture, toCIDRSet rules, ClusterMesh policy semantics (`policy-default-local-cluster`), and L7 policy prerequisites — the L7 review depends on the architecture finding of whether Cilium kube-proxy replacement and Envoy DaemonSet are in place. `istio-ambient-mesh-review-agent` reviews waypoint configuration, AuthorizationPolicy, PeerAuthentication, and mTLS posture. The three outputs are scope-separated; the synthesizer surfaces architecture findings first because they may invalidate policy and mesh assumptions.
197
+
198
+ **Sequencing note:** Architecture findings (kube-proxy replacement mode, Envoy DaemonSet running, MTU correctness) gate the *interpretation* of policy and mesh outputs but not their dispatch. Run in parallel; in the synthesis step, present architecture posture before policy and mesh — if architecture flags a one-way-door blocker, explicitly mark which policy and mesh recommendations may need to be re-scoped.
199
+
200
+ ---
201
+
202
+ ### Live-guard gate example: network architecture mutation
203
+
204
+ **User request:** "Apply `service.kubernetes.io/topology-mode: Auto` annotation to the `frontend` Service in the `prod` namespace on the prod cluster."
205
+
206
+ **Routing:**
207
+ ```
208
+ Route: kubernetes-live-network-architecture-mutation-guard-agent
209
+ Reason: Patching a Service annotation on a live production cluster is a live network architecture mutation — gate required even though the operation is reversible.
210
+ Mode: live-guard-gate
211
+ ```
212
+
213
+ **STOP — Live-guard gate. Before this dispatch can proceed, you must provide:**
214
+
215
+ 1. **Pre-flight RBAC self-check applied:** Confirm `skills/kubernetes/kubernetes-live-network-architecture-mutation-guard/references/least-privilege-rbac.yaml` is applied to the prod cluster, and that the agent's bound ServiceAccount has been pre-flight-tested with the `kubectl auth can-i` matrix from `references/rbac-pre-flight.md`. Every must-not row must return `no`; every must-be row must return `yes`.
216
+ 2. **Operator principal check:** Confirm your kubeconfig is **not** `cluster-admin` and **not** in `system:masters`. The agent will refuse if `kubectl auth can-i '*' '*' --all-namespaces` returns `yes` for your principal.
217
+ 3. **Blast-radius assessment:** Which workloads currently consume the `frontend` Service? Cross-zone traffic patterns may shift if `topology-mode: Auto` populates hints with insufficient endpoints per zone.
218
+ 4. **Rollback path:** `kubectl annotate svc frontend -n prod service.kubernetes.io/topology-mode-` (the `-` suffix removes the annotation) — confirmed reversible in under 30 seconds.
219
+ 5. **Explicit written confirmation:** Type "I confirm I understand the blast radius and rollback path. Proceed."
220
+
221
+ For irreversible operations (CNI replacement, kube-proxy mode swap, MTU change, Pod / Service CIDR resize, namespace deletion, kube-system DaemonSet writes, CRD operations), the agent **HARD REFUSES** regardless of operator confirmation — these belong to a human-led cutover plan that the architecture review agent (`kubernetes-network-architecture-review-agent`) can produce but no agent in this repo will execute.
222
+
223
+ ---
224
+
225
+ ### Live-guard gate example: RBAC mutation
174
226
 
175
227
  **User request:** "Apply the new ClusterRoleBinding for the payments service account in the prod cluster."
176
228
 
@@ -0,0 +1,84 @@
1
+ ---
2
+ name: kubernetes-network-architecture-review
3
+ description: Use this skill for Kubernetes cluster network architecture review across the dataplane (CNI choice, kube-proxy mode, IPAM, MTU, encapsulation, dual-stack), service routing surface (Service types, EndpointSlices, internalTrafficPolicy/externalTrafficPolicy, topology-aware routing, Ingress, Gateway API), in-cluster DNS (CoreDNS, NodeLocal DNSCache, ndots), multi-cluster topology (ClusterMesh, Submariner, MCS-API design choices), and connectivity observability and troubleshooting. Trigger when the user asks how to choose or change a CNI, why pod-to-pod or pod-to-service traffic fails, whether to migrate from Ingress to Gateway API, why DNS latency is high, how to size Pod or Service CIDRs, or how to design multi-cluster networking. Does NOT review NetworkPolicy content (delegate to cilium-network-policy-review) or perform live mutations (delegate to kubernetes-live-network-policy-guard / kubernetes-live-mesh-policy-guard).
4
+ allowed-tools: Read Grep Glob WebFetch
5
+ metadata:
6
+ author: "github: Raishin"
7
+ version: "0.1.0"
8
+ updated: "2026-05-07"
9
+ category: networking
10
+ ---
11
+
12
+ # Kubernetes Network Architecture Review
13
+
14
+ ## Purpose
15
+
16
+ Review Kubernetes cluster networking *as a system* — the choices that shape every pod's reachability, latency, and blast radius before any policy is written. This skill is about **design correctness, sizing, and operational traps** in the dataplane, service routing, DNS, and multi-cluster surface. Policy correctness is delegated.
17
+
18
+ ## Scope boundary
19
+
20
+ In scope:
21
+
22
+ - CNI selection and dataplane mode (overlay vs native routing, eBPF vs iptables, kube-proxy replacement).
23
+ - IPAM mode and Pod/Service CIDR sizing, including dual-stack and IPv6.
24
+ - MTU and encapsulation overhead, jumbo frames, fragmentation traps.
25
+ - Service surface: ClusterIP / NodePort / LoadBalancer / ExternalName / headless, EndpointSlices, `internalTrafficPolicy`, `externalTrafficPolicy`, topology-aware routing, `sessionAffinity`.
26
+ - Ingress vs Gateway API: GatewayClass selection, role-oriented model (infrastructure / cluster operator / application), HTTPRoute / GRPCRoute / TLSRoute, ReferenceGrant, GAMMA mesh integration.
27
+ - In-cluster DNS: CoreDNS Corefile, NodeLocal DNSCache architecture and risks, `ndots:5` tail-latency trap, ExternalDNS handoff.
28
+ - Multi-cluster networking topology: Cilium ClusterMesh, Submariner, KEP-1645 MCS-API — pick on basis of identity, address-overlap, and policy semantics.
29
+ - Connectivity observability: kube-proxy metrics, conntrack, Hubble flow shape, dropped packets, MTU mismatches.
30
+ - Troubleshooting playbooks (read-only): pod-to-pod, pod-to-Service, pod-to-external, NodePort path, conntrack exhaustion.
31
+
32
+ Out of scope — delegate:
33
+
34
+ - NetworkPolicy / CiliumNetworkPolicy / CiliumClusterwideNetworkPolicy content review → `cilium-network-policy-review`.
35
+ - Istio mesh policy / ambient L7 → `istio-ambient-mesh-review`.
36
+ - Live mutations on policy objects → `kubernetes-live-network-policy-guard`, `kubernetes-live-mesh-policy-guard`.
37
+ - Pod `securityContext`, capabilities, host networking → `kubernetes-pod-spec-review`.
38
+
39
+ If the question is **entirely** within one of these delegated scopes, refuse to answer here and name the owning agent. Do not partial-answer and append a handoff line.
40
+
41
+ ## Lean operating rules
42
+
43
+ - Prefer live cluster evidence (`kubectl get nodes,svc,endpointslices,gateway,gatewayclass,httproute,configmap -A`, `cilium status`, `cilium-dbg bpf`, `hubble observe`, `kube-proxy --metrics-port`, `conntrack -L`) when a Kubernetes MCP server, `kubectl`, and node-shell access are available; otherwise fall back to upstream documentation (kubernetes.io, gateway-api.sigs.k8s.io, docs.cilium.io, coredns.io) and sanitized YAML.
44
+ - Separate confirmed facts from inference. If the CNI version, kube-proxy mode, IPAM mode, MTU, or DNS pod count was not queried, say so.
45
+ - Treat **pod and service CIDR sizing** as architectural one-way doors — changing a Pod CIDR after the cluster is in use generally requires a cluster rebuild on most CNIs. Demand evidence of growth headroom.
46
+ - Treat **kube-proxy mode swap** (iptables ↔ IPVS ↔ nftables, or to Cilium kube-proxy replacement) as a connectivity-affecting rollout — sessions on existing connections may break depending on mode and conntrack handling.
47
+ - Treat **MTU mismatch between underlay and overlay** as a silent failure mode — TCP handshakes succeed, then large payloads stall. Always check whether encapsulation overhead (VXLAN 50B, Geneve 60B, WireGuard 60B with IPsec extra) was subtracted from node MTU.
48
+ - Treat **`externalTrafficPolicy: Local`** as a correctness-sensitive choice — preserves source IP but breaks load balancing if the matching pod is not on the receiving node, and exposes pod placement to clients via uneven load.
49
+ - Treat **`internalTrafficPolicy: Local`** the same way for in-cluster traffic — silent black-holing if no local endpoint exists.
50
+ - Treat **`ndots:5` plus search-path expansion** as the default DNS tail-latency root cause. Five negative lookups before the absolute name on every external query, multiplied by every pod, is the dominant DNS load on most clusters.
51
+ - Treat **NodeLocal DNSCache OOM** as a node-wide DNS outage — packet-filtering rules redirect to an unhealthy cache pod until restart. Memory limits and PDB are mandatory.
52
+ - Treat **Ingress Controller annotations** as non-portable — migrating to Gateway API requires a route-by-route rewrite, not a global flip; plan the migration as a controller-by-controller cutover with overlap.
53
+ - Treat **multi-cluster pod CIDR collisions** as the first failure for any cross-cluster scheme — reject any design that does not declare non-overlapping Pod CIDRs (or explicit NAT) before discussing identity or policy.
54
+ - Treat **Cilium ClusterMesh kvstore replication lag** as a silent failure — remote ServiceImports serve stale endpoint maps with no error surfaced. Connections succeed but route to removed/replaced pods after a scale event. Compare endpoint revisions across peers (`cilium-dbg kvstore get`) when ClusterMesh is in scope.
55
+ - Treat **conntrack table exhaustion** on busy nodes and **AWS NAT Gateway port exhaustion** (~55k connections per destination IP) as silent drops — packets disappear without errors at the application layer.
56
+ - Treat **topology-aware routing skew** when zone labels are missing or unevenly populated as a silent traffic concentration — the Auto mode silently falls back to cluster-wide routing or drops endpoints.
57
+ - Treat **any pod egress to `169.254.169.254` (AWS / Azure IMDS)** or **`metadata.google.internal` (GCP)** as a credential-theft vector. Recommend IRSA / Workload Identity / Pod Identity before discussing any egress allow rule. Surface unblocked metadata-service reachability as a HIGH severity finding rather than silently delegating to policy review.
58
+ - **Do not invent CLI flags or commands.** Reference only `kubectl`, `cilium`, `cilium-dbg`, `hubble`, `calicoctl`, `subctl`, `ip`, `conntrack`, `iptables`, `ipvsadm`, `nft`, `coredns`. For anything outside this set, ask the user for the help text or a doc link rather than guess.
59
+ - Refuse user-initiated mutation requests ("just apply this for me") and credential offers (kubeconfig, tokens, peer Secrets). Name the live-mutation delegate; do not proceed.
60
+ - Keep the answer scoped, reversible, least-privilege, and explicit about blockers, unknowns, and which delegate skill owns the next step.
61
+
62
+ ## References
63
+
64
+ Load these only when needed:
65
+
66
+ - [Dataplane and CNI choice](references/dataplane-and-cni.md) — use when reviewing CNI selection, kube-proxy mode, IPAM, MTU and encapsulation, dual-stack and IPv6.
67
+ - [Service and Gateway API routing](references/service-gateway-routing.md) — use when reviewing Service types, EndpointSlices, `internalTrafficPolicy` / `externalTrafficPolicy`, topology-aware routing, Ingress vs Gateway API, GatewayClass selection.
68
+ - [DNS and service discovery](references/dns-and-discovery.md) — use when reviewing CoreDNS Corefile, NodeLocal DNSCache, `ndots:5`, autopath, ExternalDNS.
69
+ - [Multi-cluster and egress topology](references/multi-cluster-and-egress.md) — use when reviewing ClusterMesh, Submariner, MCS-API, egress-from-cluster topology, cross-cluster service discovery.
70
+ - [Connectivity troubleshooting playbook](references/troubleshooting-playbook.md) — use when diagnosing pod-to-pod, pod-to-service, pod-to-external failures, intermittent latency, MTU stalls, conntrack exhaustion.
71
+ - [Evidence path and tooling](references/mcp-and-evidence.md) — use when choosing live cluster evidence, identifying CNI/kube-proxy/IPAM/DNS state, or switching to documentation mode.
72
+ - [Official sources](references/official-sources.md) — use when you need the authoritative Kubernetes / Gateway API / Cilium / CoreDNS reference list.
73
+
74
+ ## Response minimum
75
+
76
+ Return, at minimum:
77
+
78
+ - the **scoped target** (CNI, Service, Gateway, DNS, multi-cluster topology, troubleshooting),
79
+ - the **evidence level** — labeled **per finding**, not response-level only: `live evidence` / `documentation-based` / `sanitized user evidence` / `inference`. A response may legitimately mix levels; each finding must carry its own.
80
+ - the **architectural posture findings** (with severity: high / medium / low),
81
+ - the **safest next actions** — keep them reversible; for irreversible changes (CIDR resizing, kube-proxy swap), call out a tested cutover plan,
82
+ - the **rollback or fallback path**,
83
+ - the **delegate handoff** when the next step is policy content, mesh policy, live mutation, or pod-spec — name the skill or agent that owns it,
84
+ - the **assumptions and blockers** that prevent stronger conclusions — if CNI version, kube-proxy mode, IPAM mode, node MTU, or DNS pod count were not confirmed by live evidence, each MUST appear here as an explicit open assumption. This field is not optional.
@@ -0,0 +1,34 @@
1
+ {
2
+ "id": "kubernetes-network-architecture-review",
3
+ "name": "Kubernetes Network Architecture Review",
4
+ "type": "skill",
5
+ "provider": "kubernetes",
6
+ "harnesses": [
7
+ "codex",
8
+ "claude-code",
9
+ "cursor",
10
+ "gemini",
11
+ "kiro",
12
+ "other"
13
+ ],
14
+ "summary": "Review Kubernetes cluster network architecture: CNI and dataplane selection, kube-proxy mode and replacement, IPAM and CIDR sizing, MTU and encapsulation, dual-stack and IPv6, Service surface (EndpointSlices, internalTrafficPolicy, externalTrafficPolicy, topology-aware routing), Ingress to Gateway API migration, CoreDNS and NodeLocal DNSCache, multi-cluster topology, and connectivity observability and troubleshooting. Excludes NetworkPolicy content review and live mutations — those are delegated to cilium-network-policy-review and the live-guard agents.",
15
+ "source_type": "original",
16
+ "official_docs": [
17
+ "https://kubernetes.io/docs/concepts/services-networking/",
18
+ "https://kubernetes.io/docs/reference/networking/virtual-ips/",
19
+ "https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/",
20
+ "https://kubernetes.io/docs/concepts/services-networking/service-traffic-policy/",
21
+ "https://kubernetes.io/docs/concepts/services-networking/topology-aware-routing/",
22
+ "https://kubernetes.io/docs/concepts/services-networking/dual-stack/",
23
+ "https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/",
24
+ "https://gateway-api.sigs.k8s.io/",
25
+ "https://docs.cilium.io/en/stable/network/concepts/",
26
+ "https://docs.cilium.io/en/stable/network/kube-proxy-replacement/",
27
+ "https://coredns.io/plugins/kubernetes/"
28
+ ],
29
+ "security_notes": "CNI and Pod CIDR are one-way architectural choices on most stacks — resizing requires cluster rebuild. kube-proxy mode swap can break in-flight connections. MTU mismatch between underlay and overlay is a silent payload-stall failure. externalTrafficPolicy: Local preserves source IP but black-holes traffic when no local endpoint exists. NodeLocal DNSCache OOM produces a node-wide DNS outage via stale packet-filter redirect. Multi-cluster pod CIDR collisions break any cross-cluster scheme regardless of policy correctness. ndots:5 plus search path is the dominant cluster DNS load on most installations.",
30
+ "last_verified": "2026-05-07",
31
+ "path": "skills/kubernetes/kubernetes-network-architecture-review",
32
+ "author": "github: Raishin",
33
+ "version": "0.1.0"
34
+ }
@@ -0,0 +1,89 @@
1
+ # Dataplane and CNI
2
+
3
+ ## Step 1 — Identify the dataplane and CNI
4
+
5
+ Capture, in order:
6
+
7
+ 1. **CNI plugin and version** — `kubectl -n kube-system get pods -l k8s-app=cilium -o name` (or `calico-node`, `flannel`, `aws-node`, `azure-cns`, `gke-cni`). Different CNIs have entirely different dataplanes (eBPF vs iptables vs vendor SDN), so every later question depends on this answer.
8
+ 2. **kube-proxy presence and mode** — `kubectl -n kube-system get ds kube-proxy` and the `--proxy-mode` flag in its ConfigMap (`kubectl -n kube-system get cm kube-proxy -o yaml`). On Cilium kube-proxy replacement, the DaemonSet does not exist or is disabled.
9
+ 3. **Routing mode** — encapsulation (VXLAN, Geneve, IPIP) vs native routing. Encapsulation works anywhere but adds 50–60 bytes of overhead and slightly higher CPU; native routing requires the underlay to route Pod CIDRs.
10
+ 4. **IPAM mode** — `cluster-pool` (CNI manages CIDR), `kubernetes` (uses `node.spec.podCIDR`), `aws-eni` / `azure` / `gke` (cloud IPAM, pods get VPC IPs).
11
+ 5. **Pod and Service CIDRs** — `kubectl cluster-info dump | grep -E '(cluster-cidr|service-cluster-ip-range)'` or check kube-controller-manager flags. Once set, these are very hard to change.
12
+ 6. **Node MTU and overlay MTU** — `ip link show | grep mtu`, then the CNI overlay interface (`cilium_vxlan`, `flannel.1`). The overlay MTU should be `node MTU − encapsulation overhead`.
13
+
14
+ ## Step 2 — Stress-test the CNI choice against actual requirements
15
+
16
+ Common decision points and the trap each one hides:
17
+
18
+ - **Cilium with kube-proxy replacement** — requires a kernel new enough for the eBPF features used (varies by version; verify with `cilium status` and the version's [system requirements page](https://docs.cilium.io/en/stable/operations/system-requirements/)). On older kernels, parts of KPR fall back to legacy paths and you lose the performance argument.
19
+ - **Calico in BGP mode** — requires the underlay to accept BGP from every node, or a route reflector. In cloud, this often means a peering VM. In a managed cluster, this is usually impossible — the cluster ends up in IPIP encapsulation, defeating the BGP choice.
20
+ - **AWS VPC CNI** — pods get VPC IPs, so subnet sizing and ENI limits per instance type bound pod density. An m5.large can hold ~30 pods because of ENI/IP limits, not memory. This is the dominant pod-density ceiling on EKS.
21
+ - **Azure CNI (legacy)** — pre-allocates pod IPs from the subnet at node join, exhausting the subnet long before the pods exist.
22
+ - **Azure CNI Overlay** — pods get a separate overlay CIDR; nodes still consume VNet IPs. Works at scale but add it as a deliberate IPAM choice, not the default.
23
+ - **GKE alias IPs** — pod range is a secondary range on the VPC; sizing is fixed per cluster and resizing requires recreation.
24
+
25
+ ## Step 3 — Stress-test kube-proxy mode
26
+
27
+ | Mode | Strengths | Real failures |
28
+ |---|---|---|
29
+ | `iptables` (default) | Simple, broadly tested | Rule count grows linearly with `Services × Endpoints`. On large clusters (10k+ Services) full rule resync becomes a multi-second event; new Service propagation latency rises. |
30
+ | `ipvs` | Hash-based lookup, scales to many Services. Multiple LB algorithms. | Requires `ip_vs` kernel modules loaded on every node. Some session-affinity edge cases differ from iptables. Conntrack still drives source-IP preservation. |
31
+ | `nftables` | Modern netfilter framework; better incremental update than iptables. | Newer; not all distros have stable nftables tooling. Still relatively young in production at scale. |
32
+ | `kernelspace` (Windows) | Native Windows | Windows-only; behavior differs (no init container hostNetwork tricks, etc.). |
33
+ | Cilium kube-proxy replacement | eBPF socket-LB; bypasses iptables entirely; preserves source IP for NodePort without `externalTrafficPolicy: Local` quirks. | Requires Cilium dataplane. Some hostPort and kernel-version edge cases. Verify `cilium status` reports KPR enabled in the expected mode (`Strict`, `Probe`, etc.). |
34
+
35
+ Stress-tests the review must apply:
36
+
37
+ - Migrating from `iptables` to `ipvs` or `nftables` on a running cluster — short connectivity blip during the rollout as kube-proxy rewrites rules. Schedule like a node-by-node rollout, not a config flag flip.
38
+ - Migrating to Cilium KPR — uninstall kube-proxy *after* Cilium reports KPR healthy on every node. Removing kube-proxy first leaves Service VIPs unreachable.
39
+ - Mixed mode during rollout — half nodes on iptables, half on Cilium KPR — Service traffic to a Pod on a KPR node from a kube-proxy node may follow different return paths and confuse conntrack. Plan a fast rollout window.
40
+
41
+ ## Step 4 — Stress-test IPAM and CIDR sizing
42
+
43
+ A cluster sized 10× too small for its eventual workload count is the most common architectural debt — and it is hard to fix.
44
+
45
+ Sizing stress-tests:
46
+
47
+ - **Pod CIDR size** vs **max nodes × max pods/node** — a `/16` Pod CIDR with a `/24` per node gives 256 nodes max. A `/22` per node gives 64 nodes. Many CNIs allocate a fixed-size block per node, so the right number is `nodes × pods_per_node × headroom`.
48
+ - **Service CIDR size** — Services tend to grow faster than people predict (every Helm chart adds a few). A `/16` is fine; a `/20` is risky in any cluster running a service mesh, since per-namespace mesh control planes add Services.
49
+ - **CIDR collision with on-prem or peer VPC** — pods in `10.0.0.0/8` cannot route to an on-prem `10.x.x.x` system. RFC 1918 collision checks must precede every cluster build.
50
+ - **`100.64.0.0/10` and `198.18.0.0/15`** — use carrier-grade NAT and benchmarking ranges if RFC 1918 is exhausted; cloud providers generally tolerate them.
51
+ - **IPv6 / dual-stack** — single-stack v6 only is rarely supported by ecosystem tools (registries, observability). Dual-stack is the practical choice; ensure `--service-cluster-ip-range` and `--cluster-cidr` carry both families and that EndpointSlices report both.
52
+
53
+ ## Step 5 — Stress-test MTU and encapsulation
54
+
55
+ MTU mismatch is a silent failure. The TCP three-way handshake passes (small packets), then the first large response stalls forever because Path MTU Discovery (PMTUD) ICMP is dropped by a firewall.
56
+
57
+ The arithmetic the review must enforce:
58
+
59
+ - VXLAN: 50 bytes (8 VXLAN + 8 UDP + 20 IPv4 + 14 Ethernet but 14 is shared) → overlay MTU = node MTU − 50.
60
+ - Geneve: 60 bytes typical → overlay MTU = node MTU − 60.
61
+ - IPIP: 20 bytes.
62
+ - WireGuard (Cilium transparent encryption): ~60 bytes plus alignment.
63
+ - IPsec: variable, ~73 bytes worst case.
64
+
65
+ Stress-tests:
66
+
67
+ - Cloud underlay at MTU 1500 → overlay should be 1450 (VXLAN). Some installers default to 1500 on the overlay too — silent corruption on first large packet.
68
+ - Jumbo frames (MTU 9000) on AWS — works inside the same Availability Zone. Cross-AZ traffic is silently capped at 1500 by AWS, so jumbo overlay across AZs causes random stalls.
69
+ - GKE / EKS / AKS managed nodes — verify the cloud-provider MTU before trusting the CNI's auto-detection. Some installers read `eth0` MTU at startup and miss later changes.
70
+
71
+ Verification step: `kubectl run mtu-test --rm -it --image=alpine -- ping -M do -s <payload> <peer-pod-ip>`. The Don't-Fragment bit forces the kernel to fail rather than fragment, exposing the MTU ceiling.
72
+
73
+ ## Step 6 — Dual-stack and IPv6
74
+
75
+ - The cluster API server, kube-controller-manager, kube-proxy, kubelet, and the CNI must all be configured for dual-stack. A partial enablement leads to confusing behavior where Services have only IPv4 addresses but Pods have both.
76
+ - `Service.spec.ipFamilies` and `ipFamilyPolicy` (`SingleStack`, `PreferDualStack`, `RequireDualStack`) are the per-Service selectors. Default depends on the cluster's primary family.
77
+ - EndpointSlices report addresses per family — verify both families are populated when `RequireDualStack` is in use.
78
+ - IPv6-only clusters interact poorly with images pulled from IPv4-only registries; verify NAT64/DNS64 or registry mirroring is in place.
79
+
80
+ ## Output for this section
81
+
82
+ - CNI and version,
83
+ - kube-proxy mode (or KPR mode),
84
+ - routing mode (encapsulation type or native),
85
+ - IPAM mode and Pod / Service CIDR sizes,
86
+ - node and overlay MTU,
87
+ - dual-stack posture,
88
+ - findings on sizing headroom, mode mismatch, MTU correctness, KPR readiness,
89
+ - one-way-door warnings (CIDR resize, KPR migration) with cutover plan if a change is recommended.
@@ -0,0 +1,120 @@
1
+ # DNS and Service Discovery
2
+
3
+ ## Step 1 — Identify the in-cluster DNS topology
4
+
5
+ Capture:
6
+
7
+ - CoreDNS deployment shape: `kubectl -n kube-system get deploy coredns -o wide`. Replica count, resources, anti-affinity.
8
+ - CoreDNS Corefile: `kubectl -n kube-system get cm coredns -o yaml`. Plugins enabled, forward target, cache TTL, autopath, log/errors.
9
+ - NodeLocal DNSCache presence: `kubectl -n kube-system get ds node-local-dns` (or `nodelocaldns`). Listening IP (typically a link-local address like `169.254.20.10`), upstream target, kube-proxy mode coupling.
10
+ - kubelet `--cluster-dns` flag — does it point at the CoreDNS Service IP, or at the NodeLocal DNSCache link-local IP?
11
+ - A pod's `/etc/resolv.conf`: `kubectl exec <pod> -- cat /etc/resolv.conf`. Note the `search`, `nameserver`, and `options ndots:` lines.
12
+
13
+ ## Step 2 — Stress-test the Corefile
14
+
15
+ A canonical Corefile for a Kubernetes cluster typically uses these plugins in order:
16
+
17
+ ```
18
+ .:53 {
19
+ errors
20
+ health { lameduck 5s }
21
+ ready
22
+ kubernetes cluster.local in-addr.arpa ip6.arpa {
23
+ pods insecure # consider: pods verified
24
+ fallthrough in-addr.arpa ip6.arpa
25
+ ttl 30
26
+ }
27
+ prometheus :9153
28
+ forward . /etc/resolv.conf {
29
+ max_concurrent 1000
30
+ }
31
+ cache 30
32
+ loop
33
+ reload
34
+ loadbalance
35
+ }
36
+ ```
37
+
38
+ Stress-tests:
39
+
40
+ - `pods insecure` returns A records for any pod IP without verifying a pod exists in that namespace. `pods verified` validates against pod existence in the same namespace; higher memory cost but a tighter security posture.
41
+ - `forward . /etc/resolv.conf` follows the node's resolv.conf, which may point to the cloud DNS (169.254.169.254 on AWS, 168.63.129.16 on Azure). If pods talk to external services, every miss escapes the cluster — set `max_concurrent` deliberately to bound load on the upstream.
42
+ - Missing `cache` plugin — every query hits upstream, including repeated queries for the same name within the TTL. The cache plugin is required for any cluster with non-trivial DNS load.
43
+ - `cache` TTL larger than the upstream record's TTL — stale records persist past the authoritative source's update window. 30 seconds is a typical compromise.
44
+ - `autopath` plugin — server-side search-path completion. Reduces the `ndots:5` round-trip cost (see Step 4) but requires `pods verified`, costs more memory, and complicates debugging because client-side lookups no longer match what reaches CoreDNS.
45
+ - `loop` plugin — detects forwarding loops at startup. Without it, a forward to the cluster's own resolv.conf can loop until the deployment crashes. Always keep `loop` enabled.
46
+ - `reload` plugin — picks up Corefile changes without a restart. Without it, a ConfigMap edit is not honored until the pod is recreated.
47
+
48
+ ## Step 3 — Stress-test CoreDNS scaling
49
+
50
+ CoreDNS is the single most overlooked source of cluster-wide latency. Defaults from `kubeadm` and many installers are not production-sized.
51
+
52
+ - Replica count — small clusters get 2 replicas by default. A cluster with 1000 pods doing 100 QPS per pod sees 100k QPS; two replicas with stock resources will become a queue.
53
+ - Resource requests / limits — many installers default to a tight `100m CPU / 70Mi memory` request that is fine for 100 pods and a CPU throttle for 10000.
54
+ - Pod anti-affinity — every CoreDNS replica must be on a different node. The default deployments usually have this; verify after migration to a new cluster.
55
+ - PodDisruptionBudget — `minAvailable: 1` is the absolute floor; a stricter `maxUnavailable: 1` plus PDB on the DaemonSet's underlying nodes is safer during cluster autoscaler events.
56
+ - `cluster-proportional-autoscaler` ([repository](https://github.com/kubernetes-sigs/cluster-proportional-autoscaler)) is the standard way to scale CoreDNS replicas with cluster size.
57
+
58
+ ## Step 4 — The `ndots:5` and search-path tail-latency trap
59
+
60
+ The default `dnsPolicy: ClusterFirst` injects an `options ndots:5` line into every pod's `/etc/resolv.conf` along with a search list like:
61
+
62
+ ```
63
+ search default.svc.cluster.local svc.cluster.local cluster.local
64
+ options ndots:5
65
+ ```
66
+
67
+ The `ndots:5` directive means: any name with fewer than 5 dots is treated as relative and tried against every search-list entry first.
68
+
69
+ For a query like `api.example.com` (3 dots) the resolver issues:
70
+
71
+ 1. `api.example.com.default.svc.cluster.local.` — NXDOMAIN
72
+ 2. `api.example.com.svc.cluster.local.` — NXDOMAIN
73
+ 3. `api.example.com.cluster.local.` — NXDOMAIN
74
+ 4. `api.example.com.` — finally the real query
75
+
76
+ For external services, that is **4× the DNS load** and 3× the chance of dropped UDP packets causing a 5-second timeout.
77
+
78
+ Mitigations the review must consider:
79
+
80
+ - Lower `options ndots:1` per-pod via `dnsConfig.options` for workloads that overwhelmingly resolve external hostnames. Cluster-internal Service names still resolve because they are absolute when written in the canonical `service.namespace.svc.cluster.local` form.
81
+ - Use `ExternalName` Services so that `mysvc.namespace.svc.cluster.local` is the canonical reference, avoiding search-list expansion entirely.
82
+ - Enable `autopath` in CoreDNS — completes the search path server-side, requires `pods verified` and more memory.
83
+ - Always pair this fix with NodeLocal DNSCache so the search-list expansion cost stays on the node.
84
+
85
+ ## Step 5 — NodeLocal DNSCache deep-dive
86
+
87
+ Per the [Kubernetes docs](https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/), NodeLocal DNSCache runs as a DaemonSet listening on a link-local IP. It solves three production problems:
88
+
89
+ 1. **conntrack overhead** — every UDP DNS query allocates a conntrack entry that lives 30 seconds; high-DNS-QPS pods exhaust conntrack tables and trigger random drops elsewhere. NodeLocal DNSCache upgrades to TCP for upstream traffic, where entries are removed on connection close.
90
+ 2. **5-second timeout amplification** — a dropped UDP packet triggers the resolver's retry timer (3 × 10s on glibc, with extra hops). Local cache hits remove most of this.
91
+ 3. **DNAT bypass** — pods talking to the link-local IP skip the iptables/IPVS Service VIP rewrite entirely.
92
+
93
+ Operational risks the review must call out:
94
+
95
+ - **OOMKill is a node-wide DNS outage.** When the local cache pod is killed, the iptables rules that redirected DNS traffic to the cache stay in place pointing at an unhealthy pod until the new pod is ready. Set memory limits with headroom (default cache is 10k entries ≈ 30 MB; 100 MB request and 200 MB limit is a safer baseline) and monitor `coredns_dns_cache_size_entries`.
96
+ - **PodDisruptionBudget** for the DaemonSet — node drains during cluster autoscaler events should not all evict the local cache simultaneously.
97
+ - **kubelet `--cluster-dns` flag must be updated** if the kube-proxy mode is IPVS — see the linked docs for the exact rewrite. iptables mode tolerates either the cache IP or the kube-dns Service IP.
98
+ - **IPv6** — the link-local address must be enclosed in brackets (`[fd00::1]:53`).
99
+ - **Cilium kube-proxy replacement** — Cilium's `socketLB` honors the per-pod redirect to the link-local IP only with the right options; verify with Cilium's NodeLocal DNSCache integration page for the version in use.
100
+
101
+ ## Step 6 — ExternalDNS and the boundary
102
+
103
+ `ExternalDNS` runs in the cluster but operates on cloud DNS (Route 53, Cloud DNS, Azure DNS, OCI DNS) — it is the bridge between Kubernetes Service / Ingress / Gateway and externally reachable hostnames.
104
+
105
+ Architecture review checks:
106
+
107
+ - The IAM/role binding ExternalDNS uses must be scoped to the specific hosted zone(s) and the record types it actually creates (typically A, AAAA, CNAME, TXT). A `*` permission is over-scoped.
108
+ - Ownership records (TXT) prevent two clusters fighting over the same hostname. Without them, two ExternalDNS deployments will continually overwrite each other's records — silent ping-pong.
109
+ - TTL: the default may be high (300s); for blue/green or canary cutovers, a low TTL on the routed records is required, set per-resource via the `external-dns.alpha.kubernetes.io/ttl` annotation.
110
+
111
+ This boundary is also where this skill's scope ends — DNS *outside* the cluster is the AWS / Azure / OCI network architect's territory.
112
+
113
+ ## Output for this section
114
+
115
+ - CoreDNS topology, Corefile critical-plugin checks (cache, loop, reload, kubernetes plugin mode),
116
+ - replica count and autoscaler posture,
117
+ - `ndots:5` exposure for the workloads in scope and the per-pod `dnsConfig.options` plan,
118
+ - NodeLocal DNSCache presence, memory headroom, PDB, and OOM exposure,
119
+ - ExternalDNS scope, ownership-TXT posture, TTL hygiene,
120
+ - findings, severity, and the next-step delegate (cloud-network-architect agent for hosted-zone scoping).