@ruaruababa/vibe-kit 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (462) hide show
  1. package/CATALOG.md +317 -0
  2. package/README.md +121 -0
  3. package/aliases.json +65 -0
  4. package/bin/vibe.js +2 -0
  5. package/bundles.json +265 -0
  6. package/catalog.json +1560 -0
  7. package/dist/antigravity-skills/bin/cli.js +438 -0
  8. package/dist/antigravity-skills/lib/skill-utils.js +158 -0
  9. package/dist/antigravity-skills/scripts/build-catalog.js +305 -0
  10. package/dist/antigravity-skills/scripts/normalize-frontmatter.js +144 -0
  11. package/dist/antigravity-skills/scripts/validate-skills.js +230 -0
  12. package/dist/bin/vibe.js +2 -0
  13. package/dist/dist/src/cli/index.js +26 -0
  14. package/dist/lib/skill-utils.js +158 -0
  15. package/dist/scripts/build-catalog.js +50 -0
  16. package/dist/scripts/normalize-frontmatter.js +144 -0
  17. package/dist/scripts/validate-skills.js +56 -0
  18. package/dist/src/cli/index.js +146 -0
  19. package/dist/src/types/index.js +13 -0
  20. package/dist/src/utils/fs.js +1 -0
  21. package/package.json +43 -0
  22. package/skills/accessibility-compliance-accessibility-audit/SKILL.md +42 -0
  23. package/skills/accessibility-compliance-accessibility-audit/resources/implementation-playbook.md +502 -0
  24. package/skills/agent-orchestration-improve-agent/SKILL.md +349 -0
  25. package/skills/agent-orchestration-multi-agent-optimize/SKILL.md +239 -0
  26. package/skills/agent-orchestrator/SKILL.md +24 -0
  27. package/skills/ai-engineer/SKILL.md +171 -0
  28. package/skills/airflow-dag-patterns/SKILL.md +41 -0
  29. package/skills/airflow-dag-patterns/resources/implementation-playbook.md +509 -0
  30. package/skills/angular-migration/SKILL.md +428 -0
  31. package/skills/anti-reversing-techniques/SKILL.md +42 -0
  32. package/skills/anti-reversing-techniques/resources/implementation-playbook.md +539 -0
  33. package/skills/api-design-principles/SKILL.md +37 -0
  34. package/skills/api-design-principles/assets/api-design-checklist.md +155 -0
  35. package/skills/api-design-principles/assets/rest-api-template.py +182 -0
  36. package/skills/api-design-principles/references/graphql-schema-design.md +583 -0
  37. package/skills/api-design-principles/references/rest-best-practices.md +408 -0
  38. package/skills/api-design-principles/resources/implementation-playbook.md +513 -0
  39. package/skills/api-documenter/SKILL.md +184 -0
  40. package/skills/api-testing-observability-api-mock/SKILL.md +46 -0
  41. package/skills/api-testing-observability-api-mock/resources/implementation-playbook.md +1327 -0
  42. package/skills/application-performance-performance-optimization/SKILL.md +154 -0
  43. package/skills/architect-review/SKILL.md +174 -0
  44. package/skills/architecture-decision-records/SKILL.md +441 -0
  45. package/skills/architecture-patterns/SKILL.md +37 -0
  46. package/skills/architecture-patterns/resources/implementation-playbook.md +479 -0
  47. package/skills/arm-cortex-expert/SKILL.md +306 -0
  48. package/skills/async-python-patterns/SKILL.md +39 -0
  49. package/skills/async-python-patterns/resources/implementation-playbook.md +678 -0
  50. package/skills/attack-tree-construction/SKILL.md +38 -0
  51. package/skills/attack-tree-construction/resources/implementation-playbook.md +671 -0
  52. package/skills/auth-implementation-patterns/SKILL.md +39 -0
  53. package/skills/auth-implementation-patterns/resources/implementation-playbook.md +618 -0
  54. package/skills/backend-architect/SKILL.md +333 -0
  55. package/skills/backend-development-feature-development/SKILL.md +180 -0
  56. package/skills/backend-security-coder/SKILL.md +156 -0
  57. package/skills/backtesting-frameworks/SKILL.md +39 -0
  58. package/skills/backtesting-frameworks/resources/implementation-playbook.md +647 -0
  59. package/skills/bash-defensive-patterns/SKILL.md +43 -0
  60. package/skills/bash-defensive-patterns/resources/implementation-playbook.md +517 -0
  61. package/skills/bash-pro/SKILL.md +310 -0
  62. package/skills/bats-testing-patterns/SKILL.md +34 -0
  63. package/skills/bats-testing-patterns/resources/implementation-playbook.md +614 -0
  64. package/skills/bazel-build-optimization/SKILL.md +397 -0
  65. package/skills/billing-automation/SKILL.md +42 -0
  66. package/skills/billing-automation/resources/implementation-playbook.md +544 -0
  67. package/skills/binary-analysis-patterns/SKILL.md +450 -0
  68. package/skills/blockchain-developer/SKILL.md +208 -0
  69. package/skills/business-analyst/SKILL.md +182 -0
  70. package/skills/c-pro/SKILL.md +56 -0
  71. package/skills/c4-architecture-c4-architecture/SKILL.md +389 -0
  72. package/skills/c4-code/SKILL.md +244 -0
  73. package/skills/c4-component/SKILL.md +153 -0
  74. package/skills/c4-container/SKILL.md +171 -0
  75. package/skills/c4-context/SKILL.md +150 -0
  76. package/skills/changelog-automation/SKILL.md +38 -0
  77. package/skills/changelog-automation/resources/implementation-playbook.md +538 -0
  78. package/skills/cicd-automation-workflow-automate/SKILL.md +51 -0
  79. package/skills/cicd-automation-workflow-automate/resources/implementation-playbook.md +1333 -0
  80. package/skills/clean-markdown/SKILL.md +23 -0
  81. package/skills/cloud-architect/SKILL.md +135 -0
  82. package/skills/code-documentation-code-explain/SKILL.md +46 -0
  83. package/skills/code-documentation-code-explain/resources/implementation-playbook.md +802 -0
  84. package/skills/code-documentation-doc-generate/SKILL.md +48 -0
  85. package/skills/code-documentation-doc-generate/resources/implementation-playbook.md +640 -0
  86. package/skills/code-refactoring-context-restore/SKILL.md +179 -0
  87. package/skills/code-refactoring-refactor-clean/SKILL.md +51 -0
  88. package/skills/code-refactoring-refactor-clean/resources/implementation-playbook.md +879 -0
  89. package/skills/code-refactoring-tech-debt/SKILL.md +386 -0
  90. package/skills/code-review-ai-ai-review/SKILL.md +450 -0
  91. package/skills/code-review-excellence/SKILL.md +40 -0
  92. package/skills/code-review-excellence/resources/implementation-playbook.md +515 -0
  93. package/skills/code-reviewer/SKILL.md +178 -0
  94. package/skills/codebase-cleanup-deps-audit/SKILL.md +51 -0
  95. package/skills/codebase-cleanup-deps-audit/resources/implementation-playbook.md +766 -0
  96. package/skills/codebase-cleanup-refactor-clean/SKILL.md +51 -0
  97. package/skills/codebase-cleanup-refactor-clean/resources/implementation-playbook.md +879 -0
  98. package/skills/codebase-cleanup-tech-debt/SKILL.md +386 -0
  99. package/skills/competitive-landscape/SKILL.md +34 -0
  100. package/skills/competitive-landscape/resources/implementation-playbook.md +494 -0
  101. package/skills/comprehensive-review-full-review/SKILL.md +146 -0
  102. package/skills/comprehensive-review-pr-enhance/SKILL.md +46 -0
  103. package/skills/comprehensive-review-pr-enhance/resources/implementation-playbook.md +691 -0
  104. package/skills/conductor-implement/SKILL.md +388 -0
  105. package/skills/conductor-manage/SKILL.md +39 -0
  106. package/skills/conductor-manage/resources/implementation-playbook.md +1120 -0
  107. package/skills/conductor-new-track/SKILL.md +433 -0
  108. package/skills/conductor-revert/SKILL.md +372 -0
  109. package/skills/conductor-setup/SKILL.md +426 -0
  110. package/skills/conductor-status/SKILL.md +338 -0
  111. package/skills/conductor-validator/SKILL.md +62 -0
  112. package/skills/content-marketer/SKILL.md +170 -0
  113. package/skills/context-driven-development/SKILL.md +400 -0
  114. package/skills/context-management-context-restore/SKILL.md +179 -0
  115. package/skills/context-management-context-save/SKILL.md +177 -0
  116. package/skills/context-manager/SKILL.md +185 -0
  117. package/skills/cost-optimization/SKILL.md +286 -0
  118. package/skills/cpp-pro/SKILL.md +59 -0
  119. package/skills/cqrs-implementation/SKILL.md +35 -0
  120. package/skills/cqrs-implementation/resources/implementation-playbook.md +540 -0
  121. package/skills/csharp-pro/SKILL.md +59 -0
  122. package/skills/customer-support/SKILL.md +170 -0
  123. package/skills/data-engineer/SKILL.md +224 -0
  124. package/skills/data-engineering-data-driven-feature/SKILL.md +182 -0
  125. package/skills/data-engineering-data-pipeline/SKILL.md +201 -0
  126. package/skills/data-quality-frameworks/SKILL.md +40 -0
  127. package/skills/data-quality-frameworks/resources/implementation-playbook.md +573 -0
  128. package/skills/data-scientist/SKILL.md +199 -0
  129. package/skills/data-storytelling/SKILL.md +465 -0
  130. package/skills/database-admin/SKILL.md +165 -0
  131. package/skills/database-architect/SKILL.md +268 -0
  132. package/skills/database-cloud-optimization-cost-optimize/SKILL.md +44 -0
  133. package/skills/database-cloud-optimization-cost-optimize/resources/implementation-playbook.md +1441 -0
  134. package/skills/database-migration/SKILL.md +436 -0
  135. package/skills/database-migrations-migration-observability/SKILL.md +420 -0
  136. package/skills/database-migrations-sql-migrations/SKILL.md +53 -0
  137. package/skills/database-migrations-sql-migrations/resources/implementation-playbook.md +499 -0
  138. package/skills/database-optimizer/SKILL.md +167 -0
  139. package/skills/dbt-transformation-patterns/SKILL.md +34 -0
  140. package/skills/dbt-transformation-patterns/resources/implementation-playbook.md +547 -0
  141. package/skills/debugger/SKILL.md +49 -0
  142. package/skills/debugging-strategies/SKILL.md +34 -0
  143. package/skills/debugging-strategies/resources/implementation-playbook.md +511 -0
  144. package/skills/debugging-toolkit-smart-debug/SKILL.md +197 -0
  145. package/skills/defi-protocol-templates/SKILL.md +466 -0
  146. package/skills/dependency-management-deps-audit/SKILL.md +44 -0
  147. package/skills/dependency-management-deps-audit/resources/implementation-playbook.md +766 -0
  148. package/skills/dependency-upgrade/SKILL.md +421 -0
  149. package/skills/deployment-engineer/SKILL.md +170 -0
  150. package/skills/deployment-pipeline-design/SKILL.md +371 -0
  151. package/skills/deployment-validation-config-validate/SKILL.md +496 -0
  152. package/skills/devops-troubleshooter/SKILL.md +161 -0
  153. package/skills/distributed-debugging-debug-trace/SKILL.md +44 -0
  154. package/skills/distributed-debugging-debug-trace/resources/implementation-playbook.md +1307 -0
  155. package/skills/distributed-tracing/SKILL.md +450 -0
  156. package/skills/django-pro/SKILL.md +180 -0
  157. package/skills/docs-architect/SKILL.md +98 -0
  158. package/skills/documentation-generation-doc-generate/SKILL.md +48 -0
  159. package/skills/documentation-generation-doc-generate/resources/implementation-playbook.md +640 -0
  160. package/skills/dotnet-architect/SKILL.md +197 -0
  161. package/skills/dotnet-backend-patterns/SKILL.md +37 -0
  162. package/skills/dotnet-backend-patterns/assets/repository-template.cs +523 -0
  163. package/skills/dotnet-backend-patterns/assets/service-template.cs +336 -0
  164. package/skills/dotnet-backend-patterns/references/dapper-patterns.md +544 -0
  165. package/skills/dotnet-backend-patterns/references/ef-core-best-practices.md +355 -0
  166. package/skills/dotnet-backend-patterns/resources/implementation-playbook.md +799 -0
  167. package/skills/dummy-skill/SKILL.md +5 -0
  168. package/skills/dx-optimizer/SKILL.md +83 -0
  169. package/skills/e2e-testing-patterns/SKILL.md +41 -0
  170. package/skills/e2e-testing-patterns/resources/implementation-playbook.md +531 -0
  171. package/skills/elixir-pro/SKILL.md +59 -0
  172. package/skills/embedding-strategies/SKILL.md +491 -0
  173. package/skills/employment-contract-templates/SKILL.md +39 -0
  174. package/skills/employment-contract-templates/resources/implementation-playbook.md +493 -0
  175. package/skills/error-debugging-error-analysis/SKILL.md +47 -0
  176. package/skills/error-debugging-error-analysis/resources/implementation-playbook.md +1143 -0
  177. package/skills/error-debugging-error-trace/SKILL.md +43 -0
  178. package/skills/error-debugging-error-trace/resources/implementation-playbook.md +1361 -0
  179. package/skills/error-debugging-multi-agent-review/SKILL.md +216 -0
  180. package/skills/error-detective/SKILL.md +53 -0
  181. package/skills/error-diagnostics-error-analysis/SKILL.md +47 -0
  182. package/skills/error-diagnostics-error-analysis/resources/implementation-playbook.md +1143 -0
  183. package/skills/error-diagnostics-error-trace/SKILL.md +48 -0
  184. package/skills/error-diagnostics-error-trace/resources/implementation-playbook.md +1371 -0
  185. package/skills/error-diagnostics-smart-debug/SKILL.md +197 -0
  186. package/skills/error-handling-patterns/SKILL.md +35 -0
  187. package/skills/error-handling-patterns/resources/implementation-playbook.md +635 -0
  188. package/skills/event-sourcing-architect/SKILL.md +58 -0
  189. package/skills/event-store-design/SKILL.md +449 -0
  190. package/skills/fastapi-pro/SKILL.md +192 -0
  191. package/skills/fastapi-templates/SKILL.md +32 -0
  192. package/skills/fastapi-templates/resources/implementation-playbook.md +566 -0
  193. package/skills/final-test/SKILL.md +5 -0
  194. package/skills/firmware-analyst/SKILL.md +320 -0
  195. package/skills/flutter-expert/SKILL.md +200 -0
  196. package/skills/framework-migration-code-migrate/SKILL.md +48 -0
  197. package/skills/framework-migration-code-migrate/resources/implementation-playbook.md +1052 -0
  198. package/skills/framework-migration-deps-upgrade/SKILL.md +48 -0
  199. package/skills/framework-migration-deps-upgrade/resources/implementation-playbook.md +755 -0
  200. package/skills/framework-migration-legacy-modernize/SKILL.md +132 -0
  201. package/skills/frontend-developer/SKILL.md +171 -0
  202. package/skills/frontend-mobile-development-component-scaffold/SKILL.md +403 -0
  203. package/skills/frontend-mobile-security-xss-scan/SKILL.md +322 -0
  204. package/skills/frontend-security-coder/SKILL.md +170 -0
  205. package/skills/full-stack-orchestration-full-stack-feature/SKILL.md +135 -0
  206. package/skills/gdpr-data-handling/SKILL.md +33 -0
  207. package/skills/gdpr-data-handling/resources/implementation-playbook.md +615 -0
  208. package/skills/git-advanced-workflows/SKILL.md +412 -0
  209. package/skills/git-pr-workflows-git-workflow/SKILL.md +140 -0
  210. package/skills/git-pr-workflows-onboard/SKILL.md +416 -0
  211. package/skills/git-pr-workflows-pr-enhance/SKILL.md +48 -0
  212. package/skills/git-pr-workflows-pr-enhance/resources/implementation-playbook.md +701 -0
  213. package/skills/github-actions-templates/SKILL.md +345 -0
  214. package/skills/gitlab-ci-patterns/SKILL.md +283 -0
  215. package/skills/gitops-workflow/SKILL.md +303 -0
  216. package/skills/gitops-workflow/references/argocd-setup.md +134 -0
  217. package/skills/gitops-workflow/references/sync-policies.md +131 -0
  218. package/skills/go-concurrency-patterns/SKILL.md +33 -0
  219. package/skills/go-concurrency-patterns/resources/implementation-playbook.md +654 -0
  220. package/skills/godot-gdscript-patterns/SKILL.md +33 -0
  221. package/skills/godot-gdscript-patterns/resources/implementation-playbook.md +804 -0
  222. package/skills/golang-pro/SKILL.md +179 -0
  223. package/skills/grafana-dashboards/SKILL.md +381 -0
  224. package/skills/graphql-architect/SKILL.md +182 -0
  225. package/skills/haskell-pro/SKILL.md +56 -0
  226. package/skills/helm-chart-scaffolding/SKILL.md +34 -0
  227. package/skills/helm-chart-scaffolding/assets/Chart.yaml.template +42 -0
  228. package/skills/helm-chart-scaffolding/assets/values.yaml.template +185 -0
  229. package/skills/helm-chart-scaffolding/references/chart-structure.md +500 -0
  230. package/skills/helm-chart-scaffolding/resources/implementation-playbook.md +543 -0
  231. package/skills/helm-chart-scaffolding/scripts/validate-chart.sh +244 -0
  232. package/skills/hr-pro/SKILL.md +126 -0
  233. package/skills/hybrid-cloud-architect/SKILL.md +168 -0
  234. package/skills/hybrid-cloud-networking/SKILL.md +238 -0
  235. package/skills/hybrid-search-implementation/SKILL.md +32 -0
  236. package/skills/hybrid-search-implementation/resources/implementation-playbook.md +567 -0
  237. package/skills/incident-responder/SKILL.md +213 -0
  238. package/skills/incident-response-incident-response/SKILL.md +168 -0
  239. package/skills/incident-response-smart-fix/SKILL.md +29 -0
  240. package/skills/incident-response-smart-fix/resources/implementation-playbook.md +838 -0
  241. package/skills/incident-runbook-templates/SKILL.md +395 -0
  242. package/skills/ios-developer/SKILL.md +219 -0
  243. package/skills/istio-traffic-management/SKILL.md +337 -0
  244. package/skills/java-pro/SKILL.md +177 -0
  245. package/skills/javascript-pro/SKILL.md +57 -0
  246. package/skills/javascript-testing-patterns/SKILL.md +35 -0
  247. package/skills/javascript-testing-patterns/resources/implementation-playbook.md +1024 -0
  248. package/skills/javascript-typescript-typescript-scaffold/SKILL.md +361 -0
  249. package/skills/julia-pro/SKILL.md +209 -0
  250. package/skills/k8s-manifest-generator/SKILL.md +35 -0
  251. package/skills/k8s-manifest-generator/assets/configmap-template.yaml +296 -0
  252. package/skills/k8s-manifest-generator/assets/deployment-template.yaml +203 -0
  253. package/skills/k8s-manifest-generator/assets/service-template.yaml +171 -0
  254. package/skills/k8s-manifest-generator/references/deployment-spec.md +753 -0
  255. package/skills/k8s-manifest-generator/references/service-spec.md +724 -0
  256. package/skills/k8s-manifest-generator/resources/implementation-playbook.md +510 -0
  257. package/skills/k8s-security-policies/SKILL.md +346 -0
  258. package/skills/k8s-security-policies/assets/network-policy-template.yaml +177 -0
  259. package/skills/k8s-security-policies/references/rbac-patterns.md +187 -0
  260. package/skills/kpi-dashboard-design/SKILL.md +440 -0
  261. package/skills/kubernetes-architect/SKILL.md +170 -0
  262. package/skills/langchain-architecture/SKILL.md +350 -0
  263. package/skills/legacy-modernizer/SKILL.md +53 -0
  264. package/skills/legal-advisor/SKILL.md +70 -0
  265. package/skills/linkerd-patterns/SKILL.md +321 -0
  266. package/skills/llm-application-dev-ai-assistant/SKILL.md +35 -0
  267. package/skills/llm-application-dev-ai-assistant/resources/implementation-playbook.md +1236 -0
  268. package/skills/llm-application-dev-langchain-agent/SKILL.md +246 -0
  269. package/skills/llm-application-dev-prompt-optimize/SKILL.md +37 -0
  270. package/skills/llm-application-dev-prompt-optimize/resources/implementation-playbook.md +591 -0
  271. package/skills/llm-evaluation/SKILL.md +483 -0
  272. package/skills/machine-learning-ops-ml-pipeline/SKILL.md +314 -0
  273. package/skills/malware-analyst/SKILL.md +247 -0
  274. package/skills/market-sizing-analysis/SKILL.md +425 -0
  275. package/skills/market-sizing-analysis/examples/saas-market-sizing.md +349 -0
  276. package/skills/market-sizing-analysis/references/data-sources.md +360 -0
  277. package/skills/memory-forensics/SKILL.md +491 -0
  278. package/skills/memory-safety-patterns/SKILL.md +33 -0
  279. package/skills/memory-safety-patterns/resources/implementation-playbook.md +603 -0
  280. package/skills/mermaid-expert/SKILL.md +59 -0
  281. package/skills/microservices-patterns/SKILL.md +35 -0
  282. package/skills/microservices-patterns/resources/implementation-playbook.md +607 -0
  283. package/skills/minecraft-bukkit-pro/SKILL.md +126 -0
  284. package/skills/ml-engineer/SKILL.md +168 -0
  285. package/skills/ml-pipeline-workflow/SKILL.md +257 -0
  286. package/skills/mlops-engineer/SKILL.md +219 -0
  287. package/skills/mobile-developer/SKILL.md +205 -0
  288. package/skills/mobile-security-coder/SKILL.md +184 -0
  289. package/skills/modern-javascript-patterns/SKILL.md +35 -0
  290. package/skills/modern-javascript-patterns/resources/implementation-playbook.md +910 -0
  291. package/skills/monorepo-architect/SKILL.md +61 -0
  292. package/skills/monorepo-management/SKILL.md +35 -0
  293. package/skills/monorepo-management/resources/implementation-playbook.md +621 -0
  294. package/skills/mtls-configuration/SKILL.md +359 -0
  295. package/skills/multi-cloud-architecture/SKILL.md +189 -0
  296. package/skills/multi-platform-apps-multi-platform/SKILL.md +203 -0
  297. package/skills/network-engineer/SKILL.md +169 -0
  298. package/skills/nextjs-app-router-patterns/SKILL.md +33 -0
  299. package/skills/nextjs-app-router-patterns/resources/implementation-playbook.md +543 -0
  300. package/skills/nft-standards/SKILL.md +395 -0
  301. package/skills/node-expert/SKILL.md +23 -0
  302. package/skills/nodejs-backend-patterns/SKILL.md +35 -0
  303. package/skills/nodejs-backend-patterns/resources/implementation-playbook.md +1019 -0
  304. package/skills/nx-workspace-patterns/SKILL.md +464 -0
  305. package/skills/observability-engineer/SKILL.md +237 -0
  306. package/skills/observability-monitoring-monitor-setup/SKILL.md +48 -0
  307. package/skills/observability-monitoring-monitor-setup/resources/implementation-playbook.md +505 -0
  308. package/skills/observability-monitoring-slo-implement/SKILL.md +43 -0
  309. package/skills/observability-monitoring-slo-implement/resources/implementation-playbook.md +1077 -0
  310. package/skills/on-call-handoff-patterns/SKILL.md +453 -0
  311. package/skills/openapi-spec-generation/SKILL.md +33 -0
  312. package/skills/openapi-spec-generation/resources/implementation-playbook.md +1027 -0
  313. package/skills/payment-integration/SKILL.md +77 -0
  314. package/skills/paypal-integration/SKILL.md +479 -0
  315. package/skills/pci-compliance/SKILL.md +478 -0
  316. package/skills/performance-engineer/SKILL.md +180 -0
  317. package/skills/performance-testing-review-ai-review/SKILL.md +450 -0
  318. package/skills/performance-testing-review-multi-agent-review/SKILL.md +216 -0
  319. package/skills/php-pro/SKILL.md +63 -0
  320. package/skills/posix-shell-pro/SKILL.md +304 -0
  321. package/skills/postgresql/SKILL.md +230 -0
  322. package/skills/postmortem-writing/SKILL.md +386 -0
  323. package/skills/projection-patterns/SKILL.md +33 -0
  324. package/skills/projection-patterns/resources/implementation-playbook.md +501 -0
  325. package/skills/prometheus-configuration/SKILL.md +404 -0
  326. package/skills/prompt-engineer/SKILL.md +272 -0
  327. package/skills/prompt-engineering-patterns/SKILL.md +213 -0
  328. package/skills/prompt-engineering-patterns/assets/few-shot-examples.json +106 -0
  329. package/skills/prompt-engineering-patterns/assets/prompt-template-library.md +246 -0
  330. package/skills/prompt-engineering-patterns/references/chain-of-thought.md +399 -0
  331. package/skills/prompt-engineering-patterns/references/few-shot-learning.md +369 -0
  332. package/skills/prompt-engineering-patterns/references/prompt-optimization.md +414 -0
  333. package/skills/prompt-engineering-patterns/references/prompt-templates.md +470 -0
  334. package/skills/prompt-engineering-patterns/references/system-prompts.md +189 -0
  335. package/skills/prompt-engineering-patterns/scripts/optimize-prompt.py +279 -0
  336. package/skills/protocol-reverse-engineering/SKILL.md +29 -0
  337. package/skills/protocol-reverse-engineering/resources/implementation-playbook.md +509 -0
  338. package/skills/python-development-python-scaffold/SKILL.md +331 -0
  339. package/skills/python-packaging/SKILL.md +36 -0
  340. package/skills/python-packaging/resources/implementation-playbook.md +869 -0
  341. package/skills/python-performance-optimization/SKILL.md +36 -0
  342. package/skills/python-performance-optimization/resources/implementation-playbook.md +868 -0
  343. package/skills/python-pro/SKILL.md +158 -0
  344. package/skills/python-testing-patterns/SKILL.md +37 -0
  345. package/skills/python-testing-patterns/resources/implementation-playbook.md +906 -0
  346. package/skills/quant-analyst/SKILL.md +53 -0
  347. package/skills/rag-implementation/SKILL.md +421 -0
  348. package/skills/react-modernization/SKILL.md +34 -0
  349. package/skills/react-modernization/resources/implementation-playbook.md +512 -0
  350. package/skills/react-native-architecture/SKILL.md +33 -0
  351. package/skills/react-native-architecture/resources/implementation-playbook.md +670 -0
  352. package/skills/react-state-management/SKILL.md +441 -0
  353. package/skills/reference-builder/SKILL.md +188 -0
  354. package/skills/reverse-engineer/SKILL.md +173 -0
  355. package/skills/risk-manager/SKILL.md +61 -0
  356. package/skills/risk-metrics-calculation/SKILL.md +33 -0
  357. package/skills/risk-metrics-calculation/resources/implementation-playbook.md +554 -0
  358. package/skills/ruby-pro/SKILL.md +56 -0
  359. package/skills/rust-async-patterns/SKILL.md +33 -0
  360. package/skills/rust-async-patterns/resources/implementation-playbook.md +516 -0
  361. package/skills/rust-pro/SKILL.md +178 -0
  362. package/skills/saga-orchestration/SKILL.md +496 -0
  363. package/skills/sales-automator/SKILL.md +55 -0
  364. package/skills/sast-configuration/SKILL.md +212 -0
  365. package/skills/scala-pro/SKILL.md +82 -0
  366. package/skills/screen-reader-testing/SKILL.md +33 -0
  367. package/skills/screen-reader-testing/resources/implementation-playbook.md +544 -0
  368. package/skills/search-specialist/SKILL.md +80 -0
  369. package/skills/secrets-management/SKILL.md +364 -0
  370. package/skills/security-auditor/SKILL.md +169 -0
  371. package/skills/security-compliance-compliance-check/SKILL.md +55 -0
  372. package/skills/security-compliance-compliance-check/resources/implementation-playbook.md +963 -0
  373. package/skills/security-requirement-extraction/SKILL.md +33 -0
  374. package/skills/security-requirement-extraction/resources/implementation-playbook.md +676 -0
  375. package/skills/security-scanning-security-dependencies/SKILL.md +43 -0
  376. package/skills/security-scanning-security-dependencies/resources/implementation-playbook.md +544 -0
  377. package/skills/security-scanning-security-hardening/SKILL.md +147 -0
  378. package/skills/security-scanning-security-sast/SKILL.md +495 -0
  379. package/skills/seo-authority-builder/SKILL.md +136 -0
  380. package/skills/seo-cannibalization-detector/SKILL.md +123 -0
  381. package/skills/seo-content-auditor/SKILL.md +83 -0
  382. package/skills/seo-content-planner/SKILL.md +108 -0
  383. package/skills/seo-content-refresher/SKILL.md +118 -0
  384. package/skills/seo-content-writer/SKILL.md +96 -0
  385. package/skills/seo-keyword-strategist/SKILL.md +95 -0
  386. package/skills/seo-meta-optimizer/SKILL.md +92 -0
  387. package/skills/seo-snippet-hunter/SKILL.md +114 -0
  388. package/skills/seo-structure-architect/SKILL.md +108 -0
  389. package/skills/service-mesh-expert/SKILL.md +58 -0
  390. package/skills/service-mesh-observability/SKILL.md +395 -0
  391. package/skills/shellcheck-configuration/SKILL.md +466 -0
  392. package/skills/similarity-search-patterns/SKILL.md +33 -0
  393. package/skills/similarity-search-patterns/resources/implementation-playbook.md +557 -0
  394. package/skills/slo-implementation/SKILL.md +341 -0
  395. package/skills/solidity-security/SKILL.md +34 -0
  396. package/skills/solidity-security/resources/implementation-playbook.md +524 -0
  397. package/skills/spark-optimization/SKILL.md +427 -0
  398. package/skills/sql-optimization-patterns/SKILL.md +35 -0
  399. package/skills/sql-optimization-patterns/resources/implementation-playbook.md +504 -0
  400. package/skills/sql-pro/SKILL.md +173 -0
  401. package/skills/startup-analyst/SKILL.md +328 -0
  402. package/skills/startup-business-analyst-business-case/SKILL.md +487 -0
  403. package/skills/startup-business-analyst-financial-projections/SKILL.md +353 -0
  404. package/skills/startup-business-analyst-market-opportunity/SKILL.md +240 -0
  405. package/skills/startup-financial-modeling/SKILL.md +467 -0
  406. package/skills/startup-metrics-framework/SKILL.md +34 -0
  407. package/skills/startup-metrics-framework/resources/implementation-playbook.md +500 -0
  408. package/skills/stride-analysis-patterns/SKILL.md +33 -0
  409. package/skills/stride-analysis-patterns/resources/implementation-playbook.md +655 -0
  410. package/skills/stripe-integration/SKILL.md +454 -0
  411. package/skills/systems-programming-rust-project/SKILL.md +440 -0
  412. package/skills/tailwind-design-system/SKILL.md +33 -0
  413. package/skills/tailwind-design-system/resources/implementation-playbook.md +665 -0
  414. package/skills/tdd-orchestrator/SKILL.md +205 -0
  415. package/skills/tdd-workflows-tdd-cycle/SKILL.md +221 -0
  416. package/skills/tdd-workflows-tdd-green/SKILL.md +73 -0
  417. package/skills/tdd-workflows-tdd-green/resources/implementation-playbook.md +870 -0
  418. package/skills/tdd-workflows-tdd-red/SKILL.md +164 -0
  419. package/skills/tdd-workflows-tdd-refactor/SKILL.md +187 -0
  420. package/skills/team-collaboration-issue/SKILL.md +37 -0
  421. package/skills/team-collaboration-issue/resources/implementation-playbook.md +640 -0
  422. package/skills/team-collaboration-standup-notes/SKILL.md +44 -0
  423. package/skills/team-collaboration-standup-notes/resources/implementation-playbook.md +768 -0
  424. package/skills/team-composition-analysis/SKILL.md +413 -0
  425. package/skills/temporal-python-pro/SKILL.md +370 -0
  426. package/skills/temporal-python-testing/SKILL.md +170 -0
  427. package/skills/temporal-python-testing/resources/integration-testing.md +455 -0
  428. package/skills/temporal-python-testing/resources/local-setup.md +553 -0
  429. package/skills/temporal-python-testing/resources/replay-testing.md +462 -0
  430. package/skills/temporal-python-testing/resources/unit-testing.md +328 -0
  431. package/skills/terraform-module-library/SKILL.md +261 -0
  432. package/skills/terraform-module-library/references/aws-modules.md +63 -0
  433. package/skills/terraform-specialist/SKILL.md +166 -0
  434. package/skills/test-automator/SKILL.md +224 -0
  435. package/skills/threat-mitigation-mapping/SKILL.md +33 -0
  436. package/skills/threat-mitigation-mapping/resources/implementation-playbook.md +744 -0
  437. package/skills/threat-modeling-expert/SKILL.md +60 -0
  438. package/skills/track-management/SKILL.md +38 -0
  439. package/skills/track-management/resources/implementation-playbook.md +591 -0
  440. package/skills/turborepo-caching/SKILL.md +419 -0
  441. package/skills/tutorial-engineer/SKILL.md +139 -0
  442. package/skills/typescript-advanced-types/SKILL.md +35 -0
  443. package/skills/typescript-advanced-types/resources/implementation-playbook.md +716 -0
  444. package/skills/typescript-pro/SKILL.md +55 -0
  445. package/skills/ui-minimal/SKILL.md +23 -0
  446. package/skills/ui-ux-designer/SKILL.md +209 -0
  447. package/skills/ui-visual-validator/SKILL.md +214 -0
  448. package/skills/unit-testing-test-generate/SKILL.md +319 -0
  449. package/skills/unity-developer/SKILL.md +230 -0
  450. package/skills/unity-ecs-patterns/SKILL.md +33 -0
  451. package/skills/unity-ecs-patterns/resources/implementation-playbook.md +625 -0
  452. package/skills/uv-package-manager/SKILL.md +37 -0
  453. package/skills/uv-package-manager/resources/implementation-playbook.md +830 -0
  454. package/skills/vector-database-engineer/SKILL.md +60 -0
  455. package/skills/vector-index-tuning/SKILL.md +42 -0
  456. package/skills/vector-index-tuning/resources/implementation-playbook.md +507 -0
  457. package/skills/wcag-audit-patterns/SKILL.md +41 -0
  458. package/skills/wcag-audit-patterns/resources/implementation-playbook.md +541 -0
  459. package/skills/web3-testing/SKILL.md +427 -0
  460. package/skills/workflow-orchestration-patterns/SKILL.md +333 -0
  461. package/skills/workflow-patterns/SKILL.md +38 -0
  462. package/skills/workflow-patterns/resources/implementation-playbook.md +621 -0
@@ -0,0 +1,1143 @@
1
+ # Error Analysis and Resolution Implementation Playbook
2
+
3
+ This file contains detailed patterns, checklists, and code samples referenced by the skill.
4
+
5
+ ## Error Detection and Classification
6
+
7
+ ### Error Taxonomy
8
+
9
+ Classify errors into these categories to inform your debugging strategy:
10
+
11
+ **By Severity:**
12
+ - **Critical**: System down, data loss, security breach, complete service unavailability
13
+ - **High**: Major feature broken, significant user impact, data corruption risk
14
+ - **Medium**: Partial feature degradation, workarounds available, performance issues
15
+ - **Low**: Minor bugs, cosmetic issues, edge cases with minimal impact
16
+
17
+ **By Type:**
18
+ - **Runtime Errors**: Exceptions, crashes, segmentation faults, null pointer dereferences
19
+ - **Logic Errors**: Incorrect behavior, wrong calculations, invalid state transitions
20
+ - **Integration Errors**: API failures, network timeouts, external service issues
21
+ - **Performance Errors**: Memory leaks, CPU spikes, slow queries, resource exhaustion
22
+ - **Configuration Errors**: Missing environment variables, invalid settings, version mismatches
23
+ - **Security Errors**: Authentication failures, authorization violations, injection attempts
24
+
25
+ **By Observability:**
26
+ - **Deterministic**: Consistently reproducible with known inputs
27
+ - **Intermittent**: Occurs sporadically, often timing or race condition related
28
+ - **Environmental**: Only happens in specific environments or configurations
29
+ - **Load-dependent**: Appears under high traffic or resource pressure
30
+
31
+ ### Error Detection Strategy
32
+
33
+ Implement multi-layered error detection:
34
+
35
+ 1. **Application-Level Instrumentation**: Use error tracking SDKs (Sentry, DataDog Error Tracking, Rollbar) to automatically capture unhandled exceptions with full context
36
+ 2. **Health Check Endpoints**: Monitor `/health` and `/ready` endpoints to detect service degradation before user impact
37
+ 3. **Synthetic Monitoring**: Run automated tests against production to catch issues proactively
38
+ 4. **Real User Monitoring (RUM)**: Track actual user experience and frontend errors
39
+ 5. **Log Pattern Analysis**: Use SIEM tools to identify error spikes and anomalous patterns
40
+ 6. **APM Thresholds**: Alert on error rate increases, latency spikes, or throughput drops
41
+
42
+ ### Error Aggregation and Pattern Recognition
43
+
44
+ Group related errors to identify systemic issues:
45
+
46
+ - **Fingerprinting**: Group errors by stack trace similarity, error type, and affected code path
47
+ - **Trend Analysis**: Track error frequency over time to detect regressions or emerging issues
48
+ - **Correlation Analysis**: Link errors to deployments, configuration changes, or external events
49
+ - **User Impact Scoring**: Prioritize based on number of affected users and sessions
50
+ - **Geographic/Temporal Patterns**: Identify region-specific or time-based error clusters
51
+
52
+ ## Root Cause Analysis Techniques
53
+
54
+ ### Systematic Investigation Process
55
+
56
+ Follow this structured approach for each error:
57
+
58
+ 1. **Reproduce the Error**: Create minimal reproduction steps. If intermittent, identify triggering conditions
59
+ 2. **Isolate the Failure Point**: Narrow down the exact line of code or component where failure originates
60
+ 3. **Analyze the Call Chain**: Trace backwards from the error to understand how the system reached the failed state
61
+ 4. **Inspect Variable State**: Examine values at the point of failure and preceding steps
62
+ 5. **Review Recent Changes**: Check git history for recent modifications to affected code paths
63
+ 6. **Test Hypotheses**: Form theories about the cause and validate with targeted experiments
64
+
65
+ ### The Five Whys Technique
66
+
67
+ Ask "why" repeatedly to drill down to root causes:
68
+
69
+ ```
70
+ Error: Database connection timeout after 30s
71
+
72
+ Why? The database connection pool was exhausted
73
+ Why? All connections were held by long-running queries
74
+ Why? A new feature introduced N+1 query patterns
75
+ Why? The ORM lazy-loading wasn't properly configured
76
+ Why? Code review didn't catch the performance regression
77
+ ```
78
+
79
+ Root cause: Insufficient code review process for database query patterns.
80
+
81
+ ### Distributed Systems Debugging
82
+
83
+ For errors in microservices and distributed systems:
84
+
85
+ - **Trace the Request Path**: Use correlation IDs to follow requests across service boundaries
86
+ - **Check Service Dependencies**: Identify which upstream/downstream services are involved
87
+ - **Analyze Cascading Failures**: Determine if this is a symptom of a different service's failure
88
+ - **Review Circuit Breaker State**: Check if protective mechanisms are triggered
89
+ - **Examine Message Queues**: Look for backpressure, dead letters, or processing delays
90
+ - **Timeline Reconstruction**: Build a timeline of events across all services using distributed tracing
91
+
92
+ ## Stack Trace Analysis
93
+
94
+ ### Interpreting Stack Traces
95
+
96
+ Extract maximum information from stack traces:
97
+
98
+ **Key Elements:**
99
+ - **Error Type**: What kind of exception/error occurred
100
+ - **Error Message**: Contextual information about the failure
101
+ - **Origin Point**: The deepest frame where the error was thrown
102
+ - **Call Chain**: The sequence of function calls leading to the error
103
+ - **Framework vs Application Code**: Distinguish between library and your code
104
+ - **Async Boundaries**: Identify where asynchronous operations break the trace
105
+
106
+ **Analysis Strategy:**
107
+ 1. Start at the top of the stack (origin of error)
108
+ 2. Identify the first frame in your application code (not framework/library)
109
+ 3. Examine that frame's context: input parameters, local variables, state
110
+ 4. Trace backwards through calling functions to understand how invalid state was created
111
+ 5. Look for patterns: is this in a loop? Inside a callback? After an async operation?
112
+
113
+ ### Stack Trace Enrichment
114
+
115
+ Modern error tracking tools provide enhanced stack traces:
116
+
117
+ - **Source Code Context**: View surrounding lines of code for each frame
118
+ - **Local Variable Values**: Inspect variable state at each frame (with Sentry's debug mode)
119
+ - **Breadcrumbs**: See the sequence of events leading to the error
120
+ - **Release Tracking**: Link errors to specific deployments and commits
121
+ - **Source Maps**: For minified JavaScript, map back to original source
122
+ - **Inline Comments**: Annotate stack frames with contextual information
123
+
124
+ ### Common Stack Trace Patterns
125
+
126
+ **Pattern: Null Pointer Exception Deep in Framework Code**
127
+ ```
128
+ NullPointerException
129
+ at java.util.HashMap.hash(HashMap.java:339)
130
+ at java.util.HashMap.get(HashMap.java:556)
131
+ at com.myapp.service.UserService.findUser(UserService.java:45)
132
+ ```
133
+ Root Cause: Application passed null to framework code. Focus on UserService.java:45.
134
+
135
+ **Pattern: Timeout After Long Wait**
136
+ ```
137
+ TimeoutException: Operation timed out after 30000ms
138
+ at okhttp3.internal.http2.Http2Stream.waitForIo
139
+ at com.myapp.api.PaymentClient.processPayment(PaymentClient.java:89)
140
+ ```
141
+ Root Cause: External service slow/unresponsive. Need retry logic and circuit breaker.
142
+
143
+ **Pattern: Race Condition in Concurrent Code**
144
+ ```
145
+ ConcurrentModificationException
146
+ at java.util.ArrayList$Itr.checkForComodification
147
+ at com.myapp.processor.BatchProcessor.process(BatchProcessor.java:112)
148
+ ```
149
+ Root Cause: Collection modified while being iterated. Need thread-safe data structures or synchronization.
150
+
151
+ ## Log Aggregation and Pattern Matching
152
+
153
+ ### Structured Logging Implementation
154
+
155
+ Implement JSON-based structured logging for machine-readable logs:
156
+
157
+ **Standard Log Schema:**
158
+ ```json
159
+ {
160
+ "timestamp": "2025-10-11T14:23:45.123Z",
161
+ "level": "ERROR",
162
+ "correlation_id": "req-7f3b2a1c-4d5e-6f7g-8h9i-0j1k2l3m4n5o",
163
+ "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
164
+ "span_id": "00f067aa0ba902b7",
165
+ "service": "payment-service",
166
+ "environment": "production",
167
+ "host": "pod-payment-7d4f8b9c-xk2l9",
168
+ "version": "v2.3.1",
169
+ "error": {
170
+ "type": "PaymentProcessingException",
171
+ "message": "Failed to charge card: Insufficient funds",
172
+ "stack_trace": "...",
173
+ "fingerprint": "payment-insufficient-funds"
174
+ },
175
+ "user": {
176
+ "id": "user-12345",
177
+ "ip": "203.0.113.42",
178
+ "session_id": "sess-abc123"
179
+ },
180
+ "request": {
181
+ "method": "POST",
182
+ "path": "/api/v1/payments/charge",
183
+ "duration_ms": 2547,
184
+ "status_code": 402
185
+ },
186
+ "context": {
187
+ "payment_method": "credit_card",
188
+ "amount": 149.99,
189
+ "currency": "USD",
190
+ "merchant_id": "merchant-789"
191
+ }
192
+ }
193
+ ```
194
+
195
+ **Key Fields to Always Include:**
196
+ - `timestamp`: ISO 8601 format in UTC
197
+ - `level`: ERROR, WARN, INFO, DEBUG, TRACE
198
+ - `correlation_id`: Unique ID for the entire request chain
199
+ - `trace_id` and `span_id`: OpenTelemetry identifiers for distributed tracing
200
+ - `service`: Which microservice generated this log
201
+ - `environment`: dev, staging, production
202
+ - `error.fingerprint`: Stable identifier for grouping similar errors
203
+
204
+ ### Correlation ID Pattern
205
+
206
+ Implement correlation IDs to track requests across distributed systems:
207
+
208
+ **Node.js/Express Middleware:**
209
+ ```javascript
210
+ const { v4: uuidv4 } = require('uuid');
211
+ const asyncLocalStorage = require('async-local-storage');
212
+
213
+ // Middleware to generate/propagate correlation ID
214
+ function correlationIdMiddleware(req, res, next) {
215
+ const correlationId = req.headers['x-correlation-id'] || uuidv4();
216
+ req.correlationId = correlationId;
217
+ res.setHeader('x-correlation-id', correlationId);
218
+
219
+ // Store in async context for access in nested calls
220
+ asyncLocalStorage.run(new Map(), () => {
221
+ asyncLocalStorage.set('correlationId', correlationId);
222
+ next();
223
+ });
224
+ }
225
+
226
+ // Propagate to downstream services
227
+ function makeApiCall(url, data) {
228
+ const correlationId = asyncLocalStorage.get('correlationId');
229
+ return axios.post(url, data, {
230
+ headers: {
231
+ 'x-correlation-id': correlationId,
232
+ 'x-source-service': 'api-gateway'
233
+ }
234
+ });
235
+ }
236
+
237
+ // Include in all log statements
238
+ function log(level, message, context = {}) {
239
+ const correlationId = asyncLocalStorage.get('correlationId');
240
+ console.log(JSON.stringify({
241
+ timestamp: new Date().toISOString(),
242
+ level,
243
+ correlation_id: correlationId,
244
+ message,
245
+ ...context
246
+ }));
247
+ }
248
+ ```
249
+
250
+ **Python/Flask Implementation:**
251
+ ```python
252
+ import uuid
253
+ import logging
254
+ from flask import request, g
255
+ import json
256
+
257
+ class CorrelationIdFilter(logging.Filter):
258
+ def filter(self, record):
259
+ record.correlation_id = g.get('correlation_id', 'N/A')
260
+ return True
261
+
262
+ @app.before_request
263
+ def setup_correlation_id():
264
+ correlation_id = request.headers.get('X-Correlation-ID', str(uuid.uuid4()))
265
+ g.correlation_id = correlation_id
266
+
267
+ @app.after_request
268
+ def add_correlation_header(response):
269
+ response.headers['X-Correlation-ID'] = g.correlation_id
270
+ return response
271
+
272
+ # Structured logging with correlation ID
273
+ logging.basicConfig(
274
+ format='%(message)s',
275
+ level=logging.INFO
276
+ )
277
+ logger = logging.getLogger(__name__)
278
+ logger.addFilter(CorrelationIdFilter())
279
+
280
+ def log_structured(level, message, **context):
281
+ log_entry = {
282
+ 'timestamp': datetime.utcnow().isoformat() + 'Z',
283
+ 'level': level,
284
+ 'correlation_id': g.correlation_id,
285
+ 'service': 'payment-service',
286
+ 'message': message,
287
+ **context
288
+ }
289
+ logger.log(getattr(logging, level), json.dumps(log_entry))
290
+ ```
291
+
292
+ ### Log Aggregation Architecture
293
+
294
+ **Centralized Logging Pipeline:**
295
+ 1. **Application**: Outputs structured JSON logs to stdout/stderr
296
+ 2. **Log Shipper**: Fluentd/Fluent Bit/Vector collects logs from containers
297
+ 3. **Log Aggregator**: Elasticsearch/Loki/DataDog receives and indexes logs
298
+ 4. **Visualization**: Kibana/Grafana/DataDog UI for querying and dashboards
299
+ 5. **Alerting**: Trigger alerts on error patterns and thresholds
300
+
301
+ **Log Query Examples (Elasticsearch DSL):**
302
+ ```json
303
+ // Find all errors for a specific correlation ID
304
+ {
305
+ "query": {
306
+ "bool": {
307
+ "must": [
308
+ { "match": { "correlation_id": "req-7f3b2a1c-4d5e-6f7g" }},
309
+ { "term": { "level": "ERROR" }}
310
+ ]
311
+ }
312
+ },
313
+ "sort": [{ "timestamp": "asc" }]
314
+ }
315
+
316
+ // Find error rate spike in last hour
317
+ {
318
+ "query": {
319
+ "bool": {
320
+ "must": [
321
+ { "term": { "level": "ERROR" }},
322
+ { "range": { "timestamp": { "gte": "now-1h" }}}
323
+ ]
324
+ }
325
+ },
326
+ "aggs": {
327
+ "errors_per_minute": {
328
+ "date_histogram": {
329
+ "field": "timestamp",
330
+ "fixed_interval": "1m"
331
+ }
332
+ }
333
+ }
334
+ }
335
+
336
+ // Group errors by fingerprint to find most common issues
337
+ {
338
+ "query": {
339
+ "term": { "level": "ERROR" }
340
+ },
341
+ "aggs": {
342
+ "error_types": {
343
+ "terms": {
344
+ "field": "error.fingerprint",
345
+ "size": 10
346
+ },
347
+ "aggs": {
348
+ "affected_users": {
349
+ "cardinality": { "field": "user.id" }
350
+ }
351
+ }
352
+ }
353
+ }
354
+ }
355
+ ```
356
+
357
+ ### Pattern Detection and Anomaly Recognition
358
+
359
+ Use log analysis to identify patterns:
360
+
361
+ - **Error Rate Spikes**: Compare current error rate to historical baseline (e.g., >3 standard deviations)
362
+ - **New Error Types**: Alert when previously unseen error fingerprints appear
363
+ - **Cascading Failures**: Detect when errors in one service trigger errors in dependent services
364
+ - **User Impact Patterns**: Identify which users/segments are disproportionately affected
365
+ - **Geographic Patterns**: Spot region-specific issues (e.g., CDN problems, data center outages)
366
+ - **Temporal Patterns**: Find time-based issues (e.g., batch jobs, scheduled tasks, time zone bugs)
367
+
368
+ ## Debugging Workflow
369
+
370
+ ### Interactive Debugging
371
+
372
+ For deterministic errors in development:
373
+
374
+ **Debugger Setup:**
375
+ 1. Set breakpoint before the error occurs
376
+ 2. Step through code execution line by line
377
+ 3. Inspect variable values and object state
378
+ 4. Evaluate expressions in the debug console
379
+ 5. Watch for unexpected state changes
380
+ 6. Modify variables to test hypotheses
381
+
382
+ **Modern Debugging Tools:**
383
+ - **VS Code Debugger**: Integrated debugging for JavaScript, Python, Go, Java, C++
384
+ - **Chrome DevTools**: Frontend debugging with network, performance, and memory profiling
385
+ - **pdb/ipdb (Python)**: Interactive debugger with post-mortem analysis
386
+ - **dlv (Go)**: Delve debugger for Go programs
387
+ - **lldb (C/C++)**: Low-level debugger with reverse debugging capabilities
388
+
389
+ ### Production Debugging
390
+
391
+ For errors in production environments where debuggers aren't available:
392
+
393
+ **Safe Production Debugging Techniques:**
394
+
395
+ 1. **Enhanced Logging**: Add strategic log statements around suspected failure points
396
+ 2. **Feature Flags**: Enable verbose logging for specific users/requests
397
+ 3. **Sampling**: Log detailed context for a percentage of requests
398
+ 4. **APM Transaction Traces**: Use DataDog APM or New Relic to see detailed transaction flows
399
+ 5. **Distributed Tracing**: Leverage OpenTelemetry traces to understand cross-service interactions
400
+ 6. **Profiling**: Use continuous profilers (DataDog Profiler, Pyroscope) to identify hot spots
401
+ 7. **Heap Dumps**: Capture memory snapshots for analysis of memory leaks
402
+ 8. **Traffic Mirroring**: Replay production traffic in staging for safe investigation
403
+
404
+ **Remote Debugging (Use Cautiously):**
405
+ - Attach debugger to running process only in non-critical services
406
+ - Use read-only breakpoints that don't pause execution
407
+ - Time-box debugging sessions strictly
408
+ - Always have rollback plan ready
409
+
410
+ ### Memory and Performance Debugging
411
+
412
+ **Memory Leak Detection:**
413
+ ```javascript
414
+ // Node.js heap snapshot comparison
415
+ const v8 = require('v8');
416
+ const fs = require('fs');
417
+
418
+ function takeHeapSnapshot(filename) {
419
+ const snapshot = v8.writeHeapSnapshot(filename);
420
+ console.log(`Heap snapshot written to ${snapshot}`);
421
+ }
422
+
423
+ // Take snapshots at intervals
424
+ takeHeapSnapshot('heap-before.heapsnapshot');
425
+ // ... run operations that might leak ...
426
+ takeHeapSnapshot('heap-after.heapsnapshot');
427
+
428
+ // Analyze in Chrome DevTools Memory profiler
429
+ // Look for objects with increasing retained size
430
+ ```
431
+
432
+ **Performance Profiling:**
433
+ ```python
434
+ # Python profiling with cProfile
435
+ import cProfile
436
+ import pstats
437
+ from pstats import SortKey
438
+
439
+ def profile_function():
440
+ profiler = cProfile.Profile()
441
+ profiler.enable()
442
+
443
+ # Your code here
444
+ process_large_dataset()
445
+
446
+ profiler.disable()
447
+
448
+ stats = pstats.Stats(profiler)
449
+ stats.sort_stats(SortKey.CUMULATIVE)
450
+ stats.print_stats(20) # Top 20 time-consuming functions
451
+ ```
452
+
453
+ ## Error Prevention Strategies
454
+
455
+ ### Input Validation and Type Safety
456
+
457
+ **Defensive Programming:**
458
+ ```typescript
459
+ // TypeScript: Leverage type system for compile-time safety
460
+ interface PaymentRequest {
461
+ amount: number;
462
+ currency: string;
463
+ customerId: string;
464
+ paymentMethodId: string;
465
+ }
466
+
467
+ function processPayment(request: PaymentRequest): PaymentResult {
468
+ // Runtime validation for external inputs
469
+ if (request.amount <= 0) {
470
+ throw new ValidationError('Amount must be positive');
471
+ }
472
+
473
+ if (!['USD', 'EUR', 'GBP'].includes(request.currency)) {
474
+ throw new ValidationError('Unsupported currency');
475
+ }
476
+
477
+ // Use Zod or Yup for complex validation
478
+ const schema = z.object({
479
+ amount: z.number().positive().max(1000000),
480
+ currency: z.enum(['USD', 'EUR', 'GBP']),
481
+ customerId: z.string().uuid(),
482
+ paymentMethodId: z.string().min(1)
483
+ });
484
+
485
+ const validated = schema.parse(request);
486
+
487
+ // Now safe to process
488
+ return chargeCustomer(validated);
489
+ }
490
+ ```
491
+
492
+ **Python Type Hints and Validation:**
493
+ ```python
494
+ from typing import Optional
495
+ from pydantic import BaseModel, validator, Field
496
+ from decimal import Decimal
497
+
498
+ class PaymentRequest(BaseModel):
499
+ amount: Decimal = Field(..., gt=0, le=1000000)
500
+ currency: str
501
+ customer_id: str
502
+ payment_method_id: str
503
+
504
+ @validator('currency')
505
+ def validate_currency(cls, v):
506
+ if v not in ['USD', 'EUR', 'GBP']:
507
+ raise ValueError('Unsupported currency')
508
+ return v
509
+
510
+ @validator('customer_id', 'payment_method_id')
511
+ def validate_ids(cls, v):
512
+ if not v or len(v) < 1:
513
+ raise ValueError('ID cannot be empty')
514
+ return v
515
+
516
+ def process_payment(request: PaymentRequest) -> PaymentResult:
517
+ # Pydantic validates automatically on instantiation
518
+ # Type hints provide IDE support and static analysis
519
+ return charge_customer(request)
520
+ ```
521
+
522
+ ### Error Boundaries and Graceful Degradation
523
+
524
+ **React Error Boundaries:**
525
+ ```typescript
526
+ import React, { Component, ErrorInfo, ReactNode } from 'react';
527
+ import * as Sentry from '@sentry/react';
528
+
529
+ interface Props {
530
+ children: ReactNode;
531
+ fallback?: ReactNode;
532
+ }
533
+
534
+ interface State {
535
+ hasError: boolean;
536
+ error?: Error;
537
+ }
538
+
539
+ class ErrorBoundary extends Component<Props, State> {
540
+ public state: State = {
541
+ hasError: false
542
+ };
543
+
544
+ public static getDerivedStateFromError(error: Error): State {
545
+ return { hasError: true, error };
546
+ }
547
+
548
+ public componentDidCatch(error: Error, errorInfo: ErrorInfo) {
549
+ // Log to error tracking service
550
+ Sentry.captureException(error, {
551
+ contexts: {
552
+ react: {
553
+ componentStack: errorInfo.componentStack
554
+ }
555
+ }
556
+ });
557
+
558
+ console.error('Uncaught error:', error, errorInfo);
559
+ }
560
+
561
+ public render() {
562
+ if (this.state.hasError) {
563
+ return this.props.fallback || (
564
+ <div role="alert">
565
+ <h2>Something went wrong</h2>
566
+ <details>
567
+ <summary>Error details</summary>
568
+ <pre>{this.state.error?.message}</pre>
569
+ </details>
570
+ </div>
571
+ );
572
+ }
573
+
574
+ return this.props.children;
575
+ }
576
+ }
577
+
578
+ export default ErrorBoundary;
579
+ ```
580
+
581
+ **Circuit Breaker Pattern:**
582
+ ```python
583
+ from datetime import datetime, timedelta
584
+ from enum import Enum
585
+ import time
586
+
587
+ class CircuitState(Enum):
588
+ CLOSED = "closed" # Normal operation
589
+ OPEN = "open" # Failing, reject requests
590
+ HALF_OPEN = "half_open" # Testing if service recovered
591
+
592
+ class CircuitBreaker:
593
+ def __init__(self, failure_threshold=5, timeout=60, success_threshold=2):
594
+ self.failure_threshold = failure_threshold
595
+ self.timeout = timeout
596
+ self.success_threshold = success_threshold
597
+ self.failure_count = 0
598
+ self.success_count = 0
599
+ self.last_failure_time = None
600
+ self.state = CircuitState.CLOSED
601
+
602
+ def call(self, func, *args, **kwargs):
603
+ if self.state == CircuitState.OPEN:
604
+ if self._should_attempt_reset():
605
+ self.state = CircuitState.HALF_OPEN
606
+ else:
607
+ raise CircuitBreakerOpenError("Circuit breaker is OPEN")
608
+
609
+ try:
610
+ result = func(*args, **kwargs)
611
+ self._on_success()
612
+ return result
613
+ except Exception as e:
614
+ self._on_failure()
615
+ raise
616
+
617
+ def _on_success(self):
618
+ self.failure_count = 0
619
+ if self.state == CircuitState.HALF_OPEN:
620
+ self.success_count += 1
621
+ if self.success_count >= self.success_threshold:
622
+ self.state = CircuitState.CLOSED
623
+ self.success_count = 0
624
+
625
+ def _on_failure(self):
626
+ self.failure_count += 1
627
+ self.last_failure_time = datetime.now()
628
+ if self.failure_count >= self.failure_threshold:
629
+ self.state = CircuitState.OPEN
630
+
631
+ def _should_attempt_reset(self):
632
+ return (datetime.now() - self.last_failure_time) > timedelta(seconds=self.timeout)
633
+
634
+ # Usage
635
+ payment_circuit = CircuitBreaker(failure_threshold=5, timeout=60)
636
+
637
+ def process_payment_with_circuit_breaker(payment_data):
638
+ try:
639
+ result = payment_circuit.call(external_payment_api.charge, payment_data)
640
+ return result
641
+ except CircuitBreakerOpenError:
642
+ # Graceful degradation: queue for later processing
643
+ payment_queue.enqueue(payment_data)
644
+ return {"status": "queued", "message": "Payment will be processed shortly"}
645
+ ```
646
+
647
+ ### Retry Logic with Exponential Backoff
648
+
649
+ ```typescript
650
+ // TypeScript retry implementation
651
+ interface RetryOptions {
652
+ maxAttempts: number;
653
+ baseDelayMs: number;
654
+ maxDelayMs: number;
655
+ exponentialBase: number;
656
+ retryableErrors?: string[];
657
+ }
658
+
659
+ async function retryWithBackoff<T>(
660
+ fn: () => Promise<T>,
661
+ options: RetryOptions = {
662
+ maxAttempts: 3,
663
+ baseDelayMs: 1000,
664
+ maxDelayMs: 30000,
665
+ exponentialBase: 2
666
+ }
667
+ ): Promise<T> {
668
+ let lastError: Error;
669
+
670
+ for (let attempt = 0; attempt < options.maxAttempts; attempt++) {
671
+ try {
672
+ return await fn();
673
+ } catch (error) {
674
+ lastError = error as Error;
675
+
676
+ // Check if error is retryable
677
+ if (options.retryableErrors &&
678
+ !options.retryableErrors.includes(error.name)) {
679
+ throw error; // Don't retry non-retryable errors
680
+ }
681
+
682
+ if (attempt < options.maxAttempts - 1) {
683
+ const delay = Math.min(
684
+ options.baseDelayMs * Math.pow(options.exponentialBase, attempt),
685
+ options.maxDelayMs
686
+ );
687
+
688
+ // Add jitter to prevent thundering herd
689
+ const jitter = Math.random() * 0.1 * delay;
690
+ const actualDelay = delay + jitter;
691
+
692
+ console.log(`Attempt ${attempt + 1} failed, retrying in ${actualDelay}ms`);
693
+ await new Promise(resolve => setTimeout(resolve, actualDelay));
694
+ }
695
+ }
696
+ }
697
+
698
+ throw lastError!;
699
+ }
700
+
701
+ // Usage
702
+ const result = await retryWithBackoff(
703
+ () => fetch('https://api.example.com/data'),
704
+ {
705
+ maxAttempts: 3,
706
+ baseDelayMs: 1000,
707
+ maxDelayMs: 10000,
708
+ exponentialBase: 2,
709
+ retryableErrors: ['NetworkError', 'TimeoutError']
710
+ }
711
+ );
712
+ ```
713
+
714
+ ## Monitoring and Alerting Integration
715
+
716
+ ### Modern Observability Stack (2025)
717
+
718
+ **Recommended Architecture:**
719
+ - **Metrics**: Prometheus + Grafana or DataDog
720
+ - **Logs**: Elasticsearch/Loki + Fluentd or DataDog Logs
721
+ - **Traces**: OpenTelemetry + Jaeger/Tempo or DataDog APM
722
+ - **Errors**: Sentry or DataDog Error Tracking
723
+ - **Frontend**: Sentry Browser SDK or DataDog RUM
724
+ - **Synthetics**: DataDog Synthetics or Checkly
725
+
726
+ ### Sentry Integration
727
+
728
+ **Node.js/Express Setup:**
729
+ ```javascript
730
+ const Sentry = require('@sentry/node');
731
+ const { ProfilingIntegration } = require('@sentry/profiling-node');
732
+
733
+ Sentry.init({
734
+ dsn: process.env.SENTRY_DSN,
735
+ environment: process.env.NODE_ENV,
736
+ release: process.env.GIT_COMMIT_SHA,
737
+
738
+ // Performance monitoring
739
+ tracesSampleRate: 0.1, // 10% of transactions
740
+ profilesSampleRate: 0.1,
741
+
742
+ integrations: [
743
+ new ProfilingIntegration(),
744
+ new Sentry.Integrations.Http({ tracing: true }),
745
+ new Sentry.Integrations.Express({ app }),
746
+ ],
747
+
748
+ beforeSend(event, hint) {
749
+ // Scrub sensitive data
750
+ if (event.request) {
751
+ delete event.request.cookies;
752
+ delete event.request.headers?.authorization;
753
+ }
754
+
755
+ // Add custom context
756
+ event.tags = {
757
+ ...event.tags,
758
+ region: process.env.AWS_REGION,
759
+ instance_id: process.env.INSTANCE_ID
760
+ };
761
+
762
+ return event;
763
+ }
764
+ });
765
+
766
+ // Express middleware
767
+ app.use(Sentry.Handlers.requestHandler());
768
+ app.use(Sentry.Handlers.tracingHandler());
769
+
770
+ // Routes here...
771
+
772
+ // Error handler (must be last)
773
+ app.use(Sentry.Handlers.errorHandler());
774
+
775
+ // Manual error capture with context
776
+ function processOrder(orderId) {
777
+ try {
778
+ const order = getOrder(orderId);
779
+ chargeCustomer(order);
780
+ } catch (error) {
781
+ Sentry.captureException(error, {
782
+ tags: {
783
+ operation: 'process_order',
784
+ order_id: orderId
785
+ },
786
+ contexts: {
787
+ order: {
788
+ id: orderId,
789
+ status: order?.status,
790
+ amount: order?.amount
791
+ }
792
+ },
793
+ user: {
794
+ id: order?.customerId
795
+ }
796
+ });
797
+ throw error;
798
+ }
799
+ }
800
+ ```
801
+
802
+ ### DataDog APM Integration
803
+
804
+ **Python/Flask Setup:**
805
+ ```python
806
+ from ddtrace import patch_all, tracer
807
+ from ddtrace.contrib.flask import TraceMiddleware
808
+ import logging
809
+
810
+ # Auto-instrument common libraries
811
+ patch_all()
812
+
813
+ app = Flask(__name__)
814
+
815
+ # Initialize tracing
816
+ TraceMiddleware(app, tracer, service='payment-service')
817
+
818
+ # Custom span for detailed tracing
819
+ @app.route('/api/v1/payments/charge', methods=['POST'])
820
+ def charge_payment():
821
+ with tracer.trace('payment.charge', service='payment-service') as span:
822
+ payment_data = request.json
823
+
824
+ # Add custom tags
825
+ span.set_tag('payment.amount', payment_data['amount'])
826
+ span.set_tag('payment.currency', payment_data['currency'])
827
+ span.set_tag('customer.id', payment_data['customer_id'])
828
+
829
+ try:
830
+ result = payment_processor.charge(payment_data)
831
+ span.set_tag('payment.status', 'success')
832
+ return jsonify(result), 200
833
+ except InsufficientFundsError as e:
834
+ span.set_tag('payment.status', 'insufficient_funds')
835
+ span.set_tag('error', True)
836
+ return jsonify({'error': 'Insufficient funds'}), 402
837
+ except Exception as e:
838
+ span.set_tag('payment.status', 'error')
839
+ span.set_tag('error', True)
840
+ span.set_tag('error.message', str(e))
841
+ raise
842
+ ```
843
+
844
+ ### OpenTelemetry Implementation
845
+
846
+ **Go Service with OpenTelemetry:**
847
+ ```go
848
+ package main
849
+
850
+ import (
851
+ "context"
852
+ "go.opentelemetry.io/otel"
853
+ "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
854
+ "go.opentelemetry.io/otel/sdk/trace"
855
+ sdktrace "go.opentelemetry.io/otel/sdk/trace"
856
+ "go.opentelemetry.io/otel/attribute"
857
+ "go.opentelemetry.io/otel/codes"
858
+ )
859
+
860
+ func initTracer() (*sdktrace.TracerProvider, error) {
861
+ exporter, err := otlptracegrpc.New(
862
+ context.Background(),
863
+ otlptracegrpc.WithEndpoint("otel-collector:4317"),
864
+ otlptracegrpc.WithInsecure(),
865
+ )
866
+ if err != nil {
867
+ return nil, err
868
+ }
869
+
870
+ tp := sdktrace.NewTracerProvider(
871
+ sdktrace.WithBatcher(exporter),
872
+ sdktrace.WithResource(resource.NewWithAttributes(
873
+ semconv.SchemaURL,
874
+ semconv.ServiceNameKey.String("payment-service"),
875
+ semconv.ServiceVersionKey.String("v2.3.1"),
876
+ attribute.String("environment", "production"),
877
+ )),
878
+ )
879
+
880
+ otel.SetTracerProvider(tp)
881
+ return tp, nil
882
+ }
883
+
884
+ func processPayment(ctx context.Context, paymentReq PaymentRequest) error {
885
+ tracer := otel.Tracer("payment-service")
886
+ ctx, span := tracer.Start(ctx, "processPayment")
887
+ defer span.End()
888
+
889
+ // Add attributes
890
+ span.SetAttributes(
891
+ attribute.Float64("payment.amount", paymentReq.Amount),
892
+ attribute.String("payment.currency", paymentReq.Currency),
893
+ attribute.String("customer.id", paymentReq.CustomerID),
894
+ )
895
+
896
+ // Call downstream service
897
+ err := chargeCard(ctx, paymentReq)
898
+ if err != nil {
899
+ span.RecordError(err)
900
+ span.SetStatus(codes.Error, err.Error())
901
+ return err
902
+ }
903
+
904
+ span.SetStatus(codes.Ok, "Payment processed successfully")
905
+ return nil
906
+ }
907
+
908
+ func chargeCard(ctx context.Context, paymentReq PaymentRequest) error {
909
+ tracer := otel.Tracer("payment-service")
910
+ ctx, span := tracer.Start(ctx, "chargeCard")
911
+ defer span.End()
912
+
913
+ // Simulate external API call
914
+ result, err := paymentGateway.Charge(ctx, paymentReq)
915
+ if err != nil {
916
+ return fmt.Errorf("payment gateway error: %w", err)
917
+ }
918
+
919
+ span.SetAttributes(
920
+ attribute.String("transaction.id", result.TransactionID),
921
+ attribute.String("gateway.response_code", result.ResponseCode),
922
+ )
923
+
924
+ return nil
925
+ }
926
+ ```
927
+
928
+ ### Alert Configuration
929
+
930
+ **Intelligent Alerting Strategy:**
931
+
932
+ ```yaml
933
+ # DataDog Monitor Configuration
934
+ monitors:
935
+ - name: "High Error Rate - Payment Service"
936
+ type: metric
937
+ query: "avg(last_5m):sum:trace.express.request.errors{service:payment-service} / sum:trace.express.request.hits{service:payment-service} > 0.05"
938
+ message: |
939
+ Payment service error rate is {{value}}% (threshold: 5%)
940
+
941
+ This may indicate:
942
+ - Payment gateway issues
943
+ - Database connectivity problems
944
+ - Invalid payment data
945
+
946
+ Runbook: https://wiki.company.com/runbooks/payment-errors
947
+
948
+ @slack-payments-oncall @pagerduty-payments
949
+
950
+ tags:
951
+ - service:payment-service
952
+ - severity:high
953
+
954
+ options:
955
+ notify_no_data: true
956
+ no_data_timeframe: 10
957
+ escalation_message: "Error rate still elevated after 10 minutes"
958
+
959
+ - name: "New Error Type Detected"
960
+ type: log
961
+ query: "logs(\"level:ERROR service:payment-service\").rollup(\"count\").by(\"error.fingerprint\").last(\"5m\") > 0"
962
+ message: |
963
+ New error type detected in payment service: {{error.fingerprint}}
964
+
965
+ First occurrence: {{timestamp}}
966
+ Affected users: {{user_count}}
967
+
968
+ @slack-engineering
969
+
970
+ options:
971
+ enable_logs_sample: true
972
+
973
+ - name: "Payment Service - P95 Latency High"
974
+ type: metric
975
+ query: "avg(last_10m):p95:trace.express.request.duration{service:payment-service} > 2000"
976
+ message: |
977
+ Payment service P95 latency is {{value}}ms (threshold: 2000ms)
978
+
979
+ Check:
980
+ - Database query performance
981
+ - External API response times
982
+ - Resource constraints (CPU/memory)
983
+
984
+ Dashboard: https://app.datadoghq.com/dashboard/payment-service
985
+
986
+ @slack-payments-team
987
+ ```
988
+
989
+ ## Production Incident Response
990
+
991
+ ### Incident Response Workflow
992
+
993
+ **Phase 1: Detection and Triage (0-5 minutes)**
994
+ 1. Acknowledge the alert/incident
995
+ 2. Check incident severity and user impact
996
+ 3. Assign incident commander
997
+ 4. Create incident channel (#incident-2025-10-11-payment-errors)
998
+ 5. Update status page if customer-facing
999
+
1000
+ **Phase 2: Investigation (5-30 minutes)**
1001
+ 1. Gather observability data:
1002
+ - Error rates from Sentry/DataDog
1003
+ - Traces showing failed requests
1004
+ - Logs around the incident start time
1005
+ - Metrics showing resource usage, latency, throughput
1006
+ 2. Correlate with recent changes:
1007
+ - Recent deployments (check CI/CD pipeline)
1008
+ - Configuration changes
1009
+ - Infrastructure changes
1010
+ - External dependencies status
1011
+ 3. Form initial hypothesis about root cause
1012
+ 4. Document findings in incident log
1013
+
1014
+ **Phase 3: Mitigation (Immediate)**
1015
+ 1. Implement immediate fix based on hypothesis:
1016
+ - Rollback recent deployment
1017
+ - Scale up resources
1018
+ - Disable problematic feature (feature flag)
1019
+ - Failover to backup system
1020
+ - Apply hotfix
1021
+ 2. Verify mitigation worked (error rate decreases)
1022
+ 3. Monitor for 15-30 minutes to ensure stability
1023
+
1024
+ **Phase 4: Recovery and Validation**
1025
+ 1. Verify all systems operational
1026
+ 2. Check data consistency
1027
+ 3. Process queued/failed requests
1028
+ 4. Update status page: incident resolved
1029
+ 5. Notify stakeholders
1030
+
1031
+ **Phase 5: Post-Incident Review**
1032
+ 1. Schedule postmortem within 48 hours
1033
+ 2. Create detailed timeline of events
1034
+ 3. Identify root cause (may differ from initial hypothesis)
1035
+ 4. Document contributing factors
1036
+ 5. Create action items for:
1037
+ - Preventing similar incidents
1038
+ - Improving detection time
1039
+ - Improving mitigation time
1040
+ - Improving communication
1041
+
1042
+ ### Incident Investigation Tools
1043
+
1044
+ **Query Patterns for Common Incidents:**
1045
+
1046
+ ```
1047
+ # Find all errors for a specific time window (Elasticsearch)
1048
+ GET /logs-*/_search
1049
+ {
1050
+ "query": {
1051
+ "bool": {
1052
+ "must": [
1053
+ { "term": { "level": "ERROR" }},
1054
+ { "term": { "service": "payment-service" }},
1055
+ { "range": { "timestamp": {
1056
+ "gte": "2025-10-11T14:00:00Z",
1057
+ "lte": "2025-10-11T14:30:00Z"
1058
+ }}}
1059
+ ]
1060
+ }
1061
+ },
1062
+ "sort": [{ "timestamp": "asc" }],
1063
+ "size": 1000
1064
+ }
1065
+
1066
+ # Find correlation between errors and deployments (DataDog)
1067
+ # Use deployment tracking to overlay deployment markers on error graphs
1068
+ # Query: sum:trace.express.request.errors{service:payment-service} by {version}
1069
+
1070
+ # Identify affected users (Sentry)
1071
+ # Navigate to issue → User Impact tab
1072
+ # Shows: total users affected, new vs returning, geographic distribution
1073
+
1074
+ # Trace specific failed request (OpenTelemetry/Jaeger)
1075
+ # Search by trace_id or correlation_id
1076
+ # Visualize full request path across services
1077
+ # Identify which service/span failed
1078
+ ```
1079
+
1080
+ ### Communication Templates
1081
+
1082
+ **Initial Incident Notification:**
1083
+ ```
1084
+ 🚨 INCIDENT: Payment Processing Errors
1085
+
1086
+ Severity: High
1087
+ Status: Investigating
1088
+ Started: 2025-10-11 14:23 UTC
1089
+ Incident Commander: @jane.smith
1090
+
1091
+ Symptoms:
1092
+ - Payment processing error rate: 15% (normal: <1%)
1093
+ - Affected users: ~500 in last 10 minutes
1094
+ - Error: "Database connection timeout"
1095
+
1096
+ Actions Taken:
1097
+ - Investigating database connection pool
1098
+ - Checking recent deployments
1099
+ - Monitoring error rate
1100
+
1101
+ Updates: Will provide update every 15 minutes
1102
+ Status Page: https://status.company.com/incident/abc123
1103
+ ```
1104
+
1105
+ **Mitigation Notification:**
1106
+ ```
1107
+ ✅ INCIDENT UPDATE: Mitigation Applied
1108
+
1109
+ Severity: High → Medium
1110
+ Status: Mitigated
1111
+ Duration: 27 minutes
1112
+
1113
+ Root Cause: Database connection pool exhausted due to long-running queries
1114
+ introduced in v2.3.1 deployment at 14:00 UTC
1115
+
1116
+ Mitigation: Rolled back to v2.3.0
1117
+
1118
+ Current Status:
1119
+ - Error rate: 0.5% (back to normal)
1120
+ - All systems operational
1121
+ - Processing backlog of queued payments
1122
+
1123
+ Next Steps:
1124
+ - Monitor for 30 minutes
1125
+ - Fix query performance issue
1126
+ - Deploy fixed version with testing
1127
+ - Schedule postmortem
1128
+ ```
1129
+
1130
+ ## Error Analysis Deliverables
1131
+
1132
+ For each error analysis, provide:
1133
+
1134
+ 1. **Error Summary**: What happened, when, impact scope
1135
+ 2. **Root Cause**: The fundamental reason the error occurred
1136
+ 3. **Evidence**: Stack traces, logs, metrics supporting the diagnosis
1137
+ 4. **Immediate Fix**: Code changes to resolve the issue
1138
+ 5. **Testing Strategy**: How to verify the fix works
1139
+ 6. **Preventive Measures**: How to prevent similar errors in the future
1140
+ 7. **Monitoring Recommendations**: What to monitor/alert on going forward
1141
+ 8. **Runbook**: Step-by-step guide for handling similar incidents
1142
+
1143
+ Prioritize actionable recommendations that improve system reliability and reduce MTTR (Mean Time To Resolution) for future incidents.