blockmine 1.24.0 → 1.25.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (346) hide show
  1. package/CHANGELOG.md +32 -0
  2. package/README.en.md +427 -0
  3. package/README.md +40 -0
  4. package/backend/cli.js +1 -1
  5. package/backend/src/ai/plugin-assistant-system-prompt.md +664 -5
  6. package/backend/src/api/routes/bots.js +13 -0
  7. package/backend/src/api/routes/servers.js +14 -2
  8. package/backend/src/core/BotProcess.js +98 -2
  9. package/backend/src/core/PluginLoader.js +83 -3
  10. package/backend/src/core/PluginManager.js +75 -5
  11. package/backend/src/core/services/BotLifecycleService.js +186 -2
  12. package/backend/src/server.js +11 -1
  13. package/frontend/dist/assets/browser-ponyfill-DN7pwmHT.js +2 -0
  14. package/frontend/dist/assets/index-LSy71uwm.js +11261 -0
  15. package/frontend/dist/assets/index-SfhKxI4-.css +32 -0
  16. package/frontend/dist/flags/en.svg +32 -0
  17. package/frontend/dist/flags/ru.svg +5 -0
  18. package/frontend/dist/index.html +2 -2
  19. package/frontend/dist/locales/en/admin.json +100 -0
  20. package/frontend/dist/locales/en/api-keys.json +58 -0
  21. package/frontend/dist/locales/en/bots.json +110 -0
  22. package/frontend/dist/locales/en/common.json +47 -0
  23. package/frontend/dist/locales/en/configuration.json +22 -0
  24. package/frontend/dist/locales/en/console.json +10 -0
  25. package/frontend/dist/locales/en/dashboard.json +85 -0
  26. package/frontend/dist/locales/en/dialogs.json +70 -0
  27. package/frontend/dist/locales/en/event-graphs.json +50 -0
  28. package/frontend/dist/locales/en/graph-store.json +70 -0
  29. package/frontend/dist/locales/en/login.json +34 -0
  30. package/frontend/dist/locales/en/management.json +114 -0
  31. package/frontend/dist/locales/en/minecraft-viewer.json +27 -0
  32. package/frontend/dist/locales/en/nodes.json +1077 -0
  33. package/frontend/dist/locales/en/permissions.json +50 -0
  34. package/frontend/dist/locales/en/plugin-detail.json +49 -0
  35. package/frontend/dist/locales/en/plugins.json +110 -0
  36. package/frontend/dist/locales/en/proxies.json +81 -0
  37. package/frontend/dist/locales/en/servers.json +39 -0
  38. package/frontend/dist/locales/en/setup.json +17 -0
  39. package/frontend/dist/locales/en/sidebar.json +27 -0
  40. package/frontend/dist/locales/en/tasks.json +62 -0
  41. package/frontend/dist/locales/en/visual-editor.json +219 -0
  42. package/frontend/dist/locales/en/websocket.json +86 -0
  43. package/frontend/dist/locales/ru/admin.json +100 -0
  44. package/frontend/dist/locales/ru/api-keys.json +58 -0
  45. package/frontend/dist/locales/ru/bots.json +110 -0
  46. package/frontend/dist/locales/ru/common.json +49 -0
  47. package/frontend/dist/locales/ru/configuration.json +22 -0
  48. package/frontend/dist/locales/ru/console.json +10 -0
  49. package/frontend/dist/locales/ru/dashboard.json +85 -0
  50. package/frontend/dist/locales/ru/dialogs.json +70 -0
  51. package/frontend/dist/locales/ru/event-graphs.json +50 -0
  52. package/frontend/dist/locales/ru/graph-store.json +70 -0
  53. package/frontend/dist/locales/ru/login.json +34 -0
  54. package/frontend/dist/locales/ru/management.json +114 -0
  55. package/frontend/dist/locales/ru/minecraft-viewer.json +27 -0
  56. package/frontend/dist/locales/ru/nodes.json +1077 -0
  57. package/frontend/dist/locales/ru/permissions.json +50 -0
  58. package/frontend/dist/locales/ru/plugin-detail.json +49 -0
  59. package/frontend/dist/locales/ru/plugins.json +110 -0
  60. package/frontend/dist/locales/ru/proxies.json +81 -0
  61. package/frontend/dist/locales/ru/servers.json +39 -0
  62. package/frontend/dist/locales/ru/setup.json +17 -0
  63. package/frontend/dist/locales/ru/sidebar.json +27 -0
  64. package/frontend/dist/locales/ru/tasks.json +62 -0
  65. package/frontend/dist/locales/ru/visual-editor.json +221 -0
  66. package/frontend/dist/locales/ru/websocket.json +86 -0
  67. package/frontend/dist/monacoeditorwork/css.worker.bundle.js +7 -7
  68. package/frontend/dist/monacoeditorwork/html.worker.bundle.js +7 -7
  69. package/frontend/dist/monacoeditorwork/json.worker.bundle.js +7 -7
  70. package/frontend/dist/monacoeditorwork/ts.worker.bundle.js +3 -3
  71. package/frontend/package.json +4 -0
  72. package/package.json +1 -1
  73. package/screen/3dviewer.png +0 -0
  74. package/screen/console.png +0 -0
  75. package/screen/dashboard.png +0 -0
  76. package/screen/graph_collabe.png +0 -0
  77. package/screen/graph_live_debug.png +0 -0
  78. package/screen/language_selector.png +0 -0
  79. package/screen/management_command.png +0 -0
  80. package/screen/node_debug_trace.png +0 -0
  81. package/screen/plugin_/320/276/320/261/320/267/320/276/321/200.png +0 -0
  82. package/screen/websocket.png +0 -0
  83. package/screen//320/275/320/260/321/201/321/202/321/200/320/276/320/271/320/272/320/270_/320/276/321/202/320/264/320/265/320/273/321/214/320/275/321/213/321/205_/320/272/320/276/320/274/320/260/320/275/320/264_/320/272/320/260/320/266/320/264/321/203_/320/272/320/276/320/274/320/260/320/275/320/273/320/264/321/203_/320/274/320/276/320/266/320/275/320/276_/320/275/320/260/321/201/321/202/321/200/320/260/320/270/320/262/320/260/321/202/321/214.png +0 -0
  84. package/screen//320/277/320/273/320/260/320/275/320/270/321/200/320/276/320/262/321/211/320/270/320/272_/320/274/320/276/320/266/320/275/320/276_/320/267/320/260/320/264/320/260/320/262/320/260/321/202/321/214_/320/264/320/265/320/271/321/201/321/202/320/262/320/270/321/217_/320/277/320/276_/320/262/321/200/320/265/320/274/320/265/320/275/320/270.png +0 -0
  85. package/.claude/agents/README.md +0 -469
  86. package/.claude/agents/auth-route-debugger.md +0 -118
  87. package/.claude/agents/auth-route-tester.md +0 -93
  88. package/.claude/agents/auto-error-resolver.md +0 -97
  89. package/.claude/agents/build-optimizer.md +0 -236
  90. package/.claude/agents/code-architect.md +0 -34
  91. package/.claude/agents/code-architecture-reviewer.md +0 -83
  92. package/.claude/agents/code-explorer.md +0 -51
  93. package/.claude/agents/code-refactor-master.md +0 -94
  94. package/.claude/agents/code-reviewer.md +0 -46
  95. package/.claude/agents/cost-optimizer.md +0 -134
  96. package/.claude/agents/deployment-orchestrator.md +0 -113
  97. package/.claude/agents/documentation-architect.md +0 -82
  98. package/.claude/agents/frontend-error-fixer.md +0 -77
  99. package/.claude/agents/iac-code-generator.md +0 -71
  100. package/.claude/agents/incident-responder.md +0 -346
  101. package/.claude/agents/infrastructure-architect.md +0 -31
  102. package/.claude/agents/kubernetes-specialist.md +0 -56
  103. package/.claude/agents/migration-planner.md +0 -181
  104. package/.claude/agents/network-architect.md +0 -196
  105. package/.claude/agents/plan-reviewer.md +0 -52
  106. package/.claude/agents/refactor-planner.md +0 -63
  107. package/.claude/agents/security-scanner.md +0 -102
  108. package/.claude/agents/web-research-specialist.md +0 -78
  109. package/.claude/commands/cost-analysis.md +0 -315
  110. package/.claude/commands/dev-docs-update.md +0 -55
  111. package/.claude/commands/dev-docs.md +0 -51
  112. package/.claude/commands/feature-dev.md +0 -125
  113. package/.claude/commands/incident-debug.md +0 -247
  114. package/.claude/commands/infra-plan.md +0 -81
  115. package/.claude/commands/migration-plan.md +0 -478
  116. package/.claude/commands/route-research-for-testing.md +0 -37
  117. package/.claude/commands/security-review.md +0 -66
  118. package/.claude/hooks/CONFIG.md +0 -448
  119. package/.claude/hooks/README.md +0 -163
  120. package/.claude/hooks/SKILL_ACTIVATION_COMPLETE.md +0 -226
  121. package/.claude/hooks/WINDOWS_HOOKS_README.md +0 -151
  122. package/.claude/hooks/add-skill-activation-banners.ts +0 -132
  123. package/.claude/hooks/comprehensive-skill-test.ts +0 -1315
  124. package/.claude/hooks/error-handling-reminder.sh +0 -12
  125. package/.claude/hooks/error-handling-reminder.ts +0 -222
  126. package/.claude/hooks/k8s-manifest-validator.sh +0 -56
  127. package/.claude/hooks/package-lock.json +0 -556
  128. package/.claude/hooks/package.json +0 -16
  129. package/.claude/hooks/post-tool-use-tracker.ps1 +0 -174
  130. package/.claude/hooks/post-tool-use-tracker.sh +0 -183
  131. package/.claude/hooks/security-policy-check.sh +0 -247
  132. package/.claude/hooks/skill-activation-prompt.ps1 +0 -10
  133. package/.claude/hooks/skill-activation-prompt.sh +0 -10
  134. package/.claude/hooks/skill-activation-prompt.ts +0 -141
  135. package/.claude/hooks/stop-build-check-enhanced.sh +0 -130
  136. package/.claude/hooks/terraform-validator.sh +0 -53
  137. package/.claude/hooks/test-input.json +0 -7
  138. package/.claude/hooks/test-skill-activation.ts +0 -427
  139. package/.claude/hooks/trigger-build-resolver.sh +0 -79
  140. package/.claude/hooks/tsc-check.sh +0 -173
  141. package/.claude/hooks/tsconfig.json +0 -19
  142. package/.claude/settings.json +0 -59
  143. package/.claude/settings.local.json +0 -67
  144. package/.claude/skills/README.md +0 -507
  145. package/.claude/skills/api-engineering/SKILL.md +0 -63
  146. package/.claude/skills/api-engineering/resources/api-versioning.md +0 -88
  147. package/.claude/skills/api-engineering/resources/graphql-patterns.md +0 -106
  148. package/.claude/skills/api-engineering/resources/rate-limiting.md +0 -118
  149. package/.claude/skills/api-engineering/resources/rest-api-design.md +0 -105
  150. package/.claude/skills/backend-dev-guidelines/SKILL.md +0 -306
  151. package/.claude/skills/backend-dev-guidelines/resources/architecture-overview.md +0 -451
  152. package/.claude/skills/backend-dev-guidelines/resources/async-and-errors.md +0 -307
  153. package/.claude/skills/backend-dev-guidelines/resources/complete-examples.md +0 -638
  154. package/.claude/skills/backend-dev-guidelines/resources/configuration.md +0 -275
  155. package/.claude/skills/backend-dev-guidelines/resources/database-patterns.md +0 -224
  156. package/.claude/skills/backend-dev-guidelines/resources/middleware-guide.md +0 -213
  157. package/.claude/skills/backend-dev-guidelines/resources/routing-and-controllers.md +0 -756
  158. package/.claude/skills/backend-dev-guidelines/resources/sentry-and-monitoring.md +0 -336
  159. package/.claude/skills/backend-dev-guidelines/resources/services-and-repositories.md +0 -789
  160. package/.claude/skills/backend-dev-guidelines/resources/testing-guide.md +0 -235
  161. package/.claude/skills/backend-dev-guidelines/resources/validation-patterns.md +0 -754
  162. package/.claude/skills/budget-and-cost-management/SKILL.md +0 -850
  163. package/.claude/skills/build-engineering/SKILL.md +0 -431
  164. package/.claude/skills/build-engineering/resources/artifact-repositories.md +0 -72
  165. package/.claude/skills/build-engineering/resources/build-caching.md +0 -96
  166. package/.claude/skills/build-engineering/resources/build-pipelines.md +0 -105
  167. package/.claude/skills/build-engineering/resources/build-security.md +0 -95
  168. package/.claude/skills/build-engineering/resources/build-systems.md +0 -389
  169. package/.claude/skills/build-engineering/resources/compilation-optimization.md +0 -201
  170. package/.claude/skills/build-engineering/resources/dependency-management.md +0 -73
  171. package/.claude/skills/build-engineering/resources/monorepo-builds.md +0 -110
  172. package/.claude/skills/build-engineering/resources/performance-optimization.md +0 -113
  173. package/.claude/skills/build-engineering/resources/reproducible-builds.md +0 -82
  174. package/.claude/skills/cloud-engineering/SKILL.md +0 -675
  175. package/.claude/skills/cloud-engineering/resources/aws-patterns.md +0 -742
  176. package/.claude/skills/cloud-engineering/resources/azure-patterns.md +0 -714
  177. package/.claude/skills/cloud-engineering/resources/cleared-cloud-environments.md +0 -987
  178. package/.claude/skills/cloud-engineering/resources/cloud-cost-optimization.md +0 -757
  179. package/.claude/skills/cloud-engineering/resources/cloud-networking.md +0 -1058
  180. package/.claude/skills/cloud-engineering/resources/cloud-security-tools.md +0 -1530
  181. package/.claude/skills/cloud-engineering/resources/cloud-security.md +0 -990
  182. package/.claude/skills/cloud-engineering/resources/gcp-patterns.md +0 -758
  183. package/.claude/skills/cloud-engineering/resources/migration-strategies.md +0 -820
  184. package/.claude/skills/cloud-engineering/resources/multi-cloud-strategies.md +0 -670
  185. package/.claude/skills/cloud-engineering/resources/oci-patterns.md +0 -1198
  186. package/.claude/skills/cloud-engineering/resources/serverless-patterns.md +0 -795
  187. package/.claude/skills/cloud-engineering/resources/well-architected-frameworks.md +0 -966
  188. package/.claude/skills/cybersecurity/SKILL.md +0 -409
  189. package/.claude/skills/cybersecurity/resources/security-architecture.md +0 -266
  190. package/.claude/skills/database-engineering/SKILL.md +0 -61
  191. package/.claude/skills/database-engineering/resources/backup-and-recovery.md +0 -72
  192. package/.claude/skills/database-engineering/resources/database-replication.md +0 -63
  193. package/.claude/skills/database-engineering/resources/postgresql-fundamentals.md +0 -70
  194. package/.claude/skills/database-engineering/resources/query-optimization.md +0 -68
  195. package/.claude/skills/devsecops/SKILL.md +0 -374
  196. package/.claude/skills/devsecops/resources/ci-cd-security.md +0 -204
  197. package/.claude/skills/devsecops/resources/compliance-automation.md +0 -530
  198. package/.claude/skills/devsecops/resources/compliance-frameworks.md +0 -2322
  199. package/.claude/skills/devsecops/resources/container-security.md +0 -915
  200. package/.claude/skills/devsecops/resources/cspm-integration.md +0 -1440
  201. package/.claude/skills/devsecops/resources/policy-enforcement.md +0 -619
  202. package/.claude/skills/devsecops/resources/secrets-management.md +0 -755
  203. package/.claude/skills/devsecops/resources/security-monitoring.md +0 -146
  204. package/.claude/skills/devsecops/resources/security-scanning.md +0 -887
  205. package/.claude/skills/devsecops/resources/security-testing.md +0 -203
  206. package/.claude/skills/devsecops/resources/supply-chain-security.md +0 -518
  207. package/.claude/skills/devsecops/resources/vulnerability-management.md +0 -481
  208. package/.claude/skills/devsecops/resources/zero-trust-architecture.md +0 -177
  209. package/.claude/skills/documentation-as-code/SKILL.md +0 -323
  210. package/.claude/skills/documentation-as-code/resources/api-documentation.md +0 -90
  211. package/.claude/skills/documentation-as-code/resources/changelog-management.md +0 -79
  212. package/.claude/skills/documentation-as-code/resources/diagram-generation.md +0 -44
  213. package/.claude/skills/documentation-as-code/resources/docs-as-code-workflow.md +0 -99
  214. package/.claude/skills/documentation-as-code/resources/documentation-automation.md +0 -68
  215. package/.claude/skills/documentation-as-code/resources/documentation-sites.md +0 -79
  216. package/.claude/skills/documentation-as-code/resources/markdown-best-practices.md +0 -162
  217. package/.claude/skills/documentation-as-code/resources/openapi-specification.md +0 -77
  218. package/.claude/skills/documentation-as-code/resources/readme-engineering.md +0 -60
  219. package/.claude/skills/documentation-as-code/resources/technical-writing-guide.md +0 -202
  220. package/.claude/skills/engineering-management/SKILL.md +0 -356
  221. package/.claude/skills/engineering-management/resources/career-ladders.md +0 -609
  222. package/.claude/skills/engineering-management/resources/hiring-and-assessment.md +0 -555
  223. package/.claude/skills/engineering-management/resources/one-on-one-guides.md +0 -609
  224. package/.claude/skills/engineering-management/resources/resource-planning.md +0 -557
  225. package/.claude/skills/engineering-management/resources/team-organization-patterns.md +0 -491
  226. package/.claude/skills/engineering-management/resources/technical-interviews.md +0 -474
  227. package/.claude/skills/engineering-operations-management/SKILL.md +0 -817
  228. package/.claude/skills/error-tracking/SKILL.md +0 -379
  229. package/.claude/skills/frontend-design/SKILL.md +0 -42
  230. package/.claude/skills/frontend-dev-guidelines/SKILL.md +0 -403
  231. package/.claude/skills/frontend-dev-guidelines/resources/common-patterns.md +0 -331
  232. package/.claude/skills/frontend-dev-guidelines/resources/complete-examples.md +0 -872
  233. package/.claude/skills/frontend-dev-guidelines/resources/component-patterns.md +0 -502
  234. package/.claude/skills/frontend-dev-guidelines/resources/data-fetching.md +0 -767
  235. package/.claude/skills/frontend-dev-guidelines/resources/file-organization.md +0 -502
  236. package/.claude/skills/frontend-dev-guidelines/resources/loading-and-error-states.md +0 -501
  237. package/.claude/skills/frontend-dev-guidelines/resources/performance.md +0 -406
  238. package/.claude/skills/frontend-dev-guidelines/resources/routing-guide.md +0 -364
  239. package/.claude/skills/frontend-dev-guidelines/resources/styling-guide.md +0 -428
  240. package/.claude/skills/frontend-dev-guidelines/resources/typescript-standards.md +0 -418
  241. package/.claude/skills/general-it-engineering/SKILL.md +0 -393
  242. package/.claude/skills/general-it-engineering/resources/asset-management.md +0 -712
  243. package/.claude/skills/general-it-engineering/resources/automation-orchestration.md +0 -817
  244. package/.claude/skills/general-it-engineering/resources/business-continuity.md +0 -786
  245. package/.claude/skills/general-it-engineering/resources/change-management.md +0 -715
  246. package/.claude/skills/general-it-engineering/resources/enterprise-monitoring.md +0 -729
  247. package/.claude/skills/general-it-engineering/resources/help-desk-operations.md +0 -738
  248. package/.claude/skills/general-it-engineering/resources/incident-service-management.md +0 -834
  249. package/.claude/skills/general-it-engineering/resources/it-governance.md +0 -753
  250. package/.claude/skills/general-it-engineering/resources/itil-framework.md +0 -503
  251. package/.claude/skills/general-it-engineering/resources/service-management.md +0 -669
  252. package/.claude/skills/infrastructure-architecture/SKILL.md +0 -328
  253. package/.claude/skills/infrastructure-architecture/resources/architecture-decision-records.md +0 -505
  254. package/.claude/skills/infrastructure-architecture/resources/architecture-patterns.md +0 -528
  255. package/.claude/skills/infrastructure-architecture/resources/capacity-planning.md +0 -453
  256. package/.claude/skills/infrastructure-architecture/resources/cleared-environment-architecture.md +0 -773
  257. package/.claude/skills/infrastructure-architecture/resources/cost-architecture.md +0 -499
  258. package/.claude/skills/infrastructure-architecture/resources/data-architecture.md +0 -501
  259. package/.claude/skills/infrastructure-architecture/resources/disaster-recovery.md +0 -535
  260. package/.claude/skills/infrastructure-architecture/resources/migration-architecture.md +0 -512
  261. package/.claude/skills/infrastructure-architecture/resources/multi-region-design.md +0 -608
  262. package/.claude/skills/infrastructure-architecture/resources/reference-architectures.md +0 -562
  263. package/.claude/skills/infrastructure-architecture/resources/security-architecture.md +0 -538
  264. package/.claude/skills/infrastructure-architecture/resources/system-design-principles.md +0 -489
  265. package/.claude/skills/infrastructure-architecture/resources/workload-classification.md +0 -1000
  266. package/.claude/skills/infrastructure-strategy/SKILL.md +0 -924
  267. package/.claude/skills/network-engineering/SKILL.md +0 -385
  268. package/.claude/skills/network-engineering/resources/dns-management.md +0 -738
  269. package/.claude/skills/network-engineering/resources/load-balancing.md +0 -820
  270. package/.claude/skills/network-engineering/resources/network-architecture.md +0 -546
  271. package/.claude/skills/network-engineering/resources/network-security.md +0 -921
  272. package/.claude/skills/network-engineering/resources/network-troubleshooting.md +0 -749
  273. package/.claude/skills/network-engineering/resources/routing-switching.md +0 -373
  274. package/.claude/skills/network-engineering/resources/sdn-networking.md +0 -695
  275. package/.claude/skills/network-engineering/resources/service-mesh-networking.md +0 -777
  276. package/.claude/skills/network-engineering/resources/tcp-ip-protocols.md +0 -444
  277. package/.claude/skills/network-engineering/resources/vpn-connectivity.md +0 -672
  278. package/.claude/skills/node-development/SKILL.md +0 -317
  279. package/.claude/skills/observability-engineering/SKILL.md +0 -101
  280. package/.claude/skills/observability-engineering/resources/apm-tools.md +0 -97
  281. package/.claude/skills/observability-engineering/resources/correlation-strategies.md +0 -87
  282. package/.claude/skills/observability-engineering/resources/distributed-tracing.md +0 -98
  283. package/.claude/skills/observability-engineering/resources/logs-aggregation.md +0 -118
  284. package/.claude/skills/observability-engineering/resources/observability-cost-optimization.md +0 -141
  285. package/.claude/skills/observability-engineering/resources/opentelemetry.md +0 -110
  286. package/.claude/skills/platform-engineering/SKILL.md +0 -555
  287. package/.claude/skills/platform-engineering/resources/architecture-overview.md +0 -600
  288. package/.claude/skills/platform-engineering/resources/container-orchestration.md +0 -916
  289. package/.claude/skills/platform-engineering/resources/cost-optimization.md +0 -634
  290. package/.claude/skills/platform-engineering/resources/developer-platforms.md +0 -670
  291. package/.claude/skills/platform-engineering/resources/gitops-automation.md +0 -650
  292. package/.claude/skills/platform-engineering/resources/infrastructure-as-code.md +0 -778
  293. package/.claude/skills/platform-engineering/resources/infrastructure-standards.md +0 -708
  294. package/.claude/skills/platform-engineering/resources/multi-tenancy.md +0 -602
  295. package/.claude/skills/platform-engineering/resources/platform-security.md +0 -711
  296. package/.claude/skills/platform-engineering/resources/resource-management.md +0 -592
  297. package/.claude/skills/platform-engineering/resources/service-mesh.md +0 -628
  298. package/.claude/skills/release-engineering/SKILL.md +0 -393
  299. package/.claude/skills/release-engineering/resources/artifact-management.md +0 -108
  300. package/.claude/skills/release-engineering/resources/build-optimization.md +0 -84
  301. package/.claude/skills/release-engineering/resources/ci-cd-pipelines.md +0 -411
  302. package/.claude/skills/release-engineering/resources/deployment-strategies.md +0 -197
  303. package/.claude/skills/release-engineering/resources/pipeline-security.md +0 -62
  304. package/.claude/skills/release-engineering/resources/progressive-delivery.md +0 -83
  305. package/.claude/skills/release-engineering/resources/release-automation.md +0 -68
  306. package/.claude/skills/release-engineering/resources/release-orchestration.md +0 -77
  307. package/.claude/skills/release-engineering/resources/rollback-strategies.md +0 -66
  308. package/.claude/skills/release-engineering/resources/versioning-strategies.md +0 -59
  309. package/.claude/skills/route-tester/SKILL.md +0 -392
  310. package/.claude/skills/skill-developer/ADVANCED.md +0 -197
  311. package/.claude/skills/skill-developer/HOOK_MECHANISMS.md +0 -306
  312. package/.claude/skills/skill-developer/PATTERNS_LIBRARY.md +0 -152
  313. package/.claude/skills/skill-developer/SKILL.md +0 -430
  314. package/.claude/skills/skill-developer/SKILL_RULES_REFERENCE.md +0 -315
  315. package/.claude/skills/skill-developer/TRIGGER_TYPES.md +0 -305
  316. package/.claude/skills/skill-developer/TROUBLESHOOTING.md +0 -514
  317. package/.claude/skills/skill-rules.json +0 -2989
  318. package/.claude/skills/sre/SKILL.md +0 -464
  319. package/.claude/skills/sre/resources/alerting-best-practices.md +0 -282
  320. package/.claude/skills/sre/resources/capacity-planning.md +0 -226
  321. package/.claude/skills/sre/resources/chaos-engineering.md +0 -193
  322. package/.claude/skills/sre/resources/disaster-recovery.md +0 -232
  323. package/.claude/skills/sre/resources/incident-management.md +0 -436
  324. package/.claude/skills/sre/resources/observability-stack.md +0 -240
  325. package/.claude/skills/sre/resources/on-call-runbooks.md +0 -167
  326. package/.claude/skills/sre/resources/performance-optimization.md +0 -108
  327. package/.claude/skills/sre/resources/reliability-patterns.md +0 -183
  328. package/.claude/skills/sre/resources/slo-sli-sla.md +0 -464
  329. package/.claude/skills/sre/resources/toil-reduction.md +0 -145
  330. package/.claude/skills/systems-engineering/SKILL.md +0 -648
  331. package/.claude/skills/systems-engineering/resources/automation-patterns.md +0 -771
  332. package/.claude/skills/systems-engineering/resources/configuration-management.md +0 -998
  333. package/.claude/skills/systems-engineering/resources/linux-administration.md +0 -672
  334. package/.claude/skills/systems-engineering/resources/networking-fundamentals.md +0 -982
  335. package/.claude/skills/systems-engineering/resources/performance-tuning.md +0 -871
  336. package/.claude/skills/systems-engineering/resources/powershell-scripting.md +0 -482
  337. package/.claude/skills/systems-engineering/resources/security-hardening.md +0 -739
  338. package/.claude/skills/systems-engineering/resources/shell-scripting.md +0 -915
  339. package/.claude/skills/systems-engineering/resources/storage-management.md +0 -628
  340. package/.claude/skills/systems-engineering/resources/system-monitoring.md +0 -787
  341. package/.claude/skills/systems-engineering/resources/troubleshooting-guide.md +0 -753
  342. package/.claude/skills/systems-engineering/resources/windows-administration.md +0 -738
  343. package/.claude/skills/technical-leadership/SKILL.md +0 -728
  344. package/backend/docs/SECRETS_DOCUMENTATION.md +0 -327
  345. package/frontend/dist/assets/index-BC-NbKXi.css +0 -32
  346. package/frontend/dist/assets/index-DqJXZMHY.js +0 -11266
@@ -1,817 +0,0 @@
1
- # Engineering Operations Management Skill
2
-
3
- **For managers running SRE, platform, and infrastructure teams - focusing on operations, on-call, incidents, and engineering metrics.**
4
-
5
- > This skill helps engineering managers build sustainable operations practices, prevent burnout, run effective incident reviews, and measure what matters. Complements technical SRE skills with people and process management.
6
-
7
- ---
8
- ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
9
- 🎯 SKILL ACTIVATED: engineering-operations-management
10
- ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
11
-
12
-
13
- ## When to Use This Skill
14
-
15
- **You're a manager who needs help with:**
16
- - Structuring on-call rotations and compensation
17
- - Preventing and addressing engineer burnout
18
- - Running blameless postmortem reviews
19
- - Negotiating SLOs with product teams
20
- - Measuring engineering productivity (not vanity metrics)
21
- - Managing toil and operational load
22
- - Balancing feature work vs operational excellence
23
- - Building sustainable operations culture
24
-
25
- **This skill does NOT cover:**
26
- - Hands-on incident response (see **sre** skill)
27
- - Technical SLO/SLI implementation (see **sre** skill)
28
- - Monitoring/observability setup (see **platform-engineering** skill)
29
- - Team hiring and career development (see **engineering-management** skill)
30
-
31
- ---
32
-
33
- ## Core Areas
34
-
35
- ### 1. On-Call Management
36
-
37
- **Core principle:** On-call is a necessary operational burden that should be **fair, sustainable, and compensated**.
38
-
39
- #### On-Call Rotation Models
40
-
41
- **Follow-the-Sun (Recommended for global teams):**
42
- ```
43
- Americas Team: 6 AM - 2 PM PST (primary)
44
- EMEA Team: 2 PM - 10 PM PST (primary)
45
- APAC Team: 10 PM - 6 AM PST (primary)
46
-
47
- Pros: No one wakes up at night, balanced load
48
- Cons: Requires global team, handoff complexity
49
- Best for: Teams with 15+ engineers across timezones
50
- ```
51
-
52
- **Weekly Rotation (Most common):**
53
- ```
54
- Week 1: Engineer A (primary), Engineer B (secondary)
55
- Week 2: Engineer C (primary), Engineer D (secondary)
56
- Week 3: Engineer E (primary), Engineer A (secondary)
57
-
58
- Pros: Simple, fair rotation
59
- Cons: Weekend coverage, potential burnout
60
- Best for: Teams with 6-10 engineers in same timezone
61
- ```
62
-
63
- **Tiered Escalation:**
64
- ```
65
- Tier 1: Junior engineers (business hours only)
66
- Tier 2: Senior engineers (24/7 primary)
67
- Tier 3: Staff/Principal (escalation only)
68
-
69
- Pros: Gradual responsibility increase
70
- Cons: Can create "us vs them" dynamic
71
- Best for: Large teams (15+) with clear skill levels
72
- ```
73
-
74
- #### On-Call Compensation Models
75
-
76
- **Option 1: On-Call Stipend**
77
- ```
78
- Primary on-call: $500-$1,000/week
79
- Secondary on-call: $250-$500/week
80
-
81
- Pros: Predictable, simple
82
- Cons: Doesn't account for actual pages
83
- ```
84
-
85
- **Option 2: Pay-per-Page**
86
- ```
87
- Business hours page: $50-$100
88
- After-hours page: $150-$300
89
- Weekend/holiday page: $300-$500
90
-
91
- Pros: Fair - pays for actual disruption
92
- Cons: Can incentivize ignoring issues
93
- ```
94
-
95
- **Option 3: Hybrid (Recommended)**
96
- ```
97
- Base stipend: $500/week
98
- + $100 per after-hours page
99
- + Comp time (1.5x hours worked after-hours)
100
-
101
- Pros: Covers both availability and interruptions
102
- Cons: More complex to administer
103
- ```
104
-
105
- **Comp time policies:**
106
- - For every hour worked after-hours, grant 1.5 hours comp time
107
- - Comp time must be used within 30 days
108
- - Encourage taking comp time day-after major incidents
109
-
110
- #### Burnout Prevention
111
-
112
- **Warning signs:**
113
- - 🚨 Pages > 5 per week for extended period
114
- - 🚨 Engineer mentions exhaustion, stress in 1-on-1s
115
- - 🚨 Quality of work declining
116
- - 🚨 Working late nights/weekends regularly
117
- - 🚨 Cynicism, disengagement
118
-
119
- **Interventions:**
120
- 1. **Immediate:** Rotate off on-call for 2-4 weeks
121
- 2. **Short-term:** Reduce project load, pair with senior engineer
122
- 3. **Long-term:** Fix underlying system issues causing pages
123
-
124
- **Sustainable on-call rules:**
125
- - No engineer on-call more than 1 week per month
126
- - Maximum 2 weeks on-call per quarter
127
- - Mandatory break after high-page-volume week
128
- - No on-call during PTO or major life events
129
-
130
- #### On-Call Scenarios
131
-
132
- **Scenario: "What's a fair after-hours pay model?"**
133
- - **Hybrid model (recommended):**
134
- - Base: $500/week on-call stipend
135
- - Plus: $100-150 per after-hours page
136
- - Plus: 1.5x comp time for hours worked
137
- - **Example calculation:**
138
- - Week stipend: $500
139
- - 3 after-hours pages × $125 = $375
140
- - 4 hours worked × 1.5 = 6 hours comp time
141
- - **Total value:** $875 + 6 hours off
142
-
143
- **Scenario: "Team blamed someone in incident review - how to fix?"**
144
- - **Immediate:** Stop the review, reset the tone
145
- - **Say:** "We don't blame people, we fix systems. Let's focus on what failed, not who."
146
- - **Blameless culture principles:**
147
- - People make reasonable decisions based on information available
148
- - Systems should prevent single points of failure
149
- - Focus on "what" not "who"
150
- - **Follow-up:** Coach manager running review on blameless principles
151
-
152
- **Scenario: "During incident, what should I do as manager?"**
153
- - **Monitor:** Watch incident channel, don't interrupt
154
- - **Support:** "What do you need? More people? Communication handled?"
155
- - **Shield:** Handle exec questions, keep pressure off team
156
- - **Don't:** Take over, second-guess, or ask "why" questions mid-incident
157
- - **After:** Thank team, schedule postmortem, ensure comp time taken
158
-
159
- **Scenario: "How do we track incident trends?"**
160
- - **Metrics to track:**
161
- - Incident frequency (per week/month)
162
- - MTTR (mean time to recovery)
163
- - Incidents by service/component
164
- - Incidents by root cause category
165
- - **Look for patterns:**
166
- - Same service failing repeatedly → systemic issue
167
- - MTTR increasing → lack of familiarity or tooling gaps
168
- - Spike in incidents → recent deploy or infrastructure change
169
- - **Action:** Address top 3 incident sources quarterly
170
-
171
- **Scenario: "What incident communication plan do we need?"**
172
- - **During incident:**
173
- - Sev 1: Updates every 30 minutes to execs, status page every 15 min
174
- - Sev 2: Updates every hour to stakeholders
175
- - Sev 3: Update when resolved
176
- - **Channels:**
177
- - Internal: Dedicated Slack #incidents channel
178
- - External: Status page (Statuspage.io, etc.)
179
- - Executives: Email + Slack DM for Sev 1/2
180
- - **Template:**
181
- ```
182
- [SEV 1] API Service Outage
183
- Impact: All users unable to login
184
- Status: Investigating
185
- Next update: 2:30 PM (15 minutes)
186
- ```
187
-
188
- **Scenario: "What's the right retrospective format?"**
189
- - **Timeline:** Within 48 hours of incident (while fresh)
190
- - **Attendees:** Incident responders + anyone interested (open invitation)
191
- - **Duration:** 45-60 minutes
192
- - **Format:**
193
- 1. Timeline walkthrough (10 min)
194
- 2. What went well (10 min)
195
- 3. What went poorly (15 min)
196
- 4. Action items (15 min) - with owners and due dates
197
- 5. Q&A (10 min)
198
- - **Output:** Written postmortem + action items tracked
199
-
200
- **Scenario: "How do we communicate incidents to executives?"**
201
- - **During:** Brief, factual updates
202
- - "API down, 100% of users affected, team investigating"
203
- - **After:** Business-focused summary
204
- - Revenue impact: "$50K in lost sales"
205
- - User impact: "10K users couldn't check out for 2 hours"
206
- - Prevention: "Adding rate limiting to prevent recurrence"
207
- - **Avoid:** Deep technical details unless asked
208
-
209
- ---
210
-
211
- ### 2. Incident Management for Managers
212
-
213
- **Your role as a manager during incidents:**
214
-
215
- #### During the Incident (DO NOT take over unless critical)
216
-
217
- ```
218
- ✅ DO:
219
- - Monitor incident channel, offer support
220
- - Shield team from external pressure
221
- - Bring in additional engineers if needed
222
- - Coordinate with stakeholders (updates to execs)
223
- - Order food if it's going long
224
- - Take notes for postmortem
225
-
226
- ❌ DON'T:
227
- - Take over incident response (unless you're most qualified)
228
- - Ask "why didn't you..." questions during incident
229
- - Pressure for faster resolution
230
- - Blame individuals
231
- - Second-guess decisions being made
232
- ```
233
-
234
- **Incident Severity Levels (align with team):**
235
-
236
- ```
237
- Sev 1 (Critical):
238
- ├── Complete service outage
239
- ├── Data loss or security breach
240
- ├── Revenue impact > $10K/hour
241
- └── Response: All hands, exec updates every 30 min
242
-
243
- Sev 2 (High):
244
- ├── Major feature degraded
245
- ├── Significant user impact
246
- ├── Revenue impact > $1K/hour
247
- └── Response: On-call + expert, updates every hour
248
-
249
- Sev 3 (Medium):
250
- ├── Minor feature degraded
251
- ├── Limited user impact
252
- └── Response: On-call handles, regular updates
253
-
254
- Sev 4 (Low):
255
- ├── Internal tooling issue
256
- ├── No user impact
257
- └── Response: Fix during business hours
258
- ```
259
-
260
- #### After the Incident: Blameless Postmortem
261
-
262
- **Blameless postmortem framework:**
263
-
264
- ```
265
- Postmortem Template:
266
-
267
- ## Incident Summary
268
- - Date/Time: When did it happen?
269
- - Duration: How long?
270
- - Impact: Who was affected? How many users?
271
- - Severity: Sev 1-4
272
-
273
- ## Timeline
274
- - 14:32 - First alert fired
275
- - 14:35 - Engineer A acknowledged, began investigation
276
- - 14:45 - Root cause identified (database connection pool exhausted)
277
- - 15:00 - Mitigation applied (increased pool size)
278
- - 15:15 - Service fully recovered
279
-
280
- ## Root Cause
281
- What actually caused this? (Technical, not "Engineer X did...")
282
-
283
- ## What Went Well
284
- - Alert fired within 2 minutes
285
- - Communication was clear
286
- - Rollback was smooth
287
-
288
- ## What Went Poorly
289
- - No automated mitigation
290
- - Monitoring didn't catch early warning signs
291
- - On-call engineer not familiar with this service
292
-
293
- ## Action Items
294
- 1. [P0] Add automated connection pool scaling (Owner: Alice, Due: 2 weeks)
295
- 2. [P1] Improve monitoring for connection pool saturation (Owner: Bob, Due: 1 month)
296
- 3. [P2] Add service to on-call training rotation (Owner: Manager, Due: 2 weeks)
297
-
298
- ## Lessons Learned
299
- - Database connection pool defaults are too conservative
300
- - Need better pre-production load testing
301
- ```
302
-
303
- **Blameless postmortem meeting (45-60 min):**
304
-
305
- ```
306
- 1. Introduction (5 min)
307
- └── Remind: This is blameless, focus on systems not people
308
-
309
- 2. Timeline Review (15 min)
310
- └── Walk through what happened, when
311
-
312
- 3. Root Cause Analysis (15 min)
313
- └── "Why did this happen?" (ask "why" 5 times)
314
-
315
- 4. What Went Well / What Went Poorly (10 min)
316
- └── Balanced reflection
317
-
318
- 5. Action Items (10 min)
319
- └── Specific, assigned, with due dates
320
- └── Priority: P0 (this week), P1 (this month), P2 (nice to have)
321
-
322
- 6. Close (5 min)
323
- └── Thank the team, emphasize learning
324
- ```
325
-
326
- **Red flags in postmortems:**
327
- - ❌ Blaming individuals ("Alice should have...")
328
- - ❌ Vague action items ("Improve monitoring")
329
- - ❌ No follow-up on action items
330
- - ❌ Defensive posturing
331
- - ❌ Skipping postmortems for "small" incidents
332
-
333
- **Manager's job:** Enforce blameless culture, track action items, ensure learning.
334
-
335
- ---
336
-
337
- ### 3. SLO Negotiation with Product Teams
338
-
339
- **The tension:** Product wants features fast. SRE/Platform wants stability. You balance both.
340
-
341
- #### Understanding SLOs (Simple Version for Managers)
342
-
343
- ```
344
- SLI (Service Level Indicator):
345
- What you measure (e.g., "API latency p99")
346
-
347
- SLO (Service Level Objective):
348
- Target for reliability (e.g., "API latency p99 < 500ms, 99.9% of the time")
349
-
350
- SLA (Service Level Agreement):
351
- Contractual promise to customers (e.g., "99.95% uptime or we give refund")
352
-
353
- Example:
354
- SLI: Request success rate
355
- SLO: 99.9% of requests succeed (internal target)
356
- SLA: 99.5% uptime (customer-facing promise)
357
- ```
358
-
359
- **Error budget concept:**
360
-
361
- ```
362
- SLO: 99.9% availability = 0.1% allowed downtime
363
-
364
- Per month (30 days):
365
- ├── Total time: 43,200 minutes
366
- ├── Allowed downtime: 43.2 minutes
367
- └── Error budget: 43.2 minutes
368
-
369
- If error budget exhausted:
370
- ├── Freeze feature releases
371
- ├── Focus on reliability improvements
372
- └── Pay down tech debt
373
- ```
374
-
375
- #### SLO Negotiation Framework
376
-
377
- **When product pushes for aggressive feature timeline:**
378
-
379
- ```
380
- Product: "We need to ship this feature in 2 weeks"
381
-
382
- You (as manager):
383
- "Let's check our error budget first. If we have budget, we can move fast.
384
- If we're out of budget, we need to stabilize first."
385
-
386
- Scenario 1: Error budget healthy (50% remaining)
387
- ├── ✅ Green light for feature work
388
- ├── 70% capacity on features
389
- └── 30% on reliability
390
-
391
- Scenario 2: Error budget exhausted (0% remaining)
392
- ├── 🛑 Feature freeze
393
- ├── 100% capacity on reliability
394
- └── Resume features when budget recovers
395
- ```
396
-
397
- **How to set SLOs (practical guide):**
398
-
399
- 1. **Start with current performance:**
400
- - "Our API latency p99 is currently 300ms"
401
- - Don't set SLO at 300ms - give yourself buffer
402
-
403
- 2. **Set realistic target:**
404
- - "Let's set SLO at p99 < 500ms"
405
- - This gives 200ms buffer for growth/issues
406
-
407
- 3. **Align with customer expectation:**
408
- - "Customers complain if latency > 1s"
409
- - SLO should prevent customer pain
410
-
411
- 4. **Review quarterly:**
412
- - Too easy? (Always meeting SLO) → Tighten SLO or invest in features
413
- - Too hard? (Always missing SLO) → Loosen SLO or invest in reliability
414
-
415
- **Common SLOs by service type:**
416
-
417
- ```
418
- API Services:
419
- ├── Availability: 99.9% (43 min downtime/month)
420
- ├── Latency p50: < 100ms
421
- ├── Latency p99: < 500ms
422
- └── Error rate: < 0.1%
423
-
424
- Batch Processing:
425
- ├── Job success rate: 99.5%
426
- ├── Job completion time: < 4 hours
427
- └── Data accuracy: 99.99%
428
-
429
- Data Pipeline:
430
- ├── Data freshness: < 15 min lag
431
- ├── Pipeline availability: 99.9%
432
- └── Data quality: 99.95%
433
- ```
434
-
435
- ---
436
-
437
- ### 4. Engineering Metrics That Matter
438
-
439
- **The problem:** Easy to measure vanity metrics. Hard to measure real productivity.
440
-
441
- #### Vanity Metrics (Avoid)
442
-
443
- ```
444
- ❌ Lines of code written
445
- ❌ Number of commits
446
- ❌ Hours worked
447
- ❌ Number of deploys (without context)
448
- ❌ Ticket velocity (without quality)
449
- ❌ Code coverage % (without context)
450
- ```
451
-
452
- **Why these are bad:**
453
- - Lines of code: Good engineers often delete code
454
- - Number of commits: Encourages small, meaningless commits
455
- - Hours worked: Encourages burnout, not productivity
456
- - Deploys without context: Could be hotfixes for bugs you introduced
457
- - Ticket velocity: Encourages cherry-picking easy tickets
458
- - Code coverage: Can write useless tests to hit %
459
-
460
- #### Metrics That Actually Matter
461
-
462
- **1. DORA Metrics (Use these)**
463
-
464
- ```
465
- Deployment Frequency:
466
- ├── How often do you deploy to production?
467
- ├── Elite: Multiple times per day
468
- ├── High: Daily to weekly
469
- ├── Medium: Weekly to monthly
470
- └── Low: Monthly to every 6 months
471
-
472
- Lead Time for Changes:
473
- ├── How long from commit to production?
474
- ├── Elite: < 1 hour
475
- ├── High: 1 day to 1 week
476
- ├── Medium: 1 week to 1 month
477
- └── Low: 1 month to 6 months
478
-
479
- Time to Restore Service:
480
- ├── How long to recover from incident?
481
- ├── Elite: < 1 hour
482
- ├── High: < 1 day
483
- ├── Medium: 1 day to 1 week
484
- └── Low: > 1 week
485
-
486
- Change Failure Rate:
487
- ├── What % of changes cause incidents?
488
- ├── Elite: 0-15%
489
- ├── High: 16-30%
490
- ├── Medium: 31-45%
491
- └── Low: > 45%
492
- ```
493
-
494
- **How to use DORA metrics:**
495
- - Track quarterly, not daily (avoid gaming)
496
- - Trend over time (are we improving?)
497
- - Compare to benchmarks (elite, high, medium, low)
498
- - Use to identify improvement areas
499
-
500
- **2. SRE Metrics**
501
-
502
- ```
503
- Toil Percentage:
504
- ├── What % of engineer time is manual ops work?
505
- ├── Target: < 30% toil
506
- ├── Intervention needed: > 50% toil
507
- └── Measure: Time tracking, surveys
508
-
509
- On-Call Load:
510
- ├── Pages per week per engineer
511
- ├── Target: < 3 pages/week
512
- ├── Intervention: > 5 pages/week
513
- └── Measure: PagerDuty analytics
514
-
515
- SLO Compliance:
516
- ├── Are we meeting our SLOs?
517
- ├── Target: 99%+ SLO compliance
518
- └── Measure: Observability dashboards
519
- ```
520
-
521
- **3. Team Health Metrics**
522
-
523
- ```
524
- Engineer Satisfaction:
525
- ├── Quarterly survey (1-10 scale)
526
- ├── Questions: "Satisfied with work?", "Would recommend team?"
527
- ├── Target: 8+ average
528
- └── Red flag: < 6 average or declining trend
529
-
530
- Retention Rate:
531
- ├── % of engineers staying > 1 year
532
- ├── Target: > 85% annual retention
533
- └── Red flag: < 70% retention
534
-
535
- Time to Productivity (New hires):
536
- ├── How long until new hire is productive?
537
- ├── Target: < 90 days
538
- └── Measure: Manager assessment + self-assessment
539
- ```
540
-
541
- **4. Operational Excellence Metrics**
542
-
543
- ```
544
- Incident Trends:
545
- ├── Number of Sev 1/2 incidents per month
546
- ├── Target: Declining or stable
547
- └── Red flag: Increasing trend
548
-
549
- Postmortem Action Item Completion:
550
- ├── % of action items completed on time
551
- ├── Target: > 80% completion
552
- └── Red flag: < 50% completion
553
-
554
- Automated Test Coverage:
555
- ├── % of critical paths covered
556
- ├── Target: > 70% for critical paths
557
- └── Not a vanity metric if focused on high-risk areas
558
- ```
559
-
560
- #### How to Present Metrics to Leadership
561
-
562
- **Dashboard structure:**
563
-
564
- ```
565
- 1. Health at a Glance (Top metrics)
566
- ├── 🟢 SLO Compliance: 99.8% (Target: 99%)
567
- ├── 🟡 Deployment Frequency: 3x/week (Target: Daily)
568
- ├── 🟢 Incident Rate: 2 Sev2 this month (Last month: 4)
569
- └── 🟢 Team Satisfaction: 8.2/10 (Target: 8+)
570
-
571
- 2. DORA Metrics Trend (Quarterly)
572
- [Chart showing improvement over time]
573
-
574
- 3. Focus Areas
575
- ├── ✅ Reduced incident rate by 50% this quarter
576
- ├── 🚧 Working on deployment frequency (automation initiative)
577
- └── ⚠️ Toil still high at 40% - hiring 2 more engineers
578
-
579
- 4. Asks
580
- ├── Budget for observability tooling ($50K)
581
- └── Approval to pause feature work next sprint for reliability
582
- ```
583
-
584
- ---
585
-
586
- ### 5. Balancing Feature Work vs Operational Excellence
587
-
588
- **The eternal tension:** Product wants features. You want stability.
589
-
590
- #### Resource Allocation Models
591
-
592
- **70-20-10 Rule (Recommended):**
593
- ```
594
- 70% Feature Work:
595
- ├── New features product wants
596
- ├── Customer-facing improvements
597
- └── Revenue-generating projects
598
-
599
- 20% Operational Excellence:
600
- ├── Tech debt paydown
601
- ├── Reliability improvements
602
- ├── Monitoring enhancements
603
- └── Automation
604
-
605
- 10% Innovation/Learning:
606
- ├── Explore new technologies
607
- ├── Hackathons
608
- ├── Learning time
609
- └── Experimentation
610
- ```
611
-
612
- **Adjust based on phase:**
613
-
614
- ```
615
- High Growth Phase:
616
- ├── 80% Features
617
- ├── 15% Ops Excellence
618
- └── 5% Innovation
619
-
620
- Stability Phase:
621
- ├── 50% Features
622
- ├── 40% Ops Excellence
623
- └── 10% Innovation
624
-
625
- Crisis Phase (Post-Incidents):
626
- ├── 30% Features
627
- ├── 60% Ops Excellence
628
- └── 10% Innovation
629
- ```
630
-
631
- #### Negotiating with Product
632
-
633
- **When product asks for all-feature, no-ops time:**
634
-
635
- ```
636
- Scenario: "We need all engineers on Feature X for Q4"
637
-
638
- Your response framework:
639
- 1. Acknowledge business need
640
- "I understand Feature X is critical for revenue"
641
-
642
- 2. State operational reality
643
- "Our on-call load is high (8 pages/week) and error budget is 80% exhausted"
644
-
645
- 3. Present options
646
- Option A: All-in on features, risk of incidents and burnout
647
- Option B: 70-30 split, sustainable pace, less feature risk
648
- Option C: Hire 2 more engineers to do both
649
-
650
- 4. Recommend
651
- "I recommend Option B - we'll deliver 70% of Feature X this quarter,
652
- and ensure we don't have outages that impact customers"
653
-
654
- 5. Make it their decision
655
- "What's your preference given these trade-offs?"
656
- ```
657
-
658
- **Using error budgets as negotiation tool:**
659
-
660
- ```
661
- Error budget = objective metric, not subjective
662
-
663
- If product wants to move fast:
664
- ├── Check error budget: 50% remaining?
665
- ├── ✅ Green light: "We have budget, let's ship!"
666
- └── 🛑 Budget exhausted: "We need to stabilize first"
667
-
668
- This removes emotion from discussion. It's data-driven.
669
- ```
670
-
671
- ---
672
-
673
- ### 6. Building Sustainable Operations Culture
674
-
675
- **Culture eats process for breakfast.**
676
-
677
- #### Key Cultural Values
678
-
679
- **1. Blameless Culture**
680
- ```
681
- When incidents happen:
682
- ❌ "Who broke it?" → ✅ "What broke?"
683
- ❌ "Why didn't you..." → ✅ "What can we learn?"
684
- ❌ Hide mistakes → ✅ Share failures openly
685
- ```
686
-
687
- **2. Automate Toil**
688
- ```
689
- Manual work is not a badge of honor.
690
- ├── Track toil percentage
691
- ├── Reward automation, not heroics
692
- └── "If you do it twice, automate it"
693
- ```
694
-
695
- **3. Sustainable On-Call**
696
- ```
697
- On-call is not punishment.
698
- ├── Fair rotation
699
- ├── Compensated fairly
700
- ├── Protected from burnout
701
- └── Escalation is encouraged, not weakness
702
- ```
703
-
704
- **4. Continuous Improvement**
705
- ```
706
- Every incident is a learning opportunity.
707
- ├── Postmortems are required, not optional
708
- ├── Action items are tracked and completed
709
- └── Celebrate fixes, not just features
710
- ```
711
-
712
- #### Manager Actions to Reinforce Culture
713
-
714
- **1. Lead by Example**
715
- - Participate in on-call rotation (if you're technical)
716
- - Admit your own mistakes publicly
717
- - Take postmortem action items yourself
718
-
719
- **2. Celebrate Operational Wins**
720
- - Shout out engineers who reduce toil
721
- - Highlight reliability improvements in team meetings
722
- - Give "Operational Excellence" awards
723
-
724
- **3. Protect Your Team**
725
- - Say no to unrealistic timelines
726
- - Push back on "just ship it" pressure
727
- - Shield team from org politics
728
-
729
- **4. Invest in Automation**
730
- - Allocate 20% capacity to ops excellence
731
- - Approve tool/platform budgets
732
- - Hire for automation skills
733
-
734
- ---
735
-
736
- ## Quick Reference for Managers
737
-
738
- **On-Call:**
739
- - Rotation: Weekly or follow-the-sun
740
- - Compensation: $500-$1000/week + pay-per-page
741
- - Burnout prevention: Max 1 week/month, comp time after incidents
742
-
743
- **Incidents:**
744
- - Your role: Support, don't take over
745
- - Blameless postmortems: Required for all Sev 1/2
746
- - Action items: Track and ensure completion
747
-
748
- **SLOs:**
749
- - Start with current performance + buffer
750
- - Use error budgets to negotiate with product
751
- - Review quarterly
752
-
753
- **Metrics:**
754
- - Use DORA metrics (deployment freq, lead time, MTTR, change failure rate)
755
- - Avoid vanity metrics (lines of code, commits, hours)
756
- - Track team health (satisfaction, retention)
757
-
758
- **Resource Allocation:**
759
- - 70% features, 20% ops excellence, 10% innovation
760
- - Adjust based on phase (growth vs stability)
761
-
762
- **Culture:**
763
- - Blameless, automate toil, sustainable on-call
764
- - Lead by example, celebrate ops wins
765
- - Protect team from burnout
766
-
767
- ### Culture Building Scenario
768
-
769
- **Scenario: "How do we build a sustainable ops culture?"**
770
- - **Blameless:**
771
- - Never "who broke it?" Always "what broke and how do we prevent it?"
772
- - Share postmortems openly - learn from all incidents
773
- - Reward transparency (caught early) over hiding (festered)
774
- - **Automate toil:**
775
- - Track toil percentage (target < 30%)
776
- - Dedicate 20% time to automation
777
- - Celebrate "we automated ourselves out of that problem"
778
- - **Sustainable on-call:**
779
- - No hero culture - don't celebrate all-nighters
780
- - Enforce comp time and breaks
781
- - Fix systems that cause repeated pages
782
- - **Recognition:**
783
- - Highlight ops wins in all-hands: "Automated X, saved 50 hours/month"
784
- - Incident response recognition: "Great job handling outage calmly"
785
- - Quality over speed: "Prevented incident with thorough testing"
786
-
787
- **Scenario: "How do we prevent hero culture?"**
788
- - **Heroes are a symptom of broken systems**
789
- - **Signs of hero culture:**
790
- - Same engineer always saves the day
791
- - Working nights/weekends is celebrated
792
- - "We need you" used as motivation
793
- - **How to fix:**
794
- - Document hero's knowledge → spread it
795
- - Automate hero's manual tasks
796
- - Create runbooks for common issues
797
- - Rotate responsibilities - don't depend on one person
798
- - **Say:** "I appreciate your dedication, but this is unsustainable. Let's fix the system so you don't need to be a hero."
799
-
800
- ---
801
-
802
- ## Integration with Other Skills
803
-
804
- **This skill works with:**
805
- - **engineering-management** - Hiring, career development, 1-on-1s
806
- - **technical-leadership** - Making technical decisions, risk assessment
807
- - **infrastructure-strategy** - Long-term planning, platform investment
808
- - **budget-and-cost-management** - On-call budgets, tooling costs
809
-
810
- **Technical skills your team uses:**
811
- - **sre** - Hands-on SLO implementation, incident response
812
- - **platform-engineering** - Building internal platforms that reduce toil
813
- - **cybersecurity** - Security incident response, compliance
814
-
815
- ---
816
-
817
- **Remember:** Your job is to build sustainable operations practices that enable long-term success, not short-term heroics. Protect your team from burnout. Measure what matters. Learn from every incident.