@wazir-dev/cli 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (629) hide show
  1. package/AGENTS.md +111 -0
  2. package/CHANGELOG.md +14 -0
  3. package/CONTRIBUTING.md +101 -0
  4. package/LICENSE +21 -0
  5. package/README.md +314 -0
  6. package/assets/composition-engine.mmd +34 -0
  7. package/assets/demo-script.sh +17 -0
  8. package/assets/logo-dark.svg +14 -0
  9. package/assets/logo.svg +14 -0
  10. package/assets/pipeline.mmd +39 -0
  11. package/assets/record-demo.sh +51 -0
  12. package/docs/README.md +51 -0
  13. package/docs/adapters/context-mode.md +60 -0
  14. package/docs/concepts/architecture.md +87 -0
  15. package/docs/concepts/artifact-model.md +60 -0
  16. package/docs/concepts/composition-engine.md +36 -0
  17. package/docs/concepts/indexing-and-recall.md +160 -0
  18. package/docs/concepts/observability.md +41 -0
  19. package/docs/concepts/roles-and-workflows.md +59 -0
  20. package/docs/concepts/terminology-policy.md +27 -0
  21. package/docs/getting-started/01-installation.md +78 -0
  22. package/docs/getting-started/02-first-run.md +102 -0
  23. package/docs/getting-started/03-adding-to-project.md +15 -0
  24. package/docs/getting-started/04-host-setup.md +15 -0
  25. package/docs/guides/ci-integration.md +15 -0
  26. package/docs/guides/creating-skills.md +15 -0
  27. package/docs/guides/expertise-module-authoring.md +15 -0
  28. package/docs/guides/hook-development.md +15 -0
  29. package/docs/guides/memory-and-learnings.md +34 -0
  30. package/docs/guides/multi-host-export.md +15 -0
  31. package/docs/guides/troubleshooting.md +101 -0
  32. package/docs/guides/writing-custom-roles.md +15 -0
  33. package/docs/plans/2026-03-15-cli-pipeline-integration-design.md +592 -0
  34. package/docs/plans/2026-03-15-cli-pipeline-integration-plan.md +598 -0
  35. package/docs/plans/2026-03-15-docs-enforcement-plan.md +238 -0
  36. package/docs/readmes/INDEX.md +99 -0
  37. package/docs/readmes/features/expertise/README.md +171 -0
  38. package/docs/readmes/features/exports/README.md +222 -0
  39. package/docs/readmes/features/hooks/README.md +103 -0
  40. package/docs/readmes/features/hooks/loop-cap-guard.md +133 -0
  41. package/docs/readmes/features/hooks/post-tool-capture.md +121 -0
  42. package/docs/readmes/features/hooks/post-tool-lint.md +130 -0
  43. package/docs/readmes/features/hooks/pre-compact-summary.md +122 -0
  44. package/docs/readmes/features/hooks/pre-tool-capture-route.md +100 -0
  45. package/docs/readmes/features/hooks/protected-path-write-guard.md +128 -0
  46. package/docs/readmes/features/hooks/session-start.md +119 -0
  47. package/docs/readmes/features/hooks/stop-handoff-harvest.md +125 -0
  48. package/docs/readmes/features/roles/README.md +157 -0
  49. package/docs/readmes/features/roles/clarifier.md +152 -0
  50. package/docs/readmes/features/roles/content-author.md +190 -0
  51. package/docs/readmes/features/roles/designer.md +193 -0
  52. package/docs/readmes/features/roles/executor.md +184 -0
  53. package/docs/readmes/features/roles/learner.md +210 -0
  54. package/docs/readmes/features/roles/planner.md +182 -0
  55. package/docs/readmes/features/roles/researcher.md +164 -0
  56. package/docs/readmes/features/roles/reviewer.md +184 -0
  57. package/docs/readmes/features/roles/specifier.md +162 -0
  58. package/docs/readmes/features/roles/verifier.md +215 -0
  59. package/docs/readmes/features/schemas/README.md +178 -0
  60. package/docs/readmes/features/skills/README.md +63 -0
  61. package/docs/readmes/features/skills/brainstorming.md +96 -0
  62. package/docs/readmes/features/skills/debugging.md +148 -0
  63. package/docs/readmes/features/skills/design.md +120 -0
  64. package/docs/readmes/features/skills/prepare-next.md +109 -0
  65. package/docs/readmes/features/skills/run-audit.md +159 -0
  66. package/docs/readmes/features/skills/scan-project.md +109 -0
  67. package/docs/readmes/features/skills/self-audit.md +176 -0
  68. package/docs/readmes/features/skills/tdd.md +137 -0
  69. package/docs/readmes/features/skills/using-skills.md +92 -0
  70. package/docs/readmes/features/skills/verification.md +120 -0
  71. package/docs/readmes/features/skills/writing-plans.md +104 -0
  72. package/docs/readmes/features/tooling/README.md +320 -0
  73. package/docs/readmes/features/workflows/README.md +186 -0
  74. package/docs/readmes/features/workflows/author.md +181 -0
  75. package/docs/readmes/features/workflows/clarify.md +154 -0
  76. package/docs/readmes/features/workflows/design-review.md +171 -0
  77. package/docs/readmes/features/workflows/design.md +169 -0
  78. package/docs/readmes/features/workflows/discover.md +162 -0
  79. package/docs/readmes/features/workflows/execute.md +173 -0
  80. package/docs/readmes/features/workflows/learn.md +167 -0
  81. package/docs/readmes/features/workflows/plan-review.md +165 -0
  82. package/docs/readmes/features/workflows/plan.md +170 -0
  83. package/docs/readmes/features/workflows/prepare-next.md +167 -0
  84. package/docs/readmes/features/workflows/review.md +169 -0
  85. package/docs/readmes/features/workflows/run-audit.md +191 -0
  86. package/docs/readmes/features/workflows/spec-challenge.md +159 -0
  87. package/docs/readmes/features/workflows/specify.md +160 -0
  88. package/docs/readmes/features/workflows/verify.md +177 -0
  89. package/docs/readmes/packages/README.md +50 -0
  90. package/docs/readmes/packages/ajv.md +117 -0
  91. package/docs/readmes/packages/context-mode.md +118 -0
  92. package/docs/readmes/packages/gray-matter.md +116 -0
  93. package/docs/readmes/packages/node-test.md +137 -0
  94. package/docs/readmes/packages/yaml.md +112 -0
  95. package/docs/reference/configuration-reference.md +159 -0
  96. package/docs/reference/expertise-index.md +52 -0
  97. package/docs/reference/git-flow.md +43 -0
  98. package/docs/reference/hooks.md +87 -0
  99. package/docs/reference/host-exports.md +50 -0
  100. package/docs/reference/launch-checklist.md +172 -0
  101. package/docs/reference/marketplace-listings.md +76 -0
  102. package/docs/reference/release-process.md +34 -0
  103. package/docs/reference/roles-reference.md +77 -0
  104. package/docs/reference/skills.md +33 -0
  105. package/docs/reference/templates.md +29 -0
  106. package/docs/reference/tooling-cli.md +94 -0
  107. package/docs/truth-claims.yaml +222 -0
  108. package/expertise/PROGRESS.md +63 -0
  109. package/expertise/README.md +18 -0
  110. package/expertise/antipatterns/PROGRESS.md +56 -0
  111. package/expertise/antipatterns/backend/api-design-antipatterns.md +1271 -0
  112. package/expertise/antipatterns/backend/auth-antipatterns.md +1195 -0
  113. package/expertise/antipatterns/backend/caching-antipatterns.md +622 -0
  114. package/expertise/antipatterns/backend/database-antipatterns.md +1038 -0
  115. package/expertise/antipatterns/backend/index.md +24 -0
  116. package/expertise/antipatterns/backend/microservices-antipatterns.md +850 -0
  117. package/expertise/antipatterns/code/architecture-antipatterns.md +919 -0
  118. package/expertise/antipatterns/code/async-antipatterns.md +622 -0
  119. package/expertise/antipatterns/code/code-smells.md +1186 -0
  120. package/expertise/antipatterns/code/dependency-antipatterns.md +1209 -0
  121. package/expertise/antipatterns/code/error-handling-antipatterns.md +1360 -0
  122. package/expertise/antipatterns/code/index.md +27 -0
  123. package/expertise/antipatterns/code/naming-and-abstraction.md +1118 -0
  124. package/expertise/antipatterns/code/state-management-antipatterns.md +1076 -0
  125. package/expertise/antipatterns/code/testing-antipatterns.md +1053 -0
  126. package/expertise/antipatterns/design/accessibility-antipatterns.md +1136 -0
  127. package/expertise/antipatterns/design/dark-patterns.md +1121 -0
  128. package/expertise/antipatterns/design/index.md +22 -0
  129. package/expertise/antipatterns/design/ui-antipatterns.md +1202 -0
  130. package/expertise/antipatterns/design/ux-antipatterns.md +680 -0
  131. package/expertise/antipatterns/frontend/css-layout-antipatterns.md +691 -0
  132. package/expertise/antipatterns/frontend/flutter-antipatterns.md +1827 -0
  133. package/expertise/antipatterns/frontend/index.md +23 -0
  134. package/expertise/antipatterns/frontend/mobile-antipatterns.md +573 -0
  135. package/expertise/antipatterns/frontend/react-antipatterns.md +1128 -0
  136. package/expertise/antipatterns/frontend/spa-antipatterns.md +1235 -0
  137. package/expertise/antipatterns/index.md +31 -0
  138. package/expertise/antipatterns/performance/index.md +20 -0
  139. package/expertise/antipatterns/performance/performance-antipatterns.md +1013 -0
  140. package/expertise/antipatterns/performance/premature-optimization.md +623 -0
  141. package/expertise/antipatterns/performance/scaling-antipatterns.md +785 -0
  142. package/expertise/antipatterns/process/ai-coding-antipatterns.md +853 -0
  143. package/expertise/antipatterns/process/code-review-antipatterns.md +656 -0
  144. package/expertise/antipatterns/process/deployment-antipatterns.md +920 -0
  145. package/expertise/antipatterns/process/index.md +23 -0
  146. package/expertise/antipatterns/process/technical-debt-antipatterns.md +647 -0
  147. package/expertise/antipatterns/security/index.md +20 -0
  148. package/expertise/antipatterns/security/secrets-antipatterns.md +849 -0
  149. package/expertise/antipatterns/security/security-theater.md +843 -0
  150. package/expertise/antipatterns/security/vulnerability-patterns.md +801 -0
  151. package/expertise/architecture/PROGRESS.md +70 -0
  152. package/expertise/architecture/data/caching-architecture.md +671 -0
  153. package/expertise/architecture/data/data-consistency.md +574 -0
  154. package/expertise/architecture/data/data-modeling.md +536 -0
  155. package/expertise/architecture/data/event-streams-and-queues.md +634 -0
  156. package/expertise/architecture/data/index.md +25 -0
  157. package/expertise/architecture/data/search-architecture.md +663 -0
  158. package/expertise/architecture/data/sql-vs-nosql.md +708 -0
  159. package/expertise/architecture/decisions/architecture-decision-records.md +640 -0
  160. package/expertise/architecture/decisions/build-vs-buy.md +616 -0
  161. package/expertise/architecture/decisions/index.md +23 -0
  162. package/expertise/architecture/decisions/monolith-to-microservices.md +790 -0
  163. package/expertise/architecture/decisions/technology-selection.md +616 -0
  164. package/expertise/architecture/distributed/cap-theorem-and-tradeoffs.md +800 -0
  165. package/expertise/architecture/distributed/circuit-breaker-bulkhead.md +741 -0
  166. package/expertise/architecture/distributed/consensus-and-coordination.md +796 -0
  167. package/expertise/architecture/distributed/distributed-systems-fundamentals.md +564 -0
  168. package/expertise/architecture/distributed/idempotency-and-retry.md +796 -0
  169. package/expertise/architecture/distributed/index.md +25 -0
  170. package/expertise/architecture/distributed/saga-pattern.md +797 -0
  171. package/expertise/architecture/foundations/architectural-thinking.md +460 -0
  172. package/expertise/architecture/foundations/coupling-and-cohesion.md +770 -0
  173. package/expertise/architecture/foundations/design-principles-solid.md +649 -0
  174. package/expertise/architecture/foundations/domain-driven-design.md +719 -0
  175. package/expertise/architecture/foundations/index.md +25 -0
  176. package/expertise/architecture/foundations/separation-of-concerns.md +472 -0
  177. package/expertise/architecture/foundations/twelve-factor-app.md +797 -0
  178. package/expertise/architecture/index.md +34 -0
  179. package/expertise/architecture/integration/api-design-graphql.md +638 -0
  180. package/expertise/architecture/integration/api-design-grpc.md +804 -0
  181. package/expertise/architecture/integration/api-design-rest.md +892 -0
  182. package/expertise/architecture/integration/index.md +25 -0
  183. package/expertise/architecture/integration/third-party-integration.md +795 -0
  184. package/expertise/architecture/integration/webhooks-and-callbacks.md +1152 -0
  185. package/expertise/architecture/integration/websockets-realtime.md +791 -0
  186. package/expertise/architecture/mobile-architecture/index.md +22 -0
  187. package/expertise/architecture/mobile-architecture/mobile-app-architecture.md +780 -0
  188. package/expertise/architecture/mobile-architecture/mobile-backend-for-frontend.md +670 -0
  189. package/expertise/architecture/mobile-architecture/offline-first.md +719 -0
  190. package/expertise/architecture/mobile-architecture/push-and-sync.md +782 -0
  191. package/expertise/architecture/patterns/cqrs-event-sourcing.md +717 -0
  192. package/expertise/architecture/patterns/event-driven.md +797 -0
  193. package/expertise/architecture/patterns/hexagonal-clean-architecture.md +870 -0
  194. package/expertise/architecture/patterns/index.md +27 -0
  195. package/expertise/architecture/patterns/layered-architecture.md +736 -0
  196. package/expertise/architecture/patterns/microservices.md +753 -0
  197. package/expertise/architecture/patterns/modular-monolith.md +692 -0
  198. package/expertise/architecture/patterns/monolith.md +626 -0
  199. package/expertise/architecture/patterns/plugin-architecture.md +735 -0
  200. package/expertise/architecture/patterns/serverless.md +780 -0
  201. package/expertise/architecture/scaling/database-scaling.md +615 -0
  202. package/expertise/architecture/scaling/feature-flags-and-rollouts.md +757 -0
  203. package/expertise/architecture/scaling/horizontal-vs-vertical.md +606 -0
  204. package/expertise/architecture/scaling/index.md +24 -0
  205. package/expertise/architecture/scaling/multi-tenancy.md +800 -0
  206. package/expertise/architecture/scaling/stateless-design.md +787 -0
  207. package/expertise/backend/embedded-firmware.md +625 -0
  208. package/expertise/backend/go.md +853 -0
  209. package/expertise/backend/index.md +24 -0
  210. package/expertise/backend/java-spring.md +448 -0
  211. package/expertise/backend/node-typescript.md +625 -0
  212. package/expertise/backend/python-fastapi.md +724 -0
  213. package/expertise/backend/rust.md +458 -0
  214. package/expertise/backend/solidity.md +711 -0
  215. package/expertise/composition-map.yaml +443 -0
  216. package/expertise/content/foundations/content-modeling.md +395 -0
  217. package/expertise/content/foundations/editorial-standards.md +449 -0
  218. package/expertise/content/foundations/index.md +24 -0
  219. package/expertise/content/foundations/microcopy.md +455 -0
  220. package/expertise/content/foundations/terminology-governance.md +509 -0
  221. package/expertise/content/index.md +34 -0
  222. package/expertise/content/patterns/accessibility-copy.md +518 -0
  223. package/expertise/content/patterns/index.md +24 -0
  224. package/expertise/content/patterns/notification-content.md +433 -0
  225. package/expertise/content/patterns/sample-content.md +486 -0
  226. package/expertise/content/patterns/state-copy.md +439 -0
  227. package/expertise/design/PROGRESS.md +58 -0
  228. package/expertise/design/disciplines/dark-mode-theming.md +577 -0
  229. package/expertise/design/disciplines/design-systems.md +595 -0
  230. package/expertise/design/disciplines/index.md +25 -0
  231. package/expertise/design/disciplines/information-architecture.md +800 -0
  232. package/expertise/design/disciplines/interaction-design.md +788 -0
  233. package/expertise/design/disciplines/responsive-design.md +552 -0
  234. package/expertise/design/disciplines/usability-testing.md +516 -0
  235. package/expertise/design/disciplines/user-research.md +792 -0
  236. package/expertise/design/foundations/accessibility-design.md +796 -0
  237. package/expertise/design/foundations/color-theory.md +797 -0
  238. package/expertise/design/foundations/iconography.md +795 -0
  239. package/expertise/design/foundations/index.md +26 -0
  240. package/expertise/design/foundations/motion-and-animation.md +653 -0
  241. package/expertise/design/foundations/rtl-design.md +585 -0
  242. package/expertise/design/foundations/spacing-and-layout.md +607 -0
  243. package/expertise/design/foundations/typography.md +800 -0
  244. package/expertise/design/foundations/visual-hierarchy.md +761 -0
  245. package/expertise/design/index.md +32 -0
  246. package/expertise/design/patterns/authentication-flows.md +474 -0
  247. package/expertise/design/patterns/content-consumption.md +789 -0
  248. package/expertise/design/patterns/data-display.md +618 -0
  249. package/expertise/design/patterns/e-commerce.md +1494 -0
  250. package/expertise/design/patterns/feedback-and-states.md +642 -0
  251. package/expertise/design/patterns/forms-and-input.md +819 -0
  252. package/expertise/design/patterns/gamification.md +801 -0
  253. package/expertise/design/patterns/index.md +31 -0
  254. package/expertise/design/patterns/microinteractions.md +449 -0
  255. package/expertise/design/patterns/navigation.md +800 -0
  256. package/expertise/design/patterns/notifications.md +705 -0
  257. package/expertise/design/patterns/onboarding.md +700 -0
  258. package/expertise/design/patterns/search-and-filter.md +601 -0
  259. package/expertise/design/patterns/settings-and-preferences.md +768 -0
  260. package/expertise/design/patterns/social-and-community.md +748 -0
  261. package/expertise/design/platforms/desktop-native.md +612 -0
  262. package/expertise/design/platforms/index.md +25 -0
  263. package/expertise/design/platforms/mobile-android.md +825 -0
  264. package/expertise/design/platforms/mobile-cross-platform.md +983 -0
  265. package/expertise/design/platforms/mobile-ios.md +699 -0
  266. package/expertise/design/platforms/tablet.md +794 -0
  267. package/expertise/design/platforms/web-dashboard.md +790 -0
  268. package/expertise/design/platforms/web-responsive.md +550 -0
  269. package/expertise/design/psychology/behavioral-nudges.md +449 -0
  270. package/expertise/design/psychology/cognitive-load.md +1191 -0
  271. package/expertise/design/psychology/error-psychology.md +778 -0
  272. package/expertise/design/psychology/index.md +22 -0
  273. package/expertise/design/psychology/persuasive-design.md +736 -0
  274. package/expertise/design/psychology/user-mental-models.md +623 -0
  275. package/expertise/design/tooling/open-pencil.md +266 -0
  276. package/expertise/frontend/angular.md +1073 -0
  277. package/expertise/frontend/desktop-electron.md +546 -0
  278. package/expertise/frontend/flutter.md +782 -0
  279. package/expertise/frontend/index.md +27 -0
  280. package/expertise/frontend/native-android.md +409 -0
  281. package/expertise/frontend/native-ios.md +490 -0
  282. package/expertise/frontend/react-native.md +1160 -0
  283. package/expertise/frontend/react.md +808 -0
  284. package/expertise/frontend/vue.md +1089 -0
  285. package/expertise/humanize/domain-rules-code.md +79 -0
  286. package/expertise/humanize/domain-rules-content.md +67 -0
  287. package/expertise/humanize/domain-rules-technical-docs.md +56 -0
  288. package/expertise/humanize/index.md +35 -0
  289. package/expertise/humanize/self-audit-checklist.md +87 -0
  290. package/expertise/humanize/sentence-patterns.md +218 -0
  291. package/expertise/humanize/vocabulary-blacklist.md +105 -0
  292. package/expertise/i18n/PROGRESS.md +65 -0
  293. package/expertise/i18n/advanced/accessibility-and-i18n.md +28 -0
  294. package/expertise/i18n/advanced/bidirectional-text-algorithm.md +38 -0
  295. package/expertise/i18n/advanced/complex-scripts.md +30 -0
  296. package/expertise/i18n/advanced/performance-and-i18n.md +27 -0
  297. package/expertise/i18n/advanced/testing-i18n.md +28 -0
  298. package/expertise/i18n/content/content-adaptation.md +23 -0
  299. package/expertise/i18n/content/locale-specific-formatting.md +23 -0
  300. package/expertise/i18n/content/machine-translation-integration.md +28 -0
  301. package/expertise/i18n/content/translation-management.md +29 -0
  302. package/expertise/i18n/foundations/date-time-calendars.md +67 -0
  303. package/expertise/i18n/foundations/i18n-architecture.md +272 -0
  304. package/expertise/i18n/foundations/locale-and-language-tags.md +79 -0
  305. package/expertise/i18n/foundations/numbers-currency-units.md +61 -0
  306. package/expertise/i18n/foundations/pluralization-and-gender.md +109 -0
  307. package/expertise/i18n/foundations/string-externalization.md +236 -0
  308. package/expertise/i18n/foundations/text-direction-bidi.md +241 -0
  309. package/expertise/i18n/foundations/unicode-and-encoding.md +86 -0
  310. package/expertise/i18n/index.md +38 -0
  311. package/expertise/i18n/platform/backend-i18n.md +31 -0
  312. package/expertise/i18n/platform/flutter-i18n.md +148 -0
  313. package/expertise/i18n/platform/native-android-i18n.md +36 -0
  314. package/expertise/i18n/platform/native-ios-i18n.md +36 -0
  315. package/expertise/i18n/platform/react-i18n.md +103 -0
  316. package/expertise/i18n/platform/web-css-i18n.md +81 -0
  317. package/expertise/i18n/rtl/arabic-specific.md +175 -0
  318. package/expertise/i18n/rtl/hebrew-specific.md +149 -0
  319. package/expertise/i18n/rtl/rtl-animations-and-transitions.md +111 -0
  320. package/expertise/i18n/rtl/rtl-forms-and-input.md +161 -0
  321. package/expertise/i18n/rtl/rtl-fundamentals.md +211 -0
  322. package/expertise/i18n/rtl/rtl-icons-and-images.md +181 -0
  323. package/expertise/i18n/rtl/rtl-layout-mirroring.md +252 -0
  324. package/expertise/i18n/rtl/rtl-navigation-and-gestures.md +107 -0
  325. package/expertise/i18n/rtl/rtl-testing-and-qa.md +147 -0
  326. package/expertise/i18n/rtl/rtl-typography.md +160 -0
  327. package/expertise/index.md +113 -0
  328. package/expertise/index.yaml +216 -0
  329. package/expertise/infrastructure/cloud-aws.md +597 -0
  330. package/expertise/infrastructure/cloud-gcp.md +599 -0
  331. package/expertise/infrastructure/cybersecurity.md +816 -0
  332. package/expertise/infrastructure/database-mongodb.md +447 -0
  333. package/expertise/infrastructure/database-postgres.md +400 -0
  334. package/expertise/infrastructure/devops-cicd.md +787 -0
  335. package/expertise/infrastructure/index.md +27 -0
  336. package/expertise/performance/PROGRESS.md +50 -0
  337. package/expertise/performance/backend/api-latency.md +1204 -0
  338. package/expertise/performance/backend/background-jobs.md +506 -0
  339. package/expertise/performance/backend/connection-pooling.md +1209 -0
  340. package/expertise/performance/backend/database-query-optimization.md +515 -0
  341. package/expertise/performance/backend/index.md +23 -0
  342. package/expertise/performance/backend/rate-limiting-and-throttling.md +971 -0
  343. package/expertise/performance/foundations/algorithmic-complexity.md +954 -0
  344. package/expertise/performance/foundations/caching-strategies.md +489 -0
  345. package/expertise/performance/foundations/concurrency-and-parallelism.md +847 -0
  346. package/expertise/performance/foundations/index.md +24 -0
  347. package/expertise/performance/foundations/measuring-and-profiling.md +440 -0
  348. package/expertise/performance/foundations/memory-management.md +964 -0
  349. package/expertise/performance/foundations/performance-budgets.md +1314 -0
  350. package/expertise/performance/index.md +31 -0
  351. package/expertise/performance/infrastructure/auto-scaling.md +1059 -0
  352. package/expertise/performance/infrastructure/cdn-and-edge.md +1081 -0
  353. package/expertise/performance/infrastructure/index.md +22 -0
  354. package/expertise/performance/infrastructure/load-balancing.md +1081 -0
  355. package/expertise/performance/infrastructure/observability.md +1079 -0
  356. package/expertise/performance/mobile/index.md +23 -0
  357. package/expertise/performance/mobile/mobile-animations.md +544 -0
  358. package/expertise/performance/mobile/mobile-memory-battery.md +416 -0
  359. package/expertise/performance/mobile/mobile-network.md +452 -0
  360. package/expertise/performance/mobile/mobile-rendering.md +599 -0
  361. package/expertise/performance/mobile/mobile-startup-time.md +505 -0
  362. package/expertise/performance/platform-specific/flutter-performance.md +647 -0
  363. package/expertise/performance/platform-specific/index.md +22 -0
  364. package/expertise/performance/platform-specific/node-performance.md +1307 -0
  365. package/expertise/performance/platform-specific/postgres-performance.md +1366 -0
  366. package/expertise/performance/platform-specific/react-performance.md +1403 -0
  367. package/expertise/performance/web/bundle-optimization.md +1239 -0
  368. package/expertise/performance/web/image-and-media.md +636 -0
  369. package/expertise/performance/web/index.md +24 -0
  370. package/expertise/performance/web/network-optimization.md +1133 -0
  371. package/expertise/performance/web/rendering-performance.md +1098 -0
  372. package/expertise/performance/web/ssr-and-hydration.md +918 -0
  373. package/expertise/performance/web/web-vitals.md +1374 -0
  374. package/expertise/quality/accessibility.md +985 -0
  375. package/expertise/quality/evidence-based-verification.md +499 -0
  376. package/expertise/quality/index.md +24 -0
  377. package/expertise/quality/ml-model-audit.md +614 -0
  378. package/expertise/quality/performance.md +600 -0
  379. package/expertise/quality/testing-api.md +891 -0
  380. package/expertise/quality/testing-mobile.md +496 -0
  381. package/expertise/quality/testing-web.md +849 -0
  382. package/expertise/security/PROGRESS.md +54 -0
  383. package/expertise/security/agentic-identity.md +540 -0
  384. package/expertise/security/compliance-frameworks.md +601 -0
  385. package/expertise/security/data/data-encryption.md +364 -0
  386. package/expertise/security/data/data-privacy-gdpr.md +692 -0
  387. package/expertise/security/data/database-security.md +1171 -0
  388. package/expertise/security/data/index.md +22 -0
  389. package/expertise/security/data/pii-handling.md +531 -0
  390. package/expertise/security/foundations/authentication.md +1041 -0
  391. package/expertise/security/foundations/authorization.md +603 -0
  392. package/expertise/security/foundations/cryptography.md +1001 -0
  393. package/expertise/security/foundations/index.md +25 -0
  394. package/expertise/security/foundations/owasp-top-10.md +1354 -0
  395. package/expertise/security/foundations/secrets-management.md +1217 -0
  396. package/expertise/security/foundations/secure-sdlc.md +700 -0
  397. package/expertise/security/foundations/supply-chain-security.md +698 -0
  398. package/expertise/security/index.md +31 -0
  399. package/expertise/security/infrastructure/cloud-security-aws.md +1296 -0
  400. package/expertise/security/infrastructure/cloud-security-gcp.md +1376 -0
  401. package/expertise/security/infrastructure/container-security.md +721 -0
  402. package/expertise/security/infrastructure/incident-response.md +1295 -0
  403. package/expertise/security/infrastructure/index.md +24 -0
  404. package/expertise/security/infrastructure/logging-and-monitoring.md +1618 -0
  405. package/expertise/security/infrastructure/network-security.md +1337 -0
  406. package/expertise/security/mobile/index.md +23 -0
  407. package/expertise/security/mobile/mobile-android-security.md +1218 -0
  408. package/expertise/security/mobile/mobile-binary-protection.md +1229 -0
  409. package/expertise/security/mobile/mobile-data-storage.md +1265 -0
  410. package/expertise/security/mobile/mobile-ios-security.md +1401 -0
  411. package/expertise/security/mobile/mobile-network-security.md +1520 -0
  412. package/expertise/security/smart-contract-security.md +594 -0
  413. package/expertise/security/testing/index.md +22 -0
  414. package/expertise/security/testing/penetration-testing.md +1258 -0
  415. package/expertise/security/testing/security-code-review.md +1765 -0
  416. package/expertise/security/testing/threat-modeling.md +1074 -0
  417. package/expertise/security/testing/vulnerability-scanning.md +1062 -0
  418. package/expertise/security/web/api-security.md +586 -0
  419. package/expertise/security/web/cors-and-headers.md +433 -0
  420. package/expertise/security/web/csrf.md +562 -0
  421. package/expertise/security/web/file-upload.md +1477 -0
  422. package/expertise/security/web/index.md +25 -0
  423. package/expertise/security/web/injection.md +1375 -0
  424. package/expertise/security/web/session-management.md +1101 -0
  425. package/expertise/security/web/xss.md +1158 -0
  426. package/exports/README.md +17 -0
  427. package/exports/hosts/claude/.claude/agents/clarifier.md +42 -0
  428. package/exports/hosts/claude/.claude/agents/content-author.md +63 -0
  429. package/exports/hosts/claude/.claude/agents/designer.md +55 -0
  430. package/exports/hosts/claude/.claude/agents/executor.md +55 -0
  431. package/exports/hosts/claude/.claude/agents/learner.md +51 -0
  432. package/exports/hosts/claude/.claude/agents/planner.md +53 -0
  433. package/exports/hosts/claude/.claude/agents/researcher.md +43 -0
  434. package/exports/hosts/claude/.claude/agents/reviewer.md +54 -0
  435. package/exports/hosts/claude/.claude/agents/specifier.md +47 -0
  436. package/exports/hosts/claude/.claude/agents/verifier.md +71 -0
  437. package/exports/hosts/claude/.claude/commands/author.md +42 -0
  438. package/exports/hosts/claude/.claude/commands/clarify.md +38 -0
  439. package/exports/hosts/claude/.claude/commands/design-review.md +46 -0
  440. package/exports/hosts/claude/.claude/commands/design.md +44 -0
  441. package/exports/hosts/claude/.claude/commands/discover.md +37 -0
  442. package/exports/hosts/claude/.claude/commands/execute.md +48 -0
  443. package/exports/hosts/claude/.claude/commands/learn.md +38 -0
  444. package/exports/hosts/claude/.claude/commands/plan-review.md +42 -0
  445. package/exports/hosts/claude/.claude/commands/plan.md +39 -0
  446. package/exports/hosts/claude/.claude/commands/prepare-next.md +37 -0
  447. package/exports/hosts/claude/.claude/commands/review.md +40 -0
  448. package/exports/hosts/claude/.claude/commands/run-audit.md +41 -0
  449. package/exports/hosts/claude/.claude/commands/spec-challenge.md +41 -0
  450. package/exports/hosts/claude/.claude/commands/specify.md +38 -0
  451. package/exports/hosts/claude/.claude/commands/verify.md +37 -0
  452. package/exports/hosts/claude/.claude/settings.json +34 -0
  453. package/exports/hosts/claude/CLAUDE.md +19 -0
  454. package/exports/hosts/claude/export.manifest.json +38 -0
  455. package/exports/hosts/claude/host-package.json +67 -0
  456. package/exports/hosts/codex/AGENTS.md +19 -0
  457. package/exports/hosts/codex/export.manifest.json +38 -0
  458. package/exports/hosts/codex/host-package.json +41 -0
  459. package/exports/hosts/cursor/.cursor/hooks.json +16 -0
  460. package/exports/hosts/cursor/.cursor/rules/wazir-core.mdc +19 -0
  461. package/exports/hosts/cursor/export.manifest.json +38 -0
  462. package/exports/hosts/cursor/host-package.json +42 -0
  463. package/exports/hosts/gemini/GEMINI.md +19 -0
  464. package/exports/hosts/gemini/export.manifest.json +38 -0
  465. package/exports/hosts/gemini/host-package.json +41 -0
  466. package/hooks/README.md +18 -0
  467. package/hooks/definitions/loop_cap_guard.yaml +21 -0
  468. package/hooks/definitions/post_tool_capture.yaml +24 -0
  469. package/hooks/definitions/pre_compact_summary.yaml +19 -0
  470. package/hooks/definitions/pre_tool_capture_route.yaml +19 -0
  471. package/hooks/definitions/protected_path_write_guard.yaml +19 -0
  472. package/hooks/definitions/session_start.yaml +19 -0
  473. package/hooks/definitions/stop_handoff_harvest.yaml +20 -0
  474. package/hooks/loop-cap-guard +17 -0
  475. package/hooks/post-tool-lint +36 -0
  476. package/hooks/protected-path-write-guard +17 -0
  477. package/hooks/session-start +41 -0
  478. package/llms-full.txt +2355 -0
  479. package/llms.txt +43 -0
  480. package/package.json +79 -0
  481. package/roles/README.md +20 -0
  482. package/roles/clarifier.md +42 -0
  483. package/roles/content-author.md +63 -0
  484. package/roles/designer.md +55 -0
  485. package/roles/executor.md +55 -0
  486. package/roles/learner.md +51 -0
  487. package/roles/planner.md +53 -0
  488. package/roles/researcher.md +43 -0
  489. package/roles/reviewer.md +54 -0
  490. package/roles/specifier.md +47 -0
  491. package/roles/verifier.md +71 -0
  492. package/schemas/README.md +24 -0
  493. package/schemas/accepted-learning.schema.json +20 -0
  494. package/schemas/author-artifact.schema.json +156 -0
  495. package/schemas/clarification.schema.json +19 -0
  496. package/schemas/design-artifact.schema.json +80 -0
  497. package/schemas/docs-claim.schema.json +18 -0
  498. package/schemas/export-manifest.schema.json +20 -0
  499. package/schemas/hook.schema.json +67 -0
  500. package/schemas/host-export-package.schema.json +18 -0
  501. package/schemas/implementation-plan.schema.json +19 -0
  502. package/schemas/proposed-learning.schema.json +19 -0
  503. package/schemas/research.schema.json +18 -0
  504. package/schemas/review.schema.json +29 -0
  505. package/schemas/run-manifest.schema.json +18 -0
  506. package/schemas/spec-challenge.schema.json +18 -0
  507. package/schemas/spec.schema.json +20 -0
  508. package/schemas/usage.schema.json +102 -0
  509. package/schemas/verification-proof.schema.json +29 -0
  510. package/schemas/wazir-manifest.schema.json +173 -0
  511. package/skills/README.md +40 -0
  512. package/skills/brainstorming/SKILL.md +77 -0
  513. package/skills/debugging/SKILL.md +50 -0
  514. package/skills/design/SKILL.md +61 -0
  515. package/skills/dispatching-parallel-agents/SKILL.md +128 -0
  516. package/skills/executing-plans/SKILL.md +70 -0
  517. package/skills/finishing-a-development-branch/SKILL.md +169 -0
  518. package/skills/humanize/SKILL.md +123 -0
  519. package/skills/init-pipeline/SKILL.md +124 -0
  520. package/skills/prepare-next/SKILL.md +20 -0
  521. package/skills/receiving-code-review/SKILL.md +123 -0
  522. package/skills/requesting-code-review/SKILL.md +105 -0
  523. package/skills/requesting-code-review/code-reviewer.md +108 -0
  524. package/skills/run-audit/SKILL.md +197 -0
  525. package/skills/scan-project/SKILL.md +41 -0
  526. package/skills/self-audit/SKILL.md +153 -0
  527. package/skills/subagent-driven-development/SKILL.md +154 -0
  528. package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +26 -0
  529. package/skills/subagent-driven-development/implementer-prompt.md +102 -0
  530. package/skills/subagent-driven-development/spec-reviewer-prompt.md +61 -0
  531. package/skills/tdd/SKILL.md +23 -0
  532. package/skills/using-git-worktrees/SKILL.md +163 -0
  533. package/skills/using-skills/SKILL.md +95 -0
  534. package/skills/verification/SKILL.md +22 -0
  535. package/skills/wazir/SKILL.md +463 -0
  536. package/skills/writing-plans/SKILL.md +30 -0
  537. package/skills/writing-skills/SKILL.md +157 -0
  538. package/skills/writing-skills/anthropic-best-practices.md +122 -0
  539. package/skills/writing-skills/persuasion-principles.md +50 -0
  540. package/templates/README.md +20 -0
  541. package/templates/artifacts/README.md +10 -0
  542. package/templates/artifacts/accepted-learning.md +19 -0
  543. package/templates/artifacts/accepted-learning.template.json +12 -0
  544. package/templates/artifacts/author.md +74 -0
  545. package/templates/artifacts/author.template.json +19 -0
  546. package/templates/artifacts/clarification.md +21 -0
  547. package/templates/artifacts/clarification.template.json +12 -0
  548. package/templates/artifacts/execute-notes.md +19 -0
  549. package/templates/artifacts/implementation-plan.md +21 -0
  550. package/templates/artifacts/implementation-plan.template.json +11 -0
  551. package/templates/artifacts/learning-proposal.md +19 -0
  552. package/templates/artifacts/next-run-handoff.md +21 -0
  553. package/templates/artifacts/plan-review.md +19 -0
  554. package/templates/artifacts/proposed-learning.template.json +12 -0
  555. package/templates/artifacts/research.md +21 -0
  556. package/templates/artifacts/research.template.json +12 -0
  557. package/templates/artifacts/review-findings.md +19 -0
  558. package/templates/artifacts/review.template.json +11 -0
  559. package/templates/artifacts/run-manifest.template.json +8 -0
  560. package/templates/artifacts/spec-challenge.md +19 -0
  561. package/templates/artifacts/spec-challenge.template.json +11 -0
  562. package/templates/artifacts/spec.md +21 -0
  563. package/templates/artifacts/spec.template.json +12 -0
  564. package/templates/artifacts/verification-proof.md +19 -0
  565. package/templates/artifacts/verification-proof.template.json +11 -0
  566. package/templates/examples/accepted-learning.example.json +14 -0
  567. package/templates/examples/author.example.json +152 -0
  568. package/templates/examples/clarification.example.json +15 -0
  569. package/templates/examples/docs-claim.example.json +8 -0
  570. package/templates/examples/export-manifest.example.json +7 -0
  571. package/templates/examples/host-export-package.example.json +11 -0
  572. package/templates/examples/implementation-plan.example.json +17 -0
  573. package/templates/examples/proposed-learning.example.json +13 -0
  574. package/templates/examples/research.example.json +15 -0
  575. package/templates/examples/research.example.md +6 -0
  576. package/templates/examples/review.example.json +17 -0
  577. package/templates/examples/run-manifest.example.json +9 -0
  578. package/templates/examples/spec-challenge.example.json +14 -0
  579. package/templates/examples/spec.example.json +21 -0
  580. package/templates/examples/verification-proof.example.json +21 -0
  581. package/templates/examples/wazir-manifest.example.yaml +65 -0
  582. package/templates/task-definition-schema.md +99 -0
  583. package/tooling/README.md +20 -0
  584. package/tooling/src/adapters/context-mode.js +50 -0
  585. package/tooling/src/capture/command.js +376 -0
  586. package/tooling/src/capture/store.js +99 -0
  587. package/tooling/src/capture/usage.js +270 -0
  588. package/tooling/src/checks/branches.js +50 -0
  589. package/tooling/src/checks/brand-truth.js +110 -0
  590. package/tooling/src/checks/changelog.js +231 -0
  591. package/tooling/src/checks/command-registry.js +36 -0
  592. package/tooling/src/checks/commits.js +102 -0
  593. package/tooling/src/checks/docs-drift.js +103 -0
  594. package/tooling/src/checks/docs-truth.js +201 -0
  595. package/tooling/src/checks/runtime-surface.js +156 -0
  596. package/tooling/src/cli.js +116 -0
  597. package/tooling/src/command-options.js +56 -0
  598. package/tooling/src/commands/validate.js +320 -0
  599. package/tooling/src/doctor/command.js +91 -0
  600. package/tooling/src/export/command.js +77 -0
  601. package/tooling/src/export/compiler.js +498 -0
  602. package/tooling/src/guards/loop-cap-guard.js +52 -0
  603. package/tooling/src/guards/protected-path-write-guard.js +67 -0
  604. package/tooling/src/index/command.js +152 -0
  605. package/tooling/src/index/storage.js +1061 -0
  606. package/tooling/src/index/summarizers.js +261 -0
  607. package/tooling/src/loaders.js +18 -0
  608. package/tooling/src/project-root.js +22 -0
  609. package/tooling/src/recall/command.js +225 -0
  610. package/tooling/src/schema-validator.js +30 -0
  611. package/tooling/src/state-root.js +40 -0
  612. package/tooling/src/status/command.js +71 -0
  613. package/wazir.manifest.yaml +135 -0
  614. package/workflows/README.md +19 -0
  615. package/workflows/author.md +42 -0
  616. package/workflows/clarify.md +38 -0
  617. package/workflows/design-review.md +46 -0
  618. package/workflows/design.md +44 -0
  619. package/workflows/discover.md +37 -0
  620. package/workflows/execute.md +48 -0
  621. package/workflows/learn.md +38 -0
  622. package/workflows/plan-review.md +42 -0
  623. package/workflows/plan.md +39 -0
  624. package/workflows/prepare-next.md +37 -0
  625. package/workflows/review.md +40 -0
  626. package/workflows/run-audit.md +41 -0
  627. package/workflows/spec-challenge.md +41 -0
  628. package/workflows/specify.md +38 -0
  629. package/workflows/verify.md +37 -0
@@ -0,0 +1,741 @@
1
+ # Circuit Breaker and Bulkhead — Architecture Expertise Module
2
+
3
+ > Circuit Breaker prevents cascading failures by stopping calls to a failing service. Bulkhead isolates failures to prevent one slow service from consuming all resources. Together, they are the primary defense against cascade failures in distributed systems — the most dangerous failure mode in microservices.
4
+
5
+ > **Category:** Distributed
6
+ > **Complexity:** Moderate
7
+ > **Applies when:** Any system making calls to external services, databases, or other microservices — especially when a downstream failure could cascade upstream
8
+
9
+ ---
10
+
11
+ ## What This Is (and What It Isn't)
12
+
13
+ ### Circuit Breaker
14
+
15
+ A circuit breaker is a **state machine that sits between a caller and a callee**, monitoring call outcomes and automatically stopping calls when the callee is failing. The name comes from electrical circuit breakers — when too much current flows (too many failures occur), the breaker trips (opens) to prevent damage (cascading failure). When the situation resolves, the breaker resets (closes) to allow current (calls) to flow again.
16
+
17
+ The pattern was popularized by Michael Nygard in *Release It!* (2007) and brought into mainstream adoption by Netflix's Hystrix library starting in 2011. Netflix's engineering team built Hystrix after experiencing repeated cascade failures where a single degraded backend service would cause the entire Netflix streaming platform to become unresponsive. By 2015, Netflix was executing tens of billions of thread-isolated and hundreds of billions of semaphore-isolated calls through Hystrix every day.
18
+
19
+ A circuit breaker operates in **three states**:
20
+
21
+ | State | Behavior | Transitions to |
22
+ |---|---|---|
23
+ | **Closed** | All calls pass through. Failures are counted in a sliding window. Normal operation. | Open (when failure threshold is exceeded) |
24
+ | **Open** | All calls are immediately rejected without attempting the downstream call. Returns a fallback or error instantly. | Half-Open (after a configured timeout period) |
25
+ | **Half-Open** | A limited number of probe calls are allowed through. The breaker is testing whether the downstream service has recovered. | Closed (if probes succeed) or Open (if probes fail) |
26
+
27
+ The closed-to-open transition is governed by a **sliding window** that tracks recent call outcomes. The window can be count-based (e.g., the last 100 calls) or time-based (e.g., calls in the last 60 seconds). When the failure rate within the window exceeds a configured threshold (e.g., 50%), the circuit opens.
28
+
29
+ ### Bulkhead
30
+
31
+ A bulkhead isolates **resource pools** so that a failure in one downstream dependency cannot exhaust resources needed by other dependencies. The name comes from the watertight compartments in a ship's hull — if one compartment floods, the bulkheads prevent water from spreading to other compartments, keeping the ship afloat.
32
+
33
+ In software, the "water" is threads, connections, memory, or other finite resources. Without bulkheads, a single slow downstream service can consume every thread in your application's thread pool, causing all other endpoints — even those not related to the slow service — to become unresponsive. This is the **thread pool exhaustion** problem, and it is one of the most common causes of total system failure in microservice architectures.
34
+
35
+ Bulkheads come in two primary forms:
36
+
37
+ | Isolation type | Mechanism | Overhead | Use case |
38
+ |---|---|---|---|
39
+ | **Thread pool isolation** | Each dependency gets its own thread pool with a fixed maximum size | Higher (thread context switching, memory for thread stacks) | Long-running or unpredictable calls; need for timeout enforcement |
40
+ | **Semaphore isolation** | Each dependency gets a semaphore with a fixed permit count; calls execute on the caller's thread | Lower (no thread switching overhead) | Fast, predictable calls where timeout enforcement is handled elsewhere |
41
+
42
+ ### What These Patterns Are NOT
43
+
44
+ - **Not retry logic.** A circuit breaker is the **opposite** of retry. Retry says "try again." Circuit breaker says "stop trying — the service is down, and hammering it will only make things worse." The two patterns are complementary: retry handles transient glitches (a single dropped packet), while circuit breaker handles sustained failures (a service that is down and needs time to recover).
45
+
46
+ - **Not a load balancer.** A circuit breaker does not distribute traffic. It controls whether traffic should flow at all. Load balancers and circuit breakers operate at different layers — a load balancer chooses which instance to send a request to, while a circuit breaker decides whether to send the request at all.
47
+
48
+ - **Not a timeout.** Timeouts limit how long you wait for a single call. Circuit breakers track patterns of failure across many calls. A timeout fires after one slow call. A circuit breaker fires after a pattern of failures (which may include timeouts).
49
+
50
+ - **Not rate limiting.** Rate limiting controls the volume of outgoing requests to protect a downstream service from overload. Bulkheads limit concurrency to protect the **caller** from resource exhaustion. Rate limiting is about being a good neighbor; bulkheads are about self-preservation.
51
+
52
+ - **Not a substitute for fixing the root cause.** Circuit breakers and bulkheads are **damage containment** mechanisms. They buy you time. If your downstream service fails repeatedly, the circuit breaker keeps your system alive while you fix the actual problem. A system that runs permanently in open-circuit mode is a system with an unresolved outage.
53
+
54
+ ---
55
+
56
+ ## When to Use It
57
+
58
+ ### Calling external APIs or third-party services
59
+
60
+ Any call to a service you do not control — payment gateways (Stripe, PayPal), identity providers (Auth0, Okta), notification services (Twilio, SendGrid), or data providers — should be wrapped in a circuit breaker. You cannot fix their outages, and you cannot predict when they will fail. A circuit breaker lets your system degrade gracefully when an external dependency goes down.
61
+
62
+ **Example:** An e-commerce platform calling a payment gateway. When the gateway goes down, the circuit breaker opens and the system immediately shows "payment temporarily unavailable" instead of hanging for 30 seconds per request, consuming threads, and eventually taking down the product catalog page because the thread pool is exhausted.
63
+
64
+ ### Inter-service calls in microservice architectures
65
+
66
+ Every synchronous call between microservices should have a circuit breaker. This is not optional. In a system with 50 microservices and complex dependency graphs, a single failing service without circuit breakers will cascade through every upstream caller within minutes.
67
+
68
+ **Netflix's origin story (2011):** Netflix's API gateway made calls to dozens of backend services for each user request — recommendations, watch history, personalization, billing status. When one backend service (e.g., the bookmark service tracking playback position) became slow, the API gateway's thread pool filled up with threads waiting for the slow service. This meant the API gateway could not serve any requests at all — not even requests that had nothing to do with bookmarks. The entire Netflix UI would go blank. Hystrix was built specifically to solve this problem by combining circuit breakers with thread pool isolation (bulkheads).
69
+
70
+ ### Database connection pooling
71
+
72
+ Database connections are finite and expensive resources. When a database becomes slow (due to lock contention, disk I/O saturation, or a long-running query), connection pools fill up with connections waiting for responses. A circuit breaker on database access prevents your application from exhausting its connection pool and becoming entirely unresponsive.
73
+
74
+ ### High-throughput systems where a retry storm could cause total failure
75
+
76
+ In systems handling thousands of requests per second, retrying failed calls without a circuit breaker creates a **retry storm**: each failing request spawns 3-5 retries, multiplying load by 3-5x on an already-failing downstream service. This prevents recovery and can cascade upstream.
77
+
78
+ **AWS October 2025 outage:** A DNS resolution failure affecting the DynamoDB endpoint in US-EAST-1 caused millions of EC2 instances and Lambda functions to simultaneously retry failed connections. When DNS resolution was briefly restored, the retry storm overwhelmed the DynamoDB control plane, causing DNS to fail again. This oscillation cycle turned a 10-minute DNS issue into a multi-hour outage affecting 113 AWS services and disrupting Venmo, Zoom, and thousands of other applications across 60+ countries.
79
+
80
+ ### Systems with strict latency requirements
81
+
82
+ When your SLA requires 200ms response times but a downstream service occasionally takes 30 seconds, a circuit breaker can fail fast (return an error or fallback in <1ms) instead of waiting. This is especially critical for user-facing services where a hanging request translates to a frustrated user.
83
+
84
+ ---
85
+
86
+ ## When NOT to Use It
87
+
88
+ **This section is deliberately comprehensive because misconfigured circuit breakers cause their own class of outages — and applying the pattern where it is not needed adds complexity without benefit.**
89
+
90
+ ### Single-process monolith with in-memory calls
91
+
92
+ If your "services" are classes or modules within the same process communicating via function calls, there is nothing to circuit-break. In-memory function calls do not fail due to network issues, do not have latency variance, and do not exhaust thread pools. Adding circuit breakers to in-process calls is cargo-cult resilience engineering.
93
+
94
+ **Exception:** If a module within your monolith makes calls to an external system (database, API), a circuit breaker on that external boundary is appropriate. The circuit breaker goes on the boundary, not on internal calls.
95
+
96
+ ### When the downstream service must be reached no matter what
97
+
98
+ Some calls are non-negotiable. A regulatory reporting system that must submit compliance data to a government endpoint cannot "fail open" with a cached response. A financial ledger that must record every transaction cannot skip writes. In these cases, you need **queuing with guaranteed delivery** (e.g., a persistent message queue with at-least-once semantics), not circuit breaking.
99
+
100
+ Circuit breakers are about **graceful degradation** — accepting that some functionality is temporarily unavailable. If your business requirements do not allow that functionality to be unavailable under any circumstances, circuit breakers are the wrong tool.
101
+
102
+ ### When you only have 1-2 downstream dependencies
103
+
104
+ Circuit breakers and bulkheads add operational complexity: configuration parameters to tune, monitoring dashboards to build, alerts to set up, and failure modes to test. For a system with one database and one external API, the overhead of a full circuit breaker framework may exceed the benefit. A simple timeout with retry and exponential backoff may suffice.
105
+
106
+ **Rule of thumb:** If you can enumerate all your downstream dependencies on one hand and your team can reason about their failure modes without tooling, start with timeouts and retries. Add circuit breakers when you feel the pain.
107
+
108
+ ### When circuit breaker thresholds are misconfigured — worse than no circuit breaker
109
+
110
+ A circuit breaker with a failure threshold of 2 calls and a sliding window of 5 calls will open after two failures in five requests. In a system processing 10,000 requests per second, two failures is statistical noise — a 0.02% error rate. A premature circuit opening on a healthy service causes a **self-inflicted outage**: your circuit breaker just took down a perfectly functional dependency.
111
+
112
+ Misconfigured circuit breakers are a category of production incident all on their own. If you cannot commit to tuning and monitoring your circuit breaker configuration, you may be better off without one.
113
+
114
+ ### Fire-and-forget asynchronous messaging
115
+
116
+ If services communicate exclusively via message queues (Kafka, RabbitMQ, SQS), circuit breakers on the consumer side are usually unnecessary. The queue itself acts as a buffer — when the consumer is slow, messages accumulate in the queue rather than exhausting the producer's resources. The queue is the bulkhead.
117
+
118
+ **Exception:** If a service produces messages by making a synchronous call to a message broker, and the broker itself can become unresponsive, a circuit breaker on the broker connection is still valuable.
119
+
120
+ ### Over-engineering simple CRUD applications
121
+
122
+ A CRUD API that reads from and writes to a single database, serving 100 requests per minute, does not need Resilience4j, a circuit breaker dashboard, and a bulkhead configuration. The operational cost of maintaining that infrastructure exceeds the cost of an occasional database timeout. Use connection pool limits (which your database driver already provides) and move on.
123
+
124
+ ---
125
+
126
+ ## How It Works
127
+
128
+ ### Circuit breaker: the state machine in detail
129
+
130
+ **Closed state (normal operation):**
131
+
132
+ Every call passes through to the downstream service. The circuit breaker records the outcome (success, failure, timeout, or slow call) in a **sliding window**. The window can be:
133
+
134
+ - **Count-based:** Tracks the last N calls (e.g., last 100 calls). Simple to reason about. Best when call volume is steady.
135
+ - **Time-based:** Tracks all calls in the last N seconds (e.g., last 60 seconds). Adapts to varying call volumes. Better for bursty traffic.
136
+
137
+ A **minimum number of calls** threshold prevents premature evaluation. If the window size is 100 but only 3 calls have been made, the failure rate is not meaningful. Most implementations require a minimum (e.g., 10 calls) before the failure rate is evaluated.
138
+
139
+ **Transition to open (the "trip"):**
140
+
141
+ When the failure rate within the sliding window exceeds a configured threshold (e.g., 50%), the circuit opens. Some implementations also track **slow call rate** — if 80% of calls are succeeding but taking 10 seconds instead of 100ms, the service is effectively failing. Resilience4j supports both `failureRateThreshold` and `slowCallRateThreshold`.
142
+
143
+ **Open state (fail fast):**
144
+
145
+ All calls are immediately rejected without touching the downstream service. The circuit breaker returns:
146
+
147
+ - A **fallback response** (cached data, default value, degraded experience)
148
+ - A **specific exception** that callers can handle (e.g., `CircuitBreakerOpenException`)
149
+
150
+ This state has a configured **timeout** (e.g., 60 seconds). When the timeout expires, the circuit transitions to half-open.
151
+
152
+ **Half-open state (probing for recovery):**
153
+
154
+ A limited number of calls (e.g., 5-10) are allowed through to test whether the downstream service has recovered.
155
+
156
+ - If the permitted calls succeed: the circuit **closes** and normal operation resumes.
157
+ - If the permitted calls fail: the circuit **re-opens** and the timeout resets.
158
+
159
+ This probing mechanism prevents a thundering herd: instead of all callers simultaneously hitting a recovering service, only a few probe requests test the waters.
160
+
161
+ ### Bulkhead: resource isolation strategies
162
+
163
+ **Thread pool isolation:**
164
+
165
+ ```
166
+ Service A ──┐
167
+ ├── Thread Pool 1 (max 20 threads) ──── Payment Service
168
+
169
+ ├── Thread Pool 2 (max 10 threads) ──── Inventory Service
170
+
171
+ └── Thread Pool 3 (max 15 threads) ──── Notification Service
172
+ ```
173
+
174
+ Each downstream dependency gets a dedicated thread pool. When the payment service becomes slow and all 20 threads are occupied waiting for responses, the inventory and notification thread pools are completely unaffected. The maximum blast radius of a payment service failure is 20 blocked threads — not total thread pool exhaustion.
175
+
176
+ Netflix's Hystrix used this approach as its default isolation strategy. The tradeoff is the overhead of thread context switching and the memory cost of maintaining multiple thread pools (each thread stack typically consumes 512KB-1MB).
177
+
178
+ **Semaphore isolation:**
179
+
180
+ Instead of dedicated threads, a semaphore limits the number of concurrent in-flight requests to each dependency. When all permits are taken, additional calls are immediately rejected.
181
+
182
+ - **Advantage:** No thread switching overhead. Calls execute on the caller's thread.
183
+ - **Disadvantage:** Cannot enforce timeouts (since there is no separate thread to interrupt). The call will block the caller's thread until it completes or the underlying socket times out.
184
+ - **Best for:** In-memory or very fast calls where latency is predictable and you only need concurrency limiting.
185
+
186
+ **Connection pool limits (passive bulkhead):**
187
+
188
+ Database drivers and HTTP client libraries typically support connection pool sizing. Setting `maxConnections=20` on a database pool is a form of bulkhead — it prevents one database from consuming unlimited connections. This is the simplest and most commonly used form of bulkhead, and many teams use it without recognizing it as a pattern.
189
+
190
+ ### Combining circuit breaker with retry
191
+
192
+ The correct composition is **retry inside circuit breaker**:
193
+
194
+ ```
195
+ Request
196
+ └── Circuit Breaker (checks state first)
197
+ └── Retry (attempts the call N times with backoff)
198
+ └── Timeout (limits each individual attempt)
199
+ └── Actual downstream call
200
+ ```
201
+
202
+ **Why this order matters:**
203
+
204
+ - The circuit breaker checks its state **before** any retry attempt. If the circuit is open, no retries are attempted — the call fails immediately with a fallback.
205
+ - Each retry attempt is counted by the circuit breaker. If retries consistently fail, the circuit breaker will eventually trip, stopping all future calls (including retries) until the downstream service recovers.
206
+ - If the order is reversed (circuit breaker inside retry), the retry logic will keep attempting even when the circuit is open, defeating the entire purpose of the circuit breaker.
207
+
208
+ **Retry configuration alongside circuit breaker:**
209
+
210
+ | Parameter | Typical value | Rationale |
211
+ |---|---|---|
212
+ | Max retry attempts | 2-3 | More retries amplify load on a failing service |
213
+ | Backoff strategy | Exponential with jitter | Prevents synchronized retries across instances |
214
+ | Retry on | Transient errors only (5xx, timeouts) | Never retry 4xx (client errors) or business logic failures |
215
+ | Circuit breaker failure threshold | 50% | Open circuit when half of calls fail |
216
+ | Circuit breaker sliding window | 100 calls or 60 seconds | Enough data for statistical significance |
217
+ | Half-open permitted calls | 5-10 | Small enough to not overwhelm a recovering service |
218
+
219
+ ### Fallback strategies
220
+
221
+ When the circuit is open, what do you return? The answer depends on the use case:
222
+
223
+ | Strategy | Example | When to use |
224
+ |---|---|---|
225
+ | **Cached response** | Return the last known product price from a local cache | Data that is valuable even if slightly stale |
226
+ | **Default value** | Return an empty recommendation list | Feature that enhances but is not essential to the core experience |
227
+ | **Graceful degradation** | Show a generic product image instead of personalized images | Functionality that can be simplified without breaking the user flow |
228
+ | **Queue for later** | Write the event to a local queue for processing when the service recovers | Operations that must eventually succeed but can be delayed |
229
+ | **Honest error** | Show "Payment service temporarily unavailable, please try again in a few minutes" | Operations where no reasonable fallback exists |
230
+ | **Failover to alternate** | Route payment to a secondary payment processor | Critical operations with redundant providers |
231
+
232
+ Netflix exemplified graceful degradation: when the recommendation service was down, they showed a generic "trending now" list instead of personalized recommendations. The user experience was slightly worse, but the platform remained fully functional.
233
+
234
+ ---
235
+
236
+ ## Trade-Offs Matrix
237
+
238
+ | Dimension | Without circuit breaker/bulkhead | With circuit breaker/bulkhead |
239
+ |---|---|---|
240
+ | **Cascade failure risk** | A single slow dependency can take down the entire system | Failures are isolated; other functionality continues |
241
+ | **Recovery time** | Retry storms prevent recovery; outages last hours | Downstream service gets breathing room to recover |
242
+ | **Latency during failures** | Requests hang until timeout (5-30 seconds) | Fail-fast returns in <1ms when circuit is open |
243
+ | **Resource utilization during failures** | Thread pools, connection pools, memory exhausted | Resources are preserved for healthy dependencies |
244
+ | **Configuration complexity** | None | Significant — thresholds, windows, timeouts, fallbacks must be tuned per dependency |
245
+ | **Testing complexity** | Standard integration tests | Must test all three circuit states, fallback behavior, and recovery transitions |
246
+ | **Monitoring requirements** | Basic error rates | Circuit state dashboards, half-open probe success rates, bulkhead utilization metrics |
247
+ | **False positive risk** | None | Misconfigured thresholds can open circuits on healthy services |
248
+ | **Operational overhead** | Low | Moderate — alerting on circuit state changes, tuning thresholds based on observed behavior |
249
+ | **Debugging complexity** | Straightforward — request hits service, service fails | Additional layer to reason about — "was the service actually down or did the circuit open prematurely?" |
250
+ | **Cold start behavior** | N/A | Circuit breaker has no data in sliding window; first few failures can prematurely trip the circuit if minimum call threshold is too low |
251
+ | **Memory overhead** | Baseline | Sliding window storage, per-dependency thread pools, metrics collection |
252
+
253
+ ---
254
+
255
+ ## Evolution Path
256
+
257
+ ### Stage 1: Timeouts and connection pool limits (start here)
258
+
259
+ Before adding circuit breakers, ensure every outgoing call has a **timeout** and every connection pool has a **maximum size**. This is the minimum viable resilience and catches 80% of cascade failure scenarios.
260
+
261
+ ```
262
+ HTTP client timeout: 5 seconds
263
+ Database connection pool: max 20 connections
264
+ Database query timeout: 10 seconds
265
+ ```
266
+
267
+ Most frameworks provide these out of the box. If you skip this step and jump to circuit breakers, you are building a penthouse on a foundation of sand.
268
+
269
+ ### Stage 2: Circuit breakers on critical external dependencies
270
+
271
+ Add circuit breakers to calls that cross a **trust boundary** — external APIs, third-party services, and any dependency where you cannot control uptime. Start with the dependency that has the worst reliability track record.
272
+
273
+ Configuration at this stage should be conservative:
274
+
275
+ - High failure threshold (60-70%) to avoid false positives
276
+ - Large sliding window (200+ calls) for statistical significance
277
+ - Long half-open timeout (60-120 seconds) to give the service time to recover
278
+ - Simple fallback (error response with retry-after header)
279
+
280
+ ### Stage 3: Bulkheads for resource isolation
281
+
282
+ When you observe that a single slow dependency can starve resources from other dependencies (thread pool exhaustion, connection pool exhaustion), add bulkheads. Start with semaphore isolation (lower overhead) and graduate to thread pool isolation for dependencies with unpredictable latency.
283
+
284
+ ### Stage 4: Circuit breakers on all inter-service calls
285
+
286
+ In a mature microservice architecture, every synchronous inter-service call should have a circuit breaker. At this stage, invest in:
287
+
288
+ - A circuit breaker dashboard showing the state of every circuit in real-time
289
+ - Alerts on circuit state transitions (closed-to-open is a P2 alert; circuit staying open for >10 minutes is a P1)
290
+ - Runbooks for each circuit explaining what the fallback is and how to manually force-close the circuit
291
+
292
+ ### Stage 5: Service mesh circuit breaking (infrastructure-level)
293
+
294
+ Move circuit breaking from application code to the service mesh (Istio/Envoy). This provides circuit breaking as infrastructure, applied uniformly to all services without code changes.
295
+
296
+ Istio implements circuit breaking through `DestinationRule` resources with two mechanisms:
297
+
298
+ - **Connection pool settings:** `maxConnections`, `http1MaxPendingRequests`, `http2MaxRequests` — limits the number of connections and pending requests to a service.
299
+ - **Outlier detection:** Tracks 5xx error rates per upstream host and ejects unhealthy hosts from the load balancing pool for a configurable duration.
300
+
301
+ **Tradeoff:** Service mesh circuit breaking operates at the connection/request level and cannot implement application-level fallbacks. Most mature systems use both: service mesh for connection-level protection and application-level circuit breakers for business-logic fallbacks.
302
+
303
+ ---
304
+
305
+ ## Failure Modes
306
+
307
+ ### Failure Mode 1: Premature circuit opening (false positive)
308
+
309
+ **What happens:** The circuit breaker opens even though the downstream service is healthy. A small burst of timeouts (caused by a momentary network blip or GC pause) triggers the threshold because the sliding window is too small or the failure threshold is too low.
310
+
311
+ **Real-world pattern:** A circuit breaker with `slidingWindowSize=10` and `failureRateThreshold=50%` will open after 5 failures in 10 calls. In a system handling 1,000 req/s, 10 calls represent 10 milliseconds of traffic. A 10ms network hiccup causes the circuit to open, rejecting all subsequent requests to a perfectly healthy service. The team spends 30 minutes investigating a "downstream outage" that never existed.
312
+
313
+ **Prevention:**
314
+ - Set `minimumNumberOfCalls` high enough for statistical significance (minimum 20, preferably 50-100)
315
+ - Use time-based sliding windows for high-throughput services (60+ seconds)
316
+ - Monitor circuit state changes and correlate with actual downstream error rates
317
+
318
+ ### Failure Mode 2: Circuit never closing (stuck open)
319
+
320
+ **What happens:** The circuit opens legitimately, the downstream service recovers, but the circuit never closes because the half-open probe calls keep failing. This happens when:
321
+
322
+ - The half-open probe sends a request type that is different from normal traffic (e.g., a health check that hits a different code path)
323
+ - The downstream service recovers for normal load but fails under the specific probe pattern
324
+ - The half-open timeout is too short, causing probes before the service has fully recovered
325
+
326
+ **Prevention:**
327
+ - Ensure half-open probes exercise the same code path as normal calls
328
+ - Use a generous `waitDurationInOpenState` (60-120 seconds minimum)
329
+ - Provide a manual circuit close mechanism for operators
330
+
331
+ ### Failure Mode 3: Bulkhead too small (false resource exhaustion)
332
+
333
+ **What happens:** The bulkhead's maximum concurrency is set too low for normal traffic. Under normal load, the semaphore or thread pool is fully utilized, causing legitimate requests to be rejected. The bulkhead is "protecting" the system from its own traffic.
334
+
335
+ **Real-world pattern:** A team configures a thread pool of 5 threads for calls to the recommendation service. Under normal traffic, 8-10 concurrent calls are typical. 3-5 requests per second are rejected even though the recommendation service is perfectly healthy and responding in <50ms. The team sees a constant stream of "bulkhead full" errors and either increases the pool to 1,000 (defeating the purpose) or removes the bulkhead entirely.
336
+
337
+ **Prevention:**
338
+ - Measure actual concurrency under normal and peak load before configuring bulkhead sizes
339
+ - Set bulkhead limits to 150-200% of observed peak concurrency (enough headroom for normal variance, but still protective against true resource exhaustion)
340
+ - Alert on bulkhead rejection rates above 1% — this indicates either misconfiguration or genuine downstream slowness
341
+
342
+ ### Failure Mode 4: Missing fallbacks
343
+
344
+ **What happens:** The circuit breaker opens, and the application throws a `CircuitBreakerOpenException` that propagates up the stack as a 500 Internal Server Error. The user sees a generic error page instead of a graceful degradation. The circuit breaker prevented cascade failure at the infrastructure level but failed at the user experience level.
345
+
346
+ **Prevention:**
347
+ - Every circuit breaker must have an explicit fallback strategy (even if the fallback is a well-formatted error message with a retry-after header)
348
+ - Test fallback behavior as part of chaos engineering exercises
349
+ - Fallbacks should be monitored separately — a system running on fallbacks for hours may be technically "up" but functionally degraded
350
+
351
+ ### Failure Mode 5: Retry storm amplifying through open circuits
352
+
353
+ **What happens:** Service A calls Service B, which calls Service C. Service C goes down. Service B's circuit breaker opens. Service A retries its call to Service B. Service B responds immediately (circuit is open), but Service A interprets the circuit breaker error as a transient failure and retries. With 3 retries and 10 instances of Service A, Service B now handles 30x its normal request volume — all of which are immediately rejected but still consume resources (parsing, circuit state checks, response serialization).
354
+
355
+ **Prevention:**
356
+ - Return specific error codes for circuit breaker rejections (e.g., HTTP 503 with `Retry-After` header)
357
+ - Upstream callers should NOT retry on circuit-breaker-specific errors
358
+ - Implement a **retry budget** at the server level: allow a maximum percentage (e.g., 10%) of total requests to be retries
359
+
360
+ ### Failure Mode 6: The 2012 Netflix Christmas Eve cascade
361
+
362
+ On December 24, 2012, an AWS developer accidentally deleted key ELB state data in US-EAST-1. This caused load balancers to be improperly configured, degrading performance for applications behind those load balancers. Netflix's streaming service was impacted from 12:30 PM to 10:30 PM PST — a 10-hour outage on the highest-traffic day of the year.
363
+
364
+ Despite Netflix having Hystrix circuit breakers throughout their stack, this outage demonstrated that circuit breakers cannot protect against infrastructure-level failures below the application layer. The ELB failure was beneath the layer where Hystrix operated. This led Netflix to invest heavily in **multi-region active-active** architecture as a defense against region-level failures — because circuit breakers only protect within a single region's service mesh.
365
+
366
+ ### Failure Mode 7: The AWS 2025 retry storm
367
+
368
+ In October 2025, a DNS resolution failure affecting DynamoDB in US-EAST-1 cascaded into a multi-hour global outage. When DNS resolution was briefly restored, millions of EC2 instances and Lambda functions simultaneously retried failed DynamoDB connections. The retry storm overwhelmed the database control plane, causing DNS to fail again. This oscillation cycle — partial recovery followed by retry-storm-induced re-failure — turned what should have been a 10-minute DNS fix into an outage affecting 113 AWS services across 60+ countries.
369
+
370
+ **Lesson:** Circuit breakers without coordinated backoff across clients are insufficient. The retry storm problem requires **server-side admission control** (rejecting excess load at the server) combined with **client-side circuit breakers** (stopping calls entirely after repeated failures) combined with **jittered exponential backoff** (preventing synchronized retries).
371
+
372
+ ---
373
+
374
+ ## Technology Landscape
375
+
376
+ ### Java / JVM
377
+
378
+ | Library | Status | Notes |
379
+ |---|---|---|
380
+ | **Resilience4j** | Active, recommended | Lightweight, modular (circuit breaker, bulkhead, retry, rate limiter as separate modules). Supports count-based and time-based sliding windows, semaphore and thread pool bulkheads. First-class Spring Boot integration. The successor to Hystrix. |
381
+ | **Netflix Hystrix** | Maintenance mode (since 2018) | The library that popularized the pattern. Netflix recommends Resilience4j for new projects. Still widely deployed in legacy systems. Thread pool isolation was the default and defining feature. |
382
+ | **Spring Cloud Circuit Breaker** | Active | Abstraction layer that supports Resilience4j, Sentinel, and Spring Retry as backends. Use if you want to swap implementations without code changes. |
383
+ | **Sentinel (Alibaba)** | Active | Flow control, circuit breaking, and system adaptive protection. Stronger in rate limiting and traffic shaping than Resilience4j. Popular in the Chinese tech ecosystem. |
384
+
385
+ ### .NET
386
+
387
+ | Library | Status | Notes |
388
+ |---|---|---|
389
+ | **Polly** | Active, recommended | The standard .NET resilience library. Supports circuit breaker, retry, bulkhead, timeout, fallback, and hedging as composable policies. Polly v8+ integrates with `Microsoft.Extensions.Resilience`. |
390
+ | **Microsoft.Extensions.Http.Resilience** | Active | Built on Polly. Provides pre-configured resilience pipelines for `HttpClient` via dependency injection. The recommended approach for .NET 8+ applications. |
391
+
392
+ ### Node.js / JavaScript
393
+
394
+ | Library | Status | Notes |
395
+ |---|---|---|
396
+ | **opossum** | Active, recommended | The most widely used Node.js circuit breaker. Supports fallback, event-based monitoring, and Prometheus/Hystrix-compatible metrics export. Requires Node.js 20+. |
397
+ | **cockatiel** | Active | Provides circuit breaker, retry, timeout, and bulkhead. TypeScript-first. Lighter than opossum. |
398
+ | **brakes** | Maintained | Hystrix-compatible circuit breaker with built-in dashboard support. |
399
+
400
+ ### Go
401
+
402
+ | Library | Status | Notes |
403
+ |---|---|---|
404
+ | **sony/gobreaker** | Active, recommended | Clean, well-tested implementation. Configurable via `ReadyToTrip` callback for custom tripping logic. The most widely adopted Go circuit breaker. |
405
+ | **go-kit/circuitbreaker** | Active | Part of the go-kit microservices framework. Wraps gobreaker and Hystrix-go. Use if you are already in the go-kit ecosystem. |
406
+ | **failsafe-go** | Active | Port of the Java Failsafe library. Provides circuit breaker, retry, timeout, bulkhead, and fallback as composable policies. |
407
+ | **slok/goresilience** | Active | Decorator-based resilience library. Provides circuit breaker, retry, timeout, bulkhead, and chaos injection. |
408
+
409
+ ### Python
410
+
411
+ | Library | Status | Notes |
412
+ |---|---|---|
413
+ | **pybreaker** | Active | Straightforward circuit breaker implementation. Supports custom storage backends (Redis, Elasticsearch) for shared circuit state across multiple instances. |
414
+ | **tenacity** | Active | Primarily a retry library, but can be composed with custom circuit breaker logic. The standard Python retry library. |
415
+
416
+ ### Service Mesh (Infrastructure-Level)
417
+
418
+ | Technology | Circuit breaking mechanism | Notes |
419
+ |---|---|---|
420
+ | **Istio / Envoy** | `DestinationRule` with `connectionPool` settings and `outlierDetection` | Operates at the network proxy level — no application code changes. Enforces max connections, max pending requests, and ejects unhealthy hosts based on 5xx error rates. Cannot implement application-level fallbacks. |
421
+ | **Linkerd** | Circuit breaking via `ServiceProfile` retry budgets and failure accrual | Lighter than Istio. Circuit breaking is implicit via failure accrual — unhealthy endpoints are removed from load balancing. |
422
+ | **AWS App Mesh** | Envoy-based, similar to Istio circuit breaking | Integrates with AWS service discovery. Configured via `VirtualNode` outlier detection. |
423
+
424
+ ### When to use application-level vs. infrastructure-level
425
+
426
+ | Concern | Application-level (Resilience4j, Polly) | Infrastructure-level (Istio, Envoy) |
427
+ |---|---|---|
428
+ | Fallback behavior | Full control — return cached data, default values, degrade gracefully | Limited — can only reject or route to a different endpoint |
429
+ | Per-endpoint granularity | Can apply different thresholds to different API methods | Applies uniformly to all traffic to a destination |
430
+ | Language/framework coupling | Yes — must be implemented in each service's language | No — applied at the sidecar proxy level, language-agnostic |
431
+ | Operational consistency | Each team configures independently — risk of inconsistency | Uniform configuration across all services |
432
+ | Overhead | In-process, negligible | Sidecar proxy adds 1-3ms latency per hop |
433
+
434
+ **Most mature systems use both:** service mesh for baseline connection-level protection, application-level circuit breakers for business-logic fallbacks.
435
+
436
+ ---
437
+
438
+ ## Decision Tree
439
+
440
+ ```
441
+ Is the call crossing a network boundary?
442
+ ├── No (in-process function call)
443
+ │ └── Do NOT use circuit breaker or bulkhead
444
+ │ (use standard error handling)
445
+
446
+ └── Yes
447
+ ├── Is the downstream service under your team's control?
448
+ │ ├── Yes
449
+ │ │ ├── Is it a critical path (user request blocks on it)?
450
+ │ │ │ ├── Yes → Circuit breaker + bulkhead + fallback
451
+ │ │ │ └── No (async, fire-and-forget)
452
+ │ │ │ └── Queue-based decoupling (message broker)
453
+ │ │ │ └── Circuit breaker on broker connection only
454
+ │ │ └── No (external API / third-party service)
455
+ │ │ └── Circuit breaker + bulkhead + fallback (mandatory)
456
+ │ │ └── You cannot control their uptime
457
+ │ │
458
+ │ └── How many downstream dependencies?
459
+ │ ├── 1-2 dependencies
460
+ │ │ └── Start with timeouts + connection pool limits
461
+ │ │ └── Add circuit breaker when you observe cascade risk
462
+ │ ├── 3-10 dependencies
463
+ │ │ └── Circuit breakers on each + semaphore bulkheads
464
+ │ └── 10+ dependencies
465
+ │ └── Circuit breakers + thread pool bulkheads + service mesh
466
+ │ └── Invest in circuit breaker dashboard
467
+
468
+ └── What isolation level do you need?
469
+ ├── Fast calls (<50ms), predictable latency
470
+ │ └── Semaphore isolation (lower overhead)
471
+ ├── Slow calls (>200ms), variable latency
472
+ │ └── Thread pool isolation (can enforce timeouts)
473
+ └── Mixed
474
+ └── Semaphore for fast paths, thread pools for slow paths
475
+ ```
476
+
477
+ ---
478
+
479
+ ## Implementation Sketch
480
+
481
+ ### Circuit breaker with Resilience4j (Java/Spring Boot)
482
+
483
+ ```java
484
+ // Configuration — application.yml
485
+ // resilience4j.circuitbreaker:
486
+ // instances:
487
+ // paymentService:
488
+ // slidingWindowType: COUNT_BASED
489
+ // slidingWindowSize: 100
490
+ // minimumNumberOfCalls: 20
491
+ // failureRateThreshold: 50
492
+ // slowCallRateThreshold: 80
493
+ // slowCallDurationThreshold: 2s
494
+ // waitDurationInOpenState: 60s
495
+ // permittedNumberOfCallsInHalfOpenState: 10
496
+ // recordExceptions:
497
+ // - java.io.IOException
498
+ // - java.util.concurrent.TimeoutException
499
+ // ignoreExceptions:
500
+ // - com.example.BusinessValidationException
501
+ //
502
+ // resilience4j.bulkhead:
503
+ // instances:
504
+ // paymentService:
505
+ // maxConcurrentCalls: 25
506
+ // maxWaitDuration: 500ms
507
+
508
+ @Service
509
+ public class PaymentService {
510
+
511
+ private final CircuitBreaker circuitBreaker;
512
+ private final Bulkhead bulkhead;
513
+ private final PaymentClient paymentClient;
514
+ private final PaymentCacheService cache;
515
+
516
+ public PaymentService(
517
+ CircuitBreakerRegistry cbRegistry,
518
+ BulkheadRegistry bhRegistry,
519
+ PaymentClient paymentClient,
520
+ PaymentCacheService cache) {
521
+ this.circuitBreaker = cbRegistry.circuitBreaker("paymentService");
522
+ this.bulkhead = bhRegistry.bulkhead("paymentService");
523
+ this.paymentClient = paymentClient;
524
+ this.cache = cache;
525
+ }
526
+
527
+ public PaymentResult processPayment(PaymentRequest request) {
528
+ // Compose: bulkhead -> circuit breaker -> actual call
529
+ Supplier<PaymentResult> decoratedCall = Decorators
530
+ .ofSupplier(() -> paymentClient.charge(request))
531
+ .withBulkhead(bulkhead)
532
+ .withCircuitBreaker(circuitBreaker)
533
+ .withFallback(List.of(
534
+ CallNotPermittedException.class, // circuit open
535
+ BulkheadFullException.class // bulkhead saturated
536
+ ), e -> handleFallback(request, e))
537
+ .decorate();
538
+
539
+ return decoratedCall.get();
540
+ }
541
+
542
+ private PaymentResult handleFallback(PaymentRequest request, Throwable e) {
543
+ // Queue for later processing — payment must eventually succeed
544
+ cache.queueForRetry(request);
545
+ return PaymentResult.pending(
546
+ "Payment queued for processing. You will receive confirmation shortly."
547
+ );
548
+ }
549
+ }
550
+ ```
551
+
552
+ ### Circuit breaker with Polly (.NET)
553
+
554
+ ```csharp
555
+ // Program.cs — configure resilience pipeline
556
+ builder.Services.AddHttpClient("PaymentService")
557
+ .AddResilienceHandler("payment-pipeline", pipeline =>
558
+ {
559
+ // Circuit breaker
560
+ pipeline.AddCircuitBreaker(new CircuitBreakerStrategyOptions<HttpResponseMessage>
561
+ {
562
+ SamplingDuration = TimeSpan.FromSeconds(60),
563
+ MinimumThroughput = 20,
564
+ FailureRatio = 0.5,
565
+ BreakDuration = TimeSpan.FromSeconds(30),
566
+ ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
567
+ .HandleResult(r => r.StatusCode == HttpStatusCode.ServiceUnavailable)
568
+ .HandleResult(r => r.StatusCode >= HttpStatusCode.InternalServerError)
569
+ });
570
+
571
+ // Bulkhead (concurrency limiter)
572
+ pipeline.AddConcurrencyLimiter(new ConcurrencyLimiterOptions
573
+ {
574
+ PermitLimit = 25,
575
+ QueueLimit = 50
576
+ });
577
+
578
+ // Timeout per attempt
579
+ pipeline.AddTimeout(TimeSpan.FromSeconds(5));
580
+ });
581
+ ```
582
+
583
+ ### Circuit breaker with opossum (Node.js)
584
+
585
+ ```javascript
586
+ const CircuitBreaker = require('opossum');
587
+
588
+ // The function to protect
589
+ async function callPaymentService(paymentData) {
590
+ const response = await fetch('https://api.payment-provider.com/charge', {
591
+ method: 'POST',
592
+ headers: { 'Content-Type': 'application/json' },
593
+ body: JSON.stringify(paymentData),
594
+ signal: AbortSignal.timeout(5000) // 5s timeout per call
595
+ });
596
+ if (!response.ok) throw new Error(`Payment failed: ${response.status}`);
597
+ return response.json();
598
+ }
599
+
600
+ // Wrap in circuit breaker
601
+ const breaker = new CircuitBreaker(callPaymentService, {
602
+ timeout: 5000, // 5 seconds per call
603
+ errorThresholdPercentage: 50, // open after 50% failure rate
604
+ resetTimeout: 30000, // try half-open after 30 seconds
605
+ volumeThreshold: 10, // minimum 10 calls before evaluating
606
+ rollingCountTimeout: 60000, // 60-second sliding window
607
+ rollingCountBuckets: 6 // 6 buckets of 10 seconds each
608
+ });
609
+
610
+ // Fallback when circuit is open
611
+ breaker.fallback((paymentData) => ({
612
+ status: 'pending',
613
+ message: 'Payment queued for processing',
614
+ retryAfter: 30
615
+ }));
616
+
617
+ // Monitoring
618
+ breaker.on('open', () => metrics.increment('circuit.payment.opened'));
619
+ breaker.on('close', () => metrics.increment('circuit.payment.closed'));
620
+ breaker.on('halfOpen',() => metrics.increment('circuit.payment.halfOpen'));
621
+ breaker.on('fallback',() => metrics.increment('circuit.payment.fallback'));
622
+
623
+ // Usage
624
+ app.post('/api/pay', async (req, res) => {
625
+ const result = await breaker.fire(req.body);
626
+ res.json(result);
627
+ });
628
+ ```
629
+
630
+ ### Circuit breaker with gobreaker (Go)
631
+
632
+ ```go
633
+ package main
634
+
635
+ import (
636
+ "fmt"
637
+ "net/http"
638
+ "time"
639
+
640
+ "github.com/sony/gobreaker/v2"
641
+ )
642
+
643
+ func main() {
644
+ cb := gobreaker.NewCircuitBreaker[*http.Response](gobreaker.Settings{
645
+ Name: "payment-service",
646
+ MaxRequests: 5, // max calls in half-open state
647
+ Interval: 60 * time.Second, // sliding window duration (closed state)
648
+ Timeout: 30 * time.Second, // duration in open state before half-open
649
+
650
+ ReadyToTrip: func(counts gobreaker.Counts) bool {
651
+ // Open circuit when failure rate exceeds 50%
652
+ // with at least 10 requests in the window
653
+ if counts.Requests < 10 {
654
+ return false
655
+ }
656
+ failureRate := float64(counts.TotalFailures) / float64(counts.Requests)
657
+ return failureRate >= 0.5
658
+ },
659
+
660
+ OnStateChange: func(name string, from, to gobreaker.State) {
661
+ log.Printf("circuit breaker %s: %s -> %s", name, from, to)
662
+ metrics.RecordStateChange(name, to)
663
+ },
664
+ })
665
+
666
+ // Usage: wrap the downstream call
667
+ resp, err := cb.Execute(func() (*http.Response, error) {
668
+ client := &http.Client{Timeout: 5 * time.Second}
669
+ return client.Post(
670
+ "https://payment-service/charge",
671
+ "application/json",
672
+ requestBody,
673
+ )
674
+ })
675
+
676
+ if err != nil {
677
+ // Check if circuit is open — return fallback
678
+ if errors.Is(err, gobreaker.ErrOpenState) {
679
+ return cachedResponse(), nil
680
+ }
681
+ return nil, fmt.Errorf("payment call failed: %w", err)
682
+ }
683
+ }
684
+ ```
685
+
686
+ ### Istio DestinationRule (service mesh circuit breaking)
687
+
688
+ ```yaml
689
+ apiVersion: networking.istio.io/v1beta1
690
+ kind: DestinationRule
691
+ metadata:
692
+ name: payment-service-circuit-breaker
693
+ spec:
694
+ host: payment-service.production.svc.cluster.local
695
+ trafficPolicy:
696
+ connectionPool:
697
+ tcp:
698
+ maxConnections: 100 # max TCP connections
699
+ http:
700
+ h2UpgradePolicy: DEFAULT
701
+ http1MaxPendingRequests: 50 # max queued requests
702
+ http2MaxRequests: 100 # max concurrent HTTP/2 requests
703
+ maxRequestsPerConnection: 10 # force connection recycling
704
+ maxRetries: 3
705
+ outlierDetection:
706
+ consecutive5xxErrors: 5 # eject after 5 consecutive 5xx
707
+ interval: 30s # evaluation interval
708
+ baseEjectionTime: 60s # minimum ejection duration
709
+ maxEjectionPercent: 50 # never eject more than 50% of hosts
710
+ ```
711
+
712
+ ---
713
+
714
+ ## Cross-References
715
+
716
+ - **[Microservices](../patterns/microservices.md)** — Circuit breakers are mandatory infrastructure for any microservice architecture. See "Failure Mode 5: Cascade failures" in that module.
717
+ - **[Distributed Systems Fundamentals](./distributed-systems-fundamentals.md)** — The CAP theorem and network partition realities that make circuit breakers necessary.
718
+ - **[Idempotency and Retry](./idempotency-and-retry.md)** — Retry logic is the complement to circuit breaking. Retry handles transient failures; circuit breaker handles sustained failures. Always compose as retry-inside-circuit-breaker.
719
+ - **[Saga Pattern](./saga-pattern.md)** — Long-running distributed transactions use sagas for coordination. Circuit breakers protect individual saga steps from cascade failure.
720
+ - **[Event-Driven Architecture](../patterns/event-driven.md)** — Asynchronous messaging reduces the need for circuit breakers by decoupling services temporally. Consider event-driven alternatives before adding circuit breakers to synchronous call chains.
721
+
722
+ ---
723
+
724
+ ## Sources
725
+
726
+ - [Netflix Hystrix Wiki — How It Works](https://github.com/netflix/hystrix/wiki/how-it-works)
727
+ - [Netflix Hystrix GitHub — Latency and Fault Tolerance Library](https://github.com/Netflix/Hystrix)
728
+ - [Resilience4j CircuitBreaker Documentation](https://resilience4j.readme.io/docs/circuitbreaker)
729
+ - [Microsoft Azure — Bulkhead Pattern](https://learn.microsoft.com/en-us/azure/architecture/patterns/bulkhead)
730
+ - [Microsoft Azure — Circuit Breaker Pattern](https://learn.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker)
731
+ - [AWS Prescriptive Guidance — Circuit Breaker Pattern](https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/circuit-breaker.html)
732
+ - [Istio — Circuit Breaking](https://istio.io/latest/docs/tasks/traffic-management/circuit-breaking/)
733
+ - [Google SRE Book — Addressing Cascading Failures](https://sre.google/sre-book/addressing-cascading-failures/)
734
+ - [Grab Engineering — Designing Resilient Systems: Circuit Breakers or Retries?](https://engineering.grab.com/designing-resilient-systems-part-1)
735
+ - [Netflix Tech Blog — A Closer Look at the Christmas Eve Outage](https://netflixtechblog.com/a-closer-look-at-the-christmas-eve-outage-d7b409a529ee)
736
+ - [AWS — Summary of the December 24, 2012 Amazon ELB Service Event](https://aws.amazon.com/message/680587/)
737
+ - [Sony gobreaker — Circuit Breaker in Go](https://github.com/sony/gobreaker)
738
+ - [Opossum — Node.js Circuit Breaker](https://github.com/nodeshift/opossum)
739
+ - [Comparing Envoy and Istio Circuit Breaking with Netflix OSS Hystrix](https://blog.christianposta.com/microservices/comparing-envoy-and-istio-circuit-breaking-with-netflix-hystrix/)
740
+ - [Marc Brooker — Fixing Retries with Token Buckets and Circuit Breakers](https://brooker.co.za/blog/2022/02/28/retries.html)
741
+ - [InfoQ — How to Avoid Cascading Failures in Distributed Systems](https://www.infoq.com/articles/anatomy-cascading-failure/)