@wazir-dev/cli 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (629) hide show
  1. package/AGENTS.md +111 -0
  2. package/CHANGELOG.md +14 -0
  3. package/CONTRIBUTING.md +101 -0
  4. package/LICENSE +21 -0
  5. package/README.md +314 -0
  6. package/assets/composition-engine.mmd +34 -0
  7. package/assets/demo-script.sh +17 -0
  8. package/assets/logo-dark.svg +14 -0
  9. package/assets/logo.svg +14 -0
  10. package/assets/pipeline.mmd +39 -0
  11. package/assets/record-demo.sh +51 -0
  12. package/docs/README.md +51 -0
  13. package/docs/adapters/context-mode.md +60 -0
  14. package/docs/concepts/architecture.md +87 -0
  15. package/docs/concepts/artifact-model.md +60 -0
  16. package/docs/concepts/composition-engine.md +36 -0
  17. package/docs/concepts/indexing-and-recall.md +160 -0
  18. package/docs/concepts/observability.md +41 -0
  19. package/docs/concepts/roles-and-workflows.md +59 -0
  20. package/docs/concepts/terminology-policy.md +27 -0
  21. package/docs/getting-started/01-installation.md +78 -0
  22. package/docs/getting-started/02-first-run.md +102 -0
  23. package/docs/getting-started/03-adding-to-project.md +15 -0
  24. package/docs/getting-started/04-host-setup.md +15 -0
  25. package/docs/guides/ci-integration.md +15 -0
  26. package/docs/guides/creating-skills.md +15 -0
  27. package/docs/guides/expertise-module-authoring.md +15 -0
  28. package/docs/guides/hook-development.md +15 -0
  29. package/docs/guides/memory-and-learnings.md +34 -0
  30. package/docs/guides/multi-host-export.md +15 -0
  31. package/docs/guides/troubleshooting.md +101 -0
  32. package/docs/guides/writing-custom-roles.md +15 -0
  33. package/docs/plans/2026-03-15-cli-pipeline-integration-design.md +592 -0
  34. package/docs/plans/2026-03-15-cli-pipeline-integration-plan.md +598 -0
  35. package/docs/plans/2026-03-15-docs-enforcement-plan.md +238 -0
  36. package/docs/readmes/INDEX.md +99 -0
  37. package/docs/readmes/features/expertise/README.md +171 -0
  38. package/docs/readmes/features/exports/README.md +222 -0
  39. package/docs/readmes/features/hooks/README.md +103 -0
  40. package/docs/readmes/features/hooks/loop-cap-guard.md +133 -0
  41. package/docs/readmes/features/hooks/post-tool-capture.md +121 -0
  42. package/docs/readmes/features/hooks/post-tool-lint.md +130 -0
  43. package/docs/readmes/features/hooks/pre-compact-summary.md +122 -0
  44. package/docs/readmes/features/hooks/pre-tool-capture-route.md +100 -0
  45. package/docs/readmes/features/hooks/protected-path-write-guard.md +128 -0
  46. package/docs/readmes/features/hooks/session-start.md +119 -0
  47. package/docs/readmes/features/hooks/stop-handoff-harvest.md +125 -0
  48. package/docs/readmes/features/roles/README.md +157 -0
  49. package/docs/readmes/features/roles/clarifier.md +152 -0
  50. package/docs/readmes/features/roles/content-author.md +190 -0
  51. package/docs/readmes/features/roles/designer.md +193 -0
  52. package/docs/readmes/features/roles/executor.md +184 -0
  53. package/docs/readmes/features/roles/learner.md +210 -0
  54. package/docs/readmes/features/roles/planner.md +182 -0
  55. package/docs/readmes/features/roles/researcher.md +164 -0
  56. package/docs/readmes/features/roles/reviewer.md +184 -0
  57. package/docs/readmes/features/roles/specifier.md +162 -0
  58. package/docs/readmes/features/roles/verifier.md +215 -0
  59. package/docs/readmes/features/schemas/README.md +178 -0
  60. package/docs/readmes/features/skills/README.md +63 -0
  61. package/docs/readmes/features/skills/brainstorming.md +96 -0
  62. package/docs/readmes/features/skills/debugging.md +148 -0
  63. package/docs/readmes/features/skills/design.md +120 -0
  64. package/docs/readmes/features/skills/prepare-next.md +109 -0
  65. package/docs/readmes/features/skills/run-audit.md +159 -0
  66. package/docs/readmes/features/skills/scan-project.md +109 -0
  67. package/docs/readmes/features/skills/self-audit.md +176 -0
  68. package/docs/readmes/features/skills/tdd.md +137 -0
  69. package/docs/readmes/features/skills/using-skills.md +92 -0
  70. package/docs/readmes/features/skills/verification.md +120 -0
  71. package/docs/readmes/features/skills/writing-plans.md +104 -0
  72. package/docs/readmes/features/tooling/README.md +320 -0
  73. package/docs/readmes/features/workflows/README.md +186 -0
  74. package/docs/readmes/features/workflows/author.md +181 -0
  75. package/docs/readmes/features/workflows/clarify.md +154 -0
  76. package/docs/readmes/features/workflows/design-review.md +171 -0
  77. package/docs/readmes/features/workflows/design.md +169 -0
  78. package/docs/readmes/features/workflows/discover.md +162 -0
  79. package/docs/readmes/features/workflows/execute.md +173 -0
  80. package/docs/readmes/features/workflows/learn.md +167 -0
  81. package/docs/readmes/features/workflows/plan-review.md +165 -0
  82. package/docs/readmes/features/workflows/plan.md +170 -0
  83. package/docs/readmes/features/workflows/prepare-next.md +167 -0
  84. package/docs/readmes/features/workflows/review.md +169 -0
  85. package/docs/readmes/features/workflows/run-audit.md +191 -0
  86. package/docs/readmes/features/workflows/spec-challenge.md +159 -0
  87. package/docs/readmes/features/workflows/specify.md +160 -0
  88. package/docs/readmes/features/workflows/verify.md +177 -0
  89. package/docs/readmes/packages/README.md +50 -0
  90. package/docs/readmes/packages/ajv.md +117 -0
  91. package/docs/readmes/packages/context-mode.md +118 -0
  92. package/docs/readmes/packages/gray-matter.md +116 -0
  93. package/docs/readmes/packages/node-test.md +137 -0
  94. package/docs/readmes/packages/yaml.md +112 -0
  95. package/docs/reference/configuration-reference.md +159 -0
  96. package/docs/reference/expertise-index.md +52 -0
  97. package/docs/reference/git-flow.md +43 -0
  98. package/docs/reference/hooks.md +87 -0
  99. package/docs/reference/host-exports.md +50 -0
  100. package/docs/reference/launch-checklist.md +172 -0
  101. package/docs/reference/marketplace-listings.md +76 -0
  102. package/docs/reference/release-process.md +34 -0
  103. package/docs/reference/roles-reference.md +77 -0
  104. package/docs/reference/skills.md +33 -0
  105. package/docs/reference/templates.md +29 -0
  106. package/docs/reference/tooling-cli.md +94 -0
  107. package/docs/truth-claims.yaml +222 -0
  108. package/expertise/PROGRESS.md +63 -0
  109. package/expertise/README.md +18 -0
  110. package/expertise/antipatterns/PROGRESS.md +56 -0
  111. package/expertise/antipatterns/backend/api-design-antipatterns.md +1271 -0
  112. package/expertise/antipatterns/backend/auth-antipatterns.md +1195 -0
  113. package/expertise/antipatterns/backend/caching-antipatterns.md +622 -0
  114. package/expertise/antipatterns/backend/database-antipatterns.md +1038 -0
  115. package/expertise/antipatterns/backend/index.md +24 -0
  116. package/expertise/antipatterns/backend/microservices-antipatterns.md +850 -0
  117. package/expertise/antipatterns/code/architecture-antipatterns.md +919 -0
  118. package/expertise/antipatterns/code/async-antipatterns.md +622 -0
  119. package/expertise/antipatterns/code/code-smells.md +1186 -0
  120. package/expertise/antipatterns/code/dependency-antipatterns.md +1209 -0
  121. package/expertise/antipatterns/code/error-handling-antipatterns.md +1360 -0
  122. package/expertise/antipatterns/code/index.md +27 -0
  123. package/expertise/antipatterns/code/naming-and-abstraction.md +1118 -0
  124. package/expertise/antipatterns/code/state-management-antipatterns.md +1076 -0
  125. package/expertise/antipatterns/code/testing-antipatterns.md +1053 -0
  126. package/expertise/antipatterns/design/accessibility-antipatterns.md +1136 -0
  127. package/expertise/antipatterns/design/dark-patterns.md +1121 -0
  128. package/expertise/antipatterns/design/index.md +22 -0
  129. package/expertise/antipatterns/design/ui-antipatterns.md +1202 -0
  130. package/expertise/antipatterns/design/ux-antipatterns.md +680 -0
  131. package/expertise/antipatterns/frontend/css-layout-antipatterns.md +691 -0
  132. package/expertise/antipatterns/frontend/flutter-antipatterns.md +1827 -0
  133. package/expertise/antipatterns/frontend/index.md +23 -0
  134. package/expertise/antipatterns/frontend/mobile-antipatterns.md +573 -0
  135. package/expertise/antipatterns/frontend/react-antipatterns.md +1128 -0
  136. package/expertise/antipatterns/frontend/spa-antipatterns.md +1235 -0
  137. package/expertise/antipatterns/index.md +31 -0
  138. package/expertise/antipatterns/performance/index.md +20 -0
  139. package/expertise/antipatterns/performance/performance-antipatterns.md +1013 -0
  140. package/expertise/antipatterns/performance/premature-optimization.md +623 -0
  141. package/expertise/antipatterns/performance/scaling-antipatterns.md +785 -0
  142. package/expertise/antipatterns/process/ai-coding-antipatterns.md +853 -0
  143. package/expertise/antipatterns/process/code-review-antipatterns.md +656 -0
  144. package/expertise/antipatterns/process/deployment-antipatterns.md +920 -0
  145. package/expertise/antipatterns/process/index.md +23 -0
  146. package/expertise/antipatterns/process/technical-debt-antipatterns.md +647 -0
  147. package/expertise/antipatterns/security/index.md +20 -0
  148. package/expertise/antipatterns/security/secrets-antipatterns.md +849 -0
  149. package/expertise/antipatterns/security/security-theater.md +843 -0
  150. package/expertise/antipatterns/security/vulnerability-patterns.md +801 -0
  151. package/expertise/architecture/PROGRESS.md +70 -0
  152. package/expertise/architecture/data/caching-architecture.md +671 -0
  153. package/expertise/architecture/data/data-consistency.md +574 -0
  154. package/expertise/architecture/data/data-modeling.md +536 -0
  155. package/expertise/architecture/data/event-streams-and-queues.md +634 -0
  156. package/expertise/architecture/data/index.md +25 -0
  157. package/expertise/architecture/data/search-architecture.md +663 -0
  158. package/expertise/architecture/data/sql-vs-nosql.md +708 -0
  159. package/expertise/architecture/decisions/architecture-decision-records.md +640 -0
  160. package/expertise/architecture/decisions/build-vs-buy.md +616 -0
  161. package/expertise/architecture/decisions/index.md +23 -0
  162. package/expertise/architecture/decisions/monolith-to-microservices.md +790 -0
  163. package/expertise/architecture/decisions/technology-selection.md +616 -0
  164. package/expertise/architecture/distributed/cap-theorem-and-tradeoffs.md +800 -0
  165. package/expertise/architecture/distributed/circuit-breaker-bulkhead.md +741 -0
  166. package/expertise/architecture/distributed/consensus-and-coordination.md +796 -0
  167. package/expertise/architecture/distributed/distributed-systems-fundamentals.md +564 -0
  168. package/expertise/architecture/distributed/idempotency-and-retry.md +796 -0
  169. package/expertise/architecture/distributed/index.md +25 -0
  170. package/expertise/architecture/distributed/saga-pattern.md +797 -0
  171. package/expertise/architecture/foundations/architectural-thinking.md +460 -0
  172. package/expertise/architecture/foundations/coupling-and-cohesion.md +770 -0
  173. package/expertise/architecture/foundations/design-principles-solid.md +649 -0
  174. package/expertise/architecture/foundations/domain-driven-design.md +719 -0
  175. package/expertise/architecture/foundations/index.md +25 -0
  176. package/expertise/architecture/foundations/separation-of-concerns.md +472 -0
  177. package/expertise/architecture/foundations/twelve-factor-app.md +797 -0
  178. package/expertise/architecture/index.md +34 -0
  179. package/expertise/architecture/integration/api-design-graphql.md +638 -0
  180. package/expertise/architecture/integration/api-design-grpc.md +804 -0
  181. package/expertise/architecture/integration/api-design-rest.md +892 -0
  182. package/expertise/architecture/integration/index.md +25 -0
  183. package/expertise/architecture/integration/third-party-integration.md +795 -0
  184. package/expertise/architecture/integration/webhooks-and-callbacks.md +1152 -0
  185. package/expertise/architecture/integration/websockets-realtime.md +791 -0
  186. package/expertise/architecture/mobile-architecture/index.md +22 -0
  187. package/expertise/architecture/mobile-architecture/mobile-app-architecture.md +780 -0
  188. package/expertise/architecture/mobile-architecture/mobile-backend-for-frontend.md +670 -0
  189. package/expertise/architecture/mobile-architecture/offline-first.md +719 -0
  190. package/expertise/architecture/mobile-architecture/push-and-sync.md +782 -0
  191. package/expertise/architecture/patterns/cqrs-event-sourcing.md +717 -0
  192. package/expertise/architecture/patterns/event-driven.md +797 -0
  193. package/expertise/architecture/patterns/hexagonal-clean-architecture.md +870 -0
  194. package/expertise/architecture/patterns/index.md +27 -0
  195. package/expertise/architecture/patterns/layered-architecture.md +736 -0
  196. package/expertise/architecture/patterns/microservices.md +753 -0
  197. package/expertise/architecture/patterns/modular-monolith.md +692 -0
  198. package/expertise/architecture/patterns/monolith.md +626 -0
  199. package/expertise/architecture/patterns/plugin-architecture.md +735 -0
  200. package/expertise/architecture/patterns/serverless.md +780 -0
  201. package/expertise/architecture/scaling/database-scaling.md +615 -0
  202. package/expertise/architecture/scaling/feature-flags-and-rollouts.md +757 -0
  203. package/expertise/architecture/scaling/horizontal-vs-vertical.md +606 -0
  204. package/expertise/architecture/scaling/index.md +24 -0
  205. package/expertise/architecture/scaling/multi-tenancy.md +800 -0
  206. package/expertise/architecture/scaling/stateless-design.md +787 -0
  207. package/expertise/backend/embedded-firmware.md +625 -0
  208. package/expertise/backend/go.md +853 -0
  209. package/expertise/backend/index.md +24 -0
  210. package/expertise/backend/java-spring.md +448 -0
  211. package/expertise/backend/node-typescript.md +625 -0
  212. package/expertise/backend/python-fastapi.md +724 -0
  213. package/expertise/backend/rust.md +458 -0
  214. package/expertise/backend/solidity.md +711 -0
  215. package/expertise/composition-map.yaml +443 -0
  216. package/expertise/content/foundations/content-modeling.md +395 -0
  217. package/expertise/content/foundations/editorial-standards.md +449 -0
  218. package/expertise/content/foundations/index.md +24 -0
  219. package/expertise/content/foundations/microcopy.md +455 -0
  220. package/expertise/content/foundations/terminology-governance.md +509 -0
  221. package/expertise/content/index.md +34 -0
  222. package/expertise/content/patterns/accessibility-copy.md +518 -0
  223. package/expertise/content/patterns/index.md +24 -0
  224. package/expertise/content/patterns/notification-content.md +433 -0
  225. package/expertise/content/patterns/sample-content.md +486 -0
  226. package/expertise/content/patterns/state-copy.md +439 -0
  227. package/expertise/design/PROGRESS.md +58 -0
  228. package/expertise/design/disciplines/dark-mode-theming.md +577 -0
  229. package/expertise/design/disciplines/design-systems.md +595 -0
  230. package/expertise/design/disciplines/index.md +25 -0
  231. package/expertise/design/disciplines/information-architecture.md +800 -0
  232. package/expertise/design/disciplines/interaction-design.md +788 -0
  233. package/expertise/design/disciplines/responsive-design.md +552 -0
  234. package/expertise/design/disciplines/usability-testing.md +516 -0
  235. package/expertise/design/disciplines/user-research.md +792 -0
  236. package/expertise/design/foundations/accessibility-design.md +796 -0
  237. package/expertise/design/foundations/color-theory.md +797 -0
  238. package/expertise/design/foundations/iconography.md +795 -0
  239. package/expertise/design/foundations/index.md +26 -0
  240. package/expertise/design/foundations/motion-and-animation.md +653 -0
  241. package/expertise/design/foundations/rtl-design.md +585 -0
  242. package/expertise/design/foundations/spacing-and-layout.md +607 -0
  243. package/expertise/design/foundations/typography.md +800 -0
  244. package/expertise/design/foundations/visual-hierarchy.md +761 -0
  245. package/expertise/design/index.md +32 -0
  246. package/expertise/design/patterns/authentication-flows.md +474 -0
  247. package/expertise/design/patterns/content-consumption.md +789 -0
  248. package/expertise/design/patterns/data-display.md +618 -0
  249. package/expertise/design/patterns/e-commerce.md +1494 -0
  250. package/expertise/design/patterns/feedback-and-states.md +642 -0
  251. package/expertise/design/patterns/forms-and-input.md +819 -0
  252. package/expertise/design/patterns/gamification.md +801 -0
  253. package/expertise/design/patterns/index.md +31 -0
  254. package/expertise/design/patterns/microinteractions.md +449 -0
  255. package/expertise/design/patterns/navigation.md +800 -0
  256. package/expertise/design/patterns/notifications.md +705 -0
  257. package/expertise/design/patterns/onboarding.md +700 -0
  258. package/expertise/design/patterns/search-and-filter.md +601 -0
  259. package/expertise/design/patterns/settings-and-preferences.md +768 -0
  260. package/expertise/design/patterns/social-and-community.md +748 -0
  261. package/expertise/design/platforms/desktop-native.md +612 -0
  262. package/expertise/design/platforms/index.md +25 -0
  263. package/expertise/design/platforms/mobile-android.md +825 -0
  264. package/expertise/design/platforms/mobile-cross-platform.md +983 -0
  265. package/expertise/design/platforms/mobile-ios.md +699 -0
  266. package/expertise/design/platforms/tablet.md +794 -0
  267. package/expertise/design/platforms/web-dashboard.md +790 -0
  268. package/expertise/design/platforms/web-responsive.md +550 -0
  269. package/expertise/design/psychology/behavioral-nudges.md +449 -0
  270. package/expertise/design/psychology/cognitive-load.md +1191 -0
  271. package/expertise/design/psychology/error-psychology.md +778 -0
  272. package/expertise/design/psychology/index.md +22 -0
  273. package/expertise/design/psychology/persuasive-design.md +736 -0
  274. package/expertise/design/psychology/user-mental-models.md +623 -0
  275. package/expertise/design/tooling/open-pencil.md +266 -0
  276. package/expertise/frontend/angular.md +1073 -0
  277. package/expertise/frontend/desktop-electron.md +546 -0
  278. package/expertise/frontend/flutter.md +782 -0
  279. package/expertise/frontend/index.md +27 -0
  280. package/expertise/frontend/native-android.md +409 -0
  281. package/expertise/frontend/native-ios.md +490 -0
  282. package/expertise/frontend/react-native.md +1160 -0
  283. package/expertise/frontend/react.md +808 -0
  284. package/expertise/frontend/vue.md +1089 -0
  285. package/expertise/humanize/domain-rules-code.md +79 -0
  286. package/expertise/humanize/domain-rules-content.md +67 -0
  287. package/expertise/humanize/domain-rules-technical-docs.md +56 -0
  288. package/expertise/humanize/index.md +35 -0
  289. package/expertise/humanize/self-audit-checklist.md +87 -0
  290. package/expertise/humanize/sentence-patterns.md +218 -0
  291. package/expertise/humanize/vocabulary-blacklist.md +105 -0
  292. package/expertise/i18n/PROGRESS.md +65 -0
  293. package/expertise/i18n/advanced/accessibility-and-i18n.md +28 -0
  294. package/expertise/i18n/advanced/bidirectional-text-algorithm.md +38 -0
  295. package/expertise/i18n/advanced/complex-scripts.md +30 -0
  296. package/expertise/i18n/advanced/performance-and-i18n.md +27 -0
  297. package/expertise/i18n/advanced/testing-i18n.md +28 -0
  298. package/expertise/i18n/content/content-adaptation.md +23 -0
  299. package/expertise/i18n/content/locale-specific-formatting.md +23 -0
  300. package/expertise/i18n/content/machine-translation-integration.md +28 -0
  301. package/expertise/i18n/content/translation-management.md +29 -0
  302. package/expertise/i18n/foundations/date-time-calendars.md +67 -0
  303. package/expertise/i18n/foundations/i18n-architecture.md +272 -0
  304. package/expertise/i18n/foundations/locale-and-language-tags.md +79 -0
  305. package/expertise/i18n/foundations/numbers-currency-units.md +61 -0
  306. package/expertise/i18n/foundations/pluralization-and-gender.md +109 -0
  307. package/expertise/i18n/foundations/string-externalization.md +236 -0
  308. package/expertise/i18n/foundations/text-direction-bidi.md +241 -0
  309. package/expertise/i18n/foundations/unicode-and-encoding.md +86 -0
  310. package/expertise/i18n/index.md +38 -0
  311. package/expertise/i18n/platform/backend-i18n.md +31 -0
  312. package/expertise/i18n/platform/flutter-i18n.md +148 -0
  313. package/expertise/i18n/platform/native-android-i18n.md +36 -0
  314. package/expertise/i18n/platform/native-ios-i18n.md +36 -0
  315. package/expertise/i18n/platform/react-i18n.md +103 -0
  316. package/expertise/i18n/platform/web-css-i18n.md +81 -0
  317. package/expertise/i18n/rtl/arabic-specific.md +175 -0
  318. package/expertise/i18n/rtl/hebrew-specific.md +149 -0
  319. package/expertise/i18n/rtl/rtl-animations-and-transitions.md +111 -0
  320. package/expertise/i18n/rtl/rtl-forms-and-input.md +161 -0
  321. package/expertise/i18n/rtl/rtl-fundamentals.md +211 -0
  322. package/expertise/i18n/rtl/rtl-icons-and-images.md +181 -0
  323. package/expertise/i18n/rtl/rtl-layout-mirroring.md +252 -0
  324. package/expertise/i18n/rtl/rtl-navigation-and-gestures.md +107 -0
  325. package/expertise/i18n/rtl/rtl-testing-and-qa.md +147 -0
  326. package/expertise/i18n/rtl/rtl-typography.md +160 -0
  327. package/expertise/index.md +113 -0
  328. package/expertise/index.yaml +216 -0
  329. package/expertise/infrastructure/cloud-aws.md +597 -0
  330. package/expertise/infrastructure/cloud-gcp.md +599 -0
  331. package/expertise/infrastructure/cybersecurity.md +816 -0
  332. package/expertise/infrastructure/database-mongodb.md +447 -0
  333. package/expertise/infrastructure/database-postgres.md +400 -0
  334. package/expertise/infrastructure/devops-cicd.md +787 -0
  335. package/expertise/infrastructure/index.md +27 -0
  336. package/expertise/performance/PROGRESS.md +50 -0
  337. package/expertise/performance/backend/api-latency.md +1204 -0
  338. package/expertise/performance/backend/background-jobs.md +506 -0
  339. package/expertise/performance/backend/connection-pooling.md +1209 -0
  340. package/expertise/performance/backend/database-query-optimization.md +515 -0
  341. package/expertise/performance/backend/index.md +23 -0
  342. package/expertise/performance/backend/rate-limiting-and-throttling.md +971 -0
  343. package/expertise/performance/foundations/algorithmic-complexity.md +954 -0
  344. package/expertise/performance/foundations/caching-strategies.md +489 -0
  345. package/expertise/performance/foundations/concurrency-and-parallelism.md +847 -0
  346. package/expertise/performance/foundations/index.md +24 -0
  347. package/expertise/performance/foundations/measuring-and-profiling.md +440 -0
  348. package/expertise/performance/foundations/memory-management.md +964 -0
  349. package/expertise/performance/foundations/performance-budgets.md +1314 -0
  350. package/expertise/performance/index.md +31 -0
  351. package/expertise/performance/infrastructure/auto-scaling.md +1059 -0
  352. package/expertise/performance/infrastructure/cdn-and-edge.md +1081 -0
  353. package/expertise/performance/infrastructure/index.md +22 -0
  354. package/expertise/performance/infrastructure/load-balancing.md +1081 -0
  355. package/expertise/performance/infrastructure/observability.md +1079 -0
  356. package/expertise/performance/mobile/index.md +23 -0
  357. package/expertise/performance/mobile/mobile-animations.md +544 -0
  358. package/expertise/performance/mobile/mobile-memory-battery.md +416 -0
  359. package/expertise/performance/mobile/mobile-network.md +452 -0
  360. package/expertise/performance/mobile/mobile-rendering.md +599 -0
  361. package/expertise/performance/mobile/mobile-startup-time.md +505 -0
  362. package/expertise/performance/platform-specific/flutter-performance.md +647 -0
  363. package/expertise/performance/platform-specific/index.md +22 -0
  364. package/expertise/performance/platform-specific/node-performance.md +1307 -0
  365. package/expertise/performance/platform-specific/postgres-performance.md +1366 -0
  366. package/expertise/performance/platform-specific/react-performance.md +1403 -0
  367. package/expertise/performance/web/bundle-optimization.md +1239 -0
  368. package/expertise/performance/web/image-and-media.md +636 -0
  369. package/expertise/performance/web/index.md +24 -0
  370. package/expertise/performance/web/network-optimization.md +1133 -0
  371. package/expertise/performance/web/rendering-performance.md +1098 -0
  372. package/expertise/performance/web/ssr-and-hydration.md +918 -0
  373. package/expertise/performance/web/web-vitals.md +1374 -0
  374. package/expertise/quality/accessibility.md +985 -0
  375. package/expertise/quality/evidence-based-verification.md +499 -0
  376. package/expertise/quality/index.md +24 -0
  377. package/expertise/quality/ml-model-audit.md +614 -0
  378. package/expertise/quality/performance.md +600 -0
  379. package/expertise/quality/testing-api.md +891 -0
  380. package/expertise/quality/testing-mobile.md +496 -0
  381. package/expertise/quality/testing-web.md +849 -0
  382. package/expertise/security/PROGRESS.md +54 -0
  383. package/expertise/security/agentic-identity.md +540 -0
  384. package/expertise/security/compliance-frameworks.md +601 -0
  385. package/expertise/security/data/data-encryption.md +364 -0
  386. package/expertise/security/data/data-privacy-gdpr.md +692 -0
  387. package/expertise/security/data/database-security.md +1171 -0
  388. package/expertise/security/data/index.md +22 -0
  389. package/expertise/security/data/pii-handling.md +531 -0
  390. package/expertise/security/foundations/authentication.md +1041 -0
  391. package/expertise/security/foundations/authorization.md +603 -0
  392. package/expertise/security/foundations/cryptography.md +1001 -0
  393. package/expertise/security/foundations/index.md +25 -0
  394. package/expertise/security/foundations/owasp-top-10.md +1354 -0
  395. package/expertise/security/foundations/secrets-management.md +1217 -0
  396. package/expertise/security/foundations/secure-sdlc.md +700 -0
  397. package/expertise/security/foundations/supply-chain-security.md +698 -0
  398. package/expertise/security/index.md +31 -0
  399. package/expertise/security/infrastructure/cloud-security-aws.md +1296 -0
  400. package/expertise/security/infrastructure/cloud-security-gcp.md +1376 -0
  401. package/expertise/security/infrastructure/container-security.md +721 -0
  402. package/expertise/security/infrastructure/incident-response.md +1295 -0
  403. package/expertise/security/infrastructure/index.md +24 -0
  404. package/expertise/security/infrastructure/logging-and-monitoring.md +1618 -0
  405. package/expertise/security/infrastructure/network-security.md +1337 -0
  406. package/expertise/security/mobile/index.md +23 -0
  407. package/expertise/security/mobile/mobile-android-security.md +1218 -0
  408. package/expertise/security/mobile/mobile-binary-protection.md +1229 -0
  409. package/expertise/security/mobile/mobile-data-storage.md +1265 -0
  410. package/expertise/security/mobile/mobile-ios-security.md +1401 -0
  411. package/expertise/security/mobile/mobile-network-security.md +1520 -0
  412. package/expertise/security/smart-contract-security.md +594 -0
  413. package/expertise/security/testing/index.md +22 -0
  414. package/expertise/security/testing/penetration-testing.md +1258 -0
  415. package/expertise/security/testing/security-code-review.md +1765 -0
  416. package/expertise/security/testing/threat-modeling.md +1074 -0
  417. package/expertise/security/testing/vulnerability-scanning.md +1062 -0
  418. package/expertise/security/web/api-security.md +586 -0
  419. package/expertise/security/web/cors-and-headers.md +433 -0
  420. package/expertise/security/web/csrf.md +562 -0
  421. package/expertise/security/web/file-upload.md +1477 -0
  422. package/expertise/security/web/index.md +25 -0
  423. package/expertise/security/web/injection.md +1375 -0
  424. package/expertise/security/web/session-management.md +1101 -0
  425. package/expertise/security/web/xss.md +1158 -0
  426. package/exports/README.md +17 -0
  427. package/exports/hosts/claude/.claude/agents/clarifier.md +42 -0
  428. package/exports/hosts/claude/.claude/agents/content-author.md +63 -0
  429. package/exports/hosts/claude/.claude/agents/designer.md +55 -0
  430. package/exports/hosts/claude/.claude/agents/executor.md +55 -0
  431. package/exports/hosts/claude/.claude/agents/learner.md +51 -0
  432. package/exports/hosts/claude/.claude/agents/planner.md +53 -0
  433. package/exports/hosts/claude/.claude/agents/researcher.md +43 -0
  434. package/exports/hosts/claude/.claude/agents/reviewer.md +54 -0
  435. package/exports/hosts/claude/.claude/agents/specifier.md +47 -0
  436. package/exports/hosts/claude/.claude/agents/verifier.md +71 -0
  437. package/exports/hosts/claude/.claude/commands/author.md +42 -0
  438. package/exports/hosts/claude/.claude/commands/clarify.md +38 -0
  439. package/exports/hosts/claude/.claude/commands/design-review.md +46 -0
  440. package/exports/hosts/claude/.claude/commands/design.md +44 -0
  441. package/exports/hosts/claude/.claude/commands/discover.md +37 -0
  442. package/exports/hosts/claude/.claude/commands/execute.md +48 -0
  443. package/exports/hosts/claude/.claude/commands/learn.md +38 -0
  444. package/exports/hosts/claude/.claude/commands/plan-review.md +42 -0
  445. package/exports/hosts/claude/.claude/commands/plan.md +39 -0
  446. package/exports/hosts/claude/.claude/commands/prepare-next.md +37 -0
  447. package/exports/hosts/claude/.claude/commands/review.md +40 -0
  448. package/exports/hosts/claude/.claude/commands/run-audit.md +41 -0
  449. package/exports/hosts/claude/.claude/commands/spec-challenge.md +41 -0
  450. package/exports/hosts/claude/.claude/commands/specify.md +38 -0
  451. package/exports/hosts/claude/.claude/commands/verify.md +37 -0
  452. package/exports/hosts/claude/.claude/settings.json +34 -0
  453. package/exports/hosts/claude/CLAUDE.md +19 -0
  454. package/exports/hosts/claude/export.manifest.json +38 -0
  455. package/exports/hosts/claude/host-package.json +67 -0
  456. package/exports/hosts/codex/AGENTS.md +19 -0
  457. package/exports/hosts/codex/export.manifest.json +38 -0
  458. package/exports/hosts/codex/host-package.json +41 -0
  459. package/exports/hosts/cursor/.cursor/hooks.json +16 -0
  460. package/exports/hosts/cursor/.cursor/rules/wazir-core.mdc +19 -0
  461. package/exports/hosts/cursor/export.manifest.json +38 -0
  462. package/exports/hosts/cursor/host-package.json +42 -0
  463. package/exports/hosts/gemini/GEMINI.md +19 -0
  464. package/exports/hosts/gemini/export.manifest.json +38 -0
  465. package/exports/hosts/gemini/host-package.json +41 -0
  466. package/hooks/README.md +18 -0
  467. package/hooks/definitions/loop_cap_guard.yaml +21 -0
  468. package/hooks/definitions/post_tool_capture.yaml +24 -0
  469. package/hooks/definitions/pre_compact_summary.yaml +19 -0
  470. package/hooks/definitions/pre_tool_capture_route.yaml +19 -0
  471. package/hooks/definitions/protected_path_write_guard.yaml +19 -0
  472. package/hooks/definitions/session_start.yaml +19 -0
  473. package/hooks/definitions/stop_handoff_harvest.yaml +20 -0
  474. package/hooks/loop-cap-guard +17 -0
  475. package/hooks/post-tool-lint +36 -0
  476. package/hooks/protected-path-write-guard +17 -0
  477. package/hooks/session-start +41 -0
  478. package/llms-full.txt +2355 -0
  479. package/llms.txt +43 -0
  480. package/package.json +79 -0
  481. package/roles/README.md +20 -0
  482. package/roles/clarifier.md +42 -0
  483. package/roles/content-author.md +63 -0
  484. package/roles/designer.md +55 -0
  485. package/roles/executor.md +55 -0
  486. package/roles/learner.md +51 -0
  487. package/roles/planner.md +53 -0
  488. package/roles/researcher.md +43 -0
  489. package/roles/reviewer.md +54 -0
  490. package/roles/specifier.md +47 -0
  491. package/roles/verifier.md +71 -0
  492. package/schemas/README.md +24 -0
  493. package/schemas/accepted-learning.schema.json +20 -0
  494. package/schemas/author-artifact.schema.json +156 -0
  495. package/schemas/clarification.schema.json +19 -0
  496. package/schemas/design-artifact.schema.json +80 -0
  497. package/schemas/docs-claim.schema.json +18 -0
  498. package/schemas/export-manifest.schema.json +20 -0
  499. package/schemas/hook.schema.json +67 -0
  500. package/schemas/host-export-package.schema.json +18 -0
  501. package/schemas/implementation-plan.schema.json +19 -0
  502. package/schemas/proposed-learning.schema.json +19 -0
  503. package/schemas/research.schema.json +18 -0
  504. package/schemas/review.schema.json +29 -0
  505. package/schemas/run-manifest.schema.json +18 -0
  506. package/schemas/spec-challenge.schema.json +18 -0
  507. package/schemas/spec.schema.json +20 -0
  508. package/schemas/usage.schema.json +102 -0
  509. package/schemas/verification-proof.schema.json +29 -0
  510. package/schemas/wazir-manifest.schema.json +173 -0
  511. package/skills/README.md +40 -0
  512. package/skills/brainstorming/SKILL.md +77 -0
  513. package/skills/debugging/SKILL.md +50 -0
  514. package/skills/design/SKILL.md +61 -0
  515. package/skills/dispatching-parallel-agents/SKILL.md +128 -0
  516. package/skills/executing-plans/SKILL.md +70 -0
  517. package/skills/finishing-a-development-branch/SKILL.md +169 -0
  518. package/skills/humanize/SKILL.md +123 -0
  519. package/skills/init-pipeline/SKILL.md +124 -0
  520. package/skills/prepare-next/SKILL.md +20 -0
  521. package/skills/receiving-code-review/SKILL.md +123 -0
  522. package/skills/requesting-code-review/SKILL.md +105 -0
  523. package/skills/requesting-code-review/code-reviewer.md +108 -0
  524. package/skills/run-audit/SKILL.md +197 -0
  525. package/skills/scan-project/SKILL.md +41 -0
  526. package/skills/self-audit/SKILL.md +153 -0
  527. package/skills/subagent-driven-development/SKILL.md +154 -0
  528. package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +26 -0
  529. package/skills/subagent-driven-development/implementer-prompt.md +102 -0
  530. package/skills/subagent-driven-development/spec-reviewer-prompt.md +61 -0
  531. package/skills/tdd/SKILL.md +23 -0
  532. package/skills/using-git-worktrees/SKILL.md +163 -0
  533. package/skills/using-skills/SKILL.md +95 -0
  534. package/skills/verification/SKILL.md +22 -0
  535. package/skills/wazir/SKILL.md +463 -0
  536. package/skills/writing-plans/SKILL.md +30 -0
  537. package/skills/writing-skills/SKILL.md +157 -0
  538. package/skills/writing-skills/anthropic-best-practices.md +122 -0
  539. package/skills/writing-skills/persuasion-principles.md +50 -0
  540. package/templates/README.md +20 -0
  541. package/templates/artifacts/README.md +10 -0
  542. package/templates/artifacts/accepted-learning.md +19 -0
  543. package/templates/artifacts/accepted-learning.template.json +12 -0
  544. package/templates/artifacts/author.md +74 -0
  545. package/templates/artifacts/author.template.json +19 -0
  546. package/templates/artifacts/clarification.md +21 -0
  547. package/templates/artifacts/clarification.template.json +12 -0
  548. package/templates/artifacts/execute-notes.md +19 -0
  549. package/templates/artifacts/implementation-plan.md +21 -0
  550. package/templates/artifacts/implementation-plan.template.json +11 -0
  551. package/templates/artifacts/learning-proposal.md +19 -0
  552. package/templates/artifacts/next-run-handoff.md +21 -0
  553. package/templates/artifacts/plan-review.md +19 -0
  554. package/templates/artifacts/proposed-learning.template.json +12 -0
  555. package/templates/artifacts/research.md +21 -0
  556. package/templates/artifacts/research.template.json +12 -0
  557. package/templates/artifacts/review-findings.md +19 -0
  558. package/templates/artifacts/review.template.json +11 -0
  559. package/templates/artifacts/run-manifest.template.json +8 -0
  560. package/templates/artifacts/spec-challenge.md +19 -0
  561. package/templates/artifacts/spec-challenge.template.json +11 -0
  562. package/templates/artifacts/spec.md +21 -0
  563. package/templates/artifacts/spec.template.json +12 -0
  564. package/templates/artifacts/verification-proof.md +19 -0
  565. package/templates/artifacts/verification-proof.template.json +11 -0
  566. package/templates/examples/accepted-learning.example.json +14 -0
  567. package/templates/examples/author.example.json +152 -0
  568. package/templates/examples/clarification.example.json +15 -0
  569. package/templates/examples/docs-claim.example.json +8 -0
  570. package/templates/examples/export-manifest.example.json +7 -0
  571. package/templates/examples/host-export-package.example.json +11 -0
  572. package/templates/examples/implementation-plan.example.json +17 -0
  573. package/templates/examples/proposed-learning.example.json +13 -0
  574. package/templates/examples/research.example.json +15 -0
  575. package/templates/examples/research.example.md +6 -0
  576. package/templates/examples/review.example.json +17 -0
  577. package/templates/examples/run-manifest.example.json +9 -0
  578. package/templates/examples/spec-challenge.example.json +14 -0
  579. package/templates/examples/spec.example.json +21 -0
  580. package/templates/examples/verification-proof.example.json +21 -0
  581. package/templates/examples/wazir-manifest.example.yaml +65 -0
  582. package/templates/task-definition-schema.md +99 -0
  583. package/tooling/README.md +20 -0
  584. package/tooling/src/adapters/context-mode.js +50 -0
  585. package/tooling/src/capture/command.js +376 -0
  586. package/tooling/src/capture/store.js +99 -0
  587. package/tooling/src/capture/usage.js +270 -0
  588. package/tooling/src/checks/branches.js +50 -0
  589. package/tooling/src/checks/brand-truth.js +110 -0
  590. package/tooling/src/checks/changelog.js +231 -0
  591. package/tooling/src/checks/command-registry.js +36 -0
  592. package/tooling/src/checks/commits.js +102 -0
  593. package/tooling/src/checks/docs-drift.js +103 -0
  594. package/tooling/src/checks/docs-truth.js +201 -0
  595. package/tooling/src/checks/runtime-surface.js +156 -0
  596. package/tooling/src/cli.js +116 -0
  597. package/tooling/src/command-options.js +56 -0
  598. package/tooling/src/commands/validate.js +320 -0
  599. package/tooling/src/doctor/command.js +91 -0
  600. package/tooling/src/export/command.js +77 -0
  601. package/tooling/src/export/compiler.js +498 -0
  602. package/tooling/src/guards/loop-cap-guard.js +52 -0
  603. package/tooling/src/guards/protected-path-write-guard.js +67 -0
  604. package/tooling/src/index/command.js +152 -0
  605. package/tooling/src/index/storage.js +1061 -0
  606. package/tooling/src/index/summarizers.js +261 -0
  607. package/tooling/src/loaders.js +18 -0
  608. package/tooling/src/project-root.js +22 -0
  609. package/tooling/src/recall/command.js +225 -0
  610. package/tooling/src/schema-validator.js +30 -0
  611. package/tooling/src/state-root.js +40 -0
  612. package/tooling/src/status/command.js +71 -0
  613. package/wazir.manifest.yaml +135 -0
  614. package/workflows/README.md +19 -0
  615. package/workflows/author.md +42 -0
  616. package/workflows/clarify.md +38 -0
  617. package/workflows/design-review.md +46 -0
  618. package/workflows/design.md +44 -0
  619. package/workflows/discover.md +37 -0
  620. package/workflows/execute.md +48 -0
  621. package/workflows/learn.md +38 -0
  622. package/workflows/plan-review.md +42 -0
  623. package/workflows/plan.md +39 -0
  624. package/workflows/prepare-next.md +37 -0
  625. package/workflows/review.md +40 -0
  626. package/workflows/run-audit.md +41 -0
  627. package/workflows/spec-challenge.md +41 -0
  628. package/workflows/specify.md +38 -0
  629. package/workflows/verify.md +37 -0
@@ -0,0 +1,796 @@
1
+ # Consensus and Coordination -- Architecture Expertise Module
2
+
3
+ > Consensus protocols enable distributed nodes to agree on a single value despite failures.
4
+ > Coordination services (ZooKeeper, etcd) provide distributed primitives like leader election,
5
+ > distributed locks, and configuration management. Most developers should use existing
6
+ > coordination services rather than implementing consensus protocols directly.
7
+
8
+ > **Category:** Distributed
9
+ > **Complexity:** Expert
10
+ > **Applies when:** Systems needing leader election, distributed locking, configuration consensus, or state machine replication across nodes
11
+
12
+ ---
13
+
14
+ ## What This Is (and What It Isn't)
15
+
16
+ ### The Consensus Problem
17
+
18
+ The consensus problem is deceptively simple to state: given a set of N distributed nodes, get
19
+ them all to agree on a single value, even if some nodes crash. A consensus protocol must satisfy
20
+ three properties:
21
+
22
+ 1. **Agreement:** All non-faulty nodes decide on the same value.
23
+ 2. **Validity:** The decided value was proposed by some node (no "magic" values).
24
+ 3. **Termination:** Every non-faulty node eventually decides.
25
+
26
+ These three properties sound trivial until you add real-world constraints: networks drop and
27
+ reorder messages, nodes crash and restart with stale state, and there is no global clock that
28
+ all participants trust.
29
+
30
+ ### The FLP Impossibility Result
31
+
32
+ In 1985, Fischer, Lynch, and Patterson proved one of the most important results in computer
33
+ science: **no deterministic consensus algorithm can guarantee termination in an asynchronous
34
+ system where even one process may crash** (the "FLP impossibility"). This result won the
35
+ Dijkstra Award for the most influential paper in distributed computing.
36
+
37
+ The intuition behind FLP: in a fully asynchronous system, you cannot distinguish a crashed
38
+ node from a very slow node. Any protocol that waits for a response might wait forever; any
39
+ protocol that proceeds without the response might disagree with that node if it was merely
40
+ slow.
41
+
42
+ **What FLP does NOT say:**
43
+
44
+ - It does NOT say consensus is impossible in practice. It says no *deterministic* algorithm
45
+ can *guarantee* termination in a *fully asynchronous* model.
46
+ - Practical systems circumvent FLP by relaxing the model: using partial synchrony assumptions
47
+ (timeouts), randomization, or failure detectors. Raft and Paxos both rely on eventual leader
48
+ stability (a partial synchrony assumption) to make progress.
49
+
50
+ ### Paxos
51
+
52
+ Leslie Lamport introduced Paxos in 1998 (and described it allegorically in "The Part-Time
53
+ Parliament" in 1989). Paxos defines three roles:
54
+
55
+ - **Proposer:** Suggests a value to be agreed upon.
56
+ - **Acceptor:** Votes on proposals. A majority of acceptors constitutes a quorum.
57
+ - **Learner:** Learns the decided value once a quorum of acceptors agrees.
58
+
59
+ The protocol operates in two phases:
60
+
61
+ 1. **Prepare (Phase 1):** The proposer sends a `Prepare(n)` message with a unique proposal
62
+ number `n`. Each acceptor promises not to accept proposals with numbers less than `n` and
63
+ returns any previously accepted value.
64
+ 2. **Accept (Phase 2):** If the proposer receives promises from a majority, it sends
65
+ `Accept(n, v)` where `v` is either the highest-numbered previously accepted value or the
66
+ proposer's own value. Each acceptor accepts if it has not promised to a higher number.
67
+
68
+ **Multi-Paxos** extends single-decree Paxos to decide a sequence of values (a replicated log)
69
+ by reusing the same leader across multiple rounds, amortizing the cost of Phase 1.
70
+
71
+ Paxos is notoriously difficult to understand and implement correctly. Google's Chubby lock
72
+ service and Spanner database use Paxos internally. Azure Storage also uses Paxos for
73
+ consistency across its distributed storage layer.
74
+
75
+ ### Raft
76
+
77
+ Diego Ongaro and John Ousterhout designed Raft in 2014 explicitly to be more understandable
78
+ than Paxos. Raft decomposes consensus into three cleanly separated subproblems:
79
+
80
+ 1. **Leader Election:** Nodes are in one of three states: follower, candidate, or leader.
81
+ A heartbeat mechanism triggers elections. If a follower receives no communication for an
82
+ *election timeout* period, it becomes a candidate and requests votes. A candidate becomes
83
+ leader by receiving votes from a majority of nodes.
84
+
85
+ 2. **Log Replication:** The leader accepts client requests, appends them to its log, and
86
+ replicates entries to followers via `AppendEntries` RPCs. An entry is committed once
87
+ replicated on a majority. Followers apply committed entries to their state machines in
88
+ order.
89
+
90
+ 3. **Safety:** Raft guarantees that if any server has applied a log entry at a given index,
91
+ no other server will apply a different entry at that index. It achieves this by restricting
92
+ which nodes can become leader: only nodes with up-to-date logs win elections.
93
+
94
+ **Key difference from Paxos:** Raft requires log entries to be decided *in order*, which
95
+ simplifies reasoning but means a slow entry blocks subsequent entries. Paxos allows
96
+ out-of-order decisions but requires gap-filling, adding implementation complexity.
97
+
98
+ Heidi Howard's 2020 paper "Paxos vs Raft" showed the algorithms are more similar than
99
+ believed, differing primarily in leader election. Raft's contribution is one of *presentation
100
+ and decomposition*, not algorithmic novelty.
101
+
102
+ **Real-world Raft implementations:** etcd, Consul, TiKV, CockroachDB, Nomad, Kafka (KRaft).
103
+
104
+ ### Coordination Services: The Practical Layer
105
+
106
+ Most developers should never implement Paxos or Raft themselves. Instead, they should use
107
+ coordination services that embed these algorithms and expose higher-level primitives:
108
+
109
+ - **Apache ZooKeeper:** Built at Yahoo for Hadoop. Uses ZAB (closely related to Paxos).
110
+ Hierarchical namespace (znodes), ephemeral nodes, watches, sequential nodes. Powers Kafka
111
+ (legacy), HBase, Solr.
112
+
113
+ - **etcd:** CoreOS/CNCF project. Raft consensus. Flat key-value with MVCC, watch streams,
114
+ leases, distributed locks. The coordination backbone of Kubernetes.
115
+
116
+ - **Consul:** HashiCorp. Raft within datacenter, Serf gossip between datacenters. Service
117
+ discovery, health checking, KV store, service mesh. Strongest multi-DC story.
118
+
119
+ ### What This Is NOT
120
+
121
+ - **Not something most developers should implement.** Implementing Raft or Paxos correctly
122
+ is a multi-year effort. The Raft paper's reference implementation had subtle bugs found
123
+ years later. Use etcd, ZooKeeper, or Consul.
124
+ - **Not a database.** Coordination services store small metadata (configuration, leader
125
+ identity, lock state). etcd recommends keeping total data under 8 GB.
126
+ - **Not a message queue.** Watch streams provide notifications, not high-throughput delivery.
127
+ - **Not Byzantine fault tolerant.** Paxos and Raft assume crash-fault (nodes stop, they do
128
+ not lie). BFT protocols like PBFT handle malicious nodes at enormous performance cost.
129
+
130
+ ---
131
+
132
+ ## When to Use It
133
+
134
+ ### 1. Leader Election for Single-Writer Architectures
135
+
136
+ When exactly one node must be the "active" processor at any time (single-writer pattern),
137
+ consensus-based leader election ensures that exactly one leader exists and that failover
138
+ happens correctly.
139
+
140
+ **Evidence -- Amazon:** Amazon's Builders' Library documents that leases are their most widely
141
+ used leader election mechanism -- straightforward to implement with built-in fault tolerance.
142
+ DynamoDB provides lease-based locking clients for this purpose.
143
+
144
+ **Evidence -- Kubernetes:** Control plane singletons (scheduler, controller-manager) use
145
+ etcd-backed leader election via lease objects. Replicas watch the lease and acquire on expiry.
146
+
147
+ ### 2. Distributed Locks for Mutual Exclusion
148
+
149
+ When a shared resource (file, external API with rate limits, database migration) must be
150
+ accessed by at most one process at a time, distributed locks provide mutual exclusion across
151
+ nodes.
152
+
153
+ **Evidence -- Apache Curator:** Netflix built Curator for ZooKeeper, providing production-grade
154
+ lock recipes (`InterProcessMutex`, `InterProcessReadWriteLock`) used across Netflix for
155
+ coordinating shared resource access.
156
+
157
+ ### 3. Consistent Configuration Across a Cluster
158
+
159
+ When all nodes in a cluster must agree on configuration values (feature flags, routing tables,
160
+ schema versions), a coordination service provides linearizable reads and writes that guarantee
161
+ all nodes see the same state.
162
+
163
+ **Evidence -- Kubernetes/etcd:** Every Kubernetes cluster stores its desired state in etcd.
164
+ The kube-apiserver reads/writes etcd, controllers watch for changes, guaranteeing all control
165
+ plane components operate on a consistent view of cluster state.
166
+
167
+ ### 4. Service Discovery
168
+
169
+ When services need to find each other dynamically (instances come and go with auto-scaling),
170
+ coordination services provide service registration and health-checked discovery.
171
+
172
+ **Evidence -- Consul:** Services register with the local Consul agent (which participates in
173
+ Raft). Health checks auto-deregister unhealthy instances. DNS and HTTP APIs provide discovery.
174
+
175
+ ### 5. Cluster Membership and Failure Detection
176
+
177
+ When the system needs an authoritative view of which nodes are alive and what roles they play,
178
+ consensus ensures that all nodes agree on the membership list, preventing split-brain scenarios.
179
+
180
+ **Evidence -- CockroachDB:** Uses Raft consensus groups (one per data range) so that even
181
+ during node failures, the system agrees on which replica is authoritative.
182
+
183
+ ---
184
+
185
+ ## When NOT to Use It
186
+
187
+ The coordination tax is real. Consensus protocols add latency, operational complexity, and
188
+ failure modes. Many systems that adopt distributed coordination do not need it.
189
+
190
+ ### 1. Single-Node Systems
191
+
192
+ If your system runs on a single server (or can tolerate being a single server), you do not
193
+ need distributed consensus. A local mutex, a database row lock, or a simple file lock provides
194
+ the same guarantees with zero network overhead.
195
+
196
+ **The test:** If you are not running multiple instances of the process that need to coordinate,
197
+ you do not need distributed coordination. A surprising number of systems that deploy ZooKeeper
198
+ or etcd are actually single-node systems with redundant complexity.
199
+
200
+ ### 2. When a Database Lock Suffices
201
+
202
+ A `SELECT ... FOR UPDATE` in PostgreSQL or a conditional write in DynamoDB provides mutual
203
+ exclusion within the scope of a database transaction. If the resource you are protecting is
204
+ already in the database, use the database's own locking mechanisms.
205
+
206
+ **Evidence -- Shopify:** Shopify uses `pg_advisory_lock` for background job coordination.
207
+ The database is already a dependency; adding etcd would increase operational surface without
208
+ meaningful benefit.
209
+
210
+ ### 3. Distributed Locks Are Frequently Misused -- Use Idempotency Instead
211
+
212
+ This is the most common mistake. Teams reach for distributed locks to prevent duplicate
213
+ processing, when idempotent operations would eliminate the need for coordination entirely.
214
+
215
+ **The pattern to avoid:** "We need a distributed lock so that only one worker processes
216
+ each payment." The correct solution is usually an idempotent payment API with a unique
217
+ idempotency key. If two workers process the same payment, the second call is a no-op.
218
+
219
+ **Evidence -- Stripe:** Every mutating API request accepts an `Idempotency-Key` header.
220
+ Duplicate requests return cached results. No distributed lock needed. More resilient than
221
+ locking because it tolerates retries, partitions, and crashes without coordination overhead.
222
+
223
+ **Evidence -- Payment processor incident:** Double-charges occurred ~1 per 10,000 transactions
224
+ because the distributed lock expired during GC pauses. Replacing the lock with an idempotent
225
+ charge API eliminated the problem entirely.
226
+
227
+ ### 4. ZooKeeper Complexity Overhead for Small Systems
228
+
229
+ ZooKeeper requires a minimum of 3 nodes (5 recommended for production), a JVM with tuned
230
+ garbage collection, and operational expertise for compaction, snapshots, and leader elections.
231
+ For small systems (under 10 services), this operational tax is rarely justified.
232
+
233
+ **The alternative:** On Kubernetes, the API server's leases and configmaps (backed by etcd)
234
+ provide coordination without a separate cluster. Off Kubernetes, PostgreSQL advisory locks
235
+ often suffice.
236
+
237
+ ### 5. When Eventual Consistency Is Acceptable
238
+
239
+ Consensus provides strong consistency (linearizability), which comes at the cost of latency
240
+ and availability. If your use case tolerates stale reads or temporary disagreement, avoid
241
+ consensus entirely.
242
+
243
+ **Evidence -- DNS-based service discovery:** DNS is eventually consistent (TTL-based), but
244
+ for most discovery use cases, seconds of stale data is acceptable. Route 53 health checks
245
+ provide service discovery without a consensus cluster.
246
+
247
+ ### 6. Cross-Datacenter Consensus (Latency Trap)
248
+
249
+ Consensus requires a majority quorum for every write. If your nodes span datacenters with
250
+ 50-100ms round-trip latency, every write pays 2-3 round trips of that latency (100-300ms
251
+ minimum). Most applications cannot tolerate this.
252
+
253
+ **The alternative:** Consensus *within* a datacenter, async replication *between* datacenters.
254
+ Consul uses Raft intra-DC and gossip (Serf) inter-DC. Spanner uses TrueTime (GPS/atomic
255
+ clocks) to bound cross-DC read uncertainty.
256
+
257
+ ---
258
+
259
+ ## How It Works
260
+
261
+ ### Raft in Detail: Leader Election
262
+
263
+ 1. All nodes start as **followers**. Each has a randomized **election timeout** (e.g.,
264
+ 150-300ms).
265
+ 2. If a follower receives no heartbeat from a leader before its timeout expires, it increments
266
+ its **term** (a monotonically increasing integer) and transitions to **candidate**.
267
+ 3. The candidate votes for itself and sends `RequestVote` RPCs to all other nodes.
268
+ 4. Each node votes for at most one candidate per term. A candidate wins if it receives votes
269
+ from a majority of nodes.
270
+ 5. The winning candidate becomes **leader** and begins sending periodic heartbeats
271
+ (`AppendEntries` with no entries) to prevent new elections.
272
+ 6. If a candidate's election times out without winning (split vote), a new election begins
273
+ with a higher term. Randomized timeouts make perpetual split votes extremely unlikely.
274
+
275
+ **The term mechanism prevents split brain:** If a stale leader receives a message with a
276
+ higher term number, it immediately steps down to follower. This is the distributed systems
277
+ equivalent of a fencing token -- the term acts as an epoch that monotonically increases with
278
+ each leadership change.
279
+
280
+ ### Raft in Detail: Log Replication
281
+
282
+ 1. The leader receives a client request and appends it as a new entry to its log.
283
+ 2. The leader sends `AppendEntries` RPCs to all followers with the new entry.
284
+ 3. Each follower appends the entry to its log and responds with success.
285
+ 4. Once the leader receives acknowledgment from a **majority** (including itself), the entry
286
+ is **committed**. The leader applies it to its state machine and responds to the client.
287
+ 5. Followers learn about committed entries via subsequent heartbeats and apply them to their
288
+ own state machines.
289
+
290
+ **Consistency guarantee:** If two logs contain an entry with the same index and term, then
291
+ (a) the entries store the same command, and (b) all preceding entries are identical. This is
292
+ enforced by the `AppendEntries` consistency check: each RPC includes the index and term of
293
+ the entry immediately preceding the new entries, and followers reject the RPC if they do not
294
+ have a matching entry.
295
+
296
+ ### Paxos in Detail: The Two-Phase Protocol
297
+
298
+ **Phase 1 (Prepare):**
299
+ 1. A proposer selects a proposal number `n` (globally unique, monotonically increasing).
300
+ 2. It sends `Prepare(n)` to a majority of acceptors.
301
+ 3. Each acceptor responds with a promise not to accept proposals numbered less than `n`, along
302
+ with the highest-numbered proposal it has already accepted (if any).
303
+
304
+ **Phase 2 (Accept):**
305
+ 1. If the proposer receives promises from a majority, it selects the value: either the value
306
+ from the highest-numbered previously accepted proposal, or (if no acceptor has accepted
307
+ anything) its own proposed value.
308
+ 2. It sends `Accept(n, v)` to the same majority.
309
+ 3. Each acceptor accepts if it has not promised to a higher-numbered proposal.
310
+
311
+ **Key insight:** The "must use highest previously accepted value" rule is what ensures safety.
312
+ It guarantees that once a value is chosen (accepted by a majority), any future proposer will
313
+ discover it in Phase 1 and propose it again, preserving the decision.
314
+
315
+ ### Distributed Locks: Fencing Tokens
316
+
317
+ The naive distributed lock pattern -- acquire lock, do work, release lock -- is fundamentally
318
+ broken in distributed systems because of **process pauses**. A process can be paused at any
319
+ moment by:
320
+
321
+ - Garbage collection stop-the-world pauses (lasting seconds or even minutes in extreme cases)
322
+ - OS process preemption or CPU scheduling
323
+ - Virtual memory page faults
324
+ - Network delays causing request timeouts
325
+
326
+ **The fencing token pattern:**
327
+
328
+ 1. Each lock acquisition returns a **fencing token** -- a monotonically increasing integer.
329
+ 2. When the lock holder writes to the protected resource, it includes the fencing token.
330
+ 3. The resource rejects any write with a token lower than the highest token it has seen.
331
+
332
+ **Worked example:**
333
+ - Process A acquires lock, receives token 33.
334
+ - Process A enters GC pause. Lock lease expires.
335
+ - Process B acquires lock, receives token 34. Writes to resource with token 34.
336
+ - Process A wakes up, tries to write with token 33. Resource rejects it (34 > 33).
337
+
338
+ **Critical insight from Martin Kleppmann:** The lock service alone cannot solve this problem.
339
+ The resource being protected must participate in the fencing protocol. This means the resource
340
+ (database, file system, external API) must understand fencing tokens and enforce monotonicity.
341
+ If the resource does not support fencing, the distributed lock provides only *best-effort*
342
+ mutual exclusion.
343
+
344
+ ### Leader Election: Lease-Based
345
+
346
+ Lease-based leader election is the most common pattern in production systems:
347
+
348
+ 1. A **lease** is a time-bounded lock stored in a coordination service or database.
349
+ 2. The leader periodically **renews** the lease (heartbeat) before it expires.
350
+ 3. If the leader fails to renew (crash, network partition, GC pause), the lease expires.
351
+ 4. Other candidates attempt to acquire the expired lease. Exactly one succeeds (guaranteed
352
+ by the coordination service's linearizability).
353
+
354
+ **Time dependency:** Leases depend on *local elapsed time*, not synchronized wall-clock time.
355
+ The lease holder checks "has T seconds elapsed since my last renewal?" using a monotonic clock.
356
+ This avoids clock synchronization problems but introduces a fundamental tension:
357
+
358
+ - **Short leases** (1-5 seconds): fast failover but risk false positives from GC pauses or
359
+ brief network glitches, causing unnecessary leader thrashing.
360
+ - **Long leases** (15-30 seconds): fewer false positives but slower failover, meaning longer
361
+ write unavailability during real failures.
362
+
363
+ **Amazon's guidance:** Amazon recommends lease durations of 10-30 seconds for most workloads,
364
+ with the leader renewing at one-third of the lease interval.
365
+
366
+ ### Distributed Configuration: Watch-Based Propagation
367
+
368
+ Coordination services propagate configuration changes through watches: a client writes a new
369
+ value, and all nodes watching that key receive a push notification.
370
+
371
+ **etcd** uses gRPC server-side streaming with revision-ordered delivery. Clients detect missed
372
+ updates via revision gaps and recover by reading full state. **ZooKeeper** uses one-time
373
+ callback triggers that must be re-registered after each event, creating a small window for
374
+ missed changes. Curator's `TreeCache` recipe layers continuous watching on top of the
375
+ primitive API to eliminate this gap.
376
+
377
+ ---
378
+
379
+ ## Trade-Offs Matrix
380
+
381
+ | Dimension | Consensus-Based Coordination | Database-Based Coordination | No Coordination (Idempotent Design) |
382
+ |-----------|-----------------------------|-----------------------------|-------------------------------------|
383
+ | **Consistency** | Linearizable (strongest) | Serializable within transactions | Eventual; relies on idempotency for correctness |
384
+ | **Latency** | 2-10ms within datacenter (Raft round trip) | 1-5ms (local database lock) | 0ms coordination overhead |
385
+ | **Availability** | Requires majority quorum (N/2+1 of N nodes) | Single database is SPOF unless replicated | No coordination SPOF |
386
+ | **Operational cost** | Dedicated cluster (3-5 nodes), monitoring, backups | Already running a database | No additional infrastructure |
387
+ | **Failure detection** | Built-in (heartbeats, lease expiry) | Requires polling or advisory lock timeout | N/A; no leader to detect |
388
+ | **Cross-DC support** | Expensive (100-300ms per write across DCs) | Requires cross-DC database replication | Works naturally across DCs |
389
+ | **Data volume** | Small metadata only (< 8 GB for etcd) | Full database capacity | No coordination data |
390
+ | **Complexity** | High; ZooKeeper/etcd operational expertise required | Low-medium; standard DBA skills | Low; design complexity in idempotency logic |
391
+ | **Failure mode** | Split brain if quorum lost; read-only or unavailable | Deadlocks; lock contention under load | Duplicate processing if idempotency breaks |
392
+ | **Lock granularity** | Coarse-grained (per-resource locks) | Fine-grained (row-level locks) | N/A; no locks |
393
+ | **Throughput** | 10K-50K writes/sec (etcd); not designed for high throughput | 100K+ transactions/sec (PostgreSQL) | Limited only by application throughput |
394
+
395
+ ---
396
+
397
+ ## Evolution Path
398
+
399
+ ### Stage 1: Single Node (No Coordination)
400
+
401
+ The system runs on one server. Coordination is handled by the OS kernel (mutexes, file locks,
402
+ process signals). This is correct and sufficient for many applications.
403
+
404
+ **Move to Stage 2 when:** You need high availability (redundant instances) or horizontal
405
+ scaling beyond a single node.
406
+
407
+ ### Stage 2: Database-Based Coordination
408
+
409
+ Use your existing database for coordination: advisory locks for mutual exclusion, a "leaders"
410
+ table with optimistic locking for leader election, and config tables for shared configuration.
411
+
412
+ **Implementation:** `pg_advisory_lock(key)` for mutual exclusion. `SELECT ... FOR UPDATE
413
+ SKIP LOCKED` for work distribution. Conditional updates for leader election.
414
+
415
+ **Move to Stage 3 when:** Database contention from coordination queries impacts application
416
+ query performance, or you need sub-second failure detection that database polling cannot
417
+ provide.
418
+
419
+ ### Stage 3: Dedicated Coordination Service
420
+
421
+ Deploy etcd, ZooKeeper, or Consul alongside your application. Use it for leader election,
422
+ distributed locks, service discovery, and configuration management.
423
+
424
+ **Selection guide:**
425
+ - **etcd** if you are already running Kubernetes or need a simple key-value model with
426
+ strong consistency and watch semantics.
427
+ - **Consul** if you need multi-datacenter service discovery, health checking, and service
428
+ mesh capabilities.
429
+ - **ZooKeeper** if you are in the Hadoop/Kafka ecosystem or need hierarchical namespace
430
+ features (ephemeral sequential nodes for distributed queues, barriers).
431
+
432
+ **Move to Stage 4 when:** You need custom consensus behavior (application-level state machine
433
+ replication) that coordination services do not directly support.
434
+
435
+ ### Stage 4: Embedded Consensus Library
436
+
437
+ Embed a Raft library directly into your application for custom replicated state machines.
438
+ Libraries like `hashicorp/raft` (Go), `openraft` (Rust), `Apache Ratis` (Java), and
439
+ `dragonboat` (Go) provide Raft implementations that you integrate into your application.
440
+
441
+ **Evidence -- CockroachDB:** CockroachDB embeds a Raft implementation (derived from etcd's
442
+ Raft library) in its storage engine. Each data range has its own Raft group, providing
443
+ fine-grained replication without a central coordination bottleneck.
444
+
445
+ **Warning:** Appropriate only for database-class infrastructure. The investment to handle
446
+ snapshotting, membership changes, and log compaction is measured in engineer-years.
447
+
448
+ ---
449
+
450
+ ## Failure Modes
451
+
452
+ ### 1. Split Brain from Lock Expiry
453
+
454
+ **What happens:** A distributed lock holder (Process A) experiences a GC pause. The lock's
455
+ lease expires. Process B acquires the lock. Process A wakes up, believes it still holds the
456
+ lock, and writes to the shared resource. Both processes now operate on the resource
457
+ simultaneously.
458
+
459
+ **Real-world incident:** At a major payment processor, two worker processes both believed they
460
+ held the lock on an account, resulting in double-charges approximately once per 10,000
461
+ transactions. The root cause was JVM garbage collection pauses exceeding the lock's TTL.
462
+
463
+ **Mitigation:** Fencing tokens (see "How It Works"). The lock returns a monotonically
464
+ increasing token; the resource rejects stale tokens. Without resource-side enforcement,
465
+ the lock is advisory only.
466
+
467
+ ### 2. GC Pauses Causing False Leader Failover
468
+
469
+ **What happens:** The leader node enters a long GC pause (> lease duration). Followers detect
470
+ the missing heartbeats, trigger an election, and elect a new leader. The old leader wakes up
471
+ and briefly believes it is still leader, issuing conflicting commands.
472
+
473
+ **Real-world incident (Akka Cluster):** A heartbeat was delayed 15,456ms (expected: 1,000ms)
474
+ due to GC pressure. The frozen node took 7 seconds after unfreezing to step down -- during
475
+ which it processed commands while a new leader was also active, violating Single Writer.
476
+
477
+ **Production failover timing:** Deployments target 3-10s for detection; total failover in
478
+ systems like MongoDB is 10-30s. Aggressive timeouts (1-3s) risk false positives; conservative
479
+ timeouts (10-30s) extend write unavailability.
480
+
481
+ **Mitigation:** Raft's term mechanism: the old leader steps down upon receiving any message
482
+ with a higher term. For non-Raft systems, use epoch-based fencing where the storage layer
483
+ rejects operations from stale epochs.
484
+
485
+ ### 3. Martin Kleppmann's Distributed Locking Critique
486
+
487
+ Kleppmann's 2016 analysis identified two fundamental problems with Redlock:
488
+
489
+ **Problem 1 -- No fencing tokens:** Redlock uses a random unique value as a lock identifier,
490
+ but this value is not monotonically increasing. You cannot use it as a fencing token. Keeping
491
+ a counter on a single Redis node is not sufficient because that node may fail. Keeping counters
492
+ on multiple Redis nodes would require... a consensus algorithm to keep them synchronized,
493
+ defeating the purpose.
494
+
495
+ **Problem 2 -- Dangerous timing assumptions:** Redlock assumes bounded network delay and
496
+ bounded process pause time (essentially a synchronous system model). If a clock on one Redis
497
+ node jumps forward (NTP adjustment, VM migration, clock drift), a lock could expire
498
+ prematurely, allowing a second client to acquire it. Both clients now believe they hold the
499
+ lock.
500
+
501
+ **The core insight:** For correctness locks, you need fencing tokens -- the lock service must
502
+ generate monotonically increasing tokens and the resource must enforce them. Redis does not
503
+ provide this. For efficiency locks (preventing duplicate work), Redis `SETNX` with TTL is
504
+ fine -- occasional double-processing wastes work but does not corrupt data.
505
+
506
+ ### 4. Quorum Loss and Read Availability
507
+
508
+ **What happens:** In a 3-node etcd cluster, if 2 nodes go down, the remaining node cannot
509
+ form a quorum. The cluster becomes completely unavailable -- it can serve neither reads nor
510
+ writes (by default). This is a deliberate safety choice: serving reads from a single node
511
+ could return stale data if the node is partitioned.
512
+
513
+ **Mitigation:** etcd supports `--serializable` reads that can be served by any single node,
514
+ trading linearizability for availability. For non-critical reads (dashboards, monitoring),
515
+ serializable reads during quorum loss are acceptable. For correctness-critical reads (lock
516
+ state, leader identity), the system must be unavailable rather than inconsistent.
517
+
518
+ ### 5. Watch Notification Gaps (ZooKeeper)
519
+
520
+ **What happens:** ZooKeeper watches are one-time triggers. After a watch fires, the client
521
+ must re-register it. Between the watch firing and re-registration, changes can be missed.
522
+
523
+ **Mitigation:** Use Curator's `PathChildrenCache` or `TreeCache`, which handle re-registration
524
+ and gap detection. Alternatively, use etcd, whose watch API provides continuous streams with
525
+ revision-based ordering, eliminating the gap problem.
526
+
527
+ ### 6. etcd Data Size Limits Under Pressure
528
+
529
+ **What happens:** etcd stores all data in memory and recommends a maximum database size of
530
+ 8 GB. Under heavy write load with insufficient compaction, the database grows past its limit
531
+ and refuses writes, potentially causing a Kubernetes cluster to become unmanageable.
532
+
533
+ **Mitigation:** Configure auto-compaction (`--auto-compaction-retention=1h`), set
534
+ `--quota-backend-bytes`, monitor size via `/metrics`, and run periodic defragmentation.
535
+
536
+ ---
537
+
538
+ ## Technology Landscape
539
+
540
+ ### etcd
541
+
542
+ - **Consensus:** Raft
543
+ - **Data model:** Flat key-value with MVCC (multi-version concurrency control)
544
+ - **Language:** Go
545
+ - **Watch mechanism:** gRPC streaming; continuous, revision-ordered
546
+ - **Locking:** Distributed locks via `clientv3/concurrency` package
547
+ - **Best for:** Kubernetes clusters, systems already in the CNCF ecosystem
548
+ - **Operational notes:** 3 or 5 nodes recommended. Sensitive to disk latency -- use SSDs.
549
+ Monitor `wal_fsync_duration_seconds` and `backend_commit_duration_seconds`.
550
+ - **Throughput:** Approximately 10,000-50,000 writes/sec depending on value size and hardware.
551
+
552
+ ### Apache ZooKeeper
553
+
554
+ - **Consensus:** ZAB (ZooKeeper Atomic Broadcast), closely related to Paxos
555
+ - **Data model:** Hierarchical namespace (znodes) with ephemeral nodes, sequential nodes
556
+ - **Language:** Java
557
+ - **Watch mechanism:** One-time callback triggers; must re-register after each event
558
+ - **Locking:** Recipes via Apache Curator (`InterProcessMutex`, `InterProcessReadWriteLock`)
559
+ - **Best for:** Hadoop/Kafka ecosystem, systems needing hierarchical data and ephemeral nodes
560
+ - **Operational notes:** 3 or 5 nodes. JVM tuning is critical -- G1GC recommended, heap
561
+ size 4-8 GB. Configure `autopurge.snapRetainCount` and `autopurge.purgeInterval`.
562
+ - **Throughput:** Approximately 10,000-20,000 writes/sec; reads are faster and can be served
563
+ by any node (with staleness risk).
564
+
565
+ ### HashiCorp Consul
566
+
567
+ - **Consensus:** Raft (within datacenter), Serf gossip (between datacenters)
568
+ - **Data model:** Flat key-value plus service catalog with health checks
569
+ - **Language:** Go
570
+ - **Watch mechanism:** Blocking queries (long polling) and event system
571
+ - **Locking:** Built-in session-based locks with configurable behavior on session invalidation
572
+ - **Best for:** Multi-datacenter service discovery, service mesh, organizations using
573
+ HashiCorp stack (Vault, Nomad, Terraform)
574
+ - **Operational notes:** 3 or 5 server nodes per datacenter, plus client agents on every
575
+ application node. Lower operational burden than ZooKeeper.
576
+
577
+ ### Redis Redlock -- and Why It Is Controversial
578
+
579
+ Redis is often used for distributed locking via `SETNX` with `TTL`. The Redlock algorithm
580
+ (proposed by Salvatore Sanfilippo / antirez) extends this across multiple independent Redis
581
+ instances:
582
+
583
+ 1. Get current time.
584
+ 2. Acquire lock on N/2+1 of N independent Redis instances with the same key and random value.
585
+ 3. If the elapsed time to acquire exceeds the lock TTL, the lock is considered failed.
586
+ 4. If acquired, the effective lock lifetime is TTL minus elapsed acquisition time.
587
+
588
+ **Why it is controversial:**
589
+
590
+ Martin Kleppmann argued Redlock is unsafe for correctness: it makes timing assumptions
591
+ violated by clock jumps/GC pauses, cannot generate monotonically increasing fencing tokens,
592
+ and requires trusting clocks do not jump -- assumptions proven unreliable in practice.
593
+ Antirez responded defending practical safety, arguing timing checks are sufficient. The
594
+ distributed systems community remains divided.
595
+
596
+ **Guidance:** Use single-instance Redis `SETNX` for *efficiency* locks (preventing duplicate
597
+ work where occasional double-processing is harmless). Use etcd or ZooKeeper for *correctness*
598
+ locks (where double-processing causes data corruption or financial loss). Avoid Redlock for
599
+ correctness-critical paths.
600
+
601
+ ---
602
+
603
+ ## Decision Tree
604
+
605
+ ```
606
+ Do you need distributed coordination?
607
+ |
608
+ +-- Is the system single-node or single-instance?
609
+ | +-- YES --> Use OS-level primitives (mutex, flock). Stop here.
610
+ |
611
+ +-- Can the coordination be replaced with idempotent operations?
612
+ | +-- YES --> Design for idempotency. No coordination needed. Stop here.
613
+ |
614
+ +-- Is a database already a dependency?
615
+ | +-- YES --> Can the database handle the coordination load?
616
+ | +-- YES --> Use database locks (advisory locks, SELECT FOR UPDATE). Stop here.
617
+ | +-- NO --> Proceed to dedicated coordination service.
618
+ |
619
+ +-- What is the primary use case?
620
+ | |
621
+ | +-- Leader election only?
622
+ | | +-- On Kubernetes? --> Use Kubernetes Lease objects (backed by etcd).
623
+ | | +-- Not on Kubernetes? --> etcd or Consul, whichever matches your stack.
624
+ | |
625
+ | +-- Service discovery + health checking?
626
+ | | +-- Single datacenter? --> etcd or Consul.
627
+ | | +-- Multi-datacenter? --> Consul (strongest multi-DC story).
628
+ | |
629
+ | +-- Distributed locks?
630
+ | | +-- For efficiency (prevent duplicate work)?
631
+ | | | +-- Redis SETNX with TTL. Acceptable if occasional double-processing is harmless.
632
+ | | +-- For correctness (prevent data corruption)?
633
+ | | +-- etcd or ZooKeeper with fencing tokens.
634
+ | | +-- NOT Redis Redlock for correctness-critical paths.
635
+ | |
636
+ | +-- Configuration management?
637
+ | | +-- Already on Kubernetes? --> ConfigMaps + controller watches.
638
+ | | +-- Need strong consistency? --> etcd or Consul KV.
639
+ | |
640
+ | +-- Custom replicated state machine?
641
+ | +-- Use an embedded Raft library (hashicorp/raft, openraft, dragonboat).
642
+ | +-- WARNING: This is a multi-month to multi-year investment.
643
+ |
644
+ +-- Are you in the Hadoop/Kafka ecosystem?
645
+ +-- YES --> ZooKeeper (but note: Kafka is migrating to KRaft, removing ZK dependency).
646
+ +-- NO --> etcd (if Kubernetes-native) or Consul (if multi-DC or HashiCorp stack).
647
+ ```
648
+
649
+ ---
650
+
651
+ ## Implementation Sketch
652
+
653
+ ### Leader Election with etcd (Go)
654
+
655
+ ```go
656
+ package main
657
+
658
+ import (
659
+ "context"
660
+ "log"
661
+ "time"
662
+
663
+ clientv3 "go.etcd.io/etcd/client/v3"
664
+ "go.etcd.io/etcd/client/v3/concurrency"
665
+ )
666
+
667
+ func main() {
668
+ cli, _ := clientv3.New(clientv3.Config{
669
+ Endpoints: []string{"localhost:2379"},
670
+ DialTimeout: 5 * time.Second,
671
+ })
672
+ defer cli.Close()
673
+
674
+ // Session TTL = failover time. 10s = leader replaced within 10s of crash.
675
+ session, _ := concurrency.NewSession(cli, concurrency.WithTTL(10))
676
+ defer session.Close()
677
+
678
+ election := concurrency.NewElection(session, "/my-service/leader")
679
+
680
+ ctx := context.Background()
681
+ if err := election.Campaign(ctx, "node-1"); err != nil {
682
+ log.Fatal(err) // Blocks until this node wins the election
683
+ }
684
+ log.Println("I am the leader!")
685
+
686
+ // Do leader work. Lease auto-renews. On crash, lease expires -> new election.
687
+ // On graceful shutdown: election.Resign(ctx) for instant failover.
688
+ select {}
689
+ }
690
+ ```
691
+
692
+ **Key design points:**
693
+ - Session TTL determines failover time. Shorter = faster failover but more risk of false
694
+ leadership changes during GC pauses.
695
+ - `Campaign()` blocks until this node wins. Use `Observe()` to watch without campaigning.
696
+ - Call `Resign()` on graceful shutdown for instant failover (avoids waiting for lease expiry).
697
+
698
+ ### Distributed Lock with Fencing Token (Pseudocode)
699
+
700
+ ```
701
+ function acquireLockWithFencing(lockService, resourceClient, lockKey):
702
+ // Acquire lock; the service returns a monotonically increasing token
703
+ lock, fencingToken = lockService.acquire(lockKey, ttl=30s)
704
+
705
+ if not lock.acquired:
706
+ return FAILED
707
+
708
+ try:
709
+ // Include fencing token in every write to the protected resource
710
+ result = resourceClient.write(
711
+ data=payload,
712
+ fencingToken=fencingToken
713
+ )
714
+ // Resource rejects write if it has seen a higher token
715
+ if result.rejected:
716
+ log.warn("Stale lock detected, token rejected")
717
+ return STALE_LOCK
718
+ return SUCCESS
719
+ finally:
720
+ lockService.release(lockKey, lock)
721
+
722
+
723
+ // Resource-side enforcement (e.g., in a database trigger or middleware):
724
+ function handleWrite(data, fencingToken):
725
+ currentMax = storage.getMaxToken(resource_id)
726
+ if fencingToken < currentMax:
727
+ reject("Stale fencing token")
728
+ storage.setMaxToken(resource_id, fencingToken)
729
+ storage.write(data)
730
+ ```
731
+
732
+ **Critical implementation detail:** The fencing token check and the data write must be
733
+ atomic on the resource side. If they are separate operations, a race condition between the
734
+ check and the write reintroduces the problem. In a database, use a single transaction:
735
+ `UPDATE resource SET data=$data, max_token=$token WHERE id=$id AND max_token < $token`.
736
+
737
+ ### ZooKeeper Distributed Lock (Java, using Curator)
738
+
739
+ ```java
740
+ CuratorFramework client = CuratorFrameworkFactory.newClient(
741
+ "localhost:2181", new ExponentialBackoffRetry(1000, 3));
742
+ client.start();
743
+
744
+ InterProcessMutex lock = new InterProcessMutex(client, "/locks/my-resource");
745
+ if (lock.acquire(30, TimeUnit.SECONDS)) {
746
+ try {
747
+ processSharedResource(); // Critical section: one process at a time
748
+ } finally {
749
+ lock.release();
750
+ }
751
+ }
752
+ ```
753
+
754
+ **How Curator's lock works internally:** Creates an ephemeral sequential znode
755
+ (`/locks/my-resource/lock-0000000001`). The holder is the client with the lowest sequence
756
+ number. Other clients watch only the next-lowest node (avoiding the "herd effect" of
757
+ notifying all waiters). If the holder crashes, ZooKeeper deletes the ephemeral node and
758
+ the next waiter acquires automatically.
759
+
760
+ ---
761
+
762
+ ## Cross-References
763
+
764
+ - **distributed-systems-fundamentals** -- Network models (synchronous, asynchronous, partially
765
+ synchronous), failure models (crash-stop, crash-recovery, Byzantine), and the fundamental
766
+ impossibility results that constrain consensus protocol design.
767
+ - **cap-theorem-and-tradeoffs** -- The CAP theorem directly governs coordination service
768
+ design: etcd and ZooKeeper choose CP (consistency + partition tolerance), sacrificing
769
+ availability during network partitions. Understanding CAP is prerequisite to understanding
770
+ why coordination services become unavailable when quorum is lost.
771
+ - **idempotency-and-retry** -- The most important alternative to distributed locking. In many
772
+ cases where teams reach for a distributed lock, idempotent operation design eliminates the
773
+ coordination requirement entirely, producing a more resilient system.
774
+ - **data-consistency** -- Linearizability, serializability, and eventual consistency models
775
+ determine which coordination primitives are appropriate. Consensus provides linearizability;
776
+ understanding when weaker consistency suffices prevents over-engineering.
777
+
778
+ ---
779
+
780
+ ## Sources
781
+
782
+ - [Distributed Consensus: Paxos vs. Raft and Modern Implementations](https://dev.to/narendars/distributed-consensus-paxos-vs-raft-and-modern-implementations-2gng)
783
+ - [Paxos vs Raft: Have we reached consensus on distributed consensus? (Heidi Howard, 2020)](https://arxiv.org/abs/2004.05074)
784
+ - [Raft Consensus Algorithm -- Official Site](https://raft.github.io/)
785
+ - [A Brief Tour of FLP Impossibility](https://www.the-paper-trail.org/post/2008-08-13-a-brief-tour-of-flp-impossibility/)
786
+ - [Leader Election in Distributed Systems -- Amazon Builders' Library](https://aws.amazon.com/builders-library/leader-election-in-distributed-systems/)
787
+ - [Leader Election Pattern -- Azure Architecture Center](https://learn.microsoft.com/en-us/azure/architecture/patterns/leader-election)
788
+ - [How to do distributed locking -- Martin Kleppmann](https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html)
789
+ - [Is Redlock safe? -- antirez (Salvatore Sanfilippo)](https://antirez.com/news/101)
790
+ - [Distributed Locks with Redis -- Official Redis Documentation](https://redis.io/docs/latest/develop/clients/patterns/distributed-locks/)
791
+ - [etcd versus other key-value stores](https://etcd.io/docs/v3.3/learning/why/)
792
+ - [In-Depth Comparison of Distributed Coordination Tools: Consul, etcd, ZooKeeper, and Nacos](https://medium.com/@karim.albakry/in-depth-comparison-of-distributed-coordination-tools-consul-etcd-zookeeper-and-nacos-a6f8e5d612a6)
793
+ - [Distributed Lock Failure: How Long GC Pauses Break Concurrency](https://systemdr.substack.com/p/distributed-lock-failure-how-long)
794
+ - [Beyond the Lock: Why Fencing Tokens Are Essential](https://levelup.gitconnected.com/beyond-the-lock-why-fencing-tokens-are-essential-5be0857d5a6a)
795
+ - [Akka Cluster split brain failures -- are you ready for it?](https://blog.softwaremill.com/akka-cluster-split-brain-failures-are-you-ready-for-it-d9406b97e099)
796
+ - [Understanding Raft Consensus in Distributed Systems with TiDB](https://www.pingcap.com/article/understanding-raft-consensus-in-distributed-systems-with-tidb/)