@booklib/skills 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (85) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +105 -0
  3. package/animation-at-work/SKILL.md +246 -0
  4. package/animation-at-work/assets/example_asset.txt +1 -0
  5. package/animation-at-work/references/api_reference.md +369 -0
  6. package/animation-at-work/references/review-checklist.md +79 -0
  7. package/animation-at-work/scripts/example.py +1 -0
  8. package/bin/skills.js +85 -0
  9. package/clean-code-reviewer/SKILL.md +292 -0
  10. package/clean-code-reviewer/evals/evals.json +67 -0
  11. package/data-intensive-patterns/SKILL.md +204 -0
  12. package/data-intensive-patterns/assets/example_asset.txt +1 -0
  13. package/data-intensive-patterns/references/api_reference.md +34 -0
  14. package/data-intensive-patterns/references/patterns-catalog.md +551 -0
  15. package/data-intensive-patterns/references/review-checklist.md +193 -0
  16. package/data-intensive-patterns/scripts/example.py +1 -0
  17. package/data-pipelines/SKILL.md +252 -0
  18. package/data-pipelines/assets/example_asset.txt +1 -0
  19. package/data-pipelines/references/api_reference.md +301 -0
  20. package/data-pipelines/references/review-checklist.md +181 -0
  21. package/data-pipelines/scripts/example.py +1 -0
  22. package/design-patterns/SKILL.md +245 -0
  23. package/design-patterns/assets/example_asset.txt +1 -0
  24. package/design-patterns/references/api_reference.md +1 -0
  25. package/design-patterns/references/patterns-catalog.md +726 -0
  26. package/design-patterns/references/review-checklist.md +173 -0
  27. package/design-patterns/scripts/example.py +1 -0
  28. package/domain-driven-design/SKILL.md +221 -0
  29. package/domain-driven-design/assets/example_asset.txt +1 -0
  30. package/domain-driven-design/references/api_reference.md +1 -0
  31. package/domain-driven-design/references/patterns-catalog.md +545 -0
  32. package/domain-driven-design/references/review-checklist.md +158 -0
  33. package/domain-driven-design/scripts/example.py +1 -0
  34. package/effective-java/SKILL.md +195 -0
  35. package/effective-java/assets/example_asset.txt +1 -0
  36. package/effective-java/references/api_reference.md +1 -0
  37. package/effective-java/references/items-catalog.md +955 -0
  38. package/effective-java/references/review-checklist.md +216 -0
  39. package/effective-java/scripts/example.py +1 -0
  40. package/effective-kotlin/SKILL.md +225 -0
  41. package/effective-kotlin/assets/example_asset.txt +1 -0
  42. package/effective-kotlin/references/api_reference.md +1 -0
  43. package/effective-kotlin/references/practices-catalog.md +1228 -0
  44. package/effective-kotlin/references/review-checklist.md +126 -0
  45. package/effective-kotlin/scripts/example.py +1 -0
  46. package/kotlin-in-action/SKILL.md +251 -0
  47. package/kotlin-in-action/assets/example_asset.txt +1 -0
  48. package/kotlin-in-action/references/api_reference.md +1 -0
  49. package/kotlin-in-action/references/practices-catalog.md +436 -0
  50. package/kotlin-in-action/references/review-checklist.md +204 -0
  51. package/kotlin-in-action/scripts/example.py +1 -0
  52. package/lean-startup/SKILL.md +250 -0
  53. package/lean-startup/assets/example_asset.txt +1 -0
  54. package/lean-startup/references/api_reference.md +319 -0
  55. package/lean-startup/references/review-checklist.md +137 -0
  56. package/lean-startup/scripts/example.py +1 -0
  57. package/microservices-patterns/SKILL.md +179 -0
  58. package/microservices-patterns/references/patterns-catalog.md +391 -0
  59. package/microservices-patterns/references/review-checklist.md +169 -0
  60. package/package.json +17 -0
  61. package/refactoring-ui/SKILL.md +236 -0
  62. package/refactoring-ui/assets/example_asset.txt +1 -0
  63. package/refactoring-ui/references/api_reference.md +355 -0
  64. package/refactoring-ui/references/review-checklist.md +114 -0
  65. package/refactoring-ui/scripts/example.py +1 -0
  66. package/storytelling-with-data/SKILL.md +238 -0
  67. package/storytelling-with-data/assets/example_asset.txt +1 -0
  68. package/storytelling-with-data/references/api_reference.md +379 -0
  69. package/storytelling-with-data/references/review-checklist.md +111 -0
  70. package/storytelling-with-data/scripts/example.py +1 -0
  71. package/system-design-interview/SKILL.md +213 -0
  72. package/system-design-interview/assets/example_asset.txt +1 -0
  73. package/system-design-interview/references/api_reference.md +582 -0
  74. package/system-design-interview/references/review-checklist.md +201 -0
  75. package/system-design-interview/scripts/example.py +1 -0
  76. package/using-asyncio-python/SKILL.md +242 -0
  77. package/using-asyncio-python/assets/example_asset.txt +1 -0
  78. package/using-asyncio-python/references/api_reference.md +267 -0
  79. package/using-asyncio-python/references/review-checklist.md +149 -0
  80. package/using-asyncio-python/scripts/example.py +1 -0
  81. package/web-scraping-python/SKILL.md +259 -0
  82. package/web-scraping-python/assets/example_asset.txt +1 -0
  83. package/web-scraping-python/references/api_reference.md +393 -0
  84. package/web-scraping-python/references/review-checklist.md +163 -0
  85. package/web-scraping-python/scripts/example.py +1 -0
@@ -0,0 +1,551 @@
1
+ # Data-Intensive Application Patterns Catalog
2
+
3
+ Comprehensive reference of patterns from Martin Kleppmann's *Designing Data-Intensive Applications*.
4
+ Organized by the book's three-part structure. Read the section relevant to the code you're generating.
5
+
6
+ ---
7
+
8
+ ## Table of Contents
9
+
10
+ 1. [Data Models and Query Languages](#data-models-and-query-languages)
11
+ 2. [Storage Engines and Indexing](#storage-engines-and-indexing)
12
+ 3. [Encoding and Schema Evolution](#encoding-and-schema-evolution)
13
+ 4. [Replication](#replication)
14
+ 5. [Partitioning](#partitioning)
15
+ 6. [Transactions](#transactions)
16
+ 7. [Distributed Systems Fundamentals](#distributed-systems-fundamentals)
17
+ 8. [Consistency and Consensus](#consistency-and-consensus)
18
+ 9. [Batch Processing](#batch-processing)
19
+ 10. [Stream Processing](#stream-processing)
20
+ 11. [Derived Data and Integration](#derived-data-and-integration)
21
+
22
+ ---
23
+
24
+ ## Data Models and Query Languages
25
+
26
+ ### Relational Model
27
+ - Tables with rows and columns, enforced schema
28
+ - Best for: many-to-many relationships, complex joins, data with strong integrity requirements
29
+ - Normalized: reduce redundancy, enforce consistency via foreign keys
30
+ - Query language: SQL (declarative)
31
+
32
+ ### Document Model
33
+ - Self-contained JSON/BSON documents, flexible schema (schema-on-read)
34
+ - Best for: one-to-many relationships, self-contained records, heterogeneous data
35
+ - Denormalized: data locality — everything for one entity in one document
36
+ - Limitations: poor support for many-to-many; joins are weak or manual
37
+ - Examples: MongoDB, CouchDB, RethinkDB
38
+
39
+ ### Graph Model
40
+ - Vertices (nodes) and edges (relationships), flexible schema
41
+ - Best for: highly interconnected data, variable relationship types, traversal-heavy queries
42
+ - Two flavors:
43
+ - **Property graph** (Neo4j): nodes/edges have properties, query with Cypher
44
+ - **Triple store** (RDF): subject-predicate-object triples, query with SPARQL
45
+ - Good for social networks, recommendation engines, knowledge graphs, fraud detection
46
+
47
+ ### Choosing a Data Model
48
+
49
+ | Access Pattern | Best Model |
50
+ |---------------|-----------|
51
+ | Complex joins, aggregations, reporting | Relational |
52
+ | Self-contained documents, flexible schema | Document |
53
+ | Highly connected data, graph traversals | Graph |
54
+ | Append-only event log | Event store |
55
+ | Key-value lookups with high throughput | Key-value (Redis, DynamoDB) |
56
+ | Time-series with range scans | Time-series (TimescaleDB, InfluxDB) |
57
+
58
+ ---
59
+
60
+ ## Storage Engines and Indexing
61
+
62
+ ### Log-Structured Storage (LSM-Trees)
63
+
64
+ How it works:
65
+ 1. Writes go to an in-memory balanced tree (memtable)
66
+ 2. When memtable exceeds threshold, flush to disk as a sorted SSTable (Sorted String Table)
67
+ 3. Background compaction merges SSTables, removing duplicates and deleted entries
68
+ 4. Reads check memtable first, then SSTables (newest to oldest), aided by Bloom filters
69
+
70
+ Characteristics:
71
+ - **Write-optimized**: sequential writes to disk, no random I/O on write path
72
+ - **Compaction strategies**: size-tiered (good for write-heavy) vs leveled (better read amplification)
73
+ - **Bloom filters**: probabilistic data structure to quickly check if a key might exist in an SSTable
74
+ - **Trade-off**: higher write throughput, but reads may touch multiple SSTables
75
+
76
+ Implementations: LevelDB, RocksDB, Cassandra, HBase, ScyllaDB
77
+
78
+ ### Page-Oriented Storage (B-Trees)
79
+
80
+ How it works:
81
+ 1. Data organized in fixed-size pages (typically 4KB)
82
+ 2. Tree structure with branching factor ~several hundred
83
+ 3. Updates modify pages in place
84
+ 4. Write-ahead log (WAL) ensures crash recovery
85
+
86
+ Characteristics:
87
+ - **Read-optimized**: O(log n) lookups with predictable performance
88
+ - **In-place updates**: overwrites pages on disk
89
+ - **WAL**: append-only log written before modifying pages, for crash recovery
90
+ - **Latches**: lightweight locks for concurrent access to tree pages
91
+ - **Copy-on-write**: some implementations (LMDB) write new pages instead of overwriting
92
+
93
+ Implementations: PostgreSQL, MySQL InnoDB, SQL Server, Oracle
94
+
95
+ ### Choosing a Storage Engine
96
+
97
+ | Workload | Recommended Engine | Why |
98
+ |----------|-------------------|-----|
99
+ | Write-heavy (logging, IoT, events) | LSM-Tree | Sequential writes, high throughput |
100
+ | Read-heavy with point lookups | B-Tree | Predictable read latency |
101
+ | Mixed OLTP | B-Tree (usually) | Good all-around for transactions |
102
+ | Analytical (OLAP) | Column-oriented | Compression, vectorized processing |
103
+ | Time-series | LSM-Tree or specialized | Append-heavy, range scan friendly |
104
+
105
+ ### Column-Oriented Storage (OLAP)
106
+
107
+ - Store values from each column together instead of each row together
108
+ - Enables aggressive compression (run-length encoding, bitmap encoding)
109
+ - Vectorized processing: operate on columns of compressed data in CPU cache
110
+ - **Star schema**: central fact table with dimension tables (snowflake if dimensions are further normalized)
111
+ - **Materialized views / data cubes**: pre-computed aggregations for dashboard queries
112
+
113
+ ---
114
+
115
+ ## Encoding and Schema Evolution
116
+
117
+ ### Encoding Formats Comparison
118
+
119
+ | Format | Schema? | Binary? | Forward Compatible | Backward Compatible | Notes |
120
+ |--------|---------|---------|-------------------|--------------------|----|
121
+ | JSON | Optional (JSON Schema) | No | Partial | Partial | Human-readable; number precision issues |
122
+ | XML | Optional (XSD) | No | Partial | Partial | Verbose; human-readable |
123
+ | Protocol Buffers | Required (.proto) | Yes | Yes (new fields with new tags) | Yes (new fields optional) | Field tags for evolution |
124
+ | Thrift | Required (.thrift) | Yes | Yes | Yes | Similar to Protobuf, two binary formats |
125
+ | Avro | Required (.avsc) | Yes | Yes (writer's schema + reader's schema) | Yes | Schema resolution; great for Hadoop/Kafka |
126
+
127
+ ### Schema Evolution Rules
128
+
129
+ - **Forward compatibility**: old code can read data written by new code
130
+ - New fields must be optional or have defaults
131
+ - Never remove a required field
132
+ - **Backward compatibility**: new code can read data written by old code
133
+ - New fields must be optional or have defaults
134
+ - Never reuse a deleted field's tag number
135
+ - **Full compatibility** (both): needed when readers and writers are updated at different times
136
+
137
+ ### Dataflow Patterns
138
+
139
+ How data flows between processes determines which compatibility direction matters:
140
+
141
+ - **Through databases**: writer encodes, reader decodes (potentially much later) — need both directions
142
+ - **Through services (REST/RPC)**: request and response each need backward + forward compatibility
143
+ - REST: use content negotiation, versioned URLs, or header-based versioning
144
+ - RPC: treat as cross-service API, version carefully
145
+ - **Through async messaging**: producer encodes, consumer decodes — similar to databases
146
+ - Use schema registry (Confluent Schema Registry for Kafka)
147
+ - Avro is ideal: writer's schema stored with each message, reader uses its own schema
148
+
149
+ ---
150
+
151
+ ## Replication
152
+
153
+ ### Single-Leader Replication
154
+
155
+ One node (leader) accepts writes; followers replicate the leader's write-ahead log.
156
+
157
+ - **Synchronous followers**: guaranteed up-to-date but write latency increases
158
+ - **Asynchronous followers**: leader doesn't wait, risk of data loss if leader fails before replication
159
+ - **Semi-synchronous**: one follower is synchronous (guaranteed durability), rest are async
160
+
161
+ Failover concerns:
162
+ - Split-brain: two nodes think they're leader (use fencing tokens, epoch numbers)
163
+ - Lost writes: async follower promoted to leader may be missing recent writes
164
+ - Replication lag: stale reads from followers
165
+
166
+ Handling replication lag:
167
+ - **Read-after-write consistency**: after a write, read from leader (or wait for follower to catch up)
168
+ - **Monotonic reads**: always read from the same replica (session stickiness)
169
+ - **Consistent prefix reads**: preserve causal ordering of writes
170
+
171
+ ### Multi-Leader Replication
172
+
173
+ Multiple nodes accept writes. Each leader replicates to all others.
174
+
175
+ - **Use case**: multi-datacenter (one leader per datacenter), offline-capable clients, collaborative editing
176
+ - **Conflict resolution**:
177
+ - Last-write-wins (LWW): discard concurrent writes arbitrarily — data loss risk
178
+ - Merge values: application-specific merge logic
179
+ - Custom conflict handlers: on-write or on-read resolution
180
+ - **Topologies**: all-to-all (best), star, circular (avoid single points of failure)
181
+
182
+ ### Leaderless Replication (Dynamo-style)
183
+
184
+ Client sends writes to multiple replicas. Reads query multiple replicas.
185
+
186
+ - **Quorum**: w + r > n (write quorum + read quorum > total replicas) to guarantee overlap
187
+ - Common config: n=3, w=2, r=2
188
+ - Tunable: w=1, r=3 for fast writes; w=3, r=1 for fast reads
189
+ - **Read repair**: client detects stale value during read, writes newer value back to stale replica
190
+ - **Anti-entropy**: background process compares replicas and fixes differences
191
+ - **Sloppy quorum + hinted handoff**: during network partition, write to reachable nodes, hand off later
192
+ - **Version vectors**: track causal history to detect concurrent writes vs. sequential
193
+
194
+ ---
195
+
196
+ ## Partitioning
197
+
198
+ ### Partitioning Strategies
199
+
200
+ **By key range:**
201
+ - Keys sorted, each partition owns a contiguous range
202
+ - Enables efficient range scans
203
+ - Risk: hot spots if writes cluster on one range (e.g., time-based keys → today's partition is hot)
204
+ - Mitigation: compound keys (e.g., sensor_id + timestamp)
205
+
206
+ **By hash of key:**
207
+ - Hash function distributes keys uniformly across partitions
208
+ - Destroys sort order — range scans require querying all partitions
209
+ - More uniform load distribution
210
+ - Consistent hashing: minimizes data movement when adding/removing nodes
211
+
212
+ **Compound/composite partitioning:**
213
+ - First part of key determines partition (by hash), rest preserves sort order within partition
214
+ - Cassandra approach: partition key (hashed) + clustering columns (sorted within partition)
215
+
216
+ ### Secondary Indexes with Partitioning
217
+
218
+ **Document-partitioned (local) index:**
219
+ - Each partition maintains its own secondary index covering only its data
220
+ - Write: update one partition's index
221
+ - Read: scatter-gather across all partitions (fan-out) — can be slow
222
+
223
+ **Term-partitioned (global) index:**
224
+ - Secondary index is itself partitioned by the indexed term
225
+ - Write: may need to update multiple partitions' indexes (distributed transaction or async)
226
+ - Read: query only the relevant index partition — faster reads
227
+
228
+ ### Rebalancing Strategies
229
+
230
+ - **Fixed number of partitions**: create many more partitions than nodes; reassign whole partitions when nodes join/leave (Riak, Elasticsearch, Couchbase)
231
+ - **Dynamic partitioning**: split large partitions, merge small ones (HBase, RethinkDB)
232
+ - **Proportional to nodes**: fixed number of partitions per node (Cassandra)
233
+ - **Avoid**: hash mod N (reassigns almost everything when N changes)
234
+
235
+ ### Request Routing
236
+
237
+ - **Client-side**: client learns partition assignment (via ZooKeeper/gossip) and connects directly
238
+ - **Routing tier**: proxy/load balancer that knows partition assignment
239
+ - **Coordinator node**: any node accepts request, forwards to correct partition owner (Cassandra gossip)
240
+
241
+ ---
242
+
243
+ ## Transactions
244
+
245
+ ### Isolation Levels
246
+
247
+ **Read Committed:**
248
+ - Guarantees: no dirty reads (only see committed data), no dirty writes (only overwrite committed data)
249
+ - Implementation: row-level locks for writes; snapshot for reads (return old committed value during write)
250
+ - Default in PostgreSQL, SQL Server
251
+
252
+ **Snapshot Isolation (Repeatable Read / MVCC):**
253
+ - Each transaction sees a consistent snapshot of the database from the start of the transaction
254
+ - Implementation: MVCC (Multi-Version Concurrency Control) — each write creates a new version; reads see only versions committed before the transaction started
255
+ - Prevents: read skew (non-repeatable reads)
256
+ - Does NOT prevent: write skew (two transactions read, then both write based on stale reads)
257
+
258
+ **Serializable:**
259
+ - Strongest guarantee — result is as if transactions ran one-at-a-time
260
+
261
+ Three implementations:
262
+ 1. **Actual serial execution**: literally run one transaction at a time on a single CPU
263
+ - Viable when transactions are short and fit in memory
264
+ - Use stored procedures to avoid network round-trips
265
+ - Partitioning can enable per-partition serial execution
266
+ 2. **Two-phase locking (2PL)**: readers block writers, writers block readers
267
+ - Shared locks for reads, exclusive locks for writes
268
+ - Predicate locks or index-range locks prevent phantoms
269
+ - Performance: significant contention, potential deadlocks
270
+ 3. **Serializable Snapshot Isolation (SSI)**: optimistic — detect conflicts at commit time
271
+ - Based on snapshot isolation + tracking reads and writes
272
+ - Detect: writes that affect prior reads (stale MVCC reads)
273
+ - Abort conflicting transactions at commit
274
+ - Better performance than 2PL for low-contention workloads
275
+
276
+ ### Preventing Write Skew and Phantoms
277
+
278
+ Write skew: two transactions both read a condition, both decide to act, both write — violating a constraint that should hold across both.
279
+
280
+ Example: two doctors both check "≥2 doctors on call" → both remove themselves → 0 on call.
281
+
282
+ Solutions:
283
+ - Serializable isolation
284
+ - Explicit locking: `SELECT ... FOR UPDATE` to materialize the conflict
285
+ - Application-level constraints with saga patterns
286
+
287
+ Phantoms: a write in one transaction changes the result of a search query in another.
288
+
289
+ Solutions:
290
+ - Predicate locks (lock the search condition itself)
291
+ - Index-range locks (practical approximation of predicate locks)
292
+ - Materializing conflicts: create a lock table that represents the condition
293
+
294
+ ---
295
+
296
+ ## Distributed Systems Fundamentals
297
+
298
+ ### Unreliable Networks
299
+
300
+ - Networks are **asynchronous**: no upper bound on message delay
301
+ - Packet loss, reordering, duplication are normal
302
+ - **Timeout selection**: too short → false positives (unnecessary failover); too long → slow detection
303
+ - Adaptive timeouts based on observed round-trip times
304
+ - **Network partitions**: some nodes can't communicate with others
305
+
306
+ ### Unreliable Clocks
307
+
308
+ - **Time-of-day clocks**: wall clock, can jump (NTP sync, leap seconds) — DO NOT use for durations or ordering
309
+ - **Monotonic clocks**: always move forward, for measuring elapsed time within a single node
310
+ - **Logical clocks**: Lamport timestamps, vector clocks — capture causal ordering without relying on physical time
311
+
312
+ Clock issues in distributed systems:
313
+ - LWW conflict resolution using timestamps is fundamentally unsafe
314
+ - Lease expiration: a process might think its lease is valid when the clock has drifted
315
+ - Solution: **fencing tokens** — monotonically increasing tokens; storage rejects stale tokens
316
+
317
+ ### Process Pauses
318
+
319
+ - GC pauses, VM suspension, disk I/O stalls can freeze a process for seconds
320
+ - A process cannot know it was paused — it might act on stale state after resuming
321
+ - Solution: lease-based protocols with fencing tokens; always validate at the point of write
322
+
323
+ ### Fencing Tokens
324
+
325
+ When using distributed locks or leases:
326
+ 1. Lock service issues a fencing token (monotonically increasing number) with each lock grant
327
+ 2. Client includes fencing token with every write to the storage service
328
+ 3. Storage service rejects writes with a token lower than the highest seen
329
+ 4. Guarantees mutual exclusion even if a client holds a stale lock
330
+
331
+ ---
332
+
333
+ ## Consistency and Consensus
334
+
335
+ ### Linearizability
336
+
337
+ Strongest consistency model: operations appear to take effect atomically at some point between invocation and completion. Like a single-copy system.
338
+
339
+ - **Use cases**: leader election, uniqueness constraints, distributed locks
340
+ - **Not needed for**: most application reads, analytics, caching
341
+ - **Cost**: slower (requires coordination), reduced availability during network partitions
342
+
343
+ CAP Theorem (more precisely): if there's a network partition, you must choose between consistency (linearizability) and availability.
344
+
345
+ ### Causal Consistency
346
+
347
+ Weaker than linearizability but preserves causally related ordering:
348
+ - If operation A happened before B, everyone sees A before B
349
+ - Concurrent operations can be seen in different orders by different nodes
350
+ - Implemented via: version vectors, Lamport timestamps
351
+
352
+ ### Total Order Broadcast
353
+
354
+ Protocol to deliver messages to all nodes in the same order:
355
+ - All nodes deliver the same messages in the same sequence
356
+ - Equivalent to consensus (and to linearizable compare-and-swap)
357
+ - Implementations: ZooKeeper's Zab, Raft's log replication
358
+
359
+ ### Distributed Consensus
360
+
361
+ Agreement: all nodes decide on the same value. Properties:
362
+ - **Uniform agreement**: no two nodes decide differently
363
+ - **Integrity**: a node decides at most once
364
+ - **Validity**: if a node decides value v, then v was proposed by some node
365
+ - **Termination**: every non-crashed node eventually decides
366
+
367
+ Algorithms: Paxos, Raft, Zab (ZooKeeper), Viewstamped Replication
368
+
369
+ Practical usage: don't implement consensus yourself. Use:
370
+ - **ZooKeeper/etcd**: coordination services providing linearizable key-value store, leader election, distributed locks, group membership, service discovery
371
+ - **Raft-based systems**: etcd (Kubernetes), CockroachDB, TiKV
372
+
373
+ ### Two-Phase Commit (2PC) — and Why to Avoid It
374
+
375
+ Coordinator-based distributed transaction protocol:
376
+ 1. Prepare phase: coordinator asks all participants to prepare (vote yes/no)
377
+ 2. Commit phase: if all vote yes, coordinator commits; if any vote no, abort
378
+
379
+ Problems:
380
+ - **Blocking**: if coordinator crashes after prepare, participants are stuck holding locks
381
+ - **Single point of failure**: coordinator must be highly available
382
+ - **Performance**: high latency, reduced throughput due to lock holding
383
+ - **Heterogeneous systems**: XA transactions across different databases are especially fragile
384
+
385
+ Prefer: sagas with compensating transactions, or single-database transactions with outbox pattern.
386
+
387
+ ---
388
+
389
+ ## Batch Processing
390
+
391
+ ### Unix Philosophy Applied to Data
392
+
393
+ - Small, focused tools composed via pipes
394
+ - Immutable inputs, explicit outputs
395
+ - Separate the logic from the wiring
396
+
397
+ ### MapReduce
398
+
399
+ How it works:
400
+ 1. **Map**: extract key-value pairs from each input record
401
+ 2. **Shuffle**: group all values for the same key together (sorted)
402
+ 3. **Reduce**: process all values for each key, produce output
403
+
404
+ Patterns:
405
+ - **Sort-merge join (reduce-side)**: both datasets emit the join key; reducer sees all matching records
406
+ - **Broadcast hash join (map-side)**: small dataset loaded into memory in each mapper; no shuffle needed
407
+ - **Partitioned hash join (map-side)**: both datasets partitioned the same way; each mapper joins its own partition pair
408
+
409
+ Limitations of MapReduce:
410
+ - Materializes intermediate state to disk between stages (slow)
411
+ - Overhead of repeated job startup
412
+
413
+ ### Dataflow Engines (Spark, Flink, Tez)
414
+
415
+ Improvements over MapReduce:
416
+ - Model entire workflow as a directed acyclic graph (DAG) of operators
417
+ - No mandatory materialization of intermediate results (can pipeline through memory)
418
+ - Operators are generalized (not limited to map and reduce)
419
+ - Better fault tolerance: recompute from upstream operator or checkpoint
420
+
421
+ ### Graph Processing (Pregel / BSP)
422
+
423
+ Bulk Synchronous Parallel model for iterative graph algorithms:
424
+ - Each vertex processes messages from its neighbors
425
+ - Sends messages to neighbors for the next iteration
426
+ - Iterations continue until convergence (no more messages)
427
+ - Use for: PageRank, shortest paths, connected components
428
+
429
+ ---
430
+
431
+ ## Stream Processing
432
+
433
+ ### Message Brokers vs. Log-Based Systems
434
+
435
+ **Traditional message broker** (RabbitMQ, ActiveMQ):
436
+ - Messages deleted after acknowledgment
437
+ - No long-term history
438
+ - Multiple consumers: load balancing (competing consumers) or fan-out (pub/sub)
439
+ - Good for: task queues, work distribution
440
+
441
+ **Log-based message broker** (Kafka, Amazon Kinesis):
442
+ - Append-only log, messages retained for configurable period
443
+ - Consumers track their position (offset) in the log
444
+ - Multiple consumer groups read independently at their own pace
445
+ - Replay: reset offset to re-process past messages
446
+ - Ordering guaranteed within a partition
447
+
448
+ ### Change Data Capture (CDC)
449
+
450
+ Capture every write to a database and publish it as an event stream:
451
+ - **Implementation**: read the database's replication log (WAL) and convert to events
452
+ - **Tools**: Debezium (Kafka Connect), Maxwell, AWS DMS
453
+ - **Initial snapshot**: bootstrap new consumer with a full table dump, then switch to streaming
454
+ - **Log compaction**: retain only the latest event for each key — bounded storage, full state rebuild
455
+
456
+ Use CDC to keep derived systems (search indexes, caches, data warehouses) in sync with the primary database.
457
+
458
+ ### Event Sourcing
459
+
460
+ Store every state change as an immutable event in an append-only log:
461
+ - Current state = replay all events from the beginning (or from a snapshot)
462
+ - Events are facts — never deleted or modified
463
+ - Commands (requests) are validated and may be rejected; events (facts) are always recorded
464
+ - **Snapshots**: periodically save materialized state to avoid replaying entire history
465
+
466
+ Differences from CDC:
467
+ - CDC captures low-level database changes (row inserts/updates/deletes)
468
+ - Event sourcing captures high-level domain events (OrderPlaced, PaymentReceived)
469
+ - Event sourcing is designed into the application; CDC is applied to existing databases
470
+
471
+ ### Stream Processing Patterns
472
+
473
+ **Complex Event Processing (CEP):**
474
+ - Define patterns over event streams (e.g., "three failed logins within 5 minutes")
475
+ - Query is stored, events flow through the query
476
+ - Tools: Esper, Apache Flink CEP
477
+
478
+ **Stream analytics:**
479
+ - Continuous aggregation over time windows
480
+ - Window types:
481
+ - **Tumbling**: fixed-size, non-overlapping (e.g., every 1 minute)
482
+ - **Hopping**: fixed-size, overlapping (e.g., 5-min window every 1 min)
483
+ - **Sliding**: all events within a fixed duration of each other
484
+ - **Session**: group events by activity with inactivity gap
485
+
486
+ **Stream joins:**
487
+ - **Stream-stream join (window join)**: join two streams within a time window; buffer events from both sides
488
+ - **Stream-table join (enrichment)**: enrich stream events with data from a table (maintained locally via CDC or changelog)
489
+ - **Table-table join (materialized view)**: both inputs are changelogs; output is an updated materialized view
490
+
491
+ ### Stream Fault Tolerance
492
+
493
+ - **Microbatching** (Spark Streaming): process stream in small batches; replay batch on failure
494
+ - **Checkpointing** (Flink): periodic snapshots of operator state; restore from last checkpoint
495
+ - **Idempotent writes**: make sink operations idempotent so replayed messages don't cause duplicates
496
+ - **Exactly-once semantics**: achieved via atomic commit of state + output + offset (Kafka transactions), or idempotency keys at the sink
497
+ - **Rebuilding state**: if a local state store is lost, rebuild from the changelog or re-process the input stream
498
+
499
+ ---
500
+
501
+ ## Derived Data and Integration
502
+
503
+ ### The Dataflow Paradigm
504
+
505
+ Think of data systems as a pipeline:
506
+ - **System of record** (source of truth): authoritative data store
507
+ - **Derived data**: caches, search indexes, materialized views, data warehouse — all derived from the source of truth
508
+ - A change to the source of truth triggers updates to all derived views
509
+
510
+ ### Transactional Outbox Pattern
511
+
512
+ Ensure reliable event publishing alongside database writes:
513
+ 1. Write business data AND event to an OUTBOX table in the same database transaction
514
+ 2. A separate process (relay) reads the outbox and publishes events to the message broker
515
+ 3. After successful publish, mark outbox entry as published
516
+
517
+ Relay strategies:
518
+ - **Polling publisher**: periodically query the outbox table for unpublished events
519
+ - **Transaction log tailing**: read the database's WAL to detect outbox inserts (lower latency)
520
+
521
+ ### CQRS (Command Query Responsibility Segregation)
522
+
523
+ Separate the write model from the read model:
524
+ - **Command side**: handles writes, publishes domain events
525
+ - **Query side**: subscribes to events, maintains denormalized read-optimized views
526
+
527
+ Benefits:
528
+ - Read model optimized for specific query patterns
529
+ - Read and write sides can scale independently
530
+ - Can have multiple read models for different access patterns
531
+
532
+ Trade-offs:
533
+ - Eventual consistency between write and read sides
534
+ - More infrastructure (event bus, separate read databases)
535
+ - Complexity of maintaining derived views
536
+
537
+ ### Lambda Architecture vs. Kappa Architecture
538
+
539
+ **Lambda**: maintain both a batch layer (reprocess all data periodically) and a speed layer (process new events in real time); merge results.
540
+ - Problem: maintaining two codepaths (batch and streaming)
541
+
542
+ **Kappa**: single stream processing system handles everything; reprocess by replaying the log from the beginning through a new version of the processor.
543
+ - Simpler; requires log retention and ability to replay
544
+
545
+ ### End-to-End Correctness
546
+
547
+ No single component provides exactly-once across the entire pipeline. Instead:
548
+ - Use **idempotency keys** at every boundary (producer → broker → consumer → database)
549
+ - **Deduplication**: consumers track processed message IDs
550
+ - **End-to-end argument**: push correctness guarantees to the application level rather than relying on infrastructure
551
+ - **Deterministic processing**: same input always produces same output, enabling safe replay