@booklib/skills 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (85) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +105 -0
  3. package/animation-at-work/SKILL.md +246 -0
  4. package/animation-at-work/assets/example_asset.txt +1 -0
  5. package/animation-at-work/references/api_reference.md +369 -0
  6. package/animation-at-work/references/review-checklist.md +79 -0
  7. package/animation-at-work/scripts/example.py +1 -0
  8. package/bin/skills.js +85 -0
  9. package/clean-code-reviewer/SKILL.md +292 -0
  10. package/clean-code-reviewer/evals/evals.json +67 -0
  11. package/data-intensive-patterns/SKILL.md +204 -0
  12. package/data-intensive-patterns/assets/example_asset.txt +1 -0
  13. package/data-intensive-patterns/references/api_reference.md +34 -0
  14. package/data-intensive-patterns/references/patterns-catalog.md +551 -0
  15. package/data-intensive-patterns/references/review-checklist.md +193 -0
  16. package/data-intensive-patterns/scripts/example.py +1 -0
  17. package/data-pipelines/SKILL.md +252 -0
  18. package/data-pipelines/assets/example_asset.txt +1 -0
  19. package/data-pipelines/references/api_reference.md +301 -0
  20. package/data-pipelines/references/review-checklist.md +181 -0
  21. package/data-pipelines/scripts/example.py +1 -0
  22. package/design-patterns/SKILL.md +245 -0
  23. package/design-patterns/assets/example_asset.txt +1 -0
  24. package/design-patterns/references/api_reference.md +1 -0
  25. package/design-patterns/references/patterns-catalog.md +726 -0
  26. package/design-patterns/references/review-checklist.md +173 -0
  27. package/design-patterns/scripts/example.py +1 -0
  28. package/domain-driven-design/SKILL.md +221 -0
  29. package/domain-driven-design/assets/example_asset.txt +1 -0
  30. package/domain-driven-design/references/api_reference.md +1 -0
  31. package/domain-driven-design/references/patterns-catalog.md +545 -0
  32. package/domain-driven-design/references/review-checklist.md +158 -0
  33. package/domain-driven-design/scripts/example.py +1 -0
  34. package/effective-java/SKILL.md +195 -0
  35. package/effective-java/assets/example_asset.txt +1 -0
  36. package/effective-java/references/api_reference.md +1 -0
  37. package/effective-java/references/items-catalog.md +955 -0
  38. package/effective-java/references/review-checklist.md +216 -0
  39. package/effective-java/scripts/example.py +1 -0
  40. package/effective-kotlin/SKILL.md +225 -0
  41. package/effective-kotlin/assets/example_asset.txt +1 -0
  42. package/effective-kotlin/references/api_reference.md +1 -0
  43. package/effective-kotlin/references/practices-catalog.md +1228 -0
  44. package/effective-kotlin/references/review-checklist.md +126 -0
  45. package/effective-kotlin/scripts/example.py +1 -0
  46. package/kotlin-in-action/SKILL.md +251 -0
  47. package/kotlin-in-action/assets/example_asset.txt +1 -0
  48. package/kotlin-in-action/references/api_reference.md +1 -0
  49. package/kotlin-in-action/references/practices-catalog.md +436 -0
  50. package/kotlin-in-action/references/review-checklist.md +204 -0
  51. package/kotlin-in-action/scripts/example.py +1 -0
  52. package/lean-startup/SKILL.md +250 -0
  53. package/lean-startup/assets/example_asset.txt +1 -0
  54. package/lean-startup/references/api_reference.md +319 -0
  55. package/lean-startup/references/review-checklist.md +137 -0
  56. package/lean-startup/scripts/example.py +1 -0
  57. package/microservices-patterns/SKILL.md +179 -0
  58. package/microservices-patterns/references/patterns-catalog.md +391 -0
  59. package/microservices-patterns/references/review-checklist.md +169 -0
  60. package/package.json +17 -0
  61. package/refactoring-ui/SKILL.md +236 -0
  62. package/refactoring-ui/assets/example_asset.txt +1 -0
  63. package/refactoring-ui/references/api_reference.md +355 -0
  64. package/refactoring-ui/references/review-checklist.md +114 -0
  65. package/refactoring-ui/scripts/example.py +1 -0
  66. package/storytelling-with-data/SKILL.md +238 -0
  67. package/storytelling-with-data/assets/example_asset.txt +1 -0
  68. package/storytelling-with-data/references/api_reference.md +379 -0
  69. package/storytelling-with-data/references/review-checklist.md +111 -0
  70. package/storytelling-with-data/scripts/example.py +1 -0
  71. package/system-design-interview/SKILL.md +213 -0
  72. package/system-design-interview/assets/example_asset.txt +1 -0
  73. package/system-design-interview/references/api_reference.md +582 -0
  74. package/system-design-interview/references/review-checklist.md +201 -0
  75. package/system-design-interview/scripts/example.py +1 -0
  76. package/using-asyncio-python/SKILL.md +242 -0
  77. package/using-asyncio-python/assets/example_asset.txt +1 -0
  78. package/using-asyncio-python/references/api_reference.md +267 -0
  79. package/using-asyncio-python/references/review-checklist.md +149 -0
  80. package/using-asyncio-python/scripts/example.py +1 -0
  81. package/web-scraping-python/SKILL.md +259 -0
  82. package/web-scraping-python/assets/example_asset.txt +1 -0
  83. package/web-scraping-python/references/api_reference.md +393 -0
  84. package/web-scraping-python/references/review-checklist.md +163 -0
  85. package/web-scraping-python/scripts/example.py +1 -0
@@ -0,0 +1,582 @@
1
+ # System Design Interview — Chapter-by-Chapter Reference
2
+
3
+ Complete catalog of system design concepts, patterns, and techniques from all 16 chapters.
4
+
5
+ ---
6
+
7
+ ## Ch 1: Scale From Zero To Millions of Users
8
+
9
+ ### Single Server Setup
10
+ - Web app, database, cache all on one server
11
+ - DNS resolves domain to IP; HTTP request/response cycle
12
+
13
+ ### Database
14
+ - **Relational (SQL)**: MySQL, PostgreSQL — structured data, joins, ACID
15
+ - **NoSQL**: Key-value (Redis, DynamoDB), Document (MongoDB), Column (Cassandra), Graph (Neo4j)
16
+ - Choose NoSQL when: super-low latency, unstructured data, massive scale, no relational needs
17
+
18
+ ### Scaling
19
+ - **Vertical scaling** (scale up): bigger machine; simple but has hard limits and no failover
20
+ - **Horizontal scaling** (scale out): more machines; preferred for large-scale apps
21
+
22
+ ### Load Balancer
23
+ - Distributes traffic across web servers
24
+ - Users connect to LB's public IP; servers use private IPs
25
+ - Enables: failover (reroute if server dies), horizontal scaling (add more servers)
26
+
27
+ ### Database Replication
28
+ - **Master-slave model**: Master handles writes; slaves handle reads
29
+ - Read/write ratio typically high → multiple read replicas
30
+ - Failover: slave promoted to master if master fails; new slave replaces old one
31
+ - Benefits: performance (parallel reads), reliability (data replicated), availability (failover)
32
+
33
+ ### Cache
34
+ - Temporary storage for frequently accessed data (in-memory, much faster than DB)
35
+ - **Read-through cache**: Check cache → if miss, read from DB → store in cache → return
36
+ - **Considerations**: Use when reads >> writes; expiration policy; consistency; eviction (LRU, LFU, FIFO); single point of failure (multiple cache servers)
37
+
38
+ ### CDN (Content Delivery Network)
39
+ - Geographically distributed servers for static content (images, CSS, JS, videos)
40
+ - User requests asset → CDN returns cached copy or fetches from origin → caches with TTL
41
+ - **Considerations**: Cost (cache only frequently accessed), TTL, CDN fallback, invalidation (API or versioning)
42
+
43
+ ### Stateless Web Tier
44
+ - Move session data out of web servers into shared storage (Redis, Memcached, NoSQL)
45
+ - Any web server can handle any request → easy horizontal scaling
46
+ - Session data stored in persistent shared data store
47
+
48
+ ### Data Centers
49
+ - Multiple data centers for geo-routing (users routed to nearest DC)
50
+ - **Challenges**: Traffic redirection (GeoDNS), data synchronization, test/deployment across DCs
51
+
52
+ ### Message Queue
53
+ - Durable component for async communication (producers → queue → consumers)
54
+ - Decouples components: producer and consumer scale independently
55
+ - Use when: tasks are time-consuming, components should be loosely coupled
56
+
57
+ ### Logging, Metrics, Automation
58
+ - **Logging**: Per-server or aggregated (centralized tools)
59
+ - **Metrics**: Host-level (CPU, memory), aggregated (DB tier performance), business (DAU, retention)
60
+ - **Automation**: CI/CD, build automation, testing automation
61
+
62
+ ### Database Sharding
63
+ - Split data across multiple databases by shard key
64
+ - **Shard key selection**: Choose key that distributes data evenly
65
+ - **Challenges**: Resharding (consistent hashing), celebrity/hotspot problem, join/denormalization
66
+ - **Techniques**: Consistent hashing for distribution; denormalize to avoid cross-shard joins
67
+
68
+ ---
69
+
70
+ ## Ch 2: Back-of-the-Envelope Estimation
71
+
72
+ ### Powers of 2
73
+ - 10 = 1 Thousand (1 KB), 20 = 1 Million (1 MB), 30 = 1 Billion (1 GB)
74
+ - 40 = 1 Trillion (1 TB), 50 = 1 Quadrillion (1 PB)
75
+
76
+ ### Latency Numbers Every Programmer Should Know
77
+ - L1 cache: 0.5 ns
78
+ - Branch mispredict: 5 ns
79
+ - L2 cache: 7 ns
80
+ - Mutex lock/unlock: 100 ns
81
+ - Main memory: 100 ns
82
+ - Compress 1KB with Zippy: 10,000 ns (10 μs)
83
+ - Send 2KB over 1 Gbps network: 20,000 ns (20 μs)
84
+ - SSD random read: 150,000 ns (150 μs)
85
+ - Read 1MB sequentially from memory: 250,000 ns (250 μs)
86
+ - Round trip within same datacenter: 500,000 ns (500 μs)
87
+ - Read 1MB sequentially from SSD: 1,000,000 ns (1 ms)
88
+ - Disk seek: 10,000,000 ns (10 ms)
89
+ - Read 1MB sequentially from disk: 30,000,000 ns (30 ms)
90
+ - Packet CA→Netherlands→CA: 150,000,000 ns (150 ms)
91
+
92
+ ### Key Takeaways
93
+ - Memory is fast, disk is slow
94
+ - Avoid disk seeks if possible
95
+ - Simple compression algorithms are fast
96
+ - Compress data before sending over network
97
+ - Data centers are far; inter-DC round trips are expensive
98
+
99
+ ### Availability Numbers
100
+ - 99% = 3.65 days/year downtime
101
+ - 99.9% = 8.77 hours/year
102
+ - 99.99% = 52.60 minutes/year
103
+ - 99.999% = 5.26 minutes/year
104
+
105
+ ### Estimation Tips
106
+ - Round and approximate; precision not needed
107
+ - Write down assumptions
108
+ - Label units clearly
109
+ - Common estimates: QPS, peak QPS (2–5× average), storage, bandwidth, cache memory
110
+
111
+ ---
112
+
113
+ ## Ch 3: A Framework For System Design Interviews
114
+
115
+ ### The 4-Step Process
116
+
117
+ **Step 1 — Understand the problem and establish design scope (3–10 min)**
118
+ - Ask clarifying questions: What features? How many users? Scale trajectory?
119
+ - Define functional requirements (what the system does)
120
+ - Define non-functional requirements (scale, latency, availability, consistency)
121
+ - Make back-of-envelope estimates
122
+
123
+ **Step 2 — Propose high-level design and get buy-in (10–15 min)**
124
+ - Draw component diagram: clients, servers, databases, caches, CDN, load balancers
125
+ - Define API endpoints (REST or similar)
126
+ - Sketch data flow through the system
127
+ - Get agreement before diving deeper
128
+
129
+ **Step 3 — Design deep dive (10–25 min)**
130
+ - Focus on 2–3 most critical/interesting components
131
+ - Discuss trade-offs for each design decision
132
+ - Address non-functional requirements (scalability, consistency, availability)
133
+
134
+ **Step 4 — Wrap up (3–5 min)**
135
+ - Summarize the design
136
+ - Discuss error handling and edge cases
137
+ - Operational considerations: metrics, monitoring, alerts
138
+ - Future scaling and improvements
139
+
140
+ ### Dos and Don'ts
141
+ - DO: ask for clarification, communicate approach, suggest multiple approaches, design with interviewer
142
+ - DON'T: jump into solution, think in silence, ignore non-functional requirements
143
+
144
+ ---
145
+
146
+ ## Ch 4: Design A Rate Limiter
147
+
148
+ ### Algorithms
149
+
150
+ **Token bucket**
151
+ - Bucket with fixed capacity; tokens added at fixed rate; request consumes a token
152
+ - Pros: easy to implement, memory efficient, allows burst traffic
153
+ - Cons: tuning bucket size and refill rate can be challenging
154
+
155
+ **Leaking bucket**
156
+ - Queue with fixed size; requests processed at fixed rate; overflow rejected
157
+ - Pros: memory efficient, stable outflow rate
158
+ - Cons: burst of traffic fills queue; old requests may starve new ones
159
+
160
+ **Fixed window counter**
161
+ - Divide timeline into fixed windows; counter per window; reject when counter > threshold
162
+ - Pros: memory efficient, simple
163
+ - Cons: spike at window edges can allow 2× rate
164
+
165
+ **Sliding window log**
166
+ - Keep timestamp log of each request; count requests in sliding window
167
+ - Pros: very accurate
168
+ - Cons: consumes lots of memory (stores all timestamps)
169
+
170
+ **Sliding window counter**
171
+ - Hybrid: fixed window counters + sliding window calculation
172
+ - Formula: requests in current window + requests in previous window × overlap percentage
173
+ - Pros: smooths traffic spikes, memory efficient
174
+ - Cons: approximation (assumes even distribution in previous window)
175
+
176
+ ### Architecture
177
+ - Rate limiting rules stored in configuration (usually on disk, cached in memory)
178
+ - Redis used for counters: INCR (increment) and EXPIRE (set TTL)
179
+ - Rate limiter middleware sits between client and API servers
180
+ - HTTP 429 (Too Many Requests) returned when rate exceeded
181
+ - Headers: X-Ratelimit-Remaining, X-Ratelimit-Limit, X-Ratelimit-Retry-After
182
+
183
+ ### Distributed Challenges
184
+ - **Race condition**: Use Redis Lua script or sorted set for atomic operations
185
+ - **Synchronization**: Centralized Redis store; or sticky sessions (not recommended)
186
+
187
+ ---
188
+
189
+ ## Ch 5: Design Consistent Hashing
190
+
191
+ ### Problem
192
+ - Simple hash (key % N servers) causes massive redistribution when servers added/removed
193
+
194
+ ### Hash Ring
195
+ - Map servers and keys onto a circular hash space (0 to 2^160 - 1 for SHA-1)
196
+ - Key assigned to first server encountered going clockwise on the ring
197
+ - Adding/removing server only affects keys in adjacent segment
198
+
199
+ ### Virtual Nodes
200
+ - Each real server maps to multiple virtual nodes on the ring
201
+ - Benefits: more even distribution, handles heterogeneous servers (more vnodes for powerful servers)
202
+ - Trade-off: more vnodes = better balance but more metadata to store
203
+
204
+ ### Benefits
205
+ - Minimized key redistribution when servers change
206
+ - Horizontal scaling is straightforward
207
+ - Mitigates hotspot problem with virtual nodes
208
+
209
+ ---
210
+
211
+ ## Ch 6: Design A Key-Value Store
212
+
213
+ ### CAP Theorem
214
+ - **Consistency**: All nodes see same data at same time
215
+ - **Availability**: Every request gets a response
216
+ - **Partition tolerance**: System works despite network partitions
217
+ - Must choose 2 of 3: CP (consistency + partition tolerance), AP (availability + partition tolerance)
218
+ - In real distributed systems, partition tolerance is mandatory → choose between C and A
219
+
220
+ ### Core Techniques
221
+
222
+ **Data Partitioning**
223
+ - Consistent hashing (Ch 5) to distribute data across nodes
224
+
225
+ **Data Replication**
226
+ - Replicate to N nodes (first N unique servers clockwise on hash ring)
227
+ - N = 3 is typical
228
+
229
+ **Quorum Consensus**
230
+ - N = number of replicas, W = write quorum, R = read quorum
231
+ - W + R > N guarantees strong consistency (overlap ensures latest value read)
232
+ - W = 1, R = N → fast write, slow read
233
+ - W = N, R = 1 → slow write, fast read
234
+ - Typical: N=3, W=2, R=2
235
+
236
+ **Consistency Models**
237
+ - Strong: client always sees most recent write
238
+ - Weak: subsequent reads may not see most recent write
239
+ - Eventual: given enough time, all replicas converge
240
+
241
+ **Vector Clocks**
242
+ - [server, version] pairs to detect conflicts and causality
243
+ - Downside: complexity grows with many servers; prune based on threshold
244
+
245
+ **Handling Failures**
246
+ - **Failure detection**: Gossip protocol — each node periodically sends heartbeat list to random nodes; if no heartbeat for threshold period, node marked down
247
+ - **Sloppy quorum**: When not enough healthy nodes, use temporary nodes (hinted handoff)
248
+ - **Anti-entropy**: Merkle trees for efficient inconsistency detection and repair
249
+ - **Merkle tree**: Hash tree where leaves are hashes of data blocks; only differing branches need sync
250
+
251
+ ### Write Path
252
+ - Write request → commit log → memory cache (memtable) → when memtable full, flush to SSTable on disk
253
+
254
+ ### Read Path
255
+ - Check memtable → if not found, check Bloom filter → read from SSTables
256
+
257
+ ---
258
+
259
+ ## Ch 7: Design A Unique ID Generator In Distributed Systems
260
+
261
+ ### Approaches
262
+
263
+ **Multi-master replication**
264
+ - Auto-increment by k (number of servers): server 1 generates 1,3,5...; server 2 generates 2,4,6...
265
+ - Cons: hard to scale, IDs don't go up across servers, adding/removing servers breaks scheme
266
+
267
+ **UUID**
268
+ - 128-bit number, extremely low collision probability
269
+ - Pros: simple, no coordination, scales independently per server
270
+ - Cons: 128 bits is long, not sortable by time, non-numeric
271
+
272
+ **Ticket server**
273
+ - Centralized auto-increment DB (Flickr approach)
274
+ - Pros: numeric, easy to implement, works for small/medium scale
275
+ - Cons: single point of failure (can use multiple but adds sync complexity)
276
+
277
+ **Twitter snowflake**
278
+ - 64-bit ID: 1 bit sign + 41 bits timestamp + 5 bits datacenter + 5 bits machine + 12 bits sequence
279
+ - Timestamp: milliseconds since custom epoch; sortable by time
280
+ - Sequence: reset to 0 every millisecond; 4096 IDs per machine per millisecond
281
+ - Pros: 64-bit, time-sorted, distributed, high throughput
282
+ - This is the recommended approach for most use cases
283
+
284
+ ---
285
+
286
+ ## Ch 8: Design A URL Shortener
287
+
288
+ ### API Design
289
+ - POST api/v1/data/shorten (longUrl) → shortUrl
290
+ - GET api/v1/shortUrl → 301/302 redirect to longUrl
291
+
292
+ ### Redirect
293
+ - **301 (Permanent)**: Browser caches; reduces server load; less analytics
294
+ - **302 (Temporary)**: Every request hits server; better for analytics/tracking
295
+
296
+ ### Hash Approaches
297
+
298
+ **Hash + collision resolution**
299
+ - Apply hash (CRC32, MD5, SHA-1) → take first 7 characters → check DB for collision → append predefined string if collision → recheck
300
+ - Bloom filter can speed up collision detection
301
+
302
+ **Base-62 conversion**
303
+ - Map auto-increment ID to base-62 (0-9, a-z, A-Z)
304
+ - 7 characters = 62^7 ≈ 3.5 trillion URLs
305
+ - Pros: no collision, short URL length predictable from ID
306
+ - Cons: next URL is predictable; depends on unique ID generator
307
+
308
+ ### URL Shortening Flow
309
+ - Input longURL → check if exists in DB → if yes, return existing shortURL → if no, generate new ID → convert to base-62 → store in DB → return shortURL
310
+
311
+ ---
312
+
313
+ ## Ch 9: Design A Web Crawler
314
+
315
+ ### Components
316
+ - **Seed URLs**: Starting point; choose by topic locality or domain diversity
317
+ - **URL Frontier**: Queue managing which URLs to crawl next; handles politeness and priority
318
+ - **HTML Downloader**: Fetches page content; checks robots.txt (cached)
319
+ - **Content Parser**: Validates and parses HTML
320
+ - **Content Seen?**: Dedup with hash comparison (content fingerprint)
321
+ - **URL Seen?**: Bloom filter or hash table to avoid recrawling
322
+ - **URL Storage**: Already-visited URLs stored
323
+ - **Link Extractor**: Extracts URLs from HTML; converts relative to absolute
324
+
325
+ ### URL Frontier Design
326
+ - **Politeness**: Separate queue per host; only one worker per host; download delay between requests
327
+ - **Priority**: URL prioritizer ranks URLs by PageRank, update frequency, freshness; feeds into priority queues
328
+ - **Freshness**: Recrawl based on update history and page importance
329
+ - **Storage**: Mostly on disk with in-memory buffer for enqueue/dequeue
330
+
331
+ ### Robustness
332
+ - Consistent hashing to distribute load across crawlers
333
+ - Save crawl state for recovery
334
+ - Exception handling for malformed HTML
335
+ - Anti-spam: blacklists, content validation
336
+
337
+ ### Problematic Content
338
+ - Redundant content: dedup via content hashing
339
+ - Spider traps: set max URL depth
340
+ - Data noise: exclude ads, spam, etc.
341
+
342
+ ---
343
+
344
+ ## Ch 10: Design A Notification System
345
+
346
+ ### Notification Types
347
+ - **iOS Push**: Provider → APNs (Apple Push Notification Service) → iOS device
348
+ - **Android Push**: Provider → FCM (Firebase Cloud Messaging) → Android device
349
+ - **SMS**: Provider → SMS service (Twilio, Nexmo) → phone
350
+ - **Email**: Provider → email service (Mailchimp, SendGrid) → email
351
+
352
+ ### Contact Info Gathering
353
+ - Collect device tokens, phone numbers, email addresses during signup/app install
354
+ - Store in contact_info table linked to user_id
355
+
356
+ ### High-Level Design
357
+ - Services (1 to N) → Notification system → Third-party services → Devices
358
+ - Components: notification servers, cache, DB, message queues (one per notification type), workers
359
+
360
+ ### Reliability
361
+ - **Notification log**: Persist notifications in DB for retry on failure
362
+ - **Deduplication**: Check event_id before sending to avoid duplicates
363
+ - **Retry mechanism**: Workers retry failed notifications with exponential backoff
364
+
365
+ ### Additional Features
366
+ - **Notification template**: Reusable templates with parameters for consistency
367
+ - **Rate limiting**: Cap notifications per user to prevent overload
368
+ - **Monitoring**: Track queued notifications count; set alerts for anomalies
369
+ - **Analytics service**: Track open rate, click rate, engagement per notification type
370
+ - **User settings**: Opt-in/opt-out per notification channel; settings stored in DB
371
+
372
+ ---
373
+
374
+ ## Ch 11: Design A News Feed System
375
+
376
+ ### Two Sub-Problems
377
+ 1. **Feed publishing**: User posts content → system stores and distributes to friends' feeds
378
+ 2. **Newsfeed building**: Aggregate friends' posts in reverse chronological order
379
+
380
+ ### Feed Publishing
381
+ - Web servers: authentication, rate limiting
382
+ - Fanout service: distribute post to friends' news feeds
383
+ - Notification service: inform friends of new content
384
+
385
+ ### Fanout Models
386
+
387
+ **Fanout on write (push model)**
388
+ - Post → immediately write to all friends' caches
389
+ - Pros: real-time, fast read (pre-computed)
390
+ - Cons: slow for users with many friends (celebrity problem); wasted resources for inactive users
391
+
392
+ **Fanout on read (pull model)**
393
+ - News feed built on-the-fly when user requests it
394
+ - Pros: no wasted writes for inactive users; no celebrity problem
395
+ - Cons: slow reads (fetch and merge at read time)
396
+
397
+ **Hybrid approach (recommended)**
398
+ - Push for normal users (fast); pull for celebrities (avoid fan-out explosion)
399
+ - Reduces write amplification while keeping reads fast for most users
400
+
401
+ ### Cache Architecture (5 tiers)
402
+ - **News Feed**: pre-computed feed per user (feed IDs)
403
+ - **Content**: post data, indexed by post ID
404
+ - **Social Graph**: follower/following relationships
405
+ - **Action**: liked, replied, shared status per post per user
406
+ - **Counters**: likes count, replies count, followers count
407
+
408
+ ---
409
+
410
+ ## Ch 12: Design A Chat System
411
+
412
+ ### Communication Protocols
413
+ - **Polling**: Client periodically asks server for new messages; wasteful if no new messages
414
+ - **Long polling**: Client holds connection open until server has new message or timeout; server may not know which server holds the connection
415
+ - **WebSocket**: Full-duplex persistent connection; client and server send messages anytime; ideal for chat
416
+
417
+ ### High-Level Design
418
+ - **Stateless services**: Login, signup, user profile (behind load balancer)
419
+ - **Stateful service**: Chat servers maintain persistent WebSocket connections
420
+ - **Third-party integration**: Push notifications for offline users
421
+ - **Service discovery**: Apache Zookeeper recommends best chat server based on criteria (geo, capacity)
422
+
423
+ ### Storage
424
+ - Generic data (user profiles, settings): relational DB with replication/sharding
425
+ - Chat history: key-value store (HBase recommended) — write-heavy, sequential access, no random access needed
426
+ - **Message table (1-on-1)**: message_id (bigint), message_from (bigint), message_to (bigint), content (text), created_at (timestamp)
427
+ - **Message table (group)**: channel_id + message_id as composite key; channel_id is partition key
428
+
429
+ ### Message ID
430
+ - Must be unique, sortable by time, within same group/channel
431
+ - Approach: local auto-increment per channel (using key-value store's increment); or snowflake-like
432
+
433
+ ### Message Flows
434
+ - **1-on-1 send**: User A → Chat server 1 → Message sync queue → Chat server 2 → User B
435
+ - **Message sync**: Each device has cur_max_message_id; fetch messages where id > cur_max_message_id
436
+ - **Small group**: Message copied to each member's message sync queue
437
+ - **Large group**: On-demand pull (fanout on read)
438
+
439
+ ### Online Presence
440
+ - **Login**: Set status to online in presence servers (via WebSocket)
441
+ - **Logout**: Set status to offline
442
+ - **Disconnection**: Heartbeat mechanism; if no heartbeat for X seconds → offline
443
+ - **Status fanout**: Presence change → publish to channel → friends subscribed to channel receive update
444
+ - For large groups, fetch presence only when user opens group or manually refreshes
445
+
446
+ ---
447
+
448
+ ## Ch 13: Design A Search Autocomplete System
449
+
450
+ ### Trie Data Structure
451
+ - Tree structure where each node stores a character; path from root = prefix
452
+ - Nodes store: character, children map, frequency/popularity counter, top-k cached queries
453
+ - **Search**: traverse trie by prefix → collect all descendants → sort by frequency → return top k
454
+ - **Optimization**: Cache top k results at each node to avoid traversal
455
+
456
+ ### Data Gathering Service
457
+ - **Analytics logs**: Record every search query with timestamp
458
+ - **Aggregators**: Aggregate query frequency (weekly or real-time depending on use case)
459
+ - **Workers**: Build/update trie from aggregated data
460
+ - **Trie cache**: In-memory trie for fast lookups; weekly snapshot
461
+ - **Trie DB**: Persistent storage — document store (serialize trie) or key-value (each prefix = key)
462
+
463
+ ### Query Service
464
+ - User types → request sent to query service → trie cache lookup → return top suggestions
465
+ - **Optimizations**: AJAX requests (no full page reload), browser caching (autocomplete suggestions cached with TTL ~1 hour), data sampling (log only 1 in N queries to reduce volume)
466
+
467
+ ### Trie Operations
468
+ - **Create**: Build from aggregated data (offline, weekly)
469
+ - **Update**: Option 1: Rebuild weekly (recommended); Option 2: Update individual nodes in place
470
+ - **Delete**: Filter layer removes hateful, violent, explicit, or dangerous suggestions before returning results (don't modify trie directly)
471
+
472
+ ### Scaling
473
+ - **Sharding**: By first character (uneven — 's' much larger than 'z'); smarter: shard by frequency-based analysis for even distribution
474
+ - **Multi-language**: Unicode support in trie nodes; separate tries per language or unified with locale metadata
475
+
476
+ ---
477
+
478
+ ## Ch 14: Design YouTube
479
+
480
+ ### Requirements
481
+ - Upload videos, smooth streaming, quality change, low infrastructure cost, mobile + web
482
+ - Estimated: 5M DAU, 10% upload daily, average video 300MB → 150TB new storage/day
483
+
484
+ ### Video Uploading Flow
485
+ 1. User uploads via parallel chunking to original storage (S3-like)
486
+ 2. Transcoding servers process video (multiple resolutions, codecs)
487
+ 3. Transcoded videos stored in transcoded storage
488
+ 4. CDN caches and serves popular videos
489
+ 5. Completion queue + handler updates metadata DB and cache
490
+ 6. Metadata API servers handle title, description, comments, etc.
491
+
492
+ ### Streaming Protocols
493
+ - **MPEG-DASH** (Dynamic Adaptive Streaming over HTTP)
494
+ - **Apple HLS** (HTTP Live Streaming)
495
+ - **Microsoft Smooth Streaming**
496
+ - **Adobe HTTP Dynamic Streaming**
497
+ - All use adaptive bitrate: client monitors bandwidth → requests appropriate quality segment
498
+
499
+ ### Video Transcoding
500
+ - **Why**: Compatibility (different devices/codecs), bandwidth adaptation (mobile vs. desktop), multiple resolutions
501
+ - **Bitrate types**: Constant bitrate (CBR), Variable bitrate (VBR)
502
+ - **DAG model**: Video split into video/audio streams → parallel processing → encoded → merged
503
+
504
+ ### Transcoding Architecture
505
+ - **Preprocessor**: Video splitting, DAG generation, cache check
506
+ - **DAG scheduler**: Splits DAG into stages, puts tasks in task queue
507
+ - **Resource manager**: Manages task queue, worker queue, running queue; optimal task-worker assignment
508
+ - **Task workers**: Execute encoding tasks (defined in DAG)
509
+ - **Temporary storage**: Metadata in memory, video/audio in blob storage; freed after encoding
510
+ - **Encoded video**: Final output sent to CDN
511
+
512
+ ### System Optimizations
513
+ - **Speed**: Parallel uploading (split video into chunks), upload centers close to users, parallelism in transcoding pipeline, pre-signed upload URLs
514
+ - **Safety**: Pre-signed URLs (only authorized users upload), DRM, AES encryption, visual watermarking
515
+ - **Cost**: Serve popular from CDN; less popular from high-capacity storage servers; short/unpopular videos encoded on-demand; some regions don't need CDN (serve from origin); partner with ISPs
516
+
517
+ ### Error Handling
518
+ - **Recoverable errors**: Retry with exponential backoff (e.g., transcode segment failure)
519
+ - **Non-recoverable errors**: Stop task, return error code, log for investigation (e.g., malformed video)
520
+
521
+ ---
522
+
523
+ ## Ch 15: Design Google Drive
524
+
525
+ ### Features
526
+ - File upload/download, file sync across devices, notifications, reliability, fast sync, low bandwidth, high scalability/availability
527
+
528
+ ### APIs
529
+ - **Upload**: Simple upload (small files), Resumable upload (large files — init → get resumable URI → upload chunks → resume on failure)
530
+ - **Download**: GET by file path
531
+ - **Get revisions**: GET revision history by file path and limit
532
+
533
+ ### High-Level Design
534
+ - **Block servers**: Split files into blocks, delta sync (only changed blocks), compression (gzip)
535
+ - **Cloud storage**: S3 or equivalent for file blocks
536
+ - **Cold storage**: For inactive/archived files (Amazon Glacier)
537
+ - **Load balancer**: Distribute requests across API servers
538
+ - **API servers**: User/auth management, file metadata CRUD
539
+ - **Metadata DB**: File metadata, user info, block info, versioning
540
+ - **Metadata cache**: Cache frequently accessed metadata
541
+ - **Notification service**: Long polling — client holds connection, notified of changes; eventbus for internal change distribution
542
+ - **Offline backup queue**: Queue sync changes for when clients come back online
543
+
544
+ ### Block Server Design
545
+ - File split into blocks (e.g., Dropbox max 4MB blocks)
546
+ - **Delta sync**: Only sync changed blocks (not entire file)
547
+ - **Deduplication**: Hash each block; skip upload if hash already exists
548
+ - **Compression**: gzip or bzip2 to reduce transfer size
549
+
550
+ ### Metadata Database Schema
551
+ - **User**: name, email, profile_photo
552
+ - **Device**: device_id, user_id, last_logged_in
553
+ - **Namespace** (workspace/root folder): id, account_id, email
554
+ - **File**: id, filename, path, namespace_id, latest_version, is_directory
555
+ - **File_version**: id, file_id, device_id, version_number, last_modified
556
+ - **Block**: id, file_version_id, block_order, block_hash, s3_object_key
557
+
558
+ ### Sync Flows
559
+ - **Upload**: Client → Block servers (delta sync) → Cloud storage; Client → API servers → Metadata DB; Notification service informs other clients
560
+ - **Download**: Client notified of change (long polling) → Request metadata → Download changed blocks from cloud storage
561
+ - **Conflict resolution**: First version wins; later version saved as conflict copy for manual merge
562
+
563
+ ### Failure Handling
564
+ - **Load balancer**: Heartbeat monitoring; redirect traffic if one fails
565
+ - **Block server**: Other servers pick up unfinished jobs
566
+ - **Cloud storage**: S3 multi-region replication
567
+ - **API server**: Stateless; LB redirects to healthy instances
568
+ - **Metadata cache**: Replica servers; new server replaces failed one
569
+ - **Metadata DB**: Master-slave; promote slave if master fails
570
+ - **Notification service**: Long polling → client reconnects to different server
571
+ - **Offline backup queue**: Multiple replicas for durability
572
+
573
+ ---
574
+
575
+ ## Ch 16: The Learning Continues
576
+
577
+ ### Key Takeaways
578
+ - Real-world systems are far more complex than interview designs
579
+ - Learn from real-world systems by studying company engineering blogs
580
+ - Focus on fundamentals: scaling, caching, replication, partitioning, consistency models
581
+ - Practice estimation and trade-off analysis regularly
582
+ - Study company engineering blogs: Facebook, Google, Netflix, Uber, Twitter, Airbnb, etc.