npm - @booklib/skills - Versions diffs - 1.2.0 → 1.3.0 - Mend

@booklib/skills 1.2.0 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (100) hide show

package/CONTRIBUTING.md +122 -0
package/README.md +20 -2
package/ROADMAP.md +36 -0
package/animation-at-work/evals/evals.json +44 -0
package/animation-at-work/examples/after.md +64 -0
package/animation-at-work/examples/before.md +35 -0
package/animation-at-work/scripts/audit_animations.py +295 -0
package/bin/skills.js +552 -42
package/clean-code-reviewer/SKILL.md +109 -1
package/clean-code-reviewer/evals/evals.json +121 -3
package/clean-code-reviewer/examples/after.md +48 -0
package/clean-code-reviewer/examples/before.md +33 -0
package/clean-code-reviewer/references/api_reference.md +158 -0
package/clean-code-reviewer/references/practices-catalog.md +282 -0
package/clean-code-reviewer/references/review-checklist.md +254 -0
package/clean-code-reviewer/scripts/pre-review.py +206 -0
package/data-intensive-patterns/evals/evals.json +43 -0
package/data-intensive-patterns/examples/after.md +61 -0
package/data-intensive-patterns/examples/before.md +38 -0
package/data-intensive-patterns/scripts/adr.py +213 -0
package/data-pipelines/evals/evals.json +45 -0
package/data-pipelines/examples/after.md +97 -0
package/data-pipelines/examples/before.md +37 -0
package/data-pipelines/scripts/new_pipeline.py +444 -0
package/design-patterns/evals/evals.json +46 -0
package/design-patterns/examples/after.md +52 -0
package/design-patterns/examples/before.md +29 -0
package/design-patterns/scripts/scaffold.py +807 -0
package/domain-driven-design/SKILL.md +120 -0
package/domain-driven-design/evals/evals.json +48 -0
package/domain-driven-design/examples/after.md +80 -0
package/domain-driven-design/examples/before.md +43 -0
package/domain-driven-design/scripts/scaffold.py +421 -0
package/effective-java/evals/evals.json +46 -0
package/effective-java/examples/after.md +83 -0
package/effective-java/examples/before.md +37 -0
package/effective-java/scripts/checkstyle_setup.py +211 -0
package/effective-kotlin/evals/evals.json +45 -0
package/effective-kotlin/examples/after.md +36 -0
package/effective-kotlin/examples/before.md +38 -0
package/effective-python/evals/evals.json +44 -0
package/effective-python/examples/after.md +56 -0
package/effective-python/examples/before.md +40 -0
package/effective-python/references/api_reference.md +218 -0
package/effective-python/references/practices-catalog.md +483 -0
package/effective-python/references/review-checklist.md +190 -0
package/effective-python/scripts/lint.py +173 -0
package/kotlin-in-action/evals/evals.json +43 -0
package/kotlin-in-action/examples/after.md +53 -0
package/kotlin-in-action/examples/before.md +39 -0
package/kotlin-in-action/scripts/setup_detekt.py +224 -0
package/lean-startup/evals/evals.json +43 -0
package/lean-startup/examples/after.md +80 -0
package/lean-startup/examples/before.md +34 -0
package/lean-startup/scripts/new_experiment.py +286 -0
package/microservices-patterns/SKILL.md +140 -0
package/microservices-patterns/evals/evals.json +45 -0
package/microservices-patterns/examples/after.md +69 -0
package/microservices-patterns/examples/before.md +40 -0
package/microservices-patterns/scripts/new_service.py +583 -0
package/package.json +2 -8
package/refactoring-ui/evals/evals.json +45 -0
package/refactoring-ui/examples/after.md +85 -0
package/refactoring-ui/examples/before.md +58 -0
package/refactoring-ui/scripts/audit_css.py +250 -0
package/skill-router/SKILL.md +142 -0
package/skill-router/evals/evals.json +38 -0
package/skill-router/examples/after.md +63 -0
package/skill-router/examples/before.md +39 -0
package/skill-router/references/api_reference.md +24 -0
package/skill-router/references/routing-heuristics.md +89 -0
package/skill-router/references/skill-catalog.md +156 -0
package/skill-router/scripts/route.py +266 -0
package/storytelling-with-data/evals/evals.json +47 -0
package/storytelling-with-data/examples/after.md +50 -0
package/storytelling-with-data/examples/before.md +33 -0
package/storytelling-with-data/scripts/chart_review.py +301 -0
package/system-design-interview/evals/evals.json +45 -0
package/system-design-interview/examples/after.md +94 -0
package/system-design-interview/examples/before.md +27 -0
package/system-design-interview/scripts/new_design.py +421 -0
package/using-asyncio-python/evals/evals.json +43 -0
package/using-asyncio-python/examples/after.md +68 -0
package/using-asyncio-python/examples/before.md +39 -0
package/using-asyncio-python/scripts/check_blocking.py +270 -0
package/web-scraping-python/evals/evals.json +46 -0
package/web-scraping-python/examples/after.md +109 -0
package/web-scraping-python/examples/before.md +40 -0
package/web-scraping-python/scripts/new_scraper.py +231 -0
/package/{effective-python-skill → effective-python}/SKILL.md +0 -0
/package/{effective-python-skill → effective-python}/ref-01-pythonic-thinking.md +0 -0
/package/{effective-python-skill → effective-python}/ref-02-lists-and-dicts.md +0 -0
/package/{effective-python-skill → effective-python}/ref-03-functions.md +0 -0
/package/{effective-python-skill → effective-python}/ref-04-comprehensions-generators.md +0 -0
/package/{effective-python-skill → effective-python}/ref-05-classes-interfaces.md +0 -0
/package/{effective-python-skill → effective-python}/ref-06-metaclasses-attributes.md +0 -0
/package/{effective-python-skill → effective-python}/ref-07-concurrency.md +0 -0
/package/{effective-python-skill → effective-python}/ref-08-robustness-performance.md +0 -0
/package/{effective-python-skill → effective-python}/ref-09-testing-debugging.md +0 -0
/package/{effective-python-skill → effective-python}/ref-10-collaboration.md +0 -0

package/clean-code-reviewer/scripts/pre-review.py ADDED Viewed

@@ -0,0 +1,206 @@
+#!/usr/bin/env python3
+"""
+pre-review.py — Pre-analysis script for Clean Code reviews.
+Usage: python pre-review.py <file>
+Produces a structured report covering file stats, long functions, deep nesting,
+argument count violations, and linter output — ready to feed an agent as context.
+"""
+import ast
+import os
+import subprocess
+import sys
+from pathlib import Path
+def detect_language(path: Path) -> str:
+    return {
+        ".py": "python",
+        ".js": "javascript",
+        ".ts": "typescript",
+        ".java": "java",
+        ".go": "go",
+        ".rb": "ruby",
+        ".rs": "rust",
+    }.get(path.suffix.lower(), "unknown")
+def count_lines(source: str) -> int:
+    return len(source.splitlines())
+def measure_nesting_depth(node: ast.AST, depth: int = 0) -> int:
+    nesting_nodes = (
+        ast.If, ast.For, ast.While, ast.With, ast.Try,
+        ast.ExceptHandler, ast.AsyncFor, ast.AsyncWith,
+    )
+    max_depth = depth
+    for child in ast.iter_child_nodes(node):
+        if isinstance(child, nesting_nodes):
+            max_depth = max(max_depth, measure_nesting_depth(child, depth + 1))
+        else:
+            max_depth = max(max_depth, measure_nesting_depth(child, depth))
+    return max_depth
+def analyze_python_ast(source: str):
+    """Return function/class stats using AST. Returns list of dicts."""
+    try:
+        tree = ast.parse(source)
+    except SyntaxError as exc:
+        return None, f"AST parse failed: {exc}"
+    lines = source.splitlines()
+    results = []
+    for node in ast.walk(tree):
+        if not isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
+            continue
+        kind = "class" if isinstance(node, ast.ClassDef) else "function"
+        start = node.lineno
+        end = node.end_lineno if hasattr(node, "end_lineno") else start
+        length = end - start + 1
+        arg_count = 0
+        nesting = 0
+        if kind == "function":
+            args = node.args
+            arg_count = (
+                len(args.args)
+                + len(args.posonlyargs)
+                + len(args.kwonlyargs)
+                + (1 if args.vararg else 0)
+                + (1 if args.kwarg else 0)
+            )
+            # Don't count 'self' / 'cls'
+            first = args.posonlyargs[0].arg if args.posonlyargs else (args.args[0].arg if args.args else None)
+            if first in ("self", "cls"):
+                arg_count = max(0, arg_count - 1)
+            nesting = measure_nesting_depth(node)
+        results.append({
+            "kind": kind,
+            "name": node.name,
+            "start": start,
+            "end": end,
+            "length": length,
+            "arg_count": arg_count,
+            "nesting": nesting,
+        })
+    return results, None
+def run_ruff(filepath: Path):
+    """Run ruff on the file; return (output_lines, error_message)."""
+    try:
+        result = subprocess.run(
+            ["ruff", "check", "--output-format", "concise", str(filepath)],
+            capture_output=True, text=True, timeout=30,
+        )
+        output = (result.stdout + result.stderr).strip()
+        return output.splitlines() if output else [], None
+    except FileNotFoundError:
+        return [], "ruff not installed (pip install ruff)"
+    except subprocess.TimeoutExpired:
+        return [], "ruff timed out"
+def separator(char="-", width=70):
+    return char * width
+def main():
+    if len(sys.argv) < 2:
+        print("Usage: python pre-review.py <file>")
+        sys.exit(1)
+    filepath = Path(sys.argv[1])
+    if not filepath.exists():
+        print(f"Error: file not found: {filepath}")
+        sys.exit(1)
+    source = filepath.read_text(encoding="utf-8", errors="replace")
+    language = detect_language(filepath)
+    total_lines = count_lines(source)
+    file_size = filepath.stat().st_size
+    print(separator("="))
+    print(f"CLEAN CODE PRE-REVIEW REPORT")
+    print(separator("="))
+    print(f"File     : {filepath}")
+    print(f"Language : {language}")
+    print(f"Size     : {file_size:,} bytes  |  {total_lines} lines")
+    print()
+    # --- AST analysis (Python only) ---
+    if language == "python":
+        print(separator())
+        print("FUNCTION / CLASS ANALYSIS (AST)")
+        print(separator())
+        items, err = analyze_python_ast(source)
+        if err:
+            print(f"  Warning: {err}")
+        elif items:
+            long_fns = [i for i in items if i["kind"] == "function" and i["length"] > 20]
+            deep_fns = [i for i in items if i["kind"] == "function" and i["nesting"] >= 3]
+            many_args = [i for i in items if i["kind"] == "function" and i["arg_count"] > 3]
+            print(f"  Total functions : {sum(1 for i in items if i['kind'] == 'function')}")
+            print(f"  Total classes   : {sum(1 for i in items if i['kind'] == 'class')}")
+            print()
+            if long_fns:
+                print(f"  [!] LONG FUNCTIONS (>20 lines) — Clean Code: functions should do one thing")
+                for fn in long_fns:
+                    print(f"      {fn['name']}()  lines {fn['start']}-{fn['end']}  ({fn['length']} lines)")
+            else:
+                print("  [OK] No functions exceed 20 lines.")
+            print()
+            if deep_fns:
+                print(f"  [!] DEEP NESTING (>=3 levels) — consider early returns or extraction")
+                for fn in deep_fns:
+                    print(f"      {fn['name']}()  line {fn['start']}  (max nesting: {fn['nesting']})")
+            else:
+                print("  [OK] No functions have excessive nesting depth.")
+            print()
+            if many_args:
+                print(f"  [!] TOO MANY ARGUMENTS (>3) — Clean Code: prefer parameter objects")
+                for fn in many_args:
+                    print(f"      {fn['name']}()  line {fn['start']}  ({fn['arg_count']} args)")
+            else:
+                print("  [OK] All functions have 3 or fewer arguments.")
+        else:
+            print("  No functions or classes found.")
+        print()
+    # --- Linter output ---
+    print(separator())
+    if language == "python":
+        print("RUFF LINTER OUTPUT")
+        print(separator())
+        ruff_lines, ruff_err = run_ruff(filepath)
+        if ruff_err:
+            print(f"  Note: {ruff_err}")
+        elif ruff_lines:
+            for line in ruff_lines:
+                print(f"  {line}")
+        else:
+            print("  [OK] ruff found no issues.")
+    else:
+        print(f"LINTER")
+        print(separator())
+        print(f"  Automated linting not configured for '{language}'. Run language-specific tools manually.")
+    print()
+    print(separator("="))
+    print("END OF PRE-REVIEW REPORT")
+    print(separator("="))
+if __name__ == "__main__":
+    main()

package/data-intensive-patterns/evals/evals.json ADDED Viewed

@@ -0,0 +1,43 @@
+{
+  "evals": [
+    {
+      "id": "eval-01-synchronous-rest-no-event-log",
+      "prompt": "Review this microservices architecture description and pseudo-code:\n\n```\nSystem: E-commerce order processing\nServices: OrderService, InventoryService, PaymentService, NotificationService\n\nFlow (all synchronous REST over HTTP):\n\n# OrderService.place_order()\nPOST /orders  →  validates cart\n               →  GET /inventory/{sku}  (blocks on InventoryService response)\n               →  POST /inventory/reserve  (blocks, decrements stock)\n               →  POST /payments/charge  (blocks on PaymentService)\n               →  POST /notifications/send  (blocks on NotificationService)\n               →  INSERT orders (status='CONFIRMED') into orders_db\n               →  return 201\n\n# If PaymentService times out:\n#   - inventory already reserved, order not yet written\n#   - caller gets 504, retries POST /orders\n#   - inventory reserved again (double reservation)\n\n# InventoryService\nGET /inventory/{sku}  →  SELECT stock FROM inventory WHERE sku=?\nPOST /inventory/reserve  →  UPDATE inventory SET stock=stock-qty WHERE sku=?\n\n# Shared database: orders_db is directly accessed by both OrderService\n# and InventoryService for reporting queries\n```",
+      "expectations": [
+        "Flags the fully synchronous REST chain as a distributed systems anti-pattern: a failure or timeout in any downstream service leaves the system in an inconsistent state (Ch 8: partial failures, timeouts, retries)",
+        "Identifies the double-reservation bug as a direct consequence of no idempotency on POST /inventory/reserve; recommends idempotency keys on all mutating endpoints (Ch 8: idempotent operations everywhere)",
+        "Flags the absence of an event log or write-ahead log: there is no source of truth to replay from if a service crashes mid-flow (Ch 11: event sourcing, log as source of truth)",
+        "Flags shared database access between OrderService and InventoryService as shared mutable state across services, violating service autonomy (Ch 12: shared mutable state across services anti-pattern)",
+        "Recommends replacing the synchronous chain with an event-driven approach: publish an OrderPlaced event, have downstream services react asynchronously (Ch 11: stream processing, event-driven)",
+        "Recommends the transactional outbox pattern to atomically write the order and publish the event in one local transaction (Ch 11: CDC, transactional outbox)",
+        "Notes that NotificationService is particularly ill-suited for the synchronous chain since a notification failure should not roll back a payment"
+      ]
+    },
+    {
+      "id": "eval-02-schema-without-access-patterns",
+      "prompt": "Review this database schema design:\n\n```sql\n-- Proposed schema for a social media analytics platform\n-- Requirements: show user activity feeds, compute engagement scores,\n-- support time-range queries on posts, and generate daily digest emails\n\nCREATE TABLE users (\n    id          BIGINT PRIMARY KEY,\n    username    VARCHAR(50),\n    email       VARCHAR(255),\n    created_at  TIMESTAMP\n);\n\nCREATE TABLE posts (\n    id          BIGINT PRIMARY KEY,\n    user_id     BIGINT REFERENCES users(id),\n    content     TEXT,\n    created_at  TIMESTAMP\n);\n\nCREATE TABLE events (\n    id          BIGINT PRIMARY KEY,\n    post_id     BIGINT REFERENCES posts(id),\n    user_id     BIGINT REFERENCES users(id),\n    event_type  VARCHAR(20),  -- 'like', 'comment', 'share', 'view'\n    occurred_at TIMESTAMP\n);\n\n-- All analytics queries will be run against these three tables via JOINs\n-- Indexes: only the primary keys above\n```",
+      "expectations": [
+        "Flags the absence of secondary indexes for the primary access patterns: time-range queries on posts (created_at) and per-user activity (user_id + created_at) will require full table scans (Ch 3: indexing strategies, B-tree vs LSM-tree, DDIA Chapter 3)",
+        "Identifies the schema is designed around normalization, not around read patterns; for analytics (read-heavy) workloads, this forces expensive multi-table JOINs on every query (Ch 2: document model vs relational model, denormalization for read-heavy workloads)",
+        "Flags that the `events` table mixing four different event types in one table with no partitioning will become a hot write target and a slow scan table as it grows (Ch 6: partitioning to spread load)",
+        "Notes there is no derived/materialized view for engagement scores, meaning every score computation re-scans all events; recommends pre-computed aggregates or a materialized view updated via CDC (Ch 11: derived data, CQRS)",
+        "Flags that the schema has no time-based partitioning on the `events` table despite time-range queries being a stated requirement (Ch 6: partitioning by key range for range scan efficiency)",
+        "Recommends separating the OLTP write path from the OLAP analytics read path, using CDC or batch export to feed an analytics store (Ch 10: batch processing, OLTP vs OLAP separation)"
+      ]
+    },
+    {
+      "id": "eval-03-well-designed-event-sourced-system",
+      "prompt": "Review this event-sourced order system design:\n\n```python\n# Event definitions\n@dataclass(frozen=True)\nclass OrderPlaced:\n    order_id: str\n    customer_id: str\n    items: tuple  # immutable list of (sku, qty, price)\n    occurred_at: datetime\n    event_id: str = field(default_factory=lambda: str(uuid4()))\n\n@dataclass(frozen=True)\nclass OrderShipped:\n    order_id: str\n    tracking_number: str\n    occurred_at: datetime\n    event_id: str = field(default_factory=lambda: str(uuid4()))\n\n@dataclass(frozen=True)\nclass OrderCancelled:\n    order_id: str\n    reason: str\n    occurred_at: datetime\n    event_id: str = field(default_factory=lambda: str(uuid4()))\n\n# Append-only event store\nclass EventStore:\n    def append(self, stream_id: str, events: list, expected_version: int) -> None:\n        \"\"\"Append events with optimistic concurrency check.\"\"\"\n        ...\n\n    def load(self, stream_id: str) -> list:\n        \"\"\"Load all events for a stream in order.\"\"\"\n        ...\n\n# Aggregate rebuilt from events\nclass Order:\n    def __init__(self):\n        self.status = None\n        self.items = []\n        self._version = 0\n\n    @classmethod\n    def from_events(cls, events: list) -> 'Order':\n        order = cls()\n        for event in events:\n            order._apply(event)\n        return order\n\n    def _apply(self, event):\n        match event:\n            case OrderPlaced(items=items):\n                self.status = 'placed'\n                self.items = list(items)\n            case OrderShipped():\n                self.status = 'shipped'\n            case OrderCancelled():\n                self.status = 'cancelled'\n        self._version += 1\n\n# Idempotent consumer for search index projection\nclass SearchIndexProjection:\n    def handle(self, event, event_id: str) -> None:\n        if self._already_processed(event_id):\n            return\n        # update search index\n        self._mark_processed(event_id)\n```",
+      "expectations": [
+        "Recognizes this is a well-designed event-sourced system and says so explicitly",
+        "Praises the append-only event store with optimistic concurrency control via `expected_version` preventing lost updates (Ch 7: transaction isolation, write conflicts)",
+        "Praises rebuilding aggregate state from events via `from_events` — the event log is the source of truth (Ch 11: event sourcing, log-centric architecture)",
+        "Praises frozen dataclasses for events ensuring immutability, which is correct for an append-only log (Ch 11: immutable events)",
+        "Praises the idempotent consumer with deduplication by `event_id` in `SearchIndexProjection` making the projection safe to replay (Ch 11: idempotent consumers, exactly-once semantics)",
+        "Praises the separation of the write model (Order aggregate) from the read model (SearchIndexProjection) as CQRS (Ch 11: CQRS, derived data)",
+        "Does NOT manufacture issues to appear thorough; any suggestions are explicitly framed as minor optional improvements",
+        "May suggest snapshotting for long-lived streams as a performance optimization, but frames it as a future concern, not a current violation"
+      ]
+    }
+  ]
+}

package/data-intensive-patterns/examples/after.md ADDED Viewed

@@ -0,0 +1,61 @@
+# After
+An event-driven architecture where writes go through a message log, read models are derived via CDC, read replicas serve analytics, and the command/query paths are separated.
+```
+ARCHITECTURE: E-Commerce Platform (Event-Driven + CQRS)
+WRITE PATH
+──────────
+[Mobile App / Web Browser]
+        │ REST (commands only: place order, update product)
+        v
+[API Gateway]  ──>  [Order Command Service]  ──>  [Orders DB - Postgres]
+                                                          │
+                                        [Debezium CDC connector]
+                                                          │
+                                                          v
+                                              [Kafka: order.events topic]
+                                              (append-only event log,
+                                               partitioned by order_id)
+READ PATH (derived models — rebuilt from event log, no dual-writes)
+──────────────────────────────────────────────────────────────────
+[Kafka: order.events]
+        │
+        ├──> [Inventory Consumer]  ──>  [Inventory Read DB - Postgres replica]
+        │                               (product weekly sales view)
+        │
+        ├──> [Search Consumer]     ──>  [Elasticsearch Index]
+        │                               (product search, updated async)
+        │
+        └──> [Analytics Consumer]  ──>  [BigQuery Streaming Insert]
+                                        (append-only fact table,
+                                         no load on production DB)
+QUERY ENDPOINTS (served from read models, not production write DB)
+──────────────────────────────────────────────────────────────────
+GET /inventory/reorder-candidates  → Inventory Read DB
+GET /search/products?q=...         → Elasticsearch
+GET /reports/revenue?period=...    → BigQuery
+ASYNC COORDINATION (replaces synchronous call chain)
+────────────────────────────────────────────────────
+Place order  → write Orders DB → CDC → Kafka
+             → Payment Service consumes event → publishes PaymentAuthorized
+             → Notification Service consumes PaymentAuthorized → sends email
+             (no synchronous chain; each step is independently retried)
+CONSISTENCY MODEL
+─────────────────
+Orders DB       → strongly consistent (single Postgres primary)
+Read models     → eventually consistent (seconds of lag, acceptable for reads)
+Analytics       → eventually consistent (minutes of lag, acceptable for reports)
+```
+Key improvements:
+- Append-only event log (Kafka) is the single source of truth — derived views are rebuilt from it, never maintained by dual-writes (Ch 11: derived data vs. system of record)
+- CDC via Debezium captures changes from the Orders DB atomically — no risk of writing to DB and failing to publish the event (Ch 11: Change Data Capture)
+- Analytics consumers write to BigQuery directly from Kafka — no SELECT queries on the production Orders DB (Ch 10: separation of OLTP and OLAP)
+- CQRS separates command endpoints (write path) from query endpoints (read path) — each can scale independently
+- The synchronous call chain (place order → payment → notification) is replaced by event-driven coordination — failure of one consumer does not block the order write

package/data-intensive-patterns/examples/before.md ADDED Viewed

@@ -0,0 +1,38 @@
+# Before
+A text architecture diagram showing an e-commerce platform where every component communicates synchronously via REST with no event log, no read replicas, and no separation of read/write paths.
+```
+ARCHITECTURE: E-Commerce Platform (Synchronous REST Only)
+[Mobile App] ──REST──> [API Server]
+[Web Browser] ──REST──> [API Server]
+                              │
+              ┌───────────────┼──────────────────┐
+              │               │                  │
+              v               v                  v
+        [Order DB]     [Product DB]       [User DB]
+         (Postgres)     (Postgres)        (Postgres)
+              │               │
+              v               v
+     [Analytics REST]  [Search REST]    (both query
+      calls Order DB    calls Product    production
+      directly via      DB directly      DBs live)
+      SQL over HTTP
+FLOWS:
+  Place order   → API → write Order DB → REST call to
+                  Inventory Service → REST call to
+                  Payment Service → REST call to
+                  Notification Service
+                  (all synchronous, chain fails if any step fails)
+  Dashboard     → API → query Order DB, Product DB, User DB
+                  in sequence (3 serial DB queries on write path)
+  Search        → API → query Product DB directly
+                  (full table scans, no index service)
+  Reports       → Analytics service polls Order DB every 5 min
+                  via REST (puts load on production DB)
+```

package/data-intensive-patterns/scripts/adr.py ADDED Viewed

@@ -0,0 +1,213 @@
+#!/usr/bin/env python3
+"""
+adr.py - Architecture Decision Record generator for data-intensive systems.
+Usage:
+    python adr.py <decision-title>
+    python adr.py                    # interactive mode
+Generates:
+    adr-NNN-<slug>.md   - Numbered ADR file with data-intensive-specific sections
+    ADR-INDEX.md        - Running index of all ADRs (appended to)
+The ADR includes standard sections plus four data-intensive-specific sections:
+    - Consistency model
+    - Failure mode
+    - Scalability impact
+    - Operability
+Based on patterns from "Designing Data-Intensive Applications" by Martin Kleppmann.
+"""
+import argparse
+import datetime
+import pathlib
+import re
+import sys
+def slugify(title: str) -> str:
+    slug = title.lower()
+    slug = re.sub(r"[^a-z0-9]+", "-", slug)
+    slug = slug.strip("-")
+    return slug
+def next_adr_number(adr_dir: pathlib.Path) -> int:
+    existing = list(adr_dir.glob("adr-[0-9][0-9][0-9]-*.md"))
+    if not existing:
+        return 1
+    numbers = []
+    for p in existing:
+        m = re.match(r"adr-(\d{3})-", p.name)
+        if m:
+            numbers.append(int(m.group(1)))
+    return max(numbers) + 1 if numbers else 1
+def prompt(question: str, default: str = "") -> str:
+    suffix = f" [{default}]" if default else ""
+    try:
+        answer = input(f"{question}{suffix}: ").strip()
+    except (EOFError, KeyboardInterrupt):
+        print()
+        sys.exit(0)
+    return answer if answer else default
+def collect_options() -> list[str]:
+    options = []
+    print("Enter up to 4 considered options (leave blank to stop):")
+    for i in range(1, 5):
+        opt = prompt(f"  Option {i}")
+        if not opt:
+            break
+        options.append(opt)
+    return options
+def render_adr(
+    number: int,
+    title: str,
+    context: str,
+    options: list[str],
+    chosen: str,
+    consequences: str,
+    consistency_model: str,
+    failure_mode: str,
+    scalability_impact: str,
+    operability: str,
+    date: str,
+) -> str:
+    options_text = "\n".join(f"- {opt}" for opt in options) if options else "- (none listed)"
+    return f"""\
+# ADR-{number:03d}: {title}
+**Date:** {date}
+**Status:** Proposed
+---
+## Context
+{context}
+## Considered Options
+{options_text}
+## Decision
+{chosen}
+## Consequences
+{consequences}
+---
+## Data-Intensive Considerations
+### Consistency Model
+> What consistency guarantees does this choice provide?
+{consistency_model}
+### Failure Mode
+> What happens when this component fails?
+{failure_mode}
+### Scalability Impact
+> How does this scale with data volume?
+{scalability_impact}
+### Operability
+> How observable and maintainable is this choice?
+{operability}
+"""
+def append_to_index(index_path: pathlib.Path, number: int, title: str, filename: str, date: str) -> None:
+    header = "# ADR Index\n\n| # | Title | Date | File |\n|---|-------|------|------|\n"
+    entry = f"| {number:03d} | {title} | {date} | [{filename}]({filename}) |\n"
+    if not index_path.exists():
+        index_path.write_text(header + entry, encoding="utf-8")
+        print(f"Created: {index_path}")
+    else:
+        content = index_path.read_text(encoding="utf-8")
+        index_path.write_text(content + entry, encoding="utf-8")
+        print(f"Updated: {index_path}")
+def main() -> None:
+    parser = argparse.ArgumentParser(
+        description="Generate an ADR for data-intensive systems."
+    )
+    parser.add_argument(
+        "title",
+        nargs="?",
+        default="",
+        help="Decision title (will prompt if omitted)",
+    )
+    parser.add_argument(
+        "--output-dir",
+        default=".",
+        help="Directory to write ADR files (default: ./)",
+    )
+    args = parser.parse_args()
+    output_dir = pathlib.Path(args.output_dir).resolve()
+    output_dir.mkdir(parents=True, exist_ok=True)
+    title = args.title.strip() or prompt("Decision title")
+    if not title:
+        print("ERROR: A title is required.")
+        sys.exit(1)
+    print()
+    context = prompt("Context (why is this decision needed?)", default="Describe the situation and forces at play.")
+    options = collect_options()
+    chosen = prompt("Chosen option")
+    consequences = prompt("Consequences (trade-offs, risks, next steps)", default="To be determined.")
+    print()
+    print("-- Data-intensive sections --")
+    consistency_model = prompt("Consistency model", default="To be defined.")
+    failure_mode = prompt("Failure mode", default="To be defined.")
+    scalability_impact = prompt("Scalability impact", default="To be defined.")
+    operability = prompt("Operability", default="To be defined.")
+    number = next_adr_number(output_dir)
+    date = datetime.date.today().isoformat()
+    filename = f"adr-{number:03d}-{slugify(title)}.md"
+    adr_path = output_dir / filename
+    content = render_adr(
+        number=number,
+        title=title,
+        context=context,
+        options=options,
+        chosen=chosen,
+        consequences=consequences,
+        consistency_model=consistency_model,
+        failure_mode=failure_mode,
+        scalability_impact=scalability_impact,
+        operability=operability,
+        date=date,
+    )
+    adr_path.write_text(content, encoding="utf-8")
+    print(f"\nWrote: {adr_path}")
+    append_to_index(output_dir / "ADR-INDEX.md", number, title, filename, date)
+    print("\nDone.")
+if __name__ == "__main__":
+    main()

package/data-pipelines/evals/evals.json ADDED Viewed

@@ -0,0 +1,45 @@
+{
+  "evals": [
+    {
+      "id": "eval-01-etl-no-error-handling-no-idempotency",
+      "prompt": "Review this ETL script:\n\n```python\nimport psycopg2\nimport requests\n\nSOURCE_DB = 'postgresql://user:pass@source-host/prod'\nDEST_DB = 'postgresql://user:pass@warehouse-host/warehouse'\n\ndef run():\n    src = psycopg2.connect(SOURCE_DB)\n    dst = psycopg2.connect(DEST_DB)\n\n    rows = src.cursor().execute(\n        'SELECT id, customer_id, amount, created_at FROM orders'\n    ).fetchall()\n\n    for row in rows:\n        order_id, customer_id, amount, created_at = row\n        resp = requests.get(f'https://api.exchange.io/rate?currency=EUR')\n        rate = resp.json()['rate']\n        amount_eur = amount * rate\n\n        dst.cursor().execute(\n            'INSERT INTO orders_eur VALUES (%s, %s, %s, %s)',\n            (order_id, customer_id, amount_eur, created_at)\n        )\n\n    dst.commit()\n    src.close()\n    dst.close()\n\nif __name__ == '__main__':\n    run()\n```",
+      "expectations": [
+        "Flags the full-table extraction `SELECT * FROM orders` with no timestamp filter as a non-incremental load that will re-process the entire table on every run; recommends incremental extraction using a watermark (Ch 3-4: incremental over full extraction)",
+        "Flags the absence of idempotency: re-running the script will insert duplicate rows into `orders_eur`; recommends an INSERT ... ON CONFLICT DO NOTHING or MERGE pattern (Ch 13: idempotency is non-negotiable)",
+        "Flags the external API call `requests.get` inside the per-row loop, which issues one HTTP request per order row — an N+1 pattern causing severe performance and rate-limit issues; recommends fetching the exchange rate once before the loop",
+        "Flags no error handling anywhere: if the API call fails, the loop crashes mid-run leaving the destination in a partially loaded state with no indication of progress (Ch 13: error handling and retry strategies)",
+        "Flags hardcoded credentials in source strings; recommends environment variables or a secrets manager (Ch 13: never hardcode credentials)",
+        "Flags no logging of rows processed, errors encountered, or run duration (Ch 12: monitoring and observability)",
+        "Flags the absence of a staging table: data is written directly to the production `orders_eur` table without validation (Ch 8: always load to staging first)"
+      ]
+    },
+    {
+      "id": "eval-02-mixed-transform-and-load",
+      "prompt": "Review this data pipeline script:\n\n```python\nimport pandas as pd\nimport sqlalchemy\n\ndef process_and_load(csv_path: str, db_url: str, table: str):\n    df = pd.read_csv(csv_path)\n\n    # Clean and transform\n    df['email'] = df['email'].str.lower().str.strip()\n    df['revenue'] = df['revenue'].fillna(0)\n    df['signup_date'] = pd.to_datetime(df['signup_date'])\n    df = df[df['revenue'] >= 0]\n    df['revenue_category'] = df['revenue'].apply(\n        lambda x: 'high' if x > 1000 else 'low'\n    )\n    df['country'] = df['country'].str.upper()\n\n    # Enrich with another file\n    regions = pd.read_csv('regions.csv')  # hardcoded path\n    df = df.merge(regions, on='country', how='left')\n\n    # Load directly into the final table\n    engine = sqlalchemy.create_engine(db_url)\n    df.to_sql(table, engine, if_exists='append', index=False)\n    print(f'Loaded {len(df)} rows')\n```",
+      "expectations": [
+        "Flags that transformation logic and loading logic are combined in a single function, violating separation of concerns; recommends splitting into separate extract, transform, and load functions (Ch 3: ETL pattern design, Ch 11: DAG-based task granularity)",
+        "Flags the hardcoded path `'regions.csv'` as a non-configurable dependency that breaks when the file moves; recommends externalizing all paths and inputs as parameters or config (Ch 13: configurable pipelines)",
+        "Flags `if_exists='append'` with no deduplication: re-running appends duplicate rows; recommends staging table + MERGE or using a unique constraint with INSERT OR IGNORE (Ch 13: idempotency)",
+        "Flags no data validation before loading: there is no check that the merge did not produce unexpected nulls in the region column or that row counts match expectations (Ch 10: validate at boundaries)",
+        "Flags no logging beyond a single print statement: recommends structured logging of row counts at each stage, null rates, and merge match rate (Ch 12: monitoring and observability)",
+        "Flags absence of data lineage tracking: no pipeline_run_id or audit column to identify which pipeline run produced each row, making debugging and reruns harder to trace (Ch 13: data lineage)",
+        "Recommends adding a schema validation step after reading the CSV to catch missing or mistyped columns before transformations run (Ch 10: schema validation at ingestion)"
+      ]
+    },
+    {
+      "id": "eval-03-clean-pipeline-with-retry-logging-separation",
+      "prompt": "Review this data pipeline implementation:\n\n```python\nimport logging\nimport time\nfrom datetime import datetime, timedelta\nfrom typing import Iterator\nimport psycopg2\nimport psycopg2.extras\n\nlogger = logging.getLogger(__name__)\n\nBATCH_SIZE = 1000\nMAX_RETRIES = 3\nBACKOFF_BASE = 2\n\n\ndef extract(conn, watermark: datetime) -> Iterator[list]:\n    \"\"\"Yield batches of new orders since the watermark.\"\"\"\n    with conn.cursor(cursor_factory=psycopg2.extras.DictCursor) as cur:\n        cur.execute(\n            'SELECT id, customer_id, amount, created_at '\n            'FROM orders WHERE created_at > %s ORDER BY created_at',\n            (watermark,)\n        )\n        while True:\n            rows = cur.fetchmany(BATCH_SIZE)\n            if not rows:\n                break\n            logger.info('Extracted batch of %d rows', len(rows))\n            yield [dict(r) for r in rows]\n\n\ndef transform(batch: list[dict]) -> list[dict]:\n    \"\"\"Apply business rules: normalize amounts, tag high-value orders.\"\"\"\n    result = []\n    for row in batch:\n        row['amount'] = round(float(row['amount']), 2)\n        row['is_high_value'] = row['amount'] > 500\n        result.append(row)\n    return result\n\n\ndef load(conn, rows: list[dict], run_id: str) -> int:\n    \"\"\"Upsert rows into orders_warehouse; return count of rows loaded.\"\"\"\n    with conn.cursor() as cur:\n        psycopg2.extras.execute_values(\n            cur,\n            '''\n            INSERT INTO orders_warehouse (id, customer_id, amount, is_high_value, created_at, pipeline_run_id)\n            VALUES %s\n            ON CONFLICT (id) DO UPDATE SET\n                amount = EXCLUDED.amount,\n                is_high_value = EXCLUDED.is_high_value,\n                pipeline_run_id = EXCLUDED.pipeline_run_id\n            ''',\n            [(r['id'], r['customer_id'], r['amount'], r['is_high_value'],\n              r['created_at'], run_id) for r in rows]\n        )\n    conn.commit()\n    return len(rows)\n\n\ndef run_with_retry(fn, *args, **kwargs):\n    \"\"\"Retry a function with exponential backoff on transient errors.\"\"\"\n    for attempt in range(1, MAX_RETRIES + 1):\n        try:\n            return fn(*args, **kwargs)\n        except psycopg2.OperationalError as e:\n            if attempt == MAX_RETRIES:\n                raise\n            delay = BACKOFF_BASE ** attempt\n            logger.warning('Attempt %d failed: %s. Retrying in %ds', attempt, e, delay)\n            time.sleep(delay)\n```",
+      "expectations": [
+        "Recognizes this is a well-designed pipeline and says so explicitly",
+        "Praises the clear separation of `extract`, `transform`, and `load` into distinct functions with single responsibilities (Ch 3: ETL pattern, Ch 11: task granularity)",
+        "Praises the watermark-based incremental extraction that avoids full-table scans on reruns (Ch 3-4: incremental extraction)",
+        "Praises the `ON CONFLICT DO UPDATE` upsert ensuring the pipeline is idempotent and safe to re-run (Ch 13: idempotency is non-negotiable)",
+        "Praises the generator-based `extract` function that yields batches, avoiding loading the full result set into memory (Ch 4: streaming extraction, memory efficiency)",
+        "Praises the `run_with_retry` wrapper with exponential backoff for transient database errors (Ch 13: error handling and retry strategies)",
+        "Praises structured logging at the batch level with row counts for observability (Ch 12: monitoring)",
+        "Praises the `pipeline_run_id` column in the load, enabling lineage tracking and debugging of which run produced which rows (Ch 13: data lineage)",
+        "Does NOT manufacture issues to appear thorough; any suggestions are framed as minor optional improvements"
+      ]
+    }
+  ]
+}