npm - skillstore-cli - Versions diffs - 1.0.0 - Mend

skillstore-cli 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (231) hide show

package/data/shared/framework/skills/databases/references/advanced-data-patterns.md ADDED Viewed

@@ -0,0 +1,259 @@
+# Advanced Data Patterns
+## Sharding Strategies
+### Decision Matrix
+| Strategy | Best For | Shard Key Example | Trade-off |
+|---|---|---|---|
+| Hash-based | Even distribution, no range queries needed | `hash(user_id) % N` | Cannot do efficient range scans |
+| Range-based | Time-series, sequential access patterns | `created_date`, `region_id` | Risk of hot spots on recent ranges |
+| Geographic | Data locality, compliance (GDPR, residency) | `country_code`, `region` | Cross-region queries are expensive |
+### Shard Key Selection Criteria
+Choose a shard key that satisfies:
+1. **High cardinality** — enough distinct values to distribute evenly across shards
+2. **Even distribution** — no single value dominates (avoid skewed keys like `status`)
+3. **Query isolation** — most queries include the shard key, avoiding scatter-gather
+4. **Immutability** — the key should not change after insertion (resharding is expensive)
+```sql
+-- Example: hash-based sharding in application layer (PostgreSQL + Citus)
+SELECT create_distributed_table('orders', 'customer_id');
+-- Verify shard distribution
+SELECT nodename, count(*)
+FROM citus_shards
+GROUP BY nodename;
+```
+### Cross-Shard Query Handling
+- **Scatter-gather**: query all shards, merge results — acceptable for analytics, not for OLTP hot paths
+- **Global tables**: replicate small reference tables (countries, currencies) to every shard
+- **Entity groups**: co-locate related tables on the same shard key (orders + order_items both sharded by customer_id)
+### Rebalancing
+- **Citus**: `SELECT rebalance_table_shards('orders');` — moves shards to equalize storage
+- **Vitess**: `Reshard` workflow splits or merges shards with zero downtime
+- **Rule of thumb**: plan rebalancing when any shard exceeds 1.5× average size or 80% disk capacity
+---
+## Change Data Capture (CDC)
+### Debezium + PostgreSQL Setup
+1. **Enable logical replication** in `postgresql.conf`:
+```
+wal_level = logical
+max_replication_slots = 4
+max_wal_senders = 4
+```
+2. **Create a publication**:
+```sql
+CREATE PUBLICATION cdc_publication FOR TABLE orders, customers;
+```
+3. **Debezium connector configuration** (Kafka Connect JSON):
+```json
+{
+  "name": "pg-cdc-connector",
+  "config": {
+    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
+    "database.hostname": "db-primary",
+    "database.port": "5432",
+    "database.dbname": "app_db",
+    "database.user": "cdc_user",
+    "plugin.name": "pgoutput",
+    "publication.name": "cdc_publication",
+    "topic.prefix": "app",
+    "table.include.list": "public.orders,public.customers",
+    "snapshot.mode": "initial",
+    "tombstones.on.delete": true,
+    "transforms": "unwrap",
+    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
+  }
+}
+```
+4. **Event structure** — each change event contains:
+   - `before`: row state before the change (null for inserts)
+   - `after`: row state after the change (null for deletes)
+   - `source`: metadata (lsn, txId, timestamp, table, schema)
+   - `op`: operation type — `c` (create), `u` (update), `d` (delete), `r` (snapshot read)
+### CDC Use Cases
+- Sync data to Elasticsearch or analytics warehouse
+- Trigger downstream microservice workflows on data changes
+- Build audit logs without application-level instrumentation
+- Cache invalidation in Redis based on database writes
+---
+## Temporal Tables
+### PostgreSQL (temporal_tables Extension)
+```sql
+-- Install the extension
+CREATE EXTENSION IF NOT EXISTS temporal_tables;
+-- Create the main table
+CREATE TABLE products (
+    id          SERIAL PRIMARY KEY,
+    name        TEXT NOT NULL,
+    price       NUMERIC(10,2) NOT NULL,
+    sys_period  tstzrange NOT NULL DEFAULT tstzrange(current_timestamp, null)
+);
+-- Create the history table (same structure)
+CREATE TABLE products_history (LIKE products);
+-- Attach the trigger
+CREATE TRIGGER products_versioning
+    BEFORE INSERT OR UPDATE OR DELETE ON products
+    FOR EACH ROW EXECUTE FUNCTION versioning(
+        'sys_period', 'products_history', true
+    );
+-- Query: what was the price on a specific date?
+SELECT name, price
+FROM products_history
+WHERE id = 42
+  AND sys_period @> '2025-06-15T00:00:00Z'::timestamptz;
+```
+### SQL Server (System-Versioned)
+```sql
+CREATE TABLE products (
+    id          INT PRIMARY KEY,
+    name        NVARCHAR(200) NOT NULL,
+    price       DECIMAL(10,2) NOT NULL,
+    valid_from  DATETIME2 GENERATED ALWAYS AS ROW START,
+    valid_to    DATETIME2 GENERATED ALWAYS AS ROW END,
+    PERIOD FOR SYSTEM_TIME (valid_from, valid_to)
+) WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.products_history));
+-- Point-in-time query
+SELECT * FROM products
+FOR SYSTEM_TIME AS OF '2025-06-15T00:00:00';
+```
+---
+## Full-Text Search: PostgreSQL vs Elasticsearch
+### Decision Guide
+| Factor | PostgreSQL tsvector/tsquery | Elasticsearch |
+|---|---|---|
+| Data under 10M docs, simple search | Preferred — no extra infra | Overkill |
+| Fuzzy matching, typo tolerance | Limited (trigram + similarity) | Excellent (built-in fuzziness) |
+| Faceted search, aggregations | Weak | Excellent |
+| Relevance tuning | Basic (weights A-D) | Fine-grained (BM25, boosting, function_score) |
+| Operational cost | Zero — runs inside your DB | Separate cluster to manage |
+| Real-time indexing | Immediate (same transaction) | Near real-time (~1s refresh) |
+### PostgreSQL Full-Text Setup
+```sql
+-- Add a tsvector column with GIN index
+ALTER TABLE articles ADD COLUMN search_vector tsvector
+    GENERATED ALWAYS AS (
+        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
+        setweight(to_tsvector('english', coalesce(body, '')), 'B')
+    ) STORED;
+CREATE INDEX idx_articles_search ON articles USING GIN (search_vector);
+-- Search with ranking
+SELECT title, ts_rank(search_vector, query) AS rank
+FROM articles, plainto_tsquery('english', 'database sharding') AS query
+WHERE search_vector @@ query
+ORDER BY rank DESC
+LIMIT 20;
+```
+---
+## Database-per-Service (Microservices)
+### Pros
+- Independent schema evolution — no cross-team migration coordination
+- Technology freedom — use PostgreSQL for one service, MongoDB for another
+- Fault isolation — one database failure does not cascade
+- Independent scaling of storage per service
+### Cons
+- **Cross-service queries** require API composition or CQRS read models
+- **Distributed transactions** are hard — use Saga pattern instead of 2PC
+- **Data duplication** is inevitable; accept eventual consistency
+- **Operational overhead** — more databases to monitor, back up, patch
+### Data Consistency Patterns
+- **Saga (choreography)**: services emit events; each service reacts and compensates on failure
+- **Saga (orchestration)**: a central coordinator directs the transaction steps
+- **Outbox pattern**: write events to an outbox table in the same transaction, then publish asynchronously
+- **CQRS**: separate command and query models; sync via events for cross-service reads
+---
+## Multi-Tenant Data Isolation
+### Strategy Comparison
+| Strategy | Isolation | Cost | Complexity | Best For |
+|---|---|---|---|---|
+| Separate databases | Strongest | Highest | Low code, high ops | Enterprise / regulated |
+| Schema-per-tenant | Strong | Medium | Medium | Mid-market SaaS |
+| Row-level security | Weakest | Lowest | Medium code, low ops | High-volume SMB SaaS |
+### Schema-per-Tenant (PostgreSQL)
+```sql
+-- Create a schema for each tenant on signup
+CREATE SCHEMA tenant_acme;
+CREATE TABLE tenant_acme.orders (LIKE public.orders_template INCLUDING ALL);
+-- Set search_path per connection
+SET search_path TO tenant_acme, public;
+```
+### Row-Level Security (RLS)
+```sql
+-- Enable RLS on the table
+ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
+-- Policy: users see only their tenant's rows
+CREATE POLICY tenant_isolation ON orders
+    USING (tenant_id = current_setting('app.current_tenant')::uuid);
+-- Force RLS even for table owners
+ALTER TABLE orders FORCE ROW LEVEL SECURITY;
+-- Application sets tenant context per request
+SET LOCAL app.current_tenant = 'a1b2c3d4-...';
+-- Now all queries are automatically filtered
+SELECT * FROM orders;  -- returns only current tenant's orders
+```
+### Application-Layer Setup (Node.js Example)
+```typescript
+// Middleware: set tenant context on every request
+app.use(async (req, res, next) => {
+  const tenantId = req.headers['x-tenant-id'];
+  if (!tenantId) return res.status(400).json({ error: 'Tenant ID required' });
+  // Set RLS context for this database session
+  await db.query(`SET LOCAL app.current_tenant = $1`, [tenantId]);
+  next();
+});
+```

package/data/shared/framework/skills/databases/references/query-optimization.md ADDED Viewed

@@ -0,0 +1,214 @@
+# Query Optimization
+## EXPLAIN / ANALYZE Interpretation
+### PostgreSQL
+```sql
+EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT * FROM orders WHERE user_id = 'abc' AND status = 'open';
+```
+### Key Metrics to Check
+| Metric | Good | Investigate |
+|--------|------|-------------|
+| Seq Scan on large table | < 1,000 rows | > 10,000 rows |
+| Nested Loop | Inner table < 100 rows | Inner table > 1,000 rows |
+| Sort | In-memory | On disk (work_mem too low) |
+| Actual rows vs. estimated | Within 10x | Off by > 100x (stale stats) |
+| Buffers shared hit ratio | > 95% | < 80% (increase shared_buffers) |
+### Common EXPLAIN Nodes
+- **Seq Scan**: Full table scan. OK for small tables. Add index for large tables.
+- **Index Scan**: Uses index to find rows, then fetches from table. Good.
+- **Index Only Scan**: All data from index, no table fetch. Best case.
+- **Bitmap Index Scan**: Combines multiple indexes or handles many matches. OK.
+- **Hash Join / Merge Join**: Joining two sets. Hash is better for unequal sizes, Merge for pre-sorted data.
+### When Stats Are Stale
+```sql
+-- Force PostgreSQL to update statistics
+ANALYZE orders;
+-- Check when stats were last updated
+SELECT relname, last_analyze, last_autoanalyze FROM pg_stat_user_tables WHERE relname = 'orders';
+```
+## Common Query Anti-Patterns
+### 1. SELECT * (Fetching All Columns)
+```sql
+-- Bad: fetches all columns including large TEXT/BLOB fields
+SELECT * FROM users WHERE status = 'active';
+-- Good: fetch only what you need
+SELECT id, name, email FROM users WHERE status = 'active';
+```
+### 2. N+1 Queries
+```python
+# Bad: 1 query for orders + N queries for users (one per order)
+orders = db.query("SELECT * FROM orders LIMIT 100")
+for order in orders:
+    user = db.query("SELECT * FROM users WHERE id = %s", order.user_id)
+# Good: 1 query with JOIN or 2 queries with IN
+orders = db.query("""
+    SELECT o.*, u.name as user_name
+    FROM orders o JOIN users u ON o.user_id = u.id
+    LIMIT 100
+""")
+# Also good: batch fetch
+user_ids = [o.user_id for o in orders]
+users = db.query("SELECT * FROM users WHERE id = ANY(%s)", user_ids)
+```
+### 3. Missing Indexes on Foreign Keys
+Every foreign key column should have an index. Without it, JOINs and cascading deletes trigger full table scans.
+### 4. OR vs. UNION
+```sql
+-- Bad: OR often prevents index usage
+SELECT * FROM tasks WHERE assignee_id = 'a' OR creator_id = 'a';
+-- Better: UNION uses both indexes
+SELECT * FROM tasks WHERE assignee_id = 'a'
+UNION
+SELECT * FROM tasks WHERE creator_id = 'a';
+```
+### 5. Functions on Indexed Columns
+```sql
+-- Bad: index on created_at is useless because of DATE()
+SELECT * FROM orders WHERE DATE(created_at) = '2026-03-20';
+-- Good: use range query
+SELECT * FROM orders WHERE created_at >= '2026-03-20' AND created_at < '2026-03-21';
+```
+### 6. OFFSET for Deep Pagination
+```sql
+-- Bad: database reads and discards 10,000 rows
+SELECT * FROM products ORDER BY id LIMIT 25 OFFSET 10000;
+-- Good: cursor-based pagination (keyset)
+SELECT * FROM products WHERE id > 'last-seen-id' ORDER BY id LIMIT 25;
+```
+## Index Optimization Strategies
+### Composite Index Column Order
+Put the most selective column first (highest cardinality):
+```sql
+-- If user_id has 100,000 unique values and status has 5:
+CREATE INDEX idx_tasks_user_status ON tasks (user_id, status);  -- Good
+CREATE INDEX idx_tasks_status_user ON tasks (status, user_id);  -- Less effective
+```
+### Index for Sorting
+```sql
+-- Query: WHERE status = 'open' ORDER BY created_at DESC
+-- Index must match both filter and sort:
+CREATE INDEX idx_tasks_status_created ON tasks (status, created_at DESC);
+```
+### Partial Indexes for Hot Data
+```sql
+-- 90% of queries are for active records
+CREATE INDEX idx_active_users ON users (email) WHERE is_active = true;
+-- Much smaller index, faster lookups
+```
+### Monitor Unused Indexes
+```sql
+-- PostgreSQL: find indexes with zero scans
+SELECT indexrelname, idx_scan, pg_size_pretty(pg_relation_size(indexrelid))
+FROM pg_stat_user_indexes
+WHERE idx_scan = 0 AND indexrelname NOT LIKE '%pkey'
+ORDER BY pg_relation_size(indexrelid) DESC;
+```
+## Connection Pooling
+### Why
+Database connections are expensive (TCP handshake, auth, memory allocation). A pool reuses connections.
+### Configuration Guidelines
+| App Type | Pool Size | Formula |
+|----------|-----------|---------|
+| Web server | 10-20 per instance | `connections = (cores * 2) + disk_spindles` |
+| Background workers | 3-5 per worker | Fewer concurrent queries needed |
+| Total max | Check DB limit | PostgreSQL default: 100 connections |
+### Tools
+- **Application-level**: HikariCP (Java), pgBouncer (PostgreSQL), Prisma connection pool
+- **Standalone proxy**: PgBouncer, ProxySQL (MySQL)
+### PgBouncer Modes
+| Mode | When |
+|------|------|
+| Transaction | Default choice. Releases connection after each transaction. |
+| Session | App uses prepared statements or temp tables. |
+| Statement | High-throughput, no transactions needed. |
+## Read Replicas and Partitioning
+### Read Replicas
+- Route read queries (reports, search, dashboards) to replicas
+- Keep writes on primary
+- Handle replication lag (data may be 0-5 seconds behind)
+- Use for: analytics queries, search indexes, backup
+### Table Partitioning
+```sql
+-- Partition by range (time-series data)
+CREATE TABLE events (id UUID, created_at TIMESTAMPTZ, payload JSONB)
+PARTITION BY RANGE (created_at);
+CREATE TABLE events_2026_q1 PARTITION OF events
+FOR VALUES FROM ('2026-01-01') TO ('2026-04-01');
+```
+**When to partition**: table exceeds 50-100 million rows, or queries always filter by partition key.
+## Migration Strategies
+### Zero-Downtime Migrations (Expand-Contract)
+**Adding a column:**
+1. Add column as nullable (no lock)
+2. Deploy code that writes to both old and new columns
+3. Backfill existing rows in batches
+4. Deploy code that reads from new column
+5. Add NOT NULL constraint (if needed)
+6. Remove old column usage from code
+**Renaming a column:**
+1. Add new column
+2. Write to both columns
+3. Backfill
+4. Read from new column
+5. Drop old column
+### Migration Best Practices
+- Every migration must have a rollback (down migration)
+- Test migrations on production-size data locally
+- Run migrations in a transaction when possible
+- Set a statement timeout to prevent long locks: `SET lock_timeout = '5s'`
+- Never run `ALTER TABLE ... ADD COLUMN ... DEFAULT` on large tables in older PostgreSQL (< 11) — it rewrites the entire table
+## Backup and Recovery
+### Backup Types
+| Type | Frequency | Use Case |
+|------|-----------|----------|
+| Full backup (pg_dump) | Daily | Point-in-time recovery base |
+| WAL archiving | Continuous | Point-in-time recovery to any second |
+| Logical replication | Continuous | Cross-version or selective backup |
+### Testing Recovery
+- Schedule monthly recovery drills
+- Verify backup integrity: restore to a test instance and run validation queries
+- Measure Recovery Time Objective (RTO) — how long to restore
+- Measure Recovery Point Objective (RPO) — maximum data loss acceptable

package/data/shared/framework/skills/databases/references/schema-design.md ADDED Viewed

@@ -0,0 +1,159 @@
+# Schema Design
+## Normalization
+### 1NF (First Normal Form)
+- Every column contains atomic (indivisible) values
+- No repeating groups or arrays in a single column
+```sql
+-- Bad: tags stored as comma-separated string
+CREATE TABLE products (id INT, name TEXT, tags TEXT); -- tags = "electronics,sale,new"
+-- Good: separate table for tags
+CREATE TABLE products (id INT PRIMARY KEY, name TEXT);
+CREATE TABLE product_tags (product_id INT REFERENCES products(id), tag TEXT);
+```
+### 2NF (Second Normal Form)
+- Must be in 1NF
+- Every non-key column depends on the ENTIRE primary key (no partial dependencies)
+### 3NF (Third Normal Form)
+- Must be in 2NF
+- No transitive dependencies (non-key column depending on another non-key column)
+```sql
+-- Bad: city_name depends on zip_code, not on order_id
+CREATE TABLE orders (id INT, zip_code TEXT, city_name TEXT);
+-- Good: separate addresses table
+CREATE TABLE orders (id INT, address_id INT REFERENCES addresses(id));
+CREATE TABLE addresses (id INT PRIMARY KEY, zip_code TEXT, city_name TEXT);
+```
+### BCNF (Boyce-Codd Normal Form)
+- Every determinant is a candidate key
+- Rarely needed in practice — 3NF is sufficient for most applications
+**Practical rule**: Normalize to 3NF, then denormalize specific tables based on measured query performance needs.
+## Denormalization Strategies
+| Strategy | When | Trade-off |
+|----------|------|-----------|
+| Duplicate columns | Frequently joined fields (e.g., user_name in orders) | Stale data risk, update anomalies |
+| Pre-computed aggregates | Dashboard counters, report summaries | Must update on every write |
+| Materialized views | Complex reports queried often | Refresh latency |
+| JSON columns | Flexible metadata, rare querying needs | Harder to index and validate |
+**Rule**: Only denormalize when you have evidence of a performance problem. Premature denormalization causes data inconsistency bugs.
+## Index Design
+### B-tree Index (Default)
+- Good for: equality, range queries, sorting, LIKE 'prefix%'
+- Bad for: LIKE '%suffix', full-text search, high-cardinality writes
+### Composite Index
+```sql
+-- Follows leftmost prefix rule
+CREATE INDEX idx_orders_user_status ON orders (user_id, status);
+-- This index supports:
+-- WHERE user_id = ?                    ✓
+-- WHERE user_id = ? AND status = ?     ✓
+-- WHERE status = ?                     ✗ (needs separate index)
+```
+### Covering Index
+Include all columns needed by the query to avoid table lookups:
+```sql
+CREATE INDEX idx_orders_covering ON orders (user_id, status) INCLUDE (total, created_at);
+```
+### Partial Index (PostgreSQL)
+Index only rows matching a condition — smaller index, faster queries:
+```sql
+CREATE INDEX idx_active_tasks ON tasks (assignee_id, due_date) WHERE status != 'done';
+```
+### Index Guidelines
+- Always index foreign keys
+- Index columns used in WHERE, JOIN, ORDER BY
+- Don't index low-cardinality columns alone (boolean, enum with 3 values)
+- Monitor index usage: `pg_stat_user_indexes` (PostgreSQL)
+- Remove unused indexes — each index slows down INSERT/UPDATE/DELETE
+## SQL vs NoSQL Decision Matrix
+| Factor | Choose SQL | Choose NoSQL |
+|--------|-----------|--------------|
+| Data structure | Well-defined, relational | Flexible, evolving schema |
+| Consistency | Strong ACID required | Eventual consistency OK |
+| Query patterns | Complex joins, aggregations | Simple key-value or document lookups |
+| Scale | Vertical (single node OK) | Horizontal (distributed, high volume) |
+| Transactions | Multi-table transactions | Single-document operations |
+| Examples | PostgreSQL, MySQL | MongoDB, DynamoDB, Redis |
+**Default choice**: PostgreSQL. It handles 95% of application needs. Only reach for NoSQL when you have a specific reason.
+## Relational Patterns
+### One-to-Many
+```sql
+CREATE TABLE projects (id UUID PRIMARY KEY, name TEXT NOT NULL);
+CREATE TABLE tasks (id UUID PRIMARY KEY, project_id UUID REFERENCES projects(id), title TEXT NOT NULL);
+```
+### Many-to-Many
+```sql
+CREATE TABLE users (id UUID PRIMARY KEY, name TEXT NOT NULL);
+CREATE TABLE roles (id UUID PRIMARY KEY, name TEXT NOT NULL);
+CREATE TABLE user_roles (
+  user_id UUID REFERENCES users(id) ON DELETE CASCADE,
+  role_id UUID REFERENCES roles(id) ON DELETE CASCADE,
+  granted_at TIMESTAMPTZ DEFAULT now(),
+  PRIMARY KEY (user_id, role_id)
+);
+```
+### Polymorphic Associations
+Avoid single foreign key pointing to multiple tables. Use one of:
+- **Separate tables**: `task_comments`, `project_comments` (simplest, best for SQL)
+- **Shared table with type column**: `comments` with `commentable_type` + `commentable_id` (no FK enforcement)
+- **Join table per type**: `task_comments(task_id, comment_id)`, `project_comments(project_id, comment_id)`
+## NoSQL Patterns
+### Document Store (MongoDB, DynamoDB)
+- Embed related data that is always accessed together
+- Reference data that is shared or accessed independently
+- Design for access patterns, not for normalization
+### Key-Value (Redis)
+- Session storage, caching, rate limiting, leaderboards
+- Use key naming conventions: `app:entity:id:field` (e.g., `myapp:user:123:session`)
+- Set TTL on every key to prevent memory leaks
+## Data Types Best Practices
+| Data | Type | NOT This |
+|------|------|----------|
+| Primary key | UUID / ULID | AUTO_INCREMENT (unless single-node) |
+| Money | DECIMAL(19,4) or integer cents | FLOAT / DOUBLE |
+| Timestamps | TIMESTAMPTZ (with timezone) | TIMESTAMP (without timezone) |
+| Email/URL | TEXT with CHECK constraint | VARCHAR(255) |
+| Boolean | BOOLEAN | TINYINT / CHAR(1) |
+| Enum | TEXT with CHECK or enum type | Magic numbers |
+| JSON metadata | JSONB (PostgreSQL) | TEXT (can't query or index) |
+## Naming Conventions
+- Tables: plural, snake_case (`order_items`, `user_roles`)
+- Columns: singular, snake_case (`created_at`, `user_id`)
+- Primary keys: `id`
+- Foreign keys: `{referenced_table_singular}_id` (`user_id`, `project_id`)
+- Indexes: `idx_{table}_{columns}` (`idx_orders_user_id_status`)
+- Constraints: `chk_{table}_{column}` (`chk_orders_status`)
+- Booleans: `is_` or `has_` prefix (`is_active`, `has_verified_email`)