tech-hub-skills 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (133) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +250 -0
  3. package/bin/cli.js +241 -0
  4. package/bin/copilot.js +182 -0
  5. package/bin/postinstall.js +42 -0
  6. package/package.json +46 -0
  7. package/tech_hub_skills/roles/ai-engineer/skills/01-prompt-engineering/README.md +252 -0
  8. package/tech_hub_skills/roles/ai-engineer/skills/02-rag-pipeline/README.md +448 -0
  9. package/tech_hub_skills/roles/ai-engineer/skills/03-agent-orchestration/README.md +599 -0
  10. package/tech_hub_skills/roles/ai-engineer/skills/04-llm-guardrails/README.md +735 -0
  11. package/tech_hub_skills/roles/ai-engineer/skills/05-vector-embeddings/README.md +711 -0
  12. package/tech_hub_skills/roles/ai-engineer/skills/06-llm-evaluation/README.md +777 -0
  13. package/tech_hub_skills/roles/azure/skills/01-infrastructure-fundamentals/README.md +264 -0
  14. package/tech_hub_skills/roles/azure/skills/02-data-factory/README.md +264 -0
  15. package/tech_hub_skills/roles/azure/skills/03-synapse-analytics/README.md +264 -0
  16. package/tech_hub_skills/roles/azure/skills/04-databricks/README.md +264 -0
  17. package/tech_hub_skills/roles/azure/skills/05-functions/README.md +264 -0
  18. package/tech_hub_skills/roles/azure/skills/06-kubernetes-service/README.md +264 -0
  19. package/tech_hub_skills/roles/azure/skills/07-openai-service/README.md +264 -0
  20. package/tech_hub_skills/roles/azure/skills/08-machine-learning/README.md +264 -0
  21. package/tech_hub_skills/roles/azure/skills/09-storage-adls/README.md +264 -0
  22. package/tech_hub_skills/roles/azure/skills/10-networking/README.md +264 -0
  23. package/tech_hub_skills/roles/azure/skills/11-sql-cosmos/README.md +264 -0
  24. package/tech_hub_skills/roles/azure/skills/12-event-hubs/README.md +264 -0
  25. package/tech_hub_skills/roles/code-review/skills/01-automated-code-review/README.md +394 -0
  26. package/tech_hub_skills/roles/code-review/skills/02-pr-review-workflow/README.md +427 -0
  27. package/tech_hub_skills/roles/code-review/skills/03-code-quality-gates/README.md +518 -0
  28. package/tech_hub_skills/roles/code-review/skills/04-reviewer-assignment/README.md +504 -0
  29. package/tech_hub_skills/roles/code-review/skills/05-review-analytics/README.md +540 -0
  30. package/tech_hub_skills/roles/data-engineer/skills/01-lakehouse-architecture/README.md +550 -0
  31. package/tech_hub_skills/roles/data-engineer/skills/02-etl-pipeline/README.md +580 -0
  32. package/tech_hub_skills/roles/data-engineer/skills/03-data-quality/README.md +579 -0
  33. package/tech_hub_skills/roles/data-engineer/skills/04-streaming-pipelines/README.md +608 -0
  34. package/tech_hub_skills/roles/data-engineer/skills/05-performance-optimization/README.md +547 -0
  35. package/tech_hub_skills/roles/data-governance/skills/01-data-catalog/README.md +112 -0
  36. package/tech_hub_skills/roles/data-governance/skills/02-data-lineage/README.md +129 -0
  37. package/tech_hub_skills/roles/data-governance/skills/03-data-quality-framework/README.md +182 -0
  38. package/tech_hub_skills/roles/data-governance/skills/04-access-control/README.md +39 -0
  39. package/tech_hub_skills/roles/data-governance/skills/05-master-data-management/README.md +40 -0
  40. package/tech_hub_skills/roles/data-governance/skills/06-compliance-privacy/README.md +46 -0
  41. package/tech_hub_skills/roles/data-scientist/skills/01-eda-automation/README.md +230 -0
  42. package/tech_hub_skills/roles/data-scientist/skills/02-statistical-modeling/README.md +264 -0
  43. package/tech_hub_skills/roles/data-scientist/skills/03-feature-engineering/README.md +264 -0
  44. package/tech_hub_skills/roles/data-scientist/skills/04-predictive-modeling/README.md +264 -0
  45. package/tech_hub_skills/roles/data-scientist/skills/05-customer-analytics/README.md +264 -0
  46. package/tech_hub_skills/roles/data-scientist/skills/06-campaign-analysis/README.md +264 -0
  47. package/tech_hub_skills/roles/data-scientist/skills/07-experimentation/README.md +264 -0
  48. package/tech_hub_skills/roles/data-scientist/skills/08-data-visualization/README.md +264 -0
  49. package/tech_hub_skills/roles/devops/skills/01-cicd-pipeline/README.md +264 -0
  50. package/tech_hub_skills/roles/devops/skills/02-container-orchestration/README.md +264 -0
  51. package/tech_hub_skills/roles/devops/skills/03-infrastructure-as-code/README.md +264 -0
  52. package/tech_hub_skills/roles/devops/skills/04-gitops/README.md +264 -0
  53. package/tech_hub_skills/roles/devops/skills/05-environment-management/README.md +264 -0
  54. package/tech_hub_skills/roles/devops/skills/06-automated-testing/README.md +264 -0
  55. package/tech_hub_skills/roles/devops/skills/07-release-management/README.md +264 -0
  56. package/tech_hub_skills/roles/devops/skills/08-monitoring-alerting/README.md +264 -0
  57. package/tech_hub_skills/roles/devops/skills/09-devsecops/README.md +265 -0
  58. package/tech_hub_skills/roles/finops/skills/01-cost-visibility/README.md +264 -0
  59. package/tech_hub_skills/roles/finops/skills/02-resource-tagging/README.md +264 -0
  60. package/tech_hub_skills/roles/finops/skills/03-budget-management/README.md +264 -0
  61. package/tech_hub_skills/roles/finops/skills/04-reserved-instances/README.md +264 -0
  62. package/tech_hub_skills/roles/finops/skills/05-spot-optimization/README.md +264 -0
  63. package/tech_hub_skills/roles/finops/skills/06-storage-tiering/README.md +264 -0
  64. package/tech_hub_skills/roles/finops/skills/07-compute-rightsizing/README.md +264 -0
  65. package/tech_hub_skills/roles/finops/skills/08-chargeback/README.md +264 -0
  66. package/tech_hub_skills/roles/ml-engineer/skills/01-mlops-pipeline/README.md +566 -0
  67. package/tech_hub_skills/roles/ml-engineer/skills/02-feature-engineering/README.md +655 -0
  68. package/tech_hub_skills/roles/ml-engineer/skills/03-model-training/README.md +704 -0
  69. package/tech_hub_skills/roles/ml-engineer/skills/04-model-serving/README.md +845 -0
  70. package/tech_hub_skills/roles/ml-engineer/skills/05-model-monitoring/README.md +874 -0
  71. package/tech_hub_skills/roles/mlops/skills/01-ml-pipeline-orchestration/README.md +264 -0
  72. package/tech_hub_skills/roles/mlops/skills/02-experiment-tracking/README.md +264 -0
  73. package/tech_hub_skills/roles/mlops/skills/03-model-registry/README.md +264 -0
  74. package/tech_hub_skills/roles/mlops/skills/04-feature-store/README.md +264 -0
  75. package/tech_hub_skills/roles/mlops/skills/05-model-deployment/README.md +264 -0
  76. package/tech_hub_skills/roles/mlops/skills/06-model-observability/README.md +264 -0
  77. package/tech_hub_skills/roles/mlops/skills/07-data-versioning/README.md +264 -0
  78. package/tech_hub_skills/roles/mlops/skills/08-ab-testing/README.md +264 -0
  79. package/tech_hub_skills/roles/mlops/skills/09-automated-retraining/README.md +264 -0
  80. package/tech_hub_skills/roles/platform-engineer/skills/01-internal-developer-platform/README.md +153 -0
  81. package/tech_hub_skills/roles/platform-engineer/skills/02-self-service-infrastructure/README.md +57 -0
  82. package/tech_hub_skills/roles/platform-engineer/skills/03-slo-sli-management/README.md +59 -0
  83. package/tech_hub_skills/roles/platform-engineer/skills/04-developer-experience/README.md +57 -0
  84. package/tech_hub_skills/roles/platform-engineer/skills/05-incident-management/README.md +73 -0
  85. package/tech_hub_skills/roles/platform-engineer/skills/06-capacity-management/README.md +59 -0
  86. package/tech_hub_skills/roles/product-designer/skills/01-requirements-discovery/README.md +407 -0
  87. package/tech_hub_skills/roles/product-designer/skills/02-user-research/README.md +382 -0
  88. package/tech_hub_skills/roles/product-designer/skills/03-brainstorming-ideation/README.md +437 -0
  89. package/tech_hub_skills/roles/product-designer/skills/04-ux-design/README.md +496 -0
  90. package/tech_hub_skills/roles/product-designer/skills/05-product-market-fit/README.md +376 -0
  91. package/tech_hub_skills/roles/product-designer/skills/06-stakeholder-management/README.md +412 -0
  92. package/tech_hub_skills/roles/security-architect/skills/01-pii-detection/README.md +319 -0
  93. package/tech_hub_skills/roles/security-architect/skills/02-threat-modeling/README.md +264 -0
  94. package/tech_hub_skills/roles/security-architect/skills/03-infrastructure-security/README.md +264 -0
  95. package/tech_hub_skills/roles/security-architect/skills/04-iam/README.md +264 -0
  96. package/tech_hub_skills/roles/security-architect/skills/05-application-security/README.md +264 -0
  97. package/tech_hub_skills/roles/security-architect/skills/06-secrets-management/README.md +264 -0
  98. package/tech_hub_skills/roles/security-architect/skills/07-security-monitoring/README.md +264 -0
  99. package/tech_hub_skills/roles/system-design/skills/01-architecture-patterns/README.md +337 -0
  100. package/tech_hub_skills/roles/system-design/skills/02-requirements-engineering/README.md +264 -0
  101. package/tech_hub_skills/roles/system-design/skills/03-scalability/README.md +264 -0
  102. package/tech_hub_skills/roles/system-design/skills/04-high-availability/README.md +264 -0
  103. package/tech_hub_skills/roles/system-design/skills/05-cost-optimization-design/README.md +264 -0
  104. package/tech_hub_skills/roles/system-design/skills/06-api-design/README.md +264 -0
  105. package/tech_hub_skills/roles/system-design/skills/07-observability-architecture/README.md +264 -0
  106. package/tech_hub_skills/roles/system-design/skills/08-process-automation/PROCESS_TEMPLATE.md +336 -0
  107. package/tech_hub_skills/roles/system-design/skills/08-process-automation/README.md +521 -0
  108. package/tech_hub_skills/skills/README.md +336 -0
  109. package/tech_hub_skills/skills/ai-engineer.md +104 -0
  110. package/tech_hub_skills/skills/azure.md +149 -0
  111. package/tech_hub_skills/skills/code-review.md +399 -0
  112. package/tech_hub_skills/skills/compliance-automation.md +747 -0
  113. package/tech_hub_skills/skills/data-engineer.md +113 -0
  114. package/tech_hub_skills/skills/data-governance.md +102 -0
  115. package/tech_hub_skills/skills/data-scientist.md +123 -0
  116. package/tech_hub_skills/skills/devops.md +160 -0
  117. package/tech_hub_skills/skills/docker.md +160 -0
  118. package/tech_hub_skills/skills/enterprise-dashboard.md +613 -0
  119. package/tech_hub_skills/skills/finops.md +184 -0
  120. package/tech_hub_skills/skills/ml-engineer.md +115 -0
  121. package/tech_hub_skills/skills/mlops.md +187 -0
  122. package/tech_hub_skills/skills/optimization-advisor.md +329 -0
  123. package/tech_hub_skills/skills/orchestrator.md +497 -0
  124. package/tech_hub_skills/skills/platform-engineer.md +102 -0
  125. package/tech_hub_skills/skills/process-automation.md +226 -0
  126. package/tech_hub_skills/skills/process-changelog.md +184 -0
  127. package/tech_hub_skills/skills/process-documentation.md +484 -0
  128. package/tech_hub_skills/skills/process-kanban.md +324 -0
  129. package/tech_hub_skills/skills/process-versioning.md +214 -0
  130. package/tech_hub_skills/skills/product-designer.md +104 -0
  131. package/tech_hub_skills/skills/project-starter.md +443 -0
  132. package/tech_hub_skills/skills/security-architect.md +135 -0
  133. package/tech_hub_skills/skills/system-design.md +126 -0
@@ -0,0 +1,547 @@
1
+ # Skill 5: Performance Optimization
2
+
3
+ ## 🎯 Overview
4
+ Master advanced performance tuning techniques for data pipelines, query optimization, partitioning strategies, caching, and cost-efficient compute to achieve 10x faster processing at lower costs.
5
+
6
+ ## 🔗 Connections
7
+ - **Data Engineer**: Optimize lakehouse and pipelines (de-01, de-02, de-03, de-04)
8
+ - **ML Engineer**: Faster feature engineering and training (ml-01, ml-02, ml-03)
9
+ - **MLOps**: Optimize model serving latency (mo-04)
10
+ - **AI Engineer**: Speed up RAG retrieval (ai-02)
11
+ - **Data Scientist**: Faster data exploration (ds-01, ds-02)
12
+ - **FinOps**: Reduce compute costs through efficiency (fo-01, fo-06)
13
+ - **DevOps**: Infrastructure optimization (do-04, do-08)
14
+ - **System Design**: Scalability and caching patterns (sd-03, sd-04)
15
+
16
+ ## 🛠️ Tools Included
17
+
18
+ ### 1. `query_optimizer.py`
19
+ Automated query optimization and execution plan analysis.
20
+
21
+ ### 2. `partition_optimizer.py`
22
+ Smart partitioning strategies and optimization recommendations.
23
+
24
+ ### 3. `cache_manager.py`
25
+ Multi-level caching with cache warming and invalidation.
26
+
27
+ ### 4. `performance_profiler.py`
28
+ End-to-end pipeline profiling and bottleneck detection.
29
+
30
+ ### 5. `index_advisor.py`
31
+ Index recommendation engine for query acceleration.
32
+
33
+ ## 📊 Performance Optimization Framework
34
+
35
+ ```
36
+ Identify → Measure → Optimize → Validate → Monitor
37
+ ↓ ↓ ↓ ↓ ↓
38
+ Bottleneck Profile Partition Benchmark Track
39
+ Detection Queries Cache Results Trends
40
+ Storage Indexes A/B Test Alert
41
+ ```
42
+
43
+ ## 🚀 Quick Start
44
+
45
+ ```python
46
+ from query_optimizer import QueryOptimizer
47
+ from partition_optimizer import PartitionOptimizer
48
+ from cache_manager import CacheManager
49
+
50
+ # Analyze slow query
51
+ optimizer = QueryOptimizer()
52
+
53
+ query = """
54
+ SELECT customer_id, SUM(amount) as total
55
+ FROM transactions
56
+ WHERE event_date >= '2024-01-01'
57
+ GROUP BY customer_id
58
+ """
59
+
60
+ # Get optimization recommendations
61
+ analysis = optimizer.analyze(query, table="transactions")
62
+
63
+ print(f"Current execution time: {analysis.baseline_time}s")
64
+ print(f"\nRecommendations:")
65
+ for rec in analysis.recommendations:
66
+ print(f" - {rec.type}: {rec.description}")
67
+ print(f" Expected improvement: {rec.speedup}x")
68
+ print(f" Cost: {rec.cost_impact}")
69
+
70
+ # Apply optimizations
71
+ optimized_query = optimizer.apply_recommendations(query, analysis.recommendations)
72
+
73
+ # Optimize partitioning
74
+ part_optimizer = PartitionOptimizer()
75
+ part_analysis = part_optimizer.analyze_table("transactions")
76
+
77
+ if part_analysis.needs_repartitioning:
78
+ print(f"\nCurrent partitioning: {part_analysis.current_strategy}")
79
+ print(f"Recommended: {part_analysis.recommended_strategy}")
80
+ print(f"Expected speedup: {part_analysis.speedup}x")
81
+
82
+ # Repartition table
83
+ part_optimizer.repartition_table(
84
+ table="transactions",
85
+ partition_by=["event_date"],
86
+ bucket_by=["customer_id"],
87
+ num_buckets=32
88
+ )
89
+
90
+ # Enable caching
91
+ cache = CacheManager()
92
+ cache.enable_table_cache(
93
+ table="transactions",
94
+ cache_level="hot", # hot/warm/cold
95
+ ttl_hours=24
96
+ )
97
+
98
+ # Warm cache with common queries
99
+ cache.warm_cache([
100
+ "SELECT * FROM transactions WHERE event_date = CURRENT_DATE",
101
+ "SELECT customer_id, COUNT(*) FROM transactions GROUP BY customer_id"
102
+ ])
103
+ ```
104
+
105
+ ## 📚 Best Practices
106
+
107
+ ### Query Optimization
108
+
109
+ 1. **Predicate Pushdown**
110
+ - Filter early to reduce data scanned
111
+ - Push filters to data source when possible
112
+ - Use partition pruning effectively
113
+ - Leverage data skipping with statistics
114
+ - Reference: Data Engineer best practices
115
+
116
+ 2. **Join Optimization**
117
+ - Use broadcast joins for small tables (<10GB)
118
+ - Implement bucketing for large-large joins
119
+ - Sort-merge join for sorted data
120
+ - Avoid cross joins
121
+ - Reference: System Design sd-03 (Scalability)
122
+
123
+ 3. **Aggregation Optimization**
124
+ - Partial aggregation before shuffle
125
+ - Use approximate aggregations when exact not needed
126
+ - Pre-aggregate in Gold layer
127
+ - Cache frequently aggregated results
128
+ - Reference: Data Engineer de-01 (Lakehouse)
129
+
130
+ 4. **Query Plan Analysis**
131
+ - Examine execution plans regularly
132
+ - Identify shuffle-heavy operations
133
+ - Optimize stage boundaries
134
+ - Monitor skew in data distribution
135
+ - Reference: Data Engineer best practices
136
+
137
+ ### Partitioning Strategies
138
+
139
+ 5. **Time-Based Partitioning**
140
+ - Partition by date/datetime for time-series data
141
+ - Use hierarchical partitioning (year/month/day)
142
+ - Balance partition size (target 1GB per partition)
143
+ - Monitor partition count (<10,000 partitions)
144
+ - Reference: Data Engineer de-01 (Lakehouse)
145
+
146
+ 6. **Hash Partitioning**
147
+ - Use for evenly distributed data
148
+ - Choose partition key with high cardinality
149
+ - Avoid skewed partition keys
150
+ - Consider composite partition keys
151
+ - Reference: Data Engineer best practices
152
+
153
+ 7. **Bucketing**
154
+ - Bucket by join keys
155
+ - Optimize bucket count (32-128 typical)
156
+ - Combine with partitioning for best results
157
+ - Sort within buckets for range queries
158
+ - Reference: Data Engineer best practices
159
+
160
+ ### Caching Strategies (System Design Integration)
161
+
162
+ 8. **Multi-Level Caching**
163
+ - L1: Result cache (query results)
164
+ - L2: Disk cache (Delta cache)
165
+ - L3: Table cache (in-memory tables)
166
+ - Cache hot data, tier cold data
167
+ - Reference: System Design sd-04 (Caching Strategies)
168
+
169
+ 9. **Cache Invalidation**
170
+ - Time-based TTL for changing data
171
+ - Event-based invalidation for updates
172
+ - Partial invalidation for partitioned data
173
+ - Monitor cache hit rates
174
+ - Reference: System Design sd-04 (Caching Strategies)
175
+
176
+ 10. **Cache Warming**
177
+ - Pre-load cache on deployment
178
+ - Schedule cache refresh for predictable queries
179
+ - Predictive caching based on patterns
180
+ - Monitor cache utilization
181
+ - Reference: System Design sd-04 (Caching Strategies)
182
+
183
+ ### Indexing (when applicable)
184
+
185
+ 11. **Covering Indexes**
186
+ - Include all columns in SELECT
187
+ - Reduce table lookups
188
+ - Balance index size vs benefit
189
+ - Monitor index usage
190
+ - Reference: Database best practices
191
+
192
+ 12. **Composite Indexes**
193
+ - Order columns by selectivity
194
+ - Include filter and sort columns
195
+ - Avoid index duplication
196
+ - Regular index maintenance
197
+ - Reference: Database best practices
198
+
199
+ ### Storage Optimization
200
+
201
+ 13. **Compression**
202
+ - Use Snappy for balance (speed/ratio)
203
+ - Zstd for better compression
204
+ - Gzip for archival data
205
+ - Monitor compression ratios
206
+ - Reference: Data Engineer de-01 (Lakehouse)
207
+
208
+ 14. **File Sizing**
209
+ - Target 1GB files for Parquet/Delta
210
+ - Avoid small files (<128MB)
211
+ - Use OPTIMIZE for Delta tables
212
+ - Z-ORDER for common filters
213
+ - Reference: Data Engineer de-01 (Lakehouse)
214
+
215
+ 15. **Columnar Storage**
216
+ - Use Parquet/Delta for analytical workloads
217
+ - Project only needed columns
218
+ - Leverage column statistics
219
+ - Enable column pruning
220
+ - Reference: Data Engineer best practices
221
+
222
+ ### Compute Optimization (FinOps Integration)
223
+
224
+ 16. **Right-Size Clusters**
225
+ - Profile workload characteristics
226
+ - Match node types to workload
227
+ - Use memory-optimized for caching
228
+ - Compute-optimized for CPU-bound
229
+ - Reference: FinOps fo-06 (Compute Optimization)
230
+
231
+ 17. **Auto-Scaling**
232
+ - Enable cluster auto-scaling
233
+ - Set appropriate min/max nodes
234
+ - Monitor scale-up/down patterns
235
+ - Balance cost vs performance
236
+ - Reference: FinOps fo-06 (Compute Optimization)
237
+
238
+ 18. **Spot Instances**
239
+ - Use for batch workloads
240
+ - Implement checkpointing
241
+ - Graceful handling of interruptions
242
+ - 60-90% cost savings
243
+ - Reference: FinOps fo-06 (Compute Optimization)
244
+
245
+ ### Azure-Specific Optimizations
246
+
247
+ 19. **Delta Cache (Databricks)**
248
+ - Enable for frequently accessed data
249
+ - Cache hot partitions
250
+ - Monitor cache hit metrics
251
+ - Right-size cache storage
252
+ - Reference: Azure az-02 (Databricks)
253
+
254
+ 20. **Synapse SQL Optimization**
255
+ - Use result set caching
256
+ - Implement materialized views
257
+ - Optimize distribution keys
258
+ - Monitor DWU utilization
259
+ - Reference: Azure az-02 (Synapse Analytics)
260
+
261
+ 21. **Photon Engine (Databricks)**
262
+ - Enable for SQL and DataFrame workloads
263
+ - 2-5x faster for compatible workloads
264
+ - Monitor Photon utilization
265
+ - Cost-benefit analysis
266
+ - Reference: Azure az-02 (Databricks)
267
+
268
+ ## 💰 Cost-Performance Trade-offs
269
+
270
+ ### Optimize Query Performance and Cost
271
+ ```python
272
+ from query_optimizer import QueryOptimizer
273
+ from cost_analyzer import CostPerformanceAnalyzer
274
+
275
+ optimizer = QueryOptimizer()
276
+ cost_analyzer = CostPerformanceAnalyzer()
277
+
278
+ # Baseline query
279
+ baseline_query = """
280
+ SELECT user_id, COUNT(*) as event_count
281
+ FROM events
282
+ WHERE event_date >= '2024-01-01'
283
+ GROUP BY user_id
284
+ """
285
+
286
+ # Analyze cost and performance
287
+ baseline = cost_analyzer.analyze(baseline_query)
288
+ print(f"Baseline:")
289
+ print(f" Execution time: {baseline.execution_time}s")
290
+ print(f" Cost: ${baseline.cost:.4f}")
291
+ print(f" Data scanned: {baseline.data_scanned_gb:.2f} GB")
292
+
293
+ # Optimization 1: Partition pruning
294
+ optimized_v1 = """
295
+ SELECT user_id, COUNT(*) as event_count
296
+ FROM events
297
+ WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31' -- Explicit range
298
+ GROUP BY user_id
299
+ """
300
+
301
+ result_v1 = cost_analyzer.analyze(optimized_v1)
302
+ print(f"\nV1 (Partition pruning):")
303
+ print(f" Execution time: {result_v1.execution_time}s ({baseline.execution_time/result_v1.execution_time:.1f}x faster)")
304
+ print(f" Cost: ${result_v1.cost:.4f} ({(1-result_v1.cost/baseline.cost)*100:.1f}% cheaper)")
305
+ print(f" Data scanned: {result_v1.data_scanned_gb:.2f} GB")
306
+
307
+ # Optimization 2: Pre-aggregated table
308
+ # Create a Gold layer aggregation
309
+ spark.sql("""
310
+ CREATE OR REPLACE TABLE gold.daily_user_events
311
+ AS
312
+ SELECT event_date, user_id, COUNT(*) as event_count
313
+ FROM events
314
+ GROUP BY event_date, user_id
315
+ """)
316
+
317
+ optimized_v2 = """
318
+ SELECT user_id, SUM(event_count) as event_count
319
+ FROM gold.daily_user_events
320
+ WHERE event_date >= '2024-01-01'
321
+ GROUP BY user_id
322
+ """
323
+
324
+ result_v2 = cost_analyzer.analyze(optimized_v2)
325
+ print(f"\nV2 (Pre-aggregated):")
326
+ print(f" Execution time: {result_v2.execution_time}s ({baseline.execution_time/result_v2.execution_time:.1f}x faster)")
327
+ print(f" Cost: ${result_v2.cost:.4f} ({(1-result_v2.cost/baseline.cost)*100:.1f}% cheaper)")
328
+
329
+ # Optimization 3: Caching
330
+ from cache_manager import CacheManager
331
+ cache = CacheManager()
332
+ cache.cache_table("gold.daily_user_events")
333
+
334
+ result_v3 = cost_analyzer.analyze(optimized_v2) # Same query, cached
335
+ print(f"\nV3 (Cached):")
336
+ print(f" Execution time: {result_v3.execution_time}s ({baseline.execution_time/result_v3.execution_time:.1f}x faster)")
337
+ print(f" Cost: ${result_v3.cost:.4f} ({(1-result_v3.cost/baseline.cost)*100:.1f}% cheaper)")
338
+ print(f" Cache hit: {result_v3.cache_hit}")
339
+ ```
340
+
341
+ ### Delta Table Optimization
342
+ ```python
343
+ from delta_optimizer import DeltaOptimizer
344
+
345
+ optimizer = DeltaOptimizer()
346
+
347
+ # Optimize table (compact small files)
348
+ table_name = "silver.transactions"
349
+ metrics = optimizer.optimize_table(
350
+ table=table_name,
351
+ z_order_by=["customer_id", "event_date"] # Common filter columns
352
+ )
353
+
354
+ print(f"Optimization results for {table_name}:")
355
+ print(f" Files before: {metrics.files_before:,}")
356
+ print(f" Files after: {metrics.files_after:,}")
357
+ print(f" Size before: {metrics.size_before_gb:.2f} GB")
358
+ print(f" Size after: {metrics.size_after_gb:.2f} GB")
359
+ print(f" Compression improvement: {metrics.compression_ratio:.2f}x")
360
+
361
+ # Query performance comparison
362
+ from performance_profiler import PerformanceProfiler
363
+ profiler = PerformanceProfiler()
364
+
365
+ query = """
366
+ SELECT customer_id, SUM(amount)
367
+ FROM silver.transactions
368
+ WHERE event_date >= '2024-01-01'
369
+ AND customer_id IN (SELECT customer_id FROM high_value_customers)
370
+ GROUP BY customer_id
371
+ """
372
+
373
+ before_metrics = profiler.profile_query(query, version="before")
374
+ after_metrics = profiler.profile_query(query, version="after")
375
+
376
+ print(f"\nQuery performance:")
377
+ print(f" Before optimization: {before_metrics.execution_time:.2f}s")
378
+ print(f" After optimization: {after_metrics.execution_time:.2f}s")
379
+ print(f" Speedup: {before_metrics.execution_time/after_metrics.execution_time:.1f}x")
380
+ print(f" Data skipped: {after_metrics.data_skipped_percentage:.1f}%")
381
+
382
+ # Cost impact
383
+ print(f"\nCost impact:")
384
+ print(f" Query cost before: ${before_metrics.cost:.4f}")
385
+ print(f" Query cost after: ${after_metrics.cost:.4f}")
386
+ print(f" Monthly savings (1000 queries): ${(before_metrics.cost - after_metrics.cost) * 1000:.2f}")
387
+ ```
388
+
389
+ ### Adaptive Query Execution
390
+ ```python
391
+ from pyspark.sql import SparkSession
392
+
393
+ # Enable Adaptive Query Execution (AQE)
394
+ spark = SparkSession.builder \
395
+ .config("spark.sql.adaptive.enabled", "true") \
396
+ .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
397
+ .config("spark.sql.adaptive.skewJoin.enabled", "true") \
398
+ .config("spark.sql.adaptive.localShuffleReader.enabled", "true") \
399
+ .getOrCreate()
400
+
401
+ # Query with AQE benefits
402
+ query = """
403
+ SELECT t1.customer_id, t1.total_amount, t2.segment
404
+ FROM (
405
+ SELECT customer_id, SUM(amount) as total_amount
406
+ FROM transactions
407
+ GROUP BY customer_id
408
+ ) t1
409
+ JOIN customer_segments t2
410
+ ON t1.customer_id = t2.customer_id
411
+ """
412
+
413
+ # AQE will automatically:
414
+ # 1. Dynamically coalesce shuffle partitions
415
+ # 2. Convert sort-merge join to broadcast join if one side is small
416
+ # 3. Optimize skewed joins by splitting large partitions
417
+
418
+ df = spark.sql(query)
419
+
420
+ # Monitor AQE impact
421
+ metrics = df.explain("cost")
422
+ print(f"AQE optimizations applied:")
423
+ print(f" - Coalesced partitions: {metrics.coalesced_partitions}")
424
+ print(f" - Broadcast joins: {metrics.broadcast_joins}")
425
+ print(f" - Skew handled: {metrics.skew_optimizations}")
426
+ ```
427
+
428
+ ## 📊 Performance Benchmarks
429
+
430
+ ### Common Optimization Impact
431
+
432
+ | Optimization Technique | Typical Speedup | Cost Reduction | Effort |
433
+ |------------------------|-----------------|----------------|--------|
434
+ | Partition pruning | 5-10x | 80-90% | Low |
435
+ | Z-ordering | 2-5x | 50-80% | Low |
436
+ | File compaction | 2-3x | 30-50% | Low |
437
+ | Broadcast joins (small tables) | 10-100x | 90-99% | Low |
438
+ | Caching hot data | 10-50x | 90-95% | Medium |
439
+ | Pre-aggregation (Gold layer) | 10-100x | 90-99% | Medium |
440
+ | Materialized views | 10-100x | 90-99% | Medium |
441
+ | Bucketing | 2-5x | 50-80% | Medium |
442
+ | Adaptive Query Execution | 1.5-3x | 30-60% | Low (config) |
443
+ | Photon engine | 2-5x | -20-0% | Low (enable) |
444
+ | Delta cache | 3-10x | 70-90% | Low (enable) |
445
+ | Column pruning | 2-10x | 50-90% | Low |
446
+ | Compression (Zstd) | 1-2x | 40-60% | Low |
447
+
448
+ ## 📊 Enhanced Metrics & Monitoring
449
+
450
+ | Metric Category | Metric | Target | Tool |
451
+ |-----------------|--------|--------|------|
452
+ | **Query Performance** | Query latency (p95) | <10s | Query history |
453
+ | | Data scanned per query | <10GB | Query metrics |
454
+ | | Shuffle data size | <1GB | Spark UI |
455
+ | **Storage** | File count per partition | <1000 | Delta logs |
456
+ | | Average file size | >128MB | Delta logs |
457
+ | | Compression ratio | >3x | Storage metrics |
458
+ | **Cache** | Cache hit rate | >80% | Cache metrics |
459
+ | | Cache eviction rate | <10% | Cache metrics |
460
+ | | Cache memory utilization | 70-90% | Cluster metrics |
461
+ | **Cost** | Cost per query | <$0.10 | FinOps tracker |
462
+ | | Cost per TB scanned | <$5 | Cost analysis |
463
+ | | Compute utilization | 70-85% | Cluster metrics |
464
+ | **Cluster** | CPU utilization | 60-80% | Azure Monitor |
465
+ | | Memory utilization | 70-85% | Azure Monitor |
466
+ | | Spill to disk | <5% | Spark metrics |
467
+
468
+ ## 🔄 Performance Optimization Workflow
469
+
470
+ ### End-to-End Optimization Process
471
+ ```
472
+ 1. Identify Performance Issues (do-08)
473
+
474
+ 2. Profile Queries and Pipelines (de-05)
475
+
476
+ 3. Analyze Execution Plans
477
+
478
+ 4. Apply Optimizations
479
+ ├── Partitioning (de-01)
480
+ ├── Caching (sd-04)
481
+ ├── Indexing
482
+ └── Query rewriting
483
+
484
+ 5. Benchmark Results
485
+
486
+ 6. A/B Testing
487
+
488
+ 7. Monitor Performance (do-08)
489
+
490
+ 8. Track Cost Impact (fo-01)
491
+
492
+ 9. Continuous Optimization
493
+ ```
494
+
495
+ ## 🎯 Quick Wins
496
+
497
+ 1. **Enable Delta cache** - 3-10x faster for hot data
498
+ 2. **Optimize Delta tables (OPTIMIZE + Z-ORDER)** - 2-5x faster queries
499
+ 3. **Enable Adaptive Query Execution** - 1.5-3x speedup automatically
500
+ 4. **Partition pruning** - 5-10x faster, 80-90% cost reduction
501
+ 5. **Use broadcast joins for small tables** - 10-100x faster
502
+ 6. **Cache frequently accessed tables** - 10-50x faster
503
+ 7. **Pre-aggregate in Gold layer** - 10-100x faster analytics
504
+ 8. **Enable Photon engine** - 2-5x faster (Databricks)
505
+ 9. **Column pruning** - Select only needed columns, 2-10x faster
506
+ 10. **Compress with Zstd** - 40-60% storage savings
507
+
508
+ ## 🔧 Performance Tuning Checklist
509
+
510
+ ### Before Optimization
511
+ - [ ] Identify slow queries (>30s execution time)
512
+ - [ ] Profile data access patterns
513
+ - [ ] Analyze execution plans
514
+ - [ ] Measure baseline performance
515
+ - [ ] Document current costs
516
+
517
+ ### Storage Optimization
518
+ - [ ] Run OPTIMIZE on Delta tables weekly
519
+ - [ ] Z-ORDER by common filter columns
520
+ - [ ] VACUUM to remove old files (>7 days)
521
+ - [ ] Check file sizes (target 1GB)
522
+ - [ ] Monitor partition count (<10,000)
523
+ - [ ] Enable compression (Snappy/Zstd)
524
+
525
+ ### Query Optimization
526
+ - [ ] Enable Adaptive Query Execution
527
+ - [ ] Use partition pruning
528
+ - [ ] Implement broadcast joins where applicable
529
+ - [ ] Pre-aggregate in Gold layer
530
+ - [ ] Cache hot tables
531
+ - [ ] Project only needed columns
532
+ - [ ] Push down filters early
533
+
534
+ ### Cluster Optimization
535
+ - [ ] Right-size cluster nodes
536
+ - [ ] Enable auto-scaling
537
+ - [ ] Use spot instances for batch
538
+ - [ ] Monitor CPU/memory utilization
539
+ - [ ] Enable Delta cache
540
+ - [ ] Enable Photon (if Databricks)
541
+
542
+ ### Monitoring
543
+ - [ ] Track query performance trends
544
+ - [ ] Monitor cache hit rates
545
+ - [ ] Alert on performance degradation
546
+ - [ ] Track cost per query
547
+ - [ ] Review execution plans regularly
@@ -0,0 +1,112 @@
1
+ # dg-01: Data Catalog
2
+
3
+ ## Overview
4
+
5
+ Build enterprise data catalogs for asset discovery, metadata management, and data classification.
6
+
7
+ ## Key Capabilities
8
+
9
+ - **Asset Registration**: Automated discovery and registration of data assets
10
+ - **Metadata Management**: Technical, business, and operational metadata
11
+ - **Data Classification**: Automatic classification (PII, confidential, public)
12
+ - **Search & Discovery**: Powerful search capabilities for data consumers
13
+ - **Business Glossary**: Standardized business terminology
14
+
15
+ ## Tools & Technologies
16
+
17
+ - **Azure Purview**: Enterprise data catalog
18
+ - **DataHub**: Open-source metadata platform
19
+ - **Amundsen**: Lyft's data discovery platform
20
+ - **Collibra**: Data governance platform
21
+
22
+ ## Implementation
23
+
24
+ ### 1. Asset Registration
25
+
26
+ ```python
27
+ # Automated asset registration
28
+ from azure.purview.catalog import PurviewCatalogClient
29
+
30
+ def register_data_asset(asset_name, asset_type, location):
31
+ """Register data asset in catalog"""
32
+ client = PurviewCatalogClient()
33
+
34
+ asset = {
35
+ "typeName": asset_type,
36
+ "attributes": {
37
+ "name": asset_name,
38
+ "qualifiedName": f"{location}/{asset_name}",
39
+ "location": location
40
+ }
41
+ }
42
+
43
+ return client.entity.create_or_update(entity=asset)
44
+ ```
45
+
46
+ ### 2. Metadata Management
47
+
48
+ ```python
49
+ # Add business metadata
50
+ def add_business_metadata(asset_id, owner, description, tags):
51
+ """Enrich asset with business context"""
52
+ metadata = {
53
+ "businessOwner": owner,
54
+ "description": description,
55
+ "tags": tags,
56
+ "certification": "certified"
57
+ }
58
+
59
+ return client.entity.add_business_metadata(
60
+ guid=asset_id,
61
+ business_metadata=metadata
62
+ )
63
+ ```
64
+
65
+ ### 3. Data Classification
66
+
67
+ ```python
68
+ # Automatic classification
69
+ def classify_data(asset_id):
70
+ """Apply automatic classification based on content"""
71
+ classifications = []
72
+
73
+ # Scan for PII
74
+ if contains_pii(asset_id):
75
+ classifications.append("PII")
76
+
77
+ # Scan for confidential data
78
+ if contains_confidential(asset_id):
79
+ classifications.append("Confidential")
80
+
81
+ return client.entity.add_classifications(
82
+ guid=asset_id,
83
+ classifications=classifications
84
+ )
85
+ ```
86
+
87
+ ## Best Practices
88
+
89
+ 1. **Automate Discovery** - Use scanners to auto-discover assets
90
+ 2. **Enrich Metadata** - Add business context, not just technical
91
+ 3. **Clear Ownership** - Every asset needs a business owner
92
+ 4. **Regular Updates** - Keep metadata fresh and relevant
93
+ 5. **User Training** - Train users on search capabilities
94
+
95
+ ## Cost Optimization
96
+
97
+ - Use Azure Purview Standard tier for < 100k assets
98
+ - Schedule scans during off-peak hours
99
+ - Use incremental scans instead of full scans
100
+ - Archive unused asset metadata
101
+
102
+ ## Integration
103
+
104
+ **Connects with:**
105
+ - de-01 (Lakehouse): Catalog lakehouse tables
106
+ - sa-01 (PII Detection): Auto-classify PII data
107
+ - dg-02 (Lineage): Link to lineage tracking
108
+ - dg-03 (Quality): Link quality scores
109
+
110
+ ## Quick Win
111
+
112
+ Start with top 10 critical datasets, manually catalog them with rich metadata, then expand automated discovery.