tech-hub-skills 1.2.0 → 1.5.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (198) hide show
  1. package/{LICENSE → .claude/LICENSE} +21 -21
  2. package/.claude/README.md +291 -0
  3. package/.claude/bin/cli.js +266 -0
  4. package/{bin → .claude/bin}/copilot.js +182 -182
  5. package/{bin → .claude/bin}/postinstall.js +42 -42
  6. package/{tech_hub_skills/skills → .claude/commands}/README.md +336 -336
  7. package/{tech_hub_skills/skills → .claude/commands}/ai-engineer.md +104 -104
  8. package/{tech_hub_skills/skills → .claude/commands}/aws.md +143 -143
  9. package/{tech_hub_skills/skills → .claude/commands}/azure.md +149 -149
  10. package/{tech_hub_skills/skills → .claude/commands}/backend-developer.md +108 -108
  11. package/{tech_hub_skills/skills → .claude/commands}/code-review.md +399 -399
  12. package/{tech_hub_skills/skills → .claude/commands}/compliance-automation.md +747 -747
  13. package/{tech_hub_skills/skills → .claude/commands}/compliance-officer.md +108 -108
  14. package/{tech_hub_skills/skills → .claude/commands}/data-engineer.md +113 -113
  15. package/{tech_hub_skills/skills → .claude/commands}/data-governance.md +102 -102
  16. package/{tech_hub_skills/skills → .claude/commands}/data-scientist.md +123 -123
  17. package/{tech_hub_skills/skills → .claude/commands}/database-admin.md +109 -109
  18. package/{tech_hub_skills/skills → .claude/commands}/devops.md +160 -160
  19. package/{tech_hub_skills/skills → .claude/commands}/docker.md +160 -160
  20. package/{tech_hub_skills/skills → .claude/commands}/enterprise-dashboard.md +613 -613
  21. package/{tech_hub_skills/skills → .claude/commands}/finops.md +184 -184
  22. package/{tech_hub_skills/skills → .claude/commands}/frontend-developer.md +108 -108
  23. package/{tech_hub_skills/skills → .claude/commands}/gcp.md +143 -143
  24. package/{tech_hub_skills/skills → .claude/commands}/ml-engineer.md +115 -115
  25. package/{tech_hub_skills/skills → .claude/commands}/mlops.md +187 -187
  26. package/{tech_hub_skills/skills → .claude/commands}/network-engineer.md +109 -109
  27. package/{tech_hub_skills/skills → .claude/commands}/optimization-advisor.md +329 -329
  28. package/{tech_hub_skills/skills → .claude/commands}/orchestrator.md +623 -623
  29. package/{tech_hub_skills/skills → .claude/commands}/platform-engineer.md +102 -102
  30. package/{tech_hub_skills/skills → .claude/commands}/process-automation.md +226 -226
  31. package/{tech_hub_skills/skills → .claude/commands}/process-changelog.md +184 -184
  32. package/{tech_hub_skills/skills → .claude/commands}/process-documentation.md +484 -484
  33. package/{tech_hub_skills/skills → .claude/commands}/process-kanban.md +324 -324
  34. package/{tech_hub_skills/skills → .claude/commands}/process-versioning.md +214 -214
  35. package/{tech_hub_skills/skills → .claude/commands}/product-designer.md +104 -104
  36. package/{tech_hub_skills/skills → .claude/commands}/project-starter.md +443 -443
  37. package/{tech_hub_skills/skills → .claude/commands}/qa-engineer.md +109 -109
  38. package/{tech_hub_skills/skills → .claude/commands}/security-architect.md +135 -135
  39. package/{tech_hub_skills/skills → .claude/commands}/sre.md +109 -109
  40. package/{tech_hub_skills/skills → .claude/commands}/system-design.md +126 -126
  41. package/{tech_hub_skills/skills → .claude/commands}/technical-writer.md +101 -101
  42. package/.claude/package.json +46 -0
  43. package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/01-prompt-engineering/README.md +252 -252
  44. package/.claude/roles/ai-engineer/skills/01-prompt-engineering/prompt_ab_tester.py +356 -0
  45. package/.claude/roles/ai-engineer/skills/01-prompt-engineering/prompt_template_manager.py +274 -0
  46. package/.claude/roles/ai-engineer/skills/01-prompt-engineering/token_cost_estimator.py +324 -0
  47. package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/02-rag-pipeline/README.md +448 -448
  48. package/.claude/roles/ai-engineer/skills/02-rag-pipeline/document_chunker.py +336 -0
  49. package/.claude/roles/ai-engineer/skills/02-rag-pipeline/rag_pipeline.sql +213 -0
  50. package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/03-agent-orchestration/README.md +599 -599
  51. package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/04-llm-guardrails/README.md +735 -735
  52. package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/05-vector-embeddings/README.md +711 -711
  53. package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/06-llm-evaluation/README.md +777 -777
  54. package/{tech_hub_skills → .claude}/roles/azure/skills/01-infrastructure-fundamentals/README.md +264 -264
  55. package/{tech_hub_skills → .claude}/roles/azure/skills/02-data-factory/README.md +264 -264
  56. package/{tech_hub_skills → .claude}/roles/azure/skills/03-synapse-analytics/README.md +264 -264
  57. package/{tech_hub_skills → .claude}/roles/azure/skills/04-databricks/README.md +264 -264
  58. package/{tech_hub_skills → .claude}/roles/azure/skills/05-functions/README.md +264 -264
  59. package/{tech_hub_skills → .claude}/roles/azure/skills/06-kubernetes-service/README.md +264 -264
  60. package/{tech_hub_skills → .claude}/roles/azure/skills/07-openai-service/README.md +264 -264
  61. package/{tech_hub_skills → .claude}/roles/azure/skills/08-machine-learning/README.md +264 -264
  62. package/{tech_hub_skills → .claude}/roles/azure/skills/09-storage-adls/README.md +264 -264
  63. package/{tech_hub_skills → .claude}/roles/azure/skills/10-networking/README.md +264 -264
  64. package/{tech_hub_skills → .claude}/roles/azure/skills/11-sql-cosmos/README.md +264 -264
  65. package/{tech_hub_skills → .claude}/roles/azure/skills/12-event-hubs/README.md +264 -264
  66. package/{tech_hub_skills → .claude}/roles/code-review/skills/01-automated-code-review/README.md +394 -394
  67. package/{tech_hub_skills → .claude}/roles/code-review/skills/02-pr-review-workflow/README.md +427 -427
  68. package/{tech_hub_skills → .claude}/roles/code-review/skills/03-code-quality-gates/README.md +518 -518
  69. package/{tech_hub_skills → .claude}/roles/code-review/skills/04-reviewer-assignment/README.md +504 -504
  70. package/{tech_hub_skills → .claude}/roles/code-review/skills/05-review-analytics/README.md +540 -540
  71. package/{tech_hub_skills → .claude}/roles/data-engineer/skills/01-lakehouse-architecture/README.md +550 -550
  72. package/.claude/roles/data-engineer/skills/01-lakehouse-architecture/bronze_ingestion.py +337 -0
  73. package/.claude/roles/data-engineer/skills/01-lakehouse-architecture/medallion_queries.sql +300 -0
  74. package/{tech_hub_skills → .claude}/roles/data-engineer/skills/02-etl-pipeline/README.md +580 -580
  75. package/{tech_hub_skills → .claude}/roles/data-engineer/skills/03-data-quality/README.md +579 -579
  76. package/{tech_hub_skills → .claude}/roles/data-engineer/skills/04-streaming-pipelines/README.md +608 -608
  77. package/{tech_hub_skills → .claude}/roles/data-engineer/skills/05-performance-optimization/README.md +547 -547
  78. package/{tech_hub_skills → .claude}/roles/data-governance/skills/01-data-catalog/README.md +112 -112
  79. package/{tech_hub_skills → .claude}/roles/data-governance/skills/02-data-lineage/README.md +129 -129
  80. package/{tech_hub_skills → .claude}/roles/data-governance/skills/03-data-quality-framework/README.md +182 -182
  81. package/{tech_hub_skills → .claude}/roles/data-governance/skills/04-access-control/README.md +39 -39
  82. package/{tech_hub_skills → .claude}/roles/data-governance/skills/05-master-data-management/README.md +40 -40
  83. package/{tech_hub_skills → .claude}/roles/data-governance/skills/06-compliance-privacy/README.md +46 -46
  84. package/{tech_hub_skills → .claude}/roles/data-scientist/skills/01-eda-automation/README.md +230 -230
  85. package/.claude/roles/data-scientist/skills/01-eda-automation/eda_generator.py +446 -0
  86. package/{tech_hub_skills → .claude}/roles/data-scientist/skills/02-statistical-modeling/README.md +264 -264
  87. package/{tech_hub_skills → .claude}/roles/data-scientist/skills/03-feature-engineering/README.md +264 -264
  88. package/{tech_hub_skills → .claude}/roles/data-scientist/skills/04-predictive-modeling/README.md +264 -264
  89. package/{tech_hub_skills → .claude}/roles/data-scientist/skills/05-customer-analytics/README.md +264 -264
  90. package/{tech_hub_skills → .claude}/roles/data-scientist/skills/06-campaign-analysis/README.md +264 -264
  91. package/{tech_hub_skills → .claude}/roles/data-scientist/skills/07-experimentation/README.md +264 -264
  92. package/{tech_hub_skills → .claude}/roles/data-scientist/skills/08-data-visualization/README.md +264 -264
  93. package/{tech_hub_skills → .claude}/roles/devops/skills/01-cicd-pipeline/README.md +264 -264
  94. package/{tech_hub_skills → .claude}/roles/devops/skills/02-container-orchestration/README.md +264 -264
  95. package/{tech_hub_skills → .claude}/roles/devops/skills/03-infrastructure-as-code/README.md +264 -264
  96. package/{tech_hub_skills → .claude}/roles/devops/skills/04-gitops/README.md +264 -264
  97. package/{tech_hub_skills → .claude}/roles/devops/skills/05-environment-management/README.md +264 -264
  98. package/{tech_hub_skills → .claude}/roles/devops/skills/06-automated-testing/README.md +264 -264
  99. package/{tech_hub_skills → .claude}/roles/devops/skills/07-release-management/README.md +264 -264
  100. package/{tech_hub_skills → .claude}/roles/devops/skills/08-monitoring-alerting/README.md +264 -264
  101. package/{tech_hub_skills → .claude}/roles/devops/skills/09-devsecops/README.md +265 -265
  102. package/{tech_hub_skills → .claude}/roles/finops/skills/01-cost-visibility/README.md +264 -264
  103. package/{tech_hub_skills → .claude}/roles/finops/skills/02-resource-tagging/README.md +264 -264
  104. package/{tech_hub_skills → .claude}/roles/finops/skills/03-budget-management/README.md +264 -264
  105. package/{tech_hub_skills → .claude}/roles/finops/skills/04-reserved-instances/README.md +264 -264
  106. package/{tech_hub_skills → .claude}/roles/finops/skills/05-spot-optimization/README.md +264 -264
  107. package/{tech_hub_skills → .claude}/roles/finops/skills/06-storage-tiering/README.md +264 -264
  108. package/{tech_hub_skills → .claude}/roles/finops/skills/07-compute-rightsizing/README.md +264 -264
  109. package/{tech_hub_skills → .claude}/roles/finops/skills/08-chargeback/README.md +264 -264
  110. package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/01-mlops-pipeline/README.md +566 -566
  111. package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/02-feature-engineering/README.md +655 -655
  112. package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/03-model-training/README.md +704 -704
  113. package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/04-model-serving/README.md +845 -845
  114. package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/05-model-monitoring/README.md +874 -874
  115. package/{tech_hub_skills → .claude}/roles/mlops/skills/01-ml-pipeline-orchestration/README.md +264 -264
  116. package/{tech_hub_skills → .claude}/roles/mlops/skills/02-experiment-tracking/README.md +264 -264
  117. package/{tech_hub_skills → .claude}/roles/mlops/skills/03-model-registry/README.md +264 -264
  118. package/{tech_hub_skills → .claude}/roles/mlops/skills/04-feature-store/README.md +264 -264
  119. package/{tech_hub_skills → .claude}/roles/mlops/skills/05-model-deployment/README.md +264 -264
  120. package/{tech_hub_skills → .claude}/roles/mlops/skills/06-model-observability/README.md +264 -264
  121. package/{tech_hub_skills → .claude}/roles/mlops/skills/07-data-versioning/README.md +264 -264
  122. package/{tech_hub_skills → .claude}/roles/mlops/skills/08-ab-testing/README.md +264 -264
  123. package/{tech_hub_skills → .claude}/roles/mlops/skills/09-automated-retraining/README.md +264 -264
  124. package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/01-internal-developer-platform/README.md +153 -153
  125. package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/02-self-service-infrastructure/README.md +57 -57
  126. package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/03-slo-sli-management/README.md +59 -59
  127. package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/04-developer-experience/README.md +57 -57
  128. package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/05-incident-management/README.md +73 -73
  129. package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/06-capacity-management/README.md +59 -59
  130. package/{tech_hub_skills → .claude}/roles/product-designer/skills/01-requirements-discovery/README.md +407 -407
  131. package/{tech_hub_skills → .claude}/roles/product-designer/skills/02-user-research/README.md +382 -382
  132. package/{tech_hub_skills → .claude}/roles/product-designer/skills/03-brainstorming-ideation/README.md +437 -437
  133. package/{tech_hub_skills → .claude}/roles/product-designer/skills/04-ux-design/README.md +496 -496
  134. package/{tech_hub_skills → .claude}/roles/product-designer/skills/05-product-market-fit/README.md +376 -376
  135. package/{tech_hub_skills → .claude}/roles/product-designer/skills/06-stakeholder-management/README.md +412 -412
  136. package/{tech_hub_skills → .claude}/roles/security-architect/skills/01-pii-detection/README.md +319 -319
  137. package/{tech_hub_skills → .claude}/roles/security-architect/skills/02-threat-modeling/README.md +264 -264
  138. package/{tech_hub_skills → .claude}/roles/security-architect/skills/03-infrastructure-security/README.md +264 -264
  139. package/{tech_hub_skills → .claude}/roles/security-architect/skills/04-iam/README.md +264 -264
  140. package/{tech_hub_skills → .claude}/roles/security-architect/skills/05-application-security/README.md +264 -264
  141. package/{tech_hub_skills → .claude}/roles/security-architect/skills/06-secrets-management/README.md +264 -264
  142. package/{tech_hub_skills → .claude}/roles/security-architect/skills/07-security-monitoring/README.md +264 -264
  143. package/{tech_hub_skills → .claude}/roles/system-design/skills/01-architecture-patterns/README.md +337 -337
  144. package/{tech_hub_skills → .claude}/roles/system-design/skills/02-requirements-engineering/README.md +264 -264
  145. package/{tech_hub_skills → .claude}/roles/system-design/skills/03-scalability/README.md +264 -264
  146. package/{tech_hub_skills → .claude}/roles/system-design/skills/04-high-availability/README.md +264 -264
  147. package/{tech_hub_skills → .claude}/roles/system-design/skills/05-cost-optimization-design/README.md +264 -264
  148. package/{tech_hub_skills → .claude}/roles/system-design/skills/06-api-design/README.md +264 -264
  149. package/{tech_hub_skills → .claude}/roles/system-design/skills/07-observability-architecture/README.md +264 -264
  150. package/{tech_hub_skills → .claude}/roles/system-design/skills/08-process-automation/PROCESS_TEMPLATE.md +336 -336
  151. package/{tech_hub_skills → .claude}/roles/system-design/skills/08-process-automation/README.md +521 -521
  152. package/.claude/roles/system-design/skills/08-process-automation/ai_prompt_generator.py +744 -0
  153. package/.claude/roles/system-design/skills/08-process-automation/automation_recommender.py +688 -0
  154. package/.claude/roles/system-design/skills/08-process-automation/plan_generator.py +679 -0
  155. package/.claude/roles/system-design/skills/08-process-automation/process_analyzer.py +528 -0
  156. package/.claude/roles/system-design/skills/08-process-automation/process_parser.py +684 -0
  157. package/.claude/roles/system-design/skills/08-process-automation/role_matcher.py +615 -0
  158. package/.claude/skills/README.md +336 -0
  159. package/.claude/skills/ai-engineer.md +104 -0
  160. package/.claude/skills/aws.md +143 -0
  161. package/.claude/skills/azure.md +149 -0
  162. package/.claude/skills/backend-developer.md +108 -0
  163. package/.claude/skills/code-review.md +399 -0
  164. package/.claude/skills/compliance-automation.md +747 -0
  165. package/.claude/skills/compliance-officer.md +108 -0
  166. package/.claude/skills/data-engineer.md +113 -0
  167. package/.claude/skills/data-governance.md +102 -0
  168. package/.claude/skills/data-scientist.md +123 -0
  169. package/.claude/skills/database-admin.md +109 -0
  170. package/.claude/skills/devops.md +160 -0
  171. package/.claude/skills/docker.md +160 -0
  172. package/.claude/skills/enterprise-dashboard.md +613 -0
  173. package/.claude/skills/finops.md +184 -0
  174. package/.claude/skills/frontend-developer.md +108 -0
  175. package/.claude/skills/gcp.md +143 -0
  176. package/.claude/skills/ml-engineer.md +115 -0
  177. package/.claude/skills/mlops.md +187 -0
  178. package/.claude/skills/network-engineer.md +109 -0
  179. package/.claude/skills/optimization-advisor.md +329 -0
  180. package/.claude/skills/orchestrator.md +623 -0
  181. package/.claude/skills/platform-engineer.md +102 -0
  182. package/.claude/skills/process-automation.md +226 -0
  183. package/.claude/skills/process-changelog.md +184 -0
  184. package/.claude/skills/process-documentation.md +484 -0
  185. package/.claude/skills/process-kanban.md +324 -0
  186. package/.claude/skills/process-versioning.md +214 -0
  187. package/.claude/skills/product-designer.md +104 -0
  188. package/.claude/skills/project-starter.md +443 -0
  189. package/.claude/skills/qa-engineer.md +109 -0
  190. package/.claude/skills/security-architect.md +135 -0
  191. package/.claude/skills/sre.md +109 -0
  192. package/.claude/skills/system-design.md +126 -0
  193. package/.claude/skills/technical-writer.md +101 -0
  194. package/.gitattributes +2 -0
  195. package/GITHUB_COPILOT.md +106 -0
  196. package/README.md +192 -291
  197. package/package.json +16 -46
  198. package/bin/cli.js +0 -241
@@ -1,547 +1,547 @@
1
- # Skill 5: Performance Optimization
2
-
3
- ## 🎯 Overview
4
- Master advanced performance tuning techniques for data pipelines, query optimization, partitioning strategies, caching, and cost-efficient compute to achieve 10x faster processing at lower costs.
5
-
6
- ## 🔗 Connections
7
- - **Data Engineer**: Optimize lakehouse and pipelines (de-01, de-02, de-03, de-04)
8
- - **ML Engineer**: Faster feature engineering and training (ml-01, ml-02, ml-03)
9
- - **MLOps**: Optimize model serving latency (mo-04)
10
- - **AI Engineer**: Speed up RAG retrieval (ai-02)
11
- - **Data Scientist**: Faster data exploration (ds-01, ds-02)
12
- - **FinOps**: Reduce compute costs through efficiency (fo-01, fo-06)
13
- - **DevOps**: Infrastructure optimization (do-04, do-08)
14
- - **System Design**: Scalability and caching patterns (sd-03, sd-04)
15
-
16
- ## 🛠️ Tools Included
17
-
18
- ### 1. `query_optimizer.py`
19
- Automated query optimization and execution plan analysis.
20
-
21
- ### 2. `partition_optimizer.py`
22
- Smart partitioning strategies and optimization recommendations.
23
-
24
- ### 3. `cache_manager.py`
25
- Multi-level caching with cache warming and invalidation.
26
-
27
- ### 4. `performance_profiler.py`
28
- End-to-end pipeline profiling and bottleneck detection.
29
-
30
- ### 5. `index_advisor.py`
31
- Index recommendation engine for query acceleration.
32
-
33
- ## 📊 Performance Optimization Framework
34
-
35
- ```
36
- Identify → Measure → Optimize → Validate → Monitor
37
- ↓ ↓ ↓ ↓ ↓
38
- Bottleneck Profile Partition Benchmark Track
39
- Detection Queries Cache Results Trends
40
- Storage Indexes A/B Test Alert
41
- ```
42
-
43
- ## 🚀 Quick Start
44
-
45
- ```python
46
- from query_optimizer import QueryOptimizer
47
- from partition_optimizer import PartitionOptimizer
48
- from cache_manager import CacheManager
49
-
50
- # Analyze slow query
51
- optimizer = QueryOptimizer()
52
-
53
- query = """
54
- SELECT customer_id, SUM(amount) as total
55
- FROM transactions
56
- WHERE event_date >= '2024-01-01'
57
- GROUP BY customer_id
58
- """
59
-
60
- # Get optimization recommendations
61
- analysis = optimizer.analyze(query, table="transactions")
62
-
63
- print(f"Current execution time: {analysis.baseline_time}s")
64
- print(f"\nRecommendations:")
65
- for rec in analysis.recommendations:
66
- print(f" - {rec.type}: {rec.description}")
67
- print(f" Expected improvement: {rec.speedup}x")
68
- print(f" Cost: {rec.cost_impact}")
69
-
70
- # Apply optimizations
71
- optimized_query = optimizer.apply_recommendations(query, analysis.recommendations)
72
-
73
- # Optimize partitioning
74
- part_optimizer = PartitionOptimizer()
75
- part_analysis = part_optimizer.analyze_table("transactions")
76
-
77
- if part_analysis.needs_repartitioning:
78
- print(f"\nCurrent partitioning: {part_analysis.current_strategy}")
79
- print(f"Recommended: {part_analysis.recommended_strategy}")
80
- print(f"Expected speedup: {part_analysis.speedup}x")
81
-
82
- # Repartition table
83
- part_optimizer.repartition_table(
84
- table="transactions",
85
- partition_by=["event_date"],
86
- bucket_by=["customer_id"],
87
- num_buckets=32
88
- )
89
-
90
- # Enable caching
91
- cache = CacheManager()
92
- cache.enable_table_cache(
93
- table="transactions",
94
- cache_level="hot", # hot/warm/cold
95
- ttl_hours=24
96
- )
97
-
98
- # Warm cache with common queries
99
- cache.warm_cache([
100
- "SELECT * FROM transactions WHERE event_date = CURRENT_DATE",
101
- "SELECT customer_id, COUNT(*) FROM transactions GROUP BY customer_id"
102
- ])
103
- ```
104
-
105
- ## 📚 Best Practices
106
-
107
- ### Query Optimization
108
-
109
- 1. **Predicate Pushdown**
110
- - Filter early to reduce data scanned
111
- - Push filters to data source when possible
112
- - Use partition pruning effectively
113
- - Leverage data skipping with statistics
114
- - Reference: Data Engineer best practices
115
-
116
- 2. **Join Optimization**
117
- - Use broadcast joins for small tables (<10GB)
118
- - Implement bucketing for large-large joins
119
- - Sort-merge join for sorted data
120
- - Avoid cross joins
121
- - Reference: System Design sd-03 (Scalability)
122
-
123
- 3. **Aggregation Optimization**
124
- - Partial aggregation before shuffle
125
- - Use approximate aggregations when exact not needed
126
- - Pre-aggregate in Gold layer
127
- - Cache frequently aggregated results
128
- - Reference: Data Engineer de-01 (Lakehouse)
129
-
130
- 4. **Query Plan Analysis**
131
- - Examine execution plans regularly
132
- - Identify shuffle-heavy operations
133
- - Optimize stage boundaries
134
- - Monitor skew in data distribution
135
- - Reference: Data Engineer best practices
136
-
137
- ### Partitioning Strategies
138
-
139
- 5. **Time-Based Partitioning**
140
- - Partition by date/datetime for time-series data
141
- - Use hierarchical partitioning (year/month/day)
142
- - Balance partition size (target 1GB per partition)
143
- - Monitor partition count (<10,000 partitions)
144
- - Reference: Data Engineer de-01 (Lakehouse)
145
-
146
- 6. **Hash Partitioning**
147
- - Use for evenly distributed data
148
- - Choose partition key with high cardinality
149
- - Avoid skewed partition keys
150
- - Consider composite partition keys
151
- - Reference: Data Engineer best practices
152
-
153
- 7. **Bucketing**
154
- - Bucket by join keys
155
- - Optimize bucket count (32-128 typical)
156
- - Combine with partitioning for best results
157
- - Sort within buckets for range queries
158
- - Reference: Data Engineer best practices
159
-
160
- ### Caching Strategies (System Design Integration)
161
-
162
- 8. **Multi-Level Caching**
163
- - L1: Result cache (query results)
164
- - L2: Disk cache (Delta cache)
165
- - L3: Table cache (in-memory tables)
166
- - Cache hot data, tier cold data
167
- - Reference: System Design sd-04 (Caching Strategies)
168
-
169
- 9. **Cache Invalidation**
170
- - Time-based TTL for changing data
171
- - Event-based invalidation for updates
172
- - Partial invalidation for partitioned data
173
- - Monitor cache hit rates
174
- - Reference: System Design sd-04 (Caching Strategies)
175
-
176
- 10. **Cache Warming**
177
- - Pre-load cache on deployment
178
- - Schedule cache refresh for predictable queries
179
- - Predictive caching based on patterns
180
- - Monitor cache utilization
181
- - Reference: System Design sd-04 (Caching Strategies)
182
-
183
- ### Indexing (when applicable)
184
-
185
- 11. **Covering Indexes**
186
- - Include all columns in SELECT
187
- - Reduce table lookups
188
- - Balance index size vs benefit
189
- - Monitor index usage
190
- - Reference: Database best practices
191
-
192
- 12. **Composite Indexes**
193
- - Order columns by selectivity
194
- - Include filter and sort columns
195
- - Avoid index duplication
196
- - Regular index maintenance
197
- - Reference: Database best practices
198
-
199
- ### Storage Optimization
200
-
201
- 13. **Compression**
202
- - Use Snappy for balance (speed/ratio)
203
- - Zstd for better compression
204
- - Gzip for archival data
205
- - Monitor compression ratios
206
- - Reference: Data Engineer de-01 (Lakehouse)
207
-
208
- 14. **File Sizing**
209
- - Target 1GB files for Parquet/Delta
210
- - Avoid small files (<128MB)
211
- - Use OPTIMIZE for Delta tables
212
- - Z-ORDER for common filters
213
- - Reference: Data Engineer de-01 (Lakehouse)
214
-
215
- 15. **Columnar Storage**
216
- - Use Parquet/Delta for analytical workloads
217
- - Project only needed columns
218
- - Leverage column statistics
219
- - Enable column pruning
220
- - Reference: Data Engineer best practices
221
-
222
- ### Compute Optimization (FinOps Integration)
223
-
224
- 16. **Right-Size Clusters**
225
- - Profile workload characteristics
226
- - Match node types to workload
227
- - Use memory-optimized for caching
228
- - Compute-optimized for CPU-bound
229
- - Reference: FinOps fo-06 (Compute Optimization)
230
-
231
- 17. **Auto-Scaling**
232
- - Enable cluster auto-scaling
233
- - Set appropriate min/max nodes
234
- - Monitor scale-up/down patterns
235
- - Balance cost vs performance
236
- - Reference: FinOps fo-06 (Compute Optimization)
237
-
238
- 18. **Spot Instances**
239
- - Use for batch workloads
240
- - Implement checkpointing
241
- - Graceful handling of interruptions
242
- - 60-90% cost savings
243
- - Reference: FinOps fo-06 (Compute Optimization)
244
-
245
- ### Azure-Specific Optimizations
246
-
247
- 19. **Delta Cache (Databricks)**
248
- - Enable for frequently accessed data
249
- - Cache hot partitions
250
- - Monitor cache hit metrics
251
- - Right-size cache storage
252
- - Reference: Azure az-02 (Databricks)
253
-
254
- 20. **Synapse SQL Optimization**
255
- - Use result set caching
256
- - Implement materialized views
257
- - Optimize distribution keys
258
- - Monitor DWU utilization
259
- - Reference: Azure az-02 (Synapse Analytics)
260
-
261
- 21. **Photon Engine (Databricks)**
262
- - Enable for SQL and DataFrame workloads
263
- - 2-5x faster for compatible workloads
264
- - Monitor Photon utilization
265
- - Cost-benefit analysis
266
- - Reference: Azure az-02 (Databricks)
267
-
268
- ## 💰 Cost-Performance Trade-offs
269
-
270
- ### Optimize Query Performance and Cost
271
- ```python
272
- from query_optimizer import QueryOptimizer
273
- from cost_analyzer import CostPerformanceAnalyzer
274
-
275
- optimizer = QueryOptimizer()
276
- cost_analyzer = CostPerformanceAnalyzer()
277
-
278
- # Baseline query
279
- baseline_query = """
280
- SELECT user_id, COUNT(*) as event_count
281
- FROM events
282
- WHERE event_date >= '2024-01-01'
283
- GROUP BY user_id
284
- """
285
-
286
- # Analyze cost and performance
287
- baseline = cost_analyzer.analyze(baseline_query)
288
- print(f"Baseline:")
289
- print(f" Execution time: {baseline.execution_time}s")
290
- print(f" Cost: ${baseline.cost:.4f}")
291
- print(f" Data scanned: {baseline.data_scanned_gb:.2f} GB")
292
-
293
- # Optimization 1: Partition pruning
294
- optimized_v1 = """
295
- SELECT user_id, COUNT(*) as event_count
296
- FROM events
297
- WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31' -- Explicit range
298
- GROUP BY user_id
299
- """
300
-
301
- result_v1 = cost_analyzer.analyze(optimized_v1)
302
- print(f"\nV1 (Partition pruning):")
303
- print(f" Execution time: {result_v1.execution_time}s ({baseline.execution_time/result_v1.execution_time:.1f}x faster)")
304
- print(f" Cost: ${result_v1.cost:.4f} ({(1-result_v1.cost/baseline.cost)*100:.1f}% cheaper)")
305
- print(f" Data scanned: {result_v1.data_scanned_gb:.2f} GB")
306
-
307
- # Optimization 2: Pre-aggregated table
308
- # Create a Gold layer aggregation
309
- spark.sql("""
310
- CREATE OR REPLACE TABLE gold.daily_user_events
311
- AS
312
- SELECT event_date, user_id, COUNT(*) as event_count
313
- FROM events
314
- GROUP BY event_date, user_id
315
- """)
316
-
317
- optimized_v2 = """
318
- SELECT user_id, SUM(event_count) as event_count
319
- FROM gold.daily_user_events
320
- WHERE event_date >= '2024-01-01'
321
- GROUP BY user_id
322
- """
323
-
324
- result_v2 = cost_analyzer.analyze(optimized_v2)
325
- print(f"\nV2 (Pre-aggregated):")
326
- print(f" Execution time: {result_v2.execution_time}s ({baseline.execution_time/result_v2.execution_time:.1f}x faster)")
327
- print(f" Cost: ${result_v2.cost:.4f} ({(1-result_v2.cost/baseline.cost)*100:.1f}% cheaper)")
328
-
329
- # Optimization 3: Caching
330
- from cache_manager import CacheManager
331
- cache = CacheManager()
332
- cache.cache_table("gold.daily_user_events")
333
-
334
- result_v3 = cost_analyzer.analyze(optimized_v2) # Same query, cached
335
- print(f"\nV3 (Cached):")
336
- print(f" Execution time: {result_v3.execution_time}s ({baseline.execution_time/result_v3.execution_time:.1f}x faster)")
337
- print(f" Cost: ${result_v3.cost:.4f} ({(1-result_v3.cost/baseline.cost)*100:.1f}% cheaper)")
338
- print(f" Cache hit: {result_v3.cache_hit}")
339
- ```
340
-
341
- ### Delta Table Optimization
342
- ```python
343
- from delta_optimizer import DeltaOptimizer
344
-
345
- optimizer = DeltaOptimizer()
346
-
347
- # Optimize table (compact small files)
348
- table_name = "silver.transactions"
349
- metrics = optimizer.optimize_table(
350
- table=table_name,
351
- z_order_by=["customer_id", "event_date"] # Common filter columns
352
- )
353
-
354
- print(f"Optimization results for {table_name}:")
355
- print(f" Files before: {metrics.files_before:,}")
356
- print(f" Files after: {metrics.files_after:,}")
357
- print(f" Size before: {metrics.size_before_gb:.2f} GB")
358
- print(f" Size after: {metrics.size_after_gb:.2f} GB")
359
- print(f" Compression improvement: {metrics.compression_ratio:.2f}x")
360
-
361
- # Query performance comparison
362
- from performance_profiler import PerformanceProfiler
363
- profiler = PerformanceProfiler()
364
-
365
- query = """
366
- SELECT customer_id, SUM(amount)
367
- FROM silver.transactions
368
- WHERE event_date >= '2024-01-01'
369
- AND customer_id IN (SELECT customer_id FROM high_value_customers)
370
- GROUP BY customer_id
371
- """
372
-
373
- before_metrics = profiler.profile_query(query, version="before")
374
- after_metrics = profiler.profile_query(query, version="after")
375
-
376
- print(f"\nQuery performance:")
377
- print(f" Before optimization: {before_metrics.execution_time:.2f}s")
378
- print(f" After optimization: {after_metrics.execution_time:.2f}s")
379
- print(f" Speedup: {before_metrics.execution_time/after_metrics.execution_time:.1f}x")
380
- print(f" Data skipped: {after_metrics.data_skipped_percentage:.1f}%")
381
-
382
- # Cost impact
383
- print(f"\nCost impact:")
384
- print(f" Query cost before: ${before_metrics.cost:.4f}")
385
- print(f" Query cost after: ${after_metrics.cost:.4f}")
386
- print(f" Monthly savings (1000 queries): ${(before_metrics.cost - after_metrics.cost) * 1000:.2f}")
387
- ```
388
-
389
- ### Adaptive Query Execution
390
- ```python
391
- from pyspark.sql import SparkSession
392
-
393
- # Enable Adaptive Query Execution (AQE)
394
- spark = SparkSession.builder \
395
- .config("spark.sql.adaptive.enabled", "true") \
396
- .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
397
- .config("spark.sql.adaptive.skewJoin.enabled", "true") \
398
- .config("spark.sql.adaptive.localShuffleReader.enabled", "true") \
399
- .getOrCreate()
400
-
401
- # Query with AQE benefits
402
- query = """
403
- SELECT t1.customer_id, t1.total_amount, t2.segment
404
- FROM (
405
- SELECT customer_id, SUM(amount) as total_amount
406
- FROM transactions
407
- GROUP BY customer_id
408
- ) t1
409
- JOIN customer_segments t2
410
- ON t1.customer_id = t2.customer_id
411
- """
412
-
413
- # AQE will automatically:
414
- # 1. Dynamically coalesce shuffle partitions
415
- # 2. Convert sort-merge join to broadcast join if one side is small
416
- # 3. Optimize skewed joins by splitting large partitions
417
-
418
- df = spark.sql(query)
419
-
420
- # Monitor AQE impact
421
- metrics = df.explain("cost")
422
- print(f"AQE optimizations applied:")
423
- print(f" - Coalesced partitions: {metrics.coalesced_partitions}")
424
- print(f" - Broadcast joins: {metrics.broadcast_joins}")
425
- print(f" - Skew handled: {metrics.skew_optimizations}")
426
- ```
427
-
428
- ## 📊 Performance Benchmarks
429
-
430
- ### Common Optimization Impact
431
-
432
- | Optimization Technique | Typical Speedup | Cost Reduction | Effort |
433
- |------------------------|-----------------|----------------|--------|
434
- | Partition pruning | 5-10x | 80-90% | Low |
435
- | Z-ordering | 2-5x | 50-80% | Low |
436
- | File compaction | 2-3x | 30-50% | Low |
437
- | Broadcast joins (small tables) | 10-100x | 90-99% | Low |
438
- | Caching hot data | 10-50x | 90-95% | Medium |
439
- | Pre-aggregation (Gold layer) | 10-100x | 90-99% | Medium |
440
- | Materialized views | 10-100x | 90-99% | Medium |
441
- | Bucketing | 2-5x | 50-80% | Medium |
442
- | Adaptive Query Execution | 1.5-3x | 30-60% | Low (config) |
443
- | Photon engine | 2-5x | -20-0% | Low (enable) |
444
- | Delta cache | 3-10x | 70-90% | Low (enable) |
445
- | Column pruning | 2-10x | 50-90% | Low |
446
- | Compression (Zstd) | 1-2x | 40-60% | Low |
447
-
448
- ## 📊 Enhanced Metrics & Monitoring
449
-
450
- | Metric Category | Metric | Target | Tool |
451
- |-----------------|--------|--------|------|
452
- | **Query Performance** | Query latency (p95) | <10s | Query history |
453
- | | Data scanned per query | <10GB | Query metrics |
454
- | | Shuffle data size | <1GB | Spark UI |
455
- | **Storage** | File count per partition | <1000 | Delta logs |
456
- | | Average file size | >128MB | Delta logs |
457
- | | Compression ratio | >3x | Storage metrics |
458
- | **Cache** | Cache hit rate | >80% | Cache metrics |
459
- | | Cache eviction rate | <10% | Cache metrics |
460
- | | Cache memory utilization | 70-90% | Cluster metrics |
461
- | **Cost** | Cost per query | <$0.10 | FinOps tracker |
462
- | | Cost per TB scanned | <$5 | Cost analysis |
463
- | | Compute utilization | 70-85% | Cluster metrics |
464
- | **Cluster** | CPU utilization | 60-80% | Azure Monitor |
465
- | | Memory utilization | 70-85% | Azure Monitor |
466
- | | Spill to disk | <5% | Spark metrics |
467
-
468
- ## 🔄 Performance Optimization Workflow
469
-
470
- ### End-to-End Optimization Process
471
- ```
472
- 1. Identify Performance Issues (do-08)
473
-
474
- 2. Profile Queries and Pipelines (de-05)
475
-
476
- 3. Analyze Execution Plans
477
-
478
- 4. Apply Optimizations
479
- ├── Partitioning (de-01)
480
- ├── Caching (sd-04)
481
- ├── Indexing
482
- └── Query rewriting
483
-
484
- 5. Benchmark Results
485
-
486
- 6. A/B Testing
487
-
488
- 7. Monitor Performance (do-08)
489
-
490
- 8. Track Cost Impact (fo-01)
491
-
492
- 9. Continuous Optimization
493
- ```
494
-
495
- ## 🎯 Quick Wins
496
-
497
- 1. **Enable Delta cache** - 3-10x faster for hot data
498
- 2. **Optimize Delta tables (OPTIMIZE + Z-ORDER)** - 2-5x faster queries
499
- 3. **Enable Adaptive Query Execution** - 1.5-3x speedup automatically
500
- 4. **Partition pruning** - 5-10x faster, 80-90% cost reduction
501
- 5. **Use broadcast joins for small tables** - 10-100x faster
502
- 6. **Cache frequently accessed tables** - 10-50x faster
503
- 7. **Pre-aggregate in Gold layer** - 10-100x faster analytics
504
- 8. **Enable Photon engine** - 2-5x faster (Databricks)
505
- 9. **Column pruning** - Select only needed columns, 2-10x faster
506
- 10. **Compress with Zstd** - 40-60% storage savings
507
-
508
- ## 🔧 Performance Tuning Checklist
509
-
510
- ### Before Optimization
511
- - [ ] Identify slow queries (>30s execution time)
512
- - [ ] Profile data access patterns
513
- - [ ] Analyze execution plans
514
- - [ ] Measure baseline performance
515
- - [ ] Document current costs
516
-
517
- ### Storage Optimization
518
- - [ ] Run OPTIMIZE on Delta tables weekly
519
- - [ ] Z-ORDER by common filter columns
520
- - [ ] VACUUM to remove old files (>7 days)
521
- - [ ] Check file sizes (target 1GB)
522
- - [ ] Monitor partition count (<10,000)
523
- - [ ] Enable compression (Snappy/Zstd)
524
-
525
- ### Query Optimization
526
- - [ ] Enable Adaptive Query Execution
527
- - [ ] Use partition pruning
528
- - [ ] Implement broadcast joins where applicable
529
- - [ ] Pre-aggregate in Gold layer
530
- - [ ] Cache hot tables
531
- - [ ] Project only needed columns
532
- - [ ] Push down filters early
533
-
534
- ### Cluster Optimization
535
- - [ ] Right-size cluster nodes
536
- - [ ] Enable auto-scaling
537
- - [ ] Use spot instances for batch
538
- - [ ] Monitor CPU/memory utilization
539
- - [ ] Enable Delta cache
540
- - [ ] Enable Photon (if Databricks)
541
-
542
- ### Monitoring
543
- - [ ] Track query performance trends
544
- - [ ] Monitor cache hit rates
545
- - [ ] Alert on performance degradation
546
- - [ ] Track cost per query
547
- - [ ] Review execution plans regularly
1
+ # Skill 5: Performance Optimization
2
+
3
+ ## 🎯 Overview
4
+ Master advanced performance tuning techniques for data pipelines, query optimization, partitioning strategies, caching, and cost-efficient compute to achieve 10x faster processing at lower costs.
5
+
6
+ ## 🔗 Connections
7
+ - **Data Engineer**: Optimize lakehouse and pipelines (de-01, de-02, de-03, de-04)
8
+ - **ML Engineer**: Faster feature engineering and training (ml-01, ml-02, ml-03)
9
+ - **MLOps**: Optimize model serving latency (mo-04)
10
+ - **AI Engineer**: Speed up RAG retrieval (ai-02)
11
+ - **Data Scientist**: Faster data exploration (ds-01, ds-02)
12
+ - **FinOps**: Reduce compute costs through efficiency (fo-01, fo-06)
13
+ - **DevOps**: Infrastructure optimization (do-04, do-08)
14
+ - **System Design**: Scalability and caching patterns (sd-03, sd-04)
15
+
16
+ ## 🛠️ Tools Included
17
+
18
+ ### 1. `query_optimizer.py`
19
+ Automated query optimization and execution plan analysis.
20
+
21
+ ### 2. `partition_optimizer.py`
22
+ Smart partitioning strategies and optimization recommendations.
23
+
24
+ ### 3. `cache_manager.py`
25
+ Multi-level caching with cache warming and invalidation.
26
+
27
+ ### 4. `performance_profiler.py`
28
+ End-to-end pipeline profiling and bottleneck detection.
29
+
30
+ ### 5. `index_advisor.py`
31
+ Index recommendation engine for query acceleration.
32
+
33
+ ## 📊 Performance Optimization Framework
34
+
35
+ ```
36
+ Identify → Measure → Optimize → Validate → Monitor
37
+ ↓ ↓ ↓ ↓ ↓
38
+ Bottleneck Profile Partition Benchmark Track
39
+ Detection Queries Cache Results Trends
40
+ Storage Indexes A/B Test Alert
41
+ ```
42
+
43
+ ## 🚀 Quick Start
44
+
45
+ ```python
46
+ from query_optimizer import QueryOptimizer
47
+ from partition_optimizer import PartitionOptimizer
48
+ from cache_manager import CacheManager
49
+
50
+ # Analyze slow query
51
+ optimizer = QueryOptimizer()
52
+
53
+ query = """
54
+ SELECT customer_id, SUM(amount) as total
55
+ FROM transactions
56
+ WHERE event_date >= '2024-01-01'
57
+ GROUP BY customer_id
58
+ """
59
+
60
+ # Get optimization recommendations
61
+ analysis = optimizer.analyze(query, table="transactions")
62
+
63
+ print(f"Current execution time: {analysis.baseline_time}s")
64
+ print(f"\nRecommendations:")
65
+ for rec in analysis.recommendations:
66
+ print(f" - {rec.type}: {rec.description}")
67
+ print(f" Expected improvement: {rec.speedup}x")
68
+ print(f" Cost: {rec.cost_impact}")
69
+
70
+ # Apply optimizations
71
+ optimized_query = optimizer.apply_recommendations(query, analysis.recommendations)
72
+
73
+ # Optimize partitioning
74
+ part_optimizer = PartitionOptimizer()
75
+ part_analysis = part_optimizer.analyze_table("transactions")
76
+
77
+ if part_analysis.needs_repartitioning:
78
+ print(f"\nCurrent partitioning: {part_analysis.current_strategy}")
79
+ print(f"Recommended: {part_analysis.recommended_strategy}")
80
+ print(f"Expected speedup: {part_analysis.speedup}x")
81
+
82
+ # Repartition table
83
+ part_optimizer.repartition_table(
84
+ table="transactions",
85
+ partition_by=["event_date"],
86
+ bucket_by=["customer_id"],
87
+ num_buckets=32
88
+ )
89
+
90
+ # Enable caching
91
+ cache = CacheManager()
92
+ cache.enable_table_cache(
93
+ table="transactions",
94
+ cache_level="hot", # hot/warm/cold
95
+ ttl_hours=24
96
+ )
97
+
98
+ # Warm cache with common queries
99
+ cache.warm_cache([
100
+ "SELECT * FROM transactions WHERE event_date = CURRENT_DATE",
101
+ "SELECT customer_id, COUNT(*) FROM transactions GROUP BY customer_id"
102
+ ])
103
+ ```
104
+
105
+ ## 📚 Best Practices
106
+
107
+ ### Query Optimization
108
+
109
+ 1. **Predicate Pushdown**
110
+ - Filter early to reduce data scanned
111
+ - Push filters to data source when possible
112
+ - Use partition pruning effectively
113
+ - Leverage data skipping with statistics
114
+ - Reference: Data Engineer best practices
115
+
116
+ 2. **Join Optimization**
117
+ - Use broadcast joins for small tables (<10GB)
118
+ - Implement bucketing for large-large joins
119
+ - Sort-merge join for sorted data
120
+ - Avoid cross joins
121
+ - Reference: System Design sd-03 (Scalability)
122
+
123
+ 3. **Aggregation Optimization**
124
+ - Partial aggregation before shuffle
125
+ - Use approximate aggregations when exact not needed
126
+ - Pre-aggregate in Gold layer
127
+ - Cache frequently aggregated results
128
+ - Reference: Data Engineer de-01 (Lakehouse)
129
+
130
+ 4. **Query Plan Analysis**
131
+ - Examine execution plans regularly
132
+ - Identify shuffle-heavy operations
133
+ - Optimize stage boundaries
134
+ - Monitor skew in data distribution
135
+ - Reference: Data Engineer best practices
136
+
137
+ ### Partitioning Strategies
138
+
139
+ 5. **Time-Based Partitioning**
140
+ - Partition by date/datetime for time-series data
141
+ - Use hierarchical partitioning (year/month/day)
142
+ - Balance partition size (target 1GB per partition)
143
+ - Monitor partition count (<10,000 partitions)
144
+ - Reference: Data Engineer de-01 (Lakehouse)
145
+
146
+ 6. **Hash Partitioning**
147
+ - Use for evenly distributed data
148
+ - Choose partition key with high cardinality
149
+ - Avoid skewed partition keys
150
+ - Consider composite partition keys
151
+ - Reference: Data Engineer best practices
152
+
153
+ 7. **Bucketing**
154
+ - Bucket by join keys
155
+ - Optimize bucket count (32-128 typical)
156
+ - Combine with partitioning for best results
157
+ - Sort within buckets for range queries
158
+ - Reference: Data Engineer best practices
159
+
160
+ ### Caching Strategies (System Design Integration)
161
+
162
+ 8. **Multi-Level Caching**
163
+ - L1: Result cache (query results)
164
+ - L2: Disk cache (Delta cache)
165
+ - L3: Table cache (in-memory tables)
166
+ - Cache hot data, tier cold data
167
+ - Reference: System Design sd-04 (Caching Strategies)
168
+
169
+ 9. **Cache Invalidation**
170
+ - Time-based TTL for changing data
171
+ - Event-based invalidation for updates
172
+ - Partial invalidation for partitioned data
173
+ - Monitor cache hit rates
174
+ - Reference: System Design sd-04 (Caching Strategies)
175
+
176
+ 10. **Cache Warming**
177
+ - Pre-load cache on deployment
178
+ - Schedule cache refresh for predictable queries
179
+ - Predictive caching based on patterns
180
+ - Monitor cache utilization
181
+ - Reference: System Design sd-04 (Caching Strategies)
182
+
183
+ ### Indexing (when applicable)
184
+
185
+ 11. **Covering Indexes**
186
+ - Include all columns in SELECT
187
+ - Reduce table lookups
188
+ - Balance index size vs benefit
189
+ - Monitor index usage
190
+ - Reference: Database best practices
191
+
192
+ 12. **Composite Indexes**
193
+ - Order columns by selectivity
194
+ - Include filter and sort columns
195
+ - Avoid index duplication
196
+ - Regular index maintenance
197
+ - Reference: Database best practices
198
+
199
+ ### Storage Optimization
200
+
201
+ 13. **Compression**
202
+ - Use Snappy for balance (speed/ratio)
203
+ - Zstd for better compression
204
+ - Gzip for archival data
205
+ - Monitor compression ratios
206
+ - Reference: Data Engineer de-01 (Lakehouse)
207
+
208
+ 14. **File Sizing**
209
+ - Target 1GB files for Parquet/Delta
210
+ - Avoid small files (<128MB)
211
+ - Use OPTIMIZE for Delta tables
212
+ - Z-ORDER for common filters
213
+ - Reference: Data Engineer de-01 (Lakehouse)
214
+
215
+ 15. **Columnar Storage**
216
+ - Use Parquet/Delta for analytical workloads
217
+ - Project only needed columns
218
+ - Leverage column statistics
219
+ - Enable column pruning
220
+ - Reference: Data Engineer best practices
221
+
222
+ ### Compute Optimization (FinOps Integration)
223
+
224
+ 16. **Right-Size Clusters**
225
+ - Profile workload characteristics
226
+ - Match node types to workload
227
+ - Use memory-optimized for caching
228
+ - Compute-optimized for CPU-bound
229
+ - Reference: FinOps fo-06 (Compute Optimization)
230
+
231
+ 17. **Auto-Scaling**
232
+ - Enable cluster auto-scaling
233
+ - Set appropriate min/max nodes
234
+ - Monitor scale-up/down patterns
235
+ - Balance cost vs performance
236
+ - Reference: FinOps fo-06 (Compute Optimization)
237
+
238
+ 18. **Spot Instances**
239
+ - Use for batch workloads
240
+ - Implement checkpointing
241
+ - Graceful handling of interruptions
242
+ - 60-90% cost savings
243
+ - Reference: FinOps fo-06 (Compute Optimization)
244
+
245
+ ### Azure-Specific Optimizations
246
+
247
+ 19. **Delta Cache (Databricks)**
248
+ - Enable for frequently accessed data
249
+ - Cache hot partitions
250
+ - Monitor cache hit metrics
251
+ - Right-size cache storage
252
+ - Reference: Azure az-02 (Databricks)
253
+
254
+ 20. **Synapse SQL Optimization**
255
+ - Use result set caching
256
+ - Implement materialized views
257
+ - Optimize distribution keys
258
+ - Monitor DWU utilization
259
+ - Reference: Azure az-02 (Synapse Analytics)
260
+
261
+ 21. **Photon Engine (Databricks)**
262
+ - Enable for SQL and DataFrame workloads
263
+ - 2-5x faster for compatible workloads
264
+ - Monitor Photon utilization
265
+ - Cost-benefit analysis
266
+ - Reference: Azure az-02 (Databricks)
267
+
268
+ ## 💰 Cost-Performance Trade-offs
269
+
270
+ ### Optimize Query Performance and Cost
271
+ ```python
272
+ from query_optimizer import QueryOptimizer
273
+ from cost_analyzer import CostPerformanceAnalyzer
274
+
275
+ optimizer = QueryOptimizer()
276
+ cost_analyzer = CostPerformanceAnalyzer()
277
+
278
+ # Baseline query
279
+ baseline_query = """
280
+ SELECT user_id, COUNT(*) as event_count
281
+ FROM events
282
+ WHERE event_date >= '2024-01-01'
283
+ GROUP BY user_id
284
+ """
285
+
286
+ # Analyze cost and performance
287
+ baseline = cost_analyzer.analyze(baseline_query)
288
+ print(f"Baseline:")
289
+ print(f" Execution time: {baseline.execution_time}s")
290
+ print(f" Cost: ${baseline.cost:.4f}")
291
+ print(f" Data scanned: {baseline.data_scanned_gb:.2f} GB")
292
+
293
+ # Optimization 1: Partition pruning
294
+ optimized_v1 = """
295
+ SELECT user_id, COUNT(*) as event_count
296
+ FROM events
297
+ WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31' -- Explicit range
298
+ GROUP BY user_id
299
+ """
300
+
301
+ result_v1 = cost_analyzer.analyze(optimized_v1)
302
+ print(f"\nV1 (Partition pruning):")
303
+ print(f" Execution time: {result_v1.execution_time}s ({baseline.execution_time/result_v1.execution_time:.1f}x faster)")
304
+ print(f" Cost: ${result_v1.cost:.4f} ({(1-result_v1.cost/baseline.cost)*100:.1f}% cheaper)")
305
+ print(f" Data scanned: {result_v1.data_scanned_gb:.2f} GB")
306
+
307
+ # Optimization 2: Pre-aggregated table
308
+ # Create a Gold layer aggregation
309
+ spark.sql("""
310
+ CREATE OR REPLACE TABLE gold.daily_user_events
311
+ AS
312
+ SELECT event_date, user_id, COUNT(*) as event_count
313
+ FROM events
314
+ GROUP BY event_date, user_id
315
+ """)
316
+
317
+ optimized_v2 = """
318
+ SELECT user_id, SUM(event_count) as event_count
319
+ FROM gold.daily_user_events
320
+ WHERE event_date >= '2024-01-01'
321
+ GROUP BY user_id
322
+ """
323
+
324
+ result_v2 = cost_analyzer.analyze(optimized_v2)
325
+ print(f"\nV2 (Pre-aggregated):")
326
+ print(f" Execution time: {result_v2.execution_time}s ({baseline.execution_time/result_v2.execution_time:.1f}x faster)")
327
+ print(f" Cost: ${result_v2.cost:.4f} ({(1-result_v2.cost/baseline.cost)*100:.1f}% cheaper)")
328
+
329
+ # Optimization 3: Caching
330
+ from cache_manager import CacheManager
331
+ cache = CacheManager()
332
+ cache.cache_table("gold.daily_user_events")
333
+
334
+ result_v3 = cost_analyzer.analyze(optimized_v2) # Same query, cached
335
+ print(f"\nV3 (Cached):")
336
+ print(f" Execution time: {result_v3.execution_time}s ({baseline.execution_time/result_v3.execution_time:.1f}x faster)")
337
+ print(f" Cost: ${result_v3.cost:.4f} ({(1-result_v3.cost/baseline.cost)*100:.1f}% cheaper)")
338
+ print(f" Cache hit: {result_v3.cache_hit}")
339
+ ```
340
+
341
+ ### Delta Table Optimization
342
+ ```python
343
+ from delta_optimizer import DeltaOptimizer
344
+
345
+ optimizer = DeltaOptimizer()
346
+
347
+ # Optimize table (compact small files)
348
+ table_name = "silver.transactions"
349
+ metrics = optimizer.optimize_table(
350
+ table=table_name,
351
+ z_order_by=["customer_id", "event_date"] # Common filter columns
352
+ )
353
+
354
+ print(f"Optimization results for {table_name}:")
355
+ print(f" Files before: {metrics.files_before:,}")
356
+ print(f" Files after: {metrics.files_after:,}")
357
+ print(f" Size before: {metrics.size_before_gb:.2f} GB")
358
+ print(f" Size after: {metrics.size_after_gb:.2f} GB")
359
+ print(f" Compression improvement: {metrics.compression_ratio:.2f}x")
360
+
361
+ # Query performance comparison
362
+ from performance_profiler import PerformanceProfiler
363
+ profiler = PerformanceProfiler()
364
+
365
+ query = """
366
+ SELECT customer_id, SUM(amount)
367
+ FROM silver.transactions
368
+ WHERE event_date >= '2024-01-01'
369
+ AND customer_id IN (SELECT customer_id FROM high_value_customers)
370
+ GROUP BY customer_id
371
+ """
372
+
373
+ before_metrics = profiler.profile_query(query, version="before")
374
+ after_metrics = profiler.profile_query(query, version="after")
375
+
376
+ print(f"\nQuery performance:")
377
+ print(f" Before optimization: {before_metrics.execution_time:.2f}s")
378
+ print(f" After optimization: {after_metrics.execution_time:.2f}s")
379
+ print(f" Speedup: {before_metrics.execution_time/after_metrics.execution_time:.1f}x")
380
+ print(f" Data skipped: {after_metrics.data_skipped_percentage:.1f}%")
381
+
382
+ # Cost impact
383
+ print(f"\nCost impact:")
384
+ print(f" Query cost before: ${before_metrics.cost:.4f}")
385
+ print(f" Query cost after: ${after_metrics.cost:.4f}")
386
+ print(f" Monthly savings (1000 queries): ${(before_metrics.cost - after_metrics.cost) * 1000:.2f}")
387
+ ```
388
+
389
+ ### Adaptive Query Execution
390
+ ```python
391
+ from pyspark.sql import SparkSession
392
+
393
+ # Enable Adaptive Query Execution (AQE)
394
+ spark = SparkSession.builder \
395
+ .config("spark.sql.adaptive.enabled", "true") \
396
+ .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
397
+ .config("spark.sql.adaptive.skewJoin.enabled", "true") \
398
+ .config("spark.sql.adaptive.localShuffleReader.enabled", "true") \
399
+ .getOrCreate()
400
+
401
+ # Query with AQE benefits
402
+ query = """
403
+ SELECT t1.customer_id, t1.total_amount, t2.segment
404
+ FROM (
405
+ SELECT customer_id, SUM(amount) as total_amount
406
+ FROM transactions
407
+ GROUP BY customer_id
408
+ ) t1
409
+ JOIN customer_segments t2
410
+ ON t1.customer_id = t2.customer_id
411
+ """
412
+
413
+ # AQE will automatically:
414
+ # 1. Dynamically coalesce shuffle partitions
415
+ # 2. Convert sort-merge join to broadcast join if one side is small
416
+ # 3. Optimize skewed joins by splitting large partitions
417
+
418
+ df = spark.sql(query)
419
+
420
+ # Monitor AQE impact
421
+ metrics = df.explain("cost")
422
+ print(f"AQE optimizations applied:")
423
+ print(f" - Coalesced partitions: {metrics.coalesced_partitions}")
424
+ print(f" - Broadcast joins: {metrics.broadcast_joins}")
425
+ print(f" - Skew handled: {metrics.skew_optimizations}")
426
+ ```
427
+
428
+ ## 📊 Performance Benchmarks
429
+
430
+ ### Common Optimization Impact
431
+
432
+ | Optimization Technique | Typical Speedup | Cost Reduction | Effort |
433
+ |------------------------|-----------------|----------------|--------|
434
+ | Partition pruning | 5-10x | 80-90% | Low |
435
+ | Z-ordering | 2-5x | 50-80% | Low |
436
+ | File compaction | 2-3x | 30-50% | Low |
437
+ | Broadcast joins (small tables) | 10-100x | 90-99% | Low |
438
+ | Caching hot data | 10-50x | 90-95% | Medium |
439
+ | Pre-aggregation (Gold layer) | 10-100x | 90-99% | Medium |
440
+ | Materialized views | 10-100x | 90-99% | Medium |
441
+ | Bucketing | 2-5x | 50-80% | Medium |
442
+ | Adaptive Query Execution | 1.5-3x | 30-60% | Low (config) |
443
+ | Photon engine | 2-5x | -20-0% | Low (enable) |
444
+ | Delta cache | 3-10x | 70-90% | Low (enable) |
445
+ | Column pruning | 2-10x | 50-90% | Low |
446
+ | Compression (Zstd) | 1-2x | 40-60% | Low |
447
+
448
+ ## 📊 Enhanced Metrics & Monitoring
449
+
450
+ | Metric Category | Metric | Target | Tool |
451
+ |-----------------|--------|--------|------|
452
+ | **Query Performance** | Query latency (p95) | <10s | Query history |
453
+ | | Data scanned per query | <10GB | Query metrics |
454
+ | | Shuffle data size | <1GB | Spark UI |
455
+ | **Storage** | File count per partition | <1000 | Delta logs |
456
+ | | Average file size | >128MB | Delta logs |
457
+ | | Compression ratio | >3x | Storage metrics |
458
+ | **Cache** | Cache hit rate | >80% | Cache metrics |
459
+ | | Cache eviction rate | <10% | Cache metrics |
460
+ | | Cache memory utilization | 70-90% | Cluster metrics |
461
+ | **Cost** | Cost per query | <$0.10 | FinOps tracker |
462
+ | | Cost per TB scanned | <$5 | Cost analysis |
463
+ | | Compute utilization | 70-85% | Cluster metrics |
464
+ | **Cluster** | CPU utilization | 60-80% | Azure Monitor |
465
+ | | Memory utilization | 70-85% | Azure Monitor |
466
+ | | Spill to disk | <5% | Spark metrics |
467
+
468
+ ## 🔄 Performance Optimization Workflow
469
+
470
+ ### End-to-End Optimization Process
471
+ ```
472
+ 1. Identify Performance Issues (do-08)
473
+
474
+ 2. Profile Queries and Pipelines (de-05)
475
+
476
+ 3. Analyze Execution Plans
477
+
478
+ 4. Apply Optimizations
479
+ ├── Partitioning (de-01)
480
+ ├── Caching (sd-04)
481
+ ├── Indexing
482
+ └── Query rewriting
483
+
484
+ 5. Benchmark Results
485
+
486
+ 6. A/B Testing
487
+
488
+ 7. Monitor Performance (do-08)
489
+
490
+ 8. Track Cost Impact (fo-01)
491
+
492
+ 9. Continuous Optimization
493
+ ```
494
+
495
+ ## 🎯 Quick Wins
496
+
497
+ 1. **Enable Delta cache** - 3-10x faster for hot data
498
+ 2. **Optimize Delta tables (OPTIMIZE + Z-ORDER)** - 2-5x faster queries
499
+ 3. **Enable Adaptive Query Execution** - 1.5-3x speedup automatically
500
+ 4. **Partition pruning** - 5-10x faster, 80-90% cost reduction
501
+ 5. **Use broadcast joins for small tables** - 10-100x faster
502
+ 6. **Cache frequently accessed tables** - 10-50x faster
503
+ 7. **Pre-aggregate in Gold layer** - 10-100x faster analytics
504
+ 8. **Enable Photon engine** - 2-5x faster (Databricks)
505
+ 9. **Column pruning** - Select only needed columns, 2-10x faster
506
+ 10. **Compress with Zstd** - 40-60% storage savings
507
+
508
+ ## 🔧 Performance Tuning Checklist
509
+
510
+ ### Before Optimization
511
+ - [ ] Identify slow queries (>30s execution time)
512
+ - [ ] Profile data access patterns
513
+ - [ ] Analyze execution plans
514
+ - [ ] Measure baseline performance
515
+ - [ ] Document current costs
516
+
517
+ ### Storage Optimization
518
+ - [ ] Run OPTIMIZE on Delta tables weekly
519
+ - [ ] Z-ORDER by common filter columns
520
+ - [ ] VACUUM to remove old files (>7 days)
521
+ - [ ] Check file sizes (target 1GB)
522
+ - [ ] Monitor partition count (<10,000)
523
+ - [ ] Enable compression (Snappy/Zstd)
524
+
525
+ ### Query Optimization
526
+ - [ ] Enable Adaptive Query Execution
527
+ - [ ] Use partition pruning
528
+ - [ ] Implement broadcast joins where applicable
529
+ - [ ] Pre-aggregate in Gold layer
530
+ - [ ] Cache hot tables
531
+ - [ ] Project only needed columns
532
+ - [ ] Push down filters early
533
+
534
+ ### Cluster Optimization
535
+ - [ ] Right-size cluster nodes
536
+ - [ ] Enable auto-scaling
537
+ - [ ] Use spot instances for batch
538
+ - [ ] Monitor CPU/memory utilization
539
+ - [ ] Enable Delta cache
540
+ - [ ] Enable Photon (if Databricks)
541
+
542
+ ### Monitoring
543
+ - [ ] Track query performance trends
544
+ - [ ] Monitor cache hit rates
545
+ - [ ] Alert on performance degradation
546
+ - [ ] Track cost per query
547
+ - [ ] Review execution plans regularly