tech-hub-skills 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (133) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +250 -0
  3. package/bin/cli.js +241 -0
  4. package/bin/copilot.js +182 -0
  5. package/bin/postinstall.js +42 -0
  6. package/package.json +46 -0
  7. package/tech_hub_skills/roles/ai-engineer/skills/01-prompt-engineering/README.md +252 -0
  8. package/tech_hub_skills/roles/ai-engineer/skills/02-rag-pipeline/README.md +448 -0
  9. package/tech_hub_skills/roles/ai-engineer/skills/03-agent-orchestration/README.md +599 -0
  10. package/tech_hub_skills/roles/ai-engineer/skills/04-llm-guardrails/README.md +735 -0
  11. package/tech_hub_skills/roles/ai-engineer/skills/05-vector-embeddings/README.md +711 -0
  12. package/tech_hub_skills/roles/ai-engineer/skills/06-llm-evaluation/README.md +777 -0
  13. package/tech_hub_skills/roles/azure/skills/01-infrastructure-fundamentals/README.md +264 -0
  14. package/tech_hub_skills/roles/azure/skills/02-data-factory/README.md +264 -0
  15. package/tech_hub_skills/roles/azure/skills/03-synapse-analytics/README.md +264 -0
  16. package/tech_hub_skills/roles/azure/skills/04-databricks/README.md +264 -0
  17. package/tech_hub_skills/roles/azure/skills/05-functions/README.md +264 -0
  18. package/tech_hub_skills/roles/azure/skills/06-kubernetes-service/README.md +264 -0
  19. package/tech_hub_skills/roles/azure/skills/07-openai-service/README.md +264 -0
  20. package/tech_hub_skills/roles/azure/skills/08-machine-learning/README.md +264 -0
  21. package/tech_hub_skills/roles/azure/skills/09-storage-adls/README.md +264 -0
  22. package/tech_hub_skills/roles/azure/skills/10-networking/README.md +264 -0
  23. package/tech_hub_skills/roles/azure/skills/11-sql-cosmos/README.md +264 -0
  24. package/tech_hub_skills/roles/azure/skills/12-event-hubs/README.md +264 -0
  25. package/tech_hub_skills/roles/code-review/skills/01-automated-code-review/README.md +394 -0
  26. package/tech_hub_skills/roles/code-review/skills/02-pr-review-workflow/README.md +427 -0
  27. package/tech_hub_skills/roles/code-review/skills/03-code-quality-gates/README.md +518 -0
  28. package/tech_hub_skills/roles/code-review/skills/04-reviewer-assignment/README.md +504 -0
  29. package/tech_hub_skills/roles/code-review/skills/05-review-analytics/README.md +540 -0
  30. package/tech_hub_skills/roles/data-engineer/skills/01-lakehouse-architecture/README.md +550 -0
  31. package/tech_hub_skills/roles/data-engineer/skills/02-etl-pipeline/README.md +580 -0
  32. package/tech_hub_skills/roles/data-engineer/skills/03-data-quality/README.md +579 -0
  33. package/tech_hub_skills/roles/data-engineer/skills/04-streaming-pipelines/README.md +608 -0
  34. package/tech_hub_skills/roles/data-engineer/skills/05-performance-optimization/README.md +547 -0
  35. package/tech_hub_skills/roles/data-governance/skills/01-data-catalog/README.md +112 -0
  36. package/tech_hub_skills/roles/data-governance/skills/02-data-lineage/README.md +129 -0
  37. package/tech_hub_skills/roles/data-governance/skills/03-data-quality-framework/README.md +182 -0
  38. package/tech_hub_skills/roles/data-governance/skills/04-access-control/README.md +39 -0
  39. package/tech_hub_skills/roles/data-governance/skills/05-master-data-management/README.md +40 -0
  40. package/tech_hub_skills/roles/data-governance/skills/06-compliance-privacy/README.md +46 -0
  41. package/tech_hub_skills/roles/data-scientist/skills/01-eda-automation/README.md +230 -0
  42. package/tech_hub_skills/roles/data-scientist/skills/02-statistical-modeling/README.md +264 -0
  43. package/tech_hub_skills/roles/data-scientist/skills/03-feature-engineering/README.md +264 -0
  44. package/tech_hub_skills/roles/data-scientist/skills/04-predictive-modeling/README.md +264 -0
  45. package/tech_hub_skills/roles/data-scientist/skills/05-customer-analytics/README.md +264 -0
  46. package/tech_hub_skills/roles/data-scientist/skills/06-campaign-analysis/README.md +264 -0
  47. package/tech_hub_skills/roles/data-scientist/skills/07-experimentation/README.md +264 -0
  48. package/tech_hub_skills/roles/data-scientist/skills/08-data-visualization/README.md +264 -0
  49. package/tech_hub_skills/roles/devops/skills/01-cicd-pipeline/README.md +264 -0
  50. package/tech_hub_skills/roles/devops/skills/02-container-orchestration/README.md +264 -0
  51. package/tech_hub_skills/roles/devops/skills/03-infrastructure-as-code/README.md +264 -0
  52. package/tech_hub_skills/roles/devops/skills/04-gitops/README.md +264 -0
  53. package/tech_hub_skills/roles/devops/skills/05-environment-management/README.md +264 -0
  54. package/tech_hub_skills/roles/devops/skills/06-automated-testing/README.md +264 -0
  55. package/tech_hub_skills/roles/devops/skills/07-release-management/README.md +264 -0
  56. package/tech_hub_skills/roles/devops/skills/08-monitoring-alerting/README.md +264 -0
  57. package/tech_hub_skills/roles/devops/skills/09-devsecops/README.md +265 -0
  58. package/tech_hub_skills/roles/finops/skills/01-cost-visibility/README.md +264 -0
  59. package/tech_hub_skills/roles/finops/skills/02-resource-tagging/README.md +264 -0
  60. package/tech_hub_skills/roles/finops/skills/03-budget-management/README.md +264 -0
  61. package/tech_hub_skills/roles/finops/skills/04-reserved-instances/README.md +264 -0
  62. package/tech_hub_skills/roles/finops/skills/05-spot-optimization/README.md +264 -0
  63. package/tech_hub_skills/roles/finops/skills/06-storage-tiering/README.md +264 -0
  64. package/tech_hub_skills/roles/finops/skills/07-compute-rightsizing/README.md +264 -0
  65. package/tech_hub_skills/roles/finops/skills/08-chargeback/README.md +264 -0
  66. package/tech_hub_skills/roles/ml-engineer/skills/01-mlops-pipeline/README.md +566 -0
  67. package/tech_hub_skills/roles/ml-engineer/skills/02-feature-engineering/README.md +655 -0
  68. package/tech_hub_skills/roles/ml-engineer/skills/03-model-training/README.md +704 -0
  69. package/tech_hub_skills/roles/ml-engineer/skills/04-model-serving/README.md +845 -0
  70. package/tech_hub_skills/roles/ml-engineer/skills/05-model-monitoring/README.md +874 -0
  71. package/tech_hub_skills/roles/mlops/skills/01-ml-pipeline-orchestration/README.md +264 -0
  72. package/tech_hub_skills/roles/mlops/skills/02-experiment-tracking/README.md +264 -0
  73. package/tech_hub_skills/roles/mlops/skills/03-model-registry/README.md +264 -0
  74. package/tech_hub_skills/roles/mlops/skills/04-feature-store/README.md +264 -0
  75. package/tech_hub_skills/roles/mlops/skills/05-model-deployment/README.md +264 -0
  76. package/tech_hub_skills/roles/mlops/skills/06-model-observability/README.md +264 -0
  77. package/tech_hub_skills/roles/mlops/skills/07-data-versioning/README.md +264 -0
  78. package/tech_hub_skills/roles/mlops/skills/08-ab-testing/README.md +264 -0
  79. package/tech_hub_skills/roles/mlops/skills/09-automated-retraining/README.md +264 -0
  80. package/tech_hub_skills/roles/platform-engineer/skills/01-internal-developer-platform/README.md +153 -0
  81. package/tech_hub_skills/roles/platform-engineer/skills/02-self-service-infrastructure/README.md +57 -0
  82. package/tech_hub_skills/roles/platform-engineer/skills/03-slo-sli-management/README.md +59 -0
  83. package/tech_hub_skills/roles/platform-engineer/skills/04-developer-experience/README.md +57 -0
  84. package/tech_hub_skills/roles/platform-engineer/skills/05-incident-management/README.md +73 -0
  85. package/tech_hub_skills/roles/platform-engineer/skills/06-capacity-management/README.md +59 -0
  86. package/tech_hub_skills/roles/product-designer/skills/01-requirements-discovery/README.md +407 -0
  87. package/tech_hub_skills/roles/product-designer/skills/02-user-research/README.md +382 -0
  88. package/tech_hub_skills/roles/product-designer/skills/03-brainstorming-ideation/README.md +437 -0
  89. package/tech_hub_skills/roles/product-designer/skills/04-ux-design/README.md +496 -0
  90. package/tech_hub_skills/roles/product-designer/skills/05-product-market-fit/README.md +376 -0
  91. package/tech_hub_skills/roles/product-designer/skills/06-stakeholder-management/README.md +412 -0
  92. package/tech_hub_skills/roles/security-architect/skills/01-pii-detection/README.md +319 -0
  93. package/tech_hub_skills/roles/security-architect/skills/02-threat-modeling/README.md +264 -0
  94. package/tech_hub_skills/roles/security-architect/skills/03-infrastructure-security/README.md +264 -0
  95. package/tech_hub_skills/roles/security-architect/skills/04-iam/README.md +264 -0
  96. package/tech_hub_skills/roles/security-architect/skills/05-application-security/README.md +264 -0
  97. package/tech_hub_skills/roles/security-architect/skills/06-secrets-management/README.md +264 -0
  98. package/tech_hub_skills/roles/security-architect/skills/07-security-monitoring/README.md +264 -0
  99. package/tech_hub_skills/roles/system-design/skills/01-architecture-patterns/README.md +337 -0
  100. package/tech_hub_skills/roles/system-design/skills/02-requirements-engineering/README.md +264 -0
  101. package/tech_hub_skills/roles/system-design/skills/03-scalability/README.md +264 -0
  102. package/tech_hub_skills/roles/system-design/skills/04-high-availability/README.md +264 -0
  103. package/tech_hub_skills/roles/system-design/skills/05-cost-optimization-design/README.md +264 -0
  104. package/tech_hub_skills/roles/system-design/skills/06-api-design/README.md +264 -0
  105. package/tech_hub_skills/roles/system-design/skills/07-observability-architecture/README.md +264 -0
  106. package/tech_hub_skills/roles/system-design/skills/08-process-automation/PROCESS_TEMPLATE.md +336 -0
  107. package/tech_hub_skills/roles/system-design/skills/08-process-automation/README.md +521 -0
  108. package/tech_hub_skills/skills/README.md +336 -0
  109. package/tech_hub_skills/skills/ai-engineer.md +104 -0
  110. package/tech_hub_skills/skills/azure.md +149 -0
  111. package/tech_hub_skills/skills/code-review.md +399 -0
  112. package/tech_hub_skills/skills/compliance-automation.md +747 -0
  113. package/tech_hub_skills/skills/data-engineer.md +113 -0
  114. package/tech_hub_skills/skills/data-governance.md +102 -0
  115. package/tech_hub_skills/skills/data-scientist.md +123 -0
  116. package/tech_hub_skills/skills/devops.md +160 -0
  117. package/tech_hub_skills/skills/docker.md +160 -0
  118. package/tech_hub_skills/skills/enterprise-dashboard.md +613 -0
  119. package/tech_hub_skills/skills/finops.md +184 -0
  120. package/tech_hub_skills/skills/ml-engineer.md +115 -0
  121. package/tech_hub_skills/skills/mlops.md +187 -0
  122. package/tech_hub_skills/skills/optimization-advisor.md +329 -0
  123. package/tech_hub_skills/skills/orchestrator.md +497 -0
  124. package/tech_hub_skills/skills/platform-engineer.md +102 -0
  125. package/tech_hub_skills/skills/process-automation.md +226 -0
  126. package/tech_hub_skills/skills/process-changelog.md +184 -0
  127. package/tech_hub_skills/skills/process-documentation.md +484 -0
  128. package/tech_hub_skills/skills/process-kanban.md +324 -0
  129. package/tech_hub_skills/skills/process-versioning.md +214 -0
  130. package/tech_hub_skills/skills/product-designer.md +104 -0
  131. package/tech_hub_skills/skills/project-starter.md +443 -0
  132. package/tech_hub_skills/skills/security-architect.md +135 -0
  133. package/tech_hub_skills/skills/system-design.md +126 -0
@@ -0,0 +1,550 @@
1
+ # Skill 1: Lakehouse Architecture (Bronze-Silver-Gold)
2
+
3
+ ## 🎯 Overview
4
+ Implement medallion architecture (Bronze-Silver-Gold) for scalable, governed data lakehouse.
5
+
6
+ ## 🔗 Connections
7
+ - **All Roles**: Provides clean, structured data foundation for analytics and ML
8
+ - **ML Engineer**: Feeds feature store with Gold layer (ml-02, ml-03)
9
+ - **MLOps**: Data versioning and lineage tracking (mo-02, mo-06)
10
+ - **AI Engineer**: Provides context for RAG systems (ai-02, ai-03)
11
+ - **Data Scientist**: Provides clean data for analysis and modeling (ds-01, ds-02)
12
+ - **Security Architect**: Implements data governance and encryption (sa-01, sa-02, sa-06)
13
+ - **FinOps**: Storage and compute cost optimization (fo-01, fo-05, fo-06)
14
+ - **DevOps**: IaC for data infrastructure, CI/CD for pipelines (do-01, do-04, do-08)
15
+ - **System Design**: Scalability and disaster recovery patterns (sd-03, sd-04, sd-06)
16
+
17
+ ## 🛠️ Tools Included
18
+
19
+ ### 1. `bronze_ingestion.py`
20
+ Raw data ingestion with schema validation and error handling.
21
+
22
+ ### 2. `silver_transformation.py`
23
+ Data cleaning, standardization, and deduplication.
24
+
25
+ ### 3. `gold_aggregation.py`
26
+ Business logic, aggregations, and feature engineering.
27
+
28
+ ### 4. `delta_optimizer.py`
29
+ Delta Lake optimization (vacuum, Z-ordering, compaction).
30
+
31
+ ### 5. `medallion_queries.sql`
32
+ SQL patterns for each layer of the medallion architecture.
33
+
34
+ ## 📊 Architecture
35
+
36
+ ```
37
+ Bronze (Raw) → Silver (Cleaned) → Gold (Business-Ready)
38
+ ↓ ↓ ↓
39
+ Append-only Deduped/Valid Aggregated/Featured
40
+ Full history Schema enforced Business logic
41
+ ```
42
+
43
+ ## 🚀 Quick Start
44
+
45
+ ```python
46
+ from bronze_ingestion import BronzeLoader
47
+ from silver_transformation import SilverTransformer
48
+ from gold_aggregation import GoldAggregator
49
+
50
+ # Bronze: Ingest raw data
51
+ bronze = BronzeLoader(source="salesforce_crm")
52
+ bronze.ingest(path="raw_data.json")
53
+
54
+ # Silver: Clean and standardize
55
+ silver = SilverTransformer()
56
+ silver.transform(bronze_table="bronze.crm_leads")
57
+
58
+ # Gold: Create business views
59
+ gold = GoldAggregator()
60
+ gold.aggregate(silver_table="silver.crm_leads_clean")
61
+ ```
62
+
63
+ ## 📚 Best Practices
64
+
65
+ ### Cost Optimization (FinOps Integration)
66
+
67
+ 1. **Storage Cost Optimization**
68
+ - Implement lifecycle policies (hot → cool → archive)
69
+ - Use compression (Snappy, Zstandard) for Delta tables
70
+ - Partition data by date for efficient pruning
71
+ - Monitor storage growth and set capacity alerts
72
+ - Reference: FinOps fo-05 (Storage Cost Optimization)
73
+
74
+ 2. **Compute Cost Optimization**
75
+ - Use serverless SQL pools for ad-hoc queries
76
+ - Auto-scale Spark clusters based on workload
77
+ - Use spot instances for batch processing
78
+ - Schedule non-critical jobs during off-peak hours
79
+ - Right-size compute based on actual usage patterns
80
+ - Reference: FinOps fo-06 (Compute Optimization), fo-01 (Cost Monitoring)
81
+
82
+ 3. **Delta Lake Optimization for Cost**
83
+ - Run OPTIMIZE command to compact small files
84
+ - Use Z-Ordering for frequently queried columns
85
+ - VACUUM old versions with appropriate retention
86
+ - Monitor table sizes and file counts
87
+ - Reference: FinOps fo-05, Data Engineer best practices
88
+
89
+ 4. **Query Cost Optimization**
90
+ - Cache frequently accessed tables
91
+ - Use materialized views for complex aggregations
92
+ - Implement data skipping with statistics
93
+ - Monitor query costs and optimize expensive queries
94
+ - Reference: FinOps fo-03 (Budget Management)
95
+
96
+ ### Infrastructure as Code (DevOps Integration)
97
+
98
+ 5. **Deploy with IaC**
99
+ - Use Terraform or Bicep for all infrastructure
100
+ - Version control infrastructure code in Git
101
+ - Implement CI/CD for infrastructure changes
102
+ - Use multiple environments (dev, staging, prod)
103
+ - Reference: DevOps do-04 (IaC), do-05 (GitOps)
104
+
105
+ 6. **Automate Data Pipeline Deployment**
106
+ - Package pipelines as code
107
+ - Use CI/CD for pipeline deployment
108
+ - Implement automated testing for pipelines
109
+ - Blue-green deployments for critical pipelines
110
+ - Reference: DevOps do-01 (CI/CD), do-02 (Testing)
111
+
112
+ 7. **Monitoring & Observability**
113
+ - Enable diagnostic logging for all services
114
+ - Set up Azure Monitor dashboards
115
+ - Alert on pipeline failures and anomalies
116
+ - Track data quality metrics over time
117
+ - Reference: DevOps do-08 (Monitoring & Observability)
118
+
119
+ ### Security & Governance (Security Architect Integration)
120
+
121
+ 8. **Data Governance Framework**
122
+ - Implement data catalog (Azure Purview)
123
+ - Tag all datasets with business metadata
124
+ - Track data lineage from Bronze to Gold
125
+ - Enforce data quality policies
126
+ - Reference: Security Architect sa-06 (Data Governance)
127
+
128
+ 9. **PII Protection**
129
+ - Detect PII in Bronze layer
130
+ - Mask/encrypt PII in Silver layer
131
+ - Implement row-level security on Gold tables
132
+ - Audit PII access and usage
133
+ - Reference: Security Architect sa-01 (PII Detection)
134
+
135
+ 10. **Access Control**
136
+ - Implement RBAC for all data layers
137
+ - Use managed identities for service authentication
138
+ - Enforce least privilege access
139
+ - Audit all data access
140
+ - Reference: Security Architect sa-02 (IAM)
141
+
142
+ 11. **Encryption**
143
+ - Enable encryption at rest (all storage)
144
+ - Use TLS for data in transit
145
+ - Manage keys with Azure Key Vault
146
+ - Rotate encryption keys regularly
147
+ - Reference: Security Architect sa-04 (Encryption)
148
+
149
+ ### Data Quality (Data Engineer Integration)
150
+
151
+ 12. **Automated Data Quality Checks**
152
+ - Validate schemas at Bronze ingestion
153
+ - Check data completeness in Silver
154
+ - Monitor data freshness
155
+ - Alert on quality threshold violations
156
+ - Reference: Data Engineer de-03 (Data Quality)
157
+
158
+ 13. **Data Lineage Tracking**
159
+ - Track data transformations end-to-end
160
+ - Document data sources and dependencies
161
+ - Enable impact analysis for changes
162
+ - Integrate with MLOps for model lineage
163
+ - Reference: MLOps mo-02 (Data Versioning), mo-06 (Lineage)
164
+
165
+ ### Enterprise Patterns
166
+
167
+ 14. **Multi-Tenancy**
168
+ - Isolate data by tenant/business unit
169
+ - Implement tenant-level security
170
+ - Monitor costs per tenant
171
+ - Scale compute independently per tenant
172
+ - Reference: System Design sd-07 (Multi-Tenant Architecture)
173
+
174
+ 15. **Disaster Recovery**
175
+ - Implement geo-redundant storage
176
+ - Automate backups with retention policies
177
+ - Test recovery procedures regularly
178
+ - Document RPO/RTO targets
179
+ - Reference: System Design sd-06 (HA/DR)
180
+
181
+ 16. **Compliance**
182
+ - Implement GDPR right-to-erasure
183
+ - Maintain audit logs for compliance
184
+ - Data retention policies by regulation
185
+ - Regular compliance audits
186
+ - Reference: Security Architect sa-06 (Compliance)
187
+
188
+ ### Azure-Specific Best Practices
189
+
190
+ 17. **Azure Synapse Analytics**
191
+ - Use dedicated SQL pools for production workloads
192
+ - Implement workload isolation and classification
193
+ - Enable result set caching for frequent queries
194
+ - Monitor and optimize distribution keys
195
+ - Reference: Azure az-02 (Synapse Analytics)
196
+
197
+ 18. **Azure Data Factory**
198
+ - Use managed VNet for secure integration
199
+ - Implement parameterized pipelines
200
+ - Enable git integration for version control
201
+ - Monitor pipeline costs and optimize activities
202
+ - Reference: Azure az-01 (Data Factory)
203
+
204
+ 19. **Delta Lake on Azure**
205
+ - Enable delta cache for hot data
206
+ - Use optimized writes for streaming
207
+ - Implement time travel for debugging
208
+ - Monitor delta table metrics
209
+ - Reference: Data Engineer best practices
210
+
211
+ ### ML/AI Integration
212
+
213
+ 20. **Feature Store Integration**
214
+ - Design Gold layer for feature consumption
215
+ - Implement point-in-time correctness
216
+ - Version features alongside models
217
+ - Monitor feature drift
218
+ - Reference: ML Engineer ml-02 (Feature Engineering)
219
+
220
+ 21. **RAG Knowledge Base**
221
+ - Export Gold tables for RAG indexing
222
+ - Ensure data freshness for AI context
223
+ - Track document versions
224
+ - Monitor data quality for RAG
225
+ - Reference: AI Engineer ai-02 (RAG Pipeline)
226
+
227
+ ## 💰 Cost Optimization Examples
228
+
229
+ ### Storage Lifecycle Management
230
+ ```python
231
+ from delta_optimizer import DeltaOptimizer
232
+ from azure.storage.blob import BlobServiceClient
233
+
234
+ # Implement storage tiering
235
+ optimizer = DeltaOptimizer()
236
+
237
+ # Optimize Delta tables
238
+ optimizer.optimize_table(
239
+ table_name="bronze.events",
240
+ z_order_by=["event_date", "event_type"],
241
+ vacuum_retention_hours=168 # 7 days
242
+ )
243
+
244
+ # Set lifecycle policies
245
+ blob_client = BlobServiceClient(connection_string=conn_str)
246
+ lifecycle_policy = {
247
+ "rules": [
248
+ {
249
+ "enabled": True,
250
+ "name": "move-to-cool",
251
+ "type": "Lifecycle",
252
+ "definition": {
253
+ "actions": {
254
+ "baseBlob": {
255
+ "tierToCool": {"daysAfterModificationGreaterThan": 30},
256
+ "tierToArchive": {"daysAfterModificationGreaterThan": 90}
257
+ }
258
+ }
259
+ }
260
+ }
261
+ ]
262
+ }
263
+
264
+ # Monitor storage costs
265
+ from finops_tracker import StorageCostTracker
266
+ cost_tracker = StorageCostTracker()
267
+ monthly_costs = cost_tracker.get_storage_costs(
268
+ resource_group="data-lakehouse",
269
+ period="monthly"
270
+ )
271
+
272
+ print(f"Bronze layer: ${monthly_costs.bronze:.2f}")
273
+ print(f"Silver layer: ${monthly_costs.silver:.2f}")
274
+ print(f"Gold layer: ${monthly_costs.gold:.2f}")
275
+ print(f"Total savings from tiering: ${monthly_costs.savings:.2f}")
276
+ ```
277
+
278
+ ### Compute Cost Optimization
279
+ ```python
280
+ from azure.synapse.spark import SparkPoolManager
281
+
282
+ # Auto-scaling configuration
283
+ spark_pool = SparkPoolManager()
284
+ spark_pool.configure_autoscale(
285
+ pool_name="default-spark-pool",
286
+ min_nodes=2,
287
+ max_nodes=10,
288
+ auto_pause_minutes=15 # Pause when idle
289
+ )
290
+
291
+ # Use spot instances for batch jobs
292
+ spark_pool.submit_batch_job(
293
+ script="bronze_ingestion.py",
294
+ executor_size="Medium",
295
+ executors=5,
296
+ use_spot_instances=True, # 60-90% cost savings
297
+ max_price_per_hour=0.50
298
+ )
299
+
300
+ # Monitor compute costs
301
+ from finops_tracker import ComputeCostTracker
302
+ compute_tracker = ComputeCostTracker()
303
+
304
+ # Get cost breakdown
305
+ costs = compute_tracker.get_compute_costs(
306
+ resource_type="synapse_spark",
307
+ period="daily"
308
+ )
309
+
310
+ # Set budget alerts
311
+ compute_tracker.set_budget_alert(
312
+ pool_name="default-spark-pool",
313
+ daily_budget=100.00,
314
+ alert_threshold=0.8
315
+ )
316
+ ```
317
+
318
+ ### Query Cost Tracking
319
+ ```python
320
+ from synapse_cost_tracker import QueryCostAnalyzer
321
+
322
+ analyzer = QueryCostAnalyzer()
323
+
324
+ # Track query costs
325
+ @analyzer.track_cost
326
+ def run_gold_aggregation(query: str):
327
+ return spark.sql(query)
328
+
329
+ # Generate cost report
330
+ report = analyzer.generate_cost_report(period="weekly")
331
+ print(f"Total query costs: ${report.total_cost:.2f}")
332
+ print(f"Most expensive queries: {report.top_queries}")
333
+ print(f"Cost by user: {report.cost_by_user}")
334
+
335
+ # Optimize expensive queries
336
+ recommendations = analyzer.optimize_queries(
337
+ cost_threshold=10.00 # Queries costing more than $10
338
+ )
339
+ ```
340
+
341
+ ## 🏗️ Infrastructure as Code Examples
342
+
343
+ ### Terraform for Lakehouse
344
+ ```hcl
345
+ # main.tf - Lakehouse infrastructure
346
+ module "lakehouse" {
347
+ source = "./modules/lakehouse"
348
+
349
+ resource_group = "rg-lakehouse-prod"
350
+ location = "eastus"
351
+
352
+ storage_account = {
353
+ name = "lakehouseprod"
354
+ tier = "Standard"
355
+ replication_type = "GRS" # Geo-redundant
356
+ enable_versioning = true
357
+ lifecycle_rules = {
358
+ move_to_cool = 30 # days
359
+ move_to_archive = 90 # days
360
+ }
361
+ }
362
+
363
+ synapse_workspace = {
364
+ name = "synapse-lakehouse-prod"
365
+ sql_admin_username = "sqladmin"
366
+ managed_vnet_enabled = true
367
+
368
+ spark_pools = [{
369
+ name = "default"
370
+ node_size = "Medium"
371
+ min_nodes = 2
372
+ max_nodes = 10
373
+ auto_pause_minutes = 15
374
+ use_spot_instances = true # Cost savings
375
+ }]
376
+
377
+ sql_pools = [{
378
+ name = "gold_analytics"
379
+ sku = "DW100c"
380
+ auto_pause_enabled = true
381
+ auto_pause_delay = 60 # minutes
382
+ }]
383
+ }
384
+
385
+ delta_tables = {
386
+ bronze = ["events", "transactions", "users"]
387
+ silver = ["events_clean", "transactions_validated", "users_enriched"]
388
+ gold = ["daily_metrics", "user_features", "transaction_summary"]
389
+ }
390
+
391
+ tags = {
392
+ Environment = "Production"
393
+ CostCenter = "Data-Platform"
394
+ Owner = "DataEngineering"
395
+ }
396
+ }
397
+
398
+ # monitoring.tf
399
+ resource "azurerm_monitor_diagnostic_setting" "lakehouse" {
400
+ name = "lakehouse-diagnostics"
401
+ target_resource_id = module.lakehouse.storage_account_id
402
+ log_analytics_workspace_id = azurerm_log_analytics_workspace.main.id
403
+
404
+ log {
405
+ category = "StorageRead"
406
+ enabled = true
407
+ }
408
+
409
+ log {
410
+ category = "StorageWrite"
411
+ enabled = true
412
+ }
413
+
414
+ metric {
415
+ category = "Transaction"
416
+ enabled = true
417
+ }
418
+ }
419
+
420
+ # alerts.tf
421
+ resource "azurerm_monitor_metric_alert" "storage_cost" {
422
+ name = "lakehouse-storage-cost-alert"
423
+ resource_group_name = var.resource_group
424
+ scopes = [module.lakehouse.storage_account_id]
425
+ description = "Alert when storage costs exceed threshold"
426
+
427
+ criteria {
428
+ metric_namespace = "Microsoft.Storage/storageAccounts"
429
+ metric_name = "UsedCapacity"
430
+ aggregation = "Average"
431
+ operator = "GreaterThan"
432
+ threshold = 5000000000000 # 5TB
433
+ }
434
+
435
+ action {
436
+ action_group_id = azurerm_monitor_action_group.ops_team.id
437
+ }
438
+ }
439
+ ```
440
+
441
+ ### CI/CD Pipeline for Data Pipelines
442
+ ```yaml
443
+ # .github/workflows/deploy-pipeline.yml
444
+ name: Deploy Data Pipeline
445
+
446
+ on:
447
+ push:
448
+ paths:
449
+ - 'pipelines/**'
450
+ branches:
451
+ - main
452
+
453
+ jobs:
454
+ test-and-deploy:
455
+ runs-on: ubuntu-latest
456
+ steps:
457
+ - uses: actions/checkout@v3
458
+
459
+ - name: Set up Python
460
+ uses: actions/setup-python@v4
461
+ with:
462
+ python-version: '3.10'
463
+
464
+ - name: Install dependencies
465
+ run: pip install -r requirements.txt
466
+
467
+ - name: Run unit tests
468
+ run: pytest tests/unit/
469
+
470
+ - name: Run data quality tests
471
+ run: pytest tests/data_quality/
472
+
473
+ - name: Deploy infrastructure (Terraform)
474
+ run: |
475
+ cd terraform
476
+ terraform init
477
+ terraform plan -out=tfplan
478
+ terraform apply tfplan
479
+
480
+ - name: Deploy pipelines to Synapse
481
+ run: |
482
+ python scripts/deploy_pipelines.py \
483
+ --workspace synapse-lakehouse-prod \
484
+ --environment production
485
+
486
+ - name: Run integration tests
487
+ run: pytest tests/integration/
488
+
489
+ - name: Monitor pipeline health
490
+ run: python scripts/monitor_pipelines.py --duration 30m
491
+
492
+ - name: Generate cost report
493
+ run: python scripts/generate_cost_report.py
494
+ ```
495
+
496
+ ## 📊 Enhanced Metrics & Monitoring
497
+
498
+ | Metric Category | Metric | Target | Tool |
499
+ |-----------------|--------|--------|------|
500
+ | **Cost** | Monthly storage cost | <$5000 | FinOps dashboard |
501
+ | | Compute cost per pipeline run | <$50 | Cost Management |
502
+ | | Cost per TB processed | <$10 | Custom tracker |
503
+ | **Performance** | Bronze ingestion latency | <5min | Azure Monitor |
504
+ | | Silver transformation time | <15min | Synapse metrics |
505
+ | | Gold aggregation time | <30min | Spark UI |
506
+ | **Quality** | Schema validation pass rate | >99% | Data quality checks |
507
+ | | Data completeness | >98% | DQ framework |
508
+ | | Data freshness (SLA) | <1 hour | Custom alerts |
509
+ | **Reliability** | Pipeline success rate | >99.5% | Azure Monitor |
510
+ | | Data availability | >99.9% | Health checks |
511
+ | **Security** | PII detection coverage | 100% | Security scans |
512
+ | | Access control violations | 0 | Audit logs |
513
+
514
+ ## 🔄 Integration Workflow
515
+
516
+ ### End-to-End Data Flow
517
+ ```
518
+ 1. Source Systems (APIs, Databases, Files)
519
+
520
+ 2. Bronze Ingestion (de-01) → Cost Tracking (fo-05)
521
+
522
+ 3. Schema Validation (de-03)
523
+
524
+ 4. PII Detection (sa-01)
525
+
526
+ 5. Silver Transformation (de-01) → Quality Checks (de-03)
527
+
528
+ 6. Gold Aggregation (de-01) → Feature Engineering (ml-02)
529
+
530
+ 7. Serve to downstream (AI, ML, Analytics)
531
+ ├── Feature Store (ml-02)
532
+ ├── RAG Knowledge Base (ai-02)
533
+ ├── Analytics Dashboards (ds-01)
534
+ └── Model Training (ml-01)
535
+
536
+ 8. Monitor (do-08, mo-04) → Optimize Costs (fo-01)
537
+ ```
538
+
539
+ ## 🎯 Quick Wins
540
+
541
+ 1. **Enable storage lifecycle policies** - 40-60% storage cost reduction
542
+ 2. **Implement auto-scaling for Spark** - 30-50% compute cost savings
543
+ 3. **Use spot instances for batch jobs** - 60-90% compute cost savings
544
+ 4. **Set up Delta table optimization** - Faster queries, lower costs
545
+ 5. **Deploy with IaC** - Consistent environments, faster deployments
546
+ 6. **Enable diagnostic logging** - Full observability
547
+ 7. **Implement PII detection** - Compliance and data protection
548
+ 8. **Set up cost alerts** - Prevent budget overruns
549
+ 9. **Use materialized views** - 10x faster query performance
550
+ 10. **Implement data quality checks** - Prevent downstream issues