tech-hub-skills 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (133) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +250 -0
  3. package/bin/cli.js +241 -0
  4. package/bin/copilot.js +182 -0
  5. package/bin/postinstall.js +42 -0
  6. package/package.json +46 -0
  7. package/tech_hub_skills/roles/ai-engineer/skills/01-prompt-engineering/README.md +252 -0
  8. package/tech_hub_skills/roles/ai-engineer/skills/02-rag-pipeline/README.md +448 -0
  9. package/tech_hub_skills/roles/ai-engineer/skills/03-agent-orchestration/README.md +599 -0
  10. package/tech_hub_skills/roles/ai-engineer/skills/04-llm-guardrails/README.md +735 -0
  11. package/tech_hub_skills/roles/ai-engineer/skills/05-vector-embeddings/README.md +711 -0
  12. package/tech_hub_skills/roles/ai-engineer/skills/06-llm-evaluation/README.md +777 -0
  13. package/tech_hub_skills/roles/azure/skills/01-infrastructure-fundamentals/README.md +264 -0
  14. package/tech_hub_skills/roles/azure/skills/02-data-factory/README.md +264 -0
  15. package/tech_hub_skills/roles/azure/skills/03-synapse-analytics/README.md +264 -0
  16. package/tech_hub_skills/roles/azure/skills/04-databricks/README.md +264 -0
  17. package/tech_hub_skills/roles/azure/skills/05-functions/README.md +264 -0
  18. package/tech_hub_skills/roles/azure/skills/06-kubernetes-service/README.md +264 -0
  19. package/tech_hub_skills/roles/azure/skills/07-openai-service/README.md +264 -0
  20. package/tech_hub_skills/roles/azure/skills/08-machine-learning/README.md +264 -0
  21. package/tech_hub_skills/roles/azure/skills/09-storage-adls/README.md +264 -0
  22. package/tech_hub_skills/roles/azure/skills/10-networking/README.md +264 -0
  23. package/tech_hub_skills/roles/azure/skills/11-sql-cosmos/README.md +264 -0
  24. package/tech_hub_skills/roles/azure/skills/12-event-hubs/README.md +264 -0
  25. package/tech_hub_skills/roles/code-review/skills/01-automated-code-review/README.md +394 -0
  26. package/tech_hub_skills/roles/code-review/skills/02-pr-review-workflow/README.md +427 -0
  27. package/tech_hub_skills/roles/code-review/skills/03-code-quality-gates/README.md +518 -0
  28. package/tech_hub_skills/roles/code-review/skills/04-reviewer-assignment/README.md +504 -0
  29. package/tech_hub_skills/roles/code-review/skills/05-review-analytics/README.md +540 -0
  30. package/tech_hub_skills/roles/data-engineer/skills/01-lakehouse-architecture/README.md +550 -0
  31. package/tech_hub_skills/roles/data-engineer/skills/02-etl-pipeline/README.md +580 -0
  32. package/tech_hub_skills/roles/data-engineer/skills/03-data-quality/README.md +579 -0
  33. package/tech_hub_skills/roles/data-engineer/skills/04-streaming-pipelines/README.md +608 -0
  34. package/tech_hub_skills/roles/data-engineer/skills/05-performance-optimization/README.md +547 -0
  35. package/tech_hub_skills/roles/data-governance/skills/01-data-catalog/README.md +112 -0
  36. package/tech_hub_skills/roles/data-governance/skills/02-data-lineage/README.md +129 -0
  37. package/tech_hub_skills/roles/data-governance/skills/03-data-quality-framework/README.md +182 -0
  38. package/tech_hub_skills/roles/data-governance/skills/04-access-control/README.md +39 -0
  39. package/tech_hub_skills/roles/data-governance/skills/05-master-data-management/README.md +40 -0
  40. package/tech_hub_skills/roles/data-governance/skills/06-compliance-privacy/README.md +46 -0
  41. package/tech_hub_skills/roles/data-scientist/skills/01-eda-automation/README.md +230 -0
  42. package/tech_hub_skills/roles/data-scientist/skills/02-statistical-modeling/README.md +264 -0
  43. package/tech_hub_skills/roles/data-scientist/skills/03-feature-engineering/README.md +264 -0
  44. package/tech_hub_skills/roles/data-scientist/skills/04-predictive-modeling/README.md +264 -0
  45. package/tech_hub_skills/roles/data-scientist/skills/05-customer-analytics/README.md +264 -0
  46. package/tech_hub_skills/roles/data-scientist/skills/06-campaign-analysis/README.md +264 -0
  47. package/tech_hub_skills/roles/data-scientist/skills/07-experimentation/README.md +264 -0
  48. package/tech_hub_skills/roles/data-scientist/skills/08-data-visualization/README.md +264 -0
  49. package/tech_hub_skills/roles/devops/skills/01-cicd-pipeline/README.md +264 -0
  50. package/tech_hub_skills/roles/devops/skills/02-container-orchestration/README.md +264 -0
  51. package/tech_hub_skills/roles/devops/skills/03-infrastructure-as-code/README.md +264 -0
  52. package/tech_hub_skills/roles/devops/skills/04-gitops/README.md +264 -0
  53. package/tech_hub_skills/roles/devops/skills/05-environment-management/README.md +264 -0
  54. package/tech_hub_skills/roles/devops/skills/06-automated-testing/README.md +264 -0
  55. package/tech_hub_skills/roles/devops/skills/07-release-management/README.md +264 -0
  56. package/tech_hub_skills/roles/devops/skills/08-monitoring-alerting/README.md +264 -0
  57. package/tech_hub_skills/roles/devops/skills/09-devsecops/README.md +265 -0
  58. package/tech_hub_skills/roles/finops/skills/01-cost-visibility/README.md +264 -0
  59. package/tech_hub_skills/roles/finops/skills/02-resource-tagging/README.md +264 -0
  60. package/tech_hub_skills/roles/finops/skills/03-budget-management/README.md +264 -0
  61. package/tech_hub_skills/roles/finops/skills/04-reserved-instances/README.md +264 -0
  62. package/tech_hub_skills/roles/finops/skills/05-spot-optimization/README.md +264 -0
  63. package/tech_hub_skills/roles/finops/skills/06-storage-tiering/README.md +264 -0
  64. package/tech_hub_skills/roles/finops/skills/07-compute-rightsizing/README.md +264 -0
  65. package/tech_hub_skills/roles/finops/skills/08-chargeback/README.md +264 -0
  66. package/tech_hub_skills/roles/ml-engineer/skills/01-mlops-pipeline/README.md +566 -0
  67. package/tech_hub_skills/roles/ml-engineer/skills/02-feature-engineering/README.md +655 -0
  68. package/tech_hub_skills/roles/ml-engineer/skills/03-model-training/README.md +704 -0
  69. package/tech_hub_skills/roles/ml-engineer/skills/04-model-serving/README.md +845 -0
  70. package/tech_hub_skills/roles/ml-engineer/skills/05-model-monitoring/README.md +874 -0
  71. package/tech_hub_skills/roles/mlops/skills/01-ml-pipeline-orchestration/README.md +264 -0
  72. package/tech_hub_skills/roles/mlops/skills/02-experiment-tracking/README.md +264 -0
  73. package/tech_hub_skills/roles/mlops/skills/03-model-registry/README.md +264 -0
  74. package/tech_hub_skills/roles/mlops/skills/04-feature-store/README.md +264 -0
  75. package/tech_hub_skills/roles/mlops/skills/05-model-deployment/README.md +264 -0
  76. package/tech_hub_skills/roles/mlops/skills/06-model-observability/README.md +264 -0
  77. package/tech_hub_skills/roles/mlops/skills/07-data-versioning/README.md +264 -0
  78. package/tech_hub_skills/roles/mlops/skills/08-ab-testing/README.md +264 -0
  79. package/tech_hub_skills/roles/mlops/skills/09-automated-retraining/README.md +264 -0
  80. package/tech_hub_skills/roles/platform-engineer/skills/01-internal-developer-platform/README.md +153 -0
  81. package/tech_hub_skills/roles/platform-engineer/skills/02-self-service-infrastructure/README.md +57 -0
  82. package/tech_hub_skills/roles/platform-engineer/skills/03-slo-sli-management/README.md +59 -0
  83. package/tech_hub_skills/roles/platform-engineer/skills/04-developer-experience/README.md +57 -0
  84. package/tech_hub_skills/roles/platform-engineer/skills/05-incident-management/README.md +73 -0
  85. package/tech_hub_skills/roles/platform-engineer/skills/06-capacity-management/README.md +59 -0
  86. package/tech_hub_skills/roles/product-designer/skills/01-requirements-discovery/README.md +407 -0
  87. package/tech_hub_skills/roles/product-designer/skills/02-user-research/README.md +382 -0
  88. package/tech_hub_skills/roles/product-designer/skills/03-brainstorming-ideation/README.md +437 -0
  89. package/tech_hub_skills/roles/product-designer/skills/04-ux-design/README.md +496 -0
  90. package/tech_hub_skills/roles/product-designer/skills/05-product-market-fit/README.md +376 -0
  91. package/tech_hub_skills/roles/product-designer/skills/06-stakeholder-management/README.md +412 -0
  92. package/tech_hub_skills/roles/security-architect/skills/01-pii-detection/README.md +319 -0
  93. package/tech_hub_skills/roles/security-architect/skills/02-threat-modeling/README.md +264 -0
  94. package/tech_hub_skills/roles/security-architect/skills/03-infrastructure-security/README.md +264 -0
  95. package/tech_hub_skills/roles/security-architect/skills/04-iam/README.md +264 -0
  96. package/tech_hub_skills/roles/security-architect/skills/05-application-security/README.md +264 -0
  97. package/tech_hub_skills/roles/security-architect/skills/06-secrets-management/README.md +264 -0
  98. package/tech_hub_skills/roles/security-architect/skills/07-security-monitoring/README.md +264 -0
  99. package/tech_hub_skills/roles/system-design/skills/01-architecture-patterns/README.md +337 -0
  100. package/tech_hub_skills/roles/system-design/skills/02-requirements-engineering/README.md +264 -0
  101. package/tech_hub_skills/roles/system-design/skills/03-scalability/README.md +264 -0
  102. package/tech_hub_skills/roles/system-design/skills/04-high-availability/README.md +264 -0
  103. package/tech_hub_skills/roles/system-design/skills/05-cost-optimization-design/README.md +264 -0
  104. package/tech_hub_skills/roles/system-design/skills/06-api-design/README.md +264 -0
  105. package/tech_hub_skills/roles/system-design/skills/07-observability-architecture/README.md +264 -0
  106. package/tech_hub_skills/roles/system-design/skills/08-process-automation/PROCESS_TEMPLATE.md +336 -0
  107. package/tech_hub_skills/roles/system-design/skills/08-process-automation/README.md +521 -0
  108. package/tech_hub_skills/skills/README.md +336 -0
  109. package/tech_hub_skills/skills/ai-engineer.md +104 -0
  110. package/tech_hub_skills/skills/azure.md +149 -0
  111. package/tech_hub_skills/skills/code-review.md +399 -0
  112. package/tech_hub_skills/skills/compliance-automation.md +747 -0
  113. package/tech_hub_skills/skills/data-engineer.md +113 -0
  114. package/tech_hub_skills/skills/data-governance.md +102 -0
  115. package/tech_hub_skills/skills/data-scientist.md +123 -0
  116. package/tech_hub_skills/skills/devops.md +160 -0
  117. package/tech_hub_skills/skills/docker.md +160 -0
  118. package/tech_hub_skills/skills/enterprise-dashboard.md +613 -0
  119. package/tech_hub_skills/skills/finops.md +184 -0
  120. package/tech_hub_skills/skills/ml-engineer.md +115 -0
  121. package/tech_hub_skills/skills/mlops.md +187 -0
  122. package/tech_hub_skills/skills/optimization-advisor.md +329 -0
  123. package/tech_hub_skills/skills/orchestrator.md +497 -0
  124. package/tech_hub_skills/skills/platform-engineer.md +102 -0
  125. package/tech_hub_skills/skills/process-automation.md +226 -0
  126. package/tech_hub_skills/skills/process-changelog.md +184 -0
  127. package/tech_hub_skills/skills/process-documentation.md +484 -0
  128. package/tech_hub_skills/skills/process-kanban.md +324 -0
  129. package/tech_hub_skills/skills/process-versioning.md +214 -0
  130. package/tech_hub_skills/skills/product-designer.md +104 -0
  131. package/tech_hub_skills/skills/project-starter.md +443 -0
  132. package/tech_hub_skills/skills/security-architect.md +135 -0
  133. package/tech_hub_skills/skills/system-design.md +126 -0
@@ -0,0 +1,655 @@
1
+ # Skill 2: Feature Engineering & Feature Store
2
+
3
+ ## 🎯 Overview
4
+ Build scalable feature engineering pipelines with centralized feature stores for consistency across training and serving.
5
+
6
+ ## 🔗 Connections
7
+ - **Data Engineer**: Consumes data pipelines for feature creation (de-01, de-02, de-03)
8
+ - **Data Scientist**: Provides features for experimentation (ds-01, ds-02, ds-05)
9
+ - **MLOps**: Feature versioning and lineage tracking (mo-02, mo-06)
10
+ - **ML Engineer**: Feeds features to training pipelines (ml-01, ml-03)
11
+ - **FinOps**: Optimizes feature computation and storage costs (fo-05, fo-07)
12
+ - **DevOps**: Automates feature pipeline deployment (do-01, do-04)
13
+ - **Security Architect**: Ensures feature-level access controls (sa-02, sa-06)
14
+ - **System Design**: Scalable feature serving architecture (sd-03, sd-05)
15
+
16
+ ## 🛠️ Tools Included
17
+
18
+ ### 1. `feature_store_manager.py`
19
+ Centralized feature store with versioning and lineage.
20
+
21
+ ### 2. `feature_transformer.py`
22
+ Reusable feature transformations with sklearn/pandas patterns.
23
+
24
+ ### 3. `feature_validator.py`
25
+ Data quality checks and feature drift detection.
26
+
27
+ ### 4. `online_feature_server.py`
28
+ Low-latency feature serving for real-time inference.
29
+
30
+ ### 5. `feature_store_config.yaml`
31
+ Configuration templates for feature store infrastructure.
32
+
33
+ ## 🏗️ Feature Store Architecture
34
+
35
+ ```
36
+ Raw Data → Feature Engineering → Feature Store → Training/Serving
37
+ ↓ ↓ ↓
38
+ Transformations Versioning Low-latency
39
+ Validation Lineage Online/Offline
40
+ Testing Reusability Consistency
41
+ ```
42
+
43
+ ## 🚀 Quick Start
44
+
45
+ ```python
46
+ from feature_store_manager import FeatureStore
47
+ from feature_transformer import FeatureEngineering
48
+
49
+ # Initialize feature store
50
+ store = FeatureStore(
51
+ name="customer_features",
52
+ backend="azure_ml" # or "feast", "tecton"
53
+ )
54
+
55
+ # Define feature transformations
56
+ engineer = FeatureEngineering()
57
+
58
+ # Create feature group
59
+ features = engineer.create_features(
60
+ source_table="bronze.customer_events",
61
+ transformations=[
62
+ engineer.aggregate("purchases", window="30d", agg=["sum", "count", "mean"]),
63
+ engineer.categorical_encode("customer_segment"),
64
+ engineer.time_features("last_activity_date"),
65
+ engineer.ratio("purchase_amount", "page_views")
66
+ ]
67
+ )
68
+
69
+ # Register in feature store
70
+ store.register_features(
71
+ feature_group="customer_behavior",
72
+ features=features,
73
+ version="v1",
74
+ description="Customer behavior features for churn prediction"
75
+ )
76
+
77
+ # Get features for training
78
+ training_data = store.get_historical_features(
79
+ feature_refs=["customer_behavior:v1"],
80
+ entity_df=training_entities,
81
+ point_in_time="event_timestamp"
82
+ )
83
+
84
+ # Get features for online serving
85
+ online_features = store.get_online_features(
86
+ feature_refs=["customer_behavior:v1"],
87
+ entity_keys={"customer_id": "12345"}
88
+ )
89
+ ```
90
+
91
+ ## 📚 Best Practices
92
+
93
+ ### Feature Engineering Cost Optimization (FinOps Integration)
94
+
95
+ 1. **Optimize Feature Computation Costs**
96
+ - Compute features incrementally, not full refresh
97
+ - Cache frequently used feature transformations
98
+ - Use materialized views for expensive aggregations
99
+ - Schedule batch feature updates during off-peak hours
100
+ - Reference: FinOps fo-07 (AI/ML Cost), fo-05 (Storage)
101
+
102
+ 2. **Storage Cost Optimization**
103
+ - Implement feature lifecycle policies
104
+ - Archive old feature versions to cold storage
105
+ - Compress feature data (Parquet with snappy)
106
+ - Monitor feature store storage costs
107
+ - Delete unused feature groups
108
+ - Reference: FinOps fo-05 (Storage Optimization)
109
+
110
+ 3. **Compute Resource Optimization**
111
+ - Right-size Spark clusters for feature computation
112
+ - Use auto-scaling for variable workloads
113
+ - Optimize Spark jobs (partitioning, broadcast joins)
114
+ - Track cost per feature group
115
+ - Reference: FinOps fo-06 (Compute Optimization)
116
+
117
+ 4. **Online Serving Cost Optimization**
118
+ - Implement feature caching (Redis)
119
+ - Use connection pooling
120
+ - Batch feature requests when possible
121
+ - Monitor serving QPS and costs
122
+ - Auto-scale online store based on traffic
123
+ - Reference: FinOps fo-06, fo-07
124
+
125
+ 5. **Feature Reusability for Cost Savings**
126
+ - Centralize features to avoid duplication
127
+ - Track feature usage across models
128
+ - Deprecate unused features
129
+ - Share features across teams
130
+ - Reference: MLOps mo-02 (Feature Store)
131
+
132
+ ### MLOps Integration for Features
133
+
134
+ 6. **Feature Versioning & Lineage**
135
+ - Version all feature definitions
136
+ - Track feature lineage (source data → transformations → features)
137
+ - Document feature meanings and calculations
138
+ - Maintain backward compatibility
139
+ - Reference: MLOps mo-02 (Feature Store), mo-06 (Lineage)
140
+
141
+ 7. **Feature Monitoring & Drift Detection**
142
+ - Monitor feature distributions in production
143
+ - Detect feature drift vs training data
144
+ - Track feature nullability and cardinality
145
+ - Alert on feature quality issues
146
+ - Reference: MLOps mo-04 (Monitoring), mo-05 (Drift Detection)
147
+
148
+ 8. **Training-Serving Consistency**
149
+ - Use same feature code for training and serving
150
+ - Prevent training-serving skew
151
+ - Implement point-in-time correctness
152
+ - Test feature consistency rigorously
153
+ - Reference: MLOps mo-02, ML Engineer best practices
154
+
155
+ ### DevOps Integration for Features
156
+
157
+ 9. **CI/CD for Feature Pipelines**
158
+ - Automate feature pipeline deployment
159
+ - Run feature validation tests before deployment
160
+ - Version control all feature code
161
+ - Implement feature rollback mechanisms
162
+ - Reference: DevOps do-01 (CI/CD), do-06 (Deployment)
163
+
164
+ 10. **Infrastructure as Code**
165
+ - Deploy feature store with Terraform
166
+ - Automate feature pipeline infrastructure
167
+ - Version all infrastructure configs
168
+ - Implement disaster recovery
169
+ - Reference: DevOps do-04 (IaC)
170
+
171
+ 11. **Monitoring & Observability**
172
+ - Instrument feature pipelines with metrics
173
+ - Track feature computation latency
174
+ - Monitor feature serving performance
175
+ - Set up alerts for pipeline failures
176
+ - Reference: DevOps do-08 (Monitoring)
177
+
178
+ ### Data Quality & Validation
179
+
180
+ 12. **Feature Validation**
181
+ - Validate feature schema before registration
182
+ - Check feature distributions and ranges
183
+ - Detect outliers and anomalies
184
+ - Ensure data completeness
185
+ - Reference: Data Engineer de-03 (Data Quality)
186
+
187
+ 13. **Feature Testing**
188
+ - Unit test feature transformations
189
+ - Integration test feature pipelines
190
+ - Test training-serving consistency
191
+ - Validate point-in-time correctness
192
+ - Reference: DevOps do-01 (CI/CD)
193
+
194
+ ### Security & Compliance
195
+
196
+ 14. **Feature-Level Access Control**
197
+ - Implement RBAC for feature groups
198
+ - Restrict access to sensitive features (PII)
199
+ - Audit feature access logs
200
+ - Encrypt features at rest and in transit
201
+ - Reference: Security Architect sa-02 (IAM), sa-06 (Governance)
202
+
203
+ 15. **PII Handling in Features**
204
+ - Detect and mask PII in features
205
+ - Use feature hashing for sensitive data
206
+ - Implement differential privacy where needed
207
+ - Document data sensitivity levels
208
+ - Reference: Security Architect sa-01 (PII Detection)
209
+
210
+ ### Azure-Specific Best Practices
211
+
212
+ 16. **Azure ML Feature Store**
213
+ - Use managed feature store for simplicity
214
+ - Integrate with Azure ML datasets
215
+ - Enable offline and online stores
216
+ - Use Azure Cache for Redis for online serving
217
+ - Reference: Azure az-04 (AI/ML Services)
218
+
219
+ 17. **Azure Synapse for Feature Engineering**
220
+ - Use Synapse Spark for large-scale transformations
221
+ - Implement serverless SQL for feature queries
222
+ - Optimize Synapse costs with auto-pause
223
+ - Reference: Azure az-01 (Data Services)
224
+
225
+ ### Data Engineer Integration
226
+
227
+ 18. **Feature Pipeline Orchestration**
228
+ - Integrate with data pipelines (Databricks, ADF)
229
+ - Schedule feature updates appropriately
230
+ - Handle upstream data dependencies
231
+ - Implement incremental feature updates
232
+ - Reference: Data Engineer de-01 (Pipeline Orchestration)
233
+
234
+ 19. **Data Quality for Features**
235
+ - Validate source data before feature computation
236
+ - Monitor data freshness for features
237
+ - Handle missing data appropriately
238
+ - Track data lineage from source to features
239
+ - Reference: Data Engineer de-03 (Data Quality)
240
+
241
+ 20. **Feature Discovery & Documentation**
242
+ - Maintain feature catalog with descriptions
243
+ - Document feature creation logic
244
+ - Track feature owners and SLAs
245
+ - Enable feature search and discovery
246
+ - Reference: MLOps mo-02, Data Engineer best practices
247
+
248
+ ## 💰 Cost Optimization Examples
249
+
250
+ ### Incremental Feature Computation
251
+ ```python
252
+ from feature_store_manager import FeatureStore
253
+ from finops_tracker import FeatureCostTracker
254
+
255
+ store = FeatureStore(name="customer_features")
256
+ cost_tracker = FeatureCostTracker()
257
+
258
+ @cost_tracker.track_feature_cost
259
+ def compute_features_incremental(last_processed_timestamp):
260
+ """Compute only new features since last run (saves 80-95% compute costs)"""
261
+
262
+ # Read only new data
263
+ new_data = spark.read.parquet("bronze.events") \
264
+ .filter(f"event_timestamp > '{last_processed_timestamp}'")
265
+
266
+ # Incremental aggregations
267
+ new_features = new_data.groupBy("customer_id").agg(
268
+ count("*").alias("event_count_30d"),
269
+ sum("purchase_amount").alias("total_purchase_30d"),
270
+ avg("session_duration").alias("avg_session_duration")
271
+ )
272
+
273
+ # Merge with existing features
274
+ existing_features = store.get_latest_features("customer_behavior")
275
+ updated_features = existing_features.merge(
276
+ new_features,
277
+ on="customer_id",
278
+ how="outer"
279
+ )
280
+
281
+ # Write to feature store
282
+ store.write_features(
283
+ feature_group="customer_behavior",
284
+ features=updated_features,
285
+ mode="overwrite"
286
+ )
287
+
288
+ return updated_features
289
+
290
+ # Run incremental update
291
+ last_run = store.get_last_update_time("customer_behavior")
292
+ compute_features_incremental(last_run)
293
+
294
+ # Cost report
295
+ report = cost_tracker.generate_report()
296
+ print(f"Feature computation cost: ${report.compute_cost:.2f}")
297
+ print(f"Storage cost: ${report.storage_cost:.2f}")
298
+ print(f"Savings from incremental: ${report.full_refresh_cost - report.compute_cost:.2f}")
299
+ ```
300
+
301
+ ### Feature Store with Cost-Optimized Storage
302
+ ```python
303
+ from azure.ai.ml.entities import FeatureStore, FeatureSet
304
+ from datetime import timedelta
305
+
306
+ # Create feature store with lifecycle management
307
+ feature_store = FeatureStore(
308
+ name="ml-feature-store",
309
+ description="Centralized feature store with cost optimization",
310
+ offline_store={
311
+ "type": "azure_data_lake_gen2",
312
+ "storage_account": "mlfeaturestore",
313
+ "container": "features",
314
+ "format": "parquet",
315
+ "compression": "snappy", # 3-5x compression
316
+ "partition_by": ["year", "month", "day"] # Partition pruning
317
+ },
318
+ online_store={
319
+ "type": "azure_cache_redis",
320
+ "tier": "Basic", # Upgrade to Premium for production
321
+ "capacity": 1,
322
+ "ttl": 3600, # 1 hour cache
323
+ "enable_clustering": False # Enable for high throughput
324
+ }
325
+ )
326
+
327
+ # Define feature set with lifecycle policy
328
+ feature_set = FeatureSet(
329
+ name="customer_behavior",
330
+ version="v1",
331
+ description="Customer behavior features",
332
+ entities=["customer_id"],
333
+ features=[
334
+ {"name": "purchase_count_7d", "type": "integer"},
335
+ {"name": "total_spend_30d", "type": "float"},
336
+ {"name": "avg_session_duration", "type": "float"},
337
+ {"name": "last_purchase_days_ago", "type": "integer"}
338
+ ],
339
+ # Lifecycle management for cost savings
340
+ lifecycle_policy={
341
+ "hot_tier_retention_days": 30, # Recent data in premium storage
342
+ "cool_tier_retention_days": 90, # 50% cheaper
343
+ "archive_tier_retention_days": 365, # 90% cheaper
344
+ "delete_after_days": 730 # Compliance requirement
345
+ }
346
+ )
347
+
348
+ # Track feature usage and costs
349
+ feature_set.enable_usage_tracking(True)
350
+ feature_set.enable_cost_tracking(True)
351
+
352
+ # Register feature set
353
+ ml_client.feature_sets.create_or_update(feature_set)
354
+ ```
355
+
356
+ ### Low-Latency Online Feature Serving
357
+ ```python
358
+ from online_feature_server import OnlineFeatureServer
359
+ from azure.core.credentials import AzureKeyCredential
360
+ import redis
361
+ from functools import lru_cache
362
+
363
+ class OptimizedFeatureServer:
364
+ def __init__(self):
365
+ # Redis cache for features (sub-millisecond latency)
366
+ self.cache = redis.Redis(
367
+ host="ml-features.redis.cache.windows.net",
368
+ port=6380,
369
+ password=os.getenv("REDIS_PASSWORD"),
370
+ ssl=True,
371
+ decode_responses=True,
372
+ # Connection pooling for efficiency
373
+ connection_pool=redis.ConnectionPool(max_connections=50)
374
+ )
375
+
376
+ self.feature_store = FeatureStore(name="customer_features")
377
+ self.cost_tracker = FeatureCostTracker()
378
+
379
+ @lru_cache(maxsize=1000) # In-memory cache for hot features
380
+ def get_features(self, customer_id: str, feature_list: list) -> dict:
381
+ """Get features with multi-level caching"""
382
+
383
+ cache_key = f"features:{customer_id}:{':'.join(feature_list)}"
384
+
385
+ # Level 1: In-memory cache (LRU)
386
+ # Handled by @lru_cache decorator
387
+
388
+ # Level 2: Redis cache
389
+ cached = self.cache.get(cache_key)
390
+ if cached:
391
+ self.cost_tracker.record_cache_hit("redis")
392
+ return json.loads(cached)
393
+
394
+ # Level 3: Feature store (fallback)
395
+ features = self.feature_store.get_online_features(
396
+ feature_refs=feature_list,
397
+ entity_keys={"customer_id": customer_id}
398
+ )
399
+
400
+ # Update cache with 1-hour TTL
401
+ self.cache.setex(
402
+ cache_key,
403
+ timedelta(hours=1),
404
+ json.dumps(features)
405
+ )
406
+
407
+ self.cost_tracker.record_cache_miss()
408
+ return features
409
+
410
+ def batch_get_features(self, customer_ids: list, feature_list: list) -> dict:
411
+ """Batch feature retrieval for cost efficiency"""
412
+
413
+ # Use Redis pipeline for bulk operations (10x faster)
414
+ pipeline = self.cache.pipeline()
415
+ cache_keys = [
416
+ f"features:{cid}:{':'.join(feature_list)}"
417
+ for cid in customer_ids
418
+ ]
419
+
420
+ for key in cache_keys:
421
+ pipeline.get(key)
422
+
423
+ cached_results = pipeline.execute()
424
+
425
+ # Fetch missing features in bulk
426
+ missing_ids = [
427
+ customer_ids[i]
428
+ for i, result in enumerate(cached_results)
429
+ if result is None
430
+ ]
431
+
432
+ if missing_ids:
433
+ # Bulk fetch from feature store
434
+ missing_features = self.feature_store.get_online_features_batch(
435
+ feature_refs=feature_list,
436
+ entity_keys=[{"customer_id": cid} for cid in missing_ids]
437
+ )
438
+
439
+ # Bulk cache update
440
+ pipeline = self.cache.pipeline()
441
+ for cid, features in zip(missing_ids, missing_features):
442
+ cache_key = f"features:{cid}:{':'.join(feature_list)}"
443
+ pipeline.setex(cache_key, timedelta(hours=1), json.dumps(features))
444
+ pipeline.execute()
445
+
446
+ return {
447
+ "cache_hit_rate": 1 - len(missing_ids) / len(customer_ids),
448
+ "features": cached_results
449
+ }
450
+
451
+ # Usage
452
+ server = OptimizedFeatureServer()
453
+
454
+ # Single feature request (< 5ms with cache)
455
+ features = server.get_features(
456
+ customer_id="12345",
457
+ feature_list=["customer_behavior:v1"]
458
+ )
459
+
460
+ # Batch request (100x more efficient)
461
+ batch_features = server.batch_get_features(
462
+ customer_ids=["12345", "67890", "11111"],
463
+ feature_list=["customer_behavior:v1"]
464
+ )
465
+ ```
466
+
467
+ ### Cost-Optimized Feature Computation with Spark
468
+ ```python
469
+ from pyspark.sql import SparkSession
470
+ from pyspark.sql.functions import *
471
+ from pyspark.sql.window import Window
472
+
473
+ def optimize_spark_feature_computation():
474
+ """Optimized Spark configuration for feature engineering"""
475
+
476
+ spark = SparkSession.builder \
477
+ .appName("FeatureEngineering") \
478
+ .config("spark.sql.adaptive.enabled", "true") \
479
+ .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
480
+ .config("spark.dynamicAllocation.enabled", "true") \
481
+ .config("spark.dynamicAllocation.minExecutors", "1") \
482
+ .config("spark.dynamicAllocation.maxExecutors", "10") \
483
+ .config("spark.sql.shuffle.partitions", "auto") \
484
+ .config("spark.executor.instances", "auto") \
485
+ .getOrCreate()
486
+
487
+ # Read data with partition pruning
488
+ events = spark.read.parquet("bronze.customer_events") \
489
+ .filter(col("event_date") >= date_sub(current_date(), 30)) # Last 30 days only
490
+
491
+ # Broadcast small dimension tables
492
+ customer_dim = spark.read.parquet("gold.customer_dim")
493
+ broadcast_customers = broadcast(customer_dim) # Avoid shuffle
494
+
495
+ # Efficient window functions with partition and ordering
496
+ window_7d = Window.partitionBy("customer_id").orderBy("event_timestamp").rowsBetween(-6, 0)
497
+ window_30d = Window.partitionBy("customer_id").orderBy("event_timestamp").rowsBetween(-29, 0)
498
+
499
+ # Compute features efficiently
500
+ features = events.join(broadcast_customers, "customer_id", "left") \
501
+ .groupBy("customer_id") \
502
+ .agg(
503
+ # Aggregations
504
+ count("*").alias("event_count_30d"),
505
+ sum(when(col("event_type") == "purchase", 1).otherwise(0)).alias("purchase_count_30d"),
506
+ sum("purchase_amount").alias("total_spend_30d"),
507
+ avg("session_duration").alias("avg_session_duration"),
508
+ max("event_timestamp").alias("last_activity_timestamp"),
509
+
510
+ # Percentiles (approximate for speed)
511
+ expr("percentile_approx(purchase_amount, 0.5)").alias("median_purchase_amount"),
512
+
513
+ # Collect list for sequential features
514
+ collect_list(struct("event_timestamp", "event_type")).alias("event_sequence")
515
+ )
516
+
517
+ # Write with optimization
518
+ features.write \
519
+ .mode("overwrite") \
520
+ .format("parquet") \
521
+ .option("compression", "snappy") \
522
+ .partitionBy("customer_segment") \
523
+ .save("gold.customer_behavior_features")
524
+
525
+ spark.stop()
526
+
527
+ # Run with cost tracking
528
+ cost_tracker.track_spark_job(optimize_spark_feature_computation)
529
+ ```
530
+
531
+ ## 🚀 CI/CD for Feature Pipelines
532
+
533
+ ### Automated Feature Pipeline
534
+ ```yaml
535
+ # .github/workflows/feature-pipeline.yml
536
+ name: Feature Engineering Pipeline
537
+
538
+ on:
539
+ push:
540
+ paths:
541
+ - 'features/**'
542
+ - 'transformations/**'
543
+ branches:
544
+ - main
545
+ schedule:
546
+ - cron: '0 */6 * * *' # Every 6 hours
547
+
548
+ jobs:
549
+ feature-engineering:
550
+ runs-on: ubuntu-latest
551
+ steps:
552
+ - uses: actions/checkout@v3
553
+
554
+ - name: Azure Login
555
+ uses: azure/login@v1
556
+ with:
557
+ creds: ${{ secrets.AZURE_CREDENTIALS }}
558
+
559
+ - name: Unit test feature transformations
560
+ run: pytest tests/features/
561
+
562
+ - name: Validate feature schema
563
+ run: python scripts/validate_feature_schema.py
564
+
565
+ - name: Run feature quality checks
566
+ run: python scripts/feature_quality_checks.py
567
+
568
+ - name: Compute features (incremental)
569
+ run: |
570
+ python pipelines/compute_features.py \
571
+ --mode incremental \
572
+ --optimize-cost true \
573
+ --max-cost 50.00
574
+
575
+ - name: Validate feature distributions
576
+ run: python scripts/validate_feature_distributions.py
577
+
578
+ - name: Test training-serving consistency
579
+ run: pytest tests/integration/test_feature_consistency.py
580
+
581
+ - name: Register features in feature store
582
+ if: success()
583
+ run: python scripts/register_features.py --version auto
584
+
585
+ - name: Update online feature store
586
+ run: python scripts/sync_online_features.py
587
+
588
+ - name: Run feature drift detection
589
+ run: python scripts/detect_feature_drift.py
590
+
591
+ - name: Generate feature cost report
592
+ run: python scripts/feature_cost_report.py
593
+ ```
594
+
595
+ ## 📊 Metrics & Monitoring
596
+
597
+ | Metric Category | Metric | Target | Tool |
598
+ |-----------------|--------|--------|------|
599
+ | **Computation Costs** | Cost per feature group | <$10 | FinOps tracker |
600
+ | | Monthly feature compute | <$1000 | Azure Cost Management |
601
+ | | Incremental savings | >80% | Cost tracker |
602
+ | | Spark cluster utilization | >75% | Azure Monitor |
603
+ | **Storage Costs** | Feature storage cost | <$500/month | Azure Storage metrics |
604
+ | | Compression ratio | >3x | Parquet metrics |
605
+ | | Archived features | >60% | Lifecycle policy |
606
+ | **Serving Performance** | Online feature latency (p95) | <10ms | App Insights |
607
+ | | Cache hit rate | >90% | Redis metrics |
608
+ | | Feature freshness | <6 hours | Freshness monitor |
609
+ | **Data Quality** | Feature completeness | >99% | Quality checks |
610
+ | | Feature drift score | <0.15 | Drift detector |
611
+ | | Schema validation success | 100% | Validation tests |
612
+ | **Pipeline Reliability** | Feature pipeline success | >99% | Airflow/ADF |
613
+ | | Training-serving skew | <1% | Consistency tests |
614
+
615
+ ## 🔄 Integration Workflow
616
+
617
+ ### End-to-End Feature Pipeline
618
+ ```
619
+ 1. Data Ingestion (de-01)
620
+
621
+ 2. Data Quality Validation (de-03)
622
+
623
+ 3. PII Detection & Masking (sa-01)
624
+
625
+ 4. Feature Transformation (ml-02)
626
+
627
+ 5. Feature Validation (ml-02)
628
+
629
+ 6. Cost Optimization (fo-05, fo-07)
630
+
631
+ 7. Feature Store Registration (mo-02)
632
+
633
+ 8. Lineage Tracking (mo-06)
634
+
635
+ 9. Online Store Sync (ml-02)
636
+
637
+ 10. Feature Drift Monitoring (mo-05)
638
+
639
+ 11. Training Data Generation (ml-01, ml-03)
640
+
641
+ 12. Real-time Feature Serving (ml-04)
642
+ ```
643
+
644
+ ## 🎯 Quick Wins
645
+
646
+ 1. **Implement incremental feature computation** - 80-95% compute cost reduction
647
+ 2. **Enable feature caching with Redis** - 10x faster online serving
648
+ 3. **Use Parquet with compression** - 3-5x storage cost reduction
649
+ 4. **Centralize features in feature store** - Eliminate duplicate computations
650
+ 5. **Set up feature versioning** - Enable reproducibility and rollback
651
+ 6. **Implement lifecycle policies** - 60-90% storage cost savings
652
+ 7. **Optimize Spark configurations** - 30-50% faster feature computation
653
+ 8. **Enable feature drift monitoring** - Detect data quality issues early
654
+ 9. **Use broadcast joins for lookups** - 5-10x faster Spark joins
655
+ 10. **Implement batch feature serving** - 100x more efficient than single requests