tech-hub-skills 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (133) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +250 -0
  3. package/bin/cli.js +241 -0
  4. package/bin/copilot.js +182 -0
  5. package/bin/postinstall.js +42 -0
  6. package/package.json +46 -0
  7. package/tech_hub_skills/roles/ai-engineer/skills/01-prompt-engineering/README.md +252 -0
  8. package/tech_hub_skills/roles/ai-engineer/skills/02-rag-pipeline/README.md +448 -0
  9. package/tech_hub_skills/roles/ai-engineer/skills/03-agent-orchestration/README.md +599 -0
  10. package/tech_hub_skills/roles/ai-engineer/skills/04-llm-guardrails/README.md +735 -0
  11. package/tech_hub_skills/roles/ai-engineer/skills/05-vector-embeddings/README.md +711 -0
  12. package/tech_hub_skills/roles/ai-engineer/skills/06-llm-evaluation/README.md +777 -0
  13. package/tech_hub_skills/roles/azure/skills/01-infrastructure-fundamentals/README.md +264 -0
  14. package/tech_hub_skills/roles/azure/skills/02-data-factory/README.md +264 -0
  15. package/tech_hub_skills/roles/azure/skills/03-synapse-analytics/README.md +264 -0
  16. package/tech_hub_skills/roles/azure/skills/04-databricks/README.md +264 -0
  17. package/tech_hub_skills/roles/azure/skills/05-functions/README.md +264 -0
  18. package/tech_hub_skills/roles/azure/skills/06-kubernetes-service/README.md +264 -0
  19. package/tech_hub_skills/roles/azure/skills/07-openai-service/README.md +264 -0
  20. package/tech_hub_skills/roles/azure/skills/08-machine-learning/README.md +264 -0
  21. package/tech_hub_skills/roles/azure/skills/09-storage-adls/README.md +264 -0
  22. package/tech_hub_skills/roles/azure/skills/10-networking/README.md +264 -0
  23. package/tech_hub_skills/roles/azure/skills/11-sql-cosmos/README.md +264 -0
  24. package/tech_hub_skills/roles/azure/skills/12-event-hubs/README.md +264 -0
  25. package/tech_hub_skills/roles/code-review/skills/01-automated-code-review/README.md +394 -0
  26. package/tech_hub_skills/roles/code-review/skills/02-pr-review-workflow/README.md +427 -0
  27. package/tech_hub_skills/roles/code-review/skills/03-code-quality-gates/README.md +518 -0
  28. package/tech_hub_skills/roles/code-review/skills/04-reviewer-assignment/README.md +504 -0
  29. package/tech_hub_skills/roles/code-review/skills/05-review-analytics/README.md +540 -0
  30. package/tech_hub_skills/roles/data-engineer/skills/01-lakehouse-architecture/README.md +550 -0
  31. package/tech_hub_skills/roles/data-engineer/skills/02-etl-pipeline/README.md +580 -0
  32. package/tech_hub_skills/roles/data-engineer/skills/03-data-quality/README.md +579 -0
  33. package/tech_hub_skills/roles/data-engineer/skills/04-streaming-pipelines/README.md +608 -0
  34. package/tech_hub_skills/roles/data-engineer/skills/05-performance-optimization/README.md +547 -0
  35. package/tech_hub_skills/roles/data-governance/skills/01-data-catalog/README.md +112 -0
  36. package/tech_hub_skills/roles/data-governance/skills/02-data-lineage/README.md +129 -0
  37. package/tech_hub_skills/roles/data-governance/skills/03-data-quality-framework/README.md +182 -0
  38. package/tech_hub_skills/roles/data-governance/skills/04-access-control/README.md +39 -0
  39. package/tech_hub_skills/roles/data-governance/skills/05-master-data-management/README.md +40 -0
  40. package/tech_hub_skills/roles/data-governance/skills/06-compliance-privacy/README.md +46 -0
  41. package/tech_hub_skills/roles/data-scientist/skills/01-eda-automation/README.md +230 -0
  42. package/tech_hub_skills/roles/data-scientist/skills/02-statistical-modeling/README.md +264 -0
  43. package/tech_hub_skills/roles/data-scientist/skills/03-feature-engineering/README.md +264 -0
  44. package/tech_hub_skills/roles/data-scientist/skills/04-predictive-modeling/README.md +264 -0
  45. package/tech_hub_skills/roles/data-scientist/skills/05-customer-analytics/README.md +264 -0
  46. package/tech_hub_skills/roles/data-scientist/skills/06-campaign-analysis/README.md +264 -0
  47. package/tech_hub_skills/roles/data-scientist/skills/07-experimentation/README.md +264 -0
  48. package/tech_hub_skills/roles/data-scientist/skills/08-data-visualization/README.md +264 -0
  49. package/tech_hub_skills/roles/devops/skills/01-cicd-pipeline/README.md +264 -0
  50. package/tech_hub_skills/roles/devops/skills/02-container-orchestration/README.md +264 -0
  51. package/tech_hub_skills/roles/devops/skills/03-infrastructure-as-code/README.md +264 -0
  52. package/tech_hub_skills/roles/devops/skills/04-gitops/README.md +264 -0
  53. package/tech_hub_skills/roles/devops/skills/05-environment-management/README.md +264 -0
  54. package/tech_hub_skills/roles/devops/skills/06-automated-testing/README.md +264 -0
  55. package/tech_hub_skills/roles/devops/skills/07-release-management/README.md +264 -0
  56. package/tech_hub_skills/roles/devops/skills/08-monitoring-alerting/README.md +264 -0
  57. package/tech_hub_skills/roles/devops/skills/09-devsecops/README.md +265 -0
  58. package/tech_hub_skills/roles/finops/skills/01-cost-visibility/README.md +264 -0
  59. package/tech_hub_skills/roles/finops/skills/02-resource-tagging/README.md +264 -0
  60. package/tech_hub_skills/roles/finops/skills/03-budget-management/README.md +264 -0
  61. package/tech_hub_skills/roles/finops/skills/04-reserved-instances/README.md +264 -0
  62. package/tech_hub_skills/roles/finops/skills/05-spot-optimization/README.md +264 -0
  63. package/tech_hub_skills/roles/finops/skills/06-storage-tiering/README.md +264 -0
  64. package/tech_hub_skills/roles/finops/skills/07-compute-rightsizing/README.md +264 -0
  65. package/tech_hub_skills/roles/finops/skills/08-chargeback/README.md +264 -0
  66. package/tech_hub_skills/roles/ml-engineer/skills/01-mlops-pipeline/README.md +566 -0
  67. package/tech_hub_skills/roles/ml-engineer/skills/02-feature-engineering/README.md +655 -0
  68. package/tech_hub_skills/roles/ml-engineer/skills/03-model-training/README.md +704 -0
  69. package/tech_hub_skills/roles/ml-engineer/skills/04-model-serving/README.md +845 -0
  70. package/tech_hub_skills/roles/ml-engineer/skills/05-model-monitoring/README.md +874 -0
  71. package/tech_hub_skills/roles/mlops/skills/01-ml-pipeline-orchestration/README.md +264 -0
  72. package/tech_hub_skills/roles/mlops/skills/02-experiment-tracking/README.md +264 -0
  73. package/tech_hub_skills/roles/mlops/skills/03-model-registry/README.md +264 -0
  74. package/tech_hub_skills/roles/mlops/skills/04-feature-store/README.md +264 -0
  75. package/tech_hub_skills/roles/mlops/skills/05-model-deployment/README.md +264 -0
  76. package/tech_hub_skills/roles/mlops/skills/06-model-observability/README.md +264 -0
  77. package/tech_hub_skills/roles/mlops/skills/07-data-versioning/README.md +264 -0
  78. package/tech_hub_skills/roles/mlops/skills/08-ab-testing/README.md +264 -0
  79. package/tech_hub_skills/roles/mlops/skills/09-automated-retraining/README.md +264 -0
  80. package/tech_hub_skills/roles/platform-engineer/skills/01-internal-developer-platform/README.md +153 -0
  81. package/tech_hub_skills/roles/platform-engineer/skills/02-self-service-infrastructure/README.md +57 -0
  82. package/tech_hub_skills/roles/platform-engineer/skills/03-slo-sli-management/README.md +59 -0
  83. package/tech_hub_skills/roles/platform-engineer/skills/04-developer-experience/README.md +57 -0
  84. package/tech_hub_skills/roles/platform-engineer/skills/05-incident-management/README.md +73 -0
  85. package/tech_hub_skills/roles/platform-engineer/skills/06-capacity-management/README.md +59 -0
  86. package/tech_hub_skills/roles/product-designer/skills/01-requirements-discovery/README.md +407 -0
  87. package/tech_hub_skills/roles/product-designer/skills/02-user-research/README.md +382 -0
  88. package/tech_hub_skills/roles/product-designer/skills/03-brainstorming-ideation/README.md +437 -0
  89. package/tech_hub_skills/roles/product-designer/skills/04-ux-design/README.md +496 -0
  90. package/tech_hub_skills/roles/product-designer/skills/05-product-market-fit/README.md +376 -0
  91. package/tech_hub_skills/roles/product-designer/skills/06-stakeholder-management/README.md +412 -0
  92. package/tech_hub_skills/roles/security-architect/skills/01-pii-detection/README.md +319 -0
  93. package/tech_hub_skills/roles/security-architect/skills/02-threat-modeling/README.md +264 -0
  94. package/tech_hub_skills/roles/security-architect/skills/03-infrastructure-security/README.md +264 -0
  95. package/tech_hub_skills/roles/security-architect/skills/04-iam/README.md +264 -0
  96. package/tech_hub_skills/roles/security-architect/skills/05-application-security/README.md +264 -0
  97. package/tech_hub_skills/roles/security-architect/skills/06-secrets-management/README.md +264 -0
  98. package/tech_hub_skills/roles/security-architect/skills/07-security-monitoring/README.md +264 -0
  99. package/tech_hub_skills/roles/system-design/skills/01-architecture-patterns/README.md +337 -0
  100. package/tech_hub_skills/roles/system-design/skills/02-requirements-engineering/README.md +264 -0
  101. package/tech_hub_skills/roles/system-design/skills/03-scalability/README.md +264 -0
  102. package/tech_hub_skills/roles/system-design/skills/04-high-availability/README.md +264 -0
  103. package/tech_hub_skills/roles/system-design/skills/05-cost-optimization-design/README.md +264 -0
  104. package/tech_hub_skills/roles/system-design/skills/06-api-design/README.md +264 -0
  105. package/tech_hub_skills/roles/system-design/skills/07-observability-architecture/README.md +264 -0
  106. package/tech_hub_skills/roles/system-design/skills/08-process-automation/PROCESS_TEMPLATE.md +336 -0
  107. package/tech_hub_skills/roles/system-design/skills/08-process-automation/README.md +521 -0
  108. package/tech_hub_skills/skills/README.md +336 -0
  109. package/tech_hub_skills/skills/ai-engineer.md +104 -0
  110. package/tech_hub_skills/skills/azure.md +149 -0
  111. package/tech_hub_skills/skills/code-review.md +399 -0
  112. package/tech_hub_skills/skills/compliance-automation.md +747 -0
  113. package/tech_hub_skills/skills/data-engineer.md +113 -0
  114. package/tech_hub_skills/skills/data-governance.md +102 -0
  115. package/tech_hub_skills/skills/data-scientist.md +123 -0
  116. package/tech_hub_skills/skills/devops.md +160 -0
  117. package/tech_hub_skills/skills/docker.md +160 -0
  118. package/tech_hub_skills/skills/enterprise-dashboard.md +613 -0
  119. package/tech_hub_skills/skills/finops.md +184 -0
  120. package/tech_hub_skills/skills/ml-engineer.md +115 -0
  121. package/tech_hub_skills/skills/mlops.md +187 -0
  122. package/tech_hub_skills/skills/optimization-advisor.md +329 -0
  123. package/tech_hub_skills/skills/orchestrator.md +497 -0
  124. package/tech_hub_skills/skills/platform-engineer.md +102 -0
  125. package/tech_hub_skills/skills/process-automation.md +226 -0
  126. package/tech_hub_skills/skills/process-changelog.md +184 -0
  127. package/tech_hub_skills/skills/process-documentation.md +484 -0
  128. package/tech_hub_skills/skills/process-kanban.md +324 -0
  129. package/tech_hub_skills/skills/process-versioning.md +214 -0
  130. package/tech_hub_skills/skills/product-designer.md +104 -0
  131. package/tech_hub_skills/skills/project-starter.md +443 -0
  132. package/tech_hub_skills/skills/security-architect.md +135 -0
  133. package/tech_hub_skills/skills/system-design.md +126 -0
@@ -0,0 +1,129 @@
1
+ # dg-02: Data Lineage
2
+
3
+ ## Overview
4
+
5
+ Track end-to-end data lineage for impact analysis, root cause analysis, and regulatory compliance.
6
+
7
+ ## Key Capabilities
8
+
9
+ - **End-to-End Lineage**: From source to consumption
10
+ - **Impact Analysis**: Understand downstream impacts
11
+ - **Root Cause Analysis**: Trace issues to source
12
+ - **Column-Level Lineage**: Field-level tracking
13
+ - **Transformation Documentation**: Track data transformations
14
+
15
+ ## Tools & Technologies
16
+
17
+ - **Azure Purview**: Native lineage tracking
18
+ - **OpenLineage**: Open standard for lineage
19
+ - **Marquez**: Metadata service for lineage
20
+ - **Spline**: Spark lineage tracking
21
+
22
+ ## Implementation
23
+
24
+ ### 1. Lineage Extraction
25
+
26
+ ```python
27
+ # Extract lineage from Spark jobs
28
+ from spline import SplineAgent
29
+
30
+ def track_spark_lineage(spark_session):
31
+ """Enable lineage tracking for Spark"""
32
+ spark_session.sparkContext.setLogLevel("INFO")
33
+
34
+ # Initialize Spline agent
35
+ SplineAgent.builder() \
36
+ .appName("data-pipeline") \
37
+ .mode("REQUIRED") \
38
+ .url("http://spline-server:9090") \
39
+ .build()
40
+ ```
41
+
42
+ ### 2. Column-Level Lineage
43
+
44
+ ```sql
45
+ -- Azure Purview automatically tracks column lineage
46
+ -- Example transformation with lineage
47
+ CREATE VIEW customer_360 AS
48
+ SELECT
49
+ c.customer_id,
50
+ c.first_name || ' ' || c.last_name as full_name, -- Lineage: derived
51
+ o.total_orders,
52
+ p.total_payments
53
+ FROM customers c
54
+ LEFT JOIN order_summary o ON c.customer_id = o.customer_id
55
+ LEFT JOIN payment_summary p ON c.customer_id = p.customer_id;
56
+ ```
57
+
58
+ ### 3. Impact Analysis
59
+
60
+ ```python
61
+ # Find downstream dependencies
62
+ def get_downstream_impact(asset_id):
63
+ """Find all downstream assets affected by changes"""
64
+ lineage = client.lineage.get_lineage(
65
+ guid=asset_id,
66
+ direction="OUTPUT",
67
+ depth=10
68
+ )
69
+
70
+ downstream_assets = []
71
+ for entity in lineage['guidEntityMap'].values():
72
+ downstream_assets.append({
73
+ 'name': entity['attributes']['name'],
74
+ 'type': entity['typeName'],
75
+ 'owner': entity.get('attributes', {}).get('owner')
76
+ })
77
+
78
+ return downstream_assets
79
+ ```
80
+
81
+ ### 4. OpenLineage Integration
82
+
83
+ ```python
84
+ # Emit lineage events using OpenLineage
85
+ from openlineage.client import OpenLineageClient
86
+ from openlineage.client.run import RunEvent, RunState, Run, Job
87
+
88
+ def emit_lineage_event(job_name, inputs, outputs):
89
+ """Emit lineage event to OpenLineage"""
90
+ client = OpenLineageClient(url="http://lineage-api:5000")
91
+
92
+ event = RunEvent(
93
+ eventType=RunState.COMPLETE,
94
+ eventTime="2025-01-01T00:00:00Z",
95
+ run=Run(runId=str(uuid.uuid4())),
96
+ job=Job(namespace="production", name=job_name),
97
+ inputs=inputs,
98
+ outputs=outputs
99
+ )
100
+
101
+ client.emit(event)
102
+ ```
103
+
104
+ ## Best Practices
105
+
106
+ 1. **Automate Collection** - Manual lineage doesn't scale
107
+ 2. **Column-Level Tracking** - For sensitive data, track field-level
108
+ 3. **Version Control** - Track lineage changes over time
109
+ 4. **Clear Visualization** - Make lineage easy to understand
110
+ 5. **Regular Validation** - Verify lineage accuracy
111
+
112
+ ## Cost Optimization
113
+
114
+ - Use incremental lineage updates
115
+ - Archive old lineage data after retention period
116
+ - Cache frequently accessed lineage queries
117
+ - Use materialized views for complex lineage
118
+
119
+ ## Integration
120
+
121
+ **Connects with:**
122
+ - de-02 (ETL): Track pipeline lineage
123
+ - dg-01 (Catalog): Link assets to lineage
124
+ - ml-02 (Feature Engineering): Track feature lineage
125
+ - ai-02 (RAG): Track document lineage
126
+
127
+ ## Quick Win
128
+
129
+ Start with 1 critical data pipeline, manually document lineage, validate accuracy, then automate extraction.
@@ -0,0 +1,182 @@
1
+ # dg-03: Data Quality Framework
2
+
3
+ ## Overview
4
+
5
+ Implement automated data quality validation, scoring, monitoring, and issue remediation workflows.
6
+
7
+ ## Key Capabilities
8
+
9
+ - **Quality Rules Definition**: Completeness, accuracy, consistency
10
+ - **Automated Validation**: Real-time quality checks
11
+ - **Quality Scoring**: Quantifiable quality metrics
12
+ - **Quality Monitoring**: Continuous quality tracking
13
+ - **Issue Remediation**: Workflows for quality issues
14
+
15
+ ## Tools & Technologies
16
+
17
+ - **Great Expectations**: Python data validation
18
+ - **Soda**: Data quality as code
19
+ - **dbt tests**: Quality tests in dbt
20
+ - **Azure Data Quality**: Native Azure solution
21
+
22
+ ## Implementation
23
+
24
+ ### 1. Quality Rules with Great Expectations
25
+
26
+ ```python
27
+ # Define quality expectations
28
+ import great_expectations as gx
29
+
30
+ def create_quality_suite(context, table_name):
31
+ """Create data quality test suite"""
32
+ suite = context.add_expectation_suite(
33
+ expectation_suite_name=f"{table_name}_quality_suite"
34
+ )
35
+
36
+ validator = context.get_validator(
37
+ batch_request=batch_request,
38
+ expectation_suite_name=suite.expectation_suite_name
39
+ )
40
+
41
+ # Completeness checks
42
+ validator.expect_column_values_to_not_be_null(column="customer_id")
43
+ validator.expect_column_values_to_not_be_null(column="order_date")
44
+
45
+ # Accuracy checks
46
+ validator.expect_column_values_to_be_between(
47
+ column="age",
48
+ min_value=0,
49
+ max_value=120
50
+ )
51
+
52
+ # Consistency checks
53
+ validator.expect_column_values_to_match_regex(
54
+ column="email",
55
+ regex=r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
56
+ )
57
+
58
+ validator.save_expectation_suite()
59
+ return validator
60
+ ```
61
+
62
+ ### 2. Quality Scoring
63
+
64
+ ```python
65
+ # Calculate quality score
66
+ def calculate_quality_score(validation_results):
67
+ """Calculate overall quality score"""
68
+ total_checks = validation_results.statistics['evaluated_expectations']
69
+ successful_checks = validation_results.statistics['successful_expectations']
70
+
71
+ score = (successful_checks / total_checks) * 100
72
+
73
+ # Categorize quality
74
+ if score >= 95:
75
+ quality_level = "Excellent"
76
+ elif score >= 85:
77
+ quality_level = "Good"
78
+ elif score >= 70:
79
+ quality_level = "Acceptable"
80
+ else:
81
+ quality_level = "Poor"
82
+
83
+ return {
84
+ 'score': score,
85
+ 'level': quality_level,
86
+ 'total_checks': total_checks,
87
+ 'passed_checks': successful_checks
88
+ }
89
+ ```
90
+
91
+ ### 3. Automated Monitoring
92
+
93
+ ```python
94
+ # Set up quality monitoring
95
+ def setup_quality_monitoring(checkpoint_name):
96
+ """Configure automated quality monitoring"""
97
+ checkpoint_config = {
98
+ "name": checkpoint_name,
99
+ "config_version": 1.0,
100
+ "template_name": "default",
101
+ "run_name_template": "%Y%m%d-%H%M%S",
102
+ "validations": [
103
+ {
104
+ "batch_request": {
105
+ "datasource_name": "production_data",
106
+ "data_connector_name": "default_inferred_data_connector_name",
107
+ "data_asset_name": "customers"
108
+ },
109
+ "expectation_suite_name": "customers_quality_suite"
110
+ }
111
+ ],
112
+ "action_list": [
113
+ {
114
+ "name": "store_validation_result",
115
+ "action": {"class_name": "StoreValidationResultAction"}
116
+ },
117
+ {
118
+ "name": "send_slack_notification",
119
+ "action": {
120
+ "class_name": "SlackNotificationAction",
121
+ "slack_webhook": "${SLACK_WEBHOOK}",
122
+ "notify_on": "failure"
123
+ }
124
+ }
125
+ ]
126
+ }
127
+
128
+ context.add_checkpoint(**checkpoint_config)
129
+ ```
130
+
131
+ ### 4. Issue Remediation Workflow
132
+
133
+ ```python
134
+ # Create remediation workflow
135
+ def create_remediation_workflow(quality_issues):
136
+ """Create tickets for quality issues"""
137
+ from azure.devops import AzureDevOpsClient
138
+
139
+ client = AzureDevOpsClient()
140
+
141
+ for issue in quality_issues:
142
+ work_item = {
143
+ 'title': f"Data Quality Issue: {issue['column']}",
144
+ 'description': issue['description'],
145
+ 'priority': issue['severity'],
146
+ 'assigned_to': issue['data_owner'],
147
+ 'tags': ['data-quality', issue['table']]
148
+ }
149
+
150
+ client.create_work_item(
151
+ project='DataGovernance',
152
+ work_item_type='Bug',
153
+ fields=work_item
154
+ )
155
+ ```
156
+
157
+ ## Best Practices
158
+
159
+ 1. **Start Simple** - Begin with critical fields, expand coverage
160
+ 2. **Automate Everything** - Manual checks don't scale
161
+ 3. **Clear Ownership** - Assign quality issues to data owners
162
+ 4. **Threshold Alerts** - Alert on quality score drops
163
+ 5. **Historical Tracking** - Monitor quality trends over time
164
+
165
+ ## Cost Optimization
166
+
167
+ - Run quality checks incrementally (only new/changed data)
168
+ - Use sampling for large datasets
169
+ - Cache validation results
170
+ - Right-size validation compute
171
+
172
+ ## Integration
173
+
174
+ **Connects with:**
175
+ - de-01 (Lakehouse): Validate lakehouse data
176
+ - de-03 (Data Quality): Engineering quality checks
177
+ - dg-01 (Catalog): Link quality scores to assets
178
+ - dg-02 (Lineage): Trace quality issues to source
179
+
180
+ ## Quick Win
181
+
182
+ Implement completeness checks on 5 critical fields in your most important table. Show before/after quality scores.
@@ -0,0 +1,39 @@
1
+ # dg-04: Access Control & Policies
2
+
3
+ ## Overview
4
+
5
+ Implement role-based access control, column/row-level security, dynamic data masking, and access audit logging.
6
+
7
+ ## Key Capabilities
8
+
9
+ - **RBAC**: Role-based access control
10
+ - **Column-Level Security**: Restrict sensitive columns
11
+ - **Row-Level Security**: Filter data by user context
12
+ - **Dynamic Data Masking**: Auto-mask sensitive data
13
+ - **Access Audit Logging**: Track all data access
14
+
15
+ ## Implementation
16
+
17
+ ```sql
18
+ -- Column-level security
19
+ CREATE VIEW customer_secure AS
20
+ SELECT
21
+ customer_id,
22
+ CASE
23
+ WHEN CURRENT_USER() IN (SELECT user FROM admin_users)
24
+ THEN email -- Show full email to admins
25
+ ELSE CONCAT(LEFT(email, 3), '***@', SPLIT_PART(email, '@', 2)) -- Mask for others
26
+ END as email,
27
+ first_name,
28
+ last_name
29
+ FROM customers;
30
+
31
+ -- Row-level security
32
+ CREATE POLICY customer_region_policy ON customers
33
+ FOR SELECT
34
+ USING (region = current_setting('app.user_region'));
35
+ ```
36
+
37
+ ## Integration
38
+
39
+ **Connects with:** sa-01 (PII Detection), sa-04 (IAM), dg-01 (Catalog)
@@ -0,0 +1,40 @@
1
+ # dg-05: Master Data Management
2
+
3
+ ## Overview
4
+
5
+ Entity resolution, golden record creation, data stewardship, and hierarchy management for critical business entities.
6
+
7
+ ## Key Capabilities
8
+
9
+ - **Entity Resolution**: Match and merge duplicate entities
10
+ - **Golden Record**: Single source of truth
11
+ - **Data Stewardship**: Workflows for data quality
12
+ - **Cross-Reference**: Link entities across systems
13
+ - **Hierarchy Management**: Organizational structures
14
+
15
+ ## Implementation
16
+
17
+ ```python
18
+ # Entity resolution
19
+ from recordlinkage import Index, Compare
20
+
21
+ def match_customers(df1, df2):
22
+ """Match customer records across systems"""
23
+ indexer = Index()
24
+ indexer.block('last_name')
25
+ candidate_pairs = indexer.index(df1, df2)
26
+
27
+ compare = Compare()
28
+ compare.exact('first_name', 'first_name')
29
+ compare.string('email', 'email', method='jarowinkler', threshold=0.85)
30
+ compare.numeric('age', 'age', method='linear', offset=2)
31
+
32
+ features = compare.compute(candidate_pairs, df1, df2)
33
+ matches = features[features.sum(axis=1) > 2.5]
34
+
35
+ return matches
36
+ ```
37
+
38
+ ## Integration
39
+
40
+ **Connects with:** dg-01 (Catalog), dg-03 (Quality), de-02 (ETL)
@@ -0,0 +1,46 @@
1
+ # dg-06: Compliance & Privacy
2
+
3
+ ## Overview
4
+
5
+ GDPR compliance automation, data retention policies, right to be forgotten, consent management, and privacy impact assessments.
6
+
7
+ ## Key Capabilities
8
+
9
+ - **GDPR Automation**: Automated compliance checks
10
+ - **Data Retention**: Automated data lifecycle
11
+ - **Right to be Forgotten**: Delete personal data on request
12
+ - **Consent Management**: Track user consent
13
+ - **Privacy Impact Assessments**: Risk assessment
14
+
15
+ ## Implementation
16
+
17
+ ```python
18
+ # Right to be forgotten
19
+ def delete_user_data(user_id):
20
+ """Delete all personal data for a user"""
21
+ tables = [
22
+ 'customers', 'orders', 'payments',
23
+ 'preferences', 'analytics_events'
24
+ ]
25
+
26
+ for table in tables:
27
+ spark.sql(f"""
28
+ DELETE FROM {table}
29
+ WHERE user_id = '{user_id}'
30
+ """)
31
+
32
+ # Log deletion for audit
33
+ log_gdpr_deletion(user_id, tables)
34
+
35
+ # Data retention policy
36
+ def apply_retention_policy():
37
+ """Delete data past retention period"""
38
+ spark.sql("""
39
+ DELETE FROM customer_events
40
+ WHERE event_date < DATE_SUB(CURRENT_DATE(), 730) -- 2 years
41
+ """)
42
+ ```
43
+
44
+ ## Integration
45
+
46
+ **Connects with:** sa-01 (PII Detection), dg-01 (Catalog), dg-04 (Access Control)
@@ -0,0 +1,230 @@
1
+ # Skill 1: Automated Exploratory Data Analysis (EDA)
2
+
3
+ ## 🎯 Overview
4
+ Automated EDA with statistical profiling, visualization, and insight generation.
5
+
6
+ ## 🔗 Connections
7
+ - **Data Engineer**: Provides feedback on data quality issues (de-01, de-03)
8
+ - **ML Engineer**: Identifies promising features for modeling (ml-01, ml-02)
9
+ - **MLOps**: Experiment tracking for EDA findings (mo-01)
10
+ - **AI Engineer**: Generates insights for LLM context (ai-02, ai-03)
11
+ - **Security Architect**: PII detection in datasets (sa-01)
12
+ - **FinOps**: Cost-effective analytics compute (fo-06)
13
+ - **DevOps**: Automated reporting pipelines (do-01)
14
+
15
+ ## 🛠️ Tools Included
16
+
17
+ ### 1. `eda_generator.py`
18
+ Automated EDA report generation with ydata-profiling.
19
+
20
+ ### 2. `statistical_analyzer.py`
21
+ Statistical tests, distributions, and correlations.
22
+
23
+ ### 3. `visualization_suite.py`
24
+ Interactive visualizations with Plotly.
25
+
26
+ ### 4. `insight_extractor.py`
27
+ Automated insight extraction and anomaly detection.
28
+
29
+ ### 5. `eda_queries.sql`
30
+ SQL templates for common analytical queries.
31
+
32
+ ## 📊 Key Outputs
33
+ - Automated profiling reports (HTML)
34
+ - Statistical summaries
35
+ - Correlation matrices
36
+ - Distribution plots
37
+ - Anomaly detection alerts
38
+
39
+ ## 🚀 Quick Start
40
+
41
+ ```python
42
+ from eda_generator import EDAGenerator
43
+
44
+ # Initialize
45
+ eda = EDAGenerator()
46
+
47
+ # Load data
48
+ df = pd.read_csv("customer_data.csv")
49
+
50
+ # Generate comprehensive report
51
+ report = eda.generate_report(
52
+ df=df,
53
+ title="Customer Data Analysis",
54
+ output_file="eda_report.html"
55
+ )
56
+
57
+ # Extract key insights
58
+ insights = eda.extract_insights(df)
59
+ print(insights)
60
+ ```
61
+
62
+ ## 📚 Best Practices
63
+
64
+ ### Data Quality & Security (Cross-Role Integration)
65
+
66
+ 1. **PII Detection Before Analysis**
67
+ - Scan datasets for PII before profiling
68
+ - Mask sensitive data in reports and visualizations
69
+ - Track data lineage for compliance
70
+ - Reference: Security Architect sa-01 (PII Detection)
71
+
72
+ 2. **Data Quality Validation**
73
+ - Validate schema before EDA
74
+ - Check completeness, accuracy, consistency
75
+ - Alert Data Engineering team on quality issues
76
+ - Reference: Data Engineer de-03 (Data Quality)
77
+
78
+ 3. **Automated Quality Feedback Loop**
79
+ - Generate data quality scorecards
80
+ - Feed insights back to data pipelines
81
+ - Track quality improvements over time
82
+ - Reference: Data Engineer de-01, de-03
83
+
84
+ ### Cost Optimization (FinOps Integration)
85
+
86
+ 4. **Optimize Compute for Analysis**
87
+ - Use appropriate instance sizes for EDA workloads
88
+ - Auto-shutdown notebooks when idle
89
+ - Sample large datasets intelligently
90
+ - Monitor analysis costs per project
91
+ - Reference: FinOps fo-06 (Compute Optimization)
92
+
93
+ 5. **Efficient Data Sampling**
94
+ - Use stratified sampling for large datasets
95
+ - Profile samples before full dataset analysis
96
+ - Cache intermediate results
97
+ - Minimize data movement and storage
98
+ - Reference: FinOps fo-05, Data Engineer de-01
99
+
100
+ ### MLOps Integration
101
+
102
+ 6. **Track EDA Experiments**
103
+ - Log EDA findings in MLflow/Azure ML
104
+ - Version datasets used for analysis
105
+ - Document feature engineering insights
106
+ - Link EDA to downstream model experiments
107
+ - Reference: MLOps mo-01 (Experiment Tracking)
108
+
109
+ 7. **Feature Discovery Documentation**
110
+ - Document promising features for ML
111
+ - Track feature importance from EDA
112
+ - Share insights with ML Engineering team
113
+ - Maintain feature catalog
114
+ - Reference: ML Engineer ml-02 (Feature Engineering)
115
+
116
+ ### Automation & Deployment (DevOps Integration)
117
+
118
+ 8. **Automated EDA Pipelines**
119
+ - Schedule regular EDA reports for key datasets
120
+ - Automate anomaly detection and alerting
121
+ - Deploy EDA as part of data pipeline monitoring
122
+ - Version control EDA scripts
123
+ - Reference: DevOps do-01 (CI/CD), do-08 (Monitoring)
124
+
125
+ 9. **Reproducible Analysis**
126
+ - Use containerized environments
127
+ - Pin package versions
128
+ - Document analysis dependencies
129
+ - Enable one-click report regeneration
130
+ - Reference: DevOps do-03 (Containerization)
131
+
132
+ ### AI Integration
133
+
134
+ 10. **LLM-Powered Insights**
135
+ - Use LLMs to generate narrative insights
136
+ - Automate insight extraction from distributions
137
+ - Create natural language data summaries
138
+ - Reference: AI Engineer ai-01, ai-07
139
+
140
+ ## 💰 Cost Optimization Examples
141
+
142
+ ### Compute Cost Tracking
143
+ ```python
144
+ from eda_generator import EDAGenerator
145
+ from finops_tracker import AnalyticsCostTracker
146
+
147
+ cost_tracker = AnalyticsCostTracker()
148
+
149
+ # Track EDA compute costs
150
+ @cost_tracker.track_analysis_cost
151
+ def run_eda(dataset_path: str):
152
+ eda = EDAGenerator()
153
+ df = pd.read_csv(dataset_path)
154
+
155
+ # Smart sampling for large datasets
156
+ if len(df) > 1_000_000:
157
+ df = df.sample(n=100_000, random_state=42) # Cost savings
158
+
159
+ report = eda.generate_report(df)
160
+ return report
161
+
162
+ # Cost report
163
+ report = cost_tracker.monthly_report()
164
+ print(f"Total EDA costs: ${report.total_cost:.2f}")
165
+ print(f"Cost per analysis: ${report.avg_cost:.2f}")
166
+ ```
167
+
168
+ ## 🔒 Security Best Practices
169
+
170
+ ### PII Masking in Reports
171
+ ```python
172
+ from pii_detector import PIIDetector
173
+ from eda_generator import EDAGenerator
174
+
175
+ detector = PIIDetector()
176
+ eda = EDAGenerator()
177
+
178
+ def secure_eda(df: pd.DataFrame):
179
+ # Detect PII columns
180
+ pii_columns = []
181
+ for col in df.columns:
182
+ sample = df[col].astype(str).sample(min(100, len(df)))
183
+ if detector.contains_pii(sample.tolist()):
184
+ pii_columns.append(col)
185
+
186
+ # Mask PII before EDA
187
+ df_masked = df.copy()
188
+ for col in pii_columns:
189
+ df_masked[col] = "***MASKED***"
190
+
191
+ # Generate report on masked data
192
+ report = eda.generate_report(
193
+ df_masked,
194
+ title="Customer Data Analysis (PII Masked)"
195
+ )
196
+
197
+ return report, pii_columns
198
+ ```
199
+
200
+ ## 🔄 Integration Workflow
201
+
202
+ ### End-to-End EDA Pipeline
203
+ ```
204
+ 1. Data Ingestion (de-01)
205
+
206
+ 2. PII Detection (sa-01)
207
+
208
+ 3. Data Quality Check (de-03)
209
+
210
+ 4. Automated EDA (ds-01)
211
+
212
+ 5. Track Findings (mo-01)
213
+
214
+ 6. Feature Discovery (ml-02)
215
+
216
+ 7. Generate Insights (ai-07)
217
+
218
+ 8. Share Report (Automated)
219
+
220
+ 9. Monitor Costs (fo-06)
221
+ ```
222
+
223
+ ## 🎯 Quick Wins
224
+
225
+ 1. **Automate PII detection** - Prevent compliance violations in reports
226
+ 2. **Set up cost tracking** - Monitor analysis compute spending
227
+ 3. **Enable auto-shutdown** - Stop idle notebooks to save costs
228
+ 4. **Sample large datasets** - Faster EDA at lower cost
229
+ 5. **Track EDA experiments** - Link insights to model performance
230
+ 6. **Automate report generation** - Schedule weekly data profiling