@xdev-asia/xdev-knowledge-mcp 1.0.42 → 1.0.44
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/content/pages/xoa-du-lieu-nguoi-dung.md +68 -0
- package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/01-phan-1-data-engineering/lessons/01-bai-1-data-repositories-ingestion.md +198 -0
- package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/01-phan-1-data-engineering/lessons/02-bai-2-data-transformation.md +183 -0
- package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/01-phan-1-data-engineering/lessons/03-bai-3-data-analysis.md +159 -0
- package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/02-phan-2-modeling/lessons/04-bai-4-sagemaker-built-in-algorithms.md +186 -0
- package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/02-phan-2-modeling/lessons/05-bai-5-training-hyperparameter-tuning.md +159 -0
- package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/02-phan-2-modeling/lessons/06-bai-6-model-evaluation.md +169 -0
- package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/03-phan-3-implementation-operations/lessons/07-bai-7-model-deployment.md +193 -0
- package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/03-phan-3-implementation-operations/lessons/08-bai-8-model-monitoring-mlops.md +184 -0
- package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/03-phan-3-implementation-operations/lessons/09-bai-9-security-cost.md +166 -0
- package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/04-phan-4-on-tap/lessons/10-bai-10-bai-toan-thuong-gap.md +181 -0
- package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/04-phan-4-on-tap/lessons/11-bai-11-cheat-sheet.md +110 -0
- package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/04-phan-4-on-tap/lessons/12-bai-12-chien-luoc-thi.md +113 -0
- package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/index.md +1 -1
- package/content/series/luyen-thi/luyen-thi-cka/index.md +217 -0
- package/content/series/luyen-thi/luyen-thi-ckad/index.md +199 -0
- package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/01-phan-1-problem-framing/lessons/01-bai-1-framing-ml-problems.md +136 -0
- package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/01-phan-1-problem-framing/lessons/02-bai-2-gcp-ai-ml-ecosystem.md +160 -0
- package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/02-phan-2-data-engineering/lessons/03-bai-3-data-pipeline.md +174 -0
- package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/02-phan-2-data-engineering/lessons/04-bai-4-feature-engineering.md +156 -0
- package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/03-phan-3-model-development/lessons/05-bai-5-vertex-ai-training.md +155 -0
- package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/03-phan-3-model-development/lessons/06-bai-6-bigquery-ml-tensorflow.md +141 -0
- package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/04-phan-4-deployment-mlops/lessons/07-bai-7-model-deployment.md +134 -0
- package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/04-phan-4-deployment-mlops/lessons/08-bai-8-vertex-ai-pipelines-mlops.md +149 -0
- package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/05-phan-5-responsible-ai/lessons/09-bai-9-responsible-ai.md +128 -0
- package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/05-phan-5-responsible-ai/lessons/10-bai-10-cheat-sheet-chien-luoc-thi.md +108 -0
- package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/index.md +1 -1
- package/content/series/luyen-thi/luyen-thi-kcna/index.md +168 -0
- package/data/quizzes/aws-ai-practitioner.json +362 -0
- package/data/quizzes/aws-ml-specialty.json +200 -0
- package/data/quizzes/gcp-ml-engineer.json +200 -0
- package/package.json +1 -1
|
@@ -0,0 +1,68 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: 019cb2b9-4dc7-72a5-956a-93f58cbac568
|
|
3
|
+
title: Xóa dữ liệu người dùng
|
|
4
|
+
slug: xoa-du-lieu-nguoi-dung
|
|
5
|
+
excerpt: Hướng dẫn yêu cầu xóa tài khoản và dữ liệu cá nhân trên xDev Asia.
|
|
6
|
+
featured_image: null
|
|
7
|
+
template: default
|
|
8
|
+
show_in_header: false
|
|
9
|
+
show_in_footer: true
|
|
10
|
+
sort_order: 12
|
|
11
|
+
meta:
|
|
12
|
+
meta_title: Xóa dữ liệu người dùng — xDev Asia
|
|
13
|
+
meta_description: Yêu cầu xóa tài khoản và toàn bộ dữ liệu cá nhân của bạn khỏi nền tảng xDev Asia.
|
|
14
|
+
published_at: '2026-04-05T00:00:00.000000Z'
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## Xóa dữ liệu người dùng
|
|
18
|
+
|
|
19
|
+
Tại xDev Asia, chúng tôi tôn trọng quyền riêng tư và quyền kiểm soát dữ liệu cá nhân của bạn. Trang này hướng dẫn cách yêu cầu xóa tài khoản và toàn bộ dữ liệu liên quan.
|
|
20
|
+
|
|
21
|
+
### Dữ liệu chúng tôi lưu trữ
|
|
22
|
+
|
|
23
|
+
Khi bạn sử dụng xDev Asia (website và ứng dụng di động), chúng tôi có thể lưu trữ:
|
|
24
|
+
|
|
25
|
+
- **Thông tin tài khoản:** Tên hiển thị, địa chỉ email, ảnh đại diện
|
|
26
|
+
- **Lịch sử học tập:** Bài học đã hoàn thành, tiến độ khóa học
|
|
27
|
+
- **Dữ liệu bookmark:** Bài viết và series đã lưu
|
|
28
|
+
- **Kết quả quiz:** Điểm số và lịch sử thi thử
|
|
29
|
+
|
|
30
|
+
### Cách yêu cầu xóa dữ liệu
|
|
31
|
+
|
|
32
|
+
Bạn có thể yêu cầu xóa toàn bộ dữ liệu cá nhân theo một trong các cách sau:
|
|
33
|
+
|
|
34
|
+
**Cách 1: Qua email**
|
|
35
|
+
|
|
36
|
+
Gửi email đến **<admin@xdev.asia>a>a>a>** với tiêu đề **"Yêu cầu xóa tài khoản"** và nội dung bao gồm:
|
|
37
|
+
|
|
38
|
+
- Địa chỉ email đã đăng ký
|
|
39
|
+
- Tên hiển thị trên tài khoản
|
|
40
|
+
- Xác nhận bạn muốn xóa toàn bộ dữ liệu
|
|
41
|
+
|
|
42
|
+
**Cách 2: Qua GitHub**
|
|
43
|
+
|
|
44
|
+
Tạo một issue tại [github.com/xdev-asia-labs](https://github.com/xdev-asia-labs) với tiêu đề **"Data Deletion Request"**.
|
|
45
|
+
|
|
46
|
+
### Thời gian xử lý
|
|
47
|
+
|
|
48
|
+
Chúng tôi sẽ xử lý yêu cầu trong vòng **30 ngày** kể từ khi nhận được. Sau khi hoàn tất, bạn sẽ nhận được thông báo qua email xác nhận dữ liệu đã được xóa.
|
|
49
|
+
|
|
50
|
+
### Dữ liệu được xóa
|
|
51
|
+
|
|
52
|
+
Khi yêu cầu được chấp thuận, chúng tôi sẽ xóa:
|
|
53
|
+
|
|
54
|
+
- Tài khoản và thông tin cá nhân
|
|
55
|
+
- Lịch sử học tập và tiến độ
|
|
56
|
+
- Bookmark và dữ liệu tùy chỉnh
|
|
57
|
+
- Kết quả quiz và điểm số
|
|
58
|
+
|
|
59
|
+
**Lưu ý:** Một số dữ liệu đã được ẩn danh hóa và tổng hợp (ví dụ: thống kê lượt xem) có thể không bị xóa vì chúng không thể được liên kết lại với bạn.
|
|
60
|
+
|
|
61
|
+
### Liên hệ
|
|
62
|
+
|
|
63
|
+
<admin@xdev.asia>
|
|
64
|
+
Nếu bạn có câ<admin@xdev.asia>riêng tư hoặc xử lý dữ liệu, vui lòng liên hệ:
|
|
65
|
+
<admin@xdev.asia>
|
|
66
|
+
|
|
67
|
+
- **Email:** <admin@xdev.asia>
|
|
68
|
+
- **Chính sách bảo mật:** [xdev.asia/pages/chinh-sach-quyen-rieng-tu/](/pages/chinh-sach-quyen-rieng-tu/)
|
|
@@ -0,0 +1,198 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: 14a964b2-b4b7-46e5-95b0-7d91d9cacdf5
|
|
3
|
+
title: 'Bài 1: Data Repositories & Ingestion — S3, Kinesis, Glue'
|
|
4
|
+
slug: bai-1-data-repositories-ingestion
|
|
5
|
+
description: >-
|
|
6
|
+
S3 data lake cho ML. Kinesis Data Streams/Firehose cho streaming ingestion.
|
|
7
|
+
AWS Glue ETL jobs và Data Catalog. Lake Formation. Data Wrangler.
|
|
8
|
+
Chiến lược lưu trữ: Parquet, ORC, CSV, JSON.
|
|
9
|
+
duration_minutes: 60
|
|
10
|
+
is_free: true
|
|
11
|
+
video_url: null
|
|
12
|
+
sort_order: 1
|
|
13
|
+
section_title: "Phần 1: Data Engineering (20%)"
|
|
14
|
+
course:
|
|
15
|
+
id: 019c9619-lt02-7002-c002-lt0200000002
|
|
16
|
+
title: 'Luyện thi AWS Certified Machine Learning - Specialty'
|
|
17
|
+
slug: luyen-thi-aws-ml-specialty
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
<div style="text-align: center; margin: 2rem 0;">
|
|
21
|
+
<img src="/storage/uploads/2026/04/aws-mls-bai1-data-ingestion.png" alt="AWS ML Data Repositories & Ingestion" style="max-width: 800px; width: 100%; border-radius: 12px;" />
|
|
22
|
+
<p><em>Data Repositories & Ingestion: S3, Kinesis, Glue và Lake Formation trong ML pipeline</em></p>
|
|
23
|
+
</div>
|
|
24
|
+
|
|
25
|
+
<h2 id="overview"><strong>1. Tổng quan Data Engineering trong MLS-C01</strong></h2>
|
|
26
|
+
|
|
27
|
+
<p>Domain Data Engineering chiếm <strong>20% đề thi MLS-C01</strong>. Đây là phần bắt buộc phải nắm vững — đề thi thường hỏi "Which service should be used to ingest/store/transform data for ML?"</p>
|
|
28
|
+
|
|
29
|
+
<blockquote>
|
|
30
|
+
<p><strong>Exam tip:</strong> Phần lớn câu hỏi Data Engineering sẽ cho một scenario và hỏi service phù hợp. Key pattern: batch → S3 + Glue; streaming → Kinesis; structured/SQL → Athena; catalog → Glue Data Catalog.</p>
|
|
31
|
+
</blockquote>
|
|
32
|
+
|
|
33
|
+
<h2 id="s3-ml"><strong>2. Amazon S3 — ML Data Lake</strong></h2>
|
|
34
|
+
|
|
35
|
+
<p><strong>Amazon S3</strong> là nền tảng lưu trữ dữ liệu ML trên AWS. Mọi pipeline ML đều bắt đầu và kết thúc từ S3: training data, model artifacts, predictions.</p>
|
|
36
|
+
|
|
37
|
+
<h3 id="s3-storage-classes"><strong>2.1. S3 Storage Classes cho ML</strong></h3>
|
|
38
|
+
|
|
39
|
+
<table>
|
|
40
|
+
<thead><tr><th>Storage Class</th><th>Use Case</th><th>Cost</th></tr></thead>
|
|
41
|
+
<tbody>
|
|
42
|
+
<tr><td><strong>S3 Standard</strong></td><td>Active training data, frequent access</td><td>Cao nhất</td></tr>
|
|
43
|
+
<tr><td><strong>S3 Intelligent-Tiering</strong></td><td>Mixed access patterns (tự động tier)</td><td>Tự động tối ưu</td></tr>
|
|
44
|
+
<tr><td><strong>S3 Standard-IA</strong></td><td>Backup datasets, infrequent access</td><td>Thấp hơn Standard</td></tr>
|
|
45
|
+
<tr><td><strong>S3 Glacier Instant Retrieval</strong></td><td>Archived datasets, occasional retrieval</td><td>Thấp</td></tr>
|
|
46
|
+
<tr><td><strong>S3 Glacier Deep Archive</strong></td><td>Long-term compliance archives</td><td>Thấp nhất</td></tr>
|
|
47
|
+
</tbody>
|
|
48
|
+
</table>
|
|
49
|
+
|
|
50
|
+
<h3 id="s3-file-formats"><strong>2.2. File Formats for ML</strong></h3>
|
|
51
|
+
|
|
52
|
+
<table>
|
|
53
|
+
<thead><tr><th>Format</th><th>Type</th><th>Best For</th><th>Compression</th></tr></thead>
|
|
54
|
+
<tbody>
|
|
55
|
+
<tr><td><strong>Parquet</strong></td><td>Columnar</td><td>Analytics, large datasets, feature stores</td><td>Excellent</td></tr>
|
|
56
|
+
<tr><td><strong>ORC</strong></td><td>Columnar</td><td>Hive/EMR workloads</td><td>Excellent</td></tr>
|
|
57
|
+
<tr><td><strong>CSV</strong></td><td>Row-based</td><td>Simple, SageMaker training input</td><td>Poor</td></tr>
|
|
58
|
+
<tr><td><strong>JSON</strong></td><td>Semi-structured</td><td>Nested data, APIs</td><td>Poor</td></tr>
|
|
59
|
+
<tr><td><strong>RecordIO</strong></td><td>Binary</td><td>SageMaker Pipe Mode training</td><td>Good</td></tr>
|
|
60
|
+
</tbody>
|
|
61
|
+
</table>
|
|
62
|
+
|
|
63
|
+
<blockquote>
|
|
64
|
+
<p><strong>Exam tip:</strong> Khi đề hỏi về <em>performance optimization</em> cho large-scale training, đáp án thường là chuyển sang <strong>Parquet</strong> (columnar, compressed) và dùng <strong>Pipe Mode</strong> thay vì File Mode trong SageMaker.</p>
|
|
65
|
+
</blockquote>
|
|
66
|
+
|
|
67
|
+
<pre><code class="language-text">S3 Data Lake Architecture for ML:
|
|
68
|
+
|
|
69
|
+
┌─────────────────────────────────────────────────────────┐
|
|
70
|
+
│ Amazon S3 Buckets │
|
|
71
|
+
├──────────────┬──────────────┬──────────────┬────────────┤
|
|
72
|
+
│ Raw Zone │ Processed │ Features │ Models │
|
|
73
|
+
│ (landing) │ Zone │ Zone │ & Output │
|
|
74
|
+
│ │ │ │ │
|
|
75
|
+
│ CSV/JSON │ Parquet/ORC │ Feature │ Model │
|
|
76
|
+
│ original │ cleaned │ Store │ Artifacts │
|
|
77
|
+
│ data │ transformed │ snapshots │ Predictions│
|
|
78
|
+
└──────────────┴──────────────┴──────────────┴────────────┘
|
|
79
|
+
↑ ↑ ↑
|
|
80
|
+
Kinesis AWS Glue SageMaker
|
|
81
|
+
(streaming) (ETL) Processing
|
|
82
|
+
</code></pre>
|
|
83
|
+
|
|
84
|
+
<h2 id="kinesis"><strong>3. Amazon Kinesis — Streaming Ingestion</strong></h2>
|
|
85
|
+
|
|
86
|
+
<p>Kinesis là họ dịch vụ cho <strong>real-time data streaming</strong>. Đây là topic quan trọng trong đề thi — cần phân biệt rõ 4 services.</p>
|
|
87
|
+
|
|
88
|
+
<table>
|
|
89
|
+
<thead><tr><th>Service</th><th>Function</th><th>Destination</th><th>ML Use Case</th></tr></thead>
|
|
90
|
+
<tbody>
|
|
91
|
+
<tr><td><strong>Kinesis Data Streams (KDS)</strong></td><td>Custom real-time processing</td><td>Custom consumers</td><td>Real-time feature engineering</td></tr>
|
|
92
|
+
<tr><td><strong>Kinesis Data Firehose</strong></td><td>Managed delivery (no code)</td><td>S3, Redshift, ES, Splunk</td><td>Batch loading to data lake</td></tr>
|
|
93
|
+
<tr><td><strong>Kinesis Data Analytics</strong></td><td>SQL/Flink on streams</td><td>S3, Redshift</td><td>Real-time aggregations, anomaly detect</td></tr>
|
|
94
|
+
<tr><td><strong>Kinesis Video Streams</strong></td><td>Video ingestion</td><td>Rekognition, SageMaker</td><td>Computer vision pipelines</td></tr>
|
|
95
|
+
</tbody>
|
|
96
|
+
</table>
|
|
97
|
+
|
|
98
|
+
<blockquote>
|
|
99
|
+
<p><strong>Exam tip:</strong> Câu hỏi phổ biến: "IoT sensors gửi data liên tục, cần store vào S3 cho ML training mà không cần custom code?" → Kinesis <strong>Data Firehose</strong> (managed, no code). "Cần xử lý real-time với custom logic?" → Kinesis <strong>Data Streams</strong>.</p>
|
|
100
|
+
</blockquote>
|
|
101
|
+
|
|
102
|
+
<h3 id="kinesis-shards"><strong>3.1. KDS Shards & Capacity</strong></h3>
|
|
103
|
+
|
|
104
|
+
<pre><code class="language-text">Kinesis Data Streams Capacity:
|
|
105
|
+
|
|
106
|
+
┌─────────────────────────────────────────────┐
|
|
107
|
+
│ Each Shard: │
|
|
108
|
+
│ • Ingest: 1 MB/s OR 1,000 records/s │
|
|
109
|
+
│ • Read: 2 MB/s │
|
|
110
|
+
│ • Retention: 24 hours (default) → 7 days │
|
|
111
|
+
└─────────────────────────────────────────────┘
|
|
112
|
+
|
|
113
|
+
Stream with N shards:
|
|
114
|
+
• Total ingest: N × 1 MB/s
|
|
115
|
+
• Total read: N × 2 MB/s
|
|
116
|
+
</code></pre>
|
|
117
|
+
|
|
118
|
+
<h2 id="glue"><strong>4. AWS Glue — ETL for ML</strong></h2>
|
|
119
|
+
|
|
120
|
+
<p><strong>AWS Glue</strong> là fully managed ETL service. Trong ML pipeline, Glue dùng để <strong>transform và clean data</strong> trước khi đưa vào training.</p>
|
|
121
|
+
|
|
122
|
+
<h3 id="glue-components"><strong>4.1. Glue Components</strong></h3>
|
|
123
|
+
|
|
124
|
+
<table>
|
|
125
|
+
<thead><tr><th>Component</th><th>Function</th></tr></thead>
|
|
126
|
+
<tbody>
|
|
127
|
+
<tr><td><strong>Glue Data Catalog</strong></td><td>Central metadata repository — schemas, tables, partitions</td></tr>
|
|
128
|
+
<tr><td><strong>Glue Crawlers</strong></td><td>Auto-discover schema từ S3/RDS/Redshift và populate Data Catalog</td></tr>
|
|
129
|
+
<tr><td><strong>Glue ETL Jobs</strong></td><td>Spark-based transformation jobs (Python/Scala)</td></tr>
|
|
130
|
+
<tr><td><strong>Glue DataBrew</strong></td><td>No-code visual data preparation (250+ pre-built transforms)</td></tr>
|
|
131
|
+
<tr><td><strong>Glue Studio</strong></td><td>Visual ETL job builder (drag-and-drop)</td></tr>
|
|
132
|
+
</tbody>
|
|
133
|
+
</table>
|
|
134
|
+
|
|
135
|
+
<blockquote>
|
|
136
|
+
<p><strong>Exam tip:</strong> <strong>Glue Data Catalog</strong> là metadata store chung cho Athena, EMR, Redshift Spectrum. Khi đề hỏi "centralized schema management" → Glue Data Catalog. Khi hỏi "no-code data cleaning" → Glue DataBrew.</p>
|
|
137
|
+
</blockquote>
|
|
138
|
+
|
|
139
|
+
<h2 id="lake-formation"><strong>5. AWS Lake Formation</strong></h2>
|
|
140
|
+
|
|
141
|
+
<p><strong>Lake Formation</strong> build trên S3 + Glue để management <strong>data lake security và governance</strong>. Key feature: column-level và row-level access control.</p>
|
|
142
|
+
|
|
143
|
+
<pre><code class="language-text">Lake Formation Architecture:
|
|
144
|
+
|
|
145
|
+
IAM Users ──→ Lake Formation ──→ S3 Data Lake
|
|
146
|
+
IAM Roles (Security (Raw/Processed)
|
|
147
|
+
& Governance)
|
|
148
|
+
↓
|
|
149
|
+
Column/Row
|
|
150
|
+
Level Access
|
|
151
|
+
Control
|
|
152
|
+
</code></pre>
|
|
153
|
+
|
|
154
|
+
<h2 id="cheat-sheet"><strong>6. Cheat Sheet — Data Ingestion Services</strong></h2>
|
|
155
|
+
|
|
156
|
+
<table>
|
|
157
|
+
<thead><tr><th>Scenario</th><th>Service</th></tr></thead>
|
|
158
|
+
<tbody>
|
|
159
|
+
<tr><td>Streaming → S3 với no-code</td><td>Kinesis Data Firehose</td></tr>
|
|
160
|
+
<tr><td>Real-time processing với custom logic</td><td>Kinesis Data Streams</td></tr>
|
|
161
|
+
<tr><td>SQL on streaming data</td><td>Kinesis Data Analytics (Flink)</td></tr>
|
|
162
|
+
<tr><td>Batch ETL Spark-based</td><td>AWS Glue ETL Jobs</td></tr>
|
|
163
|
+
<tr><td>No-code visual data prep</td><td>Glue DataBrew</td></tr>
|
|
164
|
+
<tr><td>Schema discovery from S3</td><td>Glue Crawlers + Data Catalog</td></tr>
|
|
165
|
+
<tr><td>SQL queries on S3</td><td>Amazon Athena</td></tr>
|
|
166
|
+
<tr><td>Data lake governance</td><td>AWS Lake Formation</td></tr>
|
|
167
|
+
<tr><td>Large-scale Spark/Hadoop</td><td>Amazon EMR</td></tr>
|
|
168
|
+
</tbody>
|
|
169
|
+
</table>
|
|
170
|
+
|
|
171
|
+
<h2 id="practice"><strong>7. Practice Questions</strong></h2>
|
|
172
|
+
|
|
173
|
+
<p><strong>Q1:</strong> A company wants to ingest IoT sensor data into Amazon S3 for ML training. The data arrives continuously and no custom processing is required. Which service is the MOST cost-effective?</p>
|
|
174
|
+
<ul>
|
|
175
|
+
<li>A) Amazon Kinesis Data Streams with a Lambda consumer</li>
|
|
176
|
+
<li>B) Amazon Kinesis Data Firehose ✓</li>
|
|
177
|
+
<li>C) Amazon EMR with Spark Streaming</li>
|
|
178
|
+
<li>D) AWS Glue ETL jobs on a schedule</li>
|
|
179
|
+
</ul>
|
|
180
|
+
<p><em>Explanation: Kinesis Data Firehose is fully managed and requires no custom code — it directly delivers streaming data to S3, Redshift, or Elasticsearch. Data Streams requires custom consumers, EMR is heavy lift, and Glue is for batch ETL.</em></p>
|
|
181
|
+
|
|
182
|
+
<p><strong>Q2:</strong> A data engineer wants to query raw CSV files in S3 using SQL without loading them into a database. Which service should be used?</p>
|
|
183
|
+
<ul>
|
|
184
|
+
<li>A) Amazon RDS</li>
|
|
185
|
+
<li>B) Amazon DynamoDB</li>
|
|
186
|
+
<li>C) Amazon Athena ✓</li>
|
|
187
|
+
<li>D) Amazon Redshift</li>
|
|
188
|
+
</ul>
|
|
189
|
+
<p><em>Explanation: Amazon Athena is serverless and allows SQL queries directly on S3 data without loading. It reads files in-place and supports formats like CSV, Parquet, ORC, JSON.</em></p>
|
|
190
|
+
|
|
191
|
+
<p><strong>Q3:</strong> Which file format provides the BEST performance for columnar analytics queries on large ML datasets stored in Amazon S3?</p>
|
|
192
|
+
<ul>
|
|
193
|
+
<li>A) CSV</li>
|
|
194
|
+
<li>B) JSON</li>
|
|
195
|
+
<li>C) XML</li>
|
|
196
|
+
<li>D) Apache Parquet ✓</li>
|
|
197
|
+
</ul>
|
|
198
|
+
<p><em>Explanation: Parquet is a columnar format with excellent compression and predicate pushdown support. Columnar formats allow reading only the required columns, dramatically reducing I/O for analytical queries.</em></p>
|
|
@@ -0,0 +1,183 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: 621b7555-2901-469d-8b0b-a800506c8212
|
|
3
|
+
title: 'Bài 2: Data Transformation & Feature Engineering'
|
|
4
|
+
slug: bai-2-data-transformation
|
|
5
|
+
description: >-
|
|
6
|
+
SageMaker Processing Jobs cho data prep. SageMaker Feature Store.
|
|
7
|
+
Xử lý missing values, encoding, normalization, scaling.
|
|
8
|
+
Text preprocessing, imbalanced data techniques.
|
|
9
|
+
duration_minutes: 60
|
|
10
|
+
is_free: true
|
|
11
|
+
video_url: null
|
|
12
|
+
sort_order: 2
|
|
13
|
+
section_title: "Phần 1: Data Engineering (20%)"
|
|
14
|
+
course:
|
|
15
|
+
id: 019c9619-lt02-7002-c002-lt0200000002
|
|
16
|
+
title: 'Luyện thi AWS Certified Machine Learning - Specialty'
|
|
17
|
+
slug: luyen-thi-aws-ml-specialty
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
<div style="text-align: center; margin: 2rem 0;">
|
|
21
|
+
<img src="/storage/uploads/2026/04/aws-mls-bai2-feature-engineering.png" alt="AWS ML Data Transformation Pipeline" style="max-width: 800px; width: 100%; border-radius: 12px;" />
|
|
22
|
+
<p><em>Feature Engineering & Data Transformation: Glue, SageMaker Data Wrangler, và xử lý missing values</em></p>
|
|
23
|
+
</div>
|
|
24
|
+
|
|
25
|
+
<h2 id="overview"><strong>1. Data Transformation trong ML Pipeline</strong></h2>
|
|
26
|
+
|
|
27
|
+
<p>Trước khi train model, raw data phải qua nhiều bước transformation. Đây là nguồn gốc của câu nói nổi tiếng: <em>"Garbage in, garbage out"</em>. Đề thi MLS-C01 thường hỏi kỹ thuật xử lý data và tools phù hợp.</p>
|
|
28
|
+
|
|
29
|
+
<h2 id="processing-jobs"><strong>2. SageMaker Processing Jobs</strong></h2>
|
|
30
|
+
|
|
31
|
+
<p><strong>SageMaker Processing Jobs</strong> là managed service để chạy data processing scripts (Python, Spark) trên ephemeral compute clusters.</p>
|
|
32
|
+
|
|
33
|
+
<table>
|
|
34
|
+
<thead><tr><th>Processor Type</th><th>Framework</th><th>Use Case</th></tr></thead>
|
|
35
|
+
<tbody>
|
|
36
|
+
<tr><td><strong>ScriptProcessor</strong></td><td>Custom Docker container</td><td>Any custom script</td></tr>
|
|
37
|
+
<tr><td><strong>SKLearnProcessor</strong></td><td>scikit-learn</td><td>Classic ML preprocessing</td></tr>
|
|
38
|
+
<tr><td><strong>PySparkProcessor</strong></td><td>Apache Spark</td><td>Large-scale distributed processing</td></tr>
|
|
39
|
+
<tr><td><strong>FrameworkProcessor</strong></td><td>TensorFlow/PyTorch</td><td>Deep learning data prep</td></tr>
|
|
40
|
+
</tbody>
|
|
41
|
+
</table>
|
|
42
|
+
|
|
43
|
+
<pre><code class="language-text">SageMaker Processing Job Flow:
|
|
44
|
+
|
|
45
|
+
S3 (input data)
|
|
46
|
+
↓
|
|
47
|
+
┌─────────────────────┐
|
|
48
|
+
│ Processing Job │
|
|
49
|
+
│ (compute cluster) │
|
|
50
|
+
│ │
|
|
51
|
+
│ - Preprocess data │
|
|
52
|
+
│ - Feature engineer │
|
|
53
|
+
│ - Split train/test │
|
|
54
|
+
└─────────────────────┘
|
|
55
|
+
↓
|
|
56
|
+
S3 (output: train/, validation/, test/)
|
|
57
|
+
</code></pre>
|
|
58
|
+
|
|
59
|
+
<h2 id="missing-values"><strong>3. Xử lý Missing Values</strong></h2>
|
|
60
|
+
|
|
61
|
+
<table>
|
|
62
|
+
<thead><tr><th>Strategy</th><th>Method</th><th>When to Use</th></tr></thead>
|
|
63
|
+
<tbody>
|
|
64
|
+
<tr><td><strong>Deletion</strong></td><td>Drop rows/columns</td><td>MCAR, ít missing (<5%)</td></tr>
|
|
65
|
+
<tr><td><strong>Mean/Median Imputation</strong></td><td>Điền giá trị trung bình</td><td>Numeric, MCAR/MAR</td></tr>
|
|
66
|
+
<tr><td><strong>Mode Imputation</strong></td><td>Điền giá trị phổ biến nhất</td><td>Categorical</td></tr>
|
|
67
|
+
<tr><td><strong>KNN Imputation</strong></td><td>Dùng K neighbors gần nhất</td><td>Patterns in data, không quá lớn</td></tr>
|
|
68
|
+
<tr><td><strong>Model-based (MICE)</strong></td><td>Multiple imputation</td><td>Complex missingness patterns</td></tr>
|
|
69
|
+
<tr><td><strong>Indicator Feature</strong></td><td>Thêm cột is_missing</td><td>Khi missingness chứa thông tin</td></tr>
|
|
70
|
+
</tbody>
|
|
71
|
+
</table>
|
|
72
|
+
|
|
73
|
+
<blockquote>
|
|
74
|
+
<p><strong>Exam tip:</strong> Ba loại missing data: <strong>MCAR</strong> (Missing Completely At Random) — deletion an toàn; <strong>MAR</strong> (Missing At Random) — imputation phù hợp; <strong>MNAR</strong> (Missing Not At Random) — cần indicator feature hoặc domain knowledge.</p>
|
|
75
|
+
</blockquote>
|
|
76
|
+
|
|
77
|
+
<h2 id="encoding"><strong>4. Categorical Encoding</strong></h2>
|
|
78
|
+
|
|
79
|
+
<table>
|
|
80
|
+
<thead><tr><th>Encoding</th><th>Method</th><th>When to Use</th><th>Issues</th></tr></thead>
|
|
81
|
+
<tbody>
|
|
82
|
+
<tr><td><strong>One-Hot Encoding</strong></td><td>Binary columns mỗi category</td><td>Nominal (no order), ít categories</td><td>High cardinality → curse of dimensionality</td></tr>
|
|
83
|
+
<tr><td><strong>Label Encoding</strong></td><td>0, 1, 2, 3...</td><td>Ordinal (có thứ tự)</td><td>Implies false order for nominal</td></tr>
|
|
84
|
+
<tr><td><strong>Target Encoding</strong></td><td>Mean of target per category</td><td>High cardinality nominal</td><td>Data leakage risk nếu không cẩn thận</td></tr>
|
|
85
|
+
<tr><td><strong>Embeddings</strong></td><td>Dense vector representation</td><td>Text, high cardinality</td><td>Cần đủ data để learn</td></tr>
|
|
86
|
+
</tbody>
|
|
87
|
+
</table>
|
|
88
|
+
|
|
89
|
+
<h2 id="scaling"><strong>5. Normalization & Scaling</strong></h2>
|
|
90
|
+
|
|
91
|
+
<table>
|
|
92
|
+
<thead><tr><th>Technique</th><th>Formula</th><th>Output Range</th><th>Best For</th></tr></thead>
|
|
93
|
+
<tbody>
|
|
94
|
+
<tr><td><strong>Min-Max Normalization</strong></td><td>(x - min) / (max - min)</td><td>[0, 1]</td><td>Neural networks, distance-based</td></tr>
|
|
95
|
+
<tr><td><strong>Standardization (Z-score)</strong></td><td>(x - mean) / std</td><td>Mean=0, SD=1</td><td>Linear models, SVM, PCA</td></tr>
|
|
96
|
+
<tr><td><strong>Robust Scaler</strong></td><td>(x - median) / IQR</td><td>Centered</td><td>Outliers present</td></tr>
|
|
97
|
+
<tr><td><strong>Log Transform</strong></td><td>log(x)</td><td>Compressed</td><td>Skewed distributions</td></tr>
|
|
98
|
+
</tbody>
|
|
99
|
+
</table>
|
|
100
|
+
|
|
101
|
+
<h2 id="imbalanced"><strong>6. Xử lý Imbalanced Data</strong></h2>
|
|
102
|
+
|
|
103
|
+
<p>Class imbalance (e.g., fraud detection: 99% normal, 1% fraud) khiến model bias về majority class.</p>
|
|
104
|
+
|
|
105
|
+
<table>
|
|
106
|
+
<thead><tr><th>Technique</th><th>Method</th><th>Direction</th></tr></thead>
|
|
107
|
+
<tbody>
|
|
108
|
+
<tr><td><strong>Oversampling</strong></td><td>Duplicate minority class samples</td><td>↑ minority</td></tr>
|
|
109
|
+
<tr><td><strong>SMOTE</strong></td><td>Synthetic Minority Oversampling Technique — generate synthetic samples</td><td>↑ minority</td></tr>
|
|
110
|
+
<tr><td><strong>Undersampling</strong></td><td>Remove majority class samples</td><td>↓ majority</td></tr>
|
|
111
|
+
<tr><td><strong>Class Weights</strong></td><td>Penalize misclassification of minority more</td><td>No data change</td></tr>
|
|
112
|
+
<tr><td><strong>Ensemble Methods</strong></td><td>BalancedBagging, EasyEnsemble</td><td>Algorithm-level</td></tr>
|
|
113
|
+
</tbody>
|
|
114
|
+
</table>
|
|
115
|
+
|
|
116
|
+
<blockquote>
|
|
117
|
+
<p><strong>Exam tip:</strong> Metric phù hợp cho imbalanced data: <strong>F1 Score, AUC-ROC, Precision-Recall</strong> — KHÔNG dùng Accuracy (misleading). AWS SageMaker Clarify có thể detect class imbalance.</p>
|
|
118
|
+
</blockquote>
|
|
119
|
+
|
|
120
|
+
<h2 id="feature-store"><strong>7. SageMaker Feature Store</strong></h2>
|
|
121
|
+
|
|
122
|
+
<p><strong>SageMaker Feature Store</strong> là centralized repository để store, share và reuse ML features.</p>
|
|
123
|
+
|
|
124
|
+
<pre><code class="language-text">Feature Store Architecture:
|
|
125
|
+
|
|
126
|
+
Feature Groups
|
|
127
|
+
┌──────────────────────────────┐
|
|
128
|
+
│ user_features │
|
|
129
|
+
│ ┌──────┬────────┬────────┐ │
|
|
130
|
+
│ │ id │ age │ recency│ │
|
|
131
|
+
│ └──────┴────────┴────────┘ │
|
|
132
|
+
└──────────────────────────────┘
|
|
133
|
+
↓ writes ↑ reads
|
|
134
|
+
┌──────────────────┐ ┌──────────────────┐
|
|
135
|
+
│ Offline Store │ │ Online Store │
|
|
136
|
+
│ (S3 - training) │ │ (DynamoDB - │
|
|
137
|
+
│ batch reads │ │ low-latency │
|
|
138
|
+
│ │ │ inference) │
|
|
139
|
+
└──────────────────┘ └──────────────────┘
|
|
140
|
+
</code></pre>
|
|
141
|
+
|
|
142
|
+
<h2 id="cheat-sheet"><strong>8. Cheat Sheet — Feature Engineering</strong></h2>
|
|
143
|
+
|
|
144
|
+
<table>
|
|
145
|
+
<thead><tr><th>Problem</th><th>Solution</th></tr></thead>
|
|
146
|
+
<tbody>
|
|
147
|
+
<tr><td>High cardinality categorical</td><td>Target encoding hoặc embeddings</td></tr>
|
|
148
|
+
<tr><td>Missing values (numeric)</td><td>Median imputation + indicator feature</td></tr>
|
|
149
|
+
<tr><td>Skewed distribution</td><td>Log transform hoặc Box-Cox</td></tr>
|
|
150
|
+
<tr><td>Outliers</td><td>Robust Scaler hoặc clip/winsorize</td></tr>
|
|
151
|
+
<tr><td>Imbalanced classes</td><td>SMOTE + class weights + AUC metric</td></tr>
|
|
152
|
+
<tr><td>Reuse features across teams</td><td>SageMaker Feature Store</td></tr>
|
|
153
|
+
</tbody>
|
|
154
|
+
</table>
|
|
155
|
+
|
|
156
|
+
<h2 id="practice"><strong>9. Practice Questions</strong></h2>
|
|
157
|
+
|
|
158
|
+
<p><strong>Q1:</strong> A dataset for fraud detection has 98% negative (non-fraud) and 2% positive (fraud) examples. Which metric is MOST appropriate to evaluate the model?</p>
|
|
159
|
+
<ul>
|
|
160
|
+
<li>A) Accuracy</li>
|
|
161
|
+
<li>B) R-squared</li>
|
|
162
|
+
<li>C) AUC-ROC ✓</li>
|
|
163
|
+
<li>D) Mean Absolute Error</li>
|
|
164
|
+
</ul>
|
|
165
|
+
<p><em>Explanation: Accuracy is misleading for imbalanced data (predicting all negative gives 98% accuracy). AUC-ROC measures the model's ability to distinguish classes across all thresholds, making it ideal for imbalanced classification.</em></p>
|
|
166
|
+
|
|
167
|
+
<p><strong>Q2:</strong> Which technique generates SYNTHETIC samples to address class imbalance?</p>
|
|
168
|
+
<ul>
|
|
169
|
+
<li>A) Random undersampling</li>
|
|
170
|
+
<li>B) SMOTE (Synthetic Minority Oversampling Technique) ✓</li>
|
|
171
|
+
<li>C) Class weighting</li>
|
|
172
|
+
<li>D) Feature scaling</li>
|
|
173
|
+
</ul>
|
|
174
|
+
<p><em>Explanation: SMOTE creates new synthetic samples for the minority class by interpolating between existing minority class examples, rather than just duplicating them.</em></p>
|
|
175
|
+
|
|
176
|
+
<p><strong>Q3:</strong> A company wants to share engineered features between their training pipeline and real-time inference service. Which SageMaker feature addresses this?</p>
|
|
177
|
+
<ul>
|
|
178
|
+
<li>A) SageMaker Processing Jobs</li>
|
|
179
|
+
<li>B) SageMaker Experiments</li>
|
|
180
|
+
<li>C) SageMaker Feature Store ✓</li>
|
|
181
|
+
<li>D) SageMaker Data Wrangler</li>
|
|
182
|
+
</ul>
|
|
183
|
+
<p><em>Explanation: SageMaker Feature Store provides both an offline store (S3, for batch training) and online store (DynamoDB-backed, for low-latency real-time inference), ensuring feature consistency between training and serving.</em></p>
|
|
@@ -0,0 +1,159 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: 1a81b42d-c09e-43ef-b9f6-3158ca64b6c1
|
|
3
|
+
title: 'Bài 3: Data Analysis & Visualization'
|
|
4
|
+
slug: bai-3-data-analysis
|
|
5
|
+
description: >-
|
|
6
|
+
EDA trên SageMaker notebooks. Amazon Athena cho SQL analytics.
|
|
7
|
+
Amazon QuickSight cho BI dashboards. Phát hiện data quality issues.
|
|
8
|
+
Detect class imbalance, outliers, correlations, data drift.
|
|
9
|
+
duration_minutes: 45
|
|
10
|
+
is_free: true
|
|
11
|
+
video_url: null
|
|
12
|
+
sort_order: 3
|
|
13
|
+
section_title: "Phần 1: Data Engineering (20%)"
|
|
14
|
+
course:
|
|
15
|
+
id: 019c9619-lt02-7002-c002-lt0200000002
|
|
16
|
+
title: 'Luyện thi AWS Certified Machine Learning - Specialty'
|
|
17
|
+
slug: luyen-thi-aws-ml-specialty
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
<div style="text-align: center; margin: 2rem 0;">
|
|
21
|
+
<img src="/storage/uploads/2026/04/aws-mls-bai3-eda-data-analysis.png" alt="Exploratory Data Analysis trên AWS" style="max-width: 800px; width: 100%; border-radius: 12px;" />
|
|
22
|
+
<p><em>EDA & Data Analysis: thống kê mô tả, phát hiện outliers, feature correlation trên AWS</em></p>
|
|
23
|
+
</div>
|
|
24
|
+
|
|
25
|
+
<h2 id="eda"><strong>1. Exploratory Data Analysis (EDA)</strong></h2>
|
|
26
|
+
|
|
27
|
+
<p><strong>EDA</strong> là bước phân tích dữ liệu ban đầu để hiểu structure, patterns, và anomalies trước khi modeling. SageMaker cung cấp nhiều tools để thực hiện EDA ở scale lớn.</p>
|
|
28
|
+
|
|
29
|
+
<h2 id="eda-tools"><strong>2. AWS Tools cho Data Analysis</strong></h2>
|
|
30
|
+
|
|
31
|
+
<table>
|
|
32
|
+
<thead><tr><th>Tool</th><th>Use Case</th><th>Interface</th></tr></thead>
|
|
33
|
+
<tbody>
|
|
34
|
+
<tr><td><strong>SageMaker Studio Notebooks</strong></td><td>Interactive EDA, Python/R analysis</td><td>JupyterLab-based IDE</td></tr>
|
|
35
|
+
<tr><td><strong>SageMaker Data Wrangler</strong></td><td>Visual data prep, 300+ transforms, auto-insights</td><td>Drag-and-drop GUI</td></tr>
|
|
36
|
+
<tr><td><strong>Amazon Athena</strong></td><td>SQL queries on S3 data</td><td>SQL console</td></tr>
|
|
37
|
+
<tr><td><strong>Amazon QuickSight</strong></td><td>BI dashboards, executive reports</td><td>Visual BI tool</td></tr>
|
|
38
|
+
<tr><td><strong>Amazon Redshift</strong></td><td>Large-scale data warehousing, SQL analytics</td><td>SQL</td></tr>
|
|
39
|
+
<tr><td><strong>AWS Glue DataBrew</strong></td><td>No-code data profiling và cleaning recipes</td><td>Visual tool</td></tr>
|
|
40
|
+
</tbody>
|
|
41
|
+
</table>
|
|
42
|
+
|
|
43
|
+
<blockquote>
|
|
44
|
+
<p><strong>Exam tip:</strong> <strong>Data Wrangler</strong> = visual data prep cho ML (generates SageMaker Processing code). <strong>DataBrew</strong> = data analyst/BI (no ML context). <strong>QuickSight</strong> = BI dashboards for business users, không phải ML.</p>
|
|
45
|
+
</blockquote>
|
|
46
|
+
|
|
47
|
+
<h2 id="data-quality"><strong>3. Data Quality Issues</strong></h2>
|
|
48
|
+
|
|
49
|
+
<p>Đề thi thường hỏi về nhận biết và xử lý các vấn đề chất lượng data phổ biến.</p>
|
|
50
|
+
|
|
51
|
+
<table>
|
|
52
|
+
<thead><tr><th>Issue</th><th>Detection Method</th><th>Impact on Model</th></tr></thead>
|
|
53
|
+
<tbody>
|
|
54
|
+
<tr><td><strong>Missing Values</strong></td><td>Null counts, missing rate per column</td><td>Errors, biased results</td></tr>
|
|
55
|
+
<tr><td><strong>Outliers</strong></td><td>Box plots, Z-score > 3, IQR method</td><td>Skewed weights, poor generalization</td></tr>
|
|
56
|
+
<tr><td><strong>Class Imbalance</strong></td><td>Class distribution histogram</td><td>Biased toward majority class</td></tr>
|
|
57
|
+
<tr><td><strong>Feature Correlation</strong></td><td>Correlation matrix, VIF score</td><td>Multicollinearity → unstable coefficients</td></tr>
|
|
58
|
+
<tr><td><strong>Data Leakage</strong></td><td>Features with suspiciously high correlation to target</td><td>Over-optimistic eval, fails in production</td></tr>
|
|
59
|
+
<tr><td><strong>Distribution Skew</strong></td><td>Histogram, skewness metric</td><td>Violated model assumptions</td></tr>
|
|
60
|
+
</tbody>
|
|
61
|
+
</table>
|
|
62
|
+
|
|
63
|
+
<h3 id="data-leakage"><strong>3.1. Data Leakage — Critical Concept</strong></h3>
|
|
64
|
+
|
|
65
|
+
<p><strong>Data leakage</strong> là khi information từ outside the training set rò rỉ vào features, khiến model có accuracy cao trong training nhưng thất bại khi production.</p>
|
|
66
|
+
|
|
67
|
+
<pre><code class="language-text">Common Data Leakage Patterns:
|
|
68
|
+
|
|
69
|
+
❌ Target leakage:
|
|
70
|
+
Feature "loan_default_flag" → predicting "credit_risk"
|
|
71
|
+
(feature derived from target)
|
|
72
|
+
|
|
73
|
+
❌ Future data leakage:
|
|
74
|
+
Using tomorrow's stock price to predict today's trade
|
|
75
|
+
|
|
76
|
+
❌ Train/test contamination:
|
|
77
|
+
Scaling data BEFORE splitting (test mean leaks into train)
|
|
78
|
+
|
|
79
|
+
✅ Correct approach:
|
|
80
|
+
Split data FIRST → fit scaler on train only → transform both
|
|
81
|
+
</code></pre>
|
|
82
|
+
|
|
83
|
+
<blockquote>
|
|
84
|
+
<p><strong>Exam tip:</strong> Always <strong>split before transforming</strong>. StandardScaler.fit() chỉ được gọi trên training set. Sau đó transform() trên cả train và test. Fit+transform trên toàn bộ dataset là data leakage.</p>
|
|
85
|
+
</blockquote>
|
|
86
|
+
|
|
87
|
+
<h2 id="athena"><strong>4. Amazon Athena</strong></h2>
|
|
88
|
+
|
|
89
|
+
<p>Athena cho phép chạy SQL queries directly trên S3 without loading data vào database. <strong>Pay per scan</strong> — tối ưu bằng cách dùng Parquet + partitioning.</p>
|
|
90
|
+
|
|
91
|
+
<pre><code class="language-text">Cost Optimization Tips:
|
|
92
|
+
┌────────────────────────────────────────────────┐
|
|
93
|
+
│ Partition data by date/region/category: │
|
|
94
|
+
│ s3://bucket/data/year=2024/month=01/ │
|
|
95
|
+
│ → Query chỉ scan the required partitions │
|
|
96
|
+
│ │
|
|
97
|
+
│ Use columnar formats (Parquet/ORC): │
|
|
98
|
+
│ → Read only needed columns │
|
|
99
|
+
│ │
|
|
100
|
+
│ Compress data (Snappy, Gzip): │
|
|
101
|
+
│ → Reduce scan size → reduce cost │
|
|
102
|
+
└────────────────────────────────────────────────┘
|
|
103
|
+
</code></pre>
|
|
104
|
+
|
|
105
|
+
<h2 id="quicksight"><strong>5. Amazon QuickSight</strong></h2>
|
|
106
|
+
|
|
107
|
+
<p>QuickSight là <strong>BI service</strong>, không phải ML tool. Key feature: <strong>SPICE</strong> (in-memory engine) cho fast dashboards.</p>
|
|
108
|
+
|
|
109
|
+
<table>
|
|
110
|
+
<thead><tr><th>Feature</th><th>Description</th></tr></thead>
|
|
111
|
+
<tbody>
|
|
112
|
+
<tr><td><strong>SPICE</strong></td><td>Super-fast Parallel In-memory Calculation Engine — cached dataset</td></tr>
|
|
113
|
+
<tr><td><strong>ML Insights</strong></td><td>Built-in anomaly detection, forecasting trên dashboards</td></tr>
|
|
114
|
+
<tr><td><strong>Q (NLQ)</strong></td><td>Natural language queries — "show me sales by region last month"</td></tr>
|
|
115
|
+
</tbody>
|
|
116
|
+
</table>
|
|
117
|
+
|
|
118
|
+
<h2 id="cheat-sheet"><strong>6. Cheat Sheet — Analysis Tools</strong></h2>
|
|
119
|
+
|
|
120
|
+
<table>
|
|
121
|
+
<thead><tr><th>Scenario</th><th>Tool</th></tr></thead>
|
|
122
|
+
<tbody>
|
|
123
|
+
<tr><td>Interactive Python EDA on large data</td><td>SageMaker Studio Notebooks</td></tr>
|
|
124
|
+
<tr><td>Visual no-code ML data prep</td><td>SageMaker Data Wrangler</td></tr>
|
|
125
|
+
<tr><td>SQL on S3 data (serverless)</td><td>Amazon Athena</td></tr>
|
|
126
|
+
<tr><td>Business dashboards và reporting</td><td>Amazon QuickSight</td></tr>
|
|
127
|
+
<tr><td>Large data warehouse SQL</td><td>Amazon Redshift</td></tr>
|
|
128
|
+
<tr><td>No-code data profiling recipes</td><td>AWS Glue DataBrew</td></tr>
|
|
129
|
+
</tbody>
|
|
130
|
+
</table>
|
|
131
|
+
|
|
132
|
+
<h2 id="practice"><strong>7. Practice Questions</strong></h2>
|
|
133
|
+
|
|
134
|
+
<p><strong>Q1:</strong> A data scientist standardized features using the mean and standard deviation of the ENTIRE dataset before splitting into train/test sets. What problem does this cause?</p>
|
|
135
|
+
<ul>
|
|
136
|
+
<li>A) Model underfitting</li>
|
|
137
|
+
<li>B) Slow training convergence</li>
|
|
138
|
+
<li>C) Data leakage from test set statistics into training ✓</li>
|
|
139
|
+
<li>D) Class imbalance</li>
|
|
140
|
+
</ul>
|
|
141
|
+
<p><em>Explanation: Fitting a scaler on the entire dataset causes data leakage — the test set statistics (mean, std) influence the training data transformation. Always fit transformers on training data only, then apply the fitted transformer to both train and test sets.</em></p>
|
|
142
|
+
|
|
143
|
+
<p><strong>Q2:</strong> A business analyst needs to create executive dashboards from S3 data with fast interactive visualizations. Which AWS service is BEST suited?</p>
|
|
144
|
+
<ul>
|
|
145
|
+
<li>A) Amazon SageMaker Studio</li>
|
|
146
|
+
<li>B) Amazon Athena</li>
|
|
147
|
+
<li>C) Amazon QuickSight ✓</li>
|
|
148
|
+
<li>D) AWS Glue DataBrew</li>
|
|
149
|
+
</ul>
|
|
150
|
+
<p><em>Explanation: Amazon QuickSight is the AWS BI service designed for business dashboards and visualizations with SPICE in-memory engine for fast interactive queries. SageMaker Studio is for ML development, Athena is SQL querying, DataBrew is data preparation.</em></p>
|
|
151
|
+
|
|
152
|
+
<p><strong>Q3:</strong> A model trained on customer churn data has 99% training accuracy but performs poorly on production data. Investigation shows "days_since_last_call" is more predictive than expected. What is the MOST likely cause?</p>
|
|
153
|
+
<ul>
|
|
154
|
+
<li>A) Overfitting due to too many features</li>
|
|
155
|
+
<li>B) Underfitting due to low model complexity</li>
|
|
156
|
+
<li>C) Data leakage — the feature is derived from post-churn activity ✓</li>
|
|
157
|
+
<li>D) Class imbalance</li>
|
|
158
|
+
</ul>
|
|
159
|
+
<p><em>Explanation: This is classic target leakage — "days_since_last_call" may reflect churn behavior after the fact (customers call to cancel). This future information isn't available in production, causing the model to fail.</em></p>
|