@xdev-asia/xdev-knowledge-mcp 1.0.42 → 1.0.43
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/01-phan-1-data-engineering/lessons/01-bai-1-data-repositories-ingestion.md +193 -0
- package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/01-phan-1-data-engineering/lessons/02-bai-2-data-transformation.md +178 -0
- package/data/quizzes/aws-ai-practitioner.json +362 -0
- package/data/quizzes/aws-ml-specialty.json +200 -0
- package/data/quizzes/gcp-ml-engineer.json +200 -0
- package/package.json +1 -1
|
@@ -0,0 +1,193 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: 14a964b2-b4b7-46e5-95b0-7d91d9cacdf5
|
|
3
|
+
title: 'Bài 1: Data Repositories & Ingestion — S3, Kinesis, Glue'
|
|
4
|
+
slug: bai-1-data-repositories-ingestion
|
|
5
|
+
description: >-
|
|
6
|
+
S3 data lake cho ML. Kinesis Data Streams/Firehose cho streaming ingestion.
|
|
7
|
+
AWS Glue ETL jobs và Data Catalog. Lake Formation. Data Wrangler.
|
|
8
|
+
Chiến lược lưu trữ: Parquet, ORC, CSV, JSON.
|
|
9
|
+
duration_minutes: 60
|
|
10
|
+
is_free: true
|
|
11
|
+
video_url: null
|
|
12
|
+
sort_order: 1
|
|
13
|
+
section_title: "Phần 1: Data Engineering (20%)"
|
|
14
|
+
course:
|
|
15
|
+
id: 019c9619-lt02-7002-c002-lt0200000002
|
|
16
|
+
title: 'Luyện thi AWS Certified Machine Learning - Specialty'
|
|
17
|
+
slug: luyen-thi-aws-ml-specialty
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
<h2 id="overview"><strong>1. Tổng quan Data Engineering trong MLS-C01</strong></h2>
|
|
21
|
+
|
|
22
|
+
<p>Domain Data Engineering chiếm <strong>20% đề thi MLS-C01</strong>. Đây là phần bắt buộc phải nắm vững — đề thi thường hỏi "Which service should be used to ingest/store/transform data for ML?"</p>
|
|
23
|
+
|
|
24
|
+
<blockquote>
|
|
25
|
+
<p><strong>Exam tip:</strong> Phần lớn câu hỏi Data Engineering sẽ cho một scenario và hỏi service phù hợp. Key pattern: batch → S3 + Glue; streaming → Kinesis; structured/SQL → Athena; catalog → Glue Data Catalog.</p>
|
|
26
|
+
</blockquote>
|
|
27
|
+
|
|
28
|
+
<h2 id="s3-ml"><strong>2. Amazon S3 — ML Data Lake</strong></h2>
|
|
29
|
+
|
|
30
|
+
<p><strong>Amazon S3</strong> là nền tảng lưu trữ dữ liệu ML trên AWS. Mọi pipeline ML đều bắt đầu và kết thúc từ S3: training data, model artifacts, predictions.</p>
|
|
31
|
+
|
|
32
|
+
<h3 id="s3-storage-classes"><strong>2.1. S3 Storage Classes cho ML</strong></h3>
|
|
33
|
+
|
|
34
|
+
<table>
|
|
35
|
+
<thead><tr><th>Storage Class</th><th>Use Case</th><th>Cost</th></tr></thead>
|
|
36
|
+
<tbody>
|
|
37
|
+
<tr><td><strong>S3 Standard</strong></td><td>Active training data, frequent access</td><td>Cao nhất</td></tr>
|
|
38
|
+
<tr><td><strong>S3 Intelligent-Tiering</strong></td><td>Mixed access patterns (tự động tier)</td><td>Tự động tối ưu</td></tr>
|
|
39
|
+
<tr><td><strong>S3 Standard-IA</strong></td><td>Backup datasets, infrequent access</td><td>Thấp hơn Standard</td></tr>
|
|
40
|
+
<tr><td><strong>S3 Glacier Instant Retrieval</strong></td><td>Archived datasets, occasional retrieval</td><td>Thấp</td></tr>
|
|
41
|
+
<tr><td><strong>S3 Glacier Deep Archive</strong></td><td>Long-term compliance archives</td><td>Thấp nhất</td></tr>
|
|
42
|
+
</tbody>
|
|
43
|
+
</table>
|
|
44
|
+
|
|
45
|
+
<h3 id="s3-file-formats"><strong>2.2. File Formats for ML</strong></h3>
|
|
46
|
+
|
|
47
|
+
<table>
|
|
48
|
+
<thead><tr><th>Format</th><th>Type</th><th>Best For</th><th>Compression</th></tr></thead>
|
|
49
|
+
<tbody>
|
|
50
|
+
<tr><td><strong>Parquet</strong></td><td>Columnar</td><td>Analytics, large datasets, feature stores</td><td>Excellent</td></tr>
|
|
51
|
+
<tr><td><strong>ORC</strong></td><td>Columnar</td><td>Hive/EMR workloads</td><td>Excellent</td></tr>
|
|
52
|
+
<tr><td><strong>CSV</strong></td><td>Row-based</td><td>Simple, SageMaker training input</td><td>Poor</td></tr>
|
|
53
|
+
<tr><td><strong>JSON</strong></td><td>Semi-structured</td><td>Nested data, APIs</td><td>Poor</td></tr>
|
|
54
|
+
<tr><td><strong>RecordIO</strong></td><td>Binary</td><td>SageMaker Pipe Mode training</td><td>Good</td></tr>
|
|
55
|
+
</tbody>
|
|
56
|
+
</table>
|
|
57
|
+
|
|
58
|
+
<blockquote>
|
|
59
|
+
<p><strong>Exam tip:</strong> Khi đề hỏi về <em>performance optimization</em> cho large-scale training, đáp án thường là chuyển sang <strong>Parquet</strong> (columnar, compressed) và dùng <strong>Pipe Mode</strong> thay vì File Mode trong SageMaker.</p>
|
|
60
|
+
</blockquote>
|
|
61
|
+
|
|
62
|
+
<pre><code class="language-text">S3 Data Lake Architecture for ML:
|
|
63
|
+
|
|
64
|
+
┌─────────────────────────────────────────────────────────┐
|
|
65
|
+
│ Amazon S3 Buckets │
|
|
66
|
+
├──────────────┬──────────────┬──────────────┬────────────┤
|
|
67
|
+
│ Raw Zone │ Processed │ Features │ Models │
|
|
68
|
+
│ (landing) │ Zone │ Zone │ & Output │
|
|
69
|
+
│ │ │ │ │
|
|
70
|
+
│ CSV/JSON │ Parquet/ORC │ Feature │ Model │
|
|
71
|
+
│ original │ cleaned │ Store │ Artifacts │
|
|
72
|
+
│ data │ transformed │ snapshots │ Predictions│
|
|
73
|
+
└──────────────┴──────────────┴──────────────┴────────────┘
|
|
74
|
+
↑ ↑ ↑
|
|
75
|
+
Kinesis AWS Glue SageMaker
|
|
76
|
+
(streaming) (ETL) Processing
|
|
77
|
+
</code></pre>
|
|
78
|
+
|
|
79
|
+
<h2 id="kinesis"><strong>3. Amazon Kinesis — Streaming Ingestion</strong></h2>
|
|
80
|
+
|
|
81
|
+
<p>Kinesis là họ dịch vụ cho <strong>real-time data streaming</strong>. Đây là topic quan trọng trong đề thi — cần phân biệt rõ 4 services.</p>
|
|
82
|
+
|
|
83
|
+
<table>
|
|
84
|
+
<thead><tr><th>Service</th><th>Function</th><th>Destination</th><th>ML Use Case</th></tr></thead>
|
|
85
|
+
<tbody>
|
|
86
|
+
<tr><td><strong>Kinesis Data Streams (KDS)</strong></td><td>Custom real-time processing</td><td>Custom consumers</td><td>Real-time feature engineering</td></tr>
|
|
87
|
+
<tr><td><strong>Kinesis Data Firehose</strong></td><td>Managed delivery (no code)</td><td>S3, Redshift, ES, Splunk</td><td>Batch loading to data lake</td></tr>
|
|
88
|
+
<tr><td><strong>Kinesis Data Analytics</strong></td><td>SQL/Flink on streams</td><td>S3, Redshift</td><td>Real-time aggregations, anomaly detect</td></tr>
|
|
89
|
+
<tr><td><strong>Kinesis Video Streams</strong></td><td>Video ingestion</td><td>Rekognition, SageMaker</td><td>Computer vision pipelines</td></tr>
|
|
90
|
+
</tbody>
|
|
91
|
+
</table>
|
|
92
|
+
|
|
93
|
+
<blockquote>
|
|
94
|
+
<p><strong>Exam tip:</strong> Câu hỏi phổ biến: "IoT sensors gửi data liên tục, cần store vào S3 cho ML training mà không cần custom code?" → Kinesis <strong>Data Firehose</strong> (managed, no code). "Cần xử lý real-time với custom logic?" → Kinesis <strong>Data Streams</strong>.</p>
|
|
95
|
+
</blockquote>
|
|
96
|
+
|
|
97
|
+
<h3 id="kinesis-shards"><strong>3.1. KDS Shards & Capacity</strong></h3>
|
|
98
|
+
|
|
99
|
+
<pre><code class="language-text">Kinesis Data Streams Capacity:
|
|
100
|
+
|
|
101
|
+
┌─────────────────────────────────────────────┐
|
|
102
|
+
│ Each Shard: │
|
|
103
|
+
│ • Ingest: 1 MB/s OR 1,000 records/s │
|
|
104
|
+
│ • Read: 2 MB/s │
|
|
105
|
+
│ • Retention: 24 hours (default) → 7 days │
|
|
106
|
+
└─────────────────────────────────────────────┘
|
|
107
|
+
|
|
108
|
+
Stream with N shards:
|
|
109
|
+
• Total ingest: N × 1 MB/s
|
|
110
|
+
• Total read: N × 2 MB/s
|
|
111
|
+
</code></pre>
|
|
112
|
+
|
|
113
|
+
<h2 id="glue"><strong>4. AWS Glue — ETL for ML</strong></h2>
|
|
114
|
+
|
|
115
|
+
<p><strong>AWS Glue</strong> là fully managed ETL service. Trong ML pipeline, Glue dùng để <strong>transform và clean data</strong> trước khi đưa vào training.</p>
|
|
116
|
+
|
|
117
|
+
<h3 id="glue-components"><strong>4.1. Glue Components</strong></h3>
|
|
118
|
+
|
|
119
|
+
<table>
|
|
120
|
+
<thead><tr><th>Component</th><th>Function</th></tr></thead>
|
|
121
|
+
<tbody>
|
|
122
|
+
<tr><td><strong>Glue Data Catalog</strong></td><td>Central metadata repository — schemas, tables, partitions</td></tr>
|
|
123
|
+
<tr><td><strong>Glue Crawlers</strong></td><td>Auto-discover schema từ S3/RDS/Redshift và populate Data Catalog</td></tr>
|
|
124
|
+
<tr><td><strong>Glue ETL Jobs</strong></td><td>Spark-based transformation jobs (Python/Scala)</td></tr>
|
|
125
|
+
<tr><td><strong>Glue DataBrew</strong></td><td>No-code visual data preparation (250+ pre-built transforms)</td></tr>
|
|
126
|
+
<tr><td><strong>Glue Studio</strong></td><td>Visual ETL job builder (drag-and-drop)</td></tr>
|
|
127
|
+
</tbody>
|
|
128
|
+
</table>
|
|
129
|
+
|
|
130
|
+
<blockquote>
|
|
131
|
+
<p><strong>Exam tip:</strong> <strong>Glue Data Catalog</strong> là metadata store chung cho Athena, EMR, Redshift Spectrum. Khi đề hỏi "centralized schema management" → Glue Data Catalog. Khi hỏi "no-code data cleaning" → Glue DataBrew.</p>
|
|
132
|
+
</blockquote>
|
|
133
|
+
|
|
134
|
+
<h2 id="lake-formation"><strong>5. AWS Lake Formation</strong></h2>
|
|
135
|
+
|
|
136
|
+
<p><strong>Lake Formation</strong> build trên S3 + Glue để management <strong>data lake security và governance</strong>. Key feature: column-level và row-level access control.</p>
|
|
137
|
+
|
|
138
|
+
<pre><code class="language-text">Lake Formation Architecture:
|
|
139
|
+
|
|
140
|
+
IAM Users ──→ Lake Formation ──→ S3 Data Lake
|
|
141
|
+
IAM Roles (Security (Raw/Processed)
|
|
142
|
+
& Governance)
|
|
143
|
+
↓
|
|
144
|
+
Column/Row
|
|
145
|
+
Level Access
|
|
146
|
+
Control
|
|
147
|
+
</code></pre>
|
|
148
|
+
|
|
149
|
+
<h2 id="cheat-sheet"><strong>6. Cheat Sheet — Data Ingestion Services</strong></h2>
|
|
150
|
+
|
|
151
|
+
<table>
|
|
152
|
+
<thead><tr><th>Scenario</th><th>Service</th></tr></thead>
|
|
153
|
+
<tbody>
|
|
154
|
+
<tr><td>Streaming → S3 với no-code</td><td>Kinesis Data Firehose</td></tr>
|
|
155
|
+
<tr><td>Real-time processing với custom logic</td><td>Kinesis Data Streams</td></tr>
|
|
156
|
+
<tr><td>SQL on streaming data</td><td>Kinesis Data Analytics (Flink)</td></tr>
|
|
157
|
+
<tr><td>Batch ETL Spark-based</td><td>AWS Glue ETL Jobs</td></tr>
|
|
158
|
+
<tr><td>No-code visual data prep</td><td>Glue DataBrew</td></tr>
|
|
159
|
+
<tr><td>Schema discovery from S3</td><td>Glue Crawlers + Data Catalog</td></tr>
|
|
160
|
+
<tr><td>SQL queries on S3</td><td>Amazon Athena</td></tr>
|
|
161
|
+
<tr><td>Data lake governance</td><td>AWS Lake Formation</td></tr>
|
|
162
|
+
<tr><td>Large-scale Spark/Hadoop</td><td>Amazon EMR</td></tr>
|
|
163
|
+
</tbody>
|
|
164
|
+
</table>
|
|
165
|
+
|
|
166
|
+
<h2 id="practice"><strong>7. Practice Questions</strong></h2>
|
|
167
|
+
|
|
168
|
+
<p><strong>Q1:</strong> A company wants to ingest IoT sensor data into Amazon S3 for ML training. The data arrives continuously and no custom processing is required. Which service is the MOST cost-effective?</p>
|
|
169
|
+
<ul>
|
|
170
|
+
<li>A) Amazon Kinesis Data Streams with a Lambda consumer</li>
|
|
171
|
+
<li>B) Amazon Kinesis Data Firehose ✓</li>
|
|
172
|
+
<li>C) Amazon EMR with Spark Streaming</li>
|
|
173
|
+
<li>D) AWS Glue ETL jobs on a schedule</li>
|
|
174
|
+
</ul>
|
|
175
|
+
<p><em>Explanation: Kinesis Data Firehose is fully managed and requires no custom code — it directly delivers streaming data to S3, Redshift, or Elasticsearch. Data Streams requires custom consumers, EMR is heavy lift, and Glue is for batch ETL.</em></p>
|
|
176
|
+
|
|
177
|
+
<p><strong>Q2:</strong> A data engineer wants to query raw CSV files in S3 using SQL without loading them into a database. Which service should be used?</p>
|
|
178
|
+
<ul>
|
|
179
|
+
<li>A) Amazon RDS</li>
|
|
180
|
+
<li>B) Amazon DynamoDB</li>
|
|
181
|
+
<li>C) Amazon Athena ✓</li>
|
|
182
|
+
<li>D) Amazon Redshift</li>
|
|
183
|
+
</ul>
|
|
184
|
+
<p><em>Explanation: Amazon Athena is serverless and allows SQL queries directly on S3 data without loading. It reads files in-place and supports formats like CSV, Parquet, ORC, JSON.</em></p>
|
|
185
|
+
|
|
186
|
+
<p><strong>Q3:</strong> Which file format provides the BEST performance for columnar analytics queries on large ML datasets stored in Amazon S3?</p>
|
|
187
|
+
<ul>
|
|
188
|
+
<li>A) CSV</li>
|
|
189
|
+
<li>B) JSON</li>
|
|
190
|
+
<li>C) XML</li>
|
|
191
|
+
<li>D) Apache Parquet ✓</li>
|
|
192
|
+
</ul>
|
|
193
|
+
<p><em>Explanation: Parquet is a columnar format with excellent compression and predicate pushdown support. Columnar formats allow reading only the required columns, dramatically reducing I/O for analytical queries.</em></p>
|
|
@@ -0,0 +1,178 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: 621b7555-2901-469d-8b0b-a800506c8212
|
|
3
|
+
title: 'Bài 2: Data Transformation & Feature Engineering'
|
|
4
|
+
slug: bai-2-data-transformation
|
|
5
|
+
description: >-
|
|
6
|
+
SageMaker Processing Jobs cho data prep. SageMaker Feature Store.
|
|
7
|
+
Xử lý missing values, encoding, normalization, scaling.
|
|
8
|
+
Text preprocessing, imbalanced data techniques.
|
|
9
|
+
duration_minutes: 60
|
|
10
|
+
is_free: true
|
|
11
|
+
video_url: null
|
|
12
|
+
sort_order: 2
|
|
13
|
+
section_title: "Phần 1: Data Engineering (20%)"
|
|
14
|
+
course:
|
|
15
|
+
id: 019c9619-lt02-7002-c002-lt0200000002
|
|
16
|
+
title: 'Luyện thi AWS Certified Machine Learning - Specialty'
|
|
17
|
+
slug: luyen-thi-aws-ml-specialty
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
<h2 id="overview"><strong>1. Data Transformation trong ML Pipeline</strong></h2>
|
|
21
|
+
|
|
22
|
+
<p>Trước khi train model, raw data phải qua nhiều bước transformation. Đây là nguồn gốc của câu nói nổi tiếng: <em>"Garbage in, garbage out"</em>. Đề thi MLS-C01 thường hỏi kỹ thuật xử lý data và tools phù hợp.</p>
|
|
23
|
+
|
|
24
|
+
<h2 id="processing-jobs"><strong>2. SageMaker Processing Jobs</strong></h2>
|
|
25
|
+
|
|
26
|
+
<p><strong>SageMaker Processing Jobs</strong> là managed service để chạy data processing scripts (Python, Spark) trên ephemeral compute clusters.</p>
|
|
27
|
+
|
|
28
|
+
<table>
|
|
29
|
+
<thead><tr><th>Processor Type</th><th>Framework</th><th>Use Case</th></tr></thead>
|
|
30
|
+
<tbody>
|
|
31
|
+
<tr><td><strong>ScriptProcessor</strong></td><td>Custom Docker container</td><td>Any custom script</td></tr>
|
|
32
|
+
<tr><td><strong>SKLearnProcessor</strong></td><td>scikit-learn</td><td>Classic ML preprocessing</td></tr>
|
|
33
|
+
<tr><td><strong>PySparkProcessor</strong></td><td>Apache Spark</td><td>Large-scale distributed processing</td></tr>
|
|
34
|
+
<tr><td><strong>FrameworkProcessor</strong></td><td>TensorFlow/PyTorch</td><td>Deep learning data prep</td></tr>
|
|
35
|
+
</tbody>
|
|
36
|
+
</table>
|
|
37
|
+
|
|
38
|
+
<pre><code class="language-text">SageMaker Processing Job Flow:
|
|
39
|
+
|
|
40
|
+
S3 (input data)
|
|
41
|
+
↓
|
|
42
|
+
┌─────────────────────┐
|
|
43
|
+
│ Processing Job │
|
|
44
|
+
│ (compute cluster) │
|
|
45
|
+
│ │
|
|
46
|
+
│ - Preprocess data │
|
|
47
|
+
│ - Feature engineer │
|
|
48
|
+
│ - Split train/test │
|
|
49
|
+
└─────────────────────┘
|
|
50
|
+
↓
|
|
51
|
+
S3 (output: train/, validation/, test/)
|
|
52
|
+
</code></pre>
|
|
53
|
+
|
|
54
|
+
<h2 id="missing-values"><strong>3. Xử lý Missing Values</strong></h2>
|
|
55
|
+
|
|
56
|
+
<table>
|
|
57
|
+
<thead><tr><th>Strategy</th><th>Method</th><th>When to Use</th></tr></thead>
|
|
58
|
+
<tbody>
|
|
59
|
+
<tr><td><strong>Deletion</strong></td><td>Drop rows/columns</td><td>MCAR, ít missing (<5%)</td></tr>
|
|
60
|
+
<tr><td><strong>Mean/Median Imputation</strong></td><td>Điền giá trị trung bình</td><td>Numeric, MCAR/MAR</td></tr>
|
|
61
|
+
<tr><td><strong>Mode Imputation</strong></td><td>Điền giá trị phổ biến nhất</td><td>Categorical</td></tr>
|
|
62
|
+
<tr><td><strong>KNN Imputation</strong></td><td>Dùng K neighbors gần nhất</td><td>Patterns in data, không quá lớn</td></tr>
|
|
63
|
+
<tr><td><strong>Model-based (MICE)</strong></td><td>Multiple imputation</td><td>Complex missingness patterns</td></tr>
|
|
64
|
+
<tr><td><strong>Indicator Feature</strong></td><td>Thêm cột is_missing</td><td>Khi missingness chứa thông tin</td></tr>
|
|
65
|
+
</tbody>
|
|
66
|
+
</table>
|
|
67
|
+
|
|
68
|
+
<blockquote>
|
|
69
|
+
<p><strong>Exam tip:</strong> Ba loại missing data: <strong>MCAR</strong> (Missing Completely At Random) — deletion an toàn; <strong>MAR</strong> (Missing At Random) — imputation phù hợp; <strong>MNAR</strong> (Missing Not At Random) — cần indicator feature hoặc domain knowledge.</p>
|
|
70
|
+
</blockquote>
|
|
71
|
+
|
|
72
|
+
<h2 id="encoding"><strong>4. Categorical Encoding</strong></h2>
|
|
73
|
+
|
|
74
|
+
<table>
|
|
75
|
+
<thead><tr><th>Encoding</th><th>Method</th><th>When to Use</th><th>Issues</th></tr></thead>
|
|
76
|
+
<tbody>
|
|
77
|
+
<tr><td><strong>One-Hot Encoding</strong></td><td>Binary columns mỗi category</td><td>Nominal (no order), ít categories</td><td>High cardinality → curse of dimensionality</td></tr>
|
|
78
|
+
<tr><td><strong>Label Encoding</strong></td><td>0, 1, 2, 3...</td><td>Ordinal (có thứ tự)</td><td>Implies false order for nominal</td></tr>
|
|
79
|
+
<tr><td><strong>Target Encoding</strong></td><td>Mean of target per category</td><td>High cardinality nominal</td><td>Data leakage risk nếu không cẩn thận</td></tr>
|
|
80
|
+
<tr><td><strong>Embeddings</strong></td><td>Dense vector representation</td><td>Text, high cardinality</td><td>Cần đủ data để learn</td></tr>
|
|
81
|
+
</tbody>
|
|
82
|
+
</table>
|
|
83
|
+
|
|
84
|
+
<h2 id="scaling"><strong>5. Normalization & Scaling</strong></h2>
|
|
85
|
+
|
|
86
|
+
<table>
|
|
87
|
+
<thead><tr><th>Technique</th><th>Formula</th><th>Output Range</th><th>Best For</th></tr></thead>
|
|
88
|
+
<tbody>
|
|
89
|
+
<tr><td><strong>Min-Max Normalization</strong></td><td>(x - min) / (max - min)</td><td>[0, 1]</td><td>Neural networks, distance-based</td></tr>
|
|
90
|
+
<tr><td><strong>Standardization (Z-score)</strong></td><td>(x - mean) / std</td><td>Mean=0, SD=1</td><td>Linear models, SVM, PCA</td></tr>
|
|
91
|
+
<tr><td><strong>Robust Scaler</strong></td><td>(x - median) / IQR</td><td>Centered</td><td>Outliers present</td></tr>
|
|
92
|
+
<tr><td><strong>Log Transform</strong></td><td>log(x)</td><td>Compressed</td><td>Skewed distributions</td></tr>
|
|
93
|
+
</tbody>
|
|
94
|
+
</table>
|
|
95
|
+
|
|
96
|
+
<h2 id="imbalanced"><strong>6. Xử lý Imbalanced Data</strong></h2>
|
|
97
|
+
|
|
98
|
+
<p>Class imbalance (e.g., fraud detection: 99% normal, 1% fraud) khiến model bias về majority class.</p>
|
|
99
|
+
|
|
100
|
+
<table>
|
|
101
|
+
<thead><tr><th>Technique</th><th>Method</th><th>Direction</th></tr></thead>
|
|
102
|
+
<tbody>
|
|
103
|
+
<tr><td><strong>Oversampling</strong></td><td>Duplicate minority class samples</td><td>↑ minority</td></tr>
|
|
104
|
+
<tr><td><strong>SMOTE</strong></td><td>Synthetic Minority Oversampling Technique — generate synthetic samples</td><td>↑ minority</td></tr>
|
|
105
|
+
<tr><td><strong>Undersampling</strong></td><td>Remove majority class samples</td><td>↓ majority</td></tr>
|
|
106
|
+
<tr><td><strong>Class Weights</strong></td><td>Penalize misclassification of minority more</td><td>No data change</td></tr>
|
|
107
|
+
<tr><td><strong>Ensemble Methods</strong></td><td>BalancedBagging, EasyEnsemble</td><td>Algorithm-level</td></tr>
|
|
108
|
+
</tbody>
|
|
109
|
+
</table>
|
|
110
|
+
|
|
111
|
+
<blockquote>
|
|
112
|
+
<p><strong>Exam tip:</strong> Metric phù hợp cho imbalanced data: <strong>F1 Score, AUC-ROC, Precision-Recall</strong> — KHÔNG dùng Accuracy (misleading). AWS SageMaker Clarify có thể detect class imbalance.</p>
|
|
113
|
+
</blockquote>
|
|
114
|
+
|
|
115
|
+
<h2 id="feature-store"><strong>7. SageMaker Feature Store</strong></h2>
|
|
116
|
+
|
|
117
|
+
<p><strong>SageMaker Feature Store</strong> là centralized repository để store, share và reuse ML features.</p>
|
|
118
|
+
|
|
119
|
+
<pre><code class="language-text">Feature Store Architecture:
|
|
120
|
+
|
|
121
|
+
Feature Groups
|
|
122
|
+
┌──────────────────────────────┐
|
|
123
|
+
│ user_features │
|
|
124
|
+
│ ┌──────┬────────┬────────┐ │
|
|
125
|
+
│ │ id │ age │ recency│ │
|
|
126
|
+
│ └──────┴────────┴────────┘ │
|
|
127
|
+
└──────────────────────────────┘
|
|
128
|
+
↓ writes ↑ reads
|
|
129
|
+
┌──────────────────┐ ┌──────────────────┐
|
|
130
|
+
│ Offline Store │ │ Online Store │
|
|
131
|
+
│ (S3 - training) │ │ (DynamoDB - │
|
|
132
|
+
│ batch reads │ │ low-latency │
|
|
133
|
+
│ │ │ inference) │
|
|
134
|
+
└──────────────────┘ └──────────────────┘
|
|
135
|
+
</code></pre>
|
|
136
|
+
|
|
137
|
+
<h2 id="cheat-sheet"><strong>8. Cheat Sheet — Feature Engineering</strong></h2>
|
|
138
|
+
|
|
139
|
+
<table>
|
|
140
|
+
<thead><tr><th>Problem</th><th>Solution</th></tr></thead>
|
|
141
|
+
<tbody>
|
|
142
|
+
<tr><td>High cardinality categorical</td><td>Target encoding hoặc embeddings</td></tr>
|
|
143
|
+
<tr><td>Missing values (numeric)</td><td>Median imputation + indicator feature</td></tr>
|
|
144
|
+
<tr><td>Skewed distribution</td><td>Log transform hoặc Box-Cox</td></tr>
|
|
145
|
+
<tr><td>Outliers</td><td>Robust Scaler hoặc clip/winsorize</td></tr>
|
|
146
|
+
<tr><td>Imbalanced classes</td><td>SMOTE + class weights + AUC metric</td></tr>
|
|
147
|
+
<tr><td>Reuse features across teams</td><td>SageMaker Feature Store</td></tr>
|
|
148
|
+
</tbody>
|
|
149
|
+
</table>
|
|
150
|
+
|
|
151
|
+
<h2 id="practice"><strong>9. Practice Questions</strong></h2>
|
|
152
|
+
|
|
153
|
+
<p><strong>Q1:</strong> A dataset for fraud detection has 98% negative (non-fraud) and 2% positive (fraud) examples. Which metric is MOST appropriate to evaluate the model?</p>
|
|
154
|
+
<ul>
|
|
155
|
+
<li>A) Accuracy</li>
|
|
156
|
+
<li>B) R-squared</li>
|
|
157
|
+
<li>C) AUC-ROC ✓</li>
|
|
158
|
+
<li>D) Mean Absolute Error</li>
|
|
159
|
+
</ul>
|
|
160
|
+
<p><em>Explanation: Accuracy is misleading for imbalanced data (predicting all negative gives 98% accuracy). AUC-ROC measures the model's ability to distinguish classes across all thresholds, making it ideal for imbalanced classification.</em></p>
|
|
161
|
+
|
|
162
|
+
<p><strong>Q2:</strong> Which technique generates SYNTHETIC samples to address class imbalance?</p>
|
|
163
|
+
<ul>
|
|
164
|
+
<li>A) Random undersampling</li>
|
|
165
|
+
<li>B) SMOTE (Synthetic Minority Oversampling Technique) ✓</li>
|
|
166
|
+
<li>C) Class weighting</li>
|
|
167
|
+
<li>D) Feature scaling</li>
|
|
168
|
+
</ul>
|
|
169
|
+
<p><em>Explanation: SMOTE creates new synthetic samples for the minority class by interpolating between existing minority class examples, rather than just duplicating them.</em></p>
|
|
170
|
+
|
|
171
|
+
<p><strong>Q3:</strong> A company wants to share engineered features between their training pipeline and real-time inference service. Which SageMaker feature addresses this?</p>
|
|
172
|
+
<ul>
|
|
173
|
+
<li>A) SageMaker Processing Jobs</li>
|
|
174
|
+
<li>B) SageMaker Experiments</li>
|
|
175
|
+
<li>C) SageMaker Feature Store ✓</li>
|
|
176
|
+
<li>D) SageMaker Data Wrangler</li>
|
|
177
|
+
</ul>
|
|
178
|
+
<p><em>Explanation: SageMaker Feature Store provides both an offline store (S3, for batch training) and online store (DynamoDB-backed, for low-latency real-time inference), ensuring feature consistency between training and serving.</em></p>
|
|
@@ -0,0 +1,362 @@
|
|
|
1
|
+
{
|
|
2
|
+
"id": "aws-ai-practitioner",
|
|
3
|
+
"title": "AWS Certified AI Practitioner (AIF-C01)",
|
|
4
|
+
"slug": "aws-ai-practitioner",
|
|
5
|
+
"description": "Practice exam for AWS Certified AI Practitioner — 20 questions covering all 5 domains",
|
|
6
|
+
"icon": "award",
|
|
7
|
+
"provider": "AWS",
|
|
8
|
+
"level": "Foundational",
|
|
9
|
+
"duration_minutes": 30,
|
|
10
|
+
"passing_score": 70,
|
|
11
|
+
"questions_count": 20,
|
|
12
|
+
"tags": [
|
|
13
|
+
"AWS",
|
|
14
|
+
"AI",
|
|
15
|
+
"Cloud",
|
|
16
|
+
"Bedrock",
|
|
17
|
+
"GenAI"
|
|
18
|
+
],
|
|
19
|
+
"series_slug": "luyen-thi-aws-ai-practitioner",
|
|
20
|
+
"domains": [
|
|
21
|
+
{
|
|
22
|
+
"name": "Domain 1: Fundamentals of AI and ML",
|
|
23
|
+
"weight": 20,
|
|
24
|
+
"lessons": [
|
|
25
|
+
{
|
|
26
|
+
"title": "Bài 1: AI, ML & Deep Learning Concepts",
|
|
27
|
+
"slug": "01-bai-1-ai-ml-deep-learning-concepts"
|
|
28
|
+
},
|
|
29
|
+
{
|
|
30
|
+
"title": "Bài 2: ML Lifecycle & AWS AI Services",
|
|
31
|
+
"slug": "02-bai-2-ml-lifecycle-aws-services"
|
|
32
|
+
}
|
|
33
|
+
]
|
|
34
|
+
},
|
|
35
|
+
{
|
|
36
|
+
"name": "Domain 2: Fundamentals of Generative AI",
|
|
37
|
+
"weight": 24,
|
|
38
|
+
"lessons": [
|
|
39
|
+
{
|
|
40
|
+
"title": "Bài 3: Generative AI & Foundation Models",
|
|
41
|
+
"slug": "03-bai-3-generative-ai-foundation-models"
|
|
42
|
+
},
|
|
43
|
+
{
|
|
44
|
+
"title": "Bài 4: LLMs, Transformers & Multi-modal",
|
|
45
|
+
"slug": "04-bai-4-llm-transformers-multimodal"
|
|
46
|
+
}
|
|
47
|
+
]
|
|
48
|
+
},
|
|
49
|
+
{
|
|
50
|
+
"name": "Domain 3: Applications of Foundation Models",
|
|
51
|
+
"weight": 28,
|
|
52
|
+
"lessons": [
|
|
53
|
+
{
|
|
54
|
+
"title": "Bài 5: Prompt Engineering",
|
|
55
|
+
"slug": "05-bai-5-prompt-engineering-techniques"
|
|
56
|
+
},
|
|
57
|
+
{
|
|
58
|
+
"title": "Bài 6: RAG & Knowledge Bases",
|
|
59
|
+
"slug": "06-bai-6-rag-vector-databases-knowledge-bases"
|
|
60
|
+
},
|
|
61
|
+
{
|
|
62
|
+
"title": "Bài 7: Fine-tuning & Model Customization",
|
|
63
|
+
"slug": "07-bai-7-fine-tuning-model-customization"
|
|
64
|
+
},
|
|
65
|
+
{
|
|
66
|
+
"title": "Bài 8: Amazon Bedrock Deep Dive",
|
|
67
|
+
"slug": "08-bai-8-amazon-bedrock-deep-dive"
|
|
68
|
+
}
|
|
69
|
+
]
|
|
70
|
+
},
|
|
71
|
+
{
|
|
72
|
+
"name": "Domain 4: Guidelines for Responsible AI",
|
|
73
|
+
"weight": 14,
|
|
74
|
+
"lessons": [
|
|
75
|
+
{
|
|
76
|
+
"title": "Bài 9: Responsible AI — Fairness & Bias",
|
|
77
|
+
"slug": "09-bai-9-responsible-ai-fairness-bias-transparency"
|
|
78
|
+
},
|
|
79
|
+
{
|
|
80
|
+
"title": "Bài 10: AWS Responsible AI Tools",
|
|
81
|
+
"slug": "10-bai-10-aws-responsible-ai-tools"
|
|
82
|
+
}
|
|
83
|
+
]
|
|
84
|
+
},
|
|
85
|
+
{
|
|
86
|
+
"name": "Domain 5: Security, Compliance & Governance",
|
|
87
|
+
"weight": 14,
|
|
88
|
+
"lessons": [
|
|
89
|
+
{
|
|
90
|
+
"title": "Bài 11: AI Security & Data Privacy",
|
|
91
|
+
"slug": "11-bai-11-ai-security-data-privacy-compliance"
|
|
92
|
+
},
|
|
93
|
+
{
|
|
94
|
+
"title": "Bài 12: Exam Strategy & Cheat Sheet",
|
|
95
|
+
"slug": "12-bai-12-exam-strategy-cheat-sheet"
|
|
96
|
+
}
|
|
97
|
+
]
|
|
98
|
+
}
|
|
99
|
+
],
|
|
100
|
+
"questions": [
|
|
101
|
+
{
|
|
102
|
+
"id": 1,
|
|
103
|
+
"domain": "Domain 2: Fundamentals of Generative AI",
|
|
104
|
+
"question": "What is a Foundation Model?",
|
|
105
|
+
"options": [
|
|
106
|
+
"A model designed for only one specific task",
|
|
107
|
+
"A large AI model pre-trained on broad data that can be adapted to many downstream tasks",
|
|
108
|
+
"A model that only processes structured tabular data",
|
|
109
|
+
"A model trained entirely using Reinforcement Learning"
|
|
110
|
+
],
|
|
111
|
+
"correct": 1,
|
|
112
|
+
"explanation": "A Foundation Model is a large AI model pre-trained on vast, diverse datasets. It can be adapted to many downstream tasks through fine-tuning, RAG, or prompt engineering."
|
|
113
|
+
},
|
|
114
|
+
{
|
|
115
|
+
"id": 2,
|
|
116
|
+
"domain": "Domain 3: Applications of Foundation Models",
|
|
117
|
+
"question": "What is the PRIMARY purpose of Amazon Bedrock?",
|
|
118
|
+
"options": [
|
|
119
|
+
"Managing relational databases",
|
|
120
|
+
"Deploying containers on the cloud",
|
|
121
|
+
"Accessing and using Foundation Models from multiple providers through a single API",
|
|
122
|
+
"Monitoring cloud costs"
|
|
123
|
+
],
|
|
124
|
+
"correct": 2,
|
|
125
|
+
"explanation": "Amazon Bedrock is a fully managed service that provides access to Foundation Models from multiple providers (Anthropic, Meta, Amazon, Mistral, etc.) through a single API for building generative AI applications."
|
|
126
|
+
},
|
|
127
|
+
{
|
|
128
|
+
"id": 3,
|
|
129
|
+
"domain": "Domain 1: Fundamentals of AI and ML",
|
|
130
|
+
"question": "How does Supervised Learning differ from Unsupervised Learning?",
|
|
131
|
+
"options": [
|
|
132
|
+
"Supervised Learning does not require any data",
|
|
133
|
+
"Supervised Learning uses labeled data to train the model",
|
|
134
|
+
"Unsupervised Learning always produces more accurate results",
|
|
135
|
+
"Supervised Learning can only be used for classification tasks"
|
|
136
|
+
],
|
|
137
|
+
"correct": 1,
|
|
138
|
+
"explanation": "Supervised Learning uses labeled data (input-output pairs) to train models for classification or regression, while Unsupervised Learning discovers hidden patterns in unlabeled data (e.g., clustering)."
|
|
139
|
+
},
|
|
140
|
+
{
|
|
141
|
+
"id": 4,
|
|
142
|
+
"domain": "Domain 3: Applications of Foundation Models",
|
|
143
|
+
"question": "What problem does RAG (Retrieval-Augmented Generation) solve for LLMs?",
|
|
144
|
+
"options": [
|
|
145
|
+
"It increases inference speed",
|
|
146
|
+
"It reduces training costs",
|
|
147
|
+
"It reduces hallucination by grounding responses in external knowledge sources",
|
|
148
|
+
"It increases the context window size"
|
|
149
|
+
],
|
|
150
|
+
"correct": 2,
|
|
151
|
+
"explanation": "RAG combines retrieval of relevant external data with generation, helping LLMs produce more accurate, fact-based answers by grounding responses in retrieved documents rather than relying solely on training knowledge."
|
|
152
|
+
},
|
|
153
|
+
{
|
|
154
|
+
"id": 5,
|
|
155
|
+
"domain": "Domain 2: Fundamentals of Generative AI",
|
|
156
|
+
"question": "A customer support chatbot gives inconsistent and overly creative answers to factual questions. Which inference parameter should be adjusted?",
|
|
157
|
+
"options": [
|
|
158
|
+
"Increase temperature to 1.0",
|
|
159
|
+
"Decrease temperature closer to 0",
|
|
160
|
+
"Increase max tokens",
|
|
161
|
+
"Increase top-k to 500"
|
|
162
|
+
],
|
|
163
|
+
"correct": 1,
|
|
164
|
+
"explanation": "Lower temperature values (closer to 0) make the model more deterministic and focused, producing consistent and factual responses. Higher temperature values increase randomness and creativity."
|
|
165
|
+
},
|
|
166
|
+
{
|
|
167
|
+
"id": 6,
|
|
168
|
+
"domain": "Domain 3: Applications of Foundation Models",
|
|
169
|
+
"question": "Which prompting technique is MOST effective for improving a model's accuracy on complex mathematical reasoning tasks?",
|
|
170
|
+
"options": [
|
|
171
|
+
"Zero-shot prompting",
|
|
172
|
+
"Negative prompting",
|
|
173
|
+
"Chain-of-Thought (CoT) prompting",
|
|
174
|
+
"System prompting"
|
|
175
|
+
],
|
|
176
|
+
"correct": 2,
|
|
177
|
+
"explanation": "Chain-of-Thought prompting instructs the model to reason step by step before giving a final answer. This significantly improves accuracy on math, logic, and multi-step reasoning tasks."
|
|
178
|
+
},
|
|
179
|
+
{
|
|
180
|
+
"id": 7,
|
|
181
|
+
"domain": "Domain 3: Applications of Foundation Models",
|
|
182
|
+
"question": "A company wants to build a Q&A assistant that answers questions from internal documents stored in Amazon S3. The documents are updated weekly. Which approach is MOST suitable?",
|
|
183
|
+
"options": [
|
|
184
|
+
"Fine-tune a foundation model on the documents",
|
|
185
|
+
"Use RAG with Amazon Bedrock Knowledge Bases",
|
|
186
|
+
"Pre-train a custom model from scratch",
|
|
187
|
+
"Use zero-shot prompting with a large context window"
|
|
188
|
+
],
|
|
189
|
+
"correct": 1,
|
|
190
|
+
"explanation": "Amazon Bedrock Knowledge Bases provides managed RAG — it automatically chunks, embeds, and indexes S3 documents, retrieves relevant information per query, and stays current via auto-sync without model retraining."
|
|
191
|
+
},
|
|
192
|
+
{
|
|
193
|
+
"id": 8,
|
|
194
|
+
"domain": "Domain 2: Fundamentals of Generative AI",
|
|
195
|
+
"question": "Which Transformer architecture type is BEST suited for text generation tasks such as chatbots and content creation?",
|
|
196
|
+
"options": [
|
|
197
|
+
"Encoder-only (e.g., BERT)",
|
|
198
|
+
"Decoder-only (e.g., GPT, Claude)",
|
|
199
|
+
"Encoder-Decoder (e.g., T5)",
|
|
200
|
+
"Convolutional Neural Network (CNN)"
|
|
201
|
+
],
|
|
202
|
+
"correct": 1,
|
|
203
|
+
"explanation": "Decoder-only architectures (like GPT, Claude, Llama) generate text autoregressively one token at a time and are the basis for most modern chatbots and text generators."
|
|
204
|
+
},
|
|
205
|
+
{
|
|
206
|
+
"id": 9,
|
|
207
|
+
"domain": "Domain 3: Applications of Foundation Models",
|
|
208
|
+
"question": "A retail company wants to build an AI assistant that can check inventory, process returns, and answer product questions from their catalog. Which Amazon Bedrock feature should they use?",
|
|
209
|
+
"options": [
|
|
210
|
+
"Bedrock Guardrails",
|
|
211
|
+
"Bedrock Knowledge Bases only",
|
|
212
|
+
"Bedrock Agents with Action Groups and Knowledge Bases",
|
|
213
|
+
"Bedrock Model Evaluation"
|
|
214
|
+
],
|
|
215
|
+
"correct": 2,
|
|
216
|
+
"explanation": "Bedrock Agents can orchestrate multi-step tasks by calling APIs (action groups for inventory/returns) and retrieving information (knowledge bases for product catalog) — combining reasoning with actions."
|
|
217
|
+
},
|
|
218
|
+
{
|
|
219
|
+
"id": 10,
|
|
220
|
+
"domain": "Domain 3: Applications of Foundation Models",
|
|
221
|
+
"question": "Which technique allows fine-tuning a large language model while updating only a small fraction of the model's parameters?",
|
|
222
|
+
"options": [
|
|
223
|
+
"Full fine-tuning",
|
|
224
|
+
"LoRA (Low-Rank Adaptation)",
|
|
225
|
+
"Continued pre-training",
|
|
226
|
+
"RLHF (Reinforcement Learning from Human Feedback)"
|
|
227
|
+
],
|
|
228
|
+
"correct": 1,
|
|
229
|
+
"explanation": "LoRA is a Parameter-Efficient Fine-Tuning (PEFT) technique that adds small trainable adapter matrices while freezing the original model weights — typically updating less than 1% of total parameters, reducing cost significantly."
|
|
230
|
+
},
|
|
231
|
+
{
|
|
232
|
+
"id": 11,
|
|
233
|
+
"domain": "Domain 4: Guidelines for Responsible AI",
|
|
234
|
+
"question": "What is the PRIMARY purpose of Amazon Bedrock Guardrails?",
|
|
235
|
+
"options": [
|
|
236
|
+
"Accelerating model inference",
|
|
237
|
+
"Implementing safety controls such as content filtering, denied topics, and PII detection for AI applications",
|
|
238
|
+
"Compressing model size for deployment",
|
|
239
|
+
"Managing billing and costs"
|
|
240
|
+
],
|
|
241
|
+
"correct": 1,
|
|
242
|
+
"explanation": "Bedrock Guardrails implement safety controls including content filters (hate, violence, sexual), denied topics, word filters, PII detection/redaction, and contextual grounding checks — applied to both model inputs and outputs."
|
|
243
|
+
},
|
|
244
|
+
{
|
|
245
|
+
"id": 12,
|
|
246
|
+
"domain": "Domain 4: Guidelines for Responsible AI",
|
|
247
|
+
"question": "A hiring AI system consistently ranks male candidates higher than equally qualified female candidates. What is the MOST likely cause?",
|
|
248
|
+
"options": [
|
|
249
|
+
"Measurement bias in data collection",
|
|
250
|
+
"Selection bias in the training data reflecting historical hiring patterns",
|
|
251
|
+
"The model's architecture is too complex",
|
|
252
|
+
"The inference temperature is set too high"
|
|
253
|
+
],
|
|
254
|
+
"correct": 1,
|
|
255
|
+
"explanation": "If training data contained historical hiring decisions that favored male candidates, the model would learn and reproduce that selection bias — the training data didn't represent the qualified population fairly."
|
|
256
|
+
},
|
|
257
|
+
{
|
|
258
|
+
"id": 13,
|
|
259
|
+
"domain": "Domain 4: Guidelines for Responsible AI",
|
|
260
|
+
"question": "Which AWS service can detect bias in ML model predictions and provide per-prediction explainability using SHAP values?",
|
|
261
|
+
"options": [
|
|
262
|
+
"Amazon Rekognition",
|
|
263
|
+
"Amazon SageMaker Clarify",
|
|
264
|
+
"Amazon Bedrock Guardrails",
|
|
265
|
+
"Amazon Comprehend"
|
|
266
|
+
],
|
|
267
|
+
"correct": 1,
|
|
268
|
+
"explanation": "SageMaker Clarify provides pre-training bias detection (data analysis), post-training bias detection (prediction analysis across demographic groups), and model explainability through SHAP values."
|
|
269
|
+
},
|
|
270
|
+
{
|
|
271
|
+
"id": 14,
|
|
272
|
+
"domain": "Domain 4: Guidelines for Responsible AI",
|
|
273
|
+
"question": "A document processing application needs human review when AI-extracted data has low confidence. Which AWS service provides this human-in-the-loop capability?",
|
|
274
|
+
"options": [
|
|
275
|
+
"Amazon SageMaker Ground Truth",
|
|
276
|
+
"Amazon Augmented AI (A2I)",
|
|
277
|
+
"Amazon Mechanical Turk directly",
|
|
278
|
+
"Amazon Bedrock Agents"
|
|
279
|
+
],
|
|
280
|
+
"correct": 1,
|
|
281
|
+
"explanation": "Amazon A2I provides human-in-the-loop workflows with built-in integration for Amazon Textract and Rekognition. It automatically triggers human review when AI confidence falls below a defined threshold."
|
|
282
|
+
},
|
|
283
|
+
{
|
|
284
|
+
"id": 15,
|
|
285
|
+
"domain": "Domain 5: Security, Compliance & Governance",
|
|
286
|
+
"question": "A financial services company wants to ensure Amazon Bedrock API calls do NOT traverse the public internet. What should they configure?",
|
|
287
|
+
"options": [
|
|
288
|
+
"AWS Direct Connect only",
|
|
289
|
+
"VPC endpoint (AWS PrivateLink) for Amazon Bedrock",
|
|
290
|
+
"A VPN connection",
|
|
291
|
+
"Amazon CloudFront distribution"
|
|
292
|
+
],
|
|
293
|
+
"correct": 1,
|
|
294
|
+
"explanation": "A VPC interface endpoint (AWS PrivateLink) for Amazon Bedrock allows private connectivity from within a VPC without any traffic going through the public internet."
|
|
295
|
+
},
|
|
296
|
+
{
|
|
297
|
+
"id": 16,
|
|
298
|
+
"domain": "Domain 5: Security, Compliance & Governance",
|
|
299
|
+
"question": "According to the AWS Shared Responsibility Model, who is responsible for ensuring ML training data does not contain bias?",
|
|
300
|
+
"options": [
|
|
301
|
+
"AWS",
|
|
302
|
+
"The foundation model provider",
|
|
303
|
+
"The customer",
|
|
304
|
+
"Both AWS and the customer equally"
|
|
305
|
+
],
|
|
306
|
+
"correct": 2,
|
|
307
|
+
"explanation": "Under the Shared Responsibility Model, customers are responsible for 'security IN the cloud' — including training data quality, bias detection, model selection, IAM, and ethical AI practices."
|
|
308
|
+
},
|
|
309
|
+
{
|
|
310
|
+
"id": 17,
|
|
311
|
+
"domain": "Domain 5: Security, Compliance & Governance",
|
|
312
|
+
"question": "A chatbot must NEVER reveal customer credit card numbers in responses. Which approach provides the STRONGEST guarantee?",
|
|
313
|
+
"options": [
|
|
314
|
+
"Add 'never output credit card numbers' to the system prompt",
|
|
315
|
+
"Fine-tune the model to avoid outputting PII",
|
|
316
|
+
"Use Amazon Bedrock Guardrails with PII filters set to BLOCK",
|
|
317
|
+
"Remove credit card numbers from the knowledge base"
|
|
318
|
+
],
|
|
319
|
+
"correct": 2,
|
|
320
|
+
"explanation": "Bedrock Guardrails with PII filters provide programmatic detection and blocking of credit card numbers — this cannot be bypassed by prompt injection, unlike system prompts which are soft constraints."
|
|
321
|
+
},
|
|
322
|
+
{
|
|
323
|
+
"id": 18,
|
|
324
|
+
"domain": "Domain 5: Security, Compliance & Governance",
|
|
325
|
+
"question": "A company needs to discover which Amazon S3 buckets contain personally identifiable information (PII) before using the data for ML training. Which AWS service should they use?",
|
|
326
|
+
"options": [
|
|
327
|
+
"Amazon Comprehend",
|
|
328
|
+
"Amazon Macie",
|
|
329
|
+
"Amazon Inspector",
|
|
330
|
+
"AWS Config"
|
|
331
|
+
],
|
|
332
|
+
"correct": 1,
|
|
333
|
+
"explanation": "Amazon Macie uses ML to automatically discover and classify sensitive data (including PII) stored in Amazon S3 buckets. Comprehend detects PII in text at runtime, but Macie is designed for S3-level data discovery."
|
|
334
|
+
},
|
|
335
|
+
{
|
|
336
|
+
"id": 19,
|
|
337
|
+
"domain": "Domain 3: Applications of Foundation Models",
|
|
338
|
+
"question": "A company wants to process 50,000 customer reviews overnight using a foundation model for sentiment analysis. Which Amazon Bedrock pricing model is MOST cost-effective?",
|
|
339
|
+
"options": [
|
|
340
|
+
"On-Demand pricing",
|
|
341
|
+
"Provisioned Throughput",
|
|
342
|
+
"Batch Inference",
|
|
343
|
+
"Free tier"
|
|
344
|
+
],
|
|
345
|
+
"correct": 2,
|
|
346
|
+
"explanation": "Batch Inference is designed for large-scale, non-real-time workloads and offers up to 50% cost savings compared to on-demand pricing. Ideal for processing large datasets overnight."
|
|
347
|
+
},
|
|
348
|
+
{
|
|
349
|
+
"id": 20,
|
|
350
|
+
"domain": "Domain 3: Applications of Foundation Models",
|
|
351
|
+
"question": "A non-technical marketing team wants to experiment with generative AI applications without an AWS account or coding skills. Which AWS service should they use?",
|
|
352
|
+
"options": [
|
|
353
|
+
"Amazon SageMaker Canvas",
|
|
354
|
+
"Amazon Bedrock Console",
|
|
355
|
+
"Amazon PartyRock",
|
|
356
|
+
"Amazon Q Business"
|
|
357
|
+
],
|
|
358
|
+
"correct": 2,
|
|
359
|
+
"explanation": "Amazon PartyRock is a free, no-code playground for generative AI that requires no AWS account. Users can build and share GenAI apps with drag-and-drop — ideal for experimentation and learning."
|
|
360
|
+
}
|
|
361
|
+
]
|
|
362
|
+
}
|
|
@@ -0,0 +1,200 @@
|
|
|
1
|
+
{
|
|
2
|
+
"id": "aws-ml-specialty",
|
|
3
|
+
"title": "AWS Certified Machine Learning - Specialty",
|
|
4
|
+
"slug": "aws-ml-specialty",
|
|
5
|
+
"description": "Luyện thi chứng chỉ AWS ML Specialty — build, train, deploy ML trên AWS",
|
|
6
|
+
"icon": "award",
|
|
7
|
+
"provider": "AWS",
|
|
8
|
+
"level": "Chuyên gia",
|
|
9
|
+
"duration_minutes": 180,
|
|
10
|
+
"passing_score": 75,
|
|
11
|
+
"questions_count": 15,
|
|
12
|
+
"tags": [
|
|
13
|
+
"AWS",
|
|
14
|
+
"ML",
|
|
15
|
+
"SageMaker"
|
|
16
|
+
],
|
|
17
|
+
"series_slug": "luyen-thi-aws-ml-specialty",
|
|
18
|
+
"questions": [
|
|
19
|
+
{
|
|
20
|
+
"id": 1,
|
|
21
|
+
"question": "SageMaker built-in algorithm nào phù hợp nhất cho bài toán phát hiện bất thường (anomaly detection)?",
|
|
22
|
+
"options": [
|
|
23
|
+
"XGBoost",
|
|
24
|
+
"Random Cut Forest",
|
|
25
|
+
"BlazingText",
|
|
26
|
+
"DeepAR"
|
|
27
|
+
],
|
|
28
|
+
"correct": 1,
|
|
29
|
+
"explanation": "Random Cut Forest (RCF) là thuật toán unsupervised trong SageMaker, chuyên detect anomaly trong dữ liệu streaming hoặc time series."
|
|
30
|
+
},
|
|
31
|
+
{
|
|
32
|
+
"id": 2,
|
|
33
|
+
"question": "Feature Store trong SageMaker dùng để làm gì?",
|
|
34
|
+
"options": [
|
|
35
|
+
"Lưu trữ mô hình đã train",
|
|
36
|
+
"Quản lý và chia sẻ features giữa các team ML, đảm bảo consistency",
|
|
37
|
+
"Giám sát endpoint inference",
|
|
38
|
+
"Quản lý IAM policies"
|
|
39
|
+
],
|
|
40
|
+
"correct": 1,
|
|
41
|
+
"explanation": "SageMaker Feature Store là kho lưu trữ features centralized, giúp các team ML chia sẻ features, tránh duplicate work, và đảm bảo tính nhất quán giữa training và inference."
|
|
42
|
+
},
|
|
43
|
+
{
|
|
44
|
+
"id": 3,
|
|
45
|
+
"question": "SageMaker sử dụng mode nào để train trên nhiều instance cùng lúc?",
|
|
46
|
+
"options": [
|
|
47
|
+
"Pipe mode",
|
|
48
|
+
"Distributed training mode",
|
|
49
|
+
"File mode",
|
|
50
|
+
"Batch mode"
|
|
51
|
+
],
|
|
52
|
+
"correct": 1,
|
|
53
|
+
"explanation": "SageMaker hỗ trợ distributed training cho phép chia workload training ra nhiều instances (data parallelism hoặc model parallelism) để tăng tốc."
|
|
54
|
+
},
|
|
55
|
+
{
|
|
56
|
+
"id": 4,
|
|
57
|
+
"question": "SageMaker Model Monitor phát hiện loại drift nào?",
|
|
58
|
+
"options": [
|
|
59
|
+
"Chỉ concept drift",
|
|
60
|
+
"Data quality, model quality, bias drift, và feature attribution drift",
|
|
61
|
+
"Chỉ data drift",
|
|
62
|
+
"Chỉ bias drift"
|
|
63
|
+
],
|
|
64
|
+
"correct": 1,
|
|
65
|
+
"explanation": "SageMaker Model Monitor phát hiện 4 loại: Data Quality (thay đổi schema/statistics), Model Quality (độ chính xác giảm), Bias Drift (bias thay đổi), Feature Attribution Drift (feature importance thay đổi)."
|
|
66
|
+
},
|
|
67
|
+
{
|
|
68
|
+
"id": 5,
|
|
69
|
+
"question": "Khi nào nên dùng SageMaker Inference Pipeline?",
|
|
70
|
+
"options": [
|
|
71
|
+
"Khi cần chạy batch transform",
|
|
72
|
+
"Khi cần chain nhiều bước xử lý (preprocessing → model → postprocessing) trong một endpoint",
|
|
73
|
+
"Khi cần train nhiều model",
|
|
74
|
+
"Khi cần A/B testing"
|
|
75
|
+
],
|
|
76
|
+
"correct": 1,
|
|
77
|
+
"explanation": "Inference Pipeline cho phép chain tối đa 15 containers trong một endpoint — ví dụ: data preprocessing → feature engineering → model prediction → postprocessing."
|
|
78
|
+
},
|
|
79
|
+
{
|
|
80
|
+
"id": 6,
|
|
81
|
+
"question": "BlazingText trong SageMaker được dùng cho tác vụ nào?",
|
|
82
|
+
"options": [
|
|
83
|
+
"Object detection",
|
|
84
|
+
"Word2Vec và text classification",
|
|
85
|
+
"Time series forecasting",
|
|
86
|
+
"Recommender systems"
|
|
87
|
+
],
|
|
88
|
+
"correct": 1,
|
|
89
|
+
"explanation": "BlazingText là implementation siêu nhanh của Word2Vec và nhận diện text classification. Nó hỗ trợ training trên multi-GPU với tốc độ rất cao."
|
|
90
|
+
},
|
|
91
|
+
{
|
|
92
|
+
"id": 7,
|
|
93
|
+
"question": "SageMaker Ground Truth được sử dụng để?",
|
|
94
|
+
"options": [
|
|
95
|
+
"Deploy model lên production",
|
|
96
|
+
"Tạo labeled datasets với hỗ trợ của human annotators và active learning",
|
|
97
|
+
"Tối ưu hyperparameter",
|
|
98
|
+
"Giám sát chi phí training"
|
|
99
|
+
],
|
|
100
|
+
"correct": 1,
|
|
101
|
+
"explanation": "Ground Truth là dịch vụ data labeling, kết hợp human annotators (Amazon Mechanical Turk, private team, hoặc vendors) với active learning để giảm chi phí labeling."
|
|
102
|
+
},
|
|
103
|
+
{
|
|
104
|
+
"id": 8,
|
|
105
|
+
"question": "Elastic Inference trong SageMaker dùng để?",
|
|
106
|
+
"options": [
|
|
107
|
+
"Tăng dung lượng storage",
|
|
108
|
+
"Gắn GPU fractional vào instance để giảm chi phí inference",
|
|
109
|
+
"Tự động scale số lượng model",
|
|
110
|
+
"Nén model để deploy nhanh"
|
|
111
|
+
],
|
|
112
|
+
"correct": 1,
|
|
113
|
+
"explanation": "Elastic Inference cho phép gắn GPU acceleration với chi phí thấp vào SageMaker endpoints hoặc notebook instances — chỉ trả tiền cho GPU resource thực sự dùng."
|
|
114
|
+
},
|
|
115
|
+
{
|
|
116
|
+
"id": 9,
|
|
117
|
+
"question": "Chiến lược nào giúp xử lý dữ liệu mất cân bằng (imbalanced dataset)?",
|
|
118
|
+
"options": [
|
|
119
|
+
"Chỉ dùng accuracy làm metric",
|
|
120
|
+
"SMOTE (oversampling), undersampling, class weights, hoặc ensemble methods",
|
|
121
|
+
"Tăng learning rate",
|
|
122
|
+
"Giảm số epoch training"
|
|
123
|
+
],
|
|
124
|
+
"correct": 1,
|
|
125
|
+
"explanation": "Imbalanced data cần kỹ thuật đặc biệt: SMOTE tạo thêm sample cho class thiểu số, undersampling giảm class đa số, hoặc điều chỉnh class weights trong loss function."
|
|
126
|
+
},
|
|
127
|
+
{
|
|
128
|
+
"id": 10,
|
|
129
|
+
"question": "SageMaker Clarify dùng để?",
|
|
130
|
+
"options": [
|
|
131
|
+
"Tối ưu hyperparameter",
|
|
132
|
+
"Phát hiện bias trong dữ liệu và mô hình, giải thích dự đoán (explainability)",
|
|
133
|
+
"Quản lý experiment",
|
|
134
|
+
"Xây dựng data pipeline"
|
|
135
|
+
],
|
|
136
|
+
"correct": 1,
|
|
137
|
+
"explanation": "SageMaker Clarify giúp phát hiện bias trong data và model, cung cấp feature importance (SHAP values), hỗ trợ Responsible AI và regulatory compliance."
|
|
138
|
+
},
|
|
139
|
+
{
|
|
140
|
+
"id": 11,
|
|
141
|
+
"question": "DeepAR trong SageMaker được dùng cho bài toán nào?",
|
|
142
|
+
"options": [
|
|
143
|
+
"Image classification",
|
|
144
|
+
"Dự báo chuỗi thời gian (time series forecasting)",
|
|
145
|
+
"Text summarization",
|
|
146
|
+
"Object detection"
|
|
147
|
+
],
|
|
148
|
+
"correct": 1,
|
|
149
|
+
"explanation": "DeepAR là thuật toán RNN-based cho time series forecasting, đặc biệt hiệu quả khi có nhiều chuỗi thời gian liên quan (cold-start problem)."
|
|
150
|
+
},
|
|
151
|
+
{
|
|
152
|
+
"id": 12,
|
|
153
|
+
"question": "Multi-Model Endpoint trong SageMaker có ưu điểm gì?",
|
|
154
|
+
"options": [
|
|
155
|
+
"Chỉ hỗ trợ GPU instances",
|
|
156
|
+
"Host nhiều model trên cùng một endpoint, giảm chi phí khi có nhiều model ít traffic",
|
|
157
|
+
"Tăng tốc training",
|
|
158
|
+
"Chỉ support PyTorch"
|
|
159
|
+
],
|
|
160
|
+
"correct": 1,
|
|
161
|
+
"explanation": "Multi-Model Endpoint cho phép host hàng trăm model trên cùng endpoint, load model on-demand — tiết kiệm chi phí rất lớn so với mỗi model một endpoint riêng."
|
|
162
|
+
},
|
|
163
|
+
{
|
|
164
|
+
"id": 13,
|
|
165
|
+
"question": "Khi nào nên dùng SageMaker Batch Transform thay vì Real-time Endpoint?",
|
|
166
|
+
"options": [
|
|
167
|
+
"Khi cần inference nhanh, real-time",
|
|
168
|
+
"Khi cần xử lý inference cho dataset lớn không cần response ngay lập tức",
|
|
169
|
+
"Khi cần A/B testing",
|
|
170
|
+
"Khi cần model auto-scaling"
|
|
171
|
+
],
|
|
172
|
+
"correct": 1,
|
|
173
|
+
"explanation": "Batch Transform phù hợp khi cần inference lượng lớn dữ liệu, không cần response real-time — ví dụ: nightly scoring, preprocessing dataset lớn."
|
|
174
|
+
},
|
|
175
|
+
{
|
|
176
|
+
"id": 14,
|
|
177
|
+
"question": "SageMaker Autopilot là gì?",
|
|
178
|
+
"options": [
|
|
179
|
+
"Tool deploy model tự động",
|
|
180
|
+
"AutoML — tự động phân tích data, thử nhiều algorithms, và chọn model tốt nhất",
|
|
181
|
+
"Tool giám sát endpoint",
|
|
182
|
+
"Framework training distributed"
|
|
183
|
+
],
|
|
184
|
+
"correct": 1,
|
|
185
|
+
"explanation": "SageMaker Autopilot là giải pháp AutoML, tự động phân tích data, feature engineering, thử nhiều algorithms/hyperparameters, và đề xuất model tốt nhất — kèm notebook giải thích."
|
|
186
|
+
},
|
|
187
|
+
{
|
|
188
|
+
"id": 15,
|
|
189
|
+
"question": "Pipe Mode trong SageMaker training có lợi ích gì?",
|
|
190
|
+
"options": [
|
|
191
|
+
"Tăng kích thước model",
|
|
192
|
+
"Stream dữ liệu trực tiếp từ S3 vào training container, không cần download toàn bộ trước",
|
|
193
|
+
"Giảm thời gian deploy",
|
|
194
|
+
"Tự động chọn algorithm"
|
|
195
|
+
],
|
|
196
|
+
"correct": 1,
|
|
197
|
+
"explanation": "Pipe Mode stream dữ liệu từ S3 vào training container thay vì copy toàn bộ (File Mode) — giảm startup time và disk requirement, đặc biệt hiệu quả với dataset lớn."
|
|
198
|
+
}
|
|
199
|
+
]
|
|
200
|
+
}
|
|
@@ -0,0 +1,200 @@
|
|
|
1
|
+
{
|
|
2
|
+
"id": "gcp-ml-engineer",
|
|
3
|
+
"title": "Google Cloud Professional ML Engineer",
|
|
4
|
+
"slug": "gcp-ml-engineer",
|
|
5
|
+
"description": "Luyện thi chứng chỉ Google Cloud Professional Machine Learning Engineer",
|
|
6
|
+
"icon": "award",
|
|
7
|
+
"provider": "Google Cloud",
|
|
8
|
+
"level": "Chuyên nghiệp",
|
|
9
|
+
"duration_minutes": 120,
|
|
10
|
+
"passing_score": 70,
|
|
11
|
+
"questions_count": 15,
|
|
12
|
+
"tags": [
|
|
13
|
+
"GCP",
|
|
14
|
+
"ML",
|
|
15
|
+
"Vertex AI"
|
|
16
|
+
],
|
|
17
|
+
"series_slug": "luyen-thi-gcp-ml-engineer",
|
|
18
|
+
"questions": [
|
|
19
|
+
{
|
|
20
|
+
"id": 1,
|
|
21
|
+
"question": "Vertex AI Pipeline được xây dựng trên framework nào?",
|
|
22
|
+
"options": [
|
|
23
|
+
"Apache Spark",
|
|
24
|
+
"Kubeflow Pipelines / TFX",
|
|
25
|
+
"Apache Airflow",
|
|
26
|
+
"Jenkins"
|
|
27
|
+
],
|
|
28
|
+
"correct": 1,
|
|
29
|
+
"explanation": "Vertex AI Pipelines dựa trên Kubeflow Pipelines SDK và TFX (TensorFlow Extended), cho phép orchestrate ML workflow trên Google Cloud."
|
|
30
|
+
},
|
|
31
|
+
{
|
|
32
|
+
"id": 2,
|
|
33
|
+
"question": "BigQuery ML cho phép làm gì đặc biệt?",
|
|
34
|
+
"options": [
|
|
35
|
+
"Chỉ query dữ liệu",
|
|
36
|
+
"Train và deploy ML model trực tiếp bằng SQL trong BigQuery",
|
|
37
|
+
"Chỉ export dữ liệu sang CSV",
|
|
38
|
+
"Quản lý Kubernetes cluster"
|
|
39
|
+
],
|
|
40
|
+
"correct": 1,
|
|
41
|
+
"explanation": "BigQuery ML (BQML) cho phép data analysts train model ML bằng SQL quen thuộc ngay trong BigQuery — không cần viết Python hay setup infrastructure riêng."
|
|
42
|
+
},
|
|
43
|
+
{
|
|
44
|
+
"id": 3,
|
|
45
|
+
"question": "Vertex AI Feature Store khác gì so với lưu features trong database thông thường?",
|
|
46
|
+
"options": [
|
|
47
|
+
"Không có gì khác",
|
|
48
|
+
"Hỗ trợ serving features với low-latency, đảm bảo training-serving consistency, và feature versioning",
|
|
49
|
+
"Chỉ hỗ trợ structured data",
|
|
50
|
+
"Chỉ dùng được với TensorFlow"
|
|
51
|
+
],
|
|
52
|
+
"correct": 1,
|
|
53
|
+
"explanation": "Feature Store chuyên biệt cho ML: serving features online (low-latency) và offline (batch), đảm bảo features đồng nhất giữa training và serving, hỗ trợ time-travel và monitoring."
|
|
54
|
+
},
|
|
55
|
+
{
|
|
56
|
+
"id": 4,
|
|
57
|
+
"question": "Khi nào nên dùng AutoML thay vì custom training trên Vertex AI?",
|
|
58
|
+
"options": [
|
|
59
|
+
"Khi cần kiểm soát hoàn toàn architecture",
|
|
60
|
+
"Khi team không có nhiều ML expertise hoặc cần baseline model nhanh",
|
|
61
|
+
"Khi dataset rất lớn (>1TB)",
|
|
62
|
+
"Khi cần distributed training"
|
|
63
|
+
],
|
|
64
|
+
"correct": 1,
|
|
65
|
+
"explanation": "AutoML phù hợp khi cần model nhanh, team có ít ML expertise, hoặc cần baseline. Custom training khi cần kiểm soát architecture, thuật toán đặc thù, hoặc tối ưu sâu."
|
|
66
|
+
},
|
|
67
|
+
{
|
|
68
|
+
"id": 5,
|
|
69
|
+
"question": "Vertex AI Experiments dùng để?",
|
|
70
|
+
"options": [
|
|
71
|
+
"Deploy model lên production",
|
|
72
|
+
"Track, compare và reproduce ML experiments (hyperparameters, metrics, artifacts)",
|
|
73
|
+
"Tạo dataset mới",
|
|
74
|
+
"Quản lý IAM"
|
|
75
|
+
],
|
|
76
|
+
"correct": 1,
|
|
77
|
+
"explanation": "Vertex AI Experiments cung cấp experiment tracking: log hyperparameters, metrics, model artifacts — cho phép compare nhiều runs và reproduce kết quả."
|
|
78
|
+
},
|
|
79
|
+
{
|
|
80
|
+
"id": 6,
|
|
81
|
+
"question": "TFX (TensorFlow Extended) bao gồm những component chính nào?",
|
|
82
|
+
"options": [
|
|
83
|
+
"Chỉ có ExampleGen và Trainer",
|
|
84
|
+
"ExampleGen, StatisticsGen, SchemaGen, ExampleValidator, Transform, Trainer, Evaluator, Pusher",
|
|
85
|
+
"Chỉ có Trainer và Serving",
|
|
86
|
+
"Chỉ có Transform và Evaluator"
|
|
87
|
+
],
|
|
88
|
+
"correct": 1,
|
|
89
|
+
"explanation": "TFX là end-to-end ML platform gồm: ExampleGen (ingest), StatisticsGen + SchemaGen + ExampleValidator (validate), Transform (feature eng), Trainer, Tuner, Evaluator, Pusher (deploy)."
|
|
90
|
+
},
|
|
91
|
+
{
|
|
92
|
+
"id": 7,
|
|
93
|
+
"question": "Vertex AI Model Monitoring kiểm tra điều gì?",
|
|
94
|
+
"options": [
|
|
95
|
+
"Chỉ monitor CPU/memory",
|
|
96
|
+
"Skew (training-serving) và drift (prediction data thay đổi theo thời gian)",
|
|
97
|
+
"Chỉ monitor latency",
|
|
98
|
+
"Chỉ monitor cost"
|
|
99
|
+
],
|
|
100
|
+
"correct": 1,
|
|
101
|
+
"explanation": "Model Monitoring phát hiện: training-serving skew (feature distribution khác nhau) và prediction drift (dữ liệu production drift khỏi baseline), trigger alert khi vượt threshold."
|
|
102
|
+
},
|
|
103
|
+
{
|
|
104
|
+
"id": 8,
|
|
105
|
+
"question": "Google Cloud AI Platform Prediction hỗ trợ chiến lược deploy nào?",
|
|
106
|
+
"options": [
|
|
107
|
+
"Chỉ single model deployment",
|
|
108
|
+
"Traffic splitting cho A/B testing và canary deployments",
|
|
109
|
+
"Chỉ batch prediction",
|
|
110
|
+
"Chỉ edge deployment"
|
|
111
|
+
],
|
|
112
|
+
"correct": 1,
|
|
113
|
+
"explanation": "Vertex AI Prediction hỗ trợ traffic splitting: có thể route % traffic sang model versions khác nhau — phục vụ A/B testing, canary release, và progressive rollout."
|
|
114
|
+
},
|
|
115
|
+
{
|
|
116
|
+
"id": 9,
|
|
117
|
+
"question": "Dataflow trong ML pipeline đóng vai trò gì?",
|
|
118
|
+
"options": [
|
|
119
|
+
"Training model",
|
|
120
|
+
"Xử lý dữ liệu quy mô lớn (batch & streaming) cho data preprocessing/feature engineering",
|
|
121
|
+
"Deploy model",
|
|
122
|
+
"Monitor model"
|
|
123
|
+
],
|
|
124
|
+
"correct": 1,
|
|
125
|
+
"explanation": "Dataflow (dựa trên Apache Beam) xử lý data ở scale lớn: ETL, feature engineering cho cả batch và streaming — bước tiền xử lý quan trọng trong ML pipeline."
|
|
126
|
+
},
|
|
127
|
+
{
|
|
128
|
+
"id": 10,
|
|
129
|
+
"question": "Vertex AI Matching Engine dùng cho bài toán nào?",
|
|
130
|
+
"options": [
|
|
131
|
+
"Training model",
|
|
132
|
+
"Tìm kiếm nearest neighbor (vector similarity search) ở quy mô lớn",
|
|
133
|
+
"Data labeling",
|
|
134
|
+
"Model serving thông thường"
|
|
135
|
+
],
|
|
136
|
+
"correct": 1,
|
|
137
|
+
"explanation": "Matching Engine là managed approximate nearest neighbor (ANN) service — dùng cho similarity search, recommendation, RAG retrieval ở quy mô tỷ vectors."
|
|
138
|
+
},
|
|
139
|
+
{
|
|
140
|
+
"id": 11,
|
|
141
|
+
"question": "Vertex AI Workbench khác gì Colab Enterprise?",
|
|
142
|
+
"options": [
|
|
143
|
+
"Giống hệt nhau",
|
|
144
|
+
"Workbench là JupyterLab managed instances cho ML production, Colab Enterprise cho collaboration và exploration",
|
|
145
|
+
"Workbench chỉ support R",
|
|
146
|
+
"Colab Enterprise chỉ dùng miễn phí"
|
|
147
|
+
],
|
|
148
|
+
"correct": 1,
|
|
149
|
+
"explanation": "Workbench cung cấp JupyterLab managed instances với tích hợp sâu vào GCP services (BigQuery, GCS) cho production ML. Colab Enterprise thiên về collaboration, sharing và exploration."
|
|
150
|
+
},
|
|
151
|
+
{
|
|
152
|
+
"id": 12,
|
|
153
|
+
"question": "Kỹ thuật nào giảm kích thước model để deploy trên edge devices?",
|
|
154
|
+
"options": [
|
|
155
|
+
"Tăng layers",
|
|
156
|
+
"Quantization, pruning, knowledge distillation",
|
|
157
|
+
"Tăng batch size",
|
|
158
|
+
"Dùng thêm GPU"
|
|
159
|
+
],
|
|
160
|
+
"correct": 1,
|
|
161
|
+
"explanation": "Model compression: Quantization (giảm precision: FP32→INT8), Pruning (loại bỏ weights/neurons không quan trọng), Knowledge Distillation (teacher model dạy student model nhỏ hơn)."
|
|
162
|
+
},
|
|
163
|
+
{
|
|
164
|
+
"id": 13,
|
|
165
|
+
"question": "Vertex AI GenAI Studio dùng để?",
|
|
166
|
+
"options": [
|
|
167
|
+
"Chỉ train model từ đầu",
|
|
168
|
+
"Prototyping, testing, và tuning Foundation Models (PaLM, Gemini) trên Google Cloud",
|
|
169
|
+
"Quản lý billing",
|
|
170
|
+
"Giám sát network"
|
|
171
|
+
],
|
|
172
|
+
"correct": 1,
|
|
173
|
+
"explanation": "GenAI Studio cung cấp UI và API để thử nghiệm Foundation Models, prompt design, tuning, và deploy — không cần ML expertise sâu."
|
|
174
|
+
},
|
|
175
|
+
{
|
|
176
|
+
"id": 14,
|
|
177
|
+
"question": "Khi data có nhiều missing values, chiến lược nào phù hợp?",
|
|
178
|
+
"options": [
|
|
179
|
+
"Luôn xoá rows có missing values",
|
|
180
|
+
"Tuỳ context: imputation (mean/median/mode, KNN, model-based), hoặc tạo indicator feature cho missingness",
|
|
181
|
+
"Luôn fill bằng 0",
|
|
182
|
+
"Bỏ qua và train trực tiếp"
|
|
183
|
+
],
|
|
184
|
+
"correct": 1,
|
|
185
|
+
"explanation": "Xử lý missing values tuỳ thuộc vào pattern (MCAR/MAR/MNAR): imputation thống kê (mean/median), model-based (KNN, MICE), hoặc thêm feature indicator. Xoá rows chỉ khi missing ít và MCAR."
|
|
186
|
+
},
|
|
187
|
+
{
|
|
188
|
+
"id": 15,
|
|
189
|
+
"question": "Continuous Training (CT) trong MLOps là gì?",
|
|
190
|
+
"options": [
|
|
191
|
+
"Train model chỉ một lần",
|
|
192
|
+
"Tự động retrain model khi phát hiện trigger (data drift, schedule, hoặc performance degradation)",
|
|
193
|
+
"Train model thủ công hàng tuần",
|
|
194
|
+
"Chỉ dùng cho deep learning"
|
|
195
|
+
],
|
|
196
|
+
"correct": 1,
|
|
197
|
+
"explanation": "Continuous Training tự động kích hoạt retrain pipeline khi: data mới đến (scheduled), data drift vượt threshold, hoặc model performance giảm — đảm bảo model luôn fresh."
|
|
198
|
+
}
|
|
199
|
+
]
|
|
200
|
+
}
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@xdev-asia/xdev-knowledge-mcp",
|
|
3
|
-
"version": "1.0.
|
|
3
|
+
"version": "1.0.43",
|
|
4
4
|
"description": "MCP Server - Toàn bộ kiến thức xDev.asia: 57 series, 1200+ lessons, blog, showcase (AI, Architecture, DevSecOps, Programming)",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"main": "dist/index.js",
|