@xdev-asia/xdev-knowledge-mcp 1.0.41 → 1.0.43

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (23) hide show
  1. package/content/series/luyen-thi/luyen-thi-aws-ai-practitioner/chapters/01-domain-1-fundamentals-ai-ml/lessons/01-bai-1-ai-ml-deep-learning-concepts.md +287 -0
  2. package/content/series/luyen-thi/luyen-thi-aws-ai-practitioner/chapters/01-domain-1-fundamentals-ai-ml/lessons/02-bai-2-ml-lifecycle-aws-services.md +258 -0
  3. package/content/series/luyen-thi/luyen-thi-aws-ai-practitioner/chapters/02-domain-2-fundamentals-generative-ai/lessons/03-bai-3-generative-ai-foundation-models.md +218 -0
  4. package/content/series/luyen-thi/luyen-thi-aws-ai-practitioner/chapters/02-domain-2-fundamentals-generative-ai/lessons/04-bai-4-llm-transformers-multimodal.md +232 -0
  5. package/content/series/luyen-thi/luyen-thi-aws-ai-practitioner/chapters/03-domain-3-applications-foundation-models/lessons/05-bai-5-prompt-engineering-techniques.md +254 -0
  6. package/content/series/luyen-thi/luyen-thi-aws-ai-practitioner/chapters/03-domain-3-applications-foundation-models/lessons/06-bai-6-rag-vector-databases-knowledge-bases.md +244 -0
  7. package/content/series/luyen-thi/luyen-thi-aws-ai-practitioner/chapters/03-domain-3-applications-foundation-models/lessons/07-bai-7-fine-tuning-model-customization.md +247 -0
  8. package/content/series/luyen-thi/luyen-thi-aws-ai-practitioner/chapters/03-domain-3-applications-foundation-models/lessons/08-bai-8-amazon-bedrock-deep-dive.md +276 -0
  9. package/content/series/luyen-thi/luyen-thi-aws-ai-practitioner/chapters/04-domain-4-responsible-ai/lessons/09-bai-9-responsible-ai-fairness-bias-transparency.md +224 -0
  10. package/content/series/luyen-thi/luyen-thi-aws-ai-practitioner/chapters/04-domain-4-responsible-ai/lessons/10-bai-10-aws-responsible-ai-tools.md +252 -0
  11. package/content/series/luyen-thi/luyen-thi-aws-ai-practitioner/chapters/05-domain-5-security-compliance/lessons/11-bai-11-ai-security-data-privacy-compliance.md +279 -0
  12. package/content/series/luyen-thi/luyen-thi-aws-ai-practitioner/chapters/05-domain-5-security-compliance/lessons/12-bai-12-exam-strategy-cheat-sheet.md +229 -0
  13. package/content/series/luyen-thi/luyen-thi-aws-ai-practitioner/index.md +257 -0
  14. package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/01-phan-1-data-engineering/lessons/01-bai-1-data-repositories-ingestion.md +193 -0
  15. package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/01-phan-1-data-engineering/lessons/02-bai-2-data-transformation.md +178 -0
  16. package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/index.md +240 -0
  17. package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/index.md +225 -0
  18. package/data/categories.json +16 -4
  19. package/data/quizzes/aws-ai-practitioner.json +362 -0
  20. package/data/quizzes/aws-ml-specialty.json +200 -0
  21. package/data/quizzes/gcp-ml-engineer.json +200 -0
  22. package/data/quizzes.json +764 -0
  23. package/package.json +1 -1
@@ -0,0 +1,257 @@
1
+ ---
2
+ id: 019c9619-lt01-7001-c001-lt0100000001
3
+ title: "Luyện thi AWS Certified AI Practitioner (AIF-C01)"
4
+ slug: luyen-thi-aws-ai-practitioner
5
+ description: >-
6
+ Lộ trình ôn tập toàn diện cho kỳ thi AWS Certified AI Practitioner (AIF-C01).
7
+ Bao phủ đầy đủ 5 domain: AI/ML Fundamentals, Generative AI, Foundation Models,
8
+ Responsible AI, Security & Governance. 12 bài học chuyên sâu kèm thi thử tiếng Anh.
9
+
10
+ featured_image: images/blog/aws-ai-practitioner-series-banner.png
11
+ level: beginner
12
+ duration_hours: 30
13
+ lesson_count: 12
14
+ price: '0.00'
15
+ is_free: true
16
+ view_count: 0
17
+ average_rating: '0.00'
18
+ review_count: 0
19
+ enrollment_count: 0
20
+ meta: null
21
+ published_at: '2026-04-04T10:00:00.000000Z'
22
+ created_at: '2026-04-04T10:00:00.000000Z'
23
+
24
+ author:
25
+ id: 019c9616-d2b4-713f-9b2c-40e2e92a05cf
26
+ name: Duy Tran
27
+ avatar: avatars/7e8eb5c6-4cac-455b-a701-4060f085d501.jpeg
28
+
29
+ category:
30
+ id: 019c9616-cat9-7009-a009-000000000009
31
+ name: Luyện thi chứng chỉ
32
+ slug: luyen-thi
33
+
34
+ tags:
35
+ - name: AWS
36
+ slug: aws
37
+ - name: AI
38
+ slug: ai
39
+ - name: Chứng chỉ
40
+ slug: chung-chi
41
+ - name: Amazon Bedrock
42
+ slug: amazon-bedrock
43
+ - name: SageMaker
44
+ slug: sagemaker
45
+ - name: Generative AI
46
+ slug: generative-ai
47
+
48
+ quiz_slug: aws-ai-practitioner
49
+
50
+ sections:
51
+ - id: section-01
52
+ title: "Domain 1: Fundamentals of AI and ML (20%)"
53
+ description: Khái niệm AI, ML, Deep Learning, ML lifecycle, data types, use cases
54
+ sort_order: 1
55
+ lessons:
56
+ - id: 019c9619-lt01-d1-l01
57
+ title: "Bài 1: AI, ML & Deep Learning — Concepts and Terminology"
58
+ slug: bai-1-ai-ml-deep-learning-concepts
59
+ description: >-
60
+ AI vs ML vs DL. Supervised, Unsupervised, Reinforcement Learning.
61
+ Classification, Regression, Clustering. Neural Networks basics.
62
+ Training, Validation, Test sets. Bias-Variance tradeoff.
63
+ duration_minutes: 60
64
+ is_free: true
65
+ sort_order: 0
66
+ video_url: null
67
+ - id: 019c9619-lt01-d1-l02
68
+ title: "Bài 2: ML Development Lifecycle & AWS AI Services Overview"
69
+ slug: bai-2-ml-lifecycle-aws-services
70
+ description: >-
71
+ ML pipeline: data collection → feature engineering → training → evaluation → deployment.
72
+ AWS AI/ML service stack. SageMaker, Rekognition, Comprehend, Polly,
73
+ Transcribe, Translate, Textract, Lex, Personalize, Forecast, Kendra.
74
+ duration_minutes: 60
75
+ is_free: true
76
+ sort_order: 1
77
+ video_url: null
78
+
79
+ - id: section-02
80
+ title: "Domain 2: Fundamentals of Generative AI (24%)"
81
+ description: GenAI concepts, Foundation Models, LLMs, Transformer architecture
82
+ sort_order: 2
83
+ lessons:
84
+ - id: 019c9619-lt01-d2-l03
85
+ title: "Bài 3: Generative AI & Foundation Models"
86
+ slug: bai-3-generative-ai-foundation-models
87
+ description: >-
88
+ Generative AI là gì. Foundation Models: pre-training, fine-tuning.
89
+ Types: text-to-text, text-to-image, text-to-code. Tokenization.
90
+ Model parameters, inference, temperature, top-p, top-k.
91
+ duration_minutes: 60
92
+ is_free: true
93
+ sort_order: 0
94
+ video_url: null
95
+ - id: 019c9619-lt01-d2-l04
96
+ title: "Bài 4: LLMs, Transformers & Multi-modal Models"
97
+ slug: bai-4-llm-transformers-multimodal
98
+ description: >-
99
+ Transformer architecture: attention mechanism, self-attention.
100
+ GPT (decoder-only), BERT (encoder-only), T5 (encoder-decoder).
101
+ Multi-modal models. Hallucination: causes and mitigation.
102
+ Embeddings và vector representations.
103
+ duration_minutes: 60
104
+ is_free: true
105
+ sort_order: 1
106
+ video_url: null
107
+
108
+ - id: section-03
109
+ title: "Domain 3: Applications of Foundation Models (28%)"
110
+ description: Prompt engineering, RAG, fine-tuning, Amazon Bedrock
111
+ sort_order: 3
112
+ lessons:
113
+ - id: 019c9619-lt01-d3-l05
114
+ title: "Bài 5: Prompt Engineering Techniques"
115
+ slug: bai-5-prompt-engineering
116
+ description: >-
117
+ Zero-shot, Few-shot, Chain-of-Thought prompting.
118
+ System prompts, role-based prompting. Prompt templates.
119
+ Best practices: clarity, specificity, constraints.
120
+ Common pitfalls và cách tối ưu.
121
+ duration_minutes: 60
122
+ is_free: true
123
+ sort_order: 0
124
+ video_url: null
125
+ - id: 019c9619-lt01-d3-l06
126
+ title: "Bài 6: RAG — Retrieval-Augmented Generation"
127
+ slug: bai-6-rag-retrieval-augmented-generation
128
+ description: >-
129
+ RAG architecture: indexing, retrieval, generation.
130
+ Vector databases, embeddings, similarity search.
131
+ Amazon Bedrock Knowledge Bases. Chunking strategies.
132
+ RAG vs Fine-tuning: khi nào dùng gì.
133
+ duration_minutes: 60
134
+ is_free: true
135
+ sort_order: 1
136
+ video_url: null
137
+ - id: 019c9619-lt01-d3-l07
138
+ title: "Bài 7: Fine-tuning & Model Customization"
139
+ slug: bai-7-fine-tuning-model-customization
140
+ description: >-
141
+ Pre-training vs Fine-tuning vs Prompt Engineering.
142
+ Continued pre-training, instruction tuning.
143
+ PEFT: LoRA, QLoRA. Training data preparation.
144
+ Amazon Bedrock Custom Models, SageMaker JumpStart.
145
+ duration_minutes: 60
146
+ is_free: true
147
+ sort_order: 2
148
+ video_url: null
149
+ - id: 019c9619-lt01-d3-l08
150
+ title: "Bài 8: Amazon Bedrock — Complete Deep Dive"
151
+ slug: bai-8-amazon-bedrock-deep-dive
152
+ description: >-
153
+ Bedrock architecture, supported models (Claude, Llama, Titan, Mistral).
154
+ Bedrock Agents, Guardrails, Knowledge Bases, Model Evaluation.
155
+ PlayGrounds. Bedrock API & SDKs. Pricing models.
156
+ PartyRock for prototyping.
157
+ duration_minutes: 75
158
+ is_free: true
159
+ sort_order: 3
160
+ video_url: null
161
+
162
+ - id: section-04
163
+ title: "Domain 4: Guidelines for Responsible AI (14%)"
164
+ description: Fairness, transparency, explainability, responsible AI practices
165
+ sort_order: 4
166
+ lessons:
167
+ - id: 019c9619-lt01-d4-l09
168
+ title: "Bài 9: Responsible AI — Fairness, Bias & Transparency"
169
+ slug: bai-9-responsible-ai-fairness-bias
170
+ description: >-
171
+ AWS Responsible AI principles. Types of bias: selection, measurement,
172
+ algorithmic bias. Fairness metrics. Model explainability: SHAP, LIME.
173
+ SageMaker Clarify. AWS AI Service Cards.
174
+ duration_minutes: 50
175
+ is_free: true
176
+ sort_order: 0
177
+ video_url: null
178
+ - id: 019c9619-lt01-d4-l10
179
+ title: "Bài 10: Human-in-the-Loop & AI Governance"
180
+ slug: bai-10-human-in-the-loop-governance
181
+ description: >-
182
+ Human review workflows. Amazon Augmented AI (A2I).
183
+ Model monitoring và drift detection. Guardrails for Bedrock.
184
+ Content filtering, toxicity detection. Watermarking.
185
+ duration_minutes: 50
186
+ is_free: true
187
+ sort_order: 1
188
+ video_url: null
189
+
190
+ - id: section-05
191
+ title: "Domain 5: Security, Compliance & Governance (14%)"
192
+ description: AI security, data privacy, compliance, exam strategy
193
+ sort_order: 5
194
+ lessons:
195
+ - id: 019c9619-lt01-d5-l11
196
+ title: "Bài 11: AI Security & Data Privacy on AWS"
197
+ slug: bai-11-ai-security-data-privacy
198
+ description: >-
199
+ IAM for AI services. Data encryption (KMS, at-rest, in-transit).
200
+ VPC configuration cho SageMaker. Data privacy: PII detection,
201
+ Amazon Macie. Compliance frameworks: GDPR, HIPAA, SOC.
202
+ Shared responsibility model for AI.
203
+ duration_minutes: 50
204
+ is_free: true
205
+ sort_order: 0
206
+ video_url: null
207
+ - id: 019c9619-lt01-d5-l12
208
+ title: "Bài 12: Exam Strategy, Cheat Sheet & Mock Exam Guide"
209
+ slug: bai-12-exam-strategy-cheat-sheet
210
+ description: >-
211
+ AIF-C01 exam format: 65 questions, 90 minutes, 700/1000.
212
+ Domain weight strategy. Elimination techniques.
213
+ Complete cheat sheet: services mapping, key concepts.
214
+ Hướng dẫn thi thử và đánh giá kết quả.
215
+ duration_minutes: 45
216
+ is_free: true
217
+ sort_order: 1
218
+ video_url: null
219
+
220
+ reviews: []
221
+ quizzes: []
222
+ ---
223
+
224
+ ## Giới thiệu
225
+
226
+ Khoá học **Luyện thi AWS Certified AI Practitioner (AIF-C01)** giúp bạn ôn tập có hệ thống, bao phủ đầy đủ 5 domain của kỳ thi — từ nền tảng AI/ML đến GenAI, Amazon Bedrock, Responsible AI và Security.
227
+
228
+ ### Ai nên học?
229
+
230
+ - Developer, DevOps, Solution Architect muốn chứng chỉ AI
231
+ - Business Analyst, Product Manager muốn hiểu AI trên AWS
232
+ - Người mới bắt đầu với AI, muốn có foundation vững chắc
233
+ - Ai đã có kiến thức AI cơ bản, muốn validate bằng chứng chỉ AWS
234
+
235
+ ### Cấu trúc đề thi AIF-C01
236
+
237
+ | Domain | Tỷ trọng | Số bài học |
238
+ |--------|----------|------------|
239
+ | Domain 1: Fundamentals of AI and ML | 20% | Bài 1–2 |
240
+ | Domain 2: Fundamentals of Generative AI | 24% | Bài 3–4 |
241
+ | Domain 3: Applications of Foundation Models | 28% | Bài 5–8 |
242
+ | Domain 4: Guidelines for Responsible AI | 14% | Bài 9–10 |
243
+ | Domain 5: Security, Compliance & Governance | 14% | Bài 11–12 |
244
+
245
+ - **Số câu**: 65 câu (scored) + 15 câu (unscored) = 80 câu tổng cộng
246
+ - **Thời gian**: 90 phút
247
+ - **Điểm đạt**: 700/1000
248
+ - **Phí thi**: $100 USD
249
+ - **Ngôn ngữ thi**: Tiếng Anh (và nhiều ngôn ngữ khác)
250
+ - **Hình thức**: Pearson VUE testing center hoặc online proctored
251
+
252
+ ### Lộ trình học
253
+
254
+ 1. **Học lý thuyết** qua 12 bài trong series này
255
+ 2. **Thi thử** với đề trắc nghiệm tiếng Anh mô phỏng
256
+ 3. **Ôn lại** domain yếu, thi lại cho đến khi đạt ≥80%
257
+ 4. **Đăng ký thi** khi tự tin — [aws.amazon.com/certification](https://aws.amazon.com/certification/certified-ai-practitioner/)
@@ -0,0 +1,193 @@
1
+ ---
2
+ id: 14a964b2-b4b7-46e5-95b0-7d91d9cacdf5
3
+ title: 'Bài 1: Data Repositories & Ingestion — S3, Kinesis, Glue'
4
+ slug: bai-1-data-repositories-ingestion
5
+ description: >-
6
+ S3 data lake cho ML. Kinesis Data Streams/Firehose cho streaming ingestion.
7
+ AWS Glue ETL jobs và Data Catalog. Lake Formation. Data Wrangler.
8
+ Chiến lược lưu trữ: Parquet, ORC, CSV, JSON.
9
+ duration_minutes: 60
10
+ is_free: true
11
+ video_url: null
12
+ sort_order: 1
13
+ section_title: "Phần 1: Data Engineering (20%)"
14
+ course:
15
+ id: 019c9619-lt02-7002-c002-lt0200000002
16
+ title: 'Luyện thi AWS Certified Machine Learning - Specialty'
17
+ slug: luyen-thi-aws-ml-specialty
18
+ ---
19
+
20
+ <h2 id="overview"><strong>1. Tổng quan Data Engineering trong MLS-C01</strong></h2>
21
+
22
+ <p>Domain Data Engineering chiếm <strong>20% đề thi MLS-C01</strong>. Đây là phần bắt buộc phải nắm vững — đề thi thường hỏi "Which service should be used to ingest/store/transform data for ML?"</p>
23
+
24
+ <blockquote>
25
+ <p><strong>Exam tip:</strong> Phần lớn câu hỏi Data Engineering sẽ cho một scenario và hỏi service phù hợp. Key pattern: batch → S3 + Glue; streaming → Kinesis; structured/SQL → Athena; catalog → Glue Data Catalog.</p>
26
+ </blockquote>
27
+
28
+ <h2 id="s3-ml"><strong>2. Amazon S3 — ML Data Lake</strong></h2>
29
+
30
+ <p><strong>Amazon S3</strong> là nền tảng lưu trữ dữ liệu ML trên AWS. Mọi pipeline ML đều bắt đầu và kết thúc từ S3: training data, model artifacts, predictions.</p>
31
+
32
+ <h3 id="s3-storage-classes"><strong>2.1. S3 Storage Classes cho ML</strong></h3>
33
+
34
+ <table>
35
+ <thead><tr><th>Storage Class</th><th>Use Case</th><th>Cost</th></tr></thead>
36
+ <tbody>
37
+ <tr><td><strong>S3 Standard</strong></td><td>Active training data, frequent access</td><td>Cao nhất</td></tr>
38
+ <tr><td><strong>S3 Intelligent-Tiering</strong></td><td>Mixed access patterns (tự động tier)</td><td>Tự động tối ưu</td></tr>
39
+ <tr><td><strong>S3 Standard-IA</strong></td><td>Backup datasets, infrequent access</td><td>Thấp hơn Standard</td></tr>
40
+ <tr><td><strong>S3 Glacier Instant Retrieval</strong></td><td>Archived datasets, occasional retrieval</td><td>Thấp</td></tr>
41
+ <tr><td><strong>S3 Glacier Deep Archive</strong></td><td>Long-term compliance archives</td><td>Thấp nhất</td></tr>
42
+ </tbody>
43
+ </table>
44
+
45
+ <h3 id="s3-file-formats"><strong>2.2. File Formats for ML</strong></h3>
46
+
47
+ <table>
48
+ <thead><tr><th>Format</th><th>Type</th><th>Best For</th><th>Compression</th></tr></thead>
49
+ <tbody>
50
+ <tr><td><strong>Parquet</strong></td><td>Columnar</td><td>Analytics, large datasets, feature stores</td><td>Excellent</td></tr>
51
+ <tr><td><strong>ORC</strong></td><td>Columnar</td><td>Hive/EMR workloads</td><td>Excellent</td></tr>
52
+ <tr><td><strong>CSV</strong></td><td>Row-based</td><td>Simple, SageMaker training input</td><td>Poor</td></tr>
53
+ <tr><td><strong>JSON</strong></td><td>Semi-structured</td><td>Nested data, APIs</td><td>Poor</td></tr>
54
+ <tr><td><strong>RecordIO</strong></td><td>Binary</td><td>SageMaker Pipe Mode training</td><td>Good</td></tr>
55
+ </tbody>
56
+ </table>
57
+
58
+ <blockquote>
59
+ <p><strong>Exam tip:</strong> Khi đề hỏi về <em>performance optimization</em> cho large-scale training, đáp án thường là chuyển sang <strong>Parquet</strong> (columnar, compressed) và dùng <strong>Pipe Mode</strong> thay vì File Mode trong SageMaker.</p>
60
+ </blockquote>
61
+
62
+ <pre><code class="language-text">S3 Data Lake Architecture for ML:
63
+
64
+ ┌─────────────────────────────────────────────────────────┐
65
+ │ Amazon S3 Buckets │
66
+ ├──────────────┬──────────────┬──────────────┬────────────┤
67
+ │ Raw Zone │ Processed │ Features │ Models │
68
+ │ (landing) │ Zone │ Zone │ & Output │
69
+ │ │ │ │ │
70
+ │ CSV/JSON │ Parquet/ORC │ Feature │ Model │
71
+ │ original │ cleaned │ Store │ Artifacts │
72
+ │ data │ transformed │ snapshots │ Predictions│
73
+ └──────────────┴──────────────┴──────────────┴────────────┘
74
+ ↑ ↑ ↑
75
+ Kinesis AWS Glue SageMaker
76
+ (streaming) (ETL) Processing
77
+ </code></pre>
78
+
79
+ <h2 id="kinesis"><strong>3. Amazon Kinesis — Streaming Ingestion</strong></h2>
80
+
81
+ <p>Kinesis là họ dịch vụ cho <strong>real-time data streaming</strong>. Đây là topic quan trọng trong đề thi — cần phân biệt rõ 4 services.</p>
82
+
83
+ <table>
84
+ <thead><tr><th>Service</th><th>Function</th><th>Destination</th><th>ML Use Case</th></tr></thead>
85
+ <tbody>
86
+ <tr><td><strong>Kinesis Data Streams (KDS)</strong></td><td>Custom real-time processing</td><td>Custom consumers</td><td>Real-time feature engineering</td></tr>
87
+ <tr><td><strong>Kinesis Data Firehose</strong></td><td>Managed delivery (no code)</td><td>S3, Redshift, ES, Splunk</td><td>Batch loading to data lake</td></tr>
88
+ <tr><td><strong>Kinesis Data Analytics</strong></td><td>SQL/Flink on streams</td><td>S3, Redshift</td><td>Real-time aggregations, anomaly detect</td></tr>
89
+ <tr><td><strong>Kinesis Video Streams</strong></td><td>Video ingestion</td><td>Rekognition, SageMaker</td><td>Computer vision pipelines</td></tr>
90
+ </tbody>
91
+ </table>
92
+
93
+ <blockquote>
94
+ <p><strong>Exam tip:</strong> Câu hỏi phổ biến: "IoT sensors gửi data liên tục, cần store vào S3 cho ML training mà không cần custom code?" → Kinesis <strong>Data Firehose</strong> (managed, no code). "Cần xử lý real-time với custom logic?" → Kinesis <strong>Data Streams</strong>.</p>
95
+ </blockquote>
96
+
97
+ <h3 id="kinesis-shards"><strong>3.1. KDS Shards & Capacity</strong></h3>
98
+
99
+ <pre><code class="language-text">Kinesis Data Streams Capacity:
100
+
101
+ ┌─────────────────────────────────────────────┐
102
+ │ Each Shard: │
103
+ │ • Ingest: 1 MB/s OR 1,000 records/s │
104
+ │ • Read: 2 MB/s │
105
+ │ • Retention: 24 hours (default) → 7 days │
106
+ └─────────────────────────────────────────────┘
107
+
108
+ Stream with N shards:
109
+ • Total ingest: N × 1 MB/s
110
+ • Total read: N × 2 MB/s
111
+ </code></pre>
112
+
113
+ <h2 id="glue"><strong>4. AWS Glue — ETL for ML</strong></h2>
114
+
115
+ <p><strong>AWS Glue</strong> là fully managed ETL service. Trong ML pipeline, Glue dùng để <strong>transform và clean data</strong> trước khi đưa vào training.</p>
116
+
117
+ <h3 id="glue-components"><strong>4.1. Glue Components</strong></h3>
118
+
119
+ <table>
120
+ <thead><tr><th>Component</th><th>Function</th></tr></thead>
121
+ <tbody>
122
+ <tr><td><strong>Glue Data Catalog</strong></td><td>Central metadata repository — schemas, tables, partitions</td></tr>
123
+ <tr><td><strong>Glue Crawlers</strong></td><td>Auto-discover schema từ S3/RDS/Redshift và populate Data Catalog</td></tr>
124
+ <tr><td><strong>Glue ETL Jobs</strong></td><td>Spark-based transformation jobs (Python/Scala)</td></tr>
125
+ <tr><td><strong>Glue DataBrew</strong></td><td>No-code visual data preparation (250+ pre-built transforms)</td></tr>
126
+ <tr><td><strong>Glue Studio</strong></td><td>Visual ETL job builder (drag-and-drop)</td></tr>
127
+ </tbody>
128
+ </table>
129
+
130
+ <blockquote>
131
+ <p><strong>Exam tip:</strong> <strong>Glue Data Catalog</strong> là metadata store chung cho Athena, EMR, Redshift Spectrum. Khi đề hỏi "centralized schema management" → Glue Data Catalog. Khi hỏi "no-code data cleaning" → Glue DataBrew.</p>
132
+ </blockquote>
133
+
134
+ <h2 id="lake-formation"><strong>5. AWS Lake Formation</strong></h2>
135
+
136
+ <p><strong>Lake Formation</strong> build trên S3 + Glue để management <strong>data lake security và governance</strong>. Key feature: column-level và row-level access control.</p>
137
+
138
+ <pre><code class="language-text">Lake Formation Architecture:
139
+
140
+ IAM Users ──→ Lake Formation ──→ S3 Data Lake
141
+ IAM Roles (Security (Raw/Processed)
142
+ & Governance)
143
+
144
+ Column/Row
145
+ Level Access
146
+ Control
147
+ </code></pre>
148
+
149
+ <h2 id="cheat-sheet"><strong>6. Cheat Sheet — Data Ingestion Services</strong></h2>
150
+
151
+ <table>
152
+ <thead><tr><th>Scenario</th><th>Service</th></tr></thead>
153
+ <tbody>
154
+ <tr><td>Streaming → S3 với no-code</td><td>Kinesis Data Firehose</td></tr>
155
+ <tr><td>Real-time processing với custom logic</td><td>Kinesis Data Streams</td></tr>
156
+ <tr><td>SQL on streaming data</td><td>Kinesis Data Analytics (Flink)</td></tr>
157
+ <tr><td>Batch ETL Spark-based</td><td>AWS Glue ETL Jobs</td></tr>
158
+ <tr><td>No-code visual data prep</td><td>Glue DataBrew</td></tr>
159
+ <tr><td>Schema discovery from S3</td><td>Glue Crawlers + Data Catalog</td></tr>
160
+ <tr><td>SQL queries on S3</td><td>Amazon Athena</td></tr>
161
+ <tr><td>Data lake governance</td><td>AWS Lake Formation</td></tr>
162
+ <tr><td>Large-scale Spark/Hadoop</td><td>Amazon EMR</td></tr>
163
+ </tbody>
164
+ </table>
165
+
166
+ <h2 id="practice"><strong>7. Practice Questions</strong></h2>
167
+
168
+ <p><strong>Q1:</strong> A company wants to ingest IoT sensor data into Amazon S3 for ML training. The data arrives continuously and no custom processing is required. Which service is the MOST cost-effective?</p>
169
+ <ul>
170
+ <li>A) Amazon Kinesis Data Streams with a Lambda consumer</li>
171
+ <li>B) Amazon Kinesis Data Firehose ✓</li>
172
+ <li>C) Amazon EMR with Spark Streaming</li>
173
+ <li>D) AWS Glue ETL jobs on a schedule</li>
174
+ </ul>
175
+ <p><em>Explanation: Kinesis Data Firehose is fully managed and requires no custom code — it directly delivers streaming data to S3, Redshift, or Elasticsearch. Data Streams requires custom consumers, EMR is heavy lift, and Glue is for batch ETL.</em></p>
176
+
177
+ <p><strong>Q2:</strong> A data engineer wants to query raw CSV files in S3 using SQL without loading them into a database. Which service should be used?</p>
178
+ <ul>
179
+ <li>A) Amazon RDS</li>
180
+ <li>B) Amazon DynamoDB</li>
181
+ <li>C) Amazon Athena ✓</li>
182
+ <li>D) Amazon Redshift</li>
183
+ </ul>
184
+ <p><em>Explanation: Amazon Athena is serverless and allows SQL queries directly on S3 data without loading. It reads files in-place and supports formats like CSV, Parquet, ORC, JSON.</em></p>
185
+
186
+ <p><strong>Q3:</strong> Which file format provides the BEST performance for columnar analytics queries on large ML datasets stored in Amazon S3?</p>
187
+ <ul>
188
+ <li>A) CSV</li>
189
+ <li>B) JSON</li>
190
+ <li>C) XML</li>
191
+ <li>D) Apache Parquet ✓</li>
192
+ </ul>
193
+ <p><em>Explanation: Parquet is a columnar format with excellent compression and predicate pushdown support. Columnar formats allow reading only the required columns, dramatically reducing I/O for analytical queries.</em></p>
@@ -0,0 +1,178 @@
1
+ ---
2
+ id: 621b7555-2901-469d-8b0b-a800506c8212
3
+ title: 'Bài 2: Data Transformation & Feature Engineering'
4
+ slug: bai-2-data-transformation
5
+ description: >-
6
+ SageMaker Processing Jobs cho data prep. SageMaker Feature Store.
7
+ Xử lý missing values, encoding, normalization, scaling.
8
+ Text preprocessing, imbalanced data techniques.
9
+ duration_minutes: 60
10
+ is_free: true
11
+ video_url: null
12
+ sort_order: 2
13
+ section_title: "Phần 1: Data Engineering (20%)"
14
+ course:
15
+ id: 019c9619-lt02-7002-c002-lt0200000002
16
+ title: 'Luyện thi AWS Certified Machine Learning - Specialty'
17
+ slug: luyen-thi-aws-ml-specialty
18
+ ---
19
+
20
+ <h2 id="overview"><strong>1. Data Transformation trong ML Pipeline</strong></h2>
21
+
22
+ <p>Trước khi train model, raw data phải qua nhiều bước transformation. Đây là nguồn gốc của câu nói nổi tiếng: <em>"Garbage in, garbage out"</em>. Đề thi MLS-C01 thường hỏi kỹ thuật xử lý data và tools phù hợp.</p>
23
+
24
+ <h2 id="processing-jobs"><strong>2. SageMaker Processing Jobs</strong></h2>
25
+
26
+ <p><strong>SageMaker Processing Jobs</strong> là managed service để chạy data processing scripts (Python, Spark) trên ephemeral compute clusters.</p>
27
+
28
+ <table>
29
+ <thead><tr><th>Processor Type</th><th>Framework</th><th>Use Case</th></tr></thead>
30
+ <tbody>
31
+ <tr><td><strong>ScriptProcessor</strong></td><td>Custom Docker container</td><td>Any custom script</td></tr>
32
+ <tr><td><strong>SKLearnProcessor</strong></td><td>scikit-learn</td><td>Classic ML preprocessing</td></tr>
33
+ <tr><td><strong>PySparkProcessor</strong></td><td>Apache Spark</td><td>Large-scale distributed processing</td></tr>
34
+ <tr><td><strong>FrameworkProcessor</strong></td><td>TensorFlow/PyTorch</td><td>Deep learning data prep</td></tr>
35
+ </tbody>
36
+ </table>
37
+
38
+ <pre><code class="language-text">SageMaker Processing Job Flow:
39
+
40
+ S3 (input data)
41
+
42
+ ┌─────────────────────┐
43
+ │ Processing Job │
44
+ │ (compute cluster) │
45
+ │ │
46
+ │ - Preprocess data │
47
+ │ - Feature engineer │
48
+ │ - Split train/test │
49
+ └─────────────────────┘
50
+
51
+ S3 (output: train/, validation/, test/)
52
+ </code></pre>
53
+
54
+ <h2 id="missing-values"><strong>3. Xử lý Missing Values</strong></h2>
55
+
56
+ <table>
57
+ <thead><tr><th>Strategy</th><th>Method</th><th>When to Use</th></tr></thead>
58
+ <tbody>
59
+ <tr><td><strong>Deletion</strong></td><td>Drop rows/columns</td><td>MCAR, ít missing (&lt;5%)</td></tr>
60
+ <tr><td><strong>Mean/Median Imputation</strong></td><td>Điền giá trị trung bình</td><td>Numeric, MCAR/MAR</td></tr>
61
+ <tr><td><strong>Mode Imputation</strong></td><td>Điền giá trị phổ biến nhất</td><td>Categorical</td></tr>
62
+ <tr><td><strong>KNN Imputation</strong></td><td>Dùng K neighbors gần nhất</td><td>Patterns in data, không quá lớn</td></tr>
63
+ <tr><td><strong>Model-based (MICE)</strong></td><td>Multiple imputation</td><td>Complex missingness patterns</td></tr>
64
+ <tr><td><strong>Indicator Feature</strong></td><td>Thêm cột is_missing</td><td>Khi missingness chứa thông tin</td></tr>
65
+ </tbody>
66
+ </table>
67
+
68
+ <blockquote>
69
+ <p><strong>Exam tip:</strong> Ba loại missing data: <strong>MCAR</strong> (Missing Completely At Random) — deletion an toàn; <strong>MAR</strong> (Missing At Random) — imputation phù hợp; <strong>MNAR</strong> (Missing Not At Random) — cần indicator feature hoặc domain knowledge.</p>
70
+ </blockquote>
71
+
72
+ <h2 id="encoding"><strong>4. Categorical Encoding</strong></h2>
73
+
74
+ <table>
75
+ <thead><tr><th>Encoding</th><th>Method</th><th>When to Use</th><th>Issues</th></tr></thead>
76
+ <tbody>
77
+ <tr><td><strong>One-Hot Encoding</strong></td><td>Binary columns mỗi category</td><td>Nominal (no order), ít categories</td><td>High cardinality → curse of dimensionality</td></tr>
78
+ <tr><td><strong>Label Encoding</strong></td><td>0, 1, 2, 3...</td><td>Ordinal (có thứ tự)</td><td>Implies false order for nominal</td></tr>
79
+ <tr><td><strong>Target Encoding</strong></td><td>Mean of target per category</td><td>High cardinality nominal</td><td>Data leakage risk nếu không cẩn thận</td></tr>
80
+ <tr><td><strong>Embeddings</strong></td><td>Dense vector representation</td><td>Text, high cardinality</td><td>Cần đủ data để learn</td></tr>
81
+ </tbody>
82
+ </table>
83
+
84
+ <h2 id="scaling"><strong>5. Normalization & Scaling</strong></h2>
85
+
86
+ <table>
87
+ <thead><tr><th>Technique</th><th>Formula</th><th>Output Range</th><th>Best For</th></tr></thead>
88
+ <tbody>
89
+ <tr><td><strong>Min-Max Normalization</strong></td><td>(x - min) / (max - min)</td><td>[0, 1]</td><td>Neural networks, distance-based</td></tr>
90
+ <tr><td><strong>Standardization (Z-score)</strong></td><td>(x - mean) / std</td><td>Mean=0, SD=1</td><td>Linear models, SVM, PCA</td></tr>
91
+ <tr><td><strong>Robust Scaler</strong></td><td>(x - median) / IQR</td><td>Centered</td><td>Outliers present</td></tr>
92
+ <tr><td><strong>Log Transform</strong></td><td>log(x)</td><td>Compressed</td><td>Skewed distributions</td></tr>
93
+ </tbody>
94
+ </table>
95
+
96
+ <h2 id="imbalanced"><strong>6. Xử lý Imbalanced Data</strong></h2>
97
+
98
+ <p>Class imbalance (e.g., fraud detection: 99% normal, 1% fraud) khiến model bias về majority class.</p>
99
+
100
+ <table>
101
+ <thead><tr><th>Technique</th><th>Method</th><th>Direction</th></tr></thead>
102
+ <tbody>
103
+ <tr><td><strong>Oversampling</strong></td><td>Duplicate minority class samples</td><td>↑ minority</td></tr>
104
+ <tr><td><strong>SMOTE</strong></td><td>Synthetic Minority Oversampling Technique — generate synthetic samples</td><td>↑ minority</td></tr>
105
+ <tr><td><strong>Undersampling</strong></td><td>Remove majority class samples</td><td>↓ majority</td></tr>
106
+ <tr><td><strong>Class Weights</strong></td><td>Penalize misclassification of minority more</td><td>No data change</td></tr>
107
+ <tr><td><strong>Ensemble Methods</strong></td><td>BalancedBagging, EasyEnsemble</td><td>Algorithm-level</td></tr>
108
+ </tbody>
109
+ </table>
110
+
111
+ <blockquote>
112
+ <p><strong>Exam tip:</strong> Metric phù hợp cho imbalanced data: <strong>F1 Score, AUC-ROC, Precision-Recall</strong> — KHÔNG dùng Accuracy (misleading). AWS SageMaker Clarify có thể detect class imbalance.</p>
113
+ </blockquote>
114
+
115
+ <h2 id="feature-store"><strong>7. SageMaker Feature Store</strong></h2>
116
+
117
+ <p><strong>SageMaker Feature Store</strong> là centralized repository để store, share và reuse ML features.</p>
118
+
119
+ <pre><code class="language-text">Feature Store Architecture:
120
+
121
+ Feature Groups
122
+ ┌──────────────────────────────┐
123
+ │ user_features │
124
+ │ ┌──────┬────────┬────────┐ │
125
+ │ │ id │ age │ recency│ │
126
+ │ └──────┴────────┴────────┘ │
127
+ └──────────────────────────────┘
128
+ ↓ writes ↑ reads
129
+ ┌──────────────────┐ ┌──────────────────┐
130
+ │ Offline Store │ │ Online Store │
131
+ │ (S3 - training) │ │ (DynamoDB - │
132
+ │ batch reads │ │ low-latency │
133
+ │ │ │ inference) │
134
+ └──────────────────┘ └──────────────────┘
135
+ </code></pre>
136
+
137
+ <h2 id="cheat-sheet"><strong>8. Cheat Sheet — Feature Engineering</strong></h2>
138
+
139
+ <table>
140
+ <thead><tr><th>Problem</th><th>Solution</th></tr></thead>
141
+ <tbody>
142
+ <tr><td>High cardinality categorical</td><td>Target encoding hoặc embeddings</td></tr>
143
+ <tr><td>Missing values (numeric)</td><td>Median imputation + indicator feature</td></tr>
144
+ <tr><td>Skewed distribution</td><td>Log transform hoặc Box-Cox</td></tr>
145
+ <tr><td>Outliers</td><td>Robust Scaler hoặc clip/winsorize</td></tr>
146
+ <tr><td>Imbalanced classes</td><td>SMOTE + class weights + AUC metric</td></tr>
147
+ <tr><td>Reuse features across teams</td><td>SageMaker Feature Store</td></tr>
148
+ </tbody>
149
+ </table>
150
+
151
+ <h2 id="practice"><strong>9. Practice Questions</strong></h2>
152
+
153
+ <p><strong>Q1:</strong> A dataset for fraud detection has 98% negative (non-fraud) and 2% positive (fraud) examples. Which metric is MOST appropriate to evaluate the model?</p>
154
+ <ul>
155
+ <li>A) Accuracy</li>
156
+ <li>B) R-squared</li>
157
+ <li>C) AUC-ROC ✓</li>
158
+ <li>D) Mean Absolute Error</li>
159
+ </ul>
160
+ <p><em>Explanation: Accuracy is misleading for imbalanced data (predicting all negative gives 98% accuracy). AUC-ROC measures the model's ability to distinguish classes across all thresholds, making it ideal for imbalanced classification.</em></p>
161
+
162
+ <p><strong>Q2:</strong> Which technique generates SYNTHETIC samples to address class imbalance?</p>
163
+ <ul>
164
+ <li>A) Random undersampling</li>
165
+ <li>B) SMOTE (Synthetic Minority Oversampling Technique) ✓</li>
166
+ <li>C) Class weighting</li>
167
+ <li>D) Feature scaling</li>
168
+ </ul>
169
+ <p><em>Explanation: SMOTE creates new synthetic samples for the minority class by interpolating between existing minority class examples, rather than just duplicating them.</em></p>
170
+
171
+ <p><strong>Q3:</strong> A company wants to share engineered features between their training pipeline and real-time inference service. Which SageMaker feature addresses this?</p>
172
+ <ul>
173
+ <li>A) SageMaker Processing Jobs</li>
174
+ <li>B) SageMaker Experiments</li>
175
+ <li>C) SageMaker Feature Store ✓</li>
176
+ <li>D) SageMaker Data Wrangler</li>
177
+ </ul>
178
+ <p><em>Explanation: SageMaker Feature Store provides both an offline store (S3, for batch training) and online store (DynamoDB-backed, for low-latency real-time inference), ensuring feature consistency between training and serving.</em></p>