npm - @xdev-asia/xdev-knowledge-mcp - Versions diffs - 1.0.43 → 1.0.44 - Mend

@xdev-asia/xdev-knowledge-mcp 1.0.43 → 1.0.44

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

package/content/pages/xoa-du-lieu-nguoi-dung.md ADDED Viewed

@@ -0,0 +1,68 @@
+---
+id: 019cb2b9-4dc7-72a5-956a-93f58cbac568
+title: Xóa dữ liệu người dùng
+slug: xoa-du-lieu-nguoi-dung
+excerpt: Hướng dẫn yêu cầu xóa tài khoản và dữ liệu cá nhân trên xDev Asia.
+featured_image: null
+template: default
+show_in_header: false
+show_in_footer: true
+sort_order: 12
+meta:
+  meta_title: Xóa dữ liệu người dùng — xDev Asia
+  meta_description: Yêu cầu xóa tài khoản và toàn bộ dữ liệu cá nhân của bạn khỏi nền tảng xDev Asia.
+published_at: '2026-04-05T00:00:00.000000Z'
+---
+## Xóa dữ liệu người dùng
+Tại xDev Asia, chúng tôi tôn trọng quyền riêng tư và quyền kiểm soát dữ liệu cá nhân của bạn. Trang này hướng dẫn cách yêu cầu xóa tài khoản và toàn bộ dữ liệu liên quan.
+### Dữ liệu chúng tôi lưu trữ
+Khi bạn sử dụng xDev Asia (website và ứng dụng di động), chúng tôi có thể lưu trữ:
+- **Thông tin tài khoản:** Tên hiển thị, địa chỉ email, ảnh đại diện
+- **Lịch sử học tập:** Bài học đã hoàn thành, tiến độ khóa học
+- **Dữ liệu bookmark:** Bài viết và series đã lưu
+- **Kết quả quiz:** Điểm số và lịch sử thi thử
+### Cách yêu cầu xóa dữ liệu
+Bạn có thể yêu cầu xóa toàn bộ dữ liệu cá nhân theo một trong các cách sau:
+**Cách 1: Qua email**
+Gửi email đến **<admin@xdev.asia>a>a>a>** với tiêu đề **"Yêu cầu xóa tài khoản"** và nội dung bao gồm:
+- Địa chỉ email đã đăng ký
+- Tên hiển thị trên tài khoản
+- Xác nhận bạn muốn xóa toàn bộ dữ liệu
+**Cách 2: Qua GitHub**
+Tạo một issue tại [github.com/xdev-asia-labs](https://github.com/xdev-asia-labs) với tiêu đề **"Data Deletion Request"**.
+### Thời gian xử lý
+Chúng tôi sẽ xử lý yêu cầu trong vòng **30 ngày** kể từ khi nhận được. Sau khi hoàn tất, bạn sẽ nhận được thông báo qua email xác nhận dữ liệu đã được xóa.
+### Dữ liệu được xóa
+Khi yêu cầu được chấp thuận, chúng tôi sẽ xóa:
+- Tài khoản và thông tin cá nhân
+- Lịch sử học tập và tiến độ
+- Bookmark và dữ liệu tùy chỉnh
+- Kết quả quiz và điểm số
+**Lưu ý:** Một số dữ liệu đã được ẩn danh hóa và tổng hợp (ví dụ: thống kê lượt xem) có thể không bị xóa vì chúng không thể được liên kết lại với bạn.
+### Liên hệ
+<admin@xdev.asia>
+Nếu bạn có câ<admin@xdev.asia>riêng tư hoặc xử lý dữ liệu, vui lòng liên hệ:
+<admin@xdev.asia>
+- **Email:** <admin@xdev.asia>
+- **Chính sách bảo mật:** [xdev.asia/pages/chinh-sach-quyen-rieng-tu/](/pages/chinh-sach-quyen-rieng-tu/)

package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/01-phan-1-data-engineering/lessons/01-bai-1-data-repositories-ingestion.md CHANGED Viewed

@@ -17,6 +17,11 @@ course:
   slug: luyen-thi-aws-ml-specialty
 ---
+<div style="text-align: center; margin: 2rem 0;">
+<img src="/storage/uploads/2026/04/aws-mls-bai1-data-ingestion.png" alt="AWS ML Data Repositories & Ingestion" style="max-width: 800px; width: 100%; border-radius: 12px;" />
+<p><em>Data Repositories & Ingestion: S3, Kinesis, Glue và Lake Formation trong ML pipeline</em></p>
+</div>
 <h2 id="overview"><strong>1. Tổng quan Data Engineering trong MLS-C01</strong></h2>
 <p>Domain Data Engineering chiếm <strong>20% đề thi MLS-C01</strong>. Đây là phần bắt buộc phải nắm vững — đề thi thường hỏi "Which service should be used to ingest/store/transform data for ML?"</p>

package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/01-phan-1-data-engineering/lessons/02-bai-2-data-transformation.md CHANGED Viewed

@@ -17,6 +17,11 @@ course:
   slug: luyen-thi-aws-ml-specialty
 ---
+<div style="text-align: center; margin: 2rem 0;">
+<img src="/storage/uploads/2026/04/aws-mls-bai2-feature-engineering.png" alt="AWS ML Data Transformation Pipeline" style="max-width: 800px; width: 100%; border-radius: 12px;" />
+<p><em>Feature Engineering & Data Transformation: Glue, SageMaker Data Wrangler, và xử lý missing values</em></p>
+</div>
 <h2 id="overview"><strong>1. Data Transformation trong ML Pipeline</strong></h2>
 <p>Trước khi train model, raw data phải qua nhiều bước transformation. Đây là nguồn gốc của câu nói nổi tiếng: <em>"Garbage in, garbage out"</em>. Đề thi MLS-C01 thường hỏi kỹ thuật xử lý data và tools phù hợp.</p>

package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/01-phan-1-data-engineering/lessons/03-bai-3-data-analysis.md ADDED Viewed

@@ -0,0 +1,159 @@
+---
+id: 1a81b42d-c09e-43ef-b9f6-3158ca64b6c1
+title: 'Bài 3: Data Analysis & Visualization'
+slug: bai-3-data-analysis
+description: >-
+  EDA trên SageMaker notebooks. Amazon Athena cho SQL analytics.
+  Amazon QuickSight cho BI dashboards. Phát hiện data quality issues.
+  Detect class imbalance, outliers, correlations, data drift.
+duration_minutes: 45
+is_free: true
+video_url: null
+sort_order: 3
+section_title: "Phần 1: Data Engineering (20%)"
+course:
+  id: 019c9619-lt02-7002-c002-lt0200000002
+  title: 'Luyện thi AWS Certified Machine Learning - Specialty'
+  slug: luyen-thi-aws-ml-specialty
+---
+<div style="text-align: center; margin: 2rem 0;">
+<img src="/storage/uploads/2026/04/aws-mls-bai3-eda-data-analysis.png" alt="Exploratory Data Analysis trên AWS" style="max-width: 800px; width: 100%; border-radius: 12px;" />
+<p><em>EDA & Data Analysis: thống kê mô tả, phát hiện outliers, feature correlation trên AWS</em></p>
+</div>
+<h2 id="eda"><strong>1. Exploratory Data Analysis (EDA)</strong></h2>
+<p><strong>EDA</strong> là bước phân tích dữ liệu ban đầu để hiểu structure, patterns, và anomalies trước khi modeling. SageMaker cung cấp nhiều tools để thực hiện EDA ở scale lớn.</p>
+<h2 id="eda-tools"><strong>2. AWS Tools cho Data Analysis</strong></h2>
+<table>
+<thead><tr><th>Tool</th><th>Use Case</th><th>Interface</th></tr></thead>
+<tbody>
+<tr><td><strong>SageMaker Studio Notebooks</strong></td><td>Interactive EDA, Python/R analysis</td><td>JupyterLab-based IDE</td></tr>
+<tr><td><strong>SageMaker Data Wrangler</strong></td><td>Visual data prep, 300+ transforms, auto-insights</td><td>Drag-and-drop GUI</td></tr>
+<tr><td><strong>Amazon Athena</strong></td><td>SQL queries on S3 data</td><td>SQL console</td></tr>
+<tr><td><strong>Amazon QuickSight</strong></td><td>BI dashboards, executive reports</td><td>Visual BI tool</td></tr>
+<tr><td><strong>Amazon Redshift</strong></td><td>Large-scale data warehousing, SQL analytics</td><td>SQL</td></tr>
+<tr><td><strong>AWS Glue DataBrew</strong></td><td>No-code data profiling và cleaning recipes</td><td>Visual tool</td></tr>
+</tbody>
+</table>
+<blockquote>
+<p><strong>Exam tip:</strong> <strong>Data Wrangler</strong> = visual data prep cho ML (generates SageMaker Processing code). <strong>DataBrew</strong> = data analyst/BI (no ML context). <strong>QuickSight</strong> = BI dashboards for business users, không phải ML.</p>
+</blockquote>
+<h2 id="data-quality"><strong>3. Data Quality Issues</strong></h2>
+<p>Đề thi thường hỏi về nhận biết và xử lý các vấn đề chất lượng data phổ biến.</p>
+<table>
+<thead><tr><th>Issue</th><th>Detection Method</th><th>Impact on Model</th></tr></thead>
+<tbody>
+<tr><td><strong>Missing Values</strong></td><td>Null counts, missing rate per column</td><td>Errors, biased results</td></tr>
+<tr><td><strong>Outliers</strong></td><td>Box plots, Z-score > 3, IQR method</td><td>Skewed weights, poor generalization</td></tr>
+<tr><td><strong>Class Imbalance</strong></td><td>Class distribution histogram</td><td>Biased toward majority class</td></tr>
+<tr><td><strong>Feature Correlation</strong></td><td>Correlation matrix, VIF score</td><td>Multicollinearity → unstable coefficients</td></tr>
+<tr><td><strong>Data Leakage</strong></td><td>Features with suspiciously high correlation to target</td><td>Over-optimistic eval, fails in production</td></tr>
+<tr><td><strong>Distribution Skew</strong></td><td>Histogram, skewness metric</td><td>Violated model assumptions</td></tr>
+</tbody>
+</table>
+<h3 id="data-leakage"><strong>3.1. Data Leakage — Critical Concept</strong></h3>
+<p><strong>Data leakage</strong> là khi information từ outside the training set rò rỉ vào features, khiến model có accuracy cao trong training nhưng thất bại khi production.</p>
+<pre><code class="language-text">Common Data Leakage Patterns:
+❌ Target leakage:
+   Feature "loan_default_flag" → predicting "credit_risk"
+   (feature derived from target)
+❌ Future data leakage:
+   Using tomorrow's stock price to predict today's trade
+❌ Train/test contamination:
+   Scaling data BEFORE splitting (test mean leaks into train)
+✅ Correct approach:
+   Split data FIRST → fit scaler on train only → transform both
+</code></pre>
+<blockquote>
+<p><strong>Exam tip:</strong> Always <strong>split before transforming</strong>. StandardScaler.fit() chỉ được gọi trên training set. Sau đó transform() trên cả train và test. Fit+transform trên toàn bộ dataset là data leakage.</p>
+</blockquote>
+<h2 id="athena"><strong>4. Amazon Athena</strong></h2>
+<p>Athena cho phép chạy SQL queries directly trên S3 without loading data vào database. <strong>Pay per scan</strong> — tối ưu bằng cách dùng Parquet + partitioning.</p>
+<pre><code class="language-text">Cost Optimization Tips:
+┌────────────────────────────────────────────────┐
+│  Partition data by date/region/category:       │
+│  s3://bucket/data/year=2024/month=01/          │
+│  → Query chỉ scan the required partitions      │
+│                                                │
+│  Use columnar formats (Parquet/ORC):           │
+│  → Read only needed columns                   │
+│                                               │
+│  Compress data (Snappy, Gzip):                │
+│  → Reduce scan size → reduce cost             │
+└────────────────────────────────────────────────┘
+</code></pre>
+<h2 id="quicksight"><strong>5. Amazon QuickSight</strong></h2>
+<p>QuickSight là <strong>BI service</strong>, không phải ML tool. Key feature: <strong>SPICE</strong> (in-memory engine) cho fast dashboards.</p>
+<table>
+<thead><tr><th>Feature</th><th>Description</th></tr></thead>
+<tbody>
+<tr><td><strong>SPICE</strong></td><td>Super-fast Parallel In-memory Calculation Engine — cached dataset</td></tr>
+<tr><td><strong>ML Insights</strong></td><td>Built-in anomaly detection, forecasting trên dashboards</td></tr>
+<tr><td><strong>Q (NLQ)</strong></td><td>Natural language queries — "show me sales by region last month"</td></tr>
+</tbody>
+</table>
+<h2 id="cheat-sheet"><strong>6. Cheat Sheet — Analysis Tools</strong></h2>
+<table>
+<thead><tr><th>Scenario</th><th>Tool</th></tr></thead>
+<tbody>
+<tr><td>Interactive Python EDA on large data</td><td>SageMaker Studio Notebooks</td></tr>
+<tr><td>Visual no-code ML data prep</td><td>SageMaker Data Wrangler</td></tr>
+<tr><td>SQL on S3 data (serverless)</td><td>Amazon Athena</td></tr>
+<tr><td>Business dashboards và reporting</td><td>Amazon QuickSight</td></tr>
+<tr><td>Large data warehouse SQL</td><td>Amazon Redshift</td></tr>
+<tr><td>No-code data profiling recipes</td><td>AWS Glue DataBrew</td></tr>
+</tbody>
+</table>
+<h2 id="practice"><strong>7. Practice Questions</strong></h2>
+<p><strong>Q1:</strong> A data scientist standardized features using the mean and standard deviation of the ENTIRE dataset before splitting into train/test sets. What problem does this cause?</p>
+<ul>
+<li>A) Model underfitting</li>
+<li>B) Slow training convergence</li>
+<li>C) Data leakage from test set statistics into training ✓</li>
+<li>D) Class imbalance</li>
+</ul>
+<p><em>Explanation: Fitting a scaler on the entire dataset causes data leakage — the test set statistics (mean, std) influence the training data transformation. Always fit transformers on training data only, then apply the fitted transformer to both train and test sets.</em></p>
+<p><strong>Q2:</strong> A business analyst needs to create executive dashboards from S3 data with fast interactive visualizations. Which AWS service is BEST suited?</p>
+<ul>
+<li>A) Amazon SageMaker Studio</li>
+<li>B) Amazon Athena</li>
+<li>C) Amazon QuickSight ✓</li>
+<li>D) AWS Glue DataBrew</li>
+</ul>
+<p><em>Explanation: Amazon QuickSight is the AWS BI service designed for business dashboards and visualizations with SPICE in-memory engine for fast interactive queries. SageMaker Studio is for ML development, Athena is SQL querying, DataBrew is data preparation.</em></p>
+<p><strong>Q3:</strong> A model trained on customer churn data has 99% training accuracy but performs poorly on production data. Investigation shows "days_since_last_call" is more predictive than expected. What is the MOST likely cause?</p>
+<ul>
+<li>A) Overfitting due to too many features</li>
+<li>B) Underfitting due to low model complexity</li>
+<li>C) Data leakage — the feature is derived from post-churn activity ✓</li>
+<li>D) Class imbalance</li>
+</ul>
+<p><em>Explanation: This is classic target leakage — "days_since_last_call" may reflect churn behavior after the fact (customers call to cancel). This future information isn't available in production, causing the model to fail.</em></p>

package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/02-phan-2-modeling/lessons/04-bai-4-sagemaker-built-in-algorithms.md ADDED Viewed

@@ -0,0 +1,186 @@
+---
+id: 8d704042-9cc5-478e-b198-d80ea70c22c5
+title: 'Bài 4: SageMaker Built-in Algorithms'
+slug: bai-4-sagemaker-built-in-algorithms
+description: >-
+  XGBoost, Linear Learner, Random Cut Forest, K-Means, KNN.
+  BlazingText, Seq2Seq, DeepAR, Object Detection, Semantic Segmentation.
+  Khi nào dùng algorithm nào — decision table chi tiết.
+duration_minutes: 90
+is_free: true
+video_url: null
+sort_order: 4
+section_title: "Phần 2: Modeling (36%)"
+course:
+  id: 019c9619-lt02-7002-c002-lt0200000002
+  title: 'Luyện thi AWS Certified Machine Learning - Specialty'
+  slug: luyen-thi-aws-ml-specialty
+---
+<div style="text-align: center; margin: 2rem 0;">
+<img src="/storage/uploads/2026/04/aws-mls-bai4-sagemaker-algorithms.png" alt="SageMaker Built-in Algorithms" style="max-width: 800px; width: 100%; border-radius: 12px;" />
+<p><em>SageMaker Built-in Algorithms: từ XGBoost, Linear Learner đến DeepAR và Image Classification</em></p>
+</div>
+<h2 id="overview"><strong>1. SageMaker Built-in Algorithms Overview</strong></h2>
+<p>SageMaker cung cấp 18+ <strong>built-in algorithms</strong> được optimize để chạy distributed trên AWS infrastructure. Đây là topic <strong>cực kỳ quan trọng</strong> trong MLS-C01 — thường chiếm 8-12 câu.</p>
+<blockquote>
+<p><strong>Exam tip:</strong> Học thuộc bảng "Problem Type → Algorithm". Đề thi luôn cho scenario và hỏi algorithm phù hợp. Key patterns: time series → DeepAR; anomaly → Random Cut Forest; NLP classification → BlazingText; tabular → XGBoost.</p>
+</blockquote>
+<h2 id="supervised-table"><strong>2. Supervised Learning Algorithms</strong></h2>
+<table>
+<thead><tr><th>Algorithm</th><th>Problem Type</th><th>Input</th><th>Key Trait</th></tr></thead>
+<tbody>
+<tr><td><strong>XGBoost</strong></td><td>Classification, Regression</td><td>Tabular (CSV/LibSVM)</td><td>Top performer cho tabular data, gradient boosting</td></tr>
+<tr><td><strong>Linear Learner</strong></td><td>Binary/Multiclass classification, Regression</td><td>RecordIO, CSV</td><td>Fast, scalable, regularization built-in</td></tr>
+<tr><td><strong>Factorization Machines</strong></td><td>Binary classification, Regression</td><td>RecordIO-protobuf (sparse)</td><td>Sparse data, recommendation systems, CTR prediction</td></tr>
+<tr><td><strong>KNN (k-Nearest Neighbors)</strong></td><td>Classification, Regression</td><td>RecordIO-protobuf</td><td>Instance-based, no training, lazy learner</td></tr>
+<tr><td><strong>DeepAR</strong></td><td>Time series forecasting</td><td>JSON Lines</td><td>Multiple related time series, probabilistic forecasts</td></tr>
+<tr><td><strong>Object2Vec</strong></td><td>Embeddings</td><td>Paired sequences</td><td>Learn embeddings cho words, products, users</td></tr>
+</tbody>
+</table>
+<h2 id="nlp-algorithms"><strong>3. NLP Algorithms</strong></h2>
+<table>
+<thead><tr><th>Algorithm</th><th>Output</th><th>Use Case</th></tr></thead>
+<tbody>
+<tr><td><strong>BlazingText</strong></td><td>Word vectors hoặc text classification</td><td>Sentiment analysis, spam detection, entity classification</td></tr>
+<tr><td><strong>Seq2Seq</strong></td><td>Sequence → Sequence</td><td>Machine translation, summarization, Q&amp;A</td></tr>
+<tr><td><strong>LDA (Latent Dirichlet Allocation)</strong></td><td>Topics per document</td><td>Topic modeling, document categorization</td></tr>
+<tr><td><strong>NTM (Neural Topic Model)</strong></td><td>Latent representations</td><td>Topic modeling với neural networks</td></tr>
+</tbody>
+</table>
+<blockquote>
+<p><strong>Exam tip:</strong> <strong>BlazingText</strong> có 2 modes: (1) <code>Word2Vec</code> mode — unsupervised, generates word embeddings; (2) <code>Text Classification</code> mode — supervised, like FastText. Phân biệt rõ khi đọc câu hỏi.</p>
+</blockquote>
+<h2 id="unsupervised-algorithms"><strong>4. Unsupervised Learning Algorithms</strong></h2>
+<table>
+<thead><tr><th>Algorithm</th><th>Problem Type</th><th>Use Case</th></tr></thead>
+<tbody>
+<tr><td><strong>K-Means</strong></td><td>Clustering</td><td>Customer segmentation, document grouping</td></tr>
+<tr><td><strong>PCA (Principal Component Analysis)</strong></td><td>Dimensionality reduction</td><td>High-dimensional data, feature compression</td></tr>
+<tr><td><strong>Random Cut Forest (RCF)</strong></td><td>Anomaly detection</td><td>Fraud detection, IoT anomaly, time series anomaly</td></tr>
+<tr><td><strong>IP Insights</strong></td><td>Anomaly detection</td><td>Detect unusual IP-entity relationships, security</td></tr>
+</tbody>
+</table>
+<h2 id="computer-vision"><strong>5. Computer Vision Algorithms</strong></h2>
+<table>
+<thead><tr><th>Algorithm</th><th>Task</th><th>Output</th></tr></thead>
+<tbody>
+<tr><td><strong>Image Classification</strong></td><td>Multi-class classification</td><td>Class label + confidence</td></tr>
+<tr><td><strong>Object Detection</strong></td><td>Locate + classify objects</td><td>Bounding boxes + labels</td></tr>
+<tr><td><strong>Semantic Segmentation</strong></td><td>Pixel-level classification</td><td>Segmentation mask</td></tr>
+</tbody>
+</table>
+<h2 id="algorithm-decision"><strong>6. Algorithm Selection Decision Tree</strong></h2>
+<pre><code class="language-text">What is the problem type?
+│
+├── Tabular data, classification/regression?
+│   └── XGBoost (best general choice)
+│
+├── Sparse features, recommendation, ad CTR?
+│   └── Factorization Machines
+│
+├── Time series forecasting (multiple related series)?
+│   └── DeepAR
+│
+├── Anomaly detection on time series / IoT?
+│   └── Random Cut Forest (RCF)
+│
+├── Text classification / sentiment?
+│   └── BlazingText (supervised mode)
+│
+├── Sequence-to-sequence (translation / summarization)?
+│   └── Seq2Seq
+│
+├── Topic modeling?
+│   └── LDA or NTM
+│
+├── Clustering?
+│   └── K-Means
+│
+├── Dimensionality reduction?
+│   └── PCA
+│
+└── Image tasks?
+    ├── Classification only → Image Classification
+    ├── Locate objects → Object Detection
+    └── Pixel mask → Semantic Segmentation
+</code></pre>
+<h2 id="training-modes"><strong>7. Training Input Modes</strong></h2>
+<table>
+<thead><tr><th>Mode</th><th>How It Works</th><th>Best For</th></tr></thead>
+<tbody>
+<tr><td><strong>File Mode</strong></td><td>Downloads entire dataset to training instance before starting</td><td>Small to medium datasets</td></tr>
+<tr><td><strong>Pipe Mode</strong></td><td>Streams data directly from S3 during training</td><td>Very large datasets — no disk bottleneck</td></tr>
+<tr><td><strong>FastFile Mode</strong></td><td>Access S3 as if local file system (via FUSE)</td><td>Random access patterns</td></tr>
+</tbody>
+</table>
+<blockquote>
+<p><strong>Exam tip:</strong> Khi đề hỏi "reduce training time for large dataset", đáp án thường là chuyển sang <strong>Pipe Mode</strong> với <strong>RecordIO format</strong>. Pipe Mode không download toàn bộ dataset — stream trực tiếp từ S3.</p>
+</blockquote>
+<h2 id="cheat-sheet"><strong>8. Cheat Sheet — Quick Reference</strong></h2>
+<table>
+<thead><tr><th>Keyword in Question</th><th>Algorithm</th></tr></thead>
+<tbody>
+<tr><td>"tabular data", "structured data"</td><td>XGBoost</td></tr>
+<tr><td>"time series", "forecast"</td><td>DeepAR</td></tr>
+<tr><td>"anomaly detection"</td><td>Random Cut Forest</td></tr>
+<tr><td>"recommendation", "sparse features"</td><td>Factorization Machines</td></tr>
+<tr><td>"text classification", "sentiment"</td><td>BlazingText (supervised)</td></tr>
+<tr><td>"word embeddings"</td><td>BlazingText (Word2Vec mode)</td></tr>
+<tr><td>"translation", "summarization"</td><td>Seq2Seq</td></tr>
+<tr><td>"topic modeling"</td><td>LDA or NTM</td></tr>
+<tr><td>"clustering", "segmentation"</td><td>K-Means</td></tr>
+<tr><td>"dimensionality reduction"</td><td>PCA</td></tr>
+<tr><td>"bounding boxes", "object detection"</td><td>Object Detection</td></tr>
+<tr><td>"pixel-level", "segmentation mask"</td><td>Semantic Segmentation</td></tr>
+<tr><td>"IP address anomaly", "fraud login"</td><td>IP Insights</td></tr>
+</tbody>
+</table>
+<h2 id="practice"><strong>9. Practice Questions</strong></h2>
+<p><strong>Q1:</strong> A retail company wants to forecast product demand for the next 30 days across 5,000 product categories. Which SageMaker algorithm is BEST suited?</p>
+<ul>
+<li>A) K-Means</li>
+<li>B) Linear Learner</li>
+<li>C) DeepAR ✓</li>
+<li>D) Seq2Seq</li>
+</ul>
+<p><em>Explanation: DeepAR is specifically designed for time series forecasting across multiple related time series. It learns global patterns from all 5,000 series simultaneously, providing probabilistic forecasts. This is exactly the use case it's optimized for.</em></p>
+<p><strong>Q2:</strong> An IoT system monitors server CPU usage. The team wants to detect unusual spikes automatically. Which SageMaker built-in algorithm should be used?</p>
+<ul>
+<li>A) XGBoost</li>
+<li>B) Random Cut Forest ✓</li>
+<li>C) BlazingText</li>
+<li>D) PCA</li>
+</ul>
+<p><em>Explanation: Random Cut Forest (RCF) is SageMaker's built-in anomaly detection algorithm. It assigns an anomaly score to each data point and works well for time series anomaly detection, such as CPU usage spikes.</em></p>
+<p><strong>Q3:</strong> A data scientist is training a model on a 500 GB dataset. Training is very slow because downloading data to the training instance takes too long. Which change will MOST improve performance?</p>
+<ul>
+<li>A) Switch from CSV to JSON format</li>
+<li>B) Increase the training instance size</li>
+<li>C) Switch to Pipe Mode with RecordIO-protobuf format ✓</li>
+<li>D) Add more training epochs</li>
+</ul>
+<p><em>Explanation: Pipe Mode streams data directly from S3 during training without downloading it first, eliminating the I/O bottleneck for large datasets. Combined with RecordIO-protobuf format, it dramatically reduces startup time.</em></p>

package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/02-phan-2-modeling/lessons/05-bai-5-training-hyperparameter-tuning.md ADDED Viewed

@@ -0,0 +1,159 @@
+---
+id: 8a7a5367-e4a4-4796-8aab-68326c1dc574
+title: 'Bài 5: Training & Hyperparameter Tuning'
+slug: bai-5-training-hyperparameter-tuning
+description: >-
+  SageMaker Training Jobs: instance types, Pipe Mode vs File Mode.
+  Distributed training: data parallelism vs model parallelism.
+  Automatic Model Tuning (HPO): Bayesian vs Random vs Grid search.
+  Spot Instance Training để giảm chi phí.
+duration_minutes: 60
+is_free: true
+video_url: null
+sort_order: 5
+section_title: "Phần 2: Modeling (36%)"
+course:
+  id: 019c9619-lt02-7002-c002-lt0200000002
+  title: 'Luyện thi AWS Certified Machine Learning - Specialty'
+  slug: luyen-thi-aws-ml-specialty
+---
+<div style="text-align: center; margin: 2rem 0;">
+<img src="/storage/uploads/2026/04/aws-mls-bai5-training-hpo.png" alt="SageMaker Training & Hyperparameter Tuning" style="max-width: 800px; width: 100%; border-radius: 12px;" />
+<p><em>SageMaker Training Jobs & Hyperparameter Tuning: distributed training, Spot Instances, và HPO strategies</em></p>
+</div>
+<h2 id="training-jobs"><strong>1. SageMaker Training Jobs</strong></h2>
+<p><strong>SageMaker Training Jobs</strong> chạy ML training code trên managed compute infrastructure. Training xảy ra trên ephemeral instances — chỉ tính phí khi chạy.</p>
+<pre><code class="language-text">Training Job Lifecycle:
+  Submit Job ──→ Provision Instances ──→ Download Data
+                                              ↓
+                                       Run Training Code
+                                              ↓
+                                       Save Model to S3
+                                              ↓
+                                       Terminate Instances
+</code></pre>
+<h2 id="instance-types"><strong>2. Instance Types cho Training</strong></h2>
+<table>
+<thead><tr><th>Instance Family</th><th>Hardware</th><th>Best For</th></tr></thead>
+<tbody>
+<tr><td><strong>ml.c5</strong></td><td>CPU optimized</td><td>Tabular ML, XGBoost, sklearn</td></tr>
+<tr><td><strong>ml.m5</strong></td><td>General purpose CPU</td><td>Light training, data processing</td></tr>
+<tr><td><strong>ml.p3</strong></td><td>V100 GPU</td><td>Deep learning training</td></tr>
+<tr><td><strong>ml.p4d</strong></td><td>A100 GPU (8x)</td><td>Large-scale DL, distributed training</td></tr>
+<tr><td><strong>ml.g4dn</strong></td><td>T4 GPU (cost-effective)</td><td>Small-medium DL models</td></tr>
+<tr><td><strong>ml.trn1</strong></td><td>AWS Trainium</td><td>LLM training, cost optimization</td></tr>
+</tbody>
+</table>
+<h2 id="distributed-training"><strong>3. Distributed Training</strong></h2>
+<p>Khi model hoặc dataset quá lớn cho một instance, cần <strong>distributed training</strong> trên nhiều instances.</p>
+<table>
+<thead><tr><th>Strategy</th><th>How It Works</th><th>When to Use</th></tr></thead>
+<tbody>
+<tr><td><strong>Data Parallelism</strong></td><td>Mỗi instance có copy của model, train trên subset của data, sync gradients</td><td>Dataset quá lớn, model vừa vặn trong 1 GPU</td></tr>
+<tr><td><strong>Model Parallelism</strong></td><td>Model split across instances, mỗi instance chứa 1 phần</td><td>Model quá lớn cho 1 GPU (LLMs)</td></tr>
+</tbody>
+</table>
+<pre><code class="language-text">Data Parallelism:
+Instance 1 [Full Model] ──→ Train on data shard A ──→ ↓
+Instance 2 [Full Model] ──→ Train on data shard B ──→ ↓  AllReduce
+Instance 3 [Full Model] ──→ Train on data shard C ──→ ↓  (sync gradients)
+                                                          ↓
+                                              Updated Model Weights
+Model Parallelism:
+Instance 1 [Layers 1-4]  ──→ forward pass ──→
+Instance 2 [Layers 5-8]  ──→ forward pass ──→
+Instance 3 [Layers 9-12] ──→ forward pass ──→ output
+</code></pre>
+<blockquote>
+<p><strong>Exam tip:</strong> SageMaker cung cấp <strong>SageMaker Distributed</strong> library với 2 modules: (1) <code>smdistributed.dataparallel</code> — optimized AllReduce; (2) <code>smdistributed.modelparallel</code> — auto pipeline parallelism. Khi đề hỏi "large model training" → model parallelism.</p>
+</blockquote>
+<h2 id="hpo"><strong>4. Automatic Model Tuning (HPO)</strong></h2>
+<p><strong>Hyperparameter Optimization (HPO)</strong> tự động tìm hyperparameters tốt nhất bằng cách chạy nhiều training jobs với configs khác nhau.</p>
+<table>
+<thead><tr><th>Strategy</th><th>How It Works</th><th>Tradeoff</th></tr></thead>
+<tbody>
+<tr><td><strong>Random Search</strong></td><td>Randomly sample hyperparameters từ range</td><td>Fast, good baseline</td></tr>
+<tr><td><strong>Grid Search</strong></td><td>Try all combinations</td><td>Exhaustive, expensive, bad for large spaces</td></tr>
+<tr><td><strong>Bayesian Optimization</strong></td><td>Probabilistic model của outcome, suggest best next config</td><td>Efficient, learns from previous trials — SageMaker default</td></tr>
+<tr><td><strong>Hyperband</strong></td><td>Early-stop poorly performing trials</td><td>Resource-efficient, fast</td></tr>
+</tbody>
+</table>
+<blockquote>
+<p><strong>Exam tip:</strong> SageMaker AMT (Automatic Model Tuning) dùng <strong>Bayesian Optimization</strong> by default. Nó XEM KẾT QUẢ từ các jobs trước để suggest next hyperparameter set — intelligent search, không phải brute force.</p>
+</blockquote>
+<h2 id="spot-training"><strong>5. Spot Instance Training</strong></h2>
+<p>SageMaker hỗ trợ dùng <strong>EC2 Spot Instances</strong> cho training jobs, tiết kiệm đến <strong>90% chi phí</strong> so với On-Demand.</p>
+<table>
+<thead><tr><th>Feature</th><th>Detail</th></tr></thead>
+<tbody>
+<tr><td><strong>MaxWaitTimeInSeconds</strong></td><td>Maximum thời gian đợi spot capacity</td></tr>
+<tr><td><strong>Checkpointing</strong></td><td>Lưu model to S3 periodically — resume sau khi bị interrupt</td></tr>
+<tr><td><strong>use_spot_instances=True</strong></td><td>Parameter trong SageMaker Estimator</td></tr>
+</tbody>
+</table>
+<blockquote>
+<p><strong>Exam tip:</strong> Khi đề hỏi "reduce training costs", đáp án thường là <strong>Spot Instances với checkpointing</strong>. Checkpointing quan trọng để tránh mất progress khi spot instance bị terminate.</p>
+</blockquote>
+<h2 id="bias-variance"><strong>6. Bias-Variance Tradeoff</strong></h2>
+<table>
+<thead><tr><th>Issue</th><th>Symptom</th><th>Cause</th><th>Solution</th></tr></thead>
+<tbody>
+<tr><td><strong>High Bias (Underfitting)</strong></td><td>High train error, high test error</td><td>Model quá đơn giản</td><td>Tăng model complexity, thêm features, giảm regularization</td></tr>
+<tr><td><strong>High Variance (Overfitting)</strong></td><td>Low train error, high test error</td><td>Model quá phức tạp</td><td>Thêm data, dropout, regularization, feature selection</td></tr>
+<tr><td><strong>Balanced</strong></td><td>Low train error, low test error (gần nhau)</td><td>Good fit</td><td>Deploy model</td></tr>
+</tbody>
+</table>
+<h2 id="practice"><strong>7. Practice Questions</strong></h2>
+<p><strong>Q1:</strong> A company is training a large deep learning model that doesn't fit on a single GPU instance. Which SageMaker distributed training strategy should they use?</p>
+<ul>
+<li>A) Data parallelism</li>
+<li>B) Model parallelism ✓</li>
+<li>C) Pipeline parallelism only</li>
+<li>D) Increase batch size</li>
+</ul>
+<p><em>Explanation: Model parallelism splits the model itself across multiple GPU instances, allowing training of models too large to fit in a single GPU's memory. Data parallelism keeps a full model copy on each instance, which doesn't help when the model itself is too large.</em></p>
+<p><strong>Q2:</strong> A team wants to minimize the cost of running 500 hyperparameter tuning jobs. Training can tolerate interruptions. What is the MOST cost-effective approach?</p>
+<ul>
+<li>A) Use larger instances to run jobs faster</li>
+<li>B) Use Spot Instances with checkpointing enabled ✓</li>
+<li>C) Use Grid Search instead of Bayesian Optimization</li>
+<li>D) Reduce the number of epochs</li>
+</ul>
+<p><em>Explanation: Spot Instances can save up to 90% compared to On-Demand pricing. With checkpointing enabled, interrupted jobs save their state to S3 and can resume, making Spot Instances practical for long HPO jobs.</em></p>
+<p><strong>Q3:</strong> A model achieves 95% accuracy on training data but only 62% on the test set. What problem does this indicate?</p>
+<ul>
+<li>A) Underfitting / High bias</li>
+<li>B) Overfitting / High variance ✓</li>
+<li>C) Data leakage</li>
+<li>D) Class imbalance</li>
+</ul>
+<p><em>Explanation: The large gap between training accuracy (95%) and test accuracy (62%) is a classic sign of overfitting (high variance). The model memorized the training data but fails to generalize. Solutions: more data, regularization (L1/L2, dropout), reduce model complexity.</em></p>