@xdev-asia/xdev-knowledge-mcp 1.0.42 → 1.0.44

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (32) hide show
  1. package/content/pages/xoa-du-lieu-nguoi-dung.md +68 -0
  2. package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/01-phan-1-data-engineering/lessons/01-bai-1-data-repositories-ingestion.md +198 -0
  3. package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/01-phan-1-data-engineering/lessons/02-bai-2-data-transformation.md +183 -0
  4. package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/01-phan-1-data-engineering/lessons/03-bai-3-data-analysis.md +159 -0
  5. package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/02-phan-2-modeling/lessons/04-bai-4-sagemaker-built-in-algorithms.md +186 -0
  6. package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/02-phan-2-modeling/lessons/05-bai-5-training-hyperparameter-tuning.md +159 -0
  7. package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/02-phan-2-modeling/lessons/06-bai-6-model-evaluation.md +169 -0
  8. package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/03-phan-3-implementation-operations/lessons/07-bai-7-model-deployment.md +193 -0
  9. package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/03-phan-3-implementation-operations/lessons/08-bai-8-model-monitoring-mlops.md +184 -0
  10. package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/03-phan-3-implementation-operations/lessons/09-bai-9-security-cost.md +166 -0
  11. package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/04-phan-4-on-tap/lessons/10-bai-10-bai-toan-thuong-gap.md +181 -0
  12. package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/04-phan-4-on-tap/lessons/11-bai-11-cheat-sheet.md +110 -0
  13. package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/chapters/04-phan-4-on-tap/lessons/12-bai-12-chien-luoc-thi.md +113 -0
  14. package/content/series/luyen-thi/luyen-thi-aws-ml-specialty/index.md +1 -1
  15. package/content/series/luyen-thi/luyen-thi-cka/index.md +217 -0
  16. package/content/series/luyen-thi/luyen-thi-ckad/index.md +199 -0
  17. package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/01-phan-1-problem-framing/lessons/01-bai-1-framing-ml-problems.md +136 -0
  18. package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/01-phan-1-problem-framing/lessons/02-bai-2-gcp-ai-ml-ecosystem.md +160 -0
  19. package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/02-phan-2-data-engineering/lessons/03-bai-3-data-pipeline.md +174 -0
  20. package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/02-phan-2-data-engineering/lessons/04-bai-4-feature-engineering.md +156 -0
  21. package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/03-phan-3-model-development/lessons/05-bai-5-vertex-ai-training.md +155 -0
  22. package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/03-phan-3-model-development/lessons/06-bai-6-bigquery-ml-tensorflow.md +141 -0
  23. package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/04-phan-4-deployment-mlops/lessons/07-bai-7-model-deployment.md +134 -0
  24. package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/04-phan-4-deployment-mlops/lessons/08-bai-8-vertex-ai-pipelines-mlops.md +149 -0
  25. package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/05-phan-5-responsible-ai/lessons/09-bai-9-responsible-ai.md +128 -0
  26. package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/05-phan-5-responsible-ai/lessons/10-bai-10-cheat-sheet-chien-luoc-thi.md +108 -0
  27. package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/index.md +1 -1
  28. package/content/series/luyen-thi/luyen-thi-kcna/index.md +168 -0
  29. package/data/quizzes/aws-ai-practitioner.json +362 -0
  30. package/data/quizzes/aws-ml-specialty.json +200 -0
  31. package/data/quizzes/gcp-ml-engineer.json +200 -0
  32. package/package.json +1 -1
@@ -0,0 +1,174 @@
1
+ ---
2
+ id: 019c9619-lt03-l03
3
+ title: 'Bài 3: Data Pipeline — Dataflow, Pub/Sub, Dataproc'
4
+ slug: bai-3-data-pipeline
5
+ description: >-
6
+ Apache Beam trên Dataflow cho batch/streaming ETL.
7
+ Pub/Sub cho event-driven pipelines. Dataproc cho Spark.
8
+ Cloud Composer (Airflow) cho orchestration.
9
+ duration_minutes: 60
10
+ is_free: true
11
+ video_url: null
12
+ sort_order: 3
13
+ section_title: "Phần 2: Data Engineering & Feature Engineering"
14
+ course:
15
+ id: 019c9619-lt03-7003-c003-lt0300000003
16
+ title: 'Luyện thi Google Cloud Professional Machine Learning Engineer'
17
+ slug: luyen-thi-gcp-ml-engineer
18
+ ---
19
+
20
+ <div style="text-align: center; margin: 2rem 0;">
21
+ <img src="/storage/uploads/2026/04/gcp-mle-bai3-data-pipeline.png" alt="GCP Data Pipeline Architecture" style="max-width: 800px; width: 100%; border-radius: 12px;" />
22
+ <p><em>GCP Data Pipeline: Pub/Sub, Dataflow, Dataproc, Cloud Composer và luồng dữ liệu cho ML</em></p>
23
+ </div>
24
+
25
+ <h2 id="gcp-data-pipeline"><strong>1. GCP Data Pipeline Services</strong></h2>
26
+
27
+ <table>
28
+ <thead><tr><th>Service</th><th>Type</th><th>When to Use</th></tr></thead>
29
+ <tbody>
30
+ <tr><td><strong>Pub/Sub</strong></td><td>Managed message queue</td><td>Event streaming, decouple producers/consumers</td></tr>
31
+ <tr><td><strong>Dataflow</strong></td><td>Managed Apache Beam runner</td><td>Unified batch + streaming ETL</td></tr>
32
+ <tr><td><strong>Dataproc</strong></td><td>Managed Spark / Hadoop</td><td>Existing Spark/Hadoop workloads, ML at scale</td></tr>
33
+ <tr><td><strong>Cloud Composer</strong></td><td>Managed Apache Airflow</td><td>Orchestrate multi-step ML workflows</td></tr>
34
+ <tr><td><strong>Cloud Storage</strong></td><td>Object store</td><td>Raw data landing zone, model artifacts</td></tr>
35
+ <tr><td><strong>BigQuery</strong></td><td>Data warehouse</td><td>Structured analysis, BigQuery ML</td></tr>
36
+ </tbody>
37
+ </table>
38
+
39
+ <h2 id="pubsub"><strong>2. Pub/Sub — Event Streaming</strong></h2>
40
+
41
+ <pre><code class="language-text">Pub/Sub Architecture:
42
+
43
+ Data Source → Publisher → [Topic] → Subscription → Subscriber
44
+ (IoT devices, (Pull or (Dataflow,
45
+ web clicks, Push) Cloud Functions,
46
+ logs) BigQuery)
47
+
48
+ Key concepts:
49
+ - Topic: named resource where messages are sent
50
+ - Subscription: named resource attached to topic
51
+ - Publisher: sends messages to topic
52
+ - Subscriber: receives messages from subscription
53
+ - At-least-once delivery (not exactly-once by default)
54
+ </code></pre>
55
+
56
+ <table>
57
+ <thead><tr><th>Feature</th><th>Details</th></tr></thead>
58
+ <tbody>
59
+ <tr><td><strong>Message retention</strong></td><td>7 days default (configurable)</td></tr>
60
+ <tr><td><strong>At-least-once delivery</strong></td><td>Idempotent subscribers needed</td></tr>
61
+ <tr><td><strong>Exactly-once</strong></td><td>Available in Pub/Sub Lite (same region)</td></tr>
62
+ <tr><td><strong>Ordering</strong></td><td>Enable message ordering with ordering key</td></tr>
63
+ </tbody>
64
+ </table>
65
+
66
+ <blockquote>
67
+ <p><strong>Exam tip:</strong> Pub/Sub → Dataflow → BigQuery là pipeline pattern cực phổ biến trong đề thi. Pub/Sub ingest, Dataflow transform, BigQuery store + analyze.</p>
68
+ </blockquote>
69
+
70
+ <h2 id="dataflow"><strong>3. Cloud Dataflow — Apache Beam</strong></h2>
71
+
72
+ <p>Dataflow là managed runner cho <strong>Apache Beam</strong> — framework cho unified batch và streaming processing. Không cần quản lý servers.</p>
73
+
74
+ <table>
75
+ <thead><tr><th>Concept</th><th>Description</th></tr></thead>
76
+ <tbody>
77
+ <tr><td><strong>Pipeline</strong></td><td>Chuỗi transform operations</td></tr>
78
+ <tr><td><strong>PCollection</strong></td><td>Distributed data collection (bounded or unbounded)</td></tr>
79
+ <tr><td><strong>Transform</strong></td><td>ParDo, GroupByKey, Combine, Flatten, Partition</td></tr>
80
+ <tr><td><strong>Windowing</strong></td><td>Fixed, Sliding, Session windows cho streaming</td></tr>
81
+ <tr><td><strong>Watermarks</strong></td><td>Handle late-arriving data in streaming</td></tr>
82
+ </tbody>
83
+ </table>
84
+
85
+ <pre><code class="language-text">Dataflow Windowing for Streaming ML:
86
+
87
+ Event stream: ──●──●──●──────●──●──●──────●──●──
88
+
89
+ Fixed Window (1 min):
90
+ ├─── [W1] ──┤├─── [W2] ──┤├─── [W3] ──┤
91
+
92
+ Sliding Window (1 min, slide 30s):
93
+ ├── [W1] ────┤
94
+ ├── [W2] ────┤
95
+ ├── [W3] ────┤
96
+
97
+ Session Window (2 min gap):
98
+ ├── [S1] ──────────┤ ├── [S2] ──┤
99
+ (user session) (new session)
100
+ </code></pre>
101
+
102
+ <h2 id="dataproc"><strong>4. Cloud Dataproc — Managed Spark/Hadoop</strong></h2>
103
+
104
+ <table>
105
+ <thead><tr><th>Dataproc Feature</th><th>Details</th></tr></thead>
106
+ <tbody>
107
+ <tr><td><strong>Cluster lifecycle</strong></td><td>Create in 90 seconds, delete after job — cost efficient</td></tr>
108
+ <tr><td><strong>Ephemeral clusters</strong></td><td>Spin up → run job → shut down (per-job pricing)</td></tr>
109
+ <tr><td><strong>Preemptible VMs</strong></td><td>Use for worker nodes to reduce cost 60-80%</td></tr>
110
+ <tr><td><strong>Component gateway</strong></td><td>Access Jupyter, Zeppelin, Spark UI via browser</td></tr>
111
+ <tr><td><strong>ML libraries</strong></td><td>Spark MLlib, TensorFlow on Spark (TFoS)</td></tr>
112
+ </tbody>
113
+ </table>
114
+
115
+ <h2 id="composer"><strong>5. Cloud Composer — Workflow Orchestration</strong></h2>
116
+
117
+ <p>Cloud Composer là managed Apache Airflow. Dùng để orchestrate multi-step ML pipelines bao gồm data ingestion, preprocessing, training, và deployment.</p>
118
+
119
+ <pre><code class="language-text">Cloud Composer ML Workflow:
120
+
121
+ [DAG: daily_ml_pipeline]
122
+ Task 1: Extract data from BigQuery
123
+
124
+ Task 2: Run Dataflow preprocessing job
125
+
126
+ Task 3: Submit Vertex AI Training Job
127
+
128
+ Task 4: Evaluate model metrics
129
+ ↓ (if metrics pass threshold)
130
+ Task 5: Deploy to Vertex AI Endpoint
131
+ </code></pre>
132
+
133
+ <h2 id="decision-guide"><strong>6. Data Pipeline Service Selection</strong></h2>
134
+
135
+ <table>
136
+ <thead><tr><th>Scenario</th><th>Recommended Service</th></tr></thead>
137
+ <tbody>
138
+ <tr><td>Real-time event streaming ingestion</td><td>Pub/Sub</td></tr>
139
+ <tr><td>Unified batch + streaming ETL (no infra mgmt)</td><td>Dataflow (Apache Beam)</td></tr>
140
+ <tr><td>Migrate existing Spark jobs to GCP</td><td>Dataproc</td></tr>
141
+ <tr><td>Complex ML DAG orchestration</td><td>Cloud Composer</td></tr>
142
+ <tr><td>Stream data into BigQuery</td><td>Pub/Sub → Dataflow → BigQuery</td></tr>
143
+ <tr><td>Serverless data processing (SQL)</td><td>BigQuery (ETL via SQL)</td></tr>
144
+ </tbody>
145
+ </table>
146
+
147
+ <h2 id="practice"><strong>7. Practice Questions</strong></h2>
148
+
149
+ <p><strong>Q1:</strong> A company receives millions of IoT sensor events per second from factory equipment. They need to process these events in real time, detect anomalies, and store results in BigQuery. Which pipeline architecture is MOST appropriate?</p>
150
+ <ul>
151
+ <li>A) Dataproc → Spark Streaming → BigQuery</li>
152
+ <li>B) Pub/Sub → Dataflow → BigQuery ✓</li>
153
+ <li>C) Cloud Functions → Cloud SQL</li>
154
+ <li>D) Batch upload to Cloud Storage → BigQuery import</li>
155
+ </ul>
156
+ <p><em>Explanation: Pub/Sub ingests high-volume streaming events reliably. Dataflow processes the stream in real time using Apache Beam (windowing, transformations, anomaly detection). BigQuery stores the results for analysis. This is the canonical GCP streaming analytics pattern.</em></p>
157
+
158
+ <p><strong>Q2:</strong> A data engineering team has an existing Apache Spark job that processes training data for ML models. They want to migrate it to GCP with minimal code changes. Which service should they use?</p>
159
+ <ul>
160
+ <li>A) Cloud Dataflow</li>
161
+ <li>B) Cloud Dataproc ✓</li>
162
+ <li>C) BigQuery ETL</li>
163
+ <li>D) Cloud Composer</li>
164
+ </ul>
165
+ <p><em>Explanation: Cloud Dataproc supports Apache Spark natively, allowing teams to run existing Spark jobs on GCP with minimal changes. Dataflow uses Apache Beam (different programming model). Dataproc is the lift-and-shift option for Spark workloads.</em></p>
166
+
167
+ <p><strong>Q3:</strong> A team needs to orchestrate a daily ML pipeline that includes data extraction from BigQuery, preprocessing, Vertex AI training, and deployment if accuracy exceeds 90%. Which service handles this workflow orchestration?</p>
168
+ <ul>
169
+ <li>A) Vertex AI Pipelines</li>
170
+ <li>B) Cloud Dataflow</li>
171
+ <li>C) Cloud Composer ✓</li>
172
+ <li>D) Pub/Sub triggers</li>
173
+ </ul>
174
+ <p><em>Explanation: Cloud Composer (managed Apache Airflow) is designed for complex DAG orchestration across multiple services. It handles scheduling, conditional branching (deploy only if accuracy > 90%), retry logic, and monitoring across heterogeneous services like BigQuery, Dataflow, and Vertex AI.</em></p>
@@ -0,0 +1,156 @@
1
+ ---
2
+ id: 019c9619-lt03-l04
3
+ title: 'Bài 4: Feature Engineering & Vertex AI Feature Store'
4
+ slug: bai-4-feature-engineering
5
+ description: >-
6
+ Feature engineering techniques. BigQuery cho feature computation.
7
+ Vertex AI Feature Store: online/offline serving.
8
+ Feature monitoring, consistency giữa training/serving.
9
+ duration_minutes: 60
10
+ is_free: true
11
+ video_url: null
12
+ sort_order: 4
13
+ section_title: "Phần 2: Data Engineering & Feature Engineering"
14
+ course:
15
+ id: 019c9619-lt03-7003-c003-lt0300000003
16
+ title: 'Luyện thi Google Cloud Professional Machine Learning Engineer'
17
+ slug: luyen-thi-gcp-ml-engineer
18
+ ---
19
+
20
+ <div style="text-align: center; margin: 2rem 0;">
21
+ <img src="/storage/uploads/2026/04/gcp-mle-bai4-feature-store.png" alt="Vertex AI Feature Store" style="max-width: 800px; width: 100%; border-radius: 12px;" />
22
+ <p><em>Feature Engineering & Vertex AI Feature Store: tạo, lưu trữ, và tái sử dụng features cho ML</em></p>
23
+ </div>
24
+
25
+ <h2 id="feature-engineering"><strong>1. Feature Engineering Techniques</strong></h2>
26
+
27
+ <table>
28
+ <thead><tr><th>Technique</th><th>When to Use</th><th>Example</th></tr></thead>
29
+ <tbody>
30
+ <tr><td><strong>Normalization (Min-Max)</strong></td><td>Bounded range required (0-1)</td><td>Image pixels, probabilities</td></tr>
31
+ <tr><td><strong>Standardization (Z-score)</strong></td><td>Normal-ish distribution, no bounds</td><td>Customer age, transaction amount</td></tr>
32
+ <tr><td><strong>Log Transform</strong></td><td>Skewed distributions (price, salary)</td><td>Log(price) for housing</td></tr>
33
+ <tr><td><strong>One-Hot Encoding</strong></td><td>Nominal categorical (no order)</td><td>Country, brand, color</td></tr>
34
+ <tr><td><strong>Label Encoding</strong></td><td>Ordinal categorical (has order)</td><td>Low/Medium/High → 0/1/2</td></tr>
35
+ <tr><td><strong>Feature Crossing</strong></td><td>Capture interaction between features</td><td>city × day_of_week</td></tr>
36
+ <tr><td><strong>Bucketizing</strong></td><td>Convert continuous to categorical</td><td>Age → age_group</td></tr>
37
+ <tr><td><strong>Embeddings</strong></td><td>High-cardinality categorical</td><td>UserID, ProductID</td></tr>
38
+ </tbody>
39
+ </table>
40
+
41
+ <h2 id="missing-values"><strong>2. Handling Missing Values</strong></h2>
42
+
43
+ <table>
44
+ <thead><tr><th>Strategy</th><th>When</th></tr></thead>
45
+ <tbody>
46
+ <tr><td><strong>Mean/Median imputation</strong></td><td>Numerical, low missingness rate</td></tr>
47
+ <tr><td><strong>Mode imputation</strong></td><td>Categorical features</td></tr>
48
+ <tr><td><strong>Model-based imputation</strong></td><td>High missingness, complex patterns</td></tr>
49
+ <tr><td><strong>Indicator variable</strong></td><td>Missingness itself is informative (add is_missing flag)</td></tr>
50
+ <tr><td><strong>Drop rows</strong></td><td>Missing target / very few rows affected</td></tr>
51
+ <tr><td><strong>Drop column</strong></td><td>&gt;80% missing</td></tr>
52
+ </tbody>
53
+ </table>
54
+
55
+ <h2 id="training-serving-skew"><strong>3. Training-Serving Skew</strong></h2>
56
+
57
+ <p><strong>Training-serving skew</strong> là vấn đề nghiêm trọng: features được compute khác nhau giữa training và serving, khiến model hoạt động kém trong production dù test metrics tốt.</p>
58
+
59
+ <pre><code class="language-text">Training-Serving Skew Example:
60
+
61
+ TRAINING TIME:
62
+ avg_purchase_last_30d = mean(all purchases in batch) ← computed over full period
63
+
64
+ SERVING TIME:
65
+ avg_purchase_last_30d = mean(last 5 purchases) ← computed differently!
66
+
67
+ Result: Feature distribution mismatch → poor predictions
68
+
69
+ SOLUTION: Vertex AI Feature Store
70
+ Same feature serve logic used at training AND serving time
71
+ </code></pre>
72
+
73
+ <h2 id="feature-store"><strong>4. Vertex AI Feature Store</strong></h2>
74
+
75
+ <table>
76
+ <thead><tr><th>Component</th><th>Description</th></tr></thead>
77
+ <tbody>
78
+ <tr><td><strong>Feature Store</strong></td><td>Centralized repository for ML features</td></tr>
79
+ <tr><td><strong>Entity Type</strong></td><td>Category of things you track (User, Product)</td></tr>
80
+ <tr><td><strong>Feature</strong></td><td>Named attribute of an entity (user.avg_spend)</td></tr>
81
+ <tr><td><strong>Online Store</strong></td><td>Low-latency serving (ms) for real-time predictions</td></tr>
82
+ <tr><td><strong>Offline Store</strong></td><td>BigQuery-backed, for batch training data retrieval</td></tr>
83
+ </tbody>
84
+ </table>
85
+
86
+ <pre><code class="language-text">Vertex AI Feature Store Architecture:
87
+
88
+ Feature Ingestion (Batch or Streaming)
89
+
90
+ ┌──── Feature Store ────────────────┐
91
+ │ Offline Store (BigQuery) │ ← Training data export
92
+ │ Online Store (Bigtable-backed) │ ← Serving (ms latency)
93
+ └───────────────────────────────────┘
94
+ ↑ Same features ↑
95
+ Training Inference
96
+ Pipeline Endpoint
97
+ </code></pre>
98
+
99
+ <h2 id="bigquery-features"><strong>5. BigQuery for Feature Engineering</strong></h2>
100
+
101
+ <p>BigQuery là công cụ tốt nhất trên GCP để compute aggregate features từ large datasets.</p>
102
+
103
+ <table>
104
+ <thead><tr><th>Feature Pattern</th><th>BigQuery Approach</th></tr></thead>
105
+ <tbody>
106
+ <tr><td>Rolling window aggregates</td><td>Window functions: AVG() OVER (PARTITION BY ... ORDER BY ... ROWS BETWEEN ...)</td></tr>
107
+ <tr><td>User activity counts</td><td>COUNT() GROUP BY user_id</td></tr>
108
+ <tr><td>Categorical encoding</td><td>CASE WHEN ... or ML.ONE_HOT_ENCODE()</td></tr>
109
+ <tr><td>Hash embedding (high cardinality)</td><td>FARM_FINGERPRINT() mod N</td></tr>
110
+ <tr><td>Feature normalization</td><td>ML.STANDARD_SCALER() in BigQuery ML</td></tr>
111
+ </tbody>
112
+ </table>
113
+
114
+ <blockquote>
115
+ <p><strong>Exam tip:</strong> Khi câu hỏi nhắc đến "training-serving consistency" hoặc "feature reuse across multiple models" → <strong>Vertex AI Feature Store</strong>. Khi nhắc đến "compute features from BigQuery data at scale" → BigQuery window functions + scheduled queries.</p>
116
+ </blockquote>
117
+
118
+ <h2 id="feature-monitoring"><strong>6. Feature Drift Monitoring</strong></h2>
119
+
120
+ <table>
121
+ <thead><tr><th>Type</th><th>What Changes</th><th>Detection Method</th></tr></thead>
122
+ <tbody>
123
+ <tr><td><strong>Feature Skew</strong></td><td>Training vs serving feature distribution differs</td><td>Compare training baseline vs serving stats</td></tr>
124
+ <tr><td><strong>Feature Drift</strong></td><td>Serving features change over time</td><td>Monitor serving feature distributions daily</td></tr>
125
+ <tr><td><strong>Label Drift</strong></td><td>Target variable distribution changes</td><td>Track prediction distribution shifts</td></tr>
126
+ </tbody>
127
+ </table>
128
+
129
+ <h2 id="practice"><strong>7. Practice Questions</strong></h2>
130
+
131
+ <p><strong>Q1:</strong> A team's ML model has excellent accuracy during testing but performs poorly in production. Investigations reveal that the average purchase feature is calculated differently in training (using historical batch data) vs. serving (using real-time lookups). What is this problem called and how should it be solved?</p>
132
+ <ul>
133
+ <li>A) Model drift — retrain the model more frequently</li>
134
+ <li>B) Training-serving skew — use Vertex AI Feature Store ✓</li>
135
+ <li>C) Data leakage — remove the purchase feature</li>
136
+ <li>D) Overfitting — add dropout layers</li>
137
+ </ul>
138
+ <p><em>Explanation: Training-serving skew occurs when features are computed differently at training and serving time. Vertex AI Feature Store solves this by providing a single source of truth for feature computation, ensuring the same logic is used for both training data export and online serving.</em></p>
139
+
140
+ <p><strong>Q2:</strong> A feature has values ranging from $10 to $10,000,000 with a heavily right-skewed distribution. Which transformation is MOST appropriate before using this feature in a linear model?</p>
141
+ <ul>
142
+ <li>A) One-Hot Encoding</li>
143
+ <li>B) Min-Max Normalization</li>
144
+ <li>C) Log transformation ✓</li>
145
+ <li>D) Label Encoding</li>
146
+ </ul>
147
+ <p><em>Explanation: Log transformation compresses the scale of highly skewed distributions, making them more normal-like and suitable for linear models. Min-Max normalization would still preserve the skew. One-hot encoding is for categorical data.</em></p>
148
+
149
+ <p><strong>Q3:</strong> Which Vertex AI Feature Store store type is optimized for serving features to real-time prediction endpoints with millisecond latency?</p>
150
+ <ul>
151
+ <li>A) Offline Store (BigQuery)</li>
152
+ <li>B) Online Store (Bigtable-backed) ✓</li>
153
+ <li>C) Feature Catalog</li>
154
+ <li>D) Cloud Memorystore</li>
155
+ </ul>
156
+ <p><em>Explanation: The Online Store in Vertex AI Feature Store is backed by Bigtable and designed for sub-100ms latency lookups, serving fresh feature values to real-time prediction endpoints. The Offline Store uses BigQuery and is for batch training data retrieval.</em></p>
@@ -0,0 +1,155 @@
1
+ ---
2
+ id: 019c9619-lt03-l05
3
+ title: 'Bài 5: Vertex AI Training — Custom & AutoML'
4
+ slug: bai-5-vertex-ai-training
5
+ description: >-
6
+ Custom Training Jobs: pre-built containers, custom containers.
7
+ Distributed training trên GPU/TPU. AutoML: Tabular, Image, Text, Video.
8
+ Training pipeline setup. Hyperparameter tuning service.
9
+ duration_minutes: 60
10
+ is_free: true
11
+ video_url: null
12
+ sort_order: 5
13
+ section_title: "Phần 3: Model Development trên Vertex AI"
14
+ course:
15
+ id: 019c9619-lt03-7003-c003-lt0300000003
16
+ title: 'Luyện thi Google Cloud Professional Machine Learning Engineer'
17
+ slug: luyen-thi-gcp-ml-engineer
18
+ ---
19
+
20
+ <div style="text-align: center; margin: 2rem 0;">
21
+ <img src="/storage/uploads/2026/04/gcp-mle-bai5-vertex-training.png" alt="Vertex AI Custom Training" style="max-width: 800px; width: 100%; border-radius: 12px;" />
22
+ <p><em>Vertex AI Custom Training: Training Jobs, AutoML, phân tán distributed training và tối ưu</em></p>
23
+ </div>
24
+
25
+ <h2 id="custom-training"><strong>1. Vertex AI Custom Training</strong></h2>
26
+
27
+ <p>Custom Training cho phép bạn chạy training code của mình trên Google Cloud infrastructure. Có 2 cách đóng gói code:</p>
28
+
29
+ <table>
30
+ <thead><tr><th>Method</th><th>Description</th><th>When to Use</th></tr></thead>
31
+ <tbody>
32
+ <tr><td><strong>Pre-built containers</strong></td><td>GCP-provided containers: TF, PyTorch, Scikit-learn, XGBoost</td><td>Standard ML frameworks, fast setup</td></tr>
33
+ <tr><td><strong>Custom containers</strong></td><td>Build your own Docker image</td><td>Custom dependencies, special environments</td></tr>
34
+ </tbody>
35
+ </table>
36
+
37
+ <pre><code class="language-text">Custom Training Job Structure:
38
+
39
+ training_package/ (Python package or Docker image)
40
+
41
+ ├── trainer/
42
+ │ ├── __init__.py
43
+ │ ├── task.py ← entry point (main training script)
44
+ │ └── model.py ← model definition
45
+
46
+ └── setup.py
47
+
48
+ Arguments passed via:
49
+ TRAINING_DATA_URI: gs://bucket/data/
50
+ TRAINING_OUTPUT_URI: gs://bucket/model/
51
+ Hyperparameters: --learning-rate=0.001
52
+ </code></pre>
53
+
54
+ <h2 id="compute-options"><strong>2. Compute Options</strong></h2>
55
+
56
+ <table>
57
+ <thead><tr><th>Hardware</th><th>Best For</th><th>Notes</th></tr></thead>
58
+ <tbody>
59
+ <tr><td><strong>CPU</strong></td><td>Scikit-learn, small tabular</td><td>Cheapest, no GPU parallelism</td></tr>
60
+ <tr><td><strong>GPU (T4, A100, V100)</strong></td><td>Deep learning, NLP, CV</td><td>10-100x faster than CPU for DL</td></tr>
61
+ <tr><td><strong>TPU v3, v4</strong></td><td>TensorFlow large-scale training</td><td>Google-specific; very fast for TF/JAX</td></tr>
62
+ </tbody>
63
+ </table>
64
+
65
+ <blockquote>
66
+ <p><strong>Exam tip:</strong> TPU is Google-specific hardware optimized for TensorFlow and JAX. GPUs work with all frameworks. TPUs are most cost-effective for very large TF models; GPUs are more versatile. Exam may ask "most cost-effective for TensorFlow large-scale" → TPU.</p>
67
+ </blockquote>
68
+
69
+ <h2 id="distributed-training"><strong>3. Distributed Training on Vertex AI</strong></h2>
70
+
71
+ <table>
72
+ <thead><tr><th>Strategy</th><th>Description</th><th>Use Case</th></tr></thead>
73
+ <tbody>
74
+ <tr><td><strong>Data Parallelism</strong></td><td>Split data across workers, same model</td><td>Most DL training scenarios</td></tr>
75
+ <tr><td><strong>Model Parallelism</strong></td><td>Split model layers across workers</td><td>Model too large for one GPU</td></tr>
76
+ <tr><td><strong>MirroredStrategy (TF)</strong></td><td>Multi-GPU, single machine</td><td>Single node, multiple GPUs</td></tr>
77
+ <tr><td><strong>MultiWorkerMirroredStrategy</strong></td><td>Multi-GPU, multi-machine</td><td>Cluster training</td></tr>
78
+ <tr><td><strong>ParameterServerStrategy</strong></td><td>Async updates via parameter server</td><td>Very large models (legacy)</td></tr>
79
+ </tbody>
80
+ </table>
81
+
82
+ <h2 id="automl"><strong>4. Vertex AI AutoML</strong></h2>
83
+
84
+ <table>
85
+ <thead><tr><th>AutoML Type</th><th>Input Data</th><th>Supported Tasks</th></tr></thead>
86
+ <tbody>
87
+ <tr><td><strong>AutoML Tabular</strong></td><td>CSV, BigQuery table</td><td>Classification, Regression, Forecasting</td></tr>
88
+ <tr><td><strong>AutoML Image</strong></td><td>JPEG, PNG, BMP</td><td>Classification (single/multi), Object Detection, Segmentation</td></tr>
89
+ <tr><td><strong>AutoML Text</strong></td><td>Text documents</td><td>Classification, Entity Extraction, Sentiment</td></tr>
90
+ <tr><td><strong>AutoML Video</strong></td><td>MP4, AVI, MOV</td><td>Classification, Object Detection, Action Recognition</td></tr>
91
+ </tbody>
92
+ </table>
93
+
94
+ <h2 id="hyperparameter-tuning"><strong>5. Vertex AI Hyperparameter Tuning</strong></h2>
95
+
96
+ <p>Vertex AI Hyperparameter Tuning tự động tìm hyperparameter combinations tốt nhất.</p>
97
+
98
+ <table>
99
+ <thead><tr><th>Search Algorithm</th><th>Description</th></tr></thead>
100
+ <tbody>
101
+ <tr><td><strong>Grid Search</strong></td><td>Exhaustive, expensive; small search space</td></tr>
102
+ <tr><td><strong>Random Search</strong></td><td>Random sampling; often better than grid</td></tr>
103
+ <tr><td><strong>Bayesian Optimization</strong></td><td>Smart search using Gaussian Process; most efficient</td></tr>
104
+ </tbody>
105
+ </table>
106
+
107
+ <pre><code class="language-text">HPT Job Setup:
108
+
109
+ hyperparameters:
110
+ - parameter_id: learning_rate
111
+ type: DOUBLE
112
+ min_value: 0.0001
113
+ max_value: 0.1
114
+ scale: LOG ← log scale for LR
115
+
116
+ - parameter_id: batch_size
117
+ type: INTEGER
118
+ values: [32, 64, 128, 256]
119
+
120
+ metric:
121
+ metric_id: val_accuracy
122
+ goal: MAXIMIZE
123
+
124
+ max_trial_count: 50
125
+ parallel_trial_count: 5
126
+ </code></pre>
127
+
128
+ <h2 id="practice"><strong>6. Practice Questions</strong></h2>
129
+
130
+ <p><strong>Q1:</strong> A team wants to train a custom TensorFlow model across multiple machines with 8 GPUs each. They want gradients synchronized across all workers without a parameter server. Which TensorFlow distribution strategy should they use?</p>
131
+ <ul>
132
+ <li>A) MirroredStrategy</li>
133
+ <li>B) MultiWorkerMirroredStrategy ✓</li>
134
+ <li>C) ParameterServerStrategy</li>
135
+ <li>D) TPUStrategy</li>
136
+ </ul>
137
+ <p><em>Explanation: MultiWorkerMirroredStrategy enables synchronous data-parallel training across multiple machines, each with multiple GPUs. MirroredStrategy is single-machine multi-GPU only. ParameterServerStrategy uses asynchronous updates. TPUStrategy is for TPU pods.</em></p>
138
+
139
+ <p><strong>Q2:</strong> A company needs to train an image classification model but their team has no deep learning expertise. They have 5,000 labeled product images. Which Vertex AI option requires the LEAST ML expertise?</p>
140
+ <ul>
141
+ <li>A) Vertex AI Custom Training with TensorFlow CNN</li>
142
+ <li>B) Vertex AI AutoML Image Classification ✓</li>
143
+ <li>C) Dataproc Spark ML</li>
144
+ <li>D) BigQuery ML</li>
145
+ </ul>
146
+ <p><em>Explanation: AutoML Image Classification handles architecture selection, hyperparameter tuning, and training automatically. A team just needs to upload labeled images and specify the task. No code or deep learning expertise is required.</em></p>
147
+
148
+ <p><strong>Q3:</strong> Which hyperparameter search strategy is MOST efficient when evaluating expensive-to-train deep learning models with a large search space?</p>
149
+ <ul>
150
+ <li>A) Grid Search — tests all combinations</li>
151
+ <li>B) Random Search — samples uniformly</li>
152
+ <li>C) Bayesian Optimization — uses past trial results to guide search ✓</li>
153
+ <li>D) Manual tuning — expert selects parameters</li>
154
+ </ul>
155
+ <p><em>Explanation: Bayesian Optimization builds a probabilistic model of the objective function using Gaussian Processes to intelligently select the next hyperparameter configuration to evaluate, based on past trial results. It finds good configurations with far fewer trials than grid or random search.</em></p>
@@ -0,0 +1,141 @@
1
+ ---
2
+ id: 019c9619-lt03-l06
3
+ title: 'Bài 6: BigQuery ML & TensorFlow on GCP'
4
+ slug: bai-6-bigquery-ml-tensorflow
5
+ description: >-
6
+ BigQuery ML: CREATE MODEL syntax, supported models.
7
+ TensorFlow Extended (TFX) pipeline components.
8
+ TFServing, TFLite. Model optimization techniques.
9
+ duration_minutes: 60
10
+ is_free: true
11
+ video_url: null
12
+ sort_order: 6
13
+ section_title: "Phần 3: Model Development trên Vertex AI"
14
+ course:
15
+ id: 019c9619-lt03-7003-c003-lt0300000003
16
+ title: 'Luyện thi Google Cloud Professional Machine Learning Engineer'
17
+ slug: luyen-thi-gcp-ml-engineer
18
+ ---
19
+
20
+ <div style="text-align: center; margin: 2rem 0;">
21
+ <img src="/storage/uploads/2026/04/gcp-mle-bai6-bqml-tfx.png" alt="BigQuery ML & TFX Pipeline" style="max-width: 800px; width: 100%; border-radius: 12px;" />
22
+ <p><em>BigQuery ML và TFX Pipeline: train models bằng SQL, optimize model, và production ML pipelines</em></p>
23
+ </div>
24
+
25
+ <h2 id="bigquery-ml"><strong>1. BigQuery ML (BQML)</strong></h2>
26
+
27
+ <p>BigQuery ML cho phép data analysts train và serve ML models bằng SQL trong BigQuery — không cần export data, không cần biết framework ML.</p>
28
+
29
+ <pre><code class="language-text">BigQuery ML Workflow:
30
+
31
+ 1. CREATE MODEL → train
32
+ 2. ML.EVALUATE() → evaluate metrics
33
+ 3. ML.PREDICT() → generate predictions
34
+ 4. ML.EXPLAIN_PREDICT() → SHAP-based explanations
35
+ 5. EXPORT MODEL → export to Cloud Storage (TF SavedModel format)
36
+ </code></pre>
37
+
38
+ <table>
39
+ <thead><tr><th>Model Type</th><th>BQML Option</th><th>Task</th></tr></thead>
40
+ <tbody>
41
+ <tr><td>Linear Regression</td><td>LINEAR_REG</td><td>Regression</td></tr>
42
+ <tr><td>Logistic Regression</td><td>LOGISTIC_REG</td><td>Binary/Multiclass classification</td></tr>
43
+ <tr><td>K-Means</td><td>KMEANS</td><td>Clustering</td></tr>
44
+ <tr><td>XGBoost</td><td>BOOSTED_TREE_CLASSIFIER / BOOSTED_TREE_REGRESSOR</td><td>Tabular classification/regression</td></tr>
45
+ <tr><td>Random Forest</td><td>RANDOM_FOREST_CLASSIFIER / RANDOM_FOREST_REGRESSOR</td><td>Tabular classification/regression</td></tr>
46
+ <tr><td>DNN</td><td>DNN_CLASSIFIER / DNN_REGRESSOR</td><td>Complex patterns</td></tr>
47
+ <tr><td>Wide &amp; Deep</td><td>WIDE_AND_DEEP_CLASSIFIER</td><td>Recommendations (memorization + generalization)</td></tr>
48
+ <tr><td>AutoML</td><td>AUTOML_CLASSIFIER / AUTOML_REGRESSOR</td><td>Automated model selection</td></tr>
49
+ <tr><td>Time Series</td><td>ARIMA_PLUS</td><td>Forecasting</td></tr>
50
+ <tr><td>Matrix Factorization</td><td>MATRIX_FACTORIZATION</td><td>Collaborative filtering</td></tr>
51
+ </tbody>
52
+ </table>
53
+
54
+ <blockquote>
55
+ <p><strong>Exam tip:</strong> BQML ARIMA_PLUS tự động xử lý seasonality, holiday effects, trend decomposition. Khi đề hỏi "forecast using BigQuery data" → ARIMA_PLUS. Khi hỏi "recommendation system in BigQuery" → MATRIX_FACTORIZATION.</p>
56
+ </blockquote>
57
+
58
+ <h2 id="tfx"><strong>2. TensorFlow Extended (TFX)</strong></h2>
59
+
60
+ <p>TFX là production ML pipeline library dành cho TensorFlow. Cung cấp standard components cho mỗi bước trong ML lifecycle.</p>
61
+
62
+ <table>
63
+ <thead><tr><th>TFX Component</th><th>Purpose</th></tr></thead>
64
+ <tbody>
65
+ <tr><td><strong>ExampleGen</strong></td><td>Ingest data từ CSV, BigQuery, Avro, Parquet</td></tr>
66
+ <tr><td><strong>StatisticsGen</strong></td><td>Compute statistics về training data</td></tr>
67
+ <tr><td><strong>SchemaGen</strong></td><td>Infer schema từ statistics</td></tr>
68
+ <tr><td><strong>ExampleValidator</strong></td><td>Detect anomalies: missing, distribution skew</td></tr>
69
+ <tr><td><strong>Transform</strong></td><td>Feature engineering (Apache Beam-based)</td></tr>
70
+ <tr><td><strong>Trainer</strong></td><td>Train TF model (EvalSpec + TrainSpec)</td></tr>
71
+ <tr><td><strong>Tuner</strong></td><td>Hyperparameter tuning (KerasTuner)</td></tr>
72
+ <tr><td><strong>Evaluator</strong></td><td>Evaluate model against baseline</td></tr>
73
+ <tr><td><strong>ModelValidator</strong></td><td>Validate model meets quality thresholds</td></tr>
74
+ <tr><td><strong>Pusher</strong></td><td>Push model to serving (TF Serving, Vertex AI)</td></tr>
75
+ </tbody>
76
+ </table>
77
+
78
+ <pre><code class="language-text">TFX Pipeline (simplified):
79
+
80
+ ExampleGen → StatisticsGen → SchemaGen → ExampleValidator
81
+
82
+ Transform (feature engineering)
83
+
84
+ Trainer (model training)
85
+
86
+ Evaluator (metrics vs baseline)
87
+ ↓ (if pass)
88
+ Pusher → TF Serving / Vertex AI Endpoint
89
+ </code></pre>
90
+
91
+ <h2 id="tf-serving"><strong>3. TF Serving & TFLite</strong></h2>
92
+
93
+ <table>
94
+ <thead><tr><th>Option</th><th>Use Case</th></tr></thead>
95
+ <tbody>
96
+ <tr><td><strong>TF Serving</strong></td><td>High-performance serving trên server/cloud (gRPC or REST)</td></tr>
97
+ <tr><td><strong>TFLite</strong></td><td>Mobile devices, edge devices, microcontrollers</td></tr>
98
+ <tr><td><strong>TF.js</strong></td><td>Browser-based inference</td></tr>
99
+ </tbody>
100
+ </table>
101
+
102
+ <h2 id="model-optimization"><strong>4. Model Optimization Techniques</strong></h2>
103
+
104
+ <table>
105
+ <thead><tr><th>Technique</th><th>Description</th><th>Trade-off</th></tr></thead>
106
+ <tbody>
107
+ <tr><td><strong>Quantization</strong></td><td>Float32 → INT8 weights</td><td>4x smaller, ~2x faster, slight accuracy loss</td></tr>
108
+ <tr><td><strong>Pruning</strong></td><td>Remove low-weight connections</td><td>Smaller model, preserve accuracy</td></tr>
109
+ <tr><td><strong>Knowledge Distillation</strong></td><td>Train small "student" model from large "teacher"</td><td>Smaller + fast, slight accuracy loss</td></tr>
110
+ <tr><td><strong>TensorRT</strong></td><td>NVIDIA GPU optimization (layer fusion)</td><td>3-5x inference speedup on NVIDIA GPUs</td></tr>
111
+ </tbody>
112
+ </table>
113
+
114
+ <h2 id="practice"><strong>5. Practice Questions</strong></h2>
115
+
116
+ <p><strong>Q1:</strong> A data analyst team needs to build a sales forecasting model on data already in BigQuery. They are comfortable with SQL but have no Python/ML framework experience. Which BigQuery ML model type should they use for time series forecasting?</p>
117
+ <ul>
118
+ <li>A) KMEANS</li>
119
+ <li>B) LOGISTIC_REG</li>
120
+ <li>C) ARIMA_PLUS ✓</li>
121
+ <li>D) MATRIX_FACTORIZATION</li>
122
+ </ul>
123
+ <p><em>Explanation: BigQuery ML ARIMA_PLUS is designed for time series forecasting and automatically handles seasonality, trend, and holiday effects. It can be trained with a simple CREATE MODEL statement in SQL, requiring no Python expertise.</em></p>
124
+
125
+ <p><strong>Q2:</strong> A TFX pipeline is detecting that the distribution of the "age" feature in new production data differs significantly from the training data distribution. Which TFX component is responsible for detecting this anomaly?</p>
126
+ <ul>
127
+ <li>A) StatisticsGen</li>
128
+ <li>B) SchemaGen</li>
129
+ <li>C) ExampleValidator ✓</li>
130
+ <li>D) Transform</li>
131
+ </ul>
132
+ <p><em>Explanation: ExampleValidator compares data statistics against the expected schema and flags anomalies including distribution skew (significant difference between training and serving data distributions). StatisticsGen computes statistics; SchemaGen creates the schema; Transform does feature engineering.</em></p>
133
+
134
+ <p><strong>Q3:</strong> A team needs to deploy a TensorFlow image classification model to mobile devices with limited compute resources. They need to reduce model size by 4x with minimal accuracy loss. Which technique should they apply?</p>
135
+ <ul>
136
+ <li>A) Knowledge Distillation</li>
137
+ <li>B) Model Pruning</li>
138
+ <li>C) Post-training quantization (INT8) ✓</li>
139
+ <li>D) TensorRT optimization</li>
140
+ </ul>
141
+ <p><em>Explanation: Post-training quantization converts Float32 weights to INT8, reducing model size by approximately 4x and improving inference speed by 2x, with minimal accuracy loss for most models. TFLite supports INT8 quantization for mobile/edge deployment. TensorRT is for NVIDIA GPUs, not mobile.</em></p>