npm - @xdev-asia/xdev-knowledge-mcp - Versions diffs - 1.0.43 → 1.0.45 - Mend

@xdev-asia/xdev-knowledge-mcp 1.0.43 → 1.0.45

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (61) hide show

package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/01-phan-1-problem-framing/lessons/01-bai-1-framing-ml-problems.md ADDED Viewed

@@ -0,0 +1,136 @@
+---
+id: 019c9619-lt03-l01
+title: 'Bài 1: Framing ML Problems — Supervised, Unsupervised, RL'
+slug: bai-1-framing-ml-problems
+description: >-
+  Cách xác định bài toán có cần ML không. Chọn đúng loại model.
+  Business metrics vs ML metrics. Data availability assessment.
+  Google's ML best practices.
+duration_minutes: 50
+is_free: true
+video_url: null
+sort_order: 1
+section_title: "Phần 1: ML Problem Framing & Architecture"
+course:
+  id: 019c9619-lt03-7003-c003-lt0300000003
+  title: 'Luyện thi Google Cloud Professional Machine Learning Engineer'
+  slug: luyen-thi-gcp-ml-engineer
+---
+<div style="text-align: center; margin: 2rem 0;">
+<img src="/storage/uploads/2026/04/gcp-mle-bai1-problem-framing.png" alt="ML Problem Framing Framework" style="max-width: 800px; width: 100%; border-radius: 12px;" />
+<p><em>ML Problem Framing: xác định bài toán, chọn loại model, và định nghĩa metrics theo chuẩn Google</em></p>
+</div>
+<h2 id="when-to-use-ml"><strong>1. Khi Nào Cần Dùng ML?</strong></h2>
+<p>Google ML certification thường hỏi về <strong>problem framing</strong> — tức là xác định xem bài toán có phù hợp để áp dụng ML không, và nếu có thì dùng loại ML nào. Đây là skill quan trọng của một professional ML Engineer.</p>
+<table>
+<thead><tr><th>Câu hỏi cần đặt ra</th><th>Nếu "Có"</th><th>Nếu "Không"</th></tr></thead>
+<tbody>
+<tr><td>Có pattern phức tạp trong data không?</td><td>ML có thể giúp</td><td>Rules-based logic đủ rồi</td></tr>
+<tr><td>Có đủ data (labels) không?</td><td>Supervised Learning</td><td>Unsupervised hoặc thu thập thêm</td></tr>
+<tr><td>Output có thể định nghĩa rõ ràng không?</td><td>Supervised ML</td><td>Cần clarify với stakeholders</td></tr>
+<tr><td>Bài toán có cần agent tương tác với environment không?</td><td>Reinforcement Learning</td><td>Supervised/Unsupervised</td></tr>
+</tbody>
+</table>
+<h2 id="ml-types"><strong>2. Các Loại ML và Khi Nào Dùng</strong></h2>
+<pre><code class="language-text">Problem Framing Decision Tree:
+Has labeled training data?
+    YES → Supervised Learning
+           ├── Output is category? → Classification
+           └── Output is number? → Regression
+    NO → Has examples, no labels?
+           YES → Unsupervised Learning
+                  ├── Find groups? → Clustering
+                  └── Find patterns/anomalies? → Density estimation
+           NO → Agent in environment?
+                  YES → Reinforcement Learning
+                  NO → Reconsider problem definition
+</code></pre>
+<table>
+<thead><tr><th>ML Type</th><th>When to Use</th><th>GCP Services</th></tr></thead>
+<tbody>
+<tr><td><strong>Supervised Classification</strong></td><td>Email spam, image labels, churn prediction</td><td>Vertex AI AutoML, BigQuery ML</td></tr>
+<tr><td><strong>Supervised Regression</strong></td><td>Price prediction, demand forecast</td><td>Vertex AI, BigQuery ML BQML_REGRESSOR</td></tr>
+<tr><td><strong>Unsupervised Clustering</strong></td><td>Customer segmentation, topic discovery</td><td>Vertex AI Custom Training (k-means)</td></tr>
+<tr><td><strong>Reinforcement Learning</strong></td><td>Game agents, robotics, ad bidding</td><td>Vertex AI + custom environment</td></tr>
+<tr><td><strong>Self-supervised</strong></td><td>LLMs, foundation models</td><td>Vertex AI Model Garden</td></tr>
+</tbody>
+</table>
+<h2 id="business-vs-ml-metrics"><strong>3. Business Metrics vs. ML Metrics</strong></h2>
+<p>Một trong những sai lầm phổ biến là <strong>optimize nhầm metric</strong>. Mục tiêu ML phải align với mục tiêu business.</p>
+<table>
+<thead><tr><th>Business Goal</th><th>Wrong ML Metric</th><th>Correct ML Metric</th></tr></thead>
+<tbody>
+<tr><td>Giảm doanh thu bị gian lận</td><td>Accuracy (99%!)</td><td>Recall (bắt được nhiều fraud)</td></tr>
+<tr><td>Giảm email spam trải nghiệm người dùng</td><td>Recall</td><td>Precision (ít false positive)</td></tr>
+<tr><td>Dự báo nhu cầu tồn kho</td><td>MSE</td><td>MAPE (scale-independent)</td></tr>
+<tr><td>Ranking sản phẩm trong search</td><td>Accuracy</td><td>NDCG, MRR (ranking metrics)</td></tr>
+</tbody>
+</table>
+<blockquote>
+<p><strong>Exam tip:</strong> Professional ML Engineer exam thường hỏi "which metric BEST aligns with the business objective". Khi thấy fraud/medical diagnosis → Recall. Khi thấy spam/precision-critical → Precision. Khi thấy class imbalance → F1 hoặc AUC-ROC.</p>
+</blockquote>
+<h2 id="data-assessment"><strong>4. Data Availability Assessment</strong></h2>
+<table>
+<thead><tr><th>Data Situation</th><th>ML Approach</th></tr></thead>
+<tbody>
+<tr><td>Nhiều labeled data</td><td>Fully supervised, train from scratch</td></tr>
+<tr><td>Ít labeled data (&lt;1000)</td><td><strong>Transfer Learning</strong> (pre-trained + fine-tune)</td></tr>
+<tr><td>Không có labels</td><td>Unsupervised hoặc thu thập labels (Vertex AI Data Labeling)</td></tr>
+<tr><td>Labels tốn kém</td><td><strong>Active Learning</strong> — label uncertain samples trước</td></tr>
+<tr><td>Dữ liệu không cân bằng</td><td>Oversampling, undersampling, class weights</td></tr>
+</tbody>
+</table>
+<h2 id="google-ml-practices"><strong>5. Google's ML Best Practices</strong></h2>
+<ul>
+<li><strong>Start simple</strong>: Bắt đầu với model đơn giản nhất, sau đó phức tạp hóa dần</li>
+<li><strong>Establish baseline</strong>: So sánh với heuristic/rules trước khi dùng ML</li>
+<li><strong>Data quality first</strong>: 80% thời gian ML project là data preparation</li>
+<li><strong>Reproducibility</strong>: Pipeline phải reproducible với cùng data</li>
+<li><strong>Monitor in production</strong>: Model decay theo thời gian — cần continuous monitoring</li>
+</ul>
+<h2 id="practice"><strong>6. Practice Questions</strong></h2>
+<p><strong>Q1:</strong> A company wants to identify which of its customers are most likely to cancel their subscription in the next 30 days. They have 3 years of historical customer behavior data with known churn events. Which ML approach should they use?</p>
+<ul>
+<li>A) Unsupervised clustering to find customer groups</li>
+<li>B) Reinforcement learning to optimize retention campaigns</li>
+<li>C) Supervised binary classification with historical churn labels ✓</li>
+<li>D) Anomaly detection to find unusual behavior</li>
+</ul>
+<p><em>Explanation: This is a classic supervised classification problem (churn = yes/no). Historical data with known outcomes (churned/not churned) provides the labels needed. Clustering would not predict individual churn probability. RL is for sequential decision making, not prediction.</em></p>
+<p><strong>Q2:</strong> A medical imaging ML model achieves 98% accuracy on test data but the business team is unsatisfied. The task is detecting rare cancer cells (1% prevalence). What is the most likely issue?</p>
+<ul>
+<li>A) The model is overfitting to training data</li>
+<li>B) Accuracy is the wrong metric — the model may be predicting "no cancer" for everything ✓</li>
+<li>C) The model needs more training iterations</li>
+<li>D) The test dataset is too small</li>
+</ul>
+<p><em>Explanation: With 1% prevalence, a model always predicting "no cancer" achieves 99% accuracy but has 0% recall — it misses every cancer case. For rare class problems, Recall (sensitivity) is the critical metric, not accuracy.</em></p>
+<p><strong>Q3:</strong> A startup has 500 labeled product images for a new custom classification task. Which training approach is MOST appropriate?</p>
+<ul>
+<li>A) Train a deep learning CNN from scratch on the 500 images</li>
+<li>B) Use AutoML Tabular on the image metadata</li>
+<li>C) Use Transfer Learning from a pre-trained image model ✓</li>
+<li>D) Apply K-Means clustering since the dataset is too small</li>
+</ul>
+<p><em>Explanation: With only 500 labeled examples, training from scratch would overfit severely. Transfer Learning reuses features from a model pre-trained on millions of images (e.g., ImageNet), requiring far less data to achieve good accuracy on the new task.</em></p>

package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/01-phan-1-problem-framing/lessons/02-bai-2-gcp-ai-ml-ecosystem.md ADDED Viewed

@@ -0,0 +1,160 @@
+---
+id: 019c9619-lt03-l02
+title: 'Bài 2: GCP AI/ML Ecosystem Overview'
+slug: bai-2-gcp-ai-ml-ecosystem
+description: >-
+  Vertex AI platform tổng quan. AutoML vs Custom Training.
+  BigQuery ML. Pre-trained APIs (Vision, NLP, Translation).
+  Khi nào dùng service nào — decision tree.
+duration_minutes: 50
+is_free: true
+video_url: null
+sort_order: 2
+section_title: "Phần 1: ML Problem Framing & Architecture"
+course:
+  id: 019c9619-lt03-7003-c003-lt0300000003
+  title: 'Luyện thi Google Cloud Professional Machine Learning Engineer'
+  slug: luyen-thi-gcp-ml-engineer
+---
+<div style="text-align: center; margin: 2rem 0;">
+<img src="/storage/uploads/2026/04/gcp-mle-bai2-gcp-ecosystem.png" alt="GCP AI/ML Ecosystem" style="max-width: 800px; width: 100%; border-radius: 12px;" />
+<p><em>GCP AI/ML Ecosystem: Vertex AI, AutoML, BigQuery ML, Pre-trained APIs và khi nào dùng cái nào</em></p>
+</div>
+<h2 id="gcp-ml-landscape"><strong>1. GCP ML Landscape Overview</strong></h2>
+<pre><code class="language-text">GCP ML Capability Spectrum:
+LOW CODE ◄────────────────────────────────────► HIGH CONTROL
+  │                        │                           │
+  ▼                        ▼                           ▼
+Pre-trained APIs      Vertex AI AutoML        Custom Training
+(Vision, NLP,         (no code needed,        (full control,
+Translation)          you bring data)         you bring code)
+  │                        │                           │
+No ML expertise       Some domain              ML expertise
+needed                expertise               required
+BigQuery ML ────── SQL interface for ML on warehouse data
+</code></pre>
+<h2 id="vertex-ai"><strong>2. Vertex AI — Unified ML Platform</strong></h2>
+<p>Vertex AI là GCP's unified platform cho toàn bộ ML lifecycle. Hiểu rõ các component là bắt buộc cho kỳ thi.</p>
+<table>
+<thead><tr><th>Component</th><th>Purpose</th></tr></thead>
+<tbody>
+<tr><td><strong>Vertex AI Workbench</strong></td><td>Managed Jupyter notebooks cho data scientists</td></tr>
+<tr><td><strong>Vertex AI Training</strong></td><td>Custom training jobs (CPUs, GPUs, TPUs)</td></tr>
+<tr><td><strong>Vertex AI AutoML</strong></td><td>No-code model training (Tabular, Image, Text, Video)</td></tr>
+<tr><td><strong>Vertex AI Endpoints</strong></td><td>Deploy models cho online prediction</td></tr>
+<tr><td><strong>Vertex AI Batch Prediction</strong></td><td>Asynchronous batch scoring</td></tr>
+<tr><td><strong>Vertex AI Feature Store</strong></td><td>Serve features consistently across training/serving</td></tr>
+<tr><td><strong>Vertex AI Pipelines</strong></td><td>Kubeflow Pipelines-based ML workflow orchestration</td></tr>
+<tr><td><strong>Vertex AI Experiments</strong></td><td>Track runs, compare metrics</td></tr>
+<tr><td><strong>Vertex AI Model Registry</strong></td><td>Version control for models</td></tr>
+<tr><td><strong>Vertex AI Model Monitoring</strong></td><td>Detect feature skew và prediction drift</td></tr>
+</tbody>
+</table>
+<h2 id="automl-vs-custom"><strong>3. AutoML vs. Custom Training</strong></h2>
+<table>
+<thead><tr><th>Criteria</th><th>AutoML</th><th>Custom Training</th></tr></thead>
+<tbody>
+<tr><td>ML expertise needed</td><td>Minimal</td><td>Required</td></tr>
+<tr><td>Training time</td><td>Hours (automated)</td><td>Variable (you control)</td></tr>
+<tr><td>Model interpretability</td><td>Limited</td><td>Full control</td></tr>
+<tr><td>Cost</td><td>Higher per model</td><td>Pay per compute used</td></tr>
+<tr><td>Best for</td><td>Quick prototypes, standard tasks</td><td>Custom architectures, research</td></tr>
+<tr><td>Supported data types</td><td>Tabular, Image, Text, Video</td><td>Any (you write the code)</td></tr>
+</tbody>
+</table>
+<blockquote>
+<p><strong>Exam tip:</strong> Câu hỏi có "team doesn't have ML expertise" hoặc "fastest time to deployment" → AutoML. Câu hỏi có "custom neural architecture" hoặc "full control over training loop" → Custom Training.</p>
+</blockquote>
+<h2 id="bigquery-ml"><strong>4. BigQuery ML</strong></h2>
+<p>BigQuery ML cho phép train và serve ML models bằng SQL — không cần export data khỏi BigQuery.</p>
+<table>
+<thead><tr><th>Model Type</th><th>SQL Keyword</th><th>Use Case</th></tr></thead>
+<tbody>
+<tr><td>Linear Regression</td><td>LINEAR_REG</td><td>Price prediction</td></tr>
+<tr><td>Logistic Regression</td><td>LOGISTIC_REG</td><td>Classification</td></tr>
+<tr><td>K-Means Clustering</td><td>KMEANS</td><td>Customer segmentation</td></tr>
+<tr><td>XGBoost</td><td>BOOSTED_TREE_CLASSIFIER/REGRESSOR</td><td>Tabular classification/regression</td></tr>
+<tr><td>Deep Neural Network</td><td>DNN_CLASSIFIER/DNN_REGRESSOR</td><td>Complex patterns</td></tr>
+<tr><td>Matrix Factorization</td><td>MATRIX_FACTORIZATION</td><td>Recommendations</td></tr>
+<tr><td>Imported TF models</td><td>TENSORFLOW</td><td>Custom TF models</td></tr>
+</tbody>
+</table>
+<h2 id="pre-trained-apis"><strong>5. Pre-trained AI APIs</strong></h2>
+<table>
+<thead><tr><th>API</th><th>Capabilities</th><th>Use Case</th></tr></thead>
+<tbody>
+<tr><td><strong>Cloud Vision API</strong></td><td>Labels, OCR, faces, logos, safe search</td><td>Image analysis without training</td></tr>
+<tr><td><strong>Cloud Natural Language API</strong></td><td>Entities, sentiment, syntax, categories</td><td>Text analytics</td></tr>
+<tr><td><strong>Cloud Translation API</strong></td><td>100+ language pairs</td><td>Multi-language content</td></tr>
+<tr><td><strong>Cloud Speech-to-Text</strong></td><td>Transcription, speaker diarization</td><td>Audio processing</td></tr>
+<tr><td><strong>Cloud Text-to-Speech</strong></td><td>WaveNet voices, SSML</td><td>Voice UI, accessibility</td></tr>
+<tr><td><strong>Document AI</strong></td><td>Form parsing, invoice extraction</td><td>Document automation</td></tr>
+<tr><td><strong>Recommendations AI</strong></td><td>Real-time product recommendations</td><td>E-commerce personalization</td></tr>
+</tbody>
+</table>
+<h2 id="decision-tree"><strong>6. Service Selection Decision Tree</strong></h2>
+<pre><code class="language-text">WHICH GCP ML SERVICE?
+Do you have LABELED DATA?
+│
+├── NO → Pre-trained API sufficient for your task (Vision, NLP)?
+│         YES → Use Pre-trained API
+│         NO  → Vertex AI Custom Training (unsupervised)
+│
+└── YES → Is your data already IN BigQuery?
+          │
+          ├── YES → BigQuery ML (SQL-based, fast, no export)
+          │
+          └── NO → Need rapid prototyping, no ML team?
+                    │
+                    ├── YES → Vertex AI AutoML
+                    │
+                    └── NO  → Vertex AI Custom Training
+</code></pre>
+<h2 id="practice"><strong>7. Practice Questions</strong></h2>
+<p><strong>Q1:</strong> A data analytics team has petabytes of customer transaction data in BigQuery. They want to build a churn prediction model using their existing SQL skills without data exports. Which approach is BEST?</p>
+<ul>
+<li>A) Export to Cloud Storage, then use Vertex AI Custom Training</li>
+<li>B) Use Cloud Natural Language API</li>
+<li>C) Use BigQuery ML with CREATE MODEL LOGISTIC_REGRESSION ✓</li>
+<li>D) Use Vertex AI AutoML Tabular</li>
+</ul>
+<p><em>Explanation: BigQuery ML allows training classification models directly on BigQuery data using SQL, leveraging existing data infrastructure and skills without exporting data. This is the fastest path when data is already in BigQuery.</em></p>
+<p><strong>Q2:</strong> A small startup needs to add sentiment analysis to customer reviews. They have no ML team and no labeled sentiment data. Which solution requires the LEAST effort?</p>
+<ul>
+<li>A) Vertex AI AutoML Text Sentiment</li>
+<li>B) Train a custom BERT model on Vertex AI</li>
+<li>C) Cloud Natural Language API sentiment analysis ✓</li>
+<li>D) BigQuery ML DNN classifier</li>
+</ul>
+<p><em>Explanation: Cloud Natural Language API is a pre-trained, fully managed service that requires no training data, no ML expertise, and no infrastructure setup. Just call the API. AutoML requires labeled sentiment examples; custom BERT requires significantly more expertise.</em></p>
+<p><strong>Q3:</strong> Which Vertex AI component should a team use to ensure that feature values used during model training are identical to those served at prediction time?</p>
+<ul>
+<li>A) Vertex AI Experiments</li>
+<li>B) Vertex AI Feature Store ✓</li>
+<li>C) Vertex AI Model Registry</li>
+<li>D) Vertex AI Pipelines</li>
+</ul>
+<p><em>Explanation: Vertex AI Feature Store provides a centralized repository for storing, serving, and sharing ML features. It ensures training-serving consistency by using the same feature definitions and values for both training and online/batch prediction, preventing training-serving skew.</em></p>

package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/02-phan-2-data-engineering/lessons/03-bai-3-data-pipeline.md ADDED Viewed

@@ -0,0 +1,174 @@
+---
+id: 019c9619-lt03-l03
+title: 'Bài 3: Data Pipeline — Dataflow, Pub/Sub, Dataproc'
+slug: bai-3-data-pipeline
+description: >-
+  Apache Beam trên Dataflow cho batch/streaming ETL.
+  Pub/Sub cho event-driven pipelines. Dataproc cho Spark.
+  Cloud Composer (Airflow) cho orchestration.
+duration_minutes: 60
+is_free: true
+video_url: null
+sort_order: 3
+section_title: "Phần 2: Data Engineering & Feature Engineering"
+course:
+  id: 019c9619-lt03-7003-c003-lt0300000003
+  title: 'Luyện thi Google Cloud Professional Machine Learning Engineer'
+  slug: luyen-thi-gcp-ml-engineer
+---
+<div style="text-align: center; margin: 2rem 0;">
+<img src="/storage/uploads/2026/04/gcp-mle-bai3-data-pipeline.png" alt="GCP Data Pipeline Architecture" style="max-width: 800px; width: 100%; border-radius: 12px;" />
+<p><em>GCP Data Pipeline: Pub/Sub, Dataflow, Dataproc, Cloud Composer và luồng dữ liệu cho ML</em></p>
+</div>
+<h2 id="gcp-data-pipeline"><strong>1. GCP Data Pipeline Services</strong></h2>
+<table>
+<thead><tr><th>Service</th><th>Type</th><th>When to Use</th></tr></thead>
+<tbody>
+<tr><td><strong>Pub/Sub</strong></td><td>Managed message queue</td><td>Event streaming, decouple producers/consumers</td></tr>
+<tr><td><strong>Dataflow</strong></td><td>Managed Apache Beam runner</td><td>Unified batch + streaming ETL</td></tr>
+<tr><td><strong>Dataproc</strong></td><td>Managed Spark / Hadoop</td><td>Existing Spark/Hadoop workloads, ML at scale</td></tr>
+<tr><td><strong>Cloud Composer</strong></td><td>Managed Apache Airflow</td><td>Orchestrate multi-step ML workflows</td></tr>
+<tr><td><strong>Cloud Storage</strong></td><td>Object store</td><td>Raw data landing zone, model artifacts</td></tr>
+<tr><td><strong>BigQuery</strong></td><td>Data warehouse</td><td>Structured analysis, BigQuery ML</td></tr>
+</tbody>
+</table>
+<h2 id="pubsub"><strong>2. Pub/Sub — Event Streaming</strong></h2>
+<pre><code class="language-text">Pub/Sub Architecture:
+Data Source → Publisher → [Topic] → Subscription → Subscriber
+(IoT devices,                         (Pull or            (Dataflow,
+web clicks,                            Push)              Cloud Functions,
+logs)                                                     BigQuery)
+Key concepts:
+- Topic: named resource where messages are sent
+- Subscription: named resource attached to topic
+- Publisher: sends messages to topic
+- Subscriber: receives messages from subscription
+- At-least-once delivery (not exactly-once by default)
+</code></pre>
+<table>
+<thead><tr><th>Feature</th><th>Details</th></tr></thead>
+<tbody>
+<tr><td><strong>Message retention</strong></td><td>7 days default (configurable)</td></tr>
+<tr><td><strong>At-least-once delivery</strong></td><td>Idempotent subscribers needed</td></tr>
+<tr><td><strong>Exactly-once</strong></td><td>Available in Pub/Sub Lite (same region)</td></tr>
+<tr><td><strong>Ordering</strong></td><td>Enable message ordering with ordering key</td></tr>
+</tbody>
+</table>
+<blockquote>
+<p><strong>Exam tip:</strong> Pub/Sub → Dataflow → BigQuery là pipeline pattern cực phổ biến trong đề thi. Pub/Sub ingest, Dataflow transform, BigQuery store + analyze.</p>
+</blockquote>
+<h2 id="dataflow"><strong>3. Cloud Dataflow — Apache Beam</strong></h2>
+<p>Dataflow là managed runner cho <strong>Apache Beam</strong> — framework cho unified batch và streaming processing. Không cần quản lý servers.</p>
+<table>
+<thead><tr><th>Concept</th><th>Description</th></tr></thead>
+<tbody>
+<tr><td><strong>Pipeline</strong></td><td>Chuỗi transform operations</td></tr>
+<tr><td><strong>PCollection</strong></td><td>Distributed data collection (bounded or unbounded)</td></tr>
+<tr><td><strong>Transform</strong></td><td>ParDo, GroupByKey, Combine, Flatten, Partition</td></tr>
+<tr><td><strong>Windowing</strong></td><td>Fixed, Sliding, Session windows cho streaming</td></tr>
+<tr><td><strong>Watermarks</strong></td><td>Handle late-arriving data in streaming</td></tr>
+</tbody>
+</table>
+<pre><code class="language-text">Dataflow Windowing for Streaming ML:
+Event stream: ──●──●──●──────●──●──●──────●──●──
+Fixed Window (1 min):
+├─── [W1] ──┤├─── [W2] ──┤├─── [W3] ──┤
+Sliding Window (1 min, slide 30s):
+├── [W1] ────┤
+       ├── [W2] ────┤
+              ├── [W3] ────┤
+Session Window (2 min gap):
+├── [S1] ──────────┤          ├── [S2] ──┤
+     (user session)            (new session)
+</code></pre>
+<h2 id="dataproc"><strong>4. Cloud Dataproc — Managed Spark/Hadoop</strong></h2>
+<table>
+<thead><tr><th>Dataproc Feature</th><th>Details</th></tr></thead>
+<tbody>
+<tr><td><strong>Cluster lifecycle</strong></td><td>Create in 90 seconds, delete after job — cost efficient</td></tr>
+<tr><td><strong>Ephemeral clusters</strong></td><td>Spin up → run job → shut down (per-job pricing)</td></tr>
+<tr><td><strong>Preemptible VMs</strong></td><td>Use for worker nodes to reduce cost 60-80%</td></tr>
+<tr><td><strong>Component gateway</strong></td><td>Access Jupyter, Zeppelin, Spark UI via browser</td></tr>
+<tr><td><strong>ML libraries</strong></td><td>Spark MLlib, TensorFlow on Spark (TFoS)</td></tr>
+</tbody>
+</table>
+<h2 id="composer"><strong>5. Cloud Composer — Workflow Orchestration</strong></h2>
+<p>Cloud Composer là managed Apache Airflow. Dùng để orchestrate multi-step ML pipelines bao gồm data ingestion, preprocessing, training, và deployment.</p>
+<pre><code class="language-text">Cloud Composer ML Workflow:
+[DAG: daily_ml_pipeline]
+   Task 1: Extract data from BigQuery
+       ↓
+   Task 2: Run Dataflow preprocessing job
+       ↓
+   Task 3: Submit Vertex AI Training Job
+       ↓
+   Task 4: Evaluate model metrics
+       ↓ (if metrics pass threshold)
+   Task 5: Deploy to Vertex AI Endpoint
+</code></pre>
+<h2 id="decision-guide"><strong>6. Data Pipeline Service Selection</strong></h2>
+<table>
+<thead><tr><th>Scenario</th><th>Recommended Service</th></tr></thead>
+<tbody>
+<tr><td>Real-time event streaming ingestion</td><td>Pub/Sub</td></tr>
+<tr><td>Unified batch + streaming ETL (no infra mgmt)</td><td>Dataflow (Apache Beam)</td></tr>
+<tr><td>Migrate existing Spark jobs to GCP</td><td>Dataproc</td></tr>
+<tr><td>Complex ML DAG orchestration</td><td>Cloud Composer</td></tr>
+<tr><td>Stream data into BigQuery</td><td>Pub/Sub → Dataflow → BigQuery</td></tr>
+<tr><td>Serverless data processing (SQL)</td><td>BigQuery (ETL via SQL)</td></tr>
+</tbody>
+</table>
+<h2 id="practice"><strong>7. Practice Questions</strong></h2>
+<p><strong>Q1:</strong> A company receives millions of IoT sensor events per second from factory equipment. They need to process these events in real time, detect anomalies, and store results in BigQuery. Which pipeline architecture is MOST appropriate?</p>
+<ul>
+<li>A) Dataproc → Spark Streaming → BigQuery</li>
+<li>B) Pub/Sub → Dataflow → BigQuery ✓</li>
+<li>C) Cloud Functions → Cloud SQL</li>
+<li>D) Batch upload to Cloud Storage → BigQuery import</li>
+</ul>
+<p><em>Explanation: Pub/Sub ingests high-volume streaming events reliably. Dataflow processes the stream in real time using Apache Beam (windowing, transformations, anomaly detection). BigQuery stores the results for analysis. This is the canonical GCP streaming analytics pattern.</em></p>
+<p><strong>Q2:</strong> A data engineering team has an existing Apache Spark job that processes training data for ML models. They want to migrate it to GCP with minimal code changes. Which service should they use?</p>
+<ul>
+<li>A) Cloud Dataflow</li>
+<li>B) Cloud Dataproc ✓</li>
+<li>C) BigQuery ETL</li>
+<li>D) Cloud Composer</li>
+</ul>
+<p><em>Explanation: Cloud Dataproc supports Apache Spark natively, allowing teams to run existing Spark jobs on GCP with minimal changes. Dataflow uses Apache Beam (different programming model). Dataproc is the lift-and-shift option for Spark workloads.</em></p>
+<p><strong>Q3:</strong> A team needs to orchestrate a daily ML pipeline that includes data extraction from BigQuery, preprocessing, Vertex AI training, and deployment if accuracy exceeds 90%. Which service handles this workflow orchestration?</p>
+<ul>
+<li>A) Vertex AI Pipelines</li>
+<li>B) Cloud Dataflow</li>
+<li>C) Cloud Composer ✓</li>
+<li>D) Pub/Sub triggers</li>
+</ul>
+<p><em>Explanation: Cloud Composer (managed Apache Airflow) is designed for complex DAG orchestration across multiple services. It handles scheduling, conditional branching (deploy only if accuracy > 90%), retry logic, and monitoring across heterogeneous services like BigQuery, Dataflow, and Vertex AI.</em></p>

package/content/series/luyen-thi/luyen-thi-gcp-ml-engineer/chapters/02-phan-2-data-engineering/lessons/04-bai-4-feature-engineering.md ADDED Viewed

@@ -0,0 +1,156 @@
+---
+id: 019c9619-lt03-l04
+title: 'Bài 4: Feature Engineering & Vertex AI Feature Store'
+slug: bai-4-feature-engineering
+description: >-
+  Feature engineering techniques. BigQuery cho feature computation.
+  Vertex AI Feature Store: online/offline serving.
+  Feature monitoring, consistency giữa training/serving.
+duration_minutes: 60
+is_free: true
+video_url: null
+sort_order: 4
+section_title: "Phần 2: Data Engineering & Feature Engineering"
+course:
+  id: 019c9619-lt03-7003-c003-lt0300000003
+  title: 'Luyện thi Google Cloud Professional Machine Learning Engineer'
+  slug: luyen-thi-gcp-ml-engineer
+---
+<div style="text-align: center; margin: 2rem 0;">
+<img src="/storage/uploads/2026/04/gcp-mle-bai4-feature-store.png" alt="Vertex AI Feature Store" style="max-width: 800px; width: 100%; border-radius: 12px;" />
+<p><em>Feature Engineering & Vertex AI Feature Store: tạo, lưu trữ, và tái sử dụng features cho ML</em></p>
+</div>
+<h2 id="feature-engineering"><strong>1. Feature Engineering Techniques</strong></h2>
+<table>
+<thead><tr><th>Technique</th><th>When to Use</th><th>Example</th></tr></thead>
+<tbody>
+<tr><td><strong>Normalization (Min-Max)</strong></td><td>Bounded range required (0-1)</td><td>Image pixels, probabilities</td></tr>
+<tr><td><strong>Standardization (Z-score)</strong></td><td>Normal-ish distribution, no bounds</td><td>Customer age, transaction amount</td></tr>
+<tr><td><strong>Log Transform</strong></td><td>Skewed distributions (price, salary)</td><td>Log(price) for housing</td></tr>
+<tr><td><strong>One-Hot Encoding</strong></td><td>Nominal categorical (no order)</td><td>Country, brand, color</td></tr>
+<tr><td><strong>Label Encoding</strong></td><td>Ordinal categorical (has order)</td><td>Low/Medium/High → 0/1/2</td></tr>
+<tr><td><strong>Feature Crossing</strong></td><td>Capture interaction between features</td><td>city × day_of_week</td></tr>
+<tr><td><strong>Bucketizing</strong></td><td>Convert continuous to categorical</td><td>Age → age_group</td></tr>
+<tr><td><strong>Embeddings</strong></td><td>High-cardinality categorical</td><td>UserID, ProductID</td></tr>
+</tbody>
+</table>
+<h2 id="missing-values"><strong>2. Handling Missing Values</strong></h2>
+<table>
+<thead><tr><th>Strategy</th><th>When</th></tr></thead>
+<tbody>
+<tr><td><strong>Mean/Median imputation</strong></td><td>Numerical, low missingness rate</td></tr>
+<tr><td><strong>Mode imputation</strong></td><td>Categorical features</td></tr>
+<tr><td><strong>Model-based imputation</strong></td><td>High missingness, complex patterns</td></tr>
+<tr><td><strong>Indicator variable</strong></td><td>Missingness itself is informative (add is_missing flag)</td></tr>
+<tr><td><strong>Drop rows</strong></td><td>Missing target / very few rows affected</td></tr>
+<tr><td><strong>Drop column</strong></td><td>&gt;80% missing</td></tr>
+</tbody>
+</table>
+<h2 id="training-serving-skew"><strong>3. Training-Serving Skew</strong></h2>
+<p><strong>Training-serving skew</strong> là vấn đề nghiêm trọng: features được compute khác nhau giữa training và serving, khiến model hoạt động kém trong production dù test metrics tốt.</p>
+<pre><code class="language-text">Training-Serving Skew Example:
+TRAINING TIME:
+  avg_purchase_last_30d = mean(all purchases in batch)  ← computed over full period
+SERVING TIME:
+  avg_purchase_last_30d = mean(last 5 purchases)        ← computed differently!
+Result: Feature distribution mismatch → poor predictions
+SOLUTION: Vertex AI Feature Store
+  Same feature serve logic used at training AND serving time
+</code></pre>
+<h2 id="feature-store"><strong>4. Vertex AI Feature Store</strong></h2>
+<table>
+<thead><tr><th>Component</th><th>Description</th></tr></thead>
+<tbody>
+<tr><td><strong>Feature Store</strong></td><td>Centralized repository for ML features</td></tr>
+<tr><td><strong>Entity Type</strong></td><td>Category of things you track (User, Product)</td></tr>
+<tr><td><strong>Feature</strong></td><td>Named attribute of an entity (user.avg_spend)</td></tr>
+<tr><td><strong>Online Store</strong></td><td>Low-latency serving (ms) for real-time predictions</td></tr>
+<tr><td><strong>Offline Store</strong></td><td>BigQuery-backed, for batch training data retrieval</td></tr>
+</tbody>
+</table>
+<pre><code class="language-text">Vertex AI Feature Store Architecture:
+Feature Ingestion (Batch or Streaming)
+        ↓
+┌──── Feature Store ────────────────┐
+│  Offline Store (BigQuery)          │  ← Training data export
+│  Online Store (Bigtable-backed)    │  ← Serving (ms latency)
+└───────────────────────────────────┘
+        ↑ Same features ↑
+  Training      Inference
+  Pipeline      Endpoint
+</code></pre>
+<h2 id="bigquery-features"><strong>5. BigQuery for Feature Engineering</strong></h2>
+<p>BigQuery là công cụ tốt nhất trên GCP để compute aggregate features từ large datasets.</p>
+<table>
+<thead><tr><th>Feature Pattern</th><th>BigQuery Approach</th></tr></thead>
+<tbody>
+<tr><td>Rolling window aggregates</td><td>Window functions: AVG() OVER (PARTITION BY ... ORDER BY ... ROWS BETWEEN ...)</td></tr>
+<tr><td>User activity counts</td><td>COUNT() GROUP BY user_id</td></tr>
+<tr><td>Categorical encoding</td><td>CASE WHEN ... or ML.ONE_HOT_ENCODE()</td></tr>
+<tr><td>Hash embedding (high cardinality)</td><td>FARM_FINGERPRINT() mod N</td></tr>
+<tr><td>Feature normalization</td><td>ML.STANDARD_SCALER() in BigQuery ML</td></tr>
+</tbody>
+</table>
+<blockquote>
+<p><strong>Exam tip:</strong> Khi câu hỏi nhắc đến "training-serving consistency" hoặc "feature reuse across multiple models" → <strong>Vertex AI Feature Store</strong>. Khi nhắc đến "compute features from BigQuery data at scale" → BigQuery window functions + scheduled queries.</p>
+</blockquote>
+<h2 id="feature-monitoring"><strong>6. Feature Drift Monitoring</strong></h2>
+<table>
+<thead><tr><th>Type</th><th>What Changes</th><th>Detection Method</th></tr></thead>
+<tbody>
+<tr><td><strong>Feature Skew</strong></td><td>Training vs serving feature distribution differs</td><td>Compare training baseline vs serving stats</td></tr>
+<tr><td><strong>Feature Drift</strong></td><td>Serving features change over time</td><td>Monitor serving feature distributions daily</td></tr>
+<tr><td><strong>Label Drift</strong></td><td>Target variable distribution changes</td><td>Track prediction distribution shifts</td></tr>
+</tbody>
+</table>
+<h2 id="practice"><strong>7. Practice Questions</strong></h2>
+<p><strong>Q1:</strong> A team's ML model has excellent accuracy during testing but performs poorly in production. Investigations reveal that the average purchase feature is calculated differently in training (using historical batch data) vs. serving (using real-time lookups). What is this problem called and how should it be solved?</p>
+<ul>
+<li>A) Model drift — retrain the model more frequently</li>
+<li>B) Training-serving skew — use Vertex AI Feature Store ✓</li>
+<li>C) Data leakage — remove the purchase feature</li>
+<li>D) Overfitting — add dropout layers</li>
+</ul>
+<p><em>Explanation: Training-serving skew occurs when features are computed differently at training and serving time. Vertex AI Feature Store solves this by providing a single source of truth for feature computation, ensuring the same logic is used for both training data export and online serving.</em></p>
+<p><strong>Q2:</strong> A feature has values ranging from $10 to $10,000,000 with a heavily right-skewed distribution. Which transformation is MOST appropriate before using this feature in a linear model?</p>
+<ul>
+<li>A) One-Hot Encoding</li>
+<li>B) Min-Max Normalization</li>
+<li>C) Log transformation ✓</li>
+<li>D) Label Encoding</li>
+</ul>
+<p><em>Explanation: Log transformation compresses the scale of highly skewed distributions, making them more normal-like and suitable for linear models. Min-Max normalization would still preserve the skew. One-hot encoding is for categorical data.</em></p>
+<p><strong>Q3:</strong> Which Vertex AI Feature Store store type is optimized for serving features to real-time prediction endpoints with millisecond latency?</p>
+<ul>
+<li>A) Offline Store (BigQuery)</li>
+<li>B) Online Store (Bigtable-backed) ✓</li>
+<li>C) Feature Catalog</li>
+<li>D) Cloud Memorystore</li>
+</ul>
+<p><em>Explanation: The Online Store in Vertex AI Feature Store is backed by Bigtable and designed for sub-100ms latency lookups, serving fresh feature values to real-time prediction endpoints. The Offline Store uses BigQuery and is for batch training data retrieval.</em></p>