@xdev-asia/xdev-knowledge-mcp 1.0.57 → 1.0.59
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/content/blog/ai/minimax-danh-gia-chi-tiet-nen-tang-ai-full-stack-trung-quoc.md +450 -0
- package/content/blog/ai/nvidia-dli-generative-ai-chung-chi-va-lo-trinh-hoc.md +894 -0
- package/content/metadata/authors/duy-tran.md +2 -0
- package/content/series/luyen-thi/luyen-thi-nvidia-dli-generative-ai/chapters/01-deep-learning-foundations/lessons/01-bai-1-pytorch-neural-network-fundamentals.md +790 -0
- package/content/series/luyen-thi/luyen-thi-nvidia-dli-generative-ai/chapters/01-deep-learning-foundations/lessons/02-bai-2-transformer-architecture-attention.md +984 -0
- package/content/series/luyen-thi/luyen-thi-nvidia-dli-generative-ai/chapters/02-diffusion-models/lessons/01-bai-3-unet-architecture-denoising.md +1111 -0
- package/content/series/luyen-thi/luyen-thi-nvidia-dli-generative-ai/chapters/02-diffusion-models/lessons/02-bai-4-ddpm-forward-reverse-diffusion.md +1007 -0
- package/content/series/luyen-thi/luyen-thi-nvidia-dli-generative-ai/chapters/02-diffusion-models/lessons/03-bai-5-clip-text-to-image-pipeline.md +1037 -0
- package/content/series/luyen-thi/luyen-thi-nvidia-dli-generative-ai/chapters/03-llm-applications-rag/lessons/01-bai-6-llm-inference-pipeline-design.md +929 -0
- package/content/series/luyen-thi/luyen-thi-nvidia-dli-generative-ai/chapters/03-llm-applications-rag/lessons/02-bai-7-rag-retrieval-augmented-generation.md +1099 -0
- package/content/series/luyen-thi/luyen-thi-nvidia-dli-generative-ai/chapters/03-llm-applications-rag/lessons/03-bai-8-rag-agent-build-evaluate.md +1249 -0
- package/content/series/luyen-thi/luyen-thi-nvidia-dli-generative-ai/chapters/04-agentic-ai-customization/lessons/01-bai-9-agentic-ai-multi-agent-systems.md +1357 -0
- package/content/series/luyen-thi/luyen-thi-nvidia-dli-generative-ai/chapters/04-agentic-ai-customization/lessons/02-bai-10-llm-evaluation-lora-fine-tuning.md +1867 -0
- package/content/series/luyen-thi/luyen-thi-nvidia-dli-generative-ai/index.md +237 -0
- package/data/quizzes/nvidia-dli-generative-ai.json +350 -0
- package/data/quizzes.json +14 -0
- package/package.json +1 -1
|
@@ -0,0 +1,1037 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: 019c9619-nv01-p2-l05
|
|
3
|
+
title: 'Bài 5: CLIP & Text-to-Image Pipeline'
|
|
4
|
+
slug: bai-5-clip-text-to-image-pipeline
|
|
5
|
+
description: >-
|
|
6
|
+
CLIP: Contrastive Language-Image Pretraining.
|
|
7
|
+
Text encoding, image encoding, contrastive loss.
|
|
8
|
+
Cross-attention: inject text embeddings into U-Net.
|
|
9
|
+
Full text-to-image pipeline. Latent Diffusion overview.
|
|
10
|
+
Assessment prep: coding exercises & debug challenges.
|
|
11
|
+
duration_minutes: 90
|
|
12
|
+
is_free: true
|
|
13
|
+
video_url: null
|
|
14
|
+
sort_order: 5
|
|
15
|
+
section_title: "Part 2: Generative AI with Diffusion Models"
|
|
16
|
+
course:
|
|
17
|
+
id: 019c9619-nv01-7001-c001-nv0100000001
|
|
18
|
+
title: 'Luyện thi NVIDIA DLI — Generative AI with Diffusion Models & LLMs'
|
|
19
|
+
slug: luyen-thi-nvidia-dli-generative-ai
|
|
20
|
+
---
|
|
21
|
+
|
|
22
|
+
<h2 id="gioi-thieu">1. Giới thiệu: Từ Class Labels đến Text Prompts</h2>
|
|
23
|
+
|
|
24
|
+
<p>Bài trước bạn đã implement <strong>Classifier-Free Guidance (CFG)</strong> với class labels (số 0–9). Nhưng Stable Diffusion không dùng class labels — nó dùng <strong>text prompts</strong> tự do. Vậy làm cách nào để chuyển từ "a photo of a cat" thành tensor mà U-Net hiểu được?</p>
|
|
25
|
+
|
|
26
|
+
<p>Câu trả lời nằm ở <strong>CLIP (Contrastive Language-Image Pretraining)</strong> — model cầu nối giữa ngôn ngữ và hình ảnh, được OpenAI giới thiệu năm 2021. Đây là bài cuối cùng trong phần Diffusion Models, kết hợp tất cả kiến thức bạn đã học để xây dựng <strong>full text-to-image pipeline</strong>.</p>
|
|
27
|
+
|
|
28
|
+
<blockquote><p><strong>Exam tip:</strong> DLI assessment <strong>S-FX-14</strong> yêu cầu bạn kết hợp U-Net, DDPM, CFG, và text conditioning thành một pipeline hoàn chỉnh. Bài này là "tổng lực" — nếu bạn hiểu rõ từng thành phần ở Bài 3–4 và kết nối chúng ở Bài 5, bạn sẽ hoàn thành assessment nhanh hơn.</p></blockquote>
|
|
29
|
+
|
|
30
|
+
<pre><code class="language-text">
|
|
31
|
+
Roadmap: Class Label → Text Prompt Conditioning
|
|
32
|
+
════════════════════════════════════════════════
|
|
33
|
+
|
|
34
|
+
Bài 3: U-Net backbone → Kiến trúc denoiser
|
|
35
|
+
Bài 4: DDPM + CFG → Training & sampling với class labels
|
|
36
|
+
Bài 5: CLIP + Cross-Attention → Text-to-image pipeline ← BẠN ĐANG Ở ĐÂY
|
|
37
|
+
│
|
|
38
|
+
▼
|
|
39
|
+
┌──────────────────────────────────────────────────────────┐
|
|
40
|
+
│ "a sunset over mountains" │
|
|
41
|
+
│ │ │
|
|
42
|
+
│ ▼ │
|
|
43
|
+
│ ┌──────────┐ ┌──────────────────┐ ┌──────────┐ │
|
|
44
|
+
│ │ CLIP │──►│ Cross-Attention │──►│ U-Net │ │
|
|
45
|
+
│ │ Encoder │ │ (K, V from text) │ │ Denoise │ │
|
|
46
|
+
│ └──────────┘ └──────────────────┘ └──────────┘ │
|
|
47
|
+
│ │ │
|
|
48
|
+
│ ▼ │
|
|
49
|
+
│ [ 🖼️ Image ] │
|
|
50
|
+
└──────────────────────────────────────────────────────────┘
|
|
51
|
+
</code></pre>
|
|
52
|
+
|
|
53
|
+
<figure><img src="/storage/uploads/2026/04/nvidia-dli-bai5-clip-text-to-image.png" alt="CLIP và Text-to-Image Pipeline — Text Encoder, Cross-Attention, U-Net Denoiser" loading="lazy" /><figcaption>CLIP và Text-to-Image Pipeline — Text Encoder, Cross-Attention, U-Net Denoiser</figcaption></figure>
|
|
54
|
+
|
|
55
|
+
<h2 id="clip-architecture">2. CLIP — Contrastive Language-Image Pretraining</h2>
|
|
56
|
+
|
|
57
|
+
<h3 id="clip-dual-encoder">2.1 Dual-Encoder Architecture</h3>
|
|
58
|
+
|
|
59
|
+
<p><strong>CLIP</strong> gồm hai encoder được train cùng lúc trên 400 triệu cặp (text, image) từ internet:</p>
|
|
60
|
+
|
|
61
|
+
<ul>
|
|
62
|
+
<li><strong>Text Encoder</strong>: Transformer (giống GPT) — nhận text → output embedding vector (512-d hoặc 768-d)</li>
|
|
63
|
+
<li><strong>Image Encoder</strong>: ViT (Vision Transformer) hoặc ResNet — nhận image → output embedding cùng dimension</li>
|
|
64
|
+
</ul>
|
|
65
|
+
|
|
66
|
+
<p>Điểm mấu chốt: cả hai encoder đều output embedding <strong>cùng không gian vector</strong>. Điều này cho phép so sánh trực tiếp text và image bằng <strong>cosine similarity</strong>.</p>
|
|
67
|
+
|
|
68
|
+
<pre><code class="language-text">
|
|
69
|
+
CLIP Architecture — Dual Encoder
|
|
70
|
+
═════════════════════════════════
|
|
71
|
+
|
|
72
|
+
TEXT BRANCH IMAGE BRANCH
|
|
73
|
+
─────────── ────────────
|
|
74
|
+
|
|
75
|
+
"a photo of ┌───────────────┐ ┌─────┐ ┌───────────────┐
|
|
76
|
+
a cat" ──► │ Text Encoder │ │ 🖼️ │──►│ Image Encoder │
|
|
77
|
+
│ (Transformer) │ │ │ │ (ViT / ResNet)│
|
|
78
|
+
└───────┬───────┘ └─────┘ └───────┬───────┘
|
|
79
|
+
│ │
|
|
80
|
+
▼ ▼
|
|
81
|
+
┌──────────────┐ ┌──────────────┐
|
|
82
|
+
│ Text Embed. │ │ Image Embed. │
|
|
83
|
+
│ (768-dim) │ │ (768-dim) │
|
|
84
|
+
└──────┬───────┘ └──────┬───────┘
|
|
85
|
+
│ │
|
|
86
|
+
└────────────┬───────────────────┘
|
|
87
|
+
│
|
|
88
|
+
▼
|
|
89
|
+
┌───────────────┐
|
|
90
|
+
│ Cosine │
|
|
91
|
+
│ Similarity │
|
|
92
|
+
│ sim(t, i) │
|
|
93
|
+
└───────────────┘
|
|
94
|
+
|
|
95
|
+
Training (400M image-text pairs):
|
|
96
|
+
┌──────────────────────────────────────────────────────┐
|
|
97
|
+
│ Maximize sim(text_i, image_i) ← matched pairs │
|
|
98
|
+
│ Minimize sim(text_i, image_j) ← non-matched │
|
|
99
|
+
└──────────────────────────────────────────────────────┘
|
|
100
|
+
</code></pre>
|
|
101
|
+
|
|
102
|
+
<h3 id="contrastive-loss">2.2 Contrastive Loss</h3>
|
|
103
|
+
|
|
104
|
+
<p>CLIP sử dụng <strong>symmetric cross-entropy loss</strong> trên ma trận similarity NxN. Với batch N cặp (text, image):</p>
|
|
105
|
+
|
|
106
|
+
<pre><code class="language-text">
|
|
107
|
+
Contrastive Loss — Similarity Matrix
|
|
108
|
+
═════════════════════════════════════
|
|
109
|
+
|
|
110
|
+
Batch N = 4 cặp (text, image):
|
|
111
|
+
|
|
112
|
+
image_0 image_1 image_2 image_3
|
|
113
|
+
┌─────────┬─────────┬─────────┬─────────┐
|
|
114
|
+
text_0 │ 0.95 ✓ │ 0.12 │ 0.08 │ 0.03 │
|
|
115
|
+
├─────────┼─────────┼─────────┼─────────┤
|
|
116
|
+
text_1 │ 0.10 │ 0.91 ✓ │ 0.15 │ 0.07 │
|
|
117
|
+
├─────────┼─────────┼─────────┼─────────┤
|
|
118
|
+
text_2 │ 0.05 │ 0.11 │ 0.93 ✓ │ 0.09 │
|
|
119
|
+
├─────────┼─────────┼─────────┼─────────┤
|
|
120
|
+
text_3 │ 0.08 │ 0.06 │ 0.12 │ 0.89 ✓ │
|
|
121
|
+
└─────────┴─────────┴─────────┴─────────┘
|
|
122
|
+
|
|
123
|
+
Goal: đường chéo (✓) → cao, phần còn lại → thấp
|
|
124
|
+
|
|
125
|
+
Loss = (CE_rows + CE_cols) / 2
|
|
126
|
+
= cross_entropy(logits, labels) cho cả 2 chiều
|
|
127
|
+
|
|
128
|
+
logits = temperature * text_embeds @ image_embeds.T
|
|
129
|
+
labels = [0, 1, 2, ..., N-1] ← identity matching
|
|
130
|
+
</code></pre>
|
|
131
|
+
|
|
132
|
+
<p><strong>Temperature parameter</strong> (learnable, khởi tạo ~0.07) kiểm soát sharpness của distribution. Temperature thấp → phân biệt rõ hơn giữa positive và negative pairs.</p>
|
|
133
|
+
|
|
134
|
+
<table>
|
|
135
|
+
<thead>
|
|
136
|
+
<tr><th>Component</th><th>Chi tiết</th><th>Trong CLIP</th></tr>
|
|
137
|
+
</thead>
|
|
138
|
+
<tbody>
|
|
139
|
+
<tr><td>Text Encoder</td><td>12-layer Transformer, BPE tokenizer</td><td>Max 77 tokens, output CLS embedding</td></tr>
|
|
140
|
+
<tr><td>Image Encoder</td><td>ViT-B/32 hoặc ViT-L/14</td><td>Chia image thành patches, output CLS</td></tr>
|
|
141
|
+
<tr><td>Embedding dim</td><td>512 (ViT-B/32) hoặc 768 (ViT-L/14)</td><td>Shared space giữa text & image</td></tr>
|
|
142
|
+
<tr><td>Training data</td><td>400M image-text pairs (WIT dataset)</td><td>Crawled từ internet</td></tr>
|
|
143
|
+
<tr><td>Loss function</td><td>Symmetric cross-entropy</td><td>InfoNCE / NT-Xent variant</td></tr>
|
|
144
|
+
<tr><td>Temperature</td><td>Learnable scalar τ</td><td>Init ≈ 0.07, learned during training</td></tr>
|
|
145
|
+
</tbody>
|
|
146
|
+
</table>
|
|
147
|
+
|
|
148
|
+
<blockquote><p><strong>Exam tip:</strong> CLIP <strong>không generate</strong> images — nó chỉ encode text và image vào shared space. Trong text-to-image pipeline, ta chỉ dùng <strong>Text Encoder</strong> của CLIP để tạo conditioning signal cho U-Net. Image Encoder không được sử dụng trong quá trình generation.</p></blockquote>
|
|
149
|
+
|
|
150
|
+
<h2 id="using-clip">3. Sử dụng CLIP Encodings trong Code</h2>
|
|
151
|
+
|
|
152
|
+
<h3 id="clip-load">3.1 Load CLIP và Encode Text</h3>
|
|
153
|
+
|
|
154
|
+
<pre><code class="language-python">
|
|
155
|
+
import torch
|
|
156
|
+
import clip
|
|
157
|
+
from PIL import Image
|
|
158
|
+
|
|
159
|
+
# Load pretrained CLIP model
|
|
160
|
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
|
161
|
+
model, preprocess = clip.load("ViT-B/32", device=device)
|
|
162
|
+
|
|
163
|
+
# ── Encode text ──
|
|
164
|
+
text_prompts = ["a photo of a cat", "a sunset over mountains", "a red car"]
|
|
165
|
+
text_tokens = clip.tokenize(text_prompts).to(device) # (3, 77) — padded to 77 tokens
|
|
166
|
+
|
|
167
|
+
with torch.no_grad():
|
|
168
|
+
text_embeddings = model.encode_text(text_tokens) # (3, 512)
|
|
169
|
+
text_embeddings = text_embeddings / text_embeddings.norm(dim=-1, keepdim=True) # L2 normalize
|
|
170
|
+
|
|
171
|
+
print(f"Text embeddings shape: {text_embeddings.shape}") # (3, 512)
|
|
172
|
+
</code></pre>
|
|
173
|
+
|
|
174
|
+
<h3 id="clip-image">3.2 Encode Image và Compute Similarity</h3>
|
|
175
|
+
|
|
176
|
+
<pre><code class="language-python">
|
|
177
|
+
# ── Encode images ──
|
|
178
|
+
images = [preprocess(Image.open(f"img_{i}.jpg")).unsqueeze(0) for i in range(3)]
|
|
179
|
+
image_batch = torch.cat(images).to(device) # (3, 3, 224, 224)
|
|
180
|
+
|
|
181
|
+
with torch.no_grad():
|
|
182
|
+
image_embeddings = model.encode_image(image_batch) # (3, 512)
|
|
183
|
+
image_embeddings = image_embeddings / image_embeddings.norm(dim=-1, keepdim=True)
|
|
184
|
+
|
|
185
|
+
# ── Cosine similarity ──
|
|
186
|
+
similarity = text_embeddings @ image_embeddings.T # (3, 3)
|
|
187
|
+
print(similarity)
|
|
188
|
+
# tensor([[ 0.31, 0.05, 0.02], ← "cat" matches image_0
|
|
189
|
+
# [ 0.04, 0.28, 0.06], ← "sunset" matches image_1
|
|
190
|
+
# [ 0.03, 0.07, 0.26]]) ← "red car" matches image_2
|
|
191
|
+
</code></pre>
|
|
192
|
+
|
|
193
|
+
<p>Kết quả: text và image cùng nội dung có similarity cao nhất. Đây chính là sức mạnh của <strong>shared embedding space</strong> — bạn có thể search image bằng text hoặc ngược lại.</p>
|
|
194
|
+
|
|
195
|
+
<h3 id="clip-for-diffusion">3.3 CLIP cho Diffusion Models: Sequence Embeddings</h3>
|
|
196
|
+
|
|
197
|
+
<p>Quan trọng: Stable Diffusion <strong>không dùng CLS embedding</strong> (1 vector duy nhất). Thay vào đó nó dùng <strong>sequence of token embeddings</strong> từ CLIP Text Encoder — output trước projection layer:</p>
|
|
198
|
+
|
|
199
|
+
<pre><code class="language-text">
|
|
200
|
+
CLS Embedding vs Sequence Embeddings
|
|
201
|
+
═════════════════════════════════════
|
|
202
|
+
|
|
203
|
+
Text: "a photo of a cat"
|
|
204
|
+
Tokenized: [SOS, "a", "photo", "of", "a", "cat", EOS, PAD, PAD, ...]
|
|
205
|
+
|
|
206
|
+
CLIP Text Encoder output:
|
|
207
|
+
┌──────────────────────────────────────────────┐
|
|
208
|
+
│ token_0 (SOS) → [0.12, -0.34, 0.56, ...] │
|
|
209
|
+
│ token_1 ("a") → [0.08, -0.21, 0.43, ...] │
|
|
210
|
+
│ token_2 ("photo") → [...] │
|
|
211
|
+
│ token_3 ("of") → [...] │
|
|
212
|
+
│ token_4 ("a") → [...] │
|
|
213
|
+
│ token_5 ("cat")→ [0.91, 0.15, -0.33, ...] │ ← semantic info
|
|
214
|
+
│ token_6 (EOS) → [0.67, 0.42, -0.18, ...] │ ← CLS (used by CLIP)
|
|
215
|
+
│ ... │
|
|
216
|
+
│ token_76 (PAD) → [0.00, 0.00, 0.00, ...] │
|
|
217
|
+
└──────────────────────────────────────────────┘
|
|
218
|
+
|
|
219
|
+
Stable Diffusion dùng: ALL 77 token embeddings → (1, 77, 768)
|
|
220
|
+
CLIP zero-shot dùng: CHỈ EOS token embedding → (1, 768)
|
|
221
|
+
</code></pre>
|
|
222
|
+
|
|
223
|
+
<table>
|
|
224
|
+
<thead>
|
|
225
|
+
<tr><th>Use case</th><th>Output</th><th>Shape</th><th>Lý do</th></tr>
|
|
226
|
+
</thead>
|
|
227
|
+
<tbody>
|
|
228
|
+
<tr><td>CLIP classification</td><td>CLS / EOS token</td><td>(B, 768)</td><td>So sánh similarity toàn cục</td></tr>
|
|
229
|
+
<tr><td>Stable Diffusion</td><td>Full token sequence</td><td>(B, 77, 768)</td><td>Cross-attention cần per-token info</td></tr>
|
|
230
|
+
</tbody>
|
|
231
|
+
</table>
|
|
232
|
+
|
|
233
|
+
<blockquote><p><strong>Exam tip:</strong> Nếu đề hỏi "What is the shape of the text conditioning input to the U-Net?", đáp án là <strong>(batch, 77, 768)</strong> — KHÔNG phải (batch, 768). Cross-attention cần sequence, không phải single vector.</p></blockquote>
|
|
234
|
+
|
|
235
|
+
<h2 id="cross-attention">4. Cross-Attention: Inject Text Embeddings vào U-Net</h2>
|
|
236
|
+
|
|
237
|
+
<h3 id="cross-attn-mechanism">4.1 Cơ chế Cross-Attention</h3>
|
|
238
|
+
|
|
239
|
+
<p>Ở Bài 3, U-Net đã dùng <strong>self-attention</strong> — Q, K, V đều từ image features. <strong>Cross-attention</strong> thay đổi nguồn K và V:</p>
|
|
240
|
+
|
|
241
|
+
<pre><code class="language-text">
|
|
242
|
+
Self-Attention vs Cross-Attention
|
|
243
|
+
═════════════════════════════════
|
|
244
|
+
|
|
245
|
+
SELF-ATTENTION (trong U-Net):
|
|
246
|
+
───────────────────────────────
|
|
247
|
+
Q = W_q · image_features ← từ image
|
|
248
|
+
K = W_k · image_features ← từ image
|
|
249
|
+
V = W_v · image_features ← từ image
|
|
250
|
+
|
|
251
|
+
Attention = softmax(Q · K^T / √d) · V
|
|
252
|
+
|
|
253
|
+
CROSS-ATTENTION (text → image):
|
|
254
|
+
────────────────────────────────
|
|
255
|
+
Q = W_q · image_features ← từ image (queries)
|
|
256
|
+
K = W_k · text_embeddings ← từ CLIP text (keys)
|
|
257
|
+
V = W_v · text_embeddings ← từ CLIP text (values)
|
|
258
|
+
|
|
259
|
+
Attention = softmax(Q_image · K_text^T / √d) · V_text
|
|
260
|
+
|
|
261
|
+
┌──────────────────────────────────────────────────┐
|
|
262
|
+
│ Q shape: (B, H*W, d_model) ← spatial pixels │
|
|
263
|
+
│ K shape: (B, 77, d_model) ← text tokens │
|
|
264
|
+
│ V shape: (B, 77, d_model) ← text tokens │
|
|
265
|
+
│ Score: (B, H*W, 77) ← pixel-to-token │
|
|
266
|
+
│ Output: (B, H*W, d_model) ← text-aware image │
|
|
267
|
+
└──────────────────────────────────────────────────┘
|
|
268
|
+
</code></pre>
|
|
269
|
+
|
|
270
|
+
<p>Mỗi pixel "nhìn vào" tất cả 77 text tokens và quyết định nên attend token nào. Pixel ở vùng mèo sẽ attend mạnh vào token "cat", pixel ở vùng bầu trời attend vào "sky".</p>
|
|
271
|
+
|
|
272
|
+
<h3 id="cross-attn-unet-block">4.2 Cross-Attention trong U-Net Block</h3>
|
|
273
|
+
|
|
274
|
+
<pre><code class="language-text">
|
|
275
|
+
U-Net Block với Cross-Attention
|
|
276
|
+
════════════════════════════════
|
|
277
|
+
|
|
278
|
+
Input: x (image features, shape: B×C×H×W)
|
|
279
|
+
Condition: text_emb (CLIP output, shape: B×77×768)
|
|
280
|
+
Timestep: t_emb (timestep embedding, shape: B×d)
|
|
281
|
+
|
|
282
|
+
┌─────────────────────────────────────────────┐
|
|
283
|
+
│ U-Net Block │
|
|
284
|
+
│ │
|
|
285
|
+
│ x ──► [ResBlock + t_emb] ──► x' │
|
|
286
|
+
│ │ │
|
|
287
|
+
│ ▼ │
|
|
288
|
+
│ [Self-Attention] │
|
|
289
|
+
│ Q,K,V ← x' │
|
|
290
|
+
│ │ │
|
|
291
|
+
│ ▼ │
|
|
292
|
+
│ [Cross-Attention] ◄── text_emb │
|
|
293
|
+
│ Q ← x' │
|
|
294
|
+
│ K,V ← text_emb │
|
|
295
|
+
│ │ │
|
|
296
|
+
│ ▼ │
|
|
297
|
+
│ [FFN / MLP] │
|
|
298
|
+
│ │ │
|
|
299
|
+
│ ▼ │
|
|
300
|
+
│ output │
|
|
301
|
+
└─────────────────────────────────────────────┘
|
|
302
|
+
|
|
303
|
+
Thứ tự trong mỗi block: ResBlock → Self-Attn → Cross-Attn → FFN
|
|
304
|
+
</code></pre>
|
|
305
|
+
|
|
306
|
+
<h3 id="cross-attn-code">4.3 Implementation: CrossAttention Module</h3>
|
|
307
|
+
|
|
308
|
+
<pre><code class="language-python">
|
|
309
|
+
import torch
|
|
310
|
+
import torch.nn as nn
|
|
311
|
+
import torch.nn.functional as F
|
|
312
|
+
|
|
313
|
+
class CrossAttention(nn.Module):
|
|
314
|
+
"""
|
|
315
|
+
Cross-attention: Q from image features, K/V from text embeddings.
|
|
316
|
+
Used in U-Net blocks to inject text conditioning.
|
|
317
|
+
"""
|
|
318
|
+
def __init__(self, d_model, context_dim, n_heads=8):
|
|
319
|
+
super().__init__()
|
|
320
|
+
self.n_heads = n_heads
|
|
321
|
+
self.d_head = d_model // n_heads
|
|
322
|
+
|
|
323
|
+
# Q from image, K/V from text
|
|
324
|
+
self.to_q = nn.Linear(d_model, d_model, bias=False)
|
|
325
|
+
self.to_k = nn.Linear(context_dim, d_model, bias=False)
|
|
326
|
+
self.to_v = nn.Linear(context_dim, d_model, bias=False)
|
|
327
|
+
self.out_proj = nn.Linear(d_model, d_model)
|
|
328
|
+
self.norm = nn.LayerNorm(d_model)
|
|
329
|
+
|
|
330
|
+
def forward(self, x, context):
|
|
331
|
+
"""
|
|
332
|
+
Args:
|
|
333
|
+
x: image features (B, H*W, d_model)
|
|
334
|
+
context: text embeddings from CLIP (B, seq_len, context_dim)
|
|
335
|
+
Returns:
|
|
336
|
+
text-conditioned image features (B, H*W, d_model)
|
|
337
|
+
"""
|
|
338
|
+
residual = x
|
|
339
|
+
x = self.norm(x)
|
|
340
|
+
|
|
341
|
+
B, N, _ = x.shape
|
|
342
|
+
H = self.n_heads
|
|
343
|
+
d = self.d_head
|
|
344
|
+
|
|
345
|
+
# Project to Q, K, V
|
|
346
|
+
Q = self.to_q(x).view(B, N, H, d).transpose(1, 2) # (B, H, N, d)
|
|
347
|
+
K = self.to_k(context).view(B, -1, H, d).transpose(1, 2) # (B, H, S, d)
|
|
348
|
+
V = self.to_v(context).view(B, -1, H, d).transpose(1, 2) # (B, H, S, d)
|
|
349
|
+
|
|
350
|
+
# Scaled dot-product attention
|
|
351
|
+
scale = d ** -0.5
|
|
352
|
+
attn = torch.matmul(Q, K.transpose(-2, -1)) * scale # (B, H, N, S)
|
|
353
|
+
attn = F.softmax(attn, dim=-1)
|
|
354
|
+
|
|
355
|
+
# Weighted sum of values
|
|
356
|
+
out = torch.matmul(attn, V) # (B, H, N, d)
|
|
357
|
+
out = out.transpose(1, 2).contiguous().view(B, N, H * d) # (B, N, d_model)
|
|
358
|
+
out = self.out_proj(out)
|
|
359
|
+
|
|
360
|
+
return out + residual # residual connection
|
|
361
|
+
</code></pre>
|
|
362
|
+
|
|
363
|
+
<h3 id="unet-block-with-cross">4.4 U-Net Block kết hợp Self-Attention + Cross-Attention</h3>
|
|
364
|
+
|
|
365
|
+
<pre><code class="language-python">
|
|
366
|
+
class TransformerBlock(nn.Module):
|
|
367
|
+
"""Single transformer block: Self-Attn → Cross-Attn → FFN"""
|
|
368
|
+
def __init__(self, d_model, context_dim, n_heads=8):
|
|
369
|
+
super().__init__()
|
|
370
|
+
self.self_attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
|
|
371
|
+
self.self_attn_norm = nn.LayerNorm(d_model)
|
|
372
|
+
|
|
373
|
+
self.cross_attn = CrossAttention(d_model, context_dim, n_heads)
|
|
374
|
+
|
|
375
|
+
self.ffn = nn.Sequential(
|
|
376
|
+
nn.LayerNorm(d_model),
|
|
377
|
+
nn.Linear(d_model, d_model * 4),
|
|
378
|
+
nn.GELU(),
|
|
379
|
+
nn.Linear(d_model * 4, d_model),
|
|
380
|
+
)
|
|
381
|
+
|
|
382
|
+
def forward(self, x, context):
|
|
383
|
+
# Self-attention
|
|
384
|
+
norm_x = self.self_attn_norm(x)
|
|
385
|
+
attn_out, _ = self.self_attn(norm_x, norm_x, norm_x)
|
|
386
|
+
x = x + attn_out
|
|
387
|
+
|
|
388
|
+
# Cross-attention (inject text)
|
|
389
|
+
x = self.cross_attn(x, context)
|
|
390
|
+
|
|
391
|
+
# Feed-forward
|
|
392
|
+
x = x + self.ffn(x)
|
|
393
|
+
return x
|
|
394
|
+
</code></pre>
|
|
395
|
+
|
|
396
|
+
<blockquote><p><strong>Exam tip:</strong> Sai lầm phổ biến trong assessment: đặt K, V từ image thay vì từ text. Nếu cross-attention lấy K, V từ image features → text prompt sẽ không có tác dụng → output giống unconditional. Debug tip: kiểm tra <code>self.to_k</code> và <code>self.to_v</code> có nhận <strong>context</strong> (text) hay <strong>x</strong> (image).</p></blockquote>
|
|
397
|
+
|
|
398
|
+
<h2 id="full-pipeline">5. Full Text-to-Image Pipeline</h2>
|
|
399
|
+
|
|
400
|
+
<h3 id="pipeline-overview">5.1 Overview: Kết hợp tất cả Components</h3>
|
|
401
|
+
|
|
402
|
+
<pre><code class="language-text">
|
|
403
|
+
Full Text-to-Image Pipeline
|
|
404
|
+
════════════════════════════
|
|
405
|
+
|
|
406
|
+
Input: "a golden retriever playing in snow"
|
|
407
|
+
|
|
408
|
+
┌──────────────────────────────────────────────────────────────┐
|
|
409
|
+
│ │
|
|
410
|
+
│ Step 1: TEXT ENCODING │
|
|
411
|
+
│ ───────────────────── │
|
|
412
|
+
│ prompt ──► CLIP Tokenizer ──► CLIP Text Encoder │
|
|
413
|
+
│ │ │
|
|
414
|
+
│ text_emb (1, 77, 768) │
|
|
415
|
+
│ │ │
|
|
416
|
+
│ Step 2: NOISE INITIALIZATION │ │
|
|
417
|
+
│ ──────────────────────────── │ │
|
|
418
|
+
│ x_T ~ N(0, I) (pure noise) │ │
|
|
419
|
+
│ shape: (1, C, H, W) │ │
|
|
420
|
+
│ │ │ │
|
|
421
|
+
│ Step 3: REVERSE DIFFUSION LOOP │ │
|
|
422
|
+
│ ─────────────────────────────── │ │
|
|
423
|
+
│ for t = T, T-1, ..., 1: │ │
|
|
424
|
+
│ │ │ │
|
|
425
|
+
│ ├─► ε̂_uncond = UNet(x_t, t, ∅) │ ← unconditional │
|
|
426
|
+
│ ├─► ε̂_cond = UNet(x_t, t, text_emb) ← conditional │
|
|
427
|
+
│ │ │
|
|
428
|
+
│ ├─► ε̂ = ε̂_uncond + w·(ε̂_cond − ε̂_uncond) ← CFG │
|
|
429
|
+
│ │ │
|
|
430
|
+
│ └─► x_{t-1} = denoise_step(x_t, ε̂, t) │
|
|
431
|
+
│ │ │
|
|
432
|
+
│ Step 4: OUTPUT │
|
|
433
|
+
│ ────────────── │
|
|
434
|
+
│ x_0 = final denoised image │
|
|
435
|
+
│ │
|
|
436
|
+
└──────────────────────────────────────────────────────────────┘
|
|
437
|
+
</code></pre>
|
|
438
|
+
|
|
439
|
+
<h3 id="pipeline-code">5.2 Implementation: Text-to-Image Sampling</h3>
|
|
440
|
+
|
|
441
|
+
<pre><code class="language-python">
|
|
442
|
+
@torch.no_grad()
|
|
443
|
+
def text_to_image_sample(
|
|
444
|
+
unet, clip_model, prompt, schedule,
|
|
445
|
+
guidance_scale=7.5, image_size=64, channels=3,
|
|
446
|
+
device='cuda'
|
|
447
|
+
):
|
|
448
|
+
"""
|
|
449
|
+
Complete text-to-image sampling pipeline.
|
|
450
|
+
Combines CLIP encoding + CFG + DDPM reverse diffusion.
|
|
451
|
+
"""
|
|
452
|
+
T = len(schedule['betas'])
|
|
453
|
+
betas = schedule['betas'].to(device)
|
|
454
|
+
alphas = schedule['alphas'].to(device)
|
|
455
|
+
alpha_bar = schedule['alpha_bar'].to(device)
|
|
456
|
+
|
|
457
|
+
# ── Step 1: Encode text prompt ──
|
|
458
|
+
text_tokens = clip.tokenize([prompt]).to(device) # (1, 77)
|
|
459
|
+
text_emb = clip_model.encode_text_sequence(text_tokens) # (1, 77, 768)
|
|
460
|
+
|
|
461
|
+
# Null embedding for unconditional path (CFG)
|
|
462
|
+
null_tokens = clip.tokenize([""]).to(device)
|
|
463
|
+
null_emb = clip_model.encode_text_sequence(null_tokens) # (1, 77, 768)
|
|
464
|
+
|
|
465
|
+
# ── Step 2: Start from pure noise ──
|
|
466
|
+
x_t = torch.randn(1, channels, image_size, image_size, device=device)
|
|
467
|
+
|
|
468
|
+
# ── Step 3: Reverse diffusion with CFG ──
|
|
469
|
+
for t in reversed(range(T)):
|
|
470
|
+
t_batch = torch.tensor([t], device=device)
|
|
471
|
+
|
|
472
|
+
# Conditional & unconditional predictions
|
|
473
|
+
noise_cond = unet(x_t, t_batch, context=text_emb) # ε̂_cond
|
|
474
|
+
noise_uncond = unet(x_t, t_batch, context=null_emb) # ε̂_uncond
|
|
475
|
+
|
|
476
|
+
# Classifier-Free Guidance
|
|
477
|
+
noise_pred = noise_uncond + guidance_scale * (noise_cond - noise_uncond)
|
|
478
|
+
|
|
479
|
+
# DDPM denoise step
|
|
480
|
+
alpha_t = alphas[t]
|
|
481
|
+
alpha_bar_t = alpha_bar[t]
|
|
482
|
+
beta_t = betas[t]
|
|
483
|
+
|
|
484
|
+
# Predicted x_0
|
|
485
|
+
x_0_pred = (x_t - (1 - alpha_bar_t).sqrt() * noise_pred) / alpha_bar_t.sqrt()
|
|
486
|
+
x_0_pred = x_0_pred.clamp(-1, 1)
|
|
487
|
+
|
|
488
|
+
if t > 0:
|
|
489
|
+
alpha_bar_prev = alpha_bar[t - 1]
|
|
490
|
+
# Posterior mean
|
|
491
|
+
coeff1 = beta_t * alpha_bar_prev.sqrt() / (1 - alpha_bar_t)
|
|
492
|
+
coeff2 = (1 - alpha_bar_prev) * alpha_t.sqrt() / (1 - alpha_bar_t)
|
|
493
|
+
mean = coeff1 * x_0_pred + coeff2 * x_t
|
|
494
|
+
|
|
495
|
+
# Posterior variance
|
|
496
|
+
sigma = (beta_t * (1 - alpha_bar_prev) / (1 - alpha_bar_t)).sqrt()
|
|
497
|
+
z = torch.randn_like(x_t)
|
|
498
|
+
x_t = mean + sigma * z
|
|
499
|
+
else:
|
|
500
|
+
x_t = x_0_pred # Final step: no noise
|
|
501
|
+
|
|
502
|
+
return x_t # Generated image (1, C, H, W)
|
|
503
|
+
</code></pre>
|
|
504
|
+
|
|
505
|
+
<h3 id="pipeline-components-table">5.3 Pipeline Components Summary</h3>
|
|
506
|
+
|
|
507
|
+
<table>
|
|
508
|
+
<thead>
|
|
509
|
+
<tr><th>Component</th><th>Vai trò</th><th>Input → Output</th><th>Trainable?</th></tr>
|
|
510
|
+
</thead>
|
|
511
|
+
<tbody>
|
|
512
|
+
<tr><td>CLIP Text Encoder</td><td>Encode text → embeddings</td><td>str → (B, 77, 768)</td><td>Frozen (pretrained)</td></tr>
|
|
513
|
+
<tr><td>U-Net (with Cross-Attn)</td><td>Predict noise ε̂</td><td>(x_t, t, text_emb) → ε̂</td><td>Yes — main training target</td></tr>
|
|
514
|
+
<tr><td>Noise Schedule</td><td>Define β_t, α_t, ᾱ_t</td><td>t → schedule values</td><td>No (fixed)</td></tr>
|
|
515
|
+
<tr><td>CFG</td><td>Combine cond/uncond</td><td>(ε̂_cond, ε̂_uncond, w) → ε̂</td><td>No (inference only)</td></tr>
|
|
516
|
+
<tr><td>DDPM Sampler</td><td>Denoise step-by-step</td><td>(x_t, ε̂, t) → x_{t-1}</td><td>No (fixed formula)</td></tr>
|
|
517
|
+
</tbody>
|
|
518
|
+
</table>
|
|
519
|
+
|
|
520
|
+
<blockquote><p><strong>Exam tip:</strong> Trong assessment, bạn sẽ nhận code skeleton có sẵn CLIP encoder và schedule. Nhiệm vụ của bạn là implement phần <strong>U-Net forward pass</strong> (có cross-attention) và <strong>sampling loop</strong> (có CFG). Đừng cố viết lại CLIP — nó đã được cung cấp.</p></blockquote>
|
|
521
|
+
|
|
522
|
+
<h2 id="latent-diffusion">6. Latent Diffusion — Stable Diffusion Overview</h2>
|
|
523
|
+
|
|
524
|
+
<h3 id="pixel-space-problem">6.1 Vấn đề của Pixel-Space Diffusion</h3>
|
|
525
|
+
|
|
526
|
+
<p>DDPM gốc thực hiện diffusion trực tiếp trên pixel space. Với ảnh 256×256 RGB, mỗi diffusion step xử lý <strong>196,608 dimensions</strong>. Điều này rất chậm và tốn bộ nhớ.</p>
|
|
527
|
+
|
|
528
|
+
<p><strong>Latent Diffusion Model (LDM)</strong> — nền tảng của Stable Diffusion — giải quyết vấn đề này bằng cách thực hiện diffusion trong <strong>latent space</strong> nhỏ hơn nhiều.</p>
|
|
529
|
+
|
|
530
|
+
<pre><code class="language-text">
|
|
531
|
+
Pixel Space vs Latent Space Diffusion
|
|
532
|
+
══════════════════════════════════════
|
|
533
|
+
|
|
534
|
+
PIXEL SPACE (DDPM gốc):
|
|
535
|
+
────────────────────────
|
|
536
|
+
Image: 256 × 256 × 3 = 196,608 dims
|
|
537
|
+
U-Net phải xử lý tensor RẤT lớn
|
|
538
|
+
✗ Chậm ✗ Tốn VRAM ✗ 1000 steps
|
|
539
|
+
|
|
540
|
+
LATENT SPACE (Stable Diffusion):
|
|
541
|
+
─────────────────────────────────
|
|
542
|
+
Image ──► VAE Encoder ──► Latent: 32 × 32 × 4 = 4,096 dims
|
|
543
|
+
│
|
|
544
|
+
48× NHỎ HƠN │
|
|
545
|
+
▼
|
|
546
|
+
Diffusion trong latent space
|
|
547
|
+
│
|
|
548
|
+
▼
|
|
549
|
+
Latent ──► VAE Decoder ──► Image
|
|
550
|
+
|
|
551
|
+
┌───────────────────────────────────────────────────────┐
|
|
552
|
+
│ Stable Diffusion Architecture: │
|
|
553
|
+
│ │
|
|
554
|
+
│ "a golden retriever" │
|
|
555
|
+
│ │ │
|
|
556
|
+
│ ▼ │
|
|
557
|
+
│ ┌──────────┐ │
|
|
558
|
+
│ │ CLIP │──── text_emb (1, 77, 768) │
|
|
559
|
+
│ │ Encoder │ │ │
|
|
560
|
+
│ └──────────┘ │ │
|
|
561
|
+
│ ▼ │
|
|
562
|
+
│ z_T (noise) ──► U-Net (latent) ──► z_0 (latent) │
|
|
563
|
+
│ (1,4,32,32) + cross-attn (1,4,32,32) │
|
|
564
|
+
│ │ │
|
|
565
|
+
│ ▼ │
|
|
566
|
+
│ ┌──────────┐ │
|
|
567
|
+
│ │ VAE │ │
|
|
568
|
+
│ │ Decoder │ │
|
|
569
|
+
│ └────┬─────┘ │
|
|
570
|
+
│ │ │
|
|
571
|
+
│ ▼ │
|
|
572
|
+
│ Image (1,3,256,256) │
|
|
573
|
+
└───────────────────────────────────────────────────────┘
|
|
574
|
+
</code></pre>
|
|
575
|
+
|
|
576
|
+
<h3 id="vae-component">6.2 VAE: Encoder & Decoder</h3>
|
|
577
|
+
|
|
578
|
+
<table>
|
|
579
|
+
<thead>
|
|
580
|
+
<tr><th>Component</th><th>Pixel Space</th><th>Latent Space</th><th>Compression</th></tr>
|
|
581
|
+
</thead>
|
|
582
|
+
<tbody>
|
|
583
|
+
<tr><td>Image size</td><td>256 × 256 × 3</td><td>32 × 32 × 4</td><td>48× fewer dims</td></tr>
|
|
584
|
+
<tr><td>512 × 512 × 3</td><td>786,432 dims</td><td>64 × 64 × 4 = 16,384</td><td>48× fewer dims</td></tr>
|
|
585
|
+
<tr><td>U-Net input</td><td>Full resolution pixels</td><td>Compressed latents</td><td>Much faster</td></tr>
|
|
586
|
+
<tr><td>Training cost</td><td>Rất cao (nhiều GPU-days)</td><td>Thấp hơn nhiều</td><td>Feasible on 1 GPU</td></tr>
|
|
587
|
+
</tbody>
|
|
588
|
+
</table>
|
|
589
|
+
|
|
590
|
+
<pre><code class="language-python">
|
|
591
|
+
# Latent Diffusion — sử dụng VAE + U-Net trong latent space
|
|
592
|
+
from diffusers import AutoencoderKL
|
|
593
|
+
|
|
594
|
+
# Load pretrained VAE
|
|
595
|
+
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
|
|
596
|
+
vae = vae.to(device).eval()
|
|
597
|
+
|
|
598
|
+
# ── Encode image → latent ──
|
|
599
|
+
with torch.no_grad():
|
|
600
|
+
# image: (B, 3, 256, 256), normalized to [-1, 1]
|
|
601
|
+
latent = vae.encode(image).latent_dist.sample() # (B, 4, 32, 32)
|
|
602
|
+
latent = latent * 0.18215 # scaling factor (Stable Diffusion convention)
|
|
603
|
+
|
|
604
|
+
# ── Diffusion happens in latent space ──
|
|
605
|
+
# x_T = torch.randn(1, 4, 32, 32) ← noise in latent space
|
|
606
|
+
# ... reverse diffusion loop on latents ...
|
|
607
|
+
|
|
608
|
+
# ── Decode latent → image ──
|
|
609
|
+
with torch.no_grad():
|
|
610
|
+
latent_decoded = latent / 0.18215
|
|
611
|
+
image_out = vae.decode(latent_decoded).sample # (B, 3, 256, 256)
|
|
612
|
+
image_out = (image_out + 1) / 2 # [-1,1] → [0,1]
|
|
613
|
+
</code></pre>
|
|
614
|
+
|
|
615
|
+
<h3 id="ddim-scheduler">6.3 DDIM Scheduler: Fewer Steps</h3>
|
|
616
|
+
|
|
617
|
+
<p>DDPM cần <strong>1000 steps</strong> cho mỗi image. <strong>DDIM (Denoising Diffusion Implicit Models)</strong> cho phép <strong>deterministic sampling</strong> chỉ với 20–50 steps bằng cách skip timesteps:</p>
|
|
618
|
+
|
|
619
|
+
<table>
|
|
620
|
+
<thead>
|
|
621
|
+
<tr><th>Scheduler</th><th>Steps</th><th>Stochastic?</th><th>Quality</th><th>Tốc độ</th></tr>
|
|
622
|
+
</thead>
|
|
623
|
+
<tbody>
|
|
624
|
+
<tr><td>DDPM</td><td>1000</td><td>Yes (random z mỗi step)</td><td>Tốt</td><td>Rất chậm</td></tr>
|
|
625
|
+
<tr><td>DDIM</td><td>20–50</td><td>No (deterministic)</td><td>Tương đương</td><td>20–50× nhanh hơn</td></tr>
|
|
626
|
+
<tr><td>Euler</td><td>20–30</td><td>Optional</td><td>Tốt</td><td>Nhanh</td></tr>
|
|
627
|
+
<tr><td>DPM-Solver</td><td>10–25</td><td>Optional</td><td>Rất tốt</td><td>Nhanh nhất</td></tr>
|
|
628
|
+
</tbody>
|
|
629
|
+
</table>
|
|
630
|
+
|
|
631
|
+
<pre><code class="language-text">
|
|
632
|
+
DDPM (1000 steps) vs DDIM (50 steps)
|
|
633
|
+
═════════════════════════════════════
|
|
634
|
+
|
|
635
|
+
DDPM: x_1000 → x_999 → x_998 → ... → x_1 → x_0
|
|
636
|
+
└──────────── 1000 U-Net calls ────────────┘
|
|
637
|
+
|
|
638
|
+
DDIM: x_1000 → x_980 → x_960 → ... → x_20 → x_0
|
|
639
|
+
└──────────── 50 U-Net calls ──────────────┘
|
|
640
|
+
(skip 20 steps mỗi lần)
|
|
641
|
+
|
|
642
|
+
DDIM key insight: non-Markovian — x_{t-k} phụ thuộc x_t & x_0 (predicted)
|
|
643
|
+
→ Không cần đi qua từng step trung gian
|
|
644
|
+
→ Same quality, 20× faster
|
|
645
|
+
</code></pre>
|
|
646
|
+
|
|
647
|
+
<blockquote><p><strong>Exam tip:</strong> Nếu đề hỏi "Why does Stable Diffusion use 50 steps while DDPM uses 1000?", đáp án liên quan đến <strong>DDIM scheduler</strong> và <strong>latent space</strong>. Hai yếu tố cùng đóng góp: DDIM giảm số steps, latent space giảm kích thước mỗi step.</p></blockquote>
|
|
648
|
+
|
|
649
|
+
<h2 id="cheat-sheet">7. Cheat Sheet — Part 2 Tổng hợp</h2>
|
|
650
|
+
|
|
651
|
+
<table>
|
|
652
|
+
<thead>
|
|
653
|
+
<tr><th>Concept</th><th>Key Formula / Detail</th><th>Exam Focus</th></tr>
|
|
654
|
+
</thead>
|
|
655
|
+
<tbody>
|
|
656
|
+
<tr><td>CLIP Text Encoder</td><td>text → (B, 77, 768) embeddings</td><td>Shape, frozen vs trainable</td></tr>
|
|
657
|
+
<tr><td>Contrastive Loss</td><td>CE trên similarity matrix NxN</td><td>Matched pairs ↑, non-matched ↓</td></tr>
|
|
658
|
+
<tr><td>Cross-Attention Q</td><td>Q = W_q · image_features</td><td>Q từ image, NOT text</td></tr>
|
|
659
|
+
<tr><td>Cross-Attention K, V</td><td>K = W_k · text_emb, V = W_v · text_emb</td><td>K, V từ text, NOT image</td></tr>
|
|
660
|
+
<tr><td>Block order trong U-Net</td><td>ResBlock → Self-Attn → Cross-Attn → FFN</td><td>Coding order matters</td></tr>
|
|
661
|
+
<tr><td>CFG formula</td><td>ε̂ = ε̂_uncond + w·(ε̂_cond − ε̂_uncond)</td><td>w = 7.5 default, 2 forward passes</td></tr>
|
|
662
|
+
<tr><td>Latent space (SD)</td><td>256×256×3 → 32×32×4 via VAE</td><td>48× compression, 4-channel latent</td></tr>
|
|
663
|
+
<tr><td>DDIM vs DDPM</td><td>50 vs 1000 steps</td><td>Non-Markovian, deterministic</td></tr>
|
|
664
|
+
<tr><td>VAE scaling factor</td><td>0.18215</td><td>Multiply after encode, divide before decode</td></tr>
|
|
665
|
+
<tr><td>Pipeline order</td><td>Text → CLIP → U-Net(+CFG) → VAE Decode → Image</td><td>End-to-end flow</td></tr>
|
|
666
|
+
</tbody>
|
|
667
|
+
</table>
|
|
668
|
+
|
|
669
|
+
<h2 id="assessment-prep">8. Assessment Prep — DLI S-FX-14 Final Assessment</h2>
|
|
670
|
+
|
|
671
|
+
<h3 id="assessment-overview">8.1 Assessment Overview</h3>
|
|
672
|
+
|
|
673
|
+
<p>DLI assessment <strong>S-FX-14</strong> yêu cầu bạn hoàn thành coding tasks trong JupyterLab environment. Bạn sẽ nhận code skeleton với <code># TODO</code> markers và phải implement các phần còn thiếu.</p>
|
|
674
|
+
|
|
675
|
+
<table>
|
|
676
|
+
<thead>
|
|
677
|
+
<tr><th>Section</th><th>Nội dung</th><th>Tỷ trọng (ước tính)</th><th>Thời gian gợi ý</th></tr>
|
|
678
|
+
</thead>
|
|
679
|
+
<tbody>
|
|
680
|
+
<tr><td>U-Net architecture</td><td>Implement ResBlock, Attention, CrossAttention</td><td>~30%</td><td>25 phút</td></tr>
|
|
681
|
+
<tr><td>DDPM Training</td><td>Forward diffusion, training loop, loss</td><td>~25%</td><td>20 phút</td></tr>
|
|
682
|
+
<tr><td>Text conditioning</td><td>CLIP integration, cross-attention wiring</td><td>~25%</td><td>20 phút</td></tr>
|
|
683
|
+
<tr><td>Sampling pipeline</td><td>Reverse diffusion + CFG sampling</td><td>~20%</td><td>15 phút</td></tr>
|
|
684
|
+
</tbody>
|
|
685
|
+
</table>
|
|
686
|
+
|
|
687
|
+
<h3 id="common-pitfalls">8.2 Common Pitfalls & Fixes</h3>
|
|
688
|
+
|
|
689
|
+
<table>
|
|
690
|
+
<thead>
|
|
691
|
+
<tr><th>Pitfall</th><th>Triệu chứng</th><th>Fix</th></tr>
|
|
692
|
+
</thead>
|
|
693
|
+
<tbody>
|
|
694
|
+
<tr><td>Cross-attn K/V from image</td><td>Text prompt không ảnh hưởng output</td><td>Đảm bảo K, V nhận <code>context</code> (text), Q nhận <code>x</code> (image)</td></tr>
|
|
695
|
+
<tr><td>Quên L2 normalize CLIP</td><td>Similarity values lệch range</td><td>Thêm <code>/ embed.norm(dim=-1, keepdim=True)</code></td></tr>
|
|
696
|
+
<tr><td>CFG guidance_scale = 1.0</td><td>Ảnh chất lượng kém, không theo prompt</td><td>Dùng w = 7.5 hoặc theo đề yêu cầu</td></tr>
|
|
697
|
+
<tr><td>Sai shape khi reshape attention</td><td><code>RuntimeError: shape mismatch</code></td><td>Check (B, H, N, d) → (B, N, H*d) ordering</td></tr>
|
|
698
|
+
<tr><td>Quên <code>.no_grad()</code> khi sampling</td><td>Out of memory</td><td>Wrap sampling loop trong <code>torch.no_grad()</code></td></tr>
|
|
699
|
+
<tr><td>VAE scaling factor sai</td><td>Image output bị washed out hoặc saturated</td><td>Encode: × 0.18215, Decode: ÷ 0.18215</td></tr>
|
|
700
|
+
<tr><td>Timestep embedding sai dim</td><td>Size mismatch trong ResBlock</td><td>Verify t_emb dim matches channel dim</td></tr>
|
|
701
|
+
</tbody>
|
|
702
|
+
</table>
|
|
703
|
+
|
|
704
|
+
<h3 id="strategy">8.3 Assessment Strategy</h3>
|
|
705
|
+
|
|
706
|
+
<ol>
|
|
707
|
+
<li><strong>Đọc toàn bộ notebook trước</strong> (5 phút) — hiểu flow, xác định TODO blocks</li>
|
|
708
|
+
<li><strong>Implement theo thứ tự</strong>: U-Net blocks → forward diffusion → training loop → sampling</li>
|
|
709
|
+
<li><strong>Test từng phần</strong>: chạy cell sau mỗi TODO để confirm shape/output đúng</li>
|
|
710
|
+
<li><strong>Debug shape errors</strong>: thêm <code>print(tensor.shape)</code> tạm thời</li>
|
|
711
|
+
<li><strong>Không viết lại code đã cho</strong> — chỉ fill TODO, giữ nguyên phần khác</li>
|
|
712
|
+
</ol>
|
|
713
|
+
|
|
714
|
+
<blockquote><p><strong>Exam tip:</strong> Assessment cho phép bạn chạy code nhiều lần. Hãy <strong>test incrementally</strong>: implement 1 TODO → chạy cell → verify → tiếp TODO tiếp theo. Đừng cố implement hết rồi mới chạy — sẽ rất khó debug nếu có nhiều lỗi cùng lúc.</p></blockquote>
|
|
715
|
+
|
|
716
|
+
<h2 id="practice">9. Practice Questions — Coding Exercises</h2>
|
|
717
|
+
|
|
718
|
+
<p>Các câu hỏi dưới đây mô phỏng dạng bài trong DLI assessment. Hãy cố gắng tự giải trước khi xem đáp án.</p>
|
|
719
|
+
|
|
720
|
+
<p><strong>Q1: Implement CrossAttention module</strong></p>
|
|
721
|
+
|
|
722
|
+
<p>Complete the <code>CrossAttention</code> module. Q comes from image features, K and V come from text embeddings. Use multi-head attention with residual connection.</p>
|
|
723
|
+
|
|
724
|
+
<pre><code class="language-python">
|
|
725
|
+
class CrossAttention(nn.Module):
|
|
726
|
+
def __init__(self, d_model=256, context_dim=768, n_heads=8):
|
|
727
|
+
super().__init__()
|
|
728
|
+
self.n_heads = n_heads
|
|
729
|
+
self.d_head = d_model // n_heads
|
|
730
|
+
# TODO: Define to_q, to_k, to_v, out_proj, and norm
|
|
731
|
+
pass
|
|
732
|
+
|
|
733
|
+
def forward(self, x, context):
|
|
734
|
+
"""
|
|
735
|
+
x: (B, N, d_model) - image features
|
|
736
|
+
context: (B, S, context_dim) - text embeddings
|
|
737
|
+
Returns: (B, N, d_model)
|
|
738
|
+
"""
|
|
739
|
+
# TODO: Implement cross-attention
|
|
740
|
+
pass
|
|
741
|
+
</code></pre>
|
|
742
|
+
|
|
743
|
+
<details>
|
|
744
|
+
<summary>Xem đáp án Q1</summary>
|
|
745
|
+
|
|
746
|
+
<pre><code class="language-python">
|
|
747
|
+
class CrossAttention(nn.Module):
|
|
748
|
+
def __init__(self, d_model=256, context_dim=768, n_heads=8):
|
|
749
|
+
super().__init__()
|
|
750
|
+
self.n_heads = n_heads
|
|
751
|
+
self.d_head = d_model // n_heads
|
|
752
|
+
self.to_q = nn.Linear(d_model, d_model, bias=False)
|
|
753
|
+
self.to_k = nn.Linear(context_dim, d_model, bias=False)
|
|
754
|
+
self.to_v = nn.Linear(context_dim, d_model, bias=False)
|
|
755
|
+
self.out_proj = nn.Linear(d_model, d_model)
|
|
756
|
+
self.norm = nn.LayerNorm(d_model)
|
|
757
|
+
|
|
758
|
+
def forward(self, x, context):
|
|
759
|
+
residual = x
|
|
760
|
+
x = self.norm(x)
|
|
761
|
+
B, N, _ = x.shape
|
|
762
|
+
H, d = self.n_heads, self.d_head
|
|
763
|
+
|
|
764
|
+
Q = self.to_q(x).view(B, N, H, d).transpose(1, 2) # (B, H, N, d)
|
|
765
|
+
K = self.to_k(context).view(B, -1, H, d).transpose(1, 2) # (B, H, S, d)
|
|
766
|
+
V = self.to_v(context).view(B, -1, H, d).transpose(1, 2) # (B, H, S, d)
|
|
767
|
+
|
|
768
|
+
scale = d ** -0.5
|
|
769
|
+
attn = torch.matmul(Q, K.transpose(-2, -1)) * scale
|
|
770
|
+
attn = F.softmax(attn, dim=-1)
|
|
771
|
+
|
|
772
|
+
out = torch.matmul(attn, V)
|
|
773
|
+
out = out.transpose(1, 2).contiguous().view(B, N, H * d)
|
|
774
|
+
out = self.out_proj(out)
|
|
775
|
+
return out + residual
|
|
776
|
+
</code></pre>
|
|
777
|
+
|
|
778
|
+
<p><em>Key point: <code>to_q</code> nhận <code>x</code> (image), <code>to_k</code> và <code>to_v</code> nhận <code>context</code> (text). Đây là điểm khác biệt duy nhất so với self-attention.</em></p>
|
|
779
|
+
</details>
|
|
780
|
+
|
|
781
|
+
<p><strong>Q2: Build full text-to-image sampling pipeline</strong></p>
|
|
782
|
+
|
|
783
|
+
<p>Given a trained U-Net with cross-attention, CLIP text encoder, and DDPM schedule, implement the complete sampling function with Classifier-Free Guidance.</p>
|
|
784
|
+
|
|
785
|
+
<pre><code class="language-python">
|
|
786
|
+
@torch.no_grad()
|
|
787
|
+
def sample_text_to_image(unet, clip_encoder, prompt, schedule,
|
|
788
|
+
guidance_scale=7.5, H=64, W=64, C=3,
|
|
789
|
+
device='cuda'):
|
|
790
|
+
"""
|
|
791
|
+
Generate image from text prompt.
|
|
792
|
+
Args:
|
|
793
|
+
unet: U-Net with cross-attention (takes x_t, t, context)
|
|
794
|
+
clip_encoder: encodes text → (1, 77, 768)
|
|
795
|
+
prompt: string, e.g. "a cat sitting on a chair"
|
|
796
|
+
schedule: dict with 'betas', 'alphas', 'alpha_bar'
|
|
797
|
+
guidance_scale: CFG weight (w)
|
|
798
|
+
Returns: generated image tensor (1, C, H, W)
|
|
799
|
+
"""
|
|
800
|
+
# TODO: Implement full pipeline
|
|
801
|
+
# 1. Encode prompt and null prompt with CLIP
|
|
802
|
+
# 2. Initialize x_T as random noise
|
|
803
|
+
# 3. Reverse diffusion loop with CFG
|
|
804
|
+
# 4. Return x_0
|
|
805
|
+
pass
|
|
806
|
+
</code></pre>
|
|
807
|
+
|
|
808
|
+
<details>
|
|
809
|
+
<summary>Xem đáp án Q2</summary>
|
|
810
|
+
|
|
811
|
+
<pre><code class="language-python">
|
|
812
|
+
@torch.no_grad()
|
|
813
|
+
def sample_text_to_image(unet, clip_encoder, prompt, schedule,
|
|
814
|
+
guidance_scale=7.5, H=64, W=64, C=3,
|
|
815
|
+
device='cuda'):
|
|
816
|
+
T = len(schedule['betas'])
|
|
817
|
+
betas = schedule['betas'].to(device)
|
|
818
|
+
alphas = schedule['alphas'].to(device)
|
|
819
|
+
alpha_bar = schedule['alpha_bar'].to(device)
|
|
820
|
+
|
|
821
|
+
# 1. Encode text
|
|
822
|
+
text_emb = clip_encoder.encode(prompt) # (1, 77, 768)
|
|
823
|
+
null_emb = clip_encoder.encode("") # (1, 77, 768)
|
|
824
|
+
|
|
825
|
+
# 2. Initialize noise
|
|
826
|
+
x_t = torch.randn(1, C, H, W, device=device)
|
|
827
|
+
|
|
828
|
+
# 3. Reverse diffusion
|
|
829
|
+
for t in reversed(range(T)):
|
|
830
|
+
t_tensor = torch.tensor([t], device=device)
|
|
831
|
+
|
|
832
|
+
# CFG: two forward passes
|
|
833
|
+
eps_cond = unet(x_t, t_tensor, context=text_emb)
|
|
834
|
+
eps_uncond = unet(x_t, t_tensor, context=null_emb)
|
|
835
|
+
eps = eps_uncond + guidance_scale * (eps_cond - eps_uncond)
|
|
836
|
+
|
|
837
|
+
# DDPM reverse step
|
|
838
|
+
ab_t = alpha_bar[t]
|
|
839
|
+
a_t = alphas[t]
|
|
840
|
+
b_t = betas[t]
|
|
841
|
+
|
|
842
|
+
x0_pred = (x_t - (1 - ab_t).sqrt() * eps) / ab_t.sqrt()
|
|
843
|
+
x0_pred = x0_pred.clamp(-1, 1)
|
|
844
|
+
|
|
845
|
+
if t > 0:
|
|
846
|
+
ab_prev = alpha_bar[t - 1]
|
|
847
|
+
c1 = b_t * ab_prev.sqrt() / (1 - ab_t)
|
|
848
|
+
c2 = (1 - ab_prev) * a_t.sqrt() / (1 - ab_t)
|
|
849
|
+
mean = c1 * x0_pred + c2 * x_t
|
|
850
|
+
sigma = (b_t * (1 - ab_prev) / (1 - ab_t)).sqrt()
|
|
851
|
+
x_t = mean + sigma * torch.randn_like(x_t)
|
|
852
|
+
else:
|
|
853
|
+
x_t = x0_pred
|
|
854
|
+
|
|
855
|
+
return x_t
|
|
856
|
+
</code></pre>
|
|
857
|
+
|
|
858
|
+
<p><em>Key points: (1) Null embedding cho CFG unconditional path, (2) hai lần forward qua U-Net mỗi step, (3) clamp x0_pred để tránh numerical instability, (4) t=0 không thêm noise.</em></p>
|
|
859
|
+
</details>
|
|
860
|
+
|
|
861
|
+
<p><strong>Q3: Explain why Latent Diffusion uses ~50 steps while DDPM needs 1000</strong></p>
|
|
862
|
+
|
|
863
|
+
<p>Write a short function that demonstrates the difference between DDPM and DDIM step selection, and explain in comments why DDIM can skip steps.</p>
|
|
864
|
+
|
|
865
|
+
<pre><code class="language-python">
|
|
866
|
+
def compare_schedulers(T=1000, ddim_steps=50):
|
|
867
|
+
"""
|
|
868
|
+
Show the difference between DDPM and DDIM timestep selection.
|
|
869
|
+
TODO: Return both timestep sequences and add comments explaining
|
|
870
|
+
why DDIM can skip steps without quality loss.
|
|
871
|
+
"""
|
|
872
|
+
# TODO: implement
|
|
873
|
+
pass
|
|
874
|
+
</code></pre>
|
|
875
|
+
|
|
876
|
+
<details>
|
|
877
|
+
<summary>Xem đáp án Q3</summary>
|
|
878
|
+
|
|
879
|
+
<pre><code class="language-python">
|
|
880
|
+
import numpy as np
|
|
881
|
+
|
|
882
|
+
def compare_schedulers(T=1000, ddim_steps=50):
|
|
883
|
+
"""
|
|
884
|
+
DDPM: must visit every timestep t = T-1, T-2, ..., 1, 0
|
|
885
|
+
→ each step is Markovian: x_{t-1} depends ONLY on x_t
|
|
886
|
+
→ cannot skip steps without breaking the Markov chain
|
|
887
|
+
|
|
888
|
+
DDIM: can skip timesteps using a non-Markovian formulation
|
|
889
|
+
→ x_{t-k} = f(x_t, predicted_x_0) — depends on x_t AND predicted x_0
|
|
890
|
+
→ the "shortcut" through predicted x_0 allows jumping multiple steps
|
|
891
|
+
→ deterministic (no random noise z added at each step)
|
|
892
|
+
"""
|
|
893
|
+
# DDPM: all 1000 steps
|
|
894
|
+
ddpm_steps = list(range(T - 1, -1, -1)) # [999, 998, ..., 1, 0]
|
|
895
|
+
|
|
896
|
+
# DDIM: evenly spaced subset
|
|
897
|
+
step_size = T // ddim_steps # 1000 // 50 = 20
|
|
898
|
+
ddim_timesteps = list(range(T - 1, -1, -step_size)) # [999, 979, 959, ..., 19]
|
|
899
|
+
|
|
900
|
+
print(f"DDPM: {len(ddpm_steps)} steps")
|
|
901
|
+
print(f" First 5: {ddpm_steps[:5]}")
|
|
902
|
+
print(f" Last 5: {ddpm_steps[-5:]}")
|
|
903
|
+
|
|
904
|
+
print(f"\nDDIM: {len(ddim_timesteps)} steps")
|
|
905
|
+
print(f" First 5: {ddim_timesteps[:5]}")
|
|
906
|
+
print(f" Last 5: {ddim_timesteps[-5:]}")
|
|
907
|
+
|
|
908
|
+
# Key reason: DDIM uses non-Markovian update rule:
|
|
909
|
+
# x_{t-k} = sqrt(ᾱ_{t-k}) * predicted_x0 + sqrt(1 - ᾱ_{t-k}) * direction
|
|
910
|
+
# This formula works for ANY t-k, not just t-1
|
|
911
|
+
# → can skip from t=999 to t=979 directly
|
|
912
|
+
|
|
913
|
+
return ddpm_steps, ddim_timesteps
|
|
914
|
+
</code></pre>
|
|
915
|
+
|
|
916
|
+
<p><em>Core insight: DDPM là Markovian (mỗi step chỉ phụ thuộc step trước), DDIM là non-Markovian (phụ thuộc cả predicted x_0). Non-Markovian formulation cho phép "nhảy" qua nhiều steps cùng lúc mà không mất quality đáng kể.</em></p>
|
|
917
|
+
</details>
|
|
918
|
+
|
|
919
|
+
<p><strong>Q4: Debug — Text prompt has no effect on generated image</strong></p>
|
|
920
|
+
|
|
921
|
+
<p>The following code generates images, but changing the text prompt does NOT change the output. Find and fix the bug.</p>
|
|
922
|
+
|
|
923
|
+
<pre><code class="language-python">
|
|
924
|
+
class BuggyUNetBlock(nn.Module):
|
|
925
|
+
def __init__(self, d_model=256, context_dim=768, n_heads=8):
|
|
926
|
+
super().__init__()
|
|
927
|
+
self.self_attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
|
|
928
|
+
self.cross_attn_q = nn.Linear(d_model, d_model)
|
|
929
|
+
self.cross_attn_k = nn.Linear(d_model, d_model) # BUG HERE?
|
|
930
|
+
self.cross_attn_v = nn.Linear(d_model, d_model) # BUG HERE?
|
|
931
|
+
self.cross_attn_out = nn.Linear(d_model, d_model)
|
|
932
|
+
self.norm1 = nn.LayerNorm(d_model)
|
|
933
|
+
self.norm2 = nn.LayerNorm(d_model)
|
|
934
|
+
|
|
935
|
+
def forward(self, x, context):
|
|
936
|
+
# Self-attention
|
|
937
|
+
norm_x = self.norm1(x)
|
|
938
|
+
sa_out, _ = self.self_attn(norm_x, norm_x, norm_x)
|
|
939
|
+
x = x + sa_out
|
|
940
|
+
|
|
941
|
+
# Cross-attention
|
|
942
|
+
norm_x = self.norm2(x)
|
|
943
|
+
Q = self.cross_attn_q(norm_x)
|
|
944
|
+
K = self.cross_attn_k(norm_x) # ← THIS LINE
|
|
945
|
+
V = self.cross_attn_v(norm_x) # ← AND THIS LINE
|
|
946
|
+
# ... attention computation ...
|
|
947
|
+
return x
|
|
948
|
+
</code></pre>
|
|
949
|
+
|
|
950
|
+
<details>
|
|
951
|
+
<summary>Xem đáp án Q4</summary>
|
|
952
|
+
|
|
953
|
+
<pre><code class="language-python">
|
|
954
|
+
# BUG: cross_attn_k and cross_attn_v take norm_x (image features)
|
|
955
|
+
# instead of context (text embeddings).
|
|
956
|
+
# This makes "cross-attention" effectively another self-attention,
|
|
957
|
+
# so text prompt has ZERO effect on the output.
|
|
958
|
+
|
|
959
|
+
# FIX 1: Change Linear input dimensions
|
|
960
|
+
self.cross_attn_k = nn.Linear(context_dim, d_model) # context_dim, not d_model
|
|
961
|
+
self.cross_attn_v = nn.Linear(context_dim, d_model) # context_dim, not d_model
|
|
962
|
+
|
|
963
|
+
# FIX 2: Pass context instead of norm_x
|
|
964
|
+
K = self.cross_attn_k(context) # ← FIX: use context, not norm_x
|
|
965
|
+
V = self.cross_attn_v(context) # ← FIX: use context, not norm_x
|
|
966
|
+
</code></pre>
|
|
967
|
+
|
|
968
|
+
<p><em>Two bugs: (1) <code>cross_attn_k</code> và <code>cross_attn_v</code> có input dim = d_model thay vì context_dim, (2) K và V được compute từ <code>norm_x</code> (image) thay vì <code>context</code> (text). Kết quả: U-Net hoàn toàn bỏ qua text conditioning → output giống unconditional generation bất kể prompt.</em></p>
|
|
969
|
+
</details>
|
|
970
|
+
|
|
971
|
+
<p><strong>Q5: Integration test — Assemble working text-to-image system</strong></p>
|
|
972
|
+
|
|
973
|
+
<p>Given the following pre-built components, write the integration code that connects them into a working text-to-image system and generates one image.</p>
|
|
974
|
+
|
|
975
|
+
<pre><code class="language-python">
|
|
976
|
+
# Pre-built components (already defined):
|
|
977
|
+
# - clip_model: has .encode_text(tokens) → (B, 77, 768)
|
|
978
|
+
# - unet: has .forward(x_t, t_emb, context) → noise prediction
|
|
979
|
+
# - schedule: dict with 'betas', 'alphas', 'alpha_bar' (T=1000)
|
|
980
|
+
# - ddpm_reverse_step(x_t, noise_pred, t, schedule) → x_{t-1}
|
|
981
|
+
|
|
982
|
+
def generate_image(prompt: str, negative_prompt: str = "",
|
|
983
|
+
guidance_scale: float = 7.5, steps: int = 1000,
|
|
984
|
+
image_size: int = 64, channels: int = 3):
|
|
985
|
+
"""
|
|
986
|
+
TODO: Wire all components together.
|
|
987
|
+
Handle: CLIP encoding, null prompt for CFG, reverse loop, CFG combination.
|
|
988
|
+
Return final image tensor.
|
|
989
|
+
"""
|
|
990
|
+
pass
|
|
991
|
+
</code></pre>
|
|
992
|
+
|
|
993
|
+
<details>
|
|
994
|
+
<summary>Xem đáp án Q5</summary>
|
|
995
|
+
|
|
996
|
+
<pre><code class="language-python">
|
|
997
|
+
@torch.no_grad()
|
|
998
|
+
def generate_image(prompt: str, negative_prompt: str = "",
|
|
999
|
+
guidance_scale: float = 7.5, steps: int = 1000,
|
|
1000
|
+
image_size: int = 64, channels: int = 3):
|
|
1001
|
+
device = next(unet.parameters()).device
|
|
1002
|
+
|
|
1003
|
+
# ── 1. CLIP encode: positive and negative/null prompts ──
|
|
1004
|
+
pos_tokens = clip.tokenize([prompt]).to(device)
|
|
1005
|
+
neg_tokens = clip.tokenize([negative_prompt]).to(device)
|
|
1006
|
+
|
|
1007
|
+
pos_emb = clip_model.encode_text(pos_tokens) # (1, 77, 768)
|
|
1008
|
+
neg_emb = clip_model.encode_text(neg_tokens) # (1, 77, 768)
|
|
1009
|
+
|
|
1010
|
+
# ── 2. Initialize random noise ──
|
|
1011
|
+
x_t = torch.randn(1, channels, image_size, image_size, device=device)
|
|
1012
|
+
|
|
1013
|
+
# ── 3. Reverse diffusion with CFG ──
|
|
1014
|
+
for t in reversed(range(steps)):
|
|
1015
|
+
t_tensor = torch.tensor([t], device=device)
|
|
1016
|
+
|
|
1017
|
+
# Two forward passes for CFG
|
|
1018
|
+
noise_pos = unet(x_t, t_tensor, context=pos_emb)
|
|
1019
|
+
noise_neg = unet(x_t, t_tensor, context=neg_emb)
|
|
1020
|
+
|
|
1021
|
+
# CFG combination
|
|
1022
|
+
noise_guided = noise_neg + guidance_scale * (noise_pos - noise_neg)
|
|
1023
|
+
|
|
1024
|
+
# Denoise step
|
|
1025
|
+
x_t = ddpm_reverse_step(x_t, noise_guided, t, schedule)
|
|
1026
|
+
|
|
1027
|
+
# ── 4. Post-process ──
|
|
1028
|
+
image = (x_t.clamp(-1, 1) + 1) / 2 # [-1,1] → [0,1]
|
|
1029
|
+
return image
|
|
1030
|
+
|
|
1031
|
+
# Generate!
|
|
1032
|
+
img = generate_image("a golden retriever playing in snow", guidance_scale=7.5)
|
|
1033
|
+
print(f"Output shape: {img.shape}") # (1, 3, 64, 64)
|
|
1034
|
+
</code></pre>
|
|
1035
|
+
|
|
1036
|
+
<p><em>Integration checklist: (1) Encode cả positive và negative prompt, (2) khởi tạo noise đúng shape, (3) loop ngược từ T-1 đến 0, (4) 2 forward passes mỗi step cho CFG, (5) combine với guidance_scale, (6) gọi denoise step, (7) post-process output. Negative prompt rỗng "" hoạt động như unconditional — đây là cách Stable Diffusion xử lý CFG.</em></p>
|
|
1037
|
+
</details>
|