clusterkit 0.1.0.pre.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (49) hide show
  1. checksums.yaml +7 -0
  2. data/.rspec +3 -0
  3. data/.simplecov +47 -0
  4. data/CHANGELOG.md +35 -0
  5. data/CLAUDE.md +226 -0
  6. data/Cargo.toml +8 -0
  7. data/Gemfile +17 -0
  8. data/IMPLEMENTATION_NOTES.md +143 -0
  9. data/LICENSE.txt +21 -0
  10. data/PYTHON_COMPARISON.md +183 -0
  11. data/README.md +499 -0
  12. data/Rakefile +245 -0
  13. data/clusterkit.gemspec +45 -0
  14. data/docs/KNOWN_ISSUES.md +130 -0
  15. data/docs/RUST_ERROR_HANDLING.md +164 -0
  16. data/docs/TEST_FIXTURES.md +170 -0
  17. data/docs/UMAP_EXPLAINED.md +362 -0
  18. data/docs/UMAP_TROUBLESHOOTING.md +284 -0
  19. data/docs/VERBOSE_OUTPUT.md +84 -0
  20. data/examples/hdbscan_example.rb +147 -0
  21. data/examples/optimal_kmeans_example.rb +96 -0
  22. data/examples/pca_example.rb +114 -0
  23. data/examples/reproducible_umap.rb +99 -0
  24. data/examples/verbose_control.rb +43 -0
  25. data/ext/clusterkit/Cargo.toml +25 -0
  26. data/ext/clusterkit/extconf.rb +4 -0
  27. data/ext/clusterkit/src/clustering/hdbscan_wrapper.rs +115 -0
  28. data/ext/clusterkit/src/clustering.rs +267 -0
  29. data/ext/clusterkit/src/embedder.rs +413 -0
  30. data/ext/clusterkit/src/lib.rs +22 -0
  31. data/ext/clusterkit/src/svd.rs +112 -0
  32. data/ext/clusterkit/src/tests.rs +16 -0
  33. data/ext/clusterkit/src/utils.rs +33 -0
  34. data/lib/clusterkit/clustering/hdbscan.rb +177 -0
  35. data/lib/clusterkit/clustering.rb +213 -0
  36. data/lib/clusterkit/clusterkit.rb +9 -0
  37. data/lib/clusterkit/configuration.rb +24 -0
  38. data/lib/clusterkit/dimensionality/pca.rb +251 -0
  39. data/lib/clusterkit/dimensionality/svd.rb +144 -0
  40. data/lib/clusterkit/dimensionality/umap.rb +311 -0
  41. data/lib/clusterkit/dimensionality.rb +29 -0
  42. data/lib/clusterkit/hdbscan_api_design.rb +142 -0
  43. data/lib/clusterkit/preprocessing.rb +106 -0
  44. data/lib/clusterkit/silence.rb +42 -0
  45. data/lib/clusterkit/utils.rb +51 -0
  46. data/lib/clusterkit/version.rb +5 -0
  47. data/lib/clusterkit.rb +93 -0
  48. data/lib/tasks/visualize.rake +641 -0
  49. metadata +194 -0
@@ -0,0 +1,170 @@
1
+ # Test Fixtures for UMAP Testing
2
+
3
+ ## Overview
4
+
5
+ To avoid the hanging issues that occur when testing UMAP with synthetic random data, we use real embeddings from text models as test fixtures. This ensures our tests are both reliable and realistic.
6
+
7
+ ## Why Real Embeddings?
8
+
9
+ UMAP's initialization algorithm (`dmap_init`) expects data with manifold structure - the kind of structure that real embeddings naturally have. When given uniform random data, it can fail catastrophically, initializing all points to the same location and causing infinite loops.
10
+
11
+ Real text embeddings have:
12
+ - Natural clustering (semantically similar texts group together)
13
+ - Meaningful correlations between dimensions
14
+ - Appropriate value ranges (typically [-0.12, 0.12])
15
+ - Inherent manifold structure that UMAP is designed to discover
16
+
17
+ ## Generating Fixtures
18
+
19
+ ### Prerequisites
20
+
21
+ 1. Install the development dependencies:
22
+ ```bash
23
+ bundle install --with development
24
+ ```
25
+
26
+ 2. Generate the embedding fixtures:
27
+ ```bash
28
+ rake fixtures:generate_embeddings
29
+ ```
30
+
31
+ This will create several fixture files in `spec/fixtures/embeddings/`:
32
+
33
+ - **basic_15.json** - 15 general sentences for basic testing
34
+ - **clusters_30.json** - 30 sentences in 3 distinct topic clusters (tech, nature, food)
35
+ - **minimal_6.json** - 6 sentences for minimum viable dataset testing
36
+ - **large_100.json** - 100 sentences for performance testing
37
+
38
+ ### Fixture Format
39
+
40
+ Each fixture is a JSON file containing:
41
+ ```json
42
+ {
43
+ "description": "Test embeddings for basic_15",
44
+ "model": "sentence-transformers/all-MiniLM-L6-v2",
45
+ "dimension": 384,
46
+ "count": 15,
47
+ "embeddings": [
48
+ [0.123, -0.045, ...], // 384-dimensional vectors
49
+ ...
50
+ ]
51
+ }
52
+ ```
53
+
54
+ ## Using Fixtures in Tests
55
+
56
+ The specs automatically use fixtures when available:
57
+
58
+ ```ruby
59
+ RSpec.describe "My UMAP test" do
60
+ let(:test_data) do
61
+ if fixtures_available?
62
+ load_embedding_fixture('basic_15') # Real embeddings
63
+ else
64
+ generate_structured_test_data(15, 30) # Fallback
65
+ end
66
+ end
67
+
68
+ it "processes embeddings" do
69
+ umap = ClusterKit::UMAP.new
70
+ result = umap.fit_transform(test_data)
71
+ # Test will use real embeddings, avoiding hanging issues
72
+ end
73
+ end
74
+ ```
75
+
76
+ ### Available Helper Methods
77
+
78
+ - `fixtures_available?` - Check if any fixtures exist
79
+ - `load_embedding_fixture(name)` - Load all embeddings from a fixture
80
+ - `load_embedding_subset(name, count)` - Load first N embeddings
81
+ - `fixture_metadata(name)` - Get metadata about a fixture
82
+ - `generate_structured_test_data(n_points, n_dims)` - Fallback data generator
83
+
84
+ ## Listing Available Fixtures
85
+
86
+ To see what fixtures are available:
87
+ ```bash
88
+ rake fixtures:list
89
+ ```
90
+
91
+ Output:
92
+ ```
93
+ Available embedding fixtures:
94
+ basic_15.json: 15 embeddings, 384D
95
+ clusters_30.json: 30 embeddings, 384D
96
+ minimal_6.json: 6 embeddings, 384D
97
+ large_100.json: 100 embeddings, 384D
98
+ ```
99
+
100
+ ## When to Regenerate Fixtures
101
+
102
+ Regenerate fixtures when:
103
+ - Switching to a different embedding model
104
+ - Adding new test scenarios
105
+ - Fixtures become corrupted or deleted
106
+
107
+ The fixtures are deterministic for a given model and input text, so regenerating them should produce functionally equivalent embeddings.
108
+
109
+ ## CI/CD Considerations
110
+
111
+ For CI environments, you have two options:
112
+
113
+ 1. **Commit fixtures to git** (Recommended for small fixtures):
114
+ ```bash
115
+ git add spec/fixtures/embeddings/*.json
116
+ git commit -m "Add embedding test fixtures"
117
+ ```
118
+
119
+ 2. **Generate fixtures in CI**:
120
+ Add to your CI workflow:
121
+ ```yaml
122
+ - name: Generate test fixtures
123
+ run: bundle exec rake fixtures:generate_embeddings
124
+ ```
125
+
126
+ Note: Option 2 requires red-candle to be available in CI, which will download the embedding model on first use.
127
+
128
+ ## Troubleshooting
129
+
130
+ ### "Fixture file not found" Error
131
+ Run `rake fixtures:generate_embeddings` to create the fixtures.
132
+
133
+ ### Tests Still Hanging
134
+ Ensure you're using the fixture data, not generating random data. Check that your test includes:
135
+ ```ruby
136
+ if fixtures_available?
137
+ load_embedding_fixture('basic_15')
138
+ ```
139
+
140
+ ### red-candle Not Found
141
+ Install development dependencies:
142
+ ```bash
143
+ bundle install --with development
144
+ ```
145
+
146
+ ### Model Download Issues
147
+ The first run will download the embedding model (~90MB). Ensure you have internet connectivity and sufficient disk space.
148
+
149
+ ## Adding New Fixtures
150
+
151
+ To add new test scenarios, edit `Rakefile` and add to the `test_cases` hash:
152
+
153
+ ```ruby
154
+ test_cases = {
155
+ 'my_new_test' => [
156
+ "First test sentence",
157
+ "Second test sentence",
158
+ # ...
159
+ ]
160
+ }
161
+ ```
162
+
163
+ Then regenerate:
164
+ ```bash
165
+ rake fixtures:generate_embeddings
166
+ ```
167
+
168
+ ## Performance Note
169
+
170
+ Using real embeddings makes tests slightly slower than random data, but the reliability improvement is worth it. The fixtures are loaded from JSON, which is fast, and the UMAP algorithm actually converges properly instead of hanging.
@@ -0,0 +1,362 @@
1
+ # UMAP: Dimensionality Reduction for Software Developers
2
+
3
+ ## What is UMAP?
4
+
5
+ UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction algorithm that transforms high-dimensional data (e.g., 768-dimensional embeddings) into low-dimensional representations (typically 2D or 3D) while preserving the data's underlying structure. It's particularly effective at maintaining both local neighborhoods and global structure.
6
+
7
+ ## Example: The Sphere Analogy
8
+
9
+ Consider points in 3D space that form a sphere. You could represent each point with (x, y, z) coordinates, but if you know they lie on a sphere's surface, you could more efficiently represent them with just latitude and longitude - reducing from 3 to 2 dimensions without losing information.
10
+
11
+ UMAP works similarly: it discovers that your high-dimensional data lies on a lower-dimensional manifold (like the sphere's surface), finds the parameters of that manifold, and maps points to this more efficient coordinate system.
12
+
13
+ **The key insight**: Just as latitude/longitude preserves relationships between points on Earth's surface, UMAP preserves relationships between points on the discovered manifold. The difference is that UMAP must first discover what shape the manifold is - it might be sphere-like, pretzel-shaped, or something more complex.
14
+
15
+ ```ruby
16
+ # Example: 10,000 points on a "sphere" in 100D space
17
+ # The data is 100-dimensional, but actually lies on a 2D surface
18
+
19
+ # Generate data on a hypersphere with noise
20
+ points_100d = generate_sphere_surface_in_100d(n_points: 10000)
21
+
22
+ # UMAP discovers the 2D manifold structure
23
+ reducer = ClusterKit::UMAP.new(n_components: 2)
24
+ coords_2d = reducer.fit_transform(points_100d)
25
+
26
+ # coords_2d now contains the "latitude/longitude" equivalent
27
+ # for the discovered manifold - a 2D representation that
28
+ # preserves the essential structure of the 100D data
29
+ ```
30
+
31
+ This is why UMAP is so effective with embeddings: even though embeddings are high-dimensional, they often lie on or near a much lower-dimensional manifold that captures the true relationships in the data.
32
+
33
+ ## Core Algorithm Concept
34
+
35
+ UMAP operates on the principle that high-dimensional data lies on a lower-dimensional manifold. It constructs a topological representation of the data and then optimizes a low-dimensional layout to match this topology as closely as possible.
36
+
37
+ The algorithm makes two key assumptions:
38
+ 1. The data is uniformly distributed on a Riemannian manifold
39
+ 2. The manifold is locally connected
40
+
41
+ ## How UMAP Works: Algorithmic Steps
42
+
43
+ ### Step 1: Construct k-NN Graph
44
+ - For each data point, find its k nearest neighbors (typically k=15-50)
45
+ - Build a weighted graph where edge weights represent distances
46
+ - Uses approximate nearest neighbor algorithms for efficiency (e.g., RP-trees, NNDescent)
47
+
48
+ ### Step 2: Compute Fuzzy Simplicial Set
49
+ - Convert the k-NN graph into a fuzzy topological representation
50
+ - Apply local scaling based on distance to nearest neighbors
51
+ - Create symmetric graph by combining directed edges using fuzzy set union
52
+
53
+ ### Step 3: Initialize Low-Dimensional Embedding
54
+ - Generate initial positions in target dimension (usually 2D/3D)
55
+ - Can use spectral embedding for better initialization or random placement
56
+
57
+ ### Step 4: Optimize Layout via SGD
58
+ - Minimize cross-entropy between high-dimensional and low-dimensional fuzzy representations
59
+ - Uses attractive forces for connected points, repulsive forces for non-connected points
60
+ - Typically runs for 200-500 iterations with learning rate decay
61
+
62
+ ## Technical Classification
63
+
64
+ **UMAP is:**
65
+ - A **non-linear dimensionality reduction technique**
66
+ - A **manifold learning algorithm**
67
+ - A **graph-based embedding method**
68
+
69
+ **Algorithm Type:** UMAP doesn't fit traditional ML model categories. It's a transformation algorithm that learns a mapping from high-dimensional to low-dimensional space. Once trained on a dataset, it can transform new points using the learned embedding space.
70
+
71
+ ## Data Requirements
72
+
73
+ ### Dataset Size
74
+ - **Minimum viable**: 500 data points
75
+ - **Recommended**: 2,000-10,000 points
76
+ - **Scales well to**: Millions of points
77
+
78
+ ### Computational Complexity
79
+ - **Time complexity**: O(N^1.14) for N data points (approximate)
80
+ - **Memory complexity**: O(N) with optimizations
81
+ - **Typical runtime**: 30 seconds for 10K points with 100 dimensions
82
+
83
+ ### Input Format
84
+ - Dense numerical arrays (numpy arrays, tensors)
85
+ - All points must have same dimensionality
86
+ - Works best with normalized/standardized features
87
+
88
+ ## When to Use UMAP
89
+
90
+ ### Optimal Use Cases
91
+
92
+ 1. **Embedding Visualization**
93
+ ```
94
+ Input: 10,000 document embeddings (768 dimensions)
95
+ Output: 2D coordinates for plotting
96
+ Purpose: Visualize document clusters and relationships
97
+ ```
98
+
99
+ 2. **Clustering Preprocessing**
100
+ ```
101
+ Input: High-dimensional feature vectors
102
+ Output: 10-50 dimensional representations
103
+ Purpose: Improve clustering algorithm performance and speed
104
+ ```
105
+
106
+ 3. **Anomaly Detection**
107
+ ```
108
+ Input: Normal behavior embeddings
109
+ Output: 2D projection showing outliers
110
+ Purpose: Identify points that don't fit the manifold structure
111
+ ```
112
+
113
+ 4. **Feature Engineering**
114
+ ```
115
+ Input: Raw high-dimensional features
116
+ Output: Lower-dimensional features for downstream ML
117
+ Purpose: Capture non-linear relationships in fewer dimensions
118
+ ```
119
+
120
+ ## Limitations and Alternatives
121
+
122
+ ### UMAP Limitations
123
+
124
+ 1. **Non-deterministic**: Results vary between runs due to:
125
+ - Random initialization
126
+ - Stochastic gradient descent
127
+ - Approximate nearest neighbor search
128
+
129
+ 2. **Distance Distortion**: UMAP preserves topology, not distances
130
+ - Distances in UMAP space don't correspond to original distances
131
+ - Density can be misleading (denser areas might just be artifacts)
132
+
133
+ 3. **Parameter Sensitivity**: Results heavily depend on:
134
+ - `n_neighbors`: Controls local vs global structure balance
135
+ - `min_dist`: Controls cluster tightness
136
+ - `metric`: Distance function choice crucial for certain data types
137
+
138
+ 4. **No Inverse Transform**: Generally cannot reconstruct original data from UMAP coordinates
139
+
140
+ ### When to Use Alternatives
141
+
142
+ | Scenario | Use Instead | Reason |
143
+ |----------|-------------|---------|
144
+ | Need exact variance preservation | PCA | PCA preserves maximum variance in linear projections |
145
+ | Need deterministic results | PCA, Kernel PCA | These provide reproducible transformations |
146
+ | Small dataset (<500 points) | PCA, MDS | UMAP needs sufficient data to learn manifold |
147
+ | Need inverse transformation | Autoencoders | Can reconstruct original from embedding |
148
+ | Purely categorical data | MCA, FAMD | Designed for categorical/mixed data types |
149
+ | Need interpretable dimensions | Factor Analysis, PCA | Dimensions have meaningful interpretations |
150
+ | Time series data | DTW + MDS | Respects temporal dependencies |
151
+
152
+ ## Data That Degrades UMAP Performance
153
+
154
+ ### 1. Extreme Sparsity
155
+ ```ruby
156
+ # Problem: 99.9% zeros in data
157
+ sparse_data = Array.new(1000) { Array.new(1000) { rand < 0.001 ? 1 : 0 } }
158
+ # Solution: Use PCA/SVD first or specialized sparse methods
159
+ ```
160
+
161
+ ### 2. Curse of Dimensionality
162
+ ```ruby
163
+ # Problem: Dimensions >> samples
164
+ data = Array.new(100) { Array.new(10000) { rand } } # 100 samples, 10000 dimensions
165
+ # Solution: Apply PCA first to reduce to ~50 dimensions
166
+ ```
167
+
168
+ ### 3. Multiple Disconnected Manifolds
169
+ ```ruby
170
+ # Problem: Completely separate clusters with no connections
171
+ cluster1 = Array.new(500) { Array.new(100) { rand(-3.0..3.0) } }
172
+ cluster2 = Array.new(500) { Array.new(100) { rand(-3.0..3.0) + 1000 } } # Far separated
173
+ # Result: UMAP may arbitrarily position disconnected components
174
+ ```
175
+
176
+ ### 4. Pure Noise
177
+ ```ruby
178
+ # Problem: No underlying structure
179
+ random_data = Array.new(1000) { Array.new(100) { rand(-3.0..3.0) } }
180
+ # Result: Meaningless projection, artificial patterns
181
+ ```
182
+
183
+ ## Key Parameters
184
+
185
+ ### Essential Parameters
186
+ ```ruby
187
+ umap_model = ClusterKit::UMAP.new(
188
+ n_components: 2, # Output dimensions
189
+ n_neighbors: 15, # Number of neighbors (15-50 typical)
190
+ random_seed: 42 # For reproducibility
191
+ )
192
+ # Note: min_dist and metric parameters may be configurable in future versions
193
+ ```
194
+
195
+ ### Parameter Effects
196
+ - **n_neighbors**:
197
+ - Low (5-15): Preserves local structure, detailed clusters
198
+ - High (50-200): Preserves global structure, broader patterns
199
+
200
+ - **min_dist**:
201
+ - Near 0: Tight clumps, allows overlapping
202
+ - Near 1: Even distribution, preserves more global structure
203
+
204
+ - **metric**: Critical for specific data types
205
+ - `euclidean`: Standard for continuous data
206
+ - `cosine`: Text embeddings, directional data
207
+ - `manhattan`: Robust to outliers
208
+ - `hamming`: Binary/categorical features
209
+
210
+ ## Implementation Considerations
211
+
212
+ ### Performance Optimization
213
+ ```ruby
214
+ # For large datasets (>50K points)
215
+ umap_model = ClusterKit::UMAP.new(
216
+ n_neighbors: 15,
217
+ n_components: 2,
218
+ random_seed: 42
219
+ )
220
+ # The Rust backend automatically optimizes for performance
221
+ # using efficient algorithms like HNSW for nearest neighbor search
222
+ ```
223
+
224
+ ### Supervised vs Unsupervised
225
+ ```ruby
226
+ # Unsupervised: Find natural structure
227
+ embedding = umap_model.fit_transform(data)
228
+
229
+ # Note: Supervised UMAP (using labels to guide projection)
230
+ # may be available in future versions of clusterkit
231
+ ```
232
+
233
+ ## Practical Example: Document Embedding Pipeline
234
+
235
+ ```ruby
236
+ require 'clusterkit'
237
+ require 'candle'
238
+
239
+ # Typical workflow for document embeddings
240
+ documents = load_documents # 10,000 documents
241
+
242
+ # Initialize embedding model using red-candle's from_pretrained
243
+ embedding_model = Candle::EmbeddingModel.from_pretrained(
244
+ "jinaai/jina-embeddings-v2-base-en",
245
+ device: Candle::Device.best # Automatically use GPU if available
246
+ )
247
+
248
+ # Generate embeddings for all documents
249
+ # The embedding method returns normalized embeddings by default with "pooled_normalized"
250
+ embeddings = documents.map do |doc|
251
+ embedding_model.embedding(doc).to_a # Convert tensor to array
252
+ end
253
+ # Shape: Array of 10000 arrays, each with 768 floats
254
+
255
+ # Embeddings are already normalized when using pooled_normalized (default)
256
+ # But if you used a different pooling method, normalize like this:
257
+ # embeddings_normalized = embeddings.map do |embedding|
258
+ # magnitude = Math.sqrt(embedding.sum { |x| x**2 })
259
+ # embedding.map { |x| x / magnitude }
260
+ # end
261
+
262
+ # Dimensionality reduction
263
+ reducer = ClusterKit::UMAP.new(
264
+ n_neighbors: 30,
265
+ n_components: 2,
266
+ random_seed: 42
267
+ )
268
+
269
+ # Fit and transform
270
+ coords_2d = reducer.fit_transform(embeddings)
271
+
272
+ # Can now transform new documents
273
+ new_doc = "New document"
274
+ new_embedding = embedding_model.embedding(new_doc).to_a
275
+ new_coords = reducer.transform([new_embedding])
276
+
277
+ # Alternative: Using different pooling methods
278
+ # cls_embedding = embedding_model.embedding(doc, pooling_method: "cls")
279
+ # pooled_embedding = embedding_model.embedding(doc, pooling_method: "pooled")
280
+ ```
281
+
282
+ ## Common Pitfalls
283
+
284
+ 1. **Using raw distances between UMAP points for similarity**
285
+ - Wrong: `distance = Math.sqrt((p1[0]-p2[0])**2 + (p1[1]-p2[1])**2)`
286
+ - Right: Use original high-dimensional distances
287
+
288
+ 2. **Not scaling features before UMAP**
289
+ - Features with larger scales dominate distance calculations
290
+ - Always normalize or standardize first
291
+
292
+ 3. **Over-interpreting visual clusters**
293
+ - Clusters in UMAP don't always mean distinct groups
294
+ - Validate with clustering algorithms on original data
295
+
296
+ 4. **Forgetting to normalize embeddings for cosine similarity**
297
+ - Text embeddings typically use cosine distance
298
+ - Normalize vectors before UMAP when working with embeddings
299
+
300
+ 5. **Applying to insufficient data**
301
+ - UMAP needs enough points to learn manifold structure
302
+ - Consider simpler methods for small datasets (< 500 points)
303
+
304
+ ## Complete Ruby Example with Red-Candle
305
+
306
+ ```ruby
307
+ require 'candle'
308
+ require 'clusterkit'
309
+
310
+ # Sample documents for clustering
311
+ documents = [
312
+ # Technology cluster
313
+ "Machine learning algorithms process data efficiently",
314
+ "Neural networks enable deep learning applications",
315
+ "Artificial intelligence transforms modern software",
316
+
317
+ # Food cluster
318
+ "Italian pasta dishes are delicious and varied",
319
+ "Fresh vegetables make healthy salad options",
320
+ "Chocolate desserts satisfy sweet cravings",
321
+
322
+ # Sports cluster
323
+ "Basketball players need excellent coordination",
324
+ "Marathon runners train for endurance events",
325
+ "Tennis matches require mental focus"
326
+ ]
327
+
328
+ # Initialize embedding model using from_pretrained
329
+ # Model type is auto-detected from the model_id
330
+ model = Candle::EmbeddingModel.from_pretrained(
331
+ "sentence-transformers/all-MiniLM-L6-v2"
332
+ )
333
+
334
+ # Generate embeddings (already normalized with pooled_normalized)
335
+ embeddings = documents.map { |doc| model.embedding(doc).to_a }
336
+
337
+ # Reduce to 2D for visualization
338
+ umap = ClusterKit::UMAP.new(
339
+ n_components: 2,
340
+ n_neighbors: 5, # Small dataset, use fewer neighbors
341
+ random_seed: 42
342
+ )
343
+
344
+ coords_2d = umap.fit_transform(embeddings)
345
+
346
+ # Display results
347
+ documents.each_with_index do |doc, i|
348
+ x, y = coords_2d[i]
349
+ puts "#{doc[0..30]}... => [#{x.round(3)}, #{y.round(3)}]"
350
+ end
351
+
352
+ # The 2D coordinates should show three distinct clusters
353
+ # corresponding to technology, food, and sports topics
354
+
355
+ # Advanced: Save and load UMAP models for reuse
356
+ umap.save("models/document_umap.model")
357
+ loaded_umap = ClusterKit::UMAP.load("models/document_umap.model")
358
+ ```
359
+
360
+ ## Summary
361
+
362
+ UMAP is a powerful non-linear dimensionality reduction algorithm particularly suited for visualizing and preprocessing high-dimensional data like embeddings. It excels at preserving both local and global structure through graph-based manifold learning. While it requires sufficient data and careful parameter tuning, it generally outperforms alternatives like t-SNE in speed and quality for most embedding visualization tasks. The key is understanding that UMAP learns a topological representation, not a distance-preserving projection, and interpreting results accordingly.