clusterkit 0.3.0-arm64-darwin

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (59) hide show
  1. checksums.yaml +7 -0
  2. data/.rspec +3 -0
  3. data/.simplecov +47 -0
  4. data/CHANGELOG.md +35 -0
  5. data/CLAUDE.md +226 -0
  6. data/Cargo.lock +3228 -0
  7. data/Cargo.toml +8 -0
  8. data/Gemfile +17 -0
  9. data/IMPLEMENTATION_NOTES.md +143 -0
  10. data/LICENSE.txt +21 -0
  11. data/PYTHON_COMPARISON.md +183 -0
  12. data/README.md +744 -0
  13. data/Rakefile +259 -0
  14. data/docs/KNOWN_ISSUES.md +130 -0
  15. data/docs/RUST_ERROR_HANDLING.md +164 -0
  16. data/docs/TEST_FIXTURES.md +170 -0
  17. data/docs/UMAP_EXPLAINED.md +362 -0
  18. data/docs/UMAP_TROUBLESHOOTING.md +284 -0
  19. data/docs/VERBOSE_OUTPUT.md +84 -0
  20. data/docs/assets/clusterkit-wide.png +0 -0
  21. data/docs/assets/clusterkit.png +0 -0
  22. data/docs/assets/visualization.png +0 -0
  23. data/examples/hdbscan_example.rb +147 -0
  24. data/examples/optimal_kmeans_example.rb +96 -0
  25. data/examples/pca_example.rb +114 -0
  26. data/examples/reproducible_umap.rb +99 -0
  27. data/examples/verbose_control.rb +43 -0
  28. data/ext/clusterkit/Cargo.toml +26 -0
  29. data/ext/clusterkit/extconf.rb +23 -0
  30. data/ext/clusterkit/src/clustering/hdbscan_wrapper.rs +80 -0
  31. data/ext/clusterkit/src/clustering.rs +221 -0
  32. data/ext/clusterkit/src/embedder.rs +349 -0
  33. data/ext/clusterkit/src/hnsw.rs +579 -0
  34. data/ext/clusterkit/src/lib.rs +24 -0
  35. data/ext/clusterkit/src/svd.rs +89 -0
  36. data/ext/clusterkit/src/tests.rs +16 -0
  37. data/ext/clusterkit/src/utils.rs +183 -0
  38. data/lib/clusterkit/3.1/clusterkit.bundle +0 -0
  39. data/lib/clusterkit/3.2/clusterkit.bundle +0 -0
  40. data/lib/clusterkit/3.3/clusterkit.bundle +0 -0
  41. data/lib/clusterkit/3.4/clusterkit.bundle +0 -0
  42. data/lib/clusterkit/clustering/hdbscan.rb +164 -0
  43. data/lib/clusterkit/clustering.rb +194 -0
  44. data/lib/clusterkit/clusterkit.rb +14 -0
  45. data/lib/clusterkit/configuration.rb +24 -0
  46. data/lib/clusterkit/data_validator.rb +132 -0
  47. data/lib/clusterkit/dimensionality/pca.rb +251 -0
  48. data/lib/clusterkit/dimensionality/svd.rb +175 -0
  49. data/lib/clusterkit/dimensionality/umap.rb +282 -0
  50. data/lib/clusterkit/dimensionality.rb +29 -0
  51. data/lib/clusterkit/hdbscan_api_design.rb +142 -0
  52. data/lib/clusterkit/hnsw.rb +251 -0
  53. data/lib/clusterkit/preprocessing.rb +106 -0
  54. data/lib/clusterkit/silence.rb +42 -0
  55. data/lib/clusterkit/utils.rb +51 -0
  56. data/lib/clusterkit/version.rb +5 -0
  57. data/lib/clusterkit.rb +105 -0
  58. data/lib/tasks/visualize.rake +641 -0
  59. metadata +214 -0
data/README.md ADDED
@@ -0,0 +1,744 @@
1
+ <img src="/docs/assets/clusterkit-wide.png" alt="clusterkit" height="80px">
2
+
3
+ A high-performance clustering and dimensionality reduction toolkit for Ruby, powered by best-in-class Rust implementations.
4
+
5
+ ## 🙏 Acknowledgments & Attribution
6
+
7
+ ClusterKit builds upon excellent work from the Rust ecosystem:
8
+
9
+ - **[annembed](https://github.com/jean-pierreBoth/annembed)** - Provides the core UMAP, t-SNE, and other dimensionality reduction algorithms. Created by Jean-Pierre Both.
10
+ - **[hdbscan](https://github.com/tom-whitehead/hdbscan)** - Provides the HDBSCAN density-based clustering implementation. A Rust port of the original HDBSCAN algorithm.
11
+
12
+ This gem would not be possible without these foundational libraries. Please consider starring their repositories if you find ClusterKit useful.
13
+
14
+ ## Features
15
+
16
+ - **Dimensionality Reduction Algorithms**:
17
+ - UMAP (Uniform Manifold Approximation and Projection) - powered by annembed
18
+ - PCA (Principal Component Analysis)
19
+ - SVD (Singular Value Decomposition)
20
+
21
+ - **Advanced Clustering**:
22
+ - K-means clustering with automatic k selection via elbow method
23
+ - HDBSCAN (Hierarchical Density-Based Spatial Clustering) for density-based clustering with noise detection
24
+ - Silhouette scoring for cluster quality evaluation
25
+
26
+ - **High Performance**:
27
+ - Leverages Rust's speed and parallelization
28
+ - Efficient memory usage
29
+ - Support for large datasets
30
+
31
+ - **Easy to Use**:
32
+ - Simple, scikit-learn-like API
33
+ - Consistent interface across algorithms
34
+ - Comprehensive documentation and examples
35
+
36
+ - **Visualization Tools**:
37
+ - Interactive HTML visualizations
38
+ - Comparison of different algorithms
39
+ - Built-in rake tasks for quick experimentation
40
+
41
+ ## API Structure
42
+
43
+ ClusterKit organizes its functionality into clear modules:
44
+
45
+ - **`ClusterKit::Dimensionality`** - All dimensionality reduction algorithms
46
+ - `ClusterKit::Dimensionality::UMAP` - UMAP implementation
47
+ - `ClusterKit::Dimensionality::PCA` - PCA implementation
48
+ - `ClusterKit::Dimensionality::SVD` - SVD implementation
49
+ - **`ClusterKit::Clustering`** - All clustering algorithms
50
+ - `ClusterKit::Clustering::KMeans` - K-means clustering
51
+ - `ClusterKit::Clustering::HDBSCAN` - HDBSCAN clustering
52
+ - **`ClusterKit::Utils`** - Utility functions
53
+ - **`ClusterKit::Preprocessing`** - Data preprocessing tools
54
+
55
+ All user-facing classes are in these modules. Implementation details are kept private.
56
+
57
+ ## Installation
58
+
59
+ Add this line to your application's Gemfile:
60
+
61
+ ```ruby
62
+ gem 'clusterkit'
63
+ ```
64
+
65
+ And then execute:
66
+
67
+ $ bundle install
68
+
69
+ Or install it yourself as:
70
+
71
+ $ gem install clusterkit
72
+
73
+ ### Prerequisites
74
+
75
+ - Ruby 2.7 or higher
76
+ - Rust toolchain (for building from source)
77
+
78
+ ## Quick Start - Interactive Example
79
+
80
+ Copy and paste this **entire block** into IRB to try out the main features (including the srand line for reproducible results):
81
+
82
+ ```ruby
83
+ require 'clusterkit'
84
+
85
+ # Generate sample high-dimensional data with structure
86
+ # This simulates real-world data like text embeddings or image features
87
+ puts "Creating sample data: 100 points in 50 dimensions with 3 clusters"
88
+
89
+ # Use a fixed seed for reproducibility in this example
90
+ # Important: Random data without structure can cause UMAP errors
91
+ srand(42)
92
+
93
+ # Create data with some inherent structure (3 clusters)
94
+ # Using better separated clusters to avoid UMAP convergence issues
95
+ data = []
96
+ 3.times do |cluster|
97
+ # Each cluster has a different center, well-separated
98
+ center = Array.new(50) { rand * 0.1 + cluster * 2.0 }
99
+
100
+ # Add 33 points around each center with controlled noise
101
+ 33.times do
102
+ point = center.map { |c| c + (rand - 0.5) * 0.3 }
103
+ data << point
104
+ end
105
+ end
106
+
107
+ # Add one more point to make it 100
108
+ data << Array.new(50) { rand * 6.0 } # Scale to match cluster range
109
+
110
+ # ============================================================
111
+ # 1. DIMENSIONALITY REDUCTION - Visualize high-dim data in 2D
112
+ # ============================================================
113
+
114
+ puts "\n1. DIMENSIONALITY REDUCTION:"
115
+
116
+ # UMAP - Best for preserving both local and global structure
117
+ puts "Running UMAP..."
118
+ # Note: Using n_neighbors=5 for better stability with varied data
119
+ # Lower n_neighbors helps avoid "isolated point" errors
120
+ umap = ClusterKit::Dimensionality::UMAP.new(n_components: 2, n_neighbors: 5)
121
+ umap_result = umap.fit_transform(data)
122
+ puts " ✓ Reduced to #{umap_result.first.size}D: #{umap_result[0..2].map { |p| p.map { |v| v.round(3) } }}"
123
+
124
+ # PCA - Fast linear reduction, good for finding main variations
125
+ puts "Running PCA..."
126
+ pca = ClusterKit::Dimensionality::PCA.new(n_components: 2)
127
+ pca_result = pca.fit_transform(data)
128
+ puts " ✓ Reduced to #{pca_result.first.size}D: #{pca_result[0..2].map { |p| p.map { |v| v.round(3) } }}"
129
+ puts " ✓ Explained variance: #{(pca.explained_variance_ratio.sum * 100).round(1)}%"
130
+
131
+ # ============================================================
132
+ # 2. CLUSTERING - Find groups in your data
133
+ # ============================================================
134
+
135
+ puts "\n2. CLUSTERING:"
136
+
137
+ # K-means - When you know roughly how many clusters to expect
138
+ puts "Running K-means..."
139
+ # First, find optimal k using elbow method
140
+ elbow_scores = ClusterKit::Clustering::KMeans.elbow_method(umap_result, k_range: 2..6)
141
+ optimal_k = ClusterKit::Clustering::KMeans.detect_optimal_k(elbow_scores)
142
+ puts " ✓ Optimal k detected: #{optimal_k}"
143
+
144
+ kmeans = ClusterKit::Clustering::KMeans.new(k: optimal_k)
145
+ kmeans_labels = kmeans.fit_predict(umap_result)
146
+ puts " ✓ Found #{kmeans_labels.uniq.size} clusters"
147
+
148
+ # HDBSCAN - When you don't know the number of clusters and have noise
149
+ puts "Running HDBSCAN..."
150
+ hdbscan = ClusterKit::Clustering::HDBSCAN.new(min_samples: 5, min_cluster_size: 10)
151
+ hdbscan_labels = hdbscan.fit_predict(umap_result)
152
+ puts " ✓ Found #{hdbscan.n_clusters} clusters"
153
+ puts " ✓ Identified #{hdbscan.n_noise_points} noise points (#{(hdbscan.noise_ratio * 100).round(1)}%)"
154
+
155
+ # ============================================================
156
+ # 3. EVALUATION - How good are the clusters?
157
+ # ============================================================
158
+
159
+ puts "\n3. CLUSTER EVALUATION:"
160
+ silhouette = ClusterKit::Clustering.silhouette_score(umap_result, kmeans_labels)
161
+ puts " K-means silhouette score: #{silhouette.round(3)} (closer to 1 is better)"
162
+
163
+ # Filter noise for HDBSCAN evaluation
164
+ non_noise = hdbscan_labels.each_with_index.select { |l, _| l != -1 }.map(&:last)
165
+ if non_noise.any?
166
+ filtered_data = non_noise.map { |i| umap_result[i] }
167
+ filtered_labels = non_noise.map { |i| hdbscan_labels[i] }
168
+ hdbscan_silhouette = ClusterKit::Clustering.silhouette_score(filtered_data, filtered_labels)
169
+ puts " HDBSCAN silhouette score: #{hdbscan_silhouette.round(3)} (excluding noise)"
170
+ end
171
+
172
+ puts "\n✅ All done! Try visualizing with: rake clusterkit:visualize"
173
+ ```
174
+
175
+ ## Detailed Usage
176
+
177
+ ### API Structure
178
+
179
+ ClusterKit organizes its algorithms into logical modules:
180
+
181
+ - **`ClusterKit::Dimensionality`** - Algorithms for reducing data dimensions
182
+ - `UMAP` - Non-linear manifold learning
183
+ - `PCA` - Principal Component Analysis
184
+ - `SVD` - Singular Value Decomposition
185
+
186
+ - **`ClusterKit::Clustering`** - Algorithms for grouping data
187
+ - `KMeans` - Partition-based clustering
188
+ - `HDBSCAN` - Density-based clustering with noise detection
189
+
190
+ ### Dimensionality Reduction
191
+
192
+ #### UMAP (Uniform Manifold Approximation and Projection)
193
+
194
+ **Important**: UMAP requires data with some structure. Pure random data may fail.
195
+ For testing, create data with patterns (see Quick Start example above).
196
+
197
+ ```ruby
198
+ # Generate sample data with structure (not pure random)
199
+ data = []
200
+ 100.times do |i|
201
+ # Create points that cluster in groups
202
+ group = i / 25 # 4 groups
203
+ base = group * 2.0
204
+ point = Array.new(50) { |j| base + rand * 0.8 + j * 0.01 }
205
+ data << point
206
+ end
207
+
208
+ # Create UMAP instance
209
+ umap = ClusterKit::Dimensionality::UMAP.new(
210
+ n_components: 2, # Target dimensions (default: 2)
211
+ n_neighbors: 5, # Number of neighbors (default: 15, use 5 for small datasets)
212
+ random_seed: 42, # For reproducibility (default: nil for best performance)
213
+ nb_grad_batch: 10, # Gradient descent batches (default: 10, lower = faster)
214
+ nb_sampling_by_edge: 8 # Negative samples per edge (default: 8, lower = faster)
215
+ )
216
+
217
+ # Fit and transform data
218
+ embedded = umap.fit_transform(data)
219
+
220
+ # IMPORTANT: Seed behavior
221
+ # - WITH random_seed: Fully reproducible results using serial processing (slower)
222
+ # - WITHOUT random_seed: Faster parallel processing but non-deterministic results
223
+
224
+ # Or fit once and transform multiple datasets
225
+ # Example: Split your data into training and test sets
226
+ # Note: Using structured data (not pure random) for better UMAP results
227
+ all_data = []
228
+ 200.times do |i|
229
+ # Create data with some structure - points cluster around different regions
230
+ center = (i / 50) * 2.0 # 4 rough groups
231
+ point = Array.new(50) { |j| center + rand * 0.5 + j * 0.01 }
232
+ all_data << point
233
+ end
234
+
235
+ training_data = all_data[0...150] # First 150 samples for training
236
+ test_data = all_data[150..-1] # Last 50 samples for testing
237
+
238
+ umap.fit(training_data)
239
+ test_embedded = umap.transform(test_data)
240
+
241
+ # Save and load fitted models
242
+ umap.save_model("umap_model.bin") # Save the fitted model
243
+ loaded_umap = ClusterKit::Dimensionality::UMAP.load_model("umap_model.bin") # Load it later
244
+ new_data_embedded = loaded_umap.transform(new_data) # Use loaded model for new data
245
+
246
+ # Save and load transformed data (useful for caching results)
247
+ ClusterKit::Dimensionality::UMAP.save_data(embedded, "embeddings.json")
248
+ cached_embeddings = ClusterKit::Dimensionality::UMAP.load_data("embeddings.json")
249
+
250
+ # Note: The library automatically adjusts n_neighbors if it's too large for your dataset
251
+ ```
252
+
253
+ #### PCA (Principal Component Analysis)
254
+
255
+ ```ruby
256
+ pca = ClusterKit::Dimensionality::PCA.new(n_components: 2)
257
+ transformed = pca.fit_transform(data)
258
+
259
+ # Access explained variance
260
+ puts "Explained variance ratio: #{pca.explained_variance_ratio}"
261
+ puts "Cumulative explained variance: #{pca.cumulative_explained_variance_ratio}"
262
+
263
+ # Inverse transform to reconstruct original data
264
+ reconstructed = pca.inverse_transform(transformed)
265
+ ```
266
+
267
+
268
+ #### SVD (Singular Value Decomposition)
269
+
270
+ ```ruby
271
+ # Direct SVD decomposition using the class interface
272
+ svd = ClusterKit::Dimensionality::SVD.new(n_components: 10, n_iter: 5)
273
+ u, s, vt = svd.fit_transform(data)
274
+
275
+ # U: left singular vectors (documents in LSA)
276
+ # S: singular values (importance of each component)
277
+ # V^T: right singular vectors (terms in LSA)
278
+
279
+ puts "Shape of U: #{u.size}x#{u.first.size}"
280
+ puts "Singular values: #{s[0..4].map { |v| v.round(2) }}"
281
+ puts "Shape of V^T: #{vt.size}x#{vt.first.size}"
282
+
283
+ # For dimensionality reduction, use U * S
284
+ reduced = u.map.with_index do |row, i|
285
+ row.map.with_index { |val, j| val * s[j] }
286
+ end
287
+
288
+ # Or use the convenience method
289
+ u, s, vt = ClusterKit.svd(data, 10, n_iter: 5)
290
+ ```
291
+
292
+
293
+ ### Clustering
294
+
295
+ #### K-means with Automatic K Selection
296
+
297
+ ```ruby
298
+ # Find optimal number of clusters
299
+ elbow_scores = ClusterKit::Clustering::KMeans.elbow_method(data, k_range: 2..10)
300
+ optimal_k = ClusterKit::Clustering::KMeans.detect_optimal_k(elbow_scores)
301
+
302
+ # Cluster with optimal k
303
+ kmeans = ClusterKit::Clustering::KMeans.new(k: optimal_k, random_seed: 42)
304
+ labels = kmeans.fit_predict(data)
305
+
306
+ # Access cluster centers
307
+ centers = kmeans.cluster_centers
308
+ ```
309
+
310
+ #### HDBSCAN (Density-Based Clustering)
311
+
312
+ ```ruby
313
+ # HDBSCAN automatically determines the number of clusters
314
+ # and can identify noise points
315
+ hdbscan = ClusterKit::Clustering::HDBSCAN.new(
316
+ min_samples: 5, # Minimum samples in neighborhood
317
+ min_cluster_size: 10, # Minimum cluster size
318
+ metric: 'euclidean' # Distance metric
319
+ )
320
+
321
+ labels = hdbscan.fit_predict(data)
322
+
323
+ # Noise points are labeled as -1
324
+ puts "Clusters found: #{hdbscan.n_clusters}"
325
+ puts "Noise points: #{hdbscan.n_noise_points} (#{(hdbscan.noise_ratio * 100).round(1)}%)"
326
+
327
+ # Access additional HDBSCAN information
328
+ probabilities = hdbscan.probabilities # Cluster membership probabilities
329
+ outlier_scores = hdbscan.outlier_scores # Outlier scores for each point
330
+ ```
331
+
332
+ ### HNSW - Fast Nearest Neighbor Search
333
+
334
+ ClusterKit includes HNSW (Hierarchical Navigable Small World) for fast approximate nearest neighbor search, useful for building recommendation systems, similarity search, and as a building block for other algorithms.
335
+
336
+ Copy and paste this **entire block** into IRB to try HNSW with real embeddings:
337
+
338
+ ```ruby
339
+ require 'clusterkit'
340
+ require 'candle'
341
+
342
+ # Step 1: Initialize the embedding model
343
+ puts "Loading embedding model..."
344
+ embedding_model = Candle::EmbeddingModel.from_pretrained(
345
+ 'sentence-transformers/all-MiniLM-L6-v2',
346
+ device: Candle::Device.best
347
+ )
348
+ puts " ✓ Model loaded: #{embedding_model.model_id}"
349
+
350
+ # Step 2: Create sample documents for semantic search
351
+ documents = [
352
+ "The cat sat on the mat",
353
+ "Dogs are loyal pets that love their owners",
354
+ "Machine learning algorithms can classify text documents",
355
+ "Natural language processing helps computers understand human language",
356
+ "Ruby is a programming language known for its simplicity",
357
+ "Python is popular for data science and machine learning",
358
+ "The weather today is sunny and warm",
359
+ "Climate change affects global weather patterns",
360
+ "Artificial intelligence is transforming many industries",
361
+ "Deep learning models require large amounts of training data",
362
+ "Cats and dogs are common household pets",
363
+ "Software engineering requires problem-solving skills",
364
+ "The ocean contains many different species of fish",
365
+ "Marine biology studies life in aquatic environments",
366
+ "Cooking requires understanding of ingredients and techniques"
367
+ ]
368
+
369
+ puts "\nGenerating embeddings for #{documents.size} documents..."
370
+
371
+ # Step 3: Generate embeddings for all documents
372
+ embeddings = documents.map do |doc|
373
+ embedding_model.embedding(doc).first.to_a
374
+ end
375
+ puts " ✓ Generated embeddings: #{embeddings.first.count} dimensions each"
376
+
377
+ # Step 4: Create HNSW index
378
+ puts "\nBuilding HNSW search index..."
379
+ index = ClusterKit::HNSW.new(
380
+ dim: embeddings.first.count, # 384 dimensions for all-MiniLM-L6-v2
381
+ space: :euclidean,
382
+ m: 16, # Good balance of speed vs accuracy
383
+ ef_construction: 200, # Build quality
384
+ max_elements: documents.size,
385
+ random_seed: 42 # For reproducible results
386
+ )
387
+
388
+ # Step 5: Add all documents to the index
389
+ documents.each_with_index do |doc, i|
390
+ index.add_item(
391
+ embeddings[i],
392
+ label: "doc_#{i}",
393
+ metadata: {
394
+ 'text' => doc,
395
+ 'length' => doc.length,
396
+ 'word_count' => doc.split.size
397
+ }
398
+ )
399
+ end
400
+ puts " ✓ Added #{documents.size} documents to index"
401
+
402
+ # Step 6: Perform semantic searches
403
+ puts "\n" + "="*50
404
+ puts "SEMANTIC SEARCH DEMO"
405
+ puts "="*50
406
+
407
+ queries = [
408
+ "pets and animals",
409
+ "computer programming",
410
+ "weather and environment"
411
+ ]
412
+
413
+ queries.each do |query|
414
+ puts "\nQuery: '#{query}'"
415
+ puts "-" * 30
416
+
417
+ # Generate query embedding
418
+ query_embedding = embedding_model.embedding(query).first.to_a
419
+
420
+ # Search for similar documents
421
+ results = index.search_with_metadata(query_embedding, k: 3)
422
+
423
+ results.each_with_index do |result, i|
424
+ similarity = (1.0 - result[:distance]).round(3) # Convert distance to similarity
425
+ text = result[:metadata]['text']
426
+ puts " #{i+1}. [#{similarity}] #{text}"
427
+ end
428
+ end
429
+
430
+ # Step 7: Demonstrate advanced features
431
+ puts "\n" + "="*50
432
+ puts "ADVANCED FEATURES"
433
+ puts "="*50
434
+
435
+ # Show search quality adjustment
436
+ puts "\nAdjusting search quality (ef parameter):"
437
+ index.set_ef(50) # Lower ef = faster but potentially less accurate
438
+ fast_results = index.search(embeddings[0], k: 3)
439
+ puts " Fast search (ef=50): #{fast_results}"
440
+
441
+ index.set_ef(200) # Higher ef = slower but more accurate
442
+ accurate_results = index.search(embeddings[0], k: 3)
443
+ puts " Accurate search (ef=200): #{accurate_results}"
444
+
445
+ # Show batch operations
446
+ puts "\nBatch search example:"
447
+ query_embeddings = [embeddings[0], embeddings[5], embeddings[10]]
448
+ batch_results = query_embeddings.map { |emb| index.search(emb, k: 2) }
449
+ puts " Found #{batch_results.size} result sets"
450
+
451
+ # Save and load demonstration
452
+ puts "\nSaving and loading index:"
453
+ index.save('demo_index')
454
+ puts " ✓ Index saved to 'demo_index'"
455
+
456
+ loaded_index = ClusterKit::HNSW.load('demo_index')
457
+ test_results = loaded_index.search(embeddings[0], k: 2)
458
+ puts " ✓ Loaded index works: #{test_results}"
459
+
460
+ puts "\n✅ HNSW demo complete!"
461
+ puts "\nTry your own queries by running:"
462
+ puts "query_embedding = embedding_model.embedding('your search query').first.to_a"
463
+ puts "results = index.search_with_metadata(query_embedding, k: 5)"
464
+ ```
465
+
466
+ #### When to Use HNSW
467
+
468
+ HNSW is ideal for:
469
+ - **Recommendation Systems**: Find similar items/users quickly
470
+ - **Semantic Search**: Find documents with similar embeddings
471
+ - **Duplicate Detection**: Identify near-duplicate content
472
+ - **Clustering Support**: As a fast neighbor graph for HDBSCAN
473
+ - **Real-time Applications**: When you need sub-millisecond search times
474
+
475
+ #### Configuration Guidelines
476
+
477
+ ```ruby
478
+ # High recall (>0.95) - Best quality, slower
479
+ ClusterKit::HNSW.new(
480
+ dim: dim,
481
+ m: 32,
482
+ ef_construction: 400
483
+ ).tap { |idx| idx.set_ef(100) }
484
+
485
+ # Balanced (>0.90 recall) - Good quality, fast
486
+ ClusterKit::HNSW.new(
487
+ dim: dim,
488
+ m: 16,
489
+ ef_construction: 200
490
+ ).tap { |idx| idx.set_ef(50) }
491
+
492
+ # Speed optimized (>0.85 recall) - Fastest, acceptable quality
493
+ ClusterKit::HNSW.new(
494
+ dim: dim,
495
+ m: 8,
496
+ ef_construction: 100
497
+ ).tap { |idx| idx.set_ef(20) }
498
+ ```
499
+
500
+ #### Important Notes
501
+
502
+ 1. **Memory Usage**: HNSW keeps the entire index in memory. Estimate: `(num_items * (dim * 4 + m * 16))` bytes
503
+ 2. **Distance Metrics**: Currently only Euclidean distance is fully supported
504
+ 3. **Loading Behavior**: Due to Rust lifetime constraints, loading an index creates a small memory leak (the index metadata persists until program exit). This is typically negligible for most applications.
505
+ 4. **Build Time**: Index construction is O(N * log(N)). For large datasets (>1M items), consider building offline
506
+
507
+ #### Example: Semantic Search System
508
+
509
+ ```ruby
510
+ # Build a simple semantic search system
511
+ documents = load_documents()
512
+ embeddings = generate_embeddings(documents) # Use red-candle or similar
513
+
514
+ # Build search index
515
+ search_index = ClusterKit::HNSW.new(
516
+ dim: embeddings.first.size,
517
+ m: 16,
518
+ ef_construction: 200,
519
+ max_elements: documents.size
520
+ )
521
+
522
+ # Add all documents
523
+ documents.each_with_index do |doc, i|
524
+ search_index.add_item(
525
+ embeddings[i],
526
+ label: i,
527
+ metadata: { title: doc[:title], url: doc[:url] }
528
+ )
529
+ end
530
+
531
+ # Search function
532
+ def search(query, index, k: 10)
533
+ query_embedding = generate_embedding(query)
534
+ results = index.search_with_metadata(query_embedding, k: k)
535
+
536
+ results.map do |result|
537
+ {
538
+ title: result[:metadata]['title'],
539
+ url: result[:metadata]['url'],
540
+ similarity: 1.0 - result[:distance] # Convert distance to similarity
541
+ }
542
+ end
543
+ end
544
+
545
+ # Save for later use
546
+ search_index.save('document_index')
547
+ ```
548
+
549
+ ### Visualization
550
+
551
+ ClusterKit includes a built-in visualization tool:
552
+
553
+ ```bash
554
+ # Generate interactive visualization
555
+ rake clusterkit:visualize
556
+
557
+ # With options
558
+ rake clusterkit:visualize[output.html,iris,both] # filename, dataset, clustering method
559
+
560
+ # Dataset options: clusters, swiss, iris
561
+ # Clustering options: kmeans, hdbscan, both
562
+ ```
563
+
564
+ This creates an interactive HTML file with:
565
+ - Side-by-side comparison of dimensionality reduction methods
566
+ - Clustering results visualization
567
+ - Performance metrics
568
+ - Interactive Plotly.js charts
569
+
570
+ <img src="/docs/assets/visualization.png" alt="rake clusterkit:visualize">
571
+
572
+
573
+ ## Choosing the Right Algorithm
574
+
575
+ ### Dimensionality Reduction
576
+
577
+ | Algorithm | Best For | Pros | Cons |
578
+ |-----------|----------|------|------|
579
+ | **UMAP** | General purpose, preserving both local and global structure | Fast, scalable, supports transform() | Requires tuning parameters |
580
+ | **PCA** | Linear relationships, feature extraction | Very fast, interpretable, deterministic | Only captures linear relationships |
581
+ | **SVD** | Text analysis (LSA), recommendation systems | Memory efficient, good for sparse data | Only linear relationships |
582
+
583
+ ### Clustering
584
+
585
+ | Algorithm | Best For | Pros | Cons |
586
+ |-----------|----------|------|------|
587
+ | **K-means** | Spherical clusters, known cluster count | Fast, simple, deterministic with seed | Requires knowing k, assumes spherical clusters |
588
+ | **HDBSCAN** | Unknown cluster count, irregular shapes, noise | Finds clusters automatically, handles noise | More complex parameters, slower than k-means |
589
+
590
+ ### Recommended Combinations
591
+
592
+ - **Document Clustering**: UMAP (20D) → HDBSCAN
593
+ - **Image Clustering**: PCA (50D) → K-means
594
+ - **Customer Segmentation**: UMAP (10D) → K-means with elbow method
595
+ - **Anomaly Detection**: UMAP (5D) → HDBSCAN (outliers are noise points)
596
+ - **Visualization**: UMAP (2D) or PCA (2D) → visual inspection
597
+
598
+ ## Advanced Examples
599
+
600
+ ### Document Clustering Pipeline
601
+
602
+ ```ruby
603
+ # Typical NLP workflow: embed → reduce → cluster
604
+ documents = ["text1", "text2", ...] # Your documents
605
+
606
+ # Step 1: Get embeddings (use your favorite embedding model)
607
+ # embeddings = get_embeddings(documents) # e.g., from red-candle
608
+
609
+ # Step 2: Reduce dimensions for better clustering
610
+ umap = ClusterKit::Dimensionality::UMAP.new(n_components: 20, n_neighbors: 10)
611
+ reduced_embeddings = umap.fit_transform(embeddings)
612
+
613
+ # Step 3: Find clusters
614
+ hdbscan = ClusterKit::Clustering::HDBSCAN.new(
615
+ min_samples: 5,
616
+ min_cluster_size: 10
617
+ )
618
+ clusters = hdbscan.fit_predict(reduced_embeddings)
619
+
620
+ # Step 4: Analyze results
621
+ clusters.each_with_index do |cluster_id, doc_idx|
622
+ next if cluster_id == -1 # Skip noise
623
+ puts "Document '#{documents[doc_idx]}' belongs to cluster #{cluster_id}"
624
+ end
625
+ ```
626
+
627
+ ### Model Persistence
628
+
629
+ ```ruby
630
+ # Save trained model
631
+ umap.save("model.bin")
632
+
633
+ # Load trained model
634
+ loaded_umap = ClusterKit::Dimensionality::UMAP.load("model.bin")
635
+ result = loaded_umap.transform(new_data)
636
+ ```
637
+
638
+ ## Performance Tips
639
+
640
+ 1. **Large Datasets**: Use sampling for initial parameter tuning
641
+ 2. **HDBSCAN**: Reduce to 10-50 dimensions with UMAP first for better results
642
+ 3. **Memory**: Process in batches for very large datasets
643
+ 4. **Speed**: Compile with optimizations: `RUSTFLAGS="-C target-cpu=native" bundle install`
644
+
645
+ ### UMAP Reproducibility vs Performance
646
+
647
+ ClusterKit's UMAP implementation offers two modes:
648
+
649
+ | Mode | Usage | Performance | Reproducibility |
650
+ |------|-------|-------------|-----------------|
651
+ | **Fast (default)** | `UMAP.new()` | Parallel processing, ~25-35% faster | Non-deterministic |
652
+ | **Reproducible** | `UMAP.new(random_seed: 42)` | Serial processing | Fully deterministic |
653
+
654
+ **When to use each mode:**
655
+ - **Production/Analysis**: Use default (no seed) for best performance when exact reproducibility isn't critical
656
+ - **Research/Testing**: Use a seed when you need reproducible results for comparisons or debugging
657
+ - **CI/Testing**: Always use a seed to ensure consistent test results
658
+
659
+ **Note**: The `transform` method is always deterministic once a model is fitted, regardless of seed usage during training.
660
+
661
+
662
+ ## Troubleshooting
663
+
664
+ ### UMAP "isolated point" or "graph not connected" errors
665
+
666
+ This error occurs when UMAP cannot find enough neighbors for some points. Solutions:
667
+
668
+ 1. **Reduce n_neighbors**: Use a smaller value (e.g., 5 instead of 15)
669
+ ```ruby
670
+ umap = ClusterKit::Dimensionality::UMAP.new(n_neighbors: 5)
671
+ ```
672
+
673
+ 2. **Add structure to your data**: Completely random data may not work well
674
+ ```ruby
675
+ # Bad: Pure random data with no structure
676
+ data = Array.new(100) { Array.new(50) { rand } }
677
+
678
+ # Good: Data with clusters or patterns (see Quick Start example)
679
+ # Create clusters with centers and add points around them
680
+ ```
681
+
682
+ 3. **Ensure sufficient data points**: UMAP needs at least n_neighbors + 1 points
683
+
684
+ 4. **Use consistent data generation**: For examples/testing, use a fixed seed
685
+ ```ruby
686
+ srand(42) # Ensures reproducible data generation
687
+ ```
688
+
689
+ Note: Real-world embeddings (from text, images, etc.) typically have inherent structure and work better than random data.
690
+
691
+ ### Memory issues with large datasets
692
+
693
+ - Process in batches for datasets > 100k points
694
+ - Use PCA to reduce dimensions before UMAP
695
+
696
+ ### Installation issues
697
+
698
+ - Ensure Rust is installed: `curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh`
699
+ - For M1/M2 Macs, ensure you have the latest Xcode command line tools
700
+ - Clear the build cache if needed: `bundle exec rake clean`
701
+
702
+ ## Development
703
+
704
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests.
705
+
706
+ To install this gem onto your local machine, run `bundle exec rake install`.
707
+
708
+ ## Testing
709
+
710
+ ```bash
711
+ # Run all tests
712
+ bundle exec rspec
713
+
714
+ # Run specific test file
715
+ bundle exec rspec spec/clusterkit/clustering_spec.rb
716
+
717
+ # Run with coverage
718
+ COVERAGE=true bundle exec rspec
719
+ ```
720
+
721
+ ## Contributing
722
+
723
+ Bug reports and pull requests are welcome on GitHub at https://github.com/scientist-labs/clusterkit.
724
+
725
+ ## License
726
+
727
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
728
+
729
+ ## Citation
730
+
731
+ If you use ClusterKit in your research, please cite:
732
+
733
+ ```
734
+ @software{clusterkit,
735
+ author = {Chris Petersen},
736
+ title = {ClusterKit: High-Performance Clustering and Dimensionality Reduction for Ruby},
737
+ year = {2024},
738
+ url = {https://github.com/scientist-labs/clusterkit}
739
+ }
740
+ ```
741
+
742
+ And please also cite the underlying libraries:
743
+ - [annembed](https://github.com/jean-pierreBoth/annembed) for dimensionality reduction algorithms
744
+ - [hdbscan](https://github.com/petabi/hdbscan) for HDBSCAN clustering