clusterkit 0.1.0.pre.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (49) hide show
  1. checksums.yaml +7 -0
  2. data/.rspec +3 -0
  3. data/.simplecov +47 -0
  4. data/CHANGELOG.md +35 -0
  5. data/CLAUDE.md +226 -0
  6. data/Cargo.toml +8 -0
  7. data/Gemfile +17 -0
  8. data/IMPLEMENTATION_NOTES.md +143 -0
  9. data/LICENSE.txt +21 -0
  10. data/PYTHON_COMPARISON.md +183 -0
  11. data/README.md +499 -0
  12. data/Rakefile +245 -0
  13. data/clusterkit.gemspec +45 -0
  14. data/docs/KNOWN_ISSUES.md +130 -0
  15. data/docs/RUST_ERROR_HANDLING.md +164 -0
  16. data/docs/TEST_FIXTURES.md +170 -0
  17. data/docs/UMAP_EXPLAINED.md +362 -0
  18. data/docs/UMAP_TROUBLESHOOTING.md +284 -0
  19. data/docs/VERBOSE_OUTPUT.md +84 -0
  20. data/examples/hdbscan_example.rb +147 -0
  21. data/examples/optimal_kmeans_example.rb +96 -0
  22. data/examples/pca_example.rb +114 -0
  23. data/examples/reproducible_umap.rb +99 -0
  24. data/examples/verbose_control.rb +43 -0
  25. data/ext/clusterkit/Cargo.toml +25 -0
  26. data/ext/clusterkit/extconf.rb +4 -0
  27. data/ext/clusterkit/src/clustering/hdbscan_wrapper.rs +115 -0
  28. data/ext/clusterkit/src/clustering.rs +267 -0
  29. data/ext/clusterkit/src/embedder.rs +413 -0
  30. data/ext/clusterkit/src/lib.rs +22 -0
  31. data/ext/clusterkit/src/svd.rs +112 -0
  32. data/ext/clusterkit/src/tests.rs +16 -0
  33. data/ext/clusterkit/src/utils.rs +33 -0
  34. data/lib/clusterkit/clustering/hdbscan.rb +177 -0
  35. data/lib/clusterkit/clustering.rb +213 -0
  36. data/lib/clusterkit/clusterkit.rb +9 -0
  37. data/lib/clusterkit/configuration.rb +24 -0
  38. data/lib/clusterkit/dimensionality/pca.rb +251 -0
  39. data/lib/clusterkit/dimensionality/svd.rb +144 -0
  40. data/lib/clusterkit/dimensionality/umap.rb +311 -0
  41. data/lib/clusterkit/dimensionality.rb +29 -0
  42. data/lib/clusterkit/hdbscan_api_design.rb +142 -0
  43. data/lib/clusterkit/preprocessing.rb +106 -0
  44. data/lib/clusterkit/silence.rb +42 -0
  45. data/lib/clusterkit/utils.rb +51 -0
  46. data/lib/clusterkit/version.rb +5 -0
  47. data/lib/clusterkit.rb +93 -0
  48. data/lib/tasks/visualize.rake +641 -0
  49. metadata +194 -0
data/README.md ADDED
@@ -0,0 +1,499 @@
1
+ # ClusterKit
2
+
3
+ A high-performance clustering and dimensionality reduction toolkit for Ruby, powered by best-in-class Rust implementations.
4
+
5
+ ## 🙏 Acknowledgments & Attribution
6
+
7
+ ClusterKit builds upon excellent work from the Rust ecosystem:
8
+
9
+ - **[annembed](https://github.com/jean-pierreBoth/annembed)** - Provides the core UMAP, t-SNE, and other dimensionality reduction algorithms. Created by Jean-Pierre Both.
10
+ - **[hdbscan](https://github.com/tom-whitehead/hdbscan)** - Provides the HDBSCAN density-based clustering implementation. A Rust port of the original HDBSCAN algorithm.
11
+
12
+ This gem would not be possible without these foundational libraries. Please consider starring their repositories if you find ClusterKit useful.
13
+
14
+ ## Features
15
+
16
+ - **Dimensionality Reduction Algorithms**:
17
+ - UMAP (Uniform Manifold Approximation and Projection) - powered by annembed
18
+ - PCA (Principal Component Analysis)
19
+ - SVD (Singular Value Decomposition)
20
+
21
+ - **Advanced Clustering**:
22
+ - K-means clustering with automatic k selection via elbow method
23
+ - HDBSCAN (Hierarchical Density-Based Spatial Clustering) for density-based clustering with noise detection
24
+ - Silhouette scoring for cluster quality evaluation
25
+
26
+ - **High Performance**:
27
+ - Leverages Rust's speed and parallelization
28
+ - Efficient memory usage
29
+ - Support for large datasets
30
+
31
+ - **Easy to Use**:
32
+ - Simple, scikit-learn-like API
33
+ - Consistent interface across algorithms
34
+ - Comprehensive documentation and examples
35
+
36
+ - **Visualization Tools**:
37
+ - Interactive HTML visualizations
38
+ - Comparison of different algorithms
39
+ - Built-in rake tasks for quick experimentation
40
+
41
+ ## Installation
42
+
43
+ Add this line to your application's Gemfile:
44
+
45
+ ```ruby
46
+ gem 'clusterkit'
47
+ ```
48
+
49
+ And then execute:
50
+
51
+ $ bundle install
52
+
53
+ Or install it yourself as:
54
+
55
+ $ gem install clusterkit
56
+
57
+ ### Prerequisites
58
+
59
+ - Ruby 2.7 or higher
60
+ - Rust toolchain (for building from source)
61
+
62
+ ## Quick Start - Interactive Example
63
+
64
+ Copy and paste this **entire block** into IRB to try out the main features (including the srand line for reproducible results):
65
+
66
+ ```ruby
67
+ require 'clusterkit'
68
+
69
+ # Generate sample high-dimensional data with structure
70
+ # This simulates real-world data like text embeddings or image features
71
+ puts "Creating sample data: 100 points in 50 dimensions with 3 clusters"
72
+
73
+ # Use a fixed seed for reproducibility in this example
74
+ # Important: Random data without structure can cause UMAP errors
75
+ srand(42)
76
+
77
+ # Create data with some inherent structure (3 clusters)
78
+ # Using better separated clusters to avoid UMAP convergence issues
79
+ data = []
80
+ 3.times do |cluster|
81
+ # Each cluster has a different center, well-separated
82
+ center = Array.new(50) { rand * 0.1 + cluster * 2.0 }
83
+
84
+ # Add 33 points around each center with controlled noise
85
+ 33.times do
86
+ point = center.map { |c| c + (rand - 0.5) * 0.3 }
87
+ data << point
88
+ end
89
+ end
90
+
91
+ # Add one more point to make it 100
92
+ data << Array.new(50) { rand * 6.0 } # Scale to match cluster range
93
+
94
+ # ============================================================
95
+ # 1. DIMENSIONALITY REDUCTION - Visualize high-dim data in 2D
96
+ # ============================================================
97
+
98
+ puts "\n1. DIMENSIONALITY REDUCTION:"
99
+
100
+ # UMAP - Best for preserving both local and global structure
101
+ puts "Running UMAP..."
102
+ # Note: Using n_neighbors=5 for better stability with varied data
103
+ # Lower n_neighbors helps avoid "isolated point" errors
104
+ umap = ClusterKit::Dimensionality::UMAP.new(n_components: 2, n_neighbors: 5)
105
+ umap_result = umap.fit_transform(data)
106
+ puts " ✓ Reduced to #{umap_result.first.size}D: #{umap_result[0..2].map { |p| p.map { |v| v.round(3) } }}"
107
+
108
+ # PCA - Fast linear reduction, good for finding main variations
109
+ puts "Running PCA..."
110
+ pca = ClusterKit::Dimensionality::PCA.new(n_components: 2)
111
+ pca_result = pca.fit_transform(data)
112
+ puts " ✓ Reduced to #{pca_result.first.size}D: #{pca_result[0..2].map { |p| p.map { |v| v.round(3) } }}"
113
+ puts " ✓ Explained variance: #{(pca.explained_variance_ratio.sum * 100).round(1)}%"
114
+
115
+ # ============================================================
116
+ # 2. CLUSTERING - Find groups in your data
117
+ # ============================================================
118
+
119
+ puts "\n2. CLUSTERING:"
120
+
121
+ # K-means - When you know roughly how many clusters to expect
122
+ puts "Running K-means..."
123
+ # First, find optimal k using elbow method
124
+ elbow_scores = ClusterKit::Clustering::KMeans.elbow_method(umap_result, k_range: 2..6)
125
+ optimal_k = ClusterKit::Clustering::KMeans.detect_optimal_k(elbow_scores)
126
+ puts " ✓ Optimal k detected: #{optimal_k}"
127
+
128
+ kmeans = ClusterKit::Clustering::KMeans.new(k: optimal_k)
129
+ kmeans_labels = kmeans.fit_predict(umap_result)
130
+ puts " ✓ Found #{kmeans_labels.uniq.size} clusters"
131
+
132
+ # HDBSCAN - When you don't know the number of clusters and have noise
133
+ puts "Running HDBSCAN..."
134
+ hdbscan = ClusterKit::Clustering::HDBSCAN.new(min_samples: 5, min_cluster_size: 10)
135
+ hdbscan_labels = hdbscan.fit_predict(umap_result)
136
+ puts " ✓ Found #{hdbscan.n_clusters} clusters"
137
+ puts " ✓ Identified #{hdbscan.n_noise_points} noise points (#{(hdbscan.noise_ratio * 100).round(1)}%)"
138
+
139
+ # ============================================================
140
+ # 3. EVALUATION - How good are the clusters?
141
+ # ============================================================
142
+
143
+ puts "\n3. CLUSTER EVALUATION:"
144
+ silhouette = ClusterKit::Clustering.silhouette_score(umap_result, kmeans_labels)
145
+ puts " K-means silhouette score: #{silhouette.round(3)} (closer to 1 is better)"
146
+
147
+ # Filter noise for HDBSCAN evaluation
148
+ non_noise = hdbscan_labels.each_with_index.select { |l, _| l != -1 }.map(&:last)
149
+ if non_noise.any?
150
+ filtered_data = non_noise.map { |i| umap_result[i] }
151
+ filtered_labels = non_noise.map { |i| hdbscan_labels[i] }
152
+ hdbscan_silhouette = ClusterKit::Clustering.silhouette_score(filtered_data, filtered_labels)
153
+ puts " HDBSCAN silhouette score: #{hdbscan_silhouette.round(3)} (excluding noise)"
154
+ end
155
+
156
+ puts "\n✅ All done! Try visualizing with: rake clusterkit:visualize"
157
+ ```
158
+
159
+ ## Detailed Usage
160
+
161
+ ### API Structure
162
+
163
+ ClusterKit organizes its algorithms into logical modules:
164
+
165
+ - **`ClusterKit::Dimensionality`** - Algorithms for reducing data dimensions
166
+ - `UMAP` - Non-linear manifold learning
167
+ - `PCA` - Principal Component Analysis
168
+ - `SVD` - Singular Value Decomposition
169
+
170
+ - **`ClusterKit::Clustering`** - Algorithms for grouping data
171
+ - `KMeans` - Partition-based clustering
172
+ - `HDBSCAN` - Density-based clustering with noise detection
173
+
174
+ ### Dimensionality Reduction
175
+
176
+ #### UMAP (Uniform Manifold Approximation and Projection)
177
+
178
+ **Important**: UMAP requires data with some structure. Pure random data may fail.
179
+ For testing, create data with patterns (see Quick Start example above).
180
+
181
+ ```ruby
182
+ # Generate sample data with structure (not pure random)
183
+ data = []
184
+ 100.times do |i|
185
+ # Create points that cluster in groups
186
+ group = i / 25 # 4 groups
187
+ base = group * 2.0
188
+ point = Array.new(50) { |j| base + rand * 0.8 + j * 0.01 }
189
+ data << point
190
+ end
191
+
192
+ # Create UMAP instance
193
+ umap = ClusterKit::Dimensionality::UMAP.new(
194
+ n_components: 2, # Target dimensions (default: 2)
195
+ n_neighbors: 5, # Number of neighbors (default: 15, use 5 for small datasets)
196
+ random_seed: 42, # For reproducibility (default: nil for best performance)
197
+ nb_grad_batch: 10, # Gradient descent batches (default: 10, lower = faster)
198
+ nb_sampling_by_edge: 8 # Negative samples per edge (default: 8, lower = faster)
199
+ )
200
+
201
+ # Fit and transform data
202
+ embedded = umap.fit_transform(data)
203
+
204
+ # IMPORTANT: Seed behavior
205
+ # - WITH random_seed: Fully reproducible results using serial processing (slower)
206
+ # - WITHOUT random_seed: Faster parallel processing but non-deterministic results
207
+
208
+ # Or fit once and transform multiple datasets
209
+ # Example: Split your data into training and test sets
210
+ # Note: Using structured data (not pure random) for better UMAP results
211
+ all_data = []
212
+ 200.times do |i|
213
+ # Create data with some structure - points cluster around different regions
214
+ center = (i / 50) * 2.0 # 4 rough groups
215
+ point = Array.new(50) { |j| center + rand * 0.5 + j * 0.01 }
216
+ all_data << point
217
+ end
218
+
219
+ training_data = all_data[0...150] # First 150 samples for training
220
+ test_data = all_data[150..-1] # Last 50 samples for testing
221
+
222
+ umap.fit(training_data)
223
+ test_embedded = umap.transform(test_data)
224
+
225
+ # Note: The library automatically adjusts n_neighbors if it's too large for your dataset
226
+ ```
227
+
228
+ #### PCA (Principal Component Analysis)
229
+
230
+ ```ruby
231
+ pca = ClusterKit::Dimensionality::PCA.new(n_components: 2)
232
+ transformed = pca.fit_transform(data)
233
+
234
+ # Access explained variance
235
+ puts "Explained variance ratio: #{pca.explained_variance_ratio}"
236
+ puts "Cumulative explained variance: #{pca.cumulative_explained_variance_ratio}"
237
+
238
+ # Inverse transform to reconstruct original data
239
+ reconstructed = pca.inverse_transform(transformed)
240
+ ```
241
+
242
+
243
+ #### SVD (Singular Value Decomposition)
244
+
245
+ ```ruby
246
+ # Direct SVD decomposition using the class interface
247
+ svd = ClusterKit::Dimensionality::SVD.new(n_components: 10, n_iter: 5)
248
+ u, s, vt = svd.fit_transform(data)
249
+
250
+ # U: left singular vectors (documents in LSA)
251
+ # S: singular values (importance of each component)
252
+ # V^T: right singular vectors (terms in LSA)
253
+
254
+ puts "Shape of U: #{u.size}x#{u.first.size}"
255
+ puts "Singular values: #{s[0..4].map { |v| v.round(2) }}"
256
+ puts "Shape of V^T: #{vt.size}x#{vt.first.size}"
257
+
258
+ # For dimensionality reduction, use U * S
259
+ reduced = u.map.with_index do |row, i|
260
+ row.map.with_index { |val, j| val * s[j] }
261
+ end
262
+
263
+ # Or use the convenience method
264
+ u, s, vt = ClusterKit.svd(data, 10, n_iter: 5)
265
+ ```
266
+
267
+
268
+ ### Clustering
269
+
270
+ #### K-means with Automatic K Selection
271
+
272
+ ```ruby
273
+ # Find optimal number of clusters
274
+ elbow_scores = ClusterKit::Clustering::KMeans.elbow_method(data, k_range: 2..10)
275
+ optimal_k = ClusterKit::Clustering::KMeans.detect_optimal_k(elbow_scores)
276
+
277
+ # Cluster with optimal k
278
+ kmeans = ClusterKit::Clustering::KMeans.new(k: optimal_k, random_seed: 42)
279
+ labels = kmeans.fit_predict(data)
280
+
281
+ # Access cluster centers
282
+ centers = kmeans.cluster_centers
283
+ ```
284
+
285
+ #### HDBSCAN (Density-Based Clustering)
286
+
287
+ ```ruby
288
+ # HDBSCAN automatically determines the number of clusters
289
+ # and can identify noise points
290
+ hdbscan = ClusterKit::Clustering::HDBSCAN.new(
291
+ min_samples: 5, # Minimum samples in neighborhood
292
+ min_cluster_size: 10, # Minimum cluster size
293
+ metric: 'euclidean' # Distance metric
294
+ )
295
+
296
+ labels = hdbscan.fit_predict(data)
297
+
298
+ # Noise points are labeled as -1
299
+ puts "Clusters found: #{hdbscan.n_clusters}"
300
+ puts "Noise points: #{hdbscan.n_noise_points} (#{(hdbscan.noise_ratio * 100).round(1)}%)"
301
+
302
+ # Access additional HDBSCAN information
303
+ probabilities = hdbscan.probabilities # Cluster membership probabilities
304
+ outlier_scores = hdbscan.outlier_scores # Outlier scores for each point
305
+ ```
306
+
307
+ ### Visualization
308
+
309
+ ClusterKit includes a built-in visualization tool:
310
+
311
+ ```bash
312
+ # Generate interactive visualization
313
+ rake clusterkit:visualize
314
+
315
+ # With options
316
+ rake clusterkit:visualize[output.html,iris,both] # filename, dataset, clustering method
317
+
318
+ # Dataset options: clusters, swiss, iris
319
+ # Clustering options: kmeans, hdbscan, both
320
+ ```
321
+
322
+ This creates an interactive HTML file with:
323
+ - Side-by-side comparison of dimensionality reduction methods
324
+ - Clustering results visualization
325
+ - Performance metrics
326
+ - Interactive Plotly.js charts
327
+
328
+ ## Choosing the Right Algorithm
329
+
330
+ ### Dimensionality Reduction
331
+
332
+ | Algorithm | Best For | Pros | Cons |
333
+ |-----------|----------|------|------|
334
+ | **UMAP** | General purpose, preserving both local and global structure | Fast, scalable, supports transform() | Requires tuning parameters |
335
+ | **PCA** | Linear relationships, feature extraction | Very fast, interpretable, deterministic | Only captures linear relationships |
336
+ | **SVD** | Text analysis (LSA), recommendation systems | Memory efficient, good for sparse data | Only linear relationships |
337
+
338
+ ### Clustering
339
+
340
+ | Algorithm | Best For | Pros | Cons |
341
+ |-----------|----------|------|------|
342
+ | **K-means** | Spherical clusters, known cluster count | Fast, simple, deterministic with seed | Requires knowing k, assumes spherical clusters |
343
+ | **HDBSCAN** | Unknown cluster count, irregular shapes, noise | Finds clusters automatically, handles noise | More complex parameters, slower than k-means |
344
+
345
+ ### Recommended Combinations
346
+
347
+ - **Document Clustering**: UMAP (20D) → HDBSCAN
348
+ - **Image Clustering**: PCA (50D) → K-means
349
+ - **Customer Segmentation**: UMAP (10D) → K-means with elbow method
350
+ - **Anomaly Detection**: UMAP (5D) → HDBSCAN (outliers are noise points)
351
+ - **Visualization**: UMAP (2D) or PCA (2D) → visual inspection
352
+
353
+ ## Advanced Examples
354
+
355
+ ### Document Clustering Pipeline
356
+
357
+ ```ruby
358
+ # Typical NLP workflow: embed → reduce → cluster
359
+ documents = ["text1", "text2", ...] # Your documents
360
+
361
+ # Step 1: Get embeddings (use your favorite embedding model)
362
+ # embeddings = get_embeddings(documents) # e.g., from red-candle
363
+
364
+ # Step 2: Reduce dimensions for better clustering
365
+ umap = ClusterKit::Dimensionality::UMAP.new(n_components: 20, n_neighbors: 10)
366
+ reduced_embeddings = umap.fit_transform(embeddings)
367
+
368
+ # Step 3: Find clusters
369
+ hdbscan = ClusterKit::Clustering::HDBSCAN.new(
370
+ min_samples: 5,
371
+ min_cluster_size: 10
372
+ )
373
+ clusters = hdbscan.fit_predict(reduced_embeddings)
374
+
375
+ # Step 4: Analyze results
376
+ clusters.each_with_index do |cluster_id, doc_idx|
377
+ next if cluster_id == -1 # Skip noise
378
+ puts "Document '#{documents[doc_idx]}' belongs to cluster #{cluster_id}"
379
+ end
380
+ ```
381
+
382
+ ### Model Persistence
383
+
384
+ ```ruby
385
+ # Save trained model
386
+ umap.save("model.bin")
387
+
388
+ # Load trained model
389
+ loaded_umap = ClusterKit::Dimensionality::UMAP.load("model.bin")
390
+ result = loaded_umap.transform(new_data)
391
+ ```
392
+
393
+ ## Performance Tips
394
+
395
+ 1. **Large Datasets**: Use sampling for initial parameter tuning
396
+ 2. **HDBSCAN**: Reduce to 10-50 dimensions with UMAP first for better results
397
+ 3. **Memory**: Process in batches for very large datasets
398
+ 4. **Speed**: Compile with optimizations: `RUSTFLAGS="-C target-cpu=native" bundle install`
399
+
400
+ ### UMAP Reproducibility vs Performance
401
+
402
+ ClusterKit's UMAP implementation offers two modes:
403
+
404
+ | Mode | Usage | Performance | Reproducibility |
405
+ |------|-------|-------------|-----------------|
406
+ | **Fast (default)** | `UMAP.new()` | Parallel processing, ~25-35% faster | Non-deterministic |
407
+ | **Reproducible** | `UMAP.new(random_seed: 42)` | Serial processing | Fully deterministic |
408
+
409
+ **When to use each mode:**
410
+ - **Production/Analysis**: Use default (no seed) for best performance when exact reproducibility isn't critical
411
+ - **Research/Testing**: Use a seed when you need reproducible results for comparisons or debugging
412
+ - **CI/Testing**: Always use a seed to ensure consistent test results
413
+
414
+ **Note**: The `transform` method is always deterministic once a model is fitted, regardless of seed usage during training.
415
+
416
+
417
+ ## Troubleshooting
418
+
419
+ ### UMAP "isolated point" or "graph not connected" errors
420
+
421
+ This error occurs when UMAP cannot find enough neighbors for some points. Solutions:
422
+
423
+ 1. **Reduce n_neighbors**: Use a smaller value (e.g., 5 instead of 15)
424
+ ```ruby
425
+ umap = ClusterKit::Dimensionality::UMAP.new(n_neighbors: 5)
426
+ ```
427
+
428
+ 2. **Add structure to your data**: Completely random data may not work well
429
+ ```ruby
430
+ # Bad: Pure random data with no structure
431
+ data = Array.new(100) { Array.new(50) { rand } }
432
+
433
+ # Good: Data with clusters or patterns (see Quick Start example)
434
+ # Create clusters with centers and add points around them
435
+ ```
436
+
437
+ 3. **Ensure sufficient data points**: UMAP needs at least n_neighbors + 1 points
438
+
439
+ 4. **Use consistent data generation**: For examples/testing, use a fixed seed
440
+ ```ruby
441
+ srand(42) # Ensures reproducible data generation
442
+ ```
443
+
444
+ Note: Real-world embeddings (from text, images, etc.) typically have inherent structure and work better than random data.
445
+
446
+ ### Memory issues with large datasets
447
+
448
+ - Process in batches for datasets > 100k points
449
+ - Use PCA to reduce dimensions before UMAP
450
+
451
+ ### Installation issues
452
+
453
+ - Ensure Rust is installed: `curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh`
454
+ - For M1/M2 Macs, ensure you have the latest Xcode command line tools
455
+ - Clear the build cache if needed: `bundle exec rake clean`
456
+
457
+ ## Development
458
+
459
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests.
460
+
461
+ To install this gem onto your local machine, run `bundle exec rake install`.
462
+
463
+ ## Testing
464
+
465
+ ```bash
466
+ # Run all tests
467
+ bundle exec rspec
468
+
469
+ # Run specific test file
470
+ bundle exec rspec spec/clusterkit/clustering_spec.rb
471
+
472
+ # Run with coverage
473
+ COVERAGE=true bundle exec rspec
474
+ ```
475
+
476
+ ## Contributing
477
+
478
+ Bug reports and pull requests are welcome on GitHub at https://github.com/cpetersen/clusterkit.
479
+
480
+ ## License
481
+
482
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
483
+
484
+ ## Citation
485
+
486
+ If you use ClusterKit in your research, please cite:
487
+
488
+ ```
489
+ @software{clusterkit,
490
+ author = {Chris Petersen},
491
+ title = {ClusterKit: High-Performance Clustering and Dimensionality Reduction for Ruby},
492
+ year = {2024},
493
+ url = {https://github.com/cpetersen/clusterkit}
494
+ }
495
+ ```
496
+
497
+ And please also cite the underlying libraries:
498
+ - [annembed](https://github.com/jean-pierreBoth/annembed) for dimensionality reduction algorithms
499
+ - [hdbscan](https://github.com/petabi/hdbscan) for HDBSCAN clustering