clusterkit 0.3.0-aarch64-linux
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.rspec +3 -0
- data/.simplecov +47 -0
- data/CHANGELOG.md +35 -0
- data/CLAUDE.md +226 -0
- data/Cargo.lock +3228 -0
- data/Cargo.toml +8 -0
- data/Gemfile +17 -0
- data/IMPLEMENTATION_NOTES.md +143 -0
- data/LICENSE.txt +21 -0
- data/PYTHON_COMPARISON.md +183 -0
- data/README.md +744 -0
- data/Rakefile +259 -0
- data/docs/KNOWN_ISSUES.md +130 -0
- data/docs/RUST_ERROR_HANDLING.md +164 -0
- data/docs/TEST_FIXTURES.md +170 -0
- data/docs/UMAP_EXPLAINED.md +362 -0
- data/docs/UMAP_TROUBLESHOOTING.md +284 -0
- data/docs/VERBOSE_OUTPUT.md +84 -0
- data/docs/assets/clusterkit-wide.png +0 -0
- data/docs/assets/clusterkit.png +0 -0
- data/docs/assets/visualization.png +0 -0
- data/examples/hdbscan_example.rb +147 -0
- data/examples/optimal_kmeans_example.rb +96 -0
- data/examples/pca_example.rb +114 -0
- data/examples/reproducible_umap.rb +99 -0
- data/examples/verbose_control.rb +43 -0
- data/ext/clusterkit/Cargo.toml +26 -0
- data/ext/clusterkit/extconf.rb +23 -0
- data/ext/clusterkit/src/clustering/hdbscan_wrapper.rs +80 -0
- data/ext/clusterkit/src/clustering.rs +221 -0
- data/ext/clusterkit/src/embedder.rs +349 -0
- data/ext/clusterkit/src/hnsw.rs +579 -0
- data/ext/clusterkit/src/lib.rs +24 -0
- data/ext/clusterkit/src/svd.rs +89 -0
- data/ext/clusterkit/src/tests.rs +16 -0
- data/ext/clusterkit/src/utils.rs +183 -0
- data/lib/clusterkit/3.1/clusterkit.so +0 -0
- data/lib/clusterkit/3.2/clusterkit.so +0 -0
- data/lib/clusterkit/3.3/clusterkit.so +0 -0
- data/lib/clusterkit/3.4/clusterkit.so +0 -0
- data/lib/clusterkit/clustering/hdbscan.rb +164 -0
- data/lib/clusterkit/clustering.rb +194 -0
- data/lib/clusterkit/clusterkit.rb +14 -0
- data/lib/clusterkit/configuration.rb +24 -0
- data/lib/clusterkit/data_validator.rb +132 -0
- data/lib/clusterkit/dimensionality/pca.rb +251 -0
- data/lib/clusterkit/dimensionality/svd.rb +175 -0
- data/lib/clusterkit/dimensionality/umap.rb +282 -0
- data/lib/clusterkit/dimensionality.rb +29 -0
- data/lib/clusterkit/hdbscan_api_design.rb +142 -0
- data/lib/clusterkit/hnsw.rb +251 -0
- data/lib/clusterkit/preprocessing.rb +106 -0
- data/lib/clusterkit/silence.rb +42 -0
- data/lib/clusterkit/utils.rb +51 -0
- data/lib/clusterkit/version.rb +5 -0
- data/lib/clusterkit.rb +105 -0
- data/lib/tasks/visualize.rake +641 -0
- metadata +220 -0
data/README.md
ADDED
|
@@ -0,0 +1,744 @@
|
|
|
1
|
+
<img src="/docs/assets/clusterkit-wide.png" alt="clusterkit" height="80px">
|
|
2
|
+
|
|
3
|
+
A high-performance clustering and dimensionality reduction toolkit for Ruby, powered by best-in-class Rust implementations.
|
|
4
|
+
|
|
5
|
+
## 🙏 Acknowledgments & Attribution
|
|
6
|
+
|
|
7
|
+
ClusterKit builds upon excellent work from the Rust ecosystem:
|
|
8
|
+
|
|
9
|
+
- **[annembed](https://github.com/jean-pierreBoth/annembed)** - Provides the core UMAP, t-SNE, and other dimensionality reduction algorithms. Created by Jean-Pierre Both.
|
|
10
|
+
- **[hdbscan](https://github.com/tom-whitehead/hdbscan)** - Provides the HDBSCAN density-based clustering implementation. A Rust port of the original HDBSCAN algorithm.
|
|
11
|
+
|
|
12
|
+
This gem would not be possible without these foundational libraries. Please consider starring their repositories if you find ClusterKit useful.
|
|
13
|
+
|
|
14
|
+
## Features
|
|
15
|
+
|
|
16
|
+
- **Dimensionality Reduction Algorithms**:
|
|
17
|
+
- UMAP (Uniform Manifold Approximation and Projection) - powered by annembed
|
|
18
|
+
- PCA (Principal Component Analysis)
|
|
19
|
+
- SVD (Singular Value Decomposition)
|
|
20
|
+
|
|
21
|
+
- **Advanced Clustering**:
|
|
22
|
+
- K-means clustering with automatic k selection via elbow method
|
|
23
|
+
- HDBSCAN (Hierarchical Density-Based Spatial Clustering) for density-based clustering with noise detection
|
|
24
|
+
- Silhouette scoring for cluster quality evaluation
|
|
25
|
+
|
|
26
|
+
- **High Performance**:
|
|
27
|
+
- Leverages Rust's speed and parallelization
|
|
28
|
+
- Efficient memory usage
|
|
29
|
+
- Support for large datasets
|
|
30
|
+
|
|
31
|
+
- **Easy to Use**:
|
|
32
|
+
- Simple, scikit-learn-like API
|
|
33
|
+
- Consistent interface across algorithms
|
|
34
|
+
- Comprehensive documentation and examples
|
|
35
|
+
|
|
36
|
+
- **Visualization Tools**:
|
|
37
|
+
- Interactive HTML visualizations
|
|
38
|
+
- Comparison of different algorithms
|
|
39
|
+
- Built-in rake tasks for quick experimentation
|
|
40
|
+
|
|
41
|
+
## API Structure
|
|
42
|
+
|
|
43
|
+
ClusterKit organizes its functionality into clear modules:
|
|
44
|
+
|
|
45
|
+
- **`ClusterKit::Dimensionality`** - All dimensionality reduction algorithms
|
|
46
|
+
- `ClusterKit::Dimensionality::UMAP` - UMAP implementation
|
|
47
|
+
- `ClusterKit::Dimensionality::PCA` - PCA implementation
|
|
48
|
+
- `ClusterKit::Dimensionality::SVD` - SVD implementation
|
|
49
|
+
- **`ClusterKit::Clustering`** - All clustering algorithms
|
|
50
|
+
- `ClusterKit::Clustering::KMeans` - K-means clustering
|
|
51
|
+
- `ClusterKit::Clustering::HDBSCAN` - HDBSCAN clustering
|
|
52
|
+
- **`ClusterKit::Utils`** - Utility functions
|
|
53
|
+
- **`ClusterKit::Preprocessing`** - Data preprocessing tools
|
|
54
|
+
|
|
55
|
+
All user-facing classes are in these modules. Implementation details are kept private.
|
|
56
|
+
|
|
57
|
+
## Installation
|
|
58
|
+
|
|
59
|
+
Add this line to your application's Gemfile:
|
|
60
|
+
|
|
61
|
+
```ruby
|
|
62
|
+
gem 'clusterkit'
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
And then execute:
|
|
66
|
+
|
|
67
|
+
$ bundle install
|
|
68
|
+
|
|
69
|
+
Or install it yourself as:
|
|
70
|
+
|
|
71
|
+
$ gem install clusterkit
|
|
72
|
+
|
|
73
|
+
### Prerequisites
|
|
74
|
+
|
|
75
|
+
- Ruby 2.7 or higher
|
|
76
|
+
- Rust toolchain (for building from source)
|
|
77
|
+
|
|
78
|
+
## Quick Start - Interactive Example
|
|
79
|
+
|
|
80
|
+
Copy and paste this **entire block** into IRB to try out the main features (including the srand line for reproducible results):
|
|
81
|
+
|
|
82
|
+
```ruby
|
|
83
|
+
require 'clusterkit'
|
|
84
|
+
|
|
85
|
+
# Generate sample high-dimensional data with structure
|
|
86
|
+
# This simulates real-world data like text embeddings or image features
|
|
87
|
+
puts "Creating sample data: 100 points in 50 dimensions with 3 clusters"
|
|
88
|
+
|
|
89
|
+
# Use a fixed seed for reproducibility in this example
|
|
90
|
+
# Important: Random data without structure can cause UMAP errors
|
|
91
|
+
srand(42)
|
|
92
|
+
|
|
93
|
+
# Create data with some inherent structure (3 clusters)
|
|
94
|
+
# Using better separated clusters to avoid UMAP convergence issues
|
|
95
|
+
data = []
|
|
96
|
+
3.times do |cluster|
|
|
97
|
+
# Each cluster has a different center, well-separated
|
|
98
|
+
center = Array.new(50) { rand * 0.1 + cluster * 2.0 }
|
|
99
|
+
|
|
100
|
+
# Add 33 points around each center with controlled noise
|
|
101
|
+
33.times do
|
|
102
|
+
point = center.map { |c| c + (rand - 0.5) * 0.3 }
|
|
103
|
+
data << point
|
|
104
|
+
end
|
|
105
|
+
end
|
|
106
|
+
|
|
107
|
+
# Add one more point to make it 100
|
|
108
|
+
data << Array.new(50) { rand * 6.0 } # Scale to match cluster range
|
|
109
|
+
|
|
110
|
+
# ============================================================
|
|
111
|
+
# 1. DIMENSIONALITY REDUCTION - Visualize high-dim data in 2D
|
|
112
|
+
# ============================================================
|
|
113
|
+
|
|
114
|
+
puts "\n1. DIMENSIONALITY REDUCTION:"
|
|
115
|
+
|
|
116
|
+
# UMAP - Best for preserving both local and global structure
|
|
117
|
+
puts "Running UMAP..."
|
|
118
|
+
# Note: Using n_neighbors=5 for better stability with varied data
|
|
119
|
+
# Lower n_neighbors helps avoid "isolated point" errors
|
|
120
|
+
umap = ClusterKit::Dimensionality::UMAP.new(n_components: 2, n_neighbors: 5)
|
|
121
|
+
umap_result = umap.fit_transform(data)
|
|
122
|
+
puts " ✓ Reduced to #{umap_result.first.size}D: #{umap_result[0..2].map { |p| p.map { |v| v.round(3) } }}"
|
|
123
|
+
|
|
124
|
+
# PCA - Fast linear reduction, good for finding main variations
|
|
125
|
+
puts "Running PCA..."
|
|
126
|
+
pca = ClusterKit::Dimensionality::PCA.new(n_components: 2)
|
|
127
|
+
pca_result = pca.fit_transform(data)
|
|
128
|
+
puts " ✓ Reduced to #{pca_result.first.size}D: #{pca_result[0..2].map { |p| p.map { |v| v.round(3) } }}"
|
|
129
|
+
puts " ✓ Explained variance: #{(pca.explained_variance_ratio.sum * 100).round(1)}%"
|
|
130
|
+
|
|
131
|
+
# ============================================================
|
|
132
|
+
# 2. CLUSTERING - Find groups in your data
|
|
133
|
+
# ============================================================
|
|
134
|
+
|
|
135
|
+
puts "\n2. CLUSTERING:"
|
|
136
|
+
|
|
137
|
+
# K-means - When you know roughly how many clusters to expect
|
|
138
|
+
puts "Running K-means..."
|
|
139
|
+
# First, find optimal k using elbow method
|
|
140
|
+
elbow_scores = ClusterKit::Clustering::KMeans.elbow_method(umap_result, k_range: 2..6)
|
|
141
|
+
optimal_k = ClusterKit::Clustering::KMeans.detect_optimal_k(elbow_scores)
|
|
142
|
+
puts " ✓ Optimal k detected: #{optimal_k}"
|
|
143
|
+
|
|
144
|
+
kmeans = ClusterKit::Clustering::KMeans.new(k: optimal_k)
|
|
145
|
+
kmeans_labels = kmeans.fit_predict(umap_result)
|
|
146
|
+
puts " ✓ Found #{kmeans_labels.uniq.size} clusters"
|
|
147
|
+
|
|
148
|
+
# HDBSCAN - When you don't know the number of clusters and have noise
|
|
149
|
+
puts "Running HDBSCAN..."
|
|
150
|
+
hdbscan = ClusterKit::Clustering::HDBSCAN.new(min_samples: 5, min_cluster_size: 10)
|
|
151
|
+
hdbscan_labels = hdbscan.fit_predict(umap_result)
|
|
152
|
+
puts " ✓ Found #{hdbscan.n_clusters} clusters"
|
|
153
|
+
puts " ✓ Identified #{hdbscan.n_noise_points} noise points (#{(hdbscan.noise_ratio * 100).round(1)}%)"
|
|
154
|
+
|
|
155
|
+
# ============================================================
|
|
156
|
+
# 3. EVALUATION - How good are the clusters?
|
|
157
|
+
# ============================================================
|
|
158
|
+
|
|
159
|
+
puts "\n3. CLUSTER EVALUATION:"
|
|
160
|
+
silhouette = ClusterKit::Clustering.silhouette_score(umap_result, kmeans_labels)
|
|
161
|
+
puts " K-means silhouette score: #{silhouette.round(3)} (closer to 1 is better)"
|
|
162
|
+
|
|
163
|
+
# Filter noise for HDBSCAN evaluation
|
|
164
|
+
non_noise = hdbscan_labels.each_with_index.select { |l, _| l != -1 }.map(&:last)
|
|
165
|
+
if non_noise.any?
|
|
166
|
+
filtered_data = non_noise.map { |i| umap_result[i] }
|
|
167
|
+
filtered_labels = non_noise.map { |i| hdbscan_labels[i] }
|
|
168
|
+
hdbscan_silhouette = ClusterKit::Clustering.silhouette_score(filtered_data, filtered_labels)
|
|
169
|
+
puts " HDBSCAN silhouette score: #{hdbscan_silhouette.round(3)} (excluding noise)"
|
|
170
|
+
end
|
|
171
|
+
|
|
172
|
+
puts "\n✅ All done! Try visualizing with: rake clusterkit:visualize"
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
## Detailed Usage
|
|
176
|
+
|
|
177
|
+
### API Structure
|
|
178
|
+
|
|
179
|
+
ClusterKit organizes its algorithms into logical modules:
|
|
180
|
+
|
|
181
|
+
- **`ClusterKit::Dimensionality`** - Algorithms for reducing data dimensions
|
|
182
|
+
- `UMAP` - Non-linear manifold learning
|
|
183
|
+
- `PCA` - Principal Component Analysis
|
|
184
|
+
- `SVD` - Singular Value Decomposition
|
|
185
|
+
|
|
186
|
+
- **`ClusterKit::Clustering`** - Algorithms for grouping data
|
|
187
|
+
- `KMeans` - Partition-based clustering
|
|
188
|
+
- `HDBSCAN` - Density-based clustering with noise detection
|
|
189
|
+
|
|
190
|
+
### Dimensionality Reduction
|
|
191
|
+
|
|
192
|
+
#### UMAP (Uniform Manifold Approximation and Projection)
|
|
193
|
+
|
|
194
|
+
**Important**: UMAP requires data with some structure. Pure random data may fail.
|
|
195
|
+
For testing, create data with patterns (see Quick Start example above).
|
|
196
|
+
|
|
197
|
+
```ruby
|
|
198
|
+
# Generate sample data with structure (not pure random)
|
|
199
|
+
data = []
|
|
200
|
+
100.times do |i|
|
|
201
|
+
# Create points that cluster in groups
|
|
202
|
+
group = i / 25 # 4 groups
|
|
203
|
+
base = group * 2.0
|
|
204
|
+
point = Array.new(50) { |j| base + rand * 0.8 + j * 0.01 }
|
|
205
|
+
data << point
|
|
206
|
+
end
|
|
207
|
+
|
|
208
|
+
# Create UMAP instance
|
|
209
|
+
umap = ClusterKit::Dimensionality::UMAP.new(
|
|
210
|
+
n_components: 2, # Target dimensions (default: 2)
|
|
211
|
+
n_neighbors: 5, # Number of neighbors (default: 15, use 5 for small datasets)
|
|
212
|
+
random_seed: 42, # For reproducibility (default: nil for best performance)
|
|
213
|
+
nb_grad_batch: 10, # Gradient descent batches (default: 10, lower = faster)
|
|
214
|
+
nb_sampling_by_edge: 8 # Negative samples per edge (default: 8, lower = faster)
|
|
215
|
+
)
|
|
216
|
+
|
|
217
|
+
# Fit and transform data
|
|
218
|
+
embedded = umap.fit_transform(data)
|
|
219
|
+
|
|
220
|
+
# IMPORTANT: Seed behavior
|
|
221
|
+
# - WITH random_seed: Fully reproducible results using serial processing (slower)
|
|
222
|
+
# - WITHOUT random_seed: Faster parallel processing but non-deterministic results
|
|
223
|
+
|
|
224
|
+
# Or fit once and transform multiple datasets
|
|
225
|
+
# Example: Split your data into training and test sets
|
|
226
|
+
# Note: Using structured data (not pure random) for better UMAP results
|
|
227
|
+
all_data = []
|
|
228
|
+
200.times do |i|
|
|
229
|
+
# Create data with some structure - points cluster around different regions
|
|
230
|
+
center = (i / 50) * 2.0 # 4 rough groups
|
|
231
|
+
point = Array.new(50) { |j| center + rand * 0.5 + j * 0.01 }
|
|
232
|
+
all_data << point
|
|
233
|
+
end
|
|
234
|
+
|
|
235
|
+
training_data = all_data[0...150] # First 150 samples for training
|
|
236
|
+
test_data = all_data[150..-1] # Last 50 samples for testing
|
|
237
|
+
|
|
238
|
+
umap.fit(training_data)
|
|
239
|
+
test_embedded = umap.transform(test_data)
|
|
240
|
+
|
|
241
|
+
# Save and load fitted models
|
|
242
|
+
umap.save_model("umap_model.bin") # Save the fitted model
|
|
243
|
+
loaded_umap = ClusterKit::Dimensionality::UMAP.load_model("umap_model.bin") # Load it later
|
|
244
|
+
new_data_embedded = loaded_umap.transform(new_data) # Use loaded model for new data
|
|
245
|
+
|
|
246
|
+
# Save and load transformed data (useful for caching results)
|
|
247
|
+
ClusterKit::Dimensionality::UMAP.save_data(embedded, "embeddings.json")
|
|
248
|
+
cached_embeddings = ClusterKit::Dimensionality::UMAP.load_data("embeddings.json")
|
|
249
|
+
|
|
250
|
+
# Note: The library automatically adjusts n_neighbors if it's too large for your dataset
|
|
251
|
+
```
|
|
252
|
+
|
|
253
|
+
#### PCA (Principal Component Analysis)
|
|
254
|
+
|
|
255
|
+
```ruby
|
|
256
|
+
pca = ClusterKit::Dimensionality::PCA.new(n_components: 2)
|
|
257
|
+
transformed = pca.fit_transform(data)
|
|
258
|
+
|
|
259
|
+
# Access explained variance
|
|
260
|
+
puts "Explained variance ratio: #{pca.explained_variance_ratio}"
|
|
261
|
+
puts "Cumulative explained variance: #{pca.cumulative_explained_variance_ratio}"
|
|
262
|
+
|
|
263
|
+
# Inverse transform to reconstruct original data
|
|
264
|
+
reconstructed = pca.inverse_transform(transformed)
|
|
265
|
+
```
|
|
266
|
+
|
|
267
|
+
|
|
268
|
+
#### SVD (Singular Value Decomposition)
|
|
269
|
+
|
|
270
|
+
```ruby
|
|
271
|
+
# Direct SVD decomposition using the class interface
|
|
272
|
+
svd = ClusterKit::Dimensionality::SVD.new(n_components: 10, n_iter: 5)
|
|
273
|
+
u, s, vt = svd.fit_transform(data)
|
|
274
|
+
|
|
275
|
+
# U: left singular vectors (documents in LSA)
|
|
276
|
+
# S: singular values (importance of each component)
|
|
277
|
+
# V^T: right singular vectors (terms in LSA)
|
|
278
|
+
|
|
279
|
+
puts "Shape of U: #{u.size}x#{u.first.size}"
|
|
280
|
+
puts "Singular values: #{s[0..4].map { |v| v.round(2) }}"
|
|
281
|
+
puts "Shape of V^T: #{vt.size}x#{vt.first.size}"
|
|
282
|
+
|
|
283
|
+
# For dimensionality reduction, use U * S
|
|
284
|
+
reduced = u.map.with_index do |row, i|
|
|
285
|
+
row.map.with_index { |val, j| val * s[j] }
|
|
286
|
+
end
|
|
287
|
+
|
|
288
|
+
# Or use the convenience method
|
|
289
|
+
u, s, vt = ClusterKit.svd(data, 10, n_iter: 5)
|
|
290
|
+
```
|
|
291
|
+
|
|
292
|
+
|
|
293
|
+
### Clustering
|
|
294
|
+
|
|
295
|
+
#### K-means with Automatic K Selection
|
|
296
|
+
|
|
297
|
+
```ruby
|
|
298
|
+
# Find optimal number of clusters
|
|
299
|
+
elbow_scores = ClusterKit::Clustering::KMeans.elbow_method(data, k_range: 2..10)
|
|
300
|
+
optimal_k = ClusterKit::Clustering::KMeans.detect_optimal_k(elbow_scores)
|
|
301
|
+
|
|
302
|
+
# Cluster with optimal k
|
|
303
|
+
kmeans = ClusterKit::Clustering::KMeans.new(k: optimal_k, random_seed: 42)
|
|
304
|
+
labels = kmeans.fit_predict(data)
|
|
305
|
+
|
|
306
|
+
# Access cluster centers
|
|
307
|
+
centers = kmeans.cluster_centers
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
#### HDBSCAN (Density-Based Clustering)
|
|
311
|
+
|
|
312
|
+
```ruby
|
|
313
|
+
# HDBSCAN automatically determines the number of clusters
|
|
314
|
+
# and can identify noise points
|
|
315
|
+
hdbscan = ClusterKit::Clustering::HDBSCAN.new(
|
|
316
|
+
min_samples: 5, # Minimum samples in neighborhood
|
|
317
|
+
min_cluster_size: 10, # Minimum cluster size
|
|
318
|
+
metric: 'euclidean' # Distance metric
|
|
319
|
+
)
|
|
320
|
+
|
|
321
|
+
labels = hdbscan.fit_predict(data)
|
|
322
|
+
|
|
323
|
+
# Noise points are labeled as -1
|
|
324
|
+
puts "Clusters found: #{hdbscan.n_clusters}"
|
|
325
|
+
puts "Noise points: #{hdbscan.n_noise_points} (#{(hdbscan.noise_ratio * 100).round(1)}%)"
|
|
326
|
+
|
|
327
|
+
# Access additional HDBSCAN information
|
|
328
|
+
probabilities = hdbscan.probabilities # Cluster membership probabilities
|
|
329
|
+
outlier_scores = hdbscan.outlier_scores # Outlier scores for each point
|
|
330
|
+
```
|
|
331
|
+
|
|
332
|
+
### HNSW - Fast Nearest Neighbor Search
|
|
333
|
+
|
|
334
|
+
ClusterKit includes HNSW (Hierarchical Navigable Small World) for fast approximate nearest neighbor search, useful for building recommendation systems, similarity search, and as a building block for other algorithms.
|
|
335
|
+
|
|
336
|
+
Copy and paste this **entire block** into IRB to try HNSW with real embeddings:
|
|
337
|
+
|
|
338
|
+
```ruby
|
|
339
|
+
require 'clusterkit'
|
|
340
|
+
require 'candle'
|
|
341
|
+
|
|
342
|
+
# Step 1: Initialize the embedding model
|
|
343
|
+
puts "Loading embedding model..."
|
|
344
|
+
embedding_model = Candle::EmbeddingModel.from_pretrained(
|
|
345
|
+
'sentence-transformers/all-MiniLM-L6-v2',
|
|
346
|
+
device: Candle::Device.best
|
|
347
|
+
)
|
|
348
|
+
puts " ✓ Model loaded: #{embedding_model.model_id}"
|
|
349
|
+
|
|
350
|
+
# Step 2: Create sample documents for semantic search
|
|
351
|
+
documents = [
|
|
352
|
+
"The cat sat on the mat",
|
|
353
|
+
"Dogs are loyal pets that love their owners",
|
|
354
|
+
"Machine learning algorithms can classify text documents",
|
|
355
|
+
"Natural language processing helps computers understand human language",
|
|
356
|
+
"Ruby is a programming language known for its simplicity",
|
|
357
|
+
"Python is popular for data science and machine learning",
|
|
358
|
+
"The weather today is sunny and warm",
|
|
359
|
+
"Climate change affects global weather patterns",
|
|
360
|
+
"Artificial intelligence is transforming many industries",
|
|
361
|
+
"Deep learning models require large amounts of training data",
|
|
362
|
+
"Cats and dogs are common household pets",
|
|
363
|
+
"Software engineering requires problem-solving skills",
|
|
364
|
+
"The ocean contains many different species of fish",
|
|
365
|
+
"Marine biology studies life in aquatic environments",
|
|
366
|
+
"Cooking requires understanding of ingredients and techniques"
|
|
367
|
+
]
|
|
368
|
+
|
|
369
|
+
puts "\nGenerating embeddings for #{documents.size} documents..."
|
|
370
|
+
|
|
371
|
+
# Step 3: Generate embeddings for all documents
|
|
372
|
+
embeddings = documents.map do |doc|
|
|
373
|
+
embedding_model.embedding(doc).first.to_a
|
|
374
|
+
end
|
|
375
|
+
puts " ✓ Generated embeddings: #{embeddings.first.count} dimensions each"
|
|
376
|
+
|
|
377
|
+
# Step 4: Create HNSW index
|
|
378
|
+
puts "\nBuilding HNSW search index..."
|
|
379
|
+
index = ClusterKit::HNSW.new(
|
|
380
|
+
dim: embeddings.first.count, # 384 dimensions for all-MiniLM-L6-v2
|
|
381
|
+
space: :euclidean,
|
|
382
|
+
m: 16, # Good balance of speed vs accuracy
|
|
383
|
+
ef_construction: 200, # Build quality
|
|
384
|
+
max_elements: documents.size,
|
|
385
|
+
random_seed: 42 # For reproducible results
|
|
386
|
+
)
|
|
387
|
+
|
|
388
|
+
# Step 5: Add all documents to the index
|
|
389
|
+
documents.each_with_index do |doc, i|
|
|
390
|
+
index.add_item(
|
|
391
|
+
embeddings[i],
|
|
392
|
+
label: "doc_#{i}",
|
|
393
|
+
metadata: {
|
|
394
|
+
'text' => doc,
|
|
395
|
+
'length' => doc.length,
|
|
396
|
+
'word_count' => doc.split.size
|
|
397
|
+
}
|
|
398
|
+
)
|
|
399
|
+
end
|
|
400
|
+
puts " ✓ Added #{documents.size} documents to index"
|
|
401
|
+
|
|
402
|
+
# Step 6: Perform semantic searches
|
|
403
|
+
puts "\n" + "="*50
|
|
404
|
+
puts "SEMANTIC SEARCH DEMO"
|
|
405
|
+
puts "="*50
|
|
406
|
+
|
|
407
|
+
queries = [
|
|
408
|
+
"pets and animals",
|
|
409
|
+
"computer programming",
|
|
410
|
+
"weather and environment"
|
|
411
|
+
]
|
|
412
|
+
|
|
413
|
+
queries.each do |query|
|
|
414
|
+
puts "\nQuery: '#{query}'"
|
|
415
|
+
puts "-" * 30
|
|
416
|
+
|
|
417
|
+
# Generate query embedding
|
|
418
|
+
query_embedding = embedding_model.embedding(query).first.to_a
|
|
419
|
+
|
|
420
|
+
# Search for similar documents
|
|
421
|
+
results = index.search_with_metadata(query_embedding, k: 3)
|
|
422
|
+
|
|
423
|
+
results.each_with_index do |result, i|
|
|
424
|
+
similarity = (1.0 - result[:distance]).round(3) # Convert distance to similarity
|
|
425
|
+
text = result[:metadata]['text']
|
|
426
|
+
puts " #{i+1}. [#{similarity}] #{text}"
|
|
427
|
+
end
|
|
428
|
+
end
|
|
429
|
+
|
|
430
|
+
# Step 7: Demonstrate advanced features
|
|
431
|
+
puts "\n" + "="*50
|
|
432
|
+
puts "ADVANCED FEATURES"
|
|
433
|
+
puts "="*50
|
|
434
|
+
|
|
435
|
+
# Show search quality adjustment
|
|
436
|
+
puts "\nAdjusting search quality (ef parameter):"
|
|
437
|
+
index.set_ef(50) # Lower ef = faster but potentially less accurate
|
|
438
|
+
fast_results = index.search(embeddings[0], k: 3)
|
|
439
|
+
puts " Fast search (ef=50): #{fast_results}"
|
|
440
|
+
|
|
441
|
+
index.set_ef(200) # Higher ef = slower but more accurate
|
|
442
|
+
accurate_results = index.search(embeddings[0], k: 3)
|
|
443
|
+
puts " Accurate search (ef=200): #{accurate_results}"
|
|
444
|
+
|
|
445
|
+
# Show batch operations
|
|
446
|
+
puts "\nBatch search example:"
|
|
447
|
+
query_embeddings = [embeddings[0], embeddings[5], embeddings[10]]
|
|
448
|
+
batch_results = query_embeddings.map { |emb| index.search(emb, k: 2) }
|
|
449
|
+
puts " Found #{batch_results.size} result sets"
|
|
450
|
+
|
|
451
|
+
# Save and load demonstration
|
|
452
|
+
puts "\nSaving and loading index:"
|
|
453
|
+
index.save('demo_index')
|
|
454
|
+
puts " ✓ Index saved to 'demo_index'"
|
|
455
|
+
|
|
456
|
+
loaded_index = ClusterKit::HNSW.load('demo_index')
|
|
457
|
+
test_results = loaded_index.search(embeddings[0], k: 2)
|
|
458
|
+
puts " ✓ Loaded index works: #{test_results}"
|
|
459
|
+
|
|
460
|
+
puts "\n✅ HNSW demo complete!"
|
|
461
|
+
puts "\nTry your own queries by running:"
|
|
462
|
+
puts "query_embedding = embedding_model.embedding('your search query').first.to_a"
|
|
463
|
+
puts "results = index.search_with_metadata(query_embedding, k: 5)"
|
|
464
|
+
```
|
|
465
|
+
|
|
466
|
+
#### When to Use HNSW
|
|
467
|
+
|
|
468
|
+
HNSW is ideal for:
|
|
469
|
+
- **Recommendation Systems**: Find similar items/users quickly
|
|
470
|
+
- **Semantic Search**: Find documents with similar embeddings
|
|
471
|
+
- **Duplicate Detection**: Identify near-duplicate content
|
|
472
|
+
- **Clustering Support**: As a fast neighbor graph for HDBSCAN
|
|
473
|
+
- **Real-time Applications**: When you need sub-millisecond search times
|
|
474
|
+
|
|
475
|
+
#### Configuration Guidelines
|
|
476
|
+
|
|
477
|
+
```ruby
|
|
478
|
+
# High recall (>0.95) - Best quality, slower
|
|
479
|
+
ClusterKit::HNSW.new(
|
|
480
|
+
dim: dim,
|
|
481
|
+
m: 32,
|
|
482
|
+
ef_construction: 400
|
|
483
|
+
).tap { |idx| idx.set_ef(100) }
|
|
484
|
+
|
|
485
|
+
# Balanced (>0.90 recall) - Good quality, fast
|
|
486
|
+
ClusterKit::HNSW.new(
|
|
487
|
+
dim: dim,
|
|
488
|
+
m: 16,
|
|
489
|
+
ef_construction: 200
|
|
490
|
+
).tap { |idx| idx.set_ef(50) }
|
|
491
|
+
|
|
492
|
+
# Speed optimized (>0.85 recall) - Fastest, acceptable quality
|
|
493
|
+
ClusterKit::HNSW.new(
|
|
494
|
+
dim: dim,
|
|
495
|
+
m: 8,
|
|
496
|
+
ef_construction: 100
|
|
497
|
+
).tap { |idx| idx.set_ef(20) }
|
|
498
|
+
```
|
|
499
|
+
|
|
500
|
+
#### Important Notes
|
|
501
|
+
|
|
502
|
+
1. **Memory Usage**: HNSW keeps the entire index in memory. Estimate: `(num_items * (dim * 4 + m * 16))` bytes
|
|
503
|
+
2. **Distance Metrics**: Currently only Euclidean distance is fully supported
|
|
504
|
+
3. **Loading Behavior**: Due to Rust lifetime constraints, loading an index creates a small memory leak (the index metadata persists until program exit). This is typically negligible for most applications.
|
|
505
|
+
4. **Build Time**: Index construction is O(N * log(N)). For large datasets (>1M items), consider building offline
|
|
506
|
+
|
|
507
|
+
#### Example: Semantic Search System
|
|
508
|
+
|
|
509
|
+
```ruby
|
|
510
|
+
# Build a simple semantic search system
|
|
511
|
+
documents = load_documents()
|
|
512
|
+
embeddings = generate_embeddings(documents) # Use red-candle or similar
|
|
513
|
+
|
|
514
|
+
# Build search index
|
|
515
|
+
search_index = ClusterKit::HNSW.new(
|
|
516
|
+
dim: embeddings.first.size,
|
|
517
|
+
m: 16,
|
|
518
|
+
ef_construction: 200,
|
|
519
|
+
max_elements: documents.size
|
|
520
|
+
)
|
|
521
|
+
|
|
522
|
+
# Add all documents
|
|
523
|
+
documents.each_with_index do |doc, i|
|
|
524
|
+
search_index.add_item(
|
|
525
|
+
embeddings[i],
|
|
526
|
+
label: i,
|
|
527
|
+
metadata: { title: doc[:title], url: doc[:url] }
|
|
528
|
+
)
|
|
529
|
+
end
|
|
530
|
+
|
|
531
|
+
# Search function
|
|
532
|
+
def search(query, index, k: 10)
|
|
533
|
+
query_embedding = generate_embedding(query)
|
|
534
|
+
results = index.search_with_metadata(query_embedding, k: k)
|
|
535
|
+
|
|
536
|
+
results.map do |result|
|
|
537
|
+
{
|
|
538
|
+
title: result[:metadata]['title'],
|
|
539
|
+
url: result[:metadata]['url'],
|
|
540
|
+
similarity: 1.0 - result[:distance] # Convert distance to similarity
|
|
541
|
+
}
|
|
542
|
+
end
|
|
543
|
+
end
|
|
544
|
+
|
|
545
|
+
# Save for later use
|
|
546
|
+
search_index.save('document_index')
|
|
547
|
+
```
|
|
548
|
+
|
|
549
|
+
### Visualization
|
|
550
|
+
|
|
551
|
+
ClusterKit includes a built-in visualization tool:
|
|
552
|
+
|
|
553
|
+
```bash
|
|
554
|
+
# Generate interactive visualization
|
|
555
|
+
rake clusterkit:visualize
|
|
556
|
+
|
|
557
|
+
# With options
|
|
558
|
+
rake clusterkit:visualize[output.html,iris,both] # filename, dataset, clustering method
|
|
559
|
+
|
|
560
|
+
# Dataset options: clusters, swiss, iris
|
|
561
|
+
# Clustering options: kmeans, hdbscan, both
|
|
562
|
+
```
|
|
563
|
+
|
|
564
|
+
This creates an interactive HTML file with:
|
|
565
|
+
- Side-by-side comparison of dimensionality reduction methods
|
|
566
|
+
- Clustering results visualization
|
|
567
|
+
- Performance metrics
|
|
568
|
+
- Interactive Plotly.js charts
|
|
569
|
+
|
|
570
|
+
<img src="/docs/assets/visualization.png" alt="rake clusterkit:visualize">
|
|
571
|
+
|
|
572
|
+
|
|
573
|
+
## Choosing the Right Algorithm
|
|
574
|
+
|
|
575
|
+
### Dimensionality Reduction
|
|
576
|
+
|
|
577
|
+
| Algorithm | Best For | Pros | Cons |
|
|
578
|
+
|-----------|----------|------|------|
|
|
579
|
+
| **UMAP** | General purpose, preserving both local and global structure | Fast, scalable, supports transform() | Requires tuning parameters |
|
|
580
|
+
| **PCA** | Linear relationships, feature extraction | Very fast, interpretable, deterministic | Only captures linear relationships |
|
|
581
|
+
| **SVD** | Text analysis (LSA), recommendation systems | Memory efficient, good for sparse data | Only linear relationships |
|
|
582
|
+
|
|
583
|
+
### Clustering
|
|
584
|
+
|
|
585
|
+
| Algorithm | Best For | Pros | Cons |
|
|
586
|
+
|-----------|----------|------|------|
|
|
587
|
+
| **K-means** | Spherical clusters, known cluster count | Fast, simple, deterministic with seed | Requires knowing k, assumes spherical clusters |
|
|
588
|
+
| **HDBSCAN** | Unknown cluster count, irregular shapes, noise | Finds clusters automatically, handles noise | More complex parameters, slower than k-means |
|
|
589
|
+
|
|
590
|
+
### Recommended Combinations
|
|
591
|
+
|
|
592
|
+
- **Document Clustering**: UMAP (20D) → HDBSCAN
|
|
593
|
+
- **Image Clustering**: PCA (50D) → K-means
|
|
594
|
+
- **Customer Segmentation**: UMAP (10D) → K-means with elbow method
|
|
595
|
+
- **Anomaly Detection**: UMAP (5D) → HDBSCAN (outliers are noise points)
|
|
596
|
+
- **Visualization**: UMAP (2D) or PCA (2D) → visual inspection
|
|
597
|
+
|
|
598
|
+
## Advanced Examples
|
|
599
|
+
|
|
600
|
+
### Document Clustering Pipeline
|
|
601
|
+
|
|
602
|
+
```ruby
|
|
603
|
+
# Typical NLP workflow: embed → reduce → cluster
|
|
604
|
+
documents = ["text1", "text2", ...] # Your documents
|
|
605
|
+
|
|
606
|
+
# Step 1: Get embeddings (use your favorite embedding model)
|
|
607
|
+
# embeddings = get_embeddings(documents) # e.g., from red-candle
|
|
608
|
+
|
|
609
|
+
# Step 2: Reduce dimensions for better clustering
|
|
610
|
+
umap = ClusterKit::Dimensionality::UMAP.new(n_components: 20, n_neighbors: 10)
|
|
611
|
+
reduced_embeddings = umap.fit_transform(embeddings)
|
|
612
|
+
|
|
613
|
+
# Step 3: Find clusters
|
|
614
|
+
hdbscan = ClusterKit::Clustering::HDBSCAN.new(
|
|
615
|
+
min_samples: 5,
|
|
616
|
+
min_cluster_size: 10
|
|
617
|
+
)
|
|
618
|
+
clusters = hdbscan.fit_predict(reduced_embeddings)
|
|
619
|
+
|
|
620
|
+
# Step 4: Analyze results
|
|
621
|
+
clusters.each_with_index do |cluster_id, doc_idx|
|
|
622
|
+
next if cluster_id == -1 # Skip noise
|
|
623
|
+
puts "Document '#{documents[doc_idx]}' belongs to cluster #{cluster_id}"
|
|
624
|
+
end
|
|
625
|
+
```
|
|
626
|
+
|
|
627
|
+
### Model Persistence
|
|
628
|
+
|
|
629
|
+
```ruby
|
|
630
|
+
# Save trained model
|
|
631
|
+
umap.save("model.bin")
|
|
632
|
+
|
|
633
|
+
# Load trained model
|
|
634
|
+
loaded_umap = ClusterKit::Dimensionality::UMAP.load("model.bin")
|
|
635
|
+
result = loaded_umap.transform(new_data)
|
|
636
|
+
```
|
|
637
|
+
|
|
638
|
+
## Performance Tips
|
|
639
|
+
|
|
640
|
+
1. **Large Datasets**: Use sampling for initial parameter tuning
|
|
641
|
+
2. **HDBSCAN**: Reduce to 10-50 dimensions with UMAP first for better results
|
|
642
|
+
3. **Memory**: Process in batches for very large datasets
|
|
643
|
+
4. **Speed**: Compile with optimizations: `RUSTFLAGS="-C target-cpu=native" bundle install`
|
|
644
|
+
|
|
645
|
+
### UMAP Reproducibility vs Performance
|
|
646
|
+
|
|
647
|
+
ClusterKit's UMAP implementation offers two modes:
|
|
648
|
+
|
|
649
|
+
| Mode | Usage | Performance | Reproducibility |
|
|
650
|
+
|------|-------|-------------|-----------------|
|
|
651
|
+
| **Fast (default)** | `UMAP.new()` | Parallel processing, ~25-35% faster | Non-deterministic |
|
|
652
|
+
| **Reproducible** | `UMAP.new(random_seed: 42)` | Serial processing | Fully deterministic |
|
|
653
|
+
|
|
654
|
+
**When to use each mode:**
|
|
655
|
+
- **Production/Analysis**: Use default (no seed) for best performance when exact reproducibility isn't critical
|
|
656
|
+
- **Research/Testing**: Use a seed when you need reproducible results for comparisons or debugging
|
|
657
|
+
- **CI/Testing**: Always use a seed to ensure consistent test results
|
|
658
|
+
|
|
659
|
+
**Note**: The `transform` method is always deterministic once a model is fitted, regardless of seed usage during training.
|
|
660
|
+
|
|
661
|
+
|
|
662
|
+
## Troubleshooting
|
|
663
|
+
|
|
664
|
+
### UMAP "isolated point" or "graph not connected" errors
|
|
665
|
+
|
|
666
|
+
This error occurs when UMAP cannot find enough neighbors for some points. Solutions:
|
|
667
|
+
|
|
668
|
+
1. **Reduce n_neighbors**: Use a smaller value (e.g., 5 instead of 15)
|
|
669
|
+
```ruby
|
|
670
|
+
umap = ClusterKit::Dimensionality::UMAP.new(n_neighbors: 5)
|
|
671
|
+
```
|
|
672
|
+
|
|
673
|
+
2. **Add structure to your data**: Completely random data may not work well
|
|
674
|
+
```ruby
|
|
675
|
+
# Bad: Pure random data with no structure
|
|
676
|
+
data = Array.new(100) { Array.new(50) { rand } }
|
|
677
|
+
|
|
678
|
+
# Good: Data with clusters or patterns (see Quick Start example)
|
|
679
|
+
# Create clusters with centers and add points around them
|
|
680
|
+
```
|
|
681
|
+
|
|
682
|
+
3. **Ensure sufficient data points**: UMAP needs at least n_neighbors + 1 points
|
|
683
|
+
|
|
684
|
+
4. **Use consistent data generation**: For examples/testing, use a fixed seed
|
|
685
|
+
```ruby
|
|
686
|
+
srand(42) # Ensures reproducible data generation
|
|
687
|
+
```
|
|
688
|
+
|
|
689
|
+
Note: Real-world embeddings (from text, images, etc.) typically have inherent structure and work better than random data.
|
|
690
|
+
|
|
691
|
+
### Memory issues with large datasets
|
|
692
|
+
|
|
693
|
+
- Process in batches for datasets > 100k points
|
|
694
|
+
- Use PCA to reduce dimensions before UMAP
|
|
695
|
+
|
|
696
|
+
### Installation issues
|
|
697
|
+
|
|
698
|
+
- Ensure Rust is installed: `curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh`
|
|
699
|
+
- For M1/M2 Macs, ensure you have the latest Xcode command line tools
|
|
700
|
+
- Clear the build cache if needed: `bundle exec rake clean`
|
|
701
|
+
|
|
702
|
+
## Development
|
|
703
|
+
|
|
704
|
+
After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests.
|
|
705
|
+
|
|
706
|
+
To install this gem onto your local machine, run `bundle exec rake install`.
|
|
707
|
+
|
|
708
|
+
## Testing
|
|
709
|
+
|
|
710
|
+
```bash
|
|
711
|
+
# Run all tests
|
|
712
|
+
bundle exec rspec
|
|
713
|
+
|
|
714
|
+
# Run specific test file
|
|
715
|
+
bundle exec rspec spec/clusterkit/clustering_spec.rb
|
|
716
|
+
|
|
717
|
+
# Run with coverage
|
|
718
|
+
COVERAGE=true bundle exec rspec
|
|
719
|
+
```
|
|
720
|
+
|
|
721
|
+
## Contributing
|
|
722
|
+
|
|
723
|
+
Bug reports and pull requests are welcome on GitHub at https://github.com/scientist-labs/clusterkit.
|
|
724
|
+
|
|
725
|
+
## License
|
|
726
|
+
|
|
727
|
+
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
|
|
728
|
+
|
|
729
|
+
## Citation
|
|
730
|
+
|
|
731
|
+
If you use ClusterKit in your research, please cite:
|
|
732
|
+
|
|
733
|
+
```
|
|
734
|
+
@software{clusterkit,
|
|
735
|
+
author = {Chris Petersen},
|
|
736
|
+
title = {ClusterKit: High-Performance Clustering and Dimensionality Reduction for Ruby},
|
|
737
|
+
year = {2024},
|
|
738
|
+
url = {https://github.com/scientist-labs/clusterkit}
|
|
739
|
+
}
|
|
740
|
+
```
|
|
741
|
+
|
|
742
|
+
And please also cite the underlying libraries:
|
|
743
|
+
- [annembed](https://github.com/jean-pierreBoth/annembed) for dimensionality reduction algorithms
|
|
744
|
+
- [hdbscan](https://github.com/petabi/hdbscan) for HDBSCAN clustering
|