clusterkit 0.1.0.pre.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (49) hide show
  1. checksums.yaml +7 -0
  2. data/.rspec +3 -0
  3. data/.simplecov +47 -0
  4. data/CHANGELOG.md +35 -0
  5. data/CLAUDE.md +226 -0
  6. data/Cargo.toml +8 -0
  7. data/Gemfile +17 -0
  8. data/IMPLEMENTATION_NOTES.md +143 -0
  9. data/LICENSE.txt +21 -0
  10. data/PYTHON_COMPARISON.md +183 -0
  11. data/README.md +499 -0
  12. data/Rakefile +245 -0
  13. data/clusterkit.gemspec +45 -0
  14. data/docs/KNOWN_ISSUES.md +130 -0
  15. data/docs/RUST_ERROR_HANDLING.md +164 -0
  16. data/docs/TEST_FIXTURES.md +170 -0
  17. data/docs/UMAP_EXPLAINED.md +362 -0
  18. data/docs/UMAP_TROUBLESHOOTING.md +284 -0
  19. data/docs/VERBOSE_OUTPUT.md +84 -0
  20. data/examples/hdbscan_example.rb +147 -0
  21. data/examples/optimal_kmeans_example.rb +96 -0
  22. data/examples/pca_example.rb +114 -0
  23. data/examples/reproducible_umap.rb +99 -0
  24. data/examples/verbose_control.rb +43 -0
  25. data/ext/clusterkit/Cargo.toml +25 -0
  26. data/ext/clusterkit/extconf.rb +4 -0
  27. data/ext/clusterkit/src/clustering/hdbscan_wrapper.rs +115 -0
  28. data/ext/clusterkit/src/clustering.rs +267 -0
  29. data/ext/clusterkit/src/embedder.rs +413 -0
  30. data/ext/clusterkit/src/lib.rs +22 -0
  31. data/ext/clusterkit/src/svd.rs +112 -0
  32. data/ext/clusterkit/src/tests.rs +16 -0
  33. data/ext/clusterkit/src/utils.rs +33 -0
  34. data/lib/clusterkit/clustering/hdbscan.rb +177 -0
  35. data/lib/clusterkit/clustering.rb +213 -0
  36. data/lib/clusterkit/clusterkit.rb +9 -0
  37. data/lib/clusterkit/configuration.rb +24 -0
  38. data/lib/clusterkit/dimensionality/pca.rb +251 -0
  39. data/lib/clusterkit/dimensionality/svd.rb +144 -0
  40. data/lib/clusterkit/dimensionality/umap.rb +311 -0
  41. data/lib/clusterkit/dimensionality.rb +29 -0
  42. data/lib/clusterkit/hdbscan_api_design.rb +142 -0
  43. data/lib/clusterkit/preprocessing.rb +106 -0
  44. data/lib/clusterkit/silence.rb +42 -0
  45. data/lib/clusterkit/utils.rb +51 -0
  46. data/lib/clusterkit/version.rb +5 -0
  47. data/lib/clusterkit.rb +93 -0
  48. data/lib/tasks/visualize.rake +641 -0
  49. metadata +194 -0
@@ -0,0 +1,284 @@
1
+ # UMAP Troubleshooting Guide
2
+
3
+ ## Known Issues and Solutions
4
+
5
+ ### 1. UMAP Hanging During fit() or fit_transform()
6
+
7
+ #### Symptoms
8
+ - The UMAP algorithm hangs indefinitely during training
9
+ - Console output shows: `embedded scales quantiles at 0.05 : 2.00e-1 , 0.5 : 2.00e-1, 0.95 : 2.00e-1, 0.99 : 2.00e-1`
10
+ - All quantiles are exactly 0.2, indicating degenerate initialization
11
+
12
+ #### Important: This Only Affects Test Data
13
+ **This issue does NOT occur with real embeddings from text models.** Production usage with embeddings from models like:
14
+ - OpenAI's text-embedding-ada-002
15
+ - Jina embeddings
16
+ - Sentence transformers
17
+ - BERT-based models
18
+
19
+ ...works perfectly fine. The hanging only occurs with synthetic random test data.
20
+
21
+ #### Root Cause
22
+ The underlying annembed Rust library's `dmap_init` initialization algorithm expects data with manifold structure (like real embeddings have). When given uniform random data without structure, it can initialize all points to exactly the same location (0.2, 0.2), causing gradient descent to fail.
23
+
24
+ #### Why Real Embeddings Work
25
+ Real text embeddings have:
26
+ - Natural clustering (similar texts have similar embeddings)
27
+ - Meaningful dimensions that correlate
28
+ - Values typically in range [-0.12, 0.12]
29
+ - High dimensionality (384-1536 dimensions)
30
+ - Inherent manifold structure that UMAP is designed to find
31
+
32
+ #### Triggering Conditions
33
+ The bug only occurs with:
34
+ - Uniform random test data without structure
35
+ - Synthetic data with very small variance
36
+ - Oversimplified test cases
37
+ - Small values of `nb_grad_batch` (< 5) combined with random data
38
+ - Small values of `nb_sampling_by_edge` (< 5) combined with random data
39
+
40
+ #### Workarounds
41
+
42
+ 1. **Use conservative data ranges**
43
+ ```ruby
44
+ # Good: Small values centered near 0
45
+ data = 30.times.map { 10.times.map { rand * 0.02 - 0.01 } }
46
+
47
+ # Bad: Large ranges
48
+ data = 30.times.map { 10.times.map { rand * 4.0 - 2.0 } }
49
+ ```
50
+
51
+ 2. **Use default parameters**
52
+ ```ruby
53
+ # Good: Use defaults
54
+ umap = ClusterKit::UMAP.new(n_components: 2, n_neighbors: 15)
55
+
56
+ # Risky: Small batch parameters
57
+ umap = ClusterKit::UMAP.new(
58
+ n_components: 2,
59
+ n_neighbors: 15,
60
+ nb_grad_batch: 2, # Too small - may hang
61
+ nb_sampling_by_edge: 3 # Too small - may hang
62
+ )
63
+ ```
64
+
65
+ 3. **Add structure to your data**
66
+ ```ruby
67
+ # Instead of pure random data, add some structure
68
+ data = 15.times.map do |i|
69
+ 30.times.map do |j|
70
+ base = (i.to_f / 15) * 0.01 # Small trend
71
+ noise = rand * 0.01 - 0.005 # Small noise
72
+ base + noise
73
+ end
74
+ end
75
+ ```
76
+
77
+ ### 2. Writing Tests for UMAP
78
+
79
+ Since UMAP works fine with real embeddings but can hang with random test data, here's how to write better tests:
80
+
81
+ #### Use Realistic Test Data
82
+ ```ruby
83
+ # GOOD: Generate data with structure similar to real embeddings
84
+ def generate_embedding_like_data(n_points, n_dims)
85
+ # Create clusters to simulate semantic grouping
86
+ n_clusters = 3
87
+ points_per_cluster = n_points / n_clusters
88
+
89
+ data = []
90
+ n_clusters.times do |c|
91
+ # Each cluster has a different center
92
+ center = Array.new(n_dims) { (c - 1) * 0.05 }
93
+
94
+ points_per_cluster.times do
95
+ # Add Gaussian noise around the center
96
+ point = center.map { |x| x + (rand - 0.5) * 0.02 }
97
+ data << point
98
+ end
99
+ end
100
+
101
+ data
102
+ end
103
+
104
+ # Use it in tests
105
+ test_data = generate_embedding_like_data(30, 768) # 30 points, 768 dims like BERT
106
+ ```
107
+
108
+ #### Load Real Embeddings for Testing
109
+ ```ruby
110
+ # BETTER: Use actual embeddings from a small model
111
+ require 'candle'
112
+
113
+ def generate_real_test_embeddings
114
+ model = Candle::EmbeddingModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
115
+ texts = [
116
+ "The cat sat on the mat",
117
+ "Dogs are loyal animals",
118
+ "Machine learning is fascinating",
119
+ # ... more test sentences
120
+ ]
121
+ model.embed(texts)
122
+ end
123
+ ```
124
+
125
+ #### Skip Problematic Test Scenarios
126
+ ```ruby
127
+ # For edge cases that don't reflect real usage
128
+ it "handles random noise data", skip: "UMAP not designed for structure-less data" do
129
+ # This test would hang because random noise has no manifold
130
+ random_data = 100.times.map { 768.times.map { rand } }
131
+ # Don't test this - it's not a real use case
132
+ end
133
+ ```
134
+
135
+ ### 3. Performance Tuning with New Parameters
136
+
137
+ As of version 0.2.0, UMAP supports two new parameters for performance tuning:
138
+
139
+ #### nb_grad_batch
140
+ - **Default**: 10
141
+ - **Purpose**: Controls the number of gradient descent batches
142
+ - **Trade-off**: Lower values = faster but less accurate
143
+ - **Safe range**: 5-15
144
+ - **Warning**: Values < 5 may cause hanging with certain data
145
+
146
+ #### nb_sampling_by_edge
147
+ - **Default**: 8
148
+ - **Purpose**: Controls the number of negative samples per edge
149
+ - **Trade-off**: Lower values = faster but less accurate
150
+ - **Safe range**: 5-10
151
+ - **Warning**: Values < 5 may cause hanging with certain data
152
+
153
+ Example usage:
154
+ ```ruby
155
+ # Fast but potentially less accurate
156
+ fast_umap = ClusterKit::UMAP.new(
157
+ n_components: 2,
158
+ n_neighbors: 15,
159
+ nb_grad_batch: 5, # Minimum safe value
160
+ nb_sampling_by_edge: 5 # Minimum safe value
161
+ )
162
+
163
+ # Slower but more accurate
164
+ accurate_umap = ClusterKit::UMAP.new(
165
+ n_components: 2,
166
+ n_neighbors: 15,
167
+ nb_grad_batch: 15, # Higher for better quality
168
+ nb_sampling_by_edge: 10 # Higher for better quality
169
+ )
170
+ ```
171
+
172
+ ### 3. Data Validation Issues
173
+
174
+ #### NaN or Infinite Values
175
+ UMAP will raise an error if your data contains NaN or Infinite values:
176
+ ```ruby
177
+ # This will raise ArgumentError
178
+ bad_data = [[1.0, Float::NAN], [3.0, 4.0]]
179
+ umap.fit_transform(bad_data)
180
+ ```
181
+
182
+ #### Inconsistent Row Lengths
183
+ All rows must have the same number of features:
184
+ ```ruby
185
+ # This will raise ArgumentError
186
+ bad_data = [[1.0, 2.0], [3.0, 4.0, 5.0]] # Different lengths
187
+ umap.fit_transform(bad_data)
188
+ ```
189
+
190
+ ### 4. Memory Issues with Large Datasets
191
+
192
+ UMAP builds an HNSW graph which can be memory-intensive for large datasets.
193
+
194
+ #### Recommendations:
195
+ - For datasets > 100k points, consider sampling
196
+ - Monitor memory usage during fit_transform
197
+ - Use smaller n_neighbors values for large datasets
198
+
199
+ ### 5. Model Persistence Issues
200
+
201
+ #### Binary Compatibility
202
+ Saved models may not be compatible across different versions of clusterkit. Always test loading saved models after upgrading.
203
+
204
+ #### File Size
205
+ Model files include both the original training data and embeddings, so they can be large:
206
+ ```ruby
207
+ # Check model file size before saving
208
+ umap.save("model.bin")
209
+ puts "Model size: #{File.size('model.bin') / 1024.0 / 1024.0} MB"
210
+ ```
211
+
212
+ ## Debugging Tips
213
+
214
+ ### 1. Enable Verbose Output
215
+ The annembed library outputs diagnostic information during training. Watch for:
216
+ - "embedded scales quantiles" - should NOT all be the same value
217
+ - "initial cross entropy value" - should be a reasonable number (not 0 or infinity)
218
+ - "final cross entropy value" - should be lower than initial
219
+
220
+ ### 2. Test with Known Working Data
221
+ If you're having issues, test with this known working configuration:
222
+ ```ruby
223
+ # Known working test case
224
+ test_data = 30.times.map { 10.times.map { rand * 0.5 + 0.25 } }
225
+ umap = ClusterKit::UMAP.new(n_components: 2, n_neighbors: 5)
226
+ result = umap.fit_transform(test_data)
227
+ ```
228
+
229
+ ### 3. Check Data Characteristics
230
+ ```ruby
231
+ # Analyze your data before UMAP
232
+ data_flat = data.flatten
233
+ puts "Data range: [#{data_flat.min}, #{data_flat.max}]"
234
+ puts "Data mean: #{data_flat.sum / data_flat.length}"
235
+ puts "Data variance: #{data_flat.map { |x| (x - mean) ** 2 }.sum / data_flat.length}"
236
+
237
+ # If variance is very small or range is very large, consider normalizing
238
+ ```
239
+
240
+ ## When to Report a Bug
241
+
242
+ Report an issue if:
243
+ 1. UMAP consistently hangs even with conservative data and default parameters
244
+ 2. You get a panic or segfault (not just a Ruby exception)
245
+ 3. Results are dramatically different from other UMAP implementations
246
+ 4. Memory usage is unexpectedly high
247
+
248
+ Include in your bug report:
249
+ - Data characteristics (shape, range, variance)
250
+ - Parameters used
251
+ - Console output including the diagnostic messages
252
+ - Ruby and clusterkit versions
253
+
254
+ ## Alternative Solutions
255
+
256
+ If you continue to experience issues:
257
+
258
+ 1. **Use t-SNE instead** (if 2D visualization is the goal)
259
+ ```ruby
260
+ embedder = ClusterKit::Embedder.new(method: :tsne, n_components: 2)
261
+ result = embedder.fit_transform(data)
262
+ ```
263
+
264
+ 2. **Preprocess your data**
265
+ ```ruby
266
+ # Normalize to [0, 1]
267
+ min = data.flatten.min
268
+ max = data.flatten.max
269
+ normalized = data.map { |row| row.map { |x| (x - min) / (max - min) } }
270
+ ```
271
+
272
+ 3. **Use Python UMAP via PyCall** (if stability is critical)
273
+ ```ruby
274
+ require 'pycall'
275
+ umap = PyCall.import_module('umap')
276
+ reducer = umap.UMAP.new
277
+ embedding = reducer.fit_transform(data)
278
+ ```
279
+
280
+ ## References
281
+
282
+ - [annembed GitHub Issues](https://github.com/jean-pierreBoth/annembed/issues) - For upstream bugs
283
+ - [UMAP Algorithm Paper](https://arxiv.org/abs/1802.03426) - Understanding the algorithm
284
+ - [Original Python UMAP](https://github.com/lmcinnes/umap) - Reference implementation
@@ -0,0 +1,84 @@
1
+ # Controlling Verbose Output
2
+
3
+ The clusterkit gem provides control over the verbose debug output from the underlying Rust library.
4
+
5
+ ## Default Behavior
6
+
7
+ By default, clusterkit suppresses the debug output from the Rust library to keep your console clean. This includes messages about quantiles, cross entropy values, and gradient iterations.
8
+
9
+ ## Enabling Verbose Output
10
+
11
+ There are two ways to enable verbose output:
12
+
13
+ ### 1. Environment Variable
14
+
15
+ Set the `ANNEMBED_VERBOSE` environment variable:
16
+
17
+ ```bash
18
+ ANNEMBED_VERBOSE=true ruby your_script.rb
19
+ ```
20
+
21
+ Or in your Ruby code:
22
+
23
+ ```ruby
24
+ ENV['ANNEMBED_VERBOSE'] = 'true'
25
+ require 'clusterkit'
26
+ ```
27
+
28
+ ### 2. Configuration API
29
+
30
+ Use the configuration API for programmatic control:
31
+
32
+ ```ruby
33
+ require 'clusterkit'
34
+
35
+ # Enable verbose output
36
+ ClusterKit.configure do |config|
37
+ config.verbose = true
38
+ end
39
+
40
+ # Your UMAP operations will now show debug output
41
+ umap = ClusterKit::UMAP.new
42
+ umap.fit_transform(data)
43
+
44
+ # Disable verbose output
45
+ ClusterKit.configuration.verbose = false
46
+ ```
47
+
48
+ ## When to Use Verbose Output
49
+
50
+ Verbose output is useful for:
51
+
52
+ - **Debugging convergence issues**: See iteration counts and cross entropy values
53
+ - **Understanding performance**: Monitor gradient descent progress
54
+ - **Troubleshooting edge cases**: Identify degenerate initializations or disconnected graphs
55
+ - **Development and testing**: Verify the algorithm is working correctly
56
+
57
+ ## Example Output
58
+
59
+ When verbose mode is enabled, you'll see output like:
60
+
61
+ ```
62
+ constructed initial space
63
+
64
+ scales quantile at 0.05 : 1.12e0 , 0.5 : 1.17e0, 0.95 : 1.23e0, 0.99 : 1.23e0
65
+
66
+ edge weight quantile at 0.05 : 1.87e-1 , 0.5 : 1.99e-1, 0.95 : 2.15e-1, 0.99 : 2.19e-1
67
+
68
+ perplexity quantile at 0.05 : 4.99e0 , 0.5 : 5.00e0, 0.95 : 5.00e0, 0.99 : 5.00e0
69
+
70
+ embedded scales quantiles at 0.05 : 1.91e-1 , 0.5 : 2.00e-1, 0.95 : 2.10e-1, 0.99 : 2.10e-1
71
+
72
+ initial cross entropy value 7.48e1, in time 972µs
73
+ gradient iterations sys time(s) 0.00e0 , cpu_time(s) 0.00e0
74
+ final cross entropy value 5.85e1
75
+ ```
76
+
77
+ ## Implementation Details
78
+
79
+ The gem uses Ruby's standard approach for suppressing output from C/Rust extensions:
80
+ - Output streams are temporarily redirected to `/dev/null` (Unix) or `NUL:` (Windows)
81
+ - The redirection happens at the file descriptor level to capture C/Rust `printf`/`println!` output
82
+ - After the operation completes, streams are restored to their original state
83
+
84
+ This approach is thread-safe within the context of Ruby's Global VM Lock (GVL) and follows patterns used by popular Ruby gems like Rails/ActiveSupport.
@@ -0,0 +1,147 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ require 'bundler/setup'
5
+ require 'clusterkit'
6
+
7
+ # Helper method for the example
8
+ class Array
9
+ def mostly_in_range?(start_idx, end_idx)
10
+ count_in_range = self.count { |idx| idx >= start_idx && idx <= end_idx }
11
+ count_in_range > self.length * 0.7
12
+ end
13
+ end
14
+
15
+ # HDBSCAN Example: Document Clustering Pipeline
16
+ # ==============================================
17
+ # This example demonstrates using HDBSCAN for document clustering,
18
+ # which is the primary use case for this implementation.
19
+
20
+ puts "HDBSCAN Document Clustering Example"
21
+ puts "=" * 50
22
+
23
+ # Simulate document embeddings after UMAP reduction
24
+ # In a real application, you would:
25
+ # 1. Embed documents using a model (e.g., BERT, Sentence Transformers)
26
+ # 2. Reduce dimensions with UMAP to ~20D
27
+ # 3. Apply HDBSCAN clustering
28
+
29
+ # Generate synthetic "document embeddings" in 20D space
30
+ # These represent documents after UMAP reduction
31
+ def generate_document_embeddings
32
+ embeddings = []
33
+
34
+ # Topic 1: Technology articles (30 documents)
35
+ 30.times do
36
+ embedding = Array.new(20) { rand(-1.0..1.0) }
37
+ # Add bias to make them cluster
38
+ embedding[0] += 5.0
39
+ embedding[1] += 5.0
40
+ embeddings << embedding
41
+ end
42
+
43
+ # Topic 2: Science articles (25 documents)
44
+ 25.times do
45
+ embedding = Array.new(20) { rand(-1.0..1.0) }
46
+ # Add different bias
47
+ embedding[2] += 5.0
48
+ embedding[3] += 5.0
49
+ embeddings << embedding
50
+ end
51
+
52
+ # Topic 3: Business articles (20 documents)
53
+ 20.times do
54
+ embedding = Array.new(20) { rand(-1.0..1.0) }
55
+ # Add another bias
56
+ embedding[4] += 5.0
57
+ embedding[5] += 5.0
58
+ embeddings << embedding
59
+ end
60
+
61
+ # Noise: Off-topic or mixed-topic documents (15 documents)
62
+ 15.times do
63
+ embedding = Array.new(20) { rand(-3.0..7.0) }
64
+ embeddings << embedding
65
+ end
66
+
67
+ embeddings
68
+ end
69
+
70
+ # Generate sample data
71
+ embeddings = generate_document_embeddings
72
+ puts "Generated #{embeddings.length} document embeddings (20D)"
73
+ puts " - 30 technology articles"
74
+ puts " - 25 science articles"
75
+ puts " - 20 business articles"
76
+ puts " - 15 mixed/off-topic articles"
77
+
78
+ # Apply HDBSCAN clustering
79
+ puts "\nApplying HDBSCAN clustering..."
80
+ hdbscan = ClusterKit::Clustering::HDBSCAN.new(
81
+ min_samples: 5, # Minimum neighborhood size for density estimation
82
+ min_cluster_size: 10 # Minimum cluster size (smaller clusters become noise)
83
+ )
84
+
85
+ # Fit the model
86
+ hdbscan.fit(embeddings)
87
+
88
+ # Get results
89
+ puts "\nClustering Results:"
90
+ puts "-" * 30
91
+ puts "Topics found: #{hdbscan.n_clusters}"
92
+ puts "Unclustered documents: #{hdbscan.n_noise_points} (#{(hdbscan.noise_ratio * 100).round(1)}%)"
93
+
94
+ # Analyze each cluster
95
+ cluster_indices = hdbscan.cluster_indices
96
+ cluster_indices.each do |topic_id, doc_indices|
97
+ puts "\nTopic #{topic_id}:"
98
+ puts " - #{doc_indices.length} documents"
99
+
100
+ # In a real application, you would:
101
+ # 1. Extract keywords from documents in this cluster
102
+ # 2. Generate topic descriptions
103
+ # 3. Find representative documents
104
+
105
+ # Simulate topic identification based on our synthetic data
106
+ if doc_indices.mostly_in_range?(0, 29)
107
+ puts " - Likely topic: Technology"
108
+ elsif doc_indices.mostly_in_range?(30, 54)
109
+ puts " - Likely topic: Science"
110
+ elsif doc_indices.mostly_in_range?(55, 74)
111
+ puts " - Likely topic: Business"
112
+ else
113
+ puts " - Likely topic: Mixed"
114
+ end
115
+ end
116
+
117
+ # Show noise documents (unclustered)
118
+ if hdbscan.n_noise_points > 0
119
+ puts "\nUnclustered documents (noise):"
120
+ puts " - #{hdbscan.n_noise_points} documents"
121
+ puts " - These may be outliers, mixed-topic, or unique documents"
122
+ puts " - Consider manual review or different clustering parameters"
123
+ end
124
+
125
+ # Alternative: Use module-level convenience method
126
+ puts "\n" + "=" * 50
127
+ puts "Alternative: Module-level method"
128
+ puts "-" * 30
129
+
130
+ result = ClusterKit::Clustering.hdbscan(
131
+ embeddings,
132
+ min_samples: 3,
133
+ min_cluster_size: 8
134
+ )
135
+
136
+ puts "Topics found: #{result[:n_clusters]}"
137
+ puts "Noise ratio: #{(result[:noise_ratio] * 100).round(1)}%"
138
+
139
+ # Practical tips
140
+ puts "\n" + "=" * 50
141
+ puts "Tips for Document Clustering:"
142
+ puts "-" * 30
143
+ puts "1. Use UMAP to reduce embeddings to 20-50 dimensions first"
144
+ puts "2. Start with min_cluster_size = 10-20 for most document sets"
145
+ puts "3. Adjust min_samples based on local density (usually 5-10)"
146
+ puts "4. Expect 20-40% noise for diverse document collections"
147
+ puts "5. Noise documents often need special handling or re-clustering"
@@ -0,0 +1,96 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'bundler/setup'
4
+ require 'clusterkit'
5
+
6
+ # Generate sample data with 3 natural clusters
7
+ def generate_sample_data
8
+ data = []
9
+
10
+ # Cluster 1: centered around [0, 0]
11
+ 30.times do
12
+ data << [rand * 0.5 - 0.25, rand * 0.5 - 0.25]
13
+ end
14
+
15
+ # Cluster 2: centered around [3, 3]
16
+ 30.times do
17
+ data << [3 + rand * 0.5 - 0.25, 3 + rand * 0.5 - 0.25]
18
+ end
19
+
20
+ # Cluster 3: centered around [1.5, -2]
21
+ 30.times do
22
+ data << [1.5 + rand * 0.5 - 0.25, -2 + rand * 0.5 - 0.25]
23
+ end
24
+
25
+ data
26
+ end
27
+
28
+ puts "ClusterKit Optimal K-means Clustering Example"
29
+ puts "=" * 50
30
+
31
+ # Generate data
32
+ data = generate_sample_data
33
+ puts "\nGenerated #{data.size} data points with 3 natural clusters"
34
+
35
+ # Method 1: Manual elbow method and detection
36
+ puts "\nMethod 1: Manual elbow method"
37
+ puts "-" * 30
38
+
39
+ elbow_results = ClusterKit::Clustering.elbow_method(data, k_range: 2..8)
40
+ puts "Elbow method results:"
41
+ elbow_results.sort.each do |k, inertia|
42
+ puts " k=#{k}: inertia=#{inertia.round(2)}"
43
+ end
44
+
45
+ optimal_k = ClusterKit::Clustering.detect_optimal_k(elbow_results)
46
+ puts "\nDetected optimal k: #{optimal_k}"
47
+
48
+ # Perform K-means with optimal k
49
+ labels, centroids, inertia = ClusterKit::Clustering.kmeans(data, optimal_k)
50
+ puts "Final inertia: #{inertia.round(2)}"
51
+ puts "Cluster sizes: #{labels.tally.sort.to_h}"
52
+
53
+ # Method 2: Using optimal_kmeans (all-in-one)
54
+ puts "\nMethod 2: Using optimal_kmeans (automatic)"
55
+ puts "-" * 30
56
+
57
+ optimal_k, labels, centroids, inertia = ClusterKit::Clustering.optimal_kmeans(data, k_range: 2..8)
58
+ puts "Automatically detected k: #{optimal_k}"
59
+ puts "Final inertia: #{inertia.round(2)}"
60
+ puts "Cluster sizes: #{labels.tally.sort.to_h}"
61
+
62
+ # Method 3: Using KMeans class with detected k
63
+ puts "\nMethod 3: Using KMeans class"
64
+ puts "-" * 30
65
+
66
+ # First detect optimal k
67
+ elbow_results = ClusterKit::Clustering.elbow_method(data, k_range: 2..8)
68
+ optimal_k = ClusterKit::Clustering.detect_optimal_k(elbow_results)
69
+
70
+ # Create KMeans instance with optimal k
71
+ kmeans = ClusterKit::Clustering::KMeans.new(k: optimal_k, random_seed: 42)
72
+ labels = kmeans.fit_predict(data)
73
+
74
+ puts "K-means with k=#{optimal_k}:"
75
+ puts "Inertia: #{kmeans.inertia.round(2)}"
76
+ puts "Cluster sizes: #{labels.tally.sort.to_h}"
77
+
78
+ # Show cluster centers
79
+ puts "\nCluster centers:"
80
+ kmeans.cluster_centers.each_with_index do |center, i|
81
+ puts " Cluster #{i}: [#{center[0].round(2)}, #{center[1].round(2)}]"
82
+ end
83
+
84
+ # Calculate silhouette score to validate clustering quality
85
+ silhouette = ClusterKit::Clustering.silhouette_score(data, labels)
86
+ puts "\nSilhouette score: #{silhouette.round(3)}"
87
+ puts "(Higher is better, range is -1 to 1)"
88
+
89
+ # Custom fallback example
90
+ puts "\nCustom fallback example:"
91
+ puts "-" * 30
92
+ empty_results = {}
93
+ default_k = ClusterKit::Clustering.detect_optimal_k(empty_results)
94
+ custom_k = ClusterKit::Clustering.detect_optimal_k(empty_results, fallback_k: 5)
95
+ puts "Default fallback k: #{default_k}"
96
+ puts "Custom fallback k: #{custom_k}"