RubyGems - clusterkit - Versions diffs - 0.1.0.pre.1 - Mend

clusterkit 0.1.0.pre.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (49) hide show

checksums.yaml +7 -0
data/.rspec +3 -0
data/.simplecov +47 -0
data/CHANGELOG.md +35 -0
data/CLAUDE.md +226 -0
data/Cargo.toml +8 -0
data/Gemfile +17 -0
data/IMPLEMENTATION_NOTES.md +143 -0
data/LICENSE.txt +21 -0
data/PYTHON_COMPARISON.md +183 -0
data/README.md +499 -0
data/Rakefile +245 -0
data/clusterkit.gemspec +45 -0
data/docs/KNOWN_ISSUES.md +130 -0
data/docs/RUST_ERROR_HANDLING.md +164 -0
data/docs/TEST_FIXTURES.md +170 -0
data/docs/UMAP_EXPLAINED.md +362 -0
data/docs/UMAP_TROUBLESHOOTING.md +284 -0
data/docs/VERBOSE_OUTPUT.md +84 -0
data/examples/hdbscan_example.rb +147 -0
data/examples/optimal_kmeans_example.rb +96 -0
data/examples/pca_example.rb +114 -0
data/examples/reproducible_umap.rb +99 -0
data/examples/verbose_control.rb +43 -0
data/ext/clusterkit/Cargo.toml +25 -0
data/ext/clusterkit/extconf.rb +4 -0
data/ext/clusterkit/src/clustering/hdbscan_wrapper.rs +115 -0
data/ext/clusterkit/src/clustering.rs +267 -0
data/ext/clusterkit/src/embedder.rs +413 -0
data/ext/clusterkit/src/lib.rs +22 -0
data/ext/clusterkit/src/svd.rs +112 -0
data/ext/clusterkit/src/tests.rs +16 -0
data/ext/clusterkit/src/utils.rs +33 -0
data/lib/clusterkit/clustering/hdbscan.rb +177 -0
data/lib/clusterkit/clustering.rb +213 -0
data/lib/clusterkit/clusterkit.rb +9 -0
data/lib/clusterkit/configuration.rb +24 -0
data/lib/clusterkit/dimensionality/pca.rb +251 -0
data/lib/clusterkit/dimensionality/svd.rb +144 -0
data/lib/clusterkit/dimensionality/umap.rb +311 -0
data/lib/clusterkit/dimensionality.rb +29 -0
data/lib/clusterkit/hdbscan_api_design.rb +142 -0
data/lib/clusterkit/preprocessing.rb +106 -0
data/lib/clusterkit/silence.rb +42 -0
data/lib/clusterkit/utils.rb +51 -0
data/lib/clusterkit/version.rb +5 -0
data/lib/clusterkit.rb +93 -0
data/lib/tasks/visualize.rake +641 -0
metadata +194 -0

data/docs/UMAP_TROUBLESHOOTING.md ADDED Viewed

@@ -0,0 +1,284 @@
+# UMAP Troubleshooting Guide
+## Known Issues and Solutions
+### 1. UMAP Hanging During fit() or fit_transform()
+#### Symptoms
+- The UMAP algorithm hangs indefinitely during training
+- Console output shows: `embedded scales quantiles at 0.05 : 2.00e-1 , 0.5 : 2.00e-1, 0.95 : 2.00e-1, 0.99 : 2.00e-1`
+- All quantiles are exactly 0.2, indicating degenerate initialization
+#### Important: This Only Affects Test Data
+**This issue does NOT occur with real embeddings from text models.** Production usage with embeddings from models like:
+- OpenAI's text-embedding-ada-002
+- Jina embeddings
+- Sentence transformers
+- BERT-based models
+...works perfectly fine. The hanging only occurs with synthetic random test data.
+#### Root Cause
+The underlying annembed Rust library's `dmap_init` initialization algorithm expects data with manifold structure (like real embeddings have). When given uniform random data without structure, it can initialize all points to exactly the same location (0.2, 0.2), causing gradient descent to fail.
+#### Why Real Embeddings Work
+Real text embeddings have:
+- Natural clustering (similar texts have similar embeddings)
+- Meaningful dimensions that correlate
+- Values typically in range [-0.12, 0.12]
+- High dimensionality (384-1536 dimensions)
+- Inherent manifold structure that UMAP is designed to find
+#### Triggering Conditions
+The bug only occurs with:
+- Uniform random test data without structure
+- Synthetic data with very small variance
+- Oversimplified test cases
+- Small values of `nb_grad_batch` (< 5) combined with random data
+- Small values of `nb_sampling_by_edge` (< 5) combined with random data
+#### Workarounds
+1. **Use conservative data ranges**
+   ```ruby
+   # Good: Small values centered near 0
+   data = 30.times.map { 10.times.map { rand * 0.02 - 0.01 } }
+   # Bad: Large ranges
+   data = 30.times.map { 10.times.map { rand * 4.0 - 2.0 } }
+   ```
+2. **Use default parameters**
+   ```ruby
+   # Good: Use defaults
+   umap = ClusterKit::UMAP.new(n_components: 2, n_neighbors: 15)
+   # Risky: Small batch parameters
+   umap = ClusterKit::UMAP.new(
+     n_components: 2,
+     n_neighbors: 15,
+     nb_grad_batch: 2,      # Too small - may hang
+     nb_sampling_by_edge: 3  # Too small - may hang
+   )
+   ```
+3. **Add structure to your data**
+   ```ruby
+   # Instead of pure random data, add some structure
+   data = 15.times.map do |i|
+     30.times.map do |j|
+       base = (i.to_f / 15) * 0.01  # Small trend
+       noise = rand * 0.01 - 0.005   # Small noise
+       base + noise
+     end
+   end
+   ```
+### 2. Writing Tests for UMAP
+Since UMAP works fine with real embeddings but can hang with random test data, here's how to write better tests:
+#### Use Realistic Test Data
+```ruby
+# GOOD: Generate data with structure similar to real embeddings
+def generate_embedding_like_data(n_points, n_dims)
+  # Create clusters to simulate semantic grouping
+  n_clusters = 3
+  points_per_cluster = n_points / n_clusters
+  data = []
+  n_clusters.times do |c|
+    # Each cluster has a different center
+    center = Array.new(n_dims) { (c - 1) * 0.05 }
+    points_per_cluster.times do
+      # Add Gaussian noise around the center
+      point = center.map { |x| x + (rand - 0.5) * 0.02 }
+      data << point
+    end
+  end
+  data
+end
+# Use it in tests
+test_data = generate_embedding_like_data(30, 768)  # 30 points, 768 dims like BERT
+```
+#### Load Real Embeddings for Testing
+```ruby
+# BETTER: Use actual embeddings from a small model
+require 'candle'
+def generate_real_test_embeddings
+  model = Candle::EmbeddingModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
+  texts = [
+    "The cat sat on the mat",
+    "Dogs are loyal animals",
+    "Machine learning is fascinating",
+    # ... more test sentences
+  ]
+  model.embed(texts)
+end
+```
+#### Skip Problematic Test Scenarios
+```ruby
+# For edge cases that don't reflect real usage
+it "handles random noise data", skip: "UMAP not designed for structure-less data" do
+  # This test would hang because random noise has no manifold
+  random_data = 100.times.map { 768.times.map { rand } }
+  # Don't test this - it's not a real use case
+end
+```
+### 3. Performance Tuning with New Parameters
+As of version 0.2.0, UMAP supports two new parameters for performance tuning:
+#### nb_grad_batch
+- **Default**: 10
+- **Purpose**: Controls the number of gradient descent batches
+- **Trade-off**: Lower values = faster but less accurate
+- **Safe range**: 5-15
+- **Warning**: Values < 5 may cause hanging with certain data
+#### nb_sampling_by_edge
+- **Default**: 8
+- **Purpose**: Controls the number of negative samples per edge
+- **Trade-off**: Lower values = faster but less accurate
+- **Safe range**: 5-10
+- **Warning**: Values < 5 may cause hanging with certain data
+Example usage:
+```ruby
+# Fast but potentially less accurate
+fast_umap = ClusterKit::UMAP.new(
+  n_components: 2,
+  n_neighbors: 15,
+  nb_grad_batch: 5,        # Minimum safe value
+  nb_sampling_by_edge: 5   # Minimum safe value
+)
+# Slower but more accurate
+accurate_umap = ClusterKit::UMAP.new(
+  n_components: 2,
+  n_neighbors: 15,
+  nb_grad_batch: 15,       # Higher for better quality
+  nb_sampling_by_edge: 10  # Higher for better quality
+)
+```
+### 3. Data Validation Issues
+#### NaN or Infinite Values
+UMAP will raise an error if your data contains NaN or Infinite values:
+```ruby
+# This will raise ArgumentError
+bad_data = [[1.0, Float::NAN], [3.0, 4.0]]
+umap.fit_transform(bad_data)
+```
+#### Inconsistent Row Lengths
+All rows must have the same number of features:
+```ruby
+# This will raise ArgumentError
+bad_data = [[1.0, 2.0], [3.0, 4.0, 5.0]]  # Different lengths
+umap.fit_transform(bad_data)
+```
+### 4. Memory Issues with Large Datasets
+UMAP builds an HNSW graph which can be memory-intensive for large datasets.
+#### Recommendations:
+- For datasets > 100k points, consider sampling
+- Monitor memory usage during fit_transform
+- Use smaller n_neighbors values for large datasets
+### 5. Model Persistence Issues
+#### Binary Compatibility
+Saved models may not be compatible across different versions of clusterkit. Always test loading saved models after upgrading.
+#### File Size
+Model files include both the original training data and embeddings, so they can be large:
+```ruby
+# Check model file size before saving
+umap.save("model.bin")
+puts "Model size: #{File.size('model.bin') / 1024.0 / 1024.0} MB"
+```
+## Debugging Tips
+### 1. Enable Verbose Output
+The annembed library outputs diagnostic information during training. Watch for:
+- "embedded scales quantiles" - should NOT all be the same value
+- "initial cross entropy value" - should be a reasonable number (not 0 or infinity)
+- "final cross entropy value" - should be lower than initial
+### 2. Test with Known Working Data
+If you're having issues, test with this known working configuration:
+```ruby
+# Known working test case
+test_data = 30.times.map { 10.times.map { rand * 0.5 + 0.25 } }
+umap = ClusterKit::UMAP.new(n_components: 2, n_neighbors: 5)
+result = umap.fit_transform(test_data)
+```
+### 3. Check Data Characteristics
+```ruby
+# Analyze your data before UMAP
+data_flat = data.flatten
+puts "Data range: [#{data_flat.min}, #{data_flat.max}]"
+puts "Data mean: #{data_flat.sum / data_flat.length}"
+puts "Data variance: #{data_flat.map { |x| (x - mean) ** 2 }.sum / data_flat.length}"
+# If variance is very small or range is very large, consider normalizing
+```
+## When to Report a Bug
+Report an issue if:
+1. UMAP consistently hangs even with conservative data and default parameters
+2. You get a panic or segfault (not just a Ruby exception)
+3. Results are dramatically different from other UMAP implementations
+4. Memory usage is unexpectedly high
+Include in your bug report:
+- Data characteristics (shape, range, variance)
+- Parameters used
+- Console output including the diagnostic messages
+- Ruby and clusterkit versions
+## Alternative Solutions
+If you continue to experience issues:
+1. **Use t-SNE instead** (if 2D visualization is the goal)
+   ```ruby
+   embedder = ClusterKit::Embedder.new(method: :tsne, n_components: 2)
+   result = embedder.fit_transform(data)
+   ```
+2. **Preprocess your data**
+   ```ruby
+   # Normalize to [0, 1]
+   min = data.flatten.min
+   max = data.flatten.max
+   normalized = data.map { |row| row.map { |x| (x - min) / (max - min) } }
+   ```
+3. **Use Python UMAP via PyCall** (if stability is critical)
+   ```ruby
+   require 'pycall'
+   umap = PyCall.import_module('umap')
+   reducer = umap.UMAP.new
+   embedding = reducer.fit_transform(data)
+   ```
+## References
+- [annembed GitHub Issues](https://github.com/jean-pierreBoth/annembed/issues) - For upstream bugs
+- [UMAP Algorithm Paper](https://arxiv.org/abs/1802.03426) - Understanding the algorithm
+- [Original Python UMAP](https://github.com/lmcinnes/umap) - Reference implementation

data/docs/VERBOSE_OUTPUT.md ADDED Viewed

@@ -0,0 +1,84 @@
+# Controlling Verbose Output
+The clusterkit gem provides control over the verbose debug output from the underlying Rust library.
+## Default Behavior
+By default, clusterkit suppresses the debug output from the Rust library to keep your console clean. This includes messages about quantiles, cross entropy values, and gradient iterations.
+## Enabling Verbose Output
+There are two ways to enable verbose output:
+### 1. Environment Variable
+Set the `ANNEMBED_VERBOSE` environment variable:
+```bash
+ANNEMBED_VERBOSE=true ruby your_script.rb
+```
+Or in your Ruby code:
+```ruby
+ENV['ANNEMBED_VERBOSE'] = 'true'
+require 'clusterkit'
+```
+### 2. Configuration API
+Use the configuration API for programmatic control:
+```ruby
+require 'clusterkit'
+# Enable verbose output
+ClusterKit.configure do |config|
+  config.verbose = true
+end
+# Your UMAP operations will now show debug output
+umap = ClusterKit::UMAP.new
+umap.fit_transform(data)
+# Disable verbose output
+ClusterKit.configuration.verbose = false
+```
+## When to Use Verbose Output
+Verbose output is useful for:
+- **Debugging convergence issues**: See iteration counts and cross entropy values
+- **Understanding performance**: Monitor gradient descent progress
+- **Troubleshooting edge cases**: Identify degenerate initializations or disconnected graphs
+- **Development and testing**: Verify the algorithm is working correctly
+## Example Output
+When verbose mode is enabled, you'll see output like:
+```
+constructed initial space
+scales quantile at 0.05 : 1.12e0 , 0.5 :  1.17e0, 0.95 : 1.23e0, 0.99 : 1.23e0
+edge weight quantile at 0.05 : 1.87e-1 , 0.5 :  1.99e-1, 0.95 : 2.15e-1, 0.99 : 2.19e-1
+perplexity quantile at 0.05 : 4.99e0 , 0.5 :  5.00e0, 0.95 : 5.00e0, 0.99 : 5.00e0
+embedded scales quantiles at 0.05 : 1.91e-1 , 0.5 :  2.00e-1, 0.95 : 2.10e-1, 0.99 : 2.10e-1
+initial cross entropy value 7.48e1,  in time 972µs
+gradient iterations sys time(s) 0.00e0 , cpu_time(s) 0.00e0
+final cross entropy value 5.85e1
+```
+## Implementation Details
+The gem uses Ruby's standard approach for suppressing output from C/Rust extensions:
+- Output streams are temporarily redirected to `/dev/null` (Unix) or `NUL:` (Windows)
+- The redirection happens at the file descriptor level to capture C/Rust `printf`/`println!` output
+- After the operation completes, streams are restored to their original state
+This approach is thread-safe within the context of Ruby's Global VM Lock (GVL) and follows patterns used by popular Ruby gems like Rails/ActiveSupport.

data/examples/hdbscan_example.rb ADDED Viewed

@@ -0,0 +1,147 @@
+#!/usr/bin/env ruby
+# frozen_string_literal: true
+require 'bundler/setup'
+require 'clusterkit'
+# Helper method for the example
+class Array
+  def mostly_in_range?(start_idx, end_idx)
+    count_in_range = self.count { |idx| idx >= start_idx && idx <= end_idx }
+    count_in_range > self.length * 0.7
+  end
+end
+# HDBSCAN Example: Document Clustering Pipeline
+# ==============================================
+# This example demonstrates using HDBSCAN for document clustering,
+# which is the primary use case for this implementation.
+puts "HDBSCAN Document Clustering Example"
+puts "=" * 50
+# Simulate document embeddings after UMAP reduction
+# In a real application, you would:
+# 1. Embed documents using a model (e.g., BERT, Sentence Transformers)
+# 2. Reduce dimensions with UMAP to ~20D
+# 3. Apply HDBSCAN clustering
+# Generate synthetic "document embeddings" in 20D space
+# These represent documents after UMAP reduction
+def generate_document_embeddings
+  embeddings = []
+  # Topic 1: Technology articles (30 documents)
+  30.times do
+    embedding = Array.new(20) { rand(-1.0..1.0) }
+    # Add bias to make them cluster
+    embedding[0] += 5.0
+    embedding[1] += 5.0
+    embeddings << embedding
+  end
+  # Topic 2: Science articles (25 documents)
+  25.times do
+    embedding = Array.new(20) { rand(-1.0..1.0) }
+    # Add different bias
+    embedding[2] += 5.0
+    embedding[3] += 5.0
+    embeddings << embedding
+  end
+  # Topic 3: Business articles (20 documents)
+  20.times do
+    embedding = Array.new(20) { rand(-1.0..1.0) }
+    # Add another bias
+    embedding[4] += 5.0
+    embedding[5] += 5.0
+    embeddings << embedding
+  end
+  # Noise: Off-topic or mixed-topic documents (15 documents)
+  15.times do
+    embedding = Array.new(20) { rand(-3.0..7.0) }
+    embeddings << embedding
+  end
+  embeddings
+end
+# Generate sample data
+embeddings = generate_document_embeddings
+puts "Generated #{embeddings.length} document embeddings (20D)"
+puts "  - 30 technology articles"
+puts "  - 25 science articles"
+puts "  - 20 business articles"
+puts "  - 15 mixed/off-topic articles"
+# Apply HDBSCAN clustering
+puts "\nApplying HDBSCAN clustering..."
+hdbscan = ClusterKit::Clustering::HDBSCAN.new(
+  min_samples: 5,        # Minimum neighborhood size for density estimation
+  min_cluster_size: 10   # Minimum cluster size (smaller clusters become noise)
+)
+# Fit the model
+hdbscan.fit(embeddings)
+# Get results
+puts "\nClustering Results:"
+puts "-" * 30
+puts "Topics found: #{hdbscan.n_clusters}"
+puts "Unclustered documents: #{hdbscan.n_noise_points} (#{(hdbscan.noise_ratio * 100).round(1)}%)"
+# Analyze each cluster
+cluster_indices = hdbscan.cluster_indices
+cluster_indices.each do |topic_id, doc_indices|
+  puts "\nTopic #{topic_id}:"
+  puts "  - #{doc_indices.length} documents"
+  # In a real application, you would:
+  # 1. Extract keywords from documents in this cluster
+  # 2. Generate topic descriptions
+  # 3. Find representative documents
+  # Simulate topic identification based on our synthetic data
+  if doc_indices.mostly_in_range?(0, 29)
+    puts "  - Likely topic: Technology"
+  elsif doc_indices.mostly_in_range?(30, 54)
+    puts "  - Likely topic: Science"
+  elsif doc_indices.mostly_in_range?(55, 74)
+    puts "  - Likely topic: Business"
+  else
+    puts "  - Likely topic: Mixed"
+  end
+end
+# Show noise documents (unclustered)
+if hdbscan.n_noise_points > 0
+  puts "\nUnclustered documents (noise):"
+  puts "  - #{hdbscan.n_noise_points} documents"
+  puts "  - These may be outliers, mixed-topic, or unique documents"
+  puts "  - Consider manual review or different clustering parameters"
+end
+# Alternative: Use module-level convenience method
+puts "\n" + "=" * 50
+puts "Alternative: Module-level method"
+puts "-" * 30
+result = ClusterKit::Clustering.hdbscan(
+  embeddings,
+  min_samples: 3,
+  min_cluster_size: 8
+)
+puts "Topics found: #{result[:n_clusters]}"
+puts "Noise ratio: #{(result[:noise_ratio] * 100).round(1)}%"
+# Practical tips
+puts "\n" + "=" * 50
+puts "Tips for Document Clustering:"
+puts "-" * 30
+puts "1. Use UMAP to reduce embeddings to 20-50 dimensions first"
+puts "2. Start with min_cluster_size = 10-20 for most document sets"
+puts "3. Adjust min_samples based on local density (usually 5-10)"
+puts "4. Expect 20-40% noise for diverse document collections"
+puts "5. Noise documents often need special handling or re-clustering"

data/examples/optimal_kmeans_example.rb ADDED Viewed

@@ -0,0 +1,96 @@
+#!/usr/bin/env ruby
+require 'bundler/setup'
+require 'clusterkit'
+# Generate sample data with 3 natural clusters
+def generate_sample_data
+  data = []
+  # Cluster 1: centered around [0, 0]
+  30.times do
+    data << [rand * 0.5 - 0.25, rand * 0.5 - 0.25]
+  end
+  # Cluster 2: centered around [3, 3]
+  30.times do
+    data << [3 + rand * 0.5 - 0.25, 3 + rand * 0.5 - 0.25]
+  end
+  # Cluster 3: centered around [1.5, -2]
+  30.times do
+    data << [1.5 + rand * 0.5 - 0.25, -2 + rand * 0.5 - 0.25]
+  end
+  data
+end
+puts "ClusterKit Optimal K-means Clustering Example"
+puts "=" * 50
+# Generate data
+data = generate_sample_data
+puts "\nGenerated #{data.size} data points with 3 natural clusters"
+# Method 1: Manual elbow method and detection
+puts "\nMethod 1: Manual elbow method"
+puts "-" * 30
+elbow_results = ClusterKit::Clustering.elbow_method(data, k_range: 2..8)
+puts "Elbow method results:"
+elbow_results.sort.each do |k, inertia|
+  puts "  k=#{k}: inertia=#{inertia.round(2)}"
+end
+optimal_k = ClusterKit::Clustering.detect_optimal_k(elbow_results)
+puts "\nDetected optimal k: #{optimal_k}"
+# Perform K-means with optimal k
+labels, centroids, inertia = ClusterKit::Clustering.kmeans(data, optimal_k)
+puts "Final inertia: #{inertia.round(2)}"
+puts "Cluster sizes: #{labels.tally.sort.to_h}"
+# Method 2: Using optimal_kmeans (all-in-one)
+puts "\nMethod 2: Using optimal_kmeans (automatic)"
+puts "-" * 30
+optimal_k, labels, centroids, inertia = ClusterKit::Clustering.optimal_kmeans(data, k_range: 2..8)
+puts "Automatically detected k: #{optimal_k}"
+puts "Final inertia: #{inertia.round(2)}"
+puts "Cluster sizes: #{labels.tally.sort.to_h}"
+# Method 3: Using KMeans class with detected k
+puts "\nMethod 3: Using KMeans class"
+puts "-" * 30
+# First detect optimal k
+elbow_results = ClusterKit::Clustering.elbow_method(data, k_range: 2..8)
+optimal_k = ClusterKit::Clustering.detect_optimal_k(elbow_results)
+# Create KMeans instance with optimal k
+kmeans = ClusterKit::Clustering::KMeans.new(k: optimal_k, random_seed: 42)
+labels = kmeans.fit_predict(data)
+puts "K-means with k=#{optimal_k}:"
+puts "Inertia: #{kmeans.inertia.round(2)}"
+puts "Cluster sizes: #{labels.tally.sort.to_h}"
+# Show cluster centers
+puts "\nCluster centers:"
+kmeans.cluster_centers.each_with_index do |center, i|
+  puts "  Cluster #{i}: [#{center[0].round(2)}, #{center[1].round(2)}]"
+end
+# Calculate silhouette score to validate clustering quality
+silhouette = ClusterKit::Clustering.silhouette_score(data, labels)
+puts "\nSilhouette score: #{silhouette.round(3)}"
+puts "(Higher is better, range is -1 to 1)"
+# Custom fallback example
+puts "\nCustom fallback example:"
+puts "-" * 30
+empty_results = {}
+default_k = ClusterKit::Clustering.detect_optimal_k(empty_results)
+custom_k = ClusterKit::Clustering.detect_optimal_k(empty_results, fallback_k: 5)
+puts "Default fallback k: #{default_k}"
+puts "Custom fallback k: #{custom_k}"