RubyGems - clusterkit - Versions diffs - 0.1.0.pre.1 - Mend

clusterkit 0.1.0.pre.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (49) hide show

checksums.yaml +7 -0
data/.rspec +3 -0
data/.simplecov +47 -0
data/CHANGELOG.md +35 -0
data/CLAUDE.md +226 -0
data/Cargo.toml +8 -0
data/Gemfile +17 -0
data/IMPLEMENTATION_NOTES.md +143 -0
data/LICENSE.txt +21 -0
data/PYTHON_COMPARISON.md +183 -0
data/README.md +499 -0
data/Rakefile +245 -0
data/clusterkit.gemspec +45 -0
data/docs/KNOWN_ISSUES.md +130 -0
data/docs/RUST_ERROR_HANDLING.md +164 -0
data/docs/TEST_FIXTURES.md +170 -0
data/docs/UMAP_EXPLAINED.md +362 -0
data/docs/UMAP_TROUBLESHOOTING.md +284 -0
data/docs/VERBOSE_OUTPUT.md +84 -0
data/examples/hdbscan_example.rb +147 -0
data/examples/optimal_kmeans_example.rb +96 -0
data/examples/pca_example.rb +114 -0
data/examples/reproducible_umap.rb +99 -0
data/examples/verbose_control.rb +43 -0
data/ext/clusterkit/Cargo.toml +25 -0
data/ext/clusterkit/extconf.rb +4 -0
data/ext/clusterkit/src/clustering/hdbscan_wrapper.rs +115 -0
data/ext/clusterkit/src/clustering.rs +267 -0
data/ext/clusterkit/src/embedder.rs +413 -0
data/ext/clusterkit/src/lib.rs +22 -0
data/ext/clusterkit/src/svd.rs +112 -0
data/ext/clusterkit/src/tests.rs +16 -0
data/ext/clusterkit/src/utils.rs +33 -0
data/lib/clusterkit/clustering/hdbscan.rb +177 -0
data/lib/clusterkit/clustering.rb +213 -0
data/lib/clusterkit/clusterkit.rb +9 -0
data/lib/clusterkit/configuration.rb +24 -0
data/lib/clusterkit/dimensionality/pca.rb +251 -0
data/lib/clusterkit/dimensionality/svd.rb +144 -0
data/lib/clusterkit/dimensionality/umap.rb +311 -0
data/lib/clusterkit/dimensionality.rb +29 -0
data/lib/clusterkit/hdbscan_api_design.rb +142 -0
data/lib/clusterkit/preprocessing.rb +106 -0
data/lib/clusterkit/silence.rb +42 -0
data/lib/clusterkit/utils.rb +51 -0
data/lib/clusterkit/version.rb +5 -0
data/lib/clusterkit.rb +93 -0
data/lib/tasks/visualize.rake +641 -0
metadata +194 -0

data/Rakefile ADDED Viewed

@@ -0,0 +1,245 @@
+require "bundler/gem_tasks"
+require "rake/extensiontask"
+require "rspec/core/rake_task"
+# RSpec test task
+RSpec::Core::RakeTask.new(:spec)
+# Define the Rust extension
+spec = Gem::Specification.load("clusterkit.gemspec")
+Rake::ExtensionTask.new("clusterkit", spec) do |ext|
+  ext.lib_dir = "lib/clusterkit"
+  ext.source_pattern = "*.{rs,toml}"
+  ext.cross_compile = true
+  ext.cross_platform = %w[x86-mingw32 x64-mingw32 x86-linux x86_64-linux x86_64-darwin arm64-darwin]
+end
+task default: [:compile, :spec]
+# Documentation task
+begin
+  require "yard"
+  YARD::Rake::YardocTask.new do |t|
+    t.files = ["lib/**/*.rb"]
+    t.options = ["--no-private", "--readme", "README.md"]
+  end
+rescue LoadError
+  desc "YARD documentation task not available"
+  task :yard do
+    puts "YARD is not available. Please install it with: gem install yard"
+  end
+end
+# Benchmarking task
+desc "Run benchmarks"
+task :benchmark do
+  ruby "test/benchmark/benchmarks.rb"
+end
+# Console task for interactive testing
+desc "Open an interactive console with the gem loaded"
+task :console do
+  require "irb"
+  require "clusterkit"
+  ARGV.clear
+  IRB.start
+end
+# Rust-specific tasks
+namespace :rust do
+  desc "Run cargo fmt"
+  task :fmt do
+    Dir.chdir("ext/clusterkit") do
+      sh "cargo fmt"
+    end
+  end
+  desc "Run cargo clippy"
+  task :clippy do
+    Dir.chdir("ext/clusterkit") do
+      sh "cargo clippy -- -D warnings"
+    end
+  end
+  desc "Run cargo test"
+  task :test do
+    Dir.chdir("ext/clusterkit") do
+      sh "cargo test"
+    end
+  end
+end
+# Coverage task
+desc "Run specs with code coverage"
+task :coverage do
+  ENV['COVERAGE'] = 'true'
+  Rake::Task["spec"].invoke
+end
+# Coverage report task
+desc "Open coverage report in browser"
+task :"coverage:report" => :coverage do
+  if RUBY_PLATFORM =~ /darwin/
+    sh "open coverage/index.html"
+  elsif RUBY_PLATFORM =~ /linux/
+    sh "xdg-open coverage/index.html"
+  else
+    puts "Coverage report generated at coverage/index.html"
+  end
+end
+# CI task that runs all checks
+desc "Run all CI checks"
+task ci: ["rust:fmt", "rust:clippy", "compile", "spec", "coverage"]
+# Load custom rake tasks
+Dir.glob('lib/tasks/*.rake').each { |r| load r }
+# Test fixture generation
+namespace :fixtures do
+  desc "Generate real embedding fixtures for tests using red-candle"
+  task :generate_embeddings do
+    begin
+      require 'red-candle'
+      require 'json'
+      require 'fileutils'
+      puts "Loading embedding model..."
+      # Use a small, fast model for generating test embeddings
+      model = Candle::EmbeddingModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
+      # Create fixtures directory
+      fixtures_dir = File.join(__dir__, 'spec', 'fixtures', 'embeddings')
+      FileUtils.mkdir_p(fixtures_dir)
+      # Generate embeddings for different test scenarios
+      test_cases = {
+        # Basic test set - 15 sentences for general testing
+        'basic_15' => [
+          "The quick brown fox jumps over the lazy dog.",
+          "Machine learning is transforming how we process data.",
+          "Ruby is a dynamic programming language.",
+          "Natural language processing enables computers to understand text.",
+          "The weather today is sunny and warm.",
+          "Coffee is a popular morning beverage.",
+          "Books are a gateway to knowledge and imagination.",
+          "The ocean waves crash against the shore.",
+          "Technology continues to evolve rapidly.",
+          "Music has the power to evoke emotions.",
+          "The mountain peak was covered in snow.",
+          "Cooking is both an art and a science.",
+          "Exercise is important for maintaining health.",
+          "The stars shine brightly in the night sky.",
+          "History teaches us valuable lessons."
+        ],
+        # Clustered data - 3 distinct topics for testing clustering
+        'clusters_30' => [
+          # Technology cluster (10 items)
+          "Artificial intelligence is revolutionizing industries.",
+          "Python is widely used for data science.",
+          "Cloud computing provides scalable infrastructure.",
+          "Cybersecurity is crucial for protecting data.",
+          "Blockchain technology enables decentralized systems.",
+          "Quantum computing may solve complex problems.",
+          "APIs enable software integration.",
+          "DevOps practices improve deployment efficiency.",
+          "Microservices architecture promotes modularity.",
+          "Machine learning models require training data.",
+          # Nature cluster (10 items)
+          "The rainforest ecosystem is incredibly diverse.",
+          "Mountains are formed by tectonic activity.",
+          "Coral reefs support marine biodiversity.",
+          "Rivers flow from highlands to the sea.",
+          "Deserts have adapted to water scarcity.",
+          "Forests produce oxygen and absorb carbon.",
+          "The arctic ice is melting due to climate change.",
+          "Volcanoes release molten rock from Earth's interior.",
+          "Wetlands filter water naturally.",
+          "The savanna supports large herbivore populations.",
+          # Food/Cooking cluster (10 items)
+          "Italian cuisine features pasta and tomatoes.",
+          "Sushi is a traditional Japanese dish.",
+          "Baking bread requires yeast for fermentation.",
+          "Spices add flavor and aroma to dishes.",
+          "Vegetarian diets exclude meat products.",
+          "Wine is produced through grape fermentation.",
+          "Chocolate comes from cacao beans.",
+          "Grilling imparts a smoky flavor to food.",
+          "Fresh herbs enhance culinary creations.",
+          "Fermented foods contain beneficial probiotics."
+        ],
+        # Small set for minimum viable dataset testing
+        'minimal_6' => [
+          "Data science involves statistical analysis.",
+          "The sunset painted the sky orange.",
+          "Coffee beans are roasted before brewing.",
+          "Programming requires logical thinking.",
+          "Gardens need water and sunlight.",
+          "Music festivals bring people together."
+        ],
+        # Large set for high-dimensional testing
+        'large_100' => (1..100).map { |i| "This is test sentence number #{i} with some variation in content." }
+      }
+      puts "Generating embeddings for test cases..."
+      test_cases.each do |name, texts|
+        puts "  Generating #{name} (#{texts.length} texts)..."
+        # Generate embeddings
+        # Each embedding is a 1x384 tensor, so we need to extract the array
+        embeddings_array = texts.map { |text| model.embedding(text).to_a.first.to_a }
+        # Save as JSON
+        output_file = File.join(fixtures_dir, "#{name}.json")
+        File.write(output_file, JSON.pretty_generate({
+          'description' => "Test embeddings for #{name}",
+          'model' => 'sentence-transformers/all-MiniLM-L6-v2',
+          'dimension' => embeddings_array.first.length,
+          'count' => embeddings_array.length,
+          'embeddings' => embeddings_array
+        }))
+        puts "    Saved to #{output_file}"
+      end
+      puts "\nEmbedding fixtures generated successfully!"
+      puts "You can now use these in your specs with:"
+      puts '  embeddings = JSON.parse(File.read("spec/fixtures/embeddings/basic_15.json"))["embeddings"]'
+    rescue LoadError
+      puts "Error: red-candle gem not found."
+      puts "Please run: bundle install --with development"
+      exit 1
+    rescue => e
+      puts "Error generating embeddings: #{e.message}"
+      puts e.backtrace.first(5)
+      exit 1
+    end
+  end
+  desc "List available embedding fixtures"
+  task :list do
+    require 'json'
+    fixtures_dir = File.join(__dir__, 'spec', 'fixtures', 'embeddings')
+    if Dir.exist?(fixtures_dir)
+      files = Dir.glob(File.join(fixtures_dir, '*.json'))
+      if files.empty?
+        puts "No embedding fixtures found. Run 'rake fixtures:generate_embeddings' to create them."
+      else
+        puts "Available embedding fixtures:"
+        files.each do |file|
+          data = JSON.parse(File.read(file))
+          basename = File.basename(file)
+          puts "  #{basename}: #{data['count']} embeddings, #{data['dimension']}D"
+        end
+      end
+    else
+      puts "Fixtures directory not found. Run 'rake fixtures:generate_embeddings' to create fixtures."
+    end
+  end
+end

data/clusterkit.gemspec ADDED Viewed

@@ -0,0 +1,45 @@
+require_relative "lib/clusterkit/version"
+Gem::Specification.new do |spec|
+  spec.name = "clusterkit"
+  spec.version = ClusterKit::VERSION
+  spec.authors = ["Chris Petersen"]
+  spec.email = ["chris@petersen.io"]
+  spec.summary = "High-performance clustering and dimensionality reduction for Ruby"
+  spec.description = "A comprehensive clustering toolkit for Ruby, providing UMAP, PCA, K-means, HDBSCAN and more. Built on top of annembed and hdbscan Rust crates for blazing-fast performance."
+  spec.homepage = "https://github.com/cpetersen/clusterkit"
+  spec.license = "MIT"
+  spec.required_ruby_version = ">= 2.7.0"
+  spec.metadata["homepage_uri"] = spec.homepage
+  spec.metadata["source_code_uri"] = spec.homepage
+  spec.metadata["changelog_uri"] = "#{spec.homepage}/blob/main/CHANGELOG.md"
+  # Specify which files should be added to the gem when it is released.
+  spec.files = Dir.chdir(__dir__) do
+    `git ls-files -z`.split("\x0").reject do |f|
+      (f == __FILE__) || f.match(%r{\A(?:(?:bin|test|spec|features)/|\.(?:git|travis|circleci)|appveyor)})
+    end + Dir["ext/**/*.rs", "ext/**/*.toml"]
+  end
+  spec.bindir = "exe"
+  spec.executables = spec.files.grep(%r{\Aexe/}) { |f| File.basename(f) }
+  spec.require_paths = ["lib"]
+  spec.extensions = ["ext/clusterkit/extconf.rb"]
+  # Runtime dependencies
+  # Numo is optional but recommended for better performance
+  # spec.add_dependency "numo-narray", "~> 0.9"
+  # Development dependencies
+  spec.add_development_dependency "csv"
+  spec.add_development_dependency "rake", "~> 13.0"
+  spec.add_development_dependency "rake-compiler", "~> 1.2"
+  spec.add_development_dependency "rb_sys", "~> 0.9"
+  spec.add_development_dependency "rspec", "~> 3.0"
+  spec.add_development_dependency "simplecov", "~> 0.22"
+  spec.add_development_dependency "yard", "~> 0.9"
+  # For more information and examples about making a new gem, check out our
+  # guide at: https://bundler.io/guides/creating_gem.html
+end

data/docs/KNOWN_ISSUES.md ADDED Viewed

@@ -0,0 +1,130 @@
+# Known Issues
+## Summary
+This gem has three main categories of limitations:
+1. **Minimum dataset requirements** - UMAP needs at least 10 data points
+2. **Performance trade-offs** - Reproducibility (with seed) is ~25-35% slower than parallel mode
+3. **Uncatchable Rust panics** - Some error conditions crash the Ruby process (cannot be caught)
+## Minimum Dataset Size Requirement
+**Limitation**: UMAP requires at least 10 data points to function properly.
+**Reason**: UMAP needs sufficient data to construct a meaningful manifold approximation. With fewer than 10 points, the algorithm cannot create a reliable graph structure.
+**Workaround**:
+- Use PCA for datasets with fewer than 10 points
+- The `transform` method can handle smaller datasets once the model is fitted on adequate training data
+## Performance vs Reproducibility Trade-off
+**Design Choice**: When using `random_seed` for reproducibility, UMAP uses serial processing which is approximately 25-35% slower than parallel processing.
+**Recommendation**:
+- For production workloads where speed is critical: omit the `random_seed` parameter
+- For research, testing, or when reproducibility is required: provide a `random_seed` value
+## Rust Panic Conditions (Mostly Fixed)
+**Previous Issue**: The box_size assertion would panic and crash the Ruby process.
+**Current Status**: **FIXED** in `cpetersen/annembed:fix-box-size-panic` branch
+- The `"assertion failed: (*f).abs() <= box_size"` panic has been converted to a catchable error
+- Extreme value ranges are now handled gracefully through normalization
+- NaN/Infinite values are detected and reported with clear error messages
+**Remaining Uncatchable Errors**:
+- Array bounds violations (accessing out-of-bounds indices)
+- Some `.unwrap()` calls on `None` or `Err` values
+- These are much less common in normal usage
+**Best Practices** (still recommended):
+- Normalize your data to a reasonable range (e.g., 0-1) for best performance
+- Remove or handle NaN/Infinite values before processing
+- Use conservative parameters when data quality is uncertain
+**For more details**: See [RUST_ERROR_HANDLING.md](RUST_ERROR_HANDLING.md) for comprehensive documentation of error handling limitations.
+## Best Practices to Avoid Issues
+### Data Preprocessing
+Always preprocess your data before using UMAP:
+```ruby
+# 1. Remove NaN and Infinite values
+data.reject! { |row| row.any? { |v| v.nan? || v.infinite? } }
+# 2. Normalize to [0, 1] range
+data = data.map do |row|
+  min, max = row.minmax
+  range = max - min
+  row.map { |v| range > 0 ? (v - min) / range : 0.5 }
+end
+# 3. Check for extreme outliers
+data.each do |row|
+  row.each do |val|
+    if val.abs > 100
+      warn "Warning: Extreme value #{val} detected"
+    end
+  end
+end
+```
+### Safe Parameter Defaults
+Use conservative parameters when data quality is uncertain:
+```ruby
+# Safer configuration
+umap = ClusterKit::Dimensionality::UMAP.new(
+  n_components: 2,
+  n_neighbors: 5,        # Lower is safer (default: 15)
+  random_seed: 42,       # For reproducibility during debugging
+  nb_grad_batch: 10,     # Default is usually fine
+  nb_sampling_by_edge: 8 # Default is usually fine
+)
+```
+### Error Recovery Strategy
+Since some errors cannot be caught, implement a recovery strategy:
+```ruby
+def safe_umap_transform(data, options = {})
+  # Save data to temporary file before processing
+  temp_file = "temp_umap_data_#{Time.now.to_i}.json"
+  File.write(temp_file, JSON.dump(data))
+  begin
+    umap = ClusterKit::Dimensionality::UMAP.new(**options)
+    result = umap.fit_transform(data)
+    File.delete(temp_file) if File.exist?(temp_file)
+    result
+  rescue => e
+    puts "UMAP failed: #{e.message}"
+    puts "Data saved to #{temp_file} for debugging"
+    raise
+  end
+end
+```
+### Alternative for Problematic Data
+If UMAP consistently fails, use PCA as a fallback:
+```ruby
+def reduce_dimensions(data, n_components: 2)
+  begin
+    umap = ClusterKit::Dimensionality::UMAP.new(n_components: n_components)
+    umap.fit_transform(data)
+  rescue => e
+    warn "UMAP failed, falling back to PCA: #{e.message}"
+    pca = ClusterKit::Dimensionality::PCA.new(n_components: n_components)
+    pca.fit_transform(data)
+  end
+end
+```

data/docs/RUST_ERROR_HANDLING.md ADDED Viewed

@@ -0,0 +1,164 @@
+# Rust Layer Error Handling Documentation
+## Overview
+The annembed-ruby gem wraps Rust libraries (annembed and hnsw-rs) which have different error handling mechanisms. Some errors can be caught and handled gracefully, while others cause panics that crash the Ruby process.
+## Error Categories
+### 1. Catchable Errors (Result<T, E> types)
+These errors use Rust's `Result` type and can be caught and converted to Ruby exceptions:
+| Error | Source | Location | Ruby Exception |
+|-------|--------|----------|----------------|
+| Isolated point | annembed | `kgraph_from_hnsw_all` | `ClusterKit::IsolatedPointError` |
+| Graph construction failure | annembed | `kgraph_from_hnsw_all` | `RuntimeError` with message |
+| Embedding failure | annembed | `embedder.embed()` | Generic `RuntimeError` |
+**Example from annembed:**
+```rust
+// This can be caught
+return Err(anyhow!(
+    "kgraph_from_hnsw_all: graph will not be connected, isolated point at layer {} , pos in layer : {}",
+    p_id.0, p_id.1
+));
+```
+**How we handle it in embedder.rs:**
+```rust
+let kgraph = annembed::fromhnsw::kgraph::kgraph_from_hnsw_all(&hnsw, self.n_neighbors)
+    .map_err(|e| Error::new(magnus::exception::runtime_error(), e.to_string()))?;
+```
+### 2. Uncatchable Errors (Panics/Assertions)
+These use Rust's `assert!` or `panic!` macros and CANNOT be caught. They will crash the Ruby process:
+| Error | Source | Location | Trigger Condition |
+|-------|--------|----------|-------------------|
+| ~~Box size assertion~~ | ~~annembed~~ | ~~`set_data_box`~~ | **FIXED in cpetersen/annembed:fix-box-size-panic** |
+| Array bounds | Various | Index operations | Accessing out-of-bounds indices |
+| Unwrap failures | Various | `.unwrap()` calls | Unwrapping `None` or `Err` |
+**Update (2025-08-19):** The box size assertion has been fixed in the `fix-box-size-panic` branch of cpetersen/annembed. It now returns a proper `Result<(), anyhow::Error>` that can be caught and handled gracefully:
+```rust
+// Previously (would panic):
+assert!((*f).abs() <= box_size);
+// Now (returns catchable error):
+if (*f).is_nan() || (*f).is_infinite() {
+    return Err(anyhow!("Data normalization failed..."));
+}
+```
+## Current Mitigation Strategies
+### 1. Ruby Layer Validation
+We validate data before sending to Rust to prevent common panic conditions:
+- Check for NaN and Infinite values
+- Ensure minimum dataset size (10 points)
+- Validate array dimensions consistency
+- Warn about extreme value ranges
+### 2. Parameter Adjustment
+We automatically adjust parameters to avoid error conditions:
+```ruby
+# Automatically reduce n_neighbors if too large for dataset
+adjusted_n_neighbors = [suggested_neighbors, max_neighbors].min
+```
+### 3. Error Message Enhancement
+When we can catch Rust errors, we provide helpful Ruby-level error messages:
+```ruby
+case error_msg
+when /isolated point/i
+  raise ::ClusterKit::IsolatedPointError, <<~MSG
+    UMAP found isolated points in your data...
+    Solutions:
+    1. Reduce n_neighbors...
+    2. Remove outliers...
+  MSG
+```
+## Previously Uncatchable Panic Conditions (Now Fixed)
+### 1. "assertion failed: (*f).abs() <= box_size" - **FIXED**
+**Location:** `annembed/src/embedder.rs::set_data_box`
+**Previous Issue:** Would panic and crash the Ruby process
+**Current Status:** Fixed in `cpetersen/annembed:fix-box-size-panic` branch
+- Now returns a catchable `anyhow::Error`
+- Detects NaN/Infinite values during normalization
+- Handles constant data (max_max = 0) gracefully
+- Extreme value ranges are normalized successfully
+**User-visible behavior:**
+- Previously: Ruby process would crash with assertion failure
+- Now: Raises a catchable Ruby exception with helpful error message
+## Recommendations for Users
+### To Avoid Crashes:
+1. **Always normalize your data:**
+   ```ruby
+   # Scale to [0, 1] range
+   data = data.map do |row|
+     min, max = row.minmax
+     range = max - min
+     row.map { |v| range > 0 ? (v - min) / range : 0.5 }
+   end
+   ```
+2. **Check for extreme values:**
+   ```ruby
+   data.flatten.each do |val|
+     raise "Extreme value detected" if val.abs > 1e6
+   end
+   ```
+3. **Use conservative parameters for uncertain data:**
+   ```ruby
+   umap = ClusterKit::Dimensionality::UMAP.new(
+     n_neighbors: 5,  # Lower is safer
+     n_components: 2
+   )
+   ```
+## Future Improvements
+### Potential Solutions:
+1. **Modify annembed to use Result instead of assert:**
+   - Would require upstream changes to annembed
+   - Convert `assert!` to `if` checks that return `Err`
+2. **Add panic catching in Rust layer:**
+   - Use `std::panic::catch_unwind` (limited effectiveness)
+   - May not work for all panic types
+3. **Pre-validation in Rust:**
+   - Add more checks before calling annembed functions
+   - Predict and prevent panic conditions
+### Current Limitations:
+- Cannot catch Rust panics from Ruby
+- Some numerical instabilities are hard to predict
+- Trade-off between performance and safety checks
+## Testing Error Handling
+The test suite mocks Rust errors to verify our error handling logic works correctly. However, actual panic conditions cannot be tested without crashing the test process.
+See `spec/clusterkit/error_handling_spec.rb` for error handling tests.