RubyGems - clusterkit - Versions diffs - 0.1.0.pre.1 - Mend

clusterkit 0.1.0.pre.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (49) hide show

checksums.yaml +7 -0
data/.rspec +3 -0
data/.simplecov +47 -0
data/CHANGELOG.md +35 -0
data/CLAUDE.md +226 -0
data/Cargo.toml +8 -0
data/Gemfile +17 -0
data/IMPLEMENTATION_NOTES.md +143 -0
data/LICENSE.txt +21 -0
data/PYTHON_COMPARISON.md +183 -0
data/README.md +499 -0
data/Rakefile +245 -0
data/clusterkit.gemspec +45 -0
data/docs/KNOWN_ISSUES.md +130 -0
data/docs/RUST_ERROR_HANDLING.md +164 -0
data/docs/TEST_FIXTURES.md +170 -0
data/docs/UMAP_EXPLAINED.md +362 -0
data/docs/UMAP_TROUBLESHOOTING.md +284 -0
data/docs/VERBOSE_OUTPUT.md +84 -0
data/examples/hdbscan_example.rb +147 -0
data/examples/optimal_kmeans_example.rb +96 -0
data/examples/pca_example.rb +114 -0
data/examples/reproducible_umap.rb +99 -0
data/examples/verbose_control.rb +43 -0
data/ext/clusterkit/Cargo.toml +25 -0
data/ext/clusterkit/extconf.rb +4 -0
data/ext/clusterkit/src/clustering/hdbscan_wrapper.rs +115 -0
data/ext/clusterkit/src/clustering.rs +267 -0
data/ext/clusterkit/src/embedder.rs +413 -0
data/ext/clusterkit/src/lib.rs +22 -0
data/ext/clusterkit/src/svd.rs +112 -0
data/ext/clusterkit/src/tests.rs +16 -0
data/ext/clusterkit/src/utils.rs +33 -0
data/lib/clusterkit/clustering/hdbscan.rb +177 -0
data/lib/clusterkit/clustering.rb +213 -0
data/lib/clusterkit/clusterkit.rb +9 -0
data/lib/clusterkit/configuration.rb +24 -0
data/lib/clusterkit/dimensionality/pca.rb +251 -0
data/lib/clusterkit/dimensionality/svd.rb +144 -0
data/lib/clusterkit/dimensionality/umap.rb +311 -0
data/lib/clusterkit/dimensionality.rb +29 -0
data/lib/clusterkit/hdbscan_api_design.rb +142 -0
data/lib/clusterkit/preprocessing.rb +106 -0
data/lib/clusterkit/silence.rb +42 -0
data/lib/clusterkit/utils.rb +51 -0
data/lib/clusterkit/version.rb +5 -0
data/lib/clusterkit.rb +93 -0
data/lib/tasks/visualize.rake +641 -0
metadata +194 -0

data/docs/TEST_FIXTURES.md ADDED Viewed

@@ -0,0 +1,170 @@
+# Test Fixtures for UMAP Testing
+## Overview
+To avoid the hanging issues that occur when testing UMAP with synthetic random data, we use real embeddings from text models as test fixtures. This ensures our tests are both reliable and realistic.
+## Why Real Embeddings?
+UMAP's initialization algorithm (`dmap_init`) expects data with manifold structure - the kind of structure that real embeddings naturally have. When given uniform random data, it can fail catastrophically, initializing all points to the same location and causing infinite loops.
+Real text embeddings have:
+- Natural clustering (semantically similar texts group together)
+- Meaningful correlations between dimensions
+- Appropriate value ranges (typically [-0.12, 0.12])
+- Inherent manifold structure that UMAP is designed to discover
+## Generating Fixtures
+### Prerequisites
+1. Install the development dependencies:
+   ```bash
+   bundle install --with development
+   ```
+2. Generate the embedding fixtures:
+   ```bash
+   rake fixtures:generate_embeddings
+   ```
+This will create several fixture files in `spec/fixtures/embeddings/`:
+- **basic_15.json** - 15 general sentences for basic testing
+- **clusters_30.json** - 30 sentences in 3 distinct topic clusters (tech, nature, food)
+- **minimal_6.json** - 6 sentences for minimum viable dataset testing
+- **large_100.json** - 100 sentences for performance testing
+### Fixture Format
+Each fixture is a JSON file containing:
+```json
+{
+  "description": "Test embeddings for basic_15",
+  "model": "sentence-transformers/all-MiniLM-L6-v2",
+  "dimension": 384,
+  "count": 15,
+  "embeddings": [
+    [0.123, -0.045, ...],  // 384-dimensional vectors
+    ...
+  ]
+}
+```
+## Using Fixtures in Tests
+The specs automatically use fixtures when available:
+```ruby
+RSpec.describe "My UMAP test" do
+  let(:test_data) do
+    if fixtures_available?
+      load_embedding_fixture('basic_15')  # Real embeddings
+    else
+      generate_structured_test_data(15, 30)  # Fallback
+    end
+  end
+  it "processes embeddings" do
+    umap = ClusterKit::UMAP.new
+    result = umap.fit_transform(test_data)
+    # Test will use real embeddings, avoiding hanging issues
+  end
+end
+```
+### Available Helper Methods
+- `fixtures_available?` - Check if any fixtures exist
+- `load_embedding_fixture(name)` - Load all embeddings from a fixture
+- `load_embedding_subset(name, count)` - Load first N embeddings
+- `fixture_metadata(name)` - Get metadata about a fixture
+- `generate_structured_test_data(n_points, n_dims)` - Fallback data generator
+## Listing Available Fixtures
+To see what fixtures are available:
+```bash
+rake fixtures:list
+```
+Output:
+```
+Available embedding fixtures:
+  basic_15.json: 15 embeddings, 384D
+  clusters_30.json: 30 embeddings, 384D
+  minimal_6.json: 6 embeddings, 384D
+  large_100.json: 100 embeddings, 384D
+```
+## When to Regenerate Fixtures
+Regenerate fixtures when:
+- Switching to a different embedding model
+- Adding new test scenarios
+- Fixtures become corrupted or deleted
+The fixtures are deterministic for a given model and input text, so regenerating them should produce functionally equivalent embeddings.
+## CI/CD Considerations
+For CI environments, you have two options:
+1. **Commit fixtures to git** (Recommended for small fixtures):
+   ```bash
+   git add spec/fixtures/embeddings/*.json
+   git commit -m "Add embedding test fixtures"
+   ```
+2. **Generate fixtures in CI**:
+   Add to your CI workflow:
+   ```yaml
+   - name: Generate test fixtures
+     run: bundle exec rake fixtures:generate_embeddings
+   ```
+Note: Option 2 requires red-candle to be available in CI, which will download the embedding model on first use.
+## Troubleshooting
+### "Fixture file not found" Error
+Run `rake fixtures:generate_embeddings` to create the fixtures.
+### Tests Still Hanging
+Ensure you're using the fixture data, not generating random data. Check that your test includes:
+```ruby
+if fixtures_available?
+  load_embedding_fixture('basic_15')
+```
+### red-candle Not Found
+Install development dependencies:
+```bash
+bundle install --with development
+```
+### Model Download Issues
+The first run will download the embedding model (~90MB). Ensure you have internet connectivity and sufficient disk space.
+## Adding New Fixtures
+To add new test scenarios, edit `Rakefile` and add to the `test_cases` hash:
+```ruby
+test_cases = {
+  'my_new_test' => [
+    "First test sentence",
+    "Second test sentence",
+    # ...
+  ]
+}
+```
+Then regenerate:
+```bash
+rake fixtures:generate_embeddings
+```
+## Performance Note
+Using real embeddings makes tests slightly slower than random data, but the reliability improvement is worth it. The fixtures are loaded from JSON, which is fast, and the UMAP algorithm actually converges properly instead of hanging.

data/docs/UMAP_EXPLAINED.md ADDED Viewed

@@ -0,0 +1,362 @@
+# UMAP: Dimensionality Reduction for Software Developers
+## What is UMAP?
+UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction algorithm that transforms high-dimensional data (e.g., 768-dimensional embeddings) into low-dimensional representations (typically 2D or 3D) while preserving the data's underlying structure. It's particularly effective at maintaining both local neighborhoods and global structure.
+## Example: The Sphere Analogy
+Consider points in 3D space that form a sphere. You could represent each point with (x, y, z) coordinates, but if you know they lie on a sphere's surface, you could more efficiently represent them with just latitude and longitude - reducing from 3 to 2 dimensions without losing information.
+UMAP works similarly: it discovers that your high-dimensional data lies on a lower-dimensional manifold (like the sphere's surface), finds the parameters of that manifold, and maps points to this more efficient coordinate system.
+**The key insight**: Just as latitude/longitude preserves relationships between points on Earth's surface, UMAP preserves relationships between points on the discovered manifold. The difference is that UMAP must first discover what shape the manifold is - it might be sphere-like, pretzel-shaped, or something more complex.
+```ruby
+# Example: 10,000 points on a "sphere" in 100D space
+# The data is 100-dimensional, but actually lies on a 2D surface
+# Generate data on a hypersphere with noise
+points_100d = generate_sphere_surface_in_100d(n_points: 10000)
+# UMAP discovers the 2D manifold structure
+reducer = ClusterKit::UMAP.new(n_components: 2)
+coords_2d = reducer.fit_transform(points_100d)
+# coords_2d now contains the "latitude/longitude" equivalent
+# for the discovered manifold - a 2D representation that
+# preserves the essential structure of the 100D data
+```
+This is why UMAP is so effective with embeddings: even though embeddings are high-dimensional, they often lie on or near a much lower-dimensional manifold that captures the true relationships in the data.
+## Core Algorithm Concept
+UMAP operates on the principle that high-dimensional data lies on a lower-dimensional manifold. It constructs a topological representation of the data and then optimizes a low-dimensional layout to match this topology as closely as possible.
+The algorithm makes two key assumptions:
+1. The data is uniformly distributed on a Riemannian manifold
+2. The manifold is locally connected
+## How UMAP Works: Algorithmic Steps
+### Step 1: Construct k-NN Graph
+- For each data point, find its k nearest neighbors (typically k=15-50)
+- Build a weighted graph where edge weights represent distances
+- Uses approximate nearest neighbor algorithms for efficiency (e.g., RP-trees, NNDescent)
+### Step 2: Compute Fuzzy Simplicial Set
+- Convert the k-NN graph into a fuzzy topological representation
+- Apply local scaling based on distance to nearest neighbors
+- Create symmetric graph by combining directed edges using fuzzy set union
+### Step 3: Initialize Low-Dimensional Embedding
+- Generate initial positions in target dimension (usually 2D/3D)
+- Can use spectral embedding for better initialization or random placement
+### Step 4: Optimize Layout via SGD
+- Minimize cross-entropy between high-dimensional and low-dimensional fuzzy representations
+- Uses attractive forces for connected points, repulsive forces for non-connected points
+- Typically runs for 200-500 iterations with learning rate decay
+## Technical Classification
+**UMAP is:**
+- A **non-linear dimensionality reduction technique**
+- A **manifold learning algorithm**
+- A **graph-based embedding method**
+**Algorithm Type:** UMAP doesn't fit traditional ML model categories. It's a transformation algorithm that learns a mapping from high-dimensional to low-dimensional space. Once trained on a dataset, it can transform new points using the learned embedding space.
+## Data Requirements
+### Dataset Size
+- **Minimum viable**: 500 data points
+- **Recommended**: 2,000-10,000 points
+- **Scales well to**: Millions of points
+### Computational Complexity
+- **Time complexity**: O(N^1.14) for N data points (approximate)
+- **Memory complexity**: O(N) with optimizations
+- **Typical runtime**: 30 seconds for 10K points with 100 dimensions
+### Input Format
+- Dense numerical arrays (numpy arrays, tensors)
+- All points must have same dimensionality
+- Works best with normalized/standardized features
+## When to Use UMAP
+### Optimal Use Cases
+1. **Embedding Visualization**
+   ```
+   Input: 10,000 document embeddings (768 dimensions)
+   Output: 2D coordinates for plotting
+   Purpose: Visualize document clusters and relationships
+   ```
+2. **Clustering Preprocessing**
+   ```
+   Input: High-dimensional feature vectors
+   Output: 10-50 dimensional representations
+   Purpose: Improve clustering algorithm performance and speed
+   ```
+3. **Anomaly Detection**
+   ```
+   Input: Normal behavior embeddings
+   Output: 2D projection showing outliers
+   Purpose: Identify points that don't fit the manifold structure
+   ```
+4. **Feature Engineering**
+   ```
+   Input: Raw high-dimensional features
+   Output: Lower-dimensional features for downstream ML
+   Purpose: Capture non-linear relationships in fewer dimensions
+   ```
+## Limitations and Alternatives
+### UMAP Limitations
+1. **Non-deterministic**: Results vary between runs due to:
+   - Random initialization
+   - Stochastic gradient descent
+   - Approximate nearest neighbor search
+2. **Distance Distortion**: UMAP preserves topology, not distances
+   - Distances in UMAP space don't correspond to original distances
+   - Density can be misleading (denser areas might just be artifacts)
+3. **Parameter Sensitivity**: Results heavily depend on:
+   - `n_neighbors`: Controls local vs global structure balance
+   - `min_dist`: Controls cluster tightness
+   - `metric`: Distance function choice crucial for certain data types
+4. **No Inverse Transform**: Generally cannot reconstruct original data from UMAP coordinates
+### When to Use Alternatives
+| Scenario | Use Instead | Reason |
+|----------|-------------|---------|
+| Need exact variance preservation | PCA | PCA preserves maximum variance in linear projections |
+| Need deterministic results | PCA, Kernel PCA | These provide reproducible transformations |
+| Small dataset (<500 points) | PCA, MDS | UMAP needs sufficient data to learn manifold |
+| Need inverse transformation | Autoencoders | Can reconstruct original from embedding |
+| Purely categorical data | MCA, FAMD | Designed for categorical/mixed data types |
+| Need interpretable dimensions | Factor Analysis, PCA | Dimensions have meaningful interpretations |
+| Time series data | DTW + MDS | Respects temporal dependencies |
+## Data That Degrades UMAP Performance
+### 1. Extreme Sparsity
+```ruby
+# Problem: 99.9% zeros in data
+sparse_data = Array.new(1000) { Array.new(1000) { rand < 0.001 ? 1 : 0 } }
+# Solution: Use PCA/SVD first or specialized sparse methods
+```
+### 2. Curse of Dimensionality
+```ruby
+# Problem: Dimensions >> samples
+data = Array.new(100) { Array.new(10000) { rand } }  # 100 samples, 10000 dimensions
+# Solution: Apply PCA first to reduce to ~50 dimensions
+```
+### 3. Multiple Disconnected Manifolds
+```ruby
+# Problem: Completely separate clusters with no connections
+cluster1 = Array.new(500) { Array.new(100) { rand(-3.0..3.0) } }
+cluster2 = Array.new(500) { Array.new(100) { rand(-3.0..3.0) + 1000 } }  # Far separated
+# Result: UMAP may arbitrarily position disconnected components
+```
+### 4. Pure Noise
+```ruby
+# Problem: No underlying structure
+random_data = Array.new(1000) { Array.new(100) { rand(-3.0..3.0) } }
+# Result: Meaningless projection, artificial patterns
+```
+## Key Parameters
+### Essential Parameters
+```ruby
+umap_model = ClusterKit::UMAP.new(
+  n_components: 2,      # Output dimensions
+  n_neighbors: 15,      # Number of neighbors (15-50 typical)
+  random_seed: 42       # For reproducibility
+)
+# Note: min_dist and metric parameters may be configurable in future versions
+```
+### Parameter Effects
+- **n_neighbors**:
+  - Low (5-15): Preserves local structure, detailed clusters
+  - High (50-200): Preserves global structure, broader patterns
+- **min_dist**:
+  - Near 0: Tight clumps, allows overlapping
+  - Near 1: Even distribution, preserves more global structure
+- **metric**: Critical for specific data types
+  - `euclidean`: Standard for continuous data
+  - `cosine`: Text embeddings, directional data
+  - `manhattan`: Robust to outliers
+  - `hamming`: Binary/categorical features
+## Implementation Considerations
+### Performance Optimization
+```ruby
+# For large datasets (>50K points)
+umap_model = ClusterKit::UMAP.new(
+  n_neighbors: 15,
+  n_components: 2,
+  random_seed: 42
+)
+# The Rust backend automatically optimizes for performance
+# using efficient algorithms like HNSW for nearest neighbor search
+```
+### Supervised vs Unsupervised
+```ruby
+# Unsupervised: Find natural structure
+embedding = umap_model.fit_transform(data)
+# Note: Supervised UMAP (using labels to guide projection)
+# may be available in future versions of clusterkit
+```
+## Practical Example: Document Embedding Pipeline
+```ruby
+require 'clusterkit'
+require 'candle'
+# Typical workflow for document embeddings
+documents = load_documents  # 10,000 documents
+# Initialize embedding model using red-candle's from_pretrained
+embedding_model = Candle::EmbeddingModel.from_pretrained(
+  "jinaai/jina-embeddings-v2-base-en",
+  device: Candle::Device.best  # Automatically use GPU if available
+)
+# Generate embeddings for all documents
+# The embedding method returns normalized embeddings by default with "pooled_normalized"
+embeddings = documents.map do |doc|
+  embedding_model.embedding(doc).to_a  # Convert tensor to array
+end
+# Shape: Array of 10000 arrays, each with 768 floats
+# Embeddings are already normalized when using pooled_normalized (default)
+# But if you used a different pooling method, normalize like this:
+# embeddings_normalized = embeddings.map do |embedding|
+#   magnitude = Math.sqrt(embedding.sum { |x| x**2 })
+#   embedding.map { |x| x / magnitude }
+# end
+# Dimensionality reduction
+reducer = ClusterKit::UMAP.new(
+  n_neighbors: 30,
+  n_components: 2,
+  random_seed: 42
+)
+# Fit and transform
+coords_2d = reducer.fit_transform(embeddings)
+# Can now transform new documents
+new_doc = "New document"
+new_embedding = embedding_model.embedding(new_doc).to_a
+new_coords = reducer.transform([new_embedding])
+# Alternative: Using different pooling methods
+# cls_embedding = embedding_model.embedding(doc, pooling_method: "cls")
+# pooled_embedding = embedding_model.embedding(doc, pooling_method: "pooled")
+```
+## Common Pitfalls
+1. **Using raw distances between UMAP points for similarity**
+   - Wrong: `distance = Math.sqrt((p1[0]-p2[0])**2 + (p1[1]-p2[1])**2)`
+   - Right: Use original high-dimensional distances
+2. **Not scaling features before UMAP**
+   - Features with larger scales dominate distance calculations
+   - Always normalize or standardize first
+3. **Over-interpreting visual clusters**
+   - Clusters in UMAP don't always mean distinct groups
+   - Validate with clustering algorithms on original data
+4. **Forgetting to normalize embeddings for cosine similarity**
+   - Text embeddings typically use cosine distance
+   - Normalize vectors before UMAP when working with embeddings
+5. **Applying to insufficient data**
+   - UMAP needs enough points to learn manifold structure
+   - Consider simpler methods for small datasets (< 500 points)
+## Complete Ruby Example with Red-Candle
+```ruby
+require 'candle'
+require 'clusterkit'
+# Sample documents for clustering
+documents = [
+  # Technology cluster
+  "Machine learning algorithms process data efficiently",
+  "Neural networks enable deep learning applications",
+  "Artificial intelligence transforms modern software",
+  # Food cluster
+  "Italian pasta dishes are delicious and varied",
+  "Fresh vegetables make healthy salad options",
+  "Chocolate desserts satisfy sweet cravings",
+  # Sports cluster
+  "Basketball players need excellent coordination",
+  "Marathon runners train for endurance events",
+  "Tennis matches require mental focus"
+]
+# Initialize embedding model using from_pretrained
+# Model type is auto-detected from the model_id
+model = Candle::EmbeddingModel.from_pretrained(
+  "sentence-transformers/all-MiniLM-L6-v2"
+)
+# Generate embeddings (already normalized with pooled_normalized)
+embeddings = documents.map { |doc| model.embedding(doc).to_a }
+# Reduce to 2D for visualization
+umap = ClusterKit::UMAP.new(
+  n_components: 2,
+  n_neighbors: 5,  # Small dataset, use fewer neighbors
+  random_seed: 42
+)
+coords_2d = umap.fit_transform(embeddings)
+# Display results
+documents.each_with_index do |doc, i|
+  x, y = coords_2d[i]
+  puts "#{doc[0..30]}... => [#{x.round(3)}, #{y.round(3)}]"
+end
+# The 2D coordinates should show three distinct clusters
+# corresponding to technology, food, and sports topics
+# Advanced: Save and load UMAP models for reuse
+umap.save("models/document_umap.model")
+loaded_umap = ClusterKit::UMAP.load("models/document_umap.model")
+```
+## Summary
+UMAP is a powerful non-linear dimensionality reduction algorithm particularly suited for visualizing and preprocessing high-dimensional data like embeddings. It excels at preserving both local and global structure through graph-based manifold learning. While it requires sufficient data and careful parameter tuning, it generally outperforms alternatives like t-SNE in speed and quality for most embedding visualization tasks. The key is understanding that UMAP learns a topological representation, not a distance-preserving projection, and interpreting results accordingly.