clusterkit 0.1.0.pre.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (49) hide show
  1. checksums.yaml +7 -0
  2. data/.rspec +3 -0
  3. data/.simplecov +47 -0
  4. data/CHANGELOG.md +35 -0
  5. data/CLAUDE.md +226 -0
  6. data/Cargo.toml +8 -0
  7. data/Gemfile +17 -0
  8. data/IMPLEMENTATION_NOTES.md +143 -0
  9. data/LICENSE.txt +21 -0
  10. data/PYTHON_COMPARISON.md +183 -0
  11. data/README.md +499 -0
  12. data/Rakefile +245 -0
  13. data/clusterkit.gemspec +45 -0
  14. data/docs/KNOWN_ISSUES.md +130 -0
  15. data/docs/RUST_ERROR_HANDLING.md +164 -0
  16. data/docs/TEST_FIXTURES.md +170 -0
  17. data/docs/UMAP_EXPLAINED.md +362 -0
  18. data/docs/UMAP_TROUBLESHOOTING.md +284 -0
  19. data/docs/VERBOSE_OUTPUT.md +84 -0
  20. data/examples/hdbscan_example.rb +147 -0
  21. data/examples/optimal_kmeans_example.rb +96 -0
  22. data/examples/pca_example.rb +114 -0
  23. data/examples/reproducible_umap.rb +99 -0
  24. data/examples/verbose_control.rb +43 -0
  25. data/ext/clusterkit/Cargo.toml +25 -0
  26. data/ext/clusterkit/extconf.rb +4 -0
  27. data/ext/clusterkit/src/clustering/hdbscan_wrapper.rs +115 -0
  28. data/ext/clusterkit/src/clustering.rs +267 -0
  29. data/ext/clusterkit/src/embedder.rs +413 -0
  30. data/ext/clusterkit/src/lib.rs +22 -0
  31. data/ext/clusterkit/src/svd.rs +112 -0
  32. data/ext/clusterkit/src/tests.rs +16 -0
  33. data/ext/clusterkit/src/utils.rs +33 -0
  34. data/lib/clusterkit/clustering/hdbscan.rb +177 -0
  35. data/lib/clusterkit/clustering.rb +213 -0
  36. data/lib/clusterkit/clusterkit.rb +9 -0
  37. data/lib/clusterkit/configuration.rb +24 -0
  38. data/lib/clusterkit/dimensionality/pca.rb +251 -0
  39. data/lib/clusterkit/dimensionality/svd.rb +144 -0
  40. data/lib/clusterkit/dimensionality/umap.rb +311 -0
  41. data/lib/clusterkit/dimensionality.rb +29 -0
  42. data/lib/clusterkit/hdbscan_api_design.rb +142 -0
  43. data/lib/clusterkit/preprocessing.rb +106 -0
  44. data/lib/clusterkit/silence.rb +42 -0
  45. data/lib/clusterkit/utils.rb +51 -0
  46. data/lib/clusterkit/version.rb +5 -0
  47. data/lib/clusterkit.rb +93 -0
  48. data/lib/tasks/visualize.rake +641 -0
  49. metadata +194 -0
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: fd025da9b7f5c97e370d05fb1062484cb99b0aaaa4a7c310eb27df78336c91b4
4
+ data.tar.gz: 7665cf847930bc47cb04adc1f466ba09ea33c14da5f242434b08493047945f91
5
+ SHA512:
6
+ metadata.gz: a8db6d4738ad99a20887aef90398d4163bdd8bc47bbdb8dda74496adc105602051999e50d2bc6003b4981d16b8121151d71cf55618691262b4a092f3c46a2545
7
+ data.tar.gz: 1a6e43a00d19d7fdaf35be6deffa5215e994a1f013b140aa23ac9bbb9d982facde94d710a894cba39e1fbd39bccd7c020b86a99e9d32d2aa256f717844623e5e
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --require spec_helper
2
+ --format documentation
3
+ --color
data/.simplecov ADDED
@@ -0,0 +1,47 @@
1
+ # frozen_string_literal: true
2
+
3
+ SimpleCov.configure do
4
+ # Add custom groups
5
+ add_group 'Core', 'lib/annembed/embedder'
6
+ add_group 'UMAP', 'lib/annembed/umap'
7
+ add_group 'Utils', 'lib/annembed/utils'
8
+ add_group 'Configuration', 'lib/annembed/config'
9
+
10
+ # Track branches as well as lines
11
+ enable_coverage :branch
12
+
13
+ # Set thresholds (temporarily disabled to diagnose issues)
14
+ # minimum_coverage line: 50, branch: 40
15
+
16
+ # Don't refuse to run tests if coverage drops (during development)
17
+ # refuse_coverage_drop
18
+
19
+ # Maximum coverage drop allowed
20
+ maximum_coverage_drop 5
21
+
22
+ # Configure output directory
23
+ coverage_dir 'coverage'
24
+
25
+ # Track test files separately
26
+ track_files 'lib/**/*.rb'
27
+
28
+ # Custom filters
29
+ add_filter do |source_file|
30
+ # Skip version file
31
+ source_file.filename.include?('version.rb')
32
+ end
33
+
34
+ # Include timestamp in coverage report
35
+ SimpleCov.formatter = SimpleCov::Formatter::MultiFormatter.new([
36
+ SimpleCov::Formatter::HTMLFormatter,
37
+ ])
38
+
39
+ # Set project name
40
+ command_name 'RSpec'
41
+
42
+ # Merge results from multiple test runs
43
+ use_merging true
44
+
45
+ # Set result cache timeout (in seconds)
46
+ merge_timeout 3600
47
+ end
data/CHANGELOG.md ADDED
@@ -0,0 +1,35 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [Unreleased]
9
+
10
+ ### Added
11
+ - Clean, scikit-learn-like API for UMAP
12
+ - `fit(data)` - Train the model
13
+ - `transform(data)` - Transform new data
14
+ - `fit_transform(data)` - Train and transform in one step
15
+ - `fitted?` - Check if model is trained
16
+ - `save(path)` - Save trained model
17
+ - `load(path)` - Load trained model
18
+ - Model persistence with save/load functionality
19
+ - Data export/import utilities for caching results
20
+ - Comprehensive test suite for UMAP interface
21
+ - Detailed README with practical examples
22
+
23
+ ### Changed
24
+ - Complete API redesign to follow ML library conventions
25
+ - Removed confusing `save_embeddings`/`load_embeddings` methods
26
+ - Separated model operations from data caching concerns
27
+
28
+ ### Fixed
29
+ - Intermittent test failures with boundary assertions
30
+ - Data normalization issues with extreme values
31
+
32
+ ## [0.1.0] - TBD
33
+
34
+ ### Added
35
+ - Initial release with basic embedding functionality
data/CLAUDE.md ADDED
@@ -0,0 +1,226 @@
1
+ # CLAUDE.md - clusterkit Project Guide
2
+
3
+ ## Project Vision
4
+ clusterkit brings high-performance dimensionality reduction and embedding algorithms to Ruby by wrapping the annembed Rust crate. This gem is part of the ruby-nlp ecosystem, which aims to provide Ruby developers with native machine learning and NLP capabilities through best-in-breed Rust implementations.
5
+
6
+ ## Core Principles
7
+
8
+ ### 1. Ruby-First Design
9
+ - Provide an idiomatic Ruby API that feels natural to Ruby developers
10
+ - Follow Ruby naming conventions (snake_case methods, proper use of symbols)
11
+ - Support Ruby's duck typing while maintaining type safety at the Rust boundary
12
+ - Integrate seamlessly with Ruby's data science ecosystem
13
+
14
+ ### 2. Performance Without Compromise
15
+ - Leverage Rust's performance for compute-intensive operations
16
+ - Use Magnus for zero-copy data transfer where possible
17
+ - Enable parallelization by default
18
+ - Provide progress feedback for long-running operations
19
+
20
+ ### 3. Ecosystem Integration
21
+ - Primary support for Numo::NArray (the NumPy of Ruby)
22
+ - Work well with other ruby-nlp gems (lancelot, red-candle)
23
+ - Support common Ruby data formats and visualization tools
24
+ - Play nice with Jupyter notebooks (iruby)
25
+
26
+ ## Technical Guidelines
27
+
28
+ ### Magnus Best Practices
29
+
30
+ 1. **Memory Management**
31
+ ```rust
32
+ // Good: Let Magnus handle Ruby object lifecycle
33
+ let array: RArray = data.try_convert()?;
34
+
35
+ // Avoid: Manual memory management
36
+ // Don't try to manually free Ruby objects
37
+ ```
38
+
39
+ 2. **Error Handling**
40
+ ```rust
41
+ // Always wrap errors properly
42
+ use magnus::Error;
43
+
44
+ fn risky_operation() -> Result<RArray, Error> {
45
+ annembed_call()
46
+ .map_err(|e| Error::new(exception::runtime_error(), e.to_string()))?
47
+ }
48
+ ```
49
+
50
+ 3. **Type Conversions**
51
+ ```rust
52
+ // Define clear conversion traits
53
+ impl TryFrom<Value> for EmbedConfig {
54
+ type Error = Error;
55
+ // Robust conversion with good error messages
56
+ }
57
+ ```
58
+
59
+ ### Ruby API Design
60
+
61
+ 1. **Method Naming**
62
+ - Use Ruby conventions: `fit_transform`, not `fitTransform`
63
+ - Predicates end with `?`: `converged?`, `fitted?`
64
+ - Dangerous methods end with `!`: `normalize!`
65
+
66
+ 2. **Parameter Handling**
67
+ ```ruby
68
+ # Good: Use keyword arguments with defaults
69
+ def initialize(method: :umap, n_components: 2, **options)
70
+
71
+ # Avoid: Positional arguments for configuration
72
+ def initialize(method, n_components, min_dist, spread, ...)
73
+ ```
74
+
75
+ 3. **Return Values**
76
+ - Return Ruby arrays for small results
77
+ - Return Numo::NArray for large matrices
78
+ - Support multiple return formats via options
79
+
80
+ ### Performance Considerations
81
+
82
+ 1. **Data Transfer**
83
+ - Minimize copying between Ruby and Rust
84
+ - Use view/slice operations when possible
85
+ - Support streaming for large datasets
86
+
87
+ 2. **Threading**
88
+ - Respect Ruby's GVL (Global VM Lock)
89
+ - Release GVL for long-running Rust operations
90
+ - Use Rust's parallelization, not Ruby threads
91
+
92
+ 3. **Memory Usage**
93
+ - Provide memory estimates for large operations
94
+ - Support out-of-core processing for huge datasets
95
+ - Clear progress indication for long operations
96
+
97
+ ## Code Style Guidelines
98
+
99
+ ### Rust Side
100
+ - Follow Rust standard style (rustfmt)
101
+ - Comprehensive error types with context
102
+ - Document all public functions
103
+ - Use type aliases for clarity
104
+
105
+ ### Ruby Side
106
+ - Follow Ruby Style Guide
107
+ - Use YARD documentation format
108
+ - Provide type signatures where helpful
109
+ - Include usage examples in docs
110
+
111
+ ## Testing Philosophy
112
+
113
+ 1. **Comprehensive Coverage**
114
+ - Unit tests for all public methods
115
+ - Integration tests with real datasets
116
+ - Performance benchmarks
117
+ - Memory leak tests
118
+
119
+ 2. **Test Data**
120
+ - Use standard ML datasets (Iris, MNIST samples)
121
+ - Generate synthetic data for edge cases
122
+ - Test with various Ruby object types
123
+
124
+ 3. **Platform Testing**
125
+ - Test on multiple Ruby versions
126
+ - Test on different operating systems
127
+ - Verify precompiled gem distribution
128
+
129
+ ## Documentation Standards
130
+
131
+ 1. **README**
132
+ - Clear installation instructions
133
+ - Quick start example that works
134
+ - Link to full documentation
135
+ - Performance comparisons
136
+
137
+ 2. **API Documentation**
138
+ - Every public method documented
139
+ - Parameter types and ranges specified
140
+ - Return values clearly described
141
+ - Usage examples for complex methods
142
+
143
+ 3. **Tutorials**
144
+ - Jupyter notebook examples
145
+ - Common use case walkthroughs
146
+ - Integration examples with other gems
147
+
148
+ ## Common Patterns
149
+
150
+ ### Configuration Objects
151
+ ```ruby
152
+ # Prefer configuration objects over many parameters
153
+ config = Annembed::Config.new(
154
+ method: :umap,
155
+ n_neighbors: 15,
156
+ min_dist: 0.1
157
+ )
158
+ embedder = Annembed::Embedder.new(config)
159
+ ```
160
+
161
+ ### Progress Callbacks
162
+ ```ruby
163
+ # Support progress monitoring
164
+ embedder.on_progress do |iteration, total|
165
+ puts "Progress: #{iteration}/#{total}"
166
+ end
167
+ ```
168
+
169
+ ### Flexible Input/Output
170
+ ```ruby
171
+ # Accept multiple input formats
172
+ embedder.fit_transform(data) # Array, NArray, or CSV path
173
+
174
+ # Support different output formats
175
+ result = embedder.transform(data, output: :array) # Ruby Array
176
+ result = embedder.transform(data, output: :narray) # Numo::NArray
177
+ ```
178
+
179
+ ## Development Workflow
180
+
181
+ 1. **Branch Strategy**
182
+ - `main` - stable release
183
+ - `develop` - integration branch
184
+ - `feature/*` - new features
185
+ - `fix/*` - bug fixes
186
+
187
+ 2. **Release Process**
188
+ - Version bump in version.rb
189
+ - Update CHANGELOG.md
190
+ - Run full test suite
191
+ - Build precompiled gems
192
+ - Tag release
193
+ - Push to RubyGems
194
+
195
+ 3. **Continuous Integration**
196
+ - Run tests on each push
197
+ - Build gems for multiple platforms
198
+ - Check documentation building
199
+ - Performance regression tests
200
+
201
+ ## Future Considerations
202
+
203
+ 1. **GPU Support**
204
+ - Monitor annembed for GPU features
205
+ - Plan bindings if GPU support is added
206
+ - Consider alternative GPU libraries
207
+
208
+ 2. **Web Integration**
209
+ - Consider Rails integration
210
+ - WebAssembly compilation?
211
+ - REST API wrapper?
212
+
213
+ 3. **Visualization**
214
+ - Built-in plotting helpers?
215
+ - Export to common formats
216
+ - Interactive visualizations?
217
+
218
+ ## Getting Help
219
+
220
+ When implementing new features:
221
+ 1. Check existing patterns in lancelot and red-candle
222
+ 2. Consult annembed documentation
223
+ 3. Ask in ruby-nlp discussions
224
+ 4. Profile before optimizing
225
+
226
+ Remember: The goal is to make advanced embedding algorithms accessible and performant for Ruby developers while maintaining the simplicity and elegance that makes Ruby special.
data/Cargo.toml ADDED
@@ -0,0 +1,8 @@
1
+ [workspace]
2
+ members = ["ext/clusterkit"]
3
+ resolver = "2"
4
+
5
+ [profile.release]
6
+ opt-level = 3
7
+ lto = true
8
+ codegen-units = 1
data/Gemfile ADDED
@@ -0,0 +1,17 @@
1
+ source "https://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in annembed-ruby.gemspec
4
+ gemspec
5
+
6
+ # Test-only dependencies
7
+ group :test do
8
+ # Optional: For comparing with Python implementations
9
+ gem "pycall", "~> 1.4", require: false
10
+ end
11
+
12
+ # Development dependencies for generating test fixtures
13
+ group :development do
14
+ # For generating real embeddings to use as test fixtures
15
+ # This avoids the hanging issues with random test data
16
+ gem "red-candle", "~> 1.0", require: false
17
+ end
@@ -0,0 +1,143 @@
1
+ # Implementation Notes for annembed-ruby
2
+
3
+ ## Architecture Overview
4
+
5
+ The gem follows a three-layer architecture:
6
+
7
+ 1. **Ruby Layer** (`lib/annembed/`): User-facing API with Ruby idioms
8
+ 2. **Magnus Bridge** (`ext/annembed_ruby/src/`): Type conversions and bindings
9
+ 3. **annembed Core**: The underlying Rust embedding library
10
+
11
+ ## Key Implementation Challenges
12
+
13
+ ### 1. Data Type Conversions
14
+
15
+ The main challenge is efficiently converting between Ruby and Rust types:
16
+
17
+ ```
18
+ Ruby Array/Numo::NArray ↔ ndarray::Array2<f32> ↔ annembed matrices
19
+ ```
20
+
21
+ Consider using views instead of copies where possible.
22
+
23
+ ### 2. Memory Management
24
+
25
+ - Magnus handles Ruby GC integration
26
+ - Need to be careful with large matrices
27
+ - Consider streaming for datasets > available RAM
28
+
29
+ ### 3. Progress Reporting
30
+
31
+ annembed operations can be long-running. Options:
32
+ 1. Polling from Ruby side (requires thread)
33
+ 2. Callback from Rust (needs GVL management)
34
+ 3. Async/await pattern (complex but clean)
35
+
36
+ ### 4. Error Handling
37
+
38
+ Map annembed errors to Ruby exceptions:
39
+ - `annembed::Error` → `Annembed::Error`
40
+ - Panic → `RuntimeError` (avoid panics!)
41
+ - Invalid input → `ArgumentError`
42
+
43
+ ## annembed API Mapping
44
+
45
+ ### Core Types
46
+ - `EmbedderT` - Main embedding trait
47
+ - `HnswParams` - HNSW graph parameters
48
+ - `EmbedderParams` - Algorithm-specific params
49
+ - `GraphProjection` - For graph-based methods
50
+
51
+ ### Key Functions
52
+ ```rust
53
+ // Main embedding function
54
+ pub fn get_embedder(
55
+ data: &Array2<f32>,
56
+ params: EmbedderParams,
57
+ hnsw_params: HnswParams
58
+ ) -> Result<Box<dyn EmbedderT>>
59
+
60
+ // The embedder trait
61
+ pub trait EmbedderT {
62
+ fn embed(&self) -> Result<Array2<f32>>;
63
+ fn get_hnsw(&self) -> &Hnsw<f32, DistL2>;
64
+ }
65
+ ```
66
+
67
+ ## Performance Considerations
68
+
69
+ 1. **Parallelization**: annembed uses rayon, respect Ruby's thread settings
70
+ 2. **BLAS Backend**: Allow users to choose (OpenBLAS vs MKL)
71
+ 3. **Large Datasets**: Implement chunking for > 1M points
72
+ 4. **GPU Future**: Design API to allow GPU backend later
73
+
74
+ ## Testing Strategy
75
+
76
+ ### Unit Tests
77
+ - Type conversions
78
+ - Configuration parsing
79
+ - Error handling
80
+
81
+ ### Integration Tests
82
+ - Small datasets (Iris)
83
+ - Medium datasets (MNIST subset)
84
+ - Large datasets (if CI allows)
85
+
86
+ ### Benchmarks
87
+ - vs Python UMAP
88
+ - vs pure Ruby implementations
89
+ - Memory usage profiling
90
+
91
+ ## Platform Support
92
+
93
+ ### Priorities
94
+ 1. Linux x86_64 (most servers)
95
+ 2. macOS arm64 (M1/M2 developers)
96
+ 3. macOS x86_64 (Intel Macs)
97
+ 4. Windows x86_64 (if feasible)
98
+
99
+ ### Build Considerations
100
+ - Use rake-compiler-dock for cross-compilation
101
+ - Static link BLAS when possible
102
+ - Provide clear instructions for source builds
103
+
104
+ ## Future Enhancements
105
+
106
+ 1. **Incremental Learning**: Add new points to existing embedding
107
+ 2. **Custom Metrics**: Allow user-defined distance functions
108
+ 3. **Supervised Embedding**: Use labels to guide embedding
109
+ 4. **GPU Support**: If annembed adds it
110
+ 5. **Visualization**: Built-in plotting helpers?
111
+
112
+ ## Debugging Tips
113
+
114
+ 1. Use `RUST_BACKTRACE=1` for better errors
115
+ 2. Add logging with `env_logger` in Rust
116
+ 3. Use `rb_sys` debug mode for development
117
+ 4. Memory debugging with valgrind/ASAN
118
+
119
+ ## Code Organization
120
+
121
+ Keep related functionality together:
122
+ - `embedder.rs`: All embedding algorithms
123
+ - `utils.rs`: Dimension estimation, hubness
124
+ - `svd.rs`: Matrix decomposition
125
+ - `conversions.rs`: Type conversion helpers
126
+
127
+ ## Release Checklist
128
+
129
+ 1. [ ] Update version in `version.rb`
130
+ 2. [ ] Update CHANGELOG.md
131
+ 3. [ ] Run full test suite on all platforms
132
+ 4. [ ] Build precompiled gems
133
+ 5. [ ] Test gem installation from .gem file
134
+ 6. [ ] Tag release in git
135
+ 7. [ ] Push to RubyGems.org
136
+ 8. [ ] Update documentation
137
+
138
+ ## References
139
+
140
+ - annembed docs: https://docs.rs/annembed/
141
+ - Magnus guide: https://github.com/matsadler/magnus
142
+ - UMAP paper: https://arxiv.org/abs/1802.03426
143
+ - t-SNE paper: https://lvdmaaten.github.io/tsne/
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2024 Your Name
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
@@ -0,0 +1,183 @@
1
+ # ClusterKit vs Python UMAP/HDBSCAN: A Comparison
2
+
3
+ ## Overview
4
+
5
+ ClusterKit and Python's UMAP/HDBSCAN implementations share the same algorithmic foundations but take different approaches to error handling and data validation. This document outlines these differences to help users understand what to expect from each implementation.
6
+
7
+ ## Philosophical Approaches
8
+
9
+ ### Python UMAP/HDBSCAN
10
+ Python's implementations prioritize **algorithmic robustness**, attempting to produce results even in edge cases. This approach:
11
+ - Maximizes compatibility with diverse datasets
12
+ - Minimizes interruptions to analysis workflows
13
+ - Trusts users to interpret results appropriately
14
+ - Follows the scikit-learn pattern of "fit on anything"
15
+
16
+ ### Ruby ClusterKit
17
+ ClusterKit prioritizes **guided analysis**, providing feedback when data characteristics may lead to suboptimal results. This approach:
18
+ - Helps users understand their data's suitability for the algorithm
19
+ - Provides actionable suggestions when issues arise
20
+ - Encourages best practices in dimensionality reduction
21
+ - Aims to prevent misinterpretation of results
22
+
23
+ ## Behavioral Differences
24
+
25
+ ### Small Datasets
26
+
27
+ #### Scenario: 3 data points with n_neighbors=15
28
+
29
+ **Python UMAP:**
30
+ - Automatically adjusts n_neighbors silently
31
+ - Returns a 2D embedding of the 3 points
32
+ - The resulting triangle's shape depends on random initialization
33
+
34
+ **Ruby ClusterKit:**
35
+ - Explicitly adjusts n_neighbors with optional warning
36
+ - Currently experiences performance issues on very small datasets (being addressed)
37
+ - Provides context about why adjustment was needed
38
+
39
+ ### Data Quality Issues
40
+
41
+ #### Scenario: Random data without inherent structure
42
+
43
+ **Python UMAP:**
44
+ - Processes the data without warnings
45
+ - Returns an embedding that may show apparent "clusters"
46
+ - These patterns are typically artifacts of the algorithm rather than real structure
47
+
48
+ **Ruby ClusterKit:**
49
+ - May raise `IsolatedPointError` with explanation
50
+ - Suggests that the data lacks structure suitable for manifold learning
51
+ - Recommends alternatives like PCA for unstructured data
52
+
53
+ #### Scenario: Extreme outliers (points 1000x farther than main data)
54
+
55
+ **Python UMAP:**
56
+ - Embeds outliers far from main cluster
57
+ - Main cluster may be compressed to accommodate outlier scale
58
+ - Visualization scale dominated by outliers
59
+
60
+ **Ruby ClusterKit:**
61
+ - May raise `IsolatedPointError`
62
+ - Explains impact of outliers on manifold learning
63
+ - Suggests preprocessing steps like outlier removal or normalization
64
+
65
+ ### Invalid Data
66
+
67
+ #### Scenario: NaN or Infinite values
68
+
69
+ **Both implementations** reject invalid numerical data, but with different messaging:
70
+
71
+ **Python:** `"Input contains NaN"` or `"Input contains infinity"`
72
+
73
+ **Ruby:** `"Element at position [5, 2] is NaN or Infinite"`
74
+
75
+ The Ruby version provides specific location information to aid debugging.
76
+
77
+ ### Edge Cases
78
+
79
+ #### Scenario: Single data point
80
+
81
+ **Python UMAP:**
82
+ - Returns `[[0, 0]]` or similar default position
83
+ - No error or warning about meaningless result
84
+
85
+ **Ruby ClusterKit:**
86
+ - Raises error explaining that manifold learning requires multiple points
87
+ - Suggests minimum data requirements
88
+
89
+ #### Scenario: Empty dataset
90
+
91
+ **Both implementations** appropriately reject empty input with clear error messages.
92
+
93
+ ## Parameter Handling
94
+
95
+ ### Auto-adjustment
96
+
97
+ **Python UMAP:**
98
+ - Silently adjusts parameters when necessary
99
+ - May issue warnings through Python's warning system
100
+ - Adjustments not always visible in normal workflow
101
+
102
+ **Ruby ClusterKit:**
103
+ - Adjusts parameters when needed
104
+ - Optional verbose mode shows adjustments
105
+ - Explains why adjustments were made
106
+
107
+ ### Validation
108
+
109
+ **Python UMAP:**
110
+ - Validates parameters against mathematical constraints
111
+ - Generic `ValueError` for invalid parameters
112
+
113
+ **Ruby ClusterKit:**
114
+ - Validates parameters with context-aware messages
115
+ - `InvalidParameterError` with specific guidance
116
+ - Suggests valid parameter ranges based on data
117
+
118
+ ## Error Messages
119
+
120
+ ### Python Style
121
+ Focuses on technical accuracy:
122
+ ```
123
+ ValueError: n_neighbors must be less than or equal to the number of samples
124
+ ```
125
+
126
+ ### Ruby Style
127
+ Focuses on user guidance:
128
+ ```
129
+ The n_neighbors parameter (15) is too large for your dataset size (10).
130
+
131
+ UMAP needs n_neighbors to be less than the number of samples.
132
+ Suggested value: 5
133
+
134
+ Example: UMAP.new(n_neighbors: 5)
135
+ ```
136
+
137
+ ## Performance Characteristics
138
+
139
+ ### Python
140
+ - Mature optimization over many years
141
+ - Handles edge cases without hanging
142
+ - Extensive numerical stability improvements
143
+
144
+ ### Ruby
145
+ - Newer implementation via Rust bindings
146
+ - Excellent performance on standard datasets
147
+ - Some edge cases still being optimized
148
+
149
+ ## When Each Approach Shines
150
+
151
+ ### Python UMAP/HDBSCAN is ideal when:
152
+ - Working with well-understood data pipelines
153
+ - Requiring maximum compatibility
154
+ - Integrating with existing Python ML workflows
155
+ - Batch processing diverse datasets
156
+
157
+ ### Ruby ClusterKit is ideal when:
158
+ - Learning dimensionality reduction techniques
159
+ - Working with new or unfamiliar datasets
160
+ - Needing clear feedback about data issues
161
+ - Prioritizing interpretable results
162
+
163
+ ## Convergence Behavior
164
+
165
+ ### Python
166
+ - Continues optimization even with poor convergence
167
+ - Returns best result found within iteration limit
168
+ - May produce suboptimal embeddings silently
169
+
170
+ ### Ruby
171
+ - Raises `ConvergenceError` with explanation
172
+ - Suggests parameter adjustments to improve convergence
173
+ - Helps users understand why convergence failed
174
+
175
+ ## Summary
176
+
177
+ Both implementations are valuable tools for dimensionality reduction and clustering. Python's approach offers battle-tested robustness and broad compatibility, making it excellent for production pipelines and experienced practitioners. ClusterKit's approach provides more guidance and education, making it particularly valuable for exploratory analysis and users who want to understand their results deeply.
178
+
179
+ The choice between them often depends on your specific needs:
180
+ - Choose Python when you need maximum compatibility and robustness
181
+ - Choose ClusterKit when you value clear feedback and guided analysis
182
+
183
+ Both approaches reflect thoughtful design decisions optimized for their respective user communities and use cases.