clusterkit 0.1.0.pre.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.rspec +3 -0
- data/.simplecov +47 -0
- data/CHANGELOG.md +35 -0
- data/CLAUDE.md +226 -0
- data/Cargo.toml +8 -0
- data/Gemfile +17 -0
- data/IMPLEMENTATION_NOTES.md +143 -0
- data/LICENSE.txt +21 -0
- data/PYTHON_COMPARISON.md +183 -0
- data/README.md +499 -0
- data/Rakefile +245 -0
- data/clusterkit.gemspec +45 -0
- data/docs/KNOWN_ISSUES.md +130 -0
- data/docs/RUST_ERROR_HANDLING.md +164 -0
- data/docs/TEST_FIXTURES.md +170 -0
- data/docs/UMAP_EXPLAINED.md +362 -0
- data/docs/UMAP_TROUBLESHOOTING.md +284 -0
- data/docs/VERBOSE_OUTPUT.md +84 -0
- data/examples/hdbscan_example.rb +147 -0
- data/examples/optimal_kmeans_example.rb +96 -0
- data/examples/pca_example.rb +114 -0
- data/examples/reproducible_umap.rb +99 -0
- data/examples/verbose_control.rb +43 -0
- data/ext/clusterkit/Cargo.toml +25 -0
- data/ext/clusterkit/extconf.rb +4 -0
- data/ext/clusterkit/src/clustering/hdbscan_wrapper.rs +115 -0
- data/ext/clusterkit/src/clustering.rs +267 -0
- data/ext/clusterkit/src/embedder.rs +413 -0
- data/ext/clusterkit/src/lib.rs +22 -0
- data/ext/clusterkit/src/svd.rs +112 -0
- data/ext/clusterkit/src/tests.rs +16 -0
- data/ext/clusterkit/src/utils.rs +33 -0
- data/lib/clusterkit/clustering/hdbscan.rb +177 -0
- data/lib/clusterkit/clustering.rb +213 -0
- data/lib/clusterkit/clusterkit.rb +9 -0
- data/lib/clusterkit/configuration.rb +24 -0
- data/lib/clusterkit/dimensionality/pca.rb +251 -0
- data/lib/clusterkit/dimensionality/svd.rb +144 -0
- data/lib/clusterkit/dimensionality/umap.rb +311 -0
- data/lib/clusterkit/dimensionality.rb +29 -0
- data/lib/clusterkit/hdbscan_api_design.rb +142 -0
- data/lib/clusterkit/preprocessing.rb +106 -0
- data/lib/clusterkit/silence.rb +42 -0
- data/lib/clusterkit/utils.rb +51 -0
- data/lib/clusterkit/version.rb +5 -0
- data/lib/clusterkit.rb +93 -0
- data/lib/tasks/visualize.rake +641 -0
- metadata +194 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: fd025da9b7f5c97e370d05fb1062484cb99b0aaaa4a7c310eb27df78336c91b4
|
4
|
+
data.tar.gz: 7665cf847930bc47cb04adc1f466ba09ea33c14da5f242434b08493047945f91
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: a8db6d4738ad99a20887aef90398d4163bdd8bc47bbdb8dda74496adc105602051999e50d2bc6003b4981d16b8121151d71cf55618691262b4a092f3c46a2545
|
7
|
+
data.tar.gz: 1a6e43a00d19d7fdaf35be6deffa5215e994a1f013b140aa23ac9bbb9d982facde94d710a894cba39e1fbd39bccd7c020b86a99e9d32d2aa256f717844623e5e
|
data/.rspec
ADDED
data/.simplecov
ADDED
@@ -0,0 +1,47 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
SimpleCov.configure do
|
4
|
+
# Add custom groups
|
5
|
+
add_group 'Core', 'lib/annembed/embedder'
|
6
|
+
add_group 'UMAP', 'lib/annembed/umap'
|
7
|
+
add_group 'Utils', 'lib/annembed/utils'
|
8
|
+
add_group 'Configuration', 'lib/annembed/config'
|
9
|
+
|
10
|
+
# Track branches as well as lines
|
11
|
+
enable_coverage :branch
|
12
|
+
|
13
|
+
# Set thresholds (temporarily disabled to diagnose issues)
|
14
|
+
# minimum_coverage line: 50, branch: 40
|
15
|
+
|
16
|
+
# Don't refuse to run tests if coverage drops (during development)
|
17
|
+
# refuse_coverage_drop
|
18
|
+
|
19
|
+
# Maximum coverage drop allowed
|
20
|
+
maximum_coverage_drop 5
|
21
|
+
|
22
|
+
# Configure output directory
|
23
|
+
coverage_dir 'coverage'
|
24
|
+
|
25
|
+
# Track test files separately
|
26
|
+
track_files 'lib/**/*.rb'
|
27
|
+
|
28
|
+
# Custom filters
|
29
|
+
add_filter do |source_file|
|
30
|
+
# Skip version file
|
31
|
+
source_file.filename.include?('version.rb')
|
32
|
+
end
|
33
|
+
|
34
|
+
# Include timestamp in coverage report
|
35
|
+
SimpleCov.formatter = SimpleCov::Formatter::MultiFormatter.new([
|
36
|
+
SimpleCov::Formatter::HTMLFormatter,
|
37
|
+
])
|
38
|
+
|
39
|
+
# Set project name
|
40
|
+
command_name 'RSpec'
|
41
|
+
|
42
|
+
# Merge results from multiple test runs
|
43
|
+
use_merging true
|
44
|
+
|
45
|
+
# Set result cache timeout (in seconds)
|
46
|
+
merge_timeout 3600
|
47
|
+
end
|
data/CHANGELOG.md
ADDED
@@ -0,0 +1,35 @@
|
|
1
|
+
# Changelog
|
2
|
+
|
3
|
+
All notable changes to this project will be documented in this file.
|
4
|
+
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
7
|
+
|
8
|
+
## [Unreleased]
|
9
|
+
|
10
|
+
### Added
|
11
|
+
- Clean, scikit-learn-like API for UMAP
|
12
|
+
- `fit(data)` - Train the model
|
13
|
+
- `transform(data)` - Transform new data
|
14
|
+
- `fit_transform(data)` - Train and transform in one step
|
15
|
+
- `fitted?` - Check if model is trained
|
16
|
+
- `save(path)` - Save trained model
|
17
|
+
- `load(path)` - Load trained model
|
18
|
+
- Model persistence with save/load functionality
|
19
|
+
- Data export/import utilities for caching results
|
20
|
+
- Comprehensive test suite for UMAP interface
|
21
|
+
- Detailed README with practical examples
|
22
|
+
|
23
|
+
### Changed
|
24
|
+
- Complete API redesign to follow ML library conventions
|
25
|
+
- Removed confusing `save_embeddings`/`load_embeddings` methods
|
26
|
+
- Separated model operations from data caching concerns
|
27
|
+
|
28
|
+
### Fixed
|
29
|
+
- Intermittent test failures with boundary assertions
|
30
|
+
- Data normalization issues with extreme values
|
31
|
+
|
32
|
+
## [0.1.0] - TBD
|
33
|
+
|
34
|
+
### Added
|
35
|
+
- Initial release with basic embedding functionality
|
data/CLAUDE.md
ADDED
@@ -0,0 +1,226 @@
|
|
1
|
+
# CLAUDE.md - clusterkit Project Guide
|
2
|
+
|
3
|
+
## Project Vision
|
4
|
+
clusterkit brings high-performance dimensionality reduction and embedding algorithms to Ruby by wrapping the annembed Rust crate. This gem is part of the ruby-nlp ecosystem, which aims to provide Ruby developers with native machine learning and NLP capabilities through best-in-breed Rust implementations.
|
5
|
+
|
6
|
+
## Core Principles
|
7
|
+
|
8
|
+
### 1. Ruby-First Design
|
9
|
+
- Provide an idiomatic Ruby API that feels natural to Ruby developers
|
10
|
+
- Follow Ruby naming conventions (snake_case methods, proper use of symbols)
|
11
|
+
- Support Ruby's duck typing while maintaining type safety at the Rust boundary
|
12
|
+
- Integrate seamlessly with Ruby's data science ecosystem
|
13
|
+
|
14
|
+
### 2. Performance Without Compromise
|
15
|
+
- Leverage Rust's performance for compute-intensive operations
|
16
|
+
- Use Magnus for zero-copy data transfer where possible
|
17
|
+
- Enable parallelization by default
|
18
|
+
- Provide progress feedback for long-running operations
|
19
|
+
|
20
|
+
### 3. Ecosystem Integration
|
21
|
+
- Primary support for Numo::NArray (the NumPy of Ruby)
|
22
|
+
- Work well with other ruby-nlp gems (lancelot, red-candle)
|
23
|
+
- Support common Ruby data formats and visualization tools
|
24
|
+
- Play nice with Jupyter notebooks (iruby)
|
25
|
+
|
26
|
+
## Technical Guidelines
|
27
|
+
|
28
|
+
### Magnus Best Practices
|
29
|
+
|
30
|
+
1. **Memory Management**
|
31
|
+
```rust
|
32
|
+
// Good: Let Magnus handle Ruby object lifecycle
|
33
|
+
let array: RArray = data.try_convert()?;
|
34
|
+
|
35
|
+
// Avoid: Manual memory management
|
36
|
+
// Don't try to manually free Ruby objects
|
37
|
+
```
|
38
|
+
|
39
|
+
2. **Error Handling**
|
40
|
+
```rust
|
41
|
+
// Always wrap errors properly
|
42
|
+
use magnus::Error;
|
43
|
+
|
44
|
+
fn risky_operation() -> Result<RArray, Error> {
|
45
|
+
annembed_call()
|
46
|
+
.map_err(|e| Error::new(exception::runtime_error(), e.to_string()))?
|
47
|
+
}
|
48
|
+
```
|
49
|
+
|
50
|
+
3. **Type Conversions**
|
51
|
+
```rust
|
52
|
+
// Define clear conversion traits
|
53
|
+
impl TryFrom<Value> for EmbedConfig {
|
54
|
+
type Error = Error;
|
55
|
+
// Robust conversion with good error messages
|
56
|
+
}
|
57
|
+
```
|
58
|
+
|
59
|
+
### Ruby API Design
|
60
|
+
|
61
|
+
1. **Method Naming**
|
62
|
+
- Use Ruby conventions: `fit_transform`, not `fitTransform`
|
63
|
+
- Predicates end with `?`: `converged?`, `fitted?`
|
64
|
+
- Dangerous methods end with `!`: `normalize!`
|
65
|
+
|
66
|
+
2. **Parameter Handling**
|
67
|
+
```ruby
|
68
|
+
# Good: Use keyword arguments with defaults
|
69
|
+
def initialize(method: :umap, n_components: 2, **options)
|
70
|
+
|
71
|
+
# Avoid: Positional arguments for configuration
|
72
|
+
def initialize(method, n_components, min_dist, spread, ...)
|
73
|
+
```
|
74
|
+
|
75
|
+
3. **Return Values**
|
76
|
+
- Return Ruby arrays for small results
|
77
|
+
- Return Numo::NArray for large matrices
|
78
|
+
- Support multiple return formats via options
|
79
|
+
|
80
|
+
### Performance Considerations
|
81
|
+
|
82
|
+
1. **Data Transfer**
|
83
|
+
- Minimize copying between Ruby and Rust
|
84
|
+
- Use view/slice operations when possible
|
85
|
+
- Support streaming for large datasets
|
86
|
+
|
87
|
+
2. **Threading**
|
88
|
+
- Respect Ruby's GVL (Global VM Lock)
|
89
|
+
- Release GVL for long-running Rust operations
|
90
|
+
- Use Rust's parallelization, not Ruby threads
|
91
|
+
|
92
|
+
3. **Memory Usage**
|
93
|
+
- Provide memory estimates for large operations
|
94
|
+
- Support out-of-core processing for huge datasets
|
95
|
+
- Clear progress indication for long operations
|
96
|
+
|
97
|
+
## Code Style Guidelines
|
98
|
+
|
99
|
+
### Rust Side
|
100
|
+
- Follow Rust standard style (rustfmt)
|
101
|
+
- Comprehensive error types with context
|
102
|
+
- Document all public functions
|
103
|
+
- Use type aliases for clarity
|
104
|
+
|
105
|
+
### Ruby Side
|
106
|
+
- Follow Ruby Style Guide
|
107
|
+
- Use YARD documentation format
|
108
|
+
- Provide type signatures where helpful
|
109
|
+
- Include usage examples in docs
|
110
|
+
|
111
|
+
## Testing Philosophy
|
112
|
+
|
113
|
+
1. **Comprehensive Coverage**
|
114
|
+
- Unit tests for all public methods
|
115
|
+
- Integration tests with real datasets
|
116
|
+
- Performance benchmarks
|
117
|
+
- Memory leak tests
|
118
|
+
|
119
|
+
2. **Test Data**
|
120
|
+
- Use standard ML datasets (Iris, MNIST samples)
|
121
|
+
- Generate synthetic data for edge cases
|
122
|
+
- Test with various Ruby object types
|
123
|
+
|
124
|
+
3. **Platform Testing**
|
125
|
+
- Test on multiple Ruby versions
|
126
|
+
- Test on different operating systems
|
127
|
+
- Verify precompiled gem distribution
|
128
|
+
|
129
|
+
## Documentation Standards
|
130
|
+
|
131
|
+
1. **README**
|
132
|
+
- Clear installation instructions
|
133
|
+
- Quick start example that works
|
134
|
+
- Link to full documentation
|
135
|
+
- Performance comparisons
|
136
|
+
|
137
|
+
2. **API Documentation**
|
138
|
+
- Every public method documented
|
139
|
+
- Parameter types and ranges specified
|
140
|
+
- Return values clearly described
|
141
|
+
- Usage examples for complex methods
|
142
|
+
|
143
|
+
3. **Tutorials**
|
144
|
+
- Jupyter notebook examples
|
145
|
+
- Common use case walkthroughs
|
146
|
+
- Integration examples with other gems
|
147
|
+
|
148
|
+
## Common Patterns
|
149
|
+
|
150
|
+
### Configuration Objects
|
151
|
+
```ruby
|
152
|
+
# Prefer configuration objects over many parameters
|
153
|
+
config = Annembed::Config.new(
|
154
|
+
method: :umap,
|
155
|
+
n_neighbors: 15,
|
156
|
+
min_dist: 0.1
|
157
|
+
)
|
158
|
+
embedder = Annembed::Embedder.new(config)
|
159
|
+
```
|
160
|
+
|
161
|
+
### Progress Callbacks
|
162
|
+
```ruby
|
163
|
+
# Support progress monitoring
|
164
|
+
embedder.on_progress do |iteration, total|
|
165
|
+
puts "Progress: #{iteration}/#{total}"
|
166
|
+
end
|
167
|
+
```
|
168
|
+
|
169
|
+
### Flexible Input/Output
|
170
|
+
```ruby
|
171
|
+
# Accept multiple input formats
|
172
|
+
embedder.fit_transform(data) # Array, NArray, or CSV path
|
173
|
+
|
174
|
+
# Support different output formats
|
175
|
+
result = embedder.transform(data, output: :array) # Ruby Array
|
176
|
+
result = embedder.transform(data, output: :narray) # Numo::NArray
|
177
|
+
```
|
178
|
+
|
179
|
+
## Development Workflow
|
180
|
+
|
181
|
+
1. **Branch Strategy**
|
182
|
+
- `main` - stable release
|
183
|
+
- `develop` - integration branch
|
184
|
+
- `feature/*` - new features
|
185
|
+
- `fix/*` - bug fixes
|
186
|
+
|
187
|
+
2. **Release Process**
|
188
|
+
- Version bump in version.rb
|
189
|
+
- Update CHANGELOG.md
|
190
|
+
- Run full test suite
|
191
|
+
- Build precompiled gems
|
192
|
+
- Tag release
|
193
|
+
- Push to RubyGems
|
194
|
+
|
195
|
+
3. **Continuous Integration**
|
196
|
+
- Run tests on each push
|
197
|
+
- Build gems for multiple platforms
|
198
|
+
- Check documentation building
|
199
|
+
- Performance regression tests
|
200
|
+
|
201
|
+
## Future Considerations
|
202
|
+
|
203
|
+
1. **GPU Support**
|
204
|
+
- Monitor annembed for GPU features
|
205
|
+
- Plan bindings if GPU support is added
|
206
|
+
- Consider alternative GPU libraries
|
207
|
+
|
208
|
+
2. **Web Integration**
|
209
|
+
- Consider Rails integration
|
210
|
+
- WebAssembly compilation?
|
211
|
+
- REST API wrapper?
|
212
|
+
|
213
|
+
3. **Visualization**
|
214
|
+
- Built-in plotting helpers?
|
215
|
+
- Export to common formats
|
216
|
+
- Interactive visualizations?
|
217
|
+
|
218
|
+
## Getting Help
|
219
|
+
|
220
|
+
When implementing new features:
|
221
|
+
1. Check existing patterns in lancelot and red-candle
|
222
|
+
2. Consult annembed documentation
|
223
|
+
3. Ask in ruby-nlp discussions
|
224
|
+
4. Profile before optimizing
|
225
|
+
|
226
|
+
Remember: The goal is to make advanced embedding algorithms accessible and performant for Ruby developers while maintaining the simplicity and elegance that makes Ruby special.
|
data/Cargo.toml
ADDED
data/Gemfile
ADDED
@@ -0,0 +1,17 @@
|
|
1
|
+
source "https://rubygems.org"
|
2
|
+
|
3
|
+
# Specify your gem's dependencies in annembed-ruby.gemspec
|
4
|
+
gemspec
|
5
|
+
|
6
|
+
# Test-only dependencies
|
7
|
+
group :test do
|
8
|
+
# Optional: For comparing with Python implementations
|
9
|
+
gem "pycall", "~> 1.4", require: false
|
10
|
+
end
|
11
|
+
|
12
|
+
# Development dependencies for generating test fixtures
|
13
|
+
group :development do
|
14
|
+
# For generating real embeddings to use as test fixtures
|
15
|
+
# This avoids the hanging issues with random test data
|
16
|
+
gem "red-candle", "~> 1.0", require: false
|
17
|
+
end
|
@@ -0,0 +1,143 @@
|
|
1
|
+
# Implementation Notes for annembed-ruby
|
2
|
+
|
3
|
+
## Architecture Overview
|
4
|
+
|
5
|
+
The gem follows a three-layer architecture:
|
6
|
+
|
7
|
+
1. **Ruby Layer** (`lib/annembed/`): User-facing API with Ruby idioms
|
8
|
+
2. **Magnus Bridge** (`ext/annembed_ruby/src/`): Type conversions and bindings
|
9
|
+
3. **annembed Core**: The underlying Rust embedding library
|
10
|
+
|
11
|
+
## Key Implementation Challenges
|
12
|
+
|
13
|
+
### 1. Data Type Conversions
|
14
|
+
|
15
|
+
The main challenge is efficiently converting between Ruby and Rust types:
|
16
|
+
|
17
|
+
```
|
18
|
+
Ruby Array/Numo::NArray ↔ ndarray::Array2<f32> ↔ annembed matrices
|
19
|
+
```
|
20
|
+
|
21
|
+
Consider using views instead of copies where possible.
|
22
|
+
|
23
|
+
### 2. Memory Management
|
24
|
+
|
25
|
+
- Magnus handles Ruby GC integration
|
26
|
+
- Need to be careful with large matrices
|
27
|
+
- Consider streaming for datasets > available RAM
|
28
|
+
|
29
|
+
### 3. Progress Reporting
|
30
|
+
|
31
|
+
annembed operations can be long-running. Options:
|
32
|
+
1. Polling from Ruby side (requires thread)
|
33
|
+
2. Callback from Rust (needs GVL management)
|
34
|
+
3. Async/await pattern (complex but clean)
|
35
|
+
|
36
|
+
### 4. Error Handling
|
37
|
+
|
38
|
+
Map annembed errors to Ruby exceptions:
|
39
|
+
- `annembed::Error` → `Annembed::Error`
|
40
|
+
- Panic → `RuntimeError` (avoid panics!)
|
41
|
+
- Invalid input → `ArgumentError`
|
42
|
+
|
43
|
+
## annembed API Mapping
|
44
|
+
|
45
|
+
### Core Types
|
46
|
+
- `EmbedderT` - Main embedding trait
|
47
|
+
- `HnswParams` - HNSW graph parameters
|
48
|
+
- `EmbedderParams` - Algorithm-specific params
|
49
|
+
- `GraphProjection` - For graph-based methods
|
50
|
+
|
51
|
+
### Key Functions
|
52
|
+
```rust
|
53
|
+
// Main embedding function
|
54
|
+
pub fn get_embedder(
|
55
|
+
data: &Array2<f32>,
|
56
|
+
params: EmbedderParams,
|
57
|
+
hnsw_params: HnswParams
|
58
|
+
) -> Result<Box<dyn EmbedderT>>
|
59
|
+
|
60
|
+
// The embedder trait
|
61
|
+
pub trait EmbedderT {
|
62
|
+
fn embed(&self) -> Result<Array2<f32>>;
|
63
|
+
fn get_hnsw(&self) -> &Hnsw<f32, DistL2>;
|
64
|
+
}
|
65
|
+
```
|
66
|
+
|
67
|
+
## Performance Considerations
|
68
|
+
|
69
|
+
1. **Parallelization**: annembed uses rayon, respect Ruby's thread settings
|
70
|
+
2. **BLAS Backend**: Allow users to choose (OpenBLAS vs MKL)
|
71
|
+
3. **Large Datasets**: Implement chunking for > 1M points
|
72
|
+
4. **GPU Future**: Design API to allow GPU backend later
|
73
|
+
|
74
|
+
## Testing Strategy
|
75
|
+
|
76
|
+
### Unit Tests
|
77
|
+
- Type conversions
|
78
|
+
- Configuration parsing
|
79
|
+
- Error handling
|
80
|
+
|
81
|
+
### Integration Tests
|
82
|
+
- Small datasets (Iris)
|
83
|
+
- Medium datasets (MNIST subset)
|
84
|
+
- Large datasets (if CI allows)
|
85
|
+
|
86
|
+
### Benchmarks
|
87
|
+
- vs Python UMAP
|
88
|
+
- vs pure Ruby implementations
|
89
|
+
- Memory usage profiling
|
90
|
+
|
91
|
+
## Platform Support
|
92
|
+
|
93
|
+
### Priorities
|
94
|
+
1. Linux x86_64 (most servers)
|
95
|
+
2. macOS arm64 (M1/M2 developers)
|
96
|
+
3. macOS x86_64 (Intel Macs)
|
97
|
+
4. Windows x86_64 (if feasible)
|
98
|
+
|
99
|
+
### Build Considerations
|
100
|
+
- Use rake-compiler-dock for cross-compilation
|
101
|
+
- Static link BLAS when possible
|
102
|
+
- Provide clear instructions for source builds
|
103
|
+
|
104
|
+
## Future Enhancements
|
105
|
+
|
106
|
+
1. **Incremental Learning**: Add new points to existing embedding
|
107
|
+
2. **Custom Metrics**: Allow user-defined distance functions
|
108
|
+
3. **Supervised Embedding**: Use labels to guide embedding
|
109
|
+
4. **GPU Support**: If annembed adds it
|
110
|
+
5. **Visualization**: Built-in plotting helpers?
|
111
|
+
|
112
|
+
## Debugging Tips
|
113
|
+
|
114
|
+
1. Use `RUST_BACKTRACE=1` for better errors
|
115
|
+
2. Add logging with `env_logger` in Rust
|
116
|
+
3. Use `rb_sys` debug mode for development
|
117
|
+
4. Memory debugging with valgrind/ASAN
|
118
|
+
|
119
|
+
## Code Organization
|
120
|
+
|
121
|
+
Keep related functionality together:
|
122
|
+
- `embedder.rs`: All embedding algorithms
|
123
|
+
- `utils.rs`: Dimension estimation, hubness
|
124
|
+
- `svd.rs`: Matrix decomposition
|
125
|
+
- `conversions.rs`: Type conversion helpers
|
126
|
+
|
127
|
+
## Release Checklist
|
128
|
+
|
129
|
+
1. [ ] Update version in `version.rb`
|
130
|
+
2. [ ] Update CHANGELOG.md
|
131
|
+
3. [ ] Run full test suite on all platforms
|
132
|
+
4. [ ] Build precompiled gems
|
133
|
+
5. [ ] Test gem installation from .gem file
|
134
|
+
6. [ ] Tag release in git
|
135
|
+
7. [ ] Push to RubyGems.org
|
136
|
+
8. [ ] Update documentation
|
137
|
+
|
138
|
+
## References
|
139
|
+
|
140
|
+
- annembed docs: https://docs.rs/annembed/
|
141
|
+
- Magnus guide: https://github.com/matsadler/magnus
|
142
|
+
- UMAP paper: https://arxiv.org/abs/1802.03426
|
143
|
+
- t-SNE paper: https://lvdmaaten.github.io/tsne/
|
data/LICENSE.txt
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
The MIT License (MIT)
|
2
|
+
|
3
|
+
Copyright (c) 2024 Your Name
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
7
|
+
in the Software without restriction, including without limitation the rights
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
10
|
+
furnished to do so, subject to the following conditions:
|
11
|
+
|
12
|
+
The above copyright notice and this permission notice shall be included in
|
13
|
+
all copies or substantial portions of the Software.
|
14
|
+
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
21
|
+
THE SOFTWARE.
|
@@ -0,0 +1,183 @@
|
|
1
|
+
# ClusterKit vs Python UMAP/HDBSCAN: A Comparison
|
2
|
+
|
3
|
+
## Overview
|
4
|
+
|
5
|
+
ClusterKit and Python's UMAP/HDBSCAN implementations share the same algorithmic foundations but take different approaches to error handling and data validation. This document outlines these differences to help users understand what to expect from each implementation.
|
6
|
+
|
7
|
+
## Philosophical Approaches
|
8
|
+
|
9
|
+
### Python UMAP/HDBSCAN
|
10
|
+
Python's implementations prioritize **algorithmic robustness**, attempting to produce results even in edge cases. This approach:
|
11
|
+
- Maximizes compatibility with diverse datasets
|
12
|
+
- Minimizes interruptions to analysis workflows
|
13
|
+
- Trusts users to interpret results appropriately
|
14
|
+
- Follows the scikit-learn pattern of "fit on anything"
|
15
|
+
|
16
|
+
### Ruby ClusterKit
|
17
|
+
ClusterKit prioritizes **guided analysis**, providing feedback when data characteristics may lead to suboptimal results. This approach:
|
18
|
+
- Helps users understand their data's suitability for the algorithm
|
19
|
+
- Provides actionable suggestions when issues arise
|
20
|
+
- Encourages best practices in dimensionality reduction
|
21
|
+
- Aims to prevent misinterpretation of results
|
22
|
+
|
23
|
+
## Behavioral Differences
|
24
|
+
|
25
|
+
### Small Datasets
|
26
|
+
|
27
|
+
#### Scenario: 3 data points with n_neighbors=15
|
28
|
+
|
29
|
+
**Python UMAP:**
|
30
|
+
- Automatically adjusts n_neighbors silently
|
31
|
+
- Returns a 2D embedding of the 3 points
|
32
|
+
- The resulting triangle's shape depends on random initialization
|
33
|
+
|
34
|
+
**Ruby ClusterKit:**
|
35
|
+
- Explicitly adjusts n_neighbors with optional warning
|
36
|
+
- Currently experiences performance issues on very small datasets (being addressed)
|
37
|
+
- Provides context about why adjustment was needed
|
38
|
+
|
39
|
+
### Data Quality Issues
|
40
|
+
|
41
|
+
#### Scenario: Random data without inherent structure
|
42
|
+
|
43
|
+
**Python UMAP:**
|
44
|
+
- Processes the data without warnings
|
45
|
+
- Returns an embedding that may show apparent "clusters"
|
46
|
+
- These patterns are typically artifacts of the algorithm rather than real structure
|
47
|
+
|
48
|
+
**Ruby ClusterKit:**
|
49
|
+
- May raise `IsolatedPointError` with explanation
|
50
|
+
- Suggests that the data lacks structure suitable for manifold learning
|
51
|
+
- Recommends alternatives like PCA for unstructured data
|
52
|
+
|
53
|
+
#### Scenario: Extreme outliers (points 1000x farther than main data)
|
54
|
+
|
55
|
+
**Python UMAP:**
|
56
|
+
- Embeds outliers far from main cluster
|
57
|
+
- Main cluster may be compressed to accommodate outlier scale
|
58
|
+
- Visualization scale dominated by outliers
|
59
|
+
|
60
|
+
**Ruby ClusterKit:**
|
61
|
+
- May raise `IsolatedPointError`
|
62
|
+
- Explains impact of outliers on manifold learning
|
63
|
+
- Suggests preprocessing steps like outlier removal or normalization
|
64
|
+
|
65
|
+
### Invalid Data
|
66
|
+
|
67
|
+
#### Scenario: NaN or Infinite values
|
68
|
+
|
69
|
+
**Both implementations** reject invalid numerical data, but with different messaging:
|
70
|
+
|
71
|
+
**Python:** `"Input contains NaN"` or `"Input contains infinity"`
|
72
|
+
|
73
|
+
**Ruby:** `"Element at position [5, 2] is NaN or Infinite"`
|
74
|
+
|
75
|
+
The Ruby version provides specific location information to aid debugging.
|
76
|
+
|
77
|
+
### Edge Cases
|
78
|
+
|
79
|
+
#### Scenario: Single data point
|
80
|
+
|
81
|
+
**Python UMAP:**
|
82
|
+
- Returns `[[0, 0]]` or similar default position
|
83
|
+
- No error or warning about meaningless result
|
84
|
+
|
85
|
+
**Ruby ClusterKit:**
|
86
|
+
- Raises error explaining that manifold learning requires multiple points
|
87
|
+
- Suggests minimum data requirements
|
88
|
+
|
89
|
+
#### Scenario: Empty dataset
|
90
|
+
|
91
|
+
**Both implementations** appropriately reject empty input with clear error messages.
|
92
|
+
|
93
|
+
## Parameter Handling
|
94
|
+
|
95
|
+
### Auto-adjustment
|
96
|
+
|
97
|
+
**Python UMAP:**
|
98
|
+
- Silently adjusts parameters when necessary
|
99
|
+
- May issue warnings through Python's warning system
|
100
|
+
- Adjustments not always visible in normal workflow
|
101
|
+
|
102
|
+
**Ruby ClusterKit:**
|
103
|
+
- Adjusts parameters when needed
|
104
|
+
- Optional verbose mode shows adjustments
|
105
|
+
- Explains why adjustments were made
|
106
|
+
|
107
|
+
### Validation
|
108
|
+
|
109
|
+
**Python UMAP:**
|
110
|
+
- Validates parameters against mathematical constraints
|
111
|
+
- Generic `ValueError` for invalid parameters
|
112
|
+
|
113
|
+
**Ruby ClusterKit:**
|
114
|
+
- Validates parameters with context-aware messages
|
115
|
+
- `InvalidParameterError` with specific guidance
|
116
|
+
- Suggests valid parameter ranges based on data
|
117
|
+
|
118
|
+
## Error Messages
|
119
|
+
|
120
|
+
### Python Style
|
121
|
+
Focuses on technical accuracy:
|
122
|
+
```
|
123
|
+
ValueError: n_neighbors must be less than or equal to the number of samples
|
124
|
+
```
|
125
|
+
|
126
|
+
### Ruby Style
|
127
|
+
Focuses on user guidance:
|
128
|
+
```
|
129
|
+
The n_neighbors parameter (15) is too large for your dataset size (10).
|
130
|
+
|
131
|
+
UMAP needs n_neighbors to be less than the number of samples.
|
132
|
+
Suggested value: 5
|
133
|
+
|
134
|
+
Example: UMAP.new(n_neighbors: 5)
|
135
|
+
```
|
136
|
+
|
137
|
+
## Performance Characteristics
|
138
|
+
|
139
|
+
### Python
|
140
|
+
- Mature optimization over many years
|
141
|
+
- Handles edge cases without hanging
|
142
|
+
- Extensive numerical stability improvements
|
143
|
+
|
144
|
+
### Ruby
|
145
|
+
- Newer implementation via Rust bindings
|
146
|
+
- Excellent performance on standard datasets
|
147
|
+
- Some edge cases still being optimized
|
148
|
+
|
149
|
+
## When Each Approach Shines
|
150
|
+
|
151
|
+
### Python UMAP/HDBSCAN is ideal when:
|
152
|
+
- Working with well-understood data pipelines
|
153
|
+
- Requiring maximum compatibility
|
154
|
+
- Integrating with existing Python ML workflows
|
155
|
+
- Batch processing diverse datasets
|
156
|
+
|
157
|
+
### Ruby ClusterKit is ideal when:
|
158
|
+
- Learning dimensionality reduction techniques
|
159
|
+
- Working with new or unfamiliar datasets
|
160
|
+
- Needing clear feedback about data issues
|
161
|
+
- Prioritizing interpretable results
|
162
|
+
|
163
|
+
## Convergence Behavior
|
164
|
+
|
165
|
+
### Python
|
166
|
+
- Continues optimization even with poor convergence
|
167
|
+
- Returns best result found within iteration limit
|
168
|
+
- May produce suboptimal embeddings silently
|
169
|
+
|
170
|
+
### Ruby
|
171
|
+
- Raises `ConvergenceError` with explanation
|
172
|
+
- Suggests parameter adjustments to improve convergence
|
173
|
+
- Helps users understand why convergence failed
|
174
|
+
|
175
|
+
## Summary
|
176
|
+
|
177
|
+
Both implementations are valuable tools for dimensionality reduction and clustering. Python's approach offers battle-tested robustness and broad compatibility, making it excellent for production pipelines and experienced practitioners. ClusterKit's approach provides more guidance and education, making it particularly valuable for exploratory analysis and users who want to understand their results deeply.
|
178
|
+
|
179
|
+
The choice between them often depends on your specific needs:
|
180
|
+
- Choose Python when you need maximum compatibility and robustness
|
181
|
+
- Choose ClusterKit when you value clear feedback and guided analysis
|
182
|
+
|
183
|
+
Both approaches reflect thoughtful design decisions optimized for their respective user communities and use cases.
|