clusterkit 0.3.0-x86_64-linux
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.rspec +3 -0
- data/.simplecov +47 -0
- data/CHANGELOG.md +35 -0
- data/CLAUDE.md +226 -0
- data/Cargo.lock +3228 -0
- data/Cargo.toml +8 -0
- data/Gemfile +17 -0
- data/IMPLEMENTATION_NOTES.md +143 -0
- data/LICENSE.txt +21 -0
- data/PYTHON_COMPARISON.md +183 -0
- data/README.md +744 -0
- data/Rakefile +259 -0
- data/docs/KNOWN_ISSUES.md +130 -0
- data/docs/RUST_ERROR_HANDLING.md +164 -0
- data/docs/TEST_FIXTURES.md +170 -0
- data/docs/UMAP_EXPLAINED.md +362 -0
- data/docs/UMAP_TROUBLESHOOTING.md +284 -0
- data/docs/VERBOSE_OUTPUT.md +84 -0
- data/docs/assets/clusterkit-wide.png +0 -0
- data/docs/assets/clusterkit.png +0 -0
- data/docs/assets/visualization.png +0 -0
- data/examples/hdbscan_example.rb +147 -0
- data/examples/optimal_kmeans_example.rb +96 -0
- data/examples/pca_example.rb +114 -0
- data/examples/reproducible_umap.rb +99 -0
- data/examples/verbose_control.rb +43 -0
- data/ext/clusterkit/Cargo.toml +26 -0
- data/ext/clusterkit/extconf.rb +23 -0
- data/ext/clusterkit/src/clustering/hdbscan_wrapper.rs +80 -0
- data/ext/clusterkit/src/clustering.rs +221 -0
- data/ext/clusterkit/src/embedder.rs +349 -0
- data/ext/clusterkit/src/hnsw.rs +579 -0
- data/ext/clusterkit/src/lib.rs +24 -0
- data/ext/clusterkit/src/svd.rs +89 -0
- data/ext/clusterkit/src/tests.rs +16 -0
- data/ext/clusterkit/src/utils.rs +183 -0
- data/lib/clusterkit/3.1/clusterkit.so +0 -0
- data/lib/clusterkit/3.2/clusterkit.so +0 -0
- data/lib/clusterkit/3.3/clusterkit.so +0 -0
- data/lib/clusterkit/3.4/clusterkit.so +0 -0
- data/lib/clusterkit/clustering/hdbscan.rb +164 -0
- data/lib/clusterkit/clustering.rb +194 -0
- data/lib/clusterkit/clusterkit.rb +14 -0
- data/lib/clusterkit/configuration.rb +24 -0
- data/lib/clusterkit/data_validator.rb +132 -0
- data/lib/clusterkit/dimensionality/pca.rb +251 -0
- data/lib/clusterkit/dimensionality/svd.rb +175 -0
- data/lib/clusterkit/dimensionality/umap.rb +282 -0
- data/lib/clusterkit/dimensionality.rb +29 -0
- data/lib/clusterkit/hdbscan_api_design.rb +142 -0
- data/lib/clusterkit/hnsw.rb +251 -0
- data/lib/clusterkit/preprocessing.rb +106 -0
- data/lib/clusterkit/silence.rb +42 -0
- data/lib/clusterkit/utils.rb +51 -0
- data/lib/clusterkit/version.rb +5 -0
- data/lib/clusterkit.rb +105 -0
- data/lib/tasks/visualize.rake +641 -0
- metadata +220 -0
data/Cargo.toml
ADDED
data/Gemfile
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
source "https://rubygems.org"
|
|
2
|
+
|
|
3
|
+
# Specify your gem's dependencies in annembed-ruby.gemspec
|
|
4
|
+
gemspec
|
|
5
|
+
|
|
6
|
+
# Test-only dependencies
|
|
7
|
+
group :test do
|
|
8
|
+
# Optional: For comparing with Python implementations
|
|
9
|
+
gem "pycall", "~> 1.4", require: false
|
|
10
|
+
end
|
|
11
|
+
|
|
12
|
+
# Development dependencies for generating test fixtures
|
|
13
|
+
group :development do
|
|
14
|
+
# For generating real embeddings to use as test fixtures
|
|
15
|
+
# This avoids the hanging issues with random test data
|
|
16
|
+
gem "red-candle", "~> 1.0", require: false
|
|
17
|
+
end
|
|
@@ -0,0 +1,143 @@
|
|
|
1
|
+
# Implementation Notes for annembed-ruby
|
|
2
|
+
|
|
3
|
+
## Architecture Overview
|
|
4
|
+
|
|
5
|
+
The gem follows a three-layer architecture:
|
|
6
|
+
|
|
7
|
+
1. **Ruby Layer** (`lib/annembed/`): User-facing API with Ruby idioms
|
|
8
|
+
2. **Magnus Bridge** (`ext/annembed_ruby/src/`): Type conversions and bindings
|
|
9
|
+
3. **annembed Core**: The underlying Rust embedding library
|
|
10
|
+
|
|
11
|
+
## Key Implementation Challenges
|
|
12
|
+
|
|
13
|
+
### 1. Data Type Conversions
|
|
14
|
+
|
|
15
|
+
The main challenge is efficiently converting between Ruby and Rust types:
|
|
16
|
+
|
|
17
|
+
```
|
|
18
|
+
Ruby Array/Numo::NArray ↔ ndarray::Array2<f32> ↔ annembed matrices
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
Consider using views instead of copies where possible.
|
|
22
|
+
|
|
23
|
+
### 2. Memory Management
|
|
24
|
+
|
|
25
|
+
- Magnus handles Ruby GC integration
|
|
26
|
+
- Need to be careful with large matrices
|
|
27
|
+
- Consider streaming for datasets > available RAM
|
|
28
|
+
|
|
29
|
+
### 3. Progress Reporting
|
|
30
|
+
|
|
31
|
+
annembed operations can be long-running. Options:
|
|
32
|
+
1. Polling from Ruby side (requires thread)
|
|
33
|
+
2. Callback from Rust (needs GVL management)
|
|
34
|
+
3. Async/await pattern (complex but clean)
|
|
35
|
+
|
|
36
|
+
### 4. Error Handling
|
|
37
|
+
|
|
38
|
+
Map annembed errors to Ruby exceptions:
|
|
39
|
+
- `annembed::Error` → `Annembed::Error`
|
|
40
|
+
- Panic → `RuntimeError` (avoid panics!)
|
|
41
|
+
- Invalid input → `ArgumentError`
|
|
42
|
+
|
|
43
|
+
## annembed API Mapping
|
|
44
|
+
|
|
45
|
+
### Core Types
|
|
46
|
+
- `EmbedderT` - Main embedding trait
|
|
47
|
+
- `HnswParams` - HNSW graph parameters
|
|
48
|
+
- `EmbedderParams` - Algorithm-specific params
|
|
49
|
+
- `GraphProjection` - For graph-based methods
|
|
50
|
+
|
|
51
|
+
### Key Functions
|
|
52
|
+
```rust
|
|
53
|
+
// Main embedding function
|
|
54
|
+
pub fn get_embedder(
|
|
55
|
+
data: &Array2<f32>,
|
|
56
|
+
params: EmbedderParams,
|
|
57
|
+
hnsw_params: HnswParams
|
|
58
|
+
) -> Result<Box<dyn EmbedderT>>
|
|
59
|
+
|
|
60
|
+
// The embedder trait
|
|
61
|
+
pub trait EmbedderT {
|
|
62
|
+
fn embed(&self) -> Result<Array2<f32>>;
|
|
63
|
+
fn get_hnsw(&self) -> &Hnsw<f32, DistL2>;
|
|
64
|
+
}
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
## Performance Considerations
|
|
68
|
+
|
|
69
|
+
1. **Parallelization**: annembed uses rayon, respect Ruby's thread settings
|
|
70
|
+
2. **BLAS Backend**: Allow users to choose (OpenBLAS vs MKL)
|
|
71
|
+
3. **Large Datasets**: Implement chunking for > 1M points
|
|
72
|
+
4. **GPU Future**: Design API to allow GPU backend later
|
|
73
|
+
|
|
74
|
+
## Testing Strategy
|
|
75
|
+
|
|
76
|
+
### Unit Tests
|
|
77
|
+
- Type conversions
|
|
78
|
+
- Configuration parsing
|
|
79
|
+
- Error handling
|
|
80
|
+
|
|
81
|
+
### Integration Tests
|
|
82
|
+
- Small datasets (Iris)
|
|
83
|
+
- Medium datasets (MNIST subset)
|
|
84
|
+
- Large datasets (if CI allows)
|
|
85
|
+
|
|
86
|
+
### Benchmarks
|
|
87
|
+
- vs Python UMAP
|
|
88
|
+
- vs pure Ruby implementations
|
|
89
|
+
- Memory usage profiling
|
|
90
|
+
|
|
91
|
+
## Platform Support
|
|
92
|
+
|
|
93
|
+
### Priorities
|
|
94
|
+
1. Linux x86_64 (most servers)
|
|
95
|
+
2. macOS arm64 (M1/M2 developers)
|
|
96
|
+
3. macOS x86_64 (Intel Macs)
|
|
97
|
+
4. Windows x86_64 (if feasible)
|
|
98
|
+
|
|
99
|
+
### Build Considerations
|
|
100
|
+
- Use rake-compiler-dock for cross-compilation
|
|
101
|
+
- Static link BLAS when possible
|
|
102
|
+
- Provide clear instructions for source builds
|
|
103
|
+
|
|
104
|
+
## Future Enhancements
|
|
105
|
+
|
|
106
|
+
1. **Incremental Learning**: Add new points to existing embedding
|
|
107
|
+
2. **Custom Metrics**: Allow user-defined distance functions
|
|
108
|
+
3. **Supervised Embedding**: Use labels to guide embedding
|
|
109
|
+
4. **GPU Support**: If annembed adds it
|
|
110
|
+
5. **Visualization**: Built-in plotting helpers?
|
|
111
|
+
|
|
112
|
+
## Debugging Tips
|
|
113
|
+
|
|
114
|
+
1. Use `RUST_BACKTRACE=1` for better errors
|
|
115
|
+
2. Add logging with `env_logger` in Rust
|
|
116
|
+
3. Use `rb_sys` debug mode for development
|
|
117
|
+
4. Memory debugging with valgrind/ASAN
|
|
118
|
+
|
|
119
|
+
## Code Organization
|
|
120
|
+
|
|
121
|
+
Keep related functionality together:
|
|
122
|
+
- `embedder.rs`: All embedding algorithms
|
|
123
|
+
- `utils.rs`: Dimension estimation, hubness
|
|
124
|
+
- `svd.rs`: Matrix decomposition
|
|
125
|
+
- `conversions.rs`: Type conversion helpers
|
|
126
|
+
|
|
127
|
+
## Release Checklist
|
|
128
|
+
|
|
129
|
+
1. [ ] Update version in `version.rb`
|
|
130
|
+
2. [ ] Update CHANGELOG.md
|
|
131
|
+
3. [ ] Run full test suite on all platforms
|
|
132
|
+
4. [ ] Build precompiled gems
|
|
133
|
+
5. [ ] Test gem installation from .gem file
|
|
134
|
+
6. [ ] Tag release in git
|
|
135
|
+
7. [ ] Push to RubyGems.org
|
|
136
|
+
8. [ ] Update documentation
|
|
137
|
+
|
|
138
|
+
## References
|
|
139
|
+
|
|
140
|
+
- annembed docs: https://docs.rs/annembed/
|
|
141
|
+
- Magnus guide: https://github.com/matsadler/magnus
|
|
142
|
+
- UMAP paper: https://arxiv.org/abs/1802.03426
|
|
143
|
+
- t-SNE paper: https://lvdmaaten.github.io/tsne/
|
data/LICENSE.txt
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
The MIT License (MIT)
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2024 Your Name
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in
|
|
13
|
+
all copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
|
21
|
+
THE SOFTWARE.
|
|
@@ -0,0 +1,183 @@
|
|
|
1
|
+
# ClusterKit vs Python UMAP/HDBSCAN: A Comparison
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
|
|
5
|
+
ClusterKit and Python's UMAP/HDBSCAN implementations share the same algorithmic foundations but take different approaches to error handling and data validation. This document outlines these differences to help users understand what to expect from each implementation.
|
|
6
|
+
|
|
7
|
+
## Philosophical Approaches
|
|
8
|
+
|
|
9
|
+
### Python UMAP/HDBSCAN
|
|
10
|
+
Python's implementations prioritize **algorithmic robustness**, attempting to produce results even in edge cases. This approach:
|
|
11
|
+
- Maximizes compatibility with diverse datasets
|
|
12
|
+
- Minimizes interruptions to analysis workflows
|
|
13
|
+
- Trusts users to interpret results appropriately
|
|
14
|
+
- Follows the scikit-learn pattern of "fit on anything"
|
|
15
|
+
|
|
16
|
+
### Ruby ClusterKit
|
|
17
|
+
ClusterKit prioritizes **guided analysis**, providing feedback when data characteristics may lead to suboptimal results. This approach:
|
|
18
|
+
- Helps users understand their data's suitability for the algorithm
|
|
19
|
+
- Provides actionable suggestions when issues arise
|
|
20
|
+
- Encourages best practices in dimensionality reduction
|
|
21
|
+
- Aims to prevent misinterpretation of results
|
|
22
|
+
|
|
23
|
+
## Behavioral Differences
|
|
24
|
+
|
|
25
|
+
### Small Datasets
|
|
26
|
+
|
|
27
|
+
#### Scenario: 3 data points with n_neighbors=15
|
|
28
|
+
|
|
29
|
+
**Python UMAP:**
|
|
30
|
+
- Automatically adjusts n_neighbors silently
|
|
31
|
+
- Returns a 2D embedding of the 3 points
|
|
32
|
+
- The resulting triangle's shape depends on random initialization
|
|
33
|
+
|
|
34
|
+
**Ruby ClusterKit:**
|
|
35
|
+
- Explicitly adjusts n_neighbors with optional warning
|
|
36
|
+
- Currently experiences performance issues on very small datasets (being addressed)
|
|
37
|
+
- Provides context about why adjustment was needed
|
|
38
|
+
|
|
39
|
+
### Data Quality Issues
|
|
40
|
+
|
|
41
|
+
#### Scenario: Random data without inherent structure
|
|
42
|
+
|
|
43
|
+
**Python UMAP:**
|
|
44
|
+
- Processes the data without warnings
|
|
45
|
+
- Returns an embedding that may show apparent "clusters"
|
|
46
|
+
- These patterns are typically artifacts of the algorithm rather than real structure
|
|
47
|
+
|
|
48
|
+
**Ruby ClusterKit:**
|
|
49
|
+
- May raise `IsolatedPointError` with explanation
|
|
50
|
+
- Suggests that the data lacks structure suitable for manifold learning
|
|
51
|
+
- Recommends alternatives like PCA for unstructured data
|
|
52
|
+
|
|
53
|
+
#### Scenario: Extreme outliers (points 1000x farther than main data)
|
|
54
|
+
|
|
55
|
+
**Python UMAP:**
|
|
56
|
+
- Embeds outliers far from main cluster
|
|
57
|
+
- Main cluster may be compressed to accommodate outlier scale
|
|
58
|
+
- Visualization scale dominated by outliers
|
|
59
|
+
|
|
60
|
+
**Ruby ClusterKit:**
|
|
61
|
+
- May raise `IsolatedPointError`
|
|
62
|
+
- Explains impact of outliers on manifold learning
|
|
63
|
+
- Suggests preprocessing steps like outlier removal or normalization
|
|
64
|
+
|
|
65
|
+
### Invalid Data
|
|
66
|
+
|
|
67
|
+
#### Scenario: NaN or Infinite values
|
|
68
|
+
|
|
69
|
+
**Both implementations** reject invalid numerical data, but with different messaging:
|
|
70
|
+
|
|
71
|
+
**Python:** `"Input contains NaN"` or `"Input contains infinity"`
|
|
72
|
+
|
|
73
|
+
**Ruby:** `"Element at position [5, 2] is NaN or Infinite"`
|
|
74
|
+
|
|
75
|
+
The Ruby version provides specific location information to aid debugging.
|
|
76
|
+
|
|
77
|
+
### Edge Cases
|
|
78
|
+
|
|
79
|
+
#### Scenario: Single data point
|
|
80
|
+
|
|
81
|
+
**Python UMAP:**
|
|
82
|
+
- Returns `[[0, 0]]` or similar default position
|
|
83
|
+
- No error or warning about meaningless result
|
|
84
|
+
|
|
85
|
+
**Ruby ClusterKit:**
|
|
86
|
+
- Raises error explaining that manifold learning requires multiple points
|
|
87
|
+
- Suggests minimum data requirements
|
|
88
|
+
|
|
89
|
+
#### Scenario: Empty dataset
|
|
90
|
+
|
|
91
|
+
**Both implementations** appropriately reject empty input with clear error messages.
|
|
92
|
+
|
|
93
|
+
## Parameter Handling
|
|
94
|
+
|
|
95
|
+
### Auto-adjustment
|
|
96
|
+
|
|
97
|
+
**Python UMAP:**
|
|
98
|
+
- Silently adjusts parameters when necessary
|
|
99
|
+
- May issue warnings through Python's warning system
|
|
100
|
+
- Adjustments not always visible in normal workflow
|
|
101
|
+
|
|
102
|
+
**Ruby ClusterKit:**
|
|
103
|
+
- Adjusts parameters when needed
|
|
104
|
+
- Optional verbose mode shows adjustments
|
|
105
|
+
- Explains why adjustments were made
|
|
106
|
+
|
|
107
|
+
### Validation
|
|
108
|
+
|
|
109
|
+
**Python UMAP:**
|
|
110
|
+
- Validates parameters against mathematical constraints
|
|
111
|
+
- Generic `ValueError` for invalid parameters
|
|
112
|
+
|
|
113
|
+
**Ruby ClusterKit:**
|
|
114
|
+
- Validates parameters with context-aware messages
|
|
115
|
+
- `InvalidParameterError` with specific guidance
|
|
116
|
+
- Suggests valid parameter ranges based on data
|
|
117
|
+
|
|
118
|
+
## Error Messages
|
|
119
|
+
|
|
120
|
+
### Python Style
|
|
121
|
+
Focuses on technical accuracy:
|
|
122
|
+
```
|
|
123
|
+
ValueError: n_neighbors must be less than or equal to the number of samples
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
### Ruby Style
|
|
127
|
+
Focuses on user guidance:
|
|
128
|
+
```
|
|
129
|
+
The n_neighbors parameter (15) is too large for your dataset size (10).
|
|
130
|
+
|
|
131
|
+
UMAP needs n_neighbors to be less than the number of samples.
|
|
132
|
+
Suggested value: 5
|
|
133
|
+
|
|
134
|
+
Example: UMAP.new(n_neighbors: 5)
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
## Performance Characteristics
|
|
138
|
+
|
|
139
|
+
### Python
|
|
140
|
+
- Mature optimization over many years
|
|
141
|
+
- Handles edge cases without hanging
|
|
142
|
+
- Extensive numerical stability improvements
|
|
143
|
+
|
|
144
|
+
### Ruby
|
|
145
|
+
- Newer implementation via Rust bindings
|
|
146
|
+
- Excellent performance on standard datasets
|
|
147
|
+
- Some edge cases still being optimized
|
|
148
|
+
|
|
149
|
+
## When Each Approach Shines
|
|
150
|
+
|
|
151
|
+
### Python UMAP/HDBSCAN is ideal when:
|
|
152
|
+
- Working with well-understood data pipelines
|
|
153
|
+
- Requiring maximum compatibility
|
|
154
|
+
- Integrating with existing Python ML workflows
|
|
155
|
+
- Batch processing diverse datasets
|
|
156
|
+
|
|
157
|
+
### Ruby ClusterKit is ideal when:
|
|
158
|
+
- Learning dimensionality reduction techniques
|
|
159
|
+
- Working with new or unfamiliar datasets
|
|
160
|
+
- Needing clear feedback about data issues
|
|
161
|
+
- Prioritizing interpretable results
|
|
162
|
+
|
|
163
|
+
## Convergence Behavior
|
|
164
|
+
|
|
165
|
+
### Python
|
|
166
|
+
- Continues optimization even with poor convergence
|
|
167
|
+
- Returns best result found within iteration limit
|
|
168
|
+
- May produce suboptimal embeddings silently
|
|
169
|
+
|
|
170
|
+
### Ruby
|
|
171
|
+
- Raises `ConvergenceError` with explanation
|
|
172
|
+
- Suggests parameter adjustments to improve convergence
|
|
173
|
+
- Helps users understand why convergence failed
|
|
174
|
+
|
|
175
|
+
## Summary
|
|
176
|
+
|
|
177
|
+
Both implementations are valuable tools for dimensionality reduction and clustering. Python's approach offers battle-tested robustness and broad compatibility, making it excellent for production pipelines and experienced practitioners. ClusterKit's approach provides more guidance and education, making it particularly valuable for exploratory analysis and users who want to understand their results deeply.
|
|
178
|
+
|
|
179
|
+
The choice between them often depends on your specific needs:
|
|
180
|
+
- Choose Python when you need maximum compatibility and robustness
|
|
181
|
+
- Choose ClusterKit when you value clear feedback and guided analysis
|
|
182
|
+
|
|
183
|
+
Both approaches reflect thoughtful design decisions optimized for their respective user communities and use cases.
|