@ruvector/rvdna 0.1.0 → 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +304 -0
- package/package.json +1 -1
package/README.md
ADDED
|
@@ -0,0 +1,304 @@
|
|
|
1
|
+
# @ruvector/rvdna
|
|
2
|
+
|
|
3
|
+
**DNA analysis in JavaScript.** Encode sequences, translate proteins, search genomes by similarity, and read the `.rvdna` AI-native file format — all from Node.js or the browser.
|
|
4
|
+
|
|
5
|
+
Built on Rust via NAPI-RS for native speed. Falls back to pure JavaScript when native bindings aren't available.
|
|
6
|
+
|
|
7
|
+
```bash
|
|
8
|
+
npm install @ruvector/rvdna
|
|
9
|
+
```
|
|
10
|
+
|
|
11
|
+
## What It Does
|
|
12
|
+
|
|
13
|
+
| Function | What It Does | Native Required? |
|
|
14
|
+
|---|---|---|
|
|
15
|
+
| `encode2bit(seq)` | Pack DNA into 2-bit bytes (4 bases per byte) | No (JS fallback) |
|
|
16
|
+
| `decode2bit(buf, len)` | Unpack 2-bit bytes back to DNA string | No (JS fallback) |
|
|
17
|
+
| `translateDna(seq)` | Translate DNA to protein amino acids | No (JS fallback) |
|
|
18
|
+
| `cosineSimilarity(a, b)` | Cosine similarity between two vectors | No (JS fallback) |
|
|
19
|
+
| `fastaToRvdna(seq, opts)` | Convert FASTA to `.rvdna` binary format | Yes |
|
|
20
|
+
| `readRvdna(buf)` | Parse a `.rvdna` file from a Buffer | Yes |
|
|
21
|
+
| `isNativeAvailable()` | Check if native Rust bindings are loaded | No |
|
|
22
|
+
|
|
23
|
+
## Quick Start
|
|
24
|
+
|
|
25
|
+
```js
|
|
26
|
+
const { encode2bit, decode2bit, translateDna, cosineSimilarity } = require('@ruvector/rvdna');
|
|
27
|
+
|
|
28
|
+
// Encode DNA to compact 2-bit format (4 bases per byte)
|
|
29
|
+
const packed = encode2bit('ACGTACGTACGT');
|
|
30
|
+
console.log(packed); // <Buffer 1b 1b 1b>
|
|
31
|
+
|
|
32
|
+
// Decode it back — lossless round-trip
|
|
33
|
+
const dna = decode2bit(packed, 12);
|
|
34
|
+
console.log(dna); // 'ACGTACGTACGT'
|
|
35
|
+
|
|
36
|
+
// Translate DNA to protein (standard genetic code)
|
|
37
|
+
const protein = translateDna('ATGGCCATTGTAATG');
|
|
38
|
+
console.log(protein); // 'MAIV'
|
|
39
|
+
|
|
40
|
+
// Compare two k-mer vectors
|
|
41
|
+
const sim = cosineSimilarity([1, 2, 3], [1, 2, 3]);
|
|
42
|
+
console.log(sim); // 1.0 (identical)
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
## API Reference
|
|
46
|
+
|
|
47
|
+
### `encode2bit(sequence: string): Buffer`
|
|
48
|
+
|
|
49
|
+
Packs a DNA string into 2-bit bytes. Each byte holds 4 bases: A=00, C=01, G=10, T=11. Ambiguous bases (N) map to A.
|
|
50
|
+
|
|
51
|
+
```js
|
|
52
|
+
encode2bit('ACGT') // <Buffer 1b> — one byte for 4 bases
|
|
53
|
+
encode2bit('AAAA') // <Buffer 00>
|
|
54
|
+
encode2bit('TTTT') // <Buffer ff>
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
### `decode2bit(buffer: Buffer, length: number): string`
|
|
58
|
+
|
|
59
|
+
Decodes 2-bit packed bytes back to a DNA string. You must pass the original sequence length since the last byte may have padding.
|
|
60
|
+
|
|
61
|
+
```js
|
|
62
|
+
decode2bit(Buffer.from([0x1b]), 4) // 'ACGT'
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
### `translateDna(sequence: string): string`
|
|
66
|
+
|
|
67
|
+
Translates a DNA string to a protein amino acid string using the standard genetic code. Stops at the first stop codon (TAA, TAG, TGA).
|
|
68
|
+
|
|
69
|
+
```js
|
|
70
|
+
translateDna('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA')
|
|
71
|
+
// 'MAIVMGR' — stops at TGA stop codon
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
### `cosineSimilarity(a: number[], b: number[]): number`
|
|
75
|
+
|
|
76
|
+
Returns cosine similarity between two numeric arrays. Result is between -1 and 1.
|
|
77
|
+
|
|
78
|
+
```js
|
|
79
|
+
cosineSimilarity([1, 0, 0], [0, 1, 0]) // 0 (orthogonal)
|
|
80
|
+
cosineSimilarity([1, 2, 3], [2, 4, 6]) // 1 (parallel)
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
### `fastaToRvdna(sequence: string, options?: RvdnaOptions): Buffer`
|
|
84
|
+
|
|
85
|
+
Converts a raw DNA sequence to the `.rvdna` binary format with pre-computed k-mer vectors. **Requires native bindings.**
|
|
86
|
+
|
|
87
|
+
```js
|
|
88
|
+
const { fastaToRvdna, isNativeAvailable } = require('@ruvector/rvdna');
|
|
89
|
+
|
|
90
|
+
if (isNativeAvailable()) {
|
|
91
|
+
const rvdna = fastaToRvdna('ACGTACGT...', { k: 11, dims: 512, blockSize: 500 });
|
|
92
|
+
require('fs').writeFileSync('output.rvdna', rvdna);
|
|
93
|
+
}
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
| Option | Default | Description |
|
|
97
|
+
|---|---|---|
|
|
98
|
+
| `k` | 11 | K-mer size for vector encoding |
|
|
99
|
+
| `dims` | 512 | Vector dimensions per block |
|
|
100
|
+
| `blockSize` | 500 | Bases per vector block |
|
|
101
|
+
|
|
102
|
+
### `readRvdna(buffer: Buffer): RvdnaFile`
|
|
103
|
+
|
|
104
|
+
Parses a `.rvdna` file. Returns the decoded sequence, k-mer vectors, variants, metadata, and file statistics. **Requires native bindings.**
|
|
105
|
+
|
|
106
|
+
```js
|
|
107
|
+
const fs = require('fs');
|
|
108
|
+
const { readRvdna } = require('@ruvector/rvdna');
|
|
109
|
+
|
|
110
|
+
const file = readRvdna(fs.readFileSync('sample.rvdna'));
|
|
111
|
+
|
|
112
|
+
console.log(file.sequenceLength); // 430
|
|
113
|
+
console.log(file.sequence.slice(0, 20)); // 'ATGGTGCATCTGACTCCTGA'
|
|
114
|
+
console.log(file.kmerVectors.length); // number of vector blocks
|
|
115
|
+
console.log(file.stats.bitsPerBase); // ~3.2
|
|
116
|
+
console.log(file.stats.compressionRatio); // vs raw FASTA
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
**RvdnaFile fields:**
|
|
120
|
+
|
|
121
|
+
| Field | Type | Description |
|
|
122
|
+
|---|---|---|
|
|
123
|
+
| `version` | `number` | Format version |
|
|
124
|
+
| `sequenceLength` | `number` | Number of bases |
|
|
125
|
+
| `sequence` | `string` | Decoded DNA string |
|
|
126
|
+
| `kmerVectors` | `Array` | Pre-computed k-mer vector blocks |
|
|
127
|
+
| `variants` | `Array \| null` | Variant positions with genotype likelihoods |
|
|
128
|
+
| `metadata` | `Record \| null` | Key-value metadata |
|
|
129
|
+
| `stats.totalSize` | `number` | File size in bytes |
|
|
130
|
+
| `stats.bitsPerBase` | `number` | Storage efficiency |
|
|
131
|
+
| `stats.compressionRatio` | `number` | Compression vs raw |
|
|
132
|
+
|
|
133
|
+
## The `.rvdna` File Format
|
|
134
|
+
|
|
135
|
+
Traditional genomic formats (FASTA, FASTQ, BAM) store raw sequences. Every time an AI model needs that data, it re-encodes everything from scratch — vectors, attention matrices, features. This takes 30-120 seconds per file.
|
|
136
|
+
|
|
137
|
+
`.rvdna` stores the sequence **and** pre-computed AI features together. Open the file and everything is ready — no re-encoding.
|
|
138
|
+
|
|
139
|
+
```
|
|
140
|
+
.rvdna file layout:
|
|
141
|
+
|
|
142
|
+
[Magic: "RVDNA\x01\x00\x00"] 8 bytes — file identifier
|
|
143
|
+
[Header] 64 bytes — version, flags, offsets
|
|
144
|
+
[Section 0: Sequence] 2-bit packed DNA (4 bases/byte)
|
|
145
|
+
[Section 1: K-mer Vectors] HNSW-ready embeddings
|
|
146
|
+
[Section 2: Attention Weights] Sparse COO matrices
|
|
147
|
+
[Section 3: Variant Tensor] f16 genotype likelihoods
|
|
148
|
+
[Section 4: Protein Embeddings] GNN features + contact graphs
|
|
149
|
+
[Section 5: Epigenomic Tracks] Methylation + clock data
|
|
150
|
+
[Section 6: Metadata] JSON provenance + checksums
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
### Format Comparison
|
|
154
|
+
|
|
155
|
+
| | FASTA | FASTQ | BAM | CRAM | **.rvdna** |
|
|
156
|
+
|---|---|---|---|---|---|
|
|
157
|
+
| **Encoding** | ASCII (1 char/base) | ASCII + Phred | Binary + ref | Ref-compressed | 2-bit packed |
|
|
158
|
+
| **Bits per base** | 8 | 16 | 2-4 | 0.5-2 | **3.2** (seq only) |
|
|
159
|
+
| **Random access** | Scan from start | Scan from start | Index ~10 us | Decode ~50 us | **mmap <1 us** |
|
|
160
|
+
| **AI features included** | No | No | No | No | **Yes** |
|
|
161
|
+
| **Vector search ready** | No | No | No | No | **HNSW built-in** |
|
|
162
|
+
| **Zero-copy mmap** | No | No | Partial | No | **Full** |
|
|
163
|
+
| **Single file** | Yes | Yes | Needs .bai | Needs .crai | **Yes** |
|
|
164
|
+
|
|
165
|
+
## Platform Support
|
|
166
|
+
|
|
167
|
+
Native NAPI-RS bindings are available for these platforms:
|
|
168
|
+
|
|
169
|
+
| Platform | Architecture | Package |
|
|
170
|
+
|---|---|---|
|
|
171
|
+
| Linux | x64 (glibc) | `@ruvector/rvdna-linux-x64-gnu` |
|
|
172
|
+
| Linux | ARM64 (glibc) | `@ruvector/rvdna-linux-arm64-gnu` |
|
|
173
|
+
| macOS | x64 (Intel) | `@ruvector/rvdna-darwin-x64` |
|
|
174
|
+
| macOS | ARM64 (Apple Silicon) | `@ruvector/rvdna-darwin-arm64` |
|
|
175
|
+
| Windows | x64 | `@ruvector/rvdna-win32-x64-msvc` |
|
|
176
|
+
|
|
177
|
+
These install automatically as optional dependencies. On unsupported platforms, basic functions (`encode2bit`, `decode2bit`, `translateDna`, `cosineSimilarity`) still work via pure JavaScript fallbacks.
|
|
178
|
+
|
|
179
|
+
## WASM (WebAssembly)
|
|
180
|
+
|
|
181
|
+
rvDNA can run entirely in the browser via WebAssembly. No server needed, no data leaves the user's device.
|
|
182
|
+
|
|
183
|
+
### Browser Setup
|
|
184
|
+
|
|
185
|
+
```bash
|
|
186
|
+
# Build from the Rust source
|
|
187
|
+
cd examples/dna
|
|
188
|
+
wasm-pack build --target web --release
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
This produces a `pkg/` directory with `.wasm` and `.js` glue code.
|
|
192
|
+
|
|
193
|
+
### Using in HTML
|
|
194
|
+
|
|
195
|
+
```html
|
|
196
|
+
<script type="module">
|
|
197
|
+
import init, { encode2bit, translateDna } from './pkg/rvdna.js';
|
|
198
|
+
|
|
199
|
+
await init(); // Load the WASM module
|
|
200
|
+
|
|
201
|
+
// Encode DNA
|
|
202
|
+
const packed = encode2bit('ACGTACGTACGT');
|
|
203
|
+
console.log('Packed bytes:', packed);
|
|
204
|
+
|
|
205
|
+
// Translate to protein
|
|
206
|
+
const protein = translateDna('ATGGCCATTGTAATG');
|
|
207
|
+
console.log('Protein:', protein); // 'MAIV'
|
|
208
|
+
</script>
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
### Using with Bundlers (Webpack, Vite)
|
|
212
|
+
|
|
213
|
+
```bash
|
|
214
|
+
# For bundler targets
|
|
215
|
+
wasm-pack build --target bundler --release
|
|
216
|
+
```
|
|
217
|
+
|
|
218
|
+
```js
|
|
219
|
+
// In your app
|
|
220
|
+
import { encode2bit, translateDna, fastaToRvdna } from '@ruvector/rvdna-wasm';
|
|
221
|
+
|
|
222
|
+
const packed = encode2bit('ACGTACGT');
|
|
223
|
+
const protein = translateDna('ATGGCCATT');
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
### WASM Features
|
|
227
|
+
|
|
228
|
+
| Feature | Status | Description |
|
|
229
|
+
|---|---|---|
|
|
230
|
+
| 2-bit encode/decode | Available | Pack/unpack DNA sequences |
|
|
231
|
+
| Protein translation | Available | Standard genetic code |
|
|
232
|
+
| Cosine similarity | Available | Vector comparison |
|
|
233
|
+
| `.rvdna` read/write | Planned | Full format support in browser |
|
|
234
|
+
| HNSW search | Planned | K-mer similarity search |
|
|
235
|
+
| Variant calling | Planned | Client-side mutation detection |
|
|
236
|
+
|
|
237
|
+
**Target WASM binary size:** <2 MB gzipped
|
|
238
|
+
|
|
239
|
+
### Privacy
|
|
240
|
+
|
|
241
|
+
WASM runs entirely client-side. DNA data never leaves the browser. This makes it suitable for:
|
|
242
|
+
- Clinical genomics dashboards
|
|
243
|
+
- Patient-facing genetic reports
|
|
244
|
+
- Educational tools
|
|
245
|
+
- Offline/edge analysis on devices with no internet
|
|
246
|
+
|
|
247
|
+
## TypeScript
|
|
248
|
+
|
|
249
|
+
Full TypeScript definitions are included. Import types directly:
|
|
250
|
+
|
|
251
|
+
```ts
|
|
252
|
+
import {
|
|
253
|
+
encode2bit,
|
|
254
|
+
decode2bit,
|
|
255
|
+
translateDna,
|
|
256
|
+
cosineSimilarity,
|
|
257
|
+
fastaToRvdna,
|
|
258
|
+
readRvdna,
|
|
259
|
+
isNativeAvailable,
|
|
260
|
+
RvdnaOptions,
|
|
261
|
+
RvdnaFile,
|
|
262
|
+
} from '@ruvector/rvdna';
|
|
263
|
+
```
|
|
264
|
+
|
|
265
|
+
## Speed
|
|
266
|
+
|
|
267
|
+
The native (Rust) backend handles these operations on real human gene data:
|
|
268
|
+
|
|
269
|
+
| Operation | Time | What It Does |
|
|
270
|
+
|---|---|---|
|
|
271
|
+
| Single SNP call | **155 ns** | Bayesian genotyping at one position |
|
|
272
|
+
| Protein translation (1 kb) | **23 ns** | DNA to amino acids |
|
|
273
|
+
| K-mer vector (1 kb) | **591 us** | Full pipeline with HNSW indexing |
|
|
274
|
+
| Complete analysis (5 genes) | **12 ms** | All stages including `.rvdna` output |
|
|
275
|
+
|
|
276
|
+
### vs Traditional Tools
|
|
277
|
+
|
|
278
|
+
| Task | Traditional Tool | Their Time | rvDNA | Speedup |
|
|
279
|
+
|---|---|---|---|---|
|
|
280
|
+
| K-mer counting | Jellyfish | 15-30 min | 2-5 sec | **180-900x** |
|
|
281
|
+
| Sequence similarity | BLAST | 1-5 min | 5-50 ms | **1,200-60,000x** |
|
|
282
|
+
| Variant calling | GATK | 30-90 min | 3-10 min | **3-30x** |
|
|
283
|
+
| Methylation age | R/Bioconductor | 5-15 min | 0.1-0.5 sec | **600-9,000x** |
|
|
284
|
+
|
|
285
|
+
## Rust Crate
|
|
286
|
+
|
|
287
|
+
The full Rust crate with all algorithms is available on crates.io:
|
|
288
|
+
|
|
289
|
+
```toml
|
|
290
|
+
[dependencies]
|
|
291
|
+
rvdna = "0.1"
|
|
292
|
+
```
|
|
293
|
+
|
|
294
|
+
See the [Rust documentation](https://docs.rs/rvdna) for the complete API including Smith-Waterman alignment, Horvath clock, CYP2D6 pharmacogenomics, and more.
|
|
295
|
+
|
|
296
|
+
## Links
|
|
297
|
+
|
|
298
|
+
- [GitHub](https://github.com/ruvnet/ruvector/tree/main/examples/dna) - Source code
|
|
299
|
+
- [crates.io](https://crates.io/crates/rvdna) - Rust crate
|
|
300
|
+
- [RuVector](https://github.com/ruvnet/ruvector) - Parent vector computing platform
|
|
301
|
+
|
|
302
|
+
## License
|
|
303
|
+
|
|
304
|
+
MIT
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@ruvector/rvdna",
|
|
3
|
-
"version": "0.1.
|
|
3
|
+
"version": "0.1.1",
|
|
4
4
|
"description": "rvDNA — AI-native genomic analysis and the .rvdna file format. Variant calling, protein prediction, and HNSW vector search powered by Rust via NAPI-RS.",
|
|
5
5
|
"main": "index.js",
|
|
6
6
|
"types": "index.d.ts",
|