databonk 0.0.2 → 0.0.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (55) hide show
  1. package/README.md +116 -111
  2. package/build/release.d.ts +719 -0
  3. package/build/release.js +774 -0
  4. package/build/release.wasm +0 -0
  5. package/build/release.wasm.map +1 -0
  6. package/build/release.wat +22633 -0
  7. package/dist/dataframe.d.ts +82 -0
  8. package/dist/dataframe.d.ts.map +1 -0
  9. package/dist/dataframe.js +318 -0
  10. package/dist/dataframe.js.map +1 -0
  11. package/dist/index.d.ts +42 -19
  12. package/dist/index.d.ts.map +1 -1
  13. package/dist/index.js +37 -6166
  14. package/dist/index.js.map +1 -1
  15. package/dist/loader.d.ts +86 -0
  16. package/dist/loader.d.ts.map +1 -0
  17. package/dist/loader.js +147 -0
  18. package/dist/loader.js.map +1 -0
  19. package/dist/shared-memory.d.ts +64 -0
  20. package/dist/shared-memory.d.ts.map +1 -0
  21. package/dist/shared-memory.js +113 -0
  22. package/dist/shared-memory.js.map +1 -0
  23. package/package.json +30 -56
  24. package/dist/core/column.d.ts +0 -55
  25. package/dist/core/column.d.ts.map +0 -1
  26. package/dist/core/dataframe.d.ts +0 -70
  27. package/dist/core/dataframe.d.ts.map +0 -1
  28. package/dist/core/index-cache.d.ts +0 -44
  29. package/dist/core/index-cache.d.ts.map +0 -1
  30. package/dist/index.esm.js +0 -6153
  31. package/dist/index.esm.js.map +0 -1
  32. package/dist/io/csv.d.ts +0 -23
  33. package/dist/io/csv.d.ts.map +0 -1
  34. package/dist/operations/aggregation.d.ts +0 -23
  35. package/dist/operations/aggregation.d.ts.map +0 -1
  36. package/dist/operations/derive.d.ts +0 -38
  37. package/dist/operations/derive.d.ts.map +0 -1
  38. package/dist/operations/groupby.d.ts +0 -36
  39. package/dist/operations/groupby.d.ts.map +0 -1
  40. package/dist/operations/join.d.ts +0 -22
  41. package/dist/operations/join.d.ts.map +0 -1
  42. package/dist/operations/reshape.d.ts +0 -17
  43. package/dist/operations/reshape.d.ts.map +0 -1
  44. package/dist/utils/aggregation-engine.d.ts +0 -84
  45. package/dist/utils/aggregation-engine.d.ts.map +0 -1
  46. package/dist/utils/bitset.d.ts +0 -30
  47. package/dist/utils/bitset.d.ts.map +0 -1
  48. package/dist/utils/hash.d.ts +0 -79
  49. package/dist/utils/hash.d.ts.map +0 -1
  50. package/dist/utils/performance.d.ts +0 -44
  51. package/dist/utils/performance.d.ts.map +0 -1
  52. package/dist/utils/types.d.ts +0 -7
  53. package/dist/utils/types.d.ts.map +0 -1
  54. package/dist/validation/schema.d.ts +0 -73
  55. package/dist/validation/schema.d.ts.map +0 -1
package/README.md CHANGED
@@ -1,161 +1,166 @@
1
- # Databonk.js
1
+ # Databonk
2
2
 
3
- A lightweight, fast data frame library for JavaScript and TypeScript with built-in schema validation.
3
+ **WASM-powered DataFrame library with SIMD acceleration**
4
4
 
5
- ## Features
5
+ Databonk is a high-performance columnar DataFrame library built with AssemblyScript and WebAssembly, featuring SIMD-optimized operations and optional SharedArrayBuffer support for zero-copy data access.
6
6
 
7
- - **Lightweight**: Minimal dependencies, tree-shakeable modules
8
- - **Fast**: Columnar storage using TypedArrays for performance
9
- - **Simple**: Clean API for common data operations
10
- - **Flexible**: Works with regular arrays, TypedArrays, or Apache Arrow
11
- - **Schema Validation**: Built-in Zod integration for data validation
12
- - **Type Safe**: Full TypeScript support with inferred types
7
+ ## Key Features
8
+
9
+ - **14x faster** than JavaScript for aggregations (sum, mean, min, max)
10
+ - **SIMD acceleration** with 4-way parallel computation
11
+ - **Zero-copy access** to column data via SharedArrayBuffer
12
+ - **Full TypeScript support** with comprehensive type definitions
13
+ - **Memory efficient** columnar storage design
14
+ - **Fluent API** for method chaining
13
15
 
14
16
  ## Installation
15
17
 
16
18
  ```bash
17
- npm install databonk zod
19
+ npm install databonk
18
20
  ```
19
21
 
20
22
  ## Quick Start
21
23
 
22
- ```javascript
23
- import { DataFrame, SchemaValidator, CommonSchemas } from 'databonk';
24
-
25
- // Create a DataFrame
26
- const df = DataFrame.from({
27
- name: ['Alice', 'Bob', 'Charlie'],
28
- age: [25, 30, 35],
29
- city: ['NYC', 'LA', 'Chicago']
30
- });
24
+ ```typescript
25
+ import { loadDatabonk, DatabonkDataFrame } from 'databonk';
31
26
 
32
- // Basic operations
33
- const adults = df.filter(row => row.age >= 30);
34
- const avgAge = df.column('age').mean();
35
- const grouped = df.groupBy(['city']).agg({ avgAge: 'mean' });
27
+ // Load the WASM module
28
+ const module = await loadDatabonk();
36
29
 
37
- // Schema validation
38
- const result = df.validate(CommonSchemas.person);
39
- console.log(`Valid rows: ${result.validRows}/${result.totalRows}`);
40
- ```
30
+ // Create a DataFrame from typed arrays
31
+ const df = await DatabonkDataFrame.fromTypedArrays(module, [
32
+ { name: 'id', data: new Int32Array([1, 2, 3, 4, 5]) },
33
+ { name: 'value', data: new Float32Array([10.5, 20.5, 30.5, 40.5, 50.5]) },
34
+ ]);
41
35
 
42
- ## Core Features
36
+ // Aggregations
37
+ console.log('Sum:', df.sum('value')); // 152.5
38
+ console.log('Mean:', df.mean('value')); // 30.5
39
+ console.log('Min:', df.min('value')); // 10.5
40
+ console.log('Max:', df.max('value')); // 50.5
41
+ console.log('Rows:', df.rowCount); // 5
43
42
 
44
- ### Data Operations
45
- - **Filtering & Selection**: Powerful row/column filtering with predicate functions
46
- - **Joins**: Inner, left, right, and outer joins with multiple keys
47
- - **Aggregations**: Sum, mean, count, min, max, std, variance with group-by support
48
- - **Reshaping**: Pivot, melt, transpose operations for data transformation
49
- - **Sorting**: Multi-column sorting with custom comparators
43
+ // Clean up when done
44
+ df.free();
45
+ ```
50
46
 
51
- ### Schema Validation
52
- - **Built-in Schemas**: Common patterns for users, products, transactions, coordinates
53
- - **Custom Validation**: Define your own schemas with Zod
54
- - **Data Cleaning**: Filter valid/invalid rows, transform data types
55
- - **Error Reporting**: Detailed validation errors with row/column information
47
+ ## Performance
56
48
 
57
- ### I/O Support
58
- - **CSV**: Read/write CSV files with automatic type inference
59
- - **Apache Arrow**: Optional integration for columnar data exchange
60
- - **Streaming**: Memory-efficient processing of large datasets
49
+ Benchmarks on 1 million rows (Float32):
61
50
 
62
- ## Examples
51
+ | Operation | WASM SIMD | JavaScript | Speedup |
52
+ |-----------|-----------|------------|---------|
53
+ | Sum | ~0.3ms | ~4.2ms | **14x** |
54
+ | Min | ~0.4ms | ~4.8ms | **12x** |
55
+ | Max | ~0.4ms | ~4.8ms | **12x** |
56
+ | Mean | ~0.3ms | ~5.0ms | **16x** |
63
57
 
64
- ### Schema Validation
58
+ ## API Overview
65
59
 
66
- ```javascript
67
- import { DataFrame, SchemaValidator } from 'databonk';
68
- import { z } from 'zod';
60
+ ### Module Loading
69
61
 
70
- // Define a custom schema
71
- const userSchema = SchemaValidator.define({
72
- name: z.string().min(1),
73
- age: z.number().int().min(0).max(150),
74
- email: z.string().email(),
75
- role: z.enum(['admin', 'user', 'guest'])
62
+ ```typescript
63
+ const module = await loadDatabonk({
64
+ wasmPath: './build/release.wasm', // Optional: custom WASM path
65
+ sharedMemory: true, // Optional: enable SharedArrayBuffer
66
+ initialMemory: 256, // Optional: initial memory pages (16MB default)
67
+ maximumMemory: 16384, // Optional: max memory pages (1GB default)
76
68
  });
69
+ ```
70
+
71
+ ### DataFrame Creation
77
72
 
78
- const userData = [
79
- { name: 'Alice', age: 25, email: 'alice@example.com', role: 'admin' },
80
- { name: '', age: -5, email: 'invalid', role: 'unknown' } // Invalid
81
- ];
73
+ ```typescript
74
+ const df = await DatabonkDataFrame.fromTypedArrays(module, [
75
+ { name: 'int_col', data: new Int32Array([1, 2, 3]) },
76
+ { name: 'float_col', data: new Float32Array([1.5, 2.5, 3.5]) },
77
+ { name: 'double_col', data: new Float64Array([1.1, 2.2, 3.3]) },
78
+ ]);
79
+ ```
82
80
 
83
- const df = DataFrame.fromRows(userData);
81
+ ### Aggregations
84
82
 
85
- // Validate data
86
- const validation = df.validate(userSchema);
87
- console.log(`Errors: ${validation.errors.length}`);
83
+ ```typescript
84
+ df.sum('column'); // Sum of values
85
+ df.mean('column'); // Average
86
+ df.min('column'); // Minimum
87
+ df.max('column'); // Maximum
88
+ df.count('column'); // Count of values
89
+ ```
88
90
 
89
- // Filter valid rows
90
- const validUsers = df.filterValid(userSchema);
91
+ ### Column Arithmetic
91
92
 
92
- // Transform data with type coercion
93
- const cleanData = df.validateAndTransform(userSchema);
93
+ ```typescript
94
+ df.add('a', 'b', 'sum') // sum = a + b
95
+ .sub('a', 'b', 'diff') // diff = a - b
96
+ .scalarMul('a', 2.5, 'scaled'); // scaled = a * 2.5
94
97
  ```
95
98
 
96
- ### Advanced Data Operations
99
+ ### GroupBy
97
100
 
98
- ```javascript
99
- // Join operations
100
- const sales = DataFrame.fromRows([
101
- { product_id: 1, quantity: 100, region: 'North' },
102
- { product_id: 2, quantity: 150, region: 'South' }
103
- ]);
101
+ ```typescript
102
+ const grouped = df.groupBy('category', 256) // maxKey parameter
103
+ .sum('value'); // or .mean('value')
104
+ ```
104
105
 
105
- const products = DataFrame.fromRows([
106
- { product_id: 1, name: 'Widget', price: 10.99 },
107
- { product_id: 2, name: 'Gadget', price: 15.99 }
108
- ]);
106
+ ### Inner Join
107
+
108
+ ```typescript
109
+ const result = left.innerJoin(right, 'left_key', 'right_key');
110
+ ```
109
111
 
110
- const joined = sales.join(products, 'product_id', 'inner');
112
+ ### Zero-Copy Column Access
111
113
 
112
- // Group by with multiple aggregations
113
- const summary = joined
114
- .groupBy(['region'])
115
- .agg({
116
- quantity: ['sum', 'mean'],
117
- price: 'mean'
118
- });
114
+ ```typescript
115
+ const view = df.getColumnView('value');
116
+ if (view) {
117
+ console.log(view.get(0)); // First value
118
+ console.log([...view]); // Iterate
119
+ console.log(view.toArray()); // Copy to regular array
120
+ }
121
+ ```
119
122
 
120
- // Add calculated columns
121
- const withRevenue = joined.withColumn('revenue',
122
- row => row.quantity * row.price
123
- );
123
+ ### Memory Management
124
124
 
125
- // Pivot tables
126
- const pivot = sales.pivot(['region'], 'product_id', 'quantity', 'sum');
125
+ ```typescript
126
+ df.free(); // Always free DataFrames when done
127
127
  ```
128
128
 
129
- ## Docker Development
129
+ ## Documentation
130
130
 
131
- ```bash
132
- # Build and start development environment
133
- make docker-dev
131
+ - [API Reference](./docs/api.md) - Full API documentation
132
+ - [Examples](./docs/examples.md) - Detailed code examples
134
133
 
135
- # Run tests in Docker
136
- make docker-test
134
+ ## Supported Column Types
137
135
 
138
- # Open shell in container
139
- make docker-shell
140
- ```
136
+ | Type | TypedArray | Use Case |
137
+ |------|------------|----------|
138
+ | Int32 | `Int32Array` | Integer keys, IDs, counts |
139
+ | Float32 | `Float32Array` | Standard floating-point values |
140
+ | Float64 | `Float64Array` | High-precision values |
141
+
142
+ ## Current Limitations
143
+
144
+ - GroupBy currently supports single value column aggregation
145
+ - Join keys must be Int32 values
146
+ - String columns are supported for storage but not for operations
141
147
 
142
148
  ## Development
143
149
 
144
150
  ```bash
145
- # Local development
151
+ # Install dependencies
146
152
  npm install
147
- npm run build
153
+
154
+ # Build WASM module
155
+ npm run asbuild
156
+
157
+ # Run tests
148
158
  npm test
149
159
 
150
- # With Docker
151
- make setup
152
- make dev
160
+ # Run benchmarks
161
+ npm run benchmark
153
162
  ```
154
163
 
155
- ## Performance
164
+ ## License
156
165
 
157
- Databonk.js is designed for small to medium datasets (up to ~1M rows) with:
158
- - **Memory efficient**: Columnar storage with TypedArrays
159
- - **Fast operations**: Optimized algorithms for joins, aggregations
160
- - **Minimal overhead**: Zero-copy operations where possible
161
- - **Tree-shakeable**: Only import what you use
166
+ MIT