@databricks/zerobus-ingest-sdk 0.0.1 → 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/Cargo.lock +2233 -0
- package/Cargo.toml +46 -0
- package/LICENSE +69 -0
- package/README.md +1220 -0
- package/build.rs +5 -0
- package/index.d.ts +387 -0
- package/index.js +318 -0
- package/package.json +88 -6
- package/schemas/air_quality.proto +10 -0
- package/schemas/air_quality_descriptor.pb +9 -0
- package/src/headers_provider.ts +82 -0
- package/src/lib.rs +815 -0
- package/utils/descriptor.ts +103 -0
- package/zerobus-sdk-ts.linux-x64-gnu.node +0 -0
package/README.md
ADDED
|
@@ -0,0 +1,1220 @@
|
|
|
1
|
+
# Databricks Zerobus Ingest SDK for TypeScript
|
|
2
|
+
|
|
3
|
+
[Public Preview](https://docs.databricks.com/release-notes/release-types.html): This SDK is supported for production use cases and is available to all customers. Databricks is actively working on stabilizing the Zerobus Ingest SDK for TypeScript. Minor version updates may include backwards-incompatible changes.
|
|
4
|
+
|
|
5
|
+
We are keen to hear feedback from you on this SDK. Please [file issues](https://github.com/databricks/zerobus-sdk-ts/issues), and we will address them.
|
|
6
|
+
|
|
7
|
+
The Databricks Zerobus Ingest SDK for TypeScript provides a high-performance client for ingesting data directly into Databricks Delta tables using the Zerobus streaming protocol. This SDK wraps the high-performance [Rust SDK](https://github.com/databricks/zerobus-sdk-rs) using native bindings for optimal performance. | See also the [SDK for Rust](https://github.com/databricks/zerobus-sdk-rs) | See also the [SDK for Python](https://github.com/databricks/zerobus-sdk-py) | See also the [SDK for Java](https://github.com/databricks/zerobus-sdk-java) | See also the [SDK for Go](https://github.com/databricks/zerobus-sdk-go)
|
|
8
|
+
|
|
9
|
+
## Table of Contents
|
|
10
|
+
|
|
11
|
+
- [Features](#features)
|
|
12
|
+
- [Requirements](#requirements)
|
|
13
|
+
- [Quick Start User Guide](#quick-start-user-guide)
|
|
14
|
+
- [Prerequisites](#prerequisites)
|
|
15
|
+
- [Installation](#installation)
|
|
16
|
+
- [Choose Your Serialization Format](#choose-your-serialization-format)
|
|
17
|
+
- [Option 1: Using JSON (Quick Start)](#option-1-using-json-quick-start)
|
|
18
|
+
- [Option 2: Using Protocol Buffers (Default, Recommended)](#option-2-using-protocol-buffers-default-recommended)
|
|
19
|
+
- [Usage Examples](#usage-examples)
|
|
20
|
+
- [Authentication](#authentication)
|
|
21
|
+
- [Configuration](#configuration)
|
|
22
|
+
- [Descriptor Utilities](#descriptor-utilities)
|
|
23
|
+
- [Error Handling](#error-handling)
|
|
24
|
+
- [API Reference](#api-reference)
|
|
25
|
+
- [Best Practices](#best-practices)
|
|
26
|
+
- [Platform Support](#platform-support)
|
|
27
|
+
- [Architecture](#architecture)
|
|
28
|
+
- [Contributing](#contributing)
|
|
29
|
+
- [Related Projects](#related-projects)
|
|
30
|
+
|
|
31
|
+
## Features
|
|
32
|
+
|
|
33
|
+
- **High-throughput ingestion**: Optimized for high-volume data ingestion with native Rust implementation
|
|
34
|
+
- **Automatic recovery**: Built-in retry and recovery mechanisms for transient failures
|
|
35
|
+
- **Flexible configuration**: Customizable stream behavior and timeouts
|
|
36
|
+
- **Multiple serialization formats**: Support for JSON and Protocol Buffers
|
|
37
|
+
- **Type widening**: Accept high-level types (plain objects, protobuf messages) or low-level types (strings, buffers) - automatically handles serialization
|
|
38
|
+
- **Batch ingestion**: Ingest multiple records with a single acknowledgment for higher throughput
|
|
39
|
+
- **OAuth 2.0 authentication**: Secure authentication with client credentials
|
|
40
|
+
- **TypeScript support**: Full type definitions for excellent IDE support
|
|
41
|
+
- **Cross-platform**: Supports Linux, macOS, and Windows
|
|
42
|
+
|
|
43
|
+
## Requirements
|
|
44
|
+
|
|
45
|
+
### Runtime Requirements
|
|
46
|
+
|
|
47
|
+
- **Node.js**: >= 16
|
|
48
|
+
- **Databricks workspace** with Zerobus access enabled
|
|
49
|
+
|
|
50
|
+
### Build Requirements
|
|
51
|
+
|
|
52
|
+
- **Rust toolchain**: 1.70 or higher - [Install Rust](https://rustup.rs/)
|
|
53
|
+
- **Cargo**: Included with Rust
|
|
54
|
+
|
|
55
|
+
### Dependencies
|
|
56
|
+
|
|
57
|
+
These will be installed automatically:
|
|
58
|
+
|
|
59
|
+
```json
|
|
60
|
+
{
|
|
61
|
+
"@napi-rs/cli": "^2.18.4",
|
|
62
|
+
"napi-build": "^0.3.3"
|
|
63
|
+
}
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
## Quick Start User Guide
|
|
67
|
+
|
|
68
|
+
### Prerequisites
|
|
69
|
+
|
|
70
|
+
Before using the SDK, you'll need the following:
|
|
71
|
+
|
|
72
|
+
#### 1. Workspace URL and Workspace ID
|
|
73
|
+
|
|
74
|
+
After logging into your Databricks workspace, look at the browser URL:
|
|
75
|
+
|
|
76
|
+
```
|
|
77
|
+
https://<databricks-instance>.cloud.databricks.com/?o=<workspace-id>
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
- **Workspace URL**: The part before `/?o=` → `https://<databricks-instance>.cloud.databricks.com`
|
|
81
|
+
- **Workspace ID**: The part after `?o=` → `<workspace-id>`
|
|
82
|
+
- **Zerobus Endpoint**: `https://<workspace-id>.zerobus.<region>.cloud.databricks.com`
|
|
83
|
+
|
|
84
|
+
> **Note:** The examples above show AWS endpoints (`.cloud.databricks.com`). For Azure deployments, the workspace URL will be `https://<databricks-instance>.azuredatabricks.net` and Zerobus endpoint will use `.azuredatabricks.net`.
|
|
85
|
+
|
|
86
|
+
Example:
|
|
87
|
+
- Full URL: `https://dbc-a1b2c3d4-e5f6.cloud.databricks.com/?o=1234567890123456`
|
|
88
|
+
- Workspace URL: `https://dbc-a1b2c3d4-e5f6.cloud.databricks.com`
|
|
89
|
+
- Workspace ID: `1234567890123456`
|
|
90
|
+
- Zerobus Endpoint: `https://1234567890123456.zerobus.us-west-2.cloud.databricks.com`
|
|
91
|
+
|
|
92
|
+
#### 2. Create a Delta Table
|
|
93
|
+
|
|
94
|
+
Create a table using Databricks SQL:
|
|
95
|
+
|
|
96
|
+
```sql
|
|
97
|
+
CREATE TABLE <catalog_name>.default.air_quality (
|
|
98
|
+
device_name STRING,
|
|
99
|
+
temp INT,
|
|
100
|
+
humidity BIGINT
|
|
101
|
+
)
|
|
102
|
+
USING DELTA;
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
Replace `<catalog_name>` with your catalog name (e.g., `main`).
|
|
106
|
+
|
|
107
|
+
#### 3. Create a Service Principal
|
|
108
|
+
|
|
109
|
+
1. Navigate to **Settings > Identity and Access** in your Databricks workspace
|
|
110
|
+
2. Click **Service principals** and create a new service principal
|
|
111
|
+
3. Generate a new secret for the service principal and save it securely
|
|
112
|
+
4. Grant the following permissions:
|
|
113
|
+
- `USE_CATALOG` on the catalog (e.g., `main`)
|
|
114
|
+
- `USE_SCHEMA` on the schema (e.g., `default`)
|
|
115
|
+
- `MODIFY` and `SELECT` on the table (e.g., `air_quality`)
|
|
116
|
+
|
|
117
|
+
Grant permissions using SQL:
|
|
118
|
+
|
|
119
|
+
```sql
|
|
120
|
+
-- Grant catalog permission
|
|
121
|
+
GRANT USE CATALOG ON CATALOG <catalog_name> TO `<service-principal-application-id>`;
|
|
122
|
+
|
|
123
|
+
-- Grant schema permission
|
|
124
|
+
GRANT USE SCHEMA ON SCHEMA <catalog_name>.default TO `<service-principal-application-id>`;
|
|
125
|
+
|
|
126
|
+
-- Grant table permissions
|
|
127
|
+
GRANT SELECT, MODIFY ON TABLE <catalog_name>.default.air_quality TO `<service-principal-application-id>`;
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
### Installation
|
|
131
|
+
|
|
132
|
+
#### Prerequisites
|
|
133
|
+
|
|
134
|
+
Before installing the SDK, ensure you have the required tools:
|
|
135
|
+
|
|
136
|
+
**1. Node.js >= 16**
|
|
137
|
+
|
|
138
|
+
Check if Node.js is installed:
|
|
139
|
+
```bash
|
|
140
|
+
node --version
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
If not installed, download from [nodejs.org](https://nodejs.org/).
|
|
144
|
+
|
|
145
|
+
**2. Rust Toolchain (1.70+)**
|
|
146
|
+
|
|
147
|
+
The SDK requires Rust to compile the native addon. Install using `rustup` (the official Rust installer):
|
|
148
|
+
|
|
149
|
+
**On Linux and macOS:**
|
|
150
|
+
```bash
|
|
151
|
+
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
Follow the prompts (typically just press Enter to accept defaults).
|
|
155
|
+
|
|
156
|
+
**On Windows:**
|
|
157
|
+
|
|
158
|
+
Download and run the installer from [rustup.rs](https://rustup.rs/), or use:
|
|
159
|
+
```powershell
|
|
160
|
+
# Using winget
|
|
161
|
+
winget install Rustlang.Rustup
|
|
162
|
+
|
|
163
|
+
# Or download from https://rustup.rs/
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
**Verify Installation:**
|
|
167
|
+
```bash
|
|
168
|
+
rustc --version
|
|
169
|
+
cargo --version
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
You should see version 1.70 or higher. If the commands aren't found, restart your terminal or add Rust to your PATH:
|
|
173
|
+
```bash
|
|
174
|
+
# Linux/macOS
|
|
175
|
+
source $HOME/.cargo/env
|
|
176
|
+
|
|
177
|
+
# Windows (PowerShell)
|
|
178
|
+
# Restart your terminal
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
**Additional Platform Requirements:**
|
|
182
|
+
|
|
183
|
+
- **Linux**: Build essentials
|
|
184
|
+
```bash
|
|
185
|
+
# Ubuntu/Debian
|
|
186
|
+
sudo apt-get install build-essential
|
|
187
|
+
|
|
188
|
+
# CentOS/RHEL
|
|
189
|
+
sudo yum groupinstall "Development Tools"
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
- **macOS**: Xcode Command Line Tools
|
|
193
|
+
```bash
|
|
194
|
+
xcode-select --install
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
- **Windows**: Visual Studio Build Tools
|
|
198
|
+
- Install [Visual Studio Build Tools](https://visualstudio.microsoft.com/downloads/#build-tools-for-visual-studio-2022)
|
|
199
|
+
- During installation, select "Desktop development with C++"
|
|
200
|
+
|
|
201
|
+
#### Installation Steps
|
|
202
|
+
|
|
203
|
+
**Note for macOS users**: Pre-built binaries are not available. The package will automatically build from source during `npm install`. Ensure you have Rust toolchain and Xcode Command Line Tools installed (see prerequisites above).
|
|
204
|
+
|
|
205
|
+
1. Extract the SDK package:
|
|
206
|
+
```bash
|
|
207
|
+
unzip zerobus-sdk-ts.zip
|
|
208
|
+
cd zerobus-sdk-ts
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
2. Install dependencies:
|
|
212
|
+
```bash
|
|
213
|
+
npm install
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
3. Build the native addon:
|
|
217
|
+
```bash
|
|
218
|
+
npm run build
|
|
219
|
+
```
|
|
220
|
+
|
|
221
|
+
This will compile the Rust code into a native Node.js addon (`.node` file) for your platform.
|
|
222
|
+
|
|
223
|
+
4. Verify the build:
|
|
224
|
+
```bash
|
|
225
|
+
# You should see a .node file
|
|
226
|
+
ls -la *.node
|
|
227
|
+
```
|
|
228
|
+
|
|
229
|
+
5. The SDK is now ready to use! You can:
|
|
230
|
+
- Use it directly in this directory for examples
|
|
231
|
+
- Link it globally: `npm link`
|
|
232
|
+
- Or copy it into your project's `node_modules`
|
|
233
|
+
|
|
234
|
+
**Troubleshooting:**
|
|
235
|
+
|
|
236
|
+
- **"rustc: command not found"**: Restart your terminal after installing Rust
|
|
237
|
+
- **Build fails on Windows**: Ensure Visual Studio Build Tools are installed with C++ support
|
|
238
|
+
- **Build fails on Linux**: Install build-essential or equivalent package
|
|
239
|
+
- **Permission errors**: Don't use `sudo` with npm/cargo commands
|
|
240
|
+
|
|
241
|
+
### Choose Your Serialization Format
|
|
242
|
+
|
|
243
|
+
The SDK supports two serialization formats. **Protocol Buffers is the default** and recommended for production use:
|
|
244
|
+
|
|
245
|
+
- **Protocol Buffers (Default)** - Strongly-typed schemas, efficient binary encoding, better performance. This is the default format.
|
|
246
|
+
- **JSON** - Simple, no schema compilation needed. Good for getting started quickly or when schema flexibility is needed.
|
|
247
|
+
|
|
248
|
+
> **Note:** If you don't specify `recordType`, the SDK will use Protocol Buffers by default. To use JSON, explicitly set `recordType: RecordType.Json`.
|
|
249
|
+
|
|
250
|
+
### Option 1: Using JSON (Quick Start)
|
|
251
|
+
|
|
252
|
+
JSON mode is the simplest way to get started. You don't need to define or compile protobuf schemas, but you must explicitly specify `RecordType.Json`.
|
|
253
|
+
|
|
254
|
+
```typescript
|
|
255
|
+
import { ZerobusSdk, RecordType } from '@databricks/zerobus-ingest-sdk';
|
|
256
|
+
|
|
257
|
+
// Configuration
|
|
258
|
+
// For AWS:
|
|
259
|
+
const zerobusEndpoint = '<workspace-id>.zerobus.<region>.cloud.databricks.com';
|
|
260
|
+
const workspaceUrl = 'https://<workspace-name>.cloud.databricks.com';
|
|
261
|
+
// For Azure:
|
|
262
|
+
// const zerobusEndpoint = '<workspace-id>.zerobus.<region>.azuredatabricks.net';
|
|
263
|
+
// const workspaceUrl = 'https://<workspace-name>.azuredatabricks.net';
|
|
264
|
+
|
|
265
|
+
const tableName = 'main.default.air_quality';
|
|
266
|
+
const clientId = process.env.DATABRICKS_CLIENT_ID!;
|
|
267
|
+
const clientSecret = process.env.DATABRICKS_CLIENT_SECRET!;
|
|
268
|
+
|
|
269
|
+
// Initialize SDK
|
|
270
|
+
const sdk = new ZerobusSdk(zerobusEndpoint, workspaceUrl);
|
|
271
|
+
|
|
272
|
+
// Configure table properties (no descriptor needed for JSON)
|
|
273
|
+
const tableProperties = { tableName };
|
|
274
|
+
|
|
275
|
+
// Configure stream with JSON record type
|
|
276
|
+
const options = {
|
|
277
|
+
recordType: RecordType.Json, // JSON encoding
|
|
278
|
+
maxInflightRequests: 1000,
|
|
279
|
+
recovery: true
|
|
280
|
+
};
|
|
281
|
+
|
|
282
|
+
// Create stream
|
|
283
|
+
const stream = await sdk.createStream(
|
|
284
|
+
tableProperties,
|
|
285
|
+
clientId,
|
|
286
|
+
clientSecret,
|
|
287
|
+
options
|
|
288
|
+
);
|
|
289
|
+
|
|
290
|
+
try {
|
|
291
|
+
let lastAckPromise;
|
|
292
|
+
|
|
293
|
+
// Send all records
|
|
294
|
+
for (let i = 0; i < 100; i++) {
|
|
295
|
+
// Create JSON record
|
|
296
|
+
const record = {
|
|
297
|
+
device_name: `sensor-${i % 10}`,
|
|
298
|
+
temp: 20 + (i % 15),
|
|
299
|
+
humidity: 50 + (i % 40)
|
|
300
|
+
};
|
|
301
|
+
|
|
302
|
+
// JSON supports 2 types:
|
|
303
|
+
// 1. object (high-level) - SDK auto-stringifies
|
|
304
|
+
lastAckPromise = stream.ingestRecord(record);
|
|
305
|
+
// 2. string (low-level) - pre-serialized JSON
|
|
306
|
+
// lastAckPromise = stream.ingestRecord(JSON.stringify(record));
|
|
307
|
+
}
|
|
308
|
+
|
|
309
|
+
console.log('All records sent. Waiting for last acknowledgment...');
|
|
310
|
+
|
|
311
|
+
// Wait for the last record's acknowledgment
|
|
312
|
+
const lastOffset = await lastAckPromise;
|
|
313
|
+
console.log(`Last record offset: ${lastOffset}`);
|
|
314
|
+
|
|
315
|
+
// Flush to ensure all records are acknowledged
|
|
316
|
+
await stream.flush();
|
|
317
|
+
console.log('Successfully ingested 100 records!');
|
|
318
|
+
} finally {
|
|
319
|
+
// Always close the stream
|
|
320
|
+
await stream.close();
|
|
321
|
+
}
|
|
322
|
+
```
|
|
323
|
+
|
|
324
|
+
### Option 2: Using Protocol Buffers (Default, Recommended)
|
|
325
|
+
|
|
326
|
+
Protocol Buffers is the default serialization format and provides efficient binary encoding with schema validation. This is recommended for production use. This section covers the complete setup process.
|
|
327
|
+
|
|
328
|
+
#### Prerequisites
|
|
329
|
+
|
|
330
|
+
Before starting, ensure you have:
|
|
331
|
+
|
|
332
|
+
1. **Protocol Buffer Compiler (`protoc`)** - Required for generating descriptor files
|
|
333
|
+
2. **protobufjs** and **protobufjs-cli** - Already included in package.json devDependencies
|
|
334
|
+
|
|
335
|
+
#### Step 1: Install Protocol Buffer Compiler
|
|
336
|
+
|
|
337
|
+
**Linux:**
|
|
338
|
+
|
|
339
|
+
```bash
|
|
340
|
+
# Ubuntu/Debian
|
|
341
|
+
sudo apt-get update && sudo apt-get install -y protobuf-compiler
|
|
342
|
+
|
|
343
|
+
# CentOS/RHEL
|
|
344
|
+
sudo yum install -y protobuf-compiler
|
|
345
|
+
|
|
346
|
+
# Alpine
|
|
347
|
+
apk add protobuf
|
|
348
|
+
```
|
|
349
|
+
|
|
350
|
+
**macOS:**
|
|
351
|
+
|
|
352
|
+
```bash
|
|
353
|
+
brew install protobuf
|
|
354
|
+
```
|
|
355
|
+
|
|
356
|
+
**Windows:**
|
|
357
|
+
|
|
358
|
+
```powershell
|
|
359
|
+
# Using Chocolatey
|
|
360
|
+
choco install protoc
|
|
361
|
+
|
|
362
|
+
# Or download from: https://github.com/protocolbuffers/protobuf/releases
|
|
363
|
+
```
|
|
364
|
+
|
|
365
|
+
**Verify Installation:**
|
|
366
|
+
|
|
367
|
+
```bash
|
|
368
|
+
protoc --version
|
|
369
|
+
# Should show: libprotoc 3.x.x or higher
|
|
370
|
+
```
|
|
371
|
+
|
|
372
|
+
#### Step 2: Define Your Protocol Buffer Schema
|
|
373
|
+
|
|
374
|
+
The SDK includes an example schema at `schemas/air_quality.proto`:
|
|
375
|
+
|
|
376
|
+
```protobuf
|
|
377
|
+
syntax = "proto2";
|
|
378
|
+
|
|
379
|
+
package examples;
|
|
380
|
+
|
|
381
|
+
// Example message representing air quality sensor data
|
|
382
|
+
message AirQuality {
|
|
383
|
+
optional string device_name = 1;
|
|
384
|
+
optional int32 temp = 2;
|
|
385
|
+
optional int64 humidity = 3;
|
|
386
|
+
}
|
|
387
|
+
```
|
|
388
|
+
|
|
389
|
+
#### Step 3: Generate TypeScript Code
|
|
390
|
+
|
|
391
|
+
Generate TypeScript code from your proto schema:
|
|
392
|
+
|
|
393
|
+
```bash
|
|
394
|
+
npm run build:proto
|
|
395
|
+
```
|
|
396
|
+
|
|
397
|
+
This runs:
|
|
398
|
+
```bash
|
|
399
|
+
pbjs -t static-module -w commonjs -o examples/generated/air_quality.js schemas/air_quality.proto
|
|
400
|
+
pbts -o examples/generated/air_quality.d.ts examples/generated/air_quality.js
|
|
401
|
+
```
|
|
402
|
+
|
|
403
|
+
**Output:**
|
|
404
|
+
- `examples/generated/air_quality.js` - JavaScript protobuf code
|
|
405
|
+
- `examples/generated/air_quality.d.ts` - TypeScript type definitions
|
|
406
|
+
|
|
407
|
+
#### Step 4: Generate Descriptor File for Databricks
|
|
408
|
+
|
|
409
|
+
Databricks requires descriptor metadata about your protobuf schema.
|
|
410
|
+
|
|
411
|
+
**Generate Binary Descriptor:**
|
|
412
|
+
|
|
413
|
+
```bash
|
|
414
|
+
protoc --descriptor_set_out=schemas/air_quality_descriptor.pb \
|
|
415
|
+
--include_imports \
|
|
416
|
+
schemas/air_quality.proto
|
|
417
|
+
```
|
|
418
|
+
|
|
419
|
+
**Important flags:**
|
|
420
|
+
- `--descriptor_set_out` - Output path for the binary descriptor
|
|
421
|
+
- `--include_imports` - Include all imported proto files (required)
|
|
422
|
+
|
|
423
|
+
That's it! The SDK will automatically extract the message descriptor from this file.
|
|
424
|
+
|
|
425
|
+
#### Step 5: Use in Your Code
|
|
426
|
+
|
|
427
|
+
```typescript
|
|
428
|
+
import { ZerobusSdk, RecordType } from '@databricks/zerobus-ingest-sdk';
|
|
429
|
+
import * as airQuality from './examples/generated/air_quality';
|
|
430
|
+
import { loadDescriptorProto } from '@databricks/zerobus-ingest-sdk/utils/descriptor';
|
|
431
|
+
|
|
432
|
+
// Configuration
|
|
433
|
+
const zerobusEndpoint = '<workspace-id>.zerobus.<region>.cloud.databricks.com';
|
|
434
|
+
const workspaceUrl = 'https://<workspace-name>.cloud.databricks.com';
|
|
435
|
+
const tableName = 'main.default.air_quality';
|
|
436
|
+
const clientId = process.env.DATABRICKS_CLIENT_ID!;
|
|
437
|
+
const clientSecret = process.env.DATABRICKS_CLIENT_SECRET!;
|
|
438
|
+
|
|
439
|
+
// Load and extract the descriptor for your specific message
|
|
440
|
+
const descriptorBase64 = loadDescriptorProto({
|
|
441
|
+
descriptorPath: 'schemas/air_quality_descriptor.pb',
|
|
442
|
+
protoFileName: 'air_quality.proto',
|
|
443
|
+
messageName: 'AirQuality'
|
|
444
|
+
});
|
|
445
|
+
|
|
446
|
+
// Initialize SDK
|
|
447
|
+
const sdk = new ZerobusSdk(zerobusEndpoint, workspaceUrl);
|
|
448
|
+
|
|
449
|
+
// Configure table properties with protobuf descriptor
|
|
450
|
+
const tableProperties = {
|
|
451
|
+
tableName,
|
|
452
|
+
descriptorProto: descriptorBase64 // Required for Protocol Buffers
|
|
453
|
+
};
|
|
454
|
+
|
|
455
|
+
// Configure stream with Protocol Buffers record type
|
|
456
|
+
const options = {
|
|
457
|
+
recordType: RecordType.Proto, // Protocol Buffers encoding
|
|
458
|
+
maxInflightRequests: 1000,
|
|
459
|
+
recovery: true
|
|
460
|
+
};
|
|
461
|
+
|
|
462
|
+
// Create stream
|
|
463
|
+
const stream = await sdk.createStream(tableProperties, clientId, clientSecret, options);
|
|
464
|
+
|
|
465
|
+
try {
|
|
466
|
+
const AirQuality = airQuality.examples.AirQuality;
|
|
467
|
+
let lastAckPromise;
|
|
468
|
+
|
|
469
|
+
// Send all records
|
|
470
|
+
for (let i = 0; i < 100; i++) {
|
|
471
|
+
const record = AirQuality.create({
|
|
472
|
+
device_name: `sensor-${i}`,
|
|
473
|
+
temp: 20 + i,
|
|
474
|
+
humidity: 50 + i
|
|
475
|
+
});
|
|
476
|
+
|
|
477
|
+
// Protobuf supports 2 types:
|
|
478
|
+
// 1. Message object (high-level) - SDK calls .encode().finish()
|
|
479
|
+
lastAckPromise = stream.ingestRecord(record);
|
|
480
|
+
// 2. Buffer (low-level) - pre-serialized bytes
|
|
481
|
+
// const buffer = Buffer.from(AirQuality.encode(record).finish());
|
|
482
|
+
// lastAckPromise = stream.ingestRecord(buffer);
|
|
483
|
+
}
|
|
484
|
+
|
|
485
|
+
console.log('All records sent. Waiting for last acknowledgment...');
|
|
486
|
+
|
|
487
|
+
// Wait for the last record's acknowledgment
|
|
488
|
+
const lastOffset = await lastAckPromise;
|
|
489
|
+
console.log(`Last record offset: ${lastOffset}`);
|
|
490
|
+
|
|
491
|
+
// Flush to ensure all records are acknowledged
|
|
492
|
+
await stream.flush();
|
|
493
|
+
console.log('Successfully ingested 100 records!');
|
|
494
|
+
} finally {
|
|
495
|
+
await stream.close();
|
|
496
|
+
}
|
|
497
|
+
```
|
|
498
|
+
|
|
499
|
+
#### Type Mapping: Delta ↔ Protocol Buffers
|
|
500
|
+
|
|
501
|
+
When creating your proto schema, use these type mappings:
|
|
502
|
+
|
|
503
|
+
| Delta Type | Proto2 Type | Notes |
|
|
504
|
+
|-----------|-------------|-------|
|
|
505
|
+
| STRING, VARCHAR | string | |
|
|
506
|
+
| INT, SMALLINT, SHORT | int32 | |
|
|
507
|
+
| BIGINT, LONG | int64 | |
|
|
508
|
+
| FLOAT | float | |
|
|
509
|
+
| DOUBLE | double | |
|
|
510
|
+
| BOOLEAN | bool | |
|
|
511
|
+
| BINARY | bytes | |
|
|
512
|
+
| DATE | int32 | Days since epoch |
|
|
513
|
+
| TIMESTAMP | int64 | Microseconds since epoch |
|
|
514
|
+
| ARRAY\<type\> | repeated type | Use repeated field |
|
|
515
|
+
| MAP\<key, value\> | map\<key, value\> | Use map field |
|
|
516
|
+
| STRUCT\<fields\> | message | Define nested message |
|
|
517
|
+
|
|
518
|
+
**Example: Complex Schema**
|
|
519
|
+
|
|
520
|
+
```protobuf
|
|
521
|
+
syntax = "proto2";
|
|
522
|
+
|
|
523
|
+
package examples;
|
|
524
|
+
|
|
525
|
+
message ComplexRecord {
|
|
526
|
+
optional string id = 1;
|
|
527
|
+
optional int64 timestamp = 2;
|
|
528
|
+
repeated string tags = 3; // ARRAY<STRING>
|
|
529
|
+
map<string, int32> metrics = 4; // MAP<STRING, INT>
|
|
530
|
+
optional NestedData nested = 5; // STRUCT
|
|
531
|
+
}
|
|
532
|
+
|
|
533
|
+
message NestedData {
|
|
534
|
+
optional string field1 = 1;
|
|
535
|
+
optional double field2 = 2;
|
|
536
|
+
}
|
|
537
|
+
```
|
|
538
|
+
|
|
539
|
+
#### Using Your Own Schema
|
|
540
|
+
|
|
541
|
+
1. **Create your proto file:**
|
|
542
|
+
```bash
|
|
543
|
+
cat > schemas/my_schema.proto << 'EOF'
|
|
544
|
+
syntax = "proto2";
|
|
545
|
+
|
|
546
|
+
package my_schema;
|
|
547
|
+
|
|
548
|
+
message MyMessage {
|
|
549
|
+
optional string field1 = 1;
|
|
550
|
+
optional int32 field2 = 2;
|
|
551
|
+
}
|
|
552
|
+
EOF
|
|
553
|
+
```
|
|
554
|
+
|
|
555
|
+
2. **Add build script to package.json:**
|
|
556
|
+
```json
|
|
557
|
+
{
|
|
558
|
+
"scripts": {
|
|
559
|
+
"build:proto:myschema": "pbjs -t static-module -w commonjs -o examples/generated/my_schema.js schemas/my_schema.proto && pbts -o examples/generated/my_schema.d.ts examples/generated/my_schema.js"
|
|
560
|
+
}
|
|
561
|
+
}
|
|
562
|
+
```
|
|
563
|
+
|
|
564
|
+
3. **Generate code and descriptor:**
|
|
565
|
+
```bash
|
|
566
|
+
npm run build:proto:myschema
|
|
567
|
+
protoc --descriptor_set_out=schemas/my_schema_descriptor.pb --include_imports schemas/my_schema.proto
|
|
568
|
+
```
|
|
569
|
+
|
|
570
|
+
4. **Load descriptor in your code:**
|
|
571
|
+
```typescript
|
|
572
|
+
import { loadDescriptorProto } from '@databricks/zerobus-ingest-sdk/utils/descriptor';
|
|
573
|
+
const descriptorBase64 = loadDescriptorProto({
|
|
574
|
+
descriptorPath: 'schemas/my_schema_descriptor.pb',
|
|
575
|
+
protoFileName: 'my_schema.proto',
|
|
576
|
+
messageName: 'MyMessage'
|
|
577
|
+
});
|
|
578
|
+
```
|
|
579
|
+
|
|
580
|
+
#### Troubleshooting Protocol Buffers
|
|
581
|
+
|
|
582
|
+
**"protoc: command not found"**
|
|
583
|
+
- Install `protoc` (see Step 1 above)
|
|
584
|
+
|
|
585
|
+
**"Cannot find module './generated/air_quality'"**
|
|
586
|
+
- Run `npm run build:proto` to generate TypeScript code
|
|
587
|
+
|
|
588
|
+
**"Descriptor file not found"**
|
|
589
|
+
- Generate the descriptor file using the commands in Step 4
|
|
590
|
+
|
|
591
|
+
**"Invalid descriptor"**
|
|
592
|
+
- Ensure you used `--include_imports` flag when generating the descriptor
|
|
593
|
+
- Verify the `.pb` file was created: `ls -lh schemas/*.pb`
|
|
594
|
+
- Check that `protoFileName` and `messageName` match your proto file
|
|
595
|
+
- Make sure you're using `loadDescriptorProto()` from the utils
|
|
596
|
+
|
|
597
|
+
**Build fails on proto generation**
|
|
598
|
+
- Ensure protobufjs is installed: `npm install --save-dev protobufjs protobufjs-cli`
|
|
599
|
+
|
|
600
|
+
#### Quick Reference
|
|
601
|
+
|
|
602
|
+
Complete setup from scratch:
|
|
603
|
+
```bash
|
|
604
|
+
# Install dependencies and build SDK
|
|
605
|
+
npm install
|
|
606
|
+
npm run build
|
|
607
|
+
|
|
608
|
+
# Setup Protocol Buffers
|
|
609
|
+
npm run build:proto
|
|
610
|
+
protoc --descriptor_set_out=schemas/air_quality_descriptor.pb --include_imports schemas/air_quality.proto
|
|
611
|
+
|
|
612
|
+
# Run example
|
|
613
|
+
npx tsx examples/proto.ts
|
|
614
|
+
```
|
|
615
|
+
|
|
616
|
+
#### Why Two Steps (TypeScript + Descriptor)?
|
|
617
|
+
|
|
618
|
+
1. **TypeScript Code Generation** (`npm run build:proto`):
|
|
619
|
+
- Creates JavaScript/TypeScript code for your application
|
|
620
|
+
- Provides type-safe message creation and encoding
|
|
621
|
+
- Used in your application code
|
|
622
|
+
|
|
623
|
+
2. **Descriptor File Generation** (`protoc --descriptor_set_out`):
|
|
624
|
+
- Creates metadata about your schema for Databricks
|
|
625
|
+
- Required by Zerobus service for schema validation
|
|
626
|
+
- Uploaded as base64 string when creating a stream
|
|
627
|
+
|
|
628
|
+
Both are necessary for Protocol Buffers ingestion!
|
|
629
|
+
|
|
630
|
+
## Usage Examples
|
|
631
|
+
|
|
632
|
+
See the `examples/` directory for complete, runnable examples. See [examples/README.md](examples/README.md) for detailed instructions.
|
|
633
|
+
|
|
634
|
+
### Running Examples
|
|
635
|
+
|
|
636
|
+
```bash
|
|
637
|
+
# Set environment variables
|
|
638
|
+
export ZEROBUS_SERVER_ENDPOINT="<workspace-id>.zerobus.<region>.cloud.databricks.com"
|
|
639
|
+
export DATABRICKS_WORKSPACE_URL="https://<workspace-name>.cloud.databricks.com"
|
|
640
|
+
export DATABRICKS_CLIENT_ID="your-client-id"
|
|
641
|
+
export DATABRICKS_CLIENT_SECRET="your-client-secret"
|
|
642
|
+
export ZEROBUS_TABLE_NAME="main.default.air_quality"
|
|
643
|
+
|
|
644
|
+
# Run JSON example
|
|
645
|
+
npx tsx examples/json.ts
|
|
646
|
+
|
|
647
|
+
# For Protocol Buffers, generate TypeScript code and descriptor
|
|
648
|
+
npm run build:proto
|
|
649
|
+
protoc --descriptor_set_out=schemas/air_quality_descriptor.pb --include_imports schemas/air_quality.proto
|
|
650
|
+
|
|
651
|
+
# Run Protocol Buffers example
|
|
652
|
+
npx tsx examples/proto.ts
|
|
653
|
+
```
|
|
654
|
+
|
|
655
|
+
### Batch Ingestion
|
|
656
|
+
|
|
657
|
+
For higher throughput, use batch ingestion to send multiple records with a single acknowledgment:
|
|
658
|
+
|
|
659
|
+
#### Protocol Buffers
|
|
660
|
+
|
|
661
|
+
```typescript
|
|
662
|
+
const records = Array.from({ length: 1000 }, (_, i) =>
|
|
663
|
+
AirQuality.create({ device_name: `sensor-${i}`, temp: 20 + i, humidity: 50 + i })
|
|
664
|
+
);
|
|
665
|
+
|
|
666
|
+
// Protobuf Type 1: Message objects (high-level) - SDK auto-serializes
|
|
667
|
+
const offsetId = await stream.ingestRecords(records);
|
|
668
|
+
|
|
669
|
+
// Protobuf Type 2: Buffers (low-level) - pre-serialized bytes
|
|
670
|
+
// const buffers = records.map(r => Buffer.from(AirQuality.encode(r).finish()));
|
|
671
|
+
// const offsetId = await stream.ingestRecords(buffers);
|
|
672
|
+
|
|
673
|
+
if (offsetId !== null) {
|
|
674
|
+
console.log(`Batch acknowledged at offset ${offsetId}`);
|
|
675
|
+
}
|
|
676
|
+
```
|
|
677
|
+
|
|
678
|
+
#### JSON
|
|
679
|
+
|
|
680
|
+
```typescript
|
|
681
|
+
const records = Array.from({ length: 1000 }, (_, i) => ({
|
|
682
|
+
device_name: `sensor-${i}`,
|
|
683
|
+
temp: 20 + i,
|
|
684
|
+
humidity: 50 + i
|
|
685
|
+
}));
|
|
686
|
+
|
|
687
|
+
// JSON Type 1: objects (high-level) - SDK auto-stringifies
|
|
688
|
+
const offsetId = await stream.ingestRecords(records);
|
|
689
|
+
|
|
690
|
+
// JSON Type 2: strings (low-level) - pre-serialized JSON
|
|
691
|
+
// const jsonRecords = records.map(r => JSON.stringify(r));
|
|
692
|
+
// const offsetId = await stream.ingestRecords(jsonRecords);
|
|
693
|
+
```
|
|
694
|
+
|
|
695
|
+
**Type Widening Support:**
|
|
696
|
+
- JSON mode: Accept `object[]` (auto-stringify) or `string[]` (pre-stringified)
|
|
697
|
+
- Proto mode: Accept protobuf messages with `.encode()` method (auto-serialize) or `Buffer[]` (pre-serialized)
|
|
698
|
+
- Mixed types are supported in the same batch
|
|
699
|
+
|
|
700
|
+
**Best Practices**:
|
|
701
|
+
- Batch size: 100-1,000 records for optimal throughput/latency balance
|
|
702
|
+
- Empty batches return `null` (no error, no offset)
|
|
703
|
+
- Use `recreateStream()` for recovery - it automatically handles unacknowledged batches
|
|
704
|
+
|
|
705
|
+
**Examples:**
|
|
706
|
+
Both `json.ts` and `proto.ts` examples demonstrate batch ingestion.
|
|
707
|
+
|
|
708
|
+
## Authentication
|
|
709
|
+
|
|
710
|
+
The SDK uses OAuth 2.0 Client Credentials for authentication:
|
|
711
|
+
|
|
712
|
+
```typescript
|
|
713
|
+
import { ZerobusSdk } from '@databricks/zerobus-ingest-sdk';
|
|
714
|
+
|
|
715
|
+
const sdk = new ZerobusSdk(zerobusEndpoint, workspaceUrl);
|
|
716
|
+
|
|
717
|
+
// Create stream with OAuth authentication
|
|
718
|
+
const stream = await sdk.createStream(
|
|
719
|
+
tableProperties,
|
|
720
|
+
clientId,
|
|
721
|
+
clientSecret,
|
|
722
|
+
options
|
|
723
|
+
);
|
|
724
|
+
```
|
|
725
|
+
|
|
726
|
+
The SDK automatically fetches access tokens and includes these headers:
|
|
727
|
+
- `"authorization": "Bearer <oauth_token>"` - Obtained via OAuth 2.0 Client Credentials flow
|
|
728
|
+
- `"x-databricks-zerobus-table-name": "<table_name>"` - The fully qualified table name
|
|
729
|
+
|
|
730
|
+
### Custom Authentication
|
|
731
|
+
|
|
732
|
+
Beyond OAuth, you can use custom headers for Personal Access Tokens (PAT) or other auth methods:
|
|
733
|
+
|
|
734
|
+
```typescript
|
|
735
|
+
import { ZerobusSdk } from '@databricks/zerobus-ingest-sdk';
|
|
736
|
+
import { HeadersProvider } from '@databricks/zerobus-ingest-sdk/src/headers_provider';
|
|
737
|
+
|
|
738
|
+
class CustomHeadersProvider implements HeadersProvider {
|
|
739
|
+
async getHeaders(): Promise<Array<[string, string]>> {
|
|
740
|
+
return [
|
|
741
|
+
["authorization", `Bearer ${myToken}`],
|
|
742
|
+
["x-databricks-zerobus-table-name", tableName]
|
|
743
|
+
];
|
|
744
|
+
}
|
|
745
|
+
}
|
|
746
|
+
|
|
747
|
+
const headersProvider = new CustomHeadersProvider();
|
|
748
|
+
const stream = await sdk.createStream(
|
|
749
|
+
tableProperties,
|
|
750
|
+
'', // client_id (ignored when headers_provider is provided)
|
|
751
|
+
'', // client_secret (ignored when headers_provider is provided)
|
|
752
|
+
options,
|
|
753
|
+
{ getHeadersCallback: headersProvider.getHeaders.bind(headersProvider) }
|
|
754
|
+
);
|
|
755
|
+
```
|
|
756
|
+
|
|
757
|
+
**Note:** Custom authentication is integrated into the main `createStream()` method. See the API Reference for details.
|
|
758
|
+
|
|
759
|
+
## Configuration
|
|
760
|
+
|
|
761
|
+
### Stream Configuration Options
|
|
762
|
+
|
|
763
|
+
| Option | Default | Description |
|
|
764
|
+
|--------|---------|-------------|
|
|
765
|
+
| `recordType` | `RecordType.Proto` | Serialization format: `RecordType.Json` or `RecordType.Proto` |
|
|
766
|
+
| `maxInflightRequests` | 10,000 | Maximum number of unacknowledged requests |
|
|
767
|
+
| `recovery` | true | Enable automatic stream recovery |
|
|
768
|
+
| `recoveryTimeoutMs` | 15,000 | Timeout for recovery operations (ms) |
|
|
769
|
+
| `recoveryBackoffMs` | 2,000 | Delay between recovery attempts (ms) |
|
|
770
|
+
| `recoveryRetries` | 4 | Maximum number of recovery attempts |
|
|
771
|
+
| `flushTimeoutMs` | 300,000 | Timeout for flush operations (ms) |
|
|
772
|
+
| `serverLackOfAckTimeoutMs` | 60,000 | Server acknowledgment timeout (ms) |
|
|
773
|
+
|
|
774
|
+
### Example Configuration
|
|
775
|
+
|
|
776
|
+
```typescript
|
|
777
|
+
import { StreamConfigurationOptions, RecordType } from '@databricks/zerobus-ingest-sdk';
|
|
778
|
+
|
|
779
|
+
const options: StreamConfigurationOptions = {
|
|
780
|
+
recordType: RecordType.Json, // JSON encoding
|
|
781
|
+
maxInflightRequests: 10000,
|
|
782
|
+
recovery: true,
|
|
783
|
+
recoveryTimeoutMs: 20000,
|
|
784
|
+
recoveryBackoffMs: 2000,
|
|
785
|
+
recoveryRetries: 4
|
|
786
|
+
};
|
|
787
|
+
|
|
788
|
+
const stream = await sdk.createStream(
|
|
789
|
+
tableProperties,
|
|
790
|
+
clientId,
|
|
791
|
+
clientSecret,
|
|
792
|
+
options
|
|
793
|
+
);
|
|
794
|
+
```
|
|
795
|
+
|
|
796
|
+
## Descriptor Utilities
|
|
797
|
+
|
|
798
|
+
The SDK provides a helper function to extract Protocol Buffer descriptors from FileDescriptorSets.
|
|
799
|
+
|
|
800
|
+
### loadDescriptorProto()
|
|
801
|
+
|
|
802
|
+
Extracts a specific message descriptor from a FileDescriptorSet:
|
|
803
|
+
|
|
804
|
+
```typescript
|
|
805
|
+
import { loadDescriptorProto } from '@databricks/zerobus-ingest-sdk/utils/descriptor';
|
|
806
|
+
|
|
807
|
+
const descriptorBase64 = loadDescriptorProto({
|
|
808
|
+
descriptorPath: 'schemas/my_schema_descriptor.pb',
|
|
809
|
+
protoFileName: 'my_schema.proto', // Name of your .proto file
|
|
810
|
+
messageName: 'MyMessage' // The specific message to use
|
|
811
|
+
});
|
|
812
|
+
```
|
|
813
|
+
|
|
814
|
+
**Parameters:**
|
|
815
|
+
- `descriptorPath`: Path to the `.pb` file generated by `protoc --descriptor_set_out`
|
|
816
|
+
- `protoFileName`: Name of the proto file (e.g., `"air_quality.proto"`)
|
|
817
|
+
- `messageName`: Name of the message type to extract (e.g., `"AirQuality"`)
|
|
818
|
+
|
|
819
|
+
**Why use this utility?**
|
|
820
|
+
- Extracts the specific message descriptor you need
|
|
821
|
+
- No manual base64 conversion required
|
|
822
|
+
- Clear error messages if the file or message isn't found
|
|
823
|
+
- Flexible for complex schemas with multiple messages or imports
|
|
824
|
+
|
|
825
|
+
**Example with multiple messages:**
|
|
826
|
+
```typescript
|
|
827
|
+
// Your proto file has: Order, OrderItem, Customer
|
|
828
|
+
// You want to ingest Orders:
|
|
829
|
+
const descriptorBase64 = loadDescriptorProto({
|
|
830
|
+
descriptorPath: 'schemas/orders_descriptor.pb',
|
|
831
|
+
protoFileName: 'orders.proto',
|
|
832
|
+
messageName: 'Order' // Explicitly choose Order
|
|
833
|
+
});
|
|
834
|
+
```
|
|
835
|
+
|
|
836
|
+
## Error Handling
|
|
837
|
+
|
|
838
|
+
The SDK includes automatic recovery for transient failures (enabled by default with `recovery: true`). For permanent failures, use `recreateStream()` to automatically recover all unacknowledged batches. Always use try/finally blocks to ensure streams are properly closed:
|
|
839
|
+
|
|
840
|
+
```typescript
|
|
841
|
+
try {
|
|
842
|
+
const offset = await stream.ingestRecord(JSON.stringify(record));
|
|
843
|
+
console.log(`Success: offset ${offset}`);
|
|
844
|
+
} catch (error) {
|
|
845
|
+
console.error('Ingestion failed:', error);
|
|
846
|
+
|
|
847
|
+
// When stream fails, close it first
|
|
848
|
+
await stream.close();
|
|
849
|
+
console.log('Stream closed after error');
|
|
850
|
+
|
|
851
|
+
// Optional: Inspect what needs recovery (must be called on closed stream)
|
|
852
|
+
const unackedBatches = await stream.getUnackedBatches();
|
|
853
|
+
console.log(`Batches to recover: ${unackedBatches.length}`);
|
|
854
|
+
|
|
855
|
+
// Recommended recovery approach: Use recreateStream()
|
|
856
|
+
// This method:
|
|
857
|
+
// 1. Gets all unacknowledged batches from the failed stream
|
|
858
|
+
// 2. Creates a new stream with the same configuration
|
|
859
|
+
// 3. Re-ingests all unacknowledged batches automatically
|
|
860
|
+
// 4. Returns the new stream ready for continued use
|
|
861
|
+
const newStream = await sdk.recreateStream(stream);
|
|
862
|
+
console.log(`Stream recreated with ${unackedBatches.length} batches re-ingested`);
|
|
863
|
+
|
|
864
|
+
// Continue using newStream for further ingestion
|
|
865
|
+
try {
|
|
866
|
+
// Continue ingesting...
|
|
867
|
+
} finally {
|
|
868
|
+
await newStream.close();
|
|
869
|
+
}
|
|
870
|
+
}
|
|
871
|
+
```
|
|
872
|
+
|
|
873
|
+
**Best Practices:**
|
|
874
|
+
- **Rely on automatic recovery** (default): The SDK will automatically retry transient failures
|
|
875
|
+
- **Use `recreateStream()` for permanent failures**: Automatically recovers all unacknowledged batches
|
|
876
|
+
- **Use `getUnackedRecords()` for inspection only**: Primarily for debugging or understanding failed records
|
|
877
|
+
- Always close streams in a `finally` block to ensure proper cleanup
|
|
878
|
+
|
|
879
|
+
## API Reference
|
|
880
|
+
|
|
881
|
+
### ZerobusSdk
|
|
882
|
+
|
|
883
|
+
Main entry point for the SDK.
|
|
884
|
+
|
|
885
|
+
**Constructor:**
|
|
886
|
+
|
|
887
|
+
```typescript
|
|
888
|
+
new ZerobusSdk(zerobusEndpoint: string, unityCatalogUrl: string)
|
|
889
|
+
```
|
|
890
|
+
|
|
891
|
+
**Parameters:**
|
|
892
|
+
- `zerobusEndpoint` (string) - The Zerobus gRPC endpoint (e.g., `<workspace-id>.zerobus.<region>.cloud.databricks.com` for AWS, or `<workspace-id>.zerobus.<region>.azuredatabricks.net` for Azure)
|
|
893
|
+
- `unityCatalogUrl` (string) - The Unity Catalog endpoint (your workspace URL)
|
|
894
|
+
|
|
895
|
+
**Methods:**
|
|
896
|
+
|
|
897
|
+
```typescript
|
|
898
|
+
async createStream(
|
|
899
|
+
tableProperties: TableProperties,
|
|
900
|
+
clientId: string,
|
|
901
|
+
clientSecret: string,
|
|
902
|
+
options?: StreamConfigurationOptions
|
|
903
|
+
): Promise<ZerobusStream>
|
|
904
|
+
```
|
|
905
|
+
|
|
906
|
+
Creates a new ingestion stream using OAuth 2.0 Client Credentials authentication.
|
|
907
|
+
|
|
908
|
+
Automatically includes these headers:
|
|
909
|
+
- `"authorization": "Bearer <oauth_token>"` (fetched via OAuth 2.0 Client Credentials flow)
|
|
910
|
+
- `"x-databricks-zerobus-table-name": "<table_name>"`
|
|
911
|
+
|
|
912
|
+
Returns a `ZerobusStream` instance.
|
|
913
|
+
|
|
914
|
+
---
|
|
915
|
+
|
|
916
|
+
```typescript
|
|
917
|
+
async recreateStream(stream: ZerobusStream): Promise<ZerobusStream>
|
|
918
|
+
```
|
|
919
|
+
|
|
920
|
+
Recreates a stream with the same configuration and automatically re-ingests all unacknowledged batches.
|
|
921
|
+
|
|
922
|
+
This method is the **recommended approach** for recovering from stream failures. It:
|
|
923
|
+
1. Retrieves all unacknowledged batches from the failed stream
|
|
924
|
+
2. Creates a new stream with identical configuration (same table, auth, options)
|
|
925
|
+
3. Re-ingests all unacknowledged batches in their original order
|
|
926
|
+
4. Returns the new stream ready for continued ingestion
|
|
927
|
+
|
|
928
|
+
**Parameters:**
|
|
929
|
+
- `stream` - The failed or closed stream to recreate
|
|
930
|
+
|
|
931
|
+
**Returns:** Promise resolving to a new `ZerobusStream` with all unacknowledged batches re-ingested
|
|
932
|
+
|
|
933
|
+
**Example:**
|
|
934
|
+
```typescript
|
|
935
|
+
try {
|
|
936
|
+
await stream.ingestRecords(batch);
|
|
937
|
+
} catch (error) {
|
|
938
|
+
await stream.close();
|
|
939
|
+
// Automatically recreate stream and recover all unacked batches
|
|
940
|
+
const newStream = await sdk.recreateStream(stream);
|
|
941
|
+
// Continue ingesting with newStream
|
|
942
|
+
}
|
|
943
|
+
```
|
|
944
|
+
|
|
945
|
+
**Note:** This method preserves batch structure and re-ingests batches atomically. For debugging, you can inspect what was recovered using `getUnackedBatches()` after closing the stream.
|
|
946
|
+
|
|
947
|
+
---
|
|
948
|
+
|
|
949
|
+
### ZerobusStream
|
|
950
|
+
|
|
951
|
+
Represents an active ingestion stream.
|
|
952
|
+
|
|
953
|
+
**Methods:**
|
|
954
|
+
|
|
955
|
+
```typescript
|
|
956
|
+
async ingestRecord(payload: Buffer | string | object): Promise<bigint>
|
|
957
|
+
```
|
|
958
|
+
|
|
959
|
+
Ingests a single record. This method **blocks** until the record is sent to the SDK's internal landing zone, then returns a Promise for the server acknowledgment. This allows you to send many records without waiting for individual acknowledgments.
|
|
960
|
+
|
|
961
|
+
**Parameters:**
|
|
962
|
+
- `payload` - Record data. The SDK supports 4 input types for flexibility:
|
|
963
|
+
- **JSON Mode** (`RecordType.Json`):
|
|
964
|
+
- **Type 1 - object** (high-level): Plain JavaScript object - SDK auto-stringifies with `JSON.stringify()`
|
|
965
|
+
- **Type 2 - string** (low-level): Pre-serialized JSON string
|
|
966
|
+
- **Protocol Buffers Mode** (`RecordType.Proto`):
|
|
967
|
+
- **Type 3 - Message** (high-level): Protobuf message object - SDK calls `.encode().finish()` automatically
|
|
968
|
+
- **Type 4 - Buffer** (low-level): Pre-serialized protobuf bytes
|
|
969
|
+
|
|
970
|
+
**All 4 Type Examples:**
|
|
971
|
+
```typescript
|
|
972
|
+
// JSON Type 1: object (high-level) - SDK auto-stringifies
|
|
973
|
+
await stream.ingestRecord({ device: 'sensor-1', temp: 25 });
|
|
974
|
+
|
|
975
|
+
// JSON Type 2: string (low-level) - pre-serialized
|
|
976
|
+
await stream.ingestRecord(JSON.stringify({ device: 'sensor-1', temp: 25 }));
|
|
977
|
+
|
|
978
|
+
// Protobuf Type 3: Message object (high-level) - SDK auto-serializes
|
|
979
|
+
const message = MyMessage.create({ device: 'sensor-1', temp: 25 });
|
|
980
|
+
await stream.ingestRecord(message);
|
|
981
|
+
|
|
982
|
+
// Protobuf Type 4: Buffer (low-level) - pre-serialized bytes
|
|
983
|
+
const buffer = Buffer.from(MyMessage.encode(message).finish());
|
|
984
|
+
await stream.ingestRecord(buffer);
|
|
985
|
+
```
|
|
986
|
+
|
|
987
|
+
**Note:** The SDK automatically detects protobufjs message objects by checking if the constructor has a static `.encode()` method. This works seamlessly with messages created via `MyMessage.create()` or `new MyMessage()`.
|
|
988
|
+
|
|
989
|
+
**Returns:** Promise resolving to the offset ID when the server acknowledges the record
|
|
990
|
+
|
|
991
|
+
---
|
|
992
|
+
|
|
993
|
+
```typescript
|
|
994
|
+
async ingestRecords(payloads: Array<Buffer | string | object>): Promise<bigint | null>
|
|
995
|
+
```
|
|
996
|
+
|
|
997
|
+
Ingests multiple records as a batch. All records in a batch are acknowledged together atomically. This method **blocks** until all records are sent to the SDK's internal landing zone, then returns a Promise for the server acknowledgment.
|
|
998
|
+
|
|
999
|
+
**Parameters:**
|
|
1000
|
+
- `payloads` - Array of record data. Supports the same 4 types as `ingestRecord()`:
|
|
1001
|
+
- **JSON Mode**: Array of **objects** (Type 1) or **strings** (Type 2)
|
|
1002
|
+
- **Proto Mode**: Array of **Message objects** (Type 3) or **Buffers** (Type 4)
|
|
1003
|
+
- Mixed types within the same array are supported
|
|
1004
|
+
|
|
1005
|
+
**All 4 Type Examples:**
|
|
1006
|
+
```typescript
|
|
1007
|
+
// JSON Type 1: objects (high-level) - SDK auto-stringifies
|
|
1008
|
+
await stream.ingestRecords([
|
|
1009
|
+
{ device: 'sensor-1', temp: 25 },
|
|
1010
|
+
{ device: 'sensor-2', temp: 26 }
|
|
1011
|
+
]);
|
|
1012
|
+
|
|
1013
|
+
// JSON Type 2: strings (low-level) - pre-serialized
|
|
1014
|
+
await stream.ingestRecords([
|
|
1015
|
+
JSON.stringify({ device: 'sensor-1', temp: 25 }),
|
|
1016
|
+
JSON.stringify({ device: 'sensor-2', temp: 26 })
|
|
1017
|
+
]);
|
|
1018
|
+
|
|
1019
|
+
// Protobuf Type 3: Message objects (high-level) - SDK auto-serializes
|
|
1020
|
+
await stream.ingestRecords([
|
|
1021
|
+
MyMessage.create({ device: 'sensor-1', temp: 25 }),
|
|
1022
|
+
MyMessage.create({ device: 'sensor-2', temp: 26 })
|
|
1023
|
+
]);
|
|
1024
|
+
|
|
1025
|
+
// Protobuf Type 4: Buffers (low-level) - pre-serialized bytes
|
|
1026
|
+
const buffers = [
|
|
1027
|
+
Buffer.from(MyMessage.encode(msg1).finish()),
|
|
1028
|
+
Buffer.from(MyMessage.encode(msg2).finish())
|
|
1029
|
+
];
|
|
1030
|
+
await stream.ingestRecords(buffers);
|
|
1031
|
+
```
|
|
1032
|
+
|
|
1033
|
+
**Returns:** Promise resolving to:
|
|
1034
|
+
- `bigint` - Offset ID when the server acknowledges the entire batch
|
|
1035
|
+
- `null` - If the batch was empty (no records sent)
|
|
1036
|
+
|
|
1037
|
+
**Best Practices:**
|
|
1038
|
+
- Batch size: 100-1,000 records for optimal throughput/latency balance
|
|
1039
|
+
- Empty batches are allowed and return `null`
|
|
1040
|
+
|
|
1041
|
+
---
|
|
1042
|
+
|
|
1043
|
+
```typescript
|
|
1044
|
+
async flush(): Promise<void>
|
|
1045
|
+
```
|
|
1046
|
+
|
|
1047
|
+
Flushes all pending records and waits for acknowledgments.
|
|
1048
|
+
|
|
1049
|
+
```typescript
|
|
1050
|
+
async close(): Promise<void>
|
|
1051
|
+
```
|
|
1052
|
+
|
|
1053
|
+
Closes the stream gracefully, flushing all pending data. **Always call this in a finally block!**
|
|
1054
|
+
|
|
1055
|
+
```typescript
|
|
1056
|
+
async getUnackedRecords(): Promise<Buffer[]>
|
|
1057
|
+
```
|
|
1058
|
+
|
|
1059
|
+
Returns unacknowledged record payloads as a flat array for inspection purposes.
|
|
1060
|
+
|
|
1061
|
+
**Important:** Can only be called on **closed streams**. Call `stream.close()` first, or this will throw an error.
|
|
1062
|
+
|
|
1063
|
+
**Returns:** Array of Buffer containing the raw record payloads
|
|
1064
|
+
|
|
1065
|
+
**Use case:** For inspecting unacknowledged individual records when using `ingestRecord()`. **Note:** This method is primarily for debugging and inspection. For recovery, use `recreateStream()` (recommended) or automatic recovery (default).
|
|
1066
|
+
|
|
1067
|
+
---
|
|
1068
|
+
|
|
1069
|
+
```typescript
|
|
1070
|
+
async getUnackedBatches(): Promise<Buffer[][]>
|
|
1071
|
+
```
|
|
1072
|
+
|
|
1073
|
+
Returns unacknowledged records grouped by their original batches for inspection purposes.
|
|
1074
|
+
|
|
1075
|
+
**Important:** Can only be called on **closed streams**. Call `stream.close()` first, or this will throw an error.
|
|
1076
|
+
|
|
1077
|
+
**Returns:** Array of arrays, where each inner array represents a batch of records as Buffers
|
|
1078
|
+
|
|
1079
|
+
**Use case:** For inspecting unacknowledged batches when using `ingestRecords()`. Preserves the original batch structure. **Note:** This method is primarily for debugging and inspection. For recovery, use `recreateStream()` (recommended) or automatic recovery (default).
|
|
1080
|
+
|
|
1081
|
+
**Example:**
|
|
1082
|
+
```typescript
|
|
1083
|
+
try {
|
|
1084
|
+
await stream.ingestRecords(batch1);
|
|
1085
|
+
await stream.ingestRecords(batch2);
|
|
1086
|
+
// ... error occurs
|
|
1087
|
+
} catch (error) {
|
|
1088
|
+
await stream.close();
|
|
1089
|
+
const unackedBatches = await stream.getUnackedBatches();
|
|
1090
|
+
// unackedBatches[0] contains records from batch1 (if not acked)
|
|
1091
|
+
// unackedBatches[1] contains records from batch2 (if not acked)
|
|
1092
|
+
|
|
1093
|
+
// Re-ingest with new stream
|
|
1094
|
+
for (const batch of unackedBatches) {
|
|
1095
|
+
await newStream.ingestRecords(batch);
|
|
1096
|
+
}
|
|
1097
|
+
}
|
|
1098
|
+
```
|
|
1099
|
+
|
|
1100
|
+
---
|
|
1101
|
+
|
|
1102
|
+
### TableProperties
|
|
1103
|
+
|
|
1104
|
+
Configuration for the target table.
|
|
1105
|
+
|
|
1106
|
+
**Interface:**
|
|
1107
|
+
|
|
1108
|
+
```typescript
|
|
1109
|
+
interface TableProperties {
|
|
1110
|
+
tableName: string; // Fully qualified table name (e.g., "catalog.schema.table")
|
|
1111
|
+
descriptorProto?: string; // Base64-encoded protobuf descriptor (required for Protocol Buffers)
|
|
1112
|
+
}
|
|
1113
|
+
```
|
|
1114
|
+
|
|
1115
|
+
**Examples:**
|
|
1116
|
+
|
|
1117
|
+
```typescript
|
|
1118
|
+
// JSON mode
|
|
1119
|
+
const tableProperties = { tableName: 'main.default.air_quality' };
|
|
1120
|
+
|
|
1121
|
+
// Protocol Buffers mode
|
|
1122
|
+
const tableProperties = {
|
|
1123
|
+
tableName: 'main.default.air_quality',
|
|
1124
|
+
descriptorProto: descriptorBase64 // Required for protobuf
|
|
1125
|
+
};
|
|
1126
|
+
```
|
|
1127
|
+
|
|
1128
|
+
---
|
|
1129
|
+
|
|
1130
|
+
### StreamConfigurationOptions
|
|
1131
|
+
|
|
1132
|
+
Configuration options for stream behavior.
|
|
1133
|
+
|
|
1134
|
+
**Interface:**
|
|
1135
|
+
|
|
1136
|
+
```typescript
|
|
1137
|
+
interface StreamConfigurationOptions {
|
|
1138
|
+
recordType?: RecordType; // RecordType.Json or RecordType.Proto. Default: RecordType.Proto
|
|
1139
|
+
maxInflightRequests?: number; // Default: 10,000
|
|
1140
|
+
recovery?: boolean; // Default: true
|
|
1141
|
+
recoveryTimeoutMs?: number; // Default: 15,000
|
|
1142
|
+
recoveryBackoffMs?: number; // Default: 2,000
|
|
1143
|
+
recoveryRetries?: number; // Default: 4
|
|
1144
|
+
flushTimeoutMs?: number; // Default: 300,000
|
|
1145
|
+
serverLackOfAckTimeoutMs?: number; // Default: 60,000
|
|
1146
|
+
}
|
|
1147
|
+
|
|
1148
|
+
enum RecordType {
|
|
1149
|
+
Json = 0, // JSON encoding
|
|
1150
|
+
Proto = 1 // Protocol Buffers encoding
|
|
1151
|
+
}
|
|
1152
|
+
```
|
|
1153
|
+
|
|
1154
|
+
## Best Practices
|
|
1155
|
+
|
|
1156
|
+
1. **Reuse SDK instances**: Create one `ZerobusSdk` instance per application
|
|
1157
|
+
2. **Stream lifecycle**: Always close streams in a `finally` block to ensure all records are flushed
|
|
1158
|
+
3. **Batch size**: Adjust `maxInflightRequests` based on your throughput requirements (default: 10,000)
|
|
1159
|
+
4. **Error handling**: The stream handles errors internally with automatic retry. Only use `recreateStream()` for persistent failures after internal retries are exhausted.
|
|
1160
|
+
5. **Use Protocol Buffers for production**: Protocol Buffers (the default) provides better performance and schema validation. Use JSON only when you need schema flexibility or for quick prototyping.
|
|
1161
|
+
6. **Store credentials securely**: Use environment variables, never hardcode credentials
|
|
1162
|
+
7. **Use batch ingestion**: For high-throughput scenarios, use `ingestRecords()` instead of individual `ingestRecord()` calls
|
|
1163
|
+
|
|
1164
|
+
## Platform Support
|
|
1165
|
+
|
|
1166
|
+
The SDK supports all platforms where Node.js and Rust are available.
|
|
1167
|
+
|
|
1168
|
+
### Pre-built Binaries
|
|
1169
|
+
|
|
1170
|
+
Pre-built native binaries are available for:
|
|
1171
|
+
|
|
1172
|
+
- **Linux**: x64, ARM64
|
|
1173
|
+
- **Windows**: x64
|
|
1174
|
+
|
|
1175
|
+
### Build from Source
|
|
1176
|
+
|
|
1177
|
+
**macOS users**: Pre-built binaries are not available for macOS. The package will automatically build from source during `npm install`, which requires:
|
|
1178
|
+
|
|
1179
|
+
- **Rust toolchain** (1.70+): Install via `curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh`
|
|
1180
|
+
- **Xcode Command Line Tools**: Install via `xcode-select --install`
|
|
1181
|
+
|
|
1182
|
+
The build process happens automatically during installation and typically takes 2-3 minutes.
|
|
1183
|
+
|
|
1184
|
+
## Architecture
|
|
1185
|
+
|
|
1186
|
+
This SDK wraps the high-performance [Rust Zerobus SDK](https://github.com/databricks/zerobus-sdk-rs) using [NAPI-RS](https://napi.rs):
|
|
1187
|
+
|
|
1188
|
+
```
|
|
1189
|
+
┌─────────────────────────────┐
|
|
1190
|
+
│ TypeScript Application │
|
|
1191
|
+
└─────────────┬───────────────┘
|
|
1192
|
+
│ (NAPI-RS bindings)
|
|
1193
|
+
┌─────────────▼───────────────┐
|
|
1194
|
+
│ Rust Zerobus SDK │
|
|
1195
|
+
│ - gRPC communication │
|
|
1196
|
+
│ - OAuth authentication │
|
|
1197
|
+
│ - Stream management │
|
|
1198
|
+
└─────────────┬───────────────┘
|
|
1199
|
+
│ (gRPC/TLS)
|
|
1200
|
+
┌─────────────▼───────────────┐
|
|
1201
|
+
│ Databricks Zerobus Service│
|
|
1202
|
+
└─────────────────────────────┘
|
|
1203
|
+
```
|
|
1204
|
+
|
|
1205
|
+
**Benefits:**
|
|
1206
|
+
- **Zero-copy data transfer** between JavaScript and Rust
|
|
1207
|
+
- **Native async/await support** - Rust futures become JavaScript Promises
|
|
1208
|
+
- **Automatic memory management** - No manual cleanup required
|
|
1209
|
+
- **Type safety** - Compile-time checks on both sides
|
|
1210
|
+
|
|
1211
|
+
## Contributing
|
|
1212
|
+
|
|
1213
|
+
We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for details.
|
|
1214
|
+
|
|
1215
|
+
## Related Projects
|
|
1216
|
+
|
|
1217
|
+
- [Zerobus Rust SDK](https://github.com/databricks/zerobus-sdk-rs) - The underlying Rust implementation
|
|
1218
|
+
- [Zerobus Python SDK](https://github.com/databricks/zerobus-sdk-py) - Python SDK for Zerobus
|
|
1219
|
+
- [Zerobus Java SDK](https://github.com/databricks/zerobus-sdk-java) - Java SDK for Zerobus
|
|
1220
|
+
- [NAPI-RS](https://napi.rs) - Rust/Node.js binding framework
|