@batchactions/distributed 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 @batchactions contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,329 @@
1
+ # @batchactions/distributed
2
+
3
+ Distributed parallel batch processing for [@batchactions/core](https://www.npmjs.com/package/@batchactions/core). Fan out N workers (AWS Lambda, Cloud Functions, ECS tasks, etc.) to process batches concurrently with atomic claiming, crash recovery, and exactly-once completion.
4
+
5
+ ## When to Use This
6
+
7
+ Use `@batchactions/distributed` when:
8
+
9
+ - You need to import **hundreds of thousands or millions of records** and a single process is too slow.
10
+ - You are running in **serverless** (AWS Lambda, Google Cloud Functions) and want to parallelize across multiple invocations.
11
+ - You need **crash resilience** — if a worker dies, another worker picks up the batch.
12
+
13
+ For simpler scenarios (< 100k records, single server), `@batchactions/core` alone is sufficient. Use `processChunk()` for serverless with time limits, or `maxConcurrentBatches` for in-process parallelism.
14
+
15
+ ## Installation
16
+
17
+ ```bash
18
+ npm install @batchactions/distributed
19
+ ```
20
+
21
+ **Peer dependencies:**
22
+
23
+ - `@batchactions/core` >= 0.4.0
24
+
25
+ You also need a `DistributedStateStore` implementation. The official one is [`@batchactions/state-sequelize`](https://www.npmjs.com/package/@batchactions/state-sequelize):
26
+
27
+ ```bash
28
+ npm install @batchactions/state-sequelize sequelize pg
29
+ ```
30
+
31
+ ## How It Works
32
+
33
+ A two-phase processing model:
34
+
35
+ ```
36
+ Phase 1: PREPARE (single orchestrator)
37
+ ┌─────────────────────────────────┐
38
+ │ Stream source file │
39
+ │ Validate & materialize records │
40
+ │ Create batch boundaries │
41
+ │ Save everything to StateStore │
42
+ └──────────┬──────────────────────-─┘
43
+
44
+ { jobId, totalBatches }
45
+
46
+ ┌──────────────┼──────────────┐
47
+ ▼ ▼ ▼
48
+ Phase 2: PROCESS (N parallel workers)
49
+ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
50
+ │ Worker 1 │ │ Worker 2 │ │ Worker N │
51
+ │ claimBatch │ │ claimBatch │ │ claimBatch │
52
+ │ process │ │ process │ │ process │
53
+ │ next batch │ │ next batch │ │ next batch │
54
+ │ ... │ │ ... │ │ ... │
55
+ └──────────────┘ └──────────────┘ └──────────────┘
56
+ │ │ │
57
+ └──────────────┼──────────────┘
58
+
59
+ tryFinalizeJob()
60
+ (exactly-once)
61
+ ```
62
+
63
+ 1. **Prepare** (orchestrator): Streams the source file, validates field names, materializes all records in the `StateStore`, and registers batch boundaries. Returns `{ jobId, totalBatches }`.
64
+
65
+ 2. **Process** (workers): Each worker calls `processWorkerBatch()` in a loop. The method atomically claims the next available batch (no two workers get the same batch), loads its records, runs the full validation + hooks + duplicate-check + processor pipeline, and marks the batch as completed or failed. When the last batch finishes, `tryFinalizeJob()` transitions the job to COMPLETED (exactly once).
66
+
67
+ ## Quick Start
68
+
69
+ ### Orchestrator (Phase 1)
70
+
71
+ ```typescript
72
+ import { DistributedImport } from '@batchactions/distributed';
73
+ import { CsvParser } from '@batchactions/import';
74
+ import { UrlSource } from '@batchactions/core';
75
+ import { SequelizeStateStore } from '@batchactions/state-sequelize';
76
+ import { Sequelize } from 'sequelize';
77
+
78
+ const sequelize = new Sequelize(process.env.DATABASE_URL!);
79
+ const stateStore = new SequelizeStateStore(sequelize);
80
+ await stateStore.initialize();
81
+
82
+ const di = new DistributedImport({
83
+ schema: {
84
+ fields: [
85
+ { name: 'email', type: 'email', required: true },
86
+ { name: 'name', type: 'string', required: true },
87
+ { name: 'role', type: 'string', required: false, defaultValue: 'user' },
88
+ ],
89
+ },
90
+ batchSize: 500,
91
+ stateStore,
92
+ continueOnError: true,
93
+ });
94
+
95
+ // Phase 1: Prepare
96
+ const source = new UrlSource('https://storage.example.com/users.csv');
97
+ const { jobId, totalBatches, totalRecords } = await di.prepare(source, new CsvParser());
98
+
99
+ console.log(`Job ${jobId}: ${totalRecords} records in ${totalBatches} batches`);
100
+
101
+ // Fan out: send { jobId } to N workers via SQS, SNS, EventBridge, etc.
102
+ await sqs.sendMessage({
103
+ QueueUrl: WORKER_QUEUE_URL,
104
+ MessageBody: JSON.stringify({ jobId }),
105
+ });
106
+ ```
107
+
108
+ ### Worker (Phase 2)
109
+
110
+ ```typescript
111
+ import { DistributedImport } from '@batchactions/distributed';
112
+ import { SequelizeStateStore } from '@batchactions/state-sequelize';
113
+ import { Sequelize } from 'sequelize';
114
+
115
+ // Lambda handler
116
+ export async function handler(event: SQSEvent, context: Context) {
117
+ const { jobId } = JSON.parse(event.Records[0].body);
118
+ const workerId = context.awsRequestId;
119
+
120
+ const sequelize = new Sequelize(process.env.DATABASE_URL!);
121
+ const stateStore = new SequelizeStateStore(sequelize);
122
+ await stateStore.initialize();
123
+
124
+ const di = new DistributedImport({
125
+ schema: {
126
+ fields: [
127
+ { name: 'email', type: 'email', required: true },
128
+ { name: 'name', type: 'string', required: true },
129
+ { name: 'role', type: 'string', required: false, defaultValue: 'user' },
130
+ ],
131
+ },
132
+ batchSize: 500,
133
+ stateStore,
134
+ continueOnError: true,
135
+ });
136
+
137
+ // Process batches until none remain
138
+ while (true) {
139
+ const result = await di.processWorkerBatch(jobId, async (record) => {
140
+ await db.query(
141
+ 'INSERT INTO users (email, name, role) VALUES ($1, $2, $3)',
142
+ [record.email, record.name, record.role],
143
+ );
144
+ }, workerId);
145
+
146
+ if (!result.claimed) {
147
+ console.log('No more batches to process');
148
+ break;
149
+ }
150
+
151
+ console.log(`Batch ${result.batchIndex}: ${result.processedCount} processed, ${result.failedCount} failed`);
152
+
153
+ if (result.jobComplete) {
154
+ console.log('Job finalized by this worker!');
155
+ break;
156
+ }
157
+ }
158
+ }
159
+ ```
160
+
161
+ ## Configuration
162
+
163
+ ### `DistributedImportConfig`
164
+
165
+ | Property | Type | Default | Description |
166
+ |---|---|---|---|
167
+ | `schema` | `SchemaDefinition` | required | Field definitions and validation rules |
168
+ | `batchSize` | `number` | `100` | Records per batch |
169
+ | `continueOnError` | `boolean` | `true` | Continue when records fail validation or processing |
170
+ | `stateStore` | `StateStore` | required | Must implement `DistributedStateStore` (e.g. `SequelizeStateStore`) |
171
+ | `maxRetries` | `number` | `0` | Retry attempts for processor failures (exponential backoff) |
172
+ | `retryDelayMs` | `number` | `1000` | Base delay in ms between retry attempts |
173
+ | `hooks` | `JobHooks` | -- | Lifecycle hooks (`beforeValidate`, `afterValidate`, `beforeProcess`, `afterProcess`) |
174
+ | `duplicateChecker` | `DuplicateChecker` | -- | External duplicate detection |
175
+ | `staleBatchTimeoutMs` | `number` | `900000` | Timeout in ms before stale batches are reclaimed (15 min default) |
176
+
177
+ ## API Reference
178
+
179
+ ### `DistributedImport`
180
+
181
+ | Method | Description |
182
+ |---|---|
183
+ | `prepare(source, parser)` | Phase 1: Stream source, materialize records, create batches. Returns `PrepareResult`. |
184
+ | `processWorkerBatch(jobId, processor, workerId)` | Phase 2: Claim next batch, process records, finalize if last. Returns `DistributedBatchResult`. |
185
+ | `on(event, handler)` | Subscribe to a domain event. |
186
+ | `onAny(handler)` | Subscribe to all events. |
187
+ | `offAny(handler)` | Unsubscribe a wildcard handler. |
188
+
189
+ ### `PrepareResult`
190
+
191
+ | Field | Type | Description |
192
+ |---|---|---|
193
+ | `jobId` | `string` | Unique job identifier. Pass this to workers. |
194
+ | `totalRecords` | `number` | Total records found in the source. |
195
+ | `totalBatches` | `number` | Number of batches created. |
196
+
197
+ ### `DistributedBatchResult`
198
+
199
+ | Field | Type | Description |
200
+ |---|---|---|
201
+ | `claimed` | `boolean` | Whether a batch was successfully claimed. `false` means no batches remain. |
202
+ | `batchId` | `string?` | ID of the batch that was processed. |
203
+ | `batchIndex` | `number?` | Index of the batch that was processed. |
204
+ | `processedCount` | `number` | Records successfully processed in this batch. |
205
+ | `failedCount` | `number` | Records that failed in this batch. |
206
+ | `jobComplete` | `boolean` | `true` if this worker finalized the entire job. |
207
+ | `jobId` | `string` | The job identifier. |
208
+
209
+ ## Crash Recovery
210
+
211
+ If a worker crashes or times out, its claimed batch becomes "stale". The next `processWorkerBatch()` call automatically reclaims stale batches (based on `staleBatchTimeoutMs`) before claiming new ones.
212
+
213
+ **Requirements:**
214
+
215
+ - Your **processor callback must be idempotent**. If a batch is re-processed after a crash, records may be sent to the processor again.
216
+ - Use `ON CONFLICT DO NOTHING` / `INSERT ... IGNORE` or similar patterns in your database writes.
217
+
218
+ You can also manually reclaim stale batches:
219
+
220
+ ```typescript
221
+ import { isDistributedStateStore } from '@batchactions/distributed';
222
+
223
+ if (isDistributedStateStore(stateStore)) {
224
+ const reclaimed = await stateStore.reclaimStaleBatches(jobId, 60_000); // 1 min timeout
225
+ console.log(`Reclaimed ${reclaimed} stale batches`);
226
+ }
227
+ ```
228
+
229
+ ## Events
230
+
231
+ Each worker has its own local event bus. Subscribe to events for logging, metrics, or progress tracking:
232
+
233
+ ```typescript
234
+ di.on('batch:claimed', (e) => {
235
+ console.log(`Worker claimed batch ${e.batchIndex} of job ${e.jobId}`);
236
+ });
237
+
238
+ di.on('record:failed', (e) => {
239
+ console.error(`Record ${e.recordIndex} failed: ${e.error}`);
240
+ });
241
+
242
+ di.on('import:completed', (e) => {
243
+ // Only emitted by the worker that finalizes the job
244
+ console.log(`Job complete! ${e.summary.processed} processed, ${e.summary.failed} failed`);
245
+ });
246
+
247
+ // Forward all events (e.g. to CloudWatch, Datadog)
248
+ di.onAny((event) => {
249
+ metrics.emit(event.type, event);
250
+ });
251
+ ```
252
+
253
+ **Note:** `import:completed` is emitted only by the worker that finalizes the job (exactly once).
254
+
255
+ ## Architecture
256
+
257
+ ```
258
+ @batchactions/distributed
259
+ ├── DistributedImport.ts # Facade (composition root)
260
+ ├── PrepareDistributedImport.ts # Phase 1 use case
261
+ ├── ProcessDistributedBatch.ts # Phase 2 use case
262
+ └── index.ts # Public API
263
+
264
+ Depends on:
265
+ └── @batchactions/core
266
+ ├── DistributedStateStore # Port interface (extended StateStore)
267
+ ├── BatchReservation # Domain types
268
+ ├── SchemaValidator # Validation pipeline
269
+ └── EventBus # Event system
270
+
271
+ Implemented by:
272
+ └── @batchactions/state-sequelize
273
+ └── SequelizeStateStore # Concrete DistributedStateStore
274
+ ├── bulkimport_jobs # Job state table
275
+ ├── bulkimport_records # Record data table
276
+ └── bulkimport_batches # Batch metadata table (distributed)
277
+ ```
278
+
279
+ ## Implementing a Custom `DistributedStateStore`
280
+
281
+ If you don't use Sequelize, you can implement the `DistributedStateStore` interface:
282
+
283
+ ```typescript
284
+ import type { DistributedStateStore, ClaimBatchResult, DistributedJobStatus, ProcessedRecord } from '@batchactions/distributed';
285
+
286
+ class MyDistributedStore implements DistributedStateStore {
287
+ // ... all StateStore methods plus:
288
+
289
+ async claimBatch(jobId: string, workerId: string): Promise<ClaimBatchResult> {
290
+ // Atomic: find first PENDING batch, set to PROCESSING with workerId
291
+ // Use SELECT FOR UPDATE SKIP LOCKED or similar
292
+ }
293
+
294
+ async releaseBatch(jobId: string, batchId: string, workerId: string): Promise<void> {
295
+ // Reset batch to PENDING (only if claimed by this worker)
296
+ }
297
+
298
+ async reclaimStaleBatches(jobId: string, timeoutMs: number): Promise<number> {
299
+ // Find PROCESSING batches with claimedAt older than timeout, reset to PENDING
300
+ }
301
+
302
+ async saveBatchRecords(jobId: string, batchId: string, records: readonly ProcessedRecord[]): Promise<void> {
303
+ // Bulk insert records for a batch
304
+ }
305
+
306
+ async getBatchRecords(jobId: string, batchId: string): Promise<readonly ProcessedRecord[]> {
307
+ // Load all records for a batch
308
+ }
309
+
310
+ async getDistributedStatus(jobId: string): Promise<DistributedJobStatus> {
311
+ // Aggregate: count batches by status
312
+ }
313
+
314
+ async tryFinalizeJob(jobId: string): Promise<boolean> {
315
+ // Atomic: if all batches are terminal, set job to COMPLETED/FAILED
316
+ // Return true if THIS call finalized (exactly-once)
317
+ }
318
+ }
319
+ ```
320
+
321
+ ## Requirements
322
+
323
+ - Node.js >= 20.0.0
324
+ - `@batchactions/core` >= 0.4.0
325
+ - A `DistributedStateStore` implementation (e.g. `@batchactions/state-sequelize` >= 0.1.2)
326
+
327
+ ## License
328
+
329
+ [MIT](../../LICENSE)