@thru/replay 0.1.36
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +143 -0
- package/dist/index.cjs +1196 -0
- package/dist/index.cjs.map +1 -0
- package/dist/index.d.cts +490 -0
- package/dist/index.d.ts +490 -0
- package/dist/index.mjs +1170 -0
- package/dist/index.mjs.map +1 -0
- package/package.json +45 -0
package/README.md
ADDED
|
@@ -0,0 +1,143 @@
|
|
|
1
|
+
# @thru/etl-replay
|
|
2
|
+
|
|
3
|
+
High-throughput historical replay for the thru-net blockchain stack. This package backfills blocks, transactions, and events via the `QueryService` (`List*` RPCs) and then hands off to the realtime `StreamingService` (`Stream*` RPCs) without gaps or duplicates. It powers ETL and analytics sinks that need a single ordered feed even when the node is millions of slots behind tip.
|
|
4
|
+
|
|
5
|
+
```
|
|
6
|
+
┌─────────────┐ paginated history ┌─────────────┐
|
|
7
|
+
Chain RPC ─►│ List* APIs ├──────────────────────────►│ Backfill loop│
|
|
8
|
+
└─────────────┘ └─────┬───────┘
|
|
9
|
+
(BUFFERING+BACKFILLING)
|
|
10
|
+
┌─────────────┐ live async stream ┌─────▼───────┐
|
|
11
|
+
Chain RPC ─►│ Stream* APIs├──────────────────────────►│ LivePump │
|
|
12
|
+
└─────────────┘ └─────┬───────┘
|
|
13
|
+
(SWITCHING)
|
|
14
|
+
deduped, ordered ┌─▼────────┐
|
|
15
|
+
async iterable │ReplayStream│
|
|
16
|
+
└────┬─────┘
|
|
17
|
+
│ (STREAMING)
|
|
18
|
+
▼
|
|
19
|
+
Async consumer
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
## Capabilities
|
|
23
|
+
|
|
24
|
+
- Gapless replay for **blocks, transactions, and events** with resource-specific factories (`createBlockReplay`, `createTransactionReplay`, `createEventReplay`).
|
|
25
|
+
- **Four-phase state machine** (`BUFFERING → BACKFILLING → SWITCHING → STREAMING`) that deterministically merges historical and live data.
|
|
26
|
+
- **Safety margin & overlap management:** configurable `safetyMargin` keeps a guard band between historical slots and the earliest slot seen on the live stream so the switchover never emits future data twice.
|
|
27
|
+
- **Per-item deduplication** via customizable `extractKey` functions so multiple transactions/events in one slot are preserved while duplicates caused by overlap or reconnects are discarded.
|
|
28
|
+
- **Automatic live stream retries:** `ReplayStream` reconnects with the latest emitted slot, drains buffered data, and resumes transparently after errors or server-side EOF.
|
|
29
|
+
- **Structured metrics and logging:** `getMetrics()` exposes counts for emitted backfill vs live records, buffered overlap, and discarded duplicates, while pluggable `ReplayLogger` implementations (default `NOOP_LOGGER`, optional console logger) keep observability consistent across deployments.
|
|
30
|
+
- **ConnectRPC client wrapper (`ChainClient`)** that centralizes TLS, headers, interceptors, and transport reuse for both query and streaming services.
|
|
31
|
+
- **Deterministic test harness** (`SimulatedChain`, `SimulatedTransactionSource`) plus Vitest specs to validate deduplication, switching, and reconnect logic.
|
|
32
|
+
|
|
33
|
+
## Architecture Overview
|
|
34
|
+
|
|
35
|
+
| Layer | Responsibility | Key Files |
|
|
36
|
+
| --- | --- | --- |
|
|
37
|
+
| Entry Points | Resource-specific factories configure pagination, filters, and live subscribers for each data type. | `src/replay/block-replay.ts`, `src/replay/transaction-replay.ts`, `src/replay/event-replay.ts` |
|
|
38
|
+
| Replay State Machine | Coordinates backfill/livestream phases, metrics, retries, and dedup. | `src/replay-stream.ts` |
|
|
39
|
+
| Live Ingestion | Buffers live data, exposes overlap bounds, and feeds an async queue once streaming. | `src/live-pump.ts`, `src/async-queue.ts` |
|
|
40
|
+
| Deduplication | Slot/key-aware buffer that keeps the overlap window sorted and bounded. | `src/dedup-buffer.ts` |
|
|
41
|
+
| Connectivity | ConnectRPC wiring for Query/Streaming services, header interceptors, and transport configuration. | `src/chain-client.ts` |
|
|
42
|
+
| Testing Utilities | In-memory block/transaction sources that emulate pagination and streaming semantics. | `src/testing/*.ts` |
|
|
43
|
+
|
|
44
|
+
### Replay Lifecycle
|
|
45
|
+
|
|
46
|
+
1. **BUFFERING** – `LivePump` subscribes to `Stream*` immediately, buffering every item in a sorted dedup buffer and tracking the min/max slot observed.
|
|
47
|
+
2. **BACKFILLING** – `ReplayStream` pages through `List*` RPCs (default `orderBy = "slot asc"`). Each item is sorted, deduped against the last emitted slot+key, yielded to consumers, and used to advance `currentSlot`. After each page we prune buffered live items `<= currentSlot` so memory use stays proportional to the safety margin.
|
|
48
|
+
3. **SWITCHING** – When `currentSlot >= maxStreamSlot - safetyMargin` (or the server signals no more history), we invoke `livePump.enableStreaming(currentSlot)`, discard overlap, drain remaining buffered data in ascending order, and mark the pump as streaming-only.
|
|
49
|
+
4. **STREAMING** – The replay now awaits `livePump.next()` forever, emitting live data as soon as the async queue resolves. Failures trigger `safeClose` and a resubscription at `currentSlot`, immediately enabling streaming mode so reconnects do not block.
|
|
50
|
+
|
|
51
|
+
### Core Data Structures
|
|
52
|
+
|
|
53
|
+
- **`ReplayStream<T>`** – generic async iterable that accepts `fetchBackfill`, `subscribeLive`, `extractSlot`, `extractKey`, and `safetyMargin`. It also exposes metrics and optional `resubscribeOnEnd` control.
|
|
54
|
+
- **`LivePump<T>`** – wraps any async iterable, buffering until `enableStreaming()` is called. It records `minSlot()`/`maxSlot()` to guide the handover threshold, and enforces an `emitFloor` so late-arriving historical slots from the live stream are dropped quietly.
|
|
55
|
+
- **`DedupBuffer<T>`** – multi-map keyed by slot + user-provided key, with binary search insertion, `discardUpTo`, `drainAbove`, and `drainAll` helpers. This lets transaction/event replays keep multiple records per slot while still pruning overlap aggressively.
|
|
56
|
+
- **`AsyncQueue<T>`** – minimal async iterator queue that handles back-pressure and clean shutdown/failure propagation between the live pump and replay consumer.
|
|
57
|
+
- **`ChainClient`** – lazily builds a Connect transport (HTTP/2 by default), handles API keys/user agents via interceptors, and exposes typed wrappers for `list/stream` RPC pairs plus `getHeight`.
|
|
58
|
+
|
|
59
|
+
## Operational Behavior & Configuration
|
|
60
|
+
|
|
61
|
+
| Option | Location | Purpose |
|
|
62
|
+
| --- | --- | --- |
|
|
63
|
+
| `startSlot` | All replay factories | First slot to include in the backfill; also the minimum slot for the live subscriber. |
|
|
64
|
+
| `safetyMargin` | `ReplayStream` (`32n` for blocks, `64n` for tx/events by default) | Buffer of slots that must exist between the latest backfill slot and the earliest live slot before switching. |
|
|
65
|
+
| `pageSize` | Resource factories | Number of records to request per `List*` page. |
|
|
66
|
+
| `filter` | Resource factories | CEL expression merged with the internally generated `slot >= uint(startSlot)` predicate to ensure consistent ordering/resume behavior. |
|
|
67
|
+
| `view`, `minConsensus`, `returnEvents` | Block/tx factories | Mirror Thru RPC query flags so callers can trade fidelity for throughput. |
|
|
68
|
+
| `resubscribeOnEnd` | `ReplayStream` | If `false`, the iterable ends when the server closes the live stream instead of reconnecting. |
|
|
69
|
+
| `logger` | Any factory | Plug in structured logging (e.g., `createConsoleLogger("Blocks")`). |
|
|
70
|
+
|
|
71
|
+
`ReplayStream` automatically:
|
|
72
|
+
|
|
73
|
+
- Keeps `metrics.bufferedItems`, `emittedBackfill`, `emittedLive`, and `discardedDuplicates`. The metrics snapshot is immutable so callers can periodically poll without worrying about concurrent mutation.
|
|
74
|
+
- Deduplicates both during backfill and streaming via `extractKey`. Blocks default to slot-based keys; transactions prefer the signature (fallback to slot+blockOffset); events use `eventId` or slot+callIdx.
|
|
75
|
+
- Retries live streams after any error/EOF with an exponential-free but bounded strategy (currently constant `RETRY_DELAY_MS = 1000`), guaranteeing ordering because the new `LivePump` starts in streaming mode with the previous `currentSlot` as its emit floor.
|
|
76
|
+
|
|
77
|
+
## Usage
|
|
78
|
+
|
|
79
|
+
```ts
|
|
80
|
+
import {
|
|
81
|
+
ChainClient,
|
|
82
|
+
createBlockReplay,
|
|
83
|
+
createConsoleLogger,
|
|
84
|
+
} from "@thru/etl-replay";
|
|
85
|
+
|
|
86
|
+
const client = new ChainClient({
|
|
87
|
+
baseUrl: "https://rpc.thru.net",
|
|
88
|
+
apiKey: process.env.THRU_API_KEY,
|
|
89
|
+
userAgent: "etl-replay-demo",
|
|
90
|
+
});
|
|
91
|
+
|
|
92
|
+
const blockReplay = createBlockReplay({
|
|
93
|
+
client,
|
|
94
|
+
startSlot: 1_000_000n,
|
|
95
|
+
safetyMargin: 64n,
|
|
96
|
+
pageSize: 256,
|
|
97
|
+
logger: createConsoleLogger("BlockReplay"),
|
|
98
|
+
filter: undefined, // optional CEL filter merged with slot predicate
|
|
99
|
+
});
|
|
100
|
+
|
|
101
|
+
for await (const block of blockReplay) {
|
|
102
|
+
// Persist, transform, or forward each block.
|
|
103
|
+
console.log("slot", block.header?.slot?.toString());
|
|
104
|
+
}
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
Switching to transactions or events only changes the factory import plus any resource-specific options. `ReplayStream` itself is generic, so advanced integrations can wire custom fetch/subscription functions (e.g., for account data) as long as they abide by the `ReplayConfig` contract.
|
|
108
|
+
|
|
109
|
+
## Building, Testing, and Regenerating Protos
|
|
110
|
+
|
|
111
|
+
```bash
|
|
112
|
+
pnpm install # install dependencies
|
|
113
|
+
pnpm run build # tsup -> dist/index.{cjs,mjs,d.ts}
|
|
114
|
+
pnpm test # vitest, uses simulated sources
|
|
115
|
+
|
|
116
|
+
# When upstream proto definitions change
|
|
117
|
+
pnpm run protobufs:pull # copies repo-wide proto/ into this package
|
|
118
|
+
pnpm run protobufs:generate # buf generate -> src/proto/
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
- The package ships dual entry points (`dist/index.mjs` + `dist/index.cjs`) generated by `tsup` and targets Node.js ≥ 18.
|
|
122
|
+
- Generated files live under `src/gen/` and are kept out of version control elsewhere in the monorepo; avoid manual edits.
|
|
123
|
+
- Scripts assume workspace-relative `proto/` roots; adjust `protobufs:pull` if the directory layout changes.
|
|
124
|
+
|
|
125
|
+
## Limitations & Future Considerations
|
|
126
|
+
|
|
127
|
+
- **Single-chain, single-resource instances:** each replay handles one RPC resource (blocks, transactions, or events). Multi-resource ETL must run multiple iterables side-by-side and coordinate downstream ordering.
|
|
128
|
+
- **In-memory buffering:** overlap data is kept in process memory; extremely wide safety margins or multi-million slot gaps can increase memory pressure even though `discardBufferedUpTo` keeps it bounded to roughly the safety window. Persisted buffers/checkpointing are not implemented.
|
|
129
|
+
- **No batching/parallelization on the consumer side:** the async iterator yields one item at a time. Downstream batching must be implemented by the caller to avoid per-record I/O overhead.
|
|
130
|
+
- **Deterministic ordering requires the backend to honor `orderBy = "slot asc"` and CEL slot predicates.** Misconfigured RPC nodes that return unsorted pages will still be sorted locally, but cursor semantics (and thus throughput) degrade.
|
|
131
|
+
- **Retry policy is fixed (`1s` delay, infinite retries).** Environments that need exponential backoff or max retry counts should wrap the iterable and stop when necessary.
|
|
132
|
+
- **Filtering is limited to CEL expressions accepted by the Thru RPC API.** Compound filters are merged as string expressions; callers must avoid conflicting parameter names.
|
|
133
|
+
- **No built-in metrics export.** `getMetrics()` exposes counters, but exporting them to Prometheus/StatsD/etc. is left to the host application.
|
|
134
|
+
|
|
135
|
+
## Repository Reference
|
|
136
|
+
|
|
137
|
+
- `REPLAY_GUIDE.md` – deep dive through every module (recommended read for contributors).
|
|
138
|
+
- `REPLAY_ISSUES.md` – historical correctness issues and the fixes applied (handy for regression context).
|
|
139
|
+
- `DEV_PLAN.md` – original development milestones; useful for understanding remaining roadmap items.
|
|
140
|
+
- `scripts/` – helper scripts for running the replay against staging/mainnet endpoints.
|
|
141
|
+
- `dist/` – build output from `tsc`.
|
|
142
|
+
|
|
143
|
+
With this README plus the in-repo guides, you should have everything you need to operate, extend, or debug the replay pipeline with confidence.
|