@harperfast/harper-pro 5.0.16 → 5.0.18

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (49) hide show
  1. package/core/DESIGN.md +32 -0
  2. package/core/bin/copyDb.ts +19 -0
  3. package/core/resources/RecordEncoder.ts +15 -12
  4. package/core/resources/RocksTransactionLogStore.ts +51 -0
  5. package/core/resources/Table.ts +17 -15
  6. package/core/resources/auditStore.ts +97 -5
  7. package/core/resources/databases.ts +67 -7
  8. package/core/resources/replayLogs.ts +36 -3
  9. package/core/resources/replayLogsGuards.ts +42 -0
  10. package/core/resources/transactionBroadcast.ts +121 -66
  11. package/dist/cloneNode/cloneNode.js +13 -8
  12. package/dist/cloneNode/cloneNode.js.map +1 -1
  13. package/dist/core/bin/copyDb.js +16 -0
  14. package/dist/core/bin/copyDb.js.map +1 -1
  15. package/dist/core/resources/RecordEncoder.js +1 -1
  16. package/dist/core/resources/RecordEncoder.js.map +1 -1
  17. package/dist/core/resources/RocksTransactionLogStore.js.map +1 -1
  18. package/dist/core/resources/Table.js +17 -17
  19. package/dist/core/resources/Table.js.map +1 -1
  20. package/dist/core/resources/auditStore.js +82 -5
  21. package/dist/core/resources/auditStore.js.map +1 -1
  22. package/dist/core/resources/databases.js +68 -5
  23. package/dist/core/resources/databases.js.map +1 -1
  24. package/dist/core/resources/replayLogs.js +26 -2
  25. package/dist/core/resources/replayLogs.js.map +1 -1
  26. package/dist/core/resources/replayLogsGuards.js +43 -0
  27. package/dist/core/resources/replayLogsGuards.js.map +1 -0
  28. package/dist/core/resources/transactionBroadcast.js +129 -71
  29. package/dist/core/resources/transactionBroadcast.js.map +1 -1
  30. package/dist/replication/replicationConnection.js +174 -48
  31. package/dist/replication/replicationConnection.js.map +1 -1
  32. package/dist/replication/replicator.js +11 -2
  33. package/dist/replication/replicator.js.map +1 -1
  34. package/dist/replication/subscriptionManager.js +11 -1
  35. package/dist/replication/subscriptionManager.js.map +1 -1
  36. package/npm-shrinkwrap.json +2 -2
  37. package/package.json +1 -1
  38. package/replication/replicationConnection.ts +176 -55
  39. package/replication/replicator.ts +11 -2
  40. package/replication/subscriptionManager.ts +11 -1
  41. package/studio/web/assets/{index-pr02wSIB.js → index-Tv7e9k8K.js} +5 -5
  42. package/studio/web/assets/{index-pr02wSIB.js.map → index-Tv7e9k8K.js.map} +1 -1
  43. package/studio/web/assets/{index.lazy-CorGZz3L.js → index.lazy-De4JGuec.js} +2 -2
  44. package/studio/web/assets/{index.lazy-CorGZz3L.js.map → index.lazy-De4JGuec.js.map} +1 -1
  45. package/studio/web/assets/{profile-SSvkzt9H.js → profile-voeNsl4C.js} +2 -2
  46. package/studio/web/assets/{profile-SSvkzt9H.js.map → profile-voeNsl4C.js.map} +1 -1
  47. package/studio/web/assets/{status-Xk93QrPQ.js → status-110CCE-v.js} +2 -2
  48. package/studio/web/assets/{status-Xk93QrPQ.js.map → status-110CCE-v.js.map} +1 -1
  49. package/studio/web/index.html +1 -1
package/core/DESIGN.md CHANGED
@@ -31,3 +31,35 @@ The mitigations live in three places:
31
31
  - `LMDBTransaction.abort` and `DatabaseTransaction.abort` walk all writes and run the same cleanup unconditionally (regardless of `skipped`), since nothing was committed. `DatabaseTransaction.commit` adds an explicit reject handler so a `Promise.all` failure on `completions` (e.g. a blob save errored) aborts the underlying transaction instead of leaking it _and_ the blob files.
32
32
 
33
33
  When adding a new commit-handler early-return path: reset `write.skipped = false` at the top of the handler if you don't already, then set `write.skipped = true` immediately before the `return`. Decide first whether the audit log will reference the blob (via `auditRecordToStore`) — if it does, leave `skipped` unset. `cleanupOrphans` is the periodic safety net; don't rely on it for transactional correctness.
34
+
35
+ ## Schema migration and `runIndexing` internals (`databases.ts`)
36
+
37
+ When `table()` is called with an attribute newly marked `indexed: true` (or with any change that requires re-building the secondary index), `runIndexing` is launched asynchronously and `Table.indexingOperation` is set to its promise. While running:
38
+
39
+ **In-flight state tracking (persisted to `attributesDbi`):**
40
+
41
+ - `attribute.indexingPID = process.pid` — set at migration start; cleared on clean completion. On restart with a different PID, `indexingPID !== process.pid` triggers a re-migration.
42
+ - `attribute.lastIndexedKey` — updated every 100 records as a resumable checkpoint. Cleared on clean completion; preserved on error so a retry starts from this key.
43
+ - `attribute.indexingFailed = true` — set if any record's `index.put` errors during the backfill. `table()` checks this flag: a fresh call in the same or a new process re-triggers the backfill from `lastIndexedKey`.
44
+ - `dbi.isIndexing = true` — in-memory flag on the index dbi. Prevents `searchByIndex` from serving partial results (returns 503 "not indexed yet" instead). Cleared only when backfill completes cleanly.
45
+
46
+ **`isIndexing` propagation across `resetDatabases()` calls:**
47
+ When `signalSchemaChange('schema-change')` fires at the start of `runIndexing`, `syncSchemaMetadata` calls `resetDatabases()` which re-opens all tables via `table()`. This creates a _new_ dbi object and assigns it to `Table.indices[attribute.name]`. The condition `if (attributeDescriptor?.indexingPID) dbi.isIndexing = true` (just before `indices[name] = dbi` in the migration-detection block) ensures any dbi created while a migration is in progress also has `isIndexing = true`. Without this, a concurrent `resetDatabases()` would replace the in-progress dbi with a fresh one where `isIndexing` is false, allowing queries to read partial index results.
48
+
49
+ **Error handling:**
50
+
51
+ - Per-record sync errors: caught by the inner try-catch. Set `hadIndexingErrors = true`.
52
+ - Per-record async rejections (`index.put` returning a rejected Promise): caught by the `when()` error handler. Set `hadIndexingErrors = true`.
53
+ - The final `await lastResolution` is wrapped in its own try-catch because if the very last put in the loop was rejected, an unguarded `await lastResolution` would throw past the `hadIndexingErrors` check to the outer catch, silently bypassing the error path.
54
+ - On any error: `indexingFailed = true` is persisted; `indexingPID`, `isIndexing`, and `lastIndexedKey` are kept. This leaves the index in 503 "incomplete" state rather than silently serving partial results.
55
+
56
+ **`Object.defineProperty(attribute, 'dbi', ...)` must use `configurable: true`:**
57
+ `attribute.dbi` is defined as a non-enumerable property (to prevent serialization to `attributesDbi`). It is defined with `configurable: true` so it can be re-assigned if the attribute participates in a retry cycle in the same process.
58
+
59
+ ## Audit-store `'committed'` notification batching (`transactionBroadcast.ts`)
60
+
61
+ The cross-thread subscription path (default `crossThreads`) drives every `Table.subscribe()` consumer. When the database's audit store emits `'committed'`, we walk the audit log via a reusable iterator and dispatch matching records to subscribers. Three properties of this path are easy to break and worth knowing about before changing it:
62
+
63
+ - **`databaseSubscriptions.activeCount`** is the count of live `Subscription` instances on a database. It is incremented at the end of `addSubscription` (after the Subscription is created, so the `scope: 'full-database'` early-return path correctly skips counting) and decremented in `Subscription.end()`. `notifyFromTransactionData` short-circuits when this is zero — the reusable rocksdb iterator stays put and resumes from its position the next time a subscriber arrives. Without this short-circuit, an idle database with no subscribers still pays the audit-log iteration cost on every commit during replication backlog catch-up.
64
+ - **`notifyScheduled` + `setImmediate`** in the `'committed'` listener defers the iteration off the commit microtask. Multiple `'committed'` events that land in the same event-loop turn collapse into one notify pass. `notifyScheduled` stays set for the entire drain — including across yield-and-resume turns — so a re-entry from a new `'committed'` event cannot spawn a second concurrent notify on the same iterator.
65
+ - **Batched yielding** in `notifyFromTransactionData` (`NOTIFY_BATCH_SIZE`) is gated by `allowYield`. The `'committed'` path passes `allowYield = true`; the `listenToCommits` (same-thread `aftercommit`) path does not, because that path holds an inter-thread `'thread-local-writes'` lock that must not span event-loop turns. `subscribersWithTxns` is carried across yields via `subscriptions.pendingTxnSubscribers` so the `end_txn` signal fires exactly once when the iterator truly drains. When `activeCount` drops to zero mid-yield, the next continuation drops the carry-over to avoid invoking ended subscribers' listeners.
@@ -385,6 +385,16 @@ async function copyDbToRocks(sourceRootStore, sourceDatabase: string, targetPath
385
385
  name: INTERNAL_DBIS_NAME,
386
386
  });
387
387
 
388
+ const STRUCTURES_KEY = Symbol.for('structures');
389
+ const copyStructures = (sourceDbi, storeName: string) => {
390
+ const buffer = sourceDbi.getBinary?.(STRUCTURES_KEY);
391
+ if (buffer) {
392
+ targetRootStore.putSync([STRUCTURES_KEY, storeName], asBinary(buffer));
393
+ }
394
+ };
395
+
396
+ copyStructures(sourceDbisDb, INTERNAL_DBIS_NAME);
397
+
388
398
  let written;
389
399
  let outstandingWrites = 0;
390
400
  const transaction = sourceDbisDb.useReadTransaction();
@@ -414,6 +424,8 @@ async function copyDbToRocks(sourceRootStore, sourceDatabase: string, targetPath
414
424
  existingEncoder.getStructures = tempEncoder.getStructures;
415
425
  }
416
426
 
427
+ copyStructures(sourceDbi, key);
428
+
417
429
  console.log('migrating', key, 'from', sourceDatabase, 'to RocksDB');
418
430
  await copyDbiToRocks(sourceDbi, targetDbi, isPrimary, transaction);
419
431
  }
@@ -457,6 +469,10 @@ async function copyDbToRocks(sourceRootStore, sourceDatabase: string, targetPath
457
469
  } of sourceDbi.getRange({ start, transaction, versions: true })) {
458
470
  try {
459
471
  start = key;
472
+ if (typeof key === 'symbol') {
473
+ skippedRecord++;
474
+ continue;
475
+ }
460
476
  if (value == null) {
461
477
  skippedRecord++;
462
478
  continue;
@@ -497,6 +513,9 @@ async function copyDbToRocks(sourceRootStore, sourceDatabase: string, targetPath
497
513
  for (const { key, value } of sourceDbi.getRange({ start, transaction })) {
498
514
  try {
499
515
  start = key;
516
+ if (typeof key === 'symbol') {
517
+ continue;
518
+ }
500
519
  written = targetDbi.put(key, value);
501
520
  recordsCopied++;
502
521
  if (transaction.openTimer) transaction.openTimer = 0;
@@ -195,20 +195,23 @@ export class RecordEncoder extends Encoder {
195
195
  const superGetStructures = this.getStructures;
196
196
  this.saveStructures = function (structures, isCompatible): boolean | undefined {
197
197
  if (this.isRocksDB) {
198
- return this.rootStore.transactionSync((txn) => {
199
- const sharedStructuresKey = [Symbol.for('structures'), this.name];
200
- const existingStructuresBuffer = txn.getBinarySync(sharedStructuresKey);
201
- const existingStructures = existingStructuresBuffer ? this.decode(existingStructuresBuffer) : undefined;
202
- if (typeof isCompatible == 'function') {
203
- if (!isCompatible(existingStructures)) {
198
+ return this.rootStore.transactionSync(
199
+ (txn) => {
200
+ const sharedStructuresKey = [Symbol.for('structures'), this.name];
201
+ const existingStructuresBuffer = txn.getBinarySync(sharedStructuresKey);
202
+ const existingStructures = existingStructuresBuffer ? this.decode(existingStructuresBuffer) : undefined;
203
+ if (typeof isCompatible == 'function') {
204
+ if (!isCompatible(existingStructures)) {
205
+ return false;
206
+ }
207
+ } else if (existingStructures && existingStructures.length !== isCompatible) {
204
208
  return false;
205
209
  }
206
- } else if (existingStructures && existingStructures.length !== isCompatible) {
207
- return false;
208
- }
209
- txn.putSync(sharedStructuresKey, structures);
210
- this.structureUpdate = structures;
211
- });
210
+ txn.putSync(sharedStructuresKey, structures);
211
+ this.structureUpdate = structures;
212
+ },
213
+ { retryOnBusy: true }
214
+ );
212
215
  } else {
213
216
  const result = superSaveStructures.call(this, structures, isCompatible);
214
217
  this.structureUpdate = structures;
@@ -5,6 +5,7 @@ import { Decoder, readAuditEntry, ENTRY_DATAVIEW, AuditRecord, createAuditEntry
5
5
  import { isMainThread } from 'node:worker_threads';
6
6
  import { EventEmitter } from 'node:events';
7
7
  import { asBinary } from 'lmdb';
8
+ import * as harperLogger from '../utility/logging/harper_logger.ts';
8
9
 
9
10
  if (!process.env.HARPER_NO_FLUSH_ON_EXIT && isMainThread) {
10
11
  // we want to be able to test log replay
@@ -288,6 +289,7 @@ export class RocksTransactionLogStore extends EventEmitter {
288
289
  iterable.iterate = () => aggregateIterator;
289
290
  }
290
291
  const mappedAggregateIterable = iterable.map(({ timestamp, data, endTxn }: TransactionEntry) => {
292
+ <<<<<<< HEAD
291
293
  const decoder = new Decoder(data.buffer, data.byteOffset, data.byteLength);
292
294
  data.dataView = decoder;
293
295
  // This represents the data that shouldn't be transferred for replication
@@ -311,6 +313,55 @@ export class RocksTransactionLogStore extends EventEmitter {
311
313
  auditRecord.previousVersion = previousVersion;
312
314
  auditRecord.structureVersion = structureVersion & 0x00ffffff;
313
315
  return auditRecord;
316
+ =======
317
+ // Per-entry try/catch: a corrupt rocks prelude (first 4-16 bytes) would otherwise
318
+ // throw a raw `RangeError: Offset is outside the bounds of the DataView` out
319
+ // through `iterable.map`, escape the for-of consumer, and land as an
320
+ // uncaughtException on a later tick — stalling outgoing replication at the
321
+ // failing offset on every catch-up attempt. On error, yield a sentinel record
322
+ // with the timestamp preserved so iteration advances past the bad entry;
323
+ // downstream consumers already skip records with no `tableId`/`type`.
324
+ try {
325
+ const decoder = new Decoder(data.buffer, data.byteOffset, data.byteLength);
326
+ (data as any).dataView = decoder;
327
+ // This represents the data that shouldn't be transferred for replication
328
+ let structureVersion = decoder.getUint32(0);
329
+ let position = 4;
330
+ let previousResidencyId: number;
331
+ let previousVersion: number;
332
+ if (structureVersion & HAS_PREVIOUS_RESIDENCY_ID) {
333
+ previousResidencyId = decoder.getUint32(position);
334
+ position += 4;
335
+ }
336
+ if (structureVersion & HAS_PREVIOUS_VERSION) {
337
+ // does previous residency id and version actually require separate flags?
338
+ previousVersion = decoder.getFloat64(position);
339
+ position += 8;
340
+ }
341
+ const auditRecord = readAuditEntry(data, position, undefined);
342
+ auditRecord.version = timestamp;
343
+ auditRecord.endTxn = endTxn;
344
+ auditRecord.previousResidencyId = previousResidencyId;
345
+ auditRecord.previousVersion = previousVersion;
346
+ auditRecord.structureVersion = structureVersion & 0x00ffffff;
347
+ return auditRecord;
348
+ } catch (error) {
349
+ harperLogger.error('Failed to decode rocks transaction log entry; skipping', error, {
350
+ timestamp,
351
+ byteLength: data?.byteLength,
352
+ });
353
+ return {
354
+ version: timestamp,
355
+ endTxn,
356
+ type: undefined,
357
+ tableId: undefined,
358
+ recordId: undefined,
359
+ getValue: () => undefined,
360
+ getBinaryValue: () => undefined,
361
+ getBinaryRecordId: () => undefined,
362
+ } as unknown as AuditRecord;
363
+ }
364
+ >>>>>>> b84fbbd (fix: skip corrupt audit entries during iteration instead of throwing)
314
365
  });
315
366
  // Add methods to the mapped iterable if we have an aggregate iterator
316
367
  if (aggregateIterator?.addLog) {
@@ -805,23 +805,23 @@ export function makeTable(options) {
805
805
  /**
806
806
  * Set TTL expiration for records in this table. On retrieval, record timestamps are checked for expiration.
807
807
  * This also informs the scheduling for record eviction.
808
- * @param expirationTime Time in seconds until records expire (are stale)
809
- * @param evictionTime Time in seconds until records are evicted (removed)
808
+ * @param opts Time in seconds until records expire, or an options object with `expiration`, `eviction`,
809
+ * and `scanInterval` (all in seconds, all optional). Number form preserves any previously configured
810
+ * eviction/scanInterval; object form replaces all three.
810
811
  */
811
- static setTTLExpiration(expiration: number | { expiration: number; eviction?: number; scanInterval?: number }) {
812
- // we set up a timer to remove expired entries. we only want the timer/reaper to run in one thread,
813
- // so we use the first one
814
- if (typeof expiration === 'number') {
815
- expirationMs = expiration * 1000;
816
- if (!evictionMs) evictionMs = 0; // by default, no extra time for eviction
817
- } else if (expiration && typeof expiration === 'object') {
818
- // an object with expiration times/options specified
819
- expirationMs = expiration.expiration * 1000;
820
- evictionMs = (expiration.eviction || 0) * 1000;
821
- cleanupInterval = expiration.scanInterval * 1000;
822
- } else throw new Error('Invalid expiration value type');
812
+ static setTTLExpiration(opts: number | { expiration?: number; eviction?: number; scanInterval?: number }) {
813
+ if (opts == null || (typeof opts !== 'number' && typeof opts !== 'object'))
814
+ throw new Error('Invalid expiration value type');
815
+ if (typeof opts === 'number') {
816
+ expirationMs = opts * 1000;
817
+ } else {
818
+ // `??` so an explicit 0 is treated as the user's chosen value, not as "missing"
819
+ expirationMs = (opts.expiration ?? 0) * 1000;
820
+ evictionMs = (opts.eviction ?? 0) * 1000;
821
+ cleanupInterval = (opts.scanInterval ?? 0) * 1000;
822
+ }
823
823
  if (expirationMs < 0) throw new Error('Expiration can not be negative');
824
- // default to one quarter of the total eviction time, and make sure it fits into a 32-bit signed integer
824
+ // default to one quarter of the total expiration+eviction window
825
825
  cleanupInterval = cleanupInterval || (expirationMs + evictionMs) / 4;
826
826
  scheduleCleanup();
827
827
  }
@@ -4245,6 +4245,8 @@ export function makeTable(options) {
4245
4245
  Boolean(invalidated),
4246
4246
  auditRecord
4247
4247
  );
4248
+ // arm the eviction scanner, mirroring the .put() path
4249
+ if (sourceContext.expiresAt) scheduleCleanup();
4248
4250
  } else if (existingEntry) {
4249
4251
  logger.trace?.(
4250
4252
  `Deleting resolved record from source with id: ${id}, timestamp: ${new Date(txnTime).toISOString()}`
@@ -49,7 +49,15 @@ export type AuditRecord = {
49
49
  previousNodeId?: number;
50
50
  previousAdditionalAuditRefs?: Array<{ version: number; nodeId: number }>;
51
51
  endTxn?: boolean;
52
+ <<<<<<< HEAD
52
53
  structureVersion?: number;
54
+ =======
55
+ getBinaryRecordId?: any;
56
+ <<<<<<< HEAD
57
+ corrupt?: boolean;
58
+ >>>>>>> b84fbbd (fix: skip corrupt audit entries during iteration instead of throwing)
59
+ =======
60
+ >>>>>>> 6b6192c (test: cover lmdb keyEncoder and rocks-prelude paths; drop unused corrupt flag)
53
61
  };
54
62
 
55
63
  const ENTRY_HEADER = Buffer.alloc(2816); // this is sized to be large enough for the maximum key size (1976) plus large usernames. We may want to consider some limits on usernames to ensure this all fits
@@ -73,6 +81,16 @@ export const transactionKeyEncoder = {
73
81
  if (buffer[start] === 66) {
74
82
  const dataView =
75
83
  buffer.dataView || (buffer.dataView = new DataView(buffer.buffer, buffer.byteOffset, buffer.byteLength));
84
+ // Without this bounds check, a truncated key buffer escapes as RangeError up
85
+ // through lmdb-js's iterator and lands as an uncaughtException on a later tick,
86
+ // stalling outgoing replication for the affected (peer, db) pair.
87
+ if (start + 8 > buffer.byteLength) {
88
+ harperLogger.warn('Audit key buffer too short for float64 read; returning NaN sentinel', {
89
+ start,
90
+ byteLength: buffer.byteLength,
91
+ });
92
+ return NaN;
93
+ }
76
94
  return dataView.getFloat64(start);
77
95
  } else {
78
96
  return readKey(buffer, start, end);
@@ -439,6 +457,15 @@ export function readAuditEntry(buffer: Uint8Array, start = 0, end = undefined):
439
457
  const nodeId = decoder.readInt();
440
458
  const tableId = decoder.readInt();
441
459
  let length = decoder.readInt();
460
+ // A corrupt length field (e.g., a 0xff-prefixed uint32) would otherwise push
461
+ // decoder.position hundreds of megabytes past the buffer; the next readFloat64
462
+ // then throws with the bogus position in the message. Failing fast here keeps
463
+ // the throw inside this try/catch so we surface a sentinel instead.
464
+ if (length < 0 || decoder.position + length > buffer.byteLength) {
465
+ throw new RangeError(
466
+ `Audit entry recordId length ${length} exceeds remaining buffer (position ${decoder.position}, byteLength ${buffer.byteLength})`
467
+ );
468
+ }
442
469
  const recordIdStart = decoder.position;
443
470
  const recordIdEnd = (decoder.position += length);
444
471
  // TODO: Once we support multiple format versions, we can conditionally read the version (and the previousResidencyId)
@@ -469,6 +496,11 @@ export function readAuditEntry(buffer: Uint8Array, start = 0, end = undefined):
469
496
  }
470
497
  }
471
498
  length = decoder.readInt();
499
+ if (length < 0 || decoder.position + length > buffer.byteLength) {
500
+ throw new RangeError(
501
+ `Audit entry username length ${length} exceeds remaining buffer (position ${decoder.position}, byteLength ${buffer.byteLength})`
502
+ );
503
+ }
472
504
  const usernameStart = decoder.position;
473
505
  const usernameEnd = (decoder.position += length);
474
506
  let value: any;
@@ -477,8 +509,17 @@ export function readAuditEntry(buffer: Uint8Array, start = 0, end = undefined):
477
509
  tableId,
478
510
  nodeId,
479
511
  get recordId() {
480
- // use a subarray to protect against the underlying buffer being modified
481
- return readKey(buffer.subarray(0, recordIdEnd), recordIdStart, recordIdEnd);
512
+ // The recordId is decoded lazily and lives outside readAuditEntry's try/catch,
513
+ // so a corrupt recordId region would otherwise escape as an uncaught RangeError
514
+ // on property access. Catch and return undefined; callers already treat missing
515
+ // recordId as a skip-eligible entry.
516
+ try {
517
+ // use a subarray to protect against the underlying buffer being modified
518
+ return readKey(buffer.subarray(0, recordIdEnd), recordIdStart, recordIdEnd);
519
+ } catch (error) {
520
+ harperLogger.warn('Failed to decode audit recordId; treating as corrupt', error);
521
+ return undefined;
522
+ }
482
523
  },
483
524
  getBinaryRecordId() {
484
525
  return buffer.subarray(recordIdStart, recordIdEnd);
@@ -486,9 +527,14 @@ export function readAuditEntry(buffer: Uint8Array, start = 0, end = undefined):
486
527
  version,
487
528
  previousVersion,
488
529
  get user() {
489
- return usernameEnd > usernameStart
490
- ? readKey(buffer.subarray(0, usernameEnd), usernameStart, usernameEnd)
491
- : undefined;
530
+ try {
531
+ return usernameEnd > usernameStart
532
+ ? readKey(buffer.subarray(0, usernameEnd), usernameStart, usernameEnd)
533
+ : undefined;
534
+ } catch (error) {
535
+ harperLogger.warn('Failed to decode audit username; treating as corrupt', error);
536
+ return undefined;
537
+ }
492
538
  },
493
539
  get encoded() {
494
540
  return start ? buffer.subarray(start, end) : buffer;
@@ -523,10 +569,56 @@ export function readAuditEntry(buffer: Uint8Array, start = 0, end = undefined):
523
569
  };
524
570
  } catch (error) {
525
571
  harperLogger.error('Reading audit entry error', error, buffer);
572
+ <<<<<<< HEAD
526
573
  return {};
574
+ =======
575
+ return createCorruptAuditSentinel(buffer, start, end);
576
+ >>>>>>> b84fbbd (fix: skip corrupt audit entries during iteration instead of throwing)
527
577
  }
528
578
  }
529
579
 
580
+ /**
581
+ * Build a structurally complete audit record for an entry that failed to decode. The fields
582
+ * mirror the happy-path shape so downstream consumers that access (e.g.) `getValue` or the
583
+ * `recordId` getter don't blow up with a `TypeError: not a function` / `undefined.is(...)`
584
+ * after the header decode already failed. Consumers identify these by the undefined
585
+ * `tableId`/`type` (the same signal lmdb has produced from this catch since before this
586
+ * change) and skip them — `classifyAuditEntryForReplay` calls them out as `corrupt-header`,
587
+ * and the dispatch loops in Table.ts / transactionBroadcast.ts filter via tableId guards.
588
+ */
589
+ function createCorruptAuditSentinel(buffer: Uint8Array, start: number, end: number | undefined): AuditRecord {
590
+ return {
591
+ type: undefined,
592
+ tableId: undefined,
593
+ nodeId: undefined,
594
+ recordId: undefined,
595
+ version: undefined,
596
+ previousVersion: undefined,
597
+ user: undefined,
598
+ extendedType: undefined,
599
+ residencyId: undefined,
600
+ previousResidencyId: undefined,
601
+ expiresAt: undefined,
602
+ originatingOperation: undefined,
603
+ previousAdditionalAuditRefs: undefined,
604
+ get encoded() {
605
+ return start ? buffer.subarray(start, end) : buffer;
606
+ },
607
+ get size() {
608
+ return start !== undefined && end !== undefined ? end - start : buffer.byteLength;
609
+ },
610
+ getBinaryRecordId() {
611
+ return undefined;
612
+ },
613
+ getValue() {
614
+ return undefined;
615
+ },
616
+ getBinaryValue() {
617
+ return undefined;
618
+ },
619
+ } as any;
620
+ }
621
+
530
622
  export class Decoder extends DataView<ArrayBufferLike> {
531
623
  position = 0;
532
624
  readInt() {
@@ -1063,6 +1063,7 @@ export function table<TableResourceType>(tableDefinition: TableDefinition): Tabl
1063
1063
  const dbi = openIndex(dbiKey, rootStore, attribute);
1064
1064
  if (
1065
1065
  changed ||
1066
+ attributeDescriptor.indexingFailed ||
1066
1067
  (attributeDescriptor.indexingPID && attributeDescriptor.indexingPID !== process.pid) ||
1067
1068
  attributeDescriptor.restartNumber < workerData?.restartNumber
1068
1069
  ) {
@@ -1071,6 +1072,7 @@ export function table<TableResourceType>(tableDefinition: TableDefinition): Tabl
1071
1072
  attributeDescriptor = attributesDbi.getSync(dbiKey);
1072
1073
  if (
1073
1074
  changed ||
1075
+ attributeDescriptor.indexingFailed ||
1074
1076
  (attributeDescriptor.indexingPID && attributeDescriptor.indexingPID !== process.pid) ||
1075
1077
  attributeDescriptor.restartNumber < workerData?.restartNumber
1076
1078
  ) {
@@ -1084,14 +1086,20 @@ export function table<TableResourceType>(tableDefinition: TableDefinition): Tabl
1084
1086
  if (hasExistingData) {
1085
1087
  attribute.lastIndexedKey = attributeDescriptor?.lastIndexedKey ?? undefined;
1086
1088
  attribute.indexingPID = process.pid;
1089
+ delete attribute.indexingFailed; // clear failure flag for the new run
1087
1090
  dbi.isIndexing = true;
1088
- Object.defineProperty(attribute, 'dbi', { value: dbi });
1091
+ Object.defineProperty(attribute, 'dbi', { value: dbi, configurable: true, enumerable: false });
1089
1092
  // we only set indexing nulls to true if new or reindexing, we can't have partial indexing of null
1090
1093
  attributesToIndex.push(attribute);
1091
1094
  }
1092
1095
  }
1093
1096
  attributesDbi.put(dbiKey, attribute);
1094
1097
  }
1098
+ // If a migration is in progress (indexingPID set), any newly opened dbi must also
1099
+ // reflect isIndexing = true. A resetDatabases() during an active runIndexing creates
1100
+ // a new dbi object; without this, queries could use the new dbi (isIndexing = false)
1101
+ // and return incomplete results while the backfill is still running.
1102
+ if (attributeDescriptor?.indexingPID) dbi.isIndexing = true;
1095
1103
  if (attributeDescriptor?.indexNulls && attribute.indexNulls === undefined) attribute.indexNulls = true;
1096
1104
  dbi.indexNulls = attribute.indexNulls;
1097
1105
  indices[attribute.name] = dbi;
@@ -1162,6 +1170,7 @@ async function runIndexing(Table, attributes, indicesToRemove) {
1162
1170
  lastResolution = index.drop();
1163
1171
  }
1164
1172
  let interrupted;
1173
+ let hadIndexingErrors = false;
1165
1174
  const attributeErrorReported = {};
1166
1175
  let indexed = 0;
1167
1176
  const attributesLength = attributes.length;
@@ -1215,6 +1224,7 @@ async function runIndexing(Table, attributes, indicesToRemove) {
1215
1224
  }
1216
1225
  }
1217
1226
  } catch (error) {
1227
+ hadIndexingErrors = true;
1218
1228
  if (!attributeErrorReported[property]) {
1219
1229
  // just report an indexing error once per attribute so we don't spam the logs
1220
1230
  attributeErrorReported[property] = true;
@@ -1227,6 +1237,7 @@ async function runIndexing(Table, attributes, indicesToRemove) {
1227
1237
  () => outstanding--,
1228
1238
  (error) => {
1229
1239
  outstanding--;
1240
+ hadIndexingErrors = true;
1230
1241
  logger.error(error);
1231
1242
  }
1232
1243
  );
@@ -1244,20 +1255,69 @@ async function runIndexing(Table, attributes, indicesToRemove) {
1244
1255
  if (outstanding > MAX_OUTSTANDING_INDEXING) await lastResolution;
1245
1256
  else if (outstanding > MIN_OUTSTANDING_INDEXING) await new Promise((resolve) => setImmediate(resolve)); // yield event turn, don't want to use all computation
1246
1257
  }
1258
+ }
1259
+ // Await the last pending put. If it rejects, that is also an indexing error.
1260
+ // Note: the when() calls above already attach rejection handlers to each record's
1261
+ // last-put promise; this try-catch specifically handles the case where lastResolution
1262
+ // itself rejects (i.e. the very last put in the loop failed) which would otherwise
1263
+ // throw past the hadIndexingErrors check to the outer catch. The broader issue of
1264
+ // unhandled rejections from non-last puts in multi-value attributes is pre-existing
1265
+ // and out of scope for this fix.
1266
+ try {
1267
+ await lastResolution;
1268
+ } catch (error) {
1269
+ hadIndexingErrors = true;
1270
+ logger.error(error);
1271
+ }
1272
+ // Yield one more event turn so any queued when() error callbacks (which fire as
1273
+ // microtasks when their tracked promise settles) have a chance to set hadIndexingErrors
1274
+ // before we decide whether to mark indexing as complete.
1275
+ await new Promise((resolve) => setImmediate(resolve));
1276
+ if (hadIndexingErrors) {
1277
+ // Some records failed to index. Persist the failure marker in the descriptor so
1278
+ // the next call to table() (including after a restart with a fresh PID) re-triggers
1279
+ // the backfill from the last checkpoint. Do NOT clear indexingPID or isIndexing —
1280
+ // leave the index in its incomplete state so queries return 503 "not indexed yet"
1281
+ // rather than silently returning partial results. This is the key fix for the
1282
+ // serent-canopy issue #135 fingerprint: a completed migration with transient errors
1283
+ // (e.g. ERR_BUSY from RocksDB under load) leaving gaps while appearing successful.
1284
+ for (const attribute of attributes) {
1285
+ attribute.indexingFailed = true;
1286
+ // Preserve lastIndexedKey so the retry resumes from the last checkpoint.
1287
+ lastResolution = Table.dbisDB.put(attribute.key, attribute);
1288
+ // Keep isIndexing = true on both the attribute.dbi and the currently-active dbi
1289
+ // in Table.indices (which may differ if resetDatabases() ran during this pass).
1290
+ attribute.dbi.isIndexing = true;
1291
+ const activeDbi = Table.indices[attribute.name];
1292
+ if (activeDbi) activeDbi.isIndexing = true;
1293
+ }
1294
+ await lastResolution;
1295
+ logger.warn(
1296
+ `Indexing of ${Table.tableName} encountered errors on some records - index will remain incomplete. ` +
1297
+ `On next restart the migration will be retried from the last checkpoint (indexingFailed=true). ` +
1298
+ `Affected attributes: ${attributes.map((a) => a.name).join(', ')}`
1299
+ );
1300
+ } else {
1247
1301
  // update the attributes to indicate that we are finished
1248
1302
  for (const attribute of attributes) {
1249
1303
  delete attribute.lastIndexedKey;
1250
1304
  delete attribute.indexingPID;
1305
+ delete attribute.indexingFailed;
1251
1306
  attribute.dbi.isIndexing = false;
1307
+ // Also clear isIndexing on the currently-active dbi in Table.indices, which may
1308
+ // differ from attribute.dbi if a resetDatabases() call during this migration
1309
+ // opened a new dbi and registered it there.
1310
+ const activeDbi = Table.indices[attribute.name];
1311
+ if (activeDbi) activeDbi.isIndexing = false;
1252
1312
  lastResolution = Table.dbisDB.put(attribute.key, attribute);
1253
1313
  }
1314
+ await lastResolution;
1315
+ // now notify all the threads that we are done and the index is ready to use
1316
+ await signalling.signalSchemaChange(
1317
+ new SchemaEventMsg(process.pid, 'indexing-finished', Table.databaseName, Table.tableName)
1318
+ );
1319
+ logger.info(`Finished indexing ${Table.tableName} attributes`, attributes);
1254
1320
  }
1255
- await lastResolution;
1256
- // now notify all the threads that we are done and the index is ready to use
1257
- await signalling.signalSchemaChange(
1258
- new SchemaEventMsg(process.pid, 'indexing-finished', Table.databaseName, Table.tableName)
1259
- );
1260
- logger.info(`Finished indexing ${Table.tableName} attributes`, attributes);
1261
1321
  } catch (error) {
1262
1322
  logger.error('Error in indexing', error);
1263
1323
  }
@@ -6,6 +6,7 @@ import { DatabaseTransaction } from './DatabaseTransaction.ts';
6
6
  import { RocksTransactionLogStore } from './RocksTransactionLogStore.ts';
7
7
  import { isMainThread } from 'node:worker_threads';
8
8
  import { RequestTarget } from './RequestTarget.ts';
9
+ import { classifyAuditEntryForReplay } from './replayLogsGuards.ts';
9
10
 
10
11
  let warnedReplayHappening = false;
11
12
  export function replayLogs(rootStore: RocksDatabase, tables: any): Promise<void> {
@@ -24,11 +25,26 @@ export function replayLogs(rootStore: RocksDatabase, tables: any): Promise<void>
24
25
  let transaction: DatabaseTransaction;
25
26
  let lastTimestamp = 0;
26
27
  let writes = 0;
28
+ let skipped = 0;
27
29
  const txnLog: RocksTransactionLogStore = rootStore.auditStore;
28
30
  for (const auditRecord of txnLog.getRange({ startFromLastFlushed: true, readUncommitted: true })) {
29
- const { type, tableId, nodeId, recordId, version, residencyId, expiresAt, originatingOperation, username } =
30
- auditRecord;
31
+ const {
32
+ type,
33
+ tableId,
34
+ nodeId,
35
+ recordId,
36
+ version,
37
+ residencyId,
38
+ expiresAt,
39
+ originatingOperation,
40
+ username,
41
+ extendedType,
42
+ } = auditRecord;
31
43
  try {
44
+ if (classifyAuditEntryForReplay(extendedType, tableId, true) === 'corrupt-header') {
45
+ skipped++;
46
+ continue;
47
+ }
32
48
  const Table = tableById.get(tableId);
33
49
  if (!Table) continue;
34
50
  const context: Context = { nodeId, alreadyLogged: true, version, expiresAt, user: { name: username } };
@@ -42,7 +58,22 @@ export function replayLogs(rootStore: RocksDatabase, tables: any): Promise<void>
42
58
  warnedReplayHappening = true;
43
59
  console.warn('Harper was not properly shutdown, replaying transaction logs to synchronize database');
44
60
  }
45
- const record = auditRecord.getValue(primaryStore);
61
+ let record: any;
62
+ try {
63
+ record = auditRecord.getValue(primaryStore);
64
+ } catch {
65
+ // msgpack/structure decode failed for this entry's value. Skip rather than
66
+ // fall through to a guaranteed downstream crash, and intentionally drop the
67
+ // error: every corrupt entry would otherwise log a stack trace per iteration
68
+ // (millions of these were observed in prod). The total skip count is logged
69
+ // once at the end of replay.
70
+ skipped++;
71
+ continue;
72
+ }
73
+ if (classifyAuditEntryForReplay(extendedType, tableId, record !== undefined) === 'missing-record') {
74
+ skipped++;
75
+ continue;
76
+ }
46
77
  if (lastTimestamp !== version) {
47
78
  lastTimestamp = version;
48
79
  try {
@@ -127,6 +158,8 @@ export function replayLogs(rootStore: RocksDatabase, tables: any): Promise<void>
127
158
  logger.error('Error committing replay transaction', error);
128
159
  }
129
160
  if (writes > 0) logger.warn(`Replayed ${writes} records in ${rootStore.databaseName} database`);
161
+ if (skipped > 0)
162
+ logger.warn(`Skipped ${skipped} unrecoverable audit entries in ${rootStore.databaseName} database during replay`);
130
163
  // we never actually release the lock because we only want to ever run one time
131
164
  // rootStore.unlock('replayLogs');
132
165
  });
@@ -0,0 +1,42 @@
1
+ // Pure helpers for replayLogs (no Harper module dependencies, so unit tests can load
2
+ // them without bootstrapping the full Resource / RocksDB / DatabaseTransaction graph).
3
+ //
4
+ // Background: a node that crashed unclean re-runs replayLogs against the unflushed audit
5
+ // log on next boot. If any audit entry is corrupt or missing its record body, the loop
6
+ // hits a TypeError inside Table.validate() ("Cannot read properties of undefined
7
+ // (reading 'cacheKey')") and the per-iteration catch swallows it — but the loop keeps
8
+ // running over potentially millions of entries, pinning CPU. These guards classify each
9
+ // entry up front so the loop can skip cleanly.
10
+
11
+ // Mirrors `HAS_RECORD` (16) | `HAS_PARTIAL_RECORD` (32) from auditStore.ts — the action
12
+ // bits the writer sets when an entry carries (or should carry) a record body. Redeclared
13
+ // here so this module stays free of the Harper module graph for unit testing; a lock
14
+ // test pins the value against auditStore so silent drift is caught.
15
+ export const RECORD_BEARING_FLAGS = 16 | 32;
16
+
17
+ /**
18
+ * Decide whether an audit entry pulled from the unflushed log is safe to replay.
19
+ * Returns `null` if the entry should be replayed, or a short reason string if it should
20
+ * be skipped (the loop logs the aggregate skip count once at the end).
21
+ *
22
+ * Operates on the raw integer `action` field rather than the decoded type string: when
23
+ * `readAuditEntry` catches a header decode error it returns `{}`, so both `action` and
24
+ * `tableId` are `undefined` — the same signal — and matching the record-bearing flags
25
+ * directly against the action mirrors how the writer set them in `auditStore.ts`.
26
+ *
27
+ * @param action `auditRecord.extendedType` — the variable-length action field with
28
+ * the event type in the low nibble and HAS_* flags above it
29
+ * @param tableId `auditRecord.tableId`
30
+ * @param hasRecord `true` if `auditRecord.getValue(...)` produced a non-undefined value
31
+ */
32
+ export function classifyAuditEntryForReplay(
33
+ action: number | undefined,
34
+ tableId: number | undefined,
35
+ hasRecord: boolean
36
+ ): 'corrupt-header' | 'missing-record' | null {
37
+ if (action === undefined || tableId === undefined) return 'corrupt-header';
38
+ // If the action advertises a record body but the decoded record is undefined, the
39
+ // downstream write path will crash inside validate() on the first attribute deref.
40
+ if ((action & RECORD_BEARING_FLAGS) !== 0 && !hasRecord) return 'missing-record';
41
+ return null;
42
+ }