pipeai 0.2.1 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -46,7 +46,7 @@ type Ctx = {
46
46
  db: Database;
47
47
  };
48
48
 
49
- const assistant = new Agent<Ctx>({
49
+ const assistant = new Agent<Ctx, string>({
50
50
  id: "assistant",
51
51
  model: openai("gpt-4o"),
52
52
  system: "You are a helpful assistant.",
@@ -84,7 +84,7 @@ const classificationSchema = z.object({
84
84
  summary: z.string(),
85
85
  });
86
86
 
87
- const classifier = new Agent<Ctx>({
87
+ const classifier = new Agent<Ctx, { title: string; body: string }>({
88
88
  id: "classifier",
89
89
  input: z.object({ title: z.string(), body: z.string() }),
90
90
  output: Output.object({ schema: classificationSchema }),
@@ -102,7 +102,7 @@ result.output; // { priority: "high", category: "bug", summary: "..." }
102
102
  Most config fields accept a static value or a `(ctx, input) => value` function:
103
103
 
104
104
  ```ts
105
- const agent = new Agent<Ctx>({
105
+ const agent = new Agent<Ctx, string>({
106
106
  id: "adaptive",
107
107
  model: (ctx) => ctx.isPremium ? openai("gpt-4o") : openai("gpt-4o-mini"),
108
108
  system: (ctx) => `You assist ${ctx.userName}. Role: ${ctx.role}.`,
@@ -120,7 +120,7 @@ const agent = new Agent<Ctx>({
120
120
  Same callback names as AI SDK v6, extended with `ctx`, `input`, and `writer`. The AI SDK event payload is available as `result`. When the agent runs inside a streaming workflow, `writer` is available for writing metadata or custom stream parts:
121
121
 
122
122
  ```ts
123
- const agent = new Agent<Ctx>({
123
+ const agent = new Agent<Ctx, string>({
124
124
  id: "monitored",
125
125
  model: openai("gpt-4o"),
126
126
  prompt: (ctx, input) => input,
@@ -146,6 +146,7 @@ const agent = new Agent<Ctx>({
146
146
  | `description` | `string` | Agent description (used by `asTool()` for tool description). |
147
147
  | `input` | `ZodType` | Input schema. Required for `asTool()`. Infers `TInput`. |
148
148
  | `output` | `Output` | AI SDK Output (e.g. `Output.object({ schema })`). Infers `TOutput`. |
149
+ | `validateOutput` | `ZodType<TOutput>` | Optional runtime guard. Validates the structured `output` after the SDK parses it (distinct from `tool.outputSchema`). Catches SDK-side parse drift. |
149
150
  | `model` | `Resolvable` | Language model. Static or `(ctx, input) => model`. |
150
151
  | `system` | `Resolvable` | System prompt. |
151
152
  | `prompt` | `Resolvable` | String prompt. Mutually exclusive with `messages`. |
@@ -153,7 +154,7 @@ const agent = new Agent<Ctx>({
153
154
  | `tools` | `Resolvable` | Tool map. Supports `Tool`, `ToolProvider`, and `agent.asTool()`. |
154
155
  | `activeTools` | `Resolvable` | Subset of tool names to enable. |
155
156
  | `toolChoice` | `Resolvable` | Tool choice strategy. Static or `(ctx, input) => toolChoice`. |
156
- | `stopWhen` | `Resolvable` | Condition for stopping the tool loop. Static or `(ctx, input) => condition`. |
157
+ | `stopWhen` | `StopCondition` &#124; `StopCondition[]` | Condition(s) for stopping the tool loop. **Static only** — not a `Resolvable`. A bare function is ambiguous with the resolver form, so dynamic stop conditions require building the agent per call. |
157
158
  | `onStepFinish`| `({ result, ctx, input, writer? })`| Called after each step. `writer` available in streaming workflows. |
158
159
  | `onFinish` | `({ result, ctx, input, writer? })`| Called when all steps complete. |
159
160
  | `onError` | `({ error, ctx, input, writer? })` | Called on error. |
@@ -164,7 +165,7 @@ const agent = new Agent<Ctx>({
164
165
  `asTool()` compiles an agent into a standard AI SDK `Tool`. The parent agent's LLM tool loop handles routing — no dedicated router needed.
165
166
 
166
167
  ```ts
167
- const codingAgent = new Agent<Ctx>({
168
+ const codingAgent = new Agent<Ctx, { task: string; language?: string }>({
168
169
  id: "coding",
169
170
  description: "Writes and modifies code.",
170
171
  input: z.object({
@@ -176,7 +177,7 @@ const codingAgent = new Agent<Ctx>({
176
177
  tools: { writeFile, readFile },
177
178
  });
178
179
 
179
- const qaAgent = new Agent<Ctx>({
180
+ const qaAgent = new Agent<Ctx, { question: string }>({
180
181
  id: "qa",
181
182
  description: "Answers technical questions.",
182
183
  input: z.object({ question: z.string() }),
@@ -186,7 +187,7 @@ const qaAgent = new Agent<Ctx>({
186
187
  });
187
188
 
188
189
  // Parent agent uses sub-agents as tools
189
- const orchestrator = new Agent<Ctx>({
190
+ const orchestrator = new Agent<Ctx, string>({
190
191
  id: "orchestrator",
191
192
  model: openai("gpt-4o"),
192
193
  system: "Delegate work to the right specialist.",
@@ -221,7 +222,7 @@ codingAgent.asTool(ctx, {
221
222
  `asTool(ctx)` bakes the context in at call time. `asToolProvider()` defers context resolution — the tool is created with the correct context when another agent's tool resolution runs:
222
223
 
223
224
  ```ts
224
- const orchestrator = new Agent<Ctx>({
225
+ const orchestrator = new Agent<Ctx, string>({
225
226
  id: "orchestrator",
226
227
  model: openai("gpt-4o"),
227
228
  system: "Delegate work to the right specialist.",
@@ -267,7 +268,7 @@ const cancelOrder = define({
267
268
  });
268
269
 
269
270
  // Mix with plain AI SDK tools freely
270
- const agent = new Agent<Ctx>({
271
+ const agent = new Agent<Ctx, string>({
271
272
  id: "support",
272
273
  model: openai("gpt-4o"),
273
274
  prompt: (ctx, input) => input,
@@ -302,15 +303,24 @@ const pipeline = Workflow.create<Ctx>()
302
303
 
303
304
  ```ts
304
305
  // Non-streaming — calls agent.generate() at each step
305
- const { output } = await pipeline.generate(ctx, initialInput);
306
+ const result = await pipeline.generate(ctx, initialInput);
307
+ if (result.status === "complete") {
308
+ console.log(result.output);
309
+ } else {
310
+ // result.status === "suspended" — see Human-in-the-Loop section
311
+ await db.saveSnapshot(result.snapshot);
312
+ }
306
313
 
307
314
  // Streaming — calls agent.stream() at each step, merges into a single ReadableStream
308
315
  const { stream, output } = pipeline.stream(ctx, initialInput);
309
316
  return new Response(stream);
310
317
 
311
- const finalOutput = await output; // resolves when pipeline completes
318
+ // output resolves to WorkflowResult<T> never rejects on suspension
319
+ const final = await output;
312
320
  ```
313
321
 
322
+ The return value is a `WorkflowResult<T>` discriminated union — either `{ status: "complete", output, warnings }` or `{ status: "suspended", snapshot, warnings }`. `warnings` is always present on both branches (`readonly WorkflowWarning[]`, possibly empty).
323
+
314
324
  ### Nested workflows
315
325
 
316
326
  Workflows can be passed as steps into other workflows. The nested workflow's steps execute within the parent's runtime state — streams merge naturally, and errors propagate to the parent's `catch()`:
@@ -484,13 +494,15 @@ const pipeline = Workflow.create<Ctx>()
484
494
  .step("combine", ({ input }) => input.join("\n\n"));
485
495
  ```
486
496
 
487
- Concurrent processing with batched parallelism:
497
+ Concurrent processing with bounded parallelism:
488
498
 
489
499
  ```ts
490
- // Process 3 items at a time
500
+ // Up to 3 items run simultaneously; the next launches as soon as one finishes.
491
501
  .foreach(summarizer, { concurrency: 3 })
492
502
  ```
493
503
 
504
+ `concurrency` is the **maximum number of items in flight at any moment** — backed by a semaphore. There's no lockstep batching: a slow item never blocks a finished slot from picking up the next pending one.
505
+
494
506
  Works with nested workflows too:
495
507
 
496
508
  ```ts
@@ -503,8 +515,77 @@ const pipeline = Workflow.create<Ctx>()
503
515
  .foreach(processItem, { concurrency: 5 });
504
516
  ```
505
517
 
518
+ #### Per-item error recovery via `onError`
519
+
520
+ By default a single item's failure aborts the whole `foreach`. Pass an `onError` handler to recover individual items — return a substitute value, return `Workflow.SKIP` to drop the failed index from the output array, or rethrow to abort the step (the throw is catchable by a downstream `.catch()`):
521
+
522
+ ```ts
523
+ import { Workflow } from "pipeai";
524
+
525
+ const pipeline = Workflow.create<Ctx>()
526
+ .step("fetch", async ({ ctx }) => ctx.db.urls.getAll())
527
+ .foreach(scraper, {
528
+ concurrency: 5,
529
+ onError: ({ error, item, index }) => {
530
+ // Substitute a placeholder
531
+ if (isTransient(error)) return { url: item, body: "" };
532
+ // Or drop the item entirely (output array is shortened)
533
+ return Workflow.SKIP;
534
+ },
535
+ });
536
+ ```
537
+
538
+ `onError` is invoked sequentially in **index order** after all items settle, so its observable order is deterministic regardless of completion timing. In-flight siblings are never cancelled by another item's failure.
539
+
506
540
  **Type safety:** `foreach()` uses `ElementOf<TOutput>` to extract the array element type. If the previous step doesn't produce an array, the call is rejected at compile time.
507
541
 
542
+ ### Fan-out via `parallel()`
543
+
544
+ `parallel()` runs several branches against the **same input** concurrently and collects their results. Two type-overload forms — record (keyed by name) and tuple (positional):
545
+
546
+ ```ts
547
+ // Record form — returns { researcher: ResearchOutput, critic: CriticOutput }
548
+ const pipeline = Workflow.create<Ctx, string>()
549
+ .step("classify", classifier)
550
+ .parallel({ researcher, critic });
551
+
552
+ // Tuple form — returns [ResearchOutput, CriticOutput]
553
+ const pipeline = Workflow.create<Ctx, string>()
554
+ .step("classify", classifier)
555
+ .parallel([researcher, critic] as const);
556
+ ```
557
+
558
+ The same input (`state.output`) is fed to each branch. Default concurrency is `min(branches.length, 5)` — most users want fan-out, but the cap protects against rate-limit pressure. Pass `concurrency: Infinity` (or `branches.length`) to opt out.
559
+
560
+ ```ts
561
+ .parallel({ a, b, c, d, e, f, g, h }, { concurrency: 3 }) // explicit override
562
+ .parallel({ a, b, c, d, e, f, g, h }, { concurrency: Infinity }) // full fan-out
563
+ ```
564
+
565
+ **Generate mode only.** Streams aren't threaded through to branches — interleaving multiple agent streams into one writer is out of scope.
566
+
567
+ #### Per-branch error handling
568
+
569
+ ```ts
570
+ .parallel({ a, b }, {
571
+ onError: ({ error, key, ctx }) => {
572
+ if (key === "a") return "fallback-a"; // substitute
573
+ if (key === "b") return Workflow.SKIP; // record form: undefined slot
574
+ throw error; // rethrow to abort the parallel
575
+ },
576
+ })
577
+ ```
578
+
579
+ `onError` is **bypassed** on the suspension path — if any branch hits a nested gate, the marker reaches the caller without onError running. Non-suspension errors flow through onError in branch order.
580
+
581
+ #### Suspension under parallel
582
+
583
+ Gates inside parallel branches throw `NestedGateUnsupportedError`, same as `foreach` concurrent. The lowest-index suspending branch wins the marker; others contribute to `siblingSuspensions`. Multi-branch suspension semantics are finalized in F0.6 alongside `cancelOnFirstSuspend` — until then, all branches run to completion (or sibling-failure) before the marker reaches the caller.
584
+
585
+ > **Rate-limit hazard:** `parallel`'s default `min(N, 5)` assumes ≥5 RPS headroom on your model provider. Symptoms of overflow: 429s and stair-stepped latency.
586
+
587
+ > **Concurrent ctx-mutation hazard:** branches share the `ctx` object by reference. Treat `ctx` as immutable inside parallel branches.
588
+
508
589
  ### Conditional loops via `repeat()`
509
590
 
510
591
  `repeat()` runs an agent or workflow in a loop until a condition is met. The body's output feeds back as input — same type in, same type out:
@@ -594,11 +675,11 @@ const { stream, output } = pipeline.stream(ctx, initialInput, {
594
675
  | `.step(id, fn)` | Transform the output. `fn` receives `{ ctx, input }` and returns the new output. |
595
676
  | `.branch([...cases])` | Predicate routing. First `when` match wins; case without `when` is default. |
596
677
  | `.branch({ select, agents })` | Key routing. `select` returns a key, runs the matching agent. |
597
- | `.foreach(target, opts?)` | Map each array element through an agent or workflow. `opts.concurrency` controls parallelism (default: 1). |
678
+ | `.foreach(target, opts?)` | Map each array element through an agent or workflow. `opts.concurrency` is the max items in flight (default: 1). `opts.onError` recovers per-item failures; return `Workflow.SKIP` to drop an index. |
598
679
  | `.repeat(target, opts)` | Loop an agent or workflow. Use `{ until }` or `{ while }` (mutually exclusive). `maxIterations` defaults to 10. |
599
- | `.gate(id, opts?)` | Human-in-the-loop suspension point. Throws `WorkflowSuspended` with a serializable snapshot. Resume via `loadState(gateId, snapshot)`. |
600
- | `.catch(id, fn)` | Handle errors. `fn` receives `{ error, ctx, lastOutput, stepId }` and returns a recovery value. |
601
- | `.finally(id, fn)` | Always runs. `fn` receives `{ ctx }`. |
680
+ | `.gate(id, opts?)` | Human-in-the-loop suspension point. Returns a result with `status: "suspended"` carrying a serializable snapshot. Resume via `loadState(gateId, snapshot)`. |
681
+ | `.catch(id, fn)` | Handle errors. `fn` receives `{ error, ctx, lastOutput, stepId }` and returns a recovery value. Bypassed on suspension. |
682
+ | `.finally(id, fn)` | Always runs — including after a gate suspends. `fn` receives `{ ctx }`. Throwing finallys no longer abort subsequent ones; errors aggregate into `AggregateError` on the completion path and into `result.warnings` on the suspension path. |
602
683
 
603
684
  ### Output flow
604
685
 
@@ -623,10 +704,12 @@ Auto-extraction priority for `step()` with an agent:
623
704
 
624
705
  `gate()` suspends a workflow at a designated point, producing a JSON-serializable snapshot. The consumer persists the snapshot, collects human input out-of-band (HTTP, WebSocket, CLI, queue — any transport), then resumes the workflow from where it left off.
625
706
 
707
+ > **0.4.0 breaking change:** suspension is a return value, not a thrown error. `generate()` and `stream()` resolve with `WorkflowResult<T>` — a discriminated union of `{ status: "complete", output, warnings }` and `{ status: "suspended", snapshot, warnings }`. `WorkflowSuspended` has been removed. See [Migration from 0.3.x](#migration-from-03x).
708
+
626
709
  ### Basic gate
627
710
 
628
711
  ```ts
629
- import { Workflow, WorkflowSuspended } from "pipeai";
712
+ import { Workflow } from "pipeai";
630
713
 
631
714
  const pipeline = Workflow.create<Ctx>()
632
715
  .step(draftAgent)
@@ -636,23 +719,27 @@ const pipeline = Workflow.create<Ctx>()
636
719
  .step(publishAgent);
637
720
 
638
721
  // Run — suspends at gate
639
- try {
640
- await pipeline.generate(ctx, input);
641
- } catch (e) {
642
- if (e instanceof WorkflowSuspended) {
643
- await db.saveSnapshot(e.snapshot);
644
- return res.status(202).json(e.snapshot.gatePayload);
645
- }
722
+ const result = await pipeline.generate(ctx, input);
723
+ if (result.status === "suspended") {
724
+ await db.saveSnapshot(result.snapshot);
725
+ return res.status(202).json(result.snapshot.gatePayload);
646
726
  }
727
+ // result.status === "complete" here — TS narrows `output` automatically
728
+ return res.json({ output: result.output });
647
729
 
648
730
  // Resume — load state, pass gate ID + snapshot to generate or stream
649
731
  const snapshot = await db.loadSnapshot(id);
650
732
  const resumed = pipeline.loadState("review", snapshot);
651
- const { output } = await resumed.generate(ctx, humanResponse);
733
+ const resumeResult = await resumed.generate(ctx, humanResponse);
734
+ if (resumeResult.status === "complete") {
735
+ console.log(resumeResult.output);
736
+ }
652
737
  ```
653
738
 
654
739
  The `snapshot` is plain JSON — it survives `JSON.parse(JSON.stringify())`, database storage, and process restarts. The workflow definition (code) stays in the process; only the data is serialized.
655
740
 
741
+ `result.warnings` is **always** present on both branches — an array of non-fatal errors (a throwing `.finally()`, a misbehaving observer). It's `readonly WorkflowWarning[]`, never `undefined`. If you don't care about non-fatal failures, ignore it.
742
+
656
743
  ### Resuming with streaming
657
744
 
658
745
  For chat applications where the client reconnects and needs a live stream for the remaining steps:
@@ -667,21 +754,23 @@ The previous stream is gone — the library only streams forward from the resume
667
754
 
668
755
  ### Streaming suspension
669
756
 
670
- When `stream()` hits a gate, the stream closes cleanly (partial content from steps before the gate is delivered). The `output` promise rejects with `WorkflowSuspended`:
757
+ When `stream()` hits a gate, the stream closes cleanly (partial content from steps before the gate is delivered). The `output` Promise **resolves** with `{ status: "suspended", snapshot, warnings }` — it does **not** reject:
671
758
 
672
759
  ```ts
673
760
  const { stream, output } = pipeline.stream(ctx, input);
674
761
  pipeStreamToResponse(res, stream); // partial content delivered normally
675
762
 
676
- try {
677
- await output;
678
- } catch (e) {
679
- if (e instanceof WorkflowSuspended) {
680
- await db.saveSnapshot(e.snapshot);
681
- }
763
+ const result = await output;
764
+ if (result.status === "suspended") {
765
+ await db.saveSnapshot(result.snapshot);
682
766
  }
767
+ // Real errors (a step throws something other than a gate suspension) still
768
+ // reject the output Promise — keep your try/catch for those, but
769
+ // `WorkflowStreamOptions.onError` is NOT invoked for suspension.
683
770
  ```
684
771
 
772
+ > **Stream-mode dead-air warning:** the stream stays open while `.finally()` bodies run after a gate suspends. Long-running cleanup work causes proportional dead air. If your HTTP read timeout is shorter than your worst-case finally I/O, the connection can disconnect spuriously.
773
+
685
774
  ### Schema validation
686
775
 
687
776
  Add a `schema` to validate the human response at runtime. The schema uses a structural type — any object with a `.parse()` method works (Zod, Valibot, ArkType, etc.):
@@ -716,23 +805,25 @@ const pipeline = Workflow.create<Ctx>()
716
805
  .step("publish", ({ input }) => `published: ${input}`);
717
806
 
718
807
  // First gate
719
- let snapshot: WorkflowSnapshot;
720
- try { await pipeline.generate(ctx, input); }
721
- catch (e) { snapshot = (e as WorkflowSuspended).snapshot; }
808
+ const r1 = await pipeline.generate(ctx, input);
809
+ if (r1.status !== "suspended") throw new Error("expected suspension at review");
810
+ let snapshot = r1.snapshot;
722
811
 
723
812
  // Second gate
724
813
  const resumed1 = pipeline.loadState("review", snapshot);
725
- try { await resumed1.generate(ctx, "first approval"); }
726
- catch (e) { snapshot = (e as WorkflowSuspended).snapshot; }
814
+ const r2 = await resumed1.generate(ctx, "first approval");
815
+ if (r2.status !== "suspended") throw new Error("expected suspension at final-approval");
816
+ snapshot = r2.snapshot;
727
817
 
728
818
  // Complete
729
819
  const resumed2 = pipeline.loadState("final-approval", snapshot);
730
- const { output } = await resumed2.generate(ctx, "final approval");
820
+ const r3 = await resumed2.generate(ctx, "final approval");
821
+ if (r3.status === "complete") console.log(r3.output);
731
822
  ```
732
823
 
733
- ### Merging pre-gate output with response
824
+ ### Manual merge at the call site
734
825
 
735
- The `snapshot.output` field contains the pre-gate output. Use it to merge with the human response:
826
+ The `snapshot.output` field contains the pre-gate output. Merge it with the human response at the call site:
736
827
 
737
828
  ```ts
738
829
  // The step after the gate needs both the draft and the approval
@@ -743,6 +834,8 @@ await resumed.generate(ctx, {
743
834
  });
744
835
  ```
745
836
 
837
+ For automatic merging without exposing `snapshot.output` to the caller, see the `merge` option below.
838
+
746
839
  ### Injecting updated context on resume
747
840
 
748
841
  `ctx` is provided fresh on every `generate()`/`stream()` call — never serialized. Use it to inject updated chat history, refreshed auth tokens, or new database connections:
@@ -770,39 +863,378 @@ const pipeline = Workflow.create<Ctx>()
770
863
  .step(publishAgent);
771
864
  ```
772
865
 
773
- ### Merging pre-gate output with response
866
+ ### Merging pre-gate output with response via `merge`
867
+
868
+ Use `merge` to combine the pre-gate output with the human response into a single value for the next step. Without `merge`, only the human response is forwarded.
774
869
 
775
- Use `merge` to combine the pre-gate output with the human response into a single value for the next step. Without `merge`, only the human response is forwarded:
870
+ `merge` may return any shape its return type becomes the input type of the next step. The gate's third generic `TMerged` is inferred from the merge return type, so downstream steps type-check against the merged shape:
776
871
 
777
872
  ```ts
778
873
  const pipeline = Workflow.create<Ctx>()
779
874
  .step(draftAgent)
780
875
  .gate("review", {
876
+ schema: approvalSchema,
781
877
  merge: ({ priorOutput, response }) => ({
782
- draft: priorOutput,
783
- approval: response,
878
+ draft: priorOutput, // pre-gate output (TOutput)
879
+ approval: response, // validated human response (TResponse)
784
880
  }),
785
881
  })
786
882
  .step("publish", ({ input }) => {
787
- // input is { draft, approval }
883
+ // input is { draft, approval } — the TMerged shape
788
884
  });
789
885
  ```
790
886
 
791
887
  ### Snapshot shape
792
888
 
889
+ As of 0.5.0, `WorkflowSnapshot` is a discriminated union with three variants — gate snapshots emitted by `.gate()`, checkpoint snapshots emitted by `onCheckpoint`, and the legacy v1 form from 0.4.0 (accepted for one release via the shim):
890
+
793
891
  ```ts
794
- interface WorkflowSnapshot {
795
- version: 1;
892
+ interface GateSnapshot {
893
+ version: 2;
894
+ kind: "gate";
796
895
  resumeFromIndex: number; // step index of the gate
797
896
  output: unknown; // pre-gate output
798
897
  gateId: string; // gate identifier
799
898
  gatePayload: unknown; // data for the human
800
899
  }
900
+
901
+ interface CheckpointSnapshot {
902
+ version: 2;
903
+ kind: "checkpoint";
904
+ resumeFromIndex: number; // index of the NEXT step to run
905
+ output: unknown; // output as of the checkpoint
906
+ stepShapeHash: string; // SHA-256 hex of the workflow's structural shape
907
+ }
908
+
909
+ // Legacy v1 — only accepted by loadState() during one release. Migrate via migrateSnapshot().
910
+ interface LegacyGateSnapshotV1 {
911
+ version: 1;
912
+ kind?: undefined;
913
+ resumeFromIndex: number;
914
+ output: unknown;
915
+ gateId: string;
916
+ gatePayload: unknown;
917
+ }
918
+
919
+ type WorkflowSnapshot = GateSnapshot | CheckpointSnapshot | LegacyGateSnapshotV1;
801
920
  ```
802
921
 
922
+ `WorkflowResult<T>` narrows the suspended-branch `snapshot` to `GateSnapshot` specifically — only gates suspend, so the union widening doesn't pollute the suspended-state API.
923
+
924
+ > **Rolling-deploy hazard:** A 0.4.0 process receiving a 0.5.0-persisted v2 gate snapshot rejects via the strict `version === 1` check. Drain in-flight snapshots before cutover, ship a 0.4.x forward-compat patch ahead, or version-tag storage keys.
925
+
926
+ > **Long-lived storage:** For Redis-without-TTL / S3 / Postgres, call `migrateSnapshot(legacy)` before v0.8.0+ drops v1 acceptance.
927
+
928
+ ## Step-level checkpointing via `onCheckpoint`
929
+
930
+ Pass `onCheckpoint` in `RunOptions` to receive a v2 checkpoint snapshot after each successful step body. Use this to persist progress so a crashed/restarted process can resume where it left off — no human-in-the-loop required.
931
+
932
+ ```ts
933
+ import { Workflow, type CheckpointSnapshot } from "pipeai";
934
+
935
+ const pipeline = Workflow.create<Ctx, string>()
936
+ .step("classify", classifier)
937
+ .step("summarize", summarizer)
938
+ .step("publish", publisher);
939
+
940
+ let lastSnapshot: CheckpointSnapshot | null = null;
941
+ const result = await pipeline.generate(ctx, "input", {
942
+ onCheckpoint: async (snap) => {
943
+ lastSnapshot = snap;
944
+ await db.write({ key: "run:42", snapshot: snap });
945
+ },
946
+ checkpointEvery: 5, // every 5 executable steps
947
+ });
948
+
949
+ // On restart, resume from the last persisted snapshot:
950
+ const stored = await db.read("run:42");
951
+ const resumed = pipeline.resumeFrom(stored);
952
+ const final = await resumed.generate(ctx); // no response arg — state is seeded
953
+ ```
954
+
955
+ ### Cadence
956
+
957
+ - `checkpointEvery: N` — fire every N executable steps. Defaults to `max(1, ceil(executableCount / 4))` — 4 checkpoints across the run, floor of every step on tiny pipelines.
958
+ - `checkpointWhen({ stepIndex, stepId, ctx }) => boolean` — predicate variant. Mutually exclusive with `checkpointEvery`.
959
+ - `.catch()` and `.finally()` nodes are NOT counted as executable, so adding cleanup doesn't surprise you with extra checkpoints.
960
+
961
+ ### Timeout via `AbortSignal`
962
+
963
+ ```ts
964
+ const result = await pipeline.generate(ctx, input, {
965
+ onCheckpoint: async (snap, { signal }) => {
966
+ await fetch("/persist", { method: "POST", body: JSON.stringify(snap), signal });
967
+ },
968
+ checkpointTimeout: 500, // ms — AbortSignal fires, CheckpointTimeoutError raised
969
+ });
970
+ ```
971
+
972
+ A timed-out `onCheckpoint` raises `CheckpointTimeoutError`, which (like any `onCheckpoint` throw) bypasses `.catch()` and reaches the caller bare. `.finally()` still runs; any finally errors get a `console.warn`.
973
+
974
+ ### `stepShapeHash` and `resumeFrom`
975
+
976
+ Each checkpoint snapshot carries a SHA-256 of the workflow's structural shape (index + type + id + recursive nested workflow shapes). `resumeFrom` verifies the hash matches before continuing:
977
+
978
+ ```ts
979
+ const resumed = pipeline.resumeFrom(snapshot); // throws on shape mismatch
980
+ const resumed = pipeline.resumeFrom(snapshot, { skipShapeCheck: true }); // override
981
+ ```
982
+
983
+ Common shape changes that invalidate snapshots: insertion, removal, reorder, type-swap with same id, nested-workflow refactor. **Agent identity is NOT in the hash** — two checkpoints from runs that used different agent configs (same agent id) hash identically. Version your agents by content if resume-trust matters.
984
+
985
+ ### Stream-mode caveats
986
+
987
+ - Each `onCheckpoint` fire pauses the stream writer while it awaits — for chunky checkpoints, prefer larger cadence.
988
+ - Per-checkpoint `JSON.stringify` cost grows with `state.output`; the example above uses `checkpointEvery: 5` to amortize.
989
+ - Serializing consumers should leave `freezeSnapshots: false` — `JSON.stringify` already copies.
990
+
991
+ ### Memoization
992
+
993
+ `stepShapeHash` is memoized per terminal-workflow instance. **Build pipelines once at module load and call `generate()` many times** to amortize. Per-request construction defeats memoization.
994
+
995
+ ### `.catch()` placed before `resumeFromIndex` is dead
996
+
997
+ After a checkpoint-resume, any `.catch()` nodes BEFORE the resume index never fire (they're skipped along with all earlier steps). Place catches at the end of the workflow or strategically late.
998
+
999
+ ### Gate-vs-checkpoint resume asymmetry
1000
+
1001
+ Gate snapshots use a reorder-tolerant id-scan fallback in `loadState`. Checkpoint snapshots use `stepShapeHash`, which is reorder-strict. A workflow with both has two different resume semantics — when in doubt, bump a workflow version id and route old snapshots to old code.
1002
+
1003
+ ### Catastrophic combos
1004
+
1005
+ `validateRunOptions` throws synchronously on:
1006
+ - `checkpointEvery` and `checkpointWhen` both set (mutually exclusive)
1007
+ - `checkpointEvery` not a positive integer
1008
+ - `checkpointTimeout` not a finite positive number
1009
+ - `freezeSnapshots: true + checkpointEvery: 1` on a workflow of 8+ steps (catastrophic perf — pass `"iAcceptThePerformanceCost"` to bypass)
1010
+
1011
+ And warns once on `freezeSnapshots: true + cadence <= 2` (suspicious but legal).
1012
+
803
1013
  ### Limitations
804
1014
 
805
- Gates inside nested workflows, `foreach()`, and `repeat()` are not yet supported — a descriptive error is thrown at runtime. Gates at the top level of a workflow work in all cases.
1015
+ Gates inside nested workflows, `foreach()`, and `repeat()` are not yet supported — `NestedGateUnsupportedError` is thrown at runtime. Gates at the top level of a workflow work in all cases.
1016
+
1017
+ ```ts
1018
+ import { NestedGateUnsupportedError } from "pipeai";
1019
+
1020
+ try {
1021
+ await pipeline.generate(ctx, input);
1022
+ } catch (e) {
1023
+ if (e instanceof NestedGateUnsupportedError) {
1024
+ console.log(`gate "${e.gateId}" in nested workflow "${e.workflowId}"`);
1025
+ // e.siblingErrors — non-gate rejections from concurrent foreach siblings
1026
+ // e.siblingSuspensions — other items in concurrent foreach that also suspended
1027
+ }
1028
+ }
1029
+ ```
1030
+
1031
+ > **Middleware-wrapping caveat:** `NestedGateUnsupportedError` `instanceof` is only stable when caught close to the call site. App-specific error wrappers that re-throw as their own types defeat the check. Preserve `cause` if you wrap.
1032
+
1033
+ > **Foreach concurrency hazard:** a nested gate inside concurrent `foreach` waits for siblings to complete — sibling LLM calls bill, sibling DB writes commit. Either use `concurrency: 1` or move the gate above the `foreach`. Sibling-side non-gate errors are preserved in `result.warnings` (`source: "foreach-sibling"`) and attached to the marker via `siblingErrors`. The lowest-index suspending item wins the marker; the rest contribute to `siblingSuspensions`.
1034
+
1035
+ ### Snapshot immutability (opt-in)
1036
+
1037
+ By default snapshots and `result.warnings` are mutable. Pass `freezeSnapshots: true` in `RunOptions` to recursively `Object.freeze` them — useful when you serialize through an in-memory queue and want to catch accidental mutation:
1038
+
1039
+ ```ts
1040
+ const result = await pipeline.generate(ctx, input, { freezeSnapshots: true });
1041
+ ```
1042
+
1043
+ The same flag governs gate snapshots, F1's checkpoint snapshots (when shipped), and the warnings array. **For serializing consumers, leave it `false`** — `JSON.stringify` already copies, and freezing every step is wasted work. `runOptions` does **not** propagate into nested workflows.
1044
+
1045
+ Caveat: `Object.freeze(new Map())` doesn't prevent `.set()`. Maps and Sets inside payloads lose immutability.
1046
+
1047
+ ## Observability via `Workflow.create({ observability })`
1048
+
1049
+ Pass an `observability` object to `Workflow.create()` to receive lifecycle events for every node in the workflow:
1050
+
1051
+ ```ts
1052
+ import { Workflow, type WorkflowObservability } from "pipeai";
1053
+
1054
+ const obs: WorkflowObservability = {
1055
+ onStepStart: ({ stepId, type, ctx, input }) => {
1056
+ console.log(`step ${stepId} (${type}) starting`);
1057
+ },
1058
+ onStepFinish: ({ stepId, type, output, durationMs, suspended }) => {
1059
+ console.log(`step ${stepId} (${type}) finished in ${durationMs}ms, suspended=${suspended}`);
1060
+ },
1061
+ onStepError: ({ stepId, type, error, durationMs }) => {
1062
+ console.error(`step ${stepId} (${type}) threw after ${durationMs}ms`, error);
1063
+ },
1064
+ };
1065
+
1066
+ const pipeline = Workflow.create<Ctx, string>({ observability: obs })
1067
+ .step("classify", classifier)
1068
+ .step("respond", responder);
1069
+ ```
1070
+
1071
+ The hooks are threaded through every builder return, so any chain following `Workflow.create({ observability })` keeps the same hooks. `ResumedWorkflow` (gate resume via `loadState`) and `CheckpointResumedWorkflow` (checkpoint resume via `resumeFrom`) ALSO inherit it — events fire on resumed runs without re-wiring.
1072
+
1073
+ ### Per-node firing rules
1074
+
1075
+ | Node | `onStepStart` | `onStepFinish` (`suspended`) | `onStepError` |
1076
+ |---|---|---|---|
1077
+ | step / nested / branch / foreach / parallel / repeat | always | when body returns (`false`) | on body throw |
1078
+ | gate (suspends) | always | `suspended: true` | never |
1079
+ | gate (cond false → skip) | always | `suspended: false` | never |
1080
+ | catch | only when `pendingError` set | when `catchFn` returns | when `catchFn` throws |
1081
+ | finally | always (runs even after suspension) | always (`suspended: false`) | when body throws |
1082
+
1083
+ Skip-checked nodes (suspension or error state already set on entry) emit **nothing** — `.finally()` is the exception.
1084
+
1085
+ ### Per-item events for `foreach` and `parallel`
1086
+
1087
+ `foreach` and `parallel` ALSO fire per-item events:
1088
+
1089
+ ```ts
1090
+ const obs: WorkflowObservability = {
1091
+ onItemStart: ({ stepId, type, itemIndex, input }) => { /* ... */ },
1092
+ onItemFinish: ({ stepId, type, itemIndex, output, durationMs }) => { /* ... */ },
1093
+ onItemError: ({ stepId, type, itemIndex, error, durationMs }) => { /* ... */ },
1094
+ };
1095
+ ```
1096
+
1097
+ - For `foreach`: `itemIndex` is the item's numeric index.
1098
+ - For `parallel` record form: `itemIndex` is the branch's string key.
1099
+ - For `parallel` tuple form: `itemIndex` is the branch's numeric index.
1100
+ - `repeat` does **NOT** emit per-item events. Its iteration count is data-dependent — per-item would mislead.
1101
+
1102
+ ### Error semantics inside hooks
1103
+
1104
+ - Errors thrown inside `onStepStart`, `onStepFinish`, `onItemStart`, `onItemFinish`, `onItemError` are captured into `result.warnings` with the matching `source` tag and mirrored to `console.error`. The workflow continues.
1105
+ - Errors thrown inside `onStepError` on the normal path cause the ORIGINAL step error to reach the caller with `error.cause = obsError`. The `instanceof` of the original error is preserved.
1106
+ - `onCheckpoint` failures fire `onStepError({ stepId: CHECKPOINT_STEP_ID, type: "step", ... })`.
1107
+
1108
+ ### Concurrent-run-safe OTel pattern
1109
+
1110
+ Don't key observability state on `ctx` alone — concurrent runs share it. Use a per-`runId` key:
1111
+
1112
+ ```ts
1113
+ type Ctx = { userId: string; runId: string };
1114
+ const spans = new Map<string, ReturnType<typeof tracer.startSpan>>();
1115
+
1116
+ const pipeline = Workflow.create<Ctx>({
1117
+ observability: {
1118
+ onStepStart: ({ stepId, type, ctx }) => {
1119
+ const c = ctx as Ctx;
1120
+ spans.set(`${c.runId}:${stepId}`, tracer.startSpan(`${type}:${stepId}`, {
1121
+ attributes: { userId: c.userId },
1122
+ }));
1123
+ },
1124
+ onStepFinish: ({ stepId, ctx, durationMs, suspended }) => {
1125
+ const c = ctx as Ctx; const key = `${c.runId}:${stepId}`;
1126
+ const span = spans.get(key);
1127
+ span?.setAttribute("duration_ms", durationMs);
1128
+ span?.setAttribute("suspended", suspended);
1129
+ span?.end(); spans.delete(key);
1130
+ },
1131
+ onStepError: ({ stepId, ctx, error }) => {
1132
+ const c = ctx as Ctx; const key = `${c.runId}:${stepId}`;
1133
+ const span = spans.get(key);
1134
+ span?.recordException(error as Error);
1135
+ span?.setStatus({ code: SpanStatusCode.ERROR });
1136
+ span?.end(); spans.delete(key);
1137
+ },
1138
+ },
1139
+ }).step(classifier).step(supportAgent);
1140
+ ```
1141
+
1142
+ ## Graph patterns
1143
+
1144
+ The existing combinators compose into common workflow graph shapes — no new primitives needed.
1145
+
1146
+ ### Cycles via `repeat(subWorkflow, { until })`
1147
+
1148
+ Re-run a sub-workflow until a predicate is satisfied:
1149
+
1150
+ ```ts
1151
+ const cycle = Workflow.create<Ctx, Plan>().step(executor).step(critic);
1152
+
1153
+ const agent = Workflow.create<Ctx, string>()
1154
+ .step(planner)
1155
+ .repeat(cycle, { until: ({ output }) => output.satisfied, maxIterations: 5 });
1156
+ ```
1157
+
1158
+ `repeat` runs its body as a sub-workflow; the body's output feeds back as input.
1159
+
1160
+ ### Multi-path branching with rejoin via `.branch(...).step(...)`
1161
+
1162
+ The first step AFTER a `branch` is the rejoin point — the chosen branch's output flows in regardless of which case fired:
1163
+
1164
+ ```ts
1165
+ const pipeline = Workflow.create<Ctx>()
1166
+ .step("classify", classifier)
1167
+ .branch({
1168
+ select: ({ input }) => input as "bug" | "feature",
1169
+ agents: { bug: bugAgent, feature: featureAgent },
1170
+ })
1171
+ .step("persist", ({ input, ctx }) => db.save(ctx.userId, input));
1172
+ ```
1173
+
1174
+ ### Fan-out / fan-in via `.parallel({...}).step(...)`
1175
+
1176
+ `parallel` produces a record/tuple; the next step consumes the combined shape:
1177
+
1178
+ ```ts
1179
+ const pipeline = Workflow.create<Ctx, string>()
1180
+ .step("init", ({ input }) => input)
1181
+ .parallel({ researcher, critic })
1182
+ .step("synthesize", ({ input }) => `${input.researcher} + ${input.critic}`);
1183
+ ```
1184
+
1185
+ Pair with the [rate-limit and ctx-mutation hazards](#fan-out-via-parallel) above.
1186
+
1187
+ ### Self-recursion is NOT supported
1188
+
1189
+ ```ts
1190
+ // Doesn't work — `recur` is undefined at evaluation.
1191
+ let recur;
1192
+ recur = Workflow.create<Ctx, string>()
1193
+ .step(executor)
1194
+ .repeat(recur, { until: () => false }); // ← recur is undefined here
1195
+ ```
1196
+
1197
+ A future `repeat(thunk)` overload (F4.5 candidate) could enable this — the cycle guard inside `stepShapeHash` is already prepared for it.
1198
+
1199
+ ## Migration from 0.3.x
1200
+
1201
+ 0.4.0 makes suspension a return value instead of a thrown error, plus seven smaller behavior changes. The full list:
1202
+
1203
+ 1. **`.finally()` runs after a gate suspends.** Code that assumed `finally` ran only on completion must now check `result.status === "complete"`.
1204
+ 2. **Nested-workflow `.finally()` bodies run before `NestedGateUnsupportedError` fires.** Inner finallys see `state.suspension` truthy while running — don't branch on it. Side-effecting inner finallys execute on a path the user perceives as a thrown error.
1205
+ 3. **A throwing `.finally()` no longer aborts subsequent `.finally()` bodies.** All finallys run; their errors accumulate.
1206
+ 4. **`WorkflowSuspended` is deleted.** Migrate `try / catch (e instanceof WorkflowSuspended)` → `if (result.status === "suspended")`.
1207
+ 5. **`WorkflowResult<T>` shape changed.** `const { output } = await pipeline.generate(...)` is now a strict-mode compile error. Use `if (result.status !== "complete") throw …; const { output } = result`.
1208
+ 6. **`stream()` on suspension closes cleanly.** `WorkflowStreamOptions.onError` is **not** invoked for suspension — discriminate via the resolved `output` Promise. Real errors still flow through `onError`. F0 emits a one-time `console.warn` per process when a gate fires in stream mode with `onError` set.
1209
+ 7. **Any** `.finally()` body that throws on the completion path produces `AggregateError` — stable contract once any finally is added, including the single-error case.
1210
+ 8. **Duplicate `(type, id)` pairs in the same workflow throw at builder finalization.** `foreach(agentX).foreach(agentX)` and back-to-back default-id `branch(...)` callers must pass an explicit `{ id }`. The same applies to `step(agent, { id })` when reusing an agent in two steps.
1211
+
1212
+ Before:
1213
+
1214
+ ```ts
1215
+ import { WorkflowSuspended } from "pipeai";
1216
+ try {
1217
+ const { output } = await pipeline.generate(ctx, input);
1218
+ return output;
1219
+ } catch (e) {
1220
+ if (e instanceof WorkflowSuspended) {
1221
+ await db.saveSnapshot(e.snapshot);
1222
+ return null;
1223
+ }
1224
+ throw e;
1225
+ }
1226
+ ```
1227
+
1228
+ After:
1229
+
1230
+ ```ts
1231
+ const result = await pipeline.generate(ctx, input);
1232
+ if (result.status === "suspended") {
1233
+ await db.saveSnapshot(result.snapshot);
1234
+ return null;
1235
+ }
1236
+ return result.output;
1237
+ ```
806
1238
 
807
1239
  ## Full Example
808
1240