pipeai 0.3.0 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -46,7 +46,7 @@ type Ctx = {
46
46
  db: Database;
47
47
  };
48
48
 
49
- const assistant = new Agent<Ctx>({
49
+ const assistant = new Agent<Ctx, string>({
50
50
  id: "assistant",
51
51
  model: openai("gpt-4o"),
52
52
  system: "You are a helpful assistant.",
@@ -84,7 +84,7 @@ const classificationSchema = z.object({
84
84
  summary: z.string(),
85
85
  });
86
86
 
87
- const classifier = new Agent<Ctx>({
87
+ const classifier = new Agent<Ctx, { title: string; body: string }>({
88
88
  id: "classifier",
89
89
  input: z.object({ title: z.string(), body: z.string() }),
90
90
  output: Output.object({ schema: classificationSchema }),
@@ -102,7 +102,7 @@ result.output; // { priority: "high", category: "bug", summary: "..." }
102
102
  Most config fields accept a static value or a `(ctx, input) => value` function:
103
103
 
104
104
  ```ts
105
- const agent = new Agent<Ctx>({
105
+ const agent = new Agent<Ctx, string>({
106
106
  id: "adaptive",
107
107
  model: (ctx) => ctx.isPremium ? openai("gpt-4o") : openai("gpt-4o-mini"),
108
108
  system: (ctx) => `You assist ${ctx.userName}. Role: ${ctx.role}.`,
@@ -120,7 +120,7 @@ const agent = new Agent<Ctx>({
120
120
  Same callback names as AI SDK v6, extended with `ctx`, `input`, and `writer`. The AI SDK event payload is available as `result`. When the agent runs inside a streaming workflow, `writer` is available for writing metadata or custom stream parts:
121
121
 
122
122
  ```ts
123
- const agent = new Agent<Ctx>({
123
+ const agent = new Agent<Ctx, string>({
124
124
  id: "monitored",
125
125
  model: openai("gpt-4o"),
126
126
  prompt: (ctx, input) => input,
@@ -146,6 +146,7 @@ const agent = new Agent<Ctx>({
146
146
  | `description` | `string` | Agent description (used by `asTool()` for tool description). |
147
147
  | `input` | `ZodType` | Input schema. Required for `asTool()`. Infers `TInput`. |
148
148
  | `output` | `Output` | AI SDK Output (e.g. `Output.object({ schema })`). Infers `TOutput`. |
149
+ | `validateOutput` | `ZodType<TOutput>` | Optional runtime guard. Validates the structured `output` after the SDK parses it (distinct from `tool.outputSchema`). Catches SDK-side parse drift. |
149
150
  | `model` | `Resolvable` | Language model. Static or `(ctx, input) => model`. |
150
151
  | `system` | `Resolvable` | System prompt. |
151
152
  | `prompt` | `Resolvable` | String prompt. Mutually exclusive with `messages`. |
@@ -153,7 +154,7 @@ const agent = new Agent<Ctx>({
153
154
  | `tools` | `Resolvable` | Tool map. Supports `Tool`, `ToolProvider`, and `agent.asTool()`. |
154
155
  | `activeTools` | `Resolvable` | Subset of tool names to enable. |
155
156
  | `toolChoice` | `Resolvable` | Tool choice strategy. Static or `(ctx, input) => toolChoice`. |
156
- | `stopWhen` | `Resolvable` | Condition for stopping the tool loop. Static or `(ctx, input) => condition`. |
157
+ | `stopWhen` | `StopCondition` &#124; `StopCondition[]` | Condition(s) for stopping the tool loop. **Static only** — not a `Resolvable`. A bare function is ambiguous with the resolver form, so dynamic stop conditions require building the agent per call. |
157
158
  | `onStepFinish`| `({ result, ctx, input, writer? })`| Called after each step. `writer` available in streaming workflows. |
158
159
  | `onFinish` | `({ result, ctx, input, writer? })`| Called when all steps complete. |
159
160
  | `onError` | `({ error, ctx, input, writer? })` | Called on error. |
@@ -164,7 +165,7 @@ const agent = new Agent<Ctx>({
164
165
  `asTool()` compiles an agent into a standard AI SDK `Tool`. The parent agent's LLM tool loop handles routing — no dedicated router needed.
165
166
 
166
167
  ```ts
167
- const codingAgent = new Agent<Ctx>({
168
+ const codingAgent = new Agent<Ctx, { task: string; language?: string }>({
168
169
  id: "coding",
169
170
  description: "Writes and modifies code.",
170
171
  input: z.object({
@@ -176,7 +177,7 @@ const codingAgent = new Agent<Ctx>({
176
177
  tools: { writeFile, readFile },
177
178
  });
178
179
 
179
- const qaAgent = new Agent<Ctx>({
180
+ const qaAgent = new Agent<Ctx, { question: string }>({
180
181
  id: "qa",
181
182
  description: "Answers technical questions.",
182
183
  input: z.object({ question: z.string() }),
@@ -186,7 +187,7 @@ const qaAgent = new Agent<Ctx>({
186
187
  });
187
188
 
188
189
  // Parent agent uses sub-agents as tools
189
- const orchestrator = new Agent<Ctx>({
190
+ const orchestrator = new Agent<Ctx, string>({
190
191
  id: "orchestrator",
191
192
  model: openai("gpt-4o"),
192
193
  system: "Delegate work to the right specialist.",
@@ -221,7 +222,7 @@ codingAgent.asTool(ctx, {
221
222
  `asTool(ctx)` bakes the context in at call time. `asToolProvider()` defers context resolution — the tool is created with the correct context when another agent's tool resolution runs:
222
223
 
223
224
  ```ts
224
- const orchestrator = new Agent<Ctx>({
225
+ const orchestrator = new Agent<Ctx, string>({
225
226
  id: "orchestrator",
226
227
  model: openai("gpt-4o"),
227
228
  system: "Delegate work to the right specialist.",
@@ -267,7 +268,7 @@ const cancelOrder = define({
267
268
  });
268
269
 
269
270
  // Mix with plain AI SDK tools freely
270
- const agent = new Agent<Ctx>({
271
+ const agent = new Agent<Ctx, string>({
271
272
  id: "support",
272
273
  model: openai("gpt-4o"),
273
274
  prompt: (ctx, input) => input,
@@ -302,15 +303,24 @@ const pipeline = Workflow.create<Ctx>()
302
303
 
303
304
  ```ts
304
305
  // Non-streaming — calls agent.generate() at each step
305
- const { output } = await pipeline.generate(ctx, initialInput);
306
+ const result = await pipeline.generate(ctx, initialInput);
307
+ if (result.status === "complete") {
308
+ console.log(result.output);
309
+ } else {
310
+ // result.status === "suspended" — see Human-in-the-Loop section
311
+ await db.saveSnapshot(result.snapshot);
312
+ }
306
313
 
307
314
  // Streaming — calls agent.stream() at each step, merges into a single ReadableStream
308
315
  const { stream, output } = pipeline.stream(ctx, initialInput);
309
316
  return new Response(stream);
310
317
 
311
- const finalOutput = await output; // resolves when pipeline completes
318
+ // output resolves to WorkflowResult<T> never rejects on suspension
319
+ const final = await output;
312
320
  ```
313
321
 
322
+ The return value is a `WorkflowResult<T>` discriminated union — either `{ status: "complete", output, warnings }` or `{ status: "suspended", snapshot, warnings }`. `warnings` is always present on both branches (`readonly WorkflowWarning[]`, possibly empty).
323
+
314
324
  ### Nested workflows
315
325
 
316
326
  Workflows can be passed as steps into other workflows. The nested workflow's steps execute within the parent's runtime state — streams merge naturally, and errors propagate to the parent's `catch()`:
@@ -529,6 +539,53 @@ const pipeline = Workflow.create<Ctx>()
529
539
 
530
540
  **Type safety:** `foreach()` uses `ElementOf<TOutput>` to extract the array element type. If the previous step doesn't produce an array, the call is rejected at compile time.
531
541
 
542
+ ### Fan-out via `parallel()`
543
+
544
+ `parallel()` runs several branches against the **same input** concurrently and collects their results. Two type-overload forms — record (keyed by name) and tuple (positional):
545
+
546
+ ```ts
547
+ // Record form — returns { researcher: ResearchOutput, critic: CriticOutput }
548
+ const pipeline = Workflow.create<Ctx, string>()
549
+ .step("classify", classifier)
550
+ .parallel({ researcher, critic });
551
+
552
+ // Tuple form — returns [ResearchOutput, CriticOutput]
553
+ const pipeline = Workflow.create<Ctx, string>()
554
+ .step("classify", classifier)
555
+ .parallel([researcher, critic] as const);
556
+ ```
557
+
558
+ The same input (`state.output`) is fed to each branch. Default concurrency is `min(branches.length, 5)` — most users want fan-out, but the cap protects against rate-limit pressure. Pass `concurrency: Infinity` (or `branches.length`) to opt out.
559
+
560
+ ```ts
561
+ .parallel({ a, b, c, d, e, f, g, h }, { concurrency: 3 }) // explicit override
562
+ .parallel({ a, b, c, d, e, f, g, h }, { concurrency: Infinity }) // full fan-out
563
+ ```
564
+
565
+ **Generate mode only.** Streams aren't threaded through to branches — interleaving multiple agent streams into one writer is out of scope.
566
+
567
+ #### Per-branch error handling
568
+
569
+ ```ts
570
+ .parallel({ a, b }, {
571
+ onError: ({ error, key, ctx }) => {
572
+ if (key === "a") return "fallback-a"; // substitute
573
+ if (key === "b") return Workflow.SKIP; // record form: undefined slot
574
+ throw error; // rethrow to abort the parallel
575
+ },
576
+ })
577
+ ```
578
+
579
+ `onError` is **bypassed** on the suspension path — if any branch hits a nested gate, the marker reaches the caller without onError running. Non-suspension errors flow through onError in branch order.
580
+
581
+ #### Suspension under parallel
582
+
583
+ Gates inside parallel branches throw `NestedGateUnsupportedError`, same as `foreach` concurrent. The lowest-index suspending branch wins the marker; others contribute to `siblingSuspensions`. Multi-branch suspension semantics are finalized in F0.6 alongside `cancelOnFirstSuspend` — until then, all branches run to completion (or sibling-failure) before the marker reaches the caller.
584
+
585
+ > **Rate-limit hazard:** `parallel`'s default `min(N, 5)` assumes ≥5 RPS headroom on your model provider. Symptoms of overflow: 429s and stair-stepped latency.
586
+
587
+ > **Concurrent ctx-mutation hazard:** branches share the `ctx` object by reference. Treat `ctx` as immutable inside parallel branches.
588
+
532
589
  ### Conditional loops via `repeat()`
533
590
 
534
591
  `repeat()` runs an agent or workflow in a loop until a condition is met. The body's output feeds back as input — same type in, same type out:
@@ -620,9 +677,9 @@ const { stream, output } = pipeline.stream(ctx, initialInput, {
620
677
  | `.branch({ select, agents })` | Key routing. `select` returns a key, runs the matching agent. |
621
678
  | `.foreach(target, opts?)` | Map each array element through an agent or workflow. `opts.concurrency` is the max items in flight (default: 1). `opts.onError` recovers per-item failures; return `Workflow.SKIP` to drop an index. |
622
679
  | `.repeat(target, opts)` | Loop an agent or workflow. Use `{ until }` or `{ while }` (mutually exclusive). `maxIterations` defaults to 10. |
623
- | `.gate(id, opts?)` | Human-in-the-loop suspension point. Throws `WorkflowSuspended` with a serializable snapshot. Resume via `loadState(gateId, snapshot)`. |
624
- | `.catch(id, fn)` | Handle errors. `fn` receives `{ error, ctx, lastOutput, stepId }` and returns a recovery value. |
625
- | `.finally(id, fn)` | Always runs. `fn` receives `{ ctx }`. |
680
+ | `.gate(id, opts?)` | Human-in-the-loop suspension point. Returns a result with `status: "suspended"` carrying a serializable snapshot. Resume via `loadState(gateId, snapshot)`. |
681
+ | `.catch(id, fn)` | Handle errors. `fn` receives `{ error, ctx, lastOutput, stepId }` and returns a recovery value. Bypassed on suspension. |
682
+ | `.finally(id, fn)` | Always runs — including after a gate suspends. `fn` receives `{ ctx }`. Throwing finallys no longer abort subsequent ones; errors aggregate into `AggregateError` on the completion path and into `result.warnings` on the suspension path. |
626
683
 
627
684
  ### Output flow
628
685
 
@@ -647,10 +704,12 @@ Auto-extraction priority for `step()` with an agent:
647
704
 
648
705
  `gate()` suspends a workflow at a designated point, producing a JSON-serializable snapshot. The consumer persists the snapshot, collects human input out-of-band (HTTP, WebSocket, CLI, queue — any transport), then resumes the workflow from where it left off.
649
706
 
707
+ > **0.4.0 breaking change:** suspension is a return value, not a thrown error. `generate()` and `stream()` resolve with `WorkflowResult<T>` — a discriminated union of `{ status: "complete", output, warnings }` and `{ status: "suspended", snapshot, warnings }`. `WorkflowSuspended` has been removed. See [Migration from 0.3.x](#migration-from-03x).
708
+
650
709
  ### Basic gate
651
710
 
652
711
  ```ts
653
- import { Workflow, WorkflowSuspended } from "pipeai";
712
+ import { Workflow } from "pipeai";
654
713
 
655
714
  const pipeline = Workflow.create<Ctx>()
656
715
  .step(draftAgent)
@@ -660,23 +719,27 @@ const pipeline = Workflow.create<Ctx>()
660
719
  .step(publishAgent);
661
720
 
662
721
  // Run — suspends at gate
663
- try {
664
- await pipeline.generate(ctx, input);
665
- } catch (e) {
666
- if (e instanceof WorkflowSuspended) {
667
- await db.saveSnapshot(e.snapshot);
668
- return res.status(202).json(e.snapshot.gatePayload);
669
- }
722
+ const result = await pipeline.generate(ctx, input);
723
+ if (result.status === "suspended") {
724
+ await db.saveSnapshot(result.snapshot);
725
+ return res.status(202).json(result.snapshot.gatePayload);
670
726
  }
727
+ // result.status === "complete" here — TS narrows `output` automatically
728
+ return res.json({ output: result.output });
671
729
 
672
730
  // Resume — load state, pass gate ID + snapshot to generate or stream
673
731
  const snapshot = await db.loadSnapshot(id);
674
732
  const resumed = pipeline.loadState("review", snapshot);
675
- const { output } = await resumed.generate(ctx, humanResponse);
733
+ const resumeResult = await resumed.generate(ctx, humanResponse);
734
+ if (resumeResult.status === "complete") {
735
+ console.log(resumeResult.output);
736
+ }
676
737
  ```
677
738
 
678
739
  The `snapshot` is plain JSON — it survives `JSON.parse(JSON.stringify())`, database storage, and process restarts. The workflow definition (code) stays in the process; only the data is serialized.
679
740
 
741
+ `result.warnings` is **always** present on both branches — an array of non-fatal errors (a throwing `.finally()`, a misbehaving observer). It's `readonly WorkflowWarning[]`, never `undefined`. If you don't care about non-fatal failures, ignore it.
742
+
680
743
  ### Resuming with streaming
681
744
 
682
745
  For chat applications where the client reconnects and needs a live stream for the remaining steps:
@@ -691,21 +754,23 @@ The previous stream is gone — the library only streams forward from the resume
691
754
 
692
755
  ### Streaming suspension
693
756
 
694
- When `stream()` hits a gate, the stream closes cleanly (partial content from steps before the gate is delivered). The `output` promise rejects with `WorkflowSuspended`:
757
+ When `stream()` hits a gate, the stream closes cleanly (partial content from steps before the gate is delivered). The `output` Promise **resolves** with `{ status: "suspended", snapshot, warnings }` — it does **not** reject:
695
758
 
696
759
  ```ts
697
760
  const { stream, output } = pipeline.stream(ctx, input);
698
761
  pipeStreamToResponse(res, stream); // partial content delivered normally
699
762
 
700
- try {
701
- await output;
702
- } catch (e) {
703
- if (e instanceof WorkflowSuspended) {
704
- await db.saveSnapshot(e.snapshot);
705
- }
763
+ const result = await output;
764
+ if (result.status === "suspended") {
765
+ await db.saveSnapshot(result.snapshot);
706
766
  }
767
+ // Real errors (a step throws something other than a gate suspension) still
768
+ // reject the output Promise — keep your try/catch for those, but
769
+ // `WorkflowStreamOptions.onError` is NOT invoked for suspension.
707
770
  ```
708
771
 
772
+ > **Stream-mode dead-air warning:** the stream stays open while `.finally()` bodies run after a gate suspends. Long-running cleanup work causes proportional dead air. If your HTTP read timeout is shorter than your worst-case finally I/O, the connection can disconnect spuriously.
773
+
709
774
  ### Schema validation
710
775
 
711
776
  Add a `schema` to validate the human response at runtime. The schema uses a structural type — any object with a `.parse()` method works (Zod, Valibot, ArkType, etc.):
@@ -740,23 +805,25 @@ const pipeline = Workflow.create<Ctx>()
740
805
  .step("publish", ({ input }) => `published: ${input}`);
741
806
 
742
807
  // First gate
743
- let snapshot: WorkflowSnapshot;
744
- try { await pipeline.generate(ctx, input); }
745
- catch (e) { snapshot = (e as WorkflowSuspended).snapshot; }
808
+ const r1 = await pipeline.generate(ctx, input);
809
+ if (r1.status !== "suspended") throw new Error("expected suspension at review");
810
+ let snapshot = r1.snapshot;
746
811
 
747
812
  // Second gate
748
813
  const resumed1 = pipeline.loadState("review", snapshot);
749
- try { await resumed1.generate(ctx, "first approval"); }
750
- catch (e) { snapshot = (e as WorkflowSuspended).snapshot; }
814
+ const r2 = await resumed1.generate(ctx, "first approval");
815
+ if (r2.status !== "suspended") throw new Error("expected suspension at final-approval");
816
+ snapshot = r2.snapshot;
751
817
 
752
818
  // Complete
753
819
  const resumed2 = pipeline.loadState("final-approval", snapshot);
754
- const { output } = await resumed2.generate(ctx, "final approval");
820
+ const r3 = await resumed2.generate(ctx, "final approval");
821
+ if (r3.status === "complete") console.log(r3.output);
755
822
  ```
756
823
 
757
- ### Merging pre-gate output with response
824
+ ### Manual merge at the call site
758
825
 
759
- The `snapshot.output` field contains the pre-gate output. Use it to merge with the human response:
826
+ The `snapshot.output` field contains the pre-gate output. Merge it with the human response at the call site:
760
827
 
761
828
  ```ts
762
829
  // The step after the gate needs both the draft and the approval
@@ -767,6 +834,8 @@ await resumed.generate(ctx, {
767
834
  });
768
835
  ```
769
836
 
837
+ For automatic merging without exposing `snapshot.output` to the caller, see the `merge` option below.
838
+
770
839
  ### Injecting updated context on resume
771
840
 
772
841
  `ctx` is provided fresh on every `generate()`/`stream()` call — never serialized. Use it to inject updated chat history, refreshed auth tokens, or new database connections:
@@ -794,39 +863,378 @@ const pipeline = Workflow.create<Ctx>()
794
863
  .step(publishAgent);
795
864
  ```
796
865
 
797
- ### Merging pre-gate output with response
866
+ ### Merging pre-gate output with response via `merge`
867
+
868
+ Use `merge` to combine the pre-gate output with the human response into a single value for the next step. Without `merge`, only the human response is forwarded.
798
869
 
799
- Use `merge` to combine the pre-gate output with the human response into a single value for the next step. Without `merge`, only the human response is forwarded:
870
+ `merge` may return any shape its return type becomes the input type of the next step. The gate's third generic `TMerged` is inferred from the merge return type, so downstream steps type-check against the merged shape:
800
871
 
801
872
  ```ts
802
873
  const pipeline = Workflow.create<Ctx>()
803
874
  .step(draftAgent)
804
875
  .gate("review", {
876
+ schema: approvalSchema,
805
877
  merge: ({ priorOutput, response }) => ({
806
- draft: priorOutput,
807
- approval: response,
878
+ draft: priorOutput, // pre-gate output (TOutput)
879
+ approval: response, // validated human response (TResponse)
808
880
  }),
809
881
  })
810
882
  .step("publish", ({ input }) => {
811
- // input is { draft, approval }
883
+ // input is { draft, approval } — the TMerged shape
812
884
  });
813
885
  ```
814
886
 
815
887
  ### Snapshot shape
816
888
 
889
+ As of 0.5.0, `WorkflowSnapshot` is a discriminated union with three variants — gate snapshots emitted by `.gate()`, checkpoint snapshots emitted by `onCheckpoint`, and the legacy v1 form from 0.4.0 (accepted for one release via the shim):
890
+
817
891
  ```ts
818
- interface WorkflowSnapshot {
819
- version: 1;
892
+ interface GateSnapshot {
893
+ version: 2;
894
+ kind: "gate";
820
895
  resumeFromIndex: number; // step index of the gate
821
896
  output: unknown; // pre-gate output
822
897
  gateId: string; // gate identifier
823
898
  gatePayload: unknown; // data for the human
824
899
  }
900
+
901
+ interface CheckpointSnapshot {
902
+ version: 2;
903
+ kind: "checkpoint";
904
+ resumeFromIndex: number; // index of the NEXT step to run
905
+ output: unknown; // output as of the checkpoint
906
+ stepShapeHash: string; // SHA-256 hex of the workflow's structural shape
907
+ }
908
+
909
+ // Legacy v1 — only accepted by loadState() during one release. Migrate via migrateSnapshot().
910
+ interface LegacyGateSnapshotV1 {
911
+ version: 1;
912
+ kind?: undefined;
913
+ resumeFromIndex: number;
914
+ output: unknown;
915
+ gateId: string;
916
+ gatePayload: unknown;
917
+ }
918
+
919
+ type WorkflowSnapshot = GateSnapshot | CheckpointSnapshot | LegacyGateSnapshotV1;
920
+ ```
921
+
922
+ `WorkflowResult<T>` narrows the suspended-branch `snapshot` to `GateSnapshot` specifically — only gates suspend, so the union widening doesn't pollute the suspended-state API.
923
+
924
+ > **Rolling-deploy hazard:** A 0.4.0 process receiving a 0.5.0-persisted v2 gate snapshot rejects via the strict `version === 1` check. Drain in-flight snapshots before cutover, ship a 0.4.x forward-compat patch ahead, or version-tag storage keys.
925
+
926
+ > **Long-lived storage:** For Redis-without-TTL / S3 / Postgres, call `migrateSnapshot(legacy)` before v0.8.0+ drops v1 acceptance.
927
+
928
+ ## Step-level checkpointing via `onCheckpoint`
929
+
930
+ Pass `onCheckpoint` in `RunOptions` to receive a v2 checkpoint snapshot after each successful step body. Use this to persist progress so a crashed/restarted process can resume where it left off — no human-in-the-loop required.
931
+
932
+ ```ts
933
+ import { Workflow, type CheckpointSnapshot } from "pipeai";
934
+
935
+ const pipeline = Workflow.create<Ctx, string>()
936
+ .step("classify", classifier)
937
+ .step("summarize", summarizer)
938
+ .step("publish", publisher);
939
+
940
+ let lastSnapshot: CheckpointSnapshot | null = null;
941
+ const result = await pipeline.generate(ctx, "input", {
942
+ onCheckpoint: async (snap) => {
943
+ lastSnapshot = snap;
944
+ await db.write({ key: "run:42", snapshot: snap });
945
+ },
946
+ checkpointEvery: 5, // every 5 executable steps
947
+ });
948
+
949
+ // On restart, resume from the last persisted snapshot:
950
+ const stored = await db.read("run:42");
951
+ const resumed = pipeline.resumeFrom(stored);
952
+ const final = await resumed.generate(ctx); // no response arg — state is seeded
953
+ ```
954
+
955
+ ### Cadence
956
+
957
+ - `checkpointEvery: N` — fire every N executable steps. Defaults to `max(1, ceil(executableCount / 4))` — 4 checkpoints across the run, floor of every step on tiny pipelines.
958
+ - `checkpointWhen({ stepIndex, stepId, ctx }) => boolean` — predicate variant. Mutually exclusive with `checkpointEvery`.
959
+ - `.catch()` and `.finally()` nodes are NOT counted as executable, so adding cleanup doesn't surprise you with extra checkpoints.
960
+
961
+ ### Timeout via `AbortSignal`
962
+
963
+ ```ts
964
+ const result = await pipeline.generate(ctx, input, {
965
+ onCheckpoint: async (snap, { signal }) => {
966
+ await fetch("/persist", { method: "POST", body: JSON.stringify(snap), signal });
967
+ },
968
+ checkpointTimeout: 500, // ms — AbortSignal fires, CheckpointTimeoutError raised
969
+ });
970
+ ```
971
+
972
+ A timed-out `onCheckpoint` raises `CheckpointTimeoutError`, which (like any `onCheckpoint` throw) bypasses `.catch()` and reaches the caller bare. `.finally()` still runs; any finally errors get a `console.warn`.
973
+
974
+ ### `stepShapeHash` and `resumeFrom`
975
+
976
+ Each checkpoint snapshot carries a SHA-256 of the workflow's structural shape (index + type + id + recursive nested workflow shapes). `resumeFrom` verifies the hash matches before continuing:
977
+
978
+ ```ts
979
+ const resumed = pipeline.resumeFrom(snapshot); // throws on shape mismatch
980
+ const resumed = pipeline.resumeFrom(snapshot, { skipShapeCheck: true }); // override
825
981
  ```
826
982
 
983
+ Common shape changes that invalidate snapshots: insertion, removal, reorder, type-swap with same id, nested-workflow refactor. **Agent identity is NOT in the hash** — two checkpoints from runs that used different agent configs (same agent id) hash identically. Version your agents by content if resume-trust matters.
984
+
985
+ ### Stream-mode caveats
986
+
987
+ - Each `onCheckpoint` fire pauses the stream writer while it awaits — for chunky checkpoints, prefer larger cadence.
988
+ - Per-checkpoint `JSON.stringify` cost grows with `state.output`; the example above uses `checkpointEvery: 5` to amortize.
989
+ - Serializing consumers should leave `freezeSnapshots: false` — `JSON.stringify` already copies.
990
+
991
+ ### Memoization
992
+
993
+ `stepShapeHash` is memoized per terminal-workflow instance. **Build pipelines once at module load and call `generate()` many times** to amortize. Per-request construction defeats memoization.
994
+
995
+ ### `.catch()` placed before `resumeFromIndex` is dead
996
+
997
+ After a checkpoint-resume, any `.catch()` nodes BEFORE the resume index never fire (they're skipped along with all earlier steps). Place catches at the end of the workflow or strategically late.
998
+
999
+ ### Gate-vs-checkpoint resume asymmetry
1000
+
1001
+ Gate snapshots use a reorder-tolerant id-scan fallback in `loadState`. Checkpoint snapshots use `stepShapeHash`, which is reorder-strict. A workflow with both has two different resume semantics — when in doubt, bump a workflow version id and route old snapshots to old code.
1002
+
1003
+ ### Catastrophic combos
1004
+
1005
+ `validateRunOptions` throws synchronously on:
1006
+ - `checkpointEvery` and `checkpointWhen` both set (mutually exclusive)
1007
+ - `checkpointEvery` not a positive integer
1008
+ - `checkpointTimeout` not a finite positive number
1009
+ - `freezeSnapshots: true + checkpointEvery: 1` on a workflow of 8+ steps (catastrophic perf — pass `"iAcceptThePerformanceCost"` to bypass)
1010
+
1011
+ And warns once on `freezeSnapshots: true + cadence <= 2` (suspicious but legal).
1012
+
827
1013
  ### Limitations
828
1014
 
829
- Gates inside nested workflows, `foreach()`, and `repeat()` are not yet supported — a descriptive error is thrown at runtime. Gates at the top level of a workflow work in all cases.
1015
+ Gates inside nested workflows, `foreach()`, and `repeat()` are not yet supported — `NestedGateUnsupportedError` is thrown at runtime. Gates at the top level of a workflow work in all cases.
1016
+
1017
+ ```ts
1018
+ import { NestedGateUnsupportedError } from "pipeai";
1019
+
1020
+ try {
1021
+ await pipeline.generate(ctx, input);
1022
+ } catch (e) {
1023
+ if (e instanceof NestedGateUnsupportedError) {
1024
+ console.log(`gate "${e.gateId}" in nested workflow "${e.workflowId}"`);
1025
+ // e.siblingErrors — non-gate rejections from concurrent foreach siblings
1026
+ // e.siblingSuspensions — other items in concurrent foreach that also suspended
1027
+ }
1028
+ }
1029
+ ```
1030
+
1031
+ > **Middleware-wrapping caveat:** `NestedGateUnsupportedError` `instanceof` is only stable when caught close to the call site. App-specific error wrappers that re-throw as their own types defeat the check. Preserve `cause` if you wrap.
1032
+
1033
+ > **Foreach concurrency hazard:** a nested gate inside concurrent `foreach` waits for siblings to complete — sibling LLM calls bill, sibling DB writes commit. Either use `concurrency: 1` or move the gate above the `foreach`. Sibling-side non-gate errors are preserved in `result.warnings` (`source: "foreach-sibling"`) and attached to the marker via `siblingErrors`. The lowest-index suspending item wins the marker; the rest contribute to `siblingSuspensions`.
1034
+
1035
+ ### Snapshot immutability (opt-in)
1036
+
1037
+ By default snapshots and `result.warnings` are mutable. Pass `freezeSnapshots: true` in `RunOptions` to recursively `Object.freeze` them — useful when you serialize through an in-memory queue and want to catch accidental mutation:
1038
+
1039
+ ```ts
1040
+ const result = await pipeline.generate(ctx, input, { freezeSnapshots: true });
1041
+ ```
1042
+
1043
+ The same flag governs gate snapshots, F1's checkpoint snapshots (when shipped), and the warnings array. **For serializing consumers, leave it `false`** — `JSON.stringify` already copies, and freezing every step is wasted work. `runOptions` does **not** propagate into nested workflows.
1044
+
1045
+ Caveat: `Object.freeze(new Map())` doesn't prevent `.set()`. Maps and Sets inside payloads lose immutability.
1046
+
1047
+ ## Observability via `Workflow.create({ observability })`
1048
+
1049
+ Pass an `observability` object to `Workflow.create()` to receive lifecycle events for every node in the workflow:
1050
+
1051
+ ```ts
1052
+ import { Workflow, type WorkflowObservability } from "pipeai";
1053
+
1054
+ const obs: WorkflowObservability = {
1055
+ onStepStart: ({ stepId, type, ctx, input }) => {
1056
+ console.log(`step ${stepId} (${type}) starting`);
1057
+ },
1058
+ onStepFinish: ({ stepId, type, output, durationMs, suspended }) => {
1059
+ console.log(`step ${stepId} (${type}) finished in ${durationMs}ms, suspended=${suspended}`);
1060
+ },
1061
+ onStepError: ({ stepId, type, error, durationMs }) => {
1062
+ console.error(`step ${stepId} (${type}) threw after ${durationMs}ms`, error);
1063
+ },
1064
+ };
1065
+
1066
+ const pipeline = Workflow.create<Ctx, string>({ observability: obs })
1067
+ .step("classify", classifier)
1068
+ .step("respond", responder);
1069
+ ```
1070
+
1071
+ The hooks are threaded through every builder return, so any chain following `Workflow.create({ observability })` keeps the same hooks. `ResumedWorkflow` (gate resume via `loadState`) and `CheckpointResumedWorkflow` (checkpoint resume via `resumeFrom`) ALSO inherit it — events fire on resumed runs without re-wiring.
1072
+
1073
+ ### Per-node firing rules
1074
+
1075
+ | Node | `onStepStart` | `onStepFinish` (`suspended`) | `onStepError` |
1076
+ |---|---|---|---|
1077
+ | step / nested / branch / foreach / parallel / repeat | always | when body returns (`false`) | on body throw |
1078
+ | gate (suspends) | always | `suspended: true` | never |
1079
+ | gate (cond false → skip) | always | `suspended: false` | never |
1080
+ | catch | only when `pendingError` set | when `catchFn` returns | when `catchFn` throws |
1081
+ | finally | always (runs even after suspension) | always (`suspended: false`) | when body throws |
1082
+
1083
+ Skip-checked nodes (suspension or error state already set on entry) emit **nothing** — `.finally()` is the exception.
1084
+
1085
+ ### Per-item events for `foreach` and `parallel`
1086
+
1087
+ `foreach` and `parallel` ALSO fire per-item events:
1088
+
1089
+ ```ts
1090
+ const obs: WorkflowObservability = {
1091
+ onItemStart: ({ stepId, type, itemIndex, input }) => { /* ... */ },
1092
+ onItemFinish: ({ stepId, type, itemIndex, output, durationMs }) => { /* ... */ },
1093
+ onItemError: ({ stepId, type, itemIndex, error, durationMs }) => { /* ... */ },
1094
+ };
1095
+ ```
1096
+
1097
+ - For `foreach`: `itemIndex` is the item's numeric index.
1098
+ - For `parallel` record form: `itemIndex` is the branch's string key.
1099
+ - For `parallel` tuple form: `itemIndex` is the branch's numeric index.
1100
+ - `repeat` does **NOT** emit per-item events. Its iteration count is data-dependent — per-item would mislead.
1101
+
1102
+ ### Error semantics inside hooks
1103
+
1104
+ - Errors thrown inside `onStepStart`, `onStepFinish`, `onItemStart`, `onItemFinish`, `onItemError` are captured into `result.warnings` with the matching `source` tag and mirrored to `console.error`. The workflow continues.
1105
+ - Errors thrown inside `onStepError` on the normal path cause the ORIGINAL step error to reach the caller with `error.cause = obsError`. The `instanceof` of the original error is preserved.
1106
+ - `onCheckpoint` failures fire `onStepError({ stepId: CHECKPOINT_STEP_ID, type: "step", ... })`.
1107
+
1108
+ ### Concurrent-run-safe OTel pattern
1109
+
1110
+ Don't key observability state on `ctx` alone — concurrent runs share it. Use a per-`runId` key:
1111
+
1112
+ ```ts
1113
+ type Ctx = { userId: string; runId: string };
1114
+ const spans = new Map<string, ReturnType<typeof tracer.startSpan>>();
1115
+
1116
+ const pipeline = Workflow.create<Ctx>({
1117
+ observability: {
1118
+ onStepStart: ({ stepId, type, ctx }) => {
1119
+ const c = ctx as Ctx;
1120
+ spans.set(`${c.runId}:${stepId}`, tracer.startSpan(`${type}:${stepId}`, {
1121
+ attributes: { userId: c.userId },
1122
+ }));
1123
+ },
1124
+ onStepFinish: ({ stepId, ctx, durationMs, suspended }) => {
1125
+ const c = ctx as Ctx; const key = `${c.runId}:${stepId}`;
1126
+ const span = spans.get(key);
1127
+ span?.setAttribute("duration_ms", durationMs);
1128
+ span?.setAttribute("suspended", suspended);
1129
+ span?.end(); spans.delete(key);
1130
+ },
1131
+ onStepError: ({ stepId, ctx, error }) => {
1132
+ const c = ctx as Ctx; const key = `${c.runId}:${stepId}`;
1133
+ const span = spans.get(key);
1134
+ span?.recordException(error as Error);
1135
+ span?.setStatus({ code: SpanStatusCode.ERROR });
1136
+ span?.end(); spans.delete(key);
1137
+ },
1138
+ },
1139
+ }).step(classifier).step(supportAgent);
1140
+ ```
1141
+
1142
+ ## Graph patterns
1143
+
1144
+ The existing combinators compose into common workflow graph shapes — no new primitives needed.
1145
+
1146
+ ### Cycles via `repeat(subWorkflow, { until })`
1147
+
1148
+ Re-run a sub-workflow until a predicate is satisfied:
1149
+
1150
+ ```ts
1151
+ const cycle = Workflow.create<Ctx, Plan>().step(executor).step(critic);
1152
+
1153
+ const agent = Workflow.create<Ctx, string>()
1154
+ .step(planner)
1155
+ .repeat(cycle, { until: ({ output }) => output.satisfied, maxIterations: 5 });
1156
+ ```
1157
+
1158
+ `repeat` runs its body as a sub-workflow; the body's output feeds back as input.
1159
+
1160
+ ### Multi-path branching with rejoin via `.branch(...).step(...)`
1161
+
1162
+ The first step AFTER a `branch` is the rejoin point — the chosen branch's output flows in regardless of which case fired:
1163
+
1164
+ ```ts
1165
+ const pipeline = Workflow.create<Ctx>()
1166
+ .step("classify", classifier)
1167
+ .branch({
1168
+ select: ({ input }) => input as "bug" | "feature",
1169
+ agents: { bug: bugAgent, feature: featureAgent },
1170
+ })
1171
+ .step("persist", ({ input, ctx }) => db.save(ctx.userId, input));
1172
+ ```
1173
+
1174
+ ### Fan-out / fan-in via `.parallel({...}).step(...)`
1175
+
1176
+ `parallel` produces a record/tuple; the next step consumes the combined shape:
1177
+
1178
+ ```ts
1179
+ const pipeline = Workflow.create<Ctx, string>()
1180
+ .step("init", ({ input }) => input)
1181
+ .parallel({ researcher, critic })
1182
+ .step("synthesize", ({ input }) => `${input.researcher} + ${input.critic}`);
1183
+ ```
1184
+
1185
+ Pair with the [rate-limit and ctx-mutation hazards](#fan-out-via-parallel) above.
1186
+
1187
+ ### Self-recursion is NOT supported
1188
+
1189
+ ```ts
1190
+ // Doesn't work — `recur` is undefined at evaluation.
1191
+ let recur;
1192
+ recur = Workflow.create<Ctx, string>()
1193
+ .step(executor)
1194
+ .repeat(recur, { until: () => false }); // ← recur is undefined here
1195
+ ```
1196
+
1197
+ A future `repeat(thunk)` overload (F4.5 candidate) could enable this — the cycle guard inside `stepShapeHash` is already prepared for it.
1198
+
1199
+ ## Migration from 0.3.x
1200
+
1201
+ 0.4.0 makes suspension a return value instead of a thrown error, plus seven smaller behavior changes. The full list:
1202
+
1203
+ 1. **`.finally()` runs after a gate suspends.** Code that assumed `finally` ran only on completion must now check `result.status === "complete"`.
1204
+ 2. **Nested-workflow `.finally()` bodies run before `NestedGateUnsupportedError` fires.** Inner finallys see `state.suspension` truthy while running — don't branch on it. Side-effecting inner finallys execute on a path the user perceives as a thrown error.
1205
+ 3. **A throwing `.finally()` no longer aborts subsequent `.finally()` bodies.** All finallys run; their errors accumulate.
1206
+ 4. **`WorkflowSuspended` is deleted.** Migrate `try / catch (e instanceof WorkflowSuspended)` → `if (result.status === "suspended")`.
1207
+ 5. **`WorkflowResult<T>` shape changed.** `const { output } = await pipeline.generate(...)` is now a strict-mode compile error. Use `if (result.status !== "complete") throw …; const { output } = result`.
1208
+ 6. **`stream()` on suspension closes cleanly.** `WorkflowStreamOptions.onError` is **not** invoked for suspension — discriminate via the resolved `output` Promise. Real errors still flow through `onError`. F0 emits a one-time `console.warn` per process when a gate fires in stream mode with `onError` set.
1209
+ 7. **Any** `.finally()` body that throws on the completion path produces `AggregateError` — stable contract once any finally is added, including the single-error case.
1210
+ 8. **Duplicate `(type, id)` pairs in the same workflow throw at builder finalization.** `foreach(agentX).foreach(agentX)` and back-to-back default-id `branch(...)` callers must pass an explicit `{ id }`. The same applies to `step(agent, { id })` when reusing an agent in two steps.
1211
+
1212
+ Before:
1213
+
1214
+ ```ts
1215
+ import { WorkflowSuspended } from "pipeai";
1216
+ try {
1217
+ const { output } = await pipeline.generate(ctx, input);
1218
+ return output;
1219
+ } catch (e) {
1220
+ if (e instanceof WorkflowSuspended) {
1221
+ await db.saveSnapshot(e.snapshot);
1222
+ return null;
1223
+ }
1224
+ throw e;
1225
+ }
1226
+ ```
1227
+
1228
+ After:
1229
+
1230
+ ```ts
1231
+ const result = await pipeline.generate(ctx, input);
1232
+ if (result.status === "suspended") {
1233
+ await db.saveSnapshot(result.snapshot);
1234
+ return null;
1235
+ }
1236
+ return result.output;
1237
+ ```
830
1238
 
831
1239
  ## Full Example
832
1240