npm - pipeai - Versions diffs - 0.2.1 → 0.8.0 - Mend

pipeai 0.2.1 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

package/README.md CHANGED Viewed

@@ -46,7 +46,7 @@ type Ctx = {
   db: Database;
 };
-const assistant = new Agent<Ctx>({
+const assistant = new Agent<Ctx, string>({
   id: "assistant",
   model: openai("gpt-4o"),
   system: "You are a helpful assistant.",
@@ -84,7 +84,7 @@ const classificationSchema = z.object({
   summary: z.string(),
 });
-const classifier = new Agent<Ctx>({
+const classifier = new Agent<Ctx, { title: string; body: string }>({
   id: "classifier",
   input: z.object({ title: z.string(), body: z.string() }),
   output: Output.object({ schema: classificationSchema }),
@@ -102,7 +102,7 @@ result.output; // { priority: "high", category: "bug", summary: "..." }
 Most config fields accept a static value or a `(ctx, input) => value` function:
 ```ts
-const agent = new Agent<Ctx>({
+const agent = new Agent<Ctx, string>({
   id: "adaptive",
   model: (ctx) => ctx.isPremium ? openai("gpt-4o") : openai("gpt-4o-mini"),
   system: (ctx) => `You assist ${ctx.userName}. Role: ${ctx.role}.`,
@@ -120,7 +120,7 @@ const agent = new Agent<Ctx>({
 Same callback names as AI SDK v6, extended with `ctx`, `input`, and `writer`. The AI SDK event payload is available as `result`. When the agent runs inside a streaming workflow, `writer` is available for writing metadata or custom stream parts:
 ```ts
-const agent = new Agent<Ctx>({
+const agent = new Agent<Ctx, string>({
   id: "monitored",
   model: openai("gpt-4o"),
   prompt: (ctx, input) => input,
@@ -146,6 +146,7 @@ const agent = new Agent<Ctx>({
 | `description` | `string`                  | Agent description (used by `asTool()` for tool description).      |
 | `input`       | `ZodType`                 | Input schema. Required for `asTool()`. Infers `TInput`.           |
 | `output`      | `Output`                  | AI SDK Output (e.g. `Output.object({ schema })`). Infers `TOutput`. |
+| `validateOutput` | `ZodType<TOutput>`     | Optional runtime guard. Validates the structured `output` after the SDK parses it (distinct from `tool.outputSchema`). Catches SDK-side parse drift. |
 | `model`       | `Resolvable`              | Language model. Static or `(ctx, input) => model`.                |
 | `system`      | `Resolvable`              | System prompt.                                                    |
 | `prompt`      | `Resolvable`              | String prompt. Mutually exclusive with `messages`.                |
@@ -153,7 +154,7 @@ const agent = new Agent<Ctx>({
 | `tools`       | `Resolvable`              | Tool map. Supports `Tool`, `ToolProvider`, and `agent.asTool()`.  |
 | `activeTools` | `Resolvable`              | Subset of tool names to enable.                                   |
 | `toolChoice`  | `Resolvable`              | Tool choice strategy. Static or `(ctx, input) => toolChoice`.     |
-| `stopWhen`    | `Resolvable`              | Condition for stopping the tool loop. Static or `(ctx, input) => condition`. |
+| `stopWhen`    | `StopCondition` &#124; `StopCondition[]` | Condition(s) for stopping the tool loop. **Static only** — not a `Resolvable`. A bare function is ambiguous with the resolver form, so dynamic stop conditions require building the agent per call. |
 | `onStepFinish`| `({ result, ctx, input, writer? })`| Called after each step. `writer` available in streaming workflows. |
 | `onFinish`    | `({ result, ctx, input, writer? })`| Called when all steps complete.                                   |
 | `onError`     | `({ error, ctx, input, writer? })` | Called on error.                                                  |
@@ -164,7 +165,7 @@ const agent = new Agent<Ctx>({
 `asTool()` compiles an agent into a standard AI SDK `Tool`. The parent agent's LLM tool loop handles routing — no dedicated router needed.
 ```ts
-const codingAgent = new Agent<Ctx>({
+const codingAgent = new Agent<Ctx, { task: string; language?: string }>({
   id: "coding",
   description: "Writes and modifies code.",
   input: z.object({
@@ -176,7 +177,7 @@ const codingAgent = new Agent<Ctx>({
   tools: { writeFile, readFile },
 });
-const qaAgent = new Agent<Ctx>({
+const qaAgent = new Agent<Ctx, { question: string }>({
   id: "qa",
   description: "Answers technical questions.",
   input: z.object({ question: z.string() }),
@@ -186,7 +187,7 @@ const qaAgent = new Agent<Ctx>({
 });
 // Parent agent uses sub-agents as tools
-const orchestrator = new Agent<Ctx>({
+const orchestrator = new Agent<Ctx, string>({
   id: "orchestrator",
   model: openai("gpt-4o"),
   system: "Delegate work to the right specialist.",
@@ -221,7 +222,7 @@ codingAgent.asTool(ctx, {
 `asTool(ctx)` bakes the context in at call time. `asToolProvider()` defers context resolution — the tool is created with the correct context when another agent's tool resolution runs:
 ```ts
-const orchestrator = new Agent<Ctx>({
+const orchestrator = new Agent<Ctx, string>({
   id: "orchestrator",
   model: openai("gpt-4o"),
   system: "Delegate work to the right specialist.",
@@ -267,7 +268,7 @@ const cancelOrder = define({
 });
 // Mix with plain AI SDK tools freely
-const agent = new Agent<Ctx>({
+const agent = new Agent<Ctx, string>({
   id: "support",
   model: openai("gpt-4o"),
   prompt: (ctx, input) => input,
@@ -302,15 +303,24 @@ const pipeline = Workflow.create<Ctx>()
 ```ts
 // Non-streaming — calls agent.generate() at each step
-const { output } = await pipeline.generate(ctx, initialInput);
+const result = await pipeline.generate(ctx, initialInput);
+if (result.status === "complete") {
+  console.log(result.output);
+} else {
+  // result.status === "suspended" — see Human-in-the-Loop section
+  await db.saveSnapshot(result.snapshot);
+}
 // Streaming — calls agent.stream() at each step, merges into a single ReadableStream
 const { stream, output } = pipeline.stream(ctx, initialInput);
 return new Response(stream);
-const finalOutput = await output;  // resolves when pipeline completes
+// output resolves to WorkflowResult<T> — never rejects on suspension
+const final = await output;
 ```
+The return value is a `WorkflowResult<T>` discriminated union — either `{ status: "complete", output, warnings }` or `{ status: "suspended", snapshot, warnings }`. `warnings` is always present on both branches (`readonly WorkflowWarning[]`, possibly empty).
 ### Nested workflows
 Workflows can be passed as steps into other workflows. The nested workflow's steps execute within the parent's runtime state — streams merge naturally, and errors propagate to the parent's `catch()`:
@@ -484,13 +494,15 @@ const pipeline = Workflow.create<Ctx>()
   .step("combine", ({ input }) => input.join("\n\n"));
 ```
-Concurrent processing with batched parallelism:
+Concurrent processing with bounded parallelism:
 ```ts
-// Process 3 items at a time
+// Up to 3 items run simultaneously; the next launches as soon as one finishes.
 .foreach(summarizer, { concurrency: 3 })
 ```
+`concurrency` is the **maximum number of items in flight at any moment** — backed by a semaphore. There's no lockstep batching: a slow item never blocks a finished slot from picking up the next pending one.
 Works with nested workflows too:
 ```ts
@@ -503,8 +515,77 @@ const pipeline = Workflow.create<Ctx>()
   .foreach(processItem, { concurrency: 5 });
 ```
+#### Per-item error recovery via `onError`
+By default a single item's failure aborts the whole `foreach`. Pass an `onError` handler to recover individual items — return a substitute value, return `Workflow.SKIP` to drop the failed index from the output array, or rethrow to abort the step (the throw is catchable by a downstream `.catch()`):
+```ts
+import { Workflow } from "pipeai";
+const pipeline = Workflow.create<Ctx>()
+  .step("fetch", async ({ ctx }) => ctx.db.urls.getAll())
+  .foreach(scraper, {
+    concurrency: 5,
+    onError: ({ error, item, index }) => {
+      // Substitute a placeholder
+      if (isTransient(error)) return { url: item, body: "" };
+      // Or drop the item entirely (output array is shortened)
+      return Workflow.SKIP;
+    },
+  });
+```
+`onError` is invoked sequentially in **index order** after all items settle, so its observable order is deterministic regardless of completion timing. In-flight siblings are never cancelled by another item's failure.
 **Type safety:** `foreach()` uses `ElementOf<TOutput>` to extract the array element type. If the previous step doesn't produce an array, the call is rejected at compile time.
+### Fan-out via `parallel()`
+`parallel()` runs several branches against the **same input** concurrently and collects their results. Two type-overload forms — record (keyed by name) and tuple (positional):
+```ts
+// Record form — returns { researcher: ResearchOutput, critic: CriticOutput }
+const pipeline = Workflow.create<Ctx, string>()
+  .step("classify", classifier)
+  .parallel({ researcher, critic });
+// Tuple form — returns [ResearchOutput, CriticOutput]
+const pipeline = Workflow.create<Ctx, string>()
+  .step("classify", classifier)
+  .parallel([researcher, critic] as const);
+```
+The same input (`state.output`) is fed to each branch. Default concurrency is `min(branches.length, 5)` — most users want fan-out, but the cap protects against rate-limit pressure. Pass `concurrency: Infinity` (or `branches.length`) to opt out.
+```ts
+.parallel({ a, b, c, d, e, f, g, h }, { concurrency: 3 })     // explicit override
+.parallel({ a, b, c, d, e, f, g, h }, { concurrency: Infinity })  // full fan-out
+```
+**Generate mode only.** Streams aren't threaded through to branches — interleaving multiple agent streams into one writer is out of scope.
+#### Per-branch error handling
+```ts
+.parallel({ a, b }, {
+  onError: ({ error, key, ctx }) => {
+    if (key === "a") return "fallback-a";   // substitute
+    if (key === "b") return Workflow.SKIP;  // record form: undefined slot
+    throw error;                            // rethrow to abort the parallel
+  },
+})
+```
+`onError` is **bypassed** on the suspension path — if any branch hits a nested gate, the marker reaches the caller without onError running. Non-suspension errors flow through onError in branch order.
+#### Suspension under parallel
+Gates inside parallel branches throw `NestedGateUnsupportedError`, same as `foreach` concurrent. The lowest-index suspending branch wins the marker; others contribute to `siblingSuspensions`. Multi-branch suspension semantics are finalized in F0.6 alongside `cancelOnFirstSuspend` — until then, all branches run to completion (or sibling-failure) before the marker reaches the caller.
+> **Rate-limit hazard:** `parallel`'s default `min(N, 5)` assumes ≥5 RPS headroom on your model provider. Symptoms of overflow: 429s and stair-stepped latency.
+> **Concurrent ctx-mutation hazard:** branches share the `ctx` object by reference. Treat `ctx` as immutable inside parallel branches.
 ### Conditional loops via `repeat()`
 `repeat()` runs an agent or workflow in a loop until a condition is met. The body's output feeds back as input — same type in, same type out:
@@ -594,11 +675,11 @@ const { stream, output } = pipeline.stream(ctx, initialInput, {
 | `.step(id, fn)`           | Transform the output. `fn` receives `{ ctx, input }` and returns the new output. |
 | `.branch([...cases])`     | Predicate routing. First `when` match wins; case without `when` is default. |
 | `.branch({ select, agents })` | Key routing. `select` returns a key, runs the matching agent.          |
-| `.foreach(target, opts?)` | Map each array element through an agent or workflow. `opts.concurrency` controls parallelism (default: 1). |
+| `.foreach(target, opts?)` | Map each array element through an agent or workflow. `opts.concurrency` is the max items in flight (default: 1). `opts.onError` recovers per-item failures; return `Workflow.SKIP` to drop an index. |
 | `.repeat(target, opts)`   | Loop an agent or workflow. Use `{ until }` or `{ while }` (mutually exclusive). `maxIterations` defaults to 10. |
-| `.gate(id, opts?)`        | Human-in-the-loop suspension point. Throws `WorkflowSuspended` with a serializable snapshot. Resume via `loadState(gateId, snapshot)`. |
-| `.catch(id, fn)`          | Handle errors. `fn` receives `{ error, ctx, lastOutput, stepId }` and returns a recovery value. |
-| `.finally(id, fn)`        | Always runs. `fn` receives `{ ctx }`.                                      |
+| `.gate(id, opts?)`        | Human-in-the-loop suspension point. Returns a result with `status: "suspended"` carrying a serializable snapshot. Resume via `loadState(gateId, snapshot)`. |
+| `.catch(id, fn)`          | Handle errors. `fn` receives `{ error, ctx, lastOutput, stepId }` and returns a recovery value. Bypassed on suspension. |
+| `.finally(id, fn)`        | Always runs — including after a gate suspends. `fn` receives `{ ctx }`. Throwing finallys no longer abort subsequent ones; errors aggregate into `AggregateError` on the completion path and into `result.warnings` on the suspension path. |
 ### Output flow
@@ -623,10 +704,12 @@ Auto-extraction priority for `step()` with an agent:
 `gate()` suspends a workflow at a designated point, producing a JSON-serializable snapshot. The consumer persists the snapshot, collects human input out-of-band (HTTP, WebSocket, CLI, queue — any transport), then resumes the workflow from where it left off.
+> **0.4.0 breaking change:** suspension is a return value, not a thrown error. `generate()` and `stream()` resolve with `WorkflowResult<T>` — a discriminated union of `{ status: "complete", output, warnings }` and `{ status: "suspended", snapshot, warnings }`. `WorkflowSuspended` has been removed. See [Migration from 0.3.x](#migration-from-03x).
 ### Basic gate
 ```ts
-import { Workflow, WorkflowSuspended } from "pipeai";
+import { Workflow } from "pipeai";
 const pipeline = Workflow.create<Ctx>()
   .step(draftAgent)
@@ -636,23 +719,27 @@ const pipeline = Workflow.create<Ctx>()
   .step(publishAgent);
 // Run — suspends at gate
-try {
-  await pipeline.generate(ctx, input);
-} catch (e) {
-  if (e instanceof WorkflowSuspended) {
-    await db.saveSnapshot(e.snapshot);
-    return res.status(202).json(e.snapshot.gatePayload);
-  }
+const result = await pipeline.generate(ctx, input);
+if (result.status === "suspended") {
+  await db.saveSnapshot(result.snapshot);
+  return res.status(202).json(result.snapshot.gatePayload);
 }
+// result.status === "complete" here — TS narrows `output` automatically
+return res.json({ output: result.output });
 // Resume — load state, pass gate ID + snapshot to generate or stream
 const snapshot = await db.loadSnapshot(id);
 const resumed = pipeline.loadState("review", snapshot);
-const { output } = await resumed.generate(ctx, humanResponse);
+const resumeResult = await resumed.generate(ctx, humanResponse);
+if (resumeResult.status === "complete") {
+  console.log(resumeResult.output);
+}
 ```
 The `snapshot` is plain JSON — it survives `JSON.parse(JSON.stringify())`, database storage, and process restarts. The workflow definition (code) stays in the process; only the data is serialized.
+`result.warnings` is **always** present on both branches — an array of non-fatal errors (a throwing `.finally()`, a misbehaving observer). It's `readonly WorkflowWarning[]`, never `undefined`. If you don't care about non-fatal failures, ignore it.
 ### Resuming with streaming
 For chat applications where the client reconnects and needs a live stream for the remaining steps:
@@ -667,21 +754,23 @@ The previous stream is gone — the library only streams forward from the resume
 ### Streaming suspension
-When `stream()` hits a gate, the stream closes cleanly (partial content from steps before the gate is delivered). The `output` promise rejects with `WorkflowSuspended`:
+When `stream()` hits a gate, the stream closes cleanly (partial content from steps before the gate is delivered). The `output` Promise **resolves** with `{ status: "suspended", snapshot, warnings }` — it does **not** reject:
 ```ts
 const { stream, output } = pipeline.stream(ctx, input);
 pipeStreamToResponse(res, stream); // partial content delivered normally
-try {
-  await output;
-} catch (e) {
-  if (e instanceof WorkflowSuspended) {
-    await db.saveSnapshot(e.snapshot);
-  }
+const result = await output;
+if (result.status === "suspended") {
+  await db.saveSnapshot(result.snapshot);
 }
+// Real errors (a step throws something other than a gate suspension) still
+// reject the output Promise — keep your try/catch for those, but
+// `WorkflowStreamOptions.onError` is NOT invoked for suspension.
 ```
+> **Stream-mode dead-air warning:** the stream stays open while `.finally()` bodies run after a gate suspends. Long-running cleanup work causes proportional dead air. If your HTTP read timeout is shorter than your worst-case finally I/O, the connection can disconnect spuriously.
 ### Schema validation
 Add a `schema` to validate the human response at runtime. The schema uses a structural type — any object with a `.parse()` method works (Zod, Valibot, ArkType, etc.):
@@ -716,23 +805,25 @@ const pipeline = Workflow.create<Ctx>()
   .step("publish", ({ input }) => `published: ${input}`);
 // First gate
-let snapshot: WorkflowSnapshot;
-try { await pipeline.generate(ctx, input); }
-catch (e) { snapshot = (e as WorkflowSuspended).snapshot; }
+const r1 = await pipeline.generate(ctx, input);
+if (r1.status !== "suspended") throw new Error("expected suspension at review");
+let snapshot = r1.snapshot;
 // Second gate
 const resumed1 = pipeline.loadState("review", snapshot);
-try { await resumed1.generate(ctx, "first approval"); }
-catch (e) { snapshot = (e as WorkflowSuspended).snapshot; }
+const r2 = await resumed1.generate(ctx, "first approval");
+if (r2.status !== "suspended") throw new Error("expected suspension at final-approval");
+snapshot = r2.snapshot;
 // Complete
 const resumed2 = pipeline.loadState("final-approval", snapshot);
-const { output } = await resumed2.generate(ctx, "final approval");
+const r3 = await resumed2.generate(ctx, "final approval");
+if (r3.status === "complete") console.log(r3.output);
 ```
-### Merging pre-gate output with response
+### Manual merge at the call site
-The `snapshot.output` field contains the pre-gate output. Use it to merge with the human response:
+The `snapshot.output` field contains the pre-gate output. Merge it with the human response at the call site:
 ```ts
 // The step after the gate needs both the draft and the approval
@@ -743,6 +834,8 @@ await resumed.generate(ctx, {
 });
 ```
+For automatic merging without exposing `snapshot.output` to the caller, see the `merge` option below.
 ### Injecting updated context on resume
 `ctx` is provided fresh on every `generate()`/`stream()` call — never serialized. Use it to inject updated chat history, refreshed auth tokens, or new database connections:
@@ -770,39 +863,378 @@ const pipeline = Workflow.create<Ctx>()
   .step(publishAgent);
 ```
-### Merging pre-gate output with response
+### Merging pre-gate output with response via `merge`
+Use `merge` to combine the pre-gate output with the human response into a single value for the next step. Without `merge`, only the human response is forwarded.
-Use `merge` to combine the pre-gate output with the human response into a single value for the next step. Without `merge`, only the human response is forwarded:
+`merge` may return any shape — its return type becomes the input type of the next step. The gate's third generic `TMerged` is inferred from the merge return type, so downstream steps type-check against the merged shape:
 ```ts
 const pipeline = Workflow.create<Ctx>()
   .step(draftAgent)
   .gate("review", {
+    schema: approvalSchema,
     merge: ({ priorOutput, response }) => ({
-      draft: priorOutput,
-      approval: response,
+      draft: priorOutput,        // pre-gate output (TOutput)
+      approval: response,        // validated human response (TResponse)
     }),
   })
   .step("publish", ({ input }) => {
-    // input is { draft, approval }
+    // input is { draft, approval } — the TMerged shape
   });
 ```
 ### Snapshot shape
+As of 0.5.0, `WorkflowSnapshot` is a discriminated union with three variants — gate snapshots emitted by `.gate()`, checkpoint snapshots emitted by `onCheckpoint`, and the legacy v1 form from 0.4.0 (accepted for one release via the shim):
 ```ts
-interface WorkflowSnapshot {
-  version: 1;
+interface GateSnapshot {
+  version: 2;
+  kind: "gate";
   resumeFromIndex: number;  // step index of the gate
   output: unknown;          // pre-gate output
   gateId: string;           // gate identifier
   gatePayload: unknown;     // data for the human
 }
+interface CheckpointSnapshot {
+  version: 2;
+  kind: "checkpoint";
+  resumeFromIndex: number;  // index of the NEXT step to run
+  output: unknown;          // output as of the checkpoint
+  stepShapeHash: string;    // SHA-256 hex of the workflow's structural shape
+}
+// Legacy v1 — only accepted by loadState() during one release. Migrate via migrateSnapshot().
+interface LegacyGateSnapshotV1 {
+  version: 1;
+  kind?: undefined;
+  resumeFromIndex: number;
+  output: unknown;
+  gateId: string;
+  gatePayload: unknown;
+}
+type WorkflowSnapshot = GateSnapshot | CheckpointSnapshot | LegacyGateSnapshotV1;
 ```
+`WorkflowResult<T>` narrows the suspended-branch `snapshot` to `GateSnapshot` specifically — only gates suspend, so the union widening doesn't pollute the suspended-state API.
+> **Rolling-deploy hazard:** A 0.4.0 process receiving a 0.5.0-persisted v2 gate snapshot rejects via the strict `version === 1` check. Drain in-flight snapshots before cutover, ship a 0.4.x forward-compat patch ahead, or version-tag storage keys.
+> **Long-lived storage:** For Redis-without-TTL / S3 / Postgres, call `migrateSnapshot(legacy)` before v0.8.0+ drops v1 acceptance.
+## Step-level checkpointing via `onCheckpoint`
+Pass `onCheckpoint` in `RunOptions` to receive a v2 checkpoint snapshot after each successful step body. Use this to persist progress so a crashed/restarted process can resume where it left off — no human-in-the-loop required.
+```ts
+import { Workflow, type CheckpointSnapshot } from "pipeai";
+const pipeline = Workflow.create<Ctx, string>()
+  .step("classify", classifier)
+  .step("summarize", summarizer)
+  .step("publish", publisher);
+let lastSnapshot: CheckpointSnapshot | null = null;
+const result = await pipeline.generate(ctx, "input", {
+  onCheckpoint: async (snap) => {
+    lastSnapshot = snap;
+    await db.write({ key: "run:42", snapshot: snap });
+  },
+  checkpointEvery: 5,    // every 5 executable steps
+});
+// On restart, resume from the last persisted snapshot:
+const stored = await db.read("run:42");
+const resumed = pipeline.resumeFrom(stored);
+const final = await resumed.generate(ctx);   // no response arg — state is seeded
+```
+### Cadence
+- `checkpointEvery: N` — fire every N executable steps. Defaults to `max(1, ceil(executableCount / 4))` — 4 checkpoints across the run, floor of every step on tiny pipelines.
+- `checkpointWhen({ stepIndex, stepId, ctx }) => boolean` — predicate variant. Mutually exclusive with `checkpointEvery`.
+- `.catch()` and `.finally()` nodes are NOT counted as executable, so adding cleanup doesn't surprise you with extra checkpoints.
+### Timeout via `AbortSignal`
+```ts
+const result = await pipeline.generate(ctx, input, {
+  onCheckpoint: async (snap, { signal }) => {
+    await fetch("/persist", { method: "POST", body: JSON.stringify(snap), signal });
+  },
+  checkpointTimeout: 500,   // ms — AbortSignal fires, CheckpointTimeoutError raised
+});
+```
+A timed-out `onCheckpoint` raises `CheckpointTimeoutError`, which (like any `onCheckpoint` throw) bypasses `.catch()` and reaches the caller bare. `.finally()` still runs; any finally errors get a `console.warn`.
+### `stepShapeHash` and `resumeFrom`
+Each checkpoint snapshot carries a SHA-256 of the workflow's structural shape (index + type + id + recursive nested workflow shapes). `resumeFrom` verifies the hash matches before continuing:
+```ts
+const resumed = pipeline.resumeFrom(snapshot);                       // throws on shape mismatch
+const resumed = pipeline.resumeFrom(snapshot, { skipShapeCheck: true });  // override
+```
+Common shape changes that invalidate snapshots: insertion, removal, reorder, type-swap with same id, nested-workflow refactor. **Agent identity is NOT in the hash** — two checkpoints from runs that used different agent configs (same agent id) hash identically. Version your agents by content if resume-trust matters.
+### Stream-mode caveats
+- Each `onCheckpoint` fire pauses the stream writer while it awaits — for chunky checkpoints, prefer larger cadence.
+- Per-checkpoint `JSON.stringify` cost grows with `state.output`; the example above uses `checkpointEvery: 5` to amortize.
+- Serializing consumers should leave `freezeSnapshots: false` — `JSON.stringify` already copies.
+### Memoization
+`stepShapeHash` is memoized per terminal-workflow instance. **Build pipelines once at module load and call `generate()` many times** to amortize. Per-request construction defeats memoization.
+### `.catch()` placed before `resumeFromIndex` is dead
+After a checkpoint-resume, any `.catch()` nodes BEFORE the resume index never fire (they're skipped along with all earlier steps). Place catches at the end of the workflow or strategically late.
+### Gate-vs-checkpoint resume asymmetry
+Gate snapshots use a reorder-tolerant id-scan fallback in `loadState`. Checkpoint snapshots use `stepShapeHash`, which is reorder-strict. A workflow with both has two different resume semantics — when in doubt, bump a workflow version id and route old snapshots to old code.
+### Catastrophic combos
+`validateRunOptions` throws synchronously on:
+- `checkpointEvery` and `checkpointWhen` both set (mutually exclusive)
+- `checkpointEvery` not a positive integer
+- `checkpointTimeout` not a finite positive number
+- `freezeSnapshots: true + checkpointEvery: 1` on a workflow of 8+ steps (catastrophic perf — pass `"iAcceptThePerformanceCost"` to bypass)
+And warns once on `freezeSnapshots: true + cadence <= 2` (suspicious but legal).
 ### Limitations
-Gates inside nested workflows, `foreach()`, and `repeat()` are not yet supported — a descriptive error is thrown at runtime. Gates at the top level of a workflow work in all cases.
+Gates inside nested workflows, `foreach()`, and `repeat()` are not yet supported — `NestedGateUnsupportedError` is thrown at runtime. Gates at the top level of a workflow work in all cases.
+```ts
+import { NestedGateUnsupportedError } from "pipeai";
+try {
+  await pipeline.generate(ctx, input);
+} catch (e) {
+  if (e instanceof NestedGateUnsupportedError) {
+    console.log(`gate "${e.gateId}" in nested workflow "${e.workflowId}"`);
+    // e.siblingErrors — non-gate rejections from concurrent foreach siblings
+    // e.siblingSuspensions — other items in concurrent foreach that also suspended
+  }
+}
+```
+> **Middleware-wrapping caveat:** `NestedGateUnsupportedError` `instanceof` is only stable when caught close to the call site. App-specific error wrappers that re-throw as their own types defeat the check. Preserve `cause` if you wrap.
+> **Foreach concurrency hazard:** a nested gate inside concurrent `foreach` waits for siblings to complete — sibling LLM calls bill, sibling DB writes commit. Either use `concurrency: 1` or move the gate above the `foreach`. Sibling-side non-gate errors are preserved in `result.warnings` (`source: "foreach-sibling"`) and attached to the marker via `siblingErrors`. The lowest-index suspending item wins the marker; the rest contribute to `siblingSuspensions`.
+### Snapshot immutability (opt-in)
+By default snapshots and `result.warnings` are mutable. Pass `freezeSnapshots: true` in `RunOptions` to recursively `Object.freeze` them — useful when you serialize through an in-memory queue and want to catch accidental mutation:
+```ts
+const result = await pipeline.generate(ctx, input, { freezeSnapshots: true });
+```
+The same flag governs gate snapshots, F1's checkpoint snapshots (when shipped), and the warnings array. **For serializing consumers, leave it `false`** — `JSON.stringify` already copies, and freezing every step is wasted work. `runOptions` does **not** propagate into nested workflows.
+Caveat: `Object.freeze(new Map())` doesn't prevent `.set()`. Maps and Sets inside payloads lose immutability.
+## Observability via `Workflow.create({ observability })`
+Pass an `observability` object to `Workflow.create()` to receive lifecycle events for every node in the workflow:
+```ts
+import { Workflow, type WorkflowObservability } from "pipeai";
+const obs: WorkflowObservability = {
+  onStepStart: ({ stepId, type, ctx, input }) => {
+    console.log(`step ${stepId} (${type}) starting`);
+  },
+  onStepFinish: ({ stepId, type, output, durationMs, suspended }) => {
+    console.log(`step ${stepId} (${type}) finished in ${durationMs}ms, suspended=${suspended}`);
+  },
+  onStepError: ({ stepId, type, error, durationMs }) => {
+    console.error(`step ${stepId} (${type}) threw after ${durationMs}ms`, error);
+  },
+};
+const pipeline = Workflow.create<Ctx, string>({ observability: obs })
+  .step("classify", classifier)
+  .step("respond", responder);
+```
+The hooks are threaded through every builder return, so any chain following `Workflow.create({ observability })` keeps the same hooks. `ResumedWorkflow` (gate resume via `loadState`) and `CheckpointResumedWorkflow` (checkpoint resume via `resumeFrom`) ALSO inherit it — events fire on resumed runs without re-wiring.
+### Per-node firing rules
+| Node | `onStepStart` | `onStepFinish` (`suspended`) | `onStepError` |
+|---|---|---|---|
+| step / nested / branch / foreach / parallel / repeat | always | when body returns (`false`) | on body throw |
+| gate (suspends) | always | `suspended: true` | never |
+| gate (cond false → skip) | always | `suspended: false` | never |
+| catch | only when `pendingError` set | when `catchFn` returns | when `catchFn` throws |
+| finally | always (runs even after suspension) | always (`suspended: false`) | when body throws |
+Skip-checked nodes (suspension or error state already set on entry) emit **nothing** — `.finally()` is the exception.
+### Per-item events for `foreach` and `parallel`
+`foreach` and `parallel` ALSO fire per-item events:
+```ts
+const obs: WorkflowObservability = {
+  onItemStart: ({ stepId, type, itemIndex, input }) => { /* ... */ },
+  onItemFinish: ({ stepId, type, itemIndex, output, durationMs }) => { /* ... */ },
+  onItemError: ({ stepId, type, itemIndex, error, durationMs }) => { /* ... */ },
+};
+```
+- For `foreach`: `itemIndex` is the item's numeric index.
+- For `parallel` record form: `itemIndex` is the branch's string key.
+- For `parallel` tuple form: `itemIndex` is the branch's numeric index.
+- `repeat` does **NOT** emit per-item events. Its iteration count is data-dependent — per-item would mislead.
+### Error semantics inside hooks
+- Errors thrown inside `onStepStart`, `onStepFinish`, `onItemStart`, `onItemFinish`, `onItemError` are captured into `result.warnings` with the matching `source` tag and mirrored to `console.error`. The workflow continues.
+- Errors thrown inside `onStepError` on the normal path cause the ORIGINAL step error to reach the caller with `error.cause = obsError`. The `instanceof` of the original error is preserved.
+- `onCheckpoint` failures fire `onStepError({ stepId: CHECKPOINT_STEP_ID, type: "step", ... })`.
+### Concurrent-run-safe OTel pattern
+Don't key observability state on `ctx` alone — concurrent runs share it. Use a per-`runId` key:
+```ts
+type Ctx = { userId: string; runId: string };
+const spans = new Map<string, ReturnType<typeof tracer.startSpan>>();
+const pipeline = Workflow.create<Ctx>({
+  observability: {
+    onStepStart: ({ stepId, type, ctx }) => {
+      const c = ctx as Ctx;
+      spans.set(`${c.runId}:${stepId}`, tracer.startSpan(`${type}:${stepId}`, {
+        attributes: { userId: c.userId },
+      }));
+    },
+    onStepFinish: ({ stepId, ctx, durationMs, suspended }) => {
+      const c = ctx as Ctx; const key = `${c.runId}:${stepId}`;
+      const span = spans.get(key);
+      span?.setAttribute("duration_ms", durationMs);
+      span?.setAttribute("suspended", suspended);
+      span?.end(); spans.delete(key);
+    },
+    onStepError: ({ stepId, ctx, error }) => {
+      const c = ctx as Ctx; const key = `${c.runId}:${stepId}`;
+      const span = spans.get(key);
+      span?.recordException(error as Error);
+      span?.setStatus({ code: SpanStatusCode.ERROR });
+      span?.end(); spans.delete(key);
+    },
+  },
+}).step(classifier).step(supportAgent);
+```
+## Graph patterns
+The existing combinators compose into common workflow graph shapes — no new primitives needed.
+### Cycles via `repeat(subWorkflow, { until })`
+Re-run a sub-workflow until a predicate is satisfied:
+```ts
+const cycle = Workflow.create<Ctx, Plan>().step(executor).step(critic);
+const agent = Workflow.create<Ctx, string>()
+  .step(planner)
+  .repeat(cycle, { until: ({ output }) => output.satisfied, maxIterations: 5 });
+```
+`repeat` runs its body as a sub-workflow; the body's output feeds back as input.
+### Multi-path branching with rejoin via `.branch(...).step(...)`
+The first step AFTER a `branch` is the rejoin point — the chosen branch's output flows in regardless of which case fired:
+```ts
+const pipeline = Workflow.create<Ctx>()
+  .step("classify", classifier)
+  .branch({
+    select: ({ input }) => input as "bug" | "feature",
+    agents: { bug: bugAgent, feature: featureAgent },
+  })
+  .step("persist", ({ input, ctx }) => db.save(ctx.userId, input));
+```
+### Fan-out / fan-in via `.parallel({...}).step(...)`
+`parallel` produces a record/tuple; the next step consumes the combined shape:
+```ts
+const pipeline = Workflow.create<Ctx, string>()
+  .step("init", ({ input }) => input)
+  .parallel({ researcher, critic })
+  .step("synthesize", ({ input }) => `${input.researcher} + ${input.critic}`);
+```
+Pair with the [rate-limit and ctx-mutation hazards](#fan-out-via-parallel) above.
+### Self-recursion is NOT supported
+```ts
+// Doesn't work — `recur` is undefined at evaluation.
+let recur;
+recur = Workflow.create<Ctx, string>()
+  .step(executor)
+  .repeat(recur, { until: () => false });   // ← recur is undefined here
+```
+A future `repeat(thunk)` overload (F4.5 candidate) could enable this — the cycle guard inside `stepShapeHash` is already prepared for it.
+## Migration from 0.3.x
+0.4.0 makes suspension a return value instead of a thrown error, plus seven smaller behavior changes. The full list:
+1. **`.finally()` runs after a gate suspends.** Code that assumed `finally` ran only on completion must now check `result.status === "complete"`.
+2. **Nested-workflow `.finally()` bodies run before `NestedGateUnsupportedError` fires.** Inner finallys see `state.suspension` truthy while running — don't branch on it. Side-effecting inner finallys execute on a path the user perceives as a thrown error.
+3. **A throwing `.finally()` no longer aborts subsequent `.finally()` bodies.** All finallys run; their errors accumulate.
+4. **`WorkflowSuspended` is deleted.** Migrate `try / catch (e instanceof WorkflowSuspended)` → `if (result.status === "suspended")`.
+5. **`WorkflowResult<T>` shape changed.** `const { output } = await pipeline.generate(...)` is now a strict-mode compile error. Use `if (result.status !== "complete") throw …; const { output } = result`.
+6. **`stream()` on suspension closes cleanly.** `WorkflowStreamOptions.onError` is **not** invoked for suspension — discriminate via the resolved `output` Promise. Real errors still flow through `onError`. F0 emits a one-time `console.warn` per process when a gate fires in stream mode with `onError` set.
+7. **Any** `.finally()` body that throws on the completion path produces `AggregateError` — stable contract once any finally is added, including the single-error case.
+8. **Duplicate `(type, id)` pairs in the same workflow throw at builder finalization.** `foreach(agentX).foreach(agentX)` and back-to-back default-id `branch(...)` callers must pass an explicit `{ id }`. The same applies to `step(agent, { id })` when reusing an agent in two steps.
+Before:
+```ts
+import { WorkflowSuspended } from "pipeai";
+try {
+  const { output } = await pipeline.generate(ctx, input);
+  return output;
+} catch (e) {
+  if (e instanceof WorkflowSuspended) {
+    await db.saveSnapshot(e.snapshot);
+    return null;
+  }
+  throw e;
+}
+```
+After:
+```ts
+const result = await pipeline.generate(ctx, input);
+if (result.status === "suspended") {
+  await db.saveSnapshot(result.snapshot);
+  return null;
+}
+return result.output;
+```
 ## Full Example