npm - @docshield/didactic - Versions diffs - 0.1.1 → 0.1.4 - Mend

@docshield/didactic 0.1.1 → 0.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/README.md CHANGED Viewed

@@ -5,7 +5,9 @@
 [![npm version](https://img.shields.io/npm/v/@docshield/didactic.svg)](https://www.npmjs.com/package/@docshield/didactic)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
-Eval and optimization framework for LLM workflows.
+**Eval** your LLM workflows by comparing actual outputs against expected results with smart comparators that handle real-world variations. **Optimize** prompts automatically through iterative self-improvement—the system analyzes its own mistakes and rewrites prompts to boost accuracy.
+Use it to test extraction and classification based AI workflows, monitor regression, and improve performance
 ## Installation
@@ -18,7 +20,14 @@ Requires Node.js >= 18.0.0
 ## Quick Start
 ```typescript
-import { didactic, within, oneOf, exact } from '@docshield/didactic';
+import {
+  didactic,
+  within,
+  oneOf,
+  exact,
+  unordered,
+  numeric,
+} from '@docshield/didactic';
 const result = await didactic.eval({
   executor: didactic.endpoint('https://api.example.com/extract'),
@@ -26,18 +35,63 @@ const result = await didactic.eval({
     premium: within({ tolerance: 0.05 }),
     policyType: oneOf(['claims-made', 'occurrence']),
     carrier: exact,
+    // Nested comparators for arrays
+    coverages: unordered({
+      type: exact,
+      limit: numeric,
+    }),
   },
   testCases: [
     {
       input: { emailId: 'email-123' },
-      expected: { premium: 12500, policyType: 'claims-made', carrier: 'Acme Insurance' },
+      expected: {
+        premium: 12500,
+        policyType: 'claims-made',
+        carrier: 'Acme Insurance',
+        coverages: [
+          { type: 'liability', limit: 1000000 },
+          { type: 'property', limit: 500000 },
+        ],
+      },
     },
   ],
 });
-console.log(`${result.passed}/${result.total} passed (${result.accuracy * 100}% field accuracy)`);
+console.log(
+  `${result.passed}/${result.total} passed (${result.accuracy * 100}% field accuracy)`
+);
+```
+## Example
+### Eval - Invoice Parser
+Real-world invoice extraction using Anthropic's Claude with structured outputs. Tests field accuracy across vendor names, line items, and payment terms.
+```bash
+# Set your API key
+export ANTHROPIC_API_KEY=your_key_here
+# Run the example
+npm run example:eval:invoice-parser
+```
+Shows how to use `numeric`, `name`, `exact`, `unordered()`, and `llmCompare` comparators for financial data extraction with nested comparator structures.
+### Optimizer - Expense Categorizer
+Iteratively feed eval failures back into an optimization loop to self-improve prompt and performance. Runs evals until it reaches targeted performance or runs out of budget.
+```bash
+# Set your API key
+export ANTHROPIC_API_KEY=your_key_here
+# Run the example
+npm run example:optimizer:expense-categorizer
 ```
+Shows how to use Didactic to self-heal failures and improve prompt to better perform across test set data.
 ---
 ## Core Concepts
@@ -45,65 +99,20 @@ console.log(`${result.passed}/${result.total} passed (${result.accuracy * 100}%
 Didactic has three core components:
 1. **[Executors](#executors)** — Abstraction for running your LLM workflow (local function or HTTP endpoint)
-2. **[Comparators](#comparators)** — Functions to compare the executor's output against your test case's expected output.
-3. **[Optimization](#didacticoptimizeevalconfig-optimizeconfig)** — Iterative prompt improvement loop to hit a target success rates
+2. **[Comparators](#comparators)** — Nested structure matching your data shape, with per-field comparison logic and `unordered()` for arrays
+3. **[Optimization](#didacticoptimizeevalconfig-optimizeconfig)** — Iterative prompt improvement loop to hit a target success rate
-**How they work together:** Your executor runs each test case's input through your LLM workflow, returning output that matches your test case's expected output shape. Comparators then evaluate each field of the output against expected values, producing pass/fail results.
+**How they work together:** Your executor runs each test case's input through your LLM workflow, returning output that matches your test case's expected output shape. Comparators then evaluate each field of the output against expected values, using nested structures that mirror your data shape. For arrays, use `unordered()` to match by similarity rather than index position.
 In optimization mode, these results feed into an LLM that analyzes failures and generates improved system prompts—repeating until your target success rate or iteration/cost limit is reached.
 #### Eval Flow
 ![Eval Flow](docs/diagram-1.svg)
 #### Optimize Flow
-```mermaid
-flowchart TB
-    subgraph Config ["Config"]
-        IP[Initial Prompt]
-        TARGET[targetSuccessRate]
-        LIMITS[maxIterations / maxCost]
-    end
-    IP --> EVAL
-    subgraph Loop ["Optimization Loop"]
-        EVAL[Run Eval] --> CHECK{Target reached?}
-        CHECK -->|Yes| SUCCESS[Return optimized prompt]
-        CHECK -->|No| LIMIT{Limits exceeded?}
-        LIMIT -->|Yes| BEST[Return best prompt]
-        LIMIT -->|No| FAIL[Extract failures]
-        FAIL --> PATCH[Generate patches]
-        PATCH --> MERGE[Merge patches]
-        MERGE --> UPDATE[New Prompt]
-        UPDATE --> EVAL
-    end
-    TARGET --> CHECK
-    LIMITS --> LIMIT
-    SUCCESS --> OUT[OptimizeResult]
-    BEST --> OUT
-    linkStyle default stroke:#FFFFFF
-    style Config fill:#343434,stroke:#6D88B4,color:#FFFFFF
-    style Loop fill:#343434,stroke:#6D88B4,color:#FFFFFF
-    style IP fill:#BFD7FF,stroke:#6D88B4,color:#000B33
-    style TARGET fill:#BFD7FF,stroke:#6D88B4,color:#000B33
-    style LIMITS fill:#BFD7FF,stroke:#6D88B4,color:#000B33
-    style EVAL fill:#BFD7FF,stroke:#6D88B4,color:#000B33
-    style FAIL fill:#BFD7FF,stroke:#6D88B4,color:#000B33
-    style PATCH fill:#BFD7FF,stroke:#6D88B4,color:#000B33
-    style MERGE fill:#BFD7FF,stroke:#6D88B4,color:#000B33
-    style UPDATE fill:#BFD7FF,stroke:#6D88B4,color:#000B33
-    style CHECK fill:#FFEDE0,stroke:#6D88B4,color:#000B33
-    style LIMIT fill:#FFEDE0,stroke:#6D88B4,color:#000B33
-    style SUCCESS fill:#CDF1E6,stroke:#6D88B4,color:#000B33
-    style BEST fill:#CDF1E6,stroke:#6D88B4,color:#000B33
-    style OUT fill:#CDF1E6,stroke:#6D88B4,color:#000B33
-```
+![Optimize Flow](docs/diagram-2.svg)
 ---
@@ -119,18 +128,18 @@ const result = await didactic.eval(config);
 #### EvalConfig
-| Property | Type | Kind | Required | Default | Description |
-|----------|------|------|----------|---------|-------------|
-| `executor` | `Executor<TInput, TOutput>` | Object | **Yes** | — | Function that executes your LLM workflow. Receives input and optional system prompt, returns structured output. |
-| `testCases` | `TestCase<TInput, TOutput>[]` | Array | **Yes** | — | Array of `{ input, expected }` pairs. Each test case runs through the executor and compares output to expected. |
-| `comparators` | `ComparatorsConfig` | Object/Function | **One of** | — | Comparator(s) for expected vs. actual output. Pass an object of field names to comparators for field-level comparison of objects or list of objects. Or pass a single comparator function for uniform comparison across the entire output (primitives, lists of primitives). |
-| `comparatorOverride` | `Comparator<TOutput>` | Function | **One of** | — | Custom whole-object comparison function. Use when you need complete control over comparison logic and want to bypass field-level matching. |
-| `systemPrompt` | `string` | Primitive | No | — | System prompt passed to the executor. Required if using optimization. |
-| `perTestThreshold` | `number` | Primitive | No | `1.0` | Minimum field pass rate for a test case to pass (0.0–1.0). At default 1.0, all fields must pass. Set to 0.8 to pass if 80% of fields match. |
-| `unorderedList` | `boolean` | Primitive | No | `false` | Enable Hungarian matching for array comparison. When true, arrays are matched by similarity rather than index position. Example: `output = [1, 2, 3, 4], expected = [4, 3, 2, 1]` - you would set `unorderedList` to `true` if you consider this a pass. Use `unorderedList` when your expected output is an array of things and you want to compare the items in the array by similarity rather than index position. Works for object and primitive arrays. |
-| `rateLimitBatch` | `number` | Primitive | No | — | Number of test cases to run concurrently. Use with `rateLimitPause` for rate-limited APIs. |
-| `rateLimitPause` | `number` | Primitive | No | — | Seconds to wait between batches. Pairs with `rateLimitBatch`. |
-| `optimize` | `OptimizeConfig` | Object | No | — | Inline optimization config. When provided, triggers optimization mode instead of single eval. |
+| Property             | Type                          | Kind            | Required | Default | Description                                                                                                                                                                                                                                       |
+| -------------------- | ----------------------------- | --------------- | -------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `executor`           | `Executor<TInput, TOutput>`   | Object          | **Yes**  | —       | Function that executes your LLM workflow. Receives input and optional system prompt, returns structured output.                                                                                                                                   |
+| `testCases`          | `TestCase<TInput, TOutput>[]` | Array           | **Yes**  | —       | Array of `{ input, expected }` pairs. Each test case runs through the executor and compares output to expected.                                                                                                                                   |
+| `comparators`        | `ComparatorsConfig`           | Object/Function | No       | `exact` | Nested comparator structure matching your data shape. Can be a single comparator function (e.g., `exact`), or a nested object with per-field comparators. Use `unordered()` wrapper for arrays that should match by similarity rather than index. |
+| `comparatorOverride` | `Comparator<TOutput>`         | Function        | No       | —       | Custom whole-object comparison function. Use when you need complete control over comparison logic and want to bypass field-level matching.                                                                                                        |
+| `llmConfig`          | `LLMConfig`                   | Object          | No       | —       | Default LLM configuration for LLM-based comparators (e.g., `llmCompare`). Provides `apiKey` and optional `provider` so you don't repeat them in each comparator call.                                                                             |
+| `systemPrompt`       | `string`                      | Primitive       | No       | —       | System prompt passed to the executor. Required if using optimization.                                                                                                                                                                             |
+| `perTestThreshold`   | `number`                      | Primitive       | No       | `1.0`   | Minimum field pass rate for a test case to pass (0.0–1.0). At default 1.0, all fields must pass. Set to 0.8 to pass if 80% of fields match.                                                                                                       |
+| `rateLimitBatch`     | `number`                      | Primitive       | No       | —       | Number of test cases to run concurrently. Use with `rateLimitPause` for rate-limited APIs.                                                                                                                                                        |
+| `rateLimitPause`     | `number`                      | Primitive       | No       | —       | Seconds to wait between batches. Pairs with `rateLimitBatch`.                                                                                                                                                                                     |
+| `optimize`           | `OptimizeConfig`              | Object          | No       | —       | Inline optimization config. When provided, triggers optimization mode instead of single eval.                                                                                                                                                     |
 ---
@@ -155,21 +164,23 @@ const config = {
     storeLogs: true,
     thinking: true,
   },
-}
+};
 ```
 #### OptimizeConfig
-| Property | Type | Required | Default | Description |
-|----------|------|----------|---------|-------------|
-| `systemPrompt` | `string` | **Yes** | — | Initial system prompt to optimize. This is the starting point that the optimizer will iteratively improve. |
-| `targetSuccessRate` | `number` | **Yes** | — | Target success rate to achieve (0.0–1.0). Optimization stops when this rate is reached. |
-| `apiKey` | `string` | **Yes** | — | API key for the LLM provider used by the optimizer (not your workflow's LLM). |
-| `provider` | `LLMProviders` | **Yes** | — | LLM provider the optimizer uses to analyze failures and generate improved prompts. |
-| `maxIterations` | `number` | No | `5` | Maximum optimization iterations before stopping, even if target not reached. |
-| `maxCost` | `number` | No | — | Maximum cost budget in dollars. Optimization stops if cumulative cost exceeds this. |
-| `storeLogs` | `boolean \| string` | No | — | Save optimization logs. `true` uses default path (`./didactic-logs/optimize_<timestamp>/summary.md`), or provide custom summary path. |
-| `thinking` | `boolean` | No | — | Enable extended thinking mode for deeper analysis (provider must support it). |
+| Property            | Type                | Required | Default                                                   | Description                                                                                                                                    |
+| ------------------- | ------------------- | -------- | --------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
+| `systemPrompt`      | `string`            | **Yes**  | —                                                         | Initial system prompt to optimize. This is the starting point that the optimizer will iteratively improve.                                     |
+| `targetSuccessRate` | `number`            | **Yes**  | —                                                         | Target success rate to achieve (0.0–1.0). Optimization stops when this rate is reached.                                                        |
+| `apiKey`            | `string`            | **Yes**  | —                                                         | API key for the LLM provider used by the optimizer (not your workflow's LLM).                                                                  |
+| `provider`          | `LLMProviders`      | **Yes**  | —                                                         | LLM provider the optimizer uses to analyze failures and generate improved prompts.                                                             |
+| `maxIterations`     | `number`            | No       | `5`                                                       | Maximum optimization iterations before stopping, even if target not reached.                                                                   |
+| `maxCost`           | `number`            | No       | —                                                         | Maximum cost budget in dollars. Optimization stops if cumulative cost exceeds this.                                                            |
+| `storeLogs`         | `boolean \| string` | No       | —                                                         | Save optimization logs. `true` uses default path (`./didactic-logs/optimize_<timestamp>/summary.md`), or provide custom summary path.          |
+| `thinking`          | `boolean`           | No       | —                                                         | Enable extended thinking mode for deeper analysis (provider must support it).                                                                  |
+| `patchSystemPrompt` | `string`            | No       | [`DEFAULT_PATCH_SYSTEM_PROMPT`](src/optimizer/prompts.ts) | Custom system prompt for patch generation. Completely replaces the default prompt that analyzes failures and suggests improvements.            |
+| `mergeSystemPrompt` | `string`            | No       | [`DEFAULT_MERGE_SYSTEM_PROMPT`](src/optimizer/prompts.ts) | Custom system prompt for merging patches. Completely replaces the default prompt that combines multiple patches into a coherent system prompt. |
 ---
@@ -178,11 +189,13 @@ const config = {
 Executors abstract your LLM workflow from the evaluation harness. Whether your workflow runs locally, calls a remote API, or orchestrates Temporal activities, executors provide a consistent interface: take input + optional system prompt, return expected output.
 This separation enables:
 - **Swap execution strategies** — Switch between local/remote without changing tests
 - **Dynamic prompt injection** — System prompts flow through for optimization
 - **Cost tracking** — Aggregate execution costs across test runs
 didactic provides two built-in executors:
 - `endpoint` for calling a remote API
 - `fn` for calling a local function
@@ -192,7 +205,6 @@ You may want to provide a `mapAdditionalContext` function to extract metadata fr
 Note: If you do not provide a `mapResponse` function, the executor will assume the response from the executor is the output you want to compare against `expected`.
 ### `endpoint(url, config?)`
 Create an executor that calls an HTTP endpoint. The executor sends input + systemPrompt as the request body and expects structured JSON back.
@@ -211,14 +223,14 @@ const executor = endpoint('https://api.example.com/workflow', {
 #### EndpointConfig
-| Property | Type | Required | Default | Description |
-|----------|------|----------|---------|-------------|
-| `method` | `'POST' \| 'GET'` | No | `'POST'` | HTTP method for the request. |
-| `headers` | `Record<string, string>` | No | `{}` | Headers to include (auth tokens, content-type overrides, etc). |
-| `mapResponse` | `(response: any) => TOutput` | No | — | Transform the raw response to your expected output shape. Use when your API wraps results. |
-| `mapAdditionalContext` | `(response: any) => unknown` | No | — | Extract metadata (logs, debug info) from response for inspection. |
-| `mapCost` | `(response: any) => number` | No | — | Extract execution cost from response (e.g., token counts in headers). |
-| `timeout` | `number` | No | `30000` | Request timeout in milliseconds. |
+| Property               | Type                         | Required | Default  | Description                                                                                |
+| ---------------------- | ---------------------------- | -------- | -------- | ------------------------------------------------------------------------------------------ |
+| `method`               | `'POST' \| 'GET'`            | No       | `'POST'` | HTTP method for the request.                                                               |
+| `headers`              | `Record<string, string>`     | No       | `{}`     | Headers to include (auth tokens, content-type overrides, etc).                             |
+| `mapResponse`          | `(response: any) => TOutput` | No       | —        | Transform the raw response to your expected output shape. Use when your API wraps results. |
+| `mapAdditionalContext` | `(response: any) => unknown` | No       | —        | Extract metadata (logs, debug info) from response for inspection.                          |
+| `mapCost`              | `(response: any) => number`  | No       | —        | Extract execution cost from response (e.g., token counts in headers).                      |
+| `timeout`              | `number`                     | No       | `30000`  | Request timeout in milliseconds.                                                           |
 ---
@@ -234,19 +246,24 @@ const executor = fn({
     return await myLLMCall(input, systemPrompt);
   },
   mapResponse: (result) => result.output,
-  mapCost: (result) => result.usage.input_tokens * 0.000003 + result.usage.output_tokens * 0.000015,
-  mapAdditionalContext: (result) => ({ model: result.model, finishReason: result.stop_reason }),
+  mapCost: (result) =>
+    result.usage.input_tokens * 0.000003 +
+    result.usage.output_tokens * 0.000015,
+  mapAdditionalContext: (result) => ({
+    model: result.model,
+    finishReason: result.stop_reason,
+  }),
 });
 ```
 #### FnConfig
-| Property | Type | Required | Default | Description |
-|----------|------|----------|---------|-------------|
-| `fn` | `(input: TInput, systemPrompt?: string) => Promise<TRaw>` | **Yes** | — | Async function that executes your workflow. Receives test input and optional system prompt. |
-| `mapResponse` | `(result: TRaw) => TOutput` | No | — | Transform raw result from fn into the expected output shape to compare. Without this, raw result is used directly. |
-| `mapAdditionalContext` | `(result: TRaw) => unknown` | No | — | Map additional context about the run to pass to the optimizer prompt. |
-| `mapCost` | `(result: TRaw) => number` | No | — | Extract cost from the result (if your function tracks it). Used to track the total cost of the runs. |
+| Property               | Type                                                      | Required | Default | Description                                                                                                        |
+| ---------------------- | --------------------------------------------------------- | -------- | ------- | ------------------------------------------------------------------------------------------------------------------ |
+| `fn`                   | `(input: TInput, systemPrompt?: string) => Promise<TRaw>` | **Yes**  | —       | Async function that executes your workflow. Receives test input and optional system prompt.                        |
+| `mapResponse`          | `(result: TRaw) => TOutput`                               | No       | —       | Transform raw result from fn into the expected output shape to compare. Without this, raw result is used directly. |
+| `mapAdditionalContext` | `(result: TRaw) => unknown`                               | No       | —       | Map additional context about the run to pass to the optimizer prompt.                                              |
+| `mapCost`              | `(result: TRaw) => number`                                | No       | —       | Extract cost from the result (if your function tracks it). Used to track the total cost of the runs.               |
 ---
@@ -276,6 +293,7 @@ const executor = fn({
 ```
 Without `mapResponse`:
 - **endpoint**: uses the raw JSON response as output
 - **fn**: uses the function's return value directly as output
@@ -338,56 +356,79 @@ const executor = fn({
 Comparators bridge the gap between messy LLM output and semantic correctness. Rather than requiring exact string matches, comparators handle real-world data variations—currency formatting, date formats, name suffixes, numeric tolerance—while maintaining semantic accuracy.
-Each comparator returns a `passed` boolean and a `similarity` score (0.0–1.0). The pass/fail determines test results, while similarity enables Hungarian matching for unordered array comparison.
+**Nested structure:** Comparators mirror your data shape. Use objects to define per-field comparators, and `unordered()` to wrap arrays that should match by similarity rather than index position.
+Each comparator returns a `passed` boolean and a `similarity` score (0.0–1.0). The pass/fail determines test results, while similarity enables Hungarian matching for `unordered()` arrays.
 ### `comparators` vs `comparatorOverride`
-Use **`comparators`** for standard comparison. It accepts either:
+Use **`comparators`** for standard comparison. It accepts:
 **1. A single comparator function** — Applied uniformly across the output:
 ```typescript
-// Clean syntax for primitives, arrays, or simple objects
+// Clean syntax for primitives or arrays
 const result = await didactic.eval({
   executor: myNumberExtractor,
-  comparators: exact,  // Single comparator, no need for { '': exact }
+  comparators: exact, // Single comparator for root-level output
   testCases: [
     { input: 'twenty-three', expected: 23 },
     { input: 'one hundred', expected: 100 },
   ],
 });
-// Works with arrays too
+// For unordered arrays, use the unordered() wrapper
 const result = await didactic.eval({
   executor: myListExtractor,
-  comparators: exact,
-  unorderedList: true,  // Enable Hungarian matching for unordered arrays
-  testCases: [
-    { input: 'numbers', expected: [1, 2, 3, 4] },
-  ],
+  comparators: unordered(exact), // Match by similarity, not index
+  testCases: [{ input: 'numbers', expected: [1, 2, 3, 4] }],
 });
 ```
-**2. A field mapping object** — Different comparators per field:
+**2. A nested object structure** — Mirrors your data shape with per-field comparators:
 ```typescript
 const result = await didactic.eval({
   executor: myExecutor,
   comparators: {
-    premium: within({ tolerance: 0.05 }),  // 5% tolerance for numbers
-    carrier: exact,                         // Exact string match
-    effectiveDate: date,                    // Flexible date parsing
+    premium: within({ tolerance: 0.05 }), // 5% tolerance for numbers
+    carrier: exact, // Exact string match
+    effectiveDate: date, // Flexible date parsing
+    // Use unordered() for arrays that can be in any order
+    lineItems: unordered({
+      description: name,
+      amount: numeric,
+    }),
   },
   testCases: [
     {
       input: { emailId: 'email-123' },
-      expected: { premium: 12500, carrier: 'Acme Insurance', effectiveDate: '2024-01-15' },
+      expected: {
+        premium: 12500,
+        carrier: 'Acme Insurance',
+        effectiveDate: '2024-01-15',
+        lineItems: [
+          { description: 'Service Fee', amount: 100 },
+          { description: 'Tax', amount: 25 },
+        ],
+      },
     },
   ],
 });
 ```
+**3. Optional (defaults to `exact`)** — If omitted, uses `exact` for entire output:
+```typescript
+// No comparators needed for simple exact matching
+const result = await didactic.eval({
+  executor: myExecutor,
+  testCases: [{ input: 'hello', expected: 'hello' }],
+});
+```
 Use **`comparatorOverride`** when you need:
 - Complete control over comparison logic
 - Custom cross-field validation
 - Whole-object semantic comparison that doesn't map to individual fields
@@ -409,41 +450,83 @@ const result = await didactic.eval({
 ### Built-in Comparators
-| Comparator | Signature | Description |
-|------------|-----------|-------------|
-| `exact` | `(expected, actual)` | Deep equality with cycle detection. Default when no comparator specified. |
-| `within` | `({ tolerance, mode? })` | Numeric tolerance. `mode: 'percentage'` (default) or `'absolute'`. |
-| `oneOf` | `(allowedValues)` | Enum validation. Passes if actual equals expected AND both are in the allowed set. |
-| `contains` | `(substring)` | String contains check. Passes if actual includes the substring. |
-| `presence` | `(expected, actual)` | Existence check. Passes if expected is absent, or if actual has any value when expected does. |
-| `numeric` | `(expected, actual)` | Numeric comparison after stripping currency symbols, commas, accounting notation. |
-| `numeric.nullable` | `(expected, actual)` | Same as `numeric`, but treats null/undefined/empty as 0. |
-| `date` | `(expected, actual)` | Date comparison after normalizing formats (ISO, US MM/DD, EU DD/MM, written). |
-| `name` | `(expected, actual)` | Name comparison with case normalization, suffix removal (Inc, LLC), fuzzy matching. |
-| `custom` | `({ compare })` | User-defined logic. `compare(expected, actual, context?) => boolean`. Context provides access to parent objects for cross-field logic. |
+| Comparator         | Usage                                               | Description                                                                                                                                                |
+| ------------------ | --------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `exact`            | `exact`                                             | Deep equality with cycle detection. Default when no comparator specified.                                                                                  |
+| `within`           | `within({ tolerance, mode? })`                      | Numeric tolerance. `mode: 'percentage'` (default) or `'absolute'`.                                                                                         |
+| `oneOf`            | `oneOf(allowedValues)`                              | Enum validation. Passes if actual equals expected AND both are in the allowed set.                                                                         |
+| `contains`         | `contains(substring)`                               | String contains check. Passes if actual includes the substring.                                                                                            |
+| `presence`         | `presence`                                          | Existence check. Passes if expected is absent, or if actual has any value when expected does.                                                              |
+| `numeric`          | `numeric`                                           | Numeric comparison after stripping currency symbols, commas, accounting notation.                                                                          |
+| `numeric.nullable` | `numeric.nullable`                                  | Same as `numeric`, but treats null/undefined/empty as 0.                                                                                                   |
+| `date`             | `date`                                              | Date comparison after normalizing formats (ISO, US MM/DD, EU DD/MM, written).                                                                              |
+| `name`             | `name`                                              | Name comparison with case normalization, suffix removal (Inc, LLC), fuzzy matching.                                                                        |
+| `unordered`        | `unordered(comparator)` or `unordered({ fields })`  | Wrapper for arrays that should match by similarity (Hungarian algorithm) rather than index. Pass a comparator for primitives or nested config for objects. |
+| `llmCompare`       | `llmCompare({ systemPrompt?, apiKey?, provider? })` | LLM-based semantic comparison. Uses `llmConfig` from eval config if `apiKey` not provided. Returns rationale and tracks cost.                              |
+| `custom`           | `custom({ compare })`                               | User-defined logic. `compare(expected, actual, context?) => boolean`. Context provides access to parent objects for cross-field logic.                     |
 ### Examples
 ```typescript
-import { within, oneOf, exact, contains, presence, numeric, date, name, custom } from '@docshield/didactic';
-const comparators = {
-  premium: within({ tolerance: 0.05 }),                      // 5% tolerance
-  deductible: within({ tolerance: 100, mode: 'absolute' }),  // $100 tolerance
-  policyType: oneOf(['claims-made', 'occurrence', 'entity']),
-  carrier: exact,
-  notes: contains('approved'),
-  entityName: name,
-  effectiveDate: date,
-  amount: numeric,
-  optionalField: presence,
-  customField: custom({
-    compare: (expected, actual, context) => {
-      // Access sibling fields via context.actualParent
-      return actual.toLowerCase() === expected.toLowerCase();
-    },
-  }),
-};
+import {
+  didactic,
+  within,
+  oneOf,
+  exact,
+  contains,
+  presence,
+  numeric,
+  date,
+  name,
+  unordered,
+  llmCompare,
+  custom,
+  LLMProviders,
+} from '@docshield/didactic';
+const result = await didactic.eval({
+  executor: myInvoiceParser,
+  testCases: [...],
+  // LLM config for all llmCompare calls (no need to repeat apiKey)
+  llmConfig: {
+    apiKey: process.env.ANTHROPIC_API_KEY,
+    provider: LLMProviders.anthropic_claude_haiku,
+  },
+  comparators: {
+    premium: within({ tolerance: 0.05 }), // 5% tolerance
+    deductible: within({ tolerance: 100, mode: 'absolute' }), // $100 tolerance
+    policyType: oneOf(['claims-made', 'occurrence', 'entity']),
+    carrier: exact,
+    notes: contains('approved'),
+    entityName: name,
+    effectiveDate: date,
+    amount: numeric,
+    optionalField: presence,
+    // Unordered array of objects with nested comparators
+    lineItems: unordered({
+      description: llmCompare({
+        // Uses llmConfig.apiKey from above!
+        systemPrompt: 'Compare line item descriptions semantically.',
+      }),
+      quantity: exact,
+      price: numeric,
+    }),
+    // LLM-based comparison for flexible semantic matching
+    companyName: llmCompare({
+      systemPrompt:
+        'Compare company names considering abbreviations and legal suffixes.',
+    }),
+    customField: custom({
+      compare: (expected, actual, context) => {
+        // Access sibling fields via context.actualParent
+        return actual.toLowerCase() === expected.toLowerCase();
+      },
+    }),
+  },
+});
 ```
 ---
@@ -456,13 +539,13 @@ Supported LLM providers for the optimizer:
 import { LLMProviders } from '@docshield/didactic';
 ```
-| Value | Description |
-|-------|-------------|
-| `LLMProviders.anthropic_claude_opus` | Claude Opus 4.5 — Most capable, highest cost |
+| Value                                  | Description                                   |
+| -------------------------------------- | --------------------------------------------- |
+| `LLMProviders.anthropic_claude_opus`   | Claude Opus 4.5 — Most capable, highest cost  |
 | `LLMProviders.anthropic_claude_sonnet` | Claude Sonnet 4.5 — Balanced performance/cost |
-| `LLMProviders.anthropic_claude_haiku` | Claude Haiku 4.5 — Fastest, lowest cost |
-| `LLMProviders.openai_gpt5` | GPT-5.2 — OpenAI flagship |
-| `LLMProviders.openai_gpt5_mini` | GPT-5 Mini — OpenAI lightweight |
+| `LLMProviders.anthropic_claude_haiku`  | Claude Haiku 4.5 — Fastest, lowest cost       |
+| `LLMProviders.openai_gpt5`             | GPT-5.2 — OpenAI flagship                     |
+| `LLMProviders.openai_gpt5_mini`        | GPT-5 Mini — OpenAI lightweight               |
 ---
@@ -472,60 +555,62 @@ import { LLMProviders } from '@docshield/didactic';
 Returned by `didactic.eval()` when no optimization is configured.
-| Property | Type | Description |
-|----------|------|-------------|
-| `systemPrompt` | `string \| undefined` | System prompt that was used for this eval run. |
-| `testCases` | `TestCaseResult[]` | Detailed results for each test case. Inspect for field-level failure details. |
-| `passed` | `number` | Count of test cases that passed (met `perTestThreshold`). |
-| `total` | `number` | Total number of test cases run. |
-| `successRate` | `number` | Pass rate (0.0–1.0). `passed / total`. |
-| `correctFields` | `number` | Total correct fields across all test cases. |
-| `totalFields` | `number` | Total fields evaluated across all test cases. |
-| `accuracy` | `number` | Field-level accuracy (0.0–1.0). `correctFields / totalFields`. |
-| `cost` | `number` | Total execution cost aggregated from executor results. |
+| Property         | Type                  | Description                                                                   |
+| ---------------- | --------------------- | ----------------------------------------------------------------------------- |
+| `systemPrompt`   | `string \| undefined` | System prompt that was used for this eval run.                                |
+| `testCases`      | `TestCaseResult[]`    | Detailed results for each test case. Inspect for field-level failure details. |
+| `passed`         | `number`              | Count of test cases that passed (met `perTestThreshold`).                     |
+| `total`          | `number`              | Total number of test cases run.                                               |
+| `successRate`    | `number`              | Pass rate (0.0–1.0). `passed / total`.                                        |
+| `correctFields`  | `number`              | Total correct fields across all test cases.                                   |
+| `totalFields`    | `number`              | Total fields evaluated across all test cases.                                 |
+| `accuracy`       | `number`              | Field-level accuracy (0.0–1.0). `correctFields / totalFields`.                |
+| `cost`           | `number`              | Total execution cost aggregated from executor results.                        |
+| `comparatorCost` | `number`              | Total cost from LLM-based comparators (e.g., `llmCompare`).                   |
 ### TestCaseResult
 Per-test-case detail, accessible via `EvalResult.testCases`.
-| Property | Type | Description |
-|----------|------|-------------|
-| `input` | `TInput` | The input that was passed to the executor. |
-| `expected` | `TOutput` | The expected output from the test case. |
-| `actual` | `TOutput \| undefined` | Actual output returned by executor. Undefined if execution failed. |
-| `passed` | `boolean` | Whether this test case passed (met `perTestThreshold`). |
-| `fields` | `Record<string, FieldResult>` | Per-field comparison results. Key is field path (e.g., `"address.city"`). |
-| `passedFields` | `number` | Count of fields that passed comparison. |
-| `totalFields` | `number` | Total fields compared. |
-| `passRate` | `number` | Field pass rate for this test case (0.0–1.0). |
-| `cost` | `number \| undefined` | Execution cost for this test case, if reported by executor. |
-| `additionalContext` | `unknown \| undefined` | Extra context extracted by executor (logs, debug info). |
-| `error` | `string \| undefined` | Error message if executor threw an exception. |
+| Property            | Type                          | Description                                                               |
+| ------------------- | ----------------------------- | ------------------------------------------------------------------------- |
+| `input`             | `TInput`                      | The input that was passed to the executor.                                |
+| `expected`          | `TOutput`                     | The expected output from the test case.                                   |
+| `actual`            | `TOutput \| undefined`        | Actual output returned by executor. Undefined if execution failed.        |
+| `passed`            | `boolean`                     | Whether this test case passed (met `perTestThreshold`).                   |
+| `fields`            | `Record<string, FieldResult>` | Per-field comparison results. Key is field path (e.g., `"address.city"`). |
+| `passedFields`      | `number`                      | Count of fields that passed comparison.                                   |
+| `totalFields`       | `number`                      | Total fields compared.                                                    |
+| `passRate`          | `number`                      | Field pass rate for this test case (0.0–1.0).                             |
+| `cost`              | `number \| undefined`         | Execution cost for this test case, if reported by executor.               |
+| `comparatorCost`    | `number \| undefined`         | Total cost from LLM-based comparators in this test case.                  |
+| `additionalContext` | `unknown \| undefined`        | Extra context extracted by executor (logs, debug info).                   |
+| `error`             | `string \| undefined`         | Error message if executor threw an exception.                             |
 ### OptimizeResult
 Returned by `didactic.optimize()` or `didactic.eval()` with optimization configured.
-| Property | Type | Description |
-|----------|------|-------------|
-| `success` | `boolean` | Whether the target success rate was achieved. |
-| `finalPrompt` | `string` | The final optimized system prompt. Use this in production. |
-| `iterations` | `IterationResult[]` | Results from each optimization iteration. Inspect to see how the prompt evolved. |
-| `totalCost` | `number` | Total cost across all iterations (optimizer + executor costs). |
-| `logFolder` | `string \| undefined` | Folder path where optimization logs were written (only when `storeLogs` is enabled). |
+| Property      | Type                  | Description                                                                          |
+| ------------- | --------------------- | ------------------------------------------------------------------------------------ |
+| `success`     | `boolean`             | Whether the target success rate was achieved.                                        |
+| `finalPrompt` | `string`              | The final optimized system prompt. Use this in production.                           |
+| `iterations`  | `IterationResult[]`   | Results from each optimization iteration. Inspect to see how the prompt evolved.     |
+| `totalCost`   | `number`              | Total cost across all iterations (optimizer + executor costs).                       |
+| `logFolder`   | `string \| undefined` | Folder path where optimization logs were written (only when `storeLogs` is enabled). |
 ### IterationResult
 Per-iteration detail, accessible via `OptimizeResult.iterations`.
-| Property | Type | Description |
-|----------|------|-------------|
-| `iteration` | `number` | Iteration number (1-indexed). |
-| `systemPrompt` | `string` | System prompt used for this iteration. |
-| `passed` | `number` | Test cases passed in this iteration. |
-| `total` | `number` | Total test cases in this iteration. |
-| `testCases` | `TestCaseResult[]` | Detailed test case results for this iteration. |
-| `cost` | `number` | Cost for this iteration. |
+| Property       | Type               | Description                                    |
+| -------------- | ------------------ | ---------------------------------------------- |
+| `iteration`    | `number`           | Iteration number (1-indexed).                  |
+| `systemPrompt` | `string`           | System prompt used for this iteration.         |
+| `passed`       | `number`           | Test cases passed in this iteration.           |
+| `total`        | `number`           | Total test cases in this iteration.            |
+| `testCases`    | `TestCaseResult[]` | Detailed test case results for this iteration. |
+| `cost`         | `number`           | Cost for this iteration.                       |
 ---
@@ -535,12 +620,12 @@ When `storeLogs` is enabled in `OptimizeConfig`, four files are written to the l
 **Default path:** `./didactic-logs/optimize_<timestamp>/`
-| File | Description |
-|------|-------------|
-| `summary.md` | Human-readable report with configuration, metrics, and iteration progress |
-| `prompts.md` | All system prompts used in each iteration |
-| `rawData.json` | Complete iteration data for programmatic analysis |
-| `bestRun.json` | Detailed results from the best-performing iteration |
+| File           | Description                                                               |
+| -------------- | ------------------------------------------------------------------------- |
+| `summary.md`   | Human-readable report with configuration, metrics, and iteration progress |
+| `prompts.md`   | All system prompts used in each iteration                                 |
+| `rawData.json` | Complete iteration data for programmatic analysis                         |
+| `bestRun.json` | Detailed results from the best-performing iteration                       |
 ### rawData.json
@@ -549,17 +634,17 @@ Contains the complete optimization run data for programmatic analysis:
 ```typescript
 interface OptimizationReport {
   metadata: {
-    timestamp: string;              // ISO timestamp
-    model: string;                  // LLM model used
-    provider: string;               // Provider (anthropic, openai, etc)
-    thinking: boolean;              // Extended thinking enabled
-    targetSuccessRate: number;      // Target (0.0-1.0)
-    maxIterations: number | null;   // Max iterations or null
-    maxCost: number | null;         // Max cost budget or null
-    testCaseCount: number;          // Number of test cases
-    perTestThreshold: number;       // Per-test threshold (default 1.0)
-    rateLimitBatch?: number;        // Batch size for rate limiting
-    rateLimitPause?: number;        // Pause seconds between batches
+    timestamp: string; // ISO timestamp
+    model: string; // LLM model used
+    provider: string; // Provider (anthropic, openai, etc)
+    thinking: boolean; // Extended thinking enabled
+    targetSuccessRate: number; // Target (0.0-1.0)
+    maxIterations: number | null; // Max iterations or null
+    maxCost: number | null; // Max cost budget or null
+    testCaseCount: number; // Number of test cases
+    perTestThreshold: number; // Per-test threshold (default 1.0)
+    rateLimitBatch?: number; // Batch size for rate limiting
+    rateLimitPause?: number; // Pause seconds between batches
   };
   summary: {
     totalIterations: number;
@@ -567,16 +652,16 @@ interface OptimizationReport {
     totalCost: number;
     totalInputTokens: number;
     totalOutputTokens: number;
-    startRate: number;              // Success rate at start
-    endRate: number;                // Success rate at end
+    startRate: number; // Success rate at start
+    endRate: number; // Success rate at end
     targetMet: boolean;
   };
   best: {
-    iteration: number;              // Which iteration was best
-    successRate: number;            // Success rate (0.0-1.0)
-    passed: number;                 // Number of passing tests
-    total: number;                  // Total tests
-    fieldAccuracy: number;          // Field-level accuracy
+    iteration: number; // Which iteration was best
+    successRate: number; // Success rate (0.0-1.0)
+    passed: number; // Number of passing tests
+    total: number; // Total tests
+    fieldAccuracy: number; // Field-level accuracy
   };
   iterations: Array<{
     iteration: number;
@@ -586,8 +671,8 @@ interface OptimizationReport {
     correctFields: number;
     totalFields: number;
     fieldAccuracy: number;
-    cost: number;                   // Cost for this iteration
-    cumulativeCost: number;         // Total cost so far
+    cost: number; // Cost for this iteration
+    cumulativeCost: number; // Total cost so far
     durationMs: number;
     inputTokens: number;
     outputTokens: number;
@@ -596,7 +681,10 @@ interface OptimizationReport {
       input: unknown;
       expected: unknown;
       actual: unknown;
-      fields: Record<string, { expected: unknown; actual: unknown; passed: boolean }>;
+      fields: Record<
+        string,
+        { expected: unknown; actual: unknown; passed: boolean }
+      >;
     }>;
   }>;
 }
@@ -609,7 +697,7 @@ Contains detailed results from the best-performing iteration, with test results
 ```typescript
 interface BestRunReport {
   metadata: {
-    iteration: number;              // Which iteration was best
+    iteration: number; // Which iteration was best
     model: string;
     provider: string;
     thinking: boolean;
@@ -619,38 +707,41 @@ interface BestRunReport {
     rateLimitPause?: number;
   };
   results: {
-    successRate: number;            // Overall success rate
-    passed: number;                 // Passed tests
-    total: number;                  // Total tests
-    fieldAccuracy: number;          // Field-level accuracy
+    successRate: number; // Overall success rate
+    passed: number; // Passed tests
+    total: number; // Total tests
+    fieldAccuracy: number; // Field-level accuracy
     correctFields: number;
     totalFields: number;
   };
   cost: {
-    iteration: number;              // Cost for this iteration
-    cumulative: number;             // Total cumulative cost
+    iteration: number; // Cost for this iteration
+    cumulative: number; // Total cumulative cost
   };
   timing: {
     durationMs: number;
     inputTokens: number;
     outputTokens: number;
   };
-  failures: Array<{                 // Tests that didnt meet the configured perTestThreshold
+  failures: Array<{
+    // Tests that didnt meet the configured perTestThreshold
     testIndex: number;
     input: unknown;
     expected: unknown;
     actual: unknown;
     failedFields: Record<string, { expected: unknown; actual: unknown }>;
   }>;
-  partialFailures: Array<{          // Tests that passed but have some failing fields
+  partialFailures: Array<{
+    // Tests that passed but have some failing fields
     testIndex: number;
-    passRate: number;               // Percentage of fields passing
+    passRate: number; // Percentage of fields passing
     input: unknown;
     expected: unknown;
     actual: unknown;
     failedFields: Record<string, { expected: unknown; actual: unknown }>;
   }>;
-  successes: Array<{                // Tests with 100% field accuracy
+  successes: Array<{
+    // Tests with 100% field accuracy
     testIndex: number;
     input: unknown;
     expected: unknown;
@@ -666,10 +757,22 @@ interface BestRunReport {
 ```typescript
 // Namespace
 import { didactic } from '@docshield/didactic';
-import didactic from '@docshield/didactic';  // default export
+import didactic from '@docshield/didactic'; // default export
 // Comparators
-import { exact, within, oneOf, contains, presence, numeric, date, name, custom } from '@docshield/didactic';
+import {
+  exact,
+  within,
+  oneOf,
+  contains,
+  presence,
+  numeric,
+  date,
+  name,
+  unordered,
+  llmCompare,
+  custom,
+} from '@docshield/didactic';
 // Executors
 import { endpoint, fn } from '@docshield/didactic';
@@ -679,22 +782,24 @@ import { evaluate, optimize } from '@docshield/didactic';
 // Types
 import type {
+  // Creating custom comparators
   Comparator,
-  ComparatorMap,
   ComparatorResult,
   ComparatorContext,
+  // Creating custom executors
   Executor,
   ExecutorResult,
+  // Main API types
   TestCase,
   EvalConfig,
   EvalResult,
-  TestCaseResult,
-  FieldResult,
   OptimizeConfig,
   OptimizeResult,
-  IterationResult,
+  // Executor configs
   EndpointConfig,
   FnConfig,
+  // LLM configuration
+  LLMConfig,
 } from '@docshield/didactic';
 // Enum