npm - @talonic/docs - Versions diffs - 0.20.10 → 0.20.11 - Mend

@talonic/docs 0.20.10 → 0.20.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (2) hide show

package/dist/content.js +1017 -2
package/package.json +1 -1

package/dist/content.js CHANGED Viewed

@@ -542,6 +542,14 @@ var sections = [
           }
         ]
       },
+      {
+        type: "paragraph",
+        text: 'Understanding the relationship between these concepts is key to getting the most from the platform. When you upload documents, the extraction pipeline discovers every data point and feeds them into the **Field Registry**. The registry uses AI embeddings to cluster semantically similar fields \u2014 so "Vendor Name", "Supplier Name", and "Company Name" are recognized as the same concept. Over time, frequently occurring fields are promoted to higher tiers, and the platform synthesizes master extraction instructions that encode the best way to extract each field.'
+      },
+      {
+        type: "paragraph",
+        text: "The **Schema** layer sits on top of the registry and defines what output you need. You can use auto-generated schemas that the platform creates for each document type, or build custom template schemas by selecting specific fields from the registry. When a schema is applied to documents in a **Job**, the 4-phase pipeline fills every cell \u2014 starting with free graph lookups and falling back to AI agents for the remainder. The result is a structured grid where each row is a document and each column is a field."
+      },
       {
         type: "callout",
         variant: "info",
@@ -617,6 +625,14 @@ var sections = [
         type: "paragraph",
         text: "The pipeline is designed to be **progressive** \u2014 results appear as each phase completes rather than waiting for the entire job to finish. Phase 1 (graph resolve) fills ~30% of cells instantly and for free. Phase 2 (AI extraction) fills the remaining gaps. Phases 3 and 4 handle re-resolution and transformation. You can start reviewing early results while later phases are still running."
       },
+      {
+        type: "paragraph",
+        text: "Use the platform flow as a mental model when planning your workflow. For small, ad-hoc extractions you can go from upload to results in minutes \u2014 upload a few documents, pick an auto-generated schema, and run a job. For production workloads, invest time in the **Define schema** step: map fields to the registry, add reference tables for code lookups, and set format constraints. The upfront effort pays off because every subsequent job reuses the same schema and benefits from the growing knowledge graph."
+      },
+      {
+        type: "paragraph",
+        text: "After results are delivered, the feedback loop closes automatically. Corrections you make during the **Review** stage feed back into the Field Registry, improving future extractions. The platform tracks telemetry across runs \u2014 strategy distribution, capture hit rate, and resolve rate \u2014 so you can monitor how extraction quality improves over time as the knowledge graph accumulates more data."
+      },
       {
         type: "callout",
         variant: "info",
@@ -679,6 +695,14 @@ var sections = [
         title: "Sidebar Navigation",
         caption: "The sidebar provides access to all sections. Click the collapse button to save space. Press Cmd+K for global search."
       },
+      {
+        type: "paragraph",
+        text: "For teams processing documents at scale, the recommended approach is to start with a small representative sample. Upload 5-10 documents of the same type, let the platform extract and classify them, then review the auto-generated schema. This lets you validate the output structure before committing to a large batch. Once the schema looks right, you can upload hundreds or thousands of documents and the knowledge graph will handle an increasing share of cells through instant graph matches."
+      },
+      {
+        type: "paragraph",
+        text: "The platform includes powerful keyboard shortcuts for fast navigation. Press `Cmd+K` (or `Ctrl+K` on Windows) to open **Omnisearch**, which lets you find documents, schemas, jobs, and fields from anywhere. Press `Cmd+I` to open the **AI Agent** for natural language queries about your workspace. The sidebar can be collapsed to give more screen real estate when reviewing extraction results."
+      },
       {
         type: "callout",
         text: "The fastest path to results: upload documents in **Sources**, then go to **Structuring &rarr; Runs &rarr; New** to create your first extraction job."
@@ -735,6 +759,18 @@ var sections2 = [
         type: "paragraph",
         text: "Talonic includes an embedded AI agent accessible from every page via `Cmd+I` (`Ctrl+I` on Windows). The agent understands your workspace context and can inspect schemas, search documents, analyze extraction quality, explore cases, and build schemas \u2014 all through natural language."
       },
+      {
+        type: "paragraph",
+        text: "The agent is context-aware, meaning it automatically knows which page you are on and what data is visible. If you open the agent from a document detail page, it already has that document in scope and can answer questions about its extracted fields, processing status, or classification without you needing to specify which document you mean."
+      },
+      {
+        type: "paragraph",
+        text: "The agent classifies every user message as either a **question** (answered with information) or a **command** (triggers an action). Questions are handled instantly with read-only access, while commands go through the impact-level system to ensure safety. The agent streams its responses in real time, so you can see reasoning unfold as it queries your workspace data."
+      },
+      {
+        type: "paragraph",
+        text: "There are important limitations to be aware of. The agent cannot access external systems or the internet \u2014 it only works with data already in your Talonic workspace. It cannot bypass permission boundaries, so team members with read-only access cannot use the agent to make changes. Long-running operations like full batch extractions cannot be triggered through the agent; those must be initiated from the relevant UI page."
+      },
       { type: "heading", level: 3, id: "agent-capabilities", text: "What the Agent Can Do" },
       {
         type: "paragraph",
@@ -799,6 +835,14 @@ var sections2 = [
       {
         question: "Can the AI agent modify my data?",
         answer: "The agent operates workshop-first: schema changes create drafts, not live versions. Higher-impact operations require progressively more explicit confirmation."
+      },
+      {
+        question: "Is the AI agent context-aware?",
+        answer: "Yes. The agent automatically knows which page you are on and what data is visible. If you open it from a document detail page, it already has that document in scope and can answer questions about its fields, processing status, or classification."
+      },
+      {
+        question: "Can the AI agent access external systems or the internet?",
+        answer: "No. The agent only works with data already in your Talonic workspace. It cannot browse the internet, call external APIs, or access systems outside the platform."
       }
     ],
     mentions: [
@@ -846,6 +890,18 @@ var sections2 = [
           }
         ]
       },
+      {
+        type: "paragraph",
+        text: "The `read` impact level covers the vast majority of agent interactions. Searching documents, inspecting extraction results, browsing the field registry, and checking job status all execute instantly with no side effects. These read operations give you a fast way to explore your workspace without navigating through multiple pages."
+      },
+      {
+        type: "paragraph",
+        text: "The `draft_mutation` level is used when the agent creates or modifies schemas. Because all schema changes go through the workshop system, the agent can freely draft schemas without risk \u2014 nothing goes live until you explicitly review and publish. This makes the agent especially useful for rapid schema prototyping: describe the fields you need in plain language, and the agent creates a draft you can refine."
+      },
+      {
+        type: "paragraph",
+        text: 'The `live_mutation` and `irreversible` levels provide escalating safety gates for operations that affect production data. A `live_mutation` \u2014 such as triggering a job run or publishing a schema \u2014 presents a confirmation dialog that you must accept. An `irreversible` action \u2014 such as deleting a source or purging documents \u2014 requires you to type a confirmation keyword (e.g., "DELETE") to proceed, preventing accidental data loss.'
+      },
       {
         type: "callout",
         text: "The agent always operates workshop-first: schema changes create drafts, not live versions. You review and publish when ready."
@@ -863,6 +919,10 @@ var sections2 = [
       {
         question: "Does the AI agent make changes directly to live data?",
         answer: "No. The agent operates workshop-first. Schema changes create drafts, and live mutations require explicit user confirmation before executing."
+      },
+      {
+        question: "What happens when I ask the agent to delete something?",
+        answer: 'Deletion is classified as an irreversible action. The agent will ask you to type a confirmation keyword (e.g., "DELETE") before proceeding. This prevents accidental data loss from casual or ambiguous requests.'
       }
     ],
     mentions: ["impact levels", "draft mutation", "live mutation", "workshop-first"]
@@ -877,6 +937,18 @@ var sections2 = [
       {
         type: "paragraph",
         text: "The home page (click the Talonic logo) shows smart suggested prompts based on your workspace state. Prompts adapt to what is happening: active runs, schema creation opportunities, document types waiting for extraction. The agent input field lets you type any question directly from the dashboard."
+      },
+      {
+        type: "paragraph",
+        text: "The dashboard provides a workspace-level overview that helps you understand the health of your data pipeline at a glance. You can see document processing statistics, recent activity across sources, and the current state of your field registry. Key metrics like **capture rate**, **resolve rate**, and **synthesize rate** from the telemetry system are surfaced so you can spot trends without drilling into individual jobs."
+      },
+      {
+        type: "paragraph",
+        text: "Suggested prompts are dynamically generated based on what the platform detects in your workspace. If you have new document types that lack schemas, the dashboard suggests creating one. If a job run recently completed, it suggests reviewing the results. If field registry confirmations are pending, it prompts you to review them. This makes the dashboard a natural starting point for your workflow each session."
+      },
+      {
+        type: "paragraph",
+        text: "Every conversation with the agent is preserved in your session history, accessible from the dashboard. You can revisit previous questions and their answers, which is useful for auditing decisions or recalling how you configured a particular schema. The conversation history also provides continuity \u2014 if you asked the agent to analyze extraction quality last week, you can pick up where you left off."
       }
     ],
     related: [
@@ -891,6 +963,10 @@ var sections2 = [
       {
         question: "Do the suggested prompts change based on workspace state?",
         answer: "Yes. Prompts adapt dynamically based on active runs, schema creation opportunities, document types waiting for extraction, and other workspace activity."
+      },
+      {
+        question: "Can I revisit previous conversations with the agent?",
+        answer: "Yes. Every conversation is preserved in your session history, accessible from the dashboard. You can revisit previous questions, recall how you configured a schema, or pick up where you left off in a previous analysis."
       }
     ],
     mentions: ["dashboard", "suggested prompts", "workspace state", "agent input"]
@@ -923,6 +999,10 @@ var sections3 = [
       {
         type: "paragraph",
         text: "Files are deduplicated via SHA-256 hashing \u2014 uploading the same file twice won't create duplicates. Processing runs asynchronously so you can continue working."
+      },
+      {
+        type: "paragraph",
+        text: "When uploading folders or ZIP archives, the original directory structure is preserved as a `source_file_path` metadata field on each document (e.g., `contracts/2026/lease.pdf`). This field is available for filtering, export, and schema mapping \u2014 just like any AI-extracted field. It provides a natural way to organize and trace documents back to their original location in your file system."
       }
     ],
     related: [
@@ -938,6 +1018,10 @@ var sections3 = [
       {
         question: "Does Talonic detect duplicate uploads?",
         answer: "Yes. Files are deduplicated via SHA-256 hashing. Uploading the same file twice will not create duplicates."
+      },
+      {
+        question: "What happens when I upload a folder or ZIP archive?",
+        answer: "ZIP archives are unpacked recursively and each file is processed individually. Folders preserve the original directory structure as a source_file_path metadata field on each document, available for filtering and export."
       }
     ],
     mentions: [
@@ -956,6 +1040,10 @@ var sections3 = [
     seoTitle: "Supported File Formats \u2014 Talonic Docs",
     description: "Talonic supports 25+ file types across four processing paths: text fast-path, AI vision, OCR, and recursive archive unpacking. From PDF to XLSX to images.",
     content: [
+      {
+        type: "paragraph",
+        text: "Talonic supports 25+ file types across four distinct processing paths. Each path is optimized for its file category \u2014 text files are read directly with zero latency, while complex document formats go through OCR to produce high-quality Markdown. The processing path is selected automatically based on the file extension."
+      },
       {
         type: "param-table",
         title: "File processing paths",
@@ -981,6 +1069,23 @@ var sections3 = [
             description: "ZIP \u2014 unpacked and each file processed individually."
           }
         ]
+      },
+      {
+        type: "paragraph",
+        text: "The **OCR path** uses Mistral Document AI as the primary engine, with a Talonic API fallback if the primary service is unavailable. OCR converts documents to structured Markdown, preserving tables, headings, and layout information. For PDF files that exceed the configured chunk size (default 25 pages), the system automatically splits the document into page chunks, processes them in parallel, and merges the results \u2014 so even large documents are handled efficiently."
+      },
+      {
+        type: "paragraph",
+        text: `Image files follow the **AI Vision** path, where they are sent directly to the AI model for multimodal extraction. This means the AI "sees" the image and extracts data visually \u2014 useful for photos of receipts, scanned handwritten notes, or diagrams. If an image was previously OCR'd and produced meaningful Markdown (more than 100 characters), the system uses the Markdown extraction path instead, which enables richer quality metrics.`
+      },
+      {
+        type: "paragraph",
+        text: "The **text fast-path** is the most efficient route: files like CSV, JSON, and plain text are read directly into memory with no external API call. This means they process almost instantly and incur no OCR cost. Email files (EML, MSG) are parsed to extract both the message body and any attachments, with each attachment processed as a separate document."
+      },
+      {
+        type: "callout",
+        variant: "info",
+        text: "The processing path is selected automatically based on the file extension \u2014 you do not need to configure anything. If a file type is not recognized, the platform will attempt OCR as a fallback before marking it as unsupported."
       }
     ],
     related: [
@@ -995,6 +1100,10 @@ var sections3 = [
       {
         question: "How does Talonic handle image files?",
         answer: "Image files (PNG, JPG, JPEG, GIF, WEBP) are sent to AI for multimodal visual extraction."
+      },
+      {
+        question: "How does Talonic handle large PDF files?",
+        answer: "PDF files that exceed the configured chunk size (default 25 pages) are automatically split into page chunks, processed in parallel, and merged. This ensures even large documents are handled efficiently without timeouts."
       }
     ],
     mentions: ["OCR", "AI vision", "text fast-path", "file formats", "PDF", "DOCX", "ZIP"]
@@ -1061,6 +1170,10 @@ var sections3 = [
       {
         question: "When is a document ready to use in jobs?",
         answer: "Documents are marked complete after AI extraction finishes. You can start using them in jobs immediately without waiting for further processing."
+      },
+      {
+        question: "What happens if OCR or extraction fails on a document?",
+        answer: "The platform automatically retries failed extractions (configurable, default 1 retry). If all retries fail, the document is marked as extraction_failed with a terminal status. OCR failures follow a separate retry path with fallback from Document AI to Talonic API to local parsers."
       }
     ],
     mentions: [
@@ -1087,6 +1200,19 @@ var sections3 = [
       {
         type: "paragraph",
         text: 'Documents sharing the same ontology type are automatically merged into one document type. When a new canonical type appears, it is auto-created with ontology metadata. Unresolvable documents are assigned "Unclassified Document".'
+      },
+      {
+        type: "paragraph",
+        text: `Classification is verified in a two-step process. First, **Document AI OCR** produces an annotation with a free-text type label during the OCR pass. Then, a **type resolution** step verifies that label against the actual document content. If the label and content disagree \u2014 for example, a German *Arbeitsvertrag* incorrectly labelled as "Service Agreement" \u2014 the system trusts the content and resolves the correct canonical type. This ensures accurate classification regardless of the OCR engine's labelling bias.`
+      },
+      {
+        type: "paragraph",
+        text: "Document types drive several downstream features. The platform auto-generates a **schema** for each document type, pre-populated with fields discovered from documents of that type. **Routing rules** can be configured per document type to automatically assign schemas or trigger jobs when new documents arrive. The **Field Registry** tracks which fields appear in which document types, building a cross-type knowledge graph over time."
+      },
+      {
+        type: "callout",
+        variant: "info",
+        text: "You never need to create document types manually. The ontology is built into the platform and types are assigned automatically during classification. If you disagree with a classification, the AI agent can help you understand why a type was chosen and how the content signals were interpreted."
       }
     ],
     related: [
@@ -1102,6 +1228,10 @@ var sections3 = [
       {
         question: "Does document classification work in non-English languages?",
         answer: "Yes. The classifier works across all languages. For example, a German Arbeitsvertrag and an English Employment Contract map to the same canonical type."
+      },
+      {
+        question: "What happens if a document cannot be classified?",
+        answer: 'Unresolvable documents are assigned the "Unclassified Document" type. They can still be processed and extracted \u2014 the platform simply cannot map them to a specific canonical type in the 529-type ontology.'
       }
     ],
     mentions: [
@@ -1147,6 +1277,23 @@ var sections3 = [
             description: "View or download the source document."
           }
         ]
+      },
+      {
+        type: "paragraph",
+        text: "The **Raw Extraction** tab is the most detailed view, showing every field the AI discovered along with its confidence score and the source text that the value was extracted from. Each field displays a tier badge (Tier 1 green, Tier 2 amber, Tier 3 gray) indicating how well-established that field is across your document corpus. Synthetic metadata fields like `filename` and `source_file_path` appear here too, with full confidence (1.0)."
+      },
+      {
+        type: "paragraph",
+        text: "The **Resolved Data** tab shows how raw extracted fields map to your canonical field registry. Fields that matched automatically (similarity >= 0.80) display their canonical name and cluster. Fields in the confirm band (0.50-0.79) are flagged for review. This view helps you understand how the platform is normalizing field names across different document types and formats."
+      },
+      {
+        type: "paragraph",
+        text: "The **Processing Log** tab provides a stage-by-stage timeline of how the document was processed, including per-stage timing. You can see exactly how long OCR, classification, and extraction took, which is useful for diagnosing slow processing or understanding why a document was classified a particular way. The **Original File** tab lets you view or download the source file, so you can always compare the AI's extraction against the original document."
+      },
+      {
+        type: "callout",
+        variant: "info",
+        text: "You can open the **AI Agent** (`Cmd+I`) from any document detail page. The agent automatically has the current document in scope and can answer questions about its fields, classification, or processing status without you needing to specify which document you mean."
       }
     ],
     related: [
@@ -1162,6 +1309,10 @@ var sections3 = [
       {
         question: "How can I see the confidence score of an extracted field?",
         answer: "Open the document detail page and navigate to the Raw Extraction tab. Each field displays its confidence score alongside the extracted value and source text."
+      },
+      {
+        question: "What do the tier badges on fields mean?",
+        answer: "Tier badges indicate how well-established a field is across your document corpus. Tier 1 (green) are universal core fields, Tier 2 (amber) are established promoted fields, and Tier 3 (gray) are newly discovered emerging fields."
       }
     ],
     mentions: [
@@ -1182,6 +1333,23 @@ var sections3 = [
       {
         type: "paragraph",
         text: "Routing rules automatically assign actions to documents based on their type. Configure rules to auto-assign schemas, trigger jobs, or route documents to specific workflows. Manage rules from **Documents &rarr; Routing**."
+      },
+      {
+        type: "paragraph",
+        text: "Each routing rule specifies a **document type** as the trigger condition and one or more **actions** to execute when a document of that type is processed. Actions include assigning a specific user schema, automatically creating a job run, or tagging the document for a particular workflow. Rules are evaluated in priority order, so you can layer general rules with more specific overrides."
+      },
+      {
+        type: "paragraph",
+        text: 'Routing rules are especially useful for high-volume ingestion pipelines. If you connect a Google Drive folder that receives hundreds of invoices per week, a routing rule can automatically assign your "Invoice" schema and trigger extraction \u2014 turning what would be manual work into a fully automated pipeline. Combined with **delivery bindings**, this creates an end-to-end flow from document upload to structured output with zero manual intervention.'
+      },
+      {
+        type: "paragraph",
+        text: "You can review rule execution history from the routing page to see which rules fired, which documents they matched, and what actions were taken. This audit trail helps you verify that your routing configuration is working as expected and diagnose cases where documents were not routed correctly."
+      },
+      {
+        type: "callout",
+        variant: "info",
+        text: "Start with a simple routing rule for your most common document type. Once you verify it works correctly, expand to additional types. Rules are evaluated in priority order, so you can add specific overrides without disrupting existing rules."
       }
     ],
     related: [
@@ -1197,6 +1365,10 @@ var sections3 = [
       {
         question: "Where do I manage routing rules?",
         answer: "Navigate to Documents > Routing to create and manage routing rules for your workspace."
+      },
+      {
+        question: "Can routing rules fully automate my document processing pipeline?",
+        answer: "Yes. By combining routing rules with source connectors and delivery bindings, you can create a fully automated pipeline: documents arrive from a connected source, routing rules assign schemas and trigger extraction jobs, and delivery bindings push approved results to downstream systems."
       }
     ],
     mentions: ["routing rules", "auto-assign", "schema assignment", "document workflows"]
@@ -1272,6 +1444,14 @@ var sections3 = [
         type: "paragraph",
         text: "Google and Microsoft connectors share a single OAuth client each. OAuth tokens are encrypted at rest using `aes-256-gcm`. Each source card includes a **Batch Processing** toggle to defer extraction at 50% cost."
       },
+      {
+        type: "paragraph",
+        text: "OAuth-based connectors (Google Drive, Gmail, SharePoint, OneDrive, Outlook, Teams, Notion) use a consent-based flow where you authorize Talonic to access specific resources. For Microsoft connectors, Teams requires extended scopes that need tenant-admin consent. If a connector's OAuth credentials are revoked or expire, the source enters a disconnected state \u2014 reconnecting via the source settings page automatically refreshes the credentials without losing your existing documents."
+      },
+      {
+        type: "paragraph",
+        text: "Credential-based connectors (SQL, Amazon S3, Azure Blob) authenticate with access keys or connection strings rather than OAuth. SQL connections support PostgreSQL, MySQL, and MSSQL, with a built-in read-only safety layer that prevents accidental writes. S3-compatible storage like MinIO and Cloudflare R2 also works through the S3 connector. All credentials are encrypted at rest before being stored."
+      },
       {
         type: "callout",
         text: "Connectors are feature-gated on their OAuth client ID/secret. Without credentials configured, the connector dropdown entry is disabled."
@@ -1290,6 +1470,10 @@ var sections3 = [
       {
         question: "How are OAuth tokens stored?",
         answer: "OAuth access and refresh tokens are encrypted at rest using AES-256-GCM. The encryption key is SOURCE_ENCRYPTION_KEY (falls back to JWT_SECRET)."
+      },
+      {
+        question: "What happens if a connector loses its credentials or authorization?",
+        answer: "If OAuth credentials are revoked or expire, the source enters a disconnected state. Reconnecting via the source settings page automatically refreshes the credentials without losing your existing documents or configuration."
       }
     ],
     mentions: [
@@ -1331,6 +1515,18 @@ var sections4 = [
         id: "field-registry-table",
         title: "Field Registry \u2014 Registry Table",
         caption: "Fields are organized by tier with occurrence counts, data types, and master instruction status."
+      },
+      {
+        type: "paragraph",
+        text: "The registry grows automatically as documents are processed. During extraction, AI discovers fields from each document and resolves them against existing registry entries using **three-band matching** (exact name match, cluster member match, then semantic embedding similarity). New fields that don't match anything create a Tier 3 entry. Frequently occurring fields are promoted to higher tiers, so the registry naturally converges on a stable set of canonical fields over time."
+      },
+      {
+        type: "paragraph",
+        text: "Each registry entry tracks its **occurrence count** (how many documents contain this field), **data type** (string, number, date, etc.), **synonyms** (alternate names discovered across documents), and **master instruction** (an AI-synthesized extraction directive). The registry also maintains two embedding vectors per field: one for resolution matching and one for graph visualization, ensuring that each concern uses the most appropriate representation."
+      },
+      {
+        type: "paragraph",
+        text: "The registry is the foundation for several downstream features. **Jobs** use registry fields to pre-fill schema values via lookup cascades before resorting to LLM extraction. **Semantic clusters** group related registry fields together. **Generated schemas** are auto-built from registry fields that appear in a given document type. Understanding the registry is key to understanding how Talonic reduces extraction cost and improves accuracy over time."
       }
     ],
     related: [
@@ -1346,6 +1542,10 @@ var sections4 = [
       {
         question: "How does the Field Registry grow?",
         answer: "As documents are processed, AI discovers new fields and resolves them against existing registry entries. New fields create Tier 3 entries; frequently occurring fields are promoted to higher tiers."
+      },
+      {
+        question: "How does the Field Registry reduce extraction cost?",
+        answer: "The registry enables lookup-based resolution during job runs. When a field already exists in the registry with sufficient data, its value can be resolved via graph lookup instead of an AI call. Approximately 30% of cells are filled this way \u2014 instantly and at no cost."
       }
     ],
     mentions: [
@@ -1387,6 +1587,18 @@ var sections4 = [
           }
         ]
       },
+      {
+        type: "paragraph",
+        text: "**Tier 1** fields are the most reliable and cost-efficient. During job runs, Tier 1 fields can often be resolved via lookup tables or registry transfer without any AI call, meaning they cost nothing to extract. These are fields like `invoice_number`, `date`, or `total_amount` that appear universally across document types and have well-established extraction patterns."
+      },
+      {
+        type: "paragraph",
+        text: "**Tier 2** fields are promoted from Tier 3 after meeting frequency thresholds \u2014 specifically, 5 occurrences or a 10% occurrence rate across your documents. Once promoted, these fields gain a synthesized master instruction and become candidates for lookup-based resolution. Promotion is evaluated automatically after every batch resolution run, so fields graduate without manual intervention as your document corpus grows."
+      },
+      {
+        type: "paragraph",
+        text: "**Tier 3** fields are newly discovered and may require a full Claude API call to extract during job runs, making them the most expensive tier. As more documents are processed and a Tier 3 field appears consistently, it is automatically promoted. You can also manually adjust a field's tier from the registry detail page if you know a field is stable enough to promote early."
+      },
       {
         type: "callout",
         text: "Tier badges appear throughout the platform as the primary quality signal. Tier 1 = green, Tier 2 = amber, Tier 3 = gray."
@@ -1404,6 +1616,10 @@ var sections4 = [
       {
         question: "How are fields promoted between tiers?",
         answer: "Fields are promoted automatically based on frequency thresholds. As more documents are processed and a field appears consistently, it moves from Tier 3 to Tier 2 and eventually to Tier 1."
+      },
+      {
+        question: "Can I manually change a field's tier?",
+        answer: "Yes. You can manually adjust a field's tier from the registry detail page. This is useful when you know a field is stable enough to promote early, or when you want to demote a field that was promoted prematurely."
       }
     ],
     mentions: ["tier system", "Tier 1", "Tier 2", "Tier 3", "field promotion", "quality signal"]
@@ -1418,6 +1634,23 @@ var sections4 = [
       {
         type: "paragraph",
         text: 'Fields with similar meanings are automatically grouped using AI embeddings. For example, "Vendor Name", "Supplier Name", and "Company Name" cluster together. You can manually merge or split clusters from the Field Map view.'
+      },
+      {
+        type: "paragraph",
+        text: "Clustering uses the same three-band similarity model as field resolution. Fields with similarity >= 0.80 are automatically grouped into the same cluster. Fields in the 0.50-0.79 range are flagged as potential cluster candidates for manual confirmation. Fields below 0.50 similarity are kept separate. This graduated approach prevents false merges while still surfacing useful grouping suggestions."
+      },
+      {
+        type: "paragraph",
+        text: 'From the **Field Map** view, you can manually **merge** two clusters when you know they represent the same concept (e.g., merging a "Ship To Address" cluster with a "Delivery Address" cluster). You can also **split** a field out of a cluster if it was incorrectly grouped. These manual adjustments are permanent and improve the resolution model for all future documents \u2014 the system learns from your corrections.'
+      },
+      {
+        type: "paragraph",
+        text: 'Semantic clusters serve a practical purpose beyond organization. When a job runs, the resolution engine uses clusters to transfer values between fields that belong to the same cluster. If a document has a field called "Supplier Name" and your schema expects "Vendor Name", the cluster linkage allows the value to transfer automatically without an AI call. This is one of the key mechanisms that reduces extraction cost as your registry matures.'
+      },
+      {
+        type: "callout",
+        variant: "info",
+        text: "Manual cluster adjustments are permanent and improve the model for all future documents. If you notice the platform grouping unrelated fields together, split them early \u2014 this prevents incorrect value transfers during job runs."
       }
     ],
     related: [
@@ -1433,6 +1666,10 @@ var sections4 = [
       {
         question: "Can I manually adjust semantic clusters?",
         answer: "Yes. You can manually merge or split clusters from the Field Map view in the Field Registry."
+      },
+      {
+        question: "How do semantic clusters reduce extraction cost?",
+        answer: 'When a job runs, the resolution engine uses clusters to transfer values between fields that belong to the same cluster. If a document has "Supplier Name" and your schema expects "Vendor Name", the cluster linkage allows the value to transfer automatically without an AI call.'
       }
     ],
     mentions: [
@@ -1478,6 +1715,10 @@ var sections4 = [
         type: "paragraph",
         text: "Resolution runs concurrently across documents. Each document's fields are resolved in an isolated transaction to prevent lock contention. Occurrence rates are updated after each transaction commits, keeping the registry eventually consistent without blocking concurrent ingestion."
       },
+      {
+        type: "paragraph",
+        text: "After resolution completes, the platform evaluates tier promotions and regenerates affected schemas in a fixed chain: resolve, then promote, then regenerate. This chain ensures that newly promoted fields immediately appear in auto-generated schemas. The resolution process also feeds into the **job pipeline** \u2014 during Phase 1 of a job run, the system uses a 3-tier lookup cascade (string normalization, token fuzzy matching, then AI fallback) to fill 60-80% of cells without a full LLM call, dramatically reducing cost."
+      },
       {
         type: "callout",
         text: "Pending confirmations from the confirm band appear in **Resolution &rarr; Pending Confirmations**. Accept to merge into an existing cluster, or reject to create a new field."
@@ -1496,6 +1737,10 @@ var sections4 = [
       {
         question: "Where can I review pending field confirmations?",
         answer: "Navigate to Resolution > Pending Confirmations to review fields in the confirm band. Accept to merge into an existing cluster, or reject to create a new field."
+      },
+      {
+        question: "What happens after resolution completes?",
+        answer: "After resolution, the platform evaluates tier promotions and regenerates affected schemas in a fixed chain: resolve, then promote, then regenerate. This ensures that newly promoted fields immediately appear in auto-generated schemas."
       }
     ],
     mentions: [
@@ -1517,6 +1762,18 @@ var sections4 = [
         type: "paragraph",
         text: "As the same field is extracted from many documents, AI synthesizes a **master instruction** \u2014 a reusable directive that captures the best way to extract that field. Master instructions improve accuracy over time and are automatically used when running jobs."
       },
+      {
+        type: "paragraph",
+        text: 'Master instructions are synthesized by analyzing the extraction patterns across all documents where a field appears. The AI examines how the field was successfully extracted \u2014 including the source text, confidence scores, and document context \u2014 and distills a concise directive that captures the best extraction approach. For example, a master instruction for "invoice_date" might specify: "Look for the date near the invoice number, typically in the header area. Prefer the issue date over due date. Format as ISO 8601."'
+      },
+      {
+        type: "paragraph",
+        text: "Master instructions fire automatically during **Phase 2** of job runs, when the AI agent extracts values for fields that could not be resolved via lookup. The instruction is injected into the AI prompt alongside the document content, giving the model specific guidance for that field. This is why master instructions improve accuracy: they encode domain-specific knowledge that the base model would otherwise lack."
+      },
+      {
+        type: "paragraph",
+        text: `You can view and edit master instructions from the field detail page in the registry. Editing an instruction overrides the AI-synthesized version, which is useful when you have domain expertise the AI hasn't captured. The **"Synthesize All"** button in the Field Registry triggers the full pipeline \u2014 embedding, resolution, and synthesis \u2014 for all qualifying fields in a single operation.`
+      },
       {
         type: "callout",
         text: 'Click **"Synthesize All"** in the Field Registry to generate instructions for all qualifying fields. This runs the combined pipeline: embed &rarr; resolve &rarr; synthesize.'
@@ -1535,6 +1792,10 @@ var sections4 = [
       {
         question: "How do I generate master instructions?",
         answer: 'Click "Synthesize All" in the Field Registry. This runs the combined pipeline: embed, resolve, and synthesize instructions for all qualifying fields.'
+      },
+      {
+        question: "Can I manually edit a master instruction?",
+        answer: "Yes. You can view and edit master instructions from the field detail page in the registry. Editing overrides the AI-synthesized version, which is useful when you have domain expertise the AI has not captured."
       }
     ],
     mentions: [
@@ -1562,6 +1823,18 @@ var sections5 = [
       {
         type: "paragraph",
         text: "For each document type, Talonic generates a schema containing all Tier 1 and Tier 2 fields with occurrences in that type. Generated schemas are versioned \u2014 new versions are created when the registry changes. You can diff any two versions to see what changed."
+      },
+      {
+        type: "paragraph",
+        text: "Behind the scenes, the generation engine scans the **Field Registry** for every field that has been promoted to Tier 1 (core) or Tier 2 (established) within a given document type. It assembles these fields into a schema definition, assigns data types based on observed extraction patterns, and attaches the AI-synthesized **master instruction** for each field. The entire process is automatic \u2014 no manual curation is required."
+      },
+      {
+        type: "paragraph",
+        text: "Generated schemas are most useful as a starting point for understanding what Talonic has discovered about your documents. Review the generated schema for a document type to see which fields the system has identified, then use that knowledge to build a **User Template** containing only the fields you actually need. You can also use the diff view to monitor how your field landscape evolves over time as new documents are processed and new fields are promoted."
+      },
+      {
+        type: "callout",
+        text: "Generated schemas are read-only and cannot be used directly for job execution. To run an extraction job, create a **User Template** and map its fields to the registry."
       }
     ],
     related: [
@@ -1577,6 +1850,10 @@ var sections5 = [
       {
         question: "How are generated schemas updated?",
         answer: "New versions are created automatically when the Field Registry changes (new fields promoted, clusters merged). You can diff any two versions to see what changed."
+      },
+      {
+        question: "Can I run an extraction job using a generated schema?",
+        answer: "No. Generated schemas are read-only references. To run a job, create a User Template, select the fields you need, map them to the registry, and publish a version."
       }
     ],
     mentions: ["generated schemas", "AI-generated", "versioning", "schema diff"]
@@ -1606,6 +1883,18 @@ var sections5 = [
       {
         type: "paragraph",
         text: "Most teams start by importing an existing spreadsheet or CSV as a template baseline, then refine field types and add extraction instructions. Once you publish a version, it becomes immutable and available for job execution \u2014 any further changes happen in a new **Workshop** draft, keeping your production schema stable while you iterate."
+      },
+      {
+        type: "paragraph",
+        text: "When adding fields, take advantage of the automatic registry matching system. Fields with names that match existing registry entries are linked instantly, inheriting the AI-synthesized extraction instruction. For fields that do not match, write a clear **manual instruction** describing exactly what the AI should extract from the document. Well-written instructions are the single biggest lever for extraction accuracy."
+      },
+      {
+        type: "paragraph",
+        text: "For best results, keep templates focused on a single document type or closely related group of types. A template with 10-20 well-defined fields will produce higher accuracy than one with 50+ fields spanning unrelated domains. If you need different field sets for different document types, create separate templates and run targeted jobs for each."
+      },
+      {
+        type: "callout",
+        text: "You can import templates from Excel, CSV, or JSON files using the **Import from file** option. Column headers become field names, and data types are inferred automatically. This is the fastest way to bootstrap a template from an existing spreadsheet."
       }
     ],
     related: [
@@ -1621,6 +1910,10 @@ var sections5 = [
       {
         question: "What is the difference between generated schemas and user templates?",
         answer: "Generated schemas are AI-created per document type with all Tier 1/2 fields. User templates are custom-defined output structures where you choose exactly which fields to include and how to map them."
+      },
+      {
+        question: "Can I update a published template?",
+        answer: "Published versions are immutable. To make changes, open the Workshop draft, edit your fields, and publish a new version. The previous version remains available in Version History for reference and diffing."
       }
     ],
     mentions: ["user templates", "schema creation", "field mapping", "reference tables", "publish"]
@@ -1686,6 +1979,14 @@ var sections5 = [
         type: "paragraph",
         text: "When configuring a field, start with the basics \u2014 name, type, and registry mapping \u2014 then layer on advanced features as needed. For example, add a **format constraint** to enforce a date pattern, attach a **reference table** for code lookups, or define **capture submoves** to control the exact extraction sequence. Features compose independently, so you can mix and match without conflicts."
       },
+      {
+        type: "paragraph",
+        text: "The **modifier pipeline** runs in a fixed order during Phase 4 of the extraction pipeline: format transforms first (converting dates or numbers to your target format), then alias mapping (replacing values using a lookup), and finally max_length truncation. Constraint evaluation happens after all modifiers have been applied, so constraints validate the final transformed value, not the raw extraction."
+      },
+      {
+        type: "paragraph",
+        text: 'For best results, use **manual instructions** sparingly and only for fields that the registry cannot match. A well-written instruction should describe the field in plain language, specify where in the document to look, and note any formatting expectations. Avoid vague instructions like "extract the value" \u2014 instead, write something like "Extract the net payment amount from the invoice summary section, excluding VAT."'
+      },
       {
         type: "callout",
         text: "For the complete JSON Schema specification with all features, see the [Full Schema Reference](/docs/platform/schema-features) in the Platform Guide."
@@ -1704,6 +2005,10 @@ var sections5 = [
       {
         question: "Can I override AI extraction instructions with my own?",
         answer: "Yes. Use the Manual instruction feature on a schema field. User-written instructions override the AI-synthesized master instruction from the field registry."
+      },
+      {
+        question: "In what order are modifiers applied to extracted values?",
+        answer: "Modifiers run in a fixed order: format (date/number conversion) first, then alias (value mapping), then max_length (truncation). Constraints are evaluated after all modifiers complete."
       }
     ],
     mentions: [
@@ -1752,6 +2057,22 @@ var sections5 = [
       {
         type: "paragraph",
         text: "When you add a field to a template, the system automatically attempts to match it against the **Field Registry**. Exact name matches are applied instantly, while semantic and composite matches appear as suggestions for your confirmation. If no match is found, the field is marked **Unmapped** and you should provide a manual extraction instruction so the AI knows how to extract that value from your documents."
+      },
+      {
+        type: "paragraph",
+        text: "The matching engine uses a three-band resolution process under the hood. First, it checks for an exact name match against canonical registry field names and their synonyms. If no exact match is found, it computes embedding similarity between your field name and every registry field, surfacing semantic matches above a 0.5 confidence threshold. Matches above 0.8 are auto-accepted; those between 0.5 and 0.8 require your confirmation."
+      },
+      {
+        type: "paragraph",
+        text: "Matched fields inherit the registry's AI-synthesized **master instruction**, which tells the extraction pipeline exactly how to locate and extract that value from documents. This is why matching matters \u2014 a well-matched field leverages all the intelligence the system has built up from processing your document corpus. Unmapped fields rely solely on your manual instruction, so they may need a few correction cycles before reaching the same accuracy."
+      },
+      {
+        type: "paragraph",
+        text: "You can trigger a **Rematch** on all fields at any time from the template editor. This is useful after the registry has grown \u2014 fields that were previously unmapped may now find matches as new extractions contribute to the registry. For best results, use descriptive field names that reflect the actual data (e.g., `contract_start_date` rather than `field_1`)."
+      },
+      {
+        type: "callout",
+        text: "Field matching is read-only against the registry \u2014 it never creates new registry entries. If no match exists, the field stays unmapped until you provide a manual instruction or new documents introduce the field into the registry."
       }
     ],
     related: [
@@ -1767,6 +2088,10 @@ var sections5 = [
       {
         question: "What happens when a field is unmapped?",
         answer: "Unmapped fields have no registry match. They require manual extraction instructions to guide the AI on how to extract the value from documents."
+      },
+      {
+        question: "Can I re-run field matching after adding more documents?",
+        answer: "Yes. Use the Rematch button in the template editor to re-run matching against the current registry. Fields that were previously unmapped may find new matches as your registry grows."
       }
     ],
     mentions: ["field matching", "exact match", "semantic match", "composite", "unmapped"]
@@ -1807,6 +2132,14 @@ var sections5 = [
         type: "paragraph",
         text: "To set up a reference table, upload a CSV or manually enter key-value pairs where the **key** is the code you want in your output and the **value** is the human-readable label found in documents. During extraction, the system tries each tier in order \u2014 most values resolve instantly at Tier 1, so keeping your labels clean and consistent dramatically improves both speed and accuracy."
       },
+      {
+        type: "paragraph",
+        text: "Reference tables are used in two pipeline stages. In **Phase 1**, the lookup cascade runs as part of the resolve step, mapping extracted labels to codes without any AI calls (Tier 1 and Tier 2). In **Phase 3**, the cascade runs again on values produced by Phase 2's AI extraction, normalizing free-text AI output to your canonical codes. This two-pass approach ensures maximum code coverage across the entire pipeline."
+      },
+      {
+        type: "paragraph",
+        text: 'For best results, include common variations and abbreviations as separate value entries all pointing to the same key. For example, if your code is `US`, add values for "United States", "USA", "U.S.A.", and "United States of America". The more variations you cover, the more values resolve at Tier 1 (highest confidence) without falling through to fuzzy or AI matching.'
+      },
       {
         type: "callout",
         text: "Reference table quality directly determines lookup accuracy. A properly loaded table produces 90-100% accurate results within a single run."
@@ -1825,6 +2158,10 @@ var sections5 = [
       {
         question: "How accurate are reference table lookups?",
         answer: "A properly loaded reference table produces 90-100% accurate results within a single run. The cascade provides confidence scores: 0.95 for exact normalization, ~0.70 for fuzzy, and 0.50 for AI fallback."
+      },
+      {
+        question: "How should I format my reference table CSV?",
+        answer: "Use two columns: the first column is the key (output code) and the second is the value (human-readable label). Include common variations and abbreviations as separate rows pointing to the same key for maximum Tier 1 hit rate."
       }
     ],
     mentions: [
@@ -1849,6 +2186,18 @@ var sections5 = [
       {
         type: "paragraph",
         text: "Start by editing fields in the **Workshop** draft, then use **Test Extraction** to compare draft results against the live version before publishing. The **Version History** timeline lets you review diff summaries between any two versions, making it easy to trace when a field was added, renamed, or removed and understand the impact on downstream jobs."
+      },
+      {
+        type: "paragraph",
+        text: "The versioning system is append-only \u2014 every time you publish a draft, it creates a new immutable version and the previous version is preserved in the timeline. This means you can always go back and review the exact schema that was used for any historical job. The diff view highlights added fields, removed fields, type changes, and updated instructions, giving you a clear picture of how your schema evolved."
+      },
+      {
+        type: "paragraph",
+        text: "Use the workshop system to iterate safely on your schema without disrupting production jobs. A common workflow is to add a new field in the Workshop, run a **Test Extraction** on a few documents to verify it produces correct values, then publish when satisfied. If a downstream integration depends on a specific field, the breaking change detection will warn you before you accidentally remove or rename it."
+      },
+      {
+        type: "callout",
+        text: "Breaking changes include field removals and type changes. The system surfaces these warnings at publish time so you can assess the impact on active delivery bindings and downstream systems before committing."
       }
     ],
     related: [
@@ -1864,6 +2213,10 @@ var sections5 = [
       {
         question: "What are breaking changes in a schema?",
         answer: "Breaking changes include field removals and type changes. The system detects and warns about these when promoting a draft to live, helping you avoid unintended downstream impacts."
+      },
+      {
+        question: "Can I revert to a previous schema version?",
+        answer: "Version history is append-only, so you cannot revert directly. However, you can review any previous version in the timeline, compare it with the current live version using the diff view, and manually re-add fields or settings that were changed."
       }
     ],
     mentions: ["versioning", "drafts", "workshop", "live version", "breaking changes"]
@@ -1882,6 +2235,18 @@ var sections5 = [
       {
         type: "paragraph",
         text: "After running a test, you will see a comparison grid highlighting cells that changed between the draft and live versions. Focus on fields you modified \u2014 new fields, updated instructions, or changed reference tables \u2014 to verify they produce the expected values. This workflow catches regressions before they reach production, so you can iterate on your schema with confidence."
+      },
+      {
+        type: "paragraph",
+        text: "Test extractions run through the same 4-phase pipeline as production jobs, so the results you see are identical to what a full job would produce. The test uses a simplified single-call extraction mode under the hood, which is faster but still applies all schema features including reference table lookups, format constraints, and modifiers. This gives you a reliable preview without the cost of a full pipeline run."
+      },
+      {
+        type: "paragraph",
+        text: 'For best results, select 3-5 representative documents that cover the variety in your corpus \u2014 include at least one "clean" document and one with unusual formatting or missing fields. This gives you confidence that your schema handles both typical and edge-case documents correctly. Run the test after every significant change to a field instruction, reference table, or format constraint.'
+      },
+      {
+        type: "callout",
+        text: "Test extractions do not affect your live data or consume production job credits differently. They are designed for rapid iteration \u2014 run as many tests as you need before publishing."
       }
     ],
     related: [
@@ -1897,6 +2262,10 @@ var sections5 = [
       {
         question: "Do I need to publish a draft before testing it?",
         answer: "No. Test extraction runs against the unpublished draft, comparing its output to the current live version so you can verify changes before publishing."
+      },
+      {
+        question: "How many documents should I use for a test extraction?",
+        answer: "Select 3-5 representative documents that cover the variety in your corpus. Include documents with different layouts, data completeness levels, and edge cases to get a reliable preview of how your schema changes perform."
       }
     ],
     mentions: ["test extraction", "draft comparison", "side-by-side", "preview"]
@@ -1951,6 +2320,18 @@ var sections5 = [
       {
         type: "paragraph",
         text: "When working with international data, configure the dialect to match your downstream system requirements. For example, set **number_locale** to `fr-FR` for European comma-decimal formatting, switch the **delimiter** to semicolon for CSV compatibility, and choose **UTF-8-BOM** encoding if your data will be opened in Excel. Creating a shared dialect and reusing it across schemas ensures consistent formatting across all your exports."
+      },
+      {
+        type: "paragraph",
+        text: "Dialect settings are applied during Phase 4 of the extraction pipeline and during CSV/XLSX export. The dialect does not affect how values are stored internally \u2014 it only controls the serialization format when data leaves the platform. This means you can change a dialect at any time without re-running extractions; the new format applies to all future exports and deliveries."
+      },
+      {
+        type: "paragraph",
+        text: 'For best results, create a shared dialect for each downstream system or regional office you deliver to, and name it descriptively (e.g., "SAP Europe" or "US Accounting"). Avoid defining dialects inline on individual schemas unless you have a one-off formatting requirement. Shared dialects reduce maintenance burden and ensure consistency when you add new schemas later.'
+      },
+      {
+        type: "callout",
+        text: "If your CSV files show garbled special characters (accents, umlauts, CJK text), switch the encoding to **UTF-8-BOM**. The BOM (byte order mark) tells Excel to interpret the file as UTF-8 instead of the system default encoding."
       }
     ],
     related: [
@@ -1966,6 +2347,10 @@ var sections5 = [
       {
         question: "Can I share a dialect across multiple schemas?",
         answer: "Yes. A dialect can be shared across schemas or defined inline for a specific schema. Configure them in the Schema > Delivery tab."
+      },
+      {
+        question: "Do I need to re-run extractions when I change a dialect?",
+        answer: "No. Dialects only affect output serialization (exports and deliveries), not how values are stored internally. Changing a dialect takes effect immediately on future exports without re-processing."
       }
     ],
     mentions: [
@@ -2018,6 +2403,14 @@ var sections5 = [
         type: "paragraph",
         text: 'Use bypass strategies for fields whose values are known ahead of time or can be derived without reading the document. For example, set a **constant** of `"USD"` for a currency field that is always the same, or use a **generator** to produce a deterministic ID for each row. Fields with bypass strategies skip the AI extraction phase entirely, reducing processing time and credit usage.'
       },
+      {
+        type: "paragraph",
+        text: "The **reference** bypass strategy is particularly powerful for enrichment fields. Define a `key_expression` that references another field in the schema (e.g., the supplier name), and the system will automatically look up the corresponding code from your reference table without any AI involvement. This is ideal for mapping extracted entity names to internal system identifiers, ERP codes, or classification labels."
+      },
+      {
+        type: "paragraph",
+        text: "For best results, audit your schema for fields that never vary across documents \u2014 these are prime candidates for the **constant** strategy. Fields like currency, data source, or processing batch can be set once and never require AI extraction. This reduces per-document processing cost and improves job completion time, especially on large runs with hundreds of documents."
+      },
       {
         type: "callout",
         text: "When a `generator` strategy fails to produce a value, the field falls through to LLM extraction as a safety net. Strategy values are normalized via generator mappings in Phase 4 of the pipeline."
@@ -2036,6 +2429,10 @@ var sections5 = [
       {
         question: "What happens when a generator bypass fails?",
         answer: "When a generator strategy fails to produce a value, the field falls through to LLM extraction as a safety net, ensuring the cell is still filled."
+      },
+      {
+        question: "Do bypass strategies reduce extraction costs?",
+        answer: "Yes. Fields with bypass strategies skip the AI extraction phase entirely, which reduces both processing time and credit usage. Use constant or reference strategies for fields that do not require document reading."
       }
     ],
     mentions: [
@@ -2081,6 +2478,18 @@ var sections5 = [
       {
         type: "paragraph",
         text: "Define format constraints in the schema field editor. The pattern uses standard regex syntax. The editor provides a live test input so you can verify the pattern before saving."
+      },
+      {
+        type: "paragraph",
+        text: "Format constraints are especially useful for fields with strict formatting requirements in downstream systems. For example, a purchase order number that must follow the pattern `PO-\\d{6}` or a date that must match `\\d{4}-\\d{2}-\\d{2}`. By catching format violations at extraction time, you avoid importing malformed data into your ERP, accounting, or analytics systems."
+      },
+      {
+        type: "paragraph",
+        text: 'Choose the mismatch behavior based on your data quality requirements. Use **empty** (the default) when you prefer no data over bad data \u2014 the downstream system will see a blank cell. Use **flag** when you want to review mismatches manually before deciding \u2014 flagged cells appear with an amber dot in the results grid. Use **constant** when your downstream system needs a specific sentinel value like `"N/A"` or `"INVALID"` to trigger its own error handling.'
+      },
+      {
+        type: "callout",
+        text: "The regex evaluator includes ReDoS protection: nested quantifiers are rejected and input is capped at 1,000 characters. Use the `(?i)` inline flag for case-insensitive matching."
       }
     ],
     related: [
@@ -2096,6 +2505,10 @@ var sections5 = [
       {
         question: "Are original values preserved when format constraints clear a cell?",
         answer: "Yes. Original values are always preserved for audit in the original_extractions table, regardless of the mismatch behavior applied."
+      },
+      {
+        question: "Can I use case-insensitive regex patterns?",
+        answer: "Yes. Use the (?i) inline flag at the start of your pattern for case-insensitive matching. The evaluator supports standard JavaScript regex syntax with inline flags."
       }
     ],
     mentions: [
@@ -2124,6 +2537,18 @@ var sections6 = [
       {
         type: "paragraph",
         text: "Navigate to **Structuring &rarr; Runs &rarr; New**. Select your template and documents, then click Start. Results appear progressively as each phase completes."
+      },
+      {
+        type: "paragraph",
+        text: "When you start a job, the platform runs a pre-flight check to ensure all selected documents have completed their field resolution step. If any document was uploaded recently and has not yet been resolved against the Field Registry, the system automatically resolves it before entering Phase 1. This lazy resolution gate prevents silent data loss where registry-based lookups would return empty results for unresolved documents."
+      },
+      {
+        type: "paragraph",
+        text: "For best results, select documents of the same type or closely related types for a single job. The schema you choose should match the document content \u2014 using an invoice schema on contract documents will produce poor results. Start with a small batch of 5-10 documents to validate your schema, review the output, apply corrections, and then scale up to larger runs once you are confident in the extraction quality."
+      },
+      {
+        type: "callout",
+        text: "Results appear progressively as each pipeline phase completes. You do not need to wait for the entire job to finish \u2014 you can begin reviewing Phase 1 results while Phase 2 is still running."
       }
     ],
     related: [
@@ -2139,6 +2564,10 @@ var sections6 = [
       {
         question: "What does an extraction job produce?",
         answer: "A job produces a structured grid where rows represent documents and columns represent schema fields. Each cell contains an extracted value with confidence and provenance metadata."
+      },
+      {
+        question: "How many documents can I include in a single job?",
+        answer: "Phase 2 supports up to 2,000 documents per job, and Phase 4 supports up to 1,000. For best results, start with smaller batches to validate your schema before scaling up."
       }
     ],
     mentions: ["extraction job", "structured grid", "progressive results", "template selection"]
@@ -2158,11 +2587,23 @@ var sections6 = [
         type: "paragraph",
         text: "Each phase builds on the previous one, progressively filling the output grid. **Phase 1** resolves ~30% of cells instantly using graph matches and lookups. **Phase 2** deploys an AI agent to fill remaining gaps. **Phase 3** runs cross-field validation checks, and **Phase 4** performs targeted re-reads for empty or low-confidence cells. You can monitor fill rate in real time as each phase completes."
       },
+      {
+        type: "paragraph",
+        text: "The pipeline is designed around a key principle: use the cheapest, fastest method first and escalate to AI only when necessary. Phase 1 fills cells using deterministic lookups at zero AI cost. Phase 2 uses AI only for cells that Phase 1 could not resolve. Phase 3 re-runs lookups on Phase 2 output to normalize AI-generated values to canonical codes. Phase 4 performs targeted re-reads with full grid context for the remaining gaps. This cascading approach minimizes both cost and latency."
+      },
+      {
+        type: "paragraph",
+        text: "The grid is flushed to the database after each phase, enabling progressive rendering in the UI. You can watch cells fill in real time and begin reviewing results before the job finishes. The phase timeline on the job detail page shows which phase is currently active, how long each phase took, and the cumulative fill rate at each stage."
+      },
       {
         type: "ui-excerpt",
         id: "job-detail-phase-timeline",
         title: "Job Detail \u2014 Phase Timeline",
         caption: "The phase timeline shows progress through the pipeline. Each dot represents a stage, highlighted when active."
+      },
+      {
+        type: "callout",
+        text: "Phase order is fixed: Phase 1 &rarr; 2 &rarr; 3 &rarr; 4. Phases are never skipped or reordered. This guarantees that high-confidence deterministic values from Phase 1 are always protected by the confidence gate before AI extraction runs."
       }
     ],
     related: [
@@ -2178,6 +2619,10 @@ var sections6 = [
       {
         question: "Can I see results before all phases complete?",
         answer: "Yes. Results are visible as each phase completes. The fill rate increases progressively through the pipeline."
+      },
+      {
+        question: "Why does the pipeline use multiple phases instead of a single AI call?",
+        answer: "The cascading design minimizes cost and latency. Phase 1 fills cells with deterministic lookups at zero AI cost. Only remaining gaps go to the AI agent in Phase 2, and Phase 4 targets specific empty cells with full context. This is significantly cheaper and faster than sending everything to AI."
       }
     ],
     mentions: ["4-phase pipeline", "fill rate", "progressive rendering", "phase timeline"]
@@ -2227,6 +2672,18 @@ var sections6 = [
       {
         type: "paragraph",
         text: "Values are normalized during transfer: dates &rarr; `YYYY/MM/DD`, numbers &rarr; 2 decimal places, strings &rarr; trim + collapse spaces."
+      },
+      {
+        type: "paragraph",
+        text: "Phase 1 is the workhorse of cost efficiency. Because it relies entirely on pre-computed graph matches and deterministic lookups, it fills a large portion of the grid at near-zero cost. The confidence scores assigned during this phase are typically high (0.7-0.95) because they are derived from verified registry matches rather than AI inference. These high-confidence cells are then protected by the confidence gate, meaning later phases cannot overwrite them."
+      },
+      {
+        type: "paragraph",
+        text: "The resolution strategies execute in a fixed order: registry transfer first, then raw extraction mapping, then the 3-tier lookup cascade, and finally deterministic compute (formulas like `Total = Unit Price x Quantity`). Each strategy only attempts to fill cells that are still empty after the previous strategy ran. This ordering ensures that the highest-confidence method always gets priority."
+      },
+      {
+        type: "callout",
+        text: "Phase 1 fill rates improve over time as your Field Registry grows. The more documents you process, the richer the registry becomes, and the more cells Phase 1 can resolve without AI \u2014 reducing both cost and latency for every subsequent job."
       }
     ],
     related: [
@@ -2242,6 +2699,10 @@ var sections6 = [
       {
         question: "What percentage of cells does Phase 1 fill?",
         answer: "Phase 1 typically fills approximately 30% of cells in seconds, using graph matches and lookups without any AI calls."
+      },
+      {
+        question: "Does Phase 1 performance improve over time?",
+        answer: "Yes. As your Field Registry grows from processing more documents, Phase 1 can resolve a higher percentage of cells through graph matches. Mature registries often see Phase 1 fill rates of 60-80%."
       }
     ],
     mentions: [
@@ -2300,6 +2761,14 @@ var sections6 = [
           }
         ]
       },
+      {
+        type: "paragraph",
+        text: "Phase 2 processes documents with grouped extraction calls \u2014 schema fields are divided into batches of up to 10 fields per call to balance extraction quality with throughput. For each document, the agent sends the document text along with the schema field definitions and any already-resolved values from Phase 1 as context. This context-aware approach means the AI can use related values (like a contract start date) to more accurately extract dependent values (like the end date)."
+      },
+      {
+        type: "paragraph",
+        text: "For fields backed by a **reference table**, Phase 2 includes the table's codes and labels directly in the extraction prompt so the AI picks canonical codes rather than free-text labels. This tight integration between reference tables and AI extraction produces cleaner output that requires fewer corrections. Fields with fewer than 50 reference entries get the full table in the prompt; larger tables are handled by the Phase 3 lookup cascade instead."
+      },
       {
         type: "callout",
         variant: "warning",
@@ -2319,6 +2788,10 @@ var sections6 = [
       {
         question: "Can the agent skip a field with manual instructions?",
         answer: "No. Fields with manual instructions always use the extract strategy. Human-written instructions are treated as authoritative and never skipped."
+      },
+      {
+        question: "How many fields does the agent process per AI call?",
+        answer: "Schema fields are grouped into batches of up to 10 fields per extraction call. This balances extraction quality with throughput \u2014 smaller groups help the AI focus on each field without losing recall."
       }
     ],
     mentions: [
@@ -2375,6 +2848,18 @@ var sections6 = [
             description: "Field with >80% registry occurrence rate is empty in this document."
           }
         ]
+      },
+      {
+        type: "paragraph",
+        text: 'Phase 3 also re-runs the lookup cascade (reference table resolution) on values that Phase 2 produced. This is important because AI-extracted values often use natural language labels (e.g., "Frame Agreement") rather than the canonical codes your reference table expects (e.g., `std_master`). The Phase 3 lookup normalizes these labels to codes, improving consistency across your output without requiring manual corrections.'
+      },
+      {
+        type: "paragraph",
+        text: "Validation flags are designed to surface the most impactful issues first. The **low_confidence_outlier** flag is particularly useful \u2014 it highlights cells where the system is uncertain in an otherwise high-confidence row, pointing you to the exact cells most likely to contain errors. For large runs with hundreds of documents, filtering by flags and reviewing those cells first can reduce your review time by 80% or more."
+      },
+      {
+        type: "callout",
+        text: "Validation flags never modify cell values. They are purely informational annotations that help you prioritize review. The actual cell value and confidence score remain unchanged by Phase 3 flagging."
       }
     ],
     related: [
@@ -2390,6 +2875,10 @@ var sections6 = [
       {
         question: "What types of validation flags exist?",
         answer: "Five types: date_sanity (date inconsistencies), amount_mismatch (total discrepancies), lookup_failed (no reference match), low_confidence_outlier (low confidence cells), and unexpected_empty (missing high-frequency fields)."
+      },
+      {
+        question: "Does Phase 3 modify any cell values?",
+        answer: "Phase 3 re-runs the reference table lookup cascade to normalize AI-extracted labels to canonical codes. The validation flags themselves are purely informational and do not modify values."
       }
     ],
     mentions: [
@@ -2415,6 +2904,14 @@ var sections6 = [
         type: "paragraph",
         text: "Because Phase 4 has access to the full grid context \u2014 all values already resolved in earlier phases \u2014 it can use surrounding data as clues. For example, if a contract start date was resolved in Phase 1 but the end date is still empty, Phase 4 re-reads the document knowing the start date, which helps the AI locate the corresponding end date more accurately."
       },
+      {
+        type: "paragraph",
+        text: "Phase 4 also applies deterministic transforms to all cell values: ISO code normalization, date format standardization, and unit conversion. Format constraints (regex patterns defined on schema fields) are evaluated at this stage. If a value fails its format constraint, the configured mismatch behavior kicks in \u2014 the cell is either cleared, flagged with an amber dot, or replaced with a constant. Original values are always preserved in the `original_extractions` table for audit purposes."
+      },
+      {
+        type: "paragraph",
+        text: "Expect Phase 4 to fill 5-15% of remaining empty cells, depending on document complexity and schema coverage. The phase is most effective for fields that require cross-referencing multiple sections of a document or interpreting values in the context of other extracted data. It is less effective for fields that are genuinely absent from the source document \u2014 those will remain empty with an `unresolved` provenance type."
+      },
       {
         type: "callout",
         text: "Phase 4 respects the **confidence gate**: it can only fill empty cells or upgrade cells below the confidence threshold. High-confidence values from Phase 1 are permanently protected."
@@ -2433,6 +2930,10 @@ var sections6 = [
       {
         question: "Can Phase 4 overwrite high-confidence values?",
         answer: "No. Phase 4 respects the confidence gate \u2014 it can only fill empty cells or upgrade cells below the confidence threshold. High-confidence values from earlier phases are permanently protected."
+      },
+      {
+        question: "What else happens in Phase 4 besides gap filling?",
+        answer: "Phase 4 also applies deterministic transforms (ISO codes, dates, units), evaluates format constraints (regex validation), and runs the modifier pipeline (format, alias, max_length). Original values are preserved for audit."
       }
     ],
     mentions: ["Phase 4", "re-read", "gap filling", "confidence gate", "targeted extraction"]
@@ -2457,6 +2958,18 @@ var sections6 = [
       {
         type: "paragraph",
         text: "Start your review by switching to the **Flagged** filter to focus on cells that need attention \u2014 these are values with validation warnings, low confidence, or format mismatches. Click any cell to see its full provenance, including which phase produced it and the reasoning trace. Once you are satisfied, export via **CSV** \u2014 choose the clean export for downstream systems or the full export with metadata for auditing."
+      },
+      {
+        type: "paragraph",
+        text: "The colored dots on each cell are your quickest visual indicator of data quality. Blue dots indicate graph matches from Phase 1 (highest reliability), purple dots indicate computed values, teal dots indicate agent transfers, indigo dots indicate AI extractions, and amber dots indicate lookup results or format flags. A grid dominated by blue and purple dots typically requires minimal review, while one with many indigo and amber dots may need more attention."
+      },
+      {
+        type: "paragraph",
+        text: "For large jobs with hundreds of documents, use a systematic review workflow: first address all **Flagged** rows, then spot-check a random sample of **Clean** rows to build confidence in the overall quality. If you find recurring errors in a specific field, consider updating the schema field's instruction or reference table, then run a new job \u2014 corrections you apply also feed back as training signals for future runs."
+      },
+      {
+        type: "callout",
+        text: "The full CSV export includes metadata columns for each field: confidence score, resolution type, phase number, and reasoning trace. Use this export for audit trails or to analyze extraction performance across your document corpus."
       }
     ],
     related: [
@@ -2472,6 +2985,10 @@ var sections6 = [
       {
         question: "Can I export extraction results?",
         answer: "Yes. Use CSV export from the job detail page. You can export clean data only or full data with metadata including confidence scores and resolution types."
+      },
+      {
+        question: "What is the most efficient way to review a large extraction run?",
+        answer: "Start with the Flagged filter to address cells with validation warnings, low confidence, or format mismatches. Then spot-check a random sample of Clean rows. Focus corrections on recurring field-level patterns rather than individual cells."
       }
     ],
     mentions: [
@@ -2528,6 +3045,14 @@ var sections6 = [
           }
         ]
       },
+      {
+        type: "paragraph",
+        text: "Confidence scores follow predictable patterns by resolution type. Graph matches from Phase 1 typically score 0.7-0.95 because they are derived from verified registry data. Reference table lookups score 0.95 for exact normalization matches, ~0.70 for fuzzy matches, and 0.50 for AI fallback. Agent-derived values from Phase 2 generally score 0.5-0.9 depending on the clarity of the source document and the specificity of the extraction instruction."
+      },
+      {
+        type: "paragraph",
+        text: "Use confidence scores to set your review threshold. Cells above 0.8 are generally reliable and can be trusted without manual verification for most use cases. Cells between 0.5 and 0.8 warrant a quick check. Cells below 0.5 should always be reviewed manually. You can use the full CSV export to filter and sort by confidence, making it easy to batch-review low-confidence cells efficiently."
+      },
       {
         type: "callout",
         variant: "warning",
@@ -2547,6 +3072,10 @@ var sections6 = [
       {
         question: "What is the confidence gate?",
         answer: "The confidence gate prevents any later pipeline phase from overwriting a cell that was filled with confidence >= 0.7. This protects high-quality lookup results from lower-confidence agent extractions."
+      },
+      {
+        question: "What confidence threshold should I use for manual review?",
+        answer: "Cells above 0.8 are generally reliable. Cells between 0.5 and 0.8 warrant a quick check. Cells below 0.5 should always be reviewed manually. Use the CSV export to filter by confidence for efficient batch review."
       }
     ],
     mentions: [
@@ -2571,6 +3100,18 @@ var sections6 = [
       {
         type: "paragraph",
         text: "When correcting a value, consider using **all_similar** propagation if the same mistake appears across multiple documents \u2014 for example, a reference table code that was consistently mapped to the wrong label. This applies your fix to every document in the run that matched the same way, saving you from correcting each cell individually. The system learns from these corrections, so the same error is less likely to recur in future jobs."
+      },
+      {
+        type: "paragraph",
+        text: "Corrections create a full audit trail: the original extracted value, the corrected value, who made the change, and when. This audit log is preserved even after subsequent jobs are run, giving you a complete history of manual interventions. When you export results with the full metadata option, correction history is included so downstream systems can distinguish between AI-extracted and human-corrected values."
+      },
+      {
+        type: "paragraph",
+        text: "For best results, correct the root cause rather than individual symptoms. If a field consistently produces wrong values, update the schema field's **manual instruction** or **reference table** rather than correcting cells one by one. If a reference table code is missing, add it to the table \u2014 future runs will pick it up automatically at Tier 1 confidence (0.95). Corrections are most valuable as a feedback mechanism when they inform schema improvements."
+      },
+      {
+        type: "callout",
+        text: "Corrections with **all_similar** propagation apply instantly across all documents in the run. Use this for systematic errors like wrong reference table mappings, but verify the preview count before confirming \u2014 the system shows how many cells will be affected."
       }
     ],
     related: [
@@ -2586,6 +3127,10 @@ var sections6 = [
       {
         question: "Do corrections improve future extractions?",
         answer: "Yes. Corrections feed back as training signals for future runs, helping the system learn from your corrections and improve accuracy over time."
+      },
+      {
+        question: "Is there an audit trail for corrections?",
+        answer: "Yes. Every correction logs the original value, the corrected value, the user who made the change, and the timestamp. This audit history is preserved and included in full metadata CSV exports."
       }
     ],
     mentions: [
@@ -2639,6 +3184,18 @@ var sections7 = [
       {
         type: "paragraph",
         text: "Most link keys are auto-classified by name patterns. Remaining ambiguous fields are classified by AI. High-frequency entities (>30% of documents) are automatically excluded from case formation."
+      },
+      {
+        type: "paragraph",
+        text: "Behind the scenes, the classification engine applies rule-based heuristics first \u2014 field names like `company_name` or `invoice_number` are recognized instantly. When heuristics are inconclusive, an AI classifier examines the field's extracted values and schema context to determine the correct category. This two-tier approach keeps classification fast for the common case while handling ambiguous fields gracefully."
+      },
+      {
+        type: "paragraph",
+        text: "Use link keys whenever your documents share identifying information that should connect them. For best results, ensure your field names follow clear naming conventions \u2014 this maximizes the hit rate of the automatic classifier and minimizes the need for manual overrides."
+      },
+      {
+        type: "callout",
+        text: "Link key classification runs automatically when new fields appear in the registry. You do not need to trigger it manually \u2014 just upload documents and the system handles the rest."
       }
     ],
     related: [
@@ -2654,6 +3211,10 @@ var sections7 = [
       {
         question: "Why are high-frequency entities excluded from case formation?",
         answer: "Entities appearing in more than 30% of documents are too common to be meaningful connections. They are automatically excluded to prevent overly large, uninformative cases."
+      },
+      {
+        question: "Can I manually classify a field as a link key?",
+        answer: "Yes. Navigate to the Field Registry and change any field's link key category. Manual classifications take precedence over automatic ones and persist across future jobs."
       }
     ],
     mentions: [
@@ -2674,6 +3235,22 @@ var sections7 = [
       {
         type: "paragraph",
         text: 'After extraction, the linking pipeline runs automatically: extracts link key values, normalizes them (lowercasing, stripping suffixes like "Ltd", "Inc"), and builds a bipartite graph of documents &harr; entities.'
+      },
+      {
+        type: "paragraph",
+        text: 'The normalization step is critical for accurate linking. Values like "ACME Corp.", "Acme Corporation", and "acme corp" are all reduced to the same canonical form so they resolve to a single entity node. This prevents duplicate entities from fragmenting your cases and ensures documents that reference the same real-world entity are correctly connected.'
+      },
+      {
+        type: "paragraph",
+        text: "The resulting bipartite graph has two node types: documents and entities. An edge connects a document to an entity whenever the document contains that entity's value in a link key field. Connected components in this graph become the foundation for case formation \u2014 documents that share entities end up in the same case."
+      },
+      {
+        type: "paragraph",
+        text: "For best results, ensure your source documents contain consistent identifiers. The pipeline handles minor variations automatically, but wildly inconsistent naming (e.g., abbreviations vs. full legal names) may require manual link key tuning in the Field Registry."
+      },
+      {
+        type: "callout",
+        text: "Entity linking is incremental \u2014 when new documents arrive, the pipeline extends the existing graph rather than rebuilding it from scratch. Existing cases grow as new connections are discovered."
       }
     ],
     related: [
@@ -2689,6 +3266,10 @@ var sections7 = [
       {
         question: "When does entity linking run?",
         answer: "Entity linking runs automatically after document extraction. It processes link key values and builds connections without manual intervention."
+      },
+      {
+        question: "What normalization does entity linking apply?",
+        answer: "Values are lowercased, common suffixes (Ltd, Inc, Corp, etc.) are stripped, and whitespace is normalized. This ensures minor naming variations resolve to the same entity."
       }
     ],
     mentions: [
@@ -2780,6 +3361,22 @@ var sections7 = [
       {
         type: "paragraph",
         text: "The Document Graph provides a visual D3-force layout of the bipartite graph. Toggle between graph and list views from the Cases page. Case templates are auto-discovered after 3+ cases form \u2014 they identify recurring document type patterns."
+      },
+      {
+        type: "paragraph",
+        text: "In the graph view, document nodes and entity nodes are rendered with distinct visual styles. Edges represent link key connections, and tightly connected clusters naturally pull together through force simulation. Hovering over a node highlights its connections, making it easy to trace how documents relate through shared entities."
+      },
+      {
+        type: "paragraph",
+        text: 'Case templates capture recurring patterns \u2014 for example, "Invoice + Purchase Order + Contract" might emerge as a common template after enough cases form. Templates include a **match threshold** that controls how closely a case must match the expected document type set. Use templates to monitor completeness: if a case is missing a document type that the template expects, an anomaly is raised.'
+      },
+      {
+        type: "paragraph",
+        text: "Most teams use the graph view during initial workspace setup to verify that linking is producing sensible clusters. Once you are confident in your link key configuration, the list view is more practical for day-to-day case review and triage."
+      },
+      {
+        type: "callout",
+        text: "Templates are auto-discovered \u2014 you do not need to define them manually. The system analyzes existing cases and proposes templates when it detects at least 3 cases sharing the same document type pattern."
       }
     ],
     related: [
@@ -2794,6 +3391,10 @@ var sections7 = [
       {
         question: "What are case templates?",
         answer: "Case templates are auto-discovered after 3 or more cases form. They identify recurring document type patterns, helping you understand common document relationships in your workspace."
+      },
+      {
+        question: "Can I switch between graph and list views?",
+        answer: "Yes. Toggle between the visual D3-force graph and a traditional list view from the Cases page. Both views show the same underlying data \u2014 choose whichever suits your workflow."
       }
     ],
     mentions: ["document graph", "D3-force layout", "bipartite graph", "case templates"]
@@ -2843,6 +3444,18 @@ var sections7 = [
       {
         type: "paragraph",
         text: "Anomalies appear in the **Anomalies** tab of the case detail page (Advanced mode). Each anomaly card shows severity, affected fields, and a dismiss button. Dismissed anomalies are hidden by default but visible via the **show dismissed** toggle."
+      },
+      {
+        type: "paragraph",
+        text: "The detection engine runs automatically after case formation and whenever case membership changes (documents added, removed, or cases merged). Each detector operates independently \u2014 a single case can trigger multiple anomaly types simultaneously. Anomaly counts are displayed as badges in the case header for quick triage."
+      },
+      {
+        type: "paragraph",
+        text: "Use anomaly detection to surface data quality issues that would otherwise require manual comparison across documents. For best results, configure case templates so the **Missing Document Type** detector (D4) can flag incomplete cases. Most teams find that D2 (Field Conflict) and D3 (Duplicate Key Divergence) catch the highest-value issues in procurement and financial workflows."
+      },
+      {
+        type: "callout",
+        text: "Anomaly detection requires **Advanced mode** to be enabled. In Simple mode, anomalies are still computed but not displayed in the case detail page."
       }
     ],
     related: [
@@ -2854,6 +3467,14 @@ var sections7 = [
       {
         question: "What anomalies does Talonic detect?",
         answer: "Five structural patterns: validation clusters, field conflicts, duplicate key divergence, missing document types, and value reuse. Each is surfaced as a dismissable card on the case detail page."
+      },
+      {
+        question: "Do anomalies update automatically when cases change?",
+        answer: "Yes. The detection engine re-runs whenever case membership changes \u2014 documents added or removed, cases merged or split. Anomaly badges in the case header update in real time."
+      },
+      {
+        question: "Can I dismiss anomalies?",
+        answer: "Yes. Each anomaly card includes a dismiss button. Dismissed anomalies are hidden by default but can be revealed using the show dismissed toggle on the Anomalies tab."
       }
     ],
     mentions: ["anomaly detection", "validation cluster", "field conflict", "duplicate key divergence", "value reuse"]
@@ -2886,6 +3507,14 @@ var sections7 = [
         type: "paragraph",
         text: "**Domain packs** extend validation with industry-specific rules. The freight domain pack includes DOT number state detection and MC number validation. Additional packs can be added to `domain-packs/` without modifying the core engine."
       },
+      {
+        type: "paragraph",
+        text: "Validation runs automatically after extraction and linking complete. Each field value is checked against every applicable validator \u2014 a single field can trigger multiple rules. Results are displayed as colored badges in the **Evidence** tab: green for pass, red for fail, and amber for warnings. You can filter by status, document, category, or free-text search."
+      },
+      {
+        type: "paragraph",
+        text: "The checksum validator (S7) uses a parameterized factory pattern \u2014 it accepts a checksum algorithm name and applies the corresponding verification logic. Supported algorithms include Luhn (credit card numbers), ABA (bank routing numbers), IBAN (international bank accounts), and ISBN (book identifiers). For best results, ensure your schema fields are typed correctly so the engine knows which checksum to apply."
+      },
       {
         type: "callout",
         text: "Evidence validation results are stored in a separate `evidence_validation_results` table keyed by (document_id, entity_id, field_key) \u2014 not in the extraction or linking tables."
@@ -2904,6 +3533,10 @@ var sections7 = [
       {
         question: "What are domain packs?",
         answer: "Domain packs add industry-specific validation rules. For example, the freight domain pack validates DOT numbers and MC numbers. New packs can be added without modifying the core engine."
+      },
+      {
+        question: "How are evidence validation results displayed?",
+        answer: "Results appear as colored badges in the Evidence tab of the case detail page. Green indicates pass, red indicates fail, and amber indicates a warning. Use the filter bar to narrow results by status, document, or category."
       }
     ],
     mentions: ["evidence validation", "structural validators", "checksum", "Luhn", "IBAN", "domain packs", "freight"]
@@ -2930,6 +3563,18 @@ var sections8 = [
       {
         type: "paragraph",
         text: "Navigate to **Data Products &rarr; Dataset Templates** to manage templates. Each template is linked to a user schema and can be versioned independently. When creating a new job, select a template instead of configuring the output from scratch."
+      },
+      {
+        type: "paragraph",
+        text: "Templates support column mappings that rename, reorder, or exclude fields from the output. Default transforms \u2014 such as date formatting, currency normalization, or unit conversion \u2014 are applied automatically during assembly. This means every data product built from the same template produces structurally identical output regardless of who runs it or when."
+      },
+      {
+        type: "paragraph",
+        text: "For best results, create one template per downstream consumer. If your finance team and operations team need different column subsets from the same schema, define two templates rather than manually reconfiguring each export. Most teams version their templates alongside schema changes to maintain backward compatibility with existing integrations."
+      },
+      {
+        type: "callout",
+        text: "Dataset templates are workspace-scoped. Any team member can create, edit, or use a template \u2014 there is no per-user ownership restriction."
       }
     ],
     related: [
@@ -2945,6 +3590,10 @@ var sections8 = [
       {
         question: "How do dataset templates relate to schemas?",
         answer: "Each dataset template is linked to a user schema and can be versioned independently. When creating a new job, you can select a template instead of configuring output from scratch."
+      },
+      {
+        question: "Can I version dataset templates?",
+        answer: "Yes. Each template is versioned independently from the schema it references. This lets you evolve your output format over time without affecting existing data products built from earlier versions."
       }
     ],
     mentions: [
@@ -2964,11 +3613,19 @@ var sections8 = [
     content: [
       {
         type: "paragraph",
-        text: "An assembly combines documents from one or more sources into a single structured dataset based on a template. Assemblies track their constituent documents, source counts, and processing status."
+        text: "An assembly combines documents from one or more sources into a single structured dataset based on a template. Assemblies track their constituent documents, source counts, and processing status."
+      },
+      {
+        type: "paragraph",
+        text: "Navigate to **Data Products &rarr; Assemblies** to view and create assemblies. Each assembly shows its document count, linked schema, processing status, and the date it was created."
+      },
+      {
+        type: "paragraph",
+        text: "When you create an assembly, you select a dataset template and one or more document sources. The system pulls all matching documents, applies the template's column mappings and transforms, and produces a single structured output. The assembly tracks which documents contributed to each row, giving you full traceability from output back to source."
       },
       {
         type: "paragraph",
-        text: "Navigate to **Data Products &rarr; Assemblies** to view and create assemblies. Each assembly shows its document count, linked schema, processing status, and the date it was created."
+        text: "Use assemblies whenever you need a repeatable, auditable output for downstream systems or stakeholders. Most teams create one assembly per reporting period or delivery cycle. Because assemblies reference a template, you can regenerate the same output shape from different document sets without reconfiguring columns or transforms each time."
       },
       {
         type: "callout",
@@ -2988,6 +3645,10 @@ var sections8 = [
       {
         question: "Why should I use assemblies for production data?",
         answer: "Assemblies provide a single audit trail from source documents through extraction, resolution, and validation to the final output, making them the recommended approach for production datasets."
+      },
+      {
+        question: "Can an assembly pull from multiple sources?",
+        answer: "Yes. An assembly can combine documents from any number of sources \u2014 uploaded files, connected drives, email attachments, and more \u2014 into a single structured dataset."
       }
     ],
     mentions: [
@@ -3033,6 +3694,18 @@ var sections8 = [
       {
         type: "paragraph",
         text: "ID rules are persisted before generating IDs. Navigate to a data product detail page and use **Apply ID Rules** to generate or **Regenerate IDs** to refresh."
+      },
+      {
+        type: "paragraph",
+        text: 'Resolution maps normalize field values before they become part of the ID. For example, a resolution map can collapse "ACME Corp", "ACME Corporation", and "Acme" into a single canonical value "ACME". This prevents duplicate IDs for rows that refer to the same real-world entity under different names.'
+      },
+      {
+        type: "paragraph",
+        text: 'For best results, choose source fields with high uniqueness \u2014 contract numbers or invoice IDs work well, while generic fields like "status" do not. When your documents contain multiple candidate identifiers, configure a fallback chain so the dispenser always has a value to work with. Most teams use the primary reference number as the source field and the document name as the first fallback.'
+      },
+      {
+        type: "callout",
+        text: "ID generation is deterministic \u2014 running **Regenerate IDs** with the same rules and data always produces the same output. This makes ID dispensers safe to re-run without breaking downstream references."
       }
     ],
     related: [
@@ -3044,6 +3717,14 @@ var sections8 = [
       {
         question: "How do ID dispensers handle missing field values?",
         answer: "When the source field is empty, the dispenser tries each field in the fallback chain in order. If all are empty, it generates a prefix-less sequential ID."
+      },
+      {
+        question: "What is a resolution map?",
+        answer: 'A resolution map is a key-value lookup that normalizes field values before ID generation. For example, it can collapse "ACME Corp" and "ACME Corporation" into "ACME" to prevent duplicate IDs for the same entity.'
+      },
+      {
+        question: "Can I regenerate IDs without losing data?",
+        answer: "Yes. Regenerating IDs only updates the ID column \u2014 all other data product values remain unchanged. The operation is deterministic, so the same rules and data always produce the same IDs."
       }
     ],
     mentions: ["ID dispenser", "unique identifiers", "fallback chain", "resolution map"]
@@ -3102,6 +3783,10 @@ var sections8 = [
       {
         question: "Does CSV export preserve leading zeros?",
         answer: "Yes. All CSV exports preserve leading zeros and long numbers \u2014 values are never coerced to numeric types."
+      },
+      {
+        question: "What is auto-resolve singles?",
+        answer: "Auto-resolve singles automatically accepts fields that have only one candidate value, removing them from the manual review queue. Combined with auto-review, this significantly reduces the volume of items requiring human attention."
       }
     ],
     mentions: ["share token", "delivery website", "CSV export", "auto-review", "auto-resolve"]
@@ -3124,6 +3809,22 @@ var sections9 = [
       {
         type: "paragraph",
         text: "Schema-level quality rules run during Phase 3 of every job. Rule types: field format, value range, cross-field consistency, and AI-proposed coherence rules. Rules can be AI-proposed after a job completes, then reviewed and approved before activation."
+      },
+      {
+        type: "paragraph",
+        text: "**Field format** checks verify that values match an expected pattern (e.g., dates in ISO format, phone numbers with country codes). **Value range** checks ensure numeric or date values fall within acceptable bounds. **Cross-field consistency** checks compare two or more fields on the same record \u2014 for example, verifying that a start date precedes an end date."
+      },
+      {
+        type: "paragraph",
+        text: "AI-proposed coherence rules are generated by analyzing patterns in completed job results. The system identifies relationships that hold across most records and proposes them as candidate rules. You review each proposal in the validation settings before it becomes active \u2014 no AI-generated rule runs without explicit approval."
+      },
+      {
+        type: "paragraph",
+        text: "For best results, start with a small set of high-confidence rules and expand over time. Most teams begin with field format checks for critical identifiers (invoice numbers, dates, amounts) and add cross-field consistency rules as they learn their data patterns. Validation failures do not block extraction \u2014 they flag records for review."
+      },
+      {
+        type: "callout",
+        text: "Validation checks are schema-scoped. Rules defined on one schema do not affect other schemas in the same workspace. This lets you tailor quality rules to each document type independently."
       }
     ],
     related: [
@@ -3139,6 +3840,10 @@ var sections9 = [
       {
         question: "Can AI suggest validation rules?",
         answer: "Yes. After a job completes, AI can propose coherence rules based on the data. You review and approve these rules before they are activated."
+      },
+      {
+        question: "Do validation failures block extraction?",
+        answer: "No. Validation checks flag records for review but do not prevent extraction from completing. Failed records appear in the Approval Queue for manual inspection."
       }
     ],
     mentions: [
@@ -3158,6 +3863,22 @@ var sections9 = [
       {
         type: "paragraph",
         text: "Manually-created reference datasets with known-correct values. Create from **Validation &rarr; Golden Samples**. Benchmark runs compare extraction results against golden samples for per-field accuracy scoring with AI judge verdicts."
+      },
+      {
+        type: "paragraph",
+        text: "To create a golden sample, select a document and manually enter the correct value for each field. The system stores these known-correct values as the ground truth baseline. When you run a benchmark, the extraction pipeline processes the same document independently, and the results are compared field by field against your golden sample."
+      },
+      {
+        type: "paragraph",
+        text: 'Benchmark scoring uses an AI judge to evaluate each field comparison. The judge accounts for semantic equivalence \u2014 for example, "United States" and "US" may be scored as a match depending on the field type. Per-field accuracy scores let you identify exactly which fields are underperforming and need schema or instruction tuning.'
+      },
+      {
+        type: "paragraph",
+        text: "For best results, create golden samples from a representative mix of document types and complexity levels. Most teams maintain 5-10 golden samples per schema and re-run benchmarks after schema changes, instruction updates, or model upgrades to track quality trends over time."
+      },
+      {
+        type: "callout",
+        text: "Golden samples are not used during normal extraction \u2014 they exist solely for benchmarking. Changing a golden sample does not affect how documents are processed."
       }
     ],
     related: [
@@ -3173,6 +3894,10 @@ var sections9 = [
       {
         question: "How do benchmark runs work?",
         answer: "Benchmark runs compare extraction results against golden samples, producing per-field accuracy scores with AI judge verdicts to measure extraction quality."
+      },
+      {
+        question: "How many golden samples should I create?",
+        answer: "Most teams maintain 5-10 golden samples per schema, covering a representative mix of document types and complexity levels. Re-run benchmarks after schema changes or model upgrades to track quality trends."
       }
     ],
     mentions: ["golden samples", "ground truth", "benchmark runs", "accuracy scoring", "AI judge"]
@@ -3188,6 +3913,18 @@ var sections9 = [
         type: "paragraph",
         text: "Threshold-based rules for auto-approving or flagging results. Configure per schema with criteria: minimum confidence, validation pass rate, field coverage. Results meeting all thresholds are auto-approved; others go to the manual review queue."
       },
+      {
+        type: "paragraph",
+        text: "Each criterion acts as an independent gate. **Minimum confidence** sets the lowest acceptable extraction confidence score. **Validation pass rate** requires a minimum percentage of validation checks to pass. **Field coverage** ensures that a minimum percentage of schema fields have non-empty values. A result must clear all three gates to be auto-approved."
+      },
+      {
+        type: "paragraph",
+        text: "Start with conservative thresholds \u2014 high confidence, high pass rate, high coverage \u2014 and loosen them as you gain trust in your extraction pipeline. Most teams begin with 90% confidence, 95% validation pass rate, and 80% field coverage, then adjust based on the volume of false positives in the approval queue."
+      },
+      {
+        type: "paragraph",
+        text: "Approval gates integrate directly with the delivery pipeline. When a result passes all gates, a `result.approved` signal is emitted automatically. Bind this signal to a destination to create a fully automated flow from document upload through extraction, validation, approval, and delivery \u2014 no manual steps required for high-confidence results."
+      },
       {
         type: "callout",
         text: "Approval gates feed the delivery pipeline \u2014 bind a `result.approved` signal to a destination to only ship approved rows to your downstream systems."
@@ -3206,6 +3943,10 @@ var sections9 = [
       {
         question: "How do approval gates connect to delivery?",
         answer: "Bind a result.approved signal to a delivery destination to only ship approved rows to your downstream systems. This ensures only quality-checked data is delivered."
+      },
+      {
+        question: "What thresholds should I start with?",
+        answer: "Most teams start with 90% confidence, 95% validation pass rate, and 80% field coverage. Adjust based on the volume of false positives in the approval queue \u2014 loosen thresholds as you gain trust in your pipeline."
       }
     ],
     mentions: [
@@ -3230,6 +3971,22 @@ var sections9 = [
       {
         type: "paragraph",
         text: 'Filter the queue by status (pending, flagged), schema, or confidence range. Click "Review" on any row to inspect the extracted values, provenance trails, and validation check results before approving or rejecting.'
+      },
+      {
+        type: "paragraph",
+        text: "The review detail view shows the extracted values alongside the source document, with provenance trails tracing each value back to its origin in the text. Validation check results are displayed inline \u2014 you can see exactly which rules passed and which failed before making your decision. Batch actions are available for approving or rejecting multiple items at once."
+      },
+      {
+        type: "paragraph",
+        text: "When you approve a result, a `result.approved` signal is emitted to the delivery pipeline. When you reject a result, a `result.rejected` signal fires instead. This event-driven design lets you build automated workflows that respond to review decisions \u2014 for example, routing approved records to a webhook and rejected records to a notification channel."
+      },
+      {
+        type: "paragraph",
+        text: "For best results, review flagged items first \u2014 these are records where at least one validation check failed, making them the most likely to contain errors. Most teams assign a daily review cadence and use confidence range filters to prioritize low-confidence items that need the most attention."
+      },
+      {
+        type: "callout",
+        text: "LLM auto-review is available to accelerate the approval process. When enabled, AI proposes approve or reject decisions for pending items, which you can accept or override with a single click."
       }
     ],
     related: [
@@ -3245,6 +4002,10 @@ var sections9 = [
       {
         question: "How do I review items in the Approval Queue?",
         answer: 'Filter by status (pending, flagged), schema, or confidence range. Click "Review" on any row to inspect extracted values, provenance trails, and validation check results before approving or rejecting.'
+      },
+      {
+        question: "Can I batch approve or reject items?",
+        answer: "Yes. Select multiple items in the queue and use the batch action buttons to approve or reject them all at once. Each item emits the appropriate delivery signal individually."
       }
     ],
     mentions: [
@@ -3309,6 +4070,18 @@ var sections10 = [
       {
         type: "paragraph",
         text: "Every attempt is logged in `delivery_items`. Terminal failures (retry exhausted or permanent 4xx) write a `delivery_dead_letter` row, which is replayable. The outbox, history, DLQ, and catalog are all accessible via the [`/v1/delivery/*` API](/docs)."
+      },
+      {
+        type: "paragraph",
+        text: "The four registries \u2014 signals, deliverables, serializers, and connectors \u2014 are fully orthogonal. Adding a new destination type does not require changes to the signal or serializer code. This composable design means you can mix any supported signal with any compatible serializer and connector without custom integration work."
+      },
+      {
+        type: "paragraph",
+        text: "For best results, start with a webhook destination to verify your binding configuration end-to-end. Once the payload shape and delivery cadence match your expectations, expand to file-based destinations (S3, SFTP) or spreadsheet destinations (Google Sheets). Most teams create separate bindings for different downstream consumers rather than routing all events to a single destination."
+      },
+      {
+        type: "callout",
+        text: "Delivery is at-least-once with deterministic idempotency keys. Receivers should use the `X-Talonic-Idempotency-Key` header (or equivalent metadata for file-based connectors) to deduplicate on their end."
       }
     ],
     related: [
@@ -3324,6 +4097,10 @@ var sections10 = [
       {
         question: "What happens when a delivery fails?",
         answer: "Failed deliveries retry with a backoff ladder. Terminal failures (retry exhausted or permanent 4xx) are written to the dead-letter queue (DLQ), which is fully replayable."
+      },
+      {
+        question: "What serialization formats are supported?",
+        answer: "Ten formats: json, ndjson, csv, csv_file, xlsx, rows, graph, raw, md, and txt. Each serializer declares which deliverable shapes it supports, and the compatibility triangle validates the combination at binding creation time."
       }
     ],
     mentions: [
@@ -3382,6 +4159,22 @@ var sections10 = [
             description: "Slice 2+. Structured data as email attachment."
           }
         ]
+      },
+      {
+        type: "paragraph",
+        text: "Each destination stores its connector type, configuration (URL, bucket, folder path), and optional authentication credentials. Webhook destinations support HMAC-SHA256 signing via a **signing secret** \u2014 every payload includes a signature header so your receiver can verify authenticity. File-based destinations (S3, SFTP, Google Drive) support configurable filename templates with token substitution for binding ID, timestamp, and idempotency key."
+      },
+      {
+        type: "paragraph",
+        text: "A single destination can back multiple bindings. For example, one S3 bucket destination can receive both `document.extracted` and `result.approved` events through separate bindings, each with its own serializer and field map. This keeps your destination inventory small while supporting diverse routing requirements."
+      },
+      {
+        type: "paragraph",
+        text: "For best results, always run a live-ping test after creating a destination. The test exercises the full transport envelope \u2014 SSRF validation, payload cap, and authentication \u2014 with a tiny test payload, so you catch configuration errors before real events start flowing. OAuth-based destinations (Google Drive, Google Sheets) require connecting your account first via the OAuth flow in the dashboard."
+      },
+      {
+        type: "callout",
+        text: "Destinations can be disabled without deleting them. Set **is_active** to false and no bindings will route events to the destination until you re-enable it."
       }
     ],
     related: [
@@ -3397,6 +4190,10 @@ var sections10 = [
       {
         question: "How do I test a destination?",
         answer: "Every destination supports a live-ping test via POST /v1/delivery/destinations/:id/test that exercises the full transport envelope with a tiny test payload."
+      },
+      {
+        question: "Can one destination serve multiple bindings?",
+        answer: "Yes. A single destination can back any number of bindings, each with its own signal filter, serializer, and field map. This lets you route different event types to the same endpoint with different payload shapes."
       }
     ],
     mentions: [
@@ -3423,6 +4220,22 @@ var sections10 = [
       {
         type: "paragraph",
         text: "Optional `field_map` (rename/drop/static rules) lets you reshape the payload without custom code. Optional `delivery_policy` overrides the default retry ladder (6 attempts at `5s, 30s, 2min, 10min, 1h`) and timeout."
+      },
+      {
+        type: "paragraph",
+        text: "The compatibility triangle is enforced on every create and update. The backend checks that your chosen serializer supports the deliverable resolver's output shape, and that the connector accepts the serializer's format. If any predicate fails, the binding is rejected with a descriptive error \u2014 you never end up with a binding that cannot deliver."
+      },
+      {
+        type: "paragraph",
+        text: 'Use `field_map` to tailor the payload for each downstream consumer. **Rename** rules map internal field names to the receiver\'s expected names. **Drop** rules exclude fields the receiver does not need. **Static** rules inject constant values (e.g., a `source: "talonic"` tag) into every payload. These three operations compose in order: drop first, then rename, then static injection.'
+      },
+      {
+        type: "paragraph",
+        text: "For best results, create one binding per downstream consumer per event type. This gives you independent control over payload shape, retry policy, and serialization format for each integration point. Most teams start with a `document.extracted` binding to a webhook and expand to run-level and approval signals as their pipeline matures."
+      },
+      {
+        type: "callout",
+        text: "The binding editor in the dashboard walks you through the compatibility triangle step by step \u2014 only showing serializers and deliverables that are compatible with your chosen signal and destination."
       }
     ],
     related: [
@@ -3438,6 +4251,10 @@ var sections10 = [
       {
         question: "Can I customize the delivery payload?",
         answer: "Yes. Use field_map to rename, drop, or add static fields without custom code. Use delivery_policy to override the default retry ladder and timeout."
+      },
+      {
+        question: "What is the compatibility triangle?",
+        answer: "The compatibility triangle validates that the signal, deliverable resolver, serializer, and connector all form a compatible combination. The backend enforces this on every binding create and update to prevent misconfigured delivery routes."
       }
     ],
     mentions: [
@@ -3520,6 +4337,22 @@ var sections10 = [
             description: "Fired after a terminal delivery failure."
           }
         ]
+      },
+      {
+        type: "paragraph",
+        text: "Signals are typed events emitted by the platform when meaningful state changes occur. Document-level signals fire on extraction success or failure. Run-level signals fire when a job completes across dataspace, structuring, resolution, or extraction runs. Result-level signals fire when a reviewer approves, rejects, or flags a record."
+      },
+      {
+        type: "paragraph",
+        text: "The two `delivery.item.*` entries are **meta-signals** \u2014 they fire when a delivery itself succeeds or fails. Use them for self-monitoring: bind `delivery.item.failed` to a notification webhook to receive alerts when deliveries break. The poller includes built-in loop prevention so a failed meta-signal delivery does not emit another meta-signal."
+      },
+      {
+        type: "paragraph",
+        text: "For best results, use the catalog API to populate dropdown menus and configuration forms rather than hardcoding signal or deliverable lists. The catalog always reflects the running registry contents, so new signal types and deliverables appear automatically as the platform evolves."
+      },
+      {
+        type: "callout",
+        text: "The catalog API exposes four endpoints: `/v1/delivery/catalog/signals`, `/v1/delivery/catalog/deliverables`, `/v1/delivery/catalog/serializers`, and `/v1/delivery/catalog/connectors`. Each returns the full registry for that category."
       }
     ],
     related: [
@@ -3535,6 +4368,10 @@ var sections10 = [
       {
         question: "How do I discover available signals and deliverables?",
         answer: "Use the catalog API at /v1/delivery/catalog/* which exposes the four registries (signals, deliverables, serializers, connectors) that drive the binding picker."
+      },
+      {
+        question: "What are meta-signals?",
+        answer: "Meta-signals (delivery.item.completed and delivery.item.failed) fire when a delivery attempt itself succeeds or fails. Use them for self-monitoring \u2014 for example, binding delivery.item.failed to a notification webhook for delivery failure alerts."
       }
     ],
     mentions: [
@@ -3555,6 +4392,22 @@ var sections10 = [
       {
         type: "paragraph",
         text: "Every delivery attempt writes a row to `/v1/delivery/items` with its status, HTTP code, error code, and request/response bodies. Terminal failures (retry ladder exhausted or permanent 4xx) escalate to `/v1/delivery/dlq`. Both are fully replayable \u2014 replay enqueues a new attempt with a fresh idempotency key. Nothing in history is ever mutated; the log is strictly append-only."
+      },
+      {
+        type: "paragraph",
+        text: "The delivery items log captures the full lifecycle of each attempt: in-flight, succeeded, or failed. Each row includes the attempt number, duration in milliseconds, and truncated request/response bodies (up to 10 KB each). Use the items endpoint with filters for `binding_id`, `destination_id`, or `status` to narrow results when debugging a specific integration."
+      },
+      {
+        type: "paragraph",
+        text: "The dead letter queue (DLQ) is your safety net for terminal failures. When the retry ladder is exhausted or the destination returns a permanent error (e.g., 401 Unauthorized, 403 Forbidden), the failed delivery moves to the DLQ. From there you can inspect the error, fix the destination configuration, and replay the delivery with a single click or API call."
+      },
+      {
+        type: "paragraph",
+        text: "For best results, monitor the DLQ regularly and set up a `delivery.item.failed` meta-signal binding to receive alerts when deliveries fail terminally. Most teams configure a notification webhook for this signal so they are notified immediately rather than discovering failures during a manual review. Request and response bodies older than the configured retention period are automatically cleaned up, but row metadata (status, error code, duration) is retained indefinitely."
+      },
+      {
+        type: "callout",
+        text: "Replay is safe to run multiple times. The idempotency key is deterministic \u2014 receivers that deduplicate on the key will not process the same delivery twice, even after multiple replays."
       }
     ],
     related: [
@@ -3570,6 +4423,10 @@ var sections10 = [
       {
         question: "What is the dead letter queue (DLQ)?",
         answer: "Terminal failures (retry ladder exhausted or permanent 4xx) escalate to /v1/delivery/dlq. DLQ entries are fully replayable \u2014 replay enqueues a fresh attempt with a new idempotency key."
+      },
+      {
+        question: "How long are request and response bodies retained?",
+        answer: "Request and response bodies are cleaned up after the configured retention period (default 30 days). Row metadata \u2014 status, HTTP code, error code, and duration \u2014 is retained indefinitely for audit purposes."
       }
     ],
     mentions: [
@@ -3604,6 +4461,10 @@ var sections11 = [
         type: "paragraph",
         text: "Dialects ensure consistency across all your structured output. When your downstream systems expect dates in `YYYY-MM-DD` format, numbers with `.` as the decimal separator, and CSVs delimited by `;`, you configure this once in the shared dialect rather than repeating it in every schema."
       },
+      {
+        type: "paragraph",
+        text: "Most teams configure their shared dialect during initial workspace setup and rarely change it afterward. If your organization operates across regions with different formatting conventions, create separate workspaces with region-specific dialects rather than overriding at the schema level. This keeps the configuration clean and avoids inconsistencies in delivered data."
+      },
       {
         type: "list",
         ordered: false,
@@ -3674,6 +4535,10 @@ var sections11 = [
         type: "paragraph",
         text: "The lookup convention follows a `key` / `value` structure where the `key` is the output code and the `value` is the human-readable label. During extraction, the platform maps FROM labels found in documents TO the canonical codes defined in the reference primitive. This ensures consistent, machine-readable output regardless of how values appear in source documents."
       },
+      {
+        type: "paragraph",
+        text: "For best results, keep reference primitives focused on a single domain \u2014 for example, one primitive for country codes, another for currency codes, and another for product categories. This makes each primitive reusable across multiple schemas and simplifies maintenance. When updating a primitive, test the new version against a few sample documents before updating the version reference in production schemas."
+      },
       {
         type: "callout",
         variant: "info",
@@ -3741,6 +4606,10 @@ var sections11 = [
         type: "paragraph",
         text: "Change review is particularly important for workspaces that feed downstream systems through delivery bindings. A small change to a schema field mapping or a reference primitive value can ripple through to every document processed after that point. The review process creates a checkpoint where a second pair of eyes can verify the change before it goes live."
       },
+      {
+        type: "paragraph",
+        text: "Most teams enable change review as soon as their workspace transitions from development to production. During the initial setup phase, you can leave it disabled for faster iteration. Once your schemas, dialects, and reference primitives are stable and data is flowing to downstream systems, enable change review to protect against accidental modifications that could disrupt live pipelines."
+      },
       {
         type: "list",
         ordered: false,
@@ -3807,6 +4676,14 @@ var sections12 = [
         type: "paragraph",
         text: "Omnisearch is designed to be the single entry point for finding anything in the platform. Rather than navigating to specific pages to search within them, Omnisearch queries a **materialized values index** that aggregates data across all your content. Results are grouped by category so you can quickly distinguish between a document match and a field name match."
       },
+      {
+        type: "paragraph",
+        text: "The materialized values index is rebuilt automatically whenever documents are processed or schemas change, so search results are always current. There is no manual reindex step \u2014 new documents become searchable as soon as extraction completes. This makes Omnisearch reliable even during high-volume ingestion periods."
+      },
+      {
+        type: "paragraph",
+        text: "For best results, use Omnisearch as your primary navigation tool. Instead of browsing through document lists or clicking through the sidebar, press `Cmd+K` and type what you are looking for \u2014 whether it is a specific invoice number, a field name, or a schema title. Most users find that Omnisearch is faster than manual navigation for any task beyond browsing the most recent documents."
+      },
       {
         type: "callout",
         variant: "info",
@@ -3883,6 +4760,10 @@ var sections12 = [
       {
         type: "paragraph",
         text: "Filter state is encoded in the URL query string using dynamic SQL generation on the backend. This means you can bookmark filtered views, share them with teammates via a link, or save them as **presets** for one-click access to commonly used queries."
+      },
+      {
+        type: "paragraph",
+        text: 'For best results, save your most common filter combinations as presets. Most teams create presets for categories like "high-value invoices this quarter," "documents missing key fields," or "recently failed extractions." Presets appear as one-click buttons on the Documents page, eliminating the need to rebuild complex filter conditions from scratch each time.'
       }
     ],
     related: [
@@ -3937,6 +4818,19 @@ var sections13 = [
         type: "paragraph",
         text: "Manage API keys from **Settings &rarr; API Keys**. Keys are prefixed with `tlnc_` and passed via `Authorization: Bearer`. Keys are SHA-256 hashed \u2014 the full key is only shown once at creation."
       },
+      {
+        type: "paragraph",
+        text: "Each API key is assigned one or more scopes that control what operations it can perform. Scopes follow the principle of least privilege \u2014 create a key with only the scopes your integration needs. For example, a read-only dashboard integration only needs the `read` scope, while an automated ingestion pipeline needs `extract` and `read`."
+      },
+      {
+        type: "paragraph",
+        text: "For best results, create separate API keys for each integration or service that connects to your Talonic workspace. This makes it easy to rotate or revoke a single key without disrupting other integrations. Most teams maintain one key for their ingestion pipeline, one for their BI dashboard, and one for webhook-based automations."
+      },
+      {
+        type: "callout",
+        variant: "warning",
+        text: "Copy the full API key immediately after creation \u2014 it is only displayed once. If you lose the key, you must delete it and create a new one. Existing integrations using the old key will stop working until updated."
+      },
       {
         type: "param-table",
         title: "API key scopes",
@@ -3972,6 +4866,10 @@ var sections13 = [
       {
         question: "What scopes are available for API keys?",
         answer: "Three scopes: extract (use extraction API), read (read documents, extractions, schemas, jobs), and write (create and modify resources)."
+      },
+      {
+        question: "Can I have multiple API keys?",
+        answer: "Yes. You can create as many API keys as needed. Best practice is to create separate keys for each integration so you can rotate or revoke them independently without disrupting other services."
       }
     ],
     mentions: ["API keys", "tlnc_", "SHA-256", "Bearer token", "scopes"]
@@ -3983,6 +4881,27 @@ var sections13 = [
     seoTitle: "Public REST API Overview \u2014 Talonic Docs",
     description: "Full REST API with 20+ namespaces: extract, documents, extractions, schemas, jobs, sources, delivery, linking, matching, batches, cases, quality, and more. Cursor pagination.",
     content: [
+      {
+        type: "paragraph",
+        text: "Talonic exposes a comprehensive REST API with 20+ namespaces covering every aspect of the platform \u2014 from document extraction and schema management to delivery, matching, and quality benchmarking. All endpoints use JSON request and response bodies with cursor-based pagination for list operations."
+      },
+      {
+        type: "paragraph",
+        text: "The API follows standard REST conventions. Authenticate with a `tlnc_` API key via the `Authorization: Bearer` header. Most resources support full CRUD operations, and long-running tasks like matching runs and batch inference are handled asynchronously with polling endpoints for status and progress."
+      },
+      {
+        type: "paragraph",
+        text: "Use the public API to build automated ingestion pipelines, integrate extraction results into downstream systems, or orchestrate complex workflows that combine multiple platform features. The API mirrors every action available in the web interface, so anything you can do manually can be fully automated."
+      },
+      {
+        type: "paragraph",
+        text: "For best results, start with the `/v1/extract` endpoint for document ingestion, then use `/v1/documents` and `/v1/extractions` to retrieve results. As your integration matures, explore delivery bindings, matching configurations, and batch processing to build a fully automated data pipeline."
+      },
+      {
+        type: "callout",
+        variant: "info",
+        text: "See the full [API Documentation](/docs) for detailed endpoint specifications, request/response examples, and authentication guides. The API reference is organized by namespace and includes every parameter, status code, and error response."
+      },
       {
         type: "param-table",
         title: "API namespaces",
@@ -4103,6 +5022,10 @@ var sections13 = [
       {
         question: "Where can I find detailed API documentation?",
         answer: "See the full API Documentation at /docs for complete endpoint documentation with request/response examples, parameter descriptions, and authentication details."
+      },
+      {
+        question: "How does pagination work in the API?",
+        answer: "List endpoints use cursor-based pagination. Each response includes a cursor token that you pass as a query parameter to fetch the next page. This approach is more reliable than offset-based pagination when documents are being added or removed concurrently."
       }
     ],
     mentions: [
@@ -4128,6 +5051,14 @@ var sections13 = [
         type: "paragraph",
         text: "The webhook connector is configured as a **delivery destination**. Bind any of the signal types below to a webhook destination to receive real-time notifications. See `/v1/delivery/catalog/signals` for the exhaustive list."
       },
+      {
+        type: "paragraph",
+        text: "When a webhook fires, the platform constructs the payload from the signal data, signs it with your destination's HMAC-SHA256 signing secret, and delivers it via HTTPS POST. Each delivery includes an idempotency key in the headers so your receiver can safely deduplicate retries. Failed deliveries follow an exponential backoff schedule, and terminal failures are routed to the dead-letter queue for manual replay."
+      },
+      {
+        type: "paragraph",
+        text: "Use webhooks when your downstream system needs to react immediately to platform events \u2014 for example, triggering an ERP import when a document is extracted, or notifying a Slack channel when a reviewer rejects a record. For bulk or periodic data transfers, consider using the SFTP, S3, or cloud storage delivery connectors instead."
+      },
       {
         type: "param-table",
         title: "Delivery signal types (webhook-compatible)",
@@ -4203,6 +5134,10 @@ var sections13 = [
       {
         question: "What happens when a webhook delivery fails?",
         answer: "Failed webhook deliveries retry with exponential backoff. Terminal failures (retry exhausted or permanent 4xx) escalate to the dead-letter queue for manual replay."
+      },
+      {
+        question: "How do I verify webhook signatures?",
+        answer: "Each webhook payload is signed with HMAC-SHA256 using the signing secret from your delivery destination configuration. Compute the HMAC of the raw request body and compare it to the signature header to verify authenticity. This ensures the payload was sent by Talonic and was not tampered with in transit."
       }
     ],
     mentions: [
@@ -4262,6 +5197,10 @@ var sections14 = [
         type: "paragraph",
         text: "New members are added via domain matching: company email domains auto-match to your org with **pending** status requiring admin approval. Manage from the Team page."
       },
+      {
+        type: "paragraph",
+        text: "When a team member is removed, their access is revoked immediately but their past actions \u2014 edits, uploads, approvals, and review decisions \u2014 remain in the audit trail. This preserves data integrity and compliance history. Removed users can be re-added later through the same domain matching process if needed."
+      },
       {
         type: "callout",
         variant: "info",
@@ -4329,6 +5268,14 @@ var sections14 = [
         type: "paragraph",
         text: "Understanding your usage patterns helps optimize costs. For example, if extraction dominates your spend, consider using **batch mode** for non-urgent documents to cut that cost in half. The daily cost chart makes it easy to spot usage spikes and correlate them with specific ingestion events."
       },
+      {
+        type: "paragraph",
+        text: "Behind the scenes, every LLM and OCR call is logged with full detail \u2014 the model used, input and output token counts, latency, and computed cost. This data powers both the per-feature breakdown and the individual call log. The system tracks costs across extraction, OCR, batch inference, matching AI resolution, and quality passes so you always know where your spend is going."
+      },
+      {
+        type: "paragraph",
+        text: "Most teams review the daily cost chart weekly to establish a usage baseline. Unexpected spikes usually correlate with large document uploads or batch completions. For organizations managing multiple workspaces, the **Master view** provides a single pane of glass showing per-customer breakdowns and platform-wide aggregates \u2014 accessible only to platform administrators."
+      },
       {
         type: "param-table",
         title: "Usage views",
@@ -4404,6 +5351,14 @@ var sections14 = [
         type: "paragraph",
         text: "The Admin Panel is the central hub for platform-wide operations. **Customer management** lets you create, view, and delete organizations. **User management** provides a cross-tenant view of all platform users with the ability to remove accounts. The **data clear & rebuild** function wipes all data for a specific customer and reprocesses from scratch \u2014 useful during onboarding or after significant schema changes."
       },
+      {
+        type: "paragraph",
+        text: "The Admin Panel operates across tenant boundaries, giving administrators visibility into all organizations on the platform. The **usage statistics** view aggregates cost and volume data across all customers, making it straightforward to identify high-usage tenants, track platform growth, and forecast infrastructure needs."
+      },
+      {
+        type: "paragraph",
+        text: "For best results, limit Admin Panel access to a small group of trusted platform operators. Use the **master registry** view to audit field definitions and schemas across tenants \u2014 this is particularly useful when standardizing extraction configurations or troubleshooting cross-tenant data quality issues."
+      },
       {
         type: "list",
         ordered: false,
@@ -4463,6 +5418,18 @@ var sections14 = [
         type: "paragraph",
         text: "Talonic provides global keyboard shortcuts that work from any page in the platform. These shortcuts let you access common actions without leaving your current context, significantly speeding up daily workflows."
       },
+      {
+        type: "paragraph",
+        text: "Shortcuts are registered at the application level, meaning they respond regardless of which page or panel is currently active. The platform intercepts the key combination before it reaches the browser, so these shortcuts take priority over default browser bindings when the Talonic window is focused."
+      },
+      {
+        type: "paragraph",
+        text: "The most frequently used shortcut is **Omnisearch** (`Cmd+K` / `Ctrl+K`), which opens a global search overlay that queries documents, extracted values, field names, schemas, and sources simultaneously. Power users rely on it to navigate the platform faster than clicking through the sidebar."
+      },
+      {
+        type: "paragraph",
+        text: "For best results, build muscle memory around the three core shortcuts. Use `Cmd+K` to find anything, `Cmd+J` to upload a document on the fly, and `Escape` to dismiss any overlay or modal. These three actions cover the most common interruptions during a review or configuration session."
+      },
       {
         type: "param-table",
         title: "Shortcuts",
@@ -4533,6 +5500,14 @@ var sections15 = [
         type: "paragraph",
         text: "Under the hood, batch inference leverages the provider's native batch API (Anthropic Message Batches or AWS Bedrock invocation jobs). Documents accumulate in a queue and are submitted together, allowing the provider to schedule processing during off-peak capacity. This is why the cost reduction is possible without any loss in extraction quality."
       },
+      {
+        type: "paragraph",
+        text: "Batch mode is best suited for backlog ingestion, periodic bulk uploads, and any scenario where results are not needed in real time. Most teams use batch mode for overnight processing of large document volumes and reserve real-time processing for time-sensitive documents that need immediate attention."
+      },
+      {
+        type: "paragraph",
+        text: "When batch results arrive, they pass through the same post-processing pipeline as real-time extractions \u2014 including markdown pre-processing, field parsing, quality metrics, and extraction metadata computation. The only difference is that LLM-based quality passes (field estimation, verification, cross-reference enrichment) are skipped in batch mode to preserve the cost savings."
+      },
       {
         type: "list",
         ordered: false,
@@ -4609,6 +5584,10 @@ var sections15 = [
       {
         type: "paragraph",
         text: "While waiting for batch results, documents show a status of `batch_queued`. Once the provider returns results, the platform applies them through the same post-processing pipeline as real-time extraction \u2014 including markdown pre-processing, field parsing, quality metrics, and extraction metadata computation."
+      },
+      {
+        type: "paragraph",
+        text: "You can also enable batch mode on a per-source basis. When a source connection has the batch processing toggle enabled, all documents ingested through that source are automatically routed to the batch queue. This is ideal for source connections that handle non-urgent, high-volume ingestion \u2014 such as a shared drive that collects documents overnight."
       }
     ],
     related: [
@@ -4658,6 +5637,14 @@ var sections15 = [
         type: "paragraph",
         text: "Batches are submitted automatically when the accumulation timer fires (every 15 minutes by default) or when the item count threshold is reached. Once submitted, the platform polls the provider hourly to check for completion. When results arrive, they are applied to the corresponding documents and the batch transitions to **completed** status."
       },
+      {
+        type: "paragraph",
+        text: "The batch detail view shows individual items within a batch, including which documents are included, their current processing state, and any errors that occurred. Use this view to verify that a specific document was included in the expected batch and to troubleshoot items that failed to parse."
+      },
+      {
+        type: "paragraph",
+        text: "The platform includes built-in crash recovery for batch processing. If the application restarts while a batch is in a transient `processing` state, the recovery logic automatically reverts it to `submitted` so the next polling cycle can retry. This means batch jobs are resilient to infrastructure disruptions without requiring manual intervention."
+      },
       {
         type: "param-table",
         title: "Batch statuses",
@@ -4737,6 +5724,14 @@ var sections16 = [
         type: "paragraph",
         text: 'Reference data is the foundation of the matching system. It represents your "ground truth" \u2014 the known records you want to match extracted document data against. Common examples include customer lists, product catalogs, vendor registries, and contract databases.'
       },
+      {
+        type: "paragraph",
+        text: "When you upload a reference dataset, the platform indexes all columns and rows for fast lookup during matching runs. Each dataset is versioned independently, so you can update your reference data without affecting in-progress matching configurations. A single dataset can be shared across multiple schemas and matching configurations."
+      },
+      {
+        type: "paragraph",
+        text: "For best results, ensure your reference data is clean and deduplicated before uploading. Include all columns that you plan to match against \u2014 such as names, identifiers, dates, and amounts. Most teams refresh their reference data periodically by re-uploading from their source system or by using the SQL import option to pull directly from a connected database."
+      },
       {
         type: "callout",
         variant: "info",
@@ -4830,6 +5825,10 @@ var sections16 = [
         type: "paragraph",
         text: "Each field comparison carries a **weight** that determines how much it contributes to the overall confidence score. Set high weights on fields that are strong identifiers (like reference numbers or unique IDs) and lower weights on fields that are common or prone to variation (like names or descriptions). The weighted aggregate produces a final score between 0% and 100%."
       },
+      {
+        type: "paragraph",
+        text: "Most teams start with AI strategy generation and then fine-tune weights based on initial results. A common pattern is to set a high weight on a unique identifier field (like a PO number) with `exact` strategy, combined with lower-weighted `fuzzy` matches on name and description fields as supporting evidence. Review the first batch of results to calibrate thresholds before running at scale."
+      },
       {
         type: "callout",
         variant: "info",
@@ -4884,6 +5883,14 @@ var sections16 = [
         type: "paragraph",
         text: "There are two types of runs: **manual runs** use only the deterministic matching strategies (exact, fuzzy, date_range, numeric_range) and complete quickly. **Smart runs** add an AI resolution pass \u2014 after the initial matching, an embedding-based search with a Haiku LLM resolver attempts to improve low-confidence results."
       },
+      {
+        type: "paragraph",
+        text: "Matching runs are processed asynchronously via a dedicated job queue, so they do not block your workflow. You can continue working in the platform while a run executes in the background. The matching page shows real-time progress with the number of documents processed and estimated time remaining."
+      },
+      {
+        type: "paragraph",
+        text: "For best results, start with a manual run to establish a baseline, then use a smart run if many documents have low-confidence matches. Smart runs take longer because the AI resolver evaluates each ambiguous candidate, but they can significantly improve match quality for data with inconsistent formatting, abbreviations, or multilingual content."
+      },
       {
         type: "list",
         ordered: true,
@@ -4941,6 +5948,14 @@ var sections16 = [
         type: "paragraph",
         text: "The evidence view is designed to make match decisions transparent. For each candidate, you can see exactly which fields matched, what strategy was used, the individual field score, and the actual values that were compared. This makes it straightforward to verify correct matches and investigate false positives."
       },
+      {
+        type: "paragraph",
+        text: "Approved matches flow downstream into delivery pipelines, where they can be included in structured exports alongside extraction data. Rejected matches are excluded from future consideration for that document, which helps the system learn from your decisions when running subsequent matching passes."
+      },
+      {
+        type: "paragraph",
+        text: "When reviewing results, focus on documents where the top candidate has a confidence score between 50% and 85% \u2014 these are the borderline cases that benefit most from human judgment. High-confidence matches (above 85%) are usually correct, while very low scores (below 30%) typically indicate no valid match exists in the reference data."
+      },
       {
         type: "param-table",
         title: "Result fields",