@talonic/docs 0.20.10 → 0.20.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/dist/content.js +1017 -2
  2. package/package.json +1 -1
package/dist/content.js CHANGED
@@ -542,6 +542,14 @@ var sections = [
542
542
  }
543
543
  ]
544
544
  },
545
+ {
546
+ type: "paragraph",
547
+ text: 'Understanding the relationship between these concepts is key to getting the most from the platform. When you upload documents, the extraction pipeline discovers every data point and feeds them into the **Field Registry**. The registry uses AI embeddings to cluster semantically similar fields \u2014 so "Vendor Name", "Supplier Name", and "Company Name" are recognized as the same concept. Over time, frequently occurring fields are promoted to higher tiers, and the platform synthesizes master extraction instructions that encode the best way to extract each field.'
548
+ },
549
+ {
550
+ type: "paragraph",
551
+ text: "The **Schema** layer sits on top of the registry and defines what output you need. You can use auto-generated schemas that the platform creates for each document type, or build custom template schemas by selecting specific fields from the registry. When a schema is applied to documents in a **Job**, the 4-phase pipeline fills every cell \u2014 starting with free graph lookups and falling back to AI agents for the remainder. The result is a structured grid where each row is a document and each column is a field."
552
+ },
545
553
  {
546
554
  type: "callout",
547
555
  variant: "info",
@@ -617,6 +625,14 @@ var sections = [
617
625
  type: "paragraph",
618
626
  text: "The pipeline is designed to be **progressive** \u2014 results appear as each phase completes rather than waiting for the entire job to finish. Phase 1 (graph resolve) fills ~30% of cells instantly and for free. Phase 2 (AI extraction) fills the remaining gaps. Phases 3 and 4 handle re-resolution and transformation. You can start reviewing early results while later phases are still running."
619
627
  },
628
+ {
629
+ type: "paragraph",
630
+ text: "Use the platform flow as a mental model when planning your workflow. For small, ad-hoc extractions you can go from upload to results in minutes \u2014 upload a few documents, pick an auto-generated schema, and run a job. For production workloads, invest time in the **Define schema** step: map fields to the registry, add reference tables for code lookups, and set format constraints. The upfront effort pays off because every subsequent job reuses the same schema and benefits from the growing knowledge graph."
631
+ },
632
+ {
633
+ type: "paragraph",
634
+ text: "After results are delivered, the feedback loop closes automatically. Corrections you make during the **Review** stage feed back into the Field Registry, improving future extractions. The platform tracks telemetry across runs \u2014 strategy distribution, capture hit rate, and resolve rate \u2014 so you can monitor how extraction quality improves over time as the knowledge graph accumulates more data."
635
+ },
620
636
  {
621
637
  type: "callout",
622
638
  variant: "info",
@@ -679,6 +695,14 @@ var sections = [
679
695
  title: "Sidebar Navigation",
680
696
  caption: "The sidebar provides access to all sections. Click the collapse button to save space. Press Cmd+K for global search."
681
697
  },
698
+ {
699
+ type: "paragraph",
700
+ text: "For teams processing documents at scale, the recommended approach is to start with a small representative sample. Upload 5-10 documents of the same type, let the platform extract and classify them, then review the auto-generated schema. This lets you validate the output structure before committing to a large batch. Once the schema looks right, you can upload hundreds or thousands of documents and the knowledge graph will handle an increasing share of cells through instant graph matches."
701
+ },
702
+ {
703
+ type: "paragraph",
704
+ text: "The platform includes powerful keyboard shortcuts for fast navigation. Press `Cmd+K` (or `Ctrl+K` on Windows) to open **Omnisearch**, which lets you find documents, schemas, jobs, and fields from anywhere. Press `Cmd+I` to open the **AI Agent** for natural language queries about your workspace. The sidebar can be collapsed to give more screen real estate when reviewing extraction results."
705
+ },
682
706
  {
683
707
  type: "callout",
684
708
  text: "The fastest path to results: upload documents in **Sources**, then go to **Structuring → Runs → New** to create your first extraction job."
@@ -735,6 +759,18 @@ var sections2 = [
735
759
  type: "paragraph",
736
760
  text: "Talonic includes an embedded AI agent accessible from every page via `Cmd+I` (`Ctrl+I` on Windows). The agent understands your workspace context and can inspect schemas, search documents, analyze extraction quality, explore cases, and build schemas \u2014 all through natural language."
737
761
  },
762
+ {
763
+ type: "paragraph",
764
+ text: "The agent is context-aware, meaning it automatically knows which page you are on and what data is visible. If you open the agent from a document detail page, it already has that document in scope and can answer questions about its extracted fields, processing status, or classification without you needing to specify which document you mean."
765
+ },
766
+ {
767
+ type: "paragraph",
768
+ text: "The agent classifies every user message as either a **question** (answered with information) or a **command** (triggers an action). Questions are handled instantly with read-only access, while commands go through the impact-level system to ensure safety. The agent streams its responses in real time, so you can see reasoning unfold as it queries your workspace data."
769
+ },
770
+ {
771
+ type: "paragraph",
772
+ text: "There are important limitations to be aware of. The agent cannot access external systems or the internet \u2014 it only works with data already in your Talonic workspace. It cannot bypass permission boundaries, so team members with read-only access cannot use the agent to make changes. Long-running operations like full batch extractions cannot be triggered through the agent; those must be initiated from the relevant UI page."
773
+ },
738
774
  { type: "heading", level: 3, id: "agent-capabilities", text: "What the Agent Can Do" },
739
775
  {
740
776
  type: "paragraph",
@@ -799,6 +835,14 @@ var sections2 = [
799
835
  {
800
836
  question: "Can the AI agent modify my data?",
801
837
  answer: "The agent operates workshop-first: schema changes create drafts, not live versions. Higher-impact operations require progressively more explicit confirmation."
838
+ },
839
+ {
840
+ question: "Is the AI agent context-aware?",
841
+ answer: "Yes. The agent automatically knows which page you are on and what data is visible. If you open it from a document detail page, it already has that document in scope and can answer questions about its fields, processing status, or classification."
842
+ },
843
+ {
844
+ question: "Can the AI agent access external systems or the internet?",
845
+ answer: "No. The agent only works with data already in your Talonic workspace. It cannot browse the internet, call external APIs, or access systems outside the platform."
802
846
  }
803
847
  ],
804
848
  mentions: [
@@ -846,6 +890,18 @@ var sections2 = [
846
890
  }
847
891
  ]
848
892
  },
893
+ {
894
+ type: "paragraph",
895
+ text: "The `read` impact level covers the vast majority of agent interactions. Searching documents, inspecting extraction results, browsing the field registry, and checking job status all execute instantly with no side effects. These read operations give you a fast way to explore your workspace without navigating through multiple pages."
896
+ },
897
+ {
898
+ type: "paragraph",
899
+ text: "The `draft_mutation` level is used when the agent creates or modifies schemas. Because all schema changes go through the workshop system, the agent can freely draft schemas without risk \u2014 nothing goes live until you explicitly review and publish. This makes the agent especially useful for rapid schema prototyping: describe the fields you need in plain language, and the agent creates a draft you can refine."
900
+ },
901
+ {
902
+ type: "paragraph",
903
+ text: 'The `live_mutation` and `irreversible` levels provide escalating safety gates for operations that affect production data. A `live_mutation` \u2014 such as triggering a job run or publishing a schema \u2014 presents a confirmation dialog that you must accept. An `irreversible` action \u2014 such as deleting a source or purging documents \u2014 requires you to type a confirmation keyword (e.g., "DELETE") to proceed, preventing accidental data loss.'
904
+ },
849
905
  {
850
906
  type: "callout",
851
907
  text: "The agent always operates workshop-first: schema changes create drafts, not live versions. You review and publish when ready."
@@ -863,6 +919,10 @@ var sections2 = [
863
919
  {
864
920
  question: "Does the AI agent make changes directly to live data?",
865
921
  answer: "No. The agent operates workshop-first. Schema changes create drafts, and live mutations require explicit user confirmation before executing."
922
+ },
923
+ {
924
+ question: "What happens when I ask the agent to delete something?",
925
+ answer: 'Deletion is classified as an irreversible action. The agent will ask you to type a confirmation keyword (e.g., "DELETE") before proceeding. This prevents accidental data loss from casual or ambiguous requests.'
866
926
  }
867
927
  ],
868
928
  mentions: ["impact levels", "draft mutation", "live mutation", "workshop-first"]
@@ -877,6 +937,18 @@ var sections2 = [
877
937
  {
878
938
  type: "paragraph",
879
939
  text: "The home page (click the Talonic logo) shows smart suggested prompts based on your workspace state. Prompts adapt to what is happening: active runs, schema creation opportunities, document types waiting for extraction. The agent input field lets you type any question directly from the dashboard."
940
+ },
941
+ {
942
+ type: "paragraph",
943
+ text: "The dashboard provides a workspace-level overview that helps you understand the health of your data pipeline at a glance. You can see document processing statistics, recent activity across sources, and the current state of your field registry. Key metrics like **capture rate**, **resolve rate**, and **synthesize rate** from the telemetry system are surfaced so you can spot trends without drilling into individual jobs."
944
+ },
945
+ {
946
+ type: "paragraph",
947
+ text: "Suggested prompts are dynamically generated based on what the platform detects in your workspace. If you have new document types that lack schemas, the dashboard suggests creating one. If a job run recently completed, it suggests reviewing the results. If field registry confirmations are pending, it prompts you to review them. This makes the dashboard a natural starting point for your workflow each session."
948
+ },
949
+ {
950
+ type: "paragraph",
951
+ text: "Every conversation with the agent is preserved in your session history, accessible from the dashboard. You can revisit previous questions and their answers, which is useful for auditing decisions or recalling how you configured a particular schema. The conversation history also provides continuity \u2014 if you asked the agent to analyze extraction quality last week, you can pick up where you left off."
880
952
  }
881
953
  ],
882
954
  related: [
@@ -891,6 +963,10 @@ var sections2 = [
891
963
  {
892
964
  question: "Do the suggested prompts change based on workspace state?",
893
965
  answer: "Yes. Prompts adapt dynamically based on active runs, schema creation opportunities, document types waiting for extraction, and other workspace activity."
966
+ },
967
+ {
968
+ question: "Can I revisit previous conversations with the agent?",
969
+ answer: "Yes. Every conversation is preserved in your session history, accessible from the dashboard. You can revisit previous questions, recall how you configured a schema, or pick up where you left off in a previous analysis."
894
970
  }
895
971
  ],
896
972
  mentions: ["dashboard", "suggested prompts", "workspace state", "agent input"]
@@ -923,6 +999,10 @@ var sections3 = [
923
999
  {
924
1000
  type: "paragraph",
925
1001
  text: "Files are deduplicated via SHA-256 hashing \u2014 uploading the same file twice won't create duplicates. Processing runs asynchronously so you can continue working."
1002
+ },
1003
+ {
1004
+ type: "paragraph",
1005
+ text: "When uploading folders or ZIP archives, the original directory structure is preserved as a `source_file_path` metadata field on each document (e.g., `contracts/2026/lease.pdf`). This field is available for filtering, export, and schema mapping \u2014 just like any AI-extracted field. It provides a natural way to organize and trace documents back to their original location in your file system."
926
1006
  }
927
1007
  ],
928
1008
  related: [
@@ -938,6 +1018,10 @@ var sections3 = [
938
1018
  {
939
1019
  question: "Does Talonic detect duplicate uploads?",
940
1020
  answer: "Yes. Files are deduplicated via SHA-256 hashing. Uploading the same file twice will not create duplicates."
1021
+ },
1022
+ {
1023
+ question: "What happens when I upload a folder or ZIP archive?",
1024
+ answer: "ZIP archives are unpacked recursively and each file is processed individually. Folders preserve the original directory structure as a source_file_path metadata field on each document, available for filtering and export."
941
1025
  }
942
1026
  ],
943
1027
  mentions: [
@@ -956,6 +1040,10 @@ var sections3 = [
956
1040
  seoTitle: "Supported File Formats \u2014 Talonic Docs",
957
1041
  description: "Talonic supports 25+ file types across four processing paths: text fast-path, AI vision, OCR, and recursive archive unpacking. From PDF to XLSX to images.",
958
1042
  content: [
1043
+ {
1044
+ type: "paragraph",
1045
+ text: "Talonic supports 25+ file types across four distinct processing paths. Each path is optimized for its file category \u2014 text files are read directly with zero latency, while complex document formats go through OCR to produce high-quality Markdown. The processing path is selected automatically based on the file extension."
1046
+ },
959
1047
  {
960
1048
  type: "param-table",
961
1049
  title: "File processing paths",
@@ -981,6 +1069,23 @@ var sections3 = [
981
1069
  description: "ZIP \u2014 unpacked and each file processed individually."
982
1070
  }
983
1071
  ]
1072
+ },
1073
+ {
1074
+ type: "paragraph",
1075
+ text: "The **OCR path** uses Mistral Document AI as the primary engine, with a Talonic API fallback if the primary service is unavailable. OCR converts documents to structured Markdown, preserving tables, headings, and layout information. For PDF files that exceed the configured chunk size (default 25 pages), the system automatically splits the document into page chunks, processes them in parallel, and merges the results \u2014 so even large documents are handled efficiently."
1076
+ },
1077
+ {
1078
+ type: "paragraph",
1079
+ text: `Image files follow the **AI Vision** path, where they are sent directly to the AI model for multimodal extraction. This means the AI "sees" the image and extracts data visually \u2014 useful for photos of receipts, scanned handwritten notes, or diagrams. If an image was previously OCR'd and produced meaningful Markdown (more than 100 characters), the system uses the Markdown extraction path instead, which enables richer quality metrics.`
1080
+ },
1081
+ {
1082
+ type: "paragraph",
1083
+ text: "The **text fast-path** is the most efficient route: files like CSV, JSON, and plain text are read directly into memory with no external API call. This means they process almost instantly and incur no OCR cost. Email files (EML, MSG) are parsed to extract both the message body and any attachments, with each attachment processed as a separate document."
1084
+ },
1085
+ {
1086
+ type: "callout",
1087
+ variant: "info",
1088
+ text: "The processing path is selected automatically based on the file extension \u2014 you do not need to configure anything. If a file type is not recognized, the platform will attempt OCR as a fallback before marking it as unsupported."
984
1089
  }
985
1090
  ],
986
1091
  related: [
@@ -995,6 +1100,10 @@ var sections3 = [
995
1100
  {
996
1101
  question: "How does Talonic handle image files?",
997
1102
  answer: "Image files (PNG, JPG, JPEG, GIF, WEBP) are sent to AI for multimodal visual extraction."
1103
+ },
1104
+ {
1105
+ question: "How does Talonic handle large PDF files?",
1106
+ answer: "PDF files that exceed the configured chunk size (default 25 pages) are automatically split into page chunks, processed in parallel, and merged. This ensures even large documents are handled efficiently without timeouts."
998
1107
  }
999
1108
  ],
1000
1109
  mentions: ["OCR", "AI vision", "text fast-path", "file formats", "PDF", "DOCX", "ZIP"]
@@ -1061,6 +1170,10 @@ var sections3 = [
1061
1170
  {
1062
1171
  question: "When is a document ready to use in jobs?",
1063
1172
  answer: "Documents are marked complete after AI extraction finishes. You can start using them in jobs immediately without waiting for further processing."
1173
+ },
1174
+ {
1175
+ question: "What happens if OCR or extraction fails on a document?",
1176
+ answer: "The platform automatically retries failed extractions (configurable, default 1 retry). If all retries fail, the document is marked as extraction_failed with a terminal status. OCR failures follow a separate retry path with fallback from Document AI to Talonic API to local parsers."
1064
1177
  }
1065
1178
  ],
1066
1179
  mentions: [
@@ -1087,6 +1200,19 @@ var sections3 = [
1087
1200
  {
1088
1201
  type: "paragraph",
1089
1202
  text: 'Documents sharing the same ontology type are automatically merged into one document type. When a new canonical type appears, it is auto-created with ontology metadata. Unresolvable documents are assigned "Unclassified Document".'
1203
+ },
1204
+ {
1205
+ type: "paragraph",
1206
+ text: `Classification is verified in a two-step process. First, **Document AI OCR** produces an annotation with a free-text type label during the OCR pass. Then, a **type resolution** step verifies that label against the actual document content. If the label and content disagree \u2014 for example, a German *Arbeitsvertrag* incorrectly labelled as "Service Agreement" \u2014 the system trusts the content and resolves the correct canonical type. This ensures accurate classification regardless of the OCR engine's labelling bias.`
1207
+ },
1208
+ {
1209
+ type: "paragraph",
1210
+ text: "Document types drive several downstream features. The platform auto-generates a **schema** for each document type, pre-populated with fields discovered from documents of that type. **Routing rules** can be configured per document type to automatically assign schemas or trigger jobs when new documents arrive. The **Field Registry** tracks which fields appear in which document types, building a cross-type knowledge graph over time."
1211
+ },
1212
+ {
1213
+ type: "callout",
1214
+ variant: "info",
1215
+ text: "You never need to create document types manually. The ontology is built into the platform and types are assigned automatically during classification. If you disagree with a classification, the AI agent can help you understand why a type was chosen and how the content signals were interpreted."
1090
1216
  }
1091
1217
  ],
1092
1218
  related: [
@@ -1102,6 +1228,10 @@ var sections3 = [
1102
1228
  {
1103
1229
  question: "Does document classification work in non-English languages?",
1104
1230
  answer: "Yes. The classifier works across all languages. For example, a German Arbeitsvertrag and an English Employment Contract map to the same canonical type."
1231
+ },
1232
+ {
1233
+ question: "What happens if a document cannot be classified?",
1234
+ answer: 'Unresolvable documents are assigned the "Unclassified Document" type. They can still be processed and extracted \u2014 the platform simply cannot map them to a specific canonical type in the 529-type ontology.'
1105
1235
  }
1106
1236
  ],
1107
1237
  mentions: [
@@ -1147,6 +1277,23 @@ var sections3 = [
1147
1277
  description: "View or download the source document."
1148
1278
  }
1149
1279
  ]
1280
+ },
1281
+ {
1282
+ type: "paragraph",
1283
+ text: "The **Raw Extraction** tab is the most detailed view, showing every field the AI discovered along with its confidence score and the source text that the value was extracted from. Each field displays a tier badge (Tier 1 green, Tier 2 amber, Tier 3 gray) indicating how well-established that field is across your document corpus. Synthetic metadata fields like `filename` and `source_file_path` appear here too, with full confidence (1.0)."
1284
+ },
1285
+ {
1286
+ type: "paragraph",
1287
+ text: "The **Resolved Data** tab shows how raw extracted fields map to your canonical field registry. Fields that matched automatically (similarity >= 0.80) display their canonical name and cluster. Fields in the confirm band (0.50-0.79) are flagged for review. This view helps you understand how the platform is normalizing field names across different document types and formats."
1288
+ },
1289
+ {
1290
+ type: "paragraph",
1291
+ text: "The **Processing Log** tab provides a stage-by-stage timeline of how the document was processed, including per-stage timing. You can see exactly how long OCR, classification, and extraction took, which is useful for diagnosing slow processing or understanding why a document was classified a particular way. The **Original File** tab lets you view or download the source file, so you can always compare the AI's extraction against the original document."
1292
+ },
1293
+ {
1294
+ type: "callout",
1295
+ variant: "info",
1296
+ text: "You can open the **AI Agent** (`Cmd+I`) from any document detail page. The agent automatically has the current document in scope and can answer questions about its fields, classification, or processing status without you needing to specify which document you mean."
1150
1297
  }
1151
1298
  ],
1152
1299
  related: [
@@ -1162,6 +1309,10 @@ var sections3 = [
1162
1309
  {
1163
1310
  question: "How can I see the confidence score of an extracted field?",
1164
1311
  answer: "Open the document detail page and navigate to the Raw Extraction tab. Each field displays its confidence score alongside the extracted value and source text."
1312
+ },
1313
+ {
1314
+ question: "What do the tier badges on fields mean?",
1315
+ answer: "Tier badges indicate how well-established a field is across your document corpus. Tier 1 (green) are universal core fields, Tier 2 (amber) are established promoted fields, and Tier 3 (gray) are newly discovered emerging fields."
1165
1316
  }
1166
1317
  ],
1167
1318
  mentions: [
@@ -1182,6 +1333,23 @@ var sections3 = [
1182
1333
  {
1183
1334
  type: "paragraph",
1184
1335
  text: "Routing rules automatically assign actions to documents based on their type. Configure rules to auto-assign schemas, trigger jobs, or route documents to specific workflows. Manage rules from **Documents → Routing**."
1336
+ },
1337
+ {
1338
+ type: "paragraph",
1339
+ text: "Each routing rule specifies a **document type** as the trigger condition and one or more **actions** to execute when a document of that type is processed. Actions include assigning a specific user schema, automatically creating a job run, or tagging the document for a particular workflow. Rules are evaluated in priority order, so you can layer general rules with more specific overrides."
1340
+ },
1341
+ {
1342
+ type: "paragraph",
1343
+ text: 'Routing rules are especially useful for high-volume ingestion pipelines. If you connect a Google Drive folder that receives hundreds of invoices per week, a routing rule can automatically assign your "Invoice" schema and trigger extraction \u2014 turning what would be manual work into a fully automated pipeline. Combined with **delivery bindings**, this creates an end-to-end flow from document upload to structured output with zero manual intervention.'
1344
+ },
1345
+ {
1346
+ type: "paragraph",
1347
+ text: "You can review rule execution history from the routing page to see which rules fired, which documents they matched, and what actions were taken. This audit trail helps you verify that your routing configuration is working as expected and diagnose cases where documents were not routed correctly."
1348
+ },
1349
+ {
1350
+ type: "callout",
1351
+ variant: "info",
1352
+ text: "Start with a simple routing rule for your most common document type. Once you verify it works correctly, expand to additional types. Rules are evaluated in priority order, so you can add specific overrides without disrupting existing rules."
1185
1353
  }
1186
1354
  ],
1187
1355
  related: [
@@ -1197,6 +1365,10 @@ var sections3 = [
1197
1365
  {
1198
1366
  question: "Where do I manage routing rules?",
1199
1367
  answer: "Navigate to Documents > Routing to create and manage routing rules for your workspace."
1368
+ },
1369
+ {
1370
+ question: "Can routing rules fully automate my document processing pipeline?",
1371
+ answer: "Yes. By combining routing rules with source connectors and delivery bindings, you can create a fully automated pipeline: documents arrive from a connected source, routing rules assign schemas and trigger extraction jobs, and delivery bindings push approved results to downstream systems."
1200
1372
  }
1201
1373
  ],
1202
1374
  mentions: ["routing rules", "auto-assign", "schema assignment", "document workflows"]
@@ -1272,6 +1444,14 @@ var sections3 = [
1272
1444
  type: "paragraph",
1273
1445
  text: "Google and Microsoft connectors share a single OAuth client each. OAuth tokens are encrypted at rest using `aes-256-gcm`. Each source card includes a **Batch Processing** toggle to defer extraction at 50% cost."
1274
1446
  },
1447
+ {
1448
+ type: "paragraph",
1449
+ text: "OAuth-based connectors (Google Drive, Gmail, SharePoint, OneDrive, Outlook, Teams, Notion) use a consent-based flow where you authorize Talonic to access specific resources. For Microsoft connectors, Teams requires extended scopes that need tenant-admin consent. If a connector's OAuth credentials are revoked or expire, the source enters a disconnected state \u2014 reconnecting via the source settings page automatically refreshes the credentials without losing your existing documents."
1450
+ },
1451
+ {
1452
+ type: "paragraph",
1453
+ text: "Credential-based connectors (SQL, Amazon S3, Azure Blob) authenticate with access keys or connection strings rather than OAuth. SQL connections support PostgreSQL, MySQL, and MSSQL, with a built-in read-only safety layer that prevents accidental writes. S3-compatible storage like MinIO and Cloudflare R2 also works through the S3 connector. All credentials are encrypted at rest before being stored."
1454
+ },
1275
1455
  {
1276
1456
  type: "callout",
1277
1457
  text: "Connectors are feature-gated on their OAuth client ID/secret. Without credentials configured, the connector dropdown entry is disabled."
@@ -1290,6 +1470,10 @@ var sections3 = [
1290
1470
  {
1291
1471
  question: "How are OAuth tokens stored?",
1292
1472
  answer: "OAuth access and refresh tokens are encrypted at rest using AES-256-GCM. The encryption key is SOURCE_ENCRYPTION_KEY (falls back to JWT_SECRET)."
1473
+ },
1474
+ {
1475
+ question: "What happens if a connector loses its credentials or authorization?",
1476
+ answer: "If OAuth credentials are revoked or expire, the source enters a disconnected state. Reconnecting via the source settings page automatically refreshes the credentials without losing your existing documents or configuration."
1293
1477
  }
1294
1478
  ],
1295
1479
  mentions: [
@@ -1331,6 +1515,18 @@ var sections4 = [
1331
1515
  id: "field-registry-table",
1332
1516
  title: "Field Registry \u2014 Registry Table",
1333
1517
  caption: "Fields are organized by tier with occurrence counts, data types, and master instruction status."
1518
+ },
1519
+ {
1520
+ type: "paragraph",
1521
+ text: "The registry grows automatically as documents are processed. During extraction, AI discovers fields from each document and resolves them against existing registry entries using **three-band matching** (exact name match, cluster member match, then semantic embedding similarity). New fields that don't match anything create a Tier 3 entry. Frequently occurring fields are promoted to higher tiers, so the registry naturally converges on a stable set of canonical fields over time."
1522
+ },
1523
+ {
1524
+ type: "paragraph",
1525
+ text: "Each registry entry tracks its **occurrence count** (how many documents contain this field), **data type** (string, number, date, etc.), **synonyms** (alternate names discovered across documents), and **master instruction** (an AI-synthesized extraction directive). The registry also maintains two embedding vectors per field: one for resolution matching and one for graph visualization, ensuring that each concern uses the most appropriate representation."
1526
+ },
1527
+ {
1528
+ type: "paragraph",
1529
+ text: "The registry is the foundation for several downstream features. **Jobs** use registry fields to pre-fill schema values via lookup cascades before resorting to LLM extraction. **Semantic clusters** group related registry fields together. **Generated schemas** are auto-built from registry fields that appear in a given document type. Understanding the registry is key to understanding how Talonic reduces extraction cost and improves accuracy over time."
1334
1530
  }
1335
1531
  ],
1336
1532
  related: [
@@ -1346,6 +1542,10 @@ var sections4 = [
1346
1542
  {
1347
1543
  question: "How does the Field Registry grow?",
1348
1544
  answer: "As documents are processed, AI discovers new fields and resolves them against existing registry entries. New fields create Tier 3 entries; frequently occurring fields are promoted to higher tiers."
1545
+ },
1546
+ {
1547
+ question: "How does the Field Registry reduce extraction cost?",
1548
+ answer: "The registry enables lookup-based resolution during job runs. When a field already exists in the registry with sufficient data, its value can be resolved via graph lookup instead of an AI call. Approximately 30% of cells are filled this way \u2014 instantly and at no cost."
1349
1549
  }
1350
1550
  ],
1351
1551
  mentions: [
@@ -1387,6 +1587,18 @@ var sections4 = [
1387
1587
  }
1388
1588
  ]
1389
1589
  },
1590
+ {
1591
+ type: "paragraph",
1592
+ text: "**Tier 1** fields are the most reliable and cost-efficient. During job runs, Tier 1 fields can often be resolved via lookup tables or registry transfer without any AI call, meaning they cost nothing to extract. These are fields like `invoice_number`, `date`, or `total_amount` that appear universally across document types and have well-established extraction patterns."
1593
+ },
1594
+ {
1595
+ type: "paragraph",
1596
+ text: "**Tier 2** fields are promoted from Tier 3 after meeting frequency thresholds \u2014 specifically, 5 occurrences or a 10% occurrence rate across your documents. Once promoted, these fields gain a synthesized master instruction and become candidates for lookup-based resolution. Promotion is evaluated automatically after every batch resolution run, so fields graduate without manual intervention as your document corpus grows."
1597
+ },
1598
+ {
1599
+ type: "paragraph",
1600
+ text: "**Tier 3** fields are newly discovered and may require a full Claude API call to extract during job runs, making them the most expensive tier. As more documents are processed and a Tier 3 field appears consistently, it is automatically promoted. You can also manually adjust a field's tier from the registry detail page if you know a field is stable enough to promote early."
1601
+ },
1390
1602
  {
1391
1603
  type: "callout",
1392
1604
  text: "Tier badges appear throughout the platform as the primary quality signal. Tier 1 = green, Tier 2 = amber, Tier 3 = gray."
@@ -1404,6 +1616,10 @@ var sections4 = [
1404
1616
  {
1405
1617
  question: "How are fields promoted between tiers?",
1406
1618
  answer: "Fields are promoted automatically based on frequency thresholds. As more documents are processed and a field appears consistently, it moves from Tier 3 to Tier 2 and eventually to Tier 1."
1619
+ },
1620
+ {
1621
+ question: "Can I manually change a field's tier?",
1622
+ answer: "Yes. You can manually adjust a field's tier from the registry detail page. This is useful when you know a field is stable enough to promote early, or when you want to demote a field that was promoted prematurely."
1407
1623
  }
1408
1624
  ],
1409
1625
  mentions: ["tier system", "Tier 1", "Tier 2", "Tier 3", "field promotion", "quality signal"]
@@ -1418,6 +1634,23 @@ var sections4 = [
1418
1634
  {
1419
1635
  type: "paragraph",
1420
1636
  text: 'Fields with similar meanings are automatically grouped using AI embeddings. For example, "Vendor Name", "Supplier Name", and "Company Name" cluster together. You can manually merge or split clusters from the Field Map view.'
1637
+ },
1638
+ {
1639
+ type: "paragraph",
1640
+ text: "Clustering uses the same three-band similarity model as field resolution. Fields with similarity >= 0.80 are automatically grouped into the same cluster. Fields in the 0.50-0.79 range are flagged as potential cluster candidates for manual confirmation. Fields below 0.50 similarity are kept separate. This graduated approach prevents false merges while still surfacing useful grouping suggestions."
1641
+ },
1642
+ {
1643
+ type: "paragraph",
1644
+ text: 'From the **Field Map** view, you can manually **merge** two clusters when you know they represent the same concept (e.g., merging a "Ship To Address" cluster with a "Delivery Address" cluster). You can also **split** a field out of a cluster if it was incorrectly grouped. These manual adjustments are permanent and improve the resolution model for all future documents \u2014 the system learns from your corrections.'
1645
+ },
1646
+ {
1647
+ type: "paragraph",
1648
+ text: 'Semantic clusters serve a practical purpose beyond organization. When a job runs, the resolution engine uses clusters to transfer values between fields that belong to the same cluster. If a document has a field called "Supplier Name" and your schema expects "Vendor Name", the cluster linkage allows the value to transfer automatically without an AI call. This is one of the key mechanisms that reduces extraction cost as your registry matures.'
1649
+ },
1650
+ {
1651
+ type: "callout",
1652
+ variant: "info",
1653
+ text: "Manual cluster adjustments are permanent and improve the model for all future documents. If you notice the platform grouping unrelated fields together, split them early \u2014 this prevents incorrect value transfers during job runs."
1421
1654
  }
1422
1655
  ],
1423
1656
  related: [
@@ -1433,6 +1666,10 @@ var sections4 = [
1433
1666
  {
1434
1667
  question: "Can I manually adjust semantic clusters?",
1435
1668
  answer: "Yes. You can manually merge or split clusters from the Field Map view in the Field Registry."
1669
+ },
1670
+ {
1671
+ question: "How do semantic clusters reduce extraction cost?",
1672
+ answer: 'When a job runs, the resolution engine uses clusters to transfer values between fields that belong to the same cluster. If a document has "Supplier Name" and your schema expects "Vendor Name", the cluster linkage allows the value to transfer automatically without an AI call.'
1436
1673
  }
1437
1674
  ],
1438
1675
  mentions: [
@@ -1478,6 +1715,10 @@ var sections4 = [
1478
1715
  type: "paragraph",
1479
1716
  text: "Resolution runs concurrently across documents. Each document's fields are resolved in an isolated transaction to prevent lock contention. Occurrence rates are updated after each transaction commits, keeping the registry eventually consistent without blocking concurrent ingestion."
1480
1717
  },
1718
+ {
1719
+ type: "paragraph",
1720
+ text: "After resolution completes, the platform evaluates tier promotions and regenerates affected schemas in a fixed chain: resolve, then promote, then regenerate. This chain ensures that newly promoted fields immediately appear in auto-generated schemas. The resolution process also feeds into the **job pipeline** \u2014 during Phase 1 of a job run, the system uses a 3-tier lookup cascade (string normalization, token fuzzy matching, then AI fallback) to fill 60-80% of cells without a full LLM call, dramatically reducing cost."
1721
+ },
1481
1722
  {
1482
1723
  type: "callout",
1483
1724
  text: "Pending confirmations from the confirm band appear in **Resolution → Pending Confirmations**. Accept to merge into an existing cluster, or reject to create a new field."
@@ -1496,6 +1737,10 @@ var sections4 = [
1496
1737
  {
1497
1738
  question: "Where can I review pending field confirmations?",
1498
1739
  answer: "Navigate to Resolution > Pending Confirmations to review fields in the confirm band. Accept to merge into an existing cluster, or reject to create a new field."
1740
+ },
1741
+ {
1742
+ question: "What happens after resolution completes?",
1743
+ answer: "After resolution, the platform evaluates tier promotions and regenerates affected schemas in a fixed chain: resolve, then promote, then regenerate. This ensures that newly promoted fields immediately appear in auto-generated schemas."
1499
1744
  }
1500
1745
  ],
1501
1746
  mentions: [
@@ -1517,6 +1762,18 @@ var sections4 = [
1517
1762
  type: "paragraph",
1518
1763
  text: "As the same field is extracted from many documents, AI synthesizes a **master instruction** \u2014 a reusable directive that captures the best way to extract that field. Master instructions improve accuracy over time and are automatically used when running jobs."
1519
1764
  },
1765
+ {
1766
+ type: "paragraph",
1767
+ text: 'Master instructions are synthesized by analyzing the extraction patterns across all documents where a field appears. The AI examines how the field was successfully extracted \u2014 including the source text, confidence scores, and document context \u2014 and distills a concise directive that captures the best extraction approach. For example, a master instruction for "invoice_date" might specify: "Look for the date near the invoice number, typically in the header area. Prefer the issue date over due date. Format as ISO 8601."'
1768
+ },
1769
+ {
1770
+ type: "paragraph",
1771
+ text: "Master instructions fire automatically during **Phase 2** of job runs, when the AI agent extracts values for fields that could not be resolved via lookup. The instruction is injected into the AI prompt alongside the document content, giving the model specific guidance for that field. This is why master instructions improve accuracy: they encode domain-specific knowledge that the base model would otherwise lack."
1772
+ },
1773
+ {
1774
+ type: "paragraph",
1775
+ text: `You can view and edit master instructions from the field detail page in the registry. Editing an instruction overrides the AI-synthesized version, which is useful when you have domain expertise the AI hasn't captured. The **"Synthesize All"** button in the Field Registry triggers the full pipeline \u2014 embedding, resolution, and synthesis \u2014 for all qualifying fields in a single operation.`
1776
+ },
1520
1777
  {
1521
1778
  type: "callout",
1522
1779
  text: 'Click **"Synthesize All"** in the Field Registry to generate instructions for all qualifying fields. This runs the combined pipeline: embed → resolve → synthesize.'
@@ -1535,6 +1792,10 @@ var sections4 = [
1535
1792
  {
1536
1793
  question: "How do I generate master instructions?",
1537
1794
  answer: 'Click "Synthesize All" in the Field Registry. This runs the combined pipeline: embed, resolve, and synthesize instructions for all qualifying fields.'
1795
+ },
1796
+ {
1797
+ question: "Can I manually edit a master instruction?",
1798
+ answer: "Yes. You can view and edit master instructions from the field detail page in the registry. Editing overrides the AI-synthesized version, which is useful when you have domain expertise the AI has not captured."
1538
1799
  }
1539
1800
  ],
1540
1801
  mentions: [
@@ -1562,6 +1823,18 @@ var sections5 = [
1562
1823
  {
1563
1824
  type: "paragraph",
1564
1825
  text: "For each document type, Talonic generates a schema containing all Tier 1 and Tier 2 fields with occurrences in that type. Generated schemas are versioned \u2014 new versions are created when the registry changes. You can diff any two versions to see what changed."
1826
+ },
1827
+ {
1828
+ type: "paragraph",
1829
+ text: "Behind the scenes, the generation engine scans the **Field Registry** for every field that has been promoted to Tier 1 (core) or Tier 2 (established) within a given document type. It assembles these fields into a schema definition, assigns data types based on observed extraction patterns, and attaches the AI-synthesized **master instruction** for each field. The entire process is automatic \u2014 no manual curation is required."
1830
+ },
1831
+ {
1832
+ type: "paragraph",
1833
+ text: "Generated schemas are most useful as a starting point for understanding what Talonic has discovered about your documents. Review the generated schema for a document type to see which fields the system has identified, then use that knowledge to build a **User Template** containing only the fields you actually need. You can also use the diff view to monitor how your field landscape evolves over time as new documents are processed and new fields are promoted."
1834
+ },
1835
+ {
1836
+ type: "callout",
1837
+ text: "Generated schemas are read-only and cannot be used directly for job execution. To run an extraction job, create a **User Template** and map its fields to the registry."
1565
1838
  }
1566
1839
  ],
1567
1840
  related: [
@@ -1577,6 +1850,10 @@ var sections5 = [
1577
1850
  {
1578
1851
  question: "How are generated schemas updated?",
1579
1852
  answer: "New versions are created automatically when the Field Registry changes (new fields promoted, clusters merged). You can diff any two versions to see what changed."
1853
+ },
1854
+ {
1855
+ question: "Can I run an extraction job using a generated schema?",
1856
+ answer: "No. Generated schemas are read-only references. To run a job, create a User Template, select the fields you need, map them to the registry, and publish a version."
1580
1857
  }
1581
1858
  ],
1582
1859
  mentions: ["generated schemas", "AI-generated", "versioning", "schema diff"]
@@ -1606,6 +1883,18 @@ var sections5 = [
1606
1883
  {
1607
1884
  type: "paragraph",
1608
1885
  text: "Most teams start by importing an existing spreadsheet or CSV as a template baseline, then refine field types and add extraction instructions. Once you publish a version, it becomes immutable and available for job execution \u2014 any further changes happen in a new **Workshop** draft, keeping your production schema stable while you iterate."
1886
+ },
1887
+ {
1888
+ type: "paragraph",
1889
+ text: "When adding fields, take advantage of the automatic registry matching system. Fields with names that match existing registry entries are linked instantly, inheriting the AI-synthesized extraction instruction. For fields that do not match, write a clear **manual instruction** describing exactly what the AI should extract from the document. Well-written instructions are the single biggest lever for extraction accuracy."
1890
+ },
1891
+ {
1892
+ type: "paragraph",
1893
+ text: "For best results, keep templates focused on a single document type or closely related group of types. A template with 10-20 well-defined fields will produce higher accuracy than one with 50+ fields spanning unrelated domains. If you need different field sets for different document types, create separate templates and run targeted jobs for each."
1894
+ },
1895
+ {
1896
+ type: "callout",
1897
+ text: "You can import templates from Excel, CSV, or JSON files using the **Import from file** option. Column headers become field names, and data types are inferred automatically. This is the fastest way to bootstrap a template from an existing spreadsheet."
1609
1898
  }
1610
1899
  ],
1611
1900
  related: [
@@ -1621,6 +1910,10 @@ var sections5 = [
1621
1910
  {
1622
1911
  question: "What is the difference between generated schemas and user templates?",
1623
1912
  answer: "Generated schemas are AI-created per document type with all Tier 1/2 fields. User templates are custom-defined output structures where you choose exactly which fields to include and how to map them."
1913
+ },
1914
+ {
1915
+ question: "Can I update a published template?",
1916
+ answer: "Published versions are immutable. To make changes, open the Workshop draft, edit your fields, and publish a new version. The previous version remains available in Version History for reference and diffing."
1624
1917
  }
1625
1918
  ],
1626
1919
  mentions: ["user templates", "schema creation", "field mapping", "reference tables", "publish"]
@@ -1686,6 +1979,14 @@ var sections5 = [
1686
1979
  type: "paragraph",
1687
1980
  text: "When configuring a field, start with the basics \u2014 name, type, and registry mapping \u2014 then layer on advanced features as needed. For example, add a **format constraint** to enforce a date pattern, attach a **reference table** for code lookups, or define **capture submoves** to control the exact extraction sequence. Features compose independently, so you can mix and match without conflicts."
1688
1981
  },
1982
+ {
1983
+ type: "paragraph",
1984
+ text: "The **modifier pipeline** runs in a fixed order during Phase 4 of the extraction pipeline: format transforms first (converting dates or numbers to your target format), then alias mapping (replacing values using a lookup), and finally max_length truncation. Constraint evaluation happens after all modifiers have been applied, so constraints validate the final transformed value, not the raw extraction."
1985
+ },
1986
+ {
1987
+ type: "paragraph",
1988
+ text: 'For best results, use **manual instructions** sparingly and only for fields that the registry cannot match. A well-written instruction should describe the field in plain language, specify where in the document to look, and note any formatting expectations. Avoid vague instructions like "extract the value" \u2014 instead, write something like "Extract the net payment amount from the invoice summary section, excluding VAT."'
1989
+ },
1689
1990
  {
1690
1991
  type: "callout",
1691
1992
  text: "For the complete JSON Schema specification with all features, see the [Full Schema Reference](/docs/platform/schema-features) in the Platform Guide."
@@ -1704,6 +2005,10 @@ var sections5 = [
1704
2005
  {
1705
2006
  question: "Can I override AI extraction instructions with my own?",
1706
2007
  answer: "Yes. Use the Manual instruction feature on a schema field. User-written instructions override the AI-synthesized master instruction from the field registry."
2008
+ },
2009
+ {
2010
+ question: "In what order are modifiers applied to extracted values?",
2011
+ answer: "Modifiers run in a fixed order: format (date/number conversion) first, then alias (value mapping), then max_length (truncation). Constraints are evaluated after all modifiers complete."
1707
2012
  }
1708
2013
  ],
1709
2014
  mentions: [
@@ -1752,6 +2057,22 @@ var sections5 = [
1752
2057
  {
1753
2058
  type: "paragraph",
1754
2059
  text: "When you add a field to a template, the system automatically attempts to match it against the **Field Registry**. Exact name matches are applied instantly, while semantic and composite matches appear as suggestions for your confirmation. If no match is found, the field is marked **Unmapped** and you should provide a manual extraction instruction so the AI knows how to extract that value from your documents."
2060
+ },
2061
+ {
2062
+ type: "paragraph",
2063
+ text: "The matching engine uses a three-band resolution process under the hood. First, it checks for an exact name match against canonical registry field names and their synonyms. If no exact match is found, it computes embedding similarity between your field name and every registry field, surfacing semantic matches above a 0.5 confidence threshold. Matches above 0.8 are auto-accepted; those between 0.5 and 0.8 require your confirmation."
2064
+ },
2065
+ {
2066
+ type: "paragraph",
2067
+ text: "Matched fields inherit the registry's AI-synthesized **master instruction**, which tells the extraction pipeline exactly how to locate and extract that value from documents. This is why matching matters \u2014 a well-matched field leverages all the intelligence the system has built up from processing your document corpus. Unmapped fields rely solely on your manual instruction, so they may need a few correction cycles before reaching the same accuracy."
2068
+ },
2069
+ {
2070
+ type: "paragraph",
2071
+ text: "You can trigger a **Rematch** on all fields at any time from the template editor. This is useful after the registry has grown \u2014 fields that were previously unmapped may now find matches as new extractions contribute to the registry. For best results, use descriptive field names that reflect the actual data (e.g., `contract_start_date` rather than `field_1`)."
2072
+ },
2073
+ {
2074
+ type: "callout",
2075
+ text: "Field matching is read-only against the registry \u2014 it never creates new registry entries. If no match exists, the field stays unmapped until you provide a manual instruction or new documents introduce the field into the registry."
1755
2076
  }
1756
2077
  ],
1757
2078
  related: [
@@ -1767,6 +2088,10 @@ var sections5 = [
1767
2088
  {
1768
2089
  question: "What happens when a field is unmapped?",
1769
2090
  answer: "Unmapped fields have no registry match. They require manual extraction instructions to guide the AI on how to extract the value from documents."
2091
+ },
2092
+ {
2093
+ question: "Can I re-run field matching after adding more documents?",
2094
+ answer: "Yes. Use the Rematch button in the template editor to re-run matching against the current registry. Fields that were previously unmapped may find new matches as your registry grows."
1770
2095
  }
1771
2096
  ],
1772
2097
  mentions: ["field matching", "exact match", "semantic match", "composite", "unmapped"]
@@ -1807,6 +2132,14 @@ var sections5 = [
1807
2132
  type: "paragraph",
1808
2133
  text: "To set up a reference table, upload a CSV or manually enter key-value pairs where the **key** is the code you want in your output and the **value** is the human-readable label found in documents. During extraction, the system tries each tier in order \u2014 most values resolve instantly at Tier 1, so keeping your labels clean and consistent dramatically improves both speed and accuracy."
1809
2134
  },
2135
+ {
2136
+ type: "paragraph",
2137
+ text: "Reference tables are used in two pipeline stages. In **Phase 1**, the lookup cascade runs as part of the resolve step, mapping extracted labels to codes without any AI calls (Tier 1 and Tier 2). In **Phase 3**, the cascade runs again on values produced by Phase 2's AI extraction, normalizing free-text AI output to your canonical codes. This two-pass approach ensures maximum code coverage across the entire pipeline."
2138
+ },
2139
+ {
2140
+ type: "paragraph",
2141
+ text: 'For best results, include common variations and abbreviations as separate value entries all pointing to the same key. For example, if your code is `US`, add values for "United States", "USA", "U.S.A.", and "United States of America". The more variations you cover, the more values resolve at Tier 1 (highest confidence) without falling through to fuzzy or AI matching.'
2142
+ },
1810
2143
  {
1811
2144
  type: "callout",
1812
2145
  text: "Reference table quality directly determines lookup accuracy. A properly loaded table produces 90-100% accurate results within a single run."
@@ -1825,6 +2158,10 @@ var sections5 = [
1825
2158
  {
1826
2159
  question: "How accurate are reference table lookups?",
1827
2160
  answer: "A properly loaded reference table produces 90-100% accurate results within a single run. The cascade provides confidence scores: 0.95 for exact normalization, ~0.70 for fuzzy, and 0.50 for AI fallback."
2161
+ },
2162
+ {
2163
+ question: "How should I format my reference table CSV?",
2164
+ answer: "Use two columns: the first column is the key (output code) and the second is the value (human-readable label). Include common variations and abbreviations as separate rows pointing to the same key for maximum Tier 1 hit rate."
1828
2165
  }
1829
2166
  ],
1830
2167
  mentions: [
@@ -1849,6 +2186,18 @@ var sections5 = [
1849
2186
  {
1850
2187
  type: "paragraph",
1851
2188
  text: "Start by editing fields in the **Workshop** draft, then use **Test Extraction** to compare draft results against the live version before publishing. The **Version History** timeline lets you review diff summaries between any two versions, making it easy to trace when a field was added, renamed, or removed and understand the impact on downstream jobs."
2189
+ },
2190
+ {
2191
+ type: "paragraph",
2192
+ text: "The versioning system is append-only \u2014 every time you publish a draft, it creates a new immutable version and the previous version is preserved in the timeline. This means you can always go back and review the exact schema that was used for any historical job. The diff view highlights added fields, removed fields, type changes, and updated instructions, giving you a clear picture of how your schema evolved."
2193
+ },
2194
+ {
2195
+ type: "paragraph",
2196
+ text: "Use the workshop system to iterate safely on your schema without disrupting production jobs. A common workflow is to add a new field in the Workshop, run a **Test Extraction** on a few documents to verify it produces correct values, then publish when satisfied. If a downstream integration depends on a specific field, the breaking change detection will warn you before you accidentally remove or rename it."
2197
+ },
2198
+ {
2199
+ type: "callout",
2200
+ text: "Breaking changes include field removals and type changes. The system surfaces these warnings at publish time so you can assess the impact on active delivery bindings and downstream systems before committing."
1852
2201
  }
1853
2202
  ],
1854
2203
  related: [
@@ -1864,6 +2213,10 @@ var sections5 = [
1864
2213
  {
1865
2214
  question: "What are breaking changes in a schema?",
1866
2215
  answer: "Breaking changes include field removals and type changes. The system detects and warns about these when promoting a draft to live, helping you avoid unintended downstream impacts."
2216
+ },
2217
+ {
2218
+ question: "Can I revert to a previous schema version?",
2219
+ answer: "Version history is append-only, so you cannot revert directly. However, you can review any previous version in the timeline, compare it with the current live version using the diff view, and manually re-add fields or settings that were changed."
1867
2220
  }
1868
2221
  ],
1869
2222
  mentions: ["versioning", "drafts", "workshop", "live version", "breaking changes"]
@@ -1882,6 +2235,18 @@ var sections5 = [
1882
2235
  {
1883
2236
  type: "paragraph",
1884
2237
  text: "After running a test, you will see a comparison grid highlighting cells that changed between the draft and live versions. Focus on fields you modified \u2014 new fields, updated instructions, or changed reference tables \u2014 to verify they produce the expected values. This workflow catches regressions before they reach production, so you can iterate on your schema with confidence."
2238
+ },
2239
+ {
2240
+ type: "paragraph",
2241
+ text: "Test extractions run through the same 4-phase pipeline as production jobs, so the results you see are identical to what a full job would produce. The test uses a simplified single-call extraction mode under the hood, which is faster but still applies all schema features including reference table lookups, format constraints, and modifiers. This gives you a reliable preview without the cost of a full pipeline run."
2242
+ },
2243
+ {
2244
+ type: "paragraph",
2245
+ text: 'For best results, select 3-5 representative documents that cover the variety in your corpus \u2014 include at least one "clean" document and one with unusual formatting or missing fields. This gives you confidence that your schema handles both typical and edge-case documents correctly. Run the test after every significant change to a field instruction, reference table, or format constraint.'
2246
+ },
2247
+ {
2248
+ type: "callout",
2249
+ text: "Test extractions do not affect your live data or consume production job credits differently. They are designed for rapid iteration \u2014 run as many tests as you need before publishing."
1885
2250
  }
1886
2251
  ],
1887
2252
  related: [
@@ -1897,6 +2262,10 @@ var sections5 = [
1897
2262
  {
1898
2263
  question: "Do I need to publish a draft before testing it?",
1899
2264
  answer: "No. Test extraction runs against the unpublished draft, comparing its output to the current live version so you can verify changes before publishing."
2265
+ },
2266
+ {
2267
+ question: "How many documents should I use for a test extraction?",
2268
+ answer: "Select 3-5 representative documents that cover the variety in your corpus. Include documents with different layouts, data completeness levels, and edge cases to get a reliable preview of how your schema changes perform."
1900
2269
  }
1901
2270
  ],
1902
2271
  mentions: ["test extraction", "draft comparison", "side-by-side", "preview"]
@@ -1951,6 +2320,18 @@ var sections5 = [
1951
2320
  {
1952
2321
  type: "paragraph",
1953
2322
  text: "When working with international data, configure the dialect to match your downstream system requirements. For example, set **number_locale** to `fr-FR` for European comma-decimal formatting, switch the **delimiter** to semicolon for CSV compatibility, and choose **UTF-8-BOM** encoding if your data will be opened in Excel. Creating a shared dialect and reusing it across schemas ensures consistent formatting across all your exports."
2323
+ },
2324
+ {
2325
+ type: "paragraph",
2326
+ text: "Dialect settings are applied during Phase 4 of the extraction pipeline and during CSV/XLSX export. The dialect does not affect how values are stored internally \u2014 it only controls the serialization format when data leaves the platform. This means you can change a dialect at any time without re-running extractions; the new format applies to all future exports and deliveries."
2327
+ },
2328
+ {
2329
+ type: "paragraph",
2330
+ text: 'For best results, create a shared dialect for each downstream system or regional office you deliver to, and name it descriptively (e.g., "SAP Europe" or "US Accounting"). Avoid defining dialects inline on individual schemas unless you have a one-off formatting requirement. Shared dialects reduce maintenance burden and ensure consistency when you add new schemas later.'
2331
+ },
2332
+ {
2333
+ type: "callout",
2334
+ text: "If your CSV files show garbled special characters (accents, umlauts, CJK text), switch the encoding to **UTF-8-BOM**. The BOM (byte order mark) tells Excel to interpret the file as UTF-8 instead of the system default encoding."
1954
2335
  }
1955
2336
  ],
1956
2337
  related: [
@@ -1966,6 +2347,10 @@ var sections5 = [
1966
2347
  {
1967
2348
  question: "Can I share a dialect across multiple schemas?",
1968
2349
  answer: "Yes. A dialect can be shared across schemas or defined inline for a specific schema. Configure them in the Schema > Delivery tab."
2350
+ },
2351
+ {
2352
+ question: "Do I need to re-run extractions when I change a dialect?",
2353
+ answer: "No. Dialects only affect output serialization (exports and deliveries), not how values are stored internally. Changing a dialect takes effect immediately on future exports without re-processing."
1969
2354
  }
1970
2355
  ],
1971
2356
  mentions: [
@@ -2018,6 +2403,14 @@ var sections5 = [
2018
2403
  type: "paragraph",
2019
2404
  text: 'Use bypass strategies for fields whose values are known ahead of time or can be derived without reading the document. For example, set a **constant** of `"USD"` for a currency field that is always the same, or use a **generator** to produce a deterministic ID for each row. Fields with bypass strategies skip the AI extraction phase entirely, reducing processing time and credit usage.'
2020
2405
  },
2406
+ {
2407
+ type: "paragraph",
2408
+ text: "The **reference** bypass strategy is particularly powerful for enrichment fields. Define a `key_expression` that references another field in the schema (e.g., the supplier name), and the system will automatically look up the corresponding code from your reference table without any AI involvement. This is ideal for mapping extracted entity names to internal system identifiers, ERP codes, or classification labels."
2409
+ },
2410
+ {
2411
+ type: "paragraph",
2412
+ text: "For best results, audit your schema for fields that never vary across documents \u2014 these are prime candidates for the **constant** strategy. Fields like currency, data source, or processing batch can be set once and never require AI extraction. This reduces per-document processing cost and improves job completion time, especially on large runs with hundreds of documents."
2413
+ },
2021
2414
  {
2022
2415
  type: "callout",
2023
2416
  text: "When a `generator` strategy fails to produce a value, the field falls through to LLM extraction as a safety net. Strategy values are normalized via generator mappings in Phase 4 of the pipeline."
@@ -2036,6 +2429,10 @@ var sections5 = [
2036
2429
  {
2037
2430
  question: "What happens when a generator bypass fails?",
2038
2431
  answer: "When a generator strategy fails to produce a value, the field falls through to LLM extraction as a safety net, ensuring the cell is still filled."
2432
+ },
2433
+ {
2434
+ question: "Do bypass strategies reduce extraction costs?",
2435
+ answer: "Yes. Fields with bypass strategies skip the AI extraction phase entirely, which reduces both processing time and credit usage. Use constant or reference strategies for fields that do not require document reading."
2039
2436
  }
2040
2437
  ],
2041
2438
  mentions: [
@@ -2081,6 +2478,18 @@ var sections5 = [
2081
2478
  {
2082
2479
  type: "paragraph",
2083
2480
  text: "Define format constraints in the schema field editor. The pattern uses standard regex syntax. The editor provides a live test input so you can verify the pattern before saving."
2481
+ },
2482
+ {
2483
+ type: "paragraph",
2484
+ text: "Format constraints are especially useful for fields with strict formatting requirements in downstream systems. For example, a purchase order number that must follow the pattern `PO-\\d{6}` or a date that must match `\\d{4}-\\d{2}-\\d{2}`. By catching format violations at extraction time, you avoid importing malformed data into your ERP, accounting, or analytics systems."
2485
+ },
2486
+ {
2487
+ type: "paragraph",
2488
+ text: 'Choose the mismatch behavior based on your data quality requirements. Use **empty** (the default) when you prefer no data over bad data \u2014 the downstream system will see a blank cell. Use **flag** when you want to review mismatches manually before deciding \u2014 flagged cells appear with an amber dot in the results grid. Use **constant** when your downstream system needs a specific sentinel value like `"N/A"` or `"INVALID"` to trigger its own error handling.'
2489
+ },
2490
+ {
2491
+ type: "callout",
2492
+ text: "The regex evaluator includes ReDoS protection: nested quantifiers are rejected and input is capped at 1,000 characters. Use the `(?i)` inline flag for case-insensitive matching."
2084
2493
  }
2085
2494
  ],
2086
2495
  related: [
@@ -2096,6 +2505,10 @@ var sections5 = [
2096
2505
  {
2097
2506
  question: "Are original values preserved when format constraints clear a cell?",
2098
2507
  answer: "Yes. Original values are always preserved for audit in the original_extractions table, regardless of the mismatch behavior applied."
2508
+ },
2509
+ {
2510
+ question: "Can I use case-insensitive regex patterns?",
2511
+ answer: "Yes. Use the (?i) inline flag at the start of your pattern for case-insensitive matching. The evaluator supports standard JavaScript regex syntax with inline flags."
2099
2512
  }
2100
2513
  ],
2101
2514
  mentions: [
@@ -2124,6 +2537,18 @@ var sections6 = [
2124
2537
  {
2125
2538
  type: "paragraph",
2126
2539
  text: "Navigate to **Structuring → Runs → New**. Select your template and documents, then click Start. Results appear progressively as each phase completes."
2540
+ },
2541
+ {
2542
+ type: "paragraph",
2543
+ text: "When you start a job, the platform runs a pre-flight check to ensure all selected documents have completed their field resolution step. If any document was uploaded recently and has not yet been resolved against the Field Registry, the system automatically resolves it before entering Phase 1. This lazy resolution gate prevents silent data loss where registry-based lookups would return empty results for unresolved documents."
2544
+ },
2545
+ {
2546
+ type: "paragraph",
2547
+ text: "For best results, select documents of the same type or closely related types for a single job. The schema you choose should match the document content \u2014 using an invoice schema on contract documents will produce poor results. Start with a small batch of 5-10 documents to validate your schema, review the output, apply corrections, and then scale up to larger runs once you are confident in the extraction quality."
2548
+ },
2549
+ {
2550
+ type: "callout",
2551
+ text: "Results appear progressively as each pipeline phase completes. You do not need to wait for the entire job to finish \u2014 you can begin reviewing Phase 1 results while Phase 2 is still running."
2127
2552
  }
2128
2553
  ],
2129
2554
  related: [
@@ -2139,6 +2564,10 @@ var sections6 = [
2139
2564
  {
2140
2565
  question: "What does an extraction job produce?",
2141
2566
  answer: "A job produces a structured grid where rows represent documents and columns represent schema fields. Each cell contains an extracted value with confidence and provenance metadata."
2567
+ },
2568
+ {
2569
+ question: "How many documents can I include in a single job?",
2570
+ answer: "Phase 2 supports up to 2,000 documents per job, and Phase 4 supports up to 1,000. For best results, start with smaller batches to validate your schema before scaling up."
2142
2571
  }
2143
2572
  ],
2144
2573
  mentions: ["extraction job", "structured grid", "progressive results", "template selection"]
@@ -2158,11 +2587,23 @@ var sections6 = [
2158
2587
  type: "paragraph",
2159
2588
  text: "Each phase builds on the previous one, progressively filling the output grid. **Phase 1** resolves ~30% of cells instantly using graph matches and lookups. **Phase 2** deploys an AI agent to fill remaining gaps. **Phase 3** runs cross-field validation checks, and **Phase 4** performs targeted re-reads for empty or low-confidence cells. You can monitor fill rate in real time as each phase completes."
2160
2589
  },
2590
+ {
2591
+ type: "paragraph",
2592
+ text: "The pipeline is designed around a key principle: use the cheapest, fastest method first and escalate to AI only when necessary. Phase 1 fills cells using deterministic lookups at zero AI cost. Phase 2 uses AI only for cells that Phase 1 could not resolve. Phase 3 re-runs lookups on Phase 2 output to normalize AI-generated values to canonical codes. Phase 4 performs targeted re-reads with full grid context for the remaining gaps. This cascading approach minimizes both cost and latency."
2593
+ },
2594
+ {
2595
+ type: "paragraph",
2596
+ text: "The grid is flushed to the database after each phase, enabling progressive rendering in the UI. You can watch cells fill in real time and begin reviewing results before the job finishes. The phase timeline on the job detail page shows which phase is currently active, how long each phase took, and the cumulative fill rate at each stage."
2597
+ },
2161
2598
  {
2162
2599
  type: "ui-excerpt",
2163
2600
  id: "job-detail-phase-timeline",
2164
2601
  title: "Job Detail \u2014 Phase Timeline",
2165
2602
  caption: "The phase timeline shows progress through the pipeline. Each dot represents a stage, highlighted when active."
2603
+ },
2604
+ {
2605
+ type: "callout",
2606
+ text: "Phase order is fixed: Phase 1 → 2 → 3 → 4. Phases are never skipped or reordered. This guarantees that high-confidence deterministic values from Phase 1 are always protected by the confidence gate before AI extraction runs."
2166
2607
  }
2167
2608
  ],
2168
2609
  related: [
@@ -2178,6 +2619,10 @@ var sections6 = [
2178
2619
  {
2179
2620
  question: "Can I see results before all phases complete?",
2180
2621
  answer: "Yes. Results are visible as each phase completes. The fill rate increases progressively through the pipeline."
2622
+ },
2623
+ {
2624
+ question: "Why does the pipeline use multiple phases instead of a single AI call?",
2625
+ answer: "The cascading design minimizes cost and latency. Phase 1 fills cells with deterministic lookups at zero AI cost. Only remaining gaps go to the AI agent in Phase 2, and Phase 4 targets specific empty cells with full context. This is significantly cheaper and faster than sending everything to AI."
2181
2626
  }
2182
2627
  ],
2183
2628
  mentions: ["4-phase pipeline", "fill rate", "progressive rendering", "phase timeline"]
@@ -2227,6 +2672,18 @@ var sections6 = [
2227
2672
  {
2228
2673
  type: "paragraph",
2229
2674
  text: "Values are normalized during transfer: dates → `YYYY/MM/DD`, numbers → 2 decimal places, strings → trim + collapse spaces."
2675
+ },
2676
+ {
2677
+ type: "paragraph",
2678
+ text: "Phase 1 is the workhorse of cost efficiency. Because it relies entirely on pre-computed graph matches and deterministic lookups, it fills a large portion of the grid at near-zero cost. The confidence scores assigned during this phase are typically high (0.7-0.95) because they are derived from verified registry matches rather than AI inference. These high-confidence cells are then protected by the confidence gate, meaning later phases cannot overwrite them."
2679
+ },
2680
+ {
2681
+ type: "paragraph",
2682
+ text: "The resolution strategies execute in a fixed order: registry transfer first, then raw extraction mapping, then the 3-tier lookup cascade, and finally deterministic compute (formulas like `Total = Unit Price x Quantity`). Each strategy only attempts to fill cells that are still empty after the previous strategy ran. This ordering ensures that the highest-confidence method always gets priority."
2683
+ },
2684
+ {
2685
+ type: "callout",
2686
+ text: "Phase 1 fill rates improve over time as your Field Registry grows. The more documents you process, the richer the registry becomes, and the more cells Phase 1 can resolve without AI \u2014 reducing both cost and latency for every subsequent job."
2230
2687
  }
2231
2688
  ],
2232
2689
  related: [
@@ -2242,6 +2699,10 @@ var sections6 = [
2242
2699
  {
2243
2700
  question: "What percentage of cells does Phase 1 fill?",
2244
2701
  answer: "Phase 1 typically fills approximately 30% of cells in seconds, using graph matches and lookups without any AI calls."
2702
+ },
2703
+ {
2704
+ question: "Does Phase 1 performance improve over time?",
2705
+ answer: "Yes. As your Field Registry grows from processing more documents, Phase 1 can resolve a higher percentage of cells through graph matches. Mature registries often see Phase 1 fill rates of 60-80%."
2245
2706
  }
2246
2707
  ],
2247
2708
  mentions: [
@@ -2300,6 +2761,14 @@ var sections6 = [
2300
2761
  }
2301
2762
  ]
2302
2763
  },
2764
+ {
2765
+ type: "paragraph",
2766
+ text: "Phase 2 processes documents with grouped extraction calls \u2014 schema fields are divided into batches of up to 10 fields per call to balance extraction quality with throughput. For each document, the agent sends the document text along with the schema field definitions and any already-resolved values from Phase 1 as context. This context-aware approach means the AI can use related values (like a contract start date) to more accurately extract dependent values (like the end date)."
2767
+ },
2768
+ {
2769
+ type: "paragraph",
2770
+ text: "For fields backed by a **reference table**, Phase 2 includes the table's codes and labels directly in the extraction prompt so the AI picks canonical codes rather than free-text labels. This tight integration between reference tables and AI extraction produces cleaner output that requires fewer corrections. Fields with fewer than 50 reference entries get the full table in the prompt; larger tables are handled by the Phase 3 lookup cascade instead."
2771
+ },
2303
2772
  {
2304
2773
  type: "callout",
2305
2774
  variant: "warning",
@@ -2319,6 +2788,10 @@ var sections6 = [
2319
2788
  {
2320
2789
  question: "Can the agent skip a field with manual instructions?",
2321
2790
  answer: "No. Fields with manual instructions always use the extract strategy. Human-written instructions are treated as authoritative and never skipped."
2791
+ },
2792
+ {
2793
+ question: "How many fields does the agent process per AI call?",
2794
+ answer: "Schema fields are grouped into batches of up to 10 fields per extraction call. This balances extraction quality with throughput \u2014 smaller groups help the AI focus on each field without losing recall."
2322
2795
  }
2323
2796
  ],
2324
2797
  mentions: [
@@ -2375,6 +2848,18 @@ var sections6 = [
2375
2848
  description: "Field with >80% registry occurrence rate is empty in this document."
2376
2849
  }
2377
2850
  ]
2851
+ },
2852
+ {
2853
+ type: "paragraph",
2854
+ text: 'Phase 3 also re-runs the lookup cascade (reference table resolution) on values that Phase 2 produced. This is important because AI-extracted values often use natural language labels (e.g., "Frame Agreement") rather than the canonical codes your reference table expects (e.g., `std_master`). The Phase 3 lookup normalizes these labels to codes, improving consistency across your output without requiring manual corrections.'
2855
+ },
2856
+ {
2857
+ type: "paragraph",
2858
+ text: "Validation flags are designed to surface the most impactful issues first. The **low_confidence_outlier** flag is particularly useful \u2014 it highlights cells where the system is uncertain in an otherwise high-confidence row, pointing you to the exact cells most likely to contain errors. For large runs with hundreds of documents, filtering by flags and reviewing those cells first can reduce your review time by 80% or more."
2859
+ },
2860
+ {
2861
+ type: "callout",
2862
+ text: "Validation flags never modify cell values. They are purely informational annotations that help you prioritize review. The actual cell value and confidence score remain unchanged by Phase 3 flagging."
2378
2863
  }
2379
2864
  ],
2380
2865
  related: [
@@ -2390,6 +2875,10 @@ var sections6 = [
2390
2875
  {
2391
2876
  question: "What types of validation flags exist?",
2392
2877
  answer: "Five types: date_sanity (date inconsistencies), amount_mismatch (total discrepancies), lookup_failed (no reference match), low_confidence_outlier (low confidence cells), and unexpected_empty (missing high-frequency fields)."
2878
+ },
2879
+ {
2880
+ question: "Does Phase 3 modify any cell values?",
2881
+ answer: "Phase 3 re-runs the reference table lookup cascade to normalize AI-extracted labels to canonical codes. The validation flags themselves are purely informational and do not modify values."
2393
2882
  }
2394
2883
  ],
2395
2884
  mentions: [
@@ -2415,6 +2904,14 @@ var sections6 = [
2415
2904
  type: "paragraph",
2416
2905
  text: "Because Phase 4 has access to the full grid context \u2014 all values already resolved in earlier phases \u2014 it can use surrounding data as clues. For example, if a contract start date was resolved in Phase 1 but the end date is still empty, Phase 4 re-reads the document knowing the start date, which helps the AI locate the corresponding end date more accurately."
2417
2906
  },
2907
+ {
2908
+ type: "paragraph",
2909
+ text: "Phase 4 also applies deterministic transforms to all cell values: ISO code normalization, date format standardization, and unit conversion. Format constraints (regex patterns defined on schema fields) are evaluated at this stage. If a value fails its format constraint, the configured mismatch behavior kicks in \u2014 the cell is either cleared, flagged with an amber dot, or replaced with a constant. Original values are always preserved in the `original_extractions` table for audit purposes."
2910
+ },
2911
+ {
2912
+ type: "paragraph",
2913
+ text: "Expect Phase 4 to fill 5-15% of remaining empty cells, depending on document complexity and schema coverage. The phase is most effective for fields that require cross-referencing multiple sections of a document or interpreting values in the context of other extracted data. It is less effective for fields that are genuinely absent from the source document \u2014 those will remain empty with an `unresolved` provenance type."
2914
+ },
2418
2915
  {
2419
2916
  type: "callout",
2420
2917
  text: "Phase 4 respects the **confidence gate**: it can only fill empty cells or upgrade cells below the confidence threshold. High-confidence values from Phase 1 are permanently protected."
@@ -2433,6 +2930,10 @@ var sections6 = [
2433
2930
  {
2434
2931
  question: "Can Phase 4 overwrite high-confidence values?",
2435
2932
  answer: "No. Phase 4 respects the confidence gate \u2014 it can only fill empty cells or upgrade cells below the confidence threshold. High-confidence values from earlier phases are permanently protected."
2933
+ },
2934
+ {
2935
+ question: "What else happens in Phase 4 besides gap filling?",
2936
+ answer: "Phase 4 also applies deterministic transforms (ISO codes, dates, units), evaluates format constraints (regex validation), and runs the modifier pipeline (format, alias, max_length). Original values are preserved for audit."
2436
2937
  }
2437
2938
  ],
2438
2939
  mentions: ["Phase 4", "re-read", "gap filling", "confidence gate", "targeted extraction"]
@@ -2457,6 +2958,18 @@ var sections6 = [
2457
2958
  {
2458
2959
  type: "paragraph",
2459
2960
  text: "Start your review by switching to the **Flagged** filter to focus on cells that need attention \u2014 these are values with validation warnings, low confidence, or format mismatches. Click any cell to see its full provenance, including which phase produced it and the reasoning trace. Once you are satisfied, export via **CSV** \u2014 choose the clean export for downstream systems or the full export with metadata for auditing."
2961
+ },
2962
+ {
2963
+ type: "paragraph",
2964
+ text: "The colored dots on each cell are your quickest visual indicator of data quality. Blue dots indicate graph matches from Phase 1 (highest reliability), purple dots indicate computed values, teal dots indicate agent transfers, indigo dots indicate AI extractions, and amber dots indicate lookup results or format flags. A grid dominated by blue and purple dots typically requires minimal review, while one with many indigo and amber dots may need more attention."
2965
+ },
2966
+ {
2967
+ type: "paragraph",
2968
+ text: "For large jobs with hundreds of documents, use a systematic review workflow: first address all **Flagged** rows, then spot-check a random sample of **Clean** rows to build confidence in the overall quality. If you find recurring errors in a specific field, consider updating the schema field's instruction or reference table, then run a new job \u2014 corrections you apply also feed back as training signals for future runs."
2969
+ },
2970
+ {
2971
+ type: "callout",
2972
+ text: "The full CSV export includes metadata columns for each field: confidence score, resolution type, phase number, and reasoning trace. Use this export for audit trails or to analyze extraction performance across your document corpus."
2460
2973
  }
2461
2974
  ],
2462
2975
  related: [
@@ -2472,6 +2985,10 @@ var sections6 = [
2472
2985
  {
2473
2986
  question: "Can I export extraction results?",
2474
2987
  answer: "Yes. Use CSV export from the job detail page. You can export clean data only or full data with metadata including confidence scores and resolution types."
2988
+ },
2989
+ {
2990
+ question: "What is the most efficient way to review a large extraction run?",
2991
+ answer: "Start with the Flagged filter to address cells with validation warnings, low confidence, or format mismatches. Then spot-check a random sample of Clean rows. Focus corrections on recurring field-level patterns rather than individual cells."
2475
2992
  }
2476
2993
  ],
2477
2994
  mentions: [
@@ -2528,6 +3045,14 @@ var sections6 = [
2528
3045
  }
2529
3046
  ]
2530
3047
  },
3048
+ {
3049
+ type: "paragraph",
3050
+ text: "Confidence scores follow predictable patterns by resolution type. Graph matches from Phase 1 typically score 0.7-0.95 because they are derived from verified registry data. Reference table lookups score 0.95 for exact normalization matches, ~0.70 for fuzzy matches, and 0.50 for AI fallback. Agent-derived values from Phase 2 generally score 0.5-0.9 depending on the clarity of the source document and the specificity of the extraction instruction."
3051
+ },
3052
+ {
3053
+ type: "paragraph",
3054
+ text: "Use confidence scores to set your review threshold. Cells above 0.8 are generally reliable and can be trusted without manual verification for most use cases. Cells between 0.5 and 0.8 warrant a quick check. Cells below 0.5 should always be reviewed manually. You can use the full CSV export to filter and sort by confidence, making it easy to batch-review low-confidence cells efficiently."
3055
+ },
2531
3056
  {
2532
3057
  type: "callout",
2533
3058
  variant: "warning",
@@ -2547,6 +3072,10 @@ var sections6 = [
2547
3072
  {
2548
3073
  question: "What is the confidence gate?",
2549
3074
  answer: "The confidence gate prevents any later pipeline phase from overwriting a cell that was filled with confidence >= 0.7. This protects high-quality lookup results from lower-confidence agent extractions."
3075
+ },
3076
+ {
3077
+ question: "What confidence threshold should I use for manual review?",
3078
+ answer: "Cells above 0.8 are generally reliable. Cells between 0.5 and 0.8 warrant a quick check. Cells below 0.5 should always be reviewed manually. Use the CSV export to filter by confidence for efficient batch review."
2550
3079
  }
2551
3080
  ],
2552
3081
  mentions: [
@@ -2571,6 +3100,18 @@ var sections6 = [
2571
3100
  {
2572
3101
  type: "paragraph",
2573
3102
  text: "When correcting a value, consider using **all_similar** propagation if the same mistake appears across multiple documents \u2014 for example, a reference table code that was consistently mapped to the wrong label. This applies your fix to every document in the run that matched the same way, saving you from correcting each cell individually. The system learns from these corrections, so the same error is less likely to recur in future jobs."
3103
+ },
3104
+ {
3105
+ type: "paragraph",
3106
+ text: "Corrections create a full audit trail: the original extracted value, the corrected value, who made the change, and when. This audit log is preserved even after subsequent jobs are run, giving you a complete history of manual interventions. When you export results with the full metadata option, correction history is included so downstream systems can distinguish between AI-extracted and human-corrected values."
3107
+ },
3108
+ {
3109
+ type: "paragraph",
3110
+ text: "For best results, correct the root cause rather than individual symptoms. If a field consistently produces wrong values, update the schema field's **manual instruction** or **reference table** rather than correcting cells one by one. If a reference table code is missing, add it to the table \u2014 future runs will pick it up automatically at Tier 1 confidence (0.95). Corrections are most valuable as a feedback mechanism when they inform schema improvements."
3111
+ },
3112
+ {
3113
+ type: "callout",
3114
+ text: "Corrections with **all_similar** propagation apply instantly across all documents in the run. Use this for systematic errors like wrong reference table mappings, but verify the preview count before confirming \u2014 the system shows how many cells will be affected."
2574
3115
  }
2575
3116
  ],
2576
3117
  related: [
@@ -2586,6 +3127,10 @@ var sections6 = [
2586
3127
  {
2587
3128
  question: "Do corrections improve future extractions?",
2588
3129
  answer: "Yes. Corrections feed back as training signals for future runs, helping the system learn from your corrections and improve accuracy over time."
3130
+ },
3131
+ {
3132
+ question: "Is there an audit trail for corrections?",
3133
+ answer: "Yes. Every correction logs the original value, the corrected value, the user who made the change, and the timestamp. This audit history is preserved and included in full metadata CSV exports."
2589
3134
  }
2590
3135
  ],
2591
3136
  mentions: [
@@ -2639,6 +3184,18 @@ var sections7 = [
2639
3184
  {
2640
3185
  type: "paragraph",
2641
3186
  text: "Most link keys are auto-classified by name patterns. Remaining ambiguous fields are classified by AI. High-frequency entities (>30% of documents) are automatically excluded from case formation."
3187
+ },
3188
+ {
3189
+ type: "paragraph",
3190
+ text: "Behind the scenes, the classification engine applies rule-based heuristics first \u2014 field names like `company_name` or `invoice_number` are recognized instantly. When heuristics are inconclusive, an AI classifier examines the field's extracted values and schema context to determine the correct category. This two-tier approach keeps classification fast for the common case while handling ambiguous fields gracefully."
3191
+ },
3192
+ {
3193
+ type: "paragraph",
3194
+ text: "Use link keys whenever your documents share identifying information that should connect them. For best results, ensure your field names follow clear naming conventions \u2014 this maximizes the hit rate of the automatic classifier and minimizes the need for manual overrides."
3195
+ },
3196
+ {
3197
+ type: "callout",
3198
+ text: "Link key classification runs automatically when new fields appear in the registry. You do not need to trigger it manually \u2014 just upload documents and the system handles the rest."
2642
3199
  }
2643
3200
  ],
2644
3201
  related: [
@@ -2654,6 +3211,10 @@ var sections7 = [
2654
3211
  {
2655
3212
  question: "Why are high-frequency entities excluded from case formation?",
2656
3213
  answer: "Entities appearing in more than 30% of documents are too common to be meaningful connections. They are automatically excluded to prevent overly large, uninformative cases."
3214
+ },
3215
+ {
3216
+ question: "Can I manually classify a field as a link key?",
3217
+ answer: "Yes. Navigate to the Field Registry and change any field's link key category. Manual classifications take precedence over automatic ones and persist across future jobs."
2657
3218
  }
2658
3219
  ],
2659
3220
  mentions: [
@@ -2674,6 +3235,22 @@ var sections7 = [
2674
3235
  {
2675
3236
  type: "paragraph",
2676
3237
  text: 'After extraction, the linking pipeline runs automatically: extracts link key values, normalizes them (lowercasing, stripping suffixes like "Ltd", "Inc"), and builds a bipartite graph of documents ↔ entities.'
3238
+ },
3239
+ {
3240
+ type: "paragraph",
3241
+ text: 'The normalization step is critical for accurate linking. Values like "ACME Corp.", "Acme Corporation", and "acme corp" are all reduced to the same canonical form so they resolve to a single entity node. This prevents duplicate entities from fragmenting your cases and ensures documents that reference the same real-world entity are correctly connected.'
3242
+ },
3243
+ {
3244
+ type: "paragraph",
3245
+ text: "The resulting bipartite graph has two node types: documents and entities. An edge connects a document to an entity whenever the document contains that entity's value in a link key field. Connected components in this graph become the foundation for case formation \u2014 documents that share entities end up in the same case."
3246
+ },
3247
+ {
3248
+ type: "paragraph",
3249
+ text: "For best results, ensure your source documents contain consistent identifiers. The pipeline handles minor variations automatically, but wildly inconsistent naming (e.g., abbreviations vs. full legal names) may require manual link key tuning in the Field Registry."
3250
+ },
3251
+ {
3252
+ type: "callout",
3253
+ text: "Entity linking is incremental \u2014 when new documents arrive, the pipeline extends the existing graph rather than rebuilding it from scratch. Existing cases grow as new connections are discovered."
2677
3254
  }
2678
3255
  ],
2679
3256
  related: [
@@ -2689,6 +3266,10 @@ var sections7 = [
2689
3266
  {
2690
3267
  question: "When does entity linking run?",
2691
3268
  answer: "Entity linking runs automatically after document extraction. It processes link key values and builds connections without manual intervention."
3269
+ },
3270
+ {
3271
+ question: "What normalization does entity linking apply?",
3272
+ answer: "Values are lowercased, common suffixes (Ltd, Inc, Corp, etc.) are stripped, and whitespace is normalized. This ensures minor naming variations resolve to the same entity."
2692
3273
  }
2693
3274
  ],
2694
3275
  mentions: [
@@ -2780,6 +3361,22 @@ var sections7 = [
2780
3361
  {
2781
3362
  type: "paragraph",
2782
3363
  text: "The Document Graph provides a visual D3-force layout of the bipartite graph. Toggle between graph and list views from the Cases page. Case templates are auto-discovered after 3+ cases form \u2014 they identify recurring document type patterns."
3364
+ },
3365
+ {
3366
+ type: "paragraph",
3367
+ text: "In the graph view, document nodes and entity nodes are rendered with distinct visual styles. Edges represent link key connections, and tightly connected clusters naturally pull together through force simulation. Hovering over a node highlights its connections, making it easy to trace how documents relate through shared entities."
3368
+ },
3369
+ {
3370
+ type: "paragraph",
3371
+ text: 'Case templates capture recurring patterns \u2014 for example, "Invoice + Purchase Order + Contract" might emerge as a common template after enough cases form. Templates include a **match threshold** that controls how closely a case must match the expected document type set. Use templates to monitor completeness: if a case is missing a document type that the template expects, an anomaly is raised.'
3372
+ },
3373
+ {
3374
+ type: "paragraph",
3375
+ text: "Most teams use the graph view during initial workspace setup to verify that linking is producing sensible clusters. Once you are confident in your link key configuration, the list view is more practical for day-to-day case review and triage."
3376
+ },
3377
+ {
3378
+ type: "callout",
3379
+ text: "Templates are auto-discovered \u2014 you do not need to define them manually. The system analyzes existing cases and proposes templates when it detects at least 3 cases sharing the same document type pattern."
2783
3380
  }
2784
3381
  ],
2785
3382
  related: [
@@ -2794,6 +3391,10 @@ var sections7 = [
2794
3391
  {
2795
3392
  question: "What are case templates?",
2796
3393
  answer: "Case templates are auto-discovered after 3 or more cases form. They identify recurring document type patterns, helping you understand common document relationships in your workspace."
3394
+ },
3395
+ {
3396
+ question: "Can I switch between graph and list views?",
3397
+ answer: "Yes. Toggle between the visual D3-force graph and a traditional list view from the Cases page. Both views show the same underlying data \u2014 choose whichever suits your workflow."
2797
3398
  }
2798
3399
  ],
2799
3400
  mentions: ["document graph", "D3-force layout", "bipartite graph", "case templates"]
@@ -2843,6 +3444,18 @@ var sections7 = [
2843
3444
  {
2844
3445
  type: "paragraph",
2845
3446
  text: "Anomalies appear in the **Anomalies** tab of the case detail page (Advanced mode). Each anomaly card shows severity, affected fields, and a dismiss button. Dismissed anomalies are hidden by default but visible via the **show dismissed** toggle."
3447
+ },
3448
+ {
3449
+ type: "paragraph",
3450
+ text: "The detection engine runs automatically after case formation and whenever case membership changes (documents added, removed, or cases merged). Each detector operates independently \u2014 a single case can trigger multiple anomaly types simultaneously. Anomaly counts are displayed as badges in the case header for quick triage."
3451
+ },
3452
+ {
3453
+ type: "paragraph",
3454
+ text: "Use anomaly detection to surface data quality issues that would otherwise require manual comparison across documents. For best results, configure case templates so the **Missing Document Type** detector (D4) can flag incomplete cases. Most teams find that D2 (Field Conflict) and D3 (Duplicate Key Divergence) catch the highest-value issues in procurement and financial workflows."
3455
+ },
3456
+ {
3457
+ type: "callout",
3458
+ text: "Anomaly detection requires **Advanced mode** to be enabled. In Simple mode, anomalies are still computed but not displayed in the case detail page."
2846
3459
  }
2847
3460
  ],
2848
3461
  related: [
@@ -2854,6 +3467,14 @@ var sections7 = [
2854
3467
  {
2855
3468
  question: "What anomalies does Talonic detect?",
2856
3469
  answer: "Five structural patterns: validation clusters, field conflicts, duplicate key divergence, missing document types, and value reuse. Each is surfaced as a dismissable card on the case detail page."
3470
+ },
3471
+ {
3472
+ question: "Do anomalies update automatically when cases change?",
3473
+ answer: "Yes. The detection engine re-runs whenever case membership changes \u2014 documents added or removed, cases merged or split. Anomaly badges in the case header update in real time."
3474
+ },
3475
+ {
3476
+ question: "Can I dismiss anomalies?",
3477
+ answer: "Yes. Each anomaly card includes a dismiss button. Dismissed anomalies are hidden by default but can be revealed using the show dismissed toggle on the Anomalies tab."
2857
3478
  }
2858
3479
  ],
2859
3480
  mentions: ["anomaly detection", "validation cluster", "field conflict", "duplicate key divergence", "value reuse"]
@@ -2886,6 +3507,14 @@ var sections7 = [
2886
3507
  type: "paragraph",
2887
3508
  text: "**Domain packs** extend validation with industry-specific rules. The freight domain pack includes DOT number state detection and MC number validation. Additional packs can be added to `domain-packs/` without modifying the core engine."
2888
3509
  },
3510
+ {
3511
+ type: "paragraph",
3512
+ text: "Validation runs automatically after extraction and linking complete. Each field value is checked against every applicable validator \u2014 a single field can trigger multiple rules. Results are displayed as colored badges in the **Evidence** tab: green for pass, red for fail, and amber for warnings. You can filter by status, document, category, or free-text search."
3513
+ },
3514
+ {
3515
+ type: "paragraph",
3516
+ text: "The checksum validator (S7) uses a parameterized factory pattern \u2014 it accepts a checksum algorithm name and applies the corresponding verification logic. Supported algorithms include Luhn (credit card numbers), ABA (bank routing numbers), IBAN (international bank accounts), and ISBN (book identifiers). For best results, ensure your schema fields are typed correctly so the engine knows which checksum to apply."
3517
+ },
2889
3518
  {
2890
3519
  type: "callout",
2891
3520
  text: "Evidence validation results are stored in a separate `evidence_validation_results` table keyed by (document_id, entity_id, field_key) \u2014 not in the extraction or linking tables."
@@ -2904,6 +3533,10 @@ var sections7 = [
2904
3533
  {
2905
3534
  question: "What are domain packs?",
2906
3535
  answer: "Domain packs add industry-specific validation rules. For example, the freight domain pack validates DOT numbers and MC numbers. New packs can be added without modifying the core engine."
3536
+ },
3537
+ {
3538
+ question: "How are evidence validation results displayed?",
3539
+ answer: "Results appear as colored badges in the Evidence tab of the case detail page. Green indicates pass, red indicates fail, and amber indicates a warning. Use the filter bar to narrow results by status, document, or category."
2907
3540
  }
2908
3541
  ],
2909
3542
  mentions: ["evidence validation", "structural validators", "checksum", "Luhn", "IBAN", "domain packs", "freight"]
@@ -2930,6 +3563,18 @@ var sections8 = [
2930
3563
  {
2931
3564
  type: "paragraph",
2932
3565
  text: "Navigate to **Data Products → Dataset Templates** to manage templates. Each template is linked to a user schema and can be versioned independently. When creating a new job, select a template instead of configuring the output from scratch."
3566
+ },
3567
+ {
3568
+ type: "paragraph",
3569
+ text: "Templates support column mappings that rename, reorder, or exclude fields from the output. Default transforms \u2014 such as date formatting, currency normalization, or unit conversion \u2014 are applied automatically during assembly. This means every data product built from the same template produces structurally identical output regardless of who runs it or when."
3570
+ },
3571
+ {
3572
+ type: "paragraph",
3573
+ text: "For best results, create one template per downstream consumer. If your finance team and operations team need different column subsets from the same schema, define two templates rather than manually reconfiguring each export. Most teams version their templates alongside schema changes to maintain backward compatibility with existing integrations."
3574
+ },
3575
+ {
3576
+ type: "callout",
3577
+ text: "Dataset templates are workspace-scoped. Any team member can create, edit, or use a template \u2014 there is no per-user ownership restriction."
2933
3578
  }
2934
3579
  ],
2935
3580
  related: [
@@ -2945,6 +3590,10 @@ var sections8 = [
2945
3590
  {
2946
3591
  question: "How do dataset templates relate to schemas?",
2947
3592
  answer: "Each dataset template is linked to a user schema and can be versioned independently. When creating a new job, you can select a template instead of configuring output from scratch."
3593
+ },
3594
+ {
3595
+ question: "Can I version dataset templates?",
3596
+ answer: "Yes. Each template is versioned independently from the schema it references. This lets you evolve your output format over time without affecting existing data products built from earlier versions."
2948
3597
  }
2949
3598
  ],
2950
3599
  mentions: [
@@ -2964,11 +3613,19 @@ var sections8 = [
2964
3613
  content: [
2965
3614
  {
2966
3615
  type: "paragraph",
2967
- text: "An assembly combines documents from one or more sources into a single structured dataset based on a template. Assemblies track their constituent documents, source counts, and processing status."
3616
+ text: "An assembly combines documents from one or more sources into a single structured dataset based on a template. Assemblies track their constituent documents, source counts, and processing status."
3617
+ },
3618
+ {
3619
+ type: "paragraph",
3620
+ text: "Navigate to **Data Products → Assemblies** to view and create assemblies. Each assembly shows its document count, linked schema, processing status, and the date it was created."
3621
+ },
3622
+ {
3623
+ type: "paragraph",
3624
+ text: "When you create an assembly, you select a dataset template and one or more document sources. The system pulls all matching documents, applies the template's column mappings and transforms, and produces a single structured output. The assembly tracks which documents contributed to each row, giving you full traceability from output back to source."
2968
3625
  },
2969
3626
  {
2970
3627
  type: "paragraph",
2971
- text: "Navigate to **Data Products → Assemblies** to view and create assemblies. Each assembly shows its document count, linked schema, processing status, and the date it was created."
3628
+ text: "Use assemblies whenever you need a repeatable, auditable output for downstream systems or stakeholders. Most teams create one assembly per reporting period or delivery cycle. Because assemblies reference a template, you can regenerate the same output shape from different document sets without reconfiguring columns or transforms each time."
2972
3629
  },
2973
3630
  {
2974
3631
  type: "callout",
@@ -2988,6 +3645,10 @@ var sections8 = [
2988
3645
  {
2989
3646
  question: "Why should I use assemblies for production data?",
2990
3647
  answer: "Assemblies provide a single audit trail from source documents through extraction, resolution, and validation to the final output, making them the recommended approach for production datasets."
3648
+ },
3649
+ {
3650
+ question: "Can an assembly pull from multiple sources?",
3651
+ answer: "Yes. An assembly can combine documents from any number of sources \u2014 uploaded files, connected drives, email attachments, and more \u2014 into a single structured dataset."
2991
3652
  }
2992
3653
  ],
2993
3654
  mentions: [
@@ -3033,6 +3694,18 @@ var sections8 = [
3033
3694
  {
3034
3695
  type: "paragraph",
3035
3696
  text: "ID rules are persisted before generating IDs. Navigate to a data product detail page and use **Apply ID Rules** to generate or **Regenerate IDs** to refresh."
3697
+ },
3698
+ {
3699
+ type: "paragraph",
3700
+ text: 'Resolution maps normalize field values before they become part of the ID. For example, a resolution map can collapse "ACME Corp", "ACME Corporation", and "Acme" into a single canonical value "ACME". This prevents duplicate IDs for rows that refer to the same real-world entity under different names.'
3701
+ },
3702
+ {
3703
+ type: "paragraph",
3704
+ text: 'For best results, choose source fields with high uniqueness \u2014 contract numbers or invoice IDs work well, while generic fields like "status" do not. When your documents contain multiple candidate identifiers, configure a fallback chain so the dispenser always has a value to work with. Most teams use the primary reference number as the source field and the document name as the first fallback.'
3705
+ },
3706
+ {
3707
+ type: "callout",
3708
+ text: "ID generation is deterministic \u2014 running **Regenerate IDs** with the same rules and data always produces the same output. This makes ID dispensers safe to re-run without breaking downstream references."
3036
3709
  }
3037
3710
  ],
3038
3711
  related: [
@@ -3044,6 +3717,14 @@ var sections8 = [
3044
3717
  {
3045
3718
  question: "How do ID dispensers handle missing field values?",
3046
3719
  answer: "When the source field is empty, the dispenser tries each field in the fallback chain in order. If all are empty, it generates a prefix-less sequential ID."
3720
+ },
3721
+ {
3722
+ question: "What is a resolution map?",
3723
+ answer: 'A resolution map is a key-value lookup that normalizes field values before ID generation. For example, it can collapse "ACME Corp" and "ACME Corporation" into "ACME" to prevent duplicate IDs for the same entity.'
3724
+ },
3725
+ {
3726
+ question: "Can I regenerate IDs without losing data?",
3727
+ answer: "Yes. Regenerating IDs only updates the ID column \u2014 all other data product values remain unchanged. The operation is deterministic, so the same rules and data always produce the same IDs."
3047
3728
  }
3048
3729
  ],
3049
3730
  mentions: ["ID dispenser", "unique identifiers", "fallback chain", "resolution map"]
@@ -3102,6 +3783,10 @@ var sections8 = [
3102
3783
  {
3103
3784
  question: "Does CSV export preserve leading zeros?",
3104
3785
  answer: "Yes. All CSV exports preserve leading zeros and long numbers \u2014 values are never coerced to numeric types."
3786
+ },
3787
+ {
3788
+ question: "What is auto-resolve singles?",
3789
+ answer: "Auto-resolve singles automatically accepts fields that have only one candidate value, removing them from the manual review queue. Combined with auto-review, this significantly reduces the volume of items requiring human attention."
3105
3790
  }
3106
3791
  ],
3107
3792
  mentions: ["share token", "delivery website", "CSV export", "auto-review", "auto-resolve"]
@@ -3124,6 +3809,22 @@ var sections9 = [
3124
3809
  {
3125
3810
  type: "paragraph",
3126
3811
  text: "Schema-level quality rules run during Phase 3 of every job. Rule types: field format, value range, cross-field consistency, and AI-proposed coherence rules. Rules can be AI-proposed after a job completes, then reviewed and approved before activation."
3812
+ },
3813
+ {
3814
+ type: "paragraph",
3815
+ text: "**Field format** checks verify that values match an expected pattern (e.g., dates in ISO format, phone numbers with country codes). **Value range** checks ensure numeric or date values fall within acceptable bounds. **Cross-field consistency** checks compare two or more fields on the same record \u2014 for example, verifying that a start date precedes an end date."
3816
+ },
3817
+ {
3818
+ type: "paragraph",
3819
+ text: "AI-proposed coherence rules are generated by analyzing patterns in completed job results. The system identifies relationships that hold across most records and proposes them as candidate rules. You review each proposal in the validation settings before it becomes active \u2014 no AI-generated rule runs without explicit approval."
3820
+ },
3821
+ {
3822
+ type: "paragraph",
3823
+ text: "For best results, start with a small set of high-confidence rules and expand over time. Most teams begin with field format checks for critical identifiers (invoice numbers, dates, amounts) and add cross-field consistency rules as they learn their data patterns. Validation failures do not block extraction \u2014 they flag records for review."
3824
+ },
3825
+ {
3826
+ type: "callout",
3827
+ text: "Validation checks are schema-scoped. Rules defined on one schema do not affect other schemas in the same workspace. This lets you tailor quality rules to each document type independently."
3127
3828
  }
3128
3829
  ],
3129
3830
  related: [
@@ -3139,6 +3840,10 @@ var sections9 = [
3139
3840
  {
3140
3841
  question: "Can AI suggest validation rules?",
3141
3842
  answer: "Yes. After a job completes, AI can propose coherence rules based on the data. You review and approve these rules before they are activated."
3843
+ },
3844
+ {
3845
+ question: "Do validation failures block extraction?",
3846
+ answer: "No. Validation checks flag records for review but do not prevent extraction from completing. Failed records appear in the Approval Queue for manual inspection."
3142
3847
  }
3143
3848
  ],
3144
3849
  mentions: [
@@ -3158,6 +3863,22 @@ var sections9 = [
3158
3863
  {
3159
3864
  type: "paragraph",
3160
3865
  text: "Manually-created reference datasets with known-correct values. Create from **Validation → Golden Samples**. Benchmark runs compare extraction results against golden samples for per-field accuracy scoring with AI judge verdicts."
3866
+ },
3867
+ {
3868
+ type: "paragraph",
3869
+ text: "To create a golden sample, select a document and manually enter the correct value for each field. The system stores these known-correct values as the ground truth baseline. When you run a benchmark, the extraction pipeline processes the same document independently, and the results are compared field by field against your golden sample."
3870
+ },
3871
+ {
3872
+ type: "paragraph",
3873
+ text: 'Benchmark scoring uses an AI judge to evaluate each field comparison. The judge accounts for semantic equivalence \u2014 for example, "United States" and "US" may be scored as a match depending on the field type. Per-field accuracy scores let you identify exactly which fields are underperforming and need schema or instruction tuning.'
3874
+ },
3875
+ {
3876
+ type: "paragraph",
3877
+ text: "For best results, create golden samples from a representative mix of document types and complexity levels. Most teams maintain 5-10 golden samples per schema and re-run benchmarks after schema changes, instruction updates, or model upgrades to track quality trends over time."
3878
+ },
3879
+ {
3880
+ type: "callout",
3881
+ text: "Golden samples are not used during normal extraction \u2014 they exist solely for benchmarking. Changing a golden sample does not affect how documents are processed."
3161
3882
  }
3162
3883
  ],
3163
3884
  related: [
@@ -3173,6 +3894,10 @@ var sections9 = [
3173
3894
  {
3174
3895
  question: "How do benchmark runs work?",
3175
3896
  answer: "Benchmark runs compare extraction results against golden samples, producing per-field accuracy scores with AI judge verdicts to measure extraction quality."
3897
+ },
3898
+ {
3899
+ question: "How many golden samples should I create?",
3900
+ answer: "Most teams maintain 5-10 golden samples per schema, covering a representative mix of document types and complexity levels. Re-run benchmarks after schema changes or model upgrades to track quality trends."
3176
3901
  }
3177
3902
  ],
3178
3903
  mentions: ["golden samples", "ground truth", "benchmark runs", "accuracy scoring", "AI judge"]
@@ -3188,6 +3913,18 @@ var sections9 = [
3188
3913
  type: "paragraph",
3189
3914
  text: "Threshold-based rules for auto-approving or flagging results. Configure per schema with criteria: minimum confidence, validation pass rate, field coverage. Results meeting all thresholds are auto-approved; others go to the manual review queue."
3190
3915
  },
3916
+ {
3917
+ type: "paragraph",
3918
+ text: "Each criterion acts as an independent gate. **Minimum confidence** sets the lowest acceptable extraction confidence score. **Validation pass rate** requires a minimum percentage of validation checks to pass. **Field coverage** ensures that a minimum percentage of schema fields have non-empty values. A result must clear all three gates to be auto-approved."
3919
+ },
3920
+ {
3921
+ type: "paragraph",
3922
+ text: "Start with conservative thresholds \u2014 high confidence, high pass rate, high coverage \u2014 and loosen them as you gain trust in your extraction pipeline. Most teams begin with 90% confidence, 95% validation pass rate, and 80% field coverage, then adjust based on the volume of false positives in the approval queue."
3923
+ },
3924
+ {
3925
+ type: "paragraph",
3926
+ text: "Approval gates integrate directly with the delivery pipeline. When a result passes all gates, a `result.approved` signal is emitted automatically. Bind this signal to a destination to create a fully automated flow from document upload through extraction, validation, approval, and delivery \u2014 no manual steps required for high-confidence results."
3927
+ },
3191
3928
  {
3192
3929
  type: "callout",
3193
3930
  text: "Approval gates feed the delivery pipeline \u2014 bind a `result.approved` signal to a destination to only ship approved rows to your downstream systems."
@@ -3206,6 +3943,10 @@ var sections9 = [
3206
3943
  {
3207
3944
  question: "How do approval gates connect to delivery?",
3208
3945
  answer: "Bind a result.approved signal to a delivery destination to only ship approved rows to your downstream systems. This ensures only quality-checked data is delivered."
3946
+ },
3947
+ {
3948
+ question: "What thresholds should I start with?",
3949
+ answer: "Most teams start with 90% confidence, 95% validation pass rate, and 80% field coverage. Adjust based on the volume of false positives in the approval queue \u2014 loosen thresholds as you gain trust in your pipeline."
3209
3950
  }
3210
3951
  ],
3211
3952
  mentions: [
@@ -3230,6 +3971,22 @@ var sections9 = [
3230
3971
  {
3231
3972
  type: "paragraph",
3232
3973
  text: 'Filter the queue by status (pending, flagged), schema, or confidence range. Click "Review" on any row to inspect the extracted values, provenance trails, and validation check results before approving or rejecting.'
3974
+ },
3975
+ {
3976
+ type: "paragraph",
3977
+ text: "The review detail view shows the extracted values alongside the source document, with provenance trails tracing each value back to its origin in the text. Validation check results are displayed inline \u2014 you can see exactly which rules passed and which failed before making your decision. Batch actions are available for approving or rejecting multiple items at once."
3978
+ },
3979
+ {
3980
+ type: "paragraph",
3981
+ text: "When you approve a result, a `result.approved` signal is emitted to the delivery pipeline. When you reject a result, a `result.rejected` signal fires instead. This event-driven design lets you build automated workflows that respond to review decisions \u2014 for example, routing approved records to a webhook and rejected records to a notification channel."
3982
+ },
3983
+ {
3984
+ type: "paragraph",
3985
+ text: "For best results, review flagged items first \u2014 these are records where at least one validation check failed, making them the most likely to contain errors. Most teams assign a daily review cadence and use confidence range filters to prioritize low-confidence items that need the most attention."
3986
+ },
3987
+ {
3988
+ type: "callout",
3989
+ text: "LLM auto-review is available to accelerate the approval process. When enabled, AI proposes approve or reject decisions for pending items, which you can accept or override with a single click."
3233
3990
  }
3234
3991
  ],
3235
3992
  related: [
@@ -3245,6 +4002,10 @@ var sections9 = [
3245
4002
  {
3246
4003
  question: "How do I review items in the Approval Queue?",
3247
4004
  answer: 'Filter by status (pending, flagged), schema, or confidence range. Click "Review" on any row to inspect extracted values, provenance trails, and validation check results before approving or rejecting.'
4005
+ },
4006
+ {
4007
+ question: "Can I batch approve or reject items?",
4008
+ answer: "Yes. Select multiple items in the queue and use the batch action buttons to approve or reject them all at once. Each item emits the appropriate delivery signal individually."
3248
4009
  }
3249
4010
  ],
3250
4011
  mentions: [
@@ -3309,6 +4070,18 @@ var sections10 = [
3309
4070
  {
3310
4071
  type: "paragraph",
3311
4072
  text: "Every attempt is logged in `delivery_items`. Terminal failures (retry exhausted or permanent 4xx) write a `delivery_dead_letter` row, which is replayable. The outbox, history, DLQ, and catalog are all accessible via the [`/v1/delivery/*` API](/docs)."
4073
+ },
4074
+ {
4075
+ type: "paragraph",
4076
+ text: "The four registries \u2014 signals, deliverables, serializers, and connectors \u2014 are fully orthogonal. Adding a new destination type does not require changes to the signal or serializer code. This composable design means you can mix any supported signal with any compatible serializer and connector without custom integration work."
4077
+ },
4078
+ {
4079
+ type: "paragraph",
4080
+ text: "For best results, start with a webhook destination to verify your binding configuration end-to-end. Once the payload shape and delivery cadence match your expectations, expand to file-based destinations (S3, SFTP) or spreadsheet destinations (Google Sheets). Most teams create separate bindings for different downstream consumers rather than routing all events to a single destination."
4081
+ },
4082
+ {
4083
+ type: "callout",
4084
+ text: "Delivery is at-least-once with deterministic idempotency keys. Receivers should use the `X-Talonic-Idempotency-Key` header (or equivalent metadata for file-based connectors) to deduplicate on their end."
3312
4085
  }
3313
4086
  ],
3314
4087
  related: [
@@ -3324,6 +4097,10 @@ var sections10 = [
3324
4097
  {
3325
4098
  question: "What happens when a delivery fails?",
3326
4099
  answer: "Failed deliveries retry with a backoff ladder. Terminal failures (retry exhausted or permanent 4xx) are written to the dead-letter queue (DLQ), which is fully replayable."
4100
+ },
4101
+ {
4102
+ question: "What serialization formats are supported?",
4103
+ answer: "Ten formats: json, ndjson, csv, csv_file, xlsx, rows, graph, raw, md, and txt. Each serializer declares which deliverable shapes it supports, and the compatibility triangle validates the combination at binding creation time."
3327
4104
  }
3328
4105
  ],
3329
4106
  mentions: [
@@ -3382,6 +4159,22 @@ var sections10 = [
3382
4159
  description: "Slice 2+. Structured data as email attachment."
3383
4160
  }
3384
4161
  ]
4162
+ },
4163
+ {
4164
+ type: "paragraph",
4165
+ text: "Each destination stores its connector type, configuration (URL, bucket, folder path), and optional authentication credentials. Webhook destinations support HMAC-SHA256 signing via a **signing secret** \u2014 every payload includes a signature header so your receiver can verify authenticity. File-based destinations (S3, SFTP, Google Drive) support configurable filename templates with token substitution for binding ID, timestamp, and idempotency key."
4166
+ },
4167
+ {
4168
+ type: "paragraph",
4169
+ text: "A single destination can back multiple bindings. For example, one S3 bucket destination can receive both `document.extracted` and `result.approved` events through separate bindings, each with its own serializer and field map. This keeps your destination inventory small while supporting diverse routing requirements."
4170
+ },
4171
+ {
4172
+ type: "paragraph",
4173
+ text: "For best results, always run a live-ping test after creating a destination. The test exercises the full transport envelope \u2014 SSRF validation, payload cap, and authentication \u2014 with a tiny test payload, so you catch configuration errors before real events start flowing. OAuth-based destinations (Google Drive, Google Sheets) require connecting your account first via the OAuth flow in the dashboard."
4174
+ },
4175
+ {
4176
+ type: "callout",
4177
+ text: "Destinations can be disabled without deleting them. Set **is_active** to false and no bindings will route events to the destination until you re-enable it."
3385
4178
  }
3386
4179
  ],
3387
4180
  related: [
@@ -3397,6 +4190,10 @@ var sections10 = [
3397
4190
  {
3398
4191
  question: "How do I test a destination?",
3399
4192
  answer: "Every destination supports a live-ping test via POST /v1/delivery/destinations/:id/test that exercises the full transport envelope with a tiny test payload."
4193
+ },
4194
+ {
4195
+ question: "Can one destination serve multiple bindings?",
4196
+ answer: "Yes. A single destination can back any number of bindings, each with its own signal filter, serializer, and field map. This lets you route different event types to the same endpoint with different payload shapes."
3400
4197
  }
3401
4198
  ],
3402
4199
  mentions: [
@@ -3423,6 +4220,22 @@ var sections10 = [
3423
4220
  {
3424
4221
  type: "paragraph",
3425
4222
  text: "Optional `field_map` (rename/drop/static rules) lets you reshape the payload without custom code. Optional `delivery_policy` overrides the default retry ladder (6 attempts at `5s, 30s, 2min, 10min, 1h`) and timeout."
4223
+ },
4224
+ {
4225
+ type: "paragraph",
4226
+ text: "The compatibility triangle is enforced on every create and update. The backend checks that your chosen serializer supports the deliverable resolver's output shape, and that the connector accepts the serializer's format. If any predicate fails, the binding is rejected with a descriptive error \u2014 you never end up with a binding that cannot deliver."
4227
+ },
4228
+ {
4229
+ type: "paragraph",
4230
+ text: 'Use `field_map` to tailor the payload for each downstream consumer. **Rename** rules map internal field names to the receiver\'s expected names. **Drop** rules exclude fields the receiver does not need. **Static** rules inject constant values (e.g., a `source: "talonic"` tag) into every payload. These three operations compose in order: drop first, then rename, then static injection.'
4231
+ },
4232
+ {
4233
+ type: "paragraph",
4234
+ text: "For best results, create one binding per downstream consumer per event type. This gives you independent control over payload shape, retry policy, and serialization format for each integration point. Most teams start with a `document.extracted` binding to a webhook and expand to run-level and approval signals as their pipeline matures."
4235
+ },
4236
+ {
4237
+ type: "callout",
4238
+ text: "The binding editor in the dashboard walks you through the compatibility triangle step by step \u2014 only showing serializers and deliverables that are compatible with your chosen signal and destination."
3426
4239
  }
3427
4240
  ],
3428
4241
  related: [
@@ -3438,6 +4251,10 @@ var sections10 = [
3438
4251
  {
3439
4252
  question: "Can I customize the delivery payload?",
3440
4253
  answer: "Yes. Use field_map to rename, drop, or add static fields without custom code. Use delivery_policy to override the default retry ladder and timeout."
4254
+ },
4255
+ {
4256
+ question: "What is the compatibility triangle?",
4257
+ answer: "The compatibility triangle validates that the signal, deliverable resolver, serializer, and connector all form a compatible combination. The backend enforces this on every binding create and update to prevent misconfigured delivery routes."
3441
4258
  }
3442
4259
  ],
3443
4260
  mentions: [
@@ -3520,6 +4337,22 @@ var sections10 = [
3520
4337
  description: "Fired after a terminal delivery failure."
3521
4338
  }
3522
4339
  ]
4340
+ },
4341
+ {
4342
+ type: "paragraph",
4343
+ text: "Signals are typed events emitted by the platform when meaningful state changes occur. Document-level signals fire on extraction success or failure. Run-level signals fire when a job completes across dataspace, structuring, resolution, or extraction runs. Result-level signals fire when a reviewer approves, rejects, or flags a record."
4344
+ },
4345
+ {
4346
+ type: "paragraph",
4347
+ text: "The two `delivery.item.*` entries are **meta-signals** \u2014 they fire when a delivery itself succeeds or fails. Use them for self-monitoring: bind `delivery.item.failed` to a notification webhook to receive alerts when deliveries break. The poller includes built-in loop prevention so a failed meta-signal delivery does not emit another meta-signal."
4348
+ },
4349
+ {
4350
+ type: "paragraph",
4351
+ text: "For best results, use the catalog API to populate dropdown menus and configuration forms rather than hardcoding signal or deliverable lists. The catalog always reflects the running registry contents, so new signal types and deliverables appear automatically as the platform evolves."
4352
+ },
4353
+ {
4354
+ type: "callout",
4355
+ text: "The catalog API exposes four endpoints: `/v1/delivery/catalog/signals`, `/v1/delivery/catalog/deliverables`, `/v1/delivery/catalog/serializers`, and `/v1/delivery/catalog/connectors`. Each returns the full registry for that category."
3523
4356
  }
3524
4357
  ],
3525
4358
  related: [
@@ -3535,6 +4368,10 @@ var sections10 = [
3535
4368
  {
3536
4369
  question: "How do I discover available signals and deliverables?",
3537
4370
  answer: "Use the catalog API at /v1/delivery/catalog/* which exposes the four registries (signals, deliverables, serializers, connectors) that drive the binding picker."
4371
+ },
4372
+ {
4373
+ question: "What are meta-signals?",
4374
+ answer: "Meta-signals (delivery.item.completed and delivery.item.failed) fire when a delivery attempt itself succeeds or fails. Use them for self-monitoring \u2014 for example, binding delivery.item.failed to a notification webhook for delivery failure alerts."
3538
4375
  }
3539
4376
  ],
3540
4377
  mentions: [
@@ -3555,6 +4392,22 @@ var sections10 = [
3555
4392
  {
3556
4393
  type: "paragraph",
3557
4394
  text: "Every delivery attempt writes a row to `/v1/delivery/items` with its status, HTTP code, error code, and request/response bodies. Terminal failures (retry ladder exhausted or permanent 4xx) escalate to `/v1/delivery/dlq`. Both are fully replayable \u2014 replay enqueues a new attempt with a fresh idempotency key. Nothing in history is ever mutated; the log is strictly append-only."
4395
+ },
4396
+ {
4397
+ type: "paragraph",
4398
+ text: "The delivery items log captures the full lifecycle of each attempt: in-flight, succeeded, or failed. Each row includes the attempt number, duration in milliseconds, and truncated request/response bodies (up to 10 KB each). Use the items endpoint with filters for `binding_id`, `destination_id`, or `status` to narrow results when debugging a specific integration."
4399
+ },
4400
+ {
4401
+ type: "paragraph",
4402
+ text: "The dead letter queue (DLQ) is your safety net for terminal failures. When the retry ladder is exhausted or the destination returns a permanent error (e.g., 401 Unauthorized, 403 Forbidden), the failed delivery moves to the DLQ. From there you can inspect the error, fix the destination configuration, and replay the delivery with a single click or API call."
4403
+ },
4404
+ {
4405
+ type: "paragraph",
4406
+ text: "For best results, monitor the DLQ regularly and set up a `delivery.item.failed` meta-signal binding to receive alerts when deliveries fail terminally. Most teams configure a notification webhook for this signal so they are notified immediately rather than discovering failures during a manual review. Request and response bodies older than the configured retention period are automatically cleaned up, but row metadata (status, error code, duration) is retained indefinitely."
4407
+ },
4408
+ {
4409
+ type: "callout",
4410
+ text: "Replay is safe to run multiple times. The idempotency key is deterministic \u2014 receivers that deduplicate on the key will not process the same delivery twice, even after multiple replays."
3558
4411
  }
3559
4412
  ],
3560
4413
  related: [
@@ -3570,6 +4423,10 @@ var sections10 = [
3570
4423
  {
3571
4424
  question: "What is the dead letter queue (DLQ)?",
3572
4425
  answer: "Terminal failures (retry ladder exhausted or permanent 4xx) escalate to /v1/delivery/dlq. DLQ entries are fully replayable \u2014 replay enqueues a fresh attempt with a new idempotency key."
4426
+ },
4427
+ {
4428
+ question: "How long are request and response bodies retained?",
4429
+ answer: "Request and response bodies are cleaned up after the configured retention period (default 30 days). Row metadata \u2014 status, HTTP code, error code, and duration \u2014 is retained indefinitely for audit purposes."
3573
4430
  }
3574
4431
  ],
3575
4432
  mentions: [
@@ -3604,6 +4461,10 @@ var sections11 = [
3604
4461
  type: "paragraph",
3605
4462
  text: "Dialects ensure consistency across all your structured output. When your downstream systems expect dates in `YYYY-MM-DD` format, numbers with `.` as the decimal separator, and CSVs delimited by `;`, you configure this once in the shared dialect rather than repeating it in every schema."
3606
4463
  },
4464
+ {
4465
+ type: "paragraph",
4466
+ text: "Most teams configure their shared dialect during initial workspace setup and rarely change it afterward. If your organization operates across regions with different formatting conventions, create separate workspaces with region-specific dialects rather than overriding at the schema level. This keeps the configuration clean and avoids inconsistencies in delivered data."
4467
+ },
3607
4468
  {
3608
4469
  type: "list",
3609
4470
  ordered: false,
@@ -3674,6 +4535,10 @@ var sections11 = [
3674
4535
  type: "paragraph",
3675
4536
  text: "The lookup convention follows a `key` / `value` structure where the `key` is the output code and the `value` is the human-readable label. During extraction, the platform maps FROM labels found in documents TO the canonical codes defined in the reference primitive. This ensures consistent, machine-readable output regardless of how values appear in source documents."
3676
4537
  },
4538
+ {
4539
+ type: "paragraph",
4540
+ text: "For best results, keep reference primitives focused on a single domain \u2014 for example, one primitive for country codes, another for currency codes, and another for product categories. This makes each primitive reusable across multiple schemas and simplifies maintenance. When updating a primitive, test the new version against a few sample documents before updating the version reference in production schemas."
4541
+ },
3677
4542
  {
3678
4543
  type: "callout",
3679
4544
  variant: "info",
@@ -3741,6 +4606,10 @@ var sections11 = [
3741
4606
  type: "paragraph",
3742
4607
  text: "Change review is particularly important for workspaces that feed downstream systems through delivery bindings. A small change to a schema field mapping or a reference primitive value can ripple through to every document processed after that point. The review process creates a checkpoint where a second pair of eyes can verify the change before it goes live."
3743
4608
  },
4609
+ {
4610
+ type: "paragraph",
4611
+ text: "Most teams enable change review as soon as their workspace transitions from development to production. During the initial setup phase, you can leave it disabled for faster iteration. Once your schemas, dialects, and reference primitives are stable and data is flowing to downstream systems, enable change review to protect against accidental modifications that could disrupt live pipelines."
4612
+ },
3744
4613
  {
3745
4614
  type: "list",
3746
4615
  ordered: false,
@@ -3807,6 +4676,14 @@ var sections12 = [
3807
4676
  type: "paragraph",
3808
4677
  text: "Omnisearch is designed to be the single entry point for finding anything in the platform. Rather than navigating to specific pages to search within them, Omnisearch queries a **materialized values index** that aggregates data across all your content. Results are grouped by category so you can quickly distinguish between a document match and a field name match."
3809
4678
  },
4679
+ {
4680
+ type: "paragraph",
4681
+ text: "The materialized values index is rebuilt automatically whenever documents are processed or schemas change, so search results are always current. There is no manual reindex step \u2014 new documents become searchable as soon as extraction completes. This makes Omnisearch reliable even during high-volume ingestion periods."
4682
+ },
4683
+ {
4684
+ type: "paragraph",
4685
+ text: "For best results, use Omnisearch as your primary navigation tool. Instead of browsing through document lists or clicking through the sidebar, press `Cmd+K` and type what you are looking for \u2014 whether it is a specific invoice number, a field name, or a schema title. Most users find that Omnisearch is faster than manual navigation for any task beyond browsing the most recent documents."
4686
+ },
3810
4687
  {
3811
4688
  type: "callout",
3812
4689
  variant: "info",
@@ -3883,6 +4760,10 @@ var sections12 = [
3883
4760
  {
3884
4761
  type: "paragraph",
3885
4762
  text: "Filter state is encoded in the URL query string using dynamic SQL generation on the backend. This means you can bookmark filtered views, share them with teammates via a link, or save them as **presets** for one-click access to commonly used queries."
4763
+ },
4764
+ {
4765
+ type: "paragraph",
4766
+ text: 'For best results, save your most common filter combinations as presets. Most teams create presets for categories like "high-value invoices this quarter," "documents missing key fields," or "recently failed extractions." Presets appear as one-click buttons on the Documents page, eliminating the need to rebuild complex filter conditions from scratch each time.'
3886
4767
  }
3887
4768
  ],
3888
4769
  related: [
@@ -3937,6 +4818,19 @@ var sections13 = [
3937
4818
  type: "paragraph",
3938
4819
  text: "Manage API keys from **Settings → API Keys**. Keys are prefixed with `tlnc_` and passed via `Authorization: Bearer`. Keys are SHA-256 hashed \u2014 the full key is only shown once at creation."
3939
4820
  },
4821
+ {
4822
+ type: "paragraph",
4823
+ text: "Each API key is assigned one or more scopes that control what operations it can perform. Scopes follow the principle of least privilege \u2014 create a key with only the scopes your integration needs. For example, a read-only dashboard integration only needs the `read` scope, while an automated ingestion pipeline needs `extract` and `read`."
4824
+ },
4825
+ {
4826
+ type: "paragraph",
4827
+ text: "For best results, create separate API keys for each integration or service that connects to your Talonic workspace. This makes it easy to rotate or revoke a single key without disrupting other integrations. Most teams maintain one key for their ingestion pipeline, one for their BI dashboard, and one for webhook-based automations."
4828
+ },
4829
+ {
4830
+ type: "callout",
4831
+ variant: "warning",
4832
+ text: "Copy the full API key immediately after creation \u2014 it is only displayed once. If you lose the key, you must delete it and create a new one. Existing integrations using the old key will stop working until updated."
4833
+ },
3940
4834
  {
3941
4835
  type: "param-table",
3942
4836
  title: "API key scopes",
@@ -3972,6 +4866,10 @@ var sections13 = [
3972
4866
  {
3973
4867
  question: "What scopes are available for API keys?",
3974
4868
  answer: "Three scopes: extract (use extraction API), read (read documents, extractions, schemas, jobs), and write (create and modify resources)."
4869
+ },
4870
+ {
4871
+ question: "Can I have multiple API keys?",
4872
+ answer: "Yes. You can create as many API keys as needed. Best practice is to create separate keys for each integration so you can rotate or revoke them independently without disrupting other services."
3975
4873
  }
3976
4874
  ],
3977
4875
  mentions: ["API keys", "tlnc_", "SHA-256", "Bearer token", "scopes"]
@@ -3983,6 +4881,27 @@ var sections13 = [
3983
4881
  seoTitle: "Public REST API Overview \u2014 Talonic Docs",
3984
4882
  description: "Full REST API with 20+ namespaces: extract, documents, extractions, schemas, jobs, sources, delivery, linking, matching, batches, cases, quality, and more. Cursor pagination.",
3985
4883
  content: [
4884
+ {
4885
+ type: "paragraph",
4886
+ text: "Talonic exposes a comprehensive REST API with 20+ namespaces covering every aspect of the platform \u2014 from document extraction and schema management to delivery, matching, and quality benchmarking. All endpoints use JSON request and response bodies with cursor-based pagination for list operations."
4887
+ },
4888
+ {
4889
+ type: "paragraph",
4890
+ text: "The API follows standard REST conventions. Authenticate with a `tlnc_` API key via the `Authorization: Bearer` header. Most resources support full CRUD operations, and long-running tasks like matching runs and batch inference are handled asynchronously with polling endpoints for status and progress."
4891
+ },
4892
+ {
4893
+ type: "paragraph",
4894
+ text: "Use the public API to build automated ingestion pipelines, integrate extraction results into downstream systems, or orchestrate complex workflows that combine multiple platform features. The API mirrors every action available in the web interface, so anything you can do manually can be fully automated."
4895
+ },
4896
+ {
4897
+ type: "paragraph",
4898
+ text: "For best results, start with the `/v1/extract` endpoint for document ingestion, then use `/v1/documents` and `/v1/extractions` to retrieve results. As your integration matures, explore delivery bindings, matching configurations, and batch processing to build a fully automated data pipeline."
4899
+ },
4900
+ {
4901
+ type: "callout",
4902
+ variant: "info",
4903
+ text: "See the full [API Documentation](/docs) for detailed endpoint specifications, request/response examples, and authentication guides. The API reference is organized by namespace and includes every parameter, status code, and error response."
4904
+ },
3986
4905
  {
3987
4906
  type: "param-table",
3988
4907
  title: "API namespaces",
@@ -4103,6 +5022,10 @@ var sections13 = [
4103
5022
  {
4104
5023
  question: "Where can I find detailed API documentation?",
4105
5024
  answer: "See the full API Documentation at /docs for complete endpoint documentation with request/response examples, parameter descriptions, and authentication details."
5025
+ },
5026
+ {
5027
+ question: "How does pagination work in the API?",
5028
+ answer: "List endpoints use cursor-based pagination. Each response includes a cursor token that you pass as a query parameter to fetch the next page. This approach is more reliable than offset-based pagination when documents are being added or removed concurrently."
4106
5029
  }
4107
5030
  ],
4108
5031
  mentions: [
@@ -4128,6 +5051,14 @@ var sections13 = [
4128
5051
  type: "paragraph",
4129
5052
  text: "The webhook connector is configured as a **delivery destination**. Bind any of the signal types below to a webhook destination to receive real-time notifications. See `/v1/delivery/catalog/signals` for the exhaustive list."
4130
5053
  },
5054
+ {
5055
+ type: "paragraph",
5056
+ text: "When a webhook fires, the platform constructs the payload from the signal data, signs it with your destination's HMAC-SHA256 signing secret, and delivers it via HTTPS POST. Each delivery includes an idempotency key in the headers so your receiver can safely deduplicate retries. Failed deliveries follow an exponential backoff schedule, and terminal failures are routed to the dead-letter queue for manual replay."
5057
+ },
5058
+ {
5059
+ type: "paragraph",
5060
+ text: "Use webhooks when your downstream system needs to react immediately to platform events \u2014 for example, triggering an ERP import when a document is extracted, or notifying a Slack channel when a reviewer rejects a record. For bulk or periodic data transfers, consider using the SFTP, S3, or cloud storage delivery connectors instead."
5061
+ },
4131
5062
  {
4132
5063
  type: "param-table",
4133
5064
  title: "Delivery signal types (webhook-compatible)",
@@ -4203,6 +5134,10 @@ var sections13 = [
4203
5134
  {
4204
5135
  question: "What happens when a webhook delivery fails?",
4205
5136
  answer: "Failed webhook deliveries retry with exponential backoff. Terminal failures (retry exhausted or permanent 4xx) escalate to the dead-letter queue for manual replay."
5137
+ },
5138
+ {
5139
+ question: "How do I verify webhook signatures?",
5140
+ answer: "Each webhook payload is signed with HMAC-SHA256 using the signing secret from your delivery destination configuration. Compute the HMAC of the raw request body and compare it to the signature header to verify authenticity. This ensures the payload was sent by Talonic and was not tampered with in transit."
4206
5141
  }
4207
5142
  ],
4208
5143
  mentions: [
@@ -4262,6 +5197,10 @@ var sections14 = [
4262
5197
  type: "paragraph",
4263
5198
  text: "New members are added via domain matching: company email domains auto-match to your org with **pending** status requiring admin approval. Manage from the Team page."
4264
5199
  },
5200
+ {
5201
+ type: "paragraph",
5202
+ text: "When a team member is removed, their access is revoked immediately but their past actions \u2014 edits, uploads, approvals, and review decisions \u2014 remain in the audit trail. This preserves data integrity and compliance history. Removed users can be re-added later through the same domain matching process if needed."
5203
+ },
4265
5204
  {
4266
5205
  type: "callout",
4267
5206
  variant: "info",
@@ -4329,6 +5268,14 @@ var sections14 = [
4329
5268
  type: "paragraph",
4330
5269
  text: "Understanding your usage patterns helps optimize costs. For example, if extraction dominates your spend, consider using **batch mode** for non-urgent documents to cut that cost in half. The daily cost chart makes it easy to spot usage spikes and correlate them with specific ingestion events."
4331
5270
  },
5271
+ {
5272
+ type: "paragraph",
5273
+ text: "Behind the scenes, every LLM and OCR call is logged with full detail \u2014 the model used, input and output token counts, latency, and computed cost. This data powers both the per-feature breakdown and the individual call log. The system tracks costs across extraction, OCR, batch inference, matching AI resolution, and quality passes so you always know where your spend is going."
5274
+ },
5275
+ {
5276
+ type: "paragraph",
5277
+ text: "Most teams review the daily cost chart weekly to establish a usage baseline. Unexpected spikes usually correlate with large document uploads or batch completions. For organizations managing multiple workspaces, the **Master view** provides a single pane of glass showing per-customer breakdowns and platform-wide aggregates \u2014 accessible only to platform administrators."
5278
+ },
4332
5279
  {
4333
5280
  type: "param-table",
4334
5281
  title: "Usage views",
@@ -4404,6 +5351,14 @@ var sections14 = [
4404
5351
  type: "paragraph",
4405
5352
  text: "The Admin Panel is the central hub for platform-wide operations. **Customer management** lets you create, view, and delete organizations. **User management** provides a cross-tenant view of all platform users with the ability to remove accounts. The **data clear & rebuild** function wipes all data for a specific customer and reprocesses from scratch \u2014 useful during onboarding or after significant schema changes."
4406
5353
  },
5354
+ {
5355
+ type: "paragraph",
5356
+ text: "The Admin Panel operates across tenant boundaries, giving administrators visibility into all organizations on the platform. The **usage statistics** view aggregates cost and volume data across all customers, making it straightforward to identify high-usage tenants, track platform growth, and forecast infrastructure needs."
5357
+ },
5358
+ {
5359
+ type: "paragraph",
5360
+ text: "For best results, limit Admin Panel access to a small group of trusted platform operators. Use the **master registry** view to audit field definitions and schemas across tenants \u2014 this is particularly useful when standardizing extraction configurations or troubleshooting cross-tenant data quality issues."
5361
+ },
4407
5362
  {
4408
5363
  type: "list",
4409
5364
  ordered: false,
@@ -4463,6 +5418,18 @@ var sections14 = [
4463
5418
  type: "paragraph",
4464
5419
  text: "Talonic provides global keyboard shortcuts that work from any page in the platform. These shortcuts let you access common actions without leaving your current context, significantly speeding up daily workflows."
4465
5420
  },
5421
+ {
5422
+ type: "paragraph",
5423
+ text: "Shortcuts are registered at the application level, meaning they respond regardless of which page or panel is currently active. The platform intercepts the key combination before it reaches the browser, so these shortcuts take priority over default browser bindings when the Talonic window is focused."
5424
+ },
5425
+ {
5426
+ type: "paragraph",
5427
+ text: "The most frequently used shortcut is **Omnisearch** (`Cmd+K` / `Ctrl+K`), which opens a global search overlay that queries documents, extracted values, field names, schemas, and sources simultaneously. Power users rely on it to navigate the platform faster than clicking through the sidebar."
5428
+ },
5429
+ {
5430
+ type: "paragraph",
5431
+ text: "For best results, build muscle memory around the three core shortcuts. Use `Cmd+K` to find anything, `Cmd+J` to upload a document on the fly, and `Escape` to dismiss any overlay or modal. These three actions cover the most common interruptions during a review or configuration session."
5432
+ },
4466
5433
  {
4467
5434
  type: "param-table",
4468
5435
  title: "Shortcuts",
@@ -4533,6 +5500,14 @@ var sections15 = [
4533
5500
  type: "paragraph",
4534
5501
  text: "Under the hood, batch inference leverages the provider's native batch API (Anthropic Message Batches or AWS Bedrock invocation jobs). Documents accumulate in a queue and are submitted together, allowing the provider to schedule processing during off-peak capacity. This is why the cost reduction is possible without any loss in extraction quality."
4535
5502
  },
5503
+ {
5504
+ type: "paragraph",
5505
+ text: "Batch mode is best suited for backlog ingestion, periodic bulk uploads, and any scenario where results are not needed in real time. Most teams use batch mode for overnight processing of large document volumes and reserve real-time processing for time-sensitive documents that need immediate attention."
5506
+ },
5507
+ {
5508
+ type: "paragraph",
5509
+ text: "When batch results arrive, they pass through the same post-processing pipeline as real-time extractions \u2014 including markdown pre-processing, field parsing, quality metrics, and extraction metadata computation. The only difference is that LLM-based quality passes (field estimation, verification, cross-reference enrichment) are skipped in batch mode to preserve the cost savings."
5510
+ },
4536
5511
  {
4537
5512
  type: "list",
4538
5513
  ordered: false,
@@ -4609,6 +5584,10 @@ var sections15 = [
4609
5584
  {
4610
5585
  type: "paragraph",
4611
5586
  text: "While waiting for batch results, documents show a status of `batch_queued`. Once the provider returns results, the platform applies them through the same post-processing pipeline as real-time extraction \u2014 including markdown pre-processing, field parsing, quality metrics, and extraction metadata computation."
5587
+ },
5588
+ {
5589
+ type: "paragraph",
5590
+ text: "You can also enable batch mode on a per-source basis. When a source connection has the batch processing toggle enabled, all documents ingested through that source are automatically routed to the batch queue. This is ideal for source connections that handle non-urgent, high-volume ingestion \u2014 such as a shared drive that collects documents overnight."
4612
5591
  }
4613
5592
  ],
4614
5593
  related: [
@@ -4658,6 +5637,14 @@ var sections15 = [
4658
5637
  type: "paragraph",
4659
5638
  text: "Batches are submitted automatically when the accumulation timer fires (every 15 minutes by default) or when the item count threshold is reached. Once submitted, the platform polls the provider hourly to check for completion. When results arrive, they are applied to the corresponding documents and the batch transitions to **completed** status."
4660
5639
  },
5640
+ {
5641
+ type: "paragraph",
5642
+ text: "The batch detail view shows individual items within a batch, including which documents are included, their current processing state, and any errors that occurred. Use this view to verify that a specific document was included in the expected batch and to troubleshoot items that failed to parse."
5643
+ },
5644
+ {
5645
+ type: "paragraph",
5646
+ text: "The platform includes built-in crash recovery for batch processing. If the application restarts while a batch is in a transient `processing` state, the recovery logic automatically reverts it to `submitted` so the next polling cycle can retry. This means batch jobs are resilient to infrastructure disruptions without requiring manual intervention."
5647
+ },
4661
5648
  {
4662
5649
  type: "param-table",
4663
5650
  title: "Batch statuses",
@@ -4737,6 +5724,14 @@ var sections16 = [
4737
5724
  type: "paragraph",
4738
5725
  text: 'Reference data is the foundation of the matching system. It represents your "ground truth" \u2014 the known records you want to match extracted document data against. Common examples include customer lists, product catalogs, vendor registries, and contract databases.'
4739
5726
  },
5727
+ {
5728
+ type: "paragraph",
5729
+ text: "When you upload a reference dataset, the platform indexes all columns and rows for fast lookup during matching runs. Each dataset is versioned independently, so you can update your reference data without affecting in-progress matching configurations. A single dataset can be shared across multiple schemas and matching configurations."
5730
+ },
5731
+ {
5732
+ type: "paragraph",
5733
+ text: "For best results, ensure your reference data is clean and deduplicated before uploading. Include all columns that you plan to match against \u2014 such as names, identifiers, dates, and amounts. Most teams refresh their reference data periodically by re-uploading from their source system or by using the SQL import option to pull directly from a connected database."
5734
+ },
4740
5735
  {
4741
5736
  type: "callout",
4742
5737
  variant: "info",
@@ -4830,6 +5825,10 @@ var sections16 = [
4830
5825
  type: "paragraph",
4831
5826
  text: "Each field comparison carries a **weight** that determines how much it contributes to the overall confidence score. Set high weights on fields that are strong identifiers (like reference numbers or unique IDs) and lower weights on fields that are common or prone to variation (like names or descriptions). The weighted aggregate produces a final score between 0% and 100%."
4832
5827
  },
5828
+ {
5829
+ type: "paragraph",
5830
+ text: "Most teams start with AI strategy generation and then fine-tune weights based on initial results. A common pattern is to set a high weight on a unique identifier field (like a PO number) with `exact` strategy, combined with lower-weighted `fuzzy` matches on name and description fields as supporting evidence. Review the first batch of results to calibrate thresholds before running at scale."
5831
+ },
4833
5832
  {
4834
5833
  type: "callout",
4835
5834
  variant: "info",
@@ -4884,6 +5883,14 @@ var sections16 = [
4884
5883
  type: "paragraph",
4885
5884
  text: "There are two types of runs: **manual runs** use only the deterministic matching strategies (exact, fuzzy, date_range, numeric_range) and complete quickly. **Smart runs** add an AI resolution pass \u2014 after the initial matching, an embedding-based search with a Haiku LLM resolver attempts to improve low-confidence results."
4886
5885
  },
5886
+ {
5887
+ type: "paragraph",
5888
+ text: "Matching runs are processed asynchronously via a dedicated job queue, so they do not block your workflow. You can continue working in the platform while a run executes in the background. The matching page shows real-time progress with the number of documents processed and estimated time remaining."
5889
+ },
5890
+ {
5891
+ type: "paragraph",
5892
+ text: "For best results, start with a manual run to establish a baseline, then use a smart run if many documents have low-confidence matches. Smart runs take longer because the AI resolver evaluates each ambiguous candidate, but they can significantly improve match quality for data with inconsistent formatting, abbreviations, or multilingual content."
5893
+ },
4887
5894
  {
4888
5895
  type: "list",
4889
5896
  ordered: true,
@@ -4941,6 +5948,14 @@ var sections16 = [
4941
5948
  type: "paragraph",
4942
5949
  text: "The evidence view is designed to make match decisions transparent. For each candidate, you can see exactly which fields matched, what strategy was used, the individual field score, and the actual values that were compared. This makes it straightforward to verify correct matches and investigate false positives."
4943
5950
  },
5951
+ {
5952
+ type: "paragraph",
5953
+ text: "Approved matches flow downstream into delivery pipelines, where they can be included in structured exports alongside extraction data. Rejected matches are excluded from future consideration for that document, which helps the system learn from your decisions when running subsequent matching passes."
5954
+ },
5955
+ {
5956
+ type: "paragraph",
5957
+ text: "When reviewing results, focus on documents where the top candidate has a confidence score between 50% and 85% \u2014 these are the borderline cases that benefit most from human judgment. High-confidence matches (above 85%) are usually correct, while very low scores (below 30%) typically indicate no valid match exists in the reference data."
5958
+ },
4944
5959
  {
4945
5960
  type: "param-table",
4946
5961
  title: "Result fields",