@talonic/docs 0.20.12 → 0.20.14

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/dist/content.js CHANGED
@@ -422,15 +422,19 @@ var sections = [
422
422
  content: [
423
423
  {
424
424
  type: "paragraph",
425
- text: "Transform unstructured documents into structured, validated data \u2014 with per-cell provenance and AI reasoning traces."
425
+ text: "Transform unstructured documents into structured, validated data \u2014 with per-cell provenance and AI reasoning traces. Whether you are processing invoices, contracts, purchase orders, or any other document type, Talonic automatically discovers every data point and maps it to a unified knowledge graph. The platform handles the entire pipeline from OCR and classification through to structured output delivery, so you can focus on defining what data you need rather than building extraction logic."
426
426
  },
427
427
  {
428
428
  type: "paragraph",
429
- text: "**Supported Formats:** 25+ file types. **Resolution:** 4-phase pipeline. **Instant Matches:** ~30% of cells (free)."
429
+ text: "**Supported Formats:** 25+ file types. **Resolution:** 4-phase pipeline. **Instant Matches:** ~30% of cells (free). The platform ingests PDF, DOCX, XLSX, images (PNG, JPG), HTML, JSON, CSV, email formats (EML, MSG), and ZIP archives. ZIP files are unpacked automatically and each contained document is processed individually through the same pipeline."
430
430
  },
431
431
  {
432
432
  type: "paragraph",
433
- text: "Talonic is an **agentic data structuring platform**. It ingests documents of any type, discovers every data point inside them, builds a knowledge graph of canonical fields, and deploys AI agents to fill structured output schemas. Every cell in the output carries provenance metadata \u2014 which pipeline phase filled it, the confidence score, and an AI reasoning trace linking back to the source document."
433
+ text: "Talonic is an **agentic data structuring platform**. It ingests documents of any type, discovers every data point inside them, builds a knowledge graph of canonical fields, and deploys AI agents to fill structured output schemas. Every cell in the output carries provenance metadata \u2014 which pipeline phase filled it, the confidence score, and an AI reasoning trace linking back to the source document. The knowledge graph grows with every document you process \u2014 fields are clustered semantically, promoted through tiers, and enriched with master extraction instructions. This accumulated intelligence means extraction accuracy and speed improve over time without any manual tuning."
434
+ },
435
+ {
436
+ type: "paragraph",
437
+ text: "The platform is organized around three sections accessible from the sidebar: **Sources** (where documents enter the system), **Structuring** (where you define schemas, run extraction jobs, and review results), and **Outputs** (where approved data is delivered to downstream systems). Most teams start by uploading a small sample of documents, reviewing the auto-generated schemas, and then creating targeted extraction jobs. As the knowledge graph matures, an increasing share of cells are resolved instantly through graph lookup rather than AI extraction."
434
438
  },
435
439
  {
436
440
  type: "list",
@@ -440,13 +444,19 @@ var sections = [
440
444
  "**4-phase extraction pipeline** \u2014 resolve from the knowledge graph, extract with AI agents, re-resolve, then transform and validate.",
441
445
  "**~30% instant matches** \u2014 cells filled from graph lookup are free and instant, reducing both cost and latency.",
442
446
  "**Per-cell provenance** \u2014 every value traces back to its source with confidence scores and reasoning.",
443
- "**Batch mode** \u2014 process large backlogs at 50% cost with a 48-hour delivery window."
447
+ "**Batch mode** \u2014 process large backlogs at 50% cost with a 48-hour delivery window.",
448
+ "**10 source connectors** \u2014 Google Drive, Gmail, SharePoint, OneDrive, Outlook, Teams, Notion, SQL databases, Amazon S3, and Azure Blob Storage.",
449
+ "**6 delivery connectors** \u2014 webhook (HMAC-SHA256), SFTP, Amazon S3, Azure Blob, Google Drive, and OneDrive for pushing results downstream."
444
450
  ]
445
451
  },
452
+ {
453
+ type: "paragraph",
454
+ text: "The extraction pipeline uses a cascading design that minimizes both cost and latency. Phase 1 fills cells using deterministic graph lookups at zero AI cost. Phase 2 deploys AI agents only for cells that Phase 1 could not resolve. Phase 3 re-runs lookups on AI output to normalize values to canonical codes. Phase 4 applies transforms, validates constraints, and fills remaining gaps with full grid context. This approach ensures that the cheapest, fastest method always runs first."
455
+ },
446
456
  {
447
457
  type: "callout",
448
458
  variant: "info",
449
- text: "Talonic uses Anthropic Claude for intelligent extraction and reasoning. The platform handles OCR, classification, field discovery, and schema generation automatically \u2014 you provide documents and define what output you need."
459
+ text: "Talonic uses Anthropic Claude for intelligent extraction and reasoning. The platform handles OCR, classification, field discovery, and schema generation automatically \u2014 you provide documents and define what output you need. Document AI (Mistral) handles OCR by default, with automatic fallback to the Talonic API for unsupported formats."
450
460
  }
451
461
  ],
452
462
  related: [
@@ -457,19 +467,19 @@ var sections = [
457
467
  faq: [
458
468
  {
459
469
  question: "What is the Talonic Platform?",
460
- answer: "Talonic is an agentic data structuring platform that ingests unstructured documents, discovers fields, builds a knowledge graph, and produces structured output with per-cell provenance and confidence scores."
470
+ answer: "Talonic is an agentic data structuring platform that ingests unstructured documents, discovers fields, builds a knowledge graph, and produces structured output with per-cell provenance and confidence scores. It supports 25+ file formats and uses a 4-phase extraction pipeline that combines deterministic graph lookups with AI agents to maximize accuracy while minimizing cost."
461
471
  },
462
472
  {
463
473
  question: "How many file formats does Talonic support?",
464
- answer: "Talonic supports 25+ file types including PDF, DOCX, XLSX, images (PNG, JPG), plain text, HTML, JSON, CSV, email formats (EML, MSG), and ZIP archives."
474
+ answer: "Talonic supports 25+ file types including PDF, DOCX, XLSX, images (PNG, JPG), plain text, HTML, JSON, CSV, email formats (EML, MSG), and ZIP archives. ZIP files are unpacked automatically and each contained document is processed individually through the full extraction pipeline."
465
475
  },
466
476
  {
467
477
  question: 'What does "per-cell provenance" mean?',
468
- answer: "Every cell in the structured output carries metadata about which pipeline phase filled it, a confidence score, an AI reasoning trace, and references back to the source document. This makes every value auditable and explainable."
478
+ answer: "Every cell in the structured output carries metadata about which pipeline phase filled it, a confidence score, an AI reasoning trace, and references back to the source document. This makes every value auditable and explainable. You can hover any cell to see its confidence score and click it to expand the full provenance panel with reasoning details."
469
479
  },
470
480
  {
471
481
  question: "How much do instant graph matches cost?",
472
- answer: "Graph matches (approximately 30% of cells) are free. They are filled from the knowledge graph through deterministic lookup, so no LLM call is needed. Only cells that require AI extraction incur cost."
482
+ answer: "Graph matches (approximately 30% of cells) are free. They are filled from the knowledge graph through deterministic lookup, so no LLM call is needed. Only cells that require AI extraction incur cost. As your knowledge graph grows from processing more documents, the percentage of free graph matches increases, further reducing per-job costs."
473
483
  }
474
484
  ],
475
485
  mentions: [
@@ -564,11 +574,11 @@ var sections = [
564
574
  faq: [
565
575
  {
566
576
  question: "What is the Field Registry?",
567
- answer: "The Field Registry is a unified knowledge graph of all canonical fields discovered across your documents, organized by tier, clustered semantically, and enriched with master extraction instructions."
577
+ answer: "The Field Registry is a unified knowledge graph of all canonical fields discovered across your documents, organized by tier, clustered semantically, and enriched with master extraction instructions. Fields progress through three tiers as they mature: Tier 3 (emerging, newly discovered), Tier 2 (established, promoted after repeated occurrence), and Tier 1 (universal, core fields present across most document types). Each tier transition triggers instruction synthesis so the platform learns the optimal way to extract that field."
568
578
  },
569
579
  {
570
580
  question: "What is provenance in Talonic?",
571
- answer: "Provenance is per-cell metadata that tracks which pipeline phase filled the value, the confidence score, an AI reasoning trace, and source references back to the original document."
581
+ answer: "Provenance is per-cell metadata that tracks which pipeline phase filled the value, the confidence score, an AI reasoning trace, and source references back to the original document. You can inspect provenance by hovering any cell in the job results grid to see its confidence score, then clicking to expand the full provenance panel. The panel shows which strategy resolved the value, the raw source text it was derived from, and the AI reasoning chain when applicable."
572
582
  },
573
583
  {
574
584
  question: "How do Cases form?",
@@ -727,11 +737,11 @@ var sections = [
727
737
  faq: [
728
738
  {
729
739
  question: "What is the fastest way to get started with Talonic?",
730
- answer: "Upload documents in Sources, then go to Structuring > Runs > New to create your first extraction job. Results appear progressively as each phase completes."
740
+ answer: "Upload documents in Sources, then go to Structuring > Runs > New to create your first extraction job. Results appear progressively as each phase completes. For a single document, use the quick extract shortcut (Cmd+J / Ctrl+J) to upload and process from any page without navigating to Sources first. Most users see their first structured output within two to three minutes of uploading."
731
741
  },
732
742
  {
733
743
  question: "How is the Talonic platform organized?",
734
- answer: "The platform is organized into three primary sections: Sources (document ingest), Structuring (processing & validation), and Outputs (delivery to downstream systems)."
744
+ answer: "The platform is organized into three primary sections: Sources (document ingest), Structuring (processing & validation), and Outputs (delivery to downstream systems). Sources handles all document ingestion \u2014 manual uploads, cloud connectors, email inboxes, and API ingestion. Structuring is where you define schemas, run extraction jobs, review results, and approve output. Outputs manages delivery bindings that push approved data to webhooks, SFTP, cloud storage, and other downstream systems."
735
745
  },
736
746
  {
737
747
  question: "Do I need to define a schema before processing documents?",
@@ -757,7 +767,7 @@ var sections2 = [
757
767
  content: [
758
768
  {
759
769
  type: "paragraph",
760
- text: "Talonic includes an embedded AI agent accessible from every page via `Cmd+I` (`Ctrl+I` on Windows). The agent understands your workspace context and can inspect schemas, search documents, analyze extraction quality, explore cases, and build schemas \u2014 all through natural language."
770
+ text: "Talonic includes an embedded AI agent accessible from every page via `Cmd+I` (`Ctrl+I` on Windows). The agent understands your workspace context and can inspect schemas, search documents, analyze extraction quality, explore cases, and build schemas \u2014 all through natural language. It serves as a conversational interface to your entire workspace, eliminating the need to navigate through multiple pages to find information or perform common operations."
761
771
  },
762
772
  {
763
773
  type: "paragraph",
@@ -769,7 +779,7 @@ var sections2 = [
769
779
  },
770
780
  {
771
781
  type: "paragraph",
772
- text: "There are important limitations to be aware of. The agent cannot access external systems or the internet \u2014 it only works with data already in your Talonic workspace. It cannot bypass permission boundaries, so team members with read-only access cannot use the agent to make changes. Long-running operations like full batch extractions cannot be triggered through the agent; those must be initiated from the relevant UI page."
782
+ text: "There are important limitations to be aware of. The agent cannot access external systems or the internet \u2014 it only works with data already in your Talonic workspace. It cannot bypass permission boundaries, so team members with read-only (Viewer) access cannot use the agent to make changes. Long-running operations like full batch extractions cannot be triggered through the agent; those must be initiated from the relevant UI page. The agent also cannot modify field registry entries directly \u2014 those changes flow through the resolution pipeline."
773
783
  },
774
784
  { type: "heading", level: 3, id: "agent-capabilities", text: "What the Agent Can Do" },
775
785
  {
@@ -842,7 +852,11 @@ var sections2 = [
842
852
  },
843
853
  {
844
854
  question: "Can the AI agent access external systems or the internet?",
845
- answer: "No. The agent only works with data already in your Talonic workspace. It cannot browse the internet, call external APIs, or access systems outside the platform."
855
+ answer: "No. The agent only works with data already in your Talonic workspace. It cannot browse the internet, call external APIs, or access systems outside the platform. All data the agent references comes from your documents, schemas, field registry, and job results."
856
+ },
857
+ {
858
+ question: "What are good questions to ask the agent?",
859
+ answer: 'Try questions like "Show me all invoices processed this week", "What fields does my Invoice schema have?", "Create a schema for purchase orders with vendor name, PO number, and total amount", or "Why was this document classified as a Service Agreement?" The agent handles both read-only queries and schema creation commands.'
846
860
  }
847
861
  ],
848
862
  mentions: [
@@ -862,7 +876,7 @@ var sections2 = [
862
876
  content: [
863
877
  {
864
878
  type: "paragraph",
865
- text: "Every agent action is classified into an impact level that determines how it executes. Higher-impact operations require progressively more explicit confirmation."
879
+ text: "Every agent action is classified into an impact level that determines how it executes. Higher-impact operations require progressively more explicit confirmation. This graduated safety model ensures that the agent can be used freely for exploration and analysis while preventing accidental modifications to live data."
866
880
  },
867
881
  {
868
882
  type: "param-table",
@@ -902,9 +916,29 @@ var sections2 = [
902
916
  type: "paragraph",
903
917
  text: 'The `live_mutation` and `irreversible` levels provide escalating safety gates for operations that affect production data. A `live_mutation` \u2014 such as triggering a job run or publishing a schema \u2014 presents a confirmation dialog that you must accept. An `irreversible` action \u2014 such as deleting a source or purging documents \u2014 requires you to type a confirmation keyword (e.g., "DELETE") to proceed, preventing accidental data loss.'
904
918
  },
919
+ {
920
+ type: "heading",
921
+ level: 3,
922
+ id: "impact-examples",
923
+ text: "Common Actions by Impact Level"
924
+ },
925
+ {
926
+ type: "list",
927
+ ordered: false,
928
+ items: [
929
+ '**read** \u2014 "Show me the fields in this document", "What schemas do I have?", "How many invoices were processed today?"',
930
+ `**draft_mutation** \u2014 "Create a schema for invoices with these fields", "Add a 'due date' field to my schema draft"`,
931
+ '**live_mutation** \u2014 "Publish this schema draft", "Run extraction on these 50 documents"',
932
+ '**irreversible** \u2014 "Delete this source and all its documents", "Purge all data for this document type"'
933
+ ]
934
+ },
935
+ {
936
+ type: "paragraph",
937
+ text: "The impact level system also respects your team role. Team members with the Viewer role can only trigger `read` level actions through the agent \u2014 attempting commands that would modify data will be rejected with a clear permissions error. Members, Admins, and Owners can trigger higher impact levels according to their role permissions."
938
+ },
905
939
  {
906
940
  type: "callout",
907
- text: "The agent always operates workshop-first: schema changes create drafts, not live versions. You review and publish when ready."
941
+ text: "The agent always operates workshop-first: schema changes create drafts, not live versions. You review and publish when ready. This means you can experiment freely with schema designs through the agent without any risk to your production configuration."
908
942
  }
909
943
  ],
910
944
  related: [
@@ -936,7 +970,7 @@ var sections2 = [
936
970
  content: [
937
971
  {
938
972
  type: "paragraph",
939
- text: "The home page (click the Talonic logo) shows smart suggested prompts based on your workspace state. Prompts adapt to what is happening: active runs, schema creation opportunities, document types waiting for extraction. The agent input field lets you type any question directly from the dashboard."
973
+ text: "The home page (click the Talonic logo) shows smart suggested prompts based on your workspace state. Prompts adapt to what is happening: active runs, schema creation opportunities, document types waiting for extraction, and pending field confirmations. The agent input field lets you type any question directly from the dashboard, making it the natural starting point for each session."
940
974
  },
941
975
  {
942
976
  type: "paragraph",
@@ -949,6 +983,21 @@ var sections2 = [
949
983
  {
950
984
  type: "paragraph",
951
985
  text: "Every conversation with the agent is preserved in your session history, accessible from the dashboard. You can revisit previous questions and their answers, which is useful for auditing decisions or recalling how you configured a particular schema. The conversation history also provides continuity \u2014 if you asked the agent to analyze extraction quality last week, you can pick up where you left off."
986
+ },
987
+ {
988
+ type: "heading",
989
+ level: 3,
990
+ id: "dashboard-metrics",
991
+ text: "Dashboard Metrics"
992
+ },
993
+ {
994
+ type: "paragraph",
995
+ text: "The dashboard surfaces key telemetry metrics from your workspace. **Capture rate** measures the percentage of schema fields that were successfully extracted across your latest job runs. **Resolve rate** tracks how many extracted fields were resolved against the registry without AI intervention. **Synthesize rate** shows how many registry fields have master instructions. Together, these three metrics give you a quick health check on your extraction pipeline \u2014 a high resolve rate means your registry is mature and extraction costs are low."
996
+ },
997
+ {
998
+ type: "callout",
999
+ variant: "info",
1000
+ text: 'Try asking the agent questions like "What is my capture rate?", "Which document types need schemas?", or "Show me recent extraction failures" directly from the dashboard. The suggested prompts adapt to your workspace state, but you can always type any question.'
952
1001
  }
953
1002
  ],
954
1003
  related: [
@@ -984,11 +1033,11 @@ var sections3 = [
984
1033
  content: [
985
1034
  {
986
1035
  type: "paragraph",
987
- text: "Sources are the entry point for all data. Every document belongs to a source \u2014 whether uploaded manually, synced from Google Drive, or ingested via API."
1036
+ text: "Sources are the entry point for all data in Talonic. Every document belongs to a source \u2014 whether uploaded manually, synced from Google Drive, or ingested via the public API. A source acts as a container that groups related documents together and tracks their ingestion method, processing status, and connection credentials. You can create multiple sources to organize documents by department, client, or workflow."
988
1037
  },
989
1038
  {
990
1039
  type: "paragraph",
991
- text: "The Sources page provides a drag-and-drop upload interface. You can upload individual files, multiple files, or entire folders. ZIP archives are unpacked recursively. When uploading folders, the original file path is preserved as a data field (`source_file_path`) on each document \u2014 available for downstream processing and export."
1040
+ text: "The Sources page provides a drag-and-drop upload interface. You can upload individual files, multiple files, or entire folders. ZIP archives are unpacked recursively \u2014 nested ZIPs are also extracted, so deeply packaged archives are handled automatically. When uploading folders, the original file path is preserved as a data field (`source_file_path`) on each document \u2014 available for downstream processing and export."
992
1041
  },
993
1042
  {
994
1043
  type: "ui-excerpt",
@@ -998,11 +1047,37 @@ var sections3 = [
998
1047
  },
999
1048
  {
1000
1049
  type: "paragraph",
1001
- text: "Files are deduplicated via SHA-256 hashing \u2014 uploading the same file twice won't create duplicates. Processing runs asynchronously so you can continue working."
1050
+ text: "Files are deduplicated via SHA-256 hashing \u2014 uploading the same file twice won't create duplicates. The hash is computed from the raw file bytes before any processing begins, so identical files are caught regardless of filename or upload method. Processing runs asynchronously so you can continue working while documents flow through OCR, classification, and extraction in the background."
1002
1051
  },
1003
1052
  {
1004
1053
  type: "paragraph",
1005
1054
  text: "When uploading folders or ZIP archives, the original directory structure is preserved as a `source_file_path` metadata field on each document (e.g., `contracts/2026/lease.pdf`). This field is available for filtering, export, and schema mapping \u2014 just like any AI-extracted field. It provides a natural way to organize and trace documents back to their original location in your file system."
1055
+ },
1056
+ {
1057
+ type: "heading",
1058
+ level: 3,
1059
+ id: "upload-workflow",
1060
+ text: "Upload Workflow"
1061
+ },
1062
+ {
1063
+ type: "list",
1064
+ ordered: true,
1065
+ items: [
1066
+ "Navigate to the **Sources** page from the sidebar.",
1067
+ "Click **New Source** to create a container, or select an existing source.",
1068
+ "Drag files, folders, or ZIP archives onto the upload area.",
1069
+ "Monitor the progress indicator \u2014 each file shows its current stage (uploading, OCR, classifying, extracting).",
1070
+ "Once processing completes, documents appear in the source list with their extracted fields and classification."
1071
+ ]
1072
+ },
1073
+ {
1074
+ type: "paragraph",
1075
+ text: "Large uploads (100+ files) are throttled to avoid overloading the pipeline, but you can monitor progress from the source page where each document shows its current processing stage. For very large ingestion jobs, consider using **batch processing mode** which defers AI extraction to run at 50% cost. You can also combine manual uploads with source connectors \u2014 for example, uploading a backlog of historical files manually while connecting Google Drive for ongoing ingestion."
1076
+ },
1077
+ {
1078
+ type: "callout",
1079
+ variant: "info",
1080
+ text: "Use the **quick extract** shortcut (`Cmd+J` / `Ctrl+J`) to upload a single document from any page without navigating to Sources first. This opens a streamlined upload interface that processes the document immediately and shows results inline."
1006
1081
  }
1007
1082
  ],
1008
1083
  related: [
@@ -1013,15 +1088,19 @@ var sections3 = [
1013
1088
  faq: [
1014
1089
  {
1015
1090
  question: "How do I upload documents to Talonic?",
1016
- answer: "Drag files or folders onto the Sources page upload area. You can upload individual files, multiple files, entire folders, or ZIP archives that are unpacked recursively."
1091
+ answer: "Drag files or folders onto the Sources page upload area. You can upload individual files, multiple files, entire folders, or ZIP archives that are unpacked recursively. You can also use the quick extract shortcut (Cmd+J / Ctrl+J) to upload a single file from any page."
1017
1092
  },
1018
1093
  {
1019
1094
  question: "Does Talonic detect duplicate uploads?",
1020
- answer: "Yes. Files are deduplicated via SHA-256 hashing. Uploading the same file twice will not create duplicates."
1095
+ answer: "Yes. Files are deduplicated via SHA-256 hashing computed from the raw file bytes. Uploading the same file twice will not create duplicates, regardless of filename or upload method."
1021
1096
  },
1022
1097
  {
1023
1098
  question: "What happens when I upload a folder or ZIP archive?",
1024
- answer: "ZIP archives are unpacked recursively and each file is processed individually. Folders preserve the original directory structure as a source_file_path metadata field on each document, available for filtering and export."
1099
+ answer: "ZIP archives are unpacked recursively and each file is processed individually. Nested ZIPs are also extracted. Folders preserve the original directory structure as a source_file_path metadata field on each document, available for filtering, schema mapping, and export."
1100
+ },
1101
+ {
1102
+ question: "Can I upload large batches of documents?",
1103
+ answer: "Yes. Large uploads (100+ files) are automatically throttled to prevent pipeline overload. Each document processes independently through OCR, classification, and extraction. For very large batches, consider batch processing mode which defers extraction at 50% cost."
1025
1104
  }
1026
1105
  ],
1027
1106
  mentions: [
@@ -1082,6 +1161,10 @@ var sections3 = [
1082
1161
  type: "paragraph",
1083
1162
  text: "The **text fast-path** is the most efficient route: files like CSV, JSON, and plain text are read directly into memory with no external API call. This means they process almost instantly and incur no OCR cost. Email files (EML, MSG) are parsed to extract both the message body and any attachments, with each attachment processed as a separate document."
1084
1163
  },
1164
+ {
1165
+ type: "paragraph",
1166
+ text: "Email files receive special handling. EML files are parsed to extract both the message body and any attachments, with each attachment processed as a separate document linked back to the parent email. MSG files (Microsoft Outlook format) follow the OCR path and similarly extract embedded attachments. This means a single email file can produce multiple documents \u2014 the email body and each of its attachments \u2014 all processed independently through the full pipeline."
1167
+ },
1085
1168
  {
1086
1169
  type: "callout",
1087
1170
  variant: "info",
@@ -1099,7 +1182,7 @@ var sections3 = [
1099
1182
  },
1100
1183
  {
1101
1184
  question: "How does Talonic handle image files?",
1102
- answer: "Image files (PNG, JPG, JPEG, GIF, WEBP) are sent to AI for multimodal visual extraction."
1185
+ answer: "Image files (PNG, JPG, JPEG, GIF, WEBP) are sent to AI for multimodal visual extraction. The AI model sees the image directly and extracts data visually, which is useful for photos of receipts, scanned handwritten notes, or diagrams. If an image was previously OCR'd and produced meaningful Markdown (more than 100 characters), the system uses the Markdown extraction path instead, which enables richer quality metrics and confidence scoring."
1103
1186
  },
1104
1187
  {
1105
1188
  question: "How does Talonic handle large PDF files?",
@@ -1117,7 +1200,7 @@ var sections3 = [
1117
1200
  content: [
1118
1201
  {
1119
1202
  type: "paragraph",
1120
- text: "When a document is uploaded, it flows through a multi-stage pipeline:"
1203
+ text: "When a document is uploaded, it flows through a multi-stage pipeline that transforms raw files into structured, queryable data. Each stage runs automatically and its progress is visible in the document's processing log. The pipeline handles everything from OCR conversion to AI-powered field extraction, with built-in retry logic for resilience."
1121
1204
  },
1122
1205
  {
1123
1206
  type: "ui-excerpt",
@@ -1125,9 +1208,29 @@ var sections3 = [
1125
1208
  title: "Document \u2014 Processing Log",
1126
1209
  caption: "Every document shows a structured processing log with per-stage timing."
1127
1210
  },
1211
+ {
1212
+ type: "heading",
1213
+ level: 3,
1214
+ id: "processing-stages",
1215
+ text: "Processing Stages"
1216
+ },
1217
+ {
1218
+ type: "list",
1219
+ ordered: true,
1220
+ items: [
1221
+ "**Document AI OCR** \u2014 Converts files to Markdown and produces structured annotations (type, sensitivity, PII categories, jurisdiction) in a single pass.",
1222
+ "**Classification** \u2014 Verifies the annotation against the 529-type document ontology. If the label and content disagree, AI resolves the correct type from the actual text.",
1223
+ "**Triage** \u2014 Tags the document with sensitivity level, department, jurisdiction, and compliance signals derived from the OCR annotations.",
1224
+ "**AI Data Field Capture** \u2014 Extracts every data point from the document content using Claude, producing fields with confidence scores and source text provenance."
1225
+ ]
1226
+ },
1128
1227
  {
1129
1228
  type: "paragraph",
1130
- text: "**Document AI OCR** converts files to Markdown and produces structured annotations (type, sensitivity, PII categories, jurisdiction) in a single pass. **Classification** verifies the annotation against the 529-type document ontology \u2014 if the label and content disagree, AI resolves the correct type from the actual text. **AI Data Field Capture** extracts every data point. Large documents are automatically chunked and processed in parallel."
1229
+ text: "**Document AI OCR** converts files to Markdown and produces structured annotations (type, sensitivity, PII categories, jurisdiction) in a single pass. **Classification** verifies the annotation against the 529-type document ontology \u2014 if the label and content disagree, AI resolves the correct type from the actual text. **AI Data Field Capture** extracts every data point. Large documents (PDFs exceeding 25 pages) are automatically split into page chunks, processed in parallel, and merged \u2014 so even lengthy documents complete efficiently without timeouts."
1230
+ },
1231
+ {
1232
+ type: "paragraph",
1233
+ text: "The extraction stage uses a chunk-first approach by default: document sections are labelled inline and sent to Claude in a single call. The response combines JSON metadata with CSV-formatted fields, capturing confidence scores, source text, and chunk coverage for every extracted value. Optional quality passes \u2014 field count estimation, verification, and cross-reference enrichment \u2014 run as lightweight Haiku calls to catch extraction gaps. Six quality metrics are computed post-extraction: confidence per chunk, noise detection, semantic dissimilarity, text coverage, connection awareness, and consistency awareness."
1131
1234
  },
1132
1235
  {
1133
1236
  type: "paragraph",
@@ -1152,9 +1255,13 @@ var sections3 = [
1152
1255
  type: "paragraph",
1153
1256
  text: "Both fields appear in the **Raw Extraction** tab with full confidence (1.0) and are available for schema mapping, job resolution, filtering, and export \u2014 just like any AI-extracted field."
1154
1257
  },
1258
+ {
1259
+ type: "paragraph",
1260
+ text: "If OCR or extraction fails, the platform automatically retries using a fallback chain. OCR failures cascade from Document AI to the Talonic API to local parsers. Extraction parse failures retry in realtime mode (never as new batches). Each retry stage is logged in the processing log so you can see exactly what happened. Documents that exhaust all retries are marked with a terminal status (`extraction_failed` or `ocr_failed`) and remain visible in the source for manual review."
1261
+ },
1155
1262
  {
1156
1263
  type: "callout",
1157
- text: "Documents are marked **complete** after AI extraction finishes. You can start using them in jobs immediately \u2014 no need to wait for further processing."
1264
+ text: "Documents are marked **complete** after AI extraction finishes. You can start using them in jobs immediately \u2014 no need to wait for field resolution, which runs separately and enriches the registry in the background."
1158
1265
  }
1159
1266
  ],
1160
1267
  related: [
@@ -1199,7 +1306,7 @@ var sections3 = [
1199
1306
  },
1200
1307
  {
1201
1308
  type: "paragraph",
1202
- text: 'Documents sharing the same ontology type are automatically merged into one document type. When a new canonical type appears, it is auto-created with ontology metadata. Unresolvable documents are assigned "Unclassified Document".'
1309
+ text: 'Documents sharing the same ontology type are automatically merged into one document type. When a new canonical type appears, it is auto-created with ontology metadata including category, subcategory, and typical fields. Unresolvable documents are assigned "Unclassified Document" \u2014 they can still be processed and extracted, but the platform cannot map them to a specific type in the ontology.'
1203
1310
  },
1204
1311
  {
1205
1312
  type: "paragraph",
@@ -1209,6 +1316,23 @@ var sections3 = [
1209
1316
  type: "paragraph",
1210
1317
  text: "Document types drive several downstream features. The platform auto-generates a **schema** for each document type, pre-populated with fields discovered from documents of that type. **Routing rules** can be configured per document type to automatically assign schemas or trigger jobs when new documents arrive. The **Field Registry** tracks which fields appear in which document types, building a cross-type knowledge graph over time."
1211
1318
  },
1319
+ {
1320
+ type: "heading",
1321
+ level: 3,
1322
+ id: "ontology-examples",
1323
+ text: "Example Ontology Types"
1324
+ },
1325
+ {
1326
+ type: "list",
1327
+ ordered: false,
1328
+ items: [
1329
+ "**Financial** \u2014 Invoice, Purchase Order, Credit Note, Bank Statement, Tax Return",
1330
+ "**Legal** \u2014 Employment Contract, Non-Disclosure Agreement, Lease Agreement, Power of Attorney",
1331
+ "**Logistics** \u2014 Bill of Lading (Ocean), Commercial Invoice, Packing List, Certificate of Origin",
1332
+ "**Healthcare** \u2014 Medical Record, Lab Report, Insurance Claim, Prescription",
1333
+ "**Corporate** \u2014 Articles of Incorporation, Board Resolution, Annual Report, Meeting Minutes"
1334
+ ]
1335
+ },
1212
1336
  {
1213
1337
  type: "callout",
1214
1338
  variant: "info",
@@ -1251,7 +1375,7 @@ var sections3 = [
1251
1375
  content: [
1252
1376
  {
1253
1377
  type: "paragraph",
1254
- text: "Click any document to see its detail page with four views:"
1378
+ text: "Click any document to see its detail page with four views. The detail page is where you inspect individual extraction results, verify field accuracy, and debug processing issues. It serves as the single source of truth for everything the platform knows about a document \u2014 from raw AI output to resolved canonical fields."
1255
1379
  },
1256
1380
  {
1257
1381
  type: "param-table",
@@ -1290,6 +1414,10 @@ var sections3 = [
1290
1414
  type: "paragraph",
1291
1415
  text: "The **Processing Log** tab provides a stage-by-stage timeline of how the document was processed, including per-stage timing. You can see exactly how long OCR, classification, and extraction took, which is useful for diagnosing slow processing or understanding why a document was classified a particular way. The **Original File** tab lets you view or download the source file, so you can always compare the AI's extraction against the original document."
1292
1416
  },
1417
+ {
1418
+ type: "paragraph",
1419
+ text: "When reviewing extraction quality, start with the **Raw Extraction** tab to check that the AI found all expected fields. Compare the confidence scores across fields \u2014 values above 0.90 are typically reliable, while values below 0.70 may warrant manual verification. If a field is missing, check the **Original File** tab to confirm the information exists in the source document. The **Processing Log** tab can reveal whether the document was split into chunks, which sometimes causes fields near chunk boundaries to be missed."
1420
+ },
1293
1421
  {
1294
1422
  type: "callout",
1295
1423
  variant: "info",
@@ -1332,11 +1460,11 @@ var sections3 = [
1332
1460
  content: [
1333
1461
  {
1334
1462
  type: "paragraph",
1335
- text: "Routing rules automatically assign actions to documents based on their type. Configure rules to auto-assign schemas, trigger jobs, or route documents to specific workflows. Manage rules from **Documents → Routing**."
1463
+ text: "Routing rules automatically assign actions to documents based on their type. Configure rules to auto-assign schemas, trigger jobs, or route documents to specific workflows. Manage rules from **Documents → Routing**. Routing is the bridge between document ingestion and structured output \u2014 it eliminates the manual step of selecting a schema and starting a job for each new document."
1336
1464
  },
1337
1465
  {
1338
1466
  type: "paragraph",
1339
- text: "Each routing rule specifies a **document type** as the trigger condition and one or more **actions** to execute when a document of that type is processed. Actions include assigning a specific user schema, automatically creating a job run, or tagging the document for a particular workflow. Rules are evaluated in priority order, so you can layer general rules with more specific overrides."
1467
+ text: "Each routing rule specifies a **document type** as the trigger condition and one or more **actions** to execute when a document of that type is processed. Actions include assigning a specific user schema, automatically creating a job run, or tagging the document for a particular workflow. Rules are evaluated in priority order, so you can layer general rules with more specific overrides. If multiple rules match the same document type, the highest-priority rule wins."
1340
1468
  },
1341
1469
  {
1342
1470
  type: "paragraph",
@@ -1346,10 +1474,25 @@ var sections3 = [
1346
1474
  type: "paragraph",
1347
1475
  text: "You can review rule execution history from the routing page to see which rules fired, which documents they matched, and what actions were taken. This audit trail helps you verify that your routing configuration is working as expected and diagnose cases where documents were not routed correctly."
1348
1476
  },
1477
+ {
1478
+ type: "heading",
1479
+ level: 3,
1480
+ id: "routing-actions",
1481
+ text: "Available Routing Actions"
1482
+ },
1483
+ {
1484
+ type: "list",
1485
+ ordered: false,
1486
+ items: [
1487
+ "**Assign schema** \u2014 Automatically link a user schema to matching documents, so they are ready for extraction without manual configuration.",
1488
+ "**Trigger job** \u2014 Create and start a job run as soon as a document of the matching type completes processing.",
1489
+ "**Tag workflow** \u2014 Apply a workflow tag that can be used for filtering, reporting, or downstream delivery bindings."
1490
+ ]
1491
+ },
1349
1492
  {
1350
1493
  type: "callout",
1351
1494
  variant: "info",
1352
- text: "Start with a simple routing rule for your most common document type. Once you verify it works correctly, expand to additional types. Rules are evaluated in priority order, so you can add specific overrides without disrupting existing rules."
1495
+ text: "Start with a simple routing rule for your most common document type. Once you verify it works correctly, expand to additional types. Rules are evaluated in priority order, so you can add specific overrides without disrupting existing rules. Review the execution history regularly to ensure documents are being routed as expected."
1353
1496
  }
1354
1497
  ],
1355
1498
  related: [
@@ -1368,7 +1511,7 @@ var sections3 = [
1368
1511
  },
1369
1512
  {
1370
1513
  question: "Can routing rules fully automate my document processing pipeline?",
1371
- answer: "Yes. By combining routing rules with source connectors and delivery bindings, you can create a fully automated pipeline: documents arrive from a connected source, routing rules assign schemas and trigger extraction jobs, and delivery bindings push approved results to downstream systems."
1514
+ answer: "Yes. By combining routing rules with source connectors and delivery bindings, you can create a fully automated pipeline: documents arrive from a connected source, routing rules assign schemas and trigger extraction jobs, and delivery bindings push approved results to downstream systems. For example, a Google Drive folder receiving weekly invoices can be connected as a source with a routing rule that auto-assigns your Invoice schema and triggers extraction. A delivery binding then pushes approved results to your ERP via webhook \u2014 zero manual steps required."
1372
1515
  }
1373
1516
  ],
1374
1517
  mentions: ["routing rules", "auto-assign", "schema assignment", "document workflows"]
@@ -1382,7 +1525,7 @@ var sections3 = [
1382
1525
  content: [
1383
1526
  {
1384
1527
  type: "paragraph",
1385
- text: "Beyond manual upload and API ingestion, Talonic connects to external systems to automatically ingest documents. Each connector authenticates via OAuth or credentials and syncs documents into a source."
1528
+ text: "Beyond manual upload and API ingestion, Talonic connects to external systems to automatically ingest documents. Each connector authenticates via OAuth or credentials and syncs documents into a source. Connectors turn Talonic into a continuous ingestion pipeline \u2014 once configured, new files arriving in a connected folder or inbox are available for processing without manual intervention."
1386
1529
  },
1387
1530
  {
1388
1531
  type: "param-table",
@@ -1452,9 +1595,31 @@ var sections3 = [
1452
1595
  type: "paragraph",
1453
1596
  text: "Credential-based connectors (SQL, Amazon S3, Azure Blob) authenticate with access keys or connection strings rather than OAuth. SQL connections support PostgreSQL, MySQL, and MSSQL, with a built-in read-only safety layer that prevents accidental writes. S3-compatible storage like MinIO and Cloudflare R2 also works through the S3 connector. All credentials are encrypted at rest before being stored."
1454
1597
  },
1598
+ {
1599
+ type: "heading",
1600
+ level: 3,
1601
+ id: "connector-setup",
1602
+ text: "Setting Up a Connector"
1603
+ },
1604
+ {
1605
+ type: "list",
1606
+ ordered: true,
1607
+ items: [
1608
+ "Navigate to **Sources** and click **New Source**.",
1609
+ "Select the connector type from the dropdown (e.g., Google Drive, SharePoint, S3).",
1610
+ "For OAuth connectors, complete the authorization flow \u2014 you will be redirected to the provider to grant access.",
1611
+ "For credential-based connectors, enter the required credentials (access key, connection string, or API key).",
1612
+ "Browse the connected system to select specific folders, mailboxes, buckets, or tables to import.",
1613
+ "Optionally enable **Batch Processing** to defer extraction at 50% cost."
1614
+ ]
1615
+ },
1616
+ {
1617
+ type: "paragraph",
1618
+ text: "Email connectors (Gmail and Outlook) ingest attachments from messages rather than the messages themselves. Gmail supports query passthrough so you can use standard Gmail search syntax to filter which messages are scanned for attachments. Outlook supports date range filtering and an option to include email bodies as documents. Microsoft Teams ingests meeting transcripts and channel attachments, with configurable surface filters for channels, chats, and meetings."
1619
+ },
1455
1620
  {
1456
1621
  type: "callout",
1457
- text: "Connectors are feature-gated on their OAuth client ID/secret. Without credentials configured, the connector dropdown entry is disabled."
1622
+ text: "Connectors are feature-gated on their OAuth client ID/secret. Without credentials configured, the connector dropdown entry is disabled. Microsoft Teams requires tenant-admin consent for privileged scopes like `ChannelMessage.Read.All`."
1458
1623
  }
1459
1624
  ],
1460
1625
  related: [
@@ -1465,15 +1630,19 @@ var sections3 = [
1465
1630
  faq: [
1466
1631
  {
1467
1632
  question: "What external sources can Talonic connect to?",
1468
- answer: "Google Drive, Gmail, SharePoint, OneDrive, Outlook, Teams, Notion, SQL databases (MSSQL/PostgreSQL), Amazon S3, and Azure Blob Storage."
1633
+ answer: "Google Drive, Gmail, SharePoint, OneDrive, Outlook, Teams, Notion, SQL databases (MSSQL/PostgreSQL/MySQL), Amazon S3 (and S3-compatible storage like MinIO and Cloudflare R2), and Azure Blob Storage. Each connector authenticates via OAuth or credentials."
1469
1634
  },
1470
1635
  {
1471
1636
  question: "How are OAuth tokens stored?",
1472
- answer: "OAuth access and refresh tokens are encrypted at rest using AES-256-GCM. The encryption key is SOURCE_ENCRYPTION_KEY (falls back to JWT_SECRET)."
1637
+ answer: "OAuth access and refresh tokens are encrypted at rest using AES-256-GCM. The encryption key is SOURCE_ENCRYPTION_KEY (falls back to JWT_SECRET). Tokens are decrypted only when making API calls to the connected service."
1473
1638
  },
1474
1639
  {
1475
1640
  question: "What happens if a connector loses its credentials or authorization?",
1476
- answer: "If OAuth credentials are revoked or expire, the source enters a disconnected state. Reconnecting via the source settings page automatically refreshes the credentials without losing your existing documents or configuration."
1641
+ answer: "If OAuth credentials are revoked or expire, the source enters a disconnected state. Reconnecting via the source settings page automatically refreshes the credentials without losing your existing documents or configuration. No documents are deleted during disconnection."
1642
+ },
1643
+ {
1644
+ question: "Does the SQL connector support write operations?",
1645
+ answer: "No. SQL connections have a built-in read-only safety layer. A two-layer defense ensures no writes: an AST parser rejects anything that is not a single SELECT statement, and per-transaction read-only mode is enforced at the database level. MSSQL connections additionally reject accounts with elevated privileges."
1477
1646
  }
1478
1647
  ],
1479
1648
  mentions: [
@@ -1504,7 +1673,7 @@ var sections4 = [
1504
1673
  content: [
1505
1674
  {
1506
1675
  type: "paragraph",
1507
- text: "The Field Registry is the heart of Talonic's intelligence. As documents are processed, AI discovers fields and resolves them into a unified knowledge graph that grows smarter with every document."
1676
+ text: "The Field Registry is the heart of Talonic's intelligence. As documents are processed, AI discovers fields and resolves them into a unified knowledge graph that grows smarter with every document. The registry is what makes Talonic a learning system \u2014 each new document contributes to a shared understanding of field names, types, and extraction patterns that benefits all future processing."
1508
1677
  },
1509
1678
  {
1510
1679
  type: "paragraph",
@@ -1527,6 +1696,10 @@ var sections4 = [
1527
1696
  {
1528
1697
  type: "paragraph",
1529
1698
  text: "The registry is the foundation for several downstream features. **Jobs** use registry fields to pre-fill schema values via lookup cascades before resorting to LLM extraction. **Semantic clusters** group related registry fields together. **Generated schemas** are auto-built from registry fields that appear in a given document type. Understanding the registry is key to understanding how Talonic reduces extraction cost and improves accuracy over time."
1699
+ },
1700
+ {
1701
+ type: "paragraph",
1702
+ text: "Each registry field maintains two separate embedding vectors: one optimized for **resolution matching** (based on the canonical name and synonyms) and one for **graph visualization** (based on name, type, and instruction). This dual-embedding approach ensures that each concern uses the most appropriate representation. The resolution embedding is what powers the three-band matching during document processing, while the visualization embedding drives the Field Map clustering view."
1530
1703
  }
1531
1704
  ],
1532
1705
  related: [
@@ -1565,7 +1738,7 @@ var sections4 = [
1565
1738
  content: [
1566
1739
  {
1567
1740
  type: "paragraph",
1568
- text: "Fields are organized into three tiers based on how frequently they appear:"
1741
+ text: "Fields are organized into three tiers based on how frequently they appear across your document corpus. The tier system is Talonic's primary quality signal \u2014 it tells you at a glance how well-established and reliable a field is. Tiers also directly affect extraction cost: higher-tier fields are cheaper to extract because they can be resolved via lookup rather than AI calls."
1569
1742
  },
1570
1743
  {
1571
1744
  type: "param-table",
@@ -1599,9 +1772,19 @@ var sections4 = [
1599
1772
  type: "paragraph",
1600
1773
  text: "**Tier 3** fields are newly discovered and may require a full Claude API call to extract during job runs, making them the most expensive tier. As more documents are processed and a Tier 3 field appears consistently, it is automatically promoted. You can also manually adjust a field's tier from the registry detail page if you know a field is stable enough to promote early."
1601
1774
  },
1775
+ {
1776
+ type: "heading",
1777
+ level: 3,
1778
+ id: "promotion-thresholds",
1779
+ text: "Promotion Thresholds"
1780
+ },
1781
+ {
1782
+ type: "paragraph",
1783
+ text: "Promotion from Tier 3 to Tier 2 requires meeting one of two thresholds: **5 occurrences** (the field appears in at least 5 documents) or a **10% occurrence rate** (the field appears in at least 10% of all documents in your workspace). Promotion is evaluated automatically after every batch resolution run, so fields graduate without manual intervention as your document corpus grows. Once promoted to Tier 2, a field gains a synthesized master instruction and becomes eligible for lookup-based resolution in job runs."
1784
+ },
1602
1785
  {
1603
1786
  type: "callout",
1604
- text: "Tier badges appear throughout the platform as the primary quality signal. Tier 1 = green, Tier 2 = amber, Tier 3 = gray."
1787
+ text: "Tier badges appear throughout the platform as the primary quality signal. Tier 1 = green, Tier 2 = amber, Tier 3 = gray. You can see tier badges on the Field Registry page, in document detail views, on schema fields, and in job result grids."
1605
1788
  }
1606
1789
  ],
1607
1790
  related: [
@@ -1633,7 +1816,7 @@ var sections4 = [
1633
1816
  content: [
1634
1817
  {
1635
1818
  type: "paragraph",
1636
- text: 'Fields with similar meanings are automatically grouped using AI embeddings. For example, "Vendor Name", "Supplier Name", and "Company Name" cluster together. You can manually merge or split clusters from the Field Map view.'
1819
+ text: 'Fields with similar meanings are automatically grouped into semantic clusters using AI embeddings. For example, "Vendor Name", "Supplier Name", and "Company Name" cluster together because they represent the same underlying concept. You can manually merge or split clusters from the Field Map view. Clusters are a key mechanism for cross-document field normalization \u2014 they allow the platform to recognize that different document types use different names for the same data point.'
1637
1820
  },
1638
1821
  {
1639
1822
  type: "paragraph",
@@ -1647,10 +1830,25 @@ var sections4 = [
1647
1830
  type: "paragraph",
1648
1831
  text: 'Semantic clusters serve a practical purpose beyond organization. When a job runs, the resolution engine uses clusters to transfer values between fields that belong to the same cluster. If a document has a field called "Supplier Name" and your schema expects "Vendor Name", the cluster linkage allows the value to transfer automatically without an AI call. This is one of the key mechanisms that reduces extraction cost as your registry matures.'
1649
1832
  },
1833
+ {
1834
+ type: "heading",
1835
+ level: 3,
1836
+ id: "cluster-operations",
1837
+ text: "Cluster Operations"
1838
+ },
1839
+ {
1840
+ type: "list",
1841
+ ordered: false,
1842
+ items: [
1843
+ "**Merge** \u2014 Combine two clusters that represent the same concept. All fields from both clusters are unified under a single canonical entry.",
1844
+ "**Split** \u2014 Remove a field from a cluster if it was incorrectly grouped. The split field becomes its own independent cluster.",
1845
+ "**Inspect** \u2014 View all fields in a cluster, their source document types, and occurrence counts to understand why they were grouped."
1846
+ ]
1847
+ },
1650
1848
  {
1651
1849
  type: "callout",
1652
1850
  variant: "info",
1653
- text: "Manual cluster adjustments are permanent and improve the model for all future documents. If you notice the platform grouping unrelated fields together, split them early \u2014 this prevents incorrect value transfers during job runs."
1851
+ text: "Manual cluster adjustments are permanent and improve the model for all future documents. If you notice the platform grouping unrelated fields together, split them early \u2014 this prevents incorrect value transfers during job runs. Conversely, merging clusters that should be together improves resolution accuracy across your entire corpus."
1654
1852
  }
1655
1853
  ],
1656
1854
  related: [
@@ -1689,7 +1887,7 @@ var sections4 = [
1689
1887
  content: [
1690
1888
  {
1691
1889
  type: "paragraph",
1692
- text: "When a document is processed, each extracted field is resolved against the registry using a three-band matching model. The bands determine whether a match is accepted automatically, flagged for confirmation, or treated as a new field."
1890
+ text: "When a document is processed, each extracted field is resolved against the registry using a three-band matching model. The bands determine whether a match is accepted automatically, flagged for confirmation, or treated as a new field. Resolution is the core mechanism that turns raw, document-specific field names into canonical registry entries \u2014 building a unified knowledge graph across all your documents."
1693
1891
  },
1694
1892
  {
1695
1893
  type: "param-table",
@@ -1711,9 +1909,19 @@ var sections4 = [
1711
1909
  }
1712
1910
  ]
1713
1911
  },
1912
+ {
1913
+ type: "heading",
1914
+ level: 3,
1915
+ id: "resolution-process",
1916
+ text: "How Resolution Works"
1917
+ },
1714
1918
  {
1715
1919
  type: "paragraph",
1716
- text: "Resolution runs concurrently across documents. Each document's fields are resolved in an isolated transaction to prevent lock contention. Occurrence rates are updated after each transaction commits, keeping the registry eventually consistent without blocking concurrent ingestion."
1920
+ text: "Resolution follows a strict three-band order that is never skipped. First, the system checks for an **exact name match** against existing registry entries. If no exact match is found, it checks for a **cluster member match** \u2014 whether the field name matches any synonym in an existing semantic cluster. Finally, it computes **semantic embedding similarity** using AI embeddings to find conceptually similar fields. This graduated approach prioritizes fast, deterministic matches before falling back to more expensive similarity comparisons."
1921
+ },
1922
+ {
1923
+ type: "paragraph",
1924
+ text: "Resolution runs concurrently across documents. Each document's fields are resolved in an isolated transaction to prevent lock contention. Occurrence counts are updated atomically in the same SQL transaction using upserts with deadlock retry logic. This keeps the registry eventually consistent without blocking concurrent ingestion, even when hundreds of documents are being processed simultaneously."
1717
1925
  },
1718
1926
  {
1719
1927
  type: "paragraph",
@@ -1732,15 +1940,19 @@ var sections4 = [
1732
1940
  faq: [
1733
1941
  {
1734
1942
  question: "How does field resolution work in Talonic?",
1735
- answer: "Each extracted field is matched against the registry using three bands: auto (>=0.80 similarity, auto-linked), confirm (0.50-0.79, flagged for review), and new (<0.50, creates a new Tier 3 field)."
1943
+ answer: "Each extracted field is matched against the registry using three bands in strict order: exact name match, cluster member match, then semantic embedding similarity. Results fall into auto (>=0.80, auto-linked), confirm (0.50-0.79, flagged for review), or new (<0.50, creates a new Tier 3 field). The three-band order is never skipped."
1736
1944
  },
1737
1945
  {
1738
1946
  question: "Where can I review pending field confirmations?",
1739
- answer: "Navigate to Resolution > Pending Confirmations to review fields in the confirm band. Accept to merge into an existing cluster, or reject to create a new field."
1947
+ answer: "Navigate to Resolution > Pending Confirmations to review fields in the confirm band. Accept to merge the field into an existing cluster, or reject to create a new independent field. Processing confirmations promptly improves resolution accuracy for future documents."
1740
1948
  },
1741
1949
  {
1742
1950
  question: "What happens after resolution completes?",
1743
- answer: "After resolution, the platform evaluates tier promotions and regenerates affected schemas in a fixed chain: resolve, then promote, then regenerate. This ensures that newly promoted fields immediately appear in auto-generated schemas."
1951
+ answer: "After resolution, the platform evaluates tier promotions and regenerates affected schemas in a fixed chain: resolve, then promote, then regenerate. This ensures that newly promoted fields immediately appear in auto-generated schemas. The chain is atomic \u2014 it never breaks midway."
1952
+ },
1953
+ {
1954
+ question: "How does resolution reduce extraction cost during job runs?",
1955
+ answer: "During job runs, the system uses a 3-tier lookup cascade \u2014 string normalization, token fuzzy matching, then AI fallback \u2014 to fill 60-80% of cells without a full LLM call. Fields that are well-established in the registry with high occurrence counts are the most likely to resolve via lookup, making Tier 1 and Tier 2 fields essentially free to extract."
1744
1956
  }
1745
1957
  ],
1746
1958
  mentions: [
@@ -1760,7 +1972,7 @@ var sections4 = [
1760
1972
  content: [
1761
1973
  {
1762
1974
  type: "paragraph",
1763
- text: "As the same field is extracted from many documents, AI synthesizes a **master instruction** \u2014 a reusable directive that captures the best way to extract that field. Master instructions improve accuracy over time and are automatically used when running jobs."
1975
+ text: "As the same field is extracted from many documents, AI synthesizes a **master instruction** \u2014 a reusable directive that captures the best way to extract that field. Master instructions improve accuracy over time and are automatically used when running jobs. They encode domain-specific knowledge about where a field typically appears in a document, what format it takes, and how to disambiguate it from similar fields."
1764
1976
  },
1765
1977
  {
1766
1978
  type: "paragraph",
@@ -1774,9 +1986,26 @@ var sections4 = [
1774
1986
  type: "paragraph",
1775
1987
  text: `You can view and edit master instructions from the field detail page in the registry. Editing an instruction overrides the AI-synthesized version, which is useful when you have domain expertise the AI hasn't captured. The **"Synthesize All"** button in the Field Registry triggers the full pipeline \u2014 embedding, resolution, and synthesis \u2014 for all qualifying fields in a single operation.`
1776
1988
  },
1989
+ {
1990
+ type: "heading",
1991
+ level: 3,
1992
+ id: "instruction-lifecycle",
1993
+ text: "Instruction Lifecycle"
1994
+ },
1995
+ {
1996
+ type: "list",
1997
+ ordered: true,
1998
+ items: [
1999
+ "A field is discovered and added to the registry as Tier 3 with no instruction.",
2000
+ "As more documents are processed, the field accumulates occurrences and extraction examples.",
2001
+ "When the field is promoted to Tier 2, the platform synthesizes a master instruction by analyzing all extraction patterns for that field.",
2002
+ "The instruction is injected into AI prompts during future job runs, improving extraction accuracy.",
2003
+ "You can manually edit the instruction at any time from the field detail page to incorporate domain expertise."
2004
+ ]
2005
+ },
1777
2006
  {
1778
2007
  type: "callout",
1779
- text: 'Click **"Synthesize All"** in the Field Registry to generate instructions for all qualifying fields. This runs the combined pipeline: embed &rarr; resolve &rarr; synthesize.'
2008
+ text: 'Click **"Synthesize All"** in the Field Registry to generate instructions for all qualifying fields. This runs the combined pipeline: embed &rarr; resolve &rarr; synthesize. The operation processes all fields that meet the synthesis criteria in a single batch.'
1780
2009
  }
1781
2010
  ],
1782
2011
  related: [
@@ -1818,11 +2047,11 @@ var sections5 = [
1818
2047
  content: [
1819
2048
  {
1820
2049
  type: "paragraph",
1821
- text: "Schemas define the structure of your output data. There are two types: AI-generated schemas created per document type, and user templates you define yourself."
2050
+ text: "Schemas define the structure of your output data. There are two types: AI-generated schemas created per document type, and user templates you define yourself. Generated schemas give you an automatic, always-up-to-date view of what the platform has discovered about each document type, while user templates let you define exactly which fields you need for a specific downstream use case."
1822
2051
  },
1823
2052
  {
1824
2053
  type: "paragraph",
1825
- text: "For each document type, Talonic generates a schema containing all Tier 1 and Tier 2 fields with occurrences in that type. Generated schemas are versioned \u2014 new versions are created when the registry changes. You can diff any two versions to see what changed."
2054
+ text: "For each document type, Talonic generates a schema containing all Tier 1 and Tier 2 fields with occurrences in that type. Generated schemas are versioned \u2014 new versions are created when the registry changes. You can diff any two versions to see what changed. The diff view highlights added fields, removed fields, type changes, and updated instructions so you can track how your field landscape evolves over time."
1826
2055
  },
1827
2056
  {
1828
2057
  type: "paragraph",
@@ -1832,9 +2061,13 @@ var sections5 = [
1832
2061
  type: "paragraph",
1833
2062
  text: "Generated schemas are most useful as a starting point for understanding what Talonic has discovered about your documents. Review the generated schema for a document type to see which fields the system has identified, then use that knowledge to build a **User Template** containing only the fields you actually need. You can also use the diff view to monitor how your field landscape evolves over time as new documents are processed and new fields are promoted."
1834
2063
  },
2064
+ {
2065
+ type: "paragraph",
2066
+ text: "The tier system determines which fields appear in generated schemas. **Tier 1** (core) fields are the most frequently occurring and reliably extracted data points \u2014 they appear in nearly every document of the type. **Tier 2** (established) fields occur in a significant portion of documents and have been validated through repeated extraction. **Tier 3** (emerging) fields are too new or infrequent to be included in generated schemas, but they may be promoted as more documents are processed and their occurrence rate crosses the promotion threshold."
2067
+ },
1835
2068
  {
1836
2069
  type: "callout",
1837
- text: "Generated schemas are read-only and cannot be used directly for job execution. To run an extraction job, create a **User Template** and map its fields to the registry."
2070
+ text: "Generated schemas are read-only and cannot be used directly for job execution. To run an extraction job, create a **User Template** and map its fields to the registry. Generated schemas serve as a discovery tool to understand what the platform has found in your documents."
1838
2071
  }
1839
2072
  ],
1840
2073
  related: [
@@ -1845,15 +2078,15 @@ var sections5 = [
1845
2078
  faq: [
1846
2079
  {
1847
2080
  question: "What are generated schemas?",
1848
- answer: "Generated schemas are AI-created output definitions for each document type, containing all Tier 1 and Tier 2 fields found in that type. They are versioned and support diffing between versions."
2081
+ answer: "Generated schemas are AI-created output definitions for each document type, containing all Tier 1 and Tier 2 fields found in that type. They are versioned and support diffing between versions. Each field in the schema includes data type information, the AI-synthesized master instruction, and occurrence statistics."
1849
2082
  },
1850
2083
  {
1851
2084
  question: "How are generated schemas updated?",
1852
- answer: "New versions are created automatically when the Field Registry changes (new fields promoted, clusters merged). You can diff any two versions to see what changed."
2085
+ answer: "New versions are created automatically when the Field Registry changes (new fields promoted, clusters merged). You can diff any two versions to see what changed. The versioning system is append-only, so every previous version is preserved in the timeline for reference."
1853
2086
  },
1854
2087
  {
1855
2088
  question: "Can I run an extraction job using a generated schema?",
1856
- answer: "No. Generated schemas are read-only references. To run a job, create a User Template, select the fields you need, map them to the registry, and publish a version."
2089
+ answer: "No. Generated schemas are read-only references. To run a job, create a User Template, select the fields you need, map them to the registry, and publish a version. Generated schemas are designed as a discovery tool \u2014 use them to understand what the platform has found, then build a focused template for your specific output needs."
1857
2090
  }
1858
2091
  ],
1859
2092
  mentions: ["generated schemas", "AI-generated", "versioning", "schema diff"]
@@ -1867,7 +2100,7 @@ var sections5 = [
1867
2100
  content: [
1868
2101
  {
1869
2102
  type: "paragraph",
1870
- text: "User templates are the primary way to define your output structure. Navigate to **Structuring &rarr; Schemas** to create one."
2103
+ text: "User templates are the primary way to define your output structure. Navigate to **Structuring &rarr; Schemas** to create one. Templates give you complete control over which fields appear in your output, how they are extracted, and what validation rules apply. Unlike generated schemas, templates are executable \u2014 once published, they can be used to run extraction jobs."
1871
2104
  },
1872
2105
  {
1873
2106
  type: "list",
@@ -1905,15 +2138,15 @@ var sections5 = [
1905
2138
  faq: [
1906
2139
  {
1907
2140
  question: "How do I create a user template?",
1908
- answer: "Navigate to Structuring > Schemas, create a template with a name and description, add fields with data types and instructions, map to the registry, add reference tables, and publish."
2141
+ answer: "Navigate to Structuring > Schemas, create a template with a name and description, add fields with data types and instructions, map to the registry, add reference tables, and publish. You can also import from Excel, CSV, or JSON to bootstrap a template from an existing spreadsheet \u2014 column headers become field names and data types are inferred automatically."
1909
2142
  },
1910
2143
  {
1911
2144
  question: "What is the difference between generated schemas and user templates?",
1912
- answer: "Generated schemas are AI-created per document type with all Tier 1/2 fields. User templates are custom-defined output structures where you choose exactly which fields to include and how to map them."
2145
+ answer: "Generated schemas are AI-created per document type with all Tier 1/2 fields \u2014 they are read-only and cannot run jobs. User templates are custom-defined output structures where you choose exactly which fields to include, how to map them to the registry, and what validation rules apply. Only published user templates can be used for extraction jobs."
1913
2146
  },
1914
2147
  {
1915
2148
  question: "Can I update a published template?",
1916
- answer: "Published versions are immutable. To make changes, open the Workshop draft, edit your fields, and publish a new version. The previous version remains available in Version History for reference and diffing."
2149
+ answer: "Published versions are immutable. To make changes, open the Workshop draft, edit your fields, and publish a new version. The previous version remains available in Version History for reference and diffing. This append-only versioning ensures that historical job results always reference the exact schema that produced them."
1917
2150
  }
1918
2151
  ],
1919
2152
  mentions: ["user templates", "schema creation", "field mapping", "reference tables", "publish"]
@@ -1927,7 +2160,7 @@ var sections5 = [
1927
2160
  content: [
1928
2161
  {
1929
2162
  type: "paragraph",
1930
- text: "Every field in a template supports advanced features beyond the basic name and type. These features control how values are extracted, validated, transformed, and delivered."
2163
+ text: "Every field in a template supports advanced features beyond the basic name and type. These features control how values are extracted, validated, transformed, and delivered. You can layer features independently \u2014 for example, a single field can have a format constraint, a reference table for code lookup, modifiers for post-processing, and an output name remap for delivery. Features compose without conflicts, giving you fine-grained control over every aspect of the extraction and output pipeline."
1931
2164
  },
1932
2165
  {
1933
2166
  type: "param-table",
@@ -1979,6 +2212,20 @@ var sections5 = [
1979
2212
  type: "paragraph",
1980
2213
  text: "When configuring a field, start with the basics \u2014 name, type, and registry mapping \u2014 then layer on advanced features as needed. For example, add a **format constraint** to enforce a date pattern, attach a **reference table** for code lookups, or define **capture submoves** to control the exact extraction sequence. Features compose independently, so you can mix and match without conflicts."
1981
2214
  },
2215
+ {
2216
+ type: "list",
2217
+ ordered: false,
2218
+ items: [
2219
+ "**Format constraint** \u2014 Regex validation with configurable mismatch behavior (clear, flag, or replace).",
2220
+ "**Modifiers** \u2014 Post-processing pipeline: format (date/number conversion), alias (value mapping), max_length (truncation).",
2221
+ "**Constraints** \u2014 Validation rules: required, enum, date-format, length, cross-field expressions.",
2222
+ "**Bypass strategy** \u2014 Skip AI extraction: constant value, deterministic ID generator, or reference table lookup.",
2223
+ "**Reference table** \u2014 Key-value pairs for code mapping with a 3-tier lookup cascade (normalization, fuzzy, AI).",
2224
+ "**Manual instruction** \u2014 User-written extraction directive that overrides the AI-synthesized master instruction.",
2225
+ "**Capture submoves** \u2014 Ordered extraction sequence: match (field matching), compute (calculation), reason (LLM inference).",
2226
+ "**Output name** \u2014 Remap the field name in delivery and export output without changing the internal schema name."
2227
+ ]
2228
+ },
1982
2229
  {
1983
2230
  type: "paragraph",
1984
2231
  text: "The **modifier pipeline** runs in a fixed order during Phase 4 of the extraction pipeline: format transforms first (converting dates or numbers to your target format), then alias mapping (replacing values using a lookup), and finally max_length truncation. Constraint evaluation happens after all modifiers have been applied, so constraints validate the final transformed value, not the raw extraction."
@@ -2083,15 +2330,15 @@ var sections5 = [
2083
2330
  faq: [
2084
2331
  {
2085
2332
  question: "How does field matching work in Talonic schemas?",
2086
- answer: "Schema fields are matched to the registry using four types: Exact (direct name match), Semantic (AI finds equivalent field), Composite (multiple fields combine), and Unmapped (no match, needs manual instructions)."
2333
+ answer: "Schema fields are matched to the registry using a three-band resolution process. First, exact name matching against canonical names and synonyms. Then, embedding similarity for semantic matches (auto-accept above 0.8, confirm between 0.5 and 0.8). Four match types result: Exact (direct name match), Semantic (AI finds equivalent field), Composite (multiple fields combine), and Unmapped (no match, needs manual instructions)."
2087
2334
  },
2088
2335
  {
2089
2336
  question: "What happens when a field is unmapped?",
2090
- answer: "Unmapped fields have no registry match. They require manual extraction instructions to guide the AI on how to extract the value from documents."
2337
+ answer: "Unmapped fields have no registry match and do not inherit a master extraction instruction. They require manual extraction instructions to guide the AI on how to extract the value from documents. Write clear, specific instructions describing where in the document to look and what formatting to expect for best results."
2091
2338
  },
2092
2339
  {
2093
2340
  question: "Can I re-run field matching after adding more documents?",
2094
- answer: "Yes. Use the Rematch button in the template editor to re-run matching against the current registry. Fields that were previously unmapped may find new matches as your registry grows."
2341
+ answer: "Yes. Use the Rematch button in the template editor to re-run matching against the current registry. Fields that were previously unmapped may find new matches as your registry grows through processing additional documents. For best results, use descriptive field names that reflect the actual data rather than generic labels."
2095
2342
  }
2096
2343
  ],
2097
2344
  mentions: ["field matching", "exact match", "semantic match", "composite", "unmapped"]
@@ -2105,7 +2352,7 @@ var sections5 = [
2105
2352
  content: [
2106
2353
  {
2107
2354
  type: "paragraph",
2108
- text: "Reference tables map human-readable values to system codes. Each table is a list of key-value pairs where `key` = output code and `value` = label. During extraction, a 3-tier lookup cascade runs:"
2355
+ text: 'Reference tables map human-readable values to system codes. Each table is a list of key-value pairs where `key` = output code and `value` = label. For example, a country reference table might map "United States" to `US`, "Germany" to `DE`, and "United Kingdom" to `GB`. During extraction, a 3-tier lookup cascade runs automatically against the table to normalize extracted values to your canonical codes:'
2109
2356
  },
2110
2357
  {
2111
2358
  type: "param-table",
@@ -2136,13 +2383,17 @@ var sections5 = [
2136
2383
  type: "paragraph",
2137
2384
  text: "Reference tables are used in two pipeline stages. In **Phase 1**, the lookup cascade runs as part of the resolve step, mapping extracted labels to codes without any AI calls (Tier 1 and Tier 2). In **Phase 3**, the cascade runs again on values produced by Phase 2's AI extraction, normalizing free-text AI output to your canonical codes. This two-pass approach ensures maximum code coverage across the entire pipeline."
2138
2385
  },
2386
+ {
2387
+ type: "paragraph",
2388
+ text: 'For example, consider a "Contract Type" field with a reference table mapping codes to labels: `std_master` = "Master Agreement", `std_service` = "Service Agreement", `std_nda` = "Non-Disclosure Agreement". When the AI extracts "Frame Agreement" from a document, the Phase 3 lookup cascade normalizes it: Tier 1 finds no exact match, Tier 2 fuzzy matching scores "Frame Agreement" against "Master Agreement" at ~0.65 (below the threshold), so Tier 3 AI fallback maps it to `std_master` at 0.50 confidence. Adding "Frame Agreement" as a synonym pointing to `std_master` would promote this to a Tier 1 match (0.95 confidence) in future runs.'
2389
+ },
2139
2390
  {
2140
2391
  type: "paragraph",
2141
2392
  text: 'For best results, include common variations and abbreviations as separate value entries all pointing to the same key. For example, if your code is `US`, add values for "United States", "USA", "U.S.A.", and "United States of America". The more variations you cover, the more values resolve at Tier 1 (highest confidence) without falling through to fuzzy or AI matching.'
2142
2393
  },
2143
2394
  {
2144
2395
  type: "callout",
2145
- text: "Reference table quality directly determines lookup accuracy. A properly loaded table produces 90-100% accurate results within a single run."
2396
+ text: "Reference table quality directly determines lookup accuracy. A properly loaded table produces 90-100% accurate results within a single run. Review the lookup_failed validation flag in Phase 3 results to identify values that could not be mapped \u2014 these are candidates for adding new entries to your table."
2146
2397
  }
2147
2398
  ],
2148
2399
  related: [
@@ -2197,7 +2448,7 @@ var sections5 = [
2197
2448
  },
2198
2449
  {
2199
2450
  type: "callout",
2200
- text: "Breaking changes include field removals and type changes. The system surfaces these warnings at publish time so you can assess the impact on active delivery bindings and downstream systems before committing."
2451
+ text: "Breaking changes include field removals and type changes. The system surfaces these warnings at publish time so you can assess the impact on active delivery bindings and downstream systems before committing. Always run a **Test Extraction** on representative documents before publishing a draft that includes breaking changes."
2201
2452
  }
2202
2453
  ],
2203
2454
  related: [
@@ -2208,15 +2459,15 @@ var sections5 = [
2208
2459
  faq: [
2209
2460
  {
2210
2461
  question: "How does schema versioning work?",
2211
- answer: "Templates use a workshop system: Live (published, read-only), Workshop (mutable draft), and Version History (timeline with diffs). Breaking changes like field removals or type changes are detected on promotion."
2462
+ answer: "Templates use a workshop system with three states: Live (published, read-only), Workshop (mutable draft), and Version History (timeline with diffs). Breaking changes like field removals or type changes are detected on promotion. Every published version is immutable, creating a complete audit trail of how your schema evolved over time. The diff view highlights added fields, removed fields, type changes, and updated instructions between any two versions."
2212
2463
  },
2213
2464
  {
2214
2465
  question: "What are breaking changes in a schema?",
2215
- answer: "Breaking changes include field removals and type changes. The system detects and warns about these when promoting a draft to live, helping you avoid unintended downstream impacts."
2466
+ answer: "Breaking changes include field removals and data type changes. The system detects and warns about these when promoting a draft to live, helping you avoid unintended downstream impacts. If a downstream delivery binding depends on a specific field, the warning helps you assess the impact before committing the change. Always run a Test Extraction on representative documents before publishing a draft that includes breaking changes."
2216
2467
  },
2217
2468
  {
2218
2469
  question: "Can I revert to a previous schema version?",
2219
- answer: "Version history is append-only, so you cannot revert directly. However, you can review any previous version in the timeline, compare it with the current live version using the diff view, and manually re-add fields or settings that were changed."
2470
+ answer: "Version history is append-only, so you cannot revert directly. However, you can review any previous version in the timeline, compare it with the current live version using the diff view, and manually re-add fields or settings that were changed. This design ensures that every historical job result always references the exact schema version that produced it. For safe iteration, always use the Workshop draft to test changes via Test Extraction before publishing a new version."
2220
2471
  }
2221
2472
  ],
2222
2473
  mentions: ["versioning", "drafts", "workshop", "live version", "breaking changes"]
@@ -2230,23 +2481,27 @@ var sections5 = [
2230
2481
  content: [
2231
2482
  {
2232
2483
  type: "paragraph",
2233
- text: "Before publishing a draft, run a test extraction to compare draft vs. live results side-by-side. Select a few documents, run the test, and see exactly how your changes affect output."
2484
+ text: "Before publishing a draft, run a test extraction to compare draft vs. live results side-by-side. Select a few documents, run the test, and see exactly how your changes affect output. This is the safest way to validate schema changes \u2014 you can iterate on field instructions, reference tables, and format constraints without affecting production jobs or published data."
2234
2485
  },
2235
2486
  {
2236
2487
  type: "paragraph",
2237
- text: "After running a test, you will see a comparison grid highlighting cells that changed between the draft and live versions. Focus on fields you modified \u2014 new fields, updated instructions, or changed reference tables \u2014 to verify they produce the expected values. This workflow catches regressions before they reach production, so you can iterate on your schema with confidence."
2488
+ text: "After running a test, you will see a comparison grid highlighting cells that changed between the draft and live versions. Focus on fields you modified \u2014 new fields, updated instructions, or changed reference tables \u2014 to verify they produce the expected values. Cells that improved show in green, cells that regressed show in red, and unchanged cells are neutral. This workflow catches regressions before they reach production, so you can iterate on your schema with confidence."
2238
2489
  },
2239
2490
  {
2240
2491
  type: "paragraph",
2241
- text: "Test extractions run through the same 4-phase pipeline as production jobs, so the results you see are identical to what a full job would produce. The test uses a simplified single-call extraction mode under the hood, which is faster but still applies all schema features including reference table lookups, format constraints, and modifiers. This gives you a reliable preview without the cost of a full pipeline run."
2492
+ text: "Test extractions run through the same extraction pipeline as production jobs, so the results you see are representative of what a full job would produce. The test uses a simplified single-call extraction mode under the hood, which is faster but still applies all schema features including reference table lookups, format constraints, modifiers, and bypass strategies. This gives you a reliable preview without the cost and time of a full 4-phase pipeline run."
2242
2493
  },
2243
2494
  {
2244
2495
  type: "paragraph",
2245
2496
  text: 'For best results, select 3-5 representative documents that cover the variety in your corpus \u2014 include at least one "clean" document and one with unusual formatting or missing fields. This gives you confidence that your schema handles both typical and edge-case documents correctly. Run the test after every significant change to a field instruction, reference table, or format constraint.'
2246
2497
  },
2498
+ {
2499
+ type: "paragraph",
2500
+ text: "A typical iteration workflow looks like this: add or modify a field in the Workshop draft, run a test extraction on your sample documents, review the comparison grid to check that the new field produces correct values, adjust the instruction if needed, re-test, and publish when satisfied. This tight feedback loop is the fastest way to refine extraction accuracy without impacting production jobs or consuming unnecessary credits."
2501
+ },
2247
2502
  {
2248
2503
  type: "callout",
2249
- text: "Test extractions do not affect your live data or consume production job credits differently. They are designed for rapid iteration \u2014 run as many tests as you need before publishing."
2504
+ text: "Test extractions do not affect your live data or consume production job credits differently. They are designed for rapid iteration \u2014 run as many tests as you need before publishing. Results are temporary and do not appear in your job history."
2250
2505
  }
2251
2506
  ],
2252
2507
  related: [
@@ -2279,7 +2534,7 @@ var sections5 = [
2279
2534
  content: [
2280
2535
  {
2281
2536
  type: "paragraph",
2282
- text: "Dialects define the output format for structured data. They control how values are serialized when delivered or exported. A dialect can be shared across schemas or defined inline for a specific schema. Configure dialects in the **Schema &rarr; Delivery** tab."
2537
+ text: "Dialects define the output format for structured data. They control how values are serialized when delivered or exported \u2014 everything from date formatting and number locale to CSV delimiters and character encoding. A dialect can be shared across schemas or defined inline for a specific schema. Configure dialects in the **Schema &rarr; Delivery** tab. Shared dialects ensure consistent formatting across all your exports without duplicating configuration on every schema."
2283
2538
  },
2284
2539
  {
2285
2540
  type: "param-table",
@@ -2317,6 +2572,10 @@ var sections5 = [
2317
2572
  }
2318
2573
  ]
2319
2574
  },
2575
+ {
2576
+ type: "paragraph",
2577
+ text: 'For example, to configure date formatting for a European accounting system: set `date_format` to `DD.MM.YYYY` so dates render as `15.03.2025` instead of the default `YYYY/MM/DD`. Pair this with `number_locale: "de-DE"` for comma-decimal formatting (`1.234,56`) and `delimiter: ";"` so CSV files open correctly in Excel on European locale machines. Save this configuration as a shared dialect named "EU Accounting" and attach it to every schema that feeds into that system \u2014 all future exports and deliveries will use consistent formatting without per-schema configuration.'
2578
+ },
2320
2579
  {
2321
2580
  type: "paragraph",
2322
2581
  text: "When working with international data, configure the dialect to match your downstream system requirements. For example, set **number_locale** to `fr-FR` for European comma-decimal formatting, switch the **delimiter** to semicolon for CSV compatibility, and choose **UTF-8-BOM** encoding if your data will be opened in Excel. Creating a shared dialect and reusing it across schemas ensures consistent formatting across all your exports."
@@ -2371,7 +2630,7 @@ var sections5 = [
2371
2630
  content: [
2372
2631
  {
2373
2632
  type: "paragraph",
2374
- text: "Bypass strategies determine how a schema field is populated when it should not go through LLM extraction. Each strategy provides a deterministic value without consuming AI credits."
2633
+ text: "Bypass strategies determine how a schema field is populated when it should not go through LLM extraction. Each strategy provides a deterministic value without consuming AI credits. This is useful for fields whose values are known ahead of time, can be derived from other fields, or should be looked up from reference data rather than extracted from the document text."
2375
2634
  },
2376
2635
  {
2377
2636
  type: "param-table",
@@ -2403,6 +2662,17 @@ var sections5 = [
2403
2662
  type: "paragraph",
2404
2663
  text: 'Use bypass strategies for fields whose values are known ahead of time or can be derived without reading the document. For example, set a **constant** of `"USD"` for a currency field that is always the same, or use a **generator** to produce a deterministic ID for each row. Fields with bypass strategies skip the AI extraction phase entirely, reducing processing time and credit usage.'
2405
2664
  },
2665
+ {
2666
+ type: "list",
2667
+ ordered: false,
2668
+ items: [
2669
+ "**none** \u2014 Use when a field should always be blank. Useful for placeholder columns in your output that will be populated by a downstream system.",
2670
+ '**constant** \u2014 Use when the value never varies across documents (e.g., currency `"USD"`, data source `"talonic"`, processing status `"pending"`).',
2671
+ "**generator (deterministic-id)** \u2014 Use when you need a unique, reproducible identifier for each row. Produces a hash-based ID from entity attributes.",
2672
+ "**generator (context-fallback)** \u2014 Use when the value can be derived from other fields in the schema without reading the document.",
2673
+ "**reference** \u2014 Use when the value should be looked up from a reference table using a `key_expression` that references another schema field (e.g., map supplier name to ERP vendor code)."
2674
+ ]
2675
+ },
2406
2676
  {
2407
2677
  type: "paragraph",
2408
2678
  text: "The **reference** bypass strategy is particularly powerful for enrichment fields. Define a `key_expression` that references another field in the schema (e.g., the supplier name), and the system will automatically look up the corresponding code from your reference table without any AI involvement. This is ideal for mapping extracted entity names to internal system identifiers, ERP codes, or classification labels."
@@ -2413,7 +2683,7 @@ var sections5 = [
2413
2683
  },
2414
2684
  {
2415
2685
  type: "callout",
2416
- text: "When a `generator` strategy fails to produce a value, the field falls through to LLM extraction as a safety net. Strategy values are normalized via generator mappings in Phase 4 of the pipeline."
2686
+ text: "When a `generator` strategy fails to produce a value, the field falls through to LLM extraction as a safety net \u2014 your data is never left incomplete due to a bypass misconfiguration. Strategy values are normalized via generator mappings in Phase 4 of the pipeline. Bypass strategies execute during Phase 1, before any AI calls are made."
2417
2687
  }
2418
2688
  ],
2419
2689
  related: [
@@ -2452,7 +2722,7 @@ var sections5 = [
2452
2722
  content: [
2453
2723
  {
2454
2724
  type: "paragraph",
2455
- text: "Format constraints apply regex-based validation to schema fields. They are evaluated post-extraction in Phase 4 of the pipeline, after all transforms have been applied. Original values are preserved for audit in `original_extractions`."
2725
+ text: "Format constraints apply regex-based validation to schema fields. They are evaluated post-extraction in Phase 4 of the pipeline, after all transforms have been applied. Original values are preserved for audit in `original_extractions`. This means you can always review what the AI originally extracted before the constraint was applied, giving you full visibility into the extraction pipeline even when values are cleared or replaced."
2456
2726
  },
2457
2727
  {
2458
2728
  type: "param-table",
@@ -2477,7 +2747,7 @@ var sections5 = [
2477
2747
  },
2478
2748
  {
2479
2749
  type: "paragraph",
2480
- text: "Define format constraints in the schema field editor. The pattern uses standard regex syntax. The editor provides a live test input so you can verify the pattern before saving."
2750
+ text: "Define format constraints in the schema field editor. The pattern uses standard regex syntax with support for inline flags like `(?i)` for case-insensitive matching. The editor provides a live test input so you can verify the pattern against sample values before saving. This immediate feedback loop helps you catch overly strict or overly permissive patterns before they affect real extraction runs."
2481
2751
  },
2482
2752
  {
2483
2753
  type: "paragraph",
@@ -2489,7 +2759,7 @@ var sections5 = [
2489
2759
  },
2490
2760
  {
2491
2761
  type: "callout",
2492
- text: "The regex evaluator includes ReDoS protection: nested quantifiers are rejected and input is capped at 1,000 characters. Use the `(?i)` inline flag for case-insensitive matching."
2762
+ text: "The regex evaluator includes ReDoS protection: nested quantifiers are rejected and input is capped at 1,000 characters. Use the `(?i)` inline flag for case-insensitive matching. Format constraints support standard JavaScript regex syntax, so you can use character classes, alternation, and lookahead assertions for complex validation patterns."
2493
2763
  }
2494
2764
  ],
2495
2765
  related: [
@@ -2500,15 +2770,15 @@ var sections5 = [
2500
2770
  faq: [
2501
2771
  {
2502
2772
  question: "What are format constraints?",
2503
- answer: "Format constraints apply regex-based validation to schema fields, evaluated post-extraction in Phase 4. Mismatch behaviors: empty (clear), flag (amber dot), or constant (replace with a fixed value)."
2773
+ answer: 'Format constraints apply regex-based validation to schema fields, evaluated post-extraction in Phase 4 after all transforms have been applied. Mismatch behaviors: empty (clear the cell, the default), flag (keep the value but show an amber dot in the results grid), or constant (replace with a fixed value like "INVALID" or "N/A"). The constraint validates the final transformed value, not the raw extraction.'
2504
2774
  },
2505
2775
  {
2506
2776
  question: "Are original values preserved when format constraints clear a cell?",
2507
- answer: "Yes. Original values are always preserved for audit in the original_extractions table, regardless of the mismatch behavior applied."
2777
+ answer: "Yes. Original values are always preserved for audit in the original_extractions table, regardless of the mismatch behavior applied. This means you can always review what the AI originally extracted before the constraint was applied, giving you full visibility into the extraction pipeline."
2508
2778
  },
2509
2779
  {
2510
2780
  question: "Can I use case-insensitive regex patterns?",
2511
- answer: "Yes. Use the (?i) inline flag at the start of your pattern for case-insensitive matching. The evaluator supports standard JavaScript regex syntax with inline flags."
2781
+ answer: "Yes. Use the (?i) inline flag at the start of your pattern for case-insensitive matching. The evaluator supports standard JavaScript regex syntax including character classes, alternation, and lookahead assertions. ReDoS protection is built in \u2014 nested quantifiers are rejected and input is capped at 1,000 characters."
2512
2782
  }
2513
2783
  ],
2514
2784
  mentions: [
@@ -2532,23 +2802,27 @@ var sections6 = [
2532
2802
  content: [
2533
2803
  {
2534
2804
  type: "paragraph",
2535
- text: "Extraction jobs are the core of the platform \u2014 where schemas meet documents and AI agents produce structured data. A job produces a grid: rows = documents, columns = schema fields."
2805
+ text: "Extraction jobs are the core of the platform \u2014 where schemas meet documents and AI agents produce structured data. A job produces a grid: rows = documents, columns = schema fields. Each cell in the grid contains an extracted value along with metadata including a confidence score, the resolution type, the pipeline phase that produced it, and an AI reasoning trace explaining how the value was derived from the source document."
2536
2806
  },
2537
2807
  {
2538
2808
  type: "paragraph",
2539
- text: "Navigate to **Structuring &rarr; Runs &rarr; New**. Select your template and documents, then click Start. Results appear progressively as each phase completes."
2809
+ text: "Navigate to **Structuring &rarr; Runs &rarr; New**. Select your template and documents, then click Start. Results appear progressively as each phase completes. You can choose between three extraction modes: **pipeline** (full 4-phase extraction, the default), **simple** (single AI call, faster but less thorough), or **field registry** (no AI, deterministic strategies only \u2014 useful for benchmarking registry coverage)."
2540
2810
  },
2541
2811
  {
2542
2812
  type: "paragraph",
2543
- text: "When you start a job, the platform runs a pre-flight check to ensure all selected documents have completed their field resolution step. If any document was uploaded recently and has not yet been resolved against the Field Registry, the system automatically resolves it before entering Phase 1. This lazy resolution gate prevents silent data loss where registry-based lookups would return empty results for unresolved documents."
2813
+ text: "When you start a job, the platform runs a pre-flight check to ensure all selected documents have completed their field resolution step. If any document was uploaded recently and has not yet been resolved against the Field Registry, the system automatically resolves it before entering Phase 1. This lazy resolution gate prevents silent data loss where registry-based lookups would return empty results for unresolved documents. The pre-flight resolution runs with a concurrency of 3 and failures are non-fatal \u2014 Phase 1 will proceed with whatever is resolved."
2544
2814
  },
2545
2815
  {
2546
2816
  type: "paragraph",
2547
2817
  text: "For best results, select documents of the same type or closely related types for a single job. The schema you choose should match the document content \u2014 using an invoice schema on contract documents will produce poor results. Start with a small batch of 5-10 documents to validate your schema, review the output, apply corrections, and then scale up to larger runs once you are confident in the extraction quality."
2548
2818
  },
2819
+ {
2820
+ type: "paragraph",
2821
+ text: "The platform supports scaling caps to ensure reliable processing: Phase 2 extraction handles up to 2,000 documents per job, and Phase 4 transforms support up to 1,000 documents. Grid results are flushed to the database in batches of 200 documents per phase. For very large document collections, consider splitting into multiple jobs by document type for optimal results and easier review."
2822
+ },
2549
2823
  {
2550
2824
  type: "callout",
2551
- text: "Results appear progressively as each pipeline phase completes. You do not need to wait for the entire job to finish \u2014 you can begin reviewing Phase 1 results while Phase 2 is still running."
2825
+ text: "Results appear progressively as each pipeline phase completes. You do not need to wait for the entire job to finish \u2014 you can begin reviewing Phase 1 results while Phase 2 is still running. The phase timeline on the job detail page shows which phase is active and the cumulative fill rate at each stage."
2552
2826
  }
2553
2827
  ],
2554
2828
  related: [
@@ -2581,7 +2855,7 @@ var sections6 = [
2581
2855
  content: [
2582
2856
  {
2583
2857
  type: "paragraph",
2584
- text: "Every job runs through four phases. Each fills more cells in the output grid, reducing the problem space for the next. Results are visible as each phase completes."
2858
+ text: "Every job runs through four phases. Each fills more cells in the output grid, reducing the problem space for the next. Results are visible as each phase completes. The grid is the single source of truth during execution and is flushed to the database after each phase, enabling progressive rendering in the UI. A confidence gate protects high-quality values from being overwritten by later phases \u2014 once a cell is filled with high confidence, it is permanently locked."
2585
2859
  },
2586
2860
  {
2587
2861
  type: "paragraph",
@@ -2603,7 +2877,7 @@ var sections6 = [
2603
2877
  },
2604
2878
  {
2605
2879
  type: "callout",
2606
- text: "Phase order is fixed: Phase 1 &rarr; 2 &rarr; 3 &rarr; 4. Phases are never skipped or reordered. This guarantees that high-confidence deterministic values from Phase 1 are always protected by the confidence gate before AI extraction runs."
2880
+ text: "Phase order is fixed: Phase 1 &rarr; 2 &rarr; 3 &rarr; 4. Phases are never skipped or reordered. This guarantees that high-confidence deterministic values from Phase 1 are always protected by the confidence gate before AI extraction runs. The confidence gate is the single most important pipeline rule \u2014 once a cell is filled with a high-confidence value, no later phase can overwrite it with a lower-confidence result."
2607
2881
  }
2608
2882
  ],
2609
2883
  related: [
@@ -2614,11 +2888,11 @@ var sections6 = [
2614
2888
  faq: [
2615
2889
  {
2616
2890
  question: "What are the four phases of the extraction pipeline?",
2617
- answer: "Phase 1: Resolve (graph matches, ~30% of cells), Phase 2: Agent (AI strategies), Phase 3: Validation (cross-field checks), and Phase 4: Re-read (targeted gap filling)."
2891
+ answer: "Phase 1: Resolve (graph matches and deterministic lookups, fills 30-80% of cells depending on registry maturity). Phase 2: Agent (AI extraction for remaining gaps, grouped into batches of 10 fields per call). Phase 3: Validation (cross-field checks and reference table re-normalization of AI output). Phase 4: Re-read (targeted gap filling with full grid context, plus deterministic transforms and format constraint evaluation)."
2618
2892
  },
2619
2893
  {
2620
2894
  question: "Can I see results before all phases complete?",
2621
- answer: "Yes. Results are visible as each phase completes. The fill rate increases progressively through the pipeline."
2895
+ answer: "Yes. The grid is flushed to the database after each phase, enabling progressive rendering in the UI. You can watch cells fill in real time and begin reviewing Phase 1 results while Phase 2 is still running. The phase timeline on the job detail page shows which phase is active and the cumulative fill rate at each stage."
2622
2896
  },
2623
2897
  {
2624
2898
  question: "Why does the pipeline use multiple phases instead of a single AI call?",
@@ -2636,7 +2910,7 @@ var sections6 = [
2636
2910
  content: [
2637
2911
  {
2638
2912
  type: "paragraph",
2639
- text: "The fastest phase (~30% of cells in seconds). For each document x each schema field, the system checks if the cell can be filled from existing extracted data. **No AI calls** (except rare Haiku fallback for ambiguous lookups)."
2913
+ text: "The fastest phase (~30% of cells in seconds). For each document x each schema field, the system checks if the cell can be filled from existing extracted data. **No AI calls** (except rare Haiku fallback for ambiguous lookups). Phase 1 is the workhorse of cost efficiency \u2014 it fills a large portion of the grid using pre-computed graph matches and deterministic lookups at near-zero cost. As your Field Registry grows from processing more documents, Phase 1 fill rates steadily improve."
2640
2914
  },
2641
2915
  {
2642
2916
  type: "param-table",
@@ -2681,6 +2955,10 @@ var sections6 = [
2681
2955
  type: "paragraph",
2682
2956
  text: "The resolution strategies execute in a fixed order: registry transfer first, then raw extraction mapping, then the 3-tier lookup cascade, and finally deterministic compute (formulas like `Total = Unit Price x Quantity`). Each strategy only attempts to fill cells that are still empty after the previous strategy ran. This ordering ensures that the highest-confidence method always gets priority."
2683
2957
  },
2958
+ {
2959
+ type: "paragraph",
2960
+ text: `For example, consider an invoice with a "Vendor Name" field. The system first checks the Field Registry for a direct transfer \u2014 if "Vendor Name" was extracted from a previous document and promoted to Tier 1, it resolves instantly at 0.85+ confidence. If no registry match exists, the raw extraction mapping looks for a semantically equivalent field in the document's extracted data (e.g., "supplier_name"). If that also misses, the 3-tier lookup cascade checks the reference table: exact normalization first (0.95), then fuzzy token overlap (~0.70), then AI fallback (0.50). Only if all four strategies fail does the cell pass to Phase 2 for AI extraction.`
2961
+ },
2684
2962
  {
2685
2963
  type: "callout",
2686
2964
  text: "Phase 1 fill rates improve over time as your Field Registry grows. The more documents you process, the richer the registry becomes, and the more cells Phase 1 can resolve without AI \u2014 reducing both cost and latency for every subsequent job."
@@ -2723,7 +3001,7 @@ var sections6 = [
2723
3001
  content: [
2724
3002
  {
2725
3003
  type: "paragraph",
2726
- text: "An AI agent reviews the grid's gap patterns and produces a typed strategy:"
3004
+ text: "An AI agent reviews the grid's gap patterns and produces a typed strategy for each remaining empty cell. The agent uses Anthropic Claude Sonnet to analyze the source document alongside the schema field definitions, any already-resolved values from Phase 1, and reference table codes when available. This context-aware approach allows the AI to use related extracted values as clues for finding dependent data points."
2727
3005
  },
2728
3006
  {
2729
3007
  type: "paragraph",
@@ -2763,7 +3041,7 @@ var sections6 = [
2763
3041
  },
2764
3042
  {
2765
3043
  type: "paragraph",
2766
- text: "Phase 2 processes documents with grouped extraction calls \u2014 schema fields are divided into batches of up to 10 fields per call to balance extraction quality with throughput. For each document, the agent sends the document text along with the schema field definitions and any already-resolved values from Phase 1 as context. This context-aware approach means the AI can use related values (like a contract start date) to more accurately extract dependent values (like the end date)."
3044
+ text: 'Phase 2 processes documents with grouped extraction calls \u2014 schema fields are divided into batches of up to 10 fields per call to balance extraction quality with throughput. For each document, the agent sends the document text along with the schema field definitions and any already-resolved values from Phase 1 as context. This context-aware approach means the AI can use related values (like a contract start date) to more accurately extract dependent values (like the end date). For example, if Phase 1 resolved "Contract Start Date" to 2025-01-15 via a registry transfer, and the "Contract End Date" cell is still empty, the agent receives the start date as context and can search the document for a corresponding end date with higher precision \u2014 producing a more accurate result than extracting the end date in isolation.'
2767
3045
  },
2768
3046
  {
2769
3047
  type: "paragraph",
@@ -2812,7 +3090,7 @@ var sections6 = [
2812
3090
  content: [
2813
3091
  {
2814
3092
  type: "paragraph",
2815
- text: "Cross-field sanity checks. Flags are **informational only** \u2014 they never block output but help you prioritize review:"
3093
+ text: "Cross-field sanity checks and re-resolution. Phase 3 performs two critical tasks: it re-runs the reference table lookup cascade on values produced by Phase 2 to normalize free-text AI output to your canonical codes, and it runs informational validation checks across related fields. Flags are **informational only** \u2014 they never block output but help you prioritize review:"
2816
3094
  },
2817
3095
  {
2818
3096
  type: "paragraph",
@@ -2857,6 +3135,10 @@ var sections6 = [
2857
3135
  type: "paragraph",
2858
3136
  text: "Validation flags are designed to surface the most impactful issues first. The **low_confidence_outlier** flag is particularly useful \u2014 it highlights cells where the system is uncertain in an otherwise high-confidence row, pointing you to the exact cells most likely to contain errors. For large runs with hundreds of documents, filtering by flags and reviewing those cells first can reduce your review time by 80% or more."
2859
3137
  },
3138
+ {
3139
+ type: "paragraph",
3140
+ text: "What gets flagged and why depends on cross-field relationships, not just individual values. A **date_sanity** flag fires when temporal fields contradict each other \u2014 for example, a contract end date that falls before the start date, or a signature date after the effective date. An **amount_mismatch** flag fires when a computed total deviates more than 20% from the product of its component values (e.g., monthly rent times term length versus total contract value). The **unexpected_empty** flag fires when a field that appears in over 80% of documents in your registry is missing from this particular document, suggesting the AI may have missed it rather than it being genuinely absent."
3141
+ },
2860
3142
  {
2861
3143
  type: "callout",
2862
3144
  text: "Validation flags never modify cell values. They are purely informational annotations that help you prioritize review. The actual cell value and confidence score remain unchanged by Phase 3 flagging."
@@ -2898,7 +3180,7 @@ var sections6 = [
2898
3180
  content: [
2899
3181
  {
2900
3182
  type: "paragraph",
2901
- text: "Context-aware gap filling. For each empty cell or low-confidence value, AI re-reads the original document with the field instruction and full grid context. This focused approach often finds values missed in earlier phases."
3183
+ text: "Context-aware gap filling and deterministic transforms. Phase 4 serves two purposes: for each empty cell or low-confidence value, AI re-reads the original document with the field instruction and full grid context to find values missed in earlier phases. It also applies deterministic transforms to all cell values \u2014 ISO code normalization, date format standardization, unit conversion \u2014 and evaluates format constraints (regex patterns) with configurable mismatch behaviors. The modifier pipeline runs in a fixed order: format transforms first, then alias mapping, then max_length truncation."
2902
3184
  },
2903
3185
  {
2904
3186
  type: "paragraph",
@@ -2914,7 +3196,7 @@ var sections6 = [
2914
3196
  },
2915
3197
  {
2916
3198
  type: "callout",
2917
- text: "Phase 4 respects the **confidence gate**: it can only fill empty cells or upgrade cells below the confidence threshold. High-confidence values from Phase 1 are permanently protected."
3199
+ text: "Phase 4 respects the **confidence gate**: it can only fill empty cells or upgrade cells below the confidence threshold. High-confidence values from Phase 1 are permanently protected. Original values are always preserved in the `original_extractions` table for audit, regardless of whether format constraints clear, flag, or replace them."
2918
3200
  }
2919
3201
  ],
2920
3202
  related: [
@@ -2925,15 +3207,15 @@ var sections6 = [
2925
3207
  faq: [
2926
3208
  {
2927
3209
  question: "What does Phase 4 Re-read do?",
2928
- answer: "Phase 4 performs context-aware gap filling by re-reading the original document with field instructions and full grid context for each empty or low-confidence cell."
3210
+ answer: "Phase 4 performs context-aware gap filling by re-reading the original document with field instructions and full grid context for each empty or low-confidence cell. Because it has access to all values resolved in earlier phases, it can use surrounding data as clues \u2014 for example, using a resolved start date to locate the corresponding end date more accurately."
2929
3211
  },
2930
3212
  {
2931
3213
  question: "Can Phase 4 overwrite high-confidence values?",
2932
- answer: "No. Phase 4 respects the confidence gate \u2014 it can only fill empty cells or upgrade cells below the confidence threshold. High-confidence values from earlier phases are permanently protected."
3214
+ answer: "No. Phase 4 respects the confidence gate \u2014 it can only fill empty cells or upgrade cells below the confidence threshold. High-confidence values from earlier phases are permanently protected. This is the single most important pipeline rule, ensuring that reliable lookup results are never replaced by lower-confidence AI extractions."
2933
3215
  },
2934
3216
  {
2935
3217
  question: "What else happens in Phase 4 besides gap filling?",
2936
- answer: "Phase 4 also applies deterministic transforms (ISO codes, dates, units), evaluates format constraints (regex validation), and runs the modifier pipeline (format, alias, max_length). Original values are preserved for audit."
3218
+ answer: "Phase 4 also applies deterministic transforms (ISO codes, dates, units), evaluates format constraints (regex validation), and runs the modifier pipeline in a fixed order: format transforms first, then alias mapping, then max_length truncation. Constraint evaluation happens after all modifiers. Original values are always preserved in the original_extractions table for audit, regardless of whether constraints clear, flag, or replace them."
2937
3219
  }
2938
3220
  ],
2939
3221
  mentions: ["Phase 4", "re-read", "gap filling", "confidence gate", "targeted extraction"]
@@ -2953,7 +3235,7 @@ var sections6 = [
2953
3235
  },
2954
3236
  {
2955
3237
  type: "paragraph",
2956
- text: "The job detail page provides: a **progress bar** with fill rate, a **phase timeline**, the **strategy panel** (agent actions), a **filter bar** (Show All / Clean / Flagged), and **CSV export** (clean or full with metadata)."
3238
+ text: "The job detail page provides: a **progress bar** with fill rate, a **phase timeline**, the **strategy panel** (agent actions), a **filter bar** (Show All / Clean / Flagged), and **CSV export** (clean or full with metadata). The strategy yield breakdown shows how cells were distributed across resolution methods \u2014 registry transfer, raw extraction mapping, lookup cascade, deterministic compute, LLM extract, and bypass \u2014 giving you a clear picture of pipeline efficiency for each run."
2957
3239
  },
2958
3240
  {
2959
3241
  type: "paragraph",
@@ -2969,7 +3251,7 @@ var sections6 = [
2969
3251
  },
2970
3252
  {
2971
3253
  type: "callout",
2972
- text: "The full CSV export includes metadata columns for each field: confidence score, resolution type, phase number, and reasoning trace. Use this export for audit trails or to analyze extraction performance across your document corpus."
3254
+ text: "The full CSV export includes metadata columns for each field: confidence score, resolution type, phase number, and reasoning trace. Use this export for audit trails or to analyze extraction performance across your document corpus. The clean export omits metadata and includes only the extracted values, ready for direct import into downstream systems."
2973
3255
  }
2974
3256
  ],
2975
3257
  related: [
@@ -2980,15 +3262,15 @@ var sections6 = [
2980
3262
  faq: [
2981
3263
  {
2982
3264
  question: "What do the colored dots in the results grid mean?",
2983
- answer: "Each dot indicates how a cell was resolved: blue = graph match, purple = computed, teal = agent transfer, indigo = agent extract, amber = lookup."
3265
+ answer: "Each dot indicates how a cell was resolved: blue = graph match (Phase 1 registry transfer, highest reliability), purple = computed (deterministic formula), teal = agent transfer (copy from equivalent field), indigo = agent extract (AI read from document), amber = lookup result or format flag. A grid dominated by blue and purple dots typically requires minimal review."
2984
3266
  },
2985
3267
  {
2986
3268
  question: "Can I export extraction results?",
2987
- answer: "Yes. Use CSV export from the job detail page. You can export clean data only or full data with metadata including confidence scores and resolution types."
3269
+ answer: "Yes. Use CSV export from the job detail page. The clean export includes only extracted values, ready for direct import into downstream systems. The full export includes metadata columns for each field: confidence score, resolution type, phase number, and reasoning trace \u2014 useful for audit trails or analyzing extraction performance across your document corpus."
2988
3270
  },
2989
3271
  {
2990
3272
  question: "What is the most efficient way to review a large extraction run?",
2991
- answer: "Start with the Flagged filter to address cells with validation warnings, low confidence, or format mismatches. Then spot-check a random sample of Clean rows. Focus corrections on recurring field-level patterns rather than individual cells."
3273
+ answer: "Start with the Flagged filter to address cells with validation warnings, low confidence, or format mismatches. Then spot-check a random sample of Clean rows. Focus corrections on recurring field-level patterns rather than individual cells. If you find a field that is consistently wrong, update its manual instruction or reference table in the schema rather than correcting cells one by one \u2014 this improves future runs as well."
2992
3274
  }
2993
3275
  ],
2994
3276
  mentions: [
@@ -3008,7 +3290,7 @@ var sections6 = [
3008
3290
  content: [
3009
3291
  {
3010
3292
  type: "paragraph",
3011
- text: "Every cell carries detailed provenance. Hover a cell for confidence; click for full detail."
3293
+ text: "Every cell carries detailed provenance metadata that makes every extracted value auditable and explainable. Hover a cell for a quick confidence score glance; click it to expand the full provenance panel showing the resolution type, pipeline phase, reasoning trace, and source document references. This transparency is essential for building trust in automated extraction \u2014 you can always understand exactly how and why the platform produced a specific value."
3012
3294
  },
3013
3295
  {
3014
3296
  type: "paragraph",
@@ -3049,6 +3331,17 @@ var sections6 = [
3049
3331
  type: "paragraph",
3050
3332
  text: "Confidence scores follow predictable patterns by resolution type. Graph matches from Phase 1 typically score 0.7-0.95 because they are derived from verified registry data. Reference table lookups score 0.95 for exact normalization matches, ~0.70 for fuzzy matches, and 0.50 for AI fallback. Agent-derived values from Phase 2 generally score 0.5-0.9 depending on the clarity of the source document and the specificity of the extraction instruction."
3051
3333
  },
3334
+ {
3335
+ type: "list",
3336
+ ordered: false,
3337
+ items: [
3338
+ "**0.90-0.95** \u2014 Tier 1 lookup or exact registry transfer. Highest reliability; safe to trust without review in most workflows.",
3339
+ "**0.70-0.89** \u2014 Strong graph match or fuzzy registry transfer. Generally reliable; spot-check a sample to validate.",
3340
+ "**0.50-0.69** \u2014 AI extraction or fuzzy lookup result. Review recommended; the system found a plausible value but certainty is moderate.",
3341
+ "**0.30-0.49** \u2014 Low-confidence AI extraction. The source document was ambiguous or the field instruction was vague. Always review manually.",
3342
+ "**Below 0.30** \u2014 Very low confidence. The value is likely a best guess. Consider updating the schema instruction or adding a reference table to improve future runs."
3343
+ ]
3344
+ },
3052
3345
  {
3053
3346
  type: "paragraph",
3054
3347
  text: "Use confidence scores to set your review threshold. Cells above 0.8 are generally reliable and can be trusted without manual verification for most use cases. Cells between 0.5 and 0.8 warrant a quick check. Cells below 0.5 should always be reviewed manually. You can use the full CSV export to filter and sort by confidence, making it easy to batch-review low-confidence cells efficiently."
@@ -3095,7 +3388,7 @@ var sections6 = [
3095
3388
  content: [
3096
3389
  {
3097
3390
  type: "paragraph",
3098
- text: "Click any cell to edit its value. Corrections are logged with the original value, timestamp, and user. Choose a propagation scope: `this_document_only` or `all_similar` (same field + method + source field across all documents). Corrections feed back as training signals for future runs."
3391
+ text: "Click any cell to edit its value. Corrections are logged with the original value, timestamp, and user. Choose a propagation scope: `this_document_only` or `all_similar` (same field + method + source field across all documents). Corrections feed back as training signals for future runs, helping the system learn from your edits and improve accuracy over time. When you correct a value, the system records both the original AI-extracted value and your correction, creating a complete audit trail that is preserved even after subsequent jobs run."
3099
3392
  },
3100
3393
  {
3101
3394
  type: "paragraph",
@@ -3111,7 +3404,7 @@ var sections6 = [
3111
3404
  },
3112
3405
  {
3113
3406
  type: "callout",
3114
- text: "Corrections with **all_similar** propagation apply instantly across all documents in the run. Use this for systematic errors like wrong reference table mappings, but verify the preview count before confirming \u2014 the system shows how many cells will be affected."
3407
+ text: "Corrections with **all_similar** propagation apply instantly across all documents in the run. Use this for systematic errors like wrong reference table mappings, but verify the preview count before confirming \u2014 the system shows how many cells will be affected. For recurring field-level errors, consider updating the schema instruction or reference table rather than correcting cells individually across multiple runs."
3115
3408
  }
3116
3409
  ],
3117
3410
  related: [
@@ -3122,15 +3415,15 @@ var sections6 = [
3122
3415
  faq: [
3123
3416
  {
3124
3417
  question: "How do I correct an extracted value?",
3125
- answer: "Click any cell in the results grid to edit its value. Choose propagation scope: this_document_only (single cell) or all_similar (same field + method across all documents)."
3418
+ answer: "Click any cell in the results grid to edit its value. Choose propagation scope: this_document_only (single cell) or all_similar (same field + method across all documents). When using all_similar, the system shows a preview count of how many cells will be affected before you confirm \u2014 always verify this count to avoid unintended bulk changes."
3126
3419
  },
3127
3420
  {
3128
3421
  question: "Do corrections improve future extractions?",
3129
- answer: "Yes. Corrections feed back as training signals for future runs, helping the system learn from your corrections and improve accuracy over time."
3422
+ answer: "Yes. Corrections feed back as training signals for future runs, helping the system learn from your corrections and improve accuracy over time. For maximum impact, correct the root cause rather than individual symptoms \u2014 update the schema field instruction or reference table so that future runs resolve correctly without manual intervention."
3130
3423
  },
3131
3424
  {
3132
3425
  question: "Is there an audit trail for corrections?",
3133
- answer: "Yes. Every correction logs the original value, the corrected value, the user who made the change, and the timestamp. This audit history is preserved and included in full metadata CSV exports."
3426
+ answer: "Yes. Every correction logs the original value, the corrected value, the user who made the change, and the timestamp. This audit history is preserved even after subsequent jobs run and is included in full metadata CSV exports. Downstream systems can use this data to distinguish between AI-extracted and human-corrected values."
3134
3427
  }
3135
3428
  ],
3136
3429
  mentions: [
@@ -3193,9 +3486,25 @@ var sections7 = [
3193
3486
  type: "paragraph",
3194
3487
  text: "Use link keys whenever your documents share identifying information that should connect them. For best results, ensure your field names follow clear naming conventions \u2014 this maximizes the hit rate of the automatic classifier and minimizes the need for manual overrides."
3195
3488
  },
3489
+ {
3490
+ type: "paragraph",
3491
+ text: 'High-frequency entity exclusion is an important safeguard. If an entity value appears in more than 30% of all documents \u2014 for example, a generic department name like "Operations" or a common currency code like "USD" \u2014 it is automatically excluded from case formation. Without this filter, a single high-frequency value would pull most documents into one enormous case, making the grouping meaningless. The 30% threshold strikes a balance between connecting genuinely related documents and avoiding over-connection from generic values.'
3492
+ },
3493
+ {
3494
+ type: "list",
3495
+ items: [
3496
+ "Identity: company names, supplier names, person names \u2014 connects documents referencing the same party",
3497
+ "Transaction: contract numbers, PO numbers, invoice numbers \u2014 connects documents in the same transaction chain",
3498
+ "Reference: project codes, cost centers, shared IDs \u2014 connects documents under the same organizational grouping",
3499
+ "Auto-classified by field name patterns (e.g., company_name, invoice_number)",
3500
+ "AI classifier handles ambiguous fields that heuristics cannot resolve",
3501
+ "High-frequency entities (>30% of documents) excluded automatically",
3502
+ "Manual overrides available in the Field Registry"
3503
+ ]
3504
+ },
3196
3505
  {
3197
3506
  type: "callout",
3198
- text: "Link key classification runs automatically when new fields appear in the registry. You do not need to trigger it manually \u2014 just upload documents and the system handles the rest."
3507
+ text: "Link key classification runs automatically when new fields appear in the registry. You do not need to trigger it manually \u2014 just upload documents and the system handles the rest. Manual overrides in the Field Registry take precedence over automatic classifications and persist across future jobs."
3199
3508
  }
3200
3509
  ],
3201
3510
  related: [
@@ -3248,9 +3557,24 @@ var sections7 = [
3248
3557
  type: "paragraph",
3249
3558
  text: "For best results, ensure your source documents contain consistent identifiers. The pipeline handles minor variations automatically, but wildly inconsistent naming (e.g., abbreviations vs. full legal names) may require manual link key tuning in the Field Registry."
3250
3559
  },
3560
+ {
3561
+ type: "paragraph",
3562
+ text: 'A typical entity linking workflow looks like this: you upload a batch of invoices, contracts, and purchase orders. The pipeline extracts link key values \u2014 vendor names, PO numbers, contract references \u2014 normalizes them, and builds the graph. An invoice referencing "ACME Corp" and a contract referencing "Acme Corporation" both resolve to the same entity node after normalization, so the two documents become connected. If a purchase order also references the same vendor, all three documents end up in the same case.'
3563
+ },
3564
+ {
3565
+ type: "list",
3566
+ items: [
3567
+ "Runs automatically after document extraction \u2014 no manual trigger required",
3568
+ "Normalizes values: lowercasing, suffix stripping (Ltd, Inc, Corp, GmbH), whitespace normalization",
3569
+ "Builds a bipartite graph: document nodes connected to entity nodes via link key edges",
3570
+ "Connected components in the graph become the basis for case formation",
3571
+ "Incremental: new documents extend the existing graph rather than rebuilding it",
3572
+ "Handles minor naming variations automatically; wildly inconsistent names may need manual tuning"
3573
+ ]
3574
+ },
3251
3575
  {
3252
3576
  type: "callout",
3253
- text: "Entity linking is incremental \u2014 when new documents arrive, the pipeline extends the existing graph rather than rebuilding it from scratch. Existing cases grow as new connections are discovered."
3577
+ text: "Entity linking is incremental \u2014 when new documents arrive, the pipeline extends the existing graph rather than rebuilding it from scratch. Existing cases grow as new connections are discovered. This means your cases stay up-to-date without manual intervention as new documents flow into the workspace."
3254
3578
  }
3255
3579
  ],
3256
3580
  related: [
@@ -3321,11 +3645,30 @@ var sections7 = [
3321
3645
  },
3322
3646
  {
3323
3647
  type: "paragraph",
3324
- text: "Cases display an AI-generated **case label** as the primary title and include anomaly count badges in the header. Evidence and Timeline tabs support export to MD, CSV, and JSON."
3648
+ text: 'Cases display an AI-generated **case label** as the primary title and include anomaly count badges in the header. The label is generated by analyzing the documents and entities in the case to produce a human-readable summary \u2014 for example, "ACME Corp Invoice #4521 &rarr; PO #8890". You can rename a case manually if the AI label does not capture the right context. Evidence and Timeline tabs support export to MD, CSV, and JSON for offline review or compliance reporting.'
3325
3649
  },
3326
3650
  {
3327
3651
  type: "paragraph",
3328
- text: "Additional case operations: **merge** multiple cases into one, **split** a case into separate groups, **pin** or remove documents from a case, and **confirm** or **reject** individual linking edges."
3652
+ text: "Additional case operations: **merge** multiple cases into one, **split** a case into separate groups, **pin** or remove documents from a case, and **confirm** or **reject** individual linking edges. Merging is useful when two cases refer to the same real-world transaction but were not connected by the linking pipeline \u2014 for example, when a vendor uses slightly different names across documents. Splitting lets you break apart a case that was over-connected by a high-frequency entity value. Edge confirmation and rejection feed back into the linking model, improving future case formation accuracy."
3653
+ },
3654
+ {
3655
+ type: "paragraph",
3656
+ text: "Cases follow a lifecycle: **discovered** when the linking engine first identifies a cluster, **confirmed** when a reviewer validates the grouping, **active** during ongoing work, and **resolved** when all documents have been reviewed and processed. The lifecycle status is visible on the cases list page and can be updated from the case detail header. Filtering by lifecycle status makes it easy to focus on cases that need attention."
3657
+ },
3658
+ {
3659
+ type: "list",
3660
+ items: [
3661
+ "AI-generated case labels with manual rename option",
3662
+ "Four tabs: Overview, Anomalies (Advanced mode), Evidence, and Timeline",
3663
+ "Merge, split, pin, remove, confirm, and reject operations",
3664
+ "Lifecycle tracking: discovered &rarr; confirmed &rarr; active &rarr; resolved",
3665
+ "Anomaly count badges in the case header for quick triage",
3666
+ "Export Evidence and Timeline to MD, CSV, or JSON"
3667
+ ]
3668
+ },
3669
+ {
3670
+ type: "callout",
3671
+ text: "Case formation runs automatically after entity linking completes. You do not need to create cases manually \u2014 the system discovers them from the document-entity graph. Use merge, split, and edge operations to refine cases when the automatic grouping needs adjustment."
3329
3672
  }
3330
3673
  ],
3331
3674
  related: [
@@ -3374,9 +3717,24 @@ var sections7 = [
3374
3717
  type: "paragraph",
3375
3718
  text: "Most teams use the graph view during initial workspace setup to verify that linking is producing sensible clusters. Once you are confident in your link key configuration, the list view is more practical for day-to-day case review and triage."
3376
3719
  },
3720
+ {
3721
+ type: "paragraph",
3722
+ text: "The graph view is particularly useful during onboarding. When you first upload documents to a workspace, the graph gives you immediate visual feedback on whether your link key configuration is producing sensible clusters. If you see one massive cluster with everything connected, a high-frequency entity value may be acting as a bridge \u2014 check the Field Registry and exclude or reclassify the offending field. If you see many disconnected single-document nodes, your documents may lack shared identifiers, or the normalization rules may need adjustment."
3723
+ },
3724
+ {
3725
+ type: "list",
3726
+ items: [
3727
+ "D3-force layout with distinct visual styles for document and entity nodes",
3728
+ "Hover to highlight connections and trace document-entity relationships",
3729
+ "Toggle between graph view and list view from the Cases page",
3730
+ "Case templates auto-discovered after 3+ cases share the same document type pattern",
3731
+ "Templates include a match threshold controlling how closely a case must match",
3732
+ "Missing document type anomalies raised when a case does not match its template"
3733
+ ]
3734
+ },
3377
3735
  {
3378
3736
  type: "callout",
3379
- text: "Templates are auto-discovered \u2014 you do not need to define them manually. The system analyzes existing cases and proposes templates when it detects at least 3 cases sharing the same document type pattern."
3737
+ text: "Templates are auto-discovered \u2014 you do not need to define them manually. The system analyzes existing cases and proposes templates when it detects at least 3 cases sharing the same document type pattern. You can also trigger template discovery manually from the API via POST /cases/templates/discover."
3380
3738
  }
3381
3739
  ],
3382
3740
  related: [
@@ -3453,9 +3811,23 @@ var sections7 = [
3453
3811
  type: "paragraph",
3454
3812
  text: "Use anomaly detection to surface data quality issues that would otherwise require manual comparison across documents. For best results, configure case templates so the **Missing Document Type** detector (D4) can flag incomplete cases. Most teams find that D2 (Field Conflict) and D3 (Duplicate Key Divergence) catch the highest-value issues in procurement and financial workflows."
3455
3813
  },
3814
+ {
3815
+ type: "paragraph",
3816
+ text: "A typical workflow starts on the cases list page, where anomaly count badges give you an at-a-glance view of which cases need attention. Click into a case with anomalies, switch to the **Anomalies** tab, and use the severity filter pills to focus on critical issues first. Each anomaly card explains the affected fields and the specific violation detected. Dismiss false positives with the dismiss button \u2014 they remain accessible via the **show dismissed** toggle if you need to revisit them later."
3817
+ },
3818
+ {
3819
+ type: "list",
3820
+ items: [
3821
+ "D1 \u2014 Validation Cluster: multiple validation failures concentrated in the same document or field group",
3822
+ "D2 \u2014 Field Conflict: contradictory values for the same field across documents in a case",
3823
+ "D3 \u2014 Duplicate Key Divergence: shared link key but differing values on fields that should match",
3824
+ "D4 \u2014 Missing Document Type: case template expects a document type that is absent",
3825
+ "D5 \u2014 Value Reuse: identical values across unrelated fields, suggesting copy-paste or extraction errors"
3826
+ ]
3827
+ },
3456
3828
  {
3457
3829
  type: "callout",
3458
- text: "Anomaly detection requires **Advanced mode** to be enabled. In Simple mode, anomalies are still computed but not displayed in the case detail page."
3830
+ text: "Anomaly detection requires **Advanced mode** to be enabled. In Simple mode, anomalies are still computed but not displayed in the case detail page. Toggle Advanced mode from the sidebar to access the full anomaly workflow."
3459
3831
  }
3460
3832
  ],
3461
3833
  related: [
@@ -3466,11 +3838,11 @@ var sections7 = [
3466
3838
  faq: [
3467
3839
  {
3468
3840
  question: "What anomalies does Talonic detect?",
3469
- answer: "Five structural patterns: validation clusters, field conflicts, duplicate key divergence, missing document types, and value reuse. Each is surfaced as a dismissable card on the case detail page."
3841
+ answer: "Five structural patterns: validation clusters (D1), field conflicts (D2), duplicate key divergence (D3), missing document types (D4), and value reuse (D5). Each is surfaced as a dismissable card on the case detail page. D2 and D3 are the highest-value detectors for procurement and financial workflows \u2014 they catch contradictory values across related documents, such as mismatched amounts between an invoice and its corresponding purchase order."
3470
3842
  },
3471
3843
  {
3472
3844
  question: "Do anomalies update automatically when cases change?",
3473
- answer: "Yes. The detection engine re-runs whenever case membership changes \u2014 documents added or removed, cases merged or split. Anomaly badges in the case header update in real time."
3845
+ answer: "Yes. The detection engine re-runs whenever case membership changes \u2014 documents added or removed, cases merged or split. Anomaly badges in the case header update in real time. Each detector operates independently, so a single case can trigger multiple anomaly types simultaneously. This continuous re-evaluation ensures that anomalies stay current as your document corpus evolves."
3474
3846
  },
3475
3847
  {
3476
3848
  question: "Can I dismiss anomalies?",
@@ -3515,9 +3887,26 @@ var sections7 = [
3515
3887
  type: "paragraph",
3516
3888
  text: "The checksum validator (S7) uses a parameterized factory pattern \u2014 it accepts a checksum algorithm name and applies the corresponding verification logic. Supported algorithms include Luhn (credit card numbers), ABA (bank routing numbers), IBAN (international bank accounts), and ISBN (book identifiers). For best results, ensure your schema fields are typed correctly so the engine knows which checksum to apply."
3517
3889
  },
3890
+ {
3891
+ type: "paragraph",
3892
+ text: "A typical evidence validation workflow starts automatically after extraction and linking. You navigate to a case, open the **Evidence** tab, and immediately see colored badges next to each field value. Red badges indicate failures that need attention \u2014 click a badge to see which validator fired and what the expected format or value was. Use the filter bar to narrow results by status (pass/fail/warning), by document, by category, or by free-text search. Group-by-document collapsible sections let you review one document at a time within a case."
3893
+ },
3894
+ {
3895
+ type: "list",
3896
+ items: [
3897
+ "S1 \u2014 Free-text spillover: unstructured text leaked from adjacent content",
3898
+ "S2 \u2014 Empty value: required field is blank or whitespace-only",
3899
+ "S3 \u2014 Email/URL misclassification: value looks like an email or URL in the wrong field type",
3900
+ "S4 \u2014 Name in URL field: person or company name extracted into a URL-typed field",
3901
+ "S5 \u2014 Alpha in numeric field: alphabetic characters in a numeric-only field",
3902
+ "S6 \u2014 Cross-field duplicate: identical value in multiple unrelated fields on the same document",
3903
+ "S7 \u2014 Checksum validation: Luhn, ABA, IBAN, ISBN verification via parameterized factory",
3904
+ "Domain packs: industry-specific rules (e.g., freight: DOT numbers, MC numbers)"
3905
+ ]
3906
+ },
3518
3907
  {
3519
3908
  type: "callout",
3520
- text: "Evidence validation results are stored in a separate `evidence_validation_results` table keyed by (document_id, entity_id, field_key) \u2014 not in the extraction or linking tables."
3909
+ text: "Evidence validation results are stored separately from extraction and linking data. This means you can re-run validation independently without re-extracting documents. Results are keyed by (document_id, entity_id, field_key) for precise field-level tracking."
3521
3910
  }
3522
3911
  ],
3523
3912
  related: [
@@ -3572,9 +3961,24 @@ var sections8 = [
3572
3961
  type: "paragraph",
3573
3962
  text: "For best results, create one template per downstream consumer. If your finance team and operations team need different column subsets from the same schema, define two templates rather than manually reconfiguring each export. Most teams version their templates alongside schema changes to maintain backward compatibility with existing integrations."
3574
3963
  },
3964
+ {
3965
+ type: "paragraph",
3966
+ text: "To create a dataset template, navigate to **Data Products &rarr; Dataset Templates** and click **New Template**. Select the user schema that defines the field set, then configure column mappings to rename, reorder, or exclude fields from the output. Add default transforms \u2014 such as date formatting to ISO 8601, currency normalization to a base currency, or unit conversion \u2014 that run automatically during assembly. Save the template and it becomes available to any team member when creating a new job or assembly."
3967
+ },
3968
+ {
3969
+ type: "list",
3970
+ items: [
3971
+ "Linked to a user schema \u2014 fields are inherited automatically as the schema evolves",
3972
+ "Column mappings: rename, reorder, or exclude fields from the final output",
3973
+ "Default transforms: date formatting, currency normalization, unit conversion",
3974
+ "Independent versioning: evolve the template without affecting existing data products",
3975
+ "Workspace-scoped: any team member can create, edit, or use any template",
3976
+ "One template per downstream consumer is the recommended pattern"
3977
+ ]
3978
+ },
3575
3979
  {
3576
3980
  type: "callout",
3577
- text: "Dataset templates are workspace-scoped. Any team member can create, edit, or use a template \u2014 there is no per-user ownership restriction."
3981
+ text: "Dataset templates are workspace-scoped. Any team member can create, edit, or use a template \u2014 there is no per-user ownership restriction. Version your templates alongside schema changes to maintain backward compatibility with existing integrations and downstream consumers."
3578
3982
  }
3579
3983
  ],
3580
3984
  related: [
@@ -3627,9 +4031,24 @@ var sections8 = [
3627
4031
  type: "paragraph",
3628
4032
  text: "Use assemblies whenever you need a repeatable, auditable output for downstream systems or stakeholders. Most teams create one assembly per reporting period or delivery cycle. Because assemblies reference a template, you can regenerate the same output shape from different document sets without reconfiguring columns or transforms each time."
3629
4033
  },
4034
+ {
4035
+ type: "paragraph",
4036
+ text: "Assemblies also support incremental updates. When new documents arrive in a source that is already part of an assembly, you can regenerate the assembly to include them without reconfiguring anything. The system re-applies the template, pulls the updated document set, and produces a fresh output. Previous assembly versions are retained for comparison, so you can track how your dataset evolves over successive runs."
4037
+ },
4038
+ {
4039
+ type: "list",
4040
+ items: [
4041
+ "Select a dataset template and one or more document sources to create an assembly",
4042
+ "Column mappings and transforms from the template are applied automatically",
4043
+ "Full traceability from every output row back to its source document",
4044
+ "Incremental updates \u2014 regenerate to include newly arrived documents",
4045
+ "Previous assembly versions retained for comparison and auditing",
4046
+ "Export the assembled dataset as CSV with leading zero preservation"
4047
+ ]
4048
+ },
3630
4049
  {
3631
4050
  type: "callout",
3632
- text: "Assemblies are the recommended way to produce production datasets. They provide a single audit trail from source documents through extraction, resolution, and validation to the final output."
4051
+ text: "Assemblies are the recommended way to produce production datasets. They provide a single audit trail from source documents through extraction, resolution, and validation to the final output. If your workflow requires repeatable, auditable deliverables, assemblies eliminate the need for manual export configuration on every run."
3633
4052
  }
3634
4053
  ],
3635
4054
  related: [
@@ -3644,11 +4063,11 @@ var sections8 = [
3644
4063
  },
3645
4064
  {
3646
4065
  question: "Why should I use assemblies for production data?",
3647
- answer: "Assemblies provide a single audit trail from source documents through extraction, resolution, and validation to the final output, making them the recommended approach for production datasets."
4066
+ answer: "Assemblies provide a single audit trail from source documents through extraction, resolution, and validation to the final output, making them the recommended approach for production datasets. Unlike ad-hoc exports, assemblies are versioned and reproducible \u2014 you can regenerate the same output shape from different document sets without reconfiguring columns or transforms. Previous versions are retained automatically, so you can compare outputs across time periods and demonstrate compliance with audit requirements."
3648
4067
  },
3649
4068
  {
3650
4069
  question: "Can an assembly pull from multiple sources?",
3651
- answer: "Yes. An assembly can combine documents from any number of sources \u2014 uploaded files, connected drives, email attachments, and more \u2014 into a single structured dataset."
4070
+ answer: "Yes. An assembly can combine documents from any number of sources \u2014 uploaded files, connected drives, email attachments, and more \u2014 into a single structured dataset. This is particularly useful for cross-functional reporting where data arrives through different channels. For example, you can combine invoices from a Google Drive connector, purchase orders uploaded manually, and contracts ingested via the API into a single unified procurement dataset."
3652
4071
  }
3653
4072
  ],
3654
4073
  mentions: [
@@ -3693,7 +4112,22 @@ var sections8 = [
3693
4112
  },
3694
4113
  {
3695
4114
  type: "paragraph",
3696
- text: "ID rules are persisted before generating IDs. Navigate to a data product detail page and use **Apply ID Rules** to generate or **Regenerate IDs** to refresh."
4115
+ text: "ID rules are persisted before generating IDs. Navigate to a data product detail page and use **Apply ID Rules** to generate or **Regenerate IDs** to refresh. The generation process evaluates each row against the configured rules: it reads the source field value, applies the resolution map if one exists, prepends the prefix, and writes the resulting ID. If the source field is empty, the dispenser walks the fallback chain in order until it finds a non-empty value. If all fields in the chain are empty, a prefix-less sequential ID is assigned so no row is left without an identifier."
4116
+ },
4117
+ {
4118
+ type: "paragraph",
4119
+ text: "A typical workflow starts by choosing a high-cardinality field as the source \u2014 contract numbers, invoice IDs, or purchase order references work well because they are unique per document. Next, configure a fallback chain with one or two alternative fields (e.g., document name, then upload date) so the dispenser always has a value to work with. Finally, add a resolution map if your source data contains variant spellings of the same entity. The map normalizes these variants before they become part of the ID, preventing duplicate IDs for rows that refer to the same real-world record."
4120
+ },
4121
+ {
4122
+ type: "list",
4123
+ items: [
4124
+ "Source field: the primary field used to derive each row ID",
4125
+ "Fallback chain: ordered list of alternative fields tried when the source is empty",
4126
+ "Resolution map: key-value lookup that normalizes values before ID generation",
4127
+ "Prefix: optional string prepended to every generated ID for namespacing",
4128
+ "Deterministic: same rules + same data always produces the same IDs",
4129
+ "Non-destructive: regenerating IDs only updates the ID column, all other values remain unchanged"
4130
+ ]
3697
4131
  },
3698
4132
  {
3699
4133
  type: "paragraph",
@@ -3738,7 +4172,7 @@ var sections8 = [
3738
4172
  content: [
3739
4173
  {
3740
4174
  type: "paragraph",
3741
- text: "Each data product can generate a **share token** \u2014 a public URL that grants read access without authentication. The delivery website renders three toggle views:"
4175
+ text: "Each data product can generate a **share token** \u2014 a public URL that grants read access without authentication. Share tokens are ideal for distributing finalized datasets to external stakeholders, auditors, or downstream teams that do not have Talonic accounts. The token is scoped to a single data product and can be revoked at any time from the data product detail page without affecting other shared links. The delivery website renders three toggle views:"
3742
4176
  },
3743
4177
  {
3744
4178
  type: "param-table",
@@ -3762,11 +4196,30 @@ var sections8 = [
3762
4196
  },
3763
4197
  {
3764
4198
  type: "paragraph",
3765
- text: "The delivery website includes the Talonic logo, per-run selection, and **CSV export** with leading zero and long number preservation (values are not coerced to numbers)."
4199
+ text: "The delivery website includes the Talonic logo, per-run selection, and **CSV export** with leading zero and long number preservation (values are not coerced to numbers). When multiple runs exist for the same data product, the delivery website lets viewers switch between runs using a dropdown selector, making it easy to compare outputs across time periods or pipeline configurations. CSV downloads preserve the exact cell values shown in the active view \u2014 including leading zeros on codes like ZIP codes and account numbers \u2014 so recipients can open the file in Excel or Google Sheets without data loss."
4200
+ },
4201
+ {
4202
+ type: "paragraph",
4203
+ text: "**Auto-review** and **auto-resolve singles** are available to streamline the approval process: auto-review uses LLM to propose approve/reject decisions, and auto-resolve singles automatically accepts fields with only one candidate value. Together, these features can reduce the manual review burden by 60-80% on typical workloads. Auto-review examines each pending field against the extraction context and proposes a decision with a confidence indicator, while auto-resolve singles handles the common case where a field has exactly one candidate \u2014 no ambiguity to resolve, so automatic acceptance is safe."
3766
4204
  },
3767
4205
  {
3768
4206
  type: "paragraph",
3769
- text: "**Auto-review** and **auto-resolve singles** are available to streamline the approval process: auto-review uses LLM to propose approve/reject decisions, and auto-resolve singles automatically accepts fields with only one candidate value."
4207
+ text: "To set up sharing, navigate to the data product detail page and click **Generate Share Link**. The system creates a unique token and displays the public URL. You can copy this URL and send it to anyone \u2014 no Talonic login is required to view the delivery website. If you need to revoke access, delete the share token from the same page. The data product itself is unaffected; only the public URL stops working."
4208
+ },
4209
+ {
4210
+ type: "list",
4211
+ items: [
4212
+ "Three toggle views: Structured Data (raw extraction), Resolved (post-normalization), and Data Product (final assembled output)",
4213
+ "Per-run selector for comparing outputs across pipeline runs or time periods",
4214
+ "CSV export with leading zero and long number preservation \u2014 values are never coerced to numeric types",
4215
+ "Auto-review with LLM-proposed approve/reject decisions for pending fields",
4216
+ "Auto-resolve singles for fields with exactly one candidate value",
4217
+ "Share tokens are revocable and scoped to a single data product"
4218
+ ]
4219
+ },
4220
+ {
4221
+ type: "callout",
4222
+ text: "Share tokens grant read-only access. Recipients can view and export data but cannot modify the data product, run new jobs, or access any other workspace resources. Revoke a token at any time from the data product detail page."
3770
4223
  }
3771
4224
  ],
3772
4225
  related: [
@@ -3822,9 +4275,24 @@ var sections9 = [
3822
4275
  type: "paragraph",
3823
4276
  text: "For best results, start with a small set of high-confidence rules and expand over time. Most teams begin with field format checks for critical identifiers (invoice numbers, dates, amounts) and add cross-field consistency rules as they learn their data patterns. Validation failures do not block extraction \u2014 they flag records for review."
3824
4277
  },
4278
+ {
4279
+ type: "paragraph",
4280
+ text: "A typical setup workflow looks like this: run your first job, review the extraction results, and identify fields where errors are common. Navigate to **Validation &rarr; Checks** and create rules for those fields \u2014 a date format check on invoice_date, a value range check on total_amount, a cross-field consistency check that start_date precedes end_date. On subsequent jobs, Phase 3 evaluates every record against your active rules and flags failures for review in the Approval Queue."
4281
+ },
4282
+ {
4283
+ type: "list",
4284
+ items: [
4285
+ "Field format: verify values match expected patterns (ISO dates, phone numbers with country codes, email addresses)",
4286
+ "Value range: ensure numeric or date values fall within acceptable bounds",
4287
+ "Cross-field consistency: compare two or more fields on the same record (e.g., start date before end date)",
4288
+ "AI-proposed coherence rules: generated from patterns in completed job results, require explicit approval",
4289
+ "Schema-scoped: rules on one schema do not affect other schemas",
4290
+ "Non-blocking: validation failures flag records for review but do not prevent extraction"
4291
+ ]
4292
+ },
3825
4293
  {
3826
4294
  type: "callout",
3827
- text: "Validation checks are schema-scoped. Rules defined on one schema do not affect other schemas in the same workspace. This lets you tailor quality rules to each document type independently."
4295
+ text: "Validation checks are schema-scoped. Rules defined on one schema do not affect other schemas in the same workspace. This lets you tailor quality rules to each document type independently. Start with a few high-confidence rules and expand as you learn your data patterns."
3828
4296
  }
3829
4297
  ],
3830
4298
  related: [
@@ -3876,9 +4344,25 @@ var sections9 = [
3876
4344
  type: "paragraph",
3877
4345
  text: "For best results, create golden samples from a representative mix of document types and complexity levels. Most teams maintain 5-10 golden samples per schema and re-run benchmarks after schema changes, instruction updates, or model upgrades to track quality trends over time."
3878
4346
  },
4347
+ {
4348
+ type: "paragraph",
4349
+ text: "A typical benchmarking workflow starts after a schema change or model upgrade. Navigate to **Validation &rarr; Golden Samples**, select the samples you want to benchmark, and click **Run Benchmark**. The system re-extracts each document independently and compares every field against your known-correct values. The results page shows a per-field accuracy matrix with pass/fail indicators and AI judge verdicts explaining each comparison. Use this data to pinpoint fields that need schema instruction tuning or additional extraction context."
4350
+ },
4351
+ {
4352
+ type: "list",
4353
+ items: [
4354
+ "Create golden samples by selecting a document and entering known-correct values for each field",
4355
+ "Benchmark runs compare extraction results field-by-field against the golden sample baseline",
4356
+ 'AI judge evaluates semantic equivalence (e.g., "United States" matches "US" for country fields)',
4357
+ "Per-field accuracy scores identify exactly which fields are underperforming",
4358
+ "Maintain 5-10 golden samples per schema for representative coverage",
4359
+ "Re-run benchmarks after schema changes, instruction updates, or model upgrades",
4360
+ "Benchmark results are stored historically for tracking quality trends over time"
4361
+ ]
4362
+ },
3879
4363
  {
3880
4364
  type: "callout",
3881
- text: "Golden samples are not used during normal extraction \u2014 they exist solely for benchmarking. Changing a golden sample does not affect how documents are processed."
4365
+ text: "Golden samples are not used during normal extraction \u2014 they exist solely for benchmarking. Changing a golden sample does not affect how documents are processed. This separation ensures that ground truth data remains a pure measurement tool without introducing bias into the extraction pipeline."
3882
4366
  }
3883
4367
  ],
3884
4368
  related: [
@@ -3925,9 +4409,25 @@ var sections9 = [
3925
4409
  type: "paragraph",
3926
4410
  text: "Approval gates integrate directly with the delivery pipeline. When a result passes all gates, a `result.approved` signal is emitted automatically. Bind this signal to a destination to create a fully automated flow from document upload through extraction, validation, approval, and delivery \u2014 no manual steps required for high-confidence results."
3927
4411
  },
4412
+ {
4413
+ type: "paragraph",
4414
+ text: "A practical example: you configure an approval gate on your invoice schema with 90% confidence, 95% validation pass rate, and 80% field coverage. An invoice extraction scores 96% confidence, passes all validation checks, and has all fields populated. It clears all three gates and is auto-approved \u2014 a `result.approved` signal fires immediately. A second invoice scores 85% confidence. It fails the confidence gate and is routed to the Approval Queue for manual review. This two-track approach lets high-quality results flow through instantly while ensuring borderline cases get human attention."
4415
+ },
4416
+ {
4417
+ type: "list",
4418
+ items: [
4419
+ "Minimum confidence: lowest acceptable extraction confidence score",
4420
+ "Validation pass rate: minimum percentage of validation checks that must pass",
4421
+ "Field coverage: minimum percentage of schema fields with non-empty values",
4422
+ "All three gates must be cleared for auto-approval \u2014 any failure routes to manual review",
4423
+ "Configured per schema for independent tuning per document type",
4424
+ "Emits result.approved signal on auto-approval for delivery pipeline integration",
4425
+ "Start conservative (high thresholds) and loosen as pipeline trust builds"
4426
+ ]
4427
+ },
3928
4428
  {
3929
4429
  type: "callout",
3930
- text: "Approval gates feed the delivery pipeline \u2014 bind a `result.approved` signal to a destination to only ship approved rows to your downstream systems."
4430
+ text: "Approval gates feed the delivery pipeline \u2014 bind a `result.approved` signal to a destination to only ship approved rows to your downstream systems. Combined with webhooks, this creates a fully automated flow from document upload through extraction, validation, approval, and delivery with zero manual steps for high-confidence results."
3931
4431
  }
3932
4432
  ],
3933
4433
  related: [
@@ -3984,9 +4484,26 @@ var sections9 = [
3984
4484
  type: "paragraph",
3985
4485
  text: "For best results, review flagged items first \u2014 these are records where at least one validation check failed, making them the most likely to contain errors. Most teams assign a daily review cadence and use confidence range filters to prioritize low-confidence items that need the most attention."
3986
4486
  },
4487
+ {
4488
+ type: "paragraph",
4489
+ text: "The Approval Queue integrates with the broader delivery pipeline through event-driven signals. Every approve or reject action \u2014 whether manual or via batch operations \u2014 emits the corresponding signal immediately. This means downstream systems receive real-time notifications of review decisions. You can bind these signals to different destinations: approved records to a webhook that triggers an ERP import, rejected records to a Slack notification for the data operations team. The event-driven design decouples review decisions from downstream processing, making the system easy to extend."
4490
+ },
4491
+ {
4492
+ type: "list",
4493
+ items: [
4494
+ "Navigate to Review &rarr; Approval Queue to see all pending items",
4495
+ "Filter by status (pending, flagged), schema, or confidence range",
4496
+ "Review detail view shows extracted values alongside the source document",
4497
+ "Provenance trails trace each value back to its origin in the document text",
4498
+ "Inline validation check results show which rules passed and which failed",
4499
+ "Batch approve or reject multiple items at once",
4500
+ "LLM auto-review proposes decisions for pending items with one-click accept or override",
4501
+ "result.approved and result.rejected signals emitted for delivery pipeline integration"
4502
+ ]
4503
+ },
3987
4504
  {
3988
4505
  type: "callout",
3989
- text: "LLM auto-review is available to accelerate the approval process. When enabled, AI proposes approve or reject decisions for pending items, which you can accept or override with a single click."
4506
+ text: "LLM auto-review is available to accelerate the approval process. When enabled, AI proposes approve or reject decisions for pending items, which you can accept or override with a single click. Auto-review is especially effective for high-volume workloads where most items are straightforward \u2014 it lets reviewers focus their attention on the genuinely ambiguous cases."
3990
4507
  }
3991
4508
  ],
3992
4509
  related: [
@@ -4030,11 +4547,11 @@ var sections10 = [
4030
4547
  content: [
4031
4548
  {
4032
4549
  type: "paragraph",
4033
- text: "Push extracted, resolved, and reviewed data to any downstream system. Delivery is a typed, at-least-once pipeline with idempotency keys on the wire, append-only history, and a dead-letter queue for terminal failures."
4550
+ text: "Push extracted, resolved, and reviewed data to any downstream system. Delivery is a typed, at-least-once pipeline with idempotency keys on the wire, append-only history, and a dead-letter queue for terminal failures. The system is fully configurable without code changes \u2014 signals, deliverable resolvers, serializers, and connectors are four orthogonal registries that compose independently, so adding a new destination type never requires changes to the signal or serializer code."
4034
4551
  },
4035
4552
  {
4036
4553
  type: "paragraph",
4037
- text: "Every delivery flows through a five-stage pipeline:"
4554
+ text: "Every delivery flows through a five-stage pipeline. Producers are stateless \u2014 they only publish typed events into an outbox and never interact with destinations or bindings directly. A background poller drains the outbox every 5 seconds (configurable via `delivery.poll_interval_ms`), claiming up to 50 rows per tick using `FOR UPDATE SKIP LOCKED` for safe multi-instance operation. When the BullMQ queue depth exceeds the backpressure threshold (default 10,000), the poller pauses until the queue drains, preventing memory exhaustion under burst load. Matched events are enqueued as delivery jobs processed by workers (default concurrency: 10):"
4038
4555
  },
4039
4556
  {
4040
4557
  type: "param-table",
@@ -4122,7 +4639,7 @@ var sections10 = [
4122
4639
  content: [
4123
4640
  {
4124
4641
  type: "paragraph",
4125
- text: "A destination is a connector + configuration + optional credentials. Slice-1 ships the webhook connector; S3, Google Sheets, Drive, SFTP, and Email arrive in later slices. Use **Delivery &rarr; Destinations** to manage them from the dashboard, or `POST /v1/delivery/destinations` via the API. Every destination supports a live-ping `POST /v1/delivery/destinations/:id/test` that exercises the full transport envelope with a tiny test payload."
4642
+ text: "A destination is a connector + configuration + optional credentials. Seven connectors are live: webhook (HMAC-SHA256), SFTP, Amazon S3, Azure Blob Storage, Google Drive, OneDrive, and Google Sheets. Use **Delivery &rarr; Destinations** to manage them from the dashboard, or `POST /v1/delivery/destinations` via the API. Every destination supports a live-ping `POST /v1/delivery/destinations/:id/test` that exercises the full transport envelope with a tiny test payload. File-based destinations use lightweight probes (list directory, head bucket, get container properties) so the test never creates artifacts in your target storage."
4126
4643
  },
4127
4644
  {
4128
4645
  type: "param-table",
@@ -4168,6 +4685,10 @@ var sections10 = [
4168
4685
  type: "paragraph",
4169
4686
  text: "A single destination can back multiple bindings. For example, one S3 bucket destination can receive both `document.extracted` and `result.approved` events through separate bindings, each with its own serializer and field map. This keeps your destination inventory small while supporting diverse routing requirements."
4170
4687
  },
4688
+ {
4689
+ type: "paragraph",
4690
+ text: 'For example, to set up a webhook destination via the API: `POST /v1/delivery/destinations` with a body containing `name`, `type: "webhook"`, `config: { url: "https://ops.example.com/talonic" }`, and optionally `auth_config`, `signing_secret`, and `payload_cap_bytes`. The response returns the destination ID, which you then reference when creating a binding. After creation, call `POST /v1/delivery/destinations/:id/test` to verify the connection end-to-end before routing live events to it.'
4691
+ },
4171
4692
  {
4172
4693
  type: "paragraph",
4173
4694
  text: "For best results, always run a live-ping test after creating a destination. The test exercises the full transport envelope \u2014 SSRF validation, payload cap, and authentication \u2014 with a tiny test payload, so you catch configuration errors before real events start flowing. OAuth-based destinations (Google Drive, Google Sheets) require connecting your account first via the OAuth flow in the dashboard."
@@ -4215,15 +4736,15 @@ var sections10 = [
4215
4736
  content: [
4216
4737
  {
4217
4738
  type: "paragraph",
4218
- text: "A binding is the routing rule: it joins a **signal filter** (which events?) to a **deliverable type** (what payload shape?) to a **destination** (ship where?) via a **serializer** (encoded how?). On create, the backend validates all four pieces form a compatible triangle \u2014 the serializer must support the resolver's shape, and the connector must support the serializer format."
4739
+ text: "A binding is the routing rule: it joins a **signal filter** (which events?) to a **deliverable type** (what payload shape?) to a **destination** (ship where?) via a **serializer** (encoded how?). On create, the backend validates all four pieces form a compatible triangle \u2014 the serializer must support the resolver's shape, and the connector must support the serializer format. This six-predicate validation ensures you never end up with a misconfigured binding that cannot deliver."
4219
4740
  },
4220
4741
  {
4221
4742
  type: "paragraph",
4222
- text: "Optional `field_map` (rename/drop/static rules) lets you reshape the payload without custom code. Optional `delivery_policy` overrides the default retry ladder (6 attempts at `5s, 30s, 2min, 10min, 1h`) and timeout."
4743
+ text: "Optional `field_map` (rename/drop/static rules) lets you reshape the payload without custom code. The three operations compose in order: drop excluded fields first, then rename remaining fields, then inject static values. Optional `delivery_policy` overrides the default retry ladder (7 attempts over ~10 hours with exponential backoff: 0s, 30s, 2min, 8min, 30min, 2h, 8h) and maximum attempts."
4223
4744
  },
4224
4745
  {
4225
4746
  type: "paragraph",
4226
- text: "The compatibility triangle is enforced on every create and update. The backend checks that your chosen serializer supports the deliverable resolver's output shape, and that the connector accepts the serializer's format. If any predicate fails, the binding is rejected with a descriptive error \u2014 you never end up with a binding that cannot deliver."
4747
+ text: "The compatibility triangle is enforced on every create and update via six predicates. The backend checks that: (1) the `signal_filter` is well-formed with a known event type and valid match values, (2) the `deliverable_type` resolves to a registered resolver, (3) the `serializer_format` resolves to a registered serializer, (4) the serializer supports the resolver's output shape, (5) the connector's supported serializer list includes the chosen format, and (6) the resolver's compatible signals include the signal filter's event type. If any predicate fails, the binding is rejected with a descriptive error \u2014 you never end up with a binding that cannot deliver."
4227
4748
  },
4228
4749
  {
4229
4750
  type: "paragraph",
@@ -4275,7 +4796,7 @@ var sections10 = [
4275
4796
  content: [
4276
4797
  {
4277
4798
  type: "paragraph",
4278
- text: "The catalog API (`/v1/delivery/catalog/*`) exposes the four registries that drive the binding picker. Use it to populate dropdowns rather than hardcoding lists \u2014 it always reflects the running registry contents."
4799
+ text: "The catalog API (`/v1/delivery/catalog/*`) exposes the four registries that drive the binding picker: signals, deliverables, serializers, and connectors. Use it to populate dropdowns rather than hardcoding lists \u2014 it always reflects the running registry contents. When new signal types or deliverable resolvers are added to the platform, they appear automatically in the catalog without any configuration changes on your end."
4279
4800
  },
4280
4801
  {
4281
4802
  type: "param-table",
@@ -4340,7 +4861,7 @@ var sections10 = [
4340
4861
  },
4341
4862
  {
4342
4863
  type: "paragraph",
4343
- text: "Signals are typed events emitted by the platform when meaningful state changes occur. Document-level signals fire on extraction success or failure. Run-level signals fire when a job completes across dataspace, structuring, resolution, or extraction runs. Result-level signals fire when a reviewer approves, rejects, or flags a record."
4864
+ text: "Signals are typed events emitted by the platform when meaningful state changes occur. They fall into four categories. **Document signals** (`document.extracted`, `document.extraction_failed`) fire on extraction success or failure for individual documents. **Run signals** (`run.dataspace.completed`, `run.structuring.completed`, `run.resolution.completed`, `run.extraction.completed`) fire when a job run completes across the four pipeline domains. **Result signals** (`result.approved`, `result.rejected`, `result.flagged`) fire when a reviewer takes action on a record. **Meta-signals** (`delivery.item.completed`, `delivery.item.failed`) fire when a delivery attempt itself succeeds or fails, enabling self-monitoring workflows."
4344
4865
  },
4345
4866
  {
4346
4867
  type: "paragraph",
@@ -4391,15 +4912,15 @@ var sections10 = [
4391
4912
  content: [
4392
4913
  {
4393
4914
  type: "paragraph",
4394
- text: "Every delivery attempt writes a row to `/v1/delivery/items` with its status, HTTP code, error code, and request/response bodies. Terminal failures (retry ladder exhausted or permanent 4xx) escalate to `/v1/delivery/dlq`. Both are fully replayable \u2014 replay enqueues a new attempt with a fresh idempotency key. Nothing in history is ever mutated; the log is strictly append-only."
4915
+ text: "Every delivery attempt writes a row to `/v1/delivery/items` with its status, HTTP code, error code, and request/response bodies (truncated to 10 KB each). Terminal failures (retry ladder exhausted or permanent 4xx) escalate to `/v1/delivery/dlq`. Both are fully replayable \u2014 replay enqueues a new attempt while preserving the deterministic idempotency key, so receivers that deduplicate on the key will not process the same delivery twice even after multiple replays. Nothing in history is ever mutated; the log is strictly append-only."
4395
4916
  },
4396
4917
  {
4397
4918
  type: "paragraph",
4398
- text: "The delivery items log captures the full lifecycle of each attempt: in-flight, succeeded, or failed. Each row includes the attempt number, duration in milliseconds, and truncated request/response bodies (up to 10 KB each). Use the items endpoint with filters for `binding_id`, `destination_id`, or `status` to narrow results when debugging a specific integration."
4919
+ text: "The delivery items log captures the full lifecycle of each attempt: in-flight, succeeded, or failed. Each row includes the attempt number, HTTP status code, error code, duration in milliseconds, and truncated request/response bodies (up to 10 KB each). Use the items endpoint with filters for `binding_id`, `destination_id`, or `status` to narrow results when debugging a specific integration. Item and DLQ IDs are UUIDs, while event IDs are sequential integers for efficient ordering."
4399
4920
  },
4400
4921
  {
4401
4922
  type: "paragraph",
4402
- text: "The dead letter queue (DLQ) is your safety net for terminal failures. When the retry ladder is exhausted or the destination returns a permanent error (e.g., 401 Unauthorized, 403 Forbidden), the failed delivery moves to the DLQ. From there you can inspect the error, fix the destination configuration, and replay the delivery with a single click or API call."
4923
+ text: "The dead letter queue (DLQ) is your safety net for terminal failures. When the retry ladder is exhausted (default: 7 attempts over ~10 hours) or the destination returns a permanent error (e.g., 401 Unauthorized, 403 Forbidden), the failed delivery moves to the DLQ. From there you can inspect the error details, fix the destination configuration, and replay the delivery with a single click or API call. Destinations returning authentication errors are automatically disabled to prevent further failed attempts until the credentials are updated."
4403
4924
  },
4404
4925
  {
4405
4926
  type: "paragraph",
@@ -4418,15 +4939,15 @@ var sections10 = [
4418
4939
  faq: [
4419
4940
  {
4420
4941
  question: "How is delivery history tracked?",
4421
- answer: "Every delivery attempt writes a row to /v1/delivery/items with status, HTTP code, error code, and request/response bodies. The log is strictly append-only \u2014 nothing is ever mutated."
4942
+ answer: "Every delivery attempt writes a row to /v1/delivery/items with status, HTTP code, error code, and request/response bodies (truncated to 10 KB each). The log is strictly append-only \u2014 nothing is ever mutated. You can filter items by binding_id, destination_id, or status to narrow results when debugging a specific integration."
4422
4943
  },
4423
4944
  {
4424
4945
  question: "What is the dead letter queue (DLQ)?",
4425
- answer: "Terminal failures (retry ladder exhausted or permanent 4xx) escalate to /v1/delivery/dlq. DLQ entries are fully replayable \u2014 replay enqueues a fresh attempt with a new idempotency key."
4946
+ answer: "Terminal failures (retry ladder exhausted or permanent 4xx) escalate to /v1/delivery/dlq. DLQ entries are fully replayable \u2014 replay enqueues a fresh attempt while preserving the deterministic idempotency key, so receivers that deduplicate on the key will not process the same delivery twice. Destinations returning authentication errors are automatically disabled to prevent further failed attempts."
4426
4947
  },
4427
4948
  {
4428
4949
  question: "How long are request and response bodies retained?",
4429
- answer: "Request and response bodies are cleaned up after the configured retention period (default 30 days). Row metadata \u2014 status, HTTP code, error code, and duration \u2014 is retained indefinitely for audit purposes."
4950
+ answer: "Request and response bodies are cleaned up after the configured retention period (default 30 days) by a daily cleanup job that runs at 03:00 server time. Row metadata \u2014 status, HTTP code, error code, and duration \u2014 is retained indefinitely for audit purposes. Configure the retention period via the delivery.item_body_retention_days setting in pipeline.yaml."
4430
4951
  }
4431
4952
  ],
4432
4953
  mentions: [
@@ -4499,7 +5020,7 @@ var sections11 = [
4499
5020
  },
4500
5021
  {
4501
5022
  question: "When should I use a shared dialect vs an inline dialect?",
4502
- answer: "Use shared dialects for workspace-wide defaults that apply to most schemas. Use inline dialects only when a specific schema needs different formatting \u2014 for example, a schema that outputs dates in a different format for a particular downstream system."
5023
+ answer: "Use shared dialects for workspace-wide defaults that apply to most schemas. Use inline dialects only when a specific schema needs different formatting \u2014 for example, a schema that outputs dates in DD/MM/YYYY for a European ERP while the rest of your workspace uses YYYY-MM-DD. Inline overrides apply only to that one schema, so they do not affect any other output. If you find yourself overriding the same setting in multiple schemas, consider updating the shared dialect instead."
4503
5024
  },
4504
5025
  {
4505
5026
  question: "Do shared dialects affect the extraction process?",
@@ -4571,7 +5092,7 @@ var sections11 = [
4571
5092
  },
4572
5093
  {
4573
5094
  question: "How does the lookup cascade work?",
4574
- answer: "The platform tries three tiers: first, exact string normalization (whitespace and case normalization). If that fails, token-based fuzzy matching. If the fuzzy match is below the confidence threshold, a Haiku LLM call resolves the ambiguity."
5095
+ answer: "The platform tries three tiers in sequence. First, exact string normalization strips whitespace and normalizes casing to find a direct match. If no exact match is found, token-based fuzzy matching compares individual tokens against all reference values and scores similarity. If the best fuzzy match falls below the confidence threshold, a Haiku LLM call evaluates the ambiguous value in context against the top candidates and selects the most likely match. This three-tier approach balances speed and accuracy \u2014 most lookups resolve in the first two tiers without any LLM cost."
4575
5096
  },
4576
5097
  {
4577
5098
  question: "What happens when I update a reference primitive?",
@@ -4616,7 +5137,10 @@ var sections11 = [
4616
5137
  items: [
4617
5138
  "**Schema changes** \u2014 field additions, removals, mapping updates, and format constraint modifications.",
4618
5139
  "**Shared dialect changes** \u2014 date format, number locale, delimiter, and encoding updates.",
4619
- "**Reference primitive changes** \u2014 new versions of lookup tables and key-value modifications."
5140
+ "**Reference primitive changes** \u2014 new versions of lookup tables and key-value modifications.",
5141
+ "**Delivery binding changes** \u2014 modifications to outbound delivery destinations, field maps, or signal filters.",
5142
+ "**Routing rule changes** \u2014 additions or modifications to document routing rules that assign schemas automatically.",
5143
+ "**Format constraint changes** \u2014 regex pattern updates or fallback behavior modifications on schema fields."
4620
5144
  ]
4621
5145
  },
4622
5146
  {
@@ -4697,7 +5221,9 @@ var sections12 = [
4697
5221
  "**Extracted values** \u2014 finds specific data points across all processed documents.",
4698
5222
  "**Field names** \u2014 searches the Field Registry for canonical field definitions.",
4699
5223
  "**Schema names** \u2014 locates generated and template schemas by title.",
4700
- "**Sources** \u2014 matches source connection names and configurations."
5224
+ "**Sources** \u2014 matches source connection names and configurations.",
5225
+ "**Matching configurations** \u2014 finds matching configs and reference datasets by name.",
5226
+ "**Delivery bindings** \u2014 locates delivery pipeline bindings and destination configurations."
4701
5227
  ]
4702
5228
  }
4703
5229
  ],
@@ -4764,6 +5290,10 @@ var sections12 = [
4764
5290
  {
4765
5291
  type: "paragraph",
4766
5292
  text: 'For best results, save your most common filter combinations as presets. Most teams create presets for categories like "high-value invoices this quarter," "documents missing key fields," or "recently failed extractions." Presets appear as one-click buttons on the Documents page, eliminating the need to rebuild complex filter conditions from scratch each time.'
5293
+ },
5294
+ {
5295
+ type: "paragraph",
5296
+ text: 'For example, to find all invoices from a specific vendor with outstanding amounts, build a filter with `vendor_name eq "Acme Corp"` AND `document_type eq "Invoice"` AND `total_amount gt 5000`. The field autocomplete ensures you are filtering on valid extracted fields, and the materialized index returns results instantly even across thousands of documents. Save this as a preset called "Acme high-value invoices" for one-click access when you need to review that vendor\'s billing history.'
4767
5297
  }
4768
5298
  ],
4769
5299
  related: [
@@ -4826,10 +5356,25 @@ var sections13 = [
4826
5356
  type: "paragraph",
4827
5357
  text: "For best results, create separate API keys for each integration or service that connects to your Talonic workspace. This makes it easy to rotate or revoke a single key without disrupting other integrations. Most teams maintain one key for their ingestion pipeline, one for their BI dashboard, and one for webhook-based automations."
4828
5358
  },
5359
+ {
5360
+ type: "paragraph",
5361
+ text: "API keys are SHA-256 hashed at rest, which means the platform never stores the plaintext key after creation. This is a deliberate security measure \u2014 even if the database were compromised, the hashed keys cannot be reversed to their original values. When your integration sends a request, the platform hashes the incoming key and compares it against the stored hash. This design follows the same pattern used by GitHub, Stripe, and other major API providers."
5362
+ },
5363
+ {
5364
+ type: "list",
5365
+ items: [
5366
+ "Prefixed with tlnc_ for easy identification in logs and configuration files",
5367
+ "Passed via the Authorization: Bearer header on every API request",
5368
+ "SHA-256 hashed at rest \u2014 the full key is only shown once at creation",
5369
+ "Three scopes: extract (ingestion), read (query), write (create/modify)",
5370
+ "Create separate keys per integration for independent rotation and revocation",
5371
+ "No limit on the number of keys per workspace"
5372
+ ]
5373
+ },
4829
5374
  {
4830
5375
  type: "callout",
4831
5376
  variant: "warning",
4832
- text: "Copy the full API key immediately after creation \u2014 it is only displayed once. If you lose the key, you must delete it and create a new one. Existing integrations using the old key will stop working until updated."
5377
+ text: "Copy the full API key immediately after creation \u2014 it is only displayed once. If you lose the key, you must delete it and create a new one. Existing integrations using the old key will stop working until updated. Store API keys in a secrets manager, not in source code or environment files checked into version control."
4833
5378
  },
4834
5379
  {
4835
5380
  type: "param-table",
@@ -4870,6 +5415,10 @@ var sections13 = [
4870
5415
  {
4871
5416
  question: "Can I have multiple API keys?",
4872
5417
  answer: "Yes. You can create as many API keys as needed. Best practice is to create separate keys for each integration so you can rotate or revoke them independently without disrupting other services."
5418
+ },
5419
+ {
5420
+ question: "What are best practices for API key management?",
5421
+ answer: "Store keys in a secrets manager rather than source code or environment files checked into version control. Create one key per integration so each can be rotated independently. Use the narrowest scope possible \u2014 a read-only dashboard needs only the read scope, not extract or write. Rotate keys on a regular schedule and immediately revoke any key that may have been exposed. Monitor API usage per key to detect anomalies early."
4873
5422
  }
4874
5423
  ],
4875
5424
  mentions: ["API keys", "tlnc_", "SHA-256", "Bearer token", "scopes"]
@@ -4897,6 +5446,26 @@ var sections13 = [
4897
5446
  type: "paragraph",
4898
5447
  text: "For best results, start with the `/v1/extract` endpoint for document ingestion, then use `/v1/documents` and `/v1/extractions` to retrieve results. As your integration matures, explore delivery bindings, matching configurations, and batch processing to build a fully automated data pipeline."
4899
5448
  },
5449
+ {
5450
+ type: "paragraph",
5451
+ text: "Long-running operations like matching runs, batch inference, and resolution runs follow an asynchronous pattern. You submit a request to start the operation and receive a run ID. Poll the status endpoint with that run ID to track progress \u2014 the response includes a percentage-complete indicator and a terminal status when the operation finishes. This pattern keeps HTTP connections short-lived while giving you full visibility into background processing."
5452
+ },
5453
+ {
5454
+ type: "paragraph",
5455
+ text: "Error responses follow a consistent structure across all namespaces. Every error includes a typed error code from the ErrorCode enum, a human-readable message, and the HTTP status code. Common error codes include ENTITY_NOT_FOUND (404), VALIDATION_ERROR (400), and RATE_LIMIT_EXCEEDED (429). Rate limiting is applied per API key with configurable thresholds \u2014 the response headers include X-RateLimit-Remaining and X-RateLimit-Reset so your integration can adapt proactively."
5456
+ },
5457
+ {
5458
+ type: "list",
5459
+ items: [
5460
+ "20+ namespaces covering extraction, documents, schemas, jobs, delivery, linking, cases, quality, and more",
5461
+ "JSON request and response bodies with cursor-based pagination for list operations",
5462
+ "Authenticate with a tlnc_ API key via the Authorization: Bearer header",
5463
+ "Asynchronous operations with polling endpoints for status and progress",
5464
+ "Typed error codes with consistent response structure across all endpoints",
5465
+ "Rate limiting with X-RateLimit-Remaining and X-RateLimit-Reset headers",
5466
+ "Batch processing mode at 50% cost with 48-hour delivery SLA"
5467
+ ]
5468
+ },
4900
5469
  {
4901
5470
  type: "callout",
4902
5471
  variant: "info",
@@ -5045,7 +5614,7 @@ var sections13 = [
5045
5614
  content: [
5046
5615
  {
5047
5616
  type: "paragraph",
5048
- text: "Webhooks push real-time notifications when events occur. All payloads are HMAC-SHA256 signed. Failed deliveries retry with exponential backoff."
5617
+ text: "Webhooks push real-time notifications when events occur in your Talonic workspace. All payloads are HMAC-SHA256 signed using the signing secret configured on your delivery destination, ensuring authenticity and tamper detection. Failed deliveries retry with exponential backoff \u2014 the platform makes progressively spaced attempts before routing terminal failures to the dead-letter queue (DLQ) for manual replay."
5049
5618
  },
5050
5619
  {
5051
5620
  type: "paragraph",
@@ -5059,6 +5628,27 @@ var sections13 = [
5059
5628
  type: "paragraph",
5060
5629
  text: "Use webhooks when your downstream system needs to react immediately to platform events \u2014 for example, triggering an ERP import when a document is extracted, or notifying a Slack channel when a reviewer rejects a record. For bulk or periodic data transfers, consider using the SFTP, S3, or cloud storage delivery connectors instead."
5061
5630
  },
5631
+ {
5632
+ type: "paragraph",
5633
+ text: "To set up a webhook, create a delivery destination of type **webhook** with your endpoint URL and a signing secret. Then create a binding that maps one or more signal types to the destination. When an event fires, the platform constructs the payload, signs it, and delivers it to your endpoint. You can bind multiple signal types to the same destination or spread them across different destinations for different downstream systems."
5634
+ },
5635
+ {
5636
+ type: "list",
5637
+ items: [
5638
+ "HMAC-SHA256 signed payloads for authenticity and tamper detection",
5639
+ "Idempotency key in headers for safe deduplication on retries",
5640
+ "Exponential backoff on delivery failure with configurable retry limits",
5641
+ "Dead-letter queue (DLQ) for terminal failures with manual replay",
5642
+ "Bind any signal type to a webhook destination via delivery bindings",
5643
+ "11 signal types covering document, run, result, and delivery lifecycle events",
5644
+ "Meta-signals (delivery.item.completed/failed) are not re-delivered to avoid loops"
5645
+ ]
5646
+ },
5647
+ {
5648
+ type: "callout",
5649
+ variant: "warning",
5650
+ text: "Your webhook endpoint must respond with a 2xx status code within 30 seconds. Non-2xx responses or timeouts trigger the retry schedule. Permanent client errors (4xx except 429) are treated as terminal failures and routed directly to the DLQ without further retries."
5651
+ },
5062
5652
  {
5063
5653
  type: "param-table",
5064
5654
  title: "Delivery signal types (webhook-compatible)",
@@ -5161,11 +5751,11 @@ var sections14 = [
5161
5751
  content: [
5162
5752
  {
5163
5753
  type: "paragraph",
5164
- text: "Organizations support role-based access control:"
5754
+ text: "Organizations support role-based access control that governs who can view, create, edit, and manage resources across your workspace. Access control is enforced at the API level, so permissions apply consistently whether users interact through the web interface, the AI agent, or the public API."
5165
5755
  },
5166
5756
  {
5167
5757
  type: "paragraph",
5168
- text: "Every user in your organization is assigned one of four roles that determine what they can see and do. Roles are hierarchical \u2014 each level includes all permissions of the levels below it. Choose the most restrictive role that still lets a team member do their job."
5758
+ text: "Every user in your organization is assigned one of four roles that determine what they can see and do. Roles are hierarchical \u2014 each level includes all permissions of the levels below it. Choose the most restrictive role that still lets a team member do their job. For most team members working on data review and extraction, the **Member** role provides everything they need."
5169
5759
  },
5170
5760
  {
5171
5761
  type: "param-table",
@@ -5195,7 +5785,7 @@ var sections14 = [
5195
5785
  },
5196
5786
  {
5197
5787
  type: "paragraph",
5198
- text: "New members are added via domain matching: company email domains auto-match to your org with **pending** status requiring admin approval. Manage from the Team page."
5788
+ text: "New members are added via domain matching: company email domains auto-match to your org with **pending** status requiring admin approval. This means anyone who signs up with an email from your company domain is automatically associated with your organization, but cannot access any data until an Admin or Owner explicitly approves them. Manage all team members and pending requests from the Team page in the sidebar."
5199
5789
  },
5200
5790
  {
5201
5791
  type: "paragraph",
@@ -5206,6 +5796,16 @@ var sections14 = [
5206
5796
  variant: "info",
5207
5797
  text: "Domain matching streamlines onboarding for larger teams. When a new user signs up with an email address matching your organization's domain (e.g., `@yourcompany.com`), they are automatically associated with your org in a **pending** state. An admin must approve them before they gain access."
5208
5798
  },
5799
+ {
5800
+ type: "list",
5801
+ ordered: false,
5802
+ items: [
5803
+ "**Viewer** \u2014 read-only access to documents, extraction results, schemas, and reports. Cannot create, edit, or delete any resources.",
5804
+ "**Member** \u2014 full CRUD access to documents, schemas, jobs, matching configurations, and delivery bindings. Cannot manage team members or workspace settings.",
5805
+ "**Admin** \u2014 all Member permissions plus team management (approve/reject members, change roles), workspace settings (shared dialects, reference primitives, change review), and routing rules.",
5806
+ "**Owner** \u2014 all Admin permissions plus billing management, API key generation and revocation, organization-level settings, and the ability to transfer ownership."
5807
+ ]
5808
+ },
5209
5809
  {
5210
5810
  type: "list",
5211
5811
  ordered: true,
@@ -5262,11 +5862,11 @@ var sections14 = [
5262
5862
  content: [
5263
5863
  {
5264
5864
  type: "paragraph",
5265
- text: "The Usage & Registry page replaces the legacy credits view with a comprehensive cost breakdown. It shows per-feature cost (extraction, OCR, batch, matching), a daily cost chart, and a full call log with model, tokens, and cost per request. The **Master view** (admin only) shows per-customer breakdowns and platform-wide statistics."
5865
+ text: "The Usage & Registry page replaces the legacy credits view with a comprehensive cost breakdown. It shows per-feature cost (extraction, OCR, batch, matching), a daily cost chart, and a full call log with model, tokens, and cost per request. The **Master view** (admin only) shows per-customer breakdowns and platform-wide statistics. Navigate to **Usage** from the sidebar to access all views."
5266
5866
  },
5267
5867
  {
5268
5868
  type: "paragraph",
5269
- text: "Understanding your usage patterns helps optimize costs. For example, if extraction dominates your spend, consider using **batch mode** for non-urgent documents to cut that cost in half. The daily cost chart makes it easy to spot usage spikes and correlate them with specific ingestion events."
5869
+ text: "Understanding your usage patterns helps optimize costs. For example, if extraction dominates your spend, consider using **batch mode** for non-urgent documents to cut that cost in half. If OCR is a significant portion, check whether you are processing image-heavy documents that could benefit from better source quality. The daily cost chart makes it easy to spot usage spikes and correlate them with specific ingestion events or batch completions."
5270
5870
  },
5271
5871
  {
5272
5872
  type: "paragraph",
@@ -5302,10 +5902,26 @@ var sections14 = [
5302
5902
  }
5303
5903
  ]
5304
5904
  },
5905
+ {
5906
+ type: "heading",
5907
+ level: 3,
5908
+ id: "cost-optimization",
5909
+ text: "Cost Optimization Tips"
5910
+ },
5911
+ {
5912
+ type: "list",
5913
+ ordered: false,
5914
+ items: [
5915
+ "**Use batch mode** for non-urgent documents \u2014 extraction runs at 50% cost with a 48-hour delivery window.",
5916
+ "**Build your field registry** \u2014 as more fields reach Tier 1 and Tier 2, extraction costs drop because values are resolved via lookup instead of AI calls.",
5917
+ "**Review the per-feature breakdown** weekly to identify which operations dominate your spend.",
5918
+ "**Leverage routing rules** to automatically assign schemas, reducing the number of manual job runs and re-extractions."
5919
+ ]
5920
+ },
5305
5921
  {
5306
5922
  type: "callout",
5307
5923
  variant: "info",
5308
- text: "The call log records every LLM and OCR call with full detail \u2014 model name, input/output token counts, latency, and cost. Use it to audit individual extractions or investigate unexpected cost increases."
5924
+ text: "The call log records every LLM and OCR call with full detail \u2014 model name, input/output token counts, latency, and cost. Use it to audit individual extractions or investigate unexpected cost increases. Each entry links back to the specific document and job that triggered the call."
5309
5925
  }
5310
5926
  ],
5311
5927
  related: [
@@ -5324,7 +5940,7 @@ var sections14 = [
5324
5940
  },
5325
5941
  {
5326
5942
  question: "How can I reduce my usage costs?",
5327
- answer: "Use batch mode for non-urgent documents to cut extraction costs by 50%. Review the per-feature breakdown to identify your highest-cost operations, and use the daily cost chart to spot and investigate usage spikes."
5943
+ answer: "Use batch mode for non-urgent documents to cut extraction costs by 50%. Review the per-feature breakdown to identify your highest-cost operations, and use the daily cost chart to spot and investigate usage spikes. Additionally, invest in building your Field Registry \u2014 as more fields reach Tier 1 and Tier 2, values are resolved via deterministic lookup instead of LLM calls, which reduces per-document extraction cost over time. Leverage routing rules to assign schemas automatically, which avoids manual re-extractions and wasted processing."
5328
5944
  }
5329
5945
  ],
5330
5946
  mentions: [
@@ -5353,7 +5969,7 @@ var sections14 = [
5353
5969
  },
5354
5970
  {
5355
5971
  type: "paragraph",
5356
- text: "The Admin Panel operates across tenant boundaries, giving administrators visibility into all organizations on the platform. The **usage statistics** view aggregates cost and volume data across all customers, making it straightforward to identify high-usage tenants, track platform growth, and forecast infrastructure needs."
5972
+ text: "The Admin Panel operates across tenant boundaries, giving administrators visibility into all organizations on the platform. The **usage statistics** view aggregates cost and volume data across all customers, making it straightforward to identify high-usage tenants, track platform growth, and forecast infrastructure needs. Usage data includes per-feature breakdowns (extraction, OCR, batch, matching) and daily cost trends across all tenants."
5357
5973
  },
5358
5974
  {
5359
5975
  type: "paragraph",
@@ -5383,19 +5999,19 @@ var sections14 = [
5383
5999
  faq: [
5384
6000
  {
5385
6001
  question: "What features does the Admin Panel provide?",
5386
- answer: "Customer management, user management, usage statistics, data clear & rebuild, and cross-tenant master registry view. Accessible from the user menu for admins and superadmins."
6002
+ answer: "Customer management (create, list, delete organizations), user management (cross-tenant user view with account removal), usage statistics (platform-wide cost and volume aggregates), data clear and rebuild (wipe and reprocess all data for a customer), and cross-tenant master registry view (audit field definitions and schemas across tenants)."
5387
6003
  },
5388
6004
  {
5389
6005
  question: "Who can access the Admin Panel?",
5390
- answer: "The Admin Panel is accessible only to users with admin or superadmin roles, via the user menu in the platform navigation."
6006
+ answer: "The Admin Panel is accessible only to users with admin or superadmin roles, via the user menu in the platform navigation. Regular Members and Viewers do not see the Admin Panel option."
5391
6007
  },
5392
6008
  {
5393
6009
  question: "What does the data clear operation do?",
5394
- answer: "Data clear wipes all documents, extractions, jobs, results, and related data for a specific customer. It is irreversible and intended for full reprocessing scenarios during onboarding or after major schema changes."
6010
+ answer: "Data clear wipes all documents, extractions, jobs, results, and related data for a specific customer. It is irreversible and intended for full reprocessing scenarios during onboarding or after major schema changes. Always confirm with the customer before executing this operation."
5395
6011
  },
5396
6012
  {
5397
6013
  question: "Can I view usage across all customers?",
5398
- answer: "Yes. The Admin Panel includes a master registry view that shows cross-tenant usage statistics, per-customer cost breakdowns, and platform-wide aggregates."
6014
+ answer: "Yes. The Admin Panel includes a master registry view that shows cross-tenant usage statistics, per-customer cost breakdowns, and platform-wide aggregates. This is useful for identifying high-usage tenants, tracking platform growth, and forecasting infrastructure needs."
5399
6015
  }
5400
6016
  ],
5401
6017
  mentions: [
@@ -5416,11 +6032,11 @@ var sections14 = [
5416
6032
  content: [
5417
6033
  {
5418
6034
  type: "paragraph",
5419
- text: "Talonic provides global keyboard shortcuts that work from any page in the platform. These shortcuts let you access common actions without leaving your current context, significantly speeding up daily workflows."
6035
+ text: "Talonic provides global keyboard shortcuts that work from any page in the platform. These shortcuts let you access common actions without leaving your current context, significantly speeding up daily workflows. Whether you are reviewing extraction results, configuring schemas, or browsing the field registry, the same shortcuts are always available."
5420
6036
  },
5421
6037
  {
5422
6038
  type: "paragraph",
5423
- text: "Shortcuts are registered at the application level, meaning they respond regardless of which page or panel is currently active. The platform intercepts the key combination before it reaches the browser, so these shortcuts take priority over default browser bindings when the Talonic window is focused."
6039
+ text: "Shortcuts are registered at the application level, meaning they respond regardless of which page or panel is currently active. The platform intercepts the key combination before it reaches the browser, so these shortcuts take priority over default browser bindings when the Talonic window is focused. On macOS, shortcuts use the Command key (`Cmd`); on Windows and Linux, they use `Ctrl`."
5424
6040
  },
5425
6041
  {
5426
6042
  type: "paragraph",
@@ -5444,6 +6060,11 @@ var sections14 = [
5444
6060
  type: "global",
5445
6061
  description: "Quick extract \u2014 upload and process a document."
5446
6062
  },
6063
+ {
6064
+ name: "\u2318I / Ctrl+I",
6065
+ type: "global",
6066
+ description: "Open the AI Agent from any page."
6067
+ },
5447
6068
  {
5448
6069
  name: "Escape",
5449
6070
  type: "global",
@@ -5451,10 +6072,14 @@ var sections14 = [
5451
6072
  }
5452
6073
  ]
5453
6074
  },
6075
+ {
6076
+ type: "paragraph",
6077
+ text: "In addition to the global shortcuts, the **AI Agent** can be opened from any page using `Cmd+I` (`Ctrl+I` on Windows). The agent provides a conversational interface for inspecting data, building schemas, and analyzing extraction quality \u2014 all without navigating away from your current page. Combined with Omnisearch and quick extract, these three shortcuts cover the most common workflow interruptions."
6078
+ },
5454
6079
  {
5455
6080
  type: "callout",
5456
6081
  variant: "info",
5457
- text: "The **quick extract** shortcut (`Cmd+J` / `Ctrl+J`) is the fastest way to upload a single document. It opens a streamlined upload interface that lets you drag a file and start processing immediately."
6082
+ text: "The **quick extract** shortcut (`Cmd+J` / `Ctrl+J`) is the fastest way to upload a single document. It opens a streamlined upload interface that lets you drag a file and start processing immediately. Use it when you receive a document via email or chat and want instant extraction results."
5458
6083
  }
5459
6084
  ],
5460
6085
  related: [
@@ -5464,15 +6089,15 @@ var sections14 = [
5464
6089
  faq: [
5465
6090
  {
5466
6091
  question: "What keyboard shortcuts are available?",
5467
- answer: "Cmd+K / Ctrl+K for Omnisearch, Cmd+J / Ctrl+J for quick extract (upload and process), and Escape to close overlays, modals, and search."
6092
+ answer: "Four global shortcuts: Cmd+K / Ctrl+K for Omnisearch, Cmd+J / Ctrl+J for quick extract (upload and process), Cmd+I / Ctrl+I to open the AI Agent, and Escape to close overlays, modals, and search. These work from any page in the platform."
5468
6093
  },
5469
6094
  {
5470
6095
  question: "What does the quick extract shortcut do?",
5471
- answer: "Cmd+J / Ctrl+J opens the quick extract interface, allowing you to upload and process a document directly from any page."
6096
+ answer: "Cmd+J / Ctrl+J opens the quick extract interface, allowing you to upload and process a document directly from any page. It provides a streamlined drag-and-drop area that immediately processes the uploaded file and displays extraction results. This is the fastest path from receiving a document to seeing structured data \u2014 ideal for one-off documents that arrive via email or chat and need immediate attention without navigating to the upload page."
5472
6097
  },
5473
6098
  {
5474
6099
  question: "Do shortcuts work inside modals or overlays?",
5475
- answer: "The Escape shortcut works inside any modal or overlay to close it. Omnisearch (Cmd+K) works globally, even when other overlays are open. Quick extract (Cmd+J) is available from the main interface."
6100
+ answer: "The Escape shortcut works inside any modal or overlay to close it. Omnisearch (Cmd+K) works globally, even when other overlays are open. The AI Agent shortcut (Cmd+I) also works from any context. Quick extract (Cmd+J) is available from the main interface."
5476
6101
  }
5477
6102
  ],
5478
6103
  mentions: ["keyboard shortcuts", "Cmd+K", "Cmd+J", "Escape", "quick extract"]
@@ -5490,7 +6115,7 @@ var sections15 = [
5490
6115
  content: [
5491
6116
  {
5492
6117
  type: "paragraph",
5493
- text: "Documents can be processed in **batch mode** at 50% cost with a 48-hour delivery window. Toggle batch mode on the upload screen or set it via the API. Batch processing is ideal for large backlog ingestion where real-time results are not required."
6118
+ text: "Documents can be processed in **batch mode** at 50% cost with a 48-hour delivery window. Toggle batch mode on the upload screen or set `processing_mode=batch` via the API. Batch processing is ideal for large backlog ingestion where real-time results are not required. The cost reduction comes from the provider's native batch API, which schedules processing during off-peak capacity \u2014 there is no loss in extraction quality because the same Claude model and prompts are used as in real-time mode."
5494
6119
  },
5495
6120
  {
5496
6121
  type: "callout",
@@ -5554,7 +6179,7 @@ var sections15 = [
5554
6179
  content: [
5555
6180
  {
5556
6181
  type: "paragraph",
5557
- text: 'Set `processing_mode=batch` on upload (API) or toggle the "Batch" switch in the upload UI. Stage 1 (OCR + classification) runs immediately so documents appear in your library right away. Stage 2 (Claude extraction) is deferred to the provider\'s batch API for asynchronous processing.'
6182
+ text: 'Set `processing_mode=batch` on upload (API) or toggle the "Batch" switch in the upload UI. Stage 1 (OCR + classification) runs immediately so documents appear in your library right away with their type classification and triage metadata. Stage 2 (Claude extraction) is deferred to the provider\'s batch API for asynchronous processing. While waiting for batch results, documents show a status of `batch_queued` in your library. The system requires a minimum of 100 items per batch \u2014 if fewer documents are uploaded in batch mode, the system falls back to real-time processing with a warning.'
5558
6183
  },
5559
6184
  {
5560
6185
  type: "paragraph",
@@ -5583,11 +6208,22 @@ var sections15 = [
5583
6208
  },
5584
6209
  {
5585
6210
  type: "paragraph",
5586
- text: "While waiting for batch results, documents show a status of `batch_queued`. Once the provider returns results, the platform applies them through the same post-processing pipeline as real-time extraction \u2014 including markdown pre-processing, field parsing, quality metrics, and extraction metadata computation."
6211
+ text: "While waiting for batch results, documents show a status of `batch_queued` in your library. Once the provider returns results, the platform applies them through the same post-processing pipeline as real-time extraction \u2014 including markdown pre-processing, field parsing, quality metrics, and extraction metadata computation. If a batch extraction fails to parse, the affected document is retried through the real-time extraction path rather than as a new batch, ensuring the original 48-hour SLA is maintained."
5587
6212
  },
5588
6213
  {
5589
6214
  type: "paragraph",
5590
6215
  text: "You can also enable batch mode on a per-source basis. When a source connection has the batch processing toggle enabled, all documents ingested through that source are automatically routed to the batch queue. This is ideal for source connections that handle non-urgent, high-volume ingestion \u2014 such as a shared drive that collects documents overnight."
6216
+ },
6217
+ {
6218
+ type: "list",
6219
+ ordered: false,
6220
+ items: [
6221
+ "**Included in batch:** Stage 2 Claude extraction, markdown pre-processing, field parsing, quality metrics computation, extraction metadata, and all post-processing that does not require LLM calls.",
6222
+ "**Excluded from batch:** LLM-based quality passes (field estimation, verification, cross-reference enrichment) are skipped to preserve cost savings.",
6223
+ "**Excluded from batch:** Image-only documents (PNG, JPG) are automatically routed to real-time processing because the batch payload is text-only.",
6224
+ "**Fallback behavior:** Parse failures in batch mode are retried through the real-time extraction path \u2014 never as a new batch \u2014 to maintain the 48-hour SLA.",
6225
+ "**Minimum threshold:** Batches require at least 100 items (a provider requirement). Uploads below this threshold fall back to real-time processing with a warning."
6226
+ ]
5591
6227
  }
5592
6228
  ],
5593
6229
  related: [
@@ -5631,16 +6267,20 @@ var sections15 = [
5631
6267
  content: [
5632
6268
  {
5633
6269
  type: "paragraph",
5634
- text: "The Batches page at `/sources/batches` shows the status of all batch jobs. Each batch progresses through three states: **accumulating** (items collecting), **submitted** (sent to provider), and **completed** (results applied). The page live-syncs with the provider for real-time status updates."
6270
+ text: "The Batches page at `/sources/batches` shows the status of all batch jobs with real-time updates. Each batch progresses through three states: **accumulating** (items collecting in the queue), **submitted** (sent to the provider's batch API), and **completed** (results received and applied to the corresponding documents). The page live-syncs with the provider so you can monitor progress without manual refreshing. Click any batch to see the detail view with individual items, their processing state, and any errors."
5635
6271
  },
5636
6272
  {
5637
6273
  type: "paragraph",
5638
- text: "Batches are submitted automatically when the accumulation timer fires (every 15 minutes by default) or when the item count threshold is reached. Once submitted, the platform polls the provider hourly to check for completion. When results arrive, they are applied to the corresponding documents and the batch transitions to **completed** status."
6274
+ text: "Batches are submitted automatically when the accumulation timer fires (every 15 minutes by default) or when the item count threshold is reached, whichever comes first. These intervals are configurable in the pipeline settings. Once submitted, the platform polls the provider hourly to check for completion. When results arrive, they are applied to the corresponding documents \u2014 including field resolution, linking, triage, and delivery events \u2014 and the batch transitions to **completed** status."
5639
6275
  },
5640
6276
  {
5641
6277
  type: "paragraph",
5642
6278
  text: "The batch detail view shows individual items within a batch, including which documents are included, their current processing state, and any errors that occurred. Use this view to verify that a specific document was included in the expected batch and to troubleshoot items that failed to parse."
5643
6279
  },
6280
+ {
6281
+ type: "paragraph",
6282
+ text: "For example, after uploading 500 invoices in batch mode, navigate to `/sources/batches` to check progress. You will see a batch in **accumulating** status collecting items until the 15-minute timer fires. Once submitted, the status changes to **submitted** and the platform polls the provider hourly. Click the batch row to see each document's individual state \u2014 if 3 items show parse errors, those documents were automatically retried via the real-time path while the remaining 497 completed normally. When the batch transitions to **completed**, all results have been applied and documents are ready for review."
6283
+ },
5644
6284
  {
5645
6285
  type: "paragraph",
5646
6286
  text: "The platform includes built-in crash recovery for batch processing. If the application restarts while a batch is in a transient `processing` state, the recovery logic automatically reverts it to `submitted` so the next polling cycle can retry. This means batch jobs are resilient to infrastructure disruptions without requiring manual intervention."
@@ -5718,11 +6358,11 @@ var sections16 = [
5718
6358
  content: [
5719
6359
  {
5720
6360
  type: "paragraph",
5721
- text: "Upload CSV or Excel files as lookup tables. These reference datasets are used by the matching engine and by reference strategies in schemas. Each reference dataset is versioned and can be shared across multiple schemas."
6361
+ text: 'Upload CSV or Excel files as lookup tables for the matching engine and schema reference strategies. These reference datasets represent your "ground truth" \u2014 the known records you want to match extracted document data against. Each reference dataset is versioned independently and can be shared across multiple schemas and matching configurations without duplication.'
5722
6362
  },
5723
6363
  {
5724
6364
  type: "paragraph",
5725
- text: 'Reference data is the foundation of the matching system. It represents your "ground truth" \u2014 the known records you want to match extracted document data against. Common examples include customer lists, product catalogs, vendor registries, and contract databases.'
6365
+ text: "Reference data is the foundation of the matching system. Common examples include customer lists, product catalogs, vendor registries, contract databases, and supplier directories. When you upload a reference dataset, the platform indexes all columns and rows for fast lookup during matching runs. You can also import reference data directly from a SQL database connection using `POST /matching/reference-data/from-sql`, which streams rows asynchronously in batches of 500 from your connected MSSQL or PostgreSQL database."
5726
6366
  },
5727
6367
  {
5728
6368
  type: "paragraph",
@@ -5760,7 +6400,7 @@ var sections16 = [
5760
6400
  },
5761
6401
  {
5762
6402
  question: "How is reference data used?",
5763
- answer: "Reference datasets are used by the matching engine for field-to-field comparisons and by reference strategies in schemas for code mapping and value resolution."
6403
+ answer: "Reference datasets serve two purposes. First, the matching engine uses them for field-to-field comparisons \u2014 comparing extracted document values against reference rows using weighted strategies (exact, fuzzy, date_range, numeric_range). Second, reference strategies in schemas use them for code mapping and value resolution, translating labels found in documents into canonical codes defined in the reference dataset."
5764
6404
  },
5765
6405
  {
5766
6406
  question: "Can I import reference data from a database?",
@@ -5789,7 +6429,7 @@ var sections16 = [
5789
6429
  content: [
5790
6430
  {
5791
6431
  type: "paragraph",
5792
- text: "Define field-to-field comparisons between extracted data and reference datasets. Each comparison uses a weighted strategy to score matches:"
6432
+ text: "Define field-to-field comparisons between extracted data and reference datasets. Each comparison uses a weighted strategy to score matches. A matching configuration specifies which fields to compare, which strategy to use for each comparison, and the relative weight that determines how much each field contributes to the overall confidence score:"
5793
6433
  },
5794
6434
  {
5795
6435
  type: "param-table",
@@ -5833,6 +6473,16 @@ var sections16 = [
5833
6473
  type: "callout",
5834
6474
  variant: "info",
5835
6475
  text: "Use **AI strategy generation** when setting up matching for the first time. The platform analyzes your schema fields and reference data columns, then suggests which fields to compare and which strategy to use for each. You can review and adjust the suggestions before saving."
6476
+ },
6477
+ {
6478
+ type: "list",
6479
+ ordered: false,
6480
+ items: [
6481
+ "**exact** \u2014 case-insensitive string comparison. Best for unique identifiers like PO numbers, invoice IDs, and reference codes where values should match verbatim.",
6482
+ "**fuzzy** \u2014 token-based similarity with a configurable threshold (0-100%). Handles misspellings, abbreviations, and word reordering. Ideal for company names, addresses, and descriptions.",
6483
+ "**date_range** \u2014 matches dates within a configurable tolerance window (e.g., +/- 7 days). Useful when documents report dates with slight offsets, such as invoice date vs. received date.",
6484
+ "**numeric_range** \u2014 matches numbers within a percentage or absolute tolerance. Handles rounding differences in amounts, quantities, and prices across systems."
6485
+ ]
5836
6486
  }
5837
6487
  ],
5838
6488
  related: [
@@ -5877,15 +6527,15 @@ var sections16 = [
5877
6527
  content: [
5878
6528
  {
5879
6529
  type: "paragraph",
5880
- text: "Execute a matching run against a reference dataset. Matching runs are processed asynchronously via BullMQ. You can monitor progress from the matching page and cancel running jobs if needed."
6530
+ text: "Execute a matching run against a reference dataset. Matching runs are processed asynchronously via a dedicated job queue, so they do not block your workflow. You can monitor progress from the matching page with real-time updates showing the number of documents processed and estimated time remaining, and cancel running jobs if needed \u2014 partial results from documents already processed are preserved."
5881
6531
  },
5882
6532
  {
5883
6533
  type: "paragraph",
5884
- text: "There are two types of runs: **manual runs** use only the deterministic matching strategies (exact, fuzzy, date_range, numeric_range) and complete quickly. **Smart runs** add an AI resolution pass \u2014 after the initial matching, an embedding-based search with a Haiku LLM resolver attempts to improve low-confidence results."
6534
+ text: "There are two types of runs: **manual runs** use only the deterministic matching strategies (exact, fuzzy, date_range, numeric_range) and complete quickly. **Smart runs** add an AI resolution pass \u2014 after the initial matching, an embedding-based similarity search identifies promising candidates for each low-confidence document, and a Haiku LLM resolver evaluates each candidate in context to improve match quality."
5885
6535
  },
5886
6536
  {
5887
6537
  type: "paragraph",
5888
- text: "Matching runs are processed asynchronously via a dedicated job queue, so they do not block your workflow. You can continue working in the platform while a run executes in the background. The matching page shows real-time progress with the number of documents processed and estimated time remaining."
6538
+ text: "Matching runs are processed asynchronously via a dedicated BullMQ job queue, so they do not block your workflow. You can continue working in the platform while a run executes in the background. The matching page shows real-time progress with the number of documents processed and estimated time remaining. You can also trigger an AI resolution pass on a completed run using the `POST /matching/runs/:id/ai-resolve` endpoint to upgrade specific low-confidence results without re-running the entire job."
5889
6539
  },
5890
6540
  {
5891
6541
  type: "paragraph",
@@ -5942,7 +6592,7 @@ var sections16 = [
5942
6592
  content: [
5943
6593
  {
5944
6594
  type: "paragraph",
5945
- text: "Results are presented per document with the top 5 match candidates. Each candidate includes a confidence score and field-level evidence showing which comparisons contributed to the match and how each field scored."
6595
+ text: "Results are presented per document with the top 5 match candidates ranked by weighted confidence score. Each candidate includes a confidence score (0-100%) and field-level evidence showing which comparisons contributed to the match, the strategy used for each field (exact, fuzzy, date_range, or numeric_range), the individual field score, and the actual values that were compared on both sides. This transparency makes it straightforward to verify correct matches and investigate false positives."
5946
6596
  },
5947
6597
  {
5948
6598
  type: "paragraph",
@@ -5981,6 +6631,10 @@ var sections16 = [
5981
6631
  type: "callout",
5982
6632
  variant: "info",
5983
6633
  text: "You can **approve or reject** individual match results. Approved matches can be used downstream in delivery pipelines. Rejected matches are excluded from future consideration for that document."
6634
+ },
6635
+ {
6636
+ type: "paragraph",
6637
+ text: 'Consider a practical example: you receive an invoice from "Acme Corp" with a total of $12,450 dated 2025-03-15. The matching engine returns the top candidate as "ACME Corporation" in your reference data with a confidence score of 87%. The evidence view shows the vendor name scored 92% via fuzzy match (handling "Corp" vs "Corporation"), the amount scored 100% via exact match, and the date scored 78% via date_range because the reference shows a PO date of 2025-03-10 \u2014 within the 7-day tolerance. You can quickly verify the match is correct and approve it, sending the linked record downstream.'
5984
6638
  }
5985
6639
  ],
5986
6640
  related: [