@talonic/docs 0.20.9 → 0.20.11
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/content.js +1167 -46
- package/package.json +1 -1
package/dist/content.js
CHANGED
|
@@ -542,6 +542,14 @@ var sections = [
|
|
|
542
542
|
}
|
|
543
543
|
]
|
|
544
544
|
},
|
|
545
|
+
{
|
|
546
|
+
type: "paragraph",
|
|
547
|
+
text: 'Understanding the relationship between these concepts is key to getting the most from the platform. When you upload documents, the extraction pipeline discovers every data point and feeds them into the **Field Registry**. The registry uses AI embeddings to cluster semantically similar fields \u2014 so "Vendor Name", "Supplier Name", and "Company Name" are recognized as the same concept. Over time, frequently occurring fields are promoted to higher tiers, and the platform synthesizes master extraction instructions that encode the best way to extract each field.'
|
|
548
|
+
},
|
|
549
|
+
{
|
|
550
|
+
type: "paragraph",
|
|
551
|
+
text: "The **Schema** layer sits on top of the registry and defines what output you need. You can use auto-generated schemas that the platform creates for each document type, or build custom template schemas by selecting specific fields from the registry. When a schema is applied to documents in a **Job**, the 4-phase pipeline fills every cell \u2014 starting with free graph lookups and falling back to AI agents for the remainder. The result is a structured grid where each row is a document and each column is a field."
|
|
552
|
+
},
|
|
545
553
|
{
|
|
546
554
|
type: "callout",
|
|
547
555
|
variant: "info",
|
|
@@ -617,6 +625,14 @@ var sections = [
|
|
|
617
625
|
type: "paragraph",
|
|
618
626
|
text: "The pipeline is designed to be **progressive** \u2014 results appear as each phase completes rather than waiting for the entire job to finish. Phase 1 (graph resolve) fills ~30% of cells instantly and for free. Phase 2 (AI extraction) fills the remaining gaps. Phases 3 and 4 handle re-resolution and transformation. You can start reviewing early results while later phases are still running."
|
|
619
627
|
},
|
|
628
|
+
{
|
|
629
|
+
type: "paragraph",
|
|
630
|
+
text: "Use the platform flow as a mental model when planning your workflow. For small, ad-hoc extractions you can go from upload to results in minutes \u2014 upload a few documents, pick an auto-generated schema, and run a job. For production workloads, invest time in the **Define schema** step: map fields to the registry, add reference tables for code lookups, and set format constraints. The upfront effort pays off because every subsequent job reuses the same schema and benefits from the growing knowledge graph."
|
|
631
|
+
},
|
|
632
|
+
{
|
|
633
|
+
type: "paragraph",
|
|
634
|
+
text: "After results are delivered, the feedback loop closes automatically. Corrections you make during the **Review** stage feed back into the Field Registry, improving future extractions. The platform tracks telemetry across runs \u2014 strategy distribution, capture hit rate, and resolve rate \u2014 so you can monitor how extraction quality improves over time as the knowledge graph accumulates more data."
|
|
635
|
+
},
|
|
620
636
|
{
|
|
621
637
|
type: "callout",
|
|
622
638
|
variant: "info",
|
|
@@ -679,6 +695,14 @@ var sections = [
|
|
|
679
695
|
title: "Sidebar Navigation",
|
|
680
696
|
caption: "The sidebar provides access to all sections. Click the collapse button to save space. Press Cmd+K for global search."
|
|
681
697
|
},
|
|
698
|
+
{
|
|
699
|
+
type: "paragraph",
|
|
700
|
+
text: "For teams processing documents at scale, the recommended approach is to start with a small representative sample. Upload 5-10 documents of the same type, let the platform extract and classify them, then review the auto-generated schema. This lets you validate the output structure before committing to a large batch. Once the schema looks right, you can upload hundreds or thousands of documents and the knowledge graph will handle an increasing share of cells through instant graph matches."
|
|
701
|
+
},
|
|
702
|
+
{
|
|
703
|
+
type: "paragraph",
|
|
704
|
+
text: "The platform includes powerful keyboard shortcuts for fast navigation. Press `Cmd+K` (or `Ctrl+K` on Windows) to open **Omnisearch**, which lets you find documents, schemas, jobs, and fields from anywhere. Press `Cmd+I` to open the **AI Agent** for natural language queries about your workspace. The sidebar can be collapsed to give more screen real estate when reviewing extraction results."
|
|
705
|
+
},
|
|
682
706
|
{
|
|
683
707
|
type: "callout",
|
|
684
708
|
text: "The fastest path to results: upload documents in **Sources**, then go to **Structuring → Runs → New** to create your first extraction job."
|
|
@@ -735,6 +759,18 @@ var sections2 = [
|
|
|
735
759
|
type: "paragraph",
|
|
736
760
|
text: "Talonic includes an embedded AI agent accessible from every page via `Cmd+I` (`Ctrl+I` on Windows). The agent understands your workspace context and can inspect schemas, search documents, analyze extraction quality, explore cases, and build schemas \u2014 all through natural language."
|
|
737
761
|
},
|
|
762
|
+
{
|
|
763
|
+
type: "paragraph",
|
|
764
|
+
text: "The agent is context-aware, meaning it automatically knows which page you are on and what data is visible. If you open the agent from a document detail page, it already has that document in scope and can answer questions about its extracted fields, processing status, or classification without you needing to specify which document you mean."
|
|
765
|
+
},
|
|
766
|
+
{
|
|
767
|
+
type: "paragraph",
|
|
768
|
+
text: "The agent classifies every user message as either a **question** (answered with information) or a **command** (triggers an action). Questions are handled instantly with read-only access, while commands go through the impact-level system to ensure safety. The agent streams its responses in real time, so you can see reasoning unfold as it queries your workspace data."
|
|
769
|
+
},
|
|
770
|
+
{
|
|
771
|
+
type: "paragraph",
|
|
772
|
+
text: "There are important limitations to be aware of. The agent cannot access external systems or the internet \u2014 it only works with data already in your Talonic workspace. It cannot bypass permission boundaries, so team members with read-only access cannot use the agent to make changes. Long-running operations like full batch extractions cannot be triggered through the agent; those must be initiated from the relevant UI page."
|
|
773
|
+
},
|
|
738
774
|
{ type: "heading", level: 3, id: "agent-capabilities", text: "What the Agent Can Do" },
|
|
739
775
|
{
|
|
740
776
|
type: "paragraph",
|
|
@@ -799,6 +835,14 @@ var sections2 = [
|
|
|
799
835
|
{
|
|
800
836
|
question: "Can the AI agent modify my data?",
|
|
801
837
|
answer: "The agent operates workshop-first: schema changes create drafts, not live versions. Higher-impact operations require progressively more explicit confirmation."
|
|
838
|
+
},
|
|
839
|
+
{
|
|
840
|
+
question: "Is the AI agent context-aware?",
|
|
841
|
+
answer: "Yes. The agent automatically knows which page you are on and what data is visible. If you open it from a document detail page, it already has that document in scope and can answer questions about its fields, processing status, or classification."
|
|
842
|
+
},
|
|
843
|
+
{
|
|
844
|
+
question: "Can the AI agent access external systems or the internet?",
|
|
845
|
+
answer: "No. The agent only works with data already in your Talonic workspace. It cannot browse the internet, call external APIs, or access systems outside the platform."
|
|
802
846
|
}
|
|
803
847
|
],
|
|
804
848
|
mentions: [
|
|
@@ -846,6 +890,18 @@ var sections2 = [
|
|
|
846
890
|
}
|
|
847
891
|
]
|
|
848
892
|
},
|
|
893
|
+
{
|
|
894
|
+
type: "paragraph",
|
|
895
|
+
text: "The `read` impact level covers the vast majority of agent interactions. Searching documents, inspecting extraction results, browsing the field registry, and checking job status all execute instantly with no side effects. These read operations give you a fast way to explore your workspace without navigating through multiple pages."
|
|
896
|
+
},
|
|
897
|
+
{
|
|
898
|
+
type: "paragraph",
|
|
899
|
+
text: "The `draft_mutation` level is used when the agent creates or modifies schemas. Because all schema changes go through the workshop system, the agent can freely draft schemas without risk \u2014 nothing goes live until you explicitly review and publish. This makes the agent especially useful for rapid schema prototyping: describe the fields you need in plain language, and the agent creates a draft you can refine."
|
|
900
|
+
},
|
|
901
|
+
{
|
|
902
|
+
type: "paragraph",
|
|
903
|
+
text: 'The `live_mutation` and `irreversible` levels provide escalating safety gates for operations that affect production data. A `live_mutation` \u2014 such as triggering a job run or publishing a schema \u2014 presents a confirmation dialog that you must accept. An `irreversible` action \u2014 such as deleting a source or purging documents \u2014 requires you to type a confirmation keyword (e.g., "DELETE") to proceed, preventing accidental data loss.'
|
|
904
|
+
},
|
|
849
905
|
{
|
|
850
906
|
type: "callout",
|
|
851
907
|
text: "The agent always operates workshop-first: schema changes create drafts, not live versions. You review and publish when ready."
|
|
@@ -863,6 +919,10 @@ var sections2 = [
|
|
|
863
919
|
{
|
|
864
920
|
question: "Does the AI agent make changes directly to live data?",
|
|
865
921
|
answer: "No. The agent operates workshop-first. Schema changes create drafts, and live mutations require explicit user confirmation before executing."
|
|
922
|
+
},
|
|
923
|
+
{
|
|
924
|
+
question: "What happens when I ask the agent to delete something?",
|
|
925
|
+
answer: 'Deletion is classified as an irreversible action. The agent will ask you to type a confirmation keyword (e.g., "DELETE") before proceeding. This prevents accidental data loss from casual or ambiguous requests.'
|
|
866
926
|
}
|
|
867
927
|
],
|
|
868
928
|
mentions: ["impact levels", "draft mutation", "live mutation", "workshop-first"]
|
|
@@ -877,6 +937,18 @@ var sections2 = [
|
|
|
877
937
|
{
|
|
878
938
|
type: "paragraph",
|
|
879
939
|
text: "The home page (click the Talonic logo) shows smart suggested prompts based on your workspace state. Prompts adapt to what is happening: active runs, schema creation opportunities, document types waiting for extraction. The agent input field lets you type any question directly from the dashboard."
|
|
940
|
+
},
|
|
941
|
+
{
|
|
942
|
+
type: "paragraph",
|
|
943
|
+
text: "The dashboard provides a workspace-level overview that helps you understand the health of your data pipeline at a glance. You can see document processing statistics, recent activity across sources, and the current state of your field registry. Key metrics like **capture rate**, **resolve rate**, and **synthesize rate** from the telemetry system are surfaced so you can spot trends without drilling into individual jobs."
|
|
944
|
+
},
|
|
945
|
+
{
|
|
946
|
+
type: "paragraph",
|
|
947
|
+
text: "Suggested prompts are dynamically generated based on what the platform detects in your workspace. If you have new document types that lack schemas, the dashboard suggests creating one. If a job run recently completed, it suggests reviewing the results. If field registry confirmations are pending, it prompts you to review them. This makes the dashboard a natural starting point for your workflow each session."
|
|
948
|
+
},
|
|
949
|
+
{
|
|
950
|
+
type: "paragraph",
|
|
951
|
+
text: "Every conversation with the agent is preserved in your session history, accessible from the dashboard. You can revisit previous questions and their answers, which is useful for auditing decisions or recalling how you configured a particular schema. The conversation history also provides continuity \u2014 if you asked the agent to analyze extraction quality last week, you can pick up where you left off."
|
|
880
952
|
}
|
|
881
953
|
],
|
|
882
954
|
related: [
|
|
@@ -891,6 +963,10 @@ var sections2 = [
|
|
|
891
963
|
{
|
|
892
964
|
question: "Do the suggested prompts change based on workspace state?",
|
|
893
965
|
answer: "Yes. Prompts adapt dynamically based on active runs, schema creation opportunities, document types waiting for extraction, and other workspace activity."
|
|
966
|
+
},
|
|
967
|
+
{
|
|
968
|
+
question: "Can I revisit previous conversations with the agent?",
|
|
969
|
+
answer: "Yes. Every conversation is preserved in your session history, accessible from the dashboard. You can revisit previous questions, recall how you configured a schema, or pick up where you left off in a previous analysis."
|
|
894
970
|
}
|
|
895
971
|
],
|
|
896
972
|
mentions: ["dashboard", "suggested prompts", "workspace state", "agent input"]
|
|
@@ -923,6 +999,10 @@ var sections3 = [
|
|
|
923
999
|
{
|
|
924
1000
|
type: "paragraph",
|
|
925
1001
|
text: "Files are deduplicated via SHA-256 hashing \u2014 uploading the same file twice won't create duplicates. Processing runs asynchronously so you can continue working."
|
|
1002
|
+
},
|
|
1003
|
+
{
|
|
1004
|
+
type: "paragraph",
|
|
1005
|
+
text: "When uploading folders or ZIP archives, the original directory structure is preserved as a `source_file_path` metadata field on each document (e.g., `contracts/2026/lease.pdf`). This field is available for filtering, export, and schema mapping \u2014 just like any AI-extracted field. It provides a natural way to organize and trace documents back to their original location in your file system."
|
|
926
1006
|
}
|
|
927
1007
|
],
|
|
928
1008
|
related: [
|
|
@@ -938,6 +1018,10 @@ var sections3 = [
|
|
|
938
1018
|
{
|
|
939
1019
|
question: "Does Talonic detect duplicate uploads?",
|
|
940
1020
|
answer: "Yes. Files are deduplicated via SHA-256 hashing. Uploading the same file twice will not create duplicates."
|
|
1021
|
+
},
|
|
1022
|
+
{
|
|
1023
|
+
question: "What happens when I upload a folder or ZIP archive?",
|
|
1024
|
+
answer: "ZIP archives are unpacked recursively and each file is processed individually. Folders preserve the original directory structure as a source_file_path metadata field on each document, available for filtering and export."
|
|
941
1025
|
}
|
|
942
1026
|
],
|
|
943
1027
|
mentions: [
|
|
@@ -956,6 +1040,10 @@ var sections3 = [
|
|
|
956
1040
|
seoTitle: "Supported File Formats \u2014 Talonic Docs",
|
|
957
1041
|
description: "Talonic supports 25+ file types across four processing paths: text fast-path, AI vision, OCR, and recursive archive unpacking. From PDF to XLSX to images.",
|
|
958
1042
|
content: [
|
|
1043
|
+
{
|
|
1044
|
+
type: "paragraph",
|
|
1045
|
+
text: "Talonic supports 25+ file types across four distinct processing paths. Each path is optimized for its file category \u2014 text files are read directly with zero latency, while complex document formats go through OCR to produce high-quality Markdown. The processing path is selected automatically based on the file extension."
|
|
1046
|
+
},
|
|
959
1047
|
{
|
|
960
1048
|
type: "param-table",
|
|
961
1049
|
title: "File processing paths",
|
|
@@ -981,6 +1069,23 @@ var sections3 = [
|
|
|
981
1069
|
description: "ZIP \u2014 unpacked and each file processed individually."
|
|
982
1070
|
}
|
|
983
1071
|
]
|
|
1072
|
+
},
|
|
1073
|
+
{
|
|
1074
|
+
type: "paragraph",
|
|
1075
|
+
text: "The **OCR path** uses Mistral Document AI as the primary engine, with a Talonic API fallback if the primary service is unavailable. OCR converts documents to structured Markdown, preserving tables, headings, and layout information. For PDF files that exceed the configured chunk size (default 25 pages), the system automatically splits the document into page chunks, processes them in parallel, and merges the results \u2014 so even large documents are handled efficiently."
|
|
1076
|
+
},
|
|
1077
|
+
{
|
|
1078
|
+
type: "paragraph",
|
|
1079
|
+
text: `Image files follow the **AI Vision** path, where they are sent directly to the AI model for multimodal extraction. This means the AI "sees" the image and extracts data visually \u2014 useful for photos of receipts, scanned handwritten notes, or diagrams. If an image was previously OCR'd and produced meaningful Markdown (more than 100 characters), the system uses the Markdown extraction path instead, which enables richer quality metrics.`
|
|
1080
|
+
},
|
|
1081
|
+
{
|
|
1082
|
+
type: "paragraph",
|
|
1083
|
+
text: "The **text fast-path** is the most efficient route: files like CSV, JSON, and plain text are read directly into memory with no external API call. This means they process almost instantly and incur no OCR cost. Email files (EML, MSG) are parsed to extract both the message body and any attachments, with each attachment processed as a separate document."
|
|
1084
|
+
},
|
|
1085
|
+
{
|
|
1086
|
+
type: "callout",
|
|
1087
|
+
variant: "info",
|
|
1088
|
+
text: "The processing path is selected automatically based on the file extension \u2014 you do not need to configure anything. If a file type is not recognized, the platform will attempt OCR as a fallback before marking it as unsupported."
|
|
984
1089
|
}
|
|
985
1090
|
],
|
|
986
1091
|
related: [
|
|
@@ -995,6 +1100,10 @@ var sections3 = [
|
|
|
995
1100
|
{
|
|
996
1101
|
question: "How does Talonic handle image files?",
|
|
997
1102
|
answer: "Image files (PNG, JPG, JPEG, GIF, WEBP) are sent to AI for multimodal visual extraction."
|
|
1103
|
+
},
|
|
1104
|
+
{
|
|
1105
|
+
question: "How does Talonic handle large PDF files?",
|
|
1106
|
+
answer: "PDF files that exceed the configured chunk size (default 25 pages) are automatically split into page chunks, processed in parallel, and merged. This ensures even large documents are handled efficiently without timeouts."
|
|
998
1107
|
}
|
|
999
1108
|
],
|
|
1000
1109
|
mentions: ["OCR", "AI vision", "text fast-path", "file formats", "PDF", "DOCX", "ZIP"]
|
|
@@ -1061,6 +1170,10 @@ var sections3 = [
|
|
|
1061
1170
|
{
|
|
1062
1171
|
question: "When is a document ready to use in jobs?",
|
|
1063
1172
|
answer: "Documents are marked complete after AI extraction finishes. You can start using them in jobs immediately without waiting for further processing."
|
|
1173
|
+
},
|
|
1174
|
+
{
|
|
1175
|
+
question: "What happens if OCR or extraction fails on a document?",
|
|
1176
|
+
answer: "The platform automatically retries failed extractions (configurable, default 1 retry). If all retries fail, the document is marked as extraction_failed with a terminal status. OCR failures follow a separate retry path with fallback from Document AI to Talonic API to local parsers."
|
|
1064
1177
|
}
|
|
1065
1178
|
],
|
|
1066
1179
|
mentions: [
|
|
@@ -1087,6 +1200,19 @@ var sections3 = [
|
|
|
1087
1200
|
{
|
|
1088
1201
|
type: "paragraph",
|
|
1089
1202
|
text: 'Documents sharing the same ontology type are automatically merged into one document type. When a new canonical type appears, it is auto-created with ontology metadata. Unresolvable documents are assigned "Unclassified Document".'
|
|
1203
|
+
},
|
|
1204
|
+
{
|
|
1205
|
+
type: "paragraph",
|
|
1206
|
+
text: `Classification is verified in a two-step process. First, **Document AI OCR** produces an annotation with a free-text type label during the OCR pass. Then, a **type resolution** step verifies that label against the actual document content. If the label and content disagree \u2014 for example, a German *Arbeitsvertrag* incorrectly labelled as "Service Agreement" \u2014 the system trusts the content and resolves the correct canonical type. This ensures accurate classification regardless of the OCR engine's labelling bias.`
|
|
1207
|
+
},
|
|
1208
|
+
{
|
|
1209
|
+
type: "paragraph",
|
|
1210
|
+
text: "Document types drive several downstream features. The platform auto-generates a **schema** for each document type, pre-populated with fields discovered from documents of that type. **Routing rules** can be configured per document type to automatically assign schemas or trigger jobs when new documents arrive. The **Field Registry** tracks which fields appear in which document types, building a cross-type knowledge graph over time."
|
|
1211
|
+
},
|
|
1212
|
+
{
|
|
1213
|
+
type: "callout",
|
|
1214
|
+
variant: "info",
|
|
1215
|
+
text: "You never need to create document types manually. The ontology is built into the platform and types are assigned automatically during classification. If you disagree with a classification, the AI agent can help you understand why a type was chosen and how the content signals were interpreted."
|
|
1090
1216
|
}
|
|
1091
1217
|
],
|
|
1092
1218
|
related: [
|
|
@@ -1102,6 +1228,10 @@ var sections3 = [
|
|
|
1102
1228
|
{
|
|
1103
1229
|
question: "Does document classification work in non-English languages?",
|
|
1104
1230
|
answer: "Yes. The classifier works across all languages. For example, a German Arbeitsvertrag and an English Employment Contract map to the same canonical type."
|
|
1231
|
+
},
|
|
1232
|
+
{
|
|
1233
|
+
question: "What happens if a document cannot be classified?",
|
|
1234
|
+
answer: 'Unresolvable documents are assigned the "Unclassified Document" type. They can still be processed and extracted \u2014 the platform simply cannot map them to a specific canonical type in the 529-type ontology.'
|
|
1105
1235
|
}
|
|
1106
1236
|
],
|
|
1107
1237
|
mentions: [
|
|
@@ -1147,6 +1277,23 @@ var sections3 = [
|
|
|
1147
1277
|
description: "View or download the source document."
|
|
1148
1278
|
}
|
|
1149
1279
|
]
|
|
1280
|
+
},
|
|
1281
|
+
{
|
|
1282
|
+
type: "paragraph",
|
|
1283
|
+
text: "The **Raw Extraction** tab is the most detailed view, showing every field the AI discovered along with its confidence score and the source text that the value was extracted from. Each field displays a tier badge (Tier 1 green, Tier 2 amber, Tier 3 gray) indicating how well-established that field is across your document corpus. Synthetic metadata fields like `filename` and `source_file_path` appear here too, with full confidence (1.0)."
|
|
1284
|
+
},
|
|
1285
|
+
{
|
|
1286
|
+
type: "paragraph",
|
|
1287
|
+
text: "The **Resolved Data** tab shows how raw extracted fields map to your canonical field registry. Fields that matched automatically (similarity >= 0.80) display their canonical name and cluster. Fields in the confirm band (0.50-0.79) are flagged for review. This view helps you understand how the platform is normalizing field names across different document types and formats."
|
|
1288
|
+
},
|
|
1289
|
+
{
|
|
1290
|
+
type: "paragraph",
|
|
1291
|
+
text: "The **Processing Log** tab provides a stage-by-stage timeline of how the document was processed, including per-stage timing. You can see exactly how long OCR, classification, and extraction took, which is useful for diagnosing slow processing or understanding why a document was classified a particular way. The **Original File** tab lets you view or download the source file, so you can always compare the AI's extraction against the original document."
|
|
1292
|
+
},
|
|
1293
|
+
{
|
|
1294
|
+
type: "callout",
|
|
1295
|
+
variant: "info",
|
|
1296
|
+
text: "You can open the **AI Agent** (`Cmd+I`) from any document detail page. The agent automatically has the current document in scope and can answer questions about its fields, classification, or processing status without you needing to specify which document you mean."
|
|
1150
1297
|
}
|
|
1151
1298
|
],
|
|
1152
1299
|
related: [
|
|
@@ -1162,6 +1309,10 @@ var sections3 = [
|
|
|
1162
1309
|
{
|
|
1163
1310
|
question: "How can I see the confidence score of an extracted field?",
|
|
1164
1311
|
answer: "Open the document detail page and navigate to the Raw Extraction tab. Each field displays its confidence score alongside the extracted value and source text."
|
|
1312
|
+
},
|
|
1313
|
+
{
|
|
1314
|
+
question: "What do the tier badges on fields mean?",
|
|
1315
|
+
answer: "Tier badges indicate how well-established a field is across your document corpus. Tier 1 (green) are universal core fields, Tier 2 (amber) are established promoted fields, and Tier 3 (gray) are newly discovered emerging fields."
|
|
1165
1316
|
}
|
|
1166
1317
|
],
|
|
1167
1318
|
mentions: [
|
|
@@ -1182,6 +1333,23 @@ var sections3 = [
|
|
|
1182
1333
|
{
|
|
1183
1334
|
type: "paragraph",
|
|
1184
1335
|
text: "Routing rules automatically assign actions to documents based on their type. Configure rules to auto-assign schemas, trigger jobs, or route documents to specific workflows. Manage rules from **Documents → Routing**."
|
|
1336
|
+
},
|
|
1337
|
+
{
|
|
1338
|
+
type: "paragraph",
|
|
1339
|
+
text: "Each routing rule specifies a **document type** as the trigger condition and one or more **actions** to execute when a document of that type is processed. Actions include assigning a specific user schema, automatically creating a job run, or tagging the document for a particular workflow. Rules are evaluated in priority order, so you can layer general rules with more specific overrides."
|
|
1340
|
+
},
|
|
1341
|
+
{
|
|
1342
|
+
type: "paragraph",
|
|
1343
|
+
text: 'Routing rules are especially useful for high-volume ingestion pipelines. If you connect a Google Drive folder that receives hundreds of invoices per week, a routing rule can automatically assign your "Invoice" schema and trigger extraction \u2014 turning what would be manual work into a fully automated pipeline. Combined with **delivery bindings**, this creates an end-to-end flow from document upload to structured output with zero manual intervention.'
|
|
1344
|
+
},
|
|
1345
|
+
{
|
|
1346
|
+
type: "paragraph",
|
|
1347
|
+
text: "You can review rule execution history from the routing page to see which rules fired, which documents they matched, and what actions were taken. This audit trail helps you verify that your routing configuration is working as expected and diagnose cases where documents were not routed correctly."
|
|
1348
|
+
},
|
|
1349
|
+
{
|
|
1350
|
+
type: "callout",
|
|
1351
|
+
variant: "info",
|
|
1352
|
+
text: "Start with a simple routing rule for your most common document type. Once you verify it works correctly, expand to additional types. Rules are evaluated in priority order, so you can add specific overrides without disrupting existing rules."
|
|
1185
1353
|
}
|
|
1186
1354
|
],
|
|
1187
1355
|
related: [
|
|
@@ -1197,6 +1365,10 @@ var sections3 = [
|
|
|
1197
1365
|
{
|
|
1198
1366
|
question: "Where do I manage routing rules?",
|
|
1199
1367
|
answer: "Navigate to Documents > Routing to create and manage routing rules for your workspace."
|
|
1368
|
+
},
|
|
1369
|
+
{
|
|
1370
|
+
question: "Can routing rules fully automate my document processing pipeline?",
|
|
1371
|
+
answer: "Yes. By combining routing rules with source connectors and delivery bindings, you can create a fully automated pipeline: documents arrive from a connected source, routing rules assign schemas and trigger extraction jobs, and delivery bindings push approved results to downstream systems."
|
|
1200
1372
|
}
|
|
1201
1373
|
],
|
|
1202
1374
|
mentions: ["routing rules", "auto-assign", "schema assignment", "document workflows"]
|
|
@@ -1272,6 +1444,14 @@ var sections3 = [
|
|
|
1272
1444
|
type: "paragraph",
|
|
1273
1445
|
text: "Google and Microsoft connectors share a single OAuth client each. OAuth tokens are encrypted at rest using `aes-256-gcm`. Each source card includes a **Batch Processing** toggle to defer extraction at 50% cost."
|
|
1274
1446
|
},
|
|
1447
|
+
{
|
|
1448
|
+
type: "paragraph",
|
|
1449
|
+
text: "OAuth-based connectors (Google Drive, Gmail, SharePoint, OneDrive, Outlook, Teams, Notion) use a consent-based flow where you authorize Talonic to access specific resources. For Microsoft connectors, Teams requires extended scopes that need tenant-admin consent. If a connector's OAuth credentials are revoked or expire, the source enters a disconnected state \u2014 reconnecting via the source settings page automatically refreshes the credentials without losing your existing documents."
|
|
1450
|
+
},
|
|
1451
|
+
{
|
|
1452
|
+
type: "paragraph",
|
|
1453
|
+
text: "Credential-based connectors (SQL, Amazon S3, Azure Blob) authenticate with access keys or connection strings rather than OAuth. SQL connections support PostgreSQL, MySQL, and MSSQL, with a built-in read-only safety layer that prevents accidental writes. S3-compatible storage like MinIO and Cloudflare R2 also works through the S3 connector. All credentials are encrypted at rest before being stored."
|
|
1454
|
+
},
|
|
1275
1455
|
{
|
|
1276
1456
|
type: "callout",
|
|
1277
1457
|
text: "Connectors are feature-gated on their OAuth client ID/secret. Without credentials configured, the connector dropdown entry is disabled."
|
|
@@ -1290,6 +1470,10 @@ var sections3 = [
|
|
|
1290
1470
|
{
|
|
1291
1471
|
question: "How are OAuth tokens stored?",
|
|
1292
1472
|
answer: "OAuth access and refresh tokens are encrypted at rest using AES-256-GCM. The encryption key is SOURCE_ENCRYPTION_KEY (falls back to JWT_SECRET)."
|
|
1473
|
+
},
|
|
1474
|
+
{
|
|
1475
|
+
question: "What happens if a connector loses its credentials or authorization?",
|
|
1476
|
+
answer: "If OAuth credentials are revoked or expire, the source enters a disconnected state. Reconnecting via the source settings page automatically refreshes the credentials without losing your existing documents or configuration."
|
|
1293
1477
|
}
|
|
1294
1478
|
],
|
|
1295
1479
|
mentions: [
|
|
@@ -1331,6 +1515,18 @@ var sections4 = [
|
|
|
1331
1515
|
id: "field-registry-table",
|
|
1332
1516
|
title: "Field Registry \u2014 Registry Table",
|
|
1333
1517
|
caption: "Fields are organized by tier with occurrence counts, data types, and master instruction status."
|
|
1518
|
+
},
|
|
1519
|
+
{
|
|
1520
|
+
type: "paragraph",
|
|
1521
|
+
text: "The registry grows automatically as documents are processed. During extraction, AI discovers fields from each document and resolves them against existing registry entries using **three-band matching** (exact name match, cluster member match, then semantic embedding similarity). New fields that don't match anything create a Tier 3 entry. Frequently occurring fields are promoted to higher tiers, so the registry naturally converges on a stable set of canonical fields over time."
|
|
1522
|
+
},
|
|
1523
|
+
{
|
|
1524
|
+
type: "paragraph",
|
|
1525
|
+
text: "Each registry entry tracks its **occurrence count** (how many documents contain this field), **data type** (string, number, date, etc.), **synonyms** (alternate names discovered across documents), and **master instruction** (an AI-synthesized extraction directive). The registry also maintains two embedding vectors per field: one for resolution matching and one for graph visualization, ensuring that each concern uses the most appropriate representation."
|
|
1526
|
+
},
|
|
1527
|
+
{
|
|
1528
|
+
type: "paragraph",
|
|
1529
|
+
text: "The registry is the foundation for several downstream features. **Jobs** use registry fields to pre-fill schema values via lookup cascades before resorting to LLM extraction. **Semantic clusters** group related registry fields together. **Generated schemas** are auto-built from registry fields that appear in a given document type. Understanding the registry is key to understanding how Talonic reduces extraction cost and improves accuracy over time."
|
|
1334
1530
|
}
|
|
1335
1531
|
],
|
|
1336
1532
|
related: [
|
|
@@ -1346,6 +1542,10 @@ var sections4 = [
|
|
|
1346
1542
|
{
|
|
1347
1543
|
question: "How does the Field Registry grow?",
|
|
1348
1544
|
answer: "As documents are processed, AI discovers new fields and resolves them against existing registry entries. New fields create Tier 3 entries; frequently occurring fields are promoted to higher tiers."
|
|
1545
|
+
},
|
|
1546
|
+
{
|
|
1547
|
+
question: "How does the Field Registry reduce extraction cost?",
|
|
1548
|
+
answer: "The registry enables lookup-based resolution during job runs. When a field already exists in the registry with sufficient data, its value can be resolved via graph lookup instead of an AI call. Approximately 30% of cells are filled this way \u2014 instantly and at no cost."
|
|
1349
1549
|
}
|
|
1350
1550
|
],
|
|
1351
1551
|
mentions: [
|
|
@@ -1387,6 +1587,18 @@ var sections4 = [
|
|
|
1387
1587
|
}
|
|
1388
1588
|
]
|
|
1389
1589
|
},
|
|
1590
|
+
{
|
|
1591
|
+
type: "paragraph",
|
|
1592
|
+
text: "**Tier 1** fields are the most reliable and cost-efficient. During job runs, Tier 1 fields can often be resolved via lookup tables or registry transfer without any AI call, meaning they cost nothing to extract. These are fields like `invoice_number`, `date`, or `total_amount` that appear universally across document types and have well-established extraction patterns."
|
|
1593
|
+
},
|
|
1594
|
+
{
|
|
1595
|
+
type: "paragraph",
|
|
1596
|
+
text: "**Tier 2** fields are promoted from Tier 3 after meeting frequency thresholds \u2014 specifically, 5 occurrences or a 10% occurrence rate across your documents. Once promoted, these fields gain a synthesized master instruction and become candidates for lookup-based resolution. Promotion is evaluated automatically after every batch resolution run, so fields graduate without manual intervention as your document corpus grows."
|
|
1597
|
+
},
|
|
1598
|
+
{
|
|
1599
|
+
type: "paragraph",
|
|
1600
|
+
text: "**Tier 3** fields are newly discovered and may require a full Claude API call to extract during job runs, making them the most expensive tier. As more documents are processed and a Tier 3 field appears consistently, it is automatically promoted. You can also manually adjust a field's tier from the registry detail page if you know a field is stable enough to promote early."
|
|
1601
|
+
},
|
|
1390
1602
|
{
|
|
1391
1603
|
type: "callout",
|
|
1392
1604
|
text: "Tier badges appear throughout the platform as the primary quality signal. Tier 1 = green, Tier 2 = amber, Tier 3 = gray."
|
|
@@ -1404,6 +1616,10 @@ var sections4 = [
|
|
|
1404
1616
|
{
|
|
1405
1617
|
question: "How are fields promoted between tiers?",
|
|
1406
1618
|
answer: "Fields are promoted automatically based on frequency thresholds. As more documents are processed and a field appears consistently, it moves from Tier 3 to Tier 2 and eventually to Tier 1."
|
|
1619
|
+
},
|
|
1620
|
+
{
|
|
1621
|
+
question: "Can I manually change a field's tier?",
|
|
1622
|
+
answer: "Yes. You can manually adjust a field's tier from the registry detail page. This is useful when you know a field is stable enough to promote early, or when you want to demote a field that was promoted prematurely."
|
|
1407
1623
|
}
|
|
1408
1624
|
],
|
|
1409
1625
|
mentions: ["tier system", "Tier 1", "Tier 2", "Tier 3", "field promotion", "quality signal"]
|
|
@@ -1418,6 +1634,23 @@ var sections4 = [
|
|
|
1418
1634
|
{
|
|
1419
1635
|
type: "paragraph",
|
|
1420
1636
|
text: 'Fields with similar meanings are automatically grouped using AI embeddings. For example, "Vendor Name", "Supplier Name", and "Company Name" cluster together. You can manually merge or split clusters from the Field Map view.'
|
|
1637
|
+
},
|
|
1638
|
+
{
|
|
1639
|
+
type: "paragraph",
|
|
1640
|
+
text: "Clustering uses the same three-band similarity model as field resolution. Fields with similarity >= 0.80 are automatically grouped into the same cluster. Fields in the 0.50-0.79 range are flagged as potential cluster candidates for manual confirmation. Fields below 0.50 similarity are kept separate. This graduated approach prevents false merges while still surfacing useful grouping suggestions."
|
|
1641
|
+
},
|
|
1642
|
+
{
|
|
1643
|
+
type: "paragraph",
|
|
1644
|
+
text: 'From the **Field Map** view, you can manually **merge** two clusters when you know they represent the same concept (e.g., merging a "Ship To Address" cluster with a "Delivery Address" cluster). You can also **split** a field out of a cluster if it was incorrectly grouped. These manual adjustments are permanent and improve the resolution model for all future documents \u2014 the system learns from your corrections.'
|
|
1645
|
+
},
|
|
1646
|
+
{
|
|
1647
|
+
type: "paragraph",
|
|
1648
|
+
text: 'Semantic clusters serve a practical purpose beyond organization. When a job runs, the resolution engine uses clusters to transfer values between fields that belong to the same cluster. If a document has a field called "Supplier Name" and your schema expects "Vendor Name", the cluster linkage allows the value to transfer automatically without an AI call. This is one of the key mechanisms that reduces extraction cost as your registry matures.'
|
|
1649
|
+
},
|
|
1650
|
+
{
|
|
1651
|
+
type: "callout",
|
|
1652
|
+
variant: "info",
|
|
1653
|
+
text: "Manual cluster adjustments are permanent and improve the model for all future documents. If you notice the platform grouping unrelated fields together, split them early \u2014 this prevents incorrect value transfers during job runs."
|
|
1421
1654
|
}
|
|
1422
1655
|
],
|
|
1423
1656
|
related: [
|
|
@@ -1433,6 +1666,10 @@ var sections4 = [
|
|
|
1433
1666
|
{
|
|
1434
1667
|
question: "Can I manually adjust semantic clusters?",
|
|
1435
1668
|
answer: "Yes. You can manually merge or split clusters from the Field Map view in the Field Registry."
|
|
1669
|
+
},
|
|
1670
|
+
{
|
|
1671
|
+
question: "How do semantic clusters reduce extraction cost?",
|
|
1672
|
+
answer: 'When a job runs, the resolution engine uses clusters to transfer values between fields that belong to the same cluster. If a document has "Supplier Name" and your schema expects "Vendor Name", the cluster linkage allows the value to transfer automatically without an AI call.'
|
|
1436
1673
|
}
|
|
1437
1674
|
],
|
|
1438
1675
|
mentions: [
|
|
@@ -1478,6 +1715,10 @@ var sections4 = [
|
|
|
1478
1715
|
type: "paragraph",
|
|
1479
1716
|
text: "Resolution runs concurrently across documents. Each document's fields are resolved in an isolated transaction to prevent lock contention. Occurrence rates are updated after each transaction commits, keeping the registry eventually consistent without blocking concurrent ingestion."
|
|
1480
1717
|
},
|
|
1718
|
+
{
|
|
1719
|
+
type: "paragraph",
|
|
1720
|
+
text: "After resolution completes, the platform evaluates tier promotions and regenerates affected schemas in a fixed chain: resolve, then promote, then regenerate. This chain ensures that newly promoted fields immediately appear in auto-generated schemas. The resolution process also feeds into the **job pipeline** \u2014 during Phase 1 of a job run, the system uses a 3-tier lookup cascade (string normalization, token fuzzy matching, then AI fallback) to fill 60-80% of cells without a full LLM call, dramatically reducing cost."
|
|
1721
|
+
},
|
|
1481
1722
|
{
|
|
1482
1723
|
type: "callout",
|
|
1483
1724
|
text: "Pending confirmations from the confirm band appear in **Resolution → Pending Confirmations**. Accept to merge into an existing cluster, or reject to create a new field."
|
|
@@ -1496,6 +1737,10 @@ var sections4 = [
|
|
|
1496
1737
|
{
|
|
1497
1738
|
question: "Where can I review pending field confirmations?",
|
|
1498
1739
|
answer: "Navigate to Resolution > Pending Confirmations to review fields in the confirm band. Accept to merge into an existing cluster, or reject to create a new field."
|
|
1740
|
+
},
|
|
1741
|
+
{
|
|
1742
|
+
question: "What happens after resolution completes?",
|
|
1743
|
+
answer: "After resolution, the platform evaluates tier promotions and regenerates affected schemas in a fixed chain: resolve, then promote, then regenerate. This ensures that newly promoted fields immediately appear in auto-generated schemas."
|
|
1499
1744
|
}
|
|
1500
1745
|
],
|
|
1501
1746
|
mentions: [
|
|
@@ -1517,6 +1762,18 @@ var sections4 = [
|
|
|
1517
1762
|
type: "paragraph",
|
|
1518
1763
|
text: "As the same field is extracted from many documents, AI synthesizes a **master instruction** \u2014 a reusable directive that captures the best way to extract that field. Master instructions improve accuracy over time and are automatically used when running jobs."
|
|
1519
1764
|
},
|
|
1765
|
+
{
|
|
1766
|
+
type: "paragraph",
|
|
1767
|
+
text: 'Master instructions are synthesized by analyzing the extraction patterns across all documents where a field appears. The AI examines how the field was successfully extracted \u2014 including the source text, confidence scores, and document context \u2014 and distills a concise directive that captures the best extraction approach. For example, a master instruction for "invoice_date" might specify: "Look for the date near the invoice number, typically in the header area. Prefer the issue date over due date. Format as ISO 8601."'
|
|
1768
|
+
},
|
|
1769
|
+
{
|
|
1770
|
+
type: "paragraph",
|
|
1771
|
+
text: "Master instructions fire automatically during **Phase 2** of job runs, when the AI agent extracts values for fields that could not be resolved via lookup. The instruction is injected into the AI prompt alongside the document content, giving the model specific guidance for that field. This is why master instructions improve accuracy: they encode domain-specific knowledge that the base model would otherwise lack."
|
|
1772
|
+
},
|
|
1773
|
+
{
|
|
1774
|
+
type: "paragraph",
|
|
1775
|
+
text: `You can view and edit master instructions from the field detail page in the registry. Editing an instruction overrides the AI-synthesized version, which is useful when you have domain expertise the AI hasn't captured. The **"Synthesize All"** button in the Field Registry triggers the full pipeline \u2014 embedding, resolution, and synthesis \u2014 for all qualifying fields in a single operation.`
|
|
1776
|
+
},
|
|
1520
1777
|
{
|
|
1521
1778
|
type: "callout",
|
|
1522
1779
|
text: 'Click **"Synthesize All"** in the Field Registry to generate instructions for all qualifying fields. This runs the combined pipeline: embed → resolve → synthesize.'
|
|
@@ -1535,6 +1792,10 @@ var sections4 = [
|
|
|
1535
1792
|
{
|
|
1536
1793
|
question: "How do I generate master instructions?",
|
|
1537
1794
|
answer: 'Click "Synthesize All" in the Field Registry. This runs the combined pipeline: embed, resolve, and synthesize instructions for all qualifying fields.'
|
|
1795
|
+
},
|
|
1796
|
+
{
|
|
1797
|
+
question: "Can I manually edit a master instruction?",
|
|
1798
|
+
answer: "Yes. You can view and edit master instructions from the field detail page in the registry. Editing overrides the AI-synthesized version, which is useful when you have domain expertise the AI has not captured."
|
|
1538
1799
|
}
|
|
1539
1800
|
],
|
|
1540
1801
|
mentions: [
|
|
@@ -1562,6 +1823,18 @@ var sections5 = [
|
|
|
1562
1823
|
{
|
|
1563
1824
|
type: "paragraph",
|
|
1564
1825
|
text: "For each document type, Talonic generates a schema containing all Tier 1 and Tier 2 fields with occurrences in that type. Generated schemas are versioned \u2014 new versions are created when the registry changes. You can diff any two versions to see what changed."
|
|
1826
|
+
},
|
|
1827
|
+
{
|
|
1828
|
+
type: "paragraph",
|
|
1829
|
+
text: "Behind the scenes, the generation engine scans the **Field Registry** for every field that has been promoted to Tier 1 (core) or Tier 2 (established) within a given document type. It assembles these fields into a schema definition, assigns data types based on observed extraction patterns, and attaches the AI-synthesized **master instruction** for each field. The entire process is automatic \u2014 no manual curation is required."
|
|
1830
|
+
},
|
|
1831
|
+
{
|
|
1832
|
+
type: "paragraph",
|
|
1833
|
+
text: "Generated schemas are most useful as a starting point for understanding what Talonic has discovered about your documents. Review the generated schema for a document type to see which fields the system has identified, then use that knowledge to build a **User Template** containing only the fields you actually need. You can also use the diff view to monitor how your field landscape evolves over time as new documents are processed and new fields are promoted."
|
|
1834
|
+
},
|
|
1835
|
+
{
|
|
1836
|
+
type: "callout",
|
|
1837
|
+
text: "Generated schemas are read-only and cannot be used directly for job execution. To run an extraction job, create a **User Template** and map its fields to the registry."
|
|
1565
1838
|
}
|
|
1566
1839
|
],
|
|
1567
1840
|
related: [
|
|
@@ -1577,6 +1850,10 @@ var sections5 = [
|
|
|
1577
1850
|
{
|
|
1578
1851
|
question: "How are generated schemas updated?",
|
|
1579
1852
|
answer: "New versions are created automatically when the Field Registry changes (new fields promoted, clusters merged). You can diff any two versions to see what changed."
|
|
1853
|
+
},
|
|
1854
|
+
{
|
|
1855
|
+
question: "Can I run an extraction job using a generated schema?",
|
|
1856
|
+
answer: "No. Generated schemas are read-only references. To run a job, create a User Template, select the fields you need, map them to the registry, and publish a version."
|
|
1580
1857
|
}
|
|
1581
1858
|
],
|
|
1582
1859
|
mentions: ["generated schemas", "AI-generated", "versioning", "schema diff"]
|
|
@@ -1602,6 +1879,22 @@ var sections5 = [
|
|
|
1602
1879
|
"**Add reference tables** \u2014 For code fields (e.g., country name → ISO code), upload key-value pairs.",
|
|
1603
1880
|
"**Publish** \u2014 Create an immutable version snapshot ready for job execution."
|
|
1604
1881
|
]
|
|
1882
|
+
},
|
|
1883
|
+
{
|
|
1884
|
+
type: "paragraph",
|
|
1885
|
+
text: "Most teams start by importing an existing spreadsheet or CSV as a template baseline, then refine field types and add extraction instructions. Once you publish a version, it becomes immutable and available for job execution \u2014 any further changes happen in a new **Workshop** draft, keeping your production schema stable while you iterate."
|
|
1886
|
+
},
|
|
1887
|
+
{
|
|
1888
|
+
type: "paragraph",
|
|
1889
|
+
text: "When adding fields, take advantage of the automatic registry matching system. Fields with names that match existing registry entries are linked instantly, inheriting the AI-synthesized extraction instruction. For fields that do not match, write a clear **manual instruction** describing exactly what the AI should extract from the document. Well-written instructions are the single biggest lever for extraction accuracy."
|
|
1890
|
+
},
|
|
1891
|
+
{
|
|
1892
|
+
type: "paragraph",
|
|
1893
|
+
text: "For best results, keep templates focused on a single document type or closely related group of types. A template with 10-20 well-defined fields will produce higher accuracy than one with 50+ fields spanning unrelated domains. If you need different field sets for different document types, create separate templates and run targeted jobs for each."
|
|
1894
|
+
},
|
|
1895
|
+
{
|
|
1896
|
+
type: "callout",
|
|
1897
|
+
text: "You can import templates from Excel, CSV, or JSON files using the **Import from file** option. Column headers become field names, and data types are inferred automatically. This is the fastest way to bootstrap a template from an existing spreadsheet."
|
|
1605
1898
|
}
|
|
1606
1899
|
],
|
|
1607
1900
|
related: [
|
|
@@ -1617,6 +1910,10 @@ var sections5 = [
|
|
|
1617
1910
|
{
|
|
1618
1911
|
question: "What is the difference between generated schemas and user templates?",
|
|
1619
1912
|
answer: "Generated schemas are AI-created per document type with all Tier 1/2 fields. User templates are custom-defined output structures where you choose exactly which fields to include and how to map them."
|
|
1913
|
+
},
|
|
1914
|
+
{
|
|
1915
|
+
question: "Can I update a published template?",
|
|
1916
|
+
answer: "Published versions are immutable. To make changes, open the Workshop draft, edit your fields, and publish a new version. The previous version remains available in Version History for reference and diffing."
|
|
1620
1917
|
}
|
|
1621
1918
|
],
|
|
1622
1919
|
mentions: ["user templates", "schema creation", "field mapping", "reference tables", "publish"]
|
|
@@ -1678,6 +1975,18 @@ var sections5 = [
|
|
|
1678
1975
|
}
|
|
1679
1976
|
]
|
|
1680
1977
|
},
|
|
1978
|
+
{
|
|
1979
|
+
type: "paragraph",
|
|
1980
|
+
text: "When configuring a field, start with the basics \u2014 name, type, and registry mapping \u2014 then layer on advanced features as needed. For example, add a **format constraint** to enforce a date pattern, attach a **reference table** for code lookups, or define **capture submoves** to control the exact extraction sequence. Features compose independently, so you can mix and match without conflicts."
|
|
1981
|
+
},
|
|
1982
|
+
{
|
|
1983
|
+
type: "paragraph",
|
|
1984
|
+
text: "The **modifier pipeline** runs in a fixed order during Phase 4 of the extraction pipeline: format transforms first (converting dates or numbers to your target format), then alias mapping (replacing values using a lookup), and finally max_length truncation. Constraint evaluation happens after all modifiers have been applied, so constraints validate the final transformed value, not the raw extraction."
|
|
1985
|
+
},
|
|
1986
|
+
{
|
|
1987
|
+
type: "paragraph",
|
|
1988
|
+
text: 'For best results, use **manual instructions** sparingly and only for fields that the registry cannot match. A well-written instruction should describe the field in plain language, specify where in the document to look, and note any formatting expectations. Avoid vague instructions like "extract the value" \u2014 instead, write something like "Extract the net payment amount from the invoice summary section, excluding VAT."'
|
|
1989
|
+
},
|
|
1681
1990
|
{
|
|
1682
1991
|
type: "callout",
|
|
1683
1992
|
text: "For the complete JSON Schema specification with all features, see the [Full Schema Reference](/docs/platform/schema-features) in the Platform Guide."
|
|
@@ -1696,6 +2005,10 @@ var sections5 = [
|
|
|
1696
2005
|
{
|
|
1697
2006
|
question: "Can I override AI extraction instructions with my own?",
|
|
1698
2007
|
answer: "Yes. Use the Manual instruction feature on a schema field. User-written instructions override the AI-synthesized master instruction from the field registry."
|
|
2008
|
+
},
|
|
2009
|
+
{
|
|
2010
|
+
question: "In what order are modifiers applied to extracted values?",
|
|
2011
|
+
answer: "Modifiers run in a fixed order: format (date/number conversion) first, then alias (value mapping), then max_length (truncation). Constraints are evaluated after all modifiers complete."
|
|
1699
2012
|
}
|
|
1700
2013
|
],
|
|
1701
2014
|
mentions: [
|
|
@@ -1740,6 +2053,26 @@ var sections5 = [
|
|
|
1740
2053
|
description: "No match found \u2014 needs manual extraction instructions."
|
|
1741
2054
|
}
|
|
1742
2055
|
]
|
|
2056
|
+
},
|
|
2057
|
+
{
|
|
2058
|
+
type: "paragraph",
|
|
2059
|
+
text: "When you add a field to a template, the system automatically attempts to match it against the **Field Registry**. Exact name matches are applied instantly, while semantic and composite matches appear as suggestions for your confirmation. If no match is found, the field is marked **Unmapped** and you should provide a manual extraction instruction so the AI knows how to extract that value from your documents."
|
|
2060
|
+
},
|
|
2061
|
+
{
|
|
2062
|
+
type: "paragraph",
|
|
2063
|
+
text: "The matching engine uses a three-band resolution process under the hood. First, it checks for an exact name match against canonical registry field names and their synonyms. If no exact match is found, it computes embedding similarity between your field name and every registry field, surfacing semantic matches above a 0.5 confidence threshold. Matches above 0.8 are auto-accepted; those between 0.5 and 0.8 require your confirmation."
|
|
2064
|
+
},
|
|
2065
|
+
{
|
|
2066
|
+
type: "paragraph",
|
|
2067
|
+
text: "Matched fields inherit the registry's AI-synthesized **master instruction**, which tells the extraction pipeline exactly how to locate and extract that value from documents. This is why matching matters \u2014 a well-matched field leverages all the intelligence the system has built up from processing your document corpus. Unmapped fields rely solely on your manual instruction, so they may need a few correction cycles before reaching the same accuracy."
|
|
2068
|
+
},
|
|
2069
|
+
{
|
|
2070
|
+
type: "paragraph",
|
|
2071
|
+
text: "You can trigger a **Rematch** on all fields at any time from the template editor. This is useful after the registry has grown \u2014 fields that were previously unmapped may now find matches as new extractions contribute to the registry. For best results, use descriptive field names that reflect the actual data (e.g., `contract_start_date` rather than `field_1`)."
|
|
2072
|
+
},
|
|
2073
|
+
{
|
|
2074
|
+
type: "callout",
|
|
2075
|
+
text: "Field matching is read-only against the registry \u2014 it never creates new registry entries. If no match exists, the field stays unmapped until you provide a manual instruction or new documents introduce the field into the registry."
|
|
1743
2076
|
}
|
|
1744
2077
|
],
|
|
1745
2078
|
related: [
|
|
@@ -1755,6 +2088,10 @@ var sections5 = [
|
|
|
1755
2088
|
{
|
|
1756
2089
|
question: "What happens when a field is unmapped?",
|
|
1757
2090
|
answer: "Unmapped fields have no registry match. They require manual extraction instructions to guide the AI on how to extract the value from documents."
|
|
2091
|
+
},
|
|
2092
|
+
{
|
|
2093
|
+
question: "Can I re-run field matching after adding more documents?",
|
|
2094
|
+
answer: "Yes. Use the Rematch button in the template editor to re-run matching against the current registry. Fields that were previously unmapped may find new matches as your registry grows."
|
|
1758
2095
|
}
|
|
1759
2096
|
],
|
|
1760
2097
|
mentions: ["field matching", "exact match", "semantic match", "composite", "unmapped"]
|
|
@@ -1791,6 +2128,18 @@ var sections5 = [
|
|
|
1791
2128
|
}
|
|
1792
2129
|
]
|
|
1793
2130
|
},
|
|
2131
|
+
{
|
|
2132
|
+
type: "paragraph",
|
|
2133
|
+
text: "To set up a reference table, upload a CSV or manually enter key-value pairs where the **key** is the code you want in your output and the **value** is the human-readable label found in documents. During extraction, the system tries each tier in order \u2014 most values resolve instantly at Tier 1, so keeping your labels clean and consistent dramatically improves both speed and accuracy."
|
|
2134
|
+
},
|
|
2135
|
+
{
|
|
2136
|
+
type: "paragraph",
|
|
2137
|
+
text: "Reference tables are used in two pipeline stages. In **Phase 1**, the lookup cascade runs as part of the resolve step, mapping extracted labels to codes without any AI calls (Tier 1 and Tier 2). In **Phase 3**, the cascade runs again on values produced by Phase 2's AI extraction, normalizing free-text AI output to your canonical codes. This two-pass approach ensures maximum code coverage across the entire pipeline."
|
|
2138
|
+
},
|
|
2139
|
+
{
|
|
2140
|
+
type: "paragraph",
|
|
2141
|
+
text: 'For best results, include common variations and abbreviations as separate value entries all pointing to the same key. For example, if your code is `US`, add values for "United States", "USA", "U.S.A.", and "United States of America". The more variations you cover, the more values resolve at Tier 1 (highest confidence) without falling through to fuzzy or AI matching.'
|
|
2142
|
+
},
|
|
1794
2143
|
{
|
|
1795
2144
|
type: "callout",
|
|
1796
2145
|
text: "Reference table quality directly determines lookup accuracy. A properly loaded table produces 90-100% accurate results within a single run."
|
|
@@ -1809,6 +2158,10 @@ var sections5 = [
|
|
|
1809
2158
|
{
|
|
1810
2159
|
question: "How accurate are reference table lookups?",
|
|
1811
2160
|
answer: "A properly loaded reference table produces 90-100% accurate results within a single run. The cascade provides confidence scores: 0.95 for exact normalization, ~0.70 for fuzzy, and 0.50 for AI fallback."
|
|
2161
|
+
},
|
|
2162
|
+
{
|
|
2163
|
+
question: "How should I format my reference table CSV?",
|
|
2164
|
+
answer: "Use two columns: the first column is the key (output code) and the second is the value (human-readable label). Include common variations and abbreviations as separate rows pointing to the same key for maximum Tier 1 hit rate."
|
|
1812
2165
|
}
|
|
1813
2166
|
],
|
|
1814
2167
|
mentions: [
|
|
@@ -1829,21 +2182,41 @@ var sections5 = [
|
|
|
1829
2182
|
{
|
|
1830
2183
|
type: "paragraph",
|
|
1831
2184
|
text: "Templates support a workshop system: **Live** (current published version, read-only), **Workshop** (mutable draft for editing), and **Version History** (timeline with diff summaries). When promoting a draft, the system detects breaking changes (field removals, type changes) and warns you."
|
|
1832
|
-
}
|
|
1833
|
-
|
|
1834
|
-
|
|
1835
|
-
|
|
1836
|
-
|
|
1837
|
-
{
|
|
1838
|
-
|
|
1839
|
-
|
|
1840
|
-
|
|
1841
|
-
|
|
1842
|
-
|
|
2185
|
+
},
|
|
2186
|
+
{
|
|
2187
|
+
type: "paragraph",
|
|
2188
|
+
text: "Start by editing fields in the **Workshop** draft, then use **Test Extraction** to compare draft results against the live version before publishing. The **Version History** timeline lets you review diff summaries between any two versions, making it easy to trace when a field was added, renamed, or removed and understand the impact on downstream jobs."
|
|
2189
|
+
},
|
|
2190
|
+
{
|
|
2191
|
+
type: "paragraph",
|
|
2192
|
+
text: "The versioning system is append-only \u2014 every time you publish a draft, it creates a new immutable version and the previous version is preserved in the timeline. This means you can always go back and review the exact schema that was used for any historical job. The diff view highlights added fields, removed fields, type changes, and updated instructions, giving you a clear picture of how your schema evolved."
|
|
2193
|
+
},
|
|
2194
|
+
{
|
|
2195
|
+
type: "paragraph",
|
|
2196
|
+
text: "Use the workshop system to iterate safely on your schema without disrupting production jobs. A common workflow is to add a new field in the Workshop, run a **Test Extraction** on a few documents to verify it produces correct values, then publish when satisfied. If a downstream integration depends on a specific field, the breaking change detection will warn you before you accidentally remove or rename it."
|
|
2197
|
+
},
|
|
2198
|
+
{
|
|
2199
|
+
type: "callout",
|
|
2200
|
+
text: "Breaking changes include field removals and type changes. The system surfaces these warnings at publish time so you can assess the impact on active delivery bindings and downstream systems before committing."
|
|
2201
|
+
}
|
|
2202
|
+
],
|
|
2203
|
+
related: [
|
|
2204
|
+
{ label: "User Templates", slug: "user-templates" },
|
|
2205
|
+
{ label: "Test Extraction", slug: "test-extraction" },
|
|
2206
|
+
{ label: "Agent Impact Levels", slug: "agent-impact" }
|
|
2207
|
+
],
|
|
2208
|
+
faq: [
|
|
2209
|
+
{
|
|
2210
|
+
question: "How does schema versioning work?",
|
|
2211
|
+
answer: "Templates use a workshop system: Live (published, read-only), Workshop (mutable draft), and Version History (timeline with diffs). Breaking changes like field removals or type changes are detected on promotion."
|
|
1843
2212
|
},
|
|
1844
2213
|
{
|
|
1845
2214
|
question: "What are breaking changes in a schema?",
|
|
1846
2215
|
answer: "Breaking changes include field removals and type changes. The system detects and warns about these when promoting a draft to live, helping you avoid unintended downstream impacts."
|
|
2216
|
+
},
|
|
2217
|
+
{
|
|
2218
|
+
question: "Can I revert to a previous schema version?",
|
|
2219
|
+
answer: "Version history is append-only, so you cannot revert directly. However, you can review any previous version in the timeline, compare it with the current live version using the diff view, and manually re-add fields or settings that were changed."
|
|
1847
2220
|
}
|
|
1848
2221
|
],
|
|
1849
2222
|
mentions: ["versioning", "drafts", "workshop", "live version", "breaking changes"]
|
|
@@ -1858,6 +2231,22 @@ var sections5 = [
|
|
|
1858
2231
|
{
|
|
1859
2232
|
type: "paragraph",
|
|
1860
2233
|
text: "Before publishing a draft, run a test extraction to compare draft vs. live results side-by-side. Select a few documents, run the test, and see exactly how your changes affect output."
|
|
2234
|
+
},
|
|
2235
|
+
{
|
|
2236
|
+
type: "paragraph",
|
|
2237
|
+
text: "After running a test, you will see a comparison grid highlighting cells that changed between the draft and live versions. Focus on fields you modified \u2014 new fields, updated instructions, or changed reference tables \u2014 to verify they produce the expected values. This workflow catches regressions before they reach production, so you can iterate on your schema with confidence."
|
|
2238
|
+
},
|
|
2239
|
+
{
|
|
2240
|
+
type: "paragraph",
|
|
2241
|
+
text: "Test extractions run through the same 4-phase pipeline as production jobs, so the results you see are identical to what a full job would produce. The test uses a simplified single-call extraction mode under the hood, which is faster but still applies all schema features including reference table lookups, format constraints, and modifiers. This gives you a reliable preview without the cost of a full pipeline run."
|
|
2242
|
+
},
|
|
2243
|
+
{
|
|
2244
|
+
type: "paragraph",
|
|
2245
|
+
text: 'For best results, select 3-5 representative documents that cover the variety in your corpus \u2014 include at least one "clean" document and one with unusual formatting or missing fields. This gives you confidence that your schema handles both typical and edge-case documents correctly. Run the test after every significant change to a field instruction, reference table, or format constraint.'
|
|
2246
|
+
},
|
|
2247
|
+
{
|
|
2248
|
+
type: "callout",
|
|
2249
|
+
text: "Test extractions do not affect your live data or consume production job credits differently. They are designed for rapid iteration \u2014 run as many tests as you need before publishing."
|
|
1861
2250
|
}
|
|
1862
2251
|
],
|
|
1863
2252
|
related: [
|
|
@@ -1873,6 +2262,10 @@ var sections5 = [
|
|
|
1873
2262
|
{
|
|
1874
2263
|
question: "Do I need to publish a draft before testing it?",
|
|
1875
2264
|
answer: "No. Test extraction runs against the unpublished draft, comparing its output to the current live version so you can verify changes before publishing."
|
|
2265
|
+
},
|
|
2266
|
+
{
|
|
2267
|
+
question: "How many documents should I use for a test extraction?",
|
|
2268
|
+
answer: "Select 3-5 representative documents that cover the variety in your corpus. Include documents with different layouts, data completeness levels, and edge cases to get a reliable preview of how your schema changes perform."
|
|
1876
2269
|
}
|
|
1877
2270
|
],
|
|
1878
2271
|
mentions: ["test extraction", "draft comparison", "side-by-side", "preview"]
|
|
@@ -1923,6 +2316,22 @@ var sections5 = [
|
|
|
1923
2316
|
description: "Output file encoding: UTF-8 (default), UTF-8-BOM, ISO-8859-1, etc."
|
|
1924
2317
|
}
|
|
1925
2318
|
]
|
|
2319
|
+
},
|
|
2320
|
+
{
|
|
2321
|
+
type: "paragraph",
|
|
2322
|
+
text: "When working with international data, configure the dialect to match your downstream system requirements. For example, set **number_locale** to `fr-FR` for European comma-decimal formatting, switch the **delimiter** to semicolon for CSV compatibility, and choose **UTF-8-BOM** encoding if your data will be opened in Excel. Creating a shared dialect and reusing it across schemas ensures consistent formatting across all your exports."
|
|
2323
|
+
},
|
|
2324
|
+
{
|
|
2325
|
+
type: "paragraph",
|
|
2326
|
+
text: "Dialect settings are applied during Phase 4 of the extraction pipeline and during CSV/XLSX export. The dialect does not affect how values are stored internally \u2014 it only controls the serialization format when data leaves the platform. This means you can change a dialect at any time without re-running extractions; the new format applies to all future exports and deliveries."
|
|
2327
|
+
},
|
|
2328
|
+
{
|
|
2329
|
+
type: "paragraph",
|
|
2330
|
+
text: 'For best results, create a shared dialect for each downstream system or regional office you deliver to, and name it descriptively (e.g., "SAP Europe" or "US Accounting"). Avoid defining dialects inline on individual schemas unless you have a one-off formatting requirement. Shared dialects reduce maintenance burden and ensure consistency when you add new schemas later.'
|
|
2331
|
+
},
|
|
2332
|
+
{
|
|
2333
|
+
type: "callout",
|
|
2334
|
+
text: "If your CSV files show garbled special characters (accents, umlauts, CJK text), switch the encoding to **UTF-8-BOM**. The BOM (byte order mark) tells Excel to interpret the file as UTF-8 instead of the system default encoding."
|
|
1926
2335
|
}
|
|
1927
2336
|
],
|
|
1928
2337
|
related: [
|
|
@@ -1938,6 +2347,10 @@ var sections5 = [
|
|
|
1938
2347
|
{
|
|
1939
2348
|
question: "Can I share a dialect across multiple schemas?",
|
|
1940
2349
|
answer: "Yes. A dialect can be shared across schemas or defined inline for a specific schema. Configure them in the Schema > Delivery tab."
|
|
2350
|
+
},
|
|
2351
|
+
{
|
|
2352
|
+
question: "Do I need to re-run extractions when I change a dialect?",
|
|
2353
|
+
answer: "No. Dialects only affect output serialization (exports and deliveries), not how values are stored internally. Changing a dialect takes effect immediately on future exports without re-processing."
|
|
1941
2354
|
}
|
|
1942
2355
|
],
|
|
1943
2356
|
mentions: [
|
|
@@ -1986,6 +2399,18 @@ var sections5 = [
|
|
|
1986
2399
|
}
|
|
1987
2400
|
]
|
|
1988
2401
|
},
|
|
2402
|
+
{
|
|
2403
|
+
type: "paragraph",
|
|
2404
|
+
text: 'Use bypass strategies for fields whose values are known ahead of time or can be derived without reading the document. For example, set a **constant** of `"USD"` for a currency field that is always the same, or use a **generator** to produce a deterministic ID for each row. Fields with bypass strategies skip the AI extraction phase entirely, reducing processing time and credit usage.'
|
|
2405
|
+
},
|
|
2406
|
+
{
|
|
2407
|
+
type: "paragraph",
|
|
2408
|
+
text: "The **reference** bypass strategy is particularly powerful for enrichment fields. Define a `key_expression` that references another field in the schema (e.g., the supplier name), and the system will automatically look up the corresponding code from your reference table without any AI involvement. This is ideal for mapping extracted entity names to internal system identifiers, ERP codes, or classification labels."
|
|
2409
|
+
},
|
|
2410
|
+
{
|
|
2411
|
+
type: "paragraph",
|
|
2412
|
+
text: "For best results, audit your schema for fields that never vary across documents \u2014 these are prime candidates for the **constant** strategy. Fields like currency, data source, or processing batch can be set once and never require AI extraction. This reduces per-document processing cost and improves job completion time, especially on large runs with hundreds of documents."
|
|
2413
|
+
},
|
|
1989
2414
|
{
|
|
1990
2415
|
type: "callout",
|
|
1991
2416
|
text: "When a `generator` strategy fails to produce a value, the field falls through to LLM extraction as a safety net. Strategy values are normalized via generator mappings in Phase 4 of the pipeline."
|
|
@@ -2004,6 +2429,10 @@ var sections5 = [
|
|
|
2004
2429
|
{
|
|
2005
2430
|
question: "What happens when a generator bypass fails?",
|
|
2006
2431
|
answer: "When a generator strategy fails to produce a value, the field falls through to LLM extraction as a safety net, ensuring the cell is still filled."
|
|
2432
|
+
},
|
|
2433
|
+
{
|
|
2434
|
+
question: "Do bypass strategies reduce extraction costs?",
|
|
2435
|
+
answer: "Yes. Fields with bypass strategies skip the AI extraction phase entirely, which reduces both processing time and credit usage. Use constant or reference strategies for fields that do not require document reading."
|
|
2007
2436
|
}
|
|
2008
2437
|
],
|
|
2009
2438
|
mentions: [
|
|
@@ -2049,6 +2478,18 @@ var sections5 = [
|
|
|
2049
2478
|
{
|
|
2050
2479
|
type: "paragraph",
|
|
2051
2480
|
text: "Define format constraints in the schema field editor. The pattern uses standard regex syntax. The editor provides a live test input so you can verify the pattern before saving."
|
|
2481
|
+
},
|
|
2482
|
+
{
|
|
2483
|
+
type: "paragraph",
|
|
2484
|
+
text: "Format constraints are especially useful for fields with strict formatting requirements in downstream systems. For example, a purchase order number that must follow the pattern `PO-\\d{6}` or a date that must match `\\d{4}-\\d{2}-\\d{2}`. By catching format violations at extraction time, you avoid importing malformed data into your ERP, accounting, or analytics systems."
|
|
2485
|
+
},
|
|
2486
|
+
{
|
|
2487
|
+
type: "paragraph",
|
|
2488
|
+
text: 'Choose the mismatch behavior based on your data quality requirements. Use **empty** (the default) when you prefer no data over bad data \u2014 the downstream system will see a blank cell. Use **flag** when you want to review mismatches manually before deciding \u2014 flagged cells appear with an amber dot in the results grid. Use **constant** when your downstream system needs a specific sentinel value like `"N/A"` or `"INVALID"` to trigger its own error handling.'
|
|
2489
|
+
},
|
|
2490
|
+
{
|
|
2491
|
+
type: "callout",
|
|
2492
|
+
text: "The regex evaluator includes ReDoS protection: nested quantifiers are rejected and input is capped at 1,000 characters. Use the `(?i)` inline flag for case-insensitive matching."
|
|
2052
2493
|
}
|
|
2053
2494
|
],
|
|
2054
2495
|
related: [
|
|
@@ -2064,6 +2505,10 @@ var sections5 = [
|
|
|
2064
2505
|
{
|
|
2065
2506
|
question: "Are original values preserved when format constraints clear a cell?",
|
|
2066
2507
|
answer: "Yes. Original values are always preserved for audit in the original_extractions table, regardless of the mismatch behavior applied."
|
|
2508
|
+
},
|
|
2509
|
+
{
|
|
2510
|
+
question: "Can I use case-insensitive regex patterns?",
|
|
2511
|
+
answer: "Yes. Use the (?i) inline flag at the start of your pattern for case-insensitive matching. The evaluator supports standard JavaScript regex syntax with inline flags."
|
|
2067
2512
|
}
|
|
2068
2513
|
],
|
|
2069
2514
|
mentions: [
|
|
@@ -2092,6 +2537,18 @@ var sections6 = [
|
|
|
2092
2537
|
{
|
|
2093
2538
|
type: "paragraph",
|
|
2094
2539
|
text: "Navigate to **Structuring → Runs → New**. Select your template and documents, then click Start. Results appear progressively as each phase completes."
|
|
2540
|
+
},
|
|
2541
|
+
{
|
|
2542
|
+
type: "paragraph",
|
|
2543
|
+
text: "When you start a job, the platform runs a pre-flight check to ensure all selected documents have completed their field resolution step. If any document was uploaded recently and has not yet been resolved against the Field Registry, the system automatically resolves it before entering Phase 1. This lazy resolution gate prevents silent data loss where registry-based lookups would return empty results for unresolved documents."
|
|
2544
|
+
},
|
|
2545
|
+
{
|
|
2546
|
+
type: "paragraph",
|
|
2547
|
+
text: "For best results, select documents of the same type or closely related types for a single job. The schema you choose should match the document content \u2014 using an invoice schema on contract documents will produce poor results. Start with a small batch of 5-10 documents to validate your schema, review the output, apply corrections, and then scale up to larger runs once you are confident in the extraction quality."
|
|
2548
|
+
},
|
|
2549
|
+
{
|
|
2550
|
+
type: "callout",
|
|
2551
|
+
text: "Results appear progressively as each pipeline phase completes. You do not need to wait for the entire job to finish \u2014 you can begin reviewing Phase 1 results while Phase 2 is still running."
|
|
2095
2552
|
}
|
|
2096
2553
|
],
|
|
2097
2554
|
related: [
|
|
@@ -2107,6 +2564,10 @@ var sections6 = [
|
|
|
2107
2564
|
{
|
|
2108
2565
|
question: "What does an extraction job produce?",
|
|
2109
2566
|
answer: "A job produces a structured grid where rows represent documents and columns represent schema fields. Each cell contains an extracted value with confidence and provenance metadata."
|
|
2567
|
+
},
|
|
2568
|
+
{
|
|
2569
|
+
question: "How many documents can I include in a single job?",
|
|
2570
|
+
answer: "Phase 2 supports up to 2,000 documents per job, and Phase 4 supports up to 1,000. For best results, start with smaller batches to validate your schema before scaling up."
|
|
2110
2571
|
}
|
|
2111
2572
|
],
|
|
2112
2573
|
mentions: ["extraction job", "structured grid", "progressive results", "template selection"]
|
|
@@ -2122,11 +2583,27 @@ var sections6 = [
|
|
|
2122
2583
|
type: "paragraph",
|
|
2123
2584
|
text: "Every job runs through four phases. Each fills more cells in the output grid, reducing the problem space for the next. Results are visible as each phase completes."
|
|
2124
2585
|
},
|
|
2586
|
+
{
|
|
2587
|
+
type: "paragraph",
|
|
2588
|
+
text: "Each phase builds on the previous one, progressively filling the output grid. **Phase 1** resolves ~30% of cells instantly using graph matches and lookups. **Phase 2** deploys an AI agent to fill remaining gaps. **Phase 3** runs cross-field validation checks, and **Phase 4** performs targeted re-reads for empty or low-confidence cells. You can monitor fill rate in real time as each phase completes."
|
|
2589
|
+
},
|
|
2590
|
+
{
|
|
2591
|
+
type: "paragraph",
|
|
2592
|
+
text: "The pipeline is designed around a key principle: use the cheapest, fastest method first and escalate to AI only when necessary. Phase 1 fills cells using deterministic lookups at zero AI cost. Phase 2 uses AI only for cells that Phase 1 could not resolve. Phase 3 re-runs lookups on Phase 2 output to normalize AI-generated values to canonical codes. Phase 4 performs targeted re-reads with full grid context for the remaining gaps. This cascading approach minimizes both cost and latency."
|
|
2593
|
+
},
|
|
2594
|
+
{
|
|
2595
|
+
type: "paragraph",
|
|
2596
|
+
text: "The grid is flushed to the database after each phase, enabling progressive rendering in the UI. You can watch cells fill in real time and begin reviewing results before the job finishes. The phase timeline on the job detail page shows which phase is currently active, how long each phase took, and the cumulative fill rate at each stage."
|
|
2597
|
+
},
|
|
2125
2598
|
{
|
|
2126
2599
|
type: "ui-excerpt",
|
|
2127
2600
|
id: "job-detail-phase-timeline",
|
|
2128
2601
|
title: "Job Detail \u2014 Phase Timeline",
|
|
2129
2602
|
caption: "The phase timeline shows progress through the pipeline. Each dot represents a stage, highlighted when active."
|
|
2603
|
+
},
|
|
2604
|
+
{
|
|
2605
|
+
type: "callout",
|
|
2606
|
+
text: "Phase order is fixed: Phase 1 → 2 → 3 → 4. Phases are never skipped or reordered. This guarantees that high-confidence deterministic values from Phase 1 are always protected by the confidence gate before AI extraction runs."
|
|
2130
2607
|
}
|
|
2131
2608
|
],
|
|
2132
2609
|
related: [
|
|
@@ -2142,6 +2619,10 @@ var sections6 = [
|
|
|
2142
2619
|
{
|
|
2143
2620
|
question: "Can I see results before all phases complete?",
|
|
2144
2621
|
answer: "Yes. Results are visible as each phase completes. The fill rate increases progressively through the pipeline."
|
|
2622
|
+
},
|
|
2623
|
+
{
|
|
2624
|
+
question: "Why does the pipeline use multiple phases instead of a single AI call?",
|
|
2625
|
+
answer: "The cascading design minimizes cost and latency. Phase 1 fills cells with deterministic lookups at zero AI cost. Only remaining gaps go to the AI agent in Phase 2, and Phase 4 targets specific empty cells with full context. This is significantly cheaper and faster than sending everything to AI."
|
|
2145
2626
|
}
|
|
2146
2627
|
],
|
|
2147
2628
|
mentions: ["4-phase pipeline", "fill rate", "progressive rendering", "phase timeline"]
|
|
@@ -2191,6 +2672,18 @@ var sections6 = [
|
|
|
2191
2672
|
{
|
|
2192
2673
|
type: "paragraph",
|
|
2193
2674
|
text: "Values are normalized during transfer: dates → `YYYY/MM/DD`, numbers → 2 decimal places, strings → trim + collapse spaces."
|
|
2675
|
+
},
|
|
2676
|
+
{
|
|
2677
|
+
type: "paragraph",
|
|
2678
|
+
text: "Phase 1 is the workhorse of cost efficiency. Because it relies entirely on pre-computed graph matches and deterministic lookups, it fills a large portion of the grid at near-zero cost. The confidence scores assigned during this phase are typically high (0.7-0.95) because they are derived from verified registry matches rather than AI inference. These high-confidence cells are then protected by the confidence gate, meaning later phases cannot overwrite them."
|
|
2679
|
+
},
|
|
2680
|
+
{
|
|
2681
|
+
type: "paragraph",
|
|
2682
|
+
text: "The resolution strategies execute in a fixed order: registry transfer first, then raw extraction mapping, then the 3-tier lookup cascade, and finally deterministic compute (formulas like `Total = Unit Price x Quantity`). Each strategy only attempts to fill cells that are still empty after the previous strategy ran. This ordering ensures that the highest-confidence method always gets priority."
|
|
2683
|
+
},
|
|
2684
|
+
{
|
|
2685
|
+
type: "callout",
|
|
2686
|
+
text: "Phase 1 fill rates improve over time as your Field Registry grows. The more documents you process, the richer the registry becomes, and the more cells Phase 1 can resolve without AI \u2014 reducing both cost and latency for every subsequent job."
|
|
2194
2687
|
}
|
|
2195
2688
|
],
|
|
2196
2689
|
related: [
|
|
@@ -2206,6 +2699,10 @@ var sections6 = [
|
|
|
2206
2699
|
{
|
|
2207
2700
|
question: "What percentage of cells does Phase 1 fill?",
|
|
2208
2701
|
answer: "Phase 1 typically fills approximately 30% of cells in seconds, using graph matches and lookups without any AI calls."
|
|
2702
|
+
},
|
|
2703
|
+
{
|
|
2704
|
+
question: "Does Phase 1 performance improve over time?",
|
|
2705
|
+
answer: "Yes. As your Field Registry grows from processing more documents, Phase 1 can resolve a higher percentage of cells through graph matches. Mature registries often see Phase 1 fill rates of 60-80%."
|
|
2209
2706
|
}
|
|
2210
2707
|
],
|
|
2211
2708
|
mentions: [
|
|
@@ -2228,6 +2725,10 @@ var sections6 = [
|
|
|
2228
2725
|
type: "paragraph",
|
|
2229
2726
|
text: "An AI agent reviews the grid's gap patterns and produces a typed strategy:"
|
|
2230
2727
|
},
|
|
2728
|
+
{
|
|
2729
|
+
type: "paragraph",
|
|
2730
|
+
text: "The agent analyzes which cells are still empty after Phase 1 and selects the most efficient action for each. Simple calculations like `Total = Unit Price x Quantity` use the **compute** strategy without any AI calls, while truly missing data triggers an **extract** that re-reads the source document with targeted instructions. You can inspect the agent's chosen strategy for every field on the **Strategy Panel** of the job detail page."
|
|
2731
|
+
},
|
|
2231
2732
|
{
|
|
2232
2733
|
type: "ui-excerpt",
|
|
2233
2734
|
id: "job-detail-agent-strategy",
|
|
@@ -2260,6 +2761,14 @@ var sections6 = [
|
|
|
2260
2761
|
}
|
|
2261
2762
|
]
|
|
2262
2763
|
},
|
|
2764
|
+
{
|
|
2765
|
+
type: "paragraph",
|
|
2766
|
+
text: "Phase 2 processes documents with grouped extraction calls \u2014 schema fields are divided into batches of up to 10 fields per call to balance extraction quality with throughput. For each document, the agent sends the document text along with the schema field definitions and any already-resolved values from Phase 1 as context. This context-aware approach means the AI can use related values (like a contract start date) to more accurately extract dependent values (like the end date)."
|
|
2767
|
+
},
|
|
2768
|
+
{
|
|
2769
|
+
type: "paragraph",
|
|
2770
|
+
text: "For fields backed by a **reference table**, Phase 2 includes the table's codes and labels directly in the extraction prompt so the AI picks canonical codes rather than free-text labels. This tight integration between reference tables and AI extraction produces cleaner output that requires fewer corrections. Fields with fewer than 50 reference entries get the full table in the prompt; larger tables are handled by the Phase 3 lookup cascade instead."
|
|
2771
|
+
},
|
|
2263
2772
|
{
|
|
2264
2773
|
type: "callout",
|
|
2265
2774
|
variant: "warning",
|
|
@@ -2279,6 +2788,10 @@ var sections6 = [
|
|
|
2279
2788
|
{
|
|
2280
2789
|
question: "Can the agent skip a field with manual instructions?",
|
|
2281
2790
|
answer: "No. Fields with manual instructions always use the extract strategy. Human-written instructions are treated as authoritative and never skipped."
|
|
2791
|
+
},
|
|
2792
|
+
{
|
|
2793
|
+
question: "How many fields does the agent process per AI call?",
|
|
2794
|
+
answer: "Schema fields are grouped into batches of up to 10 fields per extraction call. This balances extraction quality with throughput \u2014 smaller groups help the AI focus on each field without losing recall."
|
|
2282
2795
|
}
|
|
2283
2796
|
],
|
|
2284
2797
|
mentions: [
|
|
@@ -2301,6 +2814,10 @@ var sections6 = [
|
|
|
2301
2814
|
type: "paragraph",
|
|
2302
2815
|
text: "Cross-field sanity checks. Flags are **informational only** \u2014 they never block output but help you prioritize review:"
|
|
2303
2816
|
},
|
|
2817
|
+
{
|
|
2818
|
+
type: "paragraph",
|
|
2819
|
+
text: "After Phase 3 completes, flagged cells appear with warning indicators in the results grid. Use the **Flagged** filter to see only rows that need attention \u2014 for example, an **amount_mismatch** flag suggests you double-check a total that does not align with its component values. Addressing flagged cells first is the most efficient way to review large extraction runs."
|
|
2820
|
+
},
|
|
2304
2821
|
{
|
|
2305
2822
|
type: "param-table",
|
|
2306
2823
|
title: "Validation flags",
|
|
@@ -2331,6 +2848,18 @@ var sections6 = [
|
|
|
2331
2848
|
description: "Field with >80% registry occurrence rate is empty in this document."
|
|
2332
2849
|
}
|
|
2333
2850
|
]
|
|
2851
|
+
},
|
|
2852
|
+
{
|
|
2853
|
+
type: "paragraph",
|
|
2854
|
+
text: 'Phase 3 also re-runs the lookup cascade (reference table resolution) on values that Phase 2 produced. This is important because AI-extracted values often use natural language labels (e.g., "Frame Agreement") rather than the canonical codes your reference table expects (e.g., `std_master`). The Phase 3 lookup normalizes these labels to codes, improving consistency across your output without requiring manual corrections.'
|
|
2855
|
+
},
|
|
2856
|
+
{
|
|
2857
|
+
type: "paragraph",
|
|
2858
|
+
text: "Validation flags are designed to surface the most impactful issues first. The **low_confidence_outlier** flag is particularly useful \u2014 it highlights cells where the system is uncertain in an otherwise high-confidence row, pointing you to the exact cells most likely to contain errors. For large runs with hundreds of documents, filtering by flags and reviewing those cells first can reduce your review time by 80% or more."
|
|
2859
|
+
},
|
|
2860
|
+
{
|
|
2861
|
+
type: "callout",
|
|
2862
|
+
text: "Validation flags never modify cell values. They are purely informational annotations that help you prioritize review. The actual cell value and confidence score remain unchanged by Phase 3 flagging."
|
|
2334
2863
|
}
|
|
2335
2864
|
],
|
|
2336
2865
|
related: [
|
|
@@ -2346,6 +2875,10 @@ var sections6 = [
|
|
|
2346
2875
|
{
|
|
2347
2876
|
question: "What types of validation flags exist?",
|
|
2348
2877
|
answer: "Five types: date_sanity (date inconsistencies), amount_mismatch (total discrepancies), lookup_failed (no reference match), low_confidence_outlier (low confidence cells), and unexpected_empty (missing high-frequency fields)."
|
|
2878
|
+
},
|
|
2879
|
+
{
|
|
2880
|
+
question: "Does Phase 3 modify any cell values?",
|
|
2881
|
+
answer: "Phase 3 re-runs the reference table lookup cascade to normalize AI-extracted labels to canonical codes. The validation flags themselves are purely informational and do not modify values."
|
|
2349
2882
|
}
|
|
2350
2883
|
],
|
|
2351
2884
|
mentions: [
|
|
@@ -2367,6 +2900,18 @@ var sections6 = [
|
|
|
2367
2900
|
type: "paragraph",
|
|
2368
2901
|
text: "Context-aware gap filling. For each empty cell or low-confidence value, AI re-reads the original document with the field instruction and full grid context. This focused approach often finds values missed in earlier phases."
|
|
2369
2902
|
},
|
|
2903
|
+
{
|
|
2904
|
+
type: "paragraph",
|
|
2905
|
+
text: "Because Phase 4 has access to the full grid context \u2014 all values already resolved in earlier phases \u2014 it can use surrounding data as clues. For example, if a contract start date was resolved in Phase 1 but the end date is still empty, Phase 4 re-reads the document knowing the start date, which helps the AI locate the corresponding end date more accurately."
|
|
2906
|
+
},
|
|
2907
|
+
{
|
|
2908
|
+
type: "paragraph",
|
|
2909
|
+
text: "Phase 4 also applies deterministic transforms to all cell values: ISO code normalization, date format standardization, and unit conversion. Format constraints (regex patterns defined on schema fields) are evaluated at this stage. If a value fails its format constraint, the configured mismatch behavior kicks in \u2014 the cell is either cleared, flagged with an amber dot, or replaced with a constant. Original values are always preserved in the `original_extractions` table for audit purposes."
|
|
2910
|
+
},
|
|
2911
|
+
{
|
|
2912
|
+
type: "paragraph",
|
|
2913
|
+
text: "Expect Phase 4 to fill 5-15% of remaining empty cells, depending on document complexity and schema coverage. The phase is most effective for fields that require cross-referencing multiple sections of a document or interpreting values in the context of other extracted data. It is less effective for fields that are genuinely absent from the source document \u2014 those will remain empty with an `unresolved` provenance type."
|
|
2914
|
+
},
|
|
2370
2915
|
{
|
|
2371
2916
|
type: "callout",
|
|
2372
2917
|
text: "Phase 4 respects the **confidence gate**: it can only fill empty cells or upgrade cells below the confidence threshold. High-confidence values from Phase 1 are permanently protected."
|
|
@@ -2385,6 +2930,10 @@ var sections6 = [
|
|
|
2385
2930
|
{
|
|
2386
2931
|
question: "Can Phase 4 overwrite high-confidence values?",
|
|
2387
2932
|
answer: "No. Phase 4 respects the confidence gate \u2014 it can only fill empty cells or upgrade cells below the confidence threshold. High-confidence values from earlier phases are permanently protected."
|
|
2933
|
+
},
|
|
2934
|
+
{
|
|
2935
|
+
question: "What else happens in Phase 4 besides gap filling?",
|
|
2936
|
+
answer: "Phase 4 also applies deterministic transforms (ISO codes, dates, units), evaluates format constraints (regex validation), and runs the modifier pipeline (format, alias, max_length). Original values are preserved for audit."
|
|
2388
2937
|
}
|
|
2389
2938
|
],
|
|
2390
2939
|
mentions: ["Phase 4", "re-read", "gap filling", "confidence gate", "targeted extraction"]
|
|
@@ -2405,6 +2954,22 @@ var sections6 = [
|
|
|
2405
2954
|
{
|
|
2406
2955
|
type: "paragraph",
|
|
2407
2956
|
text: "The job detail page provides: a **progress bar** with fill rate, a **phase timeline**, the **strategy panel** (agent actions), a **filter bar** (Show All / Clean / Flagged), and **CSV export** (clean or full with metadata)."
|
|
2957
|
+
},
|
|
2958
|
+
{
|
|
2959
|
+
type: "paragraph",
|
|
2960
|
+
text: "Start your review by switching to the **Flagged** filter to focus on cells that need attention \u2014 these are values with validation warnings, low confidence, or format mismatches. Click any cell to see its full provenance, including which phase produced it and the reasoning trace. Once you are satisfied, export via **CSV** \u2014 choose the clean export for downstream systems or the full export with metadata for auditing."
|
|
2961
|
+
},
|
|
2962
|
+
{
|
|
2963
|
+
type: "paragraph",
|
|
2964
|
+
text: "The colored dots on each cell are your quickest visual indicator of data quality. Blue dots indicate graph matches from Phase 1 (highest reliability), purple dots indicate computed values, teal dots indicate agent transfers, indigo dots indicate AI extractions, and amber dots indicate lookup results or format flags. A grid dominated by blue and purple dots typically requires minimal review, while one with many indigo and amber dots may need more attention."
|
|
2965
|
+
},
|
|
2966
|
+
{
|
|
2967
|
+
type: "paragraph",
|
|
2968
|
+
text: "For large jobs with hundreds of documents, use a systematic review workflow: first address all **Flagged** rows, then spot-check a random sample of **Clean** rows to build confidence in the overall quality. If you find recurring errors in a specific field, consider updating the schema field's instruction or reference table, then run a new job \u2014 corrections you apply also feed back as training signals for future runs."
|
|
2969
|
+
},
|
|
2970
|
+
{
|
|
2971
|
+
type: "callout",
|
|
2972
|
+
text: "The full CSV export includes metadata columns for each field: confidence score, resolution type, phase number, and reasoning trace. Use this export for audit trails or to analyze extraction performance across your document corpus."
|
|
2408
2973
|
}
|
|
2409
2974
|
],
|
|
2410
2975
|
related: [
|
|
@@ -2420,6 +2985,10 @@ var sections6 = [
|
|
|
2420
2985
|
{
|
|
2421
2986
|
question: "Can I export extraction results?",
|
|
2422
2987
|
answer: "Yes. Use CSV export from the job detail page. You can export clean data only or full data with metadata including confidence scores and resolution types."
|
|
2988
|
+
},
|
|
2989
|
+
{
|
|
2990
|
+
question: "What is the most efficient way to review a large extraction run?",
|
|
2991
|
+
answer: "Start with the Flagged filter to address cells with validation warnings, low confidence, or format mismatches. Then spot-check a random sample of Clean rows. Focus corrections on recurring field-level patterns rather than individual cells."
|
|
2423
2992
|
}
|
|
2424
2993
|
],
|
|
2425
2994
|
mentions: [
|
|
@@ -2441,6 +3010,10 @@ var sections6 = [
|
|
|
2441
3010
|
type: "paragraph",
|
|
2442
3011
|
text: "Every cell carries detailed provenance. Hover a cell for confidence; click for full detail."
|
|
2443
3012
|
},
|
|
3013
|
+
{
|
|
3014
|
+
type: "paragraph",
|
|
3015
|
+
text: "When reviewing results, hover over any cell to see its **confidence score** at a glance. Click the cell to expand the full provenance panel, which shows the **resolution type**, the **phase** that produced the value, and a human-readable **reasoning trace** explaining how the value was derived. Use this information to quickly identify which cells need manual review and which can be trusted as-is."
|
|
3016
|
+
},
|
|
2444
3017
|
{
|
|
2445
3018
|
type: "param-table",
|
|
2446
3019
|
title: "Cell provenance fields",
|
|
@@ -2472,6 +3045,14 @@ var sections6 = [
|
|
|
2472
3045
|
}
|
|
2473
3046
|
]
|
|
2474
3047
|
},
|
|
3048
|
+
{
|
|
3049
|
+
type: "paragraph",
|
|
3050
|
+
text: "Confidence scores follow predictable patterns by resolution type. Graph matches from Phase 1 typically score 0.7-0.95 because they are derived from verified registry data. Reference table lookups score 0.95 for exact normalization matches, ~0.70 for fuzzy matches, and 0.50 for AI fallback. Agent-derived values from Phase 2 generally score 0.5-0.9 depending on the clarity of the source document and the specificity of the extraction instruction."
|
|
3051
|
+
},
|
|
3052
|
+
{
|
|
3053
|
+
type: "paragraph",
|
|
3054
|
+
text: "Use confidence scores to set your review threshold. Cells above 0.8 are generally reliable and can be trusted without manual verification for most use cases. Cells between 0.5 and 0.8 warrant a quick check. Cells below 0.5 should always be reviewed manually. You can use the full CSV export to filter and sort by confidence, making it easy to batch-review low-confidence cells efficiently."
|
|
3055
|
+
},
|
|
2475
3056
|
{
|
|
2476
3057
|
type: "callout",
|
|
2477
3058
|
variant: "warning",
|
|
@@ -2491,6 +3072,10 @@ var sections6 = [
|
|
|
2491
3072
|
{
|
|
2492
3073
|
question: "What is the confidence gate?",
|
|
2493
3074
|
answer: "The confidence gate prevents any later pipeline phase from overwriting a cell that was filled with confidence >= 0.7. This protects high-quality lookup results from lower-confidence agent extractions."
|
|
3075
|
+
},
|
|
3076
|
+
{
|
|
3077
|
+
question: "What confidence threshold should I use for manual review?",
|
|
3078
|
+
answer: "Cells above 0.8 are generally reliable. Cells between 0.5 and 0.8 warrant a quick check. Cells below 0.5 should always be reviewed manually. Use the CSV export to filter by confidence for efficient batch review."
|
|
2494
3079
|
}
|
|
2495
3080
|
],
|
|
2496
3081
|
mentions: [
|
|
@@ -2511,6 +3096,22 @@ var sections6 = [
|
|
|
2511
3096
|
{
|
|
2512
3097
|
type: "paragraph",
|
|
2513
3098
|
text: "Click any cell to edit its value. Corrections are logged with the original value, timestamp, and user. Choose a propagation scope: `this_document_only` or `all_similar` (same field + method + source field across all documents). Corrections feed back as training signals for future runs."
|
|
3099
|
+
},
|
|
3100
|
+
{
|
|
3101
|
+
type: "paragraph",
|
|
3102
|
+
text: "When correcting a value, consider using **all_similar** propagation if the same mistake appears across multiple documents \u2014 for example, a reference table code that was consistently mapped to the wrong label. This applies your fix to every document in the run that matched the same way, saving you from correcting each cell individually. The system learns from these corrections, so the same error is less likely to recur in future jobs."
|
|
3103
|
+
},
|
|
3104
|
+
{
|
|
3105
|
+
type: "paragraph",
|
|
3106
|
+
text: "Corrections create a full audit trail: the original extracted value, the corrected value, who made the change, and when. This audit log is preserved even after subsequent jobs are run, giving you a complete history of manual interventions. When you export results with the full metadata option, correction history is included so downstream systems can distinguish between AI-extracted and human-corrected values."
|
|
3107
|
+
},
|
|
3108
|
+
{
|
|
3109
|
+
type: "paragraph",
|
|
3110
|
+
text: "For best results, correct the root cause rather than individual symptoms. If a field consistently produces wrong values, update the schema field's **manual instruction** or **reference table** rather than correcting cells one by one. If a reference table code is missing, add it to the table \u2014 future runs will pick it up automatically at Tier 1 confidence (0.95). Corrections are most valuable as a feedback mechanism when they inform schema improvements."
|
|
3111
|
+
},
|
|
3112
|
+
{
|
|
3113
|
+
type: "callout",
|
|
3114
|
+
text: "Corrections with **all_similar** propagation apply instantly across all documents in the run. Use this for systematic errors like wrong reference table mappings, but verify the preview count before confirming \u2014 the system shows how many cells will be affected."
|
|
2514
3115
|
}
|
|
2515
3116
|
],
|
|
2516
3117
|
related: [
|
|
@@ -2526,6 +3127,10 @@ var sections6 = [
|
|
|
2526
3127
|
{
|
|
2527
3128
|
question: "Do corrections improve future extractions?",
|
|
2528
3129
|
answer: "Yes. Corrections feed back as training signals for future runs, helping the system learn from your corrections and improve accuracy over time."
|
|
3130
|
+
},
|
|
3131
|
+
{
|
|
3132
|
+
question: "Is there an audit trail for corrections?",
|
|
3133
|
+
answer: "Yes. Every correction logs the original value, the corrected value, the user who made the change, and the timestamp. This audit history is preserved and included in full metadata CSV exports."
|
|
2529
3134
|
}
|
|
2530
3135
|
],
|
|
2531
3136
|
mentions: [
|
|
@@ -2579,6 +3184,18 @@ var sections7 = [
|
|
|
2579
3184
|
{
|
|
2580
3185
|
type: "paragraph",
|
|
2581
3186
|
text: "Most link keys are auto-classified by name patterns. Remaining ambiguous fields are classified by AI. High-frequency entities (>30% of documents) are automatically excluded from case formation."
|
|
3187
|
+
},
|
|
3188
|
+
{
|
|
3189
|
+
type: "paragraph",
|
|
3190
|
+
text: "Behind the scenes, the classification engine applies rule-based heuristics first \u2014 field names like `company_name` or `invoice_number` are recognized instantly. When heuristics are inconclusive, an AI classifier examines the field's extracted values and schema context to determine the correct category. This two-tier approach keeps classification fast for the common case while handling ambiguous fields gracefully."
|
|
3191
|
+
},
|
|
3192
|
+
{
|
|
3193
|
+
type: "paragraph",
|
|
3194
|
+
text: "Use link keys whenever your documents share identifying information that should connect them. For best results, ensure your field names follow clear naming conventions \u2014 this maximizes the hit rate of the automatic classifier and minimizes the need for manual overrides."
|
|
3195
|
+
},
|
|
3196
|
+
{
|
|
3197
|
+
type: "callout",
|
|
3198
|
+
text: "Link key classification runs automatically when new fields appear in the registry. You do not need to trigger it manually \u2014 just upload documents and the system handles the rest."
|
|
2582
3199
|
}
|
|
2583
3200
|
],
|
|
2584
3201
|
related: [
|
|
@@ -2594,6 +3211,10 @@ var sections7 = [
|
|
|
2594
3211
|
{
|
|
2595
3212
|
question: "Why are high-frequency entities excluded from case formation?",
|
|
2596
3213
|
answer: "Entities appearing in more than 30% of documents are too common to be meaningful connections. They are automatically excluded to prevent overly large, uninformative cases."
|
|
3214
|
+
},
|
|
3215
|
+
{
|
|
3216
|
+
question: "Can I manually classify a field as a link key?",
|
|
3217
|
+
answer: "Yes. Navigate to the Field Registry and change any field's link key category. Manual classifications take precedence over automatic ones and persist across future jobs."
|
|
2597
3218
|
}
|
|
2598
3219
|
],
|
|
2599
3220
|
mentions: [
|
|
@@ -2614,6 +3235,22 @@ var sections7 = [
|
|
|
2614
3235
|
{
|
|
2615
3236
|
type: "paragraph",
|
|
2616
3237
|
text: 'After extraction, the linking pipeline runs automatically: extracts link key values, normalizes them (lowercasing, stripping suffixes like "Ltd", "Inc"), and builds a bipartite graph of documents ↔ entities.'
|
|
3238
|
+
},
|
|
3239
|
+
{
|
|
3240
|
+
type: "paragraph",
|
|
3241
|
+
text: 'The normalization step is critical for accurate linking. Values like "ACME Corp.", "Acme Corporation", and "acme corp" are all reduced to the same canonical form so they resolve to a single entity node. This prevents duplicate entities from fragmenting your cases and ensures documents that reference the same real-world entity are correctly connected.'
|
|
3242
|
+
},
|
|
3243
|
+
{
|
|
3244
|
+
type: "paragraph",
|
|
3245
|
+
text: "The resulting bipartite graph has two node types: documents and entities. An edge connects a document to an entity whenever the document contains that entity's value in a link key field. Connected components in this graph become the foundation for case formation \u2014 documents that share entities end up in the same case."
|
|
3246
|
+
},
|
|
3247
|
+
{
|
|
3248
|
+
type: "paragraph",
|
|
3249
|
+
text: "For best results, ensure your source documents contain consistent identifiers. The pipeline handles minor variations automatically, but wildly inconsistent naming (e.g., abbreviations vs. full legal names) may require manual link key tuning in the Field Registry."
|
|
3250
|
+
},
|
|
3251
|
+
{
|
|
3252
|
+
type: "callout",
|
|
3253
|
+
text: "Entity linking is incremental \u2014 when new documents arrive, the pipeline extends the existing graph rather than rebuilding it from scratch. Existing cases grow as new connections are discovered."
|
|
2617
3254
|
}
|
|
2618
3255
|
],
|
|
2619
3256
|
related: [
|
|
@@ -2629,6 +3266,10 @@ var sections7 = [
|
|
|
2629
3266
|
{
|
|
2630
3267
|
question: "When does entity linking run?",
|
|
2631
3268
|
answer: "Entity linking runs automatically after document extraction. It processes link key values and builds connections without manual intervention."
|
|
3269
|
+
},
|
|
3270
|
+
{
|
|
3271
|
+
question: "What normalization does entity linking apply?",
|
|
3272
|
+
answer: "Values are lowercased, common suffixes (Ltd, Inc, Corp, etc.) are stripped, and whitespace is normalized. This ensures minor naming variations resolve to the same entity."
|
|
2632
3273
|
}
|
|
2633
3274
|
],
|
|
2634
3275
|
mentions: [
|
|
@@ -2720,6 +3361,22 @@ var sections7 = [
|
|
|
2720
3361
|
{
|
|
2721
3362
|
type: "paragraph",
|
|
2722
3363
|
text: "The Document Graph provides a visual D3-force layout of the bipartite graph. Toggle between graph and list views from the Cases page. Case templates are auto-discovered after 3+ cases form \u2014 they identify recurring document type patterns."
|
|
3364
|
+
},
|
|
3365
|
+
{
|
|
3366
|
+
type: "paragraph",
|
|
3367
|
+
text: "In the graph view, document nodes and entity nodes are rendered with distinct visual styles. Edges represent link key connections, and tightly connected clusters naturally pull together through force simulation. Hovering over a node highlights its connections, making it easy to trace how documents relate through shared entities."
|
|
3368
|
+
},
|
|
3369
|
+
{
|
|
3370
|
+
type: "paragraph",
|
|
3371
|
+
text: 'Case templates capture recurring patterns \u2014 for example, "Invoice + Purchase Order + Contract" might emerge as a common template after enough cases form. Templates include a **match threshold** that controls how closely a case must match the expected document type set. Use templates to monitor completeness: if a case is missing a document type that the template expects, an anomaly is raised.'
|
|
3372
|
+
},
|
|
3373
|
+
{
|
|
3374
|
+
type: "paragraph",
|
|
3375
|
+
text: "Most teams use the graph view during initial workspace setup to verify that linking is producing sensible clusters. Once you are confident in your link key configuration, the list view is more practical for day-to-day case review and triage."
|
|
3376
|
+
},
|
|
3377
|
+
{
|
|
3378
|
+
type: "callout",
|
|
3379
|
+
text: "Templates are auto-discovered \u2014 you do not need to define them manually. The system analyzes existing cases and proposes templates when it detects at least 3 cases sharing the same document type pattern."
|
|
2723
3380
|
}
|
|
2724
3381
|
],
|
|
2725
3382
|
related: [
|
|
@@ -2734,6 +3391,10 @@ var sections7 = [
|
|
|
2734
3391
|
{
|
|
2735
3392
|
question: "What are case templates?",
|
|
2736
3393
|
answer: "Case templates are auto-discovered after 3 or more cases form. They identify recurring document type patterns, helping you understand common document relationships in your workspace."
|
|
3394
|
+
},
|
|
3395
|
+
{
|
|
3396
|
+
question: "Can I switch between graph and list views?",
|
|
3397
|
+
answer: "Yes. Toggle between the visual D3-force graph and a traditional list view from the Cases page. Both views show the same underlying data \u2014 choose whichever suits your workflow."
|
|
2737
3398
|
}
|
|
2738
3399
|
],
|
|
2739
3400
|
mentions: ["document graph", "D3-force layout", "bipartite graph", "case templates"]
|
|
@@ -2783,6 +3444,18 @@ var sections7 = [
|
|
|
2783
3444
|
{
|
|
2784
3445
|
type: "paragraph",
|
|
2785
3446
|
text: "Anomalies appear in the **Anomalies** tab of the case detail page (Advanced mode). Each anomaly card shows severity, affected fields, and a dismiss button. Dismissed anomalies are hidden by default but visible via the **show dismissed** toggle."
|
|
3447
|
+
},
|
|
3448
|
+
{
|
|
3449
|
+
type: "paragraph",
|
|
3450
|
+
text: "The detection engine runs automatically after case formation and whenever case membership changes (documents added, removed, or cases merged). Each detector operates independently \u2014 a single case can trigger multiple anomaly types simultaneously. Anomaly counts are displayed as badges in the case header for quick triage."
|
|
3451
|
+
},
|
|
3452
|
+
{
|
|
3453
|
+
type: "paragraph",
|
|
3454
|
+
text: "Use anomaly detection to surface data quality issues that would otherwise require manual comparison across documents. For best results, configure case templates so the **Missing Document Type** detector (D4) can flag incomplete cases. Most teams find that D2 (Field Conflict) and D3 (Duplicate Key Divergence) catch the highest-value issues in procurement and financial workflows."
|
|
3455
|
+
},
|
|
3456
|
+
{
|
|
3457
|
+
type: "callout",
|
|
3458
|
+
text: "Anomaly detection requires **Advanced mode** to be enabled. In Simple mode, anomalies are still computed but not displayed in the case detail page."
|
|
2786
3459
|
}
|
|
2787
3460
|
],
|
|
2788
3461
|
related: [
|
|
@@ -2794,6 +3467,14 @@ var sections7 = [
|
|
|
2794
3467
|
{
|
|
2795
3468
|
question: "What anomalies does Talonic detect?",
|
|
2796
3469
|
answer: "Five structural patterns: validation clusters, field conflicts, duplicate key divergence, missing document types, and value reuse. Each is surfaced as a dismissable card on the case detail page."
|
|
3470
|
+
},
|
|
3471
|
+
{
|
|
3472
|
+
question: "Do anomalies update automatically when cases change?",
|
|
3473
|
+
answer: "Yes. The detection engine re-runs whenever case membership changes \u2014 documents added or removed, cases merged or split. Anomaly badges in the case header update in real time."
|
|
3474
|
+
},
|
|
3475
|
+
{
|
|
3476
|
+
question: "Can I dismiss anomalies?",
|
|
3477
|
+
answer: "Yes. Each anomaly card includes a dismiss button. Dismissed anomalies are hidden by default but can be revealed using the show dismissed toggle on the Anomalies tab."
|
|
2797
3478
|
}
|
|
2798
3479
|
],
|
|
2799
3480
|
mentions: ["anomaly detection", "validation cluster", "field conflict", "duplicate key divergence", "value reuse"]
|
|
@@ -2826,6 +3507,14 @@ var sections7 = [
|
|
|
2826
3507
|
type: "paragraph",
|
|
2827
3508
|
text: "**Domain packs** extend validation with industry-specific rules. The freight domain pack includes DOT number state detection and MC number validation. Additional packs can be added to `domain-packs/` without modifying the core engine."
|
|
2828
3509
|
},
|
|
3510
|
+
{
|
|
3511
|
+
type: "paragraph",
|
|
3512
|
+
text: "Validation runs automatically after extraction and linking complete. Each field value is checked against every applicable validator \u2014 a single field can trigger multiple rules. Results are displayed as colored badges in the **Evidence** tab: green for pass, red for fail, and amber for warnings. You can filter by status, document, category, or free-text search."
|
|
3513
|
+
},
|
|
3514
|
+
{
|
|
3515
|
+
type: "paragraph",
|
|
3516
|
+
text: "The checksum validator (S7) uses a parameterized factory pattern \u2014 it accepts a checksum algorithm name and applies the corresponding verification logic. Supported algorithms include Luhn (credit card numbers), ABA (bank routing numbers), IBAN (international bank accounts), and ISBN (book identifiers). For best results, ensure your schema fields are typed correctly so the engine knows which checksum to apply."
|
|
3517
|
+
},
|
|
2829
3518
|
{
|
|
2830
3519
|
type: "callout",
|
|
2831
3520
|
text: "Evidence validation results are stored in a separate `evidence_validation_results` table keyed by (document_id, entity_id, field_key) \u2014 not in the extraction or linking tables."
|
|
@@ -2844,6 +3533,10 @@ var sections7 = [
|
|
|
2844
3533
|
{
|
|
2845
3534
|
question: "What are domain packs?",
|
|
2846
3535
|
answer: "Domain packs add industry-specific validation rules. For example, the freight domain pack validates DOT numbers and MC numbers. New packs can be added without modifying the core engine."
|
|
3536
|
+
},
|
|
3537
|
+
{
|
|
3538
|
+
question: "How are evidence validation results displayed?",
|
|
3539
|
+
answer: "Results appear as colored badges in the Evidence tab of the case detail page. Green indicates pass, red indicates fail, and amber indicates a warning. Use the filter bar to narrow results by status, document, or category."
|
|
2847
3540
|
}
|
|
2848
3541
|
],
|
|
2849
3542
|
mentions: ["evidence validation", "structural validators", "checksum", "Luhn", "IBAN", "domain packs", "freight"]
|
|
@@ -2870,6 +3563,18 @@ var sections8 = [
|
|
|
2870
3563
|
{
|
|
2871
3564
|
type: "paragraph",
|
|
2872
3565
|
text: "Navigate to **Data Products → Dataset Templates** to manage templates. Each template is linked to a user schema and can be versioned independently. When creating a new job, select a template instead of configuring the output from scratch."
|
|
3566
|
+
},
|
|
3567
|
+
{
|
|
3568
|
+
type: "paragraph",
|
|
3569
|
+
text: "Templates support column mappings that rename, reorder, or exclude fields from the output. Default transforms \u2014 such as date formatting, currency normalization, or unit conversion \u2014 are applied automatically during assembly. This means every data product built from the same template produces structurally identical output regardless of who runs it or when."
|
|
3570
|
+
},
|
|
3571
|
+
{
|
|
3572
|
+
type: "paragraph",
|
|
3573
|
+
text: "For best results, create one template per downstream consumer. If your finance team and operations team need different column subsets from the same schema, define two templates rather than manually reconfiguring each export. Most teams version their templates alongside schema changes to maintain backward compatibility with existing integrations."
|
|
3574
|
+
},
|
|
3575
|
+
{
|
|
3576
|
+
type: "callout",
|
|
3577
|
+
text: "Dataset templates are workspace-scoped. Any team member can create, edit, or use a template \u2014 there is no per-user ownership restriction."
|
|
2873
3578
|
}
|
|
2874
3579
|
],
|
|
2875
3580
|
related: [
|
|
@@ -2885,6 +3590,10 @@ var sections8 = [
|
|
|
2885
3590
|
{
|
|
2886
3591
|
question: "How do dataset templates relate to schemas?",
|
|
2887
3592
|
answer: "Each dataset template is linked to a user schema and can be versioned independently. When creating a new job, you can select a template instead of configuring output from scratch."
|
|
3593
|
+
},
|
|
3594
|
+
{
|
|
3595
|
+
question: "Can I version dataset templates?",
|
|
3596
|
+
answer: "Yes. Each template is versioned independently from the schema it references. This lets you evolve your output format over time without affecting existing data products built from earlier versions."
|
|
2888
3597
|
}
|
|
2889
3598
|
],
|
|
2890
3599
|
mentions: [
|
|
@@ -2910,6 +3619,14 @@ var sections8 = [
|
|
|
2910
3619
|
type: "paragraph",
|
|
2911
3620
|
text: "Navigate to **Data Products → Assemblies** to view and create assemblies. Each assembly shows its document count, linked schema, processing status, and the date it was created."
|
|
2912
3621
|
},
|
|
3622
|
+
{
|
|
3623
|
+
type: "paragraph",
|
|
3624
|
+
text: "When you create an assembly, you select a dataset template and one or more document sources. The system pulls all matching documents, applies the template's column mappings and transforms, and produces a single structured output. The assembly tracks which documents contributed to each row, giving you full traceability from output back to source."
|
|
3625
|
+
},
|
|
3626
|
+
{
|
|
3627
|
+
type: "paragraph",
|
|
3628
|
+
text: "Use assemblies whenever you need a repeatable, auditable output for downstream systems or stakeholders. Most teams create one assembly per reporting period or delivery cycle. Because assemblies reference a template, you can regenerate the same output shape from different document sets without reconfiguring columns or transforms each time."
|
|
3629
|
+
},
|
|
2913
3630
|
{
|
|
2914
3631
|
type: "callout",
|
|
2915
3632
|
text: "Assemblies are the recommended way to produce production datasets. They provide a single audit trail from source documents through extraction, resolution, and validation to the final output."
|
|
@@ -2928,6 +3645,10 @@ var sections8 = [
|
|
|
2928
3645
|
{
|
|
2929
3646
|
question: "Why should I use assemblies for production data?",
|
|
2930
3647
|
answer: "Assemblies provide a single audit trail from source documents through extraction, resolution, and validation to the final output, making them the recommended approach for production datasets."
|
|
3648
|
+
},
|
|
3649
|
+
{
|
|
3650
|
+
question: "Can an assembly pull from multiple sources?",
|
|
3651
|
+
answer: "Yes. An assembly can combine documents from any number of sources \u2014 uploaded files, connected drives, email attachments, and more \u2014 into a single structured dataset."
|
|
2931
3652
|
}
|
|
2932
3653
|
],
|
|
2933
3654
|
mentions: [
|
|
@@ -2973,6 +3694,18 @@ var sections8 = [
|
|
|
2973
3694
|
{
|
|
2974
3695
|
type: "paragraph",
|
|
2975
3696
|
text: "ID rules are persisted before generating IDs. Navigate to a data product detail page and use **Apply ID Rules** to generate or **Regenerate IDs** to refresh."
|
|
3697
|
+
},
|
|
3698
|
+
{
|
|
3699
|
+
type: "paragraph",
|
|
3700
|
+
text: 'Resolution maps normalize field values before they become part of the ID. For example, a resolution map can collapse "ACME Corp", "ACME Corporation", and "Acme" into a single canonical value "ACME". This prevents duplicate IDs for rows that refer to the same real-world entity under different names.'
|
|
3701
|
+
},
|
|
3702
|
+
{
|
|
3703
|
+
type: "paragraph",
|
|
3704
|
+
text: 'For best results, choose source fields with high uniqueness \u2014 contract numbers or invoice IDs work well, while generic fields like "status" do not. When your documents contain multiple candidate identifiers, configure a fallback chain so the dispenser always has a value to work with. Most teams use the primary reference number as the source field and the document name as the first fallback.'
|
|
3705
|
+
},
|
|
3706
|
+
{
|
|
3707
|
+
type: "callout",
|
|
3708
|
+
text: "ID generation is deterministic \u2014 running **Regenerate IDs** with the same rules and data always produces the same output. This makes ID dispensers safe to re-run without breaking downstream references."
|
|
2976
3709
|
}
|
|
2977
3710
|
],
|
|
2978
3711
|
related: [
|
|
@@ -2984,6 +3717,14 @@ var sections8 = [
|
|
|
2984
3717
|
{
|
|
2985
3718
|
question: "How do ID dispensers handle missing field values?",
|
|
2986
3719
|
answer: "When the source field is empty, the dispenser tries each field in the fallback chain in order. If all are empty, it generates a prefix-less sequential ID."
|
|
3720
|
+
},
|
|
3721
|
+
{
|
|
3722
|
+
question: "What is a resolution map?",
|
|
3723
|
+
answer: 'A resolution map is a key-value lookup that normalizes field values before ID generation. For example, it can collapse "ACME Corp" and "ACME Corporation" into "ACME" to prevent duplicate IDs for the same entity.'
|
|
3724
|
+
},
|
|
3725
|
+
{
|
|
3726
|
+
question: "Can I regenerate IDs without losing data?",
|
|
3727
|
+
answer: "Yes. Regenerating IDs only updates the ID column \u2014 all other data product values remain unchanged. The operation is deterministic, so the same rules and data always produce the same IDs."
|
|
2987
3728
|
}
|
|
2988
3729
|
],
|
|
2989
3730
|
mentions: ["ID dispenser", "unique identifiers", "fallback chain", "resolution map"]
|
|
@@ -3042,6 +3783,10 @@ var sections8 = [
|
|
|
3042
3783
|
{
|
|
3043
3784
|
question: "Does CSV export preserve leading zeros?",
|
|
3044
3785
|
answer: "Yes. All CSV exports preserve leading zeros and long numbers \u2014 values are never coerced to numeric types."
|
|
3786
|
+
},
|
|
3787
|
+
{
|
|
3788
|
+
question: "What is auto-resolve singles?",
|
|
3789
|
+
answer: "Auto-resolve singles automatically accepts fields that have only one candidate value, removing them from the manual review queue. Combined with auto-review, this significantly reduces the volume of items requiring human attention."
|
|
3045
3790
|
}
|
|
3046
3791
|
],
|
|
3047
3792
|
mentions: ["share token", "delivery website", "CSV export", "auto-review", "auto-resolve"]
|
|
@@ -3064,6 +3809,22 @@ var sections9 = [
|
|
|
3064
3809
|
{
|
|
3065
3810
|
type: "paragraph",
|
|
3066
3811
|
text: "Schema-level quality rules run during Phase 3 of every job. Rule types: field format, value range, cross-field consistency, and AI-proposed coherence rules. Rules can be AI-proposed after a job completes, then reviewed and approved before activation."
|
|
3812
|
+
},
|
|
3813
|
+
{
|
|
3814
|
+
type: "paragraph",
|
|
3815
|
+
text: "**Field format** checks verify that values match an expected pattern (e.g., dates in ISO format, phone numbers with country codes). **Value range** checks ensure numeric or date values fall within acceptable bounds. **Cross-field consistency** checks compare two or more fields on the same record \u2014 for example, verifying that a start date precedes an end date."
|
|
3816
|
+
},
|
|
3817
|
+
{
|
|
3818
|
+
type: "paragraph",
|
|
3819
|
+
text: "AI-proposed coherence rules are generated by analyzing patterns in completed job results. The system identifies relationships that hold across most records and proposes them as candidate rules. You review each proposal in the validation settings before it becomes active \u2014 no AI-generated rule runs without explicit approval."
|
|
3820
|
+
},
|
|
3821
|
+
{
|
|
3822
|
+
type: "paragraph",
|
|
3823
|
+
text: "For best results, start with a small set of high-confidence rules and expand over time. Most teams begin with field format checks for critical identifiers (invoice numbers, dates, amounts) and add cross-field consistency rules as they learn their data patterns. Validation failures do not block extraction \u2014 they flag records for review."
|
|
3824
|
+
},
|
|
3825
|
+
{
|
|
3826
|
+
type: "callout",
|
|
3827
|
+
text: "Validation checks are schema-scoped. Rules defined on one schema do not affect other schemas in the same workspace. This lets you tailor quality rules to each document type independently."
|
|
3067
3828
|
}
|
|
3068
3829
|
],
|
|
3069
3830
|
related: [
|
|
@@ -3079,6 +3840,10 @@ var sections9 = [
|
|
|
3079
3840
|
{
|
|
3080
3841
|
question: "Can AI suggest validation rules?",
|
|
3081
3842
|
answer: "Yes. After a job completes, AI can propose coherence rules based on the data. You review and approve these rules before they are activated."
|
|
3843
|
+
},
|
|
3844
|
+
{
|
|
3845
|
+
question: "Do validation failures block extraction?",
|
|
3846
|
+
answer: "No. Validation checks flag records for review but do not prevent extraction from completing. Failed records appear in the Approval Queue for manual inspection."
|
|
3082
3847
|
}
|
|
3083
3848
|
],
|
|
3084
3849
|
mentions: [
|
|
@@ -3098,6 +3863,22 @@ var sections9 = [
|
|
|
3098
3863
|
{
|
|
3099
3864
|
type: "paragraph",
|
|
3100
3865
|
text: "Manually-created reference datasets with known-correct values. Create from **Validation → Golden Samples**. Benchmark runs compare extraction results against golden samples for per-field accuracy scoring with AI judge verdicts."
|
|
3866
|
+
},
|
|
3867
|
+
{
|
|
3868
|
+
type: "paragraph",
|
|
3869
|
+
text: "To create a golden sample, select a document and manually enter the correct value for each field. The system stores these known-correct values as the ground truth baseline. When you run a benchmark, the extraction pipeline processes the same document independently, and the results are compared field by field against your golden sample."
|
|
3870
|
+
},
|
|
3871
|
+
{
|
|
3872
|
+
type: "paragraph",
|
|
3873
|
+
text: 'Benchmark scoring uses an AI judge to evaluate each field comparison. The judge accounts for semantic equivalence \u2014 for example, "United States" and "US" may be scored as a match depending on the field type. Per-field accuracy scores let you identify exactly which fields are underperforming and need schema or instruction tuning.'
|
|
3874
|
+
},
|
|
3875
|
+
{
|
|
3876
|
+
type: "paragraph",
|
|
3877
|
+
text: "For best results, create golden samples from a representative mix of document types and complexity levels. Most teams maintain 5-10 golden samples per schema and re-run benchmarks after schema changes, instruction updates, or model upgrades to track quality trends over time."
|
|
3878
|
+
},
|
|
3879
|
+
{
|
|
3880
|
+
type: "callout",
|
|
3881
|
+
text: "Golden samples are not used during normal extraction \u2014 they exist solely for benchmarking. Changing a golden sample does not affect how documents are processed."
|
|
3101
3882
|
}
|
|
3102
3883
|
],
|
|
3103
3884
|
related: [
|
|
@@ -3113,6 +3894,10 @@ var sections9 = [
|
|
|
3113
3894
|
{
|
|
3114
3895
|
question: "How do benchmark runs work?",
|
|
3115
3896
|
answer: "Benchmark runs compare extraction results against golden samples, producing per-field accuracy scores with AI judge verdicts to measure extraction quality."
|
|
3897
|
+
},
|
|
3898
|
+
{
|
|
3899
|
+
question: "How many golden samples should I create?",
|
|
3900
|
+
answer: "Most teams maintain 5-10 golden samples per schema, covering a representative mix of document types and complexity levels. Re-run benchmarks after schema changes or model upgrades to track quality trends."
|
|
3116
3901
|
}
|
|
3117
3902
|
],
|
|
3118
3903
|
mentions: ["golden samples", "ground truth", "benchmark runs", "accuracy scoring", "AI judge"]
|
|
@@ -3128,6 +3913,18 @@ var sections9 = [
|
|
|
3128
3913
|
type: "paragraph",
|
|
3129
3914
|
text: "Threshold-based rules for auto-approving or flagging results. Configure per schema with criteria: minimum confidence, validation pass rate, field coverage. Results meeting all thresholds are auto-approved; others go to the manual review queue."
|
|
3130
3915
|
},
|
|
3916
|
+
{
|
|
3917
|
+
type: "paragraph",
|
|
3918
|
+
text: "Each criterion acts as an independent gate. **Minimum confidence** sets the lowest acceptable extraction confidence score. **Validation pass rate** requires a minimum percentage of validation checks to pass. **Field coverage** ensures that a minimum percentage of schema fields have non-empty values. A result must clear all three gates to be auto-approved."
|
|
3919
|
+
},
|
|
3920
|
+
{
|
|
3921
|
+
type: "paragraph",
|
|
3922
|
+
text: "Start with conservative thresholds \u2014 high confidence, high pass rate, high coverage \u2014 and loosen them as you gain trust in your extraction pipeline. Most teams begin with 90% confidence, 95% validation pass rate, and 80% field coverage, then adjust based on the volume of false positives in the approval queue."
|
|
3923
|
+
},
|
|
3924
|
+
{
|
|
3925
|
+
type: "paragraph",
|
|
3926
|
+
text: "Approval gates integrate directly with the delivery pipeline. When a result passes all gates, a `result.approved` signal is emitted automatically. Bind this signal to a destination to create a fully automated flow from document upload through extraction, validation, approval, and delivery \u2014 no manual steps required for high-confidence results."
|
|
3927
|
+
},
|
|
3131
3928
|
{
|
|
3132
3929
|
type: "callout",
|
|
3133
3930
|
text: "Approval gates feed the delivery pipeline \u2014 bind a `result.approved` signal to a destination to only ship approved rows to your downstream systems."
|
|
@@ -3146,6 +3943,10 @@ var sections9 = [
|
|
|
3146
3943
|
{
|
|
3147
3944
|
question: "How do approval gates connect to delivery?",
|
|
3148
3945
|
answer: "Bind a result.approved signal to a delivery destination to only ship approved rows to your downstream systems. This ensures only quality-checked data is delivered."
|
|
3946
|
+
},
|
|
3947
|
+
{
|
|
3948
|
+
question: "What thresholds should I start with?",
|
|
3949
|
+
answer: "Most teams start with 90% confidence, 95% validation pass rate, and 80% field coverage. Adjust based on the volume of false positives in the approval queue \u2014 loosen thresholds as you gain trust in your pipeline."
|
|
3149
3950
|
}
|
|
3150
3951
|
],
|
|
3151
3952
|
mentions: [
|
|
@@ -3165,11 +3966,27 @@ var sections9 = [
|
|
|
3165
3966
|
content: [
|
|
3166
3967
|
{
|
|
3167
3968
|
type: "paragraph",
|
|
3168
|
-
text: "Results that do not meet auto-approval thresholds are routed to the Approval Queue. Navigate to **Review → Approval Queue** to see all pending items. Each row shows the source document, schema, confidence score, and whether the result has been flagged by a validation rule."
|
|
3969
|
+
text: "Results that do not meet auto-approval thresholds are routed to the Approval Queue. Navigate to **Review → Approval Queue** to see all pending items. Each row shows the source document, schema, confidence score, and whether the result has been flagged by a validation rule."
|
|
3970
|
+
},
|
|
3971
|
+
{
|
|
3972
|
+
type: "paragraph",
|
|
3973
|
+
text: 'Filter the queue by status (pending, flagged), schema, or confidence range. Click "Review" on any row to inspect the extracted values, provenance trails, and validation check results before approving or rejecting.'
|
|
3974
|
+
},
|
|
3975
|
+
{
|
|
3976
|
+
type: "paragraph",
|
|
3977
|
+
text: "The review detail view shows the extracted values alongside the source document, with provenance trails tracing each value back to its origin in the text. Validation check results are displayed inline \u2014 you can see exactly which rules passed and which failed before making your decision. Batch actions are available for approving or rejecting multiple items at once."
|
|
3169
3978
|
},
|
|
3170
3979
|
{
|
|
3171
3980
|
type: "paragraph",
|
|
3172
|
-
text:
|
|
3981
|
+
text: "When you approve a result, a `result.approved` signal is emitted to the delivery pipeline. When you reject a result, a `result.rejected` signal fires instead. This event-driven design lets you build automated workflows that respond to review decisions \u2014 for example, routing approved records to a webhook and rejected records to a notification channel."
|
|
3982
|
+
},
|
|
3983
|
+
{
|
|
3984
|
+
type: "paragraph",
|
|
3985
|
+
text: "For best results, review flagged items first \u2014 these are records where at least one validation check failed, making them the most likely to contain errors. Most teams assign a daily review cadence and use confidence range filters to prioritize low-confidence items that need the most attention."
|
|
3986
|
+
},
|
|
3987
|
+
{
|
|
3988
|
+
type: "callout",
|
|
3989
|
+
text: "LLM auto-review is available to accelerate the approval process. When enabled, AI proposes approve or reject decisions for pending items, which you can accept or override with a single click."
|
|
3173
3990
|
}
|
|
3174
3991
|
],
|
|
3175
3992
|
related: [
|
|
@@ -3185,6 +4002,10 @@ var sections9 = [
|
|
|
3185
4002
|
{
|
|
3186
4003
|
question: "How do I review items in the Approval Queue?",
|
|
3187
4004
|
answer: 'Filter by status (pending, flagged), schema, or confidence range. Click "Review" on any row to inspect extracted values, provenance trails, and validation check results before approving or rejecting.'
|
|
4005
|
+
},
|
|
4006
|
+
{
|
|
4007
|
+
question: "Can I batch approve or reject items?",
|
|
4008
|
+
answer: "Yes. Select multiple items in the queue and use the batch action buttons to approve or reject them all at once. Each item emits the appropriate delivery signal individually."
|
|
3188
4009
|
}
|
|
3189
4010
|
],
|
|
3190
4011
|
mentions: [
|
|
@@ -3249,6 +4070,18 @@ var sections10 = [
|
|
|
3249
4070
|
{
|
|
3250
4071
|
type: "paragraph",
|
|
3251
4072
|
text: "Every attempt is logged in `delivery_items`. Terminal failures (retry exhausted or permanent 4xx) write a `delivery_dead_letter` row, which is replayable. The outbox, history, DLQ, and catalog are all accessible via the [`/v1/delivery/*` API](/docs)."
|
|
4073
|
+
},
|
|
4074
|
+
{
|
|
4075
|
+
type: "paragraph",
|
|
4076
|
+
text: "The four registries \u2014 signals, deliverables, serializers, and connectors \u2014 are fully orthogonal. Adding a new destination type does not require changes to the signal or serializer code. This composable design means you can mix any supported signal with any compatible serializer and connector without custom integration work."
|
|
4077
|
+
},
|
|
4078
|
+
{
|
|
4079
|
+
type: "paragraph",
|
|
4080
|
+
text: "For best results, start with a webhook destination to verify your binding configuration end-to-end. Once the payload shape and delivery cadence match your expectations, expand to file-based destinations (S3, SFTP) or spreadsheet destinations (Google Sheets). Most teams create separate bindings for different downstream consumers rather than routing all events to a single destination."
|
|
4081
|
+
},
|
|
4082
|
+
{
|
|
4083
|
+
type: "callout",
|
|
4084
|
+
text: "Delivery is at-least-once with deterministic idempotency keys. Receivers should use the `X-Talonic-Idempotency-Key` header (or equivalent metadata for file-based connectors) to deduplicate on their end."
|
|
3252
4085
|
}
|
|
3253
4086
|
],
|
|
3254
4087
|
related: [
|
|
@@ -3264,6 +4097,10 @@ var sections10 = [
|
|
|
3264
4097
|
{
|
|
3265
4098
|
question: "What happens when a delivery fails?",
|
|
3266
4099
|
answer: "Failed deliveries retry with a backoff ladder. Terminal failures (retry exhausted or permanent 4xx) are written to the dead-letter queue (DLQ), which is fully replayable."
|
|
4100
|
+
},
|
|
4101
|
+
{
|
|
4102
|
+
question: "What serialization formats are supported?",
|
|
4103
|
+
answer: "Ten formats: json, ndjson, csv, csv_file, xlsx, rows, graph, raw, md, and txt. Each serializer declares which deliverable shapes it supports, and the compatibility triangle validates the combination at binding creation time."
|
|
3267
4104
|
}
|
|
3268
4105
|
],
|
|
3269
4106
|
mentions: [
|
|
@@ -3322,6 +4159,22 @@ var sections10 = [
|
|
|
3322
4159
|
description: "Slice 2+. Structured data as email attachment."
|
|
3323
4160
|
}
|
|
3324
4161
|
]
|
|
4162
|
+
},
|
|
4163
|
+
{
|
|
4164
|
+
type: "paragraph",
|
|
4165
|
+
text: "Each destination stores its connector type, configuration (URL, bucket, folder path), and optional authentication credentials. Webhook destinations support HMAC-SHA256 signing via a **signing secret** \u2014 every payload includes a signature header so your receiver can verify authenticity. File-based destinations (S3, SFTP, Google Drive) support configurable filename templates with token substitution for binding ID, timestamp, and idempotency key."
|
|
4166
|
+
},
|
|
4167
|
+
{
|
|
4168
|
+
type: "paragraph",
|
|
4169
|
+
text: "A single destination can back multiple bindings. For example, one S3 bucket destination can receive both `document.extracted` and `result.approved` events through separate bindings, each with its own serializer and field map. This keeps your destination inventory small while supporting diverse routing requirements."
|
|
4170
|
+
},
|
|
4171
|
+
{
|
|
4172
|
+
type: "paragraph",
|
|
4173
|
+
text: "For best results, always run a live-ping test after creating a destination. The test exercises the full transport envelope \u2014 SSRF validation, payload cap, and authentication \u2014 with a tiny test payload, so you catch configuration errors before real events start flowing. OAuth-based destinations (Google Drive, Google Sheets) require connecting your account first via the OAuth flow in the dashboard."
|
|
4174
|
+
},
|
|
4175
|
+
{
|
|
4176
|
+
type: "callout",
|
|
4177
|
+
text: "Destinations can be disabled without deleting them. Set **is_active** to false and no bindings will route events to the destination until you re-enable it."
|
|
3325
4178
|
}
|
|
3326
4179
|
],
|
|
3327
4180
|
related: [
|
|
@@ -3337,6 +4190,10 @@ var sections10 = [
|
|
|
3337
4190
|
{
|
|
3338
4191
|
question: "How do I test a destination?",
|
|
3339
4192
|
answer: "Every destination supports a live-ping test via POST /v1/delivery/destinations/:id/test that exercises the full transport envelope with a tiny test payload."
|
|
4193
|
+
},
|
|
4194
|
+
{
|
|
4195
|
+
question: "Can one destination serve multiple bindings?",
|
|
4196
|
+
answer: "Yes. A single destination can back any number of bindings, each with its own signal filter, serializer, and field map. This lets you route different event types to the same endpoint with different payload shapes."
|
|
3340
4197
|
}
|
|
3341
4198
|
],
|
|
3342
4199
|
mentions: [
|
|
@@ -3363,6 +4220,22 @@ var sections10 = [
|
|
|
3363
4220
|
{
|
|
3364
4221
|
type: "paragraph",
|
|
3365
4222
|
text: "Optional `field_map` (rename/drop/static rules) lets you reshape the payload without custom code. Optional `delivery_policy` overrides the default retry ladder (6 attempts at `5s, 30s, 2min, 10min, 1h`) and timeout."
|
|
4223
|
+
},
|
|
4224
|
+
{
|
|
4225
|
+
type: "paragraph",
|
|
4226
|
+
text: "The compatibility triangle is enforced on every create and update. The backend checks that your chosen serializer supports the deliverable resolver's output shape, and that the connector accepts the serializer's format. If any predicate fails, the binding is rejected with a descriptive error \u2014 you never end up with a binding that cannot deliver."
|
|
4227
|
+
},
|
|
4228
|
+
{
|
|
4229
|
+
type: "paragraph",
|
|
4230
|
+
text: 'Use `field_map` to tailor the payload for each downstream consumer. **Rename** rules map internal field names to the receiver\'s expected names. **Drop** rules exclude fields the receiver does not need. **Static** rules inject constant values (e.g., a `source: "talonic"` tag) into every payload. These three operations compose in order: drop first, then rename, then static injection.'
|
|
4231
|
+
},
|
|
4232
|
+
{
|
|
4233
|
+
type: "paragraph",
|
|
4234
|
+
text: "For best results, create one binding per downstream consumer per event type. This gives you independent control over payload shape, retry policy, and serialization format for each integration point. Most teams start with a `document.extracted` binding to a webhook and expand to run-level and approval signals as their pipeline matures."
|
|
4235
|
+
},
|
|
4236
|
+
{
|
|
4237
|
+
type: "callout",
|
|
4238
|
+
text: "The binding editor in the dashboard walks you through the compatibility triangle step by step \u2014 only showing serializers and deliverables that are compatible with your chosen signal and destination."
|
|
3366
4239
|
}
|
|
3367
4240
|
],
|
|
3368
4241
|
related: [
|
|
@@ -3378,6 +4251,10 @@ var sections10 = [
|
|
|
3378
4251
|
{
|
|
3379
4252
|
question: "Can I customize the delivery payload?",
|
|
3380
4253
|
answer: "Yes. Use field_map to rename, drop, or add static fields without custom code. Use delivery_policy to override the default retry ladder and timeout."
|
|
4254
|
+
},
|
|
4255
|
+
{
|
|
4256
|
+
question: "What is the compatibility triangle?",
|
|
4257
|
+
answer: "The compatibility triangle validates that the signal, deliverable resolver, serializer, and connector all form a compatible combination. The backend enforces this on every binding create and update to prevent misconfigured delivery routes."
|
|
3381
4258
|
}
|
|
3382
4259
|
],
|
|
3383
4260
|
mentions: [
|
|
@@ -3460,6 +4337,22 @@ var sections10 = [
|
|
|
3460
4337
|
description: "Fired after a terminal delivery failure."
|
|
3461
4338
|
}
|
|
3462
4339
|
]
|
|
4340
|
+
},
|
|
4341
|
+
{
|
|
4342
|
+
type: "paragraph",
|
|
4343
|
+
text: "Signals are typed events emitted by the platform when meaningful state changes occur. Document-level signals fire on extraction success or failure. Run-level signals fire when a job completes across dataspace, structuring, resolution, or extraction runs. Result-level signals fire when a reviewer approves, rejects, or flags a record."
|
|
4344
|
+
},
|
|
4345
|
+
{
|
|
4346
|
+
type: "paragraph",
|
|
4347
|
+
text: "The two `delivery.item.*` entries are **meta-signals** \u2014 they fire when a delivery itself succeeds or fails. Use them for self-monitoring: bind `delivery.item.failed` to a notification webhook to receive alerts when deliveries break. The poller includes built-in loop prevention so a failed meta-signal delivery does not emit another meta-signal."
|
|
4348
|
+
},
|
|
4349
|
+
{
|
|
4350
|
+
type: "paragraph",
|
|
4351
|
+
text: "For best results, use the catalog API to populate dropdown menus and configuration forms rather than hardcoding signal or deliverable lists. The catalog always reflects the running registry contents, so new signal types and deliverables appear automatically as the platform evolves."
|
|
4352
|
+
},
|
|
4353
|
+
{
|
|
4354
|
+
type: "callout",
|
|
4355
|
+
text: "The catalog API exposes four endpoints: `/v1/delivery/catalog/signals`, `/v1/delivery/catalog/deliverables`, `/v1/delivery/catalog/serializers`, and `/v1/delivery/catalog/connectors`. Each returns the full registry for that category."
|
|
3463
4356
|
}
|
|
3464
4357
|
],
|
|
3465
4358
|
related: [
|
|
@@ -3475,6 +4368,10 @@ var sections10 = [
|
|
|
3475
4368
|
{
|
|
3476
4369
|
question: "How do I discover available signals and deliverables?",
|
|
3477
4370
|
answer: "Use the catalog API at /v1/delivery/catalog/* which exposes the four registries (signals, deliverables, serializers, connectors) that drive the binding picker."
|
|
4371
|
+
},
|
|
4372
|
+
{
|
|
4373
|
+
question: "What are meta-signals?",
|
|
4374
|
+
answer: "Meta-signals (delivery.item.completed and delivery.item.failed) fire when a delivery attempt itself succeeds or fails. Use them for self-monitoring \u2014 for example, binding delivery.item.failed to a notification webhook for delivery failure alerts."
|
|
3478
4375
|
}
|
|
3479
4376
|
],
|
|
3480
4377
|
mentions: [
|
|
@@ -3495,6 +4392,22 @@ var sections10 = [
|
|
|
3495
4392
|
{
|
|
3496
4393
|
type: "paragraph",
|
|
3497
4394
|
text: "Every delivery attempt writes a row to `/v1/delivery/items` with its status, HTTP code, error code, and request/response bodies. Terminal failures (retry ladder exhausted or permanent 4xx) escalate to `/v1/delivery/dlq`. Both are fully replayable \u2014 replay enqueues a new attempt with a fresh idempotency key. Nothing in history is ever mutated; the log is strictly append-only."
|
|
4395
|
+
},
|
|
4396
|
+
{
|
|
4397
|
+
type: "paragraph",
|
|
4398
|
+
text: "The delivery items log captures the full lifecycle of each attempt: in-flight, succeeded, or failed. Each row includes the attempt number, duration in milliseconds, and truncated request/response bodies (up to 10 KB each). Use the items endpoint with filters for `binding_id`, `destination_id`, or `status` to narrow results when debugging a specific integration."
|
|
4399
|
+
},
|
|
4400
|
+
{
|
|
4401
|
+
type: "paragraph",
|
|
4402
|
+
text: "The dead letter queue (DLQ) is your safety net for terminal failures. When the retry ladder is exhausted or the destination returns a permanent error (e.g., 401 Unauthorized, 403 Forbidden), the failed delivery moves to the DLQ. From there you can inspect the error, fix the destination configuration, and replay the delivery with a single click or API call."
|
|
4403
|
+
},
|
|
4404
|
+
{
|
|
4405
|
+
type: "paragraph",
|
|
4406
|
+
text: "For best results, monitor the DLQ regularly and set up a `delivery.item.failed` meta-signal binding to receive alerts when deliveries fail terminally. Most teams configure a notification webhook for this signal so they are notified immediately rather than discovering failures during a manual review. Request and response bodies older than the configured retention period are automatically cleaned up, but row metadata (status, error code, duration) is retained indefinitely."
|
|
4407
|
+
},
|
|
4408
|
+
{
|
|
4409
|
+
type: "callout",
|
|
4410
|
+
text: "Replay is safe to run multiple times. The idempotency key is deterministic \u2014 receivers that deduplicate on the key will not process the same delivery twice, even after multiple replays."
|
|
3498
4411
|
}
|
|
3499
4412
|
],
|
|
3500
4413
|
related: [
|
|
@@ -3510,6 +4423,10 @@ var sections10 = [
|
|
|
3510
4423
|
{
|
|
3511
4424
|
question: "What is the dead letter queue (DLQ)?",
|
|
3512
4425
|
answer: "Terminal failures (retry ladder exhausted or permanent 4xx) escalate to /v1/delivery/dlq. DLQ entries are fully replayable \u2014 replay enqueues a fresh attempt with a new idempotency key."
|
|
4426
|
+
},
|
|
4427
|
+
{
|
|
4428
|
+
question: "How long are request and response bodies retained?",
|
|
4429
|
+
answer: "Request and response bodies are cleaned up after the configured retention period (default 30 days). Row metadata \u2014 status, HTTP code, error code, and duration \u2014 is retained indefinitely for audit purposes."
|
|
3513
4430
|
}
|
|
3514
4431
|
],
|
|
3515
4432
|
mentions: [
|
|
@@ -3544,6 +4461,10 @@ var sections11 = [
|
|
|
3544
4461
|
type: "paragraph",
|
|
3545
4462
|
text: "Dialects ensure consistency across all your structured output. When your downstream systems expect dates in `YYYY-MM-DD` format, numbers with `.` as the decimal separator, and CSVs delimited by `;`, you configure this once in the shared dialect rather than repeating it in every schema."
|
|
3546
4463
|
},
|
|
4464
|
+
{
|
|
4465
|
+
type: "paragraph",
|
|
4466
|
+
text: "Most teams configure their shared dialect during initial workspace setup and rarely change it afterward. If your organization operates across regions with different formatting conventions, create separate workspaces with region-specific dialects rather than overriding at the schema level. This keeps the configuration clean and avoids inconsistencies in delivered data."
|
|
4467
|
+
},
|
|
3547
4468
|
{
|
|
3548
4469
|
type: "list",
|
|
3549
4470
|
ordered: false,
|
|
@@ -3614,6 +4535,10 @@ var sections11 = [
|
|
|
3614
4535
|
type: "paragraph",
|
|
3615
4536
|
text: "The lookup convention follows a `key` / `value` structure where the `key` is the output code and the `value` is the human-readable label. During extraction, the platform maps FROM labels found in documents TO the canonical codes defined in the reference primitive. This ensures consistent, machine-readable output regardless of how values appear in source documents."
|
|
3616
4537
|
},
|
|
4538
|
+
{
|
|
4539
|
+
type: "paragraph",
|
|
4540
|
+
text: "For best results, keep reference primitives focused on a single domain \u2014 for example, one primitive for country codes, another for currency codes, and another for product categories. This makes each primitive reusable across multiple schemas and simplifies maintenance. When updating a primitive, test the new version against a few sample documents before updating the version reference in production schemas."
|
|
4541
|
+
},
|
|
3617
4542
|
{
|
|
3618
4543
|
type: "callout",
|
|
3619
4544
|
variant: "info",
|
|
@@ -3681,6 +4606,10 @@ var sections11 = [
|
|
|
3681
4606
|
type: "paragraph",
|
|
3682
4607
|
text: "Change review is particularly important for workspaces that feed downstream systems through delivery bindings. A small change to a schema field mapping or a reference primitive value can ripple through to every document processed after that point. The review process creates a checkpoint where a second pair of eyes can verify the change before it goes live."
|
|
3683
4608
|
},
|
|
4609
|
+
{
|
|
4610
|
+
type: "paragraph",
|
|
4611
|
+
text: "Most teams enable change review as soon as their workspace transitions from development to production. During the initial setup phase, you can leave it disabled for faster iteration. Once your schemas, dialects, and reference primitives are stable and data is flowing to downstream systems, enable change review to protect against accidental modifications that could disrupt live pipelines."
|
|
4612
|
+
},
|
|
3684
4613
|
{
|
|
3685
4614
|
type: "list",
|
|
3686
4615
|
ordered: false,
|
|
@@ -3747,6 +4676,14 @@ var sections12 = [
|
|
|
3747
4676
|
type: "paragraph",
|
|
3748
4677
|
text: "Omnisearch is designed to be the single entry point for finding anything in the platform. Rather than navigating to specific pages to search within them, Omnisearch queries a **materialized values index** that aggregates data across all your content. Results are grouped by category so you can quickly distinguish between a document match and a field name match."
|
|
3749
4678
|
},
|
|
4679
|
+
{
|
|
4680
|
+
type: "paragraph",
|
|
4681
|
+
text: "The materialized values index is rebuilt automatically whenever documents are processed or schemas change, so search results are always current. There is no manual reindex step \u2014 new documents become searchable as soon as extraction completes. This makes Omnisearch reliable even during high-volume ingestion periods."
|
|
4682
|
+
},
|
|
4683
|
+
{
|
|
4684
|
+
type: "paragraph",
|
|
4685
|
+
text: "For best results, use Omnisearch as your primary navigation tool. Instead of browsing through document lists or clicking through the sidebar, press `Cmd+K` and type what you are looking for \u2014 whether it is a specific invoice number, a field name, or a schema title. Most users find that Omnisearch is faster than manual navigation for any task beyond browsing the most recent documents."
|
|
4686
|
+
},
|
|
3750
4687
|
{
|
|
3751
4688
|
type: "callout",
|
|
3752
4689
|
variant: "info",
|
|
@@ -3823,6 +4760,10 @@ var sections12 = [
|
|
|
3823
4760
|
{
|
|
3824
4761
|
type: "paragraph",
|
|
3825
4762
|
text: "Filter state is encoded in the URL query string using dynamic SQL generation on the backend. This means you can bookmark filtered views, share them with teammates via a link, or save them as **presets** for one-click access to commonly used queries."
|
|
4763
|
+
},
|
|
4764
|
+
{
|
|
4765
|
+
type: "paragraph",
|
|
4766
|
+
text: 'For best results, save your most common filter combinations as presets. Most teams create presets for categories like "high-value invoices this quarter," "documents missing key fields," or "recently failed extractions." Presets appear as one-click buttons on the Documents page, eliminating the need to rebuild complex filter conditions from scratch each time.'
|
|
3826
4767
|
}
|
|
3827
4768
|
],
|
|
3828
4769
|
related: [
|
|
@@ -3877,6 +4818,19 @@ var sections13 = [
|
|
|
3877
4818
|
type: "paragraph",
|
|
3878
4819
|
text: "Manage API keys from **Settings → API Keys**. Keys are prefixed with `tlnc_` and passed via `Authorization: Bearer`. Keys are SHA-256 hashed \u2014 the full key is only shown once at creation."
|
|
3879
4820
|
},
|
|
4821
|
+
{
|
|
4822
|
+
type: "paragraph",
|
|
4823
|
+
text: "Each API key is assigned one or more scopes that control what operations it can perform. Scopes follow the principle of least privilege \u2014 create a key with only the scopes your integration needs. For example, a read-only dashboard integration only needs the `read` scope, while an automated ingestion pipeline needs `extract` and `read`."
|
|
4824
|
+
},
|
|
4825
|
+
{
|
|
4826
|
+
type: "paragraph",
|
|
4827
|
+
text: "For best results, create separate API keys for each integration or service that connects to your Talonic workspace. This makes it easy to rotate or revoke a single key without disrupting other integrations. Most teams maintain one key for their ingestion pipeline, one for their BI dashboard, and one for webhook-based automations."
|
|
4828
|
+
},
|
|
4829
|
+
{
|
|
4830
|
+
type: "callout",
|
|
4831
|
+
variant: "warning",
|
|
4832
|
+
text: "Copy the full API key immediately after creation \u2014 it is only displayed once. If you lose the key, you must delete it and create a new one. Existing integrations using the old key will stop working until updated."
|
|
4833
|
+
},
|
|
3880
4834
|
{
|
|
3881
4835
|
type: "param-table",
|
|
3882
4836
|
title: "API key scopes",
|
|
@@ -3912,6 +4866,10 @@ var sections13 = [
|
|
|
3912
4866
|
{
|
|
3913
4867
|
question: "What scopes are available for API keys?",
|
|
3914
4868
|
answer: "Three scopes: extract (use extraction API), read (read documents, extractions, schemas, jobs), and write (create and modify resources)."
|
|
4869
|
+
},
|
|
4870
|
+
{
|
|
4871
|
+
question: "Can I have multiple API keys?",
|
|
4872
|
+
answer: "Yes. You can create as many API keys as needed. Best practice is to create separate keys for each integration so you can rotate or revoke them independently without disrupting other services."
|
|
3915
4873
|
}
|
|
3916
4874
|
],
|
|
3917
4875
|
mentions: ["API keys", "tlnc_", "SHA-256", "Bearer token", "scopes"]
|
|
@@ -3923,6 +4881,27 @@ var sections13 = [
|
|
|
3923
4881
|
seoTitle: "Public REST API Overview \u2014 Talonic Docs",
|
|
3924
4882
|
description: "Full REST API with 20+ namespaces: extract, documents, extractions, schemas, jobs, sources, delivery, linking, matching, batches, cases, quality, and more. Cursor pagination.",
|
|
3925
4883
|
content: [
|
|
4884
|
+
{
|
|
4885
|
+
type: "paragraph",
|
|
4886
|
+
text: "Talonic exposes a comprehensive REST API with 20+ namespaces covering every aspect of the platform \u2014 from document extraction and schema management to delivery, matching, and quality benchmarking. All endpoints use JSON request and response bodies with cursor-based pagination for list operations."
|
|
4887
|
+
},
|
|
4888
|
+
{
|
|
4889
|
+
type: "paragraph",
|
|
4890
|
+
text: "The API follows standard REST conventions. Authenticate with a `tlnc_` API key via the `Authorization: Bearer` header. Most resources support full CRUD operations, and long-running tasks like matching runs and batch inference are handled asynchronously with polling endpoints for status and progress."
|
|
4891
|
+
},
|
|
4892
|
+
{
|
|
4893
|
+
type: "paragraph",
|
|
4894
|
+
text: "Use the public API to build automated ingestion pipelines, integrate extraction results into downstream systems, or orchestrate complex workflows that combine multiple platform features. The API mirrors every action available in the web interface, so anything you can do manually can be fully automated."
|
|
4895
|
+
},
|
|
4896
|
+
{
|
|
4897
|
+
type: "paragraph",
|
|
4898
|
+
text: "For best results, start with the `/v1/extract` endpoint for document ingestion, then use `/v1/documents` and `/v1/extractions` to retrieve results. As your integration matures, explore delivery bindings, matching configurations, and batch processing to build a fully automated data pipeline."
|
|
4899
|
+
},
|
|
4900
|
+
{
|
|
4901
|
+
type: "callout",
|
|
4902
|
+
variant: "info",
|
|
4903
|
+
text: "See the full [API Documentation](/docs) for detailed endpoint specifications, request/response examples, and authentication guides. The API reference is organized by namespace and includes every parameter, status code, and error response."
|
|
4904
|
+
},
|
|
3926
4905
|
{
|
|
3927
4906
|
type: "param-table",
|
|
3928
4907
|
title: "API namespaces",
|
|
@@ -4043,6 +5022,10 @@ var sections13 = [
|
|
|
4043
5022
|
{
|
|
4044
5023
|
question: "Where can I find detailed API documentation?",
|
|
4045
5024
|
answer: "See the full API Documentation at /docs for complete endpoint documentation with request/response examples, parameter descriptions, and authentication details."
|
|
5025
|
+
},
|
|
5026
|
+
{
|
|
5027
|
+
question: "How does pagination work in the API?",
|
|
5028
|
+
answer: "List endpoints use cursor-based pagination. Each response includes a cursor token that you pass as a query parameter to fetch the next page. This approach is more reliable than offset-based pagination when documents are being added or removed concurrently."
|
|
4046
5029
|
}
|
|
4047
5030
|
],
|
|
4048
5031
|
mentions: [
|
|
@@ -4068,6 +5051,14 @@ var sections13 = [
|
|
|
4068
5051
|
type: "paragraph",
|
|
4069
5052
|
text: "The webhook connector is configured as a **delivery destination**. Bind any of the signal types below to a webhook destination to receive real-time notifications. See `/v1/delivery/catalog/signals` for the exhaustive list."
|
|
4070
5053
|
},
|
|
5054
|
+
{
|
|
5055
|
+
type: "paragraph",
|
|
5056
|
+
text: "When a webhook fires, the platform constructs the payload from the signal data, signs it with your destination's HMAC-SHA256 signing secret, and delivers it via HTTPS POST. Each delivery includes an idempotency key in the headers so your receiver can safely deduplicate retries. Failed deliveries follow an exponential backoff schedule, and terminal failures are routed to the dead-letter queue for manual replay."
|
|
5057
|
+
},
|
|
5058
|
+
{
|
|
5059
|
+
type: "paragraph",
|
|
5060
|
+
text: "Use webhooks when your downstream system needs to react immediately to platform events \u2014 for example, triggering an ERP import when a document is extracted, or notifying a Slack channel when a reviewer rejects a record. For bulk or periodic data transfers, consider using the SFTP, S3, or cloud storage delivery connectors instead."
|
|
5061
|
+
},
|
|
4071
5062
|
{
|
|
4072
5063
|
type: "param-table",
|
|
4073
5064
|
title: "Delivery signal types (webhook-compatible)",
|
|
@@ -4143,6 +5134,10 @@ var sections13 = [
|
|
|
4143
5134
|
{
|
|
4144
5135
|
question: "What happens when a webhook delivery fails?",
|
|
4145
5136
|
answer: "Failed webhook deliveries retry with exponential backoff. Terminal failures (retry exhausted or permanent 4xx) escalate to the dead-letter queue for manual replay."
|
|
5137
|
+
},
|
|
5138
|
+
{
|
|
5139
|
+
question: "How do I verify webhook signatures?",
|
|
5140
|
+
answer: "Each webhook payload is signed with HMAC-SHA256 using the signing secret from your delivery destination configuration. Compute the HMAC of the raw request body and compare it to the signature header to verify authenticity. This ensures the payload was sent by Talonic and was not tampered with in transit."
|
|
4146
5141
|
}
|
|
4147
5142
|
],
|
|
4148
5143
|
mentions: [
|
|
@@ -4202,6 +5197,10 @@ var sections14 = [
|
|
|
4202
5197
|
type: "paragraph",
|
|
4203
5198
|
text: "New members are added via domain matching: company email domains auto-match to your org with **pending** status requiring admin approval. Manage from the Team page."
|
|
4204
5199
|
},
|
|
5200
|
+
{
|
|
5201
|
+
type: "paragraph",
|
|
5202
|
+
text: "When a team member is removed, their access is revoked immediately but their past actions \u2014 edits, uploads, approvals, and review decisions \u2014 remain in the audit trail. This preserves data integrity and compliance history. Removed users can be re-added later through the same domain matching process if needed."
|
|
5203
|
+
},
|
|
4205
5204
|
{
|
|
4206
5205
|
type: "callout",
|
|
4207
5206
|
variant: "info",
|
|
@@ -4269,6 +5268,14 @@ var sections14 = [
|
|
|
4269
5268
|
type: "paragraph",
|
|
4270
5269
|
text: "Understanding your usage patterns helps optimize costs. For example, if extraction dominates your spend, consider using **batch mode** for non-urgent documents to cut that cost in half. The daily cost chart makes it easy to spot usage spikes and correlate them with specific ingestion events."
|
|
4271
5270
|
},
|
|
5271
|
+
{
|
|
5272
|
+
type: "paragraph",
|
|
5273
|
+
text: "Behind the scenes, every LLM and OCR call is logged with full detail \u2014 the model used, input and output token counts, latency, and computed cost. This data powers both the per-feature breakdown and the individual call log. The system tracks costs across extraction, OCR, batch inference, matching AI resolution, and quality passes so you always know where your spend is going."
|
|
5274
|
+
},
|
|
5275
|
+
{
|
|
5276
|
+
type: "paragraph",
|
|
5277
|
+
text: "Most teams review the daily cost chart weekly to establish a usage baseline. Unexpected spikes usually correlate with large document uploads or batch completions. For organizations managing multiple workspaces, the **Master view** provides a single pane of glass showing per-customer breakdowns and platform-wide aggregates \u2014 accessible only to platform administrators."
|
|
5278
|
+
},
|
|
4272
5279
|
{
|
|
4273
5280
|
type: "param-table",
|
|
4274
5281
|
title: "Usage views",
|
|
@@ -4344,6 +5351,14 @@ var sections14 = [
|
|
|
4344
5351
|
type: "paragraph",
|
|
4345
5352
|
text: "The Admin Panel is the central hub for platform-wide operations. **Customer management** lets you create, view, and delete organizations. **User management** provides a cross-tenant view of all platform users with the ability to remove accounts. The **data clear & rebuild** function wipes all data for a specific customer and reprocesses from scratch \u2014 useful during onboarding or after significant schema changes."
|
|
4346
5353
|
},
|
|
5354
|
+
{
|
|
5355
|
+
type: "paragraph",
|
|
5356
|
+
text: "The Admin Panel operates across tenant boundaries, giving administrators visibility into all organizations on the platform. The **usage statistics** view aggregates cost and volume data across all customers, making it straightforward to identify high-usage tenants, track platform growth, and forecast infrastructure needs."
|
|
5357
|
+
},
|
|
5358
|
+
{
|
|
5359
|
+
type: "paragraph",
|
|
5360
|
+
text: "For best results, limit Admin Panel access to a small group of trusted platform operators. Use the **master registry** view to audit field definitions and schemas across tenants \u2014 this is particularly useful when standardizing extraction configurations or troubleshooting cross-tenant data quality issues."
|
|
5361
|
+
},
|
|
4347
5362
|
{
|
|
4348
5363
|
type: "list",
|
|
4349
5364
|
ordered: false,
|
|
@@ -4403,6 +5418,18 @@ var sections14 = [
|
|
|
4403
5418
|
type: "paragraph",
|
|
4404
5419
|
text: "Talonic provides global keyboard shortcuts that work from any page in the platform. These shortcuts let you access common actions without leaving your current context, significantly speeding up daily workflows."
|
|
4405
5420
|
},
|
|
5421
|
+
{
|
|
5422
|
+
type: "paragraph",
|
|
5423
|
+
text: "Shortcuts are registered at the application level, meaning they respond regardless of which page or panel is currently active. The platform intercepts the key combination before it reaches the browser, so these shortcuts take priority over default browser bindings when the Talonic window is focused."
|
|
5424
|
+
},
|
|
5425
|
+
{
|
|
5426
|
+
type: "paragraph",
|
|
5427
|
+
text: "The most frequently used shortcut is **Omnisearch** (`Cmd+K` / `Ctrl+K`), which opens a global search overlay that queries documents, extracted values, field names, schemas, and sources simultaneously. Power users rely on it to navigate the platform faster than clicking through the sidebar."
|
|
5428
|
+
},
|
|
5429
|
+
{
|
|
5430
|
+
type: "paragraph",
|
|
5431
|
+
text: "For best results, build muscle memory around the three core shortcuts. Use `Cmd+K` to find anything, `Cmd+J` to upload a document on the fly, and `Escape` to dismiss any overlay or modal. These three actions cover the most common interruptions during a review or configuration session."
|
|
5432
|
+
},
|
|
4406
5433
|
{
|
|
4407
5434
|
type: "param-table",
|
|
4408
5435
|
title: "Shortcuts",
|
|
@@ -4473,6 +5500,14 @@ var sections15 = [
|
|
|
4473
5500
|
type: "paragraph",
|
|
4474
5501
|
text: "Under the hood, batch inference leverages the provider's native batch API (Anthropic Message Batches or AWS Bedrock invocation jobs). Documents accumulate in a queue and are submitted together, allowing the provider to schedule processing during off-peak capacity. This is why the cost reduction is possible without any loss in extraction quality."
|
|
4475
5502
|
},
|
|
5503
|
+
{
|
|
5504
|
+
type: "paragraph",
|
|
5505
|
+
text: "Batch mode is best suited for backlog ingestion, periodic bulk uploads, and any scenario where results are not needed in real time. Most teams use batch mode for overnight processing of large document volumes and reserve real-time processing for time-sensitive documents that need immediate attention."
|
|
5506
|
+
},
|
|
5507
|
+
{
|
|
5508
|
+
type: "paragraph",
|
|
5509
|
+
text: "When batch results arrive, they pass through the same post-processing pipeline as real-time extractions \u2014 including markdown pre-processing, field parsing, quality metrics, and extraction metadata computation. The only difference is that LLM-based quality passes (field estimation, verification, cross-reference enrichment) are skipped in batch mode to preserve the cost savings."
|
|
5510
|
+
},
|
|
4476
5511
|
{
|
|
4477
5512
|
type: "list",
|
|
4478
5513
|
ordered: false,
|
|
@@ -4549,6 +5584,10 @@ var sections15 = [
|
|
|
4549
5584
|
{
|
|
4550
5585
|
type: "paragraph",
|
|
4551
5586
|
text: "While waiting for batch results, documents show a status of `batch_queued`. Once the provider returns results, the platform applies them through the same post-processing pipeline as real-time extraction \u2014 including markdown pre-processing, field parsing, quality metrics, and extraction metadata computation."
|
|
5587
|
+
},
|
|
5588
|
+
{
|
|
5589
|
+
type: "paragraph",
|
|
5590
|
+
text: "You can also enable batch mode on a per-source basis. When a source connection has the batch processing toggle enabled, all documents ingested through that source are automatically routed to the batch queue. This is ideal for source connections that handle non-urgent, high-volume ingestion \u2014 such as a shared drive that collects documents overnight."
|
|
4552
5591
|
}
|
|
4553
5592
|
],
|
|
4554
5593
|
related: [
|
|
@@ -4598,6 +5637,14 @@ var sections15 = [
|
|
|
4598
5637
|
type: "paragraph",
|
|
4599
5638
|
text: "Batches are submitted automatically when the accumulation timer fires (every 15 minutes by default) or when the item count threshold is reached. Once submitted, the platform polls the provider hourly to check for completion. When results arrive, they are applied to the corresponding documents and the batch transitions to **completed** status."
|
|
4600
5639
|
},
|
|
5640
|
+
{
|
|
5641
|
+
type: "paragraph",
|
|
5642
|
+
text: "The batch detail view shows individual items within a batch, including which documents are included, their current processing state, and any errors that occurred. Use this view to verify that a specific document was included in the expected batch and to troubleshoot items that failed to parse."
|
|
5643
|
+
},
|
|
5644
|
+
{
|
|
5645
|
+
type: "paragraph",
|
|
5646
|
+
text: "The platform includes built-in crash recovery for batch processing. If the application restarts while a batch is in a transient `processing` state, the recovery logic automatically reverts it to `submitted` so the next polling cycle can retry. This means batch jobs are resilient to infrastructure disruptions without requiring manual intervention."
|
|
5647
|
+
},
|
|
4601
5648
|
{
|
|
4602
5649
|
type: "param-table",
|
|
4603
5650
|
title: "Batch statuses",
|
|
@@ -4677,6 +5724,14 @@ var sections16 = [
|
|
|
4677
5724
|
type: "paragraph",
|
|
4678
5725
|
text: 'Reference data is the foundation of the matching system. It represents your "ground truth" \u2014 the known records you want to match extracted document data against. Common examples include customer lists, product catalogs, vendor registries, and contract databases.'
|
|
4679
5726
|
},
|
|
5727
|
+
{
|
|
5728
|
+
type: "paragraph",
|
|
5729
|
+
text: "When you upload a reference dataset, the platform indexes all columns and rows for fast lookup during matching runs. Each dataset is versioned independently, so you can update your reference data without affecting in-progress matching configurations. A single dataset can be shared across multiple schemas and matching configurations."
|
|
5730
|
+
},
|
|
5731
|
+
{
|
|
5732
|
+
type: "paragraph",
|
|
5733
|
+
text: "For best results, ensure your reference data is clean and deduplicated before uploading. Include all columns that you plan to match against \u2014 such as names, identifiers, dates, and amounts. Most teams refresh their reference data periodically by re-uploading from their source system or by using the SQL import option to pull directly from a connected database."
|
|
5734
|
+
},
|
|
4680
5735
|
{
|
|
4681
5736
|
type: "callout",
|
|
4682
5737
|
variant: "info",
|
|
@@ -4770,6 +5825,10 @@ var sections16 = [
|
|
|
4770
5825
|
type: "paragraph",
|
|
4771
5826
|
text: "Each field comparison carries a **weight** that determines how much it contributes to the overall confidence score. Set high weights on fields that are strong identifiers (like reference numbers or unique IDs) and lower weights on fields that are common or prone to variation (like names or descriptions). The weighted aggregate produces a final score between 0% and 100%."
|
|
4772
5827
|
},
|
|
5828
|
+
{
|
|
5829
|
+
type: "paragraph",
|
|
5830
|
+
text: "Most teams start with AI strategy generation and then fine-tune weights based on initial results. A common pattern is to set a high weight on a unique identifier field (like a PO number) with `exact` strategy, combined with lower-weighted `fuzzy` matches on name and description fields as supporting evidence. Review the first batch of results to calibrate thresholds before running at scale."
|
|
5831
|
+
},
|
|
4773
5832
|
{
|
|
4774
5833
|
type: "callout",
|
|
4775
5834
|
variant: "info",
|
|
@@ -4824,6 +5883,14 @@ var sections16 = [
|
|
|
4824
5883
|
type: "paragraph",
|
|
4825
5884
|
text: "There are two types of runs: **manual runs** use only the deterministic matching strategies (exact, fuzzy, date_range, numeric_range) and complete quickly. **Smart runs** add an AI resolution pass \u2014 after the initial matching, an embedding-based search with a Haiku LLM resolver attempts to improve low-confidence results."
|
|
4826
5885
|
},
|
|
5886
|
+
{
|
|
5887
|
+
type: "paragraph",
|
|
5888
|
+
text: "Matching runs are processed asynchronously via a dedicated job queue, so they do not block your workflow. You can continue working in the platform while a run executes in the background. The matching page shows real-time progress with the number of documents processed and estimated time remaining."
|
|
5889
|
+
},
|
|
5890
|
+
{
|
|
5891
|
+
type: "paragraph",
|
|
5892
|
+
text: "For best results, start with a manual run to establish a baseline, then use a smart run if many documents have low-confidence matches. Smart runs take longer because the AI resolver evaluates each ambiguous candidate, but they can significantly improve match quality for data with inconsistent formatting, abbreviations, or multilingual content."
|
|
5893
|
+
},
|
|
4827
5894
|
{
|
|
4828
5895
|
type: "list",
|
|
4829
5896
|
ordered: true,
|
|
@@ -4881,6 +5948,14 @@ var sections16 = [
|
|
|
4881
5948
|
type: "paragraph",
|
|
4882
5949
|
text: "The evidence view is designed to make match decisions transparent. For each candidate, you can see exactly which fields matched, what strategy was used, the individual field score, and the actual values that were compared. This makes it straightforward to verify correct matches and investigate false positives."
|
|
4883
5950
|
},
|
|
5951
|
+
{
|
|
5952
|
+
type: "paragraph",
|
|
5953
|
+
text: "Approved matches flow downstream into delivery pipelines, where they can be included in structured exports alongside extraction data. Rejected matches are excluded from future consideration for that document, which helps the system learn from your decisions when running subsequent matching passes."
|
|
5954
|
+
},
|
|
5955
|
+
{
|
|
5956
|
+
type: "paragraph",
|
|
5957
|
+
text: "When reviewing results, focus on documents where the top candidate has a confidence score between 50% and 85% \u2014 these are the borderline cases that benefit most from human judgment. High-confidence matches (above 85%) are usually correct, while very low scores (below 30%) typically indicate no valid match exists in the reference data."
|
|
5958
|
+
},
|
|
4884
5959
|
{
|
|
4885
5960
|
type: "param-table",
|
|
4886
5961
|
title: "Result fields",
|
|
@@ -7085,6 +8160,7 @@ var sections22 = [
|
|
|
7085
8160
|
}
|
|
7086
8161
|
}`
|
|
7087
8162
|
},
|
|
8163
|
+
{ type: "paragraph", text: "Most integrations call `POST /v1/jobs` immediately after defining or updating a schema via the schemas API. Once created, poll `GET /v1/jobs/:id` every 2-5 seconds and watch for `status` transitioning to `complete`. Pair with `GET /v1/jobs/:id/results` to retrieve the structured output rows as soon as the job finishes." },
|
|
7088
8164
|
{ type: "heading", level: 2, id: "create-job-errors", text: "Errors" },
|
|
7089
8165
|
{
|
|
7090
8166
|
type: "param-table",
|
|
@@ -7172,6 +8248,7 @@ var sections22 = [
|
|
|
7172
8248
|
}
|
|
7173
8249
|
}`
|
|
7174
8250
|
},
|
|
8251
|
+
{ type: "paragraph", text: "This endpoint is typically used in a polling loop after `POST /v1/jobs`. Watch `current_phase` to track pipeline progression through `phase_1_resolve`, `phase_2_execute`, `phase_3_resolve`, and `phase_4_transform`. The `grid_stats.fill_rate` value increases as each phase completes, giving you a real-time quality signal before the job reaches `complete` status." },
|
|
7175
8252
|
{ type: "heading", level: 2, id: "get-job-errors", text: "Errors" },
|
|
7176
8253
|
{
|
|
7177
8254
|
type: "param-table",
|
|
@@ -7250,6 +8327,7 @@ var sections22 = [
|
|
|
7250
8327
|
}
|
|
7251
8328
|
}`
|
|
7252
8329
|
},
|
|
8330
|
+
{ type: "paragraph", text: "Pair this with `GET /v1/jobs/:id/results` to retrieve any rows that were processed before cancellation. The `completed_documents` field in the response tells you how many documents finished before the job was stopped, so you can decide whether the partial results are usable or if a new job is needed." },
|
|
7253
8331
|
{ type: "heading", level: 2, id: "cancel-job-errors", text: "Errors" },
|
|
7254
8332
|
{
|
|
7255
8333
|
type: "param-table",
|
|
@@ -7354,6 +8432,7 @@ var sections22 = [
|
|
|
7354
8432
|
}
|
|
7355
8433
|
}`
|
|
7356
8434
|
},
|
|
8435
|
+
{ type: "paragraph", text: "Returns one row per document with field values keyed by the schema field names you defined. Rows with `validation_flags` containing entries like `missing_required_field:<name>` or `format_mismatch:<name>` indicate Phase 4 detected data quality issues. Use the `confidence` score to prioritize which rows need manual review -- values below 0.8 typically warrant inspection." },
|
|
7357
8436
|
{ type: "heading", level: 2, id: "get-job-results-errors", text: "Errors" },
|
|
7358
8437
|
{
|
|
7359
8438
|
type: "param-table",
|
|
@@ -8002,7 +9081,8 @@ var sections24 = [
|
|
|
8002
9081
|
{ name: "404", type: "not_found", description: "No field with the given fieldId exists for your workspace." },
|
|
8003
9082
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
8004
9083
|
]
|
|
8005
|
-
}
|
|
9084
|
+
},
|
|
9085
|
+
{ type: "paragraph", text: 'Pair this endpoint with the **field autocomplete** endpoint to build a two-step filter UI: first let the user select a field via `GET /fields/autocomplete`, then populate a dropdown with that field\'s distinct values from this endpoint. The `totalDistinct` count is useful for showing "N of M values" pagination hints.' }
|
|
8006
9086
|
],
|
|
8007
9087
|
related: [
|
|
8008
9088
|
{ label: "Field Autocomplete", slug: "field-autocomplete" },
|
|
@@ -8104,7 +9184,8 @@ var sections24 = [
|
|
|
8104
9184
|
{ name: "401", type: "unauthorized", description: "Missing or invalid API key." },
|
|
8105
9185
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
8106
9186
|
]
|
|
8107
|
-
}
|
|
9187
|
+
},
|
|
9188
|
+
{ type: "paragraph", text: "Build conditions programmatically by first calling `GET /fields/autocomplete` to resolve field IDs, then `GET /fields/:fieldId/values` to populate value pickers. The response includes the full `fieldValues` map per document, so you can render result tables without additional per-document fetches." }
|
|
8108
9189
|
],
|
|
8109
9190
|
related: [
|
|
8110
9191
|
{ label: "Field Autocomplete", slug: "field-autocomplete" },
|
|
@@ -8188,7 +9269,8 @@ var sections24 = [
|
|
|
8188
9269
|
{ name: "401", type: "unauthorized", description: "Missing or invalid API key." },
|
|
8189
9270
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
8190
9271
|
]
|
|
8191
|
-
}
|
|
9272
|
+
},
|
|
9273
|
+
{ type: "paragraph", text: "Returns categorized results across **documents**, **fieldMatches**, **sources**, **schemas**, and **fields** in a single call, so a Cmd+K palette can render grouped sections without multiple requests. Use the `limit` parameter to cap results per category and keep response times fast for interactive search." }
|
|
8192
9274
|
],
|
|
8193
9275
|
related: [
|
|
8194
9276
|
{ label: "Filter Documents", slug: "filter-documents" },
|
|
@@ -8336,7 +9418,8 @@ var sections24 = [
|
|
|
8336
9418
|
{ name: "404", type: "not_found", description: "No saved filter with this ID exists." },
|
|
8337
9419
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
8338
9420
|
]
|
|
8339
|
-
}
|
|
9421
|
+
},
|
|
9422
|
+
{ type: "paragraph", text: "Saved filters work as reusable presets for `POST /v1/documents/filter`. Store a complex combination of conditions, search text, and sort order once, then load it by ID and pass the saved `conditions` array directly to the filter endpoint. All team members in the organization can list and use saved filters." }
|
|
8340
9423
|
],
|
|
8341
9424
|
related: [
|
|
8342
9425
|
{ label: "Filter Documents", slug: "filter-documents" }
|
|
@@ -8354,7 +9437,8 @@ var sections24 = [
|
|
|
8354
9437
|
seoTitle: "Document Counts \u2014 Talonic Docs",
|
|
8355
9438
|
description: "Query document counts grouped by filter conditions and source connections. Useful for building faceted navigation and dashboard widgets.",
|
|
8356
9439
|
content: [
|
|
8357
|
-
{ type: "paragraph", text: "The document counts endpoint returns aggregate counts of documents matching filter conditions, grouped by source connection. Use it to power faceted navigation UIs and dashboard summary widgets without fetching full document lists." }
|
|
9440
|
+
{ type: "paragraph", text: "The document counts endpoint returns aggregate counts of documents matching filter conditions, grouped by source connection. Use it to power faceted navigation UIs and dashboard summary widgets without fetching full document lists." },
|
|
9441
|
+
{ type: "paragraph", text: "Call this before `POST /v1/documents/filter` to preview how many documents match your conditions without fetching the full result set. The response groups counts by **source connection**, so you can render per-source facets in a sidebar alongside the main document list." }
|
|
8358
9442
|
],
|
|
8359
9443
|
related: [
|
|
8360
9444
|
{ label: "Filter Documents", slug: "filter-documents" }
|
|
@@ -8405,7 +9489,8 @@ var sections24 = [
|
|
|
8405
9489
|
{ name: "401", type: "unauthorized", description: "Missing or invalid API key." },
|
|
8406
9490
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
8407
9491
|
]
|
|
8408
|
-
}
|
|
9492
|
+
},
|
|
9493
|
+
{ type: "paragraph", text: "In a typical bulk-ingestion workflow, call `POST /v1/extract` for each file, wait for all `document.extraction.completed` webhooks, then trigger `POST /filter/materialize` once. After the backfill completes, all filter and omnisearch endpoints will reflect the newly ingested data." }
|
|
8409
9494
|
],
|
|
8410
9495
|
related: [
|
|
8411
9496
|
{ label: "Filter Documents", slug: "filter-documents" }
|
|
@@ -8423,7 +9508,8 @@ var sections24 = [
|
|
|
8423
9508
|
seoTitle: "Materialized Index \u2014 Talonic Docs",
|
|
8424
9509
|
description: "The materialized field value index powers fast filter queries. Rebuild it after bulk document ingestion using the materialize endpoint.",
|
|
8425
9510
|
content: [
|
|
8426
|
-
{ type: "paragraph", text: "The materialized index pre-computes and stores extracted field values for every document, enabling sub-second filter queries even on large workspaces. After bulk ingestion or schema changes, trigger a rebuild via the materialize endpoint to ensure the index stays current." }
|
|
9511
|
+
{ type: "paragraph", text: "The materialized index pre-computes and stores extracted field values for every document, enabling sub-second filter queries even on large workspaces. After bulk ingestion or schema changes, trigger a rebuild via the materialize endpoint to ensure the index stays current." },
|
|
9512
|
+
{ type: "paragraph", text: "Both `POST /v1/documents/filter` and `GET /v1/search` read from this materialized index. If filter results appear stale or miss recently processed documents, call `POST /filter/materialize` to rebuild. For single-document uploads the index updates automatically, so manual rebuilds are only needed after bulk operations." }
|
|
8427
9513
|
],
|
|
8428
9514
|
related: [
|
|
8429
9515
|
{ label: "Materialize", slug: "document-counts" }
|
|
@@ -8757,6 +9843,7 @@ var sections26 = [
|
|
|
8757
9843
|
}
|
|
8758
9844
|
}`
|
|
8759
9845
|
},
|
|
9846
|
+
{ type: "paragraph", text: "After creating a resolution, call `POST /v1/resolutions/{id}/execute` to start the pipeline. The typical workflow is: create a job via `POST /v1/jobs`, wait for `complete` status, then create and execute a resolution against the job's `source_run_id`. Each resolution captures its own **policy and dialect snapshots**, so you can re-run with different configurations without affecting previous results." },
|
|
8760
9847
|
{ type: "heading", level: 2, id: "create-resolution-errors", text: "Errors" },
|
|
8761
9848
|
{
|
|
8762
9849
|
type: "param-table",
|
|
@@ -8830,6 +9917,7 @@ var sections26 = [
|
|
|
8830
9917
|
}
|
|
8831
9918
|
}`
|
|
8832
9919
|
},
|
|
9920
|
+
{ type: "paragraph", text: "This endpoint is typically used in a polling loop after calling `POST /v1/resolutions/{id}/execute`. Watch for `status` transitioning from `running` to `completed` or `failed`. Once `completed`, use `GET /v1/resolutions/{id}/results` to inspect per-field resolved values and the `resolution_step` that produced each canonical mapping." },
|
|
8833
9921
|
{ type: "heading", level: 2, id: "get-resolution-errors", text: "Errors" },
|
|
8834
9922
|
{
|
|
8835
9923
|
type: "param-table",
|
|
@@ -8904,6 +9992,7 @@ var sections26 = [
|
|
|
8904
9992
|
]
|
|
8905
9993
|
}`
|
|
8906
9994
|
},
|
|
9995
|
+
{ type: "paragraph", text: "Use the `resolution_step` field to understand how each value was normalized: `lookup` indicates a direct reference table match, `transfer` means the value was carried from the field registry, and `compute` means a deterministic formula produced the result. Fields where `resolved_value` is `null` were not matched by any strategy and retain their raw extracted value -- consider adding those values to your lookup tables for future runs." },
|
|
8907
9996
|
{ type: "heading", level: 2, id: "get-resolution-results-errors", text: "Errors" },
|
|
8908
9997
|
{
|
|
8909
9998
|
type: "param-table",
|
|
@@ -8976,6 +10065,7 @@ var sections26 = [
|
|
|
8976
10065
|
}
|
|
8977
10066
|
}`
|
|
8978
10067
|
},
|
|
10068
|
+
{ type: "paragraph", text: 'The standard workflow is `POST /v1/resolutions` to create, then `POST /v1/resolutions/{id}/execute` to start processing. The endpoint returns immediately with `status: "running"` -- poll `GET /v1/resolutions/{id}` to detect completion. Deterministic lookups complete in seconds; runs that trigger the LLM fallback for ambiguous values take 1-5 minutes depending on the number of unresolved fields.' },
|
|
8979
10069
|
{ type: "heading", level: 2, id: "execute-resolution-errors", text: "Errors" },
|
|
8980
10070
|
{
|
|
8981
10071
|
type: "param-table",
|
|
@@ -9029,6 +10119,7 @@ var sections26 = [
|
|
|
9029
10119
|
"deleted": true
|
|
9030
10120
|
}`
|
|
9031
10121
|
},
|
|
10122
|
+
{ type: "paragraph", text: "Common usage is to delete `failed` resolution runs before retrying with a new `POST /v1/resolutions`. The source job run and its extracted data are completely unaffected by this operation, so you can safely clean up resolution experiments without losing upstream results." },
|
|
9032
10123
|
{ type: "heading", level: 2, id: "cancel-resolution-errors", text: "Errors" },
|
|
9033
10124
|
{
|
|
9034
10125
|
type: "param-table",
|
|
@@ -9856,6 +10947,7 @@ var sections28 = [
|
|
|
9856
10947
|
]
|
|
9857
10948
|
}`
|
|
9858
10949
|
},
|
|
10950
|
+
{ type: "paragraph", text: 'Most integrations start with `GET /v1/jobs/runs/{runId}/nshot/summary` to check the overall `agreement_rate`, then drill into this endpoint to find `status: "red"` and `status: "yellow"` comparisons that need attention. Use the `override` and `judgement` fields to track which comparisons have already been reviewed and which still need a decision.' },
|
|
9859
10951
|
{ type: "heading", level: 2, id: "nshot-list-shots-errors", text: "Errors" },
|
|
9860
10952
|
{
|
|
9861
10953
|
type: "param-table",
|
|
@@ -9953,6 +11045,7 @@ var sections28 = [
|
|
|
9953
11045
|
}
|
|
9954
11046
|
}`
|
|
9955
11047
|
},
|
|
11048
|
+
{ type: "paragraph", text: "This endpoint is typically called after identifying a problematic comparison in the `GET /v1/jobs/runs/{runId}/nshot/comparisons` list. Inspect the `values` array to see what each shot extracted, then check the `judgement` object for the LLM's recommendation. If `judgement.accepted` is `null`, you can submit a decision via `POST /v1/jobs/runs/{runId}/nshot/judge-decision`." },
|
|
9956
11049
|
{ type: "heading", level: 2, id: "nshot-compare-errors", text: "Errors" },
|
|
9957
11050
|
{
|
|
9958
11051
|
type: "param-table",
|
|
@@ -10055,6 +11148,7 @@ var sections28 = [
|
|
|
10055
11148
|
}
|
|
10056
11149
|
}`
|
|
10057
11150
|
},
|
|
11151
|
+
{ type: "paragraph", text: "Use this when the **majority value** from the shots is incorrect and a different shot produced the right extraction. The `override` record captures a full audit trail including `from_value`, `to_value`, and `overridden_at`. Pair with the judge decision endpoint when you want to accept LLM recommendations programmatically instead of selecting shots manually." },
|
|
10058
11152
|
{ type: "heading", level: 2, id: "nshot-select-errors", text: "Errors" },
|
|
10059
11153
|
{
|
|
10060
11154
|
type: "param-table",
|
|
@@ -10161,6 +11255,7 @@ var sections28 = [
|
|
|
10161
11255
|
}
|
|
10162
11256
|
}`
|
|
10163
11257
|
},
|
|
11258
|
+
{ type: "paragraph", text: 'This endpoint fits into a review workflow where you batch-process LLM judge recommendations. Retrieve comparisons with `judgement.accepted: null` from the comparisons list, then iterate through them calling this endpoint with `accepted: true` or `false`. Accepted decisions automatically create an override with `actor_id: "judge"`, so no separate override call is needed.' },
|
|
10164
11259
|
{ type: "heading", level: 2, id: "nshot-judge-decision-errors", text: "Errors" },
|
|
10165
11260
|
{
|
|
10166
11261
|
type: "param-table",
|
|
@@ -10324,7 +11419,8 @@ var sections29 = [
|
|
|
10324
11419
|
{ name: "404", type: "not_found", description: "No schema class with this ID exists for your organization." },
|
|
10325
11420
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
10326
11421
|
]
|
|
10327
|
-
}
|
|
11422
|
+
},
|
|
11423
|
+
{ type: "paragraph", text: "Use this endpoint to inspect a class before reviewing its pending diffs via `GET /v1/schema-graph/diffs?schema_class_id={id}`. The `current_version_id` tells you which version is live; follow the `links.versions` URL to compare previous snapshots and understand how the class has evolved." }
|
|
10328
11424
|
],
|
|
10329
11425
|
related: [
|
|
10330
11426
|
{ label: "List Classes", slug: "list-schema-graph-classes" },
|
|
@@ -10398,7 +11494,8 @@ var sections29 = [
|
|
|
10398
11494
|
{ name: "404", type: "not_found", description: "No schema class with this ID exists for your organization." },
|
|
10399
11495
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
10400
11496
|
]
|
|
10401
|
-
}
|
|
11497
|
+
},
|
|
11498
|
+
{ type: "paragraph", text: "Compare two versions by fetching them individually with `GET /v1/schema-graph/classes/{id}/versions/{version}` and diffing their `json_schema` and `field_ids` arrays. This is useful for auditing how a class evolved after a diff was approved, or for building a changelog UI that shows added and removed fields per version." }
|
|
10402
11499
|
],
|
|
10403
11500
|
related: [
|
|
10404
11501
|
{ label: "Get Version", slug: "get-class-version" },
|
|
@@ -10466,7 +11563,8 @@ var sections29 = [
|
|
|
10466
11563
|
{ name: "404", type: "not_found", description: "No schema class with this ID exists for your organization, or the requested version number does not exist." },
|
|
10467
11564
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
10468
11565
|
]
|
|
10469
|
-
}
|
|
11566
|
+
},
|
|
11567
|
+
{ type: "paragraph", text: "The `json_schema` object is a valid JSON Schema you can use directly for client-side validation or code generation. The `field_ids` array maps each schema property back to its field registry entry, so you can cross-reference with `GET /v1/fields/{id}` for extraction instructions and occurrence statistics." }
|
|
10470
11568
|
],
|
|
10471
11569
|
related: [
|
|
10472
11570
|
{ label: "List Versions", slug: "list-class-versions" }
|
|
@@ -10551,7 +11649,8 @@ var sections29 = [
|
|
|
10551
11649
|
{ name: "401", type: "unauthorized", description: "Missing or invalid API key." },
|
|
10552
11650
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
10553
11651
|
]
|
|
10554
|
-
}
|
|
11652
|
+
},
|
|
11653
|
+
{ type: "paragraph", text: "Filter by `status=pending` to build a review queue for ontology changes. Inspect the `added_fields`, `removed_fields`, and `type_changes` arrays to assess impact, then call `POST /v1/schema-graph/diffs/{id}/approve` or `/reject` to action each diff. The `classification` field (`additive` vs `breaking`) helps prioritize which diffs need careful review." }
|
|
10555
11654
|
],
|
|
10556
11655
|
related: [
|
|
10557
11656
|
{ label: "Approve Diff", slug: "approve-diff" },
|
|
@@ -10606,7 +11705,8 @@ var sections29 = [
|
|
|
10606
11705
|
{ name: "404", type: "not_found", description: "No diff with this ID exists for your organization." },
|
|
10607
11706
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
10608
11707
|
]
|
|
10609
|
-
}
|
|
11708
|
+
},
|
|
11709
|
+
{ type: "paragraph", text: "Approval is synchronous -- the new version is created and `current_version_id` is updated in the same request. After approving, call `GET /v1/schema-graph/classes/{id}/versions` to verify the new version appeared, or fetch the updated class via `GET /v1/schema-graph/classes/{id}` to confirm `current_version_id` advanced." }
|
|
10610
11710
|
],
|
|
10611
11711
|
related: [
|
|
10612
11712
|
{ label: "List Diffs", slug: "list-schema-graph-diffs" },
|
|
@@ -10660,7 +11760,8 @@ var sections29 = [
|
|
|
10660
11760
|
{ name: "404", type: "not_found", description: "No diff with this ID exists for your organization." },
|
|
10661
11761
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
10662
11762
|
]
|
|
10663
|
-
}
|
|
11763
|
+
},
|
|
11764
|
+
{ type: "paragraph", text: 'Rejected diffs are retained for audit and appear with `review_status: "rejected"` when listing diffs. If the same field changes are needed later, the platform generates a new diff automatically during the next extraction cycle. Use rejection to discard noisy or incorrect field discoveries without advancing the class version.' }
|
|
10664
11765
|
],
|
|
10665
11766
|
related: [
|
|
10666
11767
|
{ label: "List Diffs", slug: "list-schema-graph-diffs" },
|
|
@@ -10735,7 +11836,8 @@ var sections29 = [
|
|
|
10735
11836
|
{ name: "401", type: "unauthorized", description: "Missing or invalid API key." },
|
|
10736
11837
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
10737
11838
|
]
|
|
10738
|
-
}
|
|
11839
|
+
},
|
|
11840
|
+
{ type: "paragraph", text: "Pair this with `GET /v1/schema-graph/classes` to build relationship maps between document types. High-weight edges (e.g. `weight > 0.7`) indicate strong field overlap -- useful for identifying document types that should share user schema fields or be linked in cases. Feed the results directly into `GET /v1/schema-graph/visualize` for a D3-ready graph payload." }
|
|
10739
11841
|
],
|
|
10740
11842
|
related: [
|
|
10741
11843
|
{ label: "List Classes", slug: "list-schema-graph-classes" },
|
|
@@ -10799,7 +11901,8 @@ var sections29 = [
|
|
|
10799
11901
|
{ name: "401", type: "unauthorized", description: "Missing or invalid API key." },
|
|
10800
11902
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
10801
11903
|
]
|
|
10802
|
-
}
|
|
11904
|
+
},
|
|
11905
|
+
{ type: "paragraph", text: 'Use this endpoint to audit how variant document type labels resolve to canonical classes. For example, if documents labelled "Bill" and "Tax Invoice" both map to the **Invoice** class, they will share the same `schema_class_id`. This is useful for understanding classification behavior and debugging misclassified documents.' }
|
|
10803
11906
|
],
|
|
10804
11907
|
related: [
|
|
10805
11908
|
{ label: "List Classes", slug: "list-schema-graph-classes" }
|
|
@@ -10882,7 +11985,8 @@ var sections29 = [
|
|
|
10882
11985
|
{ name: "401", type: "unauthorized", description: "Missing or invalid API key." },
|
|
10883
11986
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
10884
11987
|
]
|
|
10885
|
-
}
|
|
11988
|
+
},
|
|
11989
|
+
{ type: "paragraph", text: "The response is structured for direct consumption by D3.js force simulations, Cytoscape, or vis.js -- edge `source` and `target` fields reference node `id` values. Filter nodes client-side by `status` to exclude archived classes, and use edge `weight` to control link distance or opacity in your graph layout." }
|
|
10886
11990
|
],
|
|
10887
11991
|
related: [
|
|
10888
11992
|
{ label: "Edges", slug: "list-schema-graph-edges" },
|
|
@@ -11088,7 +12192,8 @@ var sections30 = [
|
|
|
11088
12192
|
{ name: "401", type: "unauthorized", description: "Missing or invalid API key." },
|
|
11089
12193
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
11090
12194
|
]
|
|
11091
|
-
}
|
|
12195
|
+
},
|
|
12196
|
+
{ type: "paragraph", text: "Newly created checks are automatically **active** and evaluate against all future structuring results. Pair this with `POST /v1/structuring/gates` and `POST /v1/structuring/gates/{id}/rules` to wire checks into an approval gate -- checks flag individual fields, while gate rules aggregate check outcomes to decide whether a result needs manual approval." }
|
|
11092
12197
|
],
|
|
11093
12198
|
related: [
|
|
11094
12199
|
{ label: "List Checks", slug: "list-structuring-checks" },
|
|
@@ -11172,7 +12277,8 @@ var sections30 = [
|
|
|
11172
12277
|
{ name: "404", type: "not_found", description: "Validation check not found or does not belong to your organization." },
|
|
11173
12278
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
11174
12279
|
]
|
|
11175
|
-
}
|
|
12280
|
+
},
|
|
12281
|
+
{ type: "paragraph", text: "Use **PUT** to adjust thresholds (e.g. widening a `value_range` config) or to deactivate a check by setting `is_active` to `false` without deleting it. Historical check outcomes referencing this check are preserved regardless of updates or soft-deletion, so audit trails remain intact." }
|
|
11176
12282
|
],
|
|
11177
12283
|
related: [
|
|
11178
12284
|
{ label: "List Checks", slug: "list-structuring-checks" },
|
|
@@ -11275,7 +12381,8 @@ var sections30 = [
|
|
|
11275
12381
|
{ name: "401", type: "unauthorized", description: "Missing or invalid API key." },
|
|
11276
12382
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
11277
12383
|
]
|
|
11278
|
-
}
|
|
12384
|
+
},
|
|
12385
|
+
{ type: "paragraph", text: "Each gate embeds its active `rules` array, so you can inspect thresholds without a separate fetch. Use the `schema_id` query parameter to find which gates apply to a specific schema, then follow the `links.rules` URL to manage individual rules via `POST /v1/structuring/gates/{id}/rules`." }
|
|
11279
12386
|
],
|
|
11280
12387
|
related: [
|
|
11281
12388
|
{ label: "Create Gate", slug: "create-structuring-gate" },
|
|
@@ -11367,7 +12474,8 @@ var sections30 = [
|
|
|
11367
12474
|
{ name: "401", type: "unauthorized", description: "Missing or invalid API key." },
|
|
11368
12475
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
11369
12476
|
]
|
|
11370
|
-
}
|
|
12477
|
+
},
|
|
12478
|
+
{ type: "paragraph", text: "A gate starts with an empty `rules` array, which means all results auto-approve until you add rules. The typical setup sequence is: create the gate, then immediately call `POST /v1/structuring/gates/{id}/rules` to add a `min_confidence` rule with your desired threshold. Optionally set `destination_id` to route approved results directly to a delivery destination." }
|
|
11371
12479
|
],
|
|
11372
12480
|
related: [
|
|
11373
12481
|
{ label: "List Gates", slug: "list-structuring-gates" },
|
|
@@ -11459,7 +12567,8 @@ var sections30 = [
|
|
|
11459
12567
|
{ name: "404", type: "not_found", description: "Approval gate not found or does not belong to your organization." },
|
|
11460
12568
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
11461
12569
|
]
|
|
11462
|
-
}
|
|
12570
|
+
},
|
|
12571
|
+
{ type: "paragraph", text: "The `rules` array is only populated on **GET** responses. After a **PUT** update, re-fetch with GET to confirm the current rule set. **DELETE** soft-deletes the gate by setting `is_active` to `false` -- pending approval items already queued by this gate remain in the queue and can still be actioned manually." }
|
|
11463
12572
|
],
|
|
11464
12573
|
related: [
|
|
11465
12574
|
{ label: "List Gates", slug: "list-structuring-gates" },
|
|
@@ -11584,7 +12693,8 @@ var sections30 = [
|
|
|
11584
12693
|
{ name: "404", type: "not_found", description: "Approval gate or rule not found, or gate does not belong to your organization." },
|
|
11585
12694
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
11586
12695
|
]
|
|
11587
|
-
}
|
|
12696
|
+
},
|
|
12697
|
+
{ type: "paragraph", text: "Rules are evaluated in `sort_order` ascending -- if any rule fails, the result is flagged for manual approval via `GET /v1/structuring/approvals/pending`. A common setup is a `min_confidence` rule at `sort_order: 0` followed by a `validation_pass_rate` rule at `sort_order: 1`, so confidence is checked before validation pass rate." }
|
|
11588
12698
|
],
|
|
11589
12699
|
related: [
|
|
11590
12700
|
{ label: "Create Gate", slug: "create-structuring-gate" },
|
|
@@ -11658,7 +12768,8 @@ var sections30 = [
|
|
|
11658
12768
|
{ name: "404", type: "not_found", description: "Structuring result not found or does not belong to your organization." },
|
|
11659
12769
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
11660
12770
|
]
|
|
11661
|
-
}
|
|
12771
|
+
},
|
|
12772
|
+
{ type: "paragraph", text: 'Check outcomes are computed automatically when a structuring result is produced -- no manual trigger is needed. Use the embedded `check.name` and `check.severity` fields to render pass/fail badges in a review UI. Failed checks with `severity: "error"` are the ones that cause results to appear in `GET /v1/structuring/approvals/pending`.' }
|
|
11662
12773
|
],
|
|
11663
12774
|
related: [
|
|
11664
12775
|
{ label: "List Checks", slug: "list-structuring-checks" },
|
|
@@ -11729,7 +12840,8 @@ var sections30 = [
|
|
|
11729
12840
|
{ name: "401", type: "unauthorized", description: "Missing or invalid API key." },
|
|
11730
12841
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
11731
12842
|
]
|
|
11732
|
-
}
|
|
12843
|
+
},
|
|
12844
|
+
{ type: "paragraph", text: "Each item links a structuring result to the specific check that failed, so a single result can appear multiple times if it failed multiple checks. Use the `result_id` to group items by result, then call `POST /v1/structuring/approvals/{id}/approve` or `/reject` to action each one. Approving a result clears all its pending items and triggers the gate's `on_approve` action." }
|
|
11733
12845
|
],
|
|
11734
12846
|
related: [
|
|
11735
12847
|
{ label: "Approve / Reject", slug: "approve-reject-result" },
|
|
@@ -11796,7 +12908,8 @@ var sections30 = [
|
|
|
11796
12908
|
{ name: "404", type: "not_found", description: "Structuring result or approval gate not found, or they do not belong to your organization." },
|
|
11797
12909
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
11798
12910
|
]
|
|
11799
|
-
}
|
|
12911
|
+
},
|
|
12912
|
+
{ type: "paragraph", text: "After approving, the gate's `on_approve` action fires -- typically emitting a delivery signal. To batch-approve an entire run's results at once, iterate the pending approvals from `GET /v1/structuring/approvals/pending`, approve each individually, then call `POST /v1/structuring/delivery/{runId}` to trigger delivery for the full run." }
|
|
11800
12913
|
],
|
|
11801
12914
|
related: [
|
|
11802
12915
|
{ label: "Pending Approvals", slug: "pending-approvals" },
|
|
@@ -11853,7 +12966,8 @@ var sections30 = [
|
|
|
11853
12966
|
{ name: "404", type: "not_found", description: "Job run not found or does not belong to your organization." },
|
|
11854
12967
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
11855
12968
|
]
|
|
11856
|
-
}
|
|
12969
|
+
},
|
|
12970
|
+
{ type: "paragraph", text: "This endpoint is idempotent per result -- each approved result generates a deterministic idempotency key, so calling it multiple times on the same run does not produce duplicate deliveries. The typical workflow is: approve results individually via `POST /v1/structuring/approvals/{id}/approve`, then trigger delivery for the entire run once all reviews are complete." }
|
|
11857
12971
|
],
|
|
11858
12972
|
related: [
|
|
11859
12973
|
{ label: "Approve / Reject", slug: "approve-reject-result" }
|
|
@@ -16644,7 +17758,8 @@ var sections39 = [
|
|
|
16644
17758
|
{ name: "401", type: "unauthorized", description: "Missing or invalid API key." },
|
|
16645
17759
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
16646
17760
|
]
|
|
16647
|
-
}
|
|
17761
|
+
},
|
|
17762
|
+
{ type: "paragraph", text: "Most integrations poll this endpoint on a schedule to detect new items, then call `GET /v1/review/:id` to fetch full detail before rendering a review UI. Pair with `GET /v1/review/stats` to monitor queue depth and set alerting thresholds on the **pending** count." }
|
|
16648
17763
|
],
|
|
16649
17764
|
related: [
|
|
16650
17765
|
{ label: "Review Stats", slug: "review-stats" },
|
|
@@ -16701,7 +17816,8 @@ var sections39 = [
|
|
|
16701
17816
|
{ name: "401", type: "unauthorized", description: "Missing or invalid API key." },
|
|
16702
17817
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
16703
17818
|
]
|
|
16704
|
-
}
|
|
17819
|
+
},
|
|
17820
|
+
{ type: "paragraph", text: "This endpoint is typically called on a dashboard polling loop to drive queue-depth indicators. Pair it with `GET /v1/review` filtered by `status=pending` to fetch the actual items once the **pending** count crosses your alerting threshold." }
|
|
16705
17821
|
],
|
|
16706
17822
|
related: [
|
|
16707
17823
|
{ label: "List Review Items", slug: "list-review-items" },
|
|
@@ -16782,7 +17898,8 @@ var sections39 = [
|
|
|
16782
17898
|
{ name: "404", type: "not_found", description: "Review record not found or does not belong to your organization." },
|
|
16783
17899
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
16784
17900
|
]
|
|
16785
|
-
}
|
|
17901
|
+
},
|
|
17902
|
+
{ type: "paragraph", text: "Use the `low_confidence_fields` array to highlight problematic cells in your review UI before calling `POST /v1/review/:id/action`. The `field_decisions` object persists per-field overrides, so you can build interfaces where reviewers correct individual values and submit the decision in one step." }
|
|
16786
17903
|
],
|
|
16787
17904
|
related: [
|
|
16788
17905
|
{ label: "Review Action", slug: "review-action" },
|
|
@@ -16866,7 +17983,8 @@ var sections39 = [
|
|
|
16866
17983
|
{ name: "404", type: "not_found", description: "Review record not found or does not belong to your organization." },
|
|
16867
17984
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
16868
17985
|
]
|
|
16869
|
-
}
|
|
17986
|
+
},
|
|
17987
|
+
{ type: "paragraph", text: "After approval, the record's `status` changes to `approved` and a delivery signal is emitted into the outbox. Pair this with `POST /v1/review/batch` when you need to clear multiple items sharing similar characteristics, or call `GET /v1/review/stats` afterward to verify the **pending** count dropped as expected." }
|
|
16870
17988
|
],
|
|
16871
17989
|
related: [
|
|
16872
17990
|
{ label: "Get Review Item", slug: "get-review-item" },
|
|
@@ -16936,7 +18054,8 @@ var sections39 = [
|
|
|
16936
18054
|
{ name: "401", type: "unauthorized", description: "Missing or invalid API key." },
|
|
16937
18055
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
16938
18056
|
]
|
|
16939
|
-
}
|
|
18057
|
+
},
|
|
18058
|
+
{ type: "paragraph", text: "A common pattern is to first call `GET /v1/review?status=pending` to collect IDs, filter client-side by `overall_confidence` above a safe threshold, then batch-approve those IDs here. Check the `results` array for per-item outcomes -- items that were already actioned return an error status but do not block the rest of the batch." }
|
|
16940
18059
|
],
|
|
16941
18060
|
related: [
|
|
16942
18061
|
{ label: "Review Action", slug: "review-action" },
|
|
@@ -17018,7 +18137,8 @@ var sections39 = [
|
|
|
17018
18137
|
{ name: "404", type: "not_found", description: "Review record not found or does not belong to your organization." },
|
|
17019
18138
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
17020
18139
|
]
|
|
17021
|
-
}
|
|
18140
|
+
},
|
|
18141
|
+
{ type: "paragraph", text: "Assignments are typically made after listing pending items via `GET /v1/review` and distributing them across team members. The `assigned_to` field appears in list and detail responses, so downstream tools can filter by assignee to build per-reviewer queues. Pass `null` as `user_id` to return an item to the unassigned pool." }
|
|
17022
18142
|
],
|
|
17023
18143
|
related: [
|
|
17024
18144
|
{ label: "Get Review Item", slug: "get-review-item" },
|
|
@@ -18658,7 +19778,8 @@ var sections43 = [
|
|
|
18658
19778
|
{ name: "401", type: "unauthorized", description: "Missing or invalid API key." },
|
|
18659
19779
|
{ name: "429", type: "rate_limited", description: "Too many requests. Retry after the period indicated in the Retry-After header." }
|
|
18660
19780
|
]
|
|
18661
|
-
}
|
|
19781
|
+
},
|
|
19782
|
+
{ type: "paragraph", text: "Most integrations use registry query as a lookup layer after ingestion is complete. Call `POST /v1/extract` to ingest documents, wait for the `document.extraction.completed` webhook, then query the registry by field values to retrieve structured data across your entire corpus. Pair with `GET /v1/fields` to discover available canonical field names before building `where` conditions." }
|
|
18662
19783
|
],
|
|
18663
19784
|
related: [
|
|
18664
19785
|
{ label: "Field Registry", slug: "field-registry" },
|