@talonic/docs 0.20.10 → 0.20.12
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/content.js +1202 -6
- package/package.json +1 -1
- package/dist/tailwind-preset.d.cts +0 -45
- package/dist/tailwind-preset.d.ts +0 -45
package/dist/content.js
CHANGED
|
@@ -542,6 +542,14 @@ var sections = [
|
|
|
542
542
|
}
|
|
543
543
|
]
|
|
544
544
|
},
|
|
545
|
+
{
|
|
546
|
+
type: "paragraph",
|
|
547
|
+
text: 'Understanding the relationship between these concepts is key to getting the most from the platform. When you upload documents, the extraction pipeline discovers every data point and feeds them into the **Field Registry**. The registry uses AI embeddings to cluster semantically similar fields \u2014 so "Vendor Name", "Supplier Name", and "Company Name" are recognized as the same concept. Over time, frequently occurring fields are promoted to higher tiers, and the platform synthesizes master extraction instructions that encode the best way to extract each field.'
|
|
548
|
+
},
|
|
549
|
+
{
|
|
550
|
+
type: "paragraph",
|
|
551
|
+
text: "The **Schema** layer sits on top of the registry and defines what output you need. You can use auto-generated schemas that the platform creates for each document type, or build custom template schemas by selecting specific fields from the registry. When a schema is applied to documents in a **Job**, the 4-phase pipeline fills every cell \u2014 starting with free graph lookups and falling back to AI agents for the remainder. The result is a structured grid where each row is a document and each column is a field."
|
|
552
|
+
},
|
|
545
553
|
{
|
|
546
554
|
type: "callout",
|
|
547
555
|
variant: "info",
|
|
@@ -617,6 +625,14 @@ var sections = [
|
|
|
617
625
|
type: "paragraph",
|
|
618
626
|
text: "The pipeline is designed to be **progressive** \u2014 results appear as each phase completes rather than waiting for the entire job to finish. Phase 1 (graph resolve) fills ~30% of cells instantly and for free. Phase 2 (AI extraction) fills the remaining gaps. Phases 3 and 4 handle re-resolution and transformation. You can start reviewing early results while later phases are still running."
|
|
619
627
|
},
|
|
628
|
+
{
|
|
629
|
+
type: "paragraph",
|
|
630
|
+
text: "Use the platform flow as a mental model when planning your workflow. For small, ad-hoc extractions you can go from upload to results in minutes \u2014 upload a few documents, pick an auto-generated schema, and run a job. For production workloads, invest time in the **Define schema** step: map fields to the registry, add reference tables for code lookups, and set format constraints. The upfront effort pays off because every subsequent job reuses the same schema and benefits from the growing knowledge graph."
|
|
631
|
+
},
|
|
632
|
+
{
|
|
633
|
+
type: "paragraph",
|
|
634
|
+
text: "After results are delivered, the feedback loop closes automatically. Corrections you make during the **Review** stage feed back into the Field Registry, improving future extractions. The platform tracks telemetry across runs \u2014 strategy distribution, capture hit rate, and resolve rate \u2014 so you can monitor how extraction quality improves over time as the knowledge graph accumulates more data."
|
|
635
|
+
},
|
|
620
636
|
{
|
|
621
637
|
type: "callout",
|
|
622
638
|
variant: "info",
|
|
@@ -679,6 +695,14 @@ var sections = [
|
|
|
679
695
|
title: "Sidebar Navigation",
|
|
680
696
|
caption: "The sidebar provides access to all sections. Click the collapse button to save space. Press Cmd+K for global search."
|
|
681
697
|
},
|
|
698
|
+
{
|
|
699
|
+
type: "paragraph",
|
|
700
|
+
text: "For teams processing documents at scale, the recommended approach is to start with a small representative sample. Upload 5-10 documents of the same type, let the platform extract and classify them, then review the auto-generated schema. This lets you validate the output structure before committing to a large batch. Once the schema looks right, you can upload hundreds or thousands of documents and the knowledge graph will handle an increasing share of cells through instant graph matches."
|
|
701
|
+
},
|
|
702
|
+
{
|
|
703
|
+
type: "paragraph",
|
|
704
|
+
text: "The platform includes powerful keyboard shortcuts for fast navigation. Press `Cmd+K` (or `Ctrl+K` on Windows) to open **Omnisearch**, which lets you find documents, schemas, jobs, and fields from anywhere. Press `Cmd+I` to open the **AI Agent** for natural language queries about your workspace. The sidebar can be collapsed to give more screen real estate when reviewing extraction results."
|
|
705
|
+
},
|
|
682
706
|
{
|
|
683
707
|
type: "callout",
|
|
684
708
|
text: "The fastest path to results: upload documents in **Sources**, then go to **Structuring → Runs → New** to create your first extraction job."
|
|
@@ -735,6 +759,18 @@ var sections2 = [
|
|
|
735
759
|
type: "paragraph",
|
|
736
760
|
text: "Talonic includes an embedded AI agent accessible from every page via `Cmd+I` (`Ctrl+I` on Windows). The agent understands your workspace context and can inspect schemas, search documents, analyze extraction quality, explore cases, and build schemas \u2014 all through natural language."
|
|
737
761
|
},
|
|
762
|
+
{
|
|
763
|
+
type: "paragraph",
|
|
764
|
+
text: "The agent is context-aware, meaning it automatically knows which page you are on and what data is visible. If you open the agent from a document detail page, it already has that document in scope and can answer questions about its extracted fields, processing status, or classification without you needing to specify which document you mean."
|
|
765
|
+
},
|
|
766
|
+
{
|
|
767
|
+
type: "paragraph",
|
|
768
|
+
text: "The agent classifies every user message as either a **question** (answered with information) or a **command** (triggers an action). Questions are handled instantly with read-only access, while commands go through the impact-level system to ensure safety. The agent streams its responses in real time, so you can see reasoning unfold as it queries your workspace data."
|
|
769
|
+
},
|
|
770
|
+
{
|
|
771
|
+
type: "paragraph",
|
|
772
|
+
text: "There are important limitations to be aware of. The agent cannot access external systems or the internet \u2014 it only works with data already in your Talonic workspace. It cannot bypass permission boundaries, so team members with read-only access cannot use the agent to make changes. Long-running operations like full batch extractions cannot be triggered through the agent; those must be initiated from the relevant UI page."
|
|
773
|
+
},
|
|
738
774
|
{ type: "heading", level: 3, id: "agent-capabilities", text: "What the Agent Can Do" },
|
|
739
775
|
{
|
|
740
776
|
type: "paragraph",
|
|
@@ -799,6 +835,14 @@ var sections2 = [
|
|
|
799
835
|
{
|
|
800
836
|
question: "Can the AI agent modify my data?",
|
|
801
837
|
answer: "The agent operates workshop-first: schema changes create drafts, not live versions. Higher-impact operations require progressively more explicit confirmation."
|
|
838
|
+
},
|
|
839
|
+
{
|
|
840
|
+
question: "Is the AI agent context-aware?",
|
|
841
|
+
answer: "Yes. The agent automatically knows which page you are on and what data is visible. If you open it from a document detail page, it already has that document in scope and can answer questions about its fields, processing status, or classification."
|
|
842
|
+
},
|
|
843
|
+
{
|
|
844
|
+
question: "Can the AI agent access external systems or the internet?",
|
|
845
|
+
answer: "No. The agent only works with data already in your Talonic workspace. It cannot browse the internet, call external APIs, or access systems outside the platform."
|
|
802
846
|
}
|
|
803
847
|
],
|
|
804
848
|
mentions: [
|
|
@@ -846,6 +890,18 @@ var sections2 = [
|
|
|
846
890
|
}
|
|
847
891
|
]
|
|
848
892
|
},
|
|
893
|
+
{
|
|
894
|
+
type: "paragraph",
|
|
895
|
+
text: "The `read` impact level covers the vast majority of agent interactions. Searching documents, inspecting extraction results, browsing the field registry, and checking job status all execute instantly with no side effects. These read operations give you a fast way to explore your workspace without navigating through multiple pages."
|
|
896
|
+
},
|
|
897
|
+
{
|
|
898
|
+
type: "paragraph",
|
|
899
|
+
text: "The `draft_mutation` level is used when the agent creates or modifies schemas. Because all schema changes go through the workshop system, the agent can freely draft schemas without risk \u2014 nothing goes live until you explicitly review and publish. This makes the agent especially useful for rapid schema prototyping: describe the fields you need in plain language, and the agent creates a draft you can refine."
|
|
900
|
+
},
|
|
901
|
+
{
|
|
902
|
+
type: "paragraph",
|
|
903
|
+
text: 'The `live_mutation` and `irreversible` levels provide escalating safety gates for operations that affect production data. A `live_mutation` \u2014 such as triggering a job run or publishing a schema \u2014 presents a confirmation dialog that you must accept. An `irreversible` action \u2014 such as deleting a source or purging documents \u2014 requires you to type a confirmation keyword (e.g., "DELETE") to proceed, preventing accidental data loss.'
|
|
904
|
+
},
|
|
849
905
|
{
|
|
850
906
|
type: "callout",
|
|
851
907
|
text: "The agent always operates workshop-first: schema changes create drafts, not live versions. You review and publish when ready."
|
|
@@ -863,6 +919,10 @@ var sections2 = [
|
|
|
863
919
|
{
|
|
864
920
|
question: "Does the AI agent make changes directly to live data?",
|
|
865
921
|
answer: "No. The agent operates workshop-first. Schema changes create drafts, and live mutations require explicit user confirmation before executing."
|
|
922
|
+
},
|
|
923
|
+
{
|
|
924
|
+
question: "What happens when I ask the agent to delete something?",
|
|
925
|
+
answer: 'Deletion is classified as an irreversible action. The agent will ask you to type a confirmation keyword (e.g., "DELETE") before proceeding. This prevents accidental data loss from casual or ambiguous requests.'
|
|
866
926
|
}
|
|
867
927
|
],
|
|
868
928
|
mentions: ["impact levels", "draft mutation", "live mutation", "workshop-first"]
|
|
@@ -877,6 +937,18 @@ var sections2 = [
|
|
|
877
937
|
{
|
|
878
938
|
type: "paragraph",
|
|
879
939
|
text: "The home page (click the Talonic logo) shows smart suggested prompts based on your workspace state. Prompts adapt to what is happening: active runs, schema creation opportunities, document types waiting for extraction. The agent input field lets you type any question directly from the dashboard."
|
|
940
|
+
},
|
|
941
|
+
{
|
|
942
|
+
type: "paragraph",
|
|
943
|
+
text: "The dashboard provides a workspace-level overview that helps you understand the health of your data pipeline at a glance. You can see document processing statistics, recent activity across sources, and the current state of your field registry. Key metrics like **capture rate**, **resolve rate**, and **synthesize rate** from the telemetry system are surfaced so you can spot trends without drilling into individual jobs."
|
|
944
|
+
},
|
|
945
|
+
{
|
|
946
|
+
type: "paragraph",
|
|
947
|
+
text: "Suggested prompts are dynamically generated based on what the platform detects in your workspace. If you have new document types that lack schemas, the dashboard suggests creating one. If a job run recently completed, it suggests reviewing the results. If field registry confirmations are pending, it prompts you to review them. This makes the dashboard a natural starting point for your workflow each session."
|
|
948
|
+
},
|
|
949
|
+
{
|
|
950
|
+
type: "paragraph",
|
|
951
|
+
text: "Every conversation with the agent is preserved in your session history, accessible from the dashboard. You can revisit previous questions and their answers, which is useful for auditing decisions or recalling how you configured a particular schema. The conversation history also provides continuity \u2014 if you asked the agent to analyze extraction quality last week, you can pick up where you left off."
|
|
880
952
|
}
|
|
881
953
|
],
|
|
882
954
|
related: [
|
|
@@ -891,6 +963,10 @@ var sections2 = [
|
|
|
891
963
|
{
|
|
892
964
|
question: "Do the suggested prompts change based on workspace state?",
|
|
893
965
|
answer: "Yes. Prompts adapt dynamically based on active runs, schema creation opportunities, document types waiting for extraction, and other workspace activity."
|
|
966
|
+
},
|
|
967
|
+
{
|
|
968
|
+
question: "Can I revisit previous conversations with the agent?",
|
|
969
|
+
answer: "Yes. Every conversation is preserved in your session history, accessible from the dashboard. You can revisit previous questions, recall how you configured a schema, or pick up where you left off in a previous analysis."
|
|
894
970
|
}
|
|
895
971
|
],
|
|
896
972
|
mentions: ["dashboard", "suggested prompts", "workspace state", "agent input"]
|
|
@@ -923,6 +999,10 @@ var sections3 = [
|
|
|
923
999
|
{
|
|
924
1000
|
type: "paragraph",
|
|
925
1001
|
text: "Files are deduplicated via SHA-256 hashing \u2014 uploading the same file twice won't create duplicates. Processing runs asynchronously so you can continue working."
|
|
1002
|
+
},
|
|
1003
|
+
{
|
|
1004
|
+
type: "paragraph",
|
|
1005
|
+
text: "When uploading folders or ZIP archives, the original directory structure is preserved as a `source_file_path` metadata field on each document (e.g., `contracts/2026/lease.pdf`). This field is available for filtering, export, and schema mapping \u2014 just like any AI-extracted field. It provides a natural way to organize and trace documents back to their original location in your file system."
|
|
926
1006
|
}
|
|
927
1007
|
],
|
|
928
1008
|
related: [
|
|
@@ -938,6 +1018,10 @@ var sections3 = [
|
|
|
938
1018
|
{
|
|
939
1019
|
question: "Does Talonic detect duplicate uploads?",
|
|
940
1020
|
answer: "Yes. Files are deduplicated via SHA-256 hashing. Uploading the same file twice will not create duplicates."
|
|
1021
|
+
},
|
|
1022
|
+
{
|
|
1023
|
+
question: "What happens when I upload a folder or ZIP archive?",
|
|
1024
|
+
answer: "ZIP archives are unpacked recursively and each file is processed individually. Folders preserve the original directory structure as a source_file_path metadata field on each document, available for filtering and export."
|
|
941
1025
|
}
|
|
942
1026
|
],
|
|
943
1027
|
mentions: [
|
|
@@ -956,6 +1040,10 @@ var sections3 = [
|
|
|
956
1040
|
seoTitle: "Supported File Formats \u2014 Talonic Docs",
|
|
957
1041
|
description: "Talonic supports 25+ file types across four processing paths: text fast-path, AI vision, OCR, and recursive archive unpacking. From PDF to XLSX to images.",
|
|
958
1042
|
content: [
|
|
1043
|
+
{
|
|
1044
|
+
type: "paragraph",
|
|
1045
|
+
text: "Talonic supports 25+ file types across four distinct processing paths. Each path is optimized for its file category \u2014 text files are read directly with zero latency, while complex document formats go through OCR to produce high-quality Markdown. The processing path is selected automatically based on the file extension."
|
|
1046
|
+
},
|
|
959
1047
|
{
|
|
960
1048
|
type: "param-table",
|
|
961
1049
|
title: "File processing paths",
|
|
@@ -981,6 +1069,23 @@ var sections3 = [
|
|
|
981
1069
|
description: "ZIP \u2014 unpacked and each file processed individually."
|
|
982
1070
|
}
|
|
983
1071
|
]
|
|
1072
|
+
},
|
|
1073
|
+
{
|
|
1074
|
+
type: "paragraph",
|
|
1075
|
+
text: "The **OCR path** uses Mistral Document AI as the primary engine, with a Talonic API fallback if the primary service is unavailable. OCR converts documents to structured Markdown, preserving tables, headings, and layout information. For PDF files that exceed the configured chunk size (default 25 pages), the system automatically splits the document into page chunks, processes them in parallel, and merges the results \u2014 so even large documents are handled efficiently."
|
|
1076
|
+
},
|
|
1077
|
+
{
|
|
1078
|
+
type: "paragraph",
|
|
1079
|
+
text: `Image files follow the **AI Vision** path, where they are sent directly to the AI model for multimodal extraction. This means the AI "sees" the image and extracts data visually \u2014 useful for photos of receipts, scanned handwritten notes, or diagrams. If an image was previously OCR'd and produced meaningful Markdown (more than 100 characters), the system uses the Markdown extraction path instead, which enables richer quality metrics.`
|
|
1080
|
+
},
|
|
1081
|
+
{
|
|
1082
|
+
type: "paragraph",
|
|
1083
|
+
text: "The **text fast-path** is the most efficient route: files like CSV, JSON, and plain text are read directly into memory with no external API call. This means they process almost instantly and incur no OCR cost. Email files (EML, MSG) are parsed to extract both the message body and any attachments, with each attachment processed as a separate document."
|
|
1084
|
+
},
|
|
1085
|
+
{
|
|
1086
|
+
type: "callout",
|
|
1087
|
+
variant: "info",
|
|
1088
|
+
text: "The processing path is selected automatically based on the file extension \u2014 you do not need to configure anything. If a file type is not recognized, the platform will attempt OCR as a fallback before marking it as unsupported."
|
|
984
1089
|
}
|
|
985
1090
|
],
|
|
986
1091
|
related: [
|
|
@@ -995,6 +1100,10 @@ var sections3 = [
|
|
|
995
1100
|
{
|
|
996
1101
|
question: "How does Talonic handle image files?",
|
|
997
1102
|
answer: "Image files (PNG, JPG, JPEG, GIF, WEBP) are sent to AI for multimodal visual extraction."
|
|
1103
|
+
},
|
|
1104
|
+
{
|
|
1105
|
+
question: "How does Talonic handle large PDF files?",
|
|
1106
|
+
answer: "PDF files that exceed the configured chunk size (default 25 pages) are automatically split into page chunks, processed in parallel, and merged. This ensures even large documents are handled efficiently without timeouts."
|
|
998
1107
|
}
|
|
999
1108
|
],
|
|
1000
1109
|
mentions: ["OCR", "AI vision", "text fast-path", "file formats", "PDF", "DOCX", "ZIP"]
|
|
@@ -1061,6 +1170,10 @@ var sections3 = [
|
|
|
1061
1170
|
{
|
|
1062
1171
|
question: "When is a document ready to use in jobs?",
|
|
1063
1172
|
answer: "Documents are marked complete after AI extraction finishes. You can start using them in jobs immediately without waiting for further processing."
|
|
1173
|
+
},
|
|
1174
|
+
{
|
|
1175
|
+
question: "What happens if OCR or extraction fails on a document?",
|
|
1176
|
+
answer: "The platform automatically retries failed extractions (configurable, default 1 retry). If all retries fail, the document is marked as extraction_failed with a terminal status. OCR failures follow a separate retry path with fallback from Document AI to Talonic API to local parsers."
|
|
1064
1177
|
}
|
|
1065
1178
|
],
|
|
1066
1179
|
mentions: [
|
|
@@ -1087,6 +1200,19 @@ var sections3 = [
|
|
|
1087
1200
|
{
|
|
1088
1201
|
type: "paragraph",
|
|
1089
1202
|
text: 'Documents sharing the same ontology type are automatically merged into one document type. When a new canonical type appears, it is auto-created with ontology metadata. Unresolvable documents are assigned "Unclassified Document".'
|
|
1203
|
+
},
|
|
1204
|
+
{
|
|
1205
|
+
type: "paragraph",
|
|
1206
|
+
text: `Classification is verified in a two-step process. First, **Document AI OCR** produces an annotation with a free-text type label during the OCR pass. Then, a **type resolution** step verifies that label against the actual document content. If the label and content disagree \u2014 for example, a German *Arbeitsvertrag* incorrectly labelled as "Service Agreement" \u2014 the system trusts the content and resolves the correct canonical type. This ensures accurate classification regardless of the OCR engine's labelling bias.`
|
|
1207
|
+
},
|
|
1208
|
+
{
|
|
1209
|
+
type: "paragraph",
|
|
1210
|
+
text: "Document types drive several downstream features. The platform auto-generates a **schema** for each document type, pre-populated with fields discovered from documents of that type. **Routing rules** can be configured per document type to automatically assign schemas or trigger jobs when new documents arrive. The **Field Registry** tracks which fields appear in which document types, building a cross-type knowledge graph over time."
|
|
1211
|
+
},
|
|
1212
|
+
{
|
|
1213
|
+
type: "callout",
|
|
1214
|
+
variant: "info",
|
|
1215
|
+
text: "You never need to create document types manually. The ontology is built into the platform and types are assigned automatically during classification. If you disagree with a classification, the AI agent can help you understand why a type was chosen and how the content signals were interpreted."
|
|
1090
1216
|
}
|
|
1091
1217
|
],
|
|
1092
1218
|
related: [
|
|
@@ -1102,6 +1228,10 @@ var sections3 = [
|
|
|
1102
1228
|
{
|
|
1103
1229
|
question: "Does document classification work in non-English languages?",
|
|
1104
1230
|
answer: "Yes. The classifier works across all languages. For example, a German Arbeitsvertrag and an English Employment Contract map to the same canonical type."
|
|
1231
|
+
},
|
|
1232
|
+
{
|
|
1233
|
+
question: "What happens if a document cannot be classified?",
|
|
1234
|
+
answer: 'Unresolvable documents are assigned the "Unclassified Document" type. They can still be processed and extracted \u2014 the platform simply cannot map them to a specific canonical type in the 529-type ontology.'
|
|
1105
1235
|
}
|
|
1106
1236
|
],
|
|
1107
1237
|
mentions: [
|
|
@@ -1147,6 +1277,23 @@ var sections3 = [
|
|
|
1147
1277
|
description: "View or download the source document."
|
|
1148
1278
|
}
|
|
1149
1279
|
]
|
|
1280
|
+
},
|
|
1281
|
+
{
|
|
1282
|
+
type: "paragraph",
|
|
1283
|
+
text: "The **Raw Extraction** tab is the most detailed view, showing every field the AI discovered along with its confidence score and the source text that the value was extracted from. Each field displays a tier badge (Tier 1 green, Tier 2 amber, Tier 3 gray) indicating how well-established that field is across your document corpus. Synthetic metadata fields like `filename` and `source_file_path` appear here too, with full confidence (1.0)."
|
|
1284
|
+
},
|
|
1285
|
+
{
|
|
1286
|
+
type: "paragraph",
|
|
1287
|
+
text: "The **Resolved Data** tab shows how raw extracted fields map to your canonical field registry. Fields that matched automatically (similarity >= 0.80) display their canonical name and cluster. Fields in the confirm band (0.50-0.79) are flagged for review. This view helps you understand how the platform is normalizing field names across different document types and formats."
|
|
1288
|
+
},
|
|
1289
|
+
{
|
|
1290
|
+
type: "paragraph",
|
|
1291
|
+
text: "The **Processing Log** tab provides a stage-by-stage timeline of how the document was processed, including per-stage timing. You can see exactly how long OCR, classification, and extraction took, which is useful for diagnosing slow processing or understanding why a document was classified a particular way. The **Original File** tab lets you view or download the source file, so you can always compare the AI's extraction against the original document."
|
|
1292
|
+
},
|
|
1293
|
+
{
|
|
1294
|
+
type: "callout",
|
|
1295
|
+
variant: "info",
|
|
1296
|
+
text: "You can open the **AI Agent** (`Cmd+I`) from any document detail page. The agent automatically has the current document in scope and can answer questions about its fields, classification, or processing status without you needing to specify which document you mean."
|
|
1150
1297
|
}
|
|
1151
1298
|
],
|
|
1152
1299
|
related: [
|
|
@@ -1162,6 +1309,10 @@ var sections3 = [
|
|
|
1162
1309
|
{
|
|
1163
1310
|
question: "How can I see the confidence score of an extracted field?",
|
|
1164
1311
|
answer: "Open the document detail page and navigate to the Raw Extraction tab. Each field displays its confidence score alongside the extracted value and source text."
|
|
1312
|
+
},
|
|
1313
|
+
{
|
|
1314
|
+
question: "What do the tier badges on fields mean?",
|
|
1315
|
+
answer: "Tier badges indicate how well-established a field is across your document corpus. Tier 1 (green) are universal core fields, Tier 2 (amber) are established promoted fields, and Tier 3 (gray) are newly discovered emerging fields."
|
|
1165
1316
|
}
|
|
1166
1317
|
],
|
|
1167
1318
|
mentions: [
|
|
@@ -1182,6 +1333,23 @@ var sections3 = [
|
|
|
1182
1333
|
{
|
|
1183
1334
|
type: "paragraph",
|
|
1184
1335
|
text: "Routing rules automatically assign actions to documents based on their type. Configure rules to auto-assign schemas, trigger jobs, or route documents to specific workflows. Manage rules from **Documents → Routing**."
|
|
1336
|
+
},
|
|
1337
|
+
{
|
|
1338
|
+
type: "paragraph",
|
|
1339
|
+
text: "Each routing rule specifies a **document type** as the trigger condition and one or more **actions** to execute when a document of that type is processed. Actions include assigning a specific user schema, automatically creating a job run, or tagging the document for a particular workflow. Rules are evaluated in priority order, so you can layer general rules with more specific overrides."
|
|
1340
|
+
},
|
|
1341
|
+
{
|
|
1342
|
+
type: "paragraph",
|
|
1343
|
+
text: 'Routing rules are especially useful for high-volume ingestion pipelines. If you connect a Google Drive folder that receives hundreds of invoices per week, a routing rule can automatically assign your "Invoice" schema and trigger extraction \u2014 turning what would be manual work into a fully automated pipeline. Combined with **delivery bindings**, this creates an end-to-end flow from document upload to structured output with zero manual intervention.'
|
|
1344
|
+
},
|
|
1345
|
+
{
|
|
1346
|
+
type: "paragraph",
|
|
1347
|
+
text: "You can review rule execution history from the routing page to see which rules fired, which documents they matched, and what actions were taken. This audit trail helps you verify that your routing configuration is working as expected and diagnose cases where documents were not routed correctly."
|
|
1348
|
+
},
|
|
1349
|
+
{
|
|
1350
|
+
type: "callout",
|
|
1351
|
+
variant: "info",
|
|
1352
|
+
text: "Start with a simple routing rule for your most common document type. Once you verify it works correctly, expand to additional types. Rules are evaluated in priority order, so you can add specific overrides without disrupting existing rules."
|
|
1185
1353
|
}
|
|
1186
1354
|
],
|
|
1187
1355
|
related: [
|
|
@@ -1197,6 +1365,10 @@ var sections3 = [
|
|
|
1197
1365
|
{
|
|
1198
1366
|
question: "Where do I manage routing rules?",
|
|
1199
1367
|
answer: "Navigate to Documents > Routing to create and manage routing rules for your workspace."
|
|
1368
|
+
},
|
|
1369
|
+
{
|
|
1370
|
+
question: "Can routing rules fully automate my document processing pipeline?",
|
|
1371
|
+
answer: "Yes. By combining routing rules with source connectors and delivery bindings, you can create a fully automated pipeline: documents arrive from a connected source, routing rules assign schemas and trigger extraction jobs, and delivery bindings push approved results to downstream systems."
|
|
1200
1372
|
}
|
|
1201
1373
|
],
|
|
1202
1374
|
mentions: ["routing rules", "auto-assign", "schema assignment", "document workflows"]
|
|
@@ -1272,6 +1444,14 @@ var sections3 = [
|
|
|
1272
1444
|
type: "paragraph",
|
|
1273
1445
|
text: "Google and Microsoft connectors share a single OAuth client each. OAuth tokens are encrypted at rest using `aes-256-gcm`. Each source card includes a **Batch Processing** toggle to defer extraction at 50% cost."
|
|
1274
1446
|
},
|
|
1447
|
+
{
|
|
1448
|
+
type: "paragraph",
|
|
1449
|
+
text: "OAuth-based connectors (Google Drive, Gmail, SharePoint, OneDrive, Outlook, Teams, Notion) use a consent-based flow where you authorize Talonic to access specific resources. For Microsoft connectors, Teams requires extended scopes that need tenant-admin consent. If a connector's OAuth credentials are revoked or expire, the source enters a disconnected state \u2014 reconnecting via the source settings page automatically refreshes the credentials without losing your existing documents."
|
|
1450
|
+
},
|
|
1451
|
+
{
|
|
1452
|
+
type: "paragraph",
|
|
1453
|
+
text: "Credential-based connectors (SQL, Amazon S3, Azure Blob) authenticate with access keys or connection strings rather than OAuth. SQL connections support PostgreSQL, MySQL, and MSSQL, with a built-in read-only safety layer that prevents accidental writes. S3-compatible storage like MinIO and Cloudflare R2 also works through the S3 connector. All credentials are encrypted at rest before being stored."
|
|
1454
|
+
},
|
|
1275
1455
|
{
|
|
1276
1456
|
type: "callout",
|
|
1277
1457
|
text: "Connectors are feature-gated on their OAuth client ID/secret. Without credentials configured, the connector dropdown entry is disabled."
|
|
@@ -1290,6 +1470,10 @@ var sections3 = [
|
|
|
1290
1470
|
{
|
|
1291
1471
|
question: "How are OAuth tokens stored?",
|
|
1292
1472
|
answer: "OAuth access and refresh tokens are encrypted at rest using AES-256-GCM. The encryption key is SOURCE_ENCRYPTION_KEY (falls back to JWT_SECRET)."
|
|
1473
|
+
},
|
|
1474
|
+
{
|
|
1475
|
+
question: "What happens if a connector loses its credentials or authorization?",
|
|
1476
|
+
answer: "If OAuth credentials are revoked or expire, the source enters a disconnected state. Reconnecting via the source settings page automatically refreshes the credentials without losing your existing documents or configuration."
|
|
1293
1477
|
}
|
|
1294
1478
|
],
|
|
1295
1479
|
mentions: [
|
|
@@ -1331,6 +1515,18 @@ var sections4 = [
|
|
|
1331
1515
|
id: "field-registry-table",
|
|
1332
1516
|
title: "Field Registry \u2014 Registry Table",
|
|
1333
1517
|
caption: "Fields are organized by tier with occurrence counts, data types, and master instruction status."
|
|
1518
|
+
},
|
|
1519
|
+
{
|
|
1520
|
+
type: "paragraph",
|
|
1521
|
+
text: "The registry grows automatically as documents are processed. During extraction, AI discovers fields from each document and resolves them against existing registry entries using **three-band matching** (exact name match, cluster member match, then semantic embedding similarity). New fields that don't match anything create a Tier 3 entry. Frequently occurring fields are promoted to higher tiers, so the registry naturally converges on a stable set of canonical fields over time."
|
|
1522
|
+
},
|
|
1523
|
+
{
|
|
1524
|
+
type: "paragraph",
|
|
1525
|
+
text: "Each registry entry tracks its **occurrence count** (how many documents contain this field), **data type** (string, number, date, etc.), **synonyms** (alternate names discovered across documents), and **master instruction** (an AI-synthesized extraction directive). The registry also maintains two embedding vectors per field: one for resolution matching and one for graph visualization, ensuring that each concern uses the most appropriate representation."
|
|
1526
|
+
},
|
|
1527
|
+
{
|
|
1528
|
+
type: "paragraph",
|
|
1529
|
+
text: "The registry is the foundation for several downstream features. **Jobs** use registry fields to pre-fill schema values via lookup cascades before resorting to LLM extraction. **Semantic clusters** group related registry fields together. **Generated schemas** are auto-built from registry fields that appear in a given document type. Understanding the registry is key to understanding how Talonic reduces extraction cost and improves accuracy over time."
|
|
1334
1530
|
}
|
|
1335
1531
|
],
|
|
1336
1532
|
related: [
|
|
@@ -1346,6 +1542,10 @@ var sections4 = [
|
|
|
1346
1542
|
{
|
|
1347
1543
|
question: "How does the Field Registry grow?",
|
|
1348
1544
|
answer: "As documents are processed, AI discovers new fields and resolves them against existing registry entries. New fields create Tier 3 entries; frequently occurring fields are promoted to higher tiers."
|
|
1545
|
+
},
|
|
1546
|
+
{
|
|
1547
|
+
question: "How does the Field Registry reduce extraction cost?",
|
|
1548
|
+
answer: "The registry enables lookup-based resolution during job runs. When a field already exists in the registry with sufficient data, its value can be resolved via graph lookup instead of an AI call. Approximately 30% of cells are filled this way \u2014 instantly and at no cost."
|
|
1349
1549
|
}
|
|
1350
1550
|
],
|
|
1351
1551
|
mentions: [
|
|
@@ -1387,6 +1587,18 @@ var sections4 = [
|
|
|
1387
1587
|
}
|
|
1388
1588
|
]
|
|
1389
1589
|
},
|
|
1590
|
+
{
|
|
1591
|
+
type: "paragraph",
|
|
1592
|
+
text: "**Tier 1** fields are the most reliable and cost-efficient. During job runs, Tier 1 fields can often be resolved via lookup tables or registry transfer without any AI call, meaning they cost nothing to extract. These are fields like `invoice_number`, `date`, or `total_amount` that appear universally across document types and have well-established extraction patterns."
|
|
1593
|
+
},
|
|
1594
|
+
{
|
|
1595
|
+
type: "paragraph",
|
|
1596
|
+
text: "**Tier 2** fields are promoted from Tier 3 after meeting frequency thresholds \u2014 specifically, 5 occurrences or a 10% occurrence rate across your documents. Once promoted, these fields gain a synthesized master instruction and become candidates for lookup-based resolution. Promotion is evaluated automatically after every batch resolution run, so fields graduate without manual intervention as your document corpus grows."
|
|
1597
|
+
},
|
|
1598
|
+
{
|
|
1599
|
+
type: "paragraph",
|
|
1600
|
+
text: "**Tier 3** fields are newly discovered and may require a full Claude API call to extract during job runs, making them the most expensive tier. As more documents are processed and a Tier 3 field appears consistently, it is automatically promoted. You can also manually adjust a field's tier from the registry detail page if you know a field is stable enough to promote early."
|
|
1601
|
+
},
|
|
1390
1602
|
{
|
|
1391
1603
|
type: "callout",
|
|
1392
1604
|
text: "Tier badges appear throughout the platform as the primary quality signal. Tier 1 = green, Tier 2 = amber, Tier 3 = gray."
|
|
@@ -1404,6 +1616,10 @@ var sections4 = [
|
|
|
1404
1616
|
{
|
|
1405
1617
|
question: "How are fields promoted between tiers?",
|
|
1406
1618
|
answer: "Fields are promoted automatically based on frequency thresholds. As more documents are processed and a field appears consistently, it moves from Tier 3 to Tier 2 and eventually to Tier 1."
|
|
1619
|
+
},
|
|
1620
|
+
{
|
|
1621
|
+
question: "Can I manually change a field's tier?",
|
|
1622
|
+
answer: "Yes. You can manually adjust a field's tier from the registry detail page. This is useful when you know a field is stable enough to promote early, or when you want to demote a field that was promoted prematurely."
|
|
1407
1623
|
}
|
|
1408
1624
|
],
|
|
1409
1625
|
mentions: ["tier system", "Tier 1", "Tier 2", "Tier 3", "field promotion", "quality signal"]
|
|
@@ -1418,6 +1634,23 @@ var sections4 = [
|
|
|
1418
1634
|
{
|
|
1419
1635
|
type: "paragraph",
|
|
1420
1636
|
text: 'Fields with similar meanings are automatically grouped using AI embeddings. For example, "Vendor Name", "Supplier Name", and "Company Name" cluster together. You can manually merge or split clusters from the Field Map view.'
|
|
1637
|
+
},
|
|
1638
|
+
{
|
|
1639
|
+
type: "paragraph",
|
|
1640
|
+
text: "Clustering uses the same three-band similarity model as field resolution. Fields with similarity >= 0.80 are automatically grouped into the same cluster. Fields in the 0.50-0.79 range are flagged as potential cluster candidates for manual confirmation. Fields below 0.50 similarity are kept separate. This graduated approach prevents false merges while still surfacing useful grouping suggestions."
|
|
1641
|
+
},
|
|
1642
|
+
{
|
|
1643
|
+
type: "paragraph",
|
|
1644
|
+
text: 'From the **Field Map** view, you can manually **merge** two clusters when you know they represent the same concept (e.g., merging a "Ship To Address" cluster with a "Delivery Address" cluster). You can also **split** a field out of a cluster if it was incorrectly grouped. These manual adjustments are permanent and improve the resolution model for all future documents \u2014 the system learns from your corrections.'
|
|
1645
|
+
},
|
|
1646
|
+
{
|
|
1647
|
+
type: "paragraph",
|
|
1648
|
+
text: 'Semantic clusters serve a practical purpose beyond organization. When a job runs, the resolution engine uses clusters to transfer values between fields that belong to the same cluster. If a document has a field called "Supplier Name" and your schema expects "Vendor Name", the cluster linkage allows the value to transfer automatically without an AI call. This is one of the key mechanisms that reduces extraction cost as your registry matures.'
|
|
1649
|
+
},
|
|
1650
|
+
{
|
|
1651
|
+
type: "callout",
|
|
1652
|
+
variant: "info",
|
|
1653
|
+
text: "Manual cluster adjustments are permanent and improve the model for all future documents. If you notice the platform grouping unrelated fields together, split them early \u2014 this prevents incorrect value transfers during job runs."
|
|
1421
1654
|
}
|
|
1422
1655
|
],
|
|
1423
1656
|
related: [
|
|
@@ -1433,6 +1666,10 @@ var sections4 = [
|
|
|
1433
1666
|
{
|
|
1434
1667
|
question: "Can I manually adjust semantic clusters?",
|
|
1435
1668
|
answer: "Yes. You can manually merge or split clusters from the Field Map view in the Field Registry."
|
|
1669
|
+
},
|
|
1670
|
+
{
|
|
1671
|
+
question: "How do semantic clusters reduce extraction cost?",
|
|
1672
|
+
answer: 'When a job runs, the resolution engine uses clusters to transfer values between fields that belong to the same cluster. If a document has "Supplier Name" and your schema expects "Vendor Name", the cluster linkage allows the value to transfer automatically without an AI call.'
|
|
1436
1673
|
}
|
|
1437
1674
|
],
|
|
1438
1675
|
mentions: [
|
|
@@ -1478,6 +1715,10 @@ var sections4 = [
|
|
|
1478
1715
|
type: "paragraph",
|
|
1479
1716
|
text: "Resolution runs concurrently across documents. Each document's fields are resolved in an isolated transaction to prevent lock contention. Occurrence rates are updated after each transaction commits, keeping the registry eventually consistent without blocking concurrent ingestion."
|
|
1480
1717
|
},
|
|
1718
|
+
{
|
|
1719
|
+
type: "paragraph",
|
|
1720
|
+
text: "After resolution completes, the platform evaluates tier promotions and regenerates affected schemas in a fixed chain: resolve, then promote, then regenerate. This chain ensures that newly promoted fields immediately appear in auto-generated schemas. The resolution process also feeds into the **job pipeline** \u2014 during Phase 1 of a job run, the system uses a 3-tier lookup cascade (string normalization, token fuzzy matching, then AI fallback) to fill 60-80% of cells without a full LLM call, dramatically reducing cost."
|
|
1721
|
+
},
|
|
1481
1722
|
{
|
|
1482
1723
|
type: "callout",
|
|
1483
1724
|
text: "Pending confirmations from the confirm band appear in **Resolution → Pending Confirmations**. Accept to merge into an existing cluster, or reject to create a new field."
|
|
@@ -1496,6 +1737,10 @@ var sections4 = [
|
|
|
1496
1737
|
{
|
|
1497
1738
|
question: "Where can I review pending field confirmations?",
|
|
1498
1739
|
answer: "Navigate to Resolution > Pending Confirmations to review fields in the confirm band. Accept to merge into an existing cluster, or reject to create a new field."
|
|
1740
|
+
},
|
|
1741
|
+
{
|
|
1742
|
+
question: "What happens after resolution completes?",
|
|
1743
|
+
answer: "After resolution, the platform evaluates tier promotions and regenerates affected schemas in a fixed chain: resolve, then promote, then regenerate. This ensures that newly promoted fields immediately appear in auto-generated schemas."
|
|
1499
1744
|
}
|
|
1500
1745
|
],
|
|
1501
1746
|
mentions: [
|
|
@@ -1517,6 +1762,18 @@ var sections4 = [
|
|
|
1517
1762
|
type: "paragraph",
|
|
1518
1763
|
text: "As the same field is extracted from many documents, AI synthesizes a **master instruction** \u2014 a reusable directive that captures the best way to extract that field. Master instructions improve accuracy over time and are automatically used when running jobs."
|
|
1519
1764
|
},
|
|
1765
|
+
{
|
|
1766
|
+
type: "paragraph",
|
|
1767
|
+
text: 'Master instructions are synthesized by analyzing the extraction patterns across all documents where a field appears. The AI examines how the field was successfully extracted \u2014 including the source text, confidence scores, and document context \u2014 and distills a concise directive that captures the best extraction approach. For example, a master instruction for "invoice_date" might specify: "Look for the date near the invoice number, typically in the header area. Prefer the issue date over due date. Format as ISO 8601."'
|
|
1768
|
+
},
|
|
1769
|
+
{
|
|
1770
|
+
type: "paragraph",
|
|
1771
|
+
text: "Master instructions fire automatically during **Phase 2** of job runs, when the AI agent extracts values for fields that could not be resolved via lookup. The instruction is injected into the AI prompt alongside the document content, giving the model specific guidance for that field. This is why master instructions improve accuracy: they encode domain-specific knowledge that the base model would otherwise lack."
|
|
1772
|
+
},
|
|
1773
|
+
{
|
|
1774
|
+
type: "paragraph",
|
|
1775
|
+
text: `You can view and edit master instructions from the field detail page in the registry. Editing an instruction overrides the AI-synthesized version, which is useful when you have domain expertise the AI hasn't captured. The **"Synthesize All"** button in the Field Registry triggers the full pipeline \u2014 embedding, resolution, and synthesis \u2014 for all qualifying fields in a single operation.`
|
|
1776
|
+
},
|
|
1520
1777
|
{
|
|
1521
1778
|
type: "callout",
|
|
1522
1779
|
text: 'Click **"Synthesize All"** in the Field Registry to generate instructions for all qualifying fields. This runs the combined pipeline: embed → resolve → synthesize.'
|
|
@@ -1535,6 +1792,10 @@ var sections4 = [
|
|
|
1535
1792
|
{
|
|
1536
1793
|
question: "How do I generate master instructions?",
|
|
1537
1794
|
answer: 'Click "Synthesize All" in the Field Registry. This runs the combined pipeline: embed, resolve, and synthesize instructions for all qualifying fields.'
|
|
1795
|
+
},
|
|
1796
|
+
{
|
|
1797
|
+
question: "Can I manually edit a master instruction?",
|
|
1798
|
+
answer: "Yes. You can view and edit master instructions from the field detail page in the registry. Editing overrides the AI-synthesized version, which is useful when you have domain expertise the AI has not captured."
|
|
1538
1799
|
}
|
|
1539
1800
|
],
|
|
1540
1801
|
mentions: [
|
|
@@ -1562,6 +1823,18 @@ var sections5 = [
|
|
|
1562
1823
|
{
|
|
1563
1824
|
type: "paragraph",
|
|
1564
1825
|
text: "For each document type, Talonic generates a schema containing all Tier 1 and Tier 2 fields with occurrences in that type. Generated schemas are versioned \u2014 new versions are created when the registry changes. You can diff any two versions to see what changed."
|
|
1826
|
+
},
|
|
1827
|
+
{
|
|
1828
|
+
type: "paragraph",
|
|
1829
|
+
text: "Behind the scenes, the generation engine scans the **Field Registry** for every field that has been promoted to Tier 1 (core) or Tier 2 (established) within a given document type. It assembles these fields into a schema definition, assigns data types based on observed extraction patterns, and attaches the AI-synthesized **master instruction** for each field. The entire process is automatic \u2014 no manual curation is required."
|
|
1830
|
+
},
|
|
1831
|
+
{
|
|
1832
|
+
type: "paragraph",
|
|
1833
|
+
text: "Generated schemas are most useful as a starting point for understanding what Talonic has discovered about your documents. Review the generated schema for a document type to see which fields the system has identified, then use that knowledge to build a **User Template** containing only the fields you actually need. You can also use the diff view to monitor how your field landscape evolves over time as new documents are processed and new fields are promoted."
|
|
1834
|
+
},
|
|
1835
|
+
{
|
|
1836
|
+
type: "callout",
|
|
1837
|
+
text: "Generated schemas are read-only and cannot be used directly for job execution. To run an extraction job, create a **User Template** and map its fields to the registry."
|
|
1565
1838
|
}
|
|
1566
1839
|
],
|
|
1567
1840
|
related: [
|
|
@@ -1577,6 +1850,10 @@ var sections5 = [
|
|
|
1577
1850
|
{
|
|
1578
1851
|
question: "How are generated schemas updated?",
|
|
1579
1852
|
answer: "New versions are created automatically when the Field Registry changes (new fields promoted, clusters merged). You can diff any two versions to see what changed."
|
|
1853
|
+
},
|
|
1854
|
+
{
|
|
1855
|
+
question: "Can I run an extraction job using a generated schema?",
|
|
1856
|
+
answer: "No. Generated schemas are read-only references. To run a job, create a User Template, select the fields you need, map them to the registry, and publish a version."
|
|
1580
1857
|
}
|
|
1581
1858
|
],
|
|
1582
1859
|
mentions: ["generated schemas", "AI-generated", "versioning", "schema diff"]
|
|
@@ -1606,6 +1883,18 @@ var sections5 = [
|
|
|
1606
1883
|
{
|
|
1607
1884
|
type: "paragraph",
|
|
1608
1885
|
text: "Most teams start by importing an existing spreadsheet or CSV as a template baseline, then refine field types and add extraction instructions. Once you publish a version, it becomes immutable and available for job execution \u2014 any further changes happen in a new **Workshop** draft, keeping your production schema stable while you iterate."
|
|
1886
|
+
},
|
|
1887
|
+
{
|
|
1888
|
+
type: "paragraph",
|
|
1889
|
+
text: "When adding fields, take advantage of the automatic registry matching system. Fields with names that match existing registry entries are linked instantly, inheriting the AI-synthesized extraction instruction. For fields that do not match, write a clear **manual instruction** describing exactly what the AI should extract from the document. Well-written instructions are the single biggest lever for extraction accuracy."
|
|
1890
|
+
},
|
|
1891
|
+
{
|
|
1892
|
+
type: "paragraph",
|
|
1893
|
+
text: "For best results, keep templates focused on a single document type or closely related group of types. A template with 10-20 well-defined fields will produce higher accuracy than one with 50+ fields spanning unrelated domains. If you need different field sets for different document types, create separate templates and run targeted jobs for each."
|
|
1894
|
+
},
|
|
1895
|
+
{
|
|
1896
|
+
type: "callout",
|
|
1897
|
+
text: "You can import templates from Excel, CSV, or JSON files using the **Import from file** option. Column headers become field names, and data types are inferred automatically. This is the fastest way to bootstrap a template from an existing spreadsheet."
|
|
1609
1898
|
}
|
|
1610
1899
|
],
|
|
1611
1900
|
related: [
|
|
@@ -1621,6 +1910,10 @@ var sections5 = [
|
|
|
1621
1910
|
{
|
|
1622
1911
|
question: "What is the difference between generated schemas and user templates?",
|
|
1623
1912
|
answer: "Generated schemas are AI-created per document type with all Tier 1/2 fields. User templates are custom-defined output structures where you choose exactly which fields to include and how to map them."
|
|
1913
|
+
},
|
|
1914
|
+
{
|
|
1915
|
+
question: "Can I update a published template?",
|
|
1916
|
+
answer: "Published versions are immutable. To make changes, open the Workshop draft, edit your fields, and publish a new version. The previous version remains available in Version History for reference and diffing."
|
|
1624
1917
|
}
|
|
1625
1918
|
],
|
|
1626
1919
|
mentions: ["user templates", "schema creation", "field mapping", "reference tables", "publish"]
|
|
@@ -1686,6 +1979,14 @@ var sections5 = [
|
|
|
1686
1979
|
type: "paragraph",
|
|
1687
1980
|
text: "When configuring a field, start with the basics \u2014 name, type, and registry mapping \u2014 then layer on advanced features as needed. For example, add a **format constraint** to enforce a date pattern, attach a **reference table** for code lookups, or define **capture submoves** to control the exact extraction sequence. Features compose independently, so you can mix and match without conflicts."
|
|
1688
1981
|
},
|
|
1982
|
+
{
|
|
1983
|
+
type: "paragraph",
|
|
1984
|
+
text: "The **modifier pipeline** runs in a fixed order during Phase 4 of the extraction pipeline: format transforms first (converting dates or numbers to your target format), then alias mapping (replacing values using a lookup), and finally max_length truncation. Constraint evaluation happens after all modifiers have been applied, so constraints validate the final transformed value, not the raw extraction."
|
|
1985
|
+
},
|
|
1986
|
+
{
|
|
1987
|
+
type: "paragraph",
|
|
1988
|
+
text: 'For best results, use **manual instructions** sparingly and only for fields that the registry cannot match. A well-written instruction should describe the field in plain language, specify where in the document to look, and note any formatting expectations. Avoid vague instructions like "extract the value" \u2014 instead, write something like "Extract the net payment amount from the invoice summary section, excluding VAT."'
|
|
1989
|
+
},
|
|
1689
1990
|
{
|
|
1690
1991
|
type: "callout",
|
|
1691
1992
|
text: "For the complete JSON Schema specification with all features, see the [Full Schema Reference](/docs/platform/schema-features) in the Platform Guide."
|
|
@@ -1704,6 +2005,10 @@ var sections5 = [
|
|
|
1704
2005
|
{
|
|
1705
2006
|
question: "Can I override AI extraction instructions with my own?",
|
|
1706
2007
|
answer: "Yes. Use the Manual instruction feature on a schema field. User-written instructions override the AI-synthesized master instruction from the field registry."
|
|
2008
|
+
},
|
|
2009
|
+
{
|
|
2010
|
+
question: "In what order are modifiers applied to extracted values?",
|
|
2011
|
+
answer: "Modifiers run in a fixed order: format (date/number conversion) first, then alias (value mapping), then max_length (truncation). Constraints are evaluated after all modifiers complete."
|
|
1707
2012
|
}
|
|
1708
2013
|
],
|
|
1709
2014
|
mentions: [
|
|
@@ -1752,6 +2057,22 @@ var sections5 = [
|
|
|
1752
2057
|
{
|
|
1753
2058
|
type: "paragraph",
|
|
1754
2059
|
text: "When you add a field to a template, the system automatically attempts to match it against the **Field Registry**. Exact name matches are applied instantly, while semantic and composite matches appear as suggestions for your confirmation. If no match is found, the field is marked **Unmapped** and you should provide a manual extraction instruction so the AI knows how to extract that value from your documents."
|
|
2060
|
+
},
|
|
2061
|
+
{
|
|
2062
|
+
type: "paragraph",
|
|
2063
|
+
text: "The matching engine uses a three-band resolution process under the hood. First, it checks for an exact name match against canonical registry field names and their synonyms. If no exact match is found, it computes embedding similarity between your field name and every registry field, surfacing semantic matches above a 0.5 confidence threshold. Matches above 0.8 are auto-accepted; those between 0.5 and 0.8 require your confirmation."
|
|
2064
|
+
},
|
|
2065
|
+
{
|
|
2066
|
+
type: "paragraph",
|
|
2067
|
+
text: "Matched fields inherit the registry's AI-synthesized **master instruction**, which tells the extraction pipeline exactly how to locate and extract that value from documents. This is why matching matters \u2014 a well-matched field leverages all the intelligence the system has built up from processing your document corpus. Unmapped fields rely solely on your manual instruction, so they may need a few correction cycles before reaching the same accuracy."
|
|
2068
|
+
},
|
|
2069
|
+
{
|
|
2070
|
+
type: "paragraph",
|
|
2071
|
+
text: "You can trigger a **Rematch** on all fields at any time from the template editor. This is useful after the registry has grown \u2014 fields that were previously unmapped may now find matches as new extractions contribute to the registry. For best results, use descriptive field names that reflect the actual data (e.g., `contract_start_date` rather than `field_1`)."
|
|
2072
|
+
},
|
|
2073
|
+
{
|
|
2074
|
+
type: "callout",
|
|
2075
|
+
text: "Field matching is read-only against the registry \u2014 it never creates new registry entries. If no match exists, the field stays unmapped until you provide a manual instruction or new documents introduce the field into the registry."
|
|
1755
2076
|
}
|
|
1756
2077
|
],
|
|
1757
2078
|
related: [
|
|
@@ -1767,6 +2088,10 @@ var sections5 = [
|
|
|
1767
2088
|
{
|
|
1768
2089
|
question: "What happens when a field is unmapped?",
|
|
1769
2090
|
answer: "Unmapped fields have no registry match. They require manual extraction instructions to guide the AI on how to extract the value from documents."
|
|
2091
|
+
},
|
|
2092
|
+
{
|
|
2093
|
+
question: "Can I re-run field matching after adding more documents?",
|
|
2094
|
+
answer: "Yes. Use the Rematch button in the template editor to re-run matching against the current registry. Fields that were previously unmapped may find new matches as your registry grows."
|
|
1770
2095
|
}
|
|
1771
2096
|
],
|
|
1772
2097
|
mentions: ["field matching", "exact match", "semantic match", "composite", "unmapped"]
|
|
@@ -1807,6 +2132,14 @@ var sections5 = [
|
|
|
1807
2132
|
type: "paragraph",
|
|
1808
2133
|
text: "To set up a reference table, upload a CSV or manually enter key-value pairs where the **key** is the code you want in your output and the **value** is the human-readable label found in documents. During extraction, the system tries each tier in order \u2014 most values resolve instantly at Tier 1, so keeping your labels clean and consistent dramatically improves both speed and accuracy."
|
|
1809
2134
|
},
|
|
2135
|
+
{
|
|
2136
|
+
type: "paragraph",
|
|
2137
|
+
text: "Reference tables are used in two pipeline stages. In **Phase 1**, the lookup cascade runs as part of the resolve step, mapping extracted labels to codes without any AI calls (Tier 1 and Tier 2). In **Phase 3**, the cascade runs again on values produced by Phase 2's AI extraction, normalizing free-text AI output to your canonical codes. This two-pass approach ensures maximum code coverage across the entire pipeline."
|
|
2138
|
+
},
|
|
2139
|
+
{
|
|
2140
|
+
type: "paragraph",
|
|
2141
|
+
text: 'For best results, include common variations and abbreviations as separate value entries all pointing to the same key. For example, if your code is `US`, add values for "United States", "USA", "U.S.A.", and "United States of America". The more variations you cover, the more values resolve at Tier 1 (highest confidence) without falling through to fuzzy or AI matching.'
|
|
2142
|
+
},
|
|
1810
2143
|
{
|
|
1811
2144
|
type: "callout",
|
|
1812
2145
|
text: "Reference table quality directly determines lookup accuracy. A properly loaded table produces 90-100% accurate results within a single run."
|
|
@@ -1825,6 +2158,10 @@ var sections5 = [
|
|
|
1825
2158
|
{
|
|
1826
2159
|
question: "How accurate are reference table lookups?",
|
|
1827
2160
|
answer: "A properly loaded reference table produces 90-100% accurate results within a single run. The cascade provides confidence scores: 0.95 for exact normalization, ~0.70 for fuzzy, and 0.50 for AI fallback."
|
|
2161
|
+
},
|
|
2162
|
+
{
|
|
2163
|
+
question: "How should I format my reference table CSV?",
|
|
2164
|
+
answer: "Use two columns: the first column is the key (output code) and the second is the value (human-readable label). Include common variations and abbreviations as separate rows pointing to the same key for maximum Tier 1 hit rate."
|
|
1828
2165
|
}
|
|
1829
2166
|
],
|
|
1830
2167
|
mentions: [
|
|
@@ -1849,6 +2186,18 @@ var sections5 = [
|
|
|
1849
2186
|
{
|
|
1850
2187
|
type: "paragraph",
|
|
1851
2188
|
text: "Start by editing fields in the **Workshop** draft, then use **Test Extraction** to compare draft results against the live version before publishing. The **Version History** timeline lets you review diff summaries between any two versions, making it easy to trace when a field was added, renamed, or removed and understand the impact on downstream jobs."
|
|
2189
|
+
},
|
|
2190
|
+
{
|
|
2191
|
+
type: "paragraph",
|
|
2192
|
+
text: "The versioning system is append-only \u2014 every time you publish a draft, it creates a new immutable version and the previous version is preserved in the timeline. This means you can always go back and review the exact schema that was used for any historical job. The diff view highlights added fields, removed fields, type changes, and updated instructions, giving you a clear picture of how your schema evolved."
|
|
2193
|
+
},
|
|
2194
|
+
{
|
|
2195
|
+
type: "paragraph",
|
|
2196
|
+
text: "Use the workshop system to iterate safely on your schema without disrupting production jobs. A common workflow is to add a new field in the Workshop, run a **Test Extraction** on a few documents to verify it produces correct values, then publish when satisfied. If a downstream integration depends on a specific field, the breaking change detection will warn you before you accidentally remove or rename it."
|
|
2197
|
+
},
|
|
2198
|
+
{
|
|
2199
|
+
type: "callout",
|
|
2200
|
+
text: "Breaking changes include field removals and type changes. The system surfaces these warnings at publish time so you can assess the impact on active delivery bindings and downstream systems before committing."
|
|
1852
2201
|
}
|
|
1853
2202
|
],
|
|
1854
2203
|
related: [
|
|
@@ -1864,6 +2213,10 @@ var sections5 = [
|
|
|
1864
2213
|
{
|
|
1865
2214
|
question: "What are breaking changes in a schema?",
|
|
1866
2215
|
answer: "Breaking changes include field removals and type changes. The system detects and warns about these when promoting a draft to live, helping you avoid unintended downstream impacts."
|
|
2216
|
+
},
|
|
2217
|
+
{
|
|
2218
|
+
question: "Can I revert to a previous schema version?",
|
|
2219
|
+
answer: "Version history is append-only, so you cannot revert directly. However, you can review any previous version in the timeline, compare it with the current live version using the diff view, and manually re-add fields or settings that were changed."
|
|
1867
2220
|
}
|
|
1868
2221
|
],
|
|
1869
2222
|
mentions: ["versioning", "drafts", "workshop", "live version", "breaking changes"]
|
|
@@ -1882,6 +2235,18 @@ var sections5 = [
|
|
|
1882
2235
|
{
|
|
1883
2236
|
type: "paragraph",
|
|
1884
2237
|
text: "After running a test, you will see a comparison grid highlighting cells that changed between the draft and live versions. Focus on fields you modified \u2014 new fields, updated instructions, or changed reference tables \u2014 to verify they produce the expected values. This workflow catches regressions before they reach production, so you can iterate on your schema with confidence."
|
|
2238
|
+
},
|
|
2239
|
+
{
|
|
2240
|
+
type: "paragraph",
|
|
2241
|
+
text: "Test extractions run through the same 4-phase pipeline as production jobs, so the results you see are identical to what a full job would produce. The test uses a simplified single-call extraction mode under the hood, which is faster but still applies all schema features including reference table lookups, format constraints, and modifiers. This gives you a reliable preview without the cost of a full pipeline run."
|
|
2242
|
+
},
|
|
2243
|
+
{
|
|
2244
|
+
type: "paragraph",
|
|
2245
|
+
text: 'For best results, select 3-5 representative documents that cover the variety in your corpus \u2014 include at least one "clean" document and one with unusual formatting or missing fields. This gives you confidence that your schema handles both typical and edge-case documents correctly. Run the test after every significant change to a field instruction, reference table, or format constraint.'
|
|
2246
|
+
},
|
|
2247
|
+
{
|
|
2248
|
+
type: "callout",
|
|
2249
|
+
text: "Test extractions do not affect your live data or consume production job credits differently. They are designed for rapid iteration \u2014 run as many tests as you need before publishing."
|
|
1885
2250
|
}
|
|
1886
2251
|
],
|
|
1887
2252
|
related: [
|
|
@@ -1897,6 +2262,10 @@ var sections5 = [
|
|
|
1897
2262
|
{
|
|
1898
2263
|
question: "Do I need to publish a draft before testing it?",
|
|
1899
2264
|
answer: "No. Test extraction runs against the unpublished draft, comparing its output to the current live version so you can verify changes before publishing."
|
|
2265
|
+
},
|
|
2266
|
+
{
|
|
2267
|
+
question: "How many documents should I use for a test extraction?",
|
|
2268
|
+
answer: "Select 3-5 representative documents that cover the variety in your corpus. Include documents with different layouts, data completeness levels, and edge cases to get a reliable preview of how your schema changes perform."
|
|
1900
2269
|
}
|
|
1901
2270
|
],
|
|
1902
2271
|
mentions: ["test extraction", "draft comparison", "side-by-side", "preview"]
|
|
@@ -1951,6 +2320,18 @@ var sections5 = [
|
|
|
1951
2320
|
{
|
|
1952
2321
|
type: "paragraph",
|
|
1953
2322
|
text: "When working with international data, configure the dialect to match your downstream system requirements. For example, set **number_locale** to `fr-FR` for European comma-decimal formatting, switch the **delimiter** to semicolon for CSV compatibility, and choose **UTF-8-BOM** encoding if your data will be opened in Excel. Creating a shared dialect and reusing it across schemas ensures consistent formatting across all your exports."
|
|
2323
|
+
},
|
|
2324
|
+
{
|
|
2325
|
+
type: "paragraph",
|
|
2326
|
+
text: "Dialect settings are applied during Phase 4 of the extraction pipeline and during CSV/XLSX export. The dialect does not affect how values are stored internally \u2014 it only controls the serialization format when data leaves the platform. This means you can change a dialect at any time without re-running extractions; the new format applies to all future exports and deliveries."
|
|
2327
|
+
},
|
|
2328
|
+
{
|
|
2329
|
+
type: "paragraph",
|
|
2330
|
+
text: 'For best results, create a shared dialect for each downstream system or regional office you deliver to, and name it descriptively (e.g., "SAP Europe" or "US Accounting"). Avoid defining dialects inline on individual schemas unless you have a one-off formatting requirement. Shared dialects reduce maintenance burden and ensure consistency when you add new schemas later.'
|
|
2331
|
+
},
|
|
2332
|
+
{
|
|
2333
|
+
type: "callout",
|
|
2334
|
+
text: "If your CSV files show garbled special characters (accents, umlauts, CJK text), switch the encoding to **UTF-8-BOM**. The BOM (byte order mark) tells Excel to interpret the file as UTF-8 instead of the system default encoding."
|
|
1954
2335
|
}
|
|
1955
2336
|
],
|
|
1956
2337
|
related: [
|
|
@@ -1966,6 +2347,10 @@ var sections5 = [
|
|
|
1966
2347
|
{
|
|
1967
2348
|
question: "Can I share a dialect across multiple schemas?",
|
|
1968
2349
|
answer: "Yes. A dialect can be shared across schemas or defined inline for a specific schema. Configure them in the Schema > Delivery tab."
|
|
2350
|
+
},
|
|
2351
|
+
{
|
|
2352
|
+
question: "Do I need to re-run extractions when I change a dialect?",
|
|
2353
|
+
answer: "No. Dialects only affect output serialization (exports and deliveries), not how values are stored internally. Changing a dialect takes effect immediately on future exports without re-processing."
|
|
1969
2354
|
}
|
|
1970
2355
|
],
|
|
1971
2356
|
mentions: [
|
|
@@ -2018,6 +2403,14 @@ var sections5 = [
|
|
|
2018
2403
|
type: "paragraph",
|
|
2019
2404
|
text: 'Use bypass strategies for fields whose values are known ahead of time or can be derived without reading the document. For example, set a **constant** of `"USD"` for a currency field that is always the same, or use a **generator** to produce a deterministic ID for each row. Fields with bypass strategies skip the AI extraction phase entirely, reducing processing time and credit usage.'
|
|
2020
2405
|
},
|
|
2406
|
+
{
|
|
2407
|
+
type: "paragraph",
|
|
2408
|
+
text: "The **reference** bypass strategy is particularly powerful for enrichment fields. Define a `key_expression` that references another field in the schema (e.g., the supplier name), and the system will automatically look up the corresponding code from your reference table without any AI involvement. This is ideal for mapping extracted entity names to internal system identifiers, ERP codes, or classification labels."
|
|
2409
|
+
},
|
|
2410
|
+
{
|
|
2411
|
+
type: "paragraph",
|
|
2412
|
+
text: "For best results, audit your schema for fields that never vary across documents \u2014 these are prime candidates for the **constant** strategy. Fields like currency, data source, or processing batch can be set once and never require AI extraction. This reduces per-document processing cost and improves job completion time, especially on large runs with hundreds of documents."
|
|
2413
|
+
},
|
|
2021
2414
|
{
|
|
2022
2415
|
type: "callout",
|
|
2023
2416
|
text: "When a `generator` strategy fails to produce a value, the field falls through to LLM extraction as a safety net. Strategy values are normalized via generator mappings in Phase 4 of the pipeline."
|
|
@@ -2036,6 +2429,10 @@ var sections5 = [
|
|
|
2036
2429
|
{
|
|
2037
2430
|
question: "What happens when a generator bypass fails?",
|
|
2038
2431
|
answer: "When a generator strategy fails to produce a value, the field falls through to LLM extraction as a safety net, ensuring the cell is still filled."
|
|
2432
|
+
},
|
|
2433
|
+
{
|
|
2434
|
+
question: "Do bypass strategies reduce extraction costs?",
|
|
2435
|
+
answer: "Yes. Fields with bypass strategies skip the AI extraction phase entirely, which reduces both processing time and credit usage. Use constant or reference strategies for fields that do not require document reading."
|
|
2039
2436
|
}
|
|
2040
2437
|
],
|
|
2041
2438
|
mentions: [
|
|
@@ -2081,6 +2478,18 @@ var sections5 = [
|
|
|
2081
2478
|
{
|
|
2082
2479
|
type: "paragraph",
|
|
2083
2480
|
text: "Define format constraints in the schema field editor. The pattern uses standard regex syntax. The editor provides a live test input so you can verify the pattern before saving."
|
|
2481
|
+
},
|
|
2482
|
+
{
|
|
2483
|
+
type: "paragraph",
|
|
2484
|
+
text: "Format constraints are especially useful for fields with strict formatting requirements in downstream systems. For example, a purchase order number that must follow the pattern `PO-\\d{6}` or a date that must match `\\d{4}-\\d{2}-\\d{2}`. By catching format violations at extraction time, you avoid importing malformed data into your ERP, accounting, or analytics systems."
|
|
2485
|
+
},
|
|
2486
|
+
{
|
|
2487
|
+
type: "paragraph",
|
|
2488
|
+
text: 'Choose the mismatch behavior based on your data quality requirements. Use **empty** (the default) when you prefer no data over bad data \u2014 the downstream system will see a blank cell. Use **flag** when you want to review mismatches manually before deciding \u2014 flagged cells appear with an amber dot in the results grid. Use **constant** when your downstream system needs a specific sentinel value like `"N/A"` or `"INVALID"` to trigger its own error handling.'
|
|
2489
|
+
},
|
|
2490
|
+
{
|
|
2491
|
+
type: "callout",
|
|
2492
|
+
text: "The regex evaluator includes ReDoS protection: nested quantifiers are rejected and input is capped at 1,000 characters. Use the `(?i)` inline flag for case-insensitive matching."
|
|
2084
2493
|
}
|
|
2085
2494
|
],
|
|
2086
2495
|
related: [
|
|
@@ -2096,6 +2505,10 @@ var sections5 = [
|
|
|
2096
2505
|
{
|
|
2097
2506
|
question: "Are original values preserved when format constraints clear a cell?",
|
|
2098
2507
|
answer: "Yes. Original values are always preserved for audit in the original_extractions table, regardless of the mismatch behavior applied."
|
|
2508
|
+
},
|
|
2509
|
+
{
|
|
2510
|
+
question: "Can I use case-insensitive regex patterns?",
|
|
2511
|
+
answer: "Yes. Use the (?i) inline flag at the start of your pattern for case-insensitive matching. The evaluator supports standard JavaScript regex syntax with inline flags."
|
|
2099
2512
|
}
|
|
2100
2513
|
],
|
|
2101
2514
|
mentions: [
|
|
@@ -2124,6 +2537,18 @@ var sections6 = [
|
|
|
2124
2537
|
{
|
|
2125
2538
|
type: "paragraph",
|
|
2126
2539
|
text: "Navigate to **Structuring → Runs → New**. Select your template and documents, then click Start. Results appear progressively as each phase completes."
|
|
2540
|
+
},
|
|
2541
|
+
{
|
|
2542
|
+
type: "paragraph",
|
|
2543
|
+
text: "When you start a job, the platform runs a pre-flight check to ensure all selected documents have completed their field resolution step. If any document was uploaded recently and has not yet been resolved against the Field Registry, the system automatically resolves it before entering Phase 1. This lazy resolution gate prevents silent data loss where registry-based lookups would return empty results for unresolved documents."
|
|
2544
|
+
},
|
|
2545
|
+
{
|
|
2546
|
+
type: "paragraph",
|
|
2547
|
+
text: "For best results, select documents of the same type or closely related types for a single job. The schema you choose should match the document content \u2014 using an invoice schema on contract documents will produce poor results. Start with a small batch of 5-10 documents to validate your schema, review the output, apply corrections, and then scale up to larger runs once you are confident in the extraction quality."
|
|
2548
|
+
},
|
|
2549
|
+
{
|
|
2550
|
+
type: "callout",
|
|
2551
|
+
text: "Results appear progressively as each pipeline phase completes. You do not need to wait for the entire job to finish \u2014 you can begin reviewing Phase 1 results while Phase 2 is still running."
|
|
2127
2552
|
}
|
|
2128
2553
|
],
|
|
2129
2554
|
related: [
|
|
@@ -2139,6 +2564,10 @@ var sections6 = [
|
|
|
2139
2564
|
{
|
|
2140
2565
|
question: "What does an extraction job produce?",
|
|
2141
2566
|
answer: "A job produces a structured grid where rows represent documents and columns represent schema fields. Each cell contains an extracted value with confidence and provenance metadata."
|
|
2567
|
+
},
|
|
2568
|
+
{
|
|
2569
|
+
question: "How many documents can I include in a single job?",
|
|
2570
|
+
answer: "Phase 2 supports up to 2,000 documents per job, and Phase 4 supports up to 1,000. For best results, start with smaller batches to validate your schema before scaling up."
|
|
2142
2571
|
}
|
|
2143
2572
|
],
|
|
2144
2573
|
mentions: ["extraction job", "structured grid", "progressive results", "template selection"]
|
|
@@ -2158,11 +2587,23 @@ var sections6 = [
|
|
|
2158
2587
|
type: "paragraph",
|
|
2159
2588
|
text: "Each phase builds on the previous one, progressively filling the output grid. **Phase 1** resolves ~30% of cells instantly using graph matches and lookups. **Phase 2** deploys an AI agent to fill remaining gaps. **Phase 3** runs cross-field validation checks, and **Phase 4** performs targeted re-reads for empty or low-confidence cells. You can monitor fill rate in real time as each phase completes."
|
|
2160
2589
|
},
|
|
2590
|
+
{
|
|
2591
|
+
type: "paragraph",
|
|
2592
|
+
text: "The pipeline is designed around a key principle: use the cheapest, fastest method first and escalate to AI only when necessary. Phase 1 fills cells using deterministic lookups at zero AI cost. Phase 2 uses AI only for cells that Phase 1 could not resolve. Phase 3 re-runs lookups on Phase 2 output to normalize AI-generated values to canonical codes. Phase 4 performs targeted re-reads with full grid context for the remaining gaps. This cascading approach minimizes both cost and latency."
|
|
2593
|
+
},
|
|
2594
|
+
{
|
|
2595
|
+
type: "paragraph",
|
|
2596
|
+
text: "The grid is flushed to the database after each phase, enabling progressive rendering in the UI. You can watch cells fill in real time and begin reviewing results before the job finishes. The phase timeline on the job detail page shows which phase is currently active, how long each phase took, and the cumulative fill rate at each stage."
|
|
2597
|
+
},
|
|
2161
2598
|
{
|
|
2162
2599
|
type: "ui-excerpt",
|
|
2163
2600
|
id: "job-detail-phase-timeline",
|
|
2164
2601
|
title: "Job Detail \u2014 Phase Timeline",
|
|
2165
2602
|
caption: "The phase timeline shows progress through the pipeline. Each dot represents a stage, highlighted when active."
|
|
2603
|
+
},
|
|
2604
|
+
{
|
|
2605
|
+
type: "callout",
|
|
2606
|
+
text: "Phase order is fixed: Phase 1 → 2 → 3 → 4. Phases are never skipped or reordered. This guarantees that high-confidence deterministic values from Phase 1 are always protected by the confidence gate before AI extraction runs."
|
|
2166
2607
|
}
|
|
2167
2608
|
],
|
|
2168
2609
|
related: [
|
|
@@ -2178,6 +2619,10 @@ var sections6 = [
|
|
|
2178
2619
|
{
|
|
2179
2620
|
question: "Can I see results before all phases complete?",
|
|
2180
2621
|
answer: "Yes. Results are visible as each phase completes. The fill rate increases progressively through the pipeline."
|
|
2622
|
+
},
|
|
2623
|
+
{
|
|
2624
|
+
question: "Why does the pipeline use multiple phases instead of a single AI call?",
|
|
2625
|
+
answer: "The cascading design minimizes cost and latency. Phase 1 fills cells with deterministic lookups at zero AI cost. Only remaining gaps go to the AI agent in Phase 2, and Phase 4 targets specific empty cells with full context. This is significantly cheaper and faster than sending everything to AI."
|
|
2181
2626
|
}
|
|
2182
2627
|
],
|
|
2183
2628
|
mentions: ["4-phase pipeline", "fill rate", "progressive rendering", "phase timeline"]
|
|
@@ -2227,6 +2672,18 @@ var sections6 = [
|
|
|
2227
2672
|
{
|
|
2228
2673
|
type: "paragraph",
|
|
2229
2674
|
text: "Values are normalized during transfer: dates → `YYYY/MM/DD`, numbers → 2 decimal places, strings → trim + collapse spaces."
|
|
2675
|
+
},
|
|
2676
|
+
{
|
|
2677
|
+
type: "paragraph",
|
|
2678
|
+
text: "Phase 1 is the workhorse of cost efficiency. Because it relies entirely on pre-computed graph matches and deterministic lookups, it fills a large portion of the grid at near-zero cost. The confidence scores assigned during this phase are typically high (0.7-0.95) because they are derived from verified registry matches rather than AI inference. These high-confidence cells are then protected by the confidence gate, meaning later phases cannot overwrite them."
|
|
2679
|
+
},
|
|
2680
|
+
{
|
|
2681
|
+
type: "paragraph",
|
|
2682
|
+
text: "The resolution strategies execute in a fixed order: registry transfer first, then raw extraction mapping, then the 3-tier lookup cascade, and finally deterministic compute (formulas like `Total = Unit Price x Quantity`). Each strategy only attempts to fill cells that are still empty after the previous strategy ran. This ordering ensures that the highest-confidence method always gets priority."
|
|
2683
|
+
},
|
|
2684
|
+
{
|
|
2685
|
+
type: "callout",
|
|
2686
|
+
text: "Phase 1 fill rates improve over time as your Field Registry grows. The more documents you process, the richer the registry becomes, and the more cells Phase 1 can resolve without AI \u2014 reducing both cost and latency for every subsequent job."
|
|
2230
2687
|
}
|
|
2231
2688
|
],
|
|
2232
2689
|
related: [
|
|
@@ -2242,6 +2699,10 @@ var sections6 = [
|
|
|
2242
2699
|
{
|
|
2243
2700
|
question: "What percentage of cells does Phase 1 fill?",
|
|
2244
2701
|
answer: "Phase 1 typically fills approximately 30% of cells in seconds, using graph matches and lookups without any AI calls."
|
|
2702
|
+
},
|
|
2703
|
+
{
|
|
2704
|
+
question: "Does Phase 1 performance improve over time?",
|
|
2705
|
+
answer: "Yes. As your Field Registry grows from processing more documents, Phase 1 can resolve a higher percentage of cells through graph matches. Mature registries often see Phase 1 fill rates of 60-80%."
|
|
2245
2706
|
}
|
|
2246
2707
|
],
|
|
2247
2708
|
mentions: [
|
|
@@ -2300,6 +2761,14 @@ var sections6 = [
|
|
|
2300
2761
|
}
|
|
2301
2762
|
]
|
|
2302
2763
|
},
|
|
2764
|
+
{
|
|
2765
|
+
type: "paragraph",
|
|
2766
|
+
text: "Phase 2 processes documents with grouped extraction calls \u2014 schema fields are divided into batches of up to 10 fields per call to balance extraction quality with throughput. For each document, the agent sends the document text along with the schema field definitions and any already-resolved values from Phase 1 as context. This context-aware approach means the AI can use related values (like a contract start date) to more accurately extract dependent values (like the end date)."
|
|
2767
|
+
},
|
|
2768
|
+
{
|
|
2769
|
+
type: "paragraph",
|
|
2770
|
+
text: "For fields backed by a **reference table**, Phase 2 includes the table's codes and labels directly in the extraction prompt so the AI picks canonical codes rather than free-text labels. This tight integration between reference tables and AI extraction produces cleaner output that requires fewer corrections. Fields with fewer than 50 reference entries get the full table in the prompt; larger tables are handled by the Phase 3 lookup cascade instead."
|
|
2771
|
+
},
|
|
2303
2772
|
{
|
|
2304
2773
|
type: "callout",
|
|
2305
2774
|
variant: "warning",
|
|
@@ -2319,6 +2788,10 @@ var sections6 = [
|
|
|
2319
2788
|
{
|
|
2320
2789
|
question: "Can the agent skip a field with manual instructions?",
|
|
2321
2790
|
answer: "No. Fields with manual instructions always use the extract strategy. Human-written instructions are treated as authoritative and never skipped."
|
|
2791
|
+
},
|
|
2792
|
+
{
|
|
2793
|
+
question: "How many fields does the agent process per AI call?",
|
|
2794
|
+
answer: "Schema fields are grouped into batches of up to 10 fields per extraction call. This balances extraction quality with throughput \u2014 smaller groups help the AI focus on each field without losing recall."
|
|
2322
2795
|
}
|
|
2323
2796
|
],
|
|
2324
2797
|
mentions: [
|
|
@@ -2375,6 +2848,18 @@ var sections6 = [
|
|
|
2375
2848
|
description: "Field with >80% registry occurrence rate is empty in this document."
|
|
2376
2849
|
}
|
|
2377
2850
|
]
|
|
2851
|
+
},
|
|
2852
|
+
{
|
|
2853
|
+
type: "paragraph",
|
|
2854
|
+
text: 'Phase 3 also re-runs the lookup cascade (reference table resolution) on values that Phase 2 produced. This is important because AI-extracted values often use natural language labels (e.g., "Frame Agreement") rather than the canonical codes your reference table expects (e.g., `std_master`). The Phase 3 lookup normalizes these labels to codes, improving consistency across your output without requiring manual corrections.'
|
|
2855
|
+
},
|
|
2856
|
+
{
|
|
2857
|
+
type: "paragraph",
|
|
2858
|
+
text: "Validation flags are designed to surface the most impactful issues first. The **low_confidence_outlier** flag is particularly useful \u2014 it highlights cells where the system is uncertain in an otherwise high-confidence row, pointing you to the exact cells most likely to contain errors. For large runs with hundreds of documents, filtering by flags and reviewing those cells first can reduce your review time by 80% or more."
|
|
2859
|
+
},
|
|
2860
|
+
{
|
|
2861
|
+
type: "callout",
|
|
2862
|
+
text: "Validation flags never modify cell values. They are purely informational annotations that help you prioritize review. The actual cell value and confidence score remain unchanged by Phase 3 flagging."
|
|
2378
2863
|
}
|
|
2379
2864
|
],
|
|
2380
2865
|
related: [
|
|
@@ -2390,6 +2875,10 @@ var sections6 = [
|
|
|
2390
2875
|
{
|
|
2391
2876
|
question: "What types of validation flags exist?",
|
|
2392
2877
|
answer: "Five types: date_sanity (date inconsistencies), amount_mismatch (total discrepancies), lookup_failed (no reference match), low_confidence_outlier (low confidence cells), and unexpected_empty (missing high-frequency fields)."
|
|
2878
|
+
},
|
|
2879
|
+
{
|
|
2880
|
+
question: "Does Phase 3 modify any cell values?",
|
|
2881
|
+
answer: "Phase 3 re-runs the reference table lookup cascade to normalize AI-extracted labels to canonical codes. The validation flags themselves are purely informational and do not modify values."
|
|
2393
2882
|
}
|
|
2394
2883
|
],
|
|
2395
2884
|
mentions: [
|
|
@@ -2415,6 +2904,14 @@ var sections6 = [
|
|
|
2415
2904
|
type: "paragraph",
|
|
2416
2905
|
text: "Because Phase 4 has access to the full grid context \u2014 all values already resolved in earlier phases \u2014 it can use surrounding data as clues. For example, if a contract start date was resolved in Phase 1 but the end date is still empty, Phase 4 re-reads the document knowing the start date, which helps the AI locate the corresponding end date more accurately."
|
|
2417
2906
|
},
|
|
2907
|
+
{
|
|
2908
|
+
type: "paragraph",
|
|
2909
|
+
text: "Phase 4 also applies deterministic transforms to all cell values: ISO code normalization, date format standardization, and unit conversion. Format constraints (regex patterns defined on schema fields) are evaluated at this stage. If a value fails its format constraint, the configured mismatch behavior kicks in \u2014 the cell is either cleared, flagged with an amber dot, or replaced with a constant. Original values are always preserved in the `original_extractions` table for audit purposes."
|
|
2910
|
+
},
|
|
2911
|
+
{
|
|
2912
|
+
type: "paragraph",
|
|
2913
|
+
text: "Expect Phase 4 to fill 5-15% of remaining empty cells, depending on document complexity and schema coverage. The phase is most effective for fields that require cross-referencing multiple sections of a document or interpreting values in the context of other extracted data. It is less effective for fields that are genuinely absent from the source document \u2014 those will remain empty with an `unresolved` provenance type."
|
|
2914
|
+
},
|
|
2418
2915
|
{
|
|
2419
2916
|
type: "callout",
|
|
2420
2917
|
text: "Phase 4 respects the **confidence gate**: it can only fill empty cells or upgrade cells below the confidence threshold. High-confidence values from Phase 1 are permanently protected."
|
|
@@ -2433,6 +2930,10 @@ var sections6 = [
|
|
|
2433
2930
|
{
|
|
2434
2931
|
question: "Can Phase 4 overwrite high-confidence values?",
|
|
2435
2932
|
answer: "No. Phase 4 respects the confidence gate \u2014 it can only fill empty cells or upgrade cells below the confidence threshold. High-confidence values from earlier phases are permanently protected."
|
|
2933
|
+
},
|
|
2934
|
+
{
|
|
2935
|
+
question: "What else happens in Phase 4 besides gap filling?",
|
|
2936
|
+
answer: "Phase 4 also applies deterministic transforms (ISO codes, dates, units), evaluates format constraints (regex validation), and runs the modifier pipeline (format, alias, max_length). Original values are preserved for audit."
|
|
2436
2937
|
}
|
|
2437
2938
|
],
|
|
2438
2939
|
mentions: ["Phase 4", "re-read", "gap filling", "confidence gate", "targeted extraction"]
|
|
@@ -2457,6 +2958,18 @@ var sections6 = [
|
|
|
2457
2958
|
{
|
|
2458
2959
|
type: "paragraph",
|
|
2459
2960
|
text: "Start your review by switching to the **Flagged** filter to focus on cells that need attention \u2014 these are values with validation warnings, low confidence, or format mismatches. Click any cell to see its full provenance, including which phase produced it and the reasoning trace. Once you are satisfied, export via **CSV** \u2014 choose the clean export for downstream systems or the full export with metadata for auditing."
|
|
2961
|
+
},
|
|
2962
|
+
{
|
|
2963
|
+
type: "paragraph",
|
|
2964
|
+
text: "The colored dots on each cell are your quickest visual indicator of data quality. Blue dots indicate graph matches from Phase 1 (highest reliability), purple dots indicate computed values, teal dots indicate agent transfers, indigo dots indicate AI extractions, and amber dots indicate lookup results or format flags. A grid dominated by blue and purple dots typically requires minimal review, while one with many indigo and amber dots may need more attention."
|
|
2965
|
+
},
|
|
2966
|
+
{
|
|
2967
|
+
type: "paragraph",
|
|
2968
|
+
text: "For large jobs with hundreds of documents, use a systematic review workflow: first address all **Flagged** rows, then spot-check a random sample of **Clean** rows to build confidence in the overall quality. If you find recurring errors in a specific field, consider updating the schema field's instruction or reference table, then run a new job \u2014 corrections you apply also feed back as training signals for future runs."
|
|
2969
|
+
},
|
|
2970
|
+
{
|
|
2971
|
+
type: "callout",
|
|
2972
|
+
text: "The full CSV export includes metadata columns for each field: confidence score, resolution type, phase number, and reasoning trace. Use this export for audit trails or to analyze extraction performance across your document corpus."
|
|
2460
2973
|
}
|
|
2461
2974
|
],
|
|
2462
2975
|
related: [
|
|
@@ -2472,6 +2985,10 @@ var sections6 = [
|
|
|
2472
2985
|
{
|
|
2473
2986
|
question: "Can I export extraction results?",
|
|
2474
2987
|
answer: "Yes. Use CSV export from the job detail page. You can export clean data only or full data with metadata including confidence scores and resolution types."
|
|
2988
|
+
},
|
|
2989
|
+
{
|
|
2990
|
+
question: "What is the most efficient way to review a large extraction run?",
|
|
2991
|
+
answer: "Start with the Flagged filter to address cells with validation warnings, low confidence, or format mismatches. Then spot-check a random sample of Clean rows. Focus corrections on recurring field-level patterns rather than individual cells."
|
|
2475
2992
|
}
|
|
2476
2993
|
],
|
|
2477
2994
|
mentions: [
|
|
@@ -2528,6 +3045,14 @@ var sections6 = [
|
|
|
2528
3045
|
}
|
|
2529
3046
|
]
|
|
2530
3047
|
},
|
|
3048
|
+
{
|
|
3049
|
+
type: "paragraph",
|
|
3050
|
+
text: "Confidence scores follow predictable patterns by resolution type. Graph matches from Phase 1 typically score 0.7-0.95 because they are derived from verified registry data. Reference table lookups score 0.95 for exact normalization matches, ~0.70 for fuzzy matches, and 0.50 for AI fallback. Agent-derived values from Phase 2 generally score 0.5-0.9 depending on the clarity of the source document and the specificity of the extraction instruction."
|
|
3051
|
+
},
|
|
3052
|
+
{
|
|
3053
|
+
type: "paragraph",
|
|
3054
|
+
text: "Use confidence scores to set your review threshold. Cells above 0.8 are generally reliable and can be trusted without manual verification for most use cases. Cells between 0.5 and 0.8 warrant a quick check. Cells below 0.5 should always be reviewed manually. You can use the full CSV export to filter and sort by confidence, making it easy to batch-review low-confidence cells efficiently."
|
|
3055
|
+
},
|
|
2531
3056
|
{
|
|
2532
3057
|
type: "callout",
|
|
2533
3058
|
variant: "warning",
|
|
@@ -2547,6 +3072,10 @@ var sections6 = [
|
|
|
2547
3072
|
{
|
|
2548
3073
|
question: "What is the confidence gate?",
|
|
2549
3074
|
answer: "The confidence gate prevents any later pipeline phase from overwriting a cell that was filled with confidence >= 0.7. This protects high-quality lookup results from lower-confidence agent extractions."
|
|
3075
|
+
},
|
|
3076
|
+
{
|
|
3077
|
+
question: "What confidence threshold should I use for manual review?",
|
|
3078
|
+
answer: "Cells above 0.8 are generally reliable. Cells between 0.5 and 0.8 warrant a quick check. Cells below 0.5 should always be reviewed manually. Use the CSV export to filter by confidence for efficient batch review."
|
|
2550
3079
|
}
|
|
2551
3080
|
],
|
|
2552
3081
|
mentions: [
|
|
@@ -2571,6 +3100,18 @@ var sections6 = [
|
|
|
2571
3100
|
{
|
|
2572
3101
|
type: "paragraph",
|
|
2573
3102
|
text: "When correcting a value, consider using **all_similar** propagation if the same mistake appears across multiple documents \u2014 for example, a reference table code that was consistently mapped to the wrong label. This applies your fix to every document in the run that matched the same way, saving you from correcting each cell individually. The system learns from these corrections, so the same error is less likely to recur in future jobs."
|
|
3103
|
+
},
|
|
3104
|
+
{
|
|
3105
|
+
type: "paragraph",
|
|
3106
|
+
text: "Corrections create a full audit trail: the original extracted value, the corrected value, who made the change, and when. This audit log is preserved even after subsequent jobs are run, giving you a complete history of manual interventions. When you export results with the full metadata option, correction history is included so downstream systems can distinguish between AI-extracted and human-corrected values."
|
|
3107
|
+
},
|
|
3108
|
+
{
|
|
3109
|
+
type: "paragraph",
|
|
3110
|
+
text: "For best results, correct the root cause rather than individual symptoms. If a field consistently produces wrong values, update the schema field's **manual instruction** or **reference table** rather than correcting cells one by one. If a reference table code is missing, add it to the table \u2014 future runs will pick it up automatically at Tier 1 confidence (0.95). Corrections are most valuable as a feedback mechanism when they inform schema improvements."
|
|
3111
|
+
},
|
|
3112
|
+
{
|
|
3113
|
+
type: "callout",
|
|
3114
|
+
text: "Corrections with **all_similar** propagation apply instantly across all documents in the run. Use this for systematic errors like wrong reference table mappings, but verify the preview count before confirming \u2014 the system shows how many cells will be affected."
|
|
2574
3115
|
}
|
|
2575
3116
|
],
|
|
2576
3117
|
related: [
|
|
@@ -2586,6 +3127,10 @@ var sections6 = [
|
|
|
2586
3127
|
{
|
|
2587
3128
|
question: "Do corrections improve future extractions?",
|
|
2588
3129
|
answer: "Yes. Corrections feed back as training signals for future runs, helping the system learn from your corrections and improve accuracy over time."
|
|
3130
|
+
},
|
|
3131
|
+
{
|
|
3132
|
+
question: "Is there an audit trail for corrections?",
|
|
3133
|
+
answer: "Yes. Every correction logs the original value, the corrected value, the user who made the change, and the timestamp. This audit history is preserved and included in full metadata CSV exports."
|
|
2589
3134
|
}
|
|
2590
3135
|
],
|
|
2591
3136
|
mentions: [
|
|
@@ -2639,6 +3184,18 @@ var sections7 = [
|
|
|
2639
3184
|
{
|
|
2640
3185
|
type: "paragraph",
|
|
2641
3186
|
text: "Most link keys are auto-classified by name patterns. Remaining ambiguous fields are classified by AI. High-frequency entities (>30% of documents) are automatically excluded from case formation."
|
|
3187
|
+
},
|
|
3188
|
+
{
|
|
3189
|
+
type: "paragraph",
|
|
3190
|
+
text: "Behind the scenes, the classification engine applies rule-based heuristics first \u2014 field names like `company_name` or `invoice_number` are recognized instantly. When heuristics are inconclusive, an AI classifier examines the field's extracted values and schema context to determine the correct category. This two-tier approach keeps classification fast for the common case while handling ambiguous fields gracefully."
|
|
3191
|
+
},
|
|
3192
|
+
{
|
|
3193
|
+
type: "paragraph",
|
|
3194
|
+
text: "Use link keys whenever your documents share identifying information that should connect them. For best results, ensure your field names follow clear naming conventions \u2014 this maximizes the hit rate of the automatic classifier and minimizes the need for manual overrides."
|
|
3195
|
+
},
|
|
3196
|
+
{
|
|
3197
|
+
type: "callout",
|
|
3198
|
+
text: "Link key classification runs automatically when new fields appear in the registry. You do not need to trigger it manually \u2014 just upload documents and the system handles the rest."
|
|
2642
3199
|
}
|
|
2643
3200
|
],
|
|
2644
3201
|
related: [
|
|
@@ -2654,6 +3211,10 @@ var sections7 = [
|
|
|
2654
3211
|
{
|
|
2655
3212
|
question: "Why are high-frequency entities excluded from case formation?",
|
|
2656
3213
|
answer: "Entities appearing in more than 30% of documents are too common to be meaningful connections. They are automatically excluded to prevent overly large, uninformative cases."
|
|
3214
|
+
},
|
|
3215
|
+
{
|
|
3216
|
+
question: "Can I manually classify a field as a link key?",
|
|
3217
|
+
answer: "Yes. Navigate to the Field Registry and change any field's link key category. Manual classifications take precedence over automatic ones and persist across future jobs."
|
|
2657
3218
|
}
|
|
2658
3219
|
],
|
|
2659
3220
|
mentions: [
|
|
@@ -2674,6 +3235,22 @@ var sections7 = [
|
|
|
2674
3235
|
{
|
|
2675
3236
|
type: "paragraph",
|
|
2676
3237
|
text: 'After extraction, the linking pipeline runs automatically: extracts link key values, normalizes them (lowercasing, stripping suffixes like "Ltd", "Inc"), and builds a bipartite graph of documents ↔ entities.'
|
|
3238
|
+
},
|
|
3239
|
+
{
|
|
3240
|
+
type: "paragraph",
|
|
3241
|
+
text: 'The normalization step is critical for accurate linking. Values like "ACME Corp.", "Acme Corporation", and "acme corp" are all reduced to the same canonical form so they resolve to a single entity node. This prevents duplicate entities from fragmenting your cases and ensures documents that reference the same real-world entity are correctly connected.'
|
|
3242
|
+
},
|
|
3243
|
+
{
|
|
3244
|
+
type: "paragraph",
|
|
3245
|
+
text: "The resulting bipartite graph has two node types: documents and entities. An edge connects a document to an entity whenever the document contains that entity's value in a link key field. Connected components in this graph become the foundation for case formation \u2014 documents that share entities end up in the same case."
|
|
3246
|
+
},
|
|
3247
|
+
{
|
|
3248
|
+
type: "paragraph",
|
|
3249
|
+
text: "For best results, ensure your source documents contain consistent identifiers. The pipeline handles minor variations automatically, but wildly inconsistent naming (e.g., abbreviations vs. full legal names) may require manual link key tuning in the Field Registry."
|
|
3250
|
+
},
|
|
3251
|
+
{
|
|
3252
|
+
type: "callout",
|
|
3253
|
+
text: "Entity linking is incremental \u2014 when new documents arrive, the pipeline extends the existing graph rather than rebuilding it from scratch. Existing cases grow as new connections are discovered."
|
|
2677
3254
|
}
|
|
2678
3255
|
],
|
|
2679
3256
|
related: [
|
|
@@ -2689,6 +3266,10 @@ var sections7 = [
|
|
|
2689
3266
|
{
|
|
2690
3267
|
question: "When does entity linking run?",
|
|
2691
3268
|
answer: "Entity linking runs automatically after document extraction. It processes link key values and builds connections without manual intervention."
|
|
3269
|
+
},
|
|
3270
|
+
{
|
|
3271
|
+
question: "What normalization does entity linking apply?",
|
|
3272
|
+
answer: "Values are lowercased, common suffixes (Ltd, Inc, Corp, etc.) are stripped, and whitespace is normalized. This ensures minor naming variations resolve to the same entity."
|
|
2692
3273
|
}
|
|
2693
3274
|
],
|
|
2694
3275
|
mentions: [
|
|
@@ -2780,6 +3361,22 @@ var sections7 = [
|
|
|
2780
3361
|
{
|
|
2781
3362
|
type: "paragraph",
|
|
2782
3363
|
text: "The Document Graph provides a visual D3-force layout of the bipartite graph. Toggle between graph and list views from the Cases page. Case templates are auto-discovered after 3+ cases form \u2014 they identify recurring document type patterns."
|
|
3364
|
+
},
|
|
3365
|
+
{
|
|
3366
|
+
type: "paragraph",
|
|
3367
|
+
text: "In the graph view, document nodes and entity nodes are rendered with distinct visual styles. Edges represent link key connections, and tightly connected clusters naturally pull together through force simulation. Hovering over a node highlights its connections, making it easy to trace how documents relate through shared entities."
|
|
3368
|
+
},
|
|
3369
|
+
{
|
|
3370
|
+
type: "paragraph",
|
|
3371
|
+
text: 'Case templates capture recurring patterns \u2014 for example, "Invoice + Purchase Order + Contract" might emerge as a common template after enough cases form. Templates include a **match threshold** that controls how closely a case must match the expected document type set. Use templates to monitor completeness: if a case is missing a document type that the template expects, an anomaly is raised.'
|
|
3372
|
+
},
|
|
3373
|
+
{
|
|
3374
|
+
type: "paragraph",
|
|
3375
|
+
text: "Most teams use the graph view during initial workspace setup to verify that linking is producing sensible clusters. Once you are confident in your link key configuration, the list view is more practical for day-to-day case review and triage."
|
|
3376
|
+
},
|
|
3377
|
+
{
|
|
3378
|
+
type: "callout",
|
|
3379
|
+
text: "Templates are auto-discovered \u2014 you do not need to define them manually. The system analyzes existing cases and proposes templates when it detects at least 3 cases sharing the same document type pattern."
|
|
2783
3380
|
}
|
|
2784
3381
|
],
|
|
2785
3382
|
related: [
|
|
@@ -2794,6 +3391,10 @@ var sections7 = [
|
|
|
2794
3391
|
{
|
|
2795
3392
|
question: "What are case templates?",
|
|
2796
3393
|
answer: "Case templates are auto-discovered after 3 or more cases form. They identify recurring document type patterns, helping you understand common document relationships in your workspace."
|
|
3394
|
+
},
|
|
3395
|
+
{
|
|
3396
|
+
question: "Can I switch between graph and list views?",
|
|
3397
|
+
answer: "Yes. Toggle between the visual D3-force graph and a traditional list view from the Cases page. Both views show the same underlying data \u2014 choose whichever suits your workflow."
|
|
2797
3398
|
}
|
|
2798
3399
|
],
|
|
2799
3400
|
mentions: ["document graph", "D3-force layout", "bipartite graph", "case templates"]
|
|
@@ -2843,6 +3444,18 @@ var sections7 = [
|
|
|
2843
3444
|
{
|
|
2844
3445
|
type: "paragraph",
|
|
2845
3446
|
text: "Anomalies appear in the **Anomalies** tab of the case detail page (Advanced mode). Each anomaly card shows severity, affected fields, and a dismiss button. Dismissed anomalies are hidden by default but visible via the **show dismissed** toggle."
|
|
3447
|
+
},
|
|
3448
|
+
{
|
|
3449
|
+
type: "paragraph",
|
|
3450
|
+
text: "The detection engine runs automatically after case formation and whenever case membership changes (documents added, removed, or cases merged). Each detector operates independently \u2014 a single case can trigger multiple anomaly types simultaneously. Anomaly counts are displayed as badges in the case header for quick triage."
|
|
3451
|
+
},
|
|
3452
|
+
{
|
|
3453
|
+
type: "paragraph",
|
|
3454
|
+
text: "Use anomaly detection to surface data quality issues that would otherwise require manual comparison across documents. For best results, configure case templates so the **Missing Document Type** detector (D4) can flag incomplete cases. Most teams find that D2 (Field Conflict) and D3 (Duplicate Key Divergence) catch the highest-value issues in procurement and financial workflows."
|
|
3455
|
+
},
|
|
3456
|
+
{
|
|
3457
|
+
type: "callout",
|
|
3458
|
+
text: "Anomaly detection requires **Advanced mode** to be enabled. In Simple mode, anomalies are still computed but not displayed in the case detail page."
|
|
2846
3459
|
}
|
|
2847
3460
|
],
|
|
2848
3461
|
related: [
|
|
@@ -2854,6 +3467,14 @@ var sections7 = [
|
|
|
2854
3467
|
{
|
|
2855
3468
|
question: "What anomalies does Talonic detect?",
|
|
2856
3469
|
answer: "Five structural patterns: validation clusters, field conflicts, duplicate key divergence, missing document types, and value reuse. Each is surfaced as a dismissable card on the case detail page."
|
|
3470
|
+
},
|
|
3471
|
+
{
|
|
3472
|
+
question: "Do anomalies update automatically when cases change?",
|
|
3473
|
+
answer: "Yes. The detection engine re-runs whenever case membership changes \u2014 documents added or removed, cases merged or split. Anomaly badges in the case header update in real time."
|
|
3474
|
+
},
|
|
3475
|
+
{
|
|
3476
|
+
question: "Can I dismiss anomalies?",
|
|
3477
|
+
answer: "Yes. Each anomaly card includes a dismiss button. Dismissed anomalies are hidden by default but can be revealed using the show dismissed toggle on the Anomalies tab."
|
|
2857
3478
|
}
|
|
2858
3479
|
],
|
|
2859
3480
|
mentions: ["anomaly detection", "validation cluster", "field conflict", "duplicate key divergence", "value reuse"]
|
|
@@ -2886,6 +3507,14 @@ var sections7 = [
|
|
|
2886
3507
|
type: "paragraph",
|
|
2887
3508
|
text: "**Domain packs** extend validation with industry-specific rules. The freight domain pack includes DOT number state detection and MC number validation. Additional packs can be added to `domain-packs/` without modifying the core engine."
|
|
2888
3509
|
},
|
|
3510
|
+
{
|
|
3511
|
+
type: "paragraph",
|
|
3512
|
+
text: "Validation runs automatically after extraction and linking complete. Each field value is checked against every applicable validator \u2014 a single field can trigger multiple rules. Results are displayed as colored badges in the **Evidence** tab: green for pass, red for fail, and amber for warnings. You can filter by status, document, category, or free-text search."
|
|
3513
|
+
},
|
|
3514
|
+
{
|
|
3515
|
+
type: "paragraph",
|
|
3516
|
+
text: "The checksum validator (S7) uses a parameterized factory pattern \u2014 it accepts a checksum algorithm name and applies the corresponding verification logic. Supported algorithms include Luhn (credit card numbers), ABA (bank routing numbers), IBAN (international bank accounts), and ISBN (book identifiers). For best results, ensure your schema fields are typed correctly so the engine knows which checksum to apply."
|
|
3517
|
+
},
|
|
2889
3518
|
{
|
|
2890
3519
|
type: "callout",
|
|
2891
3520
|
text: "Evidence validation results are stored in a separate `evidence_validation_results` table keyed by (document_id, entity_id, field_key) \u2014 not in the extraction or linking tables."
|
|
@@ -2904,6 +3533,10 @@ var sections7 = [
|
|
|
2904
3533
|
{
|
|
2905
3534
|
question: "What are domain packs?",
|
|
2906
3535
|
answer: "Domain packs add industry-specific validation rules. For example, the freight domain pack validates DOT numbers and MC numbers. New packs can be added without modifying the core engine."
|
|
3536
|
+
},
|
|
3537
|
+
{
|
|
3538
|
+
question: "How are evidence validation results displayed?",
|
|
3539
|
+
answer: "Results appear as colored badges in the Evidence tab of the case detail page. Green indicates pass, red indicates fail, and amber indicates a warning. Use the filter bar to narrow results by status, document, or category."
|
|
2907
3540
|
}
|
|
2908
3541
|
],
|
|
2909
3542
|
mentions: ["evidence validation", "structural validators", "checksum", "Luhn", "IBAN", "domain packs", "freight"]
|
|
@@ -2930,6 +3563,18 @@ var sections8 = [
|
|
|
2930
3563
|
{
|
|
2931
3564
|
type: "paragraph",
|
|
2932
3565
|
text: "Navigate to **Data Products → Dataset Templates** to manage templates. Each template is linked to a user schema and can be versioned independently. When creating a new job, select a template instead of configuring the output from scratch."
|
|
3566
|
+
},
|
|
3567
|
+
{
|
|
3568
|
+
type: "paragraph",
|
|
3569
|
+
text: "Templates support column mappings that rename, reorder, or exclude fields from the output. Default transforms \u2014 such as date formatting, currency normalization, or unit conversion \u2014 are applied automatically during assembly. This means every data product built from the same template produces structurally identical output regardless of who runs it or when."
|
|
3570
|
+
},
|
|
3571
|
+
{
|
|
3572
|
+
type: "paragraph",
|
|
3573
|
+
text: "For best results, create one template per downstream consumer. If your finance team and operations team need different column subsets from the same schema, define two templates rather than manually reconfiguring each export. Most teams version their templates alongside schema changes to maintain backward compatibility with existing integrations."
|
|
3574
|
+
},
|
|
3575
|
+
{
|
|
3576
|
+
type: "callout",
|
|
3577
|
+
text: "Dataset templates are workspace-scoped. Any team member can create, edit, or use a template \u2014 there is no per-user ownership restriction."
|
|
2933
3578
|
}
|
|
2934
3579
|
],
|
|
2935
3580
|
related: [
|
|
@@ -2945,6 +3590,10 @@ var sections8 = [
|
|
|
2945
3590
|
{
|
|
2946
3591
|
question: "How do dataset templates relate to schemas?",
|
|
2947
3592
|
answer: "Each dataset template is linked to a user schema and can be versioned independently. When creating a new job, you can select a template instead of configuring output from scratch."
|
|
3593
|
+
},
|
|
3594
|
+
{
|
|
3595
|
+
question: "Can I version dataset templates?",
|
|
3596
|
+
answer: "Yes. Each template is versioned independently from the schema it references. This lets you evolve your output format over time without affecting existing data products built from earlier versions."
|
|
2948
3597
|
}
|
|
2949
3598
|
],
|
|
2950
3599
|
mentions: [
|
|
@@ -2970,6 +3619,14 @@ var sections8 = [
|
|
|
2970
3619
|
type: "paragraph",
|
|
2971
3620
|
text: "Navigate to **Data Products → Assemblies** to view and create assemblies. Each assembly shows its document count, linked schema, processing status, and the date it was created."
|
|
2972
3621
|
},
|
|
3622
|
+
{
|
|
3623
|
+
type: "paragraph",
|
|
3624
|
+
text: "When you create an assembly, you select a dataset template and one or more document sources. The system pulls all matching documents, applies the template's column mappings and transforms, and produces a single structured output. The assembly tracks which documents contributed to each row, giving you full traceability from output back to source."
|
|
3625
|
+
},
|
|
3626
|
+
{
|
|
3627
|
+
type: "paragraph",
|
|
3628
|
+
text: "Use assemblies whenever you need a repeatable, auditable output for downstream systems or stakeholders. Most teams create one assembly per reporting period or delivery cycle. Because assemblies reference a template, you can regenerate the same output shape from different document sets without reconfiguring columns or transforms each time."
|
|
3629
|
+
},
|
|
2973
3630
|
{
|
|
2974
3631
|
type: "callout",
|
|
2975
3632
|
text: "Assemblies are the recommended way to produce production datasets. They provide a single audit trail from source documents through extraction, resolution, and validation to the final output."
|
|
@@ -2988,6 +3645,10 @@ var sections8 = [
|
|
|
2988
3645
|
{
|
|
2989
3646
|
question: "Why should I use assemblies for production data?",
|
|
2990
3647
|
answer: "Assemblies provide a single audit trail from source documents through extraction, resolution, and validation to the final output, making them the recommended approach for production datasets."
|
|
3648
|
+
},
|
|
3649
|
+
{
|
|
3650
|
+
question: "Can an assembly pull from multiple sources?",
|
|
3651
|
+
answer: "Yes. An assembly can combine documents from any number of sources \u2014 uploaded files, connected drives, email attachments, and more \u2014 into a single structured dataset."
|
|
2991
3652
|
}
|
|
2992
3653
|
],
|
|
2993
3654
|
mentions: [
|
|
@@ -3033,6 +3694,18 @@ var sections8 = [
|
|
|
3033
3694
|
{
|
|
3034
3695
|
type: "paragraph",
|
|
3035
3696
|
text: "ID rules are persisted before generating IDs. Navigate to a data product detail page and use **Apply ID Rules** to generate or **Regenerate IDs** to refresh."
|
|
3697
|
+
},
|
|
3698
|
+
{
|
|
3699
|
+
type: "paragraph",
|
|
3700
|
+
text: 'Resolution maps normalize field values before they become part of the ID. For example, a resolution map can collapse "ACME Corp", "ACME Corporation", and "Acme" into a single canonical value "ACME". This prevents duplicate IDs for rows that refer to the same real-world entity under different names.'
|
|
3701
|
+
},
|
|
3702
|
+
{
|
|
3703
|
+
type: "paragraph",
|
|
3704
|
+
text: 'For best results, choose source fields with high uniqueness \u2014 contract numbers or invoice IDs work well, while generic fields like "status" do not. When your documents contain multiple candidate identifiers, configure a fallback chain so the dispenser always has a value to work with. Most teams use the primary reference number as the source field and the document name as the first fallback.'
|
|
3705
|
+
},
|
|
3706
|
+
{
|
|
3707
|
+
type: "callout",
|
|
3708
|
+
text: "ID generation is deterministic \u2014 running **Regenerate IDs** with the same rules and data always produces the same output. This makes ID dispensers safe to re-run without breaking downstream references."
|
|
3036
3709
|
}
|
|
3037
3710
|
],
|
|
3038
3711
|
related: [
|
|
@@ -3044,6 +3717,14 @@ var sections8 = [
|
|
|
3044
3717
|
{
|
|
3045
3718
|
question: "How do ID dispensers handle missing field values?",
|
|
3046
3719
|
answer: "When the source field is empty, the dispenser tries each field in the fallback chain in order. If all are empty, it generates a prefix-less sequential ID."
|
|
3720
|
+
},
|
|
3721
|
+
{
|
|
3722
|
+
question: "What is a resolution map?",
|
|
3723
|
+
answer: 'A resolution map is a key-value lookup that normalizes field values before ID generation. For example, it can collapse "ACME Corp" and "ACME Corporation" into "ACME" to prevent duplicate IDs for the same entity.'
|
|
3724
|
+
},
|
|
3725
|
+
{
|
|
3726
|
+
question: "Can I regenerate IDs without losing data?",
|
|
3727
|
+
answer: "Yes. Regenerating IDs only updates the ID column \u2014 all other data product values remain unchanged. The operation is deterministic, so the same rules and data always produce the same IDs."
|
|
3047
3728
|
}
|
|
3048
3729
|
],
|
|
3049
3730
|
mentions: ["ID dispenser", "unique identifiers", "fallback chain", "resolution map"]
|
|
@@ -3102,6 +3783,10 @@ var sections8 = [
|
|
|
3102
3783
|
{
|
|
3103
3784
|
question: "Does CSV export preserve leading zeros?",
|
|
3104
3785
|
answer: "Yes. All CSV exports preserve leading zeros and long numbers \u2014 values are never coerced to numeric types."
|
|
3786
|
+
},
|
|
3787
|
+
{
|
|
3788
|
+
question: "What is auto-resolve singles?",
|
|
3789
|
+
answer: "Auto-resolve singles automatically accepts fields that have only one candidate value, removing them from the manual review queue. Combined with auto-review, this significantly reduces the volume of items requiring human attention."
|
|
3105
3790
|
}
|
|
3106
3791
|
],
|
|
3107
3792
|
mentions: ["share token", "delivery website", "CSV export", "auto-review", "auto-resolve"]
|
|
@@ -3124,6 +3809,22 @@ var sections9 = [
|
|
|
3124
3809
|
{
|
|
3125
3810
|
type: "paragraph",
|
|
3126
3811
|
text: "Schema-level quality rules run during Phase 3 of every job. Rule types: field format, value range, cross-field consistency, and AI-proposed coherence rules. Rules can be AI-proposed after a job completes, then reviewed and approved before activation."
|
|
3812
|
+
},
|
|
3813
|
+
{
|
|
3814
|
+
type: "paragraph",
|
|
3815
|
+
text: "**Field format** checks verify that values match an expected pattern (e.g., dates in ISO format, phone numbers with country codes). **Value range** checks ensure numeric or date values fall within acceptable bounds. **Cross-field consistency** checks compare two or more fields on the same record \u2014 for example, verifying that a start date precedes an end date."
|
|
3816
|
+
},
|
|
3817
|
+
{
|
|
3818
|
+
type: "paragraph",
|
|
3819
|
+
text: "AI-proposed coherence rules are generated by analyzing patterns in completed job results. The system identifies relationships that hold across most records and proposes them as candidate rules. You review each proposal in the validation settings before it becomes active \u2014 no AI-generated rule runs without explicit approval."
|
|
3820
|
+
},
|
|
3821
|
+
{
|
|
3822
|
+
type: "paragraph",
|
|
3823
|
+
text: "For best results, start with a small set of high-confidence rules and expand over time. Most teams begin with field format checks for critical identifiers (invoice numbers, dates, amounts) and add cross-field consistency rules as they learn their data patterns. Validation failures do not block extraction \u2014 they flag records for review."
|
|
3824
|
+
},
|
|
3825
|
+
{
|
|
3826
|
+
type: "callout",
|
|
3827
|
+
text: "Validation checks are schema-scoped. Rules defined on one schema do not affect other schemas in the same workspace. This lets you tailor quality rules to each document type independently."
|
|
3127
3828
|
}
|
|
3128
3829
|
],
|
|
3129
3830
|
related: [
|
|
@@ -3139,6 +3840,10 @@ var sections9 = [
|
|
|
3139
3840
|
{
|
|
3140
3841
|
question: "Can AI suggest validation rules?",
|
|
3141
3842
|
answer: "Yes. After a job completes, AI can propose coherence rules based on the data. You review and approve these rules before they are activated."
|
|
3843
|
+
},
|
|
3844
|
+
{
|
|
3845
|
+
question: "Do validation failures block extraction?",
|
|
3846
|
+
answer: "No. Validation checks flag records for review but do not prevent extraction from completing. Failed records appear in the Approval Queue for manual inspection."
|
|
3142
3847
|
}
|
|
3143
3848
|
],
|
|
3144
3849
|
mentions: [
|
|
@@ -3158,6 +3863,22 @@ var sections9 = [
|
|
|
3158
3863
|
{
|
|
3159
3864
|
type: "paragraph",
|
|
3160
3865
|
text: "Manually-created reference datasets with known-correct values. Create from **Validation → Golden Samples**. Benchmark runs compare extraction results against golden samples for per-field accuracy scoring with AI judge verdicts."
|
|
3866
|
+
},
|
|
3867
|
+
{
|
|
3868
|
+
type: "paragraph",
|
|
3869
|
+
text: "To create a golden sample, select a document and manually enter the correct value for each field. The system stores these known-correct values as the ground truth baseline. When you run a benchmark, the extraction pipeline processes the same document independently, and the results are compared field by field against your golden sample."
|
|
3870
|
+
},
|
|
3871
|
+
{
|
|
3872
|
+
type: "paragraph",
|
|
3873
|
+
text: 'Benchmark scoring uses an AI judge to evaluate each field comparison. The judge accounts for semantic equivalence \u2014 for example, "United States" and "US" may be scored as a match depending on the field type. Per-field accuracy scores let you identify exactly which fields are underperforming and need schema or instruction tuning.'
|
|
3874
|
+
},
|
|
3875
|
+
{
|
|
3876
|
+
type: "paragraph",
|
|
3877
|
+
text: "For best results, create golden samples from a representative mix of document types and complexity levels. Most teams maintain 5-10 golden samples per schema and re-run benchmarks after schema changes, instruction updates, or model upgrades to track quality trends over time."
|
|
3878
|
+
},
|
|
3879
|
+
{
|
|
3880
|
+
type: "callout",
|
|
3881
|
+
text: "Golden samples are not used during normal extraction \u2014 they exist solely for benchmarking. Changing a golden sample does not affect how documents are processed."
|
|
3161
3882
|
}
|
|
3162
3883
|
],
|
|
3163
3884
|
related: [
|
|
@@ -3173,6 +3894,10 @@ var sections9 = [
|
|
|
3173
3894
|
{
|
|
3174
3895
|
question: "How do benchmark runs work?",
|
|
3175
3896
|
answer: "Benchmark runs compare extraction results against golden samples, producing per-field accuracy scores with AI judge verdicts to measure extraction quality."
|
|
3897
|
+
},
|
|
3898
|
+
{
|
|
3899
|
+
question: "How many golden samples should I create?",
|
|
3900
|
+
answer: "Most teams maintain 5-10 golden samples per schema, covering a representative mix of document types and complexity levels. Re-run benchmarks after schema changes or model upgrades to track quality trends."
|
|
3176
3901
|
}
|
|
3177
3902
|
],
|
|
3178
3903
|
mentions: ["golden samples", "ground truth", "benchmark runs", "accuracy scoring", "AI judge"]
|
|
@@ -3188,6 +3913,18 @@ var sections9 = [
|
|
|
3188
3913
|
type: "paragraph",
|
|
3189
3914
|
text: "Threshold-based rules for auto-approving or flagging results. Configure per schema with criteria: minimum confidence, validation pass rate, field coverage. Results meeting all thresholds are auto-approved; others go to the manual review queue."
|
|
3190
3915
|
},
|
|
3916
|
+
{
|
|
3917
|
+
type: "paragraph",
|
|
3918
|
+
text: "Each criterion acts as an independent gate. **Minimum confidence** sets the lowest acceptable extraction confidence score. **Validation pass rate** requires a minimum percentage of validation checks to pass. **Field coverage** ensures that a minimum percentage of schema fields have non-empty values. A result must clear all three gates to be auto-approved."
|
|
3919
|
+
},
|
|
3920
|
+
{
|
|
3921
|
+
type: "paragraph",
|
|
3922
|
+
text: "Start with conservative thresholds \u2014 high confidence, high pass rate, high coverage \u2014 and loosen them as you gain trust in your extraction pipeline. Most teams begin with 90% confidence, 95% validation pass rate, and 80% field coverage, then adjust based on the volume of false positives in the approval queue."
|
|
3923
|
+
},
|
|
3924
|
+
{
|
|
3925
|
+
type: "paragraph",
|
|
3926
|
+
text: "Approval gates integrate directly with the delivery pipeline. When a result passes all gates, a `result.approved` signal is emitted automatically. Bind this signal to a destination to create a fully automated flow from document upload through extraction, validation, approval, and delivery \u2014 no manual steps required for high-confidence results."
|
|
3927
|
+
},
|
|
3191
3928
|
{
|
|
3192
3929
|
type: "callout",
|
|
3193
3930
|
text: "Approval gates feed the delivery pipeline \u2014 bind a `result.approved` signal to a destination to only ship approved rows to your downstream systems."
|
|
@@ -3206,6 +3943,10 @@ var sections9 = [
|
|
|
3206
3943
|
{
|
|
3207
3944
|
question: "How do approval gates connect to delivery?",
|
|
3208
3945
|
answer: "Bind a result.approved signal to a delivery destination to only ship approved rows to your downstream systems. This ensures only quality-checked data is delivered."
|
|
3946
|
+
},
|
|
3947
|
+
{
|
|
3948
|
+
question: "What thresholds should I start with?",
|
|
3949
|
+
answer: "Most teams start with 90% confidence, 95% validation pass rate, and 80% field coverage. Adjust based on the volume of false positives in the approval queue \u2014 loosen thresholds as you gain trust in your pipeline."
|
|
3209
3950
|
}
|
|
3210
3951
|
],
|
|
3211
3952
|
mentions: [
|
|
@@ -3230,6 +3971,22 @@ var sections9 = [
|
|
|
3230
3971
|
{
|
|
3231
3972
|
type: "paragraph",
|
|
3232
3973
|
text: 'Filter the queue by status (pending, flagged), schema, or confidence range. Click "Review" on any row to inspect the extracted values, provenance trails, and validation check results before approving or rejecting.'
|
|
3974
|
+
},
|
|
3975
|
+
{
|
|
3976
|
+
type: "paragraph",
|
|
3977
|
+
text: "The review detail view shows the extracted values alongside the source document, with provenance trails tracing each value back to its origin in the text. Validation check results are displayed inline \u2014 you can see exactly which rules passed and which failed before making your decision. Batch actions are available for approving or rejecting multiple items at once."
|
|
3978
|
+
},
|
|
3979
|
+
{
|
|
3980
|
+
type: "paragraph",
|
|
3981
|
+
text: "When you approve a result, a `result.approved` signal is emitted to the delivery pipeline. When you reject a result, a `result.rejected` signal fires instead. This event-driven design lets you build automated workflows that respond to review decisions \u2014 for example, routing approved records to a webhook and rejected records to a notification channel."
|
|
3982
|
+
},
|
|
3983
|
+
{
|
|
3984
|
+
type: "paragraph",
|
|
3985
|
+
text: "For best results, review flagged items first \u2014 these are records where at least one validation check failed, making them the most likely to contain errors. Most teams assign a daily review cadence and use confidence range filters to prioritize low-confidence items that need the most attention."
|
|
3986
|
+
},
|
|
3987
|
+
{
|
|
3988
|
+
type: "callout",
|
|
3989
|
+
text: "LLM auto-review is available to accelerate the approval process. When enabled, AI proposes approve or reject decisions for pending items, which you can accept or override with a single click."
|
|
3233
3990
|
}
|
|
3234
3991
|
],
|
|
3235
3992
|
related: [
|
|
@@ -3245,6 +4002,10 @@ var sections9 = [
|
|
|
3245
4002
|
{
|
|
3246
4003
|
question: "How do I review items in the Approval Queue?",
|
|
3247
4004
|
answer: 'Filter by status (pending, flagged), schema, or confidence range. Click "Review" on any row to inspect extracted values, provenance trails, and validation check results before approving or rejecting.'
|
|
4005
|
+
},
|
|
4006
|
+
{
|
|
4007
|
+
question: "Can I batch approve or reject items?",
|
|
4008
|
+
answer: "Yes. Select multiple items in the queue and use the batch action buttons to approve or reject them all at once. Each item emits the appropriate delivery signal individually."
|
|
3248
4009
|
}
|
|
3249
4010
|
],
|
|
3250
4011
|
mentions: [
|
|
@@ -3309,6 +4070,18 @@ var sections10 = [
|
|
|
3309
4070
|
{
|
|
3310
4071
|
type: "paragraph",
|
|
3311
4072
|
text: "Every attempt is logged in `delivery_items`. Terminal failures (retry exhausted or permanent 4xx) write a `delivery_dead_letter` row, which is replayable. The outbox, history, DLQ, and catalog are all accessible via the [`/v1/delivery/*` API](/docs)."
|
|
4073
|
+
},
|
|
4074
|
+
{
|
|
4075
|
+
type: "paragraph",
|
|
4076
|
+
text: "The four registries \u2014 signals, deliverables, serializers, and connectors \u2014 are fully orthogonal. Adding a new destination type does not require changes to the signal or serializer code. This composable design means you can mix any supported signal with any compatible serializer and connector without custom integration work."
|
|
4077
|
+
},
|
|
4078
|
+
{
|
|
4079
|
+
type: "paragraph",
|
|
4080
|
+
text: "For best results, start with a webhook destination to verify your binding configuration end-to-end. Once the payload shape and delivery cadence match your expectations, expand to file-based destinations (S3, SFTP) or spreadsheet destinations (Google Sheets). Most teams create separate bindings for different downstream consumers rather than routing all events to a single destination."
|
|
4081
|
+
},
|
|
4082
|
+
{
|
|
4083
|
+
type: "callout",
|
|
4084
|
+
text: "Delivery is at-least-once with deterministic idempotency keys. Receivers should use the `X-Talonic-Idempotency-Key` header (or equivalent metadata for file-based connectors) to deduplicate on their end."
|
|
3312
4085
|
}
|
|
3313
4086
|
],
|
|
3314
4087
|
related: [
|
|
@@ -3324,6 +4097,10 @@ var sections10 = [
|
|
|
3324
4097
|
{
|
|
3325
4098
|
question: "What happens when a delivery fails?",
|
|
3326
4099
|
answer: "Failed deliveries retry with a backoff ladder. Terminal failures (retry exhausted or permanent 4xx) are written to the dead-letter queue (DLQ), which is fully replayable."
|
|
4100
|
+
},
|
|
4101
|
+
{
|
|
4102
|
+
question: "What serialization formats are supported?",
|
|
4103
|
+
answer: "Ten formats: json, ndjson, csv, csv_file, xlsx, rows, graph, raw, md, and txt. Each serializer declares which deliverable shapes it supports, and the compatibility triangle validates the combination at binding creation time."
|
|
3327
4104
|
}
|
|
3328
4105
|
],
|
|
3329
4106
|
mentions: [
|
|
@@ -3382,6 +4159,22 @@ var sections10 = [
|
|
|
3382
4159
|
description: "Slice 2+. Structured data as email attachment."
|
|
3383
4160
|
}
|
|
3384
4161
|
]
|
|
4162
|
+
},
|
|
4163
|
+
{
|
|
4164
|
+
type: "paragraph",
|
|
4165
|
+
text: "Each destination stores its connector type, configuration (URL, bucket, folder path), and optional authentication credentials. Webhook destinations support HMAC-SHA256 signing via a **signing secret** \u2014 every payload includes a signature header so your receiver can verify authenticity. File-based destinations (S3, SFTP, Google Drive) support configurable filename templates with token substitution for binding ID, timestamp, and idempotency key."
|
|
4166
|
+
},
|
|
4167
|
+
{
|
|
4168
|
+
type: "paragraph",
|
|
4169
|
+
text: "A single destination can back multiple bindings. For example, one S3 bucket destination can receive both `document.extracted` and `result.approved` events through separate bindings, each with its own serializer and field map. This keeps your destination inventory small while supporting diverse routing requirements."
|
|
4170
|
+
},
|
|
4171
|
+
{
|
|
4172
|
+
type: "paragraph",
|
|
4173
|
+
text: "For best results, always run a live-ping test after creating a destination. The test exercises the full transport envelope \u2014 SSRF validation, payload cap, and authentication \u2014 with a tiny test payload, so you catch configuration errors before real events start flowing. OAuth-based destinations (Google Drive, Google Sheets) require connecting your account first via the OAuth flow in the dashboard."
|
|
4174
|
+
},
|
|
4175
|
+
{
|
|
4176
|
+
type: "callout",
|
|
4177
|
+
text: "Destinations can be disabled without deleting them. Set **is_active** to false and no bindings will route events to the destination until you re-enable it."
|
|
3385
4178
|
}
|
|
3386
4179
|
],
|
|
3387
4180
|
related: [
|
|
@@ -3397,6 +4190,10 @@ var sections10 = [
|
|
|
3397
4190
|
{
|
|
3398
4191
|
question: "How do I test a destination?",
|
|
3399
4192
|
answer: "Every destination supports a live-ping test via POST /v1/delivery/destinations/:id/test that exercises the full transport envelope with a tiny test payload."
|
|
4193
|
+
},
|
|
4194
|
+
{
|
|
4195
|
+
question: "Can one destination serve multiple bindings?",
|
|
4196
|
+
answer: "Yes. A single destination can back any number of bindings, each with its own signal filter, serializer, and field map. This lets you route different event types to the same endpoint with different payload shapes."
|
|
3400
4197
|
}
|
|
3401
4198
|
],
|
|
3402
4199
|
mentions: [
|
|
@@ -3423,6 +4220,22 @@ var sections10 = [
|
|
|
3423
4220
|
{
|
|
3424
4221
|
type: "paragraph",
|
|
3425
4222
|
text: "Optional `field_map` (rename/drop/static rules) lets you reshape the payload without custom code. Optional `delivery_policy` overrides the default retry ladder (6 attempts at `5s, 30s, 2min, 10min, 1h`) and timeout."
|
|
4223
|
+
},
|
|
4224
|
+
{
|
|
4225
|
+
type: "paragraph",
|
|
4226
|
+
text: "The compatibility triangle is enforced on every create and update. The backend checks that your chosen serializer supports the deliverable resolver's output shape, and that the connector accepts the serializer's format. If any predicate fails, the binding is rejected with a descriptive error \u2014 you never end up with a binding that cannot deliver."
|
|
4227
|
+
},
|
|
4228
|
+
{
|
|
4229
|
+
type: "paragraph",
|
|
4230
|
+
text: 'Use `field_map` to tailor the payload for each downstream consumer. **Rename** rules map internal field names to the receiver\'s expected names. **Drop** rules exclude fields the receiver does not need. **Static** rules inject constant values (e.g., a `source: "talonic"` tag) into every payload. These three operations compose in order: drop first, then rename, then static injection.'
|
|
4231
|
+
},
|
|
4232
|
+
{
|
|
4233
|
+
type: "paragraph",
|
|
4234
|
+
text: "For best results, create one binding per downstream consumer per event type. This gives you independent control over payload shape, retry policy, and serialization format for each integration point. Most teams start with a `document.extracted` binding to a webhook and expand to run-level and approval signals as their pipeline matures."
|
|
4235
|
+
},
|
|
4236
|
+
{
|
|
4237
|
+
type: "callout",
|
|
4238
|
+
text: "The binding editor in the dashboard walks you through the compatibility triangle step by step \u2014 only showing serializers and deliverables that are compatible with your chosen signal and destination."
|
|
3426
4239
|
}
|
|
3427
4240
|
],
|
|
3428
4241
|
related: [
|
|
@@ -3438,6 +4251,10 @@ var sections10 = [
|
|
|
3438
4251
|
{
|
|
3439
4252
|
question: "Can I customize the delivery payload?",
|
|
3440
4253
|
answer: "Yes. Use field_map to rename, drop, or add static fields without custom code. Use delivery_policy to override the default retry ladder and timeout."
|
|
4254
|
+
},
|
|
4255
|
+
{
|
|
4256
|
+
question: "What is the compatibility triangle?",
|
|
4257
|
+
answer: "The compatibility triangle validates that the signal, deliverable resolver, serializer, and connector all form a compatible combination. The backend enforces this on every binding create and update to prevent misconfigured delivery routes."
|
|
3441
4258
|
}
|
|
3442
4259
|
],
|
|
3443
4260
|
mentions: [
|
|
@@ -3520,6 +4337,22 @@ var sections10 = [
|
|
|
3520
4337
|
description: "Fired after a terminal delivery failure."
|
|
3521
4338
|
}
|
|
3522
4339
|
]
|
|
4340
|
+
},
|
|
4341
|
+
{
|
|
4342
|
+
type: "paragraph",
|
|
4343
|
+
text: "Signals are typed events emitted by the platform when meaningful state changes occur. Document-level signals fire on extraction success or failure. Run-level signals fire when a job completes across dataspace, structuring, resolution, or extraction runs. Result-level signals fire when a reviewer approves, rejects, or flags a record."
|
|
4344
|
+
},
|
|
4345
|
+
{
|
|
4346
|
+
type: "paragraph",
|
|
4347
|
+
text: "The two `delivery.item.*` entries are **meta-signals** \u2014 they fire when a delivery itself succeeds or fails. Use them for self-monitoring: bind `delivery.item.failed` to a notification webhook to receive alerts when deliveries break. The poller includes built-in loop prevention so a failed meta-signal delivery does not emit another meta-signal."
|
|
4348
|
+
},
|
|
4349
|
+
{
|
|
4350
|
+
type: "paragraph",
|
|
4351
|
+
text: "For best results, use the catalog API to populate dropdown menus and configuration forms rather than hardcoding signal or deliverable lists. The catalog always reflects the running registry contents, so new signal types and deliverables appear automatically as the platform evolves."
|
|
4352
|
+
},
|
|
4353
|
+
{
|
|
4354
|
+
type: "callout",
|
|
4355
|
+
text: "The catalog API exposes four endpoints: `/v1/delivery/catalog/signals`, `/v1/delivery/catalog/deliverables`, `/v1/delivery/catalog/serializers`, and `/v1/delivery/catalog/connectors`. Each returns the full registry for that category."
|
|
3523
4356
|
}
|
|
3524
4357
|
],
|
|
3525
4358
|
related: [
|
|
@@ -3535,6 +4368,10 @@ var sections10 = [
|
|
|
3535
4368
|
{
|
|
3536
4369
|
question: "How do I discover available signals and deliverables?",
|
|
3537
4370
|
answer: "Use the catalog API at /v1/delivery/catalog/* which exposes the four registries (signals, deliverables, serializers, connectors) that drive the binding picker."
|
|
4371
|
+
},
|
|
4372
|
+
{
|
|
4373
|
+
question: "What are meta-signals?",
|
|
4374
|
+
answer: "Meta-signals (delivery.item.completed and delivery.item.failed) fire when a delivery attempt itself succeeds or fails. Use them for self-monitoring \u2014 for example, binding delivery.item.failed to a notification webhook for delivery failure alerts."
|
|
3538
4375
|
}
|
|
3539
4376
|
],
|
|
3540
4377
|
mentions: [
|
|
@@ -3555,6 +4392,22 @@ var sections10 = [
|
|
|
3555
4392
|
{
|
|
3556
4393
|
type: "paragraph",
|
|
3557
4394
|
text: "Every delivery attempt writes a row to `/v1/delivery/items` with its status, HTTP code, error code, and request/response bodies. Terminal failures (retry ladder exhausted or permanent 4xx) escalate to `/v1/delivery/dlq`. Both are fully replayable \u2014 replay enqueues a new attempt with a fresh idempotency key. Nothing in history is ever mutated; the log is strictly append-only."
|
|
4395
|
+
},
|
|
4396
|
+
{
|
|
4397
|
+
type: "paragraph",
|
|
4398
|
+
text: "The delivery items log captures the full lifecycle of each attempt: in-flight, succeeded, or failed. Each row includes the attempt number, duration in milliseconds, and truncated request/response bodies (up to 10 KB each). Use the items endpoint with filters for `binding_id`, `destination_id`, or `status` to narrow results when debugging a specific integration."
|
|
4399
|
+
},
|
|
4400
|
+
{
|
|
4401
|
+
type: "paragraph",
|
|
4402
|
+
text: "The dead letter queue (DLQ) is your safety net for terminal failures. When the retry ladder is exhausted or the destination returns a permanent error (e.g., 401 Unauthorized, 403 Forbidden), the failed delivery moves to the DLQ. From there you can inspect the error, fix the destination configuration, and replay the delivery with a single click or API call."
|
|
4403
|
+
},
|
|
4404
|
+
{
|
|
4405
|
+
type: "paragraph",
|
|
4406
|
+
text: "For best results, monitor the DLQ regularly and set up a `delivery.item.failed` meta-signal binding to receive alerts when deliveries fail terminally. Most teams configure a notification webhook for this signal so they are notified immediately rather than discovering failures during a manual review. Request and response bodies older than the configured retention period are automatically cleaned up, but row metadata (status, error code, duration) is retained indefinitely."
|
|
4407
|
+
},
|
|
4408
|
+
{
|
|
4409
|
+
type: "callout",
|
|
4410
|
+
text: "Replay is safe to run multiple times. The idempotency key is deterministic \u2014 receivers that deduplicate on the key will not process the same delivery twice, even after multiple replays."
|
|
3558
4411
|
}
|
|
3559
4412
|
],
|
|
3560
4413
|
related: [
|
|
@@ -3570,6 +4423,10 @@ var sections10 = [
|
|
|
3570
4423
|
{
|
|
3571
4424
|
question: "What is the dead letter queue (DLQ)?",
|
|
3572
4425
|
answer: "Terminal failures (retry ladder exhausted or permanent 4xx) escalate to /v1/delivery/dlq. DLQ entries are fully replayable \u2014 replay enqueues a fresh attempt with a new idempotency key."
|
|
4426
|
+
},
|
|
4427
|
+
{
|
|
4428
|
+
question: "How long are request and response bodies retained?",
|
|
4429
|
+
answer: "Request and response bodies are cleaned up after the configured retention period (default 30 days). Row metadata \u2014 status, HTTP code, error code, and duration \u2014 is retained indefinitely for audit purposes."
|
|
3573
4430
|
}
|
|
3574
4431
|
],
|
|
3575
4432
|
mentions: [
|
|
@@ -3604,6 +4461,10 @@ var sections11 = [
|
|
|
3604
4461
|
type: "paragraph",
|
|
3605
4462
|
text: "Dialects ensure consistency across all your structured output. When your downstream systems expect dates in `YYYY-MM-DD` format, numbers with `.` as the decimal separator, and CSVs delimited by `;`, you configure this once in the shared dialect rather than repeating it in every schema."
|
|
3606
4463
|
},
|
|
4464
|
+
{
|
|
4465
|
+
type: "paragraph",
|
|
4466
|
+
text: "Most teams configure their shared dialect during initial workspace setup and rarely change it afterward. If your organization operates across regions with different formatting conventions, create separate workspaces with region-specific dialects rather than overriding at the schema level. This keeps the configuration clean and avoids inconsistencies in delivered data."
|
|
4467
|
+
},
|
|
3607
4468
|
{
|
|
3608
4469
|
type: "list",
|
|
3609
4470
|
ordered: false,
|
|
@@ -3674,6 +4535,10 @@ var sections11 = [
|
|
|
3674
4535
|
type: "paragraph",
|
|
3675
4536
|
text: "The lookup convention follows a `key` / `value` structure where the `key` is the output code and the `value` is the human-readable label. During extraction, the platform maps FROM labels found in documents TO the canonical codes defined in the reference primitive. This ensures consistent, machine-readable output regardless of how values appear in source documents."
|
|
3676
4537
|
},
|
|
4538
|
+
{
|
|
4539
|
+
type: "paragraph",
|
|
4540
|
+
text: "For best results, keep reference primitives focused on a single domain \u2014 for example, one primitive for country codes, another for currency codes, and another for product categories. This makes each primitive reusable across multiple schemas and simplifies maintenance. When updating a primitive, test the new version against a few sample documents before updating the version reference in production schemas."
|
|
4541
|
+
},
|
|
3677
4542
|
{
|
|
3678
4543
|
type: "callout",
|
|
3679
4544
|
variant: "info",
|
|
@@ -3741,6 +4606,10 @@ var sections11 = [
|
|
|
3741
4606
|
type: "paragraph",
|
|
3742
4607
|
text: "Change review is particularly important for workspaces that feed downstream systems through delivery bindings. A small change to a schema field mapping or a reference primitive value can ripple through to every document processed after that point. The review process creates a checkpoint where a second pair of eyes can verify the change before it goes live."
|
|
3743
4608
|
},
|
|
4609
|
+
{
|
|
4610
|
+
type: "paragraph",
|
|
4611
|
+
text: "Most teams enable change review as soon as their workspace transitions from development to production. During the initial setup phase, you can leave it disabled for faster iteration. Once your schemas, dialects, and reference primitives are stable and data is flowing to downstream systems, enable change review to protect against accidental modifications that could disrupt live pipelines."
|
|
4612
|
+
},
|
|
3744
4613
|
{
|
|
3745
4614
|
type: "list",
|
|
3746
4615
|
ordered: false,
|
|
@@ -3807,6 +4676,14 @@ var sections12 = [
|
|
|
3807
4676
|
type: "paragraph",
|
|
3808
4677
|
text: "Omnisearch is designed to be the single entry point for finding anything in the platform. Rather than navigating to specific pages to search within them, Omnisearch queries a **materialized values index** that aggregates data across all your content. Results are grouped by category so you can quickly distinguish between a document match and a field name match."
|
|
3809
4678
|
},
|
|
4679
|
+
{
|
|
4680
|
+
type: "paragraph",
|
|
4681
|
+
text: "The materialized values index is rebuilt automatically whenever documents are processed or schemas change, so search results are always current. There is no manual reindex step \u2014 new documents become searchable as soon as extraction completes. This makes Omnisearch reliable even during high-volume ingestion periods."
|
|
4682
|
+
},
|
|
4683
|
+
{
|
|
4684
|
+
type: "paragraph",
|
|
4685
|
+
text: "For best results, use Omnisearch as your primary navigation tool. Instead of browsing through document lists or clicking through the sidebar, press `Cmd+K` and type what you are looking for \u2014 whether it is a specific invoice number, a field name, or a schema title. Most users find that Omnisearch is faster than manual navigation for any task beyond browsing the most recent documents."
|
|
4686
|
+
},
|
|
3810
4687
|
{
|
|
3811
4688
|
type: "callout",
|
|
3812
4689
|
variant: "info",
|
|
@@ -3883,6 +4760,10 @@ var sections12 = [
|
|
|
3883
4760
|
{
|
|
3884
4761
|
type: "paragraph",
|
|
3885
4762
|
text: "Filter state is encoded in the URL query string using dynamic SQL generation on the backend. This means you can bookmark filtered views, share them with teammates via a link, or save them as **presets** for one-click access to commonly used queries."
|
|
4763
|
+
},
|
|
4764
|
+
{
|
|
4765
|
+
type: "paragraph",
|
|
4766
|
+
text: 'For best results, save your most common filter combinations as presets. Most teams create presets for categories like "high-value invoices this quarter," "documents missing key fields," or "recently failed extractions." Presets appear as one-click buttons on the Documents page, eliminating the need to rebuild complex filter conditions from scratch each time.'
|
|
3886
4767
|
}
|
|
3887
4768
|
],
|
|
3888
4769
|
related: [
|
|
@@ -3937,6 +4818,19 @@ var sections13 = [
|
|
|
3937
4818
|
type: "paragraph",
|
|
3938
4819
|
text: "Manage API keys from **Settings → API Keys**. Keys are prefixed with `tlnc_` and passed via `Authorization: Bearer`. Keys are SHA-256 hashed \u2014 the full key is only shown once at creation."
|
|
3939
4820
|
},
|
|
4821
|
+
{
|
|
4822
|
+
type: "paragraph",
|
|
4823
|
+
text: "Each API key is assigned one or more scopes that control what operations it can perform. Scopes follow the principle of least privilege \u2014 create a key with only the scopes your integration needs. For example, a read-only dashboard integration only needs the `read` scope, while an automated ingestion pipeline needs `extract` and `read`."
|
|
4824
|
+
},
|
|
4825
|
+
{
|
|
4826
|
+
type: "paragraph",
|
|
4827
|
+
text: "For best results, create separate API keys for each integration or service that connects to your Talonic workspace. This makes it easy to rotate or revoke a single key without disrupting other integrations. Most teams maintain one key for their ingestion pipeline, one for their BI dashboard, and one for webhook-based automations."
|
|
4828
|
+
},
|
|
4829
|
+
{
|
|
4830
|
+
type: "callout",
|
|
4831
|
+
variant: "warning",
|
|
4832
|
+
text: "Copy the full API key immediately after creation \u2014 it is only displayed once. If you lose the key, you must delete it and create a new one. Existing integrations using the old key will stop working until updated."
|
|
4833
|
+
},
|
|
3940
4834
|
{
|
|
3941
4835
|
type: "param-table",
|
|
3942
4836
|
title: "API key scopes",
|
|
@@ -3972,6 +4866,10 @@ var sections13 = [
|
|
|
3972
4866
|
{
|
|
3973
4867
|
question: "What scopes are available for API keys?",
|
|
3974
4868
|
answer: "Three scopes: extract (use extraction API), read (read documents, extractions, schemas, jobs), and write (create and modify resources)."
|
|
4869
|
+
},
|
|
4870
|
+
{
|
|
4871
|
+
question: "Can I have multiple API keys?",
|
|
4872
|
+
answer: "Yes. You can create as many API keys as needed. Best practice is to create separate keys for each integration so you can rotate or revoke them independently without disrupting other services."
|
|
3975
4873
|
}
|
|
3976
4874
|
],
|
|
3977
4875
|
mentions: ["API keys", "tlnc_", "SHA-256", "Bearer token", "scopes"]
|
|
@@ -3983,6 +4881,27 @@ var sections13 = [
|
|
|
3983
4881
|
seoTitle: "Public REST API Overview \u2014 Talonic Docs",
|
|
3984
4882
|
description: "Full REST API with 20+ namespaces: extract, documents, extractions, schemas, jobs, sources, delivery, linking, matching, batches, cases, quality, and more. Cursor pagination.",
|
|
3985
4883
|
content: [
|
|
4884
|
+
{
|
|
4885
|
+
type: "paragraph",
|
|
4886
|
+
text: "Talonic exposes a comprehensive REST API with 20+ namespaces covering every aspect of the platform \u2014 from document extraction and schema management to delivery, matching, and quality benchmarking. All endpoints use JSON request and response bodies with cursor-based pagination for list operations."
|
|
4887
|
+
},
|
|
4888
|
+
{
|
|
4889
|
+
type: "paragraph",
|
|
4890
|
+
text: "The API follows standard REST conventions. Authenticate with a `tlnc_` API key via the `Authorization: Bearer` header. Most resources support full CRUD operations, and long-running tasks like matching runs and batch inference are handled asynchronously with polling endpoints for status and progress."
|
|
4891
|
+
},
|
|
4892
|
+
{
|
|
4893
|
+
type: "paragraph",
|
|
4894
|
+
text: "Use the public API to build automated ingestion pipelines, integrate extraction results into downstream systems, or orchestrate complex workflows that combine multiple platform features. The API mirrors every action available in the web interface, so anything you can do manually can be fully automated."
|
|
4895
|
+
},
|
|
4896
|
+
{
|
|
4897
|
+
type: "paragraph",
|
|
4898
|
+
text: "For best results, start with the `/v1/extract` endpoint for document ingestion, then use `/v1/documents` and `/v1/extractions` to retrieve results. As your integration matures, explore delivery bindings, matching configurations, and batch processing to build a fully automated data pipeline."
|
|
4899
|
+
},
|
|
4900
|
+
{
|
|
4901
|
+
type: "callout",
|
|
4902
|
+
variant: "info",
|
|
4903
|
+
text: "See the full [API Documentation](/docs) for detailed endpoint specifications, request/response examples, and authentication guides. The API reference is organized by namespace and includes every parameter, status code, and error response."
|
|
4904
|
+
},
|
|
3986
4905
|
{
|
|
3987
4906
|
type: "param-table",
|
|
3988
4907
|
title: "API namespaces",
|
|
@@ -4103,6 +5022,10 @@ var sections13 = [
|
|
|
4103
5022
|
{
|
|
4104
5023
|
question: "Where can I find detailed API documentation?",
|
|
4105
5024
|
answer: "See the full API Documentation at /docs for complete endpoint documentation with request/response examples, parameter descriptions, and authentication details."
|
|
5025
|
+
},
|
|
5026
|
+
{
|
|
5027
|
+
question: "How does pagination work in the API?",
|
|
5028
|
+
answer: "List endpoints use cursor-based pagination. Each response includes a cursor token that you pass as a query parameter to fetch the next page. This approach is more reliable than offset-based pagination when documents are being added or removed concurrently."
|
|
4106
5029
|
}
|
|
4107
5030
|
],
|
|
4108
5031
|
mentions: [
|
|
@@ -4128,6 +5051,14 @@ var sections13 = [
|
|
|
4128
5051
|
type: "paragraph",
|
|
4129
5052
|
text: "The webhook connector is configured as a **delivery destination**. Bind any of the signal types below to a webhook destination to receive real-time notifications. See `/v1/delivery/catalog/signals` for the exhaustive list."
|
|
4130
5053
|
},
|
|
5054
|
+
{
|
|
5055
|
+
type: "paragraph",
|
|
5056
|
+
text: "When a webhook fires, the platform constructs the payload from the signal data, signs it with your destination's HMAC-SHA256 signing secret, and delivers it via HTTPS POST. Each delivery includes an idempotency key in the headers so your receiver can safely deduplicate retries. Failed deliveries follow an exponential backoff schedule, and terminal failures are routed to the dead-letter queue for manual replay."
|
|
5057
|
+
},
|
|
5058
|
+
{
|
|
5059
|
+
type: "paragraph",
|
|
5060
|
+
text: "Use webhooks when your downstream system needs to react immediately to platform events \u2014 for example, triggering an ERP import when a document is extracted, or notifying a Slack channel when a reviewer rejects a record. For bulk or periodic data transfers, consider using the SFTP, S3, or cloud storage delivery connectors instead."
|
|
5061
|
+
},
|
|
4131
5062
|
{
|
|
4132
5063
|
type: "param-table",
|
|
4133
5064
|
title: "Delivery signal types (webhook-compatible)",
|
|
@@ -4203,6 +5134,10 @@ var sections13 = [
|
|
|
4203
5134
|
{
|
|
4204
5135
|
question: "What happens when a webhook delivery fails?",
|
|
4205
5136
|
answer: "Failed webhook deliveries retry with exponential backoff. Terminal failures (retry exhausted or permanent 4xx) escalate to the dead-letter queue for manual replay."
|
|
5137
|
+
},
|
|
5138
|
+
{
|
|
5139
|
+
question: "How do I verify webhook signatures?",
|
|
5140
|
+
answer: "Each webhook payload is signed with HMAC-SHA256 using the signing secret from your delivery destination configuration. Compute the HMAC of the raw request body and compare it to the signature header to verify authenticity. This ensures the payload was sent by Talonic and was not tampered with in transit."
|
|
4206
5141
|
}
|
|
4207
5142
|
],
|
|
4208
5143
|
mentions: [
|
|
@@ -4262,6 +5197,10 @@ var sections14 = [
|
|
|
4262
5197
|
type: "paragraph",
|
|
4263
5198
|
text: "New members are added via domain matching: company email domains auto-match to your org with **pending** status requiring admin approval. Manage from the Team page."
|
|
4264
5199
|
},
|
|
5200
|
+
{
|
|
5201
|
+
type: "paragraph",
|
|
5202
|
+
text: "When a team member is removed, their access is revoked immediately but their past actions \u2014 edits, uploads, approvals, and review decisions \u2014 remain in the audit trail. This preserves data integrity and compliance history. Removed users can be re-added later through the same domain matching process if needed."
|
|
5203
|
+
},
|
|
4265
5204
|
{
|
|
4266
5205
|
type: "callout",
|
|
4267
5206
|
variant: "info",
|
|
@@ -4329,6 +5268,14 @@ var sections14 = [
|
|
|
4329
5268
|
type: "paragraph",
|
|
4330
5269
|
text: "Understanding your usage patterns helps optimize costs. For example, if extraction dominates your spend, consider using **batch mode** for non-urgent documents to cut that cost in half. The daily cost chart makes it easy to spot usage spikes and correlate them with specific ingestion events."
|
|
4331
5270
|
},
|
|
5271
|
+
{
|
|
5272
|
+
type: "paragraph",
|
|
5273
|
+
text: "Behind the scenes, every LLM and OCR call is logged with full detail \u2014 the model used, input and output token counts, latency, and computed cost. This data powers both the per-feature breakdown and the individual call log. The system tracks costs across extraction, OCR, batch inference, matching AI resolution, and quality passes so you always know where your spend is going."
|
|
5274
|
+
},
|
|
5275
|
+
{
|
|
5276
|
+
type: "paragraph",
|
|
5277
|
+
text: "Most teams review the daily cost chart weekly to establish a usage baseline. Unexpected spikes usually correlate with large document uploads or batch completions. For organizations managing multiple workspaces, the **Master view** provides a single pane of glass showing per-customer breakdowns and platform-wide aggregates \u2014 accessible only to platform administrators."
|
|
5278
|
+
},
|
|
4332
5279
|
{
|
|
4333
5280
|
type: "param-table",
|
|
4334
5281
|
title: "Usage views",
|
|
@@ -4404,6 +5351,14 @@ var sections14 = [
|
|
|
4404
5351
|
type: "paragraph",
|
|
4405
5352
|
text: "The Admin Panel is the central hub for platform-wide operations. **Customer management** lets you create, view, and delete organizations. **User management** provides a cross-tenant view of all platform users with the ability to remove accounts. The **data clear & rebuild** function wipes all data for a specific customer and reprocesses from scratch \u2014 useful during onboarding or after significant schema changes."
|
|
4406
5353
|
},
|
|
5354
|
+
{
|
|
5355
|
+
type: "paragraph",
|
|
5356
|
+
text: "The Admin Panel operates across tenant boundaries, giving administrators visibility into all organizations on the platform. The **usage statistics** view aggregates cost and volume data across all customers, making it straightforward to identify high-usage tenants, track platform growth, and forecast infrastructure needs."
|
|
5357
|
+
},
|
|
5358
|
+
{
|
|
5359
|
+
type: "paragraph",
|
|
5360
|
+
text: "For best results, limit Admin Panel access to a small group of trusted platform operators. Use the **master registry** view to audit field definitions and schemas across tenants \u2014 this is particularly useful when standardizing extraction configurations or troubleshooting cross-tenant data quality issues."
|
|
5361
|
+
},
|
|
4407
5362
|
{
|
|
4408
5363
|
type: "list",
|
|
4409
5364
|
ordered: false,
|
|
@@ -4463,6 +5418,18 @@ var sections14 = [
|
|
|
4463
5418
|
type: "paragraph",
|
|
4464
5419
|
text: "Talonic provides global keyboard shortcuts that work from any page in the platform. These shortcuts let you access common actions without leaving your current context, significantly speeding up daily workflows."
|
|
4465
5420
|
},
|
|
5421
|
+
{
|
|
5422
|
+
type: "paragraph",
|
|
5423
|
+
text: "Shortcuts are registered at the application level, meaning they respond regardless of which page or panel is currently active. The platform intercepts the key combination before it reaches the browser, so these shortcuts take priority over default browser bindings when the Talonic window is focused."
|
|
5424
|
+
},
|
|
5425
|
+
{
|
|
5426
|
+
type: "paragraph",
|
|
5427
|
+
text: "The most frequently used shortcut is **Omnisearch** (`Cmd+K` / `Ctrl+K`), which opens a global search overlay that queries documents, extracted values, field names, schemas, and sources simultaneously. Power users rely on it to navigate the platform faster than clicking through the sidebar."
|
|
5428
|
+
},
|
|
5429
|
+
{
|
|
5430
|
+
type: "paragraph",
|
|
5431
|
+
text: "For best results, build muscle memory around the three core shortcuts. Use `Cmd+K` to find anything, `Cmd+J` to upload a document on the fly, and `Escape` to dismiss any overlay or modal. These three actions cover the most common interruptions during a review or configuration session."
|
|
5432
|
+
},
|
|
4466
5433
|
{
|
|
4467
5434
|
type: "param-table",
|
|
4468
5435
|
title: "Shortcuts",
|
|
@@ -4533,6 +5500,14 @@ var sections15 = [
|
|
|
4533
5500
|
type: "paragraph",
|
|
4534
5501
|
text: "Under the hood, batch inference leverages the provider's native batch API (Anthropic Message Batches or AWS Bedrock invocation jobs). Documents accumulate in a queue and are submitted together, allowing the provider to schedule processing during off-peak capacity. This is why the cost reduction is possible without any loss in extraction quality."
|
|
4535
5502
|
},
|
|
5503
|
+
{
|
|
5504
|
+
type: "paragraph",
|
|
5505
|
+
text: "Batch mode is best suited for backlog ingestion, periodic bulk uploads, and any scenario where results are not needed in real time. Most teams use batch mode for overnight processing of large document volumes and reserve real-time processing for time-sensitive documents that need immediate attention."
|
|
5506
|
+
},
|
|
5507
|
+
{
|
|
5508
|
+
type: "paragraph",
|
|
5509
|
+
text: "When batch results arrive, they pass through the same post-processing pipeline as real-time extractions \u2014 including markdown pre-processing, field parsing, quality metrics, and extraction metadata computation. The only difference is that LLM-based quality passes (field estimation, verification, cross-reference enrichment) are skipped in batch mode to preserve the cost savings."
|
|
5510
|
+
},
|
|
4536
5511
|
{
|
|
4537
5512
|
type: "list",
|
|
4538
5513
|
ordered: false,
|
|
@@ -4609,6 +5584,10 @@ var sections15 = [
|
|
|
4609
5584
|
{
|
|
4610
5585
|
type: "paragraph",
|
|
4611
5586
|
text: "While waiting for batch results, documents show a status of `batch_queued`. Once the provider returns results, the platform applies them through the same post-processing pipeline as real-time extraction \u2014 including markdown pre-processing, field parsing, quality metrics, and extraction metadata computation."
|
|
5587
|
+
},
|
|
5588
|
+
{
|
|
5589
|
+
type: "paragraph",
|
|
5590
|
+
text: "You can also enable batch mode on a per-source basis. When a source connection has the batch processing toggle enabled, all documents ingested through that source are automatically routed to the batch queue. This is ideal for source connections that handle non-urgent, high-volume ingestion \u2014 such as a shared drive that collects documents overnight."
|
|
4612
5591
|
}
|
|
4613
5592
|
],
|
|
4614
5593
|
related: [
|
|
@@ -4658,6 +5637,14 @@ var sections15 = [
|
|
|
4658
5637
|
type: "paragraph",
|
|
4659
5638
|
text: "Batches are submitted automatically when the accumulation timer fires (every 15 minutes by default) or when the item count threshold is reached. Once submitted, the platform polls the provider hourly to check for completion. When results arrive, they are applied to the corresponding documents and the batch transitions to **completed** status."
|
|
4660
5639
|
},
|
|
5640
|
+
{
|
|
5641
|
+
type: "paragraph",
|
|
5642
|
+
text: "The batch detail view shows individual items within a batch, including which documents are included, their current processing state, and any errors that occurred. Use this view to verify that a specific document was included in the expected batch and to troubleshoot items that failed to parse."
|
|
5643
|
+
},
|
|
5644
|
+
{
|
|
5645
|
+
type: "paragraph",
|
|
5646
|
+
text: "The platform includes built-in crash recovery for batch processing. If the application restarts while a batch is in a transient `processing` state, the recovery logic automatically reverts it to `submitted` so the next polling cycle can retry. This means batch jobs are resilient to infrastructure disruptions without requiring manual intervention."
|
|
5647
|
+
},
|
|
4661
5648
|
{
|
|
4662
5649
|
type: "param-table",
|
|
4663
5650
|
title: "Batch statuses",
|
|
@@ -4737,6 +5724,14 @@ var sections16 = [
|
|
|
4737
5724
|
type: "paragraph",
|
|
4738
5725
|
text: 'Reference data is the foundation of the matching system. It represents your "ground truth" \u2014 the known records you want to match extracted document data against. Common examples include customer lists, product catalogs, vendor registries, and contract databases.'
|
|
4739
5726
|
},
|
|
5727
|
+
{
|
|
5728
|
+
type: "paragraph",
|
|
5729
|
+
text: "When you upload a reference dataset, the platform indexes all columns and rows for fast lookup during matching runs. Each dataset is versioned independently, so you can update your reference data without affecting in-progress matching configurations. A single dataset can be shared across multiple schemas and matching configurations."
|
|
5730
|
+
},
|
|
5731
|
+
{
|
|
5732
|
+
type: "paragraph",
|
|
5733
|
+
text: "For best results, ensure your reference data is clean and deduplicated before uploading. Include all columns that you plan to match against \u2014 such as names, identifiers, dates, and amounts. Most teams refresh their reference data periodically by re-uploading from their source system or by using the SQL import option to pull directly from a connected database."
|
|
5734
|
+
},
|
|
4740
5735
|
{
|
|
4741
5736
|
type: "callout",
|
|
4742
5737
|
variant: "info",
|
|
@@ -4830,6 +5825,10 @@ var sections16 = [
|
|
|
4830
5825
|
type: "paragraph",
|
|
4831
5826
|
text: "Each field comparison carries a **weight** that determines how much it contributes to the overall confidence score. Set high weights on fields that are strong identifiers (like reference numbers or unique IDs) and lower weights on fields that are common or prone to variation (like names or descriptions). The weighted aggregate produces a final score between 0% and 100%."
|
|
4832
5827
|
},
|
|
5828
|
+
{
|
|
5829
|
+
type: "paragraph",
|
|
5830
|
+
text: "Most teams start with AI strategy generation and then fine-tune weights based on initial results. A common pattern is to set a high weight on a unique identifier field (like a PO number) with `exact` strategy, combined with lower-weighted `fuzzy` matches on name and description fields as supporting evidence. Review the first batch of results to calibrate thresholds before running at scale."
|
|
5831
|
+
},
|
|
4833
5832
|
{
|
|
4834
5833
|
type: "callout",
|
|
4835
5834
|
variant: "info",
|
|
@@ -4884,6 +5883,14 @@ var sections16 = [
|
|
|
4884
5883
|
type: "paragraph",
|
|
4885
5884
|
text: "There are two types of runs: **manual runs** use only the deterministic matching strategies (exact, fuzzy, date_range, numeric_range) and complete quickly. **Smart runs** add an AI resolution pass \u2014 after the initial matching, an embedding-based search with a Haiku LLM resolver attempts to improve low-confidence results."
|
|
4886
5885
|
},
|
|
5886
|
+
{
|
|
5887
|
+
type: "paragraph",
|
|
5888
|
+
text: "Matching runs are processed asynchronously via a dedicated job queue, so they do not block your workflow. You can continue working in the platform while a run executes in the background. The matching page shows real-time progress with the number of documents processed and estimated time remaining."
|
|
5889
|
+
},
|
|
5890
|
+
{
|
|
5891
|
+
type: "paragraph",
|
|
5892
|
+
text: "For best results, start with a manual run to establish a baseline, then use a smart run if many documents have low-confidence matches. Smart runs take longer because the AI resolver evaluates each ambiguous candidate, but they can significantly improve match quality for data with inconsistent formatting, abbreviations, or multilingual content."
|
|
5893
|
+
},
|
|
4887
5894
|
{
|
|
4888
5895
|
type: "list",
|
|
4889
5896
|
ordered: true,
|
|
@@ -4941,6 +5948,14 @@ var sections16 = [
|
|
|
4941
5948
|
type: "paragraph",
|
|
4942
5949
|
text: "The evidence view is designed to make match decisions transparent. For each candidate, you can see exactly which fields matched, what strategy was used, the individual field score, and the actual values that were compared. This makes it straightforward to verify correct matches and investigate false positives."
|
|
4943
5950
|
},
|
|
5951
|
+
{
|
|
5952
|
+
type: "paragraph",
|
|
5953
|
+
text: "Approved matches flow downstream into delivery pipelines, where they can be included in structured exports alongside extraction data. Rejected matches are excluded from future consideration for that document, which helps the system learn from your decisions when running subsequent matching passes."
|
|
5954
|
+
},
|
|
5955
|
+
{
|
|
5956
|
+
type: "paragraph",
|
|
5957
|
+
text: "When reviewing results, focus on documents where the top candidate has a confidence score between 50% and 85% \u2014 these are the borderline cases that benefit most from human judgment. High-confidence matches (above 85%) are usually correct, while very low scores (below 30%) typically indicate no valid match exists in the reference data."
|
|
5958
|
+
},
|
|
4944
5959
|
{
|
|
4945
5960
|
type: "param-table",
|
|
4946
5961
|
title: "Result fields",
|
|
@@ -5013,7 +6028,11 @@ var sections17 = [
|
|
|
5013
6028
|
description: "Overview of the Talonic API for extracting structured, schema-validated data from any document with a single API call using HTTPS and JSON.",
|
|
5014
6029
|
content: [
|
|
5015
6030
|
{ type: "paragraph", text: "Extract any document into schema-validated data with a single API call." },
|
|
5016
|
-
{ type: "paragraph", text: "**Base URL:** `https://api.talonic.com` | **Protocol:** HTTPS + JSON | **Auth:** `Bearer tlnc_...`" }
|
|
6031
|
+
{ type: "paragraph", text: "**Base URL:** `https://api.talonic.com` | **Protocol:** HTTPS + JSON | **Auth:** `Bearer tlnc_...`" },
|
|
6032
|
+
{ type: "paragraph", text: "Most integrations start with `POST /v1/extract` to submit a document and receive structured fields back. A typical workflow is: create an API key, upload a file with an optional schema, and consume the JSON response with per-field confidence scores and cost headers." },
|
|
6033
|
+
{ type: "paragraph", text: "The API supports three extraction modes: **auto-detect** (no schema, discovers all fields), **schema-driven** (returns exactly the fields you define), and **query** (filter previously extracted data without re-processing). Every response includes a `request_id` for tracing and support." },
|
|
6034
|
+
{ type: "paragraph", text: "Pair the extract endpoint with `GET /v1/documents` and `GET /v1/extractions` to manage your document library and retrieve results later. Webhook callbacks via `extraction.complete` events eliminate the need for polling on async extractions." },
|
|
6035
|
+
{ type: "callout", text: "All API keys use the `tlnc_` prefix. Create and rotate keys from **Settings \u2192 API Keys** in the dashboard. Keys carry scopes (`extract`, `read`, `write`, `billing`) that control endpoint access." }
|
|
5017
6036
|
],
|
|
5018
6037
|
related: [
|
|
5019
6038
|
{ label: "Authentication", slug: "authentication" },
|
|
@@ -5069,7 +6088,11 @@ var sections17 = [
|
|
|
5069
6088
|
description: "The base URL for all Talonic API endpoints. All requests must use HTTPS and are relative to the v1 base path.",
|
|
5070
6089
|
content: [
|
|
5071
6090
|
{ type: "paragraph", text: "All endpoints are relative to the base URL below. All requests must use HTTPS." },
|
|
5072
|
-
{ type: "code", language: "bash", code: "https://api.talonic.com/v1" }
|
|
6091
|
+
{ type: "code", language: "bash", code: "https://api.talonic.com/v1" },
|
|
6092
|
+
{ type: "paragraph", text: "Most integrations set this as a constant in their HTTP client configuration. A typical request URL looks like `https://api.talonic.com/v1/extract` or `https://api.talonic.com/v1/documents`. All paths in this reference are relative to the `/v1` prefix." },
|
|
6093
|
+
{ type: "paragraph", text: "The API uses standard JSON request and response bodies with `Content-Type: application/json`, except for file uploads which use `multipart/form-data`. Responses include standard HTTP status codes and rate limit headers on every call." },
|
|
6094
|
+
{ type: "paragraph", text: "There is no versioning in the URL beyond `/v1`. Breaking changes will be communicated in advance and introduced under a new version prefix. Non-breaking additions (new fields, new endpoints) are shipped continuously." },
|
|
6095
|
+
{ type: "callout", text: "Plain HTTP requests are rejected. Always use `https://` in your base URL configuration to ensure encrypted transport." }
|
|
5073
6096
|
],
|
|
5074
6097
|
related: [
|
|
5075
6098
|
{ label: "Authentication", slug: "authentication" }
|
|
@@ -5189,6 +6212,9 @@ X-Talonic-Cells-Resolved-AI: 5` },
|
|
|
5189
6212
|
description: "All list endpoints use cursor-based pagination with cursor, limit, and order parameters. Responses include next_cursor and has_more for iteration.",
|
|
5190
6213
|
content: [
|
|
5191
6214
|
{ type: "paragraph", text: "All list endpoints use cursor-based pagination. Pass a `cursor` token from the previous response to fetch the next page." },
|
|
6215
|
+
{ type: "paragraph", text: "Most integrations call list endpoints after bulk ingestion to iterate through results. A typical workflow is to fetch the first page with a `limit`, then loop using `pagination.next_cursor` until `has_more` is `false`." },
|
|
6216
|
+
{ type: "paragraph", text: "The response always includes a `pagination` object with `total`, `limit`, `has_more`, and `next_cursor`. The `total` field reflects the full count of matching items, not just the current page. Use `order` to control sort direction by `created_at`." },
|
|
6217
|
+
{ type: "paragraph", text: "Pair pagination with query filters (e.g. `status`, `after`, `before`, `search`) on endpoints like `GET /v1/documents` and `GET /v1/extractions` to narrow results before paginating. Note that cursors are opaque and short-lived \u2014 do not persist or parse them." },
|
|
5192
6218
|
{
|
|
5193
6219
|
type: "param-table",
|
|
5194
6220
|
title: "Request parameters",
|
|
@@ -5276,6 +6302,9 @@ print(f"Fetched {len(all_documents)} documents")`
|
|
|
5276
6302
|
description: "Use the Idempotency-Key header to safely retry POST requests without creating duplicate extractions. Keys are valid for 24 hours.",
|
|
5277
6303
|
content: [
|
|
5278
6304
|
{ type: "paragraph", text: "Pass an `Idempotency-Key` header on POST requests to safely retry without creating duplicate work. If a request with the same key has already been processed, the API returns the cached response." },
|
|
6305
|
+
{ type: "paragraph", text: "Most integrations use idempotency keys when calling `POST /v1/extract` to guard against network timeouts or duplicate submissions. A typical workflow is to generate a UUID per logical operation, attach it as the `Idempotency-Key` header, and retry the same request on failure without risk of double-processing." },
|
|
6306
|
+
{ type: "paragraph", text: "The cached response is stored for **24 hours** and is scoped to your API key. A duplicate request within that window returns the original response body and HTTP status immediately, with no additional credit cost. After 24 hours the key expires and can be reused for a new request." },
|
|
6307
|
+
{ type: "paragraph", text: "Pair idempotency with webhook callbacks (`webhook_url` option) for robust async workflows. Note that reusing a key with different request parameters will still return the first request's cached result \u2014 always generate a fresh key for each distinct operation." },
|
|
5279
6308
|
{
|
|
5280
6309
|
type: "param-table",
|
|
5281
6310
|
title: "Idempotency details",
|
|
@@ -5759,6 +6788,7 @@ X-Talonic-Cells-Resolved-AI: 5`
|
|
|
5759
6788
|
seoTitle: "Extract Options \u2014 Talonic Docs",
|
|
5760
6789
|
description: "Configure extraction options including output format, strict mode, async processing, webhook callbacks, raw text inclusion, page ranges, and language hints.",
|
|
5761
6790
|
content: [
|
|
6791
|
+
{ type: "paragraph", text: "Pass these options as fields in the `options` JSON object on `POST /v1/extract` to control extraction behavior. Options let you switch between sync and async mode, include raw text, restrict page ranges, and configure webhook delivery." },
|
|
5762
6792
|
{
|
|
5763
6793
|
type: "param-table",
|
|
5764
6794
|
params: [
|
|
@@ -5770,7 +6800,11 @@ X-Talonic-Cells-Resolved-AI: 5`
|
|
|
5770
6800
|
{ name: "page_range", type: "string", description: 'Pages to extract from. E.g. "1-5", "1,3,7-10". PDF only.' },
|
|
5771
6801
|
{ name: "language_hint", type: "string", description: "ISO 639-1 language code hint. Improves extraction for non-English documents." }
|
|
5772
6802
|
]
|
|
5773
|
-
}
|
|
6803
|
+
},
|
|
6804
|
+
{ type: "paragraph", text: "Most integrations use `strict: true` (default) to receive only the schema-defined fields. Set `strict: false` when you want the AI to also return additional fields it discovers beyond your schema. The `async` and `webhook_url` options are mutually beneficial \u2014 set `webhook_url` to avoid polling entirely." },
|
|
6805
|
+
{ type: "paragraph", text: 'The `page_range` option accepts comma-separated page numbers and ranges (e.g. `"1-5"`, `"1,3,7-10"`) and applies only to PDF files. Use `language_hint` with an ISO 639-1 code (e.g. `"de"`, `"ja"`) to improve extraction accuracy for non-English documents, especially when the OCR needs guidance on character sets.' },
|
|
6806
|
+
{ type: "paragraph", text: "Pair `include_raw_text: true` with schema-driven extraction when your downstream system needs both structured data and the original text for audit or display purposes. Note that setting `webhook_url` implicitly enables async behavior \u2014 the response will be `202 Accepted` regardless of the `async` flag." },
|
|
6807
|
+
{ type: "callout", text: 'The `format` option controls the output shape of the `data` field. Use `"json"` (default) for programmatic consumption. CSV format is available on the `GET /v1/extractions/:id/data` endpoint instead.' }
|
|
5774
6808
|
],
|
|
5775
6809
|
related: [
|
|
5776
6810
|
{ label: "POST /v1/extract", slug: "post-extract" },
|
|
@@ -6064,6 +7098,10 @@ var sections19 = [
|
|
|
6064
7098
|
}
|
|
6065
7099
|
}`
|
|
6066
7100
|
},
|
|
7101
|
+
{ type: "paragraph", text: "Most integrations call this endpoint after receiving an `extraction.complete` webhook or after polling a document's status until it reaches `completed`. A typical workflow is to extract a document via `POST /v1/extract`, store the returned `document.id`, then fetch full metadata here when needed." },
|
|
7102
|
+
{ type: "paragraph", text: "The response includes the current `status` field which will be `completed` when extraction has finished, `processing` while in progress, or `error` if something went wrong. Use the `latest_extraction_id` to navigate directly to the extraction result via `GET /v1/extractions/:id`." },
|
|
7103
|
+
{ type: "paragraph", text: "Pair this with `GET /v1/documents/:id/markdown` to retrieve the raw OCR text, or with `GET /v1/extractions/:id/data` for just the structured field values. Note that the `triage` object is only populated after ingestion completes and may be `null` for documents still in processing." },
|
|
7104
|
+
{ type: "callout", variant: "info", text: "The `links.dashboard` URL opens the document directly in the Talonic platform UI, which is useful for sharing with team members who need to review or correct extractions." },
|
|
6067
7105
|
{ type: "heading", level: 2, id: "get-document-errors", text: "Errors" },
|
|
6068
7106
|
{
|
|
6069
7107
|
type: "param-table",
|
|
@@ -6119,6 +7157,9 @@ var sections19 = [
|
|
|
6119
7157
|
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
|
|
6120
7158
|
}`
|
|
6121
7159
|
},
|
|
7160
|
+
{ type: "paragraph", text: "Most integrations call this endpoint as part of a cleanup workflow after data has been exported or when a document was uploaded in error. A typical pattern is to list documents with `GET /v1/documents`, identify candidates for deletion, then call this endpoint for each one." },
|
|
7161
|
+
{ type: "paragraph", text: "The response includes a `deleted` field set to `true` and the `id` of the removed document. There is no soft-delete mechanism \u2014 the original file, OCR markdown, and all extraction results are permanently purged from storage." },
|
|
7162
|
+
{ type: "paragraph", text: "Pair this with `GET /v1/documents/:id` beforehand to verify you are deleting the correct resource. Note that if the document participated in entity linking or cases, those links are removed and affected cases may be recomputed during the next backfill cycle." },
|
|
6122
7163
|
{ type: "heading", level: 2, id: "delete-document-errors", text: "Errors" },
|
|
6123
7164
|
{
|
|
6124
7165
|
type: "param-table",
|
|
@@ -6378,6 +7419,9 @@ var sections20 = [
|
|
|
6378
7419
|
"due_date": "2024-03-15"
|
|
6379
7420
|
}`
|
|
6380
7421
|
},
|
|
7422
|
+
{ type: "paragraph", text: "Most integrations call this endpoint to feed extraction output into downstream systems (CRMs, ERPs, data warehouses) that only need the raw key-value data. A typical workflow is to extract a document, then call this endpoint with the `extraction_id` from the response to get a clean data payload without metadata overhead." },
|
|
7423
|
+
{ type: "paragraph", text: "The response is a flat JSON object where each key is a field name and each value is the extracted value, typed according to the schema (strings, numbers, dates, arrays). Use `?format=csv` to download the same data as a CSV file with field names as headers \u2014 the `Content-Disposition` header provides a suggested filename." },
|
|
7424
|
+
{ type: "paragraph", text: "Pair this with `GET /v1/extractions/:id` when you also need confidence scores, locked field status, or processing metadata. Note that the response shape matches the schema used during extraction \u2014 if no schema was provided, auto-discovered field names are used as keys." },
|
|
6381
7425
|
{ type: "heading", level: 2, id: "get-extraction-fields-errors", text: "Errors" },
|
|
6382
7426
|
{
|
|
6383
7427
|
type: "param-table",
|
|
@@ -6701,6 +7745,9 @@ var sections21 = [
|
|
|
6701
7745
|
}
|
|
6702
7746
|
}`
|
|
6703
7747
|
},
|
|
7748
|
+
{ type: "paragraph", text: "Most integrations call this endpoint before running an extraction to verify the schema definition is correct, or after an update to confirm the new version was applied. A typical workflow is to create a schema with `POST /v1/schemas`, store the returned `id`, then fetch it here whenever you need the current definition." },
|
|
7749
|
+
{ type: "paragraph", text: "The response includes the full `definition` object in normalized JSON Schema format, along with the `version` number and `field_count`. Use the `links.extractions` URL to list all extractions that used this schema, and `links.dashboard` to open it in the platform UI." },
|
|
7750
|
+
{ type: "paragraph", text: "Pair this with `PUT /v1/schemas/:id` to update the definition, or pass the `id` as `schema_id` on `POST /v1/extract` to run schema-driven extraction. Note that both UUID and `SCH-` prefixed short IDs are accepted as the `:id` parameter." },
|
|
6704
7751
|
{ type: "heading", level: 2, id: "get-schema-errors", text: "Errors" },
|
|
6705
7752
|
{
|
|
6706
7753
|
type: "param-table",
|
|
@@ -6898,6 +7945,10 @@ var sections21 = [
|
|
|
6898
7945
|
}
|
|
6899
7946
|
}`
|
|
6900
7947
|
},
|
|
7948
|
+
{ type: "paragraph", text: "Most integrations call this endpoint when extraction requirements evolve \u2014 for example, adding a new field to an invoice schema or renaming an existing one. A typical workflow is to fetch the current schema with `GET /v1/schemas/:id`, modify the `definition`, then send the updated payload here." },
|
|
7949
|
+
{ type: "paragraph", text: "The response includes the updated `definition`, `field_count`, and `version` number. The `updated_at` timestamp reflects when the change was applied. All body parameters are optional \u2014 send only `name`, `definition`, or `description` to update that field without touching the others." },
|
|
7950
|
+
{ type: "paragraph", text: "Pair this with `GET /v1/extractions?schema_id=:id` to review historical extractions that used previous versions. Note that schema versioning is append-only internally, so you can always compare before-and-after definitions through the dashboard." },
|
|
7951
|
+
{ type: "callout", variant: "info", text: "Schema updates do not retroactively change existing extractions. If you need to re-extract documents with the new schema, call `POST /v1/extract` with `document_id` and the updated `schema_id`." },
|
|
6901
7952
|
{ type: "heading", level: 2, id: "update-schema-errors", text: "Errors" },
|
|
6902
7953
|
{
|
|
6903
7954
|
type: "param-table",
|
|
@@ -6953,6 +8004,9 @@ var sections21 = [
|
|
|
6953
8004
|
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
|
|
6954
8005
|
}`
|
|
6955
8006
|
},
|
|
8007
|
+
{ type: "paragraph", text: "Most integrations call this endpoint during cleanup when a schema is no longer needed, or when consolidating duplicate schemas. A typical workflow is to list schemas with `GET /v1/schemas`, identify obsolete ones, then delete them individually by `id`." },
|
|
8008
|
+
{ type: "paragraph", text: "The response confirms deletion with `deleted: true` and the `id` of the removed schema. All extraction results that used this schema remain intact and queryable via `GET /v1/extractions` \u2014 only the schema definition itself is removed from the system." },
|
|
8009
|
+
{ type: "paragraph", text: "Pair this with `GET /v1/schemas/:id` beforehand to review the schema before removing it. Note that deletion is permanent with no undo \u2014 if you need the same structure later, you must recreate it with `POST /v1/schemas`." },
|
|
6956
8010
|
{ type: "heading", level: 2, id: "delete-schema-errors", text: "Errors" },
|
|
6957
8011
|
{
|
|
6958
8012
|
type: "param-table",
|
|
@@ -7530,6 +8584,9 @@ var sections23 = [
|
|
|
7530
8584
|
description: "Create a new input source and receive a source-scoped API key. The key is only shown once in the creation response \u2014 store it securely.",
|
|
7531
8585
|
content: [
|
|
7532
8586
|
{ type: "paragraph", text: "Create a new source to start ingesting documents. The response includes a **source-scoped API key** (`tlnc_sk_*`) that authenticates uploads to this source's endpoint. This key is shown only once \u2014 store it securely immediately after creation." },
|
|
8587
|
+
{ type: "paragraph", text: "The typical workflow is: create a source, store the returned `api_key` securely, then use it to authenticate document uploads to the source's `endpoint` URL. Optionally pass a `default_schema_id` to automatically apply an extraction schema to all documents ingested through this source." },
|
|
8588
|
+
{ type: "paragraph", text: "The response returns the source with `status: active`, `document_count: 0`, and the one-time `api_key` field. The `endpoint` URL is the path for `POST` document uploads. The `links` object includes URLs for the source detail, document list, and dashboard view." },
|
|
8589
|
+
{ type: "paragraph", text: "Store the `api_key` immediately \u2014 it cannot be retrieved again. If lost, delete the source and create a new one. The source type defaults to `api` (programmatic ingestion); use `upload` for manual file uploads or `connector` for third-party integrations like Google Drive or SharePoint." },
|
|
7533
8590
|
{ type: "callout", variant: "warning", text: "The `api_key` is only returned in the creation response. It cannot be retrieved later. If you lose it, delete the source and create a new one." },
|
|
7534
8591
|
{
|
|
7535
8592
|
type: "endpoint",
|
|
@@ -7615,6 +8672,10 @@ var sections23 = [
|
|
|
7615
8672
|
description: "Get source details, update a source name, or delete a source. Documents are retained but unlinked when a source is deleted.",
|
|
7616
8673
|
content: [
|
|
7617
8674
|
{ type: "paragraph", text: "Manage an individual source with GET, PATCH, and DELETE operations on the same path. Retrieve source details, update its name, or permanently delete it. When a source is deleted, its documents are **retained** but unlinked from the source." },
|
|
8675
|
+
{ type: "paragraph", text: "Use `GET` to inspect a source's current status, document count, and default schema assignment. Use `PATCH` to rename a source. Use `DELETE` when a source is no longer needed \u2014 this immediately invalidates the source-scoped API key, so any integration using it will start receiving `401` errors." },
|
|
8676
|
+
{ type: "paragraph", text: "The `GET` response includes `document_count`, `default_schema` (with its `id` if set), and the `endpoint` URL for document ingestion. The `status` field shows the current state \u2014 `active` for API sources, or sync status values for connector-based sources (Google Drive, SharePoint, etc.)." },
|
|
8677
|
+
{ type: "paragraph", text: "Deleting a source retains all its documents in your workspace \u2014 they remain accessible via the documents API and any existing extractions are preserved. Only the source-to-document link is removed. Pair `GET /v1/sources/:id` with `GET /v1/sources/:id/documents` to see documents belonging to a specific source." },
|
|
8678
|
+
{ type: "callout", variant: "info", text: "Deleting a source immediately invalidates its API key. Any integration using that key will receive `401` errors. Documents are retained but unlinked from the source." },
|
|
7618
8679
|
{
|
|
7619
8680
|
type: "endpoint",
|
|
7620
8681
|
method: "GET",
|
|
@@ -8522,7 +9583,11 @@ var sections25 = [
|
|
|
8522
9583
|
"`extraction.complete` \u2014 Extraction finished successfully. Payload includes the full extraction result.",
|
|
8523
9584
|
"`extraction.failed` \u2014 Extraction failed. Payload includes the error details.",
|
|
8524
9585
|
"`document.ingested` \u2014 A new document has been processed and is ready for extraction."
|
|
8525
|
-
] }
|
|
9586
|
+
] },
|
|
9587
|
+
{ type: "paragraph", text: "Most integrations subscribe to `extraction.complete` to trigger downstream processing (e.g. writing structured data to a database or notifying a user). A typical workflow is to pass `webhook_url` on `POST /v1/extract`, then handle the callback payload in your server without polling." },
|
|
9588
|
+
{ type: "paragraph", text: "The `extraction.complete` payload includes the `extraction_id`, `document_id`, `schema_id`, `status`, and `confidence` score. Use the `extraction_id` to fetch the full result via `GET /v1/extractions/:id` if the payload does not contain all the fields you need." },
|
|
9589
|
+
{ type: "paragraph", text: "Pair event handling with [Signature Verification](webhook-security) to ensure payloads are authentic. Note that `extraction.failed` events include an `error` field with a machine-readable code and human-readable message \u2014 use this to decide whether to retry via `POST /v1/extract` with `document_id`." },
|
|
9590
|
+
{ type: "callout", text: "Webhook URLs must be HTTPS endpoints. HTTP URLs are rejected at configuration time to ensure payload confidentiality in transit." }
|
|
8526
9591
|
],
|
|
8527
9592
|
related: [
|
|
8528
9593
|
{ label: "Signature Verification", slug: "webhook-security" },
|
|
@@ -8640,7 +9705,11 @@ echo -n '{"event":"extraction.complete","delivery_id":"dlv_test123","timestamp":
|
|
|
8640
9705
|
"3rd retry \u2014 30 minutes",
|
|
8641
9706
|
"4th retry (final) \u2014 4 hours"
|
|
8642
9707
|
] },
|
|
8643
|
-
{ type: "paragraph", text: "After 4 failed attempts, the delivery is marked as failed. You can check delivery status and replay events from the dashboard." }
|
|
9708
|
+
{ type: "paragraph", text: "After 4 failed attempts, the delivery is marked as failed. You can check delivery status and replay events from the dashboard." },
|
|
9709
|
+
{ type: "paragraph", text: "Most integrations rely on the default retry schedule and only intervene when a delivery reaches the failed state. A typical debugging workflow is to check the delivery history in the dashboard, identify the HTTP status or timeout that caused the failure, then fix the endpoint and replay the event." },
|
|
9710
|
+
{ type: "paragraph", text: "Your endpoint must return a `2xx` status code within **30 seconds** to be considered successful. Non-`2xx` responses (including `3xx` redirects) and timeouts trigger retries. The `X-Talonic-Delivery-Id` header remains the same across retries, so use it for idempotent processing on your end." },
|
|
9711
|
+
{ type: "paragraph", text: "Pair retry awareness with [Signature Verification](webhook-security) to reject spoofed payloads early. Note that the total retry window spans approximately **4.5 hours** from the initial attempt \u2014 if your endpoint is down longer than that, use the dashboard replay feature to re-send missed events." },
|
|
9712
|
+
{ type: "callout", text: "If your endpoint consistently fails, check for firewall rules blocking Talonic IPs, TLS certificate issues, or response timeouts exceeding 30 seconds. The dashboard delivery log shows the HTTP status and error for each attempt." }
|
|
8644
9713
|
],
|
|
8645
9714
|
related: [
|
|
8646
9715
|
{ label: "Webhook Events", slug: "webhook-events" },
|
|
@@ -8658,7 +9727,11 @@ echo -n '{"event":"extraction.complete","delivery_id":"dlv_test123","timestamp":
|
|
|
8658
9727
|
seoTitle: "Webhook Delivery Format \u2014 Talonic Docs",
|
|
8659
9728
|
description: "Webhook delivery format details including POST request structure, JSON body format, and standard headers for event type, signature, delivery ID, and timestamp.",
|
|
8660
9729
|
content: [
|
|
8661
|
-
{ type: "paragraph", text: "Webhooks are delivered as `POST` requests with a JSON body. Configure webhook URLs per-source or per-extraction via the `webhook_url` option on the extract endpoint." }
|
|
9730
|
+
{ type: "paragraph", text: "Webhooks are delivered as `POST` requests with a JSON body. Configure webhook URLs per-source or per-extraction via the `webhook_url` option on the extract endpoint." },
|
|
9731
|
+
{ type: "paragraph", text: "Most integrations configure a single webhook endpoint that handles all event types, using the `X-Talonic-Event` header to route internally. A typical setup is to pass `webhook_url` on `POST /v1/extract` calls, or configure a default URL in the dashboard for all extractions from a specific source." },
|
|
9732
|
+
{ type: "paragraph", text: "Each delivery includes four standard headers: `X-Talonic-Event` (event type), `X-Talonic-Signature` (HMAC-SHA256 for verification), `X-Talonic-Delivery-Id` (unique ID for idempotency), and `X-Talonic-Timestamp` (Unix timestamp). Your endpoint must return a `2xx` status within **30 seconds** or the delivery is considered failed." },
|
|
9733
|
+
{ type: "paragraph", text: "Pair webhook delivery with the [Signature Verification](webhook-security) guide to authenticate incoming payloads. Note that failed deliveries are retried with exponential backoff up to 4 times \u2014 see [Retry Policy](webhook-retry) for the schedule." },
|
|
9734
|
+
{ type: "callout", text: "Use the `X-Talonic-Delivery-Id` header to deduplicate webhook deliveries on your end. Retries reuse the same delivery ID, so you can safely discard duplicates." }
|
|
8662
9735
|
],
|
|
8663
9736
|
related: [
|
|
8664
9737
|
{ label: "Webhook Events", slug: "webhook-events" },
|
|
@@ -9214,6 +10287,9 @@ var sections27 = [
|
|
|
9214
10287
|
description: "Classify link keys into categories (identity, transaction, reference) using AI. Runs asynchronously on ambiguous fields.",
|
|
9215
10288
|
content: [
|
|
9216
10289
|
{ type: "paragraph", text: "When new fields are extracted, some may not be automatically classified as link keys. The classify endpoint runs AI-powered classification on ambiguous fields to determine whether they are **identity**, **transaction**, or **reference** link keys. This is useful after onboarding new document types or when the field registry grows." },
|
|
10290
|
+
{ type: "paragraph", text: "Call this endpoint after uploading a new batch of documents or after adding a new document type to your workspace. The endpoint returns immediately with the count of fields that were classified \u2014 any graph rebuilding happens asynchronously via a triggered backfill." },
|
|
10291
|
+
{ type: "paragraph", text: "The response includes a `classified` count (number of fields newly assigned a category) and a `backfillTriggered` boolean. When `backfillTriggered` is `true`, entity links across all documents are being rebuilt in the background. Poll the **Backfill** progress endpoint to monitor completion." },
|
|
10292
|
+
{ type: "paragraph", text: "Only fields with a `null` category are evaluated \u2014 already-classified link keys are not re-assessed. To verify which fields were classified, call the **Link Keys** endpoint before and after. If no ambiguous fields remain, `classified` returns `0` and no backfill is triggered." },
|
|
9217
10293
|
{ type: "callout", variant: "info", text: "Classification uses a two-pass approach: rule-based heuristics handle obvious cases (e.g. fields named `invoice_number`), then an LLM call classifies the remaining ambiguous fields. A backfill is automatically triggered when new link keys are identified." },
|
|
9218
10294
|
{
|
|
9219
10295
|
type: "endpoint",
|
|
@@ -9268,6 +10344,9 @@ var sections27 = [
|
|
|
9268
10344
|
description: "Get all entity links for a specific document showing entity values, types, link keys, and linked document IDs.",
|
|
9269
10345
|
content: [
|
|
9270
10346
|
{ type: "paragraph", text: "Retrieve all entity links discovered for a specific document. Each link represents a shared field value \u2014 such as a customer ID or PO number \u2014 that connects this document to others in the workspace. Use this endpoint to understand how a document relates to the rest of your corpus." },
|
|
10347
|
+
{ type: "paragraph", text: "Call this endpoint when building a document detail view or when you need to trace the relationships of a single document before exploring the broader graph. Pass the document UUID as a path parameter \u2014 the endpoint returns all entity links regardless of link key category." },
|
|
10348
|
+
{ type: "paragraph", text: "Each entry in the response includes the **entity_value** (the raw shared value), the **field_key** (which field it was extracted from), and the **link_key_category** (`identity`, `transaction`, or `reference`). Documents with no extracted field values matching other documents return an empty `data` array." },
|
|
10349
|
+
{ type: "paragraph", text: "Use this alongside the **Full Graph** subgraph endpoint to progressively explore the linking graph. Start here for a flat list of connections, then call the subgraph endpoint with `depth=2` to expand outward from the document and discover second-degree relationships." },
|
|
9271
10350
|
{ type: "callout", variant: "info", text: "The `document_count` field on each entity indicates how many documents share that value. A high count on an identity entity (e.g. a vendor ID appearing in 50+ documents) is expected, while a high count on a transaction entity may indicate a data quality issue." },
|
|
9272
10351
|
{
|
|
9273
10352
|
type: "endpoint",
|
|
@@ -9582,6 +10661,10 @@ var sections27 = [
|
|
|
9582
10661
|
description: "List and retrieve cases \u2014 automatically created groups of 2+ related documents linked through shared field values with narrative summaries.",
|
|
9583
10662
|
content: [
|
|
9584
10663
|
{ type: "paragraph", text: "Cases are automatically created groups of two or more documents that are connected through shared **transaction** or **reference** entity values. For example, an invoice, a purchase order, and a delivery note sharing the same PO number form a case. Cases provide a high-level view of document relationships without needing to navigate the full graph." },
|
|
10664
|
+
{ type: "paragraph", text: "Use this endpoint to retrieve all cases in your workspace for building case lists, dashboards, or approval queues. The response is ordered by most recent first based on the earliest document timestamp in each case. Each case includes a `document_count` and a stable `case_key` that you can use for subsequent detail lookups." },
|
|
10665
|
+
{ type: "paragraph", text: "The response includes a `links.self` URL for each case that points to the case detail endpoint. The `label` field contains an auto-generated human-readable name when available, or `null` for cases that have not yet been labelled. The `created_at` field reflects the timestamp of the earliest document in the group." },
|
|
10666
|
+
{ type: "callout", variant: "info", text: "Each document belongs to at most one case. Documents linked only through identity entities (e.g. shared vendor ID) appear as entity groups in the full graph but are not returned by this endpoint." },
|
|
10667
|
+
{ type: "paragraph", text: "Pair this endpoint with **Case Graph** to visualize individual cases, or with **Document-Case Map** for a flat document-to-case lookup. Cases are rebuilt automatically during backfill \u2014 if you have recently reclassified link keys, trigger a backfill first to ensure case assignments are up to date." },
|
|
9585
10668
|
{ type: "list", ordered: false, items: [
|
|
9586
10669
|
"Each case has a deterministic **case key** (hex hash of its document IDs)",
|
|
9587
10670
|
"Cases are created by the linking pipeline during backfill or real-time processing",
|
|
@@ -9656,6 +10739,10 @@ var sections27 = [
|
|
|
9656
10739
|
description: "Retrieve the D3-compatible graph visualization for a single case, showing document nodes and entity edges within the case boundary.",
|
|
9657
10740
|
content: [
|
|
9658
10741
|
{ type: "paragraph", text: "Retrieve the graph structure for a single case, formatted for **D3.js** or similar graph visualization libraries. The response contains only the nodes and edges within the case boundary, making it suitable for rendering focused relationship diagrams." },
|
|
10742
|
+
{ type: "paragraph", text: "The typical workflow is to first list cases via the **Cases** endpoint, then call this endpoint with a specific `case_key` to fetch the renderable graph. This is the primary endpoint for building case-level visualizations in custom UIs or embedded dashboards." },
|
|
10743
|
+
{ type: "paragraph", text: "The response includes both **document nodes** (with filename and inferred document type) and **entity nodes** (with the shared value and link key category). Edges always connect a document to an entity \u2014 never document-to-document directly. Node IDs are stable across requests, so you can preserve force-layout positions between refreshes." },
|
|
10744
|
+
{ type: "callout", variant: "info", text: "The case graph is a strict subset of the full workspace graph. Only entities that contributed to forming the case are included \u2014 high-frequency entities excluded from BFS do not appear." },
|
|
10745
|
+
{ type: "paragraph", text: "Pair this endpoint with **Document Links** to enrich each node with additional entity metadata, or with **Full Graph** when you need cross-case visibility. The graph structure mirrors the full graph format, so the same rendering code works for both." },
|
|
9659
10746
|
{
|
|
9660
10747
|
type: "endpoint",
|
|
9661
10748
|
method: "GET",
|
|
@@ -9723,6 +10810,9 @@ var sections27 = [
|
|
|
9723
10810
|
description: "Get the mapping of documents to their resolved cases. Returns a mapping of document IDs to assigned case keys.",
|
|
9724
10811
|
content: [
|
|
9725
10812
|
{ type: "paragraph", text: "The document-case map provides a flat lookup from document ID to case assignment. Use it to quickly determine which case a document belongs to, or to identify documents that are not part of any case. Documents in **entity groups** (linked only through identity entities) are included with `is_case: false`." },
|
|
10813
|
+
{ type: "paragraph", text: "Call this endpoint when you need to enrich a document list with case membership \u2014 for example, to display a case badge next to each document in a table view. The response is a flat object keyed by document UUID, so lookups are O(1) without client-side joins." },
|
|
10814
|
+
{ type: "paragraph", text: "Each entry includes a `case_key` (the deterministic hex hash identifying the case), a `document_count` (total documents in that case or entity group), and an `is_case` boolean. When `is_case` is `false`, the `case_key` is an empty string \u2014 the document is linked via identity entities only." },
|
|
10815
|
+
{ type: "paragraph", text: "This endpoint pairs well with the **Cases** list endpoint. Use the map for bulk lookups across your document set, and the Cases endpoint when you need case-level metadata like labels or timestamps. Documents with no entity links at all are omitted from the map entirely." },
|
|
9726
10816
|
{ type: "callout", variant: "info", text: "Documents with `is_case: false` are linked to other documents only through identity entities (e.g. same vendor). They appear in the map but do not form a case. Documents with no links at all are not included in the map." },
|
|
9727
10817
|
{
|
|
9728
10818
|
type: "endpoint",
|
|
@@ -12058,6 +13148,9 @@ var sections31 = [
|
|
|
12058
13148
|
description: "Get metric trends over time for a schema. Returns time-series telemetry data across recent runs for tracking quality changes.",
|
|
12059
13149
|
content: [
|
|
12060
13150
|
{ type: "paragraph", text: "Track how structuring metrics evolve over successive runs for a schema. This endpoint returns a **time-series** of telemetry snapshots, allowing you to detect quality improvements, regressions, or shifts in strategy distribution as your field registry matures." },
|
|
13151
|
+
{ type: "paragraph", text: "Call this endpoint after several extraction runs to build trend charts or to detect regressions. The default window returns the 10 most recent runs \u2014 use the `window` query parameter to expand up to 50 runs for longer-term analysis." },
|
|
13152
|
+
{ type: "paragraph", text: "Each snapshot in the `data` array contains the same metrics as the **Schema Summary** \u2014 `capture_hit_rate`, `synthesize_rate`, `strategy_distribution`, and `tier_funnel` \u2014 plus a `created_at` timestamp and `run_id`. The array is ordered by most recent run first." },
|
|
13153
|
+
{ type: "paragraph", text: "Compare the trend data with the **Schema Fields** endpoint to pinpoint which specific fields are driving changes. A sudden spike in `synthesize_rate` across runs may indicate a new document type that the field registry has not yet learned, while a steady decrease signals healthy registry maturation." },
|
|
12061
13154
|
{ type: "callout", variant: "info", text: "A rising `capture_hit_rate` over time indicates the field registry is learning from extractions and resolving more fields deterministically, reducing LLM costs." },
|
|
12062
13155
|
{
|
|
12063
13156
|
type: "endpoint",
|
|
@@ -12162,6 +13255,9 @@ var sections31 = [
|
|
|
12162
13255
|
description: "Get per-field structuring metrics for a schema including field-level state distribution, capture rates, and strategy breakdown.",
|
|
12163
13256
|
content: [
|
|
12164
13257
|
{ type: "paragraph", text: "Drill down to **individual field performance** within a schema. This endpoint returns per-field capture rates, synthesis rates, the most common strategy used, and the distribution of cell states (filled, empty, skipped). Use it to identify underperforming fields that may need instruction tuning or manual review." },
|
|
13258
|
+
{ type: "paragraph", text: "Call this endpoint after reviewing the **Schema Summary** to investigate which fields are driving low capture rates or high synthesis costs. The field-level breakdown reveals whether issues are concentrated in a few problematic fields or spread evenly across the schema." },
|
|
13259
|
+
{ type: "paragraph", text: "Each entry in the `data` array includes the `field_name`, `capture_rate` and `synthesize_rate` (both 0-1 fractions), the dominant `strategy` (one of `transfer`, `extract`, `compute`, `skip`), and a `state_distribution` object with `filled`, `empty`, and `skipped` counts. Fields with a `strategy` of `extract` are LLM-dependent and contribute most to cost." },
|
|
13260
|
+
{ type: "paragraph", text: "Pair this with the **Schema Trend** endpoint to track how individual field performance changes across runs. Fields that remain stuck on `extract` strategy after multiple runs are strong candidates for adding explicit instructions or seeding the field registry with example values." },
|
|
12165
13261
|
{ type: "callout", variant: "info", text: "Fields with a high `synthesize_rate` and low `capture_rate` are candidates for field registry enrichment or instruction refinement to reduce LLM dependency." },
|
|
12166
13262
|
{
|
|
12167
13263
|
type: "endpoint",
|
|
@@ -12243,6 +13339,9 @@ var sections31 = [
|
|
|
12243
13339
|
description: "Get aggregate structuring metrics for a single job run including strategy distribution, tier funnel, and capture hit rate.",
|
|
12244
13340
|
content: [
|
|
12245
13341
|
{ type: "paragraph", text: "Retrieve structuring telemetry for a **specific job run** rather than the latest run for a schema. Use this when you need to inspect the performance of a particular execution, compare two runs side by side, or debug a run that produced unexpected results." },
|
|
13342
|
+
{ type: "paragraph", text: "The typical workflow is to list runs from your jobs pipeline, then call this endpoint with the run UUID to inspect its metrics. This is especially useful when a run produces unexpected accuracy \u2014 the telemetry reveals whether the issue is in capture (registry gaps), synthesis (LLM errors), or strategy selection." },
|
|
13343
|
+
{ type: "paragraph", text: "The response includes `capture_hit_rate`, `synthesize_rate`, `strategy_distribution`, and `tier_funnel` \u2014 identical in shape to the **Schema Summary**. The `schema_id` field identifies which schema was used, allowing you to cross-reference with field-level telemetry. Runs that are still `pending` or `running` return a `404` until they complete." },
|
|
13344
|
+
{ type: "paragraph", text: "To compare two runs, call this endpoint twice with different run IDs and diff the `strategy_distribution` and `tier_funnel` values. Pair with the **Schema Trend** endpoint when you need the full historical view rather than a point-in-time comparison." },
|
|
12246
13345
|
{ type: "callout", variant: "info", text: "The response shape is identical to the Schema Summary endpoint. The only difference is that this endpoint targets a specific run by ID instead of returning the latest run for a schema." },
|
|
12247
13346
|
{
|
|
12248
13347
|
type: "endpoint",
|
|
@@ -12392,6 +13491,9 @@ var sections32 = [
|
|
|
12392
13491
|
description: "Get detail with expected values or delete a ground-truth dataset. Supports GET (read scope) and DELETE (write scope) on the same path.",
|
|
12393
13492
|
content: [
|
|
12394
13493
|
{ type: "paragraph", text: "Retrieve the full details of a ground-truth dataset including all expected value entries, or permanently delete the dataset. The GET response includes every document-field pair with the expected value, which you can use to audit the benchmark data before running a validation." },
|
|
13494
|
+
{ type: "paragraph", text: "Call GET before starting a validation run to verify that expected values are correct and complete. The `values` array contains every document-field pair with its `expected_value`, `document_id`, and `field_name` \u2014 review these to ensure the benchmark data reflects your current extraction requirements." },
|
|
13495
|
+
{ type: "paragraph", text: "The response includes `entry_count` for a quick size check and `user_schema_id` to confirm schema scope. The `values` array entries each have their own UUID (`id`) and `created_at` timestamp. If the dataset is unscoped (`user_schema_id: null`), it can validate fields across any schema." },
|
|
13496
|
+
{ type: "paragraph", text: "Use DELETE only when the dataset is no longer relevant. Existing validation runs that referenced this dataset are retained with their results intact, but you cannot create new runs against a deleted dataset. To update individual entries, delete and recreate the dataset with corrected values." },
|
|
12395
13497
|
{ type: "callout", variant: "warning", text: "Deleting a ground-truth dataset also removes all associated expected value entries. Existing validation runs that used this dataset are retained but can no longer be re-run." },
|
|
12396
13498
|
{
|
|
12397
13499
|
type: "endpoint",
|
|
@@ -12653,6 +13755,10 @@ var sections32 = [
|
|
|
12653
13755
|
description: "Get validation run detail with accuracy summary or delete a run. Supports GET (read scope) and DELETE (write scope) on the same path.",
|
|
12654
13756
|
content: [
|
|
12655
13757
|
{ type: "paragraph", text: "Retrieve the full details of a validation run including its status, accuracy score, and total comparisons. Or permanently delete a run and its associated results. Use GET to poll a run's status until it reaches `completed`, then fetch the detailed results." },
|
|
13758
|
+
{ type: "paragraph", text: "After creating a validation run, poll this endpoint until the `status` field transitions from `pending` or `running` to `completed` or `failed`. Once completed, the `accuracy` field contains the overall score (0-1) and `total_comparisons` shows how many field-level comparisons were made." },
|
|
13759
|
+
{ type: "paragraph", text: "The response includes `links.results` which points directly to the per-field results endpoint. Once the run reaches `completed` status, follow this link to retrieve the granular comparison data including match types, similarity scores, and LLM judge verdicts." },
|
|
13760
|
+
{ type: "callout", variant: "warning", text: "Deleting a validation run permanently removes all per-field results. The ground-truth dataset and the original job run are not affected. Use DELETE only when you want to clean up outdated or erroneous runs." },
|
|
13761
|
+
{ type: "paragraph", text: "Pair this endpoint with **Create Validation Run** for the create-then-poll workflow, or with **List Validation Runs** to find specific runs by recency. Comparing the `accuracy` values of multiple runs against the same ground-truth dataset is the primary way to track extraction quality over time." },
|
|
12656
13762
|
{
|
|
12657
13763
|
type: "endpoint",
|
|
12658
13764
|
method: "GET",
|
|
@@ -12904,6 +14010,9 @@ var sections33 = [
|
|
|
12904
14010
|
description: "Get credit transaction history including purchases, deductions, and adjustments with page-based pagination.",
|
|
12905
14011
|
content: [
|
|
12906
14012
|
{ type: "paragraph", text: "Retrieve a chronological log of every credit transaction on your account. Transactions include **purchases** (positive amounts), **consumption deductions** (negative amounts), **bonuses**, and **manual adjustments**. Use this to audit spending and reconcile usage." },
|
|
14013
|
+
{ type: "paragraph", text: "Call this endpoint to build a transaction ledger view or to reconcile credit changes over a billing period. The response uses page-based pagination \u2014 pass `page` and `limit` query parameters to navigate through large transaction histories. The default page size is 20 with a maximum of 100." },
|
|
14014
|
+
{ type: "paragraph", text: "Each transaction includes an `amount` (negative for deductions, positive for purchases), a `type` field (`consumption`, `purchase`, `bonus`, or `adjustment`), and an `operation_type` that identifies the pipeline operation responsible. The `total` field in the response gives the full count for pagination math." },
|
|
14015
|
+
{ type: "paragraph", text: "Use this alongside the **Balance** endpoint to understand how your balance arrived at its current value. For aggregate cost analysis by operation type and model, the **Usage Summary** endpoint provides a more efficient grouped view without per-transaction detail." },
|
|
12907
14016
|
{ type: "callout", variant: "info", text: "Transactions are ordered by most recent first. Each entry includes the `operation_type` that triggered it (e.g. `extraction`, `manual`), making it easy to trace costs back to specific pipeline operations." },
|
|
12908
14017
|
{
|
|
12909
14018
|
type: "endpoint",
|
|
@@ -12986,6 +14095,9 @@ var sections33 = [
|
|
|
12986
14095
|
description: "Get aggregate credit usage summary broken down by operation type and model for a configurable time period.",
|
|
12987
14096
|
content: [
|
|
12988
14097
|
{ type: "paragraph", text: "Get a high-level view of your API usage grouped by **operation type** and **model**. This endpoint aggregates call counts, token consumption, and estimated costs over a configurable lookback period. Use it to understand which operations drive your spending." },
|
|
14098
|
+
{ type: "paragraph", text: "Call this endpoint to build cost dashboards or to identify which pipeline operations consume the most credits. The default lookback is 30 days \u2014 pass the `days` query parameter to adjust. Each row in the `stats` array represents a unique combination of `operation_type` and `model`." },
|
|
14099
|
+
{ type: "paragraph", text: "The response includes `call_count`, `total_input_tokens`, `total_output_tokens`, `total_cache_read_tokens`, and `total_cost_usd` per grouping. Note that token-based operations (e.g. `extraction` via Claude) report full token breakdowns, while page-based operations (e.g. `document_ai_ocr`) report zero tokens since cost is calculated from pages processed." },
|
|
14100
|
+
{ type: "paragraph", text: "Pair with **Daily Usage** for time-series analysis of the same period, or with **Usage Log** to drill into individual requests behind a high-cost grouping. The `period_days` field in the response confirms the actual lookback window applied." },
|
|
12989
14101
|
{ type: "callout", variant: "info", text: "Cost estimates include all token classes: input tokens, output tokens, cache creation tokens, and cache read tokens. Each is priced at the model-specific rate." },
|
|
12990
14102
|
{
|
|
12991
14103
|
type: "endpoint",
|
|
@@ -13074,6 +14186,10 @@ var sections33 = [
|
|
|
13074
14186
|
description: "Get per-day credit usage breakdown for the specified period (default last 30 days) with call counts and token totals per day.",
|
|
13075
14187
|
content: [
|
|
13076
14188
|
{ type: "paragraph", text: "Get a per-day breakdown of API usage over a configurable period. Each entry includes the total number of API calls, input/output token counts, and estimated cost for that calendar date. Use this for usage trend analysis and daily cost monitoring." },
|
|
14189
|
+
{ type: "paragraph", text: "Call this endpoint to populate daily usage charts or to set up alerting on cost spikes. The default lookback is 30 days \u2014 use the `days` query parameter to widen or narrow the window. Days with zero API calls are omitted from the response array." },
|
|
14190
|
+
{ type: "paragraph", text: "Each entry contains a `date` (YYYY-MM-DD in UTC), `calls` (total API calls), `input_tokens`, `output_tokens`, and `cost_usd`. All timestamps are UTC \u2014 a call made at 23:59 UTC on a given date appears under that UTC date, not the caller's local date." },
|
|
14191
|
+
{ type: "callout", variant: "info", text: "Daily usage is ordered by date ascending, making it ready for time-series charting without client-side sorting. Pair with the **Usage Summary** endpoint for operation-level breakdowns within the same period." },
|
|
14192
|
+
{ type: "paragraph", text: "Combine this endpoint with **Balance** to correlate daily burn against remaining runway. If you notice a cost spike on a specific date, drill into the **Usage Log** to identify the individual requests responsible." },
|
|
13077
14193
|
{
|
|
13078
14194
|
type: "endpoint",
|
|
13079
14195
|
method: "GET",
|
|
@@ -13343,6 +14459,9 @@ var sections34 = [
|
|
|
13343
14459
|
description: "List all tools available to the embedded agent including their impact level (read/write) and descriptions for discovering agent capabilities.",
|
|
13344
14460
|
content: [
|
|
13345
14461
|
{ type: "paragraph", text: "Discover all tools available to the embedded AI agent. Each tool declares its **impact level** \u2014 whether it performs a read-only operation or a mutation \u2014 so you can build permission-aware integrations. Use this endpoint to dynamically generate tool descriptions for external AI agents or to audit available capabilities." },
|
|
14462
|
+
{ type: "paragraph", text: "Call this endpoint at startup to populate your integration's tool registry, or periodically to detect newly added capabilities. The response includes every tool the agent can invoke, with a stable `name` identifier, a human-readable `description`, and the `impact` classification." },
|
|
14463
|
+
{ type: "paragraph", text: "The `totalCount` field gives the total number of tools available. Each tool's `impact` field follows a four-level severity scale: `read`, `draft_mutation`, `live_mutation`, and `irreversible`. Use these levels to build confirmation gates \u2014 for example, auto-approve `read` tools but require user confirmation for `live_mutation` and above." },
|
|
14464
|
+
{ type: "paragraph", text: "Pair this with the **Workspace Context** endpoint to give your external AI agent both situational awareness (context) and available actions (tools). The tool names returned here are stable identifiers that can be referenced in custom orchestration logic or permission policies." },
|
|
13346
14465
|
{ type: "callout", variant: "info", text: "Impact levels follow a severity scale: `read` (no side effects), `draft_mutation` (creates drafts only), `live_mutation` (modifies live data), and `irreversible` (permanent changes like deletion). Use these to implement confirmation gates in your integration." },
|
|
13347
14466
|
{
|
|
13348
14467
|
type: "endpoint",
|
|
@@ -13521,6 +14640,9 @@ var sections35 = [
|
|
|
13521
14640
|
description: "Create a matching configuration with field mappings, comparison strategies (exact, fuzzy, date_range, numeric_range), and per-field weights that sum to 1.0.",
|
|
13522
14641
|
content: [
|
|
13523
14642
|
{ type: "paragraph", text: "Create a matching configuration that defines how documents are compared against a reference dataset. Each field mapping specifies a source field (from extracted documents), a target column (in the reference data), a comparison strategy, and a relative weight." },
|
|
14643
|
+
{ type: "paragraph", text: "The typical workflow is: upload reference data via `POST /v1/matching/reference-data`, create a config with field mappings, then trigger a run via `POST /v1/matching/configs/:id/run`. For complex datasets, use `POST /v1/matching/strategies/generate` first to get AI-recommended mappings and weights." },
|
|
14644
|
+
{ type: "paragraph", text: "The response returns the config with the saved `field_mappings`, `threshold` (defaults to 0.85), and `links.runs` URL for triggering runs. The `reference_data_id` is fixed at creation \u2014 to match against a different dataset, create a new config." },
|
|
14645
|
+
{ type: "paragraph", text: "Choose strategies carefully: use `exact` for standardized codes and IDs, `fuzzy` for names with potential typos, `date_range` for dates with tolerance, and `numeric_range` for amounts with rounding differences. Weights must sum to 1.0 \u2014 fields with higher weights have more influence on the overall confidence score." },
|
|
13524
14646
|
{ type: "callout", variant: "info", text: "Field weights should sum to 1.0. The overall confidence score for a match is the weighted sum of per-field scores. Use the **generate strategy** endpoint to get AI-recommended mappings if you are unsure which fields and weights to use." },
|
|
13525
14647
|
{
|
|
13526
14648
|
type: "list",
|
|
@@ -13642,6 +14764,10 @@ var sections35 = [
|
|
|
13642
14764
|
description: "Get matching configuration details, update field mappings and weights, or delete a configuration. Deleting a config does not remove past run results.",
|
|
13643
14765
|
content: [
|
|
13644
14766
|
{ type: "paragraph", text: "Retrieve, update, or delete a matching configuration. Updates to field mappings and thresholds take effect on the next run \u2014 they do not retroactively change past results. Deleting a config removes the configuration but preserves all historical run results for audit purposes." },
|
|
14767
|
+
{ type: "paragraph", text: "Use `GET` to inspect the current field mappings, threshold, and targeting mode before running a match. Use `PUT` to adjust weights, swap strategies, or change the threshold \u2014 a common pattern is to lower the threshold after reviewing low-confidence results, then re-run to capture more matches." },
|
|
14768
|
+
{ type: "paragraph", text: "The `PUT` response returns the full updated config. The `reference_data_id` cannot be changed after creation \u2014 to match against a different dataset, create a new config. The `links.runs` URL provides a convenient shortcut to trigger a new run with the updated config." },
|
|
14769
|
+
{ type: "paragraph", text: "Deleting a config is safe for audit \u2014 all historical run results, including per-document evidence and confidence scores, are preserved. Pair config updates with the generate strategy endpoint to get AI-recommended adjustments based on your reference dataset." },
|
|
14770
|
+
{ type: "callout", variant: "info", text: "Past run results are immutable. Updating field mappings or thresholds only affects future runs \u2014 re-run matching after config changes to see the updated results." },
|
|
13645
14771
|
{
|
|
13646
14772
|
type: "endpoint",
|
|
13647
14773
|
method: "GET",
|
|
@@ -13920,6 +15046,9 @@ var sections35 = [
|
|
|
13920
15046
|
description: "Get the status, progress, and summary of a matching run. Status progresses from queued to running to completed or failed.",
|
|
13921
15047
|
content: [
|
|
13922
15048
|
{ type: "paragraph", text: "Retrieve the current state of a matching run. Poll this endpoint while `status` is `queued` or `running` to track progress. Once `completed`, the response includes the top 50 results by confidence. Use the results endpoint for full paginated access." },
|
|
15049
|
+
{ type: "paragraph", text: "Poll this endpoint after triggering a run via `POST /v1/matching/configs/:id/run`. A typical polling pattern is to check every 5-10 seconds while `status` is `queued` or `running`. Use `GET /v1/matching/runs/:id/progress` for lighter-weight progress updates during long runs." },
|
|
15050
|
+
{ type: "paragraph", text: "Once completed, the response includes `rows_processed`, `rows_matched`, and `avg_confidence` at the run level, plus a `results` array with the top 50 matches by confidence. Each result includes `document_id`, `matched_reference_row_id`, `confidence` score, review `status` (`pending`, `approved`, `rejected`), and per-field `evidence` breakdown." },
|
|
15051
|
+
{ type: "paragraph", text: "For the full result set beyond the top 50, use `GET /v1/matching/runs/:id/results` with pagination. Use `POST /v1/matching/runs/:runId/results/:resultId/review` to approve or reject individual matches. If `status` is `ai_resolving`, the run is using Claude Haiku to disambiguate borderline matches \u2014 this phase adds latency but can significantly improve accuracy on ambiguous rows." },
|
|
13923
15052
|
{ type: "callout", variant: "info", text: "The `ai_resolving` status indicates that the run has finished standard matching and is now running an AI resolution pass on low-confidence rows. This pass uses Claude Haiku to disambiguate borderline matches." },
|
|
13924
15053
|
{
|
|
13925
15054
|
type: "endpoint",
|
|
@@ -14022,6 +15151,9 @@ var sections35 = [
|
|
|
14022
15151
|
description: "Retrieve matching results for a completed run. Returns the top 5 candidates per document with weighted confidence scores and per-field evidence breakdowns.",
|
|
14023
15152
|
content: [
|
|
14024
15153
|
{ type: "paragraph", text: "Retrieve the full paginated results for a completed matching run. Each result represents a document matched (or unmatched) against the reference dataset, with a weighted confidence score and per-field evidence breakdown showing how each field contributed to the overall score." },
|
|
15154
|
+
{ type: "paragraph", text: "Use this endpoint after a run completes to review all matches. Filter by `status=pending` to see matches awaiting review, or `status=approved` to see confirmed matches. Paginate with `page` and `limit` \u2014 the run detail endpoint only shows the top 50 results, while this endpoint provides full access." },
|
|
15155
|
+
{ type: "paragraph", text: "Each result includes a per-field `evidence` object showing the strategy used and individual score for each field mapping. A `null` `matched_reference_row_id` means no reference row scored above the configured threshold for that document. The `confidence` score is the weighted sum of per-field scores using the weights from the matching config." },
|
|
15156
|
+
{ type: "paragraph", text: "Use `POST /v1/matching/runs/:runId/results/:resultId/review` to approve or reject individual matches programmatically. Pair with the config detail endpoint to understand which field mappings and thresholds produced these results. Re-run matching with adjusted weights or a lower threshold to capture more matches." },
|
|
14025
15157
|
{ type: "callout", variant: "info", text: "Results with `status: pending` have not been reviewed. Use `POST /v1/matching/runs/:runId/results/:resultId/review` to approve or reject individual matches. Approved matches can be used downstream for data enrichment and reconciliation workflows." },
|
|
14026
15158
|
{
|
|
14027
15159
|
type: "endpoint",
|
|
@@ -14316,6 +15448,9 @@ var sections36 = [
|
|
|
14316
15448
|
description: "Create a delivery destination with connector type, transport config, and authentication. Supported types: webhook, sftp, s3, azure_blob, google_drive, onedrive.",
|
|
14317
15449
|
content: [
|
|
14318
15450
|
{ type: "paragraph", text: "Create a new delivery destination by specifying the connector type, transport configuration, and optional authentication. The `config` and `auth_config` schemas vary by destination type \u2014 see the catalog endpoint for connector capabilities." },
|
|
15451
|
+
{ type: "paragraph", text: "The typical workflow is: create a destination first, then create one or more bindings that route signals to it. Call `GET /v1/delivery/catalog/connectors` to see which connector types are available and what `config` and `auth_config` schemas each expects." },
|
|
15452
|
+
{ type: "paragraph", text: "The response returns the created destination with `is_active: true` and `last_delivery_at: null`. Auth credentials are never echoed back \u2014 use the `has_auth_config` and `has_signing_secret` booleans to confirm they were stored. After creation, use `POST /v1/delivery/destinations/:id/test` to verify connectivity before setting up bindings." },
|
|
15453
|
+
{ type: "paragraph", text: "For webhook destinations, include a `signing_secret` in `auth_config` to enable HMAC-SHA256 request signing. For file-drop destinations (S3, SFTP, Azure Blob), set `payload_cap_bytes` if you need to override the global 5 MiB cap. OAuth destinations (Google Drive, OneDrive) require completing the OAuth flow first." },
|
|
14319
15454
|
{ type: "callout", variant: "info", text: "OAuth-based destinations (google_drive, onedrive) require completing an OAuth flow before creating the destination. Use the OAuth start endpoint to initiate the flow and obtain tokens." },
|
|
14320
15455
|
{
|
|
14321
15456
|
type: "endpoint",
|
|
@@ -14418,6 +15553,9 @@ var sections36 = [
|
|
|
14418
15553
|
description: "Get destination details, update config, delete a destination, or send a test payload to verify connectivity. Auth credentials are always redacted in responses.",
|
|
14419
15554
|
content: [
|
|
14420
15555
|
{ type: "paragraph", text: "Manage a single destination: retrieve its current config, update transport settings or credentials, delete it, or test connectivity. The **test** endpoint probes the destination without delivering real data \u2014 file-drop connectors (S3, SFTP, Azure Blob) verify bucket/container reachability without writing any objects." },
|
|
15556
|
+
{ type: "paragraph", text: "Use `GET` to inspect current config and delivery status. Use `PUT` to rotate credentials or change the target URL/bucket. Use `POST /test` after updating credentials to verify the new config works before live traffic flows through it. Use `DELETE` only when permanently removing a destination." },
|
|
15557
|
+
{ type: "paragraph", text: "The `GET` response includes `last_delivery_at` and `last_delivery_status` to show the most recent delivery attempt. The `is_active` flag indicates whether the destination is enabled \u2014 destinations are automatically disabled on `auth_failed` or `ssrf_blocked` errors. The test endpoint returns `success`, `durationMs`, and an optional `message` describing what was probed." },
|
|
15558
|
+
{ type: "paragraph", text: "If a destination becomes inactive due to auth failure, fix the credentials via `PUT`, then call the test endpoint to verify. The destination will be re-enabled automatically on a successful update. Prefer disabling (`is_active: false` via `PUT`) over deleting when you want to pause delivery but keep the history." },
|
|
14421
15559
|
{ type: "callout", variant: "warning", text: "Deleting a destination cascades to all its bindings, delivery items, and DLQ entries. This is irreversible. Disable the destination (`is_active: false`) instead if you want to preserve history." },
|
|
14422
15560
|
{
|
|
14423
15561
|
type: "endpoint",
|
|
@@ -14706,6 +15844,9 @@ var sections36 = [
|
|
|
14706
15844
|
description: "Create a delivery binding that routes domain signals through a deliverable resolver and serializer to a destination. Includes field mapping and retry policy configuration.",
|
|
14707
15845
|
content: [
|
|
14708
15846
|
{ type: "paragraph", text: "Create a binding that wires a domain event to a destination. The **compatibility triangle** is validated on creation: the signal event type must be compatible with the deliverable resolver, the serializer must support the deliverable shape, and the connector must support the serializer format." },
|
|
15847
|
+
{ type: "paragraph", text: "The typical workflow is: query the catalog endpoints top-down (signals, then deliverables, then serializers, then connectors), pick compatible values, and create the binding. A single event can fan out to multiple bindings \u2014 create separate bindings for each destination or output format you need." },
|
|
15848
|
+
{ type: "paragraph", text: "The response returns the binding with `is_active: true` and `last_status: null`. The `field_map` controls payload projection: use `static` to inject fixed values, `drop` to remove fields, and key-value pairs to rename fields. The `delivery_policy` defaults to 7 attempts with exponential backoff over ~10 hours if omitted." },
|
|
15849
|
+
{ type: "paragraph", text: "After creation, the binding is immediately live \u2014 the next matching signal will trigger delivery. Use `POST /v1/delivery/bindings/:id/preview` (internal) to dry-run the resolve-project-serialize pipeline. Monitor delivery health via the history and DLQ endpoints." },
|
|
14709
15850
|
{ type: "callout", variant: "info", text: "Use the catalog endpoints (`/v1/delivery/catalog/*`) to discover valid combinations before creating a binding. The catalog lists all available signals, deliverables, serializers, and connectors with their compatibility constraints." },
|
|
14710
15851
|
{
|
|
14711
15852
|
type: "endpoint",
|
|
@@ -14808,6 +15949,10 @@ var sections36 = [
|
|
|
14808
15949
|
description: "Get binding details, update signal filters or field maps, delete a binding, or preview the resolved payload for a binding without sending it.",
|
|
14809
15950
|
content: [
|
|
14810
15951
|
{ type: "paragraph", text: "Manage a single delivery binding: retrieve its configuration, update the signal filter or field map, delete it, or preview the payload it would produce. Updates re-validate the compatibility triangle. Deleting a binding stops future routing but allows in-flight deliveries to complete." },
|
|
15952
|
+
{ type: "paragraph", text: "Use `GET` to inspect the current binding config and `last_status`. Use `PUT` to adjust the signal filter, field map, or retry policy \u2014 changes take effect on the next matching event. Use `DELETE` when the binding is no longer needed; in-flight deliveries already in the job queue will still complete." },
|
|
15953
|
+
{ type: "paragraph", text: "The `PUT` response returns the full updated binding. The compatibility triangle is re-validated on every update \u2014 if you change the `signal_filter.event_type` or `serializer_format`, the system verifies the new combination is still valid. The preview endpoint (`POST /preview`) walks the resolve-project-serialize pipeline with a synthetic signal and returns the wire output without delivering." },
|
|
15954
|
+
{ type: "paragraph", text: "Pair updates with the delivery history endpoint to verify the binding is producing expected results. If `last_status` shows `failed`, check the DLQ for error details before adjusting the binding config." },
|
|
15955
|
+
{ type: "callout", variant: "info", text: "The public API preview endpoint currently returns a stub response. The internal preview endpoint is fully functional and walks the full resolve, project, and serialize pipeline with structural fallback." },
|
|
14811
15956
|
{
|
|
14812
15957
|
type: "endpoint",
|
|
14813
15958
|
method: "GET",
|
|
@@ -15016,6 +16161,9 @@ var sections36 = [
|
|
|
15016
16161
|
description: "View delivery attempt history with status, HTTP codes, and timing. Get detail for a single item or replay a failed delivery attempt.",
|
|
15017
16162
|
content: [
|
|
15018
16163
|
{ type: "paragraph", text: "The delivery history tracks every attempt to deliver a payload to a destination. Each attempt is recorded as a **delivery item** with status, timing, HTTP response code, and optional request/response bodies. Use this endpoint to audit delivery performance and debug failures." },
|
|
16164
|
+
{ type: "paragraph", text: "Query items by `binding_id` or `destination_id` to narrow results to a specific delivery path. Filter by `status` to find failures (`failed`) or in-progress attempts (`in_flight`). Use `GET /v1/delivery/items/:id` to inspect the full request and response bodies for a single attempt." },
|
|
16165
|
+
{ type: "paragraph", text: "Each item includes an `idempotency_key` (deterministic SHA-256 of binding ID and event ID) that is sent on the wire so receivers can deduplicate. The `attempt` field is 1-indexed \u2014 multiple items with the same `event_id` and `binding_id` represent retries of the same delivery. Status values are `in_flight`, `succeeded`, or `failed`." },
|
|
16166
|
+
{ type: "paragraph", text: "Use `POST /v1/delivery/items/:id/replay` to re-enqueue a specific attempt with a fresh attempt number but the same idempotency key. For terminal failures, check the DLQ endpoint instead \u2014 items that exhausted all retries are moved there automatically. Pair history inspection with binding and destination detail to diagnose delivery issues end-to-end." },
|
|
15019
16167
|
{ type: "callout", variant: "info", text: "Request and response bodies are truncated to 10 KB and retained for a configurable period (default 30 days). After the retention period, bodies are nulled but metadata (status, HTTP code, duration, error code) is preserved indefinitely." },
|
|
15020
16168
|
{
|
|
15021
16169
|
type: "endpoint",
|
|
@@ -15674,6 +16822,9 @@ var sections37 = [
|
|
|
15674
16822
|
description: "Get detailed information for a single extraction batch including item counts, provider, status, and timing. Shows per-item breakdown when the batch is completed.",
|
|
15675
16823
|
content: [
|
|
15676
16824
|
{ type: "paragraph", text: "Retrieve the full batch record including per-item status. Poll this endpoint while `status` is `submitted` to track progress. Once `completed`, each item shows its individual outcome and processing timestamp." },
|
|
16825
|
+
{ type: "paragraph", text: "Use this endpoint to monitor a batch after submission. Poll periodically while `status` is `submitted` \u2014 typically results arrive within 24 hours. Once `status` changes to `completed`, `failed`, or `cancelled`, polling can stop. Use the sync endpoint to force an immediate provider check instead of waiting for the hourly poll." },
|
|
16826
|
+
{ type: "paragraph", text: "The response includes `items` \u2014 an array of per-document results. Each item has a `status` (`pending`, `processing`, `completed`, or `failed`), the associated `document_id` and `document_filename`, and a `processed_at` timestamp. The `custom_id` field shows the provider-assigned identifier used when submitting to Anthropic or Bedrock." },
|
|
16827
|
+
{ type: "paragraph", text: "Failed items are automatically retried via **realtime** extraction, never re-batched, to preserve the 48-hour SLA. Check the `errored_count` and `expired_count` fields at the batch level, and individual `items[].error_message` for per-document failure details. Pair with `GET /v1/documents/:id` to check the final extraction status of any document in the batch." },
|
|
15677
16828
|
{ type: "callout", variant: "info", text: "Items that fail extraction in the batch are retried via **realtime** extraction (never re-batched) to preserve the original 48-hour SLA. Check `items[].status` for per-document outcomes." },
|
|
15678
16829
|
{
|
|
15679
16830
|
type: "endpoint",
|
|
@@ -15772,6 +16923,9 @@ var sections37 = [
|
|
|
15772
16923
|
description: "Force a sync with the provider to check for batch results. Useful when you do not want to wait for the hourly automatic poll.",
|
|
15773
16924
|
content: [
|
|
15774
16925
|
{ type: "paragraph", text: "Force an immediate check with the batch provider (Anthropic or Bedrock) for results. By default, batches are polled automatically every hour. Use this endpoint when you need results sooner or want to verify the current provider-side status." },
|
|
16926
|
+
{ type: "paragraph", text: "Call sync when you need results before the next hourly poll. A typical pattern is to submit documents in batch mode, wait a few hours, then call sync to check if results are ready. If the batch is still processing, the response reflects the current provider-side status without changing anything." },
|
|
16927
|
+
{ type: "paragraph", text: "The response returns the full batch object with updated counts. If results are ready, `status` transitions to `completed` and `succeeded_count`, `errored_count`, and `expired_count` are populated. If the batch is still processing on the provider side, `status` remains `submitted` and counts stay at zero." },
|
|
16928
|
+
{ type: "paragraph", text: "Syncing an `accumulating` batch has no effect since it has not been submitted to the provider yet. Syncing a `completed` or `cancelled` batch is safe but returns the same data. Pair with `GET /v1/batches/:id` to inspect per-item results after the sync completes." },
|
|
15775
16929
|
{
|
|
15776
16930
|
type: "endpoint",
|
|
15777
16931
|
method: "POST",
|
|
@@ -15849,6 +17003,9 @@ var sections37 = [
|
|
|
15849
17003
|
description: "Cancel an in-progress extraction batch. Only batches in accumulating or submitted status can be cancelled. Completed batches cannot be rolled back.",
|
|
15850
17004
|
content: [
|
|
15851
17005
|
{ type: "paragraph", text: "Cancel a batch that is still `accumulating` or `submitted`. Cancellation sends a stop request to the provider if the batch was already submitted. Documents in the cancelled batch revert to `batch_queued` status and can be resubmitted or processed via realtime extraction." },
|
|
17006
|
+
{ type: "paragraph", text: "Use cancellation when you need to abort a batch \u2014 for example, if documents were submitted with an incorrect schema or you need results faster via realtime extraction. Cancel as early as possible; items already processed by the provider before the cancellation lands may still have their results applied." },
|
|
17007
|
+
{ type: "paragraph", text: "The response returns the batch with `status: cancelled`. The `succeeded_count` may be non-zero if some items were processed before cancellation took effect. Documents revert to `batch_queued` status and can be re-processed by updating their `processing_mode` to `realtime` or by including them in a new batch." },
|
|
17008
|
+
{ type: "paragraph", text: "Only batches in `accumulating` or `submitted` status can be cancelled \u2014 calling cancel on a `completed`, `failed`, or already `cancelled` batch returns `400`. Pair with `GET /v1/batches/:id` after cancellation to inspect which items were processed before the stop request landed." },
|
|
15852
17009
|
{
|
|
15853
17010
|
type: "endpoint",
|
|
15854
17011
|
method: "POST",
|
|
@@ -16017,6 +17174,9 @@ var sections38 = [
|
|
|
16017
17174
|
description: "Retrieve a case by its key (e.g. CASE-001) including linked documents, shared entities, AI-generated narration, label, and anomaly count.",
|
|
16018
17175
|
content: [
|
|
16019
17176
|
{ type: "paragraph", text: "Retrieve the full detail of a case including its documents, AI-generated narrative summary, and anomaly count. The narrative is generated by Claude and summarizes the relationships between documents in the case." },
|
|
17177
|
+
{ type: "paragraph", text: "Call this endpoint after listing cases to drill into a specific case. The typical workflow is to list cases with filters, then fetch detail for cases that need review. The response includes the full document list and anomaly count, so you can assess case health in a single call." },
|
|
17178
|
+
{ type: "paragraph", text: "The response includes `documents` (array of document objects with `id`, `filename`, `document_type`, and `created_at`), a `narrative` string (or `null` if narration has not been triggered), and `anomaly_count`. The `links` object provides convenience URLs for the case itself and its documents list." },
|
|
17179
|
+
{ type: "paragraph", text: "Pair with `POST /v1/cases/:key/narrate` to generate narratives, and `GET /v1/cases/:key/evidence` to inspect the field-level linking data. If `anomaly_count` is non-zero, fetch the anomalies endpoint to see which structural issues were detected." },
|
|
16020
17180
|
{ type: "callout", variant: "info", text: "The `narrative` field is generated on demand via `POST /v1/cases/:key/narrate`. It will be `null` until narration is triggered for this case." },
|
|
16021
17181
|
{
|
|
16022
17182
|
type: "endpoint",
|
|
@@ -16222,6 +17382,9 @@ var sections38 = [
|
|
|
16222
17382
|
description: "List evidence items within a case. Filter by validation status, source document, category, or free-text search across evidence fields.",
|
|
16223
17383
|
content: [
|
|
16224
17384
|
{ type: "paragraph", text: "Evidence items are the extracted field values from documents in a case, annotated with validation status and confidence scores. Use evidence to audit the data quality within a case and understand which fields link documents together." },
|
|
17385
|
+
{ type: "paragraph", text: "Use this endpoint after fetching case detail to inspect the field-level data that forms the case. A typical workflow is to filter by `status=invalid` to surface extraction issues, or by `document_id` to audit a specific document's contribution to the case." },
|
|
17386
|
+
{ type: "paragraph", text: "Each evidence item includes a `field_key`, extracted `value`, validation `status` (`valid`, `invalid`, or `pending`), the source `document_id`, an optional `category` (e.g. `identity`, `financial`), and a `confidence` score between 0 and 1. The confidence score reflects extraction certainty and is independent of the validation outcome." },
|
|
17387
|
+
{ type: "paragraph", text: "Combine evidence with the anomalies endpoint to get a complete quality picture. Evidence shows individual field values; anomalies show structural patterns across multiple evidence items (e.g. conflicting values for the same field). Use the `search` parameter for free-text queries across all evidence fields." },
|
|
16225
17388
|
{ type: "callout", variant: "info", text: "Evidence is produced by the evidence validation engine, which runs rule-based validators (structural checks, checksum validation, domain packs) against extracted values. Each evidence item records the validation outcome for a specific field on a specific document." },
|
|
16226
17389
|
{
|
|
16227
17390
|
type: "endpoint",
|
|
@@ -16485,6 +17648,9 @@ var sections38 = [
|
|
|
16485
17648
|
description: "Pin or remove documents within a case. Pinned documents are highlighted in the case view and preserved during case operations.",
|
|
16486
17649
|
content: [
|
|
16487
17650
|
{ type: "paragraph", text: "Manage document membership within a case. **Pin** a document to mark it as important \u2014 pinned documents are highlighted in the UI and preserved during split operations. **Remove** a document to detach it from the case entirely." },
|
|
17651
|
+
{ type: "paragraph", text: "Use pinning to flag key documents during case review \u2014 for example, pin the primary invoice in a multi-document case so it stays visible. Use removal when a document was incorrectly linked and should not belong to this case. Both operations are immediate and do not require a recompute." },
|
|
17652
|
+
{ type: "paragraph", text: 'Pin returns `{ "success": true }` on success. Remove also returns `{ "success": true }`. Both endpoints return `404` if the case or document is not found. The pin status is reflected in the case detail response from `GET /v1/cases/:key`.' },
|
|
17653
|
+
{ type: "paragraph", text: "Pinned documents are preserved in the original partition during split operations \u2014 they always stay with the case they are pinned to. If you plan to split a case, pin the anchor documents first. Removed documents may reappear in the case after a recompute if linking edges still connect them." },
|
|
16488
17654
|
{ type: "callout", variant: "info", text: "Removing a document from a case does not delete the document itself. The document remains in your workspace and may be re-linked into a case during the next recompute cycle if linking edges still exist." },
|
|
16489
17655
|
{
|
|
16490
17656
|
type: "endpoint",
|
|
@@ -17147,6 +18313,9 @@ var sections40 = [
|
|
|
17147
18313
|
description: "List all ground truth datasets used for benchmarking extraction accuracy. Each dataset contains manually verified entries that serve as the gold standard.",
|
|
17148
18314
|
content: [
|
|
17149
18315
|
{ type: "paragraph", text: "Ground truth datasets contain manually verified data entries that serve as the gold standard for measuring extraction accuracy. Create datasets, add entries, then run benchmarks against extraction results." },
|
|
18316
|
+
{ type: "paragraph", text: "Use this endpoint to see all available datasets before creating a benchmark run. A typical workflow is to list datasets, select the one covering the document type you want to evaluate, then pass its `id` to `POST /v1/quality/benchmarks` to start a run." },
|
|
18317
|
+
{ type: "paragraph", text: "Each dataset includes a `name`, optional `description`, `user_schema_id` (if scoped to a schema), `document_count` (number of verified entries), and a `links.self` URL for the detail endpoint. Datasets are returned in descending creation order with cursor-based pagination." },
|
|
18318
|
+
{ type: "paragraph", text: "Create separate datasets for different document types or schema versions to track accuracy independently. Pair with the benchmark endpoints to measure extraction quality over time \u2014 run benchmarks after schema changes or pipeline updates to detect regressions." },
|
|
17150
18319
|
{ type: "list", ordered: false, items: [
|
|
17151
18320
|
"Each dataset contains verified entries mapping documents to expected field values",
|
|
17152
18321
|
"Datasets can be scoped to a specific user schema via `user_schema_id`",
|
|
@@ -17241,6 +18410,10 @@ var sections40 = [
|
|
|
17241
18410
|
description: "Create a new ground truth dataset linked to a schema. The dataset defines the expected extraction output used for accuracy benchmarking.",
|
|
17242
18411
|
content: [
|
|
17243
18412
|
{ type: "paragraph", text: "Create an empty ground truth dataset that you can populate with verified entries. Datasets serve as the baseline for benchmark runs that measure extraction accuracy. After creating a dataset, add entries individually or import them in bulk via CSV." },
|
|
18413
|
+
{ type: "paragraph", text: "The typical workflow is: create the dataset, then populate it using `POST /v1/quality/ground-truth/:id/entries` for individual entries or `POST /v1/quality/ground-truth/:id/entries/import-csv` for bulk import. Once populated, create a benchmark run with `POST /v1/quality/benchmarks`." },
|
|
18414
|
+
{ type: "paragraph", text: "The response returns the dataset with `document_count: 0` since it is initially empty. The `user_schema_id` is `null` unless you associate it with a schema. The `links.self` URL points to the detail endpoint where you can retrieve entries or delete the dataset." },
|
|
18415
|
+
{ type: "paragraph", text: "For best results, aim for at least 30-50 entries per dataset. Linking a dataset to a `user_schema_id` ensures ground truth field names align with your extraction schema, producing more meaningful benchmark comparisons." },
|
|
18416
|
+
{ type: "callout", variant: "info", text: "Field keys in `expected_data` entries should match the field names used in your extraction schema. Unmatched fields are stored but ignored during benchmark comparison." },
|
|
17244
18417
|
{
|
|
17245
18418
|
type: "endpoint",
|
|
17246
18419
|
method: "POST",
|
|
@@ -17315,6 +18488,9 @@ var sections40 = [
|
|
|
17315
18488
|
description: "Retrieve a ground truth dataset by ID with metadata and entry count, or delete it permanently. Deleting a dataset does not remove associated benchmark results.",
|
|
17316
18489
|
content: [
|
|
17317
18490
|
{ type: "paragraph", text: "Retrieve a dataset with its metadata and sample entries, or delete it permanently. The GET response includes a `samples` array with the actual ground truth entries, allowing you to inspect the expected values for each document." },
|
|
18491
|
+
{ type: "paragraph", text: "Use `GET` to inspect the dataset contents before running a benchmark. The `samples` array contains all ground truth entries with their `document_id`, `expected_data` (key-value map of verified field values), and optional `notes`. This lets you verify the dataset is correctly populated." },
|
|
18492
|
+
{ type: "paragraph", text: "The `document_count` field shows how many entries exist. For large datasets, the `samples` array may produce a sizable response. The `user_schema_id` indicates whether the dataset is scoped to a specific extraction schema, which improves benchmark accuracy by ensuring field name alignment." },
|
|
18493
|
+
{ type: "paragraph", text: "Use `DELETE` when a dataset is outdated or no longer needed. Benchmark results that referenced this dataset are preserved for historical tracking \u2014 the benchmark retains the `dataset_id` even after the dataset itself is removed. Create a new dataset with updated entries rather than modifying existing ones." },
|
|
17318
18494
|
{ type: "callout", variant: "warning", text: "Deleting a dataset is permanent. However, benchmark results that used this dataset are retained for historical reference. The benchmark will show the dataset_id but the dataset itself will no longer be retrievable." },
|
|
17319
18495
|
{
|
|
17320
18496
|
type: "endpoint",
|
|
@@ -17580,6 +18756,9 @@ var sections40 = [
|
|
|
17580
18756
|
description: "List benchmark runs that compare extraction results against ground truth datasets. Each run produces per-field accuracy metrics.",
|
|
17581
18757
|
content: [
|
|
17582
18758
|
{ type: "paragraph", text: "Benchmark runs compare your extraction output against ground truth datasets to produce per-field accuracy scores. Each run evaluates every document in the dataset and produces an `accuracy_overall` score along with per-field breakdowns. Use benchmarks to track extraction quality over time and measure the impact of schema or pipeline changes." },
|
|
18759
|
+
{ type: "paragraph", text: "Use this endpoint to see all benchmark runs and their accuracy scores. A typical workflow is to list benchmarks after making schema or pipeline changes, then compare the latest run against previous ones using `GET /v1/quality/benchmarks/compare` to measure improvement or detect regressions." },
|
|
18760
|
+
{ type: "paragraph", text: "Each benchmark includes `status` (`queued`, `running`, `completed`, or `failed`), `accuracy_overall` (0-1 score, null while running), `accuracy_by_field` (per-field breakdown), and `documents_processed`/`documents_total` for progress tracking. The `accuracy_delta` and `compared_to_run_id` fields support cross-run comparisons." },
|
|
18761
|
+
{ type: "paragraph", text: "Run benchmarks regularly after extraction pipeline changes. Pair with `GET /v1/quality/benchmarks/:id/results` for per-document drill-down showing which fields matched and which diverged. Use the compare endpoint to track accuracy trends across multiple runs." },
|
|
17583
18762
|
{
|
|
17584
18763
|
type: "endpoint",
|
|
17585
18764
|
method: "GET",
|
|
@@ -17689,6 +18868,9 @@ var sections40 = [
|
|
|
17689
18868
|
description: "Start a benchmark run that compares a job run output against a ground truth dataset. Produces per-field accuracy scores and overall metrics.",
|
|
17690
18869
|
content: [
|
|
17691
18870
|
{ type: "paragraph", text: "Start a new benchmark run that evaluates your current extraction output against a ground truth dataset. The benchmark compares each document in the dataset entry-by-entry and field-by-field, producing an overall accuracy score and per-field breakdowns." },
|
|
18871
|
+
{ type: "paragraph", text: "The typical workflow is: create a benchmark after making extraction pipeline changes, poll `GET /v1/quality/benchmarks/:id` until `status` is `completed`, then inspect results. Run multiple benchmarks against the same dataset over time to track accuracy trends." },
|
|
18872
|
+
{ type: "paragraph", text: "The response returns the benchmark with `status: queued`, `accuracy_overall: null`, and `documents_processed: 0`. The `documents_total` field reflects how many entries are in the dataset. Poll the detail endpoint to check `status` and `documents_processed` for progress. Once completed, `accuracy_overall` and `accuracy_by_field` are populated." },
|
|
18873
|
+
{ type: "paragraph", text: "Multiple benchmarks can run in parallel against different datasets. Use `GET /v1/quality/benchmarks/compare` after completion to compare two runs side by side. The `dataset_id` is fixed at creation \u2014 to benchmark against a different dataset, create a new run." },
|
|
17692
18874
|
{ type: "callout", variant: "info", text: "Benchmark runs are asynchronous. The endpoint returns immediately with status `queued`. Poll the benchmark detail endpoint or list benchmarks to check when the run completes." },
|
|
17693
18875
|
{
|
|
17694
18876
|
type: "endpoint",
|
|
@@ -18018,6 +19200,9 @@ var sections41 = [
|
|
|
18018
19200
|
description: "Create a new routing rule with conditions on document properties and actions to apply when matched. Conditions can match document type, source, and other metadata.",
|
|
18019
19201
|
content: [
|
|
18020
19202
|
{ type: "paragraph", text: 'Create a rule that automatically applies actions to incoming documents based on their metadata. Conditions define what to match (e.g. document type equals "invoice"), and actions define what to do (e.g. assign the finance schema). Rules are evaluated on every `document_classified` event.' },
|
|
19203
|
+
{ type: "paragraph", text: 'The typical workflow is: create rules ordered by specificity \u2014 put narrow, high-priority rules first (e.g. "contracts from vendor X") and broader catch-all rules last. New rules are active immediately upon creation, so the next classified document will be evaluated against them.' },
|
|
19204
|
+
{ type: "paragraph", text: "The response returns the rule with `is_active: true`, a `trigger_type` of `document_classified`, and the assigned `priority` (defaults to 100 if omitted). The `action_type` is resolved from the `actions` object. Use the reorder endpoint after creation to adjust the priority relative to existing rules." },
|
|
19205
|
+
{ type: "paragraph", text: "Pair with `GET /v1/routing-rules` to verify the full priority chain after creating a rule. Use `source_connection_id` to scope rules to documents from a specific source \u2014 documents from other sources will skip the rule entirely. To test a rule before going live, create it and immediately disable it via `PATCH` with `is_active: false`." },
|
|
18021
19206
|
{ type: "callout", variant: "info", text: "New rules are created with `is_active: true` by default. If you want to test a rule before activating it, create it, then immediately disable it via `PATCH /v1/routing-rules/:id` with `is_active: false`." },
|
|
18022
19207
|
{
|
|
18023
19208
|
type: "endpoint",
|
|
@@ -18120,6 +19305,10 @@ var sections41 = [
|
|
|
18120
19305
|
description: "Retrieve, update, or delete a routing rule by ID. Update conditions, actions, priority, or enabled state. Deleting a rule does not affect previously routed documents.",
|
|
18121
19306
|
content: [
|
|
18122
19307
|
{ type: "paragraph", text: "Retrieve, update, or delete a single routing rule. Updates take effect immediately \u2014 the next `document_classified` event will use the updated rule. Deleting a rule does not retroactively affect documents that were already routed by it." },
|
|
19308
|
+
{ type: "paragraph", text: "Use `GET` to inspect a rule's conditions, actions, and priority. Use `PATCH` to adjust conditions, change the schema assignment, toggle `is_active`, or update the priority. Use `DELETE` when a rule is no longer needed \u2014 previously routed documents are not affected." },
|
|
19309
|
+
{ type: "paragraph", text: "The `PATCH` response returns the full updated rule including the new `updated_at` timestamp. All fields are optional \u2014 only include fields you want to change. The `is_active` toggle lets you temporarily disable a rule without deleting it, which is useful for testing or during maintenance windows." },
|
|
19310
|
+
{ type: "paragraph", text: "After updating priority via `PATCH`, use `GET /v1/routing-rules` to verify the full evaluation order. For bulk priority changes, prefer the `POST /v1/routing-rules/reorder` endpoint instead of patching individual rules. Pair deletion with rule creation to replace a rule atomically." },
|
|
19311
|
+
{ type: "callout", variant: "info", text: "Rule changes only affect future `document_classified` events. Documents already routed by a previous version of the rule retain their assigned schema and routing actions." },
|
|
18123
19312
|
{
|
|
18124
19313
|
type: "endpoint",
|
|
18125
19314
|
method: "GET",
|
|
@@ -18301,6 +19490,9 @@ var sections41 = [
|
|
|
18301
19490
|
description: "Reorder routing rules by providing an ordered array of rule IDs. Priority values are reassigned sequentially based on the new order.",
|
|
18302
19491
|
content: [
|
|
18303
19492
|
{ type: "paragraph", text: "Reassign priority values for all routing rules at once. Pass an ordered array of rule IDs \u2014 the first ID receives priority 1, the second receives priority 2, and so on. This is the recommended way to change evaluation order after initial creation." },
|
|
19493
|
+
{ type: "paragraph", text: "Use this endpoint when you need to rearrange the evaluation order of multiple rules at once \u2014 for example, when promoting a new rule to the top of the chain or inserting a rule between two existing ones. This is more reliable than patching individual rule priorities, which can create gaps or collisions." },
|
|
19494
|
+
{ type: "paragraph", text: "The response returns a `reordered` array with each rule's `id` and new `priority` value. Priority 1 is evaluated first. The reorder takes effect immediately \u2014 the next `document_classified` event uses the new priority sequence." },
|
|
19495
|
+
{ type: "paragraph", text: "List all rules first via `GET /v1/routing-rules` to get the current IDs and order, then construct the reordered array. Include both active and inactive rules in the array to maintain a consistent priority sequence. Omitting any rule ID results in a validation error." },
|
|
18304
19496
|
{ type: "callout", variant: "warning", text: "All active rule IDs must be included in the `rule_ids` array. Omitting any rule returns a validation error. Inactive rules should also be included to maintain a consistent priority sequence." },
|
|
18305
19497
|
{
|
|
18306
19498
|
type: "endpoint",
|
|
@@ -18806,6 +19998,10 @@ var sections44 = [
|
|
|
18806
19998
|
description: "All Talonic API errors return a consistent JSON envelope with a machine-readable code, human-readable message, HTTP status, retryable flag, request ID, and timestamp.",
|
|
18807
19999
|
content: [
|
|
18808
20000
|
{ type: "paragraph", text: "All errors return a consistent JSON envelope. The `retryable` field tells you whether the request can be retried with the same parameters." },
|
|
20001
|
+
{ type: "paragraph", text: "Most integrations parse the `code` field for programmatic error handling and display the `message` field to users. A typical error handler checks `retryable` first \u2014 if `true`, queue the request for retry with exponential backoff; if `false`, surface the `message` to the caller and stop." },
|
|
20002
|
+
{ type: "paragraph", text: "The `request_id` field (prefixed with `req_`) uniquely identifies the failed request and is essential for debugging with Talonic support. The `path` field confirms which endpoint produced the error, and `timestamp` records when it occurred in ISO 8601 format." },
|
|
20003
|
+
{ type: "paragraph", text: "Pair error handling with the [Error Codes](error-codes) reference to map each `code` value to the correct remediation action. Note that `statusCode` always matches the HTTP response status, so you can use either for branching logic in your client." },
|
|
20004
|
+
{ type: "callout", text: "Always log the `request_id` from error responses. When contacting support, include it for faster resolution \u2014 it links directly to the server-side request trace." },
|
|
18809
20005
|
{
|
|
18810
20006
|
type: "code",
|
|
18811
20007
|
title: "Error response envelope",
|