@talonic/docs 0.20.11 → 0.20.13
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/content.js +882 -155
- package/package.json +1 -1
package/dist/content.js
CHANGED
|
@@ -422,15 +422,19 @@ var sections = [
|
|
|
422
422
|
content: [
|
|
423
423
|
{
|
|
424
424
|
type: "paragraph",
|
|
425
|
-
text: "Transform unstructured documents into structured, validated data \u2014 with per-cell provenance and AI reasoning traces."
|
|
425
|
+
text: "Transform unstructured documents into structured, validated data \u2014 with per-cell provenance and AI reasoning traces. Whether you are processing invoices, contracts, purchase orders, or any other document type, Talonic automatically discovers every data point and maps it to a unified knowledge graph. The platform handles the entire pipeline from OCR and classification through to structured output delivery, so you can focus on defining what data you need rather than building extraction logic."
|
|
426
426
|
},
|
|
427
427
|
{
|
|
428
428
|
type: "paragraph",
|
|
429
|
-
text: "**Supported Formats:** 25+ file types. **Resolution:** 4-phase pipeline. **Instant Matches:** ~30% of cells (free)."
|
|
429
|
+
text: "**Supported Formats:** 25+ file types. **Resolution:** 4-phase pipeline. **Instant Matches:** ~30% of cells (free). The platform ingests PDF, DOCX, XLSX, images (PNG, JPG), HTML, JSON, CSV, email formats (EML, MSG), and ZIP archives. ZIP files are unpacked automatically and each contained document is processed individually through the same pipeline."
|
|
430
430
|
},
|
|
431
431
|
{
|
|
432
432
|
type: "paragraph",
|
|
433
|
-
text: "Talonic is an **agentic data structuring platform**. It ingests documents of any type, discovers every data point inside them, builds a knowledge graph of canonical fields, and deploys AI agents to fill structured output schemas. Every cell in the output carries provenance metadata \u2014 which pipeline phase filled it, the confidence score, and an AI reasoning trace linking back to the source document."
|
|
433
|
+
text: "Talonic is an **agentic data structuring platform**. It ingests documents of any type, discovers every data point inside them, builds a knowledge graph of canonical fields, and deploys AI agents to fill structured output schemas. Every cell in the output carries provenance metadata \u2014 which pipeline phase filled it, the confidence score, and an AI reasoning trace linking back to the source document. The knowledge graph grows with every document you process \u2014 fields are clustered semantically, promoted through tiers, and enriched with master extraction instructions. This accumulated intelligence means extraction accuracy and speed improve over time without any manual tuning."
|
|
434
|
+
},
|
|
435
|
+
{
|
|
436
|
+
type: "paragraph",
|
|
437
|
+
text: "The platform is organized around three sections accessible from the sidebar: **Sources** (where documents enter the system), **Structuring** (where you define schemas, run extraction jobs, and review results), and **Outputs** (where approved data is delivered to downstream systems). Most teams start by uploading a small sample of documents, reviewing the auto-generated schemas, and then creating targeted extraction jobs. As the knowledge graph matures, an increasing share of cells are resolved instantly through graph lookup rather than AI extraction."
|
|
434
438
|
},
|
|
435
439
|
{
|
|
436
440
|
type: "list",
|
|
@@ -440,13 +444,19 @@ var sections = [
|
|
|
440
444
|
"**4-phase extraction pipeline** \u2014 resolve from the knowledge graph, extract with AI agents, re-resolve, then transform and validate.",
|
|
441
445
|
"**~30% instant matches** \u2014 cells filled from graph lookup are free and instant, reducing both cost and latency.",
|
|
442
446
|
"**Per-cell provenance** \u2014 every value traces back to its source with confidence scores and reasoning.",
|
|
443
|
-
"**Batch mode** \u2014 process large backlogs at 50% cost with a 48-hour delivery window."
|
|
447
|
+
"**Batch mode** \u2014 process large backlogs at 50% cost with a 48-hour delivery window.",
|
|
448
|
+
"**10 source connectors** \u2014 Google Drive, Gmail, SharePoint, OneDrive, Outlook, Teams, Notion, SQL databases, Amazon S3, and Azure Blob Storage.",
|
|
449
|
+
"**6 delivery connectors** \u2014 webhook (HMAC-SHA256), SFTP, Amazon S3, Azure Blob, Google Drive, and OneDrive for pushing results downstream."
|
|
444
450
|
]
|
|
445
451
|
},
|
|
452
|
+
{
|
|
453
|
+
type: "paragraph",
|
|
454
|
+
text: "The extraction pipeline uses a cascading design that minimizes both cost and latency. Phase 1 fills cells using deterministic graph lookups at zero AI cost. Phase 2 deploys AI agents only for cells that Phase 1 could not resolve. Phase 3 re-runs lookups on AI output to normalize values to canonical codes. Phase 4 applies transforms, validates constraints, and fills remaining gaps with full grid context. This approach ensures that the cheapest, fastest method always runs first."
|
|
455
|
+
},
|
|
446
456
|
{
|
|
447
457
|
type: "callout",
|
|
448
458
|
variant: "info",
|
|
449
|
-
text: "Talonic uses Anthropic Claude for intelligent extraction and reasoning. The platform handles OCR, classification, field discovery, and schema generation automatically \u2014 you provide documents and define what output you need."
|
|
459
|
+
text: "Talonic uses Anthropic Claude for intelligent extraction and reasoning. The platform handles OCR, classification, field discovery, and schema generation automatically \u2014 you provide documents and define what output you need. Document AI (Mistral) handles OCR by default, with automatic fallback to the Talonic API for unsupported formats."
|
|
450
460
|
}
|
|
451
461
|
],
|
|
452
462
|
related: [
|
|
@@ -457,19 +467,19 @@ var sections = [
|
|
|
457
467
|
faq: [
|
|
458
468
|
{
|
|
459
469
|
question: "What is the Talonic Platform?",
|
|
460
|
-
answer: "Talonic is an agentic data structuring platform that ingests unstructured documents, discovers fields, builds a knowledge graph, and produces structured output with per-cell provenance and confidence scores."
|
|
470
|
+
answer: "Talonic is an agentic data structuring platform that ingests unstructured documents, discovers fields, builds a knowledge graph, and produces structured output with per-cell provenance and confidence scores. It supports 25+ file formats and uses a 4-phase extraction pipeline that combines deterministic graph lookups with AI agents to maximize accuracy while minimizing cost."
|
|
461
471
|
},
|
|
462
472
|
{
|
|
463
473
|
question: "How many file formats does Talonic support?",
|
|
464
|
-
answer: "Talonic supports 25+ file types including PDF, DOCX, XLSX, images (PNG, JPG), plain text, HTML, JSON, CSV, email formats (EML, MSG), and ZIP archives."
|
|
474
|
+
answer: "Talonic supports 25+ file types including PDF, DOCX, XLSX, images (PNG, JPG), plain text, HTML, JSON, CSV, email formats (EML, MSG), and ZIP archives. ZIP files are unpacked automatically and each contained document is processed individually through the full extraction pipeline."
|
|
465
475
|
},
|
|
466
476
|
{
|
|
467
477
|
question: 'What does "per-cell provenance" mean?',
|
|
468
|
-
answer: "Every cell in the structured output carries metadata about which pipeline phase filled it, a confidence score, an AI reasoning trace, and references back to the source document. This makes every value auditable and explainable."
|
|
478
|
+
answer: "Every cell in the structured output carries metadata about which pipeline phase filled it, a confidence score, an AI reasoning trace, and references back to the source document. This makes every value auditable and explainable. You can hover any cell to see its confidence score and click it to expand the full provenance panel with reasoning details."
|
|
469
479
|
},
|
|
470
480
|
{
|
|
471
481
|
question: "How much do instant graph matches cost?",
|
|
472
|
-
answer: "Graph matches (approximately 30% of cells) are free. They are filled from the knowledge graph through deterministic lookup, so no LLM call is needed. Only cells that require AI extraction incur cost."
|
|
482
|
+
answer: "Graph matches (approximately 30% of cells) are free. They are filled from the knowledge graph through deterministic lookup, so no LLM call is needed. Only cells that require AI extraction incur cost. As your knowledge graph grows from processing more documents, the percentage of free graph matches increases, further reducing per-job costs."
|
|
473
483
|
}
|
|
474
484
|
],
|
|
475
485
|
mentions: [
|
|
@@ -757,7 +767,7 @@ var sections2 = [
|
|
|
757
767
|
content: [
|
|
758
768
|
{
|
|
759
769
|
type: "paragraph",
|
|
760
|
-
text: "Talonic includes an embedded AI agent accessible from every page via `Cmd+I` (`Ctrl+I` on Windows). The agent understands your workspace context and can inspect schemas, search documents, analyze extraction quality, explore cases, and build schemas \u2014 all through natural language."
|
|
770
|
+
text: "Talonic includes an embedded AI agent accessible from every page via `Cmd+I` (`Ctrl+I` on Windows). The agent understands your workspace context and can inspect schemas, search documents, analyze extraction quality, explore cases, and build schemas \u2014 all through natural language. It serves as a conversational interface to your entire workspace, eliminating the need to navigate through multiple pages to find information or perform common operations."
|
|
761
771
|
},
|
|
762
772
|
{
|
|
763
773
|
type: "paragraph",
|
|
@@ -769,7 +779,7 @@ var sections2 = [
|
|
|
769
779
|
},
|
|
770
780
|
{
|
|
771
781
|
type: "paragraph",
|
|
772
|
-
text: "There are important limitations to be aware of. The agent cannot access external systems or the internet \u2014 it only works with data already in your Talonic workspace. It cannot bypass permission boundaries, so team members with read-only access cannot use the agent to make changes. Long-running operations like full batch extractions cannot be triggered through the agent; those must be initiated from the relevant UI page."
|
|
782
|
+
text: "There are important limitations to be aware of. The agent cannot access external systems or the internet \u2014 it only works with data already in your Talonic workspace. It cannot bypass permission boundaries, so team members with read-only (Viewer) access cannot use the agent to make changes. Long-running operations like full batch extractions cannot be triggered through the agent; those must be initiated from the relevant UI page. The agent also cannot modify field registry entries directly \u2014 those changes flow through the resolution pipeline."
|
|
773
783
|
},
|
|
774
784
|
{ type: "heading", level: 3, id: "agent-capabilities", text: "What the Agent Can Do" },
|
|
775
785
|
{
|
|
@@ -842,7 +852,11 @@ var sections2 = [
|
|
|
842
852
|
},
|
|
843
853
|
{
|
|
844
854
|
question: "Can the AI agent access external systems or the internet?",
|
|
845
|
-
answer: "No. The agent only works with data already in your Talonic workspace. It cannot browse the internet, call external APIs, or access systems outside the platform."
|
|
855
|
+
answer: "No. The agent only works with data already in your Talonic workspace. It cannot browse the internet, call external APIs, or access systems outside the platform. All data the agent references comes from your documents, schemas, field registry, and job results."
|
|
856
|
+
},
|
|
857
|
+
{
|
|
858
|
+
question: "What are good questions to ask the agent?",
|
|
859
|
+
answer: 'Try questions like "Show me all invoices processed this week", "What fields does my Invoice schema have?", "Create a schema for purchase orders with vendor name, PO number, and total amount", or "Why was this document classified as a Service Agreement?" The agent handles both read-only queries and schema creation commands.'
|
|
846
860
|
}
|
|
847
861
|
],
|
|
848
862
|
mentions: [
|
|
@@ -862,7 +876,7 @@ var sections2 = [
|
|
|
862
876
|
content: [
|
|
863
877
|
{
|
|
864
878
|
type: "paragraph",
|
|
865
|
-
text: "Every agent action is classified into an impact level that determines how it executes. Higher-impact operations require progressively more explicit confirmation."
|
|
879
|
+
text: "Every agent action is classified into an impact level that determines how it executes. Higher-impact operations require progressively more explicit confirmation. This graduated safety model ensures that the agent can be used freely for exploration and analysis while preventing accidental modifications to live data."
|
|
866
880
|
},
|
|
867
881
|
{
|
|
868
882
|
type: "param-table",
|
|
@@ -902,9 +916,29 @@ var sections2 = [
|
|
|
902
916
|
type: "paragraph",
|
|
903
917
|
text: 'The `live_mutation` and `irreversible` levels provide escalating safety gates for operations that affect production data. A `live_mutation` \u2014 such as triggering a job run or publishing a schema \u2014 presents a confirmation dialog that you must accept. An `irreversible` action \u2014 such as deleting a source or purging documents \u2014 requires you to type a confirmation keyword (e.g., "DELETE") to proceed, preventing accidental data loss.'
|
|
904
918
|
},
|
|
919
|
+
{
|
|
920
|
+
type: "heading",
|
|
921
|
+
level: 3,
|
|
922
|
+
id: "impact-examples",
|
|
923
|
+
text: "Common Actions by Impact Level"
|
|
924
|
+
},
|
|
925
|
+
{
|
|
926
|
+
type: "list",
|
|
927
|
+
ordered: false,
|
|
928
|
+
items: [
|
|
929
|
+
'**read** \u2014 "Show me the fields in this document", "What schemas do I have?", "How many invoices were processed today?"',
|
|
930
|
+
`**draft_mutation** \u2014 "Create a schema for invoices with these fields", "Add a 'due date' field to my schema draft"`,
|
|
931
|
+
'**live_mutation** \u2014 "Publish this schema draft", "Run extraction on these 50 documents"',
|
|
932
|
+
'**irreversible** \u2014 "Delete this source and all its documents", "Purge all data for this document type"'
|
|
933
|
+
]
|
|
934
|
+
},
|
|
935
|
+
{
|
|
936
|
+
type: "paragraph",
|
|
937
|
+
text: "The impact level system also respects your team role. Team members with the Viewer role can only trigger `read` level actions through the agent \u2014 attempting commands that would modify data will be rejected with a clear permissions error. Members, Admins, and Owners can trigger higher impact levels according to their role permissions."
|
|
938
|
+
},
|
|
905
939
|
{
|
|
906
940
|
type: "callout",
|
|
907
|
-
text: "The agent always operates workshop-first: schema changes create drafts, not live versions. You review and publish when ready."
|
|
941
|
+
text: "The agent always operates workshop-first: schema changes create drafts, not live versions. You review and publish when ready. This means you can experiment freely with schema designs through the agent without any risk to your production configuration."
|
|
908
942
|
}
|
|
909
943
|
],
|
|
910
944
|
related: [
|
|
@@ -936,7 +970,7 @@ var sections2 = [
|
|
|
936
970
|
content: [
|
|
937
971
|
{
|
|
938
972
|
type: "paragraph",
|
|
939
|
-
text: "The home page (click the Talonic logo) shows smart suggested prompts based on your workspace state. Prompts adapt to what is happening: active runs, schema creation opportunities, document types waiting for extraction. The agent input field lets you type any question directly from the dashboard."
|
|
973
|
+
text: "The home page (click the Talonic logo) shows smart suggested prompts based on your workspace state. Prompts adapt to what is happening: active runs, schema creation opportunities, document types waiting for extraction, and pending field confirmations. The agent input field lets you type any question directly from the dashboard, making it the natural starting point for each session."
|
|
940
974
|
},
|
|
941
975
|
{
|
|
942
976
|
type: "paragraph",
|
|
@@ -949,6 +983,21 @@ var sections2 = [
|
|
|
949
983
|
{
|
|
950
984
|
type: "paragraph",
|
|
951
985
|
text: "Every conversation with the agent is preserved in your session history, accessible from the dashboard. You can revisit previous questions and their answers, which is useful for auditing decisions or recalling how you configured a particular schema. The conversation history also provides continuity \u2014 if you asked the agent to analyze extraction quality last week, you can pick up where you left off."
|
|
986
|
+
},
|
|
987
|
+
{
|
|
988
|
+
type: "heading",
|
|
989
|
+
level: 3,
|
|
990
|
+
id: "dashboard-metrics",
|
|
991
|
+
text: "Dashboard Metrics"
|
|
992
|
+
},
|
|
993
|
+
{
|
|
994
|
+
type: "paragraph",
|
|
995
|
+
text: "The dashboard surfaces key telemetry metrics from your workspace. **Capture rate** measures the percentage of schema fields that were successfully extracted across your latest job runs. **Resolve rate** tracks how many extracted fields were resolved against the registry without AI intervention. **Synthesize rate** shows how many registry fields have master instructions. Together, these three metrics give you a quick health check on your extraction pipeline \u2014 a high resolve rate means your registry is mature and extraction costs are low."
|
|
996
|
+
},
|
|
997
|
+
{
|
|
998
|
+
type: "callout",
|
|
999
|
+
variant: "info",
|
|
1000
|
+
text: 'Try asking the agent questions like "What is my capture rate?", "Which document types need schemas?", or "Show me recent extraction failures" directly from the dashboard. The suggested prompts adapt to your workspace state, but you can always type any question.'
|
|
952
1001
|
}
|
|
953
1002
|
],
|
|
954
1003
|
related: [
|
|
@@ -984,11 +1033,11 @@ var sections3 = [
|
|
|
984
1033
|
content: [
|
|
985
1034
|
{
|
|
986
1035
|
type: "paragraph",
|
|
987
|
-
text: "Sources are the entry point for all data. Every document belongs to a source \u2014 whether uploaded manually, synced from Google Drive, or ingested via API."
|
|
1036
|
+
text: "Sources are the entry point for all data in Talonic. Every document belongs to a source \u2014 whether uploaded manually, synced from Google Drive, or ingested via the public API. A source acts as a container that groups related documents together and tracks their ingestion method, processing status, and connection credentials. You can create multiple sources to organize documents by department, client, or workflow."
|
|
988
1037
|
},
|
|
989
1038
|
{
|
|
990
1039
|
type: "paragraph",
|
|
991
|
-
text: "The Sources page provides a drag-and-drop upload interface. You can upload individual files, multiple files, or entire folders. ZIP archives are unpacked recursively. When uploading folders, the original file path is preserved as a data field (`source_file_path`) on each document \u2014 available for downstream processing and export."
|
|
1040
|
+
text: "The Sources page provides a drag-and-drop upload interface. You can upload individual files, multiple files, or entire folders. ZIP archives are unpacked recursively \u2014 nested ZIPs are also extracted, so deeply packaged archives are handled automatically. When uploading folders, the original file path is preserved as a data field (`source_file_path`) on each document \u2014 available for downstream processing and export."
|
|
992
1041
|
},
|
|
993
1042
|
{
|
|
994
1043
|
type: "ui-excerpt",
|
|
@@ -998,11 +1047,37 @@ var sections3 = [
|
|
|
998
1047
|
},
|
|
999
1048
|
{
|
|
1000
1049
|
type: "paragraph",
|
|
1001
|
-
text: "Files are deduplicated via SHA-256 hashing \u2014 uploading the same file twice won't create duplicates. Processing runs asynchronously so you can continue working."
|
|
1050
|
+
text: "Files are deduplicated via SHA-256 hashing \u2014 uploading the same file twice won't create duplicates. The hash is computed from the raw file bytes before any processing begins, so identical files are caught regardless of filename or upload method. Processing runs asynchronously so you can continue working while documents flow through OCR, classification, and extraction in the background."
|
|
1002
1051
|
},
|
|
1003
1052
|
{
|
|
1004
1053
|
type: "paragraph",
|
|
1005
1054
|
text: "When uploading folders or ZIP archives, the original directory structure is preserved as a `source_file_path` metadata field on each document (e.g., `contracts/2026/lease.pdf`). This field is available for filtering, export, and schema mapping \u2014 just like any AI-extracted field. It provides a natural way to organize and trace documents back to their original location in your file system."
|
|
1055
|
+
},
|
|
1056
|
+
{
|
|
1057
|
+
type: "heading",
|
|
1058
|
+
level: 3,
|
|
1059
|
+
id: "upload-workflow",
|
|
1060
|
+
text: "Upload Workflow"
|
|
1061
|
+
},
|
|
1062
|
+
{
|
|
1063
|
+
type: "list",
|
|
1064
|
+
ordered: true,
|
|
1065
|
+
items: [
|
|
1066
|
+
"Navigate to the **Sources** page from the sidebar.",
|
|
1067
|
+
"Click **New Source** to create a container, or select an existing source.",
|
|
1068
|
+
"Drag files, folders, or ZIP archives onto the upload area.",
|
|
1069
|
+
"Monitor the progress indicator \u2014 each file shows its current stage (uploading, OCR, classifying, extracting).",
|
|
1070
|
+
"Once processing completes, documents appear in the source list with their extracted fields and classification."
|
|
1071
|
+
]
|
|
1072
|
+
},
|
|
1073
|
+
{
|
|
1074
|
+
type: "paragraph",
|
|
1075
|
+
text: "Large uploads (100+ files) are throttled to avoid overloading the pipeline, but you can monitor progress from the source page where each document shows its current processing stage. For very large ingestion jobs, consider using **batch processing mode** which defers AI extraction to run at 50% cost. You can also combine manual uploads with source connectors \u2014 for example, uploading a backlog of historical files manually while connecting Google Drive for ongoing ingestion."
|
|
1076
|
+
},
|
|
1077
|
+
{
|
|
1078
|
+
type: "callout",
|
|
1079
|
+
variant: "info",
|
|
1080
|
+
text: "Use the **quick extract** shortcut (`Cmd+J` / `Ctrl+J`) to upload a single document from any page without navigating to Sources first. This opens a streamlined upload interface that processes the document immediately and shows results inline."
|
|
1006
1081
|
}
|
|
1007
1082
|
],
|
|
1008
1083
|
related: [
|
|
@@ -1013,15 +1088,19 @@ var sections3 = [
|
|
|
1013
1088
|
faq: [
|
|
1014
1089
|
{
|
|
1015
1090
|
question: "How do I upload documents to Talonic?",
|
|
1016
|
-
answer: "Drag files or folders onto the Sources page upload area. You can upload individual files, multiple files, entire folders, or ZIP archives that are unpacked recursively."
|
|
1091
|
+
answer: "Drag files or folders onto the Sources page upload area. You can upload individual files, multiple files, entire folders, or ZIP archives that are unpacked recursively. You can also use the quick extract shortcut (Cmd+J / Ctrl+J) to upload a single file from any page."
|
|
1017
1092
|
},
|
|
1018
1093
|
{
|
|
1019
1094
|
question: "Does Talonic detect duplicate uploads?",
|
|
1020
|
-
answer: "Yes. Files are deduplicated via SHA-256 hashing. Uploading the same file twice will not create duplicates."
|
|
1095
|
+
answer: "Yes. Files are deduplicated via SHA-256 hashing computed from the raw file bytes. Uploading the same file twice will not create duplicates, regardless of filename or upload method."
|
|
1021
1096
|
},
|
|
1022
1097
|
{
|
|
1023
1098
|
question: "What happens when I upload a folder or ZIP archive?",
|
|
1024
|
-
answer: "ZIP archives are unpacked recursively and each file is processed individually. Folders preserve the original directory structure as a source_file_path metadata field on each document, available for filtering and export."
|
|
1099
|
+
answer: "ZIP archives are unpacked recursively and each file is processed individually. Nested ZIPs are also extracted. Folders preserve the original directory structure as a source_file_path metadata field on each document, available for filtering, schema mapping, and export."
|
|
1100
|
+
},
|
|
1101
|
+
{
|
|
1102
|
+
question: "Can I upload large batches of documents?",
|
|
1103
|
+
answer: "Yes. Large uploads (100+ files) are automatically throttled to prevent pipeline overload. Each document processes independently through OCR, classification, and extraction. For very large batches, consider batch processing mode which defers extraction at 50% cost."
|
|
1025
1104
|
}
|
|
1026
1105
|
],
|
|
1027
1106
|
mentions: [
|
|
@@ -1082,6 +1161,10 @@ var sections3 = [
|
|
|
1082
1161
|
type: "paragraph",
|
|
1083
1162
|
text: "The **text fast-path** is the most efficient route: files like CSV, JSON, and plain text are read directly into memory with no external API call. This means they process almost instantly and incur no OCR cost. Email files (EML, MSG) are parsed to extract both the message body and any attachments, with each attachment processed as a separate document."
|
|
1084
1163
|
},
|
|
1164
|
+
{
|
|
1165
|
+
type: "paragraph",
|
|
1166
|
+
text: "Email files receive special handling. EML files are parsed to extract both the message body and any attachments, with each attachment processed as a separate document linked back to the parent email. MSG files (Microsoft Outlook format) follow the OCR path and similarly extract embedded attachments. This means a single email file can produce multiple documents \u2014 the email body and each of its attachments \u2014 all processed independently through the full pipeline."
|
|
1167
|
+
},
|
|
1085
1168
|
{
|
|
1086
1169
|
type: "callout",
|
|
1087
1170
|
variant: "info",
|
|
@@ -1117,7 +1200,7 @@ var sections3 = [
|
|
|
1117
1200
|
content: [
|
|
1118
1201
|
{
|
|
1119
1202
|
type: "paragraph",
|
|
1120
|
-
text: "When a document is uploaded, it flows through a multi-stage pipeline
|
|
1203
|
+
text: "When a document is uploaded, it flows through a multi-stage pipeline that transforms raw files into structured, queryable data. Each stage runs automatically and its progress is visible in the document's processing log. The pipeline handles everything from OCR conversion to AI-powered field extraction, with built-in retry logic for resilience."
|
|
1121
1204
|
},
|
|
1122
1205
|
{
|
|
1123
1206
|
type: "ui-excerpt",
|
|
@@ -1125,9 +1208,29 @@ var sections3 = [
|
|
|
1125
1208
|
title: "Document \u2014 Processing Log",
|
|
1126
1209
|
caption: "Every document shows a structured processing log with per-stage timing."
|
|
1127
1210
|
},
|
|
1211
|
+
{
|
|
1212
|
+
type: "heading",
|
|
1213
|
+
level: 3,
|
|
1214
|
+
id: "processing-stages",
|
|
1215
|
+
text: "Processing Stages"
|
|
1216
|
+
},
|
|
1217
|
+
{
|
|
1218
|
+
type: "list",
|
|
1219
|
+
ordered: true,
|
|
1220
|
+
items: [
|
|
1221
|
+
"**Document AI OCR** \u2014 Converts files to Markdown and produces structured annotations (type, sensitivity, PII categories, jurisdiction) in a single pass.",
|
|
1222
|
+
"**Classification** \u2014 Verifies the annotation against the 529-type document ontology. If the label and content disagree, AI resolves the correct type from the actual text.",
|
|
1223
|
+
"**Triage** \u2014 Tags the document with sensitivity level, department, jurisdiction, and compliance signals derived from the OCR annotations.",
|
|
1224
|
+
"**AI Data Field Capture** \u2014 Extracts every data point from the document content using Claude, producing fields with confidence scores and source text provenance."
|
|
1225
|
+
]
|
|
1226
|
+
},
|
|
1227
|
+
{
|
|
1228
|
+
type: "paragraph",
|
|
1229
|
+
text: "**Document AI OCR** converts files to Markdown and produces structured annotations (type, sensitivity, PII categories, jurisdiction) in a single pass. **Classification** verifies the annotation against the 529-type document ontology \u2014 if the label and content disagree, AI resolves the correct type from the actual text. **AI Data Field Capture** extracts every data point. Large documents (PDFs exceeding 25 pages) are automatically split into page chunks, processed in parallel, and merged \u2014 so even lengthy documents complete efficiently without timeouts."
|
|
1230
|
+
},
|
|
1128
1231
|
{
|
|
1129
1232
|
type: "paragraph",
|
|
1130
|
-
text: "
|
|
1233
|
+
text: "The extraction stage uses a chunk-first approach by default: document sections are labelled inline and sent to Claude in a single call. The response combines JSON metadata with CSV-formatted fields, capturing confidence scores, source text, and chunk coverage for every extracted value. Optional quality passes \u2014 field count estimation, verification, and cross-reference enrichment \u2014 run as lightweight Haiku calls to catch extraction gaps. Six quality metrics are computed post-extraction: confidence per chunk, noise detection, semantic dissimilarity, text coverage, connection awareness, and consistency awareness."
|
|
1131
1234
|
},
|
|
1132
1235
|
{
|
|
1133
1236
|
type: "paragraph",
|
|
@@ -1152,9 +1255,13 @@ var sections3 = [
|
|
|
1152
1255
|
type: "paragraph",
|
|
1153
1256
|
text: "Both fields appear in the **Raw Extraction** tab with full confidence (1.0) and are available for schema mapping, job resolution, filtering, and export \u2014 just like any AI-extracted field."
|
|
1154
1257
|
},
|
|
1258
|
+
{
|
|
1259
|
+
type: "paragraph",
|
|
1260
|
+
text: "If OCR or extraction fails, the platform automatically retries using a fallback chain. OCR failures cascade from Document AI to the Talonic API to local parsers. Extraction parse failures retry in realtime mode (never as new batches). Each retry stage is logged in the processing log so you can see exactly what happened. Documents that exhaust all retries are marked with a terminal status (`extraction_failed` or `ocr_failed`) and remain visible in the source for manual review."
|
|
1261
|
+
},
|
|
1155
1262
|
{
|
|
1156
1263
|
type: "callout",
|
|
1157
|
-
text: "Documents are marked **complete** after AI extraction finishes. You can start using them in jobs immediately \u2014 no need to wait for
|
|
1264
|
+
text: "Documents are marked **complete** after AI extraction finishes. You can start using them in jobs immediately \u2014 no need to wait for field resolution, which runs separately and enriches the registry in the background."
|
|
1158
1265
|
}
|
|
1159
1266
|
],
|
|
1160
1267
|
related: [
|
|
@@ -1199,7 +1306,7 @@ var sections3 = [
|
|
|
1199
1306
|
},
|
|
1200
1307
|
{
|
|
1201
1308
|
type: "paragraph",
|
|
1202
|
-
text: 'Documents sharing the same ontology type are automatically merged into one document type. When a new canonical type appears, it is auto-created with ontology metadata. Unresolvable documents are assigned "Unclassified Document".'
|
|
1309
|
+
text: 'Documents sharing the same ontology type are automatically merged into one document type. When a new canonical type appears, it is auto-created with ontology metadata including category, subcategory, and typical fields. Unresolvable documents are assigned "Unclassified Document" \u2014 they can still be processed and extracted, but the platform cannot map them to a specific type in the ontology.'
|
|
1203
1310
|
},
|
|
1204
1311
|
{
|
|
1205
1312
|
type: "paragraph",
|
|
@@ -1209,6 +1316,23 @@ var sections3 = [
|
|
|
1209
1316
|
type: "paragraph",
|
|
1210
1317
|
text: "Document types drive several downstream features. The platform auto-generates a **schema** for each document type, pre-populated with fields discovered from documents of that type. **Routing rules** can be configured per document type to automatically assign schemas or trigger jobs when new documents arrive. The **Field Registry** tracks which fields appear in which document types, building a cross-type knowledge graph over time."
|
|
1211
1318
|
},
|
|
1319
|
+
{
|
|
1320
|
+
type: "heading",
|
|
1321
|
+
level: 3,
|
|
1322
|
+
id: "ontology-examples",
|
|
1323
|
+
text: "Example Ontology Types"
|
|
1324
|
+
},
|
|
1325
|
+
{
|
|
1326
|
+
type: "list",
|
|
1327
|
+
ordered: false,
|
|
1328
|
+
items: [
|
|
1329
|
+
"**Financial** \u2014 Invoice, Purchase Order, Credit Note, Bank Statement, Tax Return",
|
|
1330
|
+
"**Legal** \u2014 Employment Contract, Non-Disclosure Agreement, Lease Agreement, Power of Attorney",
|
|
1331
|
+
"**Logistics** \u2014 Bill of Lading (Ocean), Commercial Invoice, Packing List, Certificate of Origin",
|
|
1332
|
+
"**Healthcare** \u2014 Medical Record, Lab Report, Insurance Claim, Prescription",
|
|
1333
|
+
"**Corporate** \u2014 Articles of Incorporation, Board Resolution, Annual Report, Meeting Minutes"
|
|
1334
|
+
]
|
|
1335
|
+
},
|
|
1212
1336
|
{
|
|
1213
1337
|
type: "callout",
|
|
1214
1338
|
variant: "info",
|
|
@@ -1251,7 +1375,7 @@ var sections3 = [
|
|
|
1251
1375
|
content: [
|
|
1252
1376
|
{
|
|
1253
1377
|
type: "paragraph",
|
|
1254
|
-
text: "Click any document to see its detail page with four views
|
|
1378
|
+
text: "Click any document to see its detail page with four views. The detail page is where you inspect individual extraction results, verify field accuracy, and debug processing issues. It serves as the single source of truth for everything the platform knows about a document \u2014 from raw AI output to resolved canonical fields."
|
|
1255
1379
|
},
|
|
1256
1380
|
{
|
|
1257
1381
|
type: "param-table",
|
|
@@ -1290,6 +1414,10 @@ var sections3 = [
|
|
|
1290
1414
|
type: "paragraph",
|
|
1291
1415
|
text: "The **Processing Log** tab provides a stage-by-stage timeline of how the document was processed, including per-stage timing. You can see exactly how long OCR, classification, and extraction took, which is useful for diagnosing slow processing or understanding why a document was classified a particular way. The **Original File** tab lets you view or download the source file, so you can always compare the AI's extraction against the original document."
|
|
1292
1416
|
},
|
|
1417
|
+
{
|
|
1418
|
+
type: "paragraph",
|
|
1419
|
+
text: "When reviewing extraction quality, start with the **Raw Extraction** tab to check that the AI found all expected fields. Compare the confidence scores across fields \u2014 values above 0.90 are typically reliable, while values below 0.70 may warrant manual verification. If a field is missing, check the **Original File** tab to confirm the information exists in the source document. The **Processing Log** tab can reveal whether the document was split into chunks, which sometimes causes fields near chunk boundaries to be missed."
|
|
1420
|
+
},
|
|
1293
1421
|
{
|
|
1294
1422
|
type: "callout",
|
|
1295
1423
|
variant: "info",
|
|
@@ -1332,11 +1460,11 @@ var sections3 = [
|
|
|
1332
1460
|
content: [
|
|
1333
1461
|
{
|
|
1334
1462
|
type: "paragraph",
|
|
1335
|
-
text: "Routing rules automatically assign actions to documents based on their type. Configure rules to auto-assign schemas, trigger jobs, or route documents to specific workflows. Manage rules from **Documents → Routing**."
|
|
1463
|
+
text: "Routing rules automatically assign actions to documents based on their type. Configure rules to auto-assign schemas, trigger jobs, or route documents to specific workflows. Manage rules from **Documents → Routing**. Routing is the bridge between document ingestion and structured output \u2014 it eliminates the manual step of selecting a schema and starting a job for each new document."
|
|
1336
1464
|
},
|
|
1337
1465
|
{
|
|
1338
1466
|
type: "paragraph",
|
|
1339
|
-
text: "Each routing rule specifies a **document type** as the trigger condition and one or more **actions** to execute when a document of that type is processed. Actions include assigning a specific user schema, automatically creating a job run, or tagging the document for a particular workflow. Rules are evaluated in priority order, so you can layer general rules with more specific overrides."
|
|
1467
|
+
text: "Each routing rule specifies a **document type** as the trigger condition and one or more **actions** to execute when a document of that type is processed. Actions include assigning a specific user schema, automatically creating a job run, or tagging the document for a particular workflow. Rules are evaluated in priority order, so you can layer general rules with more specific overrides. If multiple rules match the same document type, the highest-priority rule wins."
|
|
1340
1468
|
},
|
|
1341
1469
|
{
|
|
1342
1470
|
type: "paragraph",
|
|
@@ -1346,10 +1474,25 @@ var sections3 = [
|
|
|
1346
1474
|
type: "paragraph",
|
|
1347
1475
|
text: "You can review rule execution history from the routing page to see which rules fired, which documents they matched, and what actions were taken. This audit trail helps you verify that your routing configuration is working as expected and diagnose cases where documents were not routed correctly."
|
|
1348
1476
|
},
|
|
1477
|
+
{
|
|
1478
|
+
type: "heading",
|
|
1479
|
+
level: 3,
|
|
1480
|
+
id: "routing-actions",
|
|
1481
|
+
text: "Available Routing Actions"
|
|
1482
|
+
},
|
|
1483
|
+
{
|
|
1484
|
+
type: "list",
|
|
1485
|
+
ordered: false,
|
|
1486
|
+
items: [
|
|
1487
|
+
"**Assign schema** \u2014 Automatically link a user schema to matching documents, so they are ready for extraction without manual configuration.",
|
|
1488
|
+
"**Trigger job** \u2014 Create and start a job run as soon as a document of the matching type completes processing.",
|
|
1489
|
+
"**Tag workflow** \u2014 Apply a workflow tag that can be used for filtering, reporting, or downstream delivery bindings."
|
|
1490
|
+
]
|
|
1491
|
+
},
|
|
1349
1492
|
{
|
|
1350
1493
|
type: "callout",
|
|
1351
1494
|
variant: "info",
|
|
1352
|
-
text: "Start with a simple routing rule for your most common document type. Once you verify it works correctly, expand to additional types. Rules are evaluated in priority order, so you can add specific overrides without disrupting existing rules."
|
|
1495
|
+
text: "Start with a simple routing rule for your most common document type. Once you verify it works correctly, expand to additional types. Rules are evaluated in priority order, so you can add specific overrides without disrupting existing rules. Review the execution history regularly to ensure documents are being routed as expected."
|
|
1353
1496
|
}
|
|
1354
1497
|
],
|
|
1355
1498
|
related: [
|
|
@@ -1382,7 +1525,7 @@ var sections3 = [
|
|
|
1382
1525
|
content: [
|
|
1383
1526
|
{
|
|
1384
1527
|
type: "paragraph",
|
|
1385
|
-
text: "Beyond manual upload and API ingestion, Talonic connects to external systems to automatically ingest documents. Each connector authenticates via OAuth or credentials and syncs documents into a source."
|
|
1528
|
+
text: "Beyond manual upload and API ingestion, Talonic connects to external systems to automatically ingest documents. Each connector authenticates via OAuth or credentials and syncs documents into a source. Connectors turn Talonic into a continuous ingestion pipeline \u2014 once configured, new files arriving in a connected folder or inbox are available for processing without manual intervention."
|
|
1386
1529
|
},
|
|
1387
1530
|
{
|
|
1388
1531
|
type: "param-table",
|
|
@@ -1452,9 +1595,31 @@ var sections3 = [
|
|
|
1452
1595
|
type: "paragraph",
|
|
1453
1596
|
text: "Credential-based connectors (SQL, Amazon S3, Azure Blob) authenticate with access keys or connection strings rather than OAuth. SQL connections support PostgreSQL, MySQL, and MSSQL, with a built-in read-only safety layer that prevents accidental writes. S3-compatible storage like MinIO and Cloudflare R2 also works through the S3 connector. All credentials are encrypted at rest before being stored."
|
|
1454
1597
|
},
|
|
1598
|
+
{
|
|
1599
|
+
type: "heading",
|
|
1600
|
+
level: 3,
|
|
1601
|
+
id: "connector-setup",
|
|
1602
|
+
text: "Setting Up a Connector"
|
|
1603
|
+
},
|
|
1604
|
+
{
|
|
1605
|
+
type: "list",
|
|
1606
|
+
ordered: true,
|
|
1607
|
+
items: [
|
|
1608
|
+
"Navigate to **Sources** and click **New Source**.",
|
|
1609
|
+
"Select the connector type from the dropdown (e.g., Google Drive, SharePoint, S3).",
|
|
1610
|
+
"For OAuth connectors, complete the authorization flow \u2014 you will be redirected to the provider to grant access.",
|
|
1611
|
+
"For credential-based connectors, enter the required credentials (access key, connection string, or API key).",
|
|
1612
|
+
"Browse the connected system to select specific folders, mailboxes, buckets, or tables to import.",
|
|
1613
|
+
"Optionally enable **Batch Processing** to defer extraction at 50% cost."
|
|
1614
|
+
]
|
|
1615
|
+
},
|
|
1616
|
+
{
|
|
1617
|
+
type: "paragraph",
|
|
1618
|
+
text: "Email connectors (Gmail and Outlook) ingest attachments from messages rather than the messages themselves. Gmail supports query passthrough so you can use standard Gmail search syntax to filter which messages are scanned for attachments. Outlook supports date range filtering and an option to include email bodies as documents. Microsoft Teams ingests meeting transcripts and channel attachments, with configurable surface filters for channels, chats, and meetings."
|
|
1619
|
+
},
|
|
1455
1620
|
{
|
|
1456
1621
|
type: "callout",
|
|
1457
|
-
text: "Connectors are feature-gated on their OAuth client ID/secret. Without credentials configured, the connector dropdown entry is disabled."
|
|
1622
|
+
text: "Connectors are feature-gated on their OAuth client ID/secret. Without credentials configured, the connector dropdown entry is disabled. Microsoft Teams requires tenant-admin consent for privileged scopes like `ChannelMessage.Read.All`."
|
|
1458
1623
|
}
|
|
1459
1624
|
],
|
|
1460
1625
|
related: [
|
|
@@ -1465,15 +1630,19 @@ var sections3 = [
|
|
|
1465
1630
|
faq: [
|
|
1466
1631
|
{
|
|
1467
1632
|
question: "What external sources can Talonic connect to?",
|
|
1468
|
-
answer: "Google Drive, Gmail, SharePoint, OneDrive, Outlook, Teams, Notion, SQL databases (MSSQL/PostgreSQL), Amazon S3, and Azure Blob Storage."
|
|
1633
|
+
answer: "Google Drive, Gmail, SharePoint, OneDrive, Outlook, Teams, Notion, SQL databases (MSSQL/PostgreSQL/MySQL), Amazon S3 (and S3-compatible storage like MinIO and Cloudflare R2), and Azure Blob Storage. Each connector authenticates via OAuth or credentials."
|
|
1469
1634
|
},
|
|
1470
1635
|
{
|
|
1471
1636
|
question: "How are OAuth tokens stored?",
|
|
1472
|
-
answer: "OAuth access and refresh tokens are encrypted at rest using AES-256-GCM. The encryption key is SOURCE_ENCRYPTION_KEY (falls back to JWT_SECRET)."
|
|
1637
|
+
answer: "OAuth access and refresh tokens are encrypted at rest using AES-256-GCM. The encryption key is SOURCE_ENCRYPTION_KEY (falls back to JWT_SECRET). Tokens are decrypted only when making API calls to the connected service."
|
|
1473
1638
|
},
|
|
1474
1639
|
{
|
|
1475
1640
|
question: "What happens if a connector loses its credentials or authorization?",
|
|
1476
|
-
answer: "If OAuth credentials are revoked or expire, the source enters a disconnected state. Reconnecting via the source settings page automatically refreshes the credentials without losing your existing documents or configuration."
|
|
1641
|
+
answer: "If OAuth credentials are revoked or expire, the source enters a disconnected state. Reconnecting via the source settings page automatically refreshes the credentials without losing your existing documents or configuration. No documents are deleted during disconnection."
|
|
1642
|
+
},
|
|
1643
|
+
{
|
|
1644
|
+
question: "Does the SQL connector support write operations?",
|
|
1645
|
+
answer: "No. SQL connections have a built-in read-only safety layer. A two-layer defense ensures no writes: an AST parser rejects anything that is not a single SELECT statement, and per-transaction read-only mode is enforced at the database level. MSSQL connections additionally reject accounts with elevated privileges."
|
|
1477
1646
|
}
|
|
1478
1647
|
],
|
|
1479
1648
|
mentions: [
|
|
@@ -1504,7 +1673,7 @@ var sections4 = [
|
|
|
1504
1673
|
content: [
|
|
1505
1674
|
{
|
|
1506
1675
|
type: "paragraph",
|
|
1507
|
-
text: "The Field Registry is the heart of Talonic's intelligence. As documents are processed, AI discovers fields and resolves them into a unified knowledge graph that grows smarter with every document."
|
|
1676
|
+
text: "The Field Registry is the heart of Talonic's intelligence. As documents are processed, AI discovers fields and resolves them into a unified knowledge graph that grows smarter with every document. The registry is what makes Talonic a learning system \u2014 each new document contributes to a shared understanding of field names, types, and extraction patterns that benefits all future processing."
|
|
1508
1677
|
},
|
|
1509
1678
|
{
|
|
1510
1679
|
type: "paragraph",
|
|
@@ -1527,6 +1696,10 @@ var sections4 = [
|
|
|
1527
1696
|
{
|
|
1528
1697
|
type: "paragraph",
|
|
1529
1698
|
text: "The registry is the foundation for several downstream features. **Jobs** use registry fields to pre-fill schema values via lookup cascades before resorting to LLM extraction. **Semantic clusters** group related registry fields together. **Generated schemas** are auto-built from registry fields that appear in a given document type. Understanding the registry is key to understanding how Talonic reduces extraction cost and improves accuracy over time."
|
|
1699
|
+
},
|
|
1700
|
+
{
|
|
1701
|
+
type: "paragraph",
|
|
1702
|
+
text: "Each registry field maintains two separate embedding vectors: one optimized for **resolution matching** (based on the canonical name and synonyms) and one for **graph visualization** (based on name, type, and instruction). This dual-embedding approach ensures that each concern uses the most appropriate representation. The resolution embedding is what powers the three-band matching during document processing, while the visualization embedding drives the Field Map clustering view."
|
|
1530
1703
|
}
|
|
1531
1704
|
],
|
|
1532
1705
|
related: [
|
|
@@ -1565,7 +1738,7 @@ var sections4 = [
|
|
|
1565
1738
|
content: [
|
|
1566
1739
|
{
|
|
1567
1740
|
type: "paragraph",
|
|
1568
|
-
text: "Fields are organized into three tiers based on how frequently they appear:"
|
|
1741
|
+
text: "Fields are organized into three tiers based on how frequently they appear across your document corpus. The tier system is Talonic's primary quality signal \u2014 it tells you at a glance how well-established and reliable a field is. Tiers also directly affect extraction cost: higher-tier fields are cheaper to extract because they can be resolved via lookup rather than AI calls."
|
|
1569
1742
|
},
|
|
1570
1743
|
{
|
|
1571
1744
|
type: "param-table",
|
|
@@ -1599,9 +1772,19 @@ var sections4 = [
|
|
|
1599
1772
|
type: "paragraph",
|
|
1600
1773
|
text: "**Tier 3** fields are newly discovered and may require a full Claude API call to extract during job runs, making them the most expensive tier. As more documents are processed and a Tier 3 field appears consistently, it is automatically promoted. You can also manually adjust a field's tier from the registry detail page if you know a field is stable enough to promote early."
|
|
1601
1774
|
},
|
|
1775
|
+
{
|
|
1776
|
+
type: "heading",
|
|
1777
|
+
level: 3,
|
|
1778
|
+
id: "promotion-thresholds",
|
|
1779
|
+
text: "Promotion Thresholds"
|
|
1780
|
+
},
|
|
1781
|
+
{
|
|
1782
|
+
type: "paragraph",
|
|
1783
|
+
text: "Promotion from Tier 3 to Tier 2 requires meeting one of two thresholds: **5 occurrences** (the field appears in at least 5 documents) or a **10% occurrence rate** (the field appears in at least 10% of all documents in your workspace). Promotion is evaluated automatically after every batch resolution run, so fields graduate without manual intervention as your document corpus grows. Once promoted to Tier 2, a field gains a synthesized master instruction and becomes eligible for lookup-based resolution in job runs."
|
|
1784
|
+
},
|
|
1602
1785
|
{
|
|
1603
1786
|
type: "callout",
|
|
1604
|
-
text: "Tier badges appear throughout the platform as the primary quality signal. Tier 1 = green, Tier 2 = amber, Tier 3 = gray."
|
|
1787
|
+
text: "Tier badges appear throughout the platform as the primary quality signal. Tier 1 = green, Tier 2 = amber, Tier 3 = gray. You can see tier badges on the Field Registry page, in document detail views, on schema fields, and in job result grids."
|
|
1605
1788
|
}
|
|
1606
1789
|
],
|
|
1607
1790
|
related: [
|
|
@@ -1633,7 +1816,7 @@ var sections4 = [
|
|
|
1633
1816
|
content: [
|
|
1634
1817
|
{
|
|
1635
1818
|
type: "paragraph",
|
|
1636
|
-
text: 'Fields with similar meanings are automatically grouped using AI embeddings. For example, "Vendor Name", "Supplier Name", and "Company Name" cluster together. You can manually merge or split clusters from the Field Map view.'
|
|
1819
|
+
text: 'Fields with similar meanings are automatically grouped into semantic clusters using AI embeddings. For example, "Vendor Name", "Supplier Name", and "Company Name" cluster together because they represent the same underlying concept. You can manually merge or split clusters from the Field Map view. Clusters are a key mechanism for cross-document field normalization \u2014 they allow the platform to recognize that different document types use different names for the same data point.'
|
|
1637
1820
|
},
|
|
1638
1821
|
{
|
|
1639
1822
|
type: "paragraph",
|
|
@@ -1647,10 +1830,25 @@ var sections4 = [
|
|
|
1647
1830
|
type: "paragraph",
|
|
1648
1831
|
text: 'Semantic clusters serve a practical purpose beyond organization. When a job runs, the resolution engine uses clusters to transfer values between fields that belong to the same cluster. If a document has a field called "Supplier Name" and your schema expects "Vendor Name", the cluster linkage allows the value to transfer automatically without an AI call. This is one of the key mechanisms that reduces extraction cost as your registry matures.'
|
|
1649
1832
|
},
|
|
1833
|
+
{
|
|
1834
|
+
type: "heading",
|
|
1835
|
+
level: 3,
|
|
1836
|
+
id: "cluster-operations",
|
|
1837
|
+
text: "Cluster Operations"
|
|
1838
|
+
},
|
|
1839
|
+
{
|
|
1840
|
+
type: "list",
|
|
1841
|
+
ordered: false,
|
|
1842
|
+
items: [
|
|
1843
|
+
"**Merge** \u2014 Combine two clusters that represent the same concept. All fields from both clusters are unified under a single canonical entry.",
|
|
1844
|
+
"**Split** \u2014 Remove a field from a cluster if it was incorrectly grouped. The split field becomes its own independent cluster.",
|
|
1845
|
+
"**Inspect** \u2014 View all fields in a cluster, their source document types, and occurrence counts to understand why they were grouped."
|
|
1846
|
+
]
|
|
1847
|
+
},
|
|
1650
1848
|
{
|
|
1651
1849
|
type: "callout",
|
|
1652
1850
|
variant: "info",
|
|
1653
|
-
text: "Manual cluster adjustments are permanent and improve the model for all future documents. If you notice the platform grouping unrelated fields together, split them early \u2014 this prevents incorrect value transfers during job runs."
|
|
1851
|
+
text: "Manual cluster adjustments are permanent and improve the model for all future documents. If you notice the platform grouping unrelated fields together, split them early \u2014 this prevents incorrect value transfers during job runs. Conversely, merging clusters that should be together improves resolution accuracy across your entire corpus."
|
|
1654
1852
|
}
|
|
1655
1853
|
],
|
|
1656
1854
|
related: [
|
|
@@ -1689,7 +1887,7 @@ var sections4 = [
|
|
|
1689
1887
|
content: [
|
|
1690
1888
|
{
|
|
1691
1889
|
type: "paragraph",
|
|
1692
|
-
text: "When a document is processed, each extracted field is resolved against the registry using a three-band matching model. The bands determine whether a match is accepted automatically, flagged for confirmation, or treated as a new field."
|
|
1890
|
+
text: "When a document is processed, each extracted field is resolved against the registry using a three-band matching model. The bands determine whether a match is accepted automatically, flagged for confirmation, or treated as a new field. Resolution is the core mechanism that turns raw, document-specific field names into canonical registry entries \u2014 building a unified knowledge graph across all your documents."
|
|
1693
1891
|
},
|
|
1694
1892
|
{
|
|
1695
1893
|
type: "param-table",
|
|
@@ -1711,9 +1909,19 @@ var sections4 = [
|
|
|
1711
1909
|
}
|
|
1712
1910
|
]
|
|
1713
1911
|
},
|
|
1912
|
+
{
|
|
1913
|
+
type: "heading",
|
|
1914
|
+
level: 3,
|
|
1915
|
+
id: "resolution-process",
|
|
1916
|
+
text: "How Resolution Works"
|
|
1917
|
+
},
|
|
1714
1918
|
{
|
|
1715
1919
|
type: "paragraph",
|
|
1716
|
-
text: "Resolution
|
|
1920
|
+
text: "Resolution follows a strict three-band order that is never skipped. First, the system checks for an **exact name match** against existing registry entries. If no exact match is found, it checks for a **cluster member match** \u2014 whether the field name matches any synonym in an existing semantic cluster. Finally, it computes **semantic embedding similarity** using AI embeddings to find conceptually similar fields. This graduated approach prioritizes fast, deterministic matches before falling back to more expensive similarity comparisons."
|
|
1921
|
+
},
|
|
1922
|
+
{
|
|
1923
|
+
type: "paragraph",
|
|
1924
|
+
text: "Resolution runs concurrently across documents. Each document's fields are resolved in an isolated transaction to prevent lock contention. Occurrence counts are updated atomically in the same SQL transaction using upserts with deadlock retry logic. This keeps the registry eventually consistent without blocking concurrent ingestion, even when hundreds of documents are being processed simultaneously."
|
|
1717
1925
|
},
|
|
1718
1926
|
{
|
|
1719
1927
|
type: "paragraph",
|
|
@@ -1732,15 +1940,19 @@ var sections4 = [
|
|
|
1732
1940
|
faq: [
|
|
1733
1941
|
{
|
|
1734
1942
|
question: "How does field resolution work in Talonic?",
|
|
1735
|
-
answer: "Each extracted field is matched against the registry using three bands: auto (>=0.80
|
|
1943
|
+
answer: "Each extracted field is matched against the registry using three bands in strict order: exact name match, cluster member match, then semantic embedding similarity. Results fall into auto (>=0.80, auto-linked), confirm (0.50-0.79, flagged for review), or new (<0.50, creates a new Tier 3 field). The three-band order is never skipped."
|
|
1736
1944
|
},
|
|
1737
1945
|
{
|
|
1738
1946
|
question: "Where can I review pending field confirmations?",
|
|
1739
|
-
answer: "Navigate to Resolution > Pending Confirmations to review fields in the confirm band. Accept to merge into an existing cluster, or reject to create a new field."
|
|
1947
|
+
answer: "Navigate to Resolution > Pending Confirmations to review fields in the confirm band. Accept to merge the field into an existing cluster, or reject to create a new independent field. Processing confirmations promptly improves resolution accuracy for future documents."
|
|
1740
1948
|
},
|
|
1741
1949
|
{
|
|
1742
1950
|
question: "What happens after resolution completes?",
|
|
1743
|
-
answer: "After resolution, the platform evaluates tier promotions and regenerates affected schemas in a fixed chain: resolve, then promote, then regenerate. This ensures that newly promoted fields immediately appear in auto-generated schemas."
|
|
1951
|
+
answer: "After resolution, the platform evaluates tier promotions and regenerates affected schemas in a fixed chain: resolve, then promote, then regenerate. This ensures that newly promoted fields immediately appear in auto-generated schemas. The chain is atomic \u2014 it never breaks midway."
|
|
1952
|
+
},
|
|
1953
|
+
{
|
|
1954
|
+
question: "How does resolution reduce extraction cost during job runs?",
|
|
1955
|
+
answer: "During job runs, the system uses a 3-tier lookup cascade \u2014 string normalization, token fuzzy matching, then AI fallback \u2014 to fill 60-80% of cells without a full LLM call. Fields that are well-established in the registry with high occurrence counts are the most likely to resolve via lookup, making Tier 1 and Tier 2 fields essentially free to extract."
|
|
1744
1956
|
}
|
|
1745
1957
|
],
|
|
1746
1958
|
mentions: [
|
|
@@ -1760,7 +1972,7 @@ var sections4 = [
|
|
|
1760
1972
|
content: [
|
|
1761
1973
|
{
|
|
1762
1974
|
type: "paragraph",
|
|
1763
|
-
text: "As the same field is extracted from many documents, AI synthesizes a **master instruction** \u2014 a reusable directive that captures the best way to extract that field. Master instructions improve accuracy over time and are automatically used when running jobs."
|
|
1975
|
+
text: "As the same field is extracted from many documents, AI synthesizes a **master instruction** \u2014 a reusable directive that captures the best way to extract that field. Master instructions improve accuracy over time and are automatically used when running jobs. They encode domain-specific knowledge about where a field typically appears in a document, what format it takes, and how to disambiguate it from similar fields."
|
|
1764
1976
|
},
|
|
1765
1977
|
{
|
|
1766
1978
|
type: "paragraph",
|
|
@@ -1774,9 +1986,26 @@ var sections4 = [
|
|
|
1774
1986
|
type: "paragraph",
|
|
1775
1987
|
text: `You can view and edit master instructions from the field detail page in the registry. Editing an instruction overrides the AI-synthesized version, which is useful when you have domain expertise the AI hasn't captured. The **"Synthesize All"** button in the Field Registry triggers the full pipeline \u2014 embedding, resolution, and synthesis \u2014 for all qualifying fields in a single operation.`
|
|
1776
1988
|
},
|
|
1989
|
+
{
|
|
1990
|
+
type: "heading",
|
|
1991
|
+
level: 3,
|
|
1992
|
+
id: "instruction-lifecycle",
|
|
1993
|
+
text: "Instruction Lifecycle"
|
|
1994
|
+
},
|
|
1995
|
+
{
|
|
1996
|
+
type: "list",
|
|
1997
|
+
ordered: true,
|
|
1998
|
+
items: [
|
|
1999
|
+
"A field is discovered and added to the registry as Tier 3 with no instruction.",
|
|
2000
|
+
"As more documents are processed, the field accumulates occurrences and extraction examples.",
|
|
2001
|
+
"When the field is promoted to Tier 2, the platform synthesizes a master instruction by analyzing all extraction patterns for that field.",
|
|
2002
|
+
"The instruction is injected into AI prompts during future job runs, improving extraction accuracy.",
|
|
2003
|
+
"You can manually edit the instruction at any time from the field detail page to incorporate domain expertise."
|
|
2004
|
+
]
|
|
2005
|
+
},
|
|
1777
2006
|
{
|
|
1778
2007
|
type: "callout",
|
|
1779
|
-
text: 'Click **"Synthesize All"** in the Field Registry to generate instructions for all qualifying fields. This runs the combined pipeline: embed → resolve → synthesize.'
|
|
2008
|
+
text: 'Click **"Synthesize All"** in the Field Registry to generate instructions for all qualifying fields. This runs the combined pipeline: embed → resolve → synthesize. The operation processes all fields that meet the synthesis criteria in a single batch.'
|
|
1780
2009
|
}
|
|
1781
2010
|
],
|
|
1782
2011
|
related: [
|
|
@@ -1818,11 +2047,11 @@ var sections5 = [
|
|
|
1818
2047
|
content: [
|
|
1819
2048
|
{
|
|
1820
2049
|
type: "paragraph",
|
|
1821
|
-
text: "Schemas define the structure of your output data. There are two types: AI-generated schemas created per document type, and user templates you define yourself."
|
|
2050
|
+
text: "Schemas define the structure of your output data. There are two types: AI-generated schemas created per document type, and user templates you define yourself. Generated schemas give you an automatic, always-up-to-date view of what the platform has discovered about each document type, while user templates let you define exactly which fields you need for a specific downstream use case."
|
|
1822
2051
|
},
|
|
1823
2052
|
{
|
|
1824
2053
|
type: "paragraph",
|
|
1825
|
-
text: "For each document type, Talonic generates a schema containing all Tier 1 and Tier 2 fields with occurrences in that type. Generated schemas are versioned \u2014 new versions are created when the registry changes. You can diff any two versions to see what changed."
|
|
2054
|
+
text: "For each document type, Talonic generates a schema containing all Tier 1 and Tier 2 fields with occurrences in that type. Generated schemas are versioned \u2014 new versions are created when the registry changes. You can diff any two versions to see what changed. The diff view highlights added fields, removed fields, type changes, and updated instructions so you can track how your field landscape evolves over time."
|
|
1826
2055
|
},
|
|
1827
2056
|
{
|
|
1828
2057
|
type: "paragraph",
|
|
@@ -1832,9 +2061,13 @@ var sections5 = [
|
|
|
1832
2061
|
type: "paragraph",
|
|
1833
2062
|
text: "Generated schemas are most useful as a starting point for understanding what Talonic has discovered about your documents. Review the generated schema for a document type to see which fields the system has identified, then use that knowledge to build a **User Template** containing only the fields you actually need. You can also use the diff view to monitor how your field landscape evolves over time as new documents are processed and new fields are promoted."
|
|
1834
2063
|
},
|
|
2064
|
+
{
|
|
2065
|
+
type: "paragraph",
|
|
2066
|
+
text: "The tier system determines which fields appear in generated schemas. **Tier 1** (core) fields are the most frequently occurring and reliably extracted data points \u2014 they appear in nearly every document of the type. **Tier 2** (established) fields occur in a significant portion of documents and have been validated through repeated extraction. **Tier 3** (emerging) fields are too new or infrequent to be included in generated schemas, but they may be promoted as more documents are processed and their occurrence rate crosses the promotion threshold."
|
|
2067
|
+
},
|
|
1835
2068
|
{
|
|
1836
2069
|
type: "callout",
|
|
1837
|
-
text: "Generated schemas are read-only and cannot be used directly for job execution. To run an extraction job, create a **User Template** and map its fields to the registry."
|
|
2070
|
+
text: "Generated schemas are read-only and cannot be used directly for job execution. To run an extraction job, create a **User Template** and map its fields to the registry. Generated schemas serve as a discovery tool to understand what the platform has found in your documents."
|
|
1838
2071
|
}
|
|
1839
2072
|
],
|
|
1840
2073
|
related: [
|
|
@@ -1845,15 +2078,15 @@ var sections5 = [
|
|
|
1845
2078
|
faq: [
|
|
1846
2079
|
{
|
|
1847
2080
|
question: "What are generated schemas?",
|
|
1848
|
-
answer: "Generated schemas are AI-created output definitions for each document type, containing all Tier 1 and Tier 2 fields found in that type. They are versioned and support diffing between versions."
|
|
2081
|
+
answer: "Generated schemas are AI-created output definitions for each document type, containing all Tier 1 and Tier 2 fields found in that type. They are versioned and support diffing between versions. Each field in the schema includes data type information, the AI-synthesized master instruction, and occurrence statistics."
|
|
1849
2082
|
},
|
|
1850
2083
|
{
|
|
1851
2084
|
question: "How are generated schemas updated?",
|
|
1852
|
-
answer: "New versions are created automatically when the Field Registry changes (new fields promoted, clusters merged). You can diff any two versions to see what changed."
|
|
2085
|
+
answer: "New versions are created automatically when the Field Registry changes (new fields promoted, clusters merged). You can diff any two versions to see what changed. The versioning system is append-only, so every previous version is preserved in the timeline for reference."
|
|
1853
2086
|
},
|
|
1854
2087
|
{
|
|
1855
2088
|
question: "Can I run an extraction job using a generated schema?",
|
|
1856
|
-
answer: "No. Generated schemas are read-only references. To run a job, create a User Template, select the fields you need, map them to the registry, and publish a version."
|
|
2089
|
+
answer: "No. Generated schemas are read-only references. To run a job, create a User Template, select the fields you need, map them to the registry, and publish a version. Generated schemas are designed as a discovery tool \u2014 use them to understand what the platform has found, then build a focused template for your specific output needs."
|
|
1857
2090
|
}
|
|
1858
2091
|
],
|
|
1859
2092
|
mentions: ["generated schemas", "AI-generated", "versioning", "schema diff"]
|
|
@@ -1867,7 +2100,7 @@ var sections5 = [
|
|
|
1867
2100
|
content: [
|
|
1868
2101
|
{
|
|
1869
2102
|
type: "paragraph",
|
|
1870
|
-
text: "User templates are the primary way to define your output structure. Navigate to **Structuring → Schemas** to create one."
|
|
2103
|
+
text: "User templates are the primary way to define your output structure. Navigate to **Structuring → Schemas** to create one. Templates give you complete control over which fields appear in your output, how they are extracted, and what validation rules apply. Unlike generated schemas, templates are executable \u2014 once published, they can be used to run extraction jobs."
|
|
1871
2104
|
},
|
|
1872
2105
|
{
|
|
1873
2106
|
type: "list",
|
|
@@ -1905,15 +2138,15 @@ var sections5 = [
|
|
|
1905
2138
|
faq: [
|
|
1906
2139
|
{
|
|
1907
2140
|
question: "How do I create a user template?",
|
|
1908
|
-
answer: "Navigate to Structuring > Schemas, create a template with a name and description, add fields with data types and instructions, map to the registry, add reference tables, and publish."
|
|
2141
|
+
answer: "Navigate to Structuring > Schemas, create a template with a name and description, add fields with data types and instructions, map to the registry, add reference tables, and publish. You can also import from Excel, CSV, or JSON to bootstrap a template from an existing spreadsheet \u2014 column headers become field names and data types are inferred automatically."
|
|
1909
2142
|
},
|
|
1910
2143
|
{
|
|
1911
2144
|
question: "What is the difference between generated schemas and user templates?",
|
|
1912
|
-
answer: "Generated schemas are AI-created per document type with all Tier 1/2 fields. User templates are custom-defined output structures where you choose exactly which fields to include
|
|
2145
|
+
answer: "Generated schemas are AI-created per document type with all Tier 1/2 fields \u2014 they are read-only and cannot run jobs. User templates are custom-defined output structures where you choose exactly which fields to include, how to map them to the registry, and what validation rules apply. Only published user templates can be used for extraction jobs."
|
|
1913
2146
|
},
|
|
1914
2147
|
{
|
|
1915
2148
|
question: "Can I update a published template?",
|
|
1916
|
-
answer: "Published versions are immutable. To make changes, open the Workshop draft, edit your fields, and publish a new version. The previous version remains available in Version History for reference and diffing."
|
|
2149
|
+
answer: "Published versions are immutable. To make changes, open the Workshop draft, edit your fields, and publish a new version. The previous version remains available in Version History for reference and diffing. This append-only versioning ensures that historical job results always reference the exact schema that produced them."
|
|
1917
2150
|
}
|
|
1918
2151
|
],
|
|
1919
2152
|
mentions: ["user templates", "schema creation", "field mapping", "reference tables", "publish"]
|
|
@@ -1927,7 +2160,7 @@ var sections5 = [
|
|
|
1927
2160
|
content: [
|
|
1928
2161
|
{
|
|
1929
2162
|
type: "paragraph",
|
|
1930
|
-
text: "Every field in a template supports advanced features beyond the basic name and type. These features control how values are extracted, validated, transformed, and delivered."
|
|
2163
|
+
text: "Every field in a template supports advanced features beyond the basic name and type. These features control how values are extracted, validated, transformed, and delivered. You can layer features independently \u2014 for example, a single field can have a format constraint, a reference table for code lookup, modifiers for post-processing, and an output name remap for delivery. Features compose without conflicts, giving you fine-grained control over every aspect of the extraction and output pipeline."
|
|
1931
2164
|
},
|
|
1932
2165
|
{
|
|
1933
2166
|
type: "param-table",
|
|
@@ -2083,15 +2316,15 @@ var sections5 = [
|
|
|
2083
2316
|
faq: [
|
|
2084
2317
|
{
|
|
2085
2318
|
question: "How does field matching work in Talonic schemas?",
|
|
2086
|
-
answer: "Schema fields are matched to the registry using
|
|
2319
|
+
answer: "Schema fields are matched to the registry using a three-band resolution process. First, exact name matching against canonical names and synonyms. Then, embedding similarity for semantic matches (auto-accept above 0.8, confirm between 0.5 and 0.8). Four match types result: Exact (direct name match), Semantic (AI finds equivalent field), Composite (multiple fields combine), and Unmapped (no match, needs manual instructions)."
|
|
2087
2320
|
},
|
|
2088
2321
|
{
|
|
2089
2322
|
question: "What happens when a field is unmapped?",
|
|
2090
|
-
answer: "Unmapped fields have no registry match. They require manual extraction instructions to guide the AI on how to extract the value from documents."
|
|
2323
|
+
answer: "Unmapped fields have no registry match and do not inherit a master extraction instruction. They require manual extraction instructions to guide the AI on how to extract the value from documents. Write clear, specific instructions describing where in the document to look and what formatting to expect for best results."
|
|
2091
2324
|
},
|
|
2092
2325
|
{
|
|
2093
2326
|
question: "Can I re-run field matching after adding more documents?",
|
|
2094
|
-
answer: "Yes. Use the Rematch button in the template editor to re-run matching against the current registry. Fields that were previously unmapped may find new matches as your registry grows."
|
|
2327
|
+
answer: "Yes. Use the Rematch button in the template editor to re-run matching against the current registry. Fields that were previously unmapped may find new matches as your registry grows through processing additional documents. For best results, use descriptive field names that reflect the actual data rather than generic labels."
|
|
2095
2328
|
}
|
|
2096
2329
|
],
|
|
2097
2330
|
mentions: ["field matching", "exact match", "semantic match", "composite", "unmapped"]
|
|
@@ -2105,7 +2338,7 @@ var sections5 = [
|
|
|
2105
2338
|
content: [
|
|
2106
2339
|
{
|
|
2107
2340
|
type: "paragraph",
|
|
2108
|
-
text:
|
|
2341
|
+
text: 'Reference tables map human-readable values to system codes. Each table is a list of key-value pairs where `key` = output code and `value` = label. For example, a country reference table might map "United States" to `US`, "Germany" to `DE`, and "United Kingdom" to `GB`. During extraction, a 3-tier lookup cascade runs automatically against the table to normalize extracted values to your canonical codes:'
|
|
2109
2342
|
},
|
|
2110
2343
|
{
|
|
2111
2344
|
type: "param-table",
|
|
@@ -2142,7 +2375,7 @@ var sections5 = [
|
|
|
2142
2375
|
},
|
|
2143
2376
|
{
|
|
2144
2377
|
type: "callout",
|
|
2145
|
-
text: "Reference table quality directly determines lookup accuracy. A properly loaded table produces 90-100% accurate results within a single run."
|
|
2378
|
+
text: "Reference table quality directly determines lookup accuracy. A properly loaded table produces 90-100% accurate results within a single run. Review the lookup_failed validation flag in Phase 3 results to identify values that could not be mapped \u2014 these are candidates for adding new entries to your table."
|
|
2146
2379
|
}
|
|
2147
2380
|
],
|
|
2148
2381
|
related: [
|
|
@@ -2197,7 +2430,7 @@ var sections5 = [
|
|
|
2197
2430
|
},
|
|
2198
2431
|
{
|
|
2199
2432
|
type: "callout",
|
|
2200
|
-
text: "Breaking changes include field removals and type changes. The system surfaces these warnings at publish time so you can assess the impact on active delivery bindings and downstream systems before committing."
|
|
2433
|
+
text: "Breaking changes include field removals and type changes. The system surfaces these warnings at publish time so you can assess the impact on active delivery bindings and downstream systems before committing. Always run a **Test Extraction** on representative documents before publishing a draft that includes breaking changes."
|
|
2201
2434
|
}
|
|
2202
2435
|
],
|
|
2203
2436
|
related: [
|
|
@@ -2208,15 +2441,15 @@ var sections5 = [
|
|
|
2208
2441
|
faq: [
|
|
2209
2442
|
{
|
|
2210
2443
|
question: "How does schema versioning work?",
|
|
2211
|
-
answer: "Templates use a workshop system: Live (published, read-only), Workshop (mutable draft), and Version History (timeline with diffs). Breaking changes like field removals or type changes are detected on promotion."
|
|
2444
|
+
answer: "Templates use a workshop system with three states: Live (published, read-only), Workshop (mutable draft), and Version History (timeline with diffs). Breaking changes like field removals or type changes are detected on promotion. Every published version is immutable, creating a complete audit trail of how your schema evolved over time."
|
|
2212
2445
|
},
|
|
2213
2446
|
{
|
|
2214
2447
|
question: "What are breaking changes in a schema?",
|
|
2215
|
-
answer: "Breaking changes include field removals and type changes. The system detects and warns about these when promoting a draft to live, helping you avoid unintended downstream impacts."
|
|
2448
|
+
answer: "Breaking changes include field removals and data type changes. The system detects and warns about these when promoting a draft to live, helping you avoid unintended downstream impacts. If a downstream delivery binding depends on a specific field, the warning helps you assess the impact before committing the change."
|
|
2216
2449
|
},
|
|
2217
2450
|
{
|
|
2218
2451
|
question: "Can I revert to a previous schema version?",
|
|
2219
|
-
answer: "Version history is append-only, so you cannot revert directly. However, you can review any previous version in the timeline, compare it with the current live version using the diff view, and manually re-add fields or settings that were changed."
|
|
2452
|
+
answer: "Version history is append-only, so you cannot revert directly. However, you can review any previous version in the timeline, compare it with the current live version using the diff view, and manually re-add fields or settings that were changed. This design ensures that every historical job result always references the exact schema version that produced it."
|
|
2220
2453
|
}
|
|
2221
2454
|
],
|
|
2222
2455
|
mentions: ["versioning", "drafts", "workshop", "live version", "breaking changes"]
|
|
@@ -2230,23 +2463,27 @@ var sections5 = [
|
|
|
2230
2463
|
content: [
|
|
2231
2464
|
{
|
|
2232
2465
|
type: "paragraph",
|
|
2233
|
-
text: "Before publishing a draft, run a test extraction to compare draft vs. live results side-by-side. Select a few documents, run the test, and see exactly how your changes affect output."
|
|
2466
|
+
text: "Before publishing a draft, run a test extraction to compare draft vs. live results side-by-side. Select a few documents, run the test, and see exactly how your changes affect output. This is the safest way to validate schema changes \u2014 you can iterate on field instructions, reference tables, and format constraints without affecting production jobs or published data."
|
|
2234
2467
|
},
|
|
2235
2468
|
{
|
|
2236
2469
|
type: "paragraph",
|
|
2237
|
-
text: "After running a test, you will see a comparison grid highlighting cells that changed between the draft and live versions. Focus on fields you modified \u2014 new fields, updated instructions, or changed reference tables \u2014 to verify they produce the expected values. This workflow catches regressions before they reach production, so you can iterate on your schema with confidence."
|
|
2470
|
+
text: "After running a test, you will see a comparison grid highlighting cells that changed between the draft and live versions. Focus on fields you modified \u2014 new fields, updated instructions, or changed reference tables \u2014 to verify they produce the expected values. Cells that improved show in green, cells that regressed show in red, and unchanged cells are neutral. This workflow catches regressions before they reach production, so you can iterate on your schema with confidence."
|
|
2238
2471
|
},
|
|
2239
2472
|
{
|
|
2240
2473
|
type: "paragraph",
|
|
2241
|
-
text: "Test extractions run through the same
|
|
2474
|
+
text: "Test extractions run through the same extraction pipeline as production jobs, so the results you see are representative of what a full job would produce. The test uses a simplified single-call extraction mode under the hood, which is faster but still applies all schema features including reference table lookups, format constraints, modifiers, and bypass strategies. This gives you a reliable preview without the cost and time of a full 4-phase pipeline run."
|
|
2242
2475
|
},
|
|
2243
2476
|
{
|
|
2244
2477
|
type: "paragraph",
|
|
2245
2478
|
text: 'For best results, select 3-5 representative documents that cover the variety in your corpus \u2014 include at least one "clean" document and one with unusual formatting or missing fields. This gives you confidence that your schema handles both typical and edge-case documents correctly. Run the test after every significant change to a field instruction, reference table, or format constraint.'
|
|
2246
2479
|
},
|
|
2480
|
+
{
|
|
2481
|
+
type: "paragraph",
|
|
2482
|
+
text: "A typical iteration workflow looks like this: add or modify a field in the Workshop draft, run a test extraction on your sample documents, review the comparison grid to check that the new field produces correct values, adjust the instruction if needed, re-test, and publish when satisfied. This tight feedback loop is the fastest way to refine extraction accuracy without impacting production jobs or consuming unnecessary credits."
|
|
2483
|
+
},
|
|
2247
2484
|
{
|
|
2248
2485
|
type: "callout",
|
|
2249
|
-
text: "Test extractions do not affect your live data or consume production job credits differently. They are designed for rapid iteration \u2014 run as many tests as you need before publishing."
|
|
2486
|
+
text: "Test extractions do not affect your live data or consume production job credits differently. They are designed for rapid iteration \u2014 run as many tests as you need before publishing. Results are temporary and do not appear in your job history."
|
|
2250
2487
|
}
|
|
2251
2488
|
],
|
|
2252
2489
|
related: [
|
|
@@ -2279,7 +2516,7 @@ var sections5 = [
|
|
|
2279
2516
|
content: [
|
|
2280
2517
|
{
|
|
2281
2518
|
type: "paragraph",
|
|
2282
|
-
text: "Dialects define the output format for structured data. They control how values are serialized when delivered or exported. A dialect can be shared across schemas or defined inline for a specific schema. Configure dialects in the **Schema → Delivery** tab."
|
|
2519
|
+
text: "Dialects define the output format for structured data. They control how values are serialized when delivered or exported \u2014 everything from date formatting and number locale to CSV delimiters and character encoding. A dialect can be shared across schemas or defined inline for a specific schema. Configure dialects in the **Schema → Delivery** tab. Shared dialects ensure consistent formatting across all your exports without duplicating configuration on every schema."
|
|
2283
2520
|
},
|
|
2284
2521
|
{
|
|
2285
2522
|
type: "param-table",
|
|
@@ -2371,7 +2608,7 @@ var sections5 = [
|
|
|
2371
2608
|
content: [
|
|
2372
2609
|
{
|
|
2373
2610
|
type: "paragraph",
|
|
2374
|
-
text: "Bypass strategies determine how a schema field is populated when it should not go through LLM extraction. Each strategy provides a deterministic value without consuming AI credits."
|
|
2611
|
+
text: "Bypass strategies determine how a schema field is populated when it should not go through LLM extraction. Each strategy provides a deterministic value without consuming AI credits. This is useful for fields whose values are known ahead of time, can be derived from other fields, or should be looked up from reference data rather than extracted from the document text."
|
|
2375
2612
|
},
|
|
2376
2613
|
{
|
|
2377
2614
|
type: "param-table",
|
|
@@ -2413,7 +2650,7 @@ var sections5 = [
|
|
|
2413
2650
|
},
|
|
2414
2651
|
{
|
|
2415
2652
|
type: "callout",
|
|
2416
|
-
text: "When a `generator` strategy fails to produce a value, the field falls through to LLM extraction as a safety net. Strategy values are normalized via generator mappings in Phase 4 of the pipeline."
|
|
2653
|
+
text: "When a `generator` strategy fails to produce a value, the field falls through to LLM extraction as a safety net \u2014 your data is never left incomplete due to a bypass misconfiguration. Strategy values are normalized via generator mappings in Phase 4 of the pipeline. Bypass strategies execute during Phase 1, before any AI calls are made."
|
|
2417
2654
|
}
|
|
2418
2655
|
],
|
|
2419
2656
|
related: [
|
|
@@ -2452,7 +2689,7 @@ var sections5 = [
|
|
|
2452
2689
|
content: [
|
|
2453
2690
|
{
|
|
2454
2691
|
type: "paragraph",
|
|
2455
|
-
text: "Format constraints apply regex-based validation to schema fields. They are evaluated post-extraction in Phase 4 of the pipeline, after all transforms have been applied. Original values are preserved for audit in `original_extractions`."
|
|
2692
|
+
text: "Format constraints apply regex-based validation to schema fields. They are evaluated post-extraction in Phase 4 of the pipeline, after all transforms have been applied. Original values are preserved for audit in `original_extractions`. This means you can always review what the AI originally extracted before the constraint was applied, giving you full visibility into the extraction pipeline even when values are cleared or replaced."
|
|
2456
2693
|
},
|
|
2457
2694
|
{
|
|
2458
2695
|
type: "param-table",
|
|
@@ -2477,7 +2714,7 @@ var sections5 = [
|
|
|
2477
2714
|
},
|
|
2478
2715
|
{
|
|
2479
2716
|
type: "paragraph",
|
|
2480
|
-
text: "Define format constraints in the schema field editor. The pattern uses standard regex syntax. The editor provides a live test input so you can verify the pattern before saving."
|
|
2717
|
+
text: "Define format constraints in the schema field editor. The pattern uses standard regex syntax with support for inline flags like `(?i)` for case-insensitive matching. The editor provides a live test input so you can verify the pattern against sample values before saving. This immediate feedback loop helps you catch overly strict or overly permissive patterns before they affect real extraction runs."
|
|
2481
2718
|
},
|
|
2482
2719
|
{
|
|
2483
2720
|
type: "paragraph",
|
|
@@ -2489,7 +2726,7 @@ var sections5 = [
|
|
|
2489
2726
|
},
|
|
2490
2727
|
{
|
|
2491
2728
|
type: "callout",
|
|
2492
|
-
text: "The regex evaluator includes ReDoS protection: nested quantifiers are rejected and input is capped at 1,000 characters. Use the `(?i)` inline flag for case-insensitive matching."
|
|
2729
|
+
text: "The regex evaluator includes ReDoS protection: nested quantifiers are rejected and input is capped at 1,000 characters. Use the `(?i)` inline flag for case-insensitive matching. Format constraints support standard JavaScript regex syntax, so you can use character classes, alternation, and lookahead assertions for complex validation patterns."
|
|
2493
2730
|
}
|
|
2494
2731
|
],
|
|
2495
2732
|
related: [
|
|
@@ -2532,23 +2769,27 @@ var sections6 = [
|
|
|
2532
2769
|
content: [
|
|
2533
2770
|
{
|
|
2534
2771
|
type: "paragraph",
|
|
2535
|
-
text: "Extraction jobs are the core of the platform \u2014 where schemas meet documents and AI agents produce structured data. A job produces a grid: rows = documents, columns = schema fields."
|
|
2772
|
+
text: "Extraction jobs are the core of the platform \u2014 where schemas meet documents and AI agents produce structured data. A job produces a grid: rows = documents, columns = schema fields. Each cell in the grid contains an extracted value along with metadata including a confidence score, the resolution type, the pipeline phase that produced it, and an AI reasoning trace explaining how the value was derived from the source document."
|
|
2536
2773
|
},
|
|
2537
2774
|
{
|
|
2538
2775
|
type: "paragraph",
|
|
2539
|
-
text: "Navigate to **Structuring → Runs → New**. Select your template and documents, then click Start. Results appear progressively as each phase completes."
|
|
2776
|
+
text: "Navigate to **Structuring → Runs → New**. Select your template and documents, then click Start. Results appear progressively as each phase completes. You can choose between three extraction modes: **pipeline** (full 4-phase extraction, the default), **simple** (single AI call, faster but less thorough), or **field registry** (no AI, deterministic strategies only \u2014 useful for benchmarking registry coverage)."
|
|
2540
2777
|
},
|
|
2541
2778
|
{
|
|
2542
2779
|
type: "paragraph",
|
|
2543
|
-
text: "When you start a job, the platform runs a pre-flight check to ensure all selected documents have completed their field resolution step. If any document was uploaded recently and has not yet been resolved against the Field Registry, the system automatically resolves it before entering Phase 1. This lazy resolution gate prevents silent data loss where registry-based lookups would return empty results for unresolved documents."
|
|
2780
|
+
text: "When you start a job, the platform runs a pre-flight check to ensure all selected documents have completed their field resolution step. If any document was uploaded recently and has not yet been resolved against the Field Registry, the system automatically resolves it before entering Phase 1. This lazy resolution gate prevents silent data loss where registry-based lookups would return empty results for unresolved documents. The pre-flight resolution runs with a concurrency of 3 and failures are non-fatal \u2014 Phase 1 will proceed with whatever is resolved."
|
|
2544
2781
|
},
|
|
2545
2782
|
{
|
|
2546
2783
|
type: "paragraph",
|
|
2547
2784
|
text: "For best results, select documents of the same type or closely related types for a single job. The schema you choose should match the document content \u2014 using an invoice schema on contract documents will produce poor results. Start with a small batch of 5-10 documents to validate your schema, review the output, apply corrections, and then scale up to larger runs once you are confident in the extraction quality."
|
|
2548
2785
|
},
|
|
2786
|
+
{
|
|
2787
|
+
type: "paragraph",
|
|
2788
|
+
text: "The platform supports scaling caps to ensure reliable processing: Phase 2 extraction handles up to 2,000 documents per job, and Phase 4 transforms support up to 1,000 documents. Grid results are flushed to the database in batches of 200 documents per phase. For very large document collections, consider splitting into multiple jobs by document type for optimal results and easier review."
|
|
2789
|
+
},
|
|
2549
2790
|
{
|
|
2550
2791
|
type: "callout",
|
|
2551
|
-
text: "Results appear progressively as each pipeline phase completes. You do not need to wait for the entire job to finish \u2014 you can begin reviewing Phase 1 results while Phase 2 is still running."
|
|
2792
|
+
text: "Results appear progressively as each pipeline phase completes. You do not need to wait for the entire job to finish \u2014 you can begin reviewing Phase 1 results while Phase 2 is still running. The phase timeline on the job detail page shows which phase is active and the cumulative fill rate at each stage."
|
|
2552
2793
|
}
|
|
2553
2794
|
],
|
|
2554
2795
|
related: [
|
|
@@ -2581,7 +2822,7 @@ var sections6 = [
|
|
|
2581
2822
|
content: [
|
|
2582
2823
|
{
|
|
2583
2824
|
type: "paragraph",
|
|
2584
|
-
text: "Every job runs through four phases. Each fills more cells in the output grid, reducing the problem space for the next. Results are visible as each phase completes."
|
|
2825
|
+
text: "Every job runs through four phases. Each fills more cells in the output grid, reducing the problem space for the next. Results are visible as each phase completes. The grid is the single source of truth during execution and is flushed to the database after each phase, enabling progressive rendering in the UI. A confidence gate protects high-quality values from being overwritten by later phases \u2014 once a cell is filled with high confidence, it is permanently locked."
|
|
2585
2826
|
},
|
|
2586
2827
|
{
|
|
2587
2828
|
type: "paragraph",
|
|
@@ -2603,7 +2844,7 @@ var sections6 = [
|
|
|
2603
2844
|
},
|
|
2604
2845
|
{
|
|
2605
2846
|
type: "callout",
|
|
2606
|
-
text: "Phase order is fixed: Phase 1 → 2 → 3 → 4. Phases are never skipped or reordered. This guarantees that high-confidence deterministic values from Phase 1 are always protected by the confidence gate before AI extraction runs."
|
|
2847
|
+
text: "Phase order is fixed: Phase 1 → 2 → 3 → 4. Phases are never skipped or reordered. This guarantees that high-confidence deterministic values from Phase 1 are always protected by the confidence gate before AI extraction runs. The confidence gate is the single most important pipeline rule \u2014 once a cell is filled with a high-confidence value, no later phase can overwrite it with a lower-confidence result."
|
|
2607
2848
|
}
|
|
2608
2849
|
],
|
|
2609
2850
|
related: [
|
|
@@ -2636,7 +2877,7 @@ var sections6 = [
|
|
|
2636
2877
|
content: [
|
|
2637
2878
|
{
|
|
2638
2879
|
type: "paragraph",
|
|
2639
|
-
text: "The fastest phase (~30% of cells in seconds). For each document x each schema field, the system checks if the cell can be filled from existing extracted data. **No AI calls** (except rare Haiku fallback for ambiguous lookups)."
|
|
2880
|
+
text: "The fastest phase (~30% of cells in seconds). For each document x each schema field, the system checks if the cell can be filled from existing extracted data. **No AI calls** (except rare Haiku fallback for ambiguous lookups). Phase 1 is the workhorse of cost efficiency \u2014 it fills a large portion of the grid using pre-computed graph matches and deterministic lookups at near-zero cost. As your Field Registry grows from processing more documents, Phase 1 fill rates steadily improve."
|
|
2640
2881
|
},
|
|
2641
2882
|
{
|
|
2642
2883
|
type: "param-table",
|
|
@@ -2723,7 +2964,7 @@ var sections6 = [
|
|
|
2723
2964
|
content: [
|
|
2724
2965
|
{
|
|
2725
2966
|
type: "paragraph",
|
|
2726
|
-
text: "An AI agent reviews the grid's gap patterns and produces a typed strategy
|
|
2967
|
+
text: "An AI agent reviews the grid's gap patterns and produces a typed strategy for each remaining empty cell. The agent uses Anthropic Claude Sonnet to analyze the source document alongside the schema field definitions, any already-resolved values from Phase 1, and reference table codes when available. This context-aware approach allows the AI to use related extracted values as clues for finding dependent data points."
|
|
2727
2968
|
},
|
|
2728
2969
|
{
|
|
2729
2970
|
type: "paragraph",
|
|
@@ -2812,7 +3053,7 @@ var sections6 = [
|
|
|
2812
3053
|
content: [
|
|
2813
3054
|
{
|
|
2814
3055
|
type: "paragraph",
|
|
2815
|
-
text: "Cross-field sanity checks. Flags are **informational only** \u2014 they never block output but help you prioritize review:"
|
|
3056
|
+
text: "Cross-field sanity checks and re-resolution. Phase 3 performs two critical tasks: it re-runs the reference table lookup cascade on values produced by Phase 2 to normalize free-text AI output to your canonical codes, and it runs informational validation checks across related fields. Flags are **informational only** \u2014 they never block output but help you prioritize review:"
|
|
2816
3057
|
},
|
|
2817
3058
|
{
|
|
2818
3059
|
type: "paragraph",
|
|
@@ -2898,7 +3139,7 @@ var sections6 = [
|
|
|
2898
3139
|
content: [
|
|
2899
3140
|
{
|
|
2900
3141
|
type: "paragraph",
|
|
2901
|
-
text: "Context-aware gap filling.
|
|
3142
|
+
text: "Context-aware gap filling and deterministic transforms. Phase 4 serves two purposes: for each empty cell or low-confidence value, AI re-reads the original document with the field instruction and full grid context to find values missed in earlier phases. It also applies deterministic transforms to all cell values \u2014 ISO code normalization, date format standardization, unit conversion \u2014 and evaluates format constraints (regex patterns) with configurable mismatch behaviors. The modifier pipeline runs in a fixed order: format transforms first, then alias mapping, then max_length truncation."
|
|
2902
3143
|
},
|
|
2903
3144
|
{
|
|
2904
3145
|
type: "paragraph",
|
|
@@ -2914,7 +3155,7 @@ var sections6 = [
|
|
|
2914
3155
|
},
|
|
2915
3156
|
{
|
|
2916
3157
|
type: "callout",
|
|
2917
|
-
text: "Phase 4 respects the **confidence gate**: it can only fill empty cells or upgrade cells below the confidence threshold. High-confidence values from Phase 1 are permanently protected."
|
|
3158
|
+
text: "Phase 4 respects the **confidence gate**: it can only fill empty cells or upgrade cells below the confidence threshold. High-confidence values from Phase 1 are permanently protected. Original values are always preserved in the `original_extractions` table for audit, regardless of whether format constraints clear, flag, or replace them."
|
|
2918
3159
|
}
|
|
2919
3160
|
],
|
|
2920
3161
|
related: [
|
|
@@ -2953,7 +3194,7 @@ var sections6 = [
|
|
|
2953
3194
|
},
|
|
2954
3195
|
{
|
|
2955
3196
|
type: "paragraph",
|
|
2956
|
-
text: "The job detail page provides: a **progress bar** with fill rate, a **phase timeline**, the **strategy panel** (agent actions), a **filter bar** (Show All / Clean / Flagged), and **CSV export** (clean or full with metadata)."
|
|
3197
|
+
text: "The job detail page provides: a **progress bar** with fill rate, a **phase timeline**, the **strategy panel** (agent actions), a **filter bar** (Show All / Clean / Flagged), and **CSV export** (clean or full with metadata). The strategy yield breakdown shows how cells were distributed across resolution methods \u2014 registry transfer, raw extraction mapping, lookup cascade, deterministic compute, LLM extract, and bypass \u2014 giving you a clear picture of pipeline efficiency for each run."
|
|
2957
3198
|
},
|
|
2958
3199
|
{
|
|
2959
3200
|
type: "paragraph",
|
|
@@ -2969,7 +3210,7 @@ var sections6 = [
|
|
|
2969
3210
|
},
|
|
2970
3211
|
{
|
|
2971
3212
|
type: "callout",
|
|
2972
|
-
text: "The full CSV export includes metadata columns for each field: confidence score, resolution type, phase number, and reasoning trace. Use this export for audit trails or to analyze extraction performance across your document corpus."
|
|
3213
|
+
text: "The full CSV export includes metadata columns for each field: confidence score, resolution type, phase number, and reasoning trace. Use this export for audit trails or to analyze extraction performance across your document corpus. The clean export omits metadata and includes only the extracted values, ready for direct import into downstream systems."
|
|
2973
3214
|
}
|
|
2974
3215
|
],
|
|
2975
3216
|
related: [
|
|
@@ -3008,7 +3249,7 @@ var sections6 = [
|
|
|
3008
3249
|
content: [
|
|
3009
3250
|
{
|
|
3010
3251
|
type: "paragraph",
|
|
3011
|
-
text: "Every cell carries detailed provenance. Hover a cell for confidence; click
|
|
3252
|
+
text: "Every cell carries detailed provenance metadata that makes every extracted value auditable and explainable. Hover a cell for a quick confidence score glance; click it to expand the full provenance panel showing the resolution type, pipeline phase, reasoning trace, and source document references. This transparency is essential for building trust in automated extraction \u2014 you can always understand exactly how and why the platform produced a specific value."
|
|
3012
3253
|
},
|
|
3013
3254
|
{
|
|
3014
3255
|
type: "paragraph",
|
|
@@ -3095,7 +3336,7 @@ var sections6 = [
|
|
|
3095
3336
|
content: [
|
|
3096
3337
|
{
|
|
3097
3338
|
type: "paragraph",
|
|
3098
|
-
text: "Click any cell to edit its value. Corrections are logged with the original value, timestamp, and user. Choose a propagation scope: `this_document_only` or `all_similar` (same field + method + source field across all documents). Corrections feed back as training signals for future runs."
|
|
3339
|
+
text: "Click any cell to edit its value. Corrections are logged with the original value, timestamp, and user. Choose a propagation scope: `this_document_only` or `all_similar` (same field + method + source field across all documents). Corrections feed back as training signals for future runs, helping the system learn from your edits and improve accuracy over time. When you correct a value, the system records both the original AI-extracted value and your correction, creating a complete audit trail that is preserved even after subsequent jobs run."
|
|
3099
3340
|
},
|
|
3100
3341
|
{
|
|
3101
3342
|
type: "paragraph",
|
|
@@ -3111,7 +3352,7 @@ var sections6 = [
|
|
|
3111
3352
|
},
|
|
3112
3353
|
{
|
|
3113
3354
|
type: "callout",
|
|
3114
|
-
text: "Corrections with **all_similar** propagation apply instantly across all documents in the run. Use this for systematic errors like wrong reference table mappings, but verify the preview count before confirming \u2014 the system shows how many cells will be affected."
|
|
3355
|
+
text: "Corrections with **all_similar** propagation apply instantly across all documents in the run. Use this for systematic errors like wrong reference table mappings, but verify the preview count before confirming \u2014 the system shows how many cells will be affected. For recurring field-level errors, consider updating the schema instruction or reference table rather than correcting cells individually across multiple runs."
|
|
3115
3356
|
}
|
|
3116
3357
|
],
|
|
3117
3358
|
related: [
|
|
@@ -3193,9 +3434,25 @@ var sections7 = [
|
|
|
3193
3434
|
type: "paragraph",
|
|
3194
3435
|
text: "Use link keys whenever your documents share identifying information that should connect them. For best results, ensure your field names follow clear naming conventions \u2014 this maximizes the hit rate of the automatic classifier and minimizes the need for manual overrides."
|
|
3195
3436
|
},
|
|
3437
|
+
{
|
|
3438
|
+
type: "paragraph",
|
|
3439
|
+
text: 'High-frequency entity exclusion is an important safeguard. If an entity value appears in more than 30% of all documents \u2014 for example, a generic department name like "Operations" or a common currency code like "USD" \u2014 it is automatically excluded from case formation. Without this filter, a single high-frequency value would pull most documents into one enormous case, making the grouping meaningless. The 30% threshold strikes a balance between connecting genuinely related documents and avoiding over-connection from generic values.'
|
|
3440
|
+
},
|
|
3441
|
+
{
|
|
3442
|
+
type: "list",
|
|
3443
|
+
items: [
|
|
3444
|
+
"Identity: company names, supplier names, person names \u2014 connects documents referencing the same party",
|
|
3445
|
+
"Transaction: contract numbers, PO numbers, invoice numbers \u2014 connects documents in the same transaction chain",
|
|
3446
|
+
"Reference: project codes, cost centers, shared IDs \u2014 connects documents under the same organizational grouping",
|
|
3447
|
+
"Auto-classified by field name patterns (e.g., company_name, invoice_number)",
|
|
3448
|
+
"AI classifier handles ambiguous fields that heuristics cannot resolve",
|
|
3449
|
+
"High-frequency entities (>30% of documents) excluded automatically",
|
|
3450
|
+
"Manual overrides available in the Field Registry"
|
|
3451
|
+
]
|
|
3452
|
+
},
|
|
3196
3453
|
{
|
|
3197
3454
|
type: "callout",
|
|
3198
|
-
text: "Link key classification runs automatically when new fields appear in the registry. You do not need to trigger it manually \u2014 just upload documents and the system handles the rest."
|
|
3455
|
+
text: "Link key classification runs automatically when new fields appear in the registry. You do not need to trigger it manually \u2014 just upload documents and the system handles the rest. Manual overrides in the Field Registry take precedence over automatic classifications and persist across future jobs."
|
|
3199
3456
|
}
|
|
3200
3457
|
],
|
|
3201
3458
|
related: [
|
|
@@ -3248,9 +3505,24 @@ var sections7 = [
|
|
|
3248
3505
|
type: "paragraph",
|
|
3249
3506
|
text: "For best results, ensure your source documents contain consistent identifiers. The pipeline handles minor variations automatically, but wildly inconsistent naming (e.g., abbreviations vs. full legal names) may require manual link key tuning in the Field Registry."
|
|
3250
3507
|
},
|
|
3508
|
+
{
|
|
3509
|
+
type: "paragraph",
|
|
3510
|
+
text: 'A typical entity linking workflow looks like this: you upload a batch of invoices, contracts, and purchase orders. The pipeline extracts link key values \u2014 vendor names, PO numbers, contract references \u2014 normalizes them, and builds the graph. An invoice referencing "ACME Corp" and a contract referencing "Acme Corporation" both resolve to the same entity node after normalization, so the two documents become connected. If a purchase order also references the same vendor, all three documents end up in the same case.'
|
|
3511
|
+
},
|
|
3512
|
+
{
|
|
3513
|
+
type: "list",
|
|
3514
|
+
items: [
|
|
3515
|
+
"Runs automatically after document extraction \u2014 no manual trigger required",
|
|
3516
|
+
"Normalizes values: lowercasing, suffix stripping (Ltd, Inc, Corp, GmbH), whitespace normalization",
|
|
3517
|
+
"Builds a bipartite graph: document nodes connected to entity nodes via link key edges",
|
|
3518
|
+
"Connected components in the graph become the basis for case formation",
|
|
3519
|
+
"Incremental: new documents extend the existing graph rather than rebuilding it",
|
|
3520
|
+
"Handles minor naming variations automatically; wildly inconsistent names may need manual tuning"
|
|
3521
|
+
]
|
|
3522
|
+
},
|
|
3251
3523
|
{
|
|
3252
3524
|
type: "callout",
|
|
3253
|
-
text: "Entity linking is incremental \u2014 when new documents arrive, the pipeline extends the existing graph rather than rebuilding it from scratch. Existing cases grow as new connections are discovered."
|
|
3525
|
+
text: "Entity linking is incremental \u2014 when new documents arrive, the pipeline extends the existing graph rather than rebuilding it from scratch. Existing cases grow as new connections are discovered. This means your cases stay up-to-date without manual intervention as new documents flow into the workspace."
|
|
3254
3526
|
}
|
|
3255
3527
|
],
|
|
3256
3528
|
related: [
|
|
@@ -3321,11 +3593,30 @@ var sections7 = [
|
|
|
3321
3593
|
},
|
|
3322
3594
|
{
|
|
3323
3595
|
type: "paragraph",
|
|
3324
|
-
text:
|
|
3596
|
+
text: 'Cases display an AI-generated **case label** as the primary title and include anomaly count badges in the header. The label is generated by analyzing the documents and entities in the case to produce a human-readable summary \u2014 for example, "ACME Corp Invoice #4521 → PO #8890". You can rename a case manually if the AI label does not capture the right context. Evidence and Timeline tabs support export to MD, CSV, and JSON for offline review or compliance reporting.'
|
|
3325
3597
|
},
|
|
3326
3598
|
{
|
|
3327
3599
|
type: "paragraph",
|
|
3328
|
-
text: "Additional case operations: **merge** multiple cases into one, **split** a case into separate groups, **pin** or remove documents from a case, and **confirm** or **reject** individual linking edges."
|
|
3600
|
+
text: "Additional case operations: **merge** multiple cases into one, **split** a case into separate groups, **pin** or remove documents from a case, and **confirm** or **reject** individual linking edges. Merging is useful when two cases refer to the same real-world transaction but were not connected by the linking pipeline \u2014 for example, when a vendor uses slightly different names across documents. Splitting lets you break apart a case that was over-connected by a high-frequency entity value. Edge confirmation and rejection feed back into the linking model, improving future case formation accuracy."
|
|
3601
|
+
},
|
|
3602
|
+
{
|
|
3603
|
+
type: "paragraph",
|
|
3604
|
+
text: "Cases follow a lifecycle: **discovered** when the linking engine first identifies a cluster, **confirmed** when a reviewer validates the grouping, **active** during ongoing work, and **resolved** when all documents have been reviewed and processed. The lifecycle status is visible on the cases list page and can be updated from the case detail header. Filtering by lifecycle status makes it easy to focus on cases that need attention."
|
|
3605
|
+
},
|
|
3606
|
+
{
|
|
3607
|
+
type: "list",
|
|
3608
|
+
items: [
|
|
3609
|
+
"AI-generated case labels with manual rename option",
|
|
3610
|
+
"Four tabs: Overview, Anomalies (Advanced mode), Evidence, and Timeline",
|
|
3611
|
+
"Merge, split, pin, remove, confirm, and reject operations",
|
|
3612
|
+
"Lifecycle tracking: discovered → confirmed → active → resolved",
|
|
3613
|
+
"Anomaly count badges in the case header for quick triage",
|
|
3614
|
+
"Export Evidence and Timeline to MD, CSV, or JSON"
|
|
3615
|
+
]
|
|
3616
|
+
},
|
|
3617
|
+
{
|
|
3618
|
+
type: "callout",
|
|
3619
|
+
text: "Case formation runs automatically after entity linking completes. You do not need to create cases manually \u2014 the system discovers them from the document-entity graph. Use merge, split, and edge operations to refine cases when the automatic grouping needs adjustment."
|
|
3329
3620
|
}
|
|
3330
3621
|
],
|
|
3331
3622
|
related: [
|
|
@@ -3374,9 +3665,24 @@ var sections7 = [
|
|
|
3374
3665
|
type: "paragraph",
|
|
3375
3666
|
text: "Most teams use the graph view during initial workspace setup to verify that linking is producing sensible clusters. Once you are confident in your link key configuration, the list view is more practical for day-to-day case review and triage."
|
|
3376
3667
|
},
|
|
3668
|
+
{
|
|
3669
|
+
type: "paragraph",
|
|
3670
|
+
text: "The graph view is particularly useful during onboarding. When you first upload documents to a workspace, the graph gives you immediate visual feedback on whether your link key configuration is producing sensible clusters. If you see one massive cluster with everything connected, a high-frequency entity value may be acting as a bridge \u2014 check the Field Registry and exclude or reclassify the offending field. If you see many disconnected single-document nodes, your documents may lack shared identifiers, or the normalization rules may need adjustment."
|
|
3671
|
+
},
|
|
3672
|
+
{
|
|
3673
|
+
type: "list",
|
|
3674
|
+
items: [
|
|
3675
|
+
"D3-force layout with distinct visual styles for document and entity nodes",
|
|
3676
|
+
"Hover to highlight connections and trace document-entity relationships",
|
|
3677
|
+
"Toggle between graph view and list view from the Cases page",
|
|
3678
|
+
"Case templates auto-discovered after 3+ cases share the same document type pattern",
|
|
3679
|
+
"Templates include a match threshold controlling how closely a case must match",
|
|
3680
|
+
"Missing document type anomalies raised when a case does not match its template"
|
|
3681
|
+
]
|
|
3682
|
+
},
|
|
3377
3683
|
{
|
|
3378
3684
|
type: "callout",
|
|
3379
|
-
text: "Templates are auto-discovered \u2014 you do not need to define them manually. The system analyzes existing cases and proposes templates when it detects at least 3 cases sharing the same document type pattern."
|
|
3685
|
+
text: "Templates are auto-discovered \u2014 you do not need to define them manually. The system analyzes existing cases and proposes templates when it detects at least 3 cases sharing the same document type pattern. You can also trigger template discovery manually from the API via POST /cases/templates/discover."
|
|
3380
3686
|
}
|
|
3381
3687
|
],
|
|
3382
3688
|
related: [
|
|
@@ -3453,9 +3759,23 @@ var sections7 = [
|
|
|
3453
3759
|
type: "paragraph",
|
|
3454
3760
|
text: "Use anomaly detection to surface data quality issues that would otherwise require manual comparison across documents. For best results, configure case templates so the **Missing Document Type** detector (D4) can flag incomplete cases. Most teams find that D2 (Field Conflict) and D3 (Duplicate Key Divergence) catch the highest-value issues in procurement and financial workflows."
|
|
3455
3761
|
},
|
|
3762
|
+
{
|
|
3763
|
+
type: "paragraph",
|
|
3764
|
+
text: "A typical workflow starts on the cases list page, where anomaly count badges give you an at-a-glance view of which cases need attention. Click into a case with anomalies, switch to the **Anomalies** tab, and use the severity filter pills to focus on critical issues first. Each anomaly card explains the affected fields and the specific violation detected. Dismiss false positives with the dismiss button \u2014 they remain accessible via the **show dismissed** toggle if you need to revisit them later."
|
|
3765
|
+
},
|
|
3766
|
+
{
|
|
3767
|
+
type: "list",
|
|
3768
|
+
items: [
|
|
3769
|
+
"D1 \u2014 Validation Cluster: multiple validation failures concentrated in the same document or field group",
|
|
3770
|
+
"D2 \u2014 Field Conflict: contradictory values for the same field across documents in a case",
|
|
3771
|
+
"D3 \u2014 Duplicate Key Divergence: shared link key but differing values on fields that should match",
|
|
3772
|
+
"D4 \u2014 Missing Document Type: case template expects a document type that is absent",
|
|
3773
|
+
"D5 \u2014 Value Reuse: identical values across unrelated fields, suggesting copy-paste or extraction errors"
|
|
3774
|
+
]
|
|
3775
|
+
},
|
|
3456
3776
|
{
|
|
3457
3777
|
type: "callout",
|
|
3458
|
-
text: "Anomaly detection requires **Advanced mode** to be enabled. In Simple mode, anomalies are still computed but not displayed in the case detail page."
|
|
3778
|
+
text: "Anomaly detection requires **Advanced mode** to be enabled. In Simple mode, anomalies are still computed but not displayed in the case detail page. Toggle Advanced mode from the sidebar to access the full anomaly workflow."
|
|
3459
3779
|
}
|
|
3460
3780
|
],
|
|
3461
3781
|
related: [
|
|
@@ -3515,9 +3835,26 @@ var sections7 = [
|
|
|
3515
3835
|
type: "paragraph",
|
|
3516
3836
|
text: "The checksum validator (S7) uses a parameterized factory pattern \u2014 it accepts a checksum algorithm name and applies the corresponding verification logic. Supported algorithms include Luhn (credit card numbers), ABA (bank routing numbers), IBAN (international bank accounts), and ISBN (book identifiers). For best results, ensure your schema fields are typed correctly so the engine knows which checksum to apply."
|
|
3517
3837
|
},
|
|
3838
|
+
{
|
|
3839
|
+
type: "paragraph",
|
|
3840
|
+
text: "A typical evidence validation workflow starts automatically after extraction and linking. You navigate to a case, open the **Evidence** tab, and immediately see colored badges next to each field value. Red badges indicate failures that need attention \u2014 click a badge to see which validator fired and what the expected format or value was. Use the filter bar to narrow results by status (pass/fail/warning), by document, by category, or by free-text search. Group-by-document collapsible sections let you review one document at a time within a case."
|
|
3841
|
+
},
|
|
3842
|
+
{
|
|
3843
|
+
type: "list",
|
|
3844
|
+
items: [
|
|
3845
|
+
"S1 \u2014 Free-text spillover: unstructured text leaked from adjacent content",
|
|
3846
|
+
"S2 \u2014 Empty value: required field is blank or whitespace-only",
|
|
3847
|
+
"S3 \u2014 Email/URL misclassification: value looks like an email or URL in the wrong field type",
|
|
3848
|
+
"S4 \u2014 Name in URL field: person or company name extracted into a URL-typed field",
|
|
3849
|
+
"S5 \u2014 Alpha in numeric field: alphabetic characters in a numeric-only field",
|
|
3850
|
+
"S6 \u2014 Cross-field duplicate: identical value in multiple unrelated fields on the same document",
|
|
3851
|
+
"S7 \u2014 Checksum validation: Luhn, ABA, IBAN, ISBN verification via parameterized factory",
|
|
3852
|
+
"Domain packs: industry-specific rules (e.g., freight: DOT numbers, MC numbers)"
|
|
3853
|
+
]
|
|
3854
|
+
},
|
|
3518
3855
|
{
|
|
3519
3856
|
type: "callout",
|
|
3520
|
-
text: "Evidence validation results are stored
|
|
3857
|
+
text: "Evidence validation results are stored separately from extraction and linking data. This means you can re-run validation independently without re-extracting documents. Results are keyed by (document_id, entity_id, field_key) for precise field-level tracking."
|
|
3521
3858
|
}
|
|
3522
3859
|
],
|
|
3523
3860
|
related: [
|
|
@@ -3572,9 +3909,24 @@ var sections8 = [
|
|
|
3572
3909
|
type: "paragraph",
|
|
3573
3910
|
text: "For best results, create one template per downstream consumer. If your finance team and operations team need different column subsets from the same schema, define two templates rather than manually reconfiguring each export. Most teams version their templates alongside schema changes to maintain backward compatibility with existing integrations."
|
|
3574
3911
|
},
|
|
3912
|
+
{
|
|
3913
|
+
type: "paragraph",
|
|
3914
|
+
text: "To create a dataset template, navigate to **Data Products → Dataset Templates** and click **New Template**. Select the user schema that defines the field set, then configure column mappings to rename, reorder, or exclude fields from the output. Add default transforms \u2014 such as date formatting to ISO 8601, currency normalization to a base currency, or unit conversion \u2014 that run automatically during assembly. Save the template and it becomes available to any team member when creating a new job or assembly."
|
|
3915
|
+
},
|
|
3916
|
+
{
|
|
3917
|
+
type: "list",
|
|
3918
|
+
items: [
|
|
3919
|
+
"Linked to a user schema \u2014 fields are inherited automatically as the schema evolves",
|
|
3920
|
+
"Column mappings: rename, reorder, or exclude fields from the final output",
|
|
3921
|
+
"Default transforms: date formatting, currency normalization, unit conversion",
|
|
3922
|
+
"Independent versioning: evolve the template without affecting existing data products",
|
|
3923
|
+
"Workspace-scoped: any team member can create, edit, or use any template",
|
|
3924
|
+
"One template per downstream consumer is the recommended pattern"
|
|
3925
|
+
]
|
|
3926
|
+
},
|
|
3575
3927
|
{
|
|
3576
3928
|
type: "callout",
|
|
3577
|
-
text: "Dataset templates are workspace-scoped. Any team member can create, edit, or use a template \u2014 there is no per-user ownership restriction."
|
|
3929
|
+
text: "Dataset templates are workspace-scoped. Any team member can create, edit, or use a template \u2014 there is no per-user ownership restriction. Version your templates alongside schema changes to maintain backward compatibility with existing integrations and downstream consumers."
|
|
3578
3930
|
}
|
|
3579
3931
|
],
|
|
3580
3932
|
related: [
|
|
@@ -3627,9 +3979,24 @@ var sections8 = [
|
|
|
3627
3979
|
type: "paragraph",
|
|
3628
3980
|
text: "Use assemblies whenever you need a repeatable, auditable output for downstream systems or stakeholders. Most teams create one assembly per reporting period or delivery cycle. Because assemblies reference a template, you can regenerate the same output shape from different document sets without reconfiguring columns or transforms each time."
|
|
3629
3981
|
},
|
|
3982
|
+
{
|
|
3983
|
+
type: "paragraph",
|
|
3984
|
+
text: "Assemblies also support incremental updates. When new documents arrive in a source that is already part of an assembly, you can regenerate the assembly to include them without reconfiguring anything. The system re-applies the template, pulls the updated document set, and produces a fresh output. Previous assembly versions are retained for comparison, so you can track how your dataset evolves over successive runs."
|
|
3985
|
+
},
|
|
3986
|
+
{
|
|
3987
|
+
type: "list",
|
|
3988
|
+
items: [
|
|
3989
|
+
"Select a dataset template and one or more document sources to create an assembly",
|
|
3990
|
+
"Column mappings and transforms from the template are applied automatically",
|
|
3991
|
+
"Full traceability from every output row back to its source document",
|
|
3992
|
+
"Incremental updates \u2014 regenerate to include newly arrived documents",
|
|
3993
|
+
"Previous assembly versions retained for comparison and auditing",
|
|
3994
|
+
"Export the assembled dataset as CSV with leading zero preservation"
|
|
3995
|
+
]
|
|
3996
|
+
},
|
|
3630
3997
|
{
|
|
3631
3998
|
type: "callout",
|
|
3632
|
-
text: "Assemblies are the recommended way to produce production datasets. They provide a single audit trail from source documents through extraction, resolution, and validation to the final output."
|
|
3999
|
+
text: "Assemblies are the recommended way to produce production datasets. They provide a single audit trail from source documents through extraction, resolution, and validation to the final output. If your workflow requires repeatable, auditable deliverables, assemblies eliminate the need for manual export configuration on every run."
|
|
3633
4000
|
}
|
|
3634
4001
|
],
|
|
3635
4002
|
related: [
|
|
@@ -3693,7 +4060,22 @@ var sections8 = [
|
|
|
3693
4060
|
},
|
|
3694
4061
|
{
|
|
3695
4062
|
type: "paragraph",
|
|
3696
|
-
text: "ID rules are persisted before generating IDs. Navigate to a data product detail page and use **Apply ID Rules** to generate or **Regenerate IDs** to refresh."
|
|
4063
|
+
text: "ID rules are persisted before generating IDs. Navigate to a data product detail page and use **Apply ID Rules** to generate or **Regenerate IDs** to refresh. The generation process evaluates each row against the configured rules: it reads the source field value, applies the resolution map if one exists, prepends the prefix, and writes the resulting ID. If the source field is empty, the dispenser walks the fallback chain in order until it finds a non-empty value. If all fields in the chain are empty, a prefix-less sequential ID is assigned so no row is left without an identifier."
|
|
4064
|
+
},
|
|
4065
|
+
{
|
|
4066
|
+
type: "paragraph",
|
|
4067
|
+
text: "A typical workflow starts by choosing a high-cardinality field as the source \u2014 contract numbers, invoice IDs, or purchase order references work well because they are unique per document. Next, configure a fallback chain with one or two alternative fields (e.g., document name, then upload date) so the dispenser always has a value to work with. Finally, add a resolution map if your source data contains variant spellings of the same entity. The map normalizes these variants before they become part of the ID, preventing duplicate IDs for rows that refer to the same real-world record."
|
|
4068
|
+
},
|
|
4069
|
+
{
|
|
4070
|
+
type: "list",
|
|
4071
|
+
items: [
|
|
4072
|
+
"Source field: the primary field used to derive each row ID",
|
|
4073
|
+
"Fallback chain: ordered list of alternative fields tried when the source is empty",
|
|
4074
|
+
"Resolution map: key-value lookup that normalizes values before ID generation",
|
|
4075
|
+
"Prefix: optional string prepended to every generated ID for namespacing",
|
|
4076
|
+
"Deterministic: same rules + same data always produces the same IDs",
|
|
4077
|
+
"Non-destructive: regenerating IDs only updates the ID column, all other values remain unchanged"
|
|
4078
|
+
]
|
|
3697
4079
|
},
|
|
3698
4080
|
{
|
|
3699
4081
|
type: "paragraph",
|
|
@@ -3738,7 +4120,7 @@ var sections8 = [
|
|
|
3738
4120
|
content: [
|
|
3739
4121
|
{
|
|
3740
4122
|
type: "paragraph",
|
|
3741
|
-
text: "Each data product can generate a **share token** \u2014 a public URL that grants read access without authentication. The delivery website renders three toggle views:"
|
|
4123
|
+
text: "Each data product can generate a **share token** \u2014 a public URL that grants read access without authentication. Share tokens are ideal for distributing finalized datasets to external stakeholders, auditors, or downstream teams that do not have Talonic accounts. The token is scoped to a single data product and can be revoked at any time from the data product detail page without affecting other shared links. The delivery website renders three toggle views:"
|
|
3742
4124
|
},
|
|
3743
4125
|
{
|
|
3744
4126
|
type: "param-table",
|
|
@@ -3762,11 +4144,30 @@ var sections8 = [
|
|
|
3762
4144
|
},
|
|
3763
4145
|
{
|
|
3764
4146
|
type: "paragraph",
|
|
3765
|
-
text: "The delivery website includes the Talonic logo, per-run selection, and **CSV export** with leading zero and long number preservation (values are not coerced to numbers)."
|
|
4147
|
+
text: "The delivery website includes the Talonic logo, per-run selection, and **CSV export** with leading zero and long number preservation (values are not coerced to numbers). When multiple runs exist for the same data product, the delivery website lets viewers switch between runs using a dropdown selector, making it easy to compare outputs across time periods or pipeline configurations. CSV downloads preserve the exact cell values shown in the active view \u2014 including leading zeros on codes like ZIP codes and account numbers \u2014 so recipients can open the file in Excel or Google Sheets without data loss."
|
|
3766
4148
|
},
|
|
3767
4149
|
{
|
|
3768
4150
|
type: "paragraph",
|
|
3769
|
-
text: "**Auto-review** and **auto-resolve singles** are available to streamline the approval process: auto-review uses LLM to propose approve/reject decisions, and auto-resolve singles automatically accepts fields with only one candidate value."
|
|
4151
|
+
text: "**Auto-review** and **auto-resolve singles** are available to streamline the approval process: auto-review uses LLM to propose approve/reject decisions, and auto-resolve singles automatically accepts fields with only one candidate value. Together, these features can reduce the manual review burden by 60-80% on typical workloads. Auto-review examines each pending field against the extraction context and proposes a decision with a confidence indicator, while auto-resolve singles handles the common case where a field has exactly one candidate \u2014 no ambiguity to resolve, so automatic acceptance is safe."
|
|
4152
|
+
},
|
|
4153
|
+
{
|
|
4154
|
+
type: "paragraph",
|
|
4155
|
+
text: "To set up sharing, navigate to the data product detail page and click **Generate Share Link**. The system creates a unique token and displays the public URL. You can copy this URL and send it to anyone \u2014 no Talonic login is required to view the delivery website. If you need to revoke access, delete the share token from the same page. The data product itself is unaffected; only the public URL stops working."
|
|
4156
|
+
},
|
|
4157
|
+
{
|
|
4158
|
+
type: "list",
|
|
4159
|
+
items: [
|
|
4160
|
+
"Three toggle views: Structured Data (raw extraction), Resolved (post-normalization), and Data Product (final assembled output)",
|
|
4161
|
+
"Per-run selector for comparing outputs across pipeline runs or time periods",
|
|
4162
|
+
"CSV export with leading zero and long number preservation \u2014 values are never coerced to numeric types",
|
|
4163
|
+
"Auto-review with LLM-proposed approve/reject decisions for pending fields",
|
|
4164
|
+
"Auto-resolve singles for fields with exactly one candidate value",
|
|
4165
|
+
"Share tokens are revocable and scoped to a single data product"
|
|
4166
|
+
]
|
|
4167
|
+
},
|
|
4168
|
+
{
|
|
4169
|
+
type: "callout",
|
|
4170
|
+
text: "Share tokens grant read-only access. Recipients can view and export data but cannot modify the data product, run new jobs, or access any other workspace resources. Revoke a token at any time from the data product detail page."
|
|
3770
4171
|
}
|
|
3771
4172
|
],
|
|
3772
4173
|
related: [
|
|
@@ -3822,9 +4223,24 @@ var sections9 = [
|
|
|
3822
4223
|
type: "paragraph",
|
|
3823
4224
|
text: "For best results, start with a small set of high-confidence rules and expand over time. Most teams begin with field format checks for critical identifiers (invoice numbers, dates, amounts) and add cross-field consistency rules as they learn their data patterns. Validation failures do not block extraction \u2014 they flag records for review."
|
|
3824
4225
|
},
|
|
4226
|
+
{
|
|
4227
|
+
type: "paragraph",
|
|
4228
|
+
text: "A typical setup workflow looks like this: run your first job, review the extraction results, and identify fields where errors are common. Navigate to **Validation → Checks** and create rules for those fields \u2014 a date format check on invoice_date, a value range check on total_amount, a cross-field consistency check that start_date precedes end_date. On subsequent jobs, Phase 3 evaluates every record against your active rules and flags failures for review in the Approval Queue."
|
|
4229
|
+
},
|
|
4230
|
+
{
|
|
4231
|
+
type: "list",
|
|
4232
|
+
items: [
|
|
4233
|
+
"Field format: verify values match expected patterns (ISO dates, phone numbers with country codes, email addresses)",
|
|
4234
|
+
"Value range: ensure numeric or date values fall within acceptable bounds",
|
|
4235
|
+
"Cross-field consistency: compare two or more fields on the same record (e.g., start date before end date)",
|
|
4236
|
+
"AI-proposed coherence rules: generated from patterns in completed job results, require explicit approval",
|
|
4237
|
+
"Schema-scoped: rules on one schema do not affect other schemas",
|
|
4238
|
+
"Non-blocking: validation failures flag records for review but do not prevent extraction"
|
|
4239
|
+
]
|
|
4240
|
+
},
|
|
3825
4241
|
{
|
|
3826
4242
|
type: "callout",
|
|
3827
|
-
text: "Validation checks are schema-scoped. Rules defined on one schema do not affect other schemas in the same workspace. This lets you tailor quality rules to each document type independently."
|
|
4243
|
+
text: "Validation checks are schema-scoped. Rules defined on one schema do not affect other schemas in the same workspace. This lets you tailor quality rules to each document type independently. Start with a few high-confidence rules and expand as you learn your data patterns."
|
|
3828
4244
|
}
|
|
3829
4245
|
],
|
|
3830
4246
|
related: [
|
|
@@ -3876,9 +4292,25 @@ var sections9 = [
|
|
|
3876
4292
|
type: "paragraph",
|
|
3877
4293
|
text: "For best results, create golden samples from a representative mix of document types and complexity levels. Most teams maintain 5-10 golden samples per schema and re-run benchmarks after schema changes, instruction updates, or model upgrades to track quality trends over time."
|
|
3878
4294
|
},
|
|
4295
|
+
{
|
|
4296
|
+
type: "paragraph",
|
|
4297
|
+
text: "A typical benchmarking workflow starts after a schema change or model upgrade. Navigate to **Validation → Golden Samples**, select the samples you want to benchmark, and click **Run Benchmark**. The system re-extracts each document independently and compares every field against your known-correct values. The results page shows a per-field accuracy matrix with pass/fail indicators and AI judge verdicts explaining each comparison. Use this data to pinpoint fields that need schema instruction tuning or additional extraction context."
|
|
4298
|
+
},
|
|
4299
|
+
{
|
|
4300
|
+
type: "list",
|
|
4301
|
+
items: [
|
|
4302
|
+
"Create golden samples by selecting a document and entering known-correct values for each field",
|
|
4303
|
+
"Benchmark runs compare extraction results field-by-field against the golden sample baseline",
|
|
4304
|
+
'AI judge evaluates semantic equivalence (e.g., "United States" matches "US" for country fields)',
|
|
4305
|
+
"Per-field accuracy scores identify exactly which fields are underperforming",
|
|
4306
|
+
"Maintain 5-10 golden samples per schema for representative coverage",
|
|
4307
|
+
"Re-run benchmarks after schema changes, instruction updates, or model upgrades",
|
|
4308
|
+
"Benchmark results are stored historically for tracking quality trends over time"
|
|
4309
|
+
]
|
|
4310
|
+
},
|
|
3879
4311
|
{
|
|
3880
4312
|
type: "callout",
|
|
3881
|
-
text: "Golden samples are not used during normal extraction \u2014 they exist solely for benchmarking. Changing a golden sample does not affect how documents are processed."
|
|
4313
|
+
text: "Golden samples are not used during normal extraction \u2014 they exist solely for benchmarking. Changing a golden sample does not affect how documents are processed. This separation ensures that ground truth data remains a pure measurement tool without introducing bias into the extraction pipeline."
|
|
3882
4314
|
}
|
|
3883
4315
|
],
|
|
3884
4316
|
related: [
|
|
@@ -3925,9 +4357,25 @@ var sections9 = [
|
|
|
3925
4357
|
type: "paragraph",
|
|
3926
4358
|
text: "Approval gates integrate directly with the delivery pipeline. When a result passes all gates, a `result.approved` signal is emitted automatically. Bind this signal to a destination to create a fully automated flow from document upload through extraction, validation, approval, and delivery \u2014 no manual steps required for high-confidence results."
|
|
3927
4359
|
},
|
|
4360
|
+
{
|
|
4361
|
+
type: "paragraph",
|
|
4362
|
+
text: "A practical example: you configure an approval gate on your invoice schema with 90% confidence, 95% validation pass rate, and 80% field coverage. An invoice extraction scores 96% confidence, passes all validation checks, and has all fields populated. It clears all three gates and is auto-approved \u2014 a `result.approved` signal fires immediately. A second invoice scores 85% confidence. It fails the confidence gate and is routed to the Approval Queue for manual review. This two-track approach lets high-quality results flow through instantly while ensuring borderline cases get human attention."
|
|
4363
|
+
},
|
|
4364
|
+
{
|
|
4365
|
+
type: "list",
|
|
4366
|
+
items: [
|
|
4367
|
+
"Minimum confidence: lowest acceptable extraction confidence score",
|
|
4368
|
+
"Validation pass rate: minimum percentage of validation checks that must pass",
|
|
4369
|
+
"Field coverage: minimum percentage of schema fields with non-empty values",
|
|
4370
|
+
"All three gates must be cleared for auto-approval \u2014 any failure routes to manual review",
|
|
4371
|
+
"Configured per schema for independent tuning per document type",
|
|
4372
|
+
"Emits result.approved signal on auto-approval for delivery pipeline integration",
|
|
4373
|
+
"Start conservative (high thresholds) and loosen as pipeline trust builds"
|
|
4374
|
+
]
|
|
4375
|
+
},
|
|
3928
4376
|
{
|
|
3929
4377
|
type: "callout",
|
|
3930
|
-
text: "Approval gates feed the delivery pipeline \u2014 bind a `result.approved` signal to a destination to only ship approved rows to your downstream systems."
|
|
4378
|
+
text: "Approval gates feed the delivery pipeline \u2014 bind a `result.approved` signal to a destination to only ship approved rows to your downstream systems. Combined with webhooks, this creates a fully automated flow from document upload through extraction, validation, approval, and delivery with zero manual steps for high-confidence results."
|
|
3931
4379
|
}
|
|
3932
4380
|
],
|
|
3933
4381
|
related: [
|
|
@@ -3984,9 +4432,26 @@ var sections9 = [
|
|
|
3984
4432
|
type: "paragraph",
|
|
3985
4433
|
text: "For best results, review flagged items first \u2014 these are records where at least one validation check failed, making them the most likely to contain errors. Most teams assign a daily review cadence and use confidence range filters to prioritize low-confidence items that need the most attention."
|
|
3986
4434
|
},
|
|
4435
|
+
{
|
|
4436
|
+
type: "paragraph",
|
|
4437
|
+
text: "The Approval Queue integrates with the broader delivery pipeline through event-driven signals. Every approve or reject action \u2014 whether manual or via batch operations \u2014 emits the corresponding signal immediately. This means downstream systems receive real-time notifications of review decisions. You can bind these signals to different destinations: approved records to a webhook that triggers an ERP import, rejected records to a Slack notification for the data operations team. The event-driven design decouples review decisions from downstream processing, making the system easy to extend."
|
|
4438
|
+
},
|
|
4439
|
+
{
|
|
4440
|
+
type: "list",
|
|
4441
|
+
items: [
|
|
4442
|
+
"Navigate to Review → Approval Queue to see all pending items",
|
|
4443
|
+
"Filter by status (pending, flagged), schema, or confidence range",
|
|
4444
|
+
"Review detail view shows extracted values alongside the source document",
|
|
4445
|
+
"Provenance trails trace each value back to its origin in the document text",
|
|
4446
|
+
"Inline validation check results show which rules passed and which failed",
|
|
4447
|
+
"Batch approve or reject multiple items at once",
|
|
4448
|
+
"LLM auto-review proposes decisions for pending items with one-click accept or override",
|
|
4449
|
+
"result.approved and result.rejected signals emitted for delivery pipeline integration"
|
|
4450
|
+
]
|
|
4451
|
+
},
|
|
3987
4452
|
{
|
|
3988
4453
|
type: "callout",
|
|
3989
|
-
text: "LLM auto-review is available to accelerate the approval process. When enabled, AI proposes approve or reject decisions for pending items, which you can accept or override with a single click."
|
|
4454
|
+
text: "LLM auto-review is available to accelerate the approval process. When enabled, AI proposes approve or reject decisions for pending items, which you can accept or override with a single click. Auto-review is especially effective for high-volume workloads where most items are straightforward \u2014 it lets reviewers focus their attention on the genuinely ambiguous cases."
|
|
3990
4455
|
}
|
|
3991
4456
|
],
|
|
3992
4457
|
related: [
|
|
@@ -4030,11 +4495,11 @@ var sections10 = [
|
|
|
4030
4495
|
content: [
|
|
4031
4496
|
{
|
|
4032
4497
|
type: "paragraph",
|
|
4033
|
-
text: "Push extracted, resolved, and reviewed data to any downstream system. Delivery is a typed, at-least-once pipeline with idempotency keys on the wire, append-only history, and a dead-letter queue for terminal failures."
|
|
4498
|
+
text: "Push extracted, resolved, and reviewed data to any downstream system. Delivery is a typed, at-least-once pipeline with idempotency keys on the wire, append-only history, and a dead-letter queue for terminal failures. The system is fully configurable without code changes \u2014 signals, deliverable resolvers, serializers, and connectors are four orthogonal registries that compose independently, so adding a new destination type never requires changes to the signal or serializer code."
|
|
4034
4499
|
},
|
|
4035
4500
|
{
|
|
4036
4501
|
type: "paragraph",
|
|
4037
|
-
text: "Every delivery flows through a five-stage pipeline:"
|
|
4502
|
+
text: "Every delivery flows through a five-stage pipeline. Producers are stateless \u2014 they only publish typed events into an outbox and never interact with destinations or bindings directly. A background poller drains the outbox every 5 seconds, matches events against active bindings, and enqueues delivery jobs for processing:"
|
|
4038
4503
|
},
|
|
4039
4504
|
{
|
|
4040
4505
|
type: "param-table",
|
|
@@ -4122,7 +4587,7 @@ var sections10 = [
|
|
|
4122
4587
|
content: [
|
|
4123
4588
|
{
|
|
4124
4589
|
type: "paragraph",
|
|
4125
|
-
text: "A destination is a connector + configuration + optional credentials.
|
|
4590
|
+
text: "A destination is a connector + configuration + optional credentials. Seven connectors are live: webhook (HMAC-SHA256), SFTP, Amazon S3, Azure Blob Storage, Google Drive, OneDrive, and Google Sheets. Use **Delivery → Destinations** to manage them from the dashboard, or `POST /v1/delivery/destinations` via the API. Every destination supports a live-ping `POST /v1/delivery/destinations/:id/test` that exercises the full transport envelope with a tiny test payload. File-based destinations use lightweight probes (list directory, head bucket, get container properties) so the test never creates artifacts in your target storage."
|
|
4126
4591
|
},
|
|
4127
4592
|
{
|
|
4128
4593
|
type: "param-table",
|
|
@@ -4215,11 +4680,11 @@ var sections10 = [
|
|
|
4215
4680
|
content: [
|
|
4216
4681
|
{
|
|
4217
4682
|
type: "paragraph",
|
|
4218
|
-
text: "A binding is the routing rule: it joins a **signal filter** (which events?) to a **deliverable type** (what payload shape?) to a **destination** (ship where?) via a **serializer** (encoded how?). On create, the backend validates all four pieces form a compatible triangle \u2014 the serializer must support the resolver's shape, and the connector must support the serializer format."
|
|
4683
|
+
text: "A binding is the routing rule: it joins a **signal filter** (which events?) to a **deliverable type** (what payload shape?) to a **destination** (ship where?) via a **serializer** (encoded how?). On create, the backend validates all four pieces form a compatible triangle \u2014 the serializer must support the resolver's shape, and the connector must support the serializer format. This six-predicate validation ensures you never end up with a misconfigured binding that cannot deliver."
|
|
4219
4684
|
},
|
|
4220
4685
|
{
|
|
4221
4686
|
type: "paragraph",
|
|
4222
|
-
text: "Optional `field_map` (rename/drop/static rules) lets you reshape the payload without custom code. Optional `delivery_policy` overrides the default retry ladder (
|
|
4687
|
+
text: "Optional `field_map` (rename/drop/static rules) lets you reshape the payload without custom code. The three operations compose in order: drop excluded fields first, then rename remaining fields, then inject static values. Optional `delivery_policy` overrides the default retry ladder (7 attempts over ~10 hours with exponential backoff: 0s, 30s, 2min, 8min, 30min, 2h, 8h) and maximum attempts."
|
|
4223
4688
|
},
|
|
4224
4689
|
{
|
|
4225
4690
|
type: "paragraph",
|
|
@@ -4275,7 +4740,7 @@ var sections10 = [
|
|
|
4275
4740
|
content: [
|
|
4276
4741
|
{
|
|
4277
4742
|
type: "paragraph",
|
|
4278
|
-
text: "The catalog API (`/v1/delivery/catalog/*`) exposes the four registries that drive the binding picker. Use it to populate dropdowns rather than hardcoding lists \u2014 it always reflects the running registry contents."
|
|
4743
|
+
text: "The catalog API (`/v1/delivery/catalog/*`) exposes the four registries that drive the binding picker: signals, deliverables, serializers, and connectors. Use it to populate dropdowns rather than hardcoding lists \u2014 it always reflects the running registry contents. When new signal types or deliverable resolvers are added to the platform, they appear automatically in the catalog without any configuration changes on your end."
|
|
4279
4744
|
},
|
|
4280
4745
|
{
|
|
4281
4746
|
type: "param-table",
|
|
@@ -4391,15 +4856,15 @@ var sections10 = [
|
|
|
4391
4856
|
content: [
|
|
4392
4857
|
{
|
|
4393
4858
|
type: "paragraph",
|
|
4394
|
-
text: "Every delivery attempt writes a row to `/v1/delivery/items` with its status, HTTP code, error code, and request/response bodies. Terminal failures (retry ladder exhausted or permanent 4xx) escalate to `/v1/delivery/dlq`. Both are fully replayable \u2014 replay enqueues a new attempt
|
|
4859
|
+
text: "Every delivery attempt writes a row to `/v1/delivery/items` with its status, HTTP code, error code, and request/response bodies (truncated to 10 KB each). Terminal failures (retry ladder exhausted or permanent 4xx) escalate to `/v1/delivery/dlq`. Both are fully replayable \u2014 replay enqueues a new attempt while preserving the deterministic idempotency key, so receivers that deduplicate on the key will not process the same delivery twice even after multiple replays. Nothing in history is ever mutated; the log is strictly append-only."
|
|
4395
4860
|
},
|
|
4396
4861
|
{
|
|
4397
4862
|
type: "paragraph",
|
|
4398
|
-
text: "The delivery items log captures the full lifecycle of each attempt: in-flight, succeeded, or failed. Each row includes the attempt number, duration in milliseconds, and truncated request/response bodies (up to 10 KB each). Use the items endpoint with filters for `binding_id`, `destination_id`, or `status` to narrow results when debugging a specific integration."
|
|
4863
|
+
text: "The delivery items log captures the full lifecycle of each attempt: in-flight, succeeded, or failed. Each row includes the attempt number, HTTP status code, error code, duration in milliseconds, and truncated request/response bodies (up to 10 KB each). Use the items endpoint with filters for `binding_id`, `destination_id`, or `status` to narrow results when debugging a specific integration. Item and DLQ IDs are UUIDs, while event IDs are sequential integers for efficient ordering."
|
|
4399
4864
|
},
|
|
4400
4865
|
{
|
|
4401
4866
|
type: "paragraph",
|
|
4402
|
-
text: "The dead letter queue (DLQ) is your safety net for terminal failures. When the retry ladder is exhausted or the destination returns a permanent error (e.g., 401 Unauthorized, 403 Forbidden), the failed delivery moves to the DLQ. From there you can inspect the error, fix the destination configuration, and replay the delivery with a single click or API call."
|
|
4867
|
+
text: "The dead letter queue (DLQ) is your safety net for terminal failures. When the retry ladder is exhausted (default: 7 attempts over ~10 hours) or the destination returns a permanent error (e.g., 401 Unauthorized, 403 Forbidden), the failed delivery moves to the DLQ. From there you can inspect the error details, fix the destination configuration, and replay the delivery with a single click or API call. Destinations returning authentication errors are automatically disabled to prevent further failed attempts until the credentials are updated."
|
|
4403
4868
|
},
|
|
4404
4869
|
{
|
|
4405
4870
|
type: "paragraph",
|
|
@@ -4826,10 +5291,25 @@ var sections13 = [
|
|
|
4826
5291
|
type: "paragraph",
|
|
4827
5292
|
text: "For best results, create separate API keys for each integration or service that connects to your Talonic workspace. This makes it easy to rotate or revoke a single key without disrupting other integrations. Most teams maintain one key for their ingestion pipeline, one for their BI dashboard, and one for webhook-based automations."
|
|
4828
5293
|
},
|
|
5294
|
+
{
|
|
5295
|
+
type: "paragraph",
|
|
5296
|
+
text: "API keys are SHA-256 hashed at rest, which means the platform never stores the plaintext key after creation. This is a deliberate security measure \u2014 even if the database were compromised, the hashed keys cannot be reversed to their original values. When your integration sends a request, the platform hashes the incoming key and compares it against the stored hash. This design follows the same pattern used by GitHub, Stripe, and other major API providers."
|
|
5297
|
+
},
|
|
5298
|
+
{
|
|
5299
|
+
type: "list",
|
|
5300
|
+
items: [
|
|
5301
|
+
"Prefixed with tlnc_ for easy identification in logs and configuration files",
|
|
5302
|
+
"Passed via the Authorization: Bearer header on every API request",
|
|
5303
|
+
"SHA-256 hashed at rest \u2014 the full key is only shown once at creation",
|
|
5304
|
+
"Three scopes: extract (ingestion), read (query), write (create/modify)",
|
|
5305
|
+
"Create separate keys per integration for independent rotation and revocation",
|
|
5306
|
+
"No limit on the number of keys per workspace"
|
|
5307
|
+
]
|
|
5308
|
+
},
|
|
4829
5309
|
{
|
|
4830
5310
|
type: "callout",
|
|
4831
5311
|
variant: "warning",
|
|
4832
|
-
text: "Copy the full API key immediately after creation \u2014 it is only displayed once. If you lose the key, you must delete it and create a new one. Existing integrations using the old key will stop working until updated."
|
|
5312
|
+
text: "Copy the full API key immediately after creation \u2014 it is only displayed once. If you lose the key, you must delete it and create a new one. Existing integrations using the old key will stop working until updated. Store API keys in a secrets manager, not in source code or environment files checked into version control."
|
|
4833
5313
|
},
|
|
4834
5314
|
{
|
|
4835
5315
|
type: "param-table",
|
|
@@ -4897,6 +5377,26 @@ var sections13 = [
|
|
|
4897
5377
|
type: "paragraph",
|
|
4898
5378
|
text: "For best results, start with the `/v1/extract` endpoint for document ingestion, then use `/v1/documents` and `/v1/extractions` to retrieve results. As your integration matures, explore delivery bindings, matching configurations, and batch processing to build a fully automated data pipeline."
|
|
4899
5379
|
},
|
|
5380
|
+
{
|
|
5381
|
+
type: "paragraph",
|
|
5382
|
+
text: "Long-running operations like matching runs, batch inference, and resolution runs follow an asynchronous pattern. You submit a request to start the operation and receive a run ID. Poll the status endpoint with that run ID to track progress \u2014 the response includes a percentage-complete indicator and a terminal status when the operation finishes. This pattern keeps HTTP connections short-lived while giving you full visibility into background processing."
|
|
5383
|
+
},
|
|
5384
|
+
{
|
|
5385
|
+
type: "paragraph",
|
|
5386
|
+
text: "Error responses follow a consistent structure across all namespaces. Every error includes a typed error code from the ErrorCode enum, a human-readable message, and the HTTP status code. Common error codes include ENTITY_NOT_FOUND (404), VALIDATION_ERROR (400), and RATE_LIMIT_EXCEEDED (429). Rate limiting is applied per API key with configurable thresholds \u2014 the response headers include X-RateLimit-Remaining and X-RateLimit-Reset so your integration can adapt proactively."
|
|
5387
|
+
},
|
|
5388
|
+
{
|
|
5389
|
+
type: "list",
|
|
5390
|
+
items: [
|
|
5391
|
+
"20+ namespaces covering extraction, documents, schemas, jobs, delivery, linking, cases, quality, and more",
|
|
5392
|
+
"JSON request and response bodies with cursor-based pagination for list operations",
|
|
5393
|
+
"Authenticate with a tlnc_ API key via the Authorization: Bearer header",
|
|
5394
|
+
"Asynchronous operations with polling endpoints for status and progress",
|
|
5395
|
+
"Typed error codes with consistent response structure across all endpoints",
|
|
5396
|
+
"Rate limiting with X-RateLimit-Remaining and X-RateLimit-Reset headers",
|
|
5397
|
+
"Batch processing mode at 50% cost with 48-hour delivery SLA"
|
|
5398
|
+
]
|
|
5399
|
+
},
|
|
4900
5400
|
{
|
|
4901
5401
|
type: "callout",
|
|
4902
5402
|
variant: "info",
|
|
@@ -5045,7 +5545,7 @@ var sections13 = [
|
|
|
5045
5545
|
content: [
|
|
5046
5546
|
{
|
|
5047
5547
|
type: "paragraph",
|
|
5048
|
-
text: "Webhooks push real-time notifications when events occur. All payloads are HMAC-SHA256 signed. Failed deliveries retry with exponential backoff."
|
|
5548
|
+
text: "Webhooks push real-time notifications when events occur in your Talonic workspace. All payloads are HMAC-SHA256 signed using the signing secret configured on your delivery destination, ensuring authenticity and tamper detection. Failed deliveries retry with exponential backoff \u2014 the platform makes progressively spaced attempts before routing terminal failures to the dead-letter queue (DLQ) for manual replay."
|
|
5049
5549
|
},
|
|
5050
5550
|
{
|
|
5051
5551
|
type: "paragraph",
|
|
@@ -5059,6 +5559,27 @@ var sections13 = [
|
|
|
5059
5559
|
type: "paragraph",
|
|
5060
5560
|
text: "Use webhooks when your downstream system needs to react immediately to platform events \u2014 for example, triggering an ERP import when a document is extracted, or notifying a Slack channel when a reviewer rejects a record. For bulk or periodic data transfers, consider using the SFTP, S3, or cloud storage delivery connectors instead."
|
|
5061
5561
|
},
|
|
5562
|
+
{
|
|
5563
|
+
type: "paragraph",
|
|
5564
|
+
text: "To set up a webhook, create a delivery destination of type **webhook** with your endpoint URL and a signing secret. Then create a binding that maps one or more signal types to the destination. When an event fires, the platform constructs the payload, signs it, and delivers it to your endpoint. You can bind multiple signal types to the same destination or spread them across different destinations for different downstream systems."
|
|
5565
|
+
},
|
|
5566
|
+
{
|
|
5567
|
+
type: "list",
|
|
5568
|
+
items: [
|
|
5569
|
+
"HMAC-SHA256 signed payloads for authenticity and tamper detection",
|
|
5570
|
+
"Idempotency key in headers for safe deduplication on retries",
|
|
5571
|
+
"Exponential backoff on delivery failure with configurable retry limits",
|
|
5572
|
+
"Dead-letter queue (DLQ) for terminal failures with manual replay",
|
|
5573
|
+
"Bind any signal type to a webhook destination via delivery bindings",
|
|
5574
|
+
"11 signal types covering document, run, result, and delivery lifecycle events",
|
|
5575
|
+
"Meta-signals (delivery.item.completed/failed) are not re-delivered to avoid loops"
|
|
5576
|
+
]
|
|
5577
|
+
},
|
|
5578
|
+
{
|
|
5579
|
+
type: "callout",
|
|
5580
|
+
variant: "warning",
|
|
5581
|
+
text: "Your webhook endpoint must respond with a 2xx status code within 30 seconds. Non-2xx responses or timeouts trigger the retry schedule. Permanent client errors (4xx except 429) are treated as terminal failures and routed directly to the DLQ without further retries."
|
|
5582
|
+
},
|
|
5062
5583
|
{
|
|
5063
5584
|
type: "param-table",
|
|
5064
5585
|
title: "Delivery signal types (webhook-compatible)",
|
|
@@ -5161,11 +5682,11 @@ var sections14 = [
|
|
|
5161
5682
|
content: [
|
|
5162
5683
|
{
|
|
5163
5684
|
type: "paragraph",
|
|
5164
|
-
text: "Organizations support role-based access control
|
|
5685
|
+
text: "Organizations support role-based access control that governs who can view, create, edit, and manage resources across your workspace. Access control is enforced at the API level, so permissions apply consistently whether users interact through the web interface, the AI agent, or the public API."
|
|
5165
5686
|
},
|
|
5166
5687
|
{
|
|
5167
5688
|
type: "paragraph",
|
|
5168
|
-
text: "Every user in your organization is assigned one of four roles that determine what they can see and do. Roles are hierarchical \u2014 each level includes all permissions of the levels below it. Choose the most restrictive role that still lets a team member do their job."
|
|
5689
|
+
text: "Every user in your organization is assigned one of four roles that determine what they can see and do. Roles are hierarchical \u2014 each level includes all permissions of the levels below it. Choose the most restrictive role that still lets a team member do their job. For most team members working on data review and extraction, the **Member** role provides everything they need."
|
|
5169
5690
|
},
|
|
5170
5691
|
{
|
|
5171
5692
|
type: "param-table",
|
|
@@ -5195,7 +5716,7 @@ var sections14 = [
|
|
|
5195
5716
|
},
|
|
5196
5717
|
{
|
|
5197
5718
|
type: "paragraph",
|
|
5198
|
-
text: "New members are added via domain matching: company email domains auto-match to your org with **pending** status requiring admin approval. Manage from the Team page."
|
|
5719
|
+
text: "New members are added via domain matching: company email domains auto-match to your org with **pending** status requiring admin approval. This means anyone who signs up with an email from your company domain is automatically associated with your organization, but cannot access any data until an Admin or Owner explicitly approves them. Manage all team members and pending requests from the Team page in the sidebar."
|
|
5199
5720
|
},
|
|
5200
5721
|
{
|
|
5201
5722
|
type: "paragraph",
|
|
@@ -5262,11 +5783,11 @@ var sections14 = [
|
|
|
5262
5783
|
content: [
|
|
5263
5784
|
{
|
|
5264
5785
|
type: "paragraph",
|
|
5265
|
-
text: "The Usage & Registry page replaces the legacy credits view with a comprehensive cost breakdown. It shows per-feature cost (extraction, OCR, batch, matching), a daily cost chart, and a full call log with model, tokens, and cost per request. The **Master view** (admin only) shows per-customer breakdowns and platform-wide statistics."
|
|
5786
|
+
text: "The Usage & Registry page replaces the legacy credits view with a comprehensive cost breakdown. It shows per-feature cost (extraction, OCR, batch, matching), a daily cost chart, and a full call log with model, tokens, and cost per request. The **Master view** (admin only) shows per-customer breakdowns and platform-wide statistics. Navigate to **Usage** from the sidebar to access all views."
|
|
5266
5787
|
},
|
|
5267
5788
|
{
|
|
5268
5789
|
type: "paragraph",
|
|
5269
|
-
text: "Understanding your usage patterns helps optimize costs. For example, if extraction dominates your spend, consider using **batch mode** for non-urgent documents to cut that cost in half. The daily cost chart makes it easy to spot usage spikes and correlate them with specific ingestion events."
|
|
5790
|
+
text: "Understanding your usage patterns helps optimize costs. For example, if extraction dominates your spend, consider using **batch mode** for non-urgent documents to cut that cost in half. If OCR is a significant portion, check whether you are processing image-heavy documents that could benefit from better source quality. The daily cost chart makes it easy to spot usage spikes and correlate them with specific ingestion events or batch completions."
|
|
5270
5791
|
},
|
|
5271
5792
|
{
|
|
5272
5793
|
type: "paragraph",
|
|
@@ -5302,10 +5823,26 @@ var sections14 = [
|
|
|
5302
5823
|
}
|
|
5303
5824
|
]
|
|
5304
5825
|
},
|
|
5826
|
+
{
|
|
5827
|
+
type: "heading",
|
|
5828
|
+
level: 3,
|
|
5829
|
+
id: "cost-optimization",
|
|
5830
|
+
text: "Cost Optimization Tips"
|
|
5831
|
+
},
|
|
5832
|
+
{
|
|
5833
|
+
type: "list",
|
|
5834
|
+
ordered: false,
|
|
5835
|
+
items: [
|
|
5836
|
+
"**Use batch mode** for non-urgent documents \u2014 extraction runs at 50% cost with a 48-hour delivery window.",
|
|
5837
|
+
"**Build your field registry** \u2014 as more fields reach Tier 1 and Tier 2, extraction costs drop because values are resolved via lookup instead of AI calls.",
|
|
5838
|
+
"**Review the per-feature breakdown** weekly to identify which operations dominate your spend.",
|
|
5839
|
+
"**Leverage routing rules** to automatically assign schemas, reducing the number of manual job runs and re-extractions."
|
|
5840
|
+
]
|
|
5841
|
+
},
|
|
5305
5842
|
{
|
|
5306
5843
|
type: "callout",
|
|
5307
5844
|
variant: "info",
|
|
5308
|
-
text: "The call log records every LLM and OCR call with full detail \u2014 model name, input/output token counts, latency, and cost. Use it to audit individual extractions or investigate unexpected cost increases."
|
|
5845
|
+
text: "The call log records every LLM and OCR call with full detail \u2014 model name, input/output token counts, latency, and cost. Use it to audit individual extractions or investigate unexpected cost increases. Each entry links back to the specific document and job that triggered the call."
|
|
5309
5846
|
}
|
|
5310
5847
|
],
|
|
5311
5848
|
related: [
|
|
@@ -5353,7 +5890,7 @@ var sections14 = [
|
|
|
5353
5890
|
},
|
|
5354
5891
|
{
|
|
5355
5892
|
type: "paragraph",
|
|
5356
|
-
text: "The Admin Panel operates across tenant boundaries, giving administrators visibility into all organizations on the platform. The **usage statistics** view aggregates cost and volume data across all customers, making it straightforward to identify high-usage tenants, track platform growth, and forecast infrastructure needs."
|
|
5893
|
+
text: "The Admin Panel operates across tenant boundaries, giving administrators visibility into all organizations on the platform. The **usage statistics** view aggregates cost and volume data across all customers, making it straightforward to identify high-usage tenants, track platform growth, and forecast infrastructure needs. Usage data includes per-feature breakdowns (extraction, OCR, batch, matching) and daily cost trends across all tenants."
|
|
5357
5894
|
},
|
|
5358
5895
|
{
|
|
5359
5896
|
type: "paragraph",
|
|
@@ -5383,19 +5920,19 @@ var sections14 = [
|
|
|
5383
5920
|
faq: [
|
|
5384
5921
|
{
|
|
5385
5922
|
question: "What features does the Admin Panel provide?",
|
|
5386
|
-
answer: "Customer management, user management, usage statistics, data clear
|
|
5923
|
+
answer: "Customer management (create, list, delete organizations), user management (cross-tenant user view with account removal), usage statistics (platform-wide cost and volume aggregates), data clear and rebuild (wipe and reprocess all data for a customer), and cross-tenant master registry view (audit field definitions and schemas across tenants)."
|
|
5387
5924
|
},
|
|
5388
5925
|
{
|
|
5389
5926
|
question: "Who can access the Admin Panel?",
|
|
5390
|
-
answer: "The Admin Panel is accessible only to users with admin or superadmin roles, via the user menu in the platform navigation."
|
|
5927
|
+
answer: "The Admin Panel is accessible only to users with admin or superadmin roles, via the user menu in the platform navigation. Regular Members and Viewers do not see the Admin Panel option."
|
|
5391
5928
|
},
|
|
5392
5929
|
{
|
|
5393
5930
|
question: "What does the data clear operation do?",
|
|
5394
|
-
answer: "Data clear wipes all documents, extractions, jobs, results, and related data for a specific customer. It is irreversible and intended for full reprocessing scenarios during onboarding or after major schema changes."
|
|
5931
|
+
answer: "Data clear wipes all documents, extractions, jobs, results, and related data for a specific customer. It is irreversible and intended for full reprocessing scenarios during onboarding or after major schema changes. Always confirm with the customer before executing this operation."
|
|
5395
5932
|
},
|
|
5396
5933
|
{
|
|
5397
5934
|
question: "Can I view usage across all customers?",
|
|
5398
|
-
answer: "Yes. The Admin Panel includes a master registry view that shows cross-tenant usage statistics, per-customer cost breakdowns, and platform-wide aggregates."
|
|
5935
|
+
answer: "Yes. The Admin Panel includes a master registry view that shows cross-tenant usage statistics, per-customer cost breakdowns, and platform-wide aggregates. This is useful for identifying high-usage tenants, tracking platform growth, and forecasting infrastructure needs."
|
|
5399
5936
|
}
|
|
5400
5937
|
],
|
|
5401
5938
|
mentions: [
|
|
@@ -5416,11 +5953,11 @@ var sections14 = [
|
|
|
5416
5953
|
content: [
|
|
5417
5954
|
{
|
|
5418
5955
|
type: "paragraph",
|
|
5419
|
-
text: "Talonic provides global keyboard shortcuts that work from any page in the platform. These shortcuts let you access common actions without leaving your current context, significantly speeding up daily workflows."
|
|
5956
|
+
text: "Talonic provides global keyboard shortcuts that work from any page in the platform. These shortcuts let you access common actions without leaving your current context, significantly speeding up daily workflows. Whether you are reviewing extraction results, configuring schemas, or browsing the field registry, the same shortcuts are always available."
|
|
5420
5957
|
},
|
|
5421
5958
|
{
|
|
5422
5959
|
type: "paragraph",
|
|
5423
|
-
text: "Shortcuts are registered at the application level, meaning they respond regardless of which page or panel is currently active. The platform intercepts the key combination before it reaches the browser, so these shortcuts take priority over default browser bindings when the Talonic window is focused."
|
|
5960
|
+
text: "Shortcuts are registered at the application level, meaning they respond regardless of which page or panel is currently active. The platform intercepts the key combination before it reaches the browser, so these shortcuts take priority over default browser bindings when the Talonic window is focused. On macOS, shortcuts use the Command key (`Cmd`); on Windows and Linux, they use `Ctrl`."
|
|
5424
5961
|
},
|
|
5425
5962
|
{
|
|
5426
5963
|
type: "paragraph",
|
|
@@ -5444,6 +5981,11 @@ var sections14 = [
|
|
|
5444
5981
|
type: "global",
|
|
5445
5982
|
description: "Quick extract \u2014 upload and process a document."
|
|
5446
5983
|
},
|
|
5984
|
+
{
|
|
5985
|
+
name: "\u2318I / Ctrl+I",
|
|
5986
|
+
type: "global",
|
|
5987
|
+
description: "Open the AI Agent from any page."
|
|
5988
|
+
},
|
|
5447
5989
|
{
|
|
5448
5990
|
name: "Escape",
|
|
5449
5991
|
type: "global",
|
|
@@ -5451,10 +5993,14 @@ var sections14 = [
|
|
|
5451
5993
|
}
|
|
5452
5994
|
]
|
|
5453
5995
|
},
|
|
5996
|
+
{
|
|
5997
|
+
type: "paragraph",
|
|
5998
|
+
text: "In addition to the global shortcuts, the **AI Agent** can be opened from any page using `Cmd+I` (`Ctrl+I` on Windows). The agent provides a conversational interface for inspecting data, building schemas, and analyzing extraction quality \u2014 all without navigating away from your current page. Combined with Omnisearch and quick extract, these three shortcuts cover the most common workflow interruptions."
|
|
5999
|
+
},
|
|
5454
6000
|
{
|
|
5455
6001
|
type: "callout",
|
|
5456
6002
|
variant: "info",
|
|
5457
|
-
text: "The **quick extract** shortcut (`Cmd+J` / `Ctrl+J`) is the fastest way to upload a single document. It opens a streamlined upload interface that lets you drag a file and start processing immediately."
|
|
6003
|
+
text: "The **quick extract** shortcut (`Cmd+J` / `Ctrl+J`) is the fastest way to upload a single document. It opens a streamlined upload interface that lets you drag a file and start processing immediately. Use it when you receive a document via email or chat and want instant extraction results."
|
|
5458
6004
|
}
|
|
5459
6005
|
],
|
|
5460
6006
|
related: [
|
|
@@ -5464,15 +6010,15 @@ var sections14 = [
|
|
|
5464
6010
|
faq: [
|
|
5465
6011
|
{
|
|
5466
6012
|
question: "What keyboard shortcuts are available?",
|
|
5467
|
-
answer: "Cmd+K / Ctrl+K for Omnisearch, Cmd+J / Ctrl+J for quick extract (upload and process), and Escape to close overlays, modals, and search."
|
|
6013
|
+
answer: "Four global shortcuts: Cmd+K / Ctrl+K for Omnisearch, Cmd+J / Ctrl+J for quick extract (upload and process), Cmd+I / Ctrl+I to open the AI Agent, and Escape to close overlays, modals, and search. These work from any page in the platform."
|
|
5468
6014
|
},
|
|
5469
6015
|
{
|
|
5470
6016
|
question: "What does the quick extract shortcut do?",
|
|
5471
|
-
answer: "Cmd+J / Ctrl+J opens the quick extract interface, allowing you to upload and process a document directly from any page."
|
|
6017
|
+
answer: "Cmd+J / Ctrl+J opens the quick extract interface, allowing you to upload and process a document directly from any page. It provides a streamlined drag-and-drop area that immediately processes the uploaded file and displays extraction results."
|
|
5472
6018
|
},
|
|
5473
6019
|
{
|
|
5474
6020
|
question: "Do shortcuts work inside modals or overlays?",
|
|
5475
|
-
answer: "The Escape shortcut works inside any modal or overlay to close it. Omnisearch (Cmd+K) works globally, even when other overlays are open. Quick extract (Cmd+J) is available from the main interface."
|
|
6021
|
+
answer: "The Escape shortcut works inside any modal or overlay to close it. Omnisearch (Cmd+K) works globally, even when other overlays are open. The AI Agent shortcut (Cmd+I) also works from any context. Quick extract (Cmd+J) is available from the main interface."
|
|
5476
6022
|
}
|
|
5477
6023
|
],
|
|
5478
6024
|
mentions: ["keyboard shortcuts", "Cmd+K", "Cmd+J", "Escape", "quick extract"]
|
|
@@ -5490,7 +6036,7 @@ var sections15 = [
|
|
|
5490
6036
|
content: [
|
|
5491
6037
|
{
|
|
5492
6038
|
type: "paragraph",
|
|
5493
|
-
text: "Documents can be processed in **batch mode** at 50% cost with a 48-hour delivery window. Toggle batch mode on the upload screen or set
|
|
6039
|
+
text: "Documents can be processed in **batch mode** at 50% cost with a 48-hour delivery window. Toggle batch mode on the upload screen or set `processing_mode=batch` via the API. Batch processing is ideal for large backlog ingestion where real-time results are not required. The cost reduction comes from the provider's native batch API, which schedules processing during off-peak capacity \u2014 there is no loss in extraction quality because the same Claude model and prompts are used as in real-time mode."
|
|
5494
6040
|
},
|
|
5495
6041
|
{
|
|
5496
6042
|
type: "callout",
|
|
@@ -5554,7 +6100,7 @@ var sections15 = [
|
|
|
5554
6100
|
content: [
|
|
5555
6101
|
{
|
|
5556
6102
|
type: "paragraph",
|
|
5557
|
-
text: 'Set `processing_mode=batch` on upload (API) or toggle the "Batch" switch in the upload UI. Stage 1 (OCR + classification) runs immediately so documents appear in your library right away. Stage 2 (Claude extraction) is deferred to the provider\'s batch API for asynchronous processing.'
|
|
6103
|
+
text: 'Set `processing_mode=batch` on upload (API) or toggle the "Batch" switch in the upload UI. Stage 1 (OCR + classification) runs immediately so documents appear in your library right away with their type classification and triage metadata. Stage 2 (Claude extraction) is deferred to the provider\'s batch API for asynchronous processing. While waiting for batch results, documents show a status of `batch_queued` in your library. The system requires a minimum of 100 items per batch \u2014 if fewer documents are uploaded in batch mode, the system falls back to real-time processing with a warning.'
|
|
5558
6104
|
},
|
|
5559
6105
|
{
|
|
5560
6106
|
type: "paragraph",
|
|
@@ -5583,7 +6129,7 @@ var sections15 = [
|
|
|
5583
6129
|
},
|
|
5584
6130
|
{
|
|
5585
6131
|
type: "paragraph",
|
|
5586
|
-
text: "While waiting for batch results, documents show a status of `batch_queued
|
|
6132
|
+
text: "While waiting for batch results, documents show a status of `batch_queued` in your library. Once the provider returns results, the platform applies them through the same post-processing pipeline as real-time extraction \u2014 including markdown pre-processing, field parsing, quality metrics, and extraction metadata computation. If a batch extraction fails to parse, the affected document is retried through the real-time extraction path rather than as a new batch, ensuring the original 48-hour SLA is maintained."
|
|
5587
6133
|
},
|
|
5588
6134
|
{
|
|
5589
6135
|
type: "paragraph",
|
|
@@ -5631,11 +6177,11 @@ var sections15 = [
|
|
|
5631
6177
|
content: [
|
|
5632
6178
|
{
|
|
5633
6179
|
type: "paragraph",
|
|
5634
|
-
text: "The Batches page at `/sources/batches` shows the status of all batch jobs. Each batch progresses through three states: **accumulating** (items collecting), **submitted** (sent to provider), and **completed** (results applied). The page live-syncs with the provider
|
|
6180
|
+
text: "The Batches page at `/sources/batches` shows the status of all batch jobs with real-time updates. Each batch progresses through three states: **accumulating** (items collecting in the queue), **submitted** (sent to the provider's batch API), and **completed** (results received and applied to the corresponding documents). The page live-syncs with the provider so you can monitor progress without manual refreshing. Click any batch to see the detail view with individual items, their processing state, and any errors."
|
|
5635
6181
|
},
|
|
5636
6182
|
{
|
|
5637
6183
|
type: "paragraph",
|
|
5638
|
-
text: "Batches are submitted automatically when the accumulation timer fires (every 15 minutes by default) or when the item count threshold is reached. Once submitted, the platform polls the provider hourly to check for completion. When results arrive, they are applied to the corresponding documents and the batch transitions to **completed** status."
|
|
6184
|
+
text: "Batches are submitted automatically when the accumulation timer fires (every 15 minutes by default) or when the item count threshold is reached, whichever comes first. These intervals are configurable in the pipeline settings. Once submitted, the platform polls the provider hourly to check for completion. When results arrive, they are applied to the corresponding documents \u2014 including field resolution, linking, triage, and delivery events \u2014 and the batch transitions to **completed** status."
|
|
5639
6185
|
},
|
|
5640
6186
|
{
|
|
5641
6187
|
type: "paragraph",
|
|
@@ -5718,11 +6264,11 @@ var sections16 = [
|
|
|
5718
6264
|
content: [
|
|
5719
6265
|
{
|
|
5720
6266
|
type: "paragraph",
|
|
5721
|
-
text:
|
|
6267
|
+
text: 'Upload CSV or Excel files as lookup tables for the matching engine and schema reference strategies. These reference datasets represent your "ground truth" \u2014 the known records you want to match extracted document data against. Each reference dataset is versioned independently and can be shared across multiple schemas and matching configurations without duplication.'
|
|
5722
6268
|
},
|
|
5723
6269
|
{
|
|
5724
6270
|
type: "paragraph",
|
|
5725
|
-
text:
|
|
6271
|
+
text: "Reference data is the foundation of the matching system. Common examples include customer lists, product catalogs, vendor registries, contract databases, and supplier directories. When you upload a reference dataset, the platform indexes all columns and rows for fast lookup during matching runs. You can also import reference data directly from a SQL database connection using `POST /matching/reference-data/from-sql`, which streams rows asynchronously in batches of 500 from your connected MSSQL or PostgreSQL database."
|
|
5726
6272
|
},
|
|
5727
6273
|
{
|
|
5728
6274
|
type: "paragraph",
|
|
@@ -5789,7 +6335,7 @@ var sections16 = [
|
|
|
5789
6335
|
content: [
|
|
5790
6336
|
{
|
|
5791
6337
|
type: "paragraph",
|
|
5792
|
-
text: "Define field-to-field comparisons between extracted data and reference datasets. Each comparison uses a weighted strategy to score matches:"
|
|
6338
|
+
text: "Define field-to-field comparisons between extracted data and reference datasets. Each comparison uses a weighted strategy to score matches. A matching configuration specifies which fields to compare, which strategy to use for each comparison, and the relative weight that determines how much each field contributes to the overall confidence score:"
|
|
5793
6339
|
},
|
|
5794
6340
|
{
|
|
5795
6341
|
type: "param-table",
|
|
@@ -5877,15 +6423,15 @@ var sections16 = [
|
|
|
5877
6423
|
content: [
|
|
5878
6424
|
{
|
|
5879
6425
|
type: "paragraph",
|
|
5880
|
-
text: "Execute a matching run against a reference dataset. Matching runs are processed asynchronously via
|
|
6426
|
+
text: "Execute a matching run against a reference dataset. Matching runs are processed asynchronously via a dedicated job queue, so they do not block your workflow. You can monitor progress from the matching page with real-time updates showing the number of documents processed and estimated time remaining, and cancel running jobs if needed \u2014 partial results from documents already processed are preserved."
|
|
5881
6427
|
},
|
|
5882
6428
|
{
|
|
5883
6429
|
type: "paragraph",
|
|
5884
|
-
text: "There are two types of runs: **manual runs** use only the deterministic matching strategies (exact, fuzzy, date_range, numeric_range) and complete quickly. **Smart runs** add an AI resolution pass \u2014 after the initial matching, an embedding-based search
|
|
6430
|
+
text: "There are two types of runs: **manual runs** use only the deterministic matching strategies (exact, fuzzy, date_range, numeric_range) and complete quickly. **Smart runs** add an AI resolution pass \u2014 after the initial matching, an embedding-based similarity search identifies promising candidates for each low-confidence document, and a Haiku LLM resolver evaluates each candidate in context to improve match quality."
|
|
5885
6431
|
},
|
|
5886
6432
|
{
|
|
5887
6433
|
type: "paragraph",
|
|
5888
|
-
text: "Matching runs are processed asynchronously via a dedicated job queue, so they do not block your workflow. You can continue working in the platform while a run executes in the background. The matching page shows real-time progress with the number of documents processed and estimated time remaining."
|
|
6434
|
+
text: "Matching runs are processed asynchronously via a dedicated BullMQ job queue, so they do not block your workflow. You can continue working in the platform while a run executes in the background. The matching page shows real-time progress with the number of documents processed and estimated time remaining. You can also trigger an AI resolution pass on a completed run using the `POST /matching/runs/:id/ai-resolve` endpoint to upgrade specific low-confidence results without re-running the entire job."
|
|
5889
6435
|
},
|
|
5890
6436
|
{
|
|
5891
6437
|
type: "paragraph",
|
|
@@ -5942,7 +6488,7 @@ var sections16 = [
|
|
|
5942
6488
|
content: [
|
|
5943
6489
|
{
|
|
5944
6490
|
type: "paragraph",
|
|
5945
|
-
text: "Results are presented per document with the top 5 match candidates. Each candidate includes a confidence score and field-level evidence showing which comparisons contributed to the match
|
|
6491
|
+
text: "Results are presented per document with the top 5 match candidates ranked by weighted confidence score. Each candidate includes a confidence score (0-100%) and field-level evidence showing which comparisons contributed to the match, the strategy used for each field (exact, fuzzy, date_range, or numeric_range), the individual field score, and the actual values that were compared on both sides. This transparency makes it straightforward to verify correct matches and investigate false positives."
|
|
5946
6492
|
},
|
|
5947
6493
|
{
|
|
5948
6494
|
type: "paragraph",
|
|
@@ -6028,7 +6574,11 @@ var sections17 = [
|
|
|
6028
6574
|
description: "Overview of the Talonic API for extracting structured, schema-validated data from any document with a single API call using HTTPS and JSON.",
|
|
6029
6575
|
content: [
|
|
6030
6576
|
{ type: "paragraph", text: "Extract any document into schema-validated data with a single API call." },
|
|
6031
|
-
{ type: "paragraph", text: "**Base URL:** `https://api.talonic.com` | **Protocol:** HTTPS + JSON | **Auth:** `Bearer tlnc_...`" }
|
|
6577
|
+
{ type: "paragraph", text: "**Base URL:** `https://api.talonic.com` | **Protocol:** HTTPS + JSON | **Auth:** `Bearer tlnc_...`" },
|
|
6578
|
+
{ type: "paragraph", text: "Most integrations start with `POST /v1/extract` to submit a document and receive structured fields back. A typical workflow is: create an API key, upload a file with an optional schema, and consume the JSON response with per-field confidence scores and cost headers." },
|
|
6579
|
+
{ type: "paragraph", text: "The API supports three extraction modes: **auto-detect** (no schema, discovers all fields), **schema-driven** (returns exactly the fields you define), and **query** (filter previously extracted data without re-processing). Every response includes a `request_id` for tracing and support." },
|
|
6580
|
+
{ type: "paragraph", text: "Pair the extract endpoint with `GET /v1/documents` and `GET /v1/extractions` to manage your document library and retrieve results later. Webhook callbacks via `extraction.complete` events eliminate the need for polling on async extractions." },
|
|
6581
|
+
{ type: "callout", text: "All API keys use the `tlnc_` prefix. Create and rotate keys from **Settings \u2192 API Keys** in the dashboard. Keys carry scopes (`extract`, `read`, `write`, `billing`) that control endpoint access." }
|
|
6032
6582
|
],
|
|
6033
6583
|
related: [
|
|
6034
6584
|
{ label: "Authentication", slug: "authentication" },
|
|
@@ -6084,7 +6634,11 @@ var sections17 = [
|
|
|
6084
6634
|
description: "The base URL for all Talonic API endpoints. All requests must use HTTPS and are relative to the v1 base path.",
|
|
6085
6635
|
content: [
|
|
6086
6636
|
{ type: "paragraph", text: "All endpoints are relative to the base URL below. All requests must use HTTPS." },
|
|
6087
|
-
{ type: "code", language: "bash", code: "https://api.talonic.com/v1" }
|
|
6637
|
+
{ type: "code", language: "bash", code: "https://api.talonic.com/v1" },
|
|
6638
|
+
{ type: "paragraph", text: "Most integrations set this as a constant in their HTTP client configuration. A typical request URL looks like `https://api.talonic.com/v1/extract` or `https://api.talonic.com/v1/documents`. All paths in this reference are relative to the `/v1` prefix." },
|
|
6639
|
+
{ type: "paragraph", text: "The API uses standard JSON request and response bodies with `Content-Type: application/json`, except for file uploads which use `multipart/form-data`. Responses include standard HTTP status codes and rate limit headers on every call." },
|
|
6640
|
+
{ type: "paragraph", text: "There is no versioning in the URL beyond `/v1`. Breaking changes will be communicated in advance and introduced under a new version prefix. Non-breaking additions (new fields, new endpoints) are shipped continuously." },
|
|
6641
|
+
{ type: "callout", text: "Plain HTTP requests are rejected. Always use `https://` in your base URL configuration to ensure encrypted transport." }
|
|
6088
6642
|
],
|
|
6089
6643
|
related: [
|
|
6090
6644
|
{ label: "Authentication", slug: "authentication" }
|
|
@@ -6204,6 +6758,9 @@ X-Talonic-Cells-Resolved-AI: 5` },
|
|
|
6204
6758
|
description: "All list endpoints use cursor-based pagination with cursor, limit, and order parameters. Responses include next_cursor and has_more for iteration.",
|
|
6205
6759
|
content: [
|
|
6206
6760
|
{ type: "paragraph", text: "All list endpoints use cursor-based pagination. Pass a `cursor` token from the previous response to fetch the next page." },
|
|
6761
|
+
{ type: "paragraph", text: "Most integrations call list endpoints after bulk ingestion to iterate through results. A typical workflow is to fetch the first page with a `limit`, then loop using `pagination.next_cursor` until `has_more` is `false`." },
|
|
6762
|
+
{ type: "paragraph", text: "The response always includes a `pagination` object with `total`, `limit`, `has_more`, and `next_cursor`. The `total` field reflects the full count of matching items, not just the current page. Use `order` to control sort direction by `created_at`." },
|
|
6763
|
+
{ type: "paragraph", text: "Pair pagination with query filters (e.g. `status`, `after`, `before`, `search`) on endpoints like `GET /v1/documents` and `GET /v1/extractions` to narrow results before paginating. Note that cursors are opaque and short-lived \u2014 do not persist or parse them." },
|
|
6207
6764
|
{
|
|
6208
6765
|
type: "param-table",
|
|
6209
6766
|
title: "Request parameters",
|
|
@@ -6291,6 +6848,9 @@ print(f"Fetched {len(all_documents)} documents")`
|
|
|
6291
6848
|
description: "Use the Idempotency-Key header to safely retry POST requests without creating duplicate extractions. Keys are valid for 24 hours.",
|
|
6292
6849
|
content: [
|
|
6293
6850
|
{ type: "paragraph", text: "Pass an `Idempotency-Key` header on POST requests to safely retry without creating duplicate work. If a request with the same key has already been processed, the API returns the cached response." },
|
|
6851
|
+
{ type: "paragraph", text: "Most integrations use idempotency keys when calling `POST /v1/extract` to guard against network timeouts or duplicate submissions. A typical workflow is to generate a UUID per logical operation, attach it as the `Idempotency-Key` header, and retry the same request on failure without risk of double-processing." },
|
|
6852
|
+
{ type: "paragraph", text: "The cached response is stored for **24 hours** and is scoped to your API key. A duplicate request within that window returns the original response body and HTTP status immediately, with no additional credit cost. After 24 hours the key expires and can be reused for a new request." },
|
|
6853
|
+
{ type: "paragraph", text: "Pair idempotency with webhook callbacks (`webhook_url` option) for robust async workflows. Note that reusing a key with different request parameters will still return the first request's cached result \u2014 always generate a fresh key for each distinct operation." },
|
|
6294
6854
|
{
|
|
6295
6855
|
type: "param-table",
|
|
6296
6856
|
title: "Idempotency details",
|
|
@@ -6774,6 +7334,7 @@ X-Talonic-Cells-Resolved-AI: 5`
|
|
|
6774
7334
|
seoTitle: "Extract Options \u2014 Talonic Docs",
|
|
6775
7335
|
description: "Configure extraction options including output format, strict mode, async processing, webhook callbacks, raw text inclusion, page ranges, and language hints.",
|
|
6776
7336
|
content: [
|
|
7337
|
+
{ type: "paragraph", text: "Pass these options as fields in the `options` JSON object on `POST /v1/extract` to control extraction behavior. Options let you switch between sync and async mode, include raw text, restrict page ranges, and configure webhook delivery." },
|
|
6777
7338
|
{
|
|
6778
7339
|
type: "param-table",
|
|
6779
7340
|
params: [
|
|
@@ -6785,7 +7346,11 @@ X-Talonic-Cells-Resolved-AI: 5`
|
|
|
6785
7346
|
{ name: "page_range", type: "string", description: 'Pages to extract from. E.g. "1-5", "1,3,7-10". PDF only.' },
|
|
6786
7347
|
{ name: "language_hint", type: "string", description: "ISO 639-1 language code hint. Improves extraction for non-English documents." }
|
|
6787
7348
|
]
|
|
6788
|
-
}
|
|
7349
|
+
},
|
|
7350
|
+
{ type: "paragraph", text: "Most integrations use `strict: true` (default) to receive only the schema-defined fields. Set `strict: false` when you want the AI to also return additional fields it discovers beyond your schema. The `async` and `webhook_url` options are mutually beneficial \u2014 set `webhook_url` to avoid polling entirely." },
|
|
7351
|
+
{ type: "paragraph", text: 'The `page_range` option accepts comma-separated page numbers and ranges (e.g. `"1-5"`, `"1,3,7-10"`) and applies only to PDF files. Use `language_hint` with an ISO 639-1 code (e.g. `"de"`, `"ja"`) to improve extraction accuracy for non-English documents, especially when the OCR needs guidance on character sets.' },
|
|
7352
|
+
{ type: "paragraph", text: "Pair `include_raw_text: true` with schema-driven extraction when your downstream system needs both structured data and the original text for audit or display purposes. Note that setting `webhook_url` implicitly enables async behavior \u2014 the response will be `202 Accepted` regardless of the `async` flag." },
|
|
7353
|
+
{ type: "callout", text: 'The `format` option controls the output shape of the `data` field. Use `"json"` (default) for programmatic consumption. CSV format is available on the `GET /v1/extractions/:id/data` endpoint instead.' }
|
|
6789
7354
|
],
|
|
6790
7355
|
related: [
|
|
6791
7356
|
{ label: "POST /v1/extract", slug: "post-extract" },
|
|
@@ -7079,6 +7644,10 @@ var sections19 = [
|
|
|
7079
7644
|
}
|
|
7080
7645
|
}`
|
|
7081
7646
|
},
|
|
7647
|
+
{ type: "paragraph", text: "Most integrations call this endpoint after receiving an `extraction.complete` webhook or after polling a document's status until it reaches `completed`. A typical workflow is to extract a document via `POST /v1/extract`, store the returned `document.id`, then fetch full metadata here when needed." },
|
|
7648
|
+
{ type: "paragraph", text: "The response includes the current `status` field which will be `completed` when extraction has finished, `processing` while in progress, or `error` if something went wrong. Use the `latest_extraction_id` to navigate directly to the extraction result via `GET /v1/extractions/:id`." },
|
|
7649
|
+
{ type: "paragraph", text: "Pair this with `GET /v1/documents/:id/markdown` to retrieve the raw OCR text, or with `GET /v1/extractions/:id/data` for just the structured field values. Note that the `triage` object is only populated after ingestion completes and may be `null` for documents still in processing." },
|
|
7650
|
+
{ type: "callout", variant: "info", text: "The `links.dashboard` URL opens the document directly in the Talonic platform UI, which is useful for sharing with team members who need to review or correct extractions." },
|
|
7082
7651
|
{ type: "heading", level: 2, id: "get-document-errors", text: "Errors" },
|
|
7083
7652
|
{
|
|
7084
7653
|
type: "param-table",
|
|
@@ -7134,6 +7703,9 @@ var sections19 = [
|
|
|
7134
7703
|
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
|
|
7135
7704
|
}`
|
|
7136
7705
|
},
|
|
7706
|
+
{ type: "paragraph", text: "Most integrations call this endpoint as part of a cleanup workflow after data has been exported or when a document was uploaded in error. A typical pattern is to list documents with `GET /v1/documents`, identify candidates for deletion, then call this endpoint for each one." },
|
|
7707
|
+
{ type: "paragraph", text: "The response includes a `deleted` field set to `true` and the `id` of the removed document. There is no soft-delete mechanism \u2014 the original file, OCR markdown, and all extraction results are permanently purged from storage." },
|
|
7708
|
+
{ type: "paragraph", text: "Pair this with `GET /v1/documents/:id` beforehand to verify you are deleting the correct resource. Note that if the document participated in entity linking or cases, those links are removed and affected cases may be recomputed during the next backfill cycle." },
|
|
7137
7709
|
{ type: "heading", level: 2, id: "delete-document-errors", text: "Errors" },
|
|
7138
7710
|
{
|
|
7139
7711
|
type: "param-table",
|
|
@@ -7393,6 +7965,9 @@ var sections20 = [
|
|
|
7393
7965
|
"due_date": "2024-03-15"
|
|
7394
7966
|
}`
|
|
7395
7967
|
},
|
|
7968
|
+
{ type: "paragraph", text: "Most integrations call this endpoint to feed extraction output into downstream systems (CRMs, ERPs, data warehouses) that only need the raw key-value data. A typical workflow is to extract a document, then call this endpoint with the `extraction_id` from the response to get a clean data payload without metadata overhead." },
|
|
7969
|
+
{ type: "paragraph", text: "The response is a flat JSON object where each key is a field name and each value is the extracted value, typed according to the schema (strings, numbers, dates, arrays). Use `?format=csv` to download the same data as a CSV file with field names as headers \u2014 the `Content-Disposition` header provides a suggested filename." },
|
|
7970
|
+
{ type: "paragraph", text: "Pair this with `GET /v1/extractions/:id` when you also need confidence scores, locked field status, or processing metadata. Note that the response shape matches the schema used during extraction \u2014 if no schema was provided, auto-discovered field names are used as keys." },
|
|
7396
7971
|
{ type: "heading", level: 2, id: "get-extraction-fields-errors", text: "Errors" },
|
|
7397
7972
|
{
|
|
7398
7973
|
type: "param-table",
|
|
@@ -7716,6 +8291,9 @@ var sections21 = [
|
|
|
7716
8291
|
}
|
|
7717
8292
|
}`
|
|
7718
8293
|
},
|
|
8294
|
+
{ type: "paragraph", text: "Most integrations call this endpoint before running an extraction to verify the schema definition is correct, or after an update to confirm the new version was applied. A typical workflow is to create a schema with `POST /v1/schemas`, store the returned `id`, then fetch it here whenever you need the current definition." },
|
|
8295
|
+
{ type: "paragraph", text: "The response includes the full `definition` object in normalized JSON Schema format, along with the `version` number and `field_count`. Use the `links.extractions` URL to list all extractions that used this schema, and `links.dashboard` to open it in the platform UI." },
|
|
8296
|
+
{ type: "paragraph", text: "Pair this with `PUT /v1/schemas/:id` to update the definition, or pass the `id` as `schema_id` on `POST /v1/extract` to run schema-driven extraction. Note that both UUID and `SCH-` prefixed short IDs are accepted as the `:id` parameter." },
|
|
7719
8297
|
{ type: "heading", level: 2, id: "get-schema-errors", text: "Errors" },
|
|
7720
8298
|
{
|
|
7721
8299
|
type: "param-table",
|
|
@@ -7913,6 +8491,10 @@ var sections21 = [
|
|
|
7913
8491
|
}
|
|
7914
8492
|
}`
|
|
7915
8493
|
},
|
|
8494
|
+
{ type: "paragraph", text: "Most integrations call this endpoint when extraction requirements evolve \u2014 for example, adding a new field to an invoice schema or renaming an existing one. A typical workflow is to fetch the current schema with `GET /v1/schemas/:id`, modify the `definition`, then send the updated payload here." },
|
|
8495
|
+
{ type: "paragraph", text: "The response includes the updated `definition`, `field_count`, and `version` number. The `updated_at` timestamp reflects when the change was applied. All body parameters are optional \u2014 send only `name`, `definition`, or `description` to update that field without touching the others." },
|
|
8496
|
+
{ type: "paragraph", text: "Pair this with `GET /v1/extractions?schema_id=:id` to review historical extractions that used previous versions. Note that schema versioning is append-only internally, so you can always compare before-and-after definitions through the dashboard." },
|
|
8497
|
+
{ type: "callout", variant: "info", text: "Schema updates do not retroactively change existing extractions. If you need to re-extract documents with the new schema, call `POST /v1/extract` with `document_id` and the updated `schema_id`." },
|
|
7916
8498
|
{ type: "heading", level: 2, id: "update-schema-errors", text: "Errors" },
|
|
7917
8499
|
{
|
|
7918
8500
|
type: "param-table",
|
|
@@ -7968,6 +8550,9 @@ var sections21 = [
|
|
|
7968
8550
|
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
|
|
7969
8551
|
}`
|
|
7970
8552
|
},
|
|
8553
|
+
{ type: "paragraph", text: "Most integrations call this endpoint during cleanup when a schema is no longer needed, or when consolidating duplicate schemas. A typical workflow is to list schemas with `GET /v1/schemas`, identify obsolete ones, then delete them individually by `id`." },
|
|
8554
|
+
{ type: "paragraph", text: "The response confirms deletion with `deleted: true` and the `id` of the removed schema. All extraction results that used this schema remain intact and queryable via `GET /v1/extractions` \u2014 only the schema definition itself is removed from the system." },
|
|
8555
|
+
{ type: "paragraph", text: "Pair this with `GET /v1/schemas/:id` beforehand to review the schema before removing it. Note that deletion is permanent with no undo \u2014 if you need the same structure later, you must recreate it with `POST /v1/schemas`." },
|
|
7971
8556
|
{ type: "heading", level: 2, id: "delete-schema-errors", text: "Errors" },
|
|
7972
8557
|
{
|
|
7973
8558
|
type: "param-table",
|
|
@@ -8545,6 +9130,9 @@ var sections23 = [
|
|
|
8545
9130
|
description: "Create a new input source and receive a source-scoped API key. The key is only shown once in the creation response \u2014 store it securely.",
|
|
8546
9131
|
content: [
|
|
8547
9132
|
{ type: "paragraph", text: "Create a new source to start ingesting documents. The response includes a **source-scoped API key** (`tlnc_sk_*`) that authenticates uploads to this source's endpoint. This key is shown only once \u2014 store it securely immediately after creation." },
|
|
9133
|
+
{ type: "paragraph", text: "The typical workflow is: create a source, store the returned `api_key` securely, then use it to authenticate document uploads to the source's `endpoint` URL. Optionally pass a `default_schema_id` to automatically apply an extraction schema to all documents ingested through this source." },
|
|
9134
|
+
{ type: "paragraph", text: "The response returns the source with `status: active`, `document_count: 0`, and the one-time `api_key` field. The `endpoint` URL is the path for `POST` document uploads. The `links` object includes URLs for the source detail, document list, and dashboard view." },
|
|
9135
|
+
{ type: "paragraph", text: "Store the `api_key` immediately \u2014 it cannot be retrieved again. If lost, delete the source and create a new one. The source type defaults to `api` (programmatic ingestion); use `upload` for manual file uploads or `connector` for third-party integrations like Google Drive or SharePoint." },
|
|
8548
9136
|
{ type: "callout", variant: "warning", text: "The `api_key` is only returned in the creation response. It cannot be retrieved later. If you lose it, delete the source and create a new one." },
|
|
8549
9137
|
{
|
|
8550
9138
|
type: "endpoint",
|
|
@@ -8630,6 +9218,10 @@ var sections23 = [
|
|
|
8630
9218
|
description: "Get source details, update a source name, or delete a source. Documents are retained but unlinked when a source is deleted.",
|
|
8631
9219
|
content: [
|
|
8632
9220
|
{ type: "paragraph", text: "Manage an individual source with GET, PATCH, and DELETE operations on the same path. Retrieve source details, update its name, or permanently delete it. When a source is deleted, its documents are **retained** but unlinked from the source." },
|
|
9221
|
+
{ type: "paragraph", text: "Use `GET` to inspect a source's current status, document count, and default schema assignment. Use `PATCH` to rename a source. Use `DELETE` when a source is no longer needed \u2014 this immediately invalidates the source-scoped API key, so any integration using it will start receiving `401` errors." },
|
|
9222
|
+
{ type: "paragraph", text: "The `GET` response includes `document_count`, `default_schema` (with its `id` if set), and the `endpoint` URL for document ingestion. The `status` field shows the current state \u2014 `active` for API sources, or sync status values for connector-based sources (Google Drive, SharePoint, etc.)." },
|
|
9223
|
+
{ type: "paragraph", text: "Deleting a source retains all its documents in your workspace \u2014 they remain accessible via the documents API and any existing extractions are preserved. Only the source-to-document link is removed. Pair `GET /v1/sources/:id` with `GET /v1/sources/:id/documents` to see documents belonging to a specific source." },
|
|
9224
|
+
{ type: "callout", variant: "info", text: "Deleting a source immediately invalidates its API key. Any integration using that key will receive `401` errors. Documents are retained but unlinked from the source." },
|
|
8633
9225
|
{
|
|
8634
9226
|
type: "endpoint",
|
|
8635
9227
|
method: "GET",
|
|
@@ -9537,7 +10129,11 @@ var sections25 = [
|
|
|
9537
10129
|
"`extraction.complete` \u2014 Extraction finished successfully. Payload includes the full extraction result.",
|
|
9538
10130
|
"`extraction.failed` \u2014 Extraction failed. Payload includes the error details.",
|
|
9539
10131
|
"`document.ingested` \u2014 A new document has been processed and is ready for extraction."
|
|
9540
|
-
] }
|
|
10132
|
+
] },
|
|
10133
|
+
{ type: "paragraph", text: "Most integrations subscribe to `extraction.complete` to trigger downstream processing (e.g. writing structured data to a database or notifying a user). A typical workflow is to pass `webhook_url` on `POST /v1/extract`, then handle the callback payload in your server without polling." },
|
|
10134
|
+
{ type: "paragraph", text: "The `extraction.complete` payload includes the `extraction_id`, `document_id`, `schema_id`, `status`, and `confidence` score. Use the `extraction_id` to fetch the full result via `GET /v1/extractions/:id` if the payload does not contain all the fields you need." },
|
|
10135
|
+
{ type: "paragraph", text: "Pair event handling with [Signature Verification](webhook-security) to ensure payloads are authentic. Note that `extraction.failed` events include an `error` field with a machine-readable code and human-readable message \u2014 use this to decide whether to retry via `POST /v1/extract` with `document_id`." },
|
|
10136
|
+
{ type: "callout", text: "Webhook URLs must be HTTPS endpoints. HTTP URLs are rejected at configuration time to ensure payload confidentiality in transit." }
|
|
9541
10137
|
],
|
|
9542
10138
|
related: [
|
|
9543
10139
|
{ label: "Signature Verification", slug: "webhook-security" },
|
|
@@ -9655,7 +10251,11 @@ echo -n '{"event":"extraction.complete","delivery_id":"dlv_test123","timestamp":
|
|
|
9655
10251
|
"3rd retry \u2014 30 minutes",
|
|
9656
10252
|
"4th retry (final) \u2014 4 hours"
|
|
9657
10253
|
] },
|
|
9658
|
-
{ type: "paragraph", text: "After 4 failed attempts, the delivery is marked as failed. You can check delivery status and replay events from the dashboard." }
|
|
10254
|
+
{ type: "paragraph", text: "After 4 failed attempts, the delivery is marked as failed. You can check delivery status and replay events from the dashboard." },
|
|
10255
|
+
{ type: "paragraph", text: "Most integrations rely on the default retry schedule and only intervene when a delivery reaches the failed state. A typical debugging workflow is to check the delivery history in the dashboard, identify the HTTP status or timeout that caused the failure, then fix the endpoint and replay the event." },
|
|
10256
|
+
{ type: "paragraph", text: "Your endpoint must return a `2xx` status code within **30 seconds** to be considered successful. Non-`2xx` responses (including `3xx` redirects) and timeouts trigger retries. The `X-Talonic-Delivery-Id` header remains the same across retries, so use it for idempotent processing on your end." },
|
|
10257
|
+
{ type: "paragraph", text: "Pair retry awareness with [Signature Verification](webhook-security) to reject spoofed payloads early. Note that the total retry window spans approximately **4.5 hours** from the initial attempt \u2014 if your endpoint is down longer than that, use the dashboard replay feature to re-send missed events." },
|
|
10258
|
+
{ type: "callout", text: "If your endpoint consistently fails, check for firewall rules blocking Talonic IPs, TLS certificate issues, or response timeouts exceeding 30 seconds. The dashboard delivery log shows the HTTP status and error for each attempt." }
|
|
9659
10259
|
],
|
|
9660
10260
|
related: [
|
|
9661
10261
|
{ label: "Webhook Events", slug: "webhook-events" },
|
|
@@ -9673,7 +10273,11 @@ echo -n '{"event":"extraction.complete","delivery_id":"dlv_test123","timestamp":
|
|
|
9673
10273
|
seoTitle: "Webhook Delivery Format \u2014 Talonic Docs",
|
|
9674
10274
|
description: "Webhook delivery format details including POST request structure, JSON body format, and standard headers for event type, signature, delivery ID, and timestamp.",
|
|
9675
10275
|
content: [
|
|
9676
|
-
{ type: "paragraph", text: "Webhooks are delivered as `POST` requests with a JSON body. Configure webhook URLs per-source or per-extraction via the `webhook_url` option on the extract endpoint." }
|
|
10276
|
+
{ type: "paragraph", text: "Webhooks are delivered as `POST` requests with a JSON body. Configure webhook URLs per-source or per-extraction via the `webhook_url` option on the extract endpoint." },
|
|
10277
|
+
{ type: "paragraph", text: "Most integrations configure a single webhook endpoint that handles all event types, using the `X-Talonic-Event` header to route internally. A typical setup is to pass `webhook_url` on `POST /v1/extract` calls, or configure a default URL in the dashboard for all extractions from a specific source." },
|
|
10278
|
+
{ type: "paragraph", text: "Each delivery includes four standard headers: `X-Talonic-Event` (event type), `X-Talonic-Signature` (HMAC-SHA256 for verification), `X-Talonic-Delivery-Id` (unique ID for idempotency), and `X-Talonic-Timestamp` (Unix timestamp). Your endpoint must return a `2xx` status within **30 seconds** or the delivery is considered failed." },
|
|
10279
|
+
{ type: "paragraph", text: "Pair webhook delivery with the [Signature Verification](webhook-security) guide to authenticate incoming payloads. Note that failed deliveries are retried with exponential backoff up to 4 times \u2014 see [Retry Policy](webhook-retry) for the schedule." },
|
|
10280
|
+
{ type: "callout", text: "Use the `X-Talonic-Delivery-Id` header to deduplicate webhook deliveries on your end. Retries reuse the same delivery ID, so you can safely discard duplicates." }
|
|
9677
10281
|
],
|
|
9678
10282
|
related: [
|
|
9679
10283
|
{ label: "Webhook Events", slug: "webhook-events" },
|
|
@@ -10229,6 +10833,9 @@ var sections27 = [
|
|
|
10229
10833
|
description: "Classify link keys into categories (identity, transaction, reference) using AI. Runs asynchronously on ambiguous fields.",
|
|
10230
10834
|
content: [
|
|
10231
10835
|
{ type: "paragraph", text: "When new fields are extracted, some may not be automatically classified as link keys. The classify endpoint runs AI-powered classification on ambiguous fields to determine whether they are **identity**, **transaction**, or **reference** link keys. This is useful after onboarding new document types or when the field registry grows." },
|
|
10836
|
+
{ type: "paragraph", text: "Call this endpoint after uploading a new batch of documents or after adding a new document type to your workspace. The endpoint returns immediately with the count of fields that were classified \u2014 any graph rebuilding happens asynchronously via a triggered backfill." },
|
|
10837
|
+
{ type: "paragraph", text: "The response includes a `classified` count (number of fields newly assigned a category) and a `backfillTriggered` boolean. When `backfillTriggered` is `true`, entity links across all documents are being rebuilt in the background. Poll the **Backfill** progress endpoint to monitor completion." },
|
|
10838
|
+
{ type: "paragraph", text: "Only fields with a `null` category are evaluated \u2014 already-classified link keys are not re-assessed. To verify which fields were classified, call the **Link Keys** endpoint before and after. If no ambiguous fields remain, `classified` returns `0` and no backfill is triggered." },
|
|
10232
10839
|
{ type: "callout", variant: "info", text: "Classification uses a two-pass approach: rule-based heuristics handle obvious cases (e.g. fields named `invoice_number`), then an LLM call classifies the remaining ambiguous fields. A backfill is automatically triggered when new link keys are identified." },
|
|
10233
10840
|
{
|
|
10234
10841
|
type: "endpoint",
|
|
@@ -10283,6 +10890,9 @@ var sections27 = [
|
|
|
10283
10890
|
description: "Get all entity links for a specific document showing entity values, types, link keys, and linked document IDs.",
|
|
10284
10891
|
content: [
|
|
10285
10892
|
{ type: "paragraph", text: "Retrieve all entity links discovered for a specific document. Each link represents a shared field value \u2014 such as a customer ID or PO number \u2014 that connects this document to others in the workspace. Use this endpoint to understand how a document relates to the rest of your corpus." },
|
|
10893
|
+
{ type: "paragraph", text: "Call this endpoint when building a document detail view or when you need to trace the relationships of a single document before exploring the broader graph. Pass the document UUID as a path parameter \u2014 the endpoint returns all entity links regardless of link key category." },
|
|
10894
|
+
{ type: "paragraph", text: "Each entry in the response includes the **entity_value** (the raw shared value), the **field_key** (which field it was extracted from), and the **link_key_category** (`identity`, `transaction`, or `reference`). Documents with no extracted field values matching other documents return an empty `data` array." },
|
|
10895
|
+
{ type: "paragraph", text: "Use this alongside the **Full Graph** subgraph endpoint to progressively explore the linking graph. Start here for a flat list of connections, then call the subgraph endpoint with `depth=2` to expand outward from the document and discover second-degree relationships." },
|
|
10286
10896
|
{ type: "callout", variant: "info", text: "The `document_count` field on each entity indicates how many documents share that value. A high count on an identity entity (e.g. a vendor ID appearing in 50+ documents) is expected, while a high count on a transaction entity may indicate a data quality issue." },
|
|
10287
10897
|
{
|
|
10288
10898
|
type: "endpoint",
|
|
@@ -10597,6 +11207,10 @@ var sections27 = [
|
|
|
10597
11207
|
description: "List and retrieve cases \u2014 automatically created groups of 2+ related documents linked through shared field values with narrative summaries.",
|
|
10598
11208
|
content: [
|
|
10599
11209
|
{ type: "paragraph", text: "Cases are automatically created groups of two or more documents that are connected through shared **transaction** or **reference** entity values. For example, an invoice, a purchase order, and a delivery note sharing the same PO number form a case. Cases provide a high-level view of document relationships without needing to navigate the full graph." },
|
|
11210
|
+
{ type: "paragraph", text: "Use this endpoint to retrieve all cases in your workspace for building case lists, dashboards, or approval queues. The response is ordered by most recent first based on the earliest document timestamp in each case. Each case includes a `document_count` and a stable `case_key` that you can use for subsequent detail lookups." },
|
|
11211
|
+
{ type: "paragraph", text: "The response includes a `links.self` URL for each case that points to the case detail endpoint. The `label` field contains an auto-generated human-readable name when available, or `null` for cases that have not yet been labelled. The `created_at` field reflects the timestamp of the earliest document in the group." },
|
|
11212
|
+
{ type: "callout", variant: "info", text: "Each document belongs to at most one case. Documents linked only through identity entities (e.g. shared vendor ID) appear as entity groups in the full graph but are not returned by this endpoint." },
|
|
11213
|
+
{ type: "paragraph", text: "Pair this endpoint with **Case Graph** to visualize individual cases, or with **Document-Case Map** for a flat document-to-case lookup. Cases are rebuilt automatically during backfill \u2014 if you have recently reclassified link keys, trigger a backfill first to ensure case assignments are up to date." },
|
|
10600
11214
|
{ type: "list", ordered: false, items: [
|
|
10601
11215
|
"Each case has a deterministic **case key** (hex hash of its document IDs)",
|
|
10602
11216
|
"Cases are created by the linking pipeline during backfill or real-time processing",
|
|
@@ -10671,6 +11285,10 @@ var sections27 = [
|
|
|
10671
11285
|
description: "Retrieve the D3-compatible graph visualization for a single case, showing document nodes and entity edges within the case boundary.",
|
|
10672
11286
|
content: [
|
|
10673
11287
|
{ type: "paragraph", text: "Retrieve the graph structure for a single case, formatted for **D3.js** or similar graph visualization libraries. The response contains only the nodes and edges within the case boundary, making it suitable for rendering focused relationship diagrams." },
|
|
11288
|
+
{ type: "paragraph", text: "The typical workflow is to first list cases via the **Cases** endpoint, then call this endpoint with a specific `case_key` to fetch the renderable graph. This is the primary endpoint for building case-level visualizations in custom UIs or embedded dashboards." },
|
|
11289
|
+
{ type: "paragraph", text: "The response includes both **document nodes** (with filename and inferred document type) and **entity nodes** (with the shared value and link key category). Edges always connect a document to an entity \u2014 never document-to-document directly. Node IDs are stable across requests, so you can preserve force-layout positions between refreshes." },
|
|
11290
|
+
{ type: "callout", variant: "info", text: "The case graph is a strict subset of the full workspace graph. Only entities that contributed to forming the case are included \u2014 high-frequency entities excluded from BFS do not appear." },
|
|
11291
|
+
{ type: "paragraph", text: "Pair this endpoint with **Document Links** to enrich each node with additional entity metadata, or with **Full Graph** when you need cross-case visibility. The graph structure mirrors the full graph format, so the same rendering code works for both." },
|
|
10674
11292
|
{
|
|
10675
11293
|
type: "endpoint",
|
|
10676
11294
|
method: "GET",
|
|
@@ -10738,6 +11356,9 @@ var sections27 = [
|
|
|
10738
11356
|
description: "Get the mapping of documents to their resolved cases. Returns a mapping of document IDs to assigned case keys.",
|
|
10739
11357
|
content: [
|
|
10740
11358
|
{ type: "paragraph", text: "The document-case map provides a flat lookup from document ID to case assignment. Use it to quickly determine which case a document belongs to, or to identify documents that are not part of any case. Documents in **entity groups** (linked only through identity entities) are included with `is_case: false`." },
|
|
11359
|
+
{ type: "paragraph", text: "Call this endpoint when you need to enrich a document list with case membership \u2014 for example, to display a case badge next to each document in a table view. The response is a flat object keyed by document UUID, so lookups are O(1) without client-side joins." },
|
|
11360
|
+
{ type: "paragraph", text: "Each entry includes a `case_key` (the deterministic hex hash identifying the case), a `document_count` (total documents in that case or entity group), and an `is_case` boolean. When `is_case` is `false`, the `case_key` is an empty string \u2014 the document is linked via identity entities only." },
|
|
11361
|
+
{ type: "paragraph", text: "This endpoint pairs well with the **Cases** list endpoint. Use the map for bulk lookups across your document set, and the Cases endpoint when you need case-level metadata like labels or timestamps. Documents with no entity links at all are omitted from the map entirely." },
|
|
10741
11362
|
{ type: "callout", variant: "info", text: "Documents with `is_case: false` are linked to other documents only through identity entities (e.g. same vendor). They appear in the map but do not form a case. Documents with no links at all are not included in the map." },
|
|
10742
11363
|
{
|
|
10743
11364
|
type: "endpoint",
|
|
@@ -13073,6 +13694,9 @@ var sections31 = [
|
|
|
13073
13694
|
description: "Get metric trends over time for a schema. Returns time-series telemetry data across recent runs for tracking quality changes.",
|
|
13074
13695
|
content: [
|
|
13075
13696
|
{ type: "paragraph", text: "Track how structuring metrics evolve over successive runs for a schema. This endpoint returns a **time-series** of telemetry snapshots, allowing you to detect quality improvements, regressions, or shifts in strategy distribution as your field registry matures." },
|
|
13697
|
+
{ type: "paragraph", text: "Call this endpoint after several extraction runs to build trend charts or to detect regressions. The default window returns the 10 most recent runs \u2014 use the `window` query parameter to expand up to 50 runs for longer-term analysis." },
|
|
13698
|
+
{ type: "paragraph", text: "Each snapshot in the `data` array contains the same metrics as the **Schema Summary** \u2014 `capture_hit_rate`, `synthesize_rate`, `strategy_distribution`, and `tier_funnel` \u2014 plus a `created_at` timestamp and `run_id`. The array is ordered by most recent run first." },
|
|
13699
|
+
{ type: "paragraph", text: "Compare the trend data with the **Schema Fields** endpoint to pinpoint which specific fields are driving changes. A sudden spike in `synthesize_rate` across runs may indicate a new document type that the field registry has not yet learned, while a steady decrease signals healthy registry maturation." },
|
|
13076
13700
|
{ type: "callout", variant: "info", text: "A rising `capture_hit_rate` over time indicates the field registry is learning from extractions and resolving more fields deterministically, reducing LLM costs." },
|
|
13077
13701
|
{
|
|
13078
13702
|
type: "endpoint",
|
|
@@ -13177,6 +13801,9 @@ var sections31 = [
|
|
|
13177
13801
|
description: "Get per-field structuring metrics for a schema including field-level state distribution, capture rates, and strategy breakdown.",
|
|
13178
13802
|
content: [
|
|
13179
13803
|
{ type: "paragraph", text: "Drill down to **individual field performance** within a schema. This endpoint returns per-field capture rates, synthesis rates, the most common strategy used, and the distribution of cell states (filled, empty, skipped). Use it to identify underperforming fields that may need instruction tuning or manual review." },
|
|
13804
|
+
{ type: "paragraph", text: "Call this endpoint after reviewing the **Schema Summary** to investigate which fields are driving low capture rates or high synthesis costs. The field-level breakdown reveals whether issues are concentrated in a few problematic fields or spread evenly across the schema." },
|
|
13805
|
+
{ type: "paragraph", text: "Each entry in the `data` array includes the `field_name`, `capture_rate` and `synthesize_rate` (both 0-1 fractions), the dominant `strategy` (one of `transfer`, `extract`, `compute`, `skip`), and a `state_distribution` object with `filled`, `empty`, and `skipped` counts. Fields with a `strategy` of `extract` are LLM-dependent and contribute most to cost." },
|
|
13806
|
+
{ type: "paragraph", text: "Pair this with the **Schema Trend** endpoint to track how individual field performance changes across runs. Fields that remain stuck on `extract` strategy after multiple runs are strong candidates for adding explicit instructions or seeding the field registry with example values." },
|
|
13180
13807
|
{ type: "callout", variant: "info", text: "Fields with a high `synthesize_rate` and low `capture_rate` are candidates for field registry enrichment or instruction refinement to reduce LLM dependency." },
|
|
13181
13808
|
{
|
|
13182
13809
|
type: "endpoint",
|
|
@@ -13258,6 +13885,9 @@ var sections31 = [
|
|
|
13258
13885
|
description: "Get aggregate structuring metrics for a single job run including strategy distribution, tier funnel, and capture hit rate.",
|
|
13259
13886
|
content: [
|
|
13260
13887
|
{ type: "paragraph", text: "Retrieve structuring telemetry for a **specific job run** rather than the latest run for a schema. Use this when you need to inspect the performance of a particular execution, compare two runs side by side, or debug a run that produced unexpected results." },
|
|
13888
|
+
{ type: "paragraph", text: "The typical workflow is to list runs from your jobs pipeline, then call this endpoint with the run UUID to inspect its metrics. This is especially useful when a run produces unexpected accuracy \u2014 the telemetry reveals whether the issue is in capture (registry gaps), synthesis (LLM errors), or strategy selection." },
|
|
13889
|
+
{ type: "paragraph", text: "The response includes `capture_hit_rate`, `synthesize_rate`, `strategy_distribution`, and `tier_funnel` \u2014 identical in shape to the **Schema Summary**. The `schema_id` field identifies which schema was used, allowing you to cross-reference with field-level telemetry. Runs that are still `pending` or `running` return a `404` until they complete." },
|
|
13890
|
+
{ type: "paragraph", text: "To compare two runs, call this endpoint twice with different run IDs and diff the `strategy_distribution` and `tier_funnel` values. Pair with the **Schema Trend** endpoint when you need the full historical view rather than a point-in-time comparison." },
|
|
13261
13891
|
{ type: "callout", variant: "info", text: "The response shape is identical to the Schema Summary endpoint. The only difference is that this endpoint targets a specific run by ID instead of returning the latest run for a schema." },
|
|
13262
13892
|
{
|
|
13263
13893
|
type: "endpoint",
|
|
@@ -13407,6 +14037,9 @@ var sections32 = [
|
|
|
13407
14037
|
description: "Get detail with expected values or delete a ground-truth dataset. Supports GET (read scope) and DELETE (write scope) on the same path.",
|
|
13408
14038
|
content: [
|
|
13409
14039
|
{ type: "paragraph", text: "Retrieve the full details of a ground-truth dataset including all expected value entries, or permanently delete the dataset. The GET response includes every document-field pair with the expected value, which you can use to audit the benchmark data before running a validation." },
|
|
14040
|
+
{ type: "paragraph", text: "Call GET before starting a validation run to verify that expected values are correct and complete. The `values` array contains every document-field pair with its `expected_value`, `document_id`, and `field_name` \u2014 review these to ensure the benchmark data reflects your current extraction requirements." },
|
|
14041
|
+
{ type: "paragraph", text: "The response includes `entry_count` for a quick size check and `user_schema_id` to confirm schema scope. The `values` array entries each have their own UUID (`id`) and `created_at` timestamp. If the dataset is unscoped (`user_schema_id: null`), it can validate fields across any schema." },
|
|
14042
|
+
{ type: "paragraph", text: "Use DELETE only when the dataset is no longer relevant. Existing validation runs that referenced this dataset are retained with their results intact, but you cannot create new runs against a deleted dataset. To update individual entries, delete and recreate the dataset with corrected values." },
|
|
13410
14043
|
{ type: "callout", variant: "warning", text: "Deleting a ground-truth dataset also removes all associated expected value entries. Existing validation runs that used this dataset are retained but can no longer be re-run." },
|
|
13411
14044
|
{
|
|
13412
14045
|
type: "endpoint",
|
|
@@ -13668,6 +14301,10 @@ var sections32 = [
|
|
|
13668
14301
|
description: "Get validation run detail with accuracy summary or delete a run. Supports GET (read scope) and DELETE (write scope) on the same path.",
|
|
13669
14302
|
content: [
|
|
13670
14303
|
{ type: "paragraph", text: "Retrieve the full details of a validation run including its status, accuracy score, and total comparisons. Or permanently delete a run and its associated results. Use GET to poll a run's status until it reaches `completed`, then fetch the detailed results." },
|
|
14304
|
+
{ type: "paragraph", text: "After creating a validation run, poll this endpoint until the `status` field transitions from `pending` or `running` to `completed` or `failed`. Once completed, the `accuracy` field contains the overall score (0-1) and `total_comparisons` shows how many field-level comparisons were made." },
|
|
14305
|
+
{ type: "paragraph", text: "The response includes `links.results` which points directly to the per-field results endpoint. Once the run reaches `completed` status, follow this link to retrieve the granular comparison data including match types, similarity scores, and LLM judge verdicts." },
|
|
14306
|
+
{ type: "callout", variant: "warning", text: "Deleting a validation run permanently removes all per-field results. The ground-truth dataset and the original job run are not affected. Use DELETE only when you want to clean up outdated or erroneous runs." },
|
|
14307
|
+
{ type: "paragraph", text: "Pair this endpoint with **Create Validation Run** for the create-then-poll workflow, or with **List Validation Runs** to find specific runs by recency. Comparing the `accuracy` values of multiple runs against the same ground-truth dataset is the primary way to track extraction quality over time." },
|
|
13671
14308
|
{
|
|
13672
14309
|
type: "endpoint",
|
|
13673
14310
|
method: "GET",
|
|
@@ -13919,6 +14556,9 @@ var sections33 = [
|
|
|
13919
14556
|
description: "Get credit transaction history including purchases, deductions, and adjustments with page-based pagination.",
|
|
13920
14557
|
content: [
|
|
13921
14558
|
{ type: "paragraph", text: "Retrieve a chronological log of every credit transaction on your account. Transactions include **purchases** (positive amounts), **consumption deductions** (negative amounts), **bonuses**, and **manual adjustments**. Use this to audit spending and reconcile usage." },
|
|
14559
|
+
{ type: "paragraph", text: "Call this endpoint to build a transaction ledger view or to reconcile credit changes over a billing period. The response uses page-based pagination \u2014 pass `page` and `limit` query parameters to navigate through large transaction histories. The default page size is 20 with a maximum of 100." },
|
|
14560
|
+
{ type: "paragraph", text: "Each transaction includes an `amount` (negative for deductions, positive for purchases), a `type` field (`consumption`, `purchase`, `bonus`, or `adjustment`), and an `operation_type` that identifies the pipeline operation responsible. The `total` field in the response gives the full count for pagination math." },
|
|
14561
|
+
{ type: "paragraph", text: "Use this alongside the **Balance** endpoint to understand how your balance arrived at its current value. For aggregate cost analysis by operation type and model, the **Usage Summary** endpoint provides a more efficient grouped view without per-transaction detail." },
|
|
13922
14562
|
{ type: "callout", variant: "info", text: "Transactions are ordered by most recent first. Each entry includes the `operation_type` that triggered it (e.g. `extraction`, `manual`), making it easy to trace costs back to specific pipeline operations." },
|
|
13923
14563
|
{
|
|
13924
14564
|
type: "endpoint",
|
|
@@ -14001,6 +14641,9 @@ var sections33 = [
|
|
|
14001
14641
|
description: "Get aggregate credit usage summary broken down by operation type and model for a configurable time period.",
|
|
14002
14642
|
content: [
|
|
14003
14643
|
{ type: "paragraph", text: "Get a high-level view of your API usage grouped by **operation type** and **model**. This endpoint aggregates call counts, token consumption, and estimated costs over a configurable lookback period. Use it to understand which operations drive your spending." },
|
|
14644
|
+
{ type: "paragraph", text: "Call this endpoint to build cost dashboards or to identify which pipeline operations consume the most credits. The default lookback is 30 days \u2014 pass the `days` query parameter to adjust. Each row in the `stats` array represents a unique combination of `operation_type` and `model`." },
|
|
14645
|
+
{ type: "paragraph", text: "The response includes `call_count`, `total_input_tokens`, `total_output_tokens`, `total_cache_read_tokens`, and `total_cost_usd` per grouping. Note that token-based operations (e.g. `extraction` via Claude) report full token breakdowns, while page-based operations (e.g. `document_ai_ocr`) report zero tokens since cost is calculated from pages processed." },
|
|
14646
|
+
{ type: "paragraph", text: "Pair with **Daily Usage** for time-series analysis of the same period, or with **Usage Log** to drill into individual requests behind a high-cost grouping. The `period_days` field in the response confirms the actual lookback window applied." },
|
|
14004
14647
|
{ type: "callout", variant: "info", text: "Cost estimates include all token classes: input tokens, output tokens, cache creation tokens, and cache read tokens. Each is priced at the model-specific rate." },
|
|
14005
14648
|
{
|
|
14006
14649
|
type: "endpoint",
|
|
@@ -14089,6 +14732,10 @@ var sections33 = [
|
|
|
14089
14732
|
description: "Get per-day credit usage breakdown for the specified period (default last 30 days) with call counts and token totals per day.",
|
|
14090
14733
|
content: [
|
|
14091
14734
|
{ type: "paragraph", text: "Get a per-day breakdown of API usage over a configurable period. Each entry includes the total number of API calls, input/output token counts, and estimated cost for that calendar date. Use this for usage trend analysis and daily cost monitoring." },
|
|
14735
|
+
{ type: "paragraph", text: "Call this endpoint to populate daily usage charts or to set up alerting on cost spikes. The default lookback is 30 days \u2014 use the `days` query parameter to widen or narrow the window. Days with zero API calls are omitted from the response array." },
|
|
14736
|
+
{ type: "paragraph", text: "Each entry contains a `date` (YYYY-MM-DD in UTC), `calls` (total API calls), `input_tokens`, `output_tokens`, and `cost_usd`. All timestamps are UTC \u2014 a call made at 23:59 UTC on a given date appears under that UTC date, not the caller's local date." },
|
|
14737
|
+
{ type: "callout", variant: "info", text: "Daily usage is ordered by date ascending, making it ready for time-series charting without client-side sorting. Pair with the **Usage Summary** endpoint for operation-level breakdowns within the same period." },
|
|
14738
|
+
{ type: "paragraph", text: "Combine this endpoint with **Balance** to correlate daily burn against remaining runway. If you notice a cost spike on a specific date, drill into the **Usage Log** to identify the individual requests responsible." },
|
|
14092
14739
|
{
|
|
14093
14740
|
type: "endpoint",
|
|
14094
14741
|
method: "GET",
|
|
@@ -14358,6 +15005,9 @@ var sections34 = [
|
|
|
14358
15005
|
description: "List all tools available to the embedded agent including their impact level (read/write) and descriptions for discovering agent capabilities.",
|
|
14359
15006
|
content: [
|
|
14360
15007
|
{ type: "paragraph", text: "Discover all tools available to the embedded AI agent. Each tool declares its **impact level** \u2014 whether it performs a read-only operation or a mutation \u2014 so you can build permission-aware integrations. Use this endpoint to dynamically generate tool descriptions for external AI agents or to audit available capabilities." },
|
|
15008
|
+
{ type: "paragraph", text: "Call this endpoint at startup to populate your integration's tool registry, or periodically to detect newly added capabilities. The response includes every tool the agent can invoke, with a stable `name` identifier, a human-readable `description`, and the `impact` classification." },
|
|
15009
|
+
{ type: "paragraph", text: "The `totalCount` field gives the total number of tools available. Each tool's `impact` field follows a four-level severity scale: `read`, `draft_mutation`, `live_mutation`, and `irreversible`. Use these levels to build confirmation gates \u2014 for example, auto-approve `read` tools but require user confirmation for `live_mutation` and above." },
|
|
15010
|
+
{ type: "paragraph", text: "Pair this with the **Workspace Context** endpoint to give your external AI agent both situational awareness (context) and available actions (tools). The tool names returned here are stable identifiers that can be referenced in custom orchestration logic or permission policies." },
|
|
14361
15011
|
{ type: "callout", variant: "info", text: "Impact levels follow a severity scale: `read` (no side effects), `draft_mutation` (creates drafts only), `live_mutation` (modifies live data), and `irreversible` (permanent changes like deletion). Use these to implement confirmation gates in your integration." },
|
|
14362
15012
|
{
|
|
14363
15013
|
type: "endpoint",
|
|
@@ -14536,6 +15186,9 @@ var sections35 = [
|
|
|
14536
15186
|
description: "Create a matching configuration with field mappings, comparison strategies (exact, fuzzy, date_range, numeric_range), and per-field weights that sum to 1.0.",
|
|
14537
15187
|
content: [
|
|
14538
15188
|
{ type: "paragraph", text: "Create a matching configuration that defines how documents are compared against a reference dataset. Each field mapping specifies a source field (from extracted documents), a target column (in the reference data), a comparison strategy, and a relative weight." },
|
|
15189
|
+
{ type: "paragraph", text: "The typical workflow is: upload reference data via `POST /v1/matching/reference-data`, create a config with field mappings, then trigger a run via `POST /v1/matching/configs/:id/run`. For complex datasets, use `POST /v1/matching/strategies/generate` first to get AI-recommended mappings and weights." },
|
|
15190
|
+
{ type: "paragraph", text: "The response returns the config with the saved `field_mappings`, `threshold` (defaults to 0.85), and `links.runs` URL for triggering runs. The `reference_data_id` is fixed at creation \u2014 to match against a different dataset, create a new config." },
|
|
15191
|
+
{ type: "paragraph", text: "Choose strategies carefully: use `exact` for standardized codes and IDs, `fuzzy` for names with potential typos, `date_range` for dates with tolerance, and `numeric_range` for amounts with rounding differences. Weights must sum to 1.0 \u2014 fields with higher weights have more influence on the overall confidence score." },
|
|
14539
15192
|
{ type: "callout", variant: "info", text: "Field weights should sum to 1.0. The overall confidence score for a match is the weighted sum of per-field scores. Use the **generate strategy** endpoint to get AI-recommended mappings if you are unsure which fields and weights to use." },
|
|
14540
15193
|
{
|
|
14541
15194
|
type: "list",
|
|
@@ -14657,6 +15310,10 @@ var sections35 = [
|
|
|
14657
15310
|
description: "Get matching configuration details, update field mappings and weights, or delete a configuration. Deleting a config does not remove past run results.",
|
|
14658
15311
|
content: [
|
|
14659
15312
|
{ type: "paragraph", text: "Retrieve, update, or delete a matching configuration. Updates to field mappings and thresholds take effect on the next run \u2014 they do not retroactively change past results. Deleting a config removes the configuration but preserves all historical run results for audit purposes." },
|
|
15313
|
+
{ type: "paragraph", text: "Use `GET` to inspect the current field mappings, threshold, and targeting mode before running a match. Use `PUT` to adjust weights, swap strategies, or change the threshold \u2014 a common pattern is to lower the threshold after reviewing low-confidence results, then re-run to capture more matches." },
|
|
15314
|
+
{ type: "paragraph", text: "The `PUT` response returns the full updated config. The `reference_data_id` cannot be changed after creation \u2014 to match against a different dataset, create a new config. The `links.runs` URL provides a convenient shortcut to trigger a new run with the updated config." },
|
|
15315
|
+
{ type: "paragraph", text: "Deleting a config is safe for audit \u2014 all historical run results, including per-document evidence and confidence scores, are preserved. Pair config updates with the generate strategy endpoint to get AI-recommended adjustments based on your reference dataset." },
|
|
15316
|
+
{ type: "callout", variant: "info", text: "Past run results are immutable. Updating field mappings or thresholds only affects future runs \u2014 re-run matching after config changes to see the updated results." },
|
|
14660
15317
|
{
|
|
14661
15318
|
type: "endpoint",
|
|
14662
15319
|
method: "GET",
|
|
@@ -14935,6 +15592,9 @@ var sections35 = [
|
|
|
14935
15592
|
description: "Get the status, progress, and summary of a matching run. Status progresses from queued to running to completed or failed.",
|
|
14936
15593
|
content: [
|
|
14937
15594
|
{ type: "paragraph", text: "Retrieve the current state of a matching run. Poll this endpoint while `status` is `queued` or `running` to track progress. Once `completed`, the response includes the top 50 results by confidence. Use the results endpoint for full paginated access." },
|
|
15595
|
+
{ type: "paragraph", text: "Poll this endpoint after triggering a run via `POST /v1/matching/configs/:id/run`. A typical polling pattern is to check every 5-10 seconds while `status` is `queued` or `running`. Use `GET /v1/matching/runs/:id/progress` for lighter-weight progress updates during long runs." },
|
|
15596
|
+
{ type: "paragraph", text: "Once completed, the response includes `rows_processed`, `rows_matched`, and `avg_confidence` at the run level, plus a `results` array with the top 50 matches by confidence. Each result includes `document_id`, `matched_reference_row_id`, `confidence` score, review `status` (`pending`, `approved`, `rejected`), and per-field `evidence` breakdown." },
|
|
15597
|
+
{ type: "paragraph", text: "For the full result set beyond the top 50, use `GET /v1/matching/runs/:id/results` with pagination. Use `POST /v1/matching/runs/:runId/results/:resultId/review` to approve or reject individual matches. If `status` is `ai_resolving`, the run is using Claude Haiku to disambiguate borderline matches \u2014 this phase adds latency but can significantly improve accuracy on ambiguous rows." },
|
|
14938
15598
|
{ type: "callout", variant: "info", text: "The `ai_resolving` status indicates that the run has finished standard matching and is now running an AI resolution pass on low-confidence rows. This pass uses Claude Haiku to disambiguate borderline matches." },
|
|
14939
15599
|
{
|
|
14940
15600
|
type: "endpoint",
|
|
@@ -15037,6 +15697,9 @@ var sections35 = [
|
|
|
15037
15697
|
description: "Retrieve matching results for a completed run. Returns the top 5 candidates per document with weighted confidence scores and per-field evidence breakdowns.",
|
|
15038
15698
|
content: [
|
|
15039
15699
|
{ type: "paragraph", text: "Retrieve the full paginated results for a completed matching run. Each result represents a document matched (or unmatched) against the reference dataset, with a weighted confidence score and per-field evidence breakdown showing how each field contributed to the overall score." },
|
|
15700
|
+
{ type: "paragraph", text: "Use this endpoint after a run completes to review all matches. Filter by `status=pending` to see matches awaiting review, or `status=approved` to see confirmed matches. Paginate with `page` and `limit` \u2014 the run detail endpoint only shows the top 50 results, while this endpoint provides full access." },
|
|
15701
|
+
{ type: "paragraph", text: "Each result includes a per-field `evidence` object showing the strategy used and individual score for each field mapping. A `null` `matched_reference_row_id` means no reference row scored above the configured threshold for that document. The `confidence` score is the weighted sum of per-field scores using the weights from the matching config." },
|
|
15702
|
+
{ type: "paragraph", text: "Use `POST /v1/matching/runs/:runId/results/:resultId/review` to approve or reject individual matches programmatically. Pair with the config detail endpoint to understand which field mappings and thresholds produced these results. Re-run matching with adjusted weights or a lower threshold to capture more matches." },
|
|
15040
15703
|
{ type: "callout", variant: "info", text: "Results with `status: pending` have not been reviewed. Use `POST /v1/matching/runs/:runId/results/:resultId/review` to approve or reject individual matches. Approved matches can be used downstream for data enrichment and reconciliation workflows." },
|
|
15041
15704
|
{
|
|
15042
15705
|
type: "endpoint",
|
|
@@ -15331,6 +15994,9 @@ var sections36 = [
|
|
|
15331
15994
|
description: "Create a delivery destination with connector type, transport config, and authentication. Supported types: webhook, sftp, s3, azure_blob, google_drive, onedrive.",
|
|
15332
15995
|
content: [
|
|
15333
15996
|
{ type: "paragraph", text: "Create a new delivery destination by specifying the connector type, transport configuration, and optional authentication. The `config` and `auth_config` schemas vary by destination type \u2014 see the catalog endpoint for connector capabilities." },
|
|
15997
|
+
{ type: "paragraph", text: "The typical workflow is: create a destination first, then create one or more bindings that route signals to it. Call `GET /v1/delivery/catalog/connectors` to see which connector types are available and what `config` and `auth_config` schemas each expects." },
|
|
15998
|
+
{ type: "paragraph", text: "The response returns the created destination with `is_active: true` and `last_delivery_at: null`. Auth credentials are never echoed back \u2014 use the `has_auth_config` and `has_signing_secret` booleans to confirm they were stored. After creation, use `POST /v1/delivery/destinations/:id/test` to verify connectivity before setting up bindings." },
|
|
15999
|
+
{ type: "paragraph", text: "For webhook destinations, include a `signing_secret` in `auth_config` to enable HMAC-SHA256 request signing. For file-drop destinations (S3, SFTP, Azure Blob), set `payload_cap_bytes` if you need to override the global 5 MiB cap. OAuth destinations (Google Drive, OneDrive) require completing the OAuth flow first." },
|
|
15334
16000
|
{ type: "callout", variant: "info", text: "OAuth-based destinations (google_drive, onedrive) require completing an OAuth flow before creating the destination. Use the OAuth start endpoint to initiate the flow and obtain tokens." },
|
|
15335
16001
|
{
|
|
15336
16002
|
type: "endpoint",
|
|
@@ -15433,6 +16099,9 @@ var sections36 = [
|
|
|
15433
16099
|
description: "Get destination details, update config, delete a destination, or send a test payload to verify connectivity. Auth credentials are always redacted in responses.",
|
|
15434
16100
|
content: [
|
|
15435
16101
|
{ type: "paragraph", text: "Manage a single destination: retrieve its current config, update transport settings or credentials, delete it, or test connectivity. The **test** endpoint probes the destination without delivering real data \u2014 file-drop connectors (S3, SFTP, Azure Blob) verify bucket/container reachability without writing any objects." },
|
|
16102
|
+
{ type: "paragraph", text: "Use `GET` to inspect current config and delivery status. Use `PUT` to rotate credentials or change the target URL/bucket. Use `POST /test` after updating credentials to verify the new config works before live traffic flows through it. Use `DELETE` only when permanently removing a destination." },
|
|
16103
|
+
{ type: "paragraph", text: "The `GET` response includes `last_delivery_at` and `last_delivery_status` to show the most recent delivery attempt. The `is_active` flag indicates whether the destination is enabled \u2014 destinations are automatically disabled on `auth_failed` or `ssrf_blocked` errors. The test endpoint returns `success`, `durationMs`, and an optional `message` describing what was probed." },
|
|
16104
|
+
{ type: "paragraph", text: "If a destination becomes inactive due to auth failure, fix the credentials via `PUT`, then call the test endpoint to verify. The destination will be re-enabled automatically on a successful update. Prefer disabling (`is_active: false` via `PUT`) over deleting when you want to pause delivery but keep the history." },
|
|
15436
16105
|
{ type: "callout", variant: "warning", text: "Deleting a destination cascades to all its bindings, delivery items, and DLQ entries. This is irreversible. Disable the destination (`is_active: false`) instead if you want to preserve history." },
|
|
15437
16106
|
{
|
|
15438
16107
|
type: "endpoint",
|
|
@@ -15721,6 +16390,9 @@ var sections36 = [
|
|
|
15721
16390
|
description: "Create a delivery binding that routes domain signals through a deliverable resolver and serializer to a destination. Includes field mapping and retry policy configuration.",
|
|
15722
16391
|
content: [
|
|
15723
16392
|
{ type: "paragraph", text: "Create a binding that wires a domain event to a destination. The **compatibility triangle** is validated on creation: the signal event type must be compatible with the deliverable resolver, the serializer must support the deliverable shape, and the connector must support the serializer format." },
|
|
16393
|
+
{ type: "paragraph", text: "The typical workflow is: query the catalog endpoints top-down (signals, then deliverables, then serializers, then connectors), pick compatible values, and create the binding. A single event can fan out to multiple bindings \u2014 create separate bindings for each destination or output format you need." },
|
|
16394
|
+
{ type: "paragraph", text: "The response returns the binding with `is_active: true` and `last_status: null`. The `field_map` controls payload projection: use `static` to inject fixed values, `drop` to remove fields, and key-value pairs to rename fields. The `delivery_policy` defaults to 7 attempts with exponential backoff over ~10 hours if omitted." },
|
|
16395
|
+
{ type: "paragraph", text: "After creation, the binding is immediately live \u2014 the next matching signal will trigger delivery. Use `POST /v1/delivery/bindings/:id/preview` (internal) to dry-run the resolve-project-serialize pipeline. Monitor delivery health via the history and DLQ endpoints." },
|
|
15724
16396
|
{ type: "callout", variant: "info", text: "Use the catalog endpoints (`/v1/delivery/catalog/*`) to discover valid combinations before creating a binding. The catalog lists all available signals, deliverables, serializers, and connectors with their compatibility constraints." },
|
|
15725
16397
|
{
|
|
15726
16398
|
type: "endpoint",
|
|
@@ -15823,6 +16495,10 @@ var sections36 = [
|
|
|
15823
16495
|
description: "Get binding details, update signal filters or field maps, delete a binding, or preview the resolved payload for a binding without sending it.",
|
|
15824
16496
|
content: [
|
|
15825
16497
|
{ type: "paragraph", text: "Manage a single delivery binding: retrieve its configuration, update the signal filter or field map, delete it, or preview the payload it would produce. Updates re-validate the compatibility triangle. Deleting a binding stops future routing but allows in-flight deliveries to complete." },
|
|
16498
|
+
{ type: "paragraph", text: "Use `GET` to inspect the current binding config and `last_status`. Use `PUT` to adjust the signal filter, field map, or retry policy \u2014 changes take effect on the next matching event. Use `DELETE` when the binding is no longer needed; in-flight deliveries already in the job queue will still complete." },
|
|
16499
|
+
{ type: "paragraph", text: "The `PUT` response returns the full updated binding. The compatibility triangle is re-validated on every update \u2014 if you change the `signal_filter.event_type` or `serializer_format`, the system verifies the new combination is still valid. The preview endpoint (`POST /preview`) walks the resolve-project-serialize pipeline with a synthetic signal and returns the wire output without delivering." },
|
|
16500
|
+
{ type: "paragraph", text: "Pair updates with the delivery history endpoint to verify the binding is producing expected results. If `last_status` shows `failed`, check the DLQ for error details before adjusting the binding config." },
|
|
16501
|
+
{ type: "callout", variant: "info", text: "The public API preview endpoint currently returns a stub response. The internal preview endpoint is fully functional and walks the full resolve, project, and serialize pipeline with structural fallback." },
|
|
15826
16502
|
{
|
|
15827
16503
|
type: "endpoint",
|
|
15828
16504
|
method: "GET",
|
|
@@ -16031,6 +16707,9 @@ var sections36 = [
|
|
|
16031
16707
|
description: "View delivery attempt history with status, HTTP codes, and timing. Get detail for a single item or replay a failed delivery attempt.",
|
|
16032
16708
|
content: [
|
|
16033
16709
|
{ type: "paragraph", text: "The delivery history tracks every attempt to deliver a payload to a destination. Each attempt is recorded as a **delivery item** with status, timing, HTTP response code, and optional request/response bodies. Use this endpoint to audit delivery performance and debug failures." },
|
|
16710
|
+
{ type: "paragraph", text: "Query items by `binding_id` or `destination_id` to narrow results to a specific delivery path. Filter by `status` to find failures (`failed`) or in-progress attempts (`in_flight`). Use `GET /v1/delivery/items/:id` to inspect the full request and response bodies for a single attempt." },
|
|
16711
|
+
{ type: "paragraph", text: "Each item includes an `idempotency_key` (deterministic SHA-256 of binding ID and event ID) that is sent on the wire so receivers can deduplicate. The `attempt` field is 1-indexed \u2014 multiple items with the same `event_id` and `binding_id` represent retries of the same delivery. Status values are `in_flight`, `succeeded`, or `failed`." },
|
|
16712
|
+
{ type: "paragraph", text: "Use `POST /v1/delivery/items/:id/replay` to re-enqueue a specific attempt with a fresh attempt number but the same idempotency key. For terminal failures, check the DLQ endpoint instead \u2014 items that exhausted all retries are moved there automatically. Pair history inspection with binding and destination detail to diagnose delivery issues end-to-end." },
|
|
16034
16713
|
{ type: "callout", variant: "info", text: "Request and response bodies are truncated to 10 KB and retained for a configurable period (default 30 days). After the retention period, bodies are nulled but metadata (status, HTTP code, duration, error code) is preserved indefinitely." },
|
|
16035
16714
|
{
|
|
16036
16715
|
type: "endpoint",
|
|
@@ -16689,6 +17368,9 @@ var sections37 = [
|
|
|
16689
17368
|
description: "Get detailed information for a single extraction batch including item counts, provider, status, and timing. Shows per-item breakdown when the batch is completed.",
|
|
16690
17369
|
content: [
|
|
16691
17370
|
{ type: "paragraph", text: "Retrieve the full batch record including per-item status. Poll this endpoint while `status` is `submitted` to track progress. Once `completed`, each item shows its individual outcome and processing timestamp." },
|
|
17371
|
+
{ type: "paragraph", text: "Use this endpoint to monitor a batch after submission. Poll periodically while `status` is `submitted` \u2014 typically results arrive within 24 hours. Once `status` changes to `completed`, `failed`, or `cancelled`, polling can stop. Use the sync endpoint to force an immediate provider check instead of waiting for the hourly poll." },
|
|
17372
|
+
{ type: "paragraph", text: "The response includes `items` \u2014 an array of per-document results. Each item has a `status` (`pending`, `processing`, `completed`, or `failed`), the associated `document_id` and `document_filename`, and a `processed_at` timestamp. The `custom_id` field shows the provider-assigned identifier used when submitting to Anthropic or Bedrock." },
|
|
17373
|
+
{ type: "paragraph", text: "Failed items are automatically retried via **realtime** extraction, never re-batched, to preserve the 48-hour SLA. Check the `errored_count` and `expired_count` fields at the batch level, and individual `items[].error_message` for per-document failure details. Pair with `GET /v1/documents/:id` to check the final extraction status of any document in the batch." },
|
|
16692
17374
|
{ type: "callout", variant: "info", text: "Items that fail extraction in the batch are retried via **realtime** extraction (never re-batched) to preserve the original 48-hour SLA. Check `items[].status` for per-document outcomes." },
|
|
16693
17375
|
{
|
|
16694
17376
|
type: "endpoint",
|
|
@@ -16787,6 +17469,9 @@ var sections37 = [
|
|
|
16787
17469
|
description: "Force a sync with the provider to check for batch results. Useful when you do not want to wait for the hourly automatic poll.",
|
|
16788
17470
|
content: [
|
|
16789
17471
|
{ type: "paragraph", text: "Force an immediate check with the batch provider (Anthropic or Bedrock) for results. By default, batches are polled automatically every hour. Use this endpoint when you need results sooner or want to verify the current provider-side status." },
|
|
17472
|
+
{ type: "paragraph", text: "Call sync when you need results before the next hourly poll. A typical pattern is to submit documents in batch mode, wait a few hours, then call sync to check if results are ready. If the batch is still processing, the response reflects the current provider-side status without changing anything." },
|
|
17473
|
+
{ type: "paragraph", text: "The response returns the full batch object with updated counts. If results are ready, `status` transitions to `completed` and `succeeded_count`, `errored_count`, and `expired_count` are populated. If the batch is still processing on the provider side, `status` remains `submitted` and counts stay at zero." },
|
|
17474
|
+
{ type: "paragraph", text: "Syncing an `accumulating` batch has no effect since it has not been submitted to the provider yet. Syncing a `completed` or `cancelled` batch is safe but returns the same data. Pair with `GET /v1/batches/:id` to inspect per-item results after the sync completes." },
|
|
16790
17475
|
{
|
|
16791
17476
|
type: "endpoint",
|
|
16792
17477
|
method: "POST",
|
|
@@ -16864,6 +17549,9 @@ var sections37 = [
|
|
|
16864
17549
|
description: "Cancel an in-progress extraction batch. Only batches in accumulating or submitted status can be cancelled. Completed batches cannot be rolled back.",
|
|
16865
17550
|
content: [
|
|
16866
17551
|
{ type: "paragraph", text: "Cancel a batch that is still `accumulating` or `submitted`. Cancellation sends a stop request to the provider if the batch was already submitted. Documents in the cancelled batch revert to `batch_queued` status and can be resubmitted or processed via realtime extraction." },
|
|
17552
|
+
{ type: "paragraph", text: "Use cancellation when you need to abort a batch \u2014 for example, if documents were submitted with an incorrect schema or you need results faster via realtime extraction. Cancel as early as possible; items already processed by the provider before the cancellation lands may still have their results applied." },
|
|
17553
|
+
{ type: "paragraph", text: "The response returns the batch with `status: cancelled`. The `succeeded_count` may be non-zero if some items were processed before cancellation took effect. Documents revert to `batch_queued` status and can be re-processed by updating their `processing_mode` to `realtime` or by including them in a new batch." },
|
|
17554
|
+
{ type: "paragraph", text: "Only batches in `accumulating` or `submitted` status can be cancelled \u2014 calling cancel on a `completed`, `failed`, or already `cancelled` batch returns `400`. Pair with `GET /v1/batches/:id` after cancellation to inspect which items were processed before the stop request landed." },
|
|
16867
17555
|
{
|
|
16868
17556
|
type: "endpoint",
|
|
16869
17557
|
method: "POST",
|
|
@@ -17032,6 +17720,9 @@ var sections38 = [
|
|
|
17032
17720
|
description: "Retrieve a case by its key (e.g. CASE-001) including linked documents, shared entities, AI-generated narration, label, and anomaly count.",
|
|
17033
17721
|
content: [
|
|
17034
17722
|
{ type: "paragraph", text: "Retrieve the full detail of a case including its documents, AI-generated narrative summary, and anomaly count. The narrative is generated by Claude and summarizes the relationships between documents in the case." },
|
|
17723
|
+
{ type: "paragraph", text: "Call this endpoint after listing cases to drill into a specific case. The typical workflow is to list cases with filters, then fetch detail for cases that need review. The response includes the full document list and anomaly count, so you can assess case health in a single call." },
|
|
17724
|
+
{ type: "paragraph", text: "The response includes `documents` (array of document objects with `id`, `filename`, `document_type`, and `created_at`), a `narrative` string (or `null` if narration has not been triggered), and `anomaly_count`. The `links` object provides convenience URLs for the case itself and its documents list." },
|
|
17725
|
+
{ type: "paragraph", text: "Pair with `POST /v1/cases/:key/narrate` to generate narratives, and `GET /v1/cases/:key/evidence` to inspect the field-level linking data. If `anomaly_count` is non-zero, fetch the anomalies endpoint to see which structural issues were detected." },
|
|
17035
17726
|
{ type: "callout", variant: "info", text: "The `narrative` field is generated on demand via `POST /v1/cases/:key/narrate`. It will be `null` until narration is triggered for this case." },
|
|
17036
17727
|
{
|
|
17037
17728
|
type: "endpoint",
|
|
@@ -17237,6 +17928,9 @@ var sections38 = [
|
|
|
17237
17928
|
description: "List evidence items within a case. Filter by validation status, source document, category, or free-text search across evidence fields.",
|
|
17238
17929
|
content: [
|
|
17239
17930
|
{ type: "paragraph", text: "Evidence items are the extracted field values from documents in a case, annotated with validation status and confidence scores. Use evidence to audit the data quality within a case and understand which fields link documents together." },
|
|
17931
|
+
{ type: "paragraph", text: "Use this endpoint after fetching case detail to inspect the field-level data that forms the case. A typical workflow is to filter by `status=invalid` to surface extraction issues, or by `document_id` to audit a specific document's contribution to the case." },
|
|
17932
|
+
{ type: "paragraph", text: "Each evidence item includes a `field_key`, extracted `value`, validation `status` (`valid`, `invalid`, or `pending`), the source `document_id`, an optional `category` (e.g. `identity`, `financial`), and a `confidence` score between 0 and 1. The confidence score reflects extraction certainty and is independent of the validation outcome." },
|
|
17933
|
+
{ type: "paragraph", text: "Combine evidence with the anomalies endpoint to get a complete quality picture. Evidence shows individual field values; anomalies show structural patterns across multiple evidence items (e.g. conflicting values for the same field). Use the `search` parameter for free-text queries across all evidence fields." },
|
|
17240
17934
|
{ type: "callout", variant: "info", text: "Evidence is produced by the evidence validation engine, which runs rule-based validators (structural checks, checksum validation, domain packs) against extracted values. Each evidence item records the validation outcome for a specific field on a specific document." },
|
|
17241
17935
|
{
|
|
17242
17936
|
type: "endpoint",
|
|
@@ -17500,6 +18194,9 @@ var sections38 = [
|
|
|
17500
18194
|
description: "Pin or remove documents within a case. Pinned documents are highlighted in the case view and preserved during case operations.",
|
|
17501
18195
|
content: [
|
|
17502
18196
|
{ type: "paragraph", text: "Manage document membership within a case. **Pin** a document to mark it as important \u2014 pinned documents are highlighted in the UI and preserved during split operations. **Remove** a document to detach it from the case entirely." },
|
|
18197
|
+
{ type: "paragraph", text: "Use pinning to flag key documents during case review \u2014 for example, pin the primary invoice in a multi-document case so it stays visible. Use removal when a document was incorrectly linked and should not belong to this case. Both operations are immediate and do not require a recompute." },
|
|
18198
|
+
{ type: "paragraph", text: 'Pin returns `{ "success": true }` on success. Remove also returns `{ "success": true }`. Both endpoints return `404` if the case or document is not found. The pin status is reflected in the case detail response from `GET /v1/cases/:key`.' },
|
|
18199
|
+
{ type: "paragraph", text: "Pinned documents are preserved in the original partition during split operations \u2014 they always stay with the case they are pinned to. If you plan to split a case, pin the anchor documents first. Removed documents may reappear in the case after a recompute if linking edges still connect them." },
|
|
17503
18200
|
{ type: "callout", variant: "info", text: "Removing a document from a case does not delete the document itself. The document remains in your workspace and may be re-linked into a case during the next recompute cycle if linking edges still exist." },
|
|
17504
18201
|
{
|
|
17505
18202
|
type: "endpoint",
|
|
@@ -18162,6 +18859,9 @@ var sections40 = [
|
|
|
18162
18859
|
description: "List all ground truth datasets used for benchmarking extraction accuracy. Each dataset contains manually verified entries that serve as the gold standard.",
|
|
18163
18860
|
content: [
|
|
18164
18861
|
{ type: "paragraph", text: "Ground truth datasets contain manually verified data entries that serve as the gold standard for measuring extraction accuracy. Create datasets, add entries, then run benchmarks against extraction results." },
|
|
18862
|
+
{ type: "paragraph", text: "Use this endpoint to see all available datasets before creating a benchmark run. A typical workflow is to list datasets, select the one covering the document type you want to evaluate, then pass its `id` to `POST /v1/quality/benchmarks` to start a run." },
|
|
18863
|
+
{ type: "paragraph", text: "Each dataset includes a `name`, optional `description`, `user_schema_id` (if scoped to a schema), `document_count` (number of verified entries), and a `links.self` URL for the detail endpoint. Datasets are returned in descending creation order with cursor-based pagination." },
|
|
18864
|
+
{ type: "paragraph", text: "Create separate datasets for different document types or schema versions to track accuracy independently. Pair with the benchmark endpoints to measure extraction quality over time \u2014 run benchmarks after schema changes or pipeline updates to detect regressions." },
|
|
18165
18865
|
{ type: "list", ordered: false, items: [
|
|
18166
18866
|
"Each dataset contains verified entries mapping documents to expected field values",
|
|
18167
18867
|
"Datasets can be scoped to a specific user schema via `user_schema_id`",
|
|
@@ -18256,6 +18956,10 @@ var sections40 = [
|
|
|
18256
18956
|
description: "Create a new ground truth dataset linked to a schema. The dataset defines the expected extraction output used for accuracy benchmarking.",
|
|
18257
18957
|
content: [
|
|
18258
18958
|
{ type: "paragraph", text: "Create an empty ground truth dataset that you can populate with verified entries. Datasets serve as the baseline for benchmark runs that measure extraction accuracy. After creating a dataset, add entries individually or import them in bulk via CSV." },
|
|
18959
|
+
{ type: "paragraph", text: "The typical workflow is: create the dataset, then populate it using `POST /v1/quality/ground-truth/:id/entries` for individual entries or `POST /v1/quality/ground-truth/:id/entries/import-csv` for bulk import. Once populated, create a benchmark run with `POST /v1/quality/benchmarks`." },
|
|
18960
|
+
{ type: "paragraph", text: "The response returns the dataset with `document_count: 0` since it is initially empty. The `user_schema_id` is `null` unless you associate it with a schema. The `links.self` URL points to the detail endpoint where you can retrieve entries or delete the dataset." },
|
|
18961
|
+
{ type: "paragraph", text: "For best results, aim for at least 30-50 entries per dataset. Linking a dataset to a `user_schema_id` ensures ground truth field names align with your extraction schema, producing more meaningful benchmark comparisons." },
|
|
18962
|
+
{ type: "callout", variant: "info", text: "Field keys in `expected_data` entries should match the field names used in your extraction schema. Unmatched fields are stored but ignored during benchmark comparison." },
|
|
18259
18963
|
{
|
|
18260
18964
|
type: "endpoint",
|
|
18261
18965
|
method: "POST",
|
|
@@ -18330,6 +19034,9 @@ var sections40 = [
|
|
|
18330
19034
|
description: "Retrieve a ground truth dataset by ID with metadata and entry count, or delete it permanently. Deleting a dataset does not remove associated benchmark results.",
|
|
18331
19035
|
content: [
|
|
18332
19036
|
{ type: "paragraph", text: "Retrieve a dataset with its metadata and sample entries, or delete it permanently. The GET response includes a `samples` array with the actual ground truth entries, allowing you to inspect the expected values for each document." },
|
|
19037
|
+
{ type: "paragraph", text: "Use `GET` to inspect the dataset contents before running a benchmark. The `samples` array contains all ground truth entries with their `document_id`, `expected_data` (key-value map of verified field values), and optional `notes`. This lets you verify the dataset is correctly populated." },
|
|
19038
|
+
{ type: "paragraph", text: "The `document_count` field shows how many entries exist. For large datasets, the `samples` array may produce a sizable response. The `user_schema_id` indicates whether the dataset is scoped to a specific extraction schema, which improves benchmark accuracy by ensuring field name alignment." },
|
|
19039
|
+
{ type: "paragraph", text: "Use `DELETE` when a dataset is outdated or no longer needed. Benchmark results that referenced this dataset are preserved for historical tracking \u2014 the benchmark retains the `dataset_id` even after the dataset itself is removed. Create a new dataset with updated entries rather than modifying existing ones." },
|
|
18333
19040
|
{ type: "callout", variant: "warning", text: "Deleting a dataset is permanent. However, benchmark results that used this dataset are retained for historical reference. The benchmark will show the dataset_id but the dataset itself will no longer be retrievable." },
|
|
18334
19041
|
{
|
|
18335
19042
|
type: "endpoint",
|
|
@@ -18595,6 +19302,9 @@ var sections40 = [
|
|
|
18595
19302
|
description: "List benchmark runs that compare extraction results against ground truth datasets. Each run produces per-field accuracy metrics.",
|
|
18596
19303
|
content: [
|
|
18597
19304
|
{ type: "paragraph", text: "Benchmark runs compare your extraction output against ground truth datasets to produce per-field accuracy scores. Each run evaluates every document in the dataset and produces an `accuracy_overall` score along with per-field breakdowns. Use benchmarks to track extraction quality over time and measure the impact of schema or pipeline changes." },
|
|
19305
|
+
{ type: "paragraph", text: "Use this endpoint to see all benchmark runs and their accuracy scores. A typical workflow is to list benchmarks after making schema or pipeline changes, then compare the latest run against previous ones using `GET /v1/quality/benchmarks/compare` to measure improvement or detect regressions." },
|
|
19306
|
+
{ type: "paragraph", text: "Each benchmark includes `status` (`queued`, `running`, `completed`, or `failed`), `accuracy_overall` (0-1 score, null while running), `accuracy_by_field` (per-field breakdown), and `documents_processed`/`documents_total` for progress tracking. The `accuracy_delta` and `compared_to_run_id` fields support cross-run comparisons." },
|
|
19307
|
+
{ type: "paragraph", text: "Run benchmarks regularly after extraction pipeline changes. Pair with `GET /v1/quality/benchmarks/:id/results` for per-document drill-down showing which fields matched and which diverged. Use the compare endpoint to track accuracy trends across multiple runs." },
|
|
18598
19308
|
{
|
|
18599
19309
|
type: "endpoint",
|
|
18600
19310
|
method: "GET",
|
|
@@ -18704,6 +19414,9 @@ var sections40 = [
|
|
|
18704
19414
|
description: "Start a benchmark run that compares a job run output against a ground truth dataset. Produces per-field accuracy scores and overall metrics.",
|
|
18705
19415
|
content: [
|
|
18706
19416
|
{ type: "paragraph", text: "Start a new benchmark run that evaluates your current extraction output against a ground truth dataset. The benchmark compares each document in the dataset entry-by-entry and field-by-field, producing an overall accuracy score and per-field breakdowns." },
|
|
19417
|
+
{ type: "paragraph", text: "The typical workflow is: create a benchmark after making extraction pipeline changes, poll `GET /v1/quality/benchmarks/:id` until `status` is `completed`, then inspect results. Run multiple benchmarks against the same dataset over time to track accuracy trends." },
|
|
19418
|
+
{ type: "paragraph", text: "The response returns the benchmark with `status: queued`, `accuracy_overall: null`, and `documents_processed: 0`. The `documents_total` field reflects how many entries are in the dataset. Poll the detail endpoint to check `status` and `documents_processed` for progress. Once completed, `accuracy_overall` and `accuracy_by_field` are populated." },
|
|
19419
|
+
{ type: "paragraph", text: "Multiple benchmarks can run in parallel against different datasets. Use `GET /v1/quality/benchmarks/compare` after completion to compare two runs side by side. The `dataset_id` is fixed at creation \u2014 to benchmark against a different dataset, create a new run." },
|
|
18707
19420
|
{ type: "callout", variant: "info", text: "Benchmark runs are asynchronous. The endpoint returns immediately with status `queued`. Poll the benchmark detail endpoint or list benchmarks to check when the run completes." },
|
|
18708
19421
|
{
|
|
18709
19422
|
type: "endpoint",
|
|
@@ -19033,6 +19746,9 @@ var sections41 = [
|
|
|
19033
19746
|
description: "Create a new routing rule with conditions on document properties and actions to apply when matched. Conditions can match document type, source, and other metadata.",
|
|
19034
19747
|
content: [
|
|
19035
19748
|
{ type: "paragraph", text: 'Create a rule that automatically applies actions to incoming documents based on their metadata. Conditions define what to match (e.g. document type equals "invoice"), and actions define what to do (e.g. assign the finance schema). Rules are evaluated on every `document_classified` event.' },
|
|
19749
|
+
{ type: "paragraph", text: 'The typical workflow is: create rules ordered by specificity \u2014 put narrow, high-priority rules first (e.g. "contracts from vendor X") and broader catch-all rules last. New rules are active immediately upon creation, so the next classified document will be evaluated against them.' },
|
|
19750
|
+
{ type: "paragraph", text: "The response returns the rule with `is_active: true`, a `trigger_type` of `document_classified`, and the assigned `priority` (defaults to 100 if omitted). The `action_type` is resolved from the `actions` object. Use the reorder endpoint after creation to adjust the priority relative to existing rules." },
|
|
19751
|
+
{ type: "paragraph", text: "Pair with `GET /v1/routing-rules` to verify the full priority chain after creating a rule. Use `source_connection_id` to scope rules to documents from a specific source \u2014 documents from other sources will skip the rule entirely. To test a rule before going live, create it and immediately disable it via `PATCH` with `is_active: false`." },
|
|
19036
19752
|
{ type: "callout", variant: "info", text: "New rules are created with `is_active: true` by default. If you want to test a rule before activating it, create it, then immediately disable it via `PATCH /v1/routing-rules/:id` with `is_active: false`." },
|
|
19037
19753
|
{
|
|
19038
19754
|
type: "endpoint",
|
|
@@ -19135,6 +19851,10 @@ var sections41 = [
|
|
|
19135
19851
|
description: "Retrieve, update, or delete a routing rule by ID. Update conditions, actions, priority, or enabled state. Deleting a rule does not affect previously routed documents.",
|
|
19136
19852
|
content: [
|
|
19137
19853
|
{ type: "paragraph", text: "Retrieve, update, or delete a single routing rule. Updates take effect immediately \u2014 the next `document_classified` event will use the updated rule. Deleting a rule does not retroactively affect documents that were already routed by it." },
|
|
19854
|
+
{ type: "paragraph", text: "Use `GET` to inspect a rule's conditions, actions, and priority. Use `PATCH` to adjust conditions, change the schema assignment, toggle `is_active`, or update the priority. Use `DELETE` when a rule is no longer needed \u2014 previously routed documents are not affected." },
|
|
19855
|
+
{ type: "paragraph", text: "The `PATCH` response returns the full updated rule including the new `updated_at` timestamp. All fields are optional \u2014 only include fields you want to change. The `is_active` toggle lets you temporarily disable a rule without deleting it, which is useful for testing or during maintenance windows." },
|
|
19856
|
+
{ type: "paragraph", text: "After updating priority via `PATCH`, use `GET /v1/routing-rules` to verify the full evaluation order. For bulk priority changes, prefer the `POST /v1/routing-rules/reorder` endpoint instead of patching individual rules. Pair deletion with rule creation to replace a rule atomically." },
|
|
19857
|
+
{ type: "callout", variant: "info", text: "Rule changes only affect future `document_classified` events. Documents already routed by a previous version of the rule retain their assigned schema and routing actions." },
|
|
19138
19858
|
{
|
|
19139
19859
|
type: "endpoint",
|
|
19140
19860
|
method: "GET",
|
|
@@ -19316,6 +20036,9 @@ var sections41 = [
|
|
|
19316
20036
|
description: "Reorder routing rules by providing an ordered array of rule IDs. Priority values are reassigned sequentially based on the new order.",
|
|
19317
20037
|
content: [
|
|
19318
20038
|
{ type: "paragraph", text: "Reassign priority values for all routing rules at once. Pass an ordered array of rule IDs \u2014 the first ID receives priority 1, the second receives priority 2, and so on. This is the recommended way to change evaluation order after initial creation." },
|
|
20039
|
+
{ type: "paragraph", text: "Use this endpoint when you need to rearrange the evaluation order of multiple rules at once \u2014 for example, when promoting a new rule to the top of the chain or inserting a rule between two existing ones. This is more reliable than patching individual rule priorities, which can create gaps or collisions." },
|
|
20040
|
+
{ type: "paragraph", text: "The response returns a `reordered` array with each rule's `id` and new `priority` value. Priority 1 is evaluated first. The reorder takes effect immediately \u2014 the next `document_classified` event uses the new priority sequence." },
|
|
20041
|
+
{ type: "paragraph", text: "List all rules first via `GET /v1/routing-rules` to get the current IDs and order, then construct the reordered array. Include both active and inactive rules in the array to maintain a consistent priority sequence. Omitting any rule ID results in a validation error." },
|
|
19319
20042
|
{ type: "callout", variant: "warning", text: "All active rule IDs must be included in the `rule_ids` array. Omitting any rule returns a validation error. Inactive rules should also be included to maintain a consistent priority sequence." },
|
|
19320
20043
|
{
|
|
19321
20044
|
type: "endpoint",
|
|
@@ -19821,6 +20544,10 @@ var sections44 = [
|
|
|
19821
20544
|
description: "All Talonic API errors return a consistent JSON envelope with a machine-readable code, human-readable message, HTTP status, retryable flag, request ID, and timestamp.",
|
|
19822
20545
|
content: [
|
|
19823
20546
|
{ type: "paragraph", text: "All errors return a consistent JSON envelope. The `retryable` field tells you whether the request can be retried with the same parameters." },
|
|
20547
|
+
{ type: "paragraph", text: "Most integrations parse the `code` field for programmatic error handling and display the `message` field to users. A typical error handler checks `retryable` first \u2014 if `true`, queue the request for retry with exponential backoff; if `false`, surface the `message` to the caller and stop." },
|
|
20548
|
+
{ type: "paragraph", text: "The `request_id` field (prefixed with `req_`) uniquely identifies the failed request and is essential for debugging with Talonic support. The `path` field confirms which endpoint produced the error, and `timestamp` records when it occurred in ISO 8601 format." },
|
|
20549
|
+
{ type: "paragraph", text: "Pair error handling with the [Error Codes](error-codes) reference to map each `code` value to the correct remediation action. Note that `statusCode` always matches the HTTP response status, so you can use either for branching logic in your client." },
|
|
20550
|
+
{ type: "callout", text: "Always log the `request_id` from error responses. When contacting support, include it for faster resolution \u2014 it links directly to the server-side request trace." },
|
|
19824
20551
|
{
|
|
19825
20552
|
type: "code",
|
|
19826
20553
|
title: "Error response envelope",
|