@talonic/docs 0.20.15 → 0.20.16
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/content.js +2994 -20
- package/package.json +1 -1
package/dist/content.js
CHANGED
|
@@ -457,6 +457,38 @@ var sections = [
|
|
|
457
457
|
type: "callout",
|
|
458
458
|
variant: "info",
|
|
459
459
|
text: "Talonic uses Anthropic Claude for intelligent extraction and reasoning. The platform handles OCR, classification, field discovery, and schema generation automatically \u2014 you provide documents and define what output you need. Document AI (Mistral) handles OCR by default, with automatic fallback to the Talonic API for unsupported formats."
|
|
460
|
+
},
|
|
461
|
+
{
|
|
462
|
+
type: "heading",
|
|
463
|
+
level: 3,
|
|
464
|
+
id: "quick-api-example",
|
|
465
|
+
text: "Quick API Example"
|
|
466
|
+
},
|
|
467
|
+
{
|
|
468
|
+
type: "paragraph",
|
|
469
|
+
text: "Talonic exposes a full REST API so you can integrate document extraction into any workflow programmatically. The simplest way to get started is to upload a document via the `/v1/sources/:sourceId/documents` endpoint. The platform automatically runs OCR, classification, and extraction \u2014 you poll for results or receive a webhook notification when processing completes."
|
|
470
|
+
},
|
|
471
|
+
{
|
|
472
|
+
type: "code",
|
|
473
|
+
language: "bash",
|
|
474
|
+
title: "Upload a document via API",
|
|
475
|
+
code: 'curl -X POST https://api.talonic.com/v1/sources/src_abc123/documents \\\n -H "Authorization: Bearer $TALONIC_API_KEY" \\\n -F "file=@invoice.pdf" \\\n -F "processing_mode=sync"'
|
|
476
|
+
},
|
|
477
|
+
{
|
|
478
|
+
type: "code",
|
|
479
|
+
language: "json",
|
|
480
|
+
title: "Response",
|
|
481
|
+
code: '{\n "id": "doc_7f3a1b2c",\n "status": "completed",\n "document_type": "Invoice",\n "fields_extracted": 24,\n "confidence_avg": 0.94,\n "created_at": "2026-05-07T10:30:00Z"\n}'
|
|
482
|
+
},
|
|
483
|
+
{
|
|
484
|
+
type: "paragraph",
|
|
485
|
+
text: "Once a document is processed, you can retrieve its extracted fields using the extractions endpoint. Each field includes the extracted value, a confidence score between 0 and 1, and the source text from the original document that the value was derived from. This makes every extraction result fully auditable and traceable back to its origin in the source document."
|
|
486
|
+
},
|
|
487
|
+
{
|
|
488
|
+
type: "code",
|
|
489
|
+
language: "bash",
|
|
490
|
+
title: "Retrieve extraction results",
|
|
491
|
+
code: 'curl https://api.talonic.com/v1/extractions/doc_7f3a1b2c \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
460
492
|
}
|
|
461
493
|
],
|
|
462
494
|
related: [
|
|
@@ -480,6 +512,10 @@ var sections = [
|
|
|
480
512
|
{
|
|
481
513
|
question: "How much do instant graph matches cost?",
|
|
482
514
|
answer: "Graph matches (approximately 30% of cells) are free. They are filled from the knowledge graph through deterministic lookup, so no LLM call is needed. Only cells that require AI extraction incur cost. As your knowledge graph grows from processing more documents, the percentage of free graph matches increases, further reducing per-job costs."
|
|
515
|
+
},
|
|
516
|
+
{
|
|
517
|
+
question: "Can I use Talonic via API without the web interface?",
|
|
518
|
+
answer: "Yes. Talonic exposes a full REST API at api.talonic.com/v1 with 20+ namespaces covering extraction, documents, schemas, jobs, delivery, and more. You can upload documents, retrieve extraction results, create schemas, run jobs, and configure delivery bindings entirely through API calls. Authenticate with a tlnc_ prefixed API key via the Authorization: Bearer header."
|
|
483
519
|
}
|
|
484
520
|
],
|
|
485
521
|
mentions: [
|
|
@@ -560,10 +596,46 @@ var sections = [
|
|
|
560
596
|
type: "paragraph",
|
|
561
597
|
text: "The **Schema** layer sits on top of the registry and defines what output you need. You can use auto-generated schemas that the platform creates for each document type, or build custom template schemas by selecting specific fields from the registry. When a schema is applied to documents in a **Job**, the 4-phase pipeline fills every cell \u2014 starting with free graph lookups and falling back to AI agents for the remainder. The result is a structured grid where each row is a document and each column is a field."
|
|
562
598
|
},
|
|
599
|
+
{
|
|
600
|
+
type: "paragraph",
|
|
601
|
+
text: "**Cases** add another dimension to document intelligence. When two or more documents share entities \u2014 like a common vendor name, reference number, or project code \u2014 the platform automatically connects them into a case. Cases provide a cross-document view that reveals relationships invisible at the individual document level. For example, a purchase order and its corresponding invoice might share a PO number, linking them into a case that lets you verify consistency across the two documents. The linking engine builds a bipartite graph of document-entity relationships and uses connected components to identify cases automatically."
|
|
602
|
+
},
|
|
563
603
|
{
|
|
564
604
|
type: "callout",
|
|
565
605
|
variant: "info",
|
|
566
606
|
text: "The **Field Registry** is the heart of the platform. As you process more documents, the registry grows \u2014 fields are clustered semantically, promoted through tiers, and enriched with master extraction instructions. This accumulated knowledge makes every subsequent extraction faster and more accurate."
|
|
607
|
+
},
|
|
608
|
+
{
|
|
609
|
+
type: "heading",
|
|
610
|
+
level: 3,
|
|
611
|
+
id: "exploring-concepts-via-api",
|
|
612
|
+
text: "Exploring Concepts via API"
|
|
613
|
+
},
|
|
614
|
+
{
|
|
615
|
+
type: "paragraph",
|
|
616
|
+
text: "Every core concept in the platform has a corresponding API namespace. You can list documents, browse the field registry, inspect schemas, retrieve job results, and explore cases programmatically. For example, to see all fields discovered across your documents, query the field registry endpoint. The response includes each field's canonical name, tier, occurrence count, and data type \u2014 giving you a complete picture of your knowledge graph."
|
|
617
|
+
},
|
|
618
|
+
{
|
|
619
|
+
type: "code",
|
|
620
|
+
language: "bash",
|
|
621
|
+
title: "List fields in the registry",
|
|
622
|
+
code: 'curl https://api.talonic.com/v1/schemas \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
623
|
+
},
|
|
624
|
+
{
|
|
625
|
+
type: "code",
|
|
626
|
+
language: "json",
|
|
627
|
+
title: "Response",
|
|
628
|
+
code: '{\n "data": [\n {\n "id": "sch_inv_001",\n "name": "Invoice",\n "fields_count": 18,\n "document_type": "Invoice",\n "version": 3\n }\n ],\n "meta": { "total": 12, "cursor": "eyJpZCI6MTJ9" }\n}'
|
|
629
|
+
},
|
|
630
|
+
{
|
|
631
|
+
type: "paragraph",
|
|
632
|
+
text: "The relationship between concepts is visible through the API as well. When you retrieve a job result, each cell includes provenance metadata that tells you which pipeline phase filled it, the confidence score, and the reasoning trace. This end-to-end traceability from source document through the field registry to structured output is what makes Talonic auditable at every step of the pipeline."
|
|
633
|
+
},
|
|
634
|
+
{
|
|
635
|
+
type: "code",
|
|
636
|
+
language: "bash",
|
|
637
|
+
title: "Retrieve job results with provenance",
|
|
638
|
+
code: 'curl https://api.talonic.com/v1/jobs/job_abc123 \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
567
639
|
}
|
|
568
640
|
],
|
|
569
641
|
related: [
|
|
@@ -587,6 +659,10 @@ var sections = [
|
|
|
587
659
|
{
|
|
588
660
|
question: "What is the difference between a Generated Schema and a Template Schema?",
|
|
589
661
|
answer: "Generated Schemas are created automatically by the platform based on the document types it discovers. Template Schemas are user-defined for specific output needs \u2014 you choose which fields to include and how they map to the Field Registry."
|
|
662
|
+
},
|
|
663
|
+
{
|
|
664
|
+
question: "Can I explore all core concepts through the API?",
|
|
665
|
+
answer: "Yes. Every core concept has a corresponding API namespace. Documents are at /v1/documents, schemas at /v1/schemas, jobs at /v1/jobs, cases at /v1/cases, and the field registry at /v1/schemas/fields. You can list, inspect, and manage each concept programmatically with the same data available in the web interface."
|
|
590
666
|
}
|
|
591
667
|
],
|
|
592
668
|
mentions: [
|
|
@@ -643,10 +719,46 @@ var sections = [
|
|
|
643
719
|
type: "paragraph",
|
|
644
720
|
text: "After results are delivered, the feedback loop closes automatically. Corrections you make during the **Review** stage feed back into the Field Registry, improving future extractions. The platform tracks telemetry across runs \u2014 strategy distribution, capture hit rate, and resolve rate \u2014 so you can monitor how extraction quality improves over time as the knowledge graph accumulates more data."
|
|
645
721
|
},
|
|
722
|
+
{
|
|
723
|
+
type: "paragraph",
|
|
724
|
+
text: "Each stage of the pipeline is observable through both the web interface and the API. The Dashboard provides a visual overview of pipeline progress across all five stages, while the telemetry API exposes the same metrics programmatically. For production deployments, teams often integrate these telemetry endpoints with external monitoring tools like Grafana or Datadog to maintain continuous visibility into pipeline health, extraction quality trends, and processing throughput."
|
|
725
|
+
},
|
|
646
726
|
{
|
|
647
727
|
type: "callout",
|
|
648
728
|
variant: "info",
|
|
649
729
|
text: "The **Dashboard** provides a real-time view of your pipeline progress with telemetry on strategy distribution, tier funnel, capture hit rate, and per-field state distribution. Use it to understand how well the knowledge graph is performing."
|
|
730
|
+
},
|
|
731
|
+
{
|
|
732
|
+
type: "heading",
|
|
733
|
+
level: 3,
|
|
734
|
+
id: "automating-the-flow",
|
|
735
|
+
text: "Automating the Flow via API"
|
|
736
|
+
},
|
|
737
|
+
{
|
|
738
|
+
type: "paragraph",
|
|
739
|
+
text: "The entire platform flow can be automated through the REST API. Upload documents to a source, wait for extraction to complete (or receive a webhook notification), then create a job run against a schema. The API follows the same seven-stage pipeline as the web interface \u2014 the only difference is that each step is triggered programmatically instead of through the UI. This makes it straightforward to embed Talonic into existing data pipelines or orchestration tools."
|
|
740
|
+
},
|
|
741
|
+
{
|
|
742
|
+
type: "code",
|
|
743
|
+
language: "bash",
|
|
744
|
+
title: "Create a job run via API",
|
|
745
|
+
code: `curl -X POST https://api.talonic.com/v1/jobs \\
|
|
746
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
747
|
+
-H "Content-Type: application/json" \\
|
|
748
|
+
-d '{
|
|
749
|
+
"schema_id": "sch_inv_001",
|
|
750
|
+
"document_ids": ["doc_7f3a1b2c", "doc_9e4d5f6a"]
|
|
751
|
+
}'`
|
|
752
|
+
},
|
|
753
|
+
{
|
|
754
|
+
type: "code",
|
|
755
|
+
language: "json",
|
|
756
|
+
title: "Response",
|
|
757
|
+
code: '{\n "id": "job_x8k2m9",\n "status": "running",\n "schema_id": "sch_inv_001",\n "documents_count": 2,\n "phase": 1,\n "created_at": "2026-05-07T14:22:00Z"\n}'
|
|
758
|
+
},
|
|
759
|
+
{
|
|
760
|
+
type: "paragraph",
|
|
761
|
+
text: "For production deployments, combine source connectors with routing rules and delivery bindings to create a fully hands-off pipeline. Documents arrive from Google Drive or S3, routing rules assign the correct schema and trigger extraction, and delivery bindings push approved results to your webhook endpoint or cloud storage. The platform handles retries, deduplication, and error escalation automatically, so you only need to intervene when the dead-letter queue flags a terminal failure."
|
|
650
762
|
}
|
|
651
763
|
],
|
|
652
764
|
related: [
|
|
@@ -670,6 +782,14 @@ var sections = [
|
|
|
670
782
|
{
|
|
671
783
|
question: "What delivery destinations are supported?",
|
|
672
784
|
answer: "Six live connectors: webhook (with HMAC-SHA256 signing), SFTP, Amazon S3, Azure Blob Storage, Google Drive, and OneDrive. Additional integrations for Sheets, SharePoint, Gmail, Outlook, and HubSpot are planned."
|
|
785
|
+
},
|
|
786
|
+
{
|
|
787
|
+
question: "Can I automate the entire platform flow end-to-end?",
|
|
788
|
+
answer: "Yes. Combine source connectors (to ingest documents automatically), routing rules (to assign schemas and trigger jobs), and delivery bindings (to push results downstream). This creates a fully automated pipeline where documents flow from ingestion through extraction and delivery with zero manual intervention. You can monitor the pipeline health from the Dashboard or via the telemetry API."
|
|
789
|
+
},
|
|
790
|
+
{
|
|
791
|
+
question: "How do I create a job run via the API?",
|
|
792
|
+
answer: "POST to /v1/jobs with a schema_id and an array of document_ids. The response includes a job ID that you can poll for status and results. Jobs run asynchronously through the 4-phase pipeline, and you can retrieve partial results while later phases are still running. Configure a webhook with the run.dataspace.completed signal to receive a notification when the full job finishes."
|
|
673
793
|
}
|
|
674
794
|
],
|
|
675
795
|
mentions: [
|
|
@@ -713,6 +833,10 @@ var sections = [
|
|
|
713
833
|
type: "paragraph",
|
|
714
834
|
text: "The platform includes powerful keyboard shortcuts for fast navigation. Press `Cmd+K` (or `Ctrl+K` on Windows) to open **Omnisearch**, which lets you find documents, schemas, jobs, and fields from anywhere. Press `Cmd+I` to open the **AI Agent** for natural language queries about your workspace. The sidebar can be collapsed to give more screen real estate when reviewing extraction results."
|
|
715
835
|
},
|
|
836
|
+
{
|
|
837
|
+
type: "paragraph",
|
|
838
|
+
text: "If you plan to integrate Talonic into an existing data pipeline, start by creating an API key from **Settings → API Keys**. The REST API mirrors every action available in the web interface, so you can upload documents, retrieve extraction results, create schemas, and configure delivery bindings programmatically. Many teams begin with the UI for initial exploration and then transition to API-driven workflows as their integration matures."
|
|
839
|
+
},
|
|
716
840
|
{
|
|
717
841
|
type: "callout",
|
|
718
842
|
text: "The fastest path to results: upload documents in **Sources**, then go to **Structuring → Runs → New** to create your first extraction job."
|
|
@@ -727,6 +851,38 @@ var sections = [
|
|
|
727
851
|
"Create a new **Run** by selecting a schema and the documents to process.",
|
|
728
852
|
"Review results in the run view \u2014 each cell shows confidence, provenance, and reasoning."
|
|
729
853
|
]
|
|
854
|
+
},
|
|
855
|
+
{
|
|
856
|
+
type: "heading",
|
|
857
|
+
level: 3,
|
|
858
|
+
id: "getting-started-api",
|
|
859
|
+
text: "Getting Started with the API"
|
|
860
|
+
},
|
|
861
|
+
{
|
|
862
|
+
type: "paragraph",
|
|
863
|
+
text: "If you prefer to integrate programmatically, the REST API lets you upload documents, retrieve results, and manage schemas without the web interface. Start by creating an API key from Settings, then use the extract endpoint to submit your first document. The response includes extraction status that you can poll until processing completes, or configure a webhook to receive a notification automatically when results are ready."
|
|
864
|
+
},
|
|
865
|
+
{
|
|
866
|
+
type: "code",
|
|
867
|
+
language: "bash",
|
|
868
|
+
title: "Submit a document for extraction",
|
|
869
|
+
code: 'curl -X POST https://api.talonic.com/v1/sources/src_abc123/documents \\\n -H "Authorization: Bearer $TALONIC_API_KEY" \\\n -F "file=@contract.pdf"'
|
|
870
|
+
},
|
|
871
|
+
{
|
|
872
|
+
type: "code",
|
|
873
|
+
language: "bash",
|
|
874
|
+
title: "Check document processing status",
|
|
875
|
+
code: 'curl https://api.talonic.com/v1/documents/doc_7f3a1b2c \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
876
|
+
},
|
|
877
|
+
{
|
|
878
|
+
type: "code",
|
|
879
|
+
language: "json",
|
|
880
|
+
title: "Response",
|
|
881
|
+
code: '{\n "id": "doc_7f3a1b2c",\n "status": "completed",\n "document_type": "Employment Contract",\n "source_id": "src_abc123",\n "fields_extracted": 31,\n "created_at": "2026-05-07T10:30:00Z"\n}'
|
|
882
|
+
},
|
|
883
|
+
{
|
|
884
|
+
type: "paragraph",
|
|
885
|
+
text: "Once you are comfortable with single-document extraction, explore batch processing for high-volume workloads. Set `processing_mode=batch` when uploading to defer AI extraction and run at 50% cost with a 48-hour delivery window. This is ideal for historical backlogs where immediate results are not required. You can monitor batch progress from the Sources page or poll the batch status endpoint via the API."
|
|
730
886
|
}
|
|
731
887
|
],
|
|
732
888
|
related: [
|
|
@@ -750,6 +906,14 @@ var sections = [
|
|
|
750
906
|
{
|
|
751
907
|
question: "What source connections are available?",
|
|
752
908
|
answer: "Ten source connectors: Google Drive, Gmail, SharePoint, OneDrive, Outlook, Teams, Notion, SQL databases (MSSQL/PostgreSQL), Amazon S3, and Azure Blob Storage. You can also upload files manually or ingest via the REST API."
|
|
909
|
+
},
|
|
910
|
+
{
|
|
911
|
+
question: "How do I upload my first document via the API?",
|
|
912
|
+
answer: "Create an API key from Settings, then POST a file to /v1/sources/:sourceId/documents with your key in the Authorization: Bearer header. The platform processes the document automatically through OCR, classification, and extraction. Poll the document status endpoint or configure a webhook to know when results are ready. Most documents complete processing within one to two minutes."
|
|
913
|
+
},
|
|
914
|
+
{
|
|
915
|
+
question: "What is the recommended approach for teams processing documents at scale?",
|
|
916
|
+
answer: "Start with a small representative sample of 5-10 documents of the same type. Let the platform extract and classify them, then review the auto-generated schema. This validates the output structure before committing to a large batch. Once the schema is confirmed, upload your full document set. As the knowledge graph matures, an increasing share of cells resolve via free graph lookups rather than AI extraction, reducing cost over time."
|
|
753
917
|
}
|
|
754
918
|
],
|
|
755
919
|
mentions: ["sidebar", "sources", "structuring", "outputs", "navigation", "Cmd+K", "source connectors"]
|
|
@@ -781,6 +945,10 @@ var sections2 = [
|
|
|
781
945
|
type: "paragraph",
|
|
782
946
|
text: "There are important limitations to be aware of. The agent cannot access external systems or the internet \u2014 it only works with data already in your Talonic workspace. It cannot bypass permission boundaries, so team members with read-only (Viewer) access cannot use the agent to make changes. Long-running operations like full batch extractions cannot be triggered through the agent; those must be initiated from the relevant UI page. The agent also cannot modify field registry entries directly \u2014 those changes flow through the resolution pipeline."
|
|
783
947
|
},
|
|
948
|
+
{
|
|
949
|
+
type: "paragraph",
|
|
950
|
+
text: 'The agent is particularly effective for onboarding new team members. Instead of reading documentation about each platform feature, new users can ask the agent questions like "How many document types do we have?", "What schemas are available?", or "Show me our most common fields." The agent provides instant, contextual answers that help users build a mental model of their workspace. This reduces time-to-productivity for new team members from days to hours.'
|
|
951
|
+
},
|
|
784
952
|
{ type: "heading", level: 3, id: "agent-capabilities", text: "What the Agent Can Do" },
|
|
785
953
|
{
|
|
786
954
|
type: "paragraph",
|
|
@@ -831,6 +999,36 @@ var sections2 = [
|
|
|
831
999
|
description: "Check delivery status and preview binding output."
|
|
832
1000
|
}
|
|
833
1001
|
]
|
|
1002
|
+
},
|
|
1003
|
+
{
|
|
1004
|
+
type: "heading",
|
|
1005
|
+
level: 3,
|
|
1006
|
+
id: "agent-example-prompts",
|
|
1007
|
+
text: "Example Agent Interactions"
|
|
1008
|
+
},
|
|
1009
|
+
{
|
|
1010
|
+
type: "paragraph",
|
|
1011
|
+
text: "The agent excels at cross-cutting queries that would otherwise require navigating multiple pages. For example, you can ask it to summarize extraction quality across your latest job runs, identify which document types have the lowest confidence scores, or compare field coverage between two schemas. The agent queries the underlying data in real time and streams results, so complex analyses that would take several minutes of manual navigation are answered in seconds."
|
|
1012
|
+
},
|
|
1013
|
+
{
|
|
1014
|
+
type: "paragraph",
|
|
1015
|
+
text: 'Schema creation through the agent is particularly powerful. Describe the fields you need in plain language \u2014 for example, "Create a schema for purchase orders with vendor name, PO number, line items, unit price, and total amount" \u2014 and the agent maps each field to the registry, identifies the best match for each, and creates a draft schema ready for your review. This is faster than manually searching the registry and adding fields one by one through the schema editor.'
|
|
1016
|
+
},
|
|
1017
|
+
{
|
|
1018
|
+
type: "paragraph",
|
|
1019
|
+
text: 'Behind the scenes, the agent uses the same internal APIs as the web interface. When you ask "Show me all invoices processed this week", the agent queries the documents endpoint with date filters and the Invoice document type. When you ask "What is my capture rate?", it reads the telemetry data from the dashboard service. This means the agent always shows the same data you would see in the UI \u2014 there is no separate data layer or cache that could show stale results.'
|
|
1020
|
+
},
|
|
1021
|
+
{
|
|
1022
|
+
type: "code",
|
|
1023
|
+
language: "bash",
|
|
1024
|
+
title: "Query documents via API (equivalent to agent search)",
|
|
1025
|
+
code: 'curl "https://api.talonic.com/v1/documents?type=Invoice&created_after=2026-05-01" \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
1026
|
+
},
|
|
1027
|
+
{
|
|
1028
|
+
type: "code",
|
|
1029
|
+
language: "json",
|
|
1030
|
+
title: "Response",
|
|
1031
|
+
code: '{\n "data": [\n {\n "id": "doc_7f3a1b2c",\n "filename": "invoice_2026_0472.pdf",\n "document_type": "Invoice",\n "status": "completed",\n "fields_extracted": 24\n }\n ],\n "meta": { "total": 156, "cursor": "eyJpZCI6MTU2fQ" }\n}'
|
|
834
1032
|
}
|
|
835
1033
|
],
|
|
836
1034
|
related: [
|
|
@@ -857,6 +1055,10 @@ var sections2 = [
|
|
|
857
1055
|
{
|
|
858
1056
|
question: "What are good questions to ask the agent?",
|
|
859
1057
|
answer: 'Try questions like "Show me all invoices processed this week", "What fields does my Invoice schema have?", "Create a schema for purchase orders with vendor name, PO number, and total amount", or "Why was this document classified as a Service Agreement?" The agent handles both read-only queries and schema creation commands.'
|
|
1058
|
+
},
|
|
1059
|
+
{
|
|
1060
|
+
question: "How does the agent help with onboarding new team members?",
|
|
1061
|
+
answer: 'New team members can ask the agent questions about the workspace to quickly build a mental model of available data \u2014 "How many document types do we have?", "What schemas are available?", or "Show me our most common fields." The agent provides instant, contextual answers that reduce onboarding time from days to hours. It also helps new users discover platform features by suggesting relevant follow-up questions.'
|
|
860
1062
|
}
|
|
861
1063
|
],
|
|
862
1064
|
mentions: [
|
|
@@ -936,9 +1138,39 @@ var sections2 = [
|
|
|
936
1138
|
type: "paragraph",
|
|
937
1139
|
text: "The impact level system also respects your team role. Team members with the Viewer role can only trigger `read` level actions through the agent \u2014 attempting commands that would modify data will be rejected with a clear permissions error. Members, Admins, and Owners can trigger higher impact levels according to their role permissions."
|
|
938
1140
|
},
|
|
1141
|
+
{
|
|
1142
|
+
type: "paragraph",
|
|
1143
|
+
text: "When the agent classifies a message as a command rather than a question, it evaluates the impact level before executing. The classification happens transparently \u2014 you see which impact level was assigned and what action the agent intends to take before any confirmation is requested. This transparency lets you understand exactly what will happen and make an informed decision about whether to proceed. If the agent misclassifies a question as a command, you can simply decline the confirmation and rephrase your request."
|
|
1144
|
+
},
|
|
939
1145
|
{
|
|
940
1146
|
type: "callout",
|
|
941
1147
|
text: "The agent always operates workshop-first: schema changes create drafts, not live versions. You review and publish when ready. This means you can experiment freely with schema designs through the agent without any risk to your production configuration."
|
|
1148
|
+
},
|
|
1149
|
+
{
|
|
1150
|
+
type: "heading",
|
|
1151
|
+
level: 3,
|
|
1152
|
+
id: "impact-api-context",
|
|
1153
|
+
text: "Impact Levels and API Operations"
|
|
1154
|
+
},
|
|
1155
|
+
{
|
|
1156
|
+
type: "paragraph",
|
|
1157
|
+
text: "The impact level system mirrors the permission model of the REST API. Read-level agent actions correspond to GET requests, draft mutations correspond to POST/PATCH operations on draft resources, and live mutations correspond to POST operations that affect published schemas or trigger job runs. When building automations that combine the agent with API calls, keep this mapping in mind \u2014 the same permission boundaries apply regardless of the interface you use."
|
|
1158
|
+
},
|
|
1159
|
+
{
|
|
1160
|
+
type: "paragraph",
|
|
1161
|
+
text: "For example, when the agent creates a schema draft, it internally calls the same endpoint as the API. You can subsequently retrieve or modify that draft through the API if you prefer. This interoperability means you can start exploring in the agent, then switch to the API for scripted workflows without losing any work. Draft schemas created by the agent appear in the same list as drafts created through the UI or API."
|
|
1162
|
+
},
|
|
1163
|
+
{
|
|
1164
|
+
type: "code",
|
|
1165
|
+
language: "bash",
|
|
1166
|
+
title: "List schema drafts (includes agent-created drafts)",
|
|
1167
|
+
code: 'curl "https://api.talonic.com/v1/schemas?status=draft" \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
1168
|
+
},
|
|
1169
|
+
{
|
|
1170
|
+
type: "code",
|
|
1171
|
+
language: "json",
|
|
1172
|
+
title: "Response",
|
|
1173
|
+
code: '{\n "data": [\n {\n "id": "sch_draft_001",\n "name": "Purchase Order - Draft",\n "status": "draft",\n "fields_count": 8,\n "created_by": "agent",\n "created_at": "2026-05-07T09:15:00Z"\n }\n ],\n "meta": { "total": 3 }\n}'
|
|
942
1174
|
}
|
|
943
1175
|
],
|
|
944
1176
|
related: [
|
|
@@ -957,6 +1189,14 @@ var sections2 = [
|
|
|
957
1189
|
{
|
|
958
1190
|
question: "What happens when I ask the agent to delete something?",
|
|
959
1191
|
answer: 'Deletion is classified as an irreversible action. The agent will ask you to type a confirmation keyword (e.g., "DELETE") before proceeding. This prevents accidental data loss from casual or ambiguous requests.'
|
|
1192
|
+
},
|
|
1193
|
+
{
|
|
1194
|
+
question: "Can I retrieve agent-created drafts through the API?",
|
|
1195
|
+
answer: "Yes. Schema drafts created by the agent are standard draft resources. You can list, inspect, modify, and publish them through the REST API at /v1/schemas just like drafts created through the web interface. The created_by field indicates whether a draft was created by the agent, the UI, or the API."
|
|
1196
|
+
},
|
|
1197
|
+
{
|
|
1198
|
+
question: "What happens if the agent misclassifies my question as a command?",
|
|
1199
|
+
answer: 'If the agent classifies a question as a command, you see the assigned impact level and intended action before any confirmation is requested. You can simply decline the confirmation and rephrase your request as a question. For example, instead of "Delete the test schema", try "What schemas do I have that contain the word test?" The agent adapts to your phrasing and routes accordingly.'
|
|
960
1200
|
}
|
|
961
1201
|
],
|
|
962
1202
|
mentions: ["impact levels", "draft mutation", "live mutation", "workshop-first"]
|
|
@@ -998,6 +1238,42 @@ var sections2 = [
|
|
|
998
1238
|
type: "callout",
|
|
999
1239
|
variant: "info",
|
|
1000
1240
|
text: 'Try asking the agent questions like "What is my capture rate?", "Which document types need schemas?", or "Show me recent extraction failures" directly from the dashboard. The suggested prompts adapt to your workspace state, but you can always type any question.'
|
|
1241
|
+
},
|
|
1242
|
+
{
|
|
1243
|
+
type: "heading",
|
|
1244
|
+
level: 3,
|
|
1245
|
+
id: "dashboard-api-metrics",
|
|
1246
|
+
text: "Dashboard Metrics via API"
|
|
1247
|
+
},
|
|
1248
|
+
{
|
|
1249
|
+
type: "paragraph",
|
|
1250
|
+
text: "The same metrics visible on the dashboard are available programmatically through the telemetry API. This lets you build custom dashboards, feed metrics into monitoring tools like Grafana or Datadog, or set up automated alerts when key metrics drop below thresholds. For example, you can track your resolve rate over time and alert your team when it drops, which might indicate that a new document type is introducing unfamiliar fields that need registry attention."
|
|
1251
|
+
},
|
|
1252
|
+
{
|
|
1253
|
+
type: "code",
|
|
1254
|
+
language: "bash",
|
|
1255
|
+
title: "Retrieve telemetry metrics",
|
|
1256
|
+
code: 'curl https://api.talonic.com/v1/telemetry \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
1257
|
+
},
|
|
1258
|
+
{
|
|
1259
|
+
type: "code",
|
|
1260
|
+
language: "json",
|
|
1261
|
+
title: "Response",
|
|
1262
|
+
code: '{\n "capture_rate": 0.87,\n "resolve_rate": 0.62,\n "synthesize_rate": 0.74,\n "documents_processed": 1247,\n "fields_in_registry": 342,\n "period": "last_30_days"\n}'
|
|
1263
|
+
},
|
|
1264
|
+
{
|
|
1265
|
+
type: "paragraph",
|
|
1266
|
+
text: "The dashboard also integrates with the credit and usage tracking system. You can see at a glance how many credits have been consumed, what percentage came from extraction versus OCR, and how batch processing has reduced your overall costs. This financial visibility is essential for teams managing extraction budgets across multiple departments or clients, as it helps you allocate costs accurately and identify opportunities for optimization."
|
|
1267
|
+
},
|
|
1268
|
+
{
|
|
1269
|
+
type: "paragraph",
|
|
1270
|
+
text: "The suggested prompts system is context-sensitive and adapts to your workspace lifecycle. When you first start using the platform with few documents, the prompts focus on getting started \u2014 uploading documents and creating your first schema. As your workspace matures with hundreds of documents and established schemas, the prompts shift toward quality analysis, extraction optimization, and delivery configuration. This progressive guidance helps teams at every stage of adoption without overwhelming new users with advanced features."
|
|
1271
|
+
},
|
|
1272
|
+
{
|
|
1273
|
+
type: "code",
|
|
1274
|
+
language: "bash",
|
|
1275
|
+
title: "Get workspace overview (similar to dashboard view)",
|
|
1276
|
+
code: 'curl https://api.talonic.com/v1/credits \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
1001
1277
|
}
|
|
1002
1278
|
],
|
|
1003
1279
|
related: [
|
|
@@ -1016,6 +1292,18 @@ var sections2 = [
|
|
|
1016
1292
|
{
|
|
1017
1293
|
question: "Can I revisit previous conversations with the agent?",
|
|
1018
1294
|
answer: "Yes. Every conversation is preserved in your session history, accessible from the dashboard. You can revisit previous questions, recall how you configured a schema, or pick up where you left off in a previous analysis."
|
|
1295
|
+
},
|
|
1296
|
+
{
|
|
1297
|
+
question: "Can I access dashboard metrics through the API?",
|
|
1298
|
+
answer: "Yes. The telemetry API exposes the same capture rate, resolve rate, and synthesize rate metrics visible on the dashboard. Use GET /v1/telemetry with your API key to retrieve current metrics programmatically. This is useful for building custom dashboards, feeding data into monitoring tools, or setting up automated alerts when metrics drop below acceptable thresholds."
|
|
1299
|
+
},
|
|
1300
|
+
{
|
|
1301
|
+
question: "How often are dashboard metrics refreshed?",
|
|
1302
|
+
answer: "Dashboard metrics are cached with a 30-second refresh interval. This means the data you see is at most 30 seconds old. The telemetry API follows the same caching behavior, so polling more frequently than every 30 seconds will return identical results."
|
|
1303
|
+
},
|
|
1304
|
+
{
|
|
1305
|
+
question: "Can I build a custom dashboard using the API?",
|
|
1306
|
+
answer: "Yes. The telemetry API at /v1/telemetry exposes capture rate, resolve rate, synthesize rate, and other metrics programmatically. The credits API at /v1/credits provides usage and cost data. Combine these endpoints with your preferred visualization tool \u2014 Grafana, Datadog, or a custom BI dashboard \u2014 to build monitoring views tailored to your team's specific needs and alert thresholds."
|
|
1019
1307
|
}
|
|
1020
1308
|
],
|
|
1021
1309
|
mentions: ["dashboard", "suggested prompts", "workspace state", "agent input"]
|
|
@@ -1078,6 +1366,42 @@ var sections3 = [
|
|
|
1078
1366
|
type: "callout",
|
|
1079
1367
|
variant: "info",
|
|
1080
1368
|
text: "Use the **quick extract** shortcut (`Cmd+J` / `Ctrl+J`) to upload a single document from any page without navigating to Sources first. This opens a streamlined upload interface that processes the document immediately and shows results inline."
|
|
1369
|
+
},
|
|
1370
|
+
{
|
|
1371
|
+
type: "heading",
|
|
1372
|
+
level: 3,
|
|
1373
|
+
id: "upload-via-api",
|
|
1374
|
+
text: "Uploading via API"
|
|
1375
|
+
},
|
|
1376
|
+
{
|
|
1377
|
+
type: "paragraph",
|
|
1378
|
+
text: "For automated ingestion pipelines, you can upload documents programmatically through the REST API. The upload endpoint accepts multipart form data with the file and optional metadata. Each uploaded document flows through the same processing pipeline as manual uploads \u2014 OCR, classification, and extraction run automatically. You can optionally set `processing_mode=batch` to defer extraction at 50% cost for non-urgent documents."
|
|
1379
|
+
},
|
|
1380
|
+
{
|
|
1381
|
+
type: "code",
|
|
1382
|
+
language: "bash",
|
|
1383
|
+
title: "Upload a document via API",
|
|
1384
|
+
code: 'curl -X POST https://api.talonic.com/v1/sources/src_abc123/documents \\\n -H "Authorization: Bearer $TALONIC_API_KEY" \\\n -F "file=@invoice.pdf" \\\n -F "processing_mode=sync"'
|
|
1385
|
+
},
|
|
1386
|
+
{
|
|
1387
|
+
type: "code",
|
|
1388
|
+
language: "json",
|
|
1389
|
+
title: "Response",
|
|
1390
|
+
code: '{\n "id": "doc_7f3a1b2c",\n "filename": "invoice.pdf",\n "status": "processing",\n "source_id": "src_abc123",\n "created_at": "2026-05-07T10:30:00Z"\n}'
|
|
1391
|
+
},
|
|
1392
|
+
{
|
|
1393
|
+
type: "code",
|
|
1394
|
+
language: "bash",
|
|
1395
|
+
title: "List documents in a source",
|
|
1396
|
+
code: 'curl https://api.talonic.com/v1/documents?source_id=src_abc123 \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
1397
|
+
},
|
|
1398
|
+
{
|
|
1399
|
+
type: "paragraph",
|
|
1400
|
+
text: "When integrating with external systems, a common pattern is to upload documents via the API and then configure a webhook to receive notifications when extraction completes. This avoids the need to poll for status updates and lets your downstream system react immediately when structured data is available. The `document.extracted` webhook signal fires as soon as all fields have been captured and the document status transitions to `completed`."
|
|
1401
|
+
},
|
|
1402
|
+
{
|
|
1403
|
+
type: "paragraph",
|
|
1404
|
+
text: "For batch ingestion scenarios, set `processing_mode=batch` when uploading to defer AI extraction and process at 50% cost. Batch mode runs OCR and classification immediately but queues the extraction step for deferred processing with a 48-hour delivery window. This is ideal for historical document backlogs where immediate results are not required. You can mix batch and realtime uploads within the same source \u2014 each document's processing mode is independent."
|
|
1081
1405
|
}
|
|
1082
1406
|
],
|
|
1083
1407
|
related: [
|
|
@@ -1101,6 +1425,10 @@ var sections3 = [
|
|
|
1101
1425
|
{
|
|
1102
1426
|
question: "Can I upload large batches of documents?",
|
|
1103
1427
|
answer: "Yes. Large uploads (100+ files) are automatically throttled to prevent pipeline overload. Each document processes independently through OCR, classification, and extraction. For very large batches, consider batch processing mode which defers extraction at 50% cost."
|
|
1428
|
+
},
|
|
1429
|
+
{
|
|
1430
|
+
question: "Can I upload documents programmatically via the API?",
|
|
1431
|
+
answer: "Yes. POST a multipart form request to /v1/sources/:sourceId/documents with the file and optional processing_mode parameter. Set processing_mode=batch for 50% cost deferral with 48-hour delivery, or omit it for immediate processing. The API follows the same pipeline as the UI \u2014 documents flow through OCR, classification, and extraction automatically after upload."
|
|
1104
1432
|
}
|
|
1105
1433
|
],
|
|
1106
1434
|
mentions: [
|
|
@@ -1169,6 +1497,42 @@ var sections3 = [
|
|
|
1169
1497
|
type: "callout",
|
|
1170
1498
|
variant: "info",
|
|
1171
1499
|
text: "The processing path is selected automatically based on the file extension \u2014 you do not need to configure anything. If a file type is not recognized, the platform will attempt OCR as a fallback before marking it as unsupported."
|
|
1500
|
+
},
|
|
1501
|
+
{
|
|
1502
|
+
type: "heading",
|
|
1503
|
+
level: 3,
|
|
1504
|
+
id: "format-detection-api",
|
|
1505
|
+
text: "Format Detection in the API"
|
|
1506
|
+
},
|
|
1507
|
+
{
|
|
1508
|
+
type: "paragraph",
|
|
1509
|
+
text: "When uploading files through the REST API, the platform detects the file format from the file extension and routes it to the appropriate processing path automatically. You do not need to specify the file type in the request \u2014 simply upload the file and the platform handles format detection, OCR engine selection, and extraction pipeline routing. The document detail response includes a `processing_pipeline` field that indicates which path was used, so you can verify how each document was processed."
|
|
1510
|
+
},
|
|
1511
|
+
{
|
|
1512
|
+
type: "code",
|
|
1513
|
+
language: "bash",
|
|
1514
|
+
title: "Upload a spreadsheet for processing",
|
|
1515
|
+
code: 'curl -X POST https://api.talonic.com/v1/sources/src_abc123/documents \\\n -H "Authorization: Bearer $TALONIC_API_KEY" \\\n -F "file=@financials.xlsx"'
|
|
1516
|
+
},
|
|
1517
|
+
{
|
|
1518
|
+
type: "paragraph",
|
|
1519
|
+
text: "For bulk ingestion scenarios, consider the performance characteristics of each processing path. Text fast-path files (CSV, JSON, TXT) process almost instantly since they require no external API calls. OCR files (PDF, DOCX, XLSX) take longer because they pass through Document AI, but large PDFs are automatically chunked and processed in parallel to maintain throughput. Image files go through AI Vision which is the most computationally intensive path but produces excellent results for receipts, handwritten notes, and diagrams."
|
|
1520
|
+
},
|
|
1521
|
+
{
|
|
1522
|
+
type: "paragraph",
|
|
1523
|
+
text: "When choosing between file formats for the same content, prefer native digital formats over scanned images whenever possible. A native PDF with embedded text processes faster and more accurately than a scanned PDF image, because the native version can use the text fast-path or lighter OCR processing. Similarly, providing data in CSV or JSON format when available eliminates OCR cost entirely and produces the highest confidence extraction results. For spreadsheets, both XLSX and XLS formats are fully supported, along with the macro-enabled XLSM variant."
|
|
1524
|
+
},
|
|
1525
|
+
{
|
|
1526
|
+
type: "code",
|
|
1527
|
+
language: "bash",
|
|
1528
|
+
title: "Check document processing path",
|
|
1529
|
+
code: 'curl https://api.talonic.com/v1/documents/doc_7f3a1b2c \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
1530
|
+
},
|
|
1531
|
+
{
|
|
1532
|
+
type: "code",
|
|
1533
|
+
language: "json",
|
|
1534
|
+
title: "Response",
|
|
1535
|
+
code: '{\n "id": "doc_7f3a1b2c",\n "filename": "financials.xlsx",\n "processing_pipeline": "document_ai",\n "status": "completed",\n "document_type": "Financial Statement",\n "fields_extracted": 42\n}'
|
|
1172
1536
|
}
|
|
1173
1537
|
],
|
|
1174
1538
|
related: [
|
|
@@ -1187,6 +1551,10 @@ var sections3 = [
|
|
|
1187
1551
|
{
|
|
1188
1552
|
question: "How does Talonic handle large PDF files?",
|
|
1189
1553
|
answer: "PDF files that exceed the configured chunk size (default 25 pages) are automatically split into page chunks, processed in parallel, and merged. This ensures even large documents are handled efficiently without timeouts."
|
|
1554
|
+
},
|
|
1555
|
+
{
|
|
1556
|
+
question: "Can I check which processing path was used for a document?",
|
|
1557
|
+
answer: "Yes. The document detail response includes a processing_pipeline field that indicates which path was used \u2014 document_ai for OCR files, vision for images, or direct for text fast-path files. You can retrieve this via the API at GET /v1/documents/:id or view it in the Processing Log tab of the document detail page in the web interface."
|
|
1190
1558
|
}
|
|
1191
1559
|
],
|
|
1192
1560
|
mentions: ["OCR", "AI vision", "text fast-path", "file formats", "PDF", "DOCX", "ZIP"]
|
|
@@ -1262,20 +1630,56 @@ var sections3 = [
|
|
|
1262
1630
|
{
|
|
1263
1631
|
type: "callout",
|
|
1264
1632
|
text: "Documents are marked **complete** after AI extraction finishes. You can start using them in jobs immediately \u2014 no need to wait for field resolution, which runs separately and enriches the registry in the background."
|
|
1265
|
-
}
|
|
1266
|
-
],
|
|
1267
|
-
related: [
|
|
1268
|
-
{ label: "Document Types", slug: "document-types" },
|
|
1269
|
-
{ label: "Document Detail", slug: "document-detail" },
|
|
1270
|
-
{ label: "Field Registry", slug: "field-registry" }
|
|
1271
|
-
],
|
|
1272
|
-
faq: [
|
|
1273
|
-
{
|
|
1274
|
-
question: "How does document processing work in Talonic?",
|
|
1275
|
-
answer: "Documents go through three stages: Document AI OCR (converts to Markdown with annotations), Classification (verifies against 529-type ontology), and AI Data Field Capture (extracts every data point)."
|
|
1276
1633
|
},
|
|
1277
1634
|
{
|
|
1278
|
-
|
|
1635
|
+
type: "heading",
|
|
1636
|
+
level: 3,
|
|
1637
|
+
id: "processing-api",
|
|
1638
|
+
text: "Monitoring Processing via API"
|
|
1639
|
+
},
|
|
1640
|
+
{
|
|
1641
|
+
type: "paragraph",
|
|
1642
|
+
text: "You can monitor document processing programmatically by polling the document detail endpoint. The response includes the current status (processing, completed, extraction_failed, or ocr_failed), the document type assigned during classification, and the number of fields extracted. For high-volume pipelines, configure a webhook with the `document.extracted` signal to receive push notifications instead of polling, which reduces API calls and provides faster response times."
|
|
1643
|
+
},
|
|
1644
|
+
{
|
|
1645
|
+
type: "code",
|
|
1646
|
+
language: "bash",
|
|
1647
|
+
title: "Check document processing status",
|
|
1648
|
+
code: 'curl https://api.talonic.com/v1/documents/doc_7f3a1b2c \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
1649
|
+
},
|
|
1650
|
+
{
|
|
1651
|
+
type: "code",
|
|
1652
|
+
language: "json",
|
|
1653
|
+
title: "Response",
|
|
1654
|
+
code: '{\n "id": "doc_7f3a1b2c",\n "status": "completed",\n "document_type": "Invoice",\n "fields_extracted": 24,\n "processing_pipeline": "document_ai",\n "extraction_attempts": 1,\n "created_at": "2026-05-07T10:30:00Z"\n}'
|
|
1655
|
+
},
|
|
1656
|
+
{
|
|
1657
|
+
type: "paragraph",
|
|
1658
|
+
text: "For documents that fail processing, the API response includes diagnostic information about what went wrong. The `extraction_attempts` field shows how many retries were attempted, and the `status` field distinguishes between OCR failures (`ocr_failed`) and extraction failures (`extraction_failed`). This information helps you decide whether to resubmit the document, try a different file format, or investigate the source document for quality issues like low-resolution scans or corrupted files."
|
|
1659
|
+
},
|
|
1660
|
+
{
|
|
1661
|
+
type: "paragraph",
|
|
1662
|
+
text: "The processing pipeline also supports `include_markdown=true` on extraction requests, which returns the OCR-generated markdown alongside the extraction results. This is useful for debugging and quality assurance workflows where you need to verify what the AI model saw before extracting fields. You can also retrieve the markdown separately at any time using the dedicated markdown endpoint, which returns the full structured markdown representation of the document including preserved tables, headings, and layout information."
|
|
1663
|
+
},
|
|
1664
|
+
{
|
|
1665
|
+
type: "code",
|
|
1666
|
+
language: "bash",
|
|
1667
|
+
title: "Retrieve the OCR markdown for a document",
|
|
1668
|
+
code: 'curl https://api.talonic.com/v1/documents/doc_7f3a1b2c/markdown \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
1669
|
+
}
|
|
1670
|
+
],
|
|
1671
|
+
related: [
|
|
1672
|
+
{ label: "Document Types", slug: "document-types" },
|
|
1673
|
+
{ label: "Document Detail", slug: "document-detail" },
|
|
1674
|
+
{ label: "Field Registry", slug: "field-registry" }
|
|
1675
|
+
],
|
|
1676
|
+
faq: [
|
|
1677
|
+
{
|
|
1678
|
+
question: "How does document processing work in Talonic?",
|
|
1679
|
+
answer: "Documents go through three stages: Document AI OCR (converts to Markdown with annotations), Classification (verifies against 529-type ontology), and AI Data Field Capture (extracts every data point)."
|
|
1680
|
+
},
|
|
1681
|
+
{
|
|
1682
|
+
question: "When is a document ready to use in jobs?",
|
|
1279
1683
|
answer: "Documents are marked complete after AI extraction finishes. You can start using them in jobs immediately without waiting for further processing."
|
|
1280
1684
|
},
|
|
1281
1685
|
{
|
|
@@ -1337,6 +1741,36 @@ var sections3 = [
|
|
|
1337
1741
|
type: "callout",
|
|
1338
1742
|
variant: "info",
|
|
1339
1743
|
text: "You never need to create document types manually. The ontology is built into the platform and types are assigned automatically during classification. If you disagree with a classification, the AI agent can help you understand why a type was chosen and how the content signals were interpreted."
|
|
1744
|
+
},
|
|
1745
|
+
{
|
|
1746
|
+
type: "heading",
|
|
1747
|
+
level: 3,
|
|
1748
|
+
id: "document-types-api",
|
|
1749
|
+
text: "Working with Document Types via API"
|
|
1750
|
+
},
|
|
1751
|
+
{
|
|
1752
|
+
type: "paragraph",
|
|
1753
|
+
text: "The REST API lets you filter documents by type, which is useful for building type-specific processing pipelines. For example, you might retrieve all invoices for a financial reconciliation workflow, or all employment contracts for an HR onboarding integration. The document type is assigned automatically during classification and included in every document response, so you can route processing logic based on the type without any manual tagging."
|
|
1754
|
+
},
|
|
1755
|
+
{
|
|
1756
|
+
type: "code",
|
|
1757
|
+
language: "bash",
|
|
1758
|
+
title: "List documents by type",
|
|
1759
|
+
code: 'curl "https://api.talonic.com/v1/documents?type=Invoice" \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
1760
|
+
},
|
|
1761
|
+
{
|
|
1762
|
+
type: "code",
|
|
1763
|
+
language: "json",
|
|
1764
|
+
title: "Response",
|
|
1765
|
+
code: '{\n "data": [\n {\n "id": "doc_7f3a1b2c",\n "filename": "invoice_0472.pdf",\n "document_type": "Invoice",\n "status": "completed",\n "fields_extracted": 24\n },\n {\n "id": "doc_9e4d5f6a",\n "filename": "inv_march_2026.pdf",\n "document_type": "Invoice",\n "status": "completed",\n "fields_extracted": 19\n }\n ],\n "meta": { "total": 89, "cursor": "eyJpZCI6ODl9" }\n}'
|
|
1766
|
+
},
|
|
1767
|
+
{
|
|
1768
|
+
type: "paragraph",
|
|
1769
|
+
text: "The classification system works particularly well with mixed-document uploads. When you upload a ZIP archive or folder containing invoices, contracts, and receipts, each document is classified independently. You can then use type-based filtering to process each category through its own schema and job pipeline, even though they were uploaded together. This makes it practical to ingest heterogeneous document batches without pre-sorting."
|
|
1770
|
+
},
|
|
1771
|
+
{
|
|
1772
|
+
type: "paragraph",
|
|
1773
|
+
text: 'Behind the scenes, classification uses a two-step verification process. First, Document AI OCR produces a free-text type label during the initial processing pass. Then, a type resolution step verifies that label against the actual document content by checking it against the full 529-type ontology. If the label and content signals disagree \u2014 which commonly happens with multilingual documents or ambiguous formats \u2014 the system trusts the content and resolves to the correct canonical type. For example, a German Arbeitsvertrag might initially be labelled as "Service Agreement" by the OCR engine, but the content-based resolution correctly identifies it as "Employment Contract" based on the actual text.'
|
|
1340
1774
|
}
|
|
1341
1775
|
],
|
|
1342
1776
|
related: [
|
|
@@ -1356,6 +1790,14 @@ var sections3 = [
|
|
|
1356
1790
|
{
|
|
1357
1791
|
question: "What happens if a document cannot be classified?",
|
|
1358
1792
|
answer: 'Unresolvable documents are assigned the "Unclassified Document" type. They can still be processed and extracted \u2014 the platform simply cannot map them to a specific canonical type in the 529-type ontology.'
|
|
1793
|
+
},
|
|
1794
|
+
{
|
|
1795
|
+
question: "Can I filter documents by type through the API?",
|
|
1796
|
+
answer: "Yes. The /v1/documents endpoint accepts a type query parameter to filter by document type. For example, GET /v1/documents?type=Invoice returns only documents classified as invoices. This is useful for building type-specific processing pipelines or generating reports by document category."
|
|
1797
|
+
},
|
|
1798
|
+
{
|
|
1799
|
+
question: "How does classification handle multilingual documents?",
|
|
1800
|
+
answer: "The classifier works across all languages by analyzing content signals rather than relying solely on the OCR-produced label. During type resolution, the system checks the first 500 characters of actual document content against the full 529-type ontology using AI. This content-based approach ensures correct classification regardless of the document language \u2014 a German Arbeitsvertrag, a French Contrat de Travail, and an English Employment Contract all map to the same canonical type."
|
|
1359
1801
|
}
|
|
1360
1802
|
],
|
|
1361
1803
|
mentions: [
|
|
@@ -1422,6 +1864,36 @@ var sections3 = [
|
|
|
1422
1864
|
type: "callout",
|
|
1423
1865
|
variant: "info",
|
|
1424
1866
|
text: "You can open the **AI Agent** (`Cmd+I`) from any document detail page. The agent automatically has the current document in scope and can answer questions about its fields, classification, or processing status without you needing to specify which document you mean."
|
|
1867
|
+
},
|
|
1868
|
+
{
|
|
1869
|
+
type: "heading",
|
|
1870
|
+
level: 3,
|
|
1871
|
+
id: "document-detail-api",
|
|
1872
|
+
text: "Accessing Document Details via API"
|
|
1873
|
+
},
|
|
1874
|
+
{
|
|
1875
|
+
type: "paragraph",
|
|
1876
|
+
text: "The document detail information visible in the UI is fully accessible through the REST API. You can retrieve a document's extraction results, download the OCR markdown, and inspect the processing metadata programmatically. This is particularly useful for building quality assurance workflows where you need to verify extraction results across hundreds of documents without clicking through each one manually."
|
|
1877
|
+
},
|
|
1878
|
+
{
|
|
1879
|
+
type: "code",
|
|
1880
|
+
language: "bash",
|
|
1881
|
+
title: "Get extraction results for a document",
|
|
1882
|
+
code: 'curl https://api.talonic.com/v1/extractions/doc_7f3a1b2c \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
1883
|
+
},
|
|
1884
|
+
{
|
|
1885
|
+
type: "code",
|
|
1886
|
+
language: "json",
|
|
1887
|
+
title: "Response",
|
|
1888
|
+
code: '{\n "document_id": "doc_7f3a1b2c",\n "fields": [\n {\n "name": "invoice_number",\n "value": "INV-2026-0472",\n "confidence": 0.98,\n "source_text": "Invoice Number: INV-2026-0472",\n "tier": 1\n },\n {\n "name": "total_amount",\n "value": "4,250.00",\n "confidence": 0.95,\n "source_text": "Total Due: $4,250.00",\n "tier": 1\n }\n ]\n}'
|
|
1889
|
+
},
|
|
1890
|
+
{
|
|
1891
|
+
type: "paragraph",
|
|
1892
|
+
text: "The extraction response includes every field the AI discovered, along with its confidence score and the exact source text from the document. Fields with confidence below 0.70 typically warrant manual review, while fields above 0.90 are highly reliable. You can use this data to build automated validation rules \u2014 for example, flagging documents where critical fields like invoice_number or total_amount have low confidence scores and routing them to a human reviewer."
|
|
1893
|
+
},
|
|
1894
|
+
{
|
|
1895
|
+
type: "paragraph",
|
|
1896
|
+
text: "The document detail page also provides access to the OCR-generated markdown through both the UI and the API. This markdown representation preserves the document's structure \u2014 headings, tables, lists, and paragraphs \u2014 in a machine-readable format. Comparing the markdown against the extracted fields is a useful debugging technique: if a field is missing from the extraction, check whether the value appears in the markdown. If it does, the extraction model may need a master instruction to guide it to the right location. If it does not, the OCR stage may have missed the content, which can happen with low-resolution scans or complex layouts."
|
|
1425
1897
|
}
|
|
1426
1898
|
],
|
|
1427
1899
|
related: [
|
|
@@ -1441,6 +1913,14 @@ var sections3 = [
|
|
|
1441
1913
|
{
|
|
1442
1914
|
question: "What do the tier badges on fields mean?",
|
|
1443
1915
|
answer: "Tier badges indicate how well-established a field is across your document corpus. Tier 1 (green) are universal core fields, Tier 2 (amber) are established promoted fields, and Tier 3 (gray) are newly discovered emerging fields."
|
|
1916
|
+
},
|
|
1917
|
+
{
|
|
1918
|
+
question: "Can I retrieve extraction results via the API?",
|
|
1919
|
+
answer: "Yes. Use GET /v1/extractions/:documentId to retrieve all extracted fields for a document, including values, confidence scores, source text, and tier information. You can also use GET /v1/documents/:id/markdown to retrieve the OCR-generated markdown of the original document for comparison."
|
|
1920
|
+
},
|
|
1921
|
+
{
|
|
1922
|
+
question: "How do I debug a missing field in extraction results?",
|
|
1923
|
+
answer: "Start by comparing the Raw Extraction tab against the Original File tab to confirm the field exists in the source document. Then check the OCR markdown (available via the Processing Log or the /v1/documents/:id/markdown API endpoint) to verify the OCR stage captured the relevant text. If the text is present in the markdown but the field is missing from extraction, a master instruction may be needed to guide the AI to the correct location. If the text is absent from the markdown, the OCR stage missed it \u2014 try re-uploading a higher-resolution version of the document."
|
|
1444
1924
|
}
|
|
1445
1925
|
],
|
|
1446
1926
|
mentions: [
|
|
@@ -1493,6 +1973,36 @@ var sections3 = [
|
|
|
1493
1973
|
type: "callout",
|
|
1494
1974
|
variant: "info",
|
|
1495
1975
|
text: "Start with a simple routing rule for your most common document type. Once you verify it works correctly, expand to additional types. Rules are evaluated in priority order, so you can add specific overrides without disrupting existing rules. Review the execution history regularly to ensure documents are being routed as expected."
|
|
1976
|
+
},
|
|
1977
|
+
{
|
|
1978
|
+
type: "heading",
|
|
1979
|
+
level: 3,
|
|
1980
|
+
id: "routing-worked-example",
|
|
1981
|
+
text: "Worked Example: Automated Invoice Pipeline"
|
|
1982
|
+
},
|
|
1983
|
+
{
|
|
1984
|
+
type: "paragraph",
|
|
1985
|
+
text: 'Consider a common scenario: your accounting team receives hundreds of invoices weekly via email and cloud storage. Without routing rules, someone must manually navigate to each new document, assign a schema, and trigger extraction. With routing, this entire workflow is automated. First, connect your Gmail or Google Drive source to ingest documents continuously. Then create a routing rule that matches the "Invoice" document type, assigns your Invoice schema, and triggers a job run. Every new invoice that arrives is now automatically processed end-to-end without any manual steps.'
|
|
1986
|
+
},
|
|
1987
|
+
{
|
|
1988
|
+
type: "paragraph",
|
|
1989
|
+
text: "To complete the automation, add a delivery binding that pushes approved results to your ERP system via webhook. The result is a fully hands-off pipeline: invoices arrive from email, the platform classifies them, extracts structured data using your schema, and delivers the results to your downstream system. You only need to intervene when the review queue flags a low-confidence extraction or when a new document type appears that does not match any existing routing rule."
|
|
1990
|
+
},
|
|
1991
|
+
{
|
|
1992
|
+
type: "paragraph",
|
|
1993
|
+
text: 'Rules are evaluated in priority order, which allows you to layer general and specific rules effectively. For example, you might set a priority-10 general rule that routes all "Invoice" documents to a standard invoice schema, then add a priority-1 specific rule that routes invoices from a particular source to a specialized vendor-specific schema. The higher-priority (lower number) rule wins when both match, giving you fine-grained control over routing behavior without duplicating configuration.'
|
|
1994
|
+
},
|
|
1995
|
+
{
|
|
1996
|
+
type: "code",
|
|
1997
|
+
language: "bash",
|
|
1998
|
+
title: "List routing rules via API",
|
|
1999
|
+
code: 'curl https://api.talonic.com/v1/routing-rules \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
2000
|
+
},
|
|
2001
|
+
{
|
|
2002
|
+
type: "code",
|
|
2003
|
+
language: "json",
|
|
2004
|
+
title: "Response",
|
|
2005
|
+
code: '{\n "data": [\n {\n "id": "rule_001",\n "document_type": "Invoice",\n "priority": 1,\n "actions": ["assign_schema", "trigger_job"],\n "schema_id": "sch_inv_001",\n "enabled": true\n }\n ],\n "meta": { "total": 4 }\n}'
|
|
1496
2006
|
}
|
|
1497
2007
|
],
|
|
1498
2008
|
related: [
|
|
@@ -1512,6 +2022,14 @@ var sections3 = [
|
|
|
1512
2022
|
{
|
|
1513
2023
|
question: "Can routing rules fully automate my document processing pipeline?",
|
|
1514
2024
|
answer: "Yes. By combining routing rules with source connectors and delivery bindings, you can create a fully automated pipeline: documents arrive from a connected source, routing rules assign schemas and trigger extraction jobs, and delivery bindings push approved results to downstream systems. For example, a Google Drive folder receiving weekly invoices can be connected as a source with a routing rule that auto-assigns your Invoice schema and triggers extraction. A delivery binding then pushes approved results to your ERP via webhook \u2014 zero manual steps required."
|
|
2025
|
+
},
|
|
2026
|
+
{
|
|
2027
|
+
question: "Can I manage routing rules through the API?",
|
|
2028
|
+
answer: "Yes. Routing rules can be listed and managed through the /v1/routing-rules endpoint. This allows you to programmatically create, update, or disable rules as part of your infrastructure automation. Each rule includes its document type trigger, priority, associated schema, and enabled status."
|
|
2029
|
+
},
|
|
2030
|
+
{
|
|
2031
|
+
question: "How does routing rule priority work?",
|
|
2032
|
+
answer: "Rules are evaluated in priority order \u2014 the lowest priority number wins when multiple rules match. This lets you layer general rules (e.g., priority 10 for all invoices) with specific overrides (e.g., priority 1 for invoices from a particular source). If multiple rules match the same document type, only the highest-priority rule executes. This design prevents duplicate processing while giving you fine-grained control over routing behavior for every document type."
|
|
1515
2033
|
}
|
|
1516
2034
|
],
|
|
1517
2035
|
mentions: ["routing rules", "auto-assign", "schema assignment", "document workflows"]
|
|
@@ -1620,6 +2138,32 @@ var sections3 = [
|
|
|
1620
2138
|
{
|
|
1621
2139
|
type: "callout",
|
|
1622
2140
|
text: "Connectors are feature-gated on their OAuth client ID/secret. Without credentials configured, the connector dropdown entry is disabled. Microsoft Teams requires tenant-admin consent for privileged scopes like `ChannelMessage.Read.All`."
|
|
2141
|
+
},
|
|
2142
|
+
{
|
|
2143
|
+
type: "heading",
|
|
2144
|
+
level: 3,
|
|
2145
|
+
id: "connectors-api",
|
|
2146
|
+
text: "Managing Sources via API"
|
|
2147
|
+
},
|
|
2148
|
+
{
|
|
2149
|
+
type: "paragraph",
|
|
2150
|
+
text: "Sources can be created, listed, and managed through the REST API. This is useful for automated provisioning scenarios where you need to set up sources programmatically \u2014 for example, creating a new source for each client or department as part of an onboarding workflow. The API also lets you trigger imports from connected sources, which is how you can build scheduled sync pipelines using external orchestration tools like cron jobs or workflow engines."
|
|
2151
|
+
},
|
|
2152
|
+
{
|
|
2153
|
+
type: "code",
|
|
2154
|
+
language: "bash",
|
|
2155
|
+
title: "List all sources",
|
|
2156
|
+
code: 'curl https://api.talonic.com/v1/sources \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
2157
|
+
},
|
|
2158
|
+
{
|
|
2159
|
+
type: "code",
|
|
2160
|
+
language: "json",
|
|
2161
|
+
title: "Response",
|
|
2162
|
+
code: '{\n "data": [\n {\n "id": "src_abc123",\n "name": "Invoices - Google Drive",\n "type": "google_drive",\n "documents_count": 234,\n "status": "connected",\n "batch_processing": false\n },\n {\n "id": "src_def456",\n "name": "Contracts - Manual Upload",\n "type": "upload",\n "documents_count": 47,\n "status": "active",\n "batch_processing": true\n }\n ],\n "meta": { "total": 5 }\n}'
|
|
2163
|
+
},
|
|
2164
|
+
{
|
|
2165
|
+
type: "paragraph",
|
|
2166
|
+
text: "For S3 and Azure Blob connectors, you can also use the API to create connections with encrypted credentials. The platform encrypts all credentials at rest using AES-256-GCM, so sensitive access keys and connection strings are never stored in plaintext. When listing sources through the API, credential values are redacted \u2014 only the connection status and metadata are returned to prevent accidental exposure of secrets in logs or dashboards."
|
|
1623
2167
|
}
|
|
1624
2168
|
],
|
|
1625
2169
|
related: [
|
|
@@ -1700,6 +2244,48 @@ var sections4 = [
|
|
|
1700
2244
|
{
|
|
1701
2245
|
type: "paragraph",
|
|
1702
2246
|
text: "Each registry field maintains two separate embedding vectors: one optimized for **resolution matching** (based on the canonical name and synonyms) and one for **graph visualization** (based on name, type, and instruction). This dual-embedding approach ensures that each concern uses the most appropriate representation. The resolution embedding is what powers the three-band matching during document processing, while the visualization embedding drives the Field Map clustering view."
|
|
2247
|
+
},
|
|
2248
|
+
{
|
|
2249
|
+
type: "heading",
|
|
2250
|
+
level: 3,
|
|
2251
|
+
id: "field-registry-api",
|
|
2252
|
+
text: "Browsing the Registry via API"
|
|
2253
|
+
},
|
|
2254
|
+
{
|
|
2255
|
+
type: "paragraph",
|
|
2256
|
+
text: "The Field Registry is fully accessible through the REST API, allowing you to build custom dashboards, export field data for analysis, or integrate registry information into external tools. You can list all fields with filtering by tier, type, and search terms, retrieve individual field details including all occurrences, and check resolution status. This programmatic access is particularly valuable for teams that need to audit their field registry or build automated monitoring around field quality."
|
|
2257
|
+
},
|
|
2258
|
+
{
|
|
2259
|
+
type: "code",
|
|
2260
|
+
language: "bash",
|
|
2261
|
+
title: "Search the field registry",
|
|
2262
|
+
code: 'curl "https://api.talonic.com/v1/schemas/fields?search=vendor&tier=1" \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
2263
|
+
},
|
|
2264
|
+
{
|
|
2265
|
+
type: "code",
|
|
2266
|
+
language: "json",
|
|
2267
|
+
title: "Response",
|
|
2268
|
+
code: '{\n "data": [\n {\n "id": "fld_001",\n "canonical_name": "vendor_name",\n "tier": 1,\n "data_type": "string",\n "occurrence_count": 312,\n "synonyms": ["supplier_name", "company_name"],\n "has_master_instruction": true\n }\n ],\n "meta": { "total": 3 }\n}'
|
|
2269
|
+
},
|
|
2270
|
+
{
|
|
2271
|
+
type: "code",
|
|
2272
|
+
language: "bash",
|
|
2273
|
+
title: "Get field detail with occurrences",
|
|
2274
|
+
code: 'curl https://api.talonic.com/v1/schemas/fields/fld_001 \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
2275
|
+
},
|
|
2276
|
+
{
|
|
2277
|
+
type: "paragraph",
|
|
2278
|
+
text: "You can also trigger batch resolution and promotion evaluation through the API. The resolution endpoint starts a batch run that resolves all unresolved field occurrences against the registry, while the promotion endpoint evaluates whether any Tier 3 fields qualify for promotion to Tier 2. These operations run asynchronously and you can poll their status to know when they complete. Running resolution and promotion regularly ensures your registry stays current as new documents are processed."
|
|
2279
|
+
},
|
|
2280
|
+
{
|
|
2281
|
+
type: "paragraph",
|
|
2282
|
+
text: "The registry also tracks aggregate statistics that help you understand the overall health and maturity of your knowledge graph. The stats endpoint returns total field count, breakdown by tier, instruction coverage percentage, and the number of unresolved occurrences. Teams use these metrics to set quality targets \u2014 for example, aiming for 80% instruction coverage and less than 5% unresolved occurrences before transitioning from pilot to production usage."
|
|
2283
|
+
},
|
|
2284
|
+
{
|
|
2285
|
+
type: "code",
|
|
2286
|
+
language: "bash",
|
|
2287
|
+
title: "Get registry aggregate stats",
|
|
2288
|
+
code: 'curl https://api.talonic.com/v1/schemas/stats \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
1703
2289
|
}
|
|
1704
2290
|
],
|
|
1705
2291
|
related: [
|
|
@@ -1719,6 +2305,14 @@ var sections4 = [
|
|
|
1719
2305
|
{
|
|
1720
2306
|
question: "How does the Field Registry reduce extraction cost?",
|
|
1721
2307
|
answer: "The registry enables lookup-based resolution during job runs. When a field already exists in the registry with sufficient data, its value can be resolved via graph lookup instead of an AI call. Approximately 30% of cells are filled this way \u2014 instantly and at no cost."
|
|
2308
|
+
},
|
|
2309
|
+
{
|
|
2310
|
+
question: "Can I search the field registry via the API?",
|
|
2311
|
+
answer: "Yes. Use GET /v1/schemas/fields with search, tier, and type query parameters to filter registry entries. The response includes each field's canonical name, tier, occurrence count, synonyms, and whether it has a master instruction. This is useful for building custom dashboards or auditing field coverage across your document corpus."
|
|
2312
|
+
},
|
|
2313
|
+
{
|
|
2314
|
+
question: "How do I check the overall health of my field registry?",
|
|
2315
|
+
answer: "Use the GET /v1/schemas/stats endpoint to retrieve aggregate statistics including total field count, tier distribution, instruction coverage percentage, and unresolved occurrence count. Teams typically aim for 80% instruction coverage and less than 5% unresolved occurrences before transitioning from pilot to production. The dashboard telemetry metrics (capture rate, resolve rate, synthesize rate) also reflect registry maturity."
|
|
1722
2316
|
}
|
|
1723
2317
|
],
|
|
1724
2318
|
mentions: [
|
|
@@ -1785,6 +2379,53 @@ var sections4 = [
|
|
|
1785
2379
|
{
|
|
1786
2380
|
type: "callout",
|
|
1787
2381
|
text: "Tier badges appear throughout the platform as the primary quality signal. Tier 1 = green, Tier 2 = amber, Tier 3 = gray. You can see tier badges on the Field Registry page, in document detail views, on schema fields, and in job result grids."
|
|
2382
|
+
},
|
|
2383
|
+
{
|
|
2384
|
+
type: "heading",
|
|
2385
|
+
level: 3,
|
|
2386
|
+
id: "tier-api",
|
|
2387
|
+
text: "Managing Tiers via API"
|
|
2388
|
+
},
|
|
2389
|
+
{
|
|
2390
|
+
type: "paragraph",
|
|
2391
|
+
text: "You can query fields by tier through the registry API to understand the maturity of your knowledge graph. Filtering by tier lets you identify which fields are well-established versus newly emerging, which helps prioritize registry curation efforts. For example, reviewing Tier 3 fields regularly helps you catch misclassifications early before they propagate into job results. You can also manually adjust a field's tier through the API when you have domain knowledge that justifies an early promotion or demotion."
|
|
2392
|
+
},
|
|
2393
|
+
{
|
|
2394
|
+
type: "code",
|
|
2395
|
+
language: "bash",
|
|
2396
|
+
title: "List Tier 3 (emerging) fields",
|
|
2397
|
+
code: 'curl "https://api.talonic.com/v1/schemas/fields?tier=3" \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
2398
|
+
},
|
|
2399
|
+
{
|
|
2400
|
+
type: "code",
|
|
2401
|
+
language: "bash",
|
|
2402
|
+
title: "Manually promote a field",
|
|
2403
|
+
code: `curl -X PATCH https://api.talonic.com/v1/schemas/fields/fld_001/tier \\
|
|
2404
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
2405
|
+
-H "Content-Type: application/json" \\
|
|
2406
|
+
-d '{ "tier": 2 }'`
|
|
2407
|
+
},
|
|
2408
|
+
{
|
|
2409
|
+
type: "paragraph",
|
|
2410
|
+
text: "The promotion evaluation can also be triggered programmatically. After a large document ingestion, you may want to run promotion immediately rather than waiting for the next automatic evaluation. The promotion endpoint scans all Tier 3 fields, checks them against the frequency thresholds (5 occurrences or 10% occurrence rate), and promotes qualifying fields to Tier 2. Promoted fields automatically receive synthesized master instructions and become eligible for lookup-based resolution in subsequent job runs."
|
|
2411
|
+
},
|
|
2412
|
+
{
|
|
2413
|
+
type: "code",
|
|
2414
|
+
language: "bash",
|
|
2415
|
+
title: "Trigger promotion evaluation",
|
|
2416
|
+
code: 'curl -X POST https://api.talonic.com/v1/schemas/promotion/run \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
2417
|
+
},
|
|
2418
|
+
{
|
|
2419
|
+
type: "code",
|
|
2420
|
+
language: "bash",
|
|
2421
|
+
title: "View promotion candidates",
|
|
2422
|
+
code: 'curl https://api.talonic.com/v1/schemas/emergence \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
2423
|
+
},
|
|
2424
|
+
{
|
|
2425
|
+
type: "code",
|
|
2426
|
+
language: "json",
|
|
2427
|
+
title: "Response",
|
|
2428
|
+
code: '{\n "candidates": [\n {\n "field_id": "fld_042",\n "canonical_name": "payment_terms",\n "current_tier": 3,\n "occurrence_count": 7,\n "occurrence_rate": 0.12,\n "qualifies": true\n }\n ],\n "total_candidates": 3\n}'
|
|
1788
2429
|
}
|
|
1789
2430
|
],
|
|
1790
2431
|
related: [
|
|
@@ -1798,11 +2439,15 @@ var sections4 = [
|
|
|
1798
2439
|
},
|
|
1799
2440
|
{
|
|
1800
2441
|
question: "How are fields promoted between tiers?",
|
|
1801
|
-
answer: "Fields are promoted automatically based on frequency thresholds. As more documents are processed and a field appears consistently, it
|
|
2442
|
+
answer: "Fields are promoted automatically based on frequency thresholds \u2014 5 occurrences or a 10% occurrence rate qualifies a Tier 3 field for promotion to Tier 2. As more documents are processed and a field appears consistently, it progresses through the tiers. Promotion is evaluated automatically after every batch resolution run, so fields graduate without manual intervention as your document corpus grows."
|
|
1802
2443
|
},
|
|
1803
2444
|
{
|
|
1804
2445
|
question: "Can I manually change a field's tier?",
|
|
1805
|
-
answer: "Yes. You can manually adjust a field's tier from the registry detail page. This is useful when you know a field is stable enough to promote early, or when you want to demote a field that was promoted prematurely."
|
|
2446
|
+
answer: "Yes. You can manually adjust a field's tier from the registry detail page or via the API at PATCH /v1/schemas/fields/:id/tier. This is useful when you know a field is stable enough to promote early, or when you want to demote a field that was promoted prematurely. Manual tier changes take effect immediately and trigger the same downstream effects as automatic promotion \u2014 instruction synthesis and schema regeneration."
|
|
2447
|
+
},
|
|
2448
|
+
{
|
|
2449
|
+
question: "What is the difference between Tier 1 and Tier 2 fields in practice?",
|
|
2450
|
+
answer: "Both Tier 1 and Tier 2 fields have master instructions and are eligible for lookup-based resolution during job runs. The key difference is coverage and reliability. Tier 1 fields appear universally across most document types and have the highest occurrence counts, making their lookup matches extremely reliable. Tier 2 fields are well-established within specific document types but may not appear across all types. In job runs, both tiers benefit from free graph lookups, but Tier 1 fields consistently achieve a higher match rate because of their broader occurrence base across the entire document corpus."
|
|
1806
2451
|
}
|
|
1807
2452
|
],
|
|
1808
2453
|
mentions: ["tier system", "Tier 1", "Tier 2", "Tier 3", "field promotion", "quality signal"]
|
|
@@ -1849,6 +2494,48 @@ var sections4 = [
|
|
|
1849
2494
|
type: "callout",
|
|
1850
2495
|
variant: "info",
|
|
1851
2496
|
text: "Manual cluster adjustments are permanent and improve the model for all future documents. If you notice the platform grouping unrelated fields together, split them early \u2014 this prevents incorrect value transfers during job runs. Conversely, merging clusters that should be together improves resolution accuracy across your entire corpus."
|
|
2497
|
+
},
|
|
2498
|
+
{
|
|
2499
|
+
type: "heading",
|
|
2500
|
+
level: 3,
|
|
2501
|
+
id: "clusters-api",
|
|
2502
|
+
text: "Working with Clusters via API"
|
|
2503
|
+
},
|
|
2504
|
+
{
|
|
2505
|
+
type: "paragraph",
|
|
2506
|
+
text: "Semantic clusters can be listed, merged, and split through the REST API. This is particularly useful for teams that need to perform bulk cluster curation or automate cluster management as part of their onboarding pipeline. For example, when setting up a new workspace, you might programmatically merge known field synonyms before processing your first batch of documents, ensuring the registry starts with clean, accurate clusters from day one."
|
|
2507
|
+
},
|
|
2508
|
+
{
|
|
2509
|
+
type: "code",
|
|
2510
|
+
language: "bash",
|
|
2511
|
+
title: "List semantic clusters",
|
|
2512
|
+
code: 'curl https://api.talonic.com/v1/schemas/clusters \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
2513
|
+
},
|
|
2514
|
+
{
|
|
2515
|
+
type: "code",
|
|
2516
|
+
language: "bash",
|
|
2517
|
+
title: "Merge two clusters",
|
|
2518
|
+
code: `curl -X POST https://api.talonic.com/v1/schemas/clusters/merge \\
|
|
2519
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
2520
|
+
-H "Content-Type: application/json" \\
|
|
2521
|
+
-d '{
|
|
2522
|
+
"source_cluster_id": "cls_ship_to",
|
|
2523
|
+
"target_cluster_id": "cls_delivery_addr"
|
|
2524
|
+
}'`
|
|
2525
|
+
},
|
|
2526
|
+
{
|
|
2527
|
+
type: "paragraph",
|
|
2528
|
+
text: 'A well-curated cluster set has a direct impact on extraction cost and accuracy. When clusters accurately represent field synonyms, the resolution engine can transfer values between related fields without triggering an AI call. For instance, if "Ship To Address" and "Delivery Address" are correctly clustered together, a document containing "Ship To Address" will automatically resolve when your schema expects "Delivery Address". Without the cluster link, this would require an AI extraction call, increasing both cost and latency.'
|
|
2529
|
+
},
|
|
2530
|
+
{
|
|
2531
|
+
type: "code",
|
|
2532
|
+
language: "json",
|
|
2533
|
+
title: "Cluster detail response",
|
|
2534
|
+
code: '{\n "cluster_id": "cls_vendor",\n "canonical_name": "vendor_name",\n "members": [\n { "field_id": "fld_001", "name": "vendor_name", "occurrence_count": 312 },\n { "field_id": "fld_047", "name": "supplier_name", "occurrence_count": 89 },\n { "field_id": "fld_103", "name": "company_name", "occurrence_count": 45 }\n ],\n "similarity_threshold": 0.82\n}'
|
|
2535
|
+
},
|
|
2536
|
+
{
|
|
2537
|
+
type: "paragraph",
|
|
2538
|
+
text: 'Cluster quality is especially important during the early stages of platform adoption. When your registry is still growing, incorrect clusters can cause value transfers between unrelated fields, leading to extraction errors that are difficult to trace. Take time during the first few weeks to review the Field Map view and split any clusters where the platform has grouped fields that look similar in name but have different semantic meanings. For example, "Order Date" and "Order Number" might score above the 0.80 similarity threshold due to the shared "Order" prefix, but they represent fundamentally different data types and should be separate clusters.'
|
|
1852
2539
|
}
|
|
1853
2540
|
],
|
|
1854
2541
|
related: [
|
|
@@ -1868,6 +2555,14 @@ var sections4 = [
|
|
|
1868
2555
|
{
|
|
1869
2556
|
question: "How do semantic clusters reduce extraction cost?",
|
|
1870
2557
|
answer: 'When a job runs, the resolution engine uses clusters to transfer values between fields that belong to the same cluster. If a document has "Supplier Name" and your schema expects "Vendor Name", the cluster linkage allows the value to transfer automatically without an AI call.'
|
|
2558
|
+
},
|
|
2559
|
+
{
|
|
2560
|
+
question: "Can I merge or split clusters via the API?",
|
|
2561
|
+
answer: "Yes. Use POST /v1/schemas/clusters/merge to combine two clusters and POST /v1/schemas/clusters/split to remove a field from a cluster. These operations are permanent and immediately affect resolution behavior for all future documents and job runs."
|
|
2562
|
+
},
|
|
2563
|
+
{
|
|
2564
|
+
question: "What is the similarity threshold for automatic clustering?",
|
|
2565
|
+
answer: "Fields with semantic similarity >= 0.80 are automatically grouped into the same cluster. Fields in the 0.50-0.79 range are flagged as potential candidates for manual confirmation. Fields below 0.50 are kept separate. The thresholds are designed to prevent false merges while still surfacing useful grouping suggestions that you can review from the Field Map view."
|
|
1871
2566
|
}
|
|
1872
2567
|
],
|
|
1873
2568
|
mentions: [
|
|
@@ -1930,6 +2625,46 @@ var sections4 = [
|
|
|
1930
2625
|
{
|
|
1931
2626
|
type: "callout",
|
|
1932
2627
|
text: "Pending confirmations from the confirm band appear in **Resolution → Pending Confirmations**. Accept to merge into an existing cluster, or reject to create a new field."
|
|
2628
|
+
},
|
|
2629
|
+
{
|
|
2630
|
+
type: "heading",
|
|
2631
|
+
level: 3,
|
|
2632
|
+
id: "resolution-api",
|
|
2633
|
+
text: "Running Resolution via API"
|
|
2634
|
+
},
|
|
2635
|
+
{
|
|
2636
|
+
type: "paragraph",
|
|
2637
|
+
text: "Batch resolution can be triggered programmatically through the REST API. This is useful for teams that want to run resolution on a schedule or immediately after a large document ingestion. The resolution endpoint starts an asynchronous run that processes all unresolved field occurrences against the registry. You can check the resolution status endpoint to monitor how many fields remain unresolved and track progress over time."
|
|
2638
|
+
},
|
|
2639
|
+
{
|
|
2640
|
+
type: "code",
|
|
2641
|
+
language: "bash",
|
|
2642
|
+
title: "Trigger batch resolution",
|
|
2643
|
+
code: 'curl -X POST https://api.talonic.com/v1/schemas/resolution/run \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
2644
|
+
},
|
|
2645
|
+
{
|
|
2646
|
+
type: "code",
|
|
2647
|
+
language: "bash",
|
|
2648
|
+
title: "Check resolution status",
|
|
2649
|
+
code: 'curl https://api.talonic.com/v1/schemas/resolution/status \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
2650
|
+
},
|
|
2651
|
+
{
|
|
2652
|
+
type: "code",
|
|
2653
|
+
language: "json",
|
|
2654
|
+
title: "Response",
|
|
2655
|
+
code: '{\n "unresolved_count": 14,\n "total_occurrences": 1847,\n "pending_confirmations": 6,\n "last_run_at": "2026-05-07T09:00:00Z"\n}'
|
|
2656
|
+
},
|
|
2657
|
+
{
|
|
2658
|
+
type: "paragraph",
|
|
2659
|
+
text: "For best results, process pending confirmations promptly after each resolution run. Unconfirmed fields in the confirm band (0.50-0.79 similarity) remain in a pending state that can affect downstream extraction accuracy. Confirming correct matches strengthens the cluster and improves future resolution, while rejecting incorrect matches prevents bad data from propagating through the knowledge graph. Teams that review confirmations weekly typically see their auto-band match rate increase steadily over the first few months of platform usage."
|
|
2660
|
+
},
|
|
2661
|
+
{
|
|
2662
|
+
type: "paragraph",
|
|
2663
|
+
text: "The resolution system operates concurrently across documents with strict isolation guarantees. Each document's fields are resolved in an independent transaction, and occurrence counts are updated atomically using SQL upserts with a 3-attempt deadlock retry mechanism. This design means resolution can run on hundreds of documents simultaneously without lock contention or data inconsistency. The system is eventually consistent \u2014 all occurrence counts converge to the correct values even under high concurrent load."
|
|
2664
|
+
},
|
|
2665
|
+
{
|
|
2666
|
+
type: "paragraph",
|
|
2667
|
+
text: "After each resolution batch, a fixed chain of operations runs automatically: first tier promotion evaluation, then affected schema regeneration, and finally cross-schema view updates. This chain ensures that newly promoted fields immediately appear in auto-generated schemas and that the cross-schema harmonization view stays current. The chain is never interrupted \u2014 if promotion detects new Tier 2 fields, the downstream regeneration and view update steps run as part of the same workflow."
|
|
1933
2668
|
}
|
|
1934
2669
|
],
|
|
1935
2670
|
related: [
|
|
@@ -2006,6 +2741,48 @@ var sections4 = [
|
|
|
2006
2741
|
{
|
|
2007
2742
|
type: "callout",
|
|
2008
2743
|
text: 'Click **"Synthesize All"** in the Field Registry to generate instructions for all qualifying fields. This runs the combined pipeline: embed → resolve → synthesize. The operation processes all fields that meet the synthesis criteria in a single batch.'
|
|
2744
|
+
},
|
|
2745
|
+
{
|
|
2746
|
+
type: "heading",
|
|
2747
|
+
level: 3,
|
|
2748
|
+
id: "instructions-api",
|
|
2749
|
+
text: "Managing Instructions via API"
|
|
2750
|
+
},
|
|
2751
|
+
{
|
|
2752
|
+
type: "paragraph",
|
|
2753
|
+
text: "Master instruction synthesis can be triggered and monitored through the REST API. The synthesis status endpoint shows how many fields have instructions versus how many qualify, giving you a clear picture of your registry's instruction coverage. You can trigger synthesis for all qualifying fields at once or for a single field. This is useful for automated workflows where you want to ensure instruction coverage stays above a threshold after each batch ingestion."
|
|
2754
|
+
},
|
|
2755
|
+
{
|
|
2756
|
+
type: "code",
|
|
2757
|
+
language: "bash",
|
|
2758
|
+
title: "Check instruction synthesis status",
|
|
2759
|
+
code: 'curl https://api.talonic.com/v1/schemas/instructions/status \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
2760
|
+
},
|
|
2761
|
+
{
|
|
2762
|
+
type: "code",
|
|
2763
|
+
language: "json",
|
|
2764
|
+
title: "Response",
|
|
2765
|
+
code: '{\n "total_fields": 342,\n "with_instruction": 256,\n "qualifying_without_instruction": 18,\n "synthesize_rate": 0.74\n}'
|
|
2766
|
+
},
|
|
2767
|
+
{
|
|
2768
|
+
type: "code",
|
|
2769
|
+
language: "bash",
|
|
2770
|
+
title: "Trigger synthesis for all qualifying fields",
|
|
2771
|
+
code: 'curl -X POST https://api.talonic.com/v1/schemas/instructions/synthesize \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
2772
|
+
},
|
|
2773
|
+
{
|
|
2774
|
+
type: "paragraph",
|
|
2775
|
+
text: "The quality of master instructions directly correlates with extraction accuracy. Fields with well-crafted instructions consistently produce higher confidence scores during job runs because the AI model receives specific guidance about where to find the value, what format to expect, and how to disambiguate similar fields. Teams that invest in reviewing and refining master instructions \u2014 either manually or through the synthesis pipeline \u2014 typically see a 10-15% improvement in extraction confidence for Tier 2 fields compared to extraction without instructions."
|
|
2776
|
+
},
|
|
2777
|
+
{
|
|
2778
|
+
type: "paragraph",
|
|
2779
|
+
text: 'A well-written master instruction typically includes three components: where to find the field in the document (e.g., "Look in the header area near the document date"), what format to expect (e.g., "Format as ISO 8601 date: YYYY-MM-DD"), and how to disambiguate from similar fields (e.g., "Prefer the issue date over the due date or payment date"). The AI synthesis process automatically generates instructions following this pattern by analyzing successful extractions, but you can always refine them manually when your domain expertise suggests a better approach.'
|
|
2780
|
+
},
|
|
2781
|
+
{
|
|
2782
|
+
type: "code",
|
|
2783
|
+
language: "bash",
|
|
2784
|
+
title: "Synthesize instructions for a single field",
|
|
2785
|
+
code: 'curl -X POST https://api.talonic.com/v1/schemas/instructions/synthesize/fld_001 \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
2009
2786
|
}
|
|
2010
2787
|
],
|
|
2011
2788
|
related: [
|
|
@@ -2025,6 +2802,14 @@ var sections4 = [
|
|
|
2025
2802
|
{
|
|
2026
2803
|
question: "Can I manually edit a master instruction?",
|
|
2027
2804
|
answer: "Yes. You can view and edit master instructions from the field detail page in the registry. Editing overrides the AI-synthesized version, which is useful when you have domain expertise the AI has not captured."
|
|
2805
|
+
},
|
|
2806
|
+
{
|
|
2807
|
+
question: "How can I check instruction coverage via the API?",
|
|
2808
|
+
answer: "Use GET /v1/schemas/instructions/status to see how many fields have instructions, how many qualify but lack them, and the overall synthesize rate. You can trigger synthesis for all qualifying fields via POST /v1/schemas/instructions/synthesize, or for a single field via POST /v1/schemas/instructions/synthesize/:fieldId."
|
|
2809
|
+
},
|
|
2810
|
+
{
|
|
2811
|
+
question: "What makes a good master instruction?",
|
|
2812
|
+
answer: 'A well-written master instruction includes three components: where to find the field in the document (e.g., "Look in the header area near the document date"), what format to expect (e.g., "Format as ISO 8601 date: YYYY-MM-DD"), and how to disambiguate from similar fields (e.g., "Prefer the issue date over the due date"). The AI synthesis process generates instructions following this pattern automatically, but manual refinement based on domain expertise can further improve accuracy.'
|
|
2028
2813
|
}
|
|
2029
2814
|
],
|
|
2030
2815
|
mentions: [
|
|
@@ -2065,6 +2850,53 @@ var sections5 = [
|
|
|
2065
2850
|
type: "paragraph",
|
|
2066
2851
|
text: "The tier system determines which fields appear in generated schemas. **Tier 1** (core) fields are the most frequently occurring and reliably extracted data points \u2014 they appear in nearly every document of the type. **Tier 2** (established) fields occur in a significant portion of documents and have been validated through repeated extraction. **Tier 3** (emerging) fields are too new or infrequent to be included in generated schemas, but they may be promoted as more documents are processed and their occurrence rate crosses the promotion threshold."
|
|
2067
2852
|
},
|
|
2853
|
+
{
|
|
2854
|
+
type: "code",
|
|
2855
|
+
language: "bash",
|
|
2856
|
+
title: "List generated schemas via API",
|
|
2857
|
+
code: `curl -s https://api.talonic.com/v1/schemas \\
|
|
2858
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
2859
|
+
|
|
2860
|
+
# Response:
|
|
2861
|
+
# {
|
|
2862
|
+
# "schemas": [
|
|
2863
|
+
# {
|
|
2864
|
+
# "id": "sch_abc123",
|
|
2865
|
+
# "document_type": "Invoice",
|
|
2866
|
+
# "version": 3,
|
|
2867
|
+
# "field_count": 24,
|
|
2868
|
+
# "created_at": "2025-04-12T10:30:00Z"
|
|
2869
|
+
# }
|
|
2870
|
+
# ]
|
|
2871
|
+
# }`
|
|
2872
|
+
},
|
|
2873
|
+
{
|
|
2874
|
+
type: "code",
|
|
2875
|
+
language: "bash",
|
|
2876
|
+
title: "Get a generated schema with fields",
|
|
2877
|
+
code: `curl -s https://api.talonic.com/v1/schemas/sch_abc123 \\
|
|
2878
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
2879
|
+
|
|
2880
|
+
# Response:
|
|
2881
|
+
# {
|
|
2882
|
+
# "id": "sch_abc123",
|
|
2883
|
+
# "document_type": "Invoice",
|
|
2884
|
+
# "version": 3,
|
|
2885
|
+
# "fields": [
|
|
2886
|
+
# {
|
|
2887
|
+
# "name": "invoice_number",
|
|
2888
|
+
# "data_type": "string",
|
|
2889
|
+
# "tier": 1,
|
|
2890
|
+
# "occurrence_count": 342,
|
|
2891
|
+
# "master_instruction": "Extract the unique invoice identifier..."
|
|
2892
|
+
# }
|
|
2893
|
+
# ]
|
|
2894
|
+
# }`
|
|
2895
|
+
},
|
|
2896
|
+
{
|
|
2897
|
+
type: "paragraph",
|
|
2898
|
+
text: "When working with generated schemas programmatically, use the list endpoint to discover which document types have been identified, then fetch individual schemas to inspect their fields. The diff between any two versions can be computed by comparing field lists \u2014 added fields appear only in the newer version, removed fields only in the older, and modified fields show differences in data type, tier, or instruction text. This is the same logic the UI uses to render the visual diff view."
|
|
2899
|
+
},
|
|
2068
2900
|
{
|
|
2069
2901
|
type: "callout",
|
|
2070
2902
|
text: "Generated schemas are read-only and cannot be used directly for job execution. To run an extraction job, create a **User Template** and map its fields to the registry. Generated schemas serve as a discovery tool to understand what the platform has found in your documents."
|
|
@@ -2087,6 +2919,10 @@ var sections5 = [
|
|
|
2087
2919
|
{
|
|
2088
2920
|
question: "Can I run an extraction job using a generated schema?",
|
|
2089
2921
|
answer: "No. Generated schemas are read-only references. To run a job, create a User Template, select the fields you need, map them to the registry, and publish a version. Generated schemas are designed as a discovery tool \u2014 use them to understand what the platform has found, then build a focused template for your specific output needs."
|
|
2922
|
+
},
|
|
2923
|
+
{
|
|
2924
|
+
question: "How do I trigger schema regeneration for a specific document type?",
|
|
2925
|
+
answer: "Use the POST /v1/schemas/generate/{typeId} endpoint to regenerate the schema for a single document type, or POST /v1/schemas/generate-all to regenerate all schemas. Regeneration scans the Field Registry for Tier 1 and Tier 2 fields, assembles the schema, and creates a new version. The process is idempotent \u2014 running it multiple times without registry changes produces no new versions."
|
|
2090
2926
|
}
|
|
2091
2927
|
],
|
|
2092
2928
|
mentions: ["generated schemas", "AI-generated", "versioning", "schema diff"]
|
|
@@ -2125,6 +2961,58 @@ var sections5 = [
|
|
|
2125
2961
|
type: "paragraph",
|
|
2126
2962
|
text: "For best results, keep templates focused on a single document type or closely related group of types. A template with 10-20 well-defined fields will produce higher accuracy than one with 50+ fields spanning unrelated domains. If you need different field sets for different document types, create separate templates and run targeted jobs for each."
|
|
2127
2963
|
},
|
|
2964
|
+
{
|
|
2965
|
+
type: "code",
|
|
2966
|
+
language: "bash",
|
|
2967
|
+
title: "Create a user template via API",
|
|
2968
|
+
code: `curl -X POST https://api.talonic.com/v1/schemas \\
|
|
2969
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
2970
|
+
-H "Content-Type: application/json" \\
|
|
2971
|
+
-d '{
|
|
2972
|
+
"name": "Invoice Extraction",
|
|
2973
|
+
"description": "Standard invoice fields for AP processing",
|
|
2974
|
+
"fields": [
|
|
2975
|
+
{ "display_name": "Invoice Number", "data_type": "string" },
|
|
2976
|
+
{ "display_name": "Invoice Date", "data_type": "date" },
|
|
2977
|
+
{ "display_name": "Total Amount", "data_type": "number" },
|
|
2978
|
+
{ "display_name": "Vendor Name", "data_type": "string" }
|
|
2979
|
+
]
|
|
2980
|
+
}'
|
|
2981
|
+
|
|
2982
|
+
# Response:
|
|
2983
|
+
# {
|
|
2984
|
+
# "id": "us_def456",
|
|
2985
|
+
# "name": "Invoice Extraction",
|
|
2986
|
+
# "status": "draft",
|
|
2987
|
+
# "field_count": 4,
|
|
2988
|
+
# "created_at": "2025-04-15T08:00:00Z"
|
|
2989
|
+
# }`
|
|
2990
|
+
},
|
|
2991
|
+
{
|
|
2992
|
+
type: "code",
|
|
2993
|
+
language: "bash",
|
|
2994
|
+
title: "Import a template from a CSV file",
|
|
2995
|
+
code: `curl -X POST https://api.talonic.com/v1/schemas/from-file \\
|
|
2996
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
2997
|
+
-F "file=@invoice_template.csv"
|
|
2998
|
+
|
|
2999
|
+
# Response:
|
|
3000
|
+
# {
|
|
3001
|
+
# "id": "us_ghi789",
|
|
3002
|
+
# "name": "invoice_template",
|
|
3003
|
+
# "status": "draft",
|
|
3004
|
+
# "field_count": 12,
|
|
3005
|
+
# "inferred_types": {
|
|
3006
|
+
# "invoice_number": "string",
|
|
3007
|
+
# "total_amount": "number",
|
|
3008
|
+
# "invoice_date": "date"
|
|
3009
|
+
# }
|
|
3010
|
+
# }`
|
|
3011
|
+
},
|
|
3012
|
+
{
|
|
3013
|
+
type: "paragraph",
|
|
3014
|
+
text: "After creating a template via the API, you can add fields incrementally using the `POST /v1/schemas/{id}/fields` endpoint, trigger automatic registry matching with `POST /v1/schemas/{id}/rematch`, and publish an immutable version when ready. The API workflow mirrors the UI experience \u2014 create, configure, match, publish \u2014 but enables automation for teams that manage schemas programmatically or need to synchronize template definitions across environments."
|
|
3015
|
+
},
|
|
2128
3016
|
{
|
|
2129
3017
|
type: "callout",
|
|
2130
3018
|
text: "You can import templates from Excel, CSV, or JSON files using the **Import from file** option. Column headers become field names, and data types are inferred automatically. This is the fastest way to bootstrap a template from an existing spreadsheet."
|
|
@@ -2147,6 +3035,10 @@ var sections5 = [
|
|
|
2147
3035
|
{
|
|
2148
3036
|
question: "Can I update a published template?",
|
|
2149
3037
|
answer: "Published versions are immutable. To make changes, open the Workshop draft, edit your fields, and publish a new version. The previous version remains available in Version History for reference and diffing. This append-only versioning ensures that historical job results always reference the exact schema that produced them."
|
|
3038
|
+
},
|
|
3039
|
+
{
|
|
3040
|
+
question: "How do I add fields to a template programmatically?",
|
|
3041
|
+
answer: "Use POST /v1/schemas/{id}/fields with a JSON body specifying the display_name, data_type, and optionally a manual_instruction. After adding fields, call POST /v1/schemas/{id}/rematch to trigger automatic registry matching. Fields with exact name matches are linked instantly; semantic matches appear as suggestions for confirmation."
|
|
2150
3042
|
}
|
|
2151
3043
|
],
|
|
2152
3044
|
mentions: ["user templates", "schema creation", "field mapping", "reference tables", "publish"]
|
|
@@ -2234,6 +3126,49 @@ var sections5 = [
|
|
|
2234
3126
|
type: "paragraph",
|
|
2235
3127
|
text: 'For best results, use **manual instructions** sparingly and only for fields that the registry cannot match. A well-written instruction should describe the field in plain language, specify where in the document to look, and note any formatting expectations. Avoid vague instructions like "extract the value" \u2014 instead, write something like "Extract the net payment amount from the invoice summary section, excluding VAT."'
|
|
2236
3128
|
},
|
|
3129
|
+
{
|
|
3130
|
+
type: "code",
|
|
3131
|
+
language: "bash",
|
|
3132
|
+
title: "Add a field with format constraint and reference table",
|
|
3133
|
+
code: `curl -X POST https://api.talonic.com/v1/schemas/us_def456/fields \\
|
|
3134
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
3135
|
+
-H "Content-Type: application/json" \\
|
|
3136
|
+
-d '{
|
|
3137
|
+
"display_name": "Purchase Order Number",
|
|
3138
|
+
"data_type": "string",
|
|
3139
|
+
"manual_instruction": "Extract the PO number from the order reference section",
|
|
3140
|
+
"constraints": {
|
|
3141
|
+
"format": {
|
|
3142
|
+
"type": "regex",
|
|
3143
|
+
"pattern": "PO-\\\\d{6}",
|
|
3144
|
+
"on_format_mismatch": "flag"
|
|
3145
|
+
}
|
|
3146
|
+
},
|
|
3147
|
+
"modifiers": {
|
|
3148
|
+
"max_length": 20
|
|
3149
|
+
},
|
|
3150
|
+
"output_name": "po_number"
|
|
3151
|
+
}'`
|
|
3152
|
+
},
|
|
3153
|
+
{
|
|
3154
|
+
type: "code",
|
|
3155
|
+
language: "bash",
|
|
3156
|
+
title: "Configure a bypass strategy (constant value)",
|
|
3157
|
+
code: `curl -X PATCH https://api.talonic.com/v1/schemas/us_def456/fields/fld_xyz \\
|
|
3158
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
3159
|
+
-H "Content-Type: application/json" \\
|
|
3160
|
+
-d '{
|
|
3161
|
+
"strategy": "constant",
|
|
3162
|
+
"constant_value": "USD"
|
|
3163
|
+
}'
|
|
3164
|
+
|
|
3165
|
+
# The field will always resolve to "USD" without any LLM call.
|
|
3166
|
+
# Bypass strategies execute during Phase 1, before AI extraction.`
|
|
3167
|
+
},
|
|
3168
|
+
{
|
|
3169
|
+
type: "paragraph",
|
|
3170
|
+
text: 'Schema features can be combined to build sophisticated field definitions. For example, a "Vendor Code" field might use a reference table for code mapping, a format constraint to validate the output format (`^V\\d{5}$`), an alias modifier to normalize legacy codes, and an output name remap for the downstream ERP system. Each feature operates at a different stage of the pipeline \u2014 bypass strategies in Phase 1, extraction instructions in Phase 2, reference table lookups in Phases 1 and 3, and modifiers plus format constraints in Phase 4 \u2014 so they compose without conflicts.'
|
|
3171
|
+
},
|
|
2237
3172
|
{
|
|
2238
3173
|
type: "callout",
|
|
2239
3174
|
text: "For the complete JSON Schema specification with all features, see the [Full Schema Reference](/docs/platform/schema-features) in the Platform Guide."
|
|
@@ -2256,6 +3191,10 @@ var sections5 = [
|
|
|
2256
3191
|
{
|
|
2257
3192
|
question: "In what order are modifiers applied to extracted values?",
|
|
2258
3193
|
answer: "Modifiers run in a fixed order: format (date/number conversion) first, then alias (value mapping), then max_length (truncation). Constraints are evaluated after all modifiers complete."
|
|
3194
|
+
},
|
|
3195
|
+
{
|
|
3196
|
+
question: "How do I configure a field to skip LLM extraction entirely?",
|
|
3197
|
+
answer: 'Use a bypass strategy on the field. Set strategy to "constant" for a fixed value, "generator" for deterministic IDs, or "reference" for lookup-based resolution. Bypass strategies execute during Phase 1 at zero AI cost. If the bypass fails to produce a value, the field falls through to LLM extraction as a safety net.'
|
|
2259
3198
|
}
|
|
2260
3199
|
],
|
|
2261
3200
|
mentions: [
|
|
@@ -2317,6 +3256,61 @@ var sections5 = [
|
|
|
2317
3256
|
type: "paragraph",
|
|
2318
3257
|
text: "You can trigger a **Rematch** on all fields at any time from the template editor. This is useful after the registry has grown \u2014 fields that were previously unmapped may now find matches as new extractions contribute to the registry. For best results, use descriptive field names that reflect the actual data (e.g., `contract_start_date` rather than `field_1`)."
|
|
2319
3258
|
},
|
|
3259
|
+
{
|
|
3260
|
+
type: "code",
|
|
3261
|
+
language: "bash",
|
|
3262
|
+
title: "Trigger a rematch on all fields in a template",
|
|
3263
|
+
code: `curl -X POST https://api.talonic.com/v1/schemas/us_def456/rematch \\
|
|
3264
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
3265
|
+
|
|
3266
|
+
# Response:
|
|
3267
|
+
# {
|
|
3268
|
+
# "rematched": 12,
|
|
3269
|
+
# "exact_matches": 8,
|
|
3270
|
+
# "semantic_matches": 2,
|
|
3271
|
+
# "unmapped": 2,
|
|
3272
|
+
# "fields": [
|
|
3273
|
+
# {
|
|
3274
|
+
# "display_name": "Invoice Number",
|
|
3275
|
+
# "match_status": "exact",
|
|
3276
|
+
# "match_confidence": 1.0,
|
|
3277
|
+
# "registry_field": "invoice_number"
|
|
3278
|
+
# },
|
|
3279
|
+
# {
|
|
3280
|
+
# "display_name": "Billing Contact",
|
|
3281
|
+
# "match_status": "semantic",
|
|
3282
|
+
# "match_confidence": 0.72,
|
|
3283
|
+
# "registry_field": "contact_name"
|
|
3284
|
+
# }
|
|
3285
|
+
# ]
|
|
3286
|
+
# }`
|
|
3287
|
+
},
|
|
3288
|
+
{
|
|
3289
|
+
type: "code",
|
|
3290
|
+
language: "bash",
|
|
3291
|
+
title: "List fields in the Field Registry",
|
|
3292
|
+
code: `curl -s "https://api.talonic.com/v1/fields?tier=1&limit=20" \\
|
|
3293
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
3294
|
+
|
|
3295
|
+
# Response:
|
|
3296
|
+
# {
|
|
3297
|
+
# "fields": [
|
|
3298
|
+
# {
|
|
3299
|
+
# "id": "fr_abc",
|
|
3300
|
+
# "canonical_name": "invoice_number",
|
|
3301
|
+
# "tier": 1,
|
|
3302
|
+
# "occurrence_count": 342,
|
|
3303
|
+
# "synonyms": ["inv_no", "invoice_id", "bill_number"],
|
|
3304
|
+
# "master_instruction": "Extract the unique invoice identifier..."
|
|
3305
|
+
# }
|
|
3306
|
+
# ],
|
|
3307
|
+
# "total": 156
|
|
3308
|
+
# }`
|
|
3309
|
+
},
|
|
3310
|
+
{
|
|
3311
|
+
type: "paragraph",
|
|
3312
|
+
text: "The rematch endpoint is particularly useful after bulk document ingestion. When a large batch of new documents introduces previously unseen fields into the registry, running a rematch on your templates can upgrade unmapped fields to exact or semantic matches without any manual configuration. The response includes a summary of how many fields changed status, so you can quickly assess whether the rematch had a meaningful impact on your template coverage."
|
|
3313
|
+
},
|
|
2320
3314
|
{
|
|
2321
3315
|
type: "callout",
|
|
2322
3316
|
text: "Field matching is read-only against the registry \u2014 it never creates new registry entries. If no match exists, the field stays unmapped until you provide a manual instruction or new documents introduce the field into the registry."
|
|
@@ -2339,6 +3333,10 @@ var sections5 = [
|
|
|
2339
3333
|
{
|
|
2340
3334
|
question: "Can I re-run field matching after adding more documents?",
|
|
2341
3335
|
answer: "Yes. Use the Rematch button in the template editor to re-run matching against the current registry. Fields that were previously unmapped may find new matches as your registry grows through processing additional documents. For best results, use descriptive field names that reflect the actual data rather than generic labels."
|
|
3336
|
+
},
|
|
3337
|
+
{
|
|
3338
|
+
question: "What confidence threshold determines automatic vs manual matching?",
|
|
3339
|
+
answer: "Matches above 0.8 confidence are auto-accepted \u2014 these are near-certain semantic equivalents. Matches between 0.5 and 0.8 appear as suggestions requiring your confirmation before they are applied. Matches below 0.5 are rejected and the field is classified as unmapped. You can adjust individual match decisions from the template editor at any time."
|
|
2342
3340
|
}
|
|
2343
3341
|
],
|
|
2344
3342
|
mentions: ["field matching", "exact match", "semantic match", "composite", "unmapped"]
|
|
@@ -2391,6 +3389,40 @@ var sections5 = [
|
|
|
2391
3389
|
type: "paragraph",
|
|
2392
3390
|
text: 'For best results, include common variations and abbreviations as separate value entries all pointing to the same key. For example, if your code is `US`, add values for "United States", "USA", "U.S.A.", and "United States of America". The more variations you cover, the more values resolve at Tier 1 (highest confidence) without falling through to fuzzy or AI matching.'
|
|
2393
3391
|
},
|
|
3392
|
+
{
|
|
3393
|
+
type: "code",
|
|
3394
|
+
language: "bash",
|
|
3395
|
+
title: "Example reference table CSV format",
|
|
3396
|
+
code: `# reference_table_contract_types.csv
|
|
3397
|
+
# key,value
|
|
3398
|
+
# std_master,Master Agreement
|
|
3399
|
+
# std_master,Frame Agreement
|
|
3400
|
+
# std_service,Service Agreement
|
|
3401
|
+
# std_service,Service Contract
|
|
3402
|
+
# std_nda,Non-Disclosure Agreement
|
|
3403
|
+
# std_nda,NDA
|
|
3404
|
+
# std_nda,Confidentiality Agreement
|
|
3405
|
+
|
|
3406
|
+
# Upload as part of a schema field:
|
|
3407
|
+
curl -X PATCH https://api.talonic.com/v1/schemas/us_def456/fields/fld_contract_type \\
|
|
3408
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
3409
|
+
-H "Content-Type: application/json" \\
|
|
3410
|
+
-d '{
|
|
3411
|
+
"reference_table": {
|
|
3412
|
+
"entries": [
|
|
3413
|
+
{ "key": "std_master", "value": "Master Agreement" },
|
|
3414
|
+
{ "key": "std_master", "value": "Frame Agreement" },
|
|
3415
|
+
{ "key": "std_service", "value": "Service Agreement" },
|
|
3416
|
+
{ "key": "std_nda", "value": "Non-Disclosure Agreement" },
|
|
3417
|
+
{ "key": "std_nda", "value": "NDA" }
|
|
3418
|
+
]
|
|
3419
|
+
}
|
|
3420
|
+
}'`
|
|
3421
|
+
},
|
|
3422
|
+
{
|
|
3423
|
+
type: "paragraph",
|
|
3424
|
+
text: "Reference tables support multi-hop resolution chains for complex lookup scenarios. A resolution chain specifies a sequence of reference tables to consult in order \u2014 if the first table does not contain a match, the system tries the next table in the chain. This is useful when your data needs to traverse multiple mapping layers, such as resolving a local product code to a regional code and then to a global SKU. Each hop in the chain applies the same 3-tier lookup cascade independently, so you get the full normalization, fuzzy, and AI fallback at every step."
|
|
3425
|
+
},
|
|
2394
3426
|
{
|
|
2395
3427
|
type: "callout",
|
|
2396
3428
|
text: "Reference table quality directly determines lookup accuracy. A properly loaded table produces 90-100% accurate results within a single run. Review the lookup_failed validation flag in Phase 3 results to identify values that could not be mapped \u2014 these are candidates for adding new entries to your table."
|
|
@@ -2413,6 +3445,10 @@ var sections5 = [
|
|
|
2413
3445
|
{
|
|
2414
3446
|
question: "How should I format my reference table CSV?",
|
|
2415
3447
|
answer: "Use two columns: the first column is the key (output code) and the second is the value (human-readable label). Include common variations and abbreviations as separate rows pointing to the same key for maximum Tier 1 hit rate."
|
|
3448
|
+
},
|
|
3449
|
+
{
|
|
3450
|
+
question: "Can I use multiple reference tables in a resolution chain?",
|
|
3451
|
+
answer: "Yes. Resolution chains allow multi-hop lookups where the output of one reference table becomes the input for the next. This is useful for complex mapping scenarios \u2014 for example, resolving a local product code to a regional code and then to a global SKU. Each hop applies the full 3-tier lookup cascade independently."
|
|
2416
3452
|
}
|
|
2417
3453
|
],
|
|
2418
3454
|
mentions: [
|
|
@@ -2432,19 +3468,47 @@ var sections5 = [
|
|
|
2432
3468
|
content: [
|
|
2433
3469
|
{
|
|
2434
3470
|
type: "paragraph",
|
|
2435
|
-
text: "Templates support a workshop system: **Live** (current published version, read-only), **Workshop** (mutable draft for editing), and **Version History** (timeline with diff summaries). When promoting a draft, the system detects breaking changes (field removals, type changes) and warns you."
|
|
3471
|
+
text: "Templates support a workshop system: **Live** (current published version, read-only), **Workshop** (mutable draft for editing), and **Version History** (timeline with diff summaries). When promoting a draft, the system detects breaking changes (field removals, type changes) and warns you."
|
|
3472
|
+
},
|
|
3473
|
+
{
|
|
3474
|
+
type: "paragraph",
|
|
3475
|
+
text: "Start by editing fields in the **Workshop** draft, then use **Test Extraction** to compare draft results against the live version before publishing. The **Version History** timeline lets you review diff summaries between any two versions, making it easy to trace when a field was added, renamed, or removed and understand the impact on downstream jobs."
|
|
3476
|
+
},
|
|
3477
|
+
{
|
|
3478
|
+
type: "paragraph",
|
|
3479
|
+
text: "The versioning system is append-only \u2014 every time you publish a draft, it creates a new immutable version and the previous version is preserved in the timeline. This means you can always go back and review the exact schema that was used for any historical job. The diff view highlights added fields, removed fields, type changes, and updated instructions, giving you a clear picture of how your schema evolved."
|
|
3480
|
+
},
|
|
3481
|
+
{
|
|
3482
|
+
type: "paragraph",
|
|
3483
|
+
text: "Use the workshop system to iterate safely on your schema without disrupting production jobs. A common workflow is to add a new field in the Workshop, run a **Test Extraction** on a few documents to verify it produces correct values, then publish when satisfied. If a downstream integration depends on a specific field, the breaking change detection will warn you before you accidentally remove or rename it."
|
|
2436
3484
|
},
|
|
2437
3485
|
{
|
|
2438
|
-
type: "
|
|
2439
|
-
|
|
3486
|
+
type: "code",
|
|
3487
|
+
language: "bash",
|
|
3488
|
+
title: "Get schema version history",
|
|
3489
|
+
code: `curl -s "https://api.talonic.com/v1/schemas/us_def456" \\
|
|
3490
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
3491
|
+
|
|
3492
|
+
# Response includes version history:
|
|
3493
|
+
# {
|
|
3494
|
+
# "id": "us_def456",
|
|
3495
|
+
# "name": "Invoice Extraction",
|
|
3496
|
+
# "live_version": 3,
|
|
3497
|
+
# "draft_status": "editing",
|
|
3498
|
+
# "versions": [
|
|
3499
|
+
# { "version": 3, "published_at": "2025-04-20T14:00:00Z", "field_count": 14 },
|
|
3500
|
+
# { "version": 2, "published_at": "2025-04-10T09:00:00Z", "field_count": 12 },
|
|
3501
|
+
# { "version": 1, "published_at": "2025-03-28T16:00:00Z", "field_count": 10 }
|
|
3502
|
+
# ]
|
|
3503
|
+
# }`
|
|
2440
3504
|
},
|
|
2441
3505
|
{
|
|
2442
3506
|
type: "paragraph",
|
|
2443
|
-
text: "The
|
|
3507
|
+
text: "The version history is a complete audit trail of your schema evolution. Each version snapshot captures the full field set, data types, instructions, reference tables, format constraints, and modifiers as they existed at publish time. When investigating a historical job result, you can always reconstruct the exact schema configuration that produced it by looking up the version number referenced in the job run metadata. This traceability is critical for compliance workflows and data governance requirements where you need to prove exactly what extraction rules were in effect at any point in time."
|
|
2444
3508
|
},
|
|
2445
3509
|
{
|
|
2446
3510
|
type: "paragraph",
|
|
2447
|
-
text: "
|
|
3511
|
+
text: "A common pattern for teams with strict governance requirements is to pair schema versioning with the Change Review feature. When Change Review is enabled, publishing a new schema version requires approval from a designated reviewer before it takes effect. This adds an extra safety layer on top of the breaking change detection, ensuring that both the technical impact and the business impact of schema changes are assessed before they reach production."
|
|
2448
3512
|
},
|
|
2449
3513
|
{
|
|
2450
3514
|
type: "callout",
|
|
@@ -2468,6 +3532,10 @@ var sections5 = [
|
|
|
2468
3532
|
{
|
|
2469
3533
|
question: "Can I revert to a previous schema version?",
|
|
2470
3534
|
answer: "Version history is append-only, so you cannot revert directly. However, you can review any previous version in the timeline, compare it with the current live version using the diff view, and manually re-add fields or settings that were changed. This design ensures that every historical job result always references the exact schema version that produced it. For safe iteration, always use the Workshop draft to test changes via Test Extraction before publishing a new version."
|
|
3535
|
+
},
|
|
3536
|
+
{
|
|
3537
|
+
question: "How do I compare two schema versions programmatically?",
|
|
3538
|
+
answer: "Fetch the full schema detail for each version using GET /v1/schemas/{id}. The response includes the complete field list with data types, instructions, and constraints for each version. Compare the field arrays to identify additions, removals, and modifications. The UI diff view performs this same comparison and highlights changes visually."
|
|
2471
3539
|
}
|
|
2472
3540
|
],
|
|
2473
3541
|
mentions: ["versioning", "drafts", "workshop", "live version", "breaking changes"]
|
|
@@ -2499,6 +3567,33 @@ var sections5 = [
|
|
|
2499
3567
|
type: "paragraph",
|
|
2500
3568
|
text: "A typical iteration workflow looks like this: add or modify a field in the Workshop draft, run a test extraction on your sample documents, review the comparison grid to check that the new field produces correct values, adjust the instruction if needed, re-test, and publish when satisfied. This tight feedback loop is the fastest way to refine extraction accuracy without impacting production jobs or consuming unnecessary credits."
|
|
2501
3569
|
},
|
|
3570
|
+
{
|
|
3571
|
+
type: "code",
|
|
3572
|
+
language: "bash",
|
|
3573
|
+
title: "Run a test extraction via the Structure slide-over",
|
|
3574
|
+
code: `# Test extractions use the simplified single-call mode.
|
|
3575
|
+
# From the API, create a job with extraction_type "simple":
|
|
3576
|
+
curl -X POST https://api.talonic.com/v1/jobs \\
|
|
3577
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
3578
|
+
-H "Content-Type: application/json" \\
|
|
3579
|
+
-d '{
|
|
3580
|
+
"schema_id": "us_def456",
|
|
3581
|
+
"document_ids": ["doc_001", "doc_002", "doc_003"],
|
|
3582
|
+
"extraction_type": "simple"
|
|
3583
|
+
}'
|
|
3584
|
+
|
|
3585
|
+
# Response:
|
|
3586
|
+
# {
|
|
3587
|
+
# "run_id": "run_test_abc",
|
|
3588
|
+
# "status": "processing",
|
|
3589
|
+
# "extraction_type": "simple",
|
|
3590
|
+
# "document_count": 3
|
|
3591
|
+
# }`
|
|
3592
|
+
},
|
|
3593
|
+
{
|
|
3594
|
+
type: "paragraph",
|
|
3595
|
+
text: "Test extraction results include a cell-by-cell comparison between the draft and live schema outputs. Each cell is classified as improved (the draft produces a better value), regressed (the draft produces a worse value), or unchanged. The comparison accounts for both the extracted value and its confidence score, so you can detect cases where the value is correct but the confidence dropped \u2014 which might indicate a fragile extraction instruction that should be refined before publishing."
|
|
3596
|
+
},
|
|
2502
3597
|
{
|
|
2503
3598
|
type: "callout",
|
|
2504
3599
|
text: "Test extractions do not affect your live data or consume production job credits differently. They are designed for rapid iteration \u2014 run as many tests as you need before publishing. Results are temporary and do not appear in your job history."
|
|
@@ -2521,6 +3616,10 @@ var sections5 = [
|
|
|
2521
3616
|
{
|
|
2522
3617
|
question: "How many documents should I use for a test extraction?",
|
|
2523
3618
|
answer: "Select 3-5 representative documents that cover the variety in your corpus. Include documents with different layouts, data completeness levels, and edge cases to get a reliable preview of how your schema changes perform."
|
|
3619
|
+
},
|
|
3620
|
+
{
|
|
3621
|
+
question: "Does test extraction apply all schema features like format constraints and modifiers?",
|
|
3622
|
+
answer: "Yes. Test extractions run through the same pipeline as production jobs, including reference table lookups, format constraints, modifiers, and bypass strategies. The only difference is the simplified single-call extraction mode, which is faster but still applies all schema features during the transform and validate phase."
|
|
2524
3623
|
}
|
|
2525
3624
|
],
|
|
2526
3625
|
mentions: ["test extraction", "draft comparison", "side-by-side", "preview"]
|
|
@@ -2588,6 +3687,49 @@ var sections5 = [
|
|
|
2588
3687
|
type: "paragraph",
|
|
2589
3688
|
text: 'For best results, create a shared dialect for each downstream system or regional office you deliver to, and name it descriptively (e.g., "SAP Europe" or "US Accounting"). Avoid defining dialects inline on individual schemas unless you have a one-off formatting requirement. Shared dialects reduce maintenance burden and ensure consistency when you add new schemas later.'
|
|
2590
3689
|
},
|
|
3690
|
+
{
|
|
3691
|
+
type: "code",
|
|
3692
|
+
language: "bash",
|
|
3693
|
+
title: "Create a shared dialect via API",
|
|
3694
|
+
code: `curl -X POST https://api.talonic.com/v1/dialects \\
|
|
3695
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
3696
|
+
-H "Content-Type: application/json" \\
|
|
3697
|
+
-d '{
|
|
3698
|
+
"name": "EU Accounting",
|
|
3699
|
+
"date_format": "DD.MM.YYYY",
|
|
3700
|
+
"number_locale": "de-DE",
|
|
3701
|
+
"delimiter": ";",
|
|
3702
|
+
"null_representation": "",
|
|
3703
|
+
"boolean_format": "yes/no",
|
|
3704
|
+
"encoding": "UTF-8-BOM"
|
|
3705
|
+
}'
|
|
3706
|
+
|
|
3707
|
+
# Response:
|
|
3708
|
+
# {
|
|
3709
|
+
# "id": "dial_eu_001",
|
|
3710
|
+
# "name": "EU Accounting",
|
|
3711
|
+
# "created_at": "2025-04-18T12:00:00Z"
|
|
3712
|
+
# }`
|
|
3713
|
+
},
|
|
3714
|
+
{
|
|
3715
|
+
type: "code",
|
|
3716
|
+
language: "bash",
|
|
3717
|
+
title: "List all dialects in the workspace",
|
|
3718
|
+
code: `curl -s https://api.talonic.com/v1/dialects \\
|
|
3719
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
3720
|
+
|
|
3721
|
+
# Response:
|
|
3722
|
+
# {
|
|
3723
|
+
# "dialects": [
|
|
3724
|
+
# { "id": "dial_eu_001", "name": "EU Accounting", "date_format": "DD.MM.YYYY" },
|
|
3725
|
+
# { "id": "dial_us_002", "name": "US Standard", "date_format": "MM/DD/YYYY" }
|
|
3726
|
+
# ]
|
|
3727
|
+
# }`
|
|
3728
|
+
},
|
|
3729
|
+
{
|
|
3730
|
+
type: "paragraph",
|
|
3731
|
+
text: "Dialects can be managed programmatically through the full CRUD API: create with POST, retrieve with GET, update with PUT, and delete with DELETE on the /v1/dialects endpoints. This is useful for teams that manage multiple workspaces and want to synchronize formatting conventions across environments. You can export a dialect configuration from one workspace and import it into another by replicating the JSON body, ensuring consistent output formatting across your entire organization."
|
|
3732
|
+
},
|
|
2591
3733
|
{
|
|
2592
3734
|
type: "callout",
|
|
2593
3735
|
text: "If your CSV files show garbled special characters (accents, umlauts, CJK text), switch the encoding to **UTF-8-BOM**. The BOM (byte order mark) tells Excel to interpret the file as UTF-8 instead of the system default encoding."
|
|
@@ -2610,6 +3752,10 @@ var sections5 = [
|
|
|
2610
3752
|
{
|
|
2611
3753
|
question: "Do I need to re-run extractions when I change a dialect?",
|
|
2612
3754
|
answer: "No. Dialects only affect output serialization (exports and deliveries), not how values are stored internally. Changing a dialect takes effect immediately on future exports without re-processing."
|
|
3755
|
+
},
|
|
3756
|
+
{
|
|
3757
|
+
question: "How do I apply a shared dialect to a specific schema?",
|
|
3758
|
+
answer: "Navigate to the schema editor, open the Delivery tab, and select the shared dialect from the dropdown. Alternatively, use the PATCH /v1/schemas/{id} endpoint with a dialect_id field to link the dialect programmatically. The dialect applies to all future exports and deliveries for that schema without re-running extractions."
|
|
2613
3759
|
}
|
|
2614
3760
|
],
|
|
2615
3761
|
mentions: [
|
|
@@ -2681,6 +3827,44 @@ var sections5 = [
|
|
|
2681
3827
|
type: "paragraph",
|
|
2682
3828
|
text: "For best results, audit your schema for fields that never vary across documents \u2014 these are prime candidates for the **constant** strategy. Fields like currency, data source, or processing batch can be set once and never require AI extraction. This reduces per-document processing cost and improves job completion time, especially on large runs with hundreds of documents."
|
|
2683
3829
|
},
|
|
3830
|
+
{
|
|
3831
|
+
type: "code",
|
|
3832
|
+
language: "bash",
|
|
3833
|
+
title: "Configure a generator bypass (deterministic ID)",
|
|
3834
|
+
code: `curl -X PATCH https://api.talonic.com/v1/schemas/us_def456/fields/fld_row_id \\
|
|
3835
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
3836
|
+
-H "Content-Type: application/json" \\
|
|
3837
|
+
-d '{
|
|
3838
|
+
"strategy": "generator",
|
|
3839
|
+
"generator_type": "deterministic-id",
|
|
3840
|
+
"generator_config": {
|
|
3841
|
+
"prefix": "INV"
|
|
3842
|
+
}
|
|
3843
|
+
}'
|
|
3844
|
+
|
|
3845
|
+
# Each row will receive a unique, reproducible ID like "INV-a7b3c9d2"
|
|
3846
|
+
# based on a hash of the document and entity attributes.`
|
|
3847
|
+
},
|
|
3848
|
+
{
|
|
3849
|
+
type: "code",
|
|
3850
|
+
language: "bash",
|
|
3851
|
+
title: "Configure a reference bypass (lookup from reference table)",
|
|
3852
|
+
code: `curl -X PATCH https://api.talonic.com/v1/schemas/us_def456/fields/fld_vendor_code \\
|
|
3853
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
3854
|
+
-H "Content-Type: application/json" \\
|
|
3855
|
+
-d '{
|
|
3856
|
+
"strategy": "reference",
|
|
3857
|
+
"key_expression": "vendor_name",
|
|
3858
|
+
"reference_table_id": "ref_vendor_codes"
|
|
3859
|
+
}'
|
|
3860
|
+
|
|
3861
|
+
# The field value is resolved by looking up the vendor_name field
|
|
3862
|
+
# against the vendor_codes reference table \u2014 no LLM call needed.`
|
|
3863
|
+
},
|
|
3864
|
+
{
|
|
3865
|
+
type: "paragraph",
|
|
3866
|
+
text: "Bypass strategies are evaluated in Phase 1 of the pipeline, before any AI calls are made. This means bypass fields are resolved in milliseconds at zero credit cost, and their values are immediately available as context for Phase 2 AI extraction of other fields. For schemas with many static or derivable fields, bypass strategies can reduce the number of fields sent to the LLM by 30-50%, which translates directly to faster job completion and lower per-document cost. Audit your schema periodically for new bypass candidates as your understanding of the data matures."
|
|
3867
|
+
},
|
|
2684
3868
|
{
|
|
2685
3869
|
type: "callout",
|
|
2686
3870
|
text: "When a `generator` strategy fails to produce a value, the field falls through to LLM extraction as a safety net \u2014 your data is never left incomplete due to a bypass misconfiguration. Strategy values are normalized via generator mappings in Phase 4 of the pipeline. Bypass strategies execute during Phase 1, before any AI calls are made."
|
|
@@ -2703,6 +3887,10 @@ var sections5 = [
|
|
|
2703
3887
|
{
|
|
2704
3888
|
question: "Do bypass strategies reduce extraction costs?",
|
|
2705
3889
|
answer: "Yes. Fields with bypass strategies skip the AI extraction phase entirely, which reduces both processing time and credit usage. Use constant or reference strategies for fields that do not require document reading."
|
|
3890
|
+
},
|
|
3891
|
+
{
|
|
3892
|
+
question: "What is the difference between a reference table on a field and a reference bypass strategy?",
|
|
3893
|
+
answer: "A reference table on a field normalizes AI-extracted values to canonical codes after extraction (Phases 1 and 3). A reference bypass strategy skips AI extraction entirely and resolves the value by looking up another field in a reference table during Phase 1. Use reference tables when the AI needs to read the document first; use reference bypass when the value can be derived from an already-extracted field without reading the document."
|
|
2706
3894
|
}
|
|
2707
3895
|
],
|
|
2708
3896
|
mentions: [
|
|
@@ -2757,6 +3945,52 @@ var sections5 = [
|
|
|
2757
3945
|
type: "paragraph",
|
|
2758
3946
|
text: 'Choose the mismatch behavior based on your data quality requirements. Use **empty** (the default) when you prefer no data over bad data \u2014 the downstream system will see a blank cell. Use **flag** when you want to review mismatches manually before deciding \u2014 flagged cells appear with an amber dot in the results grid. Use **constant** when your downstream system needs a specific sentinel value like `"N/A"` or `"INVALID"` to trigger its own error handling.'
|
|
2759
3947
|
},
|
|
3948
|
+
{
|
|
3949
|
+
type: "code",
|
|
3950
|
+
language: "bash",
|
|
3951
|
+
title: "Add a format constraint to a schema field",
|
|
3952
|
+
code: `curl -X PATCH https://api.talonic.com/v1/schemas/us_def456/fields/fld_po_number \\
|
|
3953
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
3954
|
+
-H "Content-Type: application/json" \\
|
|
3955
|
+
-d '{
|
|
3956
|
+
"constraints": {
|
|
3957
|
+
"format": {
|
|
3958
|
+
"type": "regex",
|
|
3959
|
+
"pattern": "^PO-\\\\d{6}$",
|
|
3960
|
+
"on_format_mismatch": "flag"
|
|
3961
|
+
}
|
|
3962
|
+
}
|
|
3963
|
+
}'
|
|
3964
|
+
|
|
3965
|
+
# Values matching PO-123456 pass through unchanged.
|
|
3966
|
+
# Values like "PO 123456" or "123456" are flagged with an amber dot.
|
|
3967
|
+
# Original values are always preserved in original_extractions for audit.`
|
|
3968
|
+
},
|
|
3969
|
+
{
|
|
3970
|
+
type: "code",
|
|
3971
|
+
language: "bash",
|
|
3972
|
+
title: "Date format constraint with case-insensitive flag",
|
|
3973
|
+
code: `# Validate ISO date format with optional time component:
|
|
3974
|
+
curl -X PATCH https://api.talonic.com/v1/schemas/us_def456/fields/fld_date \\
|
|
3975
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
3976
|
+
-H "Content-Type: application/json" \\
|
|
3977
|
+
-d '{
|
|
3978
|
+
"constraints": {
|
|
3979
|
+
"format": {
|
|
3980
|
+
"type": "regex",
|
|
3981
|
+
"pattern": "^\\\\d{4}-\\\\d{2}-\\\\d{2}(T\\\\d{2}:\\\\d{2}:\\\\d{2})?$",
|
|
3982
|
+
"on_format_mismatch": "empty"
|
|
3983
|
+
}
|
|
3984
|
+
}
|
|
3985
|
+
}'
|
|
3986
|
+
|
|
3987
|
+
# Values like "2025-03-15" or "2025-03-15T14:30:00" pass.
|
|
3988
|
+
# Values like "March 15, 2025" are cleared (on_format_mismatch: "empty").`
|
|
3989
|
+
},
|
|
3990
|
+
{
|
|
3991
|
+
type: "paragraph",
|
|
3992
|
+
text: 'Format constraints are one of the most effective tools for ensuring downstream system compatibility. Many ERP and accounting systems reject records with malformed identifiers, dates outside their expected format, or amounts with unexpected characters. By catching these issues at extraction time with format constraints, you prevent bad data from reaching downstream systems entirely. The three mismatch behaviors give you control over the trade-off: use "empty" when no data is better than bad data, "flag" when you want human review before deciding, and "constant" when your downstream system needs a specific sentinel value to trigger error handling.'
|
|
3993
|
+
},
|
|
2760
3994
|
{
|
|
2761
3995
|
type: "callout",
|
|
2762
3996
|
text: "The regex evaluator includes ReDoS protection: nested quantifiers are rejected and input is capped at 1,000 characters. Use the `(?i)` inline flag for case-insensitive matching. Format constraints support standard JavaScript regex syntax, so you can use character classes, alternation, and lookahead assertions for complex validation patterns."
|
|
@@ -2779,6 +4013,10 @@ var sections5 = [
|
|
|
2779
4013
|
{
|
|
2780
4014
|
question: "Can I use case-insensitive regex patterns?",
|
|
2781
4015
|
answer: "Yes. Use the (?i) inline flag at the start of your pattern for case-insensitive matching. The evaluator supports standard JavaScript regex syntax including character classes, alternation, and lookahead assertions. ReDoS protection is built in \u2014 nested quantifiers are rejected and input is capped at 1,000 characters."
|
|
4016
|
+
},
|
|
4017
|
+
{
|
|
4018
|
+
question: "What happens if my regex pattern causes performance issues?",
|
|
4019
|
+
answer: "The evaluator includes built-in ReDoS (Regular Expression Denial of Service) protection. Patterns with nested quantifiers like (a+)+ are automatically rejected at save time. Additionally, input values are capped at 1,000 characters, preventing pathological backtracking on unexpectedly long strings. These safeguards run transparently \u2014 you do not need to optimize your patterns manually for performance."
|
|
2782
4020
|
}
|
|
2783
4021
|
],
|
|
2784
4022
|
mentions: [
|
|
@@ -2820,6 +4058,59 @@ var sections6 = [
|
|
|
2820
4058
|
type: "paragraph",
|
|
2821
4059
|
text: "The platform supports scaling caps to ensure reliable processing: Phase 2 extraction handles up to 2,000 documents per job, and Phase 4 transforms support up to 1,000 documents. Grid results are flushed to the database in batches of 200 documents per phase. For very large document collections, consider splitting into multiple jobs by document type for optimal results and easier review."
|
|
2822
4060
|
},
|
|
4061
|
+
{
|
|
4062
|
+
type: "code",
|
|
4063
|
+
language: "bash",
|
|
4064
|
+
title: "Create an extraction job via API",
|
|
4065
|
+
code: `curl -X POST https://api.talonic.com/v1/jobs \\
|
|
4066
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
4067
|
+
-H "Content-Type: application/json" \\
|
|
4068
|
+
-d '{
|
|
4069
|
+
"schema_id": "us_def456",
|
|
4070
|
+
"document_ids": ["doc_001", "doc_002", "doc_003"],
|
|
4071
|
+
"extraction_type": "pipeline"
|
|
4072
|
+
}'
|
|
4073
|
+
|
|
4074
|
+
# Response:
|
|
4075
|
+
# {
|
|
4076
|
+
# "run_id": "run_abc123",
|
|
4077
|
+
# "status": "processing",
|
|
4078
|
+
# "extraction_type": "pipeline",
|
|
4079
|
+
# "document_count": 3,
|
|
4080
|
+
# "current_phase": 1,
|
|
4081
|
+
# "created_at": "2025-04-20T10:00:00Z"
|
|
4082
|
+
# }`
|
|
4083
|
+
},
|
|
4084
|
+
{
|
|
4085
|
+
type: "code",
|
|
4086
|
+
language: "bash",
|
|
4087
|
+
title: "Check job status and results",
|
|
4088
|
+
code: `curl -s https://api.talonic.com/v1/jobs/runs/run_abc123 \\
|
|
4089
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
4090
|
+
|
|
4091
|
+
# Response:
|
|
4092
|
+
# {
|
|
4093
|
+
# "run_id": "run_abc123",
|
|
4094
|
+
# "status": "completed",
|
|
4095
|
+
# "current_phase": 4,
|
|
4096
|
+
# "grid_stats": {
|
|
4097
|
+
# "total_cells": 36,
|
|
4098
|
+
# "filled_cells": 33,
|
|
4099
|
+
# "fill_rate": 0.917,
|
|
4100
|
+
# "strategy_yield": {
|
|
4101
|
+
# "registry_transfer": 0.44,
|
|
4102
|
+
# "llm_extract": 0.33,
|
|
4103
|
+
# "lookup_cascade": 0.11,
|
|
4104
|
+
# "deterministic_compute": 0.06,
|
|
4105
|
+
# "empty": 0.06
|
|
4106
|
+
# }
|
|
4107
|
+
# }
|
|
4108
|
+
# }`
|
|
4109
|
+
},
|
|
4110
|
+
{
|
|
4111
|
+
type: "paragraph",
|
|
4112
|
+
text: "The strategy yield breakdown in the job response is a powerful diagnostic tool. It shows exactly how each cell was filled \u2014 registry transfers from Phase 1, LLM extractions from Phase 2, lookup cascade matches from Phase 3, and deterministic computations. A high registry_transfer percentage indicates mature field coverage, while a high llm_extract percentage suggests the registry needs more training data. Track strategy yield across jobs to measure how your pipeline efficiency improves over time as the Field Registry grows."
|
|
4113
|
+
},
|
|
2823
4114
|
{
|
|
2824
4115
|
type: "callout",
|
|
2825
4116
|
text: "Results appear progressively as each pipeline phase completes. You do not need to wait for the entire job to finish \u2014 you can begin reviewing Phase 1 results while Phase 2 is still running. The phase timeline on the job detail page shows which phase is active and the cumulative fill rate at each stage."
|
|
@@ -2842,6 +4133,10 @@ var sections6 = [
|
|
|
2842
4133
|
{
|
|
2843
4134
|
question: "How many documents can I include in a single job?",
|
|
2844
4135
|
answer: "Phase 2 supports up to 2,000 documents per job, and Phase 4 supports up to 1,000. For best results, start with smaller batches to validate your schema before scaling up."
|
|
4136
|
+
},
|
|
4137
|
+
{
|
|
4138
|
+
question: "What extraction modes are available?",
|
|
4139
|
+
answer: "Three modes: pipeline (full 4-phase extraction, the default and most thorough), simple (single AI call, faster but less thorough \u2014 used for test extractions), and field_registry (no AI, deterministic strategies only \u2014 useful for benchmarking registry coverage). Choose pipeline for production jobs, simple for quick previews, and field_registry for measuring how much the registry can resolve without AI."
|
|
2845
4140
|
}
|
|
2846
4141
|
],
|
|
2847
4142
|
mentions: ["extraction job", "structured grid", "progressive results", "template selection"]
|
|
@@ -2875,6 +4170,34 @@ var sections6 = [
|
|
|
2875
4170
|
title: "Job Detail \u2014 Phase Timeline",
|
|
2876
4171
|
caption: "The phase timeline shows progress through the pipeline. Each dot represents a stage, highlighted when active."
|
|
2877
4172
|
},
|
|
4173
|
+
{
|
|
4174
|
+
type: "code",
|
|
4175
|
+
language: "bash",
|
|
4176
|
+
title: "Monitor pipeline phase progress",
|
|
4177
|
+
code: `# Poll the job detail endpoint to track phase progress:
|
|
4178
|
+
curl -s https://api.talonic.com/v1/jobs/runs/run_abc123 \\
|
|
4179
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
4180
|
+
|
|
4181
|
+
# During execution:
|
|
4182
|
+
# {
|
|
4183
|
+
# "run_id": "run_abc123",
|
|
4184
|
+
# "status": "processing",
|
|
4185
|
+
# "current_phase": 2,
|
|
4186
|
+
# "phase_timings": {
|
|
4187
|
+
# "phase_1": { "started_at": "...", "completed_at": "...", "duration_ms": 1240 },
|
|
4188
|
+
# "phase_2": { "started_at": "...", "completed_at": null }
|
|
4189
|
+
# },
|
|
4190
|
+
# "grid_stats": {
|
|
4191
|
+
# "fill_rate": 0.42,
|
|
4192
|
+
# "filled_cells": 15,
|
|
4193
|
+
# "total_cells": 36
|
|
4194
|
+
# }
|
|
4195
|
+
# }`
|
|
4196
|
+
},
|
|
4197
|
+
{
|
|
4198
|
+
type: "paragraph",
|
|
4199
|
+
text: "The phase timings in the job response let you identify bottlenecks in your extraction pipeline. Phase 1 typically completes in seconds regardless of document count because it uses pre-computed indexes. Phase 2 is the most time-consuming because it involves AI calls \u2014 the duration scales linearly with the number of empty cells and document length. Phase 3 adds minimal overhead with deterministic lookups. Phase 4 duration depends on how many cells remain empty after Phase 2. If Phase 2 consistently takes longer than expected, consider improving your Field Registry coverage to shift more work to Phase 1, or use batch inference for non-urgent jobs at 50% cost."
|
|
4200
|
+
},
|
|
2878
4201
|
{
|
|
2879
4202
|
type: "callout",
|
|
2880
4203
|
text: "Phase order is fixed: Phase 1 → 2 → 3 → 4. Phases are never skipped or reordered. This guarantees that high-confidence deterministic values from Phase 1 are always protected by the confidence gate before AI extraction runs. The confidence gate is the single most important pipeline rule \u2014 once a cell is filled with a high-confidence value, no later phase can overwrite it with a lower-confidence result."
|
|
@@ -2897,6 +4220,10 @@ var sections6 = [
|
|
|
2897
4220
|
{
|
|
2898
4221
|
question: "Why does the pipeline use multiple phases instead of a single AI call?",
|
|
2899
4222
|
answer: "The cascading design minimizes cost and latency. Phase 1 fills cells with deterministic lookups at zero AI cost. Only remaining gaps go to the AI agent in Phase 2, and Phase 4 targets specific empty cells with full context. This is significantly cheaper and faster than sending everything to AI."
|
|
4223
|
+
},
|
|
4224
|
+
{
|
|
4225
|
+
question: "How does the confidence gate protect values across phases?",
|
|
4226
|
+
answer: "The confidence gate is enforced on every grid write. Once a cell is filled with a confidence score of 0.7 or higher, no later phase can overwrite it with a lower-confidence value. This prevents high-quality Phase 1 lookup results (0.95 confidence) from being replaced by lower-confidence Phase 2 AI extractions (0.65 confidence). The gate is the single most important pipeline invariant."
|
|
2900
4227
|
}
|
|
2901
4228
|
],
|
|
2902
4229
|
mentions: ["4-phase pipeline", "fill rate", "progressive rendering", "phase timeline"]
|
|
@@ -2959,6 +4286,48 @@ var sections6 = [
|
|
|
2959
4286
|
type: "paragraph",
|
|
2960
4287
|
text: `For example, consider an invoice with a "Vendor Name" field. The system first checks the Field Registry for a direct transfer \u2014 if "Vendor Name" was extracted from a previous document and promoted to Tier 1, it resolves instantly at 0.85+ confidence. If no registry match exists, the raw extraction mapping looks for a semantically equivalent field in the document's extracted data (e.g., "supplier_name"). If that also misses, the 3-tier lookup cascade checks the reference table: exact normalization first (0.95), then fuzzy token overlap (~0.70), then AI fallback (0.50). Only if all four strategies fail does the cell pass to Phase 2 for AI extraction.`
|
|
2961
4288
|
},
|
|
4289
|
+
{
|
|
4290
|
+
type: "code",
|
|
4291
|
+
language: "bash",
|
|
4292
|
+
title: "Check Field Registry stats to predict Phase 1 coverage",
|
|
4293
|
+
code: `curl -s https://api.talonic.com/v1/fields/stats \\
|
|
4294
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
4295
|
+
|
|
4296
|
+
# Response:
|
|
4297
|
+
# {
|
|
4298
|
+
# "total_fields": 342,
|
|
4299
|
+
# "tier_1": 89,
|
|
4300
|
+
# "tier_2": 156,
|
|
4301
|
+
# "tier_3": 97,
|
|
4302
|
+
# "total_occurrences": 14280,
|
|
4303
|
+
# "avg_occurrence_count": 41.8
|
|
4304
|
+
# }
|
|
4305
|
+
|
|
4306
|
+
# Higher tier_1 + tier_2 counts mean more Phase 1 coverage.
|
|
4307
|
+
# Tier 1 fields resolve at 0.85+ confidence via direct transfer.`
|
|
4308
|
+
},
|
|
4309
|
+
{
|
|
4310
|
+
type: "code",
|
|
4311
|
+
language: "bash",
|
|
4312
|
+
title: "View strategy yield after a completed job",
|
|
4313
|
+
code: `curl -s https://api.talonic.com/v1/jobs/runs/run_abc123 \\
|
|
4314
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
4315
|
+
| jq '.grid_stats.strategy_yield'
|
|
4316
|
+
|
|
4317
|
+
# {
|
|
4318
|
+
# "registry_transfer": 0.44,
|
|
4319
|
+
# "raw_extraction_mapping": 0.08,
|
|
4320
|
+
# "lookup_cascade": 0.11,
|
|
4321
|
+
# "deterministic_compute": 0.06,
|
|
4322
|
+
# "llm_extract": 0.25,
|
|
4323
|
+
# "phase3_lookup": 0.03,
|
|
4324
|
+
# "empty": 0.03
|
|
4325
|
+
# }`
|
|
4326
|
+
},
|
|
4327
|
+
{
|
|
4328
|
+
type: "paragraph",
|
|
4329
|
+
text: "Understanding Phase 1 resolution strategies helps you optimize your pipeline. Registry transfer is the highest-quality strategy \u2014 it maps extracted field occurrences from the Field Registry directly to schema fields. Raw extraction mapping handles cases where the extracted field name does not exactly match the registry but can be inferred from the document raw extraction data. The lookup cascade (normalization, fuzzy, AI fallback) resolves reference table fields. Deterministic compute handles formula-based fields like totals and differences. By monitoring the strategy yield across jobs, you can identify which strategies are carrying the most weight and where to invest in improving coverage."
|
|
4330
|
+
},
|
|
2962
4331
|
{
|
|
2963
4332
|
type: "callout",
|
|
2964
4333
|
text: "Phase 1 fill rates improve over time as your Field Registry grows. The more documents you process, the richer the registry becomes, and the more cells Phase 1 can resolve without AI \u2014 reducing both cost and latency for every subsequent job."
|
|
@@ -2981,6 +4350,10 @@ var sections6 = [
|
|
|
2981
4350
|
{
|
|
2982
4351
|
question: "Does Phase 1 performance improve over time?",
|
|
2983
4352
|
answer: "Yes. As your Field Registry grows from processing more documents, Phase 1 can resolve a higher percentage of cells through graph matches. Mature registries often see Phase 1 fill rates of 60-80%."
|
|
4353
|
+
},
|
|
4354
|
+
{
|
|
4355
|
+
question: "How can I improve my Phase 1 fill rate?",
|
|
4356
|
+
answer: "Three approaches: process more documents to grow the Field Registry (more Tier 1 and Tier 2 fields mean more direct transfers), add comprehensive reference tables with common value variations (improves lookup cascade hits), and use descriptive field names in your schemas that match the canonical registry names (improves exact match rates). Check the strategy_yield in job results to identify which resolution methods are underperforming."
|
|
2984
4357
|
}
|
|
2985
4358
|
],
|
|
2986
4359
|
mentions: [
|
|
@@ -3047,6 +4420,36 @@ var sections6 = [
|
|
|
3047
4420
|
type: "paragraph",
|
|
3048
4421
|
text: "For fields backed by a **reference table**, Phase 2 includes the table's codes and labels directly in the extraction prompt so the AI picks canonical codes rather than free-text labels. This tight integration between reference tables and AI extraction produces cleaner output that requires fewer corrections. Fields with fewer than 50 reference entries get the full table in the prompt; larger tables are handled by the Phase 3 lookup cascade instead."
|
|
3049
4422
|
},
|
|
4423
|
+
{
|
|
4424
|
+
type: "code",
|
|
4425
|
+
language: "bash",
|
|
4426
|
+
title: "View the agent strategy audit trail for a job",
|
|
4427
|
+
code: `curl -s https://api.talonic.com/v1/jobs/runs/run_abc123/strategy \\
|
|
4428
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
4429
|
+
|
|
4430
|
+
# Response:
|
|
4431
|
+
# {
|
|
4432
|
+
# "strategies": [
|
|
4433
|
+
# {
|
|
4434
|
+
# "document_id": "doc_001",
|
|
4435
|
+
# "field": "total_amount",
|
|
4436
|
+
# "action": "compute",
|
|
4437
|
+
# "expression": "unit_price * quantity",
|
|
4438
|
+
# "confidence": 0.95
|
|
4439
|
+
# },
|
|
4440
|
+
# {
|
|
4441
|
+
# "document_id": "doc_001",
|
|
4442
|
+
# "field": "payment_terms",
|
|
4443
|
+
# "action": "extract",
|
|
4444
|
+
# "reasoning": "No existing value or computable relationship found"
|
|
4445
|
+
# }
|
|
4446
|
+
# ]
|
|
4447
|
+
# }`
|
|
4448
|
+
},
|
|
4449
|
+
{
|
|
4450
|
+
type: "paragraph",
|
|
4451
|
+
text: "Phase 2 uses prompt caching to reduce costs on multi-group extraction calls. For each document, the first group call writes the document content to the Anthropic cache, and subsequent group calls read from it. This primer-then-parallel dispatch pattern reduces per-document costs by 60-80% compared to uncached extraction. The caching is transparent \u2014 you do not need to configure anything. Cache entries persist for 5 minutes on the Anthropic side, so iterating on the same documents within that window benefits from warm cache hits at even lower cost."
|
|
4452
|
+
},
|
|
3050
4453
|
{
|
|
3051
4454
|
type: "callout",
|
|
3052
4455
|
variant: "warning",
|
|
@@ -3070,6 +4473,10 @@ var sections6 = [
|
|
|
3070
4473
|
{
|
|
3071
4474
|
question: "How many fields does the agent process per AI call?",
|
|
3072
4475
|
answer: "Schema fields are grouped into batches of up to 10 fields per extraction call. This balances extraction quality with throughput \u2014 smaller groups help the AI focus on each field without losing recall."
|
|
4476
|
+
},
|
|
4477
|
+
{
|
|
4478
|
+
question: "How does Phase 2 handle reference table fields?",
|
|
4479
|
+
answer: "For fields backed by a reference table with fewer than 50 entries, Phase 2 includes the full table of codes and labels in the AI prompt so the model picks canonical codes rather than free-text labels. Larger tables are handled by the Phase 3 lookup cascade instead, keeping the prompt size bounded and cost predictable."
|
|
3073
4480
|
}
|
|
3074
4481
|
],
|
|
3075
4482
|
mentions: [
|
|
@@ -3139,6 +4546,40 @@ var sections6 = [
|
|
|
3139
4546
|
type: "paragraph",
|
|
3140
4547
|
text: "What gets flagged and why depends on cross-field relationships, not just individual values. A **date_sanity** flag fires when temporal fields contradict each other \u2014 for example, a contract end date that falls before the start date, or a signature date after the effective date. An **amount_mismatch** flag fires when a computed total deviates more than 20% from the product of its component values (e.g., monthly rent times term length versus total contract value). The **unexpected_empty** flag fires when a field that appears in over 80% of documents in your registry is missing from this particular document, suggesting the AI may have missed it rather than it being genuinely absent."
|
|
3141
4548
|
},
|
|
4549
|
+
{
|
|
4550
|
+
type: "code",
|
|
4551
|
+
language: "bash",
|
|
4552
|
+
title: "Retrieve validation flags for a completed job",
|
|
4553
|
+
code: `curl -s "https://api.talonic.com/v1/jobs/runs/run_abc123?include=validation_flags" \\
|
|
4554
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
4555
|
+
|
|
4556
|
+
# Response includes per-document validation flags:
|
|
4557
|
+
# {
|
|
4558
|
+
# "results": [
|
|
4559
|
+
# {
|
|
4560
|
+
# "document_id": "doc_001",
|
|
4561
|
+
# "validation_flags": [
|
|
4562
|
+
# {
|
|
4563
|
+
# "type": "date_sanity",
|
|
4564
|
+
# "severity": "error",
|
|
4565
|
+
# "fields": ["contract_start_date", "contract_end_date"],
|
|
4566
|
+
# "message": "End date (2024-01-15) precedes start date (2025-03-01)"
|
|
4567
|
+
# },
|
|
4568
|
+
# {
|
|
4569
|
+
# "type": "low_confidence_outlier",
|
|
4570
|
+
# "severity": "warning",
|
|
4571
|
+
# "fields": ["payment_terms"],
|
|
4572
|
+
# "message": "Confidence 0.28 in row with average 0.82"
|
|
4573
|
+
# }
|
|
4574
|
+
# ]
|
|
4575
|
+
# }
|
|
4576
|
+
# ]
|
|
4577
|
+
# }`
|
|
4578
|
+
},
|
|
4579
|
+
{
|
|
4580
|
+
type: "paragraph",
|
|
4581
|
+
text: 'Phase 3 is particularly valuable for catching systematic errors before they reach downstream systems. For example, if your schema includes both a "start date" and "end date" field, the date_sanity check catches temporal inversions that would be difficult to spot in a large grid. Similarly, the amount_mismatch check catches arithmetic inconsistencies between related financial fields \u2014 a total that does not align with its components within a 20% tolerance is flagged for review. These cross-field checks operate on relationships between fields rather than individual values, catching errors that single-field validation cannot detect.'
|
|
4582
|
+
},
|
|
3142
4583
|
{
|
|
3143
4584
|
type: "callout",
|
|
3144
4585
|
text: "Validation flags never modify cell values. They are purely informational annotations that help you prioritize review. The actual cell value and confidence score remain unchanged by Phase 3 flagging."
|
|
@@ -3161,6 +4602,10 @@ var sections6 = [
|
|
|
3161
4602
|
{
|
|
3162
4603
|
question: "Does Phase 3 modify any cell values?",
|
|
3163
4604
|
answer: "Phase 3 re-runs the reference table lookup cascade to normalize AI-extracted labels to canonical codes. The validation flags themselves are purely informational and do not modify values."
|
|
4605
|
+
},
|
|
4606
|
+
{
|
|
4607
|
+
question: "How does the lookup_failed flag help improve my reference tables?",
|
|
4608
|
+
answer: 'The lookup_failed flag identifies values that the AI extracted but could not map to any code in your reference table. These are direct candidates for adding new entries. Review the flagged values after each job \u2014 if "Frame Agreement" consistently fails lookup against a contract type table, add it as a synonym pointing to the canonical code. Future runs will resolve it at Tier 1 confidence (0.95) without AI fallback.'
|
|
3164
4609
|
}
|
|
3165
4610
|
],
|
|
3166
4611
|
mentions: [
|
|
@@ -3194,6 +4639,26 @@ var sections6 = [
|
|
|
3194
4639
|
type: "paragraph",
|
|
3195
4640
|
text: "Expect Phase 4 to fill 5-15% of remaining empty cells, depending on document complexity and schema coverage. The phase is most effective for fields that require cross-referencing multiple sections of a document or interpreting values in the context of other extracted data. It is less effective for fields that are genuinely absent from the source document \u2014 those will remain empty with an `unresolved` provenance type."
|
|
3196
4641
|
},
|
|
4642
|
+
{
|
|
4643
|
+
type: "code",
|
|
4644
|
+
language: "bash",
|
|
4645
|
+
title: "Resume a failed or stuck job",
|
|
4646
|
+
code: `# If a job fails mid-pipeline, resume from the last completed phase:
|
|
4647
|
+
curl -X POST https://api.talonic.com/v1/jobs/runs/run_abc123/resume \\
|
|
4648
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
4649
|
+
|
|
4650
|
+
# Response:
|
|
4651
|
+
# {
|
|
4652
|
+
# "run_id": "run_abc123",
|
|
4653
|
+
# "status": "processing",
|
|
4654
|
+
# "resumed_from_phase": 3,
|
|
4655
|
+
# "message": "Resumed from phase 3 after previous failure"
|
|
4656
|
+
# }`
|
|
4657
|
+
},
|
|
4658
|
+
{
|
|
4659
|
+
type: "paragraph",
|
|
4660
|
+
text: "Phase 4 is where the modifier pipeline and format constraint evaluation happen. The modifier pipeline runs in a strict order \u2014 format transforms first (converting dates and numbers to your target format), then alias mapping (replacing values using a lookup table), and finally max_length truncation. After all modifiers have been applied, format constraints evaluate the final transformed value against the regex pattern. If the value fails, the configured mismatch behavior kicks in. This ordering is important because it means constraints validate the output your downstream system will actually receive, not the raw AI extraction. Original values are always preserved in the original_extractions table, giving you a complete audit trail from raw extraction through every transformation step."
|
|
4661
|
+
},
|
|
3197
4662
|
{
|
|
3198
4663
|
type: "callout",
|
|
3199
4664
|
text: "Phase 4 respects the **confidence gate**: it can only fill empty cells or upgrade cells below the confidence threshold. High-confidence values from Phase 1 are permanently protected. Original values are always preserved in the `original_extractions` table for audit, regardless of whether format constraints clear, flag, or replace them."
|
|
@@ -3216,6 +4681,10 @@ var sections6 = [
|
|
|
3216
4681
|
{
|
|
3217
4682
|
question: "What else happens in Phase 4 besides gap filling?",
|
|
3218
4683
|
answer: "Phase 4 also applies deterministic transforms (ISO codes, dates, units), evaluates format constraints (regex validation), and runs the modifier pipeline in a fixed order: format transforms first, then alias mapping, then max_length truncation. Constraint evaluation happens after all modifiers. Original values are always preserved in the original_extractions table for audit, regardless of whether constraints clear, flag, or replace them."
|
|
4684
|
+
},
|
|
4685
|
+
{
|
|
4686
|
+
question: "Can I resume a job that failed during Phase 4?",
|
|
4687
|
+
answer: "Yes. Use the POST /v1/jobs/runs/{id}/resume endpoint. The pipeline detects which phase completed last and resumes from the next phase. Results from completed phases are preserved in the grid \u2014 you never lose work from earlier phases when resuming a failed job."
|
|
3219
4688
|
}
|
|
3220
4689
|
],
|
|
3221
4690
|
mentions: ["Phase 4", "re-read", "gap filling", "confidence gate", "targeted extraction"]
|
|
@@ -3249,6 +4718,42 @@ var sections6 = [
|
|
|
3249
4718
|
type: "paragraph",
|
|
3250
4719
|
text: "For large jobs with hundreds of documents, use a systematic review workflow: first address all **Flagged** rows, then spot-check a random sample of **Clean** rows to build confidence in the overall quality. If you find recurring errors in a specific field, consider updating the schema field's instruction or reference table, then run a new job \u2014 corrections you apply also feed back as training signals for future runs."
|
|
3251
4720
|
},
|
|
4721
|
+
{
|
|
4722
|
+
type: "code",
|
|
4723
|
+
language: "bash",
|
|
4724
|
+
title: "List job results with per-cell provenance",
|
|
4725
|
+
code: `curl -s "https://api.talonic.com/v1/jobs/runs/run_abc123?include=results" \\
|
|
4726
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
4727
|
+
|
|
4728
|
+
# Response:
|
|
4729
|
+
# {
|
|
4730
|
+
# "results": [
|
|
4731
|
+
# {
|
|
4732
|
+
# "document_id": "doc_001",
|
|
4733
|
+
# "document_name": "Invoice_ACME_2025.pdf",
|
|
4734
|
+
# "fields": {
|
|
4735
|
+
# "invoice_number": {
|
|
4736
|
+
# "value": "INV-2025-0042",
|
|
4737
|
+
# "confidence": 0.95,
|
|
4738
|
+
# "resolution_type": "graph_match",
|
|
4739
|
+
# "phase": 1
|
|
4740
|
+
# },
|
|
4741
|
+
# "total_amount": {
|
|
4742
|
+
# "value": "12450.00",
|
|
4743
|
+
# "confidence": 0.88,
|
|
4744
|
+
# "resolution_type": "agent_derived",
|
|
4745
|
+
# "phase": 2,
|
|
4746
|
+
# "reasoning": "Extracted from invoice summary table, row 'Total Due'"
|
|
4747
|
+
# }
|
|
4748
|
+
# }
|
|
4749
|
+
# }
|
|
4750
|
+
# ]
|
|
4751
|
+
# }`
|
|
4752
|
+
},
|
|
4753
|
+
{
|
|
4754
|
+
type: "paragraph",
|
|
4755
|
+
text: "The strategy yield breakdown on the job detail page gives you an at-a-glance view of pipeline efficiency. A high percentage of registry_transfer and deterministic_compute cells means your Field Registry is mature and Phase 1 is doing most of the work at zero AI cost. A high percentage of llm_extract cells suggests opportunities to improve \u2014 add more documents to grow the registry, refine reference tables for better lookup coverage, or add bypass strategies for fields with predictable values. Tracking strategy yield across multiple jobs over time is the best way to measure whether your pipeline is becoming more efficient."
|
|
4756
|
+
},
|
|
3252
4757
|
{
|
|
3253
4758
|
type: "callout",
|
|
3254
4759
|
text: "The full CSV export includes metadata columns for each field: confidence score, resolution type, phase number, and reasoning trace. Use this export for audit trails or to analyze extraction performance across your document corpus. The clean export omits metadata and includes only the extracted values, ready for direct import into downstream systems."
|
|
@@ -3271,6 +4776,10 @@ var sections6 = [
|
|
|
3271
4776
|
{
|
|
3272
4777
|
question: "What is the most efficient way to review a large extraction run?",
|
|
3273
4778
|
answer: "Start with the Flagged filter to address cells with validation warnings, low confidence, or format mismatches. Then spot-check a random sample of Clean rows. Focus corrections on recurring field-level patterns rather than individual cells. If you find a field that is consistently wrong, update its manual instruction or reference table in the schema rather than correcting cells one by one \u2014 this improves future runs as well."
|
|
4779
|
+
},
|
|
4780
|
+
{
|
|
4781
|
+
question: "How do I delete a job run I no longer need?",
|
|
4782
|
+
answer: "Use DELETE /v1/jobs/runs/{id} to remove a completed run and its results. This frees storage but the deletion is permanent \u2014 the run and its results cannot be recovered. Consider exporting results via CSV before deleting if you need the data for historical reference or compliance."
|
|
3274
4783
|
}
|
|
3275
4784
|
],
|
|
3276
4785
|
mentions: [
|
|
@@ -3346,6 +4855,27 @@ var sections6 = [
|
|
|
3346
4855
|
type: "paragraph",
|
|
3347
4856
|
text: "Use confidence scores to set your review threshold. Cells above 0.8 are generally reliable and can be trusted without manual verification for most use cases. Cells between 0.5 and 0.8 warrant a quick check. Cells below 0.5 should always be reviewed manually. You can use the full CSV export to filter and sort by confidence, making it easy to batch-review low-confidence cells efficiently."
|
|
3348
4857
|
},
|
|
4858
|
+
{
|
|
4859
|
+
type: "code",
|
|
4860
|
+
language: "bash",
|
|
4861
|
+
title: "Export results with full provenance metadata",
|
|
4862
|
+
code: `# The full CSV export includes metadata columns for each field:
|
|
4863
|
+
# field_name, field_name__confidence, field_name__resolution_type,
|
|
4864
|
+
# field_name__phase, field_name__reasoning
|
|
4865
|
+
|
|
4866
|
+
curl -s "https://api.talonic.com/v1/jobs/runs/run_abc123?include=results" \\
|
|
4867
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
4868
|
+
-H "Accept: text/csv"
|
|
4869
|
+
|
|
4870
|
+
# CSV output:
|
|
4871
|
+
# document_name,invoice_number,invoice_number__confidence,invoice_number__resolution_type,...
|
|
4872
|
+
# Invoice_ACME.pdf,INV-2025-0042,0.95,graph_match,...
|
|
4873
|
+
# Contract_Beta.pdf,CTR-2025-001,0.72,agent_derived,...`
|
|
4874
|
+
},
|
|
4875
|
+
{
|
|
4876
|
+
type: "paragraph",
|
|
4877
|
+
text: 'Provenance metadata enables powerful downstream analytics. By exporting results with full metadata, you can build dashboards that track extraction confidence by field, document type, or time period. For example, if you notice that the "payment_terms" field consistently scores below 0.6 confidence, this is a signal to refine its extraction instruction or add reference table entries. Teams that track provenance metrics across jobs can quantify the ROI of schema improvements \u2014 each instruction refinement or reference table update shows up as a measurable confidence increase in subsequent runs.'
|
|
4878
|
+
},
|
|
3349
4879
|
{
|
|
3350
4880
|
type: "callout",
|
|
3351
4881
|
variant: "warning",
|
|
@@ -3369,6 +4899,10 @@ var sections6 = [
|
|
|
3369
4899
|
{
|
|
3370
4900
|
question: "What confidence threshold should I use for manual review?",
|
|
3371
4901
|
answer: "Cells above 0.8 are generally reliable. Cells between 0.5 and 0.8 warrant a quick check. Cells below 0.5 should always be reviewed manually. Use the CSV export to filter by confidence for efficient batch review."
|
|
4902
|
+
},
|
|
4903
|
+
{
|
|
4904
|
+
question: "How can I use provenance data to improve extraction quality?",
|
|
4905
|
+
answer: "Track confidence scores by field across multiple jobs. Fields that consistently score below 0.6 are candidates for improvement \u2014 refine the extraction instruction, add reference table entries, or provide more example documents. The resolution_type breakdown shows whether the system is relying on registry transfers (ideal) or AI extraction (expensive). Shifting more cells from agent_derived to graph_match reduces cost and improves reliability."
|
|
3372
4906
|
}
|
|
3373
4907
|
],
|
|
3374
4908
|
mentions: [
|
|
@@ -3402,6 +4936,43 @@ var sections6 = [
|
|
|
3402
4936
|
type: "paragraph",
|
|
3403
4937
|
text: "For best results, correct the root cause rather than individual symptoms. If a field consistently produces wrong values, update the schema field's **manual instruction** or **reference table** rather than correcting cells one by one. If a reference table code is missing, add it to the table \u2014 future runs will pick it up automatically at Tier 1 confidence (0.95). Corrections are most valuable as a feedback mechanism when they inform schema improvements."
|
|
3404
4938
|
},
|
|
4939
|
+
{
|
|
4940
|
+
type: "code",
|
|
4941
|
+
language: "bash",
|
|
4942
|
+
title: "View job results with correction audit trail",
|
|
4943
|
+
code: `curl -s "https://api.talonic.com/v1/jobs/runs/run_abc123?include=results,corrections" \\
|
|
4944
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
4945
|
+
|
|
4946
|
+
# Response includes correction history per cell:
|
|
4947
|
+
# {
|
|
4948
|
+
# "results": [
|
|
4949
|
+
# {
|
|
4950
|
+
# "document_id": "doc_001",
|
|
4951
|
+
# "fields": {
|
|
4952
|
+
# "vendor_name": {
|
|
4953
|
+
# "value": "ACME Corporation",
|
|
4954
|
+
# "confidence": 1.0,
|
|
4955
|
+
# "resolution_type": "manual_correction",
|
|
4956
|
+
# "corrections": [
|
|
4957
|
+
# {
|
|
4958
|
+
# "original_value": "Acme Corp.",
|
|
4959
|
+
# "corrected_value": "ACME Corporation",
|
|
4960
|
+
# "corrected_by": "user@example.com",
|
|
4961
|
+
# "corrected_at": "2025-04-21T14:30:00Z",
|
|
4962
|
+
# "propagation": "all_similar",
|
|
4963
|
+
# "affected_count": 8
|
|
4964
|
+
# }
|
|
4965
|
+
# ]
|
|
4966
|
+
# }
|
|
4967
|
+
# }
|
|
4968
|
+
# }
|
|
4969
|
+
# ]
|
|
4970
|
+
# }`
|
|
4971
|
+
},
|
|
4972
|
+
{
|
|
4973
|
+
type: "paragraph",
|
|
4974
|
+
text: "Corrections are most powerful when they inform structural improvements to your schema. If you find yourself correcting the same field repeatedly across multiple jobs, this is a strong signal that the underlying schema configuration needs attention. Common fixes include: adding missing synonyms to a reference table (so Tier 1 lookup resolves the value automatically), refining the manual extraction instruction (so the AI extracts the correct value in the first place), or adjusting a format constraint pattern (so valid values are not incorrectly flagged). Each of these root-cause fixes eliminates the need for future corrections on that field, compounding time savings across every subsequent job."
|
|
4975
|
+
},
|
|
3405
4976
|
{
|
|
3406
4977
|
type: "callout",
|
|
3407
4978
|
text: "Corrections with **all_similar** propagation apply instantly across all documents in the run. Use this for systematic errors like wrong reference table mappings, but verify the preview count before confirming \u2014 the system shows how many cells will be affected. For recurring field-level errors, consider updating the schema instruction or reference table rather than correcting cells individually across multiple runs."
|
|
@@ -3424,6 +4995,10 @@ var sections6 = [
|
|
|
3424
4995
|
{
|
|
3425
4996
|
question: "Is there an audit trail for corrections?",
|
|
3426
4997
|
answer: "Yes. Every correction logs the original value, the corrected value, the user who made the change, and the timestamp. This audit history is preserved even after subsequent jobs run and is included in full metadata CSV exports. Downstream systems can use this data to distinguish between AI-extracted and human-corrected values."
|
|
4998
|
+
},
|
|
4999
|
+
{
|
|
5000
|
+
question: "What is the difference between this_document_only and all_similar propagation?",
|
|
5001
|
+
answer: "this_document_only corrects a single cell in one document. all_similar finds every cell in the run where the same field was resolved using the same method and source, and applies the same correction to all of them. Use all_similar for systematic errors \u2014 for example, when a reference table consistently maps a vendor name to the wrong code. The system previews the count of affected cells before applying, so you can verify the scope before confirming."
|
|
3427
5002
|
}
|
|
3428
5003
|
],
|
|
3429
5004
|
mentions: [
|
|
@@ -3502,6 +5077,49 @@ var sections7 = [
|
|
|
3502
5077
|
"Manual overrides available in the Field Registry"
|
|
3503
5078
|
]
|
|
3504
5079
|
},
|
|
5080
|
+
{
|
|
5081
|
+
type: "code",
|
|
5082
|
+
language: "bash",
|
|
5083
|
+
title: "List fields with their link key classifications",
|
|
5084
|
+
code: `curl -s "https://api.talonic.com/v1/fields?link_key=true" \\
|
|
5085
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
5086
|
+
|
|
5087
|
+
# Response:
|
|
5088
|
+
# {
|
|
5089
|
+
# "fields": [
|
|
5090
|
+
# {
|
|
5091
|
+
# "id": "fr_vendor",
|
|
5092
|
+
# "canonical_name": "vendor_name",
|
|
5093
|
+
# "link_key_category": "identity",
|
|
5094
|
+
# "tier": 1,
|
|
5095
|
+
# "occurrence_count": 289
|
|
5096
|
+
# },
|
|
5097
|
+
# {
|
|
5098
|
+
# "id": "fr_po",
|
|
5099
|
+
# "canonical_name": "purchase_order_number",
|
|
5100
|
+
# "link_key_category": "transaction",
|
|
5101
|
+
# "tier": 1,
|
|
5102
|
+
# "occurrence_count": 415
|
|
5103
|
+
# }
|
|
5104
|
+
# ]
|
|
5105
|
+
# }`
|
|
5106
|
+
},
|
|
5107
|
+
{
|
|
5108
|
+
type: "code",
|
|
5109
|
+
language: "bash",
|
|
5110
|
+
title: "Update a field link key classification",
|
|
5111
|
+
code: `curl -X PATCH https://api.talonic.com/v1/fields/fr_project_code \\
|
|
5112
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
5113
|
+
-H "Content-Type: application/json" \\
|
|
5114
|
+
-d '{ "link_key_category": "reference" }'
|
|
5115
|
+
|
|
5116
|
+
# Manual overrides take precedence over automatic classification
|
|
5117
|
+
# and persist across all future linking runs.`
|
|
5118
|
+
},
|
|
5119
|
+
{
|
|
5120
|
+
type: "paragraph",
|
|
5121
|
+
text: "Link key classification is the foundation of the entire linking and case formation system. Getting the classification right is critical for producing meaningful cases. If too many fields are classified as link keys, cases become over-connected with hundreds of documents grouped together. If too few fields are classified, documents that should be related end up as isolated singletons. Review your link key assignments after the first batch of documents to ensure the automatic classifications make sense for your domain, and use manual overrides to correct any misclassifications."
|
|
5122
|
+
},
|
|
3505
5123
|
{
|
|
3506
5124
|
type: "callout",
|
|
3507
5125
|
text: "Link key classification runs automatically when new fields appear in the registry. You do not need to trigger it manually \u2014 just upload documents and the system handles the rest. Manual overrides in the Field Registry take precedence over automatic classifications and persist across future jobs."
|
|
@@ -3524,6 +5142,10 @@ var sections7 = [
|
|
|
3524
5142
|
{
|
|
3525
5143
|
question: "Can I manually classify a field as a link key?",
|
|
3526
5144
|
answer: "Yes. Navigate to the Field Registry and change any field's link key category. Manual classifications take precedence over automatic ones and persist across future jobs."
|
|
5145
|
+
},
|
|
5146
|
+
{
|
|
5147
|
+
question: "How do I check which fields are currently classified as link keys?",
|
|
5148
|
+
answer: "Use GET /v1/fields?link_key=true to list all fields with a link key classification. The response includes the category (identity, transaction, or reference), tier, and occurrence count for each field. You can also view link key classifications in the Field Registry page of the dashboard."
|
|
3527
5149
|
}
|
|
3528
5150
|
],
|
|
3529
5151
|
mentions: [
|
|
@@ -3572,6 +5194,37 @@ var sections7 = [
|
|
|
3572
5194
|
"Handles minor naming variations automatically; wildly inconsistent names may need manual tuning"
|
|
3573
5195
|
]
|
|
3574
5196
|
},
|
|
5197
|
+
{
|
|
5198
|
+
type: "code",
|
|
5199
|
+
language: "bash",
|
|
5200
|
+
title: "List linking results for a document",
|
|
5201
|
+
code: `curl -s "https://api.talonic.com/v1/linking?document_id=doc_001" \\
|
|
5202
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
5203
|
+
|
|
5204
|
+
# Response:
|
|
5205
|
+
# {
|
|
5206
|
+
# "links": [
|
|
5207
|
+
# {
|
|
5208
|
+
# "entity_value": "acme corporation",
|
|
5209
|
+
# "original_value": "ACME Corp.",
|
|
5210
|
+
# "link_key_field": "vendor_name",
|
|
5211
|
+
# "category": "identity",
|
|
5212
|
+
# "connected_documents": ["doc_001", "doc_005", "doc_012"]
|
|
5213
|
+
# },
|
|
5214
|
+
# {
|
|
5215
|
+
# "entity_value": "po-2025-0042",
|
|
5216
|
+
# "original_value": "PO-2025-0042",
|
|
5217
|
+
# "link_key_field": "purchase_order_number",
|
|
5218
|
+
# "category": "transaction",
|
|
5219
|
+
# "connected_documents": ["doc_001", "doc_003"]
|
|
5220
|
+
# }
|
|
5221
|
+
# ]
|
|
5222
|
+
# }`
|
|
5223
|
+
},
|
|
5224
|
+
{
|
|
5225
|
+
type: "paragraph",
|
|
5226
|
+
text: 'The normalization engine handles a comprehensive set of corporate suffixes across multiple languages: Ltd, Inc, Corp, GmbH, AG, SA, SAS, BV, NV, Pty, and many more. It also collapses multiple whitespace characters, strips leading and trailing punctuation, and lowercases all values before comparison. This means "ACME Corp.", "Acme Corporation", "acme corp", and "ACME CORP" all resolve to the same canonical entity. For edge cases where the automatic normalization is insufficient \u2014 such as when a company uses entirely different trading names \u2014 you can manually link documents through the case management interface.'
|
|
5227
|
+
},
|
|
3575
5228
|
{
|
|
3576
5229
|
type: "callout",
|
|
3577
5230
|
text: "Entity linking is incremental \u2014 when new documents arrive, the pipeline extends the existing graph rather than rebuilding it from scratch. Existing cases grow as new connections are discovered. This means your cases stay up-to-date without manual intervention as new documents flow into the workspace."
|
|
@@ -3594,6 +5247,10 @@ var sections7 = [
|
|
|
3594
5247
|
{
|
|
3595
5248
|
question: "What normalization does entity linking apply?",
|
|
3596
5249
|
answer: "Values are lowercased, common suffixes (Ltd, Inc, Corp, etc.) are stripped, and whitespace is normalized. This ensures minor naming variations resolve to the same entity."
|
|
5250
|
+
},
|
|
5251
|
+
{
|
|
5252
|
+
question: "What happens when new documents are added to an existing workspace?",
|
|
5253
|
+
answer: "Entity linking is incremental. New documents extend the existing bipartite graph rather than triggering a rebuild. If a new document shares entity values with existing documents, it is connected to the relevant entity nodes and may be added to existing cases. This happens automatically without manual intervention."
|
|
3597
5254
|
}
|
|
3598
5255
|
],
|
|
3599
5256
|
mentions: [
|
|
@@ -3666,6 +5323,56 @@ var sections7 = [
|
|
|
3666
5323
|
"Export Evidence and Timeline to MD, CSV, or JSON"
|
|
3667
5324
|
]
|
|
3668
5325
|
},
|
|
5326
|
+
{
|
|
5327
|
+
type: "code",
|
|
5328
|
+
language: "bash",
|
|
5329
|
+
title: "List cases with anomaly counts",
|
|
5330
|
+
code: `curl -s "https://api.talonic.com/v1/cases?limit=10" \\
|
|
5331
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
5332
|
+
|
|
5333
|
+
# Response:
|
|
5334
|
+
# {
|
|
5335
|
+
# "cases": [
|
|
5336
|
+
# {
|
|
5337
|
+
# "id": "case_abc",
|
|
5338
|
+
# "label": "ACME Corp Invoice #4521 \u2192 PO #8890",
|
|
5339
|
+
# "document_count": 3,
|
|
5340
|
+
# "anomaly_count": 1,
|
|
5341
|
+
# "lifecycle": "discovered",
|
|
5342
|
+
# "created_at": "2025-04-18T10:00:00Z"
|
|
5343
|
+
# }
|
|
5344
|
+
# ],
|
|
5345
|
+
# "total": 42
|
|
5346
|
+
# }`
|
|
5347
|
+
},
|
|
5348
|
+
{
|
|
5349
|
+
type: "code",
|
|
5350
|
+
language: "bash",
|
|
5351
|
+
title: "Get case detail with documents and entities",
|
|
5352
|
+
code: `curl -s https://api.talonic.com/v1/cases/case_abc \\
|
|
5353
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
5354
|
+
|
|
5355
|
+
# Response:
|
|
5356
|
+
# {
|
|
5357
|
+
# "id": "case_abc",
|
|
5358
|
+
# "label": "ACME Corp Invoice #4521 \u2192 PO #8890",
|
|
5359
|
+
# "documents": [
|
|
5360
|
+
# { "id": "doc_001", "name": "Invoice_4521.pdf", "type": "Invoice" },
|
|
5361
|
+
# { "id": "doc_003", "name": "PO_8890.pdf", "type": "Purchase Order" },
|
|
5362
|
+
# { "id": "doc_005", "name": "Contract_ACME.pdf", "type": "Contract" }
|
|
5363
|
+
# ],
|
|
5364
|
+
# "entities": [
|
|
5365
|
+
# { "value": "acme corporation", "category": "identity", "document_count": 3 },
|
|
5366
|
+
# { "value": "po-2025-8890", "category": "transaction", "document_count": 2 }
|
|
5367
|
+
# ],
|
|
5368
|
+
# "anomaly_count": 1,
|
|
5369
|
+
# "lifecycle": "discovered"
|
|
5370
|
+
# }`
|
|
5371
|
+
},
|
|
5372
|
+
{
|
|
5373
|
+
type: "paragraph",
|
|
5374
|
+
text: 'Cases are the primary unit of holistic document review. Rather than reviewing documents in isolation, cases let you see all related documents together \u2014 an invoice alongside its purchase order and contract. The AI-generated case label gives you immediate context: "ACME Corp Invoice #4521 → PO #8890" tells you at a glance what the case is about. The anomaly count badge in the header lets you prioritize cases that need attention, and the lifecycle tracking ensures nothing falls through the cracks as your team processes the queue.'
|
|
5375
|
+
},
|
|
3669
5376
|
{
|
|
3670
5377
|
type: "callout",
|
|
3671
5378
|
text: "Case formation runs automatically after entity linking completes. You do not need to create cases manually \u2014 the system discovers them from the document-entity graph. Use merge, split, and edge operations to refine cases when the automatic grouping needs adjustment."
|
|
@@ -3690,6 +5397,10 @@ var sections7 = [
|
|
|
3690
5397
|
{
|
|
3691
5398
|
question: "Can I merge or split cases?",
|
|
3692
5399
|
answer: "Yes. Merge multiple cases into one from the cases list, or split a case into separate groups from the case detail page. You can also pin or remove individual documents and confirm or reject linking edges."
|
|
5400
|
+
},
|
|
5401
|
+
{
|
|
5402
|
+
question: "How do I track case lifecycle status via the API?",
|
|
5403
|
+
answer: "The case lifecycle field is included in both list and detail responses. Cases progress through four stages: discovered (linking engine first identified the cluster), confirmed (reviewer validated the grouping), active (ongoing work), and resolved (all documents reviewed and processed). Filter the case list by lifecycle to focus on cases that need attention."
|
|
3693
5404
|
}
|
|
3694
5405
|
],
|
|
3695
5406
|
mentions: ["cases", "entity groups", "evidence chain", "AI narration", "document connections", "case label", "merge", "split"]
|
|
@@ -3732,6 +5443,31 @@ var sections7 = [
|
|
|
3732
5443
|
"Missing document type anomalies raised when a case does not match its template"
|
|
3733
5444
|
]
|
|
3734
5445
|
},
|
|
5446
|
+
{
|
|
5447
|
+
type: "code",
|
|
5448
|
+
language: "bash",
|
|
5449
|
+
title: "Trigger case template discovery via API",
|
|
5450
|
+
code: `curl -X POST https://api.talonic.com/v1/cases/templates/discover \\
|
|
5451
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
5452
|
+
|
|
5453
|
+
# Response:
|
|
5454
|
+
# {
|
|
5455
|
+
# "templates": [
|
|
5456
|
+
# {
|
|
5457
|
+
# "id": "tpl_001",
|
|
5458
|
+
# "pattern": ["Invoice", "Purchase Order", "Contract"],
|
|
5459
|
+
# "case_count": 12,
|
|
5460
|
+
# "match_threshold": 0.75,
|
|
5461
|
+
# "discovered_at": "2025-04-20T09:00:00Z"
|
|
5462
|
+
# }
|
|
5463
|
+
# ],
|
|
5464
|
+
# "discovered": 1
|
|
5465
|
+
# }`
|
|
5466
|
+
},
|
|
5467
|
+
{
|
|
5468
|
+
type: "paragraph",
|
|
5469
|
+
text: 'Case templates serve as completeness monitors. Once a template is established \u2014 for example, "Invoice + Purchase Order + Contract" \u2014 the Missing Document Type anomaly detector (D4) checks every new case against it. If a case matches the template pattern but is missing one of the expected document types, an anomaly is raised. This is particularly valuable in procurement workflows where a complete audit trail requires specific document types to be present. Templates evolve organically as your workspace grows \u2014 the system proposes new templates whenever it detects recurring patterns across 3 or more cases.'
|
|
5470
|
+
},
|
|
3735
5471
|
{
|
|
3736
5472
|
type: "callout",
|
|
3737
5473
|
text: "Templates are auto-discovered \u2014 you do not need to define them manually. The system analyzes existing cases and proposes templates when it detects at least 3 cases sharing the same document type pattern. You can also trigger template discovery manually from the API via POST /cases/templates/discover."
|
|
@@ -3753,6 +5489,10 @@ var sections7 = [
|
|
|
3753
5489
|
{
|
|
3754
5490
|
question: "Can I switch between graph and list views?",
|
|
3755
5491
|
answer: "Yes. Toggle between the visual D3-force graph and a traditional list view from the Cases page. Both views show the same underlying data \u2014 choose whichever suits your workflow."
|
|
5492
|
+
},
|
|
5493
|
+
{
|
|
5494
|
+
question: "How do case templates trigger anomaly detection?",
|
|
5495
|
+
answer: "Once a template is established (e.g., Invoice + Purchase Order + Contract), the Missing Document Type detector (D4) checks every new case against it. If a case matches the template but is missing an expected document type, a D4 anomaly is raised. This helps you catch incomplete transaction bundles before they reach downstream systems."
|
|
3756
5496
|
}
|
|
3757
5497
|
],
|
|
3758
5498
|
mentions: ["document graph", "D3-force layout", "bipartite graph", "case templates"]
|
|
@@ -3825,6 +5565,33 @@ var sections7 = [
|
|
|
3825
5565
|
"D5 \u2014 Value Reuse: identical values across unrelated fields, suggesting copy-paste or extraction errors"
|
|
3826
5566
|
]
|
|
3827
5567
|
},
|
|
5568
|
+
{
|
|
5569
|
+
type: "code",
|
|
5570
|
+
language: "bash",
|
|
5571
|
+
title: "List anomalies for a specific case",
|
|
5572
|
+
code: `curl -s "https://api.talonic.com/v1/cases/case_abc/anomalies" \\
|
|
5573
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
5574
|
+
|
|
5575
|
+
# Response:
|
|
5576
|
+
# {
|
|
5577
|
+
# "anomalies": [
|
|
5578
|
+
# {
|
|
5579
|
+
# "id": "anom_001",
|
|
5580
|
+
# "detector": "D2",
|
|
5581
|
+
# "type": "field_conflict",
|
|
5582
|
+
# "severity": "high",
|
|
5583
|
+
# "fields": ["total_amount"],
|
|
5584
|
+
# "description": "total_amount is 12,450 in Invoice_4521.pdf but 12,500 in PO_8890.pdf",
|
|
5585
|
+
# "dismissed": false,
|
|
5586
|
+
# "detected_at": "2025-04-18T10:05:00Z"
|
|
5587
|
+
# }
|
|
5588
|
+
# ]
|
|
5589
|
+
# }`
|
|
5590
|
+
},
|
|
5591
|
+
{
|
|
5592
|
+
type: "paragraph",
|
|
5593
|
+
text: "Anomaly detection is designed for high-throughput triage. In a workspace processing hundreds of cases per week, the anomaly count badges on the cases list page give you an instant heat map of where problems are concentrated. A case with zero anomalies typically needs only a quick spot-check, while a case with multiple D2 (field conflict) or D3 (duplicate key divergence) anomalies warrants detailed investigation. This severity-based prioritization lets your team focus review effort where it has the highest impact, rather than reviewing every case with equal depth."
|
|
5594
|
+
},
|
|
3828
5595
|
{
|
|
3829
5596
|
type: "callout",
|
|
3830
5597
|
text: "Anomaly detection requires **Advanced mode** to be enabled. In Simple mode, anomalies are still computed but not displayed in the case detail page. Toggle Advanced mode from the sidebar to access the full anomaly workflow."
|
|
@@ -3847,6 +5614,10 @@ var sections7 = [
|
|
|
3847
5614
|
{
|
|
3848
5615
|
question: "Can I dismiss anomalies?",
|
|
3849
5616
|
answer: "Yes. Each anomaly card includes a dismiss button. Dismissed anomalies are hidden by default but can be revealed using the show dismissed toggle on the Anomalies tab."
|
|
5617
|
+
},
|
|
5618
|
+
{
|
|
5619
|
+
question: "What is the difference between anomaly detection and validation checks?",
|
|
5620
|
+
answer: "Validation checks (Phase 3) operate on individual documents \u2014 they flag issues within a single record like date inconsistencies or format mismatches. Anomaly detection operates at the case level \u2014 it compares values across multiple documents within a case to find conflicts, divergences, and missing patterns. Both are complementary: validation catches per-document issues while anomaly detection catches cross-document issues."
|
|
3850
5621
|
}
|
|
3851
5622
|
],
|
|
3852
5623
|
mentions: ["anomaly detection", "validation cluster", "field conflict", "duplicate key divergence", "value reuse"]
|
|
@@ -3904,6 +5675,40 @@ var sections7 = [
|
|
|
3904
5675
|
"Domain packs: industry-specific rules (e.g., freight: DOT numbers, MC numbers)"
|
|
3905
5676
|
]
|
|
3906
5677
|
},
|
|
5678
|
+
{
|
|
5679
|
+
type: "code",
|
|
5680
|
+
language: "bash",
|
|
5681
|
+
title: "View evidence validation results for a case",
|
|
5682
|
+
code: `curl -s "https://api.talonic.com/v1/cases/case_abc/evidence" \\
|
|
5683
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
5684
|
+
|
|
5685
|
+
# Response:
|
|
5686
|
+
# {
|
|
5687
|
+
# "evidence": [
|
|
5688
|
+
# {
|
|
5689
|
+
# "document_id": "doc_001",
|
|
5690
|
+
# "field_key": "credit_card_number",
|
|
5691
|
+
# "validator": "S7",
|
|
5692
|
+
# "algorithm": "luhn",
|
|
5693
|
+
# "status": "fail",
|
|
5694
|
+
# "message": "Luhn checksum failed for value 4111-1111-1111-1112",
|
|
5695
|
+
# "severity": "error"
|
|
5696
|
+
# },
|
|
5697
|
+
# {
|
|
5698
|
+
# "document_id": "doc_001",
|
|
5699
|
+
# "field_key": "total_amount",
|
|
5700
|
+
# "validator": "S5",
|
|
5701
|
+
# "status": "fail",
|
|
5702
|
+
# "message": "Alphabetic characters found in numeric field: '$12,450.00'",
|
|
5703
|
+
# "severity": "warning"
|
|
5704
|
+
# }
|
|
5705
|
+
# ]
|
|
5706
|
+
# }`
|
|
5707
|
+
},
|
|
5708
|
+
{
|
|
5709
|
+
type: "paragraph",
|
|
5710
|
+
text: "The evidence validation engine is extensible through domain packs, which add industry-specific rules without modifying the core validators. Each domain pack is a self-contained module that registers its validators during application startup. The freight domain pack, for example, validates DOT numbers against state-issued format rules and verifies MC (Motor Carrier) numbers. Additional packs for financial services, healthcare, and legal domains can be added by creating a new module in the domain-packs directory with validator implementations that follow the standard interface. This plug-in architecture means the validation engine grows with your industry needs without accumulating complexity in the core rule set."
|
|
5711
|
+
},
|
|
3907
5712
|
{
|
|
3908
5713
|
type: "callout",
|
|
3909
5714
|
text: "Evidence validation results are stored separately from extraction and linking data. This means you can re-run validation independently without re-extracting documents. Results are keyed by (document_id, entity_id, field_key) for precise field-level tracking."
|
|
@@ -3926,6 +5731,10 @@ var sections7 = [
|
|
|
3926
5731
|
{
|
|
3927
5732
|
question: "How are evidence validation results displayed?",
|
|
3928
5733
|
answer: "Results appear as colored badges in the Evidence tab of the case detail page. Green indicates pass, red indicates fail, and amber indicates a warning. Use the filter bar to narrow results by status, document, or category."
|
|
5734
|
+
},
|
|
5735
|
+
{
|
|
5736
|
+
question: "Can I re-run evidence validation without re-extracting documents?",
|
|
5737
|
+
answer: "Yes. Evidence validation results are stored separately from extraction data, keyed by (document_id, entity_id, field_key). You can re-run validation independently at any time \u2014 for example, after adding a new domain pack or updating validator rules. The results replace the previous validation run without affecting extracted values or linking data."
|
|
3929
5738
|
}
|
|
3930
5739
|
],
|
|
3931
5740
|
mentions: ["evidence validation", "structural validators", "checksum", "Luhn", "IBAN", "domain packs", "freight"]
|
|
@@ -3976,6 +5785,32 @@ var sections8 = [
|
|
|
3976
5785
|
"One template per downstream consumer is the recommended pattern"
|
|
3977
5786
|
]
|
|
3978
5787
|
},
|
|
5788
|
+
{
|
|
5789
|
+
type: "code",
|
|
5790
|
+
language: "bash",
|
|
5791
|
+
title: "List data products and their statuses",
|
|
5792
|
+
code: `curl -s https://api.talonic.com/v1/data-products \\
|
|
5793
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
5794
|
+
|
|
5795
|
+
# Response:
|
|
5796
|
+
# {
|
|
5797
|
+
# "data_products": [
|
|
5798
|
+
# {
|
|
5799
|
+
# "id": "dp_001",
|
|
5800
|
+
# "name": "Q1 2025 Invoice Extract",
|
|
5801
|
+
# "schema_name": "Invoice Extraction",
|
|
5802
|
+
# "document_count": 245,
|
|
5803
|
+
# "status": "completed",
|
|
5804
|
+
# "created_at": "2025-04-01T08:00:00Z"
|
|
5805
|
+
# }
|
|
5806
|
+
# ],
|
|
5807
|
+
# "total": 8
|
|
5808
|
+
# }`
|
|
5809
|
+
},
|
|
5810
|
+
{
|
|
5811
|
+
type: "paragraph",
|
|
5812
|
+
text: "Dataset templates are the bridge between schemas and production-ready data products. While a schema defines what fields to extract, a template defines how those fields appear in the final output \u2014 column order, renamed headers, excluded fields, and default transforms. This separation means you can create multiple templates from the same schema for different downstream consumers. For example, your finance team might need all 20 invoice fields in EUR format, while your operations team needs only 8 fields in USD format. Two templates, one schema, consistent extraction."
|
|
5813
|
+
},
|
|
3979
5814
|
{
|
|
3980
5815
|
type: "callout",
|
|
3981
5816
|
text: "Dataset templates are workspace-scoped. Any team member can create, edit, or use a template \u2014 there is no per-user ownership restriction. Version your templates alongside schema changes to maintain backward compatibility with existing integrations and downstream consumers."
|
|
@@ -3998,6 +5833,10 @@ var sections8 = [
|
|
|
3998
5833
|
{
|
|
3999
5834
|
question: "Can I version dataset templates?",
|
|
4000
5835
|
answer: "Yes. Each template is versioned independently from the schema it references. This lets you evolve your output format over time without affecting existing data products built from earlier versions."
|
|
5836
|
+
},
|
|
5837
|
+
{
|
|
5838
|
+
question: "Can I create multiple templates from the same schema?",
|
|
5839
|
+
answer: "Yes. This is the recommended pattern for serving different downstream consumers. Each template can include a different subset of schema fields, apply different column mappings and transforms, and be versioned independently. The underlying extraction is identical \u2014 only the output shape differs."
|
|
4001
5840
|
}
|
|
4002
5841
|
],
|
|
4003
5842
|
mentions: [
|
|
@@ -4046,6 +5885,23 @@ var sections8 = [
|
|
|
4046
5885
|
"Export the assembled dataset as CSV with leading zero preservation"
|
|
4047
5886
|
]
|
|
4048
5887
|
},
|
|
5888
|
+
{
|
|
5889
|
+
type: "code",
|
|
5890
|
+
language: "bash",
|
|
5891
|
+
title: "Export an assembled dataset as CSV",
|
|
5892
|
+
code: `# Download the assembled dataset with leading zero preservation:
|
|
5893
|
+
curl -s "https://api.talonic.com/v1/data-products/dp_001/export?format=csv" \\
|
|
5894
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
5895
|
+
-o "q1_invoices.csv"
|
|
5896
|
+
|
|
5897
|
+
# CSV values are never coerced to numbers \u2014 leading zeros
|
|
5898
|
+
# on codes like ZIP codes and account numbers are preserved.
|
|
5899
|
+
# Fields like "00123" remain "00123", not 123.`
|
|
5900
|
+
},
|
|
5901
|
+
{
|
|
5902
|
+
type: "paragraph",
|
|
5903
|
+
text: "Assemblies support incremental workflows that align with real-world business cadences. A common pattern is to create a weekly assembly that pulls all newly arrived documents from connected sources, applies the template transforms, and produces a fresh output. Because previous assembly versions are retained, you can compare this week's output against last week's to identify changes \u2014 new records added, values updated, or documents removed. This diff capability is particularly valuable for reconciliation workflows where you need to track what changed between reporting periods."
|
|
5904
|
+
},
|
|
4049
5905
|
{
|
|
4050
5906
|
type: "callout",
|
|
4051
5907
|
text: "Assemblies are the recommended way to produce production datasets. They provide a single audit trail from source documents through extraction, resolution, and validation to the final output. If your workflow requires repeatable, auditable deliverables, assemblies eliminate the need for manual export configuration on every run."
|
|
@@ -4068,6 +5924,10 @@ var sections8 = [
|
|
|
4068
5924
|
{
|
|
4069
5925
|
question: "Can an assembly pull from multiple sources?",
|
|
4070
5926
|
answer: "Yes. An assembly can combine documents from any number of sources \u2014 uploaded files, connected drives, email attachments, and more \u2014 into a single structured dataset. This is particularly useful for cross-functional reporting where data arrives through different channels. For example, you can combine invoices from a Google Drive connector, purchase orders uploaded manually, and contracts ingested via the API into a single unified procurement dataset."
|
|
5927
|
+
},
|
|
5928
|
+
{
|
|
5929
|
+
question: "How do I regenerate an assembly with updated documents?",
|
|
5930
|
+
answer: "Navigate to the assembly detail page and click Regenerate, or trigger it via the API. The system re-applies the template, pulls the updated document set from all configured sources, and produces a fresh output. The previous assembly version is retained for comparison. No reconfiguration of columns or transforms is needed \u2014 the template handles everything."
|
|
4071
5931
|
}
|
|
4072
5932
|
],
|
|
4073
5933
|
mentions: [
|
|
@@ -4137,6 +5997,32 @@ var sections8 = [
|
|
|
4137
5997
|
type: "paragraph",
|
|
4138
5998
|
text: 'For best results, choose source fields with high uniqueness \u2014 contract numbers or invoice IDs work well, while generic fields like "status" do not. When your documents contain multiple candidate identifiers, configure a fallback chain so the dispenser always has a value to work with. Most teams use the primary reference number as the source field and the document name as the first fallback.'
|
|
4139
5999
|
},
|
|
6000
|
+
{
|
|
6001
|
+
type: "code",
|
|
6002
|
+
language: "bash",
|
|
6003
|
+
title: "Configure ID dispenser rules for a data product",
|
|
6004
|
+
code: `curl -X POST "https://api.talonic.com/v1/data-products/dp_001/id-rules" \\
|
|
6005
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
6006
|
+
-H "Content-Type: application/json" \\
|
|
6007
|
+
-d '{
|
|
6008
|
+
"source_field": "invoice_number",
|
|
6009
|
+
"prefix": "INV",
|
|
6010
|
+
"fallback_chain": ["document_name", "upload_date"],
|
|
6011
|
+
"resolution_map": {
|
|
6012
|
+
"ACME Corp": "ACME",
|
|
6013
|
+
"ACME Corporation": "ACME",
|
|
6014
|
+
"Acme": "ACME"
|
|
6015
|
+
}
|
|
6016
|
+
}'
|
|
6017
|
+
|
|
6018
|
+
# Then apply:
|
|
6019
|
+
# POST /v1/data-products/dp_001/generate-ids
|
|
6020
|
+
# Each row receives an ID like "INV-INV2025042" based on the source field.`
|
|
6021
|
+
},
|
|
6022
|
+
{
|
|
6023
|
+
type: "paragraph",
|
|
6024
|
+
text: "ID dispensers solve a common challenge in document processing: generating stable, meaningful identifiers for output rows that can be used as primary keys in downstream databases. Unlike random UUIDs, dispenser-generated IDs are derived from your actual data \u2014 an invoice number, contract reference, or vendor name \u2014 making them human-readable and traceable. The deterministic nature of the generation means the same document always receives the same ID regardless of when or how many times you regenerate, which is critical for maintaining referential integrity with downstream systems that store these IDs as foreign keys."
|
|
6025
|
+
},
|
|
4140
6026
|
{
|
|
4141
6027
|
type: "callout",
|
|
4142
6028
|
text: "ID generation is deterministic \u2014 running **Regenerate IDs** with the same rules and data always produces the same output. This makes ID dispensers safe to re-run without breaking downstream references."
|
|
@@ -4159,6 +6045,10 @@ var sections8 = [
|
|
|
4159
6045
|
{
|
|
4160
6046
|
question: "Can I regenerate IDs without losing data?",
|
|
4161
6047
|
answer: "Yes. Regenerating IDs only updates the ID column \u2014 all other data product values remain unchanged. The operation is deterministic, so the same rules and data always produce the same IDs."
|
|
6048
|
+
},
|
|
6049
|
+
{
|
|
6050
|
+
question: "What makes a good source field for ID generation?",
|
|
6051
|
+
answer: "Choose fields with high cardinality \u2014 values that are unique or nearly unique per document. Invoice numbers, contract references, and purchase order IDs work well. Avoid generic fields like status or document type, which produce collisions. Configure a fallback chain with 1-2 alternative fields so the dispenser always has a value to work with."
|
|
4162
6052
|
}
|
|
4163
6053
|
],
|
|
4164
6054
|
mentions: ["ID dispenser", "unique identifiers", "fallback chain", "resolution map"]
|
|
@@ -4217,6 +6107,27 @@ var sections8 = [
|
|
|
4217
6107
|
"Share tokens are revocable and scoped to a single data product"
|
|
4218
6108
|
]
|
|
4219
6109
|
},
|
|
6110
|
+
{
|
|
6111
|
+
type: "code",
|
|
6112
|
+
language: "bash",
|
|
6113
|
+
title: "Generate a share token for a data product",
|
|
6114
|
+
code: `curl -X POST "https://api.talonic.com/v1/data-products/dp_001/share" \\
|
|
6115
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
6116
|
+
|
|
6117
|
+
# Response:
|
|
6118
|
+
# {
|
|
6119
|
+
# "share_token": "stk_a7b3c9d2e4f6",
|
|
6120
|
+
# "public_url": "https://app.talonic.com/share/stk_a7b3c9d2e4f6",
|
|
6121
|
+
# "created_at": "2025-04-22T10:00:00Z"
|
|
6122
|
+
# }
|
|
6123
|
+
|
|
6124
|
+
# Revoke access at any time:
|
|
6125
|
+
# DELETE /v1/data-products/dp_001/share/stk_a7b3c9d2e4f6`
|
|
6126
|
+
},
|
|
6127
|
+
{
|
|
6128
|
+
type: "paragraph",
|
|
6129
|
+
text: "The delivery website behind each share token provides a complete, self-contained view of the data product without requiring Talonic authentication. Recipients can switch between the three views \u2014 Structured Data (raw extraction), Resolved (post-normalization), and Data Product (final assembled output) \u2014 to inspect the data at different pipeline stages. The per-run selector lets viewers compare outputs across pipeline runs or time periods, which is particularly useful for auditors who need to verify how the data changed between reporting cycles. CSV downloads from the delivery website preserve exact cell values including leading zeros, ensuring recipients can import the data into Excel or database tools without data loss."
|
|
6130
|
+
},
|
|
4220
6131
|
{
|
|
4221
6132
|
type: "callout",
|
|
4222
6133
|
text: "Share tokens grant read-only access. Recipients can view and export data but cannot modify the data product, run new jobs, or access any other workspace resources. Revoke a token at any time from the data product detail page."
|
|
@@ -4240,6 +6151,10 @@ var sections8 = [
|
|
|
4240
6151
|
{
|
|
4241
6152
|
question: "What is auto-resolve singles?",
|
|
4242
6153
|
answer: "Auto-resolve singles automatically accepts fields that have only one candidate value, removing them from the manual review queue. Combined with auto-review, this significantly reduces the volume of items requiring human attention."
|
|
6154
|
+
},
|
|
6155
|
+
{
|
|
6156
|
+
question: "How do I revoke a share token?",
|
|
6157
|
+
answer: "Delete the share token from the data product detail page or via DELETE /v1/data-products/{id}/share/{token}. The public URL stops working immediately. The data product itself is unaffected \u2014 only the public access link is removed. You can generate a new share token at any time if you need to re-share."
|
|
4243
6158
|
}
|
|
4244
6159
|
],
|
|
4245
6160
|
mentions: ["share token", "delivery website", "CSV export", "auto-review", "auto-resolve"]
|
|
@@ -4290,6 +6205,49 @@ var sections9 = [
|
|
|
4290
6205
|
"Non-blocking: validation failures flag records for review but do not prevent extraction"
|
|
4291
6206
|
]
|
|
4292
6207
|
},
|
|
6208
|
+
{
|
|
6209
|
+
type: "code",
|
|
6210
|
+
language: "bash",
|
|
6211
|
+
title: "Run a quality benchmark against golden samples",
|
|
6212
|
+
code: `curl -X POST https://api.talonic.com/v1/quality/benchmark \\
|
|
6213
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
6214
|
+
-H "Content-Type: application/json" \\
|
|
6215
|
+
-d '{
|
|
6216
|
+
"schema_id": "us_def456",
|
|
6217
|
+
"dataset_ids": ["gs_001", "gs_002", "gs_003"]
|
|
6218
|
+
}'
|
|
6219
|
+
|
|
6220
|
+
# Response:
|
|
6221
|
+
# {
|
|
6222
|
+
# "benchmark_id": "bench_abc",
|
|
6223
|
+
# "status": "processing",
|
|
6224
|
+
# "sample_count": 3,
|
|
6225
|
+
# "created_at": "2025-04-22T14:00:00Z"
|
|
6226
|
+
# }`
|
|
6227
|
+
},
|
|
6228
|
+
{
|
|
6229
|
+
type: "code",
|
|
6230
|
+
language: "bash",
|
|
6231
|
+
title: "Get benchmark results with per-field accuracy",
|
|
6232
|
+
code: `curl -s https://api.talonic.com/v1/quality/benchmark/bench_abc \\
|
|
6233
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
6234
|
+
|
|
6235
|
+
# Response:
|
|
6236
|
+
# {
|
|
6237
|
+
# "benchmark_id": "bench_abc",
|
|
6238
|
+
# "status": "completed",
|
|
6239
|
+
# "overall_accuracy": 0.91,
|
|
6240
|
+
# "field_scores": {
|
|
6241
|
+
# "invoice_number": { "accuracy": 1.0, "verdict": "pass" },
|
|
6242
|
+
# "total_amount": { "accuracy": 0.95, "verdict": "pass" },
|
|
6243
|
+
# "payment_terms": { "accuracy": 0.67, "verdict": "fail" }
|
|
6244
|
+
# }
|
|
6245
|
+
# }`
|
|
6246
|
+
},
|
|
6247
|
+
{
|
|
6248
|
+
type: "paragraph",
|
|
6249
|
+
text: "Validation checks form the first line of defense in your data quality strategy. They catch issues during extraction before data reaches downstream systems. The key to effective validation is starting simple and expanding incrementally. Begin with field format checks for your most critical identifiers \u2014 invoice numbers, dates, and monetary amounts \u2014 where format violations have the highest business impact. Then add cross-field consistency rules as you learn your data patterns. AI-proposed coherence rules can accelerate this process by analyzing patterns in completed job results and suggesting candidate rules that you review before activation."
|
|
6250
|
+
},
|
|
4293
6251
|
{
|
|
4294
6252
|
type: "callout",
|
|
4295
6253
|
text: "Validation checks are schema-scoped. Rules defined on one schema do not affect other schemas in the same workspace. This lets you tailor quality rules to each document type independently. Start with a few high-confidence rules and expand as you learn your data patterns."
|
|
@@ -4312,6 +6270,10 @@ var sections9 = [
|
|
|
4312
6270
|
{
|
|
4313
6271
|
question: "Do validation failures block extraction?",
|
|
4314
6272
|
answer: "No. Validation checks flag records for review but do not prevent extraction from completing. Failed records appear in the Approval Queue for manual inspection."
|
|
6273
|
+
},
|
|
6274
|
+
{
|
|
6275
|
+
question: "How do I measure extraction quality over time?",
|
|
6276
|
+
answer: "Create golden samples (5-10 per schema) and run benchmarks after schema changes or model upgrades. The benchmark compares extraction results field-by-field against your known-correct values, producing per-field accuracy scores with AI judge verdicts. Track these scores over time to measure the impact of schema improvements, instruction refinements, and reference table updates."
|
|
4315
6277
|
}
|
|
4316
6278
|
],
|
|
4317
6279
|
mentions: [
|
|
@@ -4360,6 +6322,31 @@ var sections9 = [
|
|
|
4360
6322
|
"Benchmark results are stored historically for tracking quality trends over time"
|
|
4361
6323
|
]
|
|
4362
6324
|
},
|
|
6325
|
+
{
|
|
6326
|
+
type: "code",
|
|
6327
|
+
language: "bash",
|
|
6328
|
+
title: "List ground truth datasets",
|
|
6329
|
+
code: `curl -s https://api.talonic.com/v1/quality/datasets \\
|
|
6330
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
6331
|
+
|
|
6332
|
+
# Response:
|
|
6333
|
+
# {
|
|
6334
|
+
# "datasets": [
|
|
6335
|
+
# {
|
|
6336
|
+
# "id": "gs_001",
|
|
6337
|
+
# "document_name": "Invoice_ACME_sample.pdf",
|
|
6338
|
+
# "schema_id": "us_def456",
|
|
6339
|
+
# "field_count": 14,
|
|
6340
|
+
# "created_at": "2025-03-15T09:00:00Z"
|
|
6341
|
+
# }
|
|
6342
|
+
# ],
|
|
6343
|
+
# "total": 8
|
|
6344
|
+
# }`
|
|
6345
|
+
},
|
|
6346
|
+
{
|
|
6347
|
+
type: "paragraph",
|
|
6348
|
+
text: 'Golden samples are most valuable when they represent the diversity of your document corpus. Include at least one "clean" document with all fields present and correctly formatted, one document with unusual formatting or missing fields, and one document from each major variation you encounter. This coverage ensures that benchmarks test the full range of extraction challenges rather than just the easy cases. Re-run benchmarks after every significant schema change \u2014 new field instructions, updated reference tables, or model upgrades \u2014 to measure whether the change improved or regressed extraction quality.'
|
|
6349
|
+
},
|
|
4363
6350
|
{
|
|
4364
6351
|
type: "callout",
|
|
4365
6352
|
text: "Golden samples are not used during normal extraction \u2014 they exist solely for benchmarking. Changing a golden sample does not affect how documents are processed. This separation ensures that ground truth data remains a pure measurement tool without introducing bias into the extraction pipeline."
|
|
@@ -4382,6 +6369,10 @@ var sections9 = [
|
|
|
4382
6369
|
{
|
|
4383
6370
|
question: "How many golden samples should I create?",
|
|
4384
6371
|
answer: "Most teams maintain 5-10 golden samples per schema, covering a representative mix of document types and complexity levels. Re-run benchmarks after schema changes or model upgrades to track quality trends."
|
|
6372
|
+
},
|
|
6373
|
+
{
|
|
6374
|
+
question: "Does the AI judge handle semantic equivalence in benchmark scoring?",
|
|
6375
|
+
answer: 'Yes. The AI judge accounts for semantic equivalence when comparing extracted values against golden sample ground truth. For example, "United States" and "US" may be scored as a match depending on the field type and context. This prevents false negatives where the extraction is correct but uses a different surface form than the golden sample.'
|
|
4385
6376
|
}
|
|
4386
6377
|
],
|
|
4387
6378
|
mentions: ["golden samples", "ground truth", "benchmark runs", "accuracy scoring", "AI judge"]
|
|
@@ -4425,6 +6416,40 @@ var sections9 = [
|
|
|
4425
6416
|
"Start conservative (high thresholds) and loosen as pipeline trust builds"
|
|
4426
6417
|
]
|
|
4427
6418
|
},
|
|
6419
|
+
{
|
|
6420
|
+
type: "code",
|
|
6421
|
+
language: "bash",
|
|
6422
|
+
title: "Create a delivery binding for auto-approved results",
|
|
6423
|
+
code: `# Step 1: Create a webhook destination
|
|
6424
|
+
curl -X POST https://api.talonic.com/v1/delivery/destinations \\
|
|
6425
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
6426
|
+
-H "Content-Type: application/json" \\
|
|
6427
|
+
-d '{
|
|
6428
|
+
"name": "ERP Import Webhook",
|
|
6429
|
+
"type": "webhook",
|
|
6430
|
+
"config": { "url": "https://erp.example.com/api/import" },
|
|
6431
|
+
"signing_secret": "whsec_your_secret_here"
|
|
6432
|
+
}'
|
|
6433
|
+
|
|
6434
|
+
# Step 2: Bind the result.approved signal to it
|
|
6435
|
+
curl -X POST https://api.talonic.com/v1/delivery/bindings \\
|
|
6436
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
6437
|
+
-H "Content-Type: application/json" \\
|
|
6438
|
+
-d '{
|
|
6439
|
+
"name": "Ship approved invoices to ERP",
|
|
6440
|
+
"signal_filter": { "event_type": "result.approved" },
|
|
6441
|
+
"deliverable_type": "record.approved",
|
|
6442
|
+
"destination_id": "<destination_id>",
|
|
6443
|
+
"serializer_format": "json"
|
|
6444
|
+
}'
|
|
6445
|
+
|
|
6446
|
+
# Now, when approval gates auto-approve a result, the approved
|
|
6447
|
+
# record is delivered to your ERP automatically \u2014 zero manual steps.`
|
|
6448
|
+
},
|
|
6449
|
+
{
|
|
6450
|
+
type: "paragraph",
|
|
6451
|
+
text: "The combination of approval gates and delivery bindings creates a fully automated pipeline from document upload to downstream system import. High-confidence results that clear all three gates (confidence, validation pass rate, and field coverage) are auto-approved and delivered immediately. Lower-confidence results are routed to the Approval Queue for human review, and once a reviewer approves them, the same delivery binding fires. This two-track approach ensures that your fastest-moving data flows through without delay while borderline cases still get human oversight before reaching production systems."
|
|
6452
|
+
},
|
|
4428
6453
|
{
|
|
4429
6454
|
type: "callout",
|
|
4430
6455
|
text: "Approval gates feed the delivery pipeline \u2014 bind a `result.approved` signal to a destination to only ship approved rows to your downstream systems. Combined with webhooks, this creates a fully automated flow from document upload through extraction, validation, approval, and delivery with zero manual steps for high-confidence results."
|
|
@@ -4447,6 +6472,10 @@ var sections9 = [
|
|
|
4447
6472
|
{
|
|
4448
6473
|
question: "What thresholds should I start with?",
|
|
4449
6474
|
answer: "Most teams start with 90% confidence, 95% validation pass rate, and 80% field coverage. Adjust based on the volume of false positives in the approval queue \u2014 loosen thresholds as you gain trust in your pipeline."
|
|
6475
|
+
},
|
|
6476
|
+
{
|
|
6477
|
+
question: "Can I have different approval gates for different schemas?",
|
|
6478
|
+
answer: "Yes. Approval gates are configured per schema, so you can set strict thresholds for high-stakes document types (e.g., financial statements) and looser thresholds for lower-risk types (e.g., internal memos). Each schema operates its own independent gate with its own confidence, pass rate, and coverage criteria."
|
|
4450
6479
|
}
|
|
4451
6480
|
],
|
|
4452
6481
|
mentions: [
|
|
@@ -4501,6 +6530,49 @@ var sections9 = [
|
|
|
4501
6530
|
"result.approved and result.rejected signals emitted for delivery pipeline integration"
|
|
4502
6531
|
]
|
|
4503
6532
|
},
|
|
6533
|
+
{
|
|
6534
|
+
type: "code",
|
|
6535
|
+
language: "bash",
|
|
6536
|
+
title: "List pending items in the Approval Queue via API",
|
|
6537
|
+
code: `curl -s "https://api.talonic.com/v1/review?status=pending&limit=10" \\
|
|
6538
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
6539
|
+
|
|
6540
|
+
# Response:
|
|
6541
|
+
# {
|
|
6542
|
+
# "items": [
|
|
6543
|
+
# {
|
|
6544
|
+
# "id": "rev_001",
|
|
6545
|
+
# "document_name": "Invoice_Beta_Corp.pdf",
|
|
6546
|
+
# "schema_name": "Invoice Extraction",
|
|
6547
|
+
# "confidence": 0.78,
|
|
6548
|
+
# "validation_pass_rate": 0.90,
|
|
6549
|
+
# "field_coverage": 0.85,
|
|
6550
|
+
# "flags": ["low_confidence_outlier"],
|
|
6551
|
+
# "status": "pending"
|
|
6552
|
+
# }
|
|
6553
|
+
# ],
|
|
6554
|
+
# "total": 24
|
|
6555
|
+
# }`
|
|
6556
|
+
},
|
|
6557
|
+
{
|
|
6558
|
+
type: "code",
|
|
6559
|
+
language: "bash",
|
|
6560
|
+
title: "Approve or reject a review item",
|
|
6561
|
+
code: `# Approve a single item:
|
|
6562
|
+
curl -X POST https://api.talonic.com/v1/review/rev_001/approve \\
|
|
6563
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
6564
|
+
|
|
6565
|
+
# Reject a single item:
|
|
6566
|
+
curl -X POST https://api.talonic.com/v1/review/rev_001/reject \\
|
|
6567
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
6568
|
+
|
|
6569
|
+
# Each action emits the corresponding delivery signal
|
|
6570
|
+
# (result.approved or result.rejected) immediately.`
|
|
6571
|
+
},
|
|
6572
|
+
{
|
|
6573
|
+
type: "paragraph",
|
|
6574
|
+
text: "The Approval Queue is the human checkpoint in your data pipeline. It ensures that results below your confidence thresholds receive manual review before reaching downstream systems. For teams processing large volumes, the LLM auto-review feature can dramatically reduce review time \u2014 AI proposes approve or reject decisions based on the extraction context, and reviewers accept or override with a single click. Auto-review is especially effective for straightforward items that just missed the auto-approval threshold, letting reviewers focus their attention on the genuinely ambiguous cases that benefit most from human judgment."
|
|
6575
|
+
},
|
|
4504
6576
|
{
|
|
4505
6577
|
type: "callout",
|
|
4506
6578
|
text: "LLM auto-review is available to accelerate the approval process. When enabled, AI proposes approve or reject decisions for pending items, which you can accept or override with a single click. Auto-review is especially effective for high-volume workloads where most items are straightforward \u2014 it lets reviewers focus their attention on the genuinely ambiguous cases."
|
|
@@ -4523,6 +6595,10 @@ var sections9 = [
|
|
|
4523
6595
|
{
|
|
4524
6596
|
question: "Can I batch approve or reject items?",
|
|
4525
6597
|
answer: "Yes. Select multiple items in the queue and use the batch action buttons to approve or reject them all at once. Each item emits the appropriate delivery signal individually."
|
|
6598
|
+
},
|
|
6599
|
+
{
|
|
6600
|
+
question: "What happens after I approve or reject an item?",
|
|
6601
|
+
answer: "Approving emits a result.approved signal to the delivery pipeline. Rejecting emits a result.rejected signal. If you have delivery bindings configured for these signals, the corresponding payload is delivered to your destination immediately. Both signals are emitted individually even during batch operations, ensuring each item triggers its own downstream workflow."
|
|
4526
6602
|
}
|
|
4527
6603
|
],
|
|
4528
6604
|
mentions: [
|
|
@@ -4596,6 +6672,43 @@ var sections10 = [
|
|
|
4596
6672
|
type: "paragraph",
|
|
4597
6673
|
text: "For best results, start with a webhook destination to verify your binding configuration end-to-end. Once the payload shape and delivery cadence match your expectations, expand to file-based destinations (S3, SFTP) or spreadsheet destinations (Google Sheets). Most teams create separate bindings for different downstream consumers rather than routing all events to a single destination."
|
|
4598
6674
|
},
|
|
6675
|
+
{
|
|
6676
|
+
type: "code",
|
|
6677
|
+
language: "bash",
|
|
6678
|
+
title: "Create a webhook destination and binding end-to-end",
|
|
6679
|
+
code: `# 1) Create the destination
|
|
6680
|
+
curl -X POST https://api.talonic.com/v1/delivery/destinations \\
|
|
6681
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
6682
|
+
-H "Content-Type: application/json" \\
|
|
6683
|
+
-d '{
|
|
6684
|
+
"name": "Ops Webhook",
|
|
6685
|
+
"type": "webhook",
|
|
6686
|
+
"config": { "url": "https://ops.example.com/talonic", "method": "POST" },
|
|
6687
|
+
"signing_secret": "whsec_rotate_monthly"
|
|
6688
|
+
}'
|
|
6689
|
+
# -> { "id": "dest_001", "is_active": true }
|
|
6690
|
+
|
|
6691
|
+
# 2) Create a binding
|
|
6692
|
+
curl -X POST https://api.talonic.com/v1/delivery/bindings \\
|
|
6693
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
6694
|
+
-H "Content-Type: application/json" \\
|
|
6695
|
+
-d '{
|
|
6696
|
+
"name": "Notify on extraction",
|
|
6697
|
+
"signal_filter": { "event_type": "document.extracted" },
|
|
6698
|
+
"deliverable_type": "notification",
|
|
6699
|
+
"destination_id": "dest_001",
|
|
6700
|
+
"serializer_format": "json"
|
|
6701
|
+
}'
|
|
6702
|
+
|
|
6703
|
+
# 3) Test the destination
|
|
6704
|
+
curl -X POST https://api.talonic.com/v1/delivery/destinations/dest_001/test \\
|
|
6705
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
6706
|
+
# -> { "success": true, "duration_ms": 142 }`
|
|
6707
|
+
},
|
|
6708
|
+
{
|
|
6709
|
+
type: "paragraph",
|
|
6710
|
+
text: "The delivery pipeline is designed for zero-code configuration. Adding a new destination, creating a binding, and testing the connection can all be done via the API or dashboard without writing any integration code. The compatibility triangle validation ensures that your binding configuration is valid before any events are routed \u2014 you never end up with a misconfigured binding that silently drops deliveries. Every attempt is logged with full request/response details, and terminal failures are captured in the dead-letter queue for replay, giving you complete visibility into your delivery pipeline."
|
|
6711
|
+
},
|
|
4599
6712
|
{
|
|
4600
6713
|
type: "callout",
|
|
4601
6714
|
text: "Delivery is at-least-once with deterministic idempotency keys. Receivers should use the `X-Talonic-Idempotency-Key` header (or equivalent metadata for file-based connectors) to deduplicate on their end."
|
|
@@ -4618,6 +6731,10 @@ var sections10 = [
|
|
|
4618
6731
|
{
|
|
4619
6732
|
question: "What serialization formats are supported?",
|
|
4620
6733
|
answer: "Ten formats: json, ndjson, csv, csv_file, xlsx, rows, graph, raw, md, and txt. Each serializer declares which deliverable shapes it supports, and the compatibility triangle validates the combination at binding creation time."
|
|
6734
|
+
},
|
|
6735
|
+
{
|
|
6736
|
+
question: "What is the default retry ladder for failed deliveries?",
|
|
6737
|
+
answer: "The default retry ladder uses 7 attempts over approximately 10 hours with exponential backoff: 0s (immediate), 30s, 2min, 8min, 30min, 2h, 8h. After all attempts are exhausted, the delivery moves to the dead-letter queue. You can override the retry schedule per binding via the delivery_policy field."
|
|
4621
6738
|
}
|
|
4622
6739
|
],
|
|
4623
6740
|
mentions: [
|
|
@@ -4693,6 +6810,37 @@ var sections10 = [
|
|
|
4693
6810
|
type: "paragraph",
|
|
4694
6811
|
text: "For best results, always run a live-ping test after creating a destination. The test exercises the full transport envelope \u2014 SSRF validation, payload cap, and authentication \u2014 with a tiny test payload, so you catch configuration errors before real events start flowing. OAuth-based destinations (Google Drive, Google Sheets) require connecting your account first via the OAuth flow in the dashboard."
|
|
4695
6812
|
},
|
|
6813
|
+
{
|
|
6814
|
+
type: "code",
|
|
6815
|
+
language: "bash",
|
|
6816
|
+
title: "Create an S3 destination with test",
|
|
6817
|
+
code: `curl -X POST https://api.talonic.com/v1/delivery/destinations \\
|
|
6818
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
6819
|
+
-H "Content-Type: application/json" \\
|
|
6820
|
+
-d '{
|
|
6821
|
+
"name": "Data Lake S3",
|
|
6822
|
+
"type": "s3",
|
|
6823
|
+
"config": {
|
|
6824
|
+
"bucket": "talonic-exports",
|
|
6825
|
+
"region": "us-east-1",
|
|
6826
|
+
"prefix": "extractions/{date}/"
|
|
6827
|
+
},
|
|
6828
|
+
"auth_config": {
|
|
6829
|
+
"type": "access_key",
|
|
6830
|
+
"access_key_id": "AKIA...",
|
|
6831
|
+
"secret_access_key": "..."
|
|
6832
|
+
}
|
|
6833
|
+
}'
|
|
6834
|
+
|
|
6835
|
+
# Test the connection (HeadBucket probe, no objects written):
|
|
6836
|
+
curl -X POST https://api.talonic.com/v1/delivery/destinations/dest_s3/test \\
|
|
6837
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
6838
|
+
# -> { "success": true, "message": "bucket 'talonic-exports' reachable" }`
|
|
6839
|
+
},
|
|
6840
|
+
{
|
|
6841
|
+
type: "paragraph",
|
|
6842
|
+
text: "File-based destinations (S3, SFTP, Google Drive, OneDrive, Azure Blob) support configurable filename templates with token substitution. Available tokens include {binding_id}, {event_id}, {customer_id}, {idempotency_key}, {attempt}, {timestamp_iso}, {date}, and {deliverable_type}. For example, a template like `extractions/{date}/{deliverable_type}_{event_id}.json` produces filenames like `extractions/2025-04-22/notification_12345.json`. Path traversal tokens (`..`) are automatically stripped for security. The idempotency key is embedded in the filename for SFTP and OneDrive destinations since these protocols do not support object metadata."
|
|
6843
|
+
},
|
|
4696
6844
|
{
|
|
4697
6845
|
type: "callout",
|
|
4698
6846
|
text: "Destinations can be disabled without deleting them. Set **is_active** to false and no bindings will route events to the destination until you re-enable it."
|
|
@@ -4715,6 +6863,10 @@ var sections10 = [
|
|
|
4715
6863
|
{
|
|
4716
6864
|
question: "Can one destination serve multiple bindings?",
|
|
4717
6865
|
answer: "Yes. A single destination can back any number of bindings, each with its own signal filter, serializer, and field map. This lets you route different event types to the same endpoint with different payload shapes."
|
|
6866
|
+
},
|
|
6867
|
+
{
|
|
6868
|
+
question: "How do OAuth-based destinations (Google Drive, Sheets) handle authentication?",
|
|
6869
|
+
answer: "OAuth destinations require connecting your account through the OAuth flow in the dashboard. Tokens are encrypted at rest using AES-256-GCM and refreshed automatically before expiry. The connector is stateless \u2014 token lifecycle is managed by a shared refresh service so you never need to re-authenticate unless the token is revoked."
|
|
4718
6870
|
}
|
|
4719
6871
|
],
|
|
4720
6872
|
mentions: [
|
|
@@ -4754,6 +6906,34 @@ var sections10 = [
|
|
|
4754
6906
|
type: "paragraph",
|
|
4755
6907
|
text: "For best results, create one binding per downstream consumer per event type. This gives you independent control over payload shape, retry policy, and serialization format for each integration point. Most teams start with a `document.extracted` binding to a webhook and expand to run-level and approval signals as their pipeline matures."
|
|
4756
6908
|
},
|
|
6909
|
+
{
|
|
6910
|
+
type: "code",
|
|
6911
|
+
language: "bash",
|
|
6912
|
+
title: "Create a binding with field_map and custom retry policy",
|
|
6913
|
+
code: `curl -X POST https://api.talonic.com/v1/delivery/bindings \\
|
|
6914
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
6915
|
+
-H "Content-Type: application/json" \\
|
|
6916
|
+
-d '{
|
|
6917
|
+
"name": "Approved records to S3 as CSV",
|
|
6918
|
+
"signal_filter": { "event_type": "result.approved" },
|
|
6919
|
+
"deliverable_type": "record.approved",
|
|
6920
|
+
"destination_id": "dest_s3",
|
|
6921
|
+
"serializer_format": "csv_file",
|
|
6922
|
+
"field_map": {
|
|
6923
|
+
"rename": { "invoice_number": "inv_no", "total_amount": "amount" },
|
|
6924
|
+
"drop": ["internal_notes", "debug_trace"],
|
|
6925
|
+
"static": { "source": "talonic", "pipeline": "production" }
|
|
6926
|
+
},
|
|
6927
|
+
"delivery_policy": {
|
|
6928
|
+
"max_attempts": 5,
|
|
6929
|
+
"backoff_schedule": [0, 10000, 60000, 300000, 1800000]
|
|
6930
|
+
}
|
|
6931
|
+
}'`
|
|
6932
|
+
},
|
|
6933
|
+
{
|
|
6934
|
+
type: "paragraph",
|
|
6935
|
+
text: "The field_map feature eliminates the need for middleware transformation layers between Talonic and your downstream systems. The three operations \u2014 drop, rename, and static injection \u2014 compose in a fixed order: excluded fields are removed first, then remaining fields are renamed to match your downstream schema, then static values are injected. This means you can reshape the payload to match exactly what your ERP, data warehouse, or analytics system expects without writing any code or deploying a transformation service. Most teams create one binding per downstream consumer per event type for independent control over payload shape."
|
|
6936
|
+
},
|
|
4757
6937
|
{
|
|
4758
6938
|
type: "callout",
|
|
4759
6939
|
text: "The binding editor in the dashboard walks you through the compatibility triangle step by step \u2014 only showing serializers and deliverables that are compatible with your chosen signal and destination."
|
|
@@ -4776,6 +6956,10 @@ var sections10 = [
|
|
|
4776
6956
|
{
|
|
4777
6957
|
question: "What is the compatibility triangle?",
|
|
4778
6958
|
answer: "The compatibility triangle validates that the signal, deliverable resolver, serializer, and connector all form a compatible combination. The backend enforces this on every binding create and update to prevent misconfigured delivery routes."
|
|
6959
|
+
},
|
|
6960
|
+
{
|
|
6961
|
+
question: "How do I update a binding without disrupting active deliveries?",
|
|
6962
|
+
answer: "Use PUT /v1/delivery/bindings/{id} to update a binding. The backend re-runs the compatibility triangle validation on every update. Changes take effect on the next event \u2014 in-flight deliveries using the previous configuration complete normally. If you need to pause a binding temporarily, disable its destination instead."
|
|
4779
6963
|
}
|
|
4780
6964
|
],
|
|
4781
6965
|
mentions: [
|
|
@@ -4871,6 +7055,39 @@ var sections10 = [
|
|
|
4871
7055
|
type: "paragraph",
|
|
4872
7056
|
text: "For best results, use the catalog API to populate dropdown menus and configuration forms rather than hardcoding signal or deliverable lists. The catalog always reflects the running registry contents, so new signal types and deliverables appear automatically as the platform evolves."
|
|
4873
7057
|
},
|
|
7058
|
+
{
|
|
7059
|
+
type: "code",
|
|
7060
|
+
language: "bash",
|
|
7061
|
+
title: "Query the delivery catalog API",
|
|
7062
|
+
code: `# List all available signal types:
|
|
7063
|
+
curl -s https://api.talonic.com/v1/delivery/catalog/signals \\
|
|
7064
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
7065
|
+
|
|
7066
|
+
# Response:
|
|
7067
|
+
# {
|
|
7068
|
+
# "types": ["document.extracted", "document.extraction_failed",
|
|
7069
|
+
# "run.dataspace.completed", "run.structuring.completed",
|
|
7070
|
+
# "run.resolution.completed", "run.extraction.completed",
|
|
7071
|
+
# "result.approved", "result.rejected", "result.flagged",
|
|
7072
|
+
# "delivery.item.completed", "delivery.item.failed"],
|
|
7073
|
+
# "items": [
|
|
7074
|
+
# {
|
|
7075
|
+
# "type": "document.extracted",
|
|
7076
|
+
# "label": "Document extracted",
|
|
7077
|
+
# "description": "Fired when a document completes extraction"
|
|
7078
|
+
# }
|
|
7079
|
+
# ]
|
|
7080
|
+
# }
|
|
7081
|
+
|
|
7082
|
+
# List compatible deliverables, serializers, and connectors:
|
|
7083
|
+
# GET /v1/delivery/catalog/deliverables
|
|
7084
|
+
# GET /v1/delivery/catalog/serializers
|
|
7085
|
+
# GET /v1/delivery/catalog/connectors`
|
|
7086
|
+
},
|
|
7087
|
+
{
|
|
7088
|
+
type: "paragraph",
|
|
7089
|
+
text: "The catalog API is designed for dynamic UI integration. Rather than hardcoding signal types or deliverable options in your application, query the catalog endpoints to populate dropdowns and configuration forms. This ensures your integration stays current as the platform evolves \u2014 new signal types, deliverable resolvers, and serializers appear automatically in the catalog when they are registered. The catalog also powers the compatibility triangle in the binding editor, filtering options at each step to show only compatible choices."
|
|
7090
|
+
},
|
|
4874
7091
|
{
|
|
4875
7092
|
type: "callout",
|
|
4876
7093
|
text: "The catalog API exposes four endpoints: `/v1/delivery/catalog/signals`, `/v1/delivery/catalog/deliverables`, `/v1/delivery/catalog/serializers`, and `/v1/delivery/catalog/connectors`. Each returns the full registry for that category."
|
|
@@ -4893,6 +7110,10 @@ var sections10 = [
|
|
|
4893
7110
|
{
|
|
4894
7111
|
question: "What are meta-signals?",
|
|
4895
7112
|
answer: "Meta-signals (delivery.item.completed and delivery.item.failed) fire when a delivery attempt itself succeeds or fails. Use them for self-monitoring \u2014 for example, binding delivery.item.failed to a notification webhook for delivery failure alerts."
|
|
7113
|
+
},
|
|
7114
|
+
{
|
|
7115
|
+
question: "How do I set up delivery failure alerts?",
|
|
7116
|
+
answer: "Create a binding with signal_filter event_type set to delivery.item.failed, deliverable_type set to notification, and point it at a webhook destination connected to your alerting system (Slack, PagerDuty, etc.). The built-in loop prevention ensures that a failed alert delivery does not emit another failure signal, so you never get cascading meta-signal loops."
|
|
4896
7117
|
}
|
|
4897
7118
|
],
|
|
4898
7119
|
mentions: [
|
|
@@ -4926,6 +7147,41 @@ var sections10 = [
|
|
|
4926
7147
|
type: "paragraph",
|
|
4927
7148
|
text: "For best results, monitor the DLQ regularly and set up a `delivery.item.failed` meta-signal binding to receive alerts when deliveries fail terminally. Most teams configure a notification webhook for this signal so they are notified immediately rather than discovering failures during a manual review. Request and response bodies older than the configured retention period are automatically cleaned up, but row metadata (status, error code, duration) is retained indefinitely."
|
|
4928
7149
|
},
|
|
7150
|
+
{
|
|
7151
|
+
type: "code",
|
|
7152
|
+
language: "bash",
|
|
7153
|
+
title: "Inspect delivery history and replay from DLQ",
|
|
7154
|
+
code: `# List delivery attempts for a binding:
|
|
7155
|
+
curl -s "https://api.talonic.com/v1/delivery/items?binding_id=bind_001&limit=5" \\
|
|
7156
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
7157
|
+
|
|
7158
|
+
# Response:
|
|
7159
|
+
# {
|
|
7160
|
+
# "items": [
|
|
7161
|
+
# {
|
|
7162
|
+
# "id": "item_abc",
|
|
7163
|
+
# "status": "succeeded",
|
|
7164
|
+
# "attempt": 1,
|
|
7165
|
+
# "http_status": 200,
|
|
7166
|
+
# "duration_ms": 142,
|
|
7167
|
+
# "completed_at": "2025-04-22T10:05:00Z"
|
|
7168
|
+
# }
|
|
7169
|
+
# ]
|
|
7170
|
+
# }
|
|
7171
|
+
|
|
7172
|
+
# Check the DLQ for terminal failures:
|
|
7173
|
+
curl -s "https://api.talonic.com/v1/delivery/dlq?binding_id=bind_001" \\
|
|
7174
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
7175
|
+
|
|
7176
|
+
# Replay a dead-letter entry (new attempt=1, same idempotency key):
|
|
7177
|
+
curl -X POST https://api.talonic.com/v1/delivery/dlq/dlq_xyz/replay \\
|
|
7178
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
7179
|
+
# -> { "replayed": true }`
|
|
7180
|
+
},
|
|
7181
|
+
{
|
|
7182
|
+
type: "paragraph",
|
|
7183
|
+
text: "The delivery history and DLQ together provide complete observability into your outbound data flow. Every attempt is logged with precise timing, HTTP status, error classification, and truncated request/response bodies for debugging. When a delivery fails terminally, the DLQ captures the full context \u2014 which event triggered it, which binding matched, what error occurred, and how many attempts were made. Replay is safe to run multiple times because the idempotency key is deterministic: receivers that deduplicate on the X-Talonic-Idempotency-Key header will not process the same delivery twice, even after multiple replays. Destinations that return authentication errors (401, 403) are automatically disabled to prevent a cascade of failed attempts from consuming your retry budget."
|
|
7184
|
+
},
|
|
4929
7185
|
{
|
|
4930
7186
|
type: "callout",
|
|
4931
7187
|
text: "Replay is safe to run multiple times. The idempotency key is deterministic \u2014 receivers that deduplicate on the key will not process the same delivery twice, even after multiple replays."
|
|
@@ -4948,6 +7204,10 @@ var sections10 = [
|
|
|
4948
7204
|
{
|
|
4949
7205
|
question: "How long are request and response bodies retained?",
|
|
4950
7206
|
answer: "Request and response bodies are cleaned up after the configured retention period (default 30 days) by a daily cleanup job that runs at 03:00 server time. Row metadata \u2014 status, HTTP code, error code, and duration \u2014 is retained indefinitely for audit purposes. Configure the retention period via the delivery.item_body_retention_days setting in pipeline.yaml."
|
|
7207
|
+
},
|
|
7208
|
+
{
|
|
7209
|
+
question: "What happens when a destination returns a 401 or 403 error?",
|
|
7210
|
+
answer: "Authentication failures (401, 403) are classified as permanent errors. The delivery moves to the DLQ immediately without retrying, and the destination is automatically disabled (is_active set to false) to prevent further failed attempts. Fix the credentials, re-enable the destination, and replay the DLQ entry to resume deliveries."
|
|
4951
7211
|
}
|
|
4952
7212
|
],
|
|
4953
7213
|
mentions: [
|
|
@@ -4998,6 +7258,38 @@ var sections11 = [
|
|
|
4998
7258
|
"**Encoding** \u2014 set the character encoding for file exports (UTF-8, ISO-8859-1, etc.)."
|
|
4999
7259
|
]
|
|
5000
7260
|
},
|
|
7261
|
+
{
|
|
7262
|
+
type: "code",
|
|
7263
|
+
language: "bash",
|
|
7264
|
+
title: "Manage shared dialects via API",
|
|
7265
|
+
code: `# List workspace dialects:
|
|
7266
|
+
curl -s https://api.talonic.com/v1/dialects \\
|
|
7267
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
7268
|
+
|
|
7269
|
+
# Create a shared dialect:
|
|
7270
|
+
curl -X POST https://api.talonic.com/v1/dialects \\
|
|
7271
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
7272
|
+
-H "Content-Type: application/json" \\
|
|
7273
|
+
-d '{
|
|
7274
|
+
"name": "US Standard",
|
|
7275
|
+
"date_format": "MM/DD/YYYY",
|
|
7276
|
+
"number_locale": "en-US",
|
|
7277
|
+
"delimiter": ",",
|
|
7278
|
+
"null_representation": "",
|
|
7279
|
+
"boolean_format": "true/false",
|
|
7280
|
+
"encoding": "UTF-8"
|
|
7281
|
+
}'
|
|
7282
|
+
|
|
7283
|
+
# Update an existing dialect:
|
|
7284
|
+
curl -X PUT https://api.talonic.com/v1/dialects/dial_us_002 \\
|
|
7285
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
7286
|
+
-H "Content-Type: application/json" \\
|
|
7287
|
+
-d '{ "encoding": "UTF-8-BOM" }'`
|
|
7288
|
+
},
|
|
7289
|
+
{
|
|
7290
|
+
type: "paragraph",
|
|
7291
|
+
text: "Shared dialects are a workspace-level configuration that eliminates formatting inconsistencies across your output. Without them, each schema might use slightly different date formats or number locales, creating data quality issues when records from different schemas are combined in downstream systems. By configuring the dialect once at the workspace level, every schema inherits the same formatting rules automatically. This is particularly important for organizations with multiple teams creating schemas independently \u2014 the shared dialect ensures they all produce structurally consistent output regardless of who created the schema or when."
|
|
7292
|
+
},
|
|
5001
7293
|
{
|
|
5002
7294
|
type: "callout",
|
|
5003
7295
|
variant: "info",
|
|
@@ -5025,6 +7317,10 @@ var sections11 = [
|
|
|
5025
7317
|
{
|
|
5026
7318
|
question: "Do shared dialects affect the extraction process?",
|
|
5027
7319
|
answer: "No. Dialects only affect output formatting \u2014 how extracted values are serialized in exports and deliveries. The extraction and validation phases work with normalized internal representations regardless of dialect settings."
|
|
7320
|
+
},
|
|
7321
|
+
{
|
|
7322
|
+
question: "How do I manage shared dialects across multiple workspaces?",
|
|
7323
|
+
answer: "Use the dialect API (GET, POST, PUT, DELETE on /v1/dialects) to export a dialect configuration from one workspace and replicate it in another. This ensures consistent formatting across your entire organization. If different regions need different formats, create separate workspaces with region-specific shared dialects rather than managing inline overrides on individual schemas."
|
|
5028
7324
|
}
|
|
5029
7325
|
],
|
|
5030
7326
|
mentions: [
|
|
@@ -5060,11 +7356,43 @@ var sections11 = [
|
|
|
5060
7356
|
type: "paragraph",
|
|
5061
7357
|
text: "For best results, keep reference primitives focused on a single domain \u2014 for example, one primitive for country codes, another for currency codes, and another for product categories. This makes each primitive reusable across multiple schemas and simplifies maintenance. When updating a primitive, test the new version against a few sample documents before updating the version reference in production schemas."
|
|
5062
7358
|
},
|
|
7359
|
+
{
|
|
7360
|
+
type: "code",
|
|
7361
|
+
language: "bash",
|
|
7362
|
+
title: "Create and version a reference primitive",
|
|
7363
|
+
code: `# Create a reference primitive:
|
|
7364
|
+
curl -X POST https://api.talonic.com/v1/reference-primitives \\
|
|
7365
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
7366
|
+
-H "Content-Type: application/json" \\
|
|
7367
|
+
-d '{
|
|
7368
|
+
"name": "Country Codes",
|
|
7369
|
+
"entries": [
|
|
7370
|
+
{ "key": "US", "value": "United States" },
|
|
7371
|
+
{ "key": "US", "value": "USA" },
|
|
7372
|
+
{ "key": "DE", "value": "Germany" },
|
|
7373
|
+
{ "key": "DE", "value": "Deutschland" },
|
|
7374
|
+
{ "key": "GB", "value": "United Kingdom" },
|
|
7375
|
+
{ "key": "GB", "value": "UK" }
|
|
7376
|
+
]
|
|
7377
|
+
}'
|
|
7378
|
+
|
|
7379
|
+
# List versions:
|
|
7380
|
+
curl -s https://api.talonic.com/v1/reference-primitives/rp_001/versions \\
|
|
7381
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
7382
|
+
|
|
7383
|
+
# Get a specific version:
|
|
7384
|
+
curl -s https://api.talonic.com/v1/reference-primitives/rp_001/versions/2 \\
|
|
7385
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"`
|
|
7386
|
+
},
|
|
5063
7387
|
{
|
|
5064
7388
|
type: "callout",
|
|
5065
7389
|
variant: "info",
|
|
5066
7390
|
text: "Versioning protects production stability. When you update a reference primitive, existing schemas continue using their pinned version until you explicitly update the version reference. This prevents unexpected changes to live extraction pipelines."
|
|
5067
7391
|
},
|
|
7392
|
+
{
|
|
7393
|
+
type: "paragraph",
|
|
7394
|
+
text: "Reference primitives are the centralized source of truth for code mapping across your workspace. Unlike schema-level reference tables that are defined inline and scoped to a single schema, workspace-level primitives are shared resources that any schema can reference. This eliminates duplication \u2014 instead of maintaining identical country code tables across five schemas, you maintain one primitive and reference it from all five. When a new country or code variation needs to be added, you update the primitive once and bump the version reference in each schema. The 3-tier lookup cascade (normalization, fuzzy, AI fallback) runs identically for both primitives and inline tables, so there is no difference in resolution quality."
|
|
7395
|
+
},
|
|
5068
7396
|
{
|
|
5069
7397
|
type: "list",
|
|
5070
7398
|
ordered: false,
|
|
@@ -5097,6 +7425,10 @@ var sections11 = [
|
|
|
5097
7425
|
{
|
|
5098
7426
|
question: "What happens when I update a reference primitive?",
|
|
5099
7427
|
answer: "A new version is created. Existing schemas continue using their pinned version. You must explicitly update the version reference in each schema to use the new data, which protects production pipelines from unexpected changes."
|
|
7428
|
+
},
|
|
7429
|
+
{
|
|
7430
|
+
question: "How do I pin a schema to a specific reference primitive version?",
|
|
7431
|
+
answer: "When configuring a reference table on a schema field, specify both the reference primitive ID and the version number. The schema uses that exact version for all lookups until you explicitly update it. This pin-based approach means you can test a new primitive version against a draft schema without affecting production schemas that are pinned to the previous version."
|
|
5100
7432
|
}
|
|
5101
7433
|
],
|
|
5102
7434
|
mentions: [
|
|
@@ -5143,6 +7475,37 @@ var sections11 = [
|
|
|
5143
7475
|
"**Format constraint changes** \u2014 regex pattern updates or fallback behavior modifications on schema fields."
|
|
5144
7476
|
]
|
|
5145
7477
|
},
|
|
7478
|
+
{
|
|
7479
|
+
type: "code",
|
|
7480
|
+
language: "bash",
|
|
7481
|
+
title: "Check workspace governance settings",
|
|
7482
|
+
code: `# View current workspace settings including change review status:
|
|
7483
|
+
curl -s https://api.talonic.com/v1/workspace/settings \\
|
|
7484
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
7485
|
+
|
|
7486
|
+
# Response:
|
|
7487
|
+
# {
|
|
7488
|
+
# "workspace_id": "ws_001",
|
|
7489
|
+
# "change_review_enabled": true,
|
|
7490
|
+
# "pending_changes": 3,
|
|
7491
|
+
# "reviewers": ["admin@example.com"],
|
|
7492
|
+
# "categories_requiring_review": [
|
|
7493
|
+
# "schema_changes",
|
|
7494
|
+
# "dialect_changes",
|
|
7495
|
+
# "reference_primitive_changes",
|
|
7496
|
+
# "delivery_binding_changes",
|
|
7497
|
+
# "routing_rule_changes"
|
|
7498
|
+
# ]
|
|
7499
|
+
# }`
|
|
7500
|
+
},
|
|
7501
|
+
{
|
|
7502
|
+
type: "paragraph",
|
|
7503
|
+
text: "Change review is the governance layer that protects production workspaces from unintended disruptions. In a production environment, a small change to a schema field mapping or reference primitive value can ripple through to every document processed after that point. Without change review, these modifications take effect immediately with no undo for documents already processed with the new configuration. By enabling change review, every modification is queued for approval \u2014 a second pair of eyes verifies the technical and business impact before the change goes live. This is particularly important for organizations with compliance requirements where changes to data extraction rules must be documented and approved."
|
|
7504
|
+
},
|
|
7505
|
+
{
|
|
7506
|
+
type: "paragraph",
|
|
7507
|
+
text: "A common workflow for teams transitioning from development to production is to leave change review disabled during the initial setup phase for faster iteration. Once schemas, dialects, and reference primitives are stable and data is flowing to downstream systems through delivery bindings, enable change review. From that point forward, any modification \u2014 whether it is adding a field to a schema, updating a reference primitive version, or changing a delivery binding filter \u2014 requires explicit approval before it takes effect in the live pipeline."
|
|
7508
|
+
},
|
|
5146
7509
|
{
|
|
5147
7510
|
type: "callout",
|
|
5148
7511
|
variant: "warning",
|
|
@@ -5170,6 +7533,10 @@ var sections11 = [
|
|
|
5170
7533
|
{
|
|
5171
7534
|
question: "Can I bypass change review for urgent fixes?",
|
|
5172
7535
|
answer: "Change review can be disabled temporarily from Workspace Settings if an urgent fix is needed. However, this should be done with caution in production workspaces, and the review requirement should be re-enabled afterward."
|
|
7536
|
+
},
|
|
7537
|
+
{
|
|
7538
|
+
question: "Does change review apply to delivery binding modifications?",
|
|
7539
|
+
answer: "Yes. When change review is enabled, modifications to delivery bindings \u2014 including changes to signal filters, field maps, serializer formats, and destination assignments \u2014 are queued for approval. This prevents accidental routing changes from disrupting active delivery pipelines."
|
|
5173
7540
|
}
|
|
5174
7541
|
],
|
|
5175
7542
|
mentions: [
|
|
@@ -5213,6 +7580,36 @@ var sections12 = [
|
|
|
5213
7580
|
variant: "info",
|
|
5214
7581
|
text: "Omnisearch results update as you type. The materialized index is rebuilt automatically whenever documents are processed or schemas change, so results are always current."
|
|
5215
7582
|
},
|
|
7583
|
+
{
|
|
7584
|
+
type: "code",
|
|
7585
|
+
language: "bash",
|
|
7586
|
+
title: "Search across documents and extracted values via API",
|
|
7587
|
+
code: `curl -s "https://api.talonic.com/v1/search?q=INV-2025-0042" \\
|
|
7588
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
7589
|
+
|
|
7590
|
+
# Response:
|
|
7591
|
+
# {
|
|
7592
|
+
# "results": [
|
|
7593
|
+
# {
|
|
7594
|
+
# "category": "document",
|
|
7595
|
+
# "id": "doc_001",
|
|
7596
|
+
# "title": "Invoice_ACME_2025.pdf",
|
|
7597
|
+
# "match_field": "invoice_number",
|
|
7598
|
+
# "match_value": "INV-2025-0042"
|
|
7599
|
+
# },
|
|
7600
|
+
# {
|
|
7601
|
+
# "category": "extracted_value",
|
|
7602
|
+
# "document_id": "doc_003",
|
|
7603
|
+
# "field": "reference_number",
|
|
7604
|
+
# "value": "INV-2025-0042"
|
|
7605
|
+
# }
|
|
7606
|
+
# ]
|
|
7607
|
+
# }`
|
|
7608
|
+
},
|
|
7609
|
+
{
|
|
7610
|
+
type: "paragraph",
|
|
7611
|
+
text: "Omnisearch is powered by a materialized values index that flattens all extracted field values into a searchable structure. This index is updated incrementally as documents are processed \u2014 there is no batch reindex step that could cause search results to lag behind your data. The search covers seven categories: documents (file names and metadata), extracted values (individual data points across all documents), field names (Field Registry canonical definitions), schema names (generated and template schemas), sources (connection names), matching configurations, and delivery bindings. Results are grouped by category and ranked by relevance, so you can quickly distinguish between a document match and an extracted value match."
|
|
7612
|
+
},
|
|
5216
7613
|
{
|
|
5217
7614
|
type: "list",
|
|
5218
7615
|
ordered: false,
|
|
@@ -5248,6 +7645,10 @@ var sections12 = [
|
|
|
5248
7645
|
{
|
|
5249
7646
|
question: "How quickly are new documents searchable in Omnisearch?",
|
|
5250
7647
|
answer: "Documents become searchable as soon as extraction completes. The materialized index is updated automatically during document processing, so there is no manual reindex step."
|
|
7648
|
+
},
|
|
7649
|
+
{
|
|
7650
|
+
question: "Can I search via the API?",
|
|
7651
|
+
answer: "Yes. Use GET /v1/search?q={query} to search across all categories programmatically. The response groups results by category (document, extracted_value, field, schema, source) with match details including the matched field name and value. This is useful for building custom integrations or automating lookups against your document corpus."
|
|
5251
7652
|
}
|
|
5252
7653
|
],
|
|
5253
7654
|
mentions: ["omnisearch", "global search", "Cmd+K", "Ctrl+K", "document search", "materialized values index"]
|
|
@@ -5291,6 +7692,38 @@ var sections12 = [
|
|
|
5291
7692
|
type: "paragraph",
|
|
5292
7693
|
text: 'For best results, save your most common filter combinations as presets. Most teams create presets for categories like "high-value invoices this quarter," "documents missing key fields," or "recently failed extractions." Presets appear as one-click buttons on the Documents page, eliminating the need to rebuild complex filter conditions from scratch each time.'
|
|
5293
7694
|
},
|
|
7695
|
+
{
|
|
7696
|
+
type: "code",
|
|
7697
|
+
language: "bash",
|
|
7698
|
+
title: "Filter documents by extracted field values via API",
|
|
7699
|
+
code: `# Find high-value invoices from a specific vendor:
|
|
7700
|
+
curl -s "https://api.talonic.com/v1/filter?filters=vendor_name%20eq%20%22Acme%20Corp%22%20AND%20total_amount%20gt%205000&limit=10" \\
|
|
7701
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
7702
|
+
|
|
7703
|
+
# Response:
|
|
7704
|
+
# {
|
|
7705
|
+
# "documents": [
|
|
7706
|
+
# {
|
|
7707
|
+
# "id": "doc_001",
|
|
7708
|
+
# "name": "Invoice_ACME_Q1.pdf",
|
|
7709
|
+
# "document_type": "Invoice",
|
|
7710
|
+
# "matched_values": {
|
|
7711
|
+
# "vendor_name": "Acme Corp",
|
|
7712
|
+
# "total_amount": "12450.00"
|
|
7713
|
+
# }
|
|
7714
|
+
# }
|
|
7715
|
+
# ],
|
|
7716
|
+
# "total": 15
|
|
7717
|
+
# }
|
|
7718
|
+
|
|
7719
|
+
# Find documents missing a critical field:
|
|
7720
|
+
curl -s "https://api.talonic.com/v1/filter?filters=payment_terms%20is_empty" \\
|
|
7721
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"`
|
|
7722
|
+
},
|
|
7723
|
+
{
|
|
7724
|
+
type: "paragraph",
|
|
7725
|
+
text: 'Document filters are particularly powerful for quality assurance and operational workflows. Use the `is_empty` operator to find documents where critical fields were not extracted \u2014 these are candidates for schema instruction improvement or manual review. Combine `gt` and `between` operators on monetary fields to segment documents by value range for prioritized processing. The URL-serializable filter state means you can build a complex query, copy the URL, and share it with a colleague \u2014 they see exactly the same filtered view without re-building the conditions. Save frequently-used filter combinations as presets for one-click access to common queries like "missing payment terms this month" or "high-value contracts pending review".'
|
|
7726
|
+
},
|
|
5294
7727
|
{
|
|
5295
7728
|
type: "paragraph",
|
|
5296
7729
|
text: 'For example, to find all invoices from a specific vendor with outstanding amounts, build a filter with `vendor_name eq "Acme Corp"` AND `document_type eq "Invoice"` AND `total_amount gt 5000`. The field autocomplete ensures you are filtering on valid extracted fields, and the materialized index returns results instantly even across thousands of documents. Save this as a preset called "Acme high-value invoices" for one-click access when you need to review that vendor\'s billing history.'
|
|
@@ -5317,6 +7750,10 @@ var sections12 = [
|
|
|
5317
7750
|
{
|
|
5318
7751
|
question: "Can I filter on fields that have no value?",
|
|
5319
7752
|
answer: "Yes. The is_empty operator lets you find documents where a specific field was not extracted or has no value. This is useful for identifying documents that may need reprocessing or manual review."
|
|
7753
|
+
},
|
|
7754
|
+
{
|
|
7755
|
+
question: "Can I use the filter API programmatically for automated reporting?",
|
|
7756
|
+
answer: "Yes. Use GET /v1/filter with query parameters to filter documents by extracted field values programmatically. The filter syntax supports the same operators as the UI \u2014 eq, contains, gt, lt, between, and is_empty. Combine multiple conditions with AND. This is useful for building automated reports, dashboards, or alerts that trigger when documents matching specific criteria are processed."
|
|
5320
7757
|
}
|
|
5321
7758
|
],
|
|
5322
7759
|
mentions: [
|
|
@@ -5396,6 +7833,42 @@ var sections13 = [
|
|
|
5396
7833
|
description: "Create and modify resources."
|
|
5397
7834
|
}
|
|
5398
7835
|
]
|
|
7836
|
+
},
|
|
7837
|
+
{
|
|
7838
|
+
type: "heading",
|
|
7839
|
+
level: 3,
|
|
7840
|
+
id: "api-key-usage",
|
|
7841
|
+
text: "Using Your API Key"
|
|
7842
|
+
},
|
|
7843
|
+
{
|
|
7844
|
+
type: "paragraph",
|
|
7845
|
+
text: "Once you have created an API key, include it in the Authorization header of every request to the Talonic API. The key is passed as a Bearer token. All API endpoints require authentication \u2014 requests without a valid key receive a 401 Unauthorized response. Rate limiting is applied per API key, so each integration has its own independent rate limit allowance."
|
|
7846
|
+
},
|
|
7847
|
+
{
|
|
7848
|
+
type: "code",
|
|
7849
|
+
language: "bash",
|
|
7850
|
+
title: "Authenticate an API request",
|
|
7851
|
+
code: 'curl https://api.talonic.com/v1/documents \\\n -H "Authorization: Bearer tlnc_your_api_key_here"'
|
|
7852
|
+
},
|
|
7853
|
+
{
|
|
7854
|
+
type: "code",
|
|
7855
|
+
language: "json",
|
|
7856
|
+
title: "Response",
|
|
7857
|
+
code: '{\n "data": [\n {\n "id": "doc_7f3a1b2c",\n "filename": "invoice.pdf",\n "status": "completed",\n "document_type": "Invoice"\n }\n ],\n "meta": { "total": 156, "cursor": "eyJpZCI6MTU2fQ" }\n}'
|
|
7858
|
+
},
|
|
7859
|
+
{
|
|
7860
|
+
type: "paragraph",
|
|
7861
|
+
text: "If your API key is compromised, delete it immediately from Settings and create a new one. Because keys are SHA-256 hashed at rest, the platform cannot display the key again after creation \u2014 you must generate a fresh key and update all integrations that used the old one. For production environments, store API keys in a secrets manager like AWS Secrets Manager, Azure Key Vault, or HashiCorp Vault rather than in environment files or source code."
|
|
7862
|
+
},
|
|
7863
|
+
{
|
|
7864
|
+
type: "paragraph",
|
|
7865
|
+
text: "A common pattern for enterprise teams is to create three API keys with different scopes: one extract-only key for your ingestion pipeline, one read-only key for your BI dashboard or monitoring tools, and one write key for administrative automation scripts. This separation ensures that a compromised dashboard key cannot be used to upload documents or modify schemas, and that your ingestion pipeline key cannot access sensitive configuration endpoints. Rotate each key independently on a regular schedule \u2014 monthly rotation is a common best practice for production environments."
|
|
7866
|
+
},
|
|
7867
|
+
{
|
|
7868
|
+
type: "code",
|
|
7869
|
+
language: "bash",
|
|
7870
|
+
title: "Check rate limit headers",
|
|
7871
|
+
code: 'curl -I https://api.talonic.com/v1/documents \\\n -H "Authorization: Bearer $TALONIC_API_KEY"\n\n# Response headers include:\n# X-RateLimit-Remaining: 98\n# X-RateLimit-Reset: 1746604800'
|
|
5399
7872
|
}
|
|
5400
7873
|
],
|
|
5401
7874
|
related: [
|
|
@@ -5419,6 +7892,14 @@ var sections13 = [
|
|
|
5419
7892
|
{
|
|
5420
7893
|
question: "What are best practices for API key management?",
|
|
5421
7894
|
answer: "Store keys in a secrets manager rather than source code or environment files checked into version control. Create one key per integration so each can be rotated independently. Use the narrowest scope possible \u2014 a read-only dashboard needs only the read scope, not extract or write. Rotate keys on a regular schedule and immediately revoke any key that may have been exposed. Monitor API usage per key to detect anomalies early."
|
|
7895
|
+
},
|
|
7896
|
+
{
|
|
7897
|
+
question: "How do I check my rate limit usage?",
|
|
7898
|
+
answer: "Every API response includes X-RateLimit-Remaining and X-RateLimit-Reset headers. X-RateLimit-Remaining shows how many requests you can make before hitting the limit, and X-RateLimit-Reset shows the Unix timestamp when the limit resets. If you receive a 429 Too Many Requests response, wait until the reset time before retrying."
|
|
7899
|
+
},
|
|
7900
|
+
{
|
|
7901
|
+
question: "What is the recommended API key setup for production?",
|
|
7902
|
+
answer: "Create three separate keys with different scopes: one extract-only key for your ingestion pipeline, one read-only key for dashboards and monitoring, and one write key for administrative automation. This separation ensures a compromised read-only key cannot upload documents or modify schemas. Label each key with the integration it serves and rotate them on a regular monthly schedule."
|
|
5422
7903
|
}
|
|
5423
7904
|
],
|
|
5424
7905
|
mentions: ["API keys", "tlnc_", "SHA-256", "Bearer token", "scopes"]
|
|
@@ -5471,6 +7952,40 @@ var sections13 = [
|
|
|
5471
7952
|
variant: "info",
|
|
5472
7953
|
text: "See the full [API Documentation](/docs) for detailed endpoint specifications, request/response examples, and authentication guides. The API reference is organized by namespace and includes every parameter, status code, and error response."
|
|
5473
7954
|
},
|
|
7955
|
+
{
|
|
7956
|
+
type: "heading",
|
|
7957
|
+
level: 3,
|
|
7958
|
+
id: "api-quick-start",
|
|
7959
|
+
text: "API Quick Start"
|
|
7960
|
+
},
|
|
7961
|
+
{
|
|
7962
|
+
type: "paragraph",
|
|
7963
|
+
text: "The fastest way to start using the API is to upload a document and retrieve its extraction results. The example below shows the complete flow: upload a file to a source, poll until processing completes, then fetch the extracted fields. This three-step pattern is the foundation for any API integration \u2014 from simple one-off extractions to complex automated pipelines."
|
|
7964
|
+
},
|
|
7965
|
+
{
|
|
7966
|
+
type: "code",
|
|
7967
|
+
language: "bash",
|
|
7968
|
+
title: "Step 1: Upload a document",
|
|
7969
|
+
code: 'curl -X POST https://api.talonic.com/v1/sources/src_abc123/documents \\\n -H "Authorization: Bearer $TALONIC_API_KEY" \\\n -F "file=@invoice.pdf"'
|
|
7970
|
+
},
|
|
7971
|
+
{
|
|
7972
|
+
type: "code",
|
|
7973
|
+
language: "bash",
|
|
7974
|
+
title: "Step 2: Check processing status",
|
|
7975
|
+
code: 'curl https://api.talonic.com/v1/documents/doc_7f3a1b2c \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
7976
|
+
},
|
|
7977
|
+
{
|
|
7978
|
+
type: "code",
|
|
7979
|
+
language: "bash",
|
|
7980
|
+
title: "Step 3: Retrieve extraction results",
|
|
7981
|
+
code: 'curl https://api.talonic.com/v1/extractions/doc_7f3a1b2c \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
7982
|
+
},
|
|
7983
|
+
{
|
|
7984
|
+
type: "code",
|
|
7985
|
+
language: "json",
|
|
7986
|
+
title: "Extraction response",
|
|
7987
|
+
code: '{\n "document_id": "doc_7f3a1b2c",\n "document_type": "Invoice",\n "fields": [\n {\n "name": "invoice_number",\n "value": "INV-2026-0472",\n "confidence": 0.98\n },\n {\n "name": "vendor_name",\n "value": "Acme Corp",\n "confidence": 0.96\n },\n {\n "name": "total_amount",\n "value": "4,250.00",\n "confidence": 0.95\n }\n ]\n}'
|
|
7988
|
+
},
|
|
5474
7989
|
{
|
|
5475
7990
|
type: "param-table",
|
|
5476
7991
|
title: "API namespaces",
|
|
@@ -5649,6 +8164,42 @@ var sections13 = [
|
|
|
5649
8164
|
variant: "warning",
|
|
5650
8165
|
text: "Your webhook endpoint must respond with a 2xx status code within 30 seconds. Non-2xx responses or timeouts trigger the retry schedule. Permanent client errors (4xx except 429) are treated as terminal failures and routed directly to the DLQ without further retries."
|
|
5651
8166
|
},
|
|
8167
|
+
{
|
|
8168
|
+
type: "heading",
|
|
8169
|
+
level: 3,
|
|
8170
|
+
id: "webhook-setup-api",
|
|
8171
|
+
text: "Setting Up Webhooks via API"
|
|
8172
|
+
},
|
|
8173
|
+
{
|
|
8174
|
+
type: "paragraph",
|
|
8175
|
+
text: "Webhooks are configured through the delivery API. First create a destination of type webhook with your endpoint URL and signing secret, then create a binding that maps signal types to the destination. You can bind multiple signal types to the same destination or use separate destinations for different event categories. The delivery catalog endpoint lists all available signal types so you can choose which events to subscribe to."
|
|
8176
|
+
},
|
|
8177
|
+
{
|
|
8178
|
+
type: "code",
|
|
8179
|
+
language: "bash",
|
|
8180
|
+
title: "List available webhook signal types",
|
|
8181
|
+
code: 'curl https://api.talonic.com/v1/delivery/catalog/signals \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
8182
|
+
},
|
|
8183
|
+
{
|
|
8184
|
+
type: "code",
|
|
8185
|
+
language: "bash",
|
|
8186
|
+
title: "Verify a webhook signature (Node.js example)",
|
|
8187
|
+
code: `# Your endpoint receives:
|
|
8188
|
+
# Header: X-Talonic-Signature: sha256=abc123...
|
|
8189
|
+
# Body: {"type":"document.extracted","data":{...}}
|
|
8190
|
+
|
|
8191
|
+
# Verify with:
|
|
8192
|
+
const crypto = require('crypto');
|
|
8193
|
+
const expected = crypto
|
|
8194
|
+
.createHmac('sha256', SIGNING_SECRET)
|
|
8195
|
+
.update(rawBody)
|
|
8196
|
+
.digest('hex');
|
|
8197
|
+
const valid = signature === \`sha256=\${expected}\`;`
|
|
8198
|
+
},
|
|
8199
|
+
{
|
|
8200
|
+
type: "paragraph",
|
|
8201
|
+
text: "For high-reliability integrations, always verify the HMAC-SHA256 signature before processing webhook payloads. Compute the HMAC of the raw request body using your signing secret and compare it to the signature header. This verification confirms that the payload was sent by Talonic and has not been tampered with in transit. Additionally, use the idempotency key header to deduplicate retried deliveries \u2014 store processed idempotency keys and skip any payload you have already handled."
|
|
8202
|
+
},
|
|
5652
8203
|
{
|
|
5653
8204
|
type: "param-table",
|
|
5654
8205
|
title: "Delivery signal types (webhook-compatible)",
|
|
@@ -5728,6 +8279,14 @@ var sections13 = [
|
|
|
5728
8279
|
{
|
|
5729
8280
|
question: "How do I verify webhook signatures?",
|
|
5730
8281
|
answer: "Each webhook payload is signed with HMAC-SHA256 using the signing secret from your delivery destination configuration. Compute the HMAC of the raw request body and compare it to the signature header to verify authenticity. This ensures the payload was sent by Talonic and was not tampered with in transit."
|
|
8282
|
+
},
|
|
8283
|
+
{
|
|
8284
|
+
question: "How do I replay failed webhook deliveries?",
|
|
8285
|
+
answer: "Failed deliveries that exhaust all retries are routed to the dead-letter queue (DLQ). You can replay any DLQ item via POST /v1/delivery/dlq/:id/replay, which enqueues a fresh delivery attempt. Replay is append-only \u2014 it creates a new attempt without modifying the original failure record, preserving the full delivery history for auditing."
|
|
8286
|
+
},
|
|
8287
|
+
{
|
|
8288
|
+
question: "Can I use webhooks with batch processing?",
|
|
8289
|
+
answer: "Yes. When documents uploaded with processing_mode=batch complete extraction, the document.extracted signal fires just like for realtime documents. You can bind this signal to a webhook destination to receive notifications when batch results are ready. The run.extraction.completed signal also fires when an entire batch inference run finishes, which is useful for monitoring batch completions at a higher level."
|
|
5731
8290
|
}
|
|
5732
8291
|
],
|
|
5733
8292
|
mentions: [
|
|
@@ -5816,6 +8375,36 @@ var sections14 = [
|
|
|
5816
8375
|
"Assign the appropriate role based on their responsibilities.",
|
|
5817
8376
|
"Optionally, change roles later from the same Team page."
|
|
5818
8377
|
]
|
|
8378
|
+
},
|
|
8379
|
+
{
|
|
8380
|
+
type: "heading",
|
|
8381
|
+
level: 3,
|
|
8382
|
+
id: "team-api-access",
|
|
8383
|
+
text: "API Access and Roles"
|
|
8384
|
+
},
|
|
8385
|
+
{
|
|
8386
|
+
type: "paragraph",
|
|
8387
|
+
text: "API key scopes are separate from team roles, but they work together to enforce access control. An API key created by an Owner can have any combination of scopes (extract, read, write), while team role permissions apply when members interact through the web interface or AI agent. For programmatic integrations, create dedicated API keys with the minimum required scopes rather than sharing a single key across all team members. This separation ensures that revoking a team member's access does not disrupt API-based workflows."
|
|
8388
|
+
},
|
|
8389
|
+
{
|
|
8390
|
+
type: "code",
|
|
8391
|
+
language: "bash",
|
|
8392
|
+
title: "List documents using a read-scoped API key",
|
|
8393
|
+
code: 'curl https://api.talonic.com/v1/documents \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
8394
|
+
},
|
|
8395
|
+
{
|
|
8396
|
+
type: "code",
|
|
8397
|
+
language: "json",
|
|
8398
|
+
title: "Response",
|
|
8399
|
+
code: '{\n "data": [\n {\n "id": "doc_7f3a1b2c",\n "filename": "invoice.pdf",\n "status": "completed",\n "document_type": "Invoice"\n }\n ],\n "meta": { "total": 156, "cursor": "eyJpZCI6MTU2fQ" }\n}'
|
|
8400
|
+
},
|
|
8401
|
+
{
|
|
8402
|
+
type: "paragraph",
|
|
8403
|
+
text: "When onboarding a new team, a common pattern is to have the Owner create the workspace and set up API keys, then invite team members via domain matching. Admins handle day-to-day member approvals and role assignments, while Members focus on document processing and schema configuration. This separation of concerns scales well from small teams of two or three people to larger organizations with dozens of members across multiple departments."
|
|
8404
|
+
},
|
|
8405
|
+
{
|
|
8406
|
+
type: "paragraph",
|
|
8407
|
+
text: 'For security-sensitive organizations, it is important to establish clear API key ownership practices. Only Owners can create and revoke API keys, which prevents team members from generating keys that might bypass role-based access controls. Each key should be labelled with the integration it serves (e.g., "BI Dashboard - Read Only", "ETL Pipeline - Extract") so that when a team member leaves or an integration is decommissioned, the corresponding key can be identified and revoked quickly without guessing which key belongs to which system.'
|
|
5819
8408
|
}
|
|
5820
8409
|
],
|
|
5821
8410
|
related: [
|
|
@@ -5830,7 +8419,7 @@ var sections14 = [
|
|
|
5830
8419
|
},
|
|
5831
8420
|
{
|
|
5832
8421
|
question: "How are new team members added?",
|
|
5833
|
-
answer: "New members are added via domain matching: company email domains auto-match to your organization with pending status. Admin
|
|
8422
|
+
answer: "New members are added via domain matching: company email domains auto-match to your organization with pending status requiring admin approval. An Admin or Owner must explicitly approve each pending member before they gain access to any workspace data. This ensures no unauthorized users can access your documents or extraction results."
|
|
5834
8423
|
},
|
|
5835
8424
|
{
|
|
5836
8425
|
question: "Can I change a team member's role after they join?",
|
|
@@ -5839,6 +8428,10 @@ var sections14 = [
|
|
|
5839
8428
|
{
|
|
5840
8429
|
question: "What happens if I remove a team member?",
|
|
5841
8430
|
answer: "Removing a team member revokes their access to the organization immediately. Their past actions (edits, uploads, approvals) remain in the audit trail. They can be re-added later through the same domain matching process."
|
|
8431
|
+
},
|
|
8432
|
+
{
|
|
8433
|
+
question: "How do API key scopes relate to team roles?",
|
|
8434
|
+
answer: "API key scopes and team roles are separate access control mechanisms that work together to enforce security. API keys have three scopes (extract, read, write) that govern what API operations the key can perform. Team roles (Viewer, Member, Admin, Owner) govern what users can do through the web interface and AI agent. Only Owners can create and revoke API keys. Create dedicated API keys with minimum required scopes for each integration rather than sharing keys across team members."
|
|
5842
8435
|
}
|
|
5843
8436
|
],
|
|
5844
8437
|
mentions: [
|
|
@@ -5922,6 +8515,48 @@ var sections14 = [
|
|
|
5922
8515
|
type: "callout",
|
|
5923
8516
|
variant: "info",
|
|
5924
8517
|
text: "The call log records every LLM and OCR call with full detail \u2014 model name, input/output token counts, latency, and cost. Use it to audit individual extractions or investigate unexpected cost increases. Each entry links back to the specific document and job that triggered the call."
|
|
8518
|
+
},
|
|
8519
|
+
{
|
|
8520
|
+
type: "heading",
|
|
8521
|
+
level: 3,
|
|
8522
|
+
id: "usage-api",
|
|
8523
|
+
text: "Tracking Usage via API"
|
|
8524
|
+
},
|
|
8525
|
+
{
|
|
8526
|
+
type: "paragraph",
|
|
8527
|
+
text: "Usage data is fully accessible through the REST API, enabling you to build custom cost monitoring dashboards or integrate usage tracking into your financial systems. The credits endpoint provides credit balance, transaction history, daily breakdowns, and per-request usage logs. This is particularly useful for organizations that need to allocate extraction costs across departments or client projects based on actual usage."
|
|
8528
|
+
},
|
|
8529
|
+
{
|
|
8530
|
+
type: "code",
|
|
8531
|
+
language: "bash",
|
|
8532
|
+
title: "Get credit balance and usage summary",
|
|
8533
|
+
code: 'curl https://api.talonic.com/v1/credits \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
8534
|
+
},
|
|
8535
|
+
{
|
|
8536
|
+
type: "code",
|
|
8537
|
+
language: "json",
|
|
8538
|
+
title: "Response",
|
|
8539
|
+
code: '{\n "balance": 12450,\n "total_used": 37550,\n "usage_by_feature": {\n "extraction": 22100,\n "ocr": 8900,\n "batch": 4200,\n "matching": 2350\n },\n "period": "current_month"\n}'
|
|
8540
|
+
},
|
|
8541
|
+
{
|
|
8542
|
+
type: "paragraph",
|
|
8543
|
+
text: "For automated cost monitoring, query the daily breakdown endpoint to track spend trends over time. Set up alerts in your monitoring system when daily spend exceeds a threshold \u2014 this catches unexpected usage spikes caused by large document uploads or misconfigured routing rules before they accumulate into significant costs. Teams that monitor usage proactively typically identify optimization opportunities that reduce their monthly extraction costs by 20-30%."
|
|
8544
|
+
},
|
|
8545
|
+
{
|
|
8546
|
+
type: "code",
|
|
8547
|
+
language: "bash",
|
|
8548
|
+
title: "Get daily usage breakdown",
|
|
8549
|
+
code: 'curl https://api.talonic.com/v1/credits/daily \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
8550
|
+
},
|
|
8551
|
+
{
|
|
8552
|
+
type: "code",
|
|
8553
|
+
language: "json",
|
|
8554
|
+
title: "Response",
|
|
8555
|
+
code: '{\n "data": [\n { "date": "2026-05-07", "cost": 12.50, "documents_processed": 45 },\n { "date": "2026-05-06", "cost": 8.30, "documents_processed": 31 },\n { "date": "2026-05-05", "cost": 15.80, "documents_processed": 62 }\n ]\n}'
|
|
8556
|
+
},
|
|
8557
|
+
{
|
|
8558
|
+
type: "paragraph",
|
|
8559
|
+
text: "The per-request usage log is particularly valuable for cost allocation. Each entry includes the document ID, job ID, model used, token counts, and computed cost. This granular data lets you attribute costs to specific documents, document types, or processing jobs. For organizations that charge back extraction costs to departments or clients, this data provides the foundation for accurate, auditable billing. Export the usage log via the API and feed it into your accounting or BI system for automated cost reporting."
|
|
5925
8560
|
}
|
|
5926
8561
|
],
|
|
5927
8562
|
related: [
|
|
@@ -5941,6 +8576,10 @@ var sections14 = [
|
|
|
5941
8576
|
{
|
|
5942
8577
|
question: "How can I reduce my usage costs?",
|
|
5943
8578
|
answer: "Use batch mode for non-urgent documents to cut extraction costs by 50%. Review the per-feature breakdown to identify your highest-cost operations, and use the daily cost chart to spot and investigate usage spikes. Additionally, invest in building your Field Registry \u2014 as more fields reach Tier 1 and Tier 2, values are resolved via deterministic lookup instead of LLM calls, which reduces per-document extraction cost over time. Leverage routing rules to assign schemas automatically, which avoids manual re-extractions and wasted processing."
|
|
8579
|
+
},
|
|
8580
|
+
{
|
|
8581
|
+
question: "Can I access usage data through the API?",
|
|
8582
|
+
answer: "Yes. The /v1/credits endpoint provides credit balance, transaction history, daily cost breakdowns, and per-request usage logs. Use this data to build custom cost dashboards, allocate costs across departments, or set up automated spending alerts in your monitoring system. Each entry links back to the specific document and job that generated the cost."
|
|
5944
8583
|
}
|
|
5945
8584
|
],
|
|
5946
8585
|
mentions: [
|
|
@@ -5975,6 +8614,10 @@ var sections14 = [
|
|
|
5975
8614
|
type: "paragraph",
|
|
5976
8615
|
text: "For best results, limit Admin Panel access to a small group of trusted platform operators. Use the **master registry** view to audit field definitions and schemas across tenants \u2014 this is particularly useful when standardizing extraction configurations or troubleshooting cross-tenant data quality issues."
|
|
5977
8616
|
},
|
|
8617
|
+
{
|
|
8618
|
+
type: "paragraph",
|
|
8619
|
+
text: "The data clear and rebuild operation is designed for specific scenarios: initial onboarding when a customer uploads test documents that should not persist, significant schema changes that require reprocessing the entire corpus, or data quality issues where starting fresh produces better results than incremental fixes. Before executing a data clear, always confirm with the customer and verify that any approved data products have been exported to downstream systems, since the operation permanently removes all extraction history and job results."
|
|
8620
|
+
},
|
|
5978
8621
|
{
|
|
5979
8622
|
type: "list",
|
|
5980
8623
|
ordered: false,
|
|
@@ -5990,6 +8633,36 @@ var sections14 = [
|
|
|
5990
8633
|
type: "callout",
|
|
5991
8634
|
variant: "warning",
|
|
5992
8635
|
text: "The **data clear** operation is irreversible. It deletes all documents, extractions, jobs, and results for the selected customer. Use with caution and only when a full reprocessing is genuinely needed."
|
|
8636
|
+
},
|
|
8637
|
+
{
|
|
8638
|
+
type: "heading",
|
|
8639
|
+
level: 3,
|
|
8640
|
+
id: "admin-workflows",
|
|
8641
|
+
text: "Common Admin Workflows"
|
|
8642
|
+
},
|
|
8643
|
+
{
|
|
8644
|
+
type: "paragraph",
|
|
8645
|
+
text: "The most common admin workflow is onboarding a new customer. Navigate to Customer Management, create the organization, then share the workspace URL with the customer's team lead. As team members sign up with their company email domain, they appear in the pending approval queue. The admin reviews and approves each member, assigning roles based on their responsibilities. Once the team is set up, the admin can create initial API keys and configure source connectors to get the customer started with document ingestion."
|
|
8646
|
+
},
|
|
8647
|
+
{
|
|
8648
|
+
type: "paragraph",
|
|
8649
|
+
text: "For troubleshooting, the master registry view is invaluable. It lets you inspect a customer's field registry, schemas, and extraction results from a single cross-tenant interface. If a customer reports low extraction quality, start by checking their field registry \u2014 a registry with mostly Tier 3 fields and low instruction coverage suggests the knowledge graph has not matured enough. Running a batch resolution followed by instruction synthesis often resolves the issue by promoting fields and generating master extraction instructions."
|
|
8650
|
+
},
|
|
8651
|
+
{
|
|
8652
|
+
type: "paragraph",
|
|
8653
|
+
text: "The Admin Panel also provides tools for platform capacity planning. By reviewing usage statistics across all customers, administrators can identify growth trends, forecast compute and storage needs, and plan infrastructure scaling ahead of demand spikes. The per-customer breakdown reveals which organizations are the heaviest users and which features drive the most cost, helping prioritize optimization efforts where they have the greatest impact."
|
|
8654
|
+
},
|
|
8655
|
+
{
|
|
8656
|
+
type: "code",
|
|
8657
|
+
language: "bash",
|
|
8658
|
+
title: "List all organizations (admin only)",
|
|
8659
|
+
code: 'curl https://api.talonic.com/v1/admin/customers \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
8660
|
+
},
|
|
8661
|
+
{
|
|
8662
|
+
type: "code",
|
|
8663
|
+
language: "json",
|
|
8664
|
+
title: "Response",
|
|
8665
|
+
code: '{\n "data": [\n {\n "id": "cust_001",\n "name": "Acme Corp",\n "members_count": 12,\n "documents_count": 1847,\n "created_at": "2026-01-15T08:00:00Z"\n }\n ],\n "meta": { "total": 8 }\n}'
|
|
5993
8666
|
}
|
|
5994
8667
|
],
|
|
5995
8668
|
related: [
|
|
@@ -6012,6 +8685,10 @@ var sections14 = [
|
|
|
6012
8685
|
{
|
|
6013
8686
|
question: "Can I view usage across all customers?",
|
|
6014
8687
|
answer: "Yes. The Admin Panel includes a master registry view that shows cross-tenant usage statistics, per-customer cost breakdowns, and platform-wide aggregates. This is useful for identifying high-usage tenants, tracking platform growth, and forecasting infrastructure needs."
|
|
8688
|
+
},
|
|
8689
|
+
{
|
|
8690
|
+
question: "How do I troubleshoot low extraction quality for a customer?",
|
|
8691
|
+
answer: "Start by checking the customer's field registry via the master registry view. Look at tier distribution, instruction coverage, and unresolved occurrence count. A registry dominated by Tier 3 fields with low instruction coverage suggests the knowledge graph needs more data. Run batch resolution followed by instruction synthesis to promote fields and generate extraction directives. Then review the telemetry metrics \u2014 capture rate, resolve rate, and synthesize rate \u2014 to confirm improvement."
|
|
6015
8692
|
}
|
|
6016
8693
|
],
|
|
6017
8694
|
mentions: [
|
|
@@ -6080,6 +8757,36 @@ var sections14 = [
|
|
|
6080
8757
|
type: "callout",
|
|
6081
8758
|
variant: "info",
|
|
6082
8759
|
text: "The **quick extract** shortcut (`Cmd+J` / `Ctrl+J`) is the fastest way to upload a single document. It opens a streamlined upload interface that lets you drag a file and start processing immediately. Use it when you receive a document via email or chat and want instant extraction results."
|
|
8760
|
+
},
|
|
8761
|
+
{
|
|
8762
|
+
type: "heading",
|
|
8763
|
+
level: 3,
|
|
8764
|
+
id: "shortcuts-workflow",
|
|
8765
|
+
text: "Shortcuts in Practice"
|
|
8766
|
+
},
|
|
8767
|
+
{
|
|
8768
|
+
type: "paragraph",
|
|
8769
|
+
text: 'Keyboard shortcuts become especially powerful when combined with the AI agent. A typical power-user workflow looks like this: press `Cmd+J` to quick-extract a document that just arrived via email, then press `Cmd+I` to open the agent and ask "What fields did the platform extract from my latest document?" The agent responds with the full list of extracted fields, confidence scores, and the document classification. If you need to find a related document, press `Cmd+K` and search by a field value like the vendor name or invoice number. All three actions happen without navigating away from your current page.'
|
|
8770
|
+
},
|
|
8771
|
+
{
|
|
8772
|
+
type: "paragraph",
|
|
8773
|
+
text: "For teams that process documents throughout the day, these shortcuts save significant time by eliminating navigation clicks. Over a typical workday of processing 20-30 documents, keyboard shortcuts can save 15-20 minutes compared to clicking through the sidebar for each operation. The time savings compound further when reviewing extraction results, since `Cmd+I` (AI agent) provides instant answers to quality questions that would otherwise require navigating to multiple pages."
|
|
8774
|
+
},
|
|
8775
|
+
{
|
|
8776
|
+
type: "paragraph",
|
|
8777
|
+
text: "Omnisearch (`Cmd+K`) deserves special mention because it searches across multiple resource types simultaneously. When you type a query, it returns matching documents, extracted field values, schema names, source names, and field registry entries \u2014 all in a single search overlay. This makes it the fastest way to locate any resource in your workspace. For example, searching for a vendor name returns every document mentioning that vendor, every schema field that matches, and every registry entry related to that concept."
|
|
8778
|
+
},
|
|
8779
|
+
{
|
|
8780
|
+
type: "code",
|
|
8781
|
+
language: "bash",
|
|
8782
|
+
title: "Equivalent API call for quick extract (Cmd+J)",
|
|
8783
|
+
code: 'curl -X POST https://api.talonic.com/v1/sources/src_abc123/documents \\\n -H "Authorization: Bearer $TALONIC_API_KEY" \\\n -F "file=@receipt.jpg"'
|
|
8784
|
+
},
|
|
8785
|
+
{
|
|
8786
|
+
type: "code",
|
|
8787
|
+
language: "bash",
|
|
8788
|
+
title: "Equivalent API call for omnisearch (Cmd+K)",
|
|
8789
|
+
code: 'curl "https://api.talonic.com/v1/search?q=Acme%20Corp" \\\n -H "Authorization: Bearer $TALONIC_API_KEY"'
|
|
6083
8790
|
}
|
|
6084
8791
|
],
|
|
6085
8792
|
related: [
|
|
@@ -6098,6 +8805,14 @@ var sections14 = [
|
|
|
6098
8805
|
{
|
|
6099
8806
|
question: "Do shortcuts work inside modals or overlays?",
|
|
6100
8807
|
answer: "The Escape shortcut works inside any modal or overlay to close it. Omnisearch (Cmd+K) works globally, even when other overlays are open. The AI Agent shortcut (Cmd+I) also works from any context. Quick extract (Cmd+J) is available from the main interface."
|
|
8808
|
+
},
|
|
8809
|
+
{
|
|
8810
|
+
question: "Can I perform the same actions as keyboard shortcuts through the API?",
|
|
8811
|
+
answer: "Yes. Quick extract (Cmd+J) corresponds to POST /v1/sources/:sourceId/documents for uploading a document. Omnisearch (Cmd+K) corresponds to GET /v1/search for querying across documents, fields, and schemas. The AI Agent (Cmd+I) does not have a direct API equivalent since it is a conversational interface, but all the data it queries is accessible through the individual API endpoints."
|
|
8812
|
+
},
|
|
8813
|
+
{
|
|
8814
|
+
question: "Are there keyboard shortcuts for navigation within pages?",
|
|
8815
|
+
answer: "The four global shortcuts (Cmd+K, Cmd+J, Cmd+I, Escape) work from any page. Within specific pages like the job results grid, you can use standard browser keyboard navigation (Tab, Enter, arrow keys) to move between cells and expand provenance panels. The platform does not override standard browser shortcuts beyond the four global bindings, so your usual browser keyboard patterns continue to work."
|
|
6101
8816
|
}
|
|
6102
8817
|
],
|
|
6103
8818
|
mentions: ["keyboard shortcuts", "Cmd+K", "Cmd+J", "Escape", "quick extract"]
|
|
@@ -6143,6 +8858,49 @@ var sections15 = [
|
|
|
6143
8858
|
"**Immediate visibility** \u2014 documents appear in your library right after Stage 1 (OCR + classification).",
|
|
6144
8859
|
"**Automatic result application** \u2014 when the batch completes, results are applied and documents transition to their final status."
|
|
6145
8860
|
]
|
|
8861
|
+
},
|
|
8862
|
+
{
|
|
8863
|
+
type: "code",
|
|
8864
|
+
language: "bash",
|
|
8865
|
+
title: "Upload documents in batch mode via API",
|
|
8866
|
+
code: `curl -X POST https://api.talonic.com/v1/extract \\
|
|
8867
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
8868
|
+
-F "file=@invoices_batch.pdf" \\
|
|
8869
|
+
-F "processing_mode=batch"
|
|
8870
|
+
|
|
8871
|
+
# Response:
|
|
8872
|
+
# {
|
|
8873
|
+
# "document_id": "doc_batch_001",
|
|
8874
|
+
# "status": "batch_queued",
|
|
8875
|
+
# "processing_mode": "batch",
|
|
8876
|
+
# "stage_1_completed": true,
|
|
8877
|
+
# "document_type": "Invoice",
|
|
8878
|
+
# "message": "Stage 1 complete. Stage 2 deferred to batch API."
|
|
8879
|
+
# }`
|
|
8880
|
+
},
|
|
8881
|
+
{
|
|
8882
|
+
type: "code",
|
|
8883
|
+
language: "bash",
|
|
8884
|
+
title: "Check batch status",
|
|
8885
|
+
code: `curl -s https://api.talonic.com/v1/batches \\
|
|
8886
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
8887
|
+
|
|
8888
|
+
# Response:
|
|
8889
|
+
# {
|
|
8890
|
+
# "batches": [
|
|
8891
|
+
# {
|
|
8892
|
+
# "id": "batch_abc",
|
|
8893
|
+
# "status": "submitted",
|
|
8894
|
+
# "item_count": 150,
|
|
8895
|
+
# "submitted_at": "2025-04-22T10:15:00Z",
|
|
8896
|
+
# "provider": "anthropic"
|
|
8897
|
+
# }
|
|
8898
|
+
# ]
|
|
8899
|
+
# }`
|
|
8900
|
+
},
|
|
8901
|
+
{
|
|
8902
|
+
type: "paragraph",
|
|
8903
|
+
text: "Batch inference is particularly cost-effective for periodic bulk operations. A common workflow is to configure a source connection (such as Google Drive or S3) with the batch processing toggle enabled, so all documents ingested through that source are automatically routed to the batch queue. This is ideal for overnight processing of large document volumes \u2014 documents arrive throughout the day, accumulate in the batch queue, and are submitted together to the provider for off-peak processing. By morning, all results have been applied and documents are ready for review at half the extraction cost."
|
|
6146
8904
|
}
|
|
6147
8905
|
],
|
|
6148
8906
|
related: [
|
|
@@ -6166,6 +8924,10 @@ var sections15 = [
|
|
|
6166
8924
|
{
|
|
6167
8925
|
question: "Does batch mode affect extraction quality?",
|
|
6168
8926
|
answer: "No. Batch mode uses the same Claude extraction model and prompts as real-time processing. The only difference is timing \u2014 extraction is deferred to take advantage of provider off-peak pricing."
|
|
8927
|
+
},
|
|
8928
|
+
{
|
|
8929
|
+
question: "Can I mix batch and real-time documents in the same workspace?",
|
|
8930
|
+
answer: "Yes. Batch and real-time documents coexist in the same workspace and library. Each document tracks its processing_mode independently. You can toggle batch mode per upload or per source connection. Time-sensitive documents use real-time processing while bulk backlogs use batch mode \u2014 both produce identical extraction output."
|
|
6169
8931
|
}
|
|
6170
8932
|
],
|
|
6171
8933
|
mentions: ["batch inference", "50% cost", "48-hour delivery", "backlog ingestion", "Message Batches API"]
|
|
@@ -6224,6 +8986,24 @@ var sections15 = [
|
|
|
6224
8986
|
"**Fallback behavior:** Parse failures in batch mode are retried through the real-time extraction path \u2014 never as a new batch \u2014 to maintain the 48-hour SLA.",
|
|
6225
8987
|
"**Minimum threshold:** Batches require at least 100 items (a provider requirement). Uploads below this threshold fall back to real-time processing with a warning."
|
|
6226
8988
|
]
|
|
8989
|
+
},
|
|
8990
|
+
{
|
|
8991
|
+
type: "code",
|
|
8992
|
+
language: "bash",
|
|
8993
|
+
title: "Enable batch processing on a source connection",
|
|
8994
|
+
code: `# Toggle batch mode for all documents from a Google Drive source:
|
|
8995
|
+
curl -X PATCH https://api.talonic.com/v1/sources/src_gdrive_001 \\
|
|
8996
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
8997
|
+
-H "Content-Type: application/json" \\
|
|
8998
|
+
-d '{ "batch_processing": true }'
|
|
8999
|
+
|
|
9000
|
+
# All future documents ingested from this source will use
|
|
9001
|
+
# batch processing mode automatically.
|
|
9002
|
+
# Stage 1 (OCR + classify) still runs immediately.`
|
|
9003
|
+
},
|
|
9004
|
+
{
|
|
9005
|
+
type: "paragraph",
|
|
9006
|
+
text: "The two-stage architecture of batch processing provides an elegant balance between immediate feedback and cost optimization. Stage 1 processes immediately \u2014 within seconds, you know what type of document was uploaded, its classification, triage metadata, and it appears in your library. Only Stage 2 (the expensive LLM extraction step) is deferred. This means you can build workflows that react to document arrival (routing rules, notifications, triage) without waiting for batch results, while still saving 50% on the extraction cost. Documents show a clear `batch_queued` status in the library so you always know which documents are waiting for extraction results."
|
|
6227
9007
|
}
|
|
6228
9008
|
],
|
|
6229
9009
|
related: [
|
|
@@ -6247,6 +9027,10 @@ var sections15 = [
|
|
|
6247
9027
|
{
|
|
6248
9028
|
question: "Can I enable batch mode per source?",
|
|
6249
9029
|
answer: "Yes. Each source connection has a batch processing toggle. When enabled, all documents ingested through that source are automatically processed in batch mode."
|
|
9030
|
+
},
|
|
9031
|
+
{
|
|
9032
|
+
question: "Why are image-only documents excluded from batch processing?",
|
|
9033
|
+
answer: "The batch payload is text-only \u2014 it contains the OCR markdown from Stage 1. Image-only documents (PNG, JPG) that have no extractable text require Claude Vision, which uses the image bytes directly. Since image bytes cannot be included in the text-based batch payload, these documents are automatically routed to real-time processing even when batch mode is enabled."
|
|
6250
9034
|
}
|
|
6251
9035
|
],
|
|
6252
9036
|
mentions: [
|
|
@@ -6306,6 +9090,36 @@ var sections15 = [
|
|
|
6306
9090
|
}
|
|
6307
9091
|
]
|
|
6308
9092
|
},
|
|
9093
|
+
{
|
|
9094
|
+
type: "code",
|
|
9095
|
+
language: "bash",
|
|
9096
|
+
title: "Monitor batch progress via API",
|
|
9097
|
+
code: `# List all batches with their statuses:
|
|
9098
|
+
curl -s https://api.talonic.com/v1/batches \\
|
|
9099
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
9100
|
+
|
|
9101
|
+
# Get detail for a specific batch including item states:
|
|
9102
|
+
curl -s https://api.talonic.com/v1/batches/batch_abc \\
|
|
9103
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
9104
|
+
|
|
9105
|
+
# Response:
|
|
9106
|
+
# {
|
|
9107
|
+
# "id": "batch_abc",
|
|
9108
|
+
# "status": "submitted",
|
|
9109
|
+
# "item_count": 150,
|
|
9110
|
+
# "submitted_at": "2025-04-22T10:15:00Z",
|
|
9111
|
+
# "provider": "anthropic",
|
|
9112
|
+
# "items": [
|
|
9113
|
+
# { "document_id": "doc_001", "status": "pending" },
|
|
9114
|
+
# { "document_id": "doc_002", "status": "completed" },
|
|
9115
|
+
# { "document_id": "doc_003", "status": "parse_error", "retried_realtime": true }
|
|
9116
|
+
# ]
|
|
9117
|
+
# }`
|
|
9118
|
+
},
|
|
9119
|
+
{
|
|
9120
|
+
type: "paragraph",
|
|
9121
|
+
text: "The batch detail view is your primary tool for diagnosing issues with batch processing. Each item shows its individual status \u2014 pending, completed, or parse_error with a retried_realtime flag indicating whether the system automatically retried it through the real-time path. Items with parse errors are retried exactly once through the real-time extraction path, ensuring the 48-hour SLA is maintained. If a batch has an unusually high parse error rate, this may indicate a problem with the documents themselves (corrupt files, unusual formatting) rather than a system issue. The crash recovery mechanism ensures that infrastructure disruptions \u2014 application restarts, memory pressure, or network interruptions \u2014 do not leave batches in a permanently stuck state."
|
|
9122
|
+
},
|
|
6309
9123
|
{
|
|
6310
9124
|
type: "callout",
|
|
6311
9125
|
variant: "info",
|
|
@@ -6333,6 +9147,10 @@ var sections15 = [
|
|
|
6333
9147
|
{
|
|
6334
9148
|
question: "What happens if a batch gets stuck?",
|
|
6335
9149
|
answer: 'The platform includes crash recovery logic. Batches stuck in "processing" for more than 15 minutes are automatically reverted to "submitted" so the next poll cycle retries them. No manual intervention is needed.'
|
|
9150
|
+
},
|
|
9151
|
+
{
|
|
9152
|
+
question: "How do I check the status of a specific document in a batch?",
|
|
9153
|
+
answer: "Use GET /v1/batches/{id} to see the batch detail view, which lists every item with its individual status (pending, completed, or parse_error). You can also check the document directly via GET /v1/documents/{id} \u2014 batch-queued documents show status batch_queued until results are applied, then transition to their final status."
|
|
6336
9154
|
}
|
|
6337
9155
|
],
|
|
6338
9156
|
mentions: [
|
|
@@ -6377,6 +9195,48 @@ var sections16 = [
|
|
|
6377
9195
|
variant: "info",
|
|
6378
9196
|
text: "You can also import reference data directly from a SQL database connection. The import runs asynchronously \u2014 rows are streamed in batches of 500 and column headers appear immediately so you can preview the structure while the import runs."
|
|
6379
9197
|
},
|
|
9198
|
+
{
|
|
9199
|
+
type: "code",
|
|
9200
|
+
language: "bash",
|
|
9201
|
+
title: "Upload reference data via CSV",
|
|
9202
|
+
code: `curl -X POST https://api.talonic.com/v1/matching/reference-data \\
|
|
9203
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
9204
|
+
-F "file=@vendor_registry.csv"
|
|
9205
|
+
|
|
9206
|
+
# Response:
|
|
9207
|
+
# {
|
|
9208
|
+
# "id": "ref_vendor_001",
|
|
9209
|
+
# "name": "vendor_registry",
|
|
9210
|
+
# "status": "ready",
|
|
9211
|
+
# "row_count": 2450,
|
|
9212
|
+
# "columns": ["vendor_id", "vendor_name", "country", "tax_id"],
|
|
9213
|
+
# "created_at": "2025-04-22T10:00:00Z"
|
|
9214
|
+
# }`
|
|
9215
|
+
},
|
|
9216
|
+
{
|
|
9217
|
+
type: "code",
|
|
9218
|
+
language: "bash",
|
|
9219
|
+
title: "Import reference data from a SQL database",
|
|
9220
|
+
code: `curl -X POST https://api.talonic.com/v1/matching/reference-data/from-sql \\
|
|
9221
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
9222
|
+
-H "Content-Type: application/json" \\
|
|
9223
|
+
-d '{
|
|
9224
|
+
"connection_id": "src_sql_001",
|
|
9225
|
+
"kind": "table",
|
|
9226
|
+
"table_name": "vendors",
|
|
9227
|
+
"schema_name": "public"
|
|
9228
|
+
}'
|
|
9229
|
+
|
|
9230
|
+
# Response (async \u2014 poll status until ready):
|
|
9231
|
+
# {
|
|
9232
|
+
# "id": "ref_vendor_002",
|
|
9233
|
+
# "status": "importing",
|
|
9234
|
+
# "source_meta": {
|
|
9235
|
+
# "connection_id": "src_sql_001",
|
|
9236
|
+
# "table_name": "vendors"
|
|
9237
|
+
# }
|
|
9238
|
+
# }`
|
|
9239
|
+
},
|
|
6380
9240
|
{
|
|
6381
9241
|
type: "list",
|
|
6382
9242
|
ordered: false,
|
|
@@ -6409,6 +9269,10 @@ var sections16 = [
|
|
|
6409
9269
|
{
|
|
6410
9270
|
question: "What happens if I delete a source connection that was used for a SQL import?",
|
|
6411
9271
|
answer: 'The reference data remains intact. Deleting a source connection does not cascade to reference datasets \u2014 the UI shows a "source disconnected" indicator, but the imported data continues to work for matching.'
|
|
9272
|
+
},
|
|
9273
|
+
{
|
|
9274
|
+
question: "How do I refresh reference data from a SQL source?",
|
|
9275
|
+
answer: "Re-run the SQL import using the same connection and table parameters. A new reference dataset is created with the latest data. Update your matching configurations to point to the new dataset version. The previous version remains available for comparison."
|
|
6412
9276
|
}
|
|
6413
9277
|
],
|
|
6414
9278
|
mentions: [
|
|
@@ -6474,6 +9338,45 @@ var sections16 = [
|
|
|
6474
9338
|
variant: "info",
|
|
6475
9339
|
text: "Use **AI strategy generation** when setting up matching for the first time. The platform analyzes your schema fields and reference data columns, then suggests which fields to compare and which strategy to use for each. You can review and adjust the suggestions before saving."
|
|
6476
9340
|
},
|
|
9341
|
+
{
|
|
9342
|
+
type: "code",
|
|
9343
|
+
language: "bash",
|
|
9344
|
+
title: "Create a matching configuration",
|
|
9345
|
+
code: `curl -X POST https://api.talonic.com/v1/matching/configs \\
|
|
9346
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
9347
|
+
-H "Content-Type: application/json" \\
|
|
9348
|
+
-d '{
|
|
9349
|
+
"name": "Invoice to PO Matching",
|
|
9350
|
+
"reference_data_id": "ref_vendor_001",
|
|
9351
|
+
"field_mappings": [
|
|
9352
|
+
{ "source": "vendor_name", "target": "vendor_name", "strategy": "fuzzy", "weight": 0.4 },
|
|
9353
|
+
{ "source": "po_number", "target": "po_number", "strategy": "exact", "weight": 0.35 },
|
|
9354
|
+
{ "source": "total_amount", "target": "amount", "strategy": "numeric_range", "weight": 0.15, "tolerance": 0.02 },
|
|
9355
|
+
{ "source": "invoice_date", "target": "po_date", "strategy": "date_range", "weight": 0.1, "tolerance_days": 7 }
|
|
9356
|
+
]
|
|
9357
|
+
}'`
|
|
9358
|
+
},
|
|
9359
|
+
{
|
|
9360
|
+
type: "code",
|
|
9361
|
+
language: "bash",
|
|
9362
|
+
title: "Generate AI-suggested matching strategy",
|
|
9363
|
+
code: `curl -X POST https://api.talonic.com/v1/matching/strategies/generate \\
|
|
9364
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
9365
|
+
-H "Content-Type: application/json" \\
|
|
9366
|
+
-d '{
|
|
9367
|
+
"schema_id": "us_def456",
|
|
9368
|
+
"reference_data_id": "ref_vendor_001"
|
|
9369
|
+
}'
|
|
9370
|
+
|
|
9371
|
+
# Response:
|
|
9372
|
+
# {
|
|
9373
|
+
# "id": "strat_001",
|
|
9374
|
+
# "suggested_mappings": [
|
|
9375
|
+
# { "source": "vendor_name", "target": "vendor_name", "strategy": "fuzzy", "weight": 0.4, "confidence": 0.95 },
|
|
9376
|
+
# { "source": "invoice_number", "target": "ref_number", "strategy": "exact", "weight": 0.35, "confidence": 0.88 }
|
|
9377
|
+
# ]
|
|
9378
|
+
# }`
|
|
9379
|
+
},
|
|
6477
9380
|
{
|
|
6478
9381
|
type: "list",
|
|
6479
9382
|
ordered: false,
|
|
@@ -6506,6 +9409,10 @@ var sections16 = [
|
|
|
6506
9409
|
{
|
|
6507
9410
|
question: "What is the difference between fuzzy and exact matching?",
|
|
6508
9411
|
answer: "Exact matching requires an identical string (case-insensitive). Fuzzy matching uses token-based comparison with a configurable similarity threshold, making it suitable for fields with minor variations like misspellings, abbreviations, or word reordering."
|
|
9412
|
+
},
|
|
9413
|
+
{
|
|
9414
|
+
question: "How should I set weights for my matching fields?",
|
|
9415
|
+
answer: "Assign high weights (0.3-0.5) to strong identifiers like reference numbers or unique IDs. Assign medium weights (0.1-0.2) to supporting fields like names, dates, and amounts. The weights must sum to 1.0. A common starting pattern is one high-weight exact match on a unique identifier plus two or three lower-weight fuzzy or range matches on supporting fields."
|
|
6509
9416
|
}
|
|
6510
9417
|
],
|
|
6511
9418
|
mentions: [
|
|
@@ -6552,6 +9459,31 @@ var sections16 = [
|
|
|
6552
9459
|
"Review results when the run completes."
|
|
6553
9460
|
]
|
|
6554
9461
|
},
|
|
9462
|
+
{
|
|
9463
|
+
type: "code",
|
|
9464
|
+
language: "bash",
|
|
9465
|
+
title: "Trigger a matching run and monitor progress",
|
|
9466
|
+
code: `# Start a standard matching run:
|
|
9467
|
+
curl -X POST https://api.talonic.com/v1/matching/configs/cfg_001/run \\
|
|
9468
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
9469
|
+
|
|
9470
|
+
# Start a smart run with AI resolution:
|
|
9471
|
+
curl -X POST https://api.talonic.com/v1/matching/configs/cfg_001/smart-run \\
|
|
9472
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
9473
|
+
|
|
9474
|
+
# Check progress:
|
|
9475
|
+
curl -s "https://api.talonic.com/v1/matching/runs/mrun_001/progress" \\
|
|
9476
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
9477
|
+
# -> { "processed": 180, "total": 245, "estimated_remaining_ms": 32000 }
|
|
9478
|
+
|
|
9479
|
+
# Cancel if needed (partial results preserved):
|
|
9480
|
+
curl -X POST https://api.talonic.com/v1/matching/runs/mrun_001/cancel \\
|
|
9481
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"`
|
|
9482
|
+
},
|
|
9483
|
+
{
|
|
9484
|
+
type: "paragraph",
|
|
9485
|
+
text: "You can also trigger an AI resolution pass on a completed run without re-running the entire job. The `POST /v1/matching/runs/{id}/ai-resolve` endpoint specifically targets low-confidence results from the initial deterministic matching pass and applies the embedding-based similarity search plus Haiku LLM evaluation to upgrade their scores. This is more efficient than running a full smart run when you only need to improve a small subset of borderline results. The AI resolver evaluates each ambiguous candidate in context, considering the full set of field comparisons and the document content, to make a more informed match decision than the deterministic strategies alone."
|
|
9486
|
+
},
|
|
6555
9487
|
{
|
|
6556
9488
|
type: "callout",
|
|
6557
9489
|
variant: "info",
|
|
@@ -6579,6 +9511,10 @@ var sections16 = [
|
|
|
6579
9511
|
{
|
|
6580
9512
|
question: "Can I cancel a matching run in progress?",
|
|
6581
9513
|
answer: "Yes. You can cancel a running match job from the matching page. Partial results from documents already processed are preserved and available for review."
|
|
9514
|
+
},
|
|
9515
|
+
{
|
|
9516
|
+
question: "Can I upgrade specific low-confidence results without re-running the entire job?",
|
|
9517
|
+
answer: "Yes. Use POST /v1/matching/runs/{id}/ai-resolve to trigger an AI resolution pass on a completed run. This targets only low-confidence results and applies embedding similarity plus Haiku LLM evaluation to improve their match quality, without re-processing documents that already have high-confidence matches."
|
|
6582
9518
|
}
|
|
6583
9519
|
],
|
|
6584
9520
|
mentions: ["matching runs", "async execution", "BullMQ", "progress monitoring", "smart run", "AI resolution"]
|
|
@@ -6632,6 +9568,40 @@ var sections16 = [
|
|
|
6632
9568
|
variant: "info",
|
|
6633
9569
|
text: "You can **approve or reject** individual match results. Approved matches can be used downstream in delivery pipelines. Rejected matches are excluded from future consideration for that document."
|
|
6634
9570
|
},
|
|
9571
|
+
{
|
|
9572
|
+
type: "code",
|
|
9573
|
+
language: "bash",
|
|
9574
|
+
title: "Get match results for a completed run",
|
|
9575
|
+
code: `curl -s "https://api.talonic.com/v1/matching/runs/mrun_001/results?limit=5" \\
|
|
9576
|
+
-H "Authorization: Bearer $TALONIC_API_KEY"
|
|
9577
|
+
|
|
9578
|
+
# Response:
|
|
9579
|
+
# {
|
|
9580
|
+
# "results": [
|
|
9581
|
+
# {
|
|
9582
|
+
# "id": "mres_001",
|
|
9583
|
+
# "document_id": "doc_001",
|
|
9584
|
+
# "document_name": "Invoice_ACME.pdf",
|
|
9585
|
+
# "top_candidates": [
|
|
9586
|
+
# {
|
|
9587
|
+
# "reference_row_id": "row_42",
|
|
9588
|
+
# "confidence": 87,
|
|
9589
|
+
# "evidence": [
|
|
9590
|
+
# { "field": "vendor_name", "strategy": "fuzzy", "score": 92, "source_value": "Acme Corp", "target_value": "ACME Corporation" },
|
|
9591
|
+
# { "field": "total_amount", "strategy": "numeric_range", "score": 100, "source_value": "12450.00", "target_value": "12450.00" }
|
|
9592
|
+
# ]
|
|
9593
|
+
# }
|
|
9594
|
+
# ]
|
|
9595
|
+
# }
|
|
9596
|
+
# ]
|
|
9597
|
+
# }
|
|
9598
|
+
|
|
9599
|
+
# Approve a match result:
|
|
9600
|
+
curl -X POST "https://api.talonic.com/v1/matching/runs/mrun_001/results/mres_001/review" \\
|
|
9601
|
+
-H "Authorization: Bearer $TALONIC_API_KEY" \\
|
|
9602
|
+
-H "Content-Type: application/json" \\
|
|
9603
|
+
-d '{ "action": "approve", "candidate_index": 0 }'`
|
|
9604
|
+
},
|
|
6635
9605
|
{
|
|
6636
9606
|
type: "paragraph",
|
|
6637
9607
|
text: 'Consider a practical example: you receive an invoice from "Acme Corp" with a total of $12,450 dated 2025-03-15. The matching engine returns the top candidate as "ACME Corporation" in your reference data with a confidence score of 87%. The evidence view shows the vendor name scored 92% via fuzzy match (handling "Corp" vs "Corporation"), the amount scored 100% via exact match, and the date scored 78% via date_range because the reference shows a PO date of 2025-03-10 \u2014 within the 7-day tolerance. You can quickly verify the match is correct and approve it, sending the linked record downstream.'
|
|
@@ -6658,6 +9628,10 @@ var sections16 = [
|
|
|
6658
9628
|
{
|
|
6659
9629
|
question: "Why does a match have a low confidence score?",
|
|
6660
9630
|
answer: "Low confidence usually means the fields being compared have significant differences or the matching strategies produced weak scores. Check the per-field evidence to identify which comparisons dragged the score down, then consider adjusting weights or strategies in the matching configuration."
|
|
9631
|
+
},
|
|
9632
|
+
{
|
|
9633
|
+
question: "Do approved match results feed into delivery pipelines?",
|
|
9634
|
+
answer: "Yes. Approved matches can be included in structured exports and delivery payloads alongside extraction data. This enables end-to-end workflows where extracted document data is matched against reference records and the combined result is delivered to your downstream system in a single payload."
|
|
6661
9635
|
}
|
|
6662
9636
|
],
|
|
6663
9637
|
mentions: [
|