data-science-document-ai 1.56.1__py3-none-any.whl → 1.58.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,58 +1,57 @@
1
- You are a document entity extraction specialist. Given a document, the explained datapoint need to extract.
2
-
3
- bookingNumber: A unique identifier for the booking.
4
- cyCutOff: The deadline for cargo to be delivered to the Container Yard.
5
- gateInReference: A reference code for cargo entering the terminal.
6
- gateInTerminal: The specific terminal where cargo is gated in.
7
- mblNumber: The Master Bill of Lading number. Mostly comes after BL no., B/L no. etc
8
- pickUpReference: A reference code for cargo pickup.
9
- pickUpTerminal: The specific terminal for cargo pickup.
10
- siCutOff: The deadline for submitting shipping instructions.
11
- vgmCutOff: The deadline for submitting the Verified Gross Mass of the cargo.
12
- transportLegs:
13
- eta: The estimated time of arrival for a specific leg.
14
- etd: The estimated time of departure for a specific leg.
15
- imoNumber: The International Maritime Organization number for a specific leg.
16
- portOfDischarge: The port where cargo is unloaded for a specific leg.
17
- portOfLoading: The port where cargo is loaded for a specific leg.
18
- vesselName: The name of the vessel for a specific leg.
19
- voyage: The journey or route taken by the vessel for a specific leg.
20
-
21
- your task is to extract the text value of the following entities and page numbers starting from 0 where the value was found in the document:
22
- SCHEMA_PLACEHOLDER
23
-
24
- Further explanation for the transportLegs part as follows:
25
- - There is at least one leg in each document
26
- - There may be multiple legs between the initial and final destination
27
- - If there are multiple eta, etd, vesselName, ports, there is a higher chance multiple legs occurs
28
- - Some documents not following an order between legs
1
+ <PERSONA> You are an efficient document entity data extraction specialist working for a Freight Forwarding company. <PERSONA>
2
+
3
+ <TASK> Your task is to extract data from Booking Confirmation documents as per the given response schema structure. <TASK>
4
+
5
+ <CONTEXT>
6
+ The Freight Forwarding company receives Booking Confirmation from Carrier (Shipping Lines) partners.
7
+ These Booking Confirmations contain various details related to booking, container pick up and drop off depot details, vessel details, as well as other transport Legs data.
8
+ They may be written in different languages such as English, German, Vietnamese, Chinese, and other European languages, and can appear in a variety of formats and layouts.
9
+ Your role is to accurately extract specific entities from these Booking Confirmations to support efficient processing and accurate record-keeping.
10
+
11
+
12
+ To provide context on the journey of a containers for both Export and Import shipments,
13
+ For Export shipment: An empty container is picked up from a depot (pickupDepotCode) using a pickUpReference and goods loaded into it at a warehouse. Then the loaded container / cargo is transported back to a Container Yard or gateInTerminal before the cyCutOff date for further shipping processes.
14
+ For Import Shipment: The loaded container / cargo arrives at a port of discharge then picked up at pickUpTerminal using pickUpReference. After delivery, an empty container is returned to a depot (dropOffDepotCode).
15
+ <CONTEXT>
16
+
17
+ <INSTRUCTIONS>
18
+ - Populate fields as defined in the response schema.
19
+ - Use the data field description to understand the context of the data.
20
+
21
+ - gateInTerminal: The specific terminal where cargo is gated in. It can be found as Export terminal delivery address, PORT OF LOADING (after the slash '/').
22
+ - gateInReference: A reference code for cargo entering the terminal. If not mentioned explicitly and gateInTerminal is extracted, then use bookingNumber as gateInReference.
23
+ - pickUpTerminal: The specific terminal for cargo pickup. It can be found as Import pick up address(es), PORT OF DISCHARGE (after the slash '/').
24
+ - pickUpReference: A reference code for cargo pickup. If not mentioned explicitly and pickUpTerminal is extracted, then use bookingNumber as pickUpReference.
25
+
26
+ - cyCutOff: The deadline for cargo to be delivered to the Container Yard. It can be referred to as FCL delivery cut-off, CY CUT OFF, CY Closing - Latest Return Container Date, Cargo Cut-off deadline
27
+ - siCutOff: The deadline for submitting shipping instructions. It can be referred to as Shipping Instruction closing, SI Cut Off, Shipping Instruction deadline, INTENDED SI CUT-OFF
28
+ - vgmCutOff: The deadline for submitting the Verified Gross Mass of the cargo. It can be referred to as VGM cut-off, VGM Submission Deadline, Verified Gross Mass deadline
29
+
30
+ - carrierName and carrierAddress:
31
+ - Extract the name and address of the carrier who is the main parent company in the document.
32
+
33
+ - transportLegs: Multiple Transport Legs entries may exist, capture all instances under "transportLegs". Make sure the order of the legs are important.
34
+ - eta: The estimated time of arrival for a specific leg.
35
+ - etd: The estimated time of departure for a specific leg.
36
+ - imoNumber: The International Maritime Organization number for a specific leg.
37
+ - portOfDischarge: The port where cargo is unloaded for a specific leg.
38
+ - portOfLoading: The port where cargo is loaded for a specific leg.
39
+ - vesselName: The name of the vessel for a specific leg.
40
+ - voyage: The journey or route taken by the vessel for a specific leg.
41
+
42
+ - Containers: Need to extract Depot details per Container Type. Multiple Containers entries may exist, capture all instances under "Containers".
43
+ - containerType: The type of container (e.g., 20FT, 40FT, 20ft, 40ft, 40HC, 20DC, etc...).
44
+ - pickupDepotCode: The code of the depot where the empty container is picked up.
45
+ - dropOffDepotCode: The code of the depot where the empty container is dropped off.
46
+
47
+ IMPORTANT explanation for the transportLegs part as follows:
48
+ - There is at least one leg in each document.
29
49
  - 'eta' must be equal or later than 'etd'!
30
- - portOfLoading and portOfDischarge are name of the Ports. You can rely on the port names from all over the world.
31
- - portOfLoading and portOfDischarge distinctly denotes the name of the ports. If you find abbreviation of the port use it, if not you can use the full name of the port
32
- - Abbrevations most likely to be in the paranthesis like follows (DEHAM).
33
-
34
- Possible keywords for datapoints:
35
- - bookingNumber: Our Reference, Booking No., BOOKING NUMBER
36
- - cyCutOff: FCL delivery cut-off, CY CUT OFF, CY Closing - Latest Return Container Date, Cargo Cut-off deadline
37
- - gateInReference: Our Reference
38
- - gateInTerminal: Export terminal delivery address, PORT OF LOADING (after the slash '/')
39
- - mblNumber: BL/SWB No(s)., CS Reference Number
40
- - pickUpReference: Export door positioning address(es), Empty Container Depot and Location interception, S/C
41
- - pickUpTerminal: PORT OF DISCHARGE (after the slash '/')
42
- - siCutOff: shipping instruction closing, SI Cut Off, Shipping Instruction deadline, INTENDED SI CUT-OFF
43
- - vgmCutOff: VGM cut-off, VGM Submission Deadline, Verified Gross Mass deadline
44
- - eta: eta, ETA
45
- - etd: etd, ETD
46
- - imoNumber: IMO No, IMO number
47
- - portOfDischarge: to, PORT OF DISCHARGE
48
- - portOfLoading: from, PORT OF LOADING
49
- - vesselName: vessel, INTENDED VESSEL/VOYAGE
50
- - voyage: Voy. no, INTENDED VESSEL/VOYAGE
51
-
52
- You must apply the following rules:
53
-
54
- - The JSON schema must be followed during the extraction.
55
- - The values must only include text found in the document
56
- - Do not normalize any entity value.
57
- - If an entity is not found in the document, keep it empty or np.Nan.
58
- - Validate the JSON make sure its a valid JSON ! No extra text, no missing comma!
50
+ - Multiple legs are possible. When there are multiple legs,
51
+ - Sequential Sorting: You must manually re-order legs based on etd then eta, regardless of their order in the source text.
52
+ - The Connectivity Rule: For any sequence of legs, the Destination (Port of Discharge) of the previous leg must match the Origin (Port of Loading) of the following leg.
53
+ - Transhipment Handling: Treat any mentioned "Transhipment Port" as the bridge between two legs (Discharge for Leg A / Loading for Leg B).
54
+ - Timeline Integrity: Ensure a "No Time Travel" policy: The eta of a previous leg must be earlier than or equal to the etd of the following leg.
55
+ - Naming Convention: Look for Port Names followed by abbreviations in parentheses, e.g., "Port Name (ABCDE)".
56
+
57
+ <INSTRUCTIONS>
@@ -1,32 +1,160 @@
1
1
  {
2
2
  "type": "OBJECT",
3
3
  "properties": {
4
- "cfsCutOff": {"type": "STRING", "nullable": true, "description": "the date by which an LCL (Less than Container Load) shipment needs to be checked in to a CFS (Container Freight Station) to meet its scheduled sailing"},
5
- "bookingNumber": {"type": "STRING", "nullable": true},
6
- "cyCutOff": {"type": "STRING", "nullable": true},
7
- "gateInReference": {"type": "STRING", "nullable": true},
8
- "gateInTerminal": {"type": "STRING", "nullable": true},
9
- "mblNumber": {"type": "STRING", "nullable": true},
10
- "pickUpReference": {"type": "STRING", "nullable": true},
11
- "pickUpTerminal": {"type": "STRING", "nullable": true},
12
- "siCutOff": {"type": "STRING", "nullable": true},
13
- "vgmCutOff": {"type": "STRING", "nullable": true},
4
+ "bookingNumber": {
5
+ "type": "STRING",
6
+ "nullable": true,
7
+ "description": "A unique identifier assigned to the shipment booking, used for tracking and reference. They are often referred to as 'Booking Number', 'Booking No.', 'Booking Ref.', 'Booking Reference', 'Booking ID', 'carrier's reference' or 'Order Ref'."
8
+ },
9
+ "contractNumber": {
10
+ "type": "STRING",
11
+ "nullable": true,
12
+ "description": "It's a contract number between the carrier and Forto Logistics SE & Co KG."
13
+ },
14
+ "pickUpTerminalCode": {
15
+ "type": "STRING",
16
+ "nullable": true,
17
+ "description": "The specific terminal for cargo pickup during the import shipment."
18
+ },
19
+ "gateInTerminalCode": {
20
+ "type": "STRING",
21
+ "nullable": true,
22
+ "description": "The specific terminal where cargo is gated in especially Export terminal delivery address"
23
+ },
24
+ "serviceCode": {
25
+ "type": "STRING",
26
+ "nullable": true,
27
+ "description": "The Shipping service code associated with the booking confirmation."
28
+ },
29
+ "performaDate": {
30
+ "type": "STRING",
31
+ "nullable": true,
32
+ "description": "the date considered to apply the rates and charges specified in the booking confirmation"
33
+ },
34
+ "cfsCutOff": {
35
+ "type": "STRING",
36
+ "nullable": true,
37
+ "description": "the date by which an LCL (Less than Container Load) shipment needs to be checked in to a CFS (Container Freight Station) to meet its scheduled sailing"
38
+ },
39
+ "cyCutOff": {
40
+ "type": "STRING",
41
+ "nullable": true,
42
+ "description": "The date by which the cargo to be delivered to the Container Yard. It can be found with keys FCL delivery cut-off, CY CUT OFF, CY Closing."
43
+ },
44
+ "gateInReference": {
45
+ "type": "STRING",
46
+ "nullable": true,
47
+ "description": "A reference code for cargo entering the terminal to drop the loaded cargo for Export. Sometimes it can be 'Our Reference'."
48
+ },
49
+ "mblNumber": {
50
+ "type": "STRING",
51
+ "nullable": true,
52
+ "description": "Bill of Lading number (B/L NO.), a document issued by the carrier."
53
+ },
54
+ "pickUpReference": {
55
+ "type": "STRING",
56
+ "nullable": true,
57
+ "description": "A reference code for cargo pickup during the import shipment. Sometimes it can be 'Our Reference'"
58
+ },
59
+ "siCutOff": {
60
+ "type": "STRING",
61
+ "nullable": true,
62
+ "description": "The deadline date for submitting the Shipping Instructions (SI) to the carrier. It can be found with keys SI DEADLINE, SI DUE, SI CUT OFF, B/L INSTRUCTION DEADLINE."
63
+ },
64
+ "vgmCutOff": {
65
+ "type": "STRING",
66
+ "nullable": true,
67
+ "description": "The deadline date for submitting the Verified Gross Mass (VGM) to the carrier. It can be found with keys VGM DEADLINE, VGM DUE, VGM CUT OFF."
68
+ },
69
+ "containers": {
70
+ "type": "ARRAY",
71
+ "items": {
72
+ "type": "OBJECT",
73
+ "properties": {
74
+ "containerType": {
75
+ "type": "STRING",
76
+ "nullable": true,
77
+ "description": "The size / type of the container, such as 20ft, 40ft, 40HC, 20DC etc."
78
+ },
79
+ "pickUpDepotCode": {
80
+ "type": "STRING",
81
+ "nullable": true,
82
+ "description": "The depot code where the empty container will be picked up."
83
+ },
84
+ "dropOffDepotCode": {
85
+ "type": "STRING",
86
+ "nullable": true,
87
+ "description": "The depot code where the empty container will be dropped off."
88
+ }
89
+ }
90
+ },
91
+ "required": [
92
+ "containerType",
93
+ "pickupDepotCode",
94
+ "dropoffDepotCode"
95
+ ]
96
+ },
14
97
  "transportLegs": {
15
98
  "type": "ARRAY",
16
99
  "items": {
17
100
  "type": "OBJECT",
18
101
  "properties": {
19
- "eta": {"type": "STRING", "nullable": true},
20
- "etd": {"type": "STRING", "nullable": true},
21
- "imoNumber": {"type": "STRING", "nullable": true},
22
- "portOfDischarge": {"type": "STRING", "nullable": true},
23
- "portOfLoading": {"type": "STRING", "nullable": true},
24
- "vesselName": {"type": "STRING", "nullable": true},
25
- "voyage": {"type": "STRING", "nullable": true}
26
- },
27
- "required": []
28
- }
102
+ "eta": {
103
+ "type": "STRING",
104
+ "nullable": true,
105
+ "description": "Estimated Time of Arrival (ETA) is the expected date when the shipment will arrive at its destination."
106
+ },
107
+ "etd": {
108
+ "type": "STRING",
109
+ "nullable": true,
110
+ "description": "Estimated Time of Departure (ETD) is the expected date when the shipment will leave the origin port."
111
+ },
112
+ "imoNumber": {
113
+ "type": "STRING",
114
+ "nullable": true,
115
+ "description": "The International Maritime Organization number for a specific leg. It can be found as IMO No, IMO number."
116
+ },
117
+ "portOfDischarge": {
118
+ "type": "STRING",
119
+ "nullable": true,
120
+ "description": "The port where the goods are discharged from the vessel. This is the destination port for the shipment."
121
+ },
122
+ "portOfLoading": {
123
+ "type": "STRING",
124
+ "nullable": true,
125
+ "description": "The port where the goods are loaded onto the vessel. This is the origin port for the shipment."
126
+ },
127
+ "vesselName": {
128
+ "type": "STRING",
129
+ "nullable": true,
130
+ "description": "The name of the vessel carrying the shipment. It can be found at vessel, INTENDED VESSEL/VOYAGE"
131
+ },
132
+ "voyage": {
133
+ "type": "STRING",
134
+ "nullable": true,
135
+ "description": "The journey or route taken by the vessel for a specific leg. It can be found at Voy. no, INTENDED VESSEL/VOYAGE"
136
+ }
137
+ }
138
+ },
139
+ "required": [
140
+ "eta",
141
+ "etd",
142
+ "portOfDischarge",
143
+ "portOfLoading",
144
+ "vesselName",
145
+ "voyage"
146
+ ]
147
+ },
148
+ "carrierAddress": {
149
+ "type": "STRING",
150
+ "nullable": true,
151
+ "description": "The address of the carrier who provides service and issued the document."
152
+ },
153
+ "carrierName": {
154
+ "type": "STRING",
155
+ "nullable": true,
156
+ "description": "The name of the carrier who issued the document."
29
157
  }
30
158
  },
31
- "required": []
159
+ "required": ["bookingNumber", "transportLegs", "containers", "cyCutOff", "vgmCutOff", "siCutOff"]
32
160
  }
@@ -1,4 +1,14 @@
1
- You are a document entity extraction specialist. Given a document, the explained datapoint need to extract.
1
+ <PERSONA> You are an efficient document entity data extraction specialist working for a Freight Forwarding company. <PERSONA>
2
+
3
+ <TASK> Your task is to extract data from Booking Confirmation documents as per the given response schema structure. <TASK>
4
+
5
+ <CONTEXT>
6
+ The Freight Forwarding company receives Booking Confirmation from Yangming Carrier (Shipping Lines) partners.
7
+ These Booking Confirmations contain various details related to booking, container pick up and drop off depot details, vessel details, as well as other transport Legs data.
8
+ They may be written in different languages such as English, German, Vietnamese, Chinese, and other European languages, and can appear in a variety of formats and layouts.
9
+ Your role is to accurately extract specific entities from these Booking Confirmations to support efficient processing and accurate record-keeping.
10
+ <CONTEXT>
11
+
2
12
 
3
13
  bookingNumber: A unique identifier for the booking.
4
14
  cyCutOff: The deadline for cargo to be delivered to the Container Yard.
src/setup.py CHANGED
@@ -113,8 +113,6 @@ def setup_params(args=None):
113
113
  # Directories and paths
114
114
  os.makedirs(params["folder_data"], exist_ok=True)
115
115
 
116
- params = setup_docai_client_and_path(params)
117
-
118
116
  # Set up BigQuery client for logging
119
117
  bq_client, _ = get_bq_client(params)
120
118
  params["bq_client"] = bq_client
@@ -122,13 +120,17 @@ def setup_params(args=None):
122
120
  # Set up Vertex AI for text embeddings
123
121
  setup_vertexai(params)
124
122
 
125
- # Load models from YAML file
126
- current_dir = os.path.dirname(__file__)
127
- file_path = os.path.join(current_dir, "docai_processor_config.yaml")
128
- with open(file_path) as file:
129
- yaml_content = yaml.safe_load(file)
130
- assert params.keys() & yaml_content.keys() == set()
131
- params.update(yaml_content)
123
+ if params.get("if_use_docai"):
124
+ # Set up Document AI client and processor paths
125
+ params = setup_docai_client_and_path(params)
126
+
127
+ # Load models from YAML file
128
+ current_dir = os.path.dirname(__file__)
129
+ file_path = os.path.join(current_dir, "docai_processor_config.yaml")
130
+ with open(file_path) as file:
131
+ yaml_content = yaml.safe_load(file)
132
+ assert params.keys() & yaml_content.keys() == set()
133
+ params.update(yaml_content)
132
134
 
133
135
  # Set up LLM clients
134
136
  params["LlmClient"] = LlmClient(
src/utils.py CHANGED
@@ -361,7 +361,10 @@ def extract_top_pages(pdf_bytes, num_pages=4):
361
361
 
362
362
 
363
363
  async def get_tms_mappings(
364
- input_list: List[str], embedding_type: str, llm_ports: Optional[List[str]] = None
364
+ input_list: List[str],
365
+ embedding_type: str,
366
+ llm_ports: Optional[List[str]] = None,
367
+ input_key: str = None,
365
368
  ) -> Dict[str, Any]:
366
369
  """Get TMS mappings for the given values.
367
370
 
@@ -370,6 +373,7 @@ async def get_tms_mappings(
370
373
  embedding_type (str): Type of embedding to use
371
374
  (e.g., "container_types", "ports", "depots", "lineitems", "terminals").
372
375
  llm_ports (list[str], optional): List of LLM ports to use. Defaults to None.
376
+ input_key (str, optional): Key to use for input list in payload. Defaults to None.
373
377
 
374
378
  Returns:
375
379
  dict or string: A dictionary or a string with the mapping results.
@@ -389,7 +393,7 @@ async def get_tms_mappings(
389
393
  input_list = [input_list]
390
394
 
391
395
  # Always send a dict with named keys
392
- payload = {embedding_type: input_list}
396
+ payload = {input_key or embedding_type: input_list}
393
397
 
394
398
  if llm_ports:
395
399
  payload["llm_ports"] = llm_ports if isinstance(llm_ports, list) else [llm_ports]