israeli-invoice-parser 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- israeli_invoice_parser-0.1.0/LICENSE +21 -0
- israeli_invoice_parser-0.1.0/PKG-INFO +152 -0
- israeli_invoice_parser-0.1.0/README.md +136 -0
- israeli_invoice_parser-0.1.0/pyproject.toml +28 -0
- israeli_invoice_parser-0.1.0/setup.cfg +4 -0
- israeli_invoice_parser-0.1.0/src/israeli-invoice-parser/__init__.py +12 -0
- israeli_invoice_parser-0.1.0/src/israeli-invoice-parser/base_parser.py +40 -0
- israeli_invoice_parser-0.1.0/src/israeli-invoice-parser/pairzon_parser.py +186 -0
- israeli_invoice_parser-0.1.0/src/israeli-invoice-parser/rami_levy_parser.py +208 -0
- israeli_invoice_parser-0.1.0/src/israeli-invoice-parser/weezmo_parser.py +180 -0
- israeli_invoice_parser-0.1.0/src/israeli_invoice_parser.egg-info/PKG-INFO +152 -0
- israeli_invoice_parser-0.1.0/src/israeli_invoice_parser.egg-info/SOURCES.txt +13 -0
- israeli_invoice_parser-0.1.0/src/israeli_invoice_parser.egg-info/dependency_links.txt +1 -0
- israeli_invoice_parser-0.1.0/src/israeli_invoice_parser.egg-info/requires.txt +1 -0
- israeli_invoice_parser-0.1.0/src/israeli_invoice_parser.egg-info/top_level.txt +1 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Yohay Cohen
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,152 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: israeli-invoice-parser
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: A unified parsing library for Israeli digital grocery and retail receipts
|
|
5
|
+
Author-email: Yohay Cohen <yohaybn@gmail.com>
|
|
6
|
+
Project-URL: Homepage, https://github.com/yohaybn/israeli-invoice-parser-lib
|
|
7
|
+
Project-URL: Bug Tracker, https://github.com/yohaybn/israeli-invoice-parser-lib/issues
|
|
8
|
+
Classifier: Programming Language :: Python :: 3
|
|
9
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
10
|
+
Classifier: Operating System :: OS Independent
|
|
11
|
+
Requires-Python: >=3.8
|
|
12
|
+
Description-Content-Type: text/markdown
|
|
13
|
+
License-File: LICENSE
|
|
14
|
+
Requires-Dist: beautifulsoup4>=4.12.0
|
|
15
|
+
Dynamic: license-file
|
|
16
|
+
|
|
17
|
+
# israel-invoice-parser
|
|
18
|
+
|
|
19
|
+
A unified, highly accurate Python parsing library for Israeli digital receipts, grocery bills, and commercial retail invoices. This library standardizes fragmented vendor payloads (including direct APIs, raw HTML, and complex Nuxt transport matrices) into a single, clean, structured Python dictionary.
|
|
20
|
+
|
|
21
|
+
## Supported Retailers and Providers
|
|
22
|
+
|
|
23
|
+
The library supports major Israeli storefronts directly or via central receipt infrastructure aggregators:
|
|
24
|
+
|
|
25
|
+
* **Rami Levy (רמי לוי)** — Native support for standard digital bills and microservice data streams.
|
|
26
|
+
* **Weezmo / Wee.ai Infrastructure** — Multi-brand validation supporting grocery gateways like **Yohananof (יוחננוף)**, high-street retail setups, and fashion entities (**TopTen**, **Tamnun**).
|
|
27
|
+
* **Pairzon Engine (פיירזון)** — Dynamic resolution for short-link token routers and partner store layouts (e.g., **Osher Ad (אושר עד)**, **Max Stock (מקס סטוק)**, etc.).
|
|
28
|
+
|
|
29
|
+
> **Current Limitations:** Automated extraction for **Shufersal (שופרסל)** invoices is currently blocked. The endpoint uses robust bot-protection / WAF rules that reject standard programmatic requests. We are actively trying to figure out how to bypass or properly emulate browser signatures to restore this functionality. Contributions or ideas on this technical issue are highly appreciated!
|
|
30
|
+
|
|
31
|
+
---
|
|
32
|
+
|
|
33
|
+
## Installation
|
|
34
|
+
|
|
35
|
+
Install the package via `pip`:
|
|
36
|
+
|
|
37
|
+
```bash
|
|
38
|
+
pip install israel-invoice-parser
|
|
39
|
+
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
---
|
|
43
|
+
|
|
44
|
+
## Quick Start and Usage Examples
|
|
45
|
+
|
|
46
|
+
Every parser inherits from a common interface (`BaseReceiptParser`) and returns a standardized data model, making it simple to process invoices interchangeably.
|
|
47
|
+
|
|
48
|
+
### 1. Parsing a Rami Levy URL
|
|
49
|
+
|
|
50
|
+
```python
|
|
51
|
+
from invoice_parser import RamiLevyParser
|
|
52
|
+
|
|
53
|
+
# Initialize the dedicated parser
|
|
54
|
+
parser = RamiLevyParser()
|
|
55
|
+
|
|
56
|
+
# Pass a live receipt or invoice URL directly
|
|
57
|
+
url = "https://api-digi.rami-levy.co.il/api/v1/receipts/example-token-12345"
|
|
58
|
+
receipt = parser.parse(url)
|
|
59
|
+
|
|
60
|
+
# Access standardized fields uniformly
|
|
61
|
+
print(f"Store: {receipt['store_name']}")
|
|
62
|
+
print(f"Total Paid: ₪{receipt['total_paid']}")
|
|
63
|
+
print(f"Date: {receipt['date']} at {receipt['time']}")
|
|
64
|
+
|
|
65
|
+
for item in receipt['items']:
|
|
66
|
+
print(f" - {item['description']}: ₪{item['final_price']} (Qty: {item['quantity_or_weight']})")
|
|
67
|
+
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
### 2. Parsing a Weezmo / Wee.ai Provider Short-Link (e.g., Yohananof)
|
|
71
|
+
|
|
72
|
+
```python
|
|
73
|
+
from invoice_parser import WeezmoParser
|
|
74
|
+
|
|
75
|
+
parser = WeezmoParser()
|
|
76
|
+
|
|
77
|
+
# Works with central wee.ai tracking tokens or short links
|
|
78
|
+
weezmo_url = "https://wee.ai/r/v123abcd"
|
|
79
|
+
receipt = parser.parse(weezmo_url)
|
|
80
|
+
|
|
81
|
+
# The parser dynamically extracts real corporate metadata to identify the sub-brand
|
|
82
|
+
print(f"Identified Brand: {receipt['store_name']}") # e.g., 'יוחננוף'
|
|
83
|
+
print(f"Legal Business ID: {receipt['company_legal_id']}")
|
|
84
|
+
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
### 3. Standardized Output Format Matrix
|
|
88
|
+
|
|
89
|
+
Regardless of which vendor parser is called, the output dictionary always complies with the following layout structure:
|
|
90
|
+
|
|
91
|
+
```python
|
|
92
|
+
{
|
|
93
|
+
"store_name": "רמי לוי",
|
|
94
|
+
"company_legal_id": "513770669",
|
|
95
|
+
"branch_name": "סניף תל אביב",
|
|
96
|
+
"store_address": "דרך מנחם בגין 123",
|
|
97
|
+
"store_phone": "03-1234567",
|
|
98
|
+
"customer_name": "ישראל ישראלי",
|
|
99
|
+
"date": "23/06/2026",
|
|
100
|
+
"time": "14:30:00",
|
|
101
|
+
"receipt_id": "987654321",
|
|
102
|
+
"total_paid": 245.50,
|
|
103
|
+
"vat_rate": 17.0,
|
|
104
|
+
"total_vat_paid": 35.67,
|
|
105
|
+
"payment_method": "אשראי",
|
|
106
|
+
"items": [
|
|
107
|
+
{
|
|
108
|
+
"description": "חלב תנובה 3%",
|
|
109
|
+
"barcode": "7290000042431",
|
|
110
|
+
"is_by_weight": False,
|
|
111
|
+
"quantity_or_weight": 2.0,
|
|
112
|
+
"unit_price": 6.50,
|
|
113
|
+
"original_total_price": 13.00,
|
|
114
|
+
"is_part_of_deal": True,
|
|
115
|
+
"deal_text": "2 ב-₪11",
|
|
116
|
+
"discount_amount": 2.00,
|
|
117
|
+
"final_price": 11.00,
|
|
118
|
+
"category_path": ["סופרמרקט"]
|
|
119
|
+
}
|
|
120
|
+
]
|
|
121
|
+
}
|
|
122
|
+
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
---
|
|
126
|
+
|
|
127
|
+
## Contributing and Helping Out
|
|
128
|
+
|
|
129
|
+
Parsing real-world digital invoices is a game of cat-and-mouse as retailers update their internal schemas. We need your help to make this library resilient!
|
|
130
|
+
|
|
131
|
+
### Have an Unsupported Receipt / Found a Bug?
|
|
132
|
+
|
|
133
|
+
If you run into an invoice that fails to parse (such as **Shufersal** or a newly formatted receipt Layout):
|
|
134
|
+
|
|
135
|
+
1. Open a New Issue on the [israeli-invoice-parser-lib Bug Tracker](https://github.com/yohaybn/israeli-invoice-parser-lib/issues).
|
|
136
|
+
2. **Crucial:** Provide a **real, live link to the invoice**. Without a working URL, it is impossible to inspect the underlying network payload structure, test backend responses, or map out the necessary payload parameters.
|
|
137
|
+
3. If you have suggestions or workarounds for bypassing Shufersal's anti-bot restrictions, please detail them inside the dedicated discussion issues!
|
|
138
|
+
|
|
139
|
+
### Want to Add a New Parser?
|
|
140
|
+
|
|
141
|
+
We warmly welcome pull requests! To contribute a new parser:
|
|
142
|
+
|
|
143
|
+
1. Subclass `BaseReceiptParser` from `base_parser.py`.
|
|
144
|
+
2. Implement the `.parse(self, source_data: str) -> Dict[str, Any]` method.
|
|
145
|
+
3. Map the data cleanly into our uniform dictionary format.
|
|
146
|
+
4. Submit your PR directly to the [GitHub Repository](https://github.com/yohaybn/israeli-invoice-parser-lib/).
|
|
147
|
+
|
|
148
|
+
---
|
|
149
|
+
|
|
150
|
+
## License
|
|
151
|
+
|
|
152
|
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
|
@@ -0,0 +1,136 @@
|
|
|
1
|
+
# israel-invoice-parser
|
|
2
|
+
|
|
3
|
+
A unified, highly accurate Python parsing library for Israeli digital receipts, grocery bills, and commercial retail invoices. This library standardizes fragmented vendor payloads (including direct APIs, raw HTML, and complex Nuxt transport matrices) into a single, clean, structured Python dictionary.
|
|
4
|
+
|
|
5
|
+
## Supported Retailers and Providers
|
|
6
|
+
|
|
7
|
+
The library supports major Israeli storefronts directly or via central receipt infrastructure aggregators:
|
|
8
|
+
|
|
9
|
+
* **Rami Levy (רמי לוי)** — Native support for standard digital bills and microservice data streams.
|
|
10
|
+
* **Weezmo / Wee.ai Infrastructure** — Multi-brand validation supporting grocery gateways like **Yohananof (יוחננוף)**, high-street retail setups, and fashion entities (**TopTen**, **Tamnun**).
|
|
11
|
+
* **Pairzon Engine (פיירזון)** — Dynamic resolution for short-link token routers and partner store layouts (e.g., **Osher Ad (אושר עד)**, **Max Stock (מקס סטוק)**, etc.).
|
|
12
|
+
|
|
13
|
+
> **Current Limitations:** Automated extraction for **Shufersal (שופרסל)** invoices is currently blocked. The endpoint uses robust bot-protection / WAF rules that reject standard programmatic requests. We are actively trying to figure out how to bypass or properly emulate browser signatures to restore this functionality. Contributions or ideas on this technical issue are highly appreciated!
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## Installation
|
|
18
|
+
|
|
19
|
+
Install the package via `pip`:
|
|
20
|
+
|
|
21
|
+
```bash
|
|
22
|
+
pip install israel-invoice-parser
|
|
23
|
+
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
---
|
|
27
|
+
|
|
28
|
+
## Quick Start and Usage Examples
|
|
29
|
+
|
|
30
|
+
Every parser inherits from a common interface (`BaseReceiptParser`) and returns a standardized data model, making it simple to process invoices interchangeably.
|
|
31
|
+
|
|
32
|
+
### 1. Parsing a Rami Levy URL
|
|
33
|
+
|
|
34
|
+
```python
|
|
35
|
+
from invoice_parser import RamiLevyParser
|
|
36
|
+
|
|
37
|
+
# Initialize the dedicated parser
|
|
38
|
+
parser = RamiLevyParser()
|
|
39
|
+
|
|
40
|
+
# Pass a live receipt or invoice URL directly
|
|
41
|
+
url = "https://api-digi.rami-levy.co.il/api/v1/receipts/example-token-12345"
|
|
42
|
+
receipt = parser.parse(url)
|
|
43
|
+
|
|
44
|
+
# Access standardized fields uniformly
|
|
45
|
+
print(f"Store: {receipt['store_name']}")
|
|
46
|
+
print(f"Total Paid: ₪{receipt['total_paid']}")
|
|
47
|
+
print(f"Date: {receipt['date']} at {receipt['time']}")
|
|
48
|
+
|
|
49
|
+
for item in receipt['items']:
|
|
50
|
+
print(f" - {item['description']}: ₪{item['final_price']} (Qty: {item['quantity_or_weight']})")
|
|
51
|
+
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
### 2. Parsing a Weezmo / Wee.ai Provider Short-Link (e.g., Yohananof)
|
|
55
|
+
|
|
56
|
+
```python
|
|
57
|
+
from invoice_parser import WeezmoParser
|
|
58
|
+
|
|
59
|
+
parser = WeezmoParser()
|
|
60
|
+
|
|
61
|
+
# Works with central wee.ai tracking tokens or short links
|
|
62
|
+
weezmo_url = "https://wee.ai/r/v123abcd"
|
|
63
|
+
receipt = parser.parse(weezmo_url)
|
|
64
|
+
|
|
65
|
+
# The parser dynamically extracts real corporate metadata to identify the sub-brand
|
|
66
|
+
print(f"Identified Brand: {receipt['store_name']}") # e.g., 'יוחננוף'
|
|
67
|
+
print(f"Legal Business ID: {receipt['company_legal_id']}")
|
|
68
|
+
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
### 3. Standardized Output Format Matrix
|
|
72
|
+
|
|
73
|
+
Regardless of which vendor parser is called, the output dictionary always complies with the following layout structure:
|
|
74
|
+
|
|
75
|
+
```python
|
|
76
|
+
{
|
|
77
|
+
"store_name": "רמי לוי",
|
|
78
|
+
"company_legal_id": "513770669",
|
|
79
|
+
"branch_name": "סניף תל אביב",
|
|
80
|
+
"store_address": "דרך מנחם בגין 123",
|
|
81
|
+
"store_phone": "03-1234567",
|
|
82
|
+
"customer_name": "ישראל ישראלי",
|
|
83
|
+
"date": "23/06/2026",
|
|
84
|
+
"time": "14:30:00",
|
|
85
|
+
"receipt_id": "987654321",
|
|
86
|
+
"total_paid": 245.50,
|
|
87
|
+
"vat_rate": 17.0,
|
|
88
|
+
"total_vat_paid": 35.67,
|
|
89
|
+
"payment_method": "אשראי",
|
|
90
|
+
"items": [
|
|
91
|
+
{
|
|
92
|
+
"description": "חלב תנובה 3%",
|
|
93
|
+
"barcode": "7290000042431",
|
|
94
|
+
"is_by_weight": False,
|
|
95
|
+
"quantity_or_weight": 2.0,
|
|
96
|
+
"unit_price": 6.50,
|
|
97
|
+
"original_total_price": 13.00,
|
|
98
|
+
"is_part_of_deal": True,
|
|
99
|
+
"deal_text": "2 ב-₪11",
|
|
100
|
+
"discount_amount": 2.00,
|
|
101
|
+
"final_price": 11.00,
|
|
102
|
+
"category_path": ["סופרמרקט"]
|
|
103
|
+
}
|
|
104
|
+
]
|
|
105
|
+
}
|
|
106
|
+
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
---
|
|
110
|
+
|
|
111
|
+
## Contributing and Helping Out
|
|
112
|
+
|
|
113
|
+
Parsing real-world digital invoices is a game of cat-and-mouse as retailers update their internal schemas. We need your help to make this library resilient!
|
|
114
|
+
|
|
115
|
+
### Have an Unsupported Receipt / Found a Bug?
|
|
116
|
+
|
|
117
|
+
If you run into an invoice that fails to parse (such as **Shufersal** or a newly formatted receipt Layout):
|
|
118
|
+
|
|
119
|
+
1. Open a New Issue on the [israeli-invoice-parser-lib Bug Tracker](https://github.com/yohaybn/israeli-invoice-parser-lib/issues).
|
|
120
|
+
2. **Crucial:** Provide a **real, live link to the invoice**. Without a working URL, it is impossible to inspect the underlying network payload structure, test backend responses, or map out the necessary payload parameters.
|
|
121
|
+
3. If you have suggestions or workarounds for bypassing Shufersal's anti-bot restrictions, please detail them inside the dedicated discussion issues!
|
|
122
|
+
|
|
123
|
+
### Want to Add a New Parser?
|
|
124
|
+
|
|
125
|
+
We warmly welcome pull requests! To contribute a new parser:
|
|
126
|
+
|
|
127
|
+
1. Subclass `BaseReceiptParser` from `base_parser.py`.
|
|
128
|
+
2. Implement the `.parse(self, source_data: str) -> Dict[str, Any]` method.
|
|
129
|
+
3. Map the data cleanly into our uniform dictionary format.
|
|
130
|
+
4. Submit your PR directly to the [GitHub Repository](https://github.com/yohaybn/israeli-invoice-parser-lib/).
|
|
131
|
+
|
|
132
|
+
---
|
|
133
|
+
|
|
134
|
+
## License
|
|
135
|
+
|
|
136
|
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["setuptools>=61.0.0", "wheel"]
|
|
3
|
+
build-backend = "setuptools.build_meta"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "israeli-invoice-parser"
|
|
7
|
+
version = "0.1.0"
|
|
8
|
+
authors = [
|
|
9
|
+
{ name = "Yohay Cohen", email = "yohaybn@gmail.com" }
|
|
10
|
+
]
|
|
11
|
+
description = "A unified parsing library for Israeli digital grocery and retail receipts"
|
|
12
|
+
readme = "README.md"
|
|
13
|
+
requires-python = ">=3.8"
|
|
14
|
+
classifiers = [
|
|
15
|
+
"Programming Language :: Python :: 3",
|
|
16
|
+
"License :: OSI Approved :: MIT License",
|
|
17
|
+
"Operating System :: OS Independent",
|
|
18
|
+
]
|
|
19
|
+
dependencies = [
|
|
20
|
+
"beautifulsoup4>=4.12.0",
|
|
21
|
+
]
|
|
22
|
+
|
|
23
|
+
[project.urls]
|
|
24
|
+
"Homepage" = "https://github.com/yohaybn/israeli-invoice-parser-lib"
|
|
25
|
+
"Bug Tracker" = "https://github.com/yohaybn/israeli-invoice-parser-lib/issues"
|
|
26
|
+
|
|
27
|
+
[tool.setuptools.packages.find]
|
|
28
|
+
where = ["src"]
|
|
@@ -0,0 +1,12 @@
|
|
|
1
|
+
from .base_parser import BaseReceiptParser, NuxtDataHydrator
|
|
2
|
+
from .pairzon_parser import PairzonParser
|
|
3
|
+
from .rami_levy_parser import RamiLevyParser
|
|
4
|
+
from .weezmo_parser import WeezmoParser
|
|
5
|
+
|
|
6
|
+
__all__ = [
|
|
7
|
+
"BaseReceiptParser",
|
|
8
|
+
"NuxtDataHydrator",
|
|
9
|
+
"PairzonParser",
|
|
10
|
+
"RamiLevyParser",
|
|
11
|
+
"WeezmoParser",
|
|
12
|
+
]
|
|
@@ -0,0 +1,40 @@
|
|
|
1
|
+
from abc import ABC, abstractmethod
|
|
2
|
+
from typing import Dict, Any, List
|
|
3
|
+
|
|
4
|
+
class BaseReceiptParser(ABC):
|
|
5
|
+
def __init__(self, store_name: str) -> None:
|
|
6
|
+
self.store_name: str = store_name
|
|
7
|
+
|
|
8
|
+
@abstractmethod
|
|
9
|
+
def parse(self, source_data: str) -> Dict[str, Any]:
|
|
10
|
+
pass
|
|
11
|
+
|
|
12
|
+
class NuxtDataHydrator:
|
|
13
|
+
"""
|
|
14
|
+
Decompresses multi-type transport data matrices from Nuxt 3 back
|
|
15
|
+
into standard dictionaries, cleanly handling cross-referenced table indices.
|
|
16
|
+
"""
|
|
17
|
+
def __init__(self, data_list: List[Any]) -> None:
|
|
18
|
+
self.raw_pool: List[Any] = data_list
|
|
19
|
+
self.visited = set()
|
|
20
|
+
|
|
21
|
+
def hydrate_node(self, node: Any) -> Any:
|
|
22
|
+
if isinstance(node, int) and 0 <= node < len(self.raw_pool):
|
|
23
|
+
if node in self.visited:
|
|
24
|
+
return f"<CircularRef: Index {node}>"
|
|
25
|
+
|
|
26
|
+
self.visited.add(node)
|
|
27
|
+
resolved = self.raw_pool[node]
|
|
28
|
+
result = self._transform(resolved)
|
|
29
|
+
self.visited.remove(node)
|
|
30
|
+
return result
|
|
31
|
+
return self._transform(node)
|
|
32
|
+
|
|
33
|
+
def _transform(self, val: Any) -> Any:
|
|
34
|
+
if isinstance(val, dict):
|
|
35
|
+
return {k: self.hydrate_node(v) for k, v in val.items()}
|
|
36
|
+
elif isinstance(val, list):
|
|
37
|
+
if val and val[0] in ("ShallowReactive", "Reactive", "ShallowRef", "Ref"):
|
|
38
|
+
return self.hydrate_node(val[1])
|
|
39
|
+
return [self.hydrate_node(item) for item in val]
|
|
40
|
+
return val
|
|
@@ -0,0 +1,186 @@
|
|
|
1
|
+
import json
|
|
2
|
+
import logging
|
|
3
|
+
import os
|
|
4
|
+
import urllib.request
|
|
5
|
+
import urllib.error
|
|
6
|
+
from urllib.parse import urlparse, parse_qs
|
|
7
|
+
from typing import Dict, Any
|
|
8
|
+
from .base_parser import BaseReceiptParser
|
|
9
|
+
|
|
10
|
+
logging.basicConfig(level=logging.INFO)
|
|
11
|
+
logger = logging.getLogger("PairzonParser")
|
|
12
|
+
|
|
13
|
+
class PairzonParser(BaseReceiptParser):
|
|
14
|
+
def __init__(self) -> None:
|
|
15
|
+
# Initialize with a flexible placeholder; we will dynamically
|
|
16
|
+
# override self.store_name based on the receipt's real corporate metadata.
|
|
17
|
+
super().__init__(store_name="Pairzon Provider")
|
|
18
|
+
|
|
19
|
+
def parse(self, source_data: str) -> Dict[str, Any]:
|
|
20
|
+
raw_json: str = ""
|
|
21
|
+
|
|
22
|
+
if source_data.startswith("http://") or source_data.startswith("https://"):
|
|
23
|
+
try:
|
|
24
|
+
parsed_url = urlparse(source_data.strip())
|
|
25
|
+
queries = parse_qs(parsed_url.query)
|
|
26
|
+
|
|
27
|
+
doc_id = queries.get("id", [None])[0]
|
|
28
|
+
pin_id = queries.get("p", [None])[0]
|
|
29
|
+
|
|
30
|
+
headers = {
|
|
31
|
+
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
|
|
32
|
+
'Accept': 'application/json, text/plain, */*',
|
|
33
|
+
'Accept-Language': 'he-IL,he;q=0.9,en-US;q=0.8,en;q=0.7'
|
|
34
|
+
}
|
|
35
|
+
|
|
36
|
+
# Custom handler to block automatic 302 jumps so we can catch short-links gracefully
|
|
37
|
+
class NoRedirectHandler(urllib.request.HTTPRedirectHandler):
|
|
38
|
+
def http_error_302(self, req, fp, code, msg, headers):
|
|
39
|
+
return fp
|
|
40
|
+
|
|
41
|
+
opener = urllib.request.build_opener(NoRedirectHandler())
|
|
42
|
+
|
|
43
|
+
# 1. Coordinate Extraction (Direct Parameters vs Short-Link Token Routing)
|
|
44
|
+
if doc_id and pin_id:
|
|
45
|
+
api_url = f"https://{parsed_url.netloc}/v1.0/documents/{doc_id}?p={pin_id}"
|
|
46
|
+
else:
|
|
47
|
+
logger.info("Processing standard short-link sequence. Resolving internal payload routing...")
|
|
48
|
+
path_parts = [p for p in parsed_url.path.split('/') if p]
|
|
49
|
+
if len(path_parts) < 2:
|
|
50
|
+
raise ValueError("The provided Pairzon short-link route structure is invalid or missing paths.")
|
|
51
|
+
|
|
52
|
+
prefix = path_parts[0]
|
|
53
|
+
token = path_parts[1]
|
|
54
|
+
|
|
55
|
+
# Target Pairzon's centralized cross-brand mapping api
|
|
56
|
+
link_lookup_url = f"https://{parsed_url.netloc}/v1.0/links/{prefix}/{token}"
|
|
57
|
+
logger.info(f"Querying endpoint metadata resolver: {link_lookup_url}")
|
|
58
|
+
|
|
59
|
+
lookup_req = urllib.request.Request(link_lookup_url, headers=headers)
|
|
60
|
+
try:
|
|
61
|
+
with urllib.request.urlopen(lookup_req, timeout=15) as lookup_res:
|
|
62
|
+
lookup_data = json.loads(lookup_res.read().decode('utf-8'))
|
|
63
|
+
if isinstance(lookup_data, dict) and "data" in lookup_data:
|
|
64
|
+
lookup_data = lookup_data["data"]
|
|
65
|
+
|
|
66
|
+
doc_id = lookup_data.get("documentId") or lookup_data.get("id")
|
|
67
|
+
pin_id = lookup_data.get("prefix") or prefix
|
|
68
|
+
except Exception as e:
|
|
69
|
+
logger.warning(f"Metadata link API resolution dropped ({e}). Trying 302 header interception loop...")
|
|
70
|
+
|
|
71
|
+
req = urllib.request.Request(source_data, headers=headers)
|
|
72
|
+
with opener.open(req, timeout=15) as response:
|
|
73
|
+
redirect_location = response.headers.get('Location', '')
|
|
74
|
+
if redirect_location:
|
|
75
|
+
parsed_redirect = urlparse(redirect_location)
|
|
76
|
+
redirect_queries = parse_qs(parsed_redirect.query)
|
|
77
|
+
doc_id = redirect_queries.get("id", [None])[0]
|
|
78
|
+
pin_id = redirect_queries.get("p", [None])[0]
|
|
79
|
+
|
|
80
|
+
if doc_id and pin_id:
|
|
81
|
+
api_url = f"https://{parsed_url.netloc}/v1.0/documents/{doc_id}?p={pin_id}"
|
|
82
|
+
else:
|
|
83
|
+
raise ValueError("Failed to extract backend parameters from token signature.")
|
|
84
|
+
|
|
85
|
+
# 2. Complete Transaction JSON Fetch
|
|
86
|
+
logger.info(f"Targeting active data stream gateway: {api_url}")
|
|
87
|
+
data_req = urllib.request.Request(api_url, headers=headers)
|
|
88
|
+
with urllib.request.urlopen(data_req, timeout=15) as response:
|
|
89
|
+
raw_json = response.read().decode('utf-8')
|
|
90
|
+
|
|
91
|
+
os.makedirs("temp", exist_ok=True)
|
|
92
|
+
with open("temp/pairzon_generic_raw.json", "w", encoding="utf-8") as f:
|
|
93
|
+
f.write(raw_json)
|
|
94
|
+
|
|
95
|
+
except urllib.error.HTTPError as http_err:
|
|
96
|
+
error_body = http_err.read().decode('utf-8', errors='ignore')[:500]
|
|
97
|
+
logger.error(f"Pairzon backend connection returned error code ({http_err.code}). Payload: {error_body}")
|
|
98
|
+
raise ValueError(f"שגיאת תקשורת מול שרת קבלות פיירזון: {http_err.code}")
|
|
99
|
+
except Exception as e:
|
|
100
|
+
logger.error(f"Critical execution error resolving Pairzon network document: {e}")
|
|
101
|
+
raise e
|
|
102
|
+
else:
|
|
103
|
+
raw_json = source_data
|
|
104
|
+
|
|
105
|
+
# 3. Dynamic Mapping & Extraction Grid
|
|
106
|
+
try:
|
|
107
|
+
if not raw_json.strip():
|
|
108
|
+
raise ValueError("The resolved data stream payload came back empty.")
|
|
109
|
+
|
|
110
|
+
payload = json.loads(raw_json)
|
|
111
|
+
if isinstance(payload, dict) and "data" in payload:
|
|
112
|
+
payload = payload["data"]
|
|
113
|
+
|
|
114
|
+
# Dynamic Brand Identification
|
|
115
|
+
store_info = payload.get("store", {}) or {}
|
|
116
|
+
biz_info = store_info.get("business", {}) or {}
|
|
117
|
+
|
|
118
|
+
# Extract brand name dynamically from business node or fallback to domain signatures
|
|
119
|
+
dynamic_store_name = biz_info.get("name", store_info.get("name", "רשת קמעונאות")).strip()
|
|
120
|
+
self.store_name = dynamic_store_name
|
|
121
|
+
logger.info(f"Dynamic branding identity successfully verified as: '{self.store_name}'")
|
|
122
|
+
|
|
123
|
+
# Standardize date segments
|
|
124
|
+
created_date = payload.get("createdDate", "2026-01-01T00:00:00")
|
|
125
|
+
date_part, time_part = created_date.split("T") if "T" in created_date else (created_date, "00:00:00")
|
|
126
|
+
if "-" in date_part:
|
|
127
|
+
parts = date_part.split("-")
|
|
128
|
+
formatted_date = f"{parts[2]}/{parts[1]}/{parts[0]}"
|
|
129
|
+
else:
|
|
130
|
+
formatted_date = date_part
|
|
131
|
+
|
|
132
|
+
unified_receipt: Dict[str, Any] = {
|
|
133
|
+
"store_name": self.store_name,
|
|
134
|
+
"company_legal_id": str(biz_info.get("companyLeagalId", payload.get("businessID", "513461053"))),
|
|
135
|
+
"branch_name": store_info.get("name", "סניף כללי").strip(),
|
|
136
|
+
"store_address": store_info.get("address", "").strip() or biz_info.get("address", "").strip(),
|
|
137
|
+
"store_phone": store_info.get("phone", "").strip() or biz_info.get("phone", "").strip(),
|
|
138
|
+
"customer_name": payload.get("cashierName", "").strip() or None,
|
|
139
|
+
"date": formatted_date,
|
|
140
|
+
"time": time_part[:8],
|
|
141
|
+
"receipt_id": str(payload.get("transactionID", payload.get("id", ""))),
|
|
142
|
+
"total_paid": float(payload.get("total", 0.0)),
|
|
143
|
+
"vat_rate": float(payload.get("Vat", 17.0)),
|
|
144
|
+
"total_vat_paid": float(payload.get("totalVat", 0.0)),
|
|
145
|
+
"payment_method": payload.get("payments", [{}])[0].get("name", "").strip() or "אשראי",
|
|
146
|
+
"items": []
|
|
147
|
+
}
|
|
148
|
+
|
|
149
|
+
for item in payload.get("items", []):
|
|
150
|
+
weight = item.get("weight")
|
|
151
|
+
quantity = float(weight / 1000.0) if weight else float(item.get("quantity", 1.0))
|
|
152
|
+
unit_price = float(item.get("price", 0.0))
|
|
153
|
+
final_price = float(item.get("total", quantity * unit_price))
|
|
154
|
+
expected_total = round(quantity * unit_price, 2)
|
|
155
|
+
|
|
156
|
+
deal_description = ""
|
|
157
|
+
add_info = item.get("additionalInfo", [])
|
|
158
|
+
if isinstance(add_info, list) and len(add_info) > 0:
|
|
159
|
+
deal_description = str(add_info[0].get("key", "")).strip()
|
|
160
|
+
|
|
161
|
+
has_deal = False
|
|
162
|
+
discount_amount = 0.0
|
|
163
|
+
if deal_description or final_price < expected_total:
|
|
164
|
+
has_deal = True
|
|
165
|
+
discount_amount = max(0.0, round(expected_total - final_price, 2))
|
|
166
|
+
if not deal_description:
|
|
167
|
+
deal_description = "מבצע רשת"
|
|
168
|
+
|
|
169
|
+
unified_receipt["items"].append({
|
|
170
|
+
"description": item.get("name", "פריט").strip(),
|
|
171
|
+
"barcode": str(item.get("code")) if item.get("code") else None,
|
|
172
|
+
"is_by_weight": True if weight else False,
|
|
173
|
+
"quantity_or_weight": quantity,
|
|
174
|
+
"unit_price": unit_price,
|
|
175
|
+
"original_total_price": expected_total,
|
|
176
|
+
"is_part_of_deal": has_deal,
|
|
177
|
+
"deal_text": deal_description or None,
|
|
178
|
+
"discount_amount": discount_amount,
|
|
179
|
+
"final_price": final_price,
|
|
180
|
+
"category_path": item.get("category", ["כללי"])
|
|
181
|
+
})
|
|
182
|
+
|
|
183
|
+
return unified_receipt
|
|
184
|
+
except Exception as ex:
|
|
185
|
+
logger.error(f"Failed parsing inner metrics via Pairzon schema definition matrix: {ex}")
|
|
186
|
+
raise ex
|
|
@@ -0,0 +1,208 @@
|
|
|
1
|
+
import json
|
|
2
|
+
import logging
|
|
3
|
+
import os
|
|
4
|
+
import re
|
|
5
|
+
import urllib.request
|
|
6
|
+
import urllib.error
|
|
7
|
+
from typing import Dict, Any
|
|
8
|
+
from bs4 import BeautifulSoup
|
|
9
|
+
from .base_parser import BaseReceiptParser, NuxtDataHydrator
|
|
10
|
+
|
|
11
|
+
logging.basicConfig(level=logging.INFO)
|
|
12
|
+
logger = logging.getLogger("RamiLevyParser")
|
|
13
|
+
|
|
14
|
+
class RamiLevyParser(BaseReceiptParser):
|
|
15
|
+
def __init__(self) -> None:
|
|
16
|
+
super().__init__(store_name="Rami Levy")
|
|
17
|
+
|
|
18
|
+
def parse(self, source_data: str) -> Dict[str, Any]:
|
|
19
|
+
html_content = ""
|
|
20
|
+
|
|
21
|
+
if source_data.startswith("http://") or source_data.startswith("https://"):
|
|
22
|
+
try:
|
|
23
|
+
# Clean and identify short links vs direct data resources
|
|
24
|
+
target_url = source_data.strip()
|
|
25
|
+
logger.info(f"Downloading live Rami Levy content context: {target_url}")
|
|
26
|
+
|
|
27
|
+
# Emulate a complete browser identity to step through security filters
|
|
28
|
+
headers = {
|
|
29
|
+
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
|
|
30
|
+
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
|
|
31
|
+
'Accept-Language': 'he-IL,he;q=0.9,en-US;q=0.8,en;q=0.7',
|
|
32
|
+
'Cache-Control': 'no-cache',
|
|
33
|
+
'Pragma': 'no-cache'
|
|
34
|
+
}
|
|
35
|
+
|
|
36
|
+
req = urllib.request.Request(target_url, headers=headers)
|
|
37
|
+
with urllib.request.urlopen(req, timeout=15) as response:
|
|
38
|
+
html_content = response.read().decode('utf-8')
|
|
39
|
+
|
|
40
|
+
# Save data backup trace locally for evaluation
|
|
41
|
+
os.makedirs("temp", exist_ok=True)
|
|
42
|
+
with open("temp/rami_levy_raw_page.html", "w", encoding="utf-8") as f:
|
|
43
|
+
f.write(html_content)
|
|
44
|
+
|
|
45
|
+
except urllib.error.HTTPError as http_err:
|
|
46
|
+
logger.error(f"Rami Levy gateway connection dropped (HTTP {http_err.code})")
|
|
47
|
+
raise ValueError(f"נכשל חיבור לשרת רמי לוי. קוד שגיאה: {http_err.code}")
|
|
48
|
+
except Exception as e:
|
|
49
|
+
logger.error(f"Failed downloading remote Rami Levy page context: {e}")
|
|
50
|
+
raise e
|
|
51
|
+
else:
|
|
52
|
+
html_content = source_data
|
|
53
|
+
|
|
54
|
+
try:
|
|
55
|
+
raw_json_text = ""
|
|
56
|
+
|
|
57
|
+
# 1. Standard HTML Extraction Flow
|
|
58
|
+
if "<script" in html_content or "<body" in html_content:
|
|
59
|
+
soup = BeautifulSoup(html_content, "html.parser")
|
|
60
|
+
nuxt_script = soup.find("script", id="__NUXT_DATA__")
|
|
61
|
+
if not nuxt_script:
|
|
62
|
+
nuxt_script = soup.find("script", string=re.compile(r'__NUXT_DATA__'))
|
|
63
|
+
|
|
64
|
+
if nuxt_script:
|
|
65
|
+
raw_json_text = nuxt_script.string if nuxt_script.string else nuxt_script.text
|
|
66
|
+
|
|
67
|
+
# 2. API Fallback Flow (If Nuxt text blocks are missing or encoded)
|
|
68
|
+
if not raw_json_text.strip():
|
|
69
|
+
logger.info("Script extraction returned empty text. Attempting API transformation route...")
|
|
70
|
+
# Extract the token directly out of the URL path (e.g. /0fFlup4Bp5Iw_ikNn9ZU)
|
|
71
|
+
url_path = urlparse(source_data).path if source_data.startswith("http") else ""
|
|
72
|
+
token_match = re.search(r'/([^/]+)$', url_path)
|
|
73
|
+
|
|
74
|
+
if token_match:
|
|
75
|
+
token = token_match.group(1)
|
|
76
|
+
# Query Rami Levy's microservice data endpoint directly
|
|
77
|
+
api_fallback_url = f"https://api-digi.rami-levy.co.il/api/v1/receipts/{token}"
|
|
78
|
+
logger.info(f"Querying production backup data endpoint: {api_fallback_url}")
|
|
79
|
+
|
|
80
|
+
api_headers = {
|
|
81
|
+
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
|
|
82
|
+
'Accept': 'application/json'
|
|
83
|
+
}
|
|
84
|
+
fallback_req = urllib.request.Request(api_fallback_url, headers=api_headers)
|
|
85
|
+
with urllib.request.urlopen(fallback_req, timeout=15) as fallback_res:
|
|
86
|
+
raw_json_text = fallback_res.read().decode('utf-8')
|
|
87
|
+
else:
|
|
88
|
+
raise ValueError("Could not extract document tracking tokens from provided link layout.")
|
|
89
|
+
|
|
90
|
+
if not raw_json_text.strip():
|
|
91
|
+
raise ValueError("Rami Levy parsing sequence generated an empty payload text string.")
|
|
92
|
+
|
|
93
|
+
raw_payload = json.loads(raw_json_text.strip())
|
|
94
|
+
|
|
95
|
+
# Backup uncompressed JSON objects locally
|
|
96
|
+
with open("temp/rami_levy_raw.json", "w", encoding="utf-8") as f:
|
|
97
|
+
json.dump(raw_payload, f, ensure_ascii=False, indent=2)
|
|
98
|
+
|
|
99
|
+
# Initialize hydration pipeline
|
|
100
|
+
# If the payload is from the backup direct API, it will already be unflattened.
|
|
101
|
+
# If it's standard Nuxt transport text, use index 5 to expand the root data trees.
|
|
102
|
+
if isinstance(raw_payload, list):
|
|
103
|
+
hydrator = NuxtDataHydrator(raw_payload)
|
|
104
|
+
# Search across primary structural indices for receipt context arrays
|
|
105
|
+
receipt_core = None
|
|
106
|
+
for base_index in (5, 4, 3, 2, 1):
|
|
107
|
+
try:
|
|
108
|
+
node = hydrator.hydrate_node(base_index)
|
|
109
|
+
if isinstance(node, dict) and ("items" in node or "branch" in node):
|
|
110
|
+
receipt_core = node
|
|
111
|
+
break
|
|
112
|
+
elif isinstance(node, dict) and "data" in node:
|
|
113
|
+
inner_data = node["data"]
|
|
114
|
+
if isinstance(inner_data, dict):
|
|
115
|
+
# Loop keys to check for hidden instances
|
|
116
|
+
for k, v in inner_data.items():
|
|
117
|
+
if isinstance(v, dict) and "items" in v:
|
|
118
|
+
receipt_core = v
|
|
119
|
+
break
|
|
120
|
+
except Exception:
|
|
121
|
+
continue
|
|
122
|
+
if not receipt_core:
|
|
123
|
+
raise ValueError("Failed to locate receipt parameters inside the Nuxt transport grid.")
|
|
124
|
+
else:
|
|
125
|
+
# Payload is already standard dictionary data from API backup stream
|
|
126
|
+
receipt_core = raw_payload.get("data", raw_payload)
|
|
127
|
+
|
|
128
|
+
branch_info = receipt_core.get("branch", {}) or {}
|
|
129
|
+
company_info = receipt_core.get("company", {}) or {}
|
|
130
|
+
payment_core = receipt_core.get("payments", {}) or {}
|
|
131
|
+
|
|
132
|
+
methods_list = payment_core.get("methods", [])
|
|
133
|
+
primary_method = methods_list[0] if isinstance(methods_list, list) and len(methods_list) > 0 else {}
|
|
134
|
+
|
|
135
|
+
created_at = receipt_core.get("created_at", "2026-01-01T00:00:00.000Z")
|
|
136
|
+
date_part, time_part = created_at.split("T") if "T" in created_at else (created_at, "00:00:00")
|
|
137
|
+
if "-" in date_part:
|
|
138
|
+
parts = date_part.split("-")
|
|
139
|
+
formatted_date = f"{parts[2]}/{parts[1]}/{parts[0]}"
|
|
140
|
+
else:
|
|
141
|
+
formatted_date = date_part
|
|
142
|
+
|
|
143
|
+
unified_receipt: Dict[str, Any] = {
|
|
144
|
+
"store_name": self.store_name,
|
|
145
|
+
"company_legal_id": str(company_info.get("tax_id", receipt_core.get("business_id", "513770669"))),
|
|
146
|
+
"branch_name": str(branch_info.get("name", "רמי לוי סניף")).strip(),
|
|
147
|
+
"store_address": str(company_info.get("address", "")).strip(),
|
|
148
|
+
"store_phone": str(company_info.get("customer_service", {}).get("branch_phone", "")).strip(),
|
|
149
|
+
"customer_name": str(receipt_core.get("customer", {}).get("name", "")).strip() or None,
|
|
150
|
+
"date": formatted_date,
|
|
151
|
+
"time": time_part[:8],
|
|
152
|
+
"receipt_id": str(receipt_core.get("transaction_id", receipt_core.get("id", ""))),
|
|
153
|
+
"total_paid": float(payment_core.get("total", receipt_core.get("total", 0.0))),
|
|
154
|
+
"vat_rate": float(receipt_core.get("vat_rate", 18.0)),
|
|
155
|
+
"total_vat_paid": float(payment_core.get("total_vat", 0.0)),
|
|
156
|
+
"payment_method": str(primary_method.get("name", "אשראי")).strip(),
|
|
157
|
+
"items": []
|
|
158
|
+
}
|
|
159
|
+
|
|
160
|
+
for item in receipt_core.get("items", []):
|
|
161
|
+
if not isinstance(item, dict):
|
|
162
|
+
continue
|
|
163
|
+
|
|
164
|
+
weight_val = item.get("weight")
|
|
165
|
+
quantity = float(weight_val / 1000.0) if weight_val else float(item.get("quantity", 1.0))
|
|
166
|
+
unit_price = float(item.get("price", 0.0))
|
|
167
|
+
|
|
168
|
+
expected_total = round(quantity * unit_price, 2)
|
|
169
|
+
|
|
170
|
+
discount_amount = 0.0
|
|
171
|
+
deal_description = ""
|
|
172
|
+
add_info = item.get("additional_info", [])
|
|
173
|
+
if isinstance(add_info, list):
|
|
174
|
+
for info_node in add_info:
|
|
175
|
+
if isinstance(info_node, dict) and "value" in info_node:
|
|
176
|
+
val_str = str(info_node.get("value", "0"))
|
|
177
|
+
if "-" in val_str:
|
|
178
|
+
try:
|
|
179
|
+
discount_amount = abs(float(val_str))
|
|
180
|
+
deal_description = str(info_node.get("key", "")).strip()
|
|
181
|
+
except ValueError:
|
|
182
|
+
pass
|
|
183
|
+
|
|
184
|
+
final_price = round(expected_total - discount_amount, 2)
|
|
185
|
+
# Verify cross-referenced pricing boundaries
|
|
186
|
+
if final_price <= 0 and item.get("total"):
|
|
187
|
+
final_price = float(item.get("total"))
|
|
188
|
+
|
|
189
|
+
has_deal = True if (discount_amount > 0 or deal_description) else False
|
|
190
|
+
|
|
191
|
+
unified_receipt["items"].append({
|
|
192
|
+
"description": str(item.get("name", "פריט")).strip(),
|
|
193
|
+
"barcode": str(item.get("code")) if item.get("code") else None,
|
|
194
|
+
"is_by_weight": True if weight_val else False,
|
|
195
|
+
"quantity_or_weight": quantity,
|
|
196
|
+
"unit_price": unit_price,
|
|
197
|
+
"original_total_price": expected_total,
|
|
198
|
+
"is_part_of_deal": has_deal,
|
|
199
|
+
"deal_text": deal_description or None,
|
|
200
|
+
"discount_amount": discount_amount,
|
|
201
|
+
"final_price": final_price,
|
|
202
|
+
"category_path": ["סופרמרקט"]
|
|
203
|
+
})
|
|
204
|
+
|
|
205
|
+
return unified_receipt
|
|
206
|
+
except Exception as ex:
|
|
207
|
+
logger.error(f"Error expanding serialized Rami Levy tables: {ex}")
|
|
208
|
+
raise ex
|
|
@@ -0,0 +1,180 @@
|
|
|
1
|
+
import json
|
|
2
|
+
import logging
|
|
3
|
+
import os
|
|
4
|
+
import urllib.request
|
|
5
|
+
import urllib.error
|
|
6
|
+
from urllib.parse import urlparse, parse_qs, urljoin
|
|
7
|
+
from typing import Dict, Any
|
|
8
|
+
from .base_parser import BaseReceiptParser
|
|
9
|
+
|
|
10
|
+
logging.basicConfig(level=logging.INFO)
|
|
11
|
+
logger = logging.getLogger("WeezmoParser")
|
|
12
|
+
|
|
13
|
+
class WeezmoParser(BaseReceiptParser):
|
|
14
|
+
def __init__(self) -> None:
|
|
15
|
+
super().__init__(store_name="Weezmo Provider")
|
|
16
|
+
|
|
17
|
+
def parse(self, source_data: str) -> Dict[str, Any]:
|
|
18
|
+
raw_json: str = ""
|
|
19
|
+
|
|
20
|
+
if source_data.startswith("http://") or source_data.startswith("https://"):
|
|
21
|
+
try:
|
|
22
|
+
base_url = source_data.strip()
|
|
23
|
+
|
|
24
|
+
# Modern browser header signature matrix to prevent WAF socket hanging
|
|
25
|
+
headers = {
|
|
26
|
+
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
|
|
27
|
+
'Accept': 'application/json, text/plain, */*',
|
|
28
|
+
'Accept-Language': 'he-IL,he;q=0.9,en-US;q=0.8,en;q=0.7',
|
|
29
|
+
'Connection': 'close' # Explicitly close the connection to prevent hanging sockets
|
|
30
|
+
}
|
|
31
|
+
|
|
32
|
+
parsed_url = urlparse(base_url)
|
|
33
|
+
queries = parse_qs(parsed_url.query)
|
|
34
|
+
|
|
35
|
+
receipt_token = queries.get("q", [None])[0]
|
|
36
|
+
|
|
37
|
+
# If it's a short link (wee.ai), catch the 302 redirect location header
|
|
38
|
+
if not receipt_token and "wee.ai" in base_url:
|
|
39
|
+
logger.info("Intercepting wee.ai short-link redirection matrix...")
|
|
40
|
+
|
|
41
|
+
class InterceptRedirectHandler(urllib.request.HTTPRedirectHandler):
|
|
42
|
+
def http_error_302(self, req, fp, code, msg, headers):
|
|
43
|
+
return fp
|
|
44
|
+
|
|
45
|
+
# Build opener with low fallback timeouts
|
|
46
|
+
opener = urllib.request.build_opener(InterceptRedirectHandler())
|
|
47
|
+
req = urllib.request.Request(base_url, headers=headers)
|
|
48
|
+
|
|
49
|
+
with opener.open(req, timeout=8) as response:
|
|
50
|
+
redirect_location = response.headers.get('Location', '')
|
|
51
|
+
if redirect_location:
|
|
52
|
+
# Safely handle both absolute and relative paths (/cms.html?q=...)
|
|
53
|
+
if not redirect_location.startswith('http'):
|
|
54
|
+
redirect_location = urljoin(base_url, redirect_location)
|
|
55
|
+
|
|
56
|
+
logger.info(f"Redirection resolved to target landing: {redirect_location}")
|
|
57
|
+
parsed_redirect = urlparse(redirect_location)
|
|
58
|
+
redirect_queries = parse_qs(parsed_redirect.query)
|
|
59
|
+
receipt_token = redirect_queries.get("q", [None])[0]
|
|
60
|
+
|
|
61
|
+
if not receipt_token:
|
|
62
|
+
raise ValueError("Could not extract a valid query token 'q' from the Weezmo link framework.")
|
|
63
|
+
|
|
64
|
+
api_url = f"https://receipts.weezmo.com/api/receipts/{receipt_token}"
|
|
65
|
+
logger.info(f"Targeting Weezmo data provider gateway: {api_url}")
|
|
66
|
+
|
|
67
|
+
# Download the target stream with an explicit socket timeout trigger (10 seconds max)
|
|
68
|
+
api_req = urllib.request.Request(api_url, headers=headers)
|
|
69
|
+
with urllib.request.urlopen(api_req, timeout=10) as response:
|
|
70
|
+
raw_json = response.read().decode('utf-8')
|
|
71
|
+
|
|
72
|
+
os.makedirs("temp", exist_ok=True)
|
|
73
|
+
with open("temp/weezmo_generic_raw.json", "w", encoding="utf-8") as f:
|
|
74
|
+
f.write(raw_json)
|
|
75
|
+
|
|
76
|
+
except urllib.error.HTTPError as http_err:
|
|
77
|
+
logger.error(f"Weezmo engine connection exception: {http_err.code}")
|
|
78
|
+
raise ValueError(f"שגיאת תקשורת מול שרת וויזמו: קוד {http_err.code}")
|
|
79
|
+
except urllib.error.URLError as url_err:
|
|
80
|
+
logger.error(f"Weezmo network timeout or unresolved destination: {url_err.reason}")
|
|
81
|
+
raise ValueError("חיבור הרשת לשרת וויזמו נותק או הגיע למגבלת זמן (Timeout).")
|
|
82
|
+
except Exception as e:
|
|
83
|
+
logger.error(f"Critical execution error resolving Weezmo document: {e}")
|
|
84
|
+
raise e
|
|
85
|
+
else:
|
|
86
|
+
raw_json = source_data
|
|
87
|
+
|
|
88
|
+
try:
|
|
89
|
+
if not raw_json.strip():
|
|
90
|
+
raise ValueError("Payload data stream returned empty string bounds.")
|
|
91
|
+
|
|
92
|
+
payload_data = json.loads(raw_json)
|
|
93
|
+
if isinstance(payload_data, list):
|
|
94
|
+
if len(payload_data) == 0:
|
|
95
|
+
raise ValueError("Weezmo data matrix array is empty.")
|
|
96
|
+
payload = payload_data[0]
|
|
97
|
+
else:
|
|
98
|
+
payload = payload_data
|
|
99
|
+
|
|
100
|
+
branch_info = payload.get("tBranch", {}) or {}
|
|
101
|
+
business_info = payload.get("tBusiness", {}) or {}
|
|
102
|
+
|
|
103
|
+
dynamic_store_name = business_info.get("businessName", branch_info.get("branchName", "רשת קמעונאות")).strip()
|
|
104
|
+
self.store_name = dynamic_store_name
|
|
105
|
+
logger.info(f"Dynamic branding identity successfully verified as: '{self.store_name}'")
|
|
106
|
+
|
|
107
|
+
created_date = payload.get("createdDate", "2026-01-01T00:00:00Z")
|
|
108
|
+
date_part, time_part = created_date.split("T") if "T" in created_date else (created_date, "00:00:00")
|
|
109
|
+
if "-" in date_part:
|
|
110
|
+
parts = date_part.split("-")
|
|
111
|
+
formatted_date = f"{parts[2]}/{parts[1]}/{parts[0]}"
|
|
112
|
+
else:
|
|
113
|
+
formatted_date = date_part
|
|
114
|
+
|
|
115
|
+
payments_list = payload.get("payments", [])
|
|
116
|
+
primary_payment = payments_list[0] if isinstance(payments_list, list) and len(payments_list) > 0 else {}
|
|
117
|
+
|
|
118
|
+
unified_receipt: Dict[str, Any] = {
|
|
119
|
+
"store_name": self.store_name,
|
|
120
|
+
"company_legal_id": str(branch_info.get("vatNumber", payload.get("businessID", "515136893"))),
|
|
121
|
+
"branch_name": str(branch_info.get("branchName", "סניף כללי")).strip(),
|
|
122
|
+
"store_address": str(branch_info.get("branchAddress", "")).strip(),
|
|
123
|
+
"store_phone": str(business_info.get("phone", "")).strip() or str(branch_info.get("branchPhone", "")).strip(),
|
|
124
|
+
"customer_name": str(payload.get("loyalName", "")).strip() or None,
|
|
125
|
+
"date": formatted_date,
|
|
126
|
+
"time": time_part[:8],
|
|
127
|
+
"receipt_id": str(payload.get("transactionNumber", payload.get("id", ""))),
|
|
128
|
+
"total_paid": float(payload.get("total", 0.0)),
|
|
129
|
+
"vat_rate": float(payload.get("vat", 17.0)),
|
|
130
|
+
"total_vat_paid": float(payload.get("vatTotal", 0.0)),
|
|
131
|
+
"payment_method": str(primary_payment.get("name", "אשראי")).strip(),
|
|
132
|
+
"items": []
|
|
133
|
+
}
|
|
134
|
+
|
|
135
|
+
for item in payload.get("items", []):
|
|
136
|
+
if not isinstance(item, dict):
|
|
137
|
+
continue
|
|
138
|
+
|
|
139
|
+
quantity = float(item.get("quantity", 1.0))
|
|
140
|
+
unit_price = float(item.get("price", 0.0))
|
|
141
|
+
final_price = float(item.get("total", quantity * unit_price))
|
|
142
|
+
expected_total = round(quantity * unit_price, 2)
|
|
143
|
+
|
|
144
|
+
discount_amount = 0.0
|
|
145
|
+
deal_description = ""
|
|
146
|
+
additional_data = item.get("additionalData", [])
|
|
147
|
+
|
|
148
|
+
if isinstance(additional_data, list):
|
|
149
|
+
for data_node in additional_data:
|
|
150
|
+
if isinstance(data_node, dict) and "value" in data_node:
|
|
151
|
+
val_str = str(data_node.get("value", ""))
|
|
152
|
+
if "-" in val_str:
|
|
153
|
+
try:
|
|
154
|
+
discount_amount = abs(float(val_str))
|
|
155
|
+
deal_description = str(data_node.get("key", "")).strip()
|
|
156
|
+
except ValueError:
|
|
157
|
+
pass
|
|
158
|
+
|
|
159
|
+
has_deal = True if (discount_amount > 0 or deal_description) else False
|
|
160
|
+
is_weight = not quantity.is_integer()
|
|
161
|
+
|
|
162
|
+
unified_receipt["items"].append({
|
|
163
|
+
"description": str(item.get("name", "פריט")).strip(),
|
|
164
|
+
"barcode": str(item.get("itemCode")) if item.get("itemCode") else None,
|
|
165
|
+
"is_by_weight": is_weight,
|
|
166
|
+
"quantity_or_weight": quantity,
|
|
167
|
+
"unit_price": unit_price,
|
|
168
|
+
"original_total_price": expected_total,
|
|
169
|
+
"is_part_of_deal": has_deal,
|
|
170
|
+
"deal_text": deal_description or None,
|
|
171
|
+
"discount_amount": discount_amount,
|
|
172
|
+
"final_price": final_price,
|
|
173
|
+
"category_path": ["ביגוד ואופנה"] if self.store_name == "H&O" else ["סופרמרקט"]
|
|
174
|
+
})
|
|
175
|
+
|
|
176
|
+
return unified_receipt
|
|
177
|
+
|
|
178
|
+
except Exception as ex:
|
|
179
|
+
logger.error(f"Error mapping Weezmo dynamic parameters: {ex}")
|
|
180
|
+
raise ex
|
|
@@ -0,0 +1,152 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: israeli-invoice-parser
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: A unified parsing library for Israeli digital grocery and retail receipts
|
|
5
|
+
Author-email: Yohay Cohen <yohaybn@gmail.com>
|
|
6
|
+
Project-URL: Homepage, https://github.com/yohaybn/israeli-invoice-parser-lib
|
|
7
|
+
Project-URL: Bug Tracker, https://github.com/yohaybn/israeli-invoice-parser-lib/issues
|
|
8
|
+
Classifier: Programming Language :: Python :: 3
|
|
9
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
10
|
+
Classifier: Operating System :: OS Independent
|
|
11
|
+
Requires-Python: >=3.8
|
|
12
|
+
Description-Content-Type: text/markdown
|
|
13
|
+
License-File: LICENSE
|
|
14
|
+
Requires-Dist: beautifulsoup4>=4.12.0
|
|
15
|
+
Dynamic: license-file
|
|
16
|
+
|
|
17
|
+
# israel-invoice-parser
|
|
18
|
+
|
|
19
|
+
A unified, highly accurate Python parsing library for Israeli digital receipts, grocery bills, and commercial retail invoices. This library standardizes fragmented vendor payloads (including direct APIs, raw HTML, and complex Nuxt transport matrices) into a single, clean, structured Python dictionary.
|
|
20
|
+
|
|
21
|
+
## Supported Retailers and Providers
|
|
22
|
+
|
|
23
|
+
The library supports major Israeli storefronts directly or via central receipt infrastructure aggregators:
|
|
24
|
+
|
|
25
|
+
* **Rami Levy (רמי לוי)** — Native support for standard digital bills and microservice data streams.
|
|
26
|
+
* **Weezmo / Wee.ai Infrastructure** — Multi-brand validation supporting grocery gateways like **Yohananof (יוחננוף)**, high-street retail setups, and fashion entities (**TopTen**, **Tamnun**).
|
|
27
|
+
* **Pairzon Engine (פיירזון)** — Dynamic resolution for short-link token routers and partner store layouts (e.g., **Osher Ad (אושר עד)**, **Max Stock (מקס סטוק)**, etc.).
|
|
28
|
+
|
|
29
|
+
> **Current Limitations:** Automated extraction for **Shufersal (שופרסל)** invoices is currently blocked. The endpoint uses robust bot-protection / WAF rules that reject standard programmatic requests. We are actively trying to figure out how to bypass or properly emulate browser signatures to restore this functionality. Contributions or ideas on this technical issue are highly appreciated!
|
|
30
|
+
|
|
31
|
+
---
|
|
32
|
+
|
|
33
|
+
## Installation
|
|
34
|
+
|
|
35
|
+
Install the package via `pip`:
|
|
36
|
+
|
|
37
|
+
```bash
|
|
38
|
+
pip install israel-invoice-parser
|
|
39
|
+
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
---
|
|
43
|
+
|
|
44
|
+
## Quick Start and Usage Examples
|
|
45
|
+
|
|
46
|
+
Every parser inherits from a common interface (`BaseReceiptParser`) and returns a standardized data model, making it simple to process invoices interchangeably.
|
|
47
|
+
|
|
48
|
+
### 1. Parsing a Rami Levy URL
|
|
49
|
+
|
|
50
|
+
```python
|
|
51
|
+
from invoice_parser import RamiLevyParser
|
|
52
|
+
|
|
53
|
+
# Initialize the dedicated parser
|
|
54
|
+
parser = RamiLevyParser()
|
|
55
|
+
|
|
56
|
+
# Pass a live receipt or invoice URL directly
|
|
57
|
+
url = "https://api-digi.rami-levy.co.il/api/v1/receipts/example-token-12345"
|
|
58
|
+
receipt = parser.parse(url)
|
|
59
|
+
|
|
60
|
+
# Access standardized fields uniformly
|
|
61
|
+
print(f"Store: {receipt['store_name']}")
|
|
62
|
+
print(f"Total Paid: ₪{receipt['total_paid']}")
|
|
63
|
+
print(f"Date: {receipt['date']} at {receipt['time']}")
|
|
64
|
+
|
|
65
|
+
for item in receipt['items']:
|
|
66
|
+
print(f" - {item['description']}: ₪{item['final_price']} (Qty: {item['quantity_or_weight']})")
|
|
67
|
+
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
### 2. Parsing a Weezmo / Wee.ai Provider Short-Link (e.g., Yohananof)
|
|
71
|
+
|
|
72
|
+
```python
|
|
73
|
+
from invoice_parser import WeezmoParser
|
|
74
|
+
|
|
75
|
+
parser = WeezmoParser()
|
|
76
|
+
|
|
77
|
+
# Works with central wee.ai tracking tokens or short links
|
|
78
|
+
weezmo_url = "https://wee.ai/r/v123abcd"
|
|
79
|
+
receipt = parser.parse(weezmo_url)
|
|
80
|
+
|
|
81
|
+
# The parser dynamically extracts real corporate metadata to identify the sub-brand
|
|
82
|
+
print(f"Identified Brand: {receipt['store_name']}") # e.g., 'יוחננוף'
|
|
83
|
+
print(f"Legal Business ID: {receipt['company_legal_id']}")
|
|
84
|
+
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
### 3. Standardized Output Format Matrix
|
|
88
|
+
|
|
89
|
+
Regardless of which vendor parser is called, the output dictionary always complies with the following layout structure:
|
|
90
|
+
|
|
91
|
+
```python
|
|
92
|
+
{
|
|
93
|
+
"store_name": "רמי לוי",
|
|
94
|
+
"company_legal_id": "513770669",
|
|
95
|
+
"branch_name": "סניף תל אביב",
|
|
96
|
+
"store_address": "דרך מנחם בגין 123",
|
|
97
|
+
"store_phone": "03-1234567",
|
|
98
|
+
"customer_name": "ישראל ישראלי",
|
|
99
|
+
"date": "23/06/2026",
|
|
100
|
+
"time": "14:30:00",
|
|
101
|
+
"receipt_id": "987654321",
|
|
102
|
+
"total_paid": 245.50,
|
|
103
|
+
"vat_rate": 17.0,
|
|
104
|
+
"total_vat_paid": 35.67,
|
|
105
|
+
"payment_method": "אשראי",
|
|
106
|
+
"items": [
|
|
107
|
+
{
|
|
108
|
+
"description": "חלב תנובה 3%",
|
|
109
|
+
"barcode": "7290000042431",
|
|
110
|
+
"is_by_weight": False,
|
|
111
|
+
"quantity_or_weight": 2.0,
|
|
112
|
+
"unit_price": 6.50,
|
|
113
|
+
"original_total_price": 13.00,
|
|
114
|
+
"is_part_of_deal": True,
|
|
115
|
+
"deal_text": "2 ב-₪11",
|
|
116
|
+
"discount_amount": 2.00,
|
|
117
|
+
"final_price": 11.00,
|
|
118
|
+
"category_path": ["סופרמרקט"]
|
|
119
|
+
}
|
|
120
|
+
]
|
|
121
|
+
}
|
|
122
|
+
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
---
|
|
126
|
+
|
|
127
|
+
## Contributing and Helping Out
|
|
128
|
+
|
|
129
|
+
Parsing real-world digital invoices is a game of cat-and-mouse as retailers update their internal schemas. We need your help to make this library resilient!
|
|
130
|
+
|
|
131
|
+
### Have an Unsupported Receipt / Found a Bug?
|
|
132
|
+
|
|
133
|
+
If you run into an invoice that fails to parse (such as **Shufersal** or a newly formatted receipt Layout):
|
|
134
|
+
|
|
135
|
+
1. Open a New Issue on the [israeli-invoice-parser-lib Bug Tracker](https://github.com/yohaybn/israeli-invoice-parser-lib/issues).
|
|
136
|
+
2. **Crucial:** Provide a **real, live link to the invoice**. Without a working URL, it is impossible to inspect the underlying network payload structure, test backend responses, or map out the necessary payload parameters.
|
|
137
|
+
3. If you have suggestions or workarounds for bypassing Shufersal's anti-bot restrictions, please detail them inside the dedicated discussion issues!
|
|
138
|
+
|
|
139
|
+
### Want to Add a New Parser?
|
|
140
|
+
|
|
141
|
+
We warmly welcome pull requests! To contribute a new parser:
|
|
142
|
+
|
|
143
|
+
1. Subclass `BaseReceiptParser` from `base_parser.py`.
|
|
144
|
+
2. Implement the `.parse(self, source_data: str) -> Dict[str, Any]` method.
|
|
145
|
+
3. Map the data cleanly into our uniform dictionary format.
|
|
146
|
+
4. Submit your PR directly to the [GitHub Repository](https://github.com/yohaybn/israeli-invoice-parser-lib/).
|
|
147
|
+
|
|
148
|
+
---
|
|
149
|
+
|
|
150
|
+
## License
|
|
151
|
+
|
|
152
|
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
|
@@ -0,0 +1,13 @@
|
|
|
1
|
+
LICENSE
|
|
2
|
+
README.md
|
|
3
|
+
pyproject.toml
|
|
4
|
+
src/israeli-invoice-parser/__init__.py
|
|
5
|
+
src/israeli-invoice-parser/base_parser.py
|
|
6
|
+
src/israeli-invoice-parser/pairzon_parser.py
|
|
7
|
+
src/israeli-invoice-parser/rami_levy_parser.py
|
|
8
|
+
src/israeli-invoice-parser/weezmo_parser.py
|
|
9
|
+
src/israeli_invoice_parser.egg-info/PKG-INFO
|
|
10
|
+
src/israeli_invoice_parser.egg-info/SOURCES.txt
|
|
11
|
+
src/israeli_invoice_parser.egg-info/dependency_links.txt
|
|
12
|
+
src/israeli_invoice_parser.egg-info/requires.txt
|
|
13
|
+
src/israeli_invoice_parser.egg-info/top_level.txt
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
beautifulsoup4>=4.12.0
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
israeli-invoice-parser
|