akshu-finagent 0.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- akshu_finagent-0.2.0/PKG-INFO +212 -0
- akshu_finagent-0.2.0/README.md +201 -0
- akshu_finagent-0.2.0/akshu_finagent.egg-info/PKG-INFO +212 -0
- akshu_finagent-0.2.0/akshu_finagent.egg-info/SOURCES.txt +21 -0
- akshu_finagent-0.2.0/akshu_finagent.egg-info/dependency_links.txt +1 -0
- akshu_finagent-0.2.0/akshu_finagent.egg-info/requires.txt +4 -0
- akshu_finagent-0.2.0/akshu_finagent.egg-info/top_level.txt +2 -0
- akshu_finagent-0.2.0/finagent/__init__.py +0 -0
- akshu_finagent-0.2.0/finagent/deriver.py +84 -0
- akshu_finagent-0.2.0/finagent/extractors/__init__.py +0 -0
- akshu_finagent-0.2.0/finagent/extractors/geometric.py +219 -0
- akshu_finagent-0.2.0/finagent/geometry.py +61 -0
- akshu_finagent-0.2.0/finagent/locator.py +176 -0
- akshu_finagent-0.2.0/finagent/normalizer.py +206 -0
- akshu_finagent-0.2.0/finagent/pipeline.py +74 -0
- akshu_finagent-0.2.0/finagent/profiler.py +64 -0
- akshu_finagent-0.2.0/finagent/schema.py +205 -0
- akshu_finagent-0.2.0/finagent/validator.py +195 -0
- akshu_finagent-0.2.0/finagent/writer.py +76 -0
- akshu_finagent-0.2.0/finagent_single.py +1338 -0
- akshu_finagent-0.2.0/pyproject.toml +57 -0
- akshu_finagent-0.2.0/setup.cfg +4 -0
- akshu_finagent-0.2.0/tests/test_core.py +164 -0
|
@@ -0,0 +1,212 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: akshu-finagent
|
|
3
|
+
Version: 0.2.0
|
|
4
|
+
Summary: Financial statement PDF extraction agent
|
|
5
|
+
Requires-Python: >=3.11
|
|
6
|
+
Description-Content-Type: text/markdown
|
|
7
|
+
Requires-Dist: pdfplumber>=0.11
|
|
8
|
+
Requires-Dist: pypdf>=6.0
|
|
9
|
+
Requires-Dist: openpyxl>=3.1
|
|
10
|
+
Requires-Dist: rapidfuzz>=3.14
|
|
11
|
+
|
|
12
|
+
# Financial PDF Extraction Agent (v2)
|
|
13
|
+
|
|
14
|
+
[](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/actions/workflows/ci.yml)
|
|
15
|
+
[](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/releases)
|
|
16
|
+
|
|
17
|
+
Give it any company's annual-report PDF. It finds the financial statements,
|
|
18
|
+
pulls out the key numbers, **proves each one is correct**, and writes a tidy
|
|
19
|
+
Excel where every value carries its own receipt (page number, the exact label
|
|
20
|
+
it came from, and which checks it passed).
|
|
21
|
+
|
|
22
|
+
---
|
|
23
|
+
|
|
24
|
+
## The one idea behind it
|
|
25
|
+
|
|
26
|
+
> **Accuracy comes from verification, not extraction.**
|
|
27
|
+
|
|
28
|
+
Reading numbers off a PDF is easy to get *almost* right and very hard to get
|
|
29
|
+
*exactly* right — tables are borderless, layouts differ per company, a "5" in
|
|
30
|
+
the wrong column ruins everything. So this tool doesn't trust what it reads.
|
|
31
|
+
|
|
32
|
+
Financial statements are **self-verifying**. They obey rules:
|
|
33
|
+
- Assets = Liabilities + Equity
|
|
34
|
+
- Current + Non-current = Total
|
|
35
|
+
- The closing cash in the cash-flow statement equals cash on the balance sheet
|
|
36
|
+
|
|
37
|
+
So every extracted number is checked against these rules. A value that passes a
|
|
38
|
+
rule is **VERIFIED**. A value that breaks one is **FLAGGED** for a human. We
|
|
39
|
+
never hand over a confident-but-wrong number.
|
|
40
|
+
|
|
41
|
+
---
|
|
42
|
+
|
|
43
|
+
## How it works — 7 stages, like an assembly line
|
|
44
|
+
|
|
45
|
+
```
|
|
46
|
+
PDF
|
|
47
|
+
│
|
|
48
|
+
▼
|
|
49
|
+
┌─────────────┐ What kind of PDF is this? Page sizes, orientation,
|
|
50
|
+
│ 1 PROFILE │ and whether each page has readable text.
|
|
51
|
+
└─────────────┘
|
|
52
|
+
│
|
|
53
|
+
▼
|
|
54
|
+
┌─────────────┐ Some reports print two pages side-by-side on one wide
|
|
55
|
+
│ 2 GEOMETRY │ sheet. Split those back into single logical pages.
|
|
56
|
+
└─────────────┘
|
|
57
|
+
│
|
|
58
|
+
▼
|
|
59
|
+
┌─────────────┐ Which pages ARE the Balance Sheet / P&L / Cash Flow?
|
|
60
|
+
│ 3 LOCATE │ (and prefer the consolidated version)
|
|
61
|
+
└─────────────┘
|
|
62
|
+
│
|
|
63
|
+
▼
|
|
64
|
+
┌─────────────┐ Read the chosen pages line by line: "label … numbers".
|
|
65
|
+
│ 4 EXTRACT │ Borderless-table safe — rebuilds rows from coordinates.
|
|
66
|
+
└─────────────┘
|
|
67
|
+
│
|
|
68
|
+
▼
|
|
69
|
+
┌─────────────┐ Clean the numbers and match each label to our standard
|
|
70
|
+
│ 5 NORMALIZE │ vocabulary ("Turnover" and "Net sales" both -> revenue).
|
|
71
|
+
└─────────────┘
|
|
72
|
+
│
|
|
73
|
+
▼
|
|
74
|
+
┌─────────────┐ Prove the numbers using accounting identities.
|
|
75
|
+
│ 6 VALIDATE │ -> VERIFIED / PROBABLE / FLAGGED / MISSING
|
|
76
|
+
└─────────────┘
|
|
77
|
+
│
|
|
78
|
+
▼
|
|
79
|
+
┌─────────────┐ A number the report never printed but the identities
|
|
80
|
+
│ 6b DERIVE │ pin down exactly? Compute it. Mark it DERIVED.
|
|
81
|
+
└─────────────┘
|
|
82
|
+
│
|
|
83
|
+
▼
|
|
84
|
+
┌─────────────┐ Excel: value + status + page + matched label + checks.
|
|
85
|
+
│ 7 WRITE │ Colour-coded so you see trust level at a glance.
|
|
86
|
+
└─────────────┘
|
|
87
|
+
│
|
|
88
|
+
▼
|
|
89
|
+
metrics.xlsx
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
**Status colours in the Excel:** 🟢 VERIFIED · 🟡 PROBABLE · 🔴 FLAGGED ·
|
|
93
|
+
🔵 DERIVED · ⬜ MISSING.
|
|
94
|
+
|
|
95
|
+
---
|
|
96
|
+
|
|
97
|
+
## Run it
|
|
98
|
+
|
|
99
|
+
```powershell
|
|
100
|
+
# one report
|
|
101
|
+
.venv\Scripts\python.exe finagent_single.py test_pdfs\TCS_2024-2025.pdf
|
|
102
|
+
|
|
103
|
+
# scorecard across all 12 test reports
|
|
104
|
+
.venv\Scripts\python.exe benchmark.py
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
Output lands in `output\<Company>_metrics.xlsx`. All commands use the project
|
|
108
|
+
venv (`.venv`). First-time setup: `pip install -r requirements.txt`.
|
|
109
|
+
|
|
110
|
+
---
|
|
111
|
+
|
|
112
|
+
## Install a released build
|
|
113
|
+
|
|
114
|
+
The project is packaged as a wheel and published on a version tag. Grab a
|
|
115
|
+
build from the [Releases page](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/releases)
|
|
116
|
+
and install it:
|
|
117
|
+
|
|
118
|
+
```powershell
|
|
119
|
+
pip install finagent-0.1.0-py3-none-any.whl
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
(TestPyPI publishing via OIDC trusted publishing is being wired up — once live,
|
|
123
|
+
`pip install -i https://test.pypi.org/simple/ finagent` will work directly.)
|
|
124
|
+
|
|
125
|
+
---
|
|
126
|
+
|
|
127
|
+
## CI/CD
|
|
128
|
+
|
|
129
|
+
This repo doubles as a hands-on CI/CD build-out (tracked level-by-level in
|
|
130
|
+
[`WORKLOG.md`](WORKLOG.md)):
|
|
131
|
+
|
|
132
|
+
- **CI** (`.github/workflows/ci.yml`) — every push and PR runs `ruff` + `pytest`
|
|
133
|
+
across Python 3.11 / 3.12 / 3.13 in parallel, with pip caching. A single
|
|
134
|
+
`all-green` check gates merges; `main` is branch-protected, so broken code
|
|
135
|
+
cannot merge.
|
|
136
|
+
- **CD** (`.github/workflows/release.yml`) — pushing a `vX.Y.Z` tag builds the
|
|
137
|
+
wheel + sdist and publishes a GitHub Release with the artifacts attached.
|
|
138
|
+
|
|
139
|
+
Cut a release:
|
|
140
|
+
|
|
141
|
+
```powershell
|
|
142
|
+
git tag v0.1.0
|
|
143
|
+
git push origin v0.1.0
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
---
|
|
147
|
+
|
|
148
|
+
## Two ways to read the code (same logic, your choice)
|
|
149
|
+
|
|
150
|
+
| | What | When to read it |
|
|
151
|
+
|---|---|---|
|
|
152
|
+
| **Single file** | `finagent_single.py` | You want to understand or share the *whole thing* top-to-bottom. The 7 stages appear in pipeline order, each in its own clearly-marked section. |
|
|
153
|
+
| **Package** | `finagent/` (one file per stage) | You're maintaining or extending it. Smaller files, one responsibility each. |
|
|
154
|
+
|
|
155
|
+
Both produce identical results. The package is the source of truth; the single
|
|
156
|
+
file is a flattened, portable build of it.
|
|
157
|
+
|
|
158
|
+
> 📖 **Want the full, plain-language walkthrough of every function and why it
|
|
159
|
+
> works the way it does?** See [`ARCHITECTURE.md`](ARCHITECTURE.md). It's the
|
|
160
|
+
> guide to use when explaining this project to anyone.
|
|
161
|
+
|
|
162
|
+
### Package map
|
|
163
|
+
|
|
164
|
+
| File | Stage | Job |
|
|
165
|
+
|---|---|---|
|
|
166
|
+
| `finagent/profiler.py` | 1 | Per-page: text quality, size, orientation (fast, via pypdf) |
|
|
167
|
+
| `finagent/geometry.py` | 2 | Split two-up A3 sheets into single logical pages |
|
|
168
|
+
| `finagent/locator.py` | 3 | Find + classify statement pages (consolidated BS / PL / CF) |
|
|
169
|
+
| `finagent/extractors/geometric.py` | 4 | Line-based extraction, borderless-table safe (pdfplumber) |
|
|
170
|
+
| `finagent/normalizer.py`| 5 | Parse numbers, strip note columns, fuzzy-match labels to schema |
|
|
171
|
+
| `finagent/validator.py` | 6 | Voting + accounting identities + cross-statement ties |
|
|
172
|
+
| `finagent/deriver.py` | 6b | Fill numbers the report omitted but the identities fix exactly |
|
|
173
|
+
| `finagent/writer.py` | 7 | Excel with value, status, page citation, matched label, checks |
|
|
174
|
+
| `finagent/schema.py` | — | The canonical metric vocabulary + label synonyms |
|
|
175
|
+
| `finagent/pipeline.py` | — | Glue that runs stages 1→7 in order |
|
|
176
|
+
|
|
177
|
+
---
|
|
178
|
+
|
|
179
|
+
## What's in this folder
|
|
180
|
+
|
|
181
|
+
```
|
|
182
|
+
finagent_single.py the whole agent in one file (start here to understand it)
|
|
183
|
+
finagent/ the same logic as a package, one file per stage
|
|
184
|
+
schema is shared ── the vocabulary every stage speaks
|
|
185
|
+
benchmark.py scorecard over every test PDF
|
|
186
|
+
golden_check.py checks extracted values against hand-verified answers
|
|
187
|
+
render_pages.py renders statement pages to PNGs (for eyeballing)
|
|
188
|
+
requirements.txt dependencies: pypdf, pdfplumber, rapidfuzz, openpyxl
|
|
189
|
+
test_pdfs/ 12 deliberately-different real annual reports
|
|
190
|
+
golden/ hand-verified correct answers for grading
|
|
191
|
+
output/ generated Excel files + rendered pages
|
|
192
|
+
scratch/ throwaway debug scripts (safe to ignore)
|
|
193
|
+
graphify-out/ generated knowledge-graph cache (safe to ignore)
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
### The test set (chosen to break things on purpose)
|
|
197
|
+
|
|
198
|
+
8+ deliberately-different reports in `test_pdfs/`: Adani (683 pages), Airtel
|
|
199
|
+
(A3 two-up layout), BMW (landscape, in EUR), HDFC (a bank, different schema),
|
|
200
|
+
Newgen (small-cap), Reliance (A3 two-up), TCS (clean baseline), Wilmar
|
|
201
|
+
(foreign report). If it works on all of these, it generalises.
|
|
202
|
+
|
|
203
|
+
---
|
|
204
|
+
|
|
205
|
+
## Roadmap (improve only what the benchmark proves)
|
|
206
|
+
|
|
207
|
+
1. ✅ Walking skeleton: all stages thin but connected
|
|
208
|
+
2. ✅ Geometry fixer: two-up A3 page split (Airtel, Reliance)
|
|
209
|
+
3. ✅ IFRS locator vocabulary ("profit OR loss", comprehensive-income cues)
|
|
210
|
+
4. Second extractor (Docling/TableFormer) + cross-extractor voting
|
|
211
|
+
5. Unit detection (crores/lakhs/millions) + vision-LLM third voter
|
|
212
|
+
6. Bank/NBFC schema variant (HDFC), OCR path for scanned pages
|
|
@@ -0,0 +1,201 @@
|
|
|
1
|
+
# Financial PDF Extraction Agent (v2)
|
|
2
|
+
|
|
3
|
+
[](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/actions/workflows/ci.yml)
|
|
4
|
+
[](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/releases)
|
|
5
|
+
|
|
6
|
+
Give it any company's annual-report PDF. It finds the financial statements,
|
|
7
|
+
pulls out the key numbers, **proves each one is correct**, and writes a tidy
|
|
8
|
+
Excel where every value carries its own receipt (page number, the exact label
|
|
9
|
+
it came from, and which checks it passed).
|
|
10
|
+
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
## The one idea behind it
|
|
14
|
+
|
|
15
|
+
> **Accuracy comes from verification, not extraction.**
|
|
16
|
+
|
|
17
|
+
Reading numbers off a PDF is easy to get *almost* right and very hard to get
|
|
18
|
+
*exactly* right — tables are borderless, layouts differ per company, a "5" in
|
|
19
|
+
the wrong column ruins everything. So this tool doesn't trust what it reads.
|
|
20
|
+
|
|
21
|
+
Financial statements are **self-verifying**. They obey rules:
|
|
22
|
+
- Assets = Liabilities + Equity
|
|
23
|
+
- Current + Non-current = Total
|
|
24
|
+
- The closing cash in the cash-flow statement equals cash on the balance sheet
|
|
25
|
+
|
|
26
|
+
So every extracted number is checked against these rules. A value that passes a
|
|
27
|
+
rule is **VERIFIED**. A value that breaks one is **FLAGGED** for a human. We
|
|
28
|
+
never hand over a confident-but-wrong number.
|
|
29
|
+
|
|
30
|
+
---
|
|
31
|
+
|
|
32
|
+
## How it works — 7 stages, like an assembly line
|
|
33
|
+
|
|
34
|
+
```
|
|
35
|
+
PDF
|
|
36
|
+
│
|
|
37
|
+
▼
|
|
38
|
+
┌─────────────┐ What kind of PDF is this? Page sizes, orientation,
|
|
39
|
+
│ 1 PROFILE │ and whether each page has readable text.
|
|
40
|
+
└─────────────┘
|
|
41
|
+
│
|
|
42
|
+
▼
|
|
43
|
+
┌─────────────┐ Some reports print two pages side-by-side on one wide
|
|
44
|
+
│ 2 GEOMETRY │ sheet. Split those back into single logical pages.
|
|
45
|
+
└─────────────┘
|
|
46
|
+
│
|
|
47
|
+
▼
|
|
48
|
+
┌─────────────┐ Which pages ARE the Balance Sheet / P&L / Cash Flow?
|
|
49
|
+
│ 3 LOCATE │ (and prefer the consolidated version)
|
|
50
|
+
└─────────────┘
|
|
51
|
+
│
|
|
52
|
+
▼
|
|
53
|
+
┌─────────────┐ Read the chosen pages line by line: "label … numbers".
|
|
54
|
+
│ 4 EXTRACT │ Borderless-table safe — rebuilds rows from coordinates.
|
|
55
|
+
└─────────────┘
|
|
56
|
+
│
|
|
57
|
+
▼
|
|
58
|
+
┌─────────────┐ Clean the numbers and match each label to our standard
|
|
59
|
+
│ 5 NORMALIZE │ vocabulary ("Turnover" and "Net sales" both -> revenue).
|
|
60
|
+
└─────────────┘
|
|
61
|
+
│
|
|
62
|
+
▼
|
|
63
|
+
┌─────────────┐ Prove the numbers using accounting identities.
|
|
64
|
+
│ 6 VALIDATE │ -> VERIFIED / PROBABLE / FLAGGED / MISSING
|
|
65
|
+
└─────────────┘
|
|
66
|
+
│
|
|
67
|
+
▼
|
|
68
|
+
┌─────────────┐ A number the report never printed but the identities
|
|
69
|
+
│ 6b DERIVE │ pin down exactly? Compute it. Mark it DERIVED.
|
|
70
|
+
└─────────────┘
|
|
71
|
+
│
|
|
72
|
+
▼
|
|
73
|
+
┌─────────────┐ Excel: value + status + page + matched label + checks.
|
|
74
|
+
│ 7 WRITE │ Colour-coded so you see trust level at a glance.
|
|
75
|
+
└─────────────┘
|
|
76
|
+
│
|
|
77
|
+
▼
|
|
78
|
+
metrics.xlsx
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
**Status colours in the Excel:** 🟢 VERIFIED · 🟡 PROBABLE · 🔴 FLAGGED ·
|
|
82
|
+
🔵 DERIVED · ⬜ MISSING.
|
|
83
|
+
|
|
84
|
+
---
|
|
85
|
+
|
|
86
|
+
## Run it
|
|
87
|
+
|
|
88
|
+
```powershell
|
|
89
|
+
# one report
|
|
90
|
+
.venv\Scripts\python.exe finagent_single.py test_pdfs\TCS_2024-2025.pdf
|
|
91
|
+
|
|
92
|
+
# scorecard across all 12 test reports
|
|
93
|
+
.venv\Scripts\python.exe benchmark.py
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
Output lands in `output\<Company>_metrics.xlsx`. All commands use the project
|
|
97
|
+
venv (`.venv`). First-time setup: `pip install -r requirements.txt`.
|
|
98
|
+
|
|
99
|
+
---
|
|
100
|
+
|
|
101
|
+
## Install a released build
|
|
102
|
+
|
|
103
|
+
The project is packaged as a wheel and published on a version tag. Grab a
|
|
104
|
+
build from the [Releases page](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/releases)
|
|
105
|
+
and install it:
|
|
106
|
+
|
|
107
|
+
```powershell
|
|
108
|
+
pip install finagent-0.1.0-py3-none-any.whl
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
(TestPyPI publishing via OIDC trusted publishing is being wired up — once live,
|
|
112
|
+
`pip install -i https://test.pypi.org/simple/ finagent` will work directly.)
|
|
113
|
+
|
|
114
|
+
---
|
|
115
|
+
|
|
116
|
+
## CI/CD
|
|
117
|
+
|
|
118
|
+
This repo doubles as a hands-on CI/CD build-out (tracked level-by-level in
|
|
119
|
+
[`WORKLOG.md`](WORKLOG.md)):
|
|
120
|
+
|
|
121
|
+
- **CI** (`.github/workflows/ci.yml`) — every push and PR runs `ruff` + `pytest`
|
|
122
|
+
across Python 3.11 / 3.12 / 3.13 in parallel, with pip caching. A single
|
|
123
|
+
`all-green` check gates merges; `main` is branch-protected, so broken code
|
|
124
|
+
cannot merge.
|
|
125
|
+
- **CD** (`.github/workflows/release.yml`) — pushing a `vX.Y.Z` tag builds the
|
|
126
|
+
wheel + sdist and publishes a GitHub Release with the artifacts attached.
|
|
127
|
+
|
|
128
|
+
Cut a release:
|
|
129
|
+
|
|
130
|
+
```powershell
|
|
131
|
+
git tag v0.1.0
|
|
132
|
+
git push origin v0.1.0
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
---
|
|
136
|
+
|
|
137
|
+
## Two ways to read the code (same logic, your choice)
|
|
138
|
+
|
|
139
|
+
| | What | When to read it |
|
|
140
|
+
|---|---|---|
|
|
141
|
+
| **Single file** | `finagent_single.py` | You want to understand or share the *whole thing* top-to-bottom. The 7 stages appear in pipeline order, each in its own clearly-marked section. |
|
|
142
|
+
| **Package** | `finagent/` (one file per stage) | You're maintaining or extending it. Smaller files, one responsibility each. |
|
|
143
|
+
|
|
144
|
+
Both produce identical results. The package is the source of truth; the single
|
|
145
|
+
file is a flattened, portable build of it.
|
|
146
|
+
|
|
147
|
+
> 📖 **Want the full, plain-language walkthrough of every function and why it
|
|
148
|
+
> works the way it does?** See [`ARCHITECTURE.md`](ARCHITECTURE.md). It's the
|
|
149
|
+
> guide to use when explaining this project to anyone.
|
|
150
|
+
|
|
151
|
+
### Package map
|
|
152
|
+
|
|
153
|
+
| File | Stage | Job |
|
|
154
|
+
|---|---|---|
|
|
155
|
+
| `finagent/profiler.py` | 1 | Per-page: text quality, size, orientation (fast, via pypdf) |
|
|
156
|
+
| `finagent/geometry.py` | 2 | Split two-up A3 sheets into single logical pages |
|
|
157
|
+
| `finagent/locator.py` | 3 | Find + classify statement pages (consolidated BS / PL / CF) |
|
|
158
|
+
| `finagent/extractors/geometric.py` | 4 | Line-based extraction, borderless-table safe (pdfplumber) |
|
|
159
|
+
| `finagent/normalizer.py`| 5 | Parse numbers, strip note columns, fuzzy-match labels to schema |
|
|
160
|
+
| `finagent/validator.py` | 6 | Voting + accounting identities + cross-statement ties |
|
|
161
|
+
| `finagent/deriver.py` | 6b | Fill numbers the report omitted but the identities fix exactly |
|
|
162
|
+
| `finagent/writer.py` | 7 | Excel with value, status, page citation, matched label, checks |
|
|
163
|
+
| `finagent/schema.py` | — | The canonical metric vocabulary + label synonyms |
|
|
164
|
+
| `finagent/pipeline.py` | — | Glue that runs stages 1→7 in order |
|
|
165
|
+
|
|
166
|
+
---
|
|
167
|
+
|
|
168
|
+
## What's in this folder
|
|
169
|
+
|
|
170
|
+
```
|
|
171
|
+
finagent_single.py the whole agent in one file (start here to understand it)
|
|
172
|
+
finagent/ the same logic as a package, one file per stage
|
|
173
|
+
schema is shared ── the vocabulary every stage speaks
|
|
174
|
+
benchmark.py scorecard over every test PDF
|
|
175
|
+
golden_check.py checks extracted values against hand-verified answers
|
|
176
|
+
render_pages.py renders statement pages to PNGs (for eyeballing)
|
|
177
|
+
requirements.txt dependencies: pypdf, pdfplumber, rapidfuzz, openpyxl
|
|
178
|
+
test_pdfs/ 12 deliberately-different real annual reports
|
|
179
|
+
golden/ hand-verified correct answers for grading
|
|
180
|
+
output/ generated Excel files + rendered pages
|
|
181
|
+
scratch/ throwaway debug scripts (safe to ignore)
|
|
182
|
+
graphify-out/ generated knowledge-graph cache (safe to ignore)
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
### The test set (chosen to break things on purpose)
|
|
186
|
+
|
|
187
|
+
8+ deliberately-different reports in `test_pdfs/`: Adani (683 pages), Airtel
|
|
188
|
+
(A3 two-up layout), BMW (landscape, in EUR), HDFC (a bank, different schema),
|
|
189
|
+
Newgen (small-cap), Reliance (A3 two-up), TCS (clean baseline), Wilmar
|
|
190
|
+
(foreign report). If it works on all of these, it generalises.
|
|
191
|
+
|
|
192
|
+
---
|
|
193
|
+
|
|
194
|
+
## Roadmap (improve only what the benchmark proves)
|
|
195
|
+
|
|
196
|
+
1. ✅ Walking skeleton: all stages thin but connected
|
|
197
|
+
2. ✅ Geometry fixer: two-up A3 page split (Airtel, Reliance)
|
|
198
|
+
3. ✅ IFRS locator vocabulary ("profit OR loss", comprehensive-income cues)
|
|
199
|
+
4. Second extractor (Docling/TableFormer) + cross-extractor voting
|
|
200
|
+
5. Unit detection (crores/lakhs/millions) + vision-LLM third voter
|
|
201
|
+
6. Bank/NBFC schema variant (HDFC), OCR path for scanned pages
|
|
@@ -0,0 +1,212 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: akshu-finagent
|
|
3
|
+
Version: 0.2.0
|
|
4
|
+
Summary: Financial statement PDF extraction agent
|
|
5
|
+
Requires-Python: >=3.11
|
|
6
|
+
Description-Content-Type: text/markdown
|
|
7
|
+
Requires-Dist: pdfplumber>=0.11
|
|
8
|
+
Requires-Dist: pypdf>=6.0
|
|
9
|
+
Requires-Dist: openpyxl>=3.1
|
|
10
|
+
Requires-Dist: rapidfuzz>=3.14
|
|
11
|
+
|
|
12
|
+
# Financial PDF Extraction Agent (v2)
|
|
13
|
+
|
|
14
|
+
[](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/actions/workflows/ci.yml)
|
|
15
|
+
[](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/releases)
|
|
16
|
+
|
|
17
|
+
Give it any company's annual-report PDF. It finds the financial statements,
|
|
18
|
+
pulls out the key numbers, **proves each one is correct**, and writes a tidy
|
|
19
|
+
Excel where every value carries its own receipt (page number, the exact label
|
|
20
|
+
it came from, and which checks it passed).
|
|
21
|
+
|
|
22
|
+
---
|
|
23
|
+
|
|
24
|
+
## The one idea behind it
|
|
25
|
+
|
|
26
|
+
> **Accuracy comes from verification, not extraction.**
|
|
27
|
+
|
|
28
|
+
Reading numbers off a PDF is easy to get *almost* right and very hard to get
|
|
29
|
+
*exactly* right — tables are borderless, layouts differ per company, a "5" in
|
|
30
|
+
the wrong column ruins everything. So this tool doesn't trust what it reads.
|
|
31
|
+
|
|
32
|
+
Financial statements are **self-verifying**. They obey rules:
|
|
33
|
+
- Assets = Liabilities + Equity
|
|
34
|
+
- Current + Non-current = Total
|
|
35
|
+
- The closing cash in the cash-flow statement equals cash on the balance sheet
|
|
36
|
+
|
|
37
|
+
So every extracted number is checked against these rules. A value that passes a
|
|
38
|
+
rule is **VERIFIED**. A value that breaks one is **FLAGGED** for a human. We
|
|
39
|
+
never hand over a confident-but-wrong number.
|
|
40
|
+
|
|
41
|
+
---
|
|
42
|
+
|
|
43
|
+
## How it works — 7 stages, like an assembly line
|
|
44
|
+
|
|
45
|
+
```
|
|
46
|
+
PDF
|
|
47
|
+
│
|
|
48
|
+
▼
|
|
49
|
+
┌─────────────┐ What kind of PDF is this? Page sizes, orientation,
|
|
50
|
+
│ 1 PROFILE │ and whether each page has readable text.
|
|
51
|
+
└─────────────┘
|
|
52
|
+
│
|
|
53
|
+
▼
|
|
54
|
+
┌─────────────┐ Some reports print two pages side-by-side on one wide
|
|
55
|
+
│ 2 GEOMETRY │ sheet. Split those back into single logical pages.
|
|
56
|
+
└─────────────┘
|
|
57
|
+
│
|
|
58
|
+
▼
|
|
59
|
+
┌─────────────┐ Which pages ARE the Balance Sheet / P&L / Cash Flow?
|
|
60
|
+
│ 3 LOCATE │ (and prefer the consolidated version)
|
|
61
|
+
└─────────────┘
|
|
62
|
+
│
|
|
63
|
+
▼
|
|
64
|
+
┌─────────────┐ Read the chosen pages line by line: "label … numbers".
|
|
65
|
+
│ 4 EXTRACT │ Borderless-table safe — rebuilds rows from coordinates.
|
|
66
|
+
└─────────────┘
|
|
67
|
+
│
|
|
68
|
+
▼
|
|
69
|
+
┌─────────────┐ Clean the numbers and match each label to our standard
|
|
70
|
+
│ 5 NORMALIZE │ vocabulary ("Turnover" and "Net sales" both -> revenue).
|
|
71
|
+
└─────────────┘
|
|
72
|
+
│
|
|
73
|
+
▼
|
|
74
|
+
┌─────────────┐ Prove the numbers using accounting identities.
|
|
75
|
+
│ 6 VALIDATE │ -> VERIFIED / PROBABLE / FLAGGED / MISSING
|
|
76
|
+
└─────────────┘
|
|
77
|
+
│
|
|
78
|
+
▼
|
|
79
|
+
┌─────────────┐ A number the report never printed but the identities
|
|
80
|
+
│ 6b DERIVE │ pin down exactly? Compute it. Mark it DERIVED.
|
|
81
|
+
└─────────────┘
|
|
82
|
+
│
|
|
83
|
+
▼
|
|
84
|
+
┌─────────────┐ Excel: value + status + page + matched label + checks.
|
|
85
|
+
│ 7 WRITE │ Colour-coded so you see trust level at a glance.
|
|
86
|
+
└─────────────┘
|
|
87
|
+
│
|
|
88
|
+
▼
|
|
89
|
+
metrics.xlsx
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
**Status colours in the Excel:** 🟢 VERIFIED · 🟡 PROBABLE · 🔴 FLAGGED ·
|
|
93
|
+
🔵 DERIVED · ⬜ MISSING.
|
|
94
|
+
|
|
95
|
+
---
|
|
96
|
+
|
|
97
|
+
## Run it
|
|
98
|
+
|
|
99
|
+
```powershell
|
|
100
|
+
# one report
|
|
101
|
+
.venv\Scripts\python.exe finagent_single.py test_pdfs\TCS_2024-2025.pdf
|
|
102
|
+
|
|
103
|
+
# scorecard across all 12 test reports
|
|
104
|
+
.venv\Scripts\python.exe benchmark.py
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
Output lands in `output\<Company>_metrics.xlsx`. All commands use the project
|
|
108
|
+
venv (`.venv`). First-time setup: `pip install -r requirements.txt`.
|
|
109
|
+
|
|
110
|
+
---
|
|
111
|
+
|
|
112
|
+
## Install a released build
|
|
113
|
+
|
|
114
|
+
The project is packaged as a wheel and published on a version tag. Grab a
|
|
115
|
+
build from the [Releases page](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/releases)
|
|
116
|
+
and install it:
|
|
117
|
+
|
|
118
|
+
```powershell
|
|
119
|
+
pip install finagent-0.1.0-py3-none-any.whl
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
(TestPyPI publishing via OIDC trusted publishing is being wired up — once live,
|
|
123
|
+
`pip install -i https://test.pypi.org/simple/ finagent` will work directly.)
|
|
124
|
+
|
|
125
|
+
---
|
|
126
|
+
|
|
127
|
+
## CI/CD
|
|
128
|
+
|
|
129
|
+
This repo doubles as a hands-on CI/CD build-out (tracked level-by-level in
|
|
130
|
+
[`WORKLOG.md`](WORKLOG.md)):
|
|
131
|
+
|
|
132
|
+
- **CI** (`.github/workflows/ci.yml`) — every push and PR runs `ruff` + `pytest`
|
|
133
|
+
across Python 3.11 / 3.12 / 3.13 in parallel, with pip caching. A single
|
|
134
|
+
`all-green` check gates merges; `main` is branch-protected, so broken code
|
|
135
|
+
cannot merge.
|
|
136
|
+
- **CD** (`.github/workflows/release.yml`) — pushing a `vX.Y.Z` tag builds the
|
|
137
|
+
wheel + sdist and publishes a GitHub Release with the artifacts attached.
|
|
138
|
+
|
|
139
|
+
Cut a release:
|
|
140
|
+
|
|
141
|
+
```powershell
|
|
142
|
+
git tag v0.1.0
|
|
143
|
+
git push origin v0.1.0
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
---
|
|
147
|
+
|
|
148
|
+
## Two ways to read the code (same logic, your choice)
|
|
149
|
+
|
|
150
|
+
| | What | When to read it |
|
|
151
|
+
|---|---|---|
|
|
152
|
+
| **Single file** | `finagent_single.py` | You want to understand or share the *whole thing* top-to-bottom. The 7 stages appear in pipeline order, each in its own clearly-marked section. |
|
|
153
|
+
| **Package** | `finagent/` (one file per stage) | You're maintaining or extending it. Smaller files, one responsibility each. |
|
|
154
|
+
|
|
155
|
+
Both produce identical results. The package is the source of truth; the single
|
|
156
|
+
file is a flattened, portable build of it.
|
|
157
|
+
|
|
158
|
+
> 📖 **Want the full, plain-language walkthrough of every function and why it
|
|
159
|
+
> works the way it does?** See [`ARCHITECTURE.md`](ARCHITECTURE.md). It's the
|
|
160
|
+
> guide to use when explaining this project to anyone.
|
|
161
|
+
|
|
162
|
+
### Package map
|
|
163
|
+
|
|
164
|
+
| File | Stage | Job |
|
|
165
|
+
|---|---|---|
|
|
166
|
+
| `finagent/profiler.py` | 1 | Per-page: text quality, size, orientation (fast, via pypdf) |
|
|
167
|
+
| `finagent/geometry.py` | 2 | Split two-up A3 sheets into single logical pages |
|
|
168
|
+
| `finagent/locator.py` | 3 | Find + classify statement pages (consolidated BS / PL / CF) |
|
|
169
|
+
| `finagent/extractors/geometric.py` | 4 | Line-based extraction, borderless-table safe (pdfplumber) |
|
|
170
|
+
| `finagent/normalizer.py`| 5 | Parse numbers, strip note columns, fuzzy-match labels to schema |
|
|
171
|
+
| `finagent/validator.py` | 6 | Voting + accounting identities + cross-statement ties |
|
|
172
|
+
| `finagent/deriver.py` | 6b | Fill numbers the report omitted but the identities fix exactly |
|
|
173
|
+
| `finagent/writer.py` | 7 | Excel with value, status, page citation, matched label, checks |
|
|
174
|
+
| `finagent/schema.py` | — | The canonical metric vocabulary + label synonyms |
|
|
175
|
+
| `finagent/pipeline.py` | — | Glue that runs stages 1→7 in order |
|
|
176
|
+
|
|
177
|
+
---
|
|
178
|
+
|
|
179
|
+
## What's in this folder
|
|
180
|
+
|
|
181
|
+
```
|
|
182
|
+
finagent_single.py the whole agent in one file (start here to understand it)
|
|
183
|
+
finagent/ the same logic as a package, one file per stage
|
|
184
|
+
schema is shared ── the vocabulary every stage speaks
|
|
185
|
+
benchmark.py scorecard over every test PDF
|
|
186
|
+
golden_check.py checks extracted values against hand-verified answers
|
|
187
|
+
render_pages.py renders statement pages to PNGs (for eyeballing)
|
|
188
|
+
requirements.txt dependencies: pypdf, pdfplumber, rapidfuzz, openpyxl
|
|
189
|
+
test_pdfs/ 12 deliberately-different real annual reports
|
|
190
|
+
golden/ hand-verified correct answers for grading
|
|
191
|
+
output/ generated Excel files + rendered pages
|
|
192
|
+
scratch/ throwaway debug scripts (safe to ignore)
|
|
193
|
+
graphify-out/ generated knowledge-graph cache (safe to ignore)
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
### The test set (chosen to break things on purpose)
|
|
197
|
+
|
|
198
|
+
8+ deliberately-different reports in `test_pdfs/`: Adani (683 pages), Airtel
|
|
199
|
+
(A3 two-up layout), BMW (landscape, in EUR), HDFC (a bank, different schema),
|
|
200
|
+
Newgen (small-cap), Reliance (A3 two-up), TCS (clean baseline), Wilmar
|
|
201
|
+
(foreign report). If it works on all of these, it generalises.
|
|
202
|
+
|
|
203
|
+
---
|
|
204
|
+
|
|
205
|
+
## Roadmap (improve only what the benchmark proves)
|
|
206
|
+
|
|
207
|
+
1. ✅ Walking skeleton: all stages thin but connected
|
|
208
|
+
2. ✅ Geometry fixer: two-up A3 page split (Airtel, Reliance)
|
|
209
|
+
3. ✅ IFRS locator vocabulary ("profit OR loss", comprehensive-income cues)
|
|
210
|
+
4. Second extractor (Docling/TableFormer) + cross-extractor voting
|
|
211
|
+
5. Unit detection (crores/lakhs/millions) + vision-LLM third voter
|
|
212
|
+
6. Bank/NBFC schema variant (HDFC), OCR path for scanned pages
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
README.md
|
|
2
|
+
finagent_single.py
|
|
3
|
+
pyproject.toml
|
|
4
|
+
akshu_finagent.egg-info/PKG-INFO
|
|
5
|
+
akshu_finagent.egg-info/SOURCES.txt
|
|
6
|
+
akshu_finagent.egg-info/dependency_links.txt
|
|
7
|
+
akshu_finagent.egg-info/requires.txt
|
|
8
|
+
akshu_finagent.egg-info/top_level.txt
|
|
9
|
+
finagent/__init__.py
|
|
10
|
+
finagent/deriver.py
|
|
11
|
+
finagent/geometry.py
|
|
12
|
+
finagent/locator.py
|
|
13
|
+
finagent/normalizer.py
|
|
14
|
+
finagent/pipeline.py
|
|
15
|
+
finagent/profiler.py
|
|
16
|
+
finagent/schema.py
|
|
17
|
+
finagent/validator.py
|
|
18
|
+
finagent/writer.py
|
|
19
|
+
finagent/extractors/__init__.py
|
|
20
|
+
finagent/extractors/geometric.py
|
|
21
|
+
tests/test_core.py
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
File without changes
|