akshu-finagent 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,212 @@
1
+ Metadata-Version: 2.4
2
+ Name: akshu-finagent
3
+ Version: 0.2.0
4
+ Summary: Financial statement PDF extraction agent
5
+ Requires-Python: >=3.11
6
+ Description-Content-Type: text/markdown
7
+ Requires-Dist: pdfplumber>=0.11
8
+ Requires-Dist: pypdf>=6.0
9
+ Requires-Dist: openpyxl>=3.1
10
+ Requires-Dist: rapidfuzz>=3.14
11
+
12
+ # Financial PDF Extraction Agent (v2)
13
+
14
+ [![CI](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/actions/workflows/ci.yml/badge.svg)](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/actions/workflows/ci.yml)
15
+ [![Release](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/actions/workflows/release.yml/badge.svg)](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/releases)
16
+
17
+ Give it any company's annual-report PDF. It finds the financial statements,
18
+ pulls out the key numbers, **proves each one is correct**, and writes a tidy
19
+ Excel where every value carries its own receipt (page number, the exact label
20
+ it came from, and which checks it passed).
21
+
22
+ ---
23
+
24
+ ## The one idea behind it
25
+
26
+ > **Accuracy comes from verification, not extraction.**
27
+
28
+ Reading numbers off a PDF is easy to get *almost* right and very hard to get
29
+ *exactly* right — tables are borderless, layouts differ per company, a "5" in
30
+ the wrong column ruins everything. So this tool doesn't trust what it reads.
31
+
32
+ Financial statements are **self-verifying**. They obey rules:
33
+ - Assets = Liabilities + Equity
34
+ - Current + Non-current = Total
35
+ - The closing cash in the cash-flow statement equals cash on the balance sheet
36
+
37
+ So every extracted number is checked against these rules. A value that passes a
38
+ rule is **VERIFIED**. A value that breaks one is **FLAGGED** for a human. We
39
+ never hand over a confident-but-wrong number.
40
+
41
+ ---
42
+
43
+ ## How it works — 7 stages, like an assembly line
44
+
45
+ ```
46
+ PDF
47
+
48
+
49
+ ┌─────────────┐ What kind of PDF is this? Page sizes, orientation,
50
+ │ 1 PROFILE │ and whether each page has readable text.
51
+ └─────────────┘
52
+
53
+
54
+ ┌─────────────┐ Some reports print two pages side-by-side on one wide
55
+ │ 2 GEOMETRY │ sheet. Split those back into single logical pages.
56
+ └─────────────┘
57
+
58
+
59
+ ┌─────────────┐ Which pages ARE the Balance Sheet / P&L / Cash Flow?
60
+ │ 3 LOCATE │ (and prefer the consolidated version)
61
+ └─────────────┘
62
+
63
+
64
+ ┌─────────────┐ Read the chosen pages line by line: "label … numbers".
65
+ │ 4 EXTRACT │ Borderless-table safe — rebuilds rows from coordinates.
66
+ └─────────────┘
67
+
68
+
69
+ ┌─────────────┐ Clean the numbers and match each label to our standard
70
+ │ 5 NORMALIZE │ vocabulary ("Turnover" and "Net sales" both -> revenue).
71
+ └─────────────┘
72
+
73
+
74
+ ┌─────────────┐ Prove the numbers using accounting identities.
75
+ │ 6 VALIDATE │ -> VERIFIED / PROBABLE / FLAGGED / MISSING
76
+ └─────────────┘
77
+
78
+
79
+ ┌─────────────┐ A number the report never printed but the identities
80
+ │ 6b DERIVE │ pin down exactly? Compute it. Mark it DERIVED.
81
+ └─────────────┘
82
+
83
+
84
+ ┌─────────────┐ Excel: value + status + page + matched label + checks.
85
+ │ 7 WRITE │ Colour-coded so you see trust level at a glance.
86
+ └─────────────┘
87
+
88
+
89
+ metrics.xlsx
90
+ ```
91
+
92
+ **Status colours in the Excel:** 🟢 VERIFIED · 🟡 PROBABLE · 🔴 FLAGGED ·
93
+ 🔵 DERIVED · ⬜ MISSING.
94
+
95
+ ---
96
+
97
+ ## Run it
98
+
99
+ ```powershell
100
+ # one report
101
+ .venv\Scripts\python.exe finagent_single.py test_pdfs\TCS_2024-2025.pdf
102
+
103
+ # scorecard across all 12 test reports
104
+ .venv\Scripts\python.exe benchmark.py
105
+ ```
106
+
107
+ Output lands in `output\<Company>_metrics.xlsx`. All commands use the project
108
+ venv (`.venv`). First-time setup: `pip install -r requirements.txt`.
109
+
110
+ ---
111
+
112
+ ## Install a released build
113
+
114
+ The project is packaged as a wheel and published on a version tag. Grab a
115
+ build from the [Releases page](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/releases)
116
+ and install it:
117
+
118
+ ```powershell
119
+ pip install finagent-0.1.0-py3-none-any.whl
120
+ ```
121
+
122
+ (TestPyPI publishing via OIDC trusted publishing is being wired up — once live,
123
+ `pip install -i https://test.pypi.org/simple/ finagent` will work directly.)
124
+
125
+ ---
126
+
127
+ ## CI/CD
128
+
129
+ This repo doubles as a hands-on CI/CD build-out (tracked level-by-level in
130
+ [`WORKLOG.md`](WORKLOG.md)):
131
+
132
+ - **CI** (`.github/workflows/ci.yml`) — every push and PR runs `ruff` + `pytest`
133
+ across Python 3.11 / 3.12 / 3.13 in parallel, with pip caching. A single
134
+ `all-green` check gates merges; `main` is branch-protected, so broken code
135
+ cannot merge.
136
+ - **CD** (`.github/workflows/release.yml`) — pushing a `vX.Y.Z` tag builds the
137
+ wheel + sdist and publishes a GitHub Release with the artifacts attached.
138
+
139
+ Cut a release:
140
+
141
+ ```powershell
142
+ git tag v0.1.0
143
+ git push origin v0.1.0
144
+ ```
145
+
146
+ ---
147
+
148
+ ## Two ways to read the code (same logic, your choice)
149
+
150
+ | | What | When to read it |
151
+ |---|---|---|
152
+ | **Single file** | `finagent_single.py` | You want to understand or share the *whole thing* top-to-bottom. The 7 stages appear in pipeline order, each in its own clearly-marked section. |
153
+ | **Package** | `finagent/` (one file per stage) | You're maintaining or extending it. Smaller files, one responsibility each. |
154
+
155
+ Both produce identical results. The package is the source of truth; the single
156
+ file is a flattened, portable build of it.
157
+
158
+ > 📖 **Want the full, plain-language walkthrough of every function and why it
159
+ > works the way it does?** See [`ARCHITECTURE.md`](ARCHITECTURE.md). It's the
160
+ > guide to use when explaining this project to anyone.
161
+
162
+ ### Package map
163
+
164
+ | File | Stage | Job |
165
+ |---|---|---|
166
+ | `finagent/profiler.py` | 1 | Per-page: text quality, size, orientation (fast, via pypdf) |
167
+ | `finagent/geometry.py` | 2 | Split two-up A3 sheets into single logical pages |
168
+ | `finagent/locator.py` | 3 | Find + classify statement pages (consolidated BS / PL / CF) |
169
+ | `finagent/extractors/geometric.py` | 4 | Line-based extraction, borderless-table safe (pdfplumber) |
170
+ | `finagent/normalizer.py`| 5 | Parse numbers, strip note columns, fuzzy-match labels to schema |
171
+ | `finagent/validator.py` | 6 | Voting + accounting identities + cross-statement ties |
172
+ | `finagent/deriver.py` | 6b | Fill numbers the report omitted but the identities fix exactly |
173
+ | `finagent/writer.py` | 7 | Excel with value, status, page citation, matched label, checks |
174
+ | `finagent/schema.py` | — | The canonical metric vocabulary + label synonyms |
175
+ | `finagent/pipeline.py` | — | Glue that runs stages 1→7 in order |
176
+
177
+ ---
178
+
179
+ ## What's in this folder
180
+
181
+ ```
182
+ finagent_single.py the whole agent in one file (start here to understand it)
183
+ finagent/ the same logic as a package, one file per stage
184
+ schema is shared ── the vocabulary every stage speaks
185
+ benchmark.py scorecard over every test PDF
186
+ golden_check.py checks extracted values against hand-verified answers
187
+ render_pages.py renders statement pages to PNGs (for eyeballing)
188
+ requirements.txt dependencies: pypdf, pdfplumber, rapidfuzz, openpyxl
189
+ test_pdfs/ 12 deliberately-different real annual reports
190
+ golden/ hand-verified correct answers for grading
191
+ output/ generated Excel files + rendered pages
192
+ scratch/ throwaway debug scripts (safe to ignore)
193
+ graphify-out/ generated knowledge-graph cache (safe to ignore)
194
+ ```
195
+
196
+ ### The test set (chosen to break things on purpose)
197
+
198
+ 8+ deliberately-different reports in `test_pdfs/`: Adani (683 pages), Airtel
199
+ (A3 two-up layout), BMW (landscape, in EUR), HDFC (a bank, different schema),
200
+ Newgen (small-cap), Reliance (A3 two-up), TCS (clean baseline), Wilmar
201
+ (foreign report). If it works on all of these, it generalises.
202
+
203
+ ---
204
+
205
+ ## Roadmap (improve only what the benchmark proves)
206
+
207
+ 1. ✅ Walking skeleton: all stages thin but connected
208
+ 2. ✅ Geometry fixer: two-up A3 page split (Airtel, Reliance)
209
+ 3. ✅ IFRS locator vocabulary ("profit OR loss", comprehensive-income cues)
210
+ 4. Second extractor (Docling/TableFormer) + cross-extractor voting
211
+ 5. Unit detection (crores/lakhs/millions) + vision-LLM third voter
212
+ 6. Bank/NBFC schema variant (HDFC), OCR path for scanned pages
@@ -0,0 +1,201 @@
1
+ # Financial PDF Extraction Agent (v2)
2
+
3
+ [![CI](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/actions/workflows/ci.yml/badge.svg)](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/actions/workflows/ci.yml)
4
+ [![Release](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/actions/workflows/release.yml/badge.svg)](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/releases)
5
+
6
+ Give it any company's annual-report PDF. It finds the financial statements,
7
+ pulls out the key numbers, **proves each one is correct**, and writes a tidy
8
+ Excel where every value carries its own receipt (page number, the exact label
9
+ it came from, and which checks it passed).
10
+
11
+ ---
12
+
13
+ ## The one idea behind it
14
+
15
+ > **Accuracy comes from verification, not extraction.**
16
+
17
+ Reading numbers off a PDF is easy to get *almost* right and very hard to get
18
+ *exactly* right — tables are borderless, layouts differ per company, a "5" in
19
+ the wrong column ruins everything. So this tool doesn't trust what it reads.
20
+
21
+ Financial statements are **self-verifying**. They obey rules:
22
+ - Assets = Liabilities + Equity
23
+ - Current + Non-current = Total
24
+ - The closing cash in the cash-flow statement equals cash on the balance sheet
25
+
26
+ So every extracted number is checked against these rules. A value that passes a
27
+ rule is **VERIFIED**. A value that breaks one is **FLAGGED** for a human. We
28
+ never hand over a confident-but-wrong number.
29
+
30
+ ---
31
+
32
+ ## How it works — 7 stages, like an assembly line
33
+
34
+ ```
35
+ PDF
36
+
37
+
38
+ ┌─────────────┐ What kind of PDF is this? Page sizes, orientation,
39
+ │ 1 PROFILE │ and whether each page has readable text.
40
+ └─────────────┘
41
+
42
+
43
+ ┌─────────────┐ Some reports print two pages side-by-side on one wide
44
+ │ 2 GEOMETRY │ sheet. Split those back into single logical pages.
45
+ └─────────────┘
46
+
47
+
48
+ ┌─────────────┐ Which pages ARE the Balance Sheet / P&L / Cash Flow?
49
+ │ 3 LOCATE │ (and prefer the consolidated version)
50
+ └─────────────┘
51
+
52
+
53
+ ┌─────────────┐ Read the chosen pages line by line: "label … numbers".
54
+ │ 4 EXTRACT │ Borderless-table safe — rebuilds rows from coordinates.
55
+ └─────────────┘
56
+
57
+
58
+ ┌─────────────┐ Clean the numbers and match each label to our standard
59
+ │ 5 NORMALIZE │ vocabulary ("Turnover" and "Net sales" both -> revenue).
60
+ └─────────────┘
61
+
62
+
63
+ ┌─────────────┐ Prove the numbers using accounting identities.
64
+ │ 6 VALIDATE │ -> VERIFIED / PROBABLE / FLAGGED / MISSING
65
+ └─────────────┘
66
+
67
+
68
+ ┌─────────────┐ A number the report never printed but the identities
69
+ │ 6b DERIVE │ pin down exactly? Compute it. Mark it DERIVED.
70
+ └─────────────┘
71
+
72
+
73
+ ┌─────────────┐ Excel: value + status + page + matched label + checks.
74
+ │ 7 WRITE │ Colour-coded so you see trust level at a glance.
75
+ └─────────────┘
76
+
77
+
78
+ metrics.xlsx
79
+ ```
80
+
81
+ **Status colours in the Excel:** 🟢 VERIFIED · 🟡 PROBABLE · 🔴 FLAGGED ·
82
+ 🔵 DERIVED · ⬜ MISSING.
83
+
84
+ ---
85
+
86
+ ## Run it
87
+
88
+ ```powershell
89
+ # one report
90
+ .venv\Scripts\python.exe finagent_single.py test_pdfs\TCS_2024-2025.pdf
91
+
92
+ # scorecard across all 12 test reports
93
+ .venv\Scripts\python.exe benchmark.py
94
+ ```
95
+
96
+ Output lands in `output\<Company>_metrics.xlsx`. All commands use the project
97
+ venv (`.venv`). First-time setup: `pip install -r requirements.txt`.
98
+
99
+ ---
100
+
101
+ ## Install a released build
102
+
103
+ The project is packaged as a wheel and published on a version tag. Grab a
104
+ build from the [Releases page](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/releases)
105
+ and install it:
106
+
107
+ ```powershell
108
+ pip install finagent-0.1.0-py3-none-any.whl
109
+ ```
110
+
111
+ (TestPyPI publishing via OIDC trusted publishing is being wired up — once live,
112
+ `pip install -i https://test.pypi.org/simple/ finagent` will work directly.)
113
+
114
+ ---
115
+
116
+ ## CI/CD
117
+
118
+ This repo doubles as a hands-on CI/CD build-out (tracked level-by-level in
119
+ [`WORKLOG.md`](WORKLOG.md)):
120
+
121
+ - **CI** (`.github/workflows/ci.yml`) — every push and PR runs `ruff` + `pytest`
122
+ across Python 3.11 / 3.12 / 3.13 in parallel, with pip caching. A single
123
+ `all-green` check gates merges; `main` is branch-protected, so broken code
124
+ cannot merge.
125
+ - **CD** (`.github/workflows/release.yml`) — pushing a `vX.Y.Z` tag builds the
126
+ wheel + sdist and publishes a GitHub Release with the artifacts attached.
127
+
128
+ Cut a release:
129
+
130
+ ```powershell
131
+ git tag v0.1.0
132
+ git push origin v0.1.0
133
+ ```
134
+
135
+ ---
136
+
137
+ ## Two ways to read the code (same logic, your choice)
138
+
139
+ | | What | When to read it |
140
+ |---|---|---|
141
+ | **Single file** | `finagent_single.py` | You want to understand or share the *whole thing* top-to-bottom. The 7 stages appear in pipeline order, each in its own clearly-marked section. |
142
+ | **Package** | `finagent/` (one file per stage) | You're maintaining or extending it. Smaller files, one responsibility each. |
143
+
144
+ Both produce identical results. The package is the source of truth; the single
145
+ file is a flattened, portable build of it.
146
+
147
+ > 📖 **Want the full, plain-language walkthrough of every function and why it
148
+ > works the way it does?** See [`ARCHITECTURE.md`](ARCHITECTURE.md). It's the
149
+ > guide to use when explaining this project to anyone.
150
+
151
+ ### Package map
152
+
153
+ | File | Stage | Job |
154
+ |---|---|---|
155
+ | `finagent/profiler.py` | 1 | Per-page: text quality, size, orientation (fast, via pypdf) |
156
+ | `finagent/geometry.py` | 2 | Split two-up A3 sheets into single logical pages |
157
+ | `finagent/locator.py` | 3 | Find + classify statement pages (consolidated BS / PL / CF) |
158
+ | `finagent/extractors/geometric.py` | 4 | Line-based extraction, borderless-table safe (pdfplumber) |
159
+ | `finagent/normalizer.py`| 5 | Parse numbers, strip note columns, fuzzy-match labels to schema |
160
+ | `finagent/validator.py` | 6 | Voting + accounting identities + cross-statement ties |
161
+ | `finagent/deriver.py` | 6b | Fill numbers the report omitted but the identities fix exactly |
162
+ | `finagent/writer.py` | 7 | Excel with value, status, page citation, matched label, checks |
163
+ | `finagent/schema.py` | — | The canonical metric vocabulary + label synonyms |
164
+ | `finagent/pipeline.py` | — | Glue that runs stages 1→7 in order |
165
+
166
+ ---
167
+
168
+ ## What's in this folder
169
+
170
+ ```
171
+ finagent_single.py the whole agent in one file (start here to understand it)
172
+ finagent/ the same logic as a package, one file per stage
173
+ schema is shared ── the vocabulary every stage speaks
174
+ benchmark.py scorecard over every test PDF
175
+ golden_check.py checks extracted values against hand-verified answers
176
+ render_pages.py renders statement pages to PNGs (for eyeballing)
177
+ requirements.txt dependencies: pypdf, pdfplumber, rapidfuzz, openpyxl
178
+ test_pdfs/ 12 deliberately-different real annual reports
179
+ golden/ hand-verified correct answers for grading
180
+ output/ generated Excel files + rendered pages
181
+ scratch/ throwaway debug scripts (safe to ignore)
182
+ graphify-out/ generated knowledge-graph cache (safe to ignore)
183
+ ```
184
+
185
+ ### The test set (chosen to break things on purpose)
186
+
187
+ 8+ deliberately-different reports in `test_pdfs/`: Adani (683 pages), Airtel
188
+ (A3 two-up layout), BMW (landscape, in EUR), HDFC (a bank, different schema),
189
+ Newgen (small-cap), Reliance (A3 two-up), TCS (clean baseline), Wilmar
190
+ (foreign report). If it works on all of these, it generalises.
191
+
192
+ ---
193
+
194
+ ## Roadmap (improve only what the benchmark proves)
195
+
196
+ 1. ✅ Walking skeleton: all stages thin but connected
197
+ 2. ✅ Geometry fixer: two-up A3 page split (Airtel, Reliance)
198
+ 3. ✅ IFRS locator vocabulary ("profit OR loss", comprehensive-income cues)
199
+ 4. Second extractor (Docling/TableFormer) + cross-extractor voting
200
+ 5. Unit detection (crores/lakhs/millions) + vision-LLM third voter
201
+ 6. Bank/NBFC schema variant (HDFC), OCR path for scanned pages
@@ -0,0 +1,212 @@
1
+ Metadata-Version: 2.4
2
+ Name: akshu-finagent
3
+ Version: 0.2.0
4
+ Summary: Financial statement PDF extraction agent
5
+ Requires-Python: >=3.11
6
+ Description-Content-Type: text/markdown
7
+ Requires-Dist: pdfplumber>=0.11
8
+ Requires-Dist: pypdf>=6.0
9
+ Requires-Dist: openpyxl>=3.1
10
+ Requires-Dist: rapidfuzz>=3.14
11
+
12
+ # Financial PDF Extraction Agent (v2)
13
+
14
+ [![CI](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/actions/workflows/ci.yml/badge.svg)](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/actions/workflows/ci.yml)
15
+ [![Release](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/actions/workflows/release.yml/badge.svg)](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/releases)
16
+
17
+ Give it any company's annual-report PDF. It finds the financial statements,
18
+ pulls out the key numbers, **proves each one is correct**, and writes a tidy
19
+ Excel where every value carries its own receipt (page number, the exact label
20
+ it came from, and which checks it passed).
21
+
22
+ ---
23
+
24
+ ## The one idea behind it
25
+
26
+ > **Accuracy comes from verification, not extraction.**
27
+
28
+ Reading numbers off a PDF is easy to get *almost* right and very hard to get
29
+ *exactly* right — tables are borderless, layouts differ per company, a "5" in
30
+ the wrong column ruins everything. So this tool doesn't trust what it reads.
31
+
32
+ Financial statements are **self-verifying**. They obey rules:
33
+ - Assets = Liabilities + Equity
34
+ - Current + Non-current = Total
35
+ - The closing cash in the cash-flow statement equals cash on the balance sheet
36
+
37
+ So every extracted number is checked against these rules. A value that passes a
38
+ rule is **VERIFIED**. A value that breaks one is **FLAGGED** for a human. We
39
+ never hand over a confident-but-wrong number.
40
+
41
+ ---
42
+
43
+ ## How it works — 7 stages, like an assembly line
44
+
45
+ ```
46
+ PDF
47
+
48
+
49
+ ┌─────────────┐ What kind of PDF is this? Page sizes, orientation,
50
+ │ 1 PROFILE │ and whether each page has readable text.
51
+ └─────────────┘
52
+
53
+
54
+ ┌─────────────┐ Some reports print two pages side-by-side on one wide
55
+ │ 2 GEOMETRY │ sheet. Split those back into single logical pages.
56
+ └─────────────┘
57
+
58
+
59
+ ┌─────────────┐ Which pages ARE the Balance Sheet / P&L / Cash Flow?
60
+ │ 3 LOCATE │ (and prefer the consolidated version)
61
+ └─────────────┘
62
+
63
+
64
+ ┌─────────────┐ Read the chosen pages line by line: "label … numbers".
65
+ │ 4 EXTRACT │ Borderless-table safe — rebuilds rows from coordinates.
66
+ └─────────────┘
67
+
68
+
69
+ ┌─────────────┐ Clean the numbers and match each label to our standard
70
+ │ 5 NORMALIZE │ vocabulary ("Turnover" and "Net sales" both -> revenue).
71
+ └─────────────┘
72
+
73
+
74
+ ┌─────────────┐ Prove the numbers using accounting identities.
75
+ │ 6 VALIDATE │ -> VERIFIED / PROBABLE / FLAGGED / MISSING
76
+ └─────────────┘
77
+
78
+
79
+ ┌─────────────┐ A number the report never printed but the identities
80
+ │ 6b DERIVE │ pin down exactly? Compute it. Mark it DERIVED.
81
+ └─────────────┘
82
+
83
+
84
+ ┌─────────────┐ Excel: value + status + page + matched label + checks.
85
+ │ 7 WRITE │ Colour-coded so you see trust level at a glance.
86
+ └─────────────┘
87
+
88
+
89
+ metrics.xlsx
90
+ ```
91
+
92
+ **Status colours in the Excel:** 🟢 VERIFIED · 🟡 PROBABLE · 🔴 FLAGGED ·
93
+ 🔵 DERIVED · ⬜ MISSING.
94
+
95
+ ---
96
+
97
+ ## Run it
98
+
99
+ ```powershell
100
+ # one report
101
+ .venv\Scripts\python.exe finagent_single.py test_pdfs\TCS_2024-2025.pdf
102
+
103
+ # scorecard across all 12 test reports
104
+ .venv\Scripts\python.exe benchmark.py
105
+ ```
106
+
107
+ Output lands in `output\<Company>_metrics.xlsx`. All commands use the project
108
+ venv (`.venv`). First-time setup: `pip install -r requirements.txt`.
109
+
110
+ ---
111
+
112
+ ## Install a released build
113
+
114
+ The project is packaged as a wheel and published on a version tag. Grab a
115
+ build from the [Releases page](https://github.com/Akshu24Tech/financial-pdf-extraction-agent/releases)
116
+ and install it:
117
+
118
+ ```powershell
119
+ pip install finagent-0.1.0-py3-none-any.whl
120
+ ```
121
+
122
+ (TestPyPI publishing via OIDC trusted publishing is being wired up — once live,
123
+ `pip install -i https://test.pypi.org/simple/ finagent` will work directly.)
124
+
125
+ ---
126
+
127
+ ## CI/CD
128
+
129
+ This repo doubles as a hands-on CI/CD build-out (tracked level-by-level in
130
+ [`WORKLOG.md`](WORKLOG.md)):
131
+
132
+ - **CI** (`.github/workflows/ci.yml`) — every push and PR runs `ruff` + `pytest`
133
+ across Python 3.11 / 3.12 / 3.13 in parallel, with pip caching. A single
134
+ `all-green` check gates merges; `main` is branch-protected, so broken code
135
+ cannot merge.
136
+ - **CD** (`.github/workflows/release.yml`) — pushing a `vX.Y.Z` tag builds the
137
+ wheel + sdist and publishes a GitHub Release with the artifacts attached.
138
+
139
+ Cut a release:
140
+
141
+ ```powershell
142
+ git tag v0.1.0
143
+ git push origin v0.1.0
144
+ ```
145
+
146
+ ---
147
+
148
+ ## Two ways to read the code (same logic, your choice)
149
+
150
+ | | What | When to read it |
151
+ |---|---|---|
152
+ | **Single file** | `finagent_single.py` | You want to understand or share the *whole thing* top-to-bottom. The 7 stages appear in pipeline order, each in its own clearly-marked section. |
153
+ | **Package** | `finagent/` (one file per stage) | You're maintaining or extending it. Smaller files, one responsibility each. |
154
+
155
+ Both produce identical results. The package is the source of truth; the single
156
+ file is a flattened, portable build of it.
157
+
158
+ > 📖 **Want the full, plain-language walkthrough of every function and why it
159
+ > works the way it does?** See [`ARCHITECTURE.md`](ARCHITECTURE.md). It's the
160
+ > guide to use when explaining this project to anyone.
161
+
162
+ ### Package map
163
+
164
+ | File | Stage | Job |
165
+ |---|---|---|
166
+ | `finagent/profiler.py` | 1 | Per-page: text quality, size, orientation (fast, via pypdf) |
167
+ | `finagent/geometry.py` | 2 | Split two-up A3 sheets into single logical pages |
168
+ | `finagent/locator.py` | 3 | Find + classify statement pages (consolidated BS / PL / CF) |
169
+ | `finagent/extractors/geometric.py` | 4 | Line-based extraction, borderless-table safe (pdfplumber) |
170
+ | `finagent/normalizer.py`| 5 | Parse numbers, strip note columns, fuzzy-match labels to schema |
171
+ | `finagent/validator.py` | 6 | Voting + accounting identities + cross-statement ties |
172
+ | `finagent/deriver.py` | 6b | Fill numbers the report omitted but the identities fix exactly |
173
+ | `finagent/writer.py` | 7 | Excel with value, status, page citation, matched label, checks |
174
+ | `finagent/schema.py` | — | The canonical metric vocabulary + label synonyms |
175
+ | `finagent/pipeline.py` | — | Glue that runs stages 1→7 in order |
176
+
177
+ ---
178
+
179
+ ## What's in this folder
180
+
181
+ ```
182
+ finagent_single.py the whole agent in one file (start here to understand it)
183
+ finagent/ the same logic as a package, one file per stage
184
+ schema is shared ── the vocabulary every stage speaks
185
+ benchmark.py scorecard over every test PDF
186
+ golden_check.py checks extracted values against hand-verified answers
187
+ render_pages.py renders statement pages to PNGs (for eyeballing)
188
+ requirements.txt dependencies: pypdf, pdfplumber, rapidfuzz, openpyxl
189
+ test_pdfs/ 12 deliberately-different real annual reports
190
+ golden/ hand-verified correct answers for grading
191
+ output/ generated Excel files + rendered pages
192
+ scratch/ throwaway debug scripts (safe to ignore)
193
+ graphify-out/ generated knowledge-graph cache (safe to ignore)
194
+ ```
195
+
196
+ ### The test set (chosen to break things on purpose)
197
+
198
+ 8+ deliberately-different reports in `test_pdfs/`: Adani (683 pages), Airtel
199
+ (A3 two-up layout), BMW (landscape, in EUR), HDFC (a bank, different schema),
200
+ Newgen (small-cap), Reliance (A3 two-up), TCS (clean baseline), Wilmar
201
+ (foreign report). If it works on all of these, it generalises.
202
+
203
+ ---
204
+
205
+ ## Roadmap (improve only what the benchmark proves)
206
+
207
+ 1. ✅ Walking skeleton: all stages thin but connected
208
+ 2. ✅ Geometry fixer: two-up A3 page split (Airtel, Reliance)
209
+ 3. ✅ IFRS locator vocabulary ("profit OR loss", comprehensive-income cues)
210
+ 4. Second extractor (Docling/TableFormer) + cross-extractor voting
211
+ 5. Unit detection (crores/lakhs/millions) + vision-LLM third voter
212
+ 6. Bank/NBFC schema variant (HDFC), OCR path for scanned pages
@@ -0,0 +1,21 @@
1
+ README.md
2
+ finagent_single.py
3
+ pyproject.toml
4
+ akshu_finagent.egg-info/PKG-INFO
5
+ akshu_finagent.egg-info/SOURCES.txt
6
+ akshu_finagent.egg-info/dependency_links.txt
7
+ akshu_finagent.egg-info/requires.txt
8
+ akshu_finagent.egg-info/top_level.txt
9
+ finagent/__init__.py
10
+ finagent/deriver.py
11
+ finagent/geometry.py
12
+ finagent/locator.py
13
+ finagent/normalizer.py
14
+ finagent/pipeline.py
15
+ finagent/profiler.py
16
+ finagent/schema.py
17
+ finagent/validator.py
18
+ finagent/writer.py
19
+ finagent/extractors/__init__.py
20
+ finagent/extractors/geometric.py
21
+ tests/test_core.py
@@ -0,0 +1,4 @@
1
+ pdfplumber>=0.11
2
+ pypdf>=6.0
3
+ openpyxl>=3.1
4
+ rapidfuzz>=3.14
@@ -0,0 +1,2 @@
1
+ finagent
2
+ finagent_single
File without changes