hsbc_pdf_statement_parser 1.0.1 → 2.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.ruby-version +1 -0
- data/README.md +93 -0
- data/hsbc_pdf_statement_parser.gemspec +15 -14
- data/lib/hsbc_pdf_statement_parser/reader.rb +24 -0
- data/lib/hsbc_pdf_statement_parser/statement_parser.rb +123 -0
- data/lib/hsbc_pdf_statement_parser/transaction_parser.rb +133 -0
- data/lib/hsbc_pdf_statement_parser/version.rb +1 -1
- data/lib/hsbc_pdf_statement_parser.rb +50 -5
- data/treefmt.toml +6 -0
- metadata +43 -12
- data/README.textile +0 -62
- data/lib/hsbc_pdf_statement_parser/parser.rb +0 -166
- data/lib/hsbc_pdf_statement_parser/pdf_reader_patch.rb +0 -44
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f91ab40500e36cdbad3e6201e3bc88d61c0abe0ca5d800a59403c79f3c330eb0
|
4
|
+
data.tar.gz: 912e76d7d5ae051ed118bee0d68f5d359a012ad59398614755b89eca088050a8
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: dfa5cb3796d4e2473053a3905f8801f7169a8d38d5ec8540a7cc9c11b00b3eeeeddeff7024d1a2c677b0c87e04f651be5e76201436521834d61e2e0a2dbf9330
|
7
|
+
data.tar.gz: d45b1204240c344d47a322888de778f564f44c654e144ea53ea1f26a0f19d3225dee2ea2b74023e5669b167f5bf059fc25e41da06a59be24a6e30a74ca970a1e
|
data/.ruby-version
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
3.2.3
|
data/README.md
ADDED
@@ -0,0 +1,93 @@
|
|
1
|
+
# HSBC PDF Statement Parser
|
2
|
+
|
3
|
+
This is a _very_ quick and dirty gem that swallows downloaded PDF files from HSBC (UK) and parses them into a [Dry::Struct](https://dry-rb.org/gems/dry-struct/1.0/) containing details of the statement + its transactions.
|
4
|
+
|
5
|
+
It exists soley because HSBC doesn’t seem to offer any way of exporting old statements as anything other than PDFs, which makes it a pain in the backside to import anything into any kind of finance packages.
|
6
|
+
You probably shouldn’t use it (see warnings below)
|
7
|
+
|
8
|
+
# Installation
|
9
|
+
|
10
|
+
Using bundler on the command line:
|
11
|
+
|
12
|
+
```shell
|
13
|
+
$ bundle add hsbc_pdf_statement_parser
|
14
|
+
$ bundle
|
15
|
+
```
|
16
|
+
|
17
|
+
## Usage
|
18
|
+
|
19
|
+
This gem exposes one method: `parse`, which takes the path to a statement PDF and returns a `Dry::Struct` representation.
|
20
|
+
|
21
|
+
```ruby
|
22
|
+
require 'hsbc_pdf_statement_parser'
|
23
|
+
|
24
|
+
parsed = HsbcPdfStatementParser.parse( 'path/to/statement.pdf' )
|
25
|
+
parsed.transactions.each do |tx|
|
26
|
+
printf(
|
27
|
+
"[%s] {%-3s} %-40s %7.02f | %7.02f\n",
|
28
|
+
tx.date,
|
29
|
+
tx.type,
|
30
|
+
tx.details.lines.first.strip,
|
31
|
+
tx.change,
|
32
|
+
tx.balance
|
33
|
+
)
|
34
|
+
end
|
35
|
+
```
|
36
|
+
|
37
|
+
### Statement Properties
|
38
|
+
|
39
|
+
- `account_holder`: the name of the account holder _(String)_
|
40
|
+
- `sortcode`: the sortcode shown on the statement _(String)_
|
41
|
+
- `account_number`: the account number shown on the statement _(String)_
|
42
|
+
- `sheets`: the sheets used in the statement _(Range[Int])_
|
43
|
+
- `date_range`: the date range shown on the first page of the statement _(Range[Date])_
|
44
|
+
- `opening_balance`: the opening balance of the statement _(Decimal)_
|
45
|
+
- `closing_balance`: the closing balance of the statement _(Decimal)_
|
46
|
+
- `payments_in`: the total of all transactions into the account _(Decimal)_
|
47
|
+
- `payments_out`: the total of all transactions out of the account _(Decimal)_
|
48
|
+
|
49
|
+
**Note:** `payments_in` and `payments_out` are those shown on the first page of the statement and they are not calculated from- or checked against the parsed transactions.
|
50
|
+
|
51
|
+
Also note that sheet numbers are not guaranteed to be unique. Not sure why this is the case, but I have a few statements where the the last sheet of one statement and the first of another have the same sheet number.
|
52
|
+
|
53
|
+
### Transaction Properties
|
54
|
+
|
55
|
+
- `date`: the date of the transaction _(Date)_
|
56
|
+
- `type`: a string representation of the type of the transaction (eg. `DD` for a direct debit, `VIS` for VISA, etc) _(String)_
|
57
|
+
- `details`: a text description of the transaction, which may span multiple lines _(String)_
|
58
|
+
- `paid_in`: the amount paid in, if appropriate _(Decimal, nullable)_
|
59
|
+
- `paid_out`: the amount paid out, if appropriate _(Decimal, nullable)_
|
60
|
+
- `balance`: the balance of the account after the transaction _(Decimal)_
|
61
|
+
- `change`: the calculated change to the balance of the account: negative for debits, positive for credits _(Decimal)_
|
62
|
+
|
63
|
+
**Note:** unlike in V1, `balance` is now always present and is calculated as a running total based on the opening balance of the statement. Where the statement shows a running balance after transactions (seems to be once a day), this is checked and the parser will raise an error if any discrepancy is found.
|
64
|
+
|
65
|
+
## ⚠️ Warnings
|
66
|
+
|
67
|
+
This gem has been thrown together for my own needs, and is offered to the world just in case someone else might want to play around with it.
|
68
|
+
It seems to work pretty well with statements from my Advance account here in the UK, and may also work with other flavours of accounts from elsewhere in the world, but comes with absolutely zero guarantees or assurances.
|
69
|
+
|
70
|
+
That is to say: it seems to work OK for mucking around, but I’d recommend not using it for anything mission-critical, or in a situation that might lead you or others into making any kind of financial decisions.
|
71
|
+
Any dumb financial decisions made are entirely on you =)
|
72
|
+
|
73
|
+
### Wot? No tests
|
74
|
+
|
75
|
+
I have plenty, sadly the only way of properly testing this code is by parsing real bank statements, and I’m not about to commit any of those to github. Sorry!
|
76
|
+
|
77
|
+
## Upgrading from V1.x
|
78
|
+
|
79
|
+
For various reasons I’ve not worried too much about trying to maintain backward compatibility: migration should be relatively minimal, though:
|
80
|
+
|
81
|
+
### Breaking changes
|
82
|
+
|
83
|
+
1. invocation: `HsbcPdfStatementParser::Parser.new(…)` becomes `HsbcPdfStatementParser.parse(…)`
|
84
|
+
2. when using parsed transactions, `in` and `out` are now `paid_in` and `paid_out` respectively
|
85
|
+
3. any use of `fetch(…)` on parsed transactions will need to be replaced with bare function calls (hash accessors—ie `tx[:date]`—will continue to work)
|
86
|
+
|
87
|
+
### Nonbreaking changes
|
88
|
+
|
89
|
+
Aside from the new properties added to the main Statement type, the biggest difference is that a statement’s `balance` property is now always specified, whereas it was only specified once a day in V1.x
|
90
|
+
|
91
|
+
---
|
92
|
+
|
93
|
+
Share and enjoy
|
@@ -3,18 +3,19 @@ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
|
3
3
|
require "hsbc_pdf_statement_parser/version"
|
4
4
|
|
5
5
|
Gem::Specification.new do |spec|
|
6
|
-
|
7
|
-
spec.
|
8
|
-
spec.
|
9
|
-
|
10
|
-
|
11
|
-
spec.
|
12
|
-
spec.
|
13
|
-
|
14
|
-
|
15
|
-
spec.
|
16
|
-
|
17
|
-
|
18
|
-
spec.add_dependency
|
19
|
-
|
6
|
+
spec.name = "hsbc_pdf_statement_parser"
|
7
|
+
spec.summary = "Quick and dirty RubyGem to parse HSBC’s statement PDFs"
|
8
|
+
spec.license = "MIT"
|
9
|
+
|
10
|
+
spec.authors = "Jon Pearse"
|
11
|
+
spec.email = "hello@jonpearse.net"
|
12
|
+
spec.homepage = "https://github.com/jonpearse/hsbc-pdf-statement-parser"
|
13
|
+
|
14
|
+
spec.version = HsbcPdfStatementParser::VERSION
|
15
|
+
spec.files = `git ls-files`.split($\)
|
16
|
+
|
17
|
+
spec.add_dependency "pdf-reader", "~> 2.9"
|
18
|
+
spec.add_dependency "dry-struct", "~> 1.4"
|
19
|
+
|
20
|
+
spec.add_development_dependency "rufo"
|
20
21
|
end
|
@@ -0,0 +1,24 @@
|
|
1
|
+
module HsbcPdfStatementParser
|
2
|
+
class Reader
|
3
|
+
def initialize(filename)
|
4
|
+
@reader = PDF::Reader.new(filename)
|
5
|
+
end
|
6
|
+
|
7
|
+
def first_page
|
8
|
+
@_first_page ||= @reader.pages.first.text
|
9
|
+
end
|
10
|
+
|
11
|
+
def all_text
|
12
|
+
@_all_text ||= @reader.pages.map(&:text).join
|
13
|
+
end
|
14
|
+
|
15
|
+
def statement_blocks
|
16
|
+
@_statement_lines ||= begin
|
17
|
+
@reader.pages.map do |page|
|
18
|
+
match = page.text.match(/BALANCE\s?BROUGHT\s?FORWARD(?:.*?)\n(.*?)BALANCE\s?CARRIED\s?FORWARD/im)
|
19
|
+
match ? match[1] : nil
|
20
|
+
end.compact
|
21
|
+
end
|
22
|
+
end
|
23
|
+
end
|
24
|
+
end
|
@@ -0,0 +1,123 @@
|
|
1
|
+
module HsbcPdfStatementParser
|
2
|
+
class StatementParser
|
3
|
+
attr_reader :_statement_lines
|
4
|
+
|
5
|
+
def initialize(filename)
|
6
|
+
@reader = Reader.new(filename)
|
7
|
+
end
|
8
|
+
|
9
|
+
def parse
|
10
|
+
opening_balance = scan_figure("Opening Balance")
|
11
|
+
closing_balance = scan_figure("Closing Balance")
|
12
|
+
meta = get_meta
|
13
|
+
|
14
|
+
ImportedStatement.new(
|
15
|
+
account_holder: meta[:account_holder],
|
16
|
+
sortcode: meta[:sortcode],
|
17
|
+
account_number: meta[:account_no],
|
18
|
+
sheets: meta[:sheets],
|
19
|
+
date_range: get_date_range,
|
20
|
+
opening_balance: opening_balance,
|
21
|
+
closing_balance: closing_balance,
|
22
|
+
payments_in: scan_figure("Payments In"),
|
23
|
+
payments_out: scan_figure("Payments Out"),
|
24
|
+
transactions: parse_transactions(opening_balance),
|
25
|
+
)
|
26
|
+
end
|
27
|
+
|
28
|
+
private def scan_figure(search_string)
|
29
|
+
# note: I do not like how … general this regex is, but HSBC seems to love putting random noise characters into
|
30
|
+
# opening and closing balance, so there’s only so much I can do about it.
|
31
|
+
match = Regexp.new("#{search_string}(.*?)\n", Regexp::IGNORECASE).match(@reader.first_page)
|
32
|
+
raise ParsingError.new("Could not find #{search_string}") if match.nil?
|
33
|
+
|
34
|
+
match[1].strip.gsub(/[\s£,]/, "").to_d
|
35
|
+
end
|
36
|
+
|
37
|
+
private def get_meta
|
38
|
+
all_meta = @reader.all_text.scan(/Account\s?Name\s+Sortcode\s+Account\s?Number\s+Sheet Number\n+([A-Z\s]+?)\s\s+([\d\-]+)\s\s+(\d+)\s\s+(\d+)\n/i)
|
39
|
+
raise ParsingError.new("Cannot find statement metadata") if all_meta.empty?
|
40
|
+
|
41
|
+
# check everything makes sense
|
42
|
+
raise ParsingError.new("Error parsing account name") unless all_elements_same(all_meta.map(&:first))
|
43
|
+
raise ParsingError.new("Error parsing sort code") unless all_elements_same(all_meta.map { |a| a[1] })
|
44
|
+
raise ParsingError.new("Error parsing account number") unless all_elements_same(all_meta.map { |a| a[2] })
|
45
|
+
|
46
|
+
# get page numbers
|
47
|
+
first_page = all_meta.first[3]
|
48
|
+
last_page = all_meta.last[3]
|
49
|
+
raise ParsingError.new("Error parsing sheet numbers") if (first_page > last_page)
|
50
|
+
|
51
|
+
{
|
52
|
+
sheets: Range.new(first_page.to_i, last_page.to_i),
|
53
|
+
account_holder: all_meta.dig(0, 0).strip,
|
54
|
+
sortcode: all_meta.dig(0, 1).strip,
|
55
|
+
account_no: all_meta.dig(0, 2).strip,
|
56
|
+
}
|
57
|
+
end
|
58
|
+
|
59
|
+
private def get_date_range
|
60
|
+
dates = @reader.first_page.match(/(\d{1,2}) ([a-z]+)(?: (\d{4}))? to (\d{1,2}) ([a-z]+) (\d{4})/i)
|
61
|
+
raise ParsingError.new("Cannot find date range") unless dates
|
62
|
+
dates = %i{start_day start_month start_year end_day end_month end_year}.zip(dates.captures).to_h
|
63
|
+
|
64
|
+
# default the start year
|
65
|
+
dates[:start_year] ||= dates[:end_year]
|
66
|
+
start_date = Date.parse("#{dates[:start_day]} #{dates[:start_month]} #{dates[:start_year]}")
|
67
|
+
end_date = Date.parse("#{dates[:end_day]} #{dates[:end_month]} #{dates[:end_year]}")
|
68
|
+
raise ParsingError.new("Error parsing date range") if (start_date > end_date)
|
69
|
+
|
70
|
+
Range.new(start_date, end_date)
|
71
|
+
end
|
72
|
+
|
73
|
+
private def parse_transactions(opening_balance)
|
74
|
+
# Get the raw information out of the PDF text
|
75
|
+
parser = TransactionParser.new
|
76
|
+
transactions = parser.parse(@reader.statement_blocks)
|
77
|
+
|
78
|
+
# start crosschecking!
|
79
|
+
running_balance = opening_balance
|
80
|
+
transactions.map do |tx|
|
81
|
+
|
82
|
+
# work out what we’re doing
|
83
|
+
running_balance += tx[:paid_in] || 0 - tx[:paid_out]
|
84
|
+
|
85
|
+
# push a calculated balance in if none is present
|
86
|
+
if tx[:balance].nil?
|
87
|
+
tx[:balance] = running_balance
|
88
|
+
else
|
89
|
+
raise TransactionCalculationError.new(tx[:balance], running_balance) if running_balance != tx[:balance]
|
90
|
+
end
|
91
|
+
|
92
|
+
# all good, so create a new Transaction object
|
93
|
+
Transaction.new(tx)
|
94
|
+
end
|
95
|
+
end
|
96
|
+
|
97
|
+
private def all_elements_same(arry)
|
98
|
+
arry.uniq.length == 1
|
99
|
+
end
|
100
|
+
end
|
101
|
+
|
102
|
+
class ParsingError < StandardError
|
103
|
+
attr_reader :info
|
104
|
+
|
105
|
+
def initialize(message, info = nil)
|
106
|
+
@info = info
|
107
|
+
super(message)
|
108
|
+
end
|
109
|
+
end
|
110
|
+
|
111
|
+
class TransactionCalculationError < StandardError
|
112
|
+
attr_reader :expected, :calculated
|
113
|
+
|
114
|
+
def initialize(expected, calculated)
|
115
|
+
@expected = expected
|
116
|
+
@calculated = calculated
|
117
|
+
end
|
118
|
+
|
119
|
+
def message
|
120
|
+
"Expected #{@expected} but got #{@calculated}"
|
121
|
+
end
|
122
|
+
end
|
123
|
+
end
|
@@ -0,0 +1,133 @@
|
|
1
|
+
require "bigdecimal/util"
|
2
|
+
require "date"
|
3
|
+
|
4
|
+
module HsbcPdfStatementParser
|
5
|
+
class TransactionParser
|
6
|
+
attr_reader :transactions
|
7
|
+
|
8
|
+
def parse(input)
|
9
|
+
# We need to parse pages individually because the column widths may not be the same on each page
|
10
|
+
parsed_lines = input.reduce([]) { |lines, page| lines + parse_page(page) }
|
11
|
+
|
12
|
+
# build out transactions from our parsed data
|
13
|
+
@transactions = parse_transactions(parsed_lines)
|
14
|
+
end
|
15
|
+
|
16
|
+
private def parse_page(page_data)
|
17
|
+
# get the maximum line length for this page
|
18
|
+
max_line_length = page_data.lines.map(&:length).max
|
19
|
+
|
20
|
+
# work out the columns being used
|
21
|
+
column_indices = find_columns(page_data, max_line_length)
|
22
|
+
|
23
|
+
# get the column keys and produce a map for parsing
|
24
|
+
column_keys = get_column_keys(column_indices, max_line_length)
|
25
|
+
column_map = column_keys.zip(column_indices).to_h
|
26
|
+
|
27
|
+
# parse out all the
|
28
|
+
page_data.lines.reduce([]) do |parsed, line|
|
29
|
+
next parsed if line.strip.empty?
|
30
|
+
|
31
|
+
parsed << parse_line(line, column_map, max_line_length)
|
32
|
+
end
|
33
|
+
end
|
34
|
+
|
35
|
+
private def find_columns(input, max_line_length)
|
36
|
+
state = Array.new(max_line_length, " ")
|
37
|
+
|
38
|
+
# scan the content + look for obvious columns
|
39
|
+
input.lines.each do |line|
|
40
|
+
# bounce if there’s no point…
|
41
|
+
next if line.strip.empty?
|
42
|
+
|
43
|
+
# map the state
|
44
|
+
line.split("").each.with_index { |chr, i| state[i] = "X" if chr != " " }
|
45
|
+
end
|
46
|
+
|
47
|
+
# We will almost have something like XX XXX XX for the date column, so replace any spaces with an X either side
|
48
|
+
# with an X
|
49
|
+
state = state.join.gsub(/X X/, "XXX")
|
50
|
+
|
51
|
+
# find the columns
|
52
|
+
columns = []
|
53
|
+
state.gsub(/X+/) { columns << Range.new(*$~.offset(0)) }
|
54
|
+
columns
|
55
|
+
end
|
56
|
+
|
57
|
+
private def get_column_keys(indices, max_line_length)
|
58
|
+
# if we have 6 columns, we’re all good…
|
59
|
+
return %i{date type details paid_out paid_in balance} if indices.length == 6
|
60
|
+
|
61
|
+
# if a statement contains no payments in or no payments out, it’ll only have 5 columns, so we need to work out
|
62
|
+
# which the additional column is
|
63
|
+
# We can do this with an assumption that the ‘paid out’ column is usually at around 2/3 of the total line length…
|
64
|
+
has_paid_out = indices.map { |i| i.first.to_f / max_line_length }.any? { |p| p > 0.6 && p < 0.7 }
|
65
|
+
additional = has_paid_out ? :paid_out : :paid_in
|
66
|
+
|
67
|
+
[:date, :type, :details, additional, :balance]
|
68
|
+
end
|
69
|
+
|
70
|
+
private def parse_line(line, map, max_line_length)
|
71
|
+
# lengthen the line
|
72
|
+
line = line.rstrip.ljust(max_line_length)
|
73
|
+
|
74
|
+
# cut things up
|
75
|
+
row = {
|
76
|
+
paid_out: nil,
|
77
|
+
paid_in: nil,
|
78
|
+
}
|
79
|
+
map.each { |k, v| row[k] = empty_to_nil(line[v]) }
|
80
|
+
|
81
|
+
# further processing
|
82
|
+
row[:date] = Date.strptime(row[:date], "%d %b %y") unless row[:date].nil?
|
83
|
+
row[:paid_out] = parse_decimal(row[:paid_out])
|
84
|
+
row[:paid_in] = parse_decimal(row[:paid_in])
|
85
|
+
row[:balance] = parse_decimal(row[:balance])
|
86
|
+
|
87
|
+
row
|
88
|
+
end
|
89
|
+
|
90
|
+
private def parse_decimal(str)
|
91
|
+
return nil if str.nil?
|
92
|
+
|
93
|
+
mult = str.end_with?("D") ? -1 : 1
|
94
|
+
|
95
|
+
str.gsub(",", "").to_d * mult
|
96
|
+
end
|
97
|
+
|
98
|
+
private def empty_to_nil(str)
|
99
|
+
str = str.strip
|
100
|
+
|
101
|
+
str.empty? ? nil : str
|
102
|
+
end
|
103
|
+
|
104
|
+
private def parse_transactions(parsed_lines)
|
105
|
+
current_date = nil
|
106
|
+
current_transaction = nil
|
107
|
+
|
108
|
+
transactions = []
|
109
|
+
|
110
|
+
parsed_lines.each do |parsed|
|
111
|
+
# if we have a new date, store it
|
112
|
+
current_date = parsed[:date] unless parsed[:date].nil?
|
113
|
+
|
114
|
+
# if we have a type, it’s a new transaction, so push any existing info onto the list + reset
|
115
|
+
unless parsed[:type].nil?
|
116
|
+
transactions << current_transaction unless current_transaction.nil?
|
117
|
+
current_transaction = parsed.merge(date: current_date)
|
118
|
+
next
|
119
|
+
end
|
120
|
+
|
121
|
+
# otherwise, merge things together
|
122
|
+
current_transaction = current_transaction.merge(
|
123
|
+
parsed.reject { |_, v| v.nil? },
|
124
|
+
{ details: "#{current_transaction[:details]}\n#{parsed[:details]}".strip }
|
125
|
+
)
|
126
|
+
end
|
127
|
+
|
128
|
+
# shove the last transaction onto the list + return
|
129
|
+
transactions << current_transaction
|
130
|
+
transactions
|
131
|
+
end
|
132
|
+
end
|
133
|
+
end
|
@@ -1,7 +1,52 @@
|
|
1
|
-
require
|
2
|
-
require
|
3
|
-
require
|
4
|
-
require
|
1
|
+
require "bigdecimal/util"
|
2
|
+
require "date"
|
3
|
+
require "dry-struct"
|
4
|
+
require "pdf-reader"
|
5
|
+
|
6
|
+
require "hsbc_pdf_statement_parser/reader"
|
7
|
+
require "hsbc_pdf_statement_parser/statement_parser"
|
8
|
+
require "hsbc_pdf_statement_parser/transaction_parser"
|
5
9
|
|
6
10
|
module HsbcPdfStatementParser
|
7
|
-
|
11
|
+
module Types
|
12
|
+
include Dry.Types()
|
13
|
+
end
|
14
|
+
|
15
|
+
class Transaction < Dry::Struct
|
16
|
+
attribute :date, Types::Date
|
17
|
+
attribute :type, Types::String
|
18
|
+
attribute :details, Types::String
|
19
|
+
attribute :paid_out, Types::Decimal.optional
|
20
|
+
attribute :paid_in, Types::Decimal.optional
|
21
|
+
attribute :balance, Types::Decimal
|
22
|
+
|
23
|
+
def change
|
24
|
+
paid_in || 0 - paid_out
|
25
|
+
end
|
26
|
+
end
|
27
|
+
|
28
|
+
class ImportedStatement < Dry::Struct
|
29
|
+
attribute :account_holder, Types::String
|
30
|
+
attribute :sortcode, Types::String
|
31
|
+
attribute :account_number, Types::String
|
32
|
+
|
33
|
+
attribute :sheets, Types::Instance(Range)
|
34
|
+
attribute :date_range, Types::Instance(Range)
|
35
|
+
|
36
|
+
attribute :opening_balance, Types::Decimal
|
37
|
+
attribute :closing_balance, Types::Decimal
|
38
|
+
attribute :payments_in, Types::Decimal
|
39
|
+
attribute :payments_out, Types::Decimal
|
40
|
+
|
41
|
+
attribute :transactions, Types::Array.of(Transaction)
|
42
|
+
end
|
43
|
+
|
44
|
+
# Parses the passed PDF file and returns an ImportedStatement
|
45
|
+
#
|
46
|
+
# === Parameters
|
47
|
+
#
|
48
|
+
# [filename] the filename to parse
|
49
|
+
def self.parse(filename)
|
50
|
+
StatementParser.new(filename).parse
|
51
|
+
end
|
52
|
+
end
|
data/treefmt.toml
ADDED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: hsbc_pdf_statement_parser
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 2.0.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jon Pearse
|
8
|
-
autorequire:
|
8
|
+
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2024-05-06 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: pdf-reader
|
@@ -16,33 +16,64 @@ dependencies:
|
|
16
16
|
requirements:
|
17
17
|
- - "~>"
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: '2.
|
19
|
+
version: '2.9'
|
20
20
|
type: :runtime
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
24
|
- - "~>"
|
25
25
|
- !ruby/object:Gem::Version
|
26
|
-
version: '2.
|
27
|
-
|
26
|
+
version: '2.9'
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: dry-struct
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - "~>"
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: '1.4'
|
34
|
+
type: :runtime
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - "~>"
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: '1.4'
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: rufo
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - ">="
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: '0'
|
48
|
+
type: :development
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - ">="
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '0'
|
55
|
+
description:
|
28
56
|
email: hello@jonpearse.net
|
29
57
|
executables: []
|
30
58
|
extensions: []
|
31
59
|
extra_rdoc_files: []
|
32
60
|
files:
|
61
|
+
- ".ruby-version"
|
33
62
|
- Gemfile
|
34
63
|
- LICENSE.txt
|
35
|
-
- README.
|
64
|
+
- README.md
|
36
65
|
- hsbc_pdf_statement_parser.gemspec
|
37
66
|
- lib/hsbc_pdf_statement_parser.rb
|
38
|
-
- lib/hsbc_pdf_statement_parser/
|
39
|
-
- lib/hsbc_pdf_statement_parser/
|
67
|
+
- lib/hsbc_pdf_statement_parser/reader.rb
|
68
|
+
- lib/hsbc_pdf_statement_parser/statement_parser.rb
|
69
|
+
- lib/hsbc_pdf_statement_parser/transaction_parser.rb
|
40
70
|
- lib/hsbc_pdf_statement_parser/version.rb
|
71
|
+
- treefmt.toml
|
41
72
|
homepage: https://github.com/jonpearse/hsbc-pdf-statement-parser
|
42
73
|
licenses:
|
43
74
|
- MIT
|
44
75
|
metadata: {}
|
45
|
-
post_install_message:
|
76
|
+
post_install_message:
|
46
77
|
rdoc_options: []
|
47
78
|
require_paths:
|
48
79
|
- lib
|
@@ -57,8 +88,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
57
88
|
- !ruby/object:Gem::Version
|
58
89
|
version: '0'
|
59
90
|
requirements: []
|
60
|
-
rubygems_version: 3.
|
61
|
-
signing_key:
|
91
|
+
rubygems_version: 3.4.19
|
92
|
+
signing_key:
|
62
93
|
specification_version: 4
|
63
94
|
summary: Quick and dirty RubyGem to parse HSBC’s statement PDFs
|
64
95
|
test_files: []
|
data/README.textile
DELETED
@@ -1,62 +0,0 @@
|
|
1
|
-
h1. HSBC PDF Statement Parser
|
2
|
-
|
3
|
-
This is a _very_ quick and dirty gem that swallows downloaded PDF files from HSBC (UK) and parses them into an array of hashes containing each transaction.
|
4
|
-
|
5
|
-
It exists soley because HSBC doesn’t seem to offer any way of exporting old statements as anything other than PDFs, which makes it a pain in the backside to import anything into any kind of finance packages.
|
6
|
-
You probably shouldn’t use it (see warnings below)
|
7
|
-
|
8
|
-
h2. Installation
|
9
|
-
|
10
|
-
Using bundler on the command line:
|
11
|
-
|
12
|
-
<pre>
|
13
|
-
$ bundle add hsbc_pdf_statement_parser
|
14
|
-
$ bundle
|
15
|
-
</pre>
|
16
|
-
|
17
|
-
h2. Usage
|
18
|
-
|
19
|
-
<pre>
|
20
|
-
require 'hsbc_pdf_statement_parser'
|
21
|
-
|
22
|
-
parser = HsbcPdfStatementParser::Parser.new( 'path/to/statement.pdf' )
|
23
|
-
parser.transactions.each do |tx|
|
24
|
-
|
25
|
-
printf( "%s: %-40s %7.02f\n", tx[:date], tx[:details].lines.first.strip, tx[:change] )
|
26
|
-
|
27
|
-
end
|
28
|
-
</pre>
|
29
|
-
|
30
|
-
h3. Methods
|
31
|
-
|
32
|
-
- @statements@ := returns transactions as an array of hashes (see hash keys section, below) _(Array Hash)_
|
33
|
-
- @opening_balance@ := returns the opening balance of the statement _(Float)_
|
34
|
-
- @closing_balance@ := returns the closing balance of the statement _(Float)_
|
35
|
-
- @payments_in@ := returns the total of all payments in _(Float)_
|
36
|
-
- @payments_out@ := returns the total of all payments out _(Float)_
|
37
|
-
|
38
|
-
Note that @payments_in@ and @payments_out@ are read from the header section at the top of the statement, and are not calculated from the transactions in the statement.
|
39
|
-
|
40
|
-
h3. Transaction hash keys
|
41
|
-
|
42
|
-
- @:date@ := a `Date` object describing the date of the transaction _(Date)_
|
43
|
-
- @:type@ := the type of the description, eg `DD` for a direct debit, `VIS` for visa, `(((` for contactless _(String)_
|
44
|
-
- @:details@ := the text descrption of the transaction. This may span multiple lines _(String)_
|
45
|
-
- @:in@ := the amount entering your account, or nil if an outbound transaction _(Float, nil)_
|
46
|
-
- @:out@ := the amount leaving your account, or nil if an inbound transaction _(Float, nil)_
|
47
|
-
- @:change@ := a calculated field showing the change to your bank balance: negative for debits, positive for credits _(Float)_
|
48
|
-
- @:balance@ := the balance of your account after the transaction, if present in the PDF _(Float, nil)_
|
49
|
-
|
50
|
-
*Note:* that the @:balance@ key is pulled straight from the PDF and will only be present for the last transaction on a particular day. I’m not doing anything even remotely clever here :)
|
51
|
-
|
52
|
-
h2. ⚠️ Warnings
|
53
|
-
|
54
|
-
This gem has been thrown together for my own needs, and is offered to the world just in case someone else might want to play around with it.
|
55
|
-
It seems to work pretty well with statements from my Advance account here in the UK, and may also work with other flavours of accounts from elsewhere in the world, but comes with absolutely zero guarantees or assurances.
|
56
|
-
|
57
|
-
That is to say: it seems to work OK for mucking around, but I’d recommend not using it for anything mission-critical, or in a situation that might lead you or others into making any kind of financial decisions.
|
58
|
-
Any dumb financial decisions made are entirely on you =)
|
59
|
-
|
60
|
-
Also, this gem contains a patch for "pdf-reader":https://github.com/yob/pdf-reader to help it better cope with the weird way in which HSBC seems to generate PDFs. This is unapologetically a massive hack, and really could do with someone far smarter than me to come up with a better solution.
|
61
|
-
For more information, see "pdf-reader issue #169":https://github.com/yob/pdf-reader/issues/169 which goes into a little more detail about what’s going on (my files seem to terminate the image data with @0xE0@, per my patch)
|
62
|
-
|
@@ -1,166 +0,0 @@
|
|
1
|
-
module HsbcPdfStatementParser
|
2
|
-
|
3
|
-
class Parser
|
4
|
-
|
5
|
-
# Creates a new parser from a PDF file.
|
6
|
-
#
|
7
|
-
# === Parameters
|
8
|
-
#
|
9
|
-
# [filename] the filename to parse
|
10
|
-
def initialize( filename )
|
11
|
-
|
12
|
-
@reader = PDF::Reader.new( filename )
|
13
|
-
|
14
|
-
end
|
15
|
-
|
16
|
-
# Returns an array of the transactions in the document as hashes.
|
17
|
-
#
|
18
|
-
# === Hash keys
|
19
|
-
#
|
20
|
-
# [:date] the date of the transaction _(Date)_
|
21
|
-
# [:type] the type of the transaction, eg ‘VISA’, ‘DD’, ‘ATM’, etc _(String)_
|
22
|
-
# [:details] the details of the transaction. This can span multiple lines _(String)_
|
23
|
-
# [:out] the amount of the transaction, if a debit _(Float, nil)_
|
24
|
-
# [:in] the amount of the transaction, if a credit _(Float, nil)_
|
25
|
-
# [:change] the amount of the transacation: negative if a debit, positive if a credit _(Float)_
|
26
|
-
def transactions
|
27
|
-
|
28
|
-
@_transactions ||= begin
|
29
|
-
|
30
|
-
current_transaction = nil
|
31
|
-
current_date = nil
|
32
|
-
transactions = []
|
33
|
-
|
34
|
-
document_text
|
35
|
-
.scan( /BALANCE\s?BROUGHT\s?FORWARD(?:.*?)\n(.*?)BALANCE\s?CARRIED\s?FORWARD/im )
|
36
|
-
.map{ |text| parse_page( text[0] )}
|
37
|
-
.flatten
|
38
|
-
.each do |line|
|
39
|
-
|
40
|
-
# store the current date
|
41
|
-
current_date = line[:date] unless line[:date].nil?
|
42
|
-
|
43
|
-
# if we have a type, start a new transaction
|
44
|
-
unless line[:type].nil?
|
45
|
-
transactions << current_transaction unless current_transaction.nil?
|
46
|
-
current_transaction = line.merge( date: current_date )
|
47
|
-
next
|
48
|
-
end
|
49
|
-
|
50
|
-
# merge things in
|
51
|
-
current_transaction.merge!( line.select{ |k,v| v }, { details: "#{current_transaction[:details]}\n#{line[:details]}" })
|
52
|
-
|
53
|
-
end
|
54
|
-
|
55
|
-
# dump the final transaction + return
|
56
|
-
transactions << current_transaction unless current_transaction.nil?
|
57
|
-
transactions
|
58
|
-
end
|
59
|
-
|
60
|
-
end
|
61
|
-
|
62
|
-
# Returns the opening balance of the statement read from the table on the first page.
|
63
|
-
def opening_balance
|
64
|
-
|
65
|
-
@_opening_balance ||= scan_figure( 'Opening Balance' )
|
66
|
-
|
67
|
-
end
|
68
|
-
|
69
|
-
# Returns the closing balance of the statement read from the table on the first page.
|
70
|
-
def closing_balance
|
71
|
-
|
72
|
-
@_closing_balance ||= scan_figure( 'Closing Balance' )
|
73
|
-
|
74
|
-
end
|
75
|
-
|
76
|
-
# Returns the total value of payments in during the statement read from the table on the first page (ie: not calculated)
|
77
|
-
def payments_in
|
78
|
-
|
79
|
-
@_payments_in ||= scan_figure( 'Payments In' )
|
80
|
-
|
81
|
-
end
|
82
|
-
|
83
|
-
# Returns the total value of payments out during the statement read from the table on the first page (ie: not calculated)
|
84
|
-
def payments_out
|
85
|
-
|
86
|
-
@_payments_out ||= scan_figure( 'Payments Out' )
|
87
|
-
|
88
|
-
end
|
89
|
-
|
90
|
-
private
|
91
|
-
|
92
|
-
def document_text
|
93
|
-
|
94
|
-
@text ||= begin
|
95
|
-
|
96
|
-
@reader.pages.map( &:text ).join
|
97
|
-
|
98
|
-
end
|
99
|
-
|
100
|
-
end
|
101
|
-
|
102
|
-
def scan_figure( search_string )
|
103
|
-
|
104
|
-
@_first_page ||= @reader.pages.first.text
|
105
|
-
|
106
|
-
match = Regexp.new( "#{search_string}(?:.*?)([0-9\.\,]{4,})", Regexp::IGNORECASE ).match( @_first_page )
|
107
|
-
return nil if match.nil?
|
108
|
-
|
109
|
-
match[1].gsub( ',', '' ).to_f
|
110
|
-
|
111
|
-
end
|
112
|
-
|
113
|
-
def parse_page( page_str )
|
114
|
-
|
115
|
-
# grab lines + get the longest
|
116
|
-
lines = page_str.lines
|
117
|
-
max_length = lines.map( &:length ).max
|
118
|
-
|
119
|
-
lines.map{ |line| parse_line( line.rstrip.ljust( max_length ))}.compact
|
120
|
-
|
121
|
-
end
|
122
|
-
|
123
|
-
def parse_line( line_str )
|
124
|
-
|
125
|
-
# if we’re blank…
|
126
|
-
return nil if line_str.strip.empty?
|
127
|
-
|
128
|
-
# start cutting things up
|
129
|
-
row = {
|
130
|
-
date: empty_to_nil( line_str[0..12] ),
|
131
|
-
type: empty_to_nil( line_str[12..20] ),
|
132
|
-
details: empty_to_nil( line_str[20..70] ),
|
133
|
-
out: empty_to_nil( line_str[70..90] ),
|
134
|
-
in: empty_to_nil( line_str[90..110] ),
|
135
|
-
balance: empty_to_nil( line_str[110..130] )
|
136
|
-
}
|
137
|
-
|
138
|
-
# munge things further
|
139
|
-
row[:date] = Date.strptime( row[:date], '%d %b %y' ) unless row[:date].nil?
|
140
|
-
row[:out] = row[:out].gsub( ',', '' ).to_f unless row[:out].nil?
|
141
|
-
row[:in] = row[:in].gsub( ',', '' ).to_f unless row[:in].nil?
|
142
|
-
row[:balance] = row[:balance].gsub( ',', '' ).to_f unless row[:balance].nil?
|
143
|
-
|
144
|
-
# set a change amount
|
145
|
-
row[:change] = if !row[:out].nil?
|
146
|
-
0 - row[:out]
|
147
|
-
elsif !row[:in].nil?
|
148
|
-
row[:in]
|
149
|
-
else
|
150
|
-
nil
|
151
|
-
end
|
152
|
-
|
153
|
-
# return
|
154
|
-
row
|
155
|
-
|
156
|
-
end
|
157
|
-
|
158
|
-
def empty_to_nil( str )
|
159
|
-
|
160
|
-
( str.strip!.empty? ) ? nil : str
|
161
|
-
|
162
|
-
end
|
163
|
-
|
164
|
-
end
|
165
|
-
|
166
|
-
end
|
@@ -1,44 +0,0 @@
|
|
1
|
-
# This is a horrendous patch to work around issue #169 in the pdf-reader repo (https://github.com/yob/pdf-reader/issues/169).
|
2
|
-
#
|
3
|
-
# Short version is that whatever HSBC is using to generate PDFs doesn’t seem to null-/whitespace-terminate inline image
|
4
|
-
# data. Thus, when PdfReader tries to find the ‘EI’ token when parsing inline media, it can’t and simply runs off the end
|
5
|
-
# of the document, causing a TypeError to be thrown.
|
6
|
-
#
|
7
|
-
# The PDF files I’m getting all seem to end some of the images with xE0, so I’ve simply monkey-patched this into the
|
8
|
-
# library for use with my files.
|
9
|
-
# This may not be the case for anyone else, in which case maybe add whatever your problem character is to the regex and
|
10
|
-
# open a PR should you feel the need.
|
11
|
-
#
|
12
|
-
# Or, y’know, look at the PdfReader source and see if you can work out something better, because this is horrendous :/
|
13
|
-
|
14
|
-
module PdfReaderPatch
|
15
|
-
def self.included( base )
|
16
|
-
base.class_eval do
|
17
|
-
def prepare_inline_token
|
18
|
-
str = "".dup
|
19
|
-
|
20
|
-
buffer = []
|
21
|
-
to_rewind = -3
|
22
|
-
|
23
|
-
until buffer[0] =~ /\s|\0|\xE0/n && buffer[1, 2] == ['E', 'I']
|
24
|
-
chr = @io.read(1)
|
25
|
-
buffer << chr
|
26
|
-
|
27
|
-
if buffer.length > 3
|
28
|
-
str << buffer.shift
|
29
|
-
end
|
30
|
-
|
31
|
-
to_rewind = -2 if buffer.first =~ /\xE0/n
|
32
|
-
end
|
33
|
-
|
34
|
-
str << '\0' if buffer.first == '\0'
|
35
|
-
|
36
|
-
@tokens << string_token(str)
|
37
|
-
|
38
|
-
@io.seek(to_rewind, IO::SEEK_CUR) unless chr.nil?
|
39
|
-
end
|
40
|
-
end
|
41
|
-
end
|
42
|
-
end
|
43
|
-
|
44
|
-
PDF::Reader::Buffer.send( :include, PdfReaderPatch )
|