hsbc_pdf_statement_parser 1.0.1 → 2.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 1aeed5005a826d427a328576db61241bab262e44941ae1c0af5d34f7028f1281
4
- data.tar.gz: 1af68498d1b4e4732e42a7422a32930170f4870f85a458b0a166b95e0ef7b338
3
+ metadata.gz: 67d78ba59d35520d3f1c12ba39f5d53358a68c6d6a1eb7fc2aa8e9180a24db40
4
+ data.tar.gz: 5bf2f6a30605cc92ba6c780051956a017b9ba1e4cd973e4ba6e671bece3f48f7
5
5
  SHA512:
6
- metadata.gz: 64d6adeb4671eff35241d94d8f848fe54cbf6c324aa366e75d658d44b5113b468badde10b262ab092ab8b43e87402dad81cc10f9ba4b281c6fefc325c4533199
7
- data.tar.gz: bebec0676834fd0da389f2094017a930c795f103a7477e67062449d31c3598cd9571f5f6dcbb57ab06549e4b761515915e9f211d936bf1e7660452c2497a32db
6
+ metadata.gz: 2cb49c96f8ec208224928924e9bedcfbe1287b0c1847c1ced480fa67ff3ceb50d5b643aa04885a09b627dcb48fab249b152aa579a9b3e40c050862ada3e11af1
7
+ data.tar.gz: 4da639fa609cebb7101e15d47e36b5891486a649e4288884177dfe92373c6d429ecd756893e3e2525b77eff8e465e6a01a12736e37d9c0344a40c3dfb4f56f96
data/README.md ADDED
@@ -0,0 +1,93 @@
1
+ # HSBC PDF Statement Parser
2
+
3
+ This is a _very_ quick and dirty gem that swallows downloaded PDF files from HSBC (UK) and parses them into a [Dry::Struct](https://dry-rb.org/gems/dry-struct/1.0/) containing details of the statement + its transactions.
4
+
5
+ It exists soley because HSBC doesn’t seem to offer any way of exporting old statements as anything other than PDFs, which makes it a pain in the backside to import anything into any kind of finance packages.
6
+ You probably shouldn’t use it (see warnings below)
7
+
8
+ # Installation
9
+
10
+ Using bundler on the command line:
11
+
12
+ ```shell
13
+ $ bundle add hsbc_pdf_statement_parser
14
+ $ bundle
15
+ ```
16
+
17
+ ## Usage
18
+
19
+ This gem exposes one method: `parse`, which takes the path to a statement PDF and returns a `Dry::Struct` representation.
20
+
21
+ ```ruby
22
+ require 'hsbc_pdf_statement_parser'
23
+
24
+ parsed = HsbcPdfStatementParser.parse( 'path/to/statement.pdf' )
25
+ parsed.transactions.each do |tx|
26
+ printf(
27
+ "[%s] {%-3s} %-40s %7.02f | %7.02f\n",
28
+ tx.date,
29
+ tx.type,
30
+ tx.details.lines.first.strip,
31
+ tx.change,
32
+ tx.balance
33
+ )
34
+ end
35
+ ```
36
+
37
+ ### Statement Properties
38
+
39
+ - `account_holder`: the name of the account holder _(String)_
40
+ - `sortcode`: the sortcode shown on the statement _(String)_
41
+ - `account_number`: the account number shown on the statement _(String)_
42
+ - `sheets`: the sheets used in the statement _(Range[Int])_
43
+ - `date_range`: the date range shown on the first page of the statement _(Range[Date])_
44
+ - `opening_balance`: the opening balance of the statement _(Decimal)_
45
+ - `closing_balance`: the closing balance of the statement _(Decimal)_
46
+ - `payments_in`: the total of all transactions into the account _(Decimal)_
47
+ - `payments_out`: the total of all transactions out of the account _(Decimal)_
48
+
49
+ **Note:** `payments_in` and `payments_out` are those shown on the first page of the statement and they are not calculated from- or checked against the parsed transactions.
50
+
51
+ Also note that sheet numbers are not guaranteed to be unique. Not sure why this is the case, but I have a few statements where the the last sheet of one statement and the first of another have the same sheet number.
52
+
53
+ ### Transaction Properties
54
+
55
+ - `date`: the date of the transaction _(Date)_
56
+ - `type`: a string representation of the type of the transaction (eg. `DD` for a direct debit, `VIS` for VISA, etc) _(String)_
57
+ - `details`: a text description of the transaction, which may span multiple lines _(String)_
58
+ - `paid_in`: the amount paid in, if appropriate _(Decimal, nullable)_
59
+ - `paid_out`: the amount paid out, if appropriate _(Decimal, nullable)_
60
+ - `balance`: the balance of the account after the transaction _(Decimal)_
61
+ - `change`: the calculated change to the balance of the account: negative for debits, positive for credits _(Decimal)_
62
+
63
+ **Note:** unlike in V1, `balance` is now always present and is calculated as a running total based on the opening balance of the statement. Where the statement shows a running balance after transactions (seems to be once a day), this is checked and the parser will raise an error if any discrepancy is found.
64
+
65
+ ## ⚠️ Warnings
66
+
67
+ This gem has been thrown together for my own needs, and is offered to the world just in case someone else might want to play around with it.
68
+ It seems to work pretty well with statements from my Advance account here in the UK, and may also work with other flavours of accounts from elsewhere in the world, but comes with absolutely zero guarantees or assurances.
69
+
70
+ That is to say: it seems to work OK for mucking around, but I’d recommend not using it for anything mission-critical, or in a situation that might lead you or others into making any kind of financial decisions.
71
+ Any dumb financial decisions made are entirely on you =)
72
+
73
+ ### Wot? No tests
74
+
75
+ I have plenty, sadly the only way of properly testing this code is by parsing real bank statements, and I’m not about to commit any of those to github. Sorry!
76
+
77
+ ## Upgrading from V1.x
78
+
79
+ For various reasons I’ve not worried too much about trying to maintain backward compatibility: migration should be relatively minimal, though:
80
+
81
+ ### Breaking changes
82
+
83
+ 1. invocation: `HsbcPdfStatementParser::Parser.new(…)` becomes `HsbcPdfStatementParser.parse(…)`
84
+ 2. when using parsed transactions, `in` and `out` are now `paid_in` and `paid_out` respectively
85
+ 3. any use of `fetch(…)` on parsed transactions will need to be replaced with bare function calls (hash accessors—ie `tx[:date]`—will continue to work)
86
+
87
+ ### Nonbreaking changes
88
+
89
+ Aside from the new properties added to the main Statement type, the biggest difference is that a statement’s `balance` property is now always specified, whereas it was only specified once a day in V1.x
90
+
91
+ ---
92
+
93
+ Share and enjoy
@@ -3,18 +3,17 @@ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
3
3
  require "hsbc_pdf_statement_parser/version"
4
4
 
5
5
  Gem::Specification.new do |spec|
6
-
7
- spec.name = 'hsbc_pdf_statement_parser'
8
- spec.summary = 'Quick and dirty RubyGem to parse HSBC’s statement PDFs'
9
- spec.license = 'MIT'
10
-
11
- spec.authors = 'Jon Pearse'
12
- spec.email = 'hello@jonpearse.net'
13
- spec.homepage = 'https://github.com/jonpearse/hsbc-pdf-statement-parser'
14
-
15
- spec.version = HsbcPdfStatementParser::VERSION
16
- spec.files = `git ls-files`.split($\)
17
-
18
- spec.add_dependency 'pdf-reader', '~> 2.4'
19
-
6
+ spec.name = "hsbc_pdf_statement_parser"
7
+ spec.summary = "Quick and dirty RubyGem to parse HSBC’s statement PDFs"
8
+ spec.license = "MIT"
9
+
10
+ spec.authors = "Jon Pearse"
11
+ spec.email = "hello@jonpearse.net"
12
+ spec.homepage = "https://github.com/jonpearse/hsbc-pdf-statement-parser"
13
+
14
+ spec.version = HsbcPdfStatementParser::VERSION
15
+ spec.files = `git ls-files`.split($\)
16
+
17
+ spec.add_dependency "pdf-reader", "~> 2.9"
18
+ spec.add_dependency "dry-struct", "~> 1.4"
20
19
  end
@@ -0,0 +1,24 @@
1
+ module HsbcPdfStatementParser
2
+ class Reader
3
+ def initialize(filename)
4
+ @reader = PDF::Reader.new(filename)
5
+ end
6
+
7
+ def first_page
8
+ @_first_page ||= @reader.pages.first.text
9
+ end
10
+
11
+ def all_text
12
+ @_all_text ||= @reader.pages.map(&:text).join
13
+ end
14
+
15
+ def statement_blocks
16
+ @_statement_lines ||= begin
17
+ @reader.pages.map do |page|
18
+ match = page.text.match(/BALANCE\s?BROUGHT\s?FORWARD(?:.*?)\n(.*?)BALANCE\s?CARRIED\s?FORWARD/im)
19
+ match ? match[1] : nil
20
+ end.compact
21
+ end
22
+ end
23
+ end
24
+ end
@@ -0,0 +1,125 @@
1
+ module HsbcPdfStatementParser
2
+ class StatementParser
3
+ attr_reader :_statement_lines
4
+
5
+ def initialize(filename)
6
+ @reader = Reader.new(filename)
7
+
8
+ # puts @reader.first_page
9
+ end
10
+
11
+ def parse
12
+ opening_balance = scan_figure("Opening Balance")
13
+ closing_balance = scan_figure("Closing Balance")
14
+ meta = get_meta
15
+
16
+ ImportedStatement.new(
17
+ account_holder: meta[:account_holder],
18
+ sortcode: meta[:sortcode],
19
+ account_number: meta[:account_no],
20
+ sheets: meta[:sheets],
21
+ date_range: get_date_range,
22
+ opening_balance: opening_balance,
23
+ closing_balance: closing_balance,
24
+ payments_in: scan_figure("Payments In"),
25
+ payments_out: scan_figure("Payments Out"),
26
+ transactions: parse_transactions(opening_balance),
27
+ )
28
+ end
29
+
30
+ private def scan_figure(search_string)
31
+ # note: I do not like how … general this regex is, but HSBC seems to love putting random noise characters into
32
+ # opening and closing balance, so there’s only so much I can do about it.
33
+ match = Regexp.new("#{search_string}(.*?)\n", Regexp::IGNORECASE).match(@reader.first_page)
34
+ raise ParsingError.new("Could not find #{search_string}") if match.nil?
35
+
36
+ match[1].strip.gsub(",", "").gsub(/\s/, "").to_d
37
+ end
38
+
39
+ private def get_meta
40
+ all_meta = @reader.all_text.scan(/Account\s?Name\s+Sortcode\s+Account\s?Number\s+Sheet Number\n+([A-Z\s]+?)\s\s+([\d\-]+)\s\s+(\d+)\s\s+(\d+)\n/i)
41
+ raise ParsingError.new("Cannot find statement metadata") if all_meta.empty?
42
+
43
+ # check everything makes sense
44
+ raise ParsingError.new("Error parsing account name") unless all_elements_same(all_meta.map(&:first))
45
+ raise ParsingError.new("Error parsing sort code") unless all_elements_same(all_meta.map { |a| a[1] })
46
+ raise ParsingError.new("Error parsing account number") unless all_elements_same(all_meta.map { |a| a[2] })
47
+
48
+ # get page numbers
49
+ first_page = all_meta.first[3]
50
+ last_page = all_meta.last[3]
51
+ raise ParsingError.new("Error parsing sheet numbers") if (first_page > last_page)
52
+
53
+ {
54
+ sheets: Range.new(first_page.to_i, last_page.to_i),
55
+ account_holder: all_meta.dig(0, 0).strip,
56
+ sortcode: all_meta.dig(0, 1).strip,
57
+ account_no: all_meta.dig(0, 2).strip,
58
+ }
59
+ end
60
+
61
+ private def get_date_range
62
+ dates = @reader.first_page.match(/(\d{1,2}) ([a-z]+)(?: (\d{4}))? to (\d{1,2}) ([a-z]+) (\d{4})/i)
63
+ raise ParsingError.new("Cannot find date range") unless dates
64
+ dates = %i{start_day start_month start_year end_day end_month end_year}.zip(dates.captures).to_h
65
+
66
+ # default the start year
67
+ dates[:start_year] ||= dates[:end_year]
68
+ start_date = Date.parse("#{dates[:start_day]} #{dates[:start_month]} #{dates[:start_year]}")
69
+ end_date = Date.parse("#{dates[:end_day]} #{dates[:end_month]} #{dates[:end_year]}")
70
+ raise ParsingError.new("Error parsing date range") if (start_date > end_date)
71
+
72
+ Range.new(start_date, end_date)
73
+ end
74
+
75
+ private def parse_transactions(opening_balance)
76
+ # Get the raw information out of the PDF text
77
+ parser = TransactionParser.new
78
+ transactions = parser.parse(@reader.statement_blocks)
79
+
80
+ # start crosschecking!
81
+ running_balance = opening_balance
82
+ transactions.map do |tx|
83
+
84
+ # work out what we’re doing
85
+ running_balance += tx[:paid_in] || 0 - tx[:paid_out]
86
+
87
+ # push a calculated balance in if none is present
88
+ if tx[:balance].nil?
89
+ tx[:balance] = running_balance
90
+ else
91
+ raise TransactionCalculationError.new(tx[:balance], running_balance) if running_balance != tx[:balance]
92
+ end
93
+
94
+ # all good, so create a new Transaction object
95
+ Transaction.new(tx)
96
+ end
97
+ end
98
+
99
+ private def all_elements_same(arry)
100
+ arry.uniq.length == 1
101
+ end
102
+ end
103
+
104
+ class ParsingError < StandardError
105
+ attr_reader :info
106
+
107
+ def initialize(message, info = nil)
108
+ @info = info
109
+ super(message)
110
+ end
111
+ end
112
+
113
+ class TransactionCalculationError < StandardError
114
+ attr_reader :expected, :calculated
115
+
116
+ def initialize(expected, calculated)
117
+ @expected = expected
118
+ @calculated = calculated
119
+ end
120
+
121
+ def message
122
+ "Expected #{@expected} but got #{@calculated}"
123
+ end
124
+ end
125
+ end
@@ -0,0 +1,133 @@
1
+ require "bigdecimal/util"
2
+ require "date"
3
+
4
+ module HsbcPdfStatementParser
5
+ class TransactionParser
6
+ attr_reader :transactions
7
+
8
+ def parse(input)
9
+ # We need to parse pages individually because the column widths may not be the same on each page
10
+ parsed_lines = input.reduce([]) { |lines, page| lines + parse_page(page) }
11
+
12
+ # build out transactions from our parsed data
13
+ @transactions = parse_transactions(parsed_lines)
14
+ end
15
+
16
+ private def parse_page(page_data)
17
+ # get the maximum line length for this page
18
+ max_line_length = page_data.lines.map(&:length).max
19
+
20
+ # work out the columns being used
21
+ column_indices = find_columns(page_data, max_line_length)
22
+
23
+ # get the column keys and produce a map for parsing
24
+ column_keys = get_column_keys(column_indices, max_line_length)
25
+ column_map = column_keys.zip(column_indices).to_h
26
+
27
+ # parse out all the
28
+ page_data.lines.reduce([]) do |parsed, line|
29
+ next parsed if line.strip.empty?
30
+
31
+ parsed << parse_line(line, column_map, max_line_length)
32
+ end
33
+ end
34
+
35
+ private def find_columns(input, max_line_length)
36
+ state = Array.new(max_line_length, " ")
37
+
38
+ # scan the content + look for obvious columns
39
+ input.lines.each do |line|
40
+ # bounce if there’s no point…
41
+ next if line.strip.empty?
42
+
43
+ # map the state
44
+ line.split("").each.with_index { |chr, i| state[i] = "X" if chr != " " }
45
+ end
46
+
47
+ # We will almost have something like XX XXX XX for the date column, so replace any spaces with an X either side
48
+ # with an X
49
+ state = state.join.gsub(/X X/, "XXX")
50
+
51
+ # find the columns
52
+ columns = []
53
+ state.gsub(/X+/) { columns << Range.new(*$~.offset(0)) }
54
+ columns
55
+ end
56
+
57
+ private def get_column_keys(indices, max_line_length)
58
+ # if we have 6 columns, we’re all good…
59
+ return %i{date type details paid_out paid_in balance} if indices.length == 6
60
+
61
+ # if a statement contains no payments in or no payments out, it’ll only have 5 columns, so we need to work out
62
+ # which the additional column is
63
+ # We can do this with an assumption that the ‘paid out’ column is usually at around 2/3 of the total line length…
64
+ has_paid_out = indices.map { |i| i.first.to_f / max_line_length }.any? { |p| p > 0.6 && p < 0.7 }
65
+ additional = has_paid_out ? :paid_out : :paid_in
66
+
67
+ [:date, :type, :details, additional, :balance]
68
+ end
69
+
70
+ private def parse_line(line, map, max_line_length)
71
+ # lengthen the line
72
+ line = line.rstrip.ljust(max_line_length)
73
+
74
+ # cut things up
75
+ row = {
76
+ paid_out: nil,
77
+ paid_in: nil,
78
+ }
79
+ map.each { |k, v| row[k] = empty_to_nil(line[v]) }
80
+
81
+ # further processing
82
+ row[:date] = Date.strptime(row[:date], "%d %b %y") unless row[:date].nil?
83
+ row[:paid_out] = parse_decimal(row[:paid_out])
84
+ row[:paid_in] = parse_decimal(row[:paid_in])
85
+ row[:balance] = parse_decimal(row[:balance])
86
+
87
+ row
88
+ end
89
+
90
+ private def parse_decimal(str)
91
+ return nil if str.nil?
92
+
93
+ mult = str.end_with?("D") ? -1 : 1
94
+
95
+ str.gsub(",", "").to_d * mult
96
+ end
97
+
98
+ private def empty_to_nil(str)
99
+ str = str.strip
100
+
101
+ str.empty? ? nil : str
102
+ end
103
+
104
+ private def parse_transactions(parsed_lines)
105
+ current_date = nil
106
+ current_transaction = nil
107
+
108
+ transactions = []
109
+
110
+ parsed_lines.each do |parsed|
111
+ # if we have a new date, store it
112
+ current_date = parsed[:date] unless parsed[:date].nil?
113
+
114
+ # if we have a type, it’s a new transaction, so push any existing info onto the list + reset
115
+ unless parsed[:type].nil?
116
+ transactions << current_transaction unless current_transaction.nil?
117
+ current_transaction = parsed.merge(date: current_date)
118
+ next
119
+ end
120
+
121
+ # otherwise, merge things together
122
+ current_transaction = current_transaction.merge(
123
+ parsed.reject { |_, v| v.nil? },
124
+ { details: "#{current_transaction[:details]}\n#{parsed[:details]}".strip }
125
+ )
126
+ end
127
+
128
+ # shove the last transaction onto the list + return
129
+ transactions << current_transaction
130
+ transactions
131
+ end
132
+ end
133
+ end
@@ -1,3 +1,3 @@
1
1
  module HsbcPdfStatementParser
2
- VERSION = "1.0.1"
2
+ VERSION = "2.0.0"
3
3
  end
@@ -1,7 +1,52 @@
1
- require 'date'
2
- require 'pdf-reader'
3
- require 'hsbc_pdf_statement_parser/pdf_reader_patch'
4
- require 'hsbc_pdf_statement_parser/parser'
1
+ require "bigdecimal/util"
2
+ require "date"
3
+ require "dry-struct"
4
+ require "pdf-reader"
5
+
6
+ require "hsbc_pdf_statement_parser/reader"
7
+ require "hsbc_pdf_statement_parser/statement_parser"
8
+ require "hsbc_pdf_statement_parser/transaction_parser"
5
9
 
6
10
  module HsbcPdfStatementParser
7
- end
11
+ module Types
12
+ include Dry.Types()
13
+ end
14
+
15
+ class Transaction < Dry::Struct
16
+ attribute :date, Types::Date
17
+ attribute :type, Types::String
18
+ attribute :details, Types::String
19
+ attribute :paid_out, Types::Decimal.optional
20
+ attribute :paid_in, Types::Decimal.optional
21
+ attribute :balance, Types::Decimal
22
+
23
+ def change
24
+ paid_in || 0 - paid_out
25
+ end
26
+ end
27
+
28
+ class ImportedStatement < Dry::Struct
29
+ attribute :account_holder, Types::String
30
+ attribute :sortcode, Types::String
31
+ attribute :account_number, Types::String
32
+
33
+ attribute :sheets, Types::Instance(Range)
34
+ attribute :date_range, Types::Instance(Range)
35
+
36
+ attribute :opening_balance, Types::Decimal
37
+ attribute :closing_balance, Types::Decimal
38
+ attribute :payments_in, Types::Decimal
39
+ attribute :payments_out, Types::Decimal
40
+
41
+ attribute :transactions, Types::Array.of(Transaction)
42
+ end
43
+
44
+ # Parses the passed PDF file and returns an ImportedStatement
45
+ #
46
+ # === Parameters
47
+ #
48
+ # [filename] the filename to parse
49
+ def self.parse(filename)
50
+ StatementParser.new(filename).parse
51
+ end
52
+ end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: hsbc_pdf_statement_parser
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.1
4
+ version: 2.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jon Pearse
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2019-11-26 00:00:00.000000000 Z
11
+ date: 2022-02-26 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: pdf-reader
@@ -16,15 +16,29 @@ dependencies:
16
16
  requirements:
17
17
  - - "~>"
18
18
  - !ruby/object:Gem::Version
19
- version: '2.4'
19
+ version: '2.9'
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - "~>"
25
25
  - !ruby/object:Gem::Version
26
- version: '2.4'
27
- description:
26
+ version: '2.9'
27
+ - !ruby/object:Gem::Dependency
28
+ name: dry-struct
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '1.4'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '1.4'
41
+ description:
28
42
  email: hello@jonpearse.net
29
43
  executables: []
30
44
  extensions: []
@@ -32,17 +46,18 @@ extra_rdoc_files: []
32
46
  files:
33
47
  - Gemfile
34
48
  - LICENSE.txt
35
- - README.textile
49
+ - README.md
36
50
  - hsbc_pdf_statement_parser.gemspec
37
51
  - lib/hsbc_pdf_statement_parser.rb
38
- - lib/hsbc_pdf_statement_parser/parser.rb
39
- - lib/hsbc_pdf_statement_parser/pdf_reader_patch.rb
52
+ - lib/hsbc_pdf_statement_parser/reader.rb
53
+ - lib/hsbc_pdf_statement_parser/statement_parser.rb
54
+ - lib/hsbc_pdf_statement_parser/transaction_parser.rb
40
55
  - lib/hsbc_pdf_statement_parser/version.rb
41
56
  homepage: https://github.com/jonpearse/hsbc-pdf-statement-parser
42
57
  licenses:
43
58
  - MIT
44
59
  metadata: {}
45
- post_install_message:
60
+ post_install_message:
46
61
  rdoc_options: []
47
62
  require_paths:
48
63
  - lib
@@ -57,8 +72,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
57
72
  - !ruby/object:Gem::Version
58
73
  version: '0'
59
74
  requirements: []
60
- rubygems_version: 3.0.6
61
- signing_key:
75
+ rubygems_version: 3.2.22
76
+ signing_key:
62
77
  specification_version: 4
63
78
  summary: Quick and dirty RubyGem to parse HSBC’s statement PDFs
64
79
  test_files: []
data/README.textile DELETED
@@ -1,62 +0,0 @@
1
- h1. HSBC PDF Statement Parser
2
-
3
- This is a _very_ quick and dirty gem that swallows downloaded PDF files from HSBC (UK) and parses them into an array of hashes containing each transaction.
4
-
5
- It exists soley because HSBC doesn’t seem to offer any way of exporting old statements as anything other than PDFs, which makes it a pain in the backside to import anything into any kind of finance packages.
6
- You probably shouldn’t use it (see warnings below)
7
-
8
- h2. Installation
9
-
10
- Using bundler on the command line:
11
-
12
- <pre>
13
- $ bundle add hsbc_pdf_statement_parser
14
- $ bundle
15
- </pre>
16
-
17
- h2. Usage
18
-
19
- <pre>
20
- require 'hsbc_pdf_statement_parser'
21
-
22
- parser = HsbcPdfStatementParser::Parser.new( 'path/to/statement.pdf' )
23
- parser.transactions.each do |tx|
24
-
25
- printf( "%s: %-40s %7.02f\n", tx[:date], tx[:details].lines.first.strip, tx[:change] )
26
-
27
- end
28
- </pre>
29
-
30
- h3. Methods
31
-
32
- - @statements@ := returns transactions as an array of hashes (see hash keys section, below) _(Array Hash)_
33
- - @opening_balance@ := returns the opening balance of the statement _(Float)_
34
- - @closing_balance@ := returns the closing balance of the statement _(Float)_
35
- - @payments_in@ := returns the total of all payments in _(Float)_
36
- - @payments_out@ := returns the total of all payments out _(Float)_
37
-
38
- Note that @payments_in@ and @payments_out@ are read from the header section at the top of the statement, and are not calculated from the transactions in the statement.
39
-
40
- h3. Transaction hash keys
41
-
42
- - @:date@ := a `Date` object describing the date of the transaction _(Date)_
43
- - @:type@ := the type of the description, eg `DD` for a direct debit, `VIS` for visa, `(((` for contactless _(String)_
44
- - @:details@ := the text descrption of the transaction. This may span multiple lines _(String)_
45
- - @:in@ := the amount entering your account, or nil if an outbound transaction _(Float, nil)_
46
- - @:out@ := the amount leaving your account, or nil if an inbound transaction _(Float, nil)_
47
- - @:change@ := a calculated field showing the change to your bank balance: negative for debits, positive for credits _(Float)_
48
- - @:balance@ := the balance of your account after the transaction, if present in the PDF _(Float, nil)_
49
-
50
- *Note:* that the @:balance@ key is pulled straight from the PDF and will only be present for the last transaction on a particular day. I’m not doing anything even remotely clever here :)
51
-
52
- h2. ⚠️ Warnings
53
-
54
- This gem has been thrown together for my own needs, and is offered to the world just in case someone else might want to play around with it.
55
- It seems to work pretty well with statements from my Advance account here in the UK, and may also work with other flavours of accounts from elsewhere in the world, but comes with absolutely zero guarantees or assurances.
56
-
57
- That is to say: it seems to work OK for mucking around, but I’d recommend not using it for anything mission-critical, or in a situation that might lead you or others into making any kind of financial decisions.
58
- Any dumb financial decisions made are entirely on you =)
59
-
60
- Also, this gem contains a patch for "pdf-reader":https://github.com/yob/pdf-reader to help it better cope with the weird way in which HSBC seems to generate PDFs. This is unapologetically a massive hack, and really could do with someone far smarter than me to come up with a better solution.
61
- For more information, see "pdf-reader issue #169":https://github.com/yob/pdf-reader/issues/169 which goes into a little more detail about what’s going on (my files seem to terminate the image data with @0xE0@, per my patch)
62
-
@@ -1,166 +0,0 @@
1
- module HsbcPdfStatementParser
2
-
3
- class Parser
4
-
5
- # Creates a new parser from a PDF file.
6
- #
7
- # === Parameters
8
- #
9
- # [filename] the filename to parse
10
- def initialize( filename )
11
-
12
- @reader = PDF::Reader.new( filename )
13
-
14
- end
15
-
16
- # Returns an array of the transactions in the document as hashes.
17
- #
18
- # === Hash keys
19
- #
20
- # [:date] the date of the transaction _(Date)_
21
- # [:type] the type of the transaction, eg ‘VISA’, ‘DD’, ‘ATM’, etc _(String)_
22
- # [:details] the details of the transaction. This can span multiple lines _(String)_
23
- # [:out] the amount of the transaction, if a debit _(Float, nil)_
24
- # [:in] the amount of the transaction, if a credit _(Float, nil)_
25
- # [:change] the amount of the transacation: negative if a debit, positive if a credit _(Float)_
26
- def transactions
27
-
28
- @_transactions ||= begin
29
-
30
- current_transaction = nil
31
- current_date = nil
32
- transactions = []
33
-
34
- document_text
35
- .scan( /BALANCE\s?BROUGHT\s?FORWARD(?:.*?)\n(.*?)BALANCE\s?CARRIED\s?FORWARD/im )
36
- .map{ |text| parse_page( text[0] )}
37
- .flatten
38
- .each do |line|
39
-
40
- # store the current date
41
- current_date = line[:date] unless line[:date].nil?
42
-
43
- # if we have a type, start a new transaction
44
- unless line[:type].nil?
45
- transactions << current_transaction unless current_transaction.nil?
46
- current_transaction = line.merge( date: current_date )
47
- next
48
- end
49
-
50
- # merge things in
51
- current_transaction.merge!( line.select{ |k,v| v }, { details: "#{current_transaction[:details]}\n#{line[:details]}" })
52
-
53
- end
54
-
55
- # dump the final transaction + return
56
- transactions << current_transaction unless current_transaction.nil?
57
- transactions
58
- end
59
-
60
- end
61
-
62
- # Returns the opening balance of the statement read from the table on the first page.
63
- def opening_balance
64
-
65
- @_opening_balance ||= scan_figure( 'Opening Balance' )
66
-
67
- end
68
-
69
- # Returns the closing balance of the statement read from the table on the first page.
70
- def closing_balance
71
-
72
- @_closing_balance ||= scan_figure( 'Closing Balance' )
73
-
74
- end
75
-
76
- # Returns the total value of payments in during the statement read from the table on the first page (ie: not calculated)
77
- def payments_in
78
-
79
- @_payments_in ||= scan_figure( 'Payments In' )
80
-
81
- end
82
-
83
- # Returns the total value of payments out during the statement read from the table on the first page (ie: not calculated)
84
- def payments_out
85
-
86
- @_payments_out ||= scan_figure( 'Payments Out' )
87
-
88
- end
89
-
90
- private
91
-
92
- def document_text
93
-
94
- @text ||= begin
95
-
96
- @reader.pages.map( &:text ).join
97
-
98
- end
99
-
100
- end
101
-
102
- def scan_figure( search_string )
103
-
104
- @_first_page ||= @reader.pages.first.text
105
-
106
- match = Regexp.new( "#{search_string}(?:.*?)([0-9\.\,]{4,})", Regexp::IGNORECASE ).match( @_first_page )
107
- return nil if match.nil?
108
-
109
- match[1].gsub( ',', '' ).to_f
110
-
111
- end
112
-
113
- def parse_page( page_str )
114
-
115
- # grab lines + get the longest
116
- lines = page_str.lines
117
- max_length = lines.map( &:length ).max
118
-
119
- lines.map{ |line| parse_line( line.rstrip.ljust( max_length ))}.compact
120
-
121
- end
122
-
123
- def parse_line( line_str )
124
-
125
- # if we’re blank…
126
- return nil if line_str.strip.empty?
127
-
128
- # start cutting things up
129
- row = {
130
- date: empty_to_nil( line_str[0..12] ),
131
- type: empty_to_nil( line_str[12..20] ),
132
- details: empty_to_nil( line_str[20..70] ),
133
- out: empty_to_nil( line_str[70..90] ),
134
- in: empty_to_nil( line_str[90..110] ),
135
- balance: empty_to_nil( line_str[110..130] )
136
- }
137
-
138
- # munge things further
139
- row[:date] = Date.strptime( row[:date], '%d %b %y' ) unless row[:date].nil?
140
- row[:out] = row[:out].gsub( ',', '' ).to_f unless row[:out].nil?
141
- row[:in] = row[:in].gsub( ',', '' ).to_f unless row[:in].nil?
142
- row[:balance] = row[:balance].gsub( ',', '' ).to_f unless row[:balance].nil?
143
-
144
- # set a change amount
145
- row[:change] = if !row[:out].nil?
146
- 0 - row[:out]
147
- elsif !row[:in].nil?
148
- row[:in]
149
- else
150
- nil
151
- end
152
-
153
- # return
154
- row
155
-
156
- end
157
-
158
- def empty_to_nil( str )
159
-
160
- ( str.strip!.empty? ) ? nil : str
161
-
162
- end
163
-
164
- end
165
-
166
- end
@@ -1,44 +0,0 @@
1
- # This is a horrendous patch to work around issue #169 in the pdf-reader repo (https://github.com/yob/pdf-reader/issues/169).
2
- #
3
- # Short version is that whatever HSBC is using to generate PDFs doesn’t seem to null-/whitespace-terminate inline image
4
- # data. Thus, when PdfReader tries to find the ‘EI’ token when parsing inline media, it can’t and simply runs off the end
5
- # of the document, causing a TypeError to be thrown.
6
- #
7
- # The PDF files I’m getting all seem to end some of the images with xE0, so I’ve simply monkey-patched this into the
8
- # library for use with my files.
9
- # This may not be the case for anyone else, in which case maybe add whatever your problem character is to the regex and
10
- # open a PR should you feel the need.
11
- #
12
- # Or, y’know, look at the PdfReader source and see if you can work out something better, because this is horrendous :/
13
-
14
- module PdfReaderPatch
15
- def self.included( base )
16
- base.class_eval do
17
- def prepare_inline_token
18
- str = "".dup
19
-
20
- buffer = []
21
- to_rewind = -3
22
-
23
- until buffer[0] =~ /\s|\0|\xE0/n && buffer[1, 2] == ['E', 'I']
24
- chr = @io.read(1)
25
- buffer << chr
26
-
27
- if buffer.length > 3
28
- str << buffer.shift
29
- end
30
-
31
- to_rewind = -2 if buffer.first =~ /\xE0/n
32
- end
33
-
34
- str << '\0' if buffer.first == '\0'
35
-
36
- @tokens << string_token(str)
37
-
38
- @io.seek(to_rewind, IO::SEEK_CUR) unless chr.nil?
39
- end
40
- end
41
- end
42
- end
43
-
44
- PDF::Reader::Buffer.send( :include, PdfReaderPatch )