simple_xlsx_reader 1.0.5 → 2.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e2b04473235c5ed2c2764f62a627fa6f16816c36e0fcff3497be229f8666a0f7
4
- data.tar.gz: 9367b0082f31e9cb208d9f97ed6cb67d5276a459562809460694602339dfdaad
3
+ metadata.gz: 979490ce3bd7f0482879fb5fb5465e10ad1b07c1488d0a544950131d9063050a
4
+ data.tar.gz: 412d0040a586cc5ee4acdd4a2f74dd74f3bf9eb781a35d8a36c12f6caadc566c
5
5
  SHA512:
6
- metadata.gz: cd42f7a0b8830a2f01703dca10ae779b973566ad25e3b74d31dc3693977fa5b2b3442e47bc1a3b50723bae3bb9f31facd923f1eaba06b51cc8b927e7fb207cf3
7
- data.tar.gz: 38ecb026b0ad5a1985d88349a839a9d2972f85596504e6f300686f9751169a3c8d62582e79119106085a9cadc066517206da117993c3a30f48a5a0c58f256b4c
6
+ metadata.gz: 00c01bc0c2a393eb35e458411dfeab55b8bf30cee2661324cbd97a175baf0ceb31a881b1b2b7bd668a2b475ff008372c1428908340e30769308884355fdd46e8
7
+ data.tar.gz: 81b1b26806a97c56710cab64aa22212985dea82b308e2fbba6835f4ea7a69b79067268bb13537999594dc5722928f1df235938355a7d4a51b58ae7ed4af1d093
@@ -0,0 +1,38 @@
1
+ # This workflow uses actions that are not certified by GitHub.
2
+ # They are provided by a third-party and are governed by
3
+ # separate terms of service, privacy policy, and support
4
+ # documentation.
5
+ # This workflow will download a prebuilt Ruby version, install dependencies and run tests with Rake
6
+ # For more information see: https://github.com/marketplace/actions/setup-ruby-jruby-and-truffleruby
7
+
8
+ name: Ruby
9
+
10
+ on:
11
+ push:
12
+ branches: [ "master" ]
13
+ pull_request:
14
+ branches: [ "master" ]
15
+
16
+ permissions:
17
+ contents: read
18
+
19
+ jobs:
20
+ test:
21
+
22
+ runs-on: ubuntu-latest
23
+ strategy:
24
+ matrix:
25
+ ruby-version: ['2.6', '2.7', '3.0']
26
+
27
+ steps:
28
+ - uses: actions/checkout@v3
29
+ - name: Set up Ruby
30
+ # To automatically get bug fixes and new Ruby versions for ruby/setup-ruby,
31
+ # change this to (see https://github.com/ruby/setup-ruby#versioning):
32
+ # uses: ruby/setup-ruby@v1
33
+ uses: ruby/setup-ruby@2b019609e2b0f1ea1a2bc8ca11cb82ab46ada124
34
+ with:
35
+ ruby-version: ${{ matrix.ruby-version }}
36
+ bundler-cache: true # runs 'bundle install' and caches installed gems automatically
37
+ - name: Run tests
38
+ run: bundle exec rake
data/CHANGELOG.md CHANGED
@@ -1,3 +1,10 @@
1
+ ### 2.0.0
2
+
3
+ * SPEED
4
+ * Reimplement internals in terms of a SAX parser
5
+ * Change `SimpleXlsxReader::Sheet#rows` to be a `RowsProxy` that streams `#each`
6
+ * Convenience - use `rows#each(headers: true)` to get header names while enumerating rows
7
+
1
8
  ### 1.0.5
2
9
 
3
10
  * Support string or io input via `SimpleXlsxReader#parse` (@kalsan, @til)
data/README.md CHANGED
@@ -1,88 +1,214 @@
1
- # SimpleXlsxReader [![Build Status](https://travis-ci.org/woahdae/simple_xlsx_reader.svg?branch=master)](https://travis-ci.org/woahdae/simple_xlsx_reader)
1
+ # SimpleXlsxReader
2
2
 
3
- An xlsx reader for Ruby that parses xlsx cell values into plain ruby
4
- primitives and dates/times.
3
+ A [fast](#performance) xlsx reader for Ruby that parses xlsx cell values into
4
+ plain ruby primitives and dates/times.
5
5
 
6
6
  This is *not* a rewrite of excel in Ruby. Font styles, for
7
7
  example, are parsed to determine whether a cell is a number or a date,
8
8
  then forgotten. We just want to get the data, and get out!
9
9
 
10
- ## Usage
11
-
12
- ### Summary:
10
+ ## Summary (now with stream parsing):
13
11
 
14
12
  doc = SimpleXlsxReader.open('/path/to/workbook.xlsx')
15
13
  doc.sheets # => [<#SXR::Sheet>, ...]
16
14
  doc.sheets.first.name # 'Sheet1'
17
- doc.sheets.first.rows # [['Header 1', 'Header 2', ...]
18
- ['foo', 2, ...]]
15
+ doc.sheets.first.rows # <SXR::Document::RowsProxy>
16
+ doc.sheets.first.rows.each # an <Enumerator> ready to chain or stream
17
+ doc.sheets.first.rows.each {} # Streams the rows to your block
18
+ doc.sheets.first.rows.each(headers: true) {} # Streams row-hashes
19
+ doc.sheets.first.rows.each(headers: {id: /ID/}) {} # finds & maps headers, streams
20
+ doc.sheets.first.rows.slurp # Slurps rows into memory as a 2D array
19
21
 
20
- That's it!
22
+ That's the gist of it!
21
23
 
22
- ### Load Errors
24
+ See also the [Document](https://github.com/woahdae/simple_xlsx_reader/blob/2.0.0-pre/lib/simple_xlsx_reader/document.rb) object.
23
25
 
24
- By default, cell load errors (ex. if a date cell contains the string
25
- 'hello') result in a SimpleXlsxReader::CellLoadError.
26
+ ## Why?
26
27
 
27
- If you would like to provide better error feedback to your users, you
28
- can set `SimpleXlsxReader.configuration.catch_cell_load_errors =
29
- true`, and load errors will instead be inserted into Sheet#load_errors keyed
30
- by [rownum, colnum].
28
+ ### Accurate
31
29
 
32
- ### More
30
+ This project was started years ago, primarily because other Ruby xlsx parsers
31
+ didn't import data with the correct types. Numbers as strings, dates as numbers,
32
+ hyperlinks with inaccessible URLs, or - subtly buggy - simple dates as DateTime
33
+ objects. If your app uses a timezone offset, depending on what timezone and
34
+ what time of day you load the xlsx file, your dates might end up a day off!
35
+ SimpleXlsxReader understands all these correctly.
33
36
 
34
- Here's the totality of the public api, in code:
37
+ ### Idiomatic
35
38
 
36
- module SimpleXlsxReader
37
- def self.open(file_path)
38
- Document.new(file_path: file_path).tap(&:sheets)
39
- end
39
+ Many Ruby xlsx parsers seem to be inspired more by Excel than Ruby, frankly.
40
+ SimpleXlsxReader strives to be fairly idiomatic Ruby:
40
41
 
41
- def self.parse(string_or_io)
42
- Document.new(string_or_io: string_or_io).tap(&:sheets)
42
+ # quick example having fun w/ ruby
43
+ doc = SimpleXlsxReader.open(path_or_io)
44
+ doc.sheets.first.rows.each(headers: {id: /ID/})
45
+ .with_index.with_object({}) do |(row, index), acc|
46
+ acc[row[:id]] = index
43
47
  end
44
48
 
45
- class Document
46
- attr_reader :string_or_io
47
-
48
- def initialize(legacy_file_path = nil, file_path: nil, string_or_io: nil)
49
- ((file_path || legacy_file_path).nil? ^ string_or_io.nil?) ||
50
- fail(ArgumentError, 'either file_path or string_or_io must be provided')
51
-
52
- @string_or_io = string_or_io || File.new(file_path || legacy_file_path)
53
- end
54
-
55
- def sheets
56
- @sheets ||= Mapper.new(xml).load_sheets
57
- end
58
-
59
- def to_hash
60
- sheets.inject({}) {|acc, sheet| acc[sheet.name] = sheet.rows; acc}
61
- end
62
-
63
- def xml
64
- Xml.load(string_or_io)
65
- end
66
-
67
- class Sheet < Struct.new(:name, :rows)
68
- def headers
69
- rows[0]
70
- end
71
-
72
- def data
73
- rows[1..-1]
74
- end
75
-
76
- # Load errors will be a hash of the form:
77
- # {
78
- # [rownum, colnum] => '[error]'
79
- # }
80
- def load_errors
81
- @load_errors ||= {}
82
- end
83
- end
49
+ ### Now faster
50
+
51
+ Finally, as of v2.0, SimpleXlsxReader is the fastest and most
52
+ memory-efficient parser. Previously this project couldn't reasonably load
53
+ anything over ~10k rows. Other parsers could load 100k+ rows, but were still
54
+ taking ~1gb RSS to do so, even "streaming," which seemed excessive. So a SAX
55
+ implementation was born. See [performance](#performance) for details.
56
+
57
+ ## Usage
58
+
59
+ ### Streaming
60
+
61
+ SimpleXlsxReader is performant by default - If you use
62
+ `rows.each {|row| ...}` it will stream the XLSX rows to your block without
63
+ loading either the sheet XML or the full sheet data into memory.
64
+
65
+ You can also chain `rows.each` with other Enumerable functions without
66
+ triggering a slurp, and you have lots of ways to find and map headers while
67
+ streaming.
68
+
69
+ If you had an excel sheet representing this data:
70
+
71
+ ```
72
+ | Hero ID | Hero Name | Location |
73
+ | 13576 | Samus Aran | Planet Zebes |
74
+ | 117 | John Halo | Ring World |
75
+ | 9704133 | Iron Man | Planet Earth |
76
+ ```
77
+
78
+ Get a handle on the rows proxy:
79
+
80
+ `rows = SimpleXlsxReader.open('suited_heroes.xlsx').sheets.first.rows`
81
+
82
+ Simple streaming (kinda boring):
83
+
84
+ `rows.each { |row| ... }`
85
+
86
+ Streaming with headers, and how about a little enumerable chaining:
87
+
88
+ ```
89
+ # Map of hero names by ID: { 117 => 'John Halo', ... }
90
+
91
+ rows.each(headers: true).with_object({}) do |row, acc|
92
+ acc[row['Hero ID']] = row['Hero Name']
93
+ end
94
+ ```
95
+
96
+ Sometimes though you have some junk at the top of your spreadsheet:
97
+
98
+ ```
99
+ | Unofficial Report | | |
100
+ | Dont tell Nintendo | Yes "John Halo" I know | |
101
+ | | | |
102
+ | Hero ID | Hero Name | Location |
103
+ | 13576 | Samus Aran | Planet Zebes |
104
+ | 117 | John Halo | Ring World |
105
+ | 9704133 | Iron Man | Planet Earth |
106
+ ```
107
+
108
+ For this, `headers` can be a hash whose keys replace headers and whose values
109
+ help find the correct header row:
110
+
111
+ ```
112
+ # Same map of hero names by ID: { 117 => 'John Halo', ... }
113
+
114
+ rows.each(headers: {id: /ID/, name: /Name/}).with_object({}) do |row, acc|
115
+ acc[row[:id]] = row[:name]
116
+ end
117
+ ```
118
+
119
+ If your header-to-attribute mapping is more complicated than key/value, you
120
+ can do the mapping elsewhere, but use a block to find the header row:
121
+
122
+ ```
123
+ # Example roughly analogous to some production code mapping a single spreadsheet
124
+ # across many objects. Might be a simpler way now that we have the headers-hash
125
+ # feature.
126
+
127
+ object_map = { Hero => { id: 'Hero ID', name: 'Hero Name', location: 'Location' } }
128
+
129
+ HEADERS = ['Hero ID', 'Hero Name', 'Location']
130
+
131
+ rows.each(headers: ->(row) { (HEADERS & row).any? }) do |row|
132
+ object_map.each_pair do |klass, attribute_map|
133
+ attributes =
134
+ attribute_map.each_pair.with_object({}) do |(key, header), attrs|
135
+ attrs[key] = row[header]
84
136
  end
85
- end
137
+
138
+ klass.new(attributes)
139
+ end
140
+ end
141
+ ```
142
+
143
+ ### Slurping
144
+
145
+ To make SimpleXlsxReader rows act like an array, for use with legacy
146
+ SimpleXlsxReader apps or otherwise, we still support slurping the whole array
147
+ into memory. The good news is even when doing this, the xlsx worksheet & shared
148
+ string files are never loaded as a (big) Nokogiri doc, so that's nice.
149
+
150
+ By default, to prevent accidental slurping, `<RowsProxy>` will throw an exception
151
+ if you try to access it with array methods like `[]` and `shift` without
152
+ explicitly slurping first. You can slurp either by calling `rows.slurp` or
153
+ globally by setting `SimpleXlsxReader.configuration.auto_slurp = true`.
154
+
155
+ Once slurped, enumerable methods on `rows` will use the slurped data
156
+ (i.e. not re-parse the sheet), and those Array-like methods will work.
157
+
158
+ We don't support all Array methods, just the few we have used in real projects,
159
+ as we transition towards streaming instead.
160
+
161
+ ### Load Errors
162
+
163
+ By default, cell load errors (ex. if a date cell contains the string
164
+ 'hello') result in a SimpleXlsxReader::CellLoadError.
165
+
166
+ If you would like to provide better error feedback to your users, you
167
+ can set `SimpleXlsxReader.configuration.catch_cell_load_errors =
168
+ true`, and load errors will instead be inserted into Sheet#load_errors keyed
169
+ by [rownum, colnum]:
170
+
171
+ {
172
+ [rownum, colnum] => '[error]'
173
+ }
174
+
175
+ ### Performance
176
+
177
+ SimpleXlsxReader is (as of this writing) the fastest and most memory efficient
178
+ Ruby xlsx parser.
179
+
180
+ Recent updates here have focused on large spreadsheets with especially
181
+ non-unique strings in sheets using xlsx' shared strings feature
182
+ (Excel-generated spreadsheets always use this). Other projects have implemented
183
+ streaming parsers for the sheet data, but currently none stream while loading
184
+ the shared strings file, which is the second-largest file in an xlsx archive
185
+ and can represent millions of strings in large files.
186
+
187
+ For more details, see [my fork of @shkm's excel benchmark project](https://github.com/woahdae/excel-parsing-benchmarks), but here's the summary:
188
+
189
+ 1mb excel file, 10,000 rows of sample "sales records" with a fair amount of
190
+ non-unique strings (ran on an M1 Macbook Pro):
191
+
192
+ | Gem | Parses/second | RSS Increase | Allocated Mem | Retained Mem | Allocated Objects | Retained Objects |
193
+ |--------------------|---------------|--------------|---------------|--------------|-------------------|------------------|
194
+ | simple_xlsx_reader | 1.13 | 36.94mb | 614.51mb | 1.13kb | 8796275 | 3 |
195
+ | roo | 0.75 | 74.0mb | 164.47mb | 2.18kb | 2128396 | 4 |
196
+ | creek | 0.65 | 107.55mb | 581.38mb | 3.3kb | 7240760 | 16 |
197
+ | xsv | 0.61 | 75.66mb | 2127.42mb | 3.66kb | 5922563 | 10 |
198
+ | rubyxl | 0.27 | 373.52mb | 716.7mb | 2.18kb | 10612577 | 4 |
199
+
200
+ Here is a benchmark for the "worst" file I've seen, a 26mb file whose shared
201
+ strings represent 10% of the archive (note, MemoryProfiler has too much
202
+ overhead to reasonably measure allocations so that analysis was left off, and
203
+ we just measure total time for one parse):
204
+
205
+ | Gem | Time | RSS Increase |
206
+ |--------------------|---------|--------------|
207
+ | simple_xlsx_reader | 28.71s | 148.77mb |
208
+ | roo | 40.25s | 1322.08mb |
209
+ | xsv | 45.82s | 391.27mb |
210
+ | creek | 60.63s | 886.81mb |
211
+ | rubyxl | 238.68s | 9136.3mb |
86
212
 
87
213
  ## Installation
88
214
 
data/Rakefile CHANGED
@@ -1,3 +1,5 @@
1
+ # frozen_string_literal: true
2
+
1
3
  require "bundler/gem_tasks"
2
4
 
3
5
  require 'rake/testtask'
@@ -6,4 +8,4 @@ Rake::TestTask.new do |t|
6
8
  t.libs << 'test'
7
9
  end
8
10
 
9
- task :default => [:test]
11
+ task default: [:test]
@@ -0,0 +1,147 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'forwardable'
4
+
5
+ module SimpleXlsxReader
6
+
7
+ ##
8
+ # Main class for the public API. See the README for usage examples,
9
+ # or read the code, it's pretty friendly.
10
+ class Document
11
+ attr_reader :file_path
12
+
13
+ def initialize(file_path)
14
+ @file_path = file_path
15
+ end
16
+
17
+ def sheets
18
+ @sheets ||= Loader.new(file_path).init_sheets
19
+ end
20
+
21
+ # Expensive because it slurps all the sheets into memory,
22
+ # probably only appropriate for testing
23
+ def to_hash
24
+ sheets.each_with_object({}) { |sheet, acc| acc[sheet.name] = sheet.rows.to_a; }
25
+ end
26
+
27
+ # `rows` is a RowsProxy that responds to #each
28
+ class Sheet
29
+ extend Forwardable
30
+
31
+ attr_reader :name, :rows
32
+
33
+ def_delegators :rows, :load_errors, :slurp
34
+
35
+ def initialize(name:, sheet_parser:)
36
+ @name = name
37
+ @rows = RowsProxy.new(sheet_parser: sheet_parser)
38
+ end
39
+
40
+ # Legacy - consider `rows.each(headers: true)` for better performance
41
+ def headers
42
+ rows.slurped![0]
43
+ end
44
+
45
+ # Legacy - consider `rows` or `rows.each(headers: true)` for better
46
+ # performance
47
+ def data
48
+ rows.slurped![1..-1]
49
+ end
50
+ end
51
+
52
+ # Waits until we call #each with a block to parse the rows
53
+ class RowsProxy
54
+ include Enumerable
55
+
56
+ attr_reader :slurped, :load_errors
57
+
58
+ def initialize(sheet_parser:)
59
+ @sheet_parser = sheet_parser
60
+ @slurped = nil
61
+ @load_errors = {}
62
+ end
63
+
64
+ # By default, #each streams the rows to the provided block, either as
65
+ # arrays, or as header => cell value pairs if provided a `headers:`
66
+ # argument.
67
+ #
68
+ # `headers` can be:
69
+ #
70
+ # * `true` - simply takes the first row as the header row
71
+ # * block - calls the block with successive rows until the block returns
72
+ # true, which it then uses that row for the headers. All data prior to
73
+ # finding the headers is ignored.
74
+ # * hash - transforms the header row by replacing cells with keys matched
75
+ # by value, ex. `{id: /ID|Identity/, name: /Name/i, date: 'Date'}` would
76
+ # potentially yield the row `{id: 5, name: 'Jane', date: [Date object]}`
77
+ # instead of the headers from the sheet. It would also search for the
78
+ # row that matches at least one header, in case the header row isn't the
79
+ # first.
80
+ #
81
+ # If rows have been slurped, #each will iterate the slurped rows instead.
82
+ #
83
+ # Note, calls to this after slurping will raise if given the `headers:`
84
+ # argument, as that's handled by the sheet parser. If this is important
85
+ # to someone, speak up and we could potentially support it.
86
+ def each(headers: false, &block)
87
+ if slurped?
88
+ raise '#each does not support headers with slurped rows' if headers
89
+
90
+ slurped.each(&block)
91
+ elsif block_given?
92
+ # It's possible to slurp while yielding to the block, which would
93
+ # null out @sheet_parser, so let's just keep track of it here too
94
+ sheet_parser = @sheet_parser
95
+ @sheet_parser.parse(headers: headers, &block).tap do
96
+ @load_errors = sheet_parser.load_errors
97
+ end
98
+ else
99
+ to_enum(:each, headers: headers)
100
+ end
101
+ end
102
+
103
+ # Mostly for legacy support, I'm not aware of a use case for doing this
104
+ # when you don't have to.
105
+ #
106
+ # Note that #each will use slurped results if available, and since we're
107
+ # leveraging Enumerable, all the other Enumerable methods will too.
108
+ def slurp
109
+ # possibly release sheet parser from memory on next GC run;
110
+ # untested, but it can hold a lot of stuff, so worth a try
111
+ @slurped ||= to_a.tap { @sheet_parser = nil }
112
+ end
113
+
114
+ def slurped?
115
+ !!@slurped
116
+ end
117
+
118
+ def slurped!
119
+ check_slurped
120
+
121
+ slurped
122
+ end
123
+
124
+ def [](*args)
125
+ check_slurped
126
+
127
+ slurped[*args]
128
+ end
129
+
130
+ def shift(*args)
131
+ check_slurped
132
+
133
+ slurped.shift(*args)
134
+ end
135
+
136
+ private
137
+
138
+ def check_slurped
139
+ slurp if SimpleXlsxReader.configuration.auto_slurp
140
+ return if slurped?
141
+
142
+ raise 'Called a slurp-y method without explicitly slurping;'\
143
+ ' use #each or call rows.slurp first'
144
+ end
145
+ end
146
+ end
147
+ end
@@ -0,0 +1,30 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SimpleXlsxReader
4
+ # We support hyperlinks as a "type" even though they're technically
5
+ # represented either as a function or an external reference in the xlsx spec.
6
+ #
7
+ # Since having hyperlink data in our sheet usually means we might want to do
8
+ # something primarily with the URL (store it in the database, download it, etc),
9
+ # we go through extra effort to parse the function or follow the reference
10
+ # to represent the hyperlink primarily as a URL. However, maybe we do want
11
+ # the hyperlink "friendly name" part (as MS calls it), so here we've subclassed
12
+ # string to tack on the friendly name. This means 80% of us that just want
13
+ # the URL value will have to do nothing extra, but the 20% that might want the
14
+ # friendly name can access it.
15
+ #
16
+ # Note, by default, the value we would get by just asking the cell would
17
+ # be the "friendly name" and *not* the URL, which is tucked away in the
18
+ # function definition or a separate "relationships" meta-document.
19
+ #
20
+ # See MS documentation on the HYPERLINK function for some background:
21
+ # https://support.office.com/en-us/article/HYPERLINK-function-333c7ce6-c5ae-4164-9c47-7de9b76f577f
22
+ class Hyperlink < String
23
+ attr_reader :friendly_name
24
+
25
+ def initialize(url, friendly_name = nil)
26
+ @friendly_name = friendly_name
27
+ super(url)
28
+ end
29
+ end
30
+ end
@@ -0,0 +1,46 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SimpleXlsxReader
4
+ class Loader
5
+ # For performance reasons, excel uses an optional SpreadsheetML feature
6
+ # that puts all strings in a separate xml file, and then references
7
+ # them by their index in that file.
8
+ #
9
+ # http://msdn.microsoft.com/en-us/library/office/gg278314.aspx
10
+ class SharedStringsParser < Nokogiri::XML::SAX::Document
11
+ def self.parse(file)
12
+ new.tap do |parser|
13
+ Nokogiri::XML::SAX::Parser.new(parser).parse(file)
14
+ end.result
15
+ end
16
+
17
+ def initialize
18
+ @result = []
19
+ @composite = false
20
+ @extract = false
21
+ end
22
+
23
+ attr_reader :result
24
+
25
+ def start_element(name, _attrs = [])
26
+ case name
27
+ when 'si' then @current_string = +"" # UTF-8 variant of String.new
28
+ when 't' then @extract = true
29
+ end
30
+ end
31
+
32
+ def characters(string)
33
+ return unless @extract
34
+
35
+ @current_string << string
36
+ end
37
+
38
+ def end_element(name)
39
+ case name
40
+ when 't' then @extract = false
41
+ when 'si' then @result << @current_string
42
+ end
43
+ end
44
+ end
45
+ end
46
+ end