simple_xlsx_reader 1.0.5 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e2b04473235c5ed2c2764f62a627fa6f16816c36e0fcff3497be229f8666a0f7
4
- data.tar.gz: 9367b0082f31e9cb208d9f97ed6cb67d5276a459562809460694602339dfdaad
3
+ metadata.gz: 979490ce3bd7f0482879fb5fb5465e10ad1b07c1488d0a544950131d9063050a
4
+ data.tar.gz: 412d0040a586cc5ee4acdd4a2f74dd74f3bf9eb781a35d8a36c12f6caadc566c
5
5
  SHA512:
6
- metadata.gz: cd42f7a0b8830a2f01703dca10ae779b973566ad25e3b74d31dc3693977fa5b2b3442e47bc1a3b50723bae3bb9f31facd923f1eaba06b51cc8b927e7fb207cf3
7
- data.tar.gz: 38ecb026b0ad5a1985d88349a839a9d2972f85596504e6f300686f9751169a3c8d62582e79119106085a9cadc066517206da117993c3a30f48a5a0c58f256b4c
6
+ metadata.gz: 00c01bc0c2a393eb35e458411dfeab55b8bf30cee2661324cbd97a175baf0ceb31a881b1b2b7bd668a2b475ff008372c1428908340e30769308884355fdd46e8
7
+ data.tar.gz: 81b1b26806a97c56710cab64aa22212985dea82b308e2fbba6835f4ea7a69b79067268bb13537999594dc5722928f1df235938355a7d4a51b58ae7ed4af1d093
@@ -0,0 +1,38 @@
1
+ # This workflow uses actions that are not certified by GitHub.
2
+ # They are provided by a third-party and are governed by
3
+ # separate terms of service, privacy policy, and support
4
+ # documentation.
5
+ # This workflow will download a prebuilt Ruby version, install dependencies and run tests with Rake
6
+ # For more information see: https://github.com/marketplace/actions/setup-ruby-jruby-and-truffleruby
7
+
8
+ name: Ruby
9
+
10
+ on:
11
+ push:
12
+ branches: [ "master" ]
13
+ pull_request:
14
+ branches: [ "master" ]
15
+
16
+ permissions:
17
+ contents: read
18
+
19
+ jobs:
20
+ test:
21
+
22
+ runs-on: ubuntu-latest
23
+ strategy:
24
+ matrix:
25
+ ruby-version: ['2.6', '2.7', '3.0']
26
+
27
+ steps:
28
+ - uses: actions/checkout@v3
29
+ - name: Set up Ruby
30
+ # To automatically get bug fixes and new Ruby versions for ruby/setup-ruby,
31
+ # change this to (see https://github.com/ruby/setup-ruby#versioning):
32
+ # uses: ruby/setup-ruby@v1
33
+ uses: ruby/setup-ruby@2b019609e2b0f1ea1a2bc8ca11cb82ab46ada124
34
+ with:
35
+ ruby-version: ${{ matrix.ruby-version }}
36
+ bundler-cache: true # runs 'bundle install' and caches installed gems automatically
37
+ - name: Run tests
38
+ run: bundle exec rake
data/CHANGELOG.md CHANGED
@@ -1,3 +1,10 @@
1
+ ### 2.0.0
2
+
3
+ * SPEED
4
+ * Reimplement internals in terms of a SAX parser
5
+ * Change `SimpleXlsxReader::Sheet#rows` to be a `RowsProxy` that streams `#each`
6
+ * Convenience - use `rows#each(headers: true)` to get header names while enumerating rows
7
+
1
8
  ### 1.0.5
2
9
 
3
10
  * Support string or io input via `SimpleXlsxReader#parse` (@kalsan, @til)
data/README.md CHANGED
@@ -1,88 +1,214 @@
1
- # SimpleXlsxReader [![Build Status](https://travis-ci.org/woahdae/simple_xlsx_reader.svg?branch=master)](https://travis-ci.org/woahdae/simple_xlsx_reader)
1
+ # SimpleXlsxReader
2
2
 
3
- An xlsx reader for Ruby that parses xlsx cell values into plain ruby
4
- primitives and dates/times.
3
+ A [fast](#performance) xlsx reader for Ruby that parses xlsx cell values into
4
+ plain ruby primitives and dates/times.
5
5
 
6
6
  This is *not* a rewrite of excel in Ruby. Font styles, for
7
7
  example, are parsed to determine whether a cell is a number or a date,
8
8
  then forgotten. We just want to get the data, and get out!
9
9
 
10
- ## Usage
11
-
12
- ### Summary:
10
+ ## Summary (now with stream parsing):
13
11
 
14
12
  doc = SimpleXlsxReader.open('/path/to/workbook.xlsx')
15
13
  doc.sheets # => [<#SXR::Sheet>, ...]
16
14
  doc.sheets.first.name # 'Sheet1'
17
- doc.sheets.first.rows # [['Header 1', 'Header 2', ...]
18
- ['foo', 2, ...]]
15
+ doc.sheets.first.rows # <SXR::Document::RowsProxy>
16
+ doc.sheets.first.rows.each # an <Enumerator> ready to chain or stream
17
+ doc.sheets.first.rows.each {} # Streams the rows to your block
18
+ doc.sheets.first.rows.each(headers: true) {} # Streams row-hashes
19
+ doc.sheets.first.rows.each(headers: {id: /ID/}) {} # finds & maps headers, streams
20
+ doc.sheets.first.rows.slurp # Slurps rows into memory as a 2D array
19
21
 
20
- That's it!
22
+ That's the gist of it!
21
23
 
22
- ### Load Errors
24
+ See also the [Document](https://github.com/woahdae/simple_xlsx_reader/blob/2.0.0-pre/lib/simple_xlsx_reader/document.rb) object.
23
25
 
24
- By default, cell load errors (ex. if a date cell contains the string
25
- 'hello') result in a SimpleXlsxReader::CellLoadError.
26
+ ## Why?
26
27
 
27
- If you would like to provide better error feedback to your users, you
28
- can set `SimpleXlsxReader.configuration.catch_cell_load_errors =
29
- true`, and load errors will instead be inserted into Sheet#load_errors keyed
30
- by [rownum, colnum].
28
+ ### Accurate
31
29
 
32
- ### More
30
+ This project was started years ago, primarily because other Ruby xlsx parsers
31
+ didn't import data with the correct types. Numbers as strings, dates as numbers,
32
+ hyperlinks with inaccessible URLs, or - subtly buggy - simple dates as DateTime
33
+ objects. If your app uses a timezone offset, depending on what timezone and
34
+ what time of day you load the xlsx file, your dates might end up a day off!
35
+ SimpleXlsxReader understands all these correctly.
33
36
 
34
- Here's the totality of the public api, in code:
37
+ ### Idiomatic
35
38
 
36
- module SimpleXlsxReader
37
- def self.open(file_path)
38
- Document.new(file_path: file_path).tap(&:sheets)
39
- end
39
+ Many Ruby xlsx parsers seem to be inspired more by Excel than Ruby, frankly.
40
+ SimpleXlsxReader strives to be fairly idiomatic Ruby:
40
41
 
41
- def self.parse(string_or_io)
42
- Document.new(string_or_io: string_or_io).tap(&:sheets)
42
+ # quick example having fun w/ ruby
43
+ doc = SimpleXlsxReader.open(path_or_io)
44
+ doc.sheets.first.rows.each(headers: {id: /ID/})
45
+ .with_index.with_object({}) do |(row, index), acc|
46
+ acc[row[:id]] = index
43
47
  end
44
48
 
45
- class Document
46
- attr_reader :string_or_io
47
-
48
- def initialize(legacy_file_path = nil, file_path: nil, string_or_io: nil)
49
- ((file_path || legacy_file_path).nil? ^ string_or_io.nil?) ||
50
- fail(ArgumentError, 'either file_path or string_or_io must be provided')
51
-
52
- @string_or_io = string_or_io || File.new(file_path || legacy_file_path)
53
- end
54
-
55
- def sheets
56
- @sheets ||= Mapper.new(xml).load_sheets
57
- end
58
-
59
- def to_hash
60
- sheets.inject({}) {|acc, sheet| acc[sheet.name] = sheet.rows; acc}
61
- end
62
-
63
- def xml
64
- Xml.load(string_or_io)
65
- end
66
-
67
- class Sheet < Struct.new(:name, :rows)
68
- def headers
69
- rows[0]
70
- end
71
-
72
- def data
73
- rows[1..-1]
74
- end
75
-
76
- # Load errors will be a hash of the form:
77
- # {
78
- # [rownum, colnum] => '[error]'
79
- # }
80
- def load_errors
81
- @load_errors ||= {}
82
- end
83
- end
49
+ ### Now faster
50
+
51
+ Finally, as of v2.0, SimpleXlsxReader is the fastest and most
52
+ memory-efficient parser. Previously this project couldn't reasonably load
53
+ anything over ~10k rows. Other parsers could load 100k+ rows, but were still
54
+ taking ~1gb RSS to do so, even "streaming," which seemed excessive. So a SAX
55
+ implementation was born. See [performance](#performance) for details.
56
+
57
+ ## Usage
58
+
59
+ ### Streaming
60
+
61
+ SimpleXlsxReader is performant by default - If you use
62
+ `rows.each {|row| ...}` it will stream the XLSX rows to your block without
63
+ loading either the sheet XML or the full sheet data into memory.
64
+
65
+ You can also chain `rows.each` with other Enumerable functions without
66
+ triggering a slurp, and you have lots of ways to find and map headers while
67
+ streaming.
68
+
69
+ If you had an excel sheet representing this data:
70
+
71
+ ```
72
+ | Hero ID | Hero Name | Location |
73
+ | 13576 | Samus Aran | Planet Zebes |
74
+ | 117 | John Halo | Ring World |
75
+ | 9704133 | Iron Man | Planet Earth |
76
+ ```
77
+
78
+ Get a handle on the rows proxy:
79
+
80
+ `rows = SimpleXlsxReader.open('suited_heroes.xlsx').sheets.first.rows`
81
+
82
+ Simple streaming (kinda boring):
83
+
84
+ `rows.each { |row| ... }`
85
+
86
+ Streaming with headers, and how about a little enumerable chaining:
87
+
88
+ ```
89
+ # Map of hero names by ID: { 117 => 'John Halo', ... }
90
+
91
+ rows.each(headers: true).with_object({}) do |row, acc|
92
+ acc[row['Hero ID']] = row['Hero Name']
93
+ end
94
+ ```
95
+
96
+ Sometimes though you have some junk at the top of your spreadsheet:
97
+
98
+ ```
99
+ | Unofficial Report | | |
100
+ | Dont tell Nintendo | Yes "John Halo" I know | |
101
+ | | | |
102
+ | Hero ID | Hero Name | Location |
103
+ | 13576 | Samus Aran | Planet Zebes |
104
+ | 117 | John Halo | Ring World |
105
+ | 9704133 | Iron Man | Planet Earth |
106
+ ```
107
+
108
+ For this, `headers` can be a hash whose keys replace headers and whose values
109
+ help find the correct header row:
110
+
111
+ ```
112
+ # Same map of hero names by ID: { 117 => 'John Halo', ... }
113
+
114
+ rows.each(headers: {id: /ID/, name: /Name/}).with_object({}) do |row, acc|
115
+ acc[row[:id]] = row[:name]
116
+ end
117
+ ```
118
+
119
+ If your header-to-attribute mapping is more complicated than key/value, you
120
+ can do the mapping elsewhere, but use a block to find the header row:
121
+
122
+ ```
123
+ # Example roughly analogous to some production code mapping a single spreadsheet
124
+ # across many objects. Might be a simpler way now that we have the headers-hash
125
+ # feature.
126
+
127
+ object_map = { Hero => { id: 'Hero ID', name: 'Hero Name', location: 'Location' } }
128
+
129
+ HEADERS = ['Hero ID', 'Hero Name', 'Location']
130
+
131
+ rows.each(headers: ->(row) { (HEADERS & row).any? }) do |row|
132
+ object_map.each_pair do |klass, attribute_map|
133
+ attributes =
134
+ attribute_map.each_pair.with_object({}) do |(key, header), attrs|
135
+ attrs[key] = row[header]
84
136
  end
85
- end
137
+
138
+ klass.new(attributes)
139
+ end
140
+ end
141
+ ```
142
+
143
+ ### Slurping
144
+
145
+ To make SimpleXlsxReader rows act like an array, for use with legacy
146
+ SimpleXlsxReader apps or otherwise, we still support slurping the whole array
147
+ into memory. The good news is even when doing this, the xlsx worksheet & shared
148
+ string files are never loaded as a (big) Nokogiri doc, so that's nice.
149
+
150
+ By default, to prevent accidental slurping, `<RowsProxy>` will throw an exception
151
+ if you try to access it with array methods like `[]` and `shift` without
152
+ explicitly slurping first. You can slurp either by calling `rows.slurp` or
153
+ globally by setting `SimpleXlsxReader.configuration.auto_slurp = true`.
154
+
155
+ Once slurped, enumerable methods on `rows` will use the slurped data
156
+ (i.e. not re-parse the sheet), and those Array-like methods will work.
157
+
158
+ We don't support all Array methods, just the few we have used in real projects,
159
+ as we transition towards streaming instead.
160
+
161
+ ### Load Errors
162
+
163
+ By default, cell load errors (ex. if a date cell contains the string
164
+ 'hello') result in a SimpleXlsxReader::CellLoadError.
165
+
166
+ If you would like to provide better error feedback to your users, you
167
+ can set `SimpleXlsxReader.configuration.catch_cell_load_errors =
168
+ true`, and load errors will instead be inserted into Sheet#load_errors keyed
169
+ by [rownum, colnum]:
170
+
171
+ {
172
+ [rownum, colnum] => '[error]'
173
+ }
174
+
175
+ ### Performance
176
+
177
+ SimpleXlsxReader is (as of this writing) the fastest and most memory efficient
178
+ Ruby xlsx parser.
179
+
180
+ Recent updates here have focused on large spreadsheets with especially
181
+ non-unique strings in sheets using xlsx' shared strings feature
182
+ (Excel-generated spreadsheets always use this). Other projects have implemented
183
+ streaming parsers for the sheet data, but currently none stream while loading
184
+ the shared strings file, which is the second-largest file in an xlsx archive
185
+ and can represent millions of strings in large files.
186
+
187
+ For more details, see [my fork of @shkm's excel benchmark project](https://github.com/woahdae/excel-parsing-benchmarks), but here's the summary:
188
+
189
+ 1mb excel file, 10,000 rows of sample "sales records" with a fair amount of
190
+ non-unique strings (ran on an M1 Macbook Pro):
191
+
192
+ | Gem | Parses/second | RSS Increase | Allocated Mem | Retained Mem | Allocated Objects | Retained Objects |
193
+ |--------------------|---------------|--------------|---------------|--------------|-------------------|------------------|
194
+ | simple_xlsx_reader | 1.13 | 36.94mb | 614.51mb | 1.13kb | 8796275 | 3 |
195
+ | roo | 0.75 | 74.0mb | 164.47mb | 2.18kb | 2128396 | 4 |
196
+ | creek | 0.65 | 107.55mb | 581.38mb | 3.3kb | 7240760 | 16 |
197
+ | xsv | 0.61 | 75.66mb | 2127.42mb | 3.66kb | 5922563 | 10 |
198
+ | rubyxl | 0.27 | 373.52mb | 716.7mb | 2.18kb | 10612577 | 4 |
199
+
200
+ Here is a benchmark for the "worst" file I've seen, a 26mb file whose shared
201
+ strings represent 10% of the archive (note, MemoryProfiler has too much
202
+ overhead to reasonably measure allocations so that analysis was left off, and
203
+ we just measure total time for one parse):
204
+
205
+ | Gem | Time | RSS Increase |
206
+ |--------------------|---------|--------------|
207
+ | simple_xlsx_reader | 28.71s | 148.77mb |
208
+ | roo | 40.25s | 1322.08mb |
209
+ | xsv | 45.82s | 391.27mb |
210
+ | creek | 60.63s | 886.81mb |
211
+ | rubyxl | 238.68s | 9136.3mb |
86
212
 
87
213
  ## Installation
88
214
 
data/Rakefile CHANGED
@@ -1,3 +1,5 @@
1
+ # frozen_string_literal: true
2
+
1
3
  require "bundler/gem_tasks"
2
4
 
3
5
  require 'rake/testtask'
@@ -6,4 +8,4 @@ Rake::TestTask.new do |t|
6
8
  t.libs << 'test'
7
9
  end
8
10
 
9
- task :default => [:test]
11
+ task default: [:test]
@@ -0,0 +1,147 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'forwardable'
4
+
5
+ module SimpleXlsxReader
6
+
7
+ ##
8
+ # Main class for the public API. See the README for usage examples,
9
+ # or read the code, it's pretty friendly.
10
+ class Document
11
+ attr_reader :file_path
12
+
13
+ def initialize(file_path)
14
+ @file_path = file_path
15
+ end
16
+
17
+ def sheets
18
+ @sheets ||= Loader.new(file_path).init_sheets
19
+ end
20
+
21
+ # Expensive because it slurps all the sheets into memory,
22
+ # probably only appropriate for testing
23
+ def to_hash
24
+ sheets.each_with_object({}) { |sheet, acc| acc[sheet.name] = sheet.rows.to_a; }
25
+ end
26
+
27
+ # `rows` is a RowsProxy that responds to #each
28
+ class Sheet
29
+ extend Forwardable
30
+
31
+ attr_reader :name, :rows
32
+
33
+ def_delegators :rows, :load_errors, :slurp
34
+
35
+ def initialize(name:, sheet_parser:)
36
+ @name = name
37
+ @rows = RowsProxy.new(sheet_parser: sheet_parser)
38
+ end
39
+
40
+ # Legacy - consider `rows.each(headers: true)` for better performance
41
+ def headers
42
+ rows.slurped![0]
43
+ end
44
+
45
+ # Legacy - consider `rows` or `rows.each(headers: true)` for better
46
+ # performance
47
+ def data
48
+ rows.slurped![1..-1]
49
+ end
50
+ end
51
+
52
+ # Waits until we call #each with a block to parse the rows
53
+ class RowsProxy
54
+ include Enumerable
55
+
56
+ attr_reader :slurped, :load_errors
57
+
58
+ def initialize(sheet_parser:)
59
+ @sheet_parser = sheet_parser
60
+ @slurped = nil
61
+ @load_errors = {}
62
+ end
63
+
64
+ # By default, #each streams the rows to the provided block, either as
65
+ # arrays, or as header => cell value pairs if provided a `headers:`
66
+ # argument.
67
+ #
68
+ # `headers` can be:
69
+ #
70
+ # * `true` - simply takes the first row as the header row
71
+ # * block - calls the block with successive rows until the block returns
72
+ # true, which it then uses that row for the headers. All data prior to
73
+ # finding the headers is ignored.
74
+ # * hash - transforms the header row by replacing cells with keys matched
75
+ # by value, ex. `{id: /ID|Identity/, name: /Name/i, date: 'Date'}` would
76
+ # potentially yield the row `{id: 5, name: 'Jane', date: [Date object]}`
77
+ # instead of the headers from the sheet. It would also search for the
78
+ # row that matches at least one header, in case the header row isn't the
79
+ # first.
80
+ #
81
+ # If rows have been slurped, #each will iterate the slurped rows instead.
82
+ #
83
+ # Note, calls to this after slurping will raise if given the `headers:`
84
+ # argument, as that's handled by the sheet parser. If this is important
85
+ # to someone, speak up and we could potentially support it.
86
+ def each(headers: false, &block)
87
+ if slurped?
88
+ raise '#each does not support headers with slurped rows' if headers
89
+
90
+ slurped.each(&block)
91
+ elsif block_given?
92
+ # It's possible to slurp while yielding to the block, which would
93
+ # null out @sheet_parser, so let's just keep track of it here too
94
+ sheet_parser = @sheet_parser
95
+ @sheet_parser.parse(headers: headers, &block).tap do
96
+ @load_errors = sheet_parser.load_errors
97
+ end
98
+ else
99
+ to_enum(:each, headers: headers)
100
+ end
101
+ end
102
+
103
+ # Mostly for legacy support, I'm not aware of a use case for doing this
104
+ # when you don't have to.
105
+ #
106
+ # Note that #each will use slurped results if available, and since we're
107
+ # leveraging Enumerable, all the other Enumerable methods will too.
108
+ def slurp
109
+ # possibly release sheet parser from memory on next GC run;
110
+ # untested, but it can hold a lot of stuff, so worth a try
111
+ @slurped ||= to_a.tap { @sheet_parser = nil }
112
+ end
113
+
114
+ def slurped?
115
+ !!@slurped
116
+ end
117
+
118
+ def slurped!
119
+ check_slurped
120
+
121
+ slurped
122
+ end
123
+
124
+ def [](*args)
125
+ check_slurped
126
+
127
+ slurped[*args]
128
+ end
129
+
130
+ def shift(*args)
131
+ check_slurped
132
+
133
+ slurped.shift(*args)
134
+ end
135
+
136
+ private
137
+
138
+ def check_slurped
139
+ slurp if SimpleXlsxReader.configuration.auto_slurp
140
+ return if slurped?
141
+
142
+ raise 'Called a slurp-y method without explicitly slurping;'\
143
+ ' use #each or call rows.slurp first'
144
+ end
145
+ end
146
+ end
147
+ end
@@ -0,0 +1,30 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SimpleXlsxReader
4
+ # We support hyperlinks as a "type" even though they're technically
5
+ # represented either as a function or an external reference in the xlsx spec.
6
+ #
7
+ # Since having hyperlink data in our sheet usually means we might want to do
8
+ # something primarily with the URL (store it in the database, download it, etc),
9
+ # we go through extra effort to parse the function or follow the reference
10
+ # to represent the hyperlink primarily as a URL. However, maybe we do want
11
+ # the hyperlink "friendly name" part (as MS calls it), so here we've subclassed
12
+ # string to tack on the friendly name. This means 80% of us that just want
13
+ # the URL value will have to do nothing extra, but the 20% that might want the
14
+ # friendly name can access it.
15
+ #
16
+ # Note, by default, the value we would get by just asking the cell would
17
+ # be the "friendly name" and *not* the URL, which is tucked away in the
18
+ # function definition or a separate "relationships" meta-document.
19
+ #
20
+ # See MS documentation on the HYPERLINK function for some background:
21
+ # https://support.office.com/en-us/article/HYPERLINK-function-333c7ce6-c5ae-4164-9c47-7de9b76f577f
22
+ class Hyperlink < String
23
+ attr_reader :friendly_name
24
+
25
+ def initialize(url, friendly_name = nil)
26
+ @friendly_name = friendly_name
27
+ super(url)
28
+ end
29
+ end
30
+ end
@@ -0,0 +1,46 @@
1
+ # frozen_string_literal: true
2
+
3
+ module SimpleXlsxReader
4
+ class Loader
5
+ # For performance reasons, excel uses an optional SpreadsheetML feature
6
+ # that puts all strings in a separate xml file, and then references
7
+ # them by their index in that file.
8
+ #
9
+ # http://msdn.microsoft.com/en-us/library/office/gg278314.aspx
10
+ class SharedStringsParser < Nokogiri::XML::SAX::Document
11
+ def self.parse(file)
12
+ new.tap do |parser|
13
+ Nokogiri::XML::SAX::Parser.new(parser).parse(file)
14
+ end.result
15
+ end
16
+
17
+ def initialize
18
+ @result = []
19
+ @composite = false
20
+ @extract = false
21
+ end
22
+
23
+ attr_reader :result
24
+
25
+ def start_element(name, _attrs = [])
26
+ case name
27
+ when 'si' then @current_string = +"" # UTF-8 variant of String.new
28
+ when 't' then @extract = true
29
+ end
30
+ end
31
+
32
+ def characters(string)
33
+ return unless @extract
34
+
35
+ @current_string << string
36
+ end
37
+
38
+ def end_element(name)
39
+ case name
40
+ when 't' then @extract = false
41
+ when 'si' then @result << @current_string
42
+ end
43
+ end
44
+ end
45
+ end
46
+ end