simple_xlsx_reader 1.0.2 → 2.0.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +5 -5
- data/.github/workflows/ruby.yml +38 -0
- data/.travis.yml +8 -0
- data/CHANGELOG.md +22 -0
- data/README.md +190 -57
- data/Rakefile +3 -1
- data/lib/simple_xlsx_reader/document.rb +147 -0
- data/lib/simple_xlsx_reader/hyperlink.rb +30 -0
- data/lib/simple_xlsx_reader/loader/shared_strings_parser.rb +46 -0
- data/lib/simple_xlsx_reader/loader/sheet_parser.rb +256 -0
- data/lib/simple_xlsx_reader/loader/style_types_parser.rb +115 -0
- data/lib/simple_xlsx_reader/loader/workbook_parser.rb +39 -0
- data/lib/simple_xlsx_reader/loader.rb +199 -0
- data/lib/simple_xlsx_reader/version.rb +3 -1
- data/lib/simple_xlsx_reader.rb +23 -442
- data/simple_xlsx_reader.gemspec +4 -2
- data/test/date1904_test.rb +5 -4
- data/test/datetime_test.rb +17 -10
- data/test/gdocs_sheet.xlsx +0 -0
- data/test/gdocs_sheet_test.rb +16 -0
- data/test/lower_case_sharedstrings_test.rb +9 -4
- data/test/performance_test.rb +86 -89
- data/test/sesame_street_blog.xlsx +0 -0
- data/test/shared_strings.xml +4 -0
- data/test/simple_xlsx_reader_test.rb +835 -320
- data/test/test_helper.rb +4 -1
- data/test/test_xlsx_builder.rb +104 -0
- metadata +38 -9
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
|
-
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: 979490ce3bd7f0482879fb5fb5465e10ad1b07c1488d0a544950131d9063050a
|
4
|
+
data.tar.gz: 412d0040a586cc5ee4acdd4a2f74dd74f3bf9eb781a35d8a36c12f6caadc566c
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 00c01bc0c2a393eb35e458411dfeab55b8bf30cee2661324cbd97a175baf0ceb31a881b1b2b7bd668a2b475ff008372c1428908340e30769308884355fdd46e8
|
7
|
+
data.tar.gz: 81b1b26806a97c56710cab64aa22212985dea82b308e2fbba6835f4ea7a69b79067268bb13537999594dc5722928f1df235938355a7d4a51b58ae7ed4af1d093
|
@@ -0,0 +1,38 @@
|
|
1
|
+
# This workflow uses actions that are not certified by GitHub.
|
2
|
+
# They are provided by a third-party and are governed by
|
3
|
+
# separate terms of service, privacy policy, and support
|
4
|
+
# documentation.
|
5
|
+
# This workflow will download a prebuilt Ruby version, install dependencies and run tests with Rake
|
6
|
+
# For more information see: https://github.com/marketplace/actions/setup-ruby-jruby-and-truffleruby
|
7
|
+
|
8
|
+
name: Ruby
|
9
|
+
|
10
|
+
on:
|
11
|
+
push:
|
12
|
+
branches: [ "master" ]
|
13
|
+
pull_request:
|
14
|
+
branches: [ "master" ]
|
15
|
+
|
16
|
+
permissions:
|
17
|
+
contents: read
|
18
|
+
|
19
|
+
jobs:
|
20
|
+
test:
|
21
|
+
|
22
|
+
runs-on: ubuntu-latest
|
23
|
+
strategy:
|
24
|
+
matrix:
|
25
|
+
ruby-version: ['2.6', '2.7', '3.0']
|
26
|
+
|
27
|
+
steps:
|
28
|
+
- uses: actions/checkout@v3
|
29
|
+
- name: Set up Ruby
|
30
|
+
# To automatically get bug fixes and new Ruby versions for ruby/setup-ruby,
|
31
|
+
# change this to (see https://github.com/ruby/setup-ruby#versioning):
|
32
|
+
# uses: ruby/setup-ruby@v1
|
33
|
+
uses: ruby/setup-ruby@2b019609e2b0f1ea1a2bc8ca11cb82ab46ada124
|
34
|
+
with:
|
35
|
+
ruby-version: ${{ matrix.ruby-version }}
|
36
|
+
bundler-cache: true # runs 'bundle install' and caches installed gems automatically
|
37
|
+
- name: Run tests
|
38
|
+
run: bundle exec rake
|
data/.travis.yml
ADDED
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,24 @@
|
|
1
|
+
### 2.0.0
|
2
|
+
|
3
|
+
* SPEED
|
4
|
+
* Reimplement internals in terms of a SAX parser
|
5
|
+
* Change `SimpleXlsxReader::Sheet#rows` to be a `RowsProxy` that streams `#each`
|
6
|
+
* Convenience - use `rows#each(headers: true)` to get header names while enumerating rows
|
7
|
+
|
8
|
+
### 1.0.5
|
9
|
+
|
10
|
+
* Support string or io input via `SimpleXlsxReader#parse` (@kalsan, @til)
|
11
|
+
|
12
|
+
### 1.0.4
|
13
|
+
|
14
|
+
* Fix Windows + RubyZip 1.2.1 bug preventing files from being read
|
15
|
+
* Add ability to parse hyperlinks
|
16
|
+
* Support files exported from Google Docs (@Strnadj)
|
17
|
+
|
18
|
+
### 1.0.3
|
19
|
+
|
20
|
+
Broken on Ruby 1.9; yanked.
|
21
|
+
|
1
22
|
### 1.0.2
|
2
23
|
|
3
24
|
* Fix Ruby 1.9.3-specific bug preventing parsing most sheets [middagj, eritiro]
|
@@ -5,6 +26,7 @@
|
|
5
26
|
* You don't always have a numFmtId column, and that's OK
|
6
27
|
* Sometimes 'sharedStrings.xml' can be 'sharedstrings.xml'
|
7
28
|
* Fixed parsing times very close to 12/30/1899 [Valeriy Utyaganov]
|
29
|
+
* Be more flexible with custom formats using a numFmtId < 164
|
8
30
|
|
9
31
|
### 1.0.1
|
10
32
|
|
data/README.md
CHANGED
@@ -1,81 +1,214 @@
|
|
1
1
|
# SimpleXlsxReader
|
2
2
|
|
3
|
-
|
4
|
-
primitives and dates/times.
|
3
|
+
A [fast](#performance) xlsx reader for Ruby that parses xlsx cell values into
|
4
|
+
plain ruby primitives and dates/times.
|
5
5
|
|
6
6
|
This is *not* a rewrite of excel in Ruby. Font styles, for
|
7
7
|
example, are parsed to determine whether a cell is a number or a date,
|
8
8
|
then forgotten. We just want to get the data, and get out!
|
9
9
|
|
10
|
-
##
|
11
|
-
|
12
|
-
### Summary:
|
10
|
+
## Summary (now with stream parsing):
|
13
11
|
|
14
12
|
doc = SimpleXlsxReader.open('/path/to/workbook.xlsx')
|
15
13
|
doc.sheets # => [<#SXR::Sheet>, ...]
|
16
14
|
doc.sheets.first.name # 'Sheet1'
|
17
|
-
doc.sheets.first.rows #
|
18
|
-
|
15
|
+
doc.sheets.first.rows # <SXR::Document::RowsProxy>
|
16
|
+
doc.sheets.first.rows.each # an <Enumerator> ready to chain or stream
|
17
|
+
doc.sheets.first.rows.each {} # Streams the rows to your block
|
18
|
+
doc.sheets.first.rows.each(headers: true) {} # Streams row-hashes
|
19
|
+
doc.sheets.first.rows.each(headers: {id: /ID/}) {} # finds & maps headers, streams
|
20
|
+
doc.sheets.first.rows.slurp # Slurps rows into memory as a 2D array
|
19
21
|
|
20
|
-
That's it!
|
22
|
+
That's the gist of it!
|
21
23
|
|
22
|
-
|
24
|
+
See also the [Document](https://github.com/woahdae/simple_xlsx_reader/blob/2.0.0-pre/lib/simple_xlsx_reader/document.rb) object.
|
23
25
|
|
24
|
-
|
25
|
-
'hello') result in a SimpleXlsxReader::CellLoadError.
|
26
|
+
## Why?
|
26
27
|
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
28
|
+
### Accurate
|
29
|
+
|
30
|
+
This project was started years ago, primarily because other Ruby xlsx parsers
|
31
|
+
didn't import data with the correct types. Numbers as strings, dates as numbers,
|
32
|
+
hyperlinks with inaccessible URLs, or - subtly buggy - simple dates as DateTime
|
33
|
+
objects. If your app uses a timezone offset, depending on what timezone and
|
34
|
+
what time of day you load the xlsx file, your dates might end up a day off!
|
35
|
+
SimpleXlsxReader understands all these correctly.
|
31
36
|
|
32
|
-
###
|
37
|
+
### Idiomatic
|
33
38
|
|
34
|
-
|
39
|
+
Many Ruby xlsx parsers seem to be inspired more by Excel than Ruby, frankly.
|
40
|
+
SimpleXlsxReader strives to be fairly idiomatic Ruby:
|
35
41
|
|
36
|
-
|
37
|
-
|
38
|
-
|
42
|
+
# quick example having fun w/ ruby
|
43
|
+
doc = SimpleXlsxReader.open(path_or_io)
|
44
|
+
doc.sheets.first.rows.each(headers: {id: /ID/})
|
45
|
+
.with_index.with_object({}) do |(row, index), acc|
|
46
|
+
acc[row[:id]] = index
|
39
47
|
end
|
40
48
|
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
49
|
+
### Now faster
|
50
|
+
|
51
|
+
Finally, as of v2.0, SimpleXlsxReader is the fastest and most
|
52
|
+
memory-efficient parser. Previously this project couldn't reasonably load
|
53
|
+
anything over ~10k rows. Other parsers could load 100k+ rows, but were still
|
54
|
+
taking ~1gb RSS to do so, even "streaming," which seemed excessive. So a SAX
|
55
|
+
implementation was born. See [performance](#performance) for details.
|
56
|
+
|
57
|
+
## Usage
|
58
|
+
|
59
|
+
### Streaming
|
60
|
+
|
61
|
+
SimpleXlsxReader is performant by default - If you use
|
62
|
+
`rows.each {|row| ...}` it will stream the XLSX rows to your block without
|
63
|
+
loading either the sheet XML or the full sheet data into memory.
|
64
|
+
|
65
|
+
You can also chain `rows.each` with other Enumerable functions without
|
66
|
+
triggering a slurp, and you have lots of ways to find and map headers while
|
67
|
+
streaming.
|
68
|
+
|
69
|
+
If you had an excel sheet representing this data:
|
70
|
+
|
71
|
+
```
|
72
|
+
| Hero ID | Hero Name | Location |
|
73
|
+
| 13576 | Samus Aran | Planet Zebes |
|
74
|
+
| 117 | John Halo | Ring World |
|
75
|
+
| 9704133 | Iron Man | Planet Earth |
|
76
|
+
```
|
77
|
+
|
78
|
+
Get a handle on the rows proxy:
|
79
|
+
|
80
|
+
`rows = SimpleXlsxReader.open('suited_heroes.xlsx').sheets.first.rows`
|
81
|
+
|
82
|
+
Simple streaming (kinda boring):
|
83
|
+
|
84
|
+
`rows.each { |row| ... }`
|
85
|
+
|
86
|
+
Streaming with headers, and how about a little enumerable chaining:
|
87
|
+
|
88
|
+
```
|
89
|
+
# Map of hero names by ID: { 117 => 'John Halo', ... }
|
90
|
+
|
91
|
+
rows.each(headers: true).with_object({}) do |row, acc|
|
92
|
+
acc[row['Hero ID']] = row['Hero Name']
|
93
|
+
end
|
94
|
+
```
|
95
|
+
|
96
|
+
Sometimes though you have some junk at the top of your spreadsheet:
|
97
|
+
|
98
|
+
```
|
99
|
+
| Unofficial Report | | |
|
100
|
+
| Dont tell Nintendo | Yes "John Halo" I know | |
|
101
|
+
| | | |
|
102
|
+
| Hero ID | Hero Name | Location |
|
103
|
+
| 13576 | Samus Aran | Planet Zebes |
|
104
|
+
| 117 | John Halo | Ring World |
|
105
|
+
| 9704133 | Iron Man | Planet Earth |
|
106
|
+
```
|
107
|
+
|
108
|
+
For this, `headers` can be a hash whose keys replace headers and whose values
|
109
|
+
help find the correct header row:
|
110
|
+
|
111
|
+
```
|
112
|
+
# Same map of hero names by ID: { 117 => 'John Halo', ... }
|
113
|
+
|
114
|
+
rows.each(headers: {id: /ID/, name: /Name/}).with_object({}) do |row, acc|
|
115
|
+
acc[row[:id]] = row[:name]
|
116
|
+
end
|
117
|
+
```
|
118
|
+
|
119
|
+
If your header-to-attribute mapping is more complicated than key/value, you
|
120
|
+
can do the mapping elsewhere, but use a block to find the header row:
|
121
|
+
|
122
|
+
```
|
123
|
+
# Example roughly analogous to some production code mapping a single spreadsheet
|
124
|
+
# across many objects. Might be a simpler way now that we have the headers-hash
|
125
|
+
# feature.
|
126
|
+
|
127
|
+
object_map = { Hero => { id: 'Hero ID', name: 'Hero Name', location: 'Location' } }
|
128
|
+
|
129
|
+
HEADERS = ['Hero ID', 'Hero Name', 'Location']
|
130
|
+
|
131
|
+
rows.each(headers: ->(row) { (HEADERS & row).any? }) do |row|
|
132
|
+
object_map.each_pair do |klass, attribute_map|
|
133
|
+
attributes =
|
134
|
+
attribute_map.each_pair.with_object({}) do |(key, header), attrs|
|
135
|
+
attrs[key] = row[header]
|
77
136
|
end
|
78
|
-
|
137
|
+
|
138
|
+
klass.new(attributes)
|
139
|
+
end
|
140
|
+
end
|
141
|
+
```
|
142
|
+
|
143
|
+
### Slurping
|
144
|
+
|
145
|
+
To make SimpleXlsxReader rows act like an array, for use with legacy
|
146
|
+
SimpleXlsxReader apps or otherwise, we still support slurping the whole array
|
147
|
+
into memory. The good news is even when doing this, the xlsx worksheet & shared
|
148
|
+
string files are never loaded as a (big) Nokogiri doc, so that's nice.
|
149
|
+
|
150
|
+
By default, to prevent accidental slurping, `<RowsProxy>` will throw an exception
|
151
|
+
if you try to access it with array methods like `[]` and `shift` without
|
152
|
+
explicitly slurping first. You can slurp either by calling `rows.slurp` or
|
153
|
+
globally by setting `SimpleXlsxReader.configuration.auto_slurp = true`.
|
154
|
+
|
155
|
+
Once slurped, enumerable methods on `rows` will use the slurped data
|
156
|
+
(i.e. not re-parse the sheet), and those Array-like methods will work.
|
157
|
+
|
158
|
+
We don't support all Array methods, just the few we have used in real projects,
|
159
|
+
as we transition towards streaming instead.
|
160
|
+
|
161
|
+
### Load Errors
|
162
|
+
|
163
|
+
By default, cell load errors (ex. if a date cell contains the string
|
164
|
+
'hello') result in a SimpleXlsxReader::CellLoadError.
|
165
|
+
|
166
|
+
If you would like to provide better error feedback to your users, you
|
167
|
+
can set `SimpleXlsxReader.configuration.catch_cell_load_errors =
|
168
|
+
true`, and load errors will instead be inserted into Sheet#load_errors keyed
|
169
|
+
by [rownum, colnum]:
|
170
|
+
|
171
|
+
{
|
172
|
+
[rownum, colnum] => '[error]'
|
173
|
+
}
|
174
|
+
|
175
|
+
### Performance
|
176
|
+
|
177
|
+
SimpleXlsxReader is (as of this writing) the fastest and most memory efficient
|
178
|
+
Ruby xlsx parser.
|
179
|
+
|
180
|
+
Recent updates here have focused on large spreadsheets with especially
|
181
|
+
non-unique strings in sheets using xlsx' shared strings feature
|
182
|
+
(Excel-generated spreadsheets always use this). Other projects have implemented
|
183
|
+
streaming parsers for the sheet data, but currently none stream while loading
|
184
|
+
the shared strings file, which is the second-largest file in an xlsx archive
|
185
|
+
and can represent millions of strings in large files.
|
186
|
+
|
187
|
+
For more details, see [my fork of @shkm's excel benchmark project](https://github.com/woahdae/excel-parsing-benchmarks), but here's the summary:
|
188
|
+
|
189
|
+
1mb excel file, 10,000 rows of sample "sales records" with a fair amount of
|
190
|
+
non-unique strings (ran on an M1 Macbook Pro):
|
191
|
+
|
192
|
+
| Gem | Parses/second | RSS Increase | Allocated Mem | Retained Mem | Allocated Objects | Retained Objects |
|
193
|
+
|--------------------|---------------|--------------|---------------|--------------|-------------------|------------------|
|
194
|
+
| simple_xlsx_reader | 1.13 | 36.94mb | 614.51mb | 1.13kb | 8796275 | 3 |
|
195
|
+
| roo | 0.75 | 74.0mb | 164.47mb | 2.18kb | 2128396 | 4 |
|
196
|
+
| creek | 0.65 | 107.55mb | 581.38mb | 3.3kb | 7240760 | 16 |
|
197
|
+
| xsv | 0.61 | 75.66mb | 2127.42mb | 3.66kb | 5922563 | 10 |
|
198
|
+
| rubyxl | 0.27 | 373.52mb | 716.7mb | 2.18kb | 10612577 | 4 |
|
199
|
+
|
200
|
+
Here is a benchmark for the "worst" file I've seen, a 26mb file whose shared
|
201
|
+
strings represent 10% of the archive (note, MemoryProfiler has too much
|
202
|
+
overhead to reasonably measure allocations so that analysis was left off, and
|
203
|
+
we just measure total time for one parse):
|
204
|
+
|
205
|
+
| Gem | Time | RSS Increase |
|
206
|
+
|--------------------|---------|--------------|
|
207
|
+
| simple_xlsx_reader | 28.71s | 148.77mb |
|
208
|
+
| roo | 40.25s | 1322.08mb |
|
209
|
+
| xsv | 45.82s | 391.27mb |
|
210
|
+
| creek | 60.63s | 886.81mb |
|
211
|
+
| rubyxl | 238.68s | 9136.3mb |
|
79
212
|
|
80
213
|
## Installation
|
81
214
|
|
data/Rakefile
CHANGED
@@ -0,0 +1,147 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'forwardable'
|
4
|
+
|
5
|
+
module SimpleXlsxReader
|
6
|
+
|
7
|
+
##
|
8
|
+
# Main class for the public API. See the README for usage examples,
|
9
|
+
# or read the code, it's pretty friendly.
|
10
|
+
class Document
|
11
|
+
attr_reader :file_path
|
12
|
+
|
13
|
+
def initialize(file_path)
|
14
|
+
@file_path = file_path
|
15
|
+
end
|
16
|
+
|
17
|
+
def sheets
|
18
|
+
@sheets ||= Loader.new(file_path).init_sheets
|
19
|
+
end
|
20
|
+
|
21
|
+
# Expensive because it slurps all the sheets into memory,
|
22
|
+
# probably only appropriate for testing
|
23
|
+
def to_hash
|
24
|
+
sheets.each_with_object({}) { |sheet, acc| acc[sheet.name] = sheet.rows.to_a; }
|
25
|
+
end
|
26
|
+
|
27
|
+
# `rows` is a RowsProxy that responds to #each
|
28
|
+
class Sheet
|
29
|
+
extend Forwardable
|
30
|
+
|
31
|
+
attr_reader :name, :rows
|
32
|
+
|
33
|
+
def_delegators :rows, :load_errors, :slurp
|
34
|
+
|
35
|
+
def initialize(name:, sheet_parser:)
|
36
|
+
@name = name
|
37
|
+
@rows = RowsProxy.new(sheet_parser: sheet_parser)
|
38
|
+
end
|
39
|
+
|
40
|
+
# Legacy - consider `rows.each(headers: true)` for better performance
|
41
|
+
def headers
|
42
|
+
rows.slurped![0]
|
43
|
+
end
|
44
|
+
|
45
|
+
# Legacy - consider `rows` or `rows.each(headers: true)` for better
|
46
|
+
# performance
|
47
|
+
def data
|
48
|
+
rows.slurped![1..-1]
|
49
|
+
end
|
50
|
+
end
|
51
|
+
|
52
|
+
# Waits until we call #each with a block to parse the rows
|
53
|
+
class RowsProxy
|
54
|
+
include Enumerable
|
55
|
+
|
56
|
+
attr_reader :slurped, :load_errors
|
57
|
+
|
58
|
+
def initialize(sheet_parser:)
|
59
|
+
@sheet_parser = sheet_parser
|
60
|
+
@slurped = nil
|
61
|
+
@load_errors = {}
|
62
|
+
end
|
63
|
+
|
64
|
+
# By default, #each streams the rows to the provided block, either as
|
65
|
+
# arrays, or as header => cell value pairs if provided a `headers:`
|
66
|
+
# argument.
|
67
|
+
#
|
68
|
+
# `headers` can be:
|
69
|
+
#
|
70
|
+
# * `true` - simply takes the first row as the header row
|
71
|
+
# * block - calls the block with successive rows until the block returns
|
72
|
+
# true, which it then uses that row for the headers. All data prior to
|
73
|
+
# finding the headers is ignored.
|
74
|
+
# * hash - transforms the header row by replacing cells with keys matched
|
75
|
+
# by value, ex. `{id: /ID|Identity/, name: /Name/i, date: 'Date'}` would
|
76
|
+
# potentially yield the row `{id: 5, name: 'Jane', date: [Date object]}`
|
77
|
+
# instead of the headers from the sheet. It would also search for the
|
78
|
+
# row that matches at least one header, in case the header row isn't the
|
79
|
+
# first.
|
80
|
+
#
|
81
|
+
# If rows have been slurped, #each will iterate the slurped rows instead.
|
82
|
+
#
|
83
|
+
# Note, calls to this after slurping will raise if given the `headers:`
|
84
|
+
# argument, as that's handled by the sheet parser. If this is important
|
85
|
+
# to someone, speak up and we could potentially support it.
|
86
|
+
def each(headers: false, &block)
|
87
|
+
if slurped?
|
88
|
+
raise '#each does not support headers with slurped rows' if headers
|
89
|
+
|
90
|
+
slurped.each(&block)
|
91
|
+
elsif block_given?
|
92
|
+
# It's possible to slurp while yielding to the block, which would
|
93
|
+
# null out @sheet_parser, so let's just keep track of it here too
|
94
|
+
sheet_parser = @sheet_parser
|
95
|
+
@sheet_parser.parse(headers: headers, &block).tap do
|
96
|
+
@load_errors = sheet_parser.load_errors
|
97
|
+
end
|
98
|
+
else
|
99
|
+
to_enum(:each, headers: headers)
|
100
|
+
end
|
101
|
+
end
|
102
|
+
|
103
|
+
# Mostly for legacy support, I'm not aware of a use case for doing this
|
104
|
+
# when you don't have to.
|
105
|
+
#
|
106
|
+
# Note that #each will use slurped results if available, and since we're
|
107
|
+
# leveraging Enumerable, all the other Enumerable methods will too.
|
108
|
+
def slurp
|
109
|
+
# possibly release sheet parser from memory on next GC run;
|
110
|
+
# untested, but it can hold a lot of stuff, so worth a try
|
111
|
+
@slurped ||= to_a.tap { @sheet_parser = nil }
|
112
|
+
end
|
113
|
+
|
114
|
+
def slurped?
|
115
|
+
!!@slurped
|
116
|
+
end
|
117
|
+
|
118
|
+
def slurped!
|
119
|
+
check_slurped
|
120
|
+
|
121
|
+
slurped
|
122
|
+
end
|
123
|
+
|
124
|
+
def [](*args)
|
125
|
+
check_slurped
|
126
|
+
|
127
|
+
slurped[*args]
|
128
|
+
end
|
129
|
+
|
130
|
+
def shift(*args)
|
131
|
+
check_slurped
|
132
|
+
|
133
|
+
slurped.shift(*args)
|
134
|
+
end
|
135
|
+
|
136
|
+
private
|
137
|
+
|
138
|
+
def check_slurped
|
139
|
+
slurp if SimpleXlsxReader.configuration.auto_slurp
|
140
|
+
return if slurped?
|
141
|
+
|
142
|
+
raise 'Called a slurp-y method without explicitly slurping;'\
|
143
|
+
' use #each or call rows.slurp first'
|
144
|
+
end
|
145
|
+
end
|
146
|
+
end
|
147
|
+
end
|
@@ -0,0 +1,30 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SimpleXlsxReader
|
4
|
+
# We support hyperlinks as a "type" even though they're technically
|
5
|
+
# represented either as a function or an external reference in the xlsx spec.
|
6
|
+
#
|
7
|
+
# Since having hyperlink data in our sheet usually means we might want to do
|
8
|
+
# something primarily with the URL (store it in the database, download it, etc),
|
9
|
+
# we go through extra effort to parse the function or follow the reference
|
10
|
+
# to represent the hyperlink primarily as a URL. However, maybe we do want
|
11
|
+
# the hyperlink "friendly name" part (as MS calls it), so here we've subclassed
|
12
|
+
# string to tack on the friendly name. This means 80% of us that just want
|
13
|
+
# the URL value will have to do nothing extra, but the 20% that might want the
|
14
|
+
# friendly name can access it.
|
15
|
+
#
|
16
|
+
# Note, by default, the value we would get by just asking the cell would
|
17
|
+
# be the "friendly name" and *not* the URL, which is tucked away in the
|
18
|
+
# function definition or a separate "relationships" meta-document.
|
19
|
+
#
|
20
|
+
# See MS documentation on the HYPERLINK function for some background:
|
21
|
+
# https://support.office.com/en-us/article/HYPERLINK-function-333c7ce6-c5ae-4164-9c47-7de9b76f577f
|
22
|
+
class Hyperlink < String
|
23
|
+
attr_reader :friendly_name
|
24
|
+
|
25
|
+
def initialize(url, friendly_name = nil)
|
26
|
+
@friendly_name = friendly_name
|
27
|
+
super(url)
|
28
|
+
end
|
29
|
+
end
|
30
|
+
end
|
@@ -0,0 +1,46 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SimpleXlsxReader
|
4
|
+
class Loader
|
5
|
+
# For performance reasons, excel uses an optional SpreadsheetML feature
|
6
|
+
# that puts all strings in a separate xml file, and then references
|
7
|
+
# them by their index in that file.
|
8
|
+
#
|
9
|
+
# http://msdn.microsoft.com/en-us/library/office/gg278314.aspx
|
10
|
+
class SharedStringsParser < Nokogiri::XML::SAX::Document
|
11
|
+
def self.parse(file)
|
12
|
+
new.tap do |parser|
|
13
|
+
Nokogiri::XML::SAX::Parser.new(parser).parse(file)
|
14
|
+
end.result
|
15
|
+
end
|
16
|
+
|
17
|
+
def initialize
|
18
|
+
@result = []
|
19
|
+
@composite = false
|
20
|
+
@extract = false
|
21
|
+
end
|
22
|
+
|
23
|
+
attr_reader :result
|
24
|
+
|
25
|
+
def start_element(name, _attrs = [])
|
26
|
+
case name
|
27
|
+
when 'si' then @current_string = +"" # UTF-8 variant of String.new
|
28
|
+
when 't' then @extract = true
|
29
|
+
end
|
30
|
+
end
|
31
|
+
|
32
|
+
def characters(string)
|
33
|
+
return unless @extract
|
34
|
+
|
35
|
+
@current_string << string
|
36
|
+
end
|
37
|
+
|
38
|
+
def end_element(name)
|
39
|
+
case name
|
40
|
+
when 't' then @extract = false
|
41
|
+
when 'si' then @result << @current_string
|
42
|
+
end
|
43
|
+
end
|
44
|
+
end
|
45
|
+
end
|
46
|
+
end
|