simple_xlsx_reader 4.0.1 → 5.0.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +7 -0
- data/README.md +38 -29
- data/lib/simple_xlsx_reader/hyperlink.rb +11 -12
- data/lib/simple_xlsx_reader/loader/sheet_parser.rb +3 -3
- data/lib/simple_xlsx_reader/version.rb +1 -1
- data/test/performance_test.rb +1 -1
- data/test/simple_xlsx_reader_test.rb +47 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 8552d34f153cbdc6561c40725488d193e9aa48debcded0af24d32daf01b2f951
|
4
|
+
data.tar.gz: 2a0fecdec3698bb16717244fc7bf9b45b4fe0f6b216038e9823f9a5fea2ea8fa
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 77f99e8ad1020f0313171dcd0b14f7200fdf116e16de312146eb66a4d9347e94a0bf1cb4483f606975cd8bc776e80995473485271e05ee0a11136ef72cdeeae5
|
7
|
+
data.tar.gz: 7ee3ed8c37df6632981bd6eeb301de5f852df0f66534ce91593923cf1b51aa1dc0b07aed224d5d88cbd4b1f8a6901fdb17164e6e9f22fb10d4e5d90a3c24f437
|
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,10 @@
|
|
1
|
+
### 5.0.0
|
2
|
+
|
3
|
+
* Change SimpleXlsxReader::Hyperlink to default to the visible cell value
|
4
|
+
instead of the hyperlink URL, which in the case of mailto hyperlinks is
|
5
|
+
surprising.
|
6
|
+
* Fix blank content when parsing docs from string (@codemole)
|
7
|
+
|
1
8
|
### 4.0.1
|
2
9
|
|
3
10
|
* Fix nil error when handling some inline strings
|
data/README.md
CHANGED
@@ -9,15 +9,17 @@ then forgotten. We just want to get the data, and get out!
|
|
9
9
|
|
10
10
|
## Summary (now with stream parsing):
|
11
11
|
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
12
|
+
```ruby
|
13
|
+
doc = SimpleXlsxReader.open('/path/to/workbook.xlsx')
|
14
|
+
doc.sheets # => [<#SXR::Sheet>, ...]
|
15
|
+
doc.sheets.first.name # 'Sheet1'
|
16
|
+
rows = doc.sheet.first.rows # <SXR::Document::RowsProxy>
|
17
|
+
rows.each # an <Enumerator> ready to chain or stream
|
18
|
+
rows.each {} # Streams the rows to your block
|
19
|
+
rows.each(headers: true) {} # Streams row-hashes
|
20
|
+
rows.each(headers: {id: /ID/}) {} # finds & maps headers, streams
|
21
|
+
rows.slurp # Slurps rows into memory as a 2D array
|
22
|
+
```
|
21
23
|
|
22
24
|
That's the gist of it!
|
23
25
|
|
@@ -29,7 +31,8 @@ See also the [Document](https://github.com/woahdae/simple_xlsx_reader/blob/2.0.0
|
|
29
31
|
|
30
32
|
This project was started years ago, primarily because other Ruby xlsx parsers
|
31
33
|
didn't import data with the correct types. Numbers as strings, dates as numbers,
|
32
|
-
hyperlinks
|
34
|
+
[hyperlinks](https://github.com/woahdae/simple_xlsx_reader/blob/master/lib/simple_xlsx_reader/hyperlink.rb)
|
35
|
+
with inaccessible URLs, or - subtly buggy - simple dates as DateTime
|
33
36
|
objects. If your app uses a timezone offset, depending on what timezone and
|
34
37
|
what time of day you load the xlsx file, your dates might end up a day off!
|
35
38
|
SimpleXlsxReader understands all these correctly.
|
@@ -39,12 +42,14 @@ SimpleXlsxReader understands all these correctly.
|
|
39
42
|
Many Ruby xlsx parsers seem to be inspired more by Excel than Ruby, frankly.
|
40
43
|
SimpleXlsxReader strives to be fairly idiomatic Ruby:
|
41
44
|
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
45
|
+
```ruby
|
46
|
+
# quick example having fun w/ ruby
|
47
|
+
doc = SimpleXlsxReader.open(path_or_io)
|
48
|
+
doc.sheets.first.rows.each(headers: {id: /ID/})
|
49
|
+
.with_index.with_object({}) do |(row, index), acc|
|
50
|
+
acc[row[:id]] = index
|
51
|
+
end
|
52
|
+
```
|
48
53
|
|
49
54
|
### Now faster
|
50
55
|
|
@@ -77,15 +82,19 @@ If you had an excel sheet representing this data:
|
|
77
82
|
|
78
83
|
Get a handle on the rows proxy:
|
79
84
|
|
80
|
-
|
85
|
+
```ruby
|
86
|
+
rows = SimpleXlsxReader.open('suited_heroes.xlsx').sheets.first.rows
|
87
|
+
```
|
81
88
|
|
82
89
|
Simple streaming (kinda boring):
|
83
90
|
|
84
|
-
|
91
|
+
```ruby
|
92
|
+
rows.each { |row| ... }
|
93
|
+
````
|
85
94
|
|
86
95
|
Streaming with headers, and how about a little enumerable chaining:
|
87
96
|
|
88
|
-
```
|
97
|
+
```ruby
|
89
98
|
# Map of hero names by ID: { 117 => 'John Halo', ... }
|
90
99
|
|
91
100
|
rows.each(headers: true).with_object({}) do |row, acc|
|
@@ -108,7 +117,7 @@ Sometimes though you have some junk at the top of your spreadsheet:
|
|
108
117
|
For this, `headers` can be a hash whose keys replace headers and whose values
|
109
118
|
help find the correct header row:
|
110
119
|
|
111
|
-
```
|
120
|
+
```ruby
|
112
121
|
# Same map of hero names by ID: { 117 => 'John Halo', ... }
|
113
122
|
|
114
123
|
rows.each(headers: {id: /ID/, name: /Name/}).with_object({}) do |row, acc|
|
@@ -119,7 +128,7 @@ end
|
|
119
128
|
If your header-to-attribute mapping is more complicated than key/value, you
|
120
129
|
can do the mapping elsewhere, but use a block to find the header row:
|
121
130
|
|
122
|
-
```
|
131
|
+
```ruby
|
123
132
|
# Example roughly analogous to some production code mapping a single spreadsheet
|
124
133
|
# across many objects. Might be a simpler way now that we have the headers-hash
|
125
134
|
# feature.
|
@@ -168,9 +177,11 @@ can set `SimpleXlsxReader.configuration.catch_cell_load_errors =
|
|
168
177
|
true`, and load errors will instead be inserted into Sheet#load_errors keyed
|
169
178
|
by [rownum, colnum]:
|
170
179
|
|
171
|
-
|
172
|
-
|
173
|
-
|
180
|
+
```ruby
|
181
|
+
{
|
182
|
+
[rownum, colnum] => '[error]'
|
183
|
+
}
|
184
|
+
```
|
174
185
|
|
175
186
|
### Performance
|
176
187
|
|
@@ -233,11 +244,9 @@ This project follows [semantic versioning 1.0](http://semver.org/spec/v1.0.0.htm
|
|
233
244
|
Remember to write tests, think about edge cases, and run the existing
|
234
245
|
suite.
|
235
246
|
|
236
|
-
|
237
|
-
|
238
|
-
|
239
|
-
or you're way off that, there is probably a performance regression in
|
240
|
-
your code.
|
247
|
+
The full suite contains a performance test that on an M1 MBP runs the final
|
248
|
+
large file in about five seconds. Check out that test before & after your
|
249
|
+
change to check for performance changes.
|
241
250
|
|
242
251
|
Then, the standard stuff:
|
243
252
|
|
@@ -4,27 +4,26 @@ module SimpleXlsxReader
|
|
4
4
|
# We support hyperlinks as a "type" even though they're technically
|
5
5
|
# represented either as a function or an external reference in the xlsx spec.
|
6
6
|
#
|
7
|
-
#
|
8
|
-
#
|
9
|
-
#
|
10
|
-
#
|
11
|
-
#
|
12
|
-
# string to tack on the friendly name. This means 80% of us that just want
|
13
|
-
# the URL value will have to do nothing extra, but the 20% that might want the
|
14
|
-
# friendly name can access it.
|
7
|
+
# In practice, hyperlinks are usually a link or a mailto. In the case of a
|
8
|
+
# link, we probably want to follow it to download something, but in the case
|
9
|
+
# of an email, we probably just want the email and not the mailto. So we
|
10
|
+
# represent a hyperlink primarily as it is seen by the user, following the
|
11
|
+
# principle of least surprise, but the url is accessible via #url.
|
15
12
|
#
|
16
|
-
#
|
17
|
-
#
|
18
|
-
#
|
13
|
+
# Microsoft calls the visible part of a hyperlink cell the "friendly name,"
|
14
|
+
# so we expose that as a method too, in case you want to be explicit about
|
15
|
+
# how you're accessing it.
|
19
16
|
#
|
20
17
|
# See MS documentation on the HYPERLINK function for some background:
|
21
18
|
# https://support.office.com/en-us/article/HYPERLINK-function-333c7ce6-c5ae-4164-9c47-7de9b76f577f
|
22
19
|
class Hyperlink < String
|
23
20
|
attr_reader :friendly_name
|
21
|
+
attr_reader :url
|
24
22
|
|
25
23
|
def initialize(url, friendly_name = nil)
|
26
24
|
@friendly_name = friendly_name
|
27
|
-
|
25
|
+
@url = url
|
26
|
+
super(friendly_name || url)
|
28
27
|
end
|
29
28
|
end
|
30
29
|
end
|
@@ -31,10 +31,9 @@ module SimpleXlsxReader
|
|
31
31
|
@url = nil # silence warnings
|
32
32
|
@function = nil # silence warnings
|
33
33
|
@capture = nil # silence warnings
|
34
|
+
@captured = nil # silence warnings
|
34
35
|
@dimension = nil # silence warnings
|
35
36
|
|
36
|
-
@file_io.rewind # in case we've already parsed this once
|
37
|
-
|
38
37
|
# In this project this is only used for GUI-made hyperlinks (as opposed
|
39
38
|
# to FUNCTION-based hyperlinks). Unfortunately the're needed to parse
|
40
39
|
# the spreadsheet, and they come AFTER the sheet data. So, solution is
|
@@ -44,9 +43,10 @@ module SimpleXlsxReader
|
|
44
43
|
if xrels_file&.grep(/hyperlink/)&.any?
|
45
44
|
xrels_file.rewind
|
46
45
|
load_gui_hyperlinks # represented as hyperlinks_by_cell
|
47
|
-
@file_io.rewind
|
48
46
|
end
|
49
47
|
|
48
|
+
@file_io.rewind # in case we've already parsed this once
|
49
|
+
|
50
50
|
Nokogiri::XML::SAX::Parser.new(self).parse(@file_io)
|
51
51
|
end
|
52
52
|
|
data/test/performance_test.rb
CHANGED
@@ -70,7 +70,7 @@ describe 'SimpleXlsxReader Benchmark' do
|
|
70
70
|
let(:styles) do
|
71
71
|
# s='0' above refers to the value of numFmtId at cellXfs index 0,
|
72
72
|
# which is in this case 'General' type
|
73
|
-
|
73
|
+
_styles =
|
74
74
|
<<-XML
|
75
75
|
<styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
|
76
76
|
<cellXfs count="1">
|
@@ -92,7 +92,7 @@ describe SimpleXlsxReader do
|
|
92
92
|
body: 'The Greatest',
|
93
93
|
created_at: Time.parse('2002-01-01 11:00:00 UTC'),
|
94
94
|
count: 1,
|
95
|
-
"URL" => '
|
95
|
+
"URL" => 'This uses the HYPERLINK() function'
|
96
96
|
)
|
97
97
|
|
98
98
|
_(rows.slurped?).must_equal false
|
@@ -1031,6 +1031,52 @@ describe SimpleXlsxReader do
|
|
1031
1031
|
end
|
1032
1032
|
end
|
1033
1033
|
|
1034
|
+
describe 'parsing documents with non-hyperlinked rels' do
|
1035
|
+
let(:rels) do
|
1036
|
+
[
|
1037
|
+
Nokogiri::XML(
|
1038
|
+
<<-XML
|
1039
|
+
<?xml version="1.0" encoding="UTF-8"?>
|
1040
|
+
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"></Relationships>
|
1041
|
+
XML
|
1042
|
+
).remove_namespaces!
|
1043
|
+
]
|
1044
|
+
end
|
1045
|
+
|
1046
|
+
describe 'when document is opened as path' do
|
1047
|
+
before do
|
1048
|
+
@row = SimpleXlsxReader.open(xlsx.archive.path).sheets[0].rows.to_a[0]
|
1049
|
+
end
|
1050
|
+
|
1051
|
+
it 'reads cell content' do
|
1052
|
+
_(@row[0]).must_equal 'Cell A'
|
1053
|
+
end
|
1054
|
+
end
|
1055
|
+
|
1056
|
+
describe 'when document is parsed as a String' do
|
1057
|
+
before do
|
1058
|
+
output = File.binread(xlsx.archive.path)
|
1059
|
+
@row = SimpleXlsxReader.parse(output).sheets[0].rows.to_a[0]
|
1060
|
+
end
|
1061
|
+
|
1062
|
+
it 'reads cell content' do
|
1063
|
+
_(@row[0]).must_equal 'Cell A'
|
1064
|
+
end
|
1065
|
+
end
|
1066
|
+
|
1067
|
+
describe 'when document is parsed as StringIO' do
|
1068
|
+
before do
|
1069
|
+
stream = StringIO.new(File.binread(xlsx.archive.path), 'rb')
|
1070
|
+
@row = SimpleXlsxReader.parse(stream).sheets[0].rows.to_a[0]
|
1071
|
+
stream.close
|
1072
|
+
end
|
1073
|
+
|
1074
|
+
it 'reads cell content' do
|
1075
|
+
_(@row[0]).must_equal 'Cell A'
|
1076
|
+
end
|
1077
|
+
end
|
1078
|
+
end
|
1079
|
+
|
1034
1080
|
# https://support.microsoft.com/en-us/office/available-number-formats-in-excel-0afe8f52-97db-41f1-b972-4b46e9f1e8d2
|
1035
1081
|
describe 'numeric fields styled as "General"' do
|
1036
1082
|
let(:misc_numbers_path) do
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: simple_xlsx_reader
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 5.0.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Woody Peterson
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2023-
|
11
|
+
date: 2023-06-17 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|