simple_xlsx_reader 4.0.1 → 5.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +7 -0
- data/README.md +38 -29
- data/lib/simple_xlsx_reader/hyperlink.rb +11 -12
- data/lib/simple_xlsx_reader/loader/sheet_parser.rb +3 -3
- data/lib/simple_xlsx_reader/version.rb +1 -1
- data/test/performance_test.rb +1 -1
- data/test/simple_xlsx_reader_test.rb +47 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 8552d34f153cbdc6561c40725488d193e9aa48debcded0af24d32daf01b2f951
|
4
|
+
data.tar.gz: 2a0fecdec3698bb16717244fc7bf9b45b4fe0f6b216038e9823f9a5fea2ea8fa
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 77f99e8ad1020f0313171dcd0b14f7200fdf116e16de312146eb66a4d9347e94a0bf1cb4483f606975cd8bc776e80995473485271e05ee0a11136ef72cdeeae5
|
7
|
+
data.tar.gz: 7ee3ed8c37df6632981bd6eeb301de5f852df0f66534ce91593923cf1b51aa1dc0b07aed224d5d88cbd4b1f8a6901fdb17164e6e9f22fb10d4e5d90a3c24f437
|
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,10 @@
|
|
1
|
+
### 5.0.0
|
2
|
+
|
3
|
+
* Change SimpleXlsxReader::Hyperlink to default to the visible cell value
|
4
|
+
instead of the hyperlink URL, which in the case of mailto hyperlinks is
|
5
|
+
surprising.
|
6
|
+
* Fix blank content when parsing docs from string (@codemole)
|
7
|
+
|
1
8
|
### 4.0.1
|
2
9
|
|
3
10
|
* Fix nil error when handling some inline strings
|
data/README.md
CHANGED
@@ -9,15 +9,17 @@ then forgotten. We just want to get the data, and get out!
|
|
9
9
|
|
10
10
|
## Summary (now with stream parsing):
|
11
11
|
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
12
|
+
```ruby
|
13
|
+
doc = SimpleXlsxReader.open('/path/to/workbook.xlsx')
|
14
|
+
doc.sheets # => [<#SXR::Sheet>, ...]
|
15
|
+
doc.sheets.first.name # 'Sheet1'
|
16
|
+
rows = doc.sheet.first.rows # <SXR::Document::RowsProxy>
|
17
|
+
rows.each # an <Enumerator> ready to chain or stream
|
18
|
+
rows.each {} # Streams the rows to your block
|
19
|
+
rows.each(headers: true) {} # Streams row-hashes
|
20
|
+
rows.each(headers: {id: /ID/}) {} # finds & maps headers, streams
|
21
|
+
rows.slurp # Slurps rows into memory as a 2D array
|
22
|
+
```
|
21
23
|
|
22
24
|
That's the gist of it!
|
23
25
|
|
@@ -29,7 +31,8 @@ See also the [Document](https://github.com/woahdae/simple_xlsx_reader/blob/2.0.0
|
|
29
31
|
|
30
32
|
This project was started years ago, primarily because other Ruby xlsx parsers
|
31
33
|
didn't import data with the correct types. Numbers as strings, dates as numbers,
|
32
|
-
hyperlinks
|
34
|
+
[hyperlinks](https://github.com/woahdae/simple_xlsx_reader/blob/master/lib/simple_xlsx_reader/hyperlink.rb)
|
35
|
+
with inaccessible URLs, or - subtly buggy - simple dates as DateTime
|
33
36
|
objects. If your app uses a timezone offset, depending on what timezone and
|
34
37
|
what time of day you load the xlsx file, your dates might end up a day off!
|
35
38
|
SimpleXlsxReader understands all these correctly.
|
@@ -39,12 +42,14 @@ SimpleXlsxReader understands all these correctly.
|
|
39
42
|
Many Ruby xlsx parsers seem to be inspired more by Excel than Ruby, frankly.
|
40
43
|
SimpleXlsxReader strives to be fairly idiomatic Ruby:
|
41
44
|
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
45
|
+
```ruby
|
46
|
+
# quick example having fun w/ ruby
|
47
|
+
doc = SimpleXlsxReader.open(path_or_io)
|
48
|
+
doc.sheets.first.rows.each(headers: {id: /ID/})
|
49
|
+
.with_index.with_object({}) do |(row, index), acc|
|
50
|
+
acc[row[:id]] = index
|
51
|
+
end
|
52
|
+
```
|
48
53
|
|
49
54
|
### Now faster
|
50
55
|
|
@@ -77,15 +82,19 @@ If you had an excel sheet representing this data:
|
|
77
82
|
|
78
83
|
Get a handle on the rows proxy:
|
79
84
|
|
80
|
-
|
85
|
+
```ruby
|
86
|
+
rows = SimpleXlsxReader.open('suited_heroes.xlsx').sheets.first.rows
|
87
|
+
```
|
81
88
|
|
82
89
|
Simple streaming (kinda boring):
|
83
90
|
|
84
|
-
|
91
|
+
```ruby
|
92
|
+
rows.each { |row| ... }
|
93
|
+
````
|
85
94
|
|
86
95
|
Streaming with headers, and how about a little enumerable chaining:
|
87
96
|
|
88
|
-
```
|
97
|
+
```ruby
|
89
98
|
# Map of hero names by ID: { 117 => 'John Halo', ... }
|
90
99
|
|
91
100
|
rows.each(headers: true).with_object({}) do |row, acc|
|
@@ -108,7 +117,7 @@ Sometimes though you have some junk at the top of your spreadsheet:
|
|
108
117
|
For this, `headers` can be a hash whose keys replace headers and whose values
|
109
118
|
help find the correct header row:
|
110
119
|
|
111
|
-
```
|
120
|
+
```ruby
|
112
121
|
# Same map of hero names by ID: { 117 => 'John Halo', ... }
|
113
122
|
|
114
123
|
rows.each(headers: {id: /ID/, name: /Name/}).with_object({}) do |row, acc|
|
@@ -119,7 +128,7 @@ end
|
|
119
128
|
If your header-to-attribute mapping is more complicated than key/value, you
|
120
129
|
can do the mapping elsewhere, but use a block to find the header row:
|
121
130
|
|
122
|
-
```
|
131
|
+
```ruby
|
123
132
|
# Example roughly analogous to some production code mapping a single spreadsheet
|
124
133
|
# across many objects. Might be a simpler way now that we have the headers-hash
|
125
134
|
# feature.
|
@@ -168,9 +177,11 @@ can set `SimpleXlsxReader.configuration.catch_cell_load_errors =
|
|
168
177
|
true`, and load errors will instead be inserted into Sheet#load_errors keyed
|
169
178
|
by [rownum, colnum]:
|
170
179
|
|
171
|
-
|
172
|
-
|
173
|
-
|
180
|
+
```ruby
|
181
|
+
{
|
182
|
+
[rownum, colnum] => '[error]'
|
183
|
+
}
|
184
|
+
```
|
174
185
|
|
175
186
|
### Performance
|
176
187
|
|
@@ -233,11 +244,9 @@ This project follows [semantic versioning 1.0](http://semver.org/spec/v1.0.0.htm
|
|
233
244
|
Remember to write tests, think about edge cases, and run the existing
|
234
245
|
suite.
|
235
246
|
|
236
|
-
|
237
|
-
|
238
|
-
|
239
|
-
or you're way off that, there is probably a performance regression in
|
240
|
-
your code.
|
247
|
+
The full suite contains a performance test that on an M1 MBP runs the final
|
248
|
+
large file in about five seconds. Check out that test before & after your
|
249
|
+
change to check for performance changes.
|
241
250
|
|
242
251
|
Then, the standard stuff:
|
243
252
|
|
@@ -4,27 +4,26 @@ module SimpleXlsxReader
|
|
4
4
|
# We support hyperlinks as a "type" even though they're technically
|
5
5
|
# represented either as a function or an external reference in the xlsx spec.
|
6
6
|
#
|
7
|
-
#
|
8
|
-
#
|
9
|
-
#
|
10
|
-
#
|
11
|
-
#
|
12
|
-
# string to tack on the friendly name. This means 80% of us that just want
|
13
|
-
# the URL value will have to do nothing extra, but the 20% that might want the
|
14
|
-
# friendly name can access it.
|
7
|
+
# In practice, hyperlinks are usually a link or a mailto. In the case of a
|
8
|
+
# link, we probably want to follow it to download something, but in the case
|
9
|
+
# of an email, we probably just want the email and not the mailto. So we
|
10
|
+
# represent a hyperlink primarily as it is seen by the user, following the
|
11
|
+
# principle of least surprise, but the url is accessible via #url.
|
15
12
|
#
|
16
|
-
#
|
17
|
-
#
|
18
|
-
#
|
13
|
+
# Microsoft calls the visible part of a hyperlink cell the "friendly name,"
|
14
|
+
# so we expose that as a method too, in case you want to be explicit about
|
15
|
+
# how you're accessing it.
|
19
16
|
#
|
20
17
|
# See MS documentation on the HYPERLINK function for some background:
|
21
18
|
# https://support.office.com/en-us/article/HYPERLINK-function-333c7ce6-c5ae-4164-9c47-7de9b76f577f
|
22
19
|
class Hyperlink < String
|
23
20
|
attr_reader :friendly_name
|
21
|
+
attr_reader :url
|
24
22
|
|
25
23
|
def initialize(url, friendly_name = nil)
|
26
24
|
@friendly_name = friendly_name
|
27
|
-
|
25
|
+
@url = url
|
26
|
+
super(friendly_name || url)
|
28
27
|
end
|
29
28
|
end
|
30
29
|
end
|
@@ -31,10 +31,9 @@ module SimpleXlsxReader
|
|
31
31
|
@url = nil # silence warnings
|
32
32
|
@function = nil # silence warnings
|
33
33
|
@capture = nil # silence warnings
|
34
|
+
@captured = nil # silence warnings
|
34
35
|
@dimension = nil # silence warnings
|
35
36
|
|
36
|
-
@file_io.rewind # in case we've already parsed this once
|
37
|
-
|
38
37
|
# In this project this is only used for GUI-made hyperlinks (as opposed
|
39
38
|
# to FUNCTION-based hyperlinks). Unfortunately the're needed to parse
|
40
39
|
# the spreadsheet, and they come AFTER the sheet data. So, solution is
|
@@ -44,9 +43,10 @@ module SimpleXlsxReader
|
|
44
43
|
if xrels_file&.grep(/hyperlink/)&.any?
|
45
44
|
xrels_file.rewind
|
46
45
|
load_gui_hyperlinks # represented as hyperlinks_by_cell
|
47
|
-
@file_io.rewind
|
48
46
|
end
|
49
47
|
|
48
|
+
@file_io.rewind # in case we've already parsed this once
|
49
|
+
|
50
50
|
Nokogiri::XML::SAX::Parser.new(self).parse(@file_io)
|
51
51
|
end
|
52
52
|
|
data/test/performance_test.rb
CHANGED
@@ -70,7 +70,7 @@ describe 'SimpleXlsxReader Benchmark' do
|
|
70
70
|
let(:styles) do
|
71
71
|
# s='0' above refers to the value of numFmtId at cellXfs index 0,
|
72
72
|
# which is in this case 'General' type
|
73
|
-
|
73
|
+
_styles =
|
74
74
|
<<-XML
|
75
75
|
<styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
|
76
76
|
<cellXfs count="1">
|
@@ -92,7 +92,7 @@ describe SimpleXlsxReader do
|
|
92
92
|
body: 'The Greatest',
|
93
93
|
created_at: Time.parse('2002-01-01 11:00:00 UTC'),
|
94
94
|
count: 1,
|
95
|
-
"URL" => '
|
95
|
+
"URL" => 'This uses the HYPERLINK() function'
|
96
96
|
)
|
97
97
|
|
98
98
|
_(rows.slurped?).must_equal false
|
@@ -1031,6 +1031,52 @@ describe SimpleXlsxReader do
|
|
1031
1031
|
end
|
1032
1032
|
end
|
1033
1033
|
|
1034
|
+
describe 'parsing documents with non-hyperlinked rels' do
|
1035
|
+
let(:rels) do
|
1036
|
+
[
|
1037
|
+
Nokogiri::XML(
|
1038
|
+
<<-XML
|
1039
|
+
<?xml version="1.0" encoding="UTF-8"?>
|
1040
|
+
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"></Relationships>
|
1041
|
+
XML
|
1042
|
+
).remove_namespaces!
|
1043
|
+
]
|
1044
|
+
end
|
1045
|
+
|
1046
|
+
describe 'when document is opened as path' do
|
1047
|
+
before do
|
1048
|
+
@row = SimpleXlsxReader.open(xlsx.archive.path).sheets[0].rows.to_a[0]
|
1049
|
+
end
|
1050
|
+
|
1051
|
+
it 'reads cell content' do
|
1052
|
+
_(@row[0]).must_equal 'Cell A'
|
1053
|
+
end
|
1054
|
+
end
|
1055
|
+
|
1056
|
+
describe 'when document is parsed as a String' do
|
1057
|
+
before do
|
1058
|
+
output = File.binread(xlsx.archive.path)
|
1059
|
+
@row = SimpleXlsxReader.parse(output).sheets[0].rows.to_a[0]
|
1060
|
+
end
|
1061
|
+
|
1062
|
+
it 'reads cell content' do
|
1063
|
+
_(@row[0]).must_equal 'Cell A'
|
1064
|
+
end
|
1065
|
+
end
|
1066
|
+
|
1067
|
+
describe 'when document is parsed as StringIO' do
|
1068
|
+
before do
|
1069
|
+
stream = StringIO.new(File.binread(xlsx.archive.path), 'rb')
|
1070
|
+
@row = SimpleXlsxReader.parse(stream).sheets[0].rows.to_a[0]
|
1071
|
+
stream.close
|
1072
|
+
end
|
1073
|
+
|
1074
|
+
it 'reads cell content' do
|
1075
|
+
_(@row[0]).must_equal 'Cell A'
|
1076
|
+
end
|
1077
|
+
end
|
1078
|
+
end
|
1079
|
+
|
1034
1080
|
# https://support.microsoft.com/en-us/office/available-number-formats-in-excel-0afe8f52-97db-41f1-b972-4b46e9f1e8d2
|
1035
1081
|
describe 'numeric fields styled as "General"' do
|
1036
1082
|
let(:misc_numbers_path) do
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: simple_xlsx_reader
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 5.0.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Woody Peterson
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2023-
|
11
|
+
date: 2023-06-17 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|