simple_xlsx_reader 4.0.0 → 5.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d60e969d7d2db69578d543b6ebab36d29b3e3e88b4d62d67d00218b09b644cc5
4
- data.tar.gz: 36e12c7d95b6319f8f3bb565a1bc1b8eea6d1ca44928b3188a6892bf1f0c6513
3
+ metadata.gz: 8552d34f153cbdc6561c40725488d193e9aa48debcded0af24d32daf01b2f951
4
+ data.tar.gz: 2a0fecdec3698bb16717244fc7bf9b45b4fe0f6b216038e9823f9a5fea2ea8fa
5
5
  SHA512:
6
- metadata.gz: 6610958e6cb393e6013d303dd541f80a19d91415f6ebbe1d03162b52580ac361ad3f7e8e9fef5904a1daae72fe0774a5a83f47617c75ac185748c78e2c828e5a
7
- data.tar.gz: f556a9d31d48aa7cfeb0a1a9194736f2740ae3a2c868ed6a65fc197411351b28751d9caa72fd2bbbeb7eb22acd9ba0e2d53606f414bd36843f811ccb93d80ed2
6
+ metadata.gz: 77f99e8ad1020f0313171dcd0b14f7200fdf116e16de312146eb66a4d9347e94a0bf1cb4483f606975cd8bc776e80995473485271e05ee0a11136ef72cdeeae5
7
+ data.tar.gz: 7ee3ed8c37df6632981bd6eeb301de5f852df0f66534ce91593923cf1b51aa1dc0b07aed224d5d88cbd4b1f8a6901fdb17164e6e9f22fb10d4e5d90a3c24f437
data/CHANGELOG.md CHANGED
@@ -1,3 +1,21 @@
1
+ ### 5.0.0
2
+
3
+ * Change SimpleXlsxReader::Hyperlink to default to the visible cell value
4
+ instead of the hyperlink URL, which in the case of mailto hyperlinks is
5
+ surprising.
6
+ * Fix blank content when parsing docs from string (@codemole)
7
+
8
+ ### 4.0.1
9
+
10
+ * Fix nil error when handling some inline strings
11
+
12
+ Inline strings are almost exclusively used by non-Excel XLSX
13
+ implementations, but are valid, and sometimes have nil chunks.
14
+
15
+ Also, inline strings weren't preserving whitespace if Nokogiri is
16
+ parsing the string in chunks, as it does when encountering escaped
17
+ characters. Fixed.
18
+
1
19
  ### 4.0.0
2
20
 
3
21
  * Fix percentage rounding errors. Previously we were dividing by 100, when we
data/README.md CHANGED
@@ -9,15 +9,17 @@ then forgotten. We just want to get the data, and get out!
9
9
 
10
10
  ## Summary (now with stream parsing):
11
11
 
12
- doc = SimpleXlsxReader.open('/path/to/workbook.xlsx')
13
- doc.sheets # => [<#SXR::Sheet>, ...]
14
- doc.sheets.first.name # 'Sheet1'
15
- doc.sheets.first.rows # <SXR::Document::RowsProxy>
16
- doc.sheets.first.rows.each # an <Enumerator> ready to chain or stream
17
- doc.sheets.first.rows.each {} # Streams the rows to your block
18
- doc.sheets.first.rows.each(headers: true) {} # Streams row-hashes
19
- doc.sheets.first.rows.each(headers: {id: /ID/}) {} # finds & maps headers, streams
20
- doc.sheets.first.rows.slurp # Slurps rows into memory as a 2D array
12
+ ```ruby
13
+ doc = SimpleXlsxReader.open('/path/to/workbook.xlsx')
14
+ doc.sheets # => [<#SXR::Sheet>, ...]
15
+ doc.sheets.first.name # 'Sheet1'
16
+ rows = doc.sheet.first.rows # <SXR::Document::RowsProxy>
17
+ rows.each # an <Enumerator> ready to chain or stream
18
+ rows.each {} # Streams the rows to your block
19
+ rows.each(headers: true) {} # Streams row-hashes
20
+ rows.each(headers: {id: /ID/}) {} # finds & maps headers, streams
21
+ rows.slurp # Slurps rows into memory as a 2D array
22
+ ```
21
23
 
22
24
  That's the gist of it!
23
25
 
@@ -29,7 +31,8 @@ See also the [Document](https://github.com/woahdae/simple_xlsx_reader/blob/2.0.0
29
31
 
30
32
  This project was started years ago, primarily because other Ruby xlsx parsers
31
33
  didn't import data with the correct types. Numbers as strings, dates as numbers,
32
- hyperlinks with inaccessible URLs, or - subtly buggy - simple dates as DateTime
34
+ [hyperlinks](https://github.com/woahdae/simple_xlsx_reader/blob/master/lib/simple_xlsx_reader/hyperlink.rb)
35
+ with inaccessible URLs, or - subtly buggy - simple dates as DateTime
33
36
  objects. If your app uses a timezone offset, depending on what timezone and
34
37
  what time of day you load the xlsx file, your dates might end up a day off!
35
38
  SimpleXlsxReader understands all these correctly.
@@ -39,12 +42,14 @@ SimpleXlsxReader understands all these correctly.
39
42
  Many Ruby xlsx parsers seem to be inspired more by Excel than Ruby, frankly.
40
43
  SimpleXlsxReader strives to be fairly idiomatic Ruby:
41
44
 
42
- # quick example having fun w/ ruby
43
- doc = SimpleXlsxReader.open(path_or_io)
44
- doc.sheets.first.rows.each(headers: {id: /ID/})
45
- .with_index.with_object({}) do |(row, index), acc|
46
- acc[row[:id]] = index
47
- end
45
+ ```ruby
46
+ # quick example having fun w/ ruby
47
+ doc = SimpleXlsxReader.open(path_or_io)
48
+ doc.sheets.first.rows.each(headers: {id: /ID/})
49
+ .with_index.with_object({}) do |(row, index), acc|
50
+ acc[row[:id]] = index
51
+ end
52
+ ```
48
53
 
49
54
  ### Now faster
50
55
 
@@ -77,15 +82,19 @@ If you had an excel sheet representing this data:
77
82
 
78
83
  Get a handle on the rows proxy:
79
84
 
80
- `rows = SimpleXlsxReader.open('suited_heroes.xlsx').sheets.first.rows`
85
+ ```ruby
86
+ rows = SimpleXlsxReader.open('suited_heroes.xlsx').sheets.first.rows
87
+ ```
81
88
 
82
89
  Simple streaming (kinda boring):
83
90
 
84
- `rows.each { |row| ... }`
91
+ ```ruby
92
+ rows.each { |row| ... }
93
+ ````
85
94
 
86
95
  Streaming with headers, and how about a little enumerable chaining:
87
96
 
88
- ```
97
+ ```ruby
89
98
  # Map of hero names by ID: { 117 => 'John Halo', ... }
90
99
 
91
100
  rows.each(headers: true).with_object({}) do |row, acc|
@@ -108,7 +117,7 @@ Sometimes though you have some junk at the top of your spreadsheet:
108
117
  For this, `headers` can be a hash whose keys replace headers and whose values
109
118
  help find the correct header row:
110
119
 
111
- ```
120
+ ```ruby
112
121
  # Same map of hero names by ID: { 117 => 'John Halo', ... }
113
122
 
114
123
  rows.each(headers: {id: /ID/, name: /Name/}).with_object({}) do |row, acc|
@@ -119,7 +128,7 @@ end
119
128
  If your header-to-attribute mapping is more complicated than key/value, you
120
129
  can do the mapping elsewhere, but use a block to find the header row:
121
130
 
122
- ```
131
+ ```ruby
123
132
  # Example roughly analogous to some production code mapping a single spreadsheet
124
133
  # across many objects. Might be a simpler way now that we have the headers-hash
125
134
  # feature.
@@ -168,9 +177,11 @@ can set `SimpleXlsxReader.configuration.catch_cell_load_errors =
168
177
  true`, and load errors will instead be inserted into Sheet#load_errors keyed
169
178
  by [rownum, colnum]:
170
179
 
171
- {
172
- [rownum, colnum] => '[error]'
173
- }
180
+ ```ruby
181
+ {
182
+ [rownum, colnum] => '[error]'
183
+ }
184
+ ```
174
185
 
175
186
  ### Performance
176
187
 
@@ -233,11 +244,9 @@ This project follows [semantic versioning 1.0](http://semver.org/spec/v1.0.0.htm
233
244
  Remember to write tests, think about edge cases, and run the existing
234
245
  suite.
235
246
 
236
- Note that as of commit 665cbafdde, the most extreme end of the
237
- linear-time performance test, which is 10,000 rows (12 columns), runs in
238
- ~4 seconds on Ruby 2.1 on a 2012 MBP. If the linear time assertion fails
239
- or you're way off that, there is probably a performance regression in
240
- your code.
247
+ The full suite contains a performance test that on an M1 MBP runs the final
248
+ large file in about five seconds. Check out that test before & after your
249
+ change to check for performance changes.
241
250
 
242
251
  Then, the standard stuff:
243
252
 
@@ -4,27 +4,26 @@ module SimpleXlsxReader
4
4
  # We support hyperlinks as a "type" even though they're technically
5
5
  # represented either as a function or an external reference in the xlsx spec.
6
6
  #
7
- # Since having hyperlink data in our sheet usually means we might want to do
8
- # something primarily with the URL (store it in the database, download it, etc),
9
- # we go through extra effort to parse the function or follow the reference
10
- # to represent the hyperlink primarily as a URL. However, maybe we do want
11
- # the hyperlink "friendly name" part (as MS calls it), so here we've subclassed
12
- # string to tack on the friendly name. This means 80% of us that just want
13
- # the URL value will have to do nothing extra, but the 20% that might want the
14
- # friendly name can access it.
7
+ # In practice, hyperlinks are usually a link or a mailto. In the case of a
8
+ # link, we probably want to follow it to download something, but in the case
9
+ # of an email, we probably just want the email and not the mailto. So we
10
+ # represent a hyperlink primarily as it is seen by the user, following the
11
+ # principle of least surprise, but the url is accessible via #url.
15
12
  #
16
- # Note, by default, the value we would get by just asking the cell would
17
- # be the "friendly name" and *not* the URL, which is tucked away in the
18
- # function definition or a separate "relationships" meta-document.
13
+ # Microsoft calls the visible part of a hyperlink cell the "friendly name,"
14
+ # so we expose that as a method too, in case you want to be explicit about
15
+ # how you're accessing it.
19
16
  #
20
17
  # See MS documentation on the HYPERLINK function for some background:
21
18
  # https://support.office.com/en-us/article/HYPERLINK-function-333c7ce6-c5ae-4164-9c47-7de9b76f577f
22
19
  class Hyperlink < String
23
20
  attr_reader :friendly_name
21
+ attr_reader :url
24
22
 
25
23
  def initialize(url, friendly_name = nil)
26
24
  @friendly_name = friendly_name
27
- super(url)
25
+ @url = url
26
+ super(friendly_name || url)
28
27
  end
29
28
  end
30
29
  end
@@ -31,10 +31,9 @@ module SimpleXlsxReader
31
31
  @url = nil # silence warnings
32
32
  @function = nil # silence warnings
33
33
  @capture = nil # silence warnings
34
+ @captured = nil # silence warnings
34
35
  @dimension = nil # silence warnings
35
36
 
36
- @file_io.rewind # in case we've already parsed this once
37
-
38
37
  # In this project this is only used for GUI-made hyperlinks (as opposed
39
38
  # to FUNCTION-based hyperlinks). Unfortunately the're needed to parse
40
39
  # the spreadsheet, and they come AFTER the sheet data. So, solution is
@@ -44,9 +43,10 @@ module SimpleXlsxReader
44
43
  if xrels_file&.grep(/hyperlink/)&.any?
45
44
  xrels_file.rewind
46
45
  load_gui_hyperlinks # represented as hyperlinks_by_cell
47
- @file_io.rewind
48
46
  end
49
47
 
48
+ @file_io.rewind # in case we've already parsed this once
49
+
50
50
  Nokogiri::XML::SAX::Parser.new(self).parse(@file_io)
51
51
  end
52
52
 
@@ -80,7 +80,7 @@ module SimpleXlsxReader
80
80
  captured =
81
81
  begin
82
82
  SimpleXlsxReader::Loader.cast(
83
- string.strip, @type, @style,
83
+ string, @type, @style,
84
84
  url: @url || hyperlinks_by_cell&.[](@cell_name),
85
85
  shared_strings: shared_strings,
86
86
  base_date: base_date
@@ -99,7 +99,7 @@ module SimpleXlsxReader
99
99
  else
100
100
  @load_errors[[row_idx, col_idx]] = e.message
101
101
 
102
- string.strip
102
+ string
103
103
  end
104
104
  end
105
105
 
@@ -111,7 +111,7 @@ module SimpleXlsxReader
111
111
  # to make it not do this (looked, couldn't find it).
112
112
  #
113
113
  # Loading the workbook test/chunky_utf8.xlsx repros the issue.
114
- @captured = @captured ? @captured + captured : captured
114
+ @captured = @captured ? @captured + (captured || '') : captured
115
115
  end
116
116
 
117
117
  def end_element(name)
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SimpleXlsxReader
4
- VERSION = '4.0.0'
4
+ VERSION = '5.0.0'
5
5
  end
@@ -70,7 +70,7 @@ describe 'SimpleXlsxReader Benchmark' do
70
70
  let(:styles) do
71
71
  # s='0' above refers to the value of numFmtId at cellXfs index 0,
72
72
  # which is in this case 'General' type
73
- styles =
73
+ _styles =
74
74
  <<-XML
75
75
  <styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
76
76
  <cellXfs count="1">
@@ -92,7 +92,7 @@ describe SimpleXlsxReader do
92
92
  body: 'The Greatest',
93
93
  created_at: Time.parse('2002-01-01 11:00:00 UTC'),
94
94
  count: 1,
95
- "URL" => 'http://www.example.com/hyperlink-function'
95
+ "URL" => 'This uses the HYPERLINK() function'
96
96
  )
97
97
 
98
98
  _(rows.slurped?).must_equal false
@@ -122,6 +122,52 @@ describe SimpleXlsxReader do
122
122
 
123
123
  let(:reader) { SimpleXlsxReader.open(xlsx.archive.path) }
124
124
 
125
+ describe 'when parsing escaped characters' do
126
+ let(:escaped_content) do
127
+ '&lt;a href="https://www.example.com"&gt;Link A&lt;/a&gt; &amp;bull; &lt;a href="https://www.example.com"&gt;Link B&lt;/a&gt;'
128
+ end
129
+
130
+ let(:unescaped_content) do
131
+ '<a href="https://www.example.com">Link A</a> &bull; <a href="https://www.example.com">Link B</a>'
132
+ end
133
+
134
+ let(:sheet) do
135
+ <<~XML
136
+ <worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
137
+ <dimension ref="A1:B1" />
138
+ <sheetData>
139
+ <row r="1">
140
+ <c r="A1" s="1" t="s">
141
+ <v>0</v>
142
+ </c>
143
+ <c r='B1' s='0'>
144
+ <v>#{escaped_content}</v>
145
+ </c>
146
+ </row>
147
+ </sheetData>
148
+ </worksheet>
149
+ XML
150
+ end
151
+
152
+ let(:shared_strings) do
153
+ <<~XML
154
+ <sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="1" uniqueCount="1">
155
+ <si>
156
+ <t>#{escaped_content}</t>
157
+ </si>
158
+ </sst>
159
+ XML
160
+ end
161
+
162
+ it 'loads correctly using inline strings' do
163
+ _(reader.sheets[0].rows.slurp[0][0]).must_equal(unescaped_content)
164
+ end
165
+
166
+ it 'loads correctly using shared strings' do
167
+ _(reader.sheets[0].rows.slurp[0][1]).must_equal(unescaped_content)
168
+ end
169
+ end
170
+
125
171
  describe 'Sheet#rows#each(headers: true)' do
126
172
  let(:sheet) do
127
173
  <<~XML
@@ -929,7 +975,7 @@ describe SimpleXlsxReader do
929
975
  )
930
976
  )
931
977
  end
932
-
978
+
933
979
  it "reads 'Generic' cells with numbers as numbers" do
934
980
  _(@row[9]).must_equal 1
935
981
  end
@@ -985,6 +1031,52 @@ describe SimpleXlsxReader do
985
1031
  end
986
1032
  end
987
1033
 
1034
+ describe 'parsing documents with non-hyperlinked rels' do
1035
+ let(:rels) do
1036
+ [
1037
+ Nokogiri::XML(
1038
+ <<-XML
1039
+ <?xml version="1.0" encoding="UTF-8"?>
1040
+ <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"></Relationships>
1041
+ XML
1042
+ ).remove_namespaces!
1043
+ ]
1044
+ end
1045
+
1046
+ describe 'when document is opened as path' do
1047
+ before do
1048
+ @row = SimpleXlsxReader.open(xlsx.archive.path).sheets[0].rows.to_a[0]
1049
+ end
1050
+
1051
+ it 'reads cell content' do
1052
+ _(@row[0]).must_equal 'Cell A'
1053
+ end
1054
+ end
1055
+
1056
+ describe 'when document is parsed as a String' do
1057
+ before do
1058
+ output = File.binread(xlsx.archive.path)
1059
+ @row = SimpleXlsxReader.parse(output).sheets[0].rows.to_a[0]
1060
+ end
1061
+
1062
+ it 'reads cell content' do
1063
+ _(@row[0]).must_equal 'Cell A'
1064
+ end
1065
+ end
1066
+
1067
+ describe 'when document is parsed as StringIO' do
1068
+ before do
1069
+ stream = StringIO.new(File.binread(xlsx.archive.path), 'rb')
1070
+ @row = SimpleXlsxReader.parse(stream).sheets[0].rows.to_a[0]
1071
+ stream.close
1072
+ end
1073
+
1074
+ it 'reads cell content' do
1075
+ _(@row[0]).must_equal 'Cell A'
1076
+ end
1077
+ end
1078
+ end
1079
+
988
1080
  # https://support.microsoft.com/en-us/office/available-number-formats-in-excel-0afe8f52-97db-41f1-b972-4b46e9f1e8d2
989
1081
  describe 'numeric fields styled as "General"' do
990
1082
  let(:misc_numbers_path) do
@@ -57,7 +57,6 @@ TestXlsxBuilder = Struct.new(:shared_strings, :styles, :sheets, :workbook, :rels
57
57
  self.styles ||= DEFAULTS[:styles]
58
58
  self.sheets ||= [DEFAULTS[:sheet]]
59
59
  self.rels ||= []
60
- self.shared_strings ||= []
61
60
  end
62
61
 
63
62
  def archive
@@ -76,7 +75,7 @@ TestXlsxBuilder = Struct.new(:shared_strings, :styles, :sheets, :workbook, :rels
76
75
  styles_file.write(styles)
77
76
  end
78
77
 
79
- if shared_strings.any?
78
+ if shared_strings
80
79
  zip.get_output_stream('xl/sharedStrings.xml') do |ss_file|
81
80
  ss_file.write(shared_strings)
82
81
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: simple_xlsx_reader
3
3
  version: !ruby/object:Gem::Version
4
- version: 4.0.0
4
+ version: 5.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Woody Peterson
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2023-03-05 00:00:00.000000000 Z
11
+ date: 2023-06-17 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri