RubyGems - simple_xlsx_reader - Versions diffs - 4.0.0 → 5.0.0 - Mend

simple_xlsx_reader 4.0.0 → 5.0.0

Files changed (10) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +18 -0
data/README.md +38 -29
data/lib/simple_xlsx_reader/hyperlink.rb +11 -12
data/lib/simple_xlsx_reader/loader/sheet_parser.rb +6 -6
data/lib/simple_xlsx_reader/version.rb +1 -1
data/test/performance_test.rb +1 -1
data/test/simple_xlsx_reader_test.rb +94 -2
data/test/test_xlsx_builder.rb +1 -2
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: d60e969d7d2db69578d543b6ebab36d29b3e3e88b4d62d67d00218b09b644cc5
-  data.tar.gz: 36e12c7d95b6319f8f3bb565a1bc1b8eea6d1ca44928b3188a6892bf1f0c6513
+  metadata.gz: 8552d34f153cbdc6561c40725488d193e9aa48debcded0af24d32daf01b2f951
+  data.tar.gz: 2a0fecdec3698bb16717244fc7bf9b45b4fe0f6b216038e9823f9a5fea2ea8fa
 SHA512:
-  metadata.gz: 6610958e6cb393e6013d303dd541f80a19d91415f6ebbe1d03162b52580ac361ad3f7e8e9fef5904a1daae72fe0774a5a83f47617c75ac185748c78e2c828e5a
-  data.tar.gz: f556a9d31d48aa7cfeb0a1a9194736f2740ae3a2c868ed6a65fc197411351b28751d9caa72fd2bbbeb7eb22acd9ba0e2d53606f414bd36843f811ccb93d80ed2
+  metadata.gz: 77f99e8ad1020f0313171dcd0b14f7200fdf116e16de312146eb66a4d9347e94a0bf1cb4483f606975cd8bc776e80995473485271e05ee0a11136ef72cdeeae5
+  data.tar.gz: 7ee3ed8c37df6632981bd6eeb301de5f852df0f66534ce91593923cf1b51aa1dc0b07aed224d5d88cbd4b1f8a6901fdb17164e6e9f22fb10d4e5d90a3c24f437

data/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,21 @@
+### 5.0.0
+* Change SimpleXlsxReader::Hyperlink to default to the visible cell value
+  instead of the hyperlink URL, which in the case of mailto hyperlinks is
+  surprising.
+* Fix blank content when parsing docs from string (@codemole)
+### 4.0.1
+* Fix nil error when handling some inline strings
+  Inline strings are almost exclusively used by non-Excel XLSX
+  implementations, but are valid, and sometimes have nil chunks.
+  Also, inline strings weren't preserving whitespace if Nokogiri is
+  parsing the string in chunks, as it does when encountering escaped
+  characters. Fixed.
 ### 4.0.0
 * Fix percentage rounding errors. Previously we were dividing by 100, when we

data/README.md CHANGED Viewed

@@ -9,15 +9,17 @@ then forgotten. We just want to get the data, and get out!
 ## Summary (now with stream parsing):
-    doc = SimpleXlsxReader.open('/path/to/workbook.xlsx')
-    doc.sheets # => [<#SXR::Sheet>, ...]
-    doc.sheets.first.name # 'Sheet1'
-    doc.sheets.first.rows # <SXR::Document::RowsProxy>
-    doc.sheets.first.rows.each # an <Enumerator> ready to chain or stream
-    doc.sheets.first.rows.each {} # Streams the rows to your block
-    doc.sheets.first.rows.each(headers: true) {} # Streams row-hashes
-    doc.sheets.first.rows.each(headers: {id: /ID/}) {} # finds & maps headers, streams
-    doc.sheets.first.rows.slurp # Slurps rows into memory as a 2D array
+```ruby
+doc = SimpleXlsxReader.open('/path/to/workbook.xlsx')
+doc.sheets # => [<#SXR::Sheet>, ...]
+doc.sheets.first.name # 'Sheet1'
+rows = doc.sheet.first.rows # <SXR::Document::RowsProxy>
+rows.each # an <Enumerator> ready to chain or stream
+rows.each {} # Streams the rows to your block
+rows.each(headers: true) {} # Streams row-hashes
+rows.each(headers: {id: /ID/}) {} # finds & maps headers, streams
+rows.slurp # Slurps rows into memory as a 2D array
+```
 That's the gist of it!
@@ -29,7 +31,8 @@ See also the [Document](https://github.com/woahdae/simple_xlsx_reader/blob/2.0.0
 This project was started years ago, primarily because other Ruby xlsx parsers
 didn't import data with the correct types. Numbers as strings, dates as numbers,
-hyperlinks with inaccessible URLs, or - subtly buggy - simple dates as DateTime
+[hyperlinks](https://github.com/woahdae/simple_xlsx_reader/blob/master/lib/simple_xlsx_reader/hyperlink.rb)
+with inaccessible URLs, or - subtly buggy - simple dates as DateTime
 objects. If your app uses a timezone offset, depending on what timezone and
 what time of day you load the xlsx file, your dates might end up a day off!
 SimpleXlsxReader understands all these correctly.
@@ -39,12 +42,14 @@ SimpleXlsxReader understands all these correctly.
 Many Ruby xlsx parsers seem to be inspired more by Excel than Ruby, frankly.
 SimpleXlsxReader strives to be fairly idiomatic Ruby:
-    # quick example having fun w/ ruby
-    doc = SimpleXlsxReader.open(path_or_io)
-    doc.sheets.first.rows.each(headers: {id: /ID/})
-      .with_index.with_object({}) do |(row, index), acc|
-        acc[row[:id]] = index
-      end
+```ruby
+# quick example having fun w/ ruby
+doc = SimpleXlsxReader.open(path_or_io)
+doc.sheets.first.rows.each(headers: {id: /ID/})
+  .with_index.with_object({}) do |(row, index), acc|
+    acc[row[:id]] = index
+end
+```
 ### Now faster
@@ -77,15 +82,19 @@ If you had an excel sheet representing this data:
 Get a handle on the rows proxy:
-`rows = SimpleXlsxReader.open('suited_heroes.xlsx').sheets.first.rows`
+```ruby
+rows = SimpleXlsxReader.open('suited_heroes.xlsx').sheets.first.rows
+```
 Simple streaming (kinda boring):
-`rows.each { |row| ... }`
+```ruby
+rows.each { |row| ... }
+````
 Streaming with headers, and how about a little enumerable chaining:
-```
+```ruby
 # Map of hero names by ID: { 117 => 'John Halo', ... }
 rows.each(headers: true).with_object({}) do |row, acc|
@@ -108,7 +117,7 @@ Sometimes though you have some junk at the top of your spreadsheet:
 For this, `headers` can be a hash whose keys replace headers and whose values
 help find the correct header row:
-```
+```ruby
 # Same map of hero names by ID: { 117 => 'John Halo', ... }
 rows.each(headers: {id: /ID/, name: /Name/}).with_object({}) do |row, acc|
@@ -119,7 +128,7 @@ end
 If your header-to-attribute mapping is more complicated than key/value, you
 can do the mapping elsewhere, but use a block to find the header row:
-```
+```ruby
 # Example roughly analogous to some production code mapping a single spreadsheet
 # across many objects. Might be a simpler way now that we have the headers-hash
 # feature.
@@ -168,9 +177,11 @@ can set `SimpleXlsxReader.configuration.catch_cell_load_errors =
 true`, and load errors will instead be inserted into Sheet#load_errors keyed
 by [rownum, colnum]:
-    {
-      [rownum, colnum] => '[error]'
-    }
+```ruby
+{
+  [rownum, colnum] => '[error]'
+}
+```
 ### Performance
@@ -233,11 +244,9 @@ This project follows [semantic versioning 1.0](http://semver.org/spec/v1.0.0.htm
 Remember to write tests, think about edge cases, and run the existing
 suite.
-Note that as of commit 665cbafdde, the most extreme end of the
-linear-time performance test, which is 10,000 rows (12 columns), runs in
-~4 seconds on Ruby 2.1 on a 2012 MBP. If the linear time assertion fails
-or you're way off that, there is probably a performance regression in
-your code.
+The full suite contains a performance test that on an M1 MBP runs the final
+large file in about five seconds. Check out that test before & after your
+change to check for performance changes.
 Then, the standard stuff:

data/lib/simple_xlsx_reader/hyperlink.rb CHANGED Viewed

@@ -4,27 +4,26 @@ module SimpleXlsxReader
   # We support hyperlinks as a "type" even though they're technically
   # represented either as a function or an external reference in the xlsx spec.
   #
-  # Since having hyperlink data in our sheet usually means we might want to do
-  # something primarily with the URL (store it in the database, download it, etc),
-  # we go through extra effort to parse the function or follow the reference
-  # to represent the hyperlink primarily as a URL. However, maybe we do want
-  # the hyperlink "friendly name" part (as MS calls it), so here we've subclassed
-  # string to tack on the friendly name. This means 80% of us that just want
-  # the URL value will have to do nothing extra, but the 20% that might want the
-  # friendly name can access it.
+  # In practice, hyperlinks are usually a link or a mailto. In the case of a
+  # link, we probably want to follow it to download something, but in the case
+  # of an email, we probably just want the email and not the mailto. So we
+  # represent a hyperlink primarily as it is seen by the user, following the
+  # principle of least surprise, but the url is accessible via #url.
   #
-  # Note, by default, the value we would get by just asking the cell would
-  # be the "friendly name" and *not* the URL, which is tucked away in the
-  # function definition or a separate "relationships" meta-document.
+  # Microsoft calls the visible part of a hyperlink cell the "friendly name,"
+  # so we expose that as a method too, in case you want to be explicit about
+  # how you're accessing it.
   #
   # See MS documentation on the HYPERLINK function for some background:
   # https://support.office.com/en-us/article/HYPERLINK-function-333c7ce6-c5ae-4164-9c47-7de9b76f577f
   class Hyperlink < String
     attr_reader :friendly_name
+    attr_reader :url
     def initialize(url, friendly_name = nil)
       @friendly_name = friendly_name
-      super(url)
+      @url = url
+      super(friendly_name || url)
     end
   end
 end

data/lib/simple_xlsx_reader/loader/sheet_parser.rb CHANGED Viewed

@@ -31,10 +31,9 @@ module SimpleXlsxReader
         @url = nil # silence warnings
         @function = nil # silence warnings
         @capture = nil # silence warnings
+        @captured = nil # silence warnings
         @dimension = nil # silence warnings
-        @file_io.rewind # in case we've already parsed this once
         # In this project this is only used for GUI-made hyperlinks (as opposed
         # to FUNCTION-based hyperlinks). Unfortunately the're needed to parse
         # the spreadsheet, and they come AFTER the sheet data. So, solution is
@@ -44,9 +43,10 @@ module SimpleXlsxReader
         if xrels_file&.grep(/hyperlink/)&.any?
           xrels_file.rewind
           load_gui_hyperlinks # represented as hyperlinks_by_cell
-          @file_io.rewind
         end
+        @file_io.rewind # in case we've already parsed this once
         Nokogiri::XML::SAX::Parser.new(self).parse(@file_io)
       end
@@ -80,7 +80,7 @@ module SimpleXlsxReader
         captured =
           begin
             SimpleXlsxReader::Loader.cast(
-              string.strip, @type, @style,
+              string, @type, @style,
               url: @url || hyperlinks_by_cell&.[](@cell_name),
               shared_strings: shared_strings,
               base_date: base_date
@@ -99,7 +99,7 @@ module SimpleXlsxReader
             else
               @load_errors[[row_idx, col_idx]] = e.message
-              string.strip
+              string
             end
           end
@@ -111,7 +111,7 @@ module SimpleXlsxReader
         # to make it not do this (looked, couldn't find it).
         #
         # Loading the workbook test/chunky_utf8.xlsx repros the issue.
-        @captured = @captured ? @captured + captured : captured
+        @captured = @captured ? @captured + (captured || '') : captured
       end
       def end_element(name)

data/lib/simple_xlsx_reader/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module SimpleXlsxReader
-  VERSION = '4.0.0'
+  VERSION = '5.0.0'
 end

data/test/performance_test.rb CHANGED Viewed

@@ -70,7 +70,7 @@ describe 'SimpleXlsxReader Benchmark' do
   let(:styles) do
     # s='0' above refers to the value of numFmtId at cellXfs index 0,
     # which is in this case 'General' type
-    styles =
+    _styles =
       <<-XML
         <styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
           <cellXfs count="1">

data/test/simple_xlsx_reader_test.rb CHANGED Viewed

@@ -92,7 +92,7 @@ describe SimpleXlsxReader do
         body: 'The Greatest',
         created_at: Time.parse('2002-01-01 11:00:00 UTC'),
         count: 1,
-        "URL" => 'http://www.example.com/hyperlink-function'
+        "URL" => 'This uses the HYPERLINK() function'
       )
       _(rows.slurped?).must_equal false
@@ -122,6 +122,52 @@ describe SimpleXlsxReader do
   let(:reader) { SimpleXlsxReader.open(xlsx.archive.path) }
+  describe 'when parsing escaped characters' do
+    let(:escaped_content) do
+      '&lt;a href="https://www.example.com"&gt;Link A&lt;/a&gt; &amp;bull; &lt;a href="https://www.example.com"&gt;Link B&lt;/a&gt;'
+    end
+    let(:unescaped_content) do
+      '<a href="https://www.example.com">Link A</a> &bull; <a href="https://www.example.com">Link B</a>'
+    end
+    let(:sheet) do
+      <<~XML
+        <worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
+          <dimension ref="A1:B1" />
+          <sheetData>
+            <row r="1">
+              <c r="A1" s="1" t="s">
+                <v>0</v>
+              </c>
+              <c r='B1' s='0'>
+                <v>#{escaped_content}</v>
+              </c>
+            </row>
+          </sheetData>
+        </worksheet>
+      XML
+    end
+    let(:shared_strings) do
+      <<~XML
+        <sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="1" uniqueCount="1">
+          <si>
+            <t>#{escaped_content}</t>
+          </si>
+        </sst>
+      XML
+    end
+    it 'loads correctly using inline strings' do
+      _(reader.sheets[0].rows.slurp[0][0]).must_equal(unescaped_content)
+    end
+    it 'loads correctly using shared strings' do
+      _(reader.sheets[0].rows.slurp[0][1]).must_equal(unescaped_content)
+    end
+  end
   describe 'Sheet#rows#each(headers: true)' do
     let(:sheet) do
       <<~XML
@@ -929,7 +975,7 @@ describe SimpleXlsxReader do
         )
       )
     end
     it "reads 'Generic' cells with numbers as numbers" do
       _(@row[9]).must_equal 1
     end
@@ -985,6 +1031,52 @@ describe SimpleXlsxReader do
     end
   end
+  describe 'parsing documents with non-hyperlinked rels' do
+    let(:rels) do
+      [
+        Nokogiri::XML(
+          <<-XML
+          <?xml version="1.0" encoding="UTF-8"?>
+          <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"></Relationships>
+          XML
+        ).remove_namespaces!
+      ]
+    end
+    describe 'when document is opened as path' do
+      before do
+        @row = SimpleXlsxReader.open(xlsx.archive.path).sheets[0].rows.to_a[0]
+      end
+      it 'reads cell content' do
+        _(@row[0]).must_equal 'Cell A'
+      end
+    end
+    describe 'when document is parsed as a String' do
+      before do
+        output = File.binread(xlsx.archive.path)
+        @row = SimpleXlsxReader.parse(output).sheets[0].rows.to_a[0]
+      end
+      it 'reads cell content' do
+        _(@row[0]).must_equal 'Cell A'
+      end
+    end
+    describe 'when document is parsed as StringIO' do
+      before do
+        stream = StringIO.new(File.binread(xlsx.archive.path), 'rb')
+        @row = SimpleXlsxReader.parse(stream).sheets[0].rows.to_a[0]
+        stream.close
+      end
+      it 'reads cell content' do
+        _(@row[0]).must_equal 'Cell A'
+      end
+    end
+  end
   # https://support.microsoft.com/en-us/office/available-number-formats-in-excel-0afe8f52-97db-41f1-b972-4b46e9f1e8d2
   describe 'numeric fields styled as "General"' do
     let(:misc_numbers_path) do

data/test/test_xlsx_builder.rb CHANGED Viewed

@@ -57,7 +57,6 @@ TestXlsxBuilder = Struct.new(:shared_strings, :styles, :sheets, :workbook, :rels
     self.styles ||= DEFAULTS[:styles]
     self.sheets ||= [DEFAULTS[:sheet]]
     self.rels ||= []
-    self.shared_strings ||= []
   end
   def archive
@@ -76,7 +75,7 @@ TestXlsxBuilder = Struct.new(:shared_strings, :styles, :sheets, :workbook, :rels
         styles_file.write(styles)
       end
-      if shared_strings.any?
+      if shared_strings
         zip.get_output_stream('xl/sharedStrings.xml') do |ss_file|
           ss_file.write(shared_strings)
         end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: simple_xlsx_reader
 version: !ruby/object:Gem::Version
-  version: 4.0.0
+  version: 5.0.0
 platform: ruby
 authors:
 - Woody Peterson
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2023-03-05 00:00:00.000000000 Z
+date: 2023-06-17 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: nokogiri