RubyGems - simple_xlsx_reader - Versions diffs - 1.0.5 → 2.0.0 - Mend

simple_xlsx_reader 1.0.5 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

checksums.yaml +4 -4
data/.github/workflows/ruby.yml +38 -0
data/CHANGELOG.md +7 -0
data/README.md +190 -64
data/Rakefile +3 -1
data/lib/simple_xlsx_reader/document.rb +147 -0
data/lib/simple_xlsx_reader/hyperlink.rb +30 -0
data/lib/simple_xlsx_reader/loader/shared_strings_parser.rb +46 -0
data/lib/simple_xlsx_reader/loader/sheet_parser.rb +256 -0
data/lib/simple_xlsx_reader/loader/style_types_parser.rb +115 -0
data/lib/simple_xlsx_reader/loader/workbook_parser.rb +39 -0
data/lib/simple_xlsx_reader/loader.rb +199 -0
data/lib/simple_xlsx_reader/version.rb +3 -1
data/lib/simple_xlsx_reader.rb +23 -519
data/test/date1904_test.rb +5 -4
data/test/datetime_test.rb +17 -10
data/test/gdocs_sheet_test.rb +6 -5
data/test/lower_case_sharedstrings_test.rb +9 -4
data/test/performance_test.rb +85 -88
data/test/shared_strings.xml +4 -0
data/test/simple_xlsx_reader_test.rb +785 -375
data/test/test_helper.rb +4 -1
data/test/test_xlsx_builder.rb +104 -0
metadata +16 -6

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: e2b04473235c5ed2c2764f62a627fa6f16816c36e0fcff3497be229f8666a0f7
-  data.tar.gz: 9367b0082f31e9cb208d9f97ed6cb67d5276a459562809460694602339dfdaad
+  metadata.gz: 979490ce3bd7f0482879fb5fb5465e10ad1b07c1488d0a544950131d9063050a
+  data.tar.gz: 412d0040a586cc5ee4acdd4a2f74dd74f3bf9eb781a35d8a36c12f6caadc566c
 SHA512:
-  metadata.gz: cd42f7a0b8830a2f01703dca10ae779b973566ad25e3b74d31dc3693977fa5b2b3442e47bc1a3b50723bae3bb9f31facd923f1eaba06b51cc8b927e7fb207cf3
-  data.tar.gz: 38ecb026b0ad5a1985d88349a839a9d2972f85596504e6f300686f9751169a3c8d62582e79119106085a9cadc066517206da117993c3a30f48a5a0c58f256b4c
+  metadata.gz: 00c01bc0c2a393eb35e458411dfeab55b8bf30cee2661324cbd97a175baf0ceb31a881b1b2b7bd668a2b475ff008372c1428908340e30769308884355fdd46e8
+  data.tar.gz: 81b1b26806a97c56710cab64aa22212985dea82b308e2fbba6835f4ea7a69b79067268bb13537999594dc5722928f1df235938355a7d4a51b58ae7ed4af1d093

data/.github/workflows/ruby.yml ADDED Viewed

@@ -0,0 +1,38 @@
+# This workflow uses actions that are not certified by GitHub.
+# They are provided by a third-party and are governed by
+# separate terms of service, privacy policy, and support
+# documentation.
+# This workflow will download a prebuilt Ruby version, install dependencies and run tests with Rake
+# For more information see: https://github.com/marketplace/actions/setup-ruby-jruby-and-truffleruby
+name: Ruby
+on:
+  push:
+    branches: [ "master" ]
+  pull_request:
+    branches: [ "master" ]
+permissions:
+  contents: read
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        ruby-version: ['2.6', '2.7', '3.0']
+    steps:
+    - uses: actions/checkout@v3
+    - name: Set up Ruby
+    # To automatically get bug fixes and new Ruby versions for ruby/setup-ruby,
+    # change this to (see https://github.com/ruby/setup-ruby#versioning):
+    # uses: ruby/setup-ruby@v1
+      uses: ruby/setup-ruby@2b019609e2b0f1ea1a2bc8ca11cb82ab46ada124
+      with:
+        ruby-version: ${{ matrix.ruby-version }}
+        bundler-cache: true # runs 'bundle install' and caches installed gems automatically
+    - name: Run tests
+      run: bundle exec rake

data/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,10 @@
+### 2.0.0
+* SPEED
+  * Reimplement internals in terms of a SAX parser
+  * Change `SimpleXlsxReader::Sheet#rows` to be a `RowsProxy` that streams `#each`
+* Convenience - use `rows#each(headers: true)` to get header names while enumerating rows
 ### 1.0.5
 * Support string or io input via `SimpleXlsxReader#parse` (@kalsan, @til)

data/README.md CHANGED Viewed

@@ -1,88 +1,214 @@
-# SimpleXlsxReader [![Build Status](https://travis-ci.org/woahdae/simple_xlsx_reader.svg?branch=master)](https://travis-ci.org/woahdae/simple_xlsx_reader)
+# SimpleXlsxReader
-An xlsx reader for Ruby that parses xlsx cell values into plain ruby
-primitives and dates/times.
+A [fast](#performance) xlsx reader for Ruby that parses xlsx cell values into
+plain ruby primitives and dates/times.
 This is *not* a rewrite of excel in Ruby. Font styles, for
 example, are parsed to determine whether a cell is a number or a date,
 then forgotten. We just want to get the data, and get out!
-## Usage
-### Summary:
+## Summary (now with stream parsing):
     doc = SimpleXlsxReader.open('/path/to/workbook.xlsx')
     doc.sheets # => [<#SXR::Sheet>, ...]
     doc.sheets.first.name # 'Sheet1'
-    doc.sheets.first.rows # [['Header 1', 'Header 2', ...]
-                             ['foo', 2, ...]]
+    doc.sheets.first.rows # <SXR::Document::RowsProxy>
+    doc.sheets.first.rows.each # an <Enumerator> ready to chain or stream
+    doc.sheets.first.rows.each {} # Streams the rows to your block
+    doc.sheets.first.rows.each(headers: true) {} # Streams row-hashes
+    doc.sheets.first.rows.each(headers: {id: /ID/}) {} # finds & maps headers, streams
+    doc.sheets.first.rows.slurp # Slurps rows into memory as a 2D array
-That's it!
+That's the gist of it!
-### Load Errors
+See also the [Document](https://github.com/woahdae/simple_xlsx_reader/blob/2.0.0-pre/lib/simple_xlsx_reader/document.rb) object.
-By default, cell load errors (ex. if a date cell contains the string
-'hello') result in a SimpleXlsxReader::CellLoadError.
+## Why?
-If you would like to provide better error feedback to your users, you
-can set `SimpleXlsxReader.configuration.catch_cell_load_errors =
-true`, and load errors will instead be inserted into Sheet#load_errors keyed
-by [rownum, colnum].
+### Accurate
-### More
+This project was started years ago, primarily because other Ruby xlsx parsers
+didn't import data with the correct types. Numbers as strings, dates as numbers,
+hyperlinks with inaccessible URLs, or - subtly buggy - simple dates as DateTime
+objects. If your app uses a timezone offset, depending on what timezone and
+what time of day you load the xlsx file, your dates might end up a day off!
+SimpleXlsxReader understands all these correctly.
-Here's the totality of the public api, in code:
+### Idiomatic
-    module SimpleXlsxReader
-      def self.open(file_path)
-        Document.new(file_path: file_path).tap(&:sheets)
-      end
+Many Ruby xlsx parsers seem to be inspired more by Excel than Ruby, frankly.
+SimpleXlsxReader strives to be fairly idiomatic Ruby:
-      def self.parse(string_or_io)
-        Document.new(string_or_io: string_or_io).tap(&:sheets)
+    # quick example having fun w/ ruby
+    doc = SimpleXlsxReader.open(path_or_io)
+    doc.sheets.first.rows.each(headers: {id: /ID/})
+      .with_index.with_object({}) do |(row, index), acc|
+        acc[row[:id]] = index
       end
-      class Document
-        attr_reader :string_or_io
-        def initialize(legacy_file_path = nil, file_path: nil, string_or_io: nil)
-          ((file_path || legacy_file_path).nil? ^ string_or_io.nil?) ||
-            fail(ArgumentError, 'either file_path or string_or_io must be provided')
-          @string_or_io = string_or_io || File.new(file_path || legacy_file_path)
-        end
-        def sheets
-          @sheets ||= Mapper.new(xml).load_sheets
-        end
-        def to_hash
-          sheets.inject({}) {|acc, sheet| acc[sheet.name] = sheet.rows; acc}
-        end
-        def xml
-          Xml.load(string_or_io)
-        end
-        class Sheet < Struct.new(:name, :rows)
-          def headers
-            rows[0]
-          end
-          def data
-            rows[1..-1]
-          end
-          # Load errors will be a hash of the form:
-          # {
-          #   [rownum, colnum] => '[error]'
-          # }
-          def load_errors
-            @load_errors ||= {}
-          end
-        end
+### Now faster
+Finally, as of v2.0, SimpleXlsxReader is the fastest and most
+memory-efficient parser. Previously this project couldn't reasonably load
+anything over ~10k rows. Other parsers could load 100k+ rows, but were still
+taking ~1gb RSS to do so, even "streaming," which seemed excessive. So a SAX
+implementation was born. See [performance](#performance) for details.
+## Usage
+### Streaming
+SimpleXlsxReader is performant by default - If you use
+`rows.each {|row| ...}` it will stream the XLSX rows to your block without
+loading either the sheet XML or the full sheet data into memory.
+You can also chain `rows.each` with other Enumerable functions without
+triggering a slurp, and you have lots of ways to find and map headers while
+streaming.
+If you had an excel sheet representing this data:
+```
+| Hero ID | Hero Name  | Location     |
+| 13576   | Samus Aran | Planet Zebes |
+| 117     | John Halo  | Ring World   |
+| 9704133 | Iron Man   | Planet Earth |
+```
+Get a handle on the rows proxy:
+`rows = SimpleXlsxReader.open('suited_heroes.xlsx').sheets.first.rows`
+Simple streaming (kinda boring):
+`rows.each { |row| ... }`
+Streaming with headers, and how about a little enumerable chaining:
+```
+# Map of hero names by ID: { 117 => 'John Halo', ... }
+rows.each(headers: true).with_object({}) do |row, acc|
+  acc[row['Hero ID']] = row['Hero Name']
+end
+```
+Sometimes though you have some junk at the top of your spreadsheet:
+```
+| Unofficial Report  |                        |              |
+| Dont tell Nintendo | Yes "John Halo" I know |              |
+|                    |                        |              |
+| Hero ID            | Hero Name              | Location     |
+| 13576              | Samus Aran             | Planet Zebes |
+| 117                | John Halo              | Ring World   |
+| 9704133            | Iron Man               | Planet Earth |
+```
+For this, `headers` can be a hash whose keys replace headers and whose values
+help find the correct header row:
+```
+# Same map of hero names by ID: { 117 => 'John Halo', ... }
+rows.each(headers: {id: /ID/, name: /Name/}).with_object({}) do |row, acc|
+  acc[row[:id]] = row[:name]
+end
+```
+If your header-to-attribute mapping is more complicated than key/value, you
+can do the mapping elsewhere, but use a block to find the header row:
+```
+# Example roughly analogous to some production code mapping a single spreadsheet
+# across many objects. Might be a simpler way now that we have the headers-hash
+# feature.
+object_map = { Hero => { id: 'Hero ID', name: 'Hero Name', location: 'Location' } }
+HEADERS = ['Hero ID', 'Hero Name', 'Location']
+rows.each(headers: ->(row) { (HEADERS & row).any? }) do |row|
+  object_map.each_pair do |klass, attribute_map|
+    attributes =
+      attribute_map.each_pair.with_object({}) do |(key, header), attrs|
+        attrs[key] = row[header]
       end
-    end
+    klass.new(attributes)
+  end
+end
+```
+### Slurping
+To make SimpleXlsxReader rows act like an array, for use with legacy
+SimpleXlsxReader apps or otherwise, we still support slurping the whole array
+into memory. The good news is even when doing this, the xlsx worksheet & shared
+string files are never loaded as a (big) Nokogiri doc, so that's nice.
+By default, to prevent accidental slurping, `<RowsProxy>` will throw an exception
+if you try to access it with array methods like `[]` and `shift` without
+explicitly slurping first. You can slurp either by calling `rows.slurp` or
+globally by setting `SimpleXlsxReader.configuration.auto_slurp = true`.
+Once slurped, enumerable methods on `rows` will use the slurped data
+(i.e. not re-parse the sheet), and those Array-like methods will work.
+We don't support all Array methods, just the few we have used in real projects,
+as we transition towards streaming instead.
+### Load Errors
+By default, cell load errors (ex. if a date cell contains the string
+'hello') result in a SimpleXlsxReader::CellLoadError.
+If you would like to provide better error feedback to your users, you
+can set `SimpleXlsxReader.configuration.catch_cell_load_errors =
+true`, and load errors will instead be inserted into Sheet#load_errors keyed
+by [rownum, colnum]:
+    {
+      [rownum, colnum] => '[error]'
+    }
+### Performance
+SimpleXlsxReader is (as of this writing) the fastest and most memory efficient
+Ruby xlsx parser.
+Recent updates here have focused on large spreadsheets with especially
+non-unique strings in sheets using xlsx' shared strings feature
+(Excel-generated spreadsheets always use this). Other projects have implemented
+streaming parsers for the sheet data, but currently none stream while loading
+the shared strings file, which is the second-largest file in an xlsx archive
+and can represent millions of strings in large files.
+For more details, see [my fork of @shkm's excel benchmark project](https://github.com/woahdae/excel-parsing-benchmarks), but here's the summary:
+1mb excel file, 10,000 rows of sample "sales records" with a fair amount of
+non-unique strings (ran on an M1 Macbook Pro):
+| Gem                | Parses/second | RSS Increase | Allocated Mem | Retained Mem | Allocated Objects | Retained Objects |
+|--------------------|---------------|--------------|---------------|--------------|-------------------|------------------|
+| simple_xlsx_reader | 1.13          | 36.94mb      | 614.51mb      | 1.13kb       | 8796275           | 3                |
+| roo                | 0.75          | 74.0mb       | 164.47mb      | 2.18kb       | 2128396           | 4                |
+| creek              | 0.65          | 107.55mb     | 581.38mb      | 3.3kb        | 7240760           | 16               |
+| xsv                | 0.61          | 75.66mb      | 2127.42mb     | 3.66kb       | 5922563           | 10               |
+| rubyxl             | 0.27          | 373.52mb     | 716.7mb       | 2.18kb       | 10612577          | 4                |
+Here is a benchmark for the "worst" file I've seen, a 26mb file whose shared
+strings represent 10% of the archive (note, MemoryProfiler has too much
+overhead to reasonably measure allocations so that analysis was left off, and
+we just measure total time for one parse):
+| Gem                | Time    | RSS Increase |
+|--------------------|---------|--------------|
+| simple_xlsx_reader | 28.71s  | 148.77mb     |
+| roo                | 40.25s  | 1322.08mb    |
+| xsv                | 45.82s  | 391.27mb     |
+| creek              | 60.63s  | 886.81mb     |
+| rubyxl             | 238.68s | 9136.3mb     |
 ## Installation

data/Rakefile CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 require "bundler/gem_tasks"
 require 'rake/testtask'
@@ -6,4 +8,4 @@ Rake::TestTask.new do |t|
   t.libs << 'test'
 end
-task :default => [:test]
+task default: [:test]

data/lib/simple_xlsx_reader/document.rb ADDED Viewed

@@ -0,0 +1,147 @@
+# frozen_string_literal: true
+require 'forwardable'
+module SimpleXlsxReader
+  ##
+  # Main class for the public API. See the README for usage examples,
+  # or read the code, it's pretty friendly.
+  class Document
+    attr_reader :file_path
+    def initialize(file_path)
+      @file_path = file_path
+    end
+    def sheets
+      @sheets ||= Loader.new(file_path).init_sheets
+    end
+    # Expensive because it slurps all the sheets into memory,
+    # probably only appropriate for testing
+    def to_hash
+      sheets.each_with_object({}) { |sheet, acc| acc[sheet.name] = sheet.rows.to_a; }
+    end
+    # `rows` is a RowsProxy that responds to #each
+    class Sheet
+      extend Forwardable
+      attr_reader :name, :rows
+      def_delegators :rows, :load_errors, :slurp
+      def initialize(name:, sheet_parser:)
+        @name = name
+        @rows = RowsProxy.new(sheet_parser: sheet_parser)
+      end
+      # Legacy - consider `rows.each(headers: true)` for better performance
+      def headers
+        rows.slurped![0]
+      end
+      # Legacy - consider `rows` or `rows.each(headers: true)` for better
+      # performance
+      def data
+        rows.slurped![1..-1]
+      end
+    end
+    # Waits until we call #each with a block to parse the rows
+    class RowsProxy
+      include Enumerable
+      attr_reader :slurped, :load_errors
+      def initialize(sheet_parser:)
+        @sheet_parser = sheet_parser
+        @slurped = nil
+        @load_errors = {}
+      end
+      # By default, #each streams the rows to the provided block, either as
+      # arrays, or as header => cell value pairs if provided a `headers:`
+      # argument.
+      #
+      # `headers` can be:
+      #
+      # * `true` - simply takes the first row as the header row
+      # * block - calls the block with successive rows until the block returns
+      #   true, which it then uses that row for the headers. All data prior to
+      #   finding the headers is ignored.
+      # * hash - transforms the header row by replacing cells with keys matched
+      #   by value, ex. `{id: /ID|Identity/, name: /Name/i, date: 'Date'}` would
+      #   potentially yield the row `{id: 5, name: 'Jane', date: [Date object]}`
+      #   instead of the headers from the sheet. It would also search for the
+      #   row that matches at least one header, in case the header row isn't the
+      #   first.
+      #
+      # If rows have been slurped, #each will iterate the slurped rows instead.
+      #
+      # Note, calls to this after slurping will raise if given the `headers:`
+      # argument, as that's handled by the sheet parser. If this is important
+      # to someone, speak up and we could potentially support it.
+      def each(headers: false, &block)
+        if slurped?
+          raise '#each does not support headers with slurped rows' if headers
+          slurped.each(&block)
+        elsif block_given?
+          # It's possible to slurp while yielding to the block, which would
+          # null out @sheet_parser, so let's just keep track of it here too
+          sheet_parser = @sheet_parser
+          @sheet_parser.parse(headers: headers, &block).tap do
+            @load_errors = sheet_parser.load_errors
+          end
+        else
+          to_enum(:each, headers: headers)
+        end
+      end
+      # Mostly for legacy support, I'm not aware of a use case for doing this
+      # when you don't have to.
+      #
+      # Note that #each will use slurped results if available, and since we're
+      # leveraging Enumerable, all the other Enumerable methods will too.
+      def slurp
+        # possibly release sheet parser from memory on next GC run;
+        # untested, but it can hold a lot of stuff, so worth a try
+        @slurped ||= to_a.tap { @sheet_parser = nil }
+      end
+      def slurped?
+        !!@slurped
+      end
+      def slurped!
+        check_slurped
+        slurped
+      end
+      def [](*args)
+        check_slurped
+        slurped[*args]
+      end
+      def shift(*args)
+        check_slurped
+        slurped.shift(*args)
+      end
+      private
+      def check_slurped
+        slurp if SimpleXlsxReader.configuration.auto_slurp
+        return if slurped?
+        raise 'Called a slurp-y method without explicitly slurping;'\
+          ' use #each or call rows.slurp first'
+      end
+    end
+  end
+end

data/lib/simple_xlsx_reader/hyperlink.rb ADDED Viewed

@@ -0,0 +1,30 @@
+# frozen_string_literal: true
+module SimpleXlsxReader
+  # We support hyperlinks as a "type" even though they're technically
+  # represented either as a function or an external reference in the xlsx spec.
+  #
+  # Since having hyperlink data in our sheet usually means we might want to do
+  # something primarily with the URL (store it in the database, download it, etc),
+  # we go through extra effort to parse the function or follow the reference
+  # to represent the hyperlink primarily as a URL. However, maybe we do want
+  # the hyperlink "friendly name" part (as MS calls it), so here we've subclassed
+  # string to tack on the friendly name. This means 80% of us that just want
+  # the URL value will have to do nothing extra, but the 20% that might want the
+  # friendly name can access it.
+  #
+  # Note, by default, the value we would get by just asking the cell would
+  # be the "friendly name" and *not* the URL, which is tucked away in the
+  # function definition or a separate "relationships" meta-document.
+  #
+  # See MS documentation on the HYPERLINK function for some background:
+  # https://support.office.com/en-us/article/HYPERLINK-function-333c7ce6-c5ae-4164-9c47-7de9b76f577f
+  class Hyperlink < String
+    attr_reader :friendly_name
+    def initialize(url, friendly_name = nil)
+      @friendly_name = friendly_name
+      super(url)
+    end
+  end
+end

data/lib/simple_xlsx_reader/loader/shared_strings_parser.rb ADDED Viewed

@@ -0,0 +1,46 @@
+# frozen_string_literal: true
+module SimpleXlsxReader
+  class Loader
+    # For performance reasons, excel uses an optional SpreadsheetML feature
+    # that puts all strings in a separate xml file, and then references
+    # them by their index in that file.
+    #
+    # http://msdn.microsoft.com/en-us/library/office/gg278314.aspx
+    class SharedStringsParser < Nokogiri::XML::SAX::Document
+      def self.parse(file)
+        new.tap do |parser|
+          Nokogiri::XML::SAX::Parser.new(parser).parse(file)
+        end.result
+      end
+      def initialize
+        @result = []
+        @composite = false
+        @extract = false
+      end
+      attr_reader :result
+      def start_element(name, _attrs = [])
+        case name
+        when 'si' then @current_string = +"" # UTF-8 variant of String.new
+        when 't' then @extract = true
+        end
+      end
+      def characters(string)
+        return unless @extract
+        @current_string << string
+      end
+      def end_element(name)
+        case name
+        when 't' then @extract = false
+        when 'si' then @result << @current_string
+        end
+      end
+    end
+  end
+end