RubyGems - simple_xlsx_reader - Versions diffs - 1.0.5 → 2.0.0 - Mend

simple_xlsx_reader 1.0.5 → 2.0.0

Files changed (24) hide show

checksums.yaml +4 -4
data/.github/workflows/ruby.yml +38 -0
data/CHANGELOG.md +7 -0
data/README.md +190 -64
data/Rakefile +3 -1
data/lib/simple_xlsx_reader/document.rb +147 -0
data/lib/simple_xlsx_reader/hyperlink.rb +30 -0
data/lib/simple_xlsx_reader/loader/shared_strings_parser.rb +46 -0
data/lib/simple_xlsx_reader/loader/sheet_parser.rb +256 -0
data/lib/simple_xlsx_reader/loader/style_types_parser.rb +115 -0
data/lib/simple_xlsx_reader/loader/workbook_parser.rb +39 -0
data/lib/simple_xlsx_reader/loader.rb +199 -0
data/lib/simple_xlsx_reader/version.rb +3 -1
data/lib/simple_xlsx_reader.rb +23 -519
data/test/date1904_test.rb +5 -4
data/test/datetime_test.rb +17 -10
data/test/gdocs_sheet_test.rb +6 -5
data/test/lower_case_sharedstrings_test.rb +9 -4
data/test/performance_test.rb +85 -88
data/test/shared_strings.xml +4 -0
data/test/simple_xlsx_reader_test.rb +785 -375
data/test/test_helper.rb +4 -1
data/test/test_xlsx_builder.rb +104 -0
metadata +16 -6

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: e2b04473235c5ed2c2764f62a627fa6f16816c36e0fcff3497be229f8666a0f7
-  data.tar.gz: 9367b0082f31e9cb208d9f97ed6cb67d5276a459562809460694602339dfdaad
+  metadata.gz: 979490ce3bd7f0482879fb5fb5465e10ad1b07c1488d0a544950131d9063050a
+  data.tar.gz: 412d0040a586cc5ee4acdd4a2f74dd74f3bf9eb781a35d8a36c12f6caadc566c
 SHA512:
-  metadata.gz: cd42f7a0b8830a2f01703dca10ae779b973566ad25e3b74d31dc3693977fa5b2b3442e47bc1a3b50723bae3bb9f31facd923f1eaba06b51cc8b927e7fb207cf3
-  data.tar.gz: 38ecb026b0ad5a1985d88349a839a9d2972f85596504e6f300686f9751169a3c8d62582e79119106085a9cadc066517206da117993c3a30f48a5a0c58f256b4c
+  metadata.gz: 00c01bc0c2a393eb35e458411dfeab55b8bf30cee2661324cbd97a175baf0ceb31a881b1b2b7bd668a2b475ff008372c1428908340e30769308884355fdd46e8
+  data.tar.gz: 81b1b26806a97c56710cab64aa22212985dea82b308e2fbba6835f4ea7a69b79067268bb13537999594dc5722928f1df235938355a7d4a51b58ae7ed4af1d093

data/.github/workflows/ruby.yml ADDED Viewed

@@ -0,0 +1,38 @@
+# This workflow uses actions that are not certified by GitHub.
+# They are provided by a third-party and are governed by
+# separate terms of service, privacy policy, and support
+# documentation.
+# This workflow will download a prebuilt Ruby version, install dependencies and run tests with Rake
+# For more information see: https://github.com/marketplace/actions/setup-ruby-jruby-and-truffleruby
+name: Ruby
+on:
+  push:
+    branches: [ "master" ]
+  pull_request:
+    branches: [ "master" ]
+permissions:
+  contents: read
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        ruby-version: ['2.6', '2.7', '3.0']
+    steps:
+    - uses: actions/checkout@v3
+    - name: Set up Ruby
+    # To automatically get bug fixes and new Ruby versions for ruby/setup-ruby,
+    # change this to (see https://github.com/ruby/setup-ruby#versioning):
+    # uses: ruby/setup-ruby@v1
+      uses: ruby/setup-ruby@2b019609e2b0f1ea1a2bc8ca11cb82ab46ada124
+      with:
+        ruby-version: ${{ matrix.ruby-version }}
+        bundler-cache: true # runs 'bundle install' and caches installed gems automatically
+    - name: Run tests
+      run: bundle exec rake

data/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,10 @@
+### 2.0.0
+* SPEED
+  * Reimplement internals in terms of a SAX parser
+  * Change `SimpleXlsxReader::Sheet#rows` to be a `RowsProxy` that streams `#each`
+* Convenience - use `rows#each(headers: true)` to get header names while enumerating rows
 ### 1.0.5
 * Support string or io input via `SimpleXlsxReader#parse` (@kalsan, @til)

data/README.md CHANGED Viewed

@@ -1,88 +1,214 @@
-# SimpleXlsxReader [![Build Status](https://travis-ci.org/woahdae/simple_xlsx_reader.svg?branch=master)](https://travis-ci.org/woahdae/simple_xlsx_reader)
+# SimpleXlsxReader
-An xlsx reader for Ruby that parses xlsx cell values into plain ruby
-primitives and dates/times.
+A [fast](#performance) xlsx reader for Ruby that parses xlsx cell values into
+plain ruby primitives and dates/times.
 This is *not* a rewrite of excel in Ruby. Font styles, for
 example, are parsed to determine whether a cell is a number or a date,
 then forgotten. We just want to get the data, and get out!
-## Usage
-### Summary:
+## Summary (now with stream parsing):
     doc = SimpleXlsxReader.open('/path/to/workbook.xlsx')
     doc.sheets # => [<#SXR::Sheet>, ...]
     doc.sheets.first.name # 'Sheet1'
-    doc.sheets.first.rows # [['Header 1', 'Header 2', ...]
-                             ['foo', 2, ...]]
+    doc.sheets.first.rows # <SXR::Document::RowsProxy>
+    doc.sheets.first.rows.each # an <Enumerator> ready to chain or stream
+    doc.sheets.first.rows.each {} # Streams the rows to your block
+    doc.sheets.first.rows.each(headers: true) {} # Streams row-hashes
+    doc.sheets.first.rows.each(headers: {id: /ID/}) {} # finds & maps headers, streams
+    doc.sheets.first.rows.slurp # Slurps rows into memory as a 2D array
-That's it!
+That's the gist of it!
-### Load Errors
+See also the [Document](https://github.com/woahdae/simple_xlsx_reader/blob/2.0.0-pre/lib/simple_xlsx_reader/document.rb) object.
-By default, cell load errors (ex. if a date cell contains the string
-'hello') result in a SimpleXlsxReader::CellLoadError.
+## Why?
-If you would like to provide better error feedback to your users, you
-can set `SimpleXlsxReader.configuration.catch_cell_load_errors =
-true`, and load errors will instead be inserted into Sheet#load_errors keyed
-by [rownum, colnum].
+### Accurate
-### More
+This project was started years ago, primarily because other Ruby xlsx parsers
+didn't import data with the correct types. Numbers as strings, dates as numbers,
+hyperlinks with inaccessible URLs, or - subtly buggy - simple dates as DateTime
+objects. If your app uses a timezone offset, depending on what timezone and
+what time of day you load the xlsx file, your dates might end up a day off!
+SimpleXlsxReader understands all these correctly.
-Here's the totality of the public api, in code:
+### Idiomatic
-    module SimpleXlsxReader
-      def self.open(file_path)
-        Document.new(file_path: file_path).tap(&:sheets)
-      end
+Many Ruby xlsx parsers seem to be inspired more by Excel than Ruby, frankly.
+SimpleXlsxReader strives to be fairly idiomatic Ruby:
-      def self.parse(string_or_io)
-        Document.new(string_or_io: string_or_io).tap(&:sheets)
+    # quick example having fun w/ ruby
+    doc = SimpleXlsxReader.open(path_or_io)
+    doc.sheets.first.rows.each(headers: {id: /ID/})
+      .with_index.with_object({}) do |(row, index), acc|
+        acc[row[:id]] = index
       end
-      class Document
-        attr_reader :string_or_io
-        def initialize(legacy_file_path = nil, file_path: nil, string_or_io: nil)
-          ((file_path || legacy_file_path).nil? ^ string_or_io.nil?) ||
-            fail(ArgumentError, 'either file_path or string_or_io must be provided')
-          @string_or_io = string_or_io || File.new(file_path || legacy_file_path)
-        end
-        def sheets
-          @sheets ||= Mapper.new(xml).load_sheets
-        end
-        def to_hash
-          sheets.inject({}) {|acc, sheet| acc[sheet.name] = sheet.rows; acc}
-        end
-        def xml
-          Xml.load(string_or_io)
-        end
-        class Sheet < Struct.new(:name, :rows)
-          def headers
-            rows[0]
-          end
-          def data
-            rows[1..-1]
-          end
-          # Load errors will be a hash of the form:
-          # {
-          #   [rownum, colnum] => '[error]'
-          # }
-          def load_errors
-            @load_errors ||= {}
-          end
-        end
+### Now faster
+Finally, as of v2.0, SimpleXlsxReader is the fastest and most
+memory-efficient parser. Previously this project couldn't reasonably load
+anything over ~10k rows. Other parsers could load 100k+ rows, but were still
+taking ~1gb RSS to do so, even "streaming," which seemed excessive. So a SAX
+implementation was born. See [performance](#performance) for details.
+## Usage
+### Streaming
+SimpleXlsxReader is performant by default - If you use
+`rows.each {|row| ...}` it will stream the XLSX rows to your block without
+loading either the sheet XML or the full sheet data into memory.
+You can also chain `rows.each` with other Enumerable functions without
+triggering a slurp, and you have lots of ways to find and map headers while
+streaming.
+If you had an excel sheet representing this data:
+```
+| Hero ID | Hero Name  | Location     |
+| 13576   | Samus Aran | Planet Zebes |
+| 117     | John Halo  | Ring World   |
+| 9704133 | Iron Man   | Planet Earth |
+```
+Get a handle on the rows proxy:
+`rows = SimpleXlsxReader.open('suited_heroes.xlsx').sheets.first.rows`
+Simple streaming (kinda boring):
+`rows.each { |row| ... }`
+Streaming with headers, and how about a little enumerable chaining:
+```
+# Map of hero names by ID: { 117 => 'John Halo', ... }
+rows.each(headers: true).with_object({}) do |row, acc|
+  acc[row['Hero ID']] = row['Hero Name']
+end
+```
+Sometimes though you have some junk at the top of your spreadsheet:
+```
+| Unofficial Report  |                        |              |
+| Dont tell Nintendo | Yes "John Halo" I know |              |
+|                    |                        |              |
+| Hero ID            | Hero Name              | Location     |
+| 13576              | Samus Aran             | Planet Zebes |
+| 117                | John Halo              | Ring World   |
+| 9704133            | Iron Man               | Planet Earth |
+```
+For this, `headers` can be a hash whose keys replace headers and whose values
+help find the correct header row:
+```
+# Same map of hero names by ID: { 117 => 'John Halo', ... }
+rows.each(headers: {id: /ID/, name: /Name/}).with_object({}) do |row, acc|
+  acc[row[:id]] = row[:name]
+end
+```
+If your header-to-attribute mapping is more complicated than key/value, you
+can do the mapping elsewhere, but use a block to find the header row:
+```
+# Example roughly analogous to some production code mapping a single spreadsheet
+# across many objects. Might be a simpler way now that we have the headers-hash
+# feature.
+object_map = { Hero => { id: 'Hero ID', name: 'Hero Name', location: 'Location' } }
+HEADERS = ['Hero ID', 'Hero Name', 'Location']
+rows.each(headers: ->(row) { (HEADERS & row).any? }) do |row|
+  object_map.each_pair do |klass, attribute_map|
+    attributes =
+      attribute_map.each_pair.with_object({}) do |(key, header), attrs|
+        attrs[key] = row[header]
       end
-    end
+    klass.new(attributes)
+  end
+end
+```
+### Slurping
+To make SimpleXlsxReader rows act like an array, for use with legacy
+SimpleXlsxReader apps or otherwise, we still support slurping the whole array
+into memory. The good news is even when doing this, the xlsx worksheet & shared
+string files are never loaded as a (big) Nokogiri doc, so that's nice.
+By default, to prevent accidental slurping, `<RowsProxy>` will throw an exception
+if you try to access it with array methods like `[]` and `shift` without
+explicitly slurping first. You can slurp either by calling `rows.slurp` or
+globally by setting `SimpleXlsxReader.configuration.auto_slurp = true`.
+Once slurped, enumerable methods on `rows` will use the slurped data
+(i.e. not re-parse the sheet), and those Array-like methods will work.
+We don't support all Array methods, just the few we have used in real projects,
+as we transition towards streaming instead.
+### Load Errors
+By default, cell load errors (ex. if a date cell contains the string
+'hello') result in a SimpleXlsxReader::CellLoadError.
+If you would like to provide better error feedback to your users, you
+can set `SimpleXlsxReader.configuration.catch_cell_load_errors =
+true`, and load errors will instead be inserted into Sheet#load_errors keyed
+by [rownum, colnum]:
+    {
+      [rownum, colnum] => '[error]'
+    }
+### Performance
+SimpleXlsxReader is (as of this writing) the fastest and most memory efficient
+Ruby xlsx parser.
+Recent updates here have focused on large spreadsheets with especially
+non-unique strings in sheets using xlsx' shared strings feature
+(Excel-generated spreadsheets always use this). Other projects have implemented
+streaming parsers for the sheet data, but currently none stream while loading
+the shared strings file, which is the second-largest file in an xlsx archive
+and can represent millions of strings in large files.
+For more details, see [my fork of @shkm's excel benchmark project](https://github.com/woahdae/excel-parsing-benchmarks), but here's the summary:
+1mb excel file, 10,000 rows of sample "sales records" with a fair amount of
+non-unique strings (ran on an M1 Macbook Pro):
+| Gem                | Parses/second | RSS Increase | Allocated Mem | Retained Mem | Allocated Objects | Retained Objects |
+|--------------------|---------------|--------------|---------------|--------------|-------------------|------------------|
+| simple_xlsx_reader | 1.13          | 36.94mb      | 614.51mb      | 1.13kb       | 8796275           | 3                |
+| roo                | 0.75          | 74.0mb       | 164.47mb      | 2.18kb       | 2128396           | 4                |
+| creek              | 0.65          | 107.55mb     | 581.38mb      | 3.3kb        | 7240760           | 16               |
+| xsv                | 0.61          | 75.66mb      | 2127.42mb     | 3.66kb       | 5922563           | 10               |
+| rubyxl             | 0.27          | 373.52mb     | 716.7mb       | 2.18kb       | 10612577          | 4                |
+Here is a benchmark for the "worst" file I've seen, a 26mb file whose shared
+strings represent 10% of the archive (note, MemoryProfiler has too much
+overhead to reasonably measure allocations so that analysis was left off, and
+we just measure total time for one parse):
+| Gem                | Time    | RSS Increase |
+|--------------------|---------|--------------|
+| simple_xlsx_reader | 28.71s  | 148.77mb     |
+| roo                | 40.25s  | 1322.08mb    |
+| xsv                | 45.82s  | 391.27mb     |
+| creek              | 60.63s  | 886.81mb     |
+| rubyxl             | 238.68s | 9136.3mb     |
 ## Installation

data/Rakefile CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 require "bundler/gem_tasks"
 require 'rake/testtask'
@@ -6,4 +8,4 @@ Rake::TestTask.new do |t|
   t.libs << 'test'
 end
-task :default => [:test]
+task default: [:test]

data/lib/simple_xlsx_reader/document.rb ADDED Viewed

@@ -0,0 +1,147 @@
+# frozen_string_literal: true
+require 'forwardable'
+module SimpleXlsxReader
+  ##
+  # Main class for the public API. See the README for usage examples,
+  # or read the code, it's pretty friendly.
+  class Document
+    attr_reader :file_path
+    def initialize(file_path)
+      @file_path = file_path
+    end
+    def sheets
+      @sheets ||= Loader.new(file_path).init_sheets
+    end
+    # Expensive because it slurps all the sheets into memory,
+    # probably only appropriate for testing
+    def to_hash
+      sheets.each_with_object({}) { |sheet, acc| acc[sheet.name] = sheet.rows.to_a; }
+    end
+    # `rows` is a RowsProxy that responds to #each
+    class Sheet
+      extend Forwardable
+      attr_reader :name, :rows
+      def_delegators :rows, :load_errors, :slurp
+      def initialize(name:, sheet_parser:)
+        @name = name
+        @rows = RowsProxy.new(sheet_parser: sheet_parser)
+      end
+      # Legacy - consider `rows.each(headers: true)` for better performance
+      def headers
+        rows.slurped![0]
+      end
+      # Legacy - consider `rows` or `rows.each(headers: true)` for better
+      # performance
+      def data
+        rows.slurped![1..-1]
+      end
+    end
+    # Waits until we call #each with a block to parse the rows
+    class RowsProxy
+      include Enumerable
+      attr_reader :slurped, :load_errors
+      def initialize(sheet_parser:)
+        @sheet_parser = sheet_parser
+        @slurped = nil
+        @load_errors = {}
+      end
+      # By default, #each streams the rows to the provided block, either as
+      # arrays, or as header => cell value pairs if provided a `headers:`
+      # argument.
+      #
+      # `headers` can be:
+      #
+      # * `true` - simply takes the first row as the header row
+      # * block - calls the block with successive rows until the block returns
+      #   true, which it then uses that row for the headers. All data prior to
+      #   finding the headers is ignored.
+      # * hash - transforms the header row by replacing cells with keys matched
+      #   by value, ex. `{id: /ID|Identity/, name: /Name/i, date: 'Date'}` would
+      #   potentially yield the row `{id: 5, name: 'Jane', date: [Date object]}`
+      #   instead of the headers from the sheet. It would also search for the
+      #   row that matches at least one header, in case the header row isn't the
+      #   first.
+      #
+      # If rows have been slurped, #each will iterate the slurped rows instead.
+      #
+      # Note, calls to this after slurping will raise if given the `headers:`
+      # argument, as that's handled by the sheet parser. If this is important
+      # to someone, speak up and we could potentially support it.
+      def each(headers: false, &block)
+        if slurped?
+          raise '#each does not support headers with slurped rows' if headers
+          slurped.each(&block)
+        elsif block_given?
+          # It's possible to slurp while yielding to the block, which would
+          # null out @sheet_parser, so let's just keep track of it here too
+          sheet_parser = @sheet_parser
+          @sheet_parser.parse(headers: headers, &block).tap do
+            @load_errors = sheet_parser.load_errors
+          end
+        else
+          to_enum(:each, headers: headers)
+        end
+      end
+      # Mostly for legacy support, I'm not aware of a use case for doing this
+      # when you don't have to.
+      #
+      # Note that #each will use slurped results if available, and since we're
+      # leveraging Enumerable, all the other Enumerable methods will too.
+      def slurp
+        # possibly release sheet parser from memory on next GC run;
+        # untested, but it can hold a lot of stuff, so worth a try
+        @slurped ||= to_a.tap { @sheet_parser = nil }
+      end
+      def slurped?
+        !!@slurped
+      end
+      def slurped!
+        check_slurped
+        slurped
+      end
+      def [](*args)
+        check_slurped
+        slurped[*args]
+      end
+      def shift(*args)
+        check_slurped
+        slurped.shift(*args)
+      end
+      private
+      def check_slurped
+        slurp if SimpleXlsxReader.configuration.auto_slurp
+        return if slurped?
+        raise 'Called a slurp-y method without explicitly slurping;'\
+          ' use #each or call rows.slurp first'
+      end
+    end
+  end
+end

data/lib/simple_xlsx_reader/hyperlink.rb ADDED Viewed

@@ -0,0 +1,30 @@
+# frozen_string_literal: true
+module SimpleXlsxReader
+  # We support hyperlinks as a "type" even though they're technically
+  # represented either as a function or an external reference in the xlsx spec.
+  #
+  # Since having hyperlink data in our sheet usually means we might want to do
+  # something primarily with the URL (store it in the database, download it, etc),
+  # we go through extra effort to parse the function or follow the reference
+  # to represent the hyperlink primarily as a URL. However, maybe we do want
+  # the hyperlink "friendly name" part (as MS calls it), so here we've subclassed
+  # string to tack on the friendly name. This means 80% of us that just want
+  # the URL value will have to do nothing extra, but the 20% that might want the
+  # friendly name can access it.
+  #
+  # Note, by default, the value we would get by just asking the cell would
+  # be the "friendly name" and *not* the URL, which is tucked away in the
+  # function definition or a separate "relationships" meta-document.
+  #
+  # See MS documentation on the HYPERLINK function for some background:
+  # https://support.office.com/en-us/article/HYPERLINK-function-333c7ce6-c5ae-4164-9c47-7de9b76f577f
+  class Hyperlink < String
+    attr_reader :friendly_name
+    def initialize(url, friendly_name = nil)
+      @friendly_name = friendly_name
+      super(url)
+    end
+  end
+end

data/lib/simple_xlsx_reader/loader/shared_strings_parser.rb ADDED Viewed

@@ -0,0 +1,46 @@
+# frozen_string_literal: true
+module SimpleXlsxReader
+  class Loader
+    # For performance reasons, excel uses an optional SpreadsheetML feature
+    # that puts all strings in a separate xml file, and then references
+    # them by their index in that file.
+    #
+    # http://msdn.microsoft.com/en-us/library/office/gg278314.aspx
+    class SharedStringsParser < Nokogiri::XML::SAX::Document
+      def self.parse(file)
+        new.tap do |parser|
+          Nokogiri::XML::SAX::Parser.new(parser).parse(file)
+        end.result
+      end
+      def initialize
+        @result = []
+        @composite = false
+        @extract = false
+      end
+      attr_reader :result
+      def start_element(name, _attrs = [])
+        case name
+        when 'si' then @current_string = +"" # UTF-8 variant of String.new
+        when 't' then @extract = true
+        end
+      end
+      def characters(string)
+        return unless @extract
+        @current_string << string
+      end
+      def end_element(name)
+        case name
+        when 't' then @extract = false
+        when 'si' then @result << @current_string
+        end
+      end
+    end
+  end
+end