xsv 0.3.15 → 1.0.0.pre.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 84691bdf67411e4a1e2d28c23a04e6656285247820d65c3124382261b0aba27e
4
- data.tar.gz: 2ae8332a7e5e857039b729abee1478f063663e29b4eb8b9f45a4ae3ac2632916
3
+ metadata.gz: 2c7de4a72e380ca950dd3f076353c6e275d67314491216708ab2abe28ca1c3b2
4
+ data.tar.gz: 583a01b780ee1b95f9558d28f8700b885874822c133d5c17b4f3f2e94e0bb6e3
5
5
  SHA512:
6
- metadata.gz: 666a4b190de6a64d676fc5dd66d5d836255b7c5b8114f21f53ce57b9d09deaa1cfd4f644c31018a48619449d5962dd86b6bdd41355861462275601fc7486c386
7
- data.tar.gz: 3c6c957f46e31dac95f7c77f533ab43c52db1733f1d309b7a6e6bdc7e503da72963a3fe4f80a8269d46db436977bbc37c0b425203c2d75c8a6ef1be889714eb3
6
+ metadata.gz: 6a39ec94b6c3b34ce9367e9e4e2790bae830924b60a10938b54cc7c5f40749d2ebef29bfeaa9d7f9587c19c9cafdd6de9335b6aa8a8f9775dbff8632a867177d
7
+ data.tar.gz: 12967f46a655720dd2262ddaf1acd357d24a846e0160715e8652b116fa991a766250e55cd77ac69e85c315065e1d13458d04684f421524fc3978358c940a5868
@@ -4,10 +4,7 @@ language: ruby
4
4
  cache: bundler
5
5
  rvm:
6
6
  - 2.5.8
7
- - 2.7.1
8
- env:
9
- - "rubyzip=1.3.0"
10
- - "rubyzip=2.2.0"
11
- - "ox=2.9"
12
- - "ox=2.13.2"
7
+ - 3.0
8
+ - truffleruby
9
+ - jruby
13
10
  before_install: gem install bundler
@@ -1,5 +1,30 @@
1
1
  # Xsv Changelog
2
2
 
3
+ ## 1.0.0.pre.2 2021-01-22
4
+
5
+ - Reduce allocations in XML parser
6
+ - Return strings with the correct encoding
7
+ - Handle XML entities
8
+
9
+ ## 1.0.0.pre 2021-01-18
10
+
11
+ - Switch to a minimalistic XML parser in native Ruby (#21)
12
+ - Ruby 3.0 compatibility
13
+ - Various internal cleanup and optimization
14
+ - API is backwards compatible with 0.3.x
15
+
16
+ ## 0.3.18 2020-09-30
17
+
18
+ - Improve inline string support (#18)
19
+
20
+ ## 0.3.17 2020-07-03
21
+
22
+ - Fix parsing of empty worksheets (#17)
23
+
24
+ ## 0.3.16 2020-06-03
25
+
26
+ - Support complex numbers (#16)
27
+
3
28
  ## 0.3.15 2020-06-02
4
29
 
5
30
  - Fix issue with workbooks that don't contain shared strings (#15)
data/README.md CHANGED
@@ -3,7 +3,7 @@
3
3
  [![Travis CI](https://img.shields.io/travis/martijn/xsv/master)](https://travis-ci.org/martijn/xsv)
4
4
  [![Yard Docs](http://img.shields.io/badge/yard-docs-blue.svg)](https://rubydoc.info/github/martijn/xsv)
5
5
 
6
- Xsv is a fast, lightweight parser for Office Open XML spreadsheet files
6
+ Xsv is a fast, lightweight, pure Ruby parser for Office Open XML spreadsheet files
7
7
  (commonly known as Excel or .xlsx files). It strives to be minimal in the
8
8
  sense that it provides nothing a CSV reader wouldn't, meaning it only
9
9
  deals with minimal formatting and cannot create or modify documents.
@@ -11,8 +11,9 @@ deals with minimal formatting and cannot create or modify documents.
11
11
  Xsv is designed for worksheets with a single table of data, optionally
12
12
  with a header row. It only casts values to basic Ruby types (integer, float,
13
13
  date and time) and does not deal with most formatting or more advanced
14
- functionality. The goal is to allow for fast parsing of large worksheets with
15
- minimal RAM and CPU consumption.
14
+ functionality. It strives for fast processing of large worksheets with
15
+ minimal RAM and CPU consumption and has been in production use since the earliest
16
+ versions.
16
17
 
17
18
  Xsv stands for 'Excel Separated Values', because Excel just gets in the way.
18
19
 
@@ -32,11 +33,14 @@ Or install it yourself as:
32
33
 
33
34
  $ gem install xsv
34
35
 
35
- Xsv targets ruby ~> 2.6 and depends on `rubyzip` and `ox`.
36
+ Xsv targets ruby >= 2.5 and has a just single dependency, `rubyzip`. It has been
37
+ tested successfully with MRI, JRuby, and TruffleRuby. Due to the lack of
38
+ native extensions should work well in multi-threaded environments or in Ractor
39
+ when that becomes stable.
36
40
 
37
41
  ## Usage
38
42
 
39
- Xsv has two modes of operation. By default it returns an array for
43
+ Xsv has two modes of operation. By default, it returns an array for
40
44
  each row in the sheet:
41
45
 
42
46
  ```ruby
@@ -75,15 +79,15 @@ end
75
79
  sheet[1] # => {"header1" => "value1", "header2" => "value2"}
76
80
  ```
77
81
 
78
- Be aware that hash mode will lead to unpredictable results if you have multiple
79
- columns with the same name!
82
+ Be aware that hash mode will lead to unpredictable results if the worksheet
83
+ has multiple columns with the same header.
80
84
 
81
- `Xsv::Workbook.open` accepts a filename, or a IO or String containing a workbook.
85
+ `Xsv::Workbook.open` accepts a filename, or an IO or String containing a workbook.
82
86
 
83
87
  `Xsv::Sheet` implements `Enumerable` so you can call methods like `#first`,
84
- `#filter`/`#select` and `#map` on it.
88
+ `#filter`/`#select`, and `#map` on it.
85
89
 
86
- The sheets could be accessed by index or by name:
90
+ The sheets can be accessed by index or by name:
87
91
 
88
92
  ```ruby
89
93
  x = Xsv::Workbook.open("sheet.xlsx")
@@ -93,7 +97,7 @@ sheet = x.sheets[0] # gets sheet by index
93
97
  sheet = x.sheets_by_name('Name').first # gets sheet by name
94
98
  ```
95
99
 
96
- To get all the workbook's sheets names:
100
+ To get all the sheets names:
97
101
 
98
102
  ```ruby
99
103
  sheet_names = x.sheets.map(&:name)
@@ -123,6 +127,16 @@ After checking out the repo, run `bin/setup` to install dependencies. Then, run
123
127
 
124
128
  To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
125
129
 
130
+ ## Performance and Benchmarks
131
+
132
+ Xsv is faster and more memory efficient than other gems because of two things: it only _reads values_ from Excel files and it's based on a SAX-based parser instead of a DOM-based parser. If you want to read some background on this, check out my blog post on
133
+ [Efficient XML parsing in Ruby](https://storck.io/posts/efficient-xml-parsing-in-ruby/).
134
+
135
+ Jamie Schembri did a shootout of Xsv against various other Excel reading gems comparing parsing speed, memory usage, and allocations.
136
+ Check our his blog post: [Faster Excel parsing in Ruby](https://blog.schembri.me/post/faster-excel-parsing-in-ruby/).
137
+
138
+ Pre-1.0, Xsv used a native extension for XML parsing, which was faster than the native Ruby one (on MRI). But even with the native Ruby version generally Xsv still outperforms other Ruby parsing gems.
139
+
126
140
  ## Contributing
127
141
 
128
142
  Bug reports and pull requests are welcome on GitHub at https://github.com/martijn/xsv.
data/lib/xsv.rb CHANGED
@@ -1,17 +1,18 @@
1
1
  # frozen_string_literal: true
2
- require "date"
3
- require "ox"
4
2
 
5
- require "xsv/helpers"
6
- require "xsv/relationships_handler"
7
- require "xsv/shared_strings_parser"
8
- require "xsv/sheet"
9
- require "xsv/sheet_bounds_handler"
10
- require "xsv/sheet_rows_handler"
11
- require "xsv/sheets_ids_handler"
12
- require "xsv/styles_handler"
13
- require "xsv/version"
14
- require "xsv/workbook"
3
+ require 'date'
4
+
5
+ require 'xsv/helpers'
6
+ require 'xsv/sax_parser'
7
+ require 'xsv/relationships_handler'
8
+ require 'xsv/shared_strings_parser'
9
+ require 'xsv/sheet'
10
+ require 'xsv/sheet_bounds_handler'
11
+ require 'xsv/sheet_rows_handler'
12
+ require 'xsv/sheets_ids_handler'
13
+ require 'xsv/styles_handler'
14
+ require 'xsv/version'
15
+ require 'xsv/workbook'
15
16
 
16
17
  # XSV is a fast, lightweight parser for Office Open XML spreadsheet files
17
18
  # (commonly known as Excel or .xlsx files). It strives to be minimal in the
@@ -1,52 +1,54 @@
1
1
  # frozen_string_literal: true
2
+
2
3
  module Xsv
3
4
  module Helpers
4
5
  # The default OOXML Spreadheet number formats according to the ECMA standard
5
6
  # User formats are appended from index 174 onward
6
7
  BUILT_IN_NUMBER_FORMATS = {
7
- 1 => "0",
8
- 2 => "0.00",
9
- 3 => "#, ##0",
10
- 4 => "#, ##0.00",
11
- 5 => "$#, ##0_);($#, ##0)",
12
- 6 => "$#, ##0_);[Red]($#, ##0)",
13
- 7 => "$#, ##0.00_);($#, ##0.00)",
14
- 8 => "$#, ##0.00_);[Red]($#, ##0.00)",
15
- 9 => "0%",
16
- 10 => "0.00%",
17
- 11 => "0.00E+00",
18
- 12 => "# ?/?",
19
- 13 => "# ??/??",
20
- 14 => "m/d/yyyy",
21
- 15 => "d-mmm-yy",
22
- 16 => "d-mmm",
23
- 17 => "mmm-yy",
24
- 18 => "h:mm AM/PM",
25
- 19 => "h:mm:ss AM/PM",
26
- 20 => "h:mm",
27
- 21 => "h:mm:ss",
28
- 22 => "m/d/yyyy h:mm",
29
- 37 => "#, ##0_);(#, ##0)",
30
- 38 => "#, ##0_);[Red](#, ##0)",
31
- 39 => "#, ##0.00_);(#, ##0.00)",
32
- 40 => "#, ##0.00_);[Red](#, ##0.00)",
33
- 45 => "mm:ss",
34
- 46 => "[h]:mm:ss",
35
- 47 => "mm:ss.0",
36
- 48 => "##0.0E+0",
37
- 49 => "@",
8
+ 1 => '0',
9
+ 2 => '0.00',
10
+ 3 => '#, ##0',
11
+ 4 => '#, ##0.00',
12
+ 5 => '$#, ##0_);($#, ##0)',
13
+ 6 => '$#, ##0_);[Red]($#, ##0)',
14
+ 7 => '$#, ##0.00_);($#, ##0.00)',
15
+ 8 => '$#, ##0.00_);[Red]($#, ##0.00)',
16
+ 9 => '0%',
17
+ 10 => '0.00%',
18
+ 11 => '0.00E+00',
19
+ 12 => '# ?/?',
20
+ 13 => '# ??/??',
21
+ 14 => 'm/d/yyyy',
22
+ 15 => 'd-mmm-yy',
23
+ 16 => 'd-mmm',
24
+ 17 => 'mmm-yy',
25
+ 18 => 'h:mm AM/PM',
26
+ 19 => 'h:mm:ss AM/PM',
27
+ 20 => 'h:mm',
28
+ 21 => 'h:mm:ss',
29
+ 22 => 'm/d/yyyy h:mm',
30
+ 37 => '#, ##0_);(#, ##0)',
31
+ 38 => '#, ##0_);[Red](#, ##0)',
32
+ 39 => '#, ##0.00_);(#, ##0.00)',
33
+ 40 => '#, ##0.00_);[Red](#, ##0.00)',
34
+ 45 => 'mm:ss',
35
+ 46 => '[h]:mm:ss',
36
+ 47 => 'mm:ss.0',
37
+ 48 => '##0.0E+0',
38
+ 49 => '@'
38
39
  }.freeze
39
40
 
40
- MINUTE = 60.freeze
41
- HOUR = 3600.freeze
42
- A_CODEPOINT = "A".ord.freeze
41
+ MINUTE = 60
42
+ HOUR = 3600
43
+ A_CODEPOINT = 'A'.ord.freeze
43
44
  # The epoch for all dates in OOXML Spreadsheet documents
44
45
  EPOCH = Date.new(1899, 12, 30).freeze
45
46
 
46
47
  # Return the index number for the given Excel column name (i.e. "A1" => 0)
47
48
  def column_index(col)
48
49
  col.each_codepoint.reduce(0) do |sum, n|
49
- break sum - 1 if n < A_CODEPOINT # reached a number
50
+ break sum - 1 if n < A_CODEPOINT # reached a number
51
+
50
52
  sum * 26 + (n - A_CODEPOINT + 1)
51
53
  end
52
54
  end
@@ -59,9 +61,7 @@ module Xsv
59
61
  # Return a time as a string for the given Excel time value
60
62
  def parse_time(number)
61
63
  # Disregard date part
62
- if number > 0
63
- number = number - number.truncate
64
- end
64
+ number -= number.truncate if number.positive?
65
65
 
66
66
  base = number * 24
67
67
 
@@ -70,11 +70,11 @@ module Xsv
70
70
 
71
71
  # Compensate for rounding errors
72
72
  if minutes >= 60
73
- hours = hours + (minutes / 60)
73
+ hours += (minutes / 60)
74
74
  minutes = minutes % 60
75
75
  end
76
76
 
77
- "%02d:%02d" % [hours, minutes]
77
+ format('%02d:%02d', hours, minutes)
78
78
  end
79
79
 
80
80
  # Returns a time including a date as a {Time} object
@@ -92,8 +92,10 @@ module Xsv
92
92
 
93
93
  # Returns a number as either Integer or Float
94
94
  def parse_number(string)
95
- if string.include? "."
95
+ if string.include? '.'
96
96
  string.to_f
97
+ elsif string.include? 'E'
98
+ Complex(string).to_f
97
99
  else
98
100
  string.to_i
99
101
  end
@@ -103,11 +105,11 @@ module Xsv
103
105
  def parse_number_format(number, format)
104
106
  number = parse_number(number) if number.is_a?(String)
105
107
 
106
- if is_datetime_format?(format)
108
+ if datetime_format?(format)
107
109
  parse_datetime(number)
108
- elsif is_date_format?(format)
110
+ elsif date_format?(format)
109
111
  parse_date(number)
110
- elsif is_time_format?(format)
112
+ elsif time_format?(format)
111
113
  parse_time(number)
112
114
  else
113
115
  number
@@ -115,20 +117,22 @@ module Xsv
115
117
  end
116
118
 
117
119
  # Tests if the given format string includes both date and time
118
- def is_datetime_format?(format)
119
- is_date_format?(format) && is_time_format?(format)
120
+ def datetime_format?(format)
121
+ date_format?(format) && time_format?(format)
120
122
  end
121
123
 
122
124
  # Tests if the given format string is a date
123
- def is_date_format?(format)
125
+ def date_format?(format)
124
126
  return false if format.nil?
127
+
125
128
  # If it contains at least 2 sequences of d's, m's or y's it's a date!
126
129
  format.scan(/[dmy]+/).length > 1
127
130
  end
128
131
 
129
132
  # Tests if the given format string is a time
130
- def is_time_format?(format)
133
+ def time_format?(format)
131
134
  return false if format.nil?
135
+
132
136
  # If it contains at least 2 sequences of h's, m's or s's it's a time!
133
137
  format.scan(/[hms]+/).length > 1
134
138
  end
@@ -1,40 +1,23 @@
1
1
  # frozen_string_literal: true
2
+
2
3
  module Xsv
3
4
  # RelationshipsHandler parses the "xl/_rels/workbook.xml.rels" file to get the existing relationships.
4
5
  # This is used internally when opening a workbook.
5
- class RelationshipsHandler < Ox::Sax
6
+ class RelationshipsHandler < SaxParser
6
7
  def self.get_relations(io)
7
8
  relations = []
8
- handler = new do |relation|
9
- relations << relation
10
- end
11
9
 
12
- Ox.sax_parse(handler, io.read)
13
- return relations
14
- end
10
+ new { |relation| relations << relation }.parse(io)
15
11
 
16
- # Ox::Sax implementation
12
+ relations
13
+ end
17
14
 
18
15
  def initialize(&block)
19
16
  @block = block
20
- @relationship = {}
21
- end
22
-
23
- def start_element(name)
24
- @relationship = {} if name == :Relationship
25
- end
26
-
27
- def attr(name, value)
28
- case name
29
- when :Id, :Type, :Target
30
- @relationship[name] = value
31
- end
32
17
  end
33
18
 
34
- def end_element(name)
35
- return unless name == :Relationship
36
-
37
- @block.call(@relationship)
19
+ def start_element(name, attrs)
20
+ @block.call(attrs.slice(:Id, :Type, :Target)) if name == 'Relationship'
38
21
  end
39
22
  end
40
23
  end
@@ -0,0 +1,88 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Xsv
4
+ class SaxParser
5
+ ATTR_REGEX = /((\S+)="(.*?)")/m.freeze
6
+
7
+ def parse(io)
8
+ state = :look_start
9
+ if io.is_a?(String)
10
+ pbuf = io.dup
11
+ eof_reached = true
12
+ must_read = false
13
+ else
14
+ pbuf = String.new(capacity: 8192)
15
+ eof_reached = false
16
+ must_read = true
17
+ end
18
+
19
+ loop do
20
+ if must_read
21
+ begin
22
+ pbuf << io.sysread(2048)
23
+ rescue EOFError, TypeError
24
+ # EOFError is thrown by IO, rubyzip returns nil from sysread on EOF
25
+ eof_reached = true
26
+ end
27
+
28
+ must_read = false
29
+ end
30
+
31
+ if state == :look_start
32
+ if (o = pbuf.index('<'))
33
+ chars = pbuf.slice!(0, o + 1).chop!.force_encoding('utf-8')
34
+
35
+ if respond_to?(:characters) && !chars.empty?
36
+ if chars.index('&')
37
+ chars.gsub!('&amp;', %q{&})
38
+ chars.gsub!('&apos;', %q{'})
39
+ chars.gsub!('&gt;', %q{>})
40
+ chars.gsub!('&lt;', %q{<})
41
+ chars.gsub!('&quot;', %q{"})
42
+ end
43
+ characters(chars)
44
+ end
45
+
46
+ state = :look_end
47
+ elsif eof_reached
48
+ # Discard anything after the last tag in the document
49
+ break
50
+ else
51
+ # Continue loop to read more data into the buffer
52
+ must_read = true
53
+ next
54
+ end
55
+ end
56
+
57
+ if state == :look_end
58
+ if (o = pbuf.index('>'))
59
+ if (s = pbuf.index(' ')) && s < o
60
+ tag_name = pbuf.slice!(0, s + 1).chop!
61
+ args = pbuf.slice!(0, o - s)
62
+ else
63
+ tag_name = pbuf.slice!(0, o + 1).chop!
64
+ args = nil
65
+ end
66
+
67
+ if tag_name.start_with?('/')
68
+ end_element(tag_name[1..-1]) if respond_to?(:end_element)
69
+ else
70
+ if args.nil?
71
+ start_element(tag_name, nil)
72
+ else
73
+ start_element(tag_name, args.scan(ATTR_REGEX).each_with_object({}) { |m, h| h[m[1].to_sym] = m[2] })
74
+ end_element(tag_name) if args.end_with?('/') && respond_to?(:end_element)
75
+ end
76
+ end
77
+
78
+ state = :look_start
79
+ elsif eof_reached
80
+ raise 'Malformed XML document, looking for end of tag beyond EOF'
81
+ else
82
+ must_read = true
83
+ end
84
+ end
85
+ end
86
+ end
87
+ end
88
+ end
@@ -1,13 +1,13 @@
1
1
  # frozen_string_literal: true
2
+
2
3
  module Xsv
3
4
  # Interpret the sharedStrings.xml file from the workbook
4
5
  # This is used internally when opening a sheet.
5
- class SharedStringsParser < Ox::Sax
6
+ class SharedStringsParser < SaxParser
6
7
  def self.parse(io)
7
8
  strings = []
8
- handler = new { |s| strings << s }
9
- Ox.sax_parse(handler, io.read, skip: :skip_none)
10
- return strings
9
+ new { |s| strings << s }.parse(io)
10
+ strings
11
11
  end
12
12
 
13
13
  def initialize(&block)
@@ -15,24 +15,24 @@ module Xsv
15
15
  @state = nil
16
16
  end
17
17
 
18
- def start_element(name)
18
+ def start_element(name, _attrs)
19
19
  case name
20
- when :si
21
- @current_string = ""
22
- when :t
20
+ when 'si'
21
+ @current_string = ''
22
+ when 't'
23
23
  @state = name
24
24
  end
25
25
  end
26
26
 
27
- def text(value)
28
- @current_string += value if @state == :t
27
+ def characters(value)
28
+ @current_string += value if @state == 't'
29
29
  end
30
30
 
31
31
  def end_element(name)
32
32
  case name
33
- when :si
33
+ when 'si'
34
34
  @block.call(@current_string)
35
- when :t
35
+ when 't'
36
36
  @state = nil
37
37
  end
38
38
  end
@@ -1,4 +1,5 @@
1
1
  # frozen_string_literal: true
2
+
2
3
  module Xsv
3
4
  # Sheet represents a single worksheet from a workbook and is normally accessed through {Workbook#sheets}
4
5
  #
@@ -39,14 +40,14 @@ module Xsv
39
40
  @headers = []
40
41
  @mode = :array
41
42
  @row_skip = 0
42
- @hidden = ids[:state] == "hidden"
43
+ @hidden = ids[:state] == 'hidden'
43
44
 
44
45
  @last_row, @column_count = SheetBoundsHandler.get_bounds(@io, @workbook)
45
46
  end
46
47
 
47
48
  # @return [String]
48
49
  def inspect
49
- "#<#{self.class.name}:#{self.object_id}>"
50
+ "#<#{self.class.name}:#{object_id}>"
50
51
  end
51
52
 
52
53
  # Returns true if the worksheet is hidden
@@ -60,15 +61,7 @@ module Xsv
60
61
 
61
62
  handler = SheetRowsHandler.new(@mode, empty_row, @workbook, @row_skip, @last_row, &block)
62
63
 
63
- # For smaller sheets, memory performance is a lot better if Ox is
64
- # handed a string. For larger sheets this leads to awful performance.
65
- # This is probably caused by either something in SheetRowsHandler or
66
- # the interaction between Zip::InputStream and Ox
67
- if @size > 100_000_000
68
- Ox.sax_parse(handler, @io)
69
- else
70
- Ox.sax_parse(handler, @io.read)
71
- end
64
+ handler.parse(@io)
72
65
 
73
66
  true
74
67
  end
@@ -82,7 +75,7 @@ module Xsv
82
75
  return row if i == number
83
76
  end
84
77
 
85
- return empty_row
78
+ empty_row
86
79
  end
87
80
 
88
81
  # Load headers in the top row of the worksheet. After parsing of headers
@@ -112,7 +105,7 @@ module Xsv
112
105
  elsif @mode == :hash
113
106
  @mode = :array
114
107
  headers.tap { @mode = :hash }
115
- end
108
+ end || []
116
109
  end
117
110
 
118
111
  def empty_row
@@ -1,9 +1,10 @@
1
1
  # frozen_string_literal: true
2
+
2
3
  module Xsv
3
4
  # SheetBoundsHandler scans a sheet looking for the outer bounds of the content within.
4
5
  # This is used internally when opening a sheet to deal with worksheets that do not
5
6
  # have a correct dimension tag.
6
- class SheetBoundsHandler < Ox::Sax
7
+ class SheetBoundsHandler < SaxParser
7
8
  include Xsv::Helpers
8
9
 
9
10
  def self.get_bounds(sheet, workbook)
@@ -18,12 +19,11 @@ module Xsv
18
19
  end
19
20
 
20
21
  sheet.rewind
21
- Ox.sax_parse(handler, sheet.read)
22
22
 
23
- return rows, cols
24
- end
23
+ handler.parse(sheet)
25
24
 
26
- # Ox::Sax implementation
25
+ [rows, cols]
26
+ end
27
27
 
28
28
  def initialize(trim_empty_rows, &block)
29
29
  @block = block
@@ -35,36 +35,22 @@ module Xsv
35
35
  @trim_empty_rows = trim_empty_rows
36
36
  end
37
37
 
38
- def start_element(name)
38
+ def start_element(name, attrs)
39
39
  case name
40
- when :c
40
+ when 'c'
41
41
  @state = name
42
- @cell = nil
43
- when :v
42
+ @cell = attrs[:r]
43
+ when 'v'
44
44
  col = column_index(@cell)
45
45
  @maxColumn = col if col > @maxColumn
46
46
  @maxRow = @row if @row > @maxRow
47
- when :row
47
+ when 'row'
48
48
  @state = name
49
- @row = nil
50
- when :dimension
49
+ @row = attrs[:r].to_i
50
+ when 'dimension'
51
51
  @state = name
52
- end
53
- end
54
-
55
- def end_element(name)
56
- if name == :sheetData
57
- @block.call(@maxRow, @maxColumn)
58
- end
59
- end
60
52
 
61
- def attr(name, value)
62
- if @state == :c && name == :r
63
- @cell = value
64
- elsif @state == :row && name == :r
65
- @row = value.to_i
66
- elsif @state == :dimension && name == :ref
67
- _firstCell, lastCell = value.split(":")
53
+ _firstCell, lastCell = attrs[:ref].split(':')
68
54
 
69
55
  if lastCell
70
56
  @maxColumn = column_index(lastCell)
@@ -75,5 +61,9 @@ module Xsv
75
61
  end
76
62
  end
77
63
  end
64
+
65
+ def end_element(name)
66
+ @block.call(@maxRow, @maxColumn) if name == 'sheetData'
67
+ end
78
68
  end
79
69
  end
@@ -1,98 +1,58 @@
1
1
  # frozen_string_literal: true
2
+
2
3
  module Xsv
3
4
  # This is the core worksheet parser, implemented as an Ox::Sax handler. This is
4
5
  # used internally to enumerate rows.
5
- class SheetRowsHandler < Ox::Sax
6
+ class SheetRowsHandler < SaxParser
6
7
  include Xsv::Helpers
7
8
 
8
- def format_cell
9
- return nil if @current_value.empty?
10
-
11
- case @current_cell[:t]
12
- when "s"
13
- @workbook.shared_strings[@current_value.to_i]
14
- when "str", "inlineStr"
15
- @current_value.dup
16
- when "e" # N/A
17
- nil
18
- when nil, "n"
19
- if @current_cell[:s]
20
- style = @workbook.xfs[@current_cell[:s].to_i]
21
- numFmt = @workbook.numFmts[style[:numFmtId].to_i]
22
-
23
- parse_number_format(@current_value, numFmt)
24
- else
25
- parse_number(@current_value)
26
- end
27
- when "b"
28
- @current_value == "1"
29
- else
30
- raise Xsv::Error, "Encountered unknown column type #{@current_cell[:t]}"
31
- end
32
- end
33
-
34
- # Ox::Sax implementation below
35
-
36
9
  def initialize(mode, empty_row, workbook, row_skip, last_row, &block)
37
- @block = block
38
-
39
- # :sheetData
40
- # :row
41
- # :c
42
- # :v
43
- @state = nil
44
-
45
10
  @mode = mode
46
11
  @empty_row = empty_row
47
12
  @workbook = workbook
48
13
  @row_skip = row_skip
14
+ @last_row = last_row - @row_skip
15
+ @block = block
16
+
17
+ @state = nil
18
+
49
19
  @row_index = 0
50
20
  @current_row = {}
51
21
  @current_row_attrs = {}
52
22
  @current_cell = {}
53
23
  @current_value = String.new
54
- @last_row = last_row
55
24
 
56
- if @mode == :hash
57
- @headers = @empty_row.keys
58
- end
25
+ @headers = @empty_row.keys if @mode == :hash
59
26
  end
60
27
 
61
- def start_element(name)
28
+ def start_element(name, attrs)
62
29
  case name
63
- when :c
30
+ when 'c'
64
31
  @state = name
65
- @current_cell.clear
32
+ @current_cell = attrs
66
33
  @current_value.clear
67
- when :v
34
+ when 'v', 'is'
68
35
  @state = name
69
- when :row
36
+ when 'row'
70
37
  @state = name
71
38
  @current_row = @empty_row.dup
72
- @current_row_attrs.clear
39
+ @current_row_attrs = attrs
40
+ when 't'
41
+ @state = nil unless @state == 'is'
73
42
  else
74
43
  @state = nil
75
44
  end
76
45
  end
77
46
 
78
- def text(value)
79
- if @state == :v
80
- @current_value << value
81
- end
82
- end
83
-
84
- def attr(name, value)
85
- case @state
86
- when :c
87
- @current_cell[name] = value
88
- when :row
89
- @current_row_attrs[name] = value
90
- end
47
+ def characters(value)
48
+ @current_value << value if @state == 'v' || @state == 'is'
91
49
  end
92
50
 
93
51
  def end_element(name)
94
52
  case name
95
- when :c
53
+ when 'v'
54
+ @state = nil
55
+ when 'c'
96
56
  col_index = column_index(@current_cell[:r])
97
57
 
98
58
  case @mode
@@ -101,28 +61,54 @@ module Xsv
101
61
  when :hash
102
62
  @current_row[@headers[col_index]] = format_cell
103
63
  end
104
- when :row
105
- @real_row_number = @current_row_attrs[:r].to_i
106
- @adjusted_row_number = @real_row_number - @row_skip
64
+ when 'row'
65
+ real_row_number = @current_row_attrs[:r].to_i
66
+ adjusted_row_number = real_row_number - @row_skip
107
67
 
108
- if @real_row_number <= @row_skip
109
- return
110
- end
68
+ return if real_row_number <= @row_skip
111
69
 
112
70
  @row_index += 1
113
71
 
114
72
  # Skip first row if we're in hash mode
115
- return if @adjusted_row_number == 1 && @mode == :hash
73
+ return if adjusted_row_number == 1 && @mode == :hash
116
74
 
117
75
  # Pad empty rows
118
- while @row_index < @adjusted_row_number
76
+ while @row_index < adjusted_row_number
119
77
  @block.call(@empty_row)
120
78
  @row_index += 1
121
79
  next
122
80
  end
123
81
 
124
82
  # Do not return empty trailing rows
125
- @block.call(@current_row) unless @row_index > @last_row - @row_skip
83
+ @block.call(@current_row) unless @row_index > @last_row
84
+ end
85
+ end
86
+
87
+ private
88
+
89
+ def format_cell
90
+ return nil if @current_value.empty?
91
+
92
+ case @current_cell[:t]
93
+ when 's'
94
+ @workbook.shared_strings[@current_value.to_i]
95
+ when 'str', 'inlineStr'
96
+ @current_value.strip
97
+ when 'e' # N/A
98
+ nil
99
+ when nil, 'n'
100
+ if @current_cell[:s]
101
+ style = @workbook.xfs[@current_cell[:s].to_i]
102
+ numFmt = @workbook.numFmts[style[:numFmtId].to_i]
103
+
104
+ parse_number_format(@current_value, numFmt)
105
+ else
106
+ parse_number(@current_value)
107
+ end
108
+ when 'b'
109
+ @current_value == '1'
110
+ else
111
+ raise Xsv::Error, "Encountered unknown column type #{@current_cell[:t]}"
126
112
  end
127
113
  end
128
114
  end
@@ -1,56 +1,23 @@
1
1
  # frozen_string_literal: true
2
+
2
3
  module Xsv
3
4
  # SheetsIdsHandler interprets the relevant parts of workbook.xml
4
5
  # This is used internally to get the sheets ids, relationship_ids, and names when opening a workbook.
5
- class SheetsIdsHandler < Ox::Sax
6
+ class SheetsIdsHandler < SaxParser
6
7
  def self.get_sheets_ids(io)
7
8
  sheets_ids = []
8
- handler = new do |sheet_ids|
9
- sheets_ids << sheet_ids
10
- end
11
9
 
12
- Ox.sax_parse(handler, io.read)
13
- return sheets_ids
14
- end
10
+ new { |sheet_ids| sheets_ids << sheet_ids }.parse(io)
15
11
 
16
- # Ox::Sax implementation
12
+ sheets_ids
13
+ end
17
14
 
18
15
  def initialize(&block)
19
16
  @block = block
20
- @parsing = false
21
- end
22
-
23
- def start_element(name)
24
- if name == :sheets
25
- @parsing = true
26
- return
27
- end
28
-
29
- return unless name == :sheet
30
-
31
- @sheet_ids = {}
32
- end
33
-
34
- def attr(name, value)
35
- return unless @parsing
36
-
37
- case name
38
- when :name, :sheetId, :state
39
- @sheet_ids[name] = value
40
- when :'r:id'
41
- @sheet_ids[:r_id] = value
42
- end
43
17
  end
44
18
 
45
- def end_element(name)
46
- if name == :sheets
47
- @parsing = false
48
- return
49
- end
50
-
51
- return unless name == :sheet
52
-
53
- @block.call(@sheet_ids)
19
+ def start_element(name, attrs)
20
+ @block.call(attrs.slice(:name, :sheetId, :state, :'r:id')) if name == 'sheet'
54
21
  end
55
22
  end
56
23
  end
@@ -1,59 +1,43 @@
1
1
  # frozen_string_literal: true
2
+
2
3
  module Xsv
3
4
  # StylesHandler interprets the relevant parts of styles.xml
4
5
  # This is used internally when opening a sheet.
5
- class StylesHandler < Ox::Sax
6
- def self.get_styles(io, numFmts)
7
- @xfs = nil
8
- @numFmts = nil
9
- handler = new(numFmts) do |xfs, numFmts|
6
+ class StylesHandler < SaxParser
7
+ def self.get_styles(io)
8
+ handler = new(Xsv::Helpers::BUILT_IN_NUMBER_FORMATS.dup) do |xfs, numFmts|
10
9
  @xfs = xfs
11
10
  @numFmts = numFmts
12
11
  end
13
12
 
14
- Ox.sax_parse(handler, io.read)
15
- return @xfs, @numFmts
16
- end
13
+ handler.parse(io)
17
14
 
18
- # Ox::Sax implementation
15
+ [@xfs, @numFmts]
16
+ end
19
17
 
20
18
  def initialize(numFmts, &block)
21
19
  @block = block
22
20
  @state = nil
23
21
  @xfs = []
24
22
  @numFmts = numFmts
25
-
26
- @xf = {}
27
- @numFmt = {}
28
23
  end
29
24
 
30
- def start_element(name)
25
+ def start_element(name, attrs)
31
26
  case name
32
- when :cellXfs, :numFmts
33
- @state = name
34
- when :xf
35
- @xf = {}
36
- when :numFmt
37
- @numFmt = {}
38
- end
39
- end
40
-
41
- def attr(name, value)
42
- case @state
43
- when :cellXfs
44
- @xf[name] = value
45
- when :numFmts
46
- @numFmt[name] = value
27
+ when 'cellXfs'
28
+ @state = 'cellXfs'
29
+ when 'xf'
30
+ @xfs << attrs if @state == 'cellXfs'
31
+ when 'numFmt'
32
+ @numFmts[attrs[:numFmtId].to_i] = attrs[:formatCode]
47
33
  end
48
34
  end
49
35
 
50
36
  def end_element(name)
51
- if @state == :cellXfs && name == :xf
52
- @xfs << @xf
53
- elsif @state == :numFmts && name == :numFmt
54
- @numFmts[@numFmt[:numFmtId].to_i] = @numFmt[:formatCode]
55
- elsif name == :styleSheet
37
+ if name == 'styleSheet'
56
38
  @block.call(@xfs, @numFmts)
39
+ elsif name == 'cellXfs'
40
+ @state = nil
57
41
  end
58
42
  end
59
43
  end
@@ -1,4 +1,5 @@
1
1
  # frozen_string_literal: true
2
+
2
3
  module Xsv
3
- VERSION = "0.3.15"
4
+ VERSION = '1.0.0.pre.2'
4
5
  end
@@ -1,11 +1,11 @@
1
1
  # frozen_string_literal: true
2
- require "zip"
2
+
3
+ require 'zip'
3
4
 
4
5
  module Xsv
5
6
  # An OOXML Spreadsheet document is called a Workbook. A Workbook consists of
6
7
  # multiple Sheets that are available in the array that's accessible through {#sheets}
7
8
  class Workbook
8
-
9
9
  # Access the Sheet objects contained in the workbook
10
10
  # @return [Array<Sheet>]
11
11
  attr_reader :sheets
@@ -15,13 +15,13 @@ module Xsv
15
15
  # Open the workbook of the given filename, string or buffer. For additional
16
16
  # options see {.initialize}
17
17
  def self.open(data, **kws)
18
- if data.is_a?(IO) || data.respond_to?(:read) # is it a buffer?
19
- @workbook = self.new(Zip::File.open_buffer(data), **kws)
20
- elsif data.start_with?("PK\x03\x04") # is it a string containing a filename?
21
- @workbook = self.new(Zip::File.open_buffer(data), **kws)
22
- else # must be a filename
23
- @workbook = self.new(Zip::File.open(data), **kws)
24
- end
18
+ @workbook = if data.is_a?(IO) || data.respond_to?(:read) # is it a buffer?
19
+ new(Zip::File.open_buffer(data), **kws)
20
+ elsif data.start_with?("PK\x03\x04") # is it a string containing a file?
21
+ new(Zip::File.open_buffer(data), **kws)
22
+ else # must be a filename
23
+ new(Zip::File.open(data), **kws)
24
+ end
25
25
  end
26
26
 
27
27
  # Open a workbook from an instance of {Zip::File}. Generally it's recommended
@@ -36,19 +36,16 @@ module Xsv
36
36
  @trim_empty_rows = trim_empty_rows
37
37
 
38
38
  @sheets = []
39
- @xfs = []
40
- @numFmts = Xsv::Helpers::BUILT_IN_NUMBER_FORMATS.dup
41
-
42
- fetch_shared_strings
43
- fetch_styles
44
- fetch_sheets_ids
45
- fetch_relationships
46
- fetch_sheets
39
+ @xfs, @numFmts = fetch_styles
40
+ @sheet_ids = fetch_sheet_ids
41
+ @relationships = fetch_relationships
42
+ @shared_strings = fetch_shared_strings
43
+ @sheets = fetch_sheets
47
44
  end
48
45
 
49
46
  # @return [String]
50
47
  def inspect
51
- "#<#{self.class.name}:#{self.object_id}>"
48
+ "#<#{self.class.name}:#{object_id}>"
52
49
  end
53
50
 
54
51
  # Close the handle to the workbook file and leave all resources for the GC to collect
@@ -60,7 +57,7 @@ module Xsv
60
57
  @numFmts = nil
61
58
  @relationships = nil
62
59
  @shared_strings = nil
63
- @sheets_ids = nil
60
+ @sheet_ids = nil
64
61
 
65
62
  true
66
63
  end
@@ -75,42 +72,44 @@ module Xsv
75
72
  private
76
73
 
77
74
  def fetch_shared_strings
78
- handle = @zip.glob("xl/sharedStrings.xml").first
75
+ handle = @zip.glob('xl/sharedStrings.xml').first
79
76
  return if handle.nil?
80
77
 
81
78
  stream = handle.get_input_stream
82
- @shared_strings = SharedStringsParser.parse(stream)
83
-
84
- stream.close
79
+ SharedStringsParser.parse(stream)
80
+ ensure
81
+ stream.close if stream
85
82
  end
86
83
 
87
84
  def fetch_styles
88
- stream = @zip.glob("xl/styles.xml").first.get_input_stream
85
+ stream = @zip.glob('xl/styles.xml').first.get_input_stream
89
86
 
90
- @xfs, @numFmts = StylesHandler.get_styles(stream, @numFmts)
87
+ StylesHandler.get_styles(stream)
88
+ ensure
89
+ stream.close
91
90
  end
92
91
 
93
92
  def fetch_sheets
94
- @zip.glob("xl/worksheets/sheet*.xml").sort do |a, b|
93
+ @zip.glob('xl/worksheets/sheet*.xml').sort do |a, b|
95
94
  a.name[/\d+/].to_i <=> b.name[/\d+/].to_i
96
- end.each do |entry|
97
- rel = @relationships.detect { |r| entry.name.end_with?(r[:Target]) && r[:Type].end_with?("worksheet") }
98
- sheet_ids = @sheets_ids.detect { |i| i[:r_id] == rel[:Id] }
99
- @sheets << Xsv::Sheet.new(self, entry.get_input_stream, entry.size, sheet_ids)
95
+ end.map do |entry|
96
+ rel = @relationships.detect { |r| entry.name.end_with?(r[:Target]) && r[:Type].end_with?('worksheet') }
97
+ sheet_ids = @sheet_ids.detect { |i| i[:"r:id"] == rel[:Id] }
98
+ Xsv::Sheet.new(self, entry.get_input_stream, entry.size, sheet_ids)
100
99
  end
101
100
  end
102
101
 
103
- def fetch_sheets_ids
104
- stream = @zip.glob("xl/workbook.xml").first.get_input_stream
105
- @sheets_ids = SheetsIdsHandler.get_sheets_ids(stream)
106
-
102
+ def fetch_sheet_ids
103
+ stream = @zip.glob('xl/workbook.xml').first.get_input_stream
104
+ SheetsIdsHandler.get_sheets_ids(stream)
105
+ ensure
107
106
  stream.close
108
107
  end
109
108
 
110
109
  def fetch_relationships
111
- stream = @zip.glob("xl/_rels/workbook.xml.rels").first.get_input_stream
112
- @relationships = RelationshipsHandler.get_relations(stream)
113
-
110
+ stream = @zip.glob('xl/_rels/workbook.xml.rels').first.get_input_stream
111
+ RelationshipsHandler.get_relations(stream)
112
+ ensure
114
113
  stream.close
115
114
  end
116
115
  end
@@ -36,12 +36,11 @@ Gem::Specification.new do |spec|
36
36
  spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
37
37
  spec.require_paths = ["lib"]
38
38
 
39
- spec.required_ruby_version = "~> 2.5"
39
+ spec.required_ruby_version = ">= 2.5"
40
40
 
41
41
  spec.add_dependency "rubyzip", ">= 1.3", "< 3"
42
- spec.add_dependency "ox", ">= 2.9"
43
42
 
44
43
  spec.add_development_dependency "bundler", "< 3"
45
44
  spec.add_development_dependency "rake", "~> 13.0"
46
- spec.add_development_dependency "minitest", "~> 5.0"
45
+ spec.add_development_dependency "minitest", "~> 5.14.2"
47
46
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: xsv
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.15
4
+ version: 1.0.0.pre.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Martijn Storck
8
- autorequire:
8
+ autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2020-06-02 00:00:00.000000000 Z
11
+ date: 2021-01-22 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rubyzip
@@ -30,20 +30,6 @@ dependencies:
30
30
  - - "<"
31
31
  - !ruby/object:Gem::Version
32
32
  version: '3'
33
- - !ruby/object:Gem::Dependency
34
- name: ox
35
- requirement: !ruby/object:Gem::Requirement
36
- requirements:
37
- - - ">="
38
- - !ruby/object:Gem::Version
39
- version: '2.9'
40
- type: :runtime
41
- prerelease: false
42
- version_requirements: !ruby/object:Gem::Requirement
43
- requirements:
44
- - - ">="
45
- - !ruby/object:Gem::Version
46
- version: '2.9'
47
33
  - !ruby/object:Gem::Dependency
48
34
  name: bundler
49
35
  requirement: !ruby/object:Gem::Requirement
@@ -78,14 +64,14 @@ dependencies:
78
64
  requirements:
79
65
  - - "~>"
80
66
  - !ruby/object:Gem::Version
81
- version: '5.0'
67
+ version: 5.14.2
82
68
  type: :development
83
69
  prerelease: false
84
70
  version_requirements: !ruby/object:Gem::Requirement
85
71
  requirements:
86
72
  - - "~>"
87
73
  - !ruby/object:Gem::Version
88
- version: '5.0'
74
+ version: 5.14.2
89
75
  description: |2
90
76
  Xsv is a fast, lightweight parser for Office Open XML spreadsheet files
91
77
  (commonly known as Excel or .xlsx files). It strives to be minimal in the
@@ -109,6 +95,7 @@ files:
109
95
  - lib/xsv.rb
110
96
  - lib/xsv/helpers.rb
111
97
  - lib/xsv/relationships_handler.rb
98
+ - lib/xsv/sax_parser.rb
112
99
  - lib/xsv/shared_strings_parser.rb
113
100
  - lib/xsv/sheet.rb
114
101
  - lib/xsv/sheet_bounds_handler.rb
@@ -125,23 +112,23 @@ metadata:
125
112
  homepage_uri: https://github.com/martijn/xsv
126
113
  source_code_uri: https://github.com/martijn/xsv
127
114
  changelog_uri: https://github.com/martijn/xsv/CHANGELOG.md
128
- post_install_message:
115
+ post_install_message:
129
116
  rdoc_options: []
130
117
  require_paths:
131
118
  - lib
132
119
  required_ruby_version: !ruby/object:Gem::Requirement
133
120
  requirements:
134
- - - "~>"
121
+ - - ">="
135
122
  - !ruby/object:Gem::Version
136
123
  version: '2.5'
137
124
  required_rubygems_version: !ruby/object:Gem::Requirement
138
125
  requirements:
139
- - - ">="
126
+ - - ">"
140
127
  - !ruby/object:Gem::Version
141
- version: '0'
128
+ version: 1.3.1
142
129
  requirements: []
143
- rubygems_version: 3.1.2
144
- signing_key:
130
+ rubygems_version: 3.2.3
131
+ signing_key:
145
132
  specification_version: 4
146
133
  summary: A fast and lightweiggt xlsx parser that provides nothing a CSV parser wouldn't
147
134
  test_files: []