rbxl 1.1.0 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 5213a5a5d1091d4f8927631c50c7c690362eb284ba1eb31ee80bf3d9d0a1ec7b
4
- data.tar.gz: 2e5120093c09738342b76fb160b7e259649049dcec66b325bdc88a21f59bc9dd
3
+ metadata.gz: 2b16579845423af49cff940ed7557164236d66b2e3e7f92e8bcece2a69c486e0
4
+ data.tar.gz: 3976e30db04742ea0b22b5d1edf95bccaf269e0ee6b6d209af3fe14bbaace7f0
5
5
  SHA512:
6
- metadata.gz: '038c1112ff36766d74aea7b9092ace0a0eed88d9ad7c1db28356e3d7598edd52d93d21c128a40c7b4e822719a9e329bca4515118a76ca7fce3fed0a00407f342'
7
- data.tar.gz: b14444ae769c953832fba7da2c6f09d1371ad14206160cc5d6fd711c583ed33300438322873e9863ba199d79fa60b30e642071546878e9b418530b4dc5d8007f
6
+ metadata.gz: 2dcba5f510b571dd546ca36b8c86dd607dd406298691e71bf01575e124c15bac6f526bcdb8aa7183f4980fb06c16e7dfe8b4d0f7ce3743c57b214f5e90d6c1c4
7
+ data.tar.gz: 2cd7f062d75b0af0ae1fdbe13606dcbe5b1c6df0d2aa48be94e7ef2b06bf178ef8ae315d8c9c1c11629b265d9f2829beef60f4127294ac7a91b4de088f4d2a09
data/CHANGELOG.md CHANGED
@@ -1,25 +1,108 @@
1
1
  # Changelog
2
2
 
3
- ## 1.1.0
3
+ All notable changes to this project are documented here. The format is based
4
+ on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project
5
+ follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
4
6
 
5
- - `Rbxl.open` and `Rbxl.new` now default `read_only: true` and `write_only: true` respectively, so the call site no longer needs the boilerplate. Explicitly passing `false` raises `NotImplementedError`.
6
- - Add `date_conversion: true` to `Rbxl.open`: numeric cells whose style points at a date/time `numFmt` (built-in ids 14–22, 27–36, 45–47, 50–58, or a custom format code containing date tokens) are returned as `Date` or `Time`. Off by default — no change in output shape or throughput when the flag is absent.
7
- - Fix Ruby reader path so self-closing `<row/>` and `<c/>` elements are iterated instead of silently dropped, and never yield `nil` for a row.
7
+ ## [1.3.0] - 2026-04-27
8
8
 
9
- ## 1.0.2
9
+ ### Added
10
+
11
+ - `Rbxl.open` (and `Rbxl::ReadOnlyWorkbook.open`) now accept a block. The
12
+ workbook is yielded and closed automatically when the block returns or
13
+ raises, matching the `File.open` / `Zip::File.open` idiom. Previously the
14
+ block was silently ignored.
15
+ - `Rbxl::UnsupportedFormatError` raised by `Rbxl.open` when the file is not
16
+ a `.xlsx` container. Legacy `.xls` (BIFF/CFB) inputs are detected by the
17
+ OLE compound-file magic and reported with a conversion hint, instead of
18
+ surfacing an opaque `Zip::Error` from rubyzip five frames deep.
19
+ - `Rbxl::ReadOnlyWorkbook#sheet` now accepts an integer index into
20
+ `sheet_names` (including negatives, so `sheet(-1)` returns the last
21
+ sheet), for the common single-sheet case where `book.sheet(0)` reads
22
+ cleaner than `book.sheet(book.sheet_names.first)`.
23
+ - `Rbxl::ReadOnlyWorkbook#sheets` iterator over worksheets in workbook
24
+ order. Returns an `Enumerator` when called without a block, so
25
+ `book.sheets.first` and `book.sheets.map(&:name)` compose naturally.
26
+ Worksheet objects are constructed on demand — no eager parse of sibling
27
+ sheets.
28
+
29
+ ## [1.2.0] - 2026-04-23
30
+
31
+ ### Changed
32
+
33
+ - `WorkbookAlreadySavedError` message now points at the save-once design and
34
+ the next action (open a fresh `Rbxl.new` for another file) so callers who
35
+ trip on the constraint don't have to read the source to understand why.
36
+ - Workbook- and worksheet-level parse failures raise `WorkbookFormatError` /
37
+ `WorksheetFormatError` with the workbook path and the XML entry or sheet
38
+ name in the message, replacing generic parser exceptions.
39
+
40
+ ### Added
41
+
42
+ - Location-aware coverage around malformed workbook and worksheet XML so bad
43
+ inputs surface the specific entry that failed rather than bubbling up an
44
+ unlabelled `Nokogiri::XML::SyntaxError`.
45
+ - README sections covering the write-only model (append-only, save-once,
46
+ no in-place edit), a "Reading recipes" walkthrough, and an explicit Out
47
+ of scope entry for read-modify-save workflows.
48
+
49
+ ### Fixed
50
+
51
+ - Honor Excel's `date1904` workbook setting when `date_conversion: true` is
52
+ enabled, so Mac-originated workbooks map serial dates to the correct Ruby
53
+ `Date` and `Time` values.
54
+
55
+ ## [1.1.0] - 2026-04-21
56
+
57
+ ### Added
58
+
59
+ - `date_conversion: true` option for `Rbxl.open`: numeric cells whose style
60
+ points at a date/time `numFmt` (built-in ids 14–22, 27–36, 45–47, 50–58,
61
+ or a custom format code containing date tokens) are returned as `Date`
62
+ or `Time`. Off by default — no change in output shape or throughput when
63
+ the flag is absent.
64
+
65
+ ### Changed
66
+
67
+ - `Rbxl.open` and `Rbxl.new` now default `read_only: true` and
68
+ `write_only: true` respectively, so the call site no longer needs the
69
+ boilerplate. Explicitly passing `false` raises `NotImplementedError`.
70
+
71
+ ### Fixed
72
+
73
+ - Ruby reader path now iterates self-closing `<row/>` and `<c/>` elements
74
+ instead of silently dropping them, and never yields `nil` for a row.
75
+
76
+ ## [1.0.2] - 2026-04-17
77
+
78
+ ### Added
79
+
80
+ - `streaming: true` option for `Rbxl.open` feeds worksheet XML to the
81
+ native reader in 64 KiB chunks instead of buffering the full worksheet
82
+ first.
83
+ - `Rbxl.max_worksheet_bytes` configuration and `Rbxl::WorksheetTooLargeError`
84
+ so streaming reads can stop oversized worksheet XML entries mid-inflate.
85
+
86
+ ### Changed
10
87
 
11
- - Add `streaming: true` to `Rbxl.open` to feed worksheet XML to the native reader in 64 KiB chunks instead of buffering the full worksheet first.
12
- - Add `Rbxl.max_worksheet_bytes` and `Rbxl::WorksheetTooLargeError` so streaming reads can stop oversized worksheet XML entries mid-inflate.
13
88
  - Expand RDoc coverage across the public API.
14
89
  - Tighten RBS signatures to match the actual runtime types.
15
- - Reword public docs and gem metadata to describe reads as row-by-row and writes as append-only, reserving "streaming" for the new opt-in native read path.
90
+ - Reword public docs and gem metadata to describe reads as row-by-row and
91
+ writes as append-only, reserving "streaming" for the new opt-in native
92
+ read path.
93
+
94
+ ## [1.0.1] - 2026-04-16
95
+
96
+ ### Added
97
+
98
+ - Go and Rust benchmark comparisons.
16
99
 
17
- ## 1.0.1
100
+ ### Fixed
18
101
 
19
- - Fix ZIP64 handling.
20
- - Add Go and Rust benchmark comparisons.
21
- - Align `rbxl/native` with Nokogiri's libxml2 to avoid mixed-library warnings at runtime.
102
+ - ZIP64 handling.
103
+ - Align `rbxl/native` with Nokogiri's libxml2 to avoid mixed-library
104
+ warnings at runtime.
22
105
 
23
- ## 1.0.0
106
+ ## [1.0.0] - 2026-04-16
24
107
 
25
- - Initial 1.0 release.
108
+ - Initial public release.
data/README.md CHANGED
@@ -21,12 +21,28 @@ Supported:
21
21
 
22
22
  Out of scope:
23
23
 
24
+ - in-place editing of an existing `.xlsx` file — rbxl opens workbooks
25
+ read-only and generates new workbooks write-only, with no read-modify-save
26
+ path. If you need to open a file, tweak a handful of cells, and write it
27
+ back preserving everything else, use a full-object-model library instead.
28
+ - legacy `.xls` (BIFF/CFB) input — rbxl reads OOXML `.xlsx` only. Convert
29
+ first, e.g. `libreoffice --headless --convert-to xlsx file.xls` or
30
+ `ssconvert file.xls file.xlsx` (Gnumeric). `Rbxl.open` detects the OLE
31
+ compound-file magic on open and raises `Rbxl::UnsupportedFormatError`
32
+ with the conversion hint rather than surfacing an opaque ZIP parse
33
+ error from rubyzip.
24
34
  - preserving arbitrary workbook structure on save
25
35
  - rich style round-tripping
26
36
  - formulas, images, charts, comments
27
37
 
28
38
  ## Usage
29
39
 
40
+ `Rbxl.open` defaults to read-only and `Rbxl.new` defaults to write-only;
41
+ the `read_only:` / `write_only:` keywords remain for call-site clarity and
42
+ to leave room for a future read/write mode.
43
+
44
+ ### Writing a new workbook
45
+
30
46
  ```ruby
31
47
  require "rbxl"
32
48
 
@@ -38,6 +54,23 @@ sheet.append([2, "bob", 95.5])
38
54
  book.save("report.xlsx")
39
55
  ```
40
56
 
57
+ Write-only workbooks follow three rules:
58
+
59
+ - **Append-only within a sheet.** `sheet.append(row)` is the only way to
60
+ add data. There is no random-access cell write, no mid-stream edit of a
61
+ previously appended row.
62
+ - **Save-once per workbook.** `save` flushes the full `.xlsx` package in a
63
+ single pass and then closes the workbook. Calling `save` or `add_sheet`
64
+ again raises `Rbxl::WorkbookAlreadySavedError`. To produce another file,
65
+ start a new `Rbxl.new`.
66
+ - **No read-modify-save.** rbxl cannot open an existing `.xlsx` and write
67
+ back to it (see Out of scope above).
68
+
69
+ This is the tradeoff that keeps memory flat: rbxl buffers rows per sheet
70
+ and never materializes a full workbook object graph.
71
+
72
+ ### Reading a workbook
73
+
41
74
  ```ruby
42
75
  require "rbxl"
43
76
 
@@ -53,11 +86,125 @@ p sheet.calculate_dimension
53
86
  book.close
54
87
  ```
55
88
 
56
- `Rbxl.open` defaults to read-only and `Rbxl.new` defaults to write-only;
57
- the `read_only:` / `write_only:` keywords remain for call-site clarity and
58
- to leave room for a future read/write mode. Write-only workbooks are
59
- save-once by design this matches the optimized mode tradeoff: low
60
- flexibility in exchange for simpler memory behavior.
89
+ ### Reading recipes
90
+
91
+ **Plain value arrays (fastest path).** Use `values_only: true` when you
92
+ only care about the cell values, not their coordinates. Rows come back as
93
+ frozen `Array<Object>`:
94
+
95
+ ```ruby
96
+ book.sheet("Data").each_row(values_only: true) do |values|
97
+ id, name, score = values
98
+ # ...
99
+ end
100
+ ```
101
+
102
+ **Cell objects with coordinates.** Default `each_row` yields a
103
+ `Rbxl::Row` wrapping `Rbxl::ReadOnlyCell`s. Use this when you need the
104
+ Excel coordinate alongside the value:
105
+
106
+ ```ruby
107
+ book.sheet("Data").each_row do |row|
108
+ row.index # => 2 (1-based worksheet row number)
109
+ row[0].coordinate # => "A2"
110
+ row[0].value # => "alice"
111
+ row.values # => ["alice", 100, true]
112
+ end
113
+ ```
114
+
115
+ **Skip the header row.** `each_row` without a block returns an
116
+ `Enumerator`, so chain `drop`:
117
+
118
+ ```ruby
119
+ book.sheet("Data").each_row(values_only: true).drop(1).each do |row|
120
+ # ...
121
+ end
122
+ ```
123
+
124
+ **Peek at the first N rows.** `rows(...)` is an enumerator-returning
125
+ alias that composes well with `take`, `first`, `lazy`, etc.:
126
+
127
+ ```ruby
128
+ book.sheet("Data").rows(values_only: true).first(5)
129
+ ```
130
+
131
+ **Know the data range up-front.** When the workbook has a stored
132
+ dimension, these are O(1) lookups; otherwise pass `force: true` to scan:
133
+
134
+ ```ruby
135
+ sheet = book.sheet("Data")
136
+ sheet.max_row # => 500
137
+ sheet.max_column # => 12
138
+ sheet.calculate_dimension # => "A1:L500"
139
+ ```
140
+
141
+ **Pad sparse rows to the sheet width.** Without `pad_cells`, a row
142
+ containing only `A1` and `C1` yields two cells. With `pad_cells: true`,
143
+ missing cells are filled with `Rbxl::EmptyCell` (or `nil` in values-only
144
+ mode), aligned to `max_column`:
145
+
146
+ ```ruby
147
+ book.sheet("Sparse").each_row(pad_cells: true, values_only: true).first
148
+ # => ["left", nil, "right"]
149
+ ```
150
+
151
+ **Leading empty columns aren't padded.** Both default and `pad_cells: true`
152
+ rows align to the first populated column, not to column A. On a sheet
153
+ whose dimension is `B1:N100`, every row has 13 entries (columns B–N), not
154
+ 14. `max_column` still reports `14` (column N, 1-based) — the gap is on
155
+ the left, not the right. If you need column-A alignment, inspect
156
+ `calculate_dimension` and prepend the missing `nil`s yourself:
157
+
158
+ ```ruby
159
+ sheet = book.sheet("LeftOffset")
160
+ sheet.calculate_dimension # => "B1:N100"
161
+ leading_pad = Array.new(1, nil) # B starts at column 2, so 1 nil
162
+ sheet.each_row(values_only: true, pad_cells: true) do |row|
163
+ aligned = leading_pad + row # => [nil, "first B-value", ...]
164
+ end
165
+ ```
166
+
167
+ **Expand merged cells.** Excel leaves the anchor cell populated and the
168
+ rest of the merge range empty. Pass `expand_merged: true` to propagate
169
+ the anchor value across the full range; combine with `pad_cells: true`
170
+ when you want the result aligned to the sheet's width:
171
+
172
+ ```ruby
173
+ sheet = book.sheet("Merged")
174
+
175
+ sheet.rows(values_only: true).to_a
176
+ # => [["group", "solo"], ["tail"]]
177
+
178
+ sheet.rows(values_only: true, pad_cells: true, expand_merged: true).to_a
179
+ # => [["group", "group", "solo", nil],
180
+ # ["group", "group", "solo", "tail"]]
181
+ ```
182
+
183
+ **List sheets before opening any.** Sheet XML is only read on first
184
+ iteration; enumerating names is cheap:
185
+
186
+ ```ruby
187
+ book.sheet_names # => ["Summary", "Detail", "Raw"]
188
+ book.sheet("Detail").each_row(values_only: true) { |row| ... }
189
+ ```
190
+
191
+ **Locate a bad input.** All rbxl exceptions inherit from `Rbxl::Error`
192
+ and the messages carry the workbook path and (where relevant) the sheet
193
+ name, XML entry, or cell coordinate. Rescue at the sheet level:
194
+
195
+ ```ruby
196
+ begin
197
+ book.sheet("Raw").each_row(values_only: true) { |row| ... }
198
+ rescue Rbxl::WorksheetFormatError, Rbxl::WorkbookFormatError => e
199
+ warn e.message # includes workbook path and sheet/entry
200
+ rescue Rbxl::CellValueError => e
201
+ warn e.message # includes workbook path, sheet, and coordinate
202
+ end
203
+ ```
204
+
205
+ `Rbxl::CellValueError` is raised by the cell decoder when
206
+ `date_conversion: true` is active. The reader is forward-only, so rescue
207
+ terminates iteration rather than skipping to the next row.
61
208
 
62
209
  ### Date / time conversion
63
210
 
data/Rakefile CHANGED
@@ -1,11 +1,30 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  require "bundler/gem_helper"
4
+ require "rake/testtask"
4
5
  require "rdoc/task"
5
6
 
6
7
  Bundler::GemHelper.install_tasks
7
8
 
9
+ Rake::TestTask.new(:test) do |t|
10
+ t.libs << "test"
11
+ t.libs << "lib"
12
+ t.test_files = FileList["test/**/*_test.rb"]
13
+ t.warning = false
14
+ end
15
+
8
16
  RDoc::Task.new(:rdoc) do |rdoc|
9
17
  rdoc.main = "README.md"
10
18
  rdoc.rdoc_files.include("README.md", "lib/**/*.rb")
11
19
  end
20
+
21
+ desc "Build the rbxl_native C extension in place"
22
+ task :compile do
23
+ ext_dir = File.expand_path("ext/rbxl_native", __dir__)
24
+ Dir.chdir(ext_dir) do
25
+ ruby "extconf.rb"
26
+ sh "make"
27
+ end
28
+ end
29
+
30
+ task test: :compile
data/lib/rbxl/errors.rb CHANGED
@@ -33,4 +33,23 @@ module Rbxl
33
33
  # bytes consumed from the ZIP entry, so high-compression zip-bomb style
34
34
  # worksheets are stopped mid-inflate rather than after the fact.
35
35
  class WorksheetTooLargeError < Error; end
36
+
37
+ # Raised by {Rbxl.open} when the file is not a valid +.xlsx+ container.
38
+ # Most commonly fires on legacy +.xls+ (BIFF/CFB) files — the message
39
+ # names the detected format and suggests a conversion path rather than
40
+ # letting the underlying ZIP parser surface an opaque error.
41
+ class UnsupportedFormatError < Error; end
42
+
43
+ # Raised when workbook-level XML is malformed or internally inconsistent,
44
+ # for example when +xl/workbook.xml+ cannot be parsed or references a
45
+ # missing relationship target.
46
+ class WorkbookFormatError < Error; end
47
+
48
+ # Raised when a worksheet XML entry cannot be parsed into rows.
49
+ class WorksheetFormatError < Error; end
50
+
51
+ # Raised when a specific cell cannot be decoded. The message includes the
52
+ # workbook path, sheet name, and cell coordinate to make bad inputs easy
53
+ # to locate.
54
+ class CellValueError < WorksheetFormatError; end
36
55
  end
@@ -30,6 +30,16 @@ module Rbxl
30
30
  # Namespace used by the OPC package relationships layer.
31
31
  PACKAGE_REL_NS = "http://schemas.openxmlformats.org/package/2006/relationships"
32
32
 
33
+ # First 8 bytes of the OLE Compound File Binary format (legacy .xls,
34
+ # .doc, .ppt). Sniffed to short-circuit into a typed error before
35
+ # rubyzip bubbles up an opaque "end of central directory" failure.
36
+ OLE_CFB_MAGIC = "\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1".b.freeze
37
+ private_constant :OLE_CFB_MAGIC
38
+
39
+ # ZIP local file header signature — the first bytes of every .xlsx.
40
+ ZIP_LOCAL_MAGIC = "PK\x03\x04".b.freeze
41
+ private_constant :ZIP_LOCAL_MAGIC
42
+
33
43
  # @return [String] filesystem path the workbook was opened from
34
44
  attr_reader :path
35
45
 
@@ -39,14 +49,28 @@ module Rbxl
39
49
  # Convenience constructor equivalent to
40
50
  # <tt>new(path, streaming:, date_conversion:)</tt>.
41
51
  #
52
+ # When a block is given, the workbook is yielded to the block and
53
+ # {#close} is called automatically when the block returns (or raises).
54
+ # The block's return value is returned to the caller, matching the
55
+ # +File.open+ / +Zip::File.open+ idiom.
56
+ #
42
57
  # @param path [String, #to_path] path to the <tt>.xlsx</tt> file
43
58
  # @param streaming [Boolean] feed worksheet XML to the native parser in
44
59
  # chunks (see {Rbxl.open})
45
60
  # @param date_conversion [Boolean] convert numeric cells backed by a
46
61
  # date/time +numFmt+ to Ruby date/time objects (see {Rbxl.open})
47
- # @return [Rbxl::ReadOnlyWorkbook]
62
+ # @yieldparam book [Rbxl::ReadOnlyWorkbook] the opened workbook
63
+ # @return [Rbxl::ReadOnlyWorkbook, Object] the workbook when no block is
64
+ # given, otherwise the block's return value
48
65
  def self.open(path, streaming: false, date_conversion: false)
49
- new(path, streaming: streaming, date_conversion: date_conversion)
66
+ book = new(path, streaming: streaming, date_conversion: date_conversion)
67
+ return book unless block_given?
68
+
69
+ begin
70
+ yield book
71
+ ensure
72
+ book.close
73
+ end
50
74
  end
51
75
 
52
76
  # Opens the ZIP archive, pre-loads shared strings, and indexes the
@@ -58,6 +82,7 @@ module Rbxl
58
82
  # date-style lookup table to produced worksheets
59
83
  def initialize(path, streaming: false, date_conversion: false)
60
84
  @path = path
85
+ ensure_xlsx_format!(path)
61
86
  @zip = Zip::File.open(path)
62
87
  @streaming = streaming
63
88
  @date_conversion = date_conversion
@@ -65,21 +90,27 @@ module Rbxl
65
90
  @sheet_entries = load_sheet_entries
66
91
  @sheet_names = @sheet_entries.keys.freeze
67
92
  @date_styles = nil
93
+ @date_1904 = nil
68
94
  @closed = false
69
95
  end
70
96
 
71
- # Returns a row-by-row worksheet by visible sheet name.
97
+ # Returns a row-by-row worksheet by visible sheet name or by 0-based
98
+ # index into {#sheet_names}. Negative indexes count from the end, so
99
+ # <tt>sheet(-1)</tt> returns the last sheet.
72
100
  #
73
101
  # The returned object shares the workbook's ZIP handle. Closing the
74
102
  # workbook invalidates any worksheets produced by prior calls.
75
103
  #
76
- # @param name [String] visible sheet name as listed in {#sheet_names}
104
+ # @param name_or_index [String, Integer] visible sheet name as listed in
105
+ # {#sheet_names}, or an integer index into that list
77
106
  # @return [Rbxl::ReadOnlyWorksheet]
78
- # @raise [Rbxl::SheetNotFoundError] if +name+ is not present
107
+ # @raise [Rbxl::SheetNotFoundError] if +name_or_index+ does not resolve
108
+ # to a sheet
79
109
  # @raise [Rbxl::ClosedWorkbookError] if the workbook has been closed
80
- def sheet(name)
110
+ def sheet(name_or_index)
81
111
  ensure_open!
82
112
 
113
+ name = resolve_sheet_name(name_or_index)
83
114
  entry_path = @sheet_entries.fetch(name) do
84
115
  raise SheetNotFoundError, "sheet not found: #{name}"
85
116
  end
@@ -87,13 +118,32 @@ module Rbxl
87
118
  ReadOnlyWorksheet.new(
88
119
  zip: @zip,
89
120
  entry_path: entry_path,
121
+ workbook_path: @path,
90
122
  shared_strings: @shared_strings,
91
123
  name: name,
92
124
  streaming: @streaming,
93
- date_styles: date_styles
125
+ date_styles: date_styles,
126
+ date_1904: date_1904?
94
127
  )
95
128
  end
96
129
 
130
+ # Iterates the workbook's sheets in workbook order. Each worksheet is
131
+ # constructed on demand, so <tt>sheets.first</tt> allocates only the
132
+ # first sheet and <tt>sheets.lazy.find { ... }</tt> stops as soon as a
133
+ # match is found. Returned objects share the same ZIP handle and
134
+ # cached shared-strings / date-style tables as {#sheet}.
135
+ #
136
+ # @yieldparam worksheet [Rbxl::ReadOnlyWorksheet]
137
+ # @return [Enumerator<Rbxl::ReadOnlyWorksheet>] when no block is given
138
+ # @return [void] when a block is given
139
+ # @raise [Rbxl::ClosedWorkbookError] if the workbook has been closed
140
+ def sheets
141
+ ensure_open!
142
+ return enum_for(:sheets) unless block_given?
143
+
144
+ @sheet_names.each { |name| yield sheet(name) }
145
+ end
146
+
97
147
  # Releases the underlying ZIP file handle. Idempotent; subsequent calls
98
148
  # are no-ops.
99
149
  #
@@ -116,6 +166,30 @@ module Rbxl
116
166
  raise ClosedWorkbookError, "workbook has been closed" if closed?
117
167
  end
118
168
 
169
+ def resolve_sheet_name(key)
170
+ return key unless key.is_a?(Integer)
171
+
172
+ name = @sheet_names[key]
173
+ return name if name
174
+
175
+ raise SheetNotFoundError, "sheet index out of range: #{key} (#{@sheet_names.length} sheet(s))"
176
+ end
177
+
178
+ def ensure_xlsx_format!(path)
179
+ header = File.binread(path, 8)
180
+ return if header.start_with?(ZIP_LOCAL_MAGIC)
181
+
182
+ if header.start_with?(OLE_CFB_MAGIC)
183
+ raise UnsupportedFormatError,
184
+ "#{path} looks like a legacy .xls (BIFF/CFB). " \
185
+ "rbxl supports .xlsx (OOXML) only; convert first, e.g. " \
186
+ "`libreoffice --headless --convert-to xlsx #{File.basename(path.to_s)}`."
187
+ end
188
+
189
+ raise UnsupportedFormatError,
190
+ "#{path} is not a valid .xlsx (no ZIP signature at offset 0)."
191
+ end
192
+
119
193
  # Built-in numFmtId values that Excel resolves to date/time formats.
120
194
  # Ids outside this set are dates only when the workbook provides a
121
195
  # matching custom +<numFmt>+ entry whose format code contains date
@@ -131,6 +205,13 @@ module Rbxl
131
205
  @date_styles ||= load_date_styles
132
206
  end
133
207
 
208
+ def date_1904?
209
+ return false unless @date_conversion
210
+
211
+ @date_1904 = load_date_1904 if @date_1904.nil?
212
+ @date_1904
213
+ end
214
+
134
215
  def load_date_styles
135
216
  entry = @zip.find_entry("xl/styles.xml")
136
217
  return [].freeze unless entry
@@ -268,13 +349,27 @@ module Rbxl
268
349
  rid = node.attribute("r:id")
269
350
  next unless name && rid
270
351
 
271
- target = relationships.fetch(rid)
352
+ target = relationships.fetch(rid) do
353
+ raise WorkbookFormatError,
354
+ "workbook #{@path} references missing relationship #{rid.inspect} for sheet #{name.inspect}"
355
+ end
272
356
  sheets[name] = "xl/#{target}".gsub(%r{/+}, "/")
273
357
  end
274
358
 
275
359
  sheets
276
360
  end
277
361
 
362
+ def load_date_1904
363
+ each_xml_node("xl/workbook.xml") do |node|
364
+ next unless node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
365
+ next unless node.local_name == "workbookPr"
366
+
367
+ return xml_truthy?(node.attribute("date1904"))
368
+ end
369
+
370
+ false
371
+ end
372
+
278
373
  def load_relationship_targets(entry_path)
279
374
  relationships = {}
280
375
 
@@ -293,11 +388,20 @@ module Rbxl
293
388
  end
294
389
 
295
390
  def each_xml_node(entry_path)
296
- io = @zip.get_entry(entry_path).get_input_stream
391
+ entry = @zip.get_entry(entry_path)
392
+ raise WorkbookFormatError, "workbook #{@path} is missing required entry #{entry_path.inspect}" unless entry
393
+
394
+ io = entry.get_input_stream
297
395
  reader = Nokogiri::XML::Reader(io)
298
396
  reader.each { |node| yield node }
397
+ rescue Nokogiri::XML::SyntaxError => e
398
+ raise WorkbookFormatError, "invalid workbook XML in #{@path} at #{entry_path}: #{e.message}"
299
399
  ensure
300
400
  io&.close
301
401
  end
402
+
403
+ def xml_truthy?(value)
404
+ value == "1" || value == "true"
405
+ end
302
406
  end
303
407
  end
@@ -50,6 +50,7 @@ module Rbxl
50
50
 
51
51
  # @param zip [Zip::File] open archive shared with the workbook
52
52
  # @param entry_path [String] ZIP entry path for this sheet's XML
53
+ # @param workbook_path [String] filesystem path the workbook was opened from
53
54
  # @param shared_strings [Array<String>] pre-decoded shared strings table
54
55
  # @param name [String] visible sheet name
55
56
  # @param streaming [Boolean] when the native extension is loaded, feed
@@ -59,13 +60,17 @@ module Rbxl
59
60
  # id's numFmt is a date/time format. When provided, numeric cells with
60
61
  # a matching style are returned as +Date+ or +Time+ instead of +Float+,
61
62
  # and the native fast path is bypassed.
62
- def initialize(zip:, entry_path:, shared_strings:, name:, streaming: false, date_styles: nil)
63
+ # @param date_1904 [Boolean] whether the workbook uses Excel's 1904 date
64
+ # system instead of the default 1900 date system
65
+ def initialize(zip:, entry_path:, workbook_path:, shared_strings:, name:, streaming: false, date_styles: nil, date_1904: false)
63
66
  @zip = zip
64
67
  @entry_path = entry_path
68
+ @workbook_path = workbook_path
65
69
  @shared_strings = shared_strings
66
70
  @name = name
67
71
  @streaming = streaming
68
72
  @date_styles = date_styles
73
+ @date_1904 = date_1904
69
74
  @disable_native = !date_styles.nil?
70
75
  @dimensions = extract_dimensions
71
76
  @merge_ranges_by_row = nil
@@ -171,6 +176,7 @@ module Rbxl
171
176
 
172
177
  cell_type = nil
173
178
  cell_style = nil
179
+ cell_ref = nil
174
180
  collecting_value = false
175
181
  in_v = false
176
182
  raw_value = nil
@@ -178,6 +184,7 @@ module Rbxl
178
184
  current_values = nil
179
185
  row_depth = nil
180
186
  track_style = !@date_styles.nil?
187
+ wrap_cell_errors = track_style
181
188
 
182
189
  with_sheet_reader do |reader|
183
190
  reader.each do |node|
@@ -192,13 +199,20 @@ module Rbxl
192
199
  current_values = nil
193
200
  end
194
201
  when "c"
202
+ cell_ref = node.attribute("r")
195
203
  cell_type = node.attribute("t")
196
204
  cell_style = track_style ? node.attribute("s")&.to_i : nil
197
205
  raw_value = nil
198
206
  if current_values && node.self_closing?
199
- current_values << coerce_value(raw_value, cell_type, cell_style)
207
+ value = if wrap_cell_errors
208
+ coerce_cell_value(raw_value, cell_type, cell_style, cell_ref)
209
+ else
210
+ coerce_value(raw_value, cell_type, cell_style)
211
+ end
212
+ current_values << value
200
213
  cell_type = nil
201
214
  cell_style = nil
215
+ cell_ref = nil
202
216
  end
203
217
  when "v"
204
218
  collecting_value = true
@@ -224,9 +238,15 @@ module Rbxl
224
238
  yield current_values.freeze
225
239
  current_values = nil
226
240
  elsif current_values && node.depth == row_depth + 1
227
- current_values << coerce_value(raw_value, cell_type, cell_style)
241
+ value = if wrap_cell_errors
242
+ coerce_cell_value(raw_value, cell_type, cell_style, cell_ref)
243
+ else
244
+ coerce_value(raw_value, cell_type, cell_style)
245
+ end
246
+ current_values << value
228
247
  cell_type = nil
229
248
  cell_style = nil
249
+ cell_ref = nil
230
250
  raw_value = nil
231
251
  end
232
252
  end
@@ -258,6 +278,7 @@ module Rbxl
258
278
  value_buffer = +""
259
279
  row_depth = nil
260
280
  track_style = !@date_styles.nil?
281
+ wrap_cell_errors = track_style
261
282
 
262
283
  with_sheet_reader do |reader|
263
284
  reader.each do |node|
@@ -289,7 +310,12 @@ module Rbxl
289
310
  cell_style = track_style ? node.attribute("s")&.to_i : nil
290
311
  raw_value = nil
291
312
  if current_cells && node.self_closing?
292
- current_cells << build_row_entry(cell_ref, coerce_value(raw_value, cell_type, cell_style), values_only)
313
+ value = if wrap_cell_errors
314
+ coerce_cell_value(raw_value, cell_type, cell_style, cell_ref)
315
+ else
316
+ coerce_value(raw_value, cell_type, cell_style)
317
+ end
318
+ current_cells << build_row_entry(cell_ref, value, values_only)
293
319
  cell_ref = nil
294
320
  cell_type = nil
295
321
  cell_style = nil
@@ -322,7 +348,12 @@ module Rbxl
322
348
  current_row_index = nil
323
349
  current_cells = nil
324
350
  elsif current_cells && node.depth == row_depth + 1
325
- current_cells << build_row_entry(cell_ref, coerce_value(raw_value, cell_type, cell_style), values_only)
351
+ value = if wrap_cell_errors
352
+ coerce_cell_value(raw_value, cell_type, cell_style, cell_ref)
353
+ else
354
+ coerce_value(raw_value, cell_type, cell_style)
355
+ end
356
+ current_cells << build_row_entry(cell_ref, value, values_only)
326
357
  cell_ref = nil
327
358
  cell_type = nil
328
359
  cell_style = nil
@@ -340,9 +371,14 @@ module Rbxl
340
371
  end
341
372
 
342
373
  def with_sheet_reader
343
- io = @zip.get_entry(@entry_path).get_input_stream
374
+ entry = @zip.get_entry(@entry_path)
375
+ raise WorksheetFormatError, "worksheet #{@name.inspect} is missing XML entry #{@entry_path.inspect} in #{@workbook_path}" unless entry
376
+
377
+ io = entry.get_input_stream
344
378
  reader = Nokogiri::XML::Reader(io)
345
379
  yield reader
380
+ rescue Nokogiri::XML::SyntaxError => e
381
+ raise WorksheetFormatError, "invalid worksheet XML for sheet #{@name.inspect} in #{@workbook_path}: #{e.message}"
346
382
  ensure
347
383
  io&.close
348
384
  end
@@ -352,7 +388,10 @@ module Rbxl
352
388
  max_bytes = Rbxl.max_worksheet_bytes
353
389
  Rbxl::Native.public_send(method_name, io, @shared_strings, max_bytes, &block)
354
390
  rescue RuntimeError => e
355
- raise WorksheetTooLargeError, e.message if e.message&.include?("worksheet bytes exceed limit")
391
+ if e.message&.include?("worksheet bytes exceed limit")
392
+ raise WorksheetTooLargeError,
393
+ "worksheet #{@name.inspect} in #{@workbook_path}: #{e.message}"
394
+ end
356
395
 
357
396
  raise
358
397
  ensure
@@ -586,6 +625,13 @@ module Rbxl
586
625
  end
587
626
  end
588
627
 
628
+ def coerce_cell_value(raw_value, type, style_id, coordinate)
629
+ coerce_value(raw_value, type, style_id)
630
+ rescue StandardError => e
631
+ raise CellValueError,
632
+ "failed to decode cell #{coordinate || '(unknown coordinate)'} on sheet #{@name.inspect} in #{@workbook_path}: #{e.message}"
633
+ end
634
+
589
635
  # Excel's serial date counts days from 1899-12-31 as serial 1, with a
590
636
  # documented leap-year bug for the non-existent 1900-02-29 (serial 60)
591
637
  # — for serials >= 60 the day-count is shifted back by one so that
@@ -594,9 +640,15 @@ module Rbxl
594
640
  # +Time+ so that both date and time-of-day survive the conversion.
595
641
  def excel_serial_to_ruby(serial)
596
642
  whole = serial.to_i
597
- whole -= 1 if whole >= 60
598
643
  frac = serial - serial.to_i
599
- base = Date.new(1899, 12, 31) + whole
644
+
645
+ base =
646
+ if @date_1904
647
+ Date.new(1904, 1, 1) + whole
648
+ else
649
+ whole -= 1 if whole >= 60
650
+ Date.new(1899, 12, 31) + whole
651
+ end
600
652
 
601
653
  return base if frac.zero?
602
654
 
data/lib/rbxl/version.rb CHANGED
@@ -1,4 +1,4 @@
1
1
  module Rbxl
2
2
  # Gem version string, tracked with semantic versioning.
3
- VERSION = "1.1.0"
3
+ VERSION = "1.3.0"
4
4
  end
@@ -96,7 +96,7 @@ module Rbxl
96
96
  private
97
97
 
98
98
  def ensure_writable!
99
- raise WorkbookAlreadySavedError, "write-only workbook can only be saved once" if @saved
99
+ raise WorkbookAlreadySavedError, "write-only workbook can only be saved once by design; call Rbxl.new to build another workbook" if @saved
100
100
  raise ClosedWorkbookError, "workbook has been closed" if closed?
101
101
  end
102
102
 
data/lib/rbxl.rb CHANGED
@@ -92,6 +92,14 @@ module Rbxl
92
92
  # intent explicitly at the call site. Passing +read_only: false+ raises
93
93
  # {NotImplementedError}; a read/write mode is not available.
94
94
  #
95
+ # When a block is given, the workbook is yielded and automatically
96
+ # closed when the block returns (or raises), mirroring the +File.open+
97
+ # and +Zip::File.open+ idiom:
98
+ #
99
+ # Rbxl.open("report.xlsx") do |book|
100
+ # book.sheet("Report").each_row(values_only: true) { |row| p row }
101
+ # end
102
+ #
95
103
  # With <tt>streaming: true</tt>, the native backend (when loaded) feeds
96
104
  # worksheet XML to the parser in chunks pulled from the ZIP input stream
97
105
  # instead of materializing the entire worksheet as one Ruby string. This
@@ -117,12 +125,15 @@ module Rbxl
117
125
  # the native extension is not loaded.
118
126
  # @param date_conversion [Boolean] convert numeric cells backed by a
119
127
  # date/time +numFmt+ to +Date+ / +Time+ / +DateTime+
120
- # @return [Rbxl::ReadOnlyWorkbook]
128
+ # @yieldparam book [Rbxl::ReadOnlyWorkbook] opened workbook; auto-closed
129
+ # when the block returns
130
+ # @return [Rbxl::ReadOnlyWorkbook, Object] the workbook when no block is
131
+ # given, otherwise the block's return value
121
132
  # @raise [NotImplementedError] if +read_only+ is not +true+
122
- def open(path, read_only: true, streaming: false, date_conversion: false)
133
+ def open(path, read_only: true, streaming: false, date_conversion: false, &block)
123
134
  raise NotImplementedError, "read/write mode is not supported; pass read_only: true" unless read_only
124
135
 
125
- ReadOnlyWorkbook.open(path, streaming: streaming, date_conversion: date_conversion)
136
+ ReadOnlyWorkbook.open(path, streaming: streaming, date_conversion: date_conversion, &block)
126
137
  end
127
138
 
128
139
  # Creates a new workbook in write-only mode.
data/sig/rbxl.rbs CHANGED
@@ -10,6 +10,7 @@ module Rbxl
10
10
  type dimensions = { ref: String, max_col: Integer, max_row: Integer }
11
11
 
12
12
  def self.open: (pathish path, ?read_only: bool, ?streaming: bool, ?date_conversion: bool) -> ReadOnlyWorkbook
13
+ | [T] (pathish path, ?read_only: bool, ?streaming: bool, ?date_conversion: bool) { (ReadOnlyWorkbook) -> T } -> T
13
14
  def self.new: (?write_only: bool) -> WriteOnlyWorkbook
14
15
 
15
16
  attr_accessor self.max_shared_strings: Integer?
@@ -37,6 +38,9 @@ module Rbxl
37
38
  class WorksheetTooLargeError < Error
38
39
  end
39
40
 
41
+ class UnsupportedFormatError < Error
42
+ end
43
+
40
44
  class Cell
41
45
  attr_accessor value: cell_value
42
46
  attr_accessor coordinate: String?
@@ -84,8 +88,11 @@ module Rbxl
84
88
  attr_reader sheet_names: Array[String]
85
89
 
86
90
  def self.open: (pathish path, ?streaming: bool, ?date_conversion: bool) -> ReadOnlyWorkbook
91
+ | [T] (pathish path, ?streaming: bool, ?date_conversion: bool) { (ReadOnlyWorkbook) -> T } -> T
87
92
  def initialize: (pathish path, ?streaming: bool, ?date_conversion: bool) -> void
88
- def sheet: (String name) -> ReadOnlyWorksheet
93
+ def sheet: (String | Integer name_or_index) -> ReadOnlyWorksheet
94
+ def sheets: () { (ReadOnlyWorksheet) -> void } -> void
95
+ | () -> Enumerator[ReadOnlyWorksheet, void]
89
96
  def close: () -> void
90
97
  def closed?: () -> bool
91
98
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: rbxl
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.0
4
+ version: 1.3.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Taro KOBAYASHI
@@ -43,8 +43,8 @@ dependencies:
43
43
  - - "<"
44
44
  - !ruby/object:Gem::Version
45
45
  version: '2.0'
46
- description: rbxl is a Ruby gem for read-only row-by-row iteration and write-only
47
- XLSX generation, with an optional native extension for faster XML parsing.
46
+ description: rbxl is a fast, low-memory Ruby gem for row-by-row XLSX reads and append-only
47
+ XLSX writes, with an optional native extension for higher-throughput XML parsing.
48
48
  email:
49
49
  - taro@matzlika.co.jp
50
50
  executables: []
@@ -96,6 +96,5 @@ required_rubygems_version: !ruby/object:Gem::Requirement
96
96
  requirements: []
97
97
  rubygems_version: 4.0.3
98
98
  specification_version: 4
99
- summary: A fast, memory-friendly Ruby gem for row-by-row XLSX reads and append-only
100
- writes.
99
+ summary: Fast, low-memory XLSX processing for Ruby.
101
100
  test_files: []