rbxl 1.0.2 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 76445404b974d2ddcd664b9f796fd693b7c5c36d1d56cf34fccc2b7f1fd1b51d
4
- data.tar.gz: e41c2dcccc060b7bb7e3a5608f2f57dfaa7f063daf3f82f1a4fa0bf6f85cb098
3
+ metadata.gz: b7d99201ddbfd10ac1f5173052e0ef0d0bfea0e7e0143bc5e214d28d5cbea335
4
+ data.tar.gz: 513ec07aea3c8888bafd1b60c20f6e508e6ce87a2380c5dbcb536523b09ceab3
5
5
  SHA512:
6
- metadata.gz: f41de8a1367b9033d5391ac8f46ff8b363ae79c6331bd4601bbf64fbbdf6e437c53052f38c7f130aa21833c2d60603853b1553507a6e4c7c291317da3c3f749f
7
- data.tar.gz: fe624cb616255d3437811354c073fe527f2e01bc562b93c3e19732e324aa4b227d50480b68a6e15d858c6bc276cd1da7c8456fd7ff94ea8447dea6a4b9cad70c
6
+ metadata.gz: 1dd2f6856dd7c9452d63f132e52f4336958a8bda63e304b353766ba573ed429b196c76dfa067468dcf0d85f5926de5b002f8f498b4907486fe717096ab20dbb2
7
+ data.tar.gz: 298fc80d0760d5468a7b2c95ae32751eb5c0e6070cd546d77d806ef7aabb057674f98a371d0c6c0320fd9784d89633e9fb278d03bdec4219426d89997a5540cb
data/CHANGELOG.md CHANGED
@@ -1,19 +1,88 @@
1
1
  # Changelog
2
2
 
3
- ## 1.0.2
3
+ All notable changes to this project are documented here. The format is based
4
+ on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project
5
+ follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
6
+
7
+ ## [Unreleased]
8
+
9
+ ## [1.2.0] - 2026-04-23
10
+
11
+ ### Changed
12
+
13
+ - `WorkbookAlreadySavedError` message now points at the save-once design and
14
+ the next action (open a fresh `Rbxl.new` for another file) so callers who
15
+ trip on the constraint don't have to read the source to understand why.
16
+ - Workbook- and worksheet-level parse failures raise `WorkbookFormatError` /
17
+ `WorksheetFormatError` with the workbook path and the XML entry or sheet
18
+ name in the message, replacing generic parser exceptions.
19
+
20
+ ### Added
21
+
22
+ - Location-aware coverage around malformed workbook and worksheet XML so bad
23
+ inputs surface the specific entry that failed rather than bubbling up an
24
+ unlabelled `Nokogiri::XML::SyntaxError`.
25
+ - README sections covering the write-only model (append-only, save-once,
26
+ no in-place edit), a "Reading recipes" walkthrough, and an explicit Out
27
+ of scope entry for read-modify-save workflows.
28
+
29
+ ### Fixed
30
+
31
+ - Honor Excel's `date1904` workbook setting when `date_conversion: true` is
32
+ enabled, so Mac-originated workbooks map serial dates to the correct Ruby
33
+ `Date` and `Time` values.
34
+
35
+ ## [1.1.0] - 2026-04-21
36
+
37
+ ### Added
38
+
39
+ - `date_conversion: true` option for `Rbxl.open`: numeric cells whose style
40
+ points at a date/time `numFmt` (built-in ids 14–22, 27–36, 45–47, 50–58,
41
+ or a custom format code containing date tokens) are returned as `Date`
42
+ or `Time`. Off by default — no change in output shape or throughput when
43
+ the flag is absent.
44
+
45
+ ### Changed
46
+
47
+ - `Rbxl.open` and `Rbxl.new` now default `read_only: true` and
48
+ `write_only: true` respectively, so the call site no longer needs the
49
+ boilerplate. Explicitly passing `false` raises `NotImplementedError`.
50
+
51
+ ### Fixed
52
+
53
+ - Ruby reader path now iterates self-closing `<row/>` and `<c/>` elements
54
+ instead of silently dropping them, and never yields `nil` for a row.
55
+
56
+ ## [1.0.2] - 2026-04-17
57
+
58
+ ### Added
59
+
60
+ - `streaming: true` option for `Rbxl.open` feeds worksheet XML to the
61
+ native reader in 64 KiB chunks instead of buffering the full worksheet
62
+ first.
63
+ - `Rbxl.max_worksheet_bytes` configuration and `Rbxl::WorksheetTooLargeError`
64
+ so streaming reads can stop oversized worksheet XML entries mid-inflate.
65
+
66
+ ### Changed
4
67
 
5
- - Add `streaming: true` to `Rbxl.open` to feed worksheet XML to the native reader in 64 KiB chunks instead of buffering the full worksheet first.
6
- - Add `Rbxl.max_worksheet_bytes` and `Rbxl::WorksheetTooLargeError` so streaming reads can stop oversized worksheet XML entries mid-inflate.
7
68
  - Expand RDoc coverage across the public API.
8
69
  - Tighten RBS signatures to match the actual runtime types.
9
- - Reword public docs and gem metadata to describe reads as row-by-row and writes as append-only, reserving "streaming" for the new opt-in native read path.
70
+ - Reword public docs and gem metadata to describe reads as row-by-row and
71
+ writes as append-only, reserving "streaming" for the new opt-in native
72
+ read path.
73
+
74
+ ## [1.0.1] - 2026-04-16
75
+
76
+ ### Added
77
+
78
+ - Go and Rust benchmark comparisons.
10
79
 
11
- ## 1.0.1
80
+ ### Fixed
12
81
 
13
- - Fix ZIP64 handling.
14
- - Add Go and Rust benchmark comparisons.
15
- - Align `rbxl/native` with Nokogiri's libxml2 to avoid mixed-library warnings at runtime.
82
+ - ZIP64 handling.
83
+ - Align `rbxl/native` with Nokogiri's libxml2 to avoid mixed-library
84
+ warnings at runtime.
16
85
 
17
- ## 1.0.0
86
+ ## [1.0.0] - 2026-04-16
18
87
 
19
- - Initial 1.0 release.
88
+ - Initial public release.
data/README.md CHANGED
@@ -1,5 +1,7 @@
1
1
  # rbxl
2
2
 
3
+ [![Gem Version](https://badge.fury.io/rb/rbxl.svg?icon=si%3Arubygems)](https://badge.fury.io/rb/rbxl)
4
+
3
5
  Fast, memory-friendly Ruby gem for row-by-row `.xlsx` reads and append-only writes.
4
6
 
5
7
  `rbxl` is built for the two workbook workflows that scale cleanly:
@@ -10,26 +12,35 @@ Fast, memory-friendly Ruby gem for row-by-row `.xlsx` reads and append-only writ
10
12
  The API is intentionally small and `openpyxl`-inspired, with an optional
11
13
  native extension for faster XML parsing when you need more throughput.
12
14
 
13
- Current scope is intentionally small:
15
+ Supported:
14
16
 
15
- - `write_only` workbook generation
16
- - `read_only` row-by-row iteration
17
- - `close()` for read-only workbooks
18
- - minimal `openpyxl`-like API
17
+ - write-only workbook generation
18
+ - read-only row-by-row iteration
19
+ - opt-in date/time conversion driven by the workbook's `numFmt` styles
19
20
  - optional C extension (`rbxl/native`) for maximum performance
20
21
 
21
- Out of scope for this MVP:
22
+ Out of scope:
22
23
 
24
+ - in-place editing of an existing `.xlsx` file — rbxl opens workbooks
25
+ read-only and generates new workbooks write-only, with no read-modify-save
26
+ path. If you need to open a file, tweak a handful of cells, and write it
27
+ back preserving everything else, use a full-object-model library instead.
23
28
  - preserving arbitrary workbook structure on save
24
29
  - rich style round-tripping
25
30
  - formulas, images, charts, comments
26
31
 
27
32
  ## Usage
28
33
 
34
+ `Rbxl.open` defaults to read-only and `Rbxl.new` defaults to write-only;
35
+ the `read_only:` / `write_only:` keywords remain for call-site clarity and
36
+ to leave room for a future read/write mode.
37
+
38
+ ### Writing a new workbook
39
+
29
40
  ```ruby
30
41
  require "rbxl"
31
42
 
32
- book = Rbxl.new(write_only: true)
43
+ book = Rbxl.new
33
44
  sheet = book.add_sheet("Report")
34
45
  sheet.append(["id", "name", "score"])
35
46
  sheet.append([1, "alice", 100])
@@ -37,10 +48,27 @@ sheet.append([2, "bob", 95.5])
37
48
  book.save("report.xlsx")
38
49
  ```
39
50
 
51
+ Write-only workbooks follow three rules:
52
+
53
+ - **Append-only within a sheet.** `sheet.append(row)` is the only way to
54
+ add data. There is no random-access cell write, no mid-stream edit of a
55
+ previously appended row.
56
+ - **Save-once per workbook.** `save` flushes the full `.xlsx` package in a
57
+ single pass and then closes the workbook. Calling `save` or `add_sheet`
58
+ again raises `Rbxl::WorkbookAlreadySavedError`. To produce another file,
59
+ start a new `Rbxl.new`.
60
+ - **No read-modify-save.** rbxl cannot open an existing `.xlsx` and write
61
+ back to it (see Out of scope above).
62
+
63
+ This is the tradeoff that keeps memory flat: rbxl buffers rows per sheet
64
+ and never materializes a full workbook object graph.
65
+
66
+ ### Reading a workbook
67
+
40
68
  ```ruby
41
69
  require "rbxl"
42
70
 
43
- book = Rbxl.open("report.xlsx", read_only: true)
71
+ book = Rbxl.open("report.xlsx")
44
72
  sheet = book.sheet("Report")
45
73
 
46
74
  sheet.each_row do |row|
@@ -52,8 +80,136 @@ p sheet.calculate_dimension
52
80
  book.close
53
81
  ```
54
82
 
55
- `write_only` workbooks are save-once by design. This matches the optimized
56
- mode tradeoff: low flexibility in exchange for simpler memory behavior.
83
+ ### Reading recipes
84
+
85
+ **Plain value arrays (fastest path).** Use `values_only: true` when you
86
+ only care about the cell values, not their coordinates. Rows come back as
87
+ frozen `Array<Object>`:
88
+
89
+ ```ruby
90
+ book.sheet("Data").each_row(values_only: true) do |values|
91
+ id, name, score = values
92
+ # ...
93
+ end
94
+ ```
95
+
96
+ **Cell objects with coordinates.** Default `each_row` yields a
97
+ `Rbxl::Row` wrapping `Rbxl::ReadOnlyCell`s. Use this when you need the
98
+ Excel coordinate alongside the value:
99
+
100
+ ```ruby
101
+ book.sheet("Data").each_row do |row|
102
+ row.index # => 2 (1-based worksheet row number)
103
+ row[0].coordinate # => "A2"
104
+ row[0].value # => "alice"
105
+ row.values # => ["alice", 100, true]
106
+ end
107
+ ```
108
+
109
+ **Skip the header row.** `each_row` without a block returns an
110
+ `Enumerator`, so chain `drop`:
111
+
112
+ ```ruby
113
+ book.sheet("Data").each_row(values_only: true).drop(1).each do |row|
114
+ # ...
115
+ end
116
+ ```
117
+
118
+ **Peek at the first N rows.** `rows(...)` is an enumerator-returning
119
+ alias that composes well with `take`, `first`, `lazy`, etc.:
120
+
121
+ ```ruby
122
+ book.sheet("Data").rows(values_only: true).first(5)
123
+ ```
124
+
125
+ **Know the data range up-front.** When the workbook has a stored
126
+ dimension, these are O(1) lookups; otherwise pass `force: true` to scan:
127
+
128
+ ```ruby
129
+ sheet = book.sheet("Data")
130
+ sheet.max_row # => 500
131
+ sheet.max_column # => 12
132
+ sheet.calculate_dimension # => "A1:L500"
133
+ ```
134
+
135
+ **Pad sparse rows to the sheet width.** Without `pad_cells`, a row
136
+ containing only `A1` and `C1` yields two cells. With `pad_cells: true`,
137
+ missing cells are filled with `Rbxl::EmptyCell` (or `nil` in values-only
138
+ mode), aligned to `max_column`:
139
+
140
+ ```ruby
141
+ book.sheet("Sparse").each_row(pad_cells: true, values_only: true).first
142
+ # => ["left", nil, "right"]
143
+ ```
144
+
145
+ **Expand merged cells.** Excel leaves the anchor cell populated and the
146
+ rest of the merge range empty. Pass `expand_merged: true` to propagate
147
+ the anchor value across the full range; combine with `pad_cells: true`
148
+ when you want the result aligned to the sheet's width:
149
+
150
+ ```ruby
151
+ sheet = book.sheet("Merged")
152
+
153
+ sheet.rows(values_only: true).to_a
154
+ # => [["group", "solo"], ["tail"]]
155
+
156
+ sheet.rows(values_only: true, pad_cells: true, expand_merged: true).to_a
157
+ # => [["group", "group", "solo", nil],
158
+ # ["group", "group", "solo", "tail"]]
159
+ ```
160
+
161
+ **List sheets before opening any.** Sheet XML is only read on first
162
+ iteration; enumerating names is cheap:
163
+
164
+ ```ruby
165
+ book.sheet_names # => ["Summary", "Detail", "Raw"]
166
+ book.sheet("Detail").each_row(values_only: true) { |row| ... }
167
+ ```
168
+
169
+ **Locate a bad input.** All rbxl exceptions inherit from `Rbxl::Error`
170
+ and the messages carry the workbook path and (where relevant) the sheet
171
+ name, XML entry, or cell coordinate. Rescue at the sheet level:
172
+
173
+ ```ruby
174
+ begin
175
+ book.sheet("Raw").each_row(values_only: true) { |row| ... }
176
+ rescue Rbxl::WorksheetFormatError, Rbxl::WorkbookFormatError => e
177
+ warn e.message # includes workbook path and sheet/entry
178
+ rescue Rbxl::CellValueError => e
179
+ warn e.message # includes workbook path, sheet, and coordinate
180
+ end
181
+ ```
182
+
183
+ `Rbxl::CellValueError` is raised by the cell decoder when
184
+ `date_conversion: true` is active. The reader is forward-only, so rescue
185
+ terminates iteration rather than skipping to the next row.
186
+
187
+ ### Date / time conversion
188
+
189
+ Numeric cells in `.xlsx` files are serial days since 1899-12-31; whether
190
+ they display as `44562`, `2022-01-01`, or `12:00` depends on the cell's
191
+ `numFmt` style. `rbxl` leaves cells as raw `Float` by default so the read
192
+ path stays allocation-light. Pass `date_conversion: true` to opt into
193
+ interpreting the style:
194
+
195
+ ```ruby
196
+ require "rbxl"
197
+
198
+ book = Rbxl.open("schedule.xlsx", date_conversion: true)
199
+ book.sheet("Timeline").each_row(values_only: true) do |row|
200
+ row.each { |v| p v } # => Date / Time / Float / String / ...
201
+ end
202
+ book.close
203
+ ```
204
+
205
+ With the flag on, `rbxl` parses `xl/styles.xml` once at first use and
206
+ converts numeric cells whose style maps to a built-in date `numFmtId`
207
+ (14–22, 27–36, 45–47, 50–58) or to a custom `formatCode` containing date
208
+ tokens. Whole-number serials return `Date`; fractional serials return
209
+ `Time` so the time-of-day portion is preserved. The flag is off by
210
+ default; leaving it off skips the styles parse entirely and keeps the
211
+ native fast path in use. Turning it on routes reads through the pure-Ruby
212
+ worksheet parser.
57
213
 
58
214
  ## Native C Extension
59
215
 
data/lib/rbxl/errors.rb CHANGED
@@ -33,4 +33,17 @@ module Rbxl
33
33
  # bytes consumed from the ZIP entry, so high-compression zip-bomb style
34
34
  # worksheets are stopped mid-inflate rather than after the fact.
35
35
  class WorksheetTooLargeError < Error; end
36
+
37
+ # Raised when workbook-level XML is malformed or internally inconsistent,
38
+ # for example when +xl/workbook.xml+ cannot be parsed or references a
39
+ # missing relationship target.
40
+ class WorkbookFormatError < Error; end
41
+
42
+ # Raised when a worksheet XML entry cannot be parsed into rows.
43
+ class WorksheetFormatError < Error; end
44
+
45
+ # Raised when a specific cell cannot be decoded. The message includes the
46
+ # workbook path, sheet name, and cell coordinate to make bad inputs easy
47
+ # to locate.
48
+ class CellValueError < WorksheetFormatError; end
36
49
  end
@@ -36,14 +36,17 @@ module Rbxl
36
36
  # @return [Array<String>] visible sheet names in workbook order
37
37
  attr_reader :sheet_names
38
38
 
39
- # Convenience constructor equivalent to <tt>new(path, streaming:)</tt>.
39
+ # Convenience constructor equivalent to
40
+ # <tt>new(path, streaming:, date_conversion:)</tt>.
40
41
  #
41
42
  # @param path [String, #to_path] path to the <tt>.xlsx</tt> file
42
43
  # @param streaming [Boolean] feed worksheet XML to the native parser in
43
44
  # chunks (see {Rbxl.open})
45
+ # @param date_conversion [Boolean] convert numeric cells backed by a
46
+ # date/time +numFmt+ to Ruby date/time objects (see {Rbxl.open})
44
47
  # @return [Rbxl::ReadOnlyWorkbook]
45
- def self.open(path, streaming: false)
46
- new(path, streaming: streaming)
48
+ def self.open(path, streaming: false, date_conversion: false)
49
+ new(path, streaming: streaming, date_conversion: date_conversion)
47
50
  end
48
51
 
49
52
  # Opens the ZIP archive, pre-loads shared strings, and indexes the
@@ -51,13 +54,18 @@ module Rbxl
51
54
  #
52
55
  # @param path [String, #to_path] path to the <tt>.xlsx</tt> file
53
56
  # @param streaming [Boolean] forwarded to produced worksheets
54
- def initialize(path, streaming: false)
57
+ # @param date_conversion [Boolean] lazily load styles.xml and forward the
58
+ # date-style lookup table to produced worksheets
59
+ def initialize(path, streaming: false, date_conversion: false)
55
60
  @path = path
56
61
  @zip = Zip::File.open(path)
57
62
  @streaming = streaming
63
+ @date_conversion = date_conversion
58
64
  @shared_strings = load_shared_strings
59
65
  @sheet_entries = load_sheet_entries
60
66
  @sheet_names = @sheet_entries.keys.freeze
67
+ @date_styles = nil
68
+ @date_1904 = nil
61
69
  @closed = false
62
70
  end
63
71
 
@@ -77,7 +85,16 @@ module Rbxl
77
85
  raise SheetNotFoundError, "sheet not found: #{name}"
78
86
  end
79
87
 
80
- ReadOnlyWorksheet.new(zip: @zip, entry_path: entry_path, shared_strings: @shared_strings, name: name, streaming: @streaming)
88
+ ReadOnlyWorksheet.new(
89
+ zip: @zip,
90
+ entry_path: entry_path,
91
+ workbook_path: @path,
92
+ shared_strings: @shared_strings,
93
+ name: name,
94
+ streaming: @streaming,
95
+ date_styles: date_styles,
96
+ date_1904: date_1904?
97
+ )
81
98
  end
82
99
 
83
100
  # Releases the underlying ZIP file handle. Idempotent; subsequent calls
@@ -102,6 +119,72 @@ module Rbxl
102
119
  raise ClosedWorkbookError, "workbook has been closed" if closed?
103
120
  end
104
121
 
122
+ # Built-in numFmtId values that Excel resolves to date/time formats.
123
+ # Ids outside this set are dates only when the workbook provides a
124
+ # matching custom +<numFmt>+ entry whose format code contains date
125
+ # tokens. See ECMA-376 part 1 §18.8.30.
126
+ BUILTIN_DATE_FMT_IDS = Set.new([14, 15, 16, 17, 18, 19, 20, 21, 22,
127
+ 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
128
+ 45, 46, 47, 50, 51, 52, 53, 54, 55, 56,
129
+ 57, 58]).freeze
130
+
131
+ def date_styles
132
+ return nil unless @date_conversion
133
+
134
+ @date_styles ||= load_date_styles
135
+ end
136
+
137
+ def date_1904?
138
+ return false unless @date_conversion
139
+
140
+ @date_1904 = load_date_1904 if @date_1904.nil?
141
+ @date_1904
142
+ end
143
+
144
+ def load_date_styles
145
+ entry = @zip.find_entry("xl/styles.xml")
146
+ return [].freeze unless entry
147
+
148
+ custom_date_ids = Set.new
149
+ date_styles = []
150
+ in_cell_xfs = false
151
+
152
+ each_xml_node("xl/styles.xml") do |node|
153
+ case node.node_type
154
+ when Nokogiri::XML::Reader::TYPE_ELEMENT
155
+ case node.local_name
156
+ when "cellXfs"
157
+ in_cell_xfs = true
158
+ when "numFmt"
159
+ id = node.attribute("numFmtId")
160
+ code = node.attribute("formatCode")
161
+ custom_date_ids << id.to_i if id && code && date_format_code?(code)
162
+ when "xf"
163
+ next unless in_cell_xfs
164
+
165
+ fmt_id_int = node.attribute("numFmtId")&.to_i
166
+ date_styles << (!fmt_id_int.nil? &&
167
+ (BUILTIN_DATE_FMT_IDS.include?(fmt_id_int) || custom_date_ids.include?(fmt_id_int)))
168
+ end
169
+ when Nokogiri::XML::Reader::TYPE_END_ELEMENT
170
+ in_cell_xfs = false if node.local_name == "cellXfs"
171
+ end
172
+ end
173
+
174
+ date_styles.freeze
175
+ end
176
+
177
+ # Quoted literals, bracketed directives (e.g. [Red], [$-409]), and
178
+ # backslash-escaped characters never introduce date tokens, so strip
179
+ # them before looking for +y/m/d/h/s+.
180
+ def date_format_code?(code)
181
+ stripped = code.dup
182
+ stripped.gsub!(/\[[^\]]*\]/, "")
183
+ stripped.gsub!(/"[^"]*"/, "")
184
+ stripped.gsub!(/\\./, "")
185
+ stripped.match?(/[ymdhs]/i)
186
+ end
187
+
105
188
  def load_shared_strings
106
189
  entry = @zip.find_entry("xl/sharedStrings.xml")
107
190
  return [] unless entry
@@ -195,13 +278,27 @@ module Rbxl
195
278
  rid = node.attribute("r:id")
196
279
  next unless name && rid
197
280
 
198
- target = relationships.fetch(rid)
281
+ target = relationships.fetch(rid) do
282
+ raise WorkbookFormatError,
283
+ "workbook #{@path} references missing relationship #{rid.inspect} for sheet #{name.inspect}"
284
+ end
199
285
  sheets[name] = "xl/#{target}".gsub(%r{/+}, "/")
200
286
  end
201
287
 
202
288
  sheets
203
289
  end
204
290
 
291
+ def load_date_1904
292
+ each_xml_node("xl/workbook.xml") do |node|
293
+ next unless node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
294
+ next unless node.local_name == "workbookPr"
295
+
296
+ return xml_truthy?(node.attribute("date1904"))
297
+ end
298
+
299
+ false
300
+ end
301
+
205
302
  def load_relationship_targets(entry_path)
206
303
  relationships = {}
207
304
 
@@ -220,11 +317,20 @@ module Rbxl
220
317
  end
221
318
 
222
319
  def each_xml_node(entry_path)
223
- io = @zip.get_entry(entry_path).get_input_stream
320
+ entry = @zip.get_entry(entry_path)
321
+ raise WorkbookFormatError, "workbook #{@path} is missing required entry #{entry_path.inspect}" unless entry
322
+
323
+ io = entry.get_input_stream
224
324
  reader = Nokogiri::XML::Reader(io)
225
325
  reader.each { |node| yield node }
326
+ rescue Nokogiri::XML::SyntaxError => e
327
+ raise WorkbookFormatError, "invalid workbook XML in #{@path} at #{entry_path}: #{e.message}"
226
328
  ensure
227
329
  io&.close
228
330
  end
331
+
332
+ def xml_truthy?(value)
333
+ value == "1" || value == "true"
334
+ end
229
335
  end
230
336
  end
@@ -50,17 +50,28 @@ module Rbxl
50
50
 
51
51
  # @param zip [Zip::File] open archive shared with the workbook
52
52
  # @param entry_path [String] ZIP entry path for this sheet's XML
53
+ # @param workbook_path [String] filesystem path the workbook was opened from
53
54
  # @param shared_strings [Array<String>] pre-decoded shared strings table
54
55
  # @param name [String] visible sheet name
55
56
  # @param streaming [Boolean] when the native extension is loaded, feed
56
57
  # worksheet XML to the parser in chunks instead of reading the entry
57
58
  # into memory first
58
- def initialize(zip:, entry_path:, shared_strings:, name:, streaming: false)
59
+ # @param date_styles [Array<Boolean>, nil] +true+ at a style id when the
60
+ # id's numFmt is a date/time format. When provided, numeric cells with
61
+ # a matching style are returned as +Date+ or +Time+ instead of +Float+,
62
+ # and the native fast path is bypassed.
63
+ # @param date_1904 [Boolean] whether the workbook uses Excel's 1904 date
64
+ # system instead of the default 1900 date system
65
+ def initialize(zip:, entry_path:, workbook_path:, shared_strings:, name:, streaming: false, date_styles: nil, date_1904: false)
59
66
  @zip = zip
60
67
  @entry_path = entry_path
68
+ @workbook_path = workbook_path
61
69
  @shared_strings = shared_strings
62
70
  @name = name
63
71
  @streaming = streaming
72
+ @date_styles = date_styles
73
+ @date_1904 = date_1904
74
+ @disable_native = !date_styles.nil?
64
75
  @dimensions = extract_dimensions
65
76
  @merge_ranges_by_row = nil
66
77
  @merge_anchor_values = {}
@@ -164,12 +175,16 @@ module Rbxl
164
175
  end
165
176
 
166
177
  cell_type = nil
178
+ cell_style = nil
179
+ cell_ref = nil
167
180
  collecting_value = false
168
181
  in_v = false
169
182
  raw_value = nil
170
183
  value_buffer = +""
171
184
  current_values = nil
172
185
  row_depth = nil
186
+ track_style = !@date_styles.nil?
187
+ wrap_cell_errors = track_style
173
188
 
174
189
  with_sheet_reader do |reader|
175
190
  reader.each do |node|
@@ -179,9 +194,26 @@ module Rbxl
179
194
  when "row"
180
195
  current_values = []
181
196
  row_depth = node.depth
197
+ if node.self_closing?
198
+ yield current_values.freeze
199
+ current_values = nil
200
+ end
182
201
  when "c"
202
+ cell_ref = node.attribute("r")
183
203
  cell_type = node.attribute("t")
204
+ cell_style = track_style ? node.attribute("s")&.to_i : nil
184
205
  raw_value = nil
206
+ if current_values && node.self_closing?
207
+ value = if wrap_cell_errors
208
+ coerce_cell_value(raw_value, cell_type, cell_style, cell_ref)
209
+ else
210
+ coerce_value(raw_value, cell_type, cell_style)
211
+ end
212
+ current_values << value
213
+ cell_type = nil
214
+ cell_style = nil
215
+ cell_ref = nil
216
+ end
185
217
  when "v"
186
218
  collecting_value = true
187
219
  in_v = true
@@ -202,12 +234,19 @@ module Rbxl
202
234
  raw_value = raw_value ? raw_value << value_buffer : value_buffer.dup
203
235
  collecting_value = false
204
236
  end
205
- elsif node.depth == row_depth
237
+ elsif current_values && node.depth == row_depth
206
238
  yield current_values.freeze
207
239
  current_values = nil
208
240
  elsif current_values && node.depth == row_depth + 1
209
- current_values << coerce_value(raw_value, cell_type)
241
+ value = if wrap_cell_errors
242
+ coerce_cell_value(raw_value, cell_type, cell_style, cell_ref)
243
+ else
244
+ coerce_value(raw_value, cell_type, cell_style)
245
+ end
246
+ current_values << value
210
247
  cell_type = nil
248
+ cell_style = nil
249
+ cell_ref = nil
211
250
  raw_value = nil
212
251
  end
213
252
  end
@@ -231,12 +270,15 @@ module Rbxl
231
270
  current_cells = nil
232
271
  cell_ref = nil
233
272
  cell_type = nil
273
+ cell_style = nil
234
274
  current_col_index = 0
235
275
  collecting_value = false
236
276
  in_v = false
237
277
  raw_value = nil
238
278
  value_buffer = +""
239
279
  row_depth = nil
280
+ track_style = !@date_styles.nil?
281
+ wrap_cell_errors = track_style
240
282
 
241
283
  with_sheet_reader do |reader|
242
284
  reader.each do |node|
@@ -248,6 +290,14 @@ module Rbxl
248
290
  current_col_index = 0
249
291
  current_cells = []
250
292
  row_depth = node.depth
293
+ if node.self_closing?
294
+ emit_row(current_cells, current_row_index,
295
+ pad_cells: pad_cells, expand_merged: expand_merged,
296
+ values_only: values_only, &block)
297
+ last_row_index = current_row_index
298
+ current_row_index = nil
299
+ current_cells = nil
300
+ end
251
301
  when "c"
252
302
  cell_ref = node.attribute("r")
253
303
  if cell_ref
@@ -257,7 +307,19 @@ module Rbxl
257
307
  cell_ref = "#{column_name(current_col_index)}#{current_row_index}"
258
308
  end
259
309
  cell_type = node.attribute("t")
310
+ cell_style = track_style ? node.attribute("s")&.to_i : nil
260
311
  raw_value = nil
312
+ if current_cells && node.self_closing?
313
+ value = if wrap_cell_errors
314
+ coerce_cell_value(raw_value, cell_type, cell_style, cell_ref)
315
+ else
316
+ coerce_value(raw_value, cell_type, cell_style)
317
+ end
318
+ current_cells << build_row_entry(cell_ref, value, values_only)
319
+ cell_ref = nil
320
+ cell_type = nil
321
+ cell_style = nil
322
+ end
261
323
  when "v"
262
324
  collecting_value = true
263
325
  in_v = true
@@ -278,17 +340,23 @@ module Rbxl
278
340
  raw_value = raw_value ? raw_value << value_buffer : value_buffer.dup
279
341
  collecting_value = false
280
342
  end
281
- elsif node.depth == row_depth
282
- current_cells = pad_row(current_cells, current_row_index, values_only: values_only) if pad_cells
283
- current_cells = expand_merged_cells(current_cells, current_row_index, values_only: values_only) if expand_merged
284
- yield values_only ? extract_values(current_cells).freeze : Row.new(index: current_row_index, cells: current_cells)
343
+ elsif current_cells && node.depth == row_depth
344
+ emit_row(current_cells, current_row_index,
345
+ pad_cells: pad_cells, expand_merged: expand_merged,
346
+ values_only: values_only, &block)
285
347
  last_row_index = current_row_index
286
348
  current_row_index = nil
287
349
  current_cells = nil
288
350
  elsif current_cells && node.depth == row_depth + 1
289
- current_cells << build_row_entry(cell_ref, coerce_value(raw_value, cell_type), values_only)
351
+ value = if wrap_cell_errors
352
+ coerce_cell_value(raw_value, cell_type, cell_style, cell_ref)
353
+ else
354
+ coerce_value(raw_value, cell_type, cell_style)
355
+ end
356
+ current_cells << build_row_entry(cell_ref, value, values_only)
290
357
  cell_ref = nil
291
358
  cell_type = nil
359
+ cell_style = nil
292
360
  raw_value = nil
293
361
  end
294
362
  end
@@ -296,10 +364,21 @@ module Rbxl
296
364
  end
297
365
  end
298
366
 
367
+ def emit_row(cells, row_index, pad_cells:, expand_merged:, values_only:)
368
+ cells = pad_row(cells, row_index, values_only: values_only) if pad_cells
369
+ cells = expand_merged_cells(cells, row_index, values_only: values_only) if expand_merged
370
+ yield values_only ? extract_values(cells).freeze : Row.new(index: row_index, cells: cells)
371
+ end
372
+
299
373
  def with_sheet_reader
300
- io = @zip.get_entry(@entry_path).get_input_stream
374
+ entry = @zip.get_entry(@entry_path)
375
+ raise WorksheetFormatError, "worksheet #{@name.inspect} is missing XML entry #{@entry_path.inspect} in #{@workbook_path}" unless entry
376
+
377
+ io = entry.get_input_stream
301
378
  reader = Nokogiri::XML::Reader(io)
302
379
  yield reader
380
+ rescue Nokogiri::XML::SyntaxError => e
381
+ raise WorksheetFormatError, "invalid worksheet XML for sheet #{@name.inspect} in #{@workbook_path}: #{e.message}"
303
382
  ensure
304
383
  io&.close
305
384
  end
@@ -309,7 +388,10 @@ module Rbxl
309
388
  max_bytes = Rbxl.max_worksheet_bytes
310
389
  Rbxl::Native.public_send(method_name, io, @shared_strings, max_bytes, &block)
311
390
  rescue RuntimeError => e
312
- raise WorksheetTooLargeError, e.message if e.message&.include?("worksheet bytes exceed limit")
391
+ if e.message&.include?("worksheet bytes exceed limit")
392
+ raise WorksheetTooLargeError,
393
+ "worksheet #{@name.inspect} in #{@workbook_path}: #{e.message}"
394
+ end
313
395
 
314
396
  raise
315
397
  ensure
@@ -527,7 +609,7 @@ module Rbxl
527
609
  @merge_ranges_by_row ||= extract_merge_ranges_by_row
528
610
  end
529
611
 
530
- def coerce_value(raw_value, type)
612
+ def coerce_value(raw_value, type, style_id = nil)
531
613
  case type
532
614
  when "s"
533
615
  @shared_strings[raw_value.to_i]
@@ -536,10 +618,44 @@ module Rbxl
536
618
  when "b"
537
619
  raw_value == "1"
538
620
  else
539
- infer_scalar(raw_value)
621
+ value = infer_scalar(raw_value)
622
+ return value unless @date_styles && style_id && value.is_a?(Numeric) && @date_styles[style_id]
623
+
624
+ excel_serial_to_ruby(value)
540
625
  end
541
626
  end
542
627
 
628
+ def coerce_cell_value(raw_value, type, style_id, coordinate)
629
+ coerce_value(raw_value, type, style_id)
630
+ rescue StandardError => e
631
+ raise CellValueError,
632
+ "failed to decode cell #{coordinate || '(unknown coordinate)'} on sheet #{@name.inspect} in #{@workbook_path}: #{e.message}"
633
+ end
634
+
635
+ # Excel's serial date counts days from 1899-12-31 as serial 1, with a
636
+ # documented leap-year bug for the non-existent 1900-02-29 (serial 60)
637
+ # — for serials >= 60 the day-count is shifted back by one so that
638
+ # post-1900 dates line up with the proleptic Gregorian calendar.
639
+ # Whole-number serials are returned as +Date+; fractional serials as
640
+ # +Time+ so that both date and time-of-day survive the conversion.
641
+ def excel_serial_to_ruby(serial)
642
+ whole = serial.to_i
643
+ frac = serial - serial.to_i
644
+
645
+ base =
646
+ if @date_1904
647
+ Date.new(1904, 1, 1) + whole
648
+ else
649
+ whole -= 1 if whole >= 60
650
+ Date.new(1899, 12, 31) + whole
651
+ end
652
+
653
+ return base if frac.zero?
654
+
655
+ seconds = (frac * 86_400).round
656
+ Time.new(base.year, base.month, base.day) + seconds
657
+ end
658
+
543
659
  def infer_scalar(raw_value)
544
660
  return nil if raw_value.nil? || raw_value.empty?
545
661
 
data/lib/rbxl/version.rb CHANGED
@@ -1,4 +1,4 @@
1
1
  module Rbxl
2
2
  # Gem version string, tracked with semantic versioning.
3
- VERSION = "1.0.2"
3
+ VERSION = "1.2.0"
4
4
  end
@@ -96,7 +96,7 @@ module Rbxl
96
96
  private
97
97
 
98
98
  def ensure_writable!
99
- raise WorkbookAlreadySavedError, "write-only workbook can only be saved once" if @saved
99
+ raise WorkbookAlreadySavedError, "write-only workbook can only be saved once by design; call Rbxl.new to build another workbook" if @saved
100
100
  raise ClosedWorkbookError, "workbook has been closed" if closed?
101
101
  end
102
102
 
data/lib/rbxl.rb CHANGED
@@ -1,6 +1,7 @@
1
1
  require "cgi"
2
2
  require "date"
3
3
  require "nokogiri"
4
+ require "set"
4
5
  require "stringio"
5
6
  require "zip"
6
7
 
@@ -32,16 +33,19 @@ require_relative "rbxl/write_only_worksheet"
32
33
  #
33
34
  # require "rbxl"
34
35
  #
35
- # book = Rbxl.open("report.xlsx", read_only: true)
36
+ # book = Rbxl.open("report.xlsx")
36
37
  # sheet = book.sheet("Report")
37
38
  # sheet.each_row(values_only: true) { |values| p values }
38
39
  # book.close
39
40
  #
41
+ # Pass <tt>date_conversion: true</tt> to return Date/Time objects for
42
+ # numeric cells that carry a date +numFmt+ style.
43
+ #
40
44
  # == Writing
41
45
  #
42
46
  # require "rbxl"
43
47
  #
44
- # book = Rbxl.new(write_only: true)
48
+ # book = Rbxl.new
45
49
  # sheet = book.add_sheet("Report")
46
50
  # sheet << ["id", "name", "score"]
47
51
  # sheet << [1, "alice", 100]
@@ -84,9 +88,9 @@ module Rbxl
84
88
 
85
89
  # Opens an existing workbook in read-only row-by-row mode.
86
90
  #
87
- # The +read_only+ keyword is required and must be +true+. It exists to
88
- # mark the intent explicitly and to leave room for a future read/write
89
- # mode without changing the default behavior of {.open}.
91
+ # The +read_only+ keyword defaults to +true+ and exists to mark the
92
+ # intent explicitly at the call site. Passing +read_only: false+ raises
93
+ # {NotImplementedError}; a read/write mode is not available.
90
94
  #
91
95
  # With <tt>streaming: true</tt>, the native backend (when loaded) feeds
92
96
  # worksheet XML to the parser in chunks pulled from the ZIP input stream
@@ -97,29 +101,41 @@ module Rbxl
97
101
  # differs — and typically pays back a few percent of throughput on small
98
102
  # sheets in exchange for the flat memory profile.
99
103
  #
104
+ # With <tt>date_conversion: true</tt>, numeric cells whose style points at
105
+ # a date/time +numFmt+ (built-in ids 14–22, 27–36, 45–47, 50–58, or any
106
+ # custom format code containing a date/time token) are returned as
107
+ # +Date+, +Time+, or +DateTime+ instead of a raw serial +Float+. The flag
108
+ # is off by default to preserve byte-for-byte behavior and skip the
109
+ # styles.xml parse for workbooks that don't need it; enabling it
110
+ # disables the native fast path and routes reads through the Ruby
111
+ # worksheet parser.
112
+ #
100
113
  # @param path [String, #to_path] filesystem path to an <tt>.xlsx</tt> file
101
- # @param read_only [Boolean] must be +true+ for the current API
114
+ # @param read_only [Boolean] retained for call-site clarity; must be +true+
102
115
  # @param streaming [Boolean] feed worksheet XML to the native parser in
103
116
  # chunks instead of fully inflating the entry in advance. Ignored when
104
117
  # the native extension is not loaded.
118
+ # @param date_conversion [Boolean] convert numeric cells backed by a
119
+ # date/time +numFmt+ to +Date+ / +Time+ / +DateTime+
105
120
  # @return [Rbxl::ReadOnlyWorkbook]
106
- # @raise [ArgumentError] if +read_only+ is not +true+
107
- def open(path, read_only: false, streaming: false)
108
- raise ArgumentError, "read_only: true is required for this MVP" unless read_only
121
+ # @raise [NotImplementedError] if +read_only+ is not +true+
122
+ def open(path, read_only: true, streaming: false, date_conversion: false)
123
+ raise NotImplementedError, "read/write mode is not supported; pass read_only: true" unless read_only
109
124
 
110
- ReadOnlyWorkbook.open(path, streaming: streaming)
125
+ ReadOnlyWorkbook.open(path, streaming: streaming, date_conversion: date_conversion)
111
126
  end
112
127
 
113
128
  # Creates a new workbook in write-only mode.
114
129
  #
115
- # The +write_only+ keyword is required and must be +true+ to make the
116
- # save-once, append-only contract obvious at the call site.
130
+ # The +write_only+ keyword defaults to +true+ and exists to mark the
131
+ # save-once, append-only contract explicitly. Passing
132
+ # +write_only: false+ raises {NotImplementedError}.
117
133
  #
118
- # @param write_only [Boolean] must be +true+ for the current API
134
+ # @param write_only [Boolean] retained for call-site clarity; must be +true+
119
135
  # @return [Rbxl::WriteOnlyWorkbook]
120
- # @raise [ArgumentError] if +write_only+ is not +true+
121
- def new(write_only: false)
122
- raise ArgumentError, "write_only: true is required for this MVP" unless write_only
136
+ # @raise [NotImplementedError] if +write_only+ is not +true+
137
+ def new(write_only: true)
138
+ raise NotImplementedError, "read/write mode is not supported; pass write_only: true" unless write_only
123
139
 
124
140
  WriteOnlyWorkbook.new
125
141
  end
data/sig/rbxl.rbs CHANGED
@@ -1,7 +1,7 @@
1
1
  module Rbxl
2
2
  VERSION: String
3
3
 
4
- type cell_value = String | Integer | Float | bool | nil
4
+ type cell_value = String | Integer | Float | bool | Date | Time | nil
5
5
  type pathish = String | Pathname
6
6
  type row_input = Array[untyped] | Enumerator[untyped, untyped]
7
7
  type row_values = Array[cell_value]
@@ -9,7 +9,7 @@ module Rbxl
9
9
  type row_cells = Array[row_cell]
10
10
  type dimensions = { ref: String, max_col: Integer, max_row: Integer }
11
11
 
12
- def self.open: (pathish path, ?read_only: bool, ?streaming: bool) -> ReadOnlyWorkbook
12
+ def self.open: (pathish path, ?read_only: bool, ?streaming: bool, ?date_conversion: bool) -> ReadOnlyWorkbook
13
13
  def self.new: (?write_only: bool) -> WriteOnlyWorkbook
14
14
 
15
15
  attr_accessor self.max_shared_strings: Integer?
@@ -83,8 +83,8 @@ module Rbxl
83
83
  attr_reader path: String
84
84
  attr_reader sheet_names: Array[String]
85
85
 
86
- def self.open: (pathish path, ?streaming: bool) -> ReadOnlyWorkbook
87
- def initialize: (pathish path, ?streaming: bool) -> void
86
+ def self.open: (pathish path, ?streaming: bool, ?date_conversion: bool) -> ReadOnlyWorkbook
87
+ def initialize: (pathish path, ?streaming: bool, ?date_conversion: bool) -> void
88
88
  def sheet: (String name) -> ReadOnlyWorksheet
89
89
  def close: () -> void
90
90
  def closed?: () -> bool
@@ -94,7 +94,7 @@ module Rbxl
94
94
  attr_reader name: String
95
95
  attr_reader dimensions: dimensions?
96
96
 
97
- def initialize: (zip: untyped, entry_path: String, shared_strings: Array[String], name: String, ?streaming: bool) -> void
97
+ def initialize: (zip: untyped, entry_path: String, shared_strings: Array[String], name: String, ?streaming: bool, ?date_styles: Array[bool]?) -> void
98
98
 
99
99
  def each_row: (?pad_cells: bool, ?values_only: bool, ?expand_merged: bool) { (Row | row_values) -> void } -> void
100
100
  | (?pad_cells: bool, ?values_only: bool, ?expand_merged: bool) -> Enumerator[Row | row_values, void]
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: rbxl
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.2
4
+ version: 1.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Taro KOBAYASHI
@@ -43,8 +43,8 @@ dependencies:
43
43
  - - "<"
44
44
  - !ruby/object:Gem::Version
45
45
  version: '2.0'
46
- description: rbxl is a Ruby gem for read-only row-by-row iteration and write-only
47
- XLSX generation, with an optional native extension for faster XML parsing.
46
+ description: rbxl is a fast, low-memory Ruby gem for row-by-row XLSX reads and append-only
47
+ XLSX writes, with an optional native extension for higher-throughput XML parsing.
48
48
  email:
49
49
  - taro@matzlika.co.jp
50
50
  executables: []
@@ -96,6 +96,5 @@ required_rubygems_version: !ruby/object:Gem::Requirement
96
96
  requirements: []
97
97
  rubygems_version: 4.0.3
98
98
  specification_version: 4
99
- summary: A fast, memory-friendly Ruby gem for row-by-row XLSX reads and append-only
100
- writes.
99
+ summary: Fast, low-memory XLSX processing for Ruby.
101
100
  test_files: []