rbxl 1.0.1 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e9bedc3242085871b368d031e7791aeb925d8d2a53329aebaf1776a0a0d273eb
4
- data.tar.gz: e4d6594b3c7d19b63f429b5cb5680df1d4e6e762dd86c2ab33499c97a5389918
3
+ metadata.gz: 76445404b974d2ddcd664b9f796fd693b7c5c36d1d56cf34fccc2b7f1fd1b51d
4
+ data.tar.gz: e41c2dcccc060b7bb7e3a5608f2f57dfaa7f063daf3f82f1a4fa0bf6f85cb098
5
5
  SHA512:
6
- metadata.gz: fac56fdc22b72ff9bf75c3273e8e9a61fbab953c3ddb280618522c121a71a8d530f09792d4c80682b94309b7042604af42b5bfeeab41420e67611cb57d0a57de
7
- data.tar.gz: 72f58522b5d7d9a0e1ca16e153a578871e8a2355e12a9066c7f1b1ec1026c72124bdbd7ec72c9e293caf74dfcff0e21386c09db1155c7d2c7549e04ef93abdec
6
+ metadata.gz: f41de8a1367b9033d5391ac8f46ff8b363ae79c6331bd4601bbf64fbbdf6e437c53052f38c7f130aa21833c2d60603853b1553507a6e4c7c291317da3c3f749f
7
+ data.tar.gz: fe624cb616255d3437811354c073fe527f2e01bc562b93c3e19732e324aa4b227d50480b68a6e15d858c6bc276cd1da7c8456fd7ff94ea8447dea6a4b9cad70c
data/CHANGELOG.md CHANGED
@@ -1,5 +1,13 @@
1
1
  # Changelog
2
2
 
3
+ ## 1.0.2
4
+
5
+ - Add `streaming: true` to `Rbxl.open` to feed worksheet XML to the native reader in 64 KiB chunks instead of buffering the full worksheet first.
6
+ - Add `Rbxl.max_worksheet_bytes` and `Rbxl::WorksheetTooLargeError` so streaming reads can stop oversized worksheet XML entries mid-inflate.
7
+ - Expand RDoc coverage across the public API.
8
+ - Tighten RBS signatures to match the actual runtime types.
9
+ - Reword public docs and gem metadata to describe reads as row-by-row and writes as append-only, reserving "streaming" for the new opt-in native read path.
10
+
3
11
  ## 1.0.1
4
12
 
5
13
  - Fix ZIP64 handling.
data/README.md CHANGED
@@ -1,11 +1,19 @@
1
1
  # rbxl
2
2
 
3
- `openpyxl` inspired Ruby gem for large-ish `.xlsx` files.
3
+ Fast, memory-friendly Ruby gem for row-by-row `.xlsx` reads and append-only writes.
4
+
5
+ `rbxl` is built for the two workbook workflows that scale cleanly:
6
+
7
+ - read-only row-by-row iteration
8
+ - write-only workbook generation
9
+
10
+ The API is intentionally small and `openpyxl`-inspired, with an optional
11
+ native extension for faster XML parsing when you need more throughput.
4
12
 
5
13
  Current scope is intentionally small:
6
14
 
7
15
  - `write_only` workbook generation
8
- - `read_only` row streaming
16
+ - `read_only` row-by-row iteration
9
17
  - `close()` for read-only workbooks
10
18
  - minimal `openpyxl`-like API
11
19
  - optional C extension (`rbxl/native`) for maximum performance
@@ -62,6 +70,20 @@ book.sheet("Data").rows(values_only: true).each { |row| process(row) }
62
70
  book.close
63
71
  ```
64
72
 
73
+ For large worksheets where peak memory matters more than squeezing out the
74
+ last few percent of throughput, opt into chunk-fed worksheet inflation:
75
+
76
+ ```ruby
77
+ require "rbxl"
78
+ require "rbxl/native"
79
+
80
+ Rbxl.max_worksheet_bytes = 64 * 1024 * 1024
81
+
82
+ book = Rbxl.open("large.xlsx", read_only: true, streaming: true)
83
+ book.sheet("Data").rows(values_only: true).each { |row| process(row) }
84
+ book.close
85
+ ```
86
+
65
87
  The C extension is **opt-in by design**:
66
88
 
67
89
  - **Portability first**: `require "rbxl"` alone works everywhere Ruby and
@@ -77,9 +99,17 @@ The C extension is **opt-in by design**:
77
99
  compile the C extension. If libxml2 is not found, compilation is silently
78
100
  skipped and the gem installs successfully without it. You only notice when
79
101
  you try `require "rbxl/native"`.
80
- - **Current boundary cost is explicit**: worksheet ZIP entries are still
102
+ - **Default path buffers the worksheet**: the worksheet ZIP entry is
81
103
  inflated into a Ruby string before crossing into C. The extension removes
82
104
  XML parse overhead, but not ZIP I/O or that intermediate buffer.
105
+ - **Opt-in streaming**: passing `streaming: true` to `Rbxl.open` feeds the
106
+ worksheet XML to the native parser in 64 KiB chunks pulled from the ZIP
107
+ input stream, so peak memory stays roughly independent of sheet size.
108
+ Pair with `Rbxl.max_worksheet_bytes` to cap uncompressed worksheet
109
+ inflation and stop high-compression zip-bomb style entries mid-inflate.
110
+ Throughput is usually within a few percent of the default path. Without
111
+ `require "rbxl/native"`, the flag is accepted but the pure-Ruby reader
112
+ still takes the buffered path.
83
113
 
84
114
  Requirements for the C extension:
85
115
 
@@ -88,7 +118,7 @@ Requirements for the C extension:
88
118
 
89
119
  ## Design Notes
90
120
 
91
- - Writer avoids a full workbook object graph and streams rows into sheet XML.
121
+ - Writer avoids a full workbook object graph; rows are buffered per sheet and the XML is emitted in a single pass at `save`.
92
122
  - Reader uses a pull parser for worksheet XML so it can iterate rows without building the full DOM.
93
123
  - Strings written by the MVP use `inlineStr` to avoid shared string bookkeeping during generation.
94
124
  - Reader supports both shared strings and inline strings.
@@ -96,22 +126,27 @@ Requirements for the C extension:
96
126
 
97
127
  ## Development
98
128
 
129
+ Development in this repository assumes Ruby 3.4.8 (`.ruby-version`).
130
+
99
131
  ```bash
100
132
  bundle install
101
133
  cd benchmark && npm install && cd ..
102
134
 
103
135
  # Run tests (pure Ruby)
104
- ruby -Ilib -Itest test/rbxl_test.rb
136
+ bundle exec ruby -Ilib -Itest test/rbxl_test.rb
105
137
 
106
138
  # Run tests (with native extension)
107
139
  cd ext/rbxl_native && ruby extconf.rb && make && cd ../..
108
- ruby -Ilib -Itest -r rbxl/native test/rbxl_test.rb
109
- ruby -Ilib -Itest test/fast_ext_test.rb
140
+ bundle exec ruby -Ilib -Itest -r rbxl/native test/rbxl_test.rb
141
+ bundle exec ruby -Ilib -Itest test/fast_ext_test.rb
110
142
 
111
143
  # Benchmarks
112
- ruby -Ilib benchmark/compare.rb # pure Ruby
113
- ruby -Ilib -r rbxl/native benchmark/compare.rb # with native
114
- RBXL_BENCH_WARMUP=1 RBXL_BENCH_ITERATIONS=5 ruby -Ilib benchmark/read_modes.rb
144
+ bundle exec ruby -Ilib benchmark/compare.rb # pure Ruby
145
+ bundle exec ruby -Ilib -r rbxl/native benchmark/compare.rb # with native
146
+ RBXL_BENCH_WARMUP=1 RBXL_BENCH_ITERATIONS=5 bundle exec ruby -Ilib benchmark/read_modes.rb
147
+
148
+ # Generate API docs
149
+ bundle exec rake rdoc
115
150
  ```
116
151
 
117
152
  ## Benchmarks
@@ -128,30 +163,34 @@ best read as:
128
163
 
129
164
  5000 rows x 10 columns, Ruby 3.4 / Python 3.13 / Node 24:
130
165
 
131
- ![Benchmark chart](benchmark/chart.png)
166
+ ![Benchmark chart](benchmark/chart-20260417-044037.png)
132
167
 
133
168
  ### Portable Baseline (`require "rbxl"`)
134
169
 
135
170
  | benchmark | real (s) |
136
171
  |---|---|
137
172
  | rbxl write | 0.08 |
138
- | rbxl read | 0.33 |
139
- | rbxl read values | 0.23 |
173
+ | rbxl read | 0.29 |
174
+ | rbxl read values | 0.22 |
175
+ | fast_excel write | 0.18 |
176
+ | fast_excel write constant | 0.12 |
140
177
  | exceljs write | 0.08 |
141
- | exceljs read | 0.17 |
178
+ | exceljs read | 0.19 |
142
179
  | sheetjs write | 0.13 |
143
- | sheetjs read | 0.19 |
144
- | openpyxl write | 0.35 |
145
- | openpyxl read | 0.22 |
180
+ | sheetjs read | 0.20 |
181
+ | openpyxl write | 0.36 |
182
+ | openpyxl read | 0.21 |
146
183
  | openpyxl read values | 0.18 |
184
+ | excelize write | 0.15 |
185
+ | excelize read | 0.14 |
147
186
 
148
187
  ### Performance Mode (`require "rbxl/native"`)
149
188
 
150
189
  | benchmark | real (s) | vs exceljs/openpyxl |
151
190
  |---|---|---|
152
- | rbxl write | **0.04** | about 2x / 9x faster |
153
- | rbxl read | **0.07** | about 2.6x / 3.2x faster |
154
- | rbxl read values | **0.03** | about 6.8x faster than openpyxl values |
191
+ | rbxl write | **0.05** | about 1.8x faster than exceljs, 2.5x faster than fast_excel constant, 7.7x faster than openpyxl |
192
+ | rbxl read | **0.09** | about 2.3x faster than exceljs, 2.4x faster than openpyxl |
193
+ | rbxl read values | **0.04** | about 4.8x faster than openpyxl values |
155
194
 
156
195
  The comparison script uses these libraries when available:
157
196
 
@@ -159,12 +198,14 @@ Benchmark notes:
159
198
 
160
199
  - `RBXL_BENCH_WARMUP` and `RBXL_BENCH_ITERATIONS` control warmup and repeated runs.
161
200
  - Read comparisons use the same `rbxl.xlsx` fixture for `rbxl`, `roo`, `rubyXL`, and `openpyxl`.
201
+ - `fast_excel` adds write-only comparisons for both its default mode and `constant_memory: true`.
162
202
  - JS comparisons use the same `rbxl.xlsx` fixture for `exceljs` and `sheetjs`.
163
203
  - Write comparisons still measure each library producing its own workbook.
164
204
  - `rss_delta_kb` is best-effort process RSS on Linux and should be treated as directional.
165
205
  - Install JS benchmark dependencies with `cd benchmark && npm install`.
166
206
 
167
207
  - `rbxl` for write/read
208
+ - `fast_excel` for write / constant-memory write
168
209
  - `exceljs` for write/read
169
210
  - `sheetjs` for write/read
170
211
  - `excelize` (Go) for write/read
data/Rakefile CHANGED
@@ -1,5 +1,11 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  require "bundler/gem_helper"
4
+ require "rdoc/task"
4
5
 
5
6
  Bundler::GemHelper.install_tasks
7
+
8
+ RDoc::Task.new(:rdoc) do |rdoc|
9
+ rdoc.main = "README.md"
10
+ rdoc.rdoc_files.include("README.md", "lib/**/*.rb")
11
+ end
@@ -359,11 +359,15 @@ static void on_characters(void *ctx, const xmlChar *ch, int len)
359
359
  /* Ensure-style cleanup wrapper */
360
360
  /* ------------------------------------------------------------------ */
361
361
 
362
+ #define IO_READ_CHUNK_BYTES (64 * 1024)
363
+
362
364
  typedef struct {
363
365
  parse_ctx *ctx;
364
366
  xmlParserCtxtPtr parser;
365
- const char *data;
366
- long data_len;
367
+ const char *data; /* string mode only */
368
+ long data_len; /* string mode only */
369
+ VALUE io; /* io mode only (Qnil in string mode) */
370
+ long max_bytes; /* io mode cap; 0 = unbounded */
367
371
  } parse_args;
368
372
 
369
373
  static VALUE do_parse(VALUE arg)
@@ -375,6 +379,39 @@ static VALUE do_parse(VALUE arg)
375
379
  return Qnil;
376
380
  }
377
381
 
382
+ static VALUE do_parse_io(VALUE arg)
383
+ {
384
+ parse_args *a = (parse_args *)arg;
385
+ static ID id_read = 0;
386
+ if (!id_read) id_read = rb_intern("read");
387
+ VALUE chunk_size = INT2NUM(IO_READ_CHUNK_BYTES);
388
+ long total = 0;
389
+
390
+ while (1) {
391
+ VALUE chunk = rb_funcall(a->io, id_read, 1, chunk_size);
392
+ if (NIL_P(chunk)) break;
393
+ Check_Type(chunk, T_STRING);
394
+
395
+ long n = RSTRING_LEN(chunk);
396
+ if (n == 0) break;
397
+
398
+ total += n;
399
+ if (a->max_bytes > 0 && total > a->max_bytes) {
400
+ a->ctx->error = 1;
401
+ snprintf(a->ctx->error_msg, sizeof(a->ctx->error_msg),
402
+ "worksheet bytes exceed limit (%ld)", a->max_bytes);
403
+ break;
404
+ }
405
+
406
+ xmlParseChunk(a->parser, RSTRING_PTR(chunk), (int)n, 0);
407
+ if (a->ctx->error) break;
408
+ }
409
+
410
+ /* Terminate the parser so any trailing buffered state flushes. */
411
+ xmlParseChunk(a->parser, NULL, 0, 1);
412
+ return Qnil;
413
+ }
414
+
378
415
  static VALUE cleanup_parse(VALUE arg)
379
416
  {
380
417
  parse_args *a = (parse_args *)arg;
@@ -392,7 +429,7 @@ static VALUE cleanup_parse(VALUE arg)
392
429
  /* Common parse setup */
393
430
  /* ------------------------------------------------------------------ */
394
431
 
395
- static VALUE run_parse(parse_ctx *ctx, VALUE xml_str)
432
+ static xmlParserCtxtPtr setup_push_parser(parse_ctx *ctx)
396
433
  {
397
434
  xmlSAXHandler handler;
398
435
  memset(&handler, 0, sizeof(handler));
@@ -408,11 +445,25 @@ static VALUE run_parse(parse_ctx *ctx, VALUE xml_str)
408
445
  rb_raise(rb_eRuntimeError, "failed to create libxml2 parser context");
409
446
  }
410
447
 
411
- /* Disable network access and limit entity expansion */
412
- xmlCtxtUseOptions(parser,
413
- XML_PARSE_NONET | XML_PARSE_NOENT | XML_PARSE_HUGE);
448
+ /* XXE / entity-expansion defense:
449
+ * - NONET: no network access
450
+ * - NOENT omitted: user-defined entities are NOT substituted, so
451
+ * external entities are never resolved and billion-laughs style
452
+ * expansion cannot trigger. Predefined entities (& etc.) still
453
+ * reach the characters callback via libxml2's default SAX2 handler.
454
+ * - HUGE omitted: keep libxml2's built-in parser limits active.
455
+ * Real xlsx files stay well under these limits (Excel caps cell text
456
+ * at 32,767 chars), so no throughput loss. */
457
+ xmlCtxtUseOptions(parser, XML_PARSE_NONET);
458
+ return parser;
459
+ }
414
460
 
415
- parse_args args = { ctx, parser, RSTRING_PTR(xml_str), RSTRING_LEN(xml_str) };
461
+ static VALUE run_parse(parse_ctx *ctx, VALUE xml_str)
462
+ {
463
+ xmlParserCtxtPtr parser = setup_push_parser(ctx);
464
+ parse_args args = { ctx, parser,
465
+ RSTRING_PTR(xml_str), RSTRING_LEN(xml_str),
466
+ Qnil, 0 };
416
467
 
417
468
  /* rb_ensure guarantees cleanup even if rb_yield raises */
418
469
  rb_ensure(do_parse, (VALUE)&args, cleanup_parse, (VALUE)&args);
@@ -424,6 +475,20 @@ static VALUE run_parse(parse_ctx *ctx, VALUE xml_str)
424
475
  return INT2NUM(ctx->row_count);
425
476
  }
426
477
 
478
+ static VALUE run_parse_io(parse_ctx *ctx, VALUE io, long max_bytes)
479
+ {
480
+ xmlParserCtxtPtr parser = setup_push_parser(ctx);
481
+ parse_args args = { ctx, parser, NULL, 0, io, max_bytes };
482
+
483
+ rb_ensure(do_parse_io, (VALUE)&args, cleanup_parse, (VALUE)&args);
484
+
485
+ if (ctx->error) {
486
+ rb_raise(rb_eRuntimeError, "rbxl_native: %s", ctx->error_msg);
487
+ }
488
+
489
+ return INT2NUM(ctx->row_count);
490
+ }
491
+
427
492
  /* ------------------------------------------------------------------ */
428
493
  /* Ruby method: Rbxl::Native.parse_sheet(xml_string, shared_strings) */
429
494
  /* ------------------------------------------------------------------ */
@@ -473,6 +538,59 @@ static VALUE rb_native_parse_full(VALUE self, VALUE xml_str, VALUE shared_string
473
538
  return run_parse(&ctx, xml_str);
474
539
  }
475
540
 
541
+ /* ------------------------------------------------------------------ */
542
+ /* Ruby method: Rbxl::Native.parse_sheet_io(io, shared_strings, max_bytes) */
543
+ /* Chunk-fed streaming variant of parse_sheet. */
544
+ /* max_bytes may be nil to disable the worksheet byte cap. */
545
+ /* ------------------------------------------------------------------ */
546
+
547
+ static VALUE rb_native_parse_io(VALUE self, VALUE io, VALUE shared_strings, VALUE max_bytes)
548
+ {
549
+ (void)self;
550
+ Check_Type(shared_strings, T_ARRAY);
551
+
552
+ long max = NIL_P(max_bytes) ? 0 : NUM2LONG(max_bytes);
553
+
554
+ parse_ctx ctx;
555
+ memset(&ctx, 0, sizeof(ctx));
556
+ ctx.shared_strings = shared_strings;
557
+ ctx.shared_strings_len = RARRAY_LEN(shared_strings);
558
+ ctx.current_row = Qnil;
559
+ ctx.full_mode = 0;
560
+ dynbuf_init(&ctx.text_buf);
561
+ dynbuf_init(&ctx.raw_buf);
562
+
563
+ return run_parse_io(&ctx, io, max);
564
+ }
565
+
566
+ /* ------------------------------------------------------------------ */
567
+ /* Ruby method: Rbxl::Native.parse_sheet_full_io(io, shared_strings, max_bytes) */
568
+ /* ------------------------------------------------------------------ */
569
+
570
+ static VALUE rb_native_parse_full_io(VALUE self, VALUE io, VALUE shared_strings, VALUE max_bytes)
571
+ {
572
+ (void)self;
573
+ Check_Type(shared_strings, T_ARRAY);
574
+
575
+ long max = NIL_P(max_bytes) ? 0 : NUM2LONG(max_bytes);
576
+
577
+ VALUE mRbxl = rb_const_get(rb_cObject, rb_intern("Rbxl"));
578
+
579
+ parse_ctx ctx;
580
+ memset(&ctx, 0, sizeof(ctx));
581
+ ctx.shared_strings = shared_strings;
582
+ ctx.shared_strings_len = RARRAY_LEN(shared_strings);
583
+ ctx.current_row = Qnil;
584
+ ctx.full_mode = 1;
585
+ ctx.cReadOnlyCell = rb_const_get(mRbxl, rb_intern("ReadOnlyCell"));
586
+ ctx.cRow = rb_const_get(mRbxl, rb_intern("Row"));
587
+ dynbuf_init(&ctx.text_buf);
588
+ dynbuf_init(&ctx.raw_buf);
589
+ dynbuf_init(&ctx.cell_ref);
590
+
591
+ return run_parse_io(&ctx, io, max);
592
+ }
593
+
476
594
  /* ================================================================== */
477
595
  /* Native writer — generate sheet XML from Ruby Array of Arrays */
478
596
  /* ================================================================== */
@@ -673,5 +791,7 @@ void Init_rbxl_native(void)
673
791
  VALUE mNative = rb_define_module_under(mRbxl, "Native");
674
792
  rb_define_module_function(mNative, "parse_sheet", rb_native_parse, 2);
675
793
  rb_define_module_function(mNative, "parse_sheet_full", rb_native_parse_full, 2);
794
+ rb_define_module_function(mNative, "parse_sheet_io", rb_native_parse_io, 3);
795
+ rb_define_module_function(mNative, "parse_sheet_full_io", rb_native_parse_full_io, 3);
676
796
  rb_define_module_function(mNative, "generate_sheet", rb_native_generate, 1);
677
797
  }
data/lib/rbxl/cell.rb CHANGED
@@ -1,3 +1,18 @@
1
1
  module Rbxl
2
+ # Generic value-object cell used by the pure-Ruby reader path.
3
+ #
4
+ # Yielded as an element of {Rbxl::Row#cells} when a worksheet is iterated
5
+ # without +values_only+. Cells are keyword-constructed and expose the
6
+ # decoded Ruby value plus the Excel-style coordinate.
7
+ #
8
+ # cell = Rbxl::Cell.new(value: 42, coordinate: "B3")
9
+ # cell.value # => 42
10
+ # cell.coordinate # => "B3"
11
+ #
12
+ # @!attribute [rw] value
13
+ # @return [Object] decoded Ruby value for the cell (String, Numeric,
14
+ # Boolean, or +nil+)
15
+ # @!attribute [rw] coordinate
16
+ # @return [String, nil] Excel-style coordinate such as +"B3"+
2
17
  Cell = Struct.new(:value, :coordinate, keyword_init: true)
3
18
  end
@@ -1,11 +1,26 @@
1
1
  module Rbxl
2
+ # Placeholder cell returned when a coordinate in a padded row has no data.
3
+ #
4
+ # Used only when {Rbxl::ReadOnlyWorksheet#each_row} is called with
5
+ # <tt>pad_cells: true</tt>. The object carries the synthetic coordinate so
6
+ # that downstream code can still locate the slot in the worksheet grid.
7
+ #
8
+ # cell = Rbxl::EmptyCell.new(coordinate: "C5")
9
+ # cell.coordinate # => "C5"
10
+ # cell.value # => nil
2
11
  class EmptyCell
12
+ # @return [String] Excel-style coordinate such as +"C5"+
3
13
  attr_reader :coordinate
4
14
 
15
+ # @param coordinate [String] Excel-style coordinate
5
16
  def initialize(coordinate:)
6
17
  @coordinate = coordinate
7
18
  end
8
19
 
20
+ # Always +nil+; exposed so callers can treat {EmptyCell} like any other
21
+ # cell object without a type check.
22
+ #
23
+ # @return [nil]
9
24
  def value
10
25
  nil
11
26
  end
data/lib/rbxl/errors.rb CHANGED
@@ -1,7 +1,36 @@
1
1
  module Rbxl
2
+ # Base class for all errors raised by Rbxl. Rescue this class to catch any
3
+ # library-specific failure without catching unrelated +StandardError+
4
+ # subclasses from the caller's code.
2
5
  class Error < StandardError; end
6
+
7
+ # Raised by {Rbxl::ReadOnlyWorkbook#sheet} when the requested sheet name
8
+ # is not present in the workbook.
3
9
  class SheetNotFoundError < Error; end
10
+
11
+ # Raised when an operation is attempted against a workbook whose
12
+ # underlying resources have already been released via +close+.
4
13
  class ClosedWorkbookError < Error; end
14
+
15
+ # Raised by {Rbxl::WriteOnlyWorkbook#save} when the workbook has already
16
+ # been persisted once. Write-only workbooks are save-once by design.
5
17
  class WorkbookAlreadySavedError < Error; end
18
+
19
+ # Raised by {Rbxl::ReadOnlyWorksheet#calculate_dimension} when the sheet
20
+ # lacks a stored +<dimension>+ element and the caller has not opted into
21
+ # scanning the worksheet with <tt>force: true</tt>.
6
22
  class UnsizedWorksheetError < Error; end
23
+
24
+ # Raised when the shared strings table in an opened workbook exceeds the
25
+ # configured count or byte limits (see {Rbxl.max_shared_strings} and
26
+ # {Rbxl.max_shared_string_bytes}). Guards against malicious or malformed
27
+ # +.xlsx+ files that would otherwise exhaust memory before the first row
28
+ # is read.
29
+ class SharedStringsTooLargeError < Error; end
30
+
31
+ # Raised when a worksheet's XML payload exceeds {Rbxl.max_worksheet_bytes}
32
+ # while iterating in +streaming: true+ mode. Applies to the uncompressed
33
+ # bytes consumed from the ZIP entry, so high-compression zip-bomb style
34
+ # worksheets are stopped mid-inflate rather than after the fact.
35
+ class WorksheetTooLargeError < Error; end
7
36
  end
data/lib/rbxl/native.rb CHANGED
@@ -1,9 +1,22 @@
1
1
  require "nokogiri"
2
2
 
3
+ # Opt-in loader for the libxml2-backed native extension.
4
+ #
5
+ # Requiring this file replaces the pure-Ruby worksheet XML parser and
6
+ # serializer with a C implementation that uses libxml2's SAX2 API directly.
7
+ # The public API exposed by {Rbxl} is unchanged; only the hot paths are
8
+ # swapped.
9
+ #
10
+ # The shared object is located in one of two places:
11
+ #
12
+ # 1. An installed gem layout (+rbxl_native/rbxl_native.so+ on the load path).
13
+ # 2. A development build tree under <tt>ext/rbxl_native/</tt>.
14
+ #
15
+ # If neither is available a +LoadError+ is raised with guidance on how to
16
+ # build the extension.
3
17
  begin
4
18
  require "rbxl_native/rbxl_native"
5
19
  rescue LoadError
6
- # Try loading from ext/ build directory (development)
7
20
  ext_path = File.expand_path("../../ext/rbxl_native", __dir__)
8
21
  so = Dir.glob(File.join(ext_path, "**", "rbxl_native.{so,bundle,dll}")).first
9
22
  if so
@@ -1,3 +1,13 @@
1
1
  module Rbxl
2
+ # Immutable cell value object used by the read-only worksheet path.
3
+ #
4
+ # Produced during row-by-row iteration when cells are yielded without
5
+ # +values_only+. Implemented as a +Data+ class so instances are frozen and
6
+ # hash-equal by value.
7
+ #
8
+ # @!attribute [r] coordinate
9
+ # @return [String] Excel-style coordinate such as +"A1"+
10
+ # @!attribute [r] value
11
+ # @return [Object, nil] decoded Ruby value (String, Numeric, Boolean, or +nil+)
2
12
  ReadOnlyCell = Data.define(:coordinate, :value)
3
13
  end
@@ -1,24 +1,75 @@
1
1
  module Rbxl
2
+ # Read-only workbook backed by a ZIP archive.
3
+ #
4
+ # The workbook opens the underlying <tt>.xlsx</tt> once and keeps a single
5
+ # +Zip::File+ handle open for the lifetime of the object. Worksheets are
6
+ # opened lazily via {#sheet}, so callers can process very large sheets
7
+ # without materializing the full workbook in memory.
8
+ #
9
+ # Typical use:
10
+ #
11
+ # book = Rbxl.open("big.xlsx", read_only: true)
12
+ # begin
13
+ # book.sheet_names # => ["Data"]
14
+ # book.sheet("Data").each_row do |row|
15
+ # process(row.values)
16
+ # end
17
+ # ensure
18
+ # book.close
19
+ # end
20
+ #
21
+ # After {#close} every subsequent {#sheet} call raises
22
+ # {Rbxl::ClosedWorkbookError}.
2
23
  class ReadOnlyWorkbook
24
+ # Namespace for the main SpreadsheetML schema.
3
25
  MAIN_NS = "http://schemas.openxmlformats.org/spreadsheetml/2006/main"
26
+
27
+ # Namespace used for document-level relationships.
4
28
  REL_NS = "http://schemas.openxmlformats.org/officeDocument/2006/relationships"
29
+
30
+ # Namespace used by the OPC package relationships layer.
5
31
  PACKAGE_REL_NS = "http://schemas.openxmlformats.org/package/2006/relationships"
6
32
 
7
- attr_reader :path, :sheet_names
33
+ # @return [String] filesystem path the workbook was opened from
34
+ attr_reader :path
8
35
 
9
- def self.open(path)
10
- new(path)
36
+ # @return [Array<String>] visible sheet names in workbook order
37
+ attr_reader :sheet_names
38
+
39
+ # Convenience constructor equivalent to <tt>new(path, streaming:)</tt>.
40
+ #
41
+ # @param path [String, #to_path] path to the <tt>.xlsx</tt> file
42
+ # @param streaming [Boolean] feed worksheet XML to the native parser in
43
+ # chunks (see {Rbxl.open})
44
+ # @return [Rbxl::ReadOnlyWorkbook]
45
+ def self.open(path, streaming: false)
46
+ new(path, streaming: streaming)
11
47
  end
12
48
 
13
- def initialize(path)
49
+ # Opens the ZIP archive, pre-loads shared strings, and indexes the
50
+ # worksheet entries keyed by visible sheet name.
51
+ #
52
+ # @param path [String, #to_path] path to the <tt>.xlsx</tt> file
53
+ # @param streaming [Boolean] forwarded to produced worksheets
54
+ def initialize(path, streaming: false)
14
55
  @path = path
15
56
  @zip = Zip::File.open(path)
57
+ @streaming = streaming
16
58
  @shared_strings = load_shared_strings
17
59
  @sheet_entries = load_sheet_entries
18
60
  @sheet_names = @sheet_entries.keys.freeze
19
61
  @closed = false
20
62
  end
21
63
 
64
+ # Returns a row-by-row worksheet by visible sheet name.
65
+ #
66
+ # The returned object shares the workbook's ZIP handle. Closing the
67
+ # workbook invalidates any worksheets produced by prior calls.
68
+ #
69
+ # @param name [String] visible sheet name as listed in {#sheet_names}
70
+ # @return [Rbxl::ReadOnlyWorksheet]
71
+ # @raise [Rbxl::SheetNotFoundError] if +name+ is not present
72
+ # @raise [Rbxl::ClosedWorkbookError] if the workbook has been closed
22
73
  def sheet(name)
23
74
  ensure_open!
24
75
 
@@ -26,9 +77,13 @@ module Rbxl
26
77
  raise SheetNotFoundError, "sheet not found: #{name}"
27
78
  end
28
79
 
29
- ReadOnlyWorksheet.new(zip: @zip, entry_path: entry_path, shared_strings: @shared_strings, name: name)
80
+ ReadOnlyWorksheet.new(zip: @zip, entry_path: entry_path, shared_strings: @shared_strings, name: name, streaming: @streaming)
30
81
  end
31
82
 
83
+ # Releases the underlying ZIP file handle. Idempotent; subsequent calls
84
+ # are no-ops.
85
+ #
86
+ # @return [void]
32
87
  def close
33
88
  return if closed?
34
89
 
@@ -36,6 +91,7 @@ module Rbxl
36
91
  @closed = true
37
92
  end
38
93
 
94
+ # @return [Boolean] whether {#close} has been called
39
95
  def closed?
40
96
  @closed
41
97
  end
@@ -50,7 +106,18 @@ module Rbxl
50
106
  entry = @zip.find_entry("xl/sharedStrings.xml")
51
107
  return [] unless entry
52
108
 
109
+ max_count = Rbxl.max_shared_strings
110
+ max_bytes = Rbxl.max_shared_string_bytes
111
+
112
+ # Reject zip-bomb style entries up front using the ZIP directory's
113
+ # declared uncompressed size, before allocating any decompression buffer.
114
+ if max_bytes && entry.size && entry.size > max_bytes
115
+ raise SharedStringsTooLargeError,
116
+ "shared strings uncompressed size #{entry.size} exceeds limit #{max_bytes}"
117
+ end
118
+
53
119
  strings = []
120
+ total_bytes = 0
54
121
  io = entry.get_input_stream
55
122
  reader = Nokogiri::XML::Reader(io)
56
123
 
@@ -92,7 +159,17 @@ module Rbxl
92
159
  when "rPh"
93
160
  in_phonetic = false
94
161
  when "si"
95
- strings << current_fragments.join.freeze
162
+ value = current_fragments.join.freeze
163
+ total_bytes += value.bytesize
164
+ if max_bytes && total_bytes > max_bytes
165
+ raise SharedStringsTooLargeError,
166
+ "shared strings total size exceeds limit #{max_bytes}"
167
+ end
168
+ strings << value
169
+ if max_count && strings.size > max_count
170
+ raise SharedStringsTooLargeError,
171
+ "shared strings count exceeds limit #{max_count}"
172
+ end
96
173
  in_si = false
97
174
  in_run = false
98
175
  in_phonetic = false