rbxl 1.0.1 → 1.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +8 -0
- data/README.md +61 -20
- data/Rakefile +6 -0
- data/ext/rbxl_native/native.c +127 -7
- data/lib/rbxl/cell.rb +15 -0
- data/lib/rbxl/empty_cell.rb +15 -0
- data/lib/rbxl/errors.rb +29 -0
- data/lib/rbxl/native.rb +14 -1
- data/lib/rbxl/read_only_cell.rb +10 -0
- data/lib/rbxl/read_only_workbook.rb +83 -6
- data/lib/rbxl/read_only_worksheet.rb +119 -7
- data/lib/rbxl/row.rb +34 -1
- data/lib/rbxl/version.rb +2 -1
- data/lib/rbxl/write_only_cell.rb +19 -1
- data/lib/rbxl/write_only_workbook.rb +42 -1
- data/lib/rbxl/write_only_worksheet.rb +41 -0
- data/lib/rbxl.rb +96 -2
- data/sig/rbxl.rbs +128 -0
- metadata +6 -3
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 76445404b974d2ddcd664b9f796fd693b7c5c36d1d56cf34fccc2b7f1fd1b51d
|
|
4
|
+
data.tar.gz: e41c2dcccc060b7bb7e3a5608f2f57dfaa7f063daf3f82f1a4fa0bf6f85cb098
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: f41de8a1367b9033d5391ac8f46ff8b363ae79c6331bd4601bbf64fbbdf6e437c53052f38c7f130aa21833c2d60603853b1553507a6e4c7c291317da3c3f749f
|
|
7
|
+
data.tar.gz: fe624cb616255d3437811354c073fe527f2e01bc562b93c3e19732e324aa4b227d50480b68a6e15d858c6bc276cd1da7c8456fd7ff94ea8447dea6a4b9cad70c
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,13 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 1.0.2
|
|
4
|
+
|
|
5
|
+
- Add `streaming: true` to `Rbxl.open` to feed worksheet XML to the native reader in 64 KiB chunks instead of buffering the full worksheet first.
|
|
6
|
+
- Add `Rbxl.max_worksheet_bytes` and `Rbxl::WorksheetTooLargeError` so streaming reads can stop oversized worksheet XML entries mid-inflate.
|
|
7
|
+
- Expand RDoc coverage across the public API.
|
|
8
|
+
- Tighten RBS signatures to match the actual runtime types.
|
|
9
|
+
- Reword public docs and gem metadata to describe reads as row-by-row and writes as append-only, reserving "streaming" for the new opt-in native read path.
|
|
10
|
+
|
|
3
11
|
## 1.0.1
|
|
4
12
|
|
|
5
13
|
- Fix ZIP64 handling.
|
data/README.md
CHANGED
|
@@ -1,11 +1,19 @@
|
|
|
1
1
|
# rbxl
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
Fast, memory-friendly Ruby gem for row-by-row `.xlsx` reads and append-only writes.
|
|
4
|
+
|
|
5
|
+
`rbxl` is built for the two workbook workflows that scale cleanly:
|
|
6
|
+
|
|
7
|
+
- read-only row-by-row iteration
|
|
8
|
+
- write-only workbook generation
|
|
9
|
+
|
|
10
|
+
The API is intentionally small and `openpyxl`-inspired, with an optional
|
|
11
|
+
native extension for faster XML parsing when you need more throughput.
|
|
4
12
|
|
|
5
13
|
Current scope is intentionally small:
|
|
6
14
|
|
|
7
15
|
- `write_only` workbook generation
|
|
8
|
-
- `read_only` row
|
|
16
|
+
- `read_only` row-by-row iteration
|
|
9
17
|
- `close()` for read-only workbooks
|
|
10
18
|
- minimal `openpyxl`-like API
|
|
11
19
|
- optional C extension (`rbxl/native`) for maximum performance
|
|
@@ -62,6 +70,20 @@ book.sheet("Data").rows(values_only: true).each { |row| process(row) }
|
|
|
62
70
|
book.close
|
|
63
71
|
```
|
|
64
72
|
|
|
73
|
+
For large worksheets where peak memory matters more than squeezing out the
|
|
74
|
+
last few percent of throughput, opt into chunk-fed worksheet inflation:
|
|
75
|
+
|
|
76
|
+
```ruby
|
|
77
|
+
require "rbxl"
|
|
78
|
+
require "rbxl/native"
|
|
79
|
+
|
|
80
|
+
Rbxl.max_worksheet_bytes = 64 * 1024 * 1024
|
|
81
|
+
|
|
82
|
+
book = Rbxl.open("large.xlsx", read_only: true, streaming: true)
|
|
83
|
+
book.sheet("Data").rows(values_only: true).each { |row| process(row) }
|
|
84
|
+
book.close
|
|
85
|
+
```
|
|
86
|
+
|
|
65
87
|
The C extension is **opt-in by design**:
|
|
66
88
|
|
|
67
89
|
- **Portability first**: `require "rbxl"` alone works everywhere Ruby and
|
|
@@ -77,9 +99,17 @@ The C extension is **opt-in by design**:
|
|
|
77
99
|
compile the C extension. If libxml2 is not found, compilation is silently
|
|
78
100
|
skipped and the gem installs successfully without it. You only notice when
|
|
79
101
|
you try `require "rbxl/native"`.
|
|
80
|
-
- **
|
|
102
|
+
- **Default path buffers the worksheet**: the worksheet ZIP entry is
|
|
81
103
|
inflated into a Ruby string before crossing into C. The extension removes
|
|
82
104
|
XML parse overhead, but not ZIP I/O or that intermediate buffer.
|
|
105
|
+
- **Opt-in streaming**: passing `streaming: true` to `Rbxl.open` feeds the
|
|
106
|
+
worksheet XML to the native parser in 64 KiB chunks pulled from the ZIP
|
|
107
|
+
input stream, so peak memory stays roughly independent of sheet size.
|
|
108
|
+
Pair with `Rbxl.max_worksheet_bytes` to cap uncompressed worksheet
|
|
109
|
+
inflation and stop high-compression zip-bomb style entries mid-inflate.
|
|
110
|
+
Throughput is usually within a few percent of the default path. Without
|
|
111
|
+
`require "rbxl/native"`, the flag is accepted but the pure-Ruby reader
|
|
112
|
+
still takes the buffered path.
|
|
83
113
|
|
|
84
114
|
Requirements for the C extension:
|
|
85
115
|
|
|
@@ -88,7 +118,7 @@ Requirements for the C extension:
|
|
|
88
118
|
|
|
89
119
|
## Design Notes
|
|
90
120
|
|
|
91
|
-
- Writer avoids a full workbook object graph
|
|
121
|
+
- Writer avoids a full workbook object graph; rows are buffered per sheet and the XML is emitted in a single pass at `save`.
|
|
92
122
|
- Reader uses a pull parser for worksheet XML so it can iterate rows without building the full DOM.
|
|
93
123
|
- Strings written by the MVP use `inlineStr` to avoid shared string bookkeeping during generation.
|
|
94
124
|
- Reader supports both shared strings and inline strings.
|
|
@@ -96,22 +126,27 @@ Requirements for the C extension:
|
|
|
96
126
|
|
|
97
127
|
## Development
|
|
98
128
|
|
|
129
|
+
Development in this repository assumes Ruby 3.4.8 (`.ruby-version`).
|
|
130
|
+
|
|
99
131
|
```bash
|
|
100
132
|
bundle install
|
|
101
133
|
cd benchmark && npm install && cd ..
|
|
102
134
|
|
|
103
135
|
# Run tests (pure Ruby)
|
|
104
|
-
ruby -Ilib -Itest test/rbxl_test.rb
|
|
136
|
+
bundle exec ruby -Ilib -Itest test/rbxl_test.rb
|
|
105
137
|
|
|
106
138
|
# Run tests (with native extension)
|
|
107
139
|
cd ext/rbxl_native && ruby extconf.rb && make && cd ../..
|
|
108
|
-
ruby -Ilib -Itest -r rbxl/native test/rbxl_test.rb
|
|
109
|
-
ruby -Ilib -Itest test/fast_ext_test.rb
|
|
140
|
+
bundle exec ruby -Ilib -Itest -r rbxl/native test/rbxl_test.rb
|
|
141
|
+
bundle exec ruby -Ilib -Itest test/fast_ext_test.rb
|
|
110
142
|
|
|
111
143
|
# Benchmarks
|
|
112
|
-
ruby -Ilib benchmark/compare.rb # pure Ruby
|
|
113
|
-
ruby -Ilib -r rbxl/native benchmark/compare.rb # with native
|
|
114
|
-
RBXL_BENCH_WARMUP=1 RBXL_BENCH_ITERATIONS=5 ruby -Ilib benchmark/read_modes.rb
|
|
144
|
+
bundle exec ruby -Ilib benchmark/compare.rb # pure Ruby
|
|
145
|
+
bundle exec ruby -Ilib -r rbxl/native benchmark/compare.rb # with native
|
|
146
|
+
RBXL_BENCH_WARMUP=1 RBXL_BENCH_ITERATIONS=5 bundle exec ruby -Ilib benchmark/read_modes.rb
|
|
147
|
+
|
|
148
|
+
# Generate API docs
|
|
149
|
+
bundle exec rake rdoc
|
|
115
150
|
```
|
|
116
151
|
|
|
117
152
|
## Benchmarks
|
|
@@ -128,30 +163,34 @@ best read as:
|
|
|
128
163
|
|
|
129
164
|
5000 rows x 10 columns, Ruby 3.4 / Python 3.13 / Node 24:
|
|
130
165
|
|
|
131
|
-

|
|
166
|
+

|
|
132
167
|
|
|
133
168
|
### Portable Baseline (`require "rbxl"`)
|
|
134
169
|
|
|
135
170
|
| benchmark | real (s) |
|
|
136
171
|
|---|---|
|
|
137
172
|
| rbxl write | 0.08 |
|
|
138
|
-
| rbxl read | 0.
|
|
139
|
-
| rbxl read values | 0.
|
|
173
|
+
| rbxl read | 0.29 |
|
|
174
|
+
| rbxl read values | 0.22 |
|
|
175
|
+
| fast_excel write | 0.18 |
|
|
176
|
+
| fast_excel write constant | 0.12 |
|
|
140
177
|
| exceljs write | 0.08 |
|
|
141
|
-
| exceljs read | 0.
|
|
178
|
+
| exceljs read | 0.19 |
|
|
142
179
|
| sheetjs write | 0.13 |
|
|
143
|
-
| sheetjs read | 0.
|
|
144
|
-
| openpyxl write | 0.
|
|
145
|
-
| openpyxl read | 0.
|
|
180
|
+
| sheetjs read | 0.20 |
|
|
181
|
+
| openpyxl write | 0.36 |
|
|
182
|
+
| openpyxl read | 0.21 |
|
|
146
183
|
| openpyxl read values | 0.18 |
|
|
184
|
+
| excelize write | 0.15 |
|
|
185
|
+
| excelize read | 0.14 |
|
|
147
186
|
|
|
148
187
|
### Performance Mode (`require "rbxl/native"`)
|
|
149
188
|
|
|
150
189
|
| benchmark | real (s) | vs exceljs/openpyxl |
|
|
151
190
|
|---|---|---|
|
|
152
|
-
| rbxl write | **0.
|
|
153
|
-
| rbxl read | **0.
|
|
154
|
-
| rbxl read values | **0.
|
|
191
|
+
| rbxl write | **0.05** | about 1.8x faster than exceljs, 2.5x faster than fast_excel constant, 7.7x faster than openpyxl |
|
|
192
|
+
| rbxl read | **0.09** | about 2.3x faster than exceljs, 2.4x faster than openpyxl |
|
|
193
|
+
| rbxl read values | **0.04** | about 4.8x faster than openpyxl values |
|
|
155
194
|
|
|
156
195
|
The comparison script uses these libraries when available:
|
|
157
196
|
|
|
@@ -159,12 +198,14 @@ Benchmark notes:
|
|
|
159
198
|
|
|
160
199
|
- `RBXL_BENCH_WARMUP` and `RBXL_BENCH_ITERATIONS` control warmup and repeated runs.
|
|
161
200
|
- Read comparisons use the same `rbxl.xlsx` fixture for `rbxl`, `roo`, `rubyXL`, and `openpyxl`.
|
|
201
|
+
- `fast_excel` adds write-only comparisons for both its default mode and `constant_memory: true`.
|
|
162
202
|
- JS comparisons use the same `rbxl.xlsx` fixture for `exceljs` and `sheetjs`.
|
|
163
203
|
- Write comparisons still measure each library producing its own workbook.
|
|
164
204
|
- `rss_delta_kb` is best-effort process RSS on Linux and should be treated as directional.
|
|
165
205
|
- Install JS benchmark dependencies with `cd benchmark && npm install`.
|
|
166
206
|
|
|
167
207
|
- `rbxl` for write/read
|
|
208
|
+
- `fast_excel` for write / constant-memory write
|
|
168
209
|
- `exceljs` for write/read
|
|
169
210
|
- `sheetjs` for write/read
|
|
170
211
|
- `excelize` (Go) for write/read
|
data/Rakefile
CHANGED
data/ext/rbxl_native/native.c
CHANGED
|
@@ -359,11 +359,15 @@ static void on_characters(void *ctx, const xmlChar *ch, int len)
|
|
|
359
359
|
/* Ensure-style cleanup wrapper */
|
|
360
360
|
/* ------------------------------------------------------------------ */
|
|
361
361
|
|
|
362
|
+
#define IO_READ_CHUNK_BYTES (64 * 1024)
|
|
363
|
+
|
|
362
364
|
typedef struct {
|
|
363
365
|
parse_ctx *ctx;
|
|
364
366
|
xmlParserCtxtPtr parser;
|
|
365
|
-
const char *data;
|
|
366
|
-
long data_len;
|
|
367
|
+
const char *data; /* string mode only */
|
|
368
|
+
long data_len; /* string mode only */
|
|
369
|
+
VALUE io; /* io mode only (Qnil in string mode) */
|
|
370
|
+
long max_bytes; /* io mode cap; 0 = unbounded */
|
|
367
371
|
} parse_args;
|
|
368
372
|
|
|
369
373
|
static VALUE do_parse(VALUE arg)
|
|
@@ -375,6 +379,39 @@ static VALUE do_parse(VALUE arg)
|
|
|
375
379
|
return Qnil;
|
|
376
380
|
}
|
|
377
381
|
|
|
382
|
+
static VALUE do_parse_io(VALUE arg)
|
|
383
|
+
{
|
|
384
|
+
parse_args *a = (parse_args *)arg;
|
|
385
|
+
static ID id_read = 0;
|
|
386
|
+
if (!id_read) id_read = rb_intern("read");
|
|
387
|
+
VALUE chunk_size = INT2NUM(IO_READ_CHUNK_BYTES);
|
|
388
|
+
long total = 0;
|
|
389
|
+
|
|
390
|
+
while (1) {
|
|
391
|
+
VALUE chunk = rb_funcall(a->io, id_read, 1, chunk_size);
|
|
392
|
+
if (NIL_P(chunk)) break;
|
|
393
|
+
Check_Type(chunk, T_STRING);
|
|
394
|
+
|
|
395
|
+
long n = RSTRING_LEN(chunk);
|
|
396
|
+
if (n == 0) break;
|
|
397
|
+
|
|
398
|
+
total += n;
|
|
399
|
+
if (a->max_bytes > 0 && total > a->max_bytes) {
|
|
400
|
+
a->ctx->error = 1;
|
|
401
|
+
snprintf(a->ctx->error_msg, sizeof(a->ctx->error_msg),
|
|
402
|
+
"worksheet bytes exceed limit (%ld)", a->max_bytes);
|
|
403
|
+
break;
|
|
404
|
+
}
|
|
405
|
+
|
|
406
|
+
xmlParseChunk(a->parser, RSTRING_PTR(chunk), (int)n, 0);
|
|
407
|
+
if (a->ctx->error) break;
|
|
408
|
+
}
|
|
409
|
+
|
|
410
|
+
/* Terminate the parser so any trailing buffered state flushes. */
|
|
411
|
+
xmlParseChunk(a->parser, NULL, 0, 1);
|
|
412
|
+
return Qnil;
|
|
413
|
+
}
|
|
414
|
+
|
|
378
415
|
static VALUE cleanup_parse(VALUE arg)
|
|
379
416
|
{
|
|
380
417
|
parse_args *a = (parse_args *)arg;
|
|
@@ -392,7 +429,7 @@ static VALUE cleanup_parse(VALUE arg)
|
|
|
392
429
|
/* Common parse setup */
|
|
393
430
|
/* ------------------------------------------------------------------ */
|
|
394
431
|
|
|
395
|
-
static
|
|
432
|
+
static xmlParserCtxtPtr setup_push_parser(parse_ctx *ctx)
|
|
396
433
|
{
|
|
397
434
|
xmlSAXHandler handler;
|
|
398
435
|
memset(&handler, 0, sizeof(handler));
|
|
@@ -408,11 +445,25 @@ static VALUE run_parse(parse_ctx *ctx, VALUE xml_str)
|
|
|
408
445
|
rb_raise(rb_eRuntimeError, "failed to create libxml2 parser context");
|
|
409
446
|
}
|
|
410
447
|
|
|
411
|
-
/*
|
|
412
|
-
|
|
413
|
-
|
|
448
|
+
/* XXE / entity-expansion defense:
|
|
449
|
+
* - NONET: no network access
|
|
450
|
+
* - NOENT omitted: user-defined entities are NOT substituted, so
|
|
451
|
+
* external entities are never resolved and billion-laughs style
|
|
452
|
+
* expansion cannot trigger. Predefined entities (& etc.) still
|
|
453
|
+
* reach the characters callback via libxml2's default SAX2 handler.
|
|
454
|
+
* - HUGE omitted: keep libxml2's built-in parser limits active.
|
|
455
|
+
* Real xlsx files stay well under these limits (Excel caps cell text
|
|
456
|
+
* at 32,767 chars), so no throughput loss. */
|
|
457
|
+
xmlCtxtUseOptions(parser, XML_PARSE_NONET);
|
|
458
|
+
return parser;
|
|
459
|
+
}
|
|
414
460
|
|
|
415
|
-
|
|
461
|
+
static VALUE run_parse(parse_ctx *ctx, VALUE xml_str)
|
|
462
|
+
{
|
|
463
|
+
xmlParserCtxtPtr parser = setup_push_parser(ctx);
|
|
464
|
+
parse_args args = { ctx, parser,
|
|
465
|
+
RSTRING_PTR(xml_str), RSTRING_LEN(xml_str),
|
|
466
|
+
Qnil, 0 };
|
|
416
467
|
|
|
417
468
|
/* rb_ensure guarantees cleanup even if rb_yield raises */
|
|
418
469
|
rb_ensure(do_parse, (VALUE)&args, cleanup_parse, (VALUE)&args);
|
|
@@ -424,6 +475,20 @@ static VALUE run_parse(parse_ctx *ctx, VALUE xml_str)
|
|
|
424
475
|
return INT2NUM(ctx->row_count);
|
|
425
476
|
}
|
|
426
477
|
|
|
478
|
+
static VALUE run_parse_io(parse_ctx *ctx, VALUE io, long max_bytes)
|
|
479
|
+
{
|
|
480
|
+
xmlParserCtxtPtr parser = setup_push_parser(ctx);
|
|
481
|
+
parse_args args = { ctx, parser, NULL, 0, io, max_bytes };
|
|
482
|
+
|
|
483
|
+
rb_ensure(do_parse_io, (VALUE)&args, cleanup_parse, (VALUE)&args);
|
|
484
|
+
|
|
485
|
+
if (ctx->error) {
|
|
486
|
+
rb_raise(rb_eRuntimeError, "rbxl_native: %s", ctx->error_msg);
|
|
487
|
+
}
|
|
488
|
+
|
|
489
|
+
return INT2NUM(ctx->row_count);
|
|
490
|
+
}
|
|
491
|
+
|
|
427
492
|
/* ------------------------------------------------------------------ */
|
|
428
493
|
/* Ruby method: Rbxl::Native.parse_sheet(xml_string, shared_strings) */
|
|
429
494
|
/* ------------------------------------------------------------------ */
|
|
@@ -473,6 +538,59 @@ static VALUE rb_native_parse_full(VALUE self, VALUE xml_str, VALUE shared_string
|
|
|
473
538
|
return run_parse(&ctx, xml_str);
|
|
474
539
|
}
|
|
475
540
|
|
|
541
|
+
/* ------------------------------------------------------------------ */
|
|
542
|
+
/* Ruby method: Rbxl::Native.parse_sheet_io(io, shared_strings, max_bytes) */
|
|
543
|
+
/* Chunk-fed streaming variant of parse_sheet. */
|
|
544
|
+
/* max_bytes may be nil to disable the worksheet byte cap. */
|
|
545
|
+
/* ------------------------------------------------------------------ */
|
|
546
|
+
|
|
547
|
+
static VALUE rb_native_parse_io(VALUE self, VALUE io, VALUE shared_strings, VALUE max_bytes)
|
|
548
|
+
{
|
|
549
|
+
(void)self;
|
|
550
|
+
Check_Type(shared_strings, T_ARRAY);
|
|
551
|
+
|
|
552
|
+
long max = NIL_P(max_bytes) ? 0 : NUM2LONG(max_bytes);
|
|
553
|
+
|
|
554
|
+
parse_ctx ctx;
|
|
555
|
+
memset(&ctx, 0, sizeof(ctx));
|
|
556
|
+
ctx.shared_strings = shared_strings;
|
|
557
|
+
ctx.shared_strings_len = RARRAY_LEN(shared_strings);
|
|
558
|
+
ctx.current_row = Qnil;
|
|
559
|
+
ctx.full_mode = 0;
|
|
560
|
+
dynbuf_init(&ctx.text_buf);
|
|
561
|
+
dynbuf_init(&ctx.raw_buf);
|
|
562
|
+
|
|
563
|
+
return run_parse_io(&ctx, io, max);
|
|
564
|
+
}
|
|
565
|
+
|
|
566
|
+
/* ------------------------------------------------------------------ */
|
|
567
|
+
/* Ruby method: Rbxl::Native.parse_sheet_full_io(io, shared_strings, max_bytes) */
|
|
568
|
+
/* ------------------------------------------------------------------ */
|
|
569
|
+
|
|
570
|
+
static VALUE rb_native_parse_full_io(VALUE self, VALUE io, VALUE shared_strings, VALUE max_bytes)
|
|
571
|
+
{
|
|
572
|
+
(void)self;
|
|
573
|
+
Check_Type(shared_strings, T_ARRAY);
|
|
574
|
+
|
|
575
|
+
long max = NIL_P(max_bytes) ? 0 : NUM2LONG(max_bytes);
|
|
576
|
+
|
|
577
|
+
VALUE mRbxl = rb_const_get(rb_cObject, rb_intern("Rbxl"));
|
|
578
|
+
|
|
579
|
+
parse_ctx ctx;
|
|
580
|
+
memset(&ctx, 0, sizeof(ctx));
|
|
581
|
+
ctx.shared_strings = shared_strings;
|
|
582
|
+
ctx.shared_strings_len = RARRAY_LEN(shared_strings);
|
|
583
|
+
ctx.current_row = Qnil;
|
|
584
|
+
ctx.full_mode = 1;
|
|
585
|
+
ctx.cReadOnlyCell = rb_const_get(mRbxl, rb_intern("ReadOnlyCell"));
|
|
586
|
+
ctx.cRow = rb_const_get(mRbxl, rb_intern("Row"));
|
|
587
|
+
dynbuf_init(&ctx.text_buf);
|
|
588
|
+
dynbuf_init(&ctx.raw_buf);
|
|
589
|
+
dynbuf_init(&ctx.cell_ref);
|
|
590
|
+
|
|
591
|
+
return run_parse_io(&ctx, io, max);
|
|
592
|
+
}
|
|
593
|
+
|
|
476
594
|
/* ================================================================== */
|
|
477
595
|
/* Native writer — generate sheet XML from Ruby Array of Arrays */
|
|
478
596
|
/* ================================================================== */
|
|
@@ -673,5 +791,7 @@ void Init_rbxl_native(void)
|
|
|
673
791
|
VALUE mNative = rb_define_module_under(mRbxl, "Native");
|
|
674
792
|
rb_define_module_function(mNative, "parse_sheet", rb_native_parse, 2);
|
|
675
793
|
rb_define_module_function(mNative, "parse_sheet_full", rb_native_parse_full, 2);
|
|
794
|
+
rb_define_module_function(mNative, "parse_sheet_io", rb_native_parse_io, 3);
|
|
795
|
+
rb_define_module_function(mNative, "parse_sheet_full_io", rb_native_parse_full_io, 3);
|
|
676
796
|
rb_define_module_function(mNative, "generate_sheet", rb_native_generate, 1);
|
|
677
797
|
}
|
data/lib/rbxl/cell.rb
CHANGED
|
@@ -1,3 +1,18 @@
|
|
|
1
1
|
module Rbxl
|
|
2
|
+
# Generic value-object cell used by the pure-Ruby reader path.
|
|
3
|
+
#
|
|
4
|
+
# Yielded as an element of {Rbxl::Row#cells} when a worksheet is iterated
|
|
5
|
+
# without +values_only+. Cells are keyword-constructed and expose the
|
|
6
|
+
# decoded Ruby value plus the Excel-style coordinate.
|
|
7
|
+
#
|
|
8
|
+
# cell = Rbxl::Cell.new(value: 42, coordinate: "B3")
|
|
9
|
+
# cell.value # => 42
|
|
10
|
+
# cell.coordinate # => "B3"
|
|
11
|
+
#
|
|
12
|
+
# @!attribute [rw] value
|
|
13
|
+
# @return [Object] decoded Ruby value for the cell (String, Numeric,
|
|
14
|
+
# Boolean, or +nil+)
|
|
15
|
+
# @!attribute [rw] coordinate
|
|
16
|
+
# @return [String, nil] Excel-style coordinate such as +"B3"+
|
|
2
17
|
Cell = Struct.new(:value, :coordinate, keyword_init: true)
|
|
3
18
|
end
|
data/lib/rbxl/empty_cell.rb
CHANGED
|
@@ -1,11 +1,26 @@
|
|
|
1
1
|
module Rbxl
|
|
2
|
+
# Placeholder cell returned when a coordinate in a padded row has no data.
|
|
3
|
+
#
|
|
4
|
+
# Used only when {Rbxl::ReadOnlyWorksheet#each_row} is called with
|
|
5
|
+
# <tt>pad_cells: true</tt>. The object carries the synthetic coordinate so
|
|
6
|
+
# that downstream code can still locate the slot in the worksheet grid.
|
|
7
|
+
#
|
|
8
|
+
# cell = Rbxl::EmptyCell.new(coordinate: "C5")
|
|
9
|
+
# cell.coordinate # => "C5"
|
|
10
|
+
# cell.value # => nil
|
|
2
11
|
class EmptyCell
|
|
12
|
+
# @return [String] Excel-style coordinate such as +"C5"+
|
|
3
13
|
attr_reader :coordinate
|
|
4
14
|
|
|
15
|
+
# @param coordinate [String] Excel-style coordinate
|
|
5
16
|
def initialize(coordinate:)
|
|
6
17
|
@coordinate = coordinate
|
|
7
18
|
end
|
|
8
19
|
|
|
20
|
+
# Always +nil+; exposed so callers can treat {EmptyCell} like any other
|
|
21
|
+
# cell object without a type check.
|
|
22
|
+
#
|
|
23
|
+
# @return [nil]
|
|
9
24
|
def value
|
|
10
25
|
nil
|
|
11
26
|
end
|
data/lib/rbxl/errors.rb
CHANGED
|
@@ -1,7 +1,36 @@
|
|
|
1
1
|
module Rbxl
|
|
2
|
+
# Base class for all errors raised by Rbxl. Rescue this class to catch any
|
|
3
|
+
# library-specific failure without catching unrelated +StandardError+
|
|
4
|
+
# subclasses from the caller's code.
|
|
2
5
|
class Error < StandardError; end
|
|
6
|
+
|
|
7
|
+
# Raised by {Rbxl::ReadOnlyWorkbook#sheet} when the requested sheet name
|
|
8
|
+
# is not present in the workbook.
|
|
3
9
|
class SheetNotFoundError < Error; end
|
|
10
|
+
|
|
11
|
+
# Raised when an operation is attempted against a workbook whose
|
|
12
|
+
# underlying resources have already been released via +close+.
|
|
4
13
|
class ClosedWorkbookError < Error; end
|
|
14
|
+
|
|
15
|
+
# Raised by {Rbxl::WriteOnlyWorkbook#save} when the workbook has already
|
|
16
|
+
# been persisted once. Write-only workbooks are save-once by design.
|
|
5
17
|
class WorkbookAlreadySavedError < Error; end
|
|
18
|
+
|
|
19
|
+
# Raised by {Rbxl::ReadOnlyWorksheet#calculate_dimension} when the sheet
|
|
20
|
+
# lacks a stored +<dimension>+ element and the caller has not opted into
|
|
21
|
+
# scanning the worksheet with <tt>force: true</tt>.
|
|
6
22
|
class UnsizedWorksheetError < Error; end
|
|
23
|
+
|
|
24
|
+
# Raised when the shared strings table in an opened workbook exceeds the
|
|
25
|
+
# configured count or byte limits (see {Rbxl.max_shared_strings} and
|
|
26
|
+
# {Rbxl.max_shared_string_bytes}). Guards against malicious or malformed
|
|
27
|
+
# +.xlsx+ files that would otherwise exhaust memory before the first row
|
|
28
|
+
# is read.
|
|
29
|
+
class SharedStringsTooLargeError < Error; end
|
|
30
|
+
|
|
31
|
+
# Raised when a worksheet's XML payload exceeds {Rbxl.max_worksheet_bytes}
|
|
32
|
+
# while iterating in +streaming: true+ mode. Applies to the uncompressed
|
|
33
|
+
# bytes consumed from the ZIP entry, so high-compression zip-bomb style
|
|
34
|
+
# worksheets are stopped mid-inflate rather than after the fact.
|
|
35
|
+
class WorksheetTooLargeError < Error; end
|
|
7
36
|
end
|
data/lib/rbxl/native.rb
CHANGED
|
@@ -1,9 +1,22 @@
|
|
|
1
1
|
require "nokogiri"
|
|
2
2
|
|
|
3
|
+
# Opt-in loader for the libxml2-backed native extension.
|
|
4
|
+
#
|
|
5
|
+
# Requiring this file replaces the pure-Ruby worksheet XML parser and
|
|
6
|
+
# serializer with a C implementation that uses libxml2's SAX2 API directly.
|
|
7
|
+
# The public API exposed by {Rbxl} is unchanged; only the hot paths are
|
|
8
|
+
# swapped.
|
|
9
|
+
#
|
|
10
|
+
# The shared object is located in one of two places:
|
|
11
|
+
#
|
|
12
|
+
# 1. An installed gem layout (+rbxl_native/rbxl_native.so+ on the load path).
|
|
13
|
+
# 2. A development build tree under <tt>ext/rbxl_native/</tt>.
|
|
14
|
+
#
|
|
15
|
+
# If neither is available a +LoadError+ is raised with guidance on how to
|
|
16
|
+
# build the extension.
|
|
3
17
|
begin
|
|
4
18
|
require "rbxl_native/rbxl_native"
|
|
5
19
|
rescue LoadError
|
|
6
|
-
# Try loading from ext/ build directory (development)
|
|
7
20
|
ext_path = File.expand_path("../../ext/rbxl_native", __dir__)
|
|
8
21
|
so = Dir.glob(File.join(ext_path, "**", "rbxl_native.{so,bundle,dll}")).first
|
|
9
22
|
if so
|
data/lib/rbxl/read_only_cell.rb
CHANGED
|
@@ -1,3 +1,13 @@
|
|
|
1
1
|
module Rbxl
|
|
2
|
+
# Immutable cell value object used by the read-only worksheet path.
|
|
3
|
+
#
|
|
4
|
+
# Produced during row-by-row iteration when cells are yielded without
|
|
5
|
+
# +values_only+. Implemented as a +Data+ class so instances are frozen and
|
|
6
|
+
# hash-equal by value.
|
|
7
|
+
#
|
|
8
|
+
# @!attribute [r] coordinate
|
|
9
|
+
# @return [String] Excel-style coordinate such as +"A1"+
|
|
10
|
+
# @!attribute [r] value
|
|
11
|
+
# @return [Object, nil] decoded Ruby value (String, Numeric, Boolean, or +nil+)
|
|
2
12
|
ReadOnlyCell = Data.define(:coordinate, :value)
|
|
3
13
|
end
|
|
@@ -1,24 +1,75 @@
|
|
|
1
1
|
module Rbxl
|
|
2
|
+
# Read-only workbook backed by a ZIP archive.
|
|
3
|
+
#
|
|
4
|
+
# The workbook opens the underlying <tt>.xlsx</tt> once and keeps a single
|
|
5
|
+
# +Zip::File+ handle open for the lifetime of the object. Worksheets are
|
|
6
|
+
# opened lazily via {#sheet}, so callers can process very large sheets
|
|
7
|
+
# without materializing the full workbook in memory.
|
|
8
|
+
#
|
|
9
|
+
# Typical use:
|
|
10
|
+
#
|
|
11
|
+
# book = Rbxl.open("big.xlsx", read_only: true)
|
|
12
|
+
# begin
|
|
13
|
+
# book.sheet_names # => ["Data"]
|
|
14
|
+
# book.sheet("Data").each_row do |row|
|
|
15
|
+
# process(row.values)
|
|
16
|
+
# end
|
|
17
|
+
# ensure
|
|
18
|
+
# book.close
|
|
19
|
+
# end
|
|
20
|
+
#
|
|
21
|
+
# After {#close} every subsequent {#sheet} call raises
|
|
22
|
+
# {Rbxl::ClosedWorkbookError}.
|
|
2
23
|
class ReadOnlyWorkbook
|
|
24
|
+
# Namespace for the main SpreadsheetML schema.
|
|
3
25
|
MAIN_NS = "http://schemas.openxmlformats.org/spreadsheetml/2006/main"
|
|
26
|
+
|
|
27
|
+
# Namespace used for document-level relationships.
|
|
4
28
|
REL_NS = "http://schemas.openxmlformats.org/officeDocument/2006/relationships"
|
|
29
|
+
|
|
30
|
+
# Namespace used by the OPC package relationships layer.
|
|
5
31
|
PACKAGE_REL_NS = "http://schemas.openxmlformats.org/package/2006/relationships"
|
|
6
32
|
|
|
7
|
-
|
|
33
|
+
# @return [String] filesystem path the workbook was opened from
|
|
34
|
+
attr_reader :path
|
|
8
35
|
|
|
9
|
-
|
|
10
|
-
|
|
36
|
+
# @return [Array<String>] visible sheet names in workbook order
|
|
37
|
+
attr_reader :sheet_names
|
|
38
|
+
|
|
39
|
+
# Convenience constructor equivalent to <tt>new(path, streaming:)</tt>.
|
|
40
|
+
#
|
|
41
|
+
# @param path [String, #to_path] path to the <tt>.xlsx</tt> file
|
|
42
|
+
# @param streaming [Boolean] feed worksheet XML to the native parser in
|
|
43
|
+
# chunks (see {Rbxl.open})
|
|
44
|
+
# @return [Rbxl::ReadOnlyWorkbook]
|
|
45
|
+
def self.open(path, streaming: false)
|
|
46
|
+
new(path, streaming: streaming)
|
|
11
47
|
end
|
|
12
48
|
|
|
13
|
-
|
|
49
|
+
# Opens the ZIP archive, pre-loads shared strings, and indexes the
|
|
50
|
+
# worksheet entries keyed by visible sheet name.
|
|
51
|
+
#
|
|
52
|
+
# @param path [String, #to_path] path to the <tt>.xlsx</tt> file
|
|
53
|
+
# @param streaming [Boolean] forwarded to produced worksheets
|
|
54
|
+
def initialize(path, streaming: false)
|
|
14
55
|
@path = path
|
|
15
56
|
@zip = Zip::File.open(path)
|
|
57
|
+
@streaming = streaming
|
|
16
58
|
@shared_strings = load_shared_strings
|
|
17
59
|
@sheet_entries = load_sheet_entries
|
|
18
60
|
@sheet_names = @sheet_entries.keys.freeze
|
|
19
61
|
@closed = false
|
|
20
62
|
end
|
|
21
63
|
|
|
64
|
+
# Returns a row-by-row worksheet by visible sheet name.
|
|
65
|
+
#
|
|
66
|
+
# The returned object shares the workbook's ZIP handle. Closing the
|
|
67
|
+
# workbook invalidates any worksheets produced by prior calls.
|
|
68
|
+
#
|
|
69
|
+
# @param name [String] visible sheet name as listed in {#sheet_names}
|
|
70
|
+
# @return [Rbxl::ReadOnlyWorksheet]
|
|
71
|
+
# @raise [Rbxl::SheetNotFoundError] if +name+ is not present
|
|
72
|
+
# @raise [Rbxl::ClosedWorkbookError] if the workbook has been closed
|
|
22
73
|
def sheet(name)
|
|
23
74
|
ensure_open!
|
|
24
75
|
|
|
@@ -26,9 +77,13 @@ module Rbxl
|
|
|
26
77
|
raise SheetNotFoundError, "sheet not found: #{name}"
|
|
27
78
|
end
|
|
28
79
|
|
|
29
|
-
ReadOnlyWorksheet.new(zip: @zip, entry_path: entry_path, shared_strings: @shared_strings, name: name)
|
|
80
|
+
ReadOnlyWorksheet.new(zip: @zip, entry_path: entry_path, shared_strings: @shared_strings, name: name, streaming: @streaming)
|
|
30
81
|
end
|
|
31
82
|
|
|
83
|
+
# Releases the underlying ZIP file handle. Idempotent; subsequent calls
|
|
84
|
+
# are no-ops.
|
|
85
|
+
#
|
|
86
|
+
# @return [void]
|
|
32
87
|
def close
|
|
33
88
|
return if closed?
|
|
34
89
|
|
|
@@ -36,6 +91,7 @@ module Rbxl
|
|
|
36
91
|
@closed = true
|
|
37
92
|
end
|
|
38
93
|
|
|
94
|
+
# @return [Boolean] whether {#close} has been called
|
|
39
95
|
def closed?
|
|
40
96
|
@closed
|
|
41
97
|
end
|
|
@@ -50,7 +106,18 @@ module Rbxl
|
|
|
50
106
|
entry = @zip.find_entry("xl/sharedStrings.xml")
|
|
51
107
|
return [] unless entry
|
|
52
108
|
|
|
109
|
+
max_count = Rbxl.max_shared_strings
|
|
110
|
+
max_bytes = Rbxl.max_shared_string_bytes
|
|
111
|
+
|
|
112
|
+
# Reject zip-bomb style entries up front using the ZIP directory's
|
|
113
|
+
# declared uncompressed size, before allocating any decompression buffer.
|
|
114
|
+
if max_bytes && entry.size && entry.size > max_bytes
|
|
115
|
+
raise SharedStringsTooLargeError,
|
|
116
|
+
"shared strings uncompressed size #{entry.size} exceeds limit #{max_bytes}"
|
|
117
|
+
end
|
|
118
|
+
|
|
53
119
|
strings = []
|
|
120
|
+
total_bytes = 0
|
|
54
121
|
io = entry.get_input_stream
|
|
55
122
|
reader = Nokogiri::XML::Reader(io)
|
|
56
123
|
|
|
@@ -92,7 +159,17 @@ module Rbxl
|
|
|
92
159
|
when "rPh"
|
|
93
160
|
in_phonetic = false
|
|
94
161
|
when "si"
|
|
95
|
-
|
|
162
|
+
value = current_fragments.join.freeze
|
|
163
|
+
total_bytes += value.bytesize
|
|
164
|
+
if max_bytes && total_bytes > max_bytes
|
|
165
|
+
raise SharedStringsTooLargeError,
|
|
166
|
+
"shared strings total size exceeds limit #{max_bytes}"
|
|
167
|
+
end
|
|
168
|
+
strings << value
|
|
169
|
+
if max_count && strings.size > max_count
|
|
170
|
+
raise SharedStringsTooLargeError,
|
|
171
|
+
"shared strings count exceeds limit #{max_count}"
|
|
172
|
+
end
|
|
96
173
|
in_si = false
|
|
97
174
|
in_run = false
|
|
98
175
|
in_phonetic = false
|