@rip-lang/csv 1.2.3 → 1.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +101 -19
- package/csv.rip +83 -2
- package/package.json +2 -2
package/README.md
CHANGED
|
@@ -129,6 +129,35 @@ rows = CSV.read '="01",hello\n', excel: true
|
|
|
129
129
|
rows = CSV.read str, relax: true
|
|
130
130
|
```
|
|
131
131
|
|
|
132
|
+
## Special Cases (Relax + Excel)
|
|
133
|
+
|
|
134
|
+
When `relax: true` and `excel: true` are both enabled, the parser recovers
|
|
135
|
+
from common real-world CSV malformations — stray quotes, unescaped embedded
|
|
136
|
+
quotes, and Excel `="..."` literals. These patterns appear frequently in
|
|
137
|
+
exports from systems like Labcorp, legacy Excel, and other enterprise tools.
|
|
138
|
+
|
|
139
|
+
The following table shows how the parser handles each case:
|
|
140
|
+
|
|
141
|
+
| Row | Input | Fields | Key behavior |
|
|
142
|
+
|-----|-------|--------|-------------|
|
|
143
|
+
| 0 | `"AAA "BBB",CCC,"DDD"` | 3 | Stray quotes recovered (relax) |
|
|
144
|
+
| 1 | `"CHUI, LOK HANG "BENNY",…,=""` | 5 | Stray quotes + excel empty |
|
|
145
|
+
| 2 | `"Don",="007",10,"Ed"` | 4 | Excel literal preserves leading zero |
|
|
146
|
+
| 6 | `Charlie or "Chuck",=B2 + B3,9` | 3 | Unquoted stray quotes + bare formula |
|
|
147
|
+
| 10 | `A,B,C",D` | 4 | Trailing stray quote preserved |
|
|
148
|
+
| 12 | `…,"CHO, JOELLE "JOJO"",08/19/2022` | 7 | Stray quotes + excel literals |
|
|
149
|
+
| 14 | `"CHO, JOELLE "JOJO"",456` | 3 | Stray quotes (relax) |
|
|
150
|
+
| 15 | `"CHO, JOELLE ""JOJO""",456` | 3 | Properly doubled quotes — same result |
|
|
151
|
+
| 16 | `=,=x,x=,="x",="","","=",…` | 11 | Full excel + quoting matrix |
|
|
152
|
+
|
|
153
|
+
```coffee
|
|
154
|
+
# Parse messy real-world CSV with both modes enabled
|
|
155
|
+
rows = CSV.read str, relax: true, excel: true
|
|
156
|
+
|
|
157
|
+
# Load a Labcorp file
|
|
158
|
+
rows = CSV.load! 'labcorp.csv', relax: true, excel: true, headers: true
|
|
159
|
+
```
|
|
160
|
+
|
|
132
161
|
## Writing
|
|
133
162
|
|
|
134
163
|
### Basic Writing
|
|
@@ -235,27 +264,80 @@ CSV.writer(opts) # create reusable Writer instance
|
|
|
235
264
|
CSV.formatRow(row, opts) # format single row -> string
|
|
236
265
|
```
|
|
237
266
|
|
|
267
|
+
## CLI
|
|
268
|
+
|
|
269
|
+
The library doubles as a command-line tool for converting CSV files:
|
|
270
|
+
|
|
271
|
+
```bash
|
|
272
|
+
# Clean up a malformed Labcorp file
|
|
273
|
+
bun csv.rip -r -e input.csv output.csv
|
|
274
|
+
|
|
275
|
+
# Protect leading zeros for Google Sheets / Excel
|
|
276
|
+
bun csv.rip -r -e -z input.csv output.csv
|
|
277
|
+
|
|
278
|
+
# Pipe to stdout
|
|
279
|
+
bun csv.rip -r -e input.csv
|
|
280
|
+
|
|
281
|
+
# Show version
|
|
282
|
+
bun csv.rip -v
|
|
283
|
+
```
|
|
284
|
+
|
|
285
|
+
```
|
|
286
|
+
Usage: bun csv.rip [options] <input> [output]
|
|
287
|
+
|
|
288
|
+
Read options:
|
|
289
|
+
-r, --relax Recover from stray/malformed quotes
|
|
290
|
+
-e, --excel Handle Excel ="..." literals on input
|
|
291
|
+
-s, --strip Strip whitespace from fields
|
|
292
|
+
|
|
293
|
+
Write options:
|
|
294
|
+
-z, --zeros Protect leading zeros with ="0123"
|
|
295
|
+
|
|
296
|
+
General:
|
|
297
|
+
-v, --version Show version
|
|
298
|
+
-h, --help Show this help
|
|
299
|
+
|
|
300
|
+
If output is omitted, writes to stdout.
|
|
301
|
+
```
|
|
302
|
+
|
|
238
303
|
## Performance
|
|
239
304
|
|
|
240
|
-
The parser consistently delivers **
|
|
241
|
-
CSV files
|
|
242
|
-
|
|
243
|
-
| File | Size | Rows |
|
|
244
|
-
|
|
245
|
-
|
|
|
246
|
-
|
|
|
247
|
-
|
|
|
248
|
-
|
|
|
249
|
-
|
|
|
250
|
-
|
|
|
251
|
-
|
|
252
|
-
|
|
253
|
-
|
|
254
|
-
|
|
255
|
-
|
|
256
|
-
|
|
257
|
-
|
|
258
|
-
|
|
305
|
+
The parser consistently delivers **250-530 MB/s** throughput on real-world
|
|
306
|
+
CSV files with `relax: true, excel: true` enabled:
|
|
307
|
+
|
|
308
|
+
| File | Size | Rows | Time | Throughput |
|
|
309
|
+
|------|------|------|------|-----------|
|
|
310
|
+
| Geodata | 24.8 MB | 662,061 | 75ms | 329 MB/s |
|
|
311
|
+
| Medical records | 22.8 MB | 93,963 | 86ms | 264 MB/s |
|
|
312
|
+
| Japanese postal codes | 10.9 MB | 124,565 | 29ms | 370 MB/s |
|
|
313
|
+
| Japanese postal codes (100K) | 8.8 MB | 100,000 | 20ms | 442 MB/s |
|
|
314
|
+
| Labcorp charges | 3.6 MB | 20,035 | 7ms | 528 MB/s |
|
|
315
|
+
| UTF-8 data | 2.5 MB | 30,000 | 7ms | 354 MB/s |
|
|
316
|
+
| Mixed data | 2.6 MB | 30,000 | 8ms | 340 MB/s |
|
|
317
|
+
| Lab results | 1.8 MB | 4,894 | 7ms | 254 MB/s |
|
|
318
|
+
|
|
319
|
+
Quote-free files hit the fast path (~440 MB/s). Files with quoted fields
|
|
320
|
+
use the full path (~300 MB/s). The relax+excel heuristics add zero overhead
|
|
321
|
+
on clean data — they only fire when an actual stray quote is encountered.
|
|
322
|
+
|
|
323
|
+
### Comparison with Other JS Parsers
|
|
324
|
+
|
|
325
|
+
Benchmarked against the [uDSV benchmark suite](https://github.com/leeoniya/uDSV/tree/main/bench)
|
|
326
|
+
(the most comprehensive JS CSV benchmark), which tests ~20 parsers on Bun:
|
|
327
|
+
|
|
328
|
+
| Parser | Strings | Quoted | Large (36 MB) | Notes |
|
|
329
|
+
|--------|---------|--------|---------------|-------|
|
|
330
|
+
| **Rip CSV** | **~370 MB/s** | **~330 MB/s** | **~329 MB/s** | indexOf ratchet, relax+excel |
|
|
331
|
+
| uDSV | 287 MB/s | 188 MB/s | 293 MB/s | Fastest pure-JS parser (5KB) |
|
|
332
|
+
| csv-simple-parser | 223 MB/s | 206 MB/s | 233 MB/s | |
|
|
333
|
+
| d3-dsv | 275 MB/s | 110 MB/s | 285 MB/s | |
|
|
334
|
+
| PapaParse | 252 MB/s | 59 MB/s | 292 MB/s | Drops 4x on quoted data |
|
|
335
|
+
| csv-parse/sync | 20 MB/s | 19 MB/s | 18 MB/s | Node.js built-in |
|
|
336
|
+
|
|
337
|
+
Rip CSV is in the same tier as uDSV — the acknowledged fastest JS CSV parser —
|
|
338
|
+
while also supporting relax mode and Excel literal recovery that no other
|
|
339
|
+
parser offers. On quoted files, Rip CSV is **5x faster** than PapaParse and
|
|
340
|
+
**15x faster** than csv-parse.
|
|
259
341
|
|
|
260
342
|
## Roadmap
|
|
261
343
|
|
package/csv.rip
CHANGED
|
@@ -236,6 +236,11 @@ def readFull(str, cfg)
|
|
|
236
236
|
if c is quoteCode or (excel and c is EQ and str.charCodeAt(pos + 1) is quoteCode)
|
|
237
237
|
if excel and c is EQ
|
|
238
238
|
pos += 2 # skip ="
|
|
239
|
+
# relax: skip extra " in ="" when followed by content (not a real empty literal)
|
|
240
|
+
if relax and pos < len and str.charCodeAt(pos) is quoteCode
|
|
241
|
+
p2 = pos + quote.length
|
|
242
|
+
if p2 < len and str.charCodeAt(p2) isnt sepCode and str.charCodeAt(p2) isnt nlCode
|
|
243
|
+
pos += quote.length
|
|
239
244
|
else
|
|
240
245
|
pos += 1 # skip opening quote
|
|
241
246
|
|
|
@@ -260,6 +265,11 @@ def readFull(str, cfg)
|
|
|
260
265
|
if pos < len and str.charCodeAt(pos) is quoteCode
|
|
261
266
|
field += quote
|
|
262
267
|
pos += quote.length
|
|
268
|
+
# relax+excel heuristic: "",=" means close, not just escape
|
|
269
|
+
if relax and excel and pos < len and str.charCodeAt(pos) is sepCode
|
|
270
|
+
p2 = pos + sepLen
|
|
271
|
+
if p2 < len and str.charCodeAt(p2) is EQ and p2 + 1 < len and str.charCodeAt(p2 + 1) is quoteCode
|
|
272
|
+
break
|
|
263
273
|
continue
|
|
264
274
|
else
|
|
265
275
|
# backslash escape: \" -> "
|
|
@@ -277,8 +287,17 @@ def readFull(str, cfg)
|
|
|
277
287
|
unless relax
|
|
278
288
|
throw new Error "CSV: unexpected character after quote at position #{pos}"
|
|
279
289
|
|
|
280
|
-
# relax mode:
|
|
281
|
-
|
|
290
|
+
# relax mode: stray quote — scan through to next quote (censive approach)
|
|
291
|
+
q2 = str.indexOf(quote, pos)
|
|
292
|
+
unless q2 >= 0
|
|
293
|
+
field += str.slice(pos)
|
|
294
|
+
pos = len
|
|
295
|
+
break
|
|
296
|
+
field += quote + str.slice(pos, q2 + quote.length)
|
|
297
|
+
pos = q2 + quote.length
|
|
298
|
+
break if pos >= len
|
|
299
|
+
c2 = str.charCodeAt(pos)
|
|
300
|
+
break if c2 is sepCode or c2 is nlCode
|
|
282
301
|
continue
|
|
283
302
|
|
|
284
303
|
# push field and consume trailing delimiter
|
|
@@ -305,6 +324,7 @@ def readFull(str, cfg)
|
|
|
305
324
|
else if c is sepCode
|
|
306
325
|
row.push ''
|
|
307
326
|
pos += sepLen
|
|
327
|
+
row.push '' if pos >= len or str.charCodeAt(pos) is nlCode
|
|
308
328
|
|
|
309
329
|
# === unquoted field ===
|
|
310
330
|
else
|
|
@@ -316,6 +336,7 @@ def readFull(str, cfg)
|
|
|
316
336
|
if s >= 0 and (nl is -1 or s < nl)
|
|
317
337
|
row.push str.slice(pos, s)
|
|
318
338
|
pos = s + sepLen
|
|
339
|
+
row.push '' if pos >= len or str.charCodeAt(pos) is nlCode
|
|
319
340
|
else if nl >= 0
|
|
320
341
|
row.push str.slice(pos, nl)
|
|
321
342
|
pos = nl + crlfLen(str, nl)
|
|
@@ -429,3 +450,63 @@ export CSV =
|
|
|
429
450
|
# format a single row (convenience — creates a one-shot Writer)
|
|
430
451
|
formatRow: (row, opts = {}) ->
|
|
431
452
|
new Writer(opts).row(row)
|
|
453
|
+
|
|
454
|
+
# ==============================================================================
|
|
455
|
+
# CLI — run directly with: bun csv.rip [options] <input> [output]
|
|
456
|
+
# ==============================================================================
|
|
457
|
+
|
|
458
|
+
if import.meta.main
|
|
459
|
+
VERSION = Bun.file(import.meta.dir + "/package.json").json!.version
|
|
460
|
+
|
|
461
|
+
args = process.argv.slice(2)
|
|
462
|
+
readOpts = {relax: false, excel: false, strip: false}
|
|
463
|
+
writeOpts = {excel: false}
|
|
464
|
+
files = []
|
|
465
|
+
|
|
466
|
+
for arg in args
|
|
467
|
+
switch arg
|
|
468
|
+
when '-v', '--version' then (p "csv #{VERSION}"; exit)
|
|
469
|
+
when '-h', '--help'
|
|
470
|
+
p """
|
|
471
|
+
csv #{VERSION} — Fast, flexible CSV parser and writer
|
|
472
|
+
|
|
473
|
+
Usage: bun csv.rip [options] <input> [output]
|
|
474
|
+
|
|
475
|
+
Read options:
|
|
476
|
+
-r, --relax Recover from stray/malformed quotes
|
|
477
|
+
-e, --excel Handle Excel ="..." literals on input
|
|
478
|
+
-s, --strip Strip whitespace from fields
|
|
479
|
+
|
|
480
|
+
Write options:
|
|
481
|
+
-z, --zeros Protect leading zeros with ="0123"
|
|
482
|
+
|
|
483
|
+
General:
|
|
484
|
+
-v, --version Show version
|
|
485
|
+
-h, --help Show this help
|
|
486
|
+
|
|
487
|
+
If output is omitted, writes to stdout.
|
|
488
|
+
"""
|
|
489
|
+
exit
|
|
490
|
+
when '-r', '--relax' then readOpts.relax = true
|
|
491
|
+
when '-e', '--excel' then readOpts.excel = true
|
|
492
|
+
when '-s', '--strip' then readOpts.strip = true
|
|
493
|
+
when '-z', '--zeros' then writeOpts.excel = true
|
|
494
|
+
else files.push arg
|
|
495
|
+
|
|
496
|
+
unless files.length
|
|
497
|
+
p "csv: no input file specified (use --help for usage)"
|
|
498
|
+
exit 1
|
|
499
|
+
|
|
500
|
+
input = files[0]
|
|
501
|
+
output = files[1]
|
|
502
|
+
|
|
503
|
+
str = Bun.file(input).text!
|
|
504
|
+
cfg = probe(str, readOpts)
|
|
505
|
+
rows = if cfg.hasQuotes then readFull(cfg.str, cfg) else readFast(cfg.str, cfg)
|
|
506
|
+
writer = new Writer(writeOpts)
|
|
507
|
+
|
|
508
|
+
if output
|
|
509
|
+
Bun.write! output, writer.rows(rows)
|
|
510
|
+
p "#{rows.length} rows: #{input} -> #{output}"
|
|
511
|
+
else
|
|
512
|
+
process.stdout.write writer.rows(rows)
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@rip-lang/csv",
|
|
3
|
-
"version": "1.
|
|
3
|
+
"version": "1.3.0",
|
|
4
4
|
"description": "Fast, flexible CSV parser and writer for Rip — indexOf ratchet engine, auto-detection, zero dependencies",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"main": "csv.rip",
|
|
@@ -8,7 +8,7 @@
|
|
|
8
8
|
".": "./csv.rip"
|
|
9
9
|
},
|
|
10
10
|
"scripts": {
|
|
11
|
-
"test": "rip test/
|
|
11
|
+
"test": "rip test/test.rip"
|
|
12
12
|
},
|
|
13
13
|
"keywords": [
|
|
14
14
|
"csv",
|