@rip-lang/csv 1.2.3 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (3) hide show
  1. package/README.md +101 -19
  2. package/csv.rip +83 -2
  3. package/package.json +2 -2
package/README.md CHANGED
@@ -129,6 +129,35 @@ rows = CSV.read '="01",hello\n', excel: true
129
129
  rows = CSV.read str, relax: true
130
130
  ```
131
131
 
132
+ ## Special Cases (Relax + Excel)
133
+
134
+ When `relax: true` and `excel: true` are both enabled, the parser recovers
135
+ from common real-world CSV malformations — stray quotes, unescaped embedded
136
+ quotes, and Excel `="..."` literals. These patterns appear frequently in
137
+ exports from systems like Labcorp, legacy Excel, and other enterprise tools.
138
+
139
+ The following table shows how the parser handles each case:
140
+
141
+ | Row | Input | Fields | Key behavior |
142
+ |-----|-------|--------|-------------|
143
+ | 0 | `"AAA "BBB",CCC,"DDD"` | 3 | Stray quotes recovered (relax) |
144
+ | 1 | `"CHUI, LOK HANG "BENNY",…,=""` | 5 | Stray quotes + excel empty |
145
+ | 2 | `"Don",="007",10,"Ed"` | 4 | Excel literal preserves leading zero |
146
+ | 6 | `Charlie or "Chuck",=B2 + B3,9` | 3 | Unquoted stray quotes + bare formula |
147
+ | 10 | `A,B,C",D` | 4 | Trailing stray quote preserved |
148
+ | 12 | `…,"CHO, JOELLE "JOJO"",08/19/2022` | 7 | Stray quotes + excel literals |
149
+ | 14 | `"CHO, JOELLE "JOJO"",456` | 3 | Stray quotes (relax) |
150
+ | 15 | `"CHO, JOELLE ""JOJO""",456` | 3 | Properly doubled quotes — same result |
151
+ | 16 | `=,=x,x=,="x",="","","=",…` | 11 | Full excel + quoting matrix |
152
+
153
+ ```coffee
154
+ # Parse messy real-world CSV with both modes enabled
155
+ rows = CSV.read str, relax: true, excel: true
156
+
157
+ # Load a Labcorp file
158
+ rows = CSV.load! 'labcorp.csv', relax: true, excel: true, headers: true
159
+ ```
160
+
132
161
  ## Writing
133
162
 
134
163
  ### Basic Writing
@@ -235,27 +264,80 @@ CSV.writer(opts) # create reusable Writer instance
235
264
  CSV.formatRow(row, opts) # format single row -> string
236
265
  ```
237
266
 
267
+ ## CLI
268
+
269
+ The library doubles as a command-line tool for converting CSV files:
270
+
271
+ ```bash
272
+ # Clean up a malformed Labcorp file
273
+ bun csv.rip -r -e input.csv output.csv
274
+
275
+ # Protect leading zeros for Google Sheets / Excel
276
+ bun csv.rip -r -e -z input.csv output.csv
277
+
278
+ # Pipe to stdout
279
+ bun csv.rip -r -e input.csv
280
+
281
+ # Show version
282
+ bun csv.rip -v
283
+ ```
284
+
285
+ ```
286
+ Usage: bun csv.rip [options] <input> [output]
287
+
288
+ Read options:
289
+ -r, --relax Recover from stray/malformed quotes
290
+ -e, --excel Handle Excel ="..." literals on input
291
+ -s, --strip Strip whitespace from fields
292
+
293
+ Write options:
294
+ -z, --zeros Protect leading zeros with ="0123"
295
+
296
+ General:
297
+ -v, --version Show version
298
+ -h, --help Show this help
299
+
300
+ If output is omitted, writes to stdout.
301
+ ```
302
+
238
303
  ## Performance
239
304
 
240
- The parser consistently delivers **300-430 MB/s** throughput on real-world
241
- CSV files, scaling linearly from kilobytes to gigabytes:
242
-
243
- | File | Size | Rows | Fields/row | Time | Throughput |
244
- |------|------|------|-----------|------|-----------|
245
- | Medical records | 10.5 MB | 43,962 | 44 | 39ms | 269 MB/s |
246
- | Japanese postal codes | 10.9 MB | 124,565 | 15 | 26ms | 414 MB/s |
247
- | Geodata | 24.8 MB | 662,061 | 6 | 65ms | 382 MB/s |
248
- | Lab results (large) | 137.3 MB | 493,962 | 44 | 466ms | 294 MB/s |
249
- | Lab results (XL) | 315.8 MB | 997,195 | 44 | 1.1s | 287 MB/s |
250
- | Lab results (1GB+) | 1.2 GB | 3,497,822 | 44 | 4.1s | 298 MB/s |
251
-
252
- Quote-free files hit the fast path (~420 MB/s). Files with quoted fields
253
- use the full path (~300 MB/s). The `each` callback mode is slightly faster
254
- than array mode since it skips array allocation.
255
-
256
- For context, popular JS CSV parsers typically achieve 30-120 MB/s (Papa Parse,
257
- csv-parse, d3-dsv). This library is comfortably in the top tier of the JS
258
- ecosystem.
305
+ The parser consistently delivers **250-530 MB/s** throughput on real-world
306
+ CSV files with `relax: true, excel: true` enabled:
307
+
308
+ | File | Size | Rows | Time | Throughput |
309
+ |------|------|------|------|-----------|
310
+ | Geodata | 24.8 MB | 662,061 | 75ms | 329 MB/s |
311
+ | Medical records | 22.8 MB | 93,963 | 86ms | 264 MB/s |
312
+ | Japanese postal codes | 10.9 MB | 124,565 | 29ms | 370 MB/s |
313
+ | Japanese postal codes (100K) | 8.8 MB | 100,000 | 20ms | 442 MB/s |
314
+ | Labcorp charges | 3.6 MB | 20,035 | 7ms | 528 MB/s |
315
+ | UTF-8 data | 2.5 MB | 30,000 | 7ms | 354 MB/s |
316
+ | Mixed data | 2.6 MB | 30,000 | 8ms | 340 MB/s |
317
+ | Lab results | 1.8 MB | 4,894 | 7ms | 254 MB/s |
318
+
319
+ Quote-free files hit the fast path (~440 MB/s). Files with quoted fields
320
+ use the full path (~300 MB/s). The relax+excel heuristics add zero overhead
321
+ on clean data they only fire when an actual stray quote is encountered.
322
+
323
+ ### Comparison with Other JS Parsers
324
+
325
+ Benchmarked against the [uDSV benchmark suite](https://github.com/leeoniya/uDSV/tree/main/bench)
326
+ (the most comprehensive JS CSV benchmark), which tests ~20 parsers on Bun:
327
+
328
+ | Parser | Strings | Quoted | Large (36 MB) | Notes |
329
+ |--------|---------|--------|---------------|-------|
330
+ | **Rip CSV** | **~370 MB/s** | **~330 MB/s** | **~329 MB/s** | indexOf ratchet, relax+excel |
331
+ | uDSV | 287 MB/s | 188 MB/s | 293 MB/s | Fastest pure-JS parser (5KB) |
332
+ | csv-simple-parser | 223 MB/s | 206 MB/s | 233 MB/s | |
333
+ | d3-dsv | 275 MB/s | 110 MB/s | 285 MB/s | |
334
+ | PapaParse | 252 MB/s | 59 MB/s | 292 MB/s | Drops 4x on quoted data |
335
+ | csv-parse/sync | 20 MB/s | 19 MB/s | 18 MB/s | Node.js built-in |
336
+
337
+ Rip CSV is in the same tier as uDSV — the acknowledged fastest JS CSV parser —
338
+ while also supporting relax mode and Excel literal recovery that no other
339
+ parser offers. On quoted files, Rip CSV is **5x faster** than PapaParse and
340
+ **15x faster** than csv-parse.
259
341
 
260
342
  ## Roadmap
261
343
 
package/csv.rip CHANGED
@@ -236,6 +236,11 @@ def readFull(str, cfg)
236
236
  if c is quoteCode or (excel and c is EQ and str.charCodeAt(pos + 1) is quoteCode)
237
237
  if excel and c is EQ
238
238
  pos += 2 # skip ="
239
+ # relax: skip extra " in ="" when followed by content (not a real empty literal)
240
+ if relax and pos < len and str.charCodeAt(pos) is quoteCode
241
+ p2 = pos + quote.length
242
+ if p2 < len and str.charCodeAt(p2) isnt sepCode and str.charCodeAt(p2) isnt nlCode
243
+ pos += quote.length
239
244
  else
240
245
  pos += 1 # skip opening quote
241
246
 
@@ -260,6 +265,11 @@ def readFull(str, cfg)
260
265
  if pos < len and str.charCodeAt(pos) is quoteCode
261
266
  field += quote
262
267
  pos += quote.length
268
+ # relax+excel heuristic: "",=" means close, not just escape
269
+ if relax and excel and pos < len and str.charCodeAt(pos) is sepCode
270
+ p2 = pos + sepLen
271
+ if p2 < len and str.charCodeAt(p2) is EQ and p2 + 1 < len and str.charCodeAt(p2 + 1) is quoteCode
272
+ break
263
273
  continue
264
274
  else
265
275
  # backslash escape: \" -> "
@@ -277,8 +287,17 @@ def readFull(str, cfg)
277
287
  unless relax
278
288
  throw new Error "CSV: unexpected character after quote at position #{pos}"
279
289
 
280
- # relax mode: treat the quote as literal, keep scanning
281
- field += quote
290
+ # relax mode: stray quote scan through to next quote (censive approach)
291
+ q2 = str.indexOf(quote, pos)
292
+ unless q2 >= 0
293
+ field += str.slice(pos)
294
+ pos = len
295
+ break
296
+ field += quote + str.slice(pos, q2 + quote.length)
297
+ pos = q2 + quote.length
298
+ break if pos >= len
299
+ c2 = str.charCodeAt(pos)
300
+ break if c2 is sepCode or c2 is nlCode
282
301
  continue
283
302
 
284
303
  # push field and consume trailing delimiter
@@ -305,6 +324,7 @@ def readFull(str, cfg)
305
324
  else if c is sepCode
306
325
  row.push ''
307
326
  pos += sepLen
327
+ row.push '' if pos >= len or str.charCodeAt(pos) is nlCode
308
328
 
309
329
  # === unquoted field ===
310
330
  else
@@ -316,6 +336,7 @@ def readFull(str, cfg)
316
336
  if s >= 0 and (nl is -1 or s < nl)
317
337
  row.push str.slice(pos, s)
318
338
  pos = s + sepLen
339
+ row.push '' if pos >= len or str.charCodeAt(pos) is nlCode
319
340
  else if nl >= 0
320
341
  row.push str.slice(pos, nl)
321
342
  pos = nl + crlfLen(str, nl)
@@ -429,3 +450,63 @@ export CSV =
429
450
  # format a single row (convenience — creates a one-shot Writer)
430
451
  formatRow: (row, opts = {}) ->
431
452
  new Writer(opts).row(row)
453
+
454
+ # ==============================================================================
455
+ # CLI — run directly with: bun csv.rip [options] <input> [output]
456
+ # ==============================================================================
457
+
458
+ if import.meta.main
459
+ VERSION = Bun.file(import.meta.dir + "/package.json").json!.version
460
+
461
+ args = process.argv.slice(2)
462
+ readOpts = {relax: false, excel: false, strip: false}
463
+ writeOpts = {excel: false}
464
+ files = []
465
+
466
+ for arg in args
467
+ switch arg
468
+ when '-v', '--version' then (p "csv #{VERSION}"; exit)
469
+ when '-h', '--help'
470
+ p """
471
+ csv #{VERSION} — Fast, flexible CSV parser and writer
472
+
473
+ Usage: bun csv.rip [options] <input> [output]
474
+
475
+ Read options:
476
+ -r, --relax Recover from stray/malformed quotes
477
+ -e, --excel Handle Excel ="..." literals on input
478
+ -s, --strip Strip whitespace from fields
479
+
480
+ Write options:
481
+ -z, --zeros Protect leading zeros with ="0123"
482
+
483
+ General:
484
+ -v, --version Show version
485
+ -h, --help Show this help
486
+
487
+ If output is omitted, writes to stdout.
488
+ """
489
+ exit
490
+ when '-r', '--relax' then readOpts.relax = true
491
+ when '-e', '--excel' then readOpts.excel = true
492
+ when '-s', '--strip' then readOpts.strip = true
493
+ when '-z', '--zeros' then writeOpts.excel = true
494
+ else files.push arg
495
+
496
+ unless files.length
497
+ p "csv: no input file specified (use --help for usage)"
498
+ exit 1
499
+
500
+ input = files[0]
501
+ output = files[1]
502
+
503
+ str = Bun.file(input).text!
504
+ cfg = probe(str, readOpts)
505
+ rows = if cfg.hasQuotes then readFull(cfg.str, cfg) else readFast(cfg.str, cfg)
506
+ writer = new Writer(writeOpts)
507
+
508
+ if output
509
+ Bun.write! output, writer.rows(rows)
510
+ p "#{rows.length} rows: #{input} -> #{output}"
511
+ else
512
+ process.stdout.write writer.rows(rows)
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@rip-lang/csv",
3
- "version": "1.2.3",
3
+ "version": "1.3.0",
4
4
  "description": "Fast, flexible CSV parser and writer for Rip — indexOf ratchet engine, auto-detection, zero dependencies",
5
5
  "type": "module",
6
6
  "main": "csv.rip",
@@ -8,7 +8,7 @@
8
8
  ".": "./csv.rip"
9
9
  },
10
10
  "scripts": {
11
- "test": "rip test/basic.rip"
11
+ "test": "rip test/test.rip"
12
12
  },
13
13
  "keywords": [
14
14
  "csv",