fastcsv 0.0.2 → 0.0.3
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.travis.yml +11 -0
- data/README.md +37 -2
- data/TESTS.md +42 -0
- data/ext/fastcsv/fastcsv.c +281 -223
- data/ext/fastcsv/fastcsv.rl +149 -72
- data/fastcsv.gemspec +1 -1
- data/lib/fastcsv.rb +130 -0
- data/spec/fastcsv_spec.rb +189 -57
- data/spec/fixtures/csv.csv +3 -0
- data/spec/fixtures/iso-8859-1-quoted.csv +1 -0
- data/spec/fixtures/utf-8-quoted.csv +1 -0
- data/spec/spec_helper.rb +5 -0
- data/test/csv/base.rb +8 -0
- data/test/csv/line_endings.gz +0 -0
- data/test/csv/test_csv_parsing.rb +221 -0
- data/test/csv/test_csv_writing.rb +97 -0
- data/test/csv/test_data_converters.rb +263 -0
- data/test/csv/test_encodings.rb +339 -0
- data/test/csv/test_features.rb +317 -0
- data/test/csv/test_headers.rb +289 -0
- data/test/csv/test_interface.rb +362 -0
- data/test/csv/test_row.rb +349 -0
- data/test/csv/test_table.rb +420 -0
- data/test/csv/ts_all.rb +20 -0
- data/test/runner.rb +36 -0
- data/test/with_different_ofs.rb +17 -0
- metadata +38 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 1f1c56bdc3a7600dfb311eee7a98f07fe0d0e575
|
4
|
+
data.tar.gz: 2fae9ec519e877178a81dfb6f139c0ce94c974d2
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: d9063634cca29ad95e961ee7bfa7269dc906c39a70a8317e35ea6ffc5991bd0caadf5b985040638d392bcbd49f1931d991261377c9a94c04d31b0147a8e9d721
|
7
|
+
data.tar.gz: 51314a4e948a996ad1546097ac7ad5c9d9d72eb6e8ef38a8874039920814445bee5de121225b81d5e2d461037b7fe23b1595c6be1f54dc93576aa7497681d91f
|
data/.travis.yml
ADDED
data/README.md
CHANGED
@@ -1,12 +1,19 @@
|
|
1
1
|
# FastCSV
|
2
2
|
|
3
3
|
[![Gem Version](https://badge.fury.io/rb/fastcsv.svg)](http://badge.fury.io/rb/fastcsv)
|
4
|
+
[![Build Status](https://secure.travis-ci.org/opennorth/fastcsv.png)](http://travis-ci.org/opennorth/fastcsv)
|
4
5
|
[![Dependency Status](https://gemnasium.com/opennorth/fastcsv.png)](https://gemnasium.com/opennorth/fastcsv)
|
6
|
+
[![Coverage Status](https://coveralls.io/repos/opennorth/fastcsv/badge.png?branch=master)](https://coveralls.io/r/opennorth/fastcsv)
|
7
|
+
[![Code Climate](https://codeclimate.com/github/opennorth/fastcsv.png)](https://codeclimate.com/github/opennorth/fastcsv)
|
5
8
|
|
6
9
|
A fast [Ragel](http://www.colm.net/open-source/ragel/)-based CSV parser.
|
7
10
|
|
11
|
+
**Only reads CSVs using `"` as the quote character, `,` as the delimiter and `\r`, `\n` or `\r\n` as the line terminator.**
|
12
|
+
|
8
13
|
## Usage
|
9
14
|
|
15
|
+
`FastCSV.raw_parse` is implemented in C and is the fastest way to read CSVs with FastCSV.
|
16
|
+
|
10
17
|
```ruby
|
11
18
|
require 'fastcsv'
|
12
19
|
|
@@ -33,6 +40,18 @@ FastCSV.raw_parse("\xF1\n", encoding: 'iso-8859-1:utf-8') do |row|
|
|
33
40
|
end
|
34
41
|
```
|
35
42
|
|
43
|
+
FastCSV can be used as a drop-in replacement for [CSV](http://ruby-doc.org/stdlib-2.1.1/libdoc/csv/rdoc/CSV.html) (replace `CSV` with `FastCSV`) except:
|
44
|
+
|
45
|
+
* The `:quote_char` (`"`), `:col_sep` (`,`) and `:row_sep` (`:auto`) options are ignored. [#2](https://github.com/opennorth/fastcsv/issues/2)
|
46
|
+
* If FastCSV raises an error, you can't continue reading. [#3](https://github.com/opennorth/fastcsv/issues/3) Its error messages don't perfectly match those of CSV.
|
47
|
+
|
48
|
+
A few minor caveats:
|
49
|
+
|
50
|
+
* Use `FastCSV.parse_line(string, options)` instead of `string.parse_csv(options)`.
|
51
|
+
* If you were passing CSV an IO object on which you had wrapped `#gets` (for example, as described in [this article](http://graysoftinc.com/rubies-in-the-rough/decorators-verses-the-mix-in)), `#gets` will not be called.
|
52
|
+
* The `:field_size_limit` option is ignored. If you need to prevent DoS attacks – the [ostensible reason](http://ruby-doc.org/stdlib-2.1.1/libdoc/csv/rdoc/CSV.html#new-method) for this option – limit the size of the input, not the size of quoted fields.
|
53
|
+
* FastCSV doesn't support UTF-16 or UTF-32. See [UTF-8 Everywhere](http://utf8everywhere.org/).
|
54
|
+
|
36
55
|
## Development
|
37
56
|
|
38
57
|
ragel -G2 ext/fastcsv/fastcsv.rl
|
@@ -40,10 +59,26 @@ end
|
|
40
59
|
rake compile
|
41
60
|
gem uninstall fastcsv
|
42
61
|
rake install
|
62
|
+
rake
|
63
|
+
rspec test/runner.rb test/csv
|
64
|
+
|
65
|
+
### Implementation
|
66
|
+
|
67
|
+
FastCSV implements its Ragel-based CSV parser in C at `FastCSV::Parser`.
|
68
|
+
|
69
|
+
FastCSV is a subclass of [CSV](http://ruby-doc.org/stdlib-2.1.1/libdoc/csv/rdoc/CSV.html). It overrides `#shift`, replacing the parsing code, in order to act as a drop-in replacement.
|
70
|
+
|
71
|
+
FastCSV's `raw_parse` requires a block to which it yields one row at a time. FastCSV uses [Fiber](http://www.ruby-doc.org/core-2.1.1/Fiber.html)s to pass control back to `#shift` while parsing.
|
72
|
+
|
73
|
+
CSV delegates IO methods to the IO object it's reading. IO methods that move the pointer within the file like `rewind` changes the behavior of CSV's `#shift`. However, FastCSV's C code won't take notice. We therefore null the Fiber whenever the pointer is moved, so that `#shift` uses a new Fiber.
|
74
|
+
|
75
|
+
CSV's `#shift` runs the regular expression in the `:skip_lines` option against a row's raw text. `FastCSV::Parser` implements a `row` method, which returns the most recently parsed row's raw text.
|
76
|
+
|
77
|
+
FastCSV is tested against the same tests as CSV. See [TESTS.md](https://github.com/opennorth/fastcsv/blob/master/TESTS.md) for details.
|
43
78
|
|
44
79
|
## Why?
|
45
80
|
|
46
|
-
We evaluated [many CSV Ruby gems](https://github.com/jpmckinney/csv-benchmark#benchmark), and they were either too slow or had implementation errors. [rcsv](https://github.com/fiksu/rcsv) is fast and [libcsv](http://sourceforge.net/projects/libcsv/)-based, but it skips blank rows (Ruby's CSV module returns an empty array) and silently fails on input with an unclosed quote
|
81
|
+
We evaluated [many CSV Ruby gems](https://github.com/jpmckinney/csv-benchmark#benchmark), and they were either too slow or had implementation errors. [rcsv](https://github.com/fiksu/rcsv) is fast and [libcsv](http://sourceforge.net/projects/libcsv/)-based, but it skips blank rows (Ruby's CSV module returns an empty array) and silently fails on input with an unclosed quote. [bamfcsv](https://github.com/jondistad/bamfcsv) is well implemented, but it's considerably slower on large files. We looked for Ragel-based CSV parsers to copy, but they either had implementation errors or could not handle large files. [commas](https://github.com/aklt/commas/blob/master/csv.rl) looks good, but it performs a memory check on each character, which is overkill.
|
47
82
|
|
48
83
|
## Bugs? Questions?
|
49
84
|
|
@@ -51,6 +86,6 @@ This project's main repository is on GitHub: [http://github.com/opennorth/fastcs
|
|
51
86
|
|
52
87
|
## Acknowledgements
|
53
88
|
|
54
|
-
Started as a Ruby 2.1 fork of MoonWolf <moonwolf@moonwolf.com>'s CSVScan, found in [this commit](https://github.com/nickstenning/csvscan/commit/11ec30f71a27cc673bca09738ee8a63942f416f0.patch). CSVScan uses Ragel code from [HPricot](https://github.com/hpricot/hpricot/blob/master/ext/hpricot_scan/hpricot_scan.rl) from [this commit](https://github.com/hpricot/hpricot/blob/908a4ae64bc8b935c4415c47ca6aea6492c6ce0a/ext/hpricot_scan/hpricot_scan.rl).
|
89
|
+
Started as a Ruby 2.1 fork of MoonWolf <moonwolf@moonwolf.com>'s CSVScan, found in [this commit](https://github.com/nickstenning/csvscan/commit/11ec30f71a27cc673bca09738ee8a63942f416f0.patch). CSVScan uses Ragel code from [HPricot](https://github.com/hpricot/hpricot/blob/master/ext/hpricot_scan/hpricot_scan.rl) from [this commit](https://github.com/hpricot/hpricot/blob/908a4ae64bc8b935c4415c47ca6aea6492c6ce0a/ext/hpricot_scan/hpricot_scan.rl). Most of the Ruby (i.e. non-C, non-Ragel) methods are copied from [CSV](https://github.com/ruby/ruby/blob/ab337e61ecb5f42384ba7d710c36faf96a454e5c/lib/csv.rb).
|
55
90
|
|
56
91
|
Copyright (c) 2014 Open North Inc., released under the MIT license
|
data/TESTS.md
ADDED
@@ -0,0 +1,42 @@
|
|
1
|
+
Here are some notes on maintaining the `test/` directory.
|
2
|
+
|
3
|
+
1. Download Ruby and [test CSV](http://ruby-doc.org/core-2.1.0/doc/contributing_rdoc.html#label-Running+tests).
|
4
|
+
|
5
|
+
git clone https://github.com/ruby/ruby.git
|
6
|
+
cd ruby
|
7
|
+
git co v2_1_2
|
8
|
+
gem uninstall minitest
|
9
|
+
gem install minitest --version 4.7.5
|
10
|
+
ruby test/runner.rb test/csv
|
11
|
+
|
12
|
+
1. Copy the tests into the project. All the tests should pass.
|
13
|
+
|
14
|
+
cd PROJECT
|
15
|
+
mkdir test
|
16
|
+
cp path/to/ruby/test/runner.rb test
|
17
|
+
cp path/to/ruby/test/with_different_ofs.rb test
|
18
|
+
cp -r path/to/ruby/test/csv test/csv
|
19
|
+
ruby test/runner.rb test/csv
|
20
|
+
|
21
|
+
1. Replace `\bCSV\b` with `FastCSV`. And run:
|
22
|
+
|
23
|
+
sed -i.bak '1s;^;require "fastcsv"\
|
24
|
+
;' test/runner.rb
|
25
|
+
|
26
|
+
1. In `test_interface.rb`, replace `\\t|;|(?<=\S)\|(?=\S)` with `,`. In `test_encodings.rb`, replace `(?<=[^\s{])\|(?=\S)` with `,` and replace `Encoding.list` with `Encoding.list.reject{|e| e.name[/\AUTF-\d\d/]}`. These changes are because `:col_sep`, `:row_sep` and `:quote_char` are ignored and because UTF-16 and UTF-32 aren't supported.
|
27
|
+
|
28
|
+
1. Comment these tests because `:col_sep`, `:row_sep` and `:quote_char` are ignored:
|
29
|
+
|
30
|
+
* `test_csv_parsing.rb`: the first part of `test_malformed_csv`
|
31
|
+
* `test_features.rb`: `test_col_sep`, `test_row_sep`, `test_quote_char`, `test_leading_empty_fields_with_multibyte_col_sep_bug_fix`
|
32
|
+
* `test_headers.rb`: `test_csv_header_string_inherits_separators`
|
33
|
+
|
34
|
+
1. Comment these tests in `test_csv_encoding.rb` because UTF-16 and UTF-32 aren't supported:
|
35
|
+
|
36
|
+
* `test_parses_utf16be_encoding`
|
37
|
+
* the second part of `test_open_allows_you_to_set_encodings`
|
38
|
+
* the second part of `test_foreach_allows_you_to_set_encodings`
|
39
|
+
* the second part of `test_read_allows_you_to_set_encodings`
|
40
|
+
* the second line of `encode_for_tests`
|
41
|
+
|
42
|
+
1. Comment `test_field_size_limit_controls_lookahead` in `test_csv_parsing.rb` (`:field_size_limit` not supported). FastCSV reads one more line than CSV in `test_malformed_csv`, but not sure that's worth mirroring.
|