fastcsv 0.0.2 → 0.0.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.travis.yml +11 -0
- data/README.md +37 -2
- data/TESTS.md +42 -0
- data/ext/fastcsv/fastcsv.c +281 -223
- data/ext/fastcsv/fastcsv.rl +149 -72
- data/fastcsv.gemspec +1 -1
- data/lib/fastcsv.rb +130 -0
- data/spec/fastcsv_spec.rb +189 -57
- data/spec/fixtures/csv.csv +3 -0
- data/spec/fixtures/iso-8859-1-quoted.csv +1 -0
- data/spec/fixtures/utf-8-quoted.csv +1 -0
- data/spec/spec_helper.rb +5 -0
- data/test/csv/base.rb +8 -0
- data/test/csv/line_endings.gz +0 -0
- data/test/csv/test_csv_parsing.rb +221 -0
- data/test/csv/test_csv_writing.rb +97 -0
- data/test/csv/test_data_converters.rb +263 -0
- data/test/csv/test_encodings.rb +339 -0
- data/test/csv/test_features.rb +317 -0
- data/test/csv/test_headers.rb +289 -0
- data/test/csv/test_interface.rb +362 -0
- data/test/csv/test_row.rb +349 -0
- data/test/csv/test_table.rb +420 -0
- data/test/csv/ts_all.rb +20 -0
- data/test/runner.rb +36 -0
- data/test/with_different_ofs.rb +17 -0
- metadata +38 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 1f1c56bdc3a7600dfb311eee7a98f07fe0d0e575
|
4
|
+
data.tar.gz: 2fae9ec519e877178a81dfb6f139c0ce94c974d2
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: d9063634cca29ad95e961ee7bfa7269dc906c39a70a8317e35ea6ffc5991bd0caadf5b985040638d392bcbd49f1931d991261377c9a94c04d31b0147a8e9d721
|
7
|
+
data.tar.gz: 51314a4e948a996ad1546097ac7ad5c9d9d72eb6e8ef38a8874039920814445bee5de121225b81d5e2d461037b7fe23b1595c6be1f54dc93576aa7497681d91f
|
data/.travis.yml
ADDED
data/README.md
CHANGED
@@ -1,12 +1,19 @@
|
|
1
1
|
# FastCSV
|
2
2
|
|
3
3
|
[](http://badge.fury.io/rb/fastcsv)
|
4
|
+
[](http://travis-ci.org/opennorth/fastcsv)
|
4
5
|
[](https://gemnasium.com/opennorth/fastcsv)
|
6
|
+
[](https://coveralls.io/r/opennorth/fastcsv)
|
7
|
+
[](https://codeclimate.com/github/opennorth/fastcsv)
|
5
8
|
|
6
9
|
A fast [Ragel](http://www.colm.net/open-source/ragel/)-based CSV parser.
|
7
10
|
|
11
|
+
**Only reads CSVs using `"` as the quote character, `,` as the delimiter and `\r`, `\n` or `\r\n` as the line terminator.**
|
12
|
+
|
8
13
|
## Usage
|
9
14
|
|
15
|
+
`FastCSV.raw_parse` is implemented in C and is the fastest way to read CSVs with FastCSV.
|
16
|
+
|
10
17
|
```ruby
|
11
18
|
require 'fastcsv'
|
12
19
|
|
@@ -33,6 +40,18 @@ FastCSV.raw_parse("\xF1\n", encoding: 'iso-8859-1:utf-8') do |row|
|
|
33
40
|
end
|
34
41
|
```
|
35
42
|
|
43
|
+
FastCSV can be used as a drop-in replacement for [CSV](http://ruby-doc.org/stdlib-2.1.1/libdoc/csv/rdoc/CSV.html) (replace `CSV` with `FastCSV`) except:
|
44
|
+
|
45
|
+
* The `:quote_char` (`"`), `:col_sep` (`,`) and `:row_sep` (`:auto`) options are ignored. [#2](https://github.com/opennorth/fastcsv/issues/2)
|
46
|
+
* If FastCSV raises an error, you can't continue reading. [#3](https://github.com/opennorth/fastcsv/issues/3) Its error messages don't perfectly match those of CSV.
|
47
|
+
|
48
|
+
A few minor caveats:
|
49
|
+
|
50
|
+
* Use `FastCSV.parse_line(string, options)` instead of `string.parse_csv(options)`.
|
51
|
+
* If you were passing CSV an IO object on which you had wrapped `#gets` (for example, as described in [this article](http://graysoftinc.com/rubies-in-the-rough/decorators-verses-the-mix-in)), `#gets` will not be called.
|
52
|
+
* The `:field_size_limit` option is ignored. If you need to prevent DoS attacks – the [ostensible reason](http://ruby-doc.org/stdlib-2.1.1/libdoc/csv/rdoc/CSV.html#new-method) for this option – limit the size of the input, not the size of quoted fields.
|
53
|
+
* FastCSV doesn't support UTF-16 or UTF-32. See [UTF-8 Everywhere](http://utf8everywhere.org/).
|
54
|
+
|
36
55
|
## Development
|
37
56
|
|
38
57
|
ragel -G2 ext/fastcsv/fastcsv.rl
|
@@ -40,10 +59,26 @@ end
|
|
40
59
|
rake compile
|
41
60
|
gem uninstall fastcsv
|
42
61
|
rake install
|
62
|
+
rake
|
63
|
+
rspec test/runner.rb test/csv
|
64
|
+
|
65
|
+
### Implementation
|
66
|
+
|
67
|
+
FastCSV implements its Ragel-based CSV parser in C at `FastCSV::Parser`.
|
68
|
+
|
69
|
+
FastCSV is a subclass of [CSV](http://ruby-doc.org/stdlib-2.1.1/libdoc/csv/rdoc/CSV.html). It overrides `#shift`, replacing the parsing code, in order to act as a drop-in replacement.
|
70
|
+
|
71
|
+
FastCSV's `raw_parse` requires a block to which it yields one row at a time. FastCSV uses [Fiber](http://www.ruby-doc.org/core-2.1.1/Fiber.html)s to pass control back to `#shift` while parsing.
|
72
|
+
|
73
|
+
CSV delegates IO methods to the IO object it's reading. IO methods that move the pointer within the file like `rewind` changes the behavior of CSV's `#shift`. However, FastCSV's C code won't take notice. We therefore null the Fiber whenever the pointer is moved, so that `#shift` uses a new Fiber.
|
74
|
+
|
75
|
+
CSV's `#shift` runs the regular expression in the `:skip_lines` option against a row's raw text. `FastCSV::Parser` implements a `row` method, which returns the most recently parsed row's raw text.
|
76
|
+
|
77
|
+
FastCSV is tested against the same tests as CSV. See [TESTS.md](https://github.com/opennorth/fastcsv/blob/master/TESTS.md) for details.
|
43
78
|
|
44
79
|
## Why?
|
45
80
|
|
46
|
-
We evaluated [many CSV Ruby gems](https://github.com/jpmckinney/csv-benchmark#benchmark), and they were either too slow or had implementation errors. [rcsv](https://github.com/fiksu/rcsv) is fast and [libcsv](http://sourceforge.net/projects/libcsv/)-based, but it skips blank rows (Ruby's CSV module returns an empty array) and silently fails on input with an unclosed quote
|
81
|
+
We evaluated [many CSV Ruby gems](https://github.com/jpmckinney/csv-benchmark#benchmark), and they were either too slow or had implementation errors. [rcsv](https://github.com/fiksu/rcsv) is fast and [libcsv](http://sourceforge.net/projects/libcsv/)-based, but it skips blank rows (Ruby's CSV module returns an empty array) and silently fails on input with an unclosed quote. [bamfcsv](https://github.com/jondistad/bamfcsv) is well implemented, but it's considerably slower on large files. We looked for Ragel-based CSV parsers to copy, but they either had implementation errors or could not handle large files. [commas](https://github.com/aklt/commas/blob/master/csv.rl) looks good, but it performs a memory check on each character, which is overkill.
|
47
82
|
|
48
83
|
## Bugs? Questions?
|
49
84
|
|
@@ -51,6 +86,6 @@ This project's main repository is on GitHub: [http://github.com/opennorth/fastcs
|
|
51
86
|
|
52
87
|
## Acknowledgements
|
53
88
|
|
54
|
-
Started as a Ruby 2.1 fork of MoonWolf <moonwolf@moonwolf.com>'s CSVScan, found in [this commit](https://github.com/nickstenning/csvscan/commit/11ec30f71a27cc673bca09738ee8a63942f416f0.patch). CSVScan uses Ragel code from [HPricot](https://github.com/hpricot/hpricot/blob/master/ext/hpricot_scan/hpricot_scan.rl) from [this commit](https://github.com/hpricot/hpricot/blob/908a4ae64bc8b935c4415c47ca6aea6492c6ce0a/ext/hpricot_scan/hpricot_scan.rl).
|
89
|
+
Started as a Ruby 2.1 fork of MoonWolf <moonwolf@moonwolf.com>'s CSVScan, found in [this commit](https://github.com/nickstenning/csvscan/commit/11ec30f71a27cc673bca09738ee8a63942f416f0.patch). CSVScan uses Ragel code from [HPricot](https://github.com/hpricot/hpricot/blob/master/ext/hpricot_scan/hpricot_scan.rl) from [this commit](https://github.com/hpricot/hpricot/blob/908a4ae64bc8b935c4415c47ca6aea6492c6ce0a/ext/hpricot_scan/hpricot_scan.rl). Most of the Ruby (i.e. non-C, non-Ragel) methods are copied from [CSV](https://github.com/ruby/ruby/blob/ab337e61ecb5f42384ba7d710c36faf96a454e5c/lib/csv.rb).
|
55
90
|
|
56
91
|
Copyright (c) 2014 Open North Inc., released under the MIT license
|
data/TESTS.md
ADDED
@@ -0,0 +1,42 @@
|
|
1
|
+
Here are some notes on maintaining the `test/` directory.
|
2
|
+
|
3
|
+
1. Download Ruby and [test CSV](http://ruby-doc.org/core-2.1.0/doc/contributing_rdoc.html#label-Running+tests).
|
4
|
+
|
5
|
+
git clone https://github.com/ruby/ruby.git
|
6
|
+
cd ruby
|
7
|
+
git co v2_1_2
|
8
|
+
gem uninstall minitest
|
9
|
+
gem install minitest --version 4.7.5
|
10
|
+
ruby test/runner.rb test/csv
|
11
|
+
|
12
|
+
1. Copy the tests into the project. All the tests should pass.
|
13
|
+
|
14
|
+
cd PROJECT
|
15
|
+
mkdir test
|
16
|
+
cp path/to/ruby/test/runner.rb test
|
17
|
+
cp path/to/ruby/test/with_different_ofs.rb test
|
18
|
+
cp -r path/to/ruby/test/csv test/csv
|
19
|
+
ruby test/runner.rb test/csv
|
20
|
+
|
21
|
+
1. Replace `\bCSV\b` with `FastCSV`. And run:
|
22
|
+
|
23
|
+
sed -i.bak '1s;^;require "fastcsv"\
|
24
|
+
;' test/runner.rb
|
25
|
+
|
26
|
+
1. In `test_interface.rb`, replace `\\t|;|(?<=\S)\|(?=\S)` with `,`. In `test_encodings.rb`, replace `(?<=[^\s{])\|(?=\S)` with `,` and replace `Encoding.list` with `Encoding.list.reject{|e| e.name[/\AUTF-\d\d/]}`. These changes are because `:col_sep`, `:row_sep` and `:quote_char` are ignored and because UTF-16 and UTF-32 aren't supported.
|
27
|
+
|
28
|
+
1. Comment these tests because `:col_sep`, `:row_sep` and `:quote_char` are ignored:
|
29
|
+
|
30
|
+
* `test_csv_parsing.rb`: the first part of `test_malformed_csv`
|
31
|
+
* `test_features.rb`: `test_col_sep`, `test_row_sep`, `test_quote_char`, `test_leading_empty_fields_with_multibyte_col_sep_bug_fix`
|
32
|
+
* `test_headers.rb`: `test_csv_header_string_inherits_separators`
|
33
|
+
|
34
|
+
1. Comment these tests in `test_csv_encoding.rb` because UTF-16 and UTF-32 aren't supported:
|
35
|
+
|
36
|
+
* `test_parses_utf16be_encoding`
|
37
|
+
* the second part of `test_open_allows_you_to_set_encodings`
|
38
|
+
* the second part of `test_foreach_allows_you_to_set_encodings`
|
39
|
+
* the second part of `test_read_allows_you_to_set_encodings`
|
40
|
+
* the second line of `encode_for_tests`
|
41
|
+
|
42
|
+
1. Comment `test_field_size_limit_controls_lookahead` in `test_csv_parsing.rb` (`:field_size_limit` not supported). FastCSV reads one more line than CSV in `test_malformed_csv`, but not sure that's worth mirroring.
|