gn_crossmap 0.1.4 → 0.1.5
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +15 -0
- data/README.md +49 -16
- data/exe/crossmap +5 -1
- data/gn_crossmap.gemspec +2 -0
- data/lib/gn_crossmap/collector.rb +6 -2
- data/lib/gn_crossmap/column_collector.rb +3 -0
- data/lib/gn_crossmap/errors.rb +2 -0
- data/lib/gn_crossmap/reader.rb +7 -1
- data/lib/gn_crossmap/resolver.rb +5 -2
- data/lib/gn_crossmap/result_processor.rb +15 -10
- data/lib/gn_crossmap/sci_name_collector.rb +2 -0
- data/lib/gn_crossmap/version.rb +1 -1
- data/lib/gn_crossmap/writer.rb +11 -4
- data/lib/gn_crossmap.rb +5 -2
- metadata +31 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 30c84b5910edf24b6b67fc45c1f80be36b29a183
|
4
|
+
data.tar.gz: d7481bc663e1b4edd225c83f711a84092fb9d602
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 5754ec3e9d65a69cf8fcf83aaeaa2c24fd4b565917ba37b860a0468e5f9e46ba1e6dcd37dbdb28935313469f704aa84b6faab79584a238b1e17d2e6284b23ac5
|
7
|
+
data.tar.gz: 11ba4423f9c31529f5f1d17ad25609ccfae8d7d6694efea0bcf30137e65e661334a89e30f9259dfd9934676f9bfa76ec59d4ffee07f0d79afffdb9a5bcc67abd
|
data/CHANGELOG.md
CHANGED
@@ -1,6 +1,21 @@
|
|
1
1
|
gn_crossmap CHANGELOG
|
2
2
|
=====================
|
3
3
|
|
4
|
+
0.1.5
|
5
|
+
-----
|
6
|
+
|
7
|
+
* @dimus - #5 - All original fields are now preserved in the output file.
|
8
|
+
|
9
|
+
* @dimus - #3 - If ingest has more than 10K rows -- user will see logging events
|
10
|
+
|
11
|
+
* @dimus - #4 Bug - Error messages if headers are missing necessary fields
|
12
|
+
|
13
|
+
* @dimus - #2 - Header fields are now allowed trailing spaces
|
14
|
+
|
15
|
+
* @dimus - #7 Bug - Empty rank does not break crossmapping anymore
|
16
|
+
|
17
|
+
* @dimus - #1 Bug - add missing rest-client gem
|
18
|
+
|
4
19
|
0.1.4
|
5
20
|
-----
|
6
21
|
- [Dmitry Mozzherin][dimus] - Bug fixes
|
data/README.md
CHANGED
@@ -10,6 +10,8 @@ in [GN Resolver][resolver].
|
|
10
10
|
|
11
11
|
Checklist has to be in a CSV format.
|
12
12
|
|
13
|
+
[Issues on waffle.io][waffle]
|
14
|
+
|
13
15
|
Compatibility
|
14
16
|
-------------
|
15
17
|
|
@@ -35,6 +37,29 @@ Or install it yourself as:
|
|
35
37
|
Usage
|
36
38
|
-----
|
37
39
|
|
40
|
+
### Usage from command line
|
41
|
+
|
42
|
+
# to see help
|
43
|
+
$ crossmap --help
|
44
|
+
|
45
|
+
# to compare with default source (Catalogue of Life)
|
46
|
+
$ crossmap -i my_list.csv -o my_list_col.csv
|
47
|
+
|
48
|
+
# to compare with other source (Index Fungorum in this example)
|
49
|
+
$ crossmap -i my_list.csv -o my_list_if.csv -d 5
|
50
|
+
|
51
|
+
### Usage as Ruby Library
|
52
|
+
|
53
|
+
```ruby
|
54
|
+
require "gn_crossmap"
|
55
|
+
|
56
|
+
# If you want to change logger -- default Logging is to standard output
|
57
|
+
GnCrossmap.logger = MyCustomLogger.new
|
58
|
+
|
59
|
+
GnCrossmap.run("path/to/input.csv", "path/to/output.csv", 5)
|
60
|
+
```
|
61
|
+
|
62
|
+
|
38
63
|
### Input file format
|
39
64
|
|
40
65
|
- Comma Separated File with names of fields in first row.
|
@@ -92,27 +117,33 @@ TaxonId|kingdom|subkingdom|phylum|subphylum|superclass|class|subclass|cohort|sup
|
|
92
117
|
|
93
118
|
More examples can be found in [spec/files][files] directory
|
94
119
|
|
95
|
-
###
|
96
|
-
|
97
|
-
# to see help
|
98
|
-
$ crossmap --help
|
99
|
-
|
100
|
-
# to compare with default source (Catalogue of Life)
|
101
|
-
$ crossmap -i my_list.csv -o my_list_col.csv
|
120
|
+
### Output file format
|
102
121
|
|
103
|
-
|
104
|
-
$ crossmap -i my_list.csv -o my_list_if.csv -d 5
|
122
|
+
[Output][output] includes following fields:
|
105
123
|
|
106
|
-
|
124
|
+
Field | Description
|
125
|
+
---------------------|-----------------------------------------------------------
|
126
|
+
taxonID | original ID attached to a name in the checklist
|
127
|
+
scientificName | name from the checklist
|
128
|
+
matchedScientificName| name matched from the GN Reolver data source
|
129
|
+
matchedCanonicalForm | canonical form of the matched name
|
130
|
+
rank | rank from the source (if it was given/inferred)
|
131
|
+
matchedRank | corresponding rank from the data source
|
132
|
+
matchType | what kind of match it is
|
133
|
+
editDistance | for fuzzy-matching -- how many characters differ between checklist and data source name
|
134
|
+
score | heuristic score from 0 to 1 where 1 is a good match, 0.5 match requires further human investigation
|
107
135
|
|
108
|
-
|
109
|
-
require "gn_crossmap"
|
136
|
+
#### Types of Matches
|
110
137
|
|
111
|
-
|
112
|
-
GnCrossmap.logger = MyCustomLogger.new
|
138
|
+
The output fomat returns 7 possible types of matches:
|
113
139
|
|
114
|
-
|
115
|
-
|
140
|
+
1. **Exact match** - The exact name was matched (but ignoring non-ascii characters)
|
141
|
+
2. **Exact match by canonical form of a name** - The canonical form of the name (a version of a scientific name that contains complete versions of the latin words, but lacks insertions of subtaxa, annotations, or authority information) was matched
|
142
|
+
3. **Fuzzy match by canonical form** - The canonical form gave a fuzzy (detecting lexical or spelling variations of a name using Tony Rees' algorithm TAXAMATCH) match
|
143
|
+
4. **Partial exact match by species part of canonical form** - The canonical form returned a partial but exact match
|
144
|
+
5. **Partial fuzzy match by species part of canonical form** - The canonical form returned a partial, fuzzy match
|
145
|
+
6. **Exact match by genus part of a canonical form** - The genus part of the canonical form of the species name returned an exact match
|
146
|
+
7. **[Blank]** - No match
|
116
147
|
|
117
148
|
Development
|
118
149
|
-----------
|
@@ -160,3 +191,5 @@ See [LICENSE][license] for details.
|
|
160
191
|
[license]: https://github.com/GlobalNamesArchitecture/gn_crossmap/blob/master/LICENSE
|
161
192
|
[terms]: http://rs.tdwg.org/dwc/terms
|
162
193
|
[files]: https://github.com/GlobalNamesArchitecture/gn_crossmap/tree/master/spec/files
|
194
|
+
[output]: https://github.com/GlobalNamesArchitecture/gn_crossmap/tree/master/spec/files/output-example.csv
|
195
|
+
[waffle]: https://waffle.io/GlobalNamesArchitecture/gn_crossmap
|
data/exe/crossmap
CHANGED
@@ -20,4 +20,8 @@ end
|
|
20
20
|
Trollop.die :input, "must be set" if opts[:input].nil?
|
21
21
|
Trollop.die :input, "file must exist" unless File.exist?(opts[:input])
|
22
22
|
|
23
|
-
|
23
|
+
begin
|
24
|
+
GnCrossmap.run(opts[:input], opts[:output], opts[:data_source_id])
|
25
|
+
rescue GnCrossmapError => e
|
26
|
+
GnCrossmap.logger.error(e.message)
|
27
|
+
end
|
data/gn_crossmap.gemspec
CHANGED
@@ -27,6 +27,8 @@ Gem::Specification.new do |gem|
|
|
27
27
|
|
28
28
|
gem.add_dependency "trollop", "~> 2.1"
|
29
29
|
gem.add_dependency "biodiversity", "~> 3.1"
|
30
|
+
gem.add_dependency "rest-client", "~> 1.8"
|
31
|
+
gem.add_dependency "logger-colors", "~> 1.0"
|
30
32
|
|
31
33
|
gem.add_development_dependency "bundler", "~> 1.7"
|
32
34
|
gem.add_development_dependency "rake", "~> 10.0"
|
@@ -17,14 +17,18 @@ module GnCrossmap
|
|
17
17
|
private
|
18
18
|
|
19
19
|
def init_fields_collector
|
20
|
-
@fields = @row.map { |f| f.downcase.to_sym }
|
20
|
+
@fields = @row.map { |f| f.to_s.strip.downcase.to_sym }
|
21
21
|
@collector = collector_factory
|
22
|
+
err = "taxonID must be present in the csv header"
|
23
|
+
fail GnCrossmapError, err unless @fields.include?(:taxonid)
|
22
24
|
end
|
23
25
|
|
24
26
|
def collect_data
|
25
27
|
@row = @fields.zip(@row).to_h
|
26
28
|
data = @collector.id_name_rank(@row)
|
27
|
-
|
29
|
+
return unless data
|
30
|
+
data[:original] = @row.values
|
31
|
+
@data << data
|
28
32
|
end
|
29
33
|
|
30
34
|
def collector_factory
|
@@ -9,6 +9,9 @@ module GnCrossmap
|
|
9
9
|
|
10
10
|
def initialize(fields)
|
11
11
|
@fields = fields
|
12
|
+
err = "At least some of these fields must exist in " \
|
13
|
+
"the CSV header: '#{RANKS.join('\', \'')}'"
|
14
|
+
fail GnCrossmapError, err if (RANKS - @fields).size == RANKS.size
|
12
15
|
end
|
13
16
|
|
14
17
|
def id_name_rank(row)
|
data/lib/gn_crossmap/reader.rb
CHANGED
@@ -2,9 +2,12 @@ module GnCrossmap
|
|
2
2
|
# Reads supplied csv file and creates ruby structure to compare
|
3
3
|
# with a Global Names Resolver source
|
4
4
|
class Reader
|
5
|
+
attr_reader :original_fields
|
6
|
+
|
5
7
|
def initialize(csv_path)
|
6
8
|
@csv_file = csv_path
|
7
9
|
@col_sep = col_sep
|
10
|
+
@original_fields = nil
|
8
11
|
end
|
9
12
|
|
10
13
|
def read
|
@@ -21,7 +24,10 @@ module GnCrossmap
|
|
21
24
|
|
22
25
|
def parse_input
|
23
26
|
dc = Collector.new
|
24
|
-
CSV.open(@csv_file, col_sep: @col_sep).
|
27
|
+
CSV.open(@csv_file, col_sep: @col_sep).each_with_index do |row, i|
|
28
|
+
@original_fields = row.dup if @original_fields.nil?
|
29
|
+
i += 1
|
30
|
+
GnCrossmap.log("Ingesting #{i}th csv row") if i % 10_000 == 0
|
25
31
|
dc.process_row(row)
|
26
32
|
end
|
27
33
|
dc.data
|
data/lib/gn_crossmap/resolver.rb
CHANGED
@@ -7,6 +7,7 @@ module GnCrossmap
|
|
7
7
|
@processor = GnCrossmap::ResultProcessor.new(writer)
|
8
8
|
@ds_id = data_source_id
|
9
9
|
@count = 0
|
10
|
+
@current_data = {}
|
10
11
|
@batch = 200
|
11
12
|
end
|
12
13
|
|
@@ -32,7 +33,9 @@ module GnCrossmap
|
|
32
33
|
end
|
33
34
|
|
34
35
|
def collect_names(slice)
|
36
|
+
@current_data = {}
|
35
37
|
slice.each_with_object("") do |row, str|
|
38
|
+
@current_data[row[:id]] = row[:original]
|
36
39
|
@processor.input[row[:id]] = { rank: row[:rank] }
|
37
40
|
str << "#{row[:id]}|#{row[:name]}\n"
|
38
41
|
end
|
@@ -40,7 +43,7 @@ module GnCrossmap
|
|
40
43
|
|
41
44
|
def remote_resolve(names)
|
42
45
|
res = RestClient.post(URL, data: names, data_source_ids: @ds_id)
|
43
|
-
@processor.process(res)
|
46
|
+
@processor.process(res, @current_data)
|
44
47
|
rescue RestClient::Exception
|
45
48
|
single_remote_resolve(names)
|
46
49
|
end
|
@@ -51,7 +54,7 @@ module GnCrossmap
|
|
51
54
|
res = RestClient.post(URL, data: name, data_source_ids: @ds_id)
|
52
55
|
@processor.process(res)
|
53
56
|
rescue RestClient::Exception => e
|
54
|
-
GnCrossmap.
|
57
|
+
GnCrossmap.logger.error("Resolver broke on '#{name}': #{e.message}")
|
55
58
|
next
|
56
59
|
end
|
57
60
|
end
|
@@ -17,7 +17,8 @@ module GnCrossmap
|
|
17
17
|
@input = {}
|
18
18
|
end
|
19
19
|
|
20
|
-
def process(result)
|
20
|
+
def process(result, original_data)
|
21
|
+
@original_data = original_data
|
21
22
|
res = rubyfy(result)
|
22
23
|
res[:data].each do |d|
|
23
24
|
d[:results].nil? ? write_empty_result(d) : write_result(d)
|
@@ -31,22 +32,26 @@ module GnCrossmap
|
|
31
32
|
end
|
32
33
|
|
33
34
|
def write_empty_result(datum)
|
34
|
-
res = [datum[:supplied_id]
|
35
|
-
|
35
|
+
res = @original_data[datum[:supplied_id]]
|
36
|
+
res += [datum[:supplied_name_string], nil, nil,
|
37
|
+
@input[datum[:supplied_id]][:rank], nil, nil, nil, nil]
|
36
38
|
@writer.write(res)
|
37
39
|
end
|
38
40
|
|
39
41
|
def write_result(datum)
|
40
|
-
datum[:results].each do |
|
41
|
-
|
42
|
-
r[:name_string], r[:canonical_form],
|
43
|
-
@input[datum[:supplied_id]][:rank],
|
44
|
-
matched_rank(r), matched_type(r),
|
45
|
-
r[:edit_distance], r[:score]]
|
46
|
-
@writer.write(res)
|
42
|
+
datum[:results].each do |result|
|
43
|
+
@writer.write(compile_result(datum, result))
|
47
44
|
end
|
48
45
|
end
|
49
46
|
|
47
|
+
def compile_result(datum, result)
|
48
|
+
@original_data[datum[:supplied_id]] +
|
49
|
+
[datum[:supplied_name_string], result[:name_string],
|
50
|
+
result[:canonical_form], @input[datum[:supplied_id]][:rank],
|
51
|
+
matched_rank(result), matched_type(result),
|
52
|
+
result[:edit_distance], result[:score]]
|
53
|
+
end
|
54
|
+
|
50
55
|
def matched_rank(record)
|
51
56
|
record[:classification_path_ranks].split("|").last
|
52
57
|
end
|
data/lib/gn_crossmap/version.rb
CHANGED
data/lib/gn_crossmap/writer.rb
CHANGED
@@ -1,12 +1,11 @@
|
|
1
1
|
module GnCrossmap
|
2
2
|
# Saves output from GN Resolver to disk
|
3
3
|
class Writer
|
4
|
-
def initialize(output_path)
|
4
|
+
def initialize(output_path, original_fields)
|
5
5
|
@path = output_path
|
6
|
+
@output_fields = output_fields(original_fields)
|
6
7
|
@output = CSV.open(@path, "w:utf-8")
|
7
|
-
@output <<
|
8
|
-
:matchedCanonicalForm, :rank, :matchedRank, :matchType,
|
9
|
-
:editDistance, :score]
|
8
|
+
@output << @output_fields
|
10
9
|
GnCrossmap.log("Open output file '#{@path}'")
|
11
10
|
end
|
12
11
|
|
@@ -18,5 +17,13 @@ module GnCrossmap
|
|
18
17
|
GnCrossmap.log("Close output file '#{@path}'")
|
19
18
|
@output.close
|
20
19
|
end
|
20
|
+
|
21
|
+
private
|
22
|
+
|
23
|
+
def output_fields(original_fields)
|
24
|
+
original_fields + [:inputName, :matchedName, :matchedCanonicalForm,
|
25
|
+
:inputRank, :matchedRank, :matchedType,
|
26
|
+
:matchedEditDistance, :marchedScore]
|
27
|
+
end
|
21
28
|
end
|
22
29
|
end
|
data/lib/gn_crossmap.rb
CHANGED
@@ -1,7 +1,9 @@
|
|
1
1
|
require "csv"
|
2
2
|
require "rest_client"
|
3
3
|
require "logger"
|
4
|
+
require "logger/colors"
|
4
5
|
require "biodiversity"
|
6
|
+
require "gn_crossmap/errors"
|
5
7
|
require "gn_crossmap/version"
|
6
8
|
require "gn_crossmap/reader"
|
7
9
|
require "gn_crossmap/writer"
|
@@ -17,8 +19,9 @@ module GnCrossmap
|
|
17
19
|
attr_writer :logger
|
18
20
|
|
19
21
|
def run(input, output, data_source_id)
|
20
|
-
|
21
|
-
|
22
|
+
reader = Reader.new(input)
|
23
|
+
data = reader.read
|
24
|
+
writer = Writer.new(output, reader.original_fields)
|
22
25
|
Resolver.new(writer, data_source_id).resolve(data)
|
23
26
|
output
|
24
27
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: gn_crossmap
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.5
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Dmitry Mozzherin
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2015-05-
|
11
|
+
date: 2015-05-28 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: trollop
|
@@ -38,6 +38,34 @@ dependencies:
|
|
38
38
|
- - "~>"
|
39
39
|
- !ruby/object:Gem::Version
|
40
40
|
version: '3.1'
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: rest-client
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - "~>"
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: '1.8'
|
48
|
+
type: :runtime
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - "~>"
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '1.8'
|
55
|
+
- !ruby/object:Gem::Dependency
|
56
|
+
name: logger-colors
|
57
|
+
requirement: !ruby/object:Gem::Requirement
|
58
|
+
requirements:
|
59
|
+
- - "~>"
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: '1.0'
|
62
|
+
type: :runtime
|
63
|
+
prerelease: false
|
64
|
+
version_requirements: !ruby/object:Gem::Requirement
|
65
|
+
requirements:
|
66
|
+
- - "~>"
|
67
|
+
- !ruby/object:Gem::Version
|
68
|
+
version: '1.0'
|
41
69
|
- !ruby/object:Gem::Dependency
|
42
70
|
name: bundler
|
43
71
|
requirement: !ruby/object:Gem::Requirement
|
@@ -148,6 +176,7 @@ files:
|
|
148
176
|
- lib/gn_crossmap.rb
|
149
177
|
- lib/gn_crossmap/collector.rb
|
150
178
|
- lib/gn_crossmap/column_collector.rb
|
179
|
+
- lib/gn_crossmap/errors.rb
|
151
180
|
- lib/gn_crossmap/reader.rb
|
152
181
|
- lib/gn_crossmap/resolver.rb
|
153
182
|
- lib/gn_crossmap/result_processor.rb
|