gn_crossmap 0.1.4 → 0.1.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +15 -0
- data/README.md +49 -16
- data/exe/crossmap +5 -1
- data/gn_crossmap.gemspec +2 -0
- data/lib/gn_crossmap/collector.rb +6 -2
- data/lib/gn_crossmap/column_collector.rb +3 -0
- data/lib/gn_crossmap/errors.rb +2 -0
- data/lib/gn_crossmap/reader.rb +7 -1
- data/lib/gn_crossmap/resolver.rb +5 -2
- data/lib/gn_crossmap/result_processor.rb +15 -10
- data/lib/gn_crossmap/sci_name_collector.rb +2 -0
- data/lib/gn_crossmap/version.rb +1 -1
- data/lib/gn_crossmap/writer.rb +11 -4
- data/lib/gn_crossmap.rb +5 -2
- metadata +31 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 30c84b5910edf24b6b67fc45c1f80be36b29a183
|
4
|
+
data.tar.gz: d7481bc663e1b4edd225c83f711a84092fb9d602
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 5754ec3e9d65a69cf8fcf83aaeaa2c24fd4b565917ba37b860a0468e5f9e46ba1e6dcd37dbdb28935313469f704aa84b6faab79584a238b1e17d2e6284b23ac5
|
7
|
+
data.tar.gz: 11ba4423f9c31529f5f1d17ad25609ccfae8d7d6694efea0bcf30137e65e661334a89e30f9259dfd9934676f9bfa76ec59d4ffee07f0d79afffdb9a5bcc67abd
|
data/CHANGELOG.md
CHANGED
@@ -1,6 +1,21 @@
|
|
1
1
|
gn_crossmap CHANGELOG
|
2
2
|
=====================
|
3
3
|
|
4
|
+
0.1.5
|
5
|
+
-----
|
6
|
+
|
7
|
+
* @dimus - #5 - All original fields are now preserved in the output file.
|
8
|
+
|
9
|
+
* @dimus - #3 - If ingest has more than 10K rows -- user will see logging events
|
10
|
+
|
11
|
+
* @dimus - #4 Bug - Error messages if headers are missing necessary fields
|
12
|
+
|
13
|
+
* @dimus - #2 - Header fields are now allowed trailing spaces
|
14
|
+
|
15
|
+
* @dimus - #7 Bug - Empty rank does not break crossmapping anymore
|
16
|
+
|
17
|
+
* @dimus - #1 Bug - add missing rest-client gem
|
18
|
+
|
4
19
|
0.1.4
|
5
20
|
-----
|
6
21
|
- [Dmitry Mozzherin][dimus] - Bug fixes
|
data/README.md
CHANGED
@@ -10,6 +10,8 @@ in [GN Resolver][resolver].
|
|
10
10
|
|
11
11
|
Checklist has to be in a CSV format.
|
12
12
|
|
13
|
+
[Issues on waffle.io][waffle]
|
14
|
+
|
13
15
|
Compatibility
|
14
16
|
-------------
|
15
17
|
|
@@ -35,6 +37,29 @@ Or install it yourself as:
|
|
35
37
|
Usage
|
36
38
|
-----
|
37
39
|
|
40
|
+
### Usage from command line
|
41
|
+
|
42
|
+
# to see help
|
43
|
+
$ crossmap --help
|
44
|
+
|
45
|
+
# to compare with default source (Catalogue of Life)
|
46
|
+
$ crossmap -i my_list.csv -o my_list_col.csv
|
47
|
+
|
48
|
+
# to compare with other source (Index Fungorum in this example)
|
49
|
+
$ crossmap -i my_list.csv -o my_list_if.csv -d 5
|
50
|
+
|
51
|
+
### Usage as Ruby Library
|
52
|
+
|
53
|
+
```ruby
|
54
|
+
require "gn_crossmap"
|
55
|
+
|
56
|
+
# If you want to change logger -- default Logging is to standard output
|
57
|
+
GnCrossmap.logger = MyCustomLogger.new
|
58
|
+
|
59
|
+
GnCrossmap.run("path/to/input.csv", "path/to/output.csv", 5)
|
60
|
+
```
|
61
|
+
|
62
|
+
|
38
63
|
### Input file format
|
39
64
|
|
40
65
|
- Comma Separated File with names of fields in first row.
|
@@ -92,27 +117,33 @@ TaxonId|kingdom|subkingdom|phylum|subphylum|superclass|class|subclass|cohort|sup
|
|
92
117
|
|
93
118
|
More examples can be found in [spec/files][files] directory
|
94
119
|
|
95
|
-
###
|
96
|
-
|
97
|
-
# to see help
|
98
|
-
$ crossmap --help
|
99
|
-
|
100
|
-
# to compare with default source (Catalogue of Life)
|
101
|
-
$ crossmap -i my_list.csv -o my_list_col.csv
|
120
|
+
### Output file format
|
102
121
|
|
103
|
-
|
104
|
-
$ crossmap -i my_list.csv -o my_list_if.csv -d 5
|
122
|
+
[Output][output] includes following fields:
|
105
123
|
|
106
|
-
|
124
|
+
Field | Description
|
125
|
+
---------------------|-----------------------------------------------------------
|
126
|
+
taxonID | original ID attached to a name in the checklist
|
127
|
+
scientificName | name from the checklist
|
128
|
+
matchedScientificName| name matched from the GN Reolver data source
|
129
|
+
matchedCanonicalForm | canonical form of the matched name
|
130
|
+
rank | rank from the source (if it was given/inferred)
|
131
|
+
matchedRank | corresponding rank from the data source
|
132
|
+
matchType | what kind of match it is
|
133
|
+
editDistance | for fuzzy-matching -- how many characters differ between checklist and data source name
|
134
|
+
score | heuristic score from 0 to 1 where 1 is a good match, 0.5 match requires further human investigation
|
107
135
|
|
108
|
-
|
109
|
-
require "gn_crossmap"
|
136
|
+
#### Types of Matches
|
110
137
|
|
111
|
-
|
112
|
-
GnCrossmap.logger = MyCustomLogger.new
|
138
|
+
The output fomat returns 7 possible types of matches:
|
113
139
|
|
114
|
-
|
115
|
-
|
140
|
+
1. **Exact match** - The exact name was matched (but ignoring non-ascii characters)
|
141
|
+
2. **Exact match by canonical form of a name** - The canonical form of the name (a version of a scientific name that contains complete versions of the latin words, but lacks insertions of subtaxa, annotations, or authority information) was matched
|
142
|
+
3. **Fuzzy match by canonical form** - The canonical form gave a fuzzy (detecting lexical or spelling variations of a name using Tony Rees' algorithm TAXAMATCH) match
|
143
|
+
4. **Partial exact match by species part of canonical form** - The canonical form returned a partial but exact match
|
144
|
+
5. **Partial fuzzy match by species part of canonical form** - The canonical form returned a partial, fuzzy match
|
145
|
+
6. **Exact match by genus part of a canonical form** - The genus part of the canonical form of the species name returned an exact match
|
146
|
+
7. **[Blank]** - No match
|
116
147
|
|
117
148
|
Development
|
118
149
|
-----------
|
@@ -160,3 +191,5 @@ See [LICENSE][license] for details.
|
|
160
191
|
[license]: https://github.com/GlobalNamesArchitecture/gn_crossmap/blob/master/LICENSE
|
161
192
|
[terms]: http://rs.tdwg.org/dwc/terms
|
162
193
|
[files]: https://github.com/GlobalNamesArchitecture/gn_crossmap/tree/master/spec/files
|
194
|
+
[output]: https://github.com/GlobalNamesArchitecture/gn_crossmap/tree/master/spec/files/output-example.csv
|
195
|
+
[waffle]: https://waffle.io/GlobalNamesArchitecture/gn_crossmap
|
data/exe/crossmap
CHANGED
@@ -20,4 +20,8 @@ end
|
|
20
20
|
Trollop.die :input, "must be set" if opts[:input].nil?
|
21
21
|
Trollop.die :input, "file must exist" unless File.exist?(opts[:input])
|
22
22
|
|
23
|
-
|
23
|
+
begin
|
24
|
+
GnCrossmap.run(opts[:input], opts[:output], opts[:data_source_id])
|
25
|
+
rescue GnCrossmapError => e
|
26
|
+
GnCrossmap.logger.error(e.message)
|
27
|
+
end
|
data/gn_crossmap.gemspec
CHANGED
@@ -27,6 +27,8 @@ Gem::Specification.new do |gem|
|
|
27
27
|
|
28
28
|
gem.add_dependency "trollop", "~> 2.1"
|
29
29
|
gem.add_dependency "biodiversity", "~> 3.1"
|
30
|
+
gem.add_dependency "rest-client", "~> 1.8"
|
31
|
+
gem.add_dependency "logger-colors", "~> 1.0"
|
30
32
|
|
31
33
|
gem.add_development_dependency "bundler", "~> 1.7"
|
32
34
|
gem.add_development_dependency "rake", "~> 10.0"
|
@@ -17,14 +17,18 @@ module GnCrossmap
|
|
17
17
|
private
|
18
18
|
|
19
19
|
def init_fields_collector
|
20
|
-
@fields = @row.map { |f| f.downcase.to_sym }
|
20
|
+
@fields = @row.map { |f| f.to_s.strip.downcase.to_sym }
|
21
21
|
@collector = collector_factory
|
22
|
+
err = "taxonID must be present in the csv header"
|
23
|
+
fail GnCrossmapError, err unless @fields.include?(:taxonid)
|
22
24
|
end
|
23
25
|
|
24
26
|
def collect_data
|
25
27
|
@row = @fields.zip(@row).to_h
|
26
28
|
data = @collector.id_name_rank(@row)
|
27
|
-
|
29
|
+
return unless data
|
30
|
+
data[:original] = @row.values
|
31
|
+
@data << data
|
28
32
|
end
|
29
33
|
|
30
34
|
def collector_factory
|
@@ -9,6 +9,9 @@ module GnCrossmap
|
|
9
9
|
|
10
10
|
def initialize(fields)
|
11
11
|
@fields = fields
|
12
|
+
err = "At least some of these fields must exist in " \
|
13
|
+
"the CSV header: '#{RANKS.join('\', \'')}'"
|
14
|
+
fail GnCrossmapError, err if (RANKS - @fields).size == RANKS.size
|
12
15
|
end
|
13
16
|
|
14
17
|
def id_name_rank(row)
|
data/lib/gn_crossmap/reader.rb
CHANGED
@@ -2,9 +2,12 @@ module GnCrossmap
|
|
2
2
|
# Reads supplied csv file and creates ruby structure to compare
|
3
3
|
# with a Global Names Resolver source
|
4
4
|
class Reader
|
5
|
+
attr_reader :original_fields
|
6
|
+
|
5
7
|
def initialize(csv_path)
|
6
8
|
@csv_file = csv_path
|
7
9
|
@col_sep = col_sep
|
10
|
+
@original_fields = nil
|
8
11
|
end
|
9
12
|
|
10
13
|
def read
|
@@ -21,7 +24,10 @@ module GnCrossmap
|
|
21
24
|
|
22
25
|
def parse_input
|
23
26
|
dc = Collector.new
|
24
|
-
CSV.open(@csv_file, col_sep: @col_sep).
|
27
|
+
CSV.open(@csv_file, col_sep: @col_sep).each_with_index do |row, i|
|
28
|
+
@original_fields = row.dup if @original_fields.nil?
|
29
|
+
i += 1
|
30
|
+
GnCrossmap.log("Ingesting #{i}th csv row") if i % 10_000 == 0
|
25
31
|
dc.process_row(row)
|
26
32
|
end
|
27
33
|
dc.data
|
data/lib/gn_crossmap/resolver.rb
CHANGED
@@ -7,6 +7,7 @@ module GnCrossmap
|
|
7
7
|
@processor = GnCrossmap::ResultProcessor.new(writer)
|
8
8
|
@ds_id = data_source_id
|
9
9
|
@count = 0
|
10
|
+
@current_data = {}
|
10
11
|
@batch = 200
|
11
12
|
end
|
12
13
|
|
@@ -32,7 +33,9 @@ module GnCrossmap
|
|
32
33
|
end
|
33
34
|
|
34
35
|
def collect_names(slice)
|
36
|
+
@current_data = {}
|
35
37
|
slice.each_with_object("") do |row, str|
|
38
|
+
@current_data[row[:id]] = row[:original]
|
36
39
|
@processor.input[row[:id]] = { rank: row[:rank] }
|
37
40
|
str << "#{row[:id]}|#{row[:name]}\n"
|
38
41
|
end
|
@@ -40,7 +43,7 @@ module GnCrossmap
|
|
40
43
|
|
41
44
|
def remote_resolve(names)
|
42
45
|
res = RestClient.post(URL, data: names, data_source_ids: @ds_id)
|
43
|
-
@processor.process(res)
|
46
|
+
@processor.process(res, @current_data)
|
44
47
|
rescue RestClient::Exception
|
45
48
|
single_remote_resolve(names)
|
46
49
|
end
|
@@ -51,7 +54,7 @@ module GnCrossmap
|
|
51
54
|
res = RestClient.post(URL, data: name, data_source_ids: @ds_id)
|
52
55
|
@processor.process(res)
|
53
56
|
rescue RestClient::Exception => e
|
54
|
-
GnCrossmap.
|
57
|
+
GnCrossmap.logger.error("Resolver broke on '#{name}': #{e.message}")
|
55
58
|
next
|
56
59
|
end
|
57
60
|
end
|
@@ -17,7 +17,8 @@ module GnCrossmap
|
|
17
17
|
@input = {}
|
18
18
|
end
|
19
19
|
|
20
|
-
def process(result)
|
20
|
+
def process(result, original_data)
|
21
|
+
@original_data = original_data
|
21
22
|
res = rubyfy(result)
|
22
23
|
res[:data].each do |d|
|
23
24
|
d[:results].nil? ? write_empty_result(d) : write_result(d)
|
@@ -31,22 +32,26 @@ module GnCrossmap
|
|
31
32
|
end
|
32
33
|
|
33
34
|
def write_empty_result(datum)
|
34
|
-
res = [datum[:supplied_id]
|
35
|
-
|
35
|
+
res = @original_data[datum[:supplied_id]]
|
36
|
+
res += [datum[:supplied_name_string], nil, nil,
|
37
|
+
@input[datum[:supplied_id]][:rank], nil, nil, nil, nil]
|
36
38
|
@writer.write(res)
|
37
39
|
end
|
38
40
|
|
39
41
|
def write_result(datum)
|
40
|
-
datum[:results].each do |
|
41
|
-
|
42
|
-
r[:name_string], r[:canonical_form],
|
43
|
-
@input[datum[:supplied_id]][:rank],
|
44
|
-
matched_rank(r), matched_type(r),
|
45
|
-
r[:edit_distance], r[:score]]
|
46
|
-
@writer.write(res)
|
42
|
+
datum[:results].each do |result|
|
43
|
+
@writer.write(compile_result(datum, result))
|
47
44
|
end
|
48
45
|
end
|
49
46
|
|
47
|
+
def compile_result(datum, result)
|
48
|
+
@original_data[datum[:supplied_id]] +
|
49
|
+
[datum[:supplied_name_string], result[:name_string],
|
50
|
+
result[:canonical_form], @input[datum[:supplied_id]][:rank],
|
51
|
+
matched_rank(result), matched_type(result),
|
52
|
+
result[:edit_distance], result[:score]]
|
53
|
+
end
|
54
|
+
|
50
55
|
def matched_rank(record)
|
51
56
|
record[:classification_path_ranks].split("|").last
|
52
57
|
end
|
data/lib/gn_crossmap/version.rb
CHANGED
data/lib/gn_crossmap/writer.rb
CHANGED
@@ -1,12 +1,11 @@
|
|
1
1
|
module GnCrossmap
|
2
2
|
# Saves output from GN Resolver to disk
|
3
3
|
class Writer
|
4
|
-
def initialize(output_path)
|
4
|
+
def initialize(output_path, original_fields)
|
5
5
|
@path = output_path
|
6
|
+
@output_fields = output_fields(original_fields)
|
6
7
|
@output = CSV.open(@path, "w:utf-8")
|
7
|
-
@output <<
|
8
|
-
:matchedCanonicalForm, :rank, :matchedRank, :matchType,
|
9
|
-
:editDistance, :score]
|
8
|
+
@output << @output_fields
|
10
9
|
GnCrossmap.log("Open output file '#{@path}'")
|
11
10
|
end
|
12
11
|
|
@@ -18,5 +17,13 @@ module GnCrossmap
|
|
18
17
|
GnCrossmap.log("Close output file '#{@path}'")
|
19
18
|
@output.close
|
20
19
|
end
|
20
|
+
|
21
|
+
private
|
22
|
+
|
23
|
+
def output_fields(original_fields)
|
24
|
+
original_fields + [:inputName, :matchedName, :matchedCanonicalForm,
|
25
|
+
:inputRank, :matchedRank, :matchedType,
|
26
|
+
:matchedEditDistance, :marchedScore]
|
27
|
+
end
|
21
28
|
end
|
22
29
|
end
|
data/lib/gn_crossmap.rb
CHANGED
@@ -1,7 +1,9 @@
|
|
1
1
|
require "csv"
|
2
2
|
require "rest_client"
|
3
3
|
require "logger"
|
4
|
+
require "logger/colors"
|
4
5
|
require "biodiversity"
|
6
|
+
require "gn_crossmap/errors"
|
5
7
|
require "gn_crossmap/version"
|
6
8
|
require "gn_crossmap/reader"
|
7
9
|
require "gn_crossmap/writer"
|
@@ -17,8 +19,9 @@ module GnCrossmap
|
|
17
19
|
attr_writer :logger
|
18
20
|
|
19
21
|
def run(input, output, data_source_id)
|
20
|
-
|
21
|
-
|
22
|
+
reader = Reader.new(input)
|
23
|
+
data = reader.read
|
24
|
+
writer = Writer.new(output, reader.original_fields)
|
22
25
|
Resolver.new(writer, data_source_id).resolve(data)
|
23
26
|
output
|
24
27
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: gn_crossmap
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.5
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Dmitry Mozzherin
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2015-05-
|
11
|
+
date: 2015-05-28 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: trollop
|
@@ -38,6 +38,34 @@ dependencies:
|
|
38
38
|
- - "~>"
|
39
39
|
- !ruby/object:Gem::Version
|
40
40
|
version: '3.1'
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: rest-client
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - "~>"
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: '1.8'
|
48
|
+
type: :runtime
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - "~>"
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '1.8'
|
55
|
+
- !ruby/object:Gem::Dependency
|
56
|
+
name: logger-colors
|
57
|
+
requirement: !ruby/object:Gem::Requirement
|
58
|
+
requirements:
|
59
|
+
- - "~>"
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: '1.0'
|
62
|
+
type: :runtime
|
63
|
+
prerelease: false
|
64
|
+
version_requirements: !ruby/object:Gem::Requirement
|
65
|
+
requirements:
|
66
|
+
- - "~>"
|
67
|
+
- !ruby/object:Gem::Version
|
68
|
+
version: '1.0'
|
41
69
|
- !ruby/object:Gem::Dependency
|
42
70
|
name: bundler
|
43
71
|
requirement: !ruby/object:Gem::Requirement
|
@@ -148,6 +176,7 @@ files:
|
|
148
176
|
- lib/gn_crossmap.rb
|
149
177
|
- lib/gn_crossmap/collector.rb
|
150
178
|
- lib/gn_crossmap/column_collector.rb
|
179
|
+
- lib/gn_crossmap/errors.rb
|
151
180
|
- lib/gn_crossmap/reader.rb
|
152
181
|
- lib/gn_crossmap/resolver.rb
|
153
182
|
- lib/gn_crossmap/result_processor.rb
|