gn_crossmap 0.1.4 → 0.1.5

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 2860a0033f9ebfcf9d71d4ffb2cf9fa9d29ea7f3
4
- data.tar.gz: b09ff0e2ab214279979e268adcaece8207e8c9ed
3
+ metadata.gz: 30c84b5910edf24b6b67fc45c1f80be36b29a183
4
+ data.tar.gz: d7481bc663e1b4edd225c83f711a84092fb9d602
5
5
  SHA512:
6
- metadata.gz: 7c291b3a606eebf4da9ea8abe686bb046c63d30f34e6842346cd65a14a695d8b3f5f9af6c3f1f92a64ff6be69bc6bd8286d4a7a80386b8b239b2b65db1d8ccaa
7
- data.tar.gz: 22bd5787bf73447fb8b4fc2795cec9d74ee68357e0163f90b1279d6fda5a9efa2e43c3a3c6656b158ad961923fdb9c27018f811a2babb647bb6eb7347d1b915a
6
+ metadata.gz: 5754ec3e9d65a69cf8fcf83aaeaa2c24fd4b565917ba37b860a0468e5f9e46ba1e6dcd37dbdb28935313469f704aa84b6faab79584a238b1e17d2e6284b23ac5
7
+ data.tar.gz: 11ba4423f9c31529f5f1d17ad25609ccfae8d7d6694efea0bcf30137e65e661334a89e30f9259dfd9934676f9bfa76ec59d4ffee07f0d79afffdb9a5bcc67abd
data/CHANGELOG.md CHANGED
@@ -1,6 +1,21 @@
1
1
  gn_crossmap CHANGELOG
2
2
  =====================
3
3
 
4
+ 0.1.5
5
+ -----
6
+
7
+ * @dimus - #5 - All original fields are now preserved in the output file.
8
+
9
+ * @dimus - #3 - If ingest has more than 10K rows -- user will see logging events
10
+
11
+ * @dimus - #4 Bug - Error messages if headers are missing necessary fields
12
+
13
+ * @dimus - #2 - Header fields are now allowed trailing spaces
14
+
15
+ * @dimus - #7 Bug - Empty rank does not break crossmapping anymore
16
+
17
+ * @dimus - #1 Bug - add missing rest-client gem
18
+
4
19
  0.1.4
5
20
  -----
6
21
  - [Dmitry Mozzherin][dimus] - Bug fixes
data/README.md CHANGED
@@ -10,6 +10,8 @@ in [GN Resolver][resolver].
10
10
 
11
11
  Checklist has to be in a CSV format.
12
12
 
13
+ [Issues on waffle.io][waffle]
14
+
13
15
  Compatibility
14
16
  -------------
15
17
 
@@ -35,6 +37,29 @@ Or install it yourself as:
35
37
  Usage
36
38
  -----
37
39
 
40
+ ### Usage from command line
41
+
42
+ # to see help
43
+ $ crossmap --help
44
+
45
+ # to compare with default source (Catalogue of Life)
46
+ $ crossmap -i my_list.csv -o my_list_col.csv
47
+
48
+ # to compare with other source (Index Fungorum in this example)
49
+ $ crossmap -i my_list.csv -o my_list_if.csv -d 5
50
+
51
+ ### Usage as Ruby Library
52
+
53
+ ```ruby
54
+ require "gn_crossmap"
55
+
56
+ # If you want to change logger -- default Logging is to standard output
57
+ GnCrossmap.logger = MyCustomLogger.new
58
+
59
+ GnCrossmap.run("path/to/input.csv", "path/to/output.csv", 5)
60
+ ```
61
+
62
+
38
63
  ### Input file format
39
64
 
40
65
  - Comma Separated File with names of fields in first row.
@@ -92,27 +117,33 @@ TaxonId|kingdom|subkingdom|phylum|subphylum|superclass|class|subclass|cohort|sup
92
117
 
93
118
  More examples can be found in [spec/files][files] directory
94
119
 
95
- ### Usage from command line
96
-
97
- # to see help
98
- $ crossmap --help
99
-
100
- # to compare with default source (Catalogue of Life)
101
- $ crossmap -i my_list.csv -o my_list_col.csv
120
+ ### Output file format
102
121
 
103
- # to compare with other source (Index Fungorum in this example)
104
- $ crossmap -i my_list.csv -o my_list_if.csv -d 5
122
+ [Output][output] includes following fields:
105
123
 
106
- ### Usage as Ruby Library
124
+ Field | Description
125
+ ---------------------|-----------------------------------------------------------
126
+ taxonID | original ID attached to a name in the checklist
127
+ scientificName | name from the checklist
128
+ matchedScientificName| name matched from the GN Reolver data source
129
+ matchedCanonicalForm | canonical form of the matched name
130
+ rank | rank from the source (if it was given/inferred)
131
+ matchedRank | corresponding rank from the data source
132
+ matchType | what kind of match it is
133
+ editDistance | for fuzzy-matching -- how many characters differ between checklist and data source name
134
+ score | heuristic score from 0 to 1 where 1 is a good match, 0.5 match requires further human investigation
107
135
 
108
- ```ruby
109
- require "gn_crossmap"
136
+ #### Types of Matches
110
137
 
111
- # If you want to change logger -- default Logging is to standard output
112
- GnCrossmap.logger = MyCustomLogger.new
138
+ The output fomat returns 7 possible types of matches:
113
139
 
114
- GnCrossmap.run("path/to/input.csv", "path/to/output.csv", 5)
115
- ```
140
+ 1. **Exact match** - The exact name was matched (but ignoring non-ascii characters)
141
+ 2. **Exact match by canonical form of a name** - The canonical form of the name (a version of a scientific name that contains complete versions of the latin words, but lacks insertions of subtaxa, annotations, or authority information) was matched
142
+ 3. **Fuzzy match by canonical form** - The canonical form gave a fuzzy (detecting lexical or spelling variations of a name using Tony Rees' algorithm TAXAMATCH) match
143
+ 4. **Partial exact match by species part of canonical form** - The canonical form returned a partial but exact match
144
+ 5. **Partial fuzzy match by species part of canonical form** - The canonical form returned a partial, fuzzy match
145
+ 6. **Exact match by genus part of a canonical form** - The genus part of the canonical form of the species name returned an exact match
146
+ 7. **[Blank]** - No match
116
147
 
117
148
  Development
118
149
  -----------
@@ -160,3 +191,5 @@ See [LICENSE][license] for details.
160
191
  [license]: https://github.com/GlobalNamesArchitecture/gn_crossmap/blob/master/LICENSE
161
192
  [terms]: http://rs.tdwg.org/dwc/terms
162
193
  [files]: https://github.com/GlobalNamesArchitecture/gn_crossmap/tree/master/spec/files
194
+ [output]: https://github.com/GlobalNamesArchitecture/gn_crossmap/tree/master/spec/files/output-example.csv
195
+ [waffle]: https://waffle.io/GlobalNamesArchitecture/gn_crossmap
data/exe/crossmap CHANGED
@@ -20,4 +20,8 @@ end
20
20
  Trollop.die :input, "must be set" if opts[:input].nil?
21
21
  Trollop.die :input, "file must exist" unless File.exist?(opts[:input])
22
22
 
23
- GnCrossmap.run(opts[:input], opts[:output], opts[:data_source_id])
23
+ begin
24
+ GnCrossmap.run(opts[:input], opts[:output], opts[:data_source_id])
25
+ rescue GnCrossmapError => e
26
+ GnCrossmap.logger.error(e.message)
27
+ end
data/gn_crossmap.gemspec CHANGED
@@ -27,6 +27,8 @@ Gem::Specification.new do |gem|
27
27
 
28
28
  gem.add_dependency "trollop", "~> 2.1"
29
29
  gem.add_dependency "biodiversity", "~> 3.1"
30
+ gem.add_dependency "rest-client", "~> 1.8"
31
+ gem.add_dependency "logger-colors", "~> 1.0"
30
32
 
31
33
  gem.add_development_dependency "bundler", "~> 1.7"
32
34
  gem.add_development_dependency "rake", "~> 10.0"
@@ -17,14 +17,18 @@ module GnCrossmap
17
17
  private
18
18
 
19
19
  def init_fields_collector
20
- @fields = @row.map { |f| f.downcase.to_sym }
20
+ @fields = @row.map { |f| f.to_s.strip.downcase.to_sym }
21
21
  @collector = collector_factory
22
+ err = "taxonID must be present in the csv header"
23
+ fail GnCrossmapError, err unless @fields.include?(:taxonid)
22
24
  end
23
25
 
24
26
  def collect_data
25
27
  @row = @fields.zip(@row).to_h
26
28
  data = @collector.id_name_rank(@row)
27
- @data << data if data
29
+ return unless data
30
+ data[:original] = @row.values
31
+ @data << data
28
32
  end
29
33
 
30
34
  def collector_factory
@@ -9,6 +9,9 @@ module GnCrossmap
9
9
 
10
10
  def initialize(fields)
11
11
  @fields = fields
12
+ err = "At least some of these fields must exist in " \
13
+ "the CSV header: '#{RANKS.join('\', \'')}'"
14
+ fail GnCrossmapError, err if (RANKS - @fields).size == RANKS.size
12
15
  end
13
16
 
14
17
  def id_name_rank(row)
@@ -0,0 +1,2 @@
1
+ # Error to raise in case of problems
2
+ class GnCrossmapError < RuntimeError; end
@@ -2,9 +2,12 @@ module GnCrossmap
2
2
  # Reads supplied csv file and creates ruby structure to compare
3
3
  # with a Global Names Resolver source
4
4
  class Reader
5
+ attr_reader :original_fields
6
+
5
7
  def initialize(csv_path)
6
8
  @csv_file = csv_path
7
9
  @col_sep = col_sep
10
+ @original_fields = nil
8
11
  end
9
12
 
10
13
  def read
@@ -21,7 +24,10 @@ module GnCrossmap
21
24
 
22
25
  def parse_input
23
26
  dc = Collector.new
24
- CSV.open(@csv_file, col_sep: @col_sep).each do |row|
27
+ CSV.open(@csv_file, col_sep: @col_sep).each_with_index do |row, i|
28
+ @original_fields = row.dup if @original_fields.nil?
29
+ i += 1
30
+ GnCrossmap.log("Ingesting #{i}th csv row") if i % 10_000 == 0
25
31
  dc.process_row(row)
26
32
  end
27
33
  dc.data
@@ -7,6 +7,7 @@ module GnCrossmap
7
7
  @processor = GnCrossmap::ResultProcessor.new(writer)
8
8
  @ds_id = data_source_id
9
9
  @count = 0
10
+ @current_data = {}
10
11
  @batch = 200
11
12
  end
12
13
 
@@ -32,7 +33,9 @@ module GnCrossmap
32
33
  end
33
34
 
34
35
  def collect_names(slice)
36
+ @current_data = {}
35
37
  slice.each_with_object("") do |row, str|
38
+ @current_data[row[:id]] = row[:original]
36
39
  @processor.input[row[:id]] = { rank: row[:rank] }
37
40
  str << "#{row[:id]}|#{row[:name]}\n"
38
41
  end
@@ -40,7 +43,7 @@ module GnCrossmap
40
43
 
41
44
  def remote_resolve(names)
42
45
  res = RestClient.post(URL, data: names, data_source_ids: @ds_id)
43
- @processor.process(res)
46
+ @processor.process(res, @current_data)
44
47
  rescue RestClient::Exception
45
48
  single_remote_resolve(names)
46
49
  end
@@ -51,7 +54,7 @@ module GnCrossmap
51
54
  res = RestClient.post(URL, data: name, data_source_ids: @ds_id)
52
55
  @processor.process(res)
53
56
  rescue RestClient::Exception => e
54
- GnCrossmap.log("Resolver broke on '#{name}': #{e}")
57
+ GnCrossmap.logger.error("Resolver broke on '#{name}': #{e.message}")
55
58
  next
56
59
  end
57
60
  end
@@ -17,7 +17,8 @@ module GnCrossmap
17
17
  @input = {}
18
18
  end
19
19
 
20
- def process(result)
20
+ def process(result, original_data)
21
+ @original_data = original_data
21
22
  res = rubyfy(result)
22
23
  res[:data].each do |d|
23
24
  d[:results].nil? ? write_empty_result(d) : write_result(d)
@@ -31,22 +32,26 @@ module GnCrossmap
31
32
  end
32
33
 
33
34
  def write_empty_result(datum)
34
- res = [datum[:supplied_id], datum[:supplied_name_string], nil, nil,
35
- @input[datum[:supplied_id]][:rank], nil, nil, nil, nil]
35
+ res = @original_data[datum[:supplied_id]]
36
+ res += [datum[:supplied_name_string], nil, nil,
37
+ @input[datum[:supplied_id]][:rank], nil, nil, nil, nil]
36
38
  @writer.write(res)
37
39
  end
38
40
 
39
41
  def write_result(datum)
40
- datum[:results].each do |r|
41
- res = [datum[:supplied_id], datum[:supplied_name_string],
42
- r[:name_string], r[:canonical_form],
43
- @input[datum[:supplied_id]][:rank],
44
- matched_rank(r), matched_type(r),
45
- r[:edit_distance], r[:score]]
46
- @writer.write(res)
42
+ datum[:results].each do |result|
43
+ @writer.write(compile_result(datum, result))
47
44
  end
48
45
  end
49
46
 
47
+ def compile_result(datum, result)
48
+ @original_data[datum[:supplied_id]] +
49
+ [datum[:supplied_name_string], result[:name_string],
50
+ result[:canonical_form], @input[datum[:supplied_id]][:rank],
51
+ matched_rank(result), matched_type(result),
52
+ result[:edit_distance], result[:score]]
53
+ end
54
+
50
55
  def matched_rank(record)
51
56
  record[:classification_path_ranks].split("|").last
52
57
  end
@@ -43,6 +43,8 @@ module GnCrossmap
43
43
  else
44
44
  normalize_rank(@parsed_name[:details][0][:infraspecies][-1][:rank])
45
45
  end
46
+ rescue NoMethodError
47
+ nil
46
48
  end
47
49
 
48
50
  def normalize_rank(rank)
@@ -1,6 +1,6 @@
1
1
  # Namespace module for crossmapping checklists to GN sources
2
2
  module GnCrossmap
3
- VERSION = "0.1.4"
3
+ VERSION = "0.1.5"
4
4
 
5
5
  def self.version
6
6
  VERSION
@@ -1,12 +1,11 @@
1
1
  module GnCrossmap
2
2
  # Saves output from GN Resolver to disk
3
3
  class Writer
4
- def initialize(output_path)
4
+ def initialize(output_path, original_fields)
5
5
  @path = output_path
6
+ @output_fields = output_fields(original_fields)
6
7
  @output = CSV.open(@path, "w:utf-8")
7
- @output << [:taxonID, :scientificName, :matchedScientificName,
8
- :matchedCanonicalForm, :rank, :matchedRank, :matchType,
9
- :editDistance, :score]
8
+ @output << @output_fields
10
9
  GnCrossmap.log("Open output file '#{@path}'")
11
10
  end
12
11
 
@@ -18,5 +17,13 @@ module GnCrossmap
18
17
  GnCrossmap.log("Close output file '#{@path}'")
19
18
  @output.close
20
19
  end
20
+
21
+ private
22
+
23
+ def output_fields(original_fields)
24
+ original_fields + [:inputName, :matchedName, :matchedCanonicalForm,
25
+ :inputRank, :matchedRank, :matchedType,
26
+ :matchedEditDistance, :marchedScore]
27
+ end
21
28
  end
22
29
  end
data/lib/gn_crossmap.rb CHANGED
@@ -1,7 +1,9 @@
1
1
  require "csv"
2
2
  require "rest_client"
3
3
  require "logger"
4
+ require "logger/colors"
4
5
  require "biodiversity"
6
+ require "gn_crossmap/errors"
5
7
  require "gn_crossmap/version"
6
8
  require "gn_crossmap/reader"
7
9
  require "gn_crossmap/writer"
@@ -17,8 +19,9 @@ module GnCrossmap
17
19
  attr_writer :logger
18
20
 
19
21
  def run(input, output, data_source_id)
20
- data = Reader.new(input).read
21
- writer = Writer.new(output)
22
+ reader = Reader.new(input)
23
+ data = reader.read
24
+ writer = Writer.new(output, reader.original_fields)
22
25
  Resolver.new(writer, data_source_id).resolve(data)
23
26
  output
24
27
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: gn_crossmap
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.4
4
+ version: 0.1.5
5
5
  platform: ruby
6
6
  authors:
7
7
  - Dmitry Mozzherin
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2015-05-11 00:00:00.000000000 Z
11
+ date: 2015-05-28 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: trollop
@@ -38,6 +38,34 @@ dependencies:
38
38
  - - "~>"
39
39
  - !ruby/object:Gem::Version
40
40
  version: '3.1'
41
+ - !ruby/object:Gem::Dependency
42
+ name: rest-client
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - "~>"
46
+ - !ruby/object:Gem::Version
47
+ version: '1.8'
48
+ type: :runtime
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - "~>"
53
+ - !ruby/object:Gem::Version
54
+ version: '1.8'
55
+ - !ruby/object:Gem::Dependency
56
+ name: logger-colors
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - "~>"
60
+ - !ruby/object:Gem::Version
61
+ version: '1.0'
62
+ type: :runtime
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - "~>"
67
+ - !ruby/object:Gem::Version
68
+ version: '1.0'
41
69
  - !ruby/object:Gem::Dependency
42
70
  name: bundler
43
71
  requirement: !ruby/object:Gem::Requirement
@@ -148,6 +176,7 @@ files:
148
176
  - lib/gn_crossmap.rb
149
177
  - lib/gn_crossmap/collector.rb
150
178
  - lib/gn_crossmap/column_collector.rb
179
+ - lib/gn_crossmap/errors.rb
151
180
  - lib/gn_crossmap/reader.rb
152
181
  - lib/gn_crossmap/resolver.rb
153
182
  - lib/gn_crossmap/result_processor.rb