gn_crossmap 0.1.4 → 0.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 2860a0033f9ebfcf9d71d4ffb2cf9fa9d29ea7f3
4
- data.tar.gz: b09ff0e2ab214279979e268adcaece8207e8c9ed
3
+ metadata.gz: 30c84b5910edf24b6b67fc45c1f80be36b29a183
4
+ data.tar.gz: d7481bc663e1b4edd225c83f711a84092fb9d602
5
5
  SHA512:
6
- metadata.gz: 7c291b3a606eebf4da9ea8abe686bb046c63d30f34e6842346cd65a14a695d8b3f5f9af6c3f1f92a64ff6be69bc6bd8286d4a7a80386b8b239b2b65db1d8ccaa
7
- data.tar.gz: 22bd5787bf73447fb8b4fc2795cec9d74ee68357e0163f90b1279d6fda5a9efa2e43c3a3c6656b158ad961923fdb9c27018f811a2babb647bb6eb7347d1b915a
6
+ metadata.gz: 5754ec3e9d65a69cf8fcf83aaeaa2c24fd4b565917ba37b860a0468e5f9e46ba1e6dcd37dbdb28935313469f704aa84b6faab79584a238b1e17d2e6284b23ac5
7
+ data.tar.gz: 11ba4423f9c31529f5f1d17ad25609ccfae8d7d6694efea0bcf30137e65e661334a89e30f9259dfd9934676f9bfa76ec59d4ffee07f0d79afffdb9a5bcc67abd
data/CHANGELOG.md CHANGED
@@ -1,6 +1,21 @@
1
1
  gn_crossmap CHANGELOG
2
2
  =====================
3
3
 
4
+ 0.1.5
5
+ -----
6
+
7
+ * @dimus - #5 - All original fields are now preserved in the output file.
8
+
9
+ * @dimus - #3 - If ingest has more than 10K rows -- user will see logging events
10
+
11
+ * @dimus - #4 Bug - Error messages if headers are missing necessary fields
12
+
13
+ * @dimus - #2 - Header fields are now allowed trailing spaces
14
+
15
+ * @dimus - #7 Bug - Empty rank does not break crossmapping anymore
16
+
17
+ * @dimus - #1 Bug - add missing rest-client gem
18
+
4
19
  0.1.4
5
20
  -----
6
21
  - [Dmitry Mozzherin][dimus] - Bug fixes
data/README.md CHANGED
@@ -10,6 +10,8 @@ in [GN Resolver][resolver].
10
10
 
11
11
  Checklist has to be in a CSV format.
12
12
 
13
+ [Issues on waffle.io][waffle]
14
+
13
15
  Compatibility
14
16
  -------------
15
17
 
@@ -35,6 +37,29 @@ Or install it yourself as:
35
37
  Usage
36
38
  -----
37
39
 
40
+ ### Usage from command line
41
+
42
+ # to see help
43
+ $ crossmap --help
44
+
45
+ # to compare with default source (Catalogue of Life)
46
+ $ crossmap -i my_list.csv -o my_list_col.csv
47
+
48
+ # to compare with other source (Index Fungorum in this example)
49
+ $ crossmap -i my_list.csv -o my_list_if.csv -d 5
50
+
51
+ ### Usage as Ruby Library
52
+
53
+ ```ruby
54
+ require "gn_crossmap"
55
+
56
+ # If you want to change logger -- default Logging is to standard output
57
+ GnCrossmap.logger = MyCustomLogger.new
58
+
59
+ GnCrossmap.run("path/to/input.csv", "path/to/output.csv", 5)
60
+ ```
61
+
62
+
38
63
  ### Input file format
39
64
 
40
65
  - Comma Separated File with names of fields in first row.
@@ -92,27 +117,33 @@ TaxonId|kingdom|subkingdom|phylum|subphylum|superclass|class|subclass|cohort|sup
92
117
 
93
118
  More examples can be found in [spec/files][files] directory
94
119
 
95
- ### Usage from command line
96
-
97
- # to see help
98
- $ crossmap --help
99
-
100
- # to compare with default source (Catalogue of Life)
101
- $ crossmap -i my_list.csv -o my_list_col.csv
120
+ ### Output file format
102
121
 
103
- # to compare with other source (Index Fungorum in this example)
104
- $ crossmap -i my_list.csv -o my_list_if.csv -d 5
122
+ [Output][output] includes following fields:
105
123
 
106
- ### Usage as Ruby Library
124
+ Field | Description
125
+ ---------------------|-----------------------------------------------------------
126
+ taxonID | original ID attached to a name in the checklist
127
+ scientificName | name from the checklist
128
+ matchedScientificName| name matched from the GN Reolver data source
129
+ matchedCanonicalForm | canonical form of the matched name
130
+ rank | rank from the source (if it was given/inferred)
131
+ matchedRank | corresponding rank from the data source
132
+ matchType | what kind of match it is
133
+ editDistance | for fuzzy-matching -- how many characters differ between checklist and data source name
134
+ score | heuristic score from 0 to 1 where 1 is a good match, 0.5 match requires further human investigation
107
135
 
108
- ```ruby
109
- require "gn_crossmap"
136
+ #### Types of Matches
110
137
 
111
- # If you want to change logger -- default Logging is to standard output
112
- GnCrossmap.logger = MyCustomLogger.new
138
+ The output fomat returns 7 possible types of matches:
113
139
 
114
- GnCrossmap.run("path/to/input.csv", "path/to/output.csv", 5)
115
- ```
140
+ 1. **Exact match** - The exact name was matched (but ignoring non-ascii characters)
141
+ 2. **Exact match by canonical form of a name** - The canonical form of the name (a version of a scientific name that contains complete versions of the latin words, but lacks insertions of subtaxa, annotations, or authority information) was matched
142
+ 3. **Fuzzy match by canonical form** - The canonical form gave a fuzzy (detecting lexical or spelling variations of a name using Tony Rees' algorithm TAXAMATCH) match
143
+ 4. **Partial exact match by species part of canonical form** - The canonical form returned a partial but exact match
144
+ 5. **Partial fuzzy match by species part of canonical form** - The canonical form returned a partial, fuzzy match
145
+ 6. **Exact match by genus part of a canonical form** - The genus part of the canonical form of the species name returned an exact match
146
+ 7. **[Blank]** - No match
116
147
 
117
148
  Development
118
149
  -----------
@@ -160,3 +191,5 @@ See [LICENSE][license] for details.
160
191
  [license]: https://github.com/GlobalNamesArchitecture/gn_crossmap/blob/master/LICENSE
161
192
  [terms]: http://rs.tdwg.org/dwc/terms
162
193
  [files]: https://github.com/GlobalNamesArchitecture/gn_crossmap/tree/master/spec/files
194
+ [output]: https://github.com/GlobalNamesArchitecture/gn_crossmap/tree/master/spec/files/output-example.csv
195
+ [waffle]: https://waffle.io/GlobalNamesArchitecture/gn_crossmap
data/exe/crossmap CHANGED
@@ -20,4 +20,8 @@ end
20
20
  Trollop.die :input, "must be set" if opts[:input].nil?
21
21
  Trollop.die :input, "file must exist" unless File.exist?(opts[:input])
22
22
 
23
- GnCrossmap.run(opts[:input], opts[:output], opts[:data_source_id])
23
+ begin
24
+ GnCrossmap.run(opts[:input], opts[:output], opts[:data_source_id])
25
+ rescue GnCrossmapError => e
26
+ GnCrossmap.logger.error(e.message)
27
+ end
data/gn_crossmap.gemspec CHANGED
@@ -27,6 +27,8 @@ Gem::Specification.new do |gem|
27
27
 
28
28
  gem.add_dependency "trollop", "~> 2.1"
29
29
  gem.add_dependency "biodiversity", "~> 3.1"
30
+ gem.add_dependency "rest-client", "~> 1.8"
31
+ gem.add_dependency "logger-colors", "~> 1.0"
30
32
 
31
33
  gem.add_development_dependency "bundler", "~> 1.7"
32
34
  gem.add_development_dependency "rake", "~> 10.0"
@@ -17,14 +17,18 @@ module GnCrossmap
17
17
  private
18
18
 
19
19
  def init_fields_collector
20
- @fields = @row.map { |f| f.downcase.to_sym }
20
+ @fields = @row.map { |f| f.to_s.strip.downcase.to_sym }
21
21
  @collector = collector_factory
22
+ err = "taxonID must be present in the csv header"
23
+ fail GnCrossmapError, err unless @fields.include?(:taxonid)
22
24
  end
23
25
 
24
26
  def collect_data
25
27
  @row = @fields.zip(@row).to_h
26
28
  data = @collector.id_name_rank(@row)
27
- @data << data if data
29
+ return unless data
30
+ data[:original] = @row.values
31
+ @data << data
28
32
  end
29
33
 
30
34
  def collector_factory
@@ -9,6 +9,9 @@ module GnCrossmap
9
9
 
10
10
  def initialize(fields)
11
11
  @fields = fields
12
+ err = "At least some of these fields must exist in " \
13
+ "the CSV header: '#{RANKS.join('\', \'')}'"
14
+ fail GnCrossmapError, err if (RANKS - @fields).size == RANKS.size
12
15
  end
13
16
 
14
17
  def id_name_rank(row)
@@ -0,0 +1,2 @@
1
+ # Error to raise in case of problems
2
+ class GnCrossmapError < RuntimeError; end
@@ -2,9 +2,12 @@ module GnCrossmap
2
2
  # Reads supplied csv file and creates ruby structure to compare
3
3
  # with a Global Names Resolver source
4
4
  class Reader
5
+ attr_reader :original_fields
6
+
5
7
  def initialize(csv_path)
6
8
  @csv_file = csv_path
7
9
  @col_sep = col_sep
10
+ @original_fields = nil
8
11
  end
9
12
 
10
13
  def read
@@ -21,7 +24,10 @@ module GnCrossmap
21
24
 
22
25
  def parse_input
23
26
  dc = Collector.new
24
- CSV.open(@csv_file, col_sep: @col_sep).each do |row|
27
+ CSV.open(@csv_file, col_sep: @col_sep).each_with_index do |row, i|
28
+ @original_fields = row.dup if @original_fields.nil?
29
+ i += 1
30
+ GnCrossmap.log("Ingesting #{i}th csv row") if i % 10_000 == 0
25
31
  dc.process_row(row)
26
32
  end
27
33
  dc.data
@@ -7,6 +7,7 @@ module GnCrossmap
7
7
  @processor = GnCrossmap::ResultProcessor.new(writer)
8
8
  @ds_id = data_source_id
9
9
  @count = 0
10
+ @current_data = {}
10
11
  @batch = 200
11
12
  end
12
13
 
@@ -32,7 +33,9 @@ module GnCrossmap
32
33
  end
33
34
 
34
35
  def collect_names(slice)
36
+ @current_data = {}
35
37
  slice.each_with_object("") do |row, str|
38
+ @current_data[row[:id]] = row[:original]
36
39
  @processor.input[row[:id]] = { rank: row[:rank] }
37
40
  str << "#{row[:id]}|#{row[:name]}\n"
38
41
  end
@@ -40,7 +43,7 @@ module GnCrossmap
40
43
 
41
44
  def remote_resolve(names)
42
45
  res = RestClient.post(URL, data: names, data_source_ids: @ds_id)
43
- @processor.process(res)
46
+ @processor.process(res, @current_data)
44
47
  rescue RestClient::Exception
45
48
  single_remote_resolve(names)
46
49
  end
@@ -51,7 +54,7 @@ module GnCrossmap
51
54
  res = RestClient.post(URL, data: name, data_source_ids: @ds_id)
52
55
  @processor.process(res)
53
56
  rescue RestClient::Exception => e
54
- GnCrossmap.log("Resolver broke on '#{name}': #{e}")
57
+ GnCrossmap.logger.error("Resolver broke on '#{name}': #{e.message}")
55
58
  next
56
59
  end
57
60
  end
@@ -17,7 +17,8 @@ module GnCrossmap
17
17
  @input = {}
18
18
  end
19
19
 
20
- def process(result)
20
+ def process(result, original_data)
21
+ @original_data = original_data
21
22
  res = rubyfy(result)
22
23
  res[:data].each do |d|
23
24
  d[:results].nil? ? write_empty_result(d) : write_result(d)
@@ -31,22 +32,26 @@ module GnCrossmap
31
32
  end
32
33
 
33
34
  def write_empty_result(datum)
34
- res = [datum[:supplied_id], datum[:supplied_name_string], nil, nil,
35
- @input[datum[:supplied_id]][:rank], nil, nil, nil, nil]
35
+ res = @original_data[datum[:supplied_id]]
36
+ res += [datum[:supplied_name_string], nil, nil,
37
+ @input[datum[:supplied_id]][:rank], nil, nil, nil, nil]
36
38
  @writer.write(res)
37
39
  end
38
40
 
39
41
  def write_result(datum)
40
- datum[:results].each do |r|
41
- res = [datum[:supplied_id], datum[:supplied_name_string],
42
- r[:name_string], r[:canonical_form],
43
- @input[datum[:supplied_id]][:rank],
44
- matched_rank(r), matched_type(r),
45
- r[:edit_distance], r[:score]]
46
- @writer.write(res)
42
+ datum[:results].each do |result|
43
+ @writer.write(compile_result(datum, result))
47
44
  end
48
45
  end
49
46
 
47
+ def compile_result(datum, result)
48
+ @original_data[datum[:supplied_id]] +
49
+ [datum[:supplied_name_string], result[:name_string],
50
+ result[:canonical_form], @input[datum[:supplied_id]][:rank],
51
+ matched_rank(result), matched_type(result),
52
+ result[:edit_distance], result[:score]]
53
+ end
54
+
50
55
  def matched_rank(record)
51
56
  record[:classification_path_ranks].split("|").last
52
57
  end
@@ -43,6 +43,8 @@ module GnCrossmap
43
43
  else
44
44
  normalize_rank(@parsed_name[:details][0][:infraspecies][-1][:rank])
45
45
  end
46
+ rescue NoMethodError
47
+ nil
46
48
  end
47
49
 
48
50
  def normalize_rank(rank)
@@ -1,6 +1,6 @@
1
1
  # Namespace module for crossmapping checklists to GN sources
2
2
  module GnCrossmap
3
- VERSION = "0.1.4"
3
+ VERSION = "0.1.5"
4
4
 
5
5
  def self.version
6
6
  VERSION
@@ -1,12 +1,11 @@
1
1
  module GnCrossmap
2
2
  # Saves output from GN Resolver to disk
3
3
  class Writer
4
- def initialize(output_path)
4
+ def initialize(output_path, original_fields)
5
5
  @path = output_path
6
+ @output_fields = output_fields(original_fields)
6
7
  @output = CSV.open(@path, "w:utf-8")
7
- @output << [:taxonID, :scientificName, :matchedScientificName,
8
- :matchedCanonicalForm, :rank, :matchedRank, :matchType,
9
- :editDistance, :score]
8
+ @output << @output_fields
10
9
  GnCrossmap.log("Open output file '#{@path}'")
11
10
  end
12
11
 
@@ -18,5 +17,13 @@ module GnCrossmap
18
17
  GnCrossmap.log("Close output file '#{@path}'")
19
18
  @output.close
20
19
  end
20
+
21
+ private
22
+
23
+ def output_fields(original_fields)
24
+ original_fields + [:inputName, :matchedName, :matchedCanonicalForm,
25
+ :inputRank, :matchedRank, :matchedType,
26
+ :matchedEditDistance, :marchedScore]
27
+ end
21
28
  end
22
29
  end
data/lib/gn_crossmap.rb CHANGED
@@ -1,7 +1,9 @@
1
1
  require "csv"
2
2
  require "rest_client"
3
3
  require "logger"
4
+ require "logger/colors"
4
5
  require "biodiversity"
6
+ require "gn_crossmap/errors"
5
7
  require "gn_crossmap/version"
6
8
  require "gn_crossmap/reader"
7
9
  require "gn_crossmap/writer"
@@ -17,8 +19,9 @@ module GnCrossmap
17
19
  attr_writer :logger
18
20
 
19
21
  def run(input, output, data_source_id)
20
- data = Reader.new(input).read
21
- writer = Writer.new(output)
22
+ reader = Reader.new(input)
23
+ data = reader.read
24
+ writer = Writer.new(output, reader.original_fields)
22
25
  Resolver.new(writer, data_source_id).resolve(data)
23
26
  output
24
27
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: gn_crossmap
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.4
4
+ version: 0.1.5
5
5
  platform: ruby
6
6
  authors:
7
7
  - Dmitry Mozzherin
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2015-05-11 00:00:00.000000000 Z
11
+ date: 2015-05-28 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: trollop
@@ -38,6 +38,34 @@ dependencies:
38
38
  - - "~>"
39
39
  - !ruby/object:Gem::Version
40
40
  version: '3.1'
41
+ - !ruby/object:Gem::Dependency
42
+ name: rest-client
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - "~>"
46
+ - !ruby/object:Gem::Version
47
+ version: '1.8'
48
+ type: :runtime
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - "~>"
53
+ - !ruby/object:Gem::Version
54
+ version: '1.8'
55
+ - !ruby/object:Gem::Dependency
56
+ name: logger-colors
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - "~>"
60
+ - !ruby/object:Gem::Version
61
+ version: '1.0'
62
+ type: :runtime
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - "~>"
67
+ - !ruby/object:Gem::Version
68
+ version: '1.0'
41
69
  - !ruby/object:Gem::Dependency
42
70
  name: bundler
43
71
  requirement: !ruby/object:Gem::Requirement
@@ -148,6 +176,7 @@ files:
148
176
  - lib/gn_crossmap.rb
149
177
  - lib/gn_crossmap/collector.rb
150
178
  - lib/gn_crossmap/column_collector.rb
179
+ - lib/gn_crossmap/errors.rb
151
180
  - lib/gn_crossmap/reader.rb
152
181
  - lib/gn_crossmap/resolver.rb
153
182
  - lib/gn_crossmap/result_processor.rb