gn_crossmap 0.2.2 → 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 0c5590798834035bde904c19913ec08731c58a31
4
- data.tar.gz: 28d1b65e628d614f6aa7709094a7e49a3af26052
3
+ metadata.gz: 779a3a9193e896242a7e8717fd98bc779b417937
4
+ data.tar.gz: d8e9c4c7d72447a62d35f57fe26e33e9b5a69a29
5
5
  SHA512:
6
- metadata.gz: 5cbb64951ae0aeae2d207c59012afe17666c1b7e599bdbe4444c0b51b27de5ebc48d20e69594a42c890c88d239ce6f9860ea68f119ebd48c09e03155ad494ca5
7
- data.tar.gz: 9b2d6c03842b7fb969368d7ecdd0297a1df27b10f0eeee97eca67181fa455d5fa2c552ad37f2ea35ee11291eecaed52feda43f95bb7ba7a00e462afd9b881f73
6
+ metadata.gz: 83f6f2f6be28c5891d93e5f6e5e09ac54481f4334640ef3b500822301bb5d1ea2cf0d5ca94f4ba0a2eb7b7e7afcd67fed8b37ad76c61ac184c2d9932d080ccf3
7
+ data.tar.gz: ae609d842d36b96de15ffe3013fe49e762da963191609486757d7ba9bbc5e85cdf674a19e24916fdcac7214d37adb1a9c089c95cd89ba9c4c7360256858cca11
data/.gitignore CHANGED
@@ -11,3 +11,4 @@ output.csv
11
11
  .vim.custom
12
12
  t
13
13
  gn_crossmap-*.gem
14
+ .byebug_history
data/CHANGELOG.md CHANGED
@@ -1,5 +1,13 @@
1
1
  # ``gn_crossmap`` CHANGELOG
2
2
 
3
+ ## 1.0.0
4
+
5
+ * @dimus - #18 output file optionally removes original fields except `taxonID`
6
+
7
+ * @dimus - #19 `acceptedName` field if filled for all matched names
8
+
9
+ * @dimus - #22 output is now tab-separated instead of comma-separated
10
+
3
11
  ## 0.2.2
4
12
 
5
13
  * @dimus - gem update
data/README.md CHANGED
@@ -27,37 +27,75 @@ gem 'gn_crossmap'
27
27
 
28
28
  And then execute:
29
29
 
30
- bundle
30
+ ```bash
31
+ bundle
32
+ ```
31
33
 
32
34
  Or install it yourself as:
33
35
 
34
- gem install gn_crossmap
36
+ ```bash
37
+ gem install gn_crossmap
38
+ ```
35
39
 
36
40
  ## Usage
37
41
 
38
42
  ### Usage from command line
39
43
 
40
- # to see help
41
- crossmap --help
44
+ ```bash
45
+ # to see help
46
+ crossmap --help
47
+
48
+ # to compare with default source (Catalogue of Life)
49
+ crossmap -i my_list.csv -o my_list_col.csv
50
+
51
+ # to compare with other source (Index Fungorum in this example)
52
+ crossmap -i my_list.csv -o my_list_if.csv -d 5
53
+
54
+ # to use standard intput and/or output
55
+ cat my_list.csv | crossmap -i - -o - > output
56
+
57
+ # to keep only taxonID from original input
58
+ cat my_list.csv | crossmap -i my_list.csv -s
59
+ ```
60
+
61
+ ### Usage as Ruby Library (API description)
62
+
63
+ #### `GnCrossmap.run`
64
+
65
+ Compares an input list to a data source from [GN Resolver][resolver] and
66
+ writes result into an output file.
67
+
68
+ ```ruby
69
+ GnCrossmap.run(input, output, data_source_id, skip_original)
70
+ ```
71
+
72
+ ``input``
73
+ : (string) Either a path to a csv file with list of names, or "-" which
74
+ designates `STDIN`
75
+
76
+ ``output``
77
+ : (string) Either a path to the output file, or "-" which designates `STDOUT`
78
+
79
+ ``data_source_id``
80
+ : (integer) id of a data source from [GN resolver][resolver]
42
81
 
43
- # to compare with default source (Catalogue of Life)
44
- crossmap -i my_list.csv -o my_list_col.csv
82
+ ``skip_original``
83
+ : (boolean) if true only `taxonID` is preserved from original data. Otherwise
84
+ all original data is preserved
45
85
 
46
- # to compare with other source (Index Fungorum in this example)
47
- crossmap -i my_list.csv -o my_list_if.csv -d 5
86
+ #### `GnCrossmap.logger=`
48
87
 
49
- # to use standard intput and/or output
50
- cat my_list.csv | crossmap -i - -o - > output
88
+ Allows to set logger to a custom logger (default is `STDERR`)
51
89
 
52
- ### Usage as Ruby Library
90
+ #### Usage Example
53
91
 
54
92
  ```ruby
55
93
  require "gn_crossmap"
56
94
 
57
- # If you want to change logger -- default Logging is to standard output
95
+ # If you want to change logger -- default Logging is to standard error
58
96
  GnCrossmap.logger = MyCustomLogger.new
59
97
 
60
- GnCrossmap.run("path/to/input.csv", "path/to/output.csv", 5)
98
+ GnCrossmap.run("path/to/input.csv", "path/to/output.csv", 5, true)
61
99
  ```
62
100
 
63
101
  ### Input file format
@@ -136,7 +174,7 @@ score | heuristic score from 0 to 1 where 1 is a good match, 0.5
136
174
 
137
175
  The output fomat returns 7 possible types of matches:
138
176
 
139
- 1. **Exact match** - The exact name was matched (but ignoring non-ascii characters)
177
+ 1. **Exact string match** - The exact name was matched (but ignoring non-ascii characters)
140
178
  2. **Exact match by canonical form of a name** - The canonical form of the name (a version of a scientific name that contains complete versions of the latin words, but lacks insertions of subtaxa, annotations, or authority information) was matched
141
179
  3. **Fuzzy match by canonical form** - The canonical form gave a fuzzy (detecting lexical or spelling variations of a name using Tony Rees' algorithm TAXAMATCH) match
142
180
  4. **Partial exact match by species part of canonical form** - The canonical form returned a partial but exact match
@@ -178,7 +216,7 @@ See [LICENSE][license] for details.
178
216
  [cov-link]: https://coveralls.io/r/GlobalNamesArchitecture/gn_crossmap?branch=master
179
217
  [code-badge]: https://codeclimate.com/github/GlobalNamesArchitecture/gn_crossmap/badges/gpa.svg
180
218
  [code-link]: https://codeclimate.com/github/GlobalNamesArchitecture/gn_crossmap
181
- [dep-badge]: https://gemnasium.com/GlobalNamesArchitecture/gn_crossmap.png
219
+ [dep-badge]: https://gemnasium.com/GlobalNamesArchitecture/gn_crossmap.svg
182
220
  [dep-link]: https://gemnasium.com/GlobalNamesArchitecture/gn_crossmap
183
221
  [resolver]: http://resolver.globalnames.org/data_sources
184
222
  [rubygems]: https://rubygems.org
data/exe/crossmap CHANGED
@@ -16,6 +16,8 @@ opts = Trollop.options do
16
16
  opt(:output, "Path to output file", default: OUTPUT)
17
17
  opt(:data_source_id, "Data source id from GN Resolver",
18
18
  default: CATALOGUE_OF_LIFE)
19
+ opt(:skip_original, "If given, only 'taxonID' is shown " \
20
+ "from the original input", type: :boolean)
19
21
  end
20
22
 
21
23
  Trollop.die :input, "must be set" if opts[:input].nil?
@@ -24,7 +26,8 @@ unless File.exist?(opts[:input]) || opts[:input] == "-"
24
26
  end
25
27
 
26
28
  begin
27
- GnCrossmap.run(opts[:input], opts[:output], opts[:data_source_id])
29
+ GnCrossmap.run(opts[:input], opts[:output], opts[:data_source_id],
30
+ opts[:skip_original])
28
31
  rescue GnCrossmapError => e
29
32
  GnCrossmap.logger.error(e.message)
30
33
  end
data/lib/gn_crossmap.rb CHANGED
@@ -22,9 +22,9 @@ module GnCrossmap
22
22
  class << self
23
23
  attr_writer :logger
24
24
 
25
- def run(input, output, data_source_id)
25
+ def run(input, output, data_source_id, skip_original)
26
26
  input_io, output_io = io(input, output)
27
- reader = Reader.new(input_io, input_name(input))
27
+ reader = Reader.new(input_io, input_name(input), skip_original)
28
28
  data = reader.read
29
29
  writer = Writer.new(output_io, reader.original_fields,
30
30
  output_name(output))
@@ -3,10 +3,11 @@ module GnCrossmap
3
3
  class Collector
4
4
  attr_reader :data
5
5
 
6
- def initialize
6
+ def initialize(skip_original)
7
7
  @data = []
8
8
  @fields = nil
9
9
  @collector = nil
10
+ @skip_original = skip_original
10
11
  end
11
12
 
12
13
  def process_row(row)
@@ -20,7 +21,12 @@ module GnCrossmap
20
21
  @fields = @row.map { |f| prepare_field(f) }
21
22
  @collector = collector_factory
22
23
  err = "taxonID must be present in the csv header"
23
- raise GnCrossmapError, err unless @fields.include?(:taxonid)
24
+ raise GnCrossmapError, err unless taxon_id?
25
+ end
26
+
27
+ def taxon_id?
28
+ @taxon_id_index = @fields.index(:taxonid)
29
+ !@taxon_id_index.nil?
24
30
  end
25
31
 
26
32
  def prepare_field(field)
@@ -32,10 +38,14 @@ module GnCrossmap
32
38
  @row = @fields.zip(@row).to_h
33
39
  data = @collector.id_name_rank(@row)
34
40
  return unless data
35
- data[:original] = @row.values
41
+ data[:original] = prepare_original
36
42
  @data << data
37
43
  end
38
44
 
45
+ def prepare_original
46
+ @skip_original ? [@row[:taxonid]] : @row.values
47
+ end
48
+
39
49
  def collector_factory
40
50
  if @fields.include?(:scientificname)
41
51
  SciNameCollector.new(@fields)
@@ -4,11 +4,12 @@ module GnCrossmap
4
4
  class Reader
5
5
  attr_reader :original_fields
6
6
 
7
- def initialize(csv_io, input_name)
7
+ def initialize(csv_io, input_name, skip_original)
8
8
  @csv_io = csv_io
9
9
  @col_sep = col_sep
10
10
  @original_fields = nil
11
11
  @input_name = input_name
12
+ @skip_original = skip_original
12
13
  end
13
14
 
14
15
  def read
@@ -25,15 +26,27 @@ module GnCrossmap
25
26
  end
26
27
 
27
28
  def parse_input
28
- dc = Collector.new
29
+ dc = Collector.new(@skip_original)
29
30
  csv = CSV.new(@csv_io, col_sep: col_sep)
30
31
  csv.each_with_index do |row, i|
31
- @original_fields = row.dup if @original_fields.nil?
32
+ @original_fields = headers(row) if @original_fields.nil?
32
33
  i += 1
33
34
  GnCrossmap.log("Ingesting #{i}th csv row") if (i % 10_000).zero?
34
35
  dc.process_row(row)
35
36
  end && @csv_io.close
36
37
  dc.data
37
38
  end
39
+
40
+ def headers(row)
41
+ hdrs = row.dup
42
+ @skip_original ? taxon_id_header(hdrs) : hdrs
43
+ end
44
+
45
+ def taxon_id_header(hdrs)
46
+ hdrs.each do |h|
47
+ return [h] if h =~ /taxonid\s*$/i
48
+ end
49
+ []
50
+ end
38
51
  end
39
52
  end
@@ -3,7 +3,7 @@ module GnCrossmap
3
3
  class ResultProcessor
4
4
  MATCH_TYPES = {
5
5
  0 => "No match",
6
- 1 => "Exact match",
6
+ 1 => "Exact string match",
7
7
  2 => "Canonical form exact match",
8
8
  3 => "Canonical form fuzzy match",
9
9
  4 => "Partial canonical form match",
@@ -55,7 +55,7 @@ module GnCrossmap
55
55
  [matched_type(result), datum[:supplied_name_string],
56
56
  result[:name_string], result[:canonical_form],
57
57
  @input[datum[:supplied_id]][:rank], matched_rank(result),
58
- synonym, result[:current_name_string],
58
+ synonym, result[:current_name_string] || result[:name_string],
59
59
  result[:edit_distance], result[:score], result[:taxon_id]]
60
60
  end
61
61
 
@@ -1,6 +1,6 @@
1
1
  # Namespace module for crossmapping checklists to GN sources
2
2
  module GnCrossmap
3
- VERSION = "0.2.2".freeze
3
+ VERSION = "1.0.0".freeze
4
4
 
5
5
  def self.version
6
6
  VERSION
@@ -4,7 +4,7 @@ module GnCrossmap
4
4
  def initialize(output_io, original_fields, output_name)
5
5
  @output_io = output_io
6
6
  @output_fields = output_fields(original_fields)
7
- @output = CSV.new(@output_io)
7
+ @output = CSV.new(@output_io, col_sep: "\t")
8
8
  @output << @output_fields
9
9
  @output_name = output_name
10
10
  GnCrossmap.log("Open output to #{@output_name}")
@@ -25,7 +25,7 @@ module GnCrossmap
25
25
  original_fields + [:matchedType, :inputName, :matchedName,
26
26
  :matchedCanonicalForm, :inputRank, :matchedRank,
27
27
  :synonymStatus, :acceptedName, :matchedEditDistance,
28
- :marchedScore, :matchTaxonID]
28
+ :matchedScore, :matchTaxonID]
29
29
  end
30
30
  end
31
31
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: gn_crossmap
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.2
4
+ version: 1.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Dmitry Mozzherin
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2016-11-07 00:00:00.000000000 Z
11
+ date: 2016-11-22 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: trollop