biodiversity 3.1.10 → 3.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: d4a11c6b12ca173da86c7baa1242b9639e54d93c
4
- data.tar.gz: 8c47f66efd6f8ee2b3e3dce894927a4cc73237a1
3
+ metadata.gz: d7bb0304f5e151933f5350780677b9a47a099716
4
+ data.tar.gz: 7adaf2c1bfce44db79bc2c04d75c584d53957fd0
5
5
  SHA512:
6
- metadata.gz: 8e542a0cca44cef3a63acd4dc520db4465cb2a598aafcb30e2161e65ad6110b9a146ef01325da77f3e7474c5409a0b75f1b28eda7c59a7fe1ffb0cc59162fb6e
7
- data.tar.gz: 36f21aed88acc4405c147b908de36d6109010935fec99108c90ca73d6dabd2b942a08a1eca04fd28a644bbec51a12b5836272ef2280160b4321f2f434760f665
6
+ metadata.gz: 1a450a93fb07f985b5f1e7761e669ed772c8add2c01955e4a8e70a575f3b5f0bc86b1b215507260c48a1604b386b60543f27048c7a65119f71a4b0ddfd7bcefe
7
+ data.tar.gz: 19090297f99d64580b4b6012a06729ede74fd3ccf7fb9ffdfeecd6480d2a1a0bd0ce8dc8654747f51a60301464cbaebd1f9693ceb54f2827140163d72314f62b
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --format progress
2
+ --color
3
+ --require spec_helper
data/.ruby-version CHANGED
@@ -1 +1 @@
1
- 2.1.5
1
+ 2.1.6
data/CHANGELOG CHANGED
@@ -1,3 +1,8 @@
1
+ 3.2.0 -- added UUID version 5 identifiers for every name string, better
2
+ normalizing for the names with apostrophes, underscore-formatted names are
3
+ supported. Minor version increase because of change in the output format ("id"
4
+ field)
5
+
1
6
  3.1.10 -- NPV viruses added
2
7
 
3
8
  3.1.9 -- more virus keywords, better handling of apostrophes in
data/README.md CHANGED
@@ -1,10 +1,10 @@
1
1
  Biodiversity
2
2
  ============
3
3
 
4
- [![Gem Version][1]][2]
5
- [![Continuous Integration Status][3]][4]
6
- [![CodePolice][5]][6]
7
- [![Dependency Status][7]][8]
4
+ [![Gem Version][gem_svg]][gem_link]
5
+ [![Continuous Integration Status][ci_svg]][ci_link]
6
+ [![CodePolice][cc_svg]][cc_link]
7
+ [![Dependency Status][deps_svg]][deps_link]
8
8
 
9
9
  Parses taxonomic scientific name and breaks it into semantic elements.
10
10
 
@@ -12,10 +12,12 @@ Parses taxonomic scientific name and breaks it into semantic elements.
12
12
  Support for Ruby 1.8.7 IS DROPPED. Both biodiversity and
13
13
  biodiversity19 will be for Ruby > 1.9.1 and will be identical gems.
14
14
 
15
- biodiversity19 is now deprecated and will be phased out in 2015.
15
+ biodiversity19 is now deprecated and will not be updated anymore.
16
16
  You are strongly encouraged to change your dependencies from
17
17
  biodiversity19 to biodiversity
18
18
 
19
+ Follow [biodiversity issues][waffle] on waffle.io
20
+
19
21
  Installation
20
22
  ------------
21
23
 
@@ -46,7 +48,7 @@ you can use a socket server
46
48
  parserver -h
47
49
  Usage: parserver [options]
48
50
 
49
- -r, --canonical_with_rank Adds infraspecies rank
51
+ -r, --canonical_with_rank Adds infraspecies rank
50
52
  to canonical forms
51
53
 
52
54
  -o, --output=output Specifies the type of the output:
@@ -65,7 +67,7 @@ you can use a socket server
65
67
 
66
68
  With default settings you can access parserserver via 4334 port using a
67
69
  socket client library of your programming language. You can find
68
- [socket client script example][9] in the examples directory of the gem.
70
+ [socket client script example][socket_example] in the examples directory of the gem.
69
71
 
70
72
  If you want to check if socket server works for you:
71
73
 
@@ -93,76 +95,94 @@ of scientific name
93
95
 
94
96
  You can use it as a library in Ruby, JRuby etc.
95
97
 
96
- require 'biodiversity'
97
98
 
98
- parser = ScientificNameParser.new
99
+ ```ruby
100
+ require 'biodiversity'
101
+
102
+ parser = ScientificNameParser.new
99
103
 
100
- #to find version number
101
- ScientificNameParser.version
104
+ #to find version number
105
+ ScientificNameParser.version
102
106
 
103
- # to fix capitalization in canonicals
104
- ScientificNameParser.fix_case("QUERCUS (QUERCUS) ALBA")
105
- # Output: Quercus (Quercus) alba
107
+ # to fix capitalization in canonicals
108
+ ScientificNameParser.fix_case("QUERCUS (QUERCUS) ALBA")
109
+ # Output: Quercus (Quercus) alba
106
110
 
107
- # to parse a scientific name into a ruby hash
108
- parser.parse("Plantago major")
111
+ # to parse a scientific name into a ruby hash
112
+ parser.parse("Plantago major")
109
113
 
110
- #to get json representation
111
- parser.parse("Plantago").to_json
112
- #or
113
- parser.parse("Plantago")
114
- parser.all_json
114
+ #to get json representation
115
+ parser.parse("Plantago").to_json
116
+ #or
117
+ parser.parse("Plantago")
118
+ parser.all_json
115
119
 
116
- # to clean name up
117
- parser.parse(" Plantago major ")[:scientificName][:normalized]
120
+ # to clean name up
121
+ parser.parse(" Plantago major ")[:scientificName][:normalized]
118
122
 
119
- # to get only cleaned up latin part of the name
120
- parser.parse("Pseudocercospora dendrobii (H.C. Burnett) U. \
121
- Braun & Crous 2003")[:scientificName][:canonical]
123
+ # to get only cleaned up latin part of the name
124
+ parser.parse("Pseudocercospora dendrobii (H.C. Burnett) U. \
125
+ Braun & Crous 2003")[:scientificName][:canonical]
122
126
 
123
- # to get detailed information about elements of the name
124
- parser.parse("Pseudocercospora dendrobii (H.C. Burnett 1883) U. \
125
- Braun & Crous 2003")[:scientificName][:details]
127
+ # to get detailed information about elements of the name
128
+ parser.parse("Pseudocercospora dendrobii (H.C. Burnett 1883) U. \
129
+ Braun & Crous 2003")[:scientificName][:details]
130
+ ```
126
131
 
127
132
  Returned result is not always linear, if name is complex. To get simple linear
128
133
  representation of the name you can use:
129
134
 
130
- parser.parse("Pseudocercospora dendrobii (H.C. Burnett) \
131
- U. Braun & Crous 2003")[:scientificName][:position]
132
- # returns {0=>["genus", 16], 17=>["species", 26],
133
- # 28=>["author_word", 32], 33=>["author_word", 40],
134
- # 42=>["author_word", 44], 45=>["author_word", 50],
135
- # 53=>["author_word", 58], 59=>["year", 63]}
136
- # where the key is the char index of the start of
137
- # a word, first element of the value is a semantic meaning
138
- # of the word, second element of the value is the character index
139
- # of end of the word
135
+
136
+ ```ruby
137
+ parser.parse("Pseudocercospora dendrobii (H.C. Burnett) \
138
+ U. Braun & Crous 2003")[:scientificName][:position]
139
+ # returns {0=>["genus", 16], 17=>["species", 26],
140
+ # 28=>["author_word", 32], 33=>["author_word", 40],
141
+ # 42=>["author_word", 44], 45=>["author_word", 50],
142
+ # 53=>["author_word", 58], 59=>["year", 63]}
143
+ # where the key is the char index of the start of
144
+ # a word, first element of the value is a semantic meaning
145
+ # of the word, second element of the value is the character index
146
+ # of end of the word
147
+ ```
140
148
 
141
149
  'Surrogate' is a broad group which includes 'Barcode of Life' names, and various
142
150
  undetermined names with cf. sp. spp. nr. in them:
143
-
144
- parser.parse("Coleoptera BOLD:1234567")[:scientificName][:surrogate]
145
151
 
146
- To parse using several CPUs (4 seem to be optimal)
152
+ ```ruby
153
+ parser.parse("Coleoptera BOLD:1234567")[:scientificName][:surrogate]
154
+ ```
155
+ ### What is "id" in the parsed results?
156
+
157
+ ID field contains UUID v5 hexadecimal string. ID is generated out of bytes
158
+ from the name string itself, and identical id can be generated using [any
159
+ popular programming language][uuid_examples]. You can read more about UUID
160
+ version 5 in a [blog post][uuid_blog]
161
+
162
+ ### Parse using several CPUs (4 threads seem to be optimal)
147
163
 
148
- parser = ParallelParser.new
149
- # ParallelParser.new(4) will try to run 4 processes if hardware allows
150
- array_of_names = ["Betula alba", "Homo sapiens"....]
151
- parser.parse(array_of_names)
152
- # Output: {"Betula alba" => {:scientificName...},
153
- # "Homo sapiens" => {:scientificName...}, ...}
164
+ ```ruby
165
+ parser = ParallelParser.new
166
+ # ParallelParser.new(4) will try to run 4 processes if hardware allows
167
+ array_of_names = ["Betula alba", "Homo sapiens"....]
168
+ parser.parse(array_of_names)
169
+ # Output: {"Betula alba" => {:scientificName...},
170
+ # "Homo sapiens" => {:scientificName...}, ...}
171
+ ```
154
172
 
155
- parallel parser takes list of names and returns back a hash with names as
173
+ parallel parser takes list of names and returns back a hash with names as
156
174
  keys and parsed data as values
157
175
 
158
- To get canonicals with ranks for infraspecific epithets:
176
+ ### Canonicals with ranks for infraspecific epithets:
159
177
 
160
- parser = ScientificNameParser.new(canonical_with_rank: true)
161
- parser.parse('Cola cordifolia var. puberula \
162
- A. Chev.')[:scientificName][:canonical]
163
- # Output: Cola cordifolia var. puberula
178
+ ```ruby
179
+ parser = ScientificNameParser.new(canonical_with_rank: true)
180
+ parser.parse('Cola cordifolia var. puberula \
181
+ A. Chev.')[:scientificName][:canonical]
182
+ # Output: Cola cordifolia var. puberula
183
+ ```
164
184
 
165
- To resolve lsid and get back RDF file
185
+ ### Resolving lsid and geting back RDF file
166
186
 
167
187
  LsidResolver.resolve("urn:lsid:ubio.org:classificationbank:2232671")
168
188
 
@@ -174,7 +194,7 @@ If nnparse or parserver do not start -- try to run
174
194
  gem uninstall biodiversity
175
195
  gem uninstall biodiversity19
176
196
 
177
- and make sure you remove all versions and all nnparse and parserver scripts.
197
+ and make sure you remove all versions and all nnparse and parserver scripts.
178
198
  Then install biodiversity again
179
199
 
180
200
  gem install biodiversity
@@ -184,18 +204,22 @@ It should fix the problem.
184
204
  Copyright
185
205
  ---------
186
206
 
187
- Authors: [Dmitry Mozzherin][10]
188
-
189
- Copyright (c) 2008-2015 Marine Biological Laboratory. See LICENSE for
190
- further details.
191
-
192
- [1]: https://badge.fury.io/rb/biodiversity.png
193
- [2]: http://badge.fury.io/rb/biodiversity
194
- [3]: https://secure.travis-ci.org/GlobalNamesArchitecture/biodiversity.png
195
- [4]: http://travis-ci.org/GlobalNamesArchitecture/biodiversity
196
- [5]: https://codeclimate.com/github/GlobalNamesArchitecture/biodiversity.png
197
- [6]: https://codeclimate.com/github/GlobalNamesArchitecture/biodiversity
198
- [7]: https://gemnasium.com/GlobalNamesArchitecture/biodiversity.png
199
- [8]: https://gemnasium.com/GlobalNamesArchitecture/biodiversity
200
- [9]: http://bit.ly/149iLm5
201
- [10]: https://github.com/dimus
207
+ Authors: [Dmitry Mozzherin][dimus]
208
+
209
+ Copyright (c) 2008-2015 Marine Biological Laboratory. See [LICENSE][license]
210
+ for further details.
211
+
212
+ [gem_svg]: https://badge.fury.io/rb/biodiversity.svg
213
+ [gem_link]: http://badge.fury.io/rb/biodiversity
214
+ [ci_svg]: https://secure.travis-ci.org/GlobalNamesArchitecture/biodiversity.svg
215
+ [ci_link]: http://travis-ci.org/GlobalNamesArchitecture/biodiversity
216
+ [cc_svg]: https://codeclimate.com/github/GlobalNamesArchitecture/biodiversity.svg
217
+ [cc_link]: https://codeclimate.com/github/GlobalNamesArchitecture/biodiversity
218
+ [deps_svg]: https://gemnasium.com/GlobalNamesArchitecture/biodiversity.svg
219
+ [deps_link]: https://gemnasium.com/GlobalNamesArchitecture/biodiversity
220
+ [socket_example]: http://bit.ly/149iLm5
221
+ [dimus]: https://github.com/dimus
222
+ [license]: https://github.com/GlobalNamesArchitecture/biodiversity/blob/master/LICENSE
223
+ [waffle]: https://waffle.io/GlobalNamesArchitecture/biodiversity
224
+ [uuid_examples]: https://github.com/GlobalNamesArchitecture/gn_uuid_examples
225
+ [uuid_blog]: http://globalnamesarchitecture.github.io/crossmap/gna/2015/05/31/gn-uuid-0-5-0.html
data/biodiversity.gemspec CHANGED
@@ -19,6 +19,7 @@ Gem::Specification.new do |gem|
19
19
  gem.add_runtime_dependency "treetop", "~> 1.4.1"
20
20
  gem.add_runtime_dependency "parallel", "~> 1.4"
21
21
  gem.add_runtime_dependency "unicode_utils", "~> 1.4"
22
+ gem.add_runtime_dependency "gn_uuid", "~> 0.5"
22
23
 
23
24
  gem.add_development_dependency "bundler", "~> 1.6"
24
25
  gem.add_development_dependency "rake", "~> 10.4"
@@ -1,7 +1,8 @@
1
1
  # encoding: UTF-8
2
- require_relative 'parser/scientific_name_clean'
3
- require_relative 'parser/scientific_name_dirty'
4
- require_relative 'parser/scientific_name_canonical'
2
+ require "gn_uuid"
3
+ require_relative "parser/scientific_name_clean"
4
+ require_relative "parser/scientific_name_dirty"
5
+ require_relative "parser/scientific_name_canonical"
5
6
 
6
7
  module PreProcessor
7
8
  NOTES = /\s+(species\s+group|species\s+complex|group|author)\b.*$/i
@@ -24,9 +25,10 @@ module PreProcessor
24
25
  def self.clean(a_string)
25
26
  [NOTES, TAXON_CONCEPTS1, TAXON_CONCEPTS2,
26
27
  TAXON_CONCEPTS3, NOMEN_CONCEPTS, LAST_WORD_JUNK].each do |i|
27
- a_string = a_string.gsub(i, '')
28
+ a_string = a_string.gsub(i, "")
28
29
  end
29
- a_string = a_string.tr('ſ','s') #old 's'
30
+ a_string = a_string.tr("ſ","s") #old "s"
31
+ a_string = a_string.tr("_", " ") if a_string.strip.match(/\s/).nil?
30
32
  a_string
31
33
  end
32
34
  end
@@ -36,7 +38,7 @@ end
36
38
  # Examples
37
39
  #
38
40
  # parser = ParallelParser.new(4)
39
- # parser.parse(['Betula L.', 'Pardosa moesta'])
41
+ # parser.parse(["Betula L.", "Pardosa moesta"])
40
42
  class ParallelParser
41
43
 
42
44
  # Public: Initialize ParallelParser.
@@ -45,7 +47,7 @@ class ParallelParser
45
47
  # If processes number is not set it will be determined
46
48
  # automatically.
47
49
  def initialize(processes_num = nil)
48
- require 'parallel'
50
+ require "parallel"
49
51
  cpu_num
50
52
  if processes_num.to_i > 0
51
53
  @processes_num = [processes_num, cpu_num - 1].min
@@ -66,7 +68,7 @@ class ParallelParser
66
68
  # Examples
67
69
  #
68
70
  # parser = ParallelParser.new(4)
69
- # parser.parse(['Homo sapiens L.', 'Quercus quercus'])
71
+ # parser.parse(["Homo sapiens L.", "Quercus quercus"])
70
72
  #
71
73
  # Returns a Hash with scientific names as a key, and parsing results as
72
74
  # a value.
@@ -108,7 +110,8 @@ class ScientificNameParser
108
110
 
109
111
  FAILED_RESULT = ->(name) do
110
112
  { scientificName:
111
- { parsed: false, verbatim: name.to_s.strip, error: 'Parser error' }
113
+ { id: GnUUID.uuid(name), parsed: false, verbatim: name,
114
+ error: "Parser internal error" }
112
115
  }
113
116
  end
114
117
 
@@ -121,7 +124,7 @@ class ScientificNameParser
121
124
  words_num = name_ary.size
122
125
  res = nil
123
126
  if words_num == 1
124
- res = name_ary[0].gsub(/[\(\)\{\}]/, '')
127
+ res = name_ary[0].gsub(/[\(\)\{\}]/, "")
125
128
  if res.size > 1
126
129
  res = UnicodeUtils.upcase(res[0]) + UnicodeUtils.downcase(res[1..-1])
127
130
  else
@@ -135,15 +138,15 @@ class ScientificNameParser
135
138
  word1 = name_ary[0]
136
139
  end
137
140
  if name_ary[1].match(/^\(/)
138
- word2 = name_ary[1].gsub(/\)$/, '') + ')'
141
+ word2 = name_ary[1].gsub(/\)$/, "") + ")"
139
142
  word2 = word2[0] + UnicodeUtils.upcase(word2[1]) +
140
143
  UnicodeUtils.downcase(word2[2..-1])
141
144
  else
142
145
  word2 = UnicodeUtils.downcase(name_ary[1])
143
146
  end
144
- res = word1 + ' ' +
145
- word2 + ' ' +
146
- name_ary[2..-1].map { |w| UnicodeUtils.downcase(w) }.join(' ')
147
+ res = word1 + " " +
148
+ word2 + " " +
149
+ name_ary[2..-1].map { |w| UnicodeUtils.downcase(w) }.join(" ")
147
150
  res.strip!
148
151
  end
149
152
  res
@@ -152,7 +155,7 @@ class ScientificNameParser
152
155
 
153
156
  def initialize(opts = {})
154
157
  @canonical_with_rank = !!opts[:canonical_with_rank]
155
- @verbatim = ''
158
+ @verbatim = ""
156
159
  @clean = ScientificNameCleanParser.new
157
160
  @dirty = ScientificNameDirtyParser.new
158
161
  @canonical = ScientificNameCanonicalParser.new
@@ -180,23 +183,23 @@ class ScientificNameParser
180
183
  end
181
184
 
182
185
  def parse(a_string)
183
- @verbatim = a_string.strip
186
+ @verbatim = a_string
184
187
  a_string = PreProcessor::clean(a_string)
185
188
 
186
189
  if virus?(a_string)
187
- @parsed = { verbatim: a_string, virus: true }
190
+ @parsed = { verbatim: @verbatim, virus: true }
188
191
  elsif noparse?(a_string)
189
- @parsed = { verbatim: a_string }
192
+ @parsed = { verbatim: @verbatim }
190
193
  else
191
194
  begin
192
195
  @parsed = @clean.parse(a_string) || @dirty.parse(a_string)
193
196
  unless @parsed
194
197
  index = @dirty.index || @clean.index
195
198
  salvage_match = a_string[0..index].split(/\s+/)[0..-2]
196
- salvage_string = salvage_match ? salvage_match.join(' ') : a_string
199
+ salvage_string = salvage_match ? salvage_match.join(" ") : a_string
197
200
  @parsed = @dirty.parse(salvage_string) ||
198
201
  @canonical.parse(a_string) ||
199
- { verbatim: a_string }
202
+ { verbatim: @verbatim }
200
203
  end
201
204
  rescue
202
205
  @parsed = FAILED_RESULT.(@verbatim)
@@ -205,12 +208,14 @@ class ScientificNameParser
205
208
 
206
209
  def @parsed.verbatim=(a_string)
207
210
  @verbatim = a_string
211
+ @id = GnUUID.uuid(@verbatim)
208
212
  end
209
213
 
210
214
  def @parsed.all(opts = {})
211
215
  canonical_with_rank = !!opts[:canonical_with_rank]
212
216
  parsed = self.class != Hash
213
- res = { parsed: parsed, parser_version: ScientificNameParser::version}
217
+ res = { id: @id, parsed: parsed,
218
+ parser_version: ScientificNameParser::version}
214
219
  if parsed
215
220
  hybrid = self.hybrid rescue false
216
221
  res.merge!({
@@ -226,7 +231,7 @@ class ScientificNameParser
226
231
  res.merge!(self)
227
232
  end
228
233
  if (canonical_with_rank &&
229
- canonical.count(' ') > 1 &&
234
+ canonical.count(" ") > 1 &&
230
235
  res[:details][0][:infraspecies])
231
236
  ScientificNameParser.add_rank_to_canonical(res)
232
237
  end
@@ -235,11 +240,11 @@ class ScientificNameParser
235
240
  end
236
241
 
237
242
  def @parsed.pos_json
238
- self.pos.to_json rescue ''
243
+ self.pos.to_json rescue ""
239
244
  end
240
245
 
241
246
  def @parsed.all_json
242
- self.all.to_json rescue ''
247
+ self.all.to_json rescue ""
243
248
  end
244
249
 
245
250
  @parsed.verbatim = @verbatim
@@ -256,7 +261,7 @@ class ScientificNameParser
256
261
  surrogate2 = /\b(spp|sp|nr|cf)[\.]?[\s]*$/i
257
262
  is_surrogate = false
258
263
 
259
- ai_index = pos.index('annotation_identification')
264
+ ai_index = pos.index("annotation_identification")
260
265
  if ai_index
261
266
  ai = name[pos[ai_index - 1]..pos[ai_index + 1]]
262
267
  is_surrogate = true if ai.match(/^(spp|cf|sp|nr)/)
@@ -267,15 +272,13 @@ class ScientificNameParser
267
272
  end
268
273
 
269
274
  def self.add_rank_to_canonical(parsed)
270
- parts = parsed[:canonical].split(' ')
275
+ parts = parsed[:canonical].split(" ")
271
276
  name_ary = parts[0..1]
272
277
  parsed[:details][0][:infraspecies].each do |data|
273
278
  infrasp = data[:string]
274
279
  rank = data[:rank]
275
- name_ary << (rank && rank != 'n/a' ? "#{rank} #{infrasp}" : infrasp)
280
+ name_ary << (rank && rank != "n/a" ? "#{rank} #{infrasp}" : infrasp)
276
281
  end
277
- parsed[:canonical] = name_ary.join(' ')
282
+ parsed[:canonical] = name_ary.join(" ")
278
283
  end
279
-
280
284
  end
281
-