RubyGems - biodiversity - Versions diffs - 3.1.10 → 3.2.0 - Mend

biodiversity 3.1.10 → 3.2.0

Files changed (20) hide show

checksums.yaml +4 -4
data/.rspec +3 -0
data/.ruby-version +1 -1
data/CHANGELOG +5 -0
data/README.md +95 -71
data/biodiversity.gemspec +1 -0
data/lib/biodiversity/parser.rb +33 -30
data/lib/biodiversity/parser/scientific_name_clean.rb +45 -36
data/lib/biodiversity/parser/scientific_name_clean.treetop +1 -1
data/lib/biodiversity/version.rb +1 -1
data/spec/biodiversity_spec.rb +0 -2
data/spec/files/t.rb +15 -0
data/spec/files/test_data.txt +345 -335
data/spec/files/test_data.txt.new +463 -0
data/spec/guid/lsid.spec.rb +0 -2
data/spec/parser/scientific_name_canonical_spec.rb +0 -1
data/spec/parser/scientific_name_clean_spec.rb +0 -2
data/spec/parser/scientific_name_dirty_spec.rb +0 -1
data/spec/parser/scientific_name_spec.rb +5 -4
metadata +20 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: d4a11c6b12ca173da86c7baa1242b9639e54d93c
-  data.tar.gz: 8c47f66efd6f8ee2b3e3dce894927a4cc73237a1
+  metadata.gz: d7bb0304f5e151933f5350780677b9a47a099716
+  data.tar.gz: 7adaf2c1bfce44db79bc2c04d75c584d53957fd0
 SHA512:
-  metadata.gz: 8e542a0cca44cef3a63acd4dc520db4465cb2a598aafcb30e2161e65ad6110b9a146ef01325da77f3e7474c5409a0b75f1b28eda7c59a7fe1ffb0cc59162fb6e
-  data.tar.gz: 36f21aed88acc4405c147b908de36d6109010935fec99108c90ca73d6dabd2b942a08a1eca04fd28a644bbec51a12b5836272ef2280160b4321f2f434760f665
+  metadata.gz: 1a450a93fb07f985b5f1e7761e669ed772c8add2c01955e4a8e70a575f3b5f0bc86b1b215507260c48a1604b386b60543f27048c7a65119f71a4b0ddfd7bcefe
+  data.tar.gz: 19090297f99d64580b4b6012a06729ede74fd3ccf7fb9ffdfeecd6480d2a1a0bd0ce8dc8654747f51a60301464cbaebd1f9693ceb54f2827140163d72314f62b

data/.rspec ADDED Viewed

@@ -0,0 +1,3 @@
+--format progress
+--color
+--require spec_helper

data/.ruby-version CHANGED Viewed

	@@ -1 +1 @@
1	- 2.1.5
1	+ 2.1.6

data/CHANGELOG CHANGED Viewed

@@ -1,3 +1,8 @@
+3.2.0 -- added UUID version 5 identifiers for every name string, better
+normalizing for the names with apostrophes, underscore-formatted names are
+supported. Minor version increase because of change in the output format ("id"
+field)
 3.1.10 -- NPV viruses added
 3.1.9 -- more virus keywords, better handling of apostrophes in

data/README.md CHANGED Viewed

@@ -1,10 +1,10 @@
 Biodiversity
 ============
-[![Gem Version][1]][2]
-[![Continuous Integration Status][3]][4]
-[![CodePolice][5]][6]
-[![Dependency Status][7]][8]
+[![Gem Version][gem_svg]][gem_link]
+[![Continuous Integration Status][ci_svg]][ci_link]
+[![CodePolice][cc_svg]][cc_link]
+[![Dependency Status][deps_svg]][deps_link]
 Parses taxonomic scientific name and breaks it into semantic elements.
@@ -12,10 +12,12 @@ Parses taxonomic scientific name and breaks it into semantic elements.
 Support for Ruby 1.8.7 IS DROPPED. Both biodiversity and
 biodiversity19 will be for Ruby > 1.9.1 and will be identical gems.
-biodiversity19 is now deprecated and will be phased out in 2015.
+biodiversity19 is now deprecated and will not be updated anymore.
 You are strongly encouraged to change your dependencies from
 biodiversity19 to biodiversity
+Follow [biodiversity issues][waffle] on waffle.io
 Installation
 ------------
@@ -46,7 +48,7 @@ you can use a socket server
     parserver -h
     Usage: parserver [options]
-        -r, --canonical_with_rank        Adds infraspecies rank
+        -r, --canonical_with_rank        Adds infraspecies rank
                                          to canonical forms
         -o, --output=output              Specifies the type of the output:
@@ -65,7 +67,7 @@ you can use a socket server
 With default settings you can access parserserver via 4334 port using a
 socket client library of your programming language.  You can find
-[socket client script example][9] in the examples directory of the gem.
+[socket client script example][socket_example] in the examples directory of the gem.
 If you want to check if socket server works for you:
@@ -93,76 +95,94 @@ of scientific name
 You can use it as a library in Ruby, JRuby etc.
-    require 'biodiversity'
-    parser = ScientificNameParser.new
+```ruby
+require 'biodiversity'
+parser = ScientificNameParser.new
-    #to find version number
-    ScientificNameParser.version
+#to find version number
+ScientificNameParser.version
-    # to fix capitalization in canonicals
-    ScientificNameParser.fix_case("QUERCUS (QUERCUS) ALBA")
-    # Output: Quercus (Quercus) alba
+# to fix capitalization in canonicals
+ScientificNameParser.fix_case("QUERCUS (QUERCUS) ALBA")
+# Output: Quercus (Quercus) alba
-    # to parse a scientific name into a ruby hash
-    parser.parse("Plantago major")
+# to parse a scientific name into a ruby hash
+parser.parse("Plantago major")
-    #to get json representation
-    parser.parse("Plantago").to_json
-    #or
-    parser.parse("Plantago")
-    parser.all_json
+#to get json representation
+parser.parse("Plantago").to_json
+#or
+parser.parse("Plantago")
+parser.all_json
-    # to clean name up
-    parser.parse("      Plantago       major    ")[:scientificName][:normalized]
+# to clean name up
+parser.parse("      Plantago       major    ")[:scientificName][:normalized]
-    # to get only cleaned up latin part of the name
-    parser.parse("Pseudocercospora dendrobii (H.C. Burnett) U. \
-    Braun & Crous 2003")[:scientificName][:canonical]
+# to get only cleaned up latin part of the name
+parser.parse("Pseudocercospora dendrobii (H.C. Burnett) U. \
+Braun & Crous 2003")[:scientificName][:canonical]
-    # to get detailed information about elements of the name
-    parser.parse("Pseudocercospora dendrobii (H.C. Burnett 1883) U. \
-    Braun & Crous 2003")[:scientificName][:details]
+# to get detailed information about elements of the name
+parser.parse("Pseudocercospora dendrobii (H.C. Burnett 1883) U. \
+Braun & Crous 2003")[:scientificName][:details]
+```
 Returned result is not always linear, if name is complex. To get simple linear
 representation of the name you can use:
-    parser.parse("Pseudocercospora dendrobii (H.C. Burnett) \
-    U. Braun & Crous 2003")[:scientificName][:position]
-    # returns {0=>["genus", 16], 17=>["species", 26],
-    # 28=>["author_word", 32], 33=>["author_word", 40],
-    # 42=>["author_word", 44], 45=>["author_word", 50],
-    # 53=>["author_word", 58], 59=>["year", 63]}
-    # where the key is the char index of the start of
-    # a word, first element of the value is a semantic meaning
-    # of the word, second element of the value is the character index
-    # of end of the word
+```ruby
+parser.parse("Pseudocercospora dendrobii (H.C. Burnett) \
+U. Braun & Crous 2003")[:scientificName][:position]
+# returns {0=>["genus", 16], 17=>["species", 26],
+# 28=>["author_word", 32], 33=>["author_word", 40],
+# 42=>["author_word", 44], 45=>["author_word", 50],
+# 53=>["author_word", 58], 59=>["year", 63]}
+# where the key is the char index of the start of
+# a word, first element of the value is a semantic meaning
+# of the word, second element of the value is the character index
+# of end of the word
+```
 'Surrogate' is a broad group which includes 'Barcode of Life' names, and various
 undetermined names with cf. sp. spp. nr. in them:
-    parser.parse("Coleoptera BOLD:1234567")[:scientificName][:surrogate]
-To parse using several CPUs (4 seem to be optimal)
+```ruby
+parser.parse("Coleoptera BOLD:1234567")[:scientificName][:surrogate]
+```
+### What is "id" in the parsed results?
+ID field contains UUID v5 hexadecimal string. ID is generated out of bytes
+from the name string itself, and identical id can be generated using [any
+popular programming language][uuid_examples]. You can read more about UUID
+version 5 in a [blog post][uuid_blog]
+### Parse using several CPUs (4 threads seem to be optimal)
-    parser = ParallelParser.new
-    # ParallelParser.new(4) will try to run 4 processes if hardware allows
-    array_of_names = ["Betula alba", "Homo sapiens"....]
-    parser.parse(array_of_names)
-    # Output: {"Betula alba" => {:scientificName...},
-    # "Homo sapiens" => {:scientificName...}, ...}
+```ruby
+parser = ParallelParser.new
+# ParallelParser.new(4) will try to run 4 processes if hardware allows
+array_of_names = ["Betula alba", "Homo sapiens"....]
+parser.parse(array_of_names)
+# Output: {"Betula alba" => {:scientificName...},
+# "Homo sapiens" => {:scientificName...}, ...}
+```
-parallel parser takes list of names and returns back a hash with names as
+parallel parser takes list of names and returns back a hash with names as
 keys and parsed data as values
-To get canonicals with ranks for infraspecific epithets:
+### Canonicals with ranks for infraspecific epithets:
-    parser = ScientificNameParser.new(canonical_with_rank: true)
-    parser.parse('Cola cordifolia var. puberula \
-    A. Chev.')[:scientificName][:canonical]
-    # Output: Cola cordifolia var. puberula
+```ruby
+parser = ScientificNameParser.new(canonical_with_rank: true)
+parser.parse('Cola cordifolia var. puberula \
+A. Chev.')[:scientificName][:canonical]
+# Output: Cola cordifolia var. puberula
+```
-To resolve lsid and get back RDF file
+### Resolving lsid and geting back RDF file
     LsidResolver.resolve("urn:lsid:ubio.org:classificationbank:2232671")
@@ -174,7 +194,7 @@ If nnparse or parserver do not start -- try to run
     gem uninstall biodiversity
     gem uninstall biodiversity19
-and make sure you remove all versions and all nnparse and parserver scripts.
+and make sure you remove all versions and all nnparse and parserver scripts.
 Then install biodiversity again
     gem install biodiversity
@@ -184,18 +204,22 @@ It should fix the problem.
 Copyright
 ---------
-Authors: [Dmitry Mozzherin][10]
-Copyright (c) 2008-2015 Marine Biological Laboratory. See LICENSE for
-further details.
-[1]: https://badge.fury.io/rb/biodiversity.png
-[2]: http://badge.fury.io/rb/biodiversity
-[3]: https://secure.travis-ci.org/GlobalNamesArchitecture/biodiversity.png
-[4]: http://travis-ci.org/GlobalNamesArchitecture/biodiversity
-[5]: https://codeclimate.com/github/GlobalNamesArchitecture/biodiversity.png
-[6]: https://codeclimate.com/github/GlobalNamesArchitecture/biodiversity
-[7]: https://gemnasium.com/GlobalNamesArchitecture/biodiversity.png
-[8]: https://gemnasium.com/GlobalNamesArchitecture/biodiversity
-[9]: http://bit.ly/149iLm5
-[10]: https://github.com/dimus
+Authors: [Dmitry Mozzherin][dimus]
+Copyright (c) 2008-2015 Marine Biological Laboratory. See [LICENSE][license]
+for further details.
+[gem_svg]: https://badge.fury.io/rb/biodiversity.svg
+[gem_link]: http://badge.fury.io/rb/biodiversity
+[ci_svg]: https://secure.travis-ci.org/GlobalNamesArchitecture/biodiversity.svg
+[ci_link]: http://travis-ci.org/GlobalNamesArchitecture/biodiversity
+[cc_svg]: https://codeclimate.com/github/GlobalNamesArchitecture/biodiversity.svg
+[cc_link]: https://codeclimate.com/github/GlobalNamesArchitecture/biodiversity
+[deps_svg]: https://gemnasium.com/GlobalNamesArchitecture/biodiversity.svg
+[deps_link]: https://gemnasium.com/GlobalNamesArchitecture/biodiversity
+[socket_example]: http://bit.ly/149iLm5
+[dimus]: https://github.com/dimus
+[license]: https://github.com/GlobalNamesArchitecture/biodiversity/blob/master/LICENSE
+[waffle]: https://waffle.io/GlobalNamesArchitecture/biodiversity
+[uuid_examples]: https://github.com/GlobalNamesArchitecture/gn_uuid_examples
+[uuid_blog]: http://globalnamesarchitecture.github.io/crossmap/gna/2015/05/31/gn-uuid-0-5-0.html

data/biodiversity.gemspec CHANGED Viewed

@@ -19,6 +19,7 @@ Gem::Specification.new do |gem|
   gem.add_runtime_dependency "treetop", "~> 1.4.1"
   gem.add_runtime_dependency "parallel", "~> 1.4"
   gem.add_runtime_dependency "unicode_utils", "~> 1.4"
+  gem.add_runtime_dependency "gn_uuid", "~> 0.5"
   gem.add_development_dependency "bundler", "~> 1.6"
   gem.add_development_dependency "rake", "~> 10.4"

data/lib/biodiversity/parser.rb CHANGED Viewed

@@ -1,7 +1,8 @@
 # encoding: UTF-8
-require_relative 'parser/scientific_name_clean'
-require_relative 'parser/scientific_name_dirty'
-require_relative 'parser/scientific_name_canonical'
+require "gn_uuid"
+require_relative "parser/scientific_name_clean"
+require_relative "parser/scientific_name_dirty"
+require_relative "parser/scientific_name_canonical"
 module PreProcessor
   NOTES = /\s+(species\s+group|species\s+complex|group|author)\b.*$/i
@@ -24,9 +25,10 @@ module PreProcessor
   def self.clean(a_string)
     [NOTES, TAXON_CONCEPTS1, TAXON_CONCEPTS2,
      TAXON_CONCEPTS3, NOMEN_CONCEPTS, LAST_WORD_JUNK].each do |i|
-      a_string = a_string.gsub(i, '')
+      a_string = a_string.gsub(i, "")
     end
-    a_string = a_string.tr('ſ','s') #old 's'
+    a_string = a_string.tr("ſ","s") #old "s"
+    a_string = a_string.tr("_", " ") if a_string.strip.match(/\s/).nil?
     a_string
   end
 end
@@ -36,7 +38,7 @@ end
 # Examples
 #
 # parser = ParallelParser.new(4)
-# parser.parse(['Betula L.', 'Pardosa moesta'])
+# parser.parse(["Betula L.", "Pardosa moesta"])
 class ParallelParser
   # Public: Initialize ParallelParser.
@@ -45,7 +47,7 @@ class ParallelParser
   #                 If processes number is not set it will be determined
   #                 automatically.
   def initialize(processes_num = nil)
-    require 'parallel'
+    require "parallel"
     cpu_num
     if processes_num.to_i > 0
       @processes_num = [processes_num, cpu_num - 1].min
@@ -66,7 +68,7 @@ class ParallelParser
   # Examples
   #
   # parser = ParallelParser.new(4)
-  # parser.parse(['Homo sapiens L.', 'Quercus quercus'])
+  # parser.parse(["Homo sapiens L.", "Quercus quercus"])
   #
   # Returns a Hash with scientific names as a key, and parsing results as
   # a value.
@@ -108,7 +110,8 @@ class ScientificNameParser
   FAILED_RESULT = ->(name) do
     { scientificName:
-      { parsed: false, verbatim: name.to_s.strip,  error: 'Parser error' }
+      { id: GnUUID.uuid(name), parsed: false, verbatim: name,
+        error: "Parser internal error" }
     }
   end
@@ -121,7 +124,7 @@ class ScientificNameParser
     words_num = name_ary.size
     res = nil
     if words_num == 1
-      res = name_ary[0].gsub(/[\(\)\{\}]/, '')
+      res = name_ary[0].gsub(/[\(\)\{\}]/, "")
       if res.size > 1
         res = UnicodeUtils.upcase(res[0]) + UnicodeUtils.downcase(res[1..-1])
       else
@@ -135,15 +138,15 @@ class ScientificNameParser
         word1 = name_ary[0]
       end
       if name_ary[1].match(/^\(/)
-        word2 = name_ary[1].gsub(/\)$/, '') + ')'
+        word2 = name_ary[1].gsub(/\)$/, "") + ")"
         word2 = word2[0] + UnicodeUtils.upcase(word2[1]) +
           UnicodeUtils.downcase(word2[2..-1])
       else
         word2 = UnicodeUtils.downcase(name_ary[1])
       end
-      res = word1 + ' ' +
-        word2 + ' ' +
-        name_ary[2..-1].map { |w| UnicodeUtils.downcase(w) }.join(' ')
+      res = word1 + " " +
+        word2 + " " +
+        name_ary[2..-1].map { |w| UnicodeUtils.downcase(w) }.join(" ")
       res.strip!
     end
     res
@@ -152,7 +155,7 @@ class ScientificNameParser
   def initialize(opts = {})
     @canonical_with_rank = !!opts[:canonical_with_rank]
-    @verbatim = ''
+    @verbatim = ""
     @clean = ScientificNameCleanParser.new
     @dirty = ScientificNameDirtyParser.new
     @canonical = ScientificNameCanonicalParser.new
@@ -180,23 +183,23 @@ class ScientificNameParser
   end
   def parse(a_string)
-    @verbatim = a_string.strip
+    @verbatim = a_string
     a_string = PreProcessor::clean(a_string)
     if virus?(a_string)
-      @parsed = { verbatim: a_string, virus: true }
+      @parsed = { verbatim: @verbatim, virus: true }
     elsif noparse?(a_string)
-      @parsed = { verbatim: a_string }
+      @parsed = { verbatim: @verbatim }
     else
       begin
         @parsed = @clean.parse(a_string) || @dirty.parse(a_string)
         unless @parsed
           index = @dirty.index || @clean.index
           salvage_match = a_string[0..index].split(/\s+/)[0..-2]
-          salvage_string = salvage_match ? salvage_match.join(' ') : a_string
+          salvage_string = salvage_match ? salvage_match.join(" ") : a_string
           @parsed =  @dirty.parse(salvage_string) ||
                      @canonical.parse(a_string) ||
-                     { verbatim: a_string }
+                     { verbatim: @verbatim }
         end
       rescue
         @parsed = FAILED_RESULT.(@verbatim)
@@ -205,12 +208,14 @@ class ScientificNameParser
     def @parsed.verbatim=(a_string)
       @verbatim = a_string
+      @id = GnUUID.uuid(@verbatim)
     end
     def @parsed.all(opts = {})
       canonical_with_rank = !!opts[:canonical_with_rank]
       parsed = self.class != Hash
-      res = { parsed: parsed, parser_version: ScientificNameParser::version}
+      res = { id: @id, parsed: parsed,
+              parser_version: ScientificNameParser::version}
       if parsed
         hybrid = self.hybrid rescue false
         res.merge!({
@@ -226,7 +231,7 @@ class ScientificNameParser
         res.merge!(self)
       end
       if (canonical_with_rank &&
-          canonical.count(' ') > 1 &&
+          canonical.count(" ") > 1 &&
           res[:details][0][:infraspecies])
         ScientificNameParser.add_rank_to_canonical(res)
       end
@@ -235,11 +240,11 @@ class ScientificNameParser
     end
     def @parsed.pos_json
-      self.pos.to_json rescue ''
+      self.pos.to_json rescue ""
     end
     def @parsed.all_json
-      self.all.to_json rescue ''
+      self.all.to_json rescue ""
     end
     @parsed.verbatim = @verbatim
@@ -256,7 +261,7 @@ class ScientificNameParser
     surrogate2 = /\b(spp|sp|nr|cf)[\.]?[\s]*$/i
     is_surrogate = false
-    ai_index = pos.index('annotation_identification')
+    ai_index = pos.index("annotation_identification")
     if ai_index
       ai = name[pos[ai_index - 1]..pos[ai_index + 1]]
       is_surrogate = true if ai.match(/^(spp|cf|sp|nr)/)
@@ -267,15 +272,13 @@ class ScientificNameParser
   end
   def self.add_rank_to_canonical(parsed)
-    parts = parsed[:canonical].split(' ')
+    parts = parsed[:canonical].split(" ")
     name_ary = parts[0..1]
     parsed[:details][0][:infraspecies].each do |data|
       infrasp = data[:string]
       rank = data[:rank]
-      name_ary << (rank && rank != 'n/a' ? "#{rank} #{infrasp}" : infrasp)
+      name_ary << (rank && rank != "n/a" ? "#{rank} #{infrasp}" : infrasp)
     end
-    parsed[:canonical] = name_ary.join(' ')
+    parsed[:canonical] = name_ary.join(" ")
   end
 end