RubyGems - taxonifi - Versions diffs - 0.1.0 → 0.2.0 - Mend

taxonifi 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (47) hide show

data/Gemfile +1 -0
data/Gemfile.lock +24 -7
data/README.rdoc +5 -6
data/Rakefile +1 -1
data/VERSION +1 -1
data/lib/assessor/row_assessor.rb +25 -18
data/lib/export/format/base.rb +96 -1
data/lib/export/format/obo_nomenclature.rb +71 -0
data/lib/export/format/prolog.rb +59 -0
data/lib/export/format/species_file.rb +303 -193
data/lib/lumper/clump.rb +112 -0
data/lib/lumper/lumper.rb +71 -45
data/lib/lumper/lumps/parent_child_name_collection.rb +79 -15
data/lib/models/author_year.rb +1 -2
data/lib/models/base.rb +56 -51
data/lib/models/collection.rb +16 -1
data/lib/models/name.rb +56 -15
data/lib/models/name_collection.rb +70 -19
data/lib/models/ref.rb +17 -0
data/lib/models/ref_collection.rb +2 -1
data/lib/models/shared_class_methods.rb +29 -0
data/lib/models/species_name.rb +14 -12
data/lib/splitter/parser.rb +1 -2
data/lib/splitter/tokens.rb +1 -1
data/lib/taxonifi.rb +12 -0
data/lib/utils/array.rb +17 -0
data/lib/utils/hash.rb +17 -0
data/taxonifi.gemspec +116 -0
data/test/file_fixtures/Fossil.csv +11 -0
data/test/file_fixtures/Lygaeoidea.csv +1 -1
data/test/file_fixtures/names.csv +1 -0
data/test/helper.rb +14 -0
data/test/test_export_prolog.rb +14 -0
data/test/test_exporter.rb +23 -0
data/test/test_lumper_clump.rb +75 -0
data/test/test_lumper_names.rb +67 -9
data/test/test_lumper_parent_child_name_collection.rb +47 -3
data/test/test_lumper_refs.rb +22 -7
data/test/test_obo_nomenclature.rb +14 -0
data/test/test_parser.rb +4 -2
data/test/test_splitter_tokens.rb +9 -0
data/test/test_taxonifi_accessor.rb +21 -15
data/test/test_taxonifi_base.rb +25 -0
data/test/test_taxonifi_name.rb +41 -4
data/test/test_taxonifi_name_collection.rb +54 -17
data/test/test_taxonifi_species_name.rb +1 -1
metadata +34 -5

data/Gemfile CHANGED Viewed

@@ -12,6 +12,7 @@ group :development do
   gem "rdoc", "~> 3.12"
   gem "bundler", "> 1.0.0"
   gem "jeweler", "~> 1.8.3"
+  gem "activerecord", "3.2.8"
   gem "debugger"
 #  gem "ruby-debug19"
 #  gem "simplecov", ">= 0"

data/Gemfile.lock CHANGED Viewed

@@ -1,29 +1,46 @@
 GEM
   remote: http://rubygems.org/
   specs:
+    activemodel (3.2.8)
+      activesupport (= 3.2.8)
+      builder (~> 3.0.0)
+    activerecord (3.2.8)
+      activemodel (= 3.2.8)
+      activesupport (= 3.2.8)
+      arel (~> 3.0.2)
+      tzinfo (~> 0.3.29)
+    activesupport (3.2.8)
+      i18n (~> 0.6)
+      multi_json (~> 1.0)
+    arel (3.0.2)
+    builder (3.0.0)
     columnize (0.3.6)
-    debugger (1.1.1)
+    debugger (1.2.0)
       columnize (>= 0.3.1)
-      debugger-linecache (~> 1.1)
-      debugger-ruby_core_source (~> 1.1)
-    debugger-linecache (1.1.1)
+      debugger-linecache (~> 1.1.1)
+      debugger-ruby_core_source (~> 1.1.3)
+    debugger-linecache (1.1.2)
       debugger-ruby_core_source (>= 1.1.1)
-    debugger-ruby_core_source (1.1.1)
+    debugger-ruby_core_source (1.1.3)
     git (1.2.5)
-    jeweler (1.8.3)
+    i18n (0.6.1)
+    jeweler (1.8.4)
       bundler (~> 1.0)
       git (>= 1.2.5)
       rake
       rdoc
-    json (1.6.6)
+    json (1.7.5)
+    multi_json (1.3.6)
     rake (0.9.2.2)
     rdoc (3.12)
       json (~> 1.4)
+    tzinfo (0.3.33)
 PLATFORMS
   ruby
 DEPENDENCIES
+  activerecord (= 3.2.8)
   bundler (> 1.0.0)
   debugger
   jeweler (~> 1.8.3)

data/README.rdoc CHANGED Viewed

@@ -4,19 +4,18 @@ There will always be "legacy" taxonomic data that needs shuffling around. The ta
 Overall, the goal is to provide well documented (and unit-tested) coded that is broadly useful, and vanilla enough to encourage other to fork and hack on their own.
 == Source
-Source is available at https://github.com/SpeciesFile/taxonifi.  The rdoc API is also viewable at http://taxonifi.speciesfile.org, (though those docs may lag behind commits to github).
+Source is available at https://github.com/SpeciesFile/taxonifi .  The rdoc API is also viewable at http://taxonifi.speciesfile.org , (though those docs may lag behind commits to github).
 == What's next?
-Before you jump on board you should also check out similar code from the Global Names team at https://github.com/GlobalNamesArchitecture. Future integration and merging of shared functionality is planned.  Code will be released in an "early-and-often" approach
+Before you jump on board you should also check out similar code from the Global Names team at https://github.com/GlobalNamesArchitecture. Future integration and merging of shared functionality is planned.  Code will be released in an "early-and-often" approach.
+Taxonifi is presently coded for convience, not speed (though it's not necessarily slow). It assumes that conversion processes are typically one-offs that can afford to run over a longer period of time (read minutes rather than seconds). Reading, and fully parsing into objects, around 25k rows of nomenclature (class to species, inc. author year, = ~45k names) in to memory as Taxonifi objects benchmarks at around 2 minutes. Faster indexing is planned as needed, likely using Redis (see GNA link above).
 = Getting started
 taxonifi is coded for Ruby 1.9.3, it has not been tested on earlier versions (though it will certainly not work with 1.8.7).
 Using Ruby Version Manager (RVM, https://rvm.io/ ) is highly recommend. You can test your version of Ruby by doinging "ruby -v" in your terminal.
-Taxonifi is presently coded for convience, not speed (though it's not necessarily slow). It assumes that conversion processes are typically one-offs that can afford to run over a longer period of time (read minutes rather than seconds). Reading, and fully parsing into objects, around 25k rows of nomenclature (class to species, inc. author year, = ~45k names) in to memory as Taxonifi objects benchmarks at around 2 minutes. Faster indexing is planned as needed, likely using Redis (see GNA link above).
 To install:
    gem install taxonifi
@@ -76,7 +75,7 @@ There are collections of specific types (e.g. taxonomic names, geographic names)
     csv = CSV.parse(string, {headers: true})
-    nc = Taxonifi::Lumper.create_name_collection(csv)  # => Taxonifi::Model::NameCollection
+    nc = Taxonifi::Lumper.create_name_collection(:csv => csv)  # => Taxonifi::Model::NameCollection
     nc.collection.first                                # => Taxonifi::Model::Name
     nc.collection.first.name                           # => "Fooidae"

data/Rakefile CHANGED Viewed

@@ -15,7 +15,7 @@ require 'jeweler'
 Jeweler::Tasks.new do |gem|
   # gem is a Gem::Specification... see http://docs.rubygems.org/read/chapter/20 for more options
   gem.name = "taxonifi"
-  gem.homepage = "http://github.com/mjy/taxonifi"
+  gem.homepage = "http://github.com/SpeciesFile/taxonifi"
   gem.license = "MIT"
   gem.summary = %Q{A general purpose framework for scripted handling of taxonomic names}
   gem.description = %Q{Taxonifi contains simple models and utilties of use in for parsing lists of taxonomic name (life) related metadata}

data/VERSION CHANGED Viewed

	@@ -1 +1 @@
1	- 0.1.0
1	+ 0.2.0

data/lib/assessor/row_assessor.rb CHANGED Viewed

@@ -36,7 +36,8 @@ module Taxonifi
         end
       end
-      # Return the last column with data, scoped by lump if provided.
+      # Return an Array of ["header", value] for the last column with data, scoped by lump if provided.
+      # If there is nothing available in the scope provided return [nil, nil]
       def self.last_available(csv_row, lump = nil)
         if lump.nil?
           csv_row.entries.reverse.each do |c,v|
@@ -47,25 +48,31 @@ module Taxonifi
             return [l, csv_row[l.to_s]] if !csv_row[l.to_s].nil?
           end
         end
+        [nil, nil]
       end
       # Return the rank (symbol) of the taxon name rank.  Raises
       # if no name detected.
       def self.lump_name_rank(csv_row)
-        lumps = Taxonifi::Lumper.available_lumps(csv_row.headers)
+        # Rather than just check individual columns for data ensure a complete lump is present
+        lumps = intersecting_lumps_with_data(csv_row, [:species, :genera, :higher])
         if lumps.include?(:species) # has to be a species name
-          if csv_row[:subspecies].nil?
-            return :species
+          if !csv_row['variety'].nil?
+            return :variety
           else
-            return :subspecies
+            if csv_row['subspecies'].nil?
+              return :species
+            else
+              return :subspecies
+            end
           end
         elsif lumps.include?(:genera)
-          if csv_row[:subgenus].nil?
+          if csv_row['subgenus'].nil?
             return :genus
           else
             return :subgenus
           end
-        else
+        elsif lumps.include?(:higher)
           return Taxonifi::Assessor::RowAssessor.last_available(csv_row, Taxonifi::Lumper::LUMPS[:higher]).first.to_sym
         end
@@ -75,11 +82,11 @@ module Taxonifi
       # Return the column representing the parent of the name
       # represented in this row.
-      def self.parent_taxon_column(csv_row)
-        lumps = Taxonifi::Lumper.available_lumps(csv_row.headers)
-        last = last_available(csv_row, Taxonifi::RANKS)
-        last_available(csv_row, Taxonifi::RANKS[0..Taxonifi::RANKS.index(last[0])-1])
-      end
+      # TODO: DEPRECATE, same f(n) as last_available when scoped properly
+      # def self.parent_taxon_column(csv_row)
+      #   last = last_available(csv_row, Taxonifi::RANKS)
+      #   last_available(csv_row, Taxonifi::RANKS[0..Taxonifi::RANKS.index(last[0])-1])
+      # end
       # Return an Array of headers that represent taxonomic ranks.
       def self.rank_headers(headers)
@@ -92,13 +99,13 @@ module Taxonifi
       end
       # Return lumps for which at least one column has data.
-      def self.intersecting_lumps_with_data(row, lumps_to_try = nil)
-        lumps_to_try ||= Taxonifi::Lumper::LUMPS.keys
+      def self.intersecting_lumps_with_data(csv_row, lumps_to_try = nil)
+        lumps_to_try ||= Taxonifi::Lumper.intersecting_lumps(csv_row.headers)
         lumps = []
         lumps_to_try.each do |l|
           has_data = false
           Taxonifi::Lumper::LUMPS[l].each do |c|
-            if !row[c].nil? && !row[c].empty?
+            if !csv_row[c].nil? && !csv_row[c].empty?
               has_data = true
               break
             end
@@ -109,13 +116,13 @@ module Taxonifi
       end
       # Return lumps that have data for all columns.
-      def self.lumps_with_data(row, lumps_to_try = nil)
-        lumps_to_try ||= Taxonifi::Lumper::LUMPS.keys
+      def self.lumps_with_data(csv_row, lumps_to_try = nil)
+        lumps_to_try ||= Taxonifi::Lumper.available_lumps(csv_row.headers) # Taxonifi::Lumper::LUMPS.keys
         lumps = []
         lumps_to_try.each do |l|
           has_data = true
           Taxonifi::Lumper::LUMPS[l].each do |c|
-            if row[c].nil? || row[c].empty?
+            if csv_row[c].nil? || csv_row[c].empty?
               has_data = false
               break
             end

data/lib/export/format/base.rb CHANGED Viewed

@@ -2,8 +2,72 @@ module Taxonifi::Export
   # All export classes inherit from Taxonifi::Export::Base
   class Base
+    # Hash.  An index of taxonomic ranks.
+    # See https://phenoscape.svn.sourceforge.net/svnroot/phenoscape/trunk/vocab/taxonomic_rank.obo
+    # Site: https://www.phenoscape.org/wiki/Taxonomic_Rank_Vocabulary
+    # Values of -1 have no correspondance in that ontology.
+    # Nt all values are supported. Not all values are included.
+    TAXRANKS = {
+      'taxonomic_rank' =>          0,
+      'variety'        =>          16,
+      'bio-variety'    =>          32,
+      'subspecies' =>              23,
+      'form' =>                    26,
+      'species' =>                 5,
+      'species complex' =>         12,
+      'species subgroup' =>        11,
+      'species group' =>           10,
+      'species series' =>          -1,
+      'series'  =>                 31,
+      'infragenus' =>              43,
+      'subgenus' =>                9,
+      'genus' =>                   5,
+      'genus group' =>             -1,
+      'subtribe' =>                28,
+      'tribe' =>                   25,
+      'supertribe' =>              57,
+      'infrafamily' =>             41,
+      'subfamily' =>               24,
+      'subfamily group' =>         -1,
+      'family' =>                  4,
+      'epifamily' =>               -1,
+      'superfamily' =>             18,
+      'superfamily group' =>       -1,
+      'subinfraordinal group' =>   -1,
+      'infraorder' =>              13,
+      'suborder' =>                14,
+      'order' =>                   3,
+      'mirorder' =>                -1,
+      'superorder' =>              20,
+      'magnorder' =>               -1,
+      'parvorder' =>               21,
+      'cohort' =>                  -1,
+      'supercohort' =>             -1,
+      'infraclass' =>              19,
+      'subclass' =>                7,
+      'class' =>                   2,
+      'superclass' =>              15,
+      'infraphylum' =>             40,
+      'subphylum' =>               8,
+      'phylum' =>                  1,
+      'superphylum' =>             27,
+      'infrakingdom' =>            44,
+      'subkingdom' =>              29,
+      'kingdom' =>                 17,
+      'superkingdom' =>            22,
+      'life' =>                    -1,
+      'unknown' =>                 -1,
+      'section' =>                 30
+    }
     EXPORT_BASE =  File.expand_path(File.join(Dir.home(), 'taxonifi', 'export'))
-    attr_accessor :base_export_path, :export_folder
+    # String. Defaults to EXPORT_BASE.
+    attr_accessor :base_export_path
+    # String. The folder to dump output files to, subclassess contain a reasonably named default.
+    attr_accessor :export_folder
     def initialize(options = {})
       opts = {
@@ -39,5 +103,36 @@ module Taxonifi::Export
       f.close
     end
+    # TODO: Used?!
+    # Returns a new writeable File under the
+    def new_output_file(filename = 'foo')
+      File.new( File.expand_path(File.join(export_path, filename)), 'w+')
+    end
+    # TODO: Move to a SQL library.
+    # Returns a String, an INSERT statement derived from the passed values Hash.
+    def sql_insert_statement(tbl = nil, values = {})
+      return "nope" if tbl.nil?
+      "INSERT INTO #{tbl} (#{values.keys.sort.join(",")}) VALUES (#{values.keys.sort.collect{|k| sqlize(values[k])}.join(",")});"
+    end
+    # TODO: Move to a SQL library.
+    # Returns a String that has been SQL proofed based on its class.
+    def sqlize(value)
+      case value.class.to_s
+      when 'String'
+        "'#{sanitize(value)}'"
+      else
+        value
+      end
+    end
+    # TODO: Move to SQL/String library.
+    # Returns a String with quotes handled for SQL.
+    def sanitize(value)
+      value.to_s.gsub(/'/,"''")
+    end
   end
 end

data/lib/export/format/obo_nomenclature.rb ADDED Viewed

@@ -0,0 +1,71 @@
+module Taxonifi::Export
+  # Writes a OBO formatted file for all names in a name collection.
+  # !! Does not write synonyms out.
+  # Follows the TTO example.
+  class OboNomenclature < Taxonifi::Export::Base
+     attr_accessor :name_collection, :namespace
+    def initialize(options = {})
+      opts = {
+        :nc => Taxonifi::Model::NameCollection.new,
+        :export_folder => 'obo_nomenclature',
+        :starting_id => 1,
+        :namespace => 'XYZ'
+      }.merge!(options)
+      super(opts)
+      raise Taxonifi::Export::ExportError, 'NameCollection not passed to OboNomenclature export.' if ! opts[:nc].class == Taxonifi::Model::NameCollection
+      @name_collection = opts[:nc]
+      @namespace = opts[:namespace]
+      @time = Time.now.strftime("%D %T").gsub('/',":")
+      @empty_quotes = ""
+    end
+    # Writes the file.
+    def export()
+      super
+      f = new_output_file('obo_nomenclature.obo')
+      # header
+      f.puts 'format-version: 1.2'
+      f.puts "date: #{@time}"
+      f.puts 'saved-by: someone'
+      f.puts 'auto-generated-by: Taxonifi'
+      f.puts 'synonymtypedef: COMMONNAME "common name"'
+      f.puts 'synonymtypedef: MISSPELLING "misspelling" EXACT'
+      f.puts 'synonymtypedef: TAXONNAMEUSAGE "name with (author year)" NARROW'
+      f.puts "default-namespace: #{@namespace}"
+      f.puts "ontology: FIX-ME-taxonifi-ontology\n\n"
+      # terms
+      @name_collection.collection.each do |n|
+        f.puts '[Term]'
+        f.puts "id: #{id_string(n)}"
+        f.puts "name: #{n.name}"
+        f.puts "is_a: #{id_string(n.parent)} ! #{n.parent.name}" if n.parent
+        f.puts "property_value: has_rank #{rank_string(n)}"
+        f.puts
+      end
+      # typedefs
+      f.puts "[Typedef]"
+      f.puts "id: has_rank"
+      f.puts "name: has taxonomic rank"
+      f.puts "is_metadata_tag: true"
+      true
+    end
+    def rank_string(name)
+      "TAXRANK:#{TAXRANKS[name.rank].to_s.rjust(7,"0")}"
+    end
+    def id_string(name)
+      "#{@namespace}:#{name.id.to_s.rjust(7,"0")}"
+    end
+  end # End class
+end # End module

data/lib/export/format/prolog.rb ADDED Viewed

@@ -0,0 +1,59 @@
+module Taxonifi::Export
+  # Dumps tables identical to the existing structure in SpeciesFile.
+  # Will only work in the pre Identity world.  Will reconfigure
+  # as templates for Jim's work after the fact.
+  class Prolog < Taxonifi::Export::Base
+    attr_accessor :name_collection
+    attr_accessor :ref_collection
+    attr_accessor :pub_collection
+    attr_accessor :author_index
+    attr_accessor :genus_names, :species_names, :nomenclator
+    attr_accessor :authorized_user_id, :time
+    attr_accessor :starting_ref_id
+    def initialize(options = {})
+      opts = {
+        :nc => Taxonifi::Model::NameCollection.new,
+        :export_folder => 'prolog',
+        :starting_ref_id => 1,                              # should be configured elsewhere... but
+        :manifest => %w{tblPubs tblRefs tblPeople tblRefAuthors tblTaxa tblGenusNames tblSpeciesNames tblNomenclator tblCites}
+      }.merge!(options)
+      @manifest = opts[:manifest]
+      super(opts)
+      raise Taxonifi::Export::ExportError, 'NameCollection not passed to SpeciesFile export.' if ! opts[:nc].class == Taxonifi::Model::NameCollection
+      #   raise Taxonifi::Export::ExportError, 'You must provide authorized_user_id for species_file export initialization.' if opts[:authorized_user_id].nil?
+      #  @name_collection = opts[:nc]
+      #  @pub_collection = {} # title => id
+      #  @authorized_user_id = opts[:authorized_user_id]
+      #  @author_index = {}
+      #  @starting_ref_id = opts[:starting_ref_id]
+      #
+      #  # Careful here, at present we are just generating Reference micro-citations from our names, so the indexing "just works"
+      #  # because it's all internal.  There will is a strong potential for key collisions if this pipeline is modified to
+      #  # include references external to the initialized name_collection.  See also export_references.
+      #  #
+      #  # @by_author_reference_index = {}
+      #  @genus_names = {}
+      #  @species_names = {}
+      #  @nomenclator = {}
+      @time = Time.now.strftime("%F %T")
+      @empty_quotes = ""
+    end
+    def export()
+      super
+      configure_folders
+      str = ["FOO"]
+      write_file('foo.pl', str.join("\n\n"))
+      true
+    end
+  end # End class
+end # End module