taxonifi 0.1.0 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (47) hide show
  1. data/Gemfile +1 -0
  2. data/Gemfile.lock +24 -7
  3. data/README.rdoc +5 -6
  4. data/Rakefile +1 -1
  5. data/VERSION +1 -1
  6. data/lib/assessor/row_assessor.rb +25 -18
  7. data/lib/export/format/base.rb +96 -1
  8. data/lib/export/format/obo_nomenclature.rb +71 -0
  9. data/lib/export/format/prolog.rb +59 -0
  10. data/lib/export/format/species_file.rb +303 -193
  11. data/lib/lumper/clump.rb +112 -0
  12. data/lib/lumper/lumper.rb +71 -45
  13. data/lib/lumper/lumps/parent_child_name_collection.rb +79 -15
  14. data/lib/models/author_year.rb +1 -2
  15. data/lib/models/base.rb +56 -51
  16. data/lib/models/collection.rb +16 -1
  17. data/lib/models/name.rb +56 -15
  18. data/lib/models/name_collection.rb +70 -19
  19. data/lib/models/ref.rb +17 -0
  20. data/lib/models/ref_collection.rb +2 -1
  21. data/lib/models/shared_class_methods.rb +29 -0
  22. data/lib/models/species_name.rb +14 -12
  23. data/lib/splitter/parser.rb +1 -2
  24. data/lib/splitter/tokens.rb +1 -1
  25. data/lib/taxonifi.rb +12 -0
  26. data/lib/utils/array.rb +17 -0
  27. data/lib/utils/hash.rb +17 -0
  28. data/taxonifi.gemspec +116 -0
  29. data/test/file_fixtures/Fossil.csv +11 -0
  30. data/test/file_fixtures/Lygaeoidea.csv +1 -1
  31. data/test/file_fixtures/names.csv +1 -0
  32. data/test/helper.rb +14 -0
  33. data/test/test_export_prolog.rb +14 -0
  34. data/test/test_exporter.rb +23 -0
  35. data/test/test_lumper_clump.rb +75 -0
  36. data/test/test_lumper_names.rb +67 -9
  37. data/test/test_lumper_parent_child_name_collection.rb +47 -3
  38. data/test/test_lumper_refs.rb +22 -7
  39. data/test/test_obo_nomenclature.rb +14 -0
  40. data/test/test_parser.rb +4 -2
  41. data/test/test_splitter_tokens.rb +9 -0
  42. data/test/test_taxonifi_accessor.rb +21 -15
  43. data/test/test_taxonifi_base.rb +25 -0
  44. data/test/test_taxonifi_name.rb +41 -4
  45. data/test/test_taxonifi_name_collection.rb +54 -17
  46. data/test/test_taxonifi_species_name.rb +1 -1
  47. metadata +34 -5
data/Gemfile CHANGED
@@ -12,6 +12,7 @@ group :development do
12
12
  gem "rdoc", "~> 3.12"
13
13
  gem "bundler", "> 1.0.0"
14
14
  gem "jeweler", "~> 1.8.3"
15
+ gem "activerecord", "3.2.8"
15
16
  gem "debugger"
16
17
  # gem "ruby-debug19"
17
18
  # gem "simplecov", ">= 0"
data/Gemfile.lock CHANGED
@@ -1,29 +1,46 @@
1
1
  GEM
2
2
  remote: http://rubygems.org/
3
3
  specs:
4
+ activemodel (3.2.8)
5
+ activesupport (= 3.2.8)
6
+ builder (~> 3.0.0)
7
+ activerecord (3.2.8)
8
+ activemodel (= 3.2.8)
9
+ activesupport (= 3.2.8)
10
+ arel (~> 3.0.2)
11
+ tzinfo (~> 0.3.29)
12
+ activesupport (3.2.8)
13
+ i18n (~> 0.6)
14
+ multi_json (~> 1.0)
15
+ arel (3.0.2)
16
+ builder (3.0.0)
4
17
  columnize (0.3.6)
5
- debugger (1.1.1)
18
+ debugger (1.2.0)
6
19
  columnize (>= 0.3.1)
7
- debugger-linecache (~> 1.1)
8
- debugger-ruby_core_source (~> 1.1)
9
- debugger-linecache (1.1.1)
20
+ debugger-linecache (~> 1.1.1)
21
+ debugger-ruby_core_source (~> 1.1.3)
22
+ debugger-linecache (1.1.2)
10
23
  debugger-ruby_core_source (>= 1.1.1)
11
- debugger-ruby_core_source (1.1.1)
24
+ debugger-ruby_core_source (1.1.3)
12
25
  git (1.2.5)
13
- jeweler (1.8.3)
26
+ i18n (0.6.1)
27
+ jeweler (1.8.4)
14
28
  bundler (~> 1.0)
15
29
  git (>= 1.2.5)
16
30
  rake
17
31
  rdoc
18
- json (1.6.6)
32
+ json (1.7.5)
33
+ multi_json (1.3.6)
19
34
  rake (0.9.2.2)
20
35
  rdoc (3.12)
21
36
  json (~> 1.4)
37
+ tzinfo (0.3.33)
22
38
 
23
39
  PLATFORMS
24
40
  ruby
25
41
 
26
42
  DEPENDENCIES
43
+ activerecord (= 3.2.8)
27
44
  bundler (> 1.0.0)
28
45
  debugger
29
46
  jeweler (~> 1.8.3)
data/README.rdoc CHANGED
@@ -4,19 +4,18 @@ There will always be "legacy" taxonomic data that needs shuffling around. The ta
4
4
  Overall, the goal is to provide well documented (and unit-tested) coded that is broadly useful, and vanilla enough to encourage other to fork and hack on their own.
5
5
 
6
6
  == Source
7
- Source is available at https://github.com/SpeciesFile/taxonifi. The rdoc API is also viewable at http://taxonifi.speciesfile.org, (though those docs may lag behind commits to github).
7
+ Source is available at https://github.com/SpeciesFile/taxonifi . The rdoc API is also viewable at http://taxonifi.speciesfile.org , (though those docs may lag behind commits to github).
8
8
 
9
9
  == What's next?
10
10
 
11
- Before you jump on board you should also check out similar code from the Global Names team at https://github.com/GlobalNamesArchitecture. Future integration and merging of shared functionality is planned. Code will be released in an "early-and-often" approach
11
+ Before you jump on board you should also check out similar code from the Global Names team at https://github.com/GlobalNamesArchitecture. Future integration and merging of shared functionality is planned. Code will be released in an "early-and-often" approach.
12
+
13
+ Taxonifi is presently coded for convience, not speed (though it's not necessarily slow). It assumes that conversion processes are typically one-offs that can afford to run over a longer period of time (read minutes rather than seconds). Reading, and fully parsing into objects, around 25k rows of nomenclature (class to species, inc. author year, = ~45k names) in to memory as Taxonifi objects benchmarks at around 2 minutes. Faster indexing is planned as needed, likely using Redis (see GNA link above).
12
14
 
13
15
  = Getting started
14
16
  taxonifi is coded for Ruby 1.9.3, it has not been tested on earlier versions (though it will certainly not work with 1.8.7).
15
17
  Using Ruby Version Manager (RVM, https://rvm.io/ ) is highly recommend. You can test your version of Ruby by doinging "ruby -v" in your terminal.
16
18
 
17
- Taxonifi is presently coded for convience, not speed (though it's not necessarily slow). It assumes that conversion processes are typically one-offs that can afford to run over a longer period of time (read minutes rather than seconds). Reading, and fully parsing into objects, around 25k rows of nomenclature (class to species, inc. author year, = ~45k names) in to memory as Taxonifi objects benchmarks at around 2 minutes. Faster indexing is planned as needed, likely using Redis (see GNA link above).
18
-
19
-
20
19
  To install:
21
20
 
22
21
  gem install taxonifi
@@ -76,7 +75,7 @@ There are collections of specific types (e.g. taxonomic names, geographic names)
76
75
 
77
76
  csv = CSV.parse(string, {headers: true})
78
77
 
79
- nc = Taxonifi::Lumper.create_name_collection(csv) # => Taxonifi::Model::NameCollection
78
+ nc = Taxonifi::Lumper.create_name_collection(:csv => csv) # => Taxonifi::Model::NameCollection
80
79
 
81
80
  nc.collection.first # => Taxonifi::Model::Name
82
81
  nc.collection.first.name # => "Fooidae"
data/Rakefile CHANGED
@@ -15,7 +15,7 @@ require 'jeweler'
15
15
  Jeweler::Tasks.new do |gem|
16
16
  # gem is a Gem::Specification... see http://docs.rubygems.org/read/chapter/20 for more options
17
17
  gem.name = "taxonifi"
18
- gem.homepage = "http://github.com/mjy/taxonifi"
18
+ gem.homepage = "http://github.com/SpeciesFile/taxonifi"
19
19
  gem.license = "MIT"
20
20
  gem.summary = %Q{A general purpose framework for scripted handling of taxonomic names}
21
21
  gem.description = %Q{Taxonifi contains simple models and utilties of use in for parsing lists of taxonomic name (life) related metadata}
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.1.0
1
+ 0.2.0
@@ -36,7 +36,8 @@ module Taxonifi
36
36
  end
37
37
  end
38
38
 
39
- # Return the last column with data, scoped by lump if provided.
39
+ # Return an Array of ["header", value] for the last column with data, scoped by lump if provided.
40
+ # If there is nothing available in the scope provided return [nil, nil]
40
41
  def self.last_available(csv_row, lump = nil)
41
42
  if lump.nil?
42
43
  csv_row.entries.reverse.each do |c,v|
@@ -47,25 +48,31 @@ module Taxonifi
47
48
  return [l, csv_row[l.to_s]] if !csv_row[l.to_s].nil?
48
49
  end
49
50
  end
51
+ [nil, nil]
50
52
  end
51
53
 
52
54
  # Return the rank (symbol) of the taxon name rank. Raises
53
55
  # if no name detected.
54
56
  def self.lump_name_rank(csv_row)
55
- lumps = Taxonifi::Lumper.available_lumps(csv_row.headers)
57
+ # Rather than just check individual columns for data ensure a complete lump is present
58
+ lumps = intersecting_lumps_with_data(csv_row, [:species, :genera, :higher])
56
59
  if lumps.include?(:species) # has to be a species name
57
- if csv_row[:subspecies].nil?
58
- return :species
60
+ if !csv_row['variety'].nil?
61
+ return :variety
59
62
  else
60
- return :subspecies
63
+ if csv_row['subspecies'].nil?
64
+ return :species
65
+ else
66
+ return :subspecies
67
+ end
61
68
  end
62
69
  elsif lumps.include?(:genera)
63
- if csv_row[:subgenus].nil?
70
+ if csv_row['subgenus'].nil?
64
71
  return :genus
65
72
  else
66
73
  return :subgenus
67
74
  end
68
- else
75
+ elsif lumps.include?(:higher)
69
76
  return Taxonifi::Assessor::RowAssessor.last_available(csv_row, Taxonifi::Lumper::LUMPS[:higher]).first.to_sym
70
77
  end
71
78
 
@@ -75,11 +82,11 @@ module Taxonifi
75
82
 
76
83
  # Return the column representing the parent of the name
77
84
  # represented in this row.
78
- def self.parent_taxon_column(csv_row)
79
- lumps = Taxonifi::Lumper.available_lumps(csv_row.headers)
80
- last = last_available(csv_row, Taxonifi::RANKS)
81
- last_available(csv_row, Taxonifi::RANKS[0..Taxonifi::RANKS.index(last[0])-1])
82
- end
85
+ # TODO: DEPRECATE, same f(n) as last_available when scoped properly
86
+ # def self.parent_taxon_column(csv_row)
87
+ # last = last_available(csv_row, Taxonifi::RANKS)
88
+ # last_available(csv_row, Taxonifi::RANKS[0..Taxonifi::RANKS.index(last[0])-1])
89
+ # end
83
90
 
84
91
  # Return an Array of headers that represent taxonomic ranks.
85
92
  def self.rank_headers(headers)
@@ -92,13 +99,13 @@ module Taxonifi
92
99
  end
93
100
 
94
101
  # Return lumps for which at least one column has data.
95
- def self.intersecting_lumps_with_data(row, lumps_to_try = nil)
96
- lumps_to_try ||= Taxonifi::Lumper::LUMPS.keys
102
+ def self.intersecting_lumps_with_data(csv_row, lumps_to_try = nil)
103
+ lumps_to_try ||= Taxonifi::Lumper.intersecting_lumps(csv_row.headers)
97
104
  lumps = []
98
105
  lumps_to_try.each do |l|
99
106
  has_data = false
100
107
  Taxonifi::Lumper::LUMPS[l].each do |c|
101
- if !row[c].nil? && !row[c].empty?
108
+ if !csv_row[c].nil? && !csv_row[c].empty?
102
109
  has_data = true
103
110
  break
104
111
  end
@@ -109,13 +116,13 @@ module Taxonifi
109
116
  end
110
117
 
111
118
  # Return lumps that have data for all columns.
112
- def self.lumps_with_data(row, lumps_to_try = nil)
113
- lumps_to_try ||= Taxonifi::Lumper::LUMPS.keys
119
+ def self.lumps_with_data(csv_row, lumps_to_try = nil)
120
+ lumps_to_try ||= Taxonifi::Lumper.available_lumps(csv_row.headers) # Taxonifi::Lumper::LUMPS.keys
114
121
  lumps = []
115
122
  lumps_to_try.each do |l|
116
123
  has_data = true
117
124
  Taxonifi::Lumper::LUMPS[l].each do |c|
118
- if row[c].nil? || row[c].empty?
125
+ if csv_row[c].nil? || csv_row[c].empty?
119
126
  has_data = false
120
127
  break
121
128
  end
@@ -2,8 +2,72 @@ module Taxonifi::Export
2
2
 
3
3
  # All export classes inherit from Taxonifi::Export::Base
4
4
  class Base
5
+
6
+ # Hash. An index of taxonomic ranks.
7
+ # See https://phenoscape.svn.sourceforge.net/svnroot/phenoscape/trunk/vocab/taxonomic_rank.obo
8
+ # Site: https://www.phenoscape.org/wiki/Taxonomic_Rank_Vocabulary
9
+ # Values of -1 have no correspondance in that ontology.
10
+ # Nt all values are supported. Not all values are included.
11
+ TAXRANKS = {
12
+ 'taxonomic_rank' => 0,
13
+ 'variety' => 16,
14
+ 'bio-variety' => 32,
15
+ 'subspecies' => 23,
16
+ 'form' => 26,
17
+ 'species' => 5,
18
+ 'species complex' => 12,
19
+ 'species subgroup' => 11,
20
+ 'species group' => 10,
21
+ 'species series' => -1,
22
+ 'series' => 31,
23
+ 'infragenus' => 43,
24
+ 'subgenus' => 9,
25
+ 'genus' => 5,
26
+ 'genus group' => -1,
27
+ 'subtribe' => 28,
28
+ 'tribe' => 25,
29
+ 'supertribe' => 57,
30
+ 'infrafamily' => 41,
31
+ 'subfamily' => 24,
32
+ 'subfamily group' => -1,
33
+ 'family' => 4,
34
+ 'epifamily' => -1,
35
+ 'superfamily' => 18,
36
+ 'superfamily group' => -1,
37
+ 'subinfraordinal group' => -1,
38
+ 'infraorder' => 13,
39
+ 'suborder' => 14,
40
+ 'order' => 3,
41
+ 'mirorder' => -1,
42
+ 'superorder' => 20,
43
+ 'magnorder' => -1,
44
+ 'parvorder' => 21,
45
+ 'cohort' => -1,
46
+ 'supercohort' => -1,
47
+ 'infraclass' => 19,
48
+ 'subclass' => 7,
49
+ 'class' => 2,
50
+ 'superclass' => 15,
51
+ 'infraphylum' => 40,
52
+ 'subphylum' => 8,
53
+ 'phylum' => 1,
54
+ 'superphylum' => 27,
55
+ 'infrakingdom' => 44,
56
+ 'subkingdom' => 29,
57
+ 'kingdom' => 17,
58
+ 'superkingdom' => 22,
59
+ 'life' => -1,
60
+ 'unknown' => -1,
61
+ 'section' => 30
62
+ }
63
+
5
64
  EXPORT_BASE = File.expand_path(File.join(Dir.home(), 'taxonifi', 'export'))
6
- attr_accessor :base_export_path, :export_folder
65
+
66
+ # String. Defaults to EXPORT_BASE.
67
+ attr_accessor :base_export_path
68
+
69
+ # String. The folder to dump output files to, subclassess contain a reasonably named default.
70
+ attr_accessor :export_folder
7
71
 
8
72
  def initialize(options = {})
9
73
  opts = {
@@ -39,5 +103,36 @@ module Taxonifi::Export
39
103
  f.close
40
104
  end
41
105
 
106
+ # TODO: Used?!
107
+ # Returns a new writeable File under the
108
+ def new_output_file(filename = 'foo')
109
+ File.new( File.expand_path(File.join(export_path, filename)), 'w+')
110
+ end
111
+
112
+
113
+ # TODO: Move to a SQL library.
114
+ # Returns a String, an INSERT statement derived from the passed values Hash.
115
+ def sql_insert_statement(tbl = nil, values = {})
116
+ return "nope" if tbl.nil?
117
+ "INSERT INTO #{tbl} (#{values.keys.sort.join(",")}) VALUES (#{values.keys.sort.collect{|k| sqlize(values[k])}.join(",")});"
118
+ end
119
+
120
+ # TODO: Move to a SQL library.
121
+ # Returns a String that has been SQL proofed based on its class.
122
+ def sqlize(value)
123
+ case value.class.to_s
124
+ when 'String'
125
+ "'#{sanitize(value)}'"
126
+ else
127
+ value
128
+ end
129
+ end
130
+
131
+ # TODO: Move to SQL/String library.
132
+ # Returns a String with quotes handled for SQL.
133
+ def sanitize(value)
134
+ value.to_s.gsub(/'/,"''")
135
+ end
136
+
42
137
  end
43
138
  end
@@ -0,0 +1,71 @@
1
+
2
+ module Taxonifi::Export
3
+
4
+ # Writes a OBO formatted file for all names in a name collection.
5
+ # !! Does not write synonyms out.
6
+ # Follows the TTO example.
7
+ class OboNomenclature < Taxonifi::Export::Base
8
+
9
+ attr_accessor :name_collection, :namespace
10
+
11
+ def initialize(options = {})
12
+ opts = {
13
+ :nc => Taxonifi::Model::NameCollection.new,
14
+ :export_folder => 'obo_nomenclature',
15
+ :starting_id => 1,
16
+ :namespace => 'XYZ'
17
+ }.merge!(options)
18
+
19
+ super(opts)
20
+ raise Taxonifi::Export::ExportError, 'NameCollection not passed to OboNomenclature export.' if ! opts[:nc].class == Taxonifi::Model::NameCollection
21
+ @name_collection = opts[:nc]
22
+ @namespace = opts[:namespace]
23
+ @time = Time.now.strftime("%D %T").gsub('/',":")
24
+ @empty_quotes = ""
25
+ end
26
+
27
+ # Writes the file.
28
+ def export()
29
+ super
30
+ f = new_output_file('obo_nomenclature.obo')
31
+
32
+ # header
33
+ f.puts 'format-version: 1.2'
34
+ f.puts "date: #{@time}"
35
+ f.puts 'saved-by: someone'
36
+ f.puts 'auto-generated-by: Taxonifi'
37
+ f.puts 'synonymtypedef: COMMONNAME "common name"'
38
+ f.puts 'synonymtypedef: MISSPELLING "misspelling" EXACT'
39
+ f.puts 'synonymtypedef: TAXONNAMEUSAGE "name with (author year)" NARROW'
40
+ f.puts "default-namespace: #{@namespace}"
41
+ f.puts "ontology: FIX-ME-taxonifi-ontology\n\n"
42
+
43
+ # terms
44
+ @name_collection.collection.each do |n|
45
+ f.puts '[Term]'
46
+ f.puts "id: #{id_string(n)}"
47
+ f.puts "name: #{n.name}"
48
+ f.puts "is_a: #{id_string(n.parent)} ! #{n.parent.name}" if n.parent
49
+ f.puts "property_value: has_rank #{rank_string(n)}"
50
+ f.puts
51
+ end
52
+
53
+ # typedefs
54
+ f.puts "[Typedef]"
55
+ f.puts "id: has_rank"
56
+ f.puts "name: has taxonomic rank"
57
+ f.puts "is_metadata_tag: true"
58
+
59
+ true
60
+ end
61
+
62
+ def rank_string(name)
63
+ "TAXRANK:#{TAXRANKS[name.rank].to_s.rjust(7,"0")}"
64
+ end
65
+
66
+ def id_string(name)
67
+ "#{@namespace}:#{name.id.to_s.rjust(7,"0")}"
68
+ end
69
+
70
+ end # End class
71
+ end # End module
@@ -0,0 +1,59 @@
1
+
2
+ module Taxonifi::Export
3
+
4
+ # Dumps tables identical to the existing structure in SpeciesFile.
5
+ # Will only work in the pre Identity world. Will reconfigure
6
+ # as templates for Jim's work after the fact.
7
+ class Prolog < Taxonifi::Export::Base
8
+ attr_accessor :name_collection
9
+ attr_accessor :ref_collection
10
+ attr_accessor :pub_collection
11
+ attr_accessor :author_index
12
+ attr_accessor :genus_names, :species_names, :nomenclator
13
+ attr_accessor :authorized_user_id, :time
14
+ attr_accessor :starting_ref_id
15
+
16
+ def initialize(options = {})
17
+ opts = {
18
+ :nc => Taxonifi::Model::NameCollection.new,
19
+ :export_folder => 'prolog',
20
+ :starting_ref_id => 1, # should be configured elsewhere... but
21
+ :manifest => %w{tblPubs tblRefs tblPeople tblRefAuthors tblTaxa tblGenusNames tblSpeciesNames tblNomenclator tblCites}
22
+ }.merge!(options)
23
+
24
+ @manifest = opts[:manifest]
25
+
26
+ super(opts)
27
+ raise Taxonifi::Export::ExportError, 'NameCollection not passed to SpeciesFile export.' if ! opts[:nc].class == Taxonifi::Model::NameCollection
28
+ # raise Taxonifi::Export::ExportError, 'You must provide authorized_user_id for species_file export initialization.' if opts[:authorized_user_id].nil?
29
+ # @name_collection = opts[:nc]
30
+ # @pub_collection = {} # title => id
31
+ # @authorized_user_id = opts[:authorized_user_id]
32
+ # @author_index = {}
33
+ # @starting_ref_id = opts[:starting_ref_id]
34
+ #
35
+ # # Careful here, at present we are just generating Reference micro-citations from our names, so the indexing "just works"
36
+ # # because it's all internal. There will is a strong potential for key collisions if this pipeline is modified to
37
+ # # include references external to the initialized name_collection. See also export_references.
38
+ # #
39
+ # # @by_author_reference_index = {}
40
+ # @genus_names = {}
41
+ # @species_names = {}
42
+ # @nomenclator = {}
43
+
44
+ @time = Time.now.strftime("%F %T")
45
+ @empty_quotes = ""
46
+ end
47
+
48
+ def export()
49
+ super
50
+ configure_folders
51
+ str = ["FOO"]
52
+
53
+ write_file('foo.pl', str.join("\n\n"))
54
+
55
+ true
56
+ end
57
+
58
+ end # End class
59
+ end # End module