taxamatch_rb 1.0.1 → 1.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 3f9cbd9334dff96ed1723f1487bc3bb89805c4f4
4
+ data.tar.gz: 5928a559f917d9908d251cb873285e0242d60476
5
+ SHA512:
6
+ metadata.gz: d598f616d3f34f1cfc051b1e6ee049075eb3a87d5714d3712075434eead6f88e3667bde46edfd0a5cd7a8c868b3139d4ca9802f21205f38e918b4eba07e92324
7
+ data.tar.gz: d429b8e677be3fe170f01c66dc07c002fa05e77f14fc89e78dce4f6ded814c3a52477e44ebc0fb8688bf40cba686da45e5ede275d54cebdf6d3daa7f518637e5
data/.gitignore ADDED
@@ -0,0 +1,9 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --format documentation
2
+ --color
3
+ --require spec_helper
data/.rubocop.yml ADDED
@@ -0,0 +1,9 @@
1
+ AllCops:
2
+ Exclude:
3
+ - features/**/*
4
+ - .bundle/**/*
5
+ - bundle_bin/**/*
6
+ Style/StringLiterals:
7
+ EnforcedStyle: double_quotes
8
+ Style/DotPosition:
9
+ EnforcedStyle: trailing
data/.travis.yml ADDED
@@ -0,0 +1,10 @@
1
+ rvm:
2
+ - 2.0
3
+ - 2.1
4
+ - 2.2
5
+ script:
6
+ - bundle exec rake
7
+ branches:
8
+ only:
9
+ - master
10
+
data/CHANGELOG CHANGED
@@ -1,3 +1,5 @@
1
+ 1.1.0 - create gem with bundle instead of jeweler, refactoring
2
+
1
3
  1.0.0 - fixed a parsing problem with infraspecies without string,
2
4
  upgraded version to 1 because the signature of the gem did stabilized
3
5
 
@@ -12,3 +14,4 @@ upgraded version to 1 because the signature of the gem did stabilized
12
14
  characters, all utf-8 characters unknown to normalizer are becoming '?'
13
15
 
14
16
  0.9.1 - updated gems
17
+
@@ -0,0 +1,31 @@
1
+ Contributor Code of Conduct
2
+ ===========================
3
+
4
+ As contributors and maintainers of this project, we pledge to respect all
5
+ people who contribute through reporting issues, posting feature requests,
6
+ updating documentation, submitting pull requests or patches, and other
7
+ activities.
8
+
9
+ We are committed to making participation in this project a harassment-free
10
+ experience for everyone, regardless of level of experience, gender, gender
11
+ identity and expression, sexual orientation, disability, personal appearance,
12
+ body size, race, age, or religion.
13
+
14
+ Examples of unacceptable behavior by participants include the use of sexual
15
+ language or imagery, derogatory comments or personal attacks, trolling, public
16
+ or private harassment, insults, or other unprofessional conduct.
17
+
18
+ Project maintainers have the right and responsibility to remove, edit, or
19
+ reject comments, commits, code, wiki edits, issues, and other contributions
20
+ that are not aligned to this Code of Conduct. Project maintainers who do not
21
+ follow the Code of Conduct may be removed from the project team.
22
+
23
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be
24
+ reported by opening an issue or contacting one or more of the project
25
+ maintainers.
26
+
27
+ This Code of Conduct is adapted from the [Contributor
28
+ Covenant][1], [version 1.0.0][2]
29
+
30
+ [1]: http:contributor-covenant.org
31
+ [2]: http://contributor-covenant.org/version/1/0/0/
data/Gemfile CHANGED
@@ -1,19 +1,4 @@
1
1
  source 'https://rubygems.org'
2
- require 'yaml'
3
2
 
4
- gem 'biodiversity','~> 3.1.0'
5
- gem 'damerau-levenshtein', '~> 0.5.4'
6
- gem 'json', '~> 1.7.7'
7
-
8
- group :test do
9
- gem 'rake', '~> 10.0'
10
- gem 'rake-compiler', '~> 0.8'
11
- gem 'rspec', '~> 2.13'
12
- gem 'cucumber', '~> 1.3'
13
- gem 'bundler', '~> 1.3'
14
- gem 'jeweler', '~> 1.8'
15
- gem 'debugger', '~> 1.5'
16
- gem 'ruby-prof', '~> 0.13'
17
- gem 'shoulda', '~> 3.5'
18
- gem 'mocha', '~> 0.13'
19
- end
3
+ # Specify your gem's dependencies in taxamatch_rb.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2015 Marine Biological Laboratory
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md CHANGED
@@ -1,75 +1,102 @@
1
- Taxamatch_Rb
1
+ taxamatch_rb
2
2
  ============
3
3
 
4
4
  [![Gem Version][1]][2]
5
5
  [![Continuous Integration Status][3]][4]
6
- [![Dependency Status][5]][6]
6
+ [![Coverage Status][5]][6]
7
+ [![CodePolice][7]][8]
8
+ [![Dependency Status][8]][9]
7
9
 
8
- Taxamatch_Rb is a ruby implementation of Taxamatch algorithms
9
- [developed by Tony Rees][7]:
10
+ `taxamatch_rb` is a ruby implementation of Taxamatch algorithms
11
+ [developed by Tony Rees][10]:
10
12
 
11
13
  The purpose of Taxamatch gem is to facilitate fuzzy comparison of
12
14
  two scientific name renderings to find out if they actually point to
13
15
  the same scientific name.
14
16
 
15
- require 'taxamatch_rb'
16
- tm = Taxamatch::Base.new
17
- tm.taxamatch('Homo sapien', 'Homo sapiens') #returns true
18
- tm.taxamatch('Homo sapiens Linnaeus', 'Hommo sapens (Linn. 1758)') #returns true
19
- tm.taxamatch('Homo sapiens Mozzherin', 'Homo sapiens Linnaeus') #returns false
17
+ ```ruby
18
+ require 'taxamatch_rb'
19
+ tm = Taxamatch::Base.new
20
+ tm.taxamatch('Homo sapien', 'Homo sapiens') #returns true
21
+ tm.taxamatch('Homo sapiens Linnaeus', 'Hommo sapens (Linn. 1758)') #returns true
22
+ tm.taxamatch('Homo sapiens Mozzherin', 'Homo sapiens Linnaeus') #returns false
23
+ ```
20
24
 
21
- Taxamatch_Rb is compatible with ruby versions 1.9.1 and higher
25
+ `taxamatch_rb` is compatible with ruby versions 1.9.1 and higher
22
26
 
23
27
  Installation
24
28
  ------------
25
29
 
26
- sudo gem install taxamatch_rb
30
+ ```bash
31
+ $ sudo gem install taxamatch_rb
32
+ ```
27
33
 
28
34
  Usage
29
35
  -----
30
36
 
31
- require 'taxamatch_rb'
37
+ ```ruby
38
+ require "taxamatch_rb"
32
39
 
33
- tm = Taxamatch::Base.new
40
+ # To find version
41
+ Taxamatch.version
42
+
43
+ # To start new instance of taxamatch
44
+ tm = Taxamatch::Base.new
45
+ ```
34
46
 
35
47
  * compare full scientific names
36
48
 
37
- tm.taxamatch('Hommo sapiens L.', 'Homo sapiens Linnaeus')
49
+ ```ruby
50
+ tm.taxamatch("Hommo sapiens L.", "Homo sapiens Linnaeus")
51
+ ```
38
52
 
39
53
  * preparse names for the matching (necessary for large databases of scientific names)
40
54
 
41
- p = Taxamatch::Atomizer.new
42
- parsed_name1 = p.parse('Monacanthus fronticinctus Günther 1867 sec. Eschmeyer 2004')
43
- parsed_name2 = p.parse('Monacanthus fronticinctus (Gunther, 1867)')
55
+ ```ruby
56
+ p = Taxamatch::Atomizer.new
57
+ parsed_name1 = p.parse("Monacanthus fronticinctus Günther 1867 sec. Eschmeyer 2004")
58
+ parsed_name2 = p.parse("Monacanthus fronticinctus (Gunther, 1867)")
59
+ ```
44
60
 
45
61
  * compare preparsed names
46
62
 
47
- tm.taxamatch_preparsed(parsed_name1, parsed_name2)
63
+ ```ruby
64
+ tm.taxamatch_preparsed(parsed_name1, parsed_name2)
65
+ ```
48
66
 
49
67
  * compare genera
50
68
 
51
- tm.match_genera('Monacanthus', 'MONOCANTUS')
69
+ ```ruby
70
+ tm.match_genera("Monacanthus", "MONOCANTUS")
71
+ ```
52
72
 
53
73
  * compare species
54
74
 
55
- tm.match_species('fronticinctus', 'frontecinctus')
75
+ ```ruby
76
+ tm.match_species("fronticinctus", "frontecinctus")
77
+ ```
56
78
 
57
79
  * compare authors and years
58
80
 
59
- Taxamatch::Authmatch.authmatch(['Linnaeus'], ['L','Muller'], [1786], [1787])
60
-
81
+ ```ruby
82
+ Taxamatch::Authmatch.authmatch(["Linnaeus"], ["L","Muller"], [1786], [1787])
83
+ ```
61
84
 
62
85
  You can find more examples in spec section of the code
63
86
 
64
87
  Copyright
65
88
  ---------
66
89
 
67
- Copyright (c) 2009-2013 Marine Biological Laboratory. See LICENSE for details.
90
+ Copyright (c) 2009-2015 Marine Biological Laboratory. See LICENSE for details.
68
91
 
69
92
  [1]: https://badge.fury.io/rb/taxamatch_rb.png
70
93
  [2]: http://badge.fury.io/rb/taxamatch_rb
71
94
  [3]: https://secure.travis-ci.org/GlobalNamesArchitecture/taxamatch_rb.png
72
95
  [4]: http://travis-ci.org/GlobalNamesArchitecture/taxamatch_rb
73
- [5]: https://gemnasium.com/GlobalNamesArchitecture/taxamatch_rb.png
74
- [6]: https://gemnasium.com/GlobalNamesArchitecture/taxamatch_rb
75
- [7]: http://www.cmar.csiro.au/datacentre/taxamatch.htm
96
+ [5]: https://coveralls.io/repos/GlobalNamesArchitecture/taxamatch_rb/badge.png
97
+ [6]: https://coveralls.io/r/GlobalNamesArchitecture/taxamatch_rb
98
+ [7]: https://codeclimate.com/github/GlobalNamesArchitecture/taxamatch_rb.png
99
+ [8]: https://codeclimate.com/github/GlobalNamesArchitecture/taxamatch_rb
100
+ [8]: https://gemnasium.com/GlobalNamesArchitecture/taxamatch_rb.png
101
+ [9]: https://gemnasium.com/GlobalNamesArchitecture/taxamatch_rb
102
+ [10]: http://www.cmar.csiro.au/datacentre/taxamatch.htm
data/Rakefile CHANGED
@@ -1,47 +1,18 @@
1
- require 'rubygems'
1
+ require "bundler/gem_tasks"
2
+ require "rake"
3
+ require "rspec/core"
4
+ require "rspec/core/rake_task"
5
+ require "cucumber"
6
+ require "cucumber/rake/task"
2
7
 
3
- require 'bundler'
4
- begin
5
- Bundler.setup(:default, :development)
6
- rescue Bundler::BundlerError => e
7
- $stderr.puts e.message
8
- $stderr.puts 'Run `bundle install` to install missing gems'
9
- exit e.status_code
10
- end
11
-
12
- require 'rake'
13
-
14
- begin
15
- require 'jeweler'
16
- Jeweler::Tasks.new do |gem|
17
- gem.name = 'taxamatch_rb'
18
- gem.summary = 'Implementation of Tony Rees Taxamatch algorithms'
19
- gem.description = 'This gem implements algorithm ' +
20
- 'for fuzzy matching scientific names developed by Tony Rees'
21
- gem.email = 'dmozzherin@gmail.com'
22
- gem.homepage = 'http://github.com/GlobalNamesArchitecture/taxamatch_rb'
23
- gem.authors = ['Dmitry Mozzherin']
24
- gem.files = FileList['[A-Z]*',
25
- '*.gemspec', '{bin,generators,lib,spec}/**/*']
26
- gem.files -= FileList['lib/**/*.bundle', 'lib/**/*.dll', 'lib/**/*.so']
27
- gem.files += FileList['ext/**/*.c']
28
- gem.extensions = FileList['ext/**/extconf.rb']
29
- end
30
-
31
- rescue LoadError
32
- puts 'Jeweler (or a dependency) not available.' +
33
- ' Install it with: sudo gem install jeweler'
34
- end
35
-
36
- require 'rspec/core'
37
- require 'rspec/core/rake_task'
38
8
  RSpec::Core::RakeTask.new(:spec) do |spec|
39
9
  spec.pattern = FileList['spec/**/*_spec.rb']
40
10
  end
41
11
 
42
- RSpec::Core::RakeTask.new(:rcov) do |spec|
43
- spec.pattern = 'spec/**/*_spec.rb'
44
- spec.rcov = true
12
+ # Cucumber::Rake::Task.new(:features)
13
+ Cucumber::Rake::Task.new(:features) do |t|
14
+ t.cucumber_opts = "features --format pretty"
45
15
  end
46
16
 
47
- task :default => :spec
17
+ task default: [:features, :spec]
18
+
data/bin/console ADDED
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "taxamatch_rb"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start
data/bin/setup ADDED
@@ -0,0 +1,7 @@
1
+ #!/bin/bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+
5
+ bundle install
6
+
7
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,154 @@
1
+ module Taxamatch
2
+ # Matches name strings of scientific names
3
+ class Base
4
+ def initialize
5
+ @parser = Taxamatch::Atomizer.new
6
+ @dlm = DamerauLevenshtein
7
+ end
8
+
9
+ def taxamatch(str1, str2, return_boolean = true)
10
+ preparsed_1 = @parser.parse(str1)
11
+ preparsed_2 = @parser.parse(str2)
12
+ match = taxamatch_preparsed(preparsed_1, preparsed_2)
13
+ return_boolean ? (!!match && match["match"]) : match
14
+ end
15
+
16
+ def taxamatch_preparsed(preparsed_1, preparsed_2)
17
+ result = nil
18
+ if preparsed_1[:uninomial] && preparsed_2[:uninomial]
19
+ result = match_uninomial(preparsed_1, preparsed_2)
20
+ elsif preparsed_1[:genus] && preparsed_2[:genus]
21
+ result = match_multinomial(preparsed_1, preparsed_2)
22
+ end
23
+ if result && result["match"]
24
+ result["match"] = match_authors(preparsed_1, preparsed_2)
25
+ end
26
+ result
27
+ rescue StandardError
28
+ nil
29
+ end
30
+
31
+ def match_uninomial(preparsed_1, preparsed_2)
32
+ match_genera(preparsed_1[:uninomial], preparsed_2[:uninomial])
33
+ end
34
+
35
+ def match_multinomial(preparsed_1, preparsed_2)
36
+ gen_match = match_genera(preparsed_1[:genus], preparsed_2[:genus])
37
+ sp_match = match_species(preparsed_1[:species], preparsed_2[:species])
38
+ total_length = preparsed_1[:genus][:string].size +
39
+ preparsed_2[:genus][:string].size +
40
+ preparsed_1[:species][:string].size +
41
+ preparsed_2[:species][:string].size
42
+ if preparsed_1[:infraspecies] && preparsed_2[:infraspecies]
43
+ infrasp_match = match_species(preparsed_1[:infraspecies][0],
44
+ preparsed_2[:infraspecies][0])
45
+ total_length += preparsed_1[:infraspecies][0][:string].size +
46
+ preparsed_2[:infraspecies][0][:string].size
47
+ match_hash = match_matches(gen_match, sp_match, infrasp_match)
48
+ elsif (preparsed_1[:infraspecies] && !preparsed_2[:infraspecies]) ||
49
+ (!preparsed_1[:infraspecies] && preparsed_2[:infraspecies])
50
+ match_hash = { "match" => false,
51
+ "edit_distance" => 5,
52
+ "phonetic_match" => false }
53
+ total_length += preparsed_1[:infraspecies] ?
54
+ preparsed_1[:infraspecies][0][:string].size :
55
+ preparsed_2[:infraspecies][0][:string].size
56
+ else
57
+ match_hash = match_matches(gen_match, sp_match)
58
+ end
59
+ match_hash.merge({ "score" =>
60
+ (1 - match_hash["edit_distance"]/(total_length/2)) })
61
+ match_hash
62
+ end
63
+
64
+ def match_genera(genus1, genus2, opts = {})
65
+ genus1_length = genus1[:normalized].size
66
+ genus2_length = genus2[:normalized].size
67
+ opts = { with_phonetic_match: true }.merge(opts)
68
+ min_length = [genus1_length, genus2_length].min
69
+ unless opts[:with_phonetic_match]
70
+ genus1[:phonetized] = "A"
71
+ genus2[:phonetized] = "B"
72
+ end
73
+ match = false
74
+ ed = @dlm.distance(genus1[:normalized],
75
+ genus2[:normalized], 1, 3) #TODO put block = 2
76
+ return { "edit_distance" => ed,
77
+ "phonetic_match" => false,
78
+ "match" => false } if ed/min_length.to_f > 0.2
79
+ return { "edit_distance" => ed,
80
+ "phonetic_match" => true,
81
+ "match" => true } if genus1[:phonetized] == genus2[:phonetized]
82
+
83
+ match = true if ed <= 3 && (min_length > ed * 2) &&
84
+ (ed < 2 || genus1[0] == genus2[0])
85
+ { "edit_distance" => ed, "match" => match, "phonetic_match" => false }
86
+ end
87
+
88
+ def match_species(sp1, sp2, opts = {})
89
+ sp1_length = sp1[:normalized].size
90
+ sp2_length = sp2[:normalized].size
91
+ opts = { with_phonetic_match: true }.merge(opts)
92
+ min_length = [sp1_length, sp2_length].min
93
+ unless opts[:with_phonetic_match]
94
+ sp1[:phonetized] = "A"
95
+ sp2[:phonetized] = "B"
96
+ end
97
+ sp1[:phonetized] = Taxamatch::Phonetizer.normalize_ending sp1[:phonetized]
98
+ sp2[:phonetized] = Taxamatch::Phonetizer.normalize_ending sp2[:phonetized]
99
+ match = false
100
+ ed = @dlm.distance(sp1[:normalized],
101
+ sp2[:normalized], 1, 4) #TODO put block 4
102
+ return { "edit_distance" => ed,
103
+ "phonetic_match" => false,
104
+ "match" => false } if ed/min_length.to_f > 0.3334
105
+ return {"edit_distance" => ed,
106
+ "phonetic_match" => true,
107
+ "match" => true} if sp1[:phonetized] == sp2[:phonetized]
108
+
109
+ match = true if ed <= 4 &&
110
+ (min_length >= ed * 2) &&
111
+ (ed < 2 || sp1[:normalized][0] == sp2[:normalized][0]) &&
112
+ (ed < 4 || sp1[:normalized][0...3] == sp2[:normalized][0...3])
113
+ { "edit_distance" => ed, "match" => match, "phonetic_match" => false }
114
+ end
115
+
116
+ def match_authors(preparsed_1, preparsed_2)
117
+ p1 = { normalized_authors: [], years: [] }
118
+ p2 = { normalized_authors: [], years: [] }
119
+ if preparsed_1[:infraspecies] || preparsed_2[:infraspecies]
120
+ p1 = preparsed_1[:infraspecies].last if preparsed_1[:infraspecies]
121
+ p2 = preparsed_2[:infraspecies].last if preparsed_2[:infraspecies]
122
+ elsif preparsed_1[:species] || preparsed_2[:species]
123
+ p1 = preparsed_1[:species] if preparsed_1[:species]
124
+ p2 = preparsed_2[:species] if preparsed_2[:species]
125
+ elsif preparsed_1[:uninomial] && preparsed_2[:uninomial]
126
+ p1 = preparsed_1[:uninomial]
127
+ p2 = preparsed_2[:uninomial]
128
+ end
129
+ au1 = p1[:normalized_authors]
130
+ au2 = p2[:normalized_authors]
131
+ yr1 = p1[:years]
132
+ yr2 = p2[:years]
133
+ return true if au1.empty? || au2.empty?
134
+ score = Taxamatch::Authmatch.authmatch(au1, au2, yr1, yr2)
135
+ score == 0 ? false : true
136
+ end
137
+
138
+ def match_matches(genus_match, species_match, infraspecies_match = nil)
139
+ match = species_match
140
+ if infraspecies_match
141
+ match["edit_distance"] += infraspecies_match["edit_distance"]
142
+ match["match"] &&= infraspecies_match["match"]
143
+ match["phonetic_match"] &&= infraspecies_match["phonetic_match"]
144
+ end
145
+ match["edit_distance"] += genus_match["edit_distance"]
146
+ if match["edit_distance"] > (infraspecies_match ? 6 : 4)
147
+ match["match"] = false
148
+ end
149
+ match["match"] &&= genus_match["match"]
150
+ match["phonetic_match"] &&= genus_match["phonetic_match"]
151
+ match
152
+ end
153
+ end
154
+ end
@@ -0,0 +1,7 @@
1
+ module Taxamatch
2
+ VERSION = "1.1.0"
3
+
4
+ def self.version
5
+ VERSION
6
+ end
7
+ end