casento 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 20c6b7fa068577af60901f17c7b8293aa335f989
4
+ data.tar.gz: ea2fc21a6de595bff9cb824ae187e1f29d9a9e09
5
+ SHA512:
6
+ metadata.gz: 723d6b373218bf84a4d849a5246f369d0ac083c8ac443891f0d3dafe23fef7d5f1480b44e266af78cdf51cada175760883e3c6383fd5e2225cc9f87e4e2b03de
7
+ data.tar.gz: 021e6a560cb3655272c3b9ee6f09647f87d8ae0f51b0ad367519ae4649127e25c5ae1c5f69f4b2eec57aaeb4a2dbd701874789ca6c22edfd42f818b0f3dce6b9
data/.gitignore ADDED
@@ -0,0 +1,9 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in bugguide.gemspec
4
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2016 Ken-ichi Ueda
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,36 @@
1
+ # Description
2
+ The California Academy of Sciences has databased a great deal of their
3
+ entomological specimen data, but it's generally only available through
4
+ their own [website](http://researcharchive.calacademy.org/research/entomology/EntInv/index.asp)
5
+ with no API and no machine-readable export functionality. This gem attempts to
6
+ ameliorate the situation by scraping data and presenting it in a machine-
7
+ readable format.
8
+
9
+ Since it is just a scraper it is brittle, but still, better than nothing.
10
+
11
+ # Installation
12
+
13
+ This is a Ruby gem, so you'll need [Ruby](https://www.ruby-lang.org) and [RubyGems](https://rubygems.org/) installed. Then:
14
+
15
+ `gem install casento`
16
+
17
+ or if you just want to build and install locally:
18
+
19
+ ```bash
20
+ git clone git@github.com:kueda/casento.git
21
+ cd casento
22
+ gem build casento.gemspec
23
+ gem install casento-x.x.x.gem
24
+ ```
25
+
26
+ # Examples
27
+
28
+ `casento help` should get you started, but here are some ways I use it:
29
+
30
+ ```bash
31
+ # List all records of Hemipenthes in California
32
+ casento checklist Hemipenthes --state California --country U.S.A.
33
+
34
+ # Export a checklist of all bee fly genera from California to CSV
35
+ casento checklist Bombyliidae --state California --country U.S.A. --rank genus -f csv > bombyliidae-genera-ca.csv
36
+ ```
data/Rakefile ADDED
@@ -0,0 +1,10 @@
1
+ require "bundler/gem_tasks"
2
+ require "rake/testtask"
3
+
4
+ Rake::TestTask.new(:test) do |t|
5
+ t.libs << "test"
6
+ t.libs << "lib"
7
+ t.test_files = FileList['test/**/*_test.rb', 'test/**/*_spec.rb']
8
+ end
9
+
10
+ task :default => :test
data/bin/casento ADDED
@@ -0,0 +1,120 @@
1
+ #!/usr/bin/env ruby
2
+ require 'commander/import'
3
+ require 'casento'
4
+
5
+ program :name, 'Casento'
6
+ program :version, Casento::VERSION
7
+ program :description, <<-EOT
8
+
9
+ Command-line tool for scraping data from
10
+ http://researcharchive.calacademy.org/research/entomology/EntInv. The only
11
+ option that isn't mostly self-explanatory is the taxon name. It will be
12
+ smart about matching names ending in "-dae" to families, names enting in
13
+ "-ptera" to orders, but otherwise it will assume a name in the "Genus
14
+ species subspecies" format.
15
+
16
+ One major caveat: the CAS website is super fussy about certain place names,
17
+ e.g. "U.S.A.", which must be the acronym, and must have all three periods.
18
+ Play with the underlying web page and check the URL parameters to resolve
19
+ problems like that.
20
+
21
+ EOT
22
+
23
+ global_option "--name NAME", String
24
+ global_option "--family FAMILY", String
25
+ global_option "--genus GENUS", String
26
+ global_option "--species SPECIES", String
27
+ global_option "-n", '--country COUNTRY', String, "Country / Nation, e.g. U.S.A"
28
+ global_option "-s", '--state STATE', String, "Full name of state-level political body, e.g. California"
29
+ global_option "-c", '--county COUNTY', String, "Full name of county-level political body, e.g. Alameda"
30
+ global_option "-f", '--format table|csv', String, "Output format"
31
+
32
+ def search(args, opts)
33
+ params = {}
34
+ if name = (opts.delete(:name) || args[0])
35
+ if name =~ /dae$/
36
+ params[:family] = name
37
+ elsif name =~ /ptera$/
38
+ params[:order] = name
39
+ else
40
+ params[:genus], params[:species], params[:subspecies] = name.split
41
+ end
42
+ end
43
+ params[:family] ||= opts.family
44
+ params[:genus] ||= opts.genus
45
+ params[:species] ||= opts.species
46
+ params[:country] ||= opts.country
47
+ params[:state] ||= opts.state
48
+ params[:county] ||= opts.county
49
+ Casento.search(params)
50
+ end
51
+
52
+ command :occurrences do |c|
53
+ c.syntax = "casento occurrences [taxon name]"
54
+ c.description "List occurrences matching the search parameters."
55
+ c.action do |args, opts|
56
+ occurrences = search(args, opts)
57
+ longest_name = occurrences.map(&:name).sort_by(&:size).last || ""
58
+ fields = %w(order family genus species country state county)
59
+ if opts.format == "csv"
60
+ puts fields.join(',')
61
+ occurrences.each do |o|
62
+ puts fields.map{|f| o.send(f)}.join(",")
63
+ end
64
+ else
65
+ puts "Found #{occurrences.size} occurrences:"
66
+ puts
67
+ puts fields.map{|f| f.ljust(longest_name.size) }.join(' ')
68
+ occurrences.each do |o|
69
+ puts fields.map{|f| o.send(f).to_s.ljust(longest_name.size)}.join(' ')
70
+ end
71
+ end
72
+ end
73
+ end
74
+
75
+ command :checklist do |c|
76
+ c.syntax = "casento checklist [taxon name]"
77
+ c.description "Generate a checklist of unique taxa from occurrences matching the search parameters."
78
+ c.option "--rank RANK", String, "Generate a checklist for names at this rank"
79
+ c.action do |args, opts|
80
+ occurrences = search(args, opts)
81
+ longest_name = ""
82
+ unique_name = if opts.rank == "species" || opts.rank.blank?
83
+ "scientific_name"
84
+ else
85
+ opts.rank
86
+ end
87
+ uniques = occurrences.uniq{|o|
88
+ longest_name = o.send(unique_name) if o.send(unique_name).size > longest_name.size
89
+ o.send(unique_name)
90
+ }.reject{|o|
91
+ o.send(unique_name).blank? || (opts.rank == "species" && o.species.blank?)
92
+ }.sort{|a,b|
93
+ a.send(unique_name) <=> b.send(unique_name)
94
+ }
95
+ fields = %w(order)
96
+ fields += case opts.rank
97
+ when "family"
98
+ %w(family)
99
+ when "genus"
100
+ %w(family genus)
101
+ when "species"
102
+ %w(family genus species)
103
+ else
104
+ %w(family genus species infraspecific_epithet)
105
+ end
106
+ if opts.format == "csv"
107
+ puts fields.join(',')
108
+ uniques.each do |o|
109
+ puts fields.map{|f| o.send(f)}.join(",")
110
+ end
111
+ else
112
+ puts "Found #{occurrences.size} occurrences of #{uniques.size} taxa:"
113
+ puts
114
+ puts fields.map{|f| f.ljust(longest_name.size)}.join(" ")
115
+ uniques.each do |o|
116
+ puts fields.map{|f| o.send(f).to_s.ljust(longest_name.size)}.join(" ")
117
+ end
118
+ end
119
+ end
120
+ end
data/casento.gemspec ADDED
@@ -0,0 +1,35 @@
1
+ #encoding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'casento/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "casento"
8
+ spec.version = Casento::VERSION
9
+ spec.authors = ["Ken-ichi Ueda"]
10
+ spec.email = ["kenichi.ueda@gmail.com"]
11
+
12
+ spec.summary = "Tool reading the Entomology General Collection Database at the California Academy of Sciences"
13
+ spec.description = %q{
14
+ The California Academy of Sciences has databased a great deal of their
15
+ entomological specimen data, but it's generally only available through
16
+ their own website with no API and no machine-readable export
17
+ functionality. This gem attempts to ameliorate the situation by scraping
18
+ data and presenting it in a machine-readable format.
19
+ }
20
+ spec.homepage = "https://github.com/kueda/casento"
21
+ spec.license = "MIT"
22
+
23
+ spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
24
+ spec.bindir = "bin"
25
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
26
+ spec.require_paths = ["lib"]
27
+
28
+ spec.add_development_dependency "bundler", "~> 1.10"
29
+ spec.add_development_dependency "rake", "~> 10.0"
30
+ spec.add_development_dependency "minitest", "~> 5.8"
31
+ spec.add_development_dependency "m", "~> 1.4"
32
+ spec.add_runtime_dependency 'nokogiri', "~> 1.6"
33
+ spec.add_runtime_dependency 'activesupport', "~> 4.2"
34
+ spec.add_runtime_dependency 'commander', "~> 4.3"
35
+ end
data/lib/casento.rb ADDED
@@ -0,0 +1,78 @@
1
+ # encoding: utf-8
2
+ require "uri"
3
+ require "ostruct"
4
+ require "nokogiri"
5
+ require "open-uri"
6
+ require "active_support/core_ext/object/blank"
7
+ require "active_support/core_ext/hash"
8
+ require "active_support/inflector"
9
+ require "casento/version"
10
+ require "casento/occurrence"
11
+
12
+ module Casento
13
+ def self.search( opts = {} )
14
+ occurrences = []
15
+ page = 1
16
+ loop do
17
+ page_results = get_page( page, opts )
18
+ break if page_results.blank?
19
+ occurrences += page_results
20
+ page += 1
21
+ end
22
+ occurrences
23
+ end
24
+
25
+ def self.get_page( page, opts )
26
+ opts.symbolize_keys!
27
+ # puts "opts: #{opts.inspect}"
28
+ url = "http://researcharchive.calacademy.org/research/entomology/EntInv/index.asp?"
29
+ params = {
30
+ "Page" => page,
31
+ "xAction" => "Search"
32
+ }
33
+ params["Country"] = opts[:country] if opts[:country]
34
+ params["StateProv"] = opts[:state] if opts[:state]
35
+ params["County"] = opts[:county] if opts[:county]
36
+ params["Ord"] = opts[:order] if opts[:order]
37
+ %w(Family Genus Species Subspecies).each do |rank|
38
+ val = opts[rank.to_sym] || opts[rank.downcase.to_sym]
39
+ params[rank] = val if val
40
+ end
41
+ url = "#{url}#{URI.encode_www_form( params )}"
42
+ # puts "opening #{url}"
43
+ headers = %w(
44
+ order
45
+ family
46
+ genus
47
+ species
48
+ subspecies
49
+ country
50
+ state
51
+ county
52
+ url
53
+ )
54
+ occurrences = []
55
+ open( url ) do |response|
56
+ # puts "response: #{response}"
57
+ html = Nokogiri::HTML(response.read)
58
+ html.xpath("//tr[td[@class='tdata']]").each do |tr|
59
+ occ = Occurrence.new
60
+ tr.css("td").each_with_index do |td, i|
61
+ # puts "td: #{td}"
62
+ val = if headers[i] == "url"
63
+ td.at("a")[:href]
64
+ else
65
+ td.inner_text.to_s
66
+ end
67
+ val = val.gsub(/[[:space:]]+$/, "").strip
68
+ next if val.blank?
69
+ next if val =~ /not entered/
70
+ # puts "setting #{headers[i]} to #{val}"
71
+ occ.send("#{headers[i]}=", val)
72
+ end
73
+ occurrences << occ
74
+ end
75
+ end
76
+ occurrences
77
+ end
78
+ end
@@ -0,0 +1,65 @@
1
+ #encoding: utf-8
2
+ module Casento
3
+ class Occurrence < OpenStruct
4
+ def kingdom
5
+ "Animalia"
6
+ end
7
+
8
+ def phylum
9
+ "Arthropoda"
10
+ end
11
+
12
+ def dwc_class
13
+ "Insecta"
14
+ end
15
+
16
+ def specificEpithet
17
+ species
18
+ end
19
+
20
+ def infraspecificEpithet
21
+ subspecies
22
+ end
23
+
24
+ def scientificName
25
+ name
26
+ end
27
+
28
+ def state
29
+ super.gsub(/\(state of\)/i, "").strip
30
+ end
31
+
32
+ def stateProvince
33
+ state
34
+ end
35
+
36
+ %w(
37
+ kingdom
38
+ phylum
39
+ order
40
+ family
41
+ genus
42
+ specificEpithet
43
+ infraspecificEpithet
44
+ scientificName
45
+ stateProvince
46
+ ).each do |m|
47
+ define_method "dwc_#{m}" do
48
+ send(m)
49
+ end
50
+ if m.to_s.underscore != m.to_s
51
+ define_method m.to_s.underscore do
52
+ send(m)
53
+ end
54
+ end
55
+ end
56
+
57
+ def name
58
+ [genus, species, subspecies].compact.join(" ").strip
59
+ end
60
+
61
+ def species
62
+ super =~ /spp/ ? nil : super
63
+ end
64
+ end
65
+ end
@@ -0,0 +1,4 @@
1
+ #encoding: utf-8
2
+ module Casento
3
+ VERSION = "1.0.0"
4
+ end
metadata ADDED
@@ -0,0 +1,159 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: casento
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.0.0
5
+ platform: ruby
6
+ authors:
7
+ - Ken-ichi Ueda
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2016-05-08 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: bundler
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '1.10'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '1.10'
27
+ - !ruby/object:Gem::Dependency
28
+ name: rake
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '10.0'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '10.0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: minitest
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - "~>"
46
+ - !ruby/object:Gem::Version
47
+ version: '5.8'
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - "~>"
53
+ - !ruby/object:Gem::Version
54
+ version: '5.8'
55
+ - !ruby/object:Gem::Dependency
56
+ name: m
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - "~>"
60
+ - !ruby/object:Gem::Version
61
+ version: '1.4'
62
+ type: :development
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - "~>"
67
+ - !ruby/object:Gem::Version
68
+ version: '1.4'
69
+ - !ruby/object:Gem::Dependency
70
+ name: nokogiri
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - "~>"
74
+ - !ruby/object:Gem::Version
75
+ version: '1.6'
76
+ type: :runtime
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - "~>"
81
+ - !ruby/object:Gem::Version
82
+ version: '1.6'
83
+ - !ruby/object:Gem::Dependency
84
+ name: activesupport
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - "~>"
88
+ - !ruby/object:Gem::Version
89
+ version: '4.2'
90
+ type: :runtime
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - "~>"
95
+ - !ruby/object:Gem::Version
96
+ version: '4.2'
97
+ - !ruby/object:Gem::Dependency
98
+ name: commander
99
+ requirement: !ruby/object:Gem::Requirement
100
+ requirements:
101
+ - - "~>"
102
+ - !ruby/object:Gem::Version
103
+ version: '4.3'
104
+ type: :runtime
105
+ prerelease: false
106
+ version_requirements: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - "~>"
109
+ - !ruby/object:Gem::Version
110
+ version: '4.3'
111
+ description: "\n The California Academy of Sciences has databased a great deal
112
+ of their\n entomological specimen data, but it's generally only available through\n
113
+ \ their own website with no API and no machine-readable export\n functionality.
114
+ This gem attempts to ameliorate the situation by scraping\n data and presenting
115
+ it in a machine-readable format.\n "
116
+ email:
117
+ - kenichi.ueda@gmail.com
118
+ executables:
119
+ - casento
120
+ extensions: []
121
+ extra_rdoc_files: []
122
+ files:
123
+ - ".gitignore"
124
+ - Gemfile
125
+ - LICENSE
126
+ - README.md
127
+ - Rakefile
128
+ - bin/casento
129
+ - casento.gemspec
130
+ - lib/casento.rb
131
+ - lib/casento/occurrence.rb
132
+ - lib/casento/version.rb
133
+ homepage: https://github.com/kueda/casento
134
+ licenses:
135
+ - MIT
136
+ metadata: {}
137
+ post_install_message:
138
+ rdoc_options: []
139
+ require_paths:
140
+ - lib
141
+ required_ruby_version: !ruby/object:Gem::Requirement
142
+ requirements:
143
+ - - ">="
144
+ - !ruby/object:Gem::Version
145
+ version: '0'
146
+ required_rubygems_version: !ruby/object:Gem::Requirement
147
+ requirements:
148
+ - - ">="
149
+ - !ruby/object:Gem::Version
150
+ version: '0'
151
+ requirements: []
152
+ rubyforge_project:
153
+ rubygems_version: 2.4.6
154
+ signing_key:
155
+ specification_version: 4
156
+ summary: Tool reading the Entomology General Collection Database at the California
157
+ Academy of Sciences
158
+ test_files: []
159
+ has_rdoc: