RubyGems - harvestdor-indexer - Versions diffs - 0.0.3 - Mend

harvestdor-indexer 0.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

data/.gitignore +25 -0
data/.yardopts +3 -0
data/Gemfile +5 -0
data/LICENSE.txt +5 -0
data/README.rdoc +113 -0
data/Rakefile +56 -0
data/harvestdor-indexer.gemspec +43 -0
data/lib/harvestdor-indexer.rb +213 -0
data/lib/harvestdor-indexer/version.rb +6 -0
data/spec/config/ap.yml +61 -0
data/spec/config/ap_blacklist.txt +5 -0
data/spec/config/ap_whitelist.txt +5 -0
data/spec/spec_helper.rb +21 -0
data/spec/unit/harvestdor-indexer_spec.rb +327 -0
metadata +233 -0

data/.gitignore ADDED Viewed

@@ -0,0 +1,25 @@
+*.gem
+*.rbc
+.bundle
+.config
+.yardoc
+.travis
+.rvmrc
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+spec/test_logs
+test/tmp
+test/version_tmp
+tmp
+logs
+.DS_Store
+*.tmproj
+tmtags
+.idea/*

data/.yardopts ADDED Viewed

@@ -0,0 +1,3 @@
+--title 'Harvestdor-Indexer Gem Documentation'
+lib/**/*.rb -
+README.rdoc LICENSE.txt

data/Gemfile ADDED Viewed

@@ -0,0 +1,5 @@
+source 'https://rubygems.org'
+source "http://sul-gems.stanford.edu"
+# See harvestdor-indexer.gemspec for this gem's dependencies
+gemspec

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,5 @@
+Copyright (c) 20XX-2012.  The Board of Trustees of the Leland Stanford Junior University. All rights reserved.
+Redistribution and use of this distribution in source and binary forms, with or without modification, are permitted provided that: The above copyright notice and this permission notice appear in all copies and supporting documentation; The name, identifiers, and trademarks of The Board of Trustees of the Leland Stanford Junior University are not used in advertising or publicity without the express prior written permission of The Board of Trustees of the Leland Stanford Junior University; Recipients acknowledge that this distribution is made available as a research courtesy, "as is", potentially with defects, without any obligation on the part of The Board of Trustees of the Leland Stanford Junior University to provide support, services, or repair;
+THE BOARD OF TRUSTEES OF THE LELAND STANFORD JUNIOR UNIVERSITY DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, WITH REGARD TO THIS SOFTWARE, INCLUDING WITHOUT LIMITATION ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, AND IN NO EVENT SHALL THE BOARD OF TRUSTEES OF THE LELAND STANFORD JUNIOR UNIVERSITY BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, TORT (INCLUDING NEGLIGENCE) OR STRICT LIABILITY, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

data/README.rdoc ADDED Viewed

@@ -0,0 +1,113 @@
+= Harvestdor::Indexer
+A Gem to harvest meta/data from DOR and the skeleton code to index it and write to Solr.
+== Installation
+Add this line to your application's Gemfile:
+    gem 'harvestdor-indexer'
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install harvestdor-indexer
+== Usage
+You must override the index method and provide configuration options.  It is recommended to write a script to run it, too - example below.
+=== Configuration / Set up
+Create a yml config file for your collection going to a Solr index.
+See  spec/config/ap.yml for an example.
+You will want to copy that file and change the following settings:
+1. log_name
+2. default_set (in OAI harvesting params section)
+2a. other OAI harvesting params
+3. blacklist or whitelist if you are using them
+You can also pass in non-default configurations as a hash
+  indexer = Harvestdor::Indexer.new({:oai_repository_url => 'http://my_oai.org, :default_from_date => '2012-12-01'})
+=== Override the Harvestdor::Indexer.index method
+In your code, override this method from the Harvestdor::Indexer class
+# create Solr doc for the druid and add it to Solr, unless it is on the blacklist.
+#  NOTE: don't forget to send commit to Solr, either once at end (already in harvest_and_index), or for each add, or ...
+def index druid
+  if blacklist.include?(druid)
+    logger.info("Druid #{druid} is on the blacklist and will have no Solr doc created")
+  else
+    logger.error("You must override the index method to transform druids into Solr docs and add them to Solr")
+    doc_hash = {}
+    doc_hash[:id] = druid
+    # doc_hash[:title_tsim] = smods_rec(druid).short_title
+    # you might add things from Indexer level class here
+    #  (e.g. things that are the same across all documents in the harvest)
+    solr_client.add(doc_hash)
+    # logger.info("Just created Solr doc for #{druid}")
+    # TODO: provide call to code to update DOR object's workflow datastream??
+  end
+end
+=== Run it
+(bundle install)
+I suggest you write a script to run the code.  Your script might look like this:
+	#!/usr/bin/env ruby
+	$LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..'))
+	$LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..', 'lib'))
+	require 'rubygems'
+	begin
+	  require 'your_indexer'
+	rescue LoadError
+	  require 'bundler/setup'
+	  require 'your_indexer'
+	end
+	config_yml_path = ARGV.pop
+	if config_yml_path.nil?
+	  puts "** You must provide the full path to a config yml file **"
+	  exit
+	end
+	indexer = Harvestdor::Indexer.new(config_yml_path, opts)
+	indexer.harvest_and_index
+Then you run the script like so:
+	 ./bin/indexer config/(your coll).yml
+I suggest you run your code on harvestdor-dev, as it is already set up to be able to harvest from the DOR OAI provider
+== Contributing
+# Fork it
+# Create your feature branch (`git checkout -b my-new-feature`)
+# Write code and tests.
+# Commit your changes (`git commit -am 'Added some feature'`)
+# Push to the branch (`git push origin my-new-feature`)
+# Create new Pull Request
+== Releases
+* <b>0.0.3</b> add methods for public_xml, content_metadata, identity_metadata ...
+* <b>0.0.2</b> better model code for index method (thanks, Bess!)
+* <b>0.0.1</b> initial commit

data/Rakefile ADDED Viewed

@@ -0,0 +1,56 @@
+require "bundler/gem_tasks"
+require 'rake'
+require 'bundler'
+require 'rspec/core/rake_task'
+require 'yard'
+require 'yard/rake/yardoc_task'
+require 'dlss/rake/dlss_release'
+Dlss::Release.new
+begin
+  Bundler.setup(:default, :development)
+rescue Bundler::BundlerError => e
+  $stderr.puts e.message
+  $stderr.puts "Run `bundle install` to install missing gems"
+  exit e.status_code
+end
+desc "DO NOT USE! use dlss_release"
+task :release
+task :default => :ci
+desc "run continuous integration suite (tests, coverage, docs)"
+task :ci => [:rspec, :doc]
+task :spec => :rspec
+desc "run specs EXCEPT integration specs"
+RSpec::Core::RakeTask.new(:spec_fast) do |spec|
+  spec.rspec_opts = ["-c", "-f progress", "--tty", "-t ~integration", "-r ./spec/spec_helper.rb"]
+end
+RSpec::Core::RakeTask.new(:rspec) do |spec|
+  spec.rspec_opts = ["-c", "-f progress", "--tty", "-r ./spec/spec_helper.rb"]
+end
+# Use yard to build docs
+begin
+  project_root = File.expand_path(File.dirname(__FILE__))
+  doc_dest_dir = File.join(project_root, 'doc')
+  YARD::Rake::YardocTask.new(:doc) do |yt|
+    yt.files = Dir.glob(File.join(project_root, 'lib', '**', '*.rb')) +
+                 [ File.join(project_root, 'README.rdoc') ]
+    yt.options = ['--output-dir', doc_dest_dir, '--readme', 'README.rdoc', '--title', 'Harvestdor Gem Documentation']
+  end
+rescue LoadError
+  desc "Generate YARD Documentation"
+  task :doc do
+    abort "Please install the YARD gem to generate rdoc."
+  end
+end

data/harvestdor-indexer.gemspec ADDED Viewed

@@ -0,0 +1,43 @@
+# -*- encoding: utf-8 -*-
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'harvestdor-indexer/version'
+Gem::Specification.new do |gem|
+  gem.name          = "harvestdor-indexer"
+  gem.version       = Harvestdor::Indexer::VERSION
+  gem.authors       = ["Naomi Dushay"]
+  gem.email         = ["ndushay@stanford.edu"]
+  gem.description   = %q{Harvest DOR object metadata via a relationship (e.g. hydra:isGovernedBy rdf:resource="info:fedora/druid:hy787xj5878") and dates, plus code framework to write Solr docs to index}
+  gem.summary       = %q{Harvest DOR object metadata and index it to Solr}
+  gem.homepage      = "https://consul.stanford.edu/display/chimera/Chimera+project"
+  gem.files         = `git ls-files`.split($/)
+  gem.executables   = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
+  gem.test_files    = gem.files.grep(%r{^spec/})
+  gem.require_paths = ["lib"]
+  gem.add_dependency 'rsolr'
+  # sul-gems
+  gem.add_dependency 'harvestdor'
+  gem.add_dependency 'stanford-mods'
+  # Runtime dependencies
+  # gem.add_runtime_dependency 'nokogiri'
+  # Development dependencies
+  # Bundler will install these gems too if you've checked out solrmarc-wrapper source from git and run 'bundle install'
+  # It will not add these as dependencies if you require solrmarc-wrapper for other projects
+  gem.add_development_dependency "lyberteam-gems-devel", ">= 1.0"
+  gem.add_development_dependency "rake"
+  # docs
+  gem.add_development_dependency "rdoc"
+  gem.add_development_dependency "yard"
+  # tests
+	gem.add_development_dependency 'rspec'
+	gem.add_development_dependency 'simplecov'
+	gem.add_development_dependency 'simplecov-rcov'
+	# gem.add_development_dependency 'ruby-debug19'
+end

data/lib/harvestdor-indexer.rb ADDED Viewed

@@ -0,0 +1,213 @@
+# external gems
+require 'confstruct'
+require 'rsolr'
+# sul-dlss gems
+require 'harvestdor'
+require 'stanford-mods'
+# stdlib
+require 'logger'
+require "harvestdor-indexer/version"
+module Harvestdor
+  # Base class to harvest from DOR via harvestdor gem and then index
+  class Indexer
+    def initialize yml_path, options = {}
+      @yml_path = yml_path
+      config.configure(YAML.load_file(yml_path)) if yml_path
+      config.configure options
+      yield(config) if block_given?
+    end
+    def config
+      @config ||= Confstruct::Configuration.new()
+    end
+    def logger
+      @logger ||= load_logger(config.log_dir, config.log_name)
+    end
+    # per this Indexer's config options
+    #  harvest the druids via OAI
+    #   create a Solr profiling document for each druid
+    #   write the result to the Solr index
+    def harvest_and_index
+      if whitelist.empty?
+        druids.each { |druid| index druid }
+      else
+        whitelist.each { |druid| index druid }
+      end
+      solr_client.commit
+      logger.info("Finished processing: final Solr commit returned.")
+    end
+    # return Array of druids contained in the OAI harvest indicated by OAI params in yml configuration file
+    # @return [Array<String>] or enumeration over it, if block is given.  (strings are druids, e.g. ab123cd1234)
+    def druids
+      @druids ||= harvestdor_client.druids_via_oai
+    end
+    # create Solr doc for the druid and add it to Solr, unless it is on the blacklist.
+    #  NOTE: don't forget to send commit to Solr, either once at end (already in harvest_and_index), or for each add, or ...
+    def index druid
+      if blacklist.include?(druid)
+        logger.info("Druid #{druid} is on the blacklist and will have no Solr doc created")
+      else
+        logger.fatal("You must override the index method to transform druids into Solr docs and add them to Solr")
+        begin
+          #logger.debug "About to index #{druid}"
+          doc_hash = {}
+          doc_hash[:id] = druid
+          # doc_hash[:title_tsim] = smods_rec(druid).short_title
+          # you might add things from Indexer level class here
+          #  (e.g. things that are the same across all documents in the harvest)
+          solr_client.add(doc_hash)
+          # logger.debug("Just created Solr doc for #{druid}")
+          # TODO: provide call to code to update DOR object's workflow datastream??
+        rescue => e
+          logger.error "Failed to index #{druid}: #{e.message}"
+        end
+      end
+    end
+    # return the MODS for the druid as a Stanford::Mods::Record object
+    # @param [String] druid e.g. ab123cd4567
+    # @return [Stanford::Mods::Record] created from the MODS xml for the druid
+    def smods_rec druid
+      ng_doc = harvestdor_client.mods druid
+      raise "Empty MODS metadata for #{druid}: #{ng_doc.to_xml}" if ng_doc.root.xpath('//text()').empty?
+      mods_rec = Stanford::Mods::Record.new
+      mods_rec.from_nk_node(ng_doc.root)
+      mods_rec
+    end
+    # the public xml for this DOR object, from the purl page
+    # @param [String] druid e.g. ab123cd4567
+    # @return [Nokogiri::XML::Document] the public xml for the DOR object
+    def public_xml druid
+      ng_doc = harvestdor_client.public_xml druid
+      raise "No public xml for #{druid}" if !ng_doc
+      raise "Empty public xml for #{druid}: #{ng_doc.to_xml}" if ng_doc.root.xpath('//text()').empty?
+      ng_doc
+    end
+    # the contentMetadata for this DOR object, from the purl public xml
+    # @param [String] druid e.g. ab123cd4567
+    # @return [Nokogiri::XML::Document] the contentMetadata for the DOR object
+    def content_metadata druid
+      ng_doc = harvestdor_client.content_metadata druid
+      raise "No contentMetadata for #{druid}" if !ng_doc || !ng_doc.root
+      ng_doc
+    end
+    # the identityMetadata for this DOR object, from the purl public xml
+    # @param [String] druid e.g. ab123cd4567
+    # @return [Nokogiri::XML::Document] the identityMetadata for the DOR object
+    def identity_metadata druid
+      ng_doc = harvestdor_client.identity_metadata druid
+      raise "No identityMetadata for #{druid}" if !ng_doc || !ng_doc.root
+      ng_doc
+    end
+    # the rightsMetadata for this DOR object, from the purl public xml
+    # @param [String] druid e.g. ab123cd4567
+    # @return [Nokogiri::XML::Document] the rightsMetadata for the DOR object
+    def rights_metadata druid
+      ng_doc = harvestdor_client.rights_metadata druid
+      raise "No rightsMetadata for #{druid}" if !ng_doc || !ng_doc.root
+      ng_doc
+    end
+    # the RDF for this DOR object, from the purl public xml
+    # @param [String] druid e.g. ab123cd4567
+    # @return [Nokogiri::XML::Document] the RDF for the DOR object
+    def rdf druid
+      ng_doc = harvestdor_client.rdf druid
+      raise "No RDF for #{druid}" if !ng_doc || !ng_doc.root
+      ng_doc
+    end
+    def solr_client
+      @solr_client ||= RSolr.connect(config.solr.to_hash)
+    end
+    # @return an Array of druids ('oo000oo0000') that should NOT be processed
+    def blacklist
+      # avoid trying to load the file multiple times
+      if !@blacklist && !@loaded_blacklist
+        @blacklist = load_blacklist(config.blacklist) if config.blacklist
+      end
+      @blacklist ||= []
+    end
+    # @return an Array of druids ('oo000oo0000') that should be processed
+    def whitelist
+      # avoid trying to load the file multiple times
+      if !@whitelist && !@loaded_whitelist
+        @whitelist = load_whitelist(config.whitelist) if config.whitelist
+      end
+      @whitelist ||= []
+    end
+    protected #---------------------------------------------------------------------
+    def harvestdor_client
+      @harvestdor_client ||= Harvestdor::Client.new({:config_yml_path => @yml_path})
+    end
+    # populate @blacklist as an Array of druids ('oo000oo0000') that will NOT be processed
+    #  by reading the File at the indicated path
+    # @param [String] path - path of file containing a list of druids
+    def load_blacklist path
+      if path && !@loaded_blacklist
+        @loaded_blacklist = true
+        @blacklist = load_id_list path
+      end
+    end
+    # populate @blacklist as an Array of druids ('oo000oo0000') that WILL be processed
+    #  (unless a druid is also on the blacklist)
+    #  by reading the File at the indicated path
+    # @param [String] path - path of file containing a list of druids
+    def load_whitelist path
+      if path && !@loaded_whitelist
+        @loaded_whitelist = true
+        @whitelist = load_id_list path
+      end
+    end
+    # return an Array of druids ('oo000oo0000')
+    #   populated by reading the File at the indicated path
+    # @param [String] path - path of file containing a list of druids
+    # @return [Array<String>] an Array of druids
+    def load_id_list path
+      if path
+        list = []
+        f = File.open(path).each_line { |line|
+          list << line.gsub(/\s+/, '') if !line.gsub(/\s+/, '').empty? && !line.strip.start_with?('#')
+        }
+        list
+      end
+    rescue
+      msg = "Unable to find list of druids at " + path
+      logger.fatal msg
+      raise msg
+    end
+    # Global, memoized, lazy initialized instance of a logger
+    # @param [String] log_dir directory for to get log file
+    # @param [String] log_name name of log file
+    def load_logger(log_dir, log_name)
+      Dir.mkdir(log_dir) unless File.directory?(log_dir)
+      @logger ||= Logger.new(File.join(log_dir, log_name), 'daily')
+    end
+  end # Indexer class
+end # Harvestdor module

data/lib/harvestdor-indexer/version.rb ADDED Viewed

@@ -0,0 +1,6 @@
+module Harvestdor
+  class Indexer
+    # this is the Ruby Gem version
+    VERSION = "0.0.3"
+  end
+end

data/spec/config/ap.yml ADDED Viewed

@@ -0,0 +1,61 @@
+# You will want to copy this file and change the following settings:
+# 1. log_dir, log_name
+# 2. default_set (in OAI harvesting params section)
+#  2a. other OAI harvesting params
+# 3. blacklist or whitelist if you are using them
+# 4. Solr baseurl
+# log_dir:  directory for log file  (default logs, relative to harvestdor gem path)
+log_dir: spec/test_logs
+# log_name: name of log file  (default: harvestdor.log)
+log_name: ap-test.log
+# purl: url for the DOR purl server (used to get ContentMetadata, etc.)
+purl: http://purl.stanford.edu
+# ---------- White and Black list parameters -----
+# name of file containing druids that will NOT be processed even if they are harvested via OAI
+#  either give absolute path or path relative to where the command will be executed
+#blacklist: config/ap_blacklist.txt
+# name of file containing druids that WILL be processed (all others will be ignored)
+#  either give absolute path or path relative to where the command will be executed
+#whitelist: config/ap_whitelist.txt
+# ----------- SOLR index (that we're writing INTO) parameters ------------
+solr:
+  url: https://sul-solr-test.stanford.edu/solr/mods_profiler
+#  url: http://localhost:8080/solr/mods_profiler
+  # timeouts are in seconds;  read_timeout -> open/read, open_timeout -> connection open
+  read_timeout: 60
+  open_timeout: 60
+# ---------- OAI harvesting parameters -----------
+# oai_repository_url:  URL of the OAI data provider
+oai_repository_url: https://dor-oaiprovider-prod.stanford.edu/oai
+# default_set:  default set for harvest  (default: nil)
+#   can be overridden on calls to harvest_ids and harvest_records
+default_set: is_governed_by_hy787xj5878
+# default_metadata_prefix:  default metadata prefix to be used for harvesting  (default: mods)
+#   can be overridden on calls to harvest_ids and harvest_records
+# default_from_date:  default from date for harvest  (default: nil)
+#   can be overridden on calls to harvest_ids and harvest_records
+# default_until_date:  default until date for harvest  (default: nil)
+#   can be overridden on calls to harvest_ids and harvest_records
+# oai_client_debug:  true for OAI::Client debug mode  (default: false)
+# Additional options to pass to Faraday http client (https://github.com/technoweenie/faraday)
+http_options:
+  ssl:
+    verify: false
+  # timeouts are in seconds;  timeout -> open/read, open_timeout -> connection open
+  timeout: 180
+  open_timeout: 180

data/spec/config/ap_blacklist.txt ADDED Viewed

@@ -0,0 +1,5 @@
+# blacklist containing druids that should NOT be processed.
+# druids should be in the form aa111bb2222
+oo111oo1111
+oo222oo2222

data/spec/config/ap_whitelist.txt ADDED Viewed

@@ -0,0 +1,5 @@
+# whitelist containing the specific druids to be processed (all others will be ignored)
+# druids should be in the form aa111bb2222
+oo000oo0000
+oo222oo2222

data/spec/spec_helper.rb ADDED Viewed

@@ -0,0 +1,21 @@
+# for test coverage
+require 'simplecov'
+require 'simplecov-rcov'
+class SimpleCov::Formatter::MergedFormatter
+  def format(result)
+     SimpleCov::Formatter::HTMLFormatter.new.format(result)
+     SimpleCov::Formatter::RcovFormatter.new.format(result)
+  end
+end
+SimpleCov.formatter = SimpleCov::Formatter::MergedFormatter
+SimpleCov.start do
+  add_filter "/spec/"
+end
+$LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..', 'lib'))
+$LOAD_PATH.unshift(File.dirname(__FILE__))
+require 'harvestdor-indexer'
+#RSpec.configure do |config|
+#end

data/spec/unit/harvestdor-indexer_spec.rb ADDED Viewed

@@ -0,0 +1,327 @@
+require 'spec_helper'
+describe Harvestdor::Indexer do
+  before(:all) do
+    @config_yml_path = File.join(File.dirname(__FILE__), "..", "config", "ap.yml")
+    @indexer = Harvestdor::Indexer.new(@config_yml_path)
+    require 'yaml'
+    @yaml = YAML.load_file(@config_yml_path)
+    @hdor_client = @indexer.send(:harvestdor_client)
+    @fake_druid = 'oo000oo0000'
+    @blacklist_path = File.join(File.dirname(__FILE__), "../config/ap_blacklist.txt")
+    @whitelist_path = File.join(File.dirname(__FILE__), "../config/ap_whitelist.txt")
+  end
+  describe "logging" do
+    it "should write the log file to the directory indicated by log_dir" do
+      @indexer.logger.info("indexer_spec logging test message")
+      File.exists?(File.join(@yaml['log_dir'], @yaml['log_name'])).should == true
+    end
+  end
+  it "should initialize the harvestdor_client from the config" do
+    @hdor_client.should be_an_instance_of(Harvestdor::Client)
+    @hdor_client.config.default_set.should == @yaml['default_set']
+  end
+  context "harvest_and_index" do
+    before(:all) do
+      @doc_hash = {
+        :id => @fake_druid
+      }
+    end
+    it "should call druids_via_oai and then call :add on rsolr connection" do
+      @hdor_client.should_receive(:druids_via_oai).and_return([@fake_druid])
+      @indexer.solr_client.should_receive(:add).with(@doc_hash)
+      @indexer.solr_client.should_receive(:commit)
+      @indexer.harvest_and_index
+    end
+    it "should not process druids in blacklist" do
+      indexer = Harvestdor::Indexer.new(@config_yml_path, {:blacklist => @blacklist_path})
+      hdor_client = indexer.send(:harvestdor_client)
+      hdor_client.should_receive(:druids_via_oai).and_return(['oo000oo0000', 'oo111oo1111', 'oo222oo2222', 'oo333oo3333'])
+      indexer.solr_client.should_receive(:add).with(hash_including({:id => 'oo000oo0000'}))
+      indexer.solr_client.should_not_receive(:add).with(hash_including({:id => 'oo111oo1111'}))
+      indexer.solr_client.should_not_receive(:add).with(hash_including({:id => 'oo222oo2222'}))
+      indexer.solr_client.should_receive(:add).with(hash_including({:id => 'oo333oo3333'}))
+      indexer.solr_client.should_receive(:commit)
+      indexer.harvest_and_index
+    end
+    it "should only process druids in whitelist if it exists" do
+      indexer = Harvestdor::Indexer.new(@config_yml_path, {:whitelist => @whitelist_path})
+      hdor_client = indexer.send(:harvestdor_client)
+      hdor_client.should_not_receive(:druids_via_oai)
+      indexer.solr_client.should_receive(:add).with(hash_including({:id => 'oo000oo0000'}))
+      indexer.solr_client.should_receive(:add).with(hash_including({:id => 'oo222oo2222'}))
+      indexer.solr_client.should_receive(:commit)
+      indexer.harvest_and_index
+    end
+    it "should not process druid if it is in both blacklist and whitelist" do
+      indexer = Harvestdor::Indexer.new(@config_yml_path, {:blacklist => @blacklist_path, :whitelist => @whitelist_path})
+      hdor_client = indexer.send(:harvestdor_client)
+      hdor_client.should_not_receive(:druids_via_oai)
+      indexer.solr_client.should_receive(:add).with(hash_including({:id => 'oo000oo0000'}))
+      indexer.solr_client.should_receive(:commit)
+      indexer.harvest_and_index
+    end
+    it "should only call :commit on rsolr connection once" do
+      indexer = Harvestdor::Indexer.new(@config_yml_path)
+      hdor_client = indexer.send(:harvestdor_client)
+      hdor_client.should_receive(:druids_via_oai).and_return(['1', '2', '3'])
+      indexer.solr_client.should_receive(:add).exactly(3).times
+      indexer.solr_client.should_receive(:commit).once
+      indexer.harvest_and_index
+    end
+  end
+  it "druids method should call druids_via_oai method on harvestdor_client" do
+    @hdor_client.should_receive(:druids_via_oai)
+    @indexer.druids
+  end
+  context "smods_rec method" do
+    before(:all) do
+      @fake_druid = 'oo000oo0000'
+      @ns_decl = "xmlns='#{Mods::MODS_NS}'"
+      @mods_xml = "<mods #{@ns_decl}><note>hi</note></mods>"
+      @ng_mods_xml = Nokogiri::XML(@mods_xml)
+    end
+    it "should call mods method on harvestdor_client" do
+      @hdor_client.should_receive(:mods).with(@fake_druid).and_return(@ng_mods_xml)
+      @indexer.smods_rec(@fake_druid)
+    end
+    it "should return Stanford::Mods::Record object" do
+      @hdor_client.should_receive(:mods).with(@fake_druid).and_return(@ng_mods_xml)
+      @indexer.smods_rec(@fake_druid).should be_an_instance_of(Stanford::Mods::Record)
+    end
+    it "should raise exception if MODS xml for the druid is empty" do
+      @hdor_client.stub(:mods).with(@fake_druid).and_return(Nokogiri::XML("<mods #{@ns_decl}/>"))
+      expect { @indexer.smods_rec(@fake_druid) }.to raise_error(RuntimeError, Regexp.new("^Empty MODS metadata for #{@fake_druid}: <"))
+    end
+    it "should raise exception if there is no MODS xml for the druid" do
+      expect { @indexer.smods_rec(@fake_druid) }.to raise_error(Harvestdor::Errors::MissingMods)
+    end
+  end
+  context "public_xml related methods" do
+    before(:all) do
+      @id_md_xml = "<identityMetadata><objectId>druid:#{@fake_druid}</objectId></identityMetadata>"
+      @cntnt_md_xml = "<contentMetadata type='image' objectId='#{@fake_druid}'>foo</contentMetadata>"
+      @rights_md_xml = "<rightsMetadata><access type=\"discover\"><machine><world>bar</world></machine></access></rightsMetadata>"
+      @rdf_xml = "<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'><rdf:Description rdf:about=\"info:fedora/druid:#{@fake_druid}\">relationship!</rdf:Description></rdf:RDF>"
+      @pub_xml = "<publicObject id='druid:#{@fake_druid}'>#{@id_md_xml}#{@cntnt_md_xml}#{@rights_md_xml}#{@rdf_xml}</publicObject>"
+      @ng_pub_xml = Nokogiri::XML(@pub_xml)
+    end
+    context "#public_xml" do
+      it "should call public_xml method on harvestdor_client" do
+        @hdor_client.should_receive(:public_xml).with(@fake_druid).and_return(@ng_pub_xml)
+        @indexer.public_xml @fake_druid
+      end
+      it "retrieves entire public xml as a Nokogiri::XML::Document" do
+        @hdor_client.should_receive(:public_xml).with(@fake_druid).and_return(@ng_pub_xml)
+        px = @indexer.public_xml @fake_druid
+        px.should be_kind_of(Nokogiri::XML::Document)
+        px.root.name.should == 'publicObject'
+        px.root.attributes['id'].text.should == "druid:#{@fake_druid}"
+      end
+      it "raises exception if public xml for the druid is empty" do
+        @hdor_client.should_receive(:public_xml).with(@fake_druid).and_return(Nokogiri::XML("<publicObject/>"))
+        expect { @indexer.public_xml(@fake_druid) }.to raise_error(RuntimeError, Regexp.new("^Empty public xml for #{@fake_druid}: <"))
+      end
+      it "raises Harvestdor::Errors::MissingPurlPage if there is no purl page for the druid" do
+        expect { @indexer.public_xml(@fake_druid) }.to raise_error(Harvestdor::Errors::MissingPurlPage)
+      end
+      it "raises error if there is no public_xml page for the druid" do
+        @hdor_client.should_receive(:public_xml).with(@fake_druid).and_return(nil)
+        expect { @indexer.public_xml(@fake_druid) }.to raise_error(RuntimeError, "No public xml for #{@fake_druid}")
+      end
+    end
+    context "#content_metadata" do
+      it "returns a Nokogiri::XML::Document derived from the public xml" do
+        Harvestdor.stub(:public_xml).with(@fake_druid, @indexer.config.purl).and_return(@ng_pub_xml)
+        cm = @indexer.content_metadata(@fake_druid)
+        cm.should be_kind_of(Nokogiri::XML::Document)
+        cm.root.should_not == nil
+        cm.root.name.should == 'contentMetadata'
+        cm.root.attributes['objectId'].text.should == @fake_druid
+        cm.root.text.strip.should == 'foo'
+      end
+      it "raises Harvestdor::Errors::MissingPurlPage if there is no purl page for the druid" do
+        expect { @indexer.content_metadata(@fake_druid) }.to raise_error(Harvestdor::Errors::MissingPurlPage)
+      end
+      it "should raise exception if there is no contentMetadata in the public xml" do
+        pub_xml = "<publicObject id='druid:#{@fake_druid}'>#{@id_md_xml}</publicObject>"
+        Harvestdor.stub(:public_xml).with(@fake_druid, @indexer.config.purl).and_return(Nokogiri::XML(pub_xml))
+        expect { @indexer.content_metadata(@fake_druid) }.to raise_error(RuntimeError, "No contentMetadata for #{@fake_druid}")
+      end
+      it "raises RuntimeError if nil is returned by Harvestdor::Client.contentMetadata for the druid" do
+        @hdor_client.should_receive(:content_metadata).with(@fake_druid).and_return(nil)
+        expect { @indexer.content_metadata(@fake_druid) }.to raise_error(RuntimeError, "No contentMetadata for #{@fake_druid}")
+      end
+      it "raises MissingContentMetadata error if there is no contentMetadata in the public_xml for the druid" do
+        URI::HTTP.any_instance.should_receive(:open)
+        expect { @indexer.content_metadata(@fake_druid) }.to raise_error(Harvestdor::Errors::MissingContentMetadata)
+      end
+    end
+    context "#identity_metadata" do
+      it "returns a Nokogiri::XML::Document derived from the public xml" do
+        Harvestdor.stub(:public_xml).with(@fake_druid, @indexer.config.purl).and_return(@ng_pub_xml)
+        im = @indexer.identity_metadata(@fake_druid)
+        im.should be_kind_of(Nokogiri::XML::Document)
+        im.root.should_not == nil
+        im.root.name.should == 'identityMetadata'
+        im.root.text.strip.should == "druid:#{@fake_druid}"
+      end
+      it "raises Harvestdor::Errors::MissingPurlPage if there is no purl page for the druid" do
+        expect { @indexer.identity_metadata(@fake_druid) }.to raise_error(Harvestdor::Errors::MissingPurlPage)
+      end
+      it "should raise exception if there is no identityMetadata in the public xml" do
+        pub_xml = "<publicObject id='druid:#{@fake_druid}'>#{@cntnt_md_xml}</publicObject>"
+        Harvestdor.stub(:public_xml).with(@fake_druid, @indexer.config.purl).and_return(Nokogiri::XML(pub_xml))
+        expect { @indexer.identity_metadata(@fake_druid) }.to raise_error(RuntimeError, "No identityMetadata for #{@fake_druid}")
+      end
+      it "raises RuntimeError if nil is returned by Harvestdor::Client.identityMetadata for the druid" do
+        @hdor_client.should_receive(:identity_metadata).with(@fake_druid).and_return(nil)
+        expect { @indexer.identity_metadata(@fake_druid) }.to raise_error(RuntimeError, "No identityMetadata for #{@fake_druid}")
+      end
+      it "raises MissingIdentityMetadata error if there is no identityMetadata in the public_xml for the druid" do
+        URI::HTTP.any_instance.should_receive(:open)
+        expect { @indexer.identity_metadata(@fake_druid) }.to raise_error(Harvestdor::Errors::MissingIdentityMetadata)
+      end
+    end
+    context "#rights_metadata" do
+      it "returns a Nokogiri::XML::Document derived from the public xml" do
+        Harvestdor.stub(:public_xml).with(@fake_druid, @indexer.config.purl).and_return(@ng_pub_xml)
+        im = @indexer.rights_metadata(@fake_druid)
+        im.should be_kind_of(Nokogiri::XML::Document)
+        im.root.should_not == nil
+        im.root.name.should == 'rightsMetadata'
+        im.root.text.strip.should == "bar"
+      end
+      it "raises Harvestdor::Errors::MissingPurlPage if there is no purl page for the druid" do
+        expect { @indexer.rights_metadata(@fake_druid) }.to raise_error(Harvestdor::Errors::MissingPurlPage)
+      end
+      it "should raise exception if there is no rightsMetadata in the public xml" do
+        pub_xml = "<publicObject id='druid:#{@fake_druid}'>#{@cntnt_md_xml}</publicObject>"
+        Harvestdor.stub(:public_xml).with(@fake_druid, @indexer.config.purl).and_return(Nokogiri::XML(pub_xml))
+        expect { @indexer.rights_metadata(@fake_druid) }.to raise_error(RuntimeError, "No rightsMetadata for #{@fake_druid}")
+      end
+      it "raises RuntimeError if nil is returned by Harvestdor::Client.rightsMetadata for the druid" do
+        @hdor_client.should_receive(:rights_metadata).with(@fake_druid).and_return(nil)
+        expect { @indexer.rights_metadata(@fake_druid) }.to raise_error(RuntimeError, "No rightsMetadata for #{@fake_druid}")
+      end
+      it "raises MissingRightsMetadata error if there is no rightsMetadata in the public_xml for the druid" do
+        URI::HTTP.any_instance.should_receive(:open)
+        expect { @indexer.rights_metadata(@fake_druid) }.to raise_error(Harvestdor::Errors::MissingRightsMetadata)
+      end
+    end
+    context "#rdf" do
+      it "returns a Nokogiri::XML::Document derived from the public xml" do
+        Harvestdor.stub(:public_xml).with(@fake_druid, @indexer.config.purl).and_return(@ng_pub_xml)
+        im = @indexer.rdf(@fake_druid)
+        im.should be_kind_of(Nokogiri::XML::Document)
+        im.root.should_not == nil
+        im.root.name.should == 'RDF'
+        im.root.text.strip.should == "relationship!"
+      end
+      it "raises Harvestdor::Errors::MissingPurlPage if there is no purl page for the druid" do
+        expect { @indexer.rdf(@fake_druid) }.to raise_error(Harvestdor::Errors::MissingPurlPage)
+      end
+      it "should raise exception if there is no rdf in the public xml" do
+        pub_xml = "<publicObject id='druid:#{@fake_druid}'>#{@cntnt_md_xml}</publicObject>"
+        Harvestdor.stub(:public_xml).with(@fake_druid, @indexer.config.purl).and_return(Nokogiri::XML(pub_xml))
+        expect { @indexer.rdf(@fake_druid) }.to raise_error(RuntimeError, "No RDF for #{@fake_druid}")
+      end
+      it "raises RuntimeError if nil is returned by Harvestdor::Client.rdf for the druid" do
+        @hdor_client.should_receive(:rdf).with(@fake_druid).and_return(nil)
+        expect { @indexer.rdf(@fake_druid) }.to raise_error(RuntimeError, "No RDF for #{@fake_druid}")
+      end
+      it "raises MissingRDF error if there is no rdf in the public_xml for the druid" do
+        URI::HTTP.any_instance.should_receive(:open)
+        expect { @indexer.rdf(@fake_druid) }.to raise_error(Harvestdor::Errors::MissingRDF)
+      end
+    end
+  end
+  context "blacklist" do
+    it "should be an Array with an entry for each non-empty line in the file" do
+      @indexer.send(:load_blacklist, @blacklist_path)
+      @indexer.send(:blacklist).should be_an_instance_of(Array)
+      @indexer.send(:blacklist).size.should == 2
+    end
+    it "should be empty Array if there was no blacklist config setting" do
+      indexer = Harvestdor::Indexer.new(@config_yml_path)
+      indexer.send(:blacklist).should == []
+    end
+    context "load_blacklist" do
+      it "should not be called if there was no blacklist config setting" do
+        indexer = Harvestdor::Indexer.new(@config_yml_path)
+        indexer.should_not_receive(:load_blacklist)
+        hdor_client = indexer.send(:harvestdor_client)
+        hdor_client.should_receive(:druids_via_oai).and_return([@fake_druid])
+        indexer.solr_client.should_receive(:add)
+        indexer.solr_client.should_receive(:commit)
+        indexer.harvest_and_index
+      end
+      it "should only try to load a blacklist once" do
+        indexer = Harvestdor::Indexer.new(@config_yml_path, {:blacklist => @blacklist_path})
+        indexer.send(:blacklist)
+        File.any_instance.should_not_receive(:open)
+        indexer.send(:blacklist)
+      end
+      it "should log an error message and throw RuntimeError if it can't find the indicated blacklist file" do
+        exp_msg = 'Unable to find list of druids at bad_path'
+        indexer = Harvestdor::Indexer.new(@config_yml_path, {:blacklist => 'bad_path'})
+        indexer.logger.should_receive(:fatal).with(exp_msg)
+        expect { indexer.send(:load_blacklist, 'bad_path') }.to raise_error(exp_msg)
+      end
+    end
+  end # blacklist
+  context "whitelist" do
+    it "should be an Array with an entry for each non-empty line in the file" do
+      @indexer.send(:load_whitelist, @whitelist_path)
+      @indexer.send(:whitelist).should be_an_instance_of(Array)
+      @indexer.send(:whitelist).size.should == 2
+    end
+    it "should be empty Array if there was no whitelist config setting" do
+      indexer = Harvestdor::Indexer.new(@config_yml_path)
+      indexer.send(:whitelist).should == []
+    end
+    context "load_whitelist" do
+      it "should not be called if there was no whitelist config setting" do
+        indexer = Harvestdor::Indexer.new(@config_yml_path)
+        indexer.should_not_receive(:load_whitelist)
+        hdor_client = indexer.send(:harvestdor_client)
+        hdor_client.should_receive(:druids_via_oai).and_return([@fake_druid])
+        indexer.solr_client.should_receive(:add)
+        indexer.solr_client.should_receive(:commit)
+        indexer.harvest_and_index
+      end
+      it "should only try to load a whitelist once" do
+        indexer = Harvestdor::Indexer.new(@config_yml_path, {:whitelist => @whitelist_path})
+        indexer.send(:whitelist)
+        File.any_instance.should_not_receive(:open)
+        indexer.send(:whitelist)
+      end
+      it "should log an error message and throw RuntimeError if it can't find the indicated whitelist file" do
+        exp_msg = 'Unable to find list of druids at bad_path'
+        indexer = Harvestdor::Indexer.new(@config_yml_path, {:whitelist => 'bad_path'})
+        indexer.logger.should_receive(:fatal).with(exp_msg)
+        expect { indexer.send(:load_whitelist, 'bad_path') }.to raise_error(exp_msg)
+      end
+    end
+  end # whitelist
+  it "solr_client should initialize the rsolr client using the options from the config" do
+    indexer = Harvestdor::Indexer.new(nil, Confstruct::Configuration.new(:solr => { :url => 'http://localhost:2345', :a => 1 }) )
+    RSolr.should_receive(:connect).with(hash_including(:a => 1, :url => 'http://localhost:2345')).and_return('foo')
+    indexer.solr_client
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,233 @@
+--- !ruby/object:Gem::Specification
+name: harvestdor-indexer
+version: !ruby/object:Gem::Version
+  version: 0.0.3
+  prerelease:
+platform: ruby
+authors:
+- Naomi Dushay
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2013-03-08 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: rsolr
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: harvestdor
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: stanford-mods
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: lyberteam-gems-devel
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '1.0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '1.0'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rdoc
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: yard
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: simplecov
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: simplecov-rcov
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+description: Harvest DOR object metadata via a relationship (e.g. hydra:isGovernedBy
+  rdf:resource="info:fedora/druid:hy787xj5878") and dates, plus code framework to
+  write Solr docs to index
+email:
+- ndushay@stanford.edu
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- .gitignore
+- .yardopts
+- Gemfile
+- LICENSE.txt
+- README.rdoc
+- Rakefile
+- harvestdor-indexer.gemspec
+- lib/harvestdor-indexer.rb
+- lib/harvestdor-indexer/version.rb
+- spec/config/ap.yml
+- spec/config/ap_blacklist.txt
+- spec/config/ap_whitelist.txt
+- spec/spec_helper.rb
+- spec/unit/harvestdor-indexer_spec.rb
+homepage: https://consul.stanford.edu/display/chimera/Chimera+project
+licenses: []
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+      segments:
+      - 0
+      hash: -2920299245033359379
+required_rubygems_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+      segments:
+      - 0
+      hash: -2920299245033359379
+requirements: []
+rubyforge_project:
+rubygems_version: 1.8.24
+signing_key:
+specification_version: 3
+summary: Harvest DOR object metadata and index it to Solr
+test_files:
+- spec/config/ap.yml
+- spec/config/ap_blacklist.txt
+- spec/config/ap_whitelist.txt
+- spec/spec_helper.rb
+- spec/unit/harvestdor-indexer_spec.rb
+has_rdoc: