RubyGems - traject - Versions diffs - 1.0.0.beta.2 → 1.0.0.beta.3 - Mend

traject 1.0.0.beta.2 → 1.0.0.beta.3

Files changed (12) hide show

checksums.yaml +4 -4
data/doc/batch_execution.md +60 -8
data/doc/extending.md +3 -1
data/doc/settings.md +2 -2
data/lib/traject/indexer.rb +29 -26
data/lib/traject/macros/marc_format_classifier.rb +7 -4
data/lib/traject/version.rb +1 -1
data/test/indexer/macros_marc21_semantics_test.rb +50 -4
data/test/indexer/macros_marc21_test.rb +6 -0
data/test/marc_format_classifier_test.rb +5 -1
data/test/test_helper.rb +10 -0
metadata +3 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: cb35b4c5ba302cb865b459bfac6859ef6be68927
-  data.tar.gz: 14964b88428d0a827932cbf17194a77c56de1091
+  metadata.gz: fba53ed999c0449a6de13e4e1455399431c4e052
+  data.tar.gz: 5af370ada3fd4f779607bd7e3f294607a7a20208
 SHA512:
-  metadata.gz: b3ae114fe4a11baaf6f6470d35d5df339a32b8f7e747f012310ee90ed55b982754d196abcc25ec0375d8bf988ca5e3ebd2c71ef7df159bc7007e6cdae9c69643
-  data.tar.gz: 74c6c6170860cf1cd4883d644f9c23151a703ff996592abb8668a655dcff885c108c7e04e13419f6c4ba727384b3bab3dff896c177382284b8d12b98cc711225
+  metadata.gz: f062f33724bdf1d11260edb675a239ebfcd834238dc98806c638d1a9c955acb5819fd2c6fbf15f023b683b2acd68913d100c8ede7f75ca62382e783ecba429da
+  data.tar.gz: d0efcb96f544f7cdc42517cb08a067f8d1f20de1fcbf34a53af1adbdca4b127105f7e659fed11625966605bdef79d5f23433dc0e209b159a447686fa887ab75e

data/doc/batch_execution.md CHANGED Viewed

@@ -18,7 +18,9 @@ with jruby 1.7.x or later, this should be default, recommend
 you use jruby 1.7.x.
 Especially when running under a cron job, it can be difficult to
-set things up so traject runs under jruby.
+set things up so traject runs under jruby -- and then when you add
+bundler into it, things can get positively byzantine. It's not you,
+this gets confusing.
 It can sometimes be useful to create a wrapper script for traject
 that takes care of making sure it's running under the right ruby
@@ -31,8 +33,11 @@ Simply run with:
     chruby-exec jruby -- traject {other arguments}
 Whether specifying that directly in a crontab, or in a shell script
-that needs to call traject, etc. So simple you might not need
-a wrapper script, but it might still be convenient to create one. Say
+that needs to call traject, etc. In a crontab environment, it'll actually need
+you to set PATH and SHELL variables, as specified in the [chruby docs](https://github.com/postmodern/chruby/wiki/Cron)
+So simple you might not need a wrapper script, but it might still be convenient to create one. Say
 you put a `jruby-traject` at `/usr/local/bin/jruby-traject`, that
 looks like this:
@@ -40,9 +45,55 @@ looks like this:
     chruby-exec jruby -- traject "$@"
-Now any account, in a crontab, in an interactive shell, wherever,
-can just execute `jruby-traject {arguments}`, and execute traject
-in a jruby environment.
+Now you can can just execute `jruby-traject {arguments}`, and execute traject
+in a jruby environment. (In a crontab, you'll still need to fix your
+PATH and SHELL env variables for `chruby-exec` to work, either in the
+crontab or in this wrapper script)
+### chruby monster wrapper script
+I am still not sure if this is a good idea, but here's an example of
+a wrapper script for chruby that will take care of the ENV even
+when running in a crontab, use chruby-exec only if jruby isn't
+already the default ruby, and add in `bundle exec` too.
+~~~bash
+#!/usr/bin/env bash
+# A wrapper for traject that uses chruby to make sure jruby
+# is being used before calling traject, and then calls
+# traject with bundle exec from within our traject project
+# dir.
+# Make sure /usr/local/bin is in PATH for chruby-exec,
+# which it's not ordinarily in a cronjob.
+if [[ ":$PATH:" != *":/usr/local/bin:"* ]]
+then
+  export PATH=$PATH:/usr/local/bin
+fi
+# chruby needs SHELL set, which it won't be from a crontab
+export SHELL=/bin/bash
+# Find the dir based on location of this wrapper script,
+# then use that dir to cd to for the bundle exec to find
+# the right Gemfile.
+traject_dir=$(cd `dirname "${BASH_SOURCE[0]}"` && pwd)
+# do we need to use chruby to switch to jruby?
+if [[ "$(ruby -v)" == *jruby* ]]
+then
+  ruby_picker="" # nothing needed "
+else
+  ruby_picker="chruby-exec jruby --"
+fi
+cmd="BUNDLE_GEMFILE=$traject_dir/Gemfile $ruby_picker bundle exec traject $@"
+echo $cmd
+eval $cmd
+~~~
+This monster script can perhaps be adapted for rbenv or rvm.
 ### for rbenv
@@ -62,7 +113,7 @@ If you're running inside a cronjob, things get a bit trickier,
 because rbenv isn't normally set up in the limited environment
 of cron tasks. One way to deal with this is to have your
 cronjob explicitly execute in a bash login shell, that
-will then have rbenv set up so long as it's running
+will then have rbenv set up -- so long as it's running
 under an account with rbenv set up properly!
     # in a cronfile
@@ -99,6 +150,7 @@ Now any account, in a crontab, in an interactive shell, wherever,
 can just execute `jruby-traject {arguments}`, and execute traject
 in a jruby environment.
 ### Bundler too?
 If you're running with bundler too, you could make a wrapper file specific to
@@ -188,4 +240,4 @@ do whatever you can make yell, just write ruby.
 For automated batch execution, we recommend you consider using
 bundler to manage any gem dependencies. See the [Extending
 With Your Own Code](./extending.md) traject docs for
-information on how traject integrates with bundler.
+information on how traject integrates with bundler.

data/doc/extending.md CHANGED Viewed

@@ -16,6 +16,7 @@ of a couple traject features meant to make it easier.
 * Traject `-I` argument command line can be used to list directories to
   add to the load path, similar to the `ruby -I` argument. You
   can then 'require' local project files from the load path.
+  * Or modify the ruby `$LOAD_PATH` manually at the top of a traject config file you are loading.
   * translation map files found in a
     "./translation_maps" subdir on the load path will be found
     for Traject translation maps.
@@ -155,7 +156,8 @@ by running `bundler init`, probably in the directory
 right next to your traject config files.
 Then specify what gems your traject project will use,
-possibly with version restrictions, in the [Gemfile](http://bundler.io/v1.3/gemfile.html).
+possibly with version restrictions, in the [Gemfile](http://bundler.io/v1.3/gemfile.html) --
+**do** include `gem 'traject'` in the Gemfile.
 Run `bundle install` from the directory with the Gemfile, on any system
 at any time, to make sure specified gems are installed.

data/doc/settings.md CHANGED Viewed

@@ -47,8 +47,8 @@ settings are applied first of all. It's recommended you use `provide`.
 * `log.level`:  Log this level and above. Default 'info', set to eg 'debug' to get potentially more logging info,
               or 'error' to get less. https://github.com/rudionrails/yell/wiki/101-setting-the-log-level
-* `log.batch_size`: If set to a number N (or string representation), will output a progress line to INFO
-   log, every N records.
+* `log.batch_size`: If set to a number N (or string representation), will output a progress line to DEBUG
+   log, every N records. (use -d to turn logging to DEBUG to see.)
 * `marc_source.type`: default 'binary'. Can also set to 'xml' or (not yet implemented todo) 'json'. Command line shortcut `-t`

data/lib/traject/indexer.rb CHANGED Viewed

@@ -2,11 +2,13 @@ require 'yell'
 require 'traject'
 require 'traject/qualified_const_get'
+require 'traject/thread_pool'
 require 'traject/indexer/settings'
 require 'traject/marc_reader'
 require 'traject/marc4j_reader'
 require 'traject/json_writer'
+require 'traject/solrj_writer'
 require 'traject/macros/marc21'
 require 'traject/macros/basic'
@@ -73,7 +75,7 @@ require 'traject/macros/basic'
 #  The default writer is the SolrJWriter, using Java SolrJ to
 #  write to a Solr.  A few other built-in writers are available,
 #  but it's anticipated more will be created as plugins or local
-#  code for special purposes.
+#  code for special purposes.
 #
 #  You can set alternate writers by setting a Class object directly
 #  with the #writer_class method, or by the 'writer_class_name' Setting,
@@ -167,38 +169,39 @@ class Traject::Indexer
   attr_writer :logger
-  # Just calculates the arg that's gonna be given to Yell.new
-  # or SomeLogger.new
-  def logger_argument
-    specified = settings["log.file"] || "STDERR"
-    case specified
-    when "STDOUT" then STDOUT
-    when "STDERR" then STDERR
-    else specified
-    end
-  end
-  # Second arg to Yell.new, options hash, calculated from
-  # settings
-  def logger_options
-    # formatter, default is fairly basic
+  def logger_format
     format = settings["log.format"] || "%d %5L %m"
     format = case format
-    when "false" then false
-    when "" then nil
-    else format
+      when "false" then false
+      when "" then nil
+      else format
     end
-    level = settings["log.level"] || "info"
-    {:format => format, :level => level}
   end
   # Create logger according to settings
   def create_logger
+    logger_level  = settings["log.level"] || "info"
     # log everything to STDERR or specified logfile
-    logger = Yell.new( logger_argument, logger_options )
+    logger = Yell.new
+    logger.format = logger_format
+    logger.level  = logger_level
+    logger_destination = settings["log.file"] || "STDERR"
+    # We intentionally repeat the logger_level
+    # on the adapter, so it will stay there if overall level
+    # is changed.
+    case logger_destination
+    when "STDERR"
+      logger.adapter :stderr, level: logger_level, format: logger_format
+    when "STDOUT"
+      logger.adapter :stdout, level: logger_level, format: logger_format
+    else
+      logger.adapter :file, logger_destination, level: logger_level, format: logger_format
+    end
     # ADDITIONALLY log error and higher to....
     if settings["log.error_file"]
       logger.adapter :file, settings["log.error_file"], :level => 'gte.error'
@@ -329,7 +332,7 @@ class Traject::Indexer
       if log_batch_size && (count % log_batch_size == 0)
         batch_rps = log_batch_size / (Time.now - batch_start_time)
         overall_rps = count / (Time.now - start_time)
-        logger.info "Traject::Indexer#process, read #{count} records at id:#{id_string(record)}; #{'%.0f' % batch_rps}/s this batch, #{'%.0f' % overall_rps}/s overall"
+        logger.debug "Traject::Indexer#process, read #{count} records at id:#{id_string(record)}; #{'%.0f' % batch_rps}/s this batch, #{'%.0f' % overall_rps}/s overall"
         batch_start_time = Time.now
       end

data/lib/traject/macros/marc_format_classifier.rb CHANGED Viewed

@@ -114,12 +114,15 @@ module Traject
       # * If it has any RDA 338, then it's print if it has a value of
       #   volume, sheet, or card.
       # * If it does not have an RDA 338, it's print if and only if it has
-      #   NO 245$h GMD.
+      #   no 245$h GMD.
       #
       # * Here at JH, for legacy reasons we also choose to not
       #   call it print if it's already been marked audio, but
       #   we do that in a different method.
       #
+      # Note that any record that has neither a 245 nor a 338rda is going
+      # to be marked print
+      #
       # This algorithm is definitely going to get some things wrong in
       # both directions, with real world data. But seems to be good enough.
       def print?
@@ -137,7 +140,7 @@ module Traject
             end
           end
         else
-          normalized_gmd.length == 0
+          normalized_gmd.length == 0
         end
       end
@@ -145,8 +148,8 @@ module Traject
       # resource. But sometimes resort to 245$h GMD too.
       def online?
         # field 007, byte 0 c="electronic" byte 1 r="remote" ==> sure Online
-        found_007 = record.find do |field|
-          field.tag == "007" && field.value.slice(0) == "c" && field.value.slice(1) == "r"
+        found_007 = record.fields('007').find do |field|
+          field.value.slice(0) == "c" && field.value.slice(1) == "r"
         end
         return true if found_007

data/lib/traject/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Traject
-  VERSION = "1.0.0.beta.2"
+  VERSION = "1.0.0.beta.3"
 end

data/test/indexer/macros_marc21_semantics_test.rb CHANGED Viewed

@@ -28,6 +28,8 @@ describe "Traject::Macros::Marc21Semantics" do
     output = @indexer.map_record(@record)
     assert_equal %w{47971712},  output["oclcnum"]
+    assert_equal({}, @indexer.map_record(empty_record))
   end
   it "#marc_series_facet" do
@@ -40,6 +42,8 @@ describe "Traject::Macros::Marc21Semantics" do
     # trims punctuation too
     assert_equal ["Big bands"], output["series_facet"]
+    assert_equal({}, @indexer.map_record(empty_record))
   end
   describe "marc_sortable_author" do
@@ -54,6 +58,8 @@ describe "Traject::Macros::Marc21Semantics" do
       output = @indexer.map_record(@record)
       assert_equal ["Herman, Edward S.   Manufacturing consent the political economy of the mass media Edward S. Herman and Noam Chomsky ; with a new introduction by the authors"], output["author_sort"]
+      assert_equal [""], @indexer.map_record(empty_record)['author_sort']
     end
     it "respects non-filing" do
       @record = MARC::Reader.new(support_file_path  "the_business_ren.marc").to_a.first
@@ -61,6 +67,8 @@ describe "Traject::Macros::Marc21Semantics" do
       output = @indexer.map_record(@record)
       assert_equal ["Business renaissance quarterly [electronic resource]."], output["author_sort"]
+      assert_equal [""], @indexer.map_record(empty_record)['author_sort']
     end
   end
@@ -71,6 +79,8 @@ describe "Traject::Macros::Marc21Semantics" do
     it "works" do
       output = @indexer.map_record(@record)
       assert_equal ["Manufacturing consent : the political economy of the mass media"], output["title_sort"]
+      assert_equal({}, @indexer.map_record(empty_record))
     end
     it "respects non-filing" do
       @record = MARC::Reader.new(support_file_path  "the_business_ren.marc").to_a.first
@@ -95,6 +105,8 @@ describe "Traject::Macros::Marc21Semantics" do
       output = @indexer.map_record(@record)
       assert_equal ["English", "French", "German", "Italian", "Spanish", "Russian"], output["languages"]
+      assert_equal({}, @indexer.map_record(empty_record))
     end
   end
@@ -108,6 +120,8 @@ describe "Traject::Macros::Marc21Semantics" do
       output = @indexer.map_record(@record)
       assert_equal ["Larger ensemble, Unspecified", "Piano", "Soprano voice", "Tenor voice", "Violin", "Larger ensemble, Ethnic", "Guitar", "Voices, Unspecified"], output["instrumentation"]
+      assert_equal({}, @indexer.map_record(empty_record))
     end
   end
@@ -126,16 +140,29 @@ describe "Traject::Macros::Marc21Semantics" do
       @record = MARC::Reader.new(support_file_path  "louis_armstrong.marc").to_a.first
       output = @indexer.map_record(@record)
-      assert_equal ["bb01", "bb01.s", "bb", "bb.s", "oe"],
-        output["instrument_codes"]
+      assert_equal ["bb01", "bb01.s", "bb", "bb.s", "oe"], output["instrument_codes"]
+      assert_equal({}, @indexer.map_record(empty_record))
     end
   end
   describe "publication_date" do
     # there are way too many edge cases for us to test em all, but we'll test some of em.
+    it "works when there's no date information" do
+      assert_equal nil,  Marc21Semantics.publication_date(empty_record)
+    end
+    it "uses macro correctly with no date info" do
+      @indexer.instance_eval {to_field "date", marc_publication_date }
+      assert_equal({}, @indexer.map_record(empty_record))
+    end
     it "pulls out 008 date_type s" do
       @record = MARC::Reader.new(support_file_path  "manufacturing_consent.marc").to_a.first
       assert_equal 2002, Marc21Semantics.publication_date(@record)
     end
     it "uses start date for date_type c continuing resource" do
       @record = MARC::Reader.new(support_file_path  "the_business_ren.marc").to_a.first
@@ -182,18 +209,24 @@ describe "Traject::Macros::Marc21Semantics" do
       output = @indexer.map_record(@record)
       assert_equal ["Language & Literature"], output["discipline_facet"]
     end
     it "maps to default" do
       @record = MARC::Reader.new(support_file_path  "musical_cage.marc").to_a.first
       output = @indexer.map_record(@record)
       assert_equal ["Unknown"], output["discipline_facet"]
+      assert_equal(["Unknown"], @indexer.map_record(empty_record)['discipline_facet'])
     end
     it "maps to nothing if none and no default" do
       @indexer.instance_eval {to_field "discipline_no_default", marc_lcc_to_broad_category(:default => nil)}
       @record = MARC::Reader.new(support_file_path  "musical_cage.marc").to_a.first
       output = @indexer.map_record(@record)
       assert_nil output["discipline_no_default"]
+      assert_nil @indexer.map_record(empty_record)["discipline_no_default"]
     end
     describe "LCC_REGEX" do
@@ -212,13 +245,15 @@ describe "Traject::Macros::Marc21Semantics" do
       @record = MARC::Reader.new(support_file_path  "multi_geo.marc").to_a.first
       output = @indexer.map_record(@record)
-      assert_equal ["Europe", "Middle East", "Africa, North", "Agora (Athens, Greece)", "Rome (Italy)", "Italy"],
-        output["geo_facet"]
+      assert_equal ["Europe", "Middle East", "Africa, North", "Agora (Athens, Greece)", "Rome (Italy)", "Italy"], output["geo_facet"]
+      assert_equal({}, @indexer.map_record(empty_record))
     end
     it "maps nothing on a record with no geo" do
       @record = MARC::Reader.new(support_file_path  "manufacturing_consent.marc").to_a.first
       output = @indexer.map_record(@record)
       assert_nil output["geo_facet"]
+      assert_equal({}, @indexer.map_record(empty_record))
     end
   end
@@ -232,6 +267,8 @@ describe "Traject::Macros::Marc21Semantics" do
       assert_equal ["Early modern, 1500-1700", "17th century", "Great Britain: Puritan Revolution, 1642-1660", "Great Britain: Civil War, 1642-1649", "1642-1660"],
         output["era_facet"]
+      assert_equal({}, @indexer.map_record(empty_record))
     end
   end
@@ -241,6 +278,7 @@ describe "Traject::Macros::Marc21Semantics" do
       str = Marc21Semantics.assemble_lcsh(field)
       assert_equal "Psychoanalysis and literature — England — History — 19th century", str
     end
     it "ignores numeric subfields" do
@@ -277,6 +315,9 @@ describe "Traject::Macros::Marc21Semantics" do
         assert output["lcsh"].length > 0, "outputs data"
         assert output["lcsh"].include?("Eliot, George, 1819-1880 — Characters"), "includes a string its supposed to"
+        assert_equal({}, @indexer.map_record(empty_record))
       end
     end
   end
@@ -292,6 +333,8 @@ describe "Traject::Macros::Marc21Semantics" do
       end
       output = @indexer.map_record(@record)
       assert_equal ['Business renaissance quarterly'], output['title_phrase']
+      assert_equal({}, @indexer.map_record(empty_record))
     end
     it "works with :include_original" do
@@ -300,6 +343,7 @@ describe "Traject::Macros::Marc21Semantics" do
       end
       output = @indexer.map_record(@record)
       assert_equal ['The Business renaissance quarterly', 'Business renaissance quarterly'], output['title_phrase']
+      assert_equal({}, @indexer.map_record(empty_record))
     end
     it "doesn't do anything if you don't include the first subfield" do
@@ -308,6 +352,8 @@ describe "Traject::Macros::Marc21Semantics" do
       end
       output = @indexer.map_record(@record)
       assert_equal ['[electronic resource].'], output['title_phrase']
+      assert_equal({}, @indexer.map_record(empty_record))
     end

data/test/indexer/macros_marc21_test.rb CHANGED Viewed

@@ -26,6 +26,8 @@ describe "Traject::Macros::Marc21" do
       output = @indexer.map_record(@record)
       assert_equal ["Manufacturing consent : the political economy of the mass media /"], output["title"]
+      assert_equal({}, @indexer.map_record(empty_record))
     end
     it "respects :first=>true option" do
@@ -36,6 +38,7 @@ describe "Traject::Macros::Marc21" do
       output = @indexer.map_record(@record)
       assert_length 1, output["other_id"]
     end
     it "trims punctuation with :trim_punctuation => true" do
@@ -46,6 +49,8 @@ describe "Traject::Macros::Marc21" do
       output = @indexer.map_record(@record)
       assert_equal ["Manufacturing consent : the political economy of the mass media"], output["title"]
+      assert_equal({}, @indexer.map_record(empty_record))
     end
     it "respects :default option" do
@@ -70,6 +75,7 @@ describe "Traject::Macros::Marc21" do
       output = @indexer.map_record(@record)
       assert_equal ["eng"], output['lang1']
       assert_equal ["eng", "eng"], output['lang2']
+      assert_equal({}, @indexer.map_record(empty_record))
     end
     it "fails on an extra/misspelled argument to extract_marc" do

data/test/marc_format_classifier_test.rb CHANGED Viewed

@@ -10,7 +10,11 @@ def classifier_for(filename)
 end
 describe "MarcFormatClassifier" do
+  it "returns 'Print' when there's no other data" do
+    assert_equal ['Print'],  MarcFormatClassifier.new( empty_record  ).formats
+  end
   describe "genre" do
     # We don't have the patience to test every case, just a sampling
     it "says book" do

data/test/test_helper.rb CHANGED Viewed

@@ -37,6 +37,16 @@ def assert_start_with(start_with, obj, msg = nil)
   assert obj.start_with?(start_with), msg
 end
+# An empty record, for making sure extractors and macros work when
+# the fields they're looking for aren't there
+def empty_record
+  rec = MARC::Record.new
+  rec.append(MARC::ControlField.new('001', '000000000'))
+  rec
+end
 # pretends to be a SolrJ HTTPServer-like thing, just kind of mocks it up
 # and records what happens and simulates errors in some cases.
 class MockSolrServer

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: traject
 version: !ruby/object:Gem::Version
-  version: 1.0.0.beta.2
+  version: 1.0.0.beta.3
 platform: ruby
 authors:
 - Jonathan Rochkind
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2013-10-17 00:00:00.000000000 Z
+date: 2013-10-28 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: marc
@@ -286,7 +286,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: 1.3.1
 requirements: []
 rubyforge_project:
-rubygems_version: 2.1.5
+rubygems_version: 2.1.9
 signing_key:
 specification_version: 4
 summary: Index MARC to Solr; or generally process source records to hash-like structures