RubyGems - traject - Versions diffs - 1.0.0.beta.2 → 1.0.0.beta.3 - Mend

traject 1.0.0.beta.2 → 1.0.0.beta.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml +4 -4
data/doc/batch_execution.md +60 -8
data/doc/extending.md +3 -1
data/doc/settings.md +2 -2
data/lib/traject/indexer.rb +29 -26
data/lib/traject/macros/marc_format_classifier.rb +7 -4
data/lib/traject/version.rb +1 -1
data/test/indexer/macros_marc21_semantics_test.rb +50 -4
data/test/indexer/macros_marc21_test.rb +6 -0
data/test/marc_format_classifier_test.rb +5 -1
data/test/test_helper.rb +10 -0
metadata +3 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: cb35b4c5ba302cb865b459bfac6859ef6be68927
-  data.tar.gz: 14964b88428d0a827932cbf17194a77c56de1091
+  metadata.gz: fba53ed999c0449a6de13e4e1455399431c4e052
+  data.tar.gz: 5af370ada3fd4f779607bd7e3f294607a7a20208
 SHA512:
-  metadata.gz: b3ae114fe4a11baaf6f6470d35d5df339a32b8f7e747f012310ee90ed55b982754d196abcc25ec0375d8bf988ca5e3ebd2c71ef7df159bc7007e6cdae9c69643
-  data.tar.gz: 74c6c6170860cf1cd4883d644f9c23151a703ff996592abb8668a655dcff885c108c7e04e13419f6c4ba727384b3bab3dff896c177382284b8d12b98cc711225
+  metadata.gz: f062f33724bdf1d11260edb675a239ebfcd834238dc98806c638d1a9c955acb5819fd2c6fbf15f023b683b2acd68913d100c8ede7f75ca62382e783ecba429da
+  data.tar.gz: d0efcb96f544f7cdc42517cb08a067f8d1f20de1fcbf34a53af1adbdca4b127105f7e659fed11625966605bdef79d5f23433dc0e209b159a447686fa887ab75e

data/doc/batch_execution.md CHANGED Viewed

@@ -18,7 +18,9 @@ with jruby 1.7.x or later, this should be default, recommend
 you use jruby 1.7.x.
 Especially when running under a cron job, it can be difficult to
-set things up so traject runs under jruby.
+set things up so traject runs under jruby -- and then when you add
+bundler into it, things can get positively byzantine. It's not you,
+this gets confusing.
 It can sometimes be useful to create a wrapper script for traject
 that takes care of making sure it's running under the right ruby
@@ -31,8 +33,11 @@ Simply run with:
     chruby-exec jruby -- traject {other arguments}
 Whether specifying that directly in a crontab, or in a shell script
-that needs to call traject, etc. So simple you might not need
-a wrapper script, but it might still be convenient to create one. Say
+that needs to call traject, etc. In a crontab environment, it'll actually need
+you to set PATH and SHELL variables, as specified in the [chruby docs](https://github.com/postmodern/chruby/wiki/Cron)
+So simple you might not need a wrapper script, but it might still be convenient to create one. Say
 you put a `jruby-traject` at `/usr/local/bin/jruby-traject`, that
 looks like this:
@@ -40,9 +45,55 @@ looks like this:
     chruby-exec jruby -- traject "$@"
-Now any account, in a crontab, in an interactive shell, wherever,
-can just execute `jruby-traject {arguments}`, and execute traject
-in a jruby environment.
+Now you can can just execute `jruby-traject {arguments}`, and execute traject
+in a jruby environment. (In a crontab, you'll still need to fix your
+PATH and SHELL env variables for `chruby-exec` to work, either in the
+crontab or in this wrapper script)
+### chruby monster wrapper script
+I am still not sure if this is a good idea, but here's an example of
+a wrapper script for chruby that will take care of the ENV even
+when running in a crontab, use chruby-exec only if jruby isn't
+already the default ruby, and add in `bundle exec` too.
+~~~bash
+#!/usr/bin/env bash
+# A wrapper for traject that uses chruby to make sure jruby
+# is being used before calling traject, and then calls
+# traject with bundle exec from within our traject project
+# dir.
+# Make sure /usr/local/bin is in PATH for chruby-exec,
+# which it's not ordinarily in a cronjob.
+if [[ ":$PATH:" != *":/usr/local/bin:"* ]]
+then
+  export PATH=$PATH:/usr/local/bin
+fi
+# chruby needs SHELL set, which it won't be from a crontab
+export SHELL=/bin/bash
+# Find the dir based on location of this wrapper script,
+# then use that dir to cd to for the bundle exec to find
+# the right Gemfile.
+traject_dir=$(cd `dirname "${BASH_SOURCE[0]}"` && pwd)
+# do we need to use chruby to switch to jruby?
+if [[ "$(ruby -v)" == *jruby* ]]
+then
+  ruby_picker="" # nothing needed "
+else
+  ruby_picker="chruby-exec jruby --"
+fi
+cmd="BUNDLE_GEMFILE=$traject_dir/Gemfile $ruby_picker bundle exec traject $@"
+echo $cmd
+eval $cmd
+~~~
+This monster script can perhaps be adapted for rbenv or rvm.
 ### for rbenv
@@ -62,7 +113,7 @@ If you're running inside a cronjob, things get a bit trickier,
 because rbenv isn't normally set up in the limited environment
 of cron tasks. One way to deal with this is to have your
 cronjob explicitly execute in a bash login shell, that
-will then have rbenv set up so long as it's running
+will then have rbenv set up -- so long as it's running
 under an account with rbenv set up properly!
     # in a cronfile
@@ -99,6 +150,7 @@ Now any account, in a crontab, in an interactive shell, wherever,
 can just execute `jruby-traject {arguments}`, and execute traject
 in a jruby environment.
 ### Bundler too?
 If you're running with bundler too, you could make a wrapper file specific to
@@ -188,4 +240,4 @@ do whatever you can make yell, just write ruby.
 For automated batch execution, we recommend you consider using
 bundler to manage any gem dependencies. See the [Extending
 With Your Own Code](./extending.md) traject docs for
-information on how traject integrates with bundler.
+information on how traject integrates with bundler.

data/doc/extending.md CHANGED Viewed

@@ -16,6 +16,7 @@ of a couple traject features meant to make it easier.
 * Traject `-I` argument command line can be used to list directories to
   add to the load path, similar to the `ruby -I` argument. You
   can then 'require' local project files from the load path.
+  * Or modify the ruby `$LOAD_PATH` manually at the top of a traject config file you are loading.
   * translation map files found in a
     "./translation_maps" subdir on the load path will be found
     for Traject translation maps.
@@ -155,7 +156,8 @@ by running `bundler init`, probably in the directory
 right next to your traject config files.
 Then specify what gems your traject project will use,
-possibly with version restrictions, in the [Gemfile](http://bundler.io/v1.3/gemfile.html).
+possibly with version restrictions, in the [Gemfile](http://bundler.io/v1.3/gemfile.html) --
+**do** include `gem 'traject'` in the Gemfile.
 Run `bundle install` from the directory with the Gemfile, on any system
 at any time, to make sure specified gems are installed.

data/doc/settings.md CHANGED Viewed

@@ -47,8 +47,8 @@ settings are applied first of all. It's recommended you use `provide`.
 * `log.level`:  Log this level and above. Default 'info', set to eg 'debug' to get potentially more logging info,
               or 'error' to get less. https://github.com/rudionrails/yell/wiki/101-setting-the-log-level
-* `log.batch_size`: If set to a number N (or string representation), will output a progress line to INFO
-   log, every N records.
+* `log.batch_size`: If set to a number N (or string representation), will output a progress line to DEBUG
+   log, every N records. (use -d to turn logging to DEBUG to see.)
 * `marc_source.type`: default 'binary'. Can also set to 'xml' or (not yet implemented todo) 'json'. Command line shortcut `-t`

data/lib/traject/indexer.rb CHANGED Viewed

@@ -2,11 +2,13 @@ require 'yell'
 require 'traject'
 require 'traject/qualified_const_get'
+require 'traject/thread_pool'
 require 'traject/indexer/settings'
 require 'traject/marc_reader'
 require 'traject/marc4j_reader'
 require 'traject/json_writer'
+require 'traject/solrj_writer'
 require 'traject/macros/marc21'
 require 'traject/macros/basic'
@@ -73,7 +75,7 @@ require 'traject/macros/basic'
 #  The default writer is the SolrJWriter, using Java SolrJ to
 #  write to a Solr.  A few other built-in writers are available,
 #  but it's anticipated more will be created as plugins or local
-#  code for special purposes.
+#  code for special purposes.
 #
 #  You can set alternate writers by setting a Class object directly
 #  with the #writer_class method, or by the 'writer_class_name' Setting,
@@ -167,38 +169,39 @@ class Traject::Indexer
   attr_writer :logger
-  # Just calculates the arg that's gonna be given to Yell.new
-  # or SomeLogger.new
-  def logger_argument
-    specified = settings["log.file"] || "STDERR"
-    case specified
-    when "STDOUT" then STDOUT
-    when "STDERR" then STDERR
-    else specified
-    end
-  end
-  # Second arg to Yell.new, options hash, calculated from
-  # settings
-  def logger_options
-    # formatter, default is fairly basic
+  def logger_format
     format = settings["log.format"] || "%d %5L %m"
     format = case format
-    when "false" then false
-    when "" then nil
-    else format
+      when "false" then false
+      when "" then nil
+      else format
     end
-    level = settings["log.level"] || "info"
-    {:format => format, :level => level}
   end
   # Create logger according to settings
   def create_logger
+    logger_level  = settings["log.level"] || "info"
     # log everything to STDERR or specified logfile
-    logger = Yell.new( logger_argument, logger_options )
+    logger = Yell.new
+    logger.format = logger_format
+    logger.level  = logger_level
+    logger_destination = settings["log.file"] || "STDERR"
+    # We intentionally repeat the logger_level
+    # on the adapter, so it will stay there if overall level
+    # is changed.
+    case logger_destination
+    when "STDERR"
+      logger.adapter :stderr, level: logger_level, format: logger_format
+    when "STDOUT"
+      logger.adapter :stdout, level: logger_level, format: logger_format
+    else
+      logger.adapter :file, logger_destination, level: logger_level, format: logger_format
+    end
     # ADDITIONALLY log error and higher to....
     if settings["log.error_file"]
       logger.adapter :file, settings["log.error_file"], :level => 'gte.error'
@@ -329,7 +332,7 @@ class Traject::Indexer
       if log_batch_size && (count % log_batch_size == 0)
         batch_rps = log_batch_size / (Time.now - batch_start_time)
         overall_rps = count / (Time.now - start_time)
-        logger.info "Traject::Indexer#process, read #{count} records at id:#{id_string(record)}; #{'%.0f' % batch_rps}/s this batch, #{'%.0f' % overall_rps}/s overall"
+        logger.debug "Traject::Indexer#process, read #{count} records at id:#{id_string(record)}; #{'%.0f' % batch_rps}/s this batch, #{'%.0f' % overall_rps}/s overall"
         batch_start_time = Time.now
       end

data/lib/traject/macros/marc_format_classifier.rb CHANGED Viewed

@@ -114,12 +114,15 @@ module Traject
       # * If it has any RDA 338, then it's print if it has a value of
       #   volume, sheet, or card.
       # * If it does not have an RDA 338, it's print if and only if it has
-      #   NO 245$h GMD.
+      #   no 245$h GMD.
       #
       # * Here at JH, for legacy reasons we also choose to not
       #   call it print if it's already been marked audio, but
       #   we do that in a different method.
       #
+      # Note that any record that has neither a 245 nor a 338rda is going
+      # to be marked print
+      #
       # This algorithm is definitely going to get some things wrong in
       # both directions, with real world data. But seems to be good enough.
       def print?
@@ -137,7 +140,7 @@ module Traject
             end
           end
         else
-          normalized_gmd.length == 0
+          normalized_gmd.length == 0
         end
       end
@@ -145,8 +148,8 @@ module Traject
       # resource. But sometimes resort to 245$h GMD too.
       def online?
         # field 007, byte 0 c="electronic" byte 1 r="remote" ==> sure Online
-        found_007 = record.find do |field|
-          field.tag == "007" && field.value.slice(0) == "c" && field.value.slice(1) == "r"
+        found_007 = record.fields('007').find do |field|
+          field.value.slice(0) == "c" && field.value.slice(1) == "r"
         end
         return true if found_007

data/lib/traject/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Traject
-  VERSION = "1.0.0.beta.2"
+  VERSION = "1.0.0.beta.3"
 end

data/test/indexer/macros_marc21_semantics_test.rb CHANGED Viewed

@@ -28,6 +28,8 @@ describe "Traject::Macros::Marc21Semantics" do
     output = @indexer.map_record(@record)
     assert_equal %w{47971712},  output["oclcnum"]
+    assert_equal({}, @indexer.map_record(empty_record))
   end
   it "#marc_series_facet" do
@@ -40,6 +42,8 @@ describe "Traject::Macros::Marc21Semantics" do
     # trims punctuation too
     assert_equal ["Big bands"], output["series_facet"]
+    assert_equal({}, @indexer.map_record(empty_record))
   end
   describe "marc_sortable_author" do
@@ -54,6 +58,8 @@ describe "Traject::Macros::Marc21Semantics" do
       output = @indexer.map_record(@record)
       assert_equal ["Herman, Edward S.   Manufacturing consent the political economy of the mass media Edward S. Herman and Noam Chomsky ; with a new introduction by the authors"], output["author_sort"]
+      assert_equal [""], @indexer.map_record(empty_record)['author_sort']
     end
     it "respects non-filing" do
       @record = MARC::Reader.new(support_file_path  "the_business_ren.marc").to_a.first
@@ -61,6 +67,8 @@ describe "Traject::Macros::Marc21Semantics" do
       output = @indexer.map_record(@record)
       assert_equal ["Business renaissance quarterly [electronic resource]."], output["author_sort"]
+      assert_equal [""], @indexer.map_record(empty_record)['author_sort']
     end
   end
@@ -71,6 +79,8 @@ describe "Traject::Macros::Marc21Semantics" do
     it "works" do
       output = @indexer.map_record(@record)
       assert_equal ["Manufacturing consent : the political economy of the mass media"], output["title_sort"]
+      assert_equal({}, @indexer.map_record(empty_record))
     end
     it "respects non-filing" do
       @record = MARC::Reader.new(support_file_path  "the_business_ren.marc").to_a.first
@@ -95,6 +105,8 @@ describe "Traject::Macros::Marc21Semantics" do
       output = @indexer.map_record(@record)
       assert_equal ["English", "French", "German", "Italian", "Spanish", "Russian"], output["languages"]
+      assert_equal({}, @indexer.map_record(empty_record))
     end
   end
@@ -108,6 +120,8 @@ describe "Traject::Macros::Marc21Semantics" do
       output = @indexer.map_record(@record)
       assert_equal ["Larger ensemble, Unspecified", "Piano", "Soprano voice", "Tenor voice", "Violin", "Larger ensemble, Ethnic", "Guitar", "Voices, Unspecified"], output["instrumentation"]
+      assert_equal({}, @indexer.map_record(empty_record))
     end
   end
@@ -126,16 +140,29 @@ describe "Traject::Macros::Marc21Semantics" do
       @record = MARC::Reader.new(support_file_path  "louis_armstrong.marc").to_a.first
       output = @indexer.map_record(@record)
-      assert_equal ["bb01", "bb01.s", "bb", "bb.s", "oe"],
-        output["instrument_codes"]
+      assert_equal ["bb01", "bb01.s", "bb", "bb.s", "oe"], output["instrument_codes"]
+      assert_equal({}, @indexer.map_record(empty_record))
     end
   end
   describe "publication_date" do
     # there are way too many edge cases for us to test em all, but we'll test some of em.
+    it "works when there's no date information" do
+      assert_equal nil,  Marc21Semantics.publication_date(empty_record)
+    end
+    it "uses macro correctly with no date info" do
+      @indexer.instance_eval {to_field "date", marc_publication_date }
+      assert_equal({}, @indexer.map_record(empty_record))
+    end
     it "pulls out 008 date_type s" do
       @record = MARC::Reader.new(support_file_path  "manufacturing_consent.marc").to_a.first
       assert_equal 2002, Marc21Semantics.publication_date(@record)
     end
     it "uses start date for date_type c continuing resource" do
       @record = MARC::Reader.new(support_file_path  "the_business_ren.marc").to_a.first
@@ -182,18 +209,24 @@ describe "Traject::Macros::Marc21Semantics" do
       output = @indexer.map_record(@record)
       assert_equal ["Language & Literature"], output["discipline_facet"]
     end
     it "maps to default" do
       @record = MARC::Reader.new(support_file_path  "musical_cage.marc").to_a.first
       output = @indexer.map_record(@record)
       assert_equal ["Unknown"], output["discipline_facet"]
+      assert_equal(["Unknown"], @indexer.map_record(empty_record)['discipline_facet'])
     end
     it "maps to nothing if none and no default" do
       @indexer.instance_eval {to_field "discipline_no_default", marc_lcc_to_broad_category(:default => nil)}
       @record = MARC::Reader.new(support_file_path  "musical_cage.marc").to_a.first
       output = @indexer.map_record(@record)
       assert_nil output["discipline_no_default"]
+      assert_nil @indexer.map_record(empty_record)["discipline_no_default"]
     end
     describe "LCC_REGEX" do
@@ -212,13 +245,15 @@ describe "Traject::Macros::Marc21Semantics" do
       @record = MARC::Reader.new(support_file_path  "multi_geo.marc").to_a.first
       output = @indexer.map_record(@record)
-      assert_equal ["Europe", "Middle East", "Africa, North", "Agora (Athens, Greece)", "Rome (Italy)", "Italy"],
-        output["geo_facet"]
+      assert_equal ["Europe", "Middle East", "Africa, North", "Agora (Athens, Greece)", "Rome (Italy)", "Italy"], output["geo_facet"]
+      assert_equal({}, @indexer.map_record(empty_record))
     end
     it "maps nothing on a record with no geo" do
       @record = MARC::Reader.new(support_file_path  "manufacturing_consent.marc").to_a.first
       output = @indexer.map_record(@record)
       assert_nil output["geo_facet"]
+      assert_equal({}, @indexer.map_record(empty_record))
     end
   end
@@ -232,6 +267,8 @@ describe "Traject::Macros::Marc21Semantics" do
       assert_equal ["Early modern, 1500-1700", "17th century", "Great Britain: Puritan Revolution, 1642-1660", "Great Britain: Civil War, 1642-1649", "1642-1660"],
         output["era_facet"]
+      assert_equal({}, @indexer.map_record(empty_record))
     end
   end
@@ -241,6 +278,7 @@ describe "Traject::Macros::Marc21Semantics" do
       str = Marc21Semantics.assemble_lcsh(field)
       assert_equal "Psychoanalysis and literature — England — History — 19th century", str
     end
     it "ignores numeric subfields" do
@@ -277,6 +315,9 @@ describe "Traject::Macros::Marc21Semantics" do
         assert output["lcsh"].length > 0, "outputs data"
         assert output["lcsh"].include?("Eliot, George, 1819-1880 — Characters"), "includes a string its supposed to"
+        assert_equal({}, @indexer.map_record(empty_record))
       end
     end
   end
@@ -292,6 +333,8 @@ describe "Traject::Macros::Marc21Semantics" do
       end
       output = @indexer.map_record(@record)
       assert_equal ['Business renaissance quarterly'], output['title_phrase']
+      assert_equal({}, @indexer.map_record(empty_record))
     end
     it "works with :include_original" do
@@ -300,6 +343,7 @@ describe "Traject::Macros::Marc21Semantics" do
       end
       output = @indexer.map_record(@record)
       assert_equal ['The Business renaissance quarterly', 'Business renaissance quarterly'], output['title_phrase']
+      assert_equal({}, @indexer.map_record(empty_record))
     end
     it "doesn't do anything if you don't include the first subfield" do
@@ -308,6 +352,8 @@ describe "Traject::Macros::Marc21Semantics" do
       end
       output = @indexer.map_record(@record)
       assert_equal ['[electronic resource].'], output['title_phrase']
+      assert_equal({}, @indexer.map_record(empty_record))
     end

data/test/indexer/macros_marc21_test.rb CHANGED Viewed

@@ -26,6 +26,8 @@ describe "Traject::Macros::Marc21" do
       output = @indexer.map_record(@record)
       assert_equal ["Manufacturing consent : the political economy of the mass media /"], output["title"]
+      assert_equal({}, @indexer.map_record(empty_record))
     end
     it "respects :first=>true option" do
@@ -36,6 +38,7 @@ describe "Traject::Macros::Marc21" do
       output = @indexer.map_record(@record)
       assert_length 1, output["other_id"]
     end
     it "trims punctuation with :trim_punctuation => true" do
@@ -46,6 +49,8 @@ describe "Traject::Macros::Marc21" do
       output = @indexer.map_record(@record)
       assert_equal ["Manufacturing consent : the political economy of the mass media"], output["title"]
+      assert_equal({}, @indexer.map_record(empty_record))
     end
     it "respects :default option" do
@@ -70,6 +75,7 @@ describe "Traject::Macros::Marc21" do
       output = @indexer.map_record(@record)
       assert_equal ["eng"], output['lang1']
       assert_equal ["eng", "eng"], output['lang2']
+      assert_equal({}, @indexer.map_record(empty_record))
     end
     it "fails on an extra/misspelled argument to extract_marc" do

data/test/marc_format_classifier_test.rb CHANGED Viewed

@@ -10,7 +10,11 @@ def classifier_for(filename)
 end
 describe "MarcFormatClassifier" do
+  it "returns 'Print' when there's no other data" do
+    assert_equal ['Print'],  MarcFormatClassifier.new( empty_record  ).formats
+  end
   describe "genre" do
     # We don't have the patience to test every case, just a sampling
     it "says book" do

data/test/test_helper.rb CHANGED Viewed

@@ -37,6 +37,16 @@ def assert_start_with(start_with, obj, msg = nil)
   assert obj.start_with?(start_with), msg
 end
+# An empty record, for making sure extractors and macros work when
+# the fields they're looking for aren't there
+def empty_record
+  rec = MARC::Record.new
+  rec.append(MARC::ControlField.new('001', '000000000'))
+  rec
+end
 # pretends to be a SolrJ HTTPServer-like thing, just kind of mocks it up
 # and records what happens and simulates errors in some cases.
 class MockSolrServer

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: traject
 version: !ruby/object:Gem::Version
-  version: 1.0.0.beta.2
+  version: 1.0.0.beta.3
 platform: ruby
 authors:
 - Jonathan Rochkind
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2013-10-17 00:00:00.000000000 Z
+date: 2013-10-28 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: marc
@@ -286,7 +286,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: 1.3.1
 requirements: []
 rubyforge_project:
-rubygems_version: 2.1.5
+rubygems_version: 2.1.9
 signing_key:
 specification_version: 4
 summary: Index MARC to Solr; or generally process source records to hash-like structures