RubyGems - traject - Versions diffs - 0.0.2 → 0.9.1 - Mend

traject 0.0.2 → 0.9.1

Files changed (63) hide show

data/Gemfile +4 -0
data/README.md +85 -61
data/Rakefile +5 -0
data/bin/traject +31 -3
data/doc/settings.md +74 -13
data/lib/tasks/load_maps.rake +48 -0
data/lib/traject/indexer/settings.rb +75 -0
data/lib/traject/indexer.rb +255 -45
data/lib/traject/json_writer.rb +4 -2
data/lib/traject/macros/marc21.rb +18 -6
data/lib/traject/macros/marc21_semantics.rb +405 -0
data/lib/traject/macros/marc_format_classifier.rb +180 -0
data/lib/traject/marc4j_reader.rb +160 -0
data/lib/traject/marc_extractor.rb +33 -17
data/lib/traject/marc_reader.rb +14 -11
data/lib/traject/solrj_writer.rb +247 -9
data/lib/traject/thread_pool.rb +154 -0
data/lib/traject/translation_map.rb +46 -4
data/lib/traject/util.rb +30 -0
data/lib/traject/version.rb +1 -1
data/lib/translation_maps/lcc_top_level.yaml +26 -0
data/lib/translation_maps/marc_genre_007.yaml +9 -0
data/lib/translation_maps/marc_genre_leader.yaml +22 -0
data/lib/translation_maps/marc_geographic.yaml +589 -0
data/lib/translation_maps/marc_instruments.yaml +102 -0
data/lib/translation_maps/marc_languages.yaml +490 -0
data/test/indexer/each_record_test.rb +34 -0
data/test/indexer/macros_marc21_semantics_test.rb +206 -0
data/test/indexer/macros_marc21_test.rb +10 -1
data/test/indexer/map_record_test.rb +78 -8
data/test/indexer/read_write_test.rb +43 -10
data/test/indexer/settings_test.rb +60 -4
data/test/indexer/to_field_test.rb +39 -0
data/test/marc4j_reader_test.rb +75 -0
data/test/marc_extractor_test.rb +62 -0
data/test/marc_format_classifier_test.rb +91 -0
data/test/marc_reader_test.rb +12 -0
data/test/solrj_writer_test.rb +146 -43
data/test/test_helper.rb +50 -0
data/test/test_support/245_no_ab.marc +1 -0
data/test/test_support/880_with_no_6.utf8.marc +1 -0
data/test/test_support/bad_subfield_code.marc +1 -0
data/test/test_support/date_resort_to_260.marc +1 -0
data/test/test_support/date_type_r_missing_date2.marc +1 -0
data/test/test_support/date_with_u.marc +1 -0
data/test/test_support/demo_config.rb +153 -0
data/test/test_support/emptyish_record.marc +1 -0
data/test/test_support/louis_armstrong.marc +1 -0
data/test/test_support/manuscript_online_thesis.marc +1 -0
data/test/test_support/microform_online_conference.marc +1 -0
data/test/test_support/multi_era.marc +1 -0
data/test/test_support/multi_geo.marc +1 -0
data/test/test_support/musical_cage.marc +1 -0
data/test/test_support/one-marc8.mrc +1 -0
data/test/test_support/online_only.marc +1 -0
data/test/test_support/packed_041a_lang.marc +1 -0
data/test/test_support/the_business_ren.marc +1 -0
data/test/translation_map_test.rb +8 -0
data/test/translation_maps/properties_map.properties +5 -0
data/traject.gemspec +1 -1
data/vendor/marc4j/README.md +17 -0
data/vendor/marc4j/lib/marc4j-2.5.1-beta.jar +0 -0
metadata +81 -2

data/Gemfile CHANGED Viewed

@@ -2,3 +2,7 @@ source 'https://rubygems.org'
 # Specify your gem's dependencies in traject.gemspec
 gemspec
+group :development do
+  gem "nokogiri" # used only for rake tasks load_maps:
+end

data/README.md CHANGED Viewed

@@ -9,16 +9,16 @@ them somewhere.
 ## Background/Goals
-Based on both the successes and failures of previous MARC indexing attempts -- including the venerable SolrMarc which we greatly appreciate and from which we've learned a lot -- I decided that to create a solution that worked at remaining pain points, I needed jruby -- ruby on the JVM.
+Existing tools for indexing Marc to Solr exist, and have served many of us for many years. But I was having more and more difficulty working with the existing tools, and difficulty providing the custom logic I needed in a maintainable way. I realized that for me, to create a tool with the flexibility, maintainability, and performance I wanted, I would need to do it in jruby (ruby on the JVM).
-Traject aims to:
+Some goals:
-* Be simple and straightforward for simple use cases, hopefully being accessible even to non-rubyists, although it's in ruby
-* Be composed of modular and re-composible elements, to provide flexibility for non-common use cases. You should be able to
-do your own thing wherever you want, without having to give up
-the already implemented parts you still do want, mixing and matching at will.
-* Easily support re-use and sharing of mapping rules, within an installation and between organizations.
-* Have a parsimonious or 'elegant' internal architecture, only a few architectural concepts to understand that everything else is built on, to hopefully make the internals easy to work with and maintain.
+* Aim to be accessible even to non-rubyists
+* Concise and maintainable local configuration -- including an only gradual increase in difficulty to write your own simple logic.
+* Support reusable and shareable mapping logic routines.
+* Built of modular and composable elements: If you want to change part of what traject does, you should be able to do so without having to reimplement other things you don't want to change.
+* A maintainable internal architecture, well-factored with seperated concerns and DRY logic. Aim to be comprehensible to newcomer developers, and well-covered by tests.
+* High performance, using multi-threaded concurrency where appropriate to maximize throughput. Actual throughput can depend on complexity of your mapping rules and capacity of your server(s), but I am getting throughput 2-5x greater than previous solutions.
 ## Installation
@@ -57,26 +57,42 @@ in a config file:
 settings do
   # Where to find solr server to write to
-  store "solr.url", "http://example.org/solr"
+  provide "solr.url", "http://example.org/solr"
+  # If you are connecting to Solr 1.x, you need to set
+  # for SolrJ compatibility:
+  # provide "solrj_writer.parser_class_name", "XMLResponseParser"
   # solr.version doesn't currently do anything, but set it
   # anyway, in the future it will warn you if you have settings
   # that may not work with your version.
-  store "solr.version", "4.3.0"
+  provide "solr.version", "4.3.0"
   # default source type is binary, traject can't guess
   # you have to tell it.
-  store "marc_source.type", "xml"
+  provide "marc_source.type", "xml"
   # settings can be set on command line instead of
   # config file too.
   # various others...
-  store "solrj_writer.commit_on_close", "true"
+  provide "solrj_writer.commit_on_close", "true"
+  # By default, we use the Traject::Marc4JReader, which
+  # can read marc8 and ISO8859_1 -- if your records are all in UTF8,
+  # the pure-ruby MarcReader may be faster...
+  # provide "reader_class_name", "Traject::MarcReader"
+  # If you ARE using the Marc4JReader, it defaults to "BESTGUESS"
+  # as to encoding when reading binary, you may want to tell it instead
+  provide "marc4j_reader.source_encoding", "MARC8" # or UTF-8 or ISO8859_1
 end
 ~~~
-See, docs page on [Settings][./doc/settings.md] for list
+`provide` will only set the key if it was previously unset, so first
+setting wins, and command-line comes first of all and overrides everything.
+You can also use `store` if you want to force-set, last set wins.
+See, docs page on [Settings](./doc/settings.md) for list
 of all standardized settings.
 ### Indexing Rules
@@ -158,7 +174,7 @@ for mapping form MARC codes to user-displayable strings. See Traject::Translatio
 #### Direct indexing logic vs. Macros
-It turns out all those functions we saw above used with `to_field` -- `literal`, `serialized_marc`, `extract_all_marc_values, and `extract_marc` -- are what Traject calls 'macros'.
+It turns out all those functions we saw above used with `to_field` -- `literal`, `serialized_marc`, `extract_all_marc_values`, and `extract_marc` -- are what Traject calls 'macros'.
 They are all actually built based upon a more basic element of
 indexing functionality, which you can always drop down to, and
@@ -178,7 +194,8 @@ used to define a block of logic that can be stored and executed later. When the
 The third argument is a `Traject::Indexer::Context` object that can
 be used for more advanced functionality, including caching expensive
-per-record calculations, writing out to more than one output field at a time (TODO example), or taking account of current Traject Settings in your logic.
+per-record calculations, writing out to more than one output field at a time, or taking account of current Traject Settings in your logic. The third argument is optional, you can supply
+a two-argument block too.
 You can always drop out to this basic direct use whenever you need
 special purpose logic, directly in the config file, writing in
@@ -197,12 +214,12 @@ end
 # marc_extract does, you may want to use the Traject::MarcExtractor
 # class
 to_field "weirdo" do |record, accumulator, context|
-   list = MarcExtractor.new(record, "700a").extract
+   list = MarcExtractor.extract_by_spec(record, "700a")
    # combine all the 700a's in ONE string, cause we're weird
    list = list.join(" ")
    accumulator << list
 end
-~~~~
+~~~
 You can also *combine* a macro and a direct block for some
 post-processing. In this case, the `accumulator` parameter
@@ -220,6 +237,54 @@ If you find yourself repeating code a lot in direct blocks, you
 can supply your _own_ macros, for local use, or even to share
 with others in a ruby gem. See docs [Macros](./doc/macros.md)
+#### each_record
+There is also a method `each_record`, which is like `to_field`, but without
+a specific field. It can be used for other side-effects of your choice, or
+even for writing to multiple fields.
+~~~ruby
+  each_record do |record, context|
+    # example of writing to two fields at once.
+    (x, y) = Something.do_stuff
+    (context["one_field"] ||= [])     << x
+    (context["another_field"] ||= []) << y
+  end
+~~~
+You could write or use macros for `each_record` too. It's suggested that
+such a macro take the field names it will effect as arguments (example?)
+`each_record` and `to_field` calls will be processed in one big order, guaranteed
+in order.
+~~~ruby
+  to_field("foo") {...}  # will be called first on each record
+  each_record {...}      # will always be called AFTER above has potentially added values
+  to_field("foo") {...}  # and will be called after each of the preceding for each record
+~~~
+#### Built-in MARC21 Semantics
+There is another package of 'macros' that comes with Traject for extracting semantics
+from Marc21.  These are sometimes 'opinionated', using heuristics or algorithms
+that are not inherently part of Marc21, but have proven useful in actual practice.
+It's not loaded by default, you can use straight ruby `require` and `extend`
+to load the macros into the indexer.
+~~~ruby
+# in a traject config file, extend so we can use methods from...
+require 'traject/macros/marc21_semantics'
+extend Traject::Macros::Marc21Semantics
+to_field "date",        marc_publication_date
+to_field "author_sort", marc_sortable_author
+to_field "inst_facet",  marc_instrumentation_humanized
+~~~
+See documented list of macros available in [Marc21Semantics](./lib/traject/macros/marc21_semantics.rb)
 ## Command Line
 The simplest invocation is:
@@ -241,8 +306,7 @@ If you leave off the marc_file, traject will try to read from stdin. You can onl
     cat some/dir/*.marc | traject -c conf_file.rb
 You can set any setting on the command line with `-s key=value`.
-This will over-ride any settings from conf files. (TODO, I don't
-think over-riding works, it's actually a bit tricky)
+This will over-ride any settings set with `provide` in conf files.
     traject -c conf_file.rb marc_file -s solr.url=http://somehere/solr -s solr.url=http://example.com/solr -s solrj_writer.commit_on_close=true
@@ -292,46 +356,6 @@ and/or extra files in ./docs -- as appropriate for what needs to be docs.
 ## TODO
-* Logging
-  * it's doing no logging of it's own
-  * It's not properly setting up the solrj logging
-  * Making solrj and it's own logging go to same place, accross jruby bridge, not sure
-    (I want all of this code BUT the Solr writing stuff to be usable under MRI too,
-     I want to repurpose the mapping code for DISPLAY too)
-* Error handling. Related to logging. Catch errors indexing
-  particular records, make
-  sure they are logged in an obvious place, make sure processing proceeds with other
-  records (if it should!) etc.
-* Distro and the SolrJ jars. Right now the SolrJ jars are included in the gem (although they
-  aren't actually loaded until you try to use the SolrJWriter). This is not neccesarily
-  best. other possibilities:
-  * Put them in their own gem
-  * Make the end-user download them theirselves, possibly providing the ivy.xml's to do so for
-    them.
-* Various performance improvements, this is not optimized yet. Some improvements
-  may challenge architecture, when they involve threading.
-  * Profile and optimize marc loading -- right now just using ruby-marc, always.
-  * Profile/optimize marc serialization back to stored filed, right now it uses
-    known-to-be-slow rexml as part of ruby-marc.
-  * Use threads for the mapping step? With celluloid, or threach, or other? Does
-    this require thinking more about thread safety of existing code?
-  * Use threads for writing to solr?
-    * I am not sure about using the solrj ConcurrentUpdateSolrServer -- among other
-      things, it seems to swallow solr errors, that i'm not sure we want to do.
-    * But we can batch docs ourselves before HttpServer#add'ing them -- every
-      solrj HTTPServer#add is an http transaction, but you can give it an ARRAY
-      to load multiple at once -- and still get the errors, I think. (Have to test)
-      Could be perf nearly as good as concurrentupdate? Or do that, but then make each
-      HttpServer#add in one of our own manual threads (Celluloid? Or raw?), so
-      continued processing doesn't block?
-* Reading Marc8. It can't do it yet. Easiest way would be using Marc4j to read, or using it as a transcoder anyway. Don't really want to write marc8 transcoder in ruby.
-* We need something like `to_field`, but without actually being
-for mapping to a specific output field. For generic pre or post-processing, or multi-output-field logic. `before_record do &block`, `after_record do &block` , `on_each_record do &block`, one or more of those.
 * Unicode normalization. Has to normalize to NFKC on way out to index. Except for serialized marc field and other exceptions? Except maybe don't have to, rely on solr analyzer to do it?
@@ -340,8 +364,8 @@ for mapping to a specific output field. For generic pre or post-processing, or m
   * Either way, all optional/configurable of course. based
     on Settings.
-* More macros. Not all the built-in functionality that comes with SolrMarc is here yet. It can be provided as macros, either built in, or distro'd in other gems. If really needed  as macros, and not just something local configs build themselves as needed out of the parts already here.
 * Command line code. It's only 150 lines, but it's kind of messy
 jammed into one file *and lacks tests*. I couldn't figure out
 what to do with it or how to test it. Needs a bit of love.
+* Optional built-in jetty stop/start to allow indexing to Solr that wasn't running before. maybe https://github.com/projecthydra/jettywrapper ?

data/Rakefile CHANGED Viewed

@@ -14,3 +14,8 @@ Rake::TestTask.new do |t|
   t.pattern = 'test/**/*_test.rb'
   t.libs.push 'test', 'test_support'
 end
+# Not documented well, but this seems to be
+# the way to load rake tasks from other files
+#import "lib/tasks/load_map.rake"
+Dir.glob('lib/tasks/*.rake').each { |r| import r}

data/bin/traject CHANGED Viewed

@@ -14,13 +14,16 @@ require 'traject'
 require 'traject/indexer'
+orig_argv = ARGV.dup
 opts = Slop.new(:strict => true) do
   banner "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
   on 'v', 'version', "print version information to stderr"
+  on 'd', 'debug', "Include debug log, -s log.level=debug"
   on 'h', 'help', "print usage information to stderr"
-  on 'c', 'conf', 'configuration file path (repeatable)', :argument => true, :as => Array, :required => true
+  on 'c', 'conf', 'configuration file path (repeatable)', :argument => true, :as => Array
   on :s, :setting, "settings: `-s key=value` (repeatable)", :argument => true, :as => Array
   on :r, :reader, "Set reader class, shortcut for `-s reader_class_name=*`", :argument => true
   on :w, :writer, "Set writer class, shortcut for `-s writer_class_name=*`", :argument => true
@@ -48,11 +51,12 @@ options = opts.to_hash
 if options[:version]
   $stderr.puts "traject version #{Traject::VERSION}"
+  exit 1
 end
 if options[:help]
   $stderr.puts opts.help
-  exit 0
+  exit 1
 end
 # have to use Slop object to tell diff between
@@ -87,6 +91,10 @@ settings = {}
   end
 end
+if options[:debug]
+  settings["log.level"] = "debug"
+end
 if options[:writer]
   settings["writer_class_name"] = options[:writer]
 end
@@ -112,6 +120,14 @@ end
 indexer = Traject::Indexer.new
 indexer.settings( settings )
+unless options[:conf] && options[:conf].length > 0
+  $stderr.puts "Error: Missing required configuration file"
+  $stderr.puts "Exiting..."
+  $stderr.puts
+  $stderr.puts opts.help
+  exit 2
+end
 options[:conf].each do |conf_path|
   begin
     indexer.instance_eval(File.open(conf_path).read, conf_path)
@@ -128,6 +144,14 @@ options[:conf].each do |conf_path|
   end
 end
+## SAFE TO LOG STARTING HERE.
+#
+#  Shoudln't log before config files are read above, because
+#  config files set up logger
+##############
+indexer.logger.info("executing with arguments: `#{orig_argv.join(' ')}`")
 # ARGF might be perfect for this, but problems with it include:
 # * jruby is broken, no way to set it's encoding, leads to encoding errors reading non-ascii
 #   https://github.com/jruby/jruby/issues/891
@@ -145,9 +169,13 @@ if ARGV.length > 1
   exit 1
 end
 if ARGV.length == 0
+  indexer.logger.info "Reading from STDIN..."
   io = $stdin
 else
+  indexer.logger.info "Reading from #{ARGV.first}"
   io = File.open(ARGV.first, 'r')
 end
-indexer.process(io)
+result = indexer.process(io)
+exit 1 unless result # non-zero exit status on process telling us there's problems.

data/doc/settings.md CHANGED Viewed

@@ -6,29 +6,90 @@ used for grouping and namespacing.
 Values are usually strings, but occasionally something else.
-Settings can be set in configuration files, or on the command
-line.
+Settings can be set in configuration files, usually like:
+~~~ruby
+settings do
+  provide "key", "value"
+end
+~~~~
+or on the command line: `-s key=value`.  There are also some command line shortcuts
+for commonly used settings, see `traject -h`.
 ## Known settings
-* json_writer.pretty_print: used by the JsonWriter, if set to true, will output pretty printed json (with added whitespace) for easier human readability. Default false.
+* `debug_ascii_progress`: true/'true' to print ascii characters to STDERR indicating progress. Note,
+                          yes, this is fixed to STDERR, regardless of your logging setup.
+                          * `.` for every batch of records read and parsed
+                          * `^` for every batch of records batched and queued for adding to solr
+                                (possibly in thread pool)
+                          * `%` for completing of a Solr 'add'
+                          * `!` when threadpool for solr add has a full queue, so solr add is
+                                going to happen in calling queue -- means solr adding can't
+                                keep up with production.
+* `json_writer.pretty_print`: used by the JsonWriter, if set to true, will output pretty printed json (with added whitespace) for easier human readability. Default false.
+* `log.file`: filename to send logging, or 'STDOUT' or 'STDERR' for those streams. Default STDERR
+* `log.error_file`: Default nil, if set then all log lines of ERROR and higher will be _additionally_
+                  sent to error file named.
+* `log.format`: Formatting string used by Yell logger. https://github.com/rudionrails/yell/wiki/101-formatting-log-messages
+* `log.level`:  Log this level and above. Default 'info', set to eg 'debug' to get potentially more logging info,
+              or 'error' to get less. https://github.com/rudionrails/yell/wiki/101-setting-the-log-level
-* marc_source.type: default 'binary'. Can also set to 'xml' or (not yet implemented todo) 'json'. Command line shortcut `-t`
+* `log.batch_progress`: If set to a number N (or string representation), will output a progress line to INFO
+   log, every N records.
-* reader_class_name: a Traject Reader class, used by the indexer as a source of records. Default Traject::MarcReader. See Traject::Indexer for more info. Command-line shortcut `-r`
+* `marc_source.type`: default 'binary'. Can also set to 'xml' or (not yet implemented todo) 'json'. Command line shortcut `-t`
-* solr.url: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`.
+* `marc4j_reader.jar_dir`:   Path to a directory containing Marc4J jar file to use. All .jar's in dir will
+                           be loaded. If unset, uses marc4j.jar bundled with traject.
-* solrj.jar_dir: SolrJWriter needs to load Java .jar files with SolrJ. It will load from a packaged SolrJ, but you can load your own SolrJ (different version etc) by specifying a directory. All *.jar in directory will be loaded.
+* `marc4j_reader.permissive`: Used by Marc4JReader only when marc.source_type is 'binary', boolean, argument to the underlying MarcPermissiveStreamReader. Default true.
-* solr.version: Set to eg "1.4.0", "4.3.0"; currently un-used, but in the future will control
+* `marc4j_reader.source_encoding`: Used by Marc4JReader only when marc.source_type is 'binary', encoding strings accepted
+  by marc4j MarcPermissiveStreamReader. Default "BESTGUESS", also "UTF-8", "MARC"
+* `processing_thread_pool` Default 3. Main thread pool used for processing records with input rules. Choose a
+   pool size based on size of your machine, and complexity of your indexing rules.
+   Probably no reason for it ever to be more than number of cores on indexing machine.
+   But this is the first thread_pool to try increasing for better performance on a multi-core machine.
+   A pool here can sometimes result in multi-threaded commiting to Solr too with the
+   SolrJWriter, as processing worker threads will do their own commits to solr if the
+   solrj_writer.thread_pool is full. Having a multi-threaded pool here can help even out throughput
+   through Solr's pauses for committing too.
+* `reader_class_name`: a Traject Reader class, used by the indexer as a source of records. Default Traject::Marc4jReader. If you don't need to read marc binary with Marc8 encoding, the pure ruby MarcReader may give you better performance.  Command-line shortcut `-r`
+* `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`.
+* `solrj.jar_dir`: SolrJWriter needs to load Java .jar files with SolrJ. It will load from a packaged SolrJ, but you can load your own SolrJ (different version etc) by specifying a directory. All *.jar in directory will be loaded.
+* `solr.version`: Set to eg "1.4.0", "4.3.0"; currently un-used, but in the future will control
   change some default settings, and/or sanity check and warn you if you're doing something
-  that might not work with that version of solr. Set now for help in the future.
+  that might not work with that version of solr. Set now for help in the future.
+* `solrj_writer.batch_size`: size of batches that SolrJWriter will send docs to Solr in. Default 200. Set to nil,
+  0, or 1, and SolrJWriter will do one http transaction per document, no batching.
+* `solrj_writer.commit_on_close`: default false, set to true to have SolrJWriter send an explicit commit message to Solr after indexing.
-* solrj_writer.commit_on_close: default false, set to true to have SolrJWriter send an explicit commit message to Solr after indexing.
+* `solrj_writer.parser_class_name`: Set to "XMLResponseParser" or "BinaryResponseParser". Will be instantiated and passed to the solrj.SolrServer with setResponseParser. Default nil, use SolrServer default. To talk to a solr 1.x, you will want to set to "XMLResponseParser"
-* solrj_writer.parser_class_name: Set to "XMLResponseParser" or "BinaryResponseParser". Will be instantiated and passed to the solrj.SolrServer with setResponseParser. Default nil, use SolrServer default. To talk to a solr 1.x, you will want to set to "XMLResponseParser"
+* `solrj_writer.server_class_name`: String name of a solrj.SolrServer subclass to be used by SolrJWriter. Default "HttpSolrServer"
-* solrj_writer.server_class_name: String name of a solrj.SolrServer subclass to be used by SolrJWriter. Default "HttpSolrServer"
+* `solrj_writer.thread_pool`:       Defaults to 1 (single bg thread). A thread pool is used for submitting docs
+                                    to solr. Set to 0 or nil to disable threading. Set to 1,
+                                    there will still be a single bg thread doing the adds.
+                                    May make sense to set higher than number of cores on your
+                                    indexing machine, as these threads will mostly be waiting
+                                    on Solr. Speed/capacity of your solr might be more relevant.
+                                    Note that processing_thread_pool threads can end up submitting
+                                    to solr too, if solrj_writer.thread_pool is full.
-* writer_class_name: a Traject Writer class, used by indexer to send processed dictionaries off. Default Traject::SolrJWriter, also available Traject::JsonWriter. See Traject::Indexer for more info. Command line shortcut `-w`
+* `writer_class_name`: a Traject Writer class, used by indexer to send processed dictionaries off. Default Traject::SolrJWriter, also available Traject::JsonWriter. See Traject::Indexer for more info. Command line shortcut `-w`

data/lib/tasks/load_maps.rake ADDED Viewed

@@ -0,0 +1,48 @@
+require 'net/http'
+require 'open-uri'
+namespace :load_maps do
+  desc "Load MARC geo codes by screen-scraping LC"
+  task :marc_geographic do
+    begin
+      require 'nokogiri'
+    rescue LoadError => e
+      $stderr.puts "\n  load_maps:marc_geographic task requires nokogiri"
+      $stderr.puts "  Try `gem install nokogiri` and try again. Exiting...\n\n"
+      exit 1
+    end
+    source_url = "http://www.loc.gov/marc/geoareas/gacs_code.html"
+    filename = ENV["OUTPUT_TO"] || File.expand_path("../../translation_maps/marc_geographic.yaml", __FILE__)
+    file = File.open( filename, "w:utf-8" )
+    $stderr.puts "Writing to `#{filename}` ..."
+    html = Nokogiri::HTML(open(source_url).read)
+    file.puts "# Translation map for marc geographic codes constructed by `rake load_maps:marc_geographic` task"
+    file.puts "# Scraped from #{source_url} at #{Time.now}"
+    file.puts "# Intentionally includes discontinued codes."
+    file.puts "\n"
+    html.css("tr").each do |line|
+      code = line.css("td.code").inner_text.strip
+      unless code.nil? || code.empty?
+        code.gsub!(/^\-/, '') # treat discontinued code like any other
+        label = line.css("td[2]").inner_text.strip
+        label.gsub!(/\n */, ' ') # get rid of newlines that file now sometimes contains, bah.
+        label.gsub!("'", "''") # yaml escapes single-quotes by doubling them, weird but true.
+        file.puts "'#{code}': '#{label}'"
+      end
+    end
+    $stderr.puts "Done."
+  end
+end

data/lib/traject/indexer/settings.rb ADDED Viewed

@@ -0,0 +1,75 @@
+require 'hashie'
+# A Hash of settings for a Traject::Indexer, which also ends up passed along
+# to other objects Traject::Indexer interacts with.
+#
+# Enhanced with a few features from Hashie, to make it for
+# instance string/symbol indifferent
+#
+# #provide(key, value) is added, to do like settings[key] ||= value,
+# set only if not already set (but unlike ||=, nil or false can count as already set)
+#
+# Also has an interesting 'defaults' system, meant to play along
+# with configuration file 'provide' statements. There is a built-in hash of
+# defaults, which will be lazily filled in if accessed and not yet
+# set. (nil can count as set, though!).  If they haven't been lazily
+# set yet, then #provide will still fill them in. But you can also call
+# fill_in_defaults! to fill all defaults in, if you know configuration
+# files have all been loaded, and want to fill them in for inspection.
+class Traject::Indexer
+  class Settings < Hash
+    include Hashie::Extensions::MergeInitializer # can init with hash
+    include Hashie::Extensions::IndifferentAccess
+    # Hashie bug Issue #100 https://github.com/intridea/hashie/pull/100
+    alias_method :store, :indifferent_writer
+    def initialize(*args)
+      super
+      self.default_proc = lambda do |hash, key|
+        if self.class.defaults.has_key?(key)
+          return hash[key] = self.class.defaults[key]
+        else
+          return nil
+        end
+      end
+    end
+    # a cautious store, which only saves key=value if
+    # there was not already a value for #key. Can be used
+    # to set settings that can be overridden on command line,
+    # or general first-set-wins settings.
+    def provide(key, value)
+      unless has_key? key
+        store(key, value)
+      end
+    end
+    # reverse_merge copied from ActiveSupport, pretty straightforward,
+    # modified to make sure we return a Settings
+    def reverse_merge(other_hash)
+      self.class.new(other_hash).merge(self)
+    end
+    def reverse_merge!(other_hash)
+      replace(reverse_merge(other_hash))
+    end
+    def fill_in_defaults!
+      self.reverse_merge!(self.class.defaults)
+    end
+    def self.defaults
+      @@defaults ||= {
+      "reader_class_name"         => "Traject::Marc4JReader",
+      "writer_class_name"         => "Traject::SolrJWriter",
+      "marc_source.type"          => "binary",
+      "marc4j_reader.permissive"  => true,
+      "marc4j_reader.source_encoding" => "BESTGUESS",
+      "solrj_writer.batch_size"   => 200,
+      "solrj_writer.thread_pool"  => 1,
+      "processing_thread_pool"    => 3
+      }
+    end
+  end
+end