traject 3.1.0.rc1 → 3.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 06c28d37f9aafafe709a146c7612e5b5d8a5c58a61fd1502823a38dc52b9d05b
4
- data.tar.gz: 2e38b2b8c4030456f3757ae6062231268110d68ef07e10cab722b4074ccd570c
3
+ metadata.gz: 01ca968682bb3fc2a8313131ef6344bfc9e5418b767b2900c3d799caa356d016
4
+ data.tar.gz: a3fd6c9a3bec88c6ba592500ea170357b059533f28c5ba3fb2fe72de39702a2a
5
5
  SHA512:
6
- metadata.gz: 04561a77a3e6f2073198983b5bf7d4e35cc9f52bccc1211487cc4c850b0f0b0fc9395a7c87e6ed90061f4a15af57516434d260c649fbc43ea65a0c6435194818
7
- data.tar.gz: c7312156c3be556218e319e35ae76aa97fbae5fad6720dbce2e4a046ec90603f5de34fe2cb055425fb3da499922fba50c7d4a6445858793bb0a4fb26cf8f7b29
6
+ metadata.gz: 93547927e90b7947588c1983bd37de4651722bebaff7aaad2d3965ec46eed8c647971f1ce093beb86a5c47c2efa999516d00fb6956682c98913cd54ea5a1a2b8
7
+ data.tar.gz: 128ea6e2517711f2324541215f814430f963b0042ae321ebc3f29646398990dc09d2a307acc32fb89a70142d74763fdfee6a61d220043622b5b8b11fbcd645d8
@@ -0,0 +1,26 @@
1
+ name: CI
2
+
3
+ on:
4
+ push:
5
+ branches: [ master ]
6
+ pull_request:
7
+ branches: ['**']
8
+
9
+ jobs:
10
+ tests:
11
+ runs-on: ubuntu-latest
12
+ strategy:
13
+ fail-fast: false
14
+ matrix:
15
+ ruby: [ '2.4', '2.5', '2.6', '2.7', 'jruby-9.1', 'jruby-9.2' ]
16
+ name: Ruby ${{ matrix.ruby }}
17
+ steps:
18
+ - uses: actions/checkout@v2
19
+ - name: Set up Ruby
20
+ uses: ruby/setup-ruby@v1
21
+ with:
22
+ ruby-version: ${{ matrix.ruby }}
23
+ - name: Install dependencies
24
+ run: bundle install --jobs 4 --retry 3
25
+ - name: Run tests
26
+ run: bundle exec rake
data/CHANGES.md CHANGED
@@ -1,5 +1,47 @@
1
1
  # Changes
2
2
 
3
+ ## Next
4
+
5
+ *
6
+
7
+ *
8
+
9
+ *
10
+
11
+ ## 3.5.0
12
+
13
+ * `traject -v` and `traject -h` correctly return 0 exit code indicating success.
14
+
15
+ * upgrade to slop gem 4.x, which carries with it a slightly different format of human-readable command-line arg errors, should be otherwise invisible.
16
+
17
+ * the SolrJsonWriter now supports HTTP basic auth credentials embedded in `solr.url` or `solr.update_url`, eg `http://user:pass@example.org/solr` https://github.com/traject/traject/pull/262
18
+
19
+
20
+ ## 3.4.0
21
+
22
+ * XML-mode `extract_xpath` now supports extracting attribute values with xpath @attr syntax.
23
+
24
+ ## 3.3.0
25
+
26
+ * `Traject::Macros::Marc21Semantics.publication_date` now gets date from 264 before 260. https://github.com/traject/traject/pull/233
27
+
28
+ * Allow hashie 4.x in gemspec https://github.com/traject/traject/pull/234
29
+
30
+ * Allow `http` gem 4.x versions. https://github.com/traject/traject/pull/236
31
+
32
+ * Can now call class-level Indexer.configure multiple times https://github.com/sciencehistory/scihist_digicoll/pull/525
33
+
34
+ ## 3.2.0
35
+
36
+ * NokogiriReader has a "nokogiri.strict_mode" setting. Set to true or string 'true' to ask Nokogori to parse in strict mode, so it will immediately raise on ill-formed XML, instead of nokogiri's default to do what it can with it. https://github.com/traject/traject/pull/226
37
+
38
+ * SolrJsonWriter
39
+
40
+ * Utility method `delete_all!` sends a delete all query to the Solr URL endpoint. https://github.com/traject/traject/pull/227
41
+
42
+ * Allow basic auth configuration of the default http client via `solr_writer.basic_auth_user` and `solr_writer.basic_auth_password`. https://github.com/traject/traject/pull/231
43
+
44
+
3
45
  ## 3.1.0
4
46
 
5
47
  ### Added
@@ -24,6 +66,10 @@
24
66
 
25
67
  * SolrJsonWriter now respects a `solr_writer.http_timeout` setting, in seconds, to be passed to HTTPClient instance. https://github.com/traject/traject/pull/219
26
68
 
69
+ * Only runs thread pool shutdown code (and logging) if there is a `solr_writer.batch_size` greater than 0. Keep it out of the logs if it was a no-op anyway.
70
+
71
+ * Logs at DEBUG level every time it sends an update request to solr
72
+
27
73
  * Nokogiri dependency for the NokogiriReader increased to `~> 1.9`. When using Jruby `each_record_xpath`, resulting yielded documents may have xmlns declarations on different nodes than in MRI (and previous versions of nokogiri), but we could find now way around this with nokogiri >= 1.9.0. The documents should still be semantically equivalent for namespace use. This was necessary to keep JRuby Nokogiri XML working with recent Nokogiri releases. https://github.com/traject/traject/pull/209
28
74
 
29
75
  * LineWriter guesses better about when to auto-close, and provides an optional explicit setting in case it guesses wrong. (thanks @justinlittman) https://github.com/traject/traject/pull/211
data/README.md CHANGED
@@ -9,7 +9,7 @@ Traject can also be generalized to a set of tools for getting structured data fr
9
9
  **Traject is stable, mature software, that is already being used in production by its authors and several other institutions.**
10
10
 
11
11
  [![Gem Version](https://badge.fury.io/rb/traject.png)](http://badge.fury.io/rb/traject)
12
- [![Build Status](https://travis-ci.org/traject/traject.png)](https://travis-ci.org/traject/traject)
12
+ [![CI Status](https://github.com/traject/traject/workflows/CI/badge.svg?branch=master)](https://github.com/traject/traject/actions?query=workflow%3ACI+branch%3Amaster)
13
13
 
14
14
 
15
15
  ## Background/Goals
@@ -175,6 +175,8 @@ TranslationMap use above is just one example of a transformation macro, that tra
175
175
  * `append("--after each value")`
176
176
  * `gsub(/regex/, "replacement")`
177
177
  * `split(" ")`: take values and split them, possibly result in multiple values.
178
+ * `transform(proc)`: transform each existing macro using a proc, kind of like `map`.
179
+ eg `to_field "something", extract_xml("//author"), transform( ->(author) { "#{author.last}, #{author.first}" })
178
180
 
179
181
  You can add on as many transformation macros as you want, they will be applied to output in order.
180
182
 
@@ -83,7 +83,8 @@ settings are applied first of all. It's recommended you use `provide`.
83
83
  ### Writing to solr
84
84
 
85
85
  * `json_writer.pretty_print`: used by the JsonWriter, if set to true, will output pretty printed json (with added whitespace) for easier human readability. Default false.
86
- * `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`.
86
+
87
+ * `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`. (Can include embedded HTTP basic auth as eg `http://user:pass@example.org/solr`)
87
88
 
88
89
  * `solr.version`: Set to eg "1.4.0", "4.3.0"; currently un-used, but in the future will control some default settings, and/or sanity check and warn you if you're doing something that might not work with that version of solr. Set now for help in the future.
89
90
 
@@ -93,6 +94,9 @@ settings are applied first of all. It's recommended you use `provide`.
93
94
 
94
95
  * `solr_writer.thread_pool`: defaults to 1 (single bg thread). A thread pool is used for submitting docs to solr. Set to 0 or nil to disable threading. Set to 1, there will still be a single bg thread doing the adds. May make sense to set higher than number of cores on your indexing machine, as these threads will mostly be waiting on Solr. Speed/capacity of your solr might be more relevant. Note that processing_thread_pool threads can end up submitting to solr too, if solr_json_writer.thread_pool is full.
95
96
 
97
+ * `solr_writer.basic_auth_user`, `solr_writer.basic_auth_password`: Not set by default but when both are set the default writer is configured with basic auth. You can also just embed basic
98
+ auth credentials in `solr.url` using standard URI syntax.
99
+
96
100
 
97
101
  ### Dealing with MARC data
98
102
 
data/doc/xml.md CHANGED
@@ -72,6 +72,16 @@ You can use all the standard transforation macros in Traject::Macros::Transforma
72
72
  to_field "something", extract_xpath("//value"), first_only, translation_map("some_map"), default("no value")
73
73
  ```
74
74
 
75
+ ### selecting attribute values
76
+
77
+ Just works, using xpath syntax for selecting an attribute:
78
+
79
+
80
+ ```ruby
81
+ # gets status value in: <oai:header status="something">
82
+ to_field "status", extract_xpath("//oai:record/oai:header/@status")
83
+ ```
84
+
75
85
 
76
86
  ### selecting non-text nodes
77
87
 
@@ -29,10 +29,10 @@ module Traject
29
29
  self.console = $stderr
30
30
 
31
31
  self.orig_argv = argv.dup
32
- self.remaining_argv = argv
33
32
 
34
- self.slop = create_slop!
35
- self.options = parse_options(self.remaining_argv)
33
+ self.slop = create_slop!(argv)
34
+ self.options = self.slop
35
+ self.remaining_argv = self.slop.arguments
36
36
  end
37
37
 
38
38
  # Returns true on success or false on failure; may also raise exceptions;
@@ -40,11 +40,11 @@ module Traject
40
40
  def execute
41
41
  if options[:version]
42
42
  self.console.puts "traject version #{Traject::VERSION}"
43
- return
43
+ return true
44
44
  end
45
45
  if options[:help]
46
- self.console.puts slop.help
47
- return
46
+ self.console.puts slop.to_s
47
+ return true
48
48
  end
49
49
 
50
50
 
@@ -179,11 +179,11 @@ module Traject
179
179
  end
180
180
 
181
181
  def arg_check!
182
- if options[:command] == "process" && (options[:conf].nil? || options[:conf].length == 0)
182
+ if options[:command] == "process" && (!options[:conf] || options[:conf].length == 0)
183
183
  self.console.puts "Error: Missing required configuration file"
184
184
  self.console.puts "Exiting..."
185
185
  self.console.puts
186
- self.console.puts self.slop.help
186
+ self.console.puts self.slop.to_s
187
187
  exit 2
188
188
  end
189
189
  end
@@ -234,28 +234,36 @@ module Traject
234
234
  end
235
235
 
236
236
 
237
- def create_slop!
238
- return Slop.new(:strict => true) do
239
- banner "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
237
+ def create_slop!(argv)
238
+ options = Slop::Options.new do |o|
239
+ o.banner = "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
240
240
 
241
- on 'v', 'version', "print version information to stderr"
242
- on 'd', 'debug', "Include debug log, -s log.level=debug"
243
- on 'h', 'help', "print usage information to stderr"
244
- on 'c', 'conf', 'configuration file path (repeatable)', :argument => true, :as => Array
245
- on :i, 'indexer', "Traject indexer class name or shortcut", :argument => true, default: "marc"
246
- on :s, :setting, "settings: `-s key=value` (repeatable)", :argument => true, :as => Array
247
- on :r, :reader, "Set reader class, shortcut for -s reader_class_name=", :argument => true
248
- on :o, "output_file", "output file for Writer classes that write to files", :argument => true
249
- on :w, :writer, "Set writer class, shortcut for -s writer_class_name=", :argument => true
250
- on :u, :solr, "Set solr url, shortcut for -s solr.url=", :argument => true
251
- on :t, :marc_type, "xml, json or binary. shortcut for -s marc_source.type=", :argument => true
252
- on :I, "load_path", "append paths to ruby $LOAD_PATH", :argument => true, :as => Array, :delimiter => ":"
241
+ o.on '-v', '--version', "print version information to stderr"
242
+ o.on '-d', '--debug', "Include debug log, -s log.level=debug"
243
+ o.on '-h', '--help', "print usage information to stderr"
244
+ o.array '-c', '--conf', 'configuration file path (repeatable)', :delimiter => nil
245
+ o.string "-i", '--indexer', "Traject indexer class name or shortcut", :default => "marc"
246
+ o.array "-s", "--setting", "settings: `-s key=value` (repeatable)", :delimiter => nil
247
+ o.string "-r", "--reader", "Set reader class, shortcut for -s reader_class_name="
248
+ o.string "-o", "--output_file", "output file for Writer classes that write to files"
249
+ o.string "-w", "--writer", "Set writer class, shortcut for -s writer_class_name="
250
+ o.string "-u", "--solr", "Set solr url, shortcut for -s solr.url="
251
+ o.string "-t", "--marc_type", "xml, json or binary. shortcut for -s marc_source.type="
252
+ o.array "-I", "--load_path", "append paths to ruby $LOAD_PATH", :delimiter => ":"
253
253
 
254
- on :x, "command", "alternate traject command: process (default); marcout; commit", :argument => true, :default => "process"
254
+ o.string "-x", "--command", "alternate traject command: process (default); marcout; commit", :default => "process"
255
255
 
256
- on "stdin", "read input from stdin"
257
- on "debug-mode", "debug logging, single threaded, output human readable hashes"
256
+ o.on "--stdin", "read input from stdin"
257
+ o.on "--debug-mode", "debug logging, single threaded, output human readable hashes"
258
258
  end
259
+
260
+ options.parse(argv)
261
+ rescue Slop::Error => e
262
+ self.console.puts "Error: #{e.message}"
263
+ self.console.puts "Exiting..."
264
+ self.console.puts
265
+ self.console.puts options.to_s
266
+ exit 1
259
267
  end
260
268
 
261
269
  def initialize_indexer!
@@ -267,22 +275,5 @@ module Traject
267
275
 
268
276
  return indexer
269
277
  end
270
-
271
- def parse_options(argv)
272
-
273
- begin
274
- self.slop.parse!(argv)
275
- rescue Slop::Error => e
276
- self.console.puts "Error: #{e.message}"
277
- self.console.puts "Exiting..."
278
- self.console.puts
279
- self.console.puts slop.help
280
- exit 1
281
- end
282
-
283
- return self.slop.to_hash
284
- end
285
-
286
-
287
278
  end
288
279
  end
@@ -190,7 +190,7 @@ class Traject::Indexer
190
190
  instance_eval(&block)
191
191
  end
192
192
 
193
- ## Class level configure block accepted too, and applied at instantiation
193
+ ## Class level configure block(s) accepted too, and applied at instantiation
194
194
  # before instance-level configuration.
195
195
  #
196
196
  # EXPERIMENTAL, implementation may change in ways that effect some uses.
@@ -199,8 +199,14 @@ class Traject::Indexer
199
199
  # Note that settings set by 'provide' in subclass can not really be overridden
200
200
  # by 'provide' in a next level subclass. Use self.default_settings instead, with
201
201
  # call to super.
202
+ #
203
+ # You can call this .configure multiple times, blocks are added to a list, and
204
+ # will be used to initialize an instance in order.
205
+ #
206
+ # The main downside of this workaround implementation is performance, even though
207
+ # defined at load-time on class level, blocks are all executed on every instantiation.
202
208
  def self.configure(&block)
203
- @class_configure_block = block
209
+ (@class_configure_blocks ||= []) << block
204
210
  end
205
211
 
206
212
  def self.apply_class_configure_block(instance)
@@ -208,8 +214,10 @@ class Traject::Indexer
208
214
  if self.superclass.respond_to?(:apply_class_configure_block)
209
215
  self.superclass.apply_class_configure_block(instance)
210
216
  end
211
- if @class_configure_block
212
- instance.configure(&@class_configure_block)
217
+ if @class_configure_blocks && !@class_configure_blocks.empty?
218
+ @class_configure_blocks.each do |block|
219
+ instance.configure(&block)
220
+ end
213
221
  end
214
222
  end
215
223
 
@@ -15,7 +15,7 @@ module Traject::Macros
15
15
  # field/substring specification.
16
16
  #
17
17
  # First argument is a string spec suitable for the MarcExtractor, see
18
- # MarcExtractor::parse_string_spec.
18
+ # Traject::MarcExtractor::Spec.
19
19
  #
20
20
  # Second arg is optional options, including options valid on MarcExtractor.new,
21
21
  # and others. By default, will de-duplicate results, but see :allow_duplicates
@@ -42,11 +42,11 @@ module Traject::Macros
42
42
  #
43
43
  # * :translation_map => String: translate with named translation map looked up in load
44
44
  # path, uses Tranject::TranslationMap.new(translation_map_arg).
45
- # **Instead**, use `extract_marc(whatever), translation_map(translation_map_arg)
45
+ # **Instead**, use `extract_marc(whatever), translation_map(translation_map_arg)`
46
46
  #
47
47
  # * :trim_punctuation => true; trims leading/trailing punctuation using standard algorithms that
48
48
  # have shown themselves useful with Marc, using Marc21.trim_punctuation. **Instead**, use
49
- # `extract_marc(whatever), trim_punctuation
49
+ # `extract_marc(whatever), trim_punctuation`
50
50
  #
51
51
  # * :default => String: if otherwise empty, add default value. **Instead**, use `extract_marc(whatever), default("default value")`
52
52
  #
@@ -26,19 +26,19 @@ module Traject::Macros
26
26
  accumulator.concat list.uniq if list
27
27
  end
28
28
  end
29
-
29
+
30
30
  # If a num begins with a known OCLC prefix, return it without the prefix.
31
31
  # otherwise nil.
32
32
  #
33
- # Allow (OCoLC) and/or ocn/ocm/on
34
-
33
+ # Allow (OCoLC) and/or ocn/ocm/on
34
+
35
35
  OCLCPAT = /
36
36
  \A\s*
37
37
  (?:(?:\(OCoLC\)) |
38
38
  (?:\(OCoLC\))?(?:(?:ocm)|(?:ocn)|(?:on))
39
39
  )(\d+)
40
40
  /x
41
-
41
+
42
42
  def self.oclcnum_extract(num)
43
43
  if m = OCLCPAT.match(num)
44
44
  return m[1]
@@ -364,13 +364,16 @@ module Traject::Macros
364
364
  end
365
365
  end
366
366
  end
367
- # Okay, nothing from 008, try 260
367
+ # Okay, nothing from 008, first try 264, then try 260
368
368
  if found_date.nil?
369
+ v264c = MarcExtractor.cached("264c", :separator => nil).extract(record).first
369
370
  v260c = MarcExtractor.cached("260c", :separator => nil).extract(record).first
370
371
  # just try to take the first four digits out of there, we're not going to try
371
372
  # anything crazy.
372
- if m = /(\d{4})/.match(v260c)
373
+ if m = /(\d{4})/.match(v264c)
373
374
  found_date = m[1].to_i
375
+ elsif m = /(\d{4})/.match(v260c)
376
+ found_date = m[1].to_i
374
377
  end
375
378
  end
376
379
 
@@ -519,11 +522,11 @@ module Traject::Macros
519
522
 
520
523
  # Extracts LCSH-carrying fields, and formatting them
521
524
  # as a pre-coordinated LCSH string, for instance suitable for including
522
- # in a facet.
525
+ # in a facet.
523
526
  #
524
527
  # You can supply your own list of fields as a spec, but for significant
525
528
  # customization you probably just want to write your own method in
526
- # terms of the Marc21Semantics.assemble_lcsh method.
529
+ # terms of the Marc21Semantics.assemble_lcsh method.
527
530
  def marc_lcsh_formatted(options = {})
528
531
  spec = options[:spec] || "600:610:611:630:648:650:651:654:662"
529
532
  subd_separator = options[:subdivison_separator] || " — "
@@ -540,17 +543,17 @@ module Traject::Macros
540
543
  end
541
544
 
542
545
  # Takes a MARC::Field and formats it into a pre-coordinated LCSH string
543
- # with subdivision seperators in the right place.
546
+ # with subdivision seperators in the right place.
544
547
  #
545
548
  # For 600 fields especially, need to not just join with subdivision seperator
546
549
  # to take acount of $a$d$t -- for other fields, might be able to just
547
- # join subfields, not sure.
550
+ # join subfields, not sure.
548
551
  #
549
552
  # WILL strip trailing period from generated string, contrary to some LCSH practice.
550
553
  # Our data is inconsistent on whether it has period or not, this was
551
- # the easiest way to standardize.
554
+ # the easiest way to standardize.
552
555
  #
553
- # Default subdivision seperator is em-dash with spaces, set to '--' if you want.
556
+ # Default subdivision seperator is em-dash with spaces, set to '--' if you want.
554
557
  #
555
558
  # Cite: "Dash (-) that precedes a subdivision in an extended 600 subject heading
556
559
  # is not carried in the MARC record. It may be system generated as a display constant
@@ -26,9 +26,15 @@ module Traject
26
26
  # Make sure to avoid text content that was all blank, which is "between the children"
27
27
  # whitespace.
28
28
  result = result.collect do |n|
29
- n.xpath('.//text()').collect(&:text).tap do |arr|
30
- arr.reject! { |s| s =~ (/\A\s+\z/) }
31
- end.join(" ")
29
+ if n.kind_of?(Nokogiri::XML::Attr)
30
+ # attribute value
31
+ n.value
32
+ else
33
+ # text from node
34
+ n.xpath('.//text()').collect(&:text).tap do |arr|
35
+ arr.reject! { |s| s =~ (/\A\s+\z/) }
36
+ end.join(" ")
37
+ end
32
38
  end
33
39
  else
34
40
  # just put all matches in accumulator as Nokogiri::XML::Node's
@@ -2,9 +2,9 @@ require 'traject/marc_extractor_spec'
2
2
 
3
3
  module Traject
4
4
  # MarcExtractor is a class for extracting lists of strings from a MARC::Record,
5
- # according to specifications. See #parse_string_spec for description of string
6
- # string arguments used to specify extraction. See #initialize for options
7
- # that can be set controlling extraction.
5
+ # according to specifications. See Traject::MarcExtractor::Spec for description
6
+ # of string string arguments used to specify extraction. See #initialize for
7
+ # options that can be set controlling extraction.
8
8
  #
9
9
  # Examples:
10
10
  #
@@ -21,6 +21,9 @@ module Traject
21
21
  # If you need to use namespaces here, you need to have them registered with
22
22
  # `nokogiri.default_namespaces`. If your source docs use namespaces, you DO need
23
23
  # to use them in your each_record_xpath.
24
+ # * nokogiri.strict_mode: if set to `true` or `"true"`, ask Nokogiri to parse in 'strict'
25
+ # mode, it will raise a `Nokogiri::XML::SyntaxError` if the XML is not well-formed, instead
26
+ # of trying to take it's best-guess correction. https://nokogiri.org/tutorials/ensuring_well_formed_markup.html
24
27
  # * nokogiri_reader.extra_xpath_hooks: Experimental in progress, see below.
25
28
  #
26
29
  # ## nokogiri_reader.extra_xpath_hooks: For handling nodes outside of your each_record_xpath
@@ -87,7 +90,11 @@ module Traject
87
90
  end
88
91
 
89
92
  def each
90
- whole_input_doc = Nokogiri::XML.parse(input_stream)
93
+ config_proc = if settings["nokogiri.strict_mode"]
94
+ proc { |config| config.strict }
95
+ end
96
+
97
+ whole_input_doc = Nokogiri::XML.parse(input_stream, &config_proc)
91
98
 
92
99
  if each_record_xpath
93
100
  whole_input_doc.xpath(each_record_xpath, default_namespaces).each do |matching_node|