traject 3.1.0.rc1 → 3.5.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 06c28d37f9aafafe709a146c7612e5b5d8a5c58a61fd1502823a38dc52b9d05b
4
- data.tar.gz: 2e38b2b8c4030456f3757ae6062231268110d68ef07e10cab722b4074ccd570c
3
+ metadata.gz: 01ca968682bb3fc2a8313131ef6344bfc9e5418b767b2900c3d799caa356d016
4
+ data.tar.gz: a3fd6c9a3bec88c6ba592500ea170357b059533f28c5ba3fb2fe72de39702a2a
5
5
  SHA512:
6
- metadata.gz: 04561a77a3e6f2073198983b5bf7d4e35cc9f52bccc1211487cc4c850b0f0b0fc9395a7c87e6ed90061f4a15af57516434d260c649fbc43ea65a0c6435194818
7
- data.tar.gz: c7312156c3be556218e319e35ae76aa97fbae5fad6720dbce2e4a046ec90603f5de34fe2cb055425fb3da499922fba50c7d4a6445858793bb0a4fb26cf8f7b29
6
+ metadata.gz: 93547927e90b7947588c1983bd37de4651722bebaff7aaad2d3965ec46eed8c647971f1ce093beb86a5c47c2efa999516d00fb6956682c98913cd54ea5a1a2b8
7
+ data.tar.gz: 128ea6e2517711f2324541215f814430f963b0042ae321ebc3f29646398990dc09d2a307acc32fb89a70142d74763fdfee6a61d220043622b5b8b11fbcd645d8
@@ -0,0 +1,26 @@
1
+ name: CI
2
+
3
+ on:
4
+ push:
5
+ branches: [ master ]
6
+ pull_request:
7
+ branches: ['**']
8
+
9
+ jobs:
10
+ tests:
11
+ runs-on: ubuntu-latest
12
+ strategy:
13
+ fail-fast: false
14
+ matrix:
15
+ ruby: [ '2.4', '2.5', '2.6', '2.7', 'jruby-9.1', 'jruby-9.2' ]
16
+ name: Ruby ${{ matrix.ruby }}
17
+ steps:
18
+ - uses: actions/checkout@v2
19
+ - name: Set up Ruby
20
+ uses: ruby/setup-ruby@v1
21
+ with:
22
+ ruby-version: ${{ matrix.ruby }}
23
+ - name: Install dependencies
24
+ run: bundle install --jobs 4 --retry 3
25
+ - name: Run tests
26
+ run: bundle exec rake
data/CHANGES.md CHANGED
@@ -1,5 +1,47 @@
1
1
  # Changes
2
2
 
3
+ ## Next
4
+
5
+ *
6
+
7
+ *
8
+
9
+ *
10
+
11
+ ## 3.5.0
12
+
13
+ * `traject -v` and `traject -h` correctly return 0 exit code indicating success.
14
+
15
+ * upgrade to slop gem 4.x, which carries with it a slightly different format of human-readable command-line arg errors, should be otherwise invisible.
16
+
17
+ * the SolrJsonWriter now supports HTTP basic auth credentials embedded in `solr.url` or `solr.update_url`, eg `http://user:pass@example.org/solr` https://github.com/traject/traject/pull/262
18
+
19
+
20
+ ## 3.4.0
21
+
22
+ * XML-mode `extract_xpath` now supports extracting attribute values with xpath @attr syntax.
23
+
24
+ ## 3.3.0
25
+
26
+ * `Traject::Macros::Marc21Semantics.publication_date` now gets date from 264 before 260. https://github.com/traject/traject/pull/233
27
+
28
+ * Allow hashie 4.x in gemspec https://github.com/traject/traject/pull/234
29
+
30
+ * Allow `http` gem 4.x versions. https://github.com/traject/traject/pull/236
31
+
32
+ * Can now call class-level Indexer.configure multiple times https://github.com/sciencehistory/scihist_digicoll/pull/525
33
+
34
+ ## 3.2.0
35
+
36
+ * NokogiriReader has a "nokogiri.strict_mode" setting. Set to true or string 'true' to ask Nokogori to parse in strict mode, so it will immediately raise on ill-formed XML, instead of nokogiri's default to do what it can with it. https://github.com/traject/traject/pull/226
37
+
38
+ * SolrJsonWriter
39
+
40
+ * Utility method `delete_all!` sends a delete all query to the Solr URL endpoint. https://github.com/traject/traject/pull/227
41
+
42
+ * Allow basic auth configuration of the default http client via `solr_writer.basic_auth_user` and `solr_writer.basic_auth_password`. https://github.com/traject/traject/pull/231
43
+
44
+
3
45
  ## 3.1.0
4
46
 
5
47
  ### Added
@@ -24,6 +66,10 @@
24
66
 
25
67
  * SolrJsonWriter now respects a `solr_writer.http_timeout` setting, in seconds, to be passed to HTTPClient instance. https://github.com/traject/traject/pull/219
26
68
 
69
+ * Only runs thread pool shutdown code (and logging) if there is a `solr_writer.batch_size` greater than 0. Keep it out of the logs if it was a no-op anyway.
70
+
71
+ * Logs at DEBUG level every time it sends an update request to solr
72
+
27
73
  * Nokogiri dependency for the NokogiriReader increased to `~> 1.9`. When using Jruby `each_record_xpath`, resulting yielded documents may have xmlns declarations on different nodes than in MRI (and previous versions of nokogiri), but we could find now way around this with nokogiri >= 1.9.0. The documents should still be semantically equivalent for namespace use. This was necessary to keep JRuby Nokogiri XML working with recent Nokogiri releases. https://github.com/traject/traject/pull/209
28
74
 
29
75
  * LineWriter guesses better about when to auto-close, and provides an optional explicit setting in case it guesses wrong. (thanks @justinlittman) https://github.com/traject/traject/pull/211
data/README.md CHANGED
@@ -9,7 +9,7 @@ Traject can also be generalized to a set of tools for getting structured data fr
9
9
  **Traject is stable, mature software, that is already being used in production by its authors and several other institutions.**
10
10
 
11
11
  [![Gem Version](https://badge.fury.io/rb/traject.png)](http://badge.fury.io/rb/traject)
12
- [![Build Status](https://travis-ci.org/traject/traject.png)](https://travis-ci.org/traject/traject)
12
+ [![CI Status](https://github.com/traject/traject/workflows/CI/badge.svg?branch=master)](https://github.com/traject/traject/actions?query=workflow%3ACI+branch%3Amaster)
13
13
 
14
14
 
15
15
  ## Background/Goals
@@ -175,6 +175,8 @@ TranslationMap use above is just one example of a transformation macro, that tra
175
175
  * `append("--after each value")`
176
176
  * `gsub(/regex/, "replacement")`
177
177
  * `split(" ")`: take values and split them, possibly result in multiple values.
178
+ * `transform(proc)`: transform each existing macro using a proc, kind of like `map`.
179
+ eg `to_field "something", extract_xml("//author"), transform( ->(author) { "#{author.last}, #{author.first}" })
178
180
 
179
181
  You can add on as many transformation macros as you want, they will be applied to output in order.
180
182
 
@@ -83,7 +83,8 @@ settings are applied first of all. It's recommended you use `provide`.
83
83
  ### Writing to solr
84
84
 
85
85
  * `json_writer.pretty_print`: used by the JsonWriter, if set to true, will output pretty printed json (with added whitespace) for easier human readability. Default false.
86
- * `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`.
86
+
87
+ * `solr.url`: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut `-u`. (Can include embedded HTTP basic auth as eg `http://user:pass@example.org/solr`)
87
88
 
88
89
  * `solr.version`: Set to eg "1.4.0", "4.3.0"; currently un-used, but in the future will control some default settings, and/or sanity check and warn you if you're doing something that might not work with that version of solr. Set now for help in the future.
89
90
 
@@ -93,6 +94,9 @@ settings are applied first of all. It's recommended you use `provide`.
93
94
 
94
95
  * `solr_writer.thread_pool`: defaults to 1 (single bg thread). A thread pool is used for submitting docs to solr. Set to 0 or nil to disable threading. Set to 1, there will still be a single bg thread doing the adds. May make sense to set higher than number of cores on your indexing machine, as these threads will mostly be waiting on Solr. Speed/capacity of your solr might be more relevant. Note that processing_thread_pool threads can end up submitting to solr too, if solr_json_writer.thread_pool is full.
95
96
 
97
+ * `solr_writer.basic_auth_user`, `solr_writer.basic_auth_password`: Not set by default but when both are set the default writer is configured with basic auth. You can also just embed basic
98
+ auth credentials in `solr.url` using standard URI syntax.
99
+
96
100
 
97
101
  ### Dealing with MARC data
98
102
 
data/doc/xml.md CHANGED
@@ -72,6 +72,16 @@ You can use all the standard transforation macros in Traject::Macros::Transforma
72
72
  to_field "something", extract_xpath("//value"), first_only, translation_map("some_map"), default("no value")
73
73
  ```
74
74
 
75
+ ### selecting attribute values
76
+
77
+ Just works, using xpath syntax for selecting an attribute:
78
+
79
+
80
+ ```ruby
81
+ # gets status value in: <oai:header status="something">
82
+ to_field "status", extract_xpath("//oai:record/oai:header/@status")
83
+ ```
84
+
75
85
 
76
86
  ### selecting non-text nodes
77
87
 
@@ -29,10 +29,10 @@ module Traject
29
29
  self.console = $stderr
30
30
 
31
31
  self.orig_argv = argv.dup
32
- self.remaining_argv = argv
33
32
 
34
- self.slop = create_slop!
35
- self.options = parse_options(self.remaining_argv)
33
+ self.slop = create_slop!(argv)
34
+ self.options = self.slop
35
+ self.remaining_argv = self.slop.arguments
36
36
  end
37
37
 
38
38
  # Returns true on success or false on failure; may also raise exceptions;
@@ -40,11 +40,11 @@ module Traject
40
40
  def execute
41
41
  if options[:version]
42
42
  self.console.puts "traject version #{Traject::VERSION}"
43
- return
43
+ return true
44
44
  end
45
45
  if options[:help]
46
- self.console.puts slop.help
47
- return
46
+ self.console.puts slop.to_s
47
+ return true
48
48
  end
49
49
 
50
50
 
@@ -179,11 +179,11 @@ module Traject
179
179
  end
180
180
 
181
181
  def arg_check!
182
- if options[:command] == "process" && (options[:conf].nil? || options[:conf].length == 0)
182
+ if options[:command] == "process" && (!options[:conf] || options[:conf].length == 0)
183
183
  self.console.puts "Error: Missing required configuration file"
184
184
  self.console.puts "Exiting..."
185
185
  self.console.puts
186
- self.console.puts self.slop.help
186
+ self.console.puts self.slop.to_s
187
187
  exit 2
188
188
  end
189
189
  end
@@ -234,28 +234,36 @@ module Traject
234
234
  end
235
235
 
236
236
 
237
- def create_slop!
238
- return Slop.new(:strict => true) do
239
- banner "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
237
+ def create_slop!(argv)
238
+ options = Slop::Options.new do |o|
239
+ o.banner = "traject [options] -c configuration.rb [-c config2.rb] file.mrc"
240
240
 
241
- on 'v', 'version', "print version information to stderr"
242
- on 'd', 'debug', "Include debug log, -s log.level=debug"
243
- on 'h', 'help', "print usage information to stderr"
244
- on 'c', 'conf', 'configuration file path (repeatable)', :argument => true, :as => Array
245
- on :i, 'indexer', "Traject indexer class name or shortcut", :argument => true, default: "marc"
246
- on :s, :setting, "settings: `-s key=value` (repeatable)", :argument => true, :as => Array
247
- on :r, :reader, "Set reader class, shortcut for -s reader_class_name=", :argument => true
248
- on :o, "output_file", "output file for Writer classes that write to files", :argument => true
249
- on :w, :writer, "Set writer class, shortcut for -s writer_class_name=", :argument => true
250
- on :u, :solr, "Set solr url, shortcut for -s solr.url=", :argument => true
251
- on :t, :marc_type, "xml, json or binary. shortcut for -s marc_source.type=", :argument => true
252
- on :I, "load_path", "append paths to ruby $LOAD_PATH", :argument => true, :as => Array, :delimiter => ":"
241
+ o.on '-v', '--version', "print version information to stderr"
242
+ o.on '-d', '--debug', "Include debug log, -s log.level=debug"
243
+ o.on '-h', '--help', "print usage information to stderr"
244
+ o.array '-c', '--conf', 'configuration file path (repeatable)', :delimiter => nil
245
+ o.string "-i", '--indexer', "Traject indexer class name or shortcut", :default => "marc"
246
+ o.array "-s", "--setting", "settings: `-s key=value` (repeatable)", :delimiter => nil
247
+ o.string "-r", "--reader", "Set reader class, shortcut for -s reader_class_name="
248
+ o.string "-o", "--output_file", "output file for Writer classes that write to files"
249
+ o.string "-w", "--writer", "Set writer class, shortcut for -s writer_class_name="
250
+ o.string "-u", "--solr", "Set solr url, shortcut for -s solr.url="
251
+ o.string "-t", "--marc_type", "xml, json or binary. shortcut for -s marc_source.type="
252
+ o.array "-I", "--load_path", "append paths to ruby $LOAD_PATH", :delimiter => ":"
253
253
 
254
- on :x, "command", "alternate traject command: process (default); marcout; commit", :argument => true, :default => "process"
254
+ o.string "-x", "--command", "alternate traject command: process (default); marcout; commit", :default => "process"
255
255
 
256
- on "stdin", "read input from stdin"
257
- on "debug-mode", "debug logging, single threaded, output human readable hashes"
256
+ o.on "--stdin", "read input from stdin"
257
+ o.on "--debug-mode", "debug logging, single threaded, output human readable hashes"
258
258
  end
259
+
260
+ options.parse(argv)
261
+ rescue Slop::Error => e
262
+ self.console.puts "Error: #{e.message}"
263
+ self.console.puts "Exiting..."
264
+ self.console.puts
265
+ self.console.puts options.to_s
266
+ exit 1
259
267
  end
260
268
 
261
269
  def initialize_indexer!
@@ -267,22 +275,5 @@ module Traject
267
275
 
268
276
  return indexer
269
277
  end
270
-
271
- def parse_options(argv)
272
-
273
- begin
274
- self.slop.parse!(argv)
275
- rescue Slop::Error => e
276
- self.console.puts "Error: #{e.message}"
277
- self.console.puts "Exiting..."
278
- self.console.puts
279
- self.console.puts slop.help
280
- exit 1
281
- end
282
-
283
- return self.slop.to_hash
284
- end
285
-
286
-
287
278
  end
288
279
  end
@@ -190,7 +190,7 @@ class Traject::Indexer
190
190
  instance_eval(&block)
191
191
  end
192
192
 
193
- ## Class level configure block accepted too, and applied at instantiation
193
+ ## Class level configure block(s) accepted too, and applied at instantiation
194
194
  # before instance-level configuration.
195
195
  #
196
196
  # EXPERIMENTAL, implementation may change in ways that effect some uses.
@@ -199,8 +199,14 @@ class Traject::Indexer
199
199
  # Note that settings set by 'provide' in subclass can not really be overridden
200
200
  # by 'provide' in a next level subclass. Use self.default_settings instead, with
201
201
  # call to super.
202
+ #
203
+ # You can call this .configure multiple times, blocks are added to a list, and
204
+ # will be used to initialize an instance in order.
205
+ #
206
+ # The main downside of this workaround implementation is performance, even though
207
+ # defined at load-time on class level, blocks are all executed on every instantiation.
202
208
  def self.configure(&block)
203
- @class_configure_block = block
209
+ (@class_configure_blocks ||= []) << block
204
210
  end
205
211
 
206
212
  def self.apply_class_configure_block(instance)
@@ -208,8 +214,10 @@ class Traject::Indexer
208
214
  if self.superclass.respond_to?(:apply_class_configure_block)
209
215
  self.superclass.apply_class_configure_block(instance)
210
216
  end
211
- if @class_configure_block
212
- instance.configure(&@class_configure_block)
217
+ if @class_configure_blocks && !@class_configure_blocks.empty?
218
+ @class_configure_blocks.each do |block|
219
+ instance.configure(&block)
220
+ end
213
221
  end
214
222
  end
215
223
 
@@ -15,7 +15,7 @@ module Traject::Macros
15
15
  # field/substring specification.
16
16
  #
17
17
  # First argument is a string spec suitable for the MarcExtractor, see
18
- # MarcExtractor::parse_string_spec.
18
+ # Traject::MarcExtractor::Spec.
19
19
  #
20
20
  # Second arg is optional options, including options valid on MarcExtractor.new,
21
21
  # and others. By default, will de-duplicate results, but see :allow_duplicates
@@ -42,11 +42,11 @@ module Traject::Macros
42
42
  #
43
43
  # * :translation_map => String: translate with named translation map looked up in load
44
44
  # path, uses Tranject::TranslationMap.new(translation_map_arg).
45
- # **Instead**, use `extract_marc(whatever), translation_map(translation_map_arg)
45
+ # **Instead**, use `extract_marc(whatever), translation_map(translation_map_arg)`
46
46
  #
47
47
  # * :trim_punctuation => true; trims leading/trailing punctuation using standard algorithms that
48
48
  # have shown themselves useful with Marc, using Marc21.trim_punctuation. **Instead**, use
49
- # `extract_marc(whatever), trim_punctuation
49
+ # `extract_marc(whatever), trim_punctuation`
50
50
  #
51
51
  # * :default => String: if otherwise empty, add default value. **Instead**, use `extract_marc(whatever), default("default value")`
52
52
  #
@@ -26,19 +26,19 @@ module Traject::Macros
26
26
  accumulator.concat list.uniq if list
27
27
  end
28
28
  end
29
-
29
+
30
30
  # If a num begins with a known OCLC prefix, return it without the prefix.
31
31
  # otherwise nil.
32
32
  #
33
- # Allow (OCoLC) and/or ocn/ocm/on
34
-
33
+ # Allow (OCoLC) and/or ocn/ocm/on
34
+
35
35
  OCLCPAT = /
36
36
  \A\s*
37
37
  (?:(?:\(OCoLC\)) |
38
38
  (?:\(OCoLC\))?(?:(?:ocm)|(?:ocn)|(?:on))
39
39
  )(\d+)
40
40
  /x
41
-
41
+
42
42
  def self.oclcnum_extract(num)
43
43
  if m = OCLCPAT.match(num)
44
44
  return m[1]
@@ -364,13 +364,16 @@ module Traject::Macros
364
364
  end
365
365
  end
366
366
  end
367
- # Okay, nothing from 008, try 260
367
+ # Okay, nothing from 008, first try 264, then try 260
368
368
  if found_date.nil?
369
+ v264c = MarcExtractor.cached("264c", :separator => nil).extract(record).first
369
370
  v260c = MarcExtractor.cached("260c", :separator => nil).extract(record).first
370
371
  # just try to take the first four digits out of there, we're not going to try
371
372
  # anything crazy.
372
- if m = /(\d{4})/.match(v260c)
373
+ if m = /(\d{4})/.match(v264c)
373
374
  found_date = m[1].to_i
375
+ elsif m = /(\d{4})/.match(v260c)
376
+ found_date = m[1].to_i
374
377
  end
375
378
  end
376
379
 
@@ -519,11 +522,11 @@ module Traject::Macros
519
522
 
520
523
  # Extracts LCSH-carrying fields, and formatting them
521
524
  # as a pre-coordinated LCSH string, for instance suitable for including
522
- # in a facet.
525
+ # in a facet.
523
526
  #
524
527
  # You can supply your own list of fields as a spec, but for significant
525
528
  # customization you probably just want to write your own method in
526
- # terms of the Marc21Semantics.assemble_lcsh method.
529
+ # terms of the Marc21Semantics.assemble_lcsh method.
527
530
  def marc_lcsh_formatted(options = {})
528
531
  spec = options[:spec] || "600:610:611:630:648:650:651:654:662"
529
532
  subd_separator = options[:subdivison_separator] || " — "
@@ -540,17 +543,17 @@ module Traject::Macros
540
543
  end
541
544
 
542
545
  # Takes a MARC::Field and formats it into a pre-coordinated LCSH string
543
- # with subdivision seperators in the right place.
546
+ # with subdivision seperators in the right place.
544
547
  #
545
548
  # For 600 fields especially, need to not just join with subdivision seperator
546
549
  # to take acount of $a$d$t -- for other fields, might be able to just
547
- # join subfields, not sure.
550
+ # join subfields, not sure.
548
551
  #
549
552
  # WILL strip trailing period from generated string, contrary to some LCSH practice.
550
553
  # Our data is inconsistent on whether it has period or not, this was
551
- # the easiest way to standardize.
554
+ # the easiest way to standardize.
552
555
  #
553
- # Default subdivision seperator is em-dash with spaces, set to '--' if you want.
556
+ # Default subdivision seperator is em-dash with spaces, set to '--' if you want.
554
557
  #
555
558
  # Cite: "Dash (-) that precedes a subdivision in an extended 600 subject heading
556
559
  # is not carried in the MARC record. It may be system generated as a display constant
@@ -26,9 +26,15 @@ module Traject
26
26
  # Make sure to avoid text content that was all blank, which is "between the children"
27
27
  # whitespace.
28
28
  result = result.collect do |n|
29
- n.xpath('.//text()').collect(&:text).tap do |arr|
30
- arr.reject! { |s| s =~ (/\A\s+\z/) }
31
- end.join(" ")
29
+ if n.kind_of?(Nokogiri::XML::Attr)
30
+ # attribute value
31
+ n.value
32
+ else
33
+ # text from node
34
+ n.xpath('.//text()').collect(&:text).tap do |arr|
35
+ arr.reject! { |s| s =~ (/\A\s+\z/) }
36
+ end.join(" ")
37
+ end
32
38
  end
33
39
  else
34
40
  # just put all matches in accumulator as Nokogiri::XML::Node's
@@ -2,9 +2,9 @@ require 'traject/marc_extractor_spec'
2
2
 
3
3
  module Traject
4
4
  # MarcExtractor is a class for extracting lists of strings from a MARC::Record,
5
- # according to specifications. See #parse_string_spec for description of string
6
- # string arguments used to specify extraction. See #initialize for options
7
- # that can be set controlling extraction.
5
+ # according to specifications. See Traject::MarcExtractor::Spec for description
6
+ # of string string arguments used to specify extraction. See #initialize for
7
+ # options that can be set controlling extraction.
8
8
  #
9
9
  # Examples:
10
10
  #
@@ -21,6 +21,9 @@ module Traject
21
21
  # If you need to use namespaces here, you need to have them registered with
22
22
  # `nokogiri.default_namespaces`. If your source docs use namespaces, you DO need
23
23
  # to use them in your each_record_xpath.
24
+ # * nokogiri.strict_mode: if set to `true` or `"true"`, ask Nokogiri to parse in 'strict'
25
+ # mode, it will raise a `Nokogiri::XML::SyntaxError` if the XML is not well-formed, instead
26
+ # of trying to take it's best-guess correction. https://nokogiri.org/tutorials/ensuring_well_formed_markup.html
24
27
  # * nokogiri_reader.extra_xpath_hooks: Experimental in progress, see below.
25
28
  #
26
29
  # ## nokogiri_reader.extra_xpath_hooks: For handling nodes outside of your each_record_xpath
@@ -87,7 +90,11 @@ module Traject
87
90
  end
88
91
 
89
92
  def each
90
- whole_input_doc = Nokogiri::XML.parse(input_stream)
93
+ config_proc = if settings["nokogiri.strict_mode"]
94
+ proc { |config| config.strict }
95
+ end
96
+
97
+ whole_input_doc = Nokogiri::XML.parse(input_stream, &config_proc)
91
98
 
92
99
  if each_record_xpath
93
100
  whole_input_doc.xpath(each_record_xpath, default_namespaces).each do |matching_node|