traject 3.0.0.alpha.2 → 3.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.travis.yml +2 -3
- data/README.md +20 -15
- data/doc/xml.md +8 -0
- data/lib/traject/marc_reader.rb +8 -7
- data/lib/traject/version.rb +1 -1
- data/test/indexer/nokogiri_indexer_test.rb +11 -0
- metadata +4 -4
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: cf92e5467d32d37b681a36ae1ffbd2995bbf3e0def938b13d74831a939b68632
|
4
|
+
data.tar.gz: 7c4693ded4a9a8b0e9c599e7489aaefdf9806dfffce6b20ae6054def9ba8c156
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 9e12113a6f53aa9c7629c072df80b1e347f432d069bd30dbb35d73373fccc3fa341682b281a65c778aa2a3eae9fb7b2d52c81c2f39aa17d348074ecb8b9c2512
|
7
|
+
data.tar.gz: 6f2294bce5deb181a20db0977f8ab7e73e8e1cda6e86d8ee562fabd7a8cce2c683011be8f3955ccafd0165787dbaf774e7c3571220f5d6e797eaf6fe8a02577d
|
data/.travis.yml
CHANGED
@@ -4,15 +4,14 @@ cache: bundler
|
|
4
4
|
# at downloading jruby, and
|
5
5
|
sudo: true
|
6
6
|
rvm:
|
7
|
-
- 2.
|
8
|
-
- 2.4.3
|
7
|
+
- 2.4.4
|
9
8
|
- 2.5.1
|
10
9
|
- "2.6.0-preview2"
|
11
10
|
# avoid having travis install jdk on MRI builds where we don't need it.
|
12
11
|
matrix:
|
13
12
|
include:
|
14
13
|
- jdk: openjdk8
|
15
|
-
rvm: jruby-9.
|
14
|
+
rvm: jruby-9.1.17.0
|
16
15
|
- jdk: openjdk8
|
17
16
|
rvm: jruby-9.2.0.0
|
18
17
|
allow_failures:
|
data/README.md
CHANGED
@@ -26,7 +26,7 @@ Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (Univer
|
|
26
26
|
|
27
27
|
## Installation
|
28
28
|
|
29
|
-
Traject runs under jruby (9.
|
29
|
+
Traject runs under jruby (9.1.x or higher), MRI ruby (2.3.x or higher), or probably any other ruby platform.
|
30
30
|
|
31
31
|
Once you have ruby installed, just `$ gem install traject`.
|
32
32
|
|
@@ -135,12 +135,6 @@ For the syntax and complete possibilities of the specification string argument t
|
|
135
135
|
|
136
136
|
To see all options for `extract_marc`, see the [extract_marc](http://rdoc.info/gems/traject/Traject/Macros/Marc21:extract_marc) method documentation.
|
137
137
|
|
138
|
-
There is one special MARC-specific transformation macro, that strips punctuation from beginning and end of values using heuristics designed for AACR2 in MARC:
|
139
|
-
|
140
|
-
```ruby
|
141
|
-
to_field "title", extract_marc("245abc"), trim_punctuation
|
142
|
-
```
|
143
|
-
|
144
138
|
### XML mode, extract_xml
|
145
139
|
|
146
140
|
See our [xml guide](./doc/xml.md) for more XML examples, but you will usually use extract_xpath.
|
@@ -190,15 +184,15 @@ Example:
|
|
190
184
|
to_field "something", extract_xpath("//value"), strip, default("no value"), prepend("Extracted value: ")
|
191
185
|
```
|
192
186
|
|
193
|
-
###
|
187
|
+
### Some more MARC-specific utility methods
|
194
188
|
|
195
|
-
Other built-in methods that can be used with `to_field` include:
|
189
|
+
Other built-in methods that can be used with `to_field` for MARC specifically include:
|
196
190
|
|
197
|
-
|
191
|
+
Strip punctuation from beginning and end of values using heuristics designed for AACR2 in MARC:
|
198
192
|
|
199
|
-
|
200
|
-
to_field "
|
201
|
-
|
193
|
+
```ruby
|
194
|
+
to_field "title", extract_marc("245abc"), trim_punctuation
|
195
|
+
```
|
202
196
|
|
203
197
|
the current record serialized back out as MARC, in binary, XML, or json:
|
204
198
|
|
@@ -218,7 +212,7 @@ text of all fields in a range:
|
|
218
212
|
|
219
213
|
All of these methods are defined at [Traject::Macros::Marc21](./lib/traject/macros/marc21.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21))
|
220
214
|
|
221
|
-
|
215
|
+
### More complex canned MARC semantic logic
|
222
216
|
|
223
217
|
Some more complex (and opinionated/subjective) algorithms for deriving semantics from Marc are also packaged with Traject, but not available by default. To make them available to your indexing, you just need to use ruby `require` and `extend`.
|
224
218
|
|
@@ -265,7 +259,7 @@ in a configuration file, using a ruby block, which looks like this:
|
|
265
259
|
~~~
|
266
260
|
|
267
261
|
`do |record, accumulator| ... ` is the definition of a ruby block taking
|
268
|
-
two arguments. The first one passed in will be a MARC
|
262
|
+
two arguments. The first one passed in will be a source record (eg MARC or XML). The
|
269
263
|
second is an array, you add values to the array to send them to
|
270
264
|
output.
|
271
265
|
|
@@ -296,6 +290,17 @@ use ruby methods like `map!` to modify it:
|
|
296
290
|
If you find yourself repeating boilerplate code in your custom logic, you can
|
297
291
|
even create your own 'macros' (like `extract_marc`). `extract_marc`, `translation_map`, `first_only` and other macros are nothing more than methods that return ruby lambda objects of the same format as the blocks you write for custom logic.
|
298
292
|
|
293
|
+
In fact, in addition to a literal block on the end, you can pass as many `proc` objects as you want to transform data.
|
294
|
+
|
295
|
+
```ruby
|
296
|
+
to_field( "something", extract_xpath("//title"),
|
297
|
+
->(record, acc) { acc << "extra value" },
|
298
|
+
method_that_returns_a_proc
|
299
|
+
) do |rec, acc|
|
300
|
+
whatever_to(acc)
|
301
|
+
end
|
302
|
+
```
|
303
|
+
|
299
304
|
For tips, gotchas, and a more complete explanation of how this works, see
|
300
305
|
additional documentation page on [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md)
|
301
306
|
|
data/doc/xml.md
CHANGED
@@ -58,6 +58,14 @@ to_field "title", extract_xpath("/oai:record/oai:metadata/oai:dc/dc:title", ns:
|
|
58
58
|
})
|
59
59
|
```
|
60
60
|
|
61
|
+
If you are accessing a nokogiri method directly, like in `some_record.xpath`, the registered default namespaces aren't known by nokogiri -- but they are available in the indexer as `default_namespaces`, so can be referenced and passed into the nokogiri method:
|
62
|
+
|
63
|
+
```ruby
|
64
|
+
each_record do |record|
|
65
|
+
log( record.xpath("//dc:title"), default_namespaces )
|
66
|
+
end
|
67
|
+
```
|
68
|
+
|
61
69
|
You can use all the standard transforation macros in Traject::Macros::Transformation:
|
62
70
|
|
63
71
|
```ruby
|
data/lib/traject/marc_reader.rb
CHANGED
@@ -9,12 +9,12 @@ require 'traject/ndj_reader'
|
|
9
9
|
# the gem traject-marc4j_reader.
|
10
10
|
#
|
11
11
|
# By default assumes binary MARC encoding, please set marc_source.type setting
|
12
|
-
# for XML or json. If binary, please set marc_source.encoding with char encoding.
|
12
|
+
# for XML or json. If binary, please set marc_source.encoding with char encoding.
|
13
13
|
#
|
14
14
|
# ## Settings
|
15
15
|
|
16
16
|
# * "marc_source.type": serialization type. default 'binary'
|
17
|
-
# * "binary". standard ISO 2709 "binary" MARC format,
|
17
|
+
# * "binary". standard ISO 2709 "binary" MARC format,
|
18
18
|
# will use ruby-marc MARC::Reader (Note, if you are using
|
19
19
|
# type 'binary', you probably want to also set 'marc_source.encoding')
|
20
20
|
# * "xml", MarcXML, will use ruby-marc MARC::XMLReader
|
@@ -23,15 +23,16 @@ require 'traject/ndj_reader'
|
|
23
23
|
# allowed, and no unescpaed internal newlines allowed in the json
|
24
24
|
# objects -- we just read line by line, and assume each line is a
|
25
25
|
# marc-in-json. http://dilettantes.code4lib.org/blog/2010/09/a-proposal-to-serialize-marc-in-json/
|
26
|
-
# will use Traject::NDJReader which uses MARC::Record.new_from_hash.
|
26
|
+
# will use Traject::NDJReader which uses MARC::Record.new_from_hash.
|
27
27
|
# * "marc_source.encoding": Only used for marc_source.type 'binary', character encoding
|
28
28
|
# of the source marc records. Can be any
|
29
29
|
# encoding recognized by ruby, OR 'MARC-8'. For 'MARC-8', content will
|
30
|
-
# be transcoded (by ruby-marc) to UTF-8 in internal MARC::Record Strings.
|
30
|
+
# be transcoded (by ruby-marc) to UTF-8 in internal MARC::Record Strings.
|
31
31
|
# Default nil, meaning let MARC::Reader use it's default, which will
|
32
|
-
#
|
32
|
+
# be your system's Encoding.default_external, which will probably be UTF-8.
|
33
|
+
# (but may be something unexpected/undesired on Windows, where you may want to set this explicitly.)
|
33
34
|
# Right now Traject::MarcReader is hard-coded to transcode to UTF-8 as
|
34
|
-
# an internal encoding.
|
35
|
+
# an internal encoding.
|
35
36
|
# * "marc_reader.xml_parser": For XML type, which XML parser to tell Marc::Reader
|
36
37
|
# to use. Anything recognized by [Marc::Reader :parser
|
37
38
|
# argument](http://rdoc.info/github/ruby-marc/ruby-marc/MARC/XMLReader).
|
@@ -75,7 +76,7 @@ class Traject::MarcReader
|
|
75
76
|
Traject::NDJReader.new(self.input_stream, settings)
|
76
77
|
else
|
77
78
|
args = { :invalid => :replace }
|
78
|
-
args[:external_encoding] = settings["marc_source.encoding"]
|
79
|
+
args[:external_encoding] = settings["marc_source.encoding"]
|
79
80
|
MARC::Reader.new(self.input_stream, args)
|
80
81
|
end
|
81
82
|
end
|
data/lib/traject/version.rb
CHANGED
@@ -56,6 +56,17 @@ describe "Traject::NokogiriIndexer" do
|
|
56
56
|
refute_empty results.last["rights"]
|
57
57
|
end
|
58
58
|
|
59
|
+
it "exposes nokogiri.namespaces setting in default_namespaces" do
|
60
|
+
namespaces = @namespaces
|
61
|
+
@indexer.configure do
|
62
|
+
settings do
|
63
|
+
provide "nokogiri.namespaces", namespaces
|
64
|
+
end
|
65
|
+
end
|
66
|
+
@indexer.settings.fill_in_defaults!
|
67
|
+
assert_equal namespaces, @indexer.default_namespaces
|
68
|
+
end
|
69
|
+
|
59
70
|
describe "xpath to non-terminal element" do
|
60
71
|
before do
|
61
72
|
@xml = <<-EOS
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: traject
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 3.0.0
|
4
|
+
version: 3.0.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jonathan Rochkind
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2018-
|
12
|
+
date: 2018-10-12 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: concurrent-ruby
|
@@ -381,9 +381,9 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
381
381
|
version: '0'
|
382
382
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
383
383
|
requirements:
|
384
|
-
- - "
|
384
|
+
- - ">="
|
385
385
|
- !ruby/object:Gem::Version
|
386
|
-
version:
|
386
|
+
version: '0'
|
387
387
|
requirements: []
|
388
388
|
rubyforge_project:
|
389
389
|
rubygems_version: 2.7.7
|