traject 3.0.0.alpha.2 → 3.0.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.travis.yml +2 -3
- data/README.md +20 -15
- data/doc/xml.md +8 -0
- data/lib/traject/marc_reader.rb +8 -7
- data/lib/traject/version.rb +1 -1
- data/test/indexer/nokogiri_indexer_test.rb +11 -0
- metadata +4 -4
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: cf92e5467d32d37b681a36ae1ffbd2995bbf3e0def938b13d74831a939b68632
|
4
|
+
data.tar.gz: 7c4693ded4a9a8b0e9c599e7489aaefdf9806dfffce6b20ae6054def9ba8c156
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 9e12113a6f53aa9c7629c072df80b1e347f432d069bd30dbb35d73373fccc3fa341682b281a65c778aa2a3eae9fb7b2d52c81c2f39aa17d348074ecb8b9c2512
|
7
|
+
data.tar.gz: 6f2294bce5deb181a20db0977f8ab7e73e8e1cda6e86d8ee562fabd7a8cce2c683011be8f3955ccafd0165787dbaf774e7c3571220f5d6e797eaf6fe8a02577d
|
data/.travis.yml
CHANGED
@@ -4,15 +4,14 @@ cache: bundler
|
|
4
4
|
# at downloading jruby, and
|
5
5
|
sudo: true
|
6
6
|
rvm:
|
7
|
-
- 2.
|
8
|
-
- 2.4.3
|
7
|
+
- 2.4.4
|
9
8
|
- 2.5.1
|
10
9
|
- "2.6.0-preview2"
|
11
10
|
# avoid having travis install jdk on MRI builds where we don't need it.
|
12
11
|
matrix:
|
13
12
|
include:
|
14
13
|
- jdk: openjdk8
|
15
|
-
rvm: jruby-9.
|
14
|
+
rvm: jruby-9.1.17.0
|
16
15
|
- jdk: openjdk8
|
17
16
|
rvm: jruby-9.2.0.0
|
18
17
|
allow_failures:
|
data/README.md
CHANGED
@@ -26,7 +26,7 @@ Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (Univer
|
|
26
26
|
|
27
27
|
## Installation
|
28
28
|
|
29
|
-
Traject runs under jruby (9.
|
29
|
+
Traject runs under jruby (9.1.x or higher), MRI ruby (2.3.x or higher), or probably any other ruby platform.
|
30
30
|
|
31
31
|
Once you have ruby installed, just `$ gem install traject`.
|
32
32
|
|
@@ -135,12 +135,6 @@ For the syntax and complete possibilities of the specification string argument t
|
|
135
135
|
|
136
136
|
To see all options for `extract_marc`, see the [extract_marc](http://rdoc.info/gems/traject/Traject/Macros/Marc21:extract_marc) method documentation.
|
137
137
|
|
138
|
-
There is one special MARC-specific transformation macro, that strips punctuation from beginning and end of values using heuristics designed for AACR2 in MARC:
|
139
|
-
|
140
|
-
```ruby
|
141
|
-
to_field "title", extract_marc("245abc"), trim_punctuation
|
142
|
-
```
|
143
|
-
|
144
138
|
### XML mode, extract_xml
|
145
139
|
|
146
140
|
See our [xml guide](./doc/xml.md) for more XML examples, but you will usually use extract_xpath.
|
@@ -190,15 +184,15 @@ Example:
|
|
190
184
|
to_field "something", extract_xpath("//value"), strip, default("no value"), prepend("Extracted value: ")
|
191
185
|
```
|
192
186
|
|
193
|
-
###
|
187
|
+
### Some more MARC-specific utility methods
|
194
188
|
|
195
|
-
Other built-in methods that can be used with `to_field` include:
|
189
|
+
Other built-in methods that can be used with `to_field` for MARC specifically include:
|
196
190
|
|
197
|
-
|
191
|
+
Strip punctuation from beginning and end of values using heuristics designed for AACR2 in MARC:
|
198
192
|
|
199
|
-
|
200
|
-
to_field "
|
201
|
-
|
193
|
+
```ruby
|
194
|
+
to_field "title", extract_marc("245abc"), trim_punctuation
|
195
|
+
```
|
202
196
|
|
203
197
|
the current record serialized back out as MARC, in binary, XML, or json:
|
204
198
|
|
@@ -218,7 +212,7 @@ text of all fields in a range:
|
|
218
212
|
|
219
213
|
All of these methods are defined at [Traject::Macros::Marc21](./lib/traject/macros/marc21.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21))
|
220
214
|
|
221
|
-
|
215
|
+
### More complex canned MARC semantic logic
|
222
216
|
|
223
217
|
Some more complex (and opinionated/subjective) algorithms for deriving semantics from Marc are also packaged with Traject, but not available by default. To make them available to your indexing, you just need to use ruby `require` and `extend`.
|
224
218
|
|
@@ -265,7 +259,7 @@ in a configuration file, using a ruby block, which looks like this:
|
|
265
259
|
~~~
|
266
260
|
|
267
261
|
`do |record, accumulator| ... ` is the definition of a ruby block taking
|
268
|
-
two arguments. The first one passed in will be a MARC
|
262
|
+
two arguments. The first one passed in will be a source record (eg MARC or XML). The
|
269
263
|
second is an array, you add values to the array to send them to
|
270
264
|
output.
|
271
265
|
|
@@ -296,6 +290,17 @@ use ruby methods like `map!` to modify it:
|
|
296
290
|
If you find yourself repeating boilerplate code in your custom logic, you can
|
297
291
|
even create your own 'macros' (like `extract_marc`). `extract_marc`, `translation_map`, `first_only` and other macros are nothing more than methods that return ruby lambda objects of the same format as the blocks you write for custom logic.
|
298
292
|
|
293
|
+
In fact, in addition to a literal block on the end, you can pass as many `proc` objects as you want to transform data.
|
294
|
+
|
295
|
+
```ruby
|
296
|
+
to_field( "something", extract_xpath("//title"),
|
297
|
+
->(record, acc) { acc << "extra value" },
|
298
|
+
method_that_returns_a_proc
|
299
|
+
) do |rec, acc|
|
300
|
+
whatever_to(acc)
|
301
|
+
end
|
302
|
+
```
|
303
|
+
|
299
304
|
For tips, gotchas, and a more complete explanation of how this works, see
|
300
305
|
additional documentation page on [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md)
|
301
306
|
|
data/doc/xml.md
CHANGED
@@ -58,6 +58,14 @@ to_field "title", extract_xpath("/oai:record/oai:metadata/oai:dc/dc:title", ns:
|
|
58
58
|
})
|
59
59
|
```
|
60
60
|
|
61
|
+
If you are accessing a nokogiri method directly, like in `some_record.xpath`, the registered default namespaces aren't known by nokogiri -- but they are available in the indexer as `default_namespaces`, so can be referenced and passed into the nokogiri method:
|
62
|
+
|
63
|
+
```ruby
|
64
|
+
each_record do |record|
|
65
|
+
log( record.xpath("//dc:title"), default_namespaces )
|
66
|
+
end
|
67
|
+
```
|
68
|
+
|
61
69
|
You can use all the standard transforation macros in Traject::Macros::Transformation:
|
62
70
|
|
63
71
|
```ruby
|
data/lib/traject/marc_reader.rb
CHANGED
@@ -9,12 +9,12 @@ require 'traject/ndj_reader'
|
|
9
9
|
# the gem traject-marc4j_reader.
|
10
10
|
#
|
11
11
|
# By default assumes binary MARC encoding, please set marc_source.type setting
|
12
|
-
# for XML or json. If binary, please set marc_source.encoding with char encoding.
|
12
|
+
# for XML or json. If binary, please set marc_source.encoding with char encoding.
|
13
13
|
#
|
14
14
|
# ## Settings
|
15
15
|
|
16
16
|
# * "marc_source.type": serialization type. default 'binary'
|
17
|
-
# * "binary". standard ISO 2709 "binary" MARC format,
|
17
|
+
# * "binary". standard ISO 2709 "binary" MARC format,
|
18
18
|
# will use ruby-marc MARC::Reader (Note, if you are using
|
19
19
|
# type 'binary', you probably want to also set 'marc_source.encoding')
|
20
20
|
# * "xml", MarcXML, will use ruby-marc MARC::XMLReader
|
@@ -23,15 +23,16 @@ require 'traject/ndj_reader'
|
|
23
23
|
# allowed, and no unescpaed internal newlines allowed in the json
|
24
24
|
# objects -- we just read line by line, and assume each line is a
|
25
25
|
# marc-in-json. http://dilettantes.code4lib.org/blog/2010/09/a-proposal-to-serialize-marc-in-json/
|
26
|
-
# will use Traject::NDJReader which uses MARC::Record.new_from_hash.
|
26
|
+
# will use Traject::NDJReader which uses MARC::Record.new_from_hash.
|
27
27
|
# * "marc_source.encoding": Only used for marc_source.type 'binary', character encoding
|
28
28
|
# of the source marc records. Can be any
|
29
29
|
# encoding recognized by ruby, OR 'MARC-8'. For 'MARC-8', content will
|
30
|
-
# be transcoded (by ruby-marc) to UTF-8 in internal MARC::Record Strings.
|
30
|
+
# be transcoded (by ruby-marc) to UTF-8 in internal MARC::Record Strings.
|
31
31
|
# Default nil, meaning let MARC::Reader use it's default, which will
|
32
|
-
#
|
32
|
+
# be your system's Encoding.default_external, which will probably be UTF-8.
|
33
|
+
# (but may be something unexpected/undesired on Windows, where you may want to set this explicitly.)
|
33
34
|
# Right now Traject::MarcReader is hard-coded to transcode to UTF-8 as
|
34
|
-
# an internal encoding.
|
35
|
+
# an internal encoding.
|
35
36
|
# * "marc_reader.xml_parser": For XML type, which XML parser to tell Marc::Reader
|
36
37
|
# to use. Anything recognized by [Marc::Reader :parser
|
37
38
|
# argument](http://rdoc.info/github/ruby-marc/ruby-marc/MARC/XMLReader).
|
@@ -75,7 +76,7 @@ class Traject::MarcReader
|
|
75
76
|
Traject::NDJReader.new(self.input_stream, settings)
|
76
77
|
else
|
77
78
|
args = { :invalid => :replace }
|
78
|
-
args[:external_encoding] = settings["marc_source.encoding"]
|
79
|
+
args[:external_encoding] = settings["marc_source.encoding"]
|
79
80
|
MARC::Reader.new(self.input_stream, args)
|
80
81
|
end
|
81
82
|
end
|
data/lib/traject/version.rb
CHANGED
@@ -56,6 +56,17 @@ describe "Traject::NokogiriIndexer" do
|
|
56
56
|
refute_empty results.last["rights"]
|
57
57
|
end
|
58
58
|
|
59
|
+
it "exposes nokogiri.namespaces setting in default_namespaces" do
|
60
|
+
namespaces = @namespaces
|
61
|
+
@indexer.configure do
|
62
|
+
settings do
|
63
|
+
provide "nokogiri.namespaces", namespaces
|
64
|
+
end
|
65
|
+
end
|
66
|
+
@indexer.settings.fill_in_defaults!
|
67
|
+
assert_equal namespaces, @indexer.default_namespaces
|
68
|
+
end
|
69
|
+
|
59
70
|
describe "xpath to non-terminal element" do
|
60
71
|
before do
|
61
72
|
@xml = <<-EOS
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: traject
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 3.0.0
|
4
|
+
version: 3.0.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jonathan Rochkind
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2018-
|
12
|
+
date: 2018-10-12 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: concurrent-ruby
|
@@ -381,9 +381,9 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
381
381
|
version: '0'
|
382
382
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
383
383
|
requirements:
|
384
|
-
- - "
|
384
|
+
- - ">="
|
385
385
|
- !ruby/object:Gem::Version
|
386
|
-
version:
|
386
|
+
version: '0'
|
387
387
|
requirements: []
|
388
388
|
rubyforge_project:
|
389
389
|
rubygems_version: 2.7.7
|