traject 2.3.1 → 2.3.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGES.md +8 -0
- data/README.md +7 -6
- data/doc/indexing_rules.md +3 -3
- data/lib/traject/macros/marc21.rb +1 -1
- data/lib/traject/macros/marc21_semantics.rb +3 -1
- data/lib/traject/version.rb +1 -1
- data/test/indexer/macros_marc21_test.rb +2 -0
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 1536be14599f2f0777b79a6bc27717ad0350223f
|
4
|
+
data.tar.gz: 8cc6327ca07889c69526f3a19b4e3b91b5512c65
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 05126d1932a31c7fb97f571619139c287b71afe4f3638ec7e72e73518df8c42f765cdaed7e646b64f4c13ad724b95d8f99e1dc61243b07aac0d2ab7d38bf9241
|
7
|
+
data.tar.gz: 2035e5bc42067a3c0ac598f894ac59c1309244d47afceaaa7b66a7dd4bfd034e8395c1e71b90d2f6d5b5d68efa78ea98c0bfc0681dff23b39276fdb7bed6b5b2
|
data/CHANGES.md
CHANGED
@@ -1,5 +1,13 @@
|
|
1
1
|
# Changes
|
2
2
|
|
3
|
+
## 2.3.2
|
4
|
+
* Change to `extract_marc` to work around a threadsafe problem in JRuby/MRI where
|
5
|
+
regexps were unsafely shared between threads. (@codeforkjeff)
|
6
|
+
* Make trim-punctuation safe for non-just-ASCII text (thanks to @dunn and @redlibrarian)
|
7
|
+
|
8
|
+
## 2.3.1
|
9
|
+
* Update README with more info aout new nil-related options
|
10
|
+
|
3
11
|
## 2.3.0
|
4
12
|
* Allow nil values, empty fields, and deduplication
|
5
13
|
|
data/README.md
CHANGED
@@ -2,11 +2,13 @@
|
|
2
2
|
|
3
3
|
An easy to use, high-performance, flexible and extensible MARC to Solr indexer.
|
4
4
|
|
5
|
+
(Questions about use are welcome here or on the [google group](https://groups.google.com/forum/#!forum/traject-users))
|
6
|
+
|
5
7
|
You might use [traject](https://github.com/traject/traject) to index MARC data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
|
6
8
|
|
7
9
|
Traject can also be generalized to a set of tools for getting structured data from a source, and transforming it to a hash-like object to send to a destination. In addition to sending data to Solr, Traject can produce json or yaml files, tab-delimited files, CSV files, and output suitable for debugging by a human.
|
8
10
|
|
9
|
-
**Traject is stable, mature software, that is already being used in production by its authors.**
|
11
|
+
**Traject is stable, mature software, that is already being used in production by its authors and several other institutions.**
|
10
12
|
|
11
13
|
[![Gem Version](https://badge.fury.io/rb/traject.png)](http://badge.fury.io/rb/traject)
|
12
14
|
[![Build Status](https://travis-ci.org/traject/traject.png)](https://travis-ci.org/traject/traject)
|
@@ -113,7 +115,7 @@ data out of a MARC record according to a tag/subfield specification.
|
|
113
115
|
|
114
116
|
# 245 subfields a, p, and s. 130, all subfields.
|
115
117
|
# built-in punctuation trimming routine.
|
116
|
-
to_field "title_t", extract_marc("
|
118
|
+
to_field "title_t", extract_marc("245aps:130", :trim_punctuation => true)
|
117
119
|
|
118
120
|
# Can limit to certain indicators with || chars.
|
119
121
|
# "*" is a wildcard in indicator spec. So this is
|
@@ -129,7 +131,7 @@ data out of a MARC record according to a tag/subfield specification.
|
|
129
131
|
to_field "language_code", extract_marc("008[35-37]")
|
130
132
|
~~~
|
131
133
|
|
132
|
-
`extract_marc` by default includes all 'alternate script' linked fields
|
134
|
+
`extract_marc` by default includes all 'alternate script' linked fields corresponding to matched specifications, but you can turn that off, or extract *only* corresponding 880s.
|
133
135
|
|
134
136
|
~~~ruby
|
135
137
|
to_field "title", extract_marc("245abc", :alternate_script => false)
|
@@ -140,7 +142,7 @@ By default, specifications with multiple subfields (e.g. "240abc") will produce
|
|
140
142
|
|
141
143
|
For the syntax and complete possibilities of the specification string argument to extract_marc, see docs at the [MarcExtractor class](./lib/traject/marc_extractor.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/MarcExtractor)).
|
142
144
|
|
143
|
-
`extract_marc` also supports `translation maps` similar to SolrMarc's. There are some translation maps provided by traject, and you can also define your own, in yaml or ruby. Translation maps are especially useful for mapping
|
145
|
+
`extract_marc` also supports `translation maps` similar to SolrMarc's. There are some translation maps provided by traject, and you can also define your own, in yaml or ruby. Translation maps are especially useful for mapping from MARC codes to user-displayable strings:
|
144
146
|
|
145
147
|
~~~ruby
|
146
148
|
# "translation_map" will be passed to Traject::TranslationMap.new
|
@@ -278,7 +280,7 @@ results, or writing to more than one field at once.
|
|
278
280
|
|
279
281
|
For more on `each_record`, see [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md).
|
280
282
|
|
281
|
-
There is also an `after_processing` method that can be used to register logic that will be called after the entire has been processed. You can use it for whatever custom ruby code you might want for your app (send an email? Clean up a log file? Trigger a Solr replication?)
|
283
|
+
There is also an `after_processing` method that can be used to register logic that will be called after the entire input has been processed. You can use it for whatever custom ruby code you might want for your app (send an email? Clean up a log file? Trigger a Solr replication?)
|
282
284
|
|
283
285
|
~~~ruby
|
284
286
|
after_processing do
|
@@ -305,7 +307,6 @@ The [SolrJWriter](https://github.com/traject/traject-solrj_writer) is packaged s
|
|
305
307
|
and will be useful if you need to index to Solr's older than version 3.2. It requires Jruby.
|
306
308
|
|
307
309
|
You can easily write your own Readers and Writers if you'd like, see comments at top
|
308
|
-
|
309
310
|
of [Traject::Indexer](lib/traject/indexer.rb).
|
310
311
|
|
311
312
|
|
data/doc/indexing_rules.md
CHANGED
@@ -67,7 +67,7 @@ created." In ruby, lambdas and blocks are closures. Method definitions
|
|
67
67
|
are not, which most of us have run across much to our chagrin.
|
68
68
|
|
69
69
|
Within the context of `traject`, this means you can define a variable
|
70
|
-
outside of a `to_field` or `each_record` block and it will be
|
70
|
+
outside of a `to_field` or `each_record` block and it will be available
|
71
71
|
inside those blocks. And you only have to define it once.
|
72
72
|
|
73
73
|
That's useful to do for any object that is even a bit expensive
|
@@ -190,7 +190,7 @@ to_field('foo'), macro_returning_dup_values do |rec, acc|
|
|
190
190
|
end
|
191
191
|
```
|
192
192
|
|
193
|
-
##
|
193
|
+
## Manipulating `context.output_hash` directly
|
194
194
|
|
195
195
|
If you ask for the context argument, a [Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Indexer/Context)), you have access to `context.output_hash`, which is
|
196
196
|
the hash of already transformed output that will be sent to Solr (or any other Writer).
|
@@ -218,7 +218,7 @@ context.output_hash['fieldname'] = ['fuzzy_wuzzies']
|
|
218
218
|
|
219
219
|
Thus, `each_record` blocks have no `accumulator` argument: instead they either take a single `record` argument; or both a `record` and a `context`.
|
220
220
|
|
221
|
-
`each_record` is useful for logging or
|
221
|
+
`each_record` is useful for logging or notifying, computing intermediate
|
222
222
|
results, or writing to more than one field at once.
|
223
223
|
|
224
224
|
~~~ruby
|
@@ -233,7 +233,7 @@ module Traject::Macros
|
|
233
233
|
str = str.sub(/ *[ ,\/;:] *\Z/, '')
|
234
234
|
|
235
235
|
# trailing period if it is preceded by at least three letters (possibly preceded and followed by whitespace)
|
236
|
-
str = str.sub(/(
|
236
|
+
str = str.sub(/( *[[:word:][:word:][:word:]])\. *\Z/, '\1')
|
237
237
|
|
238
238
|
# single square bracket characters if they are the start and/or end
|
239
239
|
# chars and there are no internal square brackets.
|
@@ -200,7 +200,9 @@ module Traject::Macros
|
|
200
200
|
# sometimes multiple language codes are jammed together in one subfield, and
|
201
201
|
# we need to separate ourselves. sigh.
|
202
202
|
unless value.length == 3
|
203
|
-
|
203
|
+
# split into an array of 3-length substrs; JRuby has problems with regexes
|
204
|
+
# across threads, which is why we don't use String#scan here.
|
205
|
+
value = value.chars.each_slice(3).map(&:join)
|
204
206
|
end
|
205
207
|
value
|
206
208
|
end.flatten
|
data/lib/traject/version.rb
CHANGED
@@ -128,6 +128,8 @@ describe "Traject::Macros::Marc21" do
|
|
128
128
|
|
129
129
|
# This one was a bug before
|
130
130
|
assert_equal "Feminism and art", Marc21.trim_punctuation("Feminism and art.")
|
131
|
+
|
132
|
+
assert_equal "Le réve", Marc21.trim_punctuation("Le réve.") # this assertion currently fails
|
131
133
|
end
|
132
134
|
|
133
135
|
it "uses :translation_map" do
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: traject
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 2.3.
|
4
|
+
version: 2.3.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jonathan Rochkind
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2016-
|
12
|
+
date: 2016-11-03 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: concurrent-ruby
|