traject 0.17.0 → 1.0.0.beta.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +27 -6
- data/doc/indexing_rules.md +1 -1
- data/lib/traject/marc_extractor.rb +1 -2
- data/lib/traject/util.rb +0 -2
- data/lib/traject/version.rb +1 -1
- data/test/indexer/macros_marc21_semantics_test.rb +12 -0
- data/traject.gemspec +2 -2
- metadata +5 -5
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 19fae7c4d428bc40fd5c568f13daa7583fcebb2a
|
4
|
+
data.tar.gz: 9c71c57d86f6b9585451e958596a757cc75c432d
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 6730cd527d1ea285a5d60ebddeb95950dae2fdb5c548b905627eddf7a4a985e1891a039f409577903f181c60613861a3798fede6d5ecc8a942140cbb7a2d63d4
|
7
|
+
data.tar.gz: ee66b43f7bed57fcb21d82e7cd82c6e49c973bf7780b7a9f9c95fb5afc28744ffdb03e63d9d9b28bf567123571fa760ea9fed30fa62d20cf0ebfd0ef95d5880a
|
data/README.md
CHANGED
@@ -6,10 +6,10 @@ Might be used to index MARC data for a Solr-based discovery product like [Blackl
|
|
6
6
|
Traject might also be generalized to a set of tools for getting structured data from a source, and sending it to a destination.
|
7
7
|
|
8
8
|
|
9
|
-
**Traject is nearing 1.0, it is robust, feature-rich and
|
9
|
+
**Traject is nearing 1.0, it is robust, feature-rich and being used in production by authors -- feedback invited**
|
10
10
|
|
11
11
|
[![Gem Version](https://badge.fury.io/rb/traject.png)](http://badge.fury.io/rb/traject)
|
12
|
-
[![Build Status](https://travis-ci.org/
|
12
|
+
[![Build Status](https://travis-ci.org/traject-project/traject.png)](https://travis-ci.org/traject-project/traject)
|
13
13
|
|
14
14
|
|
15
15
|
## Background/Goals
|
@@ -147,8 +147,10 @@ data out of a MARC record according to a tag/subfield specification.
|
|
147
147
|
to matched specifications, but you can turn that off, or extract *only* corresponding
|
148
148
|
880s.
|
149
149
|
|
150
|
+
~~~ruby
|
150
151
|
to_field "title", extract_marc("245abc", :alternate_script => false)
|
151
152
|
to_field "title_vernacular", extract_marc("245abc", :alternate_script => :only)
|
153
|
+
~~~
|
152
154
|
|
153
155
|
By default, specifications with multiple subfields (like "240abc") will produce one single string of output for each matching field. Specifications with single subfields (like "020a") will split subfields and produce an output string for each matching subfield.
|
154
156
|
|
@@ -173,20 +175,25 @@ To see all options for `extract_marc`, see the [method documentation](http://rdo
|
|
173
175
|
Other built-in methods that can be used with `to_field` include a hard-coded
|
174
176
|
literal string:
|
175
177
|
|
178
|
+
~~~ruby
|
176
179
|
to_field "source", literal("LIB_CATALOG")
|
180
|
+
~~~
|
177
181
|
|
178
182
|
The current record serialized back out as MARC, in binary, XML, or json:
|
179
183
|
|
184
|
+
~~~ruby
|
180
185
|
# or :format => "json" for marc-in-json
|
181
186
|
# or :format => "binary", by default Base64-encoded for Solr
|
182
187
|
# 'binary' field, or, for more like what SolrMarc did, without
|
183
188
|
# escaping:
|
184
189
|
to_field "marc_record_raw", serialized_marc(:format => "binary", :binary_escape => false, :allow_oversized => true)
|
190
|
+
~~~
|
185
191
|
|
186
192
|
Text of all fields in a range:
|
187
193
|
|
194
|
+
~~~ruby
|
188
195
|
to_field "text", extract_all_marc_values(:from => 100, :to => 899)
|
189
|
-
|
196
|
+
~~~
|
190
197
|
|
191
198
|
All of these methods are defined at [Traject::Macros::Marc21](./lib/traject/macros/marc21.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21))
|
192
199
|
|
@@ -198,6 +205,7 @@ them available to your indexing, you just need to use ruby `require` and `extend
|
|
198
205
|
|
199
206
|
A number of methods are in [Traject::Macros::Marc21Semantics](./lib/traject/macros/marc21_semantics.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21Semantics))
|
200
207
|
|
208
|
+
~~~ruby
|
201
209
|
require 'traject/macros/marc21_semantics'
|
202
210
|
extend Traject::Macros::Marc21Semantics
|
203
211
|
|
@@ -205,15 +213,17 @@ A number of methods are in [Traject::Macros::Marc21Semantics](./lib/traject/macr
|
|
205
213
|
to_field 'broad_subject', marc_lcc_to_broad_category
|
206
214
|
to_field "geographic_facet", marc_geo_facet
|
207
215
|
# And several more
|
216
|
+
~~~
|
208
217
|
|
209
218
|
And, there's a routine for classifying MARC to an internal
|
210
219
|
format/genre/type vocabulary:
|
211
220
|
|
221
|
+
~~~ruby
|
212
222
|
require 'traject/macros/marc_format_classifier'
|
213
223
|
extend Traject::Macros::MarcFormats
|
214
224
|
|
215
225
|
to_field 'format_facet', marc_formats
|
216
|
-
|
226
|
+
~~~
|
217
227
|
|
218
228
|
## Custom logic
|
219
229
|
|
@@ -221,6 +231,7 @@ The built-in routines are there for your convenience, but if you need
|
|
221
231
|
something local or custom, you can write ruby logic directly
|
222
232
|
in a configuration file, using a ruby block, which looks like this:
|
223
233
|
|
234
|
+
~~~ruby
|
224
235
|
to_field "id" do |record, accumulator|
|
225
236
|
# take the record's 001, prefix it with "bib_",
|
226
237
|
# and then add it to the 'accumulator' argument,
|
@@ -229,6 +240,7 @@ in a configuration file, using a ruby block, which looks like this:
|
|
229
240
|
value = "bib_#{value}"
|
230
241
|
accumulator << value
|
231
242
|
end
|
243
|
+
~~~
|
232
244
|
|
233
245
|
`do |record, accumulator|` is the definition of a ruby block taking
|
234
246
|
two arguments. The first one passed in will be a MARC record. The
|
@@ -239,21 +251,25 @@ Here's a more realistic example that shows how you'd get the
|
|
239
251
|
record type byte 06 out of a MARC leader, then translate it
|
240
252
|
to a human-readable string with a TranslationMap
|
241
253
|
|
254
|
+
~~~ruby
|
242
255
|
to_field "marc_type" do |record, accumulator|
|
243
256
|
leader06 = record.leader.byteslice(6)
|
244
257
|
# this translation map doesn't actually exist, but could
|
245
258
|
accumulator << TranslationMap.new("marc_leader")[ leader06 ]
|
246
259
|
end
|
260
|
+
~~~
|
247
261
|
|
248
262
|
You can also add a block onto the end of a built-in 'macro', to
|
249
263
|
further customize the output. The `accumulator` passed to your block
|
250
264
|
will already have values in it from the first step, and you can
|
251
265
|
use ruby methods like `map!` to modify it:
|
252
266
|
|
267
|
+
~~~ruby
|
253
268
|
to_field "big_title", extract_marc("245abcdefg") do |record, accumulator|
|
254
269
|
# put it all in all uppercase, I don't know why.
|
255
270
|
accumulator.map! {|v| v.upcase}
|
256
271
|
end
|
272
|
+
~~~
|
257
273
|
|
258
274
|
There are many more things you can do with custom logic blocks like this too,
|
259
275
|
including additional features we haven't discussed yet.
|
@@ -357,7 +373,6 @@ Also see `-I load_path` option and suggestions for Bundler use under Extending W
|
|
357
373
|
See also [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
|
358
374
|
|
359
375
|
|
360
|
-
|
361
376
|
## Extending With Your Own Code
|
362
377
|
|
363
378
|
Traject config files are full live ruby files, where you can do anything,
|
@@ -393,7 +408,11 @@ Own Code](./doc/extending.md)
|
|
393
408
|
|
394
409
|
* [Other traject commands](./doc/other_commands.md) including `marcout`, and `commit`
|
395
410
|
* [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
|
396
|
-
|
411
|
+
* Plugin extensions: Gems that add functionality to traject
|
412
|
+
* [traject_alephsequential_reader](https://github.com/traject-project/traject_alephsequential_reader/): read MARC files serialized in the AlephSequential format, as output by Ex Libris's Alpeh ILS.
|
413
|
+
* [traject_horizon](https://github.com/jrochkind/traject_horizon): Export MARC records directly from a Horizon ILS rdbms, as serialized MARC or to index into Solr.
|
414
|
+
* [traject_umich_format](https://github.com/billdueber/traject_umich_format/): opinionated code and associated macros to extract format (book, audio file, etc.) and types (bibliography, conference report, etc.) from a MARC record. Code mirrors that used by the University of Michigan, and is an alternate approach to that taken by the `marc_formats` macro in `Traject::Macros::MarcFormatClassifier`.
|
415
|
+
|
397
416
|
|
398
417
|
# Development
|
399
418
|
|
@@ -415,6 +434,8 @@ and/or extra files in ./docs -- as appropriate for what needs to be docs.
|
|
415
434
|
**Inline api docs** Note that our [`.yardopts` file](./.yardopts) used by rdoc.info to generate
|
416
435
|
online api docs has a `--markup markdown` specified -- inline class/method docs are in markdown, not rdoc.
|
417
436
|
|
437
|
+
Bundler rake tasks included for gem releases: `rake release`
|
438
|
+
|
418
439
|
## TODO
|
419
440
|
|
420
441
|
|
data/doc/indexing_rules.md
CHANGED
@@ -58,7 +58,7 @@ you need to modify the array in-place.
|
|
58
58
|
The third optional context argument
|
59
59
|
|
60
60
|
The third optional argument is a
|
61
|
-
[Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/github/
|
61
|
+
[Traject::Indexer::Context](./lib/traject/indexer/context.rb) ([rdoc](http://rdoc.info/github/traject-project/traject/Traject/Indexer/Context))
|
62
62
|
object. Most of the time you don't need it, but you can use it for
|
63
63
|
some sophisticated functionality, for example using these Context methods:
|
64
64
|
|
@@ -36,8 +36,7 @@ module Traject
|
|
36
36
|
# and includes a tag and a a byte slice specification.
|
37
37
|
#
|
38
38
|
# "008[35-37]:007[5]""
|
39
|
-
# => bytes 35-37 inclusive of any field 008, and byte 5 of any field 007
|
40
|
-
# "LDR" as a pseudo-tag to take byte slices of leader?)
|
39
|
+
# => bytes 35-37 inclusive of any field 008, and byte 5 of any field 007
|
41
40
|
#
|
42
41
|
# * subfields and indicators can only be provided for marc data/variable fields
|
43
42
|
# * byte slice can only be provided for marc control fields (generally tags less than 010)
|
data/lib/traject/util.rb
CHANGED
@@ -56,8 +56,6 @@ module Traject
|
|
56
56
|
Object.const_set("HttpSolrServer", org.apache.solr.client.solrj.impl.HttpSolrServer) unless defined? ::HttpSolrServer
|
57
57
|
Object.const_set("SolrInputDocument", org.apache.solr.common.SolrInputDocument) unless defined? ::SolrInputDocument
|
58
58
|
rescue NameError => e
|
59
|
-
# /Users/jrochkind/code/solrj-gem/lib"
|
60
|
-
|
61
59
|
included_jar_dir = File.expand_path("../../vendor/solrj/lib", File.dirname(__FILE__))
|
62
60
|
|
63
61
|
jardir = settings["solrj.jar_dir"] || included_jar_dir
|
data/lib/traject/version.rb
CHANGED
@@ -157,6 +157,18 @@ describe "Traject::Macros::Marc21Semantics" do
|
|
157
157
|
@record = MARC::Reader.new(support_file_path "date_type_r_missing_date2.marc").to_a.first
|
158
158
|
assert_equal 1957, Marc21Semantics.publication_date(@record)
|
159
159
|
end
|
160
|
+
|
161
|
+
it "works correctly with date type 'q'" do
|
162
|
+
val = @record['008'].value
|
163
|
+
val[6] = 'q'
|
164
|
+
val[7..10] = '191u'
|
165
|
+
val[11..14] = '192u'
|
166
|
+
@record['008'].value = val
|
167
|
+
|
168
|
+
# Date should be date1 + date2 / 2 = (1910 + 1929) / 2 = 1919
|
169
|
+
estimate_tolerance = 30
|
170
|
+
assert_equal 1919, Marc21Semantics.publication_date(@record, estimate_tolerance)
|
171
|
+
end
|
160
172
|
end
|
161
173
|
|
162
174
|
describe "marc_lcc_to_broad_category" do
|
data/traject.gemspec
CHANGED
@@ -9,7 +9,7 @@ Gem::Specification.new do |spec|
|
|
9
9
|
spec.authors = ["Jonathan Rochkind", "Bill Dueber"]
|
10
10
|
spec.email = ["none@nowhere.org"]
|
11
11
|
spec.summary = %q{Index MARC to Solr; or generally process source records to hash-like structures}
|
12
|
-
spec.homepage = "http://github.com/
|
12
|
+
spec.homepage = "http://github.com/traject-project/traject"
|
13
13
|
spec.license = "MIT"
|
14
14
|
|
15
15
|
spec.files = `git ls-files`.split($/)
|
@@ -21,7 +21,7 @@ Gem::Specification.new do |spec|
|
|
21
21
|
|
22
22
|
|
23
23
|
spec.add_dependency "marc", ">= 0.7.1"
|
24
|
-
spec.add_dependency "marc-marc4j", ">=0.1.1"
|
24
|
+
spec.add_dependency "marc-marc4j", ">=0.1.1" # use and convert marc4j
|
25
25
|
spec.add_dependency "hashie", ">= 2.0.5", "< 2.1" # used for Indexer#settings
|
26
26
|
spec.add_dependency "slop", ">= 3.4.5", "< 4.0" # command line parsing
|
27
27
|
spec.add_dependency "yell" # logging
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: traject
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 1.0.0.beta.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jonathan Rochkind
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2013-10-
|
12
|
+
date: 2013-10-14 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: marc
|
@@ -265,7 +265,7 @@ files:
|
|
265
265
|
- vendor/solrj/lib/solr-solrj-4.3.1.jar
|
266
266
|
- vendor/solrj/lib/wstx-asl-3.2.7.jar
|
267
267
|
- vendor/solrj/lib/zookeeper-3.4.5.jar
|
268
|
-
homepage: http://github.com/
|
268
|
+
homepage: http://github.com/traject-project/traject
|
269
269
|
licenses:
|
270
270
|
- MIT
|
271
271
|
metadata: {}
|
@@ -280,9 +280,9 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
280
280
|
version: '0'
|
281
281
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
282
282
|
requirements:
|
283
|
-
- - '
|
283
|
+
- - '>'
|
284
284
|
- !ruby/object:Gem::Version
|
285
|
-
version:
|
285
|
+
version: 1.3.1
|
286
286
|
requirements: []
|
287
287
|
rubyforge_project:
|
288
288
|
rubygems_version: 2.1.5
|