traject 1.0.0 → 1.1.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/Gemfile +1 -1
- data/doc/indexing_rules.md +2 -2
- data/lib/traject/macros/marc21_semantics.rb +13 -5
- data/lib/traject/translation_map.rb +28 -32
- data/lib/traject/version.rb +1 -1
- data/test/indexer/macros_marc21_semantics_test.rb +19 -0
- data/test/translation_map_test.rb +73 -7
- data/traject.gemspec +1 -0
- metadata +44 -30
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 4ae9c6a2d87868021cae1b48637592238387d8a1
|
4
|
+
data.tar.gz: 578c645162da3560ff5e01a28cf43b36e82734f8
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e0bf13c4ff3cab492b6be8922ae22e33311f701756910526eb3206522774b1519db07324531ebd2c366d5907854c78bb6cde0d65eeff78258513d37fae1a3a57
|
7
|
+
data.tar.gz: dd251b15afafe2a11cbefe8493e145d9b9b883183b5f884a5b61557046c74b699427fe2766d5df587e27729f6c22052b2f672b756aba8fee33149f6abbcc4f40
|
data/Gemfile
CHANGED
data/doc/indexing_rules.md
CHANGED
@@ -62,7 +62,7 @@ The third optional argument is a
|
|
62
62
|
object. Most of the time you don't need it, but you can use it for
|
63
63
|
some sophisticated functionality, for example using these Context methods:
|
64
64
|
|
65
|
-
* `context.clipboard` A hash into which you can stuff values that you want to pass from one indexing step to another. For example, if you go through a bunch of work to query a database and get a result you'll need more than once, stick the results somewhere in the clipboard.
|
65
|
+
* `context.clipboard` A hash into which you can stuff values that you want to pass from one indexing step to another. For example, if you go through a bunch of work to query a database and get a result you'll need more than once, stick the results somewhere in the clipboard. This clipboard is record-specific, and won't persist between records.
|
66
66
|
* `context.position` The position of the record in the input file (e.g., was it the first record, seoncd, etc.). Useful for error reporting
|
67
67
|
* `context.output_hash` A hash mapping the field names (generally defined in `to_field` calls) to an array of values to be sent to the writer associated with that field. This allows you to modify what goes to the writer without going through a `to_field` call -- you can just set `context.output_hash['myfield'] = ['my', 'values']` and you're set. See below for more examples
|
68
68
|
* `context.skip!(msg)` An assertion that this record should be ignored. No more indexing steps will be called, no results will be sent to the writer, and a `debug`-level log message will be written stating that the record was skipped.
|
@@ -262,4 +262,4 @@ args for `each_record`.
|
|
262
262
|
|
263
263
|
* **Once you call `context.skip!(msg)` no more index steps will be run for that record**. So if you have any cleanup code, you'll need to make sure to call it yourself.
|
264
264
|
|
265
|
-
* **By default, `trajcet` indexing runs multi-threaded**. In the current implementation, the indexing steps for one record are *not* split across threads, but different records can be processed simultaneously by more than one thread. That means you need to make sure your code is thread-safe (or always set `processing_thread_pool` to 0).
|
265
|
+
* **By default, `trajcet` indexing runs multi-threaded**. In the current implementation, the indexing steps for one record are *not* split across threads, but different records can be processed simultaneously by more than one thread. That means you need to make sure your code is thread-safe (or always set `processing_thread_pool` to 0).
|
@@ -26,15 +26,23 @@ module Traject::Macros
|
|
26
26
|
accumulator.concat list.uniq if list
|
27
27
|
end
|
28
28
|
end
|
29
|
+
|
29
30
|
# If a num begins with a known OCLC prefix, return it without the prefix.
|
30
31
|
# otherwise nil.
|
32
|
+
#
|
33
|
+
# Allow (OCoLC) and/or ocn/ocm/on
|
34
|
+
|
35
|
+
OCLCPAT = /
|
36
|
+
\A\s*
|
37
|
+
(?:(?:\(OCoLC\)) |
|
38
|
+
(?:\(OCoLC\))?(?:(?:ocm)|(?:ocn)|(?:on))
|
39
|
+
)(\d+)
|
40
|
+
/x
|
41
|
+
|
31
42
|
def self.oclcnum_extract(num)
|
32
|
-
|
33
|
-
|
34
|
-
# it had the prefix, which we've now stripped
|
35
|
-
return stripped
|
43
|
+
if OCLCPAT.match(num)
|
44
|
+
return $1
|
36
45
|
else
|
37
|
-
# it didn't have the prefix
|
38
46
|
return nil
|
39
47
|
end
|
40
48
|
end
|
@@ -1,6 +1,7 @@
|
|
1
1
|
require 'traject'
|
2
2
|
|
3
3
|
require 'yaml'
|
4
|
+
require 'dot-properties'
|
4
5
|
|
5
6
|
|
6
7
|
module Traject
|
@@ -14,7 +15,8 @@ module Traject
|
|
14
15
|
#
|
15
16
|
# What makes it more useful than a stunted hash is it's ability to load
|
16
17
|
# the hash definitions from configuration files, either pure ruby,
|
17
|
-
# yaml, or
|
18
|
+
# yaml, or java .properties file (not all .properties features may
|
19
|
+
# be supported, we use dot-properties gem for reading)
|
18
20
|
#
|
19
21
|
# traject's `extract_marc` macro allows you to specify a :translation_map=>filename argument
|
20
22
|
# that will automatically find and use a translation map on the resulting data:
|
@@ -197,7 +199,9 @@ module Traject
|
|
197
199
|
# Returns a dup of internal hash, dup so you can modify it
|
198
200
|
# if you like.
|
199
201
|
def to_hash
|
200
|
-
@hash.dup
|
202
|
+
dup = @hash.dup
|
203
|
+
dup.delete("__default__")
|
204
|
+
dup
|
201
205
|
end
|
202
206
|
|
203
207
|
# Run every element of an array through this translation map,
|
@@ -224,6 +228,24 @@ module Traject
|
|
224
228
|
array.replace( self.translate_array(array))
|
225
229
|
end
|
226
230
|
|
231
|
+
# Return a new TranslationMap that results from merging argument on top of self.
|
232
|
+
# Can be useful for taking an existing translation map, but merging a few
|
233
|
+
# overrides on top.
|
234
|
+
#
|
235
|
+
# merged_map = TranslationMap.new(something).merge TranslationMap.new(else)
|
236
|
+
# #...
|
237
|
+
# merged_map.translate_array(something) # etc
|
238
|
+
#
|
239
|
+
# If a default is set in the second map, it will merge over the first too.
|
240
|
+
#
|
241
|
+
# You can also pass in a plain hash as an arg, instead of an existing TranslationMap:
|
242
|
+
#
|
243
|
+
# TranslationMap.new(something).merge("overridden_key" => "value", "a" => "")
|
244
|
+
def merge(other_map)
|
245
|
+
default = other_map.default || self.default
|
246
|
+
TranslationMap.new(self.to_hash.merge(other_map.to_hash), :default => default)
|
247
|
+
end
|
248
|
+
|
227
249
|
class NotFound < Exception
|
228
250
|
def initialize(path)
|
229
251
|
super("No translation map definition file found at 'translation_maps/#{path}.[rb|yaml|properties]' in load path: #{$LOAD_PATH}")
|
@@ -232,36 +254,10 @@ module Traject
|
|
232
254
|
|
233
255
|
protected
|
234
256
|
|
235
|
-
#
|
236
|
-
#
|
237
|
-
|
238
|
-
|
239
|
-
#
|
240
|
-
# We do NOT handle split lines, don't do that!
|
241
|
-
def self.read_properties(file_name)
|
242
|
-
hash = {}
|
243
|
-
i = 0
|
244
|
-
f = File.open(file_name)
|
245
|
-
f.each_line do |line|
|
246
|
-
i += 1
|
247
|
-
|
248
|
-
line.strip!
|
249
|
-
|
250
|
-
# skip blank lines
|
251
|
-
next if line.empty?
|
252
|
-
|
253
|
-
# skip comment lines
|
254
|
-
next if line =~ /^\s*[!\#].*$/
|
255
|
-
|
256
|
-
if line =~ /\A([^:=]+)[\:\=]\s*(.*)\s*\Z/
|
257
|
-
hash[$1.strip] = $2
|
258
|
-
else
|
259
|
-
raise IOError.new("Can't parse from #{file_name} line #{i}: #{line}")
|
260
|
-
end
|
261
|
-
end
|
262
|
-
f.close
|
263
|
-
|
264
|
-
return hash
|
257
|
+
# We use dot-properties gem for reading .properties files,
|
258
|
+
# return a hash.
|
259
|
+
def self.read_properties(file_name)
|
260
|
+
return DotProperties.load(file_name).to_h
|
265
261
|
end
|
266
262
|
|
267
263
|
end
|
data/lib/traject/version.rb
CHANGED
@@ -31,6 +31,25 @@ describe "Traject::Macros::Marc21Semantics" do
|
|
31
31
|
|
32
32
|
assert_equal({}, @indexer.map_record(empty_record))
|
33
33
|
end
|
34
|
+
|
35
|
+
it "deals with all prefixed OCLC nunbers" do
|
36
|
+
@record.append(MARC::DataField.new('035', ' ', ' ', ['a', '(OCoLC)ocm111111111']))
|
37
|
+
@record.append(MARC::DataField.new('035', ' ', ' ', ['a', '(OCoLC)222222222']))
|
38
|
+
@record.append(MARC::DataField.new('035', ' ', ' ', ['a', 'ocm333333333']))
|
39
|
+
@record.append(MARC::DataField.new('035', ' ', ' ', ['a', 'ocn444444444']))
|
40
|
+
@record.append(MARC::DataField.new('035', ' ', ' ', ['a', '(OCoLC)ocn555555555']))
|
41
|
+
@record.append(MARC::DataField.new('035', ' ', ' ', ['a', '(OCoLC)on666666666']))
|
42
|
+
@record.append(MARC::DataField.new('035', ' ', ' ', ['a', '777777777'])) # not OCLC number
|
43
|
+
|
44
|
+
@indexer.instance_eval do
|
45
|
+
to_field "oclcnum", oclcnum
|
46
|
+
end
|
47
|
+
output = @indexer.map_record(@record)
|
48
|
+
|
49
|
+
assert_equal %w{47971712 111111111 222222222 333333333 444444444 555555555 666666666}, output["oclcnum"]
|
50
|
+
end
|
51
|
+
|
52
|
+
|
34
53
|
|
35
54
|
it "#marc_series_facet" do
|
36
55
|
@record = MARC::Reader.new(support_file_path "louis_armstrong.marc").to_a.first
|
@@ -109,7 +109,7 @@ describe "TranslationMap" do
|
|
109
109
|
|
110
110
|
assert_equal "DEFAULT LITERAL", map["not in the map"]
|
111
111
|
end
|
112
|
-
|
112
|
+
|
113
113
|
it "respects __default__ __passthrough__" do
|
114
114
|
map = Traject::TranslationMap.new("default_passthrough")
|
115
115
|
|
@@ -135,16 +135,82 @@ describe "TranslationMap" do
|
|
135
135
|
assert_equal ["one"], values
|
136
136
|
end
|
137
137
|
|
138
|
-
|
139
|
-
|
138
|
+
describe "#to_hash" do
|
139
|
+
it "produces a hash" do
|
140
|
+
map = Traject::TranslationMap.new("yaml_map")
|
141
|
+
|
142
|
+
hash = map.to_hash
|
143
|
+
|
144
|
+
assert_kind_of Hash, hash
|
145
|
+
|
146
|
+
assert ! hash.frozen?, "#to_hash result is not frozen"
|
147
|
+
|
148
|
+
refute_same hash, map.to_hash, "each #to_hash result is a copy"
|
149
|
+
end
|
150
|
+
|
151
|
+
it "does not include __default__ key" do
|
152
|
+
map = Traject::TranslationMap.new("default_passthrough")
|
153
|
+
|
154
|
+
refute map.to_hash.has_key?("__default__")
|
155
|
+
assert_nil map.to_hash["__default__"]
|
156
|
+
end
|
157
|
+
|
158
|
+
end
|
159
|
+
|
160
|
+
describe "#merge" do
|
161
|
+
it "merges" do
|
162
|
+
original = Traject::TranslationMap.new("yaml_map")
|
163
|
+
override = Traject::TranslationMap.new("other" => "OVERRIDE", "new" => "NEW")
|
164
|
+
|
165
|
+
merged = original.merge(override)
|
140
166
|
|
141
|
-
|
167
|
+
assert_equal "value1", merged["key1"]
|
168
|
+
assert_equal "OVERRIDE", merged["other"]
|
169
|
+
assert_equal "NEW", merged["new"]
|
170
|
+
end
|
171
|
+
|
172
|
+
it "passes through default from first map when no default in second" do
|
173
|
+
original = Traject::TranslationMap.new("yaml_map", :default => "DEFAULT_VALUE")
|
174
|
+
override = Traject::TranslationMap.new("other" => "OVERRIDE")
|
175
|
+
|
176
|
+
merged = original.merge(override)
|
177
|
+
|
178
|
+
assert_equal "DEFAULT_VALUE", merged.default
|
179
|
+
assert_equal "DEFAULT_VALUE", merged["SOME_KEY_NOT_MATCHED"]
|
180
|
+
end
|
181
|
+
|
182
|
+
it "passes through default from second map when no default in first" do
|
183
|
+
original = Traject::TranslationMap.new("yaml_map")
|
184
|
+
override = Traject::TranslationMap.new({"other" => "OVERRIDE"}, :default => "DEFAULT_VALUE")
|
185
|
+
|
186
|
+
merged = original.merge(override)
|
142
187
|
|
143
|
-
|
188
|
+
assert_equal "DEFAULT_VALUE", merged.default
|
189
|
+
assert_equal "DEFAULT_VALUE", merged["SOME_KEY_NOT_MATCHED"]
|
190
|
+
end
|
191
|
+
|
192
|
+
it "merges second default on top of first" do
|
193
|
+
original = Traject::TranslationMap.new("yaml_map", :default => "DEFAULT_VALUE")
|
194
|
+
override = Traject::TranslationMap.new({"other" => "OVERRIDE"}, :default => "NEW_DEFAULT_VALUE")
|
144
195
|
|
145
|
-
|
196
|
+
merged = original.merge(override)
|
197
|
+
|
198
|
+
assert_equal "NEW_DEFAULT_VALUE", merged.default
|
199
|
+
assert_equal "NEW_DEFAULT_VALUE", merged["SOME_KEY_NOT_MATCHED"]
|
200
|
+
end
|
201
|
+
|
202
|
+
it "merges in a plain hash too" do
|
203
|
+
original = Traject::TranslationMap.new("yaml_map")
|
204
|
+
merged = original.merge(
|
205
|
+
"other" => "OVERRIDE",
|
206
|
+
"new" => "NEW"
|
207
|
+
)
|
208
|
+
|
209
|
+
assert_equal "value1", merged["key1"]
|
210
|
+
assert_equal "OVERRIDE", merged["other"]
|
211
|
+
assert_equal "NEW", merged["new"]
|
212
|
+
end
|
146
213
|
|
147
|
-
refute_same hash, map.to_hash, "each #to_hash result is a copy"
|
148
214
|
end
|
149
215
|
|
150
216
|
end
|
data/traject.gemspec
CHANGED
@@ -25,6 +25,7 @@ Gem::Specification.new do |spec|
|
|
25
25
|
spec.add_dependency "hashie", ">= 2.0.5", "< 2.1" # used for Indexer#settings
|
26
26
|
spec.add_dependency "slop", ">= 3.4.5", "< 4.0" # command line parsing
|
27
27
|
spec.add_dependency "yell" # logging
|
28
|
+
spec.add_dependency "dot-properties", ">= 0.1.1" # reading java style .properties
|
28
29
|
|
29
30
|
spec.add_development_dependency "bundler", "~> 1.3"
|
30
31
|
spec.add_development_dependency "rake"
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: traject
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.1.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jonathan Rochkind
|
@@ -9,38 +9,48 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date:
|
12
|
+
date: 2014-04-07 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - '>='
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: 0.8.0
|
15
20
|
name: marc
|
21
|
+
prerelease: false
|
22
|
+
type: :runtime
|
16
23
|
version_requirements: !ruby/object:Gem::Requirement
|
17
24
|
requirements:
|
18
25
|
- - '>='
|
19
26
|
- !ruby/object:Gem::Version
|
20
27
|
version: 0.8.0
|
28
|
+
- !ruby/object:Gem::Dependency
|
21
29
|
requirement: !ruby/object:Gem::Requirement
|
22
30
|
requirements:
|
23
31
|
- - '>='
|
24
32
|
- !ruby/object:Gem::Version
|
25
|
-
version: 0.
|
33
|
+
version: 0.1.1
|
34
|
+
name: marc-marc4j
|
26
35
|
prerelease: false
|
27
36
|
type: :runtime
|
28
|
-
- !ruby/object:Gem::Dependency
|
29
|
-
name: marc-marc4j
|
30
37
|
version_requirements: !ruby/object:Gem::Requirement
|
31
38
|
requirements:
|
32
39
|
- - '>='
|
33
40
|
- !ruby/object:Gem::Version
|
34
41
|
version: 0.1.1
|
42
|
+
- !ruby/object:Gem::Dependency
|
35
43
|
requirement: !ruby/object:Gem::Requirement
|
36
44
|
requirements:
|
37
45
|
- - '>='
|
38
46
|
- !ruby/object:Gem::Version
|
39
|
-
version: 0.
|
47
|
+
version: 2.0.5
|
48
|
+
- - <
|
49
|
+
- !ruby/object:Gem::Version
|
50
|
+
version: '2.1'
|
51
|
+
name: hashie
|
40
52
|
prerelease: false
|
41
53
|
type: :runtime
|
42
|
-
- !ruby/object:Gem::Dependency
|
43
|
-
name: hashie
|
44
54
|
version_requirements: !ruby/object:Gem::Requirement
|
45
55
|
requirements:
|
46
56
|
- - '>='
|
@@ -49,18 +59,18 @@ dependencies:
|
|
49
59
|
- - <
|
50
60
|
- !ruby/object:Gem::Version
|
51
61
|
version: '2.1'
|
62
|
+
- !ruby/object:Gem::Dependency
|
52
63
|
requirement: !ruby/object:Gem::Requirement
|
53
64
|
requirements:
|
54
65
|
- - '>='
|
55
66
|
- !ruby/object:Gem::Version
|
56
|
-
version:
|
67
|
+
version: 3.4.5
|
57
68
|
- - <
|
58
69
|
- !ruby/object:Gem::Version
|
59
|
-
version: '
|
70
|
+
version: '4.0'
|
71
|
+
name: slop
|
60
72
|
prerelease: false
|
61
73
|
type: :runtime
|
62
|
-
- !ruby/object:Gem::Dependency
|
63
|
-
name: slop
|
64
74
|
version_requirements: !ruby/object:Gem::Requirement
|
65
75
|
requirements:
|
66
76
|
- - '>='
|
@@ -69,72 +79,76 @@ dependencies:
|
|
69
79
|
- - <
|
70
80
|
- !ruby/object:Gem::Version
|
71
81
|
version: '4.0'
|
82
|
+
- !ruby/object:Gem::Dependency
|
72
83
|
requirement: !ruby/object:Gem::Requirement
|
73
84
|
requirements:
|
74
85
|
- - '>='
|
75
86
|
- !ruby/object:Gem::Version
|
76
|
-
version:
|
77
|
-
|
78
|
-
- !ruby/object:Gem::Version
|
79
|
-
version: '4.0'
|
87
|
+
version: '0'
|
88
|
+
name: yell
|
80
89
|
prerelease: false
|
81
90
|
type: :runtime
|
82
|
-
- !ruby/object:Gem::Dependency
|
83
|
-
name: yell
|
84
91
|
version_requirements: !ruby/object:Gem::Requirement
|
85
92
|
requirements:
|
86
93
|
- - '>='
|
87
94
|
- !ruby/object:Gem::Version
|
88
95
|
version: '0'
|
96
|
+
- !ruby/object:Gem::Dependency
|
89
97
|
requirement: !ruby/object:Gem::Requirement
|
90
98
|
requirements:
|
91
99
|
- - '>='
|
92
100
|
- !ruby/object:Gem::Version
|
93
|
-
version:
|
101
|
+
version: 0.1.1
|
102
|
+
name: dot-properties
|
94
103
|
prerelease: false
|
95
104
|
type: :runtime
|
96
|
-
- !ruby/object:Gem::Dependency
|
97
|
-
name: bundler
|
98
105
|
version_requirements: !ruby/object:Gem::Requirement
|
99
106
|
requirements:
|
100
|
-
- -
|
107
|
+
- - '>='
|
101
108
|
- !ruby/object:Gem::Version
|
102
|
-
version:
|
109
|
+
version: 0.1.1
|
110
|
+
- !ruby/object:Gem::Dependency
|
103
111
|
requirement: !ruby/object:Gem::Requirement
|
104
112
|
requirements:
|
105
113
|
- - ~>
|
106
114
|
- !ruby/object:Gem::Version
|
107
115
|
version: '1.3'
|
116
|
+
name: bundler
|
108
117
|
prerelease: false
|
109
118
|
type: :development
|
110
|
-
- !ruby/object:Gem::Dependency
|
111
|
-
name: rake
|
112
119
|
version_requirements: !ruby/object:Gem::Requirement
|
113
120
|
requirements:
|
114
|
-
- -
|
121
|
+
- - ~>
|
115
122
|
- !ruby/object:Gem::Version
|
116
|
-
version: '
|
123
|
+
version: '1.3'
|
124
|
+
- !ruby/object:Gem::Dependency
|
117
125
|
requirement: !ruby/object:Gem::Requirement
|
118
126
|
requirements:
|
119
127
|
- - '>='
|
120
128
|
- !ruby/object:Gem::Version
|
121
129
|
version: '0'
|
130
|
+
name: rake
|
122
131
|
prerelease: false
|
123
132
|
type: :development
|
124
|
-
- !ruby/object:Gem::Dependency
|
125
|
-
name: minitest
|
126
133
|
version_requirements: !ruby/object:Gem::Requirement
|
127
134
|
requirements:
|
128
135
|
- - '>='
|
129
136
|
- !ruby/object:Gem::Version
|
130
137
|
version: '0'
|
138
|
+
- !ruby/object:Gem::Dependency
|
131
139
|
requirement: !ruby/object:Gem::Requirement
|
132
140
|
requirements:
|
133
141
|
- - '>='
|
134
142
|
- !ruby/object:Gem::Version
|
135
143
|
version: '0'
|
144
|
+
name: minitest
|
136
145
|
prerelease: false
|
137
146
|
type: :development
|
147
|
+
version_requirements: !ruby/object:Gem::Requirement
|
148
|
+
requirements:
|
149
|
+
- - '>='
|
150
|
+
- !ruby/object:Gem::Version
|
151
|
+
version: '0'
|
138
152
|
description:
|
139
153
|
email:
|
140
154
|
- none@nowhere.org
|
@@ -288,7 +302,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
288
302
|
version: '0'
|
289
303
|
requirements: []
|
290
304
|
rubyforge_project:
|
291
|
-
rubygems_version: 2.1.
|
305
|
+
rubygems_version: 2.1.9
|
292
306
|
signing_key:
|
293
307
|
specification_version: 4
|
294
308
|
summary: Index MARC to Solr; or generally process source records to hash-like structures
|