traject 1.0.0 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/Gemfile +1 -1
- data/doc/indexing_rules.md +2 -2
- data/lib/traject/macros/marc21_semantics.rb +13 -5
- data/lib/traject/translation_map.rb +28 -32
- data/lib/traject/version.rb +1 -1
- data/test/indexer/macros_marc21_semantics_test.rb +19 -0
- data/test/translation_map_test.rb +73 -7
- data/traject.gemspec +1 -0
- metadata +44 -30
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 4ae9c6a2d87868021cae1b48637592238387d8a1
|
4
|
+
data.tar.gz: 578c645162da3560ff5e01a28cf43b36e82734f8
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e0bf13c4ff3cab492b6be8922ae22e33311f701756910526eb3206522774b1519db07324531ebd2c366d5907854c78bb6cde0d65eeff78258513d37fae1a3a57
|
7
|
+
data.tar.gz: dd251b15afafe2a11cbefe8493e145d9b9b883183b5f884a5b61557046c74b699427fe2766d5df587e27729f6c22052b2f672b756aba8fee33149f6abbcc4f40
|
data/Gemfile
CHANGED
data/doc/indexing_rules.md
CHANGED
@@ -62,7 +62,7 @@ The third optional argument is a
|
|
62
62
|
object. Most of the time you don't need it, but you can use it for
|
63
63
|
some sophisticated functionality, for example using these Context methods:
|
64
64
|
|
65
|
-
* `context.clipboard` A hash into which you can stuff values that you want to pass from one indexing step to another. For example, if you go through a bunch of work to query a database and get a result you'll need more than once, stick the results somewhere in the clipboard.
|
65
|
+
* `context.clipboard` A hash into which you can stuff values that you want to pass from one indexing step to another. For example, if you go through a bunch of work to query a database and get a result you'll need more than once, stick the results somewhere in the clipboard. This clipboard is record-specific, and won't persist between records.
|
66
66
|
* `context.position` The position of the record in the input file (e.g., was it the first record, seoncd, etc.). Useful for error reporting
|
67
67
|
* `context.output_hash` A hash mapping the field names (generally defined in `to_field` calls) to an array of values to be sent to the writer associated with that field. This allows you to modify what goes to the writer without going through a `to_field` call -- you can just set `context.output_hash['myfield'] = ['my', 'values']` and you're set. See below for more examples
|
68
68
|
* `context.skip!(msg)` An assertion that this record should be ignored. No more indexing steps will be called, no results will be sent to the writer, and a `debug`-level log message will be written stating that the record was skipped.
|
@@ -262,4 +262,4 @@ args for `each_record`.
|
|
262
262
|
|
263
263
|
* **Once you call `context.skip!(msg)` no more index steps will be run for that record**. So if you have any cleanup code, you'll need to make sure to call it yourself.
|
264
264
|
|
265
|
-
* **By default, `trajcet` indexing runs multi-threaded**. In the current implementation, the indexing steps for one record are *not* split across threads, but different records can be processed simultaneously by more than one thread. That means you need to make sure your code is thread-safe (or always set `processing_thread_pool` to 0).
|
265
|
+
* **By default, `trajcet` indexing runs multi-threaded**. In the current implementation, the indexing steps for one record are *not* split across threads, but different records can be processed simultaneously by more than one thread. That means you need to make sure your code is thread-safe (or always set `processing_thread_pool` to 0).
|
@@ -26,15 +26,23 @@ module Traject::Macros
|
|
26
26
|
accumulator.concat list.uniq if list
|
27
27
|
end
|
28
28
|
end
|
29
|
+
|
29
30
|
# If a num begins with a known OCLC prefix, return it without the prefix.
|
30
31
|
# otherwise nil.
|
32
|
+
#
|
33
|
+
# Allow (OCoLC) and/or ocn/ocm/on
|
34
|
+
|
35
|
+
OCLCPAT = /
|
36
|
+
\A\s*
|
37
|
+
(?:(?:\(OCoLC\)) |
|
38
|
+
(?:\(OCoLC\))?(?:(?:ocm)|(?:ocn)|(?:on))
|
39
|
+
)(\d+)
|
40
|
+
/x
|
41
|
+
|
31
42
|
def self.oclcnum_extract(num)
|
32
|
-
|
33
|
-
|
34
|
-
# it had the prefix, which we've now stripped
|
35
|
-
return stripped
|
43
|
+
if OCLCPAT.match(num)
|
44
|
+
return $1
|
36
45
|
else
|
37
|
-
# it didn't have the prefix
|
38
46
|
return nil
|
39
47
|
end
|
40
48
|
end
|
@@ -1,6 +1,7 @@
|
|
1
1
|
require 'traject'
|
2
2
|
|
3
3
|
require 'yaml'
|
4
|
+
require 'dot-properties'
|
4
5
|
|
5
6
|
|
6
7
|
module Traject
|
@@ -14,7 +15,8 @@ module Traject
|
|
14
15
|
#
|
15
16
|
# What makes it more useful than a stunted hash is it's ability to load
|
16
17
|
# the hash definitions from configuration files, either pure ruby,
|
17
|
-
# yaml, or
|
18
|
+
# yaml, or java .properties file (not all .properties features may
|
19
|
+
# be supported, we use dot-properties gem for reading)
|
18
20
|
#
|
19
21
|
# traject's `extract_marc` macro allows you to specify a :translation_map=>filename argument
|
20
22
|
# that will automatically find and use a translation map on the resulting data:
|
@@ -197,7 +199,9 @@ module Traject
|
|
197
199
|
# Returns a dup of internal hash, dup so you can modify it
|
198
200
|
# if you like.
|
199
201
|
def to_hash
|
200
|
-
@hash.dup
|
202
|
+
dup = @hash.dup
|
203
|
+
dup.delete("__default__")
|
204
|
+
dup
|
201
205
|
end
|
202
206
|
|
203
207
|
# Run every element of an array through this translation map,
|
@@ -224,6 +228,24 @@ module Traject
|
|
224
228
|
array.replace( self.translate_array(array))
|
225
229
|
end
|
226
230
|
|
231
|
+
# Return a new TranslationMap that results from merging argument on top of self.
|
232
|
+
# Can be useful for taking an existing translation map, but merging a few
|
233
|
+
# overrides on top.
|
234
|
+
#
|
235
|
+
# merged_map = TranslationMap.new(something).merge TranslationMap.new(else)
|
236
|
+
# #...
|
237
|
+
# merged_map.translate_array(something) # etc
|
238
|
+
#
|
239
|
+
# If a default is set in the second map, it will merge over the first too.
|
240
|
+
#
|
241
|
+
# You can also pass in a plain hash as an arg, instead of an existing TranslationMap:
|
242
|
+
#
|
243
|
+
# TranslationMap.new(something).merge("overridden_key" => "value", "a" => "")
|
244
|
+
def merge(other_map)
|
245
|
+
default = other_map.default || self.default
|
246
|
+
TranslationMap.new(self.to_hash.merge(other_map.to_hash), :default => default)
|
247
|
+
end
|
248
|
+
|
227
249
|
class NotFound < Exception
|
228
250
|
def initialize(path)
|
229
251
|
super("No translation map definition file found at 'translation_maps/#{path}.[rb|yaml|properties]' in load path: #{$LOAD_PATH}")
|
@@ -232,36 +254,10 @@ module Traject
|
|
232
254
|
|
233
255
|
protected
|
234
256
|
|
235
|
-
#
|
236
|
-
#
|
237
|
-
|
238
|
-
|
239
|
-
#
|
240
|
-
# We do NOT handle split lines, don't do that!
|
241
|
-
def self.read_properties(file_name)
|
242
|
-
hash = {}
|
243
|
-
i = 0
|
244
|
-
f = File.open(file_name)
|
245
|
-
f.each_line do |line|
|
246
|
-
i += 1
|
247
|
-
|
248
|
-
line.strip!
|
249
|
-
|
250
|
-
# skip blank lines
|
251
|
-
next if line.empty?
|
252
|
-
|
253
|
-
# skip comment lines
|
254
|
-
next if line =~ /^\s*[!\#].*$/
|
255
|
-
|
256
|
-
if line =~ /\A([^:=]+)[\:\=]\s*(.*)\s*\Z/
|
257
|
-
hash[$1.strip] = $2
|
258
|
-
else
|
259
|
-
raise IOError.new("Can't parse from #{file_name} line #{i}: #{line}")
|
260
|
-
end
|
261
|
-
end
|
262
|
-
f.close
|
263
|
-
|
264
|
-
return hash
|
257
|
+
# We use dot-properties gem for reading .properties files,
|
258
|
+
# return a hash.
|
259
|
+
def self.read_properties(file_name)
|
260
|
+
return DotProperties.load(file_name).to_h
|
265
261
|
end
|
266
262
|
|
267
263
|
end
|
data/lib/traject/version.rb
CHANGED
@@ -31,6 +31,25 @@ describe "Traject::Macros::Marc21Semantics" do
|
|
31
31
|
|
32
32
|
assert_equal({}, @indexer.map_record(empty_record))
|
33
33
|
end
|
34
|
+
|
35
|
+
it "deals with all prefixed OCLC nunbers" do
|
36
|
+
@record.append(MARC::DataField.new('035', ' ', ' ', ['a', '(OCoLC)ocm111111111']))
|
37
|
+
@record.append(MARC::DataField.new('035', ' ', ' ', ['a', '(OCoLC)222222222']))
|
38
|
+
@record.append(MARC::DataField.new('035', ' ', ' ', ['a', 'ocm333333333']))
|
39
|
+
@record.append(MARC::DataField.new('035', ' ', ' ', ['a', 'ocn444444444']))
|
40
|
+
@record.append(MARC::DataField.new('035', ' ', ' ', ['a', '(OCoLC)ocn555555555']))
|
41
|
+
@record.append(MARC::DataField.new('035', ' ', ' ', ['a', '(OCoLC)on666666666']))
|
42
|
+
@record.append(MARC::DataField.new('035', ' ', ' ', ['a', '777777777'])) # not OCLC number
|
43
|
+
|
44
|
+
@indexer.instance_eval do
|
45
|
+
to_field "oclcnum", oclcnum
|
46
|
+
end
|
47
|
+
output = @indexer.map_record(@record)
|
48
|
+
|
49
|
+
assert_equal %w{47971712 111111111 222222222 333333333 444444444 555555555 666666666}, output["oclcnum"]
|
50
|
+
end
|
51
|
+
|
52
|
+
|
34
53
|
|
35
54
|
it "#marc_series_facet" do
|
36
55
|
@record = MARC::Reader.new(support_file_path "louis_armstrong.marc").to_a.first
|
@@ -109,7 +109,7 @@ describe "TranslationMap" do
|
|
109
109
|
|
110
110
|
assert_equal "DEFAULT LITERAL", map["not in the map"]
|
111
111
|
end
|
112
|
-
|
112
|
+
|
113
113
|
it "respects __default__ __passthrough__" do
|
114
114
|
map = Traject::TranslationMap.new("default_passthrough")
|
115
115
|
|
@@ -135,16 +135,82 @@ describe "TranslationMap" do
|
|
135
135
|
assert_equal ["one"], values
|
136
136
|
end
|
137
137
|
|
138
|
-
|
139
|
-
|
138
|
+
describe "#to_hash" do
|
139
|
+
it "produces a hash" do
|
140
|
+
map = Traject::TranslationMap.new("yaml_map")
|
141
|
+
|
142
|
+
hash = map.to_hash
|
143
|
+
|
144
|
+
assert_kind_of Hash, hash
|
145
|
+
|
146
|
+
assert ! hash.frozen?, "#to_hash result is not frozen"
|
147
|
+
|
148
|
+
refute_same hash, map.to_hash, "each #to_hash result is a copy"
|
149
|
+
end
|
150
|
+
|
151
|
+
it "does not include __default__ key" do
|
152
|
+
map = Traject::TranslationMap.new("default_passthrough")
|
153
|
+
|
154
|
+
refute map.to_hash.has_key?("__default__")
|
155
|
+
assert_nil map.to_hash["__default__"]
|
156
|
+
end
|
157
|
+
|
158
|
+
end
|
159
|
+
|
160
|
+
describe "#merge" do
|
161
|
+
it "merges" do
|
162
|
+
original = Traject::TranslationMap.new("yaml_map")
|
163
|
+
override = Traject::TranslationMap.new("other" => "OVERRIDE", "new" => "NEW")
|
164
|
+
|
165
|
+
merged = original.merge(override)
|
140
166
|
|
141
|
-
|
167
|
+
assert_equal "value1", merged["key1"]
|
168
|
+
assert_equal "OVERRIDE", merged["other"]
|
169
|
+
assert_equal "NEW", merged["new"]
|
170
|
+
end
|
171
|
+
|
172
|
+
it "passes through default from first map when no default in second" do
|
173
|
+
original = Traject::TranslationMap.new("yaml_map", :default => "DEFAULT_VALUE")
|
174
|
+
override = Traject::TranslationMap.new("other" => "OVERRIDE")
|
175
|
+
|
176
|
+
merged = original.merge(override)
|
177
|
+
|
178
|
+
assert_equal "DEFAULT_VALUE", merged.default
|
179
|
+
assert_equal "DEFAULT_VALUE", merged["SOME_KEY_NOT_MATCHED"]
|
180
|
+
end
|
181
|
+
|
182
|
+
it "passes through default from second map when no default in first" do
|
183
|
+
original = Traject::TranslationMap.new("yaml_map")
|
184
|
+
override = Traject::TranslationMap.new({"other" => "OVERRIDE"}, :default => "DEFAULT_VALUE")
|
185
|
+
|
186
|
+
merged = original.merge(override)
|
142
187
|
|
143
|
-
|
188
|
+
assert_equal "DEFAULT_VALUE", merged.default
|
189
|
+
assert_equal "DEFAULT_VALUE", merged["SOME_KEY_NOT_MATCHED"]
|
190
|
+
end
|
191
|
+
|
192
|
+
it "merges second default on top of first" do
|
193
|
+
original = Traject::TranslationMap.new("yaml_map", :default => "DEFAULT_VALUE")
|
194
|
+
override = Traject::TranslationMap.new({"other" => "OVERRIDE"}, :default => "NEW_DEFAULT_VALUE")
|
144
195
|
|
145
|
-
|
196
|
+
merged = original.merge(override)
|
197
|
+
|
198
|
+
assert_equal "NEW_DEFAULT_VALUE", merged.default
|
199
|
+
assert_equal "NEW_DEFAULT_VALUE", merged["SOME_KEY_NOT_MATCHED"]
|
200
|
+
end
|
201
|
+
|
202
|
+
it "merges in a plain hash too" do
|
203
|
+
original = Traject::TranslationMap.new("yaml_map")
|
204
|
+
merged = original.merge(
|
205
|
+
"other" => "OVERRIDE",
|
206
|
+
"new" => "NEW"
|
207
|
+
)
|
208
|
+
|
209
|
+
assert_equal "value1", merged["key1"]
|
210
|
+
assert_equal "OVERRIDE", merged["other"]
|
211
|
+
assert_equal "NEW", merged["new"]
|
212
|
+
end
|
146
213
|
|
147
|
-
refute_same hash, map.to_hash, "each #to_hash result is a copy"
|
148
214
|
end
|
149
215
|
|
150
216
|
end
|
data/traject.gemspec
CHANGED
@@ -25,6 +25,7 @@ Gem::Specification.new do |spec|
|
|
25
25
|
spec.add_dependency "hashie", ">= 2.0.5", "< 2.1" # used for Indexer#settings
|
26
26
|
spec.add_dependency "slop", ">= 3.4.5", "< 4.0" # command line parsing
|
27
27
|
spec.add_dependency "yell" # logging
|
28
|
+
spec.add_dependency "dot-properties", ">= 0.1.1" # reading java style .properties
|
28
29
|
|
29
30
|
spec.add_development_dependency "bundler", "~> 1.3"
|
30
31
|
spec.add_development_dependency "rake"
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: traject
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.1.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jonathan Rochkind
|
@@ -9,38 +9,48 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date:
|
12
|
+
date: 2014-04-07 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - '>='
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: 0.8.0
|
15
20
|
name: marc
|
21
|
+
prerelease: false
|
22
|
+
type: :runtime
|
16
23
|
version_requirements: !ruby/object:Gem::Requirement
|
17
24
|
requirements:
|
18
25
|
- - '>='
|
19
26
|
- !ruby/object:Gem::Version
|
20
27
|
version: 0.8.0
|
28
|
+
- !ruby/object:Gem::Dependency
|
21
29
|
requirement: !ruby/object:Gem::Requirement
|
22
30
|
requirements:
|
23
31
|
- - '>='
|
24
32
|
- !ruby/object:Gem::Version
|
25
|
-
version: 0.
|
33
|
+
version: 0.1.1
|
34
|
+
name: marc-marc4j
|
26
35
|
prerelease: false
|
27
36
|
type: :runtime
|
28
|
-
- !ruby/object:Gem::Dependency
|
29
|
-
name: marc-marc4j
|
30
37
|
version_requirements: !ruby/object:Gem::Requirement
|
31
38
|
requirements:
|
32
39
|
- - '>='
|
33
40
|
- !ruby/object:Gem::Version
|
34
41
|
version: 0.1.1
|
42
|
+
- !ruby/object:Gem::Dependency
|
35
43
|
requirement: !ruby/object:Gem::Requirement
|
36
44
|
requirements:
|
37
45
|
- - '>='
|
38
46
|
- !ruby/object:Gem::Version
|
39
|
-
version: 0.
|
47
|
+
version: 2.0.5
|
48
|
+
- - <
|
49
|
+
- !ruby/object:Gem::Version
|
50
|
+
version: '2.1'
|
51
|
+
name: hashie
|
40
52
|
prerelease: false
|
41
53
|
type: :runtime
|
42
|
-
- !ruby/object:Gem::Dependency
|
43
|
-
name: hashie
|
44
54
|
version_requirements: !ruby/object:Gem::Requirement
|
45
55
|
requirements:
|
46
56
|
- - '>='
|
@@ -49,18 +59,18 @@ dependencies:
|
|
49
59
|
- - <
|
50
60
|
- !ruby/object:Gem::Version
|
51
61
|
version: '2.1'
|
62
|
+
- !ruby/object:Gem::Dependency
|
52
63
|
requirement: !ruby/object:Gem::Requirement
|
53
64
|
requirements:
|
54
65
|
- - '>='
|
55
66
|
- !ruby/object:Gem::Version
|
56
|
-
version:
|
67
|
+
version: 3.4.5
|
57
68
|
- - <
|
58
69
|
- !ruby/object:Gem::Version
|
59
|
-
version: '
|
70
|
+
version: '4.0'
|
71
|
+
name: slop
|
60
72
|
prerelease: false
|
61
73
|
type: :runtime
|
62
|
-
- !ruby/object:Gem::Dependency
|
63
|
-
name: slop
|
64
74
|
version_requirements: !ruby/object:Gem::Requirement
|
65
75
|
requirements:
|
66
76
|
- - '>='
|
@@ -69,72 +79,76 @@ dependencies:
|
|
69
79
|
- - <
|
70
80
|
- !ruby/object:Gem::Version
|
71
81
|
version: '4.0'
|
82
|
+
- !ruby/object:Gem::Dependency
|
72
83
|
requirement: !ruby/object:Gem::Requirement
|
73
84
|
requirements:
|
74
85
|
- - '>='
|
75
86
|
- !ruby/object:Gem::Version
|
76
|
-
version:
|
77
|
-
|
78
|
-
- !ruby/object:Gem::Version
|
79
|
-
version: '4.0'
|
87
|
+
version: '0'
|
88
|
+
name: yell
|
80
89
|
prerelease: false
|
81
90
|
type: :runtime
|
82
|
-
- !ruby/object:Gem::Dependency
|
83
|
-
name: yell
|
84
91
|
version_requirements: !ruby/object:Gem::Requirement
|
85
92
|
requirements:
|
86
93
|
- - '>='
|
87
94
|
- !ruby/object:Gem::Version
|
88
95
|
version: '0'
|
96
|
+
- !ruby/object:Gem::Dependency
|
89
97
|
requirement: !ruby/object:Gem::Requirement
|
90
98
|
requirements:
|
91
99
|
- - '>='
|
92
100
|
- !ruby/object:Gem::Version
|
93
|
-
version:
|
101
|
+
version: 0.1.1
|
102
|
+
name: dot-properties
|
94
103
|
prerelease: false
|
95
104
|
type: :runtime
|
96
|
-
- !ruby/object:Gem::Dependency
|
97
|
-
name: bundler
|
98
105
|
version_requirements: !ruby/object:Gem::Requirement
|
99
106
|
requirements:
|
100
|
-
- -
|
107
|
+
- - '>='
|
101
108
|
- !ruby/object:Gem::Version
|
102
|
-
version:
|
109
|
+
version: 0.1.1
|
110
|
+
- !ruby/object:Gem::Dependency
|
103
111
|
requirement: !ruby/object:Gem::Requirement
|
104
112
|
requirements:
|
105
113
|
- - ~>
|
106
114
|
- !ruby/object:Gem::Version
|
107
115
|
version: '1.3'
|
116
|
+
name: bundler
|
108
117
|
prerelease: false
|
109
118
|
type: :development
|
110
|
-
- !ruby/object:Gem::Dependency
|
111
|
-
name: rake
|
112
119
|
version_requirements: !ruby/object:Gem::Requirement
|
113
120
|
requirements:
|
114
|
-
- -
|
121
|
+
- - ~>
|
115
122
|
- !ruby/object:Gem::Version
|
116
|
-
version: '
|
123
|
+
version: '1.3'
|
124
|
+
- !ruby/object:Gem::Dependency
|
117
125
|
requirement: !ruby/object:Gem::Requirement
|
118
126
|
requirements:
|
119
127
|
- - '>='
|
120
128
|
- !ruby/object:Gem::Version
|
121
129
|
version: '0'
|
130
|
+
name: rake
|
122
131
|
prerelease: false
|
123
132
|
type: :development
|
124
|
-
- !ruby/object:Gem::Dependency
|
125
|
-
name: minitest
|
126
133
|
version_requirements: !ruby/object:Gem::Requirement
|
127
134
|
requirements:
|
128
135
|
- - '>='
|
129
136
|
- !ruby/object:Gem::Version
|
130
137
|
version: '0'
|
138
|
+
- !ruby/object:Gem::Dependency
|
131
139
|
requirement: !ruby/object:Gem::Requirement
|
132
140
|
requirements:
|
133
141
|
- - '>='
|
134
142
|
- !ruby/object:Gem::Version
|
135
143
|
version: '0'
|
144
|
+
name: minitest
|
136
145
|
prerelease: false
|
137
146
|
type: :development
|
147
|
+
version_requirements: !ruby/object:Gem::Requirement
|
148
|
+
requirements:
|
149
|
+
- - '>='
|
150
|
+
- !ruby/object:Gem::Version
|
151
|
+
version: '0'
|
138
152
|
description:
|
139
153
|
email:
|
140
154
|
- none@nowhere.org
|
@@ -288,7 +302,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
288
302
|
version: '0'
|
289
303
|
requirements: []
|
290
304
|
rubyforge_project:
|
291
|
-
rubygems_version: 2.1.
|
305
|
+
rubygems_version: 2.1.9
|
292
306
|
signing_key:
|
293
307
|
specification_version: 4
|
294
308
|
summary: Index MARC to Solr; or generally process source records to hash-like structures
|