traject 3.0.0 → 3.4.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.travis.yml +3 -4
- data/CHANGES.md +65 -0
- data/README.md +9 -4
- data/doc/indexing_rules.md +5 -6
- data/doc/programmatic_use.md +25 -1
- data/doc/settings.md +4 -0
- data/doc/xml.md +12 -0
- data/lib/traject/indexer.rb +40 -4
- data/lib/traject/indexer/context.rb +45 -0
- data/lib/traject/indexer/step.rb +8 -12
- data/lib/traject/line_writer.rb +36 -4
- data/lib/traject/macros/marc21.rb +2 -2
- data/lib/traject/macros/marc21_semantics.rb +15 -12
- data/lib/traject/macros/nokogiri_macros.rb +9 -3
- data/lib/traject/nokogiri_reader.rb +17 -19
- data/lib/traject/oai_pmh_nokogiri_reader.rb +9 -3
- data/lib/traject/solr_json_writer.rb +167 -29
- data/lib/traject/version.rb +1 -1
- data/lib/translation_maps/marc_languages.yaml +77 -48
- data/test/delimited_writer_test.rb +14 -16
- data/test/indexer/class_level_configuration_test.rb +127 -0
- data/test/indexer/context_test.rb +64 -1
- data/test/indexer/error_handler_test.rb +18 -0
- data/test/indexer/macros/macros_marc21_semantics_test.rb +4 -0
- data/test/indexer/nokogiri_indexer_test.rb +35 -0
- data/test/nokogiri_reader_test.rb +66 -3
- data/test/solr_json_writer_test.rb +175 -7
- data/test/test_support/date_resort_to_264.marc +1 -0
- data/traject.gemspec +4 -4
- metadata +37 -16
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: c30572335810dc620f9a169df6f8f374512d3c472ea34bc03068106959fd1463
|
4
|
+
data.tar.gz: 3181c37e41e80416487d730e1983bc647daf480a3a308db30e294c7587adc644
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 83b73a10113e75106a0fb7af9bec79802d2e3f5c8f3e07742f33a52642a9441c20769072f8ea5bd532011b7d172db6ca007121d6874f705008e1a5a511ca1ff8
|
7
|
+
data.tar.gz: d9c53588e8adbd76764c20012baf702591276d84c2cd64ed0bb0d5b742699607a2287d30a2fb4f70c1bf6f8a7338d716d989b208c8983c2a87eeddbd6d96dd3d
|
data/.travis.yml
CHANGED
@@ -6,13 +6,12 @@ sudo: true
|
|
6
6
|
rvm:
|
7
7
|
- 2.4.4
|
8
8
|
- 2.5.1
|
9
|
-
-
|
9
|
+
- 2.6.1
|
10
|
+
- 2.7.0
|
10
11
|
# avoid having travis install jdk on MRI builds where we don't need it.
|
11
12
|
matrix:
|
12
13
|
include:
|
13
14
|
- jdk: openjdk8
|
14
15
|
rvm: jruby-9.1.17.0
|
15
16
|
- jdk: openjdk8
|
16
|
-
rvm: jruby-9.2.
|
17
|
-
allow_failures:
|
18
|
-
- rvm: "2.6.0-preview2"
|
17
|
+
rvm: jruby-9.2.6.0
|
data/CHANGES.md
CHANGED
@@ -1,5 +1,70 @@
|
|
1
1
|
# Changes
|
2
2
|
|
3
|
+
## Next
|
4
|
+
|
5
|
+
*
|
6
|
+
|
7
|
+
*
|
8
|
+
|
9
|
+
## 3.4.0
|
10
|
+
|
11
|
+
* XML-mode `extract_xpath` now supports extracting attribute values with xpath @attr syntax.
|
12
|
+
|
13
|
+
## 3.3.0
|
14
|
+
|
15
|
+
* `Traject::Macros::Marc21Semantics.publication_date` now gets date from 264 before 260. https://github.com/traject/traject/pull/233
|
16
|
+
|
17
|
+
* Allow hashie 4.x in gemspec https://github.com/traject/traject/pull/234
|
18
|
+
|
19
|
+
* Allow `http` gem 4.x versions. https://github.com/traject/traject/pull/236
|
20
|
+
|
21
|
+
* Can now call class-level Indexer.configure multiple times https://github.com/sciencehistory/scihist_digicoll/pull/525
|
22
|
+
|
23
|
+
## 3.2.0
|
24
|
+
|
25
|
+
* NokogiriReader has a "nokogiri.strict_mode" setting. Set to true or string 'true' to ask Nokogori to parse in strict mode, so it will immediately raise on ill-formed XML, instead of nokogiri's default to do what it can with it. https://github.com/traject/traject/pull/226
|
26
|
+
|
27
|
+
* SolrJsonWriter
|
28
|
+
|
29
|
+
* Utility method `delete_all!` sends a delete all query to the Solr URL endpoint. https://github.com/traject/traject/pull/227
|
30
|
+
|
31
|
+
* Allow basic auth configuration of the default http client via `solr_writer.basic_auth_user` and `solr_writer.basic_auth_password`. https://github.com/traject/traject/pull/231
|
32
|
+
|
33
|
+
|
34
|
+
## 3.1.0
|
35
|
+
|
36
|
+
### Added
|
37
|
+
|
38
|
+
* Context#add_output is added, convenient for custom ruby code.
|
39
|
+
|
40
|
+
each_record do |record, context|
|
41
|
+
context.add_output "key", something_from(record)
|
42
|
+
end
|
43
|
+
|
44
|
+
https://github.com/traject/traject/pull/220
|
45
|
+
|
46
|
+
* SolrJsonWriter
|
47
|
+
|
48
|
+
* Class-level indexer configuration, for custom indexer subclasses, now available with class-level `configure` method. Warning, Indexers are still expensive to instantiate though. https://github.com/traject/traject/pull/213
|
49
|
+
|
50
|
+
* SolrJsonWriter has new settings to control commit semantics. `solr_writer.solr_update_args` and `solr_writer.commit_solr_update_args`, both have hash values that are Solr update handler query params. https://github.com/traject/traject/pull/215
|
51
|
+
|
52
|
+
* SolrJsonWriter has a `delete(solr-unique-key)` method. Does not currently use any batching or threading. https://github.com/traject/traject/pull/214
|
53
|
+
|
54
|
+
* SolrJsonWriter, when MaxSkippedRecordsExceeded is raised, it will have a #cause that is the last error, which resulted in MaxSkippedRecordsExceeded. Some error reporting systems, including Rails, will automatically log #cause, so that's helpful. https://github.com/traject/traject/pull/216
|
55
|
+
|
56
|
+
* SolrJsonWriter now respects a `solr_writer.http_timeout` setting, in seconds, to be passed to HTTPClient instance. https://github.com/traject/traject/pull/219
|
57
|
+
|
58
|
+
* Only runs thread pool shutdown code (and logging) if there is a `solr_writer.batch_size` greater than 0. Keep it out of the logs if it was a no-op anyway.
|
59
|
+
|
60
|
+
* Logs at DEBUG level every time it sends an update request to solr
|
61
|
+
|
62
|
+
* Nokogiri dependency for the NokogiriReader increased to `~> 1.9`. When using Jruby `each_record_xpath`, resulting yielded documents may have xmlns declarations on different nodes than in MRI (and previous versions of nokogiri), but we could find now way around this with nokogiri >= 1.9.0. The documents should still be semantically equivalent for namespace use. This was necessary to keep JRuby Nokogiri XML working with recent Nokogiri releases. https://github.com/traject/traject/pull/209
|
63
|
+
|
64
|
+
* LineWriter guesses better about when to auto-close, and provides an optional explicit setting in case it guesses wrong. (thanks @justinlittman) https://github.com/traject/traject/pull/211
|
65
|
+
|
66
|
+
* Traject::Indexer will now use a Logger(-compatible) instance passed in in setting 'logger' https://github.com/traject/traject/pull/217
|
67
|
+
|
3
68
|
## 3.0.0
|
4
69
|
|
5
70
|
### Changed/Backwards Incompatibilities
|
data/README.md
CHANGED
@@ -19,7 +19,7 @@ Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (Univer
|
|
19
19
|
* Basic configuration files can be easily written even by non-rubyists, with a few simple directives traject provides. But config files are 'ruby all the way down', so we can provide a gradual slope to more complex needs, with the full power of ruby.
|
20
20
|
* Easy to program, easy to read, easy to modify.
|
21
21
|
* Fast. Traject by default indexes using multiple threads, on multiple cpu cores, when the underlying ruby implementation (i.e., JRuby) allows it, and can use a separate thread for communication with solr even under MRI. Traject is intended to be usable to process millions of records.
|
22
|
-
* Composed of decoupled components, for flexibility and extensibility.
|
22
|
+
* Composed of decoupled components, for flexibility and extensibility.f?
|
23
23
|
* Designed to support local code and configuration that's maintainable and testable, and can be shared between projects as ruby gems.
|
24
24
|
* Easy to split configuration between multiple files, for simple "pick-and-choose" command line options that can combine to deal with any of your local needs.
|
25
25
|
|
@@ -135,7 +135,7 @@ For the syntax and complete possibilities of the specification string argument t
|
|
135
135
|
|
136
136
|
To see all options for `extract_marc`, see the [extract_marc](http://rdoc.info/gems/traject/Traject/Macros/Marc21:extract_marc) method documentation.
|
137
137
|
|
138
|
-
### XML mode,
|
138
|
+
### XML mode, extract_xpath
|
139
139
|
|
140
140
|
See our [xml guide](./doc/xml.md) for more XML examples, but you will usually use extract_xpath.
|
141
141
|
|
@@ -175,6 +175,8 @@ TranslationMap use above is just one example of a transformation macro, that tra
|
|
175
175
|
* `append("--after each value")`
|
176
176
|
* `gsub(/regex/, "replacement")`
|
177
177
|
* `split(" ")`: take values and split them, possibly result in multiple values.
|
178
|
+
* `transform(proc)`: transform each existing macro using a proc, kind of like `map`.
|
179
|
+
eg `to_field "something", extract_xml("//author"), transform( ->(author) { "#{author.last}, #{author.first}" })
|
178
180
|
|
179
181
|
You can add on as many transformation macros as you want, they will be applied to output in order.
|
180
182
|
|
@@ -311,12 +313,15 @@ like `to_field`, is executed for every record, but without being tied
|
|
311
313
|
to a specific output field.
|
312
314
|
|
313
315
|
`each_record` can be used for logging or notifiying, computing intermediate
|
314
|
-
results, or
|
316
|
+
results, or more complex ruby logic.
|
315
317
|
|
316
318
|
~~~ruby
|
317
319
|
each_record do |record|
|
318
320
|
some_custom_logging(record)
|
319
321
|
end
|
322
|
+
each_record do |record, context|
|
323
|
+
context.add_output(:some_value, extract_some_value_from_record(record))
|
324
|
+
end
|
320
325
|
~~~
|
321
326
|
|
322
327
|
For more on `each_record`, see [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md).
|
@@ -405,7 +410,7 @@ writer class in question.
|
|
405
410
|
|
406
411
|
## The traject command Line
|
407
412
|
|
408
|
-
(If you are interested in running traject in an embedded/programmatic context instead of as a standalone command-line batch process, please see docs on [Programmatic Use](./
|
413
|
+
(If you are interested in running traject in an embedded/programmatic context instead of as a standalone command-line batch process, please see docs on [Programmatic Use](./doc/programmatic_use.md) )
|
409
414
|
|
410
415
|
The simplest invocation is:
|
411
416
|
|
data/doc/indexing_rules.md
CHANGED
@@ -247,13 +247,12 @@ each_record do |record, context|
|
|
247
247
|
end
|
248
248
|
|
249
249
|
each_record do |record, context|
|
250
|
-
|
250
|
+
if eligible_for_things?(record)
|
251
|
+
(val1, val2) = calculate_two_things_from(record)
|
251
252
|
|
252
|
-
|
253
|
-
|
254
|
-
|
255
|
-
context.output_hash["second_field"] ||= []
|
256
|
-
context.output_hash["second_field"] << val2
|
253
|
+
context.add_output("first_field", val1)
|
254
|
+
context.add_output("second_field", val2)
|
255
|
+
end
|
257
256
|
end
|
258
257
|
~~~
|
259
258
|
|
data/doc/programmatic_use.md
CHANGED
@@ -48,6 +48,30 @@ indexer = Traject::Indexer.new(settings) do
|
|
48
48
|
end
|
49
49
|
```
|
50
50
|
|
51
|
+
### Configuring indexer subclasses
|
52
|
+
|
53
|
+
Indexing step configuration is historically done in traject at the indexer _instance_ level. Either programmatically or by applying a "configuration file" to an indexer instance.
|
54
|
+
|
55
|
+
But you can also define your own indexer sub-class with indexing steps built-in, using the class-level `configure` method.
|
56
|
+
|
57
|
+
This is an EXPERIMENTAL feature, implementation may change. https://github.com/traject/traject/pull/213
|
58
|
+
|
59
|
+
```ruby
|
60
|
+
class MyIndexer < Traject::Indexer
|
61
|
+
configure do
|
62
|
+
settings do
|
63
|
+
provide "solr.url", Rails.application.config.my_solr_url
|
64
|
+
end
|
65
|
+
|
66
|
+
to_field "our_name", literal("University of Whatever")
|
67
|
+
end
|
68
|
+
end
|
69
|
+
```
|
70
|
+
|
71
|
+
These setting and indexing steps are now "hard-coded" into that subclass. You can still provide additional configuration at the instance level, as normal. You can also make a subclass of that `MyIndexer` class, that will inherit configuration from MyIndexer, and can supply it's own additional class-level configuration too.
|
72
|
+
|
73
|
+
Note that due to how implementation is done, instantiating an indexer is still _relatively_ expensive. (Class-level configuration is only actually executed on instantiation). You will still get better performance by re-using a global instance of your indexer subclass, instead of, say, instantiating one per object to be indexed.
|
74
|
+
|
51
75
|
## Running the indexer
|
52
76
|
|
53
77
|
### process: probably not what you want
|
@@ -157,7 +181,7 @@ You may want to consider instead creating one or more configured "global" indexe
|
|
157
181
|
|
158
182
|
* Readers, and the Indexer#process method, are not thread-safe. Which is why using Indexer#process, which uses a fixed reader, is not threads-safe, and why when sharing a global idnexer we want to use `process_record`, `map_record`, or `process_with` as above.
|
159
183
|
|
160
|
-
It ought to be safe to use a global Indexer concurrently in several threads, with the `map_record`, `process_record` or `process_with` methods -- so long as your indexing rules and writers are thread-safe, as they usually will be and always ought to be.
|
184
|
+
It ought to be safe to use a global Indexer concurrently in several threads, with the `map_record`, `process_record` or `process_with` methods -- so long as your indexing rules and writers are thread-safe, as they usually will be and always ought to be.
|
161
185
|
|
162
186
|
### An example
|
163
187
|
|
data/doc/settings.md
CHANGED
@@ -93,6 +93,8 @@ settings are applied first of all. It's recommended you use `provide`.
|
|
93
93
|
|
94
94
|
* `solr_writer.thread_pool`: defaults to 1 (single bg thread). A thread pool is used for submitting docs to solr. Set to 0 or nil to disable threading. Set to 1, there will still be a single bg thread doing the adds. May make sense to set higher than number of cores on your indexing machine, as these threads will mostly be waiting on Solr. Speed/capacity of your solr might be more relevant. Note that processing_thread_pool threads can end up submitting to solr too, if solr_json_writer.thread_pool is full.
|
95
95
|
|
96
|
+
* `solr_writer.basic_auth_user`, `solr_writer.basic_auth_password`: Not set by default but when both are set the default writer is configured with basic auth.
|
97
|
+
|
96
98
|
|
97
99
|
### Dealing with MARC data
|
98
100
|
|
@@ -119,6 +121,8 @@ settings are applied first of all. It's recommended you use `provide`.
|
|
119
121
|
|
120
122
|
* `log.batch_size.severity`: If `log.batch_size` is set, what logger severity level to log to. Default "INFO", set to "DEBUG" etc if desired.
|
121
123
|
|
124
|
+
* 'logger': Ignore all the other logger settings, just pass a `Logger` compatible logger instance in directly.
|
125
|
+
|
122
126
|
|
123
127
|
|
124
128
|
|
data/doc/xml.md
CHANGED
@@ -72,6 +72,16 @@ You can use all the standard transforation macros in Traject::Macros::Transforma
|
|
72
72
|
to_field "something", extract_xpath("//value"), first_only, translation_map("some_map"), default("no value")
|
73
73
|
```
|
74
74
|
|
75
|
+
### selecting attribute values
|
76
|
+
|
77
|
+
Just works, using xpath syntax for selecting an attribute:
|
78
|
+
|
79
|
+
|
80
|
+
```ruby
|
81
|
+
# gets status value in: <oai:header status="something">
|
82
|
+
to_field "status", extract_xpath("//oai:record/oai:header/@status")
|
83
|
+
```
|
84
|
+
|
75
85
|
|
76
86
|
### selecting non-text nodes
|
77
87
|
|
@@ -133,6 +143,8 @@ The NokogiriReader parser should be relatively performant though, allowing you t
|
|
133
143
|
|
134
144
|
(There is a half-finished `ExperimentalStreamingNokogiriReader` available, but it is experimental, half-finished, may disappear or change in backwards compat at any time, problematic, not recommended for production use, etc.)
|
135
145
|
|
146
|
+
Note also that in Jruby, when using `each_record_xpath` with the NokogiriReader, the extracted individual documents may have xmlns declerations in different places than you may expect, although they will still be semantically equivalent for namespace processing. This is due to Nokogiri JRuby implementation, and we could find no good way to ensure consistent behavior with MRI. See: https://github.com/sparklemotion/nokogiri/issues/1875
|
147
|
+
|
136
148
|
### Jruby
|
137
149
|
|
138
150
|
It may be that nokogiri JRuby is just much slower than nokogiri MRI (at least when namespaces are involved?) It may be that our workaround to a [JRuby bug involving namespaces on moving nodes](https://github.com/sparklemotion/nokogiri/issues/1774) doesn't help.
|
data/lib/traject/indexer.rb
CHANGED
@@ -180,6 +180,7 @@ class Traject::Indexer
|
|
180
180
|
@index_steps = []
|
181
181
|
@after_processing_steps = []
|
182
182
|
|
183
|
+
self.class.apply_class_configure_block(self)
|
183
184
|
instance_eval(&block) if block
|
184
185
|
end
|
185
186
|
|
@@ -189,6 +190,38 @@ class Traject::Indexer
|
|
189
190
|
instance_eval(&block)
|
190
191
|
end
|
191
192
|
|
193
|
+
## Class level configure block(s) accepted too, and applied at instantiation
|
194
|
+
# before instance-level configuration.
|
195
|
+
#
|
196
|
+
# EXPERIMENTAL, implementation may change in ways that effect some uses.
|
197
|
+
# https://github.com/traject/traject/pull/213
|
198
|
+
#
|
199
|
+
# Note that settings set by 'provide' in subclass can not really be overridden
|
200
|
+
# by 'provide' in a next level subclass. Use self.default_settings instead, with
|
201
|
+
# call to super.
|
202
|
+
#
|
203
|
+
# You can call this .configure multiple times, blocks are added to a list, and
|
204
|
+
# will be used to initialize an instance in order.
|
205
|
+
#
|
206
|
+
# The main downside of this workaround implementation is performance, even though
|
207
|
+
# defined at load-time on class level, blocks are all executed on every instantiation.
|
208
|
+
def self.configure(&block)
|
209
|
+
(@class_configure_blocks ||= []) << block
|
210
|
+
end
|
211
|
+
|
212
|
+
def self.apply_class_configure_block(instance)
|
213
|
+
# Make sure we inherit from superclass that has a class-level ivar @class_configure_block
|
214
|
+
if self.superclass.respond_to?(:apply_class_configure_block)
|
215
|
+
self.superclass.apply_class_configure_block(instance)
|
216
|
+
end
|
217
|
+
if @class_configure_blocks && !@class_configure_blocks.empty?
|
218
|
+
@class_configure_blocks.each do |block|
|
219
|
+
instance.configure(&block)
|
220
|
+
end
|
221
|
+
end
|
222
|
+
end
|
223
|
+
|
224
|
+
|
192
225
|
|
193
226
|
# Pass a string file path, a Pathname, or a File object, for
|
194
227
|
# a config file to load into indexer.
|
@@ -258,10 +291,9 @@ class Traject::Indexer
|
|
258
291
|
"log.batch_size.severity" => "info",
|
259
292
|
|
260
293
|
# how to post-process the accumulator
|
261
|
-
|
262
|
-
|
263
|
-
|
264
|
-
"allow_empty_fields" => false
|
294
|
+
Traject::Indexer::ToFieldStep::ALLOW_NIL_VALUES => false,
|
295
|
+
Traject::Indexer::ToFieldStep::ALLOW_DUPLICATE_VALUES => true,
|
296
|
+
Traject::Indexer::ToFieldStep::ALLOW_EMPTY_FIELDS => false
|
265
297
|
}.freeze
|
266
298
|
end
|
267
299
|
|
@@ -349,6 +381,10 @@ class Traject::Indexer
|
|
349
381
|
|
350
382
|
# Create logger according to settings
|
351
383
|
def create_logger
|
384
|
+
if settings["logger"]
|
385
|
+
# none of the other settings matter, we just got a logger
|
386
|
+
return settings["logger"]
|
387
|
+
end
|
352
388
|
|
353
389
|
logger_level = settings["log.level"] || "info"
|
354
390
|
|
@@ -82,6 +82,51 @@ class Traject::Indexer
|
|
82
82
|
str
|
83
83
|
end
|
84
84
|
|
85
|
+
# Add values to an array in context.output_hash with the specified key/field_name(s).
|
86
|
+
# Creates array in output_hash if currently nil.
|
87
|
+
#
|
88
|
+
# Post-processing/filtering:
|
89
|
+
#
|
90
|
+
# * uniqs accumulator, unless settings["allow_dupicate_values"] is set.
|
91
|
+
# * Removes nil values unless settings["allow_nil_values"] is set.
|
92
|
+
# * Will not add an empty array to output_hash (will leave it nil instead)
|
93
|
+
# unless settings["allow_empty_fields"] is set.
|
94
|
+
#
|
95
|
+
# Multiple values can be added with multiple arguments (we avoid an array argument meaning
|
96
|
+
# multiple values to accomodate odd use cases where array itself is desired in output_hash value)
|
97
|
+
#
|
98
|
+
# @param field_name [String,Symbol,Array<String>,Array[<Symbol>]] A key to set in output_hash, or
|
99
|
+
# an array of such keys.
|
100
|
+
#
|
101
|
+
# @example add one value
|
102
|
+
# context.add_output(:additional_title, "a title")
|
103
|
+
#
|
104
|
+
# @example add multiple values as multiple params
|
105
|
+
# context.add_output("additional_title", "a title", "another title")
|
106
|
+
#
|
107
|
+
# @example add multiple values as multiple params from array using ruby spread operator
|
108
|
+
# context.add_output(:some_key, *array_of_values)
|
109
|
+
#
|
110
|
+
# @example add to multiple keys in output hash
|
111
|
+
# context.add_output(["key1", "key2"], "value")
|
112
|
+
#
|
113
|
+
# @return [Traject::Context] self
|
114
|
+
#
|
115
|
+
# Note for historical reasons relevant settings key *names* are in constants in Traject::Indexer::ToFieldStep,
|
116
|
+
# but the settings don't just apply to ToFieldSteps
|
117
|
+
def add_output(field_name, *values)
|
118
|
+
values.compact! unless self.settings && self.settings[Traject::Indexer::ToFieldStep::ALLOW_NIL_VALUES]
|
119
|
+
|
120
|
+
return self if values.empty? and not (self.settings && self.settings[Traject::Indexer::ToFieldStep::ALLOW_EMPTY_FIELDS])
|
121
|
+
|
122
|
+
Array(field_name).each do |key|
|
123
|
+
accumulator = (self.output_hash[key.to_s] ||= [])
|
124
|
+
accumulator.concat values
|
125
|
+
accumulator.uniq! unless self.settings && self.settings[Traject::Indexer::ToFieldStep::ALLOW_DUPLICATE_VALUES]
|
126
|
+
end
|
127
|
+
|
128
|
+
return self
|
129
|
+
end
|
85
130
|
end
|
86
131
|
|
87
132
|
|
data/lib/traject/indexer/step.rb
CHANGED
@@ -145,24 +145,20 @@ class Traject::Indexer
|
|
145
145
|
return accumulator
|
146
146
|
end
|
147
147
|
|
148
|
-
|
149
|
-
#
|
150
|
-
#
|
148
|
+
|
149
|
+
# These constqnts here for historical/legacy reasons, they really oughta
|
150
|
+
# live in Traject::Context, but in case anyone is referring to them
|
151
|
+
# we'll leave them here for now.
|
151
152
|
ALLOW_NIL_VALUES = "allow_nil_values".freeze
|
152
153
|
ALLOW_EMPTY_FIELDS = "allow_empty_fields".freeze
|
153
154
|
ALLOW_DUPLICATE_VALUES = "allow_duplicate_values".freeze
|
154
155
|
|
156
|
+
# Add the accumulator to the context with the correct field name(s).
|
157
|
+
# Do post-processing on the accumulator (remove nil values, allow empty
|
158
|
+
# fields, etc)
|
155
159
|
def add_accumulator_to_context!(accumulator, context)
|
156
|
-
accumulator.compact! unless context.settings[ALLOW_NIL_VALUES]
|
157
|
-
return if accumulator.empty? and not (context.settings[ALLOW_EMPTY_FIELDS])
|
158
|
-
|
159
160
|
# field_name can actually be an array of field names
|
160
|
-
|
161
|
-
context.output_hash[a_field_name] ||= []
|
162
|
-
|
163
|
-
existing_accumulator = context.output_hash[a_field_name].concat(accumulator)
|
164
|
-
existing_accumulator.uniq! unless context.settings[ALLOW_DUPLICATE_VALUES]
|
165
|
-
end
|
161
|
+
context.add_output(field_name, *accumulator)
|
166
162
|
end
|
167
163
|
end
|
168
164
|
|
data/lib/traject/line_writer.rb
CHANGED
@@ -8,12 +8,35 @@ require 'thread'
|
|
8
8
|
# This does not seem to effect performance much, as far as I could tell
|
9
9
|
# benchmarking.
|
10
10
|
#
|
11
|
-
# Output will be sent to `settings["output_file"]` string path, or else
|
12
|
-
# `settings["output_stream"]` (ruby IO object), or else stdout.
|
13
|
-
#
|
14
11
|
# This class can be sub-classed to write out different serialized
|
15
12
|
# reprentations -- subclasses will just override the #serialize
|
16
13
|
# method. For instance, see JsonWriter.
|
14
|
+
#
|
15
|
+
# ## Output
|
16
|
+
#
|
17
|
+
# The main functionality this class provides is logic for choosing based on
|
18
|
+
# settings what file or bytestream to send output to.
|
19
|
+
#
|
20
|
+
# You can supply `settings["output_file"]` with a _file path_. LineWriter
|
21
|
+
# will open up a `File` to write to.
|
22
|
+
#
|
23
|
+
# Or you can supply `settings["output_stream"]` with any ruby IO object, such an
|
24
|
+
# open `File` object or anything else.
|
25
|
+
#
|
26
|
+
# If neither are supplied, will write to `$stdout`.
|
27
|
+
#
|
28
|
+
# ## Closing the output stream
|
29
|
+
#
|
30
|
+
# The LineWriter tries to guess on whether it should call `close` on the output
|
31
|
+
# stream it's writing to, when the LineWriter instance is closed. For instance,
|
32
|
+
# if you passed in a `settings["output_file"]` with a path, and the LineWriter
|
33
|
+
# opened up a `File` object for you, it should close it for you.
|
34
|
+
#
|
35
|
+
# But for historical reasons, LineWriter doesn't just use that signal, but tries
|
36
|
+
# to guess generally on when to call close. If for some reason it gets it wrong,
|
37
|
+
# just use `settings["close_output_on_close"]` set to `true` or `false`.
|
38
|
+
# (String `"true"` or `"false"` are also acceptable, for convenience in setting
|
39
|
+
# options on command line)
|
17
40
|
class Traject::LineWriter
|
18
41
|
attr_reader :settings
|
19
42
|
attr_reader :write_mutex, :output_file
|
@@ -57,7 +80,16 @@ class Traject::LineWriter
|
|
57
80
|
end
|
58
81
|
|
59
82
|
def close
|
60
|
-
@output_file.close
|
83
|
+
@output_file.close if should_close_stream?
|
84
|
+
end
|
85
|
+
|
86
|
+
def should_close_stream?
|
87
|
+
if settings["close_output_on_close"].nil?
|
88
|
+
(@output_file.nil? || @output_file.tty? || @output_file == $stdout || $output_file == $stderr)
|
89
|
+
else
|
90
|
+
settings["close_output_on_close"].to_s == "true"
|
91
|
+
end
|
61
92
|
end
|
62
93
|
|
94
|
+
|
63
95
|
end
|