traject 0.16.0 → 0.17.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (53) hide show
  1. checksums.yaml +7 -0
  2. data/.yardopts +1 -0
  3. data/README.md +183 -191
  4. data/bench/bench.rb +1 -1
  5. data/doc/batch_execution.md +14 -0
  6. data/doc/extending.md +14 -12
  7. data/doc/indexing_rules.md +265 -0
  8. data/lib/traject/command_line.rb +12 -41
  9. data/lib/traject/debug_writer.rb +32 -13
  10. data/lib/traject/indexer.rb +101 -24
  11. data/lib/traject/indexer/settings.rb +18 -17
  12. data/lib/traject/json_writer.rb +32 -11
  13. data/lib/traject/line_writer.rb +6 -6
  14. data/lib/traject/macros/basic.rb +1 -1
  15. data/lib/traject/macros/marc21.rb +17 -13
  16. data/lib/traject/macros/marc21_semantics.rb +27 -25
  17. data/lib/traject/macros/marc_format_classifier.rb +39 -25
  18. data/lib/traject/marc4j_reader.rb +36 -22
  19. data/lib/traject/marc_extractor.rb +79 -75
  20. data/lib/traject/marc_reader.rb +33 -25
  21. data/lib/traject/mock_reader.rb +9 -10
  22. data/lib/traject/ndj_reader.rb +7 -7
  23. data/lib/traject/null_writer.rb +1 -1
  24. data/lib/traject/qualified_const_get.rb +12 -2
  25. data/lib/traject/solrj_writer.rb +61 -52
  26. data/lib/traject/thread_pool.rb +45 -45
  27. data/lib/traject/translation_map.rb +59 -27
  28. data/lib/traject/util.rb +3 -3
  29. data/lib/traject/version.rb +1 -1
  30. data/lib/traject/yaml_writer.rb +1 -1
  31. data/test/debug_writer_test.rb +7 -7
  32. data/test/indexer/each_record_test.rb +4 -4
  33. data/test/indexer/macros_marc21_semantics_test.rb +12 -12
  34. data/test/indexer/macros_marc21_test.rb +10 -10
  35. data/test/indexer/macros_test.rb +1 -1
  36. data/test/indexer/map_record_test.rb +6 -6
  37. data/test/indexer/read_write_test.rb +43 -4
  38. data/test/indexer/settings_test.rb +2 -2
  39. data/test/indexer/to_field_test.rb +8 -8
  40. data/test/marc4j_reader_test.rb +4 -4
  41. data/test/marc_extractor_test.rb +33 -25
  42. data/test/marc_format_classifier_test.rb +3 -3
  43. data/test/marc_reader_test.rb +2 -2
  44. data/test/test_helper.rb +3 -3
  45. data/test/test_support/demo_config.rb +52 -48
  46. data/test/translation_map_test.rb +22 -4
  47. data/test/translation_maps/bad_ruby.rb +2 -2
  48. data/test/translation_maps/both_map.rb +1 -1
  49. data/test/translation_maps/default_literal.rb +1 -1
  50. data/test/translation_maps/default_passthrough.rb +1 -1
  51. data/test/translation_maps/ruby_map.rb +1 -1
  52. metadata +7 -31
  53. data/doc/macros.md +0 -103
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: ab462aadfb1252846b617cf1adb288eeb519b353
4
+ data.tar.gz: 7eac38dd8ac32e1dbfd417686ff04f95c108f011
5
+ SHA512:
6
+ metadata.gz: 331350a2a93083b10710943e71bdf31b30bb3c6aeed9dde97f05fd232eaa34681a7ac0bcdf0d7aae9e37fd6ea7b9d3e4da1c840f036ed9abc6d57a01aea02e12
7
+ data.tar.gz: 381e2c56dc2b92e0b91330bf20275e47462b86c1301ef723f31297cae1702b1f2fb77e6b8a016cf9213879f4c593ce001b68e854649613f12ba9a96238dc9da2
data/.yardopts CHANGED
@@ -1,2 +1,3 @@
1
+ --markup markdown
1
2
  -
2
3
  doc/*.md
data/README.md CHANGED
@@ -1,11 +1,12 @@
1
1
  # Traject
2
2
 
3
- Tools for indexing MARC records to Solr.
3
+ Tools for reading MARC records, transforming them with indexing rules, and indexing to Solr.
4
+ Might be used to index MARC data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
4
5
 
5
- Generalizable to tools for configuring mapping records to associative array data structures, and sending
6
- them somewhere.
6
+ Traject might also be generalized to a set of tools for getting structured data from a source, and sending it to a destination.
7
7
 
8
- **Currently under development, not production ready**
8
+
9
+ **Traject is nearing 1.0, it is robust, feature-rich and ready for trial use**
9
10
 
10
11
  [![Gem Version](https://badge.fury.io/rb/traject.png)](http://badge.fury.io/rb/traject)
11
12
  [![Build Status](https://travis-ci.org/jrochkind/traject.png)](https://travis-ci.org/jrochkind/traject)
@@ -13,23 +14,18 @@ them somewhere.
13
14
 
14
15
  ## Background/Goals
15
16
 
16
- Existing tools for indexing Marc to Solr served us well for many years, and have many features.
17
- But we were having more and more difficulty with them, including in extending/customizing in maintainable ways.
18
- We realized that to create a tool with the API (internal and external) we wanted, we could do a better
19
- job with jruby (ruby on the JVM).
17
+ Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (University of Michigan Libraries).
18
+
19
+ Traject was born out of our experience with similar tools, including the very popular and useful [solrmarc](https://code.google.com/p/solrmarc/) by Bob Haschart; and Bill Dueber's own [marc2solr](http://github.com/billdueber/marc2solr/).
20
20
 
21
- * **Easy to use**, getting started with standard use cases should be easy, even for non-rubyists.
22
- * **Support customization and flexiblity**, common customization use cases, including simple local
23
- logic, should be very easy. More sophisticated and even complex customization use cases should still be possible,
24
- changing just the parts of traject you want to change.
25
- * **Maintainable local logic**, supporting sharing of reusable logic via ruby gems.
26
- * **Comprehensible internal logic**; well-covered by tests, well-factored separation of concerns,
27
- easy for newcomer developers who know ruby to understand the codebase.
28
- * **High performance**, using multi-threaded concurrency where appropriate to maximize throughput.
29
- traject likely will provide higher throughput than other similar solutions.
30
- * **Well-behaved shell script**, for painless integration in batch processes and cronjobs, with
31
- exit codes, sufficiently flexible control of logging, proper use of stderr, etc.
21
+ We're comfortable programming (especially in a dynamic language), and want to be able to experiment with different indexing patterns quickly, easily, and testably; but are admittedly less comfortable in Java. In order to have a tool with the API's and usage patterns convenient for us, we found we could do it better in JRuby -- Ruby on the JVM.
32
22
 
23
+ * Basic configuration files can be easily written even by non-rubyists, with a few simple directives traject provides. But config files are 'ruby all the way down', so we can provide a gradual slope to more complex needs, with the full power of ruby.
24
+ * Easy to program, easy to read, easy to modify.
25
+ * Fast. Traject by default indexes using multiple threads, on multiple cpu cores.
26
+ * Composed of decoupled components, for flexibility and extensibility. The whole code base is only 6400 lines of code, more than a third of which is tests.
27
+ * Designed to support local code and configuration that's maintainable and testable, an can be shared between projects as ruby gems.
28
+ * Designed with batch execution in mind: flexible logging, good exit codes, good use of stdin/stdout/stderr.
33
29
 
34
30
 
35
31
  ## Installation
@@ -41,25 +37,30 @@ Then just `gem install traject`.
41
37
 
42
38
  ( **Note**: We may later provide an all-in-one .jar distribution, which does not require you to install jruby or use on your system. This is hypothetically possible. Is it a good idea?)
43
39
 
44
- # Usage
45
40
 
46
- ## Configuration file format
41
+ ## Configuration files
47
42
 
48
- The traject command-line utility requires you to supply it with a configuration file. So let's start by describing the configuration file.
43
+ traject is configured using configuration files. To get a sense of what they look like, you can
44
+ take a look at our sample non-trivial configuration file,
45
+ [demo_config.rb](./test/test_support/demo_config.rb), which you'd run like
46
+ `traject -c path/to/demo_config.rb marc_file.marc`.
49
47
 
50
48
  Configuration files are actually just ruby -- so by convention they end in `.rb`.
51
49
 
52
50
  We hope you can write basic useful configuration files without being a ruby expert,
53
- they give you a subset of ruby to work with. But the full power
51
+ traject gives you some easy functions to use for common diretives. But the full power
54
52
  of ruby is available to you if needed.
55
53
 
56
54
  **rubyist tip**: Technically, config files are executed with `instance_eval` in a Traject::Indexer instance, so the special commands you see are just methods on Traject::Indexer (or mixed into it). But you can
57
55
  call ordinary ruby `require` in config files, etc., too, to load
58
56
  external functionality. See more at Extending Logic below.
59
57
 
58
+ You can keep your settings and indexing rules in one config file,
59
+ or split them accross multiple config files however you like. (Connection details vs indexing? Common things vs environmental specific things?)
60
+
60
61
  There are two main categories of directives in your configuration files: _Settings_, and _Indexing Rules_.
61
62
 
62
- ### Settings
63
+ ## Settings
63
64
 
64
65
  Settings are a flat list of key/value pairs, where the keys are always strings and the values usually are. They look like this
65
66
  in a config file:
@@ -105,91 +106,58 @@ You can also use `store` if you want to force-set, last set wins.
105
106
  See, docs page on [Settings](./doc/settings.md) for list
106
107
  of all standardized settings.
107
108
 
108
- ### Indexing Rules
109
-
110
- You can keep your settings and indexing rules in one config file,
111
- or split them accross multiple config files however you like. (Connection details vs indexing? Common things vs environmental specific things?)
112
109
 
113
- The main tool for indexing rules is the `to_field` command.
114
- Which can be used with a few standard functions.
110
+ ## Indexing rules: Let's start with `to_field` and `extract_marc`
115
111
 
116
- ~~~ruby
117
- # configuration.rb
118
-
119
- # The first arguent, 'source' in this case, is what Solr field we're
120
- # sending to. And the 'literal' function supplies a hard-coded
121
- # constant string literal.
122
- to_field "source", literal("LIB_CATALOG")
123
-
124
- # you can call 'to_field' multiple times, additional values
125
- # are concatenated
126
- to_field "source", literal("ANOTHER ONE")
127
-
128
- # Serialize the marc record back out and
129
- # put it in a solr field.
130
- to_field "marc_record", serialized_marc(:format => "xml")
131
-
132
- # or :format => "json" for marc-in-json
133
- # or :format => "binary", by default Base64-encoded for Solr
134
- # 'binary' field, or, for more like what SolrMarc did, without
135
- # escaping:
136
- to_field "marc_record_raw", serialized_marc(:format => "binary", :binary_escape => false)
137
-
138
- # Take ALL of the text from the marc record, useful for
139
- # a catch-all field. Actually by default only takes
140
- # from tags 100 to 899.
141
- to_field "text", extract_all_marc_values
142
-
143
- # Now we have a simple example of the general utility function
144
- # `extract_marc`
145
- to_field "id", extract_marc("001", :first => true)
146
- ~~~
112
+ There are a few methods that can be used to create indexing rules, but the
113
+ one you'll most common is called `to_field`, and establishes a rule
114
+ to extract content to a particular named output field.
147
115
 
148
- `extract_marc` takes a marc tag/subfield specification, and optional
149
- arguments. `:first => true` means if the specification returned multiple values, ignore all bet the first. It is wise to use this
150
- *whenever you have a non-multi-valued solr field* even if you think "There should only be one 001 field anyway!", to deal with unexpected
151
- data properly.
116
+ The extraction rule can use built-in 'macros', or, as we'll see later,
117
+ entirely custom logic.
152
118
 
153
- Other examples of the specification string, which can include multiple tag mentions, as well as subfields and indicators:
119
+ The built-in macro you'll use the most is `extract_marc`, to extract
120
+ data out of a MARC record according to a tag/subfield specification.
154
121
 
155
122
  ~~~ruby
156
- # 245 subfields a, p, and s. 130, all subfields.
157
- # built-in punctuation trimming routine.
158
- to_field "title_t", extract_marc("245nps:130", :trim_punctuation => true)
159
-
160
- # Can limit to certain indicators with || chars.
161
- # "*" is a wildcard in indicator spec. So
162
- # 856 with first indicator '0', subfield u.
163
- to_field "email_addresses", extract_marc("856|0*|u")
164
-
165
- # Can list tag twice with different field combinations
166
- # to extract separately
167
- to_field "isbn", extract_marc("245a:245abcde")
123
+ # Take the value of the first 001 field, and put
124
+ # it in output field 'id', to be indexed in Solr
125
+ # field 'id'
126
+ to_field "id", extract_marc("001", :first => true)
127
+
128
+ # 245 subfields a, p, and s. 130, all subfields.
129
+ # built-in punctuation trimming routine.
130
+ to_field "title_t", extract_marc("245nps:130", :trim_punctuation => true)
131
+
132
+ # Can limit to certain indicators with || chars.
133
+ # "*" is a wildcard in indicator spec. So
134
+ # 856 with first indicator '0', subfield u.
135
+ to_field "email_addresses", extract_marc("856|0*|u")
136
+
137
+ # Can list tag twice with different field combinations
138
+ # to extract separately
139
+ to_field "isbn", extract_marc("245a:245abcde")
140
+
141
+ # For MARC Control ('fixed') fields, you can optionally
142
+ # use square brackets to take a byte offset.
143
+ to_field "langauge_code", extract_marc("008[35-37]")
168
144
  ~~~
169
145
 
170
- The `extract_marc` function *by default* includes any linked
171
- MARC `880` fields with alternate-script versions. Another reason
172
- to use the `:first` option if you really only want one.
146
+ `extract_marc` by default includes all 'alternate script' linked fields correspoinding
147
+ to matched specifications, but you can turn that off, or extract *only* corresponding
148
+ 880s.
173
149
 
174
- By default, specifications with multiple subfields (like "240abc") will produce
175
- one single string of output for each matching field. Specifications
176
- with single subfields (like "020a") will split subfields and produce
177
- an output string for each matching subfield.
150
+ to_field "title", extract_marc("245abc", :alternate_script => false)
151
+ to_field "title_vernacular", extract_marc("245abc", :alternate_script => :only)
178
152
 
179
- For MARC control (aka 'fixed') fields, you can use square
180
- brackets to take a slice by byte offset.
153
+ By default, specifications with multiple subfields (like "240abc") will produce one single string of output for each matching field. Specifications with single subfields (like "020a") will split subfields and produce an output string for each matching subfield.
181
154
 
182
- ~~~ruby
183
- to_field "langauge_code", extract_marc("008[35-37]")
184
- ~~~
185
-
186
- For more information on extraction specifications, see
187
- the [MarcExtractor class](./lib/traject/marc_extractor.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/MarcExtractor)).
155
+ For the syntax and complete possibilities of the specification
156
+ string argument to extract_marc, see docs at the [MarcExtractor class](./lib/traject/marc_extractor.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/MarcExtractor)).
188
157
 
189
158
  `extract_marc` also supports `translation maps` similar
190
159
  to SolrMarc's. There are some translation maps provided by traject,
191
- and you can also define your own. translation maps can be supplied
192
- in yaml or ruby. Translation maps are especially useful
160
+ and you can also define your own, in yaml or ruby. Translation maps are especially useful
193
161
  for mapping form MARC codes to user-displayable strings:
194
162
 
195
163
  ~~~ruby
@@ -198,131 +166,152 @@ for mapping form MARC codes to user-displayable strings:
198
166
  to_field "language", extract_marc("008[35-37]:041a:041d", :translation_map => "marc_language_code")
199
167
  ~~~
200
168
 
201
- See [Traject::TranslationMap](./lib/traject/translation_map.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/TranslationMap)) for more info on translation mapping.
169
+ To see all options for `extract_marc`, see the [method documentation](http://rdoc.info/gems/traject/Traject/Macros/Marc21:extract_marc)
202
170
 
203
- #### Direct indexing logic vs. Macros
171
+ ## other built-in utility macros
204
172
 
205
- It turns out all those functions we saw above used with `to_field` -- `literal`, `serialized_marc`, `extract_all_marc_values`, and `extract_marc` -- are what Traject calls 'macros'.
173
+ Other built-in methods that can be used with `to_field` include a hard-coded
174
+ literal string:
206
175
 
207
- They are all actually built based upon a more basic element of
208
- indexing functionality, which you can always drop down to, and
209
- which is used to build the macros. The basic use of `to_field`,
210
- with directly specified logic instead of using a macro, looks like this:
176
+ to_field "source", literal("LIB_CATALOG")
211
177
 
212
- ~~~ruby
213
- to_field "source" do |record, accumulator, context|
214
- accumulator << "LIB CATALOG"
215
- end
216
- ~~~~
178
+ The current record serialized back out as MARC, in binary, XML, or json:
217
179
 
218
- That's actually equivalent to the macro we used earlier: `to_field("source"), literal("LIB_CATALOG")`.
180
+ # or :format => "json" for marc-in-json
181
+ # or :format => "binary", by default Base64-encoded for Solr
182
+ # 'binary' field, or, for more like what SolrMarc did, without
183
+ # escaping:
184
+ to_field "marc_record_raw", serialized_marc(:format => "binary", :binary_escape => false, :allow_oversized => true)
219
185
 
220
- This direct use of to_field happens to be a ruby "block", which is
221
- used to define a block of logic that can be stored and executed later. When the block is called, first argument (`record` above) is the marc_record being indexed (a ruby-marc MARC::Record object), and the second argument (`accumulator`) is a ruby array used to accumulate output values.
186
+ Text of all fields in a range:
222
187
 
223
- The third argument is a `Traject::Indexer::Context` object that can
224
- be used for more advanced functionality, including caching expensive
225
- per-record calculations, writing out to more than one output field at a time, or taking account of current Traject Settings in your logic. The third argument is optional, you can supply
226
- a two-argument block too.
188
+ to_field "text", extract_all_marc_values(:from => 100, :to => 899)
227
189
 
228
- You can always drop out to this basic direct use whenever you need
229
- special purpose logic, directly in the config file, writing in
230
- ruby:
231
190
 
232
- ~~~ruby
233
- # this is more or less nonsense, just an example
234
- to_field "weird_title" do |record, accumlator, context|
235
- field = record['245']
236
- title = field['a']
237
- title.upcase! if field.indicator1 = '1'
238
- accumulator << title
239
- end
191
+ All of these methods are defined at [Traject::Macros::Marc21](./lib/traject/macros/marc21.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21))
240
192
 
241
- # To make use of marc extraction by specification, just like
242
- # marc_extract does, you may want to use the Traject::MarcExtractor
243
- # class
244
- to_field "weirdo" do |record, accumulator, context|
245
- # use MarcExtractor.cached for performance, globally
246
- # caching the MarcExtractor we create. See docs
247
- # at MarcExtractor.
248
- list = MarcExtractor.cached("700a").extract(record)
193
+ ## more complex canned MARC semantic logic
249
194
 
250
- # combine all the 700a's in ONE string, cause we're weird
251
- list = list.join(" ")
195
+ Some more complex (and opinionated/subjective) algorithms for deriving semantics
196
+ from Marc are also packaged with Traject, but not available by default. To make
197
+ them available to your indexing, you just need to use ruby `require` and `extend`.
252
198
 
253
- accumulator << list
254
- end
255
- ~~~
199
+ A number of methods are in [Traject::Macros::Marc21Semantics](./lib/traject/macros/marc21_semantics.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21Semantics))
256
200
 
257
- You can also *combine* a macro and a direct block for some
258
- post-processing. In this case, the `accumulator` parameter
259
- in our block will start out with the values left by
260
- the `extract_marc`:
201
+ require 'traject/macros/marc21_semantics'
202
+ extend Traject::Macros::Marc21Semantics
261
203
 
262
- ~~~ruby
263
- to_field "subjects", extract_marc("600:650:610") do |record, accumulator, context|
264
- # for some reason we want to uppercase all our subjects
265
- accumulator.collect! {|s| s.upcase }
266
- end
267
- ~~~
204
+ to_field 'title_sort', marc_sortable_title
205
+ to_field 'broad_subject', marc_lcc_to_broad_category
206
+ to_field "geographic_facet", marc_geo_facet
207
+ # And several more
208
+
209
+ And, there's a routine for classifying MARC to an internal
210
+ format/genre/type vocabulary:
211
+
212
+ require 'traject/macros/marc_format_classifier'
213
+ extend Traject::Macros::MarcFormats
214
+
215
+ to_field 'format_facet', marc_formats
216
+
217
+
218
+ ## Custom logic
219
+
220
+ The built-in routines are there for your convenience, but if you need
221
+ something local or custom, you can write ruby logic directly
222
+ in a configuration file, using a ruby block, which looks like this:
223
+
224
+ to_field "id" do |record, accumulator|
225
+ # take the record's 001, prefix it with "bib_",
226
+ # and then add it to the 'accumulator' argument,
227
+ # to send it to the specified output field
228
+ value = record['001']
229
+ value = "bib_#{value}"
230
+ accumulator << value
231
+ end
232
+
233
+ `do |record, accumulator|` is the definition of a ruby block taking
234
+ two arguments. The first one passed in will be a MARC record. The
235
+ second is an array, you add values to the array to send them to
236
+ output.
237
+
238
+ Here's a more realistic example that shows how you'd get the
239
+ record type byte 06 out of a MARC leader, then translate it
240
+ to a human-readable string with a TranslationMap
241
+
242
+ to_field "marc_type" do |record, accumulator|
243
+ leader06 = record.leader.byteslice(6)
244
+ # this translation map doesn't actually exist, but could
245
+ accumulator << TranslationMap.new("marc_leader")[ leader06 ]
246
+ end
268
247
 
269
- If you find yourself repeating code a lot in direct blocks, you
270
- can supply your _own_ macros, for local use, or even to share
271
- with others in a ruby gem. See docs [Macros](./doc/macros.md)
248
+ You can also add a block onto the end of a built-in 'macro', to
249
+ further customize the output. The `accumulator` passed to your block
250
+ will already have values in it from the first step, and you can
251
+ use ruby methods like `map!` to modify it:
272
252
 
273
- #### each_record
253
+ to_field "big_title", extract_marc("245abcdefg") do |record, accumulator|
254
+ # put it all in all uppercase, I don't know why.
255
+ accumulator.map! {|v| v.upcase}
256
+ end
274
257
 
275
- There is also a method `each_record`, which is like `to_field`, but without
276
- a specific field. It can be used for other side-effects of your choice, or
277
- even for writing to multiple fields.
258
+ There are many more things you can do with custom logic blocks like this too,
259
+ including additional features we haven't discussed yet.
260
+
261
+ If you find yourself repeating boilerplate code in your custom logic, you can
262
+ even create your own 'macros' (like `extract_marc`). `extract_marc` and other
263
+ macros are nothing more than methods that return ruby lambda objects of
264
+ the same format as the blocks you write for custom logic.
265
+
266
+ For tips, gotchas, and a more complete explanation of how this works, see
267
+ additional documentation page on [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md)
268
+
269
+ ## each_record and after_processing
270
+
271
+ In addition to `to_field`, an `each_record` method is available, which,
272
+ like `to_field`, is executed for every record, but without being tied
273
+ to a specific field.
274
+
275
+ `each_record` can be used for logging or notifiying; computing intermediate
276
+ results; or writing to more than one field at once.
278
277
 
279
278
  ~~~ruby
280
- each_record do |record, context|
281
- # example of writing to two fields at once.
282
- (x, y) = Something.do_stuff
283
- (context["one_field"] ||= []) << x
284
- (context["another_field"] ||= []) << y
279
+ each_record do |record|
280
+ some_custom_logging(record)
285
281
  end
286
282
  ~~~
287
283
 
288
- You could write or use macros for `each_record` too. It's suggested that
289
- such a macro take the field names it will effect as arguments (example?)
284
+ For more on `each_record`, see documentation page on [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md).
290
285
 
291
- `each_record` and `to_field` calls will be processed in one big order, guaranteed
292
- in order.
286
+ There is also an `after_processing` method that can be used to register
287
+ logic that will be called after the entire has been processed. You can use it for whatever custom
288
+ ruby code you might want for your app (send an email? Clean up a log file? Trigger
289
+ a Solr replication?)
293
290
 
294
291
  ~~~ruby
295
- to_field("foo") {...} # will be called first on each record
296
- each_record {...} # will always be called AFTER above has potentially added values
297
- to_field("foo") {...} # and will be called after each of the preceding for each record
292
+ after_processing do
293
+ whatever_ruby_code
294
+ end
298
295
  ~~~
299
296
 
300
- #### Sample config
301
297
 
302
- A fairly complex sample config file can be found at [./test/test_support/demo_config.rb](./test/test_support/demo_config.rb)
298
+ ## Writers
303
299
 
304
- #### Built-in MARC21 Semantics
300
+ Traject uses modular 'Writer' classes to take the output hashes from transformation, and
301
+ send them somewhere or do something useful with them.
305
302
 
306
- There is another package of 'macros' that comes with Traject for extracting semantics
307
- from Marc21. These are sometimes 'opinionated', using heuristics or algorithms
308
- that are not inherently part of Marc21, but have proven useful in actual practice.
303
+ By default traject uses the [Traject::SolrJWriter](lib/traject/solrj_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/SolrJWriter)) to send to Solr for indexing.
304
+ A couple other writers are available too, mostly for debugging purposes:
305
+ [Traject::DebugWriter](lib/traject/debug_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/DebugWriter))
306
+ and [Traject::JsonWriter](lib/traject/json_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/JsonWriter))
309
307
 
310
- It's not loaded by default, you can use straight ruby `require` and `extend`
311
- to load the macros into the indexer.
308
+ You set which writer is being used in settings (`provide "writer_class_name", "Traject::DebugWriter"`),
309
+ or on the command-line as a shortcut with `-w Traject::DebugWriter`.
312
310
 
313
- ~~~ruby
314
- # in a traject config file, extend so we can use methods from...
315
- require 'traject/macros/marc21_semantics'
316
- extend Traject::Macros::Marc21Semantics
317
-
318
- to_field "date", marc_publication_date
319
- to_field "author_sort", marc_sortable_author
320
- to_field "inst_facet", marc_instrumentation_humanized
321
- ~~~
311
+ You can write your own Readers and Writers if you'd like, see comments at top
312
+ of [Traject::Indexer](lib/traject/indexer.rb).
322
313
 
323
- See documented list of macros available in [Marc21Semantics](./lib/traject/macros/marc21_semantics.rb)
324
-
325
- ## Command Line
314
+ ## The traject command Line
326
315
 
327
316
  The simplest invocation is:
328
317
 
@@ -363,7 +352,7 @@ Use `-u` as a shortcut for `s solr.url=X`
363
352
 
364
353
  Run `traject -h` to see the command line help screen listing all available options.
365
354
 
366
- Also see `-I load_path` and `-G Gemfile` options under Extending With Your Own Code.
355
+ Also see `-I load_path` option and suggestions for Bundler use under Extending With Your Own Code.
367
356
 
368
357
  See also [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
369
358
 
@@ -396,9 +385,9 @@ Own Code](./doc/extending.md)
396
385
  * translation map files found on the load path or in a
397
386
  "./translation_maps" subdir on the load path will be found
398
387
  for Traject translation maps.
399
- * Traject `-G` command line can be used to tell traject to use
400
- bundler with a `Gemfile` located at current working dirctory
401
- (or give an argument to `-G ./some/myGemfile`)
388
+ * Use [Bundler](http://bundler.io/) with traject simply by creating a Gemfile with `bundler init`,
389
+ and then running command line with `bundle exec traject` or
390
+ even `BUNDLE_GEMFILE=path/to/Gemfile bundle exec traject`
402
391
 
403
392
  ## More
404
393
 
@@ -423,6 +412,9 @@ this with other developers first!)
423
412
  Pull requests should come with tests, as well as docs where applicable. Docs can be inline rdoc-style, edits to this README,
424
413
  and/or extra files in ./docs -- as appropriate for what needs to be docs.
425
414
 
415
+ **Inline api docs** Note that our [`.yardopts` file](./.yardopts) used by rdoc.info to generate
416
+ online api docs has a `--markup markdown` specified -- inline class/method docs are in markdown, not rdoc.
417
+
426
418
  ## TODO
427
419
 
428
420