traject 2.0.0-java

Sign up to get free protection for your applications and to get access to all the features.
Files changed (104) hide show
  1. checksums.yaml +7 -0
  2. data/.gitignore +18 -0
  3. data/.travis.yml +27 -0
  4. data/.yardopts +3 -0
  5. data/Gemfile +12 -0
  6. data/LICENSE.txt +20 -0
  7. data/README.md +461 -0
  8. data/Rakefile +21 -0
  9. data/bench/bench.rb +30 -0
  10. data/bin/traject +16 -0
  11. data/doc/batch_execution.md +243 -0
  12. data/doc/extending.md +190 -0
  13. data/doc/indexing_rules.md +265 -0
  14. data/doc/other_commands.md +47 -0
  15. data/doc/settings.md +101 -0
  16. data/lib/tasks/load_maps.rake +48 -0
  17. data/lib/traject.rb +11 -0
  18. data/lib/traject/command_line.rb +301 -0
  19. data/lib/traject/csv_writer.rb +34 -0
  20. data/lib/traject/debug_writer.rb +47 -0
  21. data/lib/traject/delimited_writer.rb +110 -0
  22. data/lib/traject/indexer.rb +613 -0
  23. data/lib/traject/indexer/settings.rb +110 -0
  24. data/lib/traject/json_writer.rb +51 -0
  25. data/lib/traject/line_writer.rb +63 -0
  26. data/lib/traject/macros/basic.rb +9 -0
  27. data/lib/traject/macros/marc21.rb +223 -0
  28. data/lib/traject/macros/marc21_semantics.rb +584 -0
  29. data/lib/traject/macros/marc_format_classifier.rb +197 -0
  30. data/lib/traject/marc_extractor.rb +410 -0
  31. data/lib/traject/marc_reader.rb +89 -0
  32. data/lib/traject/mock_reader.rb +97 -0
  33. data/lib/traject/ndj_reader.rb +40 -0
  34. data/lib/traject/null_writer.rb +22 -0
  35. data/lib/traject/qualified_const_get.rb +40 -0
  36. data/lib/traject/solr_json_writer.rb +277 -0
  37. data/lib/traject/thread_pool.rb +161 -0
  38. data/lib/traject/translation_map.rb +267 -0
  39. data/lib/traject/util.rb +52 -0
  40. data/lib/traject/version.rb +3 -0
  41. data/lib/traject/yaml_writer.rb +9 -0
  42. data/lib/translation_maps/lcc_top_level.yaml +26 -0
  43. data/lib/translation_maps/marc_genre_007.yaml +9 -0
  44. data/lib/translation_maps/marc_genre_leader.yaml +22 -0
  45. data/lib/translation_maps/marc_geographic.yaml +589 -0
  46. data/lib/translation_maps/marc_instruments.yaml +102 -0
  47. data/lib/translation_maps/marc_languages.yaml +490 -0
  48. data/test/debug_writer_test.rb +38 -0
  49. data/test/delimited_writer_test.rb +104 -0
  50. data/test/indexer/each_record_test.rb +59 -0
  51. data/test/indexer/macros_marc21_semantics_test.rb +391 -0
  52. data/test/indexer/macros_marc21_test.rb +190 -0
  53. data/test/indexer/macros_test.rb +40 -0
  54. data/test/indexer/map_record_test.rb +209 -0
  55. data/test/indexer/read_write_test.rb +101 -0
  56. data/test/indexer/settings_test.rb +152 -0
  57. data/test/indexer/to_field_test.rb +77 -0
  58. data/test/marc_extractor_test.rb +412 -0
  59. data/test/marc_format_classifier_test.rb +98 -0
  60. data/test/marc_reader_test.rb +110 -0
  61. data/test/solr_json_writer_test.rb +248 -0
  62. data/test/test_helper.rb +90 -0
  63. data/test/test_support/245_no_ab.marc +1 -0
  64. data/test/test_support/880_with_no_6.utf8.marc +1 -0
  65. data/test/test_support/bad_subfield_code.marc +1 -0
  66. data/test/test_support/bad_utf_byte.utf8.marc +1 -0
  67. data/test/test_support/date_resort_to_260.marc +1 -0
  68. data/test/test_support/date_type_r_missing_date2.marc +1 -0
  69. data/test/test_support/date_with_u.marc +1 -0
  70. data/test/test_support/demo_config.rb +155 -0
  71. data/test/test_support/emptyish_record.marc +1 -0
  72. data/test/test_support/escaped_character_reference.marc8.marc +1 -0
  73. data/test/test_support/george_eliot.marc +1 -0
  74. data/test/test_support/hebrew880s.marc +1 -0
  75. data/test/test_support/louis_armstrong.marc +1 -0
  76. data/test/test_support/manufacturing_consent.marc +1 -0
  77. data/test/test_support/manuscript_online_thesis.marc +1 -0
  78. data/test/test_support/microform_online_conference.marc +1 -0
  79. data/test/test_support/multi_era.marc +1 -0
  80. data/test/test_support/multi_geo.marc +1 -0
  81. data/test/test_support/musical_cage.marc +1 -0
  82. data/test/test_support/nature.marc +1 -0
  83. data/test/test_support/one-marc8.mrc +1 -0
  84. data/test/test_support/online_only.marc +1 -0
  85. data/test/test_support/packed_041a_lang.marc +1 -0
  86. data/test/test_support/test_data.utf8.json +30 -0
  87. data/test/test_support/test_data.utf8.marc.xml +2609 -0
  88. data/test/test_support/test_data.utf8.mrc +1 -0
  89. data/test/test_support/test_data.utf8.mrc.gz +0 -0
  90. data/test/test_support/the_business_ren.marc +1 -0
  91. data/test/translation_map_test.rb +225 -0
  92. data/test/translation_maps/bad_ruby.rb +8 -0
  93. data/test/translation_maps/bad_yaml.yaml +1 -0
  94. data/test/translation_maps/both_map.rb +1 -0
  95. data/test/translation_maps/both_map.yaml +1 -0
  96. data/test/translation_maps/default_literal.rb +10 -0
  97. data/test/translation_maps/default_passthrough.rb +10 -0
  98. data/test/translation_maps/marc_040a_translate_test.yaml +1 -0
  99. data/test/translation_maps/properties_map.properties +5 -0
  100. data/test/translation_maps/ruby_map.rb +10 -0
  101. data/test/translation_maps/translate_array_test.yaml +8 -0
  102. data/test/translation_maps/yaml_map.yaml +7 -0
  103. data/traject.gemspec +47 -0
  104. metadata +382 -0
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 7ace7c33d88c92e40e7600984d0e037435c1ecf7
4
+ data.tar.gz: 9b052363bbabaed5a498e1e803fa9ec6d6c4cfa6
5
+ SHA512:
6
+ metadata.gz: a8f3159d608efed85d37e703e5ea2965e1cd9e67c481e181790114dc4bf33e5003116f982800b36e53ad2c4ffb1350a16eb989778decb63c5e36a56e7f150622
7
+ data.tar.gz: ed74374f0593b3c1b17e8808a7b030d3d5d8fe0661a7cba93a16db04c28494d0d4e5917a08fb529e51e68871c3d046af02a16e519ac571573485505d13ae5138
data/.gitignore ADDED
@@ -0,0 +1,18 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ .DS_Store
7
+ Gemfile.lock
8
+ InstalledFiles
9
+ _yardoc
10
+ coverage
11
+ lib/bundler/man
12
+ pkg
13
+ rdoc
14
+ spec/reports
15
+ test/tmp
16
+ test/version_tmp
17
+ tmp
18
+ vendor/solrj/ivy
data/.travis.yml ADDED
@@ -0,0 +1,27 @@
1
+ language: ruby
2
+ rvm:
3
+ - jruby-19mode
4
+ - jruby-head
5
+ - 1.9
6
+ - 2.1
7
+ - 2.2
8
+ - rbx-2
9
+ jdk:
10
+ - openjdk7
11
+ - openjdk6
12
+ matrix:
13
+ exclude:
14
+ - rvm: 1.9
15
+ jdk: openjdk7
16
+ - rvm: 2.1
17
+ jdk: openjdk7
18
+ - rvm: rbx-2
19
+ jdk: openjdk7
20
+ - rvm: jruby-head
21
+ jdk: openjdk6
22
+ - rvm: 2.2
23
+ jdk: openjdk6
24
+ allow_failures:
25
+ - rvm: jruby-head
26
+
27
+ bundler_args: --without debug
data/.yardopts ADDED
@@ -0,0 +1,3 @@
1
+ --markup markdown
2
+ -
3
+ doc/*.md
data/Gemfile ADDED
@@ -0,0 +1,12 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in traject.gemspec
4
+ gemspec
5
+
6
+ group :development do
7
+ gem "nokogiri" # used only for rake tasks load_maps:
8
+ end
9
+
10
+ group :debug do
11
+ gem "ruby-debug", :platform => "jruby"
12
+ end
data/LICENSE.txt ADDED
@@ -0,0 +1,20 @@
1
+ MIT License
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,461 @@
1
+ # Traject
2
+
3
+ An easy to use, high-performance, flexible and extensible MARC to Solr indexer.
4
+
5
+ You might use traject to index MARC data for a Solr-based discovery product like [Blacklight](https://github.com/projectblacklight/blacklight) or [VUFind](http://vufind.org/).
6
+
7
+ Traject can also be generalized to a set of tools for getting structured data from a source, and transforming it to a hash-like object to send to a destination. In addition to sending data
8
+ to solr, Traject can produce json or yaml files, tab-delimited files, CSV files, and output suitable
9
+ for debugging by a human.
10
+
11
+ **Traject is stable, mature software, that is already being used in production by its authors.**
12
+
13
+ [![Gem Version](https://badge.fury.io/rb/traject.png)](http://badge.fury.io/rb/traject)
14
+ [![Build Status](https://travis-ci.org/traject-project/traject.png)](https://travis-ci.org/traject-project/traject)
15
+
16
+
17
+ ## Background/Goals
18
+
19
+ Initially by Jonathan Rochkind (Johns Hopkins Libraries) and Bill Dueber (University of Michigan Libraries).
20
+
21
+ * Basic configuration files can be easily written even by non-rubyists, with a few simple directives traject provides. But config files are 'ruby all the way down', so we can provide a gradual slope to more complex needs, with the full power of ruby.
22
+ * Easy to program, easy to read, easy to modify.
23
+ * Fast. Traject by default indexes using multiple threads, on multiple cpu cores, when the underlying
24
+ ruby implementation (i.e., JRuby) allows it, and can use a separate thread for communication with
25
+ solr even under MRI.
26
+ * Composed of decoupled components, for flexibility and extensibility.
27
+ * Designed to support local code and configuration that's maintainable and testable, and can be shared between projects as ruby gems.
28
+ * Easy to split configuration between multiple files, for simple "pick-and-choose" command line options
29
+ that can combine to deal with any of your local needs.
30
+
31
+
32
+ ## Installation
33
+
34
+ Traject runs under jruby (1.7.x or higher), MRI ruby (1.9.3 or higher), or probably any other ruby platform.
35
+
36
+ **Traject runs much faster on JRuby** where it can use multi-core parallelism, and the Java
37
+ Marc4J marc reader. If performance is a concern, you should run traject on JRuby.
38
+
39
+ Some options for installing a ruby other than your system-provided one are [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme).
40
+
41
+ Once you have ruby, just `$ gem install traject`.
42
+
43
+ ( **Note**: We might in the future provide an all-in-one .jar distribution, which does not require you to install jruby on your system, for those who want the multi-threading of jruby without having to actually install it. Let us know if interested.).
44
+
45
+
46
+ ## Configuration files
47
+
48
+ traject is configured using configuration files. To get a sense of what they look like, you can
49
+ take a look at our sample basic configuration file,
50
+ [demo_config.rb](./test/test_support/demo_config.rb). You could run traject with that configuration file
51
+ as: `traject -c path/to/demo_config.rb marc_file.marc`.
52
+
53
+ Configuration files are actually just ruby -- so by convention they end in `.rb`.
54
+
55
+ We hope you can write basic useful configuration files without much ruby experience, since
56
+ traject gives you some easy functions to use for common directives. But the full power
57
+ of ruby is available to you if needed.
58
+
59
+ **rubyist tip**: Technically, config files are executed with `instance_eval` in a Traject::Indexer instance, so the special commands you see are just methods on Traject::Indexer (or mixed into it). But you can
60
+ call ordinary ruby `require` in config files, etc., too, to load
61
+ external functionality. See more at Extending Logic below.
62
+
63
+ You can keep your settings and indexing rules in one config file,
64
+ or split them accross multiple config files however you like. (Connection details vs indexing? Common things vs environmental specific things?)
65
+
66
+ There are two main categories of directives in your configuration files: _Settings_, and _Indexing Rules_.
67
+
68
+ ## Settings
69
+
70
+ Settings are a flat list of key/value pairs, where the keys are always strings and the values usually are. They look like this
71
+ in a config file:
72
+
73
+ ~~~ruby
74
+ # configuration_file.rb
75
+ # Note that "#" is a comment, cause it's just ruby
76
+
77
+ settings do
78
+ # Where to find solr server to write to
79
+ provide "solr.url", "http://example.org/solr"
80
+
81
+ # solr.version doesn't currently do anything, but set it
82
+ # anyway, in the future it will warn you if you have settings
83
+ # that may not work with your version.
84
+ provide "solr.version", "4.3.0"
85
+
86
+ # default source type is binary, traject can't guess
87
+ # you have to tell it.
88
+ provide "marc_source.type", "xml"
89
+
90
+ # various others...
91
+ provide "solr_writer.commit_on_close", "true"
92
+
93
+ # The default writer is the Traject::SolrJsonWriter. The default
94
+ # reader is Marc4JReader (using Java Marc4J library) on Jruby,
95
+ # MarcReader (using ruby-marc) otherwise.
96
+ end
97
+ ~~~
98
+
99
+ `provide` will only set the key if it was previously unset, so first
100
+ setting wins, and command-line comes first of all and overrides everything.
101
+ You can also use `store` if you want to force-set, last set wins.
102
+
103
+ See, docs page on [Settings](./doc/settings.md) for list
104
+ of all standardized settings.
105
+
106
+
107
+ ## Indexing rules: Let's start with 'to_field' and 'extract_marc'
108
+
109
+ There are a few methods that can be used to create indexing rules, but the
110
+ one you'll most common is called `to_field`, and establishes a rule
111
+ to extract content to a particular named output field.
112
+
113
+ A `to_field` extraction rule can use built-in 'macros', or, as we'll see later,
114
+ entirely custom logic.
115
+
116
+ The built-in macro you'll use the most is `extract_marc`, to extract
117
+ data out of a MARC record according to a tag/subfield specification.
118
+
119
+ ~~~ruby
120
+ # Take the value of the first 001 field, and put
121
+ # it in output field 'id', to be indexed in Solr
122
+ # field 'id'
123
+ to_field "id", extract_marc("001", :first => true)
124
+
125
+ # 245 subfields a, p, and s. 130, all subfields.
126
+ # built-in punctuation trimming routine.
127
+ to_field "title_t", extract_marc("245nps:130", :trim_punctuation => true)
128
+
129
+ # Can limit to certain indicators with || chars.
130
+ # "*" is a wildcard in indicator spec. So this is
131
+ # 856 with first indicator '0', subfield u.
132
+ to_field "email_addresses", extract_marc("856|0*|u")
133
+
134
+ # Can list tag twice with different field combinations
135
+ # to extract separately
136
+ to_field "isbn", extract_marc("245a:245abcde")
137
+
138
+ # For MARC Control ('fixed') fields, you can optionally
139
+ # use square brackets to take a byte offset.
140
+ to_field "langauge_code", extract_marc("008[35-37]")
141
+ ~~~
142
+
143
+ `extract_marc` by default includes all 'alternate script' linked fields correspoinding
144
+ to matched specifications, but you can turn that off, or extract *only* corresponding
145
+ 880s.
146
+
147
+ ~~~ruby
148
+ to_field "title", extract_marc("245abc", :alternate_script => false)
149
+ to_field "title_vernacular", extract_marc("245abc", :alternate_script => :only)
150
+ ~~~
151
+
152
+ By default, specifications with multiple subfields (like "240abc") will produce one single string of output per field (for each '240'), with the concatenation of each matched subfield. Specifications with single subfields (like "020a") will split subfields and produce an output string for each matching subfield.
153
+
154
+ For the syntax and complete possibilities of the specification
155
+ string argument to extract_marc, see docs at the [MarcExtractor class](./lib/traject/marc_extractor.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/MarcExtractor)).
156
+
157
+ `extract_marc` also supports `translation maps` similar
158
+ to SolrMarc's. There are some translation maps provided by traject,
159
+ and you can also define your own, in yaml or ruby. Translation maps are especially useful
160
+ for mapping form MARC codes to user-displayable strings:
161
+
162
+ ~~~ruby
163
+ # "translation_map" will be passed to Traject::TranslationMap.new
164
+ # and the created map used to translate all values
165
+ to_field "language", extract_marc("008[35-37]:041a:041d", :translation_map => "marc_language_code")
166
+ ~~~
167
+
168
+ To see all options for `extract_marc`, see the [method documentation](http://rdoc.info/gems/traject/Traject/Macros/Marc21:extract_marc)
169
+
170
+ ## other built-in utility macros
171
+
172
+ Other built-in methods that can be used with `to_field` include a hard-coded
173
+ literal string:
174
+
175
+ ~~~ruby
176
+ to_field "source", literal("LIB_CATALOG")
177
+ ~~~
178
+
179
+ The current record serialized back out as MARC, in binary, XML, or json:
180
+
181
+ ~~~ruby
182
+ # or :format => "json" for marc-in-json
183
+ # or :format => "binary", by default Base64-encoded for Solr
184
+ # 'binary' field, or, for more like what SolrMarc did, without
185
+ # escaping:
186
+ to_field "marc_record_raw", serialized_marc(:format => "binary", :binary_escape => false, :allow_oversized => true)
187
+ ~~~
188
+
189
+ Text of all fields in a range:
190
+
191
+ ~~~ruby
192
+ to_field "text", extract_all_marc_values(:from => 100, :to => 899)
193
+ ~~~
194
+
195
+ All of these methods are defined at [Traject::Macros::Marc21](./lib/traject/macros/marc21.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21))
196
+
197
+ ## more complex canned MARC semantic logic
198
+
199
+ Some more complex (and opinionated/subjective) algorithms for deriving semantics
200
+ from Marc are also packaged with Traject, but not available by default. To make
201
+ them available to your indexing, you just need to use ruby `require` and `extend`.
202
+
203
+ A number of methods are in [Traject::Macros::Marc21Semantics](./lib/traject/macros/marc21_semantics.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/Macros/Marc21Semantics))
204
+
205
+ ~~~ruby
206
+ require 'traject/macros/marc21_semantics'
207
+ extend Traject::Macros::Marc21Semantics
208
+
209
+ to_field 'title_sort', marc_sortable_title
210
+ to_field 'broad_subject', marc_lcc_to_broad_category
211
+ to_field "geographic_facet", marc_geo_facet
212
+ # And several more
213
+ ~~~
214
+
215
+ And, there's a routine for classifying MARC to an internal
216
+ format/genre/type vocabulary:
217
+
218
+ ~~~ruby
219
+ require 'traject/macros/marc_format_classifier'
220
+ extend Traject::Macros::MarcFormats
221
+
222
+ to_field 'format_facet', marc_formats
223
+ ~~~
224
+
225
+ (Alternately, see the [traject_umich_format](https://github.com/billdueber/traject_umich_format) gem for the often-ridiculously-complex
226
+ logic used at the University of Michigan.)
227
+
228
+ ## Custom logic
229
+
230
+ The built-in routines are there for your convenience, but if you need
231
+ something local or custom, you can write ruby logic directly
232
+ in a configuration file, using a ruby block, which looks like this:
233
+
234
+ ~~~ruby
235
+ to_field "id" do |record, accumulator|
236
+ # take the record's 001, prefix it with "bib_",
237
+ # and then add it to the 'accumulator' argument,
238
+ # to send it to the specified output field
239
+ value = record['001']
240
+ value = "bib_#{value}"
241
+ accumulator << value
242
+ end
243
+ ~~~
244
+
245
+ `do |record, accumulator| ... ` is the definition of a ruby block taking
246
+ two arguments. The first one passed in will be a MARC record. The
247
+ second is an array, you add values to the array to send them to
248
+ output.
249
+
250
+ Here's another example that shows how you'd get the
251
+ record type byte 06 out of a MARC leader, then translate it
252
+ to a human-readable string with a TranslationMap
253
+
254
+ ~~~ruby
255
+ to_field "marc_type" do |record, accumulator|
256
+ leader06 = record.leader.byteslice(6)
257
+ # this translation map doesn't actually exist, but could
258
+ accumulator << TranslationMap.new("marc_leader")[ leader06 ]
259
+ end
260
+ ~~~
261
+
262
+ You can also add a block onto the end of a built-in 'macro', to
263
+ further customize the output. The `accumulator` passed to your block
264
+ will already have values in it from the first step, and you can
265
+ use ruby methods like `map!` to modify it:
266
+
267
+ ~~~ruby
268
+ to_field "big_title", extract_marc("245abcdefg") do |record, accumulator|
269
+ # put it all in all uppercase, I don't know why.
270
+ accumulator.map! {|v| v.upcase}
271
+ end
272
+ ~~~
273
+
274
+ If you find yourself repeating boilerplate code in your custom logic, you can
275
+ even create your own 'macros' (like `extract_marc`). `extract_marc` and other
276
+ macros are nothing more than methods that return ruby lambda objects of
277
+ the same format as the blocks you write for custom logic.
278
+
279
+ For tips, gotchas, and a more complete explanation of how this works, see
280
+ additional documentation page on [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md)
281
+
282
+ ## each_record and after_processing
283
+
284
+ In addition to `to_field`, an `each_record` method is available, which,
285
+ like `to_field`, is executed for every record, but without being tied
286
+ to a specific field.
287
+
288
+ `each_record` can be used for logging or notifiying; computing intermediate
289
+ results; or writing to more than one field at once.
290
+
291
+ ~~~ruby
292
+ each_record do |record|
293
+ some_custom_logging(record)
294
+ end
295
+ ~~~
296
+
297
+ For more on `each_record`, see documentation page on [Indexing Rules: Macros and Custom Logic](./doc/indexing_rules.md).
298
+
299
+ There is also an `after_processing` method that can be used to register
300
+ logic that will be called after the entire has been processed. You can use it for whatever custom
301
+ ruby code you might want for your app (send an email? Clean up a log file? Trigger
302
+ a Solr replication?)
303
+
304
+ ~~~ruby
305
+ after_processing do
306
+ whatever_ruby_code
307
+ end
308
+ ~~~
309
+
310
+
311
+ ## Readers and Writers
312
+
313
+ Traject uses modular 'Writer' classes to take the output hashes from transformation, and
314
+ send them somewhere or do something useful with them.
315
+
316
+ By default traject uses the [Traject::SolrJsonWriter](lib/traject/solr_json_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/SolrJsonWriter)) to send to Solr for indexing.
317
+ Several other writers are also built-in:
318
+ * [Traject::DebugWriter](lib/traject/debug_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/DebugWriter))
319
+ * [Traject::JsonWriter](lib/traject/json_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/JsonWriter))
320
+ * [Traject::YamlWriter](lib/traject/yaml_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/YamlWriter))
321
+ * [Traject::DelimitedWriter](lib/traject/delimited_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/DelimitedWriter))
322
+ * [Traject::CSVWriter](lib/traject/csv_writer.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/CSVWriter))
323
+
324
+ You set which writer is being used in settings (`provide "writer_class_name", "Traject::DebugWriter"`),
325
+ or with the shortcut command line argument `-w Traject::DebugWriter`.
326
+
327
+ The [SolrJWriter](https://github.com/traject-project/traject-solrj_writer) is packaged separately,
328
+ and will be useful if you need to index to Solr's older than version 3.2. It requires Jruby.
329
+
330
+ You can easily write your own Readers and Writers if you'd like, see comments at top
331
+
332
+ of [Traject::Indexer](lib/traject/indexer.rb).
333
+
334
+ ## The traject command Line
335
+
336
+ The simplest invocation is:
337
+
338
+ traject -c conf_file.rb marc_file.mrc
339
+
340
+ Traject assumes marc files are in ISO 2709 MARC 'binary' format; it is not
341
+ currently able to guess other marc format types like XML from filenames or content. If you are reading
342
+ marc files in another format, you need to tell traject either with the `marc_source.type` or the command-line shortcut:
343
+
344
+ traject -c conf.rb -t xml marc_file.xml
345
+
346
+ You can supply more than one conf file to traject with repeated `-c` arguments.
347
+
348
+ traject -c connection_conf.rb -c indexing_conf.rb marc_file.mrc
349
+
350
+ If you supply a `--stdin` argument, traject will try to read from stdin.
351
+ You can only supply one marc file at a time, but we can take advantage of stdin to get around this:
352
+
353
+ cat some/dir/*.marc | traject -c conf_file.rb --stdin
354
+
355
+ You can set any setting on the command line with `-s key=value`.
356
+ This will over-ride any settings set with `provide` in conf files.
357
+
358
+ traject -c conf_file.rb marc_file -s solr.url=http://somehere/solr -s solrj_writer.commit_on_close=true
359
+
360
+ There are some built-in command-line option shortcuts for useful
361
+ settings:
362
+
363
+ Use `--debug-mode` to output in a human-readable format, instead of sending to solr.
364
+ Also turns on debug logging and restricts processing to single-threaded. Useful for
365
+ debugging or sanity checking.
366
+
367
+ traject --debug-mode -c conf_file.rb marc_file
368
+
369
+ Use `-u` as a shortcut for `s solr.url=X`
370
+
371
+ traject -c conf_file.rb -u http://example.com/solr marc_file.mrc
372
+
373
+ Run `traject -h` to see the command line help screen listing all available options.
374
+
375
+ Also see `-I load_path` option and suggestions for Bundler use under Extending With Your Own Code.
376
+
377
+ See also [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
378
+
379
+
380
+ ## Extending With Your Own Code
381
+
382
+ Traject config files are full live ruby files, where you can do anything,
383
+ including declaring new classes, etc.
384
+
385
+ However, beyond limited trivial logic, you'll want to organize your
386
+ code reasonably into separate files, not jam everything into config
387
+ files.
388
+
389
+ Traject wants to make sure it makes it convenient for you to do so,
390
+ whether project-specific logic in files local to the traject project,
391
+ or in ruby gems that can be shared between projects.
392
+
393
+ There are standard ruby mechanisms you can use to do this, and
394
+ traject provides a couple features to make sure this remains
395
+ convenient with the traject command line.
396
+
397
+ For more information, see documentation page on [Extending With Your
398
+ Own Code](./doc/extending.md)
399
+
400
+ **Expert summary** :
401
+ * Traject `-I` argument command line can be used to list directories to
402
+ add to the load path, similar to the `ruby -I` argument. You
403
+ can then 'require' local project files from the load path.
404
+ * translation map files found on the load path or in a
405
+ "./translation_maps" subdir on the load path will be found
406
+ for Traject translation maps.
407
+ * Use [Bundler](http://bundler.io/) with traject simply by creating a Gemfile with `bundler init`,
408
+ and then running command line with `bundle exec traject` or
409
+ even `BUNDLE_GEMFILE=path/to/Gemfile bundle exec traject`
410
+
411
+ ## More
412
+
413
+ * [Other traject commands](./doc/other_commands.md) including `marcout`, and `commit`
414
+ * [Hints for batch and cronjob use](./doc/batch_execution.md) of traject.
415
+ * Plugin extensions: Gems that add functionality to traject
416
+ * [traject_alephsequential_reader](https://github.com/traject-project/traject_alephsequential_reader/): read MARC files serialized in the AlephSequential format, as output by Ex Libris's Alpeh ILS.
417
+ * [traject_horizon](https://github.com/jrochkind/traject_horizon): Export MARC records directly from a Horizon ILS rdbms, as serialized MARC or to index into Solr.
418
+ * [traject_umich_format](https://github.com/billdueber/traject_umich_format/): opinionated code and associated macros to extract format (book, audio file, etc.) and types (bibliography, conference report, etc.) from a MARC record. Code mirrors that used by the University of Michigan, and is an alternate approach to that taken by the `marc_formats` macro in `Traject::Macros::MarcFormatClassifier`.
419
+ * [traject-solrj_writer](https://github.com/traject-project/traject-solrj_writer): a jruby-only writer that uses the solrj .jar to talk directly to solr. Your only option for speaking to a solr version < 3.2, which is when the json handler was added to solr.
420
+ * [traject_marc4j_reader](https://github.com/billdueber/traject_marc4j_reader): Packaged with traject automatically on jruby. A JRuby-only reader for
421
+ reading marc records using the Marc4J library, fastest MARC reading on JRuby.
422
+
423
+ # Development
424
+
425
+ Run tests with `rake test` or just `rake`. Tests are written using Minitest (please, no rspec). We use the spec-style describe/it to
426
+ list the tests -- but generally prefer unit-style "assert_*" methods
427
+ to make actual assertions, for clarity.
428
+
429
+ To make a pull request, please make a feature branch *created from the master branch*, not from an existing feature branch. (If you need to do a feature branch dependent on an existing not-yet merged feature branch... discuss
430
+ this with other developers first!)
431
+
432
+ Pull requests should come with tests, as well as docs where applicable. Docs can be inline rdoc-style, edits to this README,
433
+ and/or extra files in ./docs -- as appropriate for what needs to be docs.
434
+
435
+ **Inline api docs** Note that our [`.yardopts` file](./.yardopts) used by rdoc.info to generate
436
+ online api docs has a `--markup markdown` specified -- inline class/method docs are in markdown, not rdoc.
437
+
438
+ Bundler rake tasks included for gem releases: `rake release`
439
+ * Every traject release needs to be done once when running MRI, and switch to JRuby
440
+ and do the same release again. The JRuby release is identical but for including
441
+ a gemspec dependency on the Marc4JReader gem.
442
+
443
+ ## TODO
444
+
445
+ * Readers and index rules helpers for reading XML files as input? Maybe.
446
+
447
+ * Writers for writing to stores other than Solr? ElasticSearch? Maybe.
448
+
449
+ * Unicode normalization. Has to normalize to NFKC on way out to index. Except for serialized marc field and other exceptions? Except maybe don't have to, rely on solr analyzer to do it?
450
+
451
+ * Should it normalize to NFC on the way in, to make sure translation maps and other string comparisons match properly?
452
+
453
+ * Either way, all optional/configurable of course. based
454
+ on Settings.
455
+
456
+ * CommandLine class isn't covered by tests -- it's written using functionality
457
+ from Indexer and other classes taht are well-covered, but the CommandLine itself
458
+ probably needs some tests -- especially covering error handling, which probably
459
+ needs a bit more attention and using exceptions instead of exits, etc.
460
+
461
+ * Optional built-in jetty stop/start to allow indexing to Solr that wasn't running before. maybe https://github.com/projecthydra/jettywrapper ?