stevedore-uploader 1.0.0-java → 1.0.1-java

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: ea9762ea02f37f5e56993854cb08ae63f21f7e1e
4
- data.tar.gz: f525070b72ea4f56e44096de00ec4789a059b042
3
+ metadata.gz: 3370d3545b000618352d42bf5437a536f9f9ce5e
4
+ data.tar.gz: 537acba2cff72af43b16e762f4082848cb548281
5
5
  SHA512:
6
- metadata.gz: 1e6cd4cbae534a1bd67091daea696c0cb33daf5dc94c740be2b6f6ce0fbf23195c02d05adfbe00d618e6b36f3bd890e8223fa7c8b7f02a3644aaa4e988b6d1b4
7
- data.tar.gz: d4b4971924a013b3bf7a64b375a6e2fcf5246fa29a975099415c1128a0d0ebbb9d0526776d98f8f81e79ffbe6dea21e1f6eeacf713485785c9f8f4b8f62f749b
6
+ metadata.gz: a5dbc4618ab5517701aa57c6ef09a5cb45dcdfe5bb495640e092e3796915be796313bc47f831fdc8ae3e6c2ece5af8dc0281f9ccc8595839e89bec8565e903a0
7
+ data.tar.gz: 5e9c8df53922d30e2748720f16ffa8c782ba4e978fbbade650da30e4da7833559494aa0b77c084ed230c96eba316d5412a485163d07b07c5ac6c108e03d0a180
data/README.md CHANGED
@@ -5,7 +5,7 @@ A tool for uploading documents into [Stevedore](https://github.com/newsdev/steve
5
5
 
6
6
  Stevedore is essentially an ElasticSearch endpoint with a customizable frontend attached to it. Stevedore's primary document store is ElasticSearch, so `stevedore-uploader`'s primary task is merely uploading documents to ElasticSearch, with a few attributes that Stevedore depends on. Getting a new document set ready for search requires a few steps, but this tool helps with the hardest one: Converting the documents you want to search into a format that ElasticSearch understands. Customizing the search interface is often not necessary, but if it is, information on how to do that is in the [Stevedore](https://github.com/newsdev/stevedore) repository.
7
7
 
8
- Every document processing job is different. Some might require OCR, others might require parsing e-mails, still others might call for sophisticated processing of text documents. There's no telling. That being the case, this project tries to make no assumptions about the type of data you'll be uploading -- but by default tries to convert everything into plaintext with [Apache Tika](https://tika.apache.org/). A `case` statement in `bin/upload_to_elasticsearch.rb` distinguishes between a few default types, like emails and text blobs, (and PRs would be appreciated adding new ones); for specialized types, the `do` function takes a block allowing you to modify the documents with just a few lines of Ruby.
8
+ Every document processing job is different. Some might require OCR, others might require parsing e-mails, still others might call for sophisticated processing of text documents. There's no telling. That being the case, this project tries to make no assumptions about the type of data you'll be uploading -- but by default tries to convert everything into plaintext with [Apache Tika](https://tika.apache.org/). Stevedore distinguishes between a few default types, like emails and text blobs, (and PRs would be appreciated adding new ones); for specialized types, the `do` function takes a block allowing you to modify the documents with just a few lines of Ruby.
9
9
 
10
10
  For more details on the entire workflow, see [Stevedore](https://github.com/newsdev/stevedore)
11
11
 
@@ -47,14 +47,14 @@ bundle exec ruby bin/upload_to_elasticsearch.rb --index=jrubytest --host=https:/
47
47
  if you choose to process documents from S3, you should upload those documents using your choice of tool -- but `awscli` is a good choice. *Stevedore-Uploader does NOT upload documents to S3 on your behalf.
48
48
 
49
49
  If you need to process documents in a specialized, customized way, follow this example:
50
-
50
+ ````
51
51
  uploader = Stevedore::ESUploader.new(ES_HOST, ES_INDEX, S3_BUCKET, S3_PATH_PREFIX) # S3_BUCKET, S3_PATH_PREFIX are optional
52
52
  uploader.do! FOLDER do |doc, filename, content, metadata|
53
53
  next if doc.nil?
54
54
  doc["analyzed"]["metadata"]["date"] = Date.parse(File.basename(filename).split("_")[-2])
55
55
  doc["analyzed"]["metadata"]["title"] = my_title_getter_function(File.basename(filename))
56
56
  end
57
-
57
+ ````
58
58
 
59
59
  Questions?
60
60
  ==========
@@ -45,7 +45,7 @@ if __FILE__ == $0
45
45
  options.text_column = text_column
46
46
  end
47
47
 
48
- opts.on("-o", "--[no-]ocr", "Run verbosely") do |v|
48
+ opts.on("-o", "--[no-]ocr", "don't attempt to OCR any PDFs, even if they contain no text") do |v|
49
49
  options.ocr = v
50
50
  end
51
51
 
@@ -81,7 +81,6 @@ ES_INDEX = if options.index.nil? || options.index == ''
81
81
  end
82
82
 
83
83
  S3_BUCKET = FOLDER.downcase.include?('s3://') ? FOLDER.gsub(/s3:\/\//i, '').split("/", 2).first : options.s3bucket
84
- raise ArgumentError, 's3 buckets other than int-data-dumps aren\'t supported by the frontend yet' if S3_BUCKET != 'int-data-dumps'
85
84
  ES_HOST = options.host || "localhost:9200"
86
85
  S3_PATH = options.s3path || options.index
87
86
  S3_BASEPATH = "https://#{S3_BUCKET}.s3.amazonaws.com/#{S3_PATH}"
@@ -15,7 +15,7 @@ module Stevedore
15
15
  t.message_to = metadata["Message-To"]
16
16
  t.message_from = metadata["Message-From"]
17
17
  t.message_cc = metadata["Message-Cc"]
18
- t.subject = metadata["subject"]
18
+ t.title = t.subject = metadata["subject"]
19
19
  t.attachments = metadata["X-Attachments"].to_s.split("|").map do |raw_attachment_filename|
20
20
  attachment_filename = CGI::unescape(raw_attachment_filename)
21
21
  possible_filename = File.join(File.dirname(filepath), attachment_filename)
data/lib/split_archive.rb CHANGED
@@ -1,77 +1,104 @@
1
- # splits zip, mbox and pst files into their constituent documents -- mesages and attachments
1
+ # splits zip, mbox (and eventually pst) files into their constituent documents -- mesages and attachments
2
2
  # and puts them into a tmp folder
3
3
  # which is then parsed normally
4
- require 'mapi/msg'
4
+
5
5
  require 'tmpdir'
6
- require 'mapi/pst'
6
+ require 'mail'
7
7
  require 'zip'
8
+ require 'pst' # for PST files
9
+
8
10
 
9
11
  # splits PST and Mbox formats
12
+ module Stevedore
13
+ class ArchiveSplitter
14
+ HANDLED_FORMATS = ["zip", "mbox", "pst"]
10
15
 
11
- class ArchiveSplitter
12
- def self.split(archive_filename)
13
- # if it's a PST use split_pst
14
- # if it's an mbox, use split_pst
15
- # return a list of files
16
- Dir.mktmpdir do |dir|
17
- #TODO should probably do magic byte searching et.c
18
- extension = dir.archive_filename.split(".")[-1]
16
+ def self.split(archive_filename)
17
+ # if it's a PST use split_pst
18
+ # if it's an mbox, use split_mbox, etc.
19
+ # return a list of files
20
+ Enumerator.new do |yielder|
21
+ Dir.mktmpdir do |tmpdir|
22
+ #TODO should probably do magic byte searching etc.
23
+ extension = archive_filename.split(".")[-1]
24
+ puts "splitting #{archive_filename}"
25
+ constituent_files = if extension == "mbox"
26
+ self.split_mbox(archive_filename)
27
+ elsif extension == "pst"
28
+ self.split_pst(archive_filename)
29
+ elsif extension == "zip"
30
+ self.split_zip(archive_filename)
31
+ end
32
+ # should yield a relative filename
33
+ # and a lambda that will write the file contents to the given filename
34
+ FileUtils.mkdir_p(File.join(tmpdir, File.basename(archive_filename)))
19
35
 
20
- constituent_files = if extension == "mbox"
21
- self.split_mbox(archive_filename)
22
- elsif extension == "pst"
23
- self.split_pst(archive_filename)
24
- elsif extension == "zip"
25
- self.split_zip(archive_filename)
26
- end
27
- # should yield a relative filename
28
- # and a lambda that will write the file contents to the given filename
36
+ constituent_files.each_with_index do |basename_contents_lambda, idx|
37
+ basename, contents_lambda = *basename_contents_lambda
38
+ tmp_filename = File.join(tmpdir, File.basename(archive_filename), basename.gsub("/", "") )
39
+ contents_lambda.call(tmp_filename)
40
+ yielder.yield tmp_filename, File.join(File.basename(archive_filename), basename)
41
+ end
42
+ end
43
+ end
44
+ end
29
45
 
30
- constituent_files.each do |filename, contents_lambda|
31
- contents_lambda.call(File.join(dir, File.basename(archive_filename), filename ))
46
+ def self.split_pst(archive_filename)
47
+ pstfile = Java::ComPFF::PSTFile.new(archive_filename)
48
+ idx = 0
49
+ folders = pstfile.root.sub_folders.inject({}) do |memo,f|
50
+ memo[f.name] = f
51
+ memo
32
52
  end
33
- end
34
- end
35
- end
53
+ Enumerator.new do |yielder|
54
+ folders.each do |folder_name, folder|
55
+ while mail = folder.getNextChild
36
56
 
37
- class MailArchiveSplitter
57
+ eml_str = mail.get_transport_message_headers + mail.get_body
38
58
 
39
- def self.split_pst(archive_filename)
40
- pst = Mapi::Pst.new open(archive_filename)
41
- pst.each_with_index do |mail, idx|
42
- msg = Mapi::Msg.load mail
43
- yield "#{idx}.eml", lambda{|fn| open(fn, 'wb'){|fh| fh << mail } }
44
- msg.attachments.each do |attachment|
45
- yield attachment.filename, lambda{|fn| open(fn, 'wb'){|fh| attachment.save fh }}
59
+ yielder << ["#{idx}.eml", lambda{|fn| open(fn, 'wb'){|fh| fh << eml_str } }]
60
+ attachment_count = mail.get_number_of_attachments
61
+ attachment_count.times do |attachment_idx|
62
+ attachment = mail.get_attachment(attachment_idx)
63
+ attachment_filename = attachment.get_filename
64
+ yielder << ["#{idx}-#{attachment_filename}", lambda {|fn| open(fn, 'wb'){ |fh| fh << attachment.get_file_input_stream.to_io.read }}]
65
+ end
66
+ idx += 1
67
+ end
68
+ end
46
69
  end
47
70
  end
48
71
 
49
- end
50
-
51
- def self.split_mbox(archive_filename)
52
- # stolen shamelessly from the Ruby Enumerable docs, actually
53
- # split mails in mbox (slice before Unix From line after an empty line)
54
- open(archive_filename) do |fh|
55
- f.slice_before(empty: true) do |line, h|
56
- previous_was_empty = h[:empty]
57
- h[:empty] = line == "\n"
58
- previous_was_empty && line.start_with?("From ")
59
- end.each_with_index do |mail, idx|
60
- mail.pop if mail.last == "\n" # remove last line if prexent
61
- yield "#{idx}.eml", lambda{|fn| open(fn, 'wb'){|fh| f << mail } }
62
- msg.attachments.each do |attachment|
63
- yield attachment.filename, lambda{|fn| open(fn, 'wb'){|fh| attachment.save f }}
72
+ def self.split_mbox(archive_filename)
73
+ # stolen shamelessly from the Ruby Enumerable docs, actually
74
+ # split mails in mbox (slice before Unix From line after an empty line)
75
+ Enumerator.new do |yielder|
76
+ open(archive_filename) do |fh|
77
+ fh.slice_before(empty: true) do |line, h|
78
+ previous_was_empty = h[:empty]
79
+ h[:empty] = line == "\n" || line == "\r\n" || line == "\r"
80
+ previous_was_empty && line.start_with?("From ")
81
+ end.each_with_index do |mail_str, idx|
82
+ mail_str.pop if mail_str.last == "\n" # remove last line if prexent
83
+ yielder << ["#{idx}.eml", lambda{|fn| open(fn, 'wb'){|fh| fh << mail_str.join("") } }]
84
+ mail = Mail.new mail_str.join("")
85
+ mail.attachments.each do |attachment|
86
+ yielder << [attachment.filename, lambda{|fn| open(fn, 'wb'){|fh| attachment.save fh }}]
87
+ end
88
+ end
64
89
  end
65
90
  end
66
91
  end
67
- end
68
92
 
69
- def self.split_zip(archive_filename)
70
- Zip::File.open(archive_filename) do |zip_file|
71
- zip_file.each do |entry|
72
- yield entry.names, lambda{|fn| entryhextract(fn) }
93
+ def self.split_zip(archive_filename)
94
+ Zip::File.open(archive_filename) do |zip_file|
95
+ Enumerator.new do |yielder|
96
+ zip_file.each do |entry|
97
+ yielder << [entry.name, lambda{|fn| entry.extract(fn) }]
98
+ end
99
+ end
73
100
  end
74
101
  end
75
- end
76
102
 
103
+ end
77
104
  end
@@ -20,7 +20,7 @@ module Stevedore
20
20
  class ESUploader
21
21
  #creates blobs
22
22
  attr_reader :errors
23
- attr_accessor :should_ocr, :slice_size
23
+ attr_accessor :should_ocr, :slice_size, :client
24
24
 
25
25
  def initialize(es_host, es_index, s3_bucket=nil, s3_path=nil)
26
26
  @errors = []
@@ -33,16 +33,15 @@ module Stevedore
33
33
  },
34
34
  )
35
35
  @es_index = es_index
36
- @s3_bucket = s3_bucket || FOLDER.downcase.include?('s3://') ? FOLDER.gsub(/s3:\/\//i, '').split("/", 2).first : 'int-data-dumps'
36
+ @s3_bucket = s3_bucket #|| (Stevedore::ESUploader.const_defined?(FOLDER) && FOLDER.downcase.include?('s3://') ? FOLDER.gsub(/s3:\/\//i, '').split("/", 2).first : nil)
37
37
  @s3_basepath = "https://#{s3_bucket}.s3.amazonaws.com/#{s3_path || es_index}"
38
38
 
39
-
40
39
  @slice_size = 100
41
40
 
42
41
  @should_ocr = false
43
42
 
44
43
  self.create_index!
45
- self.create_mappings!
44
+ self.add_mapping(:doc, MAPPING)
46
45
  end
47
46
 
48
47
  def create_index!
@@ -80,82 +79,28 @@ module Stevedore
80
79
  end
81
80
  end
82
81
 
83
- def create_mappings!
82
+ def add_mapping(type, mapping)
84
83
  @client.indices.put_mapping({
85
84
  index: @es_index,
86
- type: :doc,
85
+ type: type,
87
86
  body: {
88
87
  "_id" => {
89
- path: "sha1"
88
+ path: "id"
90
89
  },
91
- properties: { # feel free to add more, this is the BARE MINIMUM the UI depends on
92
- sha1: {type: :string, index: :not_analyzed},
93
- title: { type: :string, analyzer: :keyword },
94
- source_url: {type: :string, index: :not_analyzed},
95
- modifiedDate: { type: :date, format: "dateOptionalTime" },
96
- _updated_at: { type: :date },
97
- analyzed: {
98
- properties: {
99
- body: {
100
- type: :string,
101
- index_options: :offsets,
102
- term_vector: :with_positions_offsets,
103
- store: true,
104
- fields: {
105
- snowball: {
106
- type: :string,
107
- index: "analyzed",
108
- analyzer: 'snowball_analyzer' ,
109
- index_options: :offsets,
110
- term_vector: :with_positions_offsets,
111
- }
112
- }
113
- },
114
- metadata: {
115
- properties: {
116
- # "attachments" => {type: :string, index: :not_analyzed}, # might break stuff; intended to keep the index name (which often contains relevant search terms) from being indexed, e.g. if a user wants to search for 'bernie' in the bernie-burlington-emails
117
- "Message-From" => {
118
- type: "string",
119
- fields: {
120
- email: {
121
- type: "string",
122
- analyzer: "email_analyzer"
123
- },
124
- "Message-From" => {
125
- type: "string"
126
- }
127
- }
128
- },
129
- "Message-To" => {
130
- type: "string",
131
- fields: {
132
- email: {
133
- type: "string",
134
- analyzer: "email_analyzer"
135
- },
136
- "Message-To" => {
137
- type: "string"
138
- }
139
- }
140
- }
141
- }
142
- }
143
- }
144
- }
145
- }
90
+ properties: mapping
146
91
  }
147
92
  }) # was "rescue nil" but that obscured meaningful errors
148
93
  end
149
94
 
150
- def bulk_upload_to_es!(data)
95
+ def bulk_upload_to_es!(data, type=nil)
151
96
  return nil if data.empty?
152
97
  begin
153
- resp = @client.bulk body: data.map{|datum| {index: {_index: @es_index, _type: 'doc', data: datum }} }
98
+ resp = @client.bulk body: data.map{|datum| {index: {_index: @es_index, _type: type || 'doc', data: datum }} }
154
99
  puts resp if resp[:errors]
155
100
  rescue JSON::GeneratorError
156
101
  data.each do |datum|
157
102
  begin
158
- @client.bulk body: [datum].map{|datum| {index: {_index: @es_index, _type: 'doc', data: datum }} }
103
+ @client.bulk body: [datum].map{|datum| {index: {_index: @es_index, _type: type || 'doc', data: datum }} }
159
104
  rescue JSON::GeneratorError
160
105
  next
161
106
  end
@@ -166,8 +111,6 @@ module Stevedore
166
111
  end
167
112
 
168
113
  def process_document(filename, filename_for_s3)
169
-
170
-
171
114
  begin
172
115
  puts "begin to process #{filename}"
173
116
  # puts "size: #{File.size(filename)}"
@@ -183,6 +126,7 @@ module Stevedore
183
126
  puts "skipping #{filename} for being too big"
184
127
  return nil
185
128
  end
129
+ puts metadata["Content-Type"].inspect
186
130
 
187
131
  # TODO: factor these out in favor of the yield/block situation down below.
188
132
  # this should (eventually) be totally generic, but perhaps handle common
@@ -206,6 +150,7 @@ module Stevedore
206
150
  end.join("\n\n")
207
151
  # e.g. Analysis-Corporation-2.png.pdf or Torture.pdf
208
152
  files = Dir["#{pdf_basename}.png.pdf"] + (Dir["#{pdf_basename}-*.png.pdf"].sort_by{|pdf| Regexp.new("#{pdf_basename}-([0-9]+).png.pdf").match(pdf)[1].to_i })
153
+ return nil if files.empty?
209
154
  system('pdftk', *files, "cat", "output", "#{pdf_basename}.ocr.pdf")
210
155
  content, _ = Rika.parse_content_and_metadata("#{pdf_basename}.ocr.pdf")
211
156
  puts "OCRed content (#{File.basename(filename)}) length: #{content.length}"
@@ -251,12 +196,14 @@ module Stevedore
251
196
  end
252
197
  end
253
198
 
254
- def do!(target_folder_path, output_stream=STDOUT)
255
- output_stream.puts "Processing documents from #{target_folder_path}"
199
+ def do!(target_path, output_stream=STDOUT)
200
+ output_stream.puts "Processing documents from #{target_path}"
256
201
 
257
202
  docs_so_far = 0
203
+ # use_s3 = false # option to set this (an option to set document URLs to be relative to the search engine root) is TK
204
+ @s3_bucket = target_path.gsub(/s3:\/\//i, '').split("/", 2).first if @s3_bucket.nil? && target_path.downcase.include?('s3://')
258
205
 
259
- if target_folder_path.downcase.include?("s3://")
206
+ if target_path.downcase.include?("s3://")
260
207
  Dir.mktmpdir do |dir|
261
208
  Aws.config.update({
262
209
  region: 'us-east-1', # TODO should be configurable
@@ -264,7 +211,7 @@ module Stevedore
264
211
  s3 = Aws::S3::Resource.new
265
212
 
266
213
  bucket = s3.bucket(@s3_bucket)
267
- s3_path_without_bucket = target_folder_path.gsub(/s3:\/\//i, '').split("/", 2).last
214
+ s3_path_without_bucket = target_path.gsub(/s3:\/\//i, '').split("/", 2).last
268
215
  bucket.objects(:prefix => s3_path_without_bucket).each_slice(@slice_size) do |slice_of_objs|
269
216
  docs_so_far += slice_of_objs.size
270
217
 
@@ -282,17 +229,30 @@ module Stevedore
282
229
  File.open(tmp_filename, 'wb'){|f| f << body.nil? ? '' : body.chars.select(&:valid_encoding?).join}
283
230
  end
284
231
  download_filename = "https://#{@s3_bucket}.s3.amazonaws.com/" + obj.key
285
- doc, content, metadata = process_document(tmp_filename, download_filename)
286
- begin
287
- FileUtils.rm(tmp_filename)
288
- rescue Errno::ENOENT
289
- # try to delete, but no biggie if it doesn't work for some weird reason.
232
+
233
+ # is this file an archive that contains a bunch of documents we should index separately?
234
+ # obviously, there is not a strict definition here.
235
+ # emails in mailboxes are split into an email and attachments
236
+ # but, for now, standalone emails are treated as one document
237
+ # PDFs can (theoretically) contain documents as "attachments" -- those aren't handled here either.x
238
+ if ArchiveSplitter::HANDLED_FORMATS.include?(tmp_filename.split(".")[-1])
239
+ ArchiveSplitter.split(tmp_filename).map do |constituent_file, constituent_basename|
240
+ doc, content, metadata = process_document(constituent_file, download_filename)
241
+ doc["sha1"] = Digest::SHA1.hexdigest(download_filename + File.basename(constituent_basename)) # since these files all share a download URL (that of the archive, we need to come up with a custom sha1)
242
+ yield doc, obj.key, content, metadata if block_given?
243
+ FileUtils.rm(constituent_file) rescue Errno::ENOENT # try to delete, but no biggie if it doesn't work for some weird reason.
244
+ doc
245
+ end
246
+ else
247
+ doc, content, metadata = process_document(tmp_filename, download_filename)
248
+ yield doc, obj.key, content, metadata if block_given?
249
+ FileUtils.rm(tmp_filename) rescue Errno::ENOENT # try to delete, but no biggie if it doesn't work for some weird reason.
250
+ [doc]
290
251
  end
291
- yield doc, obj.key, content, metadata if block_given?
292
- doc
293
252
  end
294
253
  begin
295
- resp = bulk_upload_to_es!(slice_of_objs.compact)
254
+ resp = bulk_upload_to_es!(slice_of_objs.compact.flatten(1)) # flatten, in case there's an archive
255
+ puts resp.inspect if resp && resp["errors"]
296
256
  rescue Manticore::Timeout, Manticore::SocketException
297
257
  output_stream.puts("retrying at #{Time.now}")
298
258
  retry
@@ -302,28 +262,42 @@ module Stevedore
302
262
  end
303
263
  end
304
264
  else
305
- Dir[target_folder_path + (target_folder_path.include?('*') ? '' : '/**/*')].each_slice(@slice_size) do |slice_of_files|
265
+ list_of_files = File.file?(target_path) ? [target_path] : Dir[File.join(target_path, target_path.include?('*') ? '' : '**/*')]
266
+ list_of_files.each_slice(@slice_size) do |slice_of_files|
306
267
  output_stream.puts "starting a set of #{@slice_size}"
307
268
  docs_so_far += slice_of_files.size
308
269
 
309
270
  slice_of_files.map! do |filename|
310
271
  next unless File.file?(filename)
311
-
312
- filename_basepath = filename.gsub(target_folder_path, '')
313
- if use_s3
272
+ filename_basepath = filename.gsub(target_path, '')
273
+ # if use_s3 # turning this on TK
314
274
  download_filename = @s3_basepath + filename_basepath
275
+ # else
276
+ # download_filename = "/files/#{@es_index}/#{filename_basepath}"
277
+ # end
278
+
279
+ # is this file an archive that contains a bunch of documents we should index separately?
280
+ # obviously, there is not a strict definition here.
281
+ # emails in mailboxes are split into an email and attachments
282
+ # but, for now, standalone emails are treated as one document
283
+ # PDFs can (theoretically) contain documents as "attachments" -- those aren't handled here either.
284
+ if ArchiveSplitter::HANDLED_FORMATS.include?(filename.split(".")[-1])
285
+ ArchiveSplitter.split(filename).map do |constituent_file, constituent_basename|
286
+ doc, content, metadata = process_document(constituent_file, download_filename)
287
+ doc["sha1"] = Digest::SHA1.hexdigest(download_filename + File.basename(constituent_basename)) # since these files all share a download URL (that of the archive, we need to come up with a custom sha1)
288
+ yield doc, filename, content, metadata if block_given?
289
+ # FileUtils.rm(constituent_file) rescue Errno::ENOENT # try to delete, but no biggie if it doesn't work for some weird reason.
290
+ doc
291
+ end
315
292
  else
316
- download_filename = "/files/#{@es_index}/#{filename_basepath}"
293
+ doc, content, metadata = process_document(filename, download_filename )
294
+ yield doc, filename, content, metadata if block_given?
295
+ [doc]
317
296
  end
318
-
319
- doc, content, metadata = process_document(filename, download_filename )
320
- yield doc, filename, content, metadata if block_given?
321
- doc
322
297
  end
323
298
  begin
324
- puts "uploading"
325
- resp = bulk_upload_to_es!(slice_of_files.compact)
326
- puts resp.inspect if JSON.parse(resp)["errors"]
299
+ resp = bulk_upload_to_es!(slice_of_files.compact.flatten(1)) # flatten, in case there's an archive
300
+ puts resp.inspect if resp && resp["errors"]
327
301
  rescue Manticore::Timeout, Manticore::SocketException => e
328
302
  output_stream.puts e.inspect
329
303
  output_stream.puts "Upload error: #{e} #{e.message}."
@@ -337,4 +311,59 @@ module Stevedore
337
311
  end
338
312
  end
339
313
  end
314
+ MAPPING = {
315
+ sha1: {type: :string, index: :not_analyzed},
316
+ title: { type: :string, analyzer: :keyword },
317
+ source_url: {type: :string, index: :not_analyzed},
318
+ modifiedDate: { type: :date, format: "dateOptionalTime" },
319
+ _updated_at: { type: :date },
320
+ analyzed: {
321
+ properties: {
322
+ body: {
323
+ type: :string,
324
+ index_options: :offsets,
325
+ term_vector: :with_positions_offsets,
326
+ store: true,
327
+ fields: {
328
+ snowball: {
329
+ type: :string,
330
+ index: "analyzed",
331
+ analyzer: 'snowball_analyzer' ,
332
+ index_options: :offsets,
333
+ term_vector: :with_positions_offsets,
334
+ }
335
+ }
336
+ },
337
+ metadata: {
338
+ properties: {
339
+ # "attachments" => {type: :string, index: :not_analyzed}, # might break stuff; intended to keep the index name (which often contains relevant search terms) from being indexed, e.g. if a user wants to search for 'bernie' in the bernie-burlington-emails
340
+ "Message-From" => {
341
+ type: "string",
342
+ fields: {
343
+ email: {
344
+ type: "string",
345
+ analyzer: "email_analyzer"
346
+ },
347
+ "Message-From" => {
348
+ type: "string"
349
+ }
350
+ }
351
+ },
352
+ "Message-To" => {
353
+ type: "string",
354
+ fields: {
355
+ email: {
356
+ type: "string",
357
+ analyzer: "email_analyzer"
358
+ },
359
+ "Message-To" => {
360
+ type: "string"
361
+ }
362
+ }
363
+ }
364
+ }
365
+ }
366
+ }
367
+ }
368
+ }
340
369
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: stevedore-uploader
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 1.0.1
5
5
  platform: java
6
6
  authors:
7
7
  - Jeremy B. Merrill
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2016-04-19 00:00:00.000000000 Z
11
+ date: 2016-05-26 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  requirement: !ruby/object:Gem::Requirement
@@ -94,6 +94,48 @@ dependencies:
94
94
  - - "~>"
95
95
  - !ruby/object:Gem::Version
96
96
  version: '1.6'
97
+ - !ruby/object:Gem::Dependency
98
+ requirement: !ruby/object:Gem::Requirement
99
+ requirements:
100
+ - - "~>"
101
+ - !ruby/object:Gem::Version
102
+ version: 0.0.2
103
+ name: pst
104
+ prerelease: false
105
+ type: :runtime
106
+ version_requirements: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - "~>"
109
+ - !ruby/object:Gem::Version
110
+ version: 0.0.2
111
+ - !ruby/object:Gem::Dependency
112
+ requirement: !ruby/object:Gem::Requirement
113
+ requirements:
114
+ - - "~>"
115
+ - !ruby/object:Gem::Version
116
+ version: '2.6'
117
+ name: mail
118
+ prerelease: false
119
+ type: :runtime
120
+ version_requirements: !ruby/object:Gem::Requirement
121
+ requirements:
122
+ - - "~>"
123
+ - !ruby/object:Gem::Version
124
+ version: '2.6'
125
+ - !ruby/object:Gem::Dependency
126
+ requirement: !ruby/object:Gem::Requirement
127
+ requirements:
128
+ - - "~>"
129
+ - !ruby/object:Gem::Version
130
+ version: '1.2'
131
+ name: rubyzip
132
+ prerelease: false
133
+ type: :runtime
134
+ version_requirements: !ruby/object:Gem::Requirement
135
+ requirements:
136
+ - - "~>"
137
+ - !ruby/object:Gem::Version
138
+ version: '1.2'
97
139
  description: TK
98
140
  email: jeremy.merrill@nytimes.com
99
141
  executables: []