stevedore-uploader 1.0.0-java → 1.0.1-java
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +3 -3
- data/bin/upload_to_elasticsearch.rb +1 -2
- data/lib/parsers/stevedore_email.rb +1 -1
- data/lib/split_archive.rb +81 -54
- data/lib/stevedore-uploader.rb +120 -91
- metadata +44 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 3370d3545b000618352d42bf5437a536f9f9ce5e
|
4
|
+
data.tar.gz: 537acba2cff72af43b16e762f4082848cb548281
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: a5dbc4618ab5517701aa57c6ef09a5cb45dcdfe5bb495640e092e3796915be796313bc47f831fdc8ae3e6c2ece5af8dc0281f9ccc8595839e89bec8565e903a0
|
7
|
+
data.tar.gz: 5e9c8df53922d30e2748720f16ffa8c782ba4e978fbbade650da30e4da7833559494aa0b77c084ed230c96eba316d5412a485163d07b07c5ac6c108e03d0a180
|
data/README.md
CHANGED
@@ -5,7 +5,7 @@ A tool for uploading documents into [Stevedore](https://github.com/newsdev/steve
|
|
5
5
|
|
6
6
|
Stevedore is essentially an ElasticSearch endpoint with a customizable frontend attached to it. Stevedore's primary document store is ElasticSearch, so `stevedore-uploader`'s primary task is merely uploading documents to ElasticSearch, with a few attributes that Stevedore depends on. Getting a new document set ready for search requires a few steps, but this tool helps with the hardest one: Converting the documents you want to search into a format that ElasticSearch understands. Customizing the search interface is often not necessary, but if it is, information on how to do that is in the [Stevedore](https://github.com/newsdev/stevedore) repository.
|
7
7
|
|
8
|
-
Every document processing job is different. Some might require OCR, others might require parsing e-mails, still others might call for sophisticated processing of text documents. There's no telling. That being the case, this project tries to make no assumptions about the type of data you'll be uploading -- but by default tries to convert everything into plaintext with [Apache Tika](https://tika.apache.org/).
|
8
|
+
Every document processing job is different. Some might require OCR, others might require parsing e-mails, still others might call for sophisticated processing of text documents. There's no telling. That being the case, this project tries to make no assumptions about the type of data you'll be uploading -- but by default tries to convert everything into plaintext with [Apache Tika](https://tika.apache.org/). Stevedore distinguishes between a few default types, like emails and text blobs, (and PRs would be appreciated adding new ones); for specialized types, the `do` function takes a block allowing you to modify the documents with just a few lines of Ruby.
|
9
9
|
|
10
10
|
For more details on the entire workflow, see [Stevedore](https://github.com/newsdev/stevedore)
|
11
11
|
|
@@ -47,14 +47,14 @@ bundle exec ruby bin/upload_to_elasticsearch.rb --index=jrubytest --host=https:/
|
|
47
47
|
if you choose to process documents from S3, you should upload those documents using your choice of tool -- but `awscli` is a good choice. *Stevedore-Uploader does NOT upload documents to S3 on your behalf.
|
48
48
|
|
49
49
|
If you need to process documents in a specialized, customized way, follow this example:
|
50
|
-
|
50
|
+
````
|
51
51
|
uploader = Stevedore::ESUploader.new(ES_HOST, ES_INDEX, S3_BUCKET, S3_PATH_PREFIX) # S3_BUCKET, S3_PATH_PREFIX are optional
|
52
52
|
uploader.do! FOLDER do |doc, filename, content, metadata|
|
53
53
|
next if doc.nil?
|
54
54
|
doc["analyzed"]["metadata"]["date"] = Date.parse(File.basename(filename).split("_")[-2])
|
55
55
|
doc["analyzed"]["metadata"]["title"] = my_title_getter_function(File.basename(filename))
|
56
56
|
end
|
57
|
-
|
57
|
+
````
|
58
58
|
|
59
59
|
Questions?
|
60
60
|
==========
|
@@ -45,7 +45,7 @@ if __FILE__ == $0
|
|
45
45
|
options.text_column = text_column
|
46
46
|
end
|
47
47
|
|
48
|
-
opts.on("-o", "--[no-]ocr", "
|
48
|
+
opts.on("-o", "--[no-]ocr", "don't attempt to OCR any PDFs, even if they contain no text") do |v|
|
49
49
|
options.ocr = v
|
50
50
|
end
|
51
51
|
|
@@ -81,7 +81,6 @@ ES_INDEX = if options.index.nil? || options.index == ''
|
|
81
81
|
end
|
82
82
|
|
83
83
|
S3_BUCKET = FOLDER.downcase.include?('s3://') ? FOLDER.gsub(/s3:\/\//i, '').split("/", 2).first : options.s3bucket
|
84
|
-
raise ArgumentError, 's3 buckets other than int-data-dumps aren\'t supported by the frontend yet' if S3_BUCKET != 'int-data-dumps'
|
85
84
|
ES_HOST = options.host || "localhost:9200"
|
86
85
|
S3_PATH = options.s3path || options.index
|
87
86
|
S3_BASEPATH = "https://#{S3_BUCKET}.s3.amazonaws.com/#{S3_PATH}"
|
@@ -15,7 +15,7 @@ module Stevedore
|
|
15
15
|
t.message_to = metadata["Message-To"]
|
16
16
|
t.message_from = metadata["Message-From"]
|
17
17
|
t.message_cc = metadata["Message-Cc"]
|
18
|
-
t.subject = metadata["subject"]
|
18
|
+
t.title = t.subject = metadata["subject"]
|
19
19
|
t.attachments = metadata["X-Attachments"].to_s.split("|").map do |raw_attachment_filename|
|
20
20
|
attachment_filename = CGI::unescape(raw_attachment_filename)
|
21
21
|
possible_filename = File.join(File.dirname(filepath), attachment_filename)
|
data/lib/split_archive.rb
CHANGED
@@ -1,77 +1,104 @@
|
|
1
|
-
# splits zip, mbox and pst files into their constituent documents -- mesages and attachments
|
1
|
+
# splits zip, mbox (and eventually pst) files into their constituent documents -- mesages and attachments
|
2
2
|
# and puts them into a tmp folder
|
3
3
|
# which is then parsed normally
|
4
|
-
|
4
|
+
|
5
5
|
require 'tmpdir'
|
6
|
-
require '
|
6
|
+
require 'mail'
|
7
7
|
require 'zip'
|
8
|
+
require 'pst' # for PST files
|
9
|
+
|
8
10
|
|
9
11
|
# splits PST and Mbox formats
|
12
|
+
module Stevedore
|
13
|
+
class ArchiveSplitter
|
14
|
+
HANDLED_FORMATS = ["zip", "mbox", "pst"]
|
10
15
|
|
11
|
-
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
16
|
+
def self.split(archive_filename)
|
17
|
+
# if it's a PST use split_pst
|
18
|
+
# if it's an mbox, use split_mbox, etc.
|
19
|
+
# return a list of files
|
20
|
+
Enumerator.new do |yielder|
|
21
|
+
Dir.mktmpdir do |tmpdir|
|
22
|
+
#TODO should probably do magic byte searching etc.
|
23
|
+
extension = archive_filename.split(".")[-1]
|
24
|
+
puts "splitting #{archive_filename}"
|
25
|
+
constituent_files = if extension == "mbox"
|
26
|
+
self.split_mbox(archive_filename)
|
27
|
+
elsif extension == "pst"
|
28
|
+
self.split_pst(archive_filename)
|
29
|
+
elsif extension == "zip"
|
30
|
+
self.split_zip(archive_filename)
|
31
|
+
end
|
32
|
+
# should yield a relative filename
|
33
|
+
# and a lambda that will write the file contents to the given filename
|
34
|
+
FileUtils.mkdir_p(File.join(tmpdir, File.basename(archive_filename)))
|
19
35
|
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
36
|
+
constituent_files.each_with_index do |basename_contents_lambda, idx|
|
37
|
+
basename, contents_lambda = *basename_contents_lambda
|
38
|
+
tmp_filename = File.join(tmpdir, File.basename(archive_filename), basename.gsub("/", "") )
|
39
|
+
contents_lambda.call(tmp_filename)
|
40
|
+
yielder.yield tmp_filename, File.join(File.basename(archive_filename), basename)
|
41
|
+
end
|
42
|
+
end
|
43
|
+
end
|
44
|
+
end
|
29
45
|
|
30
|
-
|
31
|
-
|
46
|
+
def self.split_pst(archive_filename)
|
47
|
+
pstfile = Java::ComPFF::PSTFile.new(archive_filename)
|
48
|
+
idx = 0
|
49
|
+
folders = pstfile.root.sub_folders.inject({}) do |memo,f|
|
50
|
+
memo[f.name] = f
|
51
|
+
memo
|
32
52
|
end
|
33
|
-
|
34
|
-
|
35
|
-
|
53
|
+
Enumerator.new do |yielder|
|
54
|
+
folders.each do |folder_name, folder|
|
55
|
+
while mail = folder.getNextChild
|
36
56
|
|
37
|
-
|
57
|
+
eml_str = mail.get_transport_message_headers + mail.get_body
|
38
58
|
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
59
|
+
yielder << ["#{idx}.eml", lambda{|fn| open(fn, 'wb'){|fh| fh << eml_str } }]
|
60
|
+
attachment_count = mail.get_number_of_attachments
|
61
|
+
attachment_count.times do |attachment_idx|
|
62
|
+
attachment = mail.get_attachment(attachment_idx)
|
63
|
+
attachment_filename = attachment.get_filename
|
64
|
+
yielder << ["#{idx}-#{attachment_filename}", lambda {|fn| open(fn, 'wb'){ |fh| fh << attachment.get_file_input_stream.to_io.read }}]
|
65
|
+
end
|
66
|
+
idx += 1
|
67
|
+
end
|
68
|
+
end
|
46
69
|
end
|
47
70
|
end
|
48
71
|
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
72
|
+
def self.split_mbox(archive_filename)
|
73
|
+
# stolen shamelessly from the Ruby Enumerable docs, actually
|
74
|
+
# split mails in mbox (slice before Unix From line after an empty line)
|
75
|
+
Enumerator.new do |yielder|
|
76
|
+
open(archive_filename) do |fh|
|
77
|
+
fh.slice_before(empty: true) do |line, h|
|
78
|
+
previous_was_empty = h[:empty]
|
79
|
+
h[:empty] = line == "\n" || line == "\r\n" || line == "\r"
|
80
|
+
previous_was_empty && line.start_with?("From ")
|
81
|
+
end.each_with_index do |mail_str, idx|
|
82
|
+
mail_str.pop if mail_str.last == "\n" # remove last line if prexent
|
83
|
+
yielder << ["#{idx}.eml", lambda{|fn| open(fn, 'wb'){|fh| fh << mail_str.join("") } }]
|
84
|
+
mail = Mail.new mail_str.join("")
|
85
|
+
mail.attachments.each do |attachment|
|
86
|
+
yielder << [attachment.filename, lambda{|fn| open(fn, 'wb'){|fh| attachment.save fh }}]
|
87
|
+
end
|
88
|
+
end
|
64
89
|
end
|
65
90
|
end
|
66
91
|
end
|
67
|
-
end
|
68
92
|
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
93
|
+
def self.split_zip(archive_filename)
|
94
|
+
Zip::File.open(archive_filename) do |zip_file|
|
95
|
+
Enumerator.new do |yielder|
|
96
|
+
zip_file.each do |entry|
|
97
|
+
yielder << [entry.name, lambda{|fn| entry.extract(fn) }]
|
98
|
+
end
|
99
|
+
end
|
73
100
|
end
|
74
101
|
end
|
75
|
-
end
|
76
102
|
|
103
|
+
end
|
77
104
|
end
|
data/lib/stevedore-uploader.rb
CHANGED
@@ -20,7 +20,7 @@ module Stevedore
|
|
20
20
|
class ESUploader
|
21
21
|
#creates blobs
|
22
22
|
attr_reader :errors
|
23
|
-
attr_accessor :should_ocr, :slice_size
|
23
|
+
attr_accessor :should_ocr, :slice_size, :client
|
24
24
|
|
25
25
|
def initialize(es_host, es_index, s3_bucket=nil, s3_path=nil)
|
26
26
|
@errors = []
|
@@ -33,16 +33,15 @@ module Stevedore
|
|
33
33
|
},
|
34
34
|
)
|
35
35
|
@es_index = es_index
|
36
|
-
@s3_bucket = s3_bucket
|
36
|
+
@s3_bucket = s3_bucket #|| (Stevedore::ESUploader.const_defined?(FOLDER) && FOLDER.downcase.include?('s3://') ? FOLDER.gsub(/s3:\/\//i, '').split("/", 2).first : nil)
|
37
37
|
@s3_basepath = "https://#{s3_bucket}.s3.amazonaws.com/#{s3_path || es_index}"
|
38
38
|
|
39
|
-
|
40
39
|
@slice_size = 100
|
41
40
|
|
42
41
|
@should_ocr = false
|
43
42
|
|
44
43
|
self.create_index!
|
45
|
-
self.
|
44
|
+
self.add_mapping(:doc, MAPPING)
|
46
45
|
end
|
47
46
|
|
48
47
|
def create_index!
|
@@ -80,82 +79,28 @@ module Stevedore
|
|
80
79
|
end
|
81
80
|
end
|
82
81
|
|
83
|
-
def
|
82
|
+
def add_mapping(type, mapping)
|
84
83
|
@client.indices.put_mapping({
|
85
84
|
index: @es_index,
|
86
|
-
type:
|
85
|
+
type: type,
|
87
86
|
body: {
|
88
87
|
"_id" => {
|
89
|
-
path: "
|
88
|
+
path: "id"
|
90
89
|
},
|
91
|
-
properties:
|
92
|
-
sha1: {type: :string, index: :not_analyzed},
|
93
|
-
title: { type: :string, analyzer: :keyword },
|
94
|
-
source_url: {type: :string, index: :not_analyzed},
|
95
|
-
modifiedDate: { type: :date, format: "dateOptionalTime" },
|
96
|
-
_updated_at: { type: :date },
|
97
|
-
analyzed: {
|
98
|
-
properties: {
|
99
|
-
body: {
|
100
|
-
type: :string,
|
101
|
-
index_options: :offsets,
|
102
|
-
term_vector: :with_positions_offsets,
|
103
|
-
store: true,
|
104
|
-
fields: {
|
105
|
-
snowball: {
|
106
|
-
type: :string,
|
107
|
-
index: "analyzed",
|
108
|
-
analyzer: 'snowball_analyzer' ,
|
109
|
-
index_options: :offsets,
|
110
|
-
term_vector: :with_positions_offsets,
|
111
|
-
}
|
112
|
-
}
|
113
|
-
},
|
114
|
-
metadata: {
|
115
|
-
properties: {
|
116
|
-
# "attachments" => {type: :string, index: :not_analyzed}, # might break stuff; intended to keep the index name (which often contains relevant search terms) from being indexed, e.g. if a user wants to search for 'bernie' in the bernie-burlington-emails
|
117
|
-
"Message-From" => {
|
118
|
-
type: "string",
|
119
|
-
fields: {
|
120
|
-
email: {
|
121
|
-
type: "string",
|
122
|
-
analyzer: "email_analyzer"
|
123
|
-
},
|
124
|
-
"Message-From" => {
|
125
|
-
type: "string"
|
126
|
-
}
|
127
|
-
}
|
128
|
-
},
|
129
|
-
"Message-To" => {
|
130
|
-
type: "string",
|
131
|
-
fields: {
|
132
|
-
email: {
|
133
|
-
type: "string",
|
134
|
-
analyzer: "email_analyzer"
|
135
|
-
},
|
136
|
-
"Message-To" => {
|
137
|
-
type: "string"
|
138
|
-
}
|
139
|
-
}
|
140
|
-
}
|
141
|
-
}
|
142
|
-
}
|
143
|
-
}
|
144
|
-
}
|
145
|
-
}
|
90
|
+
properties: mapping
|
146
91
|
}
|
147
92
|
}) # was "rescue nil" but that obscured meaningful errors
|
148
93
|
end
|
149
94
|
|
150
|
-
def bulk_upload_to_es!(data)
|
95
|
+
def bulk_upload_to_es!(data, type=nil)
|
151
96
|
return nil if data.empty?
|
152
97
|
begin
|
153
|
-
resp = @client.bulk body: data.map{|datum| {index: {_index: @es_index, _type: 'doc', data: datum }} }
|
98
|
+
resp = @client.bulk body: data.map{|datum| {index: {_index: @es_index, _type: type || 'doc', data: datum }} }
|
154
99
|
puts resp if resp[:errors]
|
155
100
|
rescue JSON::GeneratorError
|
156
101
|
data.each do |datum|
|
157
102
|
begin
|
158
|
-
@client.bulk body: [datum].map{|datum| {index: {_index: @es_index, _type: 'doc', data: datum }} }
|
103
|
+
@client.bulk body: [datum].map{|datum| {index: {_index: @es_index, _type: type || 'doc', data: datum }} }
|
159
104
|
rescue JSON::GeneratorError
|
160
105
|
next
|
161
106
|
end
|
@@ -166,8 +111,6 @@ module Stevedore
|
|
166
111
|
end
|
167
112
|
|
168
113
|
def process_document(filename, filename_for_s3)
|
169
|
-
|
170
|
-
|
171
114
|
begin
|
172
115
|
puts "begin to process #{filename}"
|
173
116
|
# puts "size: #{File.size(filename)}"
|
@@ -183,6 +126,7 @@ module Stevedore
|
|
183
126
|
puts "skipping #{filename} for being too big"
|
184
127
|
return nil
|
185
128
|
end
|
129
|
+
puts metadata["Content-Type"].inspect
|
186
130
|
|
187
131
|
# TODO: factor these out in favor of the yield/block situation down below.
|
188
132
|
# this should (eventually) be totally generic, but perhaps handle common
|
@@ -206,6 +150,7 @@ module Stevedore
|
|
206
150
|
end.join("\n\n")
|
207
151
|
# e.g. Analysis-Corporation-2.png.pdf or Torture.pdf
|
208
152
|
files = Dir["#{pdf_basename}.png.pdf"] + (Dir["#{pdf_basename}-*.png.pdf"].sort_by{|pdf| Regexp.new("#{pdf_basename}-([0-9]+).png.pdf").match(pdf)[1].to_i })
|
153
|
+
return nil if files.empty?
|
209
154
|
system('pdftk', *files, "cat", "output", "#{pdf_basename}.ocr.pdf")
|
210
155
|
content, _ = Rika.parse_content_and_metadata("#{pdf_basename}.ocr.pdf")
|
211
156
|
puts "OCRed content (#{File.basename(filename)}) length: #{content.length}"
|
@@ -251,12 +196,14 @@ module Stevedore
|
|
251
196
|
end
|
252
197
|
end
|
253
198
|
|
254
|
-
def do!(
|
255
|
-
output_stream.puts "Processing documents from #{
|
199
|
+
def do!(target_path, output_stream=STDOUT)
|
200
|
+
output_stream.puts "Processing documents from #{target_path}"
|
256
201
|
|
257
202
|
docs_so_far = 0
|
203
|
+
# use_s3 = false # option to set this (an option to set document URLs to be relative to the search engine root) is TK
|
204
|
+
@s3_bucket = target_path.gsub(/s3:\/\//i, '').split("/", 2).first if @s3_bucket.nil? && target_path.downcase.include?('s3://')
|
258
205
|
|
259
|
-
if
|
206
|
+
if target_path.downcase.include?("s3://")
|
260
207
|
Dir.mktmpdir do |dir|
|
261
208
|
Aws.config.update({
|
262
209
|
region: 'us-east-1', # TODO should be configurable
|
@@ -264,7 +211,7 @@ module Stevedore
|
|
264
211
|
s3 = Aws::S3::Resource.new
|
265
212
|
|
266
213
|
bucket = s3.bucket(@s3_bucket)
|
267
|
-
s3_path_without_bucket =
|
214
|
+
s3_path_without_bucket = target_path.gsub(/s3:\/\//i, '').split("/", 2).last
|
268
215
|
bucket.objects(:prefix => s3_path_without_bucket).each_slice(@slice_size) do |slice_of_objs|
|
269
216
|
docs_so_far += slice_of_objs.size
|
270
217
|
|
@@ -282,17 +229,30 @@ module Stevedore
|
|
282
229
|
File.open(tmp_filename, 'wb'){|f| f << body.nil? ? '' : body.chars.select(&:valid_encoding?).join}
|
283
230
|
end
|
284
231
|
download_filename = "https://#{@s3_bucket}.s3.amazonaws.com/" + obj.key
|
285
|
-
|
286
|
-
|
287
|
-
|
288
|
-
|
289
|
-
|
232
|
+
|
233
|
+
# is this file an archive that contains a bunch of documents we should index separately?
|
234
|
+
# obviously, there is not a strict definition here.
|
235
|
+
# emails in mailboxes are split into an email and attachments
|
236
|
+
# but, for now, standalone emails are treated as one document
|
237
|
+
# PDFs can (theoretically) contain documents as "attachments" -- those aren't handled here either.x
|
238
|
+
if ArchiveSplitter::HANDLED_FORMATS.include?(tmp_filename.split(".")[-1])
|
239
|
+
ArchiveSplitter.split(tmp_filename).map do |constituent_file, constituent_basename|
|
240
|
+
doc, content, metadata = process_document(constituent_file, download_filename)
|
241
|
+
doc["sha1"] = Digest::SHA1.hexdigest(download_filename + File.basename(constituent_basename)) # since these files all share a download URL (that of the archive, we need to come up with a custom sha1)
|
242
|
+
yield doc, obj.key, content, metadata if block_given?
|
243
|
+
FileUtils.rm(constituent_file) rescue Errno::ENOENT # try to delete, but no biggie if it doesn't work for some weird reason.
|
244
|
+
doc
|
245
|
+
end
|
246
|
+
else
|
247
|
+
doc, content, metadata = process_document(tmp_filename, download_filename)
|
248
|
+
yield doc, obj.key, content, metadata if block_given?
|
249
|
+
FileUtils.rm(tmp_filename) rescue Errno::ENOENT # try to delete, but no biggie if it doesn't work for some weird reason.
|
250
|
+
[doc]
|
290
251
|
end
|
291
|
-
yield doc, obj.key, content, metadata if block_given?
|
292
|
-
doc
|
293
252
|
end
|
294
253
|
begin
|
295
|
-
resp = bulk_upload_to_es!(slice_of_objs.compact)
|
254
|
+
resp = bulk_upload_to_es!(slice_of_objs.compact.flatten(1)) # flatten, in case there's an archive
|
255
|
+
puts resp.inspect if resp && resp["errors"]
|
296
256
|
rescue Manticore::Timeout, Manticore::SocketException
|
297
257
|
output_stream.puts("retrying at #{Time.now}")
|
298
258
|
retry
|
@@ -302,28 +262,42 @@ module Stevedore
|
|
302
262
|
end
|
303
263
|
end
|
304
264
|
else
|
305
|
-
|
265
|
+
list_of_files = File.file?(target_path) ? [target_path] : Dir[File.join(target_path, target_path.include?('*') ? '' : '**/*')]
|
266
|
+
list_of_files.each_slice(@slice_size) do |slice_of_files|
|
306
267
|
output_stream.puts "starting a set of #{@slice_size}"
|
307
268
|
docs_so_far += slice_of_files.size
|
308
269
|
|
309
270
|
slice_of_files.map! do |filename|
|
310
271
|
next unless File.file?(filename)
|
311
|
-
|
312
|
-
|
313
|
-
if use_s3
|
272
|
+
filename_basepath = filename.gsub(target_path, '')
|
273
|
+
# if use_s3 # turning this on TK
|
314
274
|
download_filename = @s3_basepath + filename_basepath
|
275
|
+
# else
|
276
|
+
# download_filename = "/files/#{@es_index}/#{filename_basepath}"
|
277
|
+
# end
|
278
|
+
|
279
|
+
# is this file an archive that contains a bunch of documents we should index separately?
|
280
|
+
# obviously, there is not a strict definition here.
|
281
|
+
# emails in mailboxes are split into an email and attachments
|
282
|
+
# but, for now, standalone emails are treated as one document
|
283
|
+
# PDFs can (theoretically) contain documents as "attachments" -- those aren't handled here either.
|
284
|
+
if ArchiveSplitter::HANDLED_FORMATS.include?(filename.split(".")[-1])
|
285
|
+
ArchiveSplitter.split(filename).map do |constituent_file, constituent_basename|
|
286
|
+
doc, content, metadata = process_document(constituent_file, download_filename)
|
287
|
+
doc["sha1"] = Digest::SHA1.hexdigest(download_filename + File.basename(constituent_basename)) # since these files all share a download URL (that of the archive, we need to come up with a custom sha1)
|
288
|
+
yield doc, filename, content, metadata if block_given?
|
289
|
+
# FileUtils.rm(constituent_file) rescue Errno::ENOENT # try to delete, but no biggie if it doesn't work for some weird reason.
|
290
|
+
doc
|
291
|
+
end
|
315
292
|
else
|
316
|
-
|
293
|
+
doc, content, metadata = process_document(filename, download_filename )
|
294
|
+
yield doc, filename, content, metadata if block_given?
|
295
|
+
[doc]
|
317
296
|
end
|
318
|
-
|
319
|
-
doc, content, metadata = process_document(filename, download_filename )
|
320
|
-
yield doc, filename, content, metadata if block_given?
|
321
|
-
doc
|
322
297
|
end
|
323
298
|
begin
|
324
|
-
|
325
|
-
resp
|
326
|
-
puts resp.inspect if JSON.parse(resp)["errors"]
|
299
|
+
resp = bulk_upload_to_es!(slice_of_files.compact.flatten(1)) # flatten, in case there's an archive
|
300
|
+
puts resp.inspect if resp && resp["errors"]
|
327
301
|
rescue Manticore::Timeout, Manticore::SocketException => e
|
328
302
|
output_stream.puts e.inspect
|
329
303
|
output_stream.puts "Upload error: #{e} #{e.message}."
|
@@ -337,4 +311,59 @@ module Stevedore
|
|
337
311
|
end
|
338
312
|
end
|
339
313
|
end
|
314
|
+
MAPPING = {
|
315
|
+
sha1: {type: :string, index: :not_analyzed},
|
316
|
+
title: { type: :string, analyzer: :keyword },
|
317
|
+
source_url: {type: :string, index: :not_analyzed},
|
318
|
+
modifiedDate: { type: :date, format: "dateOptionalTime" },
|
319
|
+
_updated_at: { type: :date },
|
320
|
+
analyzed: {
|
321
|
+
properties: {
|
322
|
+
body: {
|
323
|
+
type: :string,
|
324
|
+
index_options: :offsets,
|
325
|
+
term_vector: :with_positions_offsets,
|
326
|
+
store: true,
|
327
|
+
fields: {
|
328
|
+
snowball: {
|
329
|
+
type: :string,
|
330
|
+
index: "analyzed",
|
331
|
+
analyzer: 'snowball_analyzer' ,
|
332
|
+
index_options: :offsets,
|
333
|
+
term_vector: :with_positions_offsets,
|
334
|
+
}
|
335
|
+
}
|
336
|
+
},
|
337
|
+
metadata: {
|
338
|
+
properties: {
|
339
|
+
# "attachments" => {type: :string, index: :not_analyzed}, # might break stuff; intended to keep the index name (which often contains relevant search terms) from being indexed, e.g. if a user wants to search for 'bernie' in the bernie-burlington-emails
|
340
|
+
"Message-From" => {
|
341
|
+
type: "string",
|
342
|
+
fields: {
|
343
|
+
email: {
|
344
|
+
type: "string",
|
345
|
+
analyzer: "email_analyzer"
|
346
|
+
},
|
347
|
+
"Message-From" => {
|
348
|
+
type: "string"
|
349
|
+
}
|
350
|
+
}
|
351
|
+
},
|
352
|
+
"Message-To" => {
|
353
|
+
type: "string",
|
354
|
+
fields: {
|
355
|
+
email: {
|
356
|
+
type: "string",
|
357
|
+
analyzer: "email_analyzer"
|
358
|
+
},
|
359
|
+
"Message-To" => {
|
360
|
+
type: "string"
|
361
|
+
}
|
362
|
+
}
|
363
|
+
}
|
364
|
+
}
|
365
|
+
}
|
366
|
+
}
|
367
|
+
}
|
368
|
+
}
|
340
369
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: stevedore-uploader
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.
|
4
|
+
version: 1.0.1
|
5
5
|
platform: java
|
6
6
|
authors:
|
7
7
|
- Jeremy B. Merrill
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2016-
|
11
|
+
date: 2016-05-26 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
requirement: !ruby/object:Gem::Requirement
|
@@ -94,6 +94,48 @@ dependencies:
|
|
94
94
|
- - "~>"
|
95
95
|
- !ruby/object:Gem::Version
|
96
96
|
version: '1.6'
|
97
|
+
- !ruby/object:Gem::Dependency
|
98
|
+
requirement: !ruby/object:Gem::Requirement
|
99
|
+
requirements:
|
100
|
+
- - "~>"
|
101
|
+
- !ruby/object:Gem::Version
|
102
|
+
version: 0.0.2
|
103
|
+
name: pst
|
104
|
+
prerelease: false
|
105
|
+
type: :runtime
|
106
|
+
version_requirements: !ruby/object:Gem::Requirement
|
107
|
+
requirements:
|
108
|
+
- - "~>"
|
109
|
+
- !ruby/object:Gem::Version
|
110
|
+
version: 0.0.2
|
111
|
+
- !ruby/object:Gem::Dependency
|
112
|
+
requirement: !ruby/object:Gem::Requirement
|
113
|
+
requirements:
|
114
|
+
- - "~>"
|
115
|
+
- !ruby/object:Gem::Version
|
116
|
+
version: '2.6'
|
117
|
+
name: mail
|
118
|
+
prerelease: false
|
119
|
+
type: :runtime
|
120
|
+
version_requirements: !ruby/object:Gem::Requirement
|
121
|
+
requirements:
|
122
|
+
- - "~>"
|
123
|
+
- !ruby/object:Gem::Version
|
124
|
+
version: '2.6'
|
125
|
+
- !ruby/object:Gem::Dependency
|
126
|
+
requirement: !ruby/object:Gem::Requirement
|
127
|
+
requirements:
|
128
|
+
- - "~>"
|
129
|
+
- !ruby/object:Gem::Version
|
130
|
+
version: '1.2'
|
131
|
+
name: rubyzip
|
132
|
+
prerelease: false
|
133
|
+
type: :runtime
|
134
|
+
version_requirements: !ruby/object:Gem::Requirement
|
135
|
+
requirements:
|
136
|
+
- - "~>"
|
137
|
+
- !ruby/object:Gem::Version
|
138
|
+
version: '1.2'
|
97
139
|
description: TK
|
98
140
|
email: jeremy.merrill@nytimes.com
|
99
141
|
executables: []
|